CN113159312B

CN113159312B - Method for compressing neural network model, computer system and storage medium

Info

Publication number: CN113159312B
Application number: CN202110066485.4A
Authority: CN
Inventors: 蒋薇; 王炜; 刘杉
Original assignee: Tencent America LLC
Current assignee: Tencent America LLC
Priority date: 2020-01-23
Filing date: 2021-01-19
Publication date: 2023-08-18
Anticipated expiration: 2041-01-19
Also published as: CN113159312A

Abstract

The present disclosure provides a method, computer system, and storage medium for compressing a neural network model. The method comprises the following steps: at least one index corresponding to a multi-dimensional tensor associated with the neural network is reordered. A set of weight coefficients associated with the at least one reordered index is determined. And compressing the model of the neural network according to the determined weight coefficient set.

Description

Method for compressing neural network model, computer system and storage medium

Cross Reference to Related Applications

The present application claims priority from U.S. provisional patent application No. 62/964,996 filed by the U.S. patent and trademark office at 1/23 in 2020 and U.S. patent application No. 17/088,061 filed by the U.S. patent and trademark office at 11/3 in 2020, the entire contents of which are incorporated herein by reference.

Technical Field

The present disclosure relates generally to the field of data processing, and more particularly to neural networks.

Background

The international organization for standardization (International Organization for Standardization) ISO/international electrotechnical commission (International Electrotechnical Commission) IEC moving picture experts group (Moving Picture Experts Group) MPEG (JTC 1/SC 29/WG 11) is actively looking for potential requirements for standardization of future video codec technologies for visual analysis and understanding. The ISO adopted visual search compact descriptor (Compact Descriptors for Visual Search, CDVS) standard as a still image standard in 2015, with extracted features representing for image similarity matching. The CDVS standard is listed as part 15 of MPEG 7 and ISO/IEC 15938-15 and was finalized in 2018, which extracted global and local, manual design of video clips and deep neural network (Deep Neural Networks, DNN) based feature descriptors. The success of DNN in a number of video applications such as semantic classification, object detection/recognition, object tracking, video quality enhancement, etc., has raised a strong need to compress DNN models.

Disclosure of Invention

Embodiments of the present disclosure relate to a method, system, and computer-readable storage medium for compressing a neural network model, which may compress the neural network model and improve the computational efficiency of the neural network model.

According to one aspect, a method for compressing a neural network model is provided. The method may include reordering at least one index corresponding to a multi-dimensional tensor associated with a neural network. A set of weight coefficients associated with the at least one reordered index is determined. And compressing the model of the neural network according to the determined weight coefficient set.

According to another aspect, a computer system for compressing a neural network model is provided. The computer system may include a reordering module to reorder at least one index corresponding to a multi-dimensional tensor associated with a neural network. A set of weight coefficients associated with the at least one reordered index is determined. And compressing a model of the neural network according to the determined weight coefficient set.

According to yet another aspect, a non-transitory computer readable medium for compressing a neural network model is provided. The non-transitory computer readable medium may include a processor and a memory; the memory stores a computer program that, when executed by the processor, causes the processor to execute at least one computer readable storage device and program instructions stored on at least one of the at least one tangible storage device, the program instructions being executable by the processor. The program instructions are executable by the processor for performing a method that may accordingly include reordering at least one index corresponding to a multidimensional tensor associated with the neural network. A set of weight coefficients associated with the at least one reordered index is determined. The model of the neural network is compressed based on the determined set of weight coefficients.

By the method, the system and the computer readable storage medium for compressing the neural network model, the efficiency of compressing the learned weight coefficient can be improved, so that the calculation of using the optimized weight coefficient is accelerated, the neural network model can be obviously compressed, and the calculation efficiency of the neural network model is improved.

Drawings

These and other objects, features and advantages will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings. Since the drawings are intended to facilitate a clear understanding to those skilled in the art in connection with the detailed description, the various features of the drawings are not to scale. In the drawings:

FIG. 1 illustrates a networked computer environment in accordance with at least one embodiment;

FIG. 2 is a block diagram of a neural network model compression system, in accordance with at least one embodiment;

FIG. 3 illustrates an operational flow diagram of steps performed by a program that compresses a neural network model, in accordance with at least one embodiment;

FIG. 4 is a block diagram of the internal and external components of the computer and server depicted in FIG. 1, in accordance with at least one embodiment;

FIG. 5 is a block diagram of an exemplary cloud computing environment including the computer system depicted in FIG. 1, in accordance with at least one embodiment; and

FIG. 6 is a block diagram of functional layers of the exemplary cloud computing environment of FIG. 5 in accordance with at least one embodiment.

Detailed Description

Detailed embodiments of the claimed structures and methods are disclosed herein; it is to be understood, however, that the disclosed embodiments are merely illustrative of the claimed structure and method that may be embodied in various forms. These structures and methods may, however, be embodied in many different forms and should not be construed as limited to the exemplary embodiments set forth herein. Rather, these exemplary embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope to those skilled in the art. In the description, details of well-known features and techniques may be omitted to avoid unnecessarily obscuring the presented embodiments.

Embodiments of the present disclosure relate generally to the field of data processing, and more particularly to neural networks. The exemplary embodiments described below provide systems, methods, and computer programs for compressing neural network models. Thus, some embodiments have the ability to improve the computational domain by allowing for improved compression efficiency of learned weighting coefficients, which can significantly reduce the deep neural network model size.

As previously mentioned, ISO/IEC MPEG (JTC 1/SC 29/WG 11) is actively looking for potential requirements for future standardization of video codec technology for visual analysis and understanding. ISO in 2015 adopted CDVS standard as a static image standard, which extracted feature representations for image similarity matching (Feature Representations). The CDVS standard is listed as part 15 of MPEG 7 and ISO/IEC 15938-15 and was finalized in 2018, which extracted global and local, manual design and DNN-based feature descriptors of video clips. The success of DNN in a number of video applications such as semantic classification, object detection/recognition, object tracking, video quality enhancement, etc., has raised a strong need to compress DNN models.

Thus, MPEG is actively working on the coded representation (Coded Representation) of the neural network standard (Neural Network standard, NNR) that encodes the DNN model to save storage and computation. There are several methods of learning compact DNN models. The goal is to delete the non-significant weight coefficients and assume that the smaller the value of the weight coefficients, the lower their significance. Several network pruning methods have been proposed to achieve this goal explicitly by adding sparsity-promoting regularization terms to the network training goal or greedily deleting network parameters. From the perspective of compressing the DNN model, after learning the compact network model, the weight coefficients may be further compressed by quantization and then entropy encoding. Such further compression processes can significantly reduce the storage size of the DNN model, which is essential for deploying the model on mobile devices, chips, etc.

The unified regularization of weights can improve compression efficiency in subsequent compression processes. The iterative network retraining/correction framework is used to jointly optimize the original training objective and the weight unification loss including compression rate loss, unification distortion loss, and computation speed loss, so that the learned network weight coefficients maintain the original objective performance, are suitable for further compression, and can accelerate computation using the learned weight coefficients. The proposed method can be applied to compress the original pre-trained DNN model. It may also be used as an additional processing module to further compress any pruned DNN models.

Unified regularization can improve the efficiency of further compression learning of the weight coefficients, thereby accelerating computation using optimized weight coefficients. This can significantly reduce DNN model size and speed up inference calculations. By iterating the retraining process, the performance of the original training target may be maintained, which may allow for compression and computational efficiency. The iterative retraining process also gives flexibility to introduce different losses at different times so that the system focuses on different targets during the optimization process. The methods, computer systems, and computer programs disclosed herein may be generally applicable to datasets having different forms of data. The input/output data is typically a 4D tensor, which may be a real video clip, image or extracted feature map.

Aspects are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer-readable media according to various embodiments. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

Referring now to FIG. 1, a functional block diagram of a networked computer environment illustrates a neural network model compression system 100 (hereinafter "system") for compressing a neural network model. It should be understood that fig. 1 provides only a schematic representation of one embodiment and is not meant to be limiting in any way with respect to the environments in which the different embodiments may be implemented. Many modifications to the depicted environments may be made based on design and implementation requirements.

The system 100 may include a computer 102 and a server computer 114. The computer 102 may communicate with a server computer 114 via a communication network 110 (hereinafter "network"). The computer 102 includes a processor 104 and a software program 108 stored on a data storage device 106 and is capable of interfacing with a user and communicating with a server computer 114. As will be discussed below with reference to fig. 4, the computer 102 may include an internal component 800A and an external component 900A, respectively, and the server computer 114 may include an internal component 800B and an external component 900B, respectively. The computer 102 may be, for example, a mobile device, a telephone, a personal digital assistant, a netbook, a laptop computer, a tablet computer, a desktop computer, or any type of computing device capable of running a program, accessing a network, and accessing a database.

As discussed below in connection with fig. 5 and 6, the server computer 114 may also operate in a cloud computing service model, such as software as a service (Software as a Service, saaS), platform as a service (Platform as a Service, paaS), or infrastructure as a service (Infrastructure as a Service, iaaS). The server computer 114 may also be located in a cloud computing deployment model, such as a private cloud, community cloud, public cloud, or hybrid cloud.

The server computer 114 for compressing the neural network model is capable of running a neural network model compression program (Neural Network Model Compression Program) (hereinafter referred to as a "program") 116 that interacts with the database 112. The neural network model compression procedure method will be explained in more detail below in connection with fig. 3. In one embodiment, computer 102 may operate as an input device including a user interface, and program 116 may run primarily on server computer 114. In alternative embodiments, the program 116 may run primarily on at least one computer 102, and the server computer 114 may be used to process and store data used by the program 116. It should be noted that the program 116 may be a stand-alone program or may be integrated into a larger neural network model compression program.

However, it should be noted that in some instances, the processing of program 116 may be shared between computer 102 and server computer 114 in any proportion. In another embodiment, the program 116 may operate on more than one computer, a server computer, or some combination of computers and server computers, such as multiple computers 102 in communication with a single server computer 114 over the network 110. In another embodiment, for example, the program 116 may operate on a plurality of server computers 114, the plurality of server computers 114 in communication with a plurality of client computers via the network 110. Alternatively, the program may run on a web server that communicates with the server and the plurality of client computers over a network.

Network 110 may include wired connections, wireless connections, fiber optic connections, or some combination thereof. In general, network 110 may be any combination of connections and protocols that support communications between computer 102 and server computer 114. The network 110 may include various types of networks such as, for example, a local area network (Local Area Network, LAN), a wide area network (Wide Area Network, WAN) (e.g., the internet), a telecommunications network (e.g., public switched telephone network (Public Switched Telephone Network, PSTN)), a wireless network, a public switched network, a satellite network, a cellular network (e.g., a fifth generation (the fifth generation, 5G) network, a Long term evolution (Long-Term Evolution LTE) network, a third generation (the third generation, 3G) network, a code division multiple access (Code Division Multiple Access, CDMA) network, etc.), a public land mobile network (Public Land Mobile Network, PLMN), a metropolitan area network (Metropolitan Area Network, MAN), a private network, an ad hoc network, an intranet, a fiber-based network, etc., and/or combinations of these or other types of networks.

The number and arrangement of devices and networks shown in fig. 1 are provided as examples. Indeed, there may be additional devices and/or networks, fewer devices and/or networks, different devices and/or networks, or devices and/or networks having a different arrangement than the devices and/or networks shown in fig. 1. Furthermore, two or more devices shown in fig. 1 may be implemented in a single device, or a single device shown in fig. 1 may be implemented as a plurality of distributed devices. Additionally or alternatively, a set of devices (e.g., at least one device) of system 100 may perform at least one function that is described as being performed by another set of devices of system 100.

Referring now to fig. 2, a neural network model compression system 200 is described. The neural network model compression system 200 may be used as a framework for an iterative learning process. The neural network model compression system 200 may include a unified indexing order and method selection module 202, a weight unification module 204, a network forwarding calculation module 206, a calculation target loss module 208, a calculation gradient module 210, and a back propagation and weight update module 212.

Is provided withRepresenting the assignment of the target y to the data set of input x. Let Θ= { w } denote the set of weight coefficients of DNN. The goal of neural network training is to learn the optimal set of weight coefficients Θ, so that the target loss +. >Minimizing. For example, in the previous network pruning method, the target is lost +.>There are two parts, experience data loss (empirical data loss)/(2)>And sparsity-promoting regularization loss (sparsity-promoting regularization loss) ≡ _R (Θ)：

Wherein lambda is _R 0 is a super parameter that balances the contributions of data loss and regularization loss.

Sparsity-promoting regularization loss places regularization over the entire weight coefficient, and the resulting sparse weights have a weak relationship with inference efficiency or computational acceleration. From another perspective, after pruning, the sparse weights may further undergo another network training process, in which an optimal set of weight coefficients may be learned, which may improve the efficiency of further model compression.

The present disclosure proposes a unified weight lossThe weight unification loss ∈ ->Optimization is performed along with the original target loss:

wherein lambda is _U And (2) the equal to or greater than 0 is a super-parameter for balancing the contribution of the original training target and the weight unification. By joint optimizationAn optimal set of weight coefficients can be obtained that can greatly aid the efficiency of further compression. The loss of weight unification considers how the convolution operation is performed as the basic process of the GEMM matrix multiplication process, thereby generating optimized weight coefficients, and greatly speeding up the calculation. Notably, our weight unification penalty can be considered as an additional regularization term of the general target penalty, with (when λ _R >0) Or does not have (when lambda _R =0) general regularization. Moreover, our method can be flexibly applied to any regularization penalty +. _R (Θ)。

In at least one embodiment, the weight uniformity loses +. _U (Θ) further includes a loss of compression +. _C (Θ), unified distortion loss ∈ _I (Θ), and calculating the velocity loss ∈ _S (Θ)：

￡ _U (Θ)＝￡ _I (Θ)+λ _C ￡ _C (Θ)+λ _S ￡ _S (Θ)，

A detailed description of these loss terms will be described in the later sections. For both learning effectiveness and learning efficiency, an iterative optimization process is further proposed. The partial weight coefficients that satisfy the desired structure may be fixed. The non-fixed portion of the weight coefficients may be updated by back-propagating training losses. By iteratively performing these two steps, more and more weights can be determined gradually, and joint losses can be effectively gradually optimized.

In at least one embodiment, moreover, each layer is compressed separately,can be further written as:

wherein L is _U (W ^j ) Is the uniform loss defined on the j-th layer; n is the total number of layers measuring quantization loss; w (W) ^j The weight coefficient of the j-th layer is represented. Also due to L _U (W ^j ) Is calculated independently for each layer, so in the remainder of this disclosure script j is omitted without loss of generality.

For each network layer, its weight coefficient W is of size (c _i ,k ₁ ,k ₂ ,k ₃ ,c _o ) Is a general 5-dimensional (5D) tensor. The input to the layer is of size (h _i ,w _i ,d _i ,c _i ) 4-dimensional (4D) tensor a of (h), and the output of the layer is a value of (h _o ,w _o ,d _o ,c _o ) 4D tensor B of (B). Size c _i 、k ₁ 、k ₂ 、k ₃ 、c _o 、h _i 、w _i 、d _i 、h _o 、w _o 、d _o Is an integer greater than or equal to 1. When the size c _i 、k ₁ 、k ₂ 、k ₃ 、c _o 、h _i 、w _i 、d _i 、h _o 、w _o 、d _o When any of the values is a number 1, the corresponding tensor is reduced to a lower dimension. Each term in each tensor is a floating point number. Let M denote a 5D binary mask of the same size as W, where each entry in M is a binary 0/1 number, the binary 0/1 number indicating whether the corresponding weight coefficient is clipped or reserved. M is introduced in association with W to cope with the case where W comes from the pruned DNN model, where some connections between neurons in the network are removed from the computation. When W comes from the original untrimmed pre-trained model, all entries in M take a value of 1. By computing the output B based on the convolution operation of A, M and W:

parameter h _i 、w _i And d _i (h ₀ 、w _o And d _o ) Is the height, weight and depth of the input tensor a (output tensor B). Parameter c _i (c _o ) Is the number of input (output) channels. Parameter k ₁ 、k ₂ And k ₃ Is the size of the convolution kernel corresponding to the height, weight and depth axes, respectively. That is, v=1, …, c for each output channel _o This operation can be seen as convolving with input a to a magnitude (c _i ,k ₁ ,k ₂ ,k ₃ ) The 4D weight tensor Wv of (c).

The order of the summation operations may be changed. The 5D weight tensor can be remodeled (reshape) to a size (c _i ,c _o 3D tensor of k), where k=k ₁ ·k ₂ ·k ₃ . The order of the remodeling indexes along the k-axis is determined by a remodeling algorithm in the remodeling process, which will be described in detail later.

The desired structure of the weight coefficients can be designed by considering two aspects. First, the structure of the weight coefficients is consistent with the basic GEMM matrix multiplication process of how the convolution operation is implemented, so as to expedite the rational computation using the learned weight coefficients. First, theSecond, the structure of the weight coefficients may help to improve quantization and entropy coding efficiency of further compression. In at least one embodiment, a block-wise structure (block-wise structure) of weight coefficients in each layer may be used in the 3D remodelling weight tensor. In particular, the 3D tensor may be partitioned into a number of sizes (g _i ,g _o ,g _k ) And may unify all coefficients within the block. The unified weights in the block are set to follow a predefined unified rule, e.g., all values are set to be the same, so that one value can be used to represent the entire block in the quantization process that yields high efficiency. There may be a plurality of uniformly weighted rules, each rule associated with a uniform distortion loss for measuring errors introduced by employing the rule. For example, instead of setting the weights to be the same, the weights may be set to have the same absolute value while maintaining their original symbols. Given the structure of the design, during an iteration, the partial weight coefficients to be fixed may be determined by taking into account the uniform distortion loss, the estimated compression rate loss, and the estimated speed loss. A neural network training process is performed to update the remaining non-stationary weight coefficients through a back propagation mechanism.

The overall framework of the iterative retraining/fine tuning process may iteratively alternate selection steps to gradually optimize joint loss. Given a pre-trained DNN model with weight coefficients W and mask M, which may be a pruned sparse model or an untruncated non-sparse model, in a first step, the index I (W) = [ I) may be determined by the unified index order and method selection module 202 ₀ ,…,i _k ]In order to remodel the weight coefficient W (and corresponding mask M), where k=k ₁ ·k ₂ ·k ₃ Is the remodeled 3D tensor of weight W. Specifically, the remodeled 3D tensor of weight W may be partitioned into a number of segments of size (g _i ,g _o ,g _k ) Is a super block of (c). Let S denote the superblock. Weight unification loss based on weight coefficients within superblock S, i.e., based on weight unification loss ∈ _T (Θ) I (W) is determined separately for each superblock S. The size of the superblock is typically chosen according to a later compression method. For example, in at least one embodimentThe superblock of size (64,64,2) may be selected to be consistent with the 3-dimensional coding tree unit (3-Dimension Coding Tree Unit, CTU 3D) used by the later compression process.

Each superblock S is further divided into blocks of size (d _i ,d _o ,d _k ) Is a block of (c). Weight unification occurs within a block. For each superblock S, a weight unifier is used to unify the weight coefficients within the block of S. Let b denote the block in S, there may be different ways to unify the weight coefficients in b. For example, the weight unifier may set all weights in b to be the same, e.g., to be the average of all weights in b. In this case, L of the weight coefficient in b _N Norms (e.g. L as variance of weights in b ₂ Norm) reflects the uniform distortion loss ∈r using the average value to represent the entire block _I (b) A. The invention relates to a method for producing a fibre-reinforced plastic composite Furthermore, the weight unifier may set all weights to have the same absolute value while maintaining the original symbol. In this case, L is the absolute value of the weight in b _N Norms can be used to measure L _I (b) A. The invention relates to a method for producing a fibre-reinforced plastic composite In other words, given the weight unification method u, the weight unifier can use a method with an associated unified distortion loss L _I Method u of (u, b) unifies the weights in b. Unified distortion loss of the entire super block S _I (u, S) can be achieved by pairing L over all blocks in S _I (u, b) averaging, i.e. L _I (u,S)＝average _b (L _I (u,b))。

Similarly, the compression rate is lost +. _C (u, S) reflects the compression efficiency of uniform weights in superblock S using method u. For example, when all weights are set to be the same, the entire block is represented using only one number, and the compression rate is r _compression ＝g _i ·g _o ·g _k 。￡ _C (u, S) can be defined as 1/r _compression 。

Speed loss ≡ _S (u, S) reflects the estimated calculation speed of using the method u in S using the uniform weight coefficient, +. _S (u, S) is a function of the number of multiplications in the calculation using the unified weight coefficient.

Up to now, for pairsEach possible way of reordering the indexes to generate a 3D tensor of weights W, and for each possible method u of unifying weights by the weight unifier, can be based on ∈ _I (u,S)、￡ _C (u,S)、￡ _S (u, S) to calculate the weight unification penalty ∈ _U (u, S). An optimal weight unification method u and an optimal reordering index I (W) can be selected, the combination of which has minimal weight unification penalty ∈ _U * (u, S). When k hours, the best I (W) and u (W) can be thoroughly searched. For large k, other methods may be used to find suboptimal I (W) and u (u). The invention is not limited in any way to the specific manner in which I (W) and u (u) are determined.

Once the order of indices I (W) and weight unification method u are determined for each superblock S, the goal turns to find the updated optimal set of weight coefficients W and corresponding weight mask M by iteratively minimizing the joint loss. Specifically, for the t-th iteration, the current weight coefficient W (t-1) and mask M (t-1) may be used. Furthermore, the weight uniform mask Q (t-1) may be maintained throughout the training process. The weight unification mask Q (t-1) has the same shape as W (t-1), and W (t-1) records whether the corresponding weight coefficients are unifying. Then, a unified weight coefficient W is calculated by the weight unification module 204 _U (t-1) and a new unified mask Q (t-1). In the weight unification module 204, the weight coefficients in S may be reordered according to the determined order of index I (W), and may be based on the unification penalty of the superblock _U (u.s) ascending order of superblocks. Given the super parameter q, the top q super blocks are selected for unification. And, the weight unifier unifies the blocks in the selected super block S by using the corresponding determination method u to obtain a unified weight W _U (t-1) and weight mask M _U (t-1). The corresponding entry in unified mask Q (t-1) is marked as unified. In at least one embodiment, M _U (t-1) is different from M (t-1), wherein for blocks with pruned and untrimmed weight coefficients, the weight unifier resets the initially pruned weight coefficients to have non-zero values and will change M _U The corresponding item in (t-1). For other types of blocks, M _U (t-1) naturally remains unchanged.

The weight coefficients marked as uniform in Q (t-1) may be fixed, and the remaining unfixed weight coefficients of W (t-1) may be updated by a neural network training process, resulting in updated W (t) and M (t).

Is provided withRepresenting a training dataset, wherein->Can be +.>Similarly, a pre-trained weight coefficient W is obtained based on the original dataset. />May also be different from +>But with +.>The same data distribution. In a second step, the current unified weight coefficient W is used via the network forwarding computation module 206 _U (t-1) and mask M _U (t-1) passing each input x through the current network, producing an estimated output +.>Output based on ground truth annotation y and estimation +.>Calculate target loss module 208 to calculate target training loss +.>The target loss may then be calculated by the calculate gradients module 210Loss of G (W) _U (t-1)). An automatic gradient computation method used by a deep learning framework (such as tensorflow or pytorch) can be used to compute G (W _U (t-1)). Based on gradient G (W _U (t-1)) and unified mask Q (t-1), W may be updated by back propagation using back propagation and weight update module 212 _U Non-fixed weight coefficients of (t-1) and corresponding masks M _U (t-1). The retraining process may be an iterative process. Multiple iterations are typically performed to update W _U The non-fixed portion of (t-1) and the corresponding M (t-1), e.g., until the target loss converges. The system then proceeds to the next iteration t, where a new hyper-parameter q (t) is given, according to W _U (t-1), u, and I (W), a new unified weight coefficient W is calculated by a weight unification process _U (t), mask M _U (t), and a corresponding unified mask Q (t).

In at least one embodiment, the value of the hyper-parameter q (t) increases with increasing t during each iteration, such that more and more weight coefficients are unified and fixed throughout the iterative learning process.

Referring now to fig. 3, an operational flow diagram illustrating steps of a method 300 of compressing a neural network model is depicted. In some embodiments, at least one process block of fig. 3 may be performed by computer 102 (fig. 1) and server computer 114 (fig. 1). In some embodiments, at least one of the process blocks of fig. 3 may be performed by another device or group of devices separate from computer 102 and server computer 114 or comprising computer 102 and server computer 114.

At 302, the method 300 includes reordering at least one index corresponding to a multi-dimensional tensor associated with a neural network.

At 304, the method 300 includes determining a set of weight coefficients associated with at least one reordered index.

In some embodiments, determining the set of weight coefficients associated with at least one reordered index comprises: quantizing the weight coefficient; and selecting a weight coefficient that minimizes a uniform loss value, wherein the uniform loss value is associated with the weight coefficient.

In some embodiments, the minimized unified loss value is counter-propagated, and the neural network is trained in accordance with the counter-propagated minimized unified loss value.

In some embodiments, the minimized unified loss value is counter-propagated, and at least one of the weight coefficients is fixed according to the counter-propagated minimized unified loss value.

In some embodiments, a gradient and a unified mask associated with the set of weight coefficients are determined, and at least one non-fixed weight coefficient of the weight coefficients is updated according to the gradient and the unified mask.

In some embodiments, the set of weight coefficients is compressed by quantizing and entropy encoding the weight coefficients.

In some embodiments, the unified weight coefficient set includes at least one weight coefficient having the same absolute value.

At 306, the method 300 includes compressing a model of the neural network according to the determined set of weight coefficients.

It will be appreciated that fig. 3 provides only an illustration of one embodiment and is not meant to be any limitation on how the different examples may be implemented. Many modifications to the depicted environments may be made based on design and implementation requirements.

By the method for compressing the neural network model, the efficiency of compressing the learned weight coefficient is improved, so that the calculation of the optimized weight coefficient is accelerated, the neural network model can be obviously compressed, and the calculation efficiency of the neural network model is improved.

Fig. 4 is a block diagram 400 of the internal and external components of the computer depicted in fig. 1, in accordance with an exemplary embodiment. It should be understood that fig. 4 provides only a schematic representation of one embodiment and is not meant to be limiting in any way with respect to the environments in which the different embodiments may be implemented. Many modifications to the depicted environments may be made based on design and implementation requirements.

The computer 102 (FIG. 1) and the server computer 114 (FIG. 1) may include respective sets of internal components 800A, 800B and external components 900A, 900B as shown in FIG. 4. Each of the respective sets of internal components 800 includes at least one processor 820, at least one computer-readable RAM822, and at least one computer-readable ROM824 on at least one bus 826, on at least one operating system 828, and on at least one computer-readable tangible storage device 830.

Processor 820 is implemented in hardware, firmware, or a combination of hardware and software. Processor 820 is a central processing unit (Central Processing Unit, CPU), a graphics processing unit (Graphics Processing Unit, GPU), an acceleration processing unit (Accelerated Processing Unit, APU), a microprocessor, a microcontroller, a digital signal processor (Digital Signal Processor, DSP), a Field-programmable gate array (Field-Programmable Gate Array, FPGA), an Application-specific integrated circuit (Application-Specific Integrated Circuit ASIC), or another type of processing component. In some implementations, the processor 820 includes at least one processor that can be programmed to perform functions. Bus 826 includes components that allow communication between internal components 800A, 800B.

At least one operating system 828, software programs 108 (fig. 1), and neural network model compression programs 116 (fig. 1) on the server computer 114 (fig. 1) are stored in at least one corresponding computer-readable tangible storage device 830 for execution by at least one corresponding processor 820 via at least one corresponding RAM822 (which typically includes a cache memory). In the embodiment shown in FIG. 4, each computer readable tangible storage device 830 is a disk storage device of an internal hard drive. Alternatively, each computer-readable tangible storage device 830 is a semiconductor storage device, such as ROM824, EPROM, flash memory, an optical disk, a magneto-optical disk, a solid state disk, a Compact Disk (CD), a digital versatile disk (Digital Versatile Disc, DVD), a floppy disk, a magnetic disk cartridge, a magnetic tape, and/or another type of non-volatile computer-readable tangible storage device that can store a computer program and digital information.

Each set of internal components 800A, 800B also includes an R/W drive or interface 832 to read from or write to at least one portable computer-readable tangible storage device 936 (e.g., a CD-ROM, DVD, memory stick, tape, magnetic disk, optical disk, or semiconductor storage device). Software programs, such as software program 108 (fig. 1) and neural network model compression program 116 (fig. 1), may be stored on at least one respective portable computer-readable tangible storage device 936, read via a respective R/W drive or interface 832, and loaded into a respective hard drive 830.

Each set of internal components 800A, 800B also includes a network adapter or interface 836 (e.g., TCP/IP adapter card), a wireless Wi-Fi interface card; or a 3G, 4G, or 5G wireless interface card or other wired or wireless communication link. Software programs 108 (FIG. 1) and neural network model compression programs 116 (FIG. 1) on server computer 114 (FIG. 1) may be downloaded to computer 102 (FIG. 1) from an external computer via a network (e.g., the Internet, a local area network, or other wide area network) and a corresponding network adapter or interface 836. From the network adapter or interface 836, the software program 108 and the neural network model compressor 116 on the server computer 114 are loaded into the corresponding hard disk drive 830. The network may include copper wires, optical fibers, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers.

Each set of external components 900A, 900B may include a computer display monitor 920, a keyboard 930, and a computer mouse 934. The external components 900A, 900B may also include a touch screen, virtual keyboard, touch pad, pointing device, and other human interface devices. Each set of internal components 800A, 800B also includes a device driver 840 to interface with a computer display monitor 920, a keyboard 930, and a computer mouse 934. The device driver 840, R/W driver or interface 832, and network adapter or interface 836 include hardware and software (stored in the storage device 830 and/or ROM 824).

It should be understood in advance that although the present disclosure includes a detailed description of cloud computing, embodiments of the teachings recited herein are not limited to cloud computing environments. Rather, some embodiments can be implemented in connection with any other type of computing environment, now known or later developed.

Cloud computing is a service delivery model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, network bandwidth, servers, processes, memory, storage, applications, virtual machines, and services) that can be quickly deployed and released with minimal administrative effort or interaction with service providers. The cloud model may include at least five characteristics, at least three service models, and at least four deployment models.

The characteristics are as follows:

on-demand self-service (On-demand self-service): cloud users can unilaterally provide computing functions, such as server time and network storage, automatically as needed without manual interaction with the service provider.

Extensive network access (Broad network access): the functionality may be used over a network and accessed through standard mechanisms that facilitate the use of heterogeneous thin client platforms or thick client platforms (e.g., mobile phones, laptops, and PDAs).

Resource pooling (Resource pooling): the computing resources of the provider may be aggregated by a multi-tenant model to serve multiple consumers and dynamically allocate and reallocate different physical and virtual resources as needed. Typically, the consumer is not able to control or know the exact location of the provided resources, but may be able to specify the location of a higher level of abstraction (e.g., country, state, or data center) and therefore have location independence.

Fast elasticity (rapidity): in some cases, the functionality may be quickly and flexibly deployed, in some cases automatically quickly expanding outward and quickly releasing to quickly expand inward. The functionality available for deployment is typically seemingly unlimited to the user and can be purchased in any number at any time.

Measured service (Measured service): cloud systems automatically control and optimize resource usage by leveraging metering capabilities at some level of abstraction appropriate to the type of service (e.g., storage, processing, bandwidth, and active user accounts). Resource usage may be monitored, controlled, and reported to provide transparency to both the provider and the user of the service being used.

The service model is as follows:

software as a service (Software as a Service, saaS): the functionality provided to the consumer is to use an application that the provider runs on the cloud infrastructure. Applications may be accessed from various client devices through a thin client interface such as a web browser (e.g., web-based email). In addition to potentially limiting user-specific application configuration settings, users do not manage nor control the underlying cloud infrastructure including networks, servers, operating systems, storage, or even individual application functions.

Platform as a service (Platform as a Service, paaS): the functionality provided to the consumer is to deploy consumer-created or acquired applications onto the cloud infrastructure, which are created using programming languages and tools supported by the provider. The user does not manage nor control the underlying cloud infrastructure, including the network, servers, operating systems, or storage, but has control over the deployed applications and possibly also over the application hosting environment configuration.

Infrastructure as a service (Infrastructure as a Service, iaaS): the functionality provided to the consumer is to provide processing, storage, networking, and other underlying computing resources that enable the consumer to deploy and run any software therein, including operating systems and applications. The user does not manage nor control the underlying cloud infrastructure, but has control over the operating system, storage, deployed applications, and possibly limited control over the selection of network components (e.g., host firewalls).

The deployment model is as follows:

private cloud: the cloud infrastructure alone runs for some organization. It may be managed by the organization or a third party and may exist internally (on-premois) or externally (off-premois).

Community cloud: the cloud infrastructure is shared by several organizations and supports specific communities with common points of interest, such as tasks, security requirements, policies, and compliance notes (compliance considerations). It may be managed by the organization or a third party and may exist internally or externally.

Public cloud: the cloud infrastructure is available to the public or large industry groups and owned by the organization selling the cloud services.

Mixing cloud: cloud infrastructure is made up of two or more clouds (private, community, or public) that remain the only entities, but are bound together by standardized or proprietary techniques that enable data and application portability (e.g., cloud explosion with load balancing between clouds).

Cloud computing environments are focused services that are stateless, low-coupling, modular, and semantic interoperability. The core of cloud computing is the infrastructure of a network comprising interconnected nodes.

Referring to fig. 5, an exemplary cloud computing environment 500 is depicted. As shown, cloud computing environment 500 includes at least one cloud computing node 10 with which a local computing device used by a cloud user may communicate, such as, for example, a personal digital assistant (Personal Digital Assistant PDA) or cellular telephone 54A, desktop computer 54B, laptop computer 54C, and/or automobile computer system 54N. Cloud computing nodes 10 may communicate with each other. They may be physically or virtually grouped (not shown) in at least one network, such as a private cloud, community cloud, public cloud, or hybrid cloud, or a combination thereof, as described above. This allows cloud computing environment 500 to provide infrastructure as a service, platform as a service, and/or software as a service, without requiring cloud consumers to maintain resources on local computing devices. It should be appreciated that the types of computing devices 54A-N shown in fig. 5 are intended to be exemplary only, and that cloud computing node 10 and cloud computing environment 500 may communicate with any type of computerized device over any type of network and/or network-addressable connection (e.g., using a web browser). Referring to FIG. 6, a set of functional abstraction layers 600 provided by cloud computing environment 500 (FIG. 5) is shown. It should be understood in advance that the components, layers, and functions shown in fig. 6 are merely exemplary, and embodiments are not limited thereto. As shown, the following layers and corresponding functions are provided:

The hardware and software layer 60 includes hardware components and software components. Examples of hardware components include: a host 61; a server 62 based on a reduced instruction set computer (Reduced Instruction Set Computer, RISC) architecture; a server 63; blade server 64; a storage device 65; and a network and networking component 66. In some embodiments, the software components include web application server software 67 and database software 68.

The virtualization layer 70 provides an abstraction layer from which the following examples of virtual entities may be provided: a virtual server 71; virtual memory 72; a virtual network 73 including a virtual private network; virtual applications and an operating system 74; and a virtual client 75.

In one example, management layer 80 may provide the functionality described below. Resource provisioning 81 provides dynamic acquisition of computing resources and other resources for performing tasks within a cloud computing environment. When resources are utilized in a cloud computing environment, metering and pricing 82 cost tracks the use of resources and provides billing and invoices for the consumption of these resources. In one example, the resources may include application software licenses. Security provides authentication for cloud users and tasks, as well as protection of data and other resources. User portal 83 provides users and system administrators with access to the cloud computing environment. Service level management 84 provides cloud computing resource allocation and management to meet the required service level. Service level agreement (Service Level Agreement, SLA) planning and implementation 85 provides for the pre-arrangement and procurement of cloud computing resources that anticipate future demands according to the SLA.

Workload layer 90 provides an example of functionality that may utilize a cloud computing environment. Examples of workloads and functions that may be provided from this layer include: mapping (mapping) and navigation 91; software development and lifecycle management 92; virtual classroom teaching delivery 93; a data analysis process 94; transaction processing 95; neural network model compression 96. Neural network model compression 96 may compress the neural network model.

Some embodiments may relate to systems, methods, and/or computer-readable media that integrate at any possible level of technical detail. The computer readable medium may include a computer readable non-volatile storage medium having computer readable program instructions for causing a processor to perform operations.

The computer readable storage medium may be a tangible device that can store and forward instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer-readable storage medium includes the following: portable computer floppy disk, hard disk, random access Memory (Random Access Memory, RAM), read-Only Memory (ROM), erasable programmable Read-Only Memory (Erasable Programmable Read-Only Memory, EPROM or flash Memory), static random access Memory (Static Random Access Memory, SRAM), portable compact disc Read-Only Memory (Compact Disc Read-Only Memory, CD-ROM), digital versatile disk (Digital Versatile Disk, DVD), memory stick, floppy disk, mechanical coding means such as punch cards or embossed structures in grooves having instructions recorded thereon, and any suitable combination of the foregoing. As used herein, a computer-readable storage medium is not to be construed as a transitory signal itself, such as a radio wave or other freely propagating electromagnetic wave, an electromagnetic wave propagating through a waveguide or other transmission medium (e.g., an optical pulse through a fiber optic cable), or an electrical signal transmitted through a wire.

The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a corresponding computing/processing device or to an external computer or external storage device via a network (e.g., the internet, a local area network, a wide area network, and/or a wireless network). The network may include copper transmission cables, optical transmission fibers, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.

The computer readable program code/instructions to perform an operation may be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, configuration data for an integrated circuit, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as SmallTalk, c++, and the like, and a procedural programming language such as the "C" programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (Local Area Network, LAN) or a wide area network (Wide Area Network, WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, electronic circuitry, including, for example, programmable logic circuitry, field-Programmable Gate Arrays (FPGA), or programmable logic array (Programmable Logic Arrays, PLA), can perform aspects or operations of the present disclosure by utilizing state information of computer-readable program instructions to execute the computer-readable program instructions to personalize the electronic circuitry.

These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer readable media according to various embodiments. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises at least one executable instruction for implementing the specified logical function(s). Methods, computer systems, and computer readable media may include additional blocks, fewer blocks, different blocks, or blocks arranged differently than those depicted in the figures. In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or operations, or combinations of special purpose hardware and computer instructions.

It is to be understood that the systems and/or methods described herein may be implemented in various forms of hardware, firmware, or combinations thereof. The actual specialized control hardware or software code implementing the systems and/or methods is not limiting of the embodiments. Thus, the operation and behavior of the systems and/or methods were described herein without reference to the specific software code-it being understood that software and hardware can be designed to implement the systems and/or methods based on the description herein.

No element, act, or instruction used herein should be construed as critical or essential unless explicitly described as such. Furthermore, as used herein, the articles "a" and "an" are intended to include at least one item, and are used interchangeably with "at least one". Furthermore, as used herein, the term "set" is intended to include at least one item (e.g., a related item, an unrelated item, a combination of related and unrelated items, etc.), and is used interchangeably with "at least one". When only one item is intended, the term "a" or similar language is used. Further, as used herein, the terms "having", and the like are intended to be open terms. Further, the phrase "in accordance with" is intended to mean "in accordance with, at least in part, unless explicitly stated otherwise.

The description of the various aspects and embodiments has been presented for purposes of illustration, but is not intended to be exhaustive or limited to the disclosed embodiments. Although combinations of features are recited in the claims and/or disclosed in the specification, these combinations are not intended to limit the disclosure of possible embodiments. Indeed, many of these features may be combined in ways not specifically recited in the claims and/or disclosed in the specification. Although each of the dependent claims listed below may be directly subordinate to only one claim, disclosure of possible embodiments includes each dependent claim in combination with each other claim of the claim set. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope of the described embodiments. The terminology used herein is chosen to best explain the principles of the embodiments, the practical application, or the improvement over the technology found in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method for compressing a neural network model for video processing, wherein an input and an output of the neural network model are video or image, the original pre-trained neural network model is compressed through weight unification regularization, and feature descriptors are extracted by using the compressed neural network model, the method comprising:

Reordering at least one index corresponding to a multi-dimensional tensor, the multi-dimensional tensor corresponding to a weight coefficient of each network layer of the neural network; the at least one index is determined based on a weight unification loss of the weight coefficients;

quantizing the weight coefficients and selecting the weight coefficients that minimize a uniform loss value associated with the weight coefficients to determine a set of weight coefficients associated with the at least one reordered index; and

compressing a model of the neural network according to the determined weight coefficient set;

wherein the method further comprises:

counter-propagating the minimized unified loss value, and fixing at least one weight coefficient in the weight coefficients according to the counter-propagated minimized unified loss value, wherein the unified loss value comprises compression rate loss, unified distortion loss and calculation speed loss;

a gradient and a unified mask associated with the set of weight coefficients are determined, and at least one non-fixed weight coefficient of the weight coefficients is updated according to the gradient and the unified mask.

2. The method of claim 1, further comprising back-propagating the minimized unified loss value, the neural network being trained in accordance with the back-propagated minimized unified loss value.

3. The method of any of claims 1-2, further comprising compressing the set of weight coefficients by quantizing and entropy encoding the weight coefficients.

4. The method according to any of claims 1-2, wherein the unified set of weight coefficients comprises at least one weight coefficient having the same absolute value.

5. A computer system for compressing a neural network model for video processing, wherein an input and an output of the neural network model are video or image, the neural network model is compressed by weight-unified regularization into an original pre-trained neural network model, and feature descriptors are extracted by using the compressed neural network model, the computer system comprising:

a reordering module for reordering at least one index corresponding to a multi-dimensional tensor, the multi-dimensional tensor corresponding to a weight coefficient of each network layer of the neural network; the at least one index is determined based on a weight unification loss of the weight coefficients;

a unifying module for determining a set of weight coefficients associated with at least one reordered index, comprising a quantizing module for quantizing the weight coefficients and a selecting module; the selection module is used for selecting a weight coefficient which minimizes a unified loss value, wherein the unified loss value is associated with the weight coefficient; and

A compression module for compressing a model of the neural network according to the determined set of weight coefficients;

the computer system further includes an update module for back-propagating the minimized unified loss value, fixing at least one of the weight coefficients according to the back-propagated minimized unified loss value, the unified loss value including a compression rate loss, a unified distortion loss, and a calculation speed loss;

the updating module is further configured to determine a gradient and a unified mask associated with the set of weight coefficients, and update at least one non-fixed weight coefficient of the weight coefficients based on the gradient and the unified mask.

6. The computer system of claim 5, further comprising a training module for back-propagating the minimized unified loss value, the neural network being trained in accordance with the back-propagated minimized unified loss value.

7. The computer system of claim 5, further comprising a compression module for compressing the set of weight coefficients by quantizing and entropy encoding the weight coefficients.

8. A non-transitory computer readable medium having stored thereon a computer program for compressing a neural network model, the computer program for causing at least one computer processor to perform the method of any one of claims 1-4.

9. A computing device comprising a processor and a memory; the memory stores a computer program which, when executed by the processor, causes the processor to perform the method of any one of claims 1 to 4.