CN113365072A

CN113365072A - Feature map compression method, feature map compression device and storage medium

Info

Publication number: CN113365072A
Application number: CN202110230937.8A
Authority: CN
Inventors: 蒋薇; 王炜; 刘杉
Original assignee: Tencent America LLC
Current assignee: Tencent America LLC
Priority date: 2020-03-06
Filing date: 2021-03-02
Publication date: 2021-09-07
Anticipated expiration: 2041-03-02
Also published as: CN113365072B

Abstract

In the present disclosure, a feature map compression method, apparatus, and storage medium are provided, wherein the feature map is generated by passing a first input to and through a Deep Neural Network (DNN). The feature map compression method comprises the following steps: and determining the respective optimal index order and optimal uniform method of each super block, wherein each super block is obtained by dividing a feature map. And then determining a Selective Structured Unified (SSU) layer according to the respective optimal index order and optimal unified method of each super block. The SSU layer is added to the DNN to form an updated DNN and is used to perform unified operations on the feature map. Further, a first estimated output is determined and provided as a compressed signature, wherein the first estimated output is generated by passing the first input to and through the updated DNN.

Description

Feature map compression method, feature map compression device and storage medium

Incorporation by reference

The present disclosure claims priority from united states provisional application number 62/986,330, "STRUCTURED unified FEATURE MAP COMPRESSION (FEATURE description WITH STRUCTURED customization)" filed on 6/3/2020 and united states application number 17/096,126 filed on 12/11/2020, the entire contents of which are incorporated herein by reference.

Technical Field

The present disclosure describes embodiments that generally relate to neural network model compression.

Background

The background description provided herein is intended to be a general presentation of the background of the application. Work of the presently named inventors, to the extent it is described in this background section, as well as aspects of the description, is not admitted to be prior art by inclusion in this application, nor is it expressly or implied that it is prior art to the present application.

The International Organization for Standardization (International Organization for Standardization) ISO/International Electrotechnical Commission (International Electrotechnical Commission) IEC Moving Picture Experts Group (MPEG) (JTC 1/SC 29/WG 11) has been actively looking for the potential need for Standardization of future video codec technologies for visual analysis and understanding. ISO adopted the Visual Search Compact Descriptor (CDVS) standard as a still image standard in 2015, which extracted feature representation (representation) for image similarity matching. The Visual analytics Compact Descriptors (CDVA) standard is listed as part 15 of MPEG 7 and ISO/IEC 15938-15, and was finalized in 2018, which extracts global and local, manual design of video segments and Deep Neural Network (DNN) -based feature Descriptors. The success of DNN in a number of video applications, such as semantic classification, target detection/recognition, target tracking, video quality enhancement, etc., presents a strong need for a compressed DNN model. Thus, MPEG is actively working on an encoded representation of the Neural Network standard (NNR) that encodes the DNN model to save storage and computation.

In 7 months 2019, groups were formed for the Machine Video Codec (VCM) standard to explore the topic of "Machine-vision compression codec and compression in man-Machine hybrid systems", with the goal of developing a standard that can be implemented in a chip for widespread use with any Video-related Internet of Things (IoT) device. In contrast to previous CDVA and CDVS, VCM is an emerging video for machine standards, which can be viewed as a superset of CDVA. By combining multiple feature maps of the neural network backbone, the VCM can handle more advanced visual analysis tasks such as semantic segmentation and video repair.

In the present disclosure, an iterative network retraining/re-correction framework is used to jointly optimize the original training target and feature unification penalties including compression rate penalties, unification distortion penalties, and computation speed penalties, such that the learned DNN maintains the original target performance, and feature maps can be generated that are both compression-friendly (which can be further efficiently compressed for efficient storage and transmission) and computation-friendly (which can speed up the computation speed using the generated feature maps).

Disclosure of Invention

To address the technical problem of how to provide a method of feature map compression that can generate compression-friendly and computation-friendly feature maps, the present disclosure includes a method of compressing feature maps generated by DNNs, for example, by using a unified regularization method in an iterative network retraining/tuning framework. The characteristic uniform loss may include a compression ratio loss, a uniform distortion loss, and a computation speed loss. Through an iterative retraining/fine tuning process, the feature unification loss can be optimized in conjunction with the original network training objectives.

According to an aspect of the present disclosure, a feature map compression method and apparatus are provided, in which a feature map is generated by passing a first input through a DNN. The respective optimal index order and optimal uniform method of each superblock can be determined, and each superblock is obtained by dividing the feature map. A selectively structured unified SSU layer is then determined according to the respective optimal indexing order and optimal unified approach for each of the superblocks, the SSU layer being added to the DNN to form an updated DNN, the SSU layer being used to perform unified operations on the feature map. Further, a first estimated output may be determined and provided as a compressed signature, the first estimated output generated by passing the first input to and through the updated DNN.

In some embodiments, the network coefficients of the DNN may be updated based on the first desired output and the first estimated output.

To determine the respective optimal index order and optimal unified approach for each superblock, at least one unified approach may be defined for each superblock. At least one characteristic uniformity penalty may be obtained in accordance with the at least one uniformity method for each of the superblocks. Therefore, the respective optimal index order and optimal unification method are determined for each super-block according to the at least one characteristic unification loss, and the optimal index order and optimal unification method of each super-block correspond to the minimum characteristic unification loss in the at least one characteristic unification loss.

In some embodiments, a first characteristic unity loss of the characteristic unity losses is obtained for the first superblock, the first characteristic unity loss being equal to the sum of the compression ratio loss, the unity distortion loss and the calculated speed loss.

In order to obtain a first characteristic unity loss, dividing the first super block into a plurality of blocks, each of the plurality of blocks having a respective sub-compression rate loss, sub-unity distortion loss and sub-computation speed loss; obtaining the compression rate loss of the first super block, the compression rate loss of the first super block being the sum of the sub-compression rate losses of the plurality of blocks; obtaining the uniform distortion loss of the first super block, wherein the uniform distortion loss of the first super block is the sum of the sub-uniform distortion losses of the plurality of blocks; and obtaining the calculated speed penalty for the first super block, the calculated speed penalty for the first super block being the sum of the sub-calculated speed penalties for the plurality of blocks.

In some embodiments, each of said sub-compression rate losses is equal to the standard deviation of the absolute values of the features in the corresponding block; each of the sub-uniform distortion losses is equal to an inverse of a product value of a height, a width, and a depth of a corresponding block; and each said sub-computation speed penalty is a function of the number of multiplication operations in the computation using the unified approach for the corresponding block.

In some embodiments, to update the network coefficients of the DNN, the feature map is generated by passing the first input to and through an extraction layer of the updated DNN; generating a unified feature map by passing the feature map to and through the SSU layer located after the extraction layer; and passing the unified feature map to and through remaining layers of the updated DNN, generating the first estimation output.

In order to generate a unified feature map, ordering the superblocks of the feature map in ascending order according to their feature unification penalties, wherein each of the superblocks is reordered along the depth of the feature map according to the corresponding index order; defining a uniform ratio q; and unifying q% of the superblocks arranged in ascending order by an index order and a unifying method, the unifying method being a unifying method that generates a minimum feature unifying loss among the feature unifying losses of the superblocks.

To update the network coefficients of the DNN according to a first desired input and a first estimated output, a target penalty may be defined according to the first estimated output and the first desired output; acquiring the gradient of the target loss; and updating the network coefficients of the DNN according to the gradient of the target penalty through a back propagation and weight update process.

In some embodiments, a second estimated output is determined, the second estimated output generated by passing a second input to and through the updated DNN; and updating the network coefficients of the DNN according to a second desired output and the second estimated output.

In some examples, the feature map compression apparatus includes: the first determining module is used for determining the respective optimal index order and optimal unified method of each super block, each super block is obtained by dividing a feature map, and the feature map is generated by transmitting a first input to and passing through a Deep Neural Network (DNN); a second determining module, configured to determine a selectively structured unified SSU layer according to the respective optimal index order and optimal unified method for each superblock, where the SSU layer is added to the DNN to form an updated DNN, and the SSU layer is configured to perform unified operations on the feature map; and a third determination module to determine a first estimated output generated by passing the first input to and through the updated DNN and to treat the first estimated output as a compressed signature.

Aspects of the disclosure also provide a non-transitory computer-readable medium storing instructions that, when executed by a computer for parallel processing of data streams, cause the computer to perform at least one of the methods described above.

The feature map compression method and apparatus provided by the embodiments of the present disclosure, in an iterative network retraining/revising framework, uniformly adds corresponding structured features to a deep neural network in a layer manner, so that the learned DNN maintains the original target performance, and can generate feature maps that are both compression-friendly (which can be further efficiently compressed for efficient storage and transmission) and computation-friendly (which can accelerate the computation speed using the generated feature maps).

Drawings

Other features, properties, and various advantages of the disclosed subject matter will be further apparent from the following detailed description and the accompanying drawings, in which:

FIG. 1 is a schematic diagram of a deep learning model according to some embodiments of the present disclosure;

fig. 2 is a schematic diagram of a DNN model according to some embodiments of the present disclosure;

3A-3O are example unified methods of unifying weights in 4 x 4 blocks, according to some embodiments of the present disclosure;

fig. 4 is an exemplary overall workflow of compressing a signature graph generated by DNN according to some embodiments of the present disclosure;

FIG. 5 shows a flowchart outlining an example of a process according to some embodiments of the present disclosure;

fig. 6 is a schematic illustration of a computer system, according to an embodiment.

Detailed Description

Aspects of the present disclosure include various techniques for compressing neural network models. For example, the feature maps generated by DNN may be compressed by using unified regularization in an iterative network retraining/tuning framework.

Artificial neural networks can be used for a wide range of tasks in multimedia analysis and processing, media codec, data analysis, and many other fields. The success of using artificial neural networks is based on the feasibility of processing much larger and more complex neural networks than in the past (deep neural networks, DNNs), and the availability of large-scale training data sets. Thus, a trained neural network may contain a large number of parameters and weights, resulting in a significant data size (e.g., hundreds of MB). Many applications require deployment of a particular trained network instance, potentially to more devices, which may have limitations in processing power and memory (e.g., mobile devices or smart cameras) and in communication bandwidth.

DNN may combine multiple non-linear processing layers, using simple elements operating in parallel, inspired by the biological nervous system. The DNN may consist of one input layer, several hidden layers and one output layer. The layers are interconnected by nodes or neurons, each hidden layer using the output of the previous layer as its input. Fig. 1 is a schematic diagram of a DNN (100), which may include at least one input (102), an input layer (106), a plurality of hidden layers (108), an output layer (110), and at least one output (104). As shown in fig. 1, at least one input (102) may pass through an input layer (106), a hidden layer (108), an output layer (110), where each of these layers (106, 108, 110) is a mathematical operation, and the DNN (100) may find the correct mathematical operation to convert the at least one input (102) into at least one output (104). The mathematical operation may include a linear relationship or a non-linear relationship. The DNN (100) moves through the layers (106, 108, 110) by calculating a probability of output for each of the layers (106, 108, 110).

Fig. 2 is a schematic diagram of a DNN model (200) that applies a convolution operation to convert an input to an output. The DNN may be composed of an input layer, an output layer, and a number of hidden layers in between. In another way, the DNN may include a feature detection layer and a classification layer. As shown in fig. 2, the DNN (200) may include a feature detection layer (202) and a classification layer (204). The feature detection layer (202) may perform operations on the data such as convolution, merging (boosting), or rectifying linear units (ReLU). The convolution operation (or convolution layer) passes the input image (or input) (201) through a set of convolution filters, each of which may activate certain features from the input image (201). Convolutional layers are the main building blocks used in DNN and are mainly used for convolution operations. To perform the convolution operation, a filter (or kernel) needs to be defined. The filter may typically be a matrix. The convolution operation is the first multiplication and then accumulation of each pixel corresponding to a local area of the image covered by the filter. Repeated application of the same filter to the input may produce an activation map, referred to as a feature map, that indicates the location and intensity of features detected in the input, e.g., an image.

The merge operation (or merge layer) simplifies the output by performing non-linear down-sampling and reducing the number of parameters that the DNN network (or DNN) (200) needs to learn. A rectified linear unit (ReLU) operation (or ReLU layer) allows faster and more efficient training by mapping negative values to zero and keeping positive values. These three operations are repeated on the feature detection layer (202), where the feature detection layer (202) may include tens or hundreds of layers, each layer learning to detect a different feature. After feature detection (or feature detection layer (202)), the architecture of DNN (200) moves to classification (or classification layer (204)). The classification layer (204) may perform operations on the data such as flattening (flatten), full join, and softmax functions. As shown in fig. 2, the flattening operation (or flattening layer) is used to change the shape of the data from a vector of a two-dimensional matrix (or three-dimensional matrix) to the correct format for interpretation by the fully-connected layer. The fully connected layer (FC) (or FC operation) is used to output a vector in K dimensions, where K is the number of classes that the DNN network (200) can predict. In some embodiments, a fully connected layer is used as the output layer of the DNN (200). The K-dimensional vector may contain the probability for each class of any image being classified. The softmax layer (or softmax operation) of the DNN network (200) is used to apply a softmax function to provide a classification output.

In DNN (e.g., DNN 200), the data set may be represented as

Where the input x is passed through DNN (represented by its weight coefficients Θ) to generate the feature map F. The DNN may be with a weight coefficient Θ_OFor a further DNN that uses the data set D for certain tasks_O＝{(x_O,y_O) Training with each input x_OAnd comment y_OAnd (4) associating. For example, for a semantic segmentation task, each input x_OMay be a color image, and y_OMay be with x_OSegmentation maps with the same resolution. In addition, each term in the segmentation graph may be an index of semantic categories, where x_OIs assigned to the semantic category. For the super resolution task, x_OMay be from a true high resolution image y_OThe resulting low resolution image. The DNN Θ can be a larger DNN Θ_OThe first few layers of (a) and is commonly referred to as DNN theta_OA backbone (or feature extraction) network. Data set D may be compared to D_OThe same is true. It may also be a compound of formula D_ODifferent data sets, but with D_OSimilar data distribution. For example, x and x_OMay have the same dimensions and the same probability distribution p (x, y) ═ p (x)_O,y_O) Where y is the base annotation associated with x.

The feature map F may be a general 4D tensor of size (h, w, c, t). In the 4D tensor size, h and w may be the height and width of the eigenmap F. For example, for semantic segmentation or super resolution tasks, h and w may be the same as the original height and width of the input image (or input) x, which may determine the resolution of the feature map F. Further, t may be the length of the feature map. For example, for a video classification, t may be the length of a video segment. When t is 1, the feature map is reduced to a 3D tensor. Further, c may be the depth of the feature map F, typically corresponding to c channels. Due to the nature of the DNN calculation, the feature map F may have a smooth characteristic, as neighboring features in the feature map F tend to have similar values in the (h, w) facet. This is due to the smooth nature of the original input image x and the convolution operation of the DNN calculation that remains smooth. From a feature extraction perspective, features along different channels extract different aspects of the information to represent the input x, and the feature map F may have low response (low feature values) in unimportant regions. For example, for a semantic segmentation task, the feature map of one channel may have a high response to objects of a semantic class (e.g., cars) and a low response to all other regions. The feature map of another channel may have a high response to objects of another semantic category (e.g., buildings) and a low response to all other regions. Thus, the feature map may be fairly sparse, and for most of the sparse feature map it is reasonable to pursue local smoothness along the channel dimension, as insignificant small responses may be set to be the same without significantly affecting the overall prediction.

In the present disclosure, a Selectively Structured Unified (SSU) layer is provided. In some embodiments, the SSU layer may be disposed in a feature detection layer (202), such as illustrated in fig. 2. The SSU layer may utilize the smoothing feature described above, which selectively unifies some features in the feature map F according to some desired unifying structure, so that the encoding (i.e., quantization and entropy coding) of the feature map F is used to reduce the amount of memory, and to speed up the inference to reduce the amount of computation while maintaining the original prediction target performance of DNN. For the t dimension, whether or not a smoothing property is pursued depends on the DNN itself. For example, if t corresponds to t consecutive frames in a video segment, it is reasonable to pursue smoothness also along the t-axis.

In DNN, Θ ═ { W } may represent a set of weight coefficients for DNN used to generate feature map F for a given input x. The DNN may be with a weight coefficient Θ_O＝{W_OPart of a larger DNN whose training goal is to learn the optimal set of weight coefficients Θ_O＝{W*_OSo that the target loss can be minimized

Usually, the target is lost

There may be two parts: loss of empirical data

Θ_O) Cross-entropy loss (cross-entropy loss), for example for classification tasks, and regularization loss

Such as sparsity-promoting regularization. Loss of target

Can be described by equation (1):

wherein λ_R≧ 0 is a superparameter that balances the contributions of empirical data loss and regularization loss.

In the present disclosure, a feature unification loss (feature unification loss) may be further provided

Which can be lost with the original target

Are optimized together. Therefore, combined losses

Can be described by equation (2):

wherein λ_U> 0 is a balanced raw prediction target

And feature unity loss

The contribution of (1). Joint loss by optimizing equation (2)

The optimal DNN theta can be obtained_O(or DNN Θ) to preserve the original prediction performance and generate a feature map F (x, Θ) for each input x that is valid for further encoding compression. Furthermore, the disclosed features are uniformly lossy

It may be considered how to use the feature map F (x, Θ) in subsequent convolution operations, which may be performed as a GEMM (i.e. generic matrix multiplication) matrix multiplication process. Therefore, the feature maps generated by the optimal DNN Θ can be greatly increasedCalculation of the velocity profile F (x, Θ). Notably, feature uniformity loss

Can be regarded as a general target loss

The characteristic unity loss has (when λ)_R>0) Or does not (when lambda)_R0) general regularization. Moreover, the method of the present disclosure can be flexibly applied to any regularization loss

In an exemplary embodiment, feature uniformity loss

May further include loss of compressibility

Unified distortion loss

And calculating the speed loss

As described in equation (3):

a detailed description of these loss terms can be described in later sections. An iterative optimization process may be further used for both learning effectiveness and learning efficiency. In the first step of the iterative optimization process, the losses can be unified based on features

To determine a unified feature map in the SSU layer. In the second step of the iterative optimization process, the SSU layer may be fixed with a uniform approach to the determined feature map, and the DNN weight coefficients may be fixed by a back propagation training loss (e.g.,

) To be updated. By iteratively performing these two steps, joint losses can be efficiently optimized step by step

And the learned DNNs may generate a feature map with desired characteristics.

Given a general 4D eigenmap tensor of size (h, w, c, t), in general, the eigenmap can be used as input for another set of network layers, such as some fully connected layers, some convolutional layers, and the fully connected layers that follow it, etc. In DNN, W may represent a weight coefficient of a layer using the feature map as an input, an output of the layer O may be calculated by a convolution operation, and the convolution operation may be implemented as a GEMM matrix multiplication process. Corresponding to the general 4D tensor eigenmap F, the weighting factor W may be of size (c, k)₁,k₂,k₃,c_o) And the output O may also be of size (h) and a general 5D tensor of (r), and_o,w_o,c_o,t_o) General 4D tensor (general 4D tensor). When c, k₁、k₂、k₃、c_o、h、w、t、h_o、w_o、c_oAnd t_oWhen any of the values is 1, the corresponding tensor is reduced to a lower dimension. Each term in each tensor is a floating point number. The output O can be calculated from the convolution of F and W in equation (4):

as shown in equation (4), the parameter k₁、k₂And k₃Respectively corresponding to a height axis, a width axis and a depthThe size of the convolution kernel on the degree axis. The relationship between l and l ', m and m ', n and n ' may be determined by the jump size along the axis (hopping size). For example, when the jump size is 1 along the height axis, l ═ l ', and when the jump size is 2 along the width axis, m ═ 2 m'. The order of the above summation operations may be changed corresponding to a matrix multiplication of reshaped (reshaped) F and reshaped W.

The feature map F may be further partitioned into a size of (h)_s,w_sH is not less than 1 for 4D superblocks of c, t)_s≤h，1≤w_sW is less than or equal to w. In an exemplary embodiment, a three-Dimensional Coding Tree Unit (CTU3D:3-Dimensional Coding Tree Unit) structure may be employed as a superblock. Thus, the feature map F can be regarded as a plurality of 3D tensors of size (h, w, c). Each 3D tensor can be further divided into sizes (h)_s,w_sAnd c) 3D cells. Assuming S represents a superblock, an order index i (S) may be determined for S to reorder the c channels along the depth axis, thereby generating a reordered superblock. The reordered superblock may then be further partitioned into sizes (h)_b,w_b,c_b) Block of (1) h_b≤h_s，1≤w_b≤w_s，1≤c_bC is less than or equal to c. Let B denote a block, a unifier can be used to unify features within block B. The features may be unified in a number of ways. For example, in an embodiment, the characteristics of block B may be set to the same value, so that the entire block B may be efficiently represented in a subsequent encoding process. In another embodiment, the features of block B may be set to share the same absolute value while maintaining the original sign of the features.

Fig. 3A-3O are exemplary unified methods u(s) to unify weights (or features) in 4 x 4 blocks according to some embodiments of the present disclosure. As shown in fig. 3A to 3O, 15 exemplary unification methods can be applied to unify the characteristics of 4 × 4 blocks. The characteristics of the blocks may be set to the values represented by a, b, c, and d. Given a uniform method U (S), a uniform distortion loss can be calculated

By measuring by using a uniform partyMethod U (S) incorporates the error introduced by the characteristics of block B.

For example, the features of block B may be set to have the same absolute value while maintaining the original sign of the features. The same absolute value v (B) may be the average of the absolute values of all the features in block B (abs (v)), which may be v (B) ═ mean_v∈B(abs (v)). In this case, the standard deviation of the absolute values abs (v) of all the features of block B may be applied to define the uniform distortion loss as

Given the order index I (S) and the uniform method U (S), the uniform distortion penalty for the superblock S can be calculated as in equation (5):

according to the unified method U (S), the loss of compressibility of the block B can be calculated

To measure the compression efficiency of the uniform feature in block B. For example, when all the features of the block B are set to the same value, only one number may be applied to represent the entire block B, and the compression rate may be defined as r_compression＝h_b·w_b·c_b. Therefore, the method can be followed by

Is defined as 1/r_compressionAnd is and

can be calculated as equation (6):

the speed loss of block B can be calculated

Inferring the calculated speed with an estimate that measures a uniform feature that is uniform according to a uniform method U (S), the speed penalty

May be a function of the number of multiplication operations in the DNN calculation using the unified feature, which is unified according to a unified approach u(s). For example, when all the characteristics of block B are set to the same value, at least one multiplication operation may be omitted to obtain a matrix multiplication output using block B by sharing an intermediate output. The speed loss of the whole super block S can be given by equation (7):

based on equations (5), (6), (7), for each superblock S of the feature map F, the unity loss provided in equation (3) can be obtained for the corresponding unity method u (S)

When at least one unified method (e.g., the unified methods in fig. 3A to 3O) is applied, at least one unified loss may be accordingly obtained according to equation (3). An optimal index order I (S) and an optimal uniform method U (S) may be determined based on the at least one uniform loss. The optimal index order I x (S) and the optimal unification method U x (S) may correspond to a minimum feature unification loss of the at least one feature unification loss and give an optimal unification loss

FIG. 4 shows the overall framework of an iterative retraining/fine tuning process that iteratively alternates two processing steps to gradually optimize the union of equation (2)Loss of power

As shown in fig. 4, a DNN model (402) (typically pre-trained) may be provided with a weight coefficient Θ ═ { W }, which may be with a weight coefficient Θ_O＝{W_OA fraction of the larger DNN of. Θ ═ { W } may correspond to the feature extraction portion of the larger DNN (e.g., (202) in fig. 2). In a first processing step, an optimal index order I (S) and an optimal uniform method U (S) may be defined to unify the characteristics of each superblock in the set of partitioned superblocks { S } of the characteristic map generated by DNN (402) with weight coefficients Θ through a uniform index order and method selection process (406). The size of the super block may be determined according to a compression method. In an exemplary embodiment, the CTU3D structure may be applied as a superblock, and the t 3D eigenmap tensors may be processed separately. Specifically, each CTU3D unit (or structure) S may have a size (64,64, c), which may be further divided into a plurality of blocks, and each block B may have a size (2,2, c)_b) Wherein c is_b2, depending on the shape of the profile F. For example, if the feature map has only one channel (c ═ 1), then c is_b＝1。

To determine the optimal index order I (S) and the optimal uniform method U (S) for each superblock S, a training data set may be applied in a uniform index order and method selection module (406)

For the tth iteration, the current weight Θ (t-1) may be defined as Θ (t-1) ═ W (t-1). Each input data (or input) x can thus be passed to and through DNN Θ (t-1) via a Network forwarding computing process (402) to generate a feature map F (x, Θ (t-1)) (404). As described in the previous section, the unified index order and method selection module (or process) (406) can compute the unified loss £ depicted in equation (3)_U(x, Θ (t-1), I (S), U (S)). The uniform loss of the different inputs can be accumulated to determine an optimal order of indices (or optimal index order) I (S) and an optimal uniform method U (S), which isThe optimal order of indexing (or optimal index order) I (S) corresponds to the optimal uniform method U (S) with the minimum uniform loss. Different configurations may be applied to define an optimal indexing order and an optimal unifying approach for all input feature maps. For example, all feature maps of all inputs may be set to share the same optimal unified approach U (S), or to share the same optimal unified approach U (S) and the same optimal index order I (S), or to have a separate index order and unified approach, etc. In an exemplary embodiment, all feature maps of all inputs may be set to share the same unified method U (S), but maintain the respective corresponding optimal index order I (S). That is, for each candidate unifying method U (S), a unification loss may be calculated

Where I (S) can be described by equation (8):

by accumulating the unity loss between the different inputs, the unity approach for all inputs can be defined by equation (9):

when the number of channels c is small, an exhaustive search may be conducted to search for the best (or optimal) I x (S) and U x (S) by comparing the index order with the uniform loss of the set of possible combinations of the uniform method (e.g., all possible combinations). For large c, other methods can be used to find the sub-optimal I (S) and U (S). In the present disclosure, there is no limitation on the manner in which the optimal I (S) and U (S) are determined.

Once the indexing order I (S) and the feature unification method U (S) are determined for each superblock S, the SSU layer (408) may be fixed to unify the feature map F of DNN Θ ═ W.

Still referring to fig. 4, in a second step, the target turns to find an updated optimal DNN weight coefficient set Θ ═ { W }, by iteratively optimizing the target loss of the original prediction task described in equation (1).

To learn the DNN weights through network training, an annotation y is given for each corresponding input x, i.e.

For the tth iteration, the current weight Θ (t-1) ═ W (t-1) }, and Θ_O(t-1)＝{W_O(t-1). The fixed SSU layer (without learnable weights) for the tth iteration can thus be added to the entire DNN Θ_O(t-1)＝{W_OIn (t-1) }, it may be specifically added after the layer whose output is the feature map, that is, immediately after the last layer of DNN Θ (t-1) ═ W (t-1) }. Through the network forwarding computation and feature unification process (410), each input x can be passed to and through a new/updated DNN with the same weighting factor Θ as the original overall DNN_O(t-1) but with an additional SSU layer (408) to generate an estimated output

Specifically, the network forwarding computation and feature unification process (410) may compute in three steps

In step one, a feature map F (x, Θ (t-1)) is generated from an input x as an output of the DNN component Θ (t-1) ═ W (t-1). Then, in step two, the unified feature map F (x, Θ (t-1)) can be computed by passing the feature map F (x, Θ (t-1)) through the SSU layer using the feature unification module of the network forwarding computation and feature unification process (410)_U(x, Θ (t-1)). In the feature unification module, all superblocks { S } of F (x, Θ (t-1)) can be unified loss £ based on the features of the superblock_U(U, I, S) are arranged in ascending order. Then, given the consolidation rate q (t-1) as a superparameter, the top q (t-1)%, of the superblock can be consolidated using the corresponding order index and consolidation method. At the same time, can maintainThe feature unification mask Q (t-1) has the same shape as F (x, Θ (t-1)), and the feature unification mask Q (t-1) records whether the corresponding features in the generated feature map are unified. Next, in step three, the unified feature map F can be applied_U(x, Θ (t-1)) is passed to and through the entire DNN Θ_O(t-1) remaining portion (or layer) to generate final output

Based on the true annotation y (or expected y) and the estimated output

The Target Loss in equation (1) may be defined by a calculate Target Loss process 412

The gradient of the target loss can then be calculated using a calculate gradient process (414) to obtain a gradient G (Θ)_O(t-1)). An automatic gradient computation method used by a deep learning framework (such as tensorflow or pytorch) can be used to compute G (Θ)_O(t-1)). Based on the gradient G (theta)_O(t-1)), the network coefficients can be updated using the backpropagation and backpropagation of the weight update procedure (416) to obtain an updated Θ_O(t)＝{W_O(t) (and Θ (t) ═ { w (t) } as part of the entire DNN.

The above-described retraining process for DNN may also be the iterative process itself, which may be labeled by the dashed box (420) in fig. 4. In general, multiple iterations may be performed to update the network weight coefficients, for example, until the target training loss of equation (1) converges. The backpropagation and weight update process (416) may choose to accumulate gradients G (Θ) for a batch of inputs_O(t-1)) and only the network coefficients with accumulated gradients are updated. The batch size may be a predefined hyper-parameter, and the system iterates multiple times over all training data, where each time may be referred to as an epoch (epoch). The system runs for multiple periods until the loss optimization converges. The system then proceeds to the next iteration t, where a new iteration is givenUnified ratio q (t), based on updated Θ_O(t) (and Θ (t)), a new set of index orders and unified methods can be determined, and an updated SSU layer can be generated to compute a new unified feature graph F for each input x_U(x,Θ(t))。

Various embodiments described herein may provide several advantages over related examples. The disclosed structural feature unification may improve the efficiency of further encoding of the feature map, which may significantly reduce the size of the feature map for efficient storage and transmission. Through an iterative retraining process, the learned DNN may be customized to extract a feature map that may efficiently perform the original prediction and that is suitable for subsequent compression. The iterative retraining framework may also provide the flexibility to introduce different penalties at different processing steps, for example, for optimizing a unified feature map or for optimizing a prediction objective so that the system is focused on different tasks at different times. The disclosed methods may be generally applied to data sets having different data forms. The input data x may be a general 4D tensor, which may be a video segment, a color image or a grayscale image. The disclosed method can be flexibly applied to compress different types of feature maps, such as 2D, 3D, and 4D feature maps, with different designs of superblocks and block structures. The disclosed framework can be generally applied to different tasks of extracting feature maps from a trained backbone network, such as semantic segmentation, image/video classification, object detection, image/video super-resolution, etc.

Fig. 5 shows a flowchart outlining a feature map compression method (500) according to an embodiment of the present disclosure, the feature map being generated by passing a first input to and through a DNN. The method (500) begins (S501) and proceeds to (S510).

At (S510), an optimal index order and an optimal uniform method for each superblock, which is partitioned from the feature map, are determined. To determine the respective optimal index order and optimal unified approach for each superblock, at least one unified approach may be defined for each superblock. At least one characteristic uniformity penalty may further be determined based on the at least one uniformity method for each superblock. And then determining an optimal index order and an optimal unification method for each super block according to the at least one characteristic unification loss, wherein the optimal index order and the optimal unification method for each super block correspond to the minimum characteristic unification loss in the at least one characteristic unification loss.

In some embodiments, to obtain the characteristic unity loss, a first characteristic unity loss of the characteristic unity loss may be obtained for the first superblock, wherein the first characteristic unity loss may be equal to a sum of a compression ratio loss, a unity distortion loss, and a calculated speed loss.

In some embodiments, to obtain the first characteristic unity loss, the first superblock may be divided into a plurality of blocks, where each block may have a respective sub-compression rate loss, sub-unity distortion loss, and sub-computation speed loss. The loss of compression ratio of the first super block may be determined to be equal to a sum of sub-compression ratio losses of the plurality of blocks. The unified distortion loss for the first super block may also be determined to be equal to a sum of sub-unified distortion losses for the plurality of blocks. Further, the calculated speed penalty for the first super block may be determined to be equal to a sum of sub-calculated speed penalties for the plurality of blocks.

In some embodiments, each sub-compression rate loss may be equal to a standard deviation of absolute values of features in the corresponding block. Each sub-uniform distortion loss may be equal to an inverse of a product value of a height, a width, and a depth of the corresponding block. Each sub-computation speed penalty may be a function of the number of multiplication operations in the computation using the unified approach for the corresponding block.

At (S520), a Selective Structural Unification (SSU) layer is determined according to the respective optimal index order and optimal unification method for each superblock. An SSU layer may be added to the DNN to form an updated DNN. The SSU layer may be used to perform unified operations on the feature map. In some embodiments, the SSU layer may be added after the extraction layer of the DNN that generated the feature map.

At (S530), a first estimated output is determined and provided as a compressed feature map. A first estimated output is generated by passing the first input to and through the updated DNN.

In some embodiments, the method (500) may further include updating the network coefficients of the DNN based on the first expected output and the first estimated output. In some embodiments, to update the network coefficients of the DNN, the feature map may be generated by passing the first input to and through an extraction layer of the updated DNN. The unified feature map may then be generated by passing the feature map to and through an SSU layer located after the extraction layer. The unified feature map may further be passed to and through the remaining layers of the updated DNN to generate a first estimation output.

To generate a unified feature map, the superblocks of the feature map may be arranged in ascending order according to the feature unified loss of the superblocks, where each superblock may be reordered along the depth of the feature map according to the corresponding index order. A unity ratio q may be defined. Further, unifying q% of the superblocks arranged in ascending order by an indexing order and a unifying method, the unifying method being a unifying method that generates a minimum feature unification loss among the feature unification losses of the superblocks.

To update the network coefficients of the DNN based on the first expected output and the first estimated output, a target penalty may be defined based on the first estimated output and the first expected output. A gradient of target loss may be further acquired. The network coefficients of the DNN may then be updated according to the gradient of the target penalty through a back propagation and weight update process.

Corresponding to the above feature diagram compression method in the embodiment of the present disclosure, an embodiment of the present application further provides a feature diagram compression apparatus, where the apparatus includes:

a first determining module, configured to determine an optimal index order and an optimal unification method for each superblock, where each superblock is obtained by dividing a feature map, and the feature map is generated by passing a first input to and through a deep neural network, DNN;

a second determining module, configured to determine, according to the respective optimal indexing order and optimal unification method for each superblock, a selective structural unification, SSU, layer, the SSU layer being added to the DNN to form an updated DNN, the SSU layer being configured to perform unification operations on the feature map; and

a third determination module to determine a first estimated output, the first estimated output generated by passing the first input to and through the updated DNN.

In some embodiments, the apparatus further comprises: an update module; the updating module is configured to update the network coefficients of the DNN according to a first desired output and the first estimated output.

The techniques described above may be implemented as computer software via computer readable instructions and physically stored in one or more computer readable media. For example, fig. 6 illustrates a computer system (600) suitable for implementing certain embodiments of the disclosed subject matter.

The computer software may be encoded in any suitable machine code or computer language, and by assembly, compilation, linking, etc., mechanisms create code that includes instructions that are directly executable by one or more computer Central Processing Units (CPUs), Graphics Processing Units (GPUs), etc., or by way of transcoding, microcode, etc.

The instructions may be executed on various types of computing devices or components thereof, including, for example, personal computers, tablets, servers, smartphones, gaming devices, internet of things devices, and so forth.

The components illustrated in FIG. 6 for the computer system (600) are exemplary in nature and are not intended to limit the scope of use or functionality of the computer software implementing the embodiments of the application in any way. Neither should the configuration of components be interpreted as having any dependency or requirement relating to any one or combination of components illustrated in the exemplary embodiments of the computer system (600).

The computer system (600) may include some human interface input devices. Such human interface input devices may respond to input from one or more human users through tactile input (e.g., keyboard input, swipe, data glove movement), audio input (e.g., sound, applause), visual input (e.g., gestures), olfactory input (not shown). The human-machine interface device may also be used to capture media that does not necessarily directly relate to human conscious input, such as audio (e.g., voice, music, ambient sounds), images (e.g., scanned images, photographic images obtained from still-image cameras), video (e.g., two-dimensional video, three-dimensional video including stereoscopic video).

The human interface input device may include one or more of the following (only one of which is depicted): keyboard (601), mouse (602), touch pad (603), touch screen (610), data glove (not shown), joystick (605), microphone (606), scanner (607), camera (608).

The computer system (600) may also include certain human interface output devices. Such human interface output devices may stimulate the senses of one or more human users through, for example, tactile outputs, sounds, light, and olfactory/gustatory sensations. Such human interface output devices may include tactile output devices (e.g., tactile feedback through a touch screen (610), data glove (not shown), or joystick (605), but there may also be tactile feedback devices that do not act as input devices), audio output devices (e.g., speaker (609), headphones (not shown)), visual output devices (e.g., screens (610) including cathode ray tube screens, liquid crystal screens, plasma screens, organic light emitting diode screens, each with or without touch screen input functionality, each with or without haptic feedback functionality-some of which may output two-dimensional visual output or more than three-dimensional output by means such as stereoscopic picture output; virtual reality glasses (not shown), holographic displays and smoke boxes (not shown)), and printers (not shown).

The computer system (600) may also include human-accessible storage devices and their associated media, such as optical media including compact disc read-only/rewritable (CD/DVD ROM/RW) (620) with CD/DVD or similar media (621), thumb drive (622), removable hard drive or solid state drive (623), conventional magnetic media such as magnetic tape and floppy disk (not shown), ROM/ASIC/PLD based application specific devices such as security dongle (not shown), and the like.

Those skilled in the art will also appreciate that the term "computer-readable medium" used in connection with the disclosed subject matter does not include transmission media, carrier waves, or other transitory signals.

The computer system (600) may also include an interface to one or more communication networks. For example, the network may be wireless, wired, optical. The network may also be a local area network, a wide area network, a metropolitan area network, a vehicular network, an industrial network, a real-time network, a delay tolerant network, and so forth. The network also includes ethernet, wireless local area networks, local area networks such as cellular networks (GSM, 3G, 4G, 5G, LTE, etc.), television wired or wireless wide area digital networks (including cable, satellite, and terrestrial broadcast television), automotive and industrial networks (including CANBus), and so forth. Some networks typically require external network interface adapters for connecting to some general purpose data ports or peripheral buses (649) (e.g., USB ports of the computer system (600)); other systems are typically integrated into the core of the computer system (600) by connecting to a system bus as described below (e.g., an ethernet interface to a PC computer system or a cellular network interface to a smart phone computer system). Using any of these networks, the computer system (600) may communicate with other entities. The communication may be unidirectional, for reception only (e.g., wireless television), unidirectional for transmission only (e.g., CAN bus to certain CAN bus devices), or bidirectional, for example, to other computer systems over a local or wide area digital network. Each of the networks and network interfaces described above may use certain protocols and protocol stacks.

The human interface device, human accessible storage device, and network interface described above may be connected to the core (640) of the computer system (600).

The core (640) may include one or more Central Processing Units (CPUs) (641), Graphics Processing Units (GPUs) (642), special purpose programmable processing units in the form of Field Programmable Gate Arrays (FPGAs) (643), hardware accelerators (644) for specific tasks, and so forth. These devices, as well as Read Only Memory (ROM) (645), random access memory (646), internal mass storage (e.g., internal non-user accessible hard drives, solid state drives, etc.) (647), etc. may be connected via a system bus (648). In some computer systems, the system bus (648) may be accessed in the form of one or more physical plugs so as to be expandable by additional central processing units, graphics processing units, and the like. The peripheral devices may be attached directly to the system bus (648) of the core or connected through a peripheral bus (649). The architecture of the peripheral bus includes peripheral controller interface PCI, universal serial bus USB, etc.

The CPU (641), GPU (642), FPGA (643), and accelerator (644) may execute certain instructions, which in combination may constitute the computer code described above. The computer code may be stored in ROM (645) or RAM (646). Transitional data may also be stored in RAM (646) while persistent data may be stored in, for example, internal mass storage (647). Fast storage and retrieval for any memory device may be achieved through the use of caches, which may be closely associated with one or more of CPU (641), GPU (642), mass storage (647), ROM (645), RAM (646), etc.

The computer-readable medium may have computer code thereon for performing various computer-implemented operations. The media and computer code may be those specially designed and constructed for the purposes of the present application, or they may be of the kind well known and available to those having skill in the computer software arts.

By way of example, and not limitation, a computer system having an architecture (600), and in particular a core (640), may provide functionality as a processor (including CPUs, GPUs, FPGAs, accelerators, etc.) executing software embodied in one or more tangible computer-readable media. Such computer-readable media may be media associated with the user-accessible mass storage described above, as well as certain storage with a non-volatile core (640), such as core internal mass storage (647) or ROM (645). Software implementing various embodiments of the present application may be stored in such devices and executed by the core (640). The computer-readable medium may include one or more memory devices or chips, according to particular needs. The software may cause the core (640), and in particular the processors therein (including CPUs, GPUs, FPGAs, etc.), to perform certain processes or certain portions of certain processes described herein, including defining data structures stored in RAM (646) and modifying such data structures according to software-defined processes. Additionally or alternatively, the computer system may provide functionality that is logically hardwired or otherwise embodied in circuitry (e.g., accelerator (644)) that may operate in place of or in conjunction with software to perform certain processes or certain portions of certain processes described herein. Where appropriate, reference to software may include logic and vice versa. Where appropriate, reference to a computer-readable medium may include circuitry (e.g., an Integrated Circuit (IC)) storing executable software, circuitry comprising executable logic, or both. The present application includes any suitable combination of hardware and software.

While the application has described several exemplary embodiments, various modifications, arrangements, and equivalents of the embodiments are within the scope of the application. It will thus be appreciated that those skilled in the art will be able to devise various systems and methods which, although not explicitly shown or described herein, embody the principles of the application and are thus within its spirit and scope.

Claims

1. A method of feature map compression, the feature map being generated based on passing a first input to and through a deep neural network DNN, the method comprising:

determining the respective optimal index order and optimal unification method of each super block, wherein each super block is obtained by dividing a feature map;

determining a selectively structured unified SSU layer according to the respective optimal index order and optimal unified method of each superblock, the SSU layer being added to the DNN to form an updated DNN, the SSU layer being used to perform unified operations on the feature map; and

determining a first estimated output, the first estimated output generated by passing the first input to and through the updated DNN, and as a compressed feature map.

2. The method of claim 1, further comprising:

updating network coefficients of the DNN based on a first desired output and the first estimated output.

3. The method of claim 1, wherein determining the respective optimal index order and optimal unified approach for each superblock further comprises:

defining at least one uniform method for each said superblock;

obtaining at least one characteristic uniformity loss according to the at least one uniformity method of each superblock; and

and determining the respective optimal index order and optimal unifying method for each super block according to the at least one characteristic unified loss, wherein the optimal index order and optimal unifying method of each super block correspond to the minimum characteristic unified loss in the at least one characteristic unified loss.

4. The method of claim 3, wherein said obtaining at least one characteristic uniform penalty according to said at least one uniform method for each said superblock further comprises:

obtaining a first characteristic unity loss of the at least one characteristic unity loss for a first superblock, the first characteristic unity loss being equal to a sum of a compression ratio loss, a unity distortion loss, and a calculated speed loss.

5. The method of claim 4, wherein obtaining a first feature uniform penalty for the first superblock for the at least one feature uniform penalty further comprises:

dividing the first super block into a plurality of blocks, each of the plurality of blocks having a respective sub-compression ratio loss, sub-unity distortion loss, and sub-computation speed loss;

obtaining the compression rate loss of the first super block, the compression rate loss of the first super block being the sum of the sub-compression rate losses of the plurality of blocks;

obtaining the uniform distortion loss of the first super block, wherein the uniform distortion loss of the first super block is the sum of the sub-uniform distortion losses of the plurality of blocks; and

obtaining the calculated speed penalty for the first super-block, the calculated speed penalty for the first super-block being the sum of the sub-calculated speed penalties for the plurality of blocks.

6. The method according to claim 5, wherein each of said sub-compression rate losses is equal to the standard deviation of the absolute values of the features in the corresponding block;

each of the sub-uniform distortion losses is equal to an inverse of a product value of a height, a width, and a depth of a corresponding block; and

each of the sub-computation speed penalties is a function of the number of multiplication operations in the computation using the unified approach for the corresponding block.

7. The method of claim 2, wherein the SSU layer is added after an extraction layer in the DNN that generates the feature map.

8. The method of claim 7, wherein the updating the network coefficients of the DNN further comprises:

generating the feature map by passing the first input to and through an extraction layer of the updated DNN;

generating a unified feature map by passing the feature map to and through the SSU layer located after the extraction layer; and

passing the unified feature map to and through remaining layers of the updated DNN, generating the first estimation output.

9. The method of claim 8, wherein generating the unified feature map further comprises:

ordering the superblocks of the feature map in ascending order according to their feature uniform loss, wherein each of the superblocks is reordered along the depth of the feature map according to the corresponding index order;

defining a uniform ratio q; and

unifying q% of the superblocks arranged in ascending order by an index order and a unification method, the unification method being a unification method that generates a smallest feature unification loss among the feature unification losses of the superblocks.

10. The method of claim 9, wherein updating the network coefficients of the DNN according to the first desired output and the first estimated output further comprises:

defining a target penalty based on the first estimated output and the first desired output;

acquiring the gradient of the target loss; and

updating the network coefficients of the DNN according to the gradient of the target penalty through a back propagation and weight update process.

11. The method of claim 10, further comprising:

determining a second estimated output, the second estimated output generated by passing a second input to and through the updated DNN; and

updating the network coefficients of the DNN according to a second desired output and the second estimated output.

12. A feature map compression apparatus, comprising:

the first determining module is used for determining the respective optimal index order and optimal unified method of each super block, each super block is obtained by dividing a feature map, and the feature map is generated by transmitting a first input to and passing through a Deep Neural Network (DNN);

a second determining module, configured to determine a selectively structured unified SSU layer according to the respective optimal index order and optimal unified method for each superblock, where the SSU layer is added to the DNN to form an updated DNN, and the SSU layer is configured to perform unified operations on the feature map; and

a third determination module to determine a first estimated output generated by passing the first input to and through the updated DNN and to treat the first estimated output as a compressed signature.

13. The apparatus of claim 12, further comprising: an update module; the updating module is configured to update the network coefficients of the DNN according to a first desired output and the first estimated output.

14. A computing device comprising a processor and a memory; the memory stores a computer program which, when executed by the processor, causes the processor to perform the method of any one of claims 1 to 11.

15. A non-transitory computer-readable medium storing instructions that, when executed by a computer, cause the computer to perform the method of any one of claims 1 to 11.