CN113762479A

CN113762479A - Neural network optimization method and device

Info

Publication number: CN113762479A
Application number: CN202111060216.3A
Authority: CN
Inventors: 徐友庆; 高成; 关晨; 孟祥峰
Original assignee: Shenzhen Park Sheng Intelligent Technology Co ltd
Current assignee: Shenzhen Park Sheng Intelligent Technology Co ltd
Priority date: 2021-09-10
Filing date: 2021-09-10
Publication date: 2021-12-07

Abstract

The invention discloses a neural network optimization method and device. Wherein the method comprises the following steps: model training is carried out based on a multi-branch fusible residual structure, and trained model parameters are extracted; performing structure conversion on the trained fusible residual error structure type by using a fusion operator to obtain a single-branch residual error structure; and deploying the single-branch residual structure to a target device and executing the inference step of the target task. The invention realizes the integration of the residual error module through design, carries out structural replacement on the residual error module, fully utilizes the advantages of a multi-branch structure and a single-branch structure, improves the memory efficiency and the parallelism degree when the network deployment is operated, saves the network resource consumption and accelerates the network reasoning speed; and a re-parameterization method is adopted for parameter compression, so that the problem of precision reduction caused by cutting parameters and connection is reduced.

Description

Neural network optimization method and device

Technical Field

The embodiment of the invention relates to the technical field of neural networks, in particular to a neural network optimization method and device.

Background

In recent years, with the rapid development of deep learning, the deep learning has achieved excellent performance in many tasks, so that the deep learning is increasingly applied to a plurality of life and industrial fields. At present, a deployment deep neural network model is divided into an Online deployment mode and an Offline deployment mode. The Offline deployment is usually used in most practical industrial production environments, and the Offline deployment processes data locally without passing through a network, so that the safety and the real-time performance can be guaranteed. However, for embedded end-side devices with limited computational resources, the massive demands on computational power from deep neural networks are unacceptable. At the same time, heavy computing can quickly drain its limited battery power for embedded mobile devices that use batteries.

To solve the deployment dilemma of deep neural networks in embedded devices, bottlenecks have occurred only by the conventional method. The simple increase of DRAM memory capacity of embedded equipment and the enhancement of CPU operational capability cannot match the development speed of neural networks. And in many industrial scenarios, there are strict volume and power consumption limitations on embedded devices, which present a huge challenge to the deployment of neural networks on embedded devices. The constraint requirements of the neural network on the deployment memory and the power consumption of the embedded device are solved, so a feasible neural network deployment scheme meeting the embedded limited hardware resources is born, namely, the neural network model compression.

However, the conventional neural network model compression method cuts redundant connections and parameters out of the trained network model, thereby reducing the number of parameters. Because the compression methods do not change the overall architecture of the network, only redundant connections and parameters are cut off, and thus the model loses part of precision; in addition, the traditional neural network architecture cannot simultaneously utilize the advantages of a multi-branch structure and a single-branch structure, so that the neural network reasoning efficiency is low.

Disclosure of Invention

The invention provides a neural network optimization method and device, which are used for effectively reducing model parameters and improving reasoning efficiency of a neural network.

In a first aspect, an embodiment of the present invention provides a neural network optimization method, including:

model training is carried out based on a multi-branch fusible residual structure, and trained model parameters are extracted;

performing structure conversion on the trained fusible residual error structure type by using a fusion operator to obtain a single-branch residual error structure;

and deploying the single-branch residual structure to a target device and executing the inference step of the target task.

Optionally, the fusible residual structure is obtained by removing a relu layer between two consecutive convolution kernels from the residual structure.

Optionally, the convolution kernel structure in the fusible residual structure includes: a 1 by 1 convolution kernel, a 3 by 3 convolution kernel following the 1 by 1 convolution kernel, and a 1 by 1 convolution kernel following the 3 by 3 convolution kernel.

Optionally, performing structure transformation on the trained fusible residual structure by using a fusion operator, including:

traversing all fusible residual error structures in the neural network;

and substituting the convolution kernel input in the fusible residual error structure into the formula of the batch normalization layer to obtain the convolution kernel fused with the batch normalization layer.

each convolution kernel in the fusible residual error structure takes the output of the previous convolution kernel layer as input and feeds the output back to the next convolution kernel so as to realize the combination of the convolution kernels and the convolution kernels.

In a second aspect, an embodiment of the present invention further provides a neural network optimization apparatus, including:

the training module is used for carrying out model training based on the multi-branch fusible residual structure and extracting model parameters after training;

the fusion module is used for performing structure conversion on the trained fusible residual error structure type by utilizing a fusion operator to obtain a single-branch residual error structure;

and the deployment inference module is used for deploying the single-branch residual structure to target equipment and executing inference steps of a target task.

Aiming at the memory low-efficiency and low-parallelism structure of a multi-branch network, the invention provides a fusible residual module, adopts a re-parameterization technology, aims at a similar ResNet network, carries out structural replacement on the residual module by replacing the fusible residual module, fuses the residual structure into a convolution when in deployment, avoids the additional memory consumption brought by the multi-branch structure of the network, reduces the network depth, improves the memory efficiency and the parallelism when the network is deployed, saves the network resource consumption, and accelerates the network reasoning speed; meanwhile, various equivalent convolution structures and anisotropic convolution structures are provided, and the performance of the fusible residual error module is enhanced.

Drawings

Fig. 1 is a flowchart of a neural network optimization method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a fusible residual structure according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an equivalent expansion of a 1 by 1 convolution kernel according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a neural network optimization device according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Examples

Fig. 1 is a flowchart of a neural network optimization method provided in an embodiment of the present invention, which specifically includes the following steps:

s110, model training is carried out based on the multi-branch fusible residual structure, and trained model parameters are extracted.

Referring to fig. 2, fig. 2 is a schematic diagram of a fusible residual structure according to an embodiment of the present invention. The fusible residual structure in this embodiment removes the relu layer between two consecutive convolutional layers, removing the nonlinear relationship between the convolutional layers, thereby enabling it to fuse. Further, the fusible residual structure adopts a 131 structure, i.e., a 1 by 1 convolution kernel, a 3 by 3 convolution kernel following the 1 by 1 convolution kernel, and a 1 by 1 convolution kernel following the 3 by 3 convolution kernel.

In this embodiment, at the convolution kernel of 3 by 3, the accuracy degradation problem caused by removing the relu layer is reduced by widening the number of channels.

And S120, performing structure conversion on the trained fusible residual structure by using a fusion operator to obtain a single-branch residual structure.

Specifically, the method for performing structure transformation on the trained model parameters by adopting the fusion operator mainly comprises the following steps: the method comprises three parts of convolution kernel and batch normalization layer combination, convolution kernel and convolution kernel combination and convolution kernel horizontal combination.

(1) Convolution kernel and batch normalization layer merging

In this embodiment, the convolution kernel with the batch normalization layer fused is obtained by traversing all the fusible residual error structures in the neural network and bringing the convolution kernel input in the fusible residual error structures into the formula of the batch normalization layer.

Specifically, the formula of the convolution kernel is:

Conv(X)＝WX+b

where X is the input image matrix, W is the parameter matrix, and b is the bias matrix.

The output of the convolution kernel is substituted into the formula of the batch normalization layer to obtain the following expression:

where mean and var are the mean and variance, respectively, of the input matrix X, and γ and β are the scaling factor and bias, respectively, in the normalization layer.

Order:

wherein, W_fusedIs a fused parameter matrix, B_fsuedIs the fused bias matrix.

The following expression is obtained, which is actually a convolution kernel expression fused with batch normalization.

Conv_fused(X)＝BN(Conv(X))

＝W_fusedX+B_fused

Wherein, Conv_fusedIs a convolution kernel expression formed by fusing batch normalization and convolution kernels, and is represented by W_fusedAnd B_fsuedAnd (4) forming.

(2) Convolution kernel and convolution kernel merging

In this embodiment, after the batchnorm layers are fused into the convolution kernel layers, each convolution kernel layer in fig. 2 is directly connected, which means that each convolution kernel layer takes the output of its previous convolution kernel layer as an input and feeds back the output to its next convolution kernel layer, so as to implement the combination of the convolution kernel and the convolution kernel.

The specific expression is as follows:

Conv₂(Conv₁(X))＝W₂(W₁X+b₁)+b₂

＝W₂W₁X+W₂b₁+b₂

＝(W₂W₁)X+(W₂b₁+b₂)

order:

W_fused＝(W₂W₁)b_fused＝(W₂b₁+b₂)

the expression is obtained which is in fact an equivalent expression fusing two successive convolution kernels.

Conv_fused＝W_fusedX+b_fused

(3) Convolution kernel horizontal merging

For a fusible residual structure with downsampling, the 1 by 1 convolution kernels on the skip layer need to be merged horizontally. Specifically, to merge horizontally, the 1 by 1 convolution kernel on the direct connection needs to be equivalently extended to the 3 by 3 convolution kernel to match the sizes, as shown in fig. 3. A 1 by 1 convolution kernel can be seen as a special case of a 3 by 3 convolution kernel, i.e. it can be represented by a 3 by 3 convolution kernel. As shown in fig. 3, the 1 by 1 convolution kernel is extended to a 3 by 3 convolution kernel by filling zeros around the 1 by 1 convolution kernel. The horizontal 3 by 3 convolution kernels may then be combined into one 3 by 3 convolution kernel by adding the 3 by 3 convolution kernel to the center point of the extended 3 by 3 convolution kernel.

S130, deploying the single-branch residual structure to target equipment and executing the inference step of a target task.

For example, a target task may be to automatically assess mineralized foam grade on an embedded device. Aiming at the scenes, the accuracy of the converged ResNe network is reserved during cloud training, and the converged ResNe network is converted into a single-branch structure during deployment and then is deployed at an embedded equipment end, so that the reasoning speed can be obviously increased, and the single reasoning time delay is reduced.

The target task may also be to guard against and detect malicious traffic in the software defined network. Aiming at the scene, the application of the converged ResNet network can effectively improve the reasoning speed of the ResNet network, thereby reducing the interval of network flow scanning each time and improving the overall safety of the software defined network.

Further, the embodiment of the present invention further provides a corresponding experimental verification result, which specifically includes the following contents:

1. experimental setup

The experiment training is carried out by using a Pythroch, the Cifar10 and Cifar100 data sets with enhanced simple data are trained for 120 periods, the learning rate is changed into a preheated cosine annealing function with 5 epochs, and the training batch size (batch size) is 256. In the experimental test, a Pythroch is used as a software environment for the test, the server graphics card is NVIDIA V100, the embedded device is NVIDIATX2, and the speed unit is example/second. In the experimental comparison, the proposed branch fusion method for the residual structure is applied to the ResNet, and compared with the original ResNet in terms of operation speed, model accuracy and memory consumption.

OS	Ubuntu 16.04 Xenial
		CPU	2*Intel Xeon E5-2620 v4@32x3GHz
GPU	2*Nvidia Tesla V100
		RAM	256GB DDR4

TABLE 1 training Server configuration Table

The training server for the experiment in this embodiment uses an Intel Xeon E5 server, and is configured with 2 NVIDIA V100 video cards, the specific configuration of which is shown in table 1.

Table 2 NVDIA TX2 configuration table

Testing was also performed on the embedded platform at deployment time, using Nvidia TX2 as the deployment environment, which carries quad-cores

MPCore,8GB 256 bit LPDDR4 memory, operating system Ubuntu 18.04. The specific configuration thereof is shown in table 2.

2. Results of the experiment

Model (model)	V100 speed (FPS)	TX speed (FPS)	Deployment parameter number (MB)
				ResNet18	1644.34	159.54	45
ResNet18*	3038.67	300.22	21
				ResNet34	1641.48	158.51	84
ResNet34	3031.32	298.60	39
				ResNet50	474.71	48.23	98
ResNet50*	2054.89	189.00	40
				ResNet101	277.84	28.86	171
ResNet101*	1200.04	112.75	78
				ResNet152	192.23	20.30	231
ResNet152*	834.63	79.34	110

TABLE 3 deployment speed comparison at V100 and TX2

Table 3 shows the comparison of the inference speed when the server side and the embedded side are actually deployed. In the test, ResNet18, ResNet34, ResNet50, ResNet101 and ResNet152 in branch fusion deployment are compared with an original model, and the batch size (batch size) is 64 during reasoning. The speed-up ratio of the fusible residual module relative to BasicBlock (shallow ResNet) is about 1.84, the speed-up ratio relative to Bottleneck (deep ResNet) is about 4, and the parameter number is about half less than that of the original ResNet.

TABLE 4 CIFAR10 comparison of training results on CIFAR100

Table 4 shows the training results on Cifar10 and Cifar100, in this test, ResNet18, ResNet34, ResNet50, and ResNet101 deployed in branch fusion are compared with the original model, and a VGG network is added for comparison, and the model performance loss of removing the nonlinear layer is recovered by connecting the fusible extension module. The model with ResNet 50-analog band "-" is a network generated by directly replacing a corresponding ResNet with a fusible residual module, and can be seen that the nonlinear Relu layer in the residual module is directly removed, so that the network performance is reduced by 1% -2% compared with the original network, and the model with ResNet 50-analog band "-" is formed by adding a multipath extension branch to the fusible residual module so as to improve the model performance. Experiments show that through the fusible extension module, the fusible residual module in the embodiment is basically consistent with the accuracy of the original ResNet network.

3. Analysis of Experimental results

Considering that the different points of interest of the model during training and deployment are different, by means of the idea of reparameterization, the embodiment provides a fusible residual module for a residual structure aiming at the hardware operation efficiency during network inference, and optimizes the inference efficiency and the memory efficiency of the residual network model during deployment. By removing the nonlinear layer in the residual error structure and fusing the multi-branch structure before deployment, the model branch structure is removed, the number of model layers is reduced, and the memory efficiency and the operation efficiency during deployment are improved. Firstly, the advantages and limitations of the linear network structure and the multi-branch network structure are discussed, secondly, the training and deployment of the network are decoupled by fine-tuning the ResNet network structure, the multi-branch residual error network structure is used during the training, and the multi-branch residual error network structure is converted into the linear network structure during the deployment, and meanwhile, the advantages of the single-branch network and the multi-branch network are utilized to avoid the disadvantages of the single-branch network and the multi-branch network. Compared with a ResNet network, the model obtained finally has equivalent accuracy and an acceleration ratio of 1.8-4.4 under the condition that the parameters are reduced by half.

With continued reference to fig. 4, fig. 4 is a diagram of a neural network optimization apparatus according to an embodiment of the present invention, where the apparatus includes:

a training module 210, configured to perform model training based on a multi-branch fusible residual structure, and extract model parameters after training;

the fusion module 220 is configured to perform structure transformation on the trained fusible residual structure type by using a fusion operator to obtain a single-branch residual structure;

and a deployment inference module 230, configured to deploy the single-branch residual structure to a target device and perform inference steps of a target task.

Wherein the fusion module 220 is specifically configured to: traversing all fusible residual error structures in the neural network;

Wherein the fusion module 220 is specifically configured to: each convolution kernel in the fusible residual error structure takes the output of the previous convolution kernel layer as input and feeds the output back to the next convolution kernel so as to realize the combination of the convolution kernels and the convolution kernels.

Wherein the fusion module 220 is specifically configured to: for a fusible residual structure with downsampling, expanding a 1-by-1 convolution kernel on a direct connection into a 3-by-3 convolution kernel;

and adding the central point of the expanded 3-by-3 convolution kernel to the 3-by-3 convolution kernel to complete horizontal combination.

The neural network optimization device provided by the embodiment of the invention can execute the neural network optimization method provided by any embodiment of the invention, has corresponding functional modules and beneficial effects of the execution method, and is not repeated.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A neural network optimization method, comprising:

2. The method of claim 1, wherein the fusible residual structure is derived from the residual structure by removing a relu layer between two successive convolution kernels.

3. The method of claim 1, a convolution kernel structure in the fusible residual structure comprising: a 1 by 1 convolution kernel, a 3 by 3 convolution kernel following the 1 by 1 convolution kernel, and a 1 by 1 convolution kernel following the 3 by 3 convolution kernel.

4. The method of claim 1, wherein performing structure transformation on the trained fusible residual structure type by using a fusion operator comprises:

traversing all fusible residual error structures in the neural network;

5. The method of claim 1, wherein performing structure transformation on the trained fusible residual structure type by using a fusion operator comprises:

6. The method of claim 2, wherein performing structure transformation on the trained fusible residual structure type by using a fusion operator comprises:

for a fusible residual structure with downsampling, expanding a 1-by-1 convolution kernel on a direct connection into a 3-by-3 convolution kernel;

7. An apparatus for neural network optimization, comprising: