CN110458841B

CN110458841B - Method for improving image segmentation running speed

Info

Publication number: CN110458841B
Application number: CN201910535642.4A
Authority: CN
Inventors: 张烨; 樊一超; 郭艺玲
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2019-06-20
Filing date: 2019-06-20
Publication date: 2021-06-08
Anticipated expiration: 2039-06-20
Also published as: CN110458841A

Abstract

A method of increasing an image segmentation run rate, comprising: designing a multi-scale cavity convolution kernel; designing a channel convolution network; designing a full convolution connection and deconvolution network; the invention can be suitable for any image size through a network of deconvolution and full convolution operations, can carry out semantic analysis on each pixel point of the image, achieves the aim of rapidly segmenting the image, and can rapidly and accurately position the image characteristics.

Description

Method for improving image segmentation running speed

Technical Field

The invention relates to a method for changing image segmentation rate.

Technical Field

In recent years, with the rapid development of computer science and technology, image processing, image target detection and the like based on computer technology have also been developed unprecedentedly, wherein deep learning is performed by learning massive digital image features and extracting key target features, which is more than human in target detection, and brings a further surprise to the industry. With the rise of the neuron network again, the video image method based on the convolutional neuron network becomes a mainstream technology of image segmentation and identification, and the accurate identification of the image is realized by means of template matching, edge feature extraction, gradient histograms and the like. Although the image feature detection based on the neural network can effectively identify the features of the targets in the complex scene, and the effect is far better than that of the traditional method, the method also has the following defects: (1) the noise immunity is weak; (2) the problem of overfitting is solved by using a Dropout method, a convolutional neural network model and parameters are improved, but the precision is slightly reduced; (3) a variable convolution and separable convolution structure is introduced, the generalization of the model is improved, the network model feature extraction capability is enhanced, but the target identification performance of a complex scene is poor; (4) a newer image segmentation method, namely End-to-End, directly predicts image pixel classification information and achieves pixel positioning of a target object, but the model has the problems of large parameter, low efficiency, rough segmentation and the like. In a word, the traditional detection method and the video image method have the problems of complex operation, low identification precision, low identification efficiency, rough segmentation and the like.

Disclosure of Invention

To overcome the above-mentioned deficiencies of the prior art, the present invention provides a method for increasing the running rate of image segmentation for full convolution. The invention adopts a deep learning framework and optimizes and improves the convolutional neural network; reducing the parameter quantity of the model by adopting a channel convolution method; the characteristics of the image are increased by adopting multi-scale hole convolution, and the problem of small receptive field of the traditional network is solved.

In order to achieve the purpose, the invention adopts the following technical scheme:

a method for improving the running speed of image segmentation comprises the following steps:

designing a multi-scale cavity convolution kernel;

in order to solve the problem that the receptive field is increased by adopting the traditional convolution and maximum pooling method, the invention adopts the hole convolution kernel, increases the sampling rate on the basis of the traditional convolution kernel, and changes the original convolution kernel into fluffy.

Thus, while the original calculated amount is kept, the receptive field is increased, so that the image segmentation information is accurate enough, and the calculation formula of the size of the receptive field based on the cavity convolution kernel is

In the formula: f is the size of the current layer receptive field; the rate is the sampling rate of the hole convolution kernel, i.e. the number of intervals, and the rate of the conventional convolution kernel can be regarded as 1, and the sampling rate of the hole convolution can be regarded as 2. The traditional convolution receptive field calculation formula is

In the formula: f_i-1The size of the receptive field of the previous layer; k is a radical of_iIs convolution of ith layerNuclear or pooled nuclear size; n is the total number of layers of convolution; s_iIs the convolution step size Stride of the i-th layer convolution kernel.

The multi-scale cavity convolution is designed by using the thought of multi-scale image change, and the sampling rate and the convolution kernel size are subjected to diversified processing, so that the method can adapt to the feature extraction process of targets with different sizes. The calculation of the multi-scale hole convolution is

In the formula: y [ i ] is the convolution summation result corresponding to the ith step position; k is a convolution kernel; k is the coordinate position of the parameter in the convolution kernel, and K belongs to K; w [ k ] is the convolution kernel weight; the rate is a sampling rate and can take corresponding values of 1, 2 and 3.

Designing a channel convolution network;

since the conventional convolution mode is a dimension-increasing operation, it can be considered that a channel convolution mode is adopted to achieve the effect of reducing the dimension of the feature convolution. Firstly, the traditional convolution is changed into two-layer convolution, similar to group operation in ResNet, the new structure shortens the calculation time to about 1/8 on the premise of not influencing the accuracy, reduces the parameter quantity to about 1/9, can be well applied to a mobile terminal, realizes real-time detection of a target, and has an obvious model compression effect.

For the conventional convolution, the number of input characteristic channels is assumed to be M; width or height of convolution kernel is D_kOr D_k(ii) a The number of convolution kernels is N. Then there are N M.D. s for each position that the convolution slides once_k·D_kThe step size of the sliding is set to s. The calculation formula of the size of the image after sliding is

In the formula: h ', w' are height and width after convolution; pad is the boundary of the width and height fill. Therefore, a certain point of the size after h '& w' convolution corresponds to N M & D_k·D_kThe parameter (D) can be obtained as a total parameter of N.M.D_k·D_k·h'·w' (6)

And the convolution step is divided into two steps by adopting an improved channel convolution mode:

1) by using D_k·D_kThe convolution of M convolves the M channels separately. Sliding by using the same step length s, the dimension after convolution is h ', w', and the parameter quantity generated by the step is D_k·D_k·M·h'·w' (7)

2) And setting a convolution kernel of 1.1. N to perform the ascending-dimension feature extraction. At the moment, the feature diagram obtained by the result is subjected to feature extraction again by adopting a mode of step length 1, the original M channel features are subjected to feature extraction by adopting N convolution kernels respectively, and the calculated total parameter size is M.N.h '. w'. 1.1 (8)

The convolution structures of the two steps are integrated to obtain the final parameter quantity D of the channel convolution_k·D_k·M·h'·w'+M·N·h'·w' (9)

As previously mentioned, the parameter of the conventional convolution kernel is compared with the parameter of the improved channel convolution by the quantity

From the analysis of equation (10), if a convolution kernel size of 3 × 3 is used, the channel convolution operation can reduce the parameter amount to 1/9.

Designing a full convolution connection and deconvolution network;

the final layer of the traditional network structure adopts a fixed size, so that an input picture needs to be converted into a fixed size in advance, and the acquisition of the vehicle length coordinate of the logistics vehicle is not facilitated; in addition, the traditional full-connection layer network has the defects that the determined digit space coordinate is lost, so that the image space information is distorted, and the target cannot be effectively and accurately positioned. In order to solve the problem of information loss, the invention adopts a full convolution connection mode to accurately position the position coordinates of the features in the picture.

The full connection of the traditional network converts the convolution network [ b, c, h, w ] of the former part into [ b, c.h.w ], namely [ b,4096], and then into [ b, cls ], wherein b represents the batch size and cls represents the class number. The use of a full convolutional network is relative to a convolutional network followed by 1 × 1, without a full connection layer. Hence, it is called a full convolutional network. The calculation method of the full convolution is

In the formula: n is more than or equal to 1 and less than or equal to N; y is_n[i][j]Convolving the (i, j) th position of the nth convolution kernel; s_iConvolution step size in the horizontal direction; s_jConvolution step size in the vertical direction; k is a radical of_nIs the nth convolution kernel; d_kFor convolution kernel width and height, the convolution kernel size corresponds to D in step 2_k·D_k；δ_i，δ_jFor positions in the convolution kernel, the layer has a total of N different types of convolution kernels, 0 ≦ δ_i,δ_j≤D_kWhereas the sliding convolution operation of the convolution kernel may be converted to a two matrix multiplication operation. The result of the convolution with the pixels of the corresponding image may be expressed as

Wherein: the matrix dimension on the left is [ N, M.D ]_k·D_k](ii) a The matrix dimension on the right side is [ M.D ]_k·D_k,w′·h′](ii) a The dimension after convolution is [ N, w '. h']. In the matrix on the right, I is img, and subscripts thereof are image width and image height in turn, i.e. I_wh。

Finally, through deconvolution operation, converting [ N, w '. h' ] into the size of the input image, thus accurately identifying the specific semantic information represented by each pixel and avoiding the loss of spatial information. The specific operation of deconvolution is equivalent to the inverse operation of convolution, i.e. using a single convolution as the inverse of the convolution

In the formula: k is a radical of₁，…，k_NThe weight value corresponding to each convolution kernel is changed from the original one

Is changed into

The weight is adjusted through training, and has the image semantic information characteristic.

Therefore, the network through the deconvolution and full convolution operation can be suitable for any image size, semantic analysis can be carried out on each pixel point of the image, the purpose of rapidly segmenting the image is achieved, and rapid and accurate positioning can be carried out on image features.

The invention has the advantages that:

aiming at the sample problem, the invention adopts a full convolution method to improve the image segmentation running speed, and has the most prominent characteristics that the image is subjected to lightweight processing, the segmentation efficiency of the model is improved under the condition of ensuring the segmentation precision, and the parameter quantity of the model is reduced in a channel convolution mode; and a multi-scale cavity convolution kernel is arranged, so that the receptive field of the model is reasonably and simply improved, and the generalization of the model is enhanced. The algorithm can be widely applied to the field of image positioning identification, such as logistics park vehicle identification and the like.

Drawings

FIG. 1 is a diagram illustrating a conventional convolution kernel convolution operation;

FIG. 2 is a schematic diagram of the convolution operation of the improved hole convolution kernel of the present invention;

FIGS. 3 a-3 c are multi-scale hole convolution kernels of the present invention, with FIG. 3a being a hole convolution kernel with a sample rate of 1, FIG. 3b being a hole convolution kernel with a sample rate of 2, and FIG. 3c being a hole convolution kernel with a sample rate of 3;

FIG. 4 is a prior art convolution scheme;

FIG. 5 is a channel convolution scheme of the present invention;

FIG. 6 is a channel convolution structure of the present invention;

FIG. 7 is a full convolution network design structure of the present invention;

FIG. 8 is a schematic diagram of a full convolution matrix calculation process according to the present invention.

Note: in fig. 6, DW is a channel convolution group, which represents a fixed collocation of channel convolution kernels; BN is batch normalization operation, and the problem that the data distribution of the middle layer is changed in the training process is solved; conv is the convolutional layer operation; RelU is a modified linear unit and is an activation function.

Note: in fig. 8: k is a radical of₁，…，k_NThe number of convolution kernels;

is the position weight of the nth convolution kernel.

Detailed Description

In order to overcome the defects in the prior art, the invention provides a full-convolution image segmentation method aiming at the sample problem, which adopts a deep learning framework and optimizes and improves a convolution neural network; reducing the parameter quantity of the model by adopting a channel convolution method; the characteristics of the image are increased by adopting multi-scale hole convolution, and the problem of small receptive field of the traditional network is solved.

designing a multi-scale cavity convolution kernel;

In the formula: f_i-1The size of the receptive field of the previous layer; k is a radical of_iIs the convolution or pooling kernel size of the ith layer; n is the total number of layers of convolution; s_iIs the convolution step size Stride of the i-th layer convolution kernel.

Designing a channel convolution network;

Designing a full convolution connection and deconvolution network;

Is changed into

In order to verify the superiority of the invention, a logistics park vehicle is taken as an example, the following network model is constructed, and a comparison experiment is carried out:

firstly, network construction is carried out: four types of logistics vehicles, namely van trucks, traction trucks, dump trucks and tank trucks, are collected from the logistics park and are divided into 8000 training sets, 2000 testing sets and 1000 testing sets. The configuration of each parameter of the constructed network model structure is shown in the following table 1.

In table 1: k is the convolution kernel size; s is the step length; p is the size of the fill; DW is a channel convolution group and represents the fixed collocation formed by channel convolution kernels; residual summation is used to facilitate gradient transfer of a large network; activation of each layer and Batch Normalization (BN) facilitate accelerated network training; ReLU is a modified linear unit and is an activation function.

TABLE 1 design of parameters of network model architecture

The computer configuration adopted in the example is a display card with 11G, 1607 MHz display memory of Cujia NVIDIA Yingwei GTX1080 Ti.

Finally, the model test performance of the example network and the conventional network were compared, and the results are shown in table 2.

TABLE 2 comparison of lightweight segmentation model Performance

The evaluation index MPA in table 2 represents the average pixel accuracy (Mean pixel accuracy); MA represents the ratio of foreground area to label area (Mean accuracy); the MIOU represents a ratio of average intersection to area coverage (Mean intersection over), i.e. a ratio of the predicted correct region to the union of the predicted area and the tag area; the unit M pic-1 represents the memory occupied by training a picture, and the memory unit is megaly (M); the unit ms · iter-1 represents the time required per iteration, in milliseconds (ms); after the channel convolution is adopted, the occupied video memory is reduced by 51%, the training speed is improved by 78%, the testing speed is improved by 79%, each evaluation index of the segmentation positioning is greatly improved, and the improving amplitude of the MIOU is the largest.

By the embodiment, the improved method is verified to be capable of improving the performance of the model test, namely the running speed of image segmentation.

The scheme has the advantages that:

The embodiments described in this specification are merely illustrative of implementations of the inventive concept and the scope of the present invention should not be considered limited to the specific forms set forth in the embodiments but rather by the equivalents thereof as may occur to those skilled in the art upon consideration of the present inventive concept.

Claims

1. A method for improving the running speed of image segmentation comprises the following steps:

designing a multi-scale cavity convolution kernel;

in order to solve the problem that the receptive field is increased by adopting the traditional convolution and maximum pooling method, a hole convolution kernel is adopted, the sampling rate is increased on the basis of the traditional convolution kernel, and the original convolution kernel is turned to be fluffy;

In the formula: f is the size of the current layer receptive field; the rate is the sampling rate of the void convolution kernel, i.e. the number of intervals, and can be regarded as 1 for the rate of the conventional convolution kernel and 2 for the rate of the void convolution; the traditional convolution receptive field calculation formula is

In the formula: f_i-1The size of the receptive field of the previous layer; k is a radical of_iIs the convolution or pooling kernel size of the ith layer; n is the total number of layers of convolution; s_iConvolution step size Stride being the i-th layer of convolution kernel;

the multi-scale cavity convolution is designed by using the thought of multi-scale image change, and the sampling rate and the convolution kernel size are subjected to diversified processing, so that the method can adapt to the characteristic extraction process of targets with different sizes; the calculation of the multi-scale hole convolution is

In the formula: y [ i ] is the convolution summation result corresponding to the ith step position; k is a convolution kernel; k is the coordinate position of the parameter in the convolution kernel, and K belongs to K; w [ k ] is the convolution kernel weight; rate is a sampling rate and can take corresponding values of 1, 2 and 3;

designing a channel convolution network;

because the traditional convolution mode is a dimension increasing operation, the function of characteristic convolution dimension reduction is achieved by adopting a channel convolution mode at first; firstly, the traditional convolution is changed into two-layer convolution, similar to group operation in ResNet, the new structure shortens the calculation time to about 1/8 on the premise of not influencing the accuracy, reduces the parameter quantity to about 1/9, can be well applied to a mobile terminal, realizes the real-time detection of a target, and has an obvious model compression effect;

for conventional convolution, the input is assumedThe number of the characteristic channels is M; width or height of convolution kernel is D_kOr D_k(ii) a The number of convolution kernels is N; then there are N M.D. s for each position that the convolution slides once_k·D_kThe step length of sliding is set as s; the calculation formula of the size of the image after sliding is

In the formula: h ', w' are height and width after convolution; pad is the boundary of the width and height filling; therefore, a certain point of the size after h '& w' convolution corresponds to N M & D_k·D_kThe parameter quantity of (2) can be obtained as the total parameter quantity

N·M·D_k·D_k·h'·w' (6)

1) by using D_k·D_kThe convolution of M performs convolution on M channels respectively; the same step length s is used for sliding, the dimension after convolution is h ', w', and the parameter quantity generated by the step is

D_k·D_k·M·h'·w' (7)

2) Setting convolution kernels of 1, 1 and N for performing dimension-increasing feature extraction; at the moment, the step length is 1, the feature diagram obtained in the step 1) is subjected to feature extraction again, the original M channel features are subjected to feature extraction by adopting N convolution kernels, and the calculated total parameter number is

M·N·h'·w'·1·1 (8)

The convolution structures of the two steps are integrated to obtain the final parameter quantity of the channel convolution as

D_k·D_k·M·h'·w'+M·N·h'·w' (9)

From the analysis of equation (10), the channel convolution operation reduces the parameter amount;

designing a full convolution connection and deconvolution network;

the final layer of the traditional network structure adopts a fixed size, so that an input picture needs to be converted into a fixed size in advance, and the acquisition of the vehicle length coordinate of the logistics vehicle is not facilitated; in addition, the traditional full-connection layer network has the defects that the determined digit space coordinate is lost, so that the image space information is distorted, and the target cannot be effectively and accurately positioned; in order to solve the problem of information loss, a full convolution connection mode is adopted to accurately position the position coordinates of the features in the picture;

the full connection of the traditional network converts the convolutional network [ b, c, h, w ] of the front part into [ b, c.h.w ], namely [ b,4096], and then into [ b, cls ], wherein b represents the batch size and cls represents the class number; the adopted full convolution network is a convolution network connected with 1 multiplied by 1 correspondingly, and has no full connection layer; hence, the term full convolutional network; the calculation method of the full convolution is

y_n[i][j]＝f_kns(x[s_i+δ_i][s_j+δ_j]) (11)

In the formula: n is more than or equal to 1 and less than or equal to N; y is_n[i][j]Convolving the (i, j) th position of the nth convolution kernel; s_iConvolution step size in the horizontal direction; s_jConvolution step size in the vertical direction; k is a radical of_nIs the nth convolution kernel; d_kFor the width and height of the convolution kernel, the size of the convolution kernel corresponds to D in the second step_k·D_k；δ_i，δ_jFor the position in the convolution kernel, the full convolution connection network layer has N convolution kernels of different types in total, and delta is more than or equal to 0_i,δ_j≤D_kAnd the sliding convolution operation of the convolution kernel can be converted into two matrix multiplication operations; corresponding pixels and volumes of an imageThe result of the product can be expressed as

Wherein: the matrix dimension on the left is [ N, M.D ]_k·D_k](ii) a The matrix dimension on the right side is [ M.D ]_k·D_k,w′·h′](ii) a The dimension after convolution is [ N, w '. h'](ii) a In the matrix on the right, I is img, and subscripts thereof are image width and image height in turn, i.e. I_wh；

Finally, through deconvolution operation, converting [ N, w '& h' ] into the size of the input image, thus accurately identifying the specific semantic information represented by each pixel and avoiding the loss of spatial information; the specific operation of deconvolution is equivalent to the inverse operation of convolution, i.e. using a single convolution as the inverse of the convolution

Is changed into