CN111179272A - Rapid semantic segmentation method for road scene - Google Patents
Rapid semantic segmentation method for road scene Download PDFInfo
- Publication number
- CN111179272A CN111179272A CN201911256375.3A CN201911256375A CN111179272A CN 111179272 A CN111179272 A CN 111179272A CN 201911256375 A CN201911256375 A CN 201911256375A CN 111179272 A CN111179272 A CN 111179272A
- Authority
- CN
- China
- Prior art keywords
- model
- semantic segmentation
- image
- input
- output
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000011218 segmentation Effects 0.000 title claims abstract description 79
- 238000000034 method Methods 0.000 title claims abstract description 53
- 238000012549 training Methods 0.000 claims abstract description 35
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 11
- 230000008569 process Effects 0.000 claims description 24
- 238000012545 processing Methods 0.000 claims description 12
- 238000004364 calculation method Methods 0.000 claims description 7
- 238000013528 artificial neural network Methods 0.000 claims description 5
- 238000009795 derivation Methods 0.000 claims description 4
- 238000010606 normalization Methods 0.000 claims description 2
- 230000006870 function Effects 0.000 description 24
- 238000000605 extraction Methods 0.000 description 8
- 238000010586 diagram Methods 0.000 description 6
- 230000004438 eyesight Effects 0.000 description 5
- 238000012795 verification Methods 0.000 description 5
- 241000282414 Homo sapiens Species 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 3
- 238000002474 experimental method Methods 0.000 description 3
- 238000003709 image segmentation Methods 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 238000005070 sampling Methods 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 238000013145 classification model Methods 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 230000008447 perception Effects 0.000 description 2
- 238000011176 pooling Methods 0.000 description 2
- 238000010200 validation analysis Methods 0.000 description 2
- 230000016776 visual perception Effects 0.000 description 2
- 239000011800 void material Substances 0.000 description 2
- 238000009825 accumulation Methods 0.000 description 1
- 210000004556 brain Anatomy 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011478 gradient descent method Methods 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 238000007637 random forest analysis Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 101150075118 sub1 gene Proteins 0.000 description 1
- 230000001131 transforming effect Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/10—Segmentation; Edge detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10024—Color image
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a road scene-oriented rapid semantic segmentation method, which specifically comprises the following steps: step 1, constructing a model based on a convolutional neural network; step 2, training the model constructed in the step 1 by using training data; step 3, calculating the model loss trained in the step 2 by using a loss function, and calculating the gradient according to the obtained model loss result; and 4, updating the model parameters according to the gradient obtained in the step 3. The segmentation method provided by the invention can be used for rapidly segmenting the image and obtaining higher precision.
Description
Technical Field
The invention belongs to the technical field of computer vision, and relates to a road scene-oriented rapid semantic segmentation method.
Background
In recent years, image semantic segmentation has become a research hotspot of computer vision, and can be applied to various scenes such as robot vision, environment perception of intelligent driving and the like. The term semantic segmentation is to understand that a plurality of target elements in a shot picture are accurately segmented according to respective outlines. For computers, an image is a multi-channel matrix of values. For a computer to understand and segment target elements, numerical features of each target element need to be found from an original numerical matrix, which target elements are contained in an image are understood according to the features, and then the target elements in the image are analyzed in a deeper level according to what structure the target elements are combined into a picture. In general, the purpose of semantic segmentation of images is to identify objects and their interrelationships in the picture, which is a simulation of the human brain visual system. Human beings understand the surrounding environment through the visual perception system, and image segmentation technology acquires, understands and recognizes information in pictures by imitating the human visual perception system. Image semantic segmentation, a very important task in the field of image processing and pattern recognition, is a pre-step of many computer vision techniques. The detection and tracking are carried out according to the structure of semantic segmentation, the detection and tracking area can be reduced, and even the outlines of some objects can be directly given; image and text description is carried out according to the result of semantic segmentation, and a large amount of position information among targets can be added; and performing style conversion of the image according to the result of semantic segmentation, so that the background area can be quickly positioned and replaced, and a specific target can be replaced. Therefore, the image semantic segmentation has strong theoretical and research values.
Early semantic segmentation methods relied on manual features. Such as using a random decision forest to predict classification probabilities and using a probabilistic model of the conditional random domain to handle uncertainty and integrate context information in the image. In recent years, Convolutional Neural Networks (CNNs) have been well developed in the field of computer vision due to the advent of large-scale training data sets and high-performance Graphics Processing Units (GPUs). In addition, excellent deep learning open source frameworks such as Caffe, MXNet and tensrflow have also facilitated the development of deep learning algorithms. The strong deep neural network greatly reduces the classification error, and the semantic segmentation also makes great progress in the process.
At present, many researchers try to obtain the highest accuracy as possible by using the smallest computation amount and the smallest parameter amount, so that the model can operate on a vehicle-mounted terminal platform, such as image classification models of Squeeze Net, ShuffleNet and Mobile Net. In the semantic segmentation field, since the input size is 3 × H × W and the output size is C × H × W, the width and height of the output and the input are completely the same, and C is often much larger than 3, a large number of operations are generated in the process of obtaining the output. Meanwhile, in order to obtain higher segmentation accuracy, the number of times of down-sampling the image is small, so that the width and height of the intermediate features are still large, and further the computation amount is further increased. By combining the above two points, semantic segmentation is a task with a very large computation amount.
To reduce the amount of computation generated by semantic segmentation, there are generally two ways: reduce picture size and reduce model complexity. Reducing picture size can most directly reduce the amount of computation, but the image loses a lot of detail and thus affects accuracy. Reducing the complexity of the model results in a reduction in the feature extraction capability of the model, thereby affecting the segmentation accuracy. At present, the semantic segmentation model facing to the road scene is difficult to balance the contradiction between real-time performance and high segmentation precision.
And most of the existing semantic segmentation frameworks are based on a full convolution network. The full convolutional network successfully improves the performance of semantic segmentation by transforming the classification network into the full convolutional network. In other words, the full convolutional network replaces the fully connected layer of the classification model with the convolutional layer. However, the full convolution network does not consider the relationship between pixels, neglects the spatial regularization (spatial regularization) step used in the general pixel classification-based segmentation method, and lacks spatial consistency.
Disclosure of Invention
The invention aims to provide a road scene-oriented fast semantic segmentation method, which solves the problems of low image precision and weak feature extraction capability of the existing segmentation method.
The technical scheme adopted by the invention is that a rapid semantic segmentation method facing to a road scene specifically comprises the following steps:
step 1, constructing a model based on a convolutional neural network;
step 2, training the model constructed in the step 1 by using training data;
step 3, calculating the model loss trained in the step 2 by using a loss function, and calculating the gradient according to the obtained model loss result;
and 4, updating the model parameters according to the gradient obtained in the step 3.
The invention is also characterized in that:
the specific process of the step 1 is as follows: processing an input image by using a convolution neural network formed by a plurality of convolution kernels, thereby realizing that 3 multiplied by H multiplied by W data is input to obtain 1 multiplied by H multiplied by W prediction output, wherein H is the height of the input image, and W is the width of the input image;
the model was constructed according to the following equation (1):
wherein, FoutFor output characteristics, FinAs input features, KiIs the ith convolution kernel, N is the number of output channels, and b is the offset;
since the image is two-dimensional data, the size of the input feature is Cin×Hin×WinUsing a convolution kernel of size Cout×Cin×Hk×WkThe output characteristic obtained is Cout×Hout×Wout;
Wherein, CinAnd CoutNumber of channels characteristic of input and output, HinAnd WinHeight and width of input features, HkAnd WkHeight and width of convolution kernel, HoutAnd WoutHeight and width of the output features;
for input of Cin×Hin×WinIs characterized by the use of CoutSize is Cin×Hk×WkThe convolution kernel performs sliding multiply-add operation on the input characteristics to obtain CoutSize is Hout×WoutThe characteristics of (1).
The calculation process of the height and width of the output features in step 1 is as follows:
wherein p is the frame width and s is the step length.
The specific process of the step 2 is as follows: the training data comprises artificially acquired images and label images corresponding to the acquired images;
the training process is as follows: a label image is obtained from an input image, wherein the input image is a color RGB image, and the label image is a single-channel gray image.
In step 2, the gray value of the image pixel directly represents the category to which the pixel belongs; when the image has C classes to be divided, each pixel value in the label image is 0-C-1.
The specific process of the step 3 is as follows:
step 3.1, determining a model target:
assuming that the input data of the model is X, the label data is Y, and the model is f with parameters omegaωIf the loss function used is L, then the model is targeted as:
ω=argminωL(fω(X),Y) (4);
step 3.2, according to the step 3.1, determining a semantic segmentation loss function by the obtained model target:
the semantic segmentation essentially belongs to a classification task, and each pixel on the image has one and only one classification, namely, one and only one numerical value in the pixel label is 1, and the rest is 0; the penalty function for semantic segmentation is then:
L(fω(X),Y)=-logfω(xt) (8);
wherein f isω(xt) The predicted probability value of the corresponding category with the label of 1;
step 3.3, in order to reduce the complexity of the model, a weight attenuation is added to the model, and the loss function after the weight attenuation is added is as follows:
L(fω(X),Y)=-logfω(xt)+α·ω2(9);
wherein α is a weight;
step 3.4, carrying out normalization processing on the model by adopting the following formula:
wherein C is the number of channels, xiValue, y, output for a certain pixel location modeliIs the corresponding predicted probability value;
step 3.5, obtaining a gradient value corresponding to each parameter participating in the operation based on a chain rule in derivation according to the result obtained in step 3.4, wherein the gradient value calculation process is shown as the following formula (11):
wherein x iskAnd xtAre all convolved by the parameter ω.
The specific process of step 4 is as follows:
parameters in the model are updated according to the following formulas (12) and (13):
mt+1=ρ·mt+Δω (12);
ωt+1=ωt-lr·mt+1(13);
wherein m istIs the current momentum, mt+1For updated momentum, ωtAs a current parameter, ωt+1In order to update the parameters, Δ ω is a gradient, ρ is a weight parameter between values of 0-1, and lr is an updated step length.
The method has the advantages that the convolutional neural network is used for constructing the model, the multi-level convolution module is used for accelerating the feature extraction speed, the multi-level convolution module is used for replacing the 3 x 3 convolution to extract more and deeper high-level features, the image can be rapidly segmented, and meanwhile, higher precision is obtained.
Drawings
FIG. 1 is a diagrammatic illustration of a graphical meaning segmentation process;
FIG. 2 is a schematic structural diagram of an MConvblock adopted in a semantic segmentation model of an embodiment of a road scene-oriented fast semantic segmentation method of the invention;
FIG. 3 is a schematic diagram of an MResblock structure in a semantic segmentation model according to an embodiment of the road scene-oriented fast semantic segmentation method of the present invention;
FIG. 4 is a schematic network structure diagram of Road Scene Deeplabv3+ in the semantic segmentation model according to the embodiment of the Road Scene-oriented fast semantic segmentation method of the present invention;
fig. 5(a) - (c) are experimental result diagrams of the road scene-oriented fast semantic segmentation method of the present invention.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.
Semantic segmentation, which is a pixel-level segmentation task providing rich target information, is an important research problem in the field of robot perception, and has been widely applied in many fields, such as automatic driving and robot navigation. In the application of road scene understanding, the semantic segmentation model should accurately describe the appearances and the categories of different objects, and in addition, the semantic segmentation model needs to understand the spatial relationship between different objects.
Image segmentation is the division of an image into several disjoint regions containing a single object, typically in a bottom-up manner. The semantic segmentation is to divide each pixel in the image into a certain category with semantics, which is usually implemented in a top-down manner. The final goal of semantic segmentation is to obtain a model in a top-down manner, so that it can accurately predict a label with semantic meaning for each pixel in an input image, and a schematic diagram thereof is shown in fig. 1.
The invention relates to a road scene-oriented rapid semantic segmentation method, which specifically comprises the following steps:
step 1, constructing a model based on a convolutional neural network;
convolutional neural networks are also referred to as area-dependent networks because convolution is achieved by computing a region in the input features.
The convolutional neural network is composed of a series of convolutional kernels with fixed height, width and parameters, the convolutional kernels perform sliding operation on input features, and multiplication and addition operation is performed on numerical values of covered areas to obtain output features.
The output features can be used as input features of the next convolution for further operation, and a plurality of convolution kernels are overlapped in the mode to carry out more processing on the features so as to form a convolution neural network.
This form of accumulation can reach tens or even hundreds of layers, thus forming a feature extraction process for input images from low-layer area correlation to high-layer semantic correlation.
The model formula constructed based on the convolutional neural network is as follows:
wherein, FoutFor output characteristics, FinAs input features, KiIs the ith convolution kernel, N is the number of output channels, and b is the offset.
Since the image is two-dimensional data, the input feature size is Cin×Hin×WinUsing a convolution kernel of size Cout×Cin×Hk×WkThe output characteristic obtained is Cout×Hout×Wout. Wherein C isinAnd CoutNumber of channels characteristic of input and output, HinAnd WinHeight and width of input features, HkAnd WkHeight and width of convolution kernel, HoutAnd WoutHeight and width of the output features.
Wherein the calculation rule of the width and height of the output feature is as follows:
where p is the frame width and s is the step size.
For input of Cin×Hin×WinIs characterized by the use of CoutSize is Cin×Hk×WkThe convolution kernel performs sliding multiply-add operation on the input characteristics to obtain CoutSize is Hout×WoutThe characteristics of (1).
In order to keep the width and height of the output features consistent with the input features as much as possible in the convolution process, 0 is usually complemented on the frame of the input features, and the method for complementing 0 is performed according to the principle that the width and height after convolution are not changed.
Semantic segmentation requires an output that is as wide and high as the input image, and each pixel corresponds to a class of predictions.
Therefore, the semantic segmentation process is to use a plurality of convolution kernels to form a convolution neural network to process the input image, so as to realize that the input 3 × H × W data obtains a 1 × H × W prediction output, where H is the height of the input image and W is the width of the input image.
In actual operation, a probability is usually given to all possible classes, so a convolutional neural network is usually used to obtain a prediction of C × H × W, and then the class with the highest probability is selected from each pixel position as a final class to obtain a prediction output of 1 × H × W, where C is the total number of classes.
Step 2, training the model constructed in the step 1 by using training data;
training data for semantic segmentation is usually composed of a series of artificially acquired images and corresponding label images, and the purpose of model training is to enable a model to fit the provided data, i.e., to obtain the label images from input images. The input image is a commonly acquired color RGB image, while the label image is a single-channel grayscale image, and the grayscale value of a pixel directly represents the class to which the pixel belongs. Assuming that there are C classes to be segmented, each pixel value in the label image is between 0 and C-1.
Step 3, calculating the model loss trained in the step 2 by using a loss function, and calculating the gradient according to the obtained model loss result;
the loss function is a function for calculating similarity to the prediction result and the tag data.
Generally, the closer the prediction result is to the tag data, the smaller the numerical result calculated from the loss function. Assuming that the input data of the model is X, the label data is Y, and the model is f with parameters omegaωAnd the loss function used is L, then the model targets are:
ω=argminωL(fω(X),Y) (4);
obtaining a model f with omega parametersωSo that the predicted result and the labeled result can obtain the minimum value under the loss function L.
The loss functions commonly used in deep learning are Mean Absolute Error (MAE), Mean Square Error (MSE), and cross entropy (cross entropy). Where cross entropy is used the most in the classification task, it computes the similarity between two probability distributions.
The loss function formula for MAE is:
the loss function equation for MSE is:
the cross-entropy loss function is formulated as:
cross-entropy loss functions are commonly used in loss functions;
semantic segmentation essentially belongs to the classification task, and each pixel has one and only one class, i.e. one and only one value in its label is 1, and the rest are 0. Then the cross entropy loss function can be simplified in the semantic segmentation task as:
L(fω(X),Y)=-logfω(xt) (8);
wherein f isω(xt) The predicted probability value of the corresponding category is labeled 1.
A weight decay (weight decay) is usually added to the model to reduce the complexity of the model. Generally, MSE is used as the loss function of weight attenuation, and if cross entropy is used as the main loss function, there are:
L(fω(X),Y)=-logfω(xt)+α·ω2(9);
where α is a weight, typically set to 0.0001.
Since the result of the last convolution of the model is variable over a range of values, it needs to be normalized to between 0 and 1, and the sum over the channels also gets 1, i.e. the predicted probability sum over all classes per pixel position should be 1 to be able to correspond to the label provided.
To achieve this, the output result is typically normalized using softmax, whose formula is:
wherein C is the number of channels, xiValue, y, output for a certain pixel location modeliIs the corresponding prediction probability value.
After the prediction probability value is obtained, the difference between the prediction probability value and the label can be calculated by using a loss function, and the gradient value corresponding to each parameter participating in the operation is obtained through a chain rule in derivation.
In the invention, the gradient of the parameter simple omega is calculated by using the cross entropy of the loss function, and y is led to betProbability for corresponding label:
wherein x iskAnd xtAll are obtained by parameter omega convolution, and the related operations are multiplication and addition operations and nonlinear operations capable of derivation, i.e. all operations can be derived, so thatAndthe section can be obtained by using a chain rule according to the concrete condition of the model.
Therefore, once the network model and the loss function are determined, the formula of the gradient is determined. When the network exceeds one layer, the mode of the gradient obtained by the model parameters and the mode of the characteristic obtained by the model are very similar, and the gradient can be obtained in the network along the path obtained by the characteristic in the reverse direction.
And 4, updating the model parameters according to the gradient obtained in the step 3.
After the gradient of the parameter is obtained, the parameter in the model needs to be updated according to the gradient. The invention mainly uses the most common random gradient descent method with momentum in semantic segmentation. The random two words in the random gradient descent result from the process of random selection of samples in the training process. Because the training samples are often huge in quantity and cannot be completely put into the GPU to train the model, a fixed quantity of samples are often randomly selected in a training set to train the model during training. The invention uses the extracted data to train the model by randomly extracting the data in the training set without putting back. After each training to obtain the gradient, the parameters in the model are updated according to the following rules:
mt+1=ρ·mt+Δω (12);
ωt+1=ωt-lr·mt+1(13);
wherein m istIs the current momentum, mt+1For updated momentum, ωtAs a current parameter, ωt+1For the updated parameters, Δ ω is a gradient obtained by back propagation, ρ is a weight parameter between values of 0-1, and lr is an updated step length, generally between values of 0-1.
After the model, data, loss function, and parameter update are determined, the model may be iteratively trained on the data set. The training process may be targeted to the number of iterations, or to the accuracy to some standard. In the training process, part of data in the training set is selected as a verification set, and the verification set does not participate in the training. After each iteration is finished, the model is put on a verification set for verification, and the training condition of the model is judged by observing the performance of the model on the training set and the test set. For example, in semantic segmentation, if the model gets a significantly higher cross-over ratio on the training data than on the validation data, then it is common that overfitting of the model occurs and training to readjust the model should be stopped. If the results of the model on the training set and the verification set are always poor, which indicates that the model parameters can not be converged, the training should be stopped to change the training strategy. If the precision of the training set and the validation set is not very different and continuously increases, the training state is good. After training is completed, the model can be used to perform tests on the test set to observe the actual application of the model.
Examples
According to the rapid semantic segmentation method facing the road scene, a rapid semantic segmentation model based on a multi-convolution module is designed, and Deeplabv3+ is used as an overall framework of the model. In order to improve the running speed of the model, the invention designs a shallow network as a framework network of Deeplabv3+, the framework network is named as Road Scene Deeplabv3+, and Table 1 shows the network structure parameters of the Road Scene Deeplabv3+, the structure is similar to the feature extraction part in Dark Net53, but the number of layers is less and the quantity of parameters is less. Meanwhile, the invention also designs a multilayer convolution module MConvblock to obtain multilayer characteristics.
TABLE 1 Road Scene Deeplabv3+ structural parameters
Name (R) | Number of repetitions | Output channel | Form of operation |
Conv1 | 1 | 16 | MConvblock |
Sub1 | 1 | 32 | 3×3conv+BN+Leaky,stride=2 |
Resl | 1 | 32 | MResblock |
Sub2 | 1 | 64 | 3×3conv+BN+Leaky,stride=2 |
Res2 | 2 | 64 | MResblock |
Sub3 | 1 | 128 | 3×3conv+BN+Leaky,stride=2 |
Res3 | 2 | 128 | MResblock |
Sub4 | 1 | 256 | 3×3conv+BN+Leaky,stride=2 |
Res4 | 2 | 256 | MResblock |
Sub5 | 1 | 512 | 3×3conv+BN+Leaky,stride=2 |
Res5 | 1 | 512 | MResblock |
Conv2 | 1 | 1024 | MConvblock |
The structures of MConvblock and MResblock are shown in fig. 2 and 3, respectively, where K is the number of input channels and C is the number of output channels. Here it can be seen that MResblock is a residual network structure designed based on MConvblock, so the output size and the input size of MResblock are identical.
Fig. 4 is a network structure of Road Scene Deeplabv3+, which is designed by the present invention, wherein Block1 outputs 4 times down-sampled features, Block2 outputs 16 times down-sampled features, and C represents the number of channels of the input features. Wherein Conv 1-Res 2 and Sub 3-Conv 2 correspond to Table 1. Similarly, in order for the framework network of the present invention to downsample only 16 times, i.e., four times 2 times, the present invention replaces the convolution step size in Sub5 with 2 and replaces the convolutions in Res5 and Conv2 with a punctured convolution with a void rate of 2. In order to extract more high-level features, the invention adds a hole convolution with a hole rate of 24 in an Aperture Space Pyramid Pooling (ASPP), and replaces the 3 × 3 convolution in the ASPP with a multi-level convolution module, wherein MConv represents MConvblock in FIG. 2, so that each convolution in the multi-level convolution module in the ASPP is expanded correspondingly. The output channel of each module in the ASPP of the present invention is 64, which is much less than 256 in the original deepbabv 3+, which is done to reduce the amount of computation. Likewise, the present invention also reduces the parameters of each convolution in the up-sampling path, thereby further reducing the amount of computation. After the high-level features of different scales are fused, the features are upsampled by using quadruple bilinear interpolation and fused with quadruple downsampling features. And finally, performing convolution for three times, and obtaining a final semantic segmentation result by using 4 times of bilinear interpolation upsampling again. In the case of using a GTX 1080 tivpu as a running platform, the model of the present invention can achieve a processing speed of 0.057s per frame on images of 2048 × 1024 × 3 size, and can achieve a processing speed of 0.03s per frame on images of 720p, that is, 1920 × 720 × 3 size. Meanwhile, when the model parameters are stored with the floating point precision of 32 bits, only 20MB of storage space is occupied. In general, the fast semantic segmentation model designed by the invention can meet the semantic segmentation requirement facing to the road scene in both speed and storage space.
The features of the above embodiment are as follows:
(1) in order to increase the running speed of the model, a shallow network Road Scene Deeplabv3+ is designed as a framework network for fast semantic segmentation, and table 1 shows network structure parameters of the Road Scene Deeplabv3+, which are similar to the feature extraction part in Dark Net53, but have fewer layers and fewer parameter quantities.
(2) In order to make the framework network of the present embodiment perform 16-fold downsampling, i.e., four times 2-fold downsampling, on the input image, the present embodiment replaces the convolution step size in Sub5 with 2, and replaces the convolution in Res5 and Conv2 with a punctured convolution with a void rate of 2.
(3) In order to extract more high-level features, the present embodiment adds a hole convolution with a hole rate of 24 in the hole space pyramid Pooling (ASPP).
(4) The present embodiment also replaces the 3 × 3 convolution in ASPP with a multi-level convolution module, where MConv represents MConvblock in fig. 2, and therefore each convolution in the multi-level convolution module in ASPP is also expanded accordingly. The output channel of each module in the ASPP of the present embodiment is 64, which is much smaller than 256 in the original deepbabv 3+, and this is done to reduce the amount of calculation.
(5) The present embodiment also reduces the parameters of each convolution in the up-sampling path accordingly, thereby further reducing the amount of calculation. After the high-level features of different scales are fused, the features are upsampled by using quadruple bilinear interpolation and fused with quadruple downsampling features. And finally, performing convolution for three times, and obtaining a final semantic segmentation result by using 4 times of bilinear interpolation upsampling again.
To sum up, the current semantic segmentation model is difficult to balance the requirements of real-time performance and high segmentation precision in the application of the Road Scene, and the Road Scene-oriented fast semantic segmentation model is designed based on the Road Scene Deeplabv3+ network structure. Firstly, a shallow network is designed to be used as a framework network of Deeplabv3+ for feature extraction, and a multi-level convolution module is designed for the shallow network to accelerate the feature extraction speed. The multi-level convolution module is then used to replace the 3 x 3 convolution in deepabv 3+ to extract more and deeper high-level features. Compared with other methods, the method can rapidly carry out image segmentation and obtain higher segmentation precision.
In order to verify the validity of the road scene semantic segmentation model provided in this embodiment, a plurality of experiments are performed, and a schematic diagram of experimental results is shown in fig. 5, where a column in fig. 5(a) represents an input image, a list in fig. 5(b) represents a label of a label image, and a column in fig. 5(c) represents an output image. Compared with an image data set directly acquired in reality, the cityscaps data set is subjected to certain artificial processing, and the quality of an original image and the quality of an annotated image are high, so that the cityscaps data set suitable for a road scene semantic segmentation experiment is selected in the experiment.
In the case of using a GTX 1080 tivpu as a running platform, the semantic segmentation model in the present embodiment can achieve a processing speed of 0.057s per frame on images of 2048 × 1024 × 3 size, and can achieve a processing speed of 0.03s per frame on images of 720p, that is, 1920 × 720 × 3 size. Meanwhile, the model parameters in this embodiment only need to occupy 20MB of storage space when stored with 32-bit floating point precision. In general, the fast semantic segmentation model designed by the embodiment can meet the semantic segmentation requirement facing the road scene in both speed and storage space.
Claims (7)
1. A road scene-oriented rapid semantic segmentation method is characterized by comprising the following steps: the method specifically comprises the following steps:
step 1, constructing a model based on a convolutional neural network;
step 2, training the model constructed in the step 1 by using training data;
step 3, calculating the model loss trained in the step 2 by using a loss function, and calculating the gradient according to the obtained model loss result;
and 4, updating the model parameters according to the gradient obtained in the step 3.
2. The method for fast semantic segmentation of road scenes according to claim 1, characterized in that: the specific process of the step 1 is as follows: processing an input image by using a convolution neural network formed by a plurality of convolution kernels, thereby realizing that 3 multiplied by H multiplied by W data is input to obtain 1 multiplied by H multiplied by W prediction output, wherein H is the height of the input image, and W is the width of the input image;
the model was constructed according to the following equation (1):
wherein, FoutFor output characteristics, FinAs input features, KiIs the ith convolution kernel, N is the number of output channels, and b is the offset;
since the image is two-dimensional data, the size of the input feature is Cin×Hin×WinUsing a convolution kernel of size Cout×Cin×Hk×WkThe output characteristic obtained is Cout×Hout×Wout;
Wherein, CinAnd CoutNumber of channels characteristic of input and output, HinAnd WinHeight and width of input features, HkAnd WkHeight and width of convolution kernel, HoutAnd WoutHeight and width of the output features;
for input of Cin×Hin×WinIs characterized by the use of CoutSize is Cin×Hk×WkThe convolution kernel performs sliding multiply-add operation on the input characteristics to obtain CoutSize is Hout×WoutThe characteristics of (1).
4. The method for fast semantic segmentation of road scenes according to claim 3, characterized in that: the specific process of the step 2 is as follows: the training data comprises artificially acquired images and label images corresponding to the acquired images;
the training process is as follows: a label image is obtained from an input image, wherein the input image is a color RGB image, and the label image is a single-channel gray image.
5. The method for fast semantic segmentation of road scenes according to claim 4, characterized in that: in the step 2, the gray value of the image pixel directly represents the category to which the pixel belongs; when the image has C classes to be divided, each pixel value in the label image is 0-C-1.
6. The method for fast semantic segmentation of road scenes according to claim 4, characterized in that: the specific process of the step 3 is as follows:
step 3.1, determining a model target:
assuming that the input data of the model is X, the label data is Y, and the model is f with parameters omegaωThe loss function used is L, then that of the modelThe goal is then:
ω=argminωL(fω(X),Y) (4);
step 3.2, according to the step 3.1, determining a semantic segmentation loss function by the obtained model target:
the semantic segmentation essentially belongs to a classification task, and each pixel on the image has one and only one classification, namely, one and only one numerical value in the pixel label is 1, and the rest is 0; the penalty function for semantic segmentation is then:
L(fω(X),Y)=-logfω(xt) (8);
wherein f isω(xt) The predicted probability value of the corresponding category with the label of 1;
step 3.3, in order to reduce the complexity of the model, a weight attenuation is added to the model, and the loss function after the weight attenuation is added is as follows:
L(fω(X),Y)=-logfω(xt)+α·ω2(9);
wherein α is a weight;
step 3.4, carrying out normalization processing on the model by adopting the following formula:
wherein C is the number of channels, xiValue, y, output for a certain pixel location modeliIs the corresponding predicted probability value;
step 3.5, obtaining a gradient value corresponding to each parameter participating in the operation based on a chain rule in derivation according to the result obtained in step 3.4, wherein the gradient value calculation process is shown as the following formula (11):
wherein x iskAnd xtAre all convolved by the parameter ω.
7. The method for fast semantic segmentation of road scenes according to claim 6, characterized in that: the specific process of the step 4 is as follows:
parameters in the model are updated according to the following formulas (12) and (13):
mt+1=ρ·mt+Δω (12);
ωt+1=ωt-lr·mt+1(13);
wherein m istIs the current momentum, mt+1For updated momentum, ωtAs a current parameter, ωt+1In order to update the parameters, Δ ω is a gradient, ρ is a weight parameter between values of 0-1, and lr is an updated step length.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911256375.3A CN111179272B (en) | 2019-12-10 | 2019-12-10 | Rapid semantic segmentation method for road scene |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911256375.3A CN111179272B (en) | 2019-12-10 | 2019-12-10 | Rapid semantic segmentation method for road scene |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111179272A true CN111179272A (en) | 2020-05-19 |
CN111179272B CN111179272B (en) | 2024-01-05 |
Family
ID=70657206
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911256375.3A Active CN111179272B (en) | 2019-12-10 | 2019-12-10 | Rapid semantic segmentation method for road scene |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111179272B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112819000A (en) * | 2021-02-24 | 2021-05-18 | 长春工业大学 | Streetscape image semantic segmentation system, streetscape image semantic segmentation method, electronic equipment and computer readable medium |
CN113763392A (en) * | 2021-11-10 | 2021-12-07 | 北京中科慧眼科技有限公司 | Model prediction method and system for road surface flatness detection and intelligent terminal |
CN115018857A (en) * | 2022-08-10 | 2022-09-06 | 南昌昂坤半导体设备有限公司 | Image segmentation method, image segmentation device, computer-readable storage medium and computer equipment |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107016415A (en) * | 2017-04-12 | 2017-08-04 | 合肥工业大学 | A kind of coloured image Color Semantic sorting technique based on full convolutional network |
WO2018081537A1 (en) * | 2016-10-31 | 2018-05-03 | Konica Minolta Laboratory U.S.A., Inc. | Method and system for image segmentation using controlled feedback |
CN109543502A (en) * | 2018-09-27 | 2019-03-29 | 天津大学 | A kind of semantic segmentation method based on the multiple dimensioned neural network of depth |
CN110147794A (en) * | 2019-05-21 | 2019-08-20 | 东北大学 | A kind of unmanned vehicle outdoor scene real time method for segmenting based on deep learning |
CN110210350A (en) * | 2019-05-22 | 2019-09-06 | 北京理工大学 | A kind of quick parking space detection method based on deep learning |
-
2019
- 2019-12-10 CN CN201911256375.3A patent/CN111179272B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2018081537A1 (en) * | 2016-10-31 | 2018-05-03 | Konica Minolta Laboratory U.S.A., Inc. | Method and system for image segmentation using controlled feedback |
CN107016415A (en) * | 2017-04-12 | 2017-08-04 | 合肥工业大学 | A kind of coloured image Color Semantic sorting technique based on full convolutional network |
CN109543502A (en) * | 2018-09-27 | 2019-03-29 | 天津大学 | A kind of semantic segmentation method based on the multiple dimensioned neural network of depth |
CN110147794A (en) * | 2019-05-21 | 2019-08-20 | 东北大学 | A kind of unmanned vehicle outdoor scene real time method for segmenting based on deep learning |
CN110210350A (en) * | 2019-05-22 | 2019-09-06 | 北京理工大学 | A kind of quick parking space detection method based on deep learning |
Non-Patent Citations (1)
Title |
---|
GUOQING XU: "Dynamic Modeling of Driver Control Strategy of Lane-Change Behavior and Trajectory Planning for Collision Prediction", IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, vol. 13, no. 3, pages 1138 - 1154 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112819000A (en) * | 2021-02-24 | 2021-05-18 | 长春工业大学 | Streetscape image semantic segmentation system, streetscape image semantic segmentation method, electronic equipment and computer readable medium |
CN113763392A (en) * | 2021-11-10 | 2021-12-07 | 北京中科慧眼科技有限公司 | Model prediction method and system for road surface flatness detection and intelligent terminal |
CN113763392B (en) * | 2021-11-10 | 2022-03-18 | 北京中科慧眼科技有限公司 | Model prediction method and system for road surface flatness detection and intelligent terminal |
CN115018857A (en) * | 2022-08-10 | 2022-09-06 | 南昌昂坤半导体设备有限公司 | Image segmentation method, image segmentation device, computer-readable storage medium and computer equipment |
CN115018857B (en) * | 2022-08-10 | 2022-11-11 | 南昌昂坤半导体设备有限公司 | Image segmentation method, image segmentation device, computer-readable storage medium and computer equipment |
Also Published As
Publication number | Publication date |
---|---|
CN111179272B (en) | 2024-01-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107945204B (en) | Pixel-level image matting method based on generation countermeasure network | |
US11354906B2 (en) | Temporally distributed neural networks for video semantic segmentation | |
CN113011499B (en) | Hyperspectral remote sensing image classification method based on double-attention machine system | |
CN110135267B (en) | Large-scene SAR image fine target detection method | |
CN110910391B (en) | Video object segmentation method for dual-module neural network structure | |
CN109784283B (en) | Remote sensing image target extraction method based on scene recognition task | |
CN111259906B (en) | Method for generating remote sensing image target segmentation countermeasures under condition containing multilevel channel attention | |
CN110210551A (en) | A kind of visual target tracking method based on adaptive main body sensitivity | |
CN107092870A (en) | A kind of high resolution image semantics information extracting method and system | |
CN111680695A (en) | Semantic segmentation method based on reverse attention model | |
CN111523521A (en) | Remote sensing image classification method for double-branch fusion multi-scale attention neural network | |
CN112561027A (en) | Neural network architecture searching method, image processing method, device and storage medium | |
CN111179272B (en) | Rapid semantic segmentation method for road scene | |
CN112560733B (en) | Multitasking system and method for two-stage remote sensing image | |
CN112560966B (en) | Polarized SAR image classification method, medium and equipment based on scattering map convolution network | |
CN110826411B (en) | Vehicle target rapid identification method based on unmanned aerial vehicle image | |
Li et al. | Robust deep neural networks for road extraction from remote sensing images | |
CN109657538B (en) | Scene segmentation method and system based on context information guidance | |
CN116740527A (en) | Remote sensing image change detection method combining U-shaped network and self-attention mechanism | |
CN115311550B (en) | Remote sensing image semantic change detection method and device, electronic equipment and storage medium | |
CN116596966A (en) | Segmentation and tracking method based on attention and feature fusion | |
CN116863194A (en) | Foot ulcer image classification method, system, equipment and medium | |
CN111739037A (en) | Semantic segmentation method for indoor scene RGB-D image | |
CN116740362A (en) | Attention-based lightweight asymmetric scene semantic segmentation method and system | |
CN105678798A (en) | Multi-target fuzzy clustering image segmentation method combining local spatial information |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |