CN111179272A - Rapid semantic segmentation method for road scene - Google Patents

Rapid semantic segmentation method for road scene Download PDF

Info

Publication number
CN111179272A
CN111179272A CN201911256375.3A CN201911256375A CN111179272A CN 111179272 A CN111179272 A CN 111179272A CN 201911256375 A CN201911256375 A CN 201911256375A CN 111179272 A CN111179272 A CN 111179272A
Authority
CN
China
Prior art keywords
model
semantic segmentation
image
input
output
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911256375.3A
Other languages
Chinese (zh)
Other versions
CN111179272B (en
Inventor
欧勇盛
彭远哲
王志扬
熊荣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Institute of Advanced Technology of CAS
Original Assignee
Shenzhen Institute of Advanced Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Institute of Advanced Technology of CAS filed Critical Shenzhen Institute of Advanced Technology of CAS
Priority to CN201911256375.3A priority Critical patent/CN111179272B/en
Publication of CN111179272A publication Critical patent/CN111179272A/en
Application granted granted Critical
Publication of CN111179272B publication Critical patent/CN111179272B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10024Color image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a road scene-oriented rapid semantic segmentation method, which specifically comprises the following steps: step 1, constructing a model based on a convolutional neural network; step 2, training the model constructed in the step 1 by using training data; step 3, calculating the model loss trained in the step 2 by using a loss function, and calculating the gradient according to the obtained model loss result; and 4, updating the model parameters according to the gradient obtained in the step 3. The segmentation method provided by the invention can be used for rapidly segmenting the image and obtaining higher precision.

Description

Rapid semantic segmentation method for road scene
Technical Field
The invention belongs to the technical field of computer vision, and relates to a road scene-oriented rapid semantic segmentation method.
Background
In recent years, image semantic segmentation has become a research hotspot of computer vision, and can be applied to various scenes such as robot vision, environment perception of intelligent driving and the like. The term semantic segmentation is to understand that a plurality of target elements in a shot picture are accurately segmented according to respective outlines. For computers, an image is a multi-channel matrix of values. For a computer to understand and segment target elements, numerical features of each target element need to be found from an original numerical matrix, which target elements are contained in an image are understood according to the features, and then the target elements in the image are analyzed in a deeper level according to what structure the target elements are combined into a picture. In general, the purpose of semantic segmentation of images is to identify objects and their interrelationships in the picture, which is a simulation of the human brain visual system. Human beings understand the surrounding environment through the visual perception system, and image segmentation technology acquires, understands and recognizes information in pictures by imitating the human visual perception system. Image semantic segmentation, a very important task in the field of image processing and pattern recognition, is a pre-step of many computer vision techniques. The detection and tracking are carried out according to the structure of semantic segmentation, the detection and tracking area can be reduced, and even the outlines of some objects can be directly given; image and text description is carried out according to the result of semantic segmentation, and a large amount of position information among targets can be added; and performing style conversion of the image according to the result of semantic segmentation, so that the background area can be quickly positioned and replaced, and a specific target can be replaced. Therefore, the image semantic segmentation has strong theoretical and research values.
Early semantic segmentation methods relied on manual features. Such as using a random decision forest to predict classification probabilities and using a probabilistic model of the conditional random domain to handle uncertainty and integrate context information in the image. In recent years, Convolutional Neural Networks (CNNs) have been well developed in the field of computer vision due to the advent of large-scale training data sets and high-performance Graphics Processing Units (GPUs). In addition, excellent deep learning open source frameworks such as Caffe, MXNet and tensrflow have also facilitated the development of deep learning algorithms. The strong deep neural network greatly reduces the classification error, and the semantic segmentation also makes great progress in the process.
At present, many researchers try to obtain the highest accuracy as possible by using the smallest computation amount and the smallest parameter amount, so that the model can operate on a vehicle-mounted terminal platform, such as image classification models of Squeeze Net, ShuffleNet and Mobile Net. In the semantic segmentation field, since the input size is 3 × H × W and the output size is C × H × W, the width and height of the output and the input are completely the same, and C is often much larger than 3, a large number of operations are generated in the process of obtaining the output. Meanwhile, in order to obtain higher segmentation accuracy, the number of times of down-sampling the image is small, so that the width and height of the intermediate features are still large, and further the computation amount is further increased. By combining the above two points, semantic segmentation is a task with a very large computation amount.
To reduce the amount of computation generated by semantic segmentation, there are generally two ways: reduce picture size and reduce model complexity. Reducing picture size can most directly reduce the amount of computation, but the image loses a lot of detail and thus affects accuracy. Reducing the complexity of the model results in a reduction in the feature extraction capability of the model, thereby affecting the segmentation accuracy. At present, the semantic segmentation model facing to the road scene is difficult to balance the contradiction between real-time performance and high segmentation precision.
And most of the existing semantic segmentation frameworks are based on a full convolution network. The full convolutional network successfully improves the performance of semantic segmentation by transforming the classification network into the full convolutional network. In other words, the full convolutional network replaces the fully connected layer of the classification model with the convolutional layer. However, the full convolution network does not consider the relationship between pixels, neglects the spatial regularization (spatial regularization) step used in the general pixel classification-based segmentation method, and lacks spatial consistency.
Disclosure of Invention
The invention aims to provide a road scene-oriented fast semantic segmentation method, which solves the problems of low image precision and weak feature extraction capability of the existing segmentation method.
The technical scheme adopted by the invention is that a rapid semantic segmentation method facing to a road scene specifically comprises the following steps:
step 1, constructing a model based on a convolutional neural network;
step 2, training the model constructed in the step 1 by using training data;
step 3, calculating the model loss trained in the step 2 by using a loss function, and calculating the gradient according to the obtained model loss result;
and 4, updating the model parameters according to the gradient obtained in the step 3.
The invention is also characterized in that:
the specific process of the step 1 is as follows: processing an input image by using a convolution neural network formed by a plurality of convolution kernels, thereby realizing that 3 multiplied by H multiplied by W data is input to obtain 1 multiplied by H multiplied by W prediction output, wherein H is the height of the input image, and W is the width of the input image;
the model was constructed according to the following equation (1):
Figure BDA0002310373270000021
wherein, FoutFor output characteristics, FinAs input features, KiIs the ith convolution kernel, N is the number of output channels, and b is the offset;
since the image is two-dimensional data, the size of the input feature is Cin×Hin×WinUsing a convolution kernel of size Cout×Cin×Hk×WkThe output characteristic obtained is Cout×Hout×Wout
Wherein, CinAnd CoutNumber of channels characteristic of input and output, HinAnd WinHeight and width of input features, HkAnd WkHeight and width of convolution kernel, HoutAnd WoutHeight and width of the output features;
for input of Cin×Hin×WinIs characterized by the use of CoutSize is Cin×Hk×WkThe convolution kernel performs sliding multiply-add operation on the input characteristics to obtain CoutSize is Hout×WoutThe characteristics of (1).
The calculation process of the height and width of the output features in step 1 is as follows:
Figure BDA0002310373270000031
Figure BDA0002310373270000032
wherein p is the frame width and s is the step length.
The specific process of the step 2 is as follows: the training data comprises artificially acquired images and label images corresponding to the acquired images;
the training process is as follows: a label image is obtained from an input image, wherein the input image is a color RGB image, and the label image is a single-channel gray image.
In step 2, the gray value of the image pixel directly represents the category to which the pixel belongs; when the image has C classes to be divided, each pixel value in the label image is 0-C-1.
The specific process of the step 3 is as follows:
step 3.1, determining a model target:
assuming that the input data of the model is X, the label data is Y, and the model is f with parameters omegaωIf the loss function used is L, then the model is targeted as:
ω=argminωL(fω(X),Y) (4);
step 3.2, according to the step 3.1, determining a semantic segmentation loss function by the obtained model target:
the semantic segmentation essentially belongs to a classification task, and each pixel on the image has one and only one classification, namely, one and only one numerical value in the pixel label is 1, and the rest is 0; the penalty function for semantic segmentation is then:
L(fω(X),Y)=-logfω(xt) (8);
wherein f isω(xt) The predicted probability value of the corresponding category with the label of 1;
step 3.3, in order to reduce the complexity of the model, a weight attenuation is added to the model, and the loss function after the weight attenuation is added is as follows:
L(fω(X),Y)=-logfω(xt)+α·ω2(9);
wherein α is a weight;
step 3.4, carrying out normalization processing on the model by adopting the following formula:
Figure BDA0002310373270000033
wherein C is the number of channels, xiValue, y, output for a certain pixel location modeliIs the corresponding predicted probability value;
step 3.5, obtaining a gradient value corresponding to each parameter participating in the operation based on a chain rule in derivation according to the result obtained in step 3.4, wherein the gradient value calculation process is shown as the following formula (11):
Figure BDA0002310373270000041
wherein x iskAnd xtAre all convolved by the parameter ω.
The specific process of step 4 is as follows:
parameters in the model are updated according to the following formulas (12) and (13):
mt+1=ρ·mt+Δω (12);
ωt+1=ωt-lr·mt+1(13);
wherein m istIs the current momentum, mt+1For updated momentum, ωtAs a current parameter, ωt+1In order to update the parameters, Δ ω is a gradient, ρ is a weight parameter between values of 0-1, and lr is an updated step length.
The method has the advantages that the convolutional neural network is used for constructing the model, the multi-level convolution module is used for accelerating the feature extraction speed, the multi-level convolution module is used for replacing the 3 x 3 convolution to extract more and deeper high-level features, the image can be rapidly segmented, and meanwhile, higher precision is obtained.
Drawings
FIG. 1 is a diagrammatic illustration of a graphical meaning segmentation process;
FIG. 2 is a schematic structural diagram of an MConvblock adopted in a semantic segmentation model of an embodiment of a road scene-oriented fast semantic segmentation method of the invention;
FIG. 3 is a schematic diagram of an MResblock structure in a semantic segmentation model according to an embodiment of the road scene-oriented fast semantic segmentation method of the present invention;
FIG. 4 is a schematic network structure diagram of Road Scene Deeplabv3+ in the semantic segmentation model according to the embodiment of the Road Scene-oriented fast semantic segmentation method of the present invention;
fig. 5(a) - (c) are experimental result diagrams of the road scene-oriented fast semantic segmentation method of the present invention.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.
Semantic segmentation, which is a pixel-level segmentation task providing rich target information, is an important research problem in the field of robot perception, and has been widely applied in many fields, such as automatic driving and robot navigation. In the application of road scene understanding, the semantic segmentation model should accurately describe the appearances and the categories of different objects, and in addition, the semantic segmentation model needs to understand the spatial relationship between different objects.
Image segmentation is the division of an image into several disjoint regions containing a single object, typically in a bottom-up manner. The semantic segmentation is to divide each pixel in the image into a certain category with semantics, which is usually implemented in a top-down manner. The final goal of semantic segmentation is to obtain a model in a top-down manner, so that it can accurately predict a label with semantic meaning for each pixel in an input image, and a schematic diagram thereof is shown in fig. 1.
The invention relates to a road scene-oriented rapid semantic segmentation method, which specifically comprises the following steps:
step 1, constructing a model based on a convolutional neural network;
convolutional neural networks are also referred to as area-dependent networks because convolution is achieved by computing a region in the input features.
The convolutional neural network is composed of a series of convolutional kernels with fixed height, width and parameters, the convolutional kernels perform sliding operation on input features, and multiplication and addition operation is performed on numerical values of covered areas to obtain output features.
The output features can be used as input features of the next convolution for further operation, and a plurality of convolution kernels are overlapped in the mode to carry out more processing on the features so as to form a convolution neural network.
This form of accumulation can reach tens or even hundreds of layers, thus forming a feature extraction process for input images from low-layer area correlation to high-layer semantic correlation.
The model formula constructed based on the convolutional neural network is as follows:
Figure BDA0002310373270000051
wherein, FoutFor output characteristics, FinAs input features, KiIs the ith convolution kernel, N is the number of output channels, and b is the offset.
Since the image is two-dimensional data, the input feature size is Cin×Hin×WinUsing a convolution kernel of size Cout×Cin×Hk×WkThe output characteristic obtained is Cout×Hout×Wout. Wherein C isinAnd CoutNumber of channels characteristic of input and output, HinAnd WinHeight and width of input features, HkAnd WkHeight and width of convolution kernel, HoutAnd WoutHeight and width of the output features.
Wherein the calculation rule of the width and height of the output feature is as follows:
Figure BDA0002310373270000052
Figure BDA0002310373270000053
where p is the frame width and s is the step size.
For input of Cin×Hin×WinIs characterized by the use of CoutSize is Cin×Hk×WkThe convolution kernel performs sliding multiply-add operation on the input characteristics to obtain CoutSize is Hout×WoutThe characteristics of (1).
In order to keep the width and height of the output features consistent with the input features as much as possible in the convolution process, 0 is usually complemented on the frame of the input features, and the method for complementing 0 is performed according to the principle that the width and height after convolution are not changed.
Semantic segmentation requires an output that is as wide and high as the input image, and each pixel corresponds to a class of predictions.
Therefore, the semantic segmentation process is to use a plurality of convolution kernels to form a convolution neural network to process the input image, so as to realize that the input 3 × H × W data obtains a 1 × H × W prediction output, where H is the height of the input image and W is the width of the input image.
In actual operation, a probability is usually given to all possible classes, so a convolutional neural network is usually used to obtain a prediction of C × H × W, and then the class with the highest probability is selected from each pixel position as a final class to obtain a prediction output of 1 × H × W, where C is the total number of classes.
Step 2, training the model constructed in the step 1 by using training data;
training data for semantic segmentation is usually composed of a series of artificially acquired images and corresponding label images, and the purpose of model training is to enable a model to fit the provided data, i.e., to obtain the label images from input images. The input image is a commonly acquired color RGB image, while the label image is a single-channel grayscale image, and the grayscale value of a pixel directly represents the class to which the pixel belongs. Assuming that there are C classes to be segmented, each pixel value in the label image is between 0 and C-1.
Step 3, calculating the model loss trained in the step 2 by using a loss function, and calculating the gradient according to the obtained model loss result;
the loss function is a function for calculating similarity to the prediction result and the tag data.
Generally, the closer the prediction result is to the tag data, the smaller the numerical result calculated from the loss function. Assuming that the input data of the model is X, the label data is Y, and the model is f with parameters omegaωAnd the loss function used is L, then the model targets are:
ω=argminωL(fω(X),Y) (4);
obtaining a model f with omega parametersωSo that the predicted result and the labeled result can obtain the minimum value under the loss function L.
The loss functions commonly used in deep learning are Mean Absolute Error (MAE), Mean Square Error (MSE), and cross entropy (cross entropy). Where cross entropy is used the most in the classification task, it computes the similarity between two probability distributions.
The loss function formula for MAE is:
Figure BDA0002310373270000071
the loss function equation for MSE is:
Figure BDA0002310373270000072
the cross-entropy loss function is formulated as:
Figure BDA0002310373270000073
cross-entropy loss functions are commonly used in loss functions;
semantic segmentation essentially belongs to the classification task, and each pixel has one and only one class, i.e. one and only one value in its label is 1, and the rest are 0. Then the cross entropy loss function can be simplified in the semantic segmentation task as:
L(fω(X),Y)=-logfω(xt) (8);
wherein f isω(xt) The predicted probability value of the corresponding category is labeled 1.
A weight decay (weight decay) is usually added to the model to reduce the complexity of the model. Generally, MSE is used as the loss function of weight attenuation, and if cross entropy is used as the main loss function, there are:
L(fω(X),Y)=-logfω(xt)+α·ω2(9);
where α is a weight, typically set to 0.0001.
Since the result of the last convolution of the model is variable over a range of values, it needs to be normalized to between 0 and 1, and the sum over the channels also gets 1, i.e. the predicted probability sum over all classes per pixel position should be 1 to be able to correspond to the label provided.
To achieve this, the output result is typically normalized using softmax, whose formula is:
Figure BDA0002310373270000074
wherein C is the number of channels, xiValue, y, output for a certain pixel location modeliIs the corresponding prediction probability value.
After the prediction probability value is obtained, the difference between the prediction probability value and the label can be calculated by using a loss function, and the gradient value corresponding to each parameter participating in the operation is obtained through a chain rule in derivation.
In the invention, the gradient of the parameter simple omega is calculated by using the cross entropy of the loss function, and y is led to betProbability for corresponding label:
Figure BDA0002310373270000081
wherein x iskAnd xtAll are obtained by parameter omega convolution, and the related operations are multiplication and addition operations and nonlinear operations capable of derivation, i.e. all operations can be derived, so that
Figure BDA0002310373270000082
And
Figure BDA0002310373270000083
the section can be obtained by using a chain rule according to the concrete condition of the model.
Therefore, once the network model and the loss function are determined, the formula of the gradient is determined. When the network exceeds one layer, the mode of the gradient obtained by the model parameters and the mode of the characteristic obtained by the model are very similar, and the gradient can be obtained in the network along the path obtained by the characteristic in the reverse direction.
And 4, updating the model parameters according to the gradient obtained in the step 3.
After the gradient of the parameter is obtained, the parameter in the model needs to be updated according to the gradient. The invention mainly uses the most common random gradient descent method with momentum in semantic segmentation. The random two words in the random gradient descent result from the process of random selection of samples in the training process. Because the training samples are often huge in quantity and cannot be completely put into the GPU to train the model, a fixed quantity of samples are often randomly selected in a training set to train the model during training. The invention uses the extracted data to train the model by randomly extracting the data in the training set without putting back. After each training to obtain the gradient, the parameters in the model are updated according to the following rules:
mt+1=ρ·mt+Δω (12);
ωt+1=ωt-lr·mt+1(13);
wherein m istIs the current momentum, mt+1For updated momentum, ωtAs a current parameter, ωt+1For the updated parameters, Δ ω is a gradient obtained by back propagation, ρ is a weight parameter between values of 0-1, and lr is an updated step length, generally between values of 0-1.
After the model, data, loss function, and parameter update are determined, the model may be iteratively trained on the data set. The training process may be targeted to the number of iterations, or to the accuracy to some standard. In the training process, part of data in the training set is selected as a verification set, and the verification set does not participate in the training. After each iteration is finished, the model is put on a verification set for verification, and the training condition of the model is judged by observing the performance of the model on the training set and the test set. For example, in semantic segmentation, if the model gets a significantly higher cross-over ratio on the training data than on the validation data, then it is common that overfitting of the model occurs and training to readjust the model should be stopped. If the results of the model on the training set and the verification set are always poor, which indicates that the model parameters can not be converged, the training should be stopped to change the training strategy. If the precision of the training set and the validation set is not very different and continuously increases, the training state is good. After training is completed, the model can be used to perform tests on the test set to observe the actual application of the model.
Examples
According to the rapid semantic segmentation method facing the road scene, a rapid semantic segmentation model based on a multi-convolution module is designed, and Deeplabv3+ is used as an overall framework of the model. In order to improve the running speed of the model, the invention designs a shallow network as a framework network of Deeplabv3+, the framework network is named as Road Scene Deeplabv3+, and Table 1 shows the network structure parameters of the Road Scene Deeplabv3+, the structure is similar to the feature extraction part in Dark Net53, but the number of layers is less and the quantity of parameters is less. Meanwhile, the invention also designs a multilayer convolution module MConvblock to obtain multilayer characteristics.
TABLE 1 Road Scene Deeplabv3+ structural parameters
Name (R) Number of repetitions Output channel Form of operation
Conv1 1 16 MConvblock
Sub1 1 32 3×3conv+BN+Leaky,stride=2
Resl 1 32 MResblock
Sub2 1 64 3×3conv+BN+Leaky,stride=2
Res2 2 64 MResblock
Sub3 1 128 3×3conv+BN+Leaky,stride=2
Res3 2 128 MResblock
Sub4 1 256 3×3conv+BN+Leaky,stride=2
Res4 2 256 MResblock
Sub5 1 512 3×3conv+BN+Leaky,stride=2
Res5 1 512 MResblock
Conv2 1 1024 MConvblock
The structures of MConvblock and MResblock are shown in fig. 2 and 3, respectively, where K is the number of input channels and C is the number of output channels. Here it can be seen that MResblock is a residual network structure designed based on MConvblock, so the output size and the input size of MResblock are identical.
Fig. 4 is a network structure of Road Scene Deeplabv3+, which is designed by the present invention, wherein Block1 outputs 4 times down-sampled features, Block2 outputs 16 times down-sampled features, and C represents the number of channels of the input features. Wherein Conv 1-Res 2 and Sub 3-Conv 2 correspond to Table 1. Similarly, in order for the framework network of the present invention to downsample only 16 times, i.e., four times 2 times, the present invention replaces the convolution step size in Sub5 with 2 and replaces the convolutions in Res5 and Conv2 with a punctured convolution with a void rate of 2. In order to extract more high-level features, the invention adds a hole convolution with a hole rate of 24 in an Aperture Space Pyramid Pooling (ASPP), and replaces the 3 × 3 convolution in the ASPP with a multi-level convolution module, wherein MConv represents MConvblock in FIG. 2, so that each convolution in the multi-level convolution module in the ASPP is expanded correspondingly. The output channel of each module in the ASPP of the present invention is 64, which is much less than 256 in the original deepbabv 3+, which is done to reduce the amount of computation. Likewise, the present invention also reduces the parameters of each convolution in the up-sampling path, thereby further reducing the amount of computation. After the high-level features of different scales are fused, the features are upsampled by using quadruple bilinear interpolation and fused with quadruple downsampling features. And finally, performing convolution for three times, and obtaining a final semantic segmentation result by using 4 times of bilinear interpolation upsampling again. In the case of using a GTX 1080 tivpu as a running platform, the model of the present invention can achieve a processing speed of 0.057s per frame on images of 2048 × 1024 × 3 size, and can achieve a processing speed of 0.03s per frame on images of 720p, that is, 1920 × 720 × 3 size. Meanwhile, when the model parameters are stored with the floating point precision of 32 bits, only 20MB of storage space is occupied. In general, the fast semantic segmentation model designed by the invention can meet the semantic segmentation requirement facing to the road scene in both speed and storage space.
The features of the above embodiment are as follows:
(1) in order to increase the running speed of the model, a shallow network Road Scene Deeplabv3+ is designed as a framework network for fast semantic segmentation, and table 1 shows network structure parameters of the Road Scene Deeplabv3+, which are similar to the feature extraction part in Dark Net53, but have fewer layers and fewer parameter quantities.
(2) In order to make the framework network of the present embodiment perform 16-fold downsampling, i.e., four times 2-fold downsampling, on the input image, the present embodiment replaces the convolution step size in Sub5 with 2, and replaces the convolution in Res5 and Conv2 with a punctured convolution with a void rate of 2.
(3) In order to extract more high-level features, the present embodiment adds a hole convolution with a hole rate of 24 in the hole space pyramid Pooling (ASPP).
(4) The present embodiment also replaces the 3 × 3 convolution in ASPP with a multi-level convolution module, where MConv represents MConvblock in fig. 2, and therefore each convolution in the multi-level convolution module in ASPP is also expanded accordingly. The output channel of each module in the ASPP of the present embodiment is 64, which is much smaller than 256 in the original deepbabv 3+, and this is done to reduce the amount of calculation.
(5) The present embodiment also reduces the parameters of each convolution in the up-sampling path accordingly, thereby further reducing the amount of calculation. After the high-level features of different scales are fused, the features are upsampled by using quadruple bilinear interpolation and fused with quadruple downsampling features. And finally, performing convolution for three times, and obtaining a final semantic segmentation result by using 4 times of bilinear interpolation upsampling again.
To sum up, the current semantic segmentation model is difficult to balance the requirements of real-time performance and high segmentation precision in the application of the Road Scene, and the Road Scene-oriented fast semantic segmentation model is designed based on the Road Scene Deeplabv3+ network structure. Firstly, a shallow network is designed to be used as a framework network of Deeplabv3+ for feature extraction, and a multi-level convolution module is designed for the shallow network to accelerate the feature extraction speed. The multi-level convolution module is then used to replace the 3 x 3 convolution in deepabv 3+ to extract more and deeper high-level features. Compared with other methods, the method can rapidly carry out image segmentation and obtain higher segmentation precision.
In order to verify the validity of the road scene semantic segmentation model provided in this embodiment, a plurality of experiments are performed, and a schematic diagram of experimental results is shown in fig. 5, where a column in fig. 5(a) represents an input image, a list in fig. 5(b) represents a label of a label image, and a column in fig. 5(c) represents an output image. Compared with an image data set directly acquired in reality, the cityscaps data set is subjected to certain artificial processing, and the quality of an original image and the quality of an annotated image are high, so that the cityscaps data set suitable for a road scene semantic segmentation experiment is selected in the experiment.
In the case of using a GTX 1080 tivpu as a running platform, the semantic segmentation model in the present embodiment can achieve a processing speed of 0.057s per frame on images of 2048 × 1024 × 3 size, and can achieve a processing speed of 0.03s per frame on images of 720p, that is, 1920 × 720 × 3 size. Meanwhile, the model parameters in this embodiment only need to occupy 20MB of storage space when stored with 32-bit floating point precision. In general, the fast semantic segmentation model designed by the embodiment can meet the semantic segmentation requirement facing the road scene in both speed and storage space.

Claims (7)

1. A road scene-oriented rapid semantic segmentation method is characterized by comprising the following steps: the method specifically comprises the following steps:
step 1, constructing a model based on a convolutional neural network;
step 2, training the model constructed in the step 1 by using training data;
step 3, calculating the model loss trained in the step 2 by using a loss function, and calculating the gradient according to the obtained model loss result;
and 4, updating the model parameters according to the gradient obtained in the step 3.
2. The method for fast semantic segmentation of road scenes according to claim 1, characterized in that: the specific process of the step 1 is as follows: processing an input image by using a convolution neural network formed by a plurality of convolution kernels, thereby realizing that 3 multiplied by H multiplied by W data is input to obtain 1 multiplied by H multiplied by W prediction output, wherein H is the height of the input image, and W is the width of the input image;
the model was constructed according to the following equation (1):
Figure FDA0002310373260000011
wherein, FoutFor output characteristics, FinAs input features, KiIs the ith convolution kernel, N is the number of output channels, and b is the offset;
since the image is two-dimensional data, the size of the input feature is Cin×Hin×WinUsing a convolution kernel of size Cout×Cin×Hk×WkThe output characteristic obtained is Cout×Hout×Wout
Wherein, CinAnd CoutNumber of channels characteristic of input and output, HinAnd WinHeight and width of input features, HkAnd WkHeight and width of convolution kernel, HoutAnd WoutHeight and width of the output features;
for input of Cin×Hin×WinIs characterized by the use of CoutSize is Cin×Hk×WkThe convolution kernel performs sliding multiply-add operation on the input characteristics to obtain CoutSize is Hout×WoutThe characteristics of (1).
3. The method for fast semantic segmentation of road scenes according to claim 2, characterized in that: the calculation process of the height and width of the output features in the step 1 is as follows:
Figure FDA0002310373260000012
Figure FDA0002310373260000013
wherein p is the frame width and s is the step length.
4. The method for fast semantic segmentation of road scenes according to claim 3, characterized in that: the specific process of the step 2 is as follows: the training data comprises artificially acquired images and label images corresponding to the acquired images;
the training process is as follows: a label image is obtained from an input image, wherein the input image is a color RGB image, and the label image is a single-channel gray image.
5. The method for fast semantic segmentation of road scenes according to claim 4, characterized in that: in the step 2, the gray value of the image pixel directly represents the category to which the pixel belongs; when the image has C classes to be divided, each pixel value in the label image is 0-C-1.
6. The method for fast semantic segmentation of road scenes according to claim 4, characterized in that: the specific process of the step 3 is as follows:
step 3.1, determining a model target:
assuming that the input data of the model is X, the label data is Y, and the model is f with parameters omegaωThe loss function used is L, then that of the modelThe goal is then:
ω=argminωL(fω(X),Y) (4);
step 3.2, according to the step 3.1, determining a semantic segmentation loss function by the obtained model target:
the semantic segmentation essentially belongs to a classification task, and each pixel on the image has one and only one classification, namely, one and only one numerical value in the pixel label is 1, and the rest is 0; the penalty function for semantic segmentation is then:
L(fω(X),Y)=-logfω(xt) (8);
wherein f isω(xt) The predicted probability value of the corresponding category with the label of 1;
step 3.3, in order to reduce the complexity of the model, a weight attenuation is added to the model, and the loss function after the weight attenuation is added is as follows:
L(fω(X),Y)=-logfω(xt)+α·ω2(9);
wherein α is a weight;
step 3.4, carrying out normalization processing on the model by adopting the following formula:
Figure FDA0002310373260000021
wherein C is the number of channels, xiValue, y, output for a certain pixel location modeliIs the corresponding predicted probability value;
step 3.5, obtaining a gradient value corresponding to each parameter participating in the operation based on a chain rule in derivation according to the result obtained in step 3.4, wherein the gradient value calculation process is shown as the following formula (11):
Figure FDA0002310373260000031
wherein x iskAnd xtAre all convolved by the parameter ω.
7. The method for fast semantic segmentation of road scenes according to claim 6, characterized in that: the specific process of the step 4 is as follows:
parameters in the model are updated according to the following formulas (12) and (13):
mt+1=ρ·mt+Δω (12);
ωt+1=ωt-lr·mt+1(13);
wherein m istIs the current momentum, mt+1For updated momentum, ωtAs a current parameter, ωt+1In order to update the parameters, Δ ω is a gradient, ρ is a weight parameter between values of 0-1, and lr is an updated step length.
CN201911256375.3A 2019-12-10 2019-12-10 Rapid semantic segmentation method for road scene Active CN111179272B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911256375.3A CN111179272B (en) 2019-12-10 2019-12-10 Rapid semantic segmentation method for road scene

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911256375.3A CN111179272B (en) 2019-12-10 2019-12-10 Rapid semantic segmentation method for road scene

Publications (2)

Publication Number Publication Date
CN111179272A true CN111179272A (en) 2020-05-19
CN111179272B CN111179272B (en) 2024-01-05

Family

ID=70657206

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911256375.3A Active CN111179272B (en) 2019-12-10 2019-12-10 Rapid semantic segmentation method for road scene

Country Status (1)

Country Link
CN (1) CN111179272B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112819000A (en) * 2021-02-24 2021-05-18 长春工业大学 Streetscape image semantic segmentation system, streetscape image semantic segmentation method, electronic equipment and computer readable medium
CN113763392A (en) * 2021-11-10 2021-12-07 北京中科慧眼科技有限公司 Model prediction method and system for road surface flatness detection and intelligent terminal
CN115018857A (en) * 2022-08-10 2022-09-06 南昌昂坤半导体设备有限公司 Image segmentation method, image segmentation device, computer-readable storage medium and computer equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107016415A (en) * 2017-04-12 2017-08-04 合肥工业大学 A kind of coloured image Color Semantic sorting technique based on full convolutional network
WO2018081537A1 (en) * 2016-10-31 2018-05-03 Konica Minolta Laboratory U.S.A., Inc. Method and system for image segmentation using controlled feedback
CN109543502A (en) * 2018-09-27 2019-03-29 天津大学 A kind of semantic segmentation method based on the multiple dimensioned neural network of depth
CN110147794A (en) * 2019-05-21 2019-08-20 东北大学 A kind of unmanned vehicle outdoor scene real time method for segmenting based on deep learning
CN110210350A (en) * 2019-05-22 2019-09-06 北京理工大学 A kind of quick parking space detection method based on deep learning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018081537A1 (en) * 2016-10-31 2018-05-03 Konica Minolta Laboratory U.S.A., Inc. Method and system for image segmentation using controlled feedback
CN107016415A (en) * 2017-04-12 2017-08-04 合肥工业大学 A kind of coloured image Color Semantic sorting technique based on full convolutional network
CN109543502A (en) * 2018-09-27 2019-03-29 天津大学 A kind of semantic segmentation method based on the multiple dimensioned neural network of depth
CN110147794A (en) * 2019-05-21 2019-08-20 东北大学 A kind of unmanned vehicle outdoor scene real time method for segmenting based on deep learning
CN110210350A (en) * 2019-05-22 2019-09-06 北京理工大学 A kind of quick parking space detection method based on deep learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
GUOQING XU: "Dynamic Modeling of Driver Control Strategy of Lane-Change Behavior and Trajectory Planning for Collision Prediction", IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, vol. 13, no. 3, pages 1138 - 1154 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112819000A (en) * 2021-02-24 2021-05-18 长春工业大学 Streetscape image semantic segmentation system, streetscape image semantic segmentation method, electronic equipment and computer readable medium
CN113763392A (en) * 2021-11-10 2021-12-07 北京中科慧眼科技有限公司 Model prediction method and system for road surface flatness detection and intelligent terminal
CN113763392B (en) * 2021-11-10 2022-03-18 北京中科慧眼科技有限公司 Model prediction method and system for road surface flatness detection and intelligent terminal
CN115018857A (en) * 2022-08-10 2022-09-06 南昌昂坤半导体设备有限公司 Image segmentation method, image segmentation device, computer-readable storage medium and computer equipment
CN115018857B (en) * 2022-08-10 2022-11-11 南昌昂坤半导体设备有限公司 Image segmentation method, image segmentation device, computer-readable storage medium and computer equipment

Also Published As

Publication number Publication date
CN111179272B (en) 2024-01-05

Similar Documents

Publication Publication Date Title
CN107945204B (en) Pixel-level image matting method based on generation countermeasure network
US11354906B2 (en) Temporally distributed neural networks for video semantic segmentation
CN113011499B (en) Hyperspectral remote sensing image classification method based on double-attention machine system
CN110135267B (en) Large-scene SAR image fine target detection method
CN110910391B (en) Video object segmentation method for dual-module neural network structure
CN109784283B (en) Remote sensing image target extraction method based on scene recognition task
CN111259906B (en) Method for generating remote sensing image target segmentation countermeasures under condition containing multilevel channel attention
CN110210551A (en) A kind of visual target tracking method based on adaptive main body sensitivity
CN107092870A (en) A kind of high resolution image semantics information extracting method and system
CN111680695A (en) Semantic segmentation method based on reverse attention model
CN111523521A (en) Remote sensing image classification method for double-branch fusion multi-scale attention neural network
CN112561027A (en) Neural network architecture searching method, image processing method, device and storage medium
CN111179272B (en) Rapid semantic segmentation method for road scene
CN112560733B (en) Multitasking system and method for two-stage remote sensing image
CN112560966B (en) Polarized SAR image classification method, medium and equipment based on scattering map convolution network
CN110826411B (en) Vehicle target rapid identification method based on unmanned aerial vehicle image
Li et al. Robust deep neural networks for road extraction from remote sensing images
CN109657538B (en) Scene segmentation method and system based on context information guidance
CN116740527A (en) Remote sensing image change detection method combining U-shaped network and self-attention mechanism
CN115311550B (en) Remote sensing image semantic change detection method and device, electronic equipment and storage medium
CN116596966A (en) Segmentation and tracking method based on attention and feature fusion
CN116863194A (en) Foot ulcer image classification method, system, equipment and medium
CN111739037A (en) Semantic segmentation method for indoor scene RGB-D image
CN116740362A (en) Attention-based lightweight asymmetric scene semantic segmentation method and system
CN105678798A (en) Multi-target fuzzy clustering image segmentation method combining local spatial information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant