CN111179272B - Rapid semantic segmentation method for road scene - Google Patents

Rapid semantic segmentation method for road scene Download PDF

Info

Publication number
CN111179272B
CN111179272B CN201911256375.3A CN201911256375A CN111179272B CN 111179272 B CN111179272 B CN 111179272B CN 201911256375 A CN201911256375 A CN 201911256375A CN 111179272 B CN111179272 B CN 111179272B
Authority
CN
China
Prior art keywords
model
image
semantic segmentation
input
segmentation method
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911256375.3A
Other languages
Chinese (zh)
Other versions
CN111179272A (en
Inventor
欧勇盛
彭远哲
王志扬
熊荣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Institute of Advanced Technology of CAS
Original Assignee
Shenzhen Institute of Advanced Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Institute of Advanced Technology of CAS filed Critical Shenzhen Institute of Advanced Technology of CAS
Priority to CN201911256375.3A priority Critical patent/CN111179272B/en
Publication of CN111179272A publication Critical patent/CN111179272A/en
Application granted granted Critical
Publication of CN111179272B publication Critical patent/CN111179272B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10024Color image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning

Abstract

The invention discloses a rapid semantic segmentation method for a road scene, which specifically comprises the following steps: step 1, constructing a model based on a convolutional neural network; step 2, training the model constructed in the step 1 by using training data; step 3, calculating model loss trained in the step 2 by using a loss function, and calculating gradient according to the obtained model loss result; and 4, updating the model parameters according to the gradient obtained in the step 3. The segmentation method provided by the invention can be used for rapidly segmenting the image and obtaining higher precision.

Description

Rapid semantic segmentation method for road scene
Technical Field
The invention belongs to the technical field of computer vision, and relates to a rapid semantic segmentation method for a road scene.
Background
In recent years, image semantic segmentation has become a research hotspot of computer vision, and can be applied to various scenes such as robot vision, intelligent driving environment perception and the like. Semantic segmentation is to understand a plurality of target elements in a photographed picture, and precisely segment the target elements according to respective outlines. For a computer, the image is a multi-channel matrix of values. For a computer to understand and segment the target elements, it is necessary to find the numerical characteristics of each target element from the original numerical matrix, understand which target elements are contained in the image according to the characteristics, and then analyze the target elements in a deeper layer to combine into a picture according to what structure. In general, the purpose of image semantic segmentation is to identify objects and their interrelationships in a picture, which is a simulation of the human brain visual system. Humans understand the surrounding environment through a visual perception system, while image segmentation techniques acquire, understand and recognize information in pictures by mimicking a human visual perception system. Image semantic segmentation is a very important task in the field of image processing and pattern recognition, a pre-step in many computer vision techniques. The detection and tracking are carried out according to the structure of semantic segmentation, so that the detection and tracking area can be reduced, and even the outline of some objects can be directly given; image text description is carried out according to the semantic segmentation result, and a large amount of inter-target position information can be added; the style conversion of the image is carried out according to the semantic segmentation result, so that the background area can be rapidly positioned and replaced, and a specific target can be replaced. Therefore, the image semantic segmentation has strong theoretical and research values.
Early semantic segmentation methods relied on manual features. Such as using a random decision forest to predict classification probabilities and using a probabilistic model of conditional random fields to handle uncertainties and integrate context information in the image. In recent years, convolutional Neural Networks (CNNs) have evolved well in the field of computer vision due to the advent of large-scale training data sets and high-performance Graphics Processing Units (GPUs). In addition, the development of deep learning algorithms is also facilitated by excellent deep learning open source frameworks such as Caffe, MXNet, and Tensorflow. The powerful deep neural network greatly reduces classification errors, and semantic segmentation also makes great progress in the process.
Many researchers are trying to achieve as high accuracy as possible with as little computation effort and as few parameters as possible, thereby enabling models to run on vehicle terminal platforms such as image classification models Squeeze Net, shuffle Net and Mobile Net. In the semantic segmentation field, since the input size is 3×h×w, the output size is c×h×w, and the output is exactly the same as the input width and height and C is often much larger than 3, a large amount of operations will be generated in the process of obtaining the output. Meanwhile, in order to obtain higher segmentation precision, the times of downsampling the image are smaller, so that the width and height of the intermediate features are still large, and the operand is further increased. By combining the two points, semantic segmentation is a task with very large operand.
In order to reduce the amount of computation generated by semantic segmentation, there are generally two ways: the picture size is reduced and the model complexity is reduced. Reducing the picture size can most directly reduce the amount of computation, but the image can lose a lot of detail, affecting the accuracy. Reducing the complexity of the model results in reduced feature extraction capabilities of the model, thereby affecting segmentation accuracy. At present, the semantic segmentation model facing the road scene is difficult to balance the contradiction between real-time performance and high segmentation precision.
And most of the existing semantic segmentation frameworks are based on full convolution networks. The full convolution network successfully improves the performance of semantic segmentation by modifying the classification network into the full convolution network. In other words, the full convolution network replaces the full connection layer of the classification model with a convolution layer. However, the relationship between pixels in the full convolution network is not considered, and the spatial normalization (spatial regularization) step used in the usual pixel classification-based segmentation method is omitted, and spatial uniformity is lacking.
Disclosure of Invention
The invention aims to provide a rapid semantic segmentation method for a road scene, which solves the problems of low image precision and weak feature extraction capability of the existing segmentation method.
The technical scheme adopted by the invention is that a rapid semantic segmentation method facing to a road scene specifically comprises the following steps:
step 1, constructing a model based on a convolutional neural network;
step 2, training the model constructed in the step 1 by using training data;
step 3, calculating model loss trained in the step 2 by using a loss function, and calculating gradient according to the obtained model loss result;
and 4, updating the model parameters according to the gradient obtained in the step 3.
The invention is also characterized in that:
the specific process of the step 1 is as follows: processing an input image by using a convolutional neural network formed by a plurality of convolutional kernels, so as to realize that 3 XH XW data are input to obtain 1 XH XW prediction output, wherein H is the height of the input image, and W is the width of the input image;
constructing a model according to the following formula (1):
wherein F is out To output features, F in For inputting features, K i For the ith convolution kernel, N is the number of output channels, and b is offset;
since the image is two-dimensional data, the size of the input feature is C in ×H in ×W in The convolution kernel used has a size of C out ×C in ×H k ×W k The obtained output is characterized as C out ×H out ×W out
Wherein C is in And C out Channel number for input and output features, H in And W is in To input the height and width of the feature, H k And W is k Is the height and width of convolution kernel, H out And W is out Height and width for the output feature;
for input of C in ×H in ×W in Is characterized by using C out With a size of C in ×H k ×W k Is subjected to sliding multiplication and addition operation on the input features to obtain C out The size is H out ×W out Is characterized by (3).
The calculation process of the height and width of the output feature in step 1 is as follows:
wherein p is the frame width, s is the step size.
The specific process of the step 2 is as follows: the training data comprises an image acquired by people and a label image corresponding to the acquired image;
the training process is as follows: and obtaining a label image from the input image, wherein the input image is a color RGB image, and the label image is a single-channel gray image.
In step 2, the gray value of the image pixel directly represents the category to which the pixel belongs; when the image has C classes to be segmented, each pixel value in the label image is 0-C-1.
The specific process of the step 3 is as follows:
step 3.1, determining a model target:
assuming that the input data of the model is X, the tag data is Y, and the model is f with the parameter omega ω The loss function used is L, then the model targets:
ω=argmin ω L(f ω (X),Y) (4);
step 3.2, determining a semantic segmentation loss function according to the obtained model target in step 3.1:
semantic segmentation essentially belongs to a classification task, and there is one and only one class of each pixel on the image, namely, there is one and only one numerical value in the pixel labels is 1, and the rest are 0; the loss function for semantic segmentation is then:
L(f ω (X),Y)=-logf ω (x t ) (8);
wherein f ω (x t ) A predicted probability value for a class corresponding to a label 1;
step 3.3, in order to reduce the complexity of the model, adding a weight decay to the model, the loss function after adding the weight decay is as follows:
L(f ω (X),Y)=-logf ω (x t )+α·ω 2 (9);
wherein α is a weight;
and 3.4, carrying out normalization processing on the model by adopting the following formula:
wherein C is the number of channels, x i For a value output by a pixel location model, y i The predicted probability value is the corresponding predicted probability value;
step 3.5, according to the result obtained in step 3.4, calculating a gradient value corresponding to each parameter participating in the operation based on a chain rule in derivative, wherein the calculation process of the gradient value is shown in the following formula (11):
wherein x is k And x t Are all convolved by the parameter ω.
The specific process of the step 4 is as follows:
updating parameters in the model according to the following formulas (12) and (13):
m t+1 =ρ·m t +Δω (12);
ω t+1 =ω t -lr·m t+1 (13);
wherein m is t For the current momentum, m t+1 For updated momentum ω t Omega as the current parameter t+1 For the updated parameters, Δω is the gradient, ρ is the weight parameter between 0 and 1, lr is the updated step size.
The rapid semantic segmentation method for the road scene has the advantages that a convolutional neural network is used for constructing a model, the characteristic extraction speed is accelerated through the multi-level convolution module, the multi-level convolution module is used for replacing 3×3 convolution to extract more and deeper high-level characteristics, images can be rapidly segmented, and meanwhile higher precision is obtained.
Drawings
FIG. 1 is a schematic diagram of a semantic segmentation process of the diagram;
FIG. 2 is a schematic view of MConvblock structure adopted in a semantic segmentation model of an embodiment of a rapid semantic segmentation method for road scene according to the present invention;
FIG. 3 is a schematic view of MResblock structure in a semantic segmentation model of an embodiment of a rapid semantic segmentation method for road scene according to the present invention;
fig. 4 is a schematic diagram of a network structure of Road Scene Deeplabv3+ in a semantic segmentation model of an embodiment of a rapid semantic segmentation method for a road scene according to the present invention;
fig. 5 (a) - (c) are diagrams of experimental results of a rapid semantic segmentation method for road scenes.
Detailed Description
The invention will be described in detail with reference to the drawings and the detailed description.
Semantic segmentation, which is a pixel-level classification task providing rich target information, is an important research problem in the field of robot perception, and has been widely used in many fields such as autopilot and robot navigation. In road scene understanding applications, the semantic segmentation model should accurately describe the appearance and class of different objects, and in addition, the semantic segmentation model needs to understand the spatial relationship between different objects.
Image segmentation is the division of an image into several disjoint regions containing a single object, typically in a bottom-up fashion. Semantic segmentation is to divide each pixel in an image into a category with semantics, and is usually implemented in a top-down manner. The final goal of semantic segmentation is to obtain a model in a top-down fashion that enables accurate prediction of a semantic label for each pixel in the input image, a schematic diagram of which is shown in fig. 1.
The invention discloses a rapid semantic segmentation method for a road scene, which specifically comprises the following steps:
step 1, constructing a model based on a convolutional neural network;
convolutional neural networks are also known as region-dependent networks because convolution is performed by computing a region in an input feature.
The convolutional neural network is composed of a series of convolution kernels with fixed height and width and fixed parameters, the convolution kernels perform sliding operation on input features, and the values of covered areas are multiplied and added to obtain output features.
These output features can be used as input features for the next convolution to further operate, and multiple convolution kernels are superimposed in this fashion to further process the features to form a convolutional neural network.
This form of accumulation can reach tens or even hundreds of layers, thus creating a feature extraction process for the input image from low-level region correlation to high-level semantic correlation.
The model formula constructed based on the convolutional neural network is as follows:
wherein F is out To output features, F in For inputting features, K i For the ith convolution kernel, N is the number of output channels and b is the offset.
Since the image is two-dimensional data, the input feature size is C in ×H in ×W in The convolution kernel used has a size of C out ×C in ×H k ×W k The obtained output is characterized as C out ×H out ×W out . Wherein C is in And C out Channel number for input and output features, H in And W is in To input the height and width of the feature, H k And W is k Is the height and width of convolution kernel, H out And W is out Is the height and width of the output feature.
The calculation rules of the width and the height of the output feature are as follows:
where p is the frame width and s is the step size.
For input of C in ×H in ×W in Is characterized by usingC out With a size of C in ×H k ×W k Is subjected to sliding multiplication and addition operation on the input features to obtain C out The size is H out ×W out Is characterized by (3).
In order to keep the width and height of the output characteristic consistent with the input characteristic as much as possible in the convolution process, 0 is usually complemented on the frame of the input characteristic, and the method for complementing 0 is carried out according to the principle that the width and height of the convolved input characteristic are unchanged.
Semantic segmentation requires that an output be obtained that is as wide and high as the input image, and that each pixel corresponds to a class of predictions.
Therefore, the semantic segmentation process is to use a convolutional neural network formed by a plurality of convolution kernels to process an input image, so as to realize that 3×h×w data is input to obtain 1×h×w prediction output, wherein H is the height of the input image, and W is the width of the input image.
In practice, a probability is usually given to all possible classes, so a convolutional neural network is usually used to obtain a prediction of c×h×w, and then a class with the highest probability is selected from each pixel position as a final class to obtain a prediction output of 1×h×w, where C is the total number of classes.
Step 2, training the model constructed in the step 1 by using training data;
the training data of semantic segmentation is usually composed of a series of images collected by human and corresponding label images, and the purpose of model training is to enable the model to fit the data provided, i.e. to obtain the label images from the input images. The input image is a color RGB image that is usually acquired, while the label image is a single-channel gray image, and the gray value of a pixel directly represents the class to which the pixel belongs. Assuming that there are C classes to segment, then each pixel value in the label image is between 0 and C-1.
Step 3, calculating model loss trained in the step 2 by using a loss function, and calculating gradient according to the obtained model loss result;
the penalty function is a function used to calculate the similarity to the prediction and tag data.
In general, the closer the predicted result is to the tag data, the smaller the numerical result calculated from the loss function. Assuming that the input data of the model is X, the tag data is Y, and the model is f with the parameter omega ω The loss function used is L, then the model targets:
ω=argmin ω L(f ω (X),Y) (4);
obtaining a model f with omega parameters ω So that it can get the minimum value of the prediction result and the label result under the loss function L.
The loss functions commonly used in deep learning are mean absolute error (Mean Absolute Error, MAE), mean square error (Mean Square Error, MSE), and cross entropy (cross entropy). Where cross entropy is most used in classification tasks, it is the similarity between the two probability distributions that is calculated.
The loss function formula for MAE is:
the loss function formula for MSE is:
the loss function formula of the cross entropy is:
cross entropy loss functions are commonly used in loss functions;
semantic segmentation essentially belongs to the classification task, whereas each pixel has one and only one class, i.e. one in its label and only one value of 1, the rest being 0. Then the cross entropy loss function can be reduced to:
L(f ω (X),Y)=-logf ω (x t ) (8);
wherein f ω (x t ) And a predicted probability value of the corresponding category with the label of 1.
A weight decay (weight decay) is typically added to the model to reduce the complexity of the model. MSE is generally used as the loss function for weight decay, and if cross entropy is used as the main loss function, there are:
L(f ω (X),Y)=-logf ω (x t )+α·ω 2 (9);
where α is a weight, typically set to 0.0001.
Since the result of the last convolution of the model is not constant over the range of values, it needs to be normalized to between 0 and 1 and the sum over the channel must also be 1, i.e. the sum of the predictive probabilities for each pixel location over all categories should be 1 to be able to correspond to the label provided.
To achieve this, the output results are typically normalized using a softmax, which is formulated as:
wherein C is the number of channels, x i For a value output by a pixel location model, y i Is the corresponding predicted probability value.
After obtaining the predicted probability value, the difference between the predicted probability value and the label can be calculated by using the loss function, and the gradient value corresponding to each parameter participating in the operation is obtained by using the chain rule in the derivation.
The invention uses the cross entropy of loss function to calculate the gradient of parameter simple omega, and let y in the invention t Probability of corresponding label:
wherein x is k And x t All are obtained by the parameter omega convolution, the related operations are multiply-add operation and derivative nonlinear operation, namely all operations can be derivative, soAnd->The part can be obtained by using a chain rule according to the specific situation of the model.
Thus, once the network model and loss function are determined, the formulation of their gradients is determined. When the network exceeds one layer, the mode of obtaining the gradient by the model parameter and obtaining the characteristics by the model is quite similar, and the gradient can be obtained in sequence along the characteristic obtaining path in the network in a reverse direction.
And 4, updating the model parameters according to the gradient obtained in the step 3.
After obtaining the gradient of the parameter, the parameter in the model needs to be updated according to the gradient. The invention mainly uses a random gradient descent method with momentum, which is most commonly used in semantic segmentation. The random two-word in the random gradient descent originates from the process of random selection of samples in the training process. Because the training samples are often huge in number and cannot be fully put into the GPU to train the model, a fixed number of samples are often selected randomly in the training set to train the model during training. The invention uses the extracted data for training the model by randomly extracting the data in the training set without replacement. After each training to obtain the gradient, the parameters in the model are updated according to the following rules:
m t+1 =ρ·m t +Δω (12);
ω t+1 =ω t -lr·m t+1 (13);
wherein m is t For the current momentum, m t+1 For updated momentum ω t Omega as the current parameter t+1 For updated parameters, Δω is the counter-propagating ladderThe degree ρ is a weight parameter between 0 and 1, lr is an updated step length, and the value is generally between 0 and 1.
After the model, data, loss function, and parameter update are determined, the model may be iteratively trained on the data set. The training process can target the iteration times or reach a certain standard with accuracy. In the training process, a part of data is selected from the training set to serve as the verification set, and the verification set does not participate in training. After each iteration is completed, the model is placed on a verification set for verification, and the training condition of the model is judged by observing the performances of the model on a training set and a testing set. For example, in semantic segmentation, if the model gets a much higher cross-over on the training data than on the verification data, then over-fitting of the model typically occurs and training should cease to readjust the model. If the model results on the training set and the validation set are always poor, that indicates that the model parameters are not able to converge, the training should be stopped to alter the training strategy. If the accuracy of the training set and the validation set are not very bad and continue to grow, this indicates that the training state is good. After training is completed, the model can be used for testing on a test set, so that the actual application condition of the model is observed.
Examples
According to the rapid semantic segmentation method for the road scene, a rapid semantic segmentation model based on a multi-convolution module is designed, and deep labv3+ is used as the overall framework of the model. In order to improve the running speed of the model, the invention designs a shallow network as a framework network of deep labv3+, which is named as Road Scene Deeplabv & lt+ & gt and has network structure parameters of Road Scene Deeplabv & lt+ & gt in table 1, and the structure is similar to a feature extraction part in a Dark Net53, but the number of layers is smaller and the parameter quantity is smaller. Meanwhile, the invention also designs a multi-level convolution module MConvblock to acquire multi-level characteristics.
Table 1 Road Scene Deeplabv3+ structural parameters
Name of the name Number of repetitions Output channel Operation form
Conv1 1 16 MConvblock
Sub1 1 32 3×3conv+BN+Leaky,stride=2
Resl 1 32 MResblock
Sub2 1 64 3×3conv+BN+Leaky,stride=2
Res2 2 64 MResblock
Sub3 1 128 3×3conv+BN+Leaky,stride=2
Res3 2 128 MResblock
Sub4 1 256 3×3conv+BN+Leaky,stride=2
Res4 2 256 MResblock
Sub5 1 512 3×3conv+BN+Leaky,stride=2
Res5 1 512 MResblock
Conv2 1 1024 MConvblock
The structures of MConvblock and MResblock are shown in fig. 2 and 3, respectively, where K is the number of input channels and C is the number of output channels. It can be seen here that MResblock is a residual network structure based on MConvblock design, so that the output size and the input size of MResblock are exactly the same.
Fig. 4 shows a network structure of Road Scene Deeplabv3+ according to the present invention, wherein Block1 outputs 4 times of downsampled features, block2 outputs 16 times of downsampled features, and C represents the number of channels of the input features. Wherein Conv 1-Res 2 and Sub 3-Conv 2 may correspond to Table 1. Likewise, in order for the framing network of the present invention to downsample the input image only 16 times, i.e., four times 2 times, the present invention replaces the convolution step in Sub5 with 2 and replaces the convolutions in Res5 and Conv2 with a punctured convolution with a void fraction of 2. In order to extract more high-level features, the invention adds a hole convolution with the hole rate of 24 in the hole space pyramid pooling (Atrous Spatial Pyramid Pooling, ASPP), and replaces 3×3 convolution in ASPP with a multi-level convolution module, wherein MConv represents MConvblock in fig. 2, so that each convolution in the multi-level convolution module in ASPP is correspondingly expanded. The output channel of each module in ASPP of the present invention is 64, which is much less than 256 in the original deeplabv3+, which is done in order to reduce the calculation amount. Likewise, the invention correspondingly reduces the parameters of each convolution in the up-sampling path, thereby further reducing the calculation amount. After fusing these different scale high level features, the features are again upsampled using a quadruple bilinear interpolation and fused with a quadruple downsampling feature. And finally, carrying out convolution for three times, and carrying out up-sampling again by using 4 times bilinear interpolation to obtain a final semantic segmentation result. In the case of using a GTX 1080TiGPU as the running platform, the model of the present invention is able to achieve a processing speed of 0.057s per frame on an image of 2048 x 1024 x 3 size, and a processing speed of 0.03s per frame on an image of 720p, i.e., 1920 x 720 x 3 size. Meanwhile, when the model parameters are stored with 32-bit floating point precision, only 20MB of storage space is required. In general, the rapid semantic segmentation model designed by the invention can meet the semantic segmentation requirement facing the road scene in terms of speed and storage space.
The features of the above embodiment are as follows:
(1) In order to increase the running speed of the model, this embodiment designs a shallow network Road Scene Deeplabv3+ as a framework network for rapid semantic segmentation, and table 1 is Road Scene Deeplabv3+ network structure parameters, which are similar to the feature extraction part in the Dark Net53, but have fewer layers and fewer parameter amounts.
(2) In order for the frame network of the present embodiment to downsample the input image only 16 times, i.e. four times 2 times, the present embodiment replaces the convolution step in Sub5 with 2 and replaces the convolution in Res5 and Conv2 with a punctured convolution with a void ratio of 2.
(3) In order to extract more high-level features, the embodiment adds a hole convolution with a hole rate of 24 in the hole space pyramid pooling (Atrous Spatial Pyramid Pooling, ASPP).
(4) This embodiment also replaces the 3 x 3 convolution in ASPP with a multi-level convolution block, where MConv represents MConvblock in fig. 2, so each convolution in the multi-level convolution block in ASPP is correspondingly expanded. The output channel of each module in ASPP of this embodiment is 64, which is much less than 256 in the original deeplabv3+, which is done in order to reduce the amount of computation.
(5) The present embodiment also reduces the parameters of each convolution in the up-sampling path accordingly, thereby further reducing the amount of computation. After fusing these different scale high level features, the features are again upsampled using a quadruple bilinear interpolation and fused with a quadruple downsampling feature. And finally, carrying out convolution for three times, and carrying out up-sampling again by using 4 times bilinear interpolation to obtain a final semantic segmentation result.
In summary, the current semantic segmentation model is difficult to balance the requirements of real-time performance and high segmentation accuracy in application facing the road scene, and the invention designs a rapid semantic segmentation model facing the road scene based on a Road Scene Deeplabv3 +network structure. Firstly, a shallow network is designed as a framework network of deep bv3+ to extract features, and meanwhile, a multi-level convolution module is designed for the shallow network to accelerate the feature extraction speed. A multi-level convolution module is then used to replace the 3 x 3 convolution in deeplabv3+ to extract more and deeper higher level features. Compared with other methods, the method can quickly divide the image and obtain higher dividing precision.
In order to verify the validity of the road scene semantic segmentation model provided by the embodiment, multiple experiments are performed, the schematic diagram of the experimental results is shown in fig. 5, the column of fig. 5 (a) represents the input image, the column of fig. 5 (b) represents the label of the column marker image, and the column of fig. 5 (c) represents the output image. Compared with the image data set directly acquired from reality, the Cityscapes data set has been subjected to certain artificial processing, and the quality of the original image and the labeling image is higher, so that the Cityscapes data set suitable for the road scene semantic segmentation experiment is selected through the experiment.
In the case of using a GTX 1080TiGPU as an operation platform, the semantic segmentation model in the present embodiment can achieve a processing speed of 0.057s per frame on an image of 2048×1024×3 size, and a processing speed of 0.03s per frame on an image of 720p, i.e., 1920×720×3 size. Meanwhile, the model parameters in this embodiment only occupy 20MB of storage space when stored with 32-bit floating point precision. In general, the rapid semantic segmentation model designed by the embodiment can meet the semantic segmentation requirement facing the road scene in terms of speed and storage space.

Claims (6)

1. A rapid semantic segmentation method for a road scene is characterized by comprising the following steps of: the method specifically comprises the following steps:
step 1, constructing a model based on a convolutional neural network;
step 2, training the model constructed in the step 1 by using training data;
step 3, calculating model loss trained in the step 2 by using a loss function, and calculating gradient according to the obtained model loss result;
step 4, updating the model parameters according to the gradient obtained in the step 3;
the specific process of the step 3 is as follows:
step 3.1, determining a model target:
assuming that the input data of the model is X, the tag data is Y, and the model is f with the parameter omega ω The loss function used is L, then the model targets:
ω=argmin ω L(f ω (X),Y) (4);
step 3.2, determining a semantic segmentation loss function according to the obtained model target in step 3.1:
semantic segmentation essentially belongs to a classification task, and there is one and only one class of each pixel on the image, namely, there is one and only one numerical value in the pixel labels is 1, and the rest are 0; the loss function for semantic segmentation is then:
L(f ω (X),Y)=-logf ω (x t ) (8);
wherein f ω (x t ) A predicted probability value for a class corresponding to a label 1;
step 3.3, in order to reduce the complexity of the model, adding a weight decay to the model, the loss function after adding the weight decay is as follows:
L(f ω (X),Y)=-logf ω (x t )+α·ω 2 (9);
wherein α is a weight;
and 3.4, carrying out normalization processing on the model by adopting the following formula:
wherein C is the number of channels, x i For a value output by a pixel location model, y i The predicted probability value is the corresponding predicted probability value;
step 3.5, according to the result obtained in step 3.4, calculating a gradient value corresponding to each parameter participating in the operation based on a chain rule in derivative, wherein the calculation process of the gradient value is shown in the following formula (11):
(11);
wherein x is k And x t Are all convolved by the parameter ω.
2. The rapid semantic segmentation method for a road scene according to claim 1, wherein the rapid semantic segmentation method comprises the following steps: the specific process of the step 1 is as follows: processing an input image by using a convolutional neural network formed by a plurality of convolutional kernels, so as to realize that 3 XH XW data are input to obtain 1 XH XW prediction output, wherein H is the height of the input image, and W is the width of the input image;
constructing a model according to the following formula (1):
wherein F is out To output features, F in For inputting features, K i For the ith convolution kernel, N is the number of output channels, and b is offset;
since the image is two-dimensional data, the size of the input feature is C in ×H in ×W in The convolution kernel used has a size of C out ×C in ×H k ×W k The obtained output is characterized as C out ×H out ×W out
Wherein C is in And C out Channel number for input and output features, H in And W is in To input the height and width of the feature, H k And W is k Is the height and width of convolution kernel, H out And W is out Height and width for the output feature;
for input of C in ×H in ×W in Is characterized by using C out With a size of C in ×H k ×W k Is subjected to sliding multiplication and addition operation on the input features to obtain C out The size is H out ×W out Is characterized by (3).
3. The rapid semantic segmentation method for a road scene according to claim 2, wherein: the calculation process of the height and width of the output characteristic in the step 1 is as follows:
wherein p is the frame width, s is the step size.
4. A rapid semantic segmentation method for road scenes according to claim 3, wherein: the specific process of the step 2 is as follows: the training data comprises an image acquired by people and a label image corresponding to the acquired image;
the training process is as follows: and obtaining a label image from the input image, wherein the input image is a color RGB image, and the label image is a single-channel gray image.
5. The rapid semantic segmentation method for a road scene according to claim 4, wherein: in the step 2, since the gray value of the image pixel directly represents the category to which the pixel belongs; when the image has C classes to be segmented, each pixel value in the label image is 0-C-1.
6. The rapid semantic segmentation method for a road scene according to claim 1, wherein the rapid semantic segmentation method comprises the following steps: the specific process of the step 4 is as follows:
updating parameters in the model according to the following formulas (12) and (13):
m t+1 =ρ·m t +Δω (12);
ω t+1 =ω t -lr·m t+1 (13);
wherein m is t For the current momentum, m t+1 For updated momentum ω t Omega as the current parameter t+1 For the updated parameters, Δω is the gradient, ρ is the weight parameter between 0 and 1, lr is the updated step size.
CN201911256375.3A 2019-12-10 2019-12-10 Rapid semantic segmentation method for road scene Active CN111179272B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911256375.3A CN111179272B (en) 2019-12-10 2019-12-10 Rapid semantic segmentation method for road scene

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911256375.3A CN111179272B (en) 2019-12-10 2019-12-10 Rapid semantic segmentation method for road scene

Publications (2)

Publication Number Publication Date
CN111179272A CN111179272A (en) 2020-05-19
CN111179272B true CN111179272B (en) 2024-01-05

Family

ID=70657206

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911256375.3A Active CN111179272B (en) 2019-12-10 2019-12-10 Rapid semantic segmentation method for road scene

Country Status (1)

Country Link
CN (1) CN111179272B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112819000A (en) * 2021-02-24 2021-05-18 长春工业大学 Streetscape image semantic segmentation system, streetscape image semantic segmentation method, electronic equipment and computer readable medium
CN113763392B (en) * 2021-11-10 2022-03-18 北京中科慧眼科技有限公司 Model prediction method and system for road surface flatness detection and intelligent terminal
CN115018857B (en) * 2022-08-10 2022-11-11 南昌昂坤半导体设备有限公司 Image segmentation method, image segmentation device, computer-readable storage medium and computer equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107016415A (en) * 2017-04-12 2017-08-04 合肥工业大学 A kind of coloured image Color Semantic sorting technique based on full convolutional network
WO2018081537A1 (en) * 2016-10-31 2018-05-03 Konica Minolta Laboratory U.S.A., Inc. Method and system for image segmentation using controlled feedback
CN109543502A (en) * 2018-09-27 2019-03-29 天津大学 A kind of semantic segmentation method based on the multiple dimensioned neural network of depth
CN110147794A (en) * 2019-05-21 2019-08-20 东北大学 A kind of unmanned vehicle outdoor scene real time method for segmenting based on deep learning
CN110210350A (en) * 2019-05-22 2019-09-06 北京理工大学 A kind of quick parking space detection method based on deep learning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018081537A1 (en) * 2016-10-31 2018-05-03 Konica Minolta Laboratory U.S.A., Inc. Method and system for image segmentation using controlled feedback
CN107016415A (en) * 2017-04-12 2017-08-04 合肥工业大学 A kind of coloured image Color Semantic sorting technique based on full convolutional network
CN109543502A (en) * 2018-09-27 2019-03-29 天津大学 A kind of semantic segmentation method based on the multiple dimensioned neural network of depth
CN110147794A (en) * 2019-05-21 2019-08-20 东北大学 A kind of unmanned vehicle outdoor scene real time method for segmenting based on deep learning
CN110210350A (en) * 2019-05-22 2019-09-06 北京理工大学 A kind of quick parking space detection method based on deep learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Dynamic Modeling of Driver Control Strategy of Lane-Change Behavior and Trajectory Planning for Collision Prediction;Guoqing Xu;IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS;第13卷(第3期);1138-1154 *

Also Published As

Publication number Publication date
CN111179272A (en) 2020-05-19

Similar Documents

Publication Publication Date Title
CN107945204B (en) Pixel-level image matting method based on generation countermeasure network
CN110188765B (en) Image semantic segmentation model generation method, device, equipment and storage medium
CN108647585B (en) Traffic identifier detection method based on multi-scale circulation attention network
CN109886121B (en) Human face key point positioning method for shielding robustness
CN108229468B (en) Vehicle appearance feature recognition and vehicle retrieval method and device, storage medium and electronic equipment
US20210397876A1 (en) Similarity propagation for one-shot and few-shot image segmentation
US11354906B2 (en) Temporally distributed neural networks for video semantic segmentation
CN110910391B (en) Video object segmentation method for dual-module neural network structure
CN109784283B (en) Remote sensing image target extraction method based on scene recognition task
CN114202672A (en) Small target detection method based on attention mechanism
CN111179272B (en) Rapid semantic segmentation method for road scene
CN112036447B (en) Zero-sample target detection system and learnable semantic and fixed semantic fusion method
CN112434618B (en) Video target detection method, storage medium and device based on sparse foreground priori
CN107506792B (en) Semi-supervised salient object detection method
CN114049381A (en) Twin cross target tracking method fusing multilayer semantic information
CN111739037B (en) Semantic segmentation method for indoor scene RGB-D image
CN116645592B (en) Crack detection method based on image processing and storage medium
CN109657538B (en) Scene segmentation method and system based on context information guidance
Wu et al. A deep residual convolutional neural network for facial keypoint detection with missing labels
CN116229056A (en) Semantic segmentation method, device and equipment based on double-branch feature fusion
Li et al. Transformer helps identify kiwifruit diseases in complex natural environments
CN114863348A (en) Video target segmentation method based on self-supervision
CN116740362B (en) Attention-based lightweight asymmetric scene semantic segmentation method and system
Chacon-Murguia et al. Moving object detection in video sequences based on a two-frame temporal information CNN
CN116596966A (en) Segmentation and tracking method based on attention and feature fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant