CN111179272B

CN111179272B - Rapid semantic segmentation method for road scene

Info

Publication number: CN111179272B
Application number: CN201911256375.3A
Authority: CN
Inventors: 欧勇盛; 彭远哲; 王志扬; 熊荣
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2019-12-10
Filing date: 2019-12-10
Publication date: 2024-01-05
Anticipated expiration: 2039-12-10
Also published as: CN111179272A

Abstract

The invention discloses a rapid semantic segmentation method for a road scene, which specifically comprises the following steps: step 1, constructing a model based on a convolutional neural network; step 2, training the model constructed in the step 1 by using training data; step 3, calculating model loss trained in the step 2 by using a loss function, and calculating gradient according to the obtained model loss result; and 4, updating the model parameters according to the gradient obtained in the step 3. The segmentation method provided by the invention can be used for rapidly segmenting the image and obtaining higher precision.

Description

Rapid semantic segmentation method for road scene

Technical Field

The invention belongs to the technical field of computer vision, and relates to a rapid semantic segmentation method for a road scene.

Background

In recent years, image semantic segmentation has become a research hotspot of computer vision, and can be applied to various scenes such as robot vision, intelligent driving environment perception and the like. Semantic segmentation is to understand a plurality of target elements in a photographed picture, and precisely segment the target elements according to respective outlines. For a computer, the image is a multi-channel matrix of values. For a computer to understand and segment the target elements, it is necessary to find the numerical characteristics of each target element from the original numerical matrix, understand which target elements are contained in the image according to the characteristics, and then analyze the target elements in a deeper layer to combine into a picture according to what structure. In general, the purpose of image semantic segmentation is to identify objects and their interrelationships in a picture, which is a simulation of the human brain visual system. Humans understand the surrounding environment through a visual perception system, while image segmentation techniques acquire, understand and recognize information in pictures by mimicking a human visual perception system. Image semantic segmentation is a very important task in the field of image processing and pattern recognition, a pre-step in many computer vision techniques. The detection and tracking are carried out according to the structure of semantic segmentation, so that the detection and tracking area can be reduced, and even the outline of some objects can be directly given; image text description is carried out according to the semantic segmentation result, and a large amount of inter-target position information can be added; the style conversion of the image is carried out according to the semantic segmentation result, so that the background area can be rapidly positioned and replaced, and a specific target can be replaced. Therefore, the image semantic segmentation has strong theoretical and research values.

Early semantic segmentation methods relied on manual features. Such as using a random decision forest to predict classification probabilities and using a probabilistic model of conditional random fields to handle uncertainties and integrate context information in the image. In recent years, convolutional Neural Networks (CNNs) have evolved well in the field of computer vision due to the advent of large-scale training data sets and high-performance Graphics Processing Units (GPUs). In addition, the development of deep learning algorithms is also facilitated by excellent deep learning open source frameworks such as Caffe, MXNet, and Tensorflow. The powerful deep neural network greatly reduces classification errors, and semantic segmentation also makes great progress in the process.

Many researchers are trying to achieve as high accuracy as possible with as little computation effort and as few parameters as possible, thereby enabling models to run on vehicle terminal platforms such as image classification models Squeeze Net, shuffle Net and Mobile Net. In the semantic segmentation field, since the input size is 3×h×w, the output size is c×h×w, and the output is exactly the same as the input width and height and C is often much larger than 3, a large amount of operations will be generated in the process of obtaining the output. Meanwhile, in order to obtain higher segmentation precision, the times of downsampling the image are smaller, so that the width and height of the intermediate features are still large, and the operand is further increased. By combining the two points, semantic segmentation is a task with very large operand.

In order to reduce the amount of computation generated by semantic segmentation, there are generally two ways: the picture size is reduced and the model complexity is reduced. Reducing the picture size can most directly reduce the amount of computation, but the image can lose a lot of detail, affecting the accuracy. Reducing the complexity of the model results in reduced feature extraction capabilities of the model, thereby affecting segmentation accuracy. At present, the semantic segmentation model facing the road scene is difficult to balance the contradiction between real-time performance and high segmentation precision.

And most of the existing semantic segmentation frameworks are based on full convolution networks. The full convolution network successfully improves the performance of semantic segmentation by modifying the classification network into the full convolution network. In other words, the full convolution network replaces the full connection layer of the classification model with a convolution layer. However, the relationship between pixels in the full convolution network is not considered, and the spatial normalization (spatial regularization) step used in the usual pixel classification-based segmentation method is omitted, and spatial uniformity is lacking.

Disclosure of Invention

The invention aims to provide a rapid semantic segmentation method for a road scene, which solves the problems of low image precision and weak feature extraction capability of the existing segmentation method.

The technical scheme adopted by the invention is that a rapid semantic segmentation method facing to a road scene specifically comprises the following steps:

step 1, constructing a model based on a convolutional neural network;

step 2, training the model constructed in the step 1 by using training data;

step 3, calculating model loss trained in the step 2 by using a loss function, and calculating gradient according to the obtained model loss result;

and 4, updating the model parameters according to the gradient obtained in the step 3.

The invention is also characterized in that:

the specific process of the step 1 is as follows: processing an input image by using a convolutional neural network formed by a plurality of convolutional kernels, so as to realize that 3 XH XW data are input to obtain 1 XH XW prediction output, wherein H is the height of the input image, and W is the width of the input image;

constructing a model according to the following formula (1):

wherein F is _out To output features, F _in For inputting features, K _i For the ith convolution kernel, N is the number of output channels, and b is offset;

since the image is two-dimensional data, the size of the input feature is C _in ×H _in ×W _in The convolution kernel used has a size of C _out ×C _in ×H _k ×W _k The obtained output is characterized as C _out ×H _out ×W _out ；

Wherein C is _in And C _out Channel number for input and output features, H _in And W is _in To input the height and width of the feature, H _k And W is _k Is the height and width of convolution kernel, H _out And W is _out Height and width for the output feature;

for input of C _in ×H _in ×W _in Is characterized by using C _out With a size of C _in ×H _k ×W _k Is subjected to sliding multiplication and addition operation on the input features to obtain C _out The size is H _out ×W _out Is characterized by (3).

The calculation process of the height and width of the output feature in step 1 is as follows:

wherein p is the frame width, s is the step size.

The specific process of the step 2 is as follows: the training data comprises an image acquired by people and a label image corresponding to the acquired image;

the training process is as follows: and obtaining a label image from the input image, wherein the input image is a color RGB image, and the label image is a single-channel gray image.

In step 2, the gray value of the image pixel directly represents the category to which the pixel belongs; when the image has C classes to be segmented, each pixel value in the label image is 0-C-1.

The specific process of the step 3 is as follows:

step 3.1, determining a model target:

assuming that the input data of the model is X, the tag data is Y, and the model is f with the parameter omega _ω The loss function used is L, then the model targets:

ω＝argmin _ω L(f _ω (X)，Y) (4)；

step 3.2, determining a semantic segmentation loss function according to the obtained model target in step 3.1:

semantic segmentation essentially belongs to a classification task, and there is one and only one class of each pixel on the image, namely, there is one and only one numerical value in the pixel labels is 1, and the rest are 0; the loss function for semantic segmentation is then:

L(f _ω (X)，Y)＝-logf _ω (x _t ) (8)；

wherein f _ω (x _t ) A predicted probability value for a class corresponding to a label 1;

step 3.3, in order to reduce the complexity of the model, adding a weight decay to the model, the loss function after adding the weight decay is as follows:

L(f _ω (X)，Y)＝-logf _ω (x _t )+α·ω ² (9)；

wherein α is a weight;

and 3.4, carrying out normalization processing on the model by adopting the following formula:

wherein C is the number of channels, x _i For a value output by a pixel location model, y _i The predicted probability value is the corresponding predicted probability value;

step 3.5, according to the result obtained in step 3.4, calculating a gradient value corresponding to each parameter participating in the operation based on a chain rule in derivative, wherein the calculation process of the gradient value is shown in the following formula (11):

wherein x is _k And x _t Are all convolved by the parameter ω.

The specific process of the step 4 is as follows:

updating parameters in the model according to the following formulas (12) and (13):

m _t+1 ＝ρ·m _t +Δω (12)；

ω _t+1 ＝ω _t -lr·m _t+1 (13)；

wherein m is _t For the current momentum, m _t+1 For updated momentum ω _t Omega as the current parameter _t+1 For the updated parameters, Δω is the gradient, ρ is the weight parameter between 0 and 1, lr is the updated step size.

The rapid semantic segmentation method for the road scene has the advantages that a convolutional neural network is used for constructing a model, the characteristic extraction speed is accelerated through the multi-level convolution module, the multi-level convolution module is used for replacing 3×3 convolution to extract more and deeper high-level characteristics, images can be rapidly segmented, and meanwhile higher precision is obtained.

Drawings

FIG. 1 is a schematic diagram of a semantic segmentation process of the diagram;

FIG. 2 is a schematic view of MConvblock structure adopted in a semantic segmentation model of an embodiment of a rapid semantic segmentation method for road scene according to the present invention;

FIG. 3 is a schematic view of MResblock structure in a semantic segmentation model of an embodiment of a rapid semantic segmentation method for road scene according to the present invention;

fig. 4 is a schematic diagram of a network structure of Road Scene Deeplabv3+ in a semantic segmentation model of an embodiment of a rapid semantic segmentation method for a road scene according to the present invention;

fig. 5 (a) - (c) are diagrams of experimental results of a rapid semantic segmentation method for road scenes.

Detailed Description

The invention will be described in detail with reference to the drawings and the detailed description.

Semantic segmentation, which is a pixel-level classification task providing rich target information, is an important research problem in the field of robot perception, and has been widely used in many fields such as autopilot and robot navigation. In road scene understanding applications, the semantic segmentation model should accurately describe the appearance and class of different objects, and in addition, the semantic segmentation model needs to understand the spatial relationship between different objects.

Image segmentation is the division of an image into several disjoint regions containing a single object, typically in a bottom-up fashion. Semantic segmentation is to divide each pixel in an image into a category with semantics, and is usually implemented in a top-down manner. The final goal of semantic segmentation is to obtain a model in a top-down fashion that enables accurate prediction of a semantic label for each pixel in the input image, a schematic diagram of which is shown in fig. 1.

The invention discloses a rapid semantic segmentation method for a road scene, which specifically comprises the following steps:

step 1, constructing a model based on a convolutional neural network;

convolutional neural networks are also known as region-dependent networks because convolution is performed by computing a region in an input feature.

The convolutional neural network is composed of a series of convolution kernels with fixed height and width and fixed parameters, the convolution kernels perform sliding operation on input features, and the values of covered areas are multiplied and added to obtain output features.

These output features can be used as input features for the next convolution to further operate, and multiple convolution kernels are superimposed in this fashion to further process the features to form a convolutional neural network.

This form of accumulation can reach tens or even hundreds of layers, thus creating a feature extraction process for the input image from low-level region correlation to high-level semantic correlation.

The model formula constructed based on the convolutional neural network is as follows:

wherein F is _out To output features, F _in For inputting features, K _i For the ith convolution kernel, N is the number of output channels and b is the offset.

Since the image is two-dimensional data, the input feature size is C _in ×H _in ×W _in The convolution kernel used has a size of C _out ×C _in ×H _k ×W _k The obtained output is characterized as C _out ×H _out ×W _out . Wherein C is _in And C _out Channel number for input and output features, H _in And W is _in To input the height and width of the feature, H _k And W is _k Is the height and width of convolution kernel, H _out And W is _out Is the height and width of the output feature.

The calculation rules of the width and the height of the output feature are as follows:

where p is the frame width and s is the step size.

For input of C _in ×H _in ×W _in Is characterized by usingC _out With a size of C _in ×H _k ×W _k Is subjected to sliding multiplication and addition operation on the input features to obtain C _out The size is H _out ×W _out Is characterized by (3).

In order to keep the width and height of the output characteristic consistent with the input characteristic as much as possible in the convolution process, 0 is usually complemented on the frame of the input characteristic, and the method for complementing 0 is carried out according to the principle that the width and height of the convolved input characteristic are unchanged.

Semantic segmentation requires that an output be obtained that is as wide and high as the input image, and that each pixel corresponds to a class of predictions.

Therefore, the semantic segmentation process is to use a convolutional neural network formed by a plurality of convolution kernels to process an input image, so as to realize that 3×h×w data is input to obtain 1×h×w prediction output, wherein H is the height of the input image, and W is the width of the input image.

In practice, a probability is usually given to all possible classes, so a convolutional neural network is usually used to obtain a prediction of c×h×w, and then a class with the highest probability is selected from each pixel position as a final class to obtain a prediction output of 1×h×w, where C is the total number of classes.

Step 2, training the model constructed in the step 1 by using training data;

the training data of semantic segmentation is usually composed of a series of images collected by human and corresponding label images, and the purpose of model training is to enable the model to fit the data provided, i.e. to obtain the label images from the input images. The input image is a color RGB image that is usually acquired, while the label image is a single-channel gray image, and the gray value of a pixel directly represents the class to which the pixel belongs. Assuming that there are C classes to segment, then each pixel value in the label image is between 0 and C-1.

the penalty function is a function used to calculate the similarity to the prediction and tag data.

In general, the closer the predicted result is to the tag data, the smaller the numerical result calculated from the loss function. Assuming that the input data of the model is X, the tag data is Y, and the model is f with the parameter omega _ω The loss function used is L, then the model targets:

ω＝argmin _ω L(f _ω (X)，Y) (4)；

obtaining a model f with omega parameters _ω So that it can get the minimum value of the prediction result and the label result under the loss function L.

The loss functions commonly used in deep learning are mean absolute error (Mean Absolute Error, MAE), mean square error (Mean Square Error, MSE), and cross entropy (cross entropy). Where cross entropy is most used in classification tasks, it is the similarity between the two probability distributions that is calculated.

The loss function formula for MAE is:

the loss function formula for MSE is:

the loss function formula of the cross entropy is:

cross entropy loss functions are commonly used in loss functions;

semantic segmentation essentially belongs to the classification task, whereas each pixel has one and only one class, i.e. one in its label and only one value of 1, the rest being 0. Then the cross entropy loss function can be reduced to:

L(f _ω (X)，Y)＝-logf _ω (x _t ) (8)；

wherein f _ω (x _t ) And a predicted probability value of the corresponding category with the label of 1.

A weight decay (weight decay) is typically added to the model to reduce the complexity of the model. MSE is generally used as the loss function for weight decay, and if cross entropy is used as the main loss function, there are:

L(f _ω (X)，Y)＝-logf _ω (x _t )+α·ω ² (9)；

where α is a weight, typically set to 0.0001.

Since the result of the last convolution of the model is not constant over the range of values, it needs to be normalized to between 0 and 1 and the sum over the channel must also be 1, i.e. the sum of the predictive probabilities for each pixel location over all categories should be 1 to be able to correspond to the label provided.

To achieve this, the output results are typically normalized using a softmax, which is formulated as:

wherein C is the number of channels, x _i For a value output by a pixel location model, y _i Is the corresponding predicted probability value.

After obtaining the predicted probability value, the difference between the predicted probability value and the label can be calculated by using the loss function, and the gradient value corresponding to each parameter participating in the operation is obtained by using the chain rule in the derivation.

The invention uses the cross entropy of loss function to calculate the gradient of parameter simple omega, and let y in the invention _t Probability of corresponding label:

wherein x is _k And x _t All are obtained by the parameter omega convolution, the related operations are multiply-add operation and derivative nonlinear operation, namely all operations can be derivative, soAnd->The part can be obtained by using a chain rule according to the specific situation of the model.

Thus, once the network model and loss function are determined, the formulation of their gradients is determined. When the network exceeds one layer, the mode of obtaining the gradient by the model parameter and obtaining the characteristics by the model is quite similar, and the gradient can be obtained in sequence along the characteristic obtaining path in the network in a reverse direction.

After obtaining the gradient of the parameter, the parameter in the model needs to be updated according to the gradient. The invention mainly uses a random gradient descent method with momentum, which is most commonly used in semantic segmentation. The random two-word in the random gradient descent originates from the process of random selection of samples in the training process. Because the training samples are often huge in number and cannot be fully put into the GPU to train the model, a fixed number of samples are often selected randomly in the training set to train the model during training. The invention uses the extracted data for training the model by randomly extracting the data in the training set without replacement. After each training to obtain the gradient, the parameters in the model are updated according to the following rules:

m _t+1 ＝ρ·m _t +Δω (12)；

ω _t+1 ＝ω _t -lr·m _t+1 (13)；

wherein m is _t For the current momentum, m _t+1 For updated momentum ω _t Omega as the current parameter _t+1 For updated parameters, Δω is the counter-propagating ladderThe degree ρ is a weight parameter between 0 and 1, lr is an updated step length, and the value is generally between 0 and 1.

After the model, data, loss function, and parameter update are determined, the model may be iteratively trained on the data set. The training process can target the iteration times or reach a certain standard with accuracy. In the training process, a part of data is selected from the training set to serve as the verification set, and the verification set does not participate in training. After each iteration is completed, the model is placed on a verification set for verification, and the training condition of the model is judged by observing the performances of the model on a training set and a testing set. For example, in semantic segmentation, if the model gets a much higher cross-over on the training data than on the verification data, then over-fitting of the model typically occurs and training should cease to readjust the model. If the model results on the training set and the validation set are always poor, that indicates that the model parameters are not able to converge, the training should be stopped to alter the training strategy. If the accuracy of the training set and the validation set are not very bad and continue to grow, this indicates that the training state is good. After training is completed, the model can be used for testing on a test set, so that the actual application condition of the model is observed.

Examples

According to the rapid semantic segmentation method for the road scene, a rapid semantic segmentation model based on a multi-convolution module is designed, and deep labv3+ is used as the overall framework of the model. In order to improve the running speed of the model, the invention designs a shallow network as a framework network of deep labv3+, which is named as Road Scene Deeplabv & lt+ & gt and has network structure parameters of Road Scene Deeplabv & lt+ & gt in table 1, and the structure is similar to a feature extraction part in a Dark Net53, but the number of layers is smaller and the parameter quantity is smaller. Meanwhile, the invention also designs a multi-level convolution module MConvblock to acquire multi-level characteristics.

Table 1 Road Scene Deeplabv3+ structural parameters

Name of the name	Number of repetitions	Output channel	Operation form
				Conv1	1	16	MConvblock
Sub1	1	32	3×3conv+BN+Leaky，stride＝2
				Resl	1	32	MResblock
Sub2	1	64	3×3conv+BN+Leaky，stride＝2
				Res2	2	64	MResblock
Sub3	1	128	3×3conv+BN+Leaky，stride＝2
				Res3	2	128	MResblock
Sub4	1	256	3×3conv+BN+Leaky，stride＝2
				Res4	2	256	MResblock
Sub5	1	512	3×3conv+BN+Leaky，stride＝2
				Res5	1	512	MResblock
Conv2	1	1024	MConvblock

The structures of MConvblock and MResblock are shown in fig. 2 and 3, respectively, where K is the number of input channels and C is the number of output channels. It can be seen here that MResblock is a residual network structure based on MConvblock design, so that the output size and the input size of MResblock are exactly the same.

Fig. 4 shows a network structure of Road Scene Deeplabv3+ according to the present invention, wherein Block1 outputs 4 times of downsampled features, block2 outputs 16 times of downsampled features, and C represents the number of channels of the input features. Wherein Conv 1-Res 2 and Sub 3-Conv 2 may correspond to Table 1. Likewise, in order for the framing network of the present invention to downsample the input image only 16 times, i.e., four times 2 times, the present invention replaces the convolution step in Sub5 with 2 and replaces the convolutions in Res5 and Conv2 with a punctured convolution with a void fraction of 2. In order to extract more high-level features, the invention adds a hole convolution with the hole rate of 24 in the hole space pyramid pooling (Atrous Spatial Pyramid Pooling, ASPP), and replaces 3×3 convolution in ASPP with a multi-level convolution module, wherein MConv represents MConvblock in fig. 2, so that each convolution in the multi-level convolution module in ASPP is correspondingly expanded. The output channel of each module in ASPP of the present invention is 64, which is much less than 256 in the original deeplabv3+, which is done in order to reduce the calculation amount. Likewise, the invention correspondingly reduces the parameters of each convolution in the up-sampling path, thereby further reducing the calculation amount. After fusing these different scale high level features, the features are again upsampled using a quadruple bilinear interpolation and fused with a quadruple downsampling feature. And finally, carrying out convolution for three times, and carrying out up-sampling again by using 4 times bilinear interpolation to obtain a final semantic segmentation result. In the case of using a GTX 1080TiGPU as the running platform, the model of the present invention is able to achieve a processing speed of 0.057s per frame on an image of 2048 x 1024 x 3 size, and a processing speed of 0.03s per frame on an image of 720p, i.e., 1920 x 720 x 3 size. Meanwhile, when the model parameters are stored with 32-bit floating point precision, only 20MB of storage space is required. In general, the rapid semantic segmentation model designed by the invention can meet the semantic segmentation requirement facing the road scene in terms of speed and storage space.

The features of the above embodiment are as follows:

(1) In order to increase the running speed of the model, this embodiment designs a shallow network Road Scene Deeplabv3+ as a framework network for rapid semantic segmentation, and table 1 is Road Scene Deeplabv3+ network structure parameters, which are similar to the feature extraction part in the Dark Net53, but have fewer layers and fewer parameter amounts.

(2) In order for the frame network of the present embodiment to downsample the input image only 16 times, i.e. four times 2 times, the present embodiment replaces the convolution step in Sub5 with 2 and replaces the convolution in Res5 and Conv2 with a punctured convolution with a void ratio of 2.

(3) In order to extract more high-level features, the embodiment adds a hole convolution with a hole rate of 24 in the hole space pyramid pooling (Atrous Spatial Pyramid Pooling, ASPP).

(4) This embodiment also replaces the 3 x 3 convolution in ASPP with a multi-level convolution block, where MConv represents MConvblock in fig. 2, so each convolution in the multi-level convolution block in ASPP is correspondingly expanded. The output channel of each module in ASPP of this embodiment is 64, which is much less than 256 in the original deeplabv3+, which is done in order to reduce the amount of computation.

(5) The present embodiment also reduces the parameters of each convolution in the up-sampling path accordingly, thereby further reducing the amount of computation. After fusing these different scale high level features, the features are again upsampled using a quadruple bilinear interpolation and fused with a quadruple downsampling feature. And finally, carrying out convolution for three times, and carrying out up-sampling again by using 4 times bilinear interpolation to obtain a final semantic segmentation result.

In summary, the current semantic segmentation model is difficult to balance the requirements of real-time performance and high segmentation accuracy in application facing the road scene, and the invention designs a rapid semantic segmentation model facing the road scene based on a Road Scene Deeplabv3 +network structure. Firstly, a shallow network is designed as a framework network of deep bv3+ to extract features, and meanwhile, a multi-level convolution module is designed for the shallow network to accelerate the feature extraction speed. A multi-level convolution module is then used to replace the 3 x 3 convolution in deeplabv3+ to extract more and deeper higher level features. Compared with other methods, the method can quickly divide the image and obtain higher dividing precision.

In order to verify the validity of the road scene semantic segmentation model provided by the embodiment, multiple experiments are performed, the schematic diagram of the experimental results is shown in fig. 5, the column of fig. 5 (a) represents the input image, the column of fig. 5 (b) represents the label of the column marker image, and the column of fig. 5 (c) represents the output image. Compared with the image data set directly acquired from reality, the Cityscapes data set has been subjected to certain artificial processing, and the quality of the original image and the labeling image is higher, so that the Cityscapes data set suitable for the road scene semantic segmentation experiment is selected through the experiment.

In the case of using a GTX 1080TiGPU as an operation platform, the semantic segmentation model in the present embodiment can achieve a processing speed of 0.057s per frame on an image of 2048×1024×3 size, and a processing speed of 0.03s per frame on an image of 720p, i.e., 1920×720×3 size. Meanwhile, the model parameters in this embodiment only occupy 20MB of storage space when stored with 32-bit floating point precision. In general, the rapid semantic segmentation model designed by the embodiment can meet the semantic segmentation requirement facing the road scene in terms of speed and storage space.

Claims

1. A rapid semantic segmentation method for a road scene is characterized by comprising the following steps of: the method specifically comprises the following steps:

step 1, constructing a model based on a convolutional neural network;

step 2, training the model constructed in the step 1 by using training data;

step 4, updating the model parameters according to the gradient obtained in the step 3;

the specific process of the step 3 is as follows:

step 3.1, determining a model target:

ω＝argmin _ω L(f _ω (X)，Y) (4)；

L(f _ω (X)，Y)＝-logf _ω (x _t ) (8)；

L(f _ω (X)，Y)＝-logf _ω (x _t )+α·ω ² (9)；

wherein α is a weight;

(11)；

wherein x is _k And x _t Are all convolved by the parameter ω.

2. The rapid semantic segmentation method for a road scene according to claim 1, wherein the rapid semantic segmentation method comprises the following steps: the specific process of the step 1 is as follows: processing an input image by using a convolutional neural network formed by a plurality of convolutional kernels, so as to realize that 3 XH XW data are input to obtain 1 XH XW prediction output, wherein H is the height of the input image, and W is the width of the input image;

constructing a model according to the following formula (1):

3. The rapid semantic segmentation method for a road scene according to claim 2, wherein: the calculation process of the height and width of the output characteristic in the step 1 is as follows:

wherein p is the frame width, s is the step size.

4. A rapid semantic segmentation method for road scenes according to claim 3, wherein: the specific process of the step 2 is as follows: the training data comprises an image acquired by people and a label image corresponding to the acquired image;

5. The rapid semantic segmentation method for a road scene according to claim 4, wherein: in the step 2, since the gray value of the image pixel directly represents the category to which the pixel belongs; when the image has C classes to be segmented, each pixel value in the label image is 0-C-1.

6. The rapid semantic segmentation method for a road scene according to claim 1, wherein the rapid semantic segmentation method comprises the following steps: the specific process of the step 4 is as follows:

m _t+1 ＝ρ·m _t +Δω (12)；

ω _t+1 ＝ω _t -lr·m _t+1 (13)；