CN111179272A

CN111179272A - Rapid semantic segmentation method for road scene

Info

Publication number: CN111179272A
Application number: CN201911256375.3A
Authority: CN
Inventors: 欧勇盛; 彭远哲; 王志扬; 熊荣
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2019-12-10
Filing date: 2019-12-10
Publication date: 2020-05-19
Anticipated expiration: 2039-12-10
Also published as: CN111179272B

Abstract

The invention discloses a road scene-oriented rapid semantic segmentation method, which specifically comprises the following steps: step 1, constructing a model based on a convolutional neural network; step 2, training the model constructed in the step 1 by using training data; step 3, calculating the model loss trained in the step 2 by using a loss function, and calculating the gradient according to the obtained model loss result; and 4, updating the model parameters according to the gradient obtained in the step 3. The segmentation method provided by the invention can be used for rapidly segmenting the image and obtaining higher precision.

Description

Rapid semantic segmentation method for road scene

Technical Field

The invention belongs to the technical field of computer vision, and relates to a road scene-oriented rapid semantic segmentation method.

Background

In recent years, image semantic segmentation has become a research hotspot of computer vision, and can be applied to various scenes such as robot vision, environment perception of intelligent driving and the like. The term semantic segmentation is to understand that a plurality of target elements in a shot picture are accurately segmented according to respective outlines. For computers, an image is a multi-channel matrix of values. For a computer to understand and segment target elements, numerical features of each target element need to be found from an original numerical matrix, which target elements are contained in an image are understood according to the features, and then the target elements in the image are analyzed in a deeper level according to what structure the target elements are combined into a picture. In general, the purpose of semantic segmentation of images is to identify objects and their interrelationships in the picture, which is a simulation of the human brain visual system. Human beings understand the surrounding environment through the visual perception system, and image segmentation technology acquires, understands and recognizes information in pictures by imitating the human visual perception system. Image semantic segmentation, a very important task in the field of image processing and pattern recognition, is a pre-step of many computer vision techniques. The detection and tracking are carried out according to the structure of semantic segmentation, the detection and tracking area can be reduced, and even the outlines of some objects can be directly given; image and text description is carried out according to the result of semantic segmentation, and a large amount of position information among targets can be added; and performing style conversion of the image according to the result of semantic segmentation, so that the background area can be quickly positioned and replaced, and a specific target can be replaced. Therefore, the image semantic segmentation has strong theoretical and research values.

Early semantic segmentation methods relied on manual features. Such as using a random decision forest to predict classification probabilities and using a probabilistic model of the conditional random domain to handle uncertainty and integrate context information in the image. In recent years, Convolutional Neural Networks (CNNs) have been well developed in the field of computer vision due to the advent of large-scale training data sets and high-performance Graphics Processing Units (GPUs). In addition, excellent deep learning open source frameworks such as Caffe, MXNet and tensrflow have also facilitated the development of deep learning algorithms. The strong deep neural network greatly reduces the classification error, and the semantic segmentation also makes great progress in the process.

At present, many researchers try to obtain the highest accuracy as possible by using the smallest computation amount and the smallest parameter amount, so that the model can operate on a vehicle-mounted terminal platform, such as image classification models of Squeeze Net, ShuffleNet and Mobile Net. In the semantic segmentation field, since the input size is 3 × H × W and the output size is C × H × W, the width and height of the output and the input are completely the same, and C is often much larger than 3, a large number of operations are generated in the process of obtaining the output. Meanwhile, in order to obtain higher segmentation accuracy, the number of times of down-sampling the image is small, so that the width and height of the intermediate features are still large, and further the computation amount is further increased. By combining the above two points, semantic segmentation is a task with a very large computation amount.

To reduce the amount of computation generated by semantic segmentation, there are generally two ways: reduce picture size and reduce model complexity. Reducing picture size can most directly reduce the amount of computation, but the image loses a lot of detail and thus affects accuracy. Reducing the complexity of the model results in a reduction in the feature extraction capability of the model, thereby affecting the segmentation accuracy. At present, the semantic segmentation model facing to the road scene is difficult to balance the contradiction between real-time performance and high segmentation precision.

And most of the existing semantic segmentation frameworks are based on a full convolution network. The full convolutional network successfully improves the performance of semantic segmentation by transforming the classification network into the full convolutional network. In other words, the full convolutional network replaces the fully connected layer of the classification model with the convolutional layer. However, the full convolution network does not consider the relationship between pixels, neglects the spatial regularization (spatial regularization) step used in the general pixel classification-based segmentation method, and lacks spatial consistency.

Disclosure of Invention

The invention aims to provide a road scene-oriented fast semantic segmentation method, which solves the problems of low image precision and weak feature extraction capability of the existing segmentation method.

The technical scheme adopted by the invention is that a rapid semantic segmentation method facing to a road scene specifically comprises the following steps:

step 1, constructing a model based on a convolutional neural network;

step 2, training the model constructed in the step 1 by using training data;

step 3, calculating the model loss trained in the step 2 by using a loss function, and calculating the gradient according to the obtained model loss result;

and 4, updating the model parameters according to the gradient obtained in the step 3.

The invention is also characterized in that:

the specific process of the step 1 is as follows: processing an input image by using a convolution neural network formed by a plurality of convolution kernels, thereby realizing that 3 multiplied by H multiplied by W data is input to obtain 1 multiplied by H multiplied by W prediction output, wherein H is the height of the input image, and W is the width of the input image;

the model was constructed according to the following equation (1):

wherein, F_outFor output characteristics, F_inAs input features, K_iIs the ith convolution kernel, N is the number of output channels, and b is the offset;

since the image is two-dimensional data, the size of the input feature is C_in×H_in×W_inUsing a convolution kernel of size C_out×C_in×H_k×W_kThe output characteristic obtained is C_out×H_out×W_out；

Wherein, C_inAnd C_outNumber of channels characteristic of input and output, H_inAnd W_inHeight and width of input features, H_kAnd W_kHeight and width of convolution kernel, H_outAnd W_outHeight and width of the output features;

for input of C_in×H_in×W_inIs characterized by the use of C_outSize is C_in×H_k×W_kThe convolution kernel performs sliding multiply-add operation on the input characteristics to obtain C_outSize is H_out×W_outThe characteristics of (1).

The calculation process of the height and width of the output features in step 1 is as follows:

wherein p is the frame width and s is the step length.

The specific process of the step 2 is as follows: the training data comprises artificially acquired images and label images corresponding to the acquired images;

the training process is as follows: a label image is obtained from an input image, wherein the input image is a color RGB image, and the label image is a single-channel gray image.

In step 2, the gray value of the image pixel directly represents the category to which the pixel belongs; when the image has C classes to be divided, each pixel value in the label image is 0-C-1.

The specific process of the step 3 is as follows:

step 3.1, determining a model target:

assuming that the input data of the model is X, the label data is Y, and the model is f with parameters omega_ωIf the loss function used is L, then the model is targeted as:

ω＝argmin_ωL(f_ω(X)，Y) (4)；

step 3.2, according to the step 3.1, determining a semantic segmentation loss function by the obtained model target:

the semantic segmentation essentially belongs to a classification task, and each pixel on the image has one and only one classification, namely, one and only one numerical value in the pixel label is 1, and the rest is 0; the penalty function for semantic segmentation is then:

L(f_ω(X)，Y)＝-logf_ω(x_t) (8)；

wherein f is_ω(x_t) The predicted probability value of the corresponding category with the label of 1;

step 3.3, in order to reduce the complexity of the model, a weight attenuation is added to the model, and the loss function after the weight attenuation is added is as follows:

L(f_ω(X)，Y)＝-logf_ω(x_t)+α·ω²(9)；

wherein α is a weight;

step 3.4, carrying out normalization processing on the model by adopting the following formula:

wherein C is the number of channels, x_iValue, y, output for a certain pixel location model_iIs the corresponding predicted probability value;

step 3.5, obtaining a gradient value corresponding to each parameter participating in the operation based on a chain rule in derivation according to the result obtained in step 3.4, wherein the gradient value calculation process is shown as the following formula (11):

wherein x is_kAnd x_tAre all convolved by the parameter ω.

The specific process of step 4 is as follows:

parameters in the model are updated according to the following formulas (12) and (13):

m_t+1＝ρ·m_t+Δω (12)；

ω_t+1＝ω_t-lr·m_t+1(13)；

wherein m is_tIs the current momentum, m_t+1For updated momentum, ω_tAs a current parameter, ω_t+1In order to update the parameters, Δ ω is a gradient, ρ is a weight parameter between values of 0-1, and lr is an updated step length.

The method has the advantages that the convolutional neural network is used for constructing the model, the multi-level convolution module is used for accelerating the feature extraction speed, the multi-level convolution module is used for replacing the 3 x 3 convolution to extract more and deeper high-level features, the image can be rapidly segmented, and meanwhile, higher precision is obtained.

Drawings

FIG. 1 is a diagrammatic illustration of a graphical meaning segmentation process;

FIG. 2 is a schematic structural diagram of an MConvblock adopted in a semantic segmentation model of an embodiment of a road scene-oriented fast semantic segmentation method of the invention;

FIG. 3 is a schematic diagram of an MResblock structure in a semantic segmentation model according to an embodiment of the road scene-oriented fast semantic segmentation method of the present invention;

FIG. 4 is a schematic network structure diagram of Road Scene Deeplabv3+ in the semantic segmentation model according to the embodiment of the Road Scene-oriented fast semantic segmentation method of the present invention;

fig. 5(a) - (c) are experimental result diagrams of the road scene-oriented fast semantic segmentation method of the present invention.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

Semantic segmentation, which is a pixel-level segmentation task providing rich target information, is an important research problem in the field of robot perception, and has been widely applied in many fields, such as automatic driving and robot navigation. In the application of road scene understanding, the semantic segmentation model should accurately describe the appearances and the categories of different objects, and in addition, the semantic segmentation model needs to understand the spatial relationship between different objects.

Image segmentation is the division of an image into several disjoint regions containing a single object, typically in a bottom-up manner. The semantic segmentation is to divide each pixel in the image into a certain category with semantics, which is usually implemented in a top-down manner. The final goal of semantic segmentation is to obtain a model in a top-down manner, so that it can accurately predict a label with semantic meaning for each pixel in an input image, and a schematic diagram thereof is shown in fig. 1.

The invention relates to a road scene-oriented rapid semantic segmentation method, which specifically comprises the following steps:

step 1, constructing a model based on a convolutional neural network;

convolutional neural networks are also referred to as area-dependent networks because convolution is achieved by computing a region in the input features.

The convolutional neural network is composed of a series of convolutional kernels with fixed height, width and parameters, the convolutional kernels perform sliding operation on input features, and multiplication and addition operation is performed on numerical values of covered areas to obtain output features.

The output features can be used as input features of the next convolution for further operation, and a plurality of convolution kernels are overlapped in the mode to carry out more processing on the features so as to form a convolution neural network.

This form of accumulation can reach tens or even hundreds of layers, thus forming a feature extraction process for input images from low-layer area correlation to high-layer semantic correlation.

The model formula constructed based on the convolutional neural network is as follows:

wherein, F_outFor output characteristics, F_inAs input features, K_iIs the ith convolution kernel, N is the number of output channels, and b is the offset.

Since the image is two-dimensional data, the input feature size is C_in×H_in×W_inUsing a convolution kernel of size C_out×C_in×H_k×W_kThe output characteristic obtained is C_out×H_out×W_out. Wherein C is_inAnd C_outNumber of channels characteristic of input and output, H_inAnd W_inHeight and width of input features, H_kAnd W_kHeight and width of convolution kernel, H_outAnd W_outHeight and width of the output features.

Wherein the calculation rule of the width and height of the output feature is as follows:

where p is the frame width and s is the step size.

In order to keep the width and height of the output features consistent with the input features as much as possible in the convolution process, 0 is usually complemented on the frame of the input features, and the method for complementing 0 is performed according to the principle that the width and height after convolution are not changed.

Semantic segmentation requires an output that is as wide and high as the input image, and each pixel corresponds to a class of predictions.

Therefore, the semantic segmentation process is to use a plurality of convolution kernels to form a convolution neural network to process the input image, so as to realize that the input 3 × H × W data obtains a 1 × H × W prediction output, where H is the height of the input image and W is the width of the input image.

In actual operation, a probability is usually given to all possible classes, so a convolutional neural network is usually used to obtain a prediction of C × H × W, and then the class with the highest probability is selected from each pixel position as a final class to obtain a prediction output of 1 × H × W, where C is the total number of classes.

Step 2, training the model constructed in the step 1 by using training data;

training data for semantic segmentation is usually composed of a series of artificially acquired images and corresponding label images, and the purpose of model training is to enable a model to fit the provided data, i.e., to obtain the label images from input images. The input image is a commonly acquired color RGB image, while the label image is a single-channel grayscale image, and the grayscale value of a pixel directly represents the class to which the pixel belongs. Assuming that there are C classes to be segmented, each pixel value in the label image is between 0 and C-1.

the loss function is a function for calculating similarity to the prediction result and the tag data.

Generally, the closer the prediction result is to the tag data, the smaller the numerical result calculated from the loss function. Assuming that the input data of the model is X, the label data is Y, and the model is f with parameters omega_ωAnd the loss function used is L, then the model targets are:

ω＝argmin_ωL(f_ω(X)，Y) (4)；

obtaining a model f with omega parameters_ωSo that the predicted result and the labeled result can obtain the minimum value under the loss function L.

The loss functions commonly used in deep learning are Mean Absolute Error (MAE), Mean Square Error (MSE), and cross entropy (cross entropy). Where cross entropy is used the most in the classification task, it computes the similarity between two probability distributions.

The loss function formula for MAE is:

the loss function equation for MSE is:

the cross-entropy loss function is formulated as:

cross-entropy loss functions are commonly used in loss functions;

semantic segmentation essentially belongs to the classification task, and each pixel has one and only one class, i.e. one and only one value in its label is 1, and the rest are 0. Then the cross entropy loss function can be simplified in the semantic segmentation task as:

L(f_ω(X)，Y)＝-logf_ω(x_t) (8)；

wherein f is_ω(x_t) The predicted probability value of the corresponding category is labeled 1.

A weight decay (weight decay) is usually added to the model to reduce the complexity of the model. Generally, MSE is used as the loss function of weight attenuation, and if cross entropy is used as the main loss function, there are:

L(f_ω(X)，Y)＝-logf_ω(x_t)+α·ω²(9)；

where α is a weight, typically set to 0.0001.

Since the result of the last convolution of the model is variable over a range of values, it needs to be normalized to between 0 and 1, and the sum over the channels also gets 1, i.e. the predicted probability sum over all classes per pixel position should be 1 to be able to correspond to the label provided.

To achieve this, the output result is typically normalized using softmax, whose formula is:

wherein C is the number of channels, x_iValue, y, output for a certain pixel location model_iIs the corresponding prediction probability value.

After the prediction probability value is obtained, the difference between the prediction probability value and the label can be calculated by using a loss function, and the gradient value corresponding to each parameter participating in the operation is obtained through a chain rule in derivation.

In the invention, the gradient of the parameter simple omega is calculated by using the cross entropy of the loss function, and y is led to be_tProbability for corresponding label:

wherein x is_kAnd x_tAll are obtained by parameter omega convolution, and the related operations are multiplication and addition operations and nonlinear operations capable of derivation, i.e. all operations can be derived, so that

And

the section can be obtained by using a chain rule according to the concrete condition of the model.

Therefore, once the network model and the loss function are determined, the formula of the gradient is determined. When the network exceeds one layer, the mode of the gradient obtained by the model parameters and the mode of the characteristic obtained by the model are very similar, and the gradient can be obtained in the network along the path obtained by the characteristic in the reverse direction.

After the gradient of the parameter is obtained, the parameter in the model needs to be updated according to the gradient. The invention mainly uses the most common random gradient descent method with momentum in semantic segmentation. The random two words in the random gradient descent result from the process of random selection of samples in the training process. Because the training samples are often huge in quantity and cannot be completely put into the GPU to train the model, a fixed quantity of samples are often randomly selected in a training set to train the model during training. The invention uses the extracted data to train the model by randomly extracting the data in the training set without putting back. After each training to obtain the gradient, the parameters in the model are updated according to the following rules:

m_t+1＝ρ·m_t+Δω (12)；

ω_t+1＝ω_t-lr·m_t+1(13)；

wherein m is_tIs the current momentum, m_t+1For updated momentum, ω_tAs a current parameter, ω_t+1For the updated parameters, Δ ω is a gradient obtained by back propagation, ρ is a weight parameter between values of 0-1, and lr is an updated step length, generally between values of 0-1.

After the model, data, loss function, and parameter update are determined, the model may be iteratively trained on the data set. The training process may be targeted to the number of iterations, or to the accuracy to some standard. In the training process, part of data in the training set is selected as a verification set, and the verification set does not participate in the training. After each iteration is finished, the model is put on a verification set for verification, and the training condition of the model is judged by observing the performance of the model on the training set and the test set. For example, in semantic segmentation, if the model gets a significantly higher cross-over ratio on the training data than on the validation data, then it is common that overfitting of the model occurs and training to readjust the model should be stopped. If the results of the model on the training set and the verification set are always poor, which indicates that the model parameters can not be converged, the training should be stopped to change the training strategy. If the precision of the training set and the validation set is not very different and continuously increases, the training state is good. After training is completed, the model can be used to perform tests on the test set to observe the actual application of the model.

Examples

According to the rapid semantic segmentation method facing the road scene, a rapid semantic segmentation model based on a multi-convolution module is designed, and Deeplabv3+ is used as an overall framework of the model. In order to improve the running speed of the model, the invention designs a shallow network as a framework network of Deeplabv3+, the framework network is named as Road Scene Deeplabv3+, and Table 1 shows the network structure parameters of the Road Scene Deeplabv3+, the structure is similar to the feature extraction part in Dark Net53, but the number of layers is less and the quantity of parameters is less. Meanwhile, the invention also designs a multilayer convolution module MConvblock to obtain multilayer characteristics.

TABLE 1 Road Scene Deeplabv3+ structural parameters

Name (R)	Number of repetitions	Output channel	Form of operation
				Conv1	1	16	MConvblock
Sub1	1	32	3×3conv+BN+Leaky，stride＝2
				Resl	1	32	MResblock
Sub2	1	64	3×3conv+BN+Leaky，stride＝2
				Res2	2	64	MResblock
Sub3	1	128	3×3conv+BN+Leaky，stride＝2
				Res3	2	128	MResblock
Sub4	1	256	3×3conv+BN+Leaky，stride＝2
				Res4	2	256	MResblock
Sub5	1	512	3×3conv+BN+Leaky，stride＝2
				Res5	1	512	MResblock
Conv2	1	1024	MConvblock

The structures of MConvblock and MResblock are shown in fig. 2 and 3, respectively, where K is the number of input channels and C is the number of output channels. Here it can be seen that MResblock is a residual network structure designed based on MConvblock, so the output size and the input size of MResblock are identical.

Fig. 4 is a network structure of Road Scene Deeplabv3+, which is designed by the present invention, wherein Block1 outputs 4 times down-sampled features, Block2 outputs 16 times down-sampled features, and C represents the number of channels of the input features. Wherein Conv 1-Res 2 and Sub 3-Conv 2 correspond to Table 1. Similarly, in order for the framework network of the present invention to downsample only 16 times, i.e., four times 2 times, the present invention replaces the convolution step size in Sub5 with 2 and replaces the convolutions in Res5 and Conv2 with a punctured convolution with a void rate of 2. In order to extract more high-level features, the invention adds a hole convolution with a hole rate of 24 in an Aperture Space Pyramid Pooling (ASPP), and replaces the 3 × 3 convolution in the ASPP with a multi-level convolution module, wherein MConv represents MConvblock in FIG. 2, so that each convolution in the multi-level convolution module in the ASPP is expanded correspondingly. The output channel of each module in the ASPP of the present invention is 64, which is much less than 256 in the original deepbabv 3+, which is done to reduce the amount of computation. Likewise, the present invention also reduces the parameters of each convolution in the up-sampling path, thereby further reducing the amount of computation. After the high-level features of different scales are fused, the features are upsampled by using quadruple bilinear interpolation and fused with quadruple downsampling features. And finally, performing convolution for three times, and obtaining a final semantic segmentation result by using 4 times of bilinear interpolation upsampling again. In the case of using a GTX 1080 tivpu as a running platform, the model of the present invention can achieve a processing speed of 0.057s per frame on images of 2048 × 1024 × 3 size, and can achieve a processing speed of 0.03s per frame on images of 720p, that is, 1920 × 720 × 3 size. Meanwhile, when the model parameters are stored with the floating point precision of 32 bits, only 20MB of storage space is occupied. In general, the fast semantic segmentation model designed by the invention can meet the semantic segmentation requirement facing to the road scene in both speed and storage space.

The features of the above embodiment are as follows:

(1) in order to increase the running speed of the model, a shallow network Road Scene Deeplabv3+ is designed as a framework network for fast semantic segmentation, and table 1 shows network structure parameters of the Road Scene Deeplabv3+, which are similar to the feature extraction part in Dark Net53, but have fewer layers and fewer parameter quantities.

(2) In order to make the framework network of the present embodiment perform 16-fold downsampling, i.e., four times 2-fold downsampling, on the input image, the present embodiment replaces the convolution step size in Sub5 with 2, and replaces the convolution in Res5 and Conv2 with a punctured convolution with a void rate of 2.

(3) In order to extract more high-level features, the present embodiment adds a hole convolution with a hole rate of 24 in the hole space pyramid Pooling (ASPP).

(4) The present embodiment also replaces the 3 × 3 convolution in ASPP with a multi-level convolution module, where MConv represents MConvblock in fig. 2, and therefore each convolution in the multi-level convolution module in ASPP is also expanded accordingly. The output channel of each module in the ASPP of the present embodiment is 64, which is much smaller than 256 in the original deepbabv 3+, and this is done to reduce the amount of calculation.

(5) The present embodiment also reduces the parameters of each convolution in the up-sampling path accordingly, thereby further reducing the amount of calculation. After the high-level features of different scales are fused, the features are upsampled by using quadruple bilinear interpolation and fused with quadruple downsampling features. And finally, performing convolution for three times, and obtaining a final semantic segmentation result by using 4 times of bilinear interpolation upsampling again.

To sum up, the current semantic segmentation model is difficult to balance the requirements of real-time performance and high segmentation precision in the application of the Road Scene, and the Road Scene-oriented fast semantic segmentation model is designed based on the Road Scene Deeplabv3+ network structure. Firstly, a shallow network is designed to be used as a framework network of Deeplabv3+ for feature extraction, and a multi-level convolution module is designed for the shallow network to accelerate the feature extraction speed. The multi-level convolution module is then used to replace the 3 x 3 convolution in deepabv 3+ to extract more and deeper high-level features. Compared with other methods, the method can rapidly carry out image segmentation and obtain higher segmentation precision.

In order to verify the validity of the road scene semantic segmentation model provided in this embodiment, a plurality of experiments are performed, and a schematic diagram of experimental results is shown in fig. 5, where a column in fig. 5(a) represents an input image, a list in fig. 5(b) represents a label of a label image, and a column in fig. 5(c) represents an output image. Compared with an image data set directly acquired in reality, the cityscaps data set is subjected to certain artificial processing, and the quality of an original image and the quality of an annotated image are high, so that the cityscaps data set suitable for a road scene semantic segmentation experiment is selected in the experiment.

In the case of using a GTX 1080 tivpu as a running platform, the semantic segmentation model in the present embodiment can achieve a processing speed of 0.057s per frame on images of 2048 × 1024 × 3 size, and can achieve a processing speed of 0.03s per frame on images of 720p, that is, 1920 × 720 × 3 size. Meanwhile, the model parameters in this embodiment only need to occupy 20MB of storage space when stored with 32-bit floating point precision. In general, the fast semantic segmentation model designed by the embodiment can meet the semantic segmentation requirement facing the road scene in both speed and storage space.

Claims

1. A road scene-oriented rapid semantic segmentation method is characterized by comprising the following steps: the method specifically comprises the following steps:

step 1, constructing a model based on a convolutional neural network;

step 2, training the model constructed in the step 1 by using training data;

2. The method for fast semantic segmentation of road scenes according to claim 1, characterized in that: the specific process of the step 1 is as follows: processing an input image by using a convolution neural network formed by a plurality of convolution kernels, thereby realizing that 3 multiplied by H multiplied by W data is input to obtain 1 multiplied by H multiplied by W prediction output, wherein H is the height of the input image, and W is the width of the input image;

the model was constructed according to the following equation (1):

3. The method for fast semantic segmentation of road scenes according to claim 2, characterized in that: the calculation process of the height and width of the output features in the step 1 is as follows:

wherein p is the frame width and s is the step length.

4. The method for fast semantic segmentation of road scenes according to claim 3, characterized in that: the specific process of the step 2 is as follows: the training data comprises artificially acquired images and label images corresponding to the acquired images;

5. The method for fast semantic segmentation of road scenes according to claim 4, characterized in that: in the step 2, the gray value of the image pixel directly represents the category to which the pixel belongs; when the image has C classes to be divided, each pixel value in the label image is 0-C-1.

6. The method for fast semantic segmentation of road scenes according to claim 4, characterized in that: the specific process of the step 3 is as follows:

step 3.1, determining a model target:

assuming that the input data of the model is X, the label data is Y, and the model is f with parameters omega_ωThe loss function used is L, then that of the modelThe goal is then:

ω＝argmin_ωL(f_ω(X)，Y) (4)；

L(f_ω(X)，Y)＝-logf_ω(x_t) (8)；

L(f_ω(X)，Y)＝-logf_ω(x_t)+α·ω²(9)；

wherein α is a weight;

wherein x is_kAnd x_tAre all convolved by the parameter ω.

7. The method for fast semantic segmentation of road scenes according to claim 6, characterized in that: the specific process of the step 4 is as follows:

m_t+1＝ρ·m_t+Δω (12)；

ω_t+1＝ω_t-lr·m_t+1(13)；