CN113436210B

CN113436210B - Road image segmentation method fusing context progressive sampling

Info

Publication number: CN113436210B
Application number: CN202110706637.2A
Authority: CN
Inventors: 陆彦钊; 刘惠义
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2021-06-24
Filing date: 2021-06-24
Publication date: 2022-10-11
Anticipated expiration: 2041-06-24
Also published as: CN113436210A

Abstract

The invention discloses a road image segmentation method fusing context progressive sampling, which comprises the following steps: preprocessing the acquired multiple road images to obtain segmentation pictures; inputting the segmented picture into a constructed Xception model to extract a deep layer feature map and a shallow layer feature map; inputting the shallow feature map into the constructed CBAM attention model to amplify the features of the small targets, and inputting the output result into the constructed HRNet module for fusion; inputting the deep characteristic map into a constructed ASPP pyramid module for pooling; fusing the deep layer characteristic graphs or the shallow layer characteristic graphs with the same resolution in the fusion result and the pooling result, and performing up-sampling step by 2 times to amplify the deep layer characteristic graphs or the shallow layer characteristic graphs back to the original graph size; the method and the device can improve the segmentation accuracy of the picture, and meanwhile, the segmentation is more precise in detail.

Description

Road image segmentation method fusing context progressive sampling

Technical Field

The invention relates to a road image segmentation method fusing context progressive sampling, and belongs to the technical field of image segmentation.

Background

Image semantic segmentation is a key problem in the computer field nowadays and is also an important direction for computer vision research. In the early stage, the image segmentation in the computer vision technology generally depends on information such as edges, gradual change and the like, pixel-level understanding is not provided, so that the segmentation accuracy is low, and the method cannot be applied to the fields of intelligent driving and the like. With the deepening of the convolutional neural network research in recent years, the pixel level comprehension capability of a computer is stronger, the network for semantic segmentation is more and more perfect, and the method has wide application prospects in the fields of automatic driving, human-computer interaction, virtual reality and the like.

Early image semantic segmentation generally has methods based on thresholds, edges, regions, etc. Although the methods are convenient to use and easy to understand, a lot of spatial information is lost, and the segmentation effect is poor. To solve these problems, jonathan Long et al proposed a full convolution network FCN (full convolution Networks) based on the CNN Convolutional neural network. The network removes the last full connection layer of the CNN, deconvolves the last characteristic graph of the CNN to perform upsampling, and then amplifies the upsampling graph into the size of an original graph to achieve the purpose of pixel-level classification. The research of Jonathan Long et al has made a great breakthrough in image semantic segmentation. However, since the FCN reduces the original image by 32 times and then enlarges, pooling therein may result in information loss and a probabilistic model between tags is not applied. Chen LC et al propose a deplab v1 method that employs hole convolution to enlarge the receptive field, reduce pooling layers, and avoid loss of detail information due to excessive pooling. Meanwhile, due to the adoption of the CRF conditional random field, the edge is further refined, and the segmentation effect of complex edges like trees and bicycles is improved. Based on DeepLabV1, linag-Chieh Chen et al proposed DeeplabB V2. Compared with the DeeplabV1 network, the VGG16 used as the backbone network is changed into ResNet and an ASPP (airborne particulate pyramid) pyramid module is added. The ASPP is detected by adopting the cavity convolution layers with a plurality of sampling rates in parallel, and the global and local characteristics are fused to improve the segmentation effect. And the subsequent Deeplab V3+ introduces the structures of an encoder and a decoder, fuses the output of the backbone network with the shallow layer characteristics, and reconstructs spatial information step by step to better capture the details of the object. While employing depth separable convolutions to reduce the amount of computation. Although the Deeplab V3+ can better capture the context information, the edge segmentation precision of the small-scale object is still not high.

In order to solve the problems, the application provides a road image segmentation method fusing context progressive sampling.

Disclosure of Invention

The invention aims to overcome the defects in the prior art and provide a road image segmentation method fusing context and progressive sampling, which can identify small target objects on a road more accurately and obviously improve image detail segmentation.

In order to achieve the purpose, the invention is realized by adopting the following technical scheme:

a road image segmentation method fusing context progressive sampling comprises the following steps:

preprocessing the acquired multiple road images to obtain segmentation pictures;

inputting the segmented picture into a constructed Xception model to extract a deep layer feature map and a shallow layer feature map;

inputting the shallow feature map into the constructed CBAM attention model to amplify the features of the small targets, and inputting the output result into the constructed HRNet module for fusion;

inputting the deep characteristic map into a constructed ASPP pyramid module for pooling;

and fusing the deep layer feature map or the shallow layer feature map with the same resolution in the fusion result and the pooling result, and performing up-sampling step by 2 times to amplify the deep layer feature map or the shallow layer feature map back to the original image size.

Preferably, the preprocessing the acquired multiple road images to obtain the segmentation picture includes:

cutting the road image into 1024 × 1024 pixel pictures, and uniformly storing the pictures into a jpg format;

performing semantic annotation on each picture to obtain a segmented picture;

the semantic annotation content comprises a background, an automobile, a person, the sky, a road, a grassland, a wall, a building and a pedestrian crossing.

Preferably, the construction of the Xception model includes:

constructing a block1 intermediate characteristic layer which consists of a 3 × 3 convolution layer with 32 channels, a relu activation layer, a 3 × 3 convolution layer with 64 channels and a relu activation layer;

building a block2 intermediate feature layer, which consists of 3 x 3 depth separable convolution layers of 2 128 channels, a relu activation layer and a maximum pooling layer;

constructing a block3 intermediate characteristic layer which consists of 3 x 3 depth separable convolution layers with 2 256 channels, a relu activation layer and a maximum pooling layer;

building a block4 intermediate feature layer, wherein the block4 intermediate feature layer consists of 3-by-3 depth separable convolution layers with 2 728 channels, a relu activation layer and a maximum pooling layer;

building block5-block13 intermediate characteristic layers, which are composed of 3 × 3 depth separable convolution layers of 728 channels and 3 relu activation layers;

after the block1 intermediate characteristic layer is output, simultaneously sending the output to a 1 x 1 convolution layer, and adding the result with the output of the block2 intermediate characteristic layer; after the block2 intermediate characteristic layer is output, simultaneously sending the output to a 1 × 1 convolution layer, and adding the result with the output of the block3 intermediate characteristic layer; after the output of the block3 intermediate feature layer, the output is simultaneously sent to a 1 × 1 convolution layer, and the result is added with the output of the block4 intermediate feature layer.

Preferably, the step of inputting the segmented picture into the constructed Xception model to extract the deep feature map and the shallow feature map includes: the Xconcept model extracts deep feature maps in the segmented pictures in the block13 intermediate feature layer, and the Xconcept model extracts shallow feature maps in the segmented pictures in the block2, block3 and block4 intermediate feature layers.

Preferably, the inputting the shallow feature map into the constructed CBAM attention model to amplify the features of the small targets therein, and inputting the output result into the constructed HRNet module for fusion includes:

inputting the shallow layer characteristic diagram extracted from the middle characteristic layer of block2, block3 and block4 into the constructed CBAM attention model to amplify the characteristics of the small target, and outputting out1, out2 and out3;

performing cross fusion on out1, out2 and out3 in an up-sampling and down-sampling mode to obtain corresponding feature maps hrout1, hrout2 and hrout3 with 3 resolution sizes;

the small target is an object with the area less than 10 × 10 pixels in the segmented picture;

the size of hrout2 is 1/2 of that of hrout1, and the size of hrout3 is 1/2 of that of hrout 2.

Preferably, the CBAM attention model is constructed by: constructing a channel attention mechanism and a space attention mechanism;

the channel attention mechanism comprises:

respectively pooling the input feature maps on the channel dimensions for one time in a maximum mode and pooling the input feature maps on the average mode for one time in an average mode so as to extract the maximum weight and the average weight on each channel;

respectively sending the maximum weight and the average weight to two full-connection layers for classification;

adding the classification results and activating by using a sigmoid function to obtain an importance weight matrix of each channel;

multiplying the importance weight matrix of each channel with the input feature map to obtain the output of amplified channel features;

the maximum pooling refers to the maximum value of pixel points of each channel, the average pooling refers to the average value of the pixel points of each channel, and the sigmoid activation function is used for enabling a larger value in input to be larger and a smaller value in input to be smaller;

the spatial attention mechanism comprises:

performing primary maximum pooling and primary average pooling on the output of the amplified channel characteristics on the spatial dimension to extract the maximum weight and the average weight of each pixel point;

carrying out convolution operation on the maximum weight and the average weight through a 3-by-3 convolution layer, activating by using a sigmoid function, and outputting an importance weight matrix of each pixel point;

and multiplying the importance weight matrix of each pixel point by the output of the amplified channel characteristic to obtain the output of the amplified pixel point characteristic.

Preferably, the ASPP pyramid module comprises 3 × 3 convolution layers with step sizes of 6, 12 and 18, respectively, and an average pooling layer with step size of 1; the step of inputting the deep feature map into the constructed ASPP pyramid module for pooling comprises the following steps:

sending the deep feature map extracted from the block13 middle feature layer into a 3 × 3 convolution layer with the step length of 6, and then sending two 1 × 1 convolution layers with the step length of 1 to output results;

sending the deep characteristic diagram extracted from the block13 middle characteristic layer into a 3-by-3 convolution layer with the step length of 12, and then sending two 1-by-1 convolution layers with the step length of 1 to output a result;

sending the deep characteristic diagram extracted from the block13 middle characteristic layer into a 3-by-3 convolution layer with the step length of 18, and then sending two 1-by-1 convolution layers with the step length of 1 to output a result;

sending the deep characteristic diagram extracted from the block13 middle characteristic layer into an average pooling layer with the step length of 1 to output a result;

and combining the output results to obtain the final ASPP pyramid module pooling output.

Preferably, the fusing the deep feature map or the shallow feature map with the same resolution in the fusion result and the pooling result, and performing up-sampling 2 times step by step to enlarge the deep feature map or the shallow feature map back to the original image size includes:

convolving the pooled output of the ASPP pyramid module once, upsampling by 2 times once, and combining with hrout3;

convolving the merged result once, upsampling the merged result 2 times once, and merging the upsampled result with the hrout 2;

convolving the merged result once, upsampling the merged result 2 times once, and merging the upsampled result with the hrout 1;

and (4) convolving the merging result twice, performing 2 times of upsampling once, and activating by using a softmax function to obtain final output.

Compared with the prior art, the invention has the following beneficial effects:

the invention provides a road image segmentation method fusing context and step-by-step sampling, which fuses different levels of features by utilizing a HRNet mode, adds a CBAM attention mechanism in front of an HRNet module, enhances a beneficial feature channel, weakens a useless feature channel, and finally samples the output of an ASPP pyramid module and the fused different levels of features step by step. The experimental results show that: the method for integrating context and up-sampling step by step is more accurate in identifying small target objects on the road and has obvious improvement on image detail segmentation; the invention can help the automobile to identify the type, position and size of the road surface object, and can effectively pre-judge the distant small target pedestrians in advance because the identification of the small target object is more accurate, thereby having a large play space in the intelligent driving direction.

Drawings

Fig. 1 is a flowchart of a road image segmentation method fusing context progressive sampling according to an embodiment of the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.

The first embodiment is as follows:

the embodiment provides a road image segmentation method, which comprises the following steps:

step 1, preprocessing a plurality of acquired road images to obtain segmented pictures;

cutting the road image into 1024-by-1024 pixel pictures, and uniformly storing the pictures into a jpg format;

performing semantic annotation on each picture to obtain a segmented picture;

Step 2, inputting the segmented picture into the constructed Xscene model to extract a deep layer feature map and a shallow layer feature map;

the Xconcept model extracts deep feature maps in the segmented pictures in the block13 intermediate feature layer, and the Xconcept model extracts shallow feature maps in the segmented pictures in the block2, block3 and block4 intermediate feature layers.

Step 3, inputting the shallow feature map into the constructed CBAM attention model to amplify the features of the small targets, and inputting the output result into the constructed HRNet module for fusion;

wherein, the small target is an object with an area smaller than 10 × 10 pixels in the segmented picture;

the size of the hrout2 is 1/2 of the size of the hrout1, and the size of the hrout3 is 1/2 of the size of the hrout 2.

Step 4, inputting the deep feature map into the constructed ASPP pyramid module for pooling;

sending the deep characteristic diagram extracted from the block13 middle characteristic layer into a 3-by-3 convolution layer with the step length of 6, and then sending two 1-by-1 convolution layers with the step length of 1 to output a result;

sending deep feature maps extracted from the block13 middle feature layer into a 3-by-3 convolution layer with the step length of 12, and then sending two 1-by-1 convolution layers with the step length of 1 to output results;

sending the deep feature map extracted from the block13 middle feature layer into an average pooling layer with the step length of 1 to output a result;

And 5, fusing the deep layer feature map or the shallow layer feature map with the same resolution in the fusion result and the pooling result, and performing up-sampling step by 2 times to amplify the deep layer feature map or the shallow layer feature map back to the original image size.

convolving the merged result once, upsampling by 2 times once, and merging with hrout 2;

The Xception model is a network structure for picture classification proposed by google, and in this embodiment, the construction of the Xception model includes:

constructing a block1 intermediate characteristic layer which consists of 3 × 3 convolution layers with 32 channels, a relu active layer, 3 × 3 convolution layers with 64 channels and a relu active layer;

building block5-block13 intermediate characteristic layers which are composed of 3-by-3 depth separable convolution layers of 728 channels and 3 relu active layers;

The Constraint Block Attention Module (CBAM) is an Attention model combining space and channels, and can improve the recognition capability of a small target object by utilizing the space and channel information among pixels; the construction of the CBAM attention model in this embodiment includes: constructing a channel attention mechanism and a space attention mechanism;

the channel attention mechanism comprises:

respectively performing maximum pooling and average pooling on the input feature map on channel dimensions to extract the maximum weight and the average weight on each channel;

the maximum pooling refers to taking the maximum value of pixel points of each channel, the average pooling refers to taking the average value of the pixel points of each channel, and the sigmoid activation function is used for enabling the larger value in the input to be larger and the smaller value in the input to be smaller;

the spatial attention mechanism comprises:

The ASPP pyramid module enlarges the field of view by using the hole convolution with different step sizes, which avoids the problem that the resolution is sacrificed in order to obtain a larger field of view in the conventional method, and in this embodiment, the ASPP pyramid module includes 3 × 3 convolution layers with step sizes of 6, 12, and 18, respectively, and an average pooling layer with step size of 1.

In this embodiment, the construction of the HRNet module includes:

combining out1, 2 times up-sampled out2, 4 times up-sampled out3 into out11; combining the 2 times down-sampled out1, out2, and 2 times up-sampled out3 into out22; combining the 4-fold down-sampled out1, the 2-fold down-sampled out2, out3 into out33;

combining out11, 2 times up-sampled out22, 4 times up-sampled out33 into out111; combining the 2 times down-sampled out11, out22, and 2 times up-sampled out33 into out222; the out11 of the 4-fold down-sampling, the out22 of the 2-fold down-sampling, and the out33 are combined into an out333.

Taking the city street view of Germany as an example, the data set contains 9 broad categories: background, car, people, sky, road, grassland, wall, building, pedestrian crossing. The data set has 1300 road street view pictures of 10 cities in Germany. 1000 samples were used for training and 300 samples were tested. Each picture is 2048 x 1024 pixels in size. During training, a Tesla P100 GPU with a video memory of 16G is used for training. During training, a random gradient descending mode is adopted, an optimizer is AdamaOptizer, the learning rate is 0.001 in the first 500 epochs, and the learning rate is adjusted to 0.0001 in the last 200 epochs. The loss function is a cross entropy loss function (categorial _ cross).

Putting the test data set into the step (1) for image preprocessing, wherein the steps are as follows:

1.1 cut 1000 test pictures into 2000 pictures of 1024 × 1024 pixels size.

1.2 convert the 2000 cut pictures from step 1.1 into 3-channel array format, resulting in 2000 matrices with size 1024 × 3.

1.3 the 2000 three-dimensional matrices from step 1.2 are combined into a four-dimensional matrix of 2000 x 1024 x 3.

Putting the result in the step 1.3 into the step (2), and extracting the characteristics by using an Xception network, wherein the steps are as follows:

2.1 the field width of the four dimensional matrix is filled twice with 0 to form a 2000 x 1028 x 3 matrix, which is placed in block1 of the Xception network to obtain a matrix with a size of 2000 x 512 x 64.

2.2 put the output of step 2.1 into block2, resulting in a matrix of shallow profiles of size 2000 × 256 × 128.

2.3 put the output of step 2.2 into block3, resulting in a shallow profile matrix of size 2000 x 128 x 256.

2.4 put the output of step 2.3 into block4, resulting in a matrix of shallow profiles of size 2000 x 64 x 728.

2.5, sequentially putting the input in the step 2.4 into block5, block6, block7, block8, block9, block10, block11, block12 and block13 to obtain a deep characteristic diagram matrix with the size of 2000 × 64 × 728.

And (3) respectively putting the outputs of the steps 2.2, 2.3 and 2.4 into the step (3), and amplifying the small target characteristics by using a CBAM attention mechanism, wherein the specific steps are as follows:

3.1, the feature matrix before block2 pooling in step 2.2 sequentially enters a channel attention mechanism and a space attention mechanism in the CBAM module to obtain an output matrix diagram after amplifying small target features, wherein the size of the output matrix diagram is 2000 × 512 × 128.

3.2, the feature matrix before block3 pooling in the step 2.3 sequentially enters a channel attention mechanism and a space attention mechanism in the CBAM module to obtain an output matrix diagram after amplifying small target features, wherein the size of the output matrix diagram is 2000 × 256.

3.3, in the step 2.4, the feature matrix before block4 pooling sequentially enters a channel attention mechanism and a space attention mechanism in the CBAM module to obtain an output matrix diagram after the small target features are amplified, wherein the size of the output matrix diagram is 2000 × 128 × 512.

Putting the deep feature map block13 in the step 2.5 into the step (4), and acquiring a larger receptive field by using an ASPP pyramid module, wherein the method comprises the following specific steps:

4.1 put the output feature matrix map of block13 into 3 × 3 convolutional layers with 256 channels and 6 steps, resulting in feature matrices of size 2000 × 64 × 256.

4.2 put the output feature matrix map of block13 into 3 × 3 convolutional layers with 256 channels and 12 steps, resulting in a feature matrix size of 2000 × 64 × 256.

4.3 put the output feature matrix map of block13 into 3 × 3 convolution layers with 256 channels and 18 steps, resulting in feature matrices of size 2000 × 64 × 256.

4.4 put the output feature matrix map of block13 into the pooling layer with step size of 1, and go through a 3 × 3 convolution layer with 256 channels to get feature matrix with size of 2000 × 64 × 256.

4.5 combining the outputs of steps 4.1, 4.2, 4.3, 4.4 to obtain a feature matrix of 2000 × 64 × 1024, and passing through a 3 × 3 convolutional layer with 256 channels to obtain a feature matrix of 2000 × 64 × 256.

And (5) putting the shallow feature matrix in the step (3) into the step (5), and amplifying the small target features by using a CBAM attention mechanism, wherein the method specifically comprises the following steps:

5.1 the output of step 3.2 was upsampled by 2 times 2000 x 512 x 256, the output of step 3.3 was upsampled by 4 times 2000 x 512, these two results were merged with the result of step 3.1 to give a matrix of 2000 x 512 x 896, and then passed through two 3 x 3 convolutional layers of 128 channels to give a signature matrix of 2000 x 512 x 128.

5.2 multiply the output of step 3.1 by 2 to 2000 x 256 x 128, the output of step 3.3 by 2 to 2000 x 256 x 512, combine these two results with the result of step 3.2 to give a matrix of 2000 x 256 x 896, and then pass through two 3 x 3 convolutional layers of 256 channels to give a signature matrix of 2000 x 256.

5.3 multiply the output down-sample of step 3.1 by 2000 x 128, multiply the output down-sample of step 3.2 by 2000 x 128 x 256, combine these two results with the result of step 3.3 to get a matrix of 2000 x 128 x 896 size, then go through two 3 x 3 convolutional layers of 512 channels to get a signature matrix of 2000 x 128 x 512 size.

5.4 multiply output upsample 2 of step 5.2 by 2000 x 512 x 256, output upsample 4 of step 5.3 by 2000 x 512, combine these two results with the result of step 5.1 to obtain a matrix of 2000 x 512 x 896, then pass through two 3 x 3 convolutional layers of 128 channels to obtain a signature matrix of 2000 x 512 x 128.

5.5 multiply the output of step 5.1 by 2 to 2000 x 256 x 128, the output of step 5.3 by 2 to 2000 x 256 x 512, combine these two results with the result of step 5.2 to give a matrix of 2000 x 256 x 896, and then pass through two 3 x 3 convolutional layers of 256 channels to give a signature matrix of 2000 x 256.

5.6 multiply the output of step 5.1 by 4 to 2000 x 128, multiply the output of step 5.2 by 2 to 2000 x 128 x 256, combine these two results with the result of step 5.3 to give a matrix of 2000 x 128 x 896 size, and then pass through two 3 x 3 convolutional layers of 512 channels to give a signature matrix of 2000 x 128 x 512 size.

And (4) gradually sending the steps 4.5, 5.4, 5.5 and 5.6 into the step (6), and gradually up-sampling and amplifying the feature map, wherein the specific steps are as follows:

6.1 put the output of step 5.6 into a 1 x 1 convolutional layer with 80 channels and up-sample 2 times to get a matrix with size of 2000 x 128 x 80.

6.2 combine the output of step 6.1 and the output of 4.5 to get a matrix of size 2000 x 128 x 336, then go through 3 x 3 convolutional layers of 256 channels, and then sample 2 times to get a matrix of size 2000 x 256.

6.3 put the output of step 5.5 into a 1 x 1 convolutional layer with 80 channels and up-sample 2 times to get a matrix with 2000 x 256 x 80 size.

6.4 combining the output of step 6.3 and the output of 6.2 to get a matrix of size 2000 x 256 x 336, then passing through 3 x 3 convolutional layers of 256 number of channels, and up sampling 2 times to get a matrix of size 2000 x 512 x 256.

6.5 put the output of step 5.4 into a 1 x 1 convolutional layer with 80 channels and up sample 2 times to get a matrix with 2000 x 512 x 80 size.

6.6 combining the output of step 6.5 and the output of 6.4 to get a matrix size of 2000 x 512 x 336, then passing through 3 x 3 convolutional layers with 256 channels, and up sampling 2 times to get a matrix size of 2000 x 1024 x 256.

6.7 put the output matrix of step 6.6 into a 9-channel 1 × 1 convolutional layer and activate with softmax function, resulting in a matrix size of 2000 × 1024 × 9.

6.8 comparing the difference between the output matrix and the marked picture matrix, continuously optimizing network parameters by using a gradient descent method, and obtaining a final network after 700 times of training by adopting a cross entropy loss function as a loss function.

In step (7), the segmented picture is output, which specifically comprises the following steps:

7.1 testing was performed using 300 test pictures of 1024 × 2048 pixels size to crop each picture into 2 pictures of 1024 × 1024 pixels size.

7.2, sending the cut pictures obtained in the step 7.1 into a network at one time to obtain 1 matrix with the size of 600 × 1024 × 9, and using onehot coding to reduce dimension in the last dimension to obtain the matrix with the size of 600 × 1024, namely 600 pictures with the size of 1024 × 1024, wherein each pixel on the pictures is a label from 0 to 9 and represents 9 classifications of background, automobile, people, sky, road, grassland, wall, building and pedestrian crossing.

TABLE 1 comparison of the method of the invention with other methods

Classification method	Deeplabv1	Deeplabv2	Deeplabv3+	The method of the invention
					Rate of accuracy	79.5％	83.32％	88.48％	90.02％

As can be seen from Table 1, the method of the present invention has better accuracy in road image segmentation than the existing mainstream segmentation network. Especially, the recognition capability to small targets is stronger, and the edge segmentation to the object is more accurate. The method of the invention utilizes a deep learning method to find different characteristics of different objects and classify the different characteristics, and can be widely applied to the directions of road identification, road scene segmentation and the like.

The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims

1. A road image segmentation method, comprising:

fusing the deep layer characteristic graphs or the shallow layer characteristic graphs with the same resolution in the fusion result and the pooling result, and performing up-sampling step by 2 times to amplify the deep layer characteristic graphs or the shallow layer characteristic graphs back to the original graph size;

wherein the construction of the Xscene model comprises the following steps:

after the block1 intermediate characteristic layer is output, simultaneously sending the output to a 1 x 1 convolution layer, and adding the result with the output of the block2 intermediate characteristic layer; after the block2 intermediate characteristic layer is output, simultaneously sending the output to a 1 × 1 convolution layer, and adding the result with the output of the block3 intermediate characteristic layer; after the block3 intermediate characteristic layer is output, simultaneously sending the output to a 1 × 1 convolution layer, and adding the result with the output of the block4 intermediate characteristic layer;

wherein, the step of inputting the segmentation picture into the constructed Xscene model to extract the deep layer feature map and the shallow layer feature map comprises the following steps: the method comprises the steps that an Xception model extracts deep feature maps in segmented pictures in a block13 middle feature layer, and the Xception model extracts shallow feature maps in the segmented pictures in block2, block3 and block4 middle feature layers;

the method for inputting the shallow feature map into the constructed CBAM attention model to amplify the features of the small targets and inputting the output result into the constructed HRNet module for fusion comprises the following steps:

inputting the shallow layer characteristic diagram extracted from the block2, block3 and block4 intermediate characteristic layers into the constructed CBAM attention model to amplify the characteristics of the small target, and outputting out1, out2 and out3;

performing cross fusion on out1, out2 and out3 in an upsampling and downsampling mode to obtain corresponding feature maps hrout1, hrout2 and hrout3 with 3 resolution sizes;

2. The road image segmentation method according to claim 1, wherein the preprocessing the acquired road images to obtain segmented pictures comprises:

performing semantic annotation on each picture to obtain a segmented picture;

the semantic annotation content comprises a background, an automobile, a person, the sky, a road, a grassland, a wall, a building and a pedestrian crossing line.

3. The road image segmentation method according to claim 1, wherein the constructing of the CBAM attention model comprises: constructing a channel attention mechanism and a space attention mechanism;

the channel attention mechanism comprises:

the spatial attention mechanism comprises:

carrying out convolution operation on the maximum weight and the average weight through a 3 x 3 convolution layer, activating by using a sigmoid function, and outputting an importance weight matrix of each pixel point;

4. The road image segmentation method according to claim 1, wherein the ASPP pyramid module comprises 3 × 3 convolution layers with step sizes of 6, 12 and 18, respectively, and an average pooling layer with step size of 1; the step of inputting the deep feature map into the constructed ASPP pyramid module for pooling comprises the following steps:

sending the deep feature map extracted from the block13 middle feature layer into a 3 × 3 convolution layer with the step length of 18, and then sending two 1 × 1 convolution layers with the step length of 1 to output results;

5. The road image segmentation method of claim 4, wherein the fusing the deep feature map or the shallow feature map with the same resolution in the fusion result and the pooling result, and performing up-sampling by 2 times in a stage to enlarge the deep feature map or the shallow feature map back to the original size comprises: