CN112699889A

CN112699889A - Unmanned real-time road scene semantic segmentation method based on multi-task supervision

Info

Publication number: CN112699889A
Application number: CN202110017471.3A
Authority: CN
Inventors: 周武杰; 林鑫杨; 钱小鸿; 万健; 甘兴利; 叶宁
Original assignee: Zhejiang Lover Health Science and Technology Development Co Ltd
Current assignee: Zhejiang Lover Health Science and Technology Development Co Ltd
Priority date: 2021-01-07
Filing date: 2021-01-07
Publication date: 2021-04-23

Abstract

The invention discloses a multitask supervision-based unmanned real-time road scene semantic segmentation method, which is applied to the technical field of road scene semantic segmentation and comprises the following steps: selecting a color image, a thermal image and a corresponding real semantic segmentation image of Q original road scene images to form a training set; the method comprises the following steps of using a MobileNet V2 lightweight network as a feature extractor, using an improved high-efficiency void space feature pyramid structure to extract deep semantic features of an image, and using a dense connection structure to fuse multi-level features to construct a convolutional neural network; inputting the color image and the thermal image of the original road scene image in the training set into a convolutional neural network for training to obtain a predicted image; calculating a loss function value between the predicted image and the corresponding original image; and obtaining a final weight vector and a final bias item according to the loss function value. The invention can improve the image segmentation efficiency and accuracy and meet the real-time requirement.

Description

Unmanned real-time road scene semantic segmentation method based on multi-task supervision

Technical Field

The invention relates to the technical field of semantic segmentation of unmanned road scenes, in particular to an unmanned real-time road scene semantic segmentation method based on multi-task supervision.

Background

With the continuous development of automatic driving technology, computer vision and natural language processing technology, unmanned vehicles are gradually and widely appearing in our lives. The unmanned automobile needs to accurately understand surrounding scenes in real time and make a decision on an emergency quickly in the driving process, so that traffic accidents are avoided. Therefore, efficient and accurate road scene semantic segmentation is becoming one of the hot spots for the research in the field of computer vision.

The semantic segmentation task is a basic task for image understanding and is an important task to be solved in the field of computer vision. Deep learning techniques, particularly convolutional neural networks, have shown great potential in semantic segmentation tasks over the past few years. In general, the full-convolution network architecture used by the semantic segmentation task can be divided into two categories: based on an encoder-decoder structure and based on an expanded convolution structure. The encoder-decoder architecture first uses the encoder to extract image features and then uses the decoder to recover spatial resolution; the expansion convolution structure is used for increasing the overall receptive field by expanding convolution in order to reduce the loss of the space information of the coding part, so that the model can give consideration to the overall semantic information.

Although the expansion convolution structure has the advantage of maintaining spatial information, if a higher spatial resolution is used all the time without downsampling, the consumed memory is larger, the inference speed of the model is greatly influenced, and the requirement of real-time performance cannot be met. In addition, because the convolutional network learns richer characteristics along with the deepening of the layer number, the network is difficult to have a deeper structure due to high memory consumption.

Therefore, the problem that the technical personnel in the field need to solve urgently is to provide the semantic segmentation method for the unmanned real-time road scene, which has high segmentation efficiency and high segmentation accuracy and can meet the real-time requirement.

Disclosure of Invention

The invention provides a multitask supervision-based unmanned real-time road scene semantic segmentation method, which combines low-level and high-level feature information, uses a dense connection structure for image decoding, uses a MobileNet V2 lightweight network as a feature extractor, uses an improved high-efficiency hollow space feature pyramid structure for extracting deep-level semantic features of an image, and brings great challenges to night scene understanding under poor illumination conditions for night road scenes.

In order to achieve the above purpose, the invention provides the following technical scheme:

a multitask supervision-based unmanned real-time road scene semantic segmentation method comprises the following specific steps:

selecting a color image and a thermal image of Q original road scene images, a corresponding real foreground background image, a real semantic segmentation image and a real boundary image to form a training set;

constructing a convolutional neural network, wherein the convolutional neural network uses a MobileNet V2 lightweight network as a feature extractor, uses an improved high-efficiency void space feature pyramid structure to extract deep semantic features of an image, and uses a dense connection structure to fuse multi-level features;

inputting the color image and the thermal image of the original road scene image in the training set as original input images into the convolutional neural network for training to obtain a corresponding foreground background prediction image, a corresponding semantic segmentation prediction image and a corresponding boundary prediction image;

calculating loss function values among the foreground background prediction image, the semantic segmentation prediction image and the boundary prediction image obtained by training and the corresponding real foreground background image, real semantic segmentation image and real boundary image;

and repeating training and calculating a loss function value, and determining the last training result as a final weight vector and a final bias item.

Further, the Q original road scene images are images in a road scene image database reported in the MFNet.

Further, the convolutional neural network comprises an input layer, a feature extraction layer, a feature fusion layer and a multitask output layer;

the input layer comprises a color image input layer and a thermal image input layer, and color images and thermal images are input respectively;

the characteristic extraction layer performs layer-by-layer characteristic extraction on the color image and the thermal image and extracts the deep semantic characteristics of the image;

the characteristic fusion layer fuses multi-level characteristics by using a dense connection structure;

and the multi-task output layer outputs a foreground background prediction image, a semantic segmentation prediction image and a boundary prediction image.

Further, the MobileNetV2 removes the last two inverse residual error structures and the classification layer, and divides the remaining part into 3 blocks, where the color image input branch corresponds to R _ Block i, i is 1,2, and 3, and the thermal image input branch corresponds to T _ Block i, i is 1,2, and 3.

Furthermore, the output result of each module of the thermal image input branch and the output result of each module of the color image input branch are fused by adding corresponding characteristics.

Further, the dense connection structure comprises an fused upsampling module, wherein the fused upsampling module comprises a 1 × 1 convolutional layer, a batch normalization layer and a ReLU6 activation function, a double upsampling layer, a 3 × 3 depth convolutional layer, a batch normalization layer and a ReLU6 activation function, and a 1 × 1 convolutional layer and a batch normalization layer.

The pyramid structure of the high-efficiency void space characteristic comprises a 1 × 1 convolution layer with the step length of 1, the filling of 0 and the number of filters of 192, a batch normalization layer, a ReLU6 activation function, three parallel shallow structures and a deep structure. The first shallow structure comprises a 3 x 3 depth convolutional layer with step size of 1, filling of 2 and hole factor of 2, a batch normalization layer and a ReLU6 activation function, the second shallow structure comprises a 3 x 3 depth convolutional layer with step size of 1, filling of 4 and hole factor of 4, a batch normalization layer and a ReLU6 activation function, and the third shallow structure comprises a 3 x 3 depth convolutional layer with step size of 1, filling of 8 and hole factor of 8. The deep structure includes a 3 × 3 depth convolutional layer with step size 2 and fill of 1, a batch normalization layer and a ReLU6 activation function, a 1 × 1 convolutional layer with step size 1, fill of 0, filter number 192, a batch normalization layer and a ReLU6 activation function and a parallel structure. The parallel structure comprises a combination of a 3 × 3 depth convolutional layer with a step size of 1, a fill of 2, and a hole factor of 2, a batch normalization layer, a ReLU6 activation function and double upsampling, a combination of a 3 × 3 depth convolutional layer with a step size of 1, a fill of 4, and a hole factor of 4, a batch normalization layer, a ReLU6 activation function and double upsampling, an adaptive maximum pooling layer and a 10-fold upsampling combination. A feature mosaic, 1 × 1 convolutional layers with step size 1, padding 0, filter count 96, batch normalization layer, and ReLU6 activation function.

According to the technical scheme, compared with the prior art, the invention has the beneficial effects that:

1) the method takes the thermal image information as the supplement of the color image information, fuses the thermal image characteristics and the color image characteristics, and can accurately predict the object under the condition of night.

2) The method uses the MobileNet V2 lightweight network as a feature extractor, so that the model can meet the real-time requirement. And the improved high-efficiency void space characteristic pyramid structure is used for extracting deep semantic characteristics of the image, so that the accuracy of the convolutional neural network model is further improved.

3) The method combines the low-level and high-level feature information when constructing the convolutional neural network, uses the dense connection structure to decode the image and fuses the multi-level features, so that various classification targets with different sizes in the road scene can be accurately segmented, and the semantic segmentation accuracy of the road scene image is effectively improved.

4) The method of the invention uses a multi-task supervision method to increase the model performance through the correlation among multiple tasks.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a block diagram of an overall implementation of the method of the present invention;

FIG. 2 is a diagram of the structure of the high-efficiency void space feature pyramid eASPP of the method of the present invention;

FIG. 3 is a block diagram of a fused upsampling module FU of the method of the present invention;

FIGS. 4a and 4b are the 1 st original color image and thermal image of the road scene in the same scene;

fig. 4c, 4d, and 4e are predicted semantic segmentation images, predicted boundary images, and predicted foreground-background images obtained by predicting the original road scene images shown in fig. 4a and 4b by using the method of the present invention, respectively;

FIGS. 5a and 5b are 2 nd original road scene color images and road scene thermal images of the same scene;

5c, 5d, 5e are predicted semantic segmentation images, predicted boundary images and predicted foreground background images obtained by predicting the original road scene images shown in FIGS. 5a, 5b by the method of the present invention, respectively;

FIGS. 6a and 6b are 3 rd original road scene color images and road scene thermal images of the same scene;

fig. 6c, 6d, and 6e are predicted semantic segmentation images, predicted boundary images, and predicted foreground-background images obtained by predicting the original road scene images shown in fig. 6a and 6b by using the method of the present invention, respectively;

FIGS. 7a and 7b are 4 th original road scene color images and road scene thermal images of the same scene;

fig. 7c, 7d, and 7e are predicted semantic segmentation images, predicted boundary images, and predicted foreground-background images obtained by predicting the original road scene images shown in fig. 7a and 7b by using the method of the present invention, respectively.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention discloses a multitask supervision-based unmanned real-time road scene semantic segmentation method, and the overall implementation block diagram is shown in figure 1.

The method comprises a training stage and a testing stage, wherein the training stage comprises the following specific steps:

step S101: selecting a color image and a thermal image of Q original road scene images, a corresponding real foreground background image, a real semantic segmentation image and a real boundary image to form a training set;

step S102: constructing a convolutional neural network, wherein the convolutional neural network uses a MobileNet V2 lightweight network as a feature extractor, uses an improved high-efficiency void space feature pyramid structure to extract deep semantic features of an image, and uses a dense connection structure to fuse multi-level features;

step S103: inputting color images and thermal images of original road scene images in a training set as original input images into a convolutional neural network for training to obtain corresponding foreground and background prediction images, semantic segmentation prediction images and boundary prediction images;

step S104: calculating a loss function value between a predicted image obtained by training and a corresponding original road scene image;

step S105: and repeating training and calculating a loss function value, and determining the last training result as a final weight vector and a final bias item.

In the embodiment of the present invention, step S101 specifically includes: selecting Q color images and thermal images of the original road scene image, corresponding real foreground background images, real semantic segmentation images and real boundary images to form a training set, and recording the Q-th original road scene color image in the training set as a color image

Thermal image marking

Record the corresponding real semantic segmentation image as

Then, the real foreground background image, the real semantic segmentation image and the real boundary image corresponding to each original road scene image in the training set are recorded as

Wherein Q1176 is the number of training samples, Q is a positive integer, 1Q, 1 i W, 1 j H, W is the width of the input image, H is the height of the input image, e.g. W640, H480,

respectively represent

The middle coordinate position is the pixel value of the pixel point of (i, j); in this embodiment, 1176 images in the road scene image database reported in the MFNet are directly selected as the original road scene image.

Step S102: constructing a convolutional neural network, wherein the convolutional neural network comprises an input layer, a feature extraction layer, a feature fusion layer and a multitask output layer;

the characteristic extraction layer consists of two main networks of MobileNet V2 and an improved high-efficiency hollow space characteristic pyramid structure; the feature fusion layer uses a dense connection structure to repeatedly utilize high-level features for image decoding; the multitask output layer uses the fusion features as input and outputs a semantic prediction graph, a boundary prediction graph and a foreground and background prediction graph.

Specifically, for the input layer, including a color image input layer and a thermal image input layer, an RGB color image and a thermal image are input, respectively, wherein the color image and the thermal image of the input layer are required to have a width W and a height H.

For the feature extraction layer, the method uses MobileNetV2 as a main network as a feature extractor, removes the last two inverse residual error structures and the classification layer, and divides the rest into 3 blocks, and the detailed division is shown in table 1. As shown in fig. 1, for the color image input branch, the corresponding structures are defined as R _ Blocki, i is 1,2, 3; for the thermal image input branches, the corresponding structures are defined as T _ Blocki, i ═ 1,2,3, respectively. The output result of each module of the thermal image branch and the color image branch output result are fused by adding corresponding elements, and the fusion characteristic of different layers is defined as a characteristic O from a shallow layer to a deep layer₄Characteristic O₃Characteristic O₂. In table 1, t is the internal parameter of the bottleneck layer, c is the output channel size, n is the module repetition number, and s is the partial downsampling multiple.

Further, in order to improve the receptive field of the model, the invention uses an efficient void space characteristic Pyramid structure (esaspp) as shown in fig. 2 to extract the deep semantic characteristics of the image. Characteristic of its use O₂As input, get the characteristic O₁. Defining the deep convolution as a block convolution whose number of blocks is the number of input feature channels. Characteristic O₂First, the 1 × 1 convolutional layer with step size 1, padding 0, filter number 192, batch normalization layer, and ReLU6 activation function are input, and then the output features are input into three parallel shallow structures and one deep structure. The first shallow layer structure comprises a 3 × 3 depth convolution layer with step length of 1, filling of 2 and hole factor of 2Normalization layer and ReLU6 activation function, the second shallow structure comprising a 3 x 3 depth convolutional layer with step size of 1, fill of 4, and hole factor of 4, batch normalization layer, and ReLU6 activation function, and the third shallow structure comprising a 3 x 3 depth convolutional layer with step size of 1, fill of 8, and hole factor of 8. The deep structure includes a 3 × 3 depth convolutional layer with step size 2 and fill of 1, a batch normalization layer and a ReLU6 activation function, a 1 × 1 convolutional layer with step size 1, fill of 0, filter number 192, a batch normalization layer and a ReLU6 activation function and a parallel structure. The parallel structure comprises a combination of a 3 × 3 depth convolutional layer with a step size of 1, a fill of 2, and a hole factor of 2, a batch normalization layer, a ReLU6 activation function and double upsampling, a combination of a 3 × 3 depth convolutional layer with a step size of 1, a fill of 4, and a hole factor of 4, a batch normalization layer, a ReLU6 activation function and double upsampling, an adaptive maximum pooling layer and a 10-fold upsampling combination. Finally, all output characteristics are spliced and input into a 1 multiplied by 1 convolutional layer with the step length of 1, the filling of 0 and the number of filters of 96, a batch normalization layer and a ReLU6 activation function to obtain the characteristic O₁。

TABLE 1 MobileNet V2 backbone network partitioning

For the feature fusion layer, the invention uses a dense connection structure to fuse multi-level features. The fused Upsampling module (FU) in the dense connection structure is shown in fig. 3. The feature fusion layer firstly combines the features O₁And characteristic O₂Splicing, inputting the obtained result into a fusion upsampling module FU3 to obtain a characteristic F₁. The FU3 is composed of a 1 × 1 convolutional layer with step size of 1, padding of 0, and filter number of 192, a batch normalization layer, and ReLU6 activation function, a double up-sampling layer, a 3 × 3 deep convolutional layer with step size of 1 and padding of 1, a batch normalization layer, and a ReLU6 activation function, and a 1 × 1 convolutional layer with step size of 1, padding of 0, and filter number of 64, and a batch normalization layer. The characteristic O is then₁After 2 times of up-samplingResults and characteristics of (1)₁And characteristic O₃Splicing, inputting the obtained result into a fusion upsampling module FU2 to obtain a characteristic F₂. The FU2 is composed of a 1 × 1 convolutional layer with step size of 1, padding of 0, and number of filters of 96, a batch normalization layer, and a ReLU6 activation function, a double upsampled layer, a 3 × 3 deep convolutional layer with step size of 1 and padding of 1, a batch normalization layer, and a ReLU6 activation function, a 1 × 1 convolutional layer with step size of 1, padding of 0, and number of filters of 32, and a batch normalization layer. The characteristic O is then₁Results after 4 times upsampling and feature F₁Results and features F after 2 times upsampling₂And characteristic O₄Splicing, inputting the spliced signals into a 1 × 1 convolutional layer with the step size of 1, the filling of 0 and the number of filters of 376, a batch normalization layer and a ReLU6 activation function to obtain a characteristic F₃。

For a multi-task output layer comprising a foreground background prediction branch, a semantic segmentation prediction branch and a boundary prediction branch, using the characteristic F₃The foreground and background prediction maps, the semantic segmentation prediction map and the boundary prediction map are output as input. The foreground and background prediction branch consists of a 1 × 1 convolutional layer with the step size of 1, the filling amount of 0 and the number of filters of 94, a batch normalization layer, a ReLU6 activation function, a 3 × 3 convolutional layer with the step size of 1, the filling amount of 1 and the number of filters of 1, and a 4-time upsampling and foreground and background output layer, and outputs a foreground and background prediction image. The semantic segmentation prediction branch consists of 2 times of upsampling, a 1 × 1 convolutional layer with the step size of 1, the filling of 0 and the number of filters of 376, a batch normalization layer and a ReLU6 activation function, a 3 × 3 convolutional layer with the step size of 1, the filling of 1 and the number of filters of 9, and a 2 times of upsampling and semantic classification output layer. The output result of the 3 x 3 convolutional layer of the foreground and background prediction branch is input into a Sigmoid activation function, and the output result and the characteristic F are input₃And taking the result obtained after multiplication as input, and outputting the semantic segmentation prediction graph. Boundary prediction Branch first feature F₃And performing double upsampling, and splicing the result with the output of the post-activation function of the 1 × 1 convolutional layer in the semantic segmentation prediction branch to obtain a characteristic serving as an input. It is composed of a step size of 1, a padding of 0, and filteringThe 1 × 1 convolutional layers with the number of devices 376, the batch normalization layer and the ReLU6 activation function, the 3 × 3 convolutional layers with the step size of 1, the filling of 1 and the number of filters of 1 and the 2 times of upsampling and boundary output layers are formed, and a boundary prediction graph is output.

Step S103: inputting each original road scene image in the training set as an original input image into a convolutional neural network for training to obtain a foreground background prediction image, a semantic segmentation prediction image and a boundary prediction image corresponding to each original road scene image in the training set, and respectively recording the foreground background prediction image, the semantic segmentation prediction image and the boundary prediction image as a foreground background prediction image, a semantic segmentation prediction image and a boundary prediction image

Step S104: calculating a loss function value between a predicted image and a real image corresponding to each original road scene image in the training set, and recording the loss function value as a loss function value

Wherein Loss₁And Loss₃Is a two-class cross entropy Loss function, Loss₂Is a multi-class cross entropy loss function.

Step S105: repeatedly executing the step S103 and the step S104V times, training the model by using an Adam optimization method, correspondingly taking the weight vector and the bias term corresponding to the last training result as the final weight vector and the final bias term of the convolutional neural network classification training model, and correspondingly marking as W^bestAnd b^best(ii) a Where V > 1, in this example V is 300.

The specific steps of the test phase include:

step S201: order to

Representing a road scene color image and a thermal image to be semantically segmented; wherein, i' is more than or equal to 1 and less than or equal to W', 1. ltoreq. j '. ltoreq.H ', W ' denotes the width of the image, H ' denotes the height of the image,

respectively represent

And the middle coordinate position is the pixel value of the pixel point of (i, j).

Step S202: inputting the color image and the thermal image into a convolutional neural network classification training model and utilizing W^bestAnd b^bestPredicting, and recording the corresponding prediction semantic segmentation image as

Wherein the content of the first and second substances,

to represent

And the pixel value of the pixel point with the middle coordinate position of (i ', j').

To further verify the feasibility and effectiveness of the method of the invention, experiments were performed:

and (3) building a convolutional neural network architecture by using a python-based deep learning library pytorch. The road scene image database test set reported in the MFNet is adopted to analyze how the segmentation effect of the road scene image (393 road scene images) predicted by the method is. Here, the segmentation performance of the predicted semantic segmentation image is evaluated by using 2 common objective parameters of the evaluated semantic segmentation method as evaluation indexes, namely, a Class average accuracy (Class accuracy) and a ratio of Intersection and Union of the segmentation image and the label image (Mean Intersection over Union, mlou). The number of predicted images per second (FPS) was used to evaluate the speed of the model.

The method is utilized to predict each road scene image in the test set to obtain a predicted semantic segmentation image corresponding to each road scene image, and the average category accuracy CA reflecting the semantic segmentation effect of the method, the ratio mIoU of the intersection and the union of the segmentation image and the label image, and the number FPS of predicted images per second are listed in Table 2. As can be seen from the data listed in table 2, the road scene image obtained by the method of the present invention has a better segmentation result and a faster prediction speed, which indicates that it is feasible and effective to obtain the predicted semantic segmentation image corresponding to the road scene image by using the method of the present invention.

Table 2, evaluation results on test sets using the method of the present invention

CA	67.7％
		mIoU	54.8％
FPS	54.06

Fig. 4a and 4b show the 1 st original road scene color image and thermal image of the same scene, and fig. 4c, 4d and 4e show the predicted semantic segmentation image, the predicted boundary image and the predicted foreground background image obtained by predicting the original road scene image by using the method of the present invention; fig. 5a and 5b show 2 nd original road scene color image and thermal image of the same scene, and fig. 5c, 5d and 5e show a prediction semantic segmentation image, a prediction boundary image and a prediction foreground background image obtained by predicting the original road scene image by using the method of the present invention; fig. 6a and 6b show 3 rd original road scene color image and thermal image of the same scene, and fig. 6c, 6d and 6e show a prediction semantic segmentation image, a prediction boundary image and a prediction foreground background image obtained by predicting the original road scene image by using the method of the present invention; fig. 7a and 7b show the 4 th original road scene color image and thermal image of the same scene, and fig. 7c, 7d and 7e show the predicted semantic segmentation image, the predicted boundary image and the predicted foreground-background image obtained by predicting the original road scene image by using the method of the present invention. The segmentation precision of the predicted semantic segmentation image obtained by the method is high.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A multitask supervision-based unmanned real-time road scene semantic segmentation method is characterized by comprising the following specific steps:

2. The unmanned real-time road scene semantic segmentation method based on multitask supervision as claimed in claim 1, wherein the Q original road scene images are selected from images in a road scene image database reported in MFNet.

3. The unmanned real-time road scene semantic segmentation method based on multitask supervision as claimed in claim 1, wherein the convolutional neural network comprises an input layer, a feature extraction layer, a feature fusion layer and a multitask output layer;

4. The unmanned real-time road scene semantic segmentation method based on multitask supervision as claimed in claim 1, wherein the MobileNetV2 lightweight network removes the last two inverse residual error structures and classification layers, and divides the remaining part into 3 blocks, wherein the color image input branches correspond to R _ Block i, i is 1,2,3, and the thermal image input branches correspond to T _ Block i, i is 1,2, 3.

5. The unmanned real-time road scene semantic segmentation method based on multitask supervision as claimed in claim 4, wherein the output result of each module of the thermal image input branch and the output result of each module of the color image input branch are fused through corresponding feature addition.

6. The unmanned real-time road scene semantic segmentation method based on multitask supervision as claimed in claim 1, wherein the dense connection structure comprises an integrated upsampling module, and the integrated upsampling module comprises a 1 x 1 convolutional layer, a batch normalization layer and a ReLU6 activation function, a double upsampling layer, a 3 x 3 deep convolutional layer, a batch normalization layer and a ReLU6 activation function, and a 1 x 1 convolutional layer and a batch normalization layer.