CN114943835A

CN114943835A - Real-time semantic segmentation method for aerial images of ice slush unmanned aerial vehicle in yellow river

Info

Publication number: CN114943835A
Application number: CN202210415977.4A
Authority: CN
Inventors: 张秀伟; 张艳宁; 赵梓旭; 尹翰林; 邢颖慧; 王康威; 刘启兴
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2022-04-20
Filing date: 2022-04-20
Publication date: 2022-08-26
Anticipated expiration: 2042-04-20
Also published as: CN114943835B

Abstract

The invention discloses a real-time semantic segmentation method for aerial images of a Huanghe ice unmanned aerial vehicle, which comprises the steps of constructing a Huanghe ice semantic segmentation data set according to collected aerial images of the ice of the unmanned aerial vehicle, wherein the data set comprises the aerial images of the ice of the Huanghe unmanned aerial vehicle and tag data; and training the segmentation network FastICENet by using the constructed yellow river ice semantic segmentation data set to obtain a final semantic segmentation model. Even if the size and the shape of the ice in the image are different, the detection result of the invention is still accurate; when the precision of the semantic segmentation network is similar to that of other networks, the segmentation speed of the semantic segmentation network is far better than that of other semantic segmentation networks.

Description

Real-time semantic segmentation method for aerial image of ice slush unmanned aerial vehicle in yellow river

Technical Field

The invention belongs to the technical field of pattern recognition, and particularly relates to a real-time semantic segmentation method for an aerial image.

Background

Semantic segmentation is a very important field in computer vision, and refers to identifying images at a pixel level, i.e. marking the object class to which each pixel in an image belongs, and the goal is to predict the class label of each pixel in the image. River ice condition monitoring has important significance for river management of shipping industry. Accurate ice segmentation is one of the most important techniques in ice condition monitoring studies. Among them, lightweight semantic segmentation is especially important in ice condition monitoring. The input image needs to be quickly analyzed, and the auxiliary system and the external environment can interact in time. Specifically, the method is used for rapidly and accurately segmenting the input yellow river ice image, monitoring the ice condition in the river in real time and giving early warning in time. Therefore, it is necessary to design a real-time accurate lightweight semantic segmentation network.

An early ice semantic segmentation method is mainly used for solving the technical problem that the existing ice detection method is poor in accuracy. For example, a segmentation network structure is constructed, the network comprises a shallow branch and a deep branch, and a channel attention module is added into the deep branch; adding a position attention module in the shallow layer branch; the fusion module is used for fusing the shallow branch and the deep branch. And putting the data in the training set into the network in batches, and training the constructed neural network by adopting cross entropy loss and an RMSprop optimizer. And finally, inputting an image to be tested, and testing by using the trained model. The method can selectively perform multi-level and multi-scale feature fusion, captures context information based on an attention mechanism, obtains a feature map with higher resolution and obtains better segmentation effect. However, the problem of slow segmentation speed exists, the segmentation network cannot be operated on low-power-consumption equipment in real time, and the actual landing requirement of the yellow river ice slush segmentation is difficult to meet.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a real-time semantic segmentation method for aerial images of a yellow river ice unmanned aerial vehicle, wherein a yellow river ice semantic segmentation data set is constructed according to collected aerial ice images of the unmanned aerial vehicle, and the data set comprises the aerial ice images of the yellow river unmanned aerial vehicle and tag data; and training the segmentation network FastICENet by using the constructed yellow river ice semantic segmentation data set to obtain a final semantic segmentation model. Even if the size and the shape of the ice in the image are different, the detection result of the invention is still accurate; when the precision of the semantic segmentation network is similar to that of other networks, the segmentation speed of the semantic segmentation network is far better than that of other semantic segmentation networks.

The technical scheme adopted by the invention for solving the technical problem comprises the following steps:

step 1: constructing a yellow river ice semantic segmentation data set according to the collected unmanned aerial vehicle aerial ice image, wherein the data set comprises the yellow river unmanned aerial vehicle aerial ice image and label data; dividing a data set into a training set, a verification set and a test set;

step 2: constructing a semantic segmentation model FastICENet;

the semantic segmentation model FastICENet comprises a shallow detail branch, a deep semantic branch and a fusion upsampling module; the shallow detail branch is used for extracting low-level detail information of the slush image, the deep semantic branch is used for extracting deep semantic information of the slush image, and finally the deep semantic branch and the shallow detail branch are fused and sampled by the fusion upsampling module to obtain a semantic segmentation result with the same size as the original image;

step 2-1: the shallow detail branch is specifically as follows: the input image with the size of h multiplied by w, wherein h and w are the height and the width of the image respectively, sequentially pass through a convolution module I, a convolution module II and a convolution module II, and the resolution of a feature map is h/8 multiplied by w/8 after passing through the three convolution modules;

step 2-2: the deep semantic branch is specifically as follows:

step 2-2-1: an input image with the size of h multiplied by w sequentially passes through a first down-sampling module, a second down-sampling module and a third down-sampling module, and a feature map is obtained after the input image passes through the three down-sampling modules, wherein the resolution ratio is h/8 multiplied by w/8;

step 2-2-2: inputting the feature map obtained in the step 2-2-1 into a dense connection module I based on the phantom feature map, wherein the resolution of the output feature map is still h/8 xw/8;

step 2-2-3: inputting the feature map obtained in the step 2-2-2 into a fourth down-sampling module, wherein the resolution of the output feature map is h/16 multiplied by w/16;

step 2-2-4: inputting the feature map obtained in the step 2-2-3 into a second dense connection module based on a phantom feature map, and respectively inputting the output feature map into a first attention thinning module and a mean pooling module; stacking the output of the attention thinning module I and the output result of the average pooling module according to channels, and taking the obtained characteristic diagram as the output of the step 2-2-4;

step 2-2-5: enabling the feature map obtained in the step 2-2-4 to pass through a first up-sampling module, wherein the size of the output feature map is h/8 multiplied by w/8;

step 2-2-6: the outputs of the step 2-2-2 and the step 2-2-5 are jointly input into a second attention module, and the resolution of an output feature map is h/8 xw/8;

step 2-3: the fusion upsampling module specifically comprises: the output of the shallow detail branch and the deep semantic branch is jointly input into a feature fusion module, and the size of an output feature graph is h/8 multiplied by w/8; recovering the output of the feature fusion module to the original size h multiplied by w through an up-sampling module II, and predicting a segmentation result;

and step 3: and training a semantic segmentation model FastICENet by utilizing the training set and the verification set to obtain a final semantic segmentation model, and testing the performance of the final semantic segmentation model by utilizing the test set.

Preferably, the step 1 specifically comprises:

step 1-1: collecting multi-period and multi-region aerial yellow river ice images of an unmanned aerial vehicle;

step 1-2: the collected images were cropped to 1600 x 640 size images, each image was manually labeled with three classification labels pixel by pixel: ice, water and river banks;

step 1-3: the yellow river ice image and the classification label thereof are obtained through the step 1-2, and are divided into a training set, a verification set and a test set according to the proportion of 3: 1.

Preferably, the convolution kernel size of the convolution module one is 7 × 7, the step size is 2, and the padding size is 3; after the convolution kernel, connecting the combination of batch processing regularization and ReLU; the convolution kernel size of the convolution module two and the convolution module three is 3 x 3, the step length is 2, the filling size is 1, and the combination of batch process regularization and ReLU is connected later.

Preferably, the first down-sampling module, the second down-sampling module, the third down-sampling module and the fourth down-sampling module all adopt the following structures:

the number of input channels, the number of output channels and the number of convolution layer output channels of the feature map in the down-sampling module are respectively Win, Wout and Wconv;

in the down-sampling module, when Wout is larger than Win, an input feature map firstly passes through a convolution layer with a convolution kernel size of 3 multiplied by 3 and a maximum pooling layer of 2 multiplied by 2 in parallel, the step length of the convolution layer and the maximum pooling layer are both 2, the number Wconv of channels of the convolution layer output feature map is Wout-Win, and the number of channels of the output feature map of the maximum pooling layer is Win; then, the outputs of the convolutional layer and the maximum pooling layer are subjected to channel stacking and batch processing regularization, and activated by Relu to realize 2-time down-sampling of the feature map;

in a down-sampling module, when Wout is less than Win, the input feature map only passes through a convolution layer with the convolution kernel size of 3 multiplied by 3 and the step length of 2, then batch processing regularization and Relu activation are carried out, and 2 times down-sampling of the feature map is realized through a convolution mode;

the number of channels of the input feature map of the first downsampling module is 3, and the number of channels of the output feature map is 15; the number of channels of the input feature map of the second downsampling module is 15, and the number of channels of the output feature map is 30; the number of channels of the input feature map of the third downsampling module is 30, and the number of channels of the output feature map is 60; the number of channels of the input feature map of the down-sampling module four is 160, and the number of channels of the output feature map is 160.

Preferably, the structure of the first dense connection module based on the phantom feature diagram is the same as that of the second dense connection module based on the phantom feature diagram, and is defined as follows:

defining a phantom module: using the following formulaGenerating m original characteristic graphs Y' epsilon R by one convolution ^h×w×m ：

Y′＝X*f′

Where Y 'is the profile of the convolutional layer output, X is the convolutional layer input, f' is the R ^c×k×k×m The convolution kernel is used, m is less than or equal to n, and n is the number of layers of the actually required characteristic diagram in the network model;

applying a series of linear operations to each raw feature map in Y' to generate s phantom feature maps:

wherein y' _i Is the ith original feature map in Y', phi _i，j Is the jth linear operation for generating the jth phantom feature map y _ij ；

By using linear operation, n-m-s feature maps Y-Y are obtained ₁₁ ，y ₁₂ ，...，y _ij ，...，y _ms ]As output data of the phantom module; finally, channel superposition is carried out on the original characteristic diagram and the phantom characteristic diagram, and the superposition result is used as the output of the phantom module;

using a dense connection mode for a plurality of phantom modules, namely, the input of each phantom module is the channel superposition of the input characteristic graph of the first initial dense connection module and the output characteristic graphs of all the phantom modules before;

the number of channels of an input feature map of a dense connection module based on the phantom feature map is 60, the number of channels of an output feature map is 160, and 5 phantom modules are used for dense connection;

the number of channels of the input feature map of the dense connection module II based on the phantom feature map is 160, the number of channels of the output feature map is 320, and 8 phantom modules are used for dense connection;

the 13 phantom modules are each added with 10 channels through the convolution layer, and 10 channels are added through linear operation, so that the number of output channels of each phantom module is increased by 20 channels relative to the input channels of the phantom module.

Preferably, the first attention module and the second attention module are implemented as follows: and sequentially carrying out global average pooling, 1 × 1 convolution and batch processing regularization on the input feature map, and finally obtaining a channel attention vector through sigmoid, then multiplying the channel attention vector by a corresponding bit of the input feature map, and adding a multiplication result and the input feature map to obtain a channel weighted feature map.

Preferably, the first upsampling module and the second upsampling module have the same structure, and the implementation manner is as follows: assume that the input feature map has a size of

Wherein the content of the first and second substances,

and

for feature height and width, C is the number of channels in the feature, and the input feature is passed through a convolutional layer of N convolution kernels of 1 × 1 size to produce a convolution kernel of size

New feature maps of (2); the new feature map is then reshaped to a size of

The output feature map of (1); the first up-sampling module adopts 2 as the up-sampling multiple, and the second novel up-sampling module adopts 8 as the up-sampling multiple.

Preferably, the structure of the feature fusion module is as follows: firstly, stacking feature graph channels output by shallow detail branches and deep semantic branches by a feature fusion module, performing convolution kernel with step length of 1 and size of 1 multiplied by 1, and performing batch processing regularization and relu activation functions; secondly, performing global pooling on the output in the step one, performing convolution with the step length of 1 and the size of 1 multiplied by 1, activating a function through relu, performing convolution with the step length of 1 and the size of 1 multiplied by 1, activating the function through sigmoid, and multiplying the output by the corresponding bit of the output in the step one; and thirdly, adding and outputting the multiplication result in the step two and the output in the step one as the output of the characteristic fusion module.

The invention has the following beneficial effects:

1) the invention provides a double-branch lightweight semantic segmentation network which is used for real-time semantic segmentation of the ice of the yellow river;

2) the size and the shape of the ice in the image to be segmented are different, and the detection result is still accurate;

3) when the precision of the semantic segmentation network is similar to that of other networks, the segmentation speed of the semantic segmentation network is far better than that of other semantic segmentation networks.

Drawings

FIG. 1 is a diagram of a semantic segmentation model architecture of the present invention.

Fig. 2 is a block diagram of a downsampling module of the present invention.

Detailed Description

The invention is further illustrated with reference to the following figures and examples.

The invention aims to provide a real-time semantic segmentation method for yellow river icings, which improves the accuracy of a segmentation model through a double-branch structure and solves the problem of low segmentation speed by adopting a lightweight module.

A real-time semantic segmentation method for aerial images of an unmanned aerial vehicle for yellow river ice comprises the following steps:

step 1: constructing a yellow river ice semantic segmentation data set according to the collected unmanned aerial vehicle aerial ice image, wherein the data set comprises the yellow river unmanned aerial vehicle aerial ice image and label data;

step 1-1: collecting multi-period and multi-region unmanned aerial vehicle aerial photography yellow river ice images, and selecting clear images with sufficient illumination from the images;

step 1-2: the collected images were cropped to 1600 x 640 size images, each image being labeled with three classification labels by manual pixel by pixel: ice, water and river banks;

step 1-3: obtaining the yellow river ice image and the classification label thereof through the step 1-2, and performing the steps according to the following steps of 3: 1: the scale of 1 is divided into a training set, a validation set, and a test set.

Step 2: constructing a semantic segmentation model FastICENet;

the semantic segmentation model FastICENet comprises a shallow detail branch, a deep semantic branch and a fusion upsampling module; the shallow detail branch is used for extracting low-level detail information and texture information of the slush image, the deep semantic branch is used for extracting deep semantic information of the slush image, and finally the deep semantic branch and the shallow detail branch are fused and sampled by a fusion upsampling module to obtain a semantic segmentation result with the same size as the original image;

step 2-1: the shallow detail branch is specifically as follows: the input image with the size of h multiplied by w, wherein h and w are the height and the width of the image respectively, sequentially pass through a convolution module I, a convolution module II and a convolution module II, and after passing through the three convolution modules, the resolution of the feature map is h/8 multiplied by w/8;

step 2-2: the deep semantic branch is specifically as follows:

step 2-2-4: inputting the feature map obtained in the step 2-2-3 into a dense connection module II based on the phantom feature map, and respectively inputting the output feature map into an attention thinning module I and an average pooling module; stacking the output of the attention thinning module I and the output result of the average pooling module according to channels, and taking the obtained characteristic diagram as the output of the step 2-2-4;

step 2-2-5: enabling the characteristic diagram obtained in the step 2-2-4 to pass through a first up-sampling module, wherein the size of the output characteristic diagram is h/8 multiplied by w/8;

and 3, step 3: training a semantic segmentation model FastICENet by using the training set and the verification set to obtain a final semantic segmentation model, and testing the performance of the final semantic segmentation model by using the test set.

Preferably, the convolution kernel size of the convolution module one is 7 × 7, the step size is 2, and the padding size is 3; after the convolution kernel, connecting the combination of batch processing regularization and ReLU; convolution kernel size of convolution module two and convolution module three is 3 × 3, step size is 2, padding size is 1, followed by batch regularization and combination of ReLU.

in the down-sampling module, when Wout is greater than Win, an input feature map firstly passes through a convolution layer with a convolution kernel size of 3 x 3 and a maximum pooling layer with a size of 2 x 2 in parallel, the step lengths of the two layers of the convolution layer and the maximum pooling layer are both 2, the number Wconv of channels of the output feature map of the convolution layer is Wout-Win, and the number of channels of the output feature map of the maximum pooling layer is Win; then, the outputs of the convolutional layer and the maximum pooling layer are subjected to channel stacking and batch processing regularization, and activated by Relu to realize 2-time down-sampling of the feature map;

in a down-sampling module, when Wout is less than Win, the input feature map only passes through a convolution layer with the convolution kernel size of 3 multiplied by 3 and the step length of 2, then batch processing regularization and Relu activation are carried out, and 2-time down-sampling of the feature map is realized in a convolution mode;

defining a phantom module: generating m original characteristic graphs Y' epsilon R by using the following formula through one convolution ^h×w×m ：

Y′＝X*f′

Where Y 'is the profile of the convolutional layer output, X is the convolutional layer input, f' is the R ^c×k×k×m The convolution kernel is used, m is less than or equal to n, n is the number of layers of the characteristic diagram actually required in the network model, and the deviation term is omitted for simplicity; the superparameters, i.e., convolution kernel size, step size, and fill are the same as those in normal convolution to keep the spatial sizes (i.e., h 'and w') of the output feature map consistent.

By using linear operation, n-m-s feature maps Y-Y are obtained ₁₁ ，y ₁₂ ，...，y _ij ，...，y _ms ]As output data of the phantom module; the convolution layer of the present invention uses convolution kernel of 1 × 1 size, and the linear operation Φ is deep convolution (depth wise convolution) of the original feature image YMotion) to generate a phantom feature map, and finally, performing channel superposition on the original feature map and the phantom feature map, wherein the superposition result is used as the output of the phantom module;

the 13 phantom modules are added with 10 channels through the convolution layer and 10 channels through linear operation, so that the number of output channels of each phantom module is increased by 20 channels relative to the input channels of the phantom module.

Wherein the content of the first and second substances,

and

for the height and width of the feature map, C is the number of channels of the feature map, and the input feature map is passed through a channel with NConvolution layers of 1 x 1 size convolution kernels, producing a convolution layer of size

New feature maps of (2); the new feature map is then reshaped to a size of

Output feature maps of (a); the first up-sampling module adopts 2 as the up-sampling multiple, and the second novel up-sampling module adopts 8 as the up-sampling multiple.

Preferably, the structure of the feature fusion module is as follows: firstly, stacking feature graph channels output by shallow detail branches and deep semantic branches by a feature fusion module, performing convolution kernel with step length of 1 and size of 1 multiplied by 1, and performing batch processing regularization and relu activation functions; secondly, performing global pooling on the output in the step one, performing convolution with the step length of 1 and the size of 1 multiplied by 1, activating a function through relu, performing convolution with the step length of 1 and the size of 1 multiplied by 1, activating the function through sigmoid, and multiplying the output by the corresponding bit of the output in the step one; and thirdly, adding the multiplication result in the second step and the output in the first step and outputting the result as the output of the characteristic fusion module.

The specific embodiment is as follows:

in order to verify and illustrate the effectiveness of the method, the method is compared with four existing deep learning methods, and table 1 shows the performance (precision and speed) of the method of the invention and other deep learning-based methods.

TABLE 1 comparison of the method of the present invention with four other deep learning methods

As can be seen from Table 1, when the accuracy mIoU of the method is similar to that of the other four methods, the speed FPS greatly leads the other methods to reach 94.840 FPS.

Claims

1. A real-time semantic segmentation method for aerial images of an unmanned aerial vehicle based on yellow river ice is characterized by comprising the following steps:

step 2: constructing a semantic segmentation model FastICENet;

step 2-2: the deep semantic branch is specifically as follows:

step 2-3: the fusion upsampling module specifically comprises: the output of the shallow detail branch and the deep semantic branch is jointly input into a feature fusion module, and the size of an output feature graph is h/8 multiplied by w/8; restoring the output of the feature fusion module to the original size h multiplied by w through an up-sampling module II, and predicting a segmentation result;

and step 3: training a semantic segmentation model FastICENet by using the training set and the verification set to obtain a final semantic segmentation model, and testing the performance of the final semantic segmentation model by using the test set.

2. The real-time semantic segmentation method for the aerial images of the yellow river ice unmanned aerial vehicle according to claim 1, wherein the step 1 specifically comprises:

3. The real-time semantic segmentation method for the aerial image of the yellow river ice unmanned aerial vehicle according to claim 1, characterized in that the convolution kernel of the convolution module I is 7 x 7 in size, the step length is 2, and the filling size is 3; after the convolution kernel, connecting the combination of batch processing regularization and ReLU; the convolution kernel size of the convolution module two and the convolution module three is 3 x 3, the step length is 2, the filling size is 1, and the combination of batch process regularization and ReLU is connected later.

4. The real-time semantic segmentation method for the aerial image of the yellow river ice unmanned aerial vehicle as claimed in claim 1, wherein the down-sampling module I, the down-sampling module II, the down-sampling module III and the down-sampling module IV all adopt the following structures:

5. The real-time semantic segmentation method for aerial images of the yellow river ice slush unmanned aerial vehicle according to claim 1, wherein the first dense connection module based on the phantom feature diagram and the second dense connection module based on the phantom feature diagram have the same structure and are defined as follows:

defining a phantom module: generating m original characteristic graphs Y' epsilon R by using the following formula through one convolution ^{h′×w′×m} :

Y′＝X*f′

Where Y 'is the characteristic graph of the convolution layer output, X is the convolution input, X is the convolution operation, f' is the R ^c×k×k×m The convolution kernel is used, m is less than or equal to n, and n is the number of layers of the actually required characteristic diagram in the network model;

wherein y' _i Is the ith original feature map in Y', phi _i,j Is the jth linear operation for generating the jth phantom feature map y _ij ；

By using linear operation, n-m-s feature maps Y-Y are obtained _11, y ₁₂ ,...,y _ij ,...,y _ms ]As output data of the phantom module; finally, channel superposition is carried out on the original characteristic diagram and the phantom characteristic diagram, and the superposition result is used as the output of the phantom module;

6. The real-time semantic segmentation method for the aerial images of the yellow river ice unmanned aerial vehicle according to claim 1, wherein the first attention module and the second attention module are implemented as follows: and sequentially carrying out global average pooling, 1 × 1 convolution and batch processing regularization on the input feature map, and finally obtaining a channel attention vector through sigmoid, then multiplying the channel attention vector by a corresponding bit of the input feature map, and adding a multiplication result and the input feature map to obtain a channel weighted feature map.

7. The real-time semantic segmentation method for the aerial images of the Huanghe Ice slush unmanned aerial vehicle as claimed in claim 1, wherein the first up-sampling module and the second up-sampling module have the same structure and are implemented in the following manner: assume that the input feature map has a size of

Wherein the content of the first and second substances,

and

New feature maps of (2); the new feature map is then reshaped to a size of

Output feature maps of (a); the first up-sampling module adopts 2 as the up-sampling multiple, and the novel up-samplingAnd the second module adopts 8 as an upsampling multiple.

8. The real-time semantic segmentation method for the aerial images of the yellow river ice unmanned aerial vehicle according to claim 1, wherein the feature fusion module has the following structure: firstly, stacking feature graph channels output by shallow detail branches and deep semantic branches by a feature fusion module, performing convolution kernel with step length of 1 and size of 1 multiplied by 1, and performing batch processing regularization and relu activation functions; secondly, performing global pooling on the output in the step one, performing convolution with the step length of 1 and the size of 1 multiplied by 1, activating a function through relu, performing convolution with the step length of 1 and the size of 1 multiplied by 1, activating the function through sigmoid, and multiplying the output by the corresponding bit of the output in the step one; and thirdly, adding and outputting the multiplication result in the step two and the output in the step one as the output of the characteristic fusion module.