CN116416534A

CN116416534A - Unmanned aerial vehicle spare area identification method facing protection target

Info

Publication number: CN116416534A
Application number: CN202310139757.8A
Authority: CN
Inventors: 屈若锟; 刘晔璐; 陈忠辉; 谭锦涛; 李诚龙; 江波; 黄龙杨
Original assignee: Hangzhou Xunyi Network Technology Co ltd; Civil Aviation Flight University of China
Current assignee: Hangzhou Xunyi Network Technology Co ltd; Civil Aviation Flight University of China
Priority date: 2023-02-21
Filing date: 2023-02-21
Publication date: 2023-07-11

Abstract

The invention relates to a protection target-oriented unmanned aerial vehicle spare area identification method, which comprises the following steps: collecting historical aerial image data of the unmanned aerial vehicle, screening the historical aerial image data and marking pixel points by pixel points to form an aerial image data set; inputting the aerial photographing data set into a target recognition network to obtain a context characteristic; the target recognition network comprises a plurality of layers of semantic segmentation models and a unified attention fusion module connected with the semantic segmentation models, wherein after the aerial data set is input into the semantic segmentation models, the obtained global feature map of a part of layers is input into the unified attention fusion module, and a context feature map is obtained; and respectively inputting the context feature map into a semantic segmentation head and a target detection head, and fusing the output results of the semantic segmentation head and the target detection head into identification results. The invention detects pedestrians and vehicles in the spare landing area through the semantic segmentation technology in computer vision so as to ensure that the unmanned aerial vehicle can safely land to the spare landing area.

Description

Unmanned aerial vehicle spare area identification method facing protection target

Technical Field

The invention relates to the technical field of semantic segmentation in computer vision, in particular to a protection target-oriented unmanned aerial vehicle spare area recognition method.

Background

The unmanned aerial vehicle aerial image semantic segmentation is to apply semantic segmentation technology to unmanned aerial vehicle aerial image technology to enable unmanned aerial vehicle to obtain intelligent perception capability of scene targets. Aiming at the semantic segmentation of the unmanned aerial vehicle aerial image, the object-oriented extremely complex aerial scene graph, the spare areas to be identified comprise horizontal roofs, horizontal floors, horizontal grasslands and the like, pedestrians and vehicles in each spare area are detected, and if no pedestrians and vehicles exist in the scene, the spare areas can be judged.

Disclosure of Invention

The invention aims to detect pedestrians and vehicles in a spare landing area through a semantic segmentation technology in computer vision so as to ensure that an unmanned aerial vehicle can safely land to the spare landing area, and provides a protection target-oriented unmanned aerial vehicle spare landing area identification method.

In order to achieve the above object, the embodiment of the present invention provides the following technical solutions:

a protection target-oriented unmanned aerial vehicle standby area identification method comprises the following steps:

step 1, collecting historical aerial image data of an unmanned aerial vehicle, screening the historical aerial image data and labeling pixel points by pixel points to form an aerial image data set;

step 2, inputting the aerial photographing data set into a target recognition network to obtain context characteristics; the target recognition network comprises a plurality of layers of semantic segmentation models and a unified attention fusion module connected with the semantic segmentation models, wherein after the aerial data set is input into the semantic segmentation models, the obtained global feature map of a part of layers is input into the unified attention fusion module, and a context feature map is obtained;

and 3, respectively inputting the context feature map into a semantic segmentation head and a target detection head, and fusing the output results of the semantic segmentation head and the target detection head into identification results.

Compared with the prior art, the invention has the beneficial effects that:

according to the invention, the semantic segmentation technology is utilized to segment and identify the preparation descending region of the unmanned aerial vehicle, and as the semantic segmentation is a pixel-level image understanding method, the preparation descending region identification is more accurate and more efficient, and the STDC-BiSeNet network model is in a leading technology in the current real-time semantic segmentation field, so that the scientificity and popularization of the method are embodied.

The invention identifies pedestrians and vehicles in the preliminary landing area, and has good identification effect, thereby guaranteeing the life and property safety of pedestrians on the ground.

According to the invention, the STDC-BiSeNet backbone network is shared by semantic segmentation and target detection, so that the participation of the whole task model is reduced, the whole model is lightened, and the rapid deployment of the model is facilitated.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a schematic diagram of a feature attention weighting module according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a full convolution module according to an embodiment of the present disclosure;

FIG. 4 is a flowchart illustrating a process of the unified attention fusion module according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be made by a person skilled in the art without making any inventive effort, are intended to be within the scope of the present invention.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures. Also, in the description of the present invention, the terms "first," "second," and the like are used merely to distinguish one from another, and are not to be construed as indicating or implying a relative importance or implying any actual such relationship or order between such entities or operations. In addition, the terms "connected," "coupled," and the like may be used to denote a direct connection between elements, or an indirect connection via other elements.

Example 1:

the invention is realized by the following technical scheme, as shown in fig. 1, and the unmanned aerial vehicle spare area identification method facing the protection target comprises the following steps:

step 1, collecting historical aerial image data of the unmanned aerial vehicle, screening the historical aerial image data and labeling pixel points by pixel to form an aerial data set.

In order to enable the unmanned aerial vehicle to show better generalization capability when executing a target recognition task, the embodiment collects a large amount of historical aerial image data or video data in different scenes, different time periods and different areas, screens the data and marks the data pixel by pixel, uses a labeme image marking tool during marking, and makes the marked data into a VOC data set format to form an aerial image data set.

Step 2, inputting the aerial photographing data set into a target recognition network to obtain context characteristics; the target recognition network comprises a plurality of layers of semantic segmentation models and a unified attention fusion module connected with the semantic segmentation models, wherein after the aerial data set is input into the semantic segmentation models, the obtained global feature map of the partial layers is input into the unified attention fusion module, and the context feature map is obtained.

The current mainstream semantic segmentation model is mostly an encoder-decoder (encoder-decoder) structure, which is used for feature extraction in the encoder part, so that the feature map is rich in semantic information while the resolution of the feature map is gradually reduced, and the final segmentation prediction result is decoded by using the features encoded by the encoder as input in the decoder part. There are many problems in the basic framework, in which detail information is required in addition to semantic information in the semantic segmentation task, a model often loses a large amount of detail information in the continuous convolution and pooling processes, and the process often causes the model parameters to become large.

In order to achieve the real-time effect in the semantic segmentation task, the scheme adopts a semantic segmentation model (STDC-BiSeNet), and the model is simple in network, small in parameter quantity, very light in weight and good in segmentation performance. The method is built on an unmanned aerial vehicle platform, and can well realize the identification of a spare landing area, a forced landing area and a protection target.

Referring to fig. 1, the semantic segmentation model includes 5 layers, which are a first full convolution module, a second full convolution module, a first feature attention weight module, a second feature attention weight module, and a third feature attention weight module that are sequentially connected.

The aerial photographing data set with the scale of 224 x 3 is input into a first full convolution module, and a first feature map with the scale of 112 x 32 is output to a second full convolution module after being processed by the first full convolution module; and outputting a second characteristic diagram with the scale of 56 x 64 to the first characteristic attention weighting module after being processed by the second full convolution module.

The first global feature map F with the scale of 28 x 256 is output to the second feature attention weight module after being processed by the first feature attention weight module _low1 The method comprises the steps of carrying out a first treatment on the surface of the The second global feature map F with the scale of 14 x 512 is output to a third feature attention weight module after being processed by the second feature attention weight module _low2 The method comprises the steps of carrying out a first treatment on the surface of the Outputting a third global feature map F with the scale of 7 x 1024 to the global pooling layer after being processed by a third feature attention weight module _low3 。

The first feature attention weight module, the second feature attention weight module and the third feature attention weight module of the semantic segmentation model output a first global feature map F to the unified attention fusion module respectively _low1 Second global feature map F _low2 Third global feature map F _low3 。

The first full convolution module and the second full convolution module have the same structure, and referring to fig. 3, the first full convolution module and the second full convolution module each include a convolution layer, a normalization layer, and an activation layer that are sequentially connected.

The first feature attention weight module, the second feature attention weight module and the third feature attention weight module have the same structure, please refer to fig. 2, and the first feature attention weight module, the second feature attention weight module and the third feature attention weight module all include a global pooling layer, and a first attention convolution layer, a second attention convolution layer, a third attention convolution layer, a fourth attention convolution layer and a Concat layer which are sequentially connected. The convolution kernel size of the first attention convolution layer is 1*1, the convolution kernel size of the second attention convolution layer is 3*3, the convolution kernel size of the third attention convolution layer is 3*3, and the convolution kernel size of the fourth attention convolution layer is 3*3.

With continued reference to fig. 2, the aerial data set passes through a first full convolution module and a second full convolution module to obtain a low-level feature map F ₀ Low-level feature map F ₀ Obtaining a first global feature subgraph F through a first attention convolution layer ₁ The method comprises the steps of carrying out a first treatment on the surface of the First global feature subgraph F ₁ Obtaining a second global feature subgraph F through a second attention convolution layer ₂ The method comprises the steps of carrying out a first treatment on the surface of the Second global feature subgraph F ₂ Obtaining a third global feature subgraph F through a third attention convolution layer ₃ The method comprises the steps of carrying out a first treatment on the surface of the Third global feature subgraph F ₃ Obtaining a fourth global feature subgraph F through a fourth attention convolution layer ₄ The method comprises the steps of carrying out a first treatment on the surface of the First global feature subgraph F ₁ After passing through the global pooling layer with the core size of 3*3, the core is combined with a second global feature subgraph F ₂ Third global feature subgraph F ₃ Four global feature subgraphs F ₄ Fused into a global feature map F through Concat layer _low 。

It is easy to understand that the first feature attention weighting module outputs a first global feature map F _low1 The second feature attention weighting module outputs a second global feature map F _low2 The third feature attention weighting module outputs a third global feature map F _low3 。

With continued reference to fig. 1, the unified attention fusion module is further connected to a pyramid pooling module, and the pyramid pooling module is used to increase receptive fields when extracting the context feature map.

The pyramid pooling module outputs a third global feature map F to a third feature attention weighting module _low3 Processing to obtain a third high-level global feature map F _high3 The method comprises the steps of carrying out a first treatment on the surface of the Map the third global feature map F _low3 And a third high-level global feature map F _high3 The unified attention fusion module is input together to obtain a third context feature diagram F _out3 。

Map of third context feature F _out3 As a second high-level global feature map F _high2 And a second feature attention weighting moduleOutput second global feature map F _low2 The unified attention fusion module is input together to obtain a second context feature map F _out2 。

Map the second context feature F _out2 As a first high-level global feature map F _high1 A first global feature map F output by the first feature attention weighting module _low1 The unified attention fusion module is input together to obtain a first context feature map F _out1 。

Referring to FIG. 4, a third global feature map F is obtained by the unified attention fusion module _low3 And a third high-level global feature map F _high3 To illustrate the processing, the pyramid pooling module first pools the third global feature map F _low3 Processing to obtain a third high-level global feature map F _high3 Third global feature map F _low3 And a third high-level global feature map F _high3 The unified attention fusion module is input together to obtain a third context feature diagram F _out3 ：

For the third high-level global feature map F _high3 Upsampling to form F _up3 ：

F _up3 ＝Upsample(F _high3 )

Will F _up3 And a third global feature map F _low3 The channels of the common input attention mechanism produce weights a, 1-a:

(a,1-a)＝Attention(F _up3 ,F _high3 )

wherein a is F _up3 Is 1-a is F _low3 Weights of (2);

and then F is arranged _up3 、F _high3 Multiplying the third context feature map F by the respective weights _out3 ：

F _out3 ＝F _up3 *a+F _low3 *(1-a)。

It is easy to understand that the second context feature map F is obtained _out2 With the first context feature map F _out1 The same manner is not repeated.

On the other hand, the low-level feature map F of the first attention convolution layer is input in FIG. 2 ₀ The number of channels is M, and a first global feature subgraph F is obtained after the first attention convolution layer ₁ The channel number of (2) is M/2, and then the second global feature subgraph F obtained after the downward convolution treatment and the second attention convolution layer ₂ The number of channels of the third attention convolution layer and the fourth attention convolution layer is M/4, then the number of channels of the third attention convolution layer and the fourth attention convolution layer is M/8, and then the first global feature subgraph F ₁ Second global feature subgraph F ₂ Third global feature subgraph F ₃ Fourth global feature subgraph F ₄ And performing jump connection splicing fusion. The feature map output to the unified attention fusion module by the pyramid pooling module needs up-sampling, the number of channels is continuously increased, and the feature space is continuously reduced, so that the calculation cost is balanced.

In order to enhance the feature extraction of the target recognition network, the target recognition network has the context multiscale capability, so that the scheme introduces a unified attention fusion module, and the global feature graphs output by the first feature attention weight module, the second feature attention weight module and the third attention weight module are transmitted into the unified attention fusion module to be unified fused, thereby fully utilizing the relations between the spaces of input features and between channels, which is a key factor for improving the segmentation precision.

In summary, the semantic segmentation module, the unified attention fusion module and the pyramid pooling module are connected with the target recognition network, so that the network calculation amount is reduced on the basis of the traditional BiSeNet model, and meanwhile, the calculation efficiency of the model is improved. The result of layer jump connection is adopted integrally, a unified attention fusion module and a pyramid pooling module are introduced, the receptive field of the target recognition network is enlarged, and the context characteristics are fused.

Map the first context feature F _out1 The method comprises the steps that as the final output context characteristics of a target recognition network, a prediction part is input, the prediction part comprises two parallel parts of a semantic segmentation head and a target detection head, and the context characteristics are subjected to semantic segmentationAnd after the header and the target detection header, the content is displayed on a graph, and the output results are fused into the identification result.

The Loss functions widely used in most semantic segmentation methods at present are a Dice Loss function and a cross entropy function, and for a single pixel point, the Dice Loss function is derived from a Dice coefficient and is a measurement function for measuring similarity of a set, and is generally used for calculating the similarity between two samples, wherein the Dice Loss function has the following formula:

wherein, p is the true value of the pixel point, and the value is 0 and 1; y is a pixel point predicted value, and is a value which is subjected to sigmoid or softmax, and the value is between (0 and 1); epsilon is a smoothing coefficient, and its function is to prevent denominator prediction from being 0, and it may also function as a smoothing loss and gradient, where epsilon=1.

For multiple pixels, the Dice Loss function formula is as follows:

however, the solution finds that the negative samples are too many in the model training process, which results in inconsistent results during training and testing, and the convergence effect is relatively poor during model training. Therefore, the method improves the denominator of the Dice Loss function into the form of the square sum through experiments, better convergence can be realized, and the improved Dice Loss function is as follows:

however, in the training task of the semantic segmentation model, there is a phenomenon that the number of simple negative samples is too large, and the model cannot distinguish between the positive samples and the difficult negative samples due to the too large number of simple samples. To solve this problem, the present solution continuously adjusts each during the training of the modelThe weight of the sample was determined using (1-y _i ) As a weight for each sample. For a simple sample, because the model can easily fit y _i Pushing to 1, so the weight of the training device becomes smaller gradually in the training process, and the modified Dice Loss function is finally as follows:

wherein n represents the total number of samples of the aerial photographing data set, and i represents the ith sample of the aerial photographing data set; p is p _i The pixel point true value of i Zhang Yangben is represented, and the value is 0 or 1; y is _i The pixel point predicted value of the ith sample is represented, and the value is between (0 and 1); epsilon represents the smoothing system.

The cross entropy function is mainly used for measuring the difference between the predicted distribution Y and the real distribution P of the same random variable X, and the cross entropy function formula is as follows:

to solve the problem of class imbalance, the same usage (1-y _i ) As the weight of each sample, the improved cross entropy function formula is:

because the invention performs small target semantic segmentation and target recognition on the high-altitude ground, gradient saturation phenomenon may occur due to extreme conditions in the model training process, the improved Dice Loss function and the cross entropy function are combined, and the combined total Loss function is as follows:

L＝L _dice `+L _ce `

in summary, the semantic segmentation and target detection technology is implemented in the semantic segmentation model, the backbone network used is STDC-BiSeNet, the parameter number of the total model is greatly optimized, and the model has the advantage of light weight by using the same backbone network, so that the rapid deployment of the model is facilitated. The model is deployed into TX2 for testing, semantic segmentation MPA (average pixel precision) reaches 90%, target detection MAP (average precision) reaches 96.8%, and fps reaches 59, which shows that the model has high-efficiency segmentation performance and real-time performance in the unmanned aerial vehicle aerial data set established in the invention.

The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A protection target-oriented unmanned aerial vehicle standby area identification method is characterized by comprising the following steps of: the method comprises the following steps:

2. The protection-target-oriented unmanned aerial vehicle standby area identification method according to claim 1, wherein the method comprises the following steps of: the semantic segmentation model of each layer comprises a first full convolution module, a second full convolution module, a first characteristic attention weight module, a second characteristic attention weight module and a third characteristic attention weight module which are connected in sequence;

the aerial photographing data set with the scale of 224 x 3 is input into a first full convolution module, and a first feature map with the scale of 112 x 32 is output to a second full convolution module after being processed by the first full convolution module; after being processed by the second full convolution module, the second feature map with the scale of 56 x 64 is output to the first feature attention weight module;

the first global feature map F with the scale of 28 x 256 is output to the second feature attention weight module after being processed by the first feature attention weight module _low1 The method comprises the steps of carrying out a first treatment on the surface of the The second global feature map F with the scale of 14 x 512 is output to a third feature attention weight module after being processed by the second feature attention weight module _low2 The method comprises the steps of carrying out a first treatment on the surface of the Outputting a third global feature map F with the scale of 7 x 1024 to the global pooling layer after being processed by a third feature attention weight module _low3 ；

3. The protection-target-oriented unmanned aerial vehicle spare area identification method according to claim 2, wherein the method comprises the following steps of: the first full convolution module and the second full convolution module comprise a convolution layer, a normalization layer and an activation layer which are sequentially connected.

4. The protection-target-oriented unmanned aerial vehicle spare area identification method according to claim 2, wherein the method comprises the following steps of: the first feature attention weight module, the second feature attention weight module and the third feature attention weight module comprise global pooling layers, and a first attention convolution layer, a second attention convolution layer, a third attention convolution layer, a fourth attention convolution layer and a Concat layer which are sequentially connected;

the convolution kernel size of the first attention convolution layer is 1*1, the convolution kernel size of the second attention convolution layer is 3*3, the convolution kernel size of the third attention convolution layer is 3*3, and the convolution kernel size of the fourth attention convolution layer is 3*3;

input to the first attention convolution layer is a low-level feature map F ₀ Low-level feature map F ₀ Obtaining a first global feature subgraph F through a first attention convolution layer ₁ The method comprises the steps of carrying out a first treatment on the surface of the First global feature subgraph F ₁ Obtaining a second global feature subgraph F through a second attention convolution layer ₂ The method comprises the steps of carrying out a first treatment on the surface of the Second global feature subgraph F ₂ Obtaining a third global feature subgraph F through a third attention convolution layer ₃ The method comprises the steps of carrying out a first treatment on the surface of the Third global feature subgraph F ₃ Obtaining a fourth global feature subgraph F through a fourth attention convolution layer ₄ ；

First global feature subgraph F ₁ After passing through the global pooling layer with the core size of 3*3, the core is combined with a second global feature subgraph F ₂ Third global feature subgraph F ₃ Four global feature subgraphs F ₄ Fused into a global feature map F through Concat layer _low 。

5. The protection-target-oriented unmanned aerial vehicle spare area identification method according to claim 2, wherein the method comprises the following steps of: the unified attention fusion module is also connected with a pyramid pooling module, and the pyramid pooling module is used for increasing the receptive field when the context feature map is extracted;

the pyramid pooling module outputs a third global feature map F to a third feature attention weighting module _low3 Processing to obtain a third high-level global feature map F _high3 The method comprises the steps of carrying out a first treatment on the surface of the Map the third global feature map F _low3 And a third high-level global feature map F _high3 The unified attention fusion module is input together to obtain a third context feature diagram F _out3 ；

Map of third context feature F _out3 As a second high-level global feature map F _high2 A second global feature map F output by the second feature attention weighting module _low2 The unified attention fusion module is input together to obtain a second context feature map F _out2 ；

6. The protection-target-oriented unmanned aerial vehicle spare area identification method of claim 5, wherein the method comprises the following steps of: global feature map F _low And an advanced global feature map F _high The unified attention fusion module is input together to obtain a context feature map F _out Comprises the steps of:

for the advanced global feature map F _high Upsampling to form F _up ：

F _up ＝Upsample(F _high )

Will F _up And global feature map F _low The channels of the common input attention mechanism produce weights a, 1-a:

(a,1-a)＝Attention(F _up ,F _high )

wherein a is F _up Is 1-a is F _low Weights of (2);

and then F is arranged _up 、F _high Multiplying the obtained product with the weight to obtain a context feature map F _out ：

F _out ＝F _up *a+F _low *(1-a)。

7. The protection-target-oriented unmanned aerial vehicle standby area identification method according to claim 1, wherein the method comprises the following steps of: the loss function of the target recognition network is as follows:

the Dice Loss function:

cross entropy function:

wherein n represents the total number of samples of the aerial photographing data set, and i represents the ith sample of the aerial photographing data set; p is p _i The pixel point true value of i Zhang Yangben is represented, and the value is 0 or 1; y is _i The pixel point predicted value of the ith sample is represented, and the value is between (0 and 1); epsilon represents the smoothing coefficient;

the overall loss function of the object recognition network is:

L＝L _dice `+L _ce `。