Unmanned aerial vehicle inclined image semantic information extraction method and equipment for resisting transmission distortion
Technical Field
The invention belongs to the field of computer application, and mainly relates to a transmission distortion resistant unmanned aerial vehicle inclined image semantic information extraction method.
Background
The oblique images acquired by the unmanned aerial vehicle are one of the most important data sources in urban scene observation. The unmanned aerial vehicle inclined image contains information of the top surface and the side surface of an object in a city, and meanwhile, the unmanned aerial vehicle inclined image has the advantages of a ground vehicle-mounted image and a remote sensing image. Semantic information extraction technology of unmanned aerial vehicle inclined images plays a vital role in many city applications, including city planning, dynamic monitoring, city three-dimensional semantic modeling and the like. The purpose of pixel-level semantic extraction, i.e., semantic segmentation, is to assign a unique semantic label to each pixel in an image. Heretofore, there have been a great deal of researches on semantic segmentation methods of ground vehicle-mounted images and remote sensing images, and there have been a few researches on semantic segmentation of unmanned aerial vehicle oblique images.
In recent years, with the rapid development of Convolutional Neural Networks (CNNs), semantic segmentation tasks have achieved tremendous success. Especially for semantic segmentation tasks of ground vehicle-mounted images and remote sensing images, many CNN models show strong segmentation performance. However, compared with remote sensing images and ground vehicle-mounted images, the unmanned aerial vehicle inclined images face more complex scale problems, so that the existing semantic segmentation model is difficult to apply to semantic segmentation tasks of the unmanned aerial vehicle inclined images. Specifically, for the same urban scene, the scale problem faced by semantic segmentation of unmanned aerial vehicle oblique images mainly comes from two aspects: 1) More complex dimensional changes. The dimensional change of the remote sensing image is derived from the difference in physical dimensions between different objects. In addition to the dimensional difference from different objects, the dimensional change between objects in the oblique images of the unmanned aerial vehicle can also cause the dimensional change along the depth direction due to the influence of transmission distortion, and the same type of objects generate dimensional difference due to different distances from the camera. 2) A larger scene range. The scale change between objects in the ground vehicle-mounted image is also affected by the transmission distortion, but is limited by the more local scene range of the vehicle-mounted image. In contrast, unmanned aerial vehicle tilted images have larger swaths, more and denser objects in the acquired scene, resulting in greater challenges in pixel-level semantic segmentation thereof. Aiming at the difficulties faced by unmanned aerial vehicle inclined image semantic segmentation, the current mainstream method carries out unmanned aerial vehicle inclined image semantic segmentation by fusing multi-scale features from different resolution image codes or predicted multi-scale semantic segmentation results, and although the methods obtain a certain segmentation effect, the performance of the unmanned aerial vehicle inclined image is limited due to insufficient utilization of the multi-scale features.
Disclosure of Invention
In order to solve the technical problems, the invention provides a transmission distortion resistant unmanned aerial vehicle inclined image semantic information extraction method, which realizes semantic segmentation of unmanned aerial vehicle inclined images with high precision.
The invention discloses a transmission distortion resistant unmanned aerial vehicle inclined image semantic information extraction method, which is characterized by comprising the following steps of:
step 1, multi-scale feature coding is carried out based on a double-branch feature extractor, a double-resolution image is taken as an input, multi-scale feature extraction is carried out, the multi-scale features comprise multi-scale features of different levels and the same level, the multi-scale features of different levels come from different coding layers of an encoder, and the multi-scale features of the same level come from the same coding layer with different resolution inputs;
step 2, dense feature extraction is carried out based on a cross-scale context selector, the cross-scale context selector is constructed, context information is extracted from the double-scale features of the same coding layer, a feature weight map corresponding to the double-scale features is obtained, and self-adaptive fusion is carried out on the double-scale feature map according to the learned feature weight map;
the method comprises the steps that a cross-scale context selector firstly upsamples multi-scale features coded by lower resolution images to the same size as multi-scale features coded by higher resolution images, splices the multi-scale features along a channel dimension to obtain a feature map fusing the dual-scale features, and then learns a context relation between cross-scale pixels in the dual-scale feature map by utilizing a cross-scale context coding structure, so that importance weight is given to each pixel in the dual-scale feature map to selectively fuse the dual-scale features;
step 3, performing feature decoding based on a multi-scale feature aggregator, firstly embedding low-level geometric features of an encoder into high-level semantic features of a decoder layer by layer in a top-down feature transfer mode, and then establishing the multi-scale feature aggregator to fuse semantic features of different scales in a plurality of decoding layers, so that robustness of a decoding feature map on semantic expression of multi-scale objects in an unmanned aerial vehicle image is enhanced;
the feature aggregator performs convolution operation on each semantic feature graph, then merges the semantic feature graphs along the channel direction, acquires global feature description vectors, obtains context relations among features, and finally acquires importance weight vectors of features of each channel.
Further, in the step 1, the wide-resnet38 is adopted to extract the multi-scale features.
Further, the cross-scale context selector is composed of twoTwo consecutive +.>Is formed by a convolution layer of (1) and a sigmoid activation function, wherein->For mixing the two-scale features and performing a dimension reduction process, a +.>The convolution layer of (2) is used for extracting local context information of the mixed double-scale feature, and the sigmoid activation function is used for normalizing the mixed double-scale feature obtained by encoding to obtain a feature weight map corresponding to the double-scale feature.
Further, according to the learned feature weight map, the dual-scale feature map is adaptively fused by the following formula:
in the formula 1, the components are mixed,representing a fused feature map of the two-dimensional feature map by context selection, < + >>Indicating that the convolution kernel is +.>Is used for dimension reduction, symbol +.>Representing a matrix multiplication operation, +.>Representing matrix addition operations, +.>Up-sampling operation of representing feature map, multi-scale features of dual resolution input image coding are respectively represented as +.>Andboth according to the characteristic weight graphW i Fusion is performed.
Further, the decoder adopts a top-down feature transfer mode to embed the low-level geometric features of the encoder into the high-level features of the decoder layer by layer, and specifically adopts the following formula:
in the method, in the process of the invention,to fuse the feature map of the dual scale feature +.>Indicating that the convolution kernel is +.>For fusion of different features, symbol +.>Representing matrix addition operations, +.>Representing the characteristic diagram->Upsampling to +.>The same size is used for feature fusion.
Still further, the multi-scale feature aggregator first employs one on each semantic feature mapAnd then up-sampling the third and fourth decoding layer feature maps to the same size as the second decoding layer feature map and merging along the channel direction, secondly, obtaining global feature description vectors by an overall average pooling operation along the channel direction by an aggregator, further modeling context relations among features in the global feature description vectors by using two fully connected layers, and finally, obtaining importance weight vectors of the features of each channel by normalizing a sigmoid function.
Based on the same inventive concept, the scheme also provides a structure for realizing the unmanned aerial vehicle inclined image semantic information extraction method for resisting transmission distortion, which is characterized in that: the device comprises an encoding module, a feature extraction module and a decoding module;
the coding module performs multi-scale feature coding based on a double-branch feature extractor, takes a double-resolution image as input, performs multi-scale feature extraction, and comprises multi-scale features of different levels and the same level, wherein the multi-scale features of different levels come from different coding layers of the coder, and the multi-scale features of the same level come from the same coding layer with different resolution inputs;
the feature extraction module performs intensive feature extraction based on a cross-scale context selector, constructs the cross-scale context selector, is used for extracting context information from the double-scale features of the same coding layer, obtains a feature weight map corresponding to the double-scale features, and performs self-adaptive fusion on the double-scale feature map according to the learned feature weight map;
the decoding module performs feature decoding based on the multi-scale feature aggregator, and the decoding module firstly embeds the low-level geometric features of the encoder into the high-level features of the decoder layer by layer in a top-down feature transfer mode, so that semantic feature graphs with different granularity detail information are obtained. Then constructing a multi-scale context aggregator, carrying out convolution operation on each semantic feature graph, merging along the channel direction, carrying out global average pooling operation along the channel direction to obtain a global feature description vector, acquiring a context relation among features, and finally acquiring an importance weight vector of each channel feature for self-adaptive aggregation of the multi-layer semantic feature graphs.
Based on the same inventive concept, the scheme also designs a network training method for the system, which is characterized in that:
the method has the advantages that the network training is carried out in a mode of joint supervision of the encoder and the decoder, the supervision mode carries out semantic supervision on final decoding characteristics, and extra semantic supervision is added on the highest-layer characteristics of the encoder so as to guide the encoding characteristics to carry out gradient feedback in a more effective mode, thereby further optimizing semantic prediction results;
the total loss equation for this joint supervision model can be expressed as follows:
in the method, in the process of the invention,representing semantic truth value->And->Representing the final prediction map from the decoder and the prediction map from the highest layer features of the encoder, respectively; />Representing the number of different resolution images of the network input,/-for>Andrespectively represent for supervision->And->The loss equation of (2), the loss is exploited->And->As a weight to balance->And->Importance in the overall loss function.
Based on the same inventive concept, the scheme also designs electronic equipment, which comprises:
one or more processors;
a storage means for storing one or more programs;
when one or more programs are executed by the one or more processors, the one or more processors implement a method for extracting semantic information of unmanned aerial vehicle inclined images with anti-transmission distortion.
Based on the same inventive concept, the present solution also designs a computer readable medium, on which a computer program is stored, characterized in that: the program is executed by a processor to realize the unmanned aerial vehicle inclined image semantic information extraction method for resisting transmission distortion.
The innovation point of the invention is that:
1) The novel deep neural network for dense context learning is used for semantic segmentation of the unmanned aerial vehicle inclined images, and can effectively aggregate multi-scale context information from multi-resolution image coding so as to enhance the characteristic distortion resistance in the unmanned aerial vehicle inclined image semantic segmentation process.
2) A cross-scale context selector embedded by a plurality of coding layers is constructed, and context information is densely and selectively fused from a multi-level double-scale feature map, so that the information expression capability of coding features is enhanced.
3) A multi-scale feature aggregator is introduced to effectively aggregate long-distance context information from multiple coding layers, and finally an accurate unmanned aerial vehicle inclined image semantic prediction graph is obtained.
Drawings
Fig. 1 is a flow chart of the method of the present invention.
FIG. 2 is a schematic diagram of a cross-scale context selector according to an embodiment of the present invention.
FIG. 3 is a schematic diagram of a multi-scale context aggregator of an embodiment of the invention.
Detailed Description
In order to facilitate the understanding and practice of the invention, those of ordinary skill in the art will now make further details with reference to the drawings and examples, it being understood that the examples described herein are for the purpose of illustration and explanation only and are not intended to limit the invention thereto.
The technical scheme adopted by the invention is an unmanned aerial vehicle inclined image semantic information extraction method for resisting transmission distortion, the flow of which can be seen in fig. 1, and the specific implementation steps are explained as follows:
step 1, multi-scale feature coding based on a dual-branch feature extractor. With the dual-resolution image as input, in this embodiment, the oblique image and the scaled 0.5-fold image of the unmanned aerial vehicle with the original resolution are adopted, and the wide-resnet38 is adopted to extract the multi-scale features. The multi-scale features include multi-scale features of different levels and the same level. Wherein the multi-scale features of different levels come from different coding layers of the encoder, and the multi-scale features of the same level come from the same coding layer of different resolution inputs.
Step 2, dense feature extraction based on cross-scale context selector. To fuse multi-scale features of the same hierarchy, the present invention constructs a cross-scale context selector, as shown in FIG. 2. In a dual branch feature extractor, multi-scale features encoded in the original resolution image are represented asThe multiscale feature encoded with the original 0.5-fold resolution image is denoted +.>. For the corresponding bi-scale feature from the same coding layer->Extracting useful context information from the same, first +.>Upsampling to sum->Equal size and stitching them together along the channel dimension to get a feature map of fused double-scale features +.>. Then learn +.>The context between mid-span scale pixels, thus +.>Importance weights are assigned to each pixel of (c) for fusion of the dual scale features. The trans-scale context coding structure consists of two +.>Two consecutive +.>Is formed by a convolution layer of (1) and a sigmoid activation function. Wherein (1)>Is used to mix the two-scale features and the dimension reduction process,/->The convolution layer of (2) is used for extracting local context information of the mixed double-scale feature, and the sigmoid activation function is used for normalizing the mixed double-scale feature obtained by encoding to obtain the double-scale feature ∈>Corresponding characteristic weight map->. The feature weight map represents the importance weight of each pixel in the corresponding feature map in the double-scale feature, and models the context relationship of the long distance in the double-scale feature map. According to the feature weight map, the trans-scale context selector can effectively retain +.>Useful context information in the context of the document, while suppressing redundant features. Finally, according to the learned feature weight map, the dual-scale feature map can be adaptively fused by the following formula:
(1)
in the formula 1, the components are mixed,representing a fused feature map of the two-dimensional feature map by context selection, < + >>Indicating that the convolution kernel is +.>Is used for dimension reduction. Sign->Representing a matrix multiplication operation, +.>Representing matrix addition operations, +.>Representing the upsampling operation of the feature map.
And 3, feature decoding based on a multi-scale feature aggregator. Since some detail features are inevitably lost while extracting semantics using convolution and downsampling operations during encoding, these detail features are critical in pixel-level semantic prediction. Therefore, the decoder needs to be designed to restore the spatial resolution for pixel-level prediction. Considering that the coding features of the lower layer contain more detail information, and the semantic information contained in the coding features of the higher layer is more obvious, the decoder firstly adopts a top-down feature transmission mode shown in the formula 2 to embed the features of the lower layer into the features of the higher layer by layer, so as to obtain semantic feature graphs with detail information with different granularities.
(2)
In the formula 2, the components are mixed,is prepared by the step 1The feature map of the obtained fused double-scale feature, < + >>Representing the convolution kernel asFor fusion of different features. Sign->Representing matrix addition operations, +.>Representing the characteristic diagram->Upsampling to +.>The same size is used for feature fusion.
To further merge these semantic feature graphs with detail information of different granularityThe present invention introduces a multi-scale feature aggregator. To enhance the local context information of features, the aggregator first employs a +_ on each semantic feature map>Is then +.>And->Upsampling into sum->The same size and merge along the channel direction. Second, for extracting the merged feature map->For the most effective featuresAt final semantic prediction, the aggregator obtains global feature description vectors using a global averaging pooling operation along the channel direction. Modeling further using two fully connected layers +.>Contextual relationships between the intermediate features. Finally, an importance weight vector of each channel characteristic is obtained by normalization of a sigmoid function>. Let->Representation->The feature aggregator aggregates semantic feature graphs of different granularity detail information by equation 3:
(3)
in the formula 3, the components are mixed,and->Respectively represent characteristic diagrams->And->Characteristic diagram of the kth channel, symbol->Representing an element-by-element multiplication operation.
Based on the same inventive concept, the scheme also designs a network model structure for the method, which comprises an encoding module, a feature extraction module and a decoding module;
the coding module performs multi-scale feature coding based on a double-branch feature extractor, takes a double-resolution image as input, performs multi-scale feature extraction, and comprises multi-scale features of different levels and the same level, wherein the multi-scale features of different levels come from different coding layers of the coder, and the multi-scale features of the same level come from the same coding layer with different resolution inputs;
the feature extraction module performs intensive feature extraction based on a cross-scale context selector, constructs the cross-scale context selector, is used for extracting context information from the double-scale features of the same coding layer, obtains a feature weight map corresponding to the double-scale features, and performs self-adaptive fusion on the double-scale feature map according to the learned feature weight map;
the decoding module performs feature decoding based on a multi-scale feature aggregator, and the decoding module firstly embeds low-level geometric features of an encoder into high-level features of the decoder layer by layer in a top-down feature transfer mode, so that semantic feature graphs with different granularity detail information are obtained; then constructing a multi-scale context aggregator, carrying out convolution operation on each semantic feature graph, merging along the channel direction, carrying out global average pooling operation along the channel direction to obtain a global feature description vector, acquiring a context relation among features, and finally acquiring an importance weight vector of each channel feature for self-adaptive aggregation of the multi-layer semantic feature graphs.
Refined semantic prediction based on encoder-decoder joint supervision. In order to enhance the effectiveness of feature expression in the process of extracting the intensive context features, the invention adopts an encoder-decoder joint supervision mode for training the network model structure. The supervision mode performs semantic supervision on the final decoding features, and adds additional semantic supervision on the highest-layer features of the encoder to guide the encoding features to perform gradient feedback in a more efficient manner, so that semantic prediction results are further optimized. The total loss equation for this joint supervision model can be expressed as follows:
(4)
in the formula 4, the components are mixed,representing semantic truth value->And->Representing the final prediction map from the decoder and the prediction map from the highest layer features of the encoder, respectively. />Representing the number of different resolution images of the network input, set to 2 ± in the dense context learning network>And->Respectively represent for supervision->And->Is calculated using the loss of area mutual information (RMI). Furthermore, the loss utilises->And->As weights to balanceAnd->Importance in the overall loss function.
Based on the same inventive concept, the scheme also designs electronic equipment, which comprises:
one or more processors;
a storage means for storing one or more programs;
when one or more programs are executed by the one or more processors, the one or more processors implement a method for extracting semantic information of unmanned aerial vehicle inclined images with anti-transmission distortion.
Based on the same inventive concept, the present solution also designs a computer readable medium, on which a computer program is stored, characterized in that: the program is executed by a processor to realize the unmanned aerial vehicle inclined image semantic information extraction method for resisting transmission distortion.
It should be understood that the foregoing description of the preferred embodiments is not intended to limit the scope of the invention, but rather to limit the scope of the claims, and that those skilled in the art can make substitutions or modifications without departing from the scope of the invention as set forth in the appended claims.