CN112966546A

CN112966546A - Embedded attitude estimation method based on unmanned aerial vehicle scout image

Info

Publication number: CN112966546A
Application number: CN202110004413.7A
Authority: CN
Inventors: 姜梁; 马祥森; 吴国强; 钱宇浛; 孙浩惠
Original assignee: China Spaceflight Electronic Technology Research Institute; Aerospace Times Feihong Technology Co ltd
Current assignee: China Spaceflight Electronic Technology Research Institute; Aerospace Times Feihong Technology Co ltd; China Academy of Aerospace Electronics Technology Co Ltd
Priority date: 2021-01-04
Filing date: 2021-01-04
Publication date: 2021-06-15

Abstract

The invention discloses an embedded attitude estimation method based on an unmanned aerial vehicle scout image, and belongs to the field of image processing and machine vision. The method specifically comprises the following steps: acquiring an original unmanned aerial vehicle scout image data set, and performing data enhancement processing on the original unmanned aerial vehicle scout image data set; labeling the obtained original unmanned aerial vehicle reconnaissance image data set to obtain a training data set with a label; constructing a lightweight multi-stage hourglass network, and training the lightweight multi-stage hourglass network by using the training data set; inputting an unmanned aerial vehicle scout image to be processed, preprocessing the unmanned aerial vehicle scout image, inputting the preprocessed unmanned aerial vehicle scout image into a trained lightweight attitude estimation network to obtain a portrait feature map, and estimating the portrait attitude according to the portrait feature map. According to the technical scheme, the algorithm performance and the deployment adaptability are integrated, and various problems of attitude estimation of the unmanned aerial vehicle video processing system are solved.

Description

Embedded attitude estimation method based on unmanned aerial vehicle scout image

Technical Field

The invention relates to the field of image processing and machine vision, in particular to a ground small target embedded posture estimation hourglass network for an unmanned aerial vehicle aerial video.

Background

In recent years, the unmanned aerial vehicle as a new combat force plays an irreplaceable role under the intelligent combat condition, the unmanned aerial vehicle equipment technology is vigorously developed, and the unmanned aerial vehicle has great strategic significance for improving the combat capability of troops. The attitude estimation technology is one of key technologies for the unmanned aerial vehicle to execute reconnaissance and striking tasks, and can provide strong support for the unmanned aerial vehicle to quickly and accurately identify the target intention, the advancing route and the like. The high-efficiency and accurate attitude estimation algorithm can effectively reduce the burden of ground operators, and improve the investigation capability and the quick response combat efficiency.

The traditional unmanned aerial vehicle reconnaissance ground small target posture estimation algorithm mainly obtains coordinates of key points of a human body through an image processing technology, so that a human body skeleton model or a contour model is obtained, and human body posture behaviors can be expressed more intuitively. Before 2015, all body pose estimation methods aimed at regressing the exact coordinates of the body's key points. However, these methods are very poorly scalable due to the flexibility of human body actions.

The human body posture estimation-based algorithm has the advantages that a human body is converted into a human body posture skeleton diagram or a human body contour diagram, so that the method is concise and intuitive, and background interference can be suppressed to a great extent. The disadvantage is that the pose estimation itself is a relatively complex problem, which is used as the front-end input for the pose detector, and the detector results are greatly affected by the pose estimation.

In an unmanned aerial vehicle video image processing system, the attitude estimation technology for a ground small target currently faces the following problems:

1) the complex human body image makes the model need to learn the highly nonlinear mapping relation, and the learning difficulty of the mapping relation is extremely large. The main reasons are: firstly, human body images are shot in different scenes and have different shooting angles and illumination conditions; secondly, random occlusion can be caused by the interaction relationship between people and objects and between people; finally, different wear and body types also increase the complexity of the mapping. Although the human body posture estimation method based on manual features can realize accurate positioning of non-shielding joints under the conditions of fixed scenes, visual angles and stable illumination, the ideal situation does not exist in real scenes. Therefore, how to extract more robust features and learn complex mapping relationships through characterization learning is a problem which must be studied in human body posture estimation.

2) The highly non-linear mapping relationship needs to use a more complex model to learn, and the more complex model needs a large computational overhead. How to guarantee the accuracy of the model while accelerating the running speed of the human body posture estimation model is a key problem for the human body posture estimation method to be practical.

Disclosure of Invention

In order to solve the problems, the technical scheme of the invention provides an embedded attitude estimation algorithm based on unmanned aerial vehicle scout images based on unmanned aerial vehicle video image characteristics and defects of the domestic prior art in the aspect of attitude estimation of unmanned aerial vehicle scout ground small targets, and overall algorithm performance and deployment adaptability, and solves a plurality of problems of attitude estimation of an unmanned aerial vehicle video processing system. The method mainly comprises the following steps: 1) the traditional attitude estimation is greatly influenced by the foreground; 2) the traditional deep learning algorithm model is large and difficult to deploy in the embedded equipment; 3) the problem of low efficiency of feature extraction and poor effect of feature fusion; 4) real-time nature of the detection process.

According to a first aspect of the present invention, an embedded pose estimation method based on a scout image of an unmanned aerial vehicle is provided, where the method specifically includes:

step 1, acquiring an original unmanned aerial vehicle scout image data set, and performing data enhancement processing on the original unmanned aerial vehicle scout image data set;

step 2, performing labeling processing on the original unmanned aerial vehicle reconnaissance image data set obtained in the step 1 to obtain a training data set with a label;

step 3, constructing a lightweight multi-stage hourglass network, and training the lightweight multi-stage hourglass network by using the training data set;

and 4, inputting an unmanned aerial vehicle scout image to be processed, preprocessing the unmanned aerial vehicle scout image, inputting the preprocessed unmanned aerial vehicle scout image into the trained lightweight attitude estimation network to obtain a portrait feature map, and estimating the portrait attitude according to the portrait feature map.

Further, in step 1, the data enhancement processing includes dilation, erosion, and bilateral filtering operations.

Further, in the step 2, the labeling processing of adding the Label is realized by an image labeling tool Label Img.

Further, the image annotation tool is Label Img.

Further, in step 3, the lightweight posture estimation network includes a convolutional layer, a pooling layer, a channel separation Module, a multilevel hourglass network formed by a plurality of Pyramid Residual Modules (PRMs), and a channel mixing Module.

Furthermore, the multi-stage hourglass network is a two-stage hourglass network and is composed of two pyramid residual modules.

Further, the convolutional layers are depth separable convolutional layers, including depth convolution processing and point convolution processing.

Further, the depth separable convolutional layer is specifically operative to:

for the common convolution with convolution kernel K, input channel number M and output channel number O, the method is divided into deep convolution processing and point convolution processing,

deep convolution processing: performing a K convolution operation on each input channel;

and (3) point convolution processing: performing linear fusion on the M characteristics, wherein the number of point convolutions is O,

wherein K, M, O are all positive integers.

Further, the channel separation module includes a plurality of feature channels.

Further, the multi-stage hourglass network is an eight-stage hourglass network.

Further, the hourglass network is composed of

Further, the step 4 specifically includes:

step 41: inputting an unmanned aerial vehicle scout image to be processed, and intercepting the unmanned aerial vehicle scout image to obtain a reduced-size unmanned aerial vehicle scout image;

step 42: inputting the unmanned aerial vehicle scout image obtained in the step 41 into the lightweight attitude estimation network, and obtaining a first characteristic diagram after pooling and convolution;

step 43: inputting the first characteristic diagram into a multi-stage hourglass network through a channel separation module, and outputting a plurality of second characteristic diagrams;

step 44: and inputting the plurality of second feature maps into a channel mixing module, performing feature fusion on the plurality of second feature maps, and outputting a portrait feature map, thereby estimating the portrait posture according to the portrait feature map.

Further, the second feature map is a low resolution feature map.

According to a second aspect of the invention, there is provided a computer readable storage medium having a computer program stored thereon, characterized in that the program, when executed by a processor, implements the steps of the method according to any of the above aspects.

According to a third aspect of the present invention there is provided a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the method according to any aspect are implemented when the program is executed by the processor.

Compared with the prior art, the invention has the following advantages:

1) the invention has high operation efficiency, and can perform real-time processing on a video image with a resolution of 1920 x 1080 within 20ms under the condition of only using a video card GTX 1050.

2) According to the invention, the common convolutional network is replaced by the deep separable network, so that the estimation effect is ensured and the network is further lightened.

3) The invention can be grouped and transmitted downwards during application, respectively extracts the characteristics, and reorders the channels when the characteristics are fused finally. Therefore, the number of channels can be reduced during transmission, the image characteristics of all parts can be effectively transmitted to the back during transmission, the correlation of the image characteristics can be improved, and the posture estimation effect can be further improved.

4) The present invention fuses features using concatenation. The fusion between the features is enhanced, so that each group of output channels can include all input features, the correlation of information is enhanced, and the attitude estimation efficiency of the ground small target is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the structures shown in the drawings without creative efforts.

FIG. 1 is a flow chart of an embedded attitude estimation method based on an unmanned aerial vehicle scout image according to the technical scheme of the invention;

FIG. 2 is a schematic diagram of a network model built up from a plurality of hourglass networks according to an aspect of the present invention;

fig. 3 is a schematic view of an hourglass network of light-weight PRMs of identical construction according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a depth separable convolution according to aspects of the present invention;

FIG. 5 is a schematic diagram of channel separation and recombination according to the present invention;

FIG. 6 is a diagram illustrating an original pyramid residual block according to an embodiment of the present invention;

fig. 7a and 7b are schematic views of a light-weight PRM according to an embodiment of the present invention;

fig. 8 is a diagram illustrating the results of the detection of 16 key points on the MPII dataset by the detected person and the lightweight network.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

The terms "first," "second," and the like in the description and in the claims of the present disclosure are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein.

Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

A plurality, including two or more.

And/or, it should be understood that, for the term "and/or" as used in this disclosure, it is merely one type of association that describes an associated object, meaning that three types of relationships may exist. For example, a and/or B, may represent: a exists alone, A and B exist simultaneously, and B exists alone.

The technical scheme of the invention provides an embedded attitude estimation method based on an unmanned aerial vehicle scout image, which is mainly based on a pyramid residual error module and designs a lightweight network model. The depth separable convolution is used for replacing the ordinary convolution to reduce the training parameters, and a channel separation module and a channel mixing module are added to change the channel dimension of the feature map so as to strengthen the fusion of the features. In order to ensure that the network can still extract the features more completely, only the identity mapping part is subjected to channel separation processing, and a channel mixing module is added in the final feature fusion. According to the technical scheme, the deep separable convolution is added on the basis of the pyramid residual error network, and the channel separation and channel mixing module is combined, so that the network can effectively reduce the calculated amount and the storage space on the basis of maintaining the precision.

Based on the pyramid residual module, a lightweight network model is provided. The depth separable convolution is used for replacing the ordinary convolution to reduce the training parameters, and a channel separation module and a channel mixing module are added to change the channel dimension of the feature map so as to strengthen the fusion of the features. In order to ensure that the network can still extract the features more completely, only the identity mapping part is subjected to channel separation processing, and a channel mixing module is added in the final feature fusion.

And adding multi-scale features on the basis of the pyramid residual error module, extracting the features through convolution, and then performing feature fusion on the resolution ratio up-sampled before.

Specifically, as shown in fig. 1, the following steps are included.

101, acquiring an original unmanned aerial vehicle scout image data set, and performing data enhancement processing on the original unmanned aerial vehicle scout image data set;

102, performing labeling processing on the original unmanned aerial vehicle reconnaissance image data set obtained in the step 1 to obtain a training data set with a label;

103, constructing a lightweight multi-stage hourglass network, and training the lightweight multi-stage hourglass network by using the training data set;

and 104, inputting an unmanned aerial vehicle scout image to be processed, preprocessing the unmanned aerial vehicle scout image, inputting the preprocessed unmanned aerial vehicle scout image into the trained lightweight attitude estimation network to obtain a portrait feature map, and estimating the portrait attitude according to the portrait feature map.

The following describes key technologies related to the technical solutions of the present invention in detail with reference to the drawings.

Pyramid residual network

The hourglass network has a good detection effect on human body posture estimation, and a plurality of hourglass networks are stacked to continuously optimize a detection result. Each hourglass network combines the characteristics of a plurality of resolutions, is a modular network, and uses a residual error module for characteristic extraction for a plurality of times at each stage. Based on the pyramid residual network of the hourglass network, as shown in fig. 2, the image passes through a 7 × 7 convolutional layer, a pooling layer and a PRM, the image resolution is reduced to 64 × 64, and the image passes through each hourglass network in turn, and each network is followed by a relay monitor to prevent the gradient from disappearing. The structure of each hourglass network is as shown in fig. 3, the resolution is reduced continuously through the pooling layer, the lowest resolution reaches 4 × 4, and then feature extraction is performed through the pyramid residual module. Meanwhile, the multi-resolution features are continuously combined to carry out effective attitude estimation. Each module in the network is a pyramid residual module, and based on the fact that the module is a modularized network, the lightweight design is carried out on the module, the number of channels and the convolution mode are changed, and therefore the whole network is improved.

Designing a lightweight network:

depth separable convolution

The deep separable convolution is divided into two parts, namely, deep convolution and point convolution, as shown in fig. 4, wherein a convolution kernel is K, the number of input channels is M, the number of output channels is O, the deep convolution is performed on each channel by K × K convolution operation, then the point convolution is used for performing linear fusion on M features, and the number of the point convolution is the number of output channels.

For an input image of size Y × Z × M, the amount of computation through a common convolution is:

Y×Z×M×O×K×K (1)

the amount of computation through the depth separable convolution is:

Y×Z×M×O+Y×Z×M×K×K (2)

by comparison, when the convolution kernel is 3 × 3, the calculation amount of each convolution is reduced to about 1/9, and the convolution mode can effectively combine the characteristics of each channel.

Channel separation recombination

Channel separation and recombination as shown in fig. 5, when applied, the channels can be transmitted in groups, features are respectively extracted, and the channels are reordered when the features are fused finally. Therefore, the number of channels can be reduced during transmission, the image characteristics of all parts can be effectively transmitted to the back during transmission, and the correlation of the image characteristics can be improved.

Lightweight PRM

Each module in the hourglass network is a pyramid residual module. As shown in fig. 6, multi-scale features are added on the basis of a residual error module, the number of scales can be customized, and after features are extracted by convolution, the features are upsampled to the previous resolution ratio for feature fusion.

According to the analysis, the invention designs a lightweight pyramid residual module. As shown in fig. 7a and 7b, the normal convolution is replaced with a depth separable convolution. In experiments, it is found that the depth separable convolution has poor effect on the calculation rate although the parameter amount and the calculation amount are reduced, so that the invention only replaces the convolution of the original resolution branch with the depth separable convolution. Meanwhile, a channel separation module is added at the beginning part of the module, and in order to enable the network to extract more features, the number of channels of the feature extraction branch is not reduced, but half of the channels are selected at the identity mapping part and the features are fused by cascade connection. If the direct fusion has half of the channels with less information of feature extraction, a channel recombination module is added in the back to orderly reorder the channels, and the method enhances the fusion among the features, so that each group of output channels can include all the input features, and the information correlation is enhanced.

Examples

The network proposed herein was trained on the MPII dataset of human pose estimation, with the results shown in fig. 8, which included about 25000 images and 40000 labeled samples after Label Img, with 28000 trainings and 11000 tests. The environment Ubuntu was run, number of iterations 250, batchsize 6, and tested using the Torch7 framework and two NVIDIA 1080ti GPUs. The evaluation results were good using Percentage Correct Keys (PCK) as an accuracy evaluation index.

First, a 1080 × 1920 drone scout image is acquired, cut to size 227 × 227 by windowing, and data-enhanced by dilation, erosion, and bilateral filtering.

Wherein, swell (dilate): the operation of finding the local maximum value expands the boundary of the object, and the specific expansion result is related to the shape of the image and the structural element; corrosion (enode): erosion and dilation are the opposite operations, erosion being the operation of finding a local minimum. The erosion operation causes the highlight areas in the image to gradually decrease.

And secondly, outputting a first feature map through three times of pooling (Max Pool) and convolution, inputting the first feature map to a multi-stage hourglass network through a multi-channel separation module (Split), and outputting a plurality of second feature maps with low resolution.

Wherein the present embodiment uses eight hourglass network stacks as the overall network framework.

Finally, the plurality of second feature maps are subjected to feature fusion through a channel mixing module (Merge), and a portrait feature map is output, so that the portrait posture is estimated according to the portrait feature map.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the above implementation method can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation method. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. An embedded attitude estimation method based on unmanned aerial vehicle scout images is characterized by specifically comprising the following steps:

step 1: acquiring an original unmanned aerial vehicle scout image data set, and performing data enhancement processing on the original unmanned aerial vehicle scout image data set;

step 2: performing labeling processing on the original unmanned aerial vehicle reconnaissance image data set obtained in the step 1 to obtain a training data set with a label;

and step 3: constructing a lightweight attitude estimation network, and training the lightweight attitude estimation network by using the training data set;

and 4, step 4: the method comprises the steps of obtaining an unmanned aerial vehicle scout image to be processed, preprocessing the unmanned aerial vehicle scout image, inputting the preprocessed unmanned aerial vehicle scout image into a trained lightweight attitude estimation network to obtain a portrait feature map, and estimating the portrait attitude according to the portrait feature map.

2. The embedded pose estimation method according to claim 1, wherein in step 1, the data enhancement process comprises dilation, erosion and bilateral filtering operations.

3. The embedded pose estimation method according to claim 1, wherein in the step 2, labeling processing for adding labels is realized by an image labeling tool.

4. The embedded pose estimation method of claim 1, wherein in step 3, the lightweight pose estimation network comprises a convolutional layer, a pooling layer, a channel separation module, a multi-level hourglass network composed of a plurality of pyramid residual modules, and a channel mixing module.

5. The embedded pose estimation method of claim 4, wherein the convolutional layers are depth separable convolutional layers, including depth convolution processing and point convolution processing.

6. The embedded pose estimation method of claim 5, wherein the depth separable convolutional layer is specifically operable to:

wherein K, M, O are all positive integers.

7. The embedded pose estimation method of claim 4, wherein the channel separation module comprises a plurality of independent feature channels.

8. The embedded pose estimation method according to claim 4, wherein the step 4 specifically comprises:

step 42: inputting the unmanned aerial vehicle scout image obtained in the step 41 into the trained lightweight attitude estimation network, and performing pooling and convolution to obtain a first characteristic diagram;

step 43: after the first characteristic diagram is input into the multi-stage hourglass network through the channel separation module, a plurality of second characteristic diagrams are output;

9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 8.

10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method according to any of claims 1 to 8 are implemented when the program is executed by the processor.