CN116778528A

CN116778528A - Unmanned aerial vehicle image-based human body analysis network training method and device

Info

Publication number: CN116778528A
Application number: CN202310748727.7A
Authority: CN
Inventors: 孙凯; 贺振中; 徐亮
Original assignee: Shenzhen Huasairuifei Intelligent Technology Co ltd
Current assignee: Shenzhen Huasairuifei Intelligent Technology Co ltd
Priority date: 2023-06-21
Filing date: 2023-06-21
Publication date: 2023-09-19

Abstract

A human body analysis network training method, device and medium based on unmanned aerial vehicle images comprises the steps of carrying out data enhancement and image preprocessing on a plurality of unmanned aerial vehicle images to obtain a training image set, carrying out feature extraction on the training images by utilizing an encoder to obtain a plurality of semantic feature images and target feature images meeting screening requirements, carrying out boundary supervision on the target feature images by utilizing a detail extractor to obtain a boundary two-classification result, carrying out multi-scale pooling on a preset layer number feature image in the semantic feature images by utilizing an improved spatial pyramid pooling module and boundary residual errors to obtain a plurality of human body analysis results, carrying out network training on a human body analysis network based on a final loss function obtained by summing cross entropy loss functions corresponding to the plurality of human body analysis results and two-classification loss functions corresponding to the boundary two-classification result, and carrying out optimization and adjustment to obtain a standard human body analysis network. The standard human body analysis network can improve the analysis accuracy of human body analysis on the premise of reducing the model parameters.

Description

Unmanned aerial vehicle image-based human body analysis network training method and device

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a human body analysis network training method, device and medium based on unmanned aerial vehicle images.

Background

Unmanned aerial vehicles have wide application in aspects such as safety protection, suspected personnel tracking or discernment. However, due to the particularity of the unmanned aerial vehicle, the designed algorithm must have real-time performance and high accuracy. Human body identification based on unmanned aerial vehicle means that unmanned aerial vehicle images are utilized to carry out fine analysis on human bodies, namely, a plurality of parts such as heads, clothes and legs of the human bodies are segmented. Wherein, human body analysis belongs to a classification task with fine granularity.

The human body analysis difficulty based on the unmanned aerial vehicle image is that the unmanned aerial vehicle image is shot in the air or shot at a long distance, so that the human body occupies only a small part in the image, and the unmanned aerial vehicle image belongs to small target semantic segmentation. The traditional unmanned aerial vehicle image human body analysis accuracy based on the machine learning algorithm is low, and only human bodies are roughly segmented. However, the existing deep learning-based algorithm can achieve high accuracy, but the human body recognition based on the unmanned aerial vehicle image cannot achieve fast reasoning speed due to large-scale parameter quantity.

Disclosure of Invention

The invention mainly solves the technical problem that the analysis accuracy of the human body analysis network in the unmanned aerial vehicle image can be improved on the premise of reducing the model parameter quantity.

According to a first aspect, in one embodiment, a human body analysis network training method based on unmanned aerial vehicle images is provided, including:

acquiring a plurality of unmanned aerial vehicle images, performing data enhancement processing on the unmanned aerial vehicle images, and performing image preprocessing on the data-enhanced images to obtain a training image set;

performing boundary residual calculation on the training images in the training image set to obtain boundary residual;

extracting features of training images in the training image set by using a preset encoder to obtain a plurality of semantic feature images, and selecting semantic feature images meeting screening requirements from the plurality of semantic feature images as target feature images;

performing boundary supervision processing on the target feature map by using a preset detail extractor to obtain a boundary classification result;

carrying out multi-scale pooling on a preset layer number feature map in the plurality of semantic feature maps by utilizing an improved spatial pyramid pooling module and the boundary residual error to obtain a plurality of human body analysis results;

Constructing corresponding cross entropy loss functions based on the multiple human body analysis results, constructing corresponding two-class loss functions according to the boundary two-class results, performing network training on a human body analysis network by using a final loss function obtained by summing the cross entropy loss functions and the two-class loss functions, and optimizing and adjusting the trained human body analysis network to obtain a standard human body analysis network, wherein the human body analysis network is constructed and generated by a preset encoder, an improved space pyramid pooling and a preset detail extractor.

In an embodiment, the calculating the boundary residual error of the training image in the training image set to obtain the boundary residual error includes:

performing downsampling processing on training images in the training image set to obtain a downsampled image set;

selecting images which are in the downsampled image set and accord with the preset layer number as the images of the layer, and selecting images which have preset intervals with the images of the layer as the images of the lower layer;

and carrying out up-sampling processing on the lower layer image to obtain an up-sampling image, and calculating the difference between the current layer image and the up-sampling image to obtain a boundary residual error.

In an embodiment, the performing multi-scale pooling on the preset layer number feature map in the plurality of semantic feature maps by using the improved spatial pyramid pooling module and the boundary residual error to obtain a plurality of human body analysis results includes:

inputting a preset layer number feature map into the improved spatial pyramid pooling module for feature convolution processing to obtain a plurality of pooling features;

performing feature stitching processing on the plurality of pooled features to obtain stitching features, and performing dimension reduction processing on the stitching features by using a convolution layer to obtain dimension-reduced stitching features;

performing feature selection processing on the reduced-dimension spliced features by using a preset feature refining module to obtain final output features;

and carrying out human body analysis processing according to the final output characteristics and the boundary residual error to obtain a plurality of human body analysis results.

In an embodiment, the inputting the preset layer number feature map into the improved spatial pyramid pooling module to perform feature convolution processing to obtain multiple pooling features includes:

performing convolution processing on the preset layer number feature map to obtain a first layer pooling feature of the improved spatial pyramid pooling module;

Carrying out cavity convolution processing on the preset layer number feature map by using cavity convolution layers with different cavity rates respectively to obtain a plurality of cavity pooling features;

and summarizing the first layer of pooling features and the plurality of cavity pooling features to obtain a plurality of pooling features.

In an embodiment, the performing feature selection processing on the reduced-dimension stitching feature by using a preset feature refining module to obtain a final output feature includes:

carrying out average pooling treatment on the spliced features subjected to dimension reduction by using a global average pooling layer to obtain channel weight features;

carrying out convolution processing, normalization processing and activation processing on the channel weight characteristics to obtain final weight characteristics;

and multiplying the channel weight characteristic and the final weight characteristic to obtain a final output characteristic.

In an embodiment, the performing human body analysis processing according to the final output feature and the boundary residual error to obtain a plurality of human body analysis results includes:

inputting the final output characteristics into a multi-layer convolution layer for convolution processing to obtain a first analysis result;

respectively carrying out up-sampling treatment on the final output characteristic and the first analysis result to obtain a first up-sampling characteristic and a first up-sampling result, carrying out characteristic splicing treatment on the first up-sampling characteristic, the first up-sampling result and the boundary residual error to obtain an initial splicing characteristic, and carrying out convolution treatment and characteristic refining treatment on the initial splicing characteristic to obtain a second analysis result;

Re-executing up-sampling processing, feature splicing processing, convolution processing and feature refining processing on the second analysis result and the boundary residual error until a third analysis result and a fourth analysis result are obtained;

and summarizing the first analysis result, the second analysis result, the third analysis result and the fourth analysis result to obtain a plurality of human body analysis results.

In an embodiment, the performing data enhancement processing on the multiple unmanned aerial vehicle images includes:

the data enhancement processes include image panning, image flipping, random brightness transformation and median filtering

According to a second aspect, in one embodiment, there is provided a human body recognition method based on an image of an unmanned aerial vehicle, including:

acquiring a human body image to be analyzed;

inputting the human body image to be analyzed into a standard human body analysis network to perform human body analysis processing to obtain a human body analysis result; wherein the standard body resolution network is trained by the method of any one of claims 1 to 7.

According to a third aspect, an embodiment of the present invention provides an apparatus comprising:

a memory for storing a program;

a processor for implementing the patrol robot-based object detection method according to any one of the above by executing the program stored in the memory.

According to the human body identification method, device and medium based on the unmanned aerial vehicle image, the human body identification method, device and medium based on the unmanned aerial vehicle image comprises the steps of constructing and generating a human body analysis network according to a preset encoder, an improved space pyramid pooling module and a preset detail extractor, and training, optimizing and adjusting the human body analysis network to obtain a standard human body analysis network. The improved spatial pyramid pooling module can enable the standard human body analysis network to extract features with different scales, so that the recognition capability of small targets is improved. The preset detail extractor can conduct boundary supervision processing on the image, the accuracy of the standard human body analysis network on boundary identification is increased, meanwhile, boundary residual errors obtained by conducting boundary residual error calculation are combined in the training process of the human body analysis network, and the problem of boundary segmentation in human body analysis of the standard human body analysis network can be solved by utilizing the boundary residual errors. Therefore, the standard human body analysis network obtained after training can improve the accuracy of human body analysis in the unmanned aerial vehicle image.

Drawings

Fig. 1 is a human body analysis network training flow chart based on an unmanned aerial vehicle image according to an embodiment of the application;

FIG. 2 is a schematic diagram of a boundary residual calculation flow according to an embodiment;

FIG. 3 is a schematic diagram of a multi-scale pooling process according to one embodiment;

FIG. 4 is a schematic diagram of a multi-scale pooling process according to another embodiment;

FIG. 5 is a schematic diagram of a multi-scale pooling process according to another embodiment;

FIG. 6 is a schematic diagram of a multi-scale pooling process according to another embodiment;

fig. 7 is a schematic diagram of a human body recognition flow based on an image of an unmanned aerial vehicle according to another embodiment;

fig. 8 is a block diagram of a human body analysis network training device based on an unmanned aerial vehicle image according to an embodiment of the application.

Detailed Description

The application will be described in further detail below with reference to the drawings by means of specific embodiments. Wherein like elements in different embodiments are numbered alike in association. In the following embodiments, numerous specific details are set forth in order to provide a better understanding of the present application. However, one skilled in the art will readily recognize that some of the features may be omitted, or replaced by other elements, materials, or methods in different situations. In some instances, related operations of the present application have not been shown or described in the specification in order to avoid obscuring the core portions of the present application, and may be unnecessary to persons skilled in the art from a detailed description of the related operations, which may be presented in the description and general knowledge of one skilled in the art.

Furthermore, the described features, operations, or characteristics of the description may be combined in any suitable manner in various embodiments. Also, various steps or acts in the method descriptions may be interchanged or modified in a manner apparent to those of ordinary skill in the art. Thus, the various orders in the description and drawings are for clarity of description of only certain embodiments, and are not meant to be required orders unless otherwise indicated.

The numbering of the components itself, e.g. "first", "second", etc., is used herein merely to distinguish between the described objects and does not have any sequential or technical meaning. The term "coupled" as used herein includes both direct and indirect coupling (coupling), unless otherwise indicated.

In the embodiment of the application, a human body analysis network is constructed and generated according to a preset encoder, an improved space pyramid pooling and a preset detail extractor, a boundary residual calculation process is added, and the human body analysis network is trained, optimized and adjusted by utilizing the calculated boundary residual to obtain a standard human body analysis network. The accuracy of human body analysis in the unmanned aerial vehicle image by using the standard human body analysis network can be improved.

Referring to fig. 1, some embodiments of the present invention provide a human body analysis network training method based on unmanned aerial vehicle images, which includes steps S10 to S60, and is specifically described below.

Step S10: acquiring a plurality of unmanned aerial vehicle images, performing data enhancement processing on the plurality of unmanned aerial vehicle images, and performing image preprocessing on the data-enhanced images to obtain a training image set.

In some embodiments, performing data enhancement processing on a plurality of unmanned aerial vehicle images includes:

the data enhancement processes include image panning, image flipping, random brightness transformation, and median filtering.

In some embodiments, the data enhancement processing is performed on the multiple unmanned aerial vehicle images, so that the generalization capability of the model in the process of model training can be enhanced.

In some embodiments, image preprocessing is performed on the data-enhanced image to obtain a training image set, wherein the image preprocessing comprises image cropping and image normalization.

Step S20: and carrying out boundary residual calculation on the training images in the training image set to obtain boundary residual.

Referring to fig. 2, in some embodiments, step S20 performs boundary residual calculation on the training images in the training image set, and obtaining the boundary residual includes steps S21 to S23, which are described in detail below.

Step S21: and carrying out downsampling treatment on the training images in the training image set to obtain a downsampled image set.

Step S22: and selecting images which are in the downsampled image set and accord with the preset layer number as the images of the layer, and selecting images which have preset intervals with the images of the layer as the images of the lower layer.

In some embodiments, a layer 1 image, a layer 2 image, a layer k image, and a layer k+1 image are included in the downsampled image set, the layer k image in the downsampled image set is selected as the present layer image, and the layer k+1 image in the downsampled image set is selected as the lower layer image. Where k represents the index of the number of layers of the laplacian pyramid.

Step S23: and performing up-sampling processing on the lower layer image to obtain an up-sampling image, and calculating the difference between the current layer image and the up-sampling image to obtain a boundary residual error.

In some embodiments, the downsampled image is upsampled to obtain an upsampled image as up (L _k+1 ) The difference between the layer image and the up-sampled image is calculated, and the formula for obtaining the boundary residual error is expressed as follows:

R _k ＝L _k -up(L _k+1 )

wherein R is _k For boundary residual, L _k For the layer image, up (L _k+1 ) For upsampling an image, k represents the layer number index of the laplacian pyramid.

Step S30: and extracting features of the training images in the training image set by using a preset encoder to obtain a plurality of semantic feature images, and selecting the semantic feature images meeting the screening requirements from the plurality of semantic feature images as target feature images.

In some embodiments, the preset encoder may be a res net18 encoder, and the res net18 encoder has a four-layer encoding structure, so that feature extraction is performed on the training image by the res net18 encoder, and four-layer feature maps are obtained and are respectively I ₁ ，I ₂ ，I ₃ And I ₄ 。

In some embodiments, the network formula for the ResNet18 encoder is as follows:

F(·)＝(F ₄ ^° F ₃ ^° F ₂ ^° F ₁ ^° )

wherein F is _k (. Cndot.) represents the k-th layer encoder, F ₄ ^° Representing a fourth layer encoder, F ₃ ^° Representing a third layer encoder, F ₂ ^° Representing a second layer encoder, F ₁ ^° Representing the first layer encoder.

In some embodiments, the filtering requirement is a predefined requirement, and the filtering requirement is related to filtering of feature maps corresponding to different layer coding structures in a preset encoder, and the feature map corresponding to the second layer coding structure is selected as the target feature map.

Step S40: and carrying out boundary supervision processing on the target feature map by using a preset detail extractor to obtain a boundary classification result.

In some embodiments, the target feature map is input to a preset detail extractor to obtain a boundary classification result, wherein the boundary classification result represents the boundary information of the input human body, and the recognition capability of the model on the human body boundary is improved.

In some embodiments, the boundary classification result is expressed as Seg _detail ∈R ^2×H×W Where H is the height of the target feature map and W is the width of the target feature map.

Step S50: and carrying out multi-scale pooling on a preset layer number feature map in the plurality of semantic feature maps by utilizing the improved spatial pyramid pooling module and the boundary residual error to obtain a plurality of human body analysis results.

Referring to fig. 3, in an embodiment, step S50 performs multi-scale pooling on a preset layer number feature map in the multiple semantic feature maps by using the improved spatial pyramid pooling module and the boundary residual, and the obtaining multiple human body analysis results includes steps S51 to S54, which are specifically described below.

Step S51: and inputting the preset layer number feature map into an improved spatial pyramid pooling module for feature convolution processing to obtain a plurality of pooling features.

Referring to fig. 4, in an embodiment, step S51 inputs the preset layer number feature map to the improved spatial pyramid pooling module for feature convolution processing, and the step S511 to step S513 are included to obtain multiple pooled features, which will be described in detail below.

Step S511: and carrying out convolution processing on the preset layer number feature map to obtain the first layer pooling feature of the improved spatial pyramid pooling module.

Step S512: and carrying out cavity convolution processing on the preset layer number characteristic map by using the cavity convolution layers with different cavity rates respectively to obtain a plurality of cavity pooling characteristics.

Step S513: and summarizing the first layer of pooling features and the plurality of cavity pooling features to obtain a plurality of pooling features.

In some embodiments, the predetermined layer number feature map is the feature I output by the corresponding fourth layer encoder in the predetermined encoder ₄ First, a layer of convolution with the size of 1 multiplied by 1 is carried out to carry out convolution processing, so as to obtain the first layer of pooling characteristics of the improved spatial pyramid pooling moduleThen respectively carrying out hole convolution processing on the preset layer number characteristic graphs by using the hole convolution layers with different hole rates to obtain a plurality of hole pooling characteristics, namely, the characteristics I output by a fourth layer encoder ₄ Passing through a 3×3-sized cavity convolution layer, wherein the cavity rate of the cavity convolution layer is=6, and the second-layer pooling feature of the improved spatial pyramid pooling module is obtained>Then output the characteristic I of the fourth layer encoder ₄ Passing through a 3×3-sized cavity convolution layer, wherein the cavity rate of the cavity convolution layer is=12, and the third layer pooling feature of the improved spatial pyramid pooling module is obtained >Finally, outputting the characteristic I of the fourth layer encoder ₄ Passing through a 3×3-sized cavity convolution layer, wherein the cavity rate of the cavity convolution layer is=18, and the fourth layer pooling feature of the improved spatial pyramid pooling module is obtained>

Step S52: and performing feature splicing treatment on the plurality of pooled features to obtain spliced features, and performing dimension reduction treatment on the spliced features by using a convolution layer to obtain dimension-reduced spliced features.

In some embodiments, the multiple pooled features are spliced in the channel dimension to obtain spliced featuresSplice feature->And inputting the 1 multiplied by 1 convolution layer to perform the dimension reduction processing of the channel number, so as to obtain the splicing characteristics after dimension reduction.

Step S53: and performing feature selection processing on the spliced features after the dimension reduction by using a preset feature refining module to obtain final output features.

Referring to fig. 5, in an embodiment, step S53 performs feature selection processing on the reduced-dimension stitching features by using a preset feature refining module, and obtaining final output features includes steps S531 to S533, which are described in detail below.

Step S531: and carrying out average pooling treatment on the spliced features after the dimension reduction by using a global average pooling layer to obtain channel weight features.

Step S532: and carrying out convolution processing, normalization processing and activation processing on the channel weight characteristics to obtain final weight characteristics.

Step S533: and multiplying the channel weight characteristic and the final weight characteristic to obtain a final output characteristic.

In some embodiments, the overall average pooling layer is used to perform average pooling processing on the reduced-dimension splicing features to obtain channel weight featuresChannel weight feature->The final weight feature is obtained through a 1×1 convolution, a layer of batch norm, and a layer of s igmoid activation function. Wherein, one layer of batch norm is used for carrying out normalization processing, and one layer of s igmoid activation function is used for carrying out activation processing. Multiplying the channel weight characteristic and the final weight characteristic to obtain a final output characteristic +.>

In some embodiments, the formulas for the processes of steps S531-S533 are expressed as follows:

Filter(I ₄ ) _{3×3,rate＝6} ,Filter(I ₄ ) _{3×3,rate＝12} ,Filter(I ₄ ) _{3×3,rate＝18} )

wherein,,for splice feature, concat represents the splice treatment of channel dimension, filter ( _{n×n,rate＝k} Representing a convolution kernel of size n and a void fraction k, GAP represents a global average pooling operation,>for channel weight feature->For final output characteristics, I ₄ A signature representing the characteristics of the fourth layer encoder output, sigmoid representing the activation function.

Step S54: and carrying out human body analysis processing according to the final output characteristics and the boundary residual error to obtain a plurality of human body analysis results.

Referring to fig. 6, in an embodiment, step S54 performs a human body analysis process according to the final output feature and the boundary residual, and the obtaining of a plurality of human body analysis results includes steps S541 to S544, which are described in detail below.

Step S541: and inputting the final output characteristics into the multi-layer convolution layer for convolution processing to obtain a first analysis result.

Step S542: and respectively carrying out up-sampling treatment on the final output characteristics and the first analysis result to obtain first up-sampling characteristics and first up-sampling results, carrying out characteristic splicing treatment on the first up-sampling characteristics, the first up-sampling results and the boundary residual errors to obtain initial splicing characteristics, and carrying out convolution treatment and characteristic refining treatment on the initial splicing characteristics to obtain second analysis results.

Step S543: and re-executing up-sampling processing, feature stitching processing, convolution processing and feature refining processing on the second analysis result and the boundary residual until a third analysis result and a fourth analysis result are obtained.

Step S544: and summarizing the first analysis result, the second analysis result, the third analysis result and the fourth analysis result to obtain a plurality of human body analysis results.

In some embodiments, the final output characteristics are The first analysis result D is obtained through four convolution layers ₄ The first analysis result D ₄ A first upsampling result after a layer of upsampling and a final output characteristic +.>The resulting first upsampled feature after a layer of upsampling, and also the boundary residual R ₃ Performing splicing processing to obtain initial splicing characteristicsThe initial stitching feature->Obtaining a second analysis result D through a feature refining module and four layers of convolution layers ₃ Re-performing up-sampling processing and feature spelling on the second analysis result and boundary residualAnd (3) performing joint processing, convolution processing and feature refining processing until a third analysis result and a fourth analysis result are obtained.

In some embodiments, the formulas of the process of steps S541 to S544 are expressed as follows:

wherein conv represents a four-layer convolution operation, ARM is a feature refining module, concat represents feature stitching in a channel dimension, up represents an up-sampling operation, and D _i Represents the analysis result of the ith layer, D _i+1 Represents the analysis result of the i+1 layer, R _i Represents the boundary residual of the i-th layer,representing semantic features of the i+1 layer.

Because the sizes of the various parts of the human body are different, it is difficult to identify large objects and small objects at the same time by adopting a single convolution kernel size. To this end, an improved spatial pyramid pooling is added after the last layer encoder. Compared with the traditional space pyramid pooling, the feature refining module based on the attention mechanism is added at the back, wherein the feature refining module can output the network self-adaptive selection features.

Step S60: constructing corresponding cross entropy loss functions based on a plurality of human body analysis results, constructing corresponding two-class loss functions according to boundary two-class results, carrying out network training on a human body analysis network by using a final loss function obtained by summing the cross entropy loss functions and the two-class loss functions, optimizing and adjusting the trained human body analysis network to obtain a standard human body analysis network, and constructing and generating the human body analysis network by a preset encoder, an improved space pyramid pooling and a preset detail extractor.

In some embodiments, constructing a corresponding cross entropy loss function based on a plurality of human body parsing results includes:

the cross entropy loss function is:

wherein, xi _seg-loss Representing a cross-entropy loss function,human body analysis tag with true Laplace characteristics for ith layer, D _i Represents the i-th analysis result, xi _ce Representing the tag variance value.

In some embodiments, constructing a corresponding classification loss function from the boundary classification result includes:

the classification loss function is:

wherein, xi _detail-loss Representing a two-class loss function,seg is a true human body boundary label _detail For boundary classification results, ζ _dice Representing a similarity measure function.

In some embodiments, the trained body analysis network is optimized and adjusted to obtain a standard body analysis network, wherein the network can be optimized by using an SGD optimizer, and the network learning rate is dynamically adjusted by using a poly learning rate adjustment strategy. The encoded backbone network res net18 used was pre-trained on ImageNet datasets. The batch s ize and the epoch s ize are set to 4 and 300, respectively. Using a Poly learning strategy, the initial learning rate is matched after each epoch is completedMultiplying. The network was trained on an optimizer using a small batch random gradient descent (SGC) with a momentum of 0.9 and a weight decay of 0.0001. Finally obtaining the standard human body analysis after trainingA network.

Referring to fig. 7, in some embodiments, a human body recognition method based on an image of an unmanned aerial vehicle is provided, including steps 1 and 2:

step 1: and acquiring a human body image to be analyzed.

Step 2: inputting the human body image to be analyzed into a standard human body analysis network to perform human body analysis processing to obtain a human body analysis result; the standard human body analysis network is obtained by training the method from step S10 to step S60.

R _k ＝L _k -up(L _k+1 )

F(·)＝(F ₄ ^° F ₃ ^° F ₂ ^° F ₁ ^° )

wherein F is _k (. Cndot.) represents the k-th layer encoder, F ₄ ^° Representing a fourth layer encoder, F ₃ ^° Representing a third layer encoder, F ₂ ^° Represent the firstTwo-layer encoder, F ₁ ^° Representing the first layer encoder.

In some embodiments, the predetermined layer number feature map is the feature I output by the corresponding fourth layer encoder in the predetermined encoder ₄ First, a layer of convolution with the size of 1 multiplied by 1 is carried out to carry out convolution processing, so as to obtain the first layer of pooling characteristics of the improved spatial pyramid pooling moduleThen respectively carrying out hole convolution processing on the preset layer number characteristic graphs by using the hole convolution layers with different hole rates to obtain a plurality of hole pooling characteristics, namely, the characteristics I output by a fourth layer encoder ₄ Passing through a 3×3-sized cavity convolution layer, wherein the cavity rate of the cavity convolution layer is=6, and the second-layer pooling feature of the improved spatial pyramid pooling module is obtained>Then output the characteristic I of the fourth layer encoder ₄ Passing through a 3×3-sized cavity convolution layer, wherein the cavity rate of the cavity convolution layer is=12, and the third layer pooling feature of the improved spatial pyramid pooling module is obtained>Finally, outputting the characteristic I of the fourth layer encoder ₄ Passing through a 3×3-sized cavity convolution layer, wherein the cavity rate of the cavity convolution layer is=18, and the fourth layer pooling feature of the improved spatial pyramid pooling module is obtained>

In some embodiments, the overall average pooling layer is used to perform average pooling processing on the reduced-dimension splicing features to obtain channel weight featuresChannel weight feature->The final weight feature is obtained through a 1×1 convolution, a layer of batch norm, and a layer of s igmoid activation function. Wherein, one layer of batch norm is used for normalization processing, and one layer of s igmoid activation function is used for activationAnd (5) processing. Multiplying the channel weight characteristic and the final weight characteristic to obtain a final output characteristic +.>

In some embodiments, the final output characteristics areThe first analysis result D is obtained through four convolution layers ₄ The first analysis result D ₄ A first upsampling result after a layer of upsampling and a final output characteristic +.>The resulting first upsampled feature after a layer of upsampling, and also the boundary residual R ₃ Performing splicing processing to obtain initial splicing characteristicsThe initial stitching feature->Obtaining a second analysis result D through a feature refining module and four layers of convolution layers ₃ And re-executing up-sampling processing, feature stitching processing, convolution processing and feature refining processing on the second analysis result and the boundary residual until a third analysis result and a fourth analysis result are obtained.

The cross entropy loss function is:

the classification loss function is:

In some embodiments, the trained body analysis network is optimized and adjusted to obtain a standard body analysis network, wherein the network can be optimized by using an SGD optimizer, and the network learning rate is dynamically adjusted by using a poly learning rate adjustment strategy. The encoded backbone network res net18 used was pre-trained on ImageNet datasets. The batch s ize and the epoch s ize are set to 4 and 300, respectively. Using a Poly learning strategy, the initial learning rate is matched after each epoch is completedMultiplying. The network was trained on an optimizer using a small batch random gradient descent (SGC) with a momentum of 0.9 and a weight decay of 0.0001. Finally, the standard human body analysis network after training is completed is obtained.

Referring to fig. 8, in some embodiments, a human body analysis network training device based on unmanned aerial vehicle images is provided, which includes a residual calculation module 10, a boundary supervision module 20, a multi-scale pooling module 30, and a network training module 40:

the residual calculation module 10 is configured to obtain a plurality of unmanned aerial vehicle images, perform data enhancement processing on the plurality of unmanned aerial vehicle images, perform image preprocessing on the data-enhanced images to obtain a training image set, and perform boundary residual calculation on training images in the training image set to obtain boundary residual.

In some embodiments, the residual calculation module 10 performs data enhancement processing on a plurality of unmanned aerial vehicle images, including:

In some embodiments, the residual calculation module 10 performs boundary residual calculation on the training images in the training image set, and the obtaining of the boundary residual may be implemented as follows:

The residual calculation module 10 performs downsampling processing on training images in the training image set to obtain a downsampled image set, selects images which conform to a preset layer number in the downsampled image set as a layer image, and selects images with a preset interval with the layer image as a lower layer image.

The residual calculation module 10 performs upsampling processing on the lower layer image to obtain an upsampled image, and calculates a difference between the current layer image and the upsampled image to obtain a boundary residual.

R _k ＝L _k -up(L _k+1 )

wherein R is _k For boundary residual, L _k For the layer image, up (L _k+1 ) For upsampling the image.

The boundary supervision module 20 is configured to perform feature extraction on a training image in the training image set by using a preset encoder to obtain a plurality of semantic feature graphs, select a semantic feature graph meeting a screening requirement from the plurality of semantic feature graphs as a target feature graph, and perform boundary supervision processing on the target feature graph by using a preset detail extractor to obtain a boundary classification result.

F(·)＝(F ₄ ^° F ₃ ^° F ₂ ^° F ₁ ^° )

wherein F is _k (·) represents a k-th layer encoder.

In some embodiments, the boundary monitor module 20 performs boundary monitor processing on the target feature map by using a preset detail extractor to obtain a boundary classification result.

In some embodiments, the boundary monitor module 20 inputs the target feature map to the preset detail extractor to obtain a boundary classification result, where the boundary classification result represents the boundary information of the input human body, and improves the recognition capability of the model on the boundary of the human body.

The multi-scale pooling module 30 is configured to perform multi-scale pooling on a preset layer number feature map in the multiple semantic feature maps by using the improved spatial pyramid pooling module and the boundary residual error, so as to obtain multiple human body analysis results.

In some embodiments, the multi-scale pooling module 30 inputs the preset layer number feature map into the improved spatial pyramid pooling module for feature convolution processing, so as to obtain a plurality of pooled features; in some embodiments of the present invention,

the multi-scale pooling module 30 performs convolution processing on the preset layer number feature map to obtain a first layer pooling feature of the improved space pyramid pooling module, performs cavity convolution processing on the preset layer number feature map by utilizing cavity convolution layers with different cavity rates to obtain a plurality of cavity pooling features, and gathers the first layer pooling feature and the plurality of cavity pooling features to obtain a plurality of pooling features.

In some embodiments, the predetermined layer number feature map is the feature I output by the corresponding fourth layer encoder in the predetermined encoder ₄ First, a layer of convolution with the size of 1 multiplied by 1 is carried out to carry out convolution processing, so as to obtain the first layer of pooling characteristics of the improved spatial pyramid pooling moduleThen respectively carrying out cavity convolution processing on the preset layer number characteristic map by utilizing the cavity convolution layers with different cavity rates Obtaining a plurality of cavity pooling features, namely, the features I output by a fourth layer encoder ₄ Passing through a 3×3-sized cavity convolution layer, wherein the cavity rate of the cavity convolution layer is=6, and the second-layer pooling feature of the improved spatial pyramid pooling module is obtained>Then output the characteristic I of the fourth layer encoder ₄ Passing through a 3×3-sized cavity convolution layer, wherein the cavity rate of the cavity convolution layer is=12, and the third layer pooling feature of the improved spatial pyramid pooling module is obtained>Finally, outputting the characteristic I of the fourth layer encoder ₄ Passing through a 3×3-sized cavity convolution layer, wherein the cavity rate of the cavity convolution layer is=18, and the fourth layer pooling feature of the improved spatial pyramid pooling module is obtained>

The multi-scale pooling module 30 performs feature stitching on the pooled features to obtain stitching features, and performs dimension reduction on the stitching features by using a convolution layer to obtain dimension-reduced stitching features.

In some embodiments, the multi-scale pooling module 30 performs channel dimension stitching on multiple pooled features to obtain stitched featuresSplice feature->And inputting the 1 multiplied by 1 convolution layer to perform the dimension reduction processing of the channel number, so as to obtain the splicing characteristics after dimension reduction.

The multi-scale pooling module 30 performs feature selection processing on the reduced-dimension splicing features by using a preset feature refining module to obtain final output features.

In one embodiment, the multi-scale pooling module 30 performs feature selection processing on the reduced-dimension stitching features by using a preset feature refining module to obtain final output features, which may be implemented as follows: the multi-scale pooling module 30 performs average pooling treatment on the spliced features subjected to dimension reduction by using a global average pooling layer to obtain channel weight features;

the multi-scale pooling module 30 performs convolution processing, normalization processing and activation processing on the channel weight characteristics to obtain final weight characteristics, and multiplies the channel weight characteristics and the final weight characteristics to obtain final output characteristics.

In some embodiments, the overall average pooling layer is used to perform average pooling processing on the reduced-dimension splicing features to obtain channel weight featuresChannel weight feature->The final weight feature is obtained through a 1×1 convolution, a layer of batch norm, and a layer of s igmoid activation function. Wherein, one layer of batch norm is used for carrying out normalization processing, and one layer of s igmoid activation function is used for carrying out activation processing. Multiplying the channel weight characteristic and the final weight characteristic to obtain a final output characteristic +. >

wherein,,for splice feature, concat represents the splice treatment of channel dimension, filter ( _{n×n,rate＝k} Representing a convolution kernel of size n and a void fraction k, GAP represents a global average pooling operation,>for channel weight feature->Is the final output feature.

The multi-scale pooling module 30 performs human body analysis processing according to the final output characteristics and the boundary residual error to obtain a plurality of human body analysis results.

In one embodiment, the multi-scale pooling module 30 performs the human body analysis according to the final output feature and the boundary residual error to obtain a plurality of human body analysis results, which may be implemented as follows:

the multi-scale pooling module 30 inputs the final output feature into the multi-layer convolution layer to perform convolution processing to obtain a first analysis result, performs up-sampling processing on the final output feature and the first analysis result to obtain a first up-sampling feature and a first up-sampling result, performs feature stitching processing on the first up-sampling feature, the first up-sampling result and the boundary residual to obtain an initial stitching feature, performs convolution processing and feature refining processing on the initial stitching feature to obtain a second analysis result, and re-performs up-sampling processing, feature stitching processing, convolution processing and feature refining processing on the second analysis result and the boundary residual until a third analysis result and a fourth analysis result are obtained, and gathers the first analysis result, the second analysis result, the third analysis result and the fourth analysis result to obtain a plurality of human body analysis results.

In some embodiments, the above process is formulated as follows:

wherein conv represents a four-layer convolution operation, ARM is a feature refining module, concat represents feature stitching in a channel dimension, up represents an up-sampling operation, and D _i+1 Represents the analysis result of the i+1 layer, R _i Represents the boundary residual of the i-th layer,representing semantic features of the i+1 layer.

The network training module 40 is configured to construct a corresponding cross entropy loss function based on a plurality of human body analysis results, construct a corresponding two-class loss function according to a boundary two-class result, perform network training on the human body analysis network by using a final loss function obtained by summing the cross entropy loss function and the two-class loss function, and optimize and adjust the trained human body analysis network to obtain a standard human body analysis network, wherein the human body analysis network is constructed and generated by a preset encoder, an improved spatial pyramid pooling and a preset detail extractor.

the cross entropy loss function is:

wherein, xi _seg-loss Representing a cross-entropy loss function,human body analysis tag with true Laplace characteristics for ith layer, D _i The i-th analysis result is shown.

the classification loss function is:

wherein, xi _detail-loss Representing a two-class loss function,seg is a true human body boundary label _detail The boundary is classified into two results.

In some embodiments, the trained body analysis network is optimized and adjusted to obtain a standard body analysis network, wherein the network can be optimized by using an SGD optimizer, and the network learning rate is dynamically adjusted by using a poly learning rate adjustment strategy. The encoded backbone network res net18 used was pre-trained on ImageNet datasets. The batch s ize and the epoch s ize are set to 4 and 300, respectively. Using a Poly learning strategy, the initial learning rate is matched after each epoch is completed Multiplying. The network was trained on an optimizer using a small batch random gradient descent (SGC) with a momentum of 0.9 and a weight decay of 0.0001. Finally, the standard human body analysis network after training is completed is obtained.

According to the embodiment, a human body analysis network is constructed and generated according to a preset encoder, an improved space pyramid pooling and a preset detail extractor, a boundary residual calculation process is added, and the human body analysis network is trained, optimized and adjusted by utilizing the calculated boundary residual to obtain a standard human body analysis network. The accuracy of human body analysis in the unmanned aerial vehicle image by using the standard human body analysis network can be improved.

Those skilled in the art will appreciate that all or part of the functions of the various methods in the above embodiments may be implemented by hardware, or may be implemented by a computer program. When all or part of the functions in the above embodiments are implemented by means of a computer program, the program may be stored in a computer readable storage medium, and the storage medium may include: read-only memory, random access memory, magnetic disk, optical disk, hard disk, etc., and the program is executed by a computer to realize the above-mentioned functions. For example, the program is stored in the memory of the device, and when the program in the memory is executed by the processor, all or part of the functions described above can be realized. In addition, when all or part of the functions in the above embodiments are implemented by means of a computer program, the program may be stored in a storage medium such as a server, another computer, a magnetic disk, an optical disk, a flash disk, or a removable hard disk, and the program in the above embodiments may be implemented by downloading or copying the program into a memory of a local device or updating a version of a system of the local device, and when the program in the memory is executed by a processor.

The foregoing description of the invention has been presented for purposes of illustration and description, and is not intended to be limiting. Several simple deductions, modifications or substitutions may also be made by a person skilled in the art to which the invention pertains, based on the idea of the invention.

Claims

1. The human body analysis network training method based on the unmanned aerial vehicle image is characterized by comprising the following steps of:

2. The method of claim 1, wherein performing boundary residual calculation on the training images in the training image set to obtain boundary residuals comprises:

3. The method of claim 1, wherein the performing multi-scale pooling on the preset layer number feature map in the plurality of semantic feature maps by using the improved spatial pyramid pooling module and the boundary residual error to obtain a plurality of human body analysis results comprises:

4. The method of claim 3, wherein inputting the preset layer number feature map into the modified spatial pyramid pooling module for feature convolution processing to obtain a plurality of pooled features comprises:

5. The method of claim 3, wherein the performing feature selection processing on the reduced-dimension stitching feature by using a preset feature refining module to obtain a final output feature includes:

6. The method of claim 3, wherein the performing a human body parsing process according to the final output feature and the boundary residual error to obtain a plurality of human body parsing results comprises:

7. The method of claim 1, wherein the performing data enhancement processing on the plurality of drone images comprises:

the data enhancement processing includes image translation, image inversion, random brightness transformation and median filtering.

8. The human body identification method based on the unmanned aerial vehicle image is characterized by comprising the following steps of:

Acquiring a human body image to be analyzed;

9. Human body analysis network training equipment based on unmanned aerial vehicle image, characterized by comprising:

a memory for storing a program;

a processor for implementing the method of any of claims 1-7 by executing a program stored in the memory.

10. A computer readable storage medium, characterized in that the medium has stored thereon a program executable by a processor to implement the method of any of claims 1-7.