CN112668536A

CN112668536A - Lightweight rotating target detection and identification method based on airborne photoelectric video

Info

Publication number: CN112668536A
Application number: CN202110010819.6A
Authority: CN
Inventors: 李伟; 黄展超; 陶然
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2021-01-06
Filing date: 2021-01-06
Publication date: 2021-04-16
Anticipated expiration: 2041-01-06
Also published as: CN112668536B

Abstract

The invention discloses a lightweight rotating target detection and identification method based on an airborne photoelectric video, which comprises the process of constructing a lightweight rotating target detection and identification model, wherein the lightweight rotating target detection and identification model carries out feature extraction on a photoelectric video image through a feature extraction network with an improved channel splitting-aggregation structure. The method can rapidly detect and identify the multi-type, multi-scale and multi-direction rotating targets in the airborne photoelectric video image by using the light-weight deep neural network model, and has high detection and identification precision and stability and lower calculation complexity.

Description

Lightweight rotating target detection and identification method based on airborne photoelectric video

Technical Field

The invention relates to the technical field of target detection and identification of airborne photoelectric radars.

Background

At present, methods for detecting the rotating target of an aerial video image through deep learning exist, the methods mainly adopt a target detection framework such as a lightweight mobile neural network (MobileNet), and although the lightweight improvement is performed on a model through a one-stage detection framework and a deep separable convolution, the efficiency is obviously limited by the large number of stacked deep separable convolutions, especially for 1 × 1 point-to-point convolution.

Some prior arts have been improved, such as Tiny-YOLO, etc., by using a network with fewer convolution layers to improve the calculation efficiency and speed, but the feature learning ability of the network is weakened along with the reduction of the network layers, which results in lower detection and identification accuracy. Therefore, a deep neural network model which can maintain high target detection and identification precision and process the photoelectric video target in real time on an onboard embedded platform is one of the important requirements in the field at present.

On the other hand, for the detection and identification of the video image rotating target, although the conventional method adopts a method based on rotation angle regression (five-parameter method) or four-vertex regression (eight-parameter method) and the like to represent and predict the rotating target in the aviation photoelectric video image, the methods also have the problem of low positioning accuracy of the rotating target. For example, the five-parameter method based on angle regression can only predict a rectangle rotated by a specific angle, and is difficult to accurately position to a target with more diversified shapes, and the angle regression has a periodicity problem; although the eight-parameter method based on vertex regression can represent a quadrilateral bounding box with any shape, the degree of freedom of shape change of the bounding box is too high, so that a deep neural network is difficult to stably predict, and ambiguity problems of vertex ordering exist, and the improvement of target detection positioning accuracy is restricted.

Summarizing the above-described prior art, it can be seen that they have difficulty in stably and accurately learning the shape of an object in the detection of a rotating object, whether angle-based or vertex-based, while still requiring great improvement in detection recognition efficiency, particularly in the computational complexity of stacked depth separable convolution.

Disclosure of Invention

The invention aims to provide a lightweight rotating target detection and identification method based on an airborne photoelectric video, which can maintain higher target detection and identification precision, has the function of processing a photoelectric video target in real time on an airborne embedded platform, and is higher in identification accuracy, stability and efficiency and lower in calculation complexity.

The technical scheme of the invention is as follows:

a lightweight rotating target detection and identification method based on an airborne photoelectric video comprises the following steps:

s1: constructing a lightweight rotating target detection and identification model;

s2: transplanting the lightweight rotating target detection and identification model on an embedded platform, and training the lightweight rotating target detection and identification model through an airborne photoelectric video image;

s3: detecting and identifying the rotating target through a detection and identification model obtained after training;

the lightweight rotating target detection and identification model carries out feature extraction on the photoelectric video image through a feature extraction network with an improved channel splitting-polymerizing structure to obtain an extracted feature map; the characteristic extraction network with the improved channel splitting-aggregation structure comprises a plurality of groups of channel splitting-aggregation convolutional layers, each group of channel splitting-aggregation convolutional layers comprises a splitting-aggregation convolutional subnet, and each splitting-aggregation convolutional subnet comprises at least 2 splitting convolutional layers and at least 1 convolutional layer for aggregating the output of the splitting convolutional layers.

As in the specific implementation, let the input channel split-aggregate structure be characterized by X_W×H×CWhere wxhxc represents the dimension of the input feature and W, H, C represents the width, height, and number of channels of the input feature, respectively, then the split-convolution-aggregate process for each channel split-aggregate convolutional layer can be expressed as:

wherein ,f_RandSeg(. The) represents the channel random splitting operation, randomly divides the characteristics into two groups according to the channel direction as the input of two parallel convolution branches,

two groups of features after random grouping are shown, and the dimensions of the two groups of features are W multiplied by H multiplied by C/2. Two groups of the Chinese character 'xi' are combinedThe symbols are separately convolved as follows:

wherein ,w⁽¹⁾,bias⁽¹⁾Convolution kernel and bias term, w, representing the convolution of the first branch⁽²⁾,bias⁽²⁾Convolution kernel and bias term representing the second branch convolution, conv1_W×H×C/2,conv2_W×H×C/2The method is characterized in that the method represents the output of two split parallel convolutions, the target feature extraction efficiency of the two convolution channels is higher, the structure is specially designed for the next neighborhood dynamic expansion convolution, the common convolution can be conveniently further improved into the neighborhood dynamic expansion convolution, and the multi-scale target structure feature is extracted dynamically. After the convolution operation, the two convolution outputs are aggregated and the channel directions are rearranged,

Z_W×H×C＝g_RandCom(conv1_W×H×C/2,conv2_W×H×C/2) (3)

wherein ,Z_W×H×CIs the output characteristic of the channel split-aggregate subnet, whose dimension is consistent with that of the input characteristic, g_RandCom(. cndot.) is a characteristic random aggregation function. The channel splitting-aggregating convolution with the same characteristic dimension is connected to obtain 1 group of channel splitting-aggregating convolution layers.

According to some embodiments of the invention, the feature extraction network comprises 5 sets of channel split-aggregate convolutional layers.

Preferably, in the channel splitting-aggregating convolution layer, the length and width of the image obtained after each set of convolution becomes half of the image of the previous set of input images.

Preferably, the 3 × 3 convolution in the channel split-aggregate convolutional layer uses block convolution to reduce complexity, wherein the number of blocks is equal to the number of channels C of the input feature map.

Preferably, the first and second groups of the 5-channel split-aggregate convolutional layer contain 4 split-aggregate convolutional subnets, and the third, fourth and fifth groups contain 8 split-aggregate convolutional subnets.

Preferably, the feature size of the output of the third group of channel split-polymer convolutional layers is 68 × 68 × 32, the feature size of the output of the fourth group of channel split-polymer convolutional layers is 34 × 34 × 96, and the feature size of the output of the fifth group of channel split-polymer convolutional layers is 17 × 17 × 1280.

According to some embodiments of the invention, the lightweight rotating target detection and identification model performs up-sampling, down-sampling equalization and anti-aliasing processing on the features extracted by the feature extraction network through a feature balancing unit to obtain a feature map after balancing.

Preferably, the feature balancing unit performs up-sampling, down-sampling equalization and anti-aliasing processing on the feature map output by the third, fourth and fifth sets of channel split-aggregate convolutional layers.

Preferably, the feature balancing unit down-samples the feature map output by the third group of channel split-aggregate convolutional layers, up-samples the feature map output by the fifth group of channel split-aggregate convolutional layers after 1 × 1 convolution, performs 1 × 1 independent convolution on the feature map output by the fourth group of channel split-aggregate convolutional layers, and performs anti-aliasing processing on the feature map subjected to up-sampling, down-sampling and independent convolution and then summation through upper and lower triangular filtering.

Preferably, the upsampling is performed by a deformable convolution. More preferably, the step size of the deformable convolution is 2.

Preferably, the upsampling is performed by neighbor interpolation. More preferably, the neighbor interpolation is a 2-fold neighbor interpolation.

Preferably, the size of the output feature map of the feature balance unit is 34 × 34 × 512.

According to some embodiments of the present invention, the lightweight rotating target detection and identification model performs feature fusion on the extracted feature map and/or the balanced feature map through a channel splitting-aggregating and neighborhood dynamic expansion convolution unit to obtain a fused feature map.

According to some embodiments of the invention, the channel split-aggregate and neighborhood dynamic expansion convolution units comprise a combination of split-aggregate convolution sub-networks and neighborhood dynamic expansion convolution sub-networks.

According to some embodiments of the invention, the combination is: and the neighborhood dynamic expansion convolution sub-network is embedded into the split-aggregate convolution sub-network and replaces the common convolution or the deep separable convolution in the split-aggregate convolution sub-network to obtain the channel split-aggregate and neighborhood dynamic expansion convolution units.

Preferably, the split-aggregate convolutional subnetwork is as described above, comprising at least 2 split convolutional layers, and at least 1 convolutional layer that aggregates outputs of the split convolutional layers.

Preferably, the neighborhood dynamic expansion convolution subnet comprises a convolution layer of a dynamic convolution kernel and a dynamic interpolation mechanism.

For example, in the specific implementation, the features are randomly divided into two groups by the splitting structure of the splitting-aggregating convolution subnet, then the two groups of features are respectively input into the parallel 2 neighborhood dynamic expansion convolution branches for respective operation, and then the two groups of features are fused by utilizing the aggregation design of the splitting-aggregating convolution subnet. Namely formula (2) is changed to

Where Dw represents the convolution kernel of the neighborhood dynamic expansion convolution, ξ₁,ξ₂,ξ₃,ξ₄The dynamic weighting coefficients representing the convolution kernel whose values are dynamically adjusted in accordance with the inference process of the neural network model.

Preferably, the channel splitting-aggregating and neighborhood dynamic expansion convolution unit randomly groups the input 34 × 34 × 512 feature maps into 2 groups of 34 × 34 × 256 feature maps through the splitting-aggregating convolution sub-network, then inputs the 2 groups of feature maps into the neighborhood dynamic expansion convolution sub-network, re-aggregates the output features of the neighborhood dynamic expansion convolution into the 34 × 34 × 512 feature maps, and randomly orders the channel directions to increase feature interaction; wherein, the two groups of neighborhood dynamic expansion convolutions respectively comprise 1 convolution layer with 3 × 3 dynamic convolution kernel and dynamic interpolation mechanism and 1 convolution layer with 1 × 1 × 256.

More preferably, the hyper-parameter of the 3 × 3 neighborhood dynamic expansion convolution, such as the dynamic weight coefficient ξ of the convolution kernel₁,ξ₂,ξ₃,ξ₄And a neighborhood range R_mObtained through network model learning.

The neighborhood range by the above neighborhood dynamic expansion convolution is:

wherein ,R_mNeighborhood range, kernel, representing the mth layer of convolution_mDenotes the size of the convolution kernel, er denotes the neighborhood dynamic expansion factor, stride_iIndicating the step size of the i-th convolution operation.

According to some embodiments of the invention, the lightweight rotation target detection and recognition model performs feature learning on the fused feature map through a bounding box shape learning unit to obtain a shape and position feature vector of the target.

Preferably, the learning unit performs bounding box shape learning by diagonal support constraint.

According to some embodiments of the invention, the shape learning process comprises:

performing feature dimensionality reduction on the fused feature map to obtain a feature vector subjected to dimensionality reduction;

and inputting the feature vectors subjected to dimensionality reduction into three parallel convolution feature decoding subnets, and respectively outputting the circumscribed rectangle parameter vector, the inscribed arbitrary quadrilateral parameter vector and diagonal support constraint vectors related to the circumscribed rectangle and the inscribed arbitrary quadrilateral of the target.

More preferably, the shape learning process includes:

firstly, performing feature dimensionality reduction on the fused feature map through a 3 × 3 convolution kernel and a 1 × 1 convolution, and converting features into feature vectors with dimensions of 34 × 34 × 11;

decomposing the obtained feature vector into three parallel convolution feature decoding subnets, and outputting a 34 multiplied by 5 dimensional circumscribed rectangle parameter vector, a 34 multiplied by 4 dimensional inscribed arbitrary quadrilateral parameter vector and a 34 multiplied by 2 dimensional diagonal support constraint vector of the target by the three decoding subnets respectively to form the shape and position feature vector of the target.

More preferably, the circumscribed rectangle parameter vector includes: the horizontal and vertical coordinate values of the central point coordinate of the circumscribed rectangle, the width w and the height h of the circumscribed rectangle, and the confidence coefficient c.

More preferably, the parameter vector of the inscribed quadrangle includes: distances s from four vertices of the inscribed quadrangle to four vertices of the circumscribed rectangle₁,s₂,s₃,s₄。

More preferably, the diagonal support constraint vector includes: the length S of the two diagonal lines is calculated by the following formula₁₃,S₂₄：

Wherein, p and v are respectively the vertex coordinates of the external rectangular core and the internal quadrilateral, and the number represents the vertex serial number.

More preferably, the parameters are transformed as follows to obtain a transformed feature vector:

wherein ,α₁,α₂,α₃,α₄Represents the ratio of the distance from the four vertices of the inscribed quadrangle to the four vertices of the external rectangle to the width or height of the circumscribed rectangle, beta₁,β₂Representing the ratio of the length of the diagonal of an arbitrary quadrilateral to the width or height of its bounding rectangle.

According to some embodiments of the present invention, the lightweight rotation target detection recognition model performs constraint calculation on the feature vector and/or the transformed feature vector through a quadratic shape constraint unit, and outputs a position parameter and a shape parameter of a target.

Preferably, the quadratic shape constraint unit adopts the following M-Sigmoid function as a constraint function:

wherein e represents a natural constant, min and max represent minimum and maximum operations, respectively, and q represents a characteristic parameter of the bounding box described in equation (7).

More preferably, q represents the abscissa and ordinate of 4 vertices of the bounding box, for 8 parameters.

According to some embodiments of the present invention, the lightweight rotating target detection recognition model obtains accurate position, shape and category information of the target through a post-processing and decoding unit.

According to some embodiments of the invention, the loss function of the lightweight rotating target detection recognition model is:

Loss＝Loss_conf+Loss_HBB+Loss_OBB+Loss_cls， (9)

wherein Loss represents total Loss, Loss_confRepresents the Loss of confidence c, Loss_HBBRepresents the regression Loss, of the circumscribed rectangle_OBBRepresents the regression Loss, of the target_clsRepresents a loss of classification, wherein:

wherein ,

and

respectively represent the true value of the training sample, smooth-L1 (-) represents the smoothed L1 norm, alpha_i and β_iIs an intermediate variable between any quadrilateral vertex constraint and diagonal constraint, i.e. alpha as described hereinbefore₁,α₂,α₃,α₄ and β₁,β₂。

The invention has the following beneficial effects:

the invention fully considers the characteristics of various multi-type and multi-direction rotating targets in an airborne photoelectric video image and the requirement of airborne platform computing resource limitation on a light weight neural network, and provides a light weight rotating target detection and identification method based on an airborne photoelectric video, which has high detection and identification efficiency, high accuracy and low computation complexity.

Compared with common convolution, deep separable convolution, cavity convolution, dynamic convolution and the like, the convolution with the dynamic sensing range mechanism not only has the convolution kernel dynamically adjusted, but also has the sensing expansion range dynamically adjusted, and the adjustment is completed through autonomous learning of a convolution neural network based on different input video images, so that the characteristic extraction process can be more effectively adjusted in a self-adaptive manner according to the target characteristics, the characteristic extraction quality is improved, and the target detection and identification precision is further improved; the convolution structure is organically combined with channel splitting aggregation and aligned with the characteristic, the convolution is between two stages of splitting and aggregation, convolution outputs of all branches are fused in an adding mode and then channel expansion is carried out through congruent mapping, and therefore the calculation efficiency of the structure is optimal.

The mechanism unit with the channel splitting aggregation structure and the dynamic sensing range is a lightweight rotating target feature extraction and fusion network model, wherein the channel splitting aggregation structure improves the inference efficiency of a neural network by parallel grouping of random channels, avoids a large amount of 1 multiplied by 1 convolution operations in deep separable convolution, and strengthens the interaction between feature cross channels; the convolution with the dynamic sensing range mechanism not only has the convolution kernel dynamically adjusted and the sensing expansion range dynamically adjusted, can more effectively self-adaptively adjust the characteristic extraction process according to the target characteristics and improve the characteristic extraction quality, but also has the convolution outputs of all branches fused in an addition mode and then subjected to channel expansion through congruent mapping, so that the calculation efficiency is optimal.

Compared with the angle-based rotating target detection method in the prior art, the shape learning unit has higher efficiency without setting a large number of anchoring frames with different angles for assisting regression, avoids the periodic problem of angle regression due to no angle prediction, and can predict quadrangles with any shapes without being limited to rectangles with rotating angles; compared with a regression method based on sliding vertexes or vertexes, the method enables shape prediction to be more stable through a quadrilateral diagonal constraint mode, avoids the problem that target shape regression is difficult to converge and unstable due to too much freedom degree of a direct vertex prediction method, and further obtains higher positioning accuracy.

The constraint unit of the invention uses an M-Sigmoid function with mixed soft and hard boundary conditions, and can carry out secondary constraint on the shape of the bounding box, thereby not only avoiding the problem that the soft boundary function is difficult to converge to a boundary value, but also avoiding the problem that ambiguity exists in the rotating target bounding box solution caused by the hard boundary function having two boundary values, and ensuring that the deep neural network model has more accurate positioning and regression of the shape to the target; the finally constructed model can run quickly on the embedded equipment, and the rotating target in the airborne photoelectric video image can be detected more accurately than the prior art.

Drawings

FIG. 1 is a flow chart of a lightweight rotating target detection and identification method based on an airborne photoelectric video according to the present invention;

FIG. 2 is a diagram of a stacked separable convolution combination in a conventional lightweight neural network model;

FIG. 3 is a structural diagram of a lightweight rotating target detection and identification model constructed in the invention;

FIG. 4 is a diagram of a channel splitting-aggregating and neighborhood dynamic expansion convolution unit structure designed by the present invention

FIG. 5 is a bounding box shape learning unit based on diagonal support constraints as designed by the present invention;

FIG. 6 is a stacked depth separable convolution unit in the prior art;

fig. 7 is a flowchart illustrating the target detection and identification method for detecting and identifying a lightweight rotating target on an onboard electro-optic video image data set according to an embodiment of the present invention.

Fig. 8 is a diagram of the recognition effect in the embodiment of the present invention.

Detailed Description

The present invention is described in detail below with reference to the following embodiments and the attached drawings, but it should be understood that the embodiments and the attached drawings are only used for the illustrative description of the present invention and do not limit the protection scope of the present invention in any way. All reasonable variations and combinations that fall within the spirit of the invention are intended to be within the scope of the invention.

According to the technical scheme of the invention, a specific implementation manner comprises a detection identification process shown in the attached figure 1, which specifically comprises the following steps:

s1: and constructing a lightweight rotating target detection and identification model.

More specifically, the lightweight rotating target detection and identification model is composed of an image input layer, a channel splitting-aggregation structure improved feature extraction network, a feature balance unit, a channel splitting-aggregation and neighborhood dynamic expansion convolution unit, a bounding box shape learning unit based on diagonal support constraint, a bounding box shape secondary constraint unit and a post-processing unit.

In one embodiment, as shown in fig. 3, the detection recognition model includes:

an input layer for image input;

a channel splitting-aggregation structure improved feature extraction network for performing multi-scale contextual feature extraction on an input image; in more specific implementations, the feature extraction network includes 5 groups of depth separable convolutional layers improved by channel split-aggregate convolution (i.e., the number of groups is the number of grouped convolutions corresponding to the number of input feature channels), where the number of convolutional layers in groups 1 and 2 is 4, the number of convolutional layers in groups 3, 4 and 5 is 8, the length and width of the input image after each group of convolutional layers becomes half of the output of the previous group, the output of group 3 is a feature map of 68 × 68 × 32, the output of group 4 is a feature map of 34 × 34 × 96, and the output of group 5 is a feature map of 17 × 17 × 1280;

the characteristic balance unit is used for performing up-sampling and down-sampling balance and anti-aliasing processing on the output characteristics of the groups 3, 4 and 5 of the characteristic extraction network; in more specific implementations, the feature balancing unit is composed of three parts of variable convolution downsampling with the step size of 2, 2-time neighbor interpolation upsampling and an upper triangular filtering anti-aliasing layer and a lower triangular filtering anti-aliasing layer, and the feature graph size output by the feature balancing module is 34 × 34 × 512;

a channel splitting-aggregating and neighborhood dynamic expansion convolution unit for deeply extracting, fusing and reducing the dimension of the balance characteristic and optimizing the complexity of the model; in more specific embodiments, as shown in fig. 4, the unit mainly includes a channel splitting-aggregating structure and 2 parts of neighborhood dynamic expansion convolution, where the input of the channel splitting-aggregating structure is 34 × 34 × 512 feature maps, the channel splitting-aggregating structure is randomly grouped into 2 groups of 34 × 34 × 256 feature maps, the 2 groups of feature maps are input into the neighborhood dynamic expansion convolution, and finally, the output features of the neighborhood dynamic expansion convolution are reassembled into the 34 × 34 × 512 feature maps and the channel directions are randomly ordered to increase feature interaction; the hyper-parameters of the 3 x 3 neighborhood dynamic expansion convolution, including 4 convolution kernel coefficients and 2 expansion scale factors, can be learned through a network model and can be adaptively adjusted according to input;

a bounding box shape learning unit based on diagonal support constraint for positioning and bounding box shape prediction of a rotating target in any direction; in some more specific implementations, as shown in fig. 5, the unit includes the following three parts: 34 × 34 × 5 dimensional non-rotated circumscribed rectangle shape vector, wherein each dimension respectively represents the confidence of the rectangular bounding box with each sampling point as the center, the coordinates of the center point (x, y) and the width and height of the rectangle, 34 × 34 × 4 dimensional inscribed rotated quadrilateral shape vector, wherein each dimension respectively represents 4 segments of distances from the four vertices of the inscribed quadrilateral to the four vertices of the circumscribed rectangle, 34 × 34 × 2 dimensional rotated bounding box constraint conditions based on diagonal support constraint, wherein each dimension respectively represents the solution of the inscribed quadrilateral diagonal support constraint equation, and finally the output is 34 × 34 × 8 dimensional decoded quadrilateral feature vector;

a quadratic constraint unit for performing quadratic constraint on the shape of the target rotated in any direction represented by the decoded quadrilateral eigenvector; in some more specific implementations, the constraint function uses a hybrid Sigmoid function;

the target detection, identification and post-processing module is used for carrying out non-maximum inhibition, filtering, screening, representation and output on the predicted target enclosure frame; in some more specific implementations, post-processing includes 3 steps of non-maximum suppression, confidence threshold filtering, detection of target representations, and visualization.

The detection and identification process of the target by the detection and identification model is as follows:

and S11, inputting the airborne photoelectric platform aviation video image through the input layer.

Wherein, the aerial video image can be collected through an airborne photoelectric sensor and the like.

The video image input by the S12 realizes the random grouping of the features, the splitting according to the channel direction, the feature extraction of multi-type and/or multi-direction rotating targets, the feature aggregation and the random sequencing of the features through different layers of a feature extraction network with 5 groups of channel splitting-aggregation structure improvement;

according to some of the above more specific embodiments, where the number of convolutional layers of the 1 st and 2 nd groups is 4, the number of convolutional layers of the 3 rd, 4 th and 5 th groups is 8, the length and width of the input image after each group of convolutional layers becomes half of the output of the previous group, the output of the 3 rd group is 68 × 68 × 32 feature map, the output of the 4 th group is 34 × 34 × 96 feature map, and the output of the 5 th group is 17 × 17 × 1280 feature map; finally, the characteristics output by the three layers are input into the characteristic balancing unit of the next step for processing.

S13, carrying out up-down sampling balance processing on the three groups of characteristics output by the characteristic extraction network;

according to some of the above more specific embodiments, the 68 × 68 × 32 features output from the group 3 are down-sampled by a variable convolution with a step size of 2 to obtain 34 × 34 × 32 features, and then the 1 × 1 convolution is used to perform channel expansion to obtain 34 × 34 × 512 features; directly extending the 34 × 34 × 96 features of the 4 th group output to 34 × 34 × 512 dimensions by 1 × 1 convolution; performing up-sampling processing on a 5 th group of 17 × 17 × 1280 feature maps, firstly performing feature dimensionality reduction by using 1 × 1 convolution to obtain 17 × 17 × 512 features, and then expanding the dimensions to 34 × 34 × 512 by means of neighbor interpolation; and finally, superposing and fusing the three groups of balanced features, and removing saw teeth and anti-aliasing through upper and lower triangular filtering, wherein the output feature dimension is still 34 multiplied by 512.

S14, inputting the balanced characteristics into 5 groups of designed channel splitting-aggregating and neighborhood dynamic expansion convolution units;

according to some embodiments described above in more detail, as shown in fig. 4, the 34 × 34 × 512 feature maps of the input channel split-aggregate structure are first grouped randomly into 2 groups of 34 × 34 × 256 feature maps, and input into the neighborhood dynamic expansion convolution unit. Each neighborhood dynamic expansion convolution is composed of 13 × 3 neighborhood dynamic expansion convolution and 1 × 1 convolution, wherein a convolution kernel and neighborhood dynamic expansion parameters (4 kernel coefficients and 2 expansion scale factors) can be obtained through network model learning, and can be adaptively adjusted according to input to match different target conditions, so that the feature extraction capability is improved. And finally, aggregating the output characteristics of the neighborhood dynamic expansion convolution unit into 34 multiplied by 512 characteristics again, and randomly sequencing the channel directions to increase characteristic interaction.

(1) According to the model structure, the channel splitting-aggregation and the calculation complexity of the neighborhood dynamic expansion convolution unit:

wherein W, H, C represents the width, height, and number of channels, respectively, of the input feature map, FLOPs_CSAFLOPs representing the computational complexity of the proposed channel splitting aggregate structure_randomThe computational complexity of channel random ordering aggregation is expressed, as in the following examples, W-34, H-34, C-512.

The structure is used for replacing a stacking depth separable convolution unit shown in fig. 6 in the prior lightweight network technology, and the replaced parts are located at two positions of a feature extraction network improved by a channel splitting-aggregation structure and a channel splitting-aggregation and neighborhood dynamic expansion convolution unit in fig. 3, so that the improvement of the prior network is realized.

The complexity of the stack depth separable convolution improved by the replacement in the case of the same input same size feature (W34, H34, C512) is:

it can be seen by comparison that the computational complexity of the proposed channel split-aggregate and neighborhood dynamic expansion convolution unit is much lower than the prior art stack depth separable convolution unit. The proposed channel split-aggregate and neighborhood dynamic expansion convolution unit differs from the deep separable convolution and has the advantages that: random channel parallel grouping is carried out on the characteristic diagram, the inference efficiency of the neural network is improved by parallel calculation, the complexity is lower, a large amount of 1 multiplied by 1 convolution operation in deep separable convolution is avoided, and the design that 1 multiplied by 1 convolution is connected before and after 3 multiplied by 3 convolution in the channel shuffling network is not needed, so that the efficiency is improved; meanwhile, the random arrangement and aggregation of the channels enhance the interaction of the features among the channels, which is beneficial to improving the feature extraction capability of the network and structurally provides feature alignment guarantee.

(2) The neighborhood range of the neighborhood dynamic expansion convolution in this cell can be expressed as:

in the formula ,R_mDenotes the perceptual range of the mth layer convolution, and as in the following embodiments, takes m 4+4+8+ 8-24, kernel_mRepresenting the size of the convolution kernel, as in the following example, take kernel_m3, er denotes the neighborhood dynamic expansion factor, which can be dynamically adjusted by model learning, stride_iThe step size of the i-th convolution operation is represented by taking stride 1 as in the following embodiment.

As shown in fig. 4, in the constructed channel splitting-aggregating and neighborhood dynamic expansion convolution unit, in addition to the channel splitting-aggregating structure and two convolution branches, there is a dynamic expansion factor learning branch, which is composed of 1 layer of global pooling layer and 1 layer of full-link layer, and the convolution expansion range size is learned through a neural network, for example, as embodied by a series of combination parameters er, whose range is 2, 4, 6, 8 in the following embodiments, and can be dynamically adjusted according to different input video images.

S15, positioning the characteristics output by the channel splitting-aggregating and neighborhood dynamic expansion convolution unit and predicting the shape of the bounding box;

according to some embodiments described above, as shown in fig. 3, the feature dimension output after 4 sets of channel splitting-aggregating and neighborhood dynamic expansion convolution units is still 34 × 34 × 512, and the feature dimension is input into a bounding box shape learning unit based on diagonal support constraints, and a quadrangle in any direction is predicted to be used for delineating the detected target. The unit first performs feature dimensionality reduction by a 3 × 03 convolution kernel, 1 × 11 convolution, transforms features into feature vectors of 34 × 234 × 11 dimensions, and decomposes the feature vectors into three parallel convolution feature decoding branches, as shown in fig. 5. Wherein, the three branches respectively output circumscribed rectangle parameter vectors of 34 × 34 × 5 dimensions (i.e., the feature map size after being compressed by the previous series of feature extraction steps is 34 × 34 pixels, and 5 parameters in total represent a circumscribed rectangle), inscribed arbitrary quadrilateral parameter vectors of 34 × 34 × 4 dimensions (i.e., 4 parameters in total represent an inscribed quadrilateral), and diagonal support constraint vectors of 34 × 34 × 2 dimensions (i.e., 2 parameters in total represent a diagonal support constraint of an inscribed quadrilateral). In particular, the parameters in these vectors are explained as follows: the 5 parameters of each circumscribed rectangle are: the center point coordinates are (x, y), the width w and height h of the rectangle, and the confidence c. The 4 parameters of each inscribed quadrilateral are the distances s from the four vertices of the quadrilateral to the four vertices of the circumscribed rectangle₁,s₂,s₃,s₄Therefore, geometric calculations can be performed based on these parameters to uniquely define a quadrilateral shape that encloses the frame, as shown in fig. 5. In order to make the shape prediction more stable, the diagonal support is predicted on the basis of the shape predictionBeam, i.e. two diagonal lengths S of an internal arbitrary quadrilateral calculated from four vertex coordinates₁₃,S₂₄The calculation process is as follows

In the formula, p and v are respectively the vertex coordinates of an inscribed quadrangle of the circumscribed rectangular core.

In order to make the learning process more stable and more easily and accurately predict the position of the measured target, the predicted value is transformed as follows, so that the range of the value range is reduced:

in the formula ,α₁,α₂,α₃,α₄The ratio of the distance from the four vertexes of the inscribed quadrangle to the four vertexes of the external rectangle to the width or height of the circumscribed rectangle is shown, and the specific schematic diagram is shown in FIG. 5; beta is a₁,β₂Representing the ratio of the length of the diagonal of an arbitrary quadrilateral to the width or height of its bounding rectangle.

Therefore, the model of the present invention may not perform direct shape parameter prediction, but the prediction range is limited to the above-mentioned intermediate variable α between (0, 1)₁,α₂,α₃,α₄ and β₁,β₂The influence of dimension is eliminated and the range of regression feasible region is reduced, so that the regression is more stable. Finally, the representing parameters of the quadrangular bounding box, i.e., the coordinates of the four vertices (8 parameters in total), are calculated by these parameters and the above formula.

And S16, constraining the output predicted value through a constraint unit to obtain the positioned target position parameter and the positioned shape parameter.

Wherein the constraint function provided by the constraint unit is selected as M-Sigmoid function, which can be used for predicting the result alpha₁,α₂,α₃,α₄ and β₁,β₂And carrying out secondary constraint. The M-Sigmoid function may be specifically set as:

wherein e is a mathematical natural constant having a value of about 2.71828; min and max respectively represent the operation of taking the minimum value and the maximum value, q is a representation parameter of the quadrilateral bounding box in the step S15, the coordinates of four vertexes of the bounding box of the target are defined, 8 parameters are totally calculated, the 8 parameters are subjected to the calculation operation in the formula (16), the parameter values are subjected to secondary constraint, and the problems of boundary values and sequence ambiguity existing in the conventional vertex regression-based rotating target detection method are solved. Position parameter and shape parameter of final output target

And S17, processing the positioned target position parameters and shape parameters through the target detection recognition and post-processing module, and outputting an effective recognition result.

The post-processing can include non-maximum suppression and threshold screening of confidence c, so as to filter error and redundant results and output effective target detection and identification results. Firstly, sorting all predicted bounding box confidence scores, and selecting the highest score and a bounding box corresponding to the highest score; then, traversing the rest of the frames, and if the overlapping area of the frame and the current highest frame is larger than a threshold value (set to be 0.5), deleting the frame; and then, continuously selecting one of the unprocessed boxes with the highest score, and repeating the process until the iteration is finished, wherein the rest boxes are finally output bounding boxes for delimiting the target.

S2: transplanting the detection recognition model on an embedded platform, and training the detection recognition model through the aerial video image.

The embedded platform can be TX2 or AGX Xavier, and the aerial video image can be collected by an onboard photoelectric sensor.

The specific training process can be shown in fig. 7, which includes: firstly, carrying out data annotation and preprocessing on an acquired aerial video image and constructing a training data set; secondly, loading a test image, processing the input image by adopting the specific model structures such as the splitting-aggregating structure improved feature extraction network, the feature balance unit, the channel splitting-aggregating and neighborhood dynamic expansion convolution unit and the like in the steps S12-S14, and extracting the image features; then, the features are input into the bounding box shape learning unit and the object shape constraint unit based on diagonal support constraint described in steps S15-S16, the feature vectors are decoded and expressed to obtain bounding box position and shape parameters of the bounding object, these parameters are compared with the labeled a priori labels when constructing the data set, and their errors (i.e., losses) are calculated. Wherein the loss function is further configured to:

Loss＝Loss_conf+Loss_HBB+Loss_OBB+Loss_cls， (17)

wherein Loss represents total Loss, Loss_confRepresents the Loss of confidence c, Loss_HBBRepresents the Loss of the circumscribed rectangle (i.e., the error between the circumscribed rectangle center point (x, y) and the width w, height h, and true value, previously described), Loss_OBBRepresents the regression Loss (i.e., the error between the inscribed quadrilateral shape parameters and the true values, previously described), Loss, of the rotated target_clsRepresents a loss of target recognition classification, and:

wherein ,

and

respectively represent the true value of the training sample, smooth-L1 (-) represents the smoothed L1 norm, alpha_i and β_iThe intermediate variables of the vertex constraint and the diagonal constraint of the arbitrary quadrangle respectively represent the ratio of the distance from the four vertices of the inscribed quadrangle to the four vertices of the external rectangle to the width or height of the external rectangle, and the ratio of the length of the diagonal of the arbitrary quadrangle to the width or height of the external rectangle.

And continuously iterating the training until the loss function converges.

In the following examples, trainingThe parameters are set as follows: the maximum training round is 100, and the initial learning rate is 1 × 10^-4The final learning rate is 1 × 10^-6In the training process, learning rate adjustment is carried out by adopting an Adam iterative algorithm of cosine annealing, the threshold value of the overlapping ratio of the target enclosing frame and the true value is set to be 0.5, and the threshold value of non-maximum inhibition is set to be 0.5.

S3: and detecting and identifying the rotating target through the trained detection and identification model.

Example 1

15 types of rotating targets in airborne visible light images are detected and identified on an NVIDIA Xavier embedded development board, wherein the rotating targets comprise playgrounds, roundabouts, oil tanks, ships, airplanes, bridges, ports, swimming pools, airports, tennis courts, basketball courts, automobiles, helicopters, railway stations and football fields, the implementation process is shown in the attached figure 5 and comprises the following steps:

step 1: acquiring an aerial photography target image, performing data annotation and data preprocessing, and constructing a training data set; the data is marked by adopting a four-vertex method, the data is marked in a clockwise sequence, and the data is preprocessed by adopting enhancement methods such as image scale scaling, denoising, deblurring, random color dithering and the like, so that the robustness of the model to the environmental noise interference is improved, and the generalization capability of the model is improved;

step 2: according to the embodiment, the neural network structure is provided with a channel splitting-aggregation and neighborhood dynamic expansion convolution unit; establishing a lightweight rotating target feature extraction and fusion network model, and extracting multi-type and multi-direction rotating target features in a video image; for example, the dimension of an input feature map of the structure is 34 × 34 × 512, two groups of feature maps with 34 × 34 × 256 dimensions are obtained through channel splitting processing, features dynamically matched with a target are obtained through 3 × 3 neighborhood dynamic expansion convolution and 1 × 1 convolution respectively, when the target is large, the expansion ratio of the dynamic expansion convolution is automatically increased, and when the target is small, the expansion ratio is correspondingly reduced, so as to match features with different shapes and sizes of different types of targets, the features are aggregated and randomly ordered in the channel direction, so that feature interaction is increased, and the output feature is still 34 × 34 × 512;

and step 3: learning more stably the target position predicted by the unit, the shape of the bounding box bounding the target, from the multi-type, multi-directional rotating target features extracted by the deep neural network according to the embodiments, by the bounding box shape based on diagonal support constraints; for example, when the target is a bridge, the shape of the target approaches a strip shape and may have different directions, and when the vertex of the strip target deviates a little distance, a relatively large deformation may be caused, but with the proposed bounding box shape learning unit based on the diagonal support constraint, after obtaining the coordinates of the vertex of the bounding box, the bounding box is not directly located, the diagonal support constraint of the bridge is continuously predicted, the error caused by the single predicted vertex is compensated according to the length of the diagonal to limit the degree of deformation, and finally the bounding box limited by the constraint is output for bounding the target, which improves the stability of prediction.

And 4, step 4: according to the M-Sigmoid function of the mixed soft and hard boundary condition of the specific embodiment, the prediction result output by the bounding box boundary positioning and shape learning module is restricted, the position and shape parameters of the target output by the neural network model are corrected, and the target is identified according to the positioned target feature; because the neural network prediction is difficult to converge to the boundary values 0 or 1, the neural network prediction can more easily converge to the boundary values after secondary constraint, namely, the target rotation angle is 0 degrees, and the bounding box of the target is a positive rectangle; furthermore, considering the case where the four vertices are 1234 and 2341 in order, the represented rectangles are the same rectangle, but the boundary values of one case are 0 and the other is 1, which creates ambiguity in understanding, and the proposed quadratic constraint M-Sigmoid has only one-sided boundary value because the range of the value range is [0,1 ], so that this can be avoided.

And 5: according to the specific implementation mode, a new loss function comprising four parts of target confidence loss, circumscribed rectangle regression loss, rotating target boundary regression loss and target classification loss is constructed, a data set is loaded to train a target detection and identification model, iteration is continuously circulated until the loss function is converged, and the lightweight rotating targetDetecting the recognition model; where the maximum training round is 100 (e.g., when the training samples are 6000 images and the batch size is 12, 500 iterations are required to complete one training round), the initial learning rate is 1 × 10^-4The final learning rate is 1 × 10^-6In the training process, learning rate adjustment is carried out by adopting an Adam iterative algorithm of cosine annealing, the threshold value of the overlapping ratio of the target enclosing frame and the true value is set to be 0.5, and the threshold value of non-maximum inhibition is set to be 0.5.

Step 6: transplanting the lightweight rotary target detection and identification model based on the airborne photoelectric video on the embedded platform AGXXavier, training the lightweight model by using the aerial video image acquired by the airborne photoelectric sensor, and loading the model to detect and identify the rotary target.

The target detection and identification result is shown in fig. 8, and is a typical airborne photoelectric video image, an organic field scene, an ocean scene, a parking lot scene and the like, and detected targets such as airplanes, automobiles, ships and the like which need to be detected and identified by an airborne photoelectric radar exist in the image. The target framed by the arbitrary quadrilateral surrounding frame with directivity on the periphery in the figure is the target detected and recognized by the lightweight rotating target detection and recognition method based on the airborne photoelectric video. According to the diagram, the rotating quadrilateral bounding box in any direction predicted by the model can accurately position and select the detected target, has high detection precision, and is suitable for multi-type, multi-direction and multi-scale target detection and identification tasks under various scenes and conditions.

The detection and identification effects of the method of the invention and the MobileNet + YOLO method in the prior art on the data set of the embodiment are compared, and the comparison of the detection precision is shown in the following table 1:

TABLE 1 Experimental results of different identification methods on airborne aerial image data sets

Therefore, the detection and identification method has higher detection precision.

The above examples are merely preferred embodiments of the present invention, and the scope of the present invention is not limited to the above examples. All technical schemes belonging to the idea of the invention belong to the protection scope of the invention. It should be noted that modifications and embellishments within the scope of the invention may be made by those skilled in the art without departing from the principle of the invention, and such modifications and embellishments should also be considered as within the scope of the invention.

Claims

1. A lightweight rotating target detection and identification method based on an airborne photoelectric video is characterized by comprising the following steps: the method comprises the following steps:

2. The detection and identification method according to claim 1, wherein: the feature extraction network comprises 5 groups of channel split-aggregate convolution layers, wherein the first group and the second group comprise 4 split-aggregate convolution subnetworks, and the third group, the fourth group and the fifth group comprise 8 split-aggregate convolution subnetworks.

3. The detection and identification method according to claim 1, wherein: the lightweight rotating target detection and identification model performs up-sampling equalization, down-sampling equalization and anti-aliasing processing on the features extracted by the feature extraction network through a feature balancing unit to obtain a feature map after balancing;

4. The detection and identification method according to claim 3, wherein: the lightweight rotating target detection and identification model performs feature fusion on the extracted feature map and/or the balanced feature map through a channel splitting-aggregating and neighborhood dynamic expansion convolution unit to obtain a fused feature map, wherein the channel splitting-aggregating and neighborhood dynamic expansion convolution unit comprises a splitting-aggregating convolution sub-network and a neighborhood dynamic expansion convolution sub-network; the combination is obtained by embedding the neighborhood dynamic expansion convolution sub-network into the split-aggregate convolution sub-network and replacing the common convolution or the deep separable convolution in the split-aggregate convolution sub-network; the neighborhood dynamic expansion convolution subnet comprises a convolution layer of a dynamic convolution kernel and a dynamic interpolation mechanism.

5. The detection and identification method according to claim 4, wherein: the channel splitting-aggregating and neighborhood dynamic expansion convolution unit comprises 2 groups of neighborhood dynamic expansion convolution subnets, wherein each of the two groups of neighborhood dynamic expansion convolution subnets comprises 1 convolution layer with a 3 x 3 dynamic convolution kernel and a dynamic interpolation mechanism and 1 x 256 convolution layer, and hyper-parameters of the 3 x 3 dynamic convolution kernel are obtained through network model learning.

6. The detection and identification method according to claim 4, wherein: the lightweight rotating target detection and identification model performs feature learning on the fused feature map through a surrounding frame shape learning unit to obtain a shape and position feature vector of a target; wherein the bounding box is a quadrilateral bounding box, the shape learning process comprising:

7. The detection and identification method according to claim 6, wherein: the circumscribed rectangle parameter vector includes: the horizontal and vertical coordinate values of the central point coordinate of the circumscribed rectangle, the width w and the height h of the circumscribed rectangle, and the confidence coefficient c; and/or the parameter vector of the inscribed quadrilateral comprises: distances s from four vertices of the inscribed quadrangle to four vertices of the circumscribed rectangle₁,s₂,s₃,s₄(ii) a And/or the diagonal support constraint vector comprises: the length S of the two diagonal lines is calculated by the following formula₁₃,S₂₄：

8. The detection and identification method according to claim 7, wherein: the shape learning process includes: the method also comprises the following steps of transforming the parameters to obtain transformed feature vectors:

9. The detection and identification method according to any one of claims 6 to 8, wherein: the lightweight rotating target detection and identification model performs constraint calculation on the feature vectors and/or the transformed feature vectors through a quadratic shape constraint unit, and outputs position parameters and shape parameters of a target, wherein the quadratic shape constraint unit adopts the following constraint functions:

wherein e represents a natural constant, min and max represent minimum and maximum operations, respectively, and q represents a characteristic parameter of the bounding box.

10. The detection and identification method according to any one of claims 7 to 9, wherein: the loss function of the lightweight rotating target detection and identification model is as follows:

Loss＝Loss_conf+Loss_HBB+Loss_OBB+Loss_cls， (8)

wherein ,

respectively represent the true value of the training sample, smooth-L1 (-) represents the smoothed L1 norm, alpha_i and β_iIs an intermediate variable between the arbitrary quadrilateral vertex constraint and the diagonal constraint.