CN115471670A

CN115471670A - Space target detection method based on improved YOLOX network model

Info

Publication number: CN115471670A
Application number: CN202210874032.9A
Authority: CN
Inventors: 张海峰; 艾汗; 董森; 任龙; 冯佳
Original assignee: XiAn Institute of Optics and Precision Mechanics of CAS
Current assignee: XiAn Institute of Optics and Precision Mechanics of CAS
Priority date: 2022-07-20
Filing date: 2022-07-20
Publication date: 2022-12-13

Abstract

The invention relates to an image detection method, in particular to a space target detection method based on an improved YOLOX network model, and solves the technical problems of algorithm complexity, low detection rate and poor generalization capability of the traditional space target detection method under extremely complex space environment. The space target detection method based on the improved YOLOX network model comprises the following steps: step S1: acquiring a space target detection data set with labels and tags; step S2: constructing a YOLOX network model; s3, inputting the training set and the verification set obtained in the step S1 into the YOLOX network model constructed in the step S2, training and verifying to obtain a spatial target detection model and a prediction weight thereof, and continuously performing iterative optimization on the prediction weight through forward propagation and backward propagation to obtain a trained YOLOX network model; and S4, inputting the space target image in the test set into a trained YOLOX network model for space target detection. And high-precision detection of the space target is realized.

Description

Space target detection method based on improved YOLOX network model

Technical Field

The invention relates to an image detection method, in particular to a space target detection method based on an improved YOLOX network model.

Background

With the continuous development and competition of the related technologies in the space field, the space target detection is mutually crossed with other important fields, which becomes the important basis of the aerospace technology and has great research significance.

For space target detection, the traditional detection algorithm mainly extracts features of a selected area based on features of straight lines, polygons, ellipses and the like, and then judges the type of a target through the features. However, the traditional method has higher algorithm complexity, low detection rate and poor generalization capability in extremely complex space environment. Therefore, research on fast real-time algorithms, high accuracy and high reliability has become a hotspot.

In recent years, with the rapid development of computer technology and image processing technology, the convolutional layer neural network has made great progress in the field of target detection, and compared with the traditional identification method, the convolutional layer neural network has stronger feature expression capability on a target. The two-stage detection algorithm is based on candidate areas and then classifies the selected areas, such as Faster R-CNN, mask R-CNN, R-FCN and the like, the algorithm has high detection precision, but the detection speed is still unsatisfactory; the one-stage detection algorithm can directly position the target and output the type detection information of the target, such as SSD, YOLOv3, YOLOv4, YOLOv5 and the like. However, in many target detection efforts, the following problems exist: 1. the targets are different in size and cannot be effectively detected and identified; 2. the background is complicated, and misjudgment is easily caused. With the continuous development of artificial intelligence technology, deep learning methods begin to penetrate into various fields, so that intelligent supervision on space environment is urgently required.

Disclosure of Invention

The invention aims to provide a space target detection method based on an improved YOLOX network model aiming at the technical problems of low detection rate and poor generalization capability of the traditional space target detection method in extremely complex space environment, so that the detection precision of a space target is improved, and experiments show that the model can detect the target more accurately and quickly.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows:

a space target detection method based on an improved YOLOX network model is characterized by comprising the following steps:

step S1: acquiring a spatial target detection data set with labels and tags, and dividing the spatial target detection data set into a training set, a verification set and a test set;

step S2: constructing a YOLOX network model, wherein the YOLOX network model comprises a Backbone feature extraction module Backbone network, a reinforced feature extraction module related Encoder network and a decoupling output module Yolohead network;

s3, inputting the training set and the verification set obtained in the step S1 into the YOLOX network model constructed in the step S2, training and verifying to obtain a spatial target detection model and prediction weights thereof, and continuously performing iterative optimization on the prediction weights through forward propagation and backward propagation to obtain a trained YOLOX network model;

and S4, inputting the space target image in the test set into a trained YOLOX network model for space target detection.

Further, step S1 specifically includes:

s11, obtaining an image with a space target, and performing Copy-Reduce-Paste data enhancement on the image to obtain an enhanced image;

s12, labeling the enhanced image obtained in the step S11, and obtaining a space target position corresponding to the enhanced image and an XML labeling file of the type of the space target position; establishing a space target detection data set by the enhanced image and the corresponding XML markup file;

s13, the space target detection data set obtained in the step S12 is processed according to the following steps of 8:1:1 are randomly divided into a training set, a validation set and a test set.

Further, step S2 specifically includes:

s21: constructing a Backbone feature extraction module Backbone network;

s22: constructing a reinforced feature extraction module related Encoder network;

s23: and constructing a decoupling output module Yolohead network to complete the construction of a Yolox network model.

Further, the Backbone feature extraction module backhaul network in step S21 includes a Focus module, a depth separable convolution layer, a residual module, and an sppbottleeck module;

the depth-separable convolutional layers include a first depth-separable convolutional layer, a second depth-separable convolutional layer, a third depth-separable convolutional layer, a fourth depth-separable convolutional layer, and a fifth depth-separable convolutional layer;

the residual module comprises a first CspLayer module, a second CspLayer module, a third CspLayer module and a fourth CspLayer module;

the Focus module, the first depth separable convolution layer, the second depth separable convolution layer, the first CspLayer module, the third depth separable convolution layer, the second CspLayer module, the fourth depth separable convolution layer, the third CspLayer module, the fifth depth separable convolution layer, the SPPBottleeck module and the fourth CspLayer module are connected in sequence;

the second CspLayer module generates a first feature layer; a third csplyer module produces a second feature layer and a fourth csplyer module produces a third feature layer.

Further, in step S22, the modified Encoder network of the enhanced feature extraction module includes 1 initial convolutional layer module, Z extended residual blocks, and Z-1 attention mechanism feedback modules CBAM, where Z is a positive integer;

step S22 specifically includes:

s221: constructing an initial convolutional layer module, and adding the first characteristic layer m obtained in the step S21 ₁ Using 1 × 1 convolutional layer as input of initial convolutional layer module to reduce channel dimension, adding 3 × 3 convolutional layer to refine semantic context, and obtaining output x of initial convolutional layer module ₁ ：

x ₁ ＝conv ₂ (conv ₁ (m ₁ ))

In the formula, conv ₁ Is 1 × 1 convolutional layer, conv ₂ Is a 3 × 3 convolutional layer;

s222: building an extended residual block, and outputting x of the initial convolutional layer module obtained in step S221 ₁ Performing convolution layer operation to obtain output X of the residual error block _i ：

X _i ＝x _i +conv ₅ (conv ₄ (conv ₃ (x _i )))

In the formula, x _i For the input of the i-th extended residual block, conv ₃ 、conv ₅ Are all 1X 1 convolutional layers, conv ₄ Is a 3X 3 convolution layer, X _i For the output of the ith residual block for expansion,

s223: constructing an attention mechanism feedback module CBAM, and expanding the output X of the residual error block in the step S222 _i Inputting an attention mechanism feedback module CBAM to obtain a channel attention output characteristic diagram

And spatial attention output profile

And outputting the spatial attention as a feature map

Output characteristic diagram Y as attention mechanism feedback module CBAM _i ；

S224: establishing a recursive reinforced feature extraction module scaled Encoder network:

x _i+1 ＝Y _i

through the steps S222 to S223, the output characteristic diagram Y of the attention mechanism feedback module CBAM at the Z-1 st time is obtained _Z-1 To obtain the input x of the Z-th expansion residual block _Z X is to be _Z Substituting into the output X of the expanded residual block in step S222 _i ＝x _i +conv ₅ (conv ₄ (conv ₃ (x _i ) ) obtaining a reinforced characteristic layer of a reinforced characteristic extraction module related to the first characteristic layer;

s225: and repeating the steps S221 to S224 to obtain a second feature layer and a third feature layer corresponding to the enhanced feature layer output by the enhanced feature extraction module related Encoder network, and completing the construction of the enhanced feature extraction module related Encoder network.

Further, in step S23, the decoupling output module YoloHead network includes a dynamic convolution layer, a layer attention mechanism and a prediction parameter layer;

step S23 specifically includes:

s231: calculating task interaction characteristics of a decoupling output module YoloHead network to obtain a dynamic convolution layer

Comprises the following steps:

X∈R×H×W×C

wherein, X is one of the enhanced feature layers obtained in step S225, R, H, W and C respectively represent the number of images batchSize, the image height, the image width and the number of channels of the YOLOX network model input each time, δ denotes a relu activation function, conv _k Refers to the k-th convolutional layer,

s232: using the dynamic convolution layer obtained in step S231 using a layer attention mechanism

Feature layer for computational classification and regression tasks

w＝σ(fc ₂ (δ(fc ₁ (x ^inter ))))

Wherein x is ^inter Is a spliced dynamic convolution layer

Characteristic map, fc, obtained thereafter ₁ Is the first fully-connected layer, fc ₂ Is a second fully-connected layer, w is x ^inter The k-dimensional weight variable calculated by the layer attention mechanism can capture the dependency relationship, w, between k convolutional layers _k Is the kth element of w, σ is the sigmoid function;

s233: according to the characteristic layer in S232

Obtaining a prediction parameter Z for classification or regression obtained by the enhanced feature layer through a decoupling output module YoloHead network ^task

Z ^task ＝conv ₁₂ (δ(conv ₁₁ (X ^task )))

Wherein, X ^task Is a characteristic layer

Splicing characteristic map of (1), conv ₁₁ 1 × 1 convolution layer for adjusting the number of channels, conv ₁₂ Convolutional layer for generating prediction parameter Z ^task ；

S234: and repeating the steps S231 to S233 to obtain prediction parameters of all the reinforced feature layers obtained through a decoupling output module YoloHead network, and completing the construction of a Yolox network model.

Further, step S3 specifically includes:

s31: inputting the RGB pictures of the training set and the verification set in the step S13 in the YOLOX network model, and carrying out slicing operation on the RGB pictures by using the Focus module in the step S21;

s32: inputting the RGB picture processed in the step S31 into a Backbone feature extraction module Backbone network, and obtaining an effective feature layer through the residual error module and the depth separable convolution layer in the step S21;

s33: respectively inputting the effective characteristic layers obtained in the step S32 into a reinforced characteristic extraction module scaled Encoder network to obtain effective reinforced characteristic layers;

s34: inputting the effective reinforced characteristic layer obtained in the step S33 into a YoloHead network of a decoupling output module to obtain a prediction parameter of the effective reinforced characteristic layer; the prediction parameters comprise category prediction parameters Cls, target frame parameters Reg and foreground/background parameters Obj;

s35: stacking the category prediction parameters Cls, the target frame parameters Reg and the foreground/background parameters Obj in the step S34 to obtain a prediction characteristic layer;

s36: and calculating the prediction parameters of the prediction feature layer in the step S35 and the category prediction parameters Cls and the target frame parameters Reg and the cross entropy loss of the foreground/background parameters Obj in the training set in the enhanced image in the step S12 and the corresponding XML annotation file, and continuously performing iterative optimization according to the model prediction weight of the cross entropy loss until a space target detection model is obtained.

Further, step S223 specifically includes:

the channel attention output characteristic graph calculation formula is as follows:

calculating a spatial attention output feature map from the channel attention output feature map:

taking the spatial attention output characteristic map as an output characteristic map of an attention mechanism feedback module CBAM:

in the formula, avgPool is the average pooling, maxPool is the maximum pooling operation, conv ₆ 、conv ₇ 、conv ₈ 、conv ₉ Are all 1X 1 convolutional layers, conv ₁₀ Is a 7 x 7 convolutional layer, cat is based on a one-dimensional stitching operation.

Further, the method also comprises the step S5:

and (3) inputting the space target image of the test set into the YOLOX network model constructed in the step (S2), and evaluating the overall detection performance of the YOLOX network model.

Further, in step S5, the method for evaluating the overall detection performance of the YOLOX network model specifically includes: the integral detection performance of the YOLOX network model meets the average value of the evaluation index AP of the average detection precision and all the evaluation indexes AP, namely the average accuracy rate mAP;

in the formula, P represents accuracy Precision and is used for evaluating the predicted accuracy; r represents a Recall rate Recall and is used for evaluating and predicting how many correct samples are predicted; TP refers to the positive samples of the positive class predicted by the model, FP is the negative samples of the positive class predicted by the model, and FN is the positive samples of the negative class predicted by the model.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that:

(1) The method comprises the steps of constructing a YOLOX network model, wherein the YOLOX network model comprises a Backbone feature extraction module Backbone network, a reinforced feature extraction module related Encoder network and a decoupling output module Yolohead network; the detection precision and speed of the method for the natural images reach the current higher level; on the basis, the network structure of the YOLOX network model is improved, and a spatial target detection data set with labels and tags is used for training and testing to obtain the spatial target detection model. In practical applications, the improved YOLOX algorithm is better weighted than the YOLO v3, YOLO v4 and YOLO v5 networks.

(2) The method can realize real-time detection of the space target, continuously performs iterative optimization on the prediction weight of the YOLOX network model through forward propagation and backward propagation, and each model evaluation index of the YOLOX network model achieves a better effect, so that the YOLOX network model can effectively detect and identify a specific type.

Drawings

FIG. 1 is a flow chart of the method for detecting a spatial target based on improved YOLOX.

FIG. 2 is a schematic diagram of a YOLOX network in accordance with an embodiment of the invention.

FIG. 3 is a schematic diagram of SPPBottllenck in the embodiment of the present invention.

FIG. 4 is a diagram of a DiateEncoder network in YOLOX in accordance with an embodiment of the present invention.

FIG. 5 is a schematic diagram of the attention mechanism feedback module CBAM in the DialateEncoder network in YOLOX in accordance with an embodiment of the present invention.

FIG. 6 is a schematic diagram of the YoloHead network in Yolox in accordance with an embodiment of the present invention.

Fig. 7 is a schematic diagram illustrating a spatial target detection effect according to an embodiment of the present invention.

Detailed Description

The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it is to be understood that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments obtained by those skilled in the art without creative efforts based on the technical solutions of the present invention belong to the protection scope of the present invention.

The invention discloses a space target detection method based on an improved YOLOX network model, which comprises the following steps as shown in figure 1:

step S1: acquiring a spatial target detection data set with labels and labels, and dividing the spatial target detection data set into a training set, a verification set and a test set;

s12, labeling the enhanced image obtained in the step S11, and acquiring a space target position corresponding to the enhanced image and an XML labeling file of the type of the space target position; establishing a space target detection data set by the enhanced image and the corresponding XML markup file;

s13, the space target detection data set obtained in the step S12 is processed according to the following steps of 8:1:1 is randomly divided into a training set, a validation set and a test set.

s21: constructing a Backbone feature extraction module Backbone network;

the Backbone feature extraction module Backbone network comprises a Focus module, a depth separable convolution layer, a residual module and an SPPBottlenck module;

the device comprises a Focus module, a first depth separable convolution layer, a second depth separable convolution layer, a first CspLayer module, a third depth separable convolution layer, a second CspLayer module, a fourth depth separable convolution layer, a third CspLayer module, a fifth depth separable convolution layer, an SPPBottleeck module and a fourth CspLayer module which are sequentially arranged;

in step S22, the enhanced feature extraction module related Encoder network comprises 1 initial convolution layer module, Z expansion residual blocks and Z-1 attention mechanism feedback modules CBAM, wherein Z is a positive integer;

x ₁ ＝conv ₂ (conv ₁ (m ₁ ))

s222: building an extended residual block, and outputting x of the initial convolutional layer module obtained in step S221 ₁ Performing convolution layer operation to obtain output X of the residual block _i ：

X _i ＝x _i +conv ₅ (conv ₄ (conv ₃ (x _i )))

In the formula, x _i For the input of the i-th extended residual block, conv ₃ 、conv ₅ All are 1 × 1 convolutional layers, conv ₄ Is a 3X 3 convolution layer, X _i For the output of the ith residual block for expansion,

And spatial attention output profile

And outputting the spatial attention as a feature map

in the formula, avgPool is average pooling, maxpool is maximum pooling operation, conv ₆ 、conv ₇ 、conv ₈ 、conv ₉ Are all 1X 1 convolutional layers, conv ₁₀ Is a 7 x 7 convolutional layer, cat is based on a one-dimensional stitching operation.

x _i+1 ＝Y _i

through steps S222 to S223, the output characteristic diagram Y of the attention mechanism feedback module CBAM at the Z-1 st time is obtained _Z-1 To obtain the input x of the Z-th expansion residual block _Z X is to _Z Substituting into the output X of the expanded residual block in step S222 _i ＝x _i +conv ₅ (conv ₄ (conv ₃ (x _i ) )) to obtain a reinforced characteristic layer of a reinforced characteristic extraction module scaled Encoder network corresponding to the first characteristic layer;

In the step S23, the decoupling output module YoloHead network comprises a dynamic convolution layer, a layer attention mechanism and a prediction parameter layer;

Comprises the following steps:

X∈R×H×W×C

wherein, X is one of the enhancement feature layers obtained in step S225, R, H, W and C respectively represent the number of images batchSize, image height, image width and channel number of the YOLOX network model input each time, δ indicates the relu activation function, conv _k Refers to the k-th convolutional layer,

S232：using the dynamic convolution layer obtained in step S231 using a layer attention mechanism

Feature layer for computational classification and regression tasks

w＝σ(fc ₂ (δ(fc ₁ (x ^inter ))))

Wherein x is ^inter Is a spliced dynamic convolution layer

Characteristic map, fc, obtained thereafter ₁ Is the first fully-connected layer, fc ₂ Is a second fully connected layer, w is x ^inter The k-dimensional weight variables calculated by the layer attention mechanism can capture the dependency relationship between k convolutional layers, w _k Is the kth element of w, σ is the sigmoid function;

s233: according to the characteristic layer in S232

Z ^task ＝conv ₁₂ (δ(conv ₁₁ (X ^task )))

Wherein, X ^task Is a characteristic layer

Splicing characteristic map of (1), conv ₁₁ Is 1 × 1 convolution layer for adjusting the number of channels, conv ₁₂ Is 1 × 1 convolutional layer, and is used for generating prediction parameter Z ^task ；

S3, inputting the training set and the verification set obtained in the step S1 into the YOLOX network model constructed in the step S2, training and verifying to obtain a spatial target detection model and a prediction weight thereof, and continuously performing iterative optimization on the prediction weight through forward propagation and backward propagation to obtain a trained YOLOX network model;

s31: inputting the RGB pictures of the training set and the verification set in the step S13 in the YOLOX network model, and slicing the RGB pictures by using the Focus module in the step S21;

s36: and (4) calculating the prediction parameters of the prediction feature layer in the step (S35) and the cross entropy losses of the category prediction parameters Cls and the target frame parameters Reg and the foreground/background parameters Obj in the enhanced image in the step (S12) and the corresponding XML annotation file, and continuously performing iterative optimization according to the model prediction weight of the cross entropy losses until a space target detection model is obtained.

S41: putting the weight of the space target detection model trained in the step S36 into the YOLOX network model constructed in the step S235;

s42: and (4) inputting the space target image in the test set obtained in the step (S13) into the YOLOX network model constructed in the step (S41), and evaluating the overall detection performance of the space target detection model.

The overall detection performance of the space target detection model meets the evaluation index AP and the average accuracy mAP of the average detection precision, wherein the mAP is the average value of the average detection precision APs of all categories.

Step S5: and (3) inputting the space target image of the test set into the YOLOX network model constructed in the step (S2), and evaluating the overall detection performance of the YOLOX network model.

The present invention will be described in detail with reference to specific examples.

S1: acquiring a space target detection data set with labels and tags;

s11, obtaining an image with a space target, performing Copy-Reduce-Paste data enhancement on the image, wherein the Copy-Reduce-Paste data enhancement refers to the steps of reducing/amplifying the space target of the image, pasting the image to an original image, increasing the number of space targets with different sizes, obtaining an enhanced image with a plurality of space targets with different sizes, and improving the extraction effect of a YOLOX network model on image features.

S12, labeling the enhanced images obtained in the step S11 by using a deep learning image labeling tool LabelImg, and acquiring an XML labeling file which comprises a space target position and a type thereof and corresponds to each enhanced image; the annotation category comprises three types of targets, namely a Satellite body (Satellite), a Satellite Cabin body (Cabin) and a solar sailboard (Windsurfing), and the space target image and the corresponding XML annotation file are used for establishing a space target detection data set.

S13, the space target detection data set obtained in the step S12 is processed according to the following steps of 8:1:1, randomly dividing the space target detection model into a training set, a verification set and a test set, and training and testing the space target detection model.

S2: a YOLOX network model, a schematic diagram of the YOLOX network, was constructed, as shown in fig. 2.

The trunk feature extraction module Backbone network comprises a Focus module, a depth separable convolution layer (Conv 2D _ BN _ SilU), a residual error module (CspLayer) and an SPPBottleeck module; the depth-separable convolutional layers include a first depth-separable convolutional layer, a second depth-separable convolutional layer, a third depth-separable convolutional layer, a fourth depth-separable convolutional layer, and a fifth depth-separable convolutional layer; the residual error module comprises a first residual error module, a second residual error module, a third residual error module and a fourth residual error module;

the Backbone feature extraction module backhaul network is formed by a Focus module, a first depth separable convolutional layer, a second depth separable convolutional layer, a first CsppLayer module, a third depth separable convolutional layer, a second CsppLayer module, a fourth depth separable convolutional layer, a third CsppLayer module, a fifth depth separable convolutional layer, an SPPBottleenemy module and a fourth CsppLayer module in sequence;

the Focus module obtains values of every other pixel in an enhanced picture to obtain four independent feature layers, then the four independent feature layers are stacked, input channels are expanded by four times, and the spliced feature layers are changed into twelve channels relative to the original three channels; the depth separable convolutional layer is used for changing the mode of the convolutional layer to reduce the execution times of the convolutional layer; the residual error module is connected through two branches, wherein one branch is used for carrying out convolutional layer standardization and activation function operation on an input characteristic layer, the other branch is processed through the activation function and then is processed through the n residual error modules, and the last two branches are connected; as shown in fig. 3, the sppbotttleneck module performs feature extraction through maximal pooling of different pooled core sizes, so as to improve the receptive field of the backhaul network; finally, the second csplyer module generates the first feature layer; a third csplyer module produces a second feature layer and a fourth csplyer module produces a third feature layer.

S22: constructing a modified Encoder network by a reinforced feature extraction module;

as shown in fig. 4, the augmented feature extraction Module scaled Encoder network includes an initial Convolutional layer Module, a residual error Module, and a Attention mechanism feedback Module CBAM (conditional Block Attention Module); the residual error module is used for extracting context information of targets in various scales from a single feature layer in the trunk feature extraction module by using the discrete convolution layer; as shown in fig. 5, the attention mechanism feedback module CBAM compensates for the performance gap between the single-in single-out architecture and the multiple-in single-out architecture. As shown in fig. 6, in step S23, the decoupling output module YoloHead network performs a series of vector convolution layer operations and branch decoupling operations of an activation function on the feature layer extracted by the enhanced feature extraction module related Encoder network to obtain final prediction parameters, where the prediction parameters include respective prediction parameters Cls, a target frame parameter Reg, and a foreground/background parameter Obj.

The enhanced feature extraction module related Encoder network comprises three main components: the system comprises an initial convolutional layer module, an expansion residual error block and an attention mechanism feedback module CBAM. Firstly, constructing an initial convolutional layer module, and reducing the channel dimension by using a 1 multiplied by 1 convolutional layer; then, adding a 3 × 3 convolutional layer to refine semantic context, then stacking four continuous expansion residual blocks, generating expansion rate in different 3 × 3 convolutional layers, outputting the characteristics with a plurality of receptive fields, and covering the scales of all objects; and finally, establishing an attention mechanism feedback module CBAM to make up the performance difference between the single-input single-output structure and the multi-input single-output structure.

S221: constructing an initial convolutional layer module, and adding the first characteristic layer m obtained in the step S21 ₁ Using 1 × 1 convolutional layer as input of initial convolutional layer module to reduce channel dimension, adding 3 × 3 convolutional layer to refine semantic context, and obtaining output x of initial convolutional layer module ₁ As shown in equation (1):

x ₁ ＝conv ₂ (conv ₁ (m ₁ )) (1)

wherein x is ₁ For the output of the initial convolutional layer module, conv ₁ Represents a 1X 1 convolutional layer, conv ₂ Is a 3 × 3 convolution layer;

s222: four expansion residual blocks, bottleneck1, bottleneck2, bottleneck3, and Bottleneck4, are stacked as shown in equation (2):

wherein x is _i For the input of the i-th extended residual block, conv ₃ 、conv ₅ All are 1 × 1 convolutional layers, conv ₄ Is a 3 × 3 convolution layer, expands the number of residual blocks

Z is an integer;

in the present embodiment, the first and second electrodes are,

the expansion ratios of the four void convolution layers are 2,4,6,8, respectively.

S223: and establishing an attention mechanism feedback module CBAM, and combining an attention mechanism of a space (Spatial) and a channel (channel) to make up the performance gap between a single-in single-out structure and a multi-in single-out structure.

Wherein,

is the output of the channel's attention,

is the output of spatial attention, Y _i Is the output, X, of the feedback attention module CBAM _i Is the input feature map, avgPool and MaxPool are the average pooling and maximum pooling operations, respectively, conv ₆ 、conv ₇ 、conv ₈ 、conv ₉ Are all 1X 1 convolutional layers, conv ₁₀ Is a 7 × 7 convolutional layer, δ refers to relu activation function, σ is sigmoid function, and cat is based on one-dimensional stitching operation.

As shown in formula (3), mixing X _i Respectively carrying out AvgPool average pooling and MaxPool maximum pooling operations based on width and height, then respectively carrying out conv convolution layer operations, carrying out summation operation on feature maps output by convolution layers, and finally generating a final channel attention output feature map through sigmoid operation.

As shown in equation (4), will

And respectively carrying out AvgPool average pooling and MaxPool maximum pooling operations based on width and height, splicing, and carrying out convolutional layer operation to obtain a final spatial attention output characteristic diagram.

And (5) taking the final spatial attention output characteristic map as an output characteristic map of the attention mechanism feedback module CBAM.

S224: establishing a recursive generalized Encoder network

S23: constructing a YoloHead network of a decoupling output module;

s231: in order to enhance the interaction between classification and positioning, a stack of task interaction features is learned from a plurality of convolutional layers by using a feature extractor, and the design not only facilitates the task interaction, but also provides multi-level features with multi-scale effective receptive fields. Formally, let X ∈ R × H × W × C denote a single feature layer in step S21, where R, H, W, and C denote the number of images batchSize, image height, image width, and channel number of the YOLOX network model input each time, respectively, and 4 consecutive convolution layers with activation functions are used to calculate task interaction features, and the obtained convolution layers are used to calculate task interaction features

Dynamics, as shown in equation (7):

wherein, conv _k Refers to the kth convolutional layer.

S232: using the layer attention mechanism, by the convolution layer obtained in step S231

Dynamically, the features of such specific tasks are computed to perform the decomposition of the tasks. The calculation for each task is as shown in equation (8),

w＝σ(fc ₂ (δ(fc ₁ (x ^inter )))) (9)

wherein, w _k Is the kth element of w obtained after the attention layer calculation, as shown in formula (9), w is calculated from the cross-layer task interaction features and can capture the dependency relationship, fc, between layers ₁ And fc ₂ Refers to two fully connected layers, and X ^inter Is a splicing

And (5) obtaining a characteristic diagram.

S233: the result for classification or localization is obtained, as shown in equation (10),

Z ^task ＝conv ₁₂ (δ(conv ₁₁ (X ^task ))) (10)

wherein, X ^task Is that

Splicing characteristic map of (1), conv ₁₁ Is a 1 × 1 convolutional layer for adjusting the number of channels, and is the relu activation function, conv ₁₂ Is 1 × 1 convolutional layer, is used forGenerating a prediction parameter Z ^task Namely, a class prediction parameter Cls, a target frame parameter Reg and a foreground/background parameter Obj of target detection are generated.

S3: and respectively inputting the training set and the verification set into an improved YOLOX network model for training and verification to obtain a space target detection model.

S31: inputting 3-channel RGB pictures with any size, and carrying out normalization, cutting, random up-down and left-right turning, scaling, random color change, moisaic and CutMix processing on the images. Scaling the image to 640 x 640 to be used as the input of a YOLOX network model, and carrying out slicing operation on the input image by using a Focus structure in a Backbone feature extraction module Backbone network;

s32: and (3) inputting the image processed in the step (S31) into a Backbone feature extraction module Backbone network, stacking a series of residual modules comprising a plurality of residual blocks and a depth separable convolution layer to deepen the network to realize the initial extraction of features and reduce a large number of training parameters, so as to obtain three effective feature layers of 20 × 20, 40 × 40 and 80 × 80.

S33: and (4) respectively inputting the three effective feature layers of 20 × 20, 40 × 40 and 80 × 80 obtained in the step (S32) into a scaled Encoder network (scaled Encoder network) of the enhanced feature extraction module to obtain effective enhanced feature layers.

S34: inputting the three effective feature layers of 20 × 20, 40 × 40 and 80 × 80 obtained in step S32 into the decoupling output module YoloHead network, and obtaining three prediction results for each effective enhanced feature layer, where the three prediction results are: reg (h, w, 4) is used for judging regression parameters of each feature point, and a prediction frame can be obtained after the regression parameters are adjusted; obj (h, w, 1) is used for judging whether each feature point contains an object; cls (h, w, num _ classes) is used for judging the type of the object contained in each feature point. Stacking three prediction results, wherein the result obtained by each characteristic layer is as follows: the first four parameters of Out (h, w,4+1+ num categories) are used for judging the regression parameters of each feature point, and a prediction frame can be obtained after the regression parameters are adjusted; the fifth parameter is used for judging whether each feature point contains an object or not; the last num _ classes parameters are used for judging the object type contained in each feature point.

S4: and inputting the space target images in the test set into a trained YOLOX network model for space target detection.

S5: and evaluating a space target detection algorithm of the improved YOLOX network model by using the space target detection result of the step S4.

And putting the weight of the trained test set space target detection model into a YOLOX network model, inputting an undetected space target image into the YOLOX network model, and evaluating the overall detection performance of the space target detection model.

The Average accuracy, which is an evaluation index of the algorithm detection accuracy, was mAP (mean Average precision), and the number of Frames Per Second (FPS) of the processed image was used as an evaluation index of the detection speed. The mAP is defined as the Average of the Average Accuracy (AP) of all classes. The average accuracy is then:

wherein, P represents accuracy Precision, and is used for evaluating the predicted accuracy; r represents the Recall rate Recall and is used to evaluate how many correct samples are predicted. TP refers to the positive samples predicted by the model as the positive class, FP refers to the negative samples predicted by the model as the positive class, and FN refers to the positive samples predicted by the model as the negative class.

Fig. 7 is a diagram showing the effect of the detection result of detecting the spatial target image by using the present invention. It can be seen from fig. 7 that a Satellite body (Satellite), a Satellite Cabin (bin), and a solar sailboard (Windsurfing) are detected, a numerical value on each box represents a confidence level, and is used for judging whether an object in a bounding box is a positive sample or a negative sample, a positive sample is judged if the value is greater than a confidence level threshold, and a negative sample, i.e., a background, is judged if the value is less than the confidence level threshold, and in the present invention, the confidence level threshold is set to be 0.60. The detection method provided by the invention can accurately detect the type and the number of the space targets. As shown in Table 1, the AP, mAP and FPS results of the present invention and the YOLOX network model were compared while keeping the training and testing images consistent. Compared with YOLOX, the method has the advantages that the average accuracy is improved by 4 points, and the requirements of real-time detection are met in the aspect of reasoning speed.

TABLE 1 comparison of the test results of the present invention with different test methods

Algorithm	mAP	FPS	Satellite	Windsurfing	Cabin
						Yolo X	0.9117	58.1	0.95	0.90	0.89
Improved Yolo X	0.9528	59.2	0.98	0.96	0.91

In conclusion, the method can realize real-time detection of the space target, continuously performs iterative optimization on the prediction weight of the YOLOX network model through forward propagation and backward propagation, and each model evaluation index achieves a better effect, so that the YOLOX network model can effectively detect and identify a specific type.

Claims

1. A space target detection method based on an improved YOLOX network model is characterized by comprising the following steps:

2. The method for detecting the spatial target based on the improved YOLOX network model as claimed in claim 1, wherein the step S1 is specifically:

3. The method for detecting the spatial target based on the improved YOLOX network model as claimed in claim 1 or 2, wherein the step S2 is specifically:

s21: constructing a Backbone feature extraction module Backbone network;

4. The method according to claim 3, wherein the Backbone feature extraction module Backbone network in step S21 comprises a Focus module, a depth separable convolutional layer, a residual error module and an SPPBottlene module;

the residual error module comprises a first CspLayer module, a second CspLayer module, a third CspLayer module and a fourth CspLayer module;

5. The method of claim 4, wherein the enhanced feature extraction module scaled Encoder network in step S22 includes 1 initial convolutional layer module, Z expansion residual blocks, and Z-1 attention mechanism feedback modules CBAM, where Z is a positive integer;

step S22 specifically includes:

x ₁ ＝conv ₂ (conv ₁ (m ₁ ))

X _i ＝x _i +conv ₅ (conv ₄ (conv ₃ (x _i )))

In the formula, x _i Is a firsti input of the extended residual block, conv ₃ 、conv ₅ Are all 1X 1 convolutional layers, conv ₄ Is a 3X 3 convolutional layer, X _i For the output of the ith residual block for expansion,

And spatial attention output feature map

And outputting the spatial attention as a feature map

S224: establishing a recursive reinforced feature extraction module related Encoder network:

x _i+1 ＝Y _i

through steps S222 to S223, the output characteristic diagram Y of the attention mechanism feedback module CBAM at the Z-1 st time is obtained _Z-1 To obtain the input x of the Z-th expansion residual block _Z X is to _Z Substituting into the output X of the expanded residual block in step S222 _i ＝x _i +conv ₅ (conv ₄ (conv ₃ (x _i ) ) obtaining a reinforced characteristic layer of a reinforced characteristic extraction module related to the first characteristic layer;

s225: and repeating the steps S221 to S224 to obtain a second feature layer and a third feature layer corresponding to the enhanced feature layer output by the enhanced feature extraction module scaled Encoder network, and completing the construction of the enhanced feature extraction module scaled Encoder network.

6. The method as claimed in claim 5, wherein the decoupling output module YoloHead network in step S23 includes a dynamic convolution layer, a layer attention mechanism and a prediction parameter layer;

step S23 specifically includes:

Comprises the following steps:

X∈R×H×W×C

Feature layer for computational classification and regression tasks

w＝σ(fc ₂ (δ(fc ₁ (x ^inter ))))

Wherein，x ^inter Is a spliced dynamic convolution layer

Characteristic map, fc, obtained later ₁ Is the first fully-connected layer, fc ₂ Is a second fully connected layer, w is x ^inter The k-dimensional weight variables calculated by the layer attention mechanism can capture the dependency relationship between k convolutional layers, w _k Is the kth element of w, σ is the sigmoid function;

s233: according to the characteristic layer in S232

Z ^task ＝conv ₁₂ (δ(conv ₁₁ (X ^task )))

Wherein, X ^task Is a characteristic layer

Splicing characteristic map of ₁₁ Is 1 × 1 convolution layer for adjusting the number of channels, conv ₁₂ Convolution layer for generating prediction parameter Z ^task ；

S234: and repeating the steps S231 to S233 to obtain all the prediction parameters of the enhanced feature layer obtained through the decoupling output module YoloHead network, and completing the construction of the Yolox network model.

7. The method for detecting the spatial target based on the improved YOLOX network model as claimed in claim 6, wherein step S3 is specifically:

s34: inputting the effective reinforced feature layer obtained in the step S33 into a YoloHead network of a decoupling output module to obtain a prediction parameter of the effective reinforced feature layer; the prediction parameters comprise category prediction parameters Cls, target frame parameters Reg and foreground/background parameters Obj;

s36: and (5) calculating the prediction parameters of the prediction feature layer in the step (S35) and the category prediction parameters Cls and the cross entropy losses of the target frame parameters Reg and the foreground/background parameters Obj in the enhanced image in the step (S12) and the XML annotation file corresponding to the training set, and continuously performing iterative optimization according to the model prediction weight of the cross entropy losses until a space target detection model is obtained.

8. The method for detecting a spatial target based on the improved YOLOX network model as claimed in claim 7, wherein step S223 is specifically:

calculating a spatial attention output feature map according to the channel attention output feature map:

in the formula, avgPool is average pooling, maxpool is maximum pooling operation, conv ₆ 、conv ₇ 、conv ₈ 、conv ₉ All are 1 × 1 convolutional layers, conv ₁₀ Is a 7 x 7 convolutional layer, cat is based on a one-dimensional stitching operation.

9. The method for detecting the spatial target based on the improved YOLOX network model as claimed in claim 8, further comprising step S5:

10. The method for spatial target detection based on the improved YOLOX network model as claimed in claim 9, wherein in step S5, the method for evaluating the overall detection performance of the YOLOX network model specifically comprises: the integral detection performance of the YOLOX network model meets the average value of the evaluation indexes AP of the average detection precision and all the evaluation indexes AP, namely the average accuracy mAP;

in the formula, P represents accuracy Precision and is used for evaluating the predicted accuracy; r represents a Recall rate Recall and is used for evaluating the predicted correct samples; TP refers to the positive samples of the positive class predicted by the model, FP is the negative samples of the positive class predicted by the model, and FN is the positive samples of the negative class predicted by the model.