CN115546586A

CN115546586A - Method and device for detecting infrared dim target, computing equipment and storage medium

Info

Publication number: CN115546586A
Application number: CN202211318412.0A
Authority: CN
Inventors: 程宇航; 张樯; 李斌; 姚裔仁
Original assignee: Beijing Institute of Environmental Features
Current assignee: Beijing Institute of Environmental Features
Priority date: 2022-10-26
Filing date: 2022-10-26
Publication date: 2022-12-30

Abstract

The embodiment of the invention relates to the technical field of image processing, in particular to a method and a device for detecting infrared small and weak targets, computing equipment and a storage medium. The method comprises the following steps: acquiring an infrared small target image to be detected; inputting the infrared small target image into a detection model generated by pre-training; the detection model is obtained based on preset neural network training, and the preset neural network comprises a backbone network formed by connecting four residual error networks in series, a context feature extraction network consisting of a plurality of cavity convolution layers, a feature fusion network and a detection head module; and obtaining a detection result of the infrared weak and small target image according to an output result of the detection head module. According to the scheme, the detection capability of the detection model on the infrared small and weak target can be improved.

Description

Method and device for detecting infrared dim target, computing equipment and storage medium

Technical Field

The embodiment of the invention relates to the technical field of image processing, in particular to a method and a device for detecting infrared small and weak targets, computing equipment and a storage medium.

Background

In the field of computer vision, the detection of infrared small and weak targets has been a popular and challenging task. In recent years, there is a demand for higher detection rate and lower false alarm rate for detection of infrared weak and small targets in various fields.

The existing infrared weak and small target detection algorithm is almost completely modified by Single shot multi-box detector (SSD) and Single phase target detection algorithm such as You Only Look Once (YOLO), and the universal target detection network is expanded to the detection task of the infrared weak and small target by modifying the network structure and the size of the preset anchor frame. However, in the expansion process, the detection methods of the size of the preset anchor frame are sensitive to the size of the small target, so that the network optimization is easy to generate large deviation, and the detection precision of the infrared small target is influenced.

Disclosure of Invention

In order to improve the detection precision of the existing infrared dim target detection algorithm, the embodiment of the invention provides a method and a device for detecting an infrared dim target, computing equipment and a storage medium.

In a first aspect, an embodiment of the present invention provides a method for detecting an infrared dim target, including:

acquiring an infrared small target image to be detected;

inputting the infrared small target image into a detection model generated by pre-training; the detection model is obtained based on preset neural network training, and the preset neural network comprises a backbone network formed by connecting four residual error networks in series, a context feature extraction network formed by a plurality of cavity convolution layers, a feature fusion network and a detection head module;

and obtaining a detection result of the infrared small and weak target image according to an output result of the detection head module.

Preferably, the detection head module comprises a first detection head and a second detection head;

the training mode of the detection model comprises the following steps:

obtaining a plurality of training samples marked with labels;

inputting a plurality of training samples into the backbone network, and respectively performing first-stage feature extraction, second-stage feature extraction, third-stage feature extraction and fourth-stage feature extraction on each training sample by using four residual error networks in the backbone network to obtain a feature map after feature extraction at each stage;

inputting the feature map after the feature extraction of the fourth stage into each cavity convolution layer in the context feature extraction network to obtain an expansion feature map corresponding to each training sample;

obtaining a first fusion feature map corresponding to each training sample based on the feature map after the third-stage feature extraction and the expansion feature map by using the feature fusion network, and obtaining a second fusion feature map corresponding to each training sample based on the first fusion feature map and the feature map after the second-stage feature extraction after obtaining the first fusion feature map;

inputting the first fused feature map and the second fused feature map to the first detection head and the second detection head respectively to obtain an output result of each training sample;

and adjusting the network parameters of the neural network according to the output result of each training sample and the label corresponding to each training sample until a detection model meeting expectations is obtained.

Preferably, the first detection head and the second detection head share the same set of network parameters; and the first detection head and the second detection head train network parameters by using a packet normalization mode.

Preferably, the first detection head and the second detection head each comprise a classification branch, a regression branch, and a centrality branch;

the inputting the first fused feature map and the second fused feature map into the first detection head and the second detection head, respectively, to obtain an output result of each training sample, including:

inputting the first fusion feature map and the second fusion feature map into the classification branch of the first detection head and the classification branch of the second detection head respectively to obtain a classification result of each pixel point in the first fusion feature map and the second fusion feature map;

and respectively inputting the first fusion feature map into the regression branch and the central degree branch of the first detection head, and respectively inputting the second fusion feature map into the regression branch and the central degree branch of the second detection head to obtain the regression result of the first fusion feature map and the regression result of the second fusion feature map.

Preferably, the regression branch and the centrality branch share the same set of convolutional layers.

Preferably, the output results of the detection head module include the classification result and the regression result of the first target feature map, and the classification result and the regression result of the second target feature map; the classification result and the regression result of the first target feature map are output results of the first target feature map obtained by using the first detection head in the detection model to pass through the feature fusion network in the detection model on the infrared dim target image; the classification result and the regression result of the second target feature map are output results of the second target feature map obtained by using the second detection head in the detection model to pass through the feature fusion network in the detection model on the infrared small and weak target image;

the obtaining of the detection result of the infrared dim target image according to the output result of the detection head module includes:

mapping each pixel point in the first target feature map and the second target feature map back to the infrared small target image respectively so as to determine an initial target according to the classification result corresponding to the first target feature map and the second target feature map respectively;

generating a candidate frame of each initial target according to regression results corresponding to the first target feature map and the second target feature map;

calculating the distance from a central pixel point of each initial target in the infrared small and weak target image to a corresponding alternative frame, and screening each initial target according to the distance to obtain the infrared small and weak target;

and calculating the normalized distance from the central pixel point of each infrared small target in the infrared small target image to the corresponding candidate frame, and screening the candidate frames corresponding to the infrared small targets according to the normalized distance to obtain the detection frame of the infrared small targets.

Preferably, the distance is calculated by the following formula:

wherein the distance is (l) ^* ,t ^* ,r ^* ,b ^* )，l ^* 、t ^* 、r ^* And b ^* The distance from the central pixel point of the initial target in the infrared weak and small target image to the left side, the distance from the upper side, the distance from the right side and the distance from the lower side of the corresponding alternative frame are respectively,

the coordinates of the upper left corner point of the candidate box,

coordinates of a lower right corner point of the alternative frame are obtained;

the normalized distance is calculated by the following formula:

wherein Centerness is the normalized distance, l ^* 、t ^* 、r ^* And b ^* And the distance between the central pixel point of the infrared dim target in the infrared dim target image and the left side distance, the upper side distance, the right side distance and the lower side distance of the corresponding alternative frame are respectively.

In a second aspect, an embodiment of the present invention further provides a device for detecting an infrared weak and small target, including:

the acquisition unit is used for acquiring an infrared dim target image to be detected;

the detection unit is used for inputting the infrared small target image into a detection model generated by pre-training; the detection model is obtained based on preset neural network training, and the preset neural network comprises a backbone network formed by connecting four residual error networks in series, a context feature extraction network formed by a plurality of cavity convolution layers, a feature fusion network and a detection head module;

and the regression unit is used for obtaining the detection result of the infrared small and weak target image according to the output result of the detection head module.

In a third aspect, an embodiment of the present invention further provides a computing device, including a memory and a processor, where the memory stores a computer program, and the processor, when executing the computer program, implements the method described in any embodiment of this specification.

In a fourth aspect, the present invention further provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed in a computer, the computer program causes the computer to execute the method described in any embodiment of the present specification.

The embodiment of the invention provides a method, a device, computing equipment and a storage medium for detecting infrared dim targets, which are characterized in that firstly, an image of an infrared dim target to be detected is input into a detection model generated by pre-training, wherein the detection model is obtained by training based on a preset neural network, and the preset neural network comprises a backbone network formed by connecting four residual error networks in series, a context feature extraction network formed by a plurality of cavity convolution layers, a feature fusion network and a detection head module; therefore, the backbone network in the detection model can be used for extracting the characteristics of the infrared small dim target image to be detected, the context information around the infrared small dim target is introduced by the context characteristic extraction network, the characteristics extracted by the backbone network and the context characteristic extraction network are fused by the characteristic fusion network, and the infrared small dim target image is detected by the detection head module without the preset anchor frame, so that the detection capability of the detection model on the infrared small dim target is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

Fig. 1 is a flowchart of a method for detecting an infrared weak and small target according to an embodiment of the present invention;

fig. 2 is a hardware architecture diagram of an electronic device according to an embodiment of the present invention;

fig. 3 is a structural diagram of an apparatus for detecting a weak and small infrared target according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer and more complete, the technical solutions in the embodiments of the present invention will be described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention, and based on the embodiments of the present invention, all other embodiments obtained by a person of ordinary skill in the art without creative efforts belong to the scope of the present invention.

As described above, the existing infrared small and weak target detection algorithm is almost entirely modified by a Single shot multi-box detector (SSD) and a Single-stage target detection algorithm such as a youonly Look one Once (YOLO), and a general target detection network is extended to a detection task of the infrared small and weak target by modifying a network structure and a size of a preset anchor frame. However, in the expansion process, the detection methods of the size of the preset anchor frame are sensitive to the size of the small target, so that the network optimization is easy to generate large deviation, and the detection precision of the infrared small target is influenced.

In order to solve the technical problem, the inventor can consider that an algorithm without a preset anchor frame is used for detecting the infrared weak and small target image, so that the detection head module in the detection model of the application has no preset anchor frame, and the detection precision of the infrared weak and small target is improved.

Specific implementations of the above concepts are described below.

Referring to fig. 1, an embodiment of the present invention provides a method for detecting an infrared weak and small target, including:

step 100: acquiring an infrared dim target image to be detected;

step 102: inputting the infrared small target image into a detection model generated by pre-training; the detection model is obtained based on preset neural network training, and the preset neural network comprises a backbone network formed by connecting four residual error networks in series, a context feature extraction network formed by a plurality of cavity convolution layers, a feature fusion network and a detection head module;

step 104: and obtaining a detection result of the infrared small target image according to an output result of the detection head module.

In the embodiment of the invention, an infrared dim target image to be detected is input into a detection model generated by pre-training, wherein the detection model is obtained by training based on a preset neural network, and the preset neural network comprises a backbone network formed by connecting four residual error networks in series, a context feature extraction network formed by a plurality of cavity convolution layers, a feature fusion network and a detection head module; therefore, the backbone network in the detection model can be used for extracting the characteristics of the infrared small and weak target image to be detected, the context information around the infrared small and weak target is introduced by the context characteristic extraction network, the characteristics extracted by the backbone network and the context characteristic extraction network are fused by the characteristic fusion network, and finally the detection head module without the preset anchor frame is used for detecting the infrared small and weak target image, so that the detection capability of the detection model on the infrared small and weak target is improved.

The manner in which the various steps shown in fig. 1 are performed is described below.

With respect to step 100:

it should be noted that the infrared weak and small target image to be detected in this step may include any type of infrared weak and small target, or may not include an infrared weak and small target, and is not specifically limited herein. And the infrared weak and small target image to be detected has no label.

With respect to step 102:

in some embodiments, the test head module includes a first test head and a second test head.

Then, the construction process of the detection model will be explained next.

In some embodiments, the training mode of the detection model may include the following steps H1 to H4:

step H1, obtaining a plurality of training samples marked with labels;

in step H1, several images of the infrared small and weak targets captured in the actual environment are acquired, and the categories and labeling frames of the infrared small and weak targets in the images are manually labeled to serve as training samples.

And H2, inputting a plurality of training samples into the backbone network, and respectively performing first-stage feature extraction, second-stage feature extraction, third-stage feature extraction and fourth-stage feature extraction on each training sample by using four residual error networks in the backbone network to obtain a feature map after feature extraction at each stage.

In step H2, the backbone network includes four stages of residual networks, and each stage of residual network includes 4 residual blocks. Firstly, inputting each training sample into a first-stage residual error network for first-stage feature extraction; then, inputting the feature graph after the first-stage feature extraction into a second-stage residual error network for second-stage feature extraction; secondly, inputting the feature graph after the second-stage feature extraction into a third-stage residual error network for third-stage feature extraction; and finally, inputting the feature map after the feature extraction of the third stage into a residual error network of a fourth stage to carry out the feature extraction of the fourth stage, thereby obtaining the feature map after the feature extraction of the fourth stage.

From the first stage to the fourth stage, the scale of the input feature map is gradually reduced, and the number of feature layers is gradually increased. Because the features contained in the feature maps with different scales are different, the feature map extracted in the former stage in the embodiment contains rich detail features, the feature map extracted in the later stage contains rich semantic features, and the four residual error networks connected in series by the backbone network are used for performing multi-scale feature extraction on the training sample, so that the multi-scale features of the infrared weak and small target can be obtained, and the detection rate of the detection model can be improved.

And H3, inputting the feature graph after the feature extraction of the fourth stage into each cavity convolution layer in the context feature extraction network to obtain an expansion feature graph corresponding to each training sample.

In step H3, the feature map after the feature extraction at the fourth stage is subjected to context feature extraction using a plurality of hole convolution layers, so that the receptive field is enlarged without reducing the feature resolution, and the size of the model is not significantly increased.

In some embodiments, the contextual feature extraction network includes a first hole convolution layer, a second hole convolution layer, a third hole convolution layer, a fourth hole convolution layer, and a fifth hole convolution layer.

Then, step H3 may include:

inputting the feature map after the fourth-stage feature extraction corresponding to each training sample into the first void convolution layer to perform first expansion on the feature map after the fourth-stage feature extraction to obtain a first feature map;

cascading the first characteristic diagram and the characteristic diagram after the fourth-stage characteristic extraction, and inputting the cascaded characteristic diagrams into a second void convolution layer to obtain a second characteristic diagram;

cascading the second characteristic diagram, the characteristic diagram after the fourth-stage characteristic extraction and the first characteristic diagram, and inputting the cascaded characteristic diagrams into a third cavity convolution layer to obtain a third characteristic diagram;

cascading the third feature map, the feature map after the feature extraction of the fourth stage, the first feature map and the second feature map, and inputting the cascaded feature maps into a fourth void convolution layer to obtain a fourth feature map;

cascading the fourth feature map, the feature map after the feature extraction at the fourth stage, the first feature map, the second feature map and the third feature map, and inputting the cascaded feature maps into a fifth void convolution layer to obtain a fifth feature map;

and cascading the fifth feature map, the feature map after the fourth-stage feature extraction, the first feature map, the second feature map, the third feature map and the fourth feature map to obtain an expansion feature map corresponding to each training sample.

In this embodiment, the expansion rates of the first, second, third, fourth, and fifth hole convolution layers in the context feature extraction network are 1,3,6, 12, and 18, respectively; in this embodiment, the hole convolution layers with different expansion rates are connected in a dense link manner, so that the characteristic resolution of each hole convolution layer can be increased. The dense links of the cavity convolution layers and the selection of the expansion rate of each cavity convolution layer can cover a large enough receptive field to form a dense enough characteristic, and the sensing capability of the multi-scale infrared small and weak target and context can be effectively enhanced.

And H4, obtaining a first fusion feature map corresponding to each training sample based on the feature map and the expansion feature map after the third-stage feature extraction by using the feature fusion network, and obtaining a second fusion feature map corresponding to each training sample based on the first fusion feature map and the feature map after the second-stage feature extraction after obtaining the first fusion feature map.

In step H4, the feature fusion network includes a first feature fusion module and a second feature fusion module. Firstly, an expanded feature map obtained by a context feature extraction network and a feature map obtained after feature extraction at a third stage in a backbone network are fused by a first feature fusion module to obtain a first fusion feature map corresponding to each training sample, and the feature maps are intended to fuse features at different scales, wherein the feature map obtained after feature extraction at the third stage has a larger relative scale, retains local detail information such as image edges and outlines and is beneficial to target positioning, the expanded feature map obtained by the context feature extraction network has a smallest scale and contains more abstract semantic information, but has poor detail perception capability, and the first fusion feature map obtained by fusing the expanded feature map and the expanded feature map contains rich semantic information and target details and is beneficial to detection of infrared weak and small targets. And then, fusing the first fused feature graph and the feature graph extracted from the second stage features in the backbone network by using a second feature fusion module to construct features with finer granularity and features with richer semantic information, so as to obtain a second fused feature graph with stronger descriptive property, improve the detection rate of the infrared dim targets and reduce the false alarm rate.

In some embodiments, the first feature fusion module includes a spatial attention branch and a channel attention branch.

Then, the step of obtaining a first fused feature map corresponding to each training sample based on the feature map and the expanded feature map after the feature extraction in the third stage by using the feature fusion network may include:

interpolating the expansion characteristic diagram, reducing the number of channels of the expansion characteristic diagram after interpolation to be the same as the number of channels of the characteristic diagram after the third-stage characteristic extraction, and adding the expansion characteristic diagram and the characteristic diagram after the third-stage characteristic extraction to obtain a first fusion diagram;

inputting the first fusion graph into a space attention branch and a channel attention branch respectively, and adding feature graphs output by the space attention branch and the channel attention branch to obtain a second fusion graph;

and adding the result obtained by multiplying the second fusion graph by the expansion characteristic graph after interpolation and channel number reduction and the result obtained by multiplying the second fusion graph by the characteristic graph after the third-stage characteristic extraction to obtain a first fusion characteristic graph.

In this embodiment, in the first feature fusion module, because the size of the expanded feature map obtained in step H3 is small and the number of channels is large, in order to better fuse the expanded feature map and the feature map after the third-stage feature extraction, the expanded feature map corresponding to each training sample obtained in step H3 may be interpolated to the size the same as that of the feature map after the third-stage feature extraction, and then the number of channels of the expanded feature map after interpolation is reduced to the number the same as that of the feature map after the third-stage feature extraction by using 1 × 1 convolution, and then the number of channels is added to the feature map after the third-stage feature extraction, so as to obtain a first fusion map; in order to highlight the information of the infrared dim target, retain valuable features and remove non-valuable features, the embodiment inputs the first fusion graph into the spatial attention branch and the channel attention branch respectively, and adds the feature graphs output by the spatial attention branch and the channel attention branch to obtain a second fusion graph; and finally, adding a result obtained by multiplying the second fusion image by the expansion feature image obtained after interpolation and channel number reduction and a result obtained by multiplying the second fusion image by the feature image obtained after feature extraction in the third stage to obtain a first fusion feature image containing rich semantic information and target details.

In some embodiments, the spatial attention branch comprises convolution layers with both convolution kernels being 1 × 1; the channel attention branch includes an adaptive pooling layer and a one-dimensional convolution layer.

In this embodiment, the first fused map is input into the spatial attention branch and the channel attention branch, respectively, and the spatial attention branch outputs weights on the first fused map using two 1 × 1 convolutions; channel attention branching first turns the input to C1 x 1 size using adaptive pooling, and then captures cross-channel interactions on nearby 5 channels using 1-dimensional convolution to generate weights on the individual channels. Therefore, the information of the infrared dim target in the first fusion graph can be highlighted by utilizing the space attention branch and the channel attention branch, valuable features are reserved, and non-valuable features are eliminated, so that the first fusion feature graph with stronger description is obtained, the detection rate of the infrared dim target is improved, and the false alarm rate is reduced.

In some embodiments, the second feature fusion module is identical in network structure to the first feature fusion module.

In this embodiment, in order to construct a feature with a finer granularity and a feature with richer semantic information, the network structure of the second feature fusion module may be the same as that of the first feature fusion module, and then the step "obtaining a second fusion feature map corresponding to each training sample based on the feature maps obtained after the first fusion feature map and the second stage feature extraction" may include:

interpolating the first fusion characteristic diagram, reducing the number of channels of the first fusion characteristic diagram after interpolation to be the same as that of the characteristic diagram after the second-stage characteristic extraction, and adding the channel number of the first fusion characteristic diagram and the characteristic diagram after the second-stage characteristic extraction to obtain a third fusion diagram;

inputting the third fusion map into the spatial attention branch and the channel attention branch respectively, and adding the feature maps output by the spatial attention branch and the channel attention branch to obtain a fourth fusion map;

and adding the result obtained by multiplying the fourth fusion image by the first fusion feature image after interpolation and channel number reduction and the result obtained by multiplying the fourth fusion image by the feature image obtained by feature extraction at the second stage to obtain a second fusion feature image.

In this embodiment, the third fused map is input into the spatial attention branch and the channel attention branch, respectively, and the spatial attention branch outputs weights on the first fused map using two 1 × 1 convolutions; channel attention branching first turns the input to C1 x 1 size using adaptive pooling, and then captures cross-channel interactions on nearby 5 channels using 1-dimensional convolution to generate weights on the individual channels. Therefore, the information of the infrared dim target in the third fusion graph can be highlighted by utilizing the space attention branch and the channel attention branch, valuable features are reserved, and the worthless features are removed, so that the second fusion feature graph with stronger descriptive performance is obtained, the detection rate of the infrared dim target is improved, and the false alarm rate is reduced.

And H5, inputting the first fusion characteristic diagram and the second fusion characteristic diagram into the first detection head and the second detection head respectively to obtain an output result of each training sample.

In step H5, because the first fused feature map and the second fused feature map have different scales and include different semantic features and different detail features, the first fused feature map and the second fused feature map are detected by using the first detection head and the second detection head, respectively, so as to improve the multi-scale detection capability of the detection model.

And H6, adjusting network parameters of the neural network according to the output result of each training sample and the label corresponding to each training sample until a detection model meeting expectations is obtained.

In step H6, after each batch of training samples is input to a preset neural network, parameters of each layer in the neural network need to be calculated and stored, and deviation values of the existing parameters and the labels are calculated according to a loss function; and solving an error gradient according to the deviation value, and taking the error gradient as a basis for updating the neural network parameters of the training samples of the next batch until a detection model meeting the expectation is obtained.

In some embodiments, the first detection head and the second detection head share the same set of network parameters; the first detection head and the second detection head train network parameters by using a packet normalization mode.

In this embodiment, the first detection head and the second detection head share the same set of network parameters, which can significantly reduce the number of parameters and solve the problem of unbalanced scale of the input target during training. In addition, a Group normalization (Group Norm) method is used for both the first and second detection heads. This is because the two detection heads share the same set of network parameters, and it is not desirable to store a set of parameters independently for the normalization operation in each detection head in the training phase, and if Batch normalization (Batch Norm) is used, each detection head updates parameters for the Batch normalization layer, which causes parameter confusion and extremely adverse effect on the inference result. The parameters of the grouping normalization layer are irrelevant to the batch size and can be directly calculated in the inference stage, so that the problem of parameter confusion is avoided.

In some embodiments, the first detection head and the second detection head each comprise a classification branch, a regression branch, and a centrality branch.

Then, step H5 may include:

inputting the first fusion characteristic diagram and the second fusion characteristic diagram into a classification branch of the first detection head and a classification branch of the second detection head respectively to obtain a classification result of each pixel point in the first fusion characteristic diagram and the second fusion characteristic diagram;

In this embodiment, the classification branch changes the input feature map into (B × H × W × C) through 4 convolutional layers and 1 output layer, where B represents the batch size, H, W represents the size of the input feature map, and C represents the number of classifications, and the classification result of the classification branch is the classification of the class of the point in each input feature map. The regression branch passes through the offsets of the 4 convolutional layers and 1 output layer regression points to the corresponding boundaries of the input feature map, and the output shape is (B × H × W × 4). The centrality branch returns the centrality of each point through 4 convolutional layers and 1 output layer.

In some embodiments, the regression branch and the centrality branch share the same set of convolutional layers.

In this embodiment, since the regression branch and the central branch need to obtain the position of each point, in order to reduce the network parameters, the regression branch and the central branch can share the same set of convolutional layers, in this embodiment, the central branch and the regression branch share the first 4 convolutional layers, and the output shape is (B × H × W × 1).

In summary, the process of constructing the test model is described.

With respect to step 104:

in some embodiments, after the infrared weak and small target image to be detected is input into the detection model generated according to the above construction process training, the output results of the detection head module in the detection model include the classification result and the regression result of the first target feature map, and the classification result and the regression result of the second target feature map; the classification result and the regression result of the first target feature map are output results of the first target feature map obtained by utilizing a first detection head in the detection model and passing the infrared small dim target image through a feature fusion network in the detection model; and the classification result and the regression result of the second target feature map are output results of the second target feature map obtained by using a second detection head in the detection model to pass through a feature fusion network in the detection model on the infrared small and weak target image.

Then, step 104 may include:

mapping each pixel point in the first target characteristic diagram and the second target characteristic diagram back to the infrared small target image respectively so as to determine an initial target according to the classification result corresponding to the first target characteristic diagram and the second target characteristic diagram respectively;

calculating the distance from a central pixel point of each initial target in the infrared small target image to the corresponding alternative frame, and screening each initial target according to the distance to obtain the infrared small target;

and calculating the normalized distance from the central pixel point of each infrared small target in the infrared small target image to the corresponding alternative frame, and screening the alternative frame corresponding to the infrared small target according to the normalized distance to obtain the detection frame of the infrared small target.

In this embodiment, the output result of the infrared weak and small target image to be detected by using the detection model is a pixel-level classification and positioning result similar to semantic segmentation, and in order to implement target detection, the output result needs to be converted back to the target.

Taking the first target feature map as an example, mapping each pixel point on the first target feature map back to the position of the pixel point on the original image, that is, the position of the pixel point in the infrared dim target image to be detected, if the step length is s, the positions of the pixel points on the original image may fall into the size of s ² I.e. each point on the first target feature map may be mapped to a size s ² The grid area of (2). And determining an initial target according to a classification result corresponding to the first target characteristic diagram, wherein the initial target comprises infrared dim targets of various categories. And then, generating a candidate frame by taking each initial target as a center according to the regression result corresponding to the first target feature map. The initial targets and the candidate boxes corresponding to each initial target in the second target feature map are determined in the same manner as the first target feature map.

And then, calculating the distance from the central pixel point of each initial target in the infrared weak and small target image to the corresponding alternative frame.

In the present embodiment, the distance is calculated by the following formula:

wherein the distance is (l) ^* ,t ^* ,r ^* ,b ^* )，l ^* 、t ^* 、r ^* And b ^* Respectively the left side distance, the upper side distance, the right side distance and the lower side distance from the central pixel point of the initial target in the infrared weak and small target image to the corresponding alternative frame,

for the coordinates of the upper left corner point of the candidate box,

is the coordinate of the lower right corner point of the candidate box.

It should be noted that, because the first target feature map and the second target feature map have different scales, that is, different step sizes, the areas of the original target mapped back to the original image on the first target feature map and the second target feature map are different, and the sizes of the targets suitable for detection are different. In this embodiment, the size of the first target feature map is smaller than that of the second target feature map, which is 1/16 and 1/8 of the size of the infrared weak and small target image, respectively (i.e. the step size is set to be 16, 8), and the larger the step size is, the larger the area of the original target mapped back to the original image is, the more suitable it is to detect a large target, so the first target feature map is suitable for detecting a large target, and the second target feature map is suitable for detecting a small target.

Therefore, the upper limit and the lower limit of the size of the target detectable by the first detection head and the second detection head may be predefined, respectively, so that the size of the target detected by each detection head may be limited.

For example, since the first detection head detects a first target feature map suitable for detecting a larger target, the upper and lower size limits for limiting the detectable target of the first detection head are (16,32)]. Similarly, the size of the target detectable by the second detection head is limitedThe lower limit of the size and the limit of the size are (0, 16)]. Then, when the distance from the central pixel point of the initial target in the infrared weak and small target image obtained according to the output of the first detection head to the corresponding candidate frame exceeds the size upper limit and the size lower limit interval, that is to say, the distance is between the central pixel point and the corresponding candidate frame

This initial target is noted as a negative sample. By this method, the size of the object that can be learned by each detection head is limited, so that not only can overlapping objects be detected without considering the overlapping objects as one object, but also the probability that two detection heads repeatedly detect the same object can be reduced. Therefore, each initial target can be screened according to the distance to obtain the infrared weak and small target.

And finally, the alternative frame corresponding to each infrared weak and small target may deviate from the center of the real frame seriously, and the frame deviating from the target is restrained through the centrality value in the method without the preset anchor frame. Specifically, the normalized distance from the central pixel point of each infrared small and weak target in the infrared small and weak target image to the corresponding candidate frame is calculated.

In the embodiment of the present invention, the normalized distance is calculated by the following formula:

where Centeress is the normalized distance, l ^* 、t ^* 、r ^* And b ^* The distance from the central pixel point of the infrared small and weak target in the infrared small and weak target image to the left side, the distance from the upper side, the distance from the right side and the distance from the lower side of the corresponding alternative frame are respectively.

From the above formula, the cenntess value domain is (0, 1), the larger the value is, the closer the representative point is to the center of the real frame, the frame with the point as the center has larger cenntess weight, the frames with low quality can be screened out by sorting the frames through confidence and cenntess and using non-maximum value to inhibit, and the detection frame of the infrared weak and small target is obtained.

Therefore, the detection result of the infrared weak and small target image is the category and the detection frame of the infrared weak and small target.

In order to examine the detection performance of the algorithm proposed in this embodiment on infrared small and weak targets, the detection performance was actually tested on a test set, and the accuracy and speed indexes shown in table 1 below were formed.

TABLE 1

	Average accuracy (AP 50)	Processing rate
			The invention	92.40％	105FPS

As shown in fig. 2 and fig. 3, an embodiment of the present invention provides an infrared weak and small target detection apparatus. The device embodiments may be implemented by software, or by hardware, or by a combination of hardware and software. In terms of hardware, as shown in fig. 3, a hardware architecture diagram of a computing device where an infrared weak and small target detection apparatus provided in the embodiment of the present invention is located is shown, and besides the processor, the memory, the network interface, and the nonvolatile memory shown in fig. 3, the computing device where the apparatus is located in the embodiment may also generally include other hardware, such as a forwarding chip responsible for processing a packet, and the like. Taking a software implementation as an example, as shown in fig. 3, as a logical apparatus, a CPU of a computing device in which the apparatus is located reads a corresponding computer program in a non-volatile memory into a memory to run.

As shown in fig. 3, the present embodiment provides an apparatus for detecting an infrared weak and small target, including:

an acquiring unit 301, configured to acquire an infrared dim target image to be detected;

the detection unit 302 is used for inputting the infrared small target image into a detection model generated by pre-training; the detection model is obtained based on preset neural network training, and the preset neural network comprises a backbone network formed by connecting four residual error networks in series, a context feature extraction network consisting of a plurality of cavity convolution layers, a feature fusion network and a detection head module;

and the regression unit 303 is configured to obtain a detection result of the infrared weak and small target image according to an output result of the detection head module.

In one embodiment of the present invention, in the detection unit 302, the detection head module includes a first detection head and a second detection head;

the training mode of the detection model comprises the following steps:

obtaining a plurality of training samples marked with labels;

inputting a plurality of training samples into a backbone network, and respectively performing first-stage feature extraction, second-stage feature extraction, third-stage feature extraction and fourth-stage feature extraction on each training sample by using four residual error networks in the backbone network to obtain a feature map after feature extraction of each stage;

inputting the feature map after feature extraction in the fourth stage into each cavity convolution layer in the context feature extraction network to obtain an expansion feature map corresponding to each training sample;

obtaining a first fusion feature map corresponding to each training sample based on the feature map and the expansion feature map after the third-stage feature extraction by using a feature fusion network, and obtaining a second fusion feature map corresponding to each training sample based on the first fusion feature map and the feature map after the second-stage feature extraction after obtaining the first fusion feature map;

inputting the first fusion characteristic diagram and the second fusion characteristic diagram into a first detection head and a second detection head respectively to obtain an output result of each training sample;

In an embodiment of the present invention, in the detection unit 302, the first detection head and the second detection head share the same set of network parameters; the first detection head and the second detection head train network parameters by using a packet normalization mode.

In one embodiment of the present invention, in the detection unit 302, the first detection head and the second detection head each include a classification branch, a regression branch, and a centrality branch;

when the first fused feature map and the second fused feature map are respectively input to the first detection head and the second detection head to obtain the output result of each training sample, the method is used for executing:

In one embodiment of the present invention, the regression branch and the centrality branch share the same set of convolutional layers in the detection unit 302.

In one embodiment of the present invention, in the detection unit 302, the output results of the detection head module include the classification result and the regression result of the first target feature map, and the classification result and the regression result of the second target feature map; the classification result and the regression result of the first target feature map are output results of the first target feature map obtained by using a first detection head in the detection model to pass through a feature fusion network in the detection model on the infrared dim and small target image; the classification result and the regression result of the second target feature map are output results of the second target feature map obtained by using a second detection head in the detection model to pass through a feature fusion network in the detection model on the infrared dim and small target image;

the regression unit 303, when executing that the detection result of the infrared weak and small target image is obtained according to the output result of the detection head module, is configured to execute:

mapping each pixel point in the first target characteristic graph and the second target characteristic graph back to the infrared small target image respectively so as to determine an initial target according to the classification result corresponding to the first target characteristic graph and the second target characteristic graph respectively;

In one embodiment of the present invention, in the regression unit 303, the distance is calculated by the following formula:

for the coordinates of the upper left corner point of the candidate box,

coordinates of a right lower corner point of the alternative frame;

the normalized distance is calculated by the following formula:

where center is the normalized distance,/ ^* 、t ^* 、r ^* And b ^* The distance between the central pixel point of the infrared dim target in the infrared dim target image and the corresponding alternative frame is the left distance, the upper side distance, the right side distance and the lower side distance respectively.

It is to be understood that the illustrated structure of the embodiment of the present invention does not constitute a specific limitation to an infrared weak and small target detection device. In other embodiments of the present invention, an apparatus for detecting infrared small objects may include more or fewer components than those shown, or some components may be combined, some components may be separated, or a different arrangement of components may be used. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

Because the content of information interaction, execution process, and the like among the modules in the device is based on the same concept as the method embodiment of the present invention, specific content can be referred to the description in the method embodiment of the present invention, and is not described herein again.

The embodiment of the invention also provides a computing device, which comprises a memory and a processor, wherein the memory stores a computer program, and when the processor executes the computer program, the method for detecting the infrared weak and small target in any embodiment of the invention is realized.

An embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the processor is caused to execute a method for detecting an infrared weak and small target in any embodiment of the present invention.

Specifically, a system or an apparatus equipped with a storage medium on which software program codes that realize the functions of any of the embodiments described above are stored may be provided, and a computer (or a CPU or MPU) of the system or the apparatus is caused to read out and execute the program codes stored in the storage medium.

In this case, the program code itself read from the storage medium can realize the functions of any of the above-described embodiments, and thus the program code and the storage medium storing the program code constitute a part of the present invention.

Examples of the storage medium for supplying the program code include a floppy disk, a hard disk, a magneto-optical disk, an optical disk (e.g., CD-ROM, CD-R, CD-RW, DVD-ROM, DVD-RAM, DVD-RW, DVD + RW), a magnetic tape, a nonvolatile memory card, and a ROM. Alternatively, the program code may be downloaded from a server computer via a communications network.

Further, it should be clear that the functions of any one of the above-described embodiments may be implemented not only by executing the program code read out by the computer, but also by causing an operating system or the like operating on the computer to perform a part or all of the actual operations based on instructions of the program code.

Further, it is to be understood that the program code read out from the storage medium is written to a memory provided in an expansion board inserted into the computer or to a memory provided in an expansion module connected to the computer, and then a CPU or the like mounted on the expansion board or the expansion module is caused to perform part or all of the actual operations based on instructions of the program code, thereby realizing the functions of any of the embodiments described above.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a" \8230; "does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Those of ordinary skill in the art will understand that: all or part of the steps of implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer-readable storage medium, and when executed, executes the steps including the method embodiments; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, and not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for detecting infrared weak and small targets is characterized by comprising the following steps:

acquiring an infrared dim target image to be detected;

and obtaining a detection result of the infrared small target image according to an output result of the detection head module.

2. The method of claim 1, wherein the detection head module comprises a first detection head and a second detection head;

the training mode of the detection model comprises the following steps:

obtaining a plurality of training samples marked with labels;

inputting a plurality of training samples into the backbone network, and performing first-stage feature extraction, second-stage feature extraction, third-stage feature extraction and fourth-stage feature extraction on each training sample by using four residual error networks in the backbone network to obtain a feature map after feature extraction at each stage;

inputting the feature map after the feature extraction of the fourth stage into each hole convolution layer in the context feature extraction network to obtain an expansion feature map corresponding to each training sample;

3. The method of claim 2, wherein the first detection head and the second detection head share a same set of network parameters; and the first detection head and the second detection head train network parameters by using a packet normalization mode.

4. The method of claim 2 or 3, wherein the first and second detection heads each comprise a classification branch, a regression branch, and a centrality branch;

the inputting the first fused feature map and the second fused feature map to the first detection head and the second detection head respectively to obtain an output result of each training sample includes:

and inputting the first fusion feature map into the regression branch and the central degree branch of the first detection head respectively, and inputting the second fusion feature map into the regression branch and the central degree branch of the second detection head respectively to obtain the regression result of the first fusion feature map and the regression result of the second fusion feature map.

5. The method of claim 4, wherein the regression branch and the centrality branch share the same set of convolutional layers.

6. The method according to claim 4, wherein the output results of the detection head module include a classification result and a regression result of the first target feature map, and a classification result and a regression result of the second target feature map; the classification result and the regression result of the first target feature map are output results of the first target feature map obtained by using the first detection head in the detection model to pass through the feature fusion network in the detection model on the infrared dim target image; the classification result and the regression result of the second target feature map are output results of the second target feature map obtained by utilizing the second detection head in the detection model to pass through the feature fusion network in the detection model for the infrared small dim target image;

mapping each pixel point in the first target feature map and the second target feature map back to the infrared small target image so as to determine an initial target according to classification results corresponding to the first target feature map and the second target feature map;

generating an alternative frame of each initial target according to regression results corresponding to the first target feature map and the second target feature map;

7. The method of claim 6,

the distance is calculated by the following formula:

the coordinates of the upper left corner point of the candidate box,

coordinates of a right lower corner point of the alternative frame are obtained;

the normalized distance is calculated by the following formula:

wherein Centeress isNormalized distance, l ^* 、t ^* 、r ^* And b ^* And the distance between the central pixel point of the infrared dim target in the infrared dim target image and the left side distance, the upper side distance, the right side distance and the lower side distance of the corresponding alternative frame are respectively.

8. An apparatus for detecting infrared small and weak targets, comprising:

9. A computing device comprising a memory having stored therein a computer program and a processor that, when executing the computer program, implements the method of any of claims 1-7.

10. A computer-readable storage medium, on which a computer program is stored which, when executed in a computer, causes the computer to carry out the method of any one of claims 1-7.