CN113158943A

CN113158943A - Cross-domain infrared target detection method

Info

Publication number: CN113158943A
Application number: CN202110474134.7A
Authority: CN
Inventors: 颜成钢; 路统宇; 戴振宇; 孙垚棋; 张继勇; 李宗鹏; 张勇东
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2021-04-29
Filing date: 2021-04-29
Publication date: 2021-07-23

Abstract

The invention discloses a cross-domain infrared target detection method, which comprises the following steps of firstly obtaining source domain data with labels and target domain data with a small number of labels and without labels: and then training a Mask R-CNN-1 network by using source domain data containing tags: carrying out domain adaptive training on a new Mask R-CNN-2 by utilizing the source domain data and the target domain data: and finally, inputting the infrared image to be subjected to target detection into the trained Mask R-CNN-2 network, and finally realizing the target detection function. The method of the invention utilizes the domain adaptation technology, solves the problem of poor network training effect under the condition of insufficient target domain data labels, and improves the accuracy of target detection; the defect that the domain adaptation task is only carried out on the characteristic level is overcome, a Mask R-CNN network is used, and pixel-level target detection is further realized on the basis of target detection.

Description

Cross-domain infrared target detection method

Technical Field

The invention belongs to the technical field of computer vision and pattern recognition, relates to a cross-domain infrared target detection method, and particularly relates to a semi-supervised cross-domain infrared target detection method based on a feature map and a CAM (computer-aided manufacturing) map layer, which can align a source domain and a target domain on the feature map layer and the CAM map layer to solve the problem of few labels of a target domain training set, namely, the network training task which takes the target domain as a target is assisted by source domain data, and the network training and the realization of infrared image target detection are realized under the condition of few labels of the target domain data.

Background

Since the human eye can perceive light in the wavelength range of 0.43 μm to 0.79 μm, light outside this range needs to be detected by a specialized detector. The infrared ray is an electromagnetic wave with the wavelength between 750nm and 1mm, the infrared imaging is realized by receiving the infrared radiation generated by an object and converting the infrared radiation into an infrared image through a sensor, and the infrared imaging has the outstanding characteristics of long detection distance, high concealment, capability of working day and night and the like, not only extends the human visual system, but also can greatly make up the defect that the visible light imaging is limited by the environmental influence, has very important status in the field of target detection, and is widely applied to civil and military scenes. For example, in public gathering places such as railway stations, airports, markets and the like and battlefield environments, targets are generally required to be identified and detected accurately in time, and compared with visible light imaging, infrared imaging can provide a good supplementary effect for target detection in a complicated field of view.

The traditional infrared target detection algorithm is to manually design a target template, perform template matching on a target, and determine the position of the target according to the compared similarity, and the main detection methods include a gradient histogram algorithm, a scale invariant feature detection algorithm, a haar feature algorithm and the like. With the development of deep neural networks in recent years, many scholars propose different target detection algorithms, the basic principle of which is to abstract and extract the features of an image through convolution operation, such as RCNN algorithm, Fast R-CNN algorithm, Mask R-CNN algorithm and the like.

However, the existing target detection methods such as Fast R-CNN and Mask R-CNN are all fully supervised target detection algorithms, a large amount of labeled training data is needed, and under the condition of less labeled training data, an overfitting phenomenon often occurs, so that the final performance of the network is reduced; meanwhile, compared with the visible light image, the labeling cost of the infrared image is higher, the infrared image data set with the label is less, the contrast of the infrared image is lower, the detail information is lacked, the difference between the infrared image data set and the visible light image is larger, and if the infrared image is directly subjected to target detection by a network trained by the visible light image, a good effect cannot be achieved. The domain adaptation technology is used as a method for reducing inter-domain differences, source domain data with rich labels can be used as assistance to constrain target domain features extracted by a network, so that the feature distribution of the target domain features is close to that of the source domain data, and the purpose of improving the detection precision is achieved.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a cross-domain infrared target detection method.

The method is an infrared target detection method based on a characteristic diagram and a CAM (class Activation map) diagram level semi-supervised cross-domain.

A cross-domain infrared target detection method comprises the following steps:

step 1, acquiring source domain data with labels and target domain data with a small number of labels and without labels:

the source domain data and the target domain data are respectively visible light images and infrared images, the related scenes of the source domain data and the target domain data are similar, and the targets are the same. The labeled source domain is denoted as S ═ { X_s，Y_sIn which x_s∈X_sAnd y_s∈Y_sRespectively representing source domain data and corresponding tags; the target domain is composed of two parts, the target domain T containing a label₁＝{X_t1，Y_t1In which x_t1∈X_t1And y_t1∈Y_t1Respectively representing target domain data and corresponding labels, and a target domain T without labels₂＝{X_t2In which x_t2∈X_t2Representing the target domain data.

Step 2, training a Mask R-CNN-1 network by using source domain data containing labels:

inputting labeled source domain data into a Mask R-CNN-1 network for full supervision training, wherein a loss function of the network is divided into two parts, the first part is RPN foreground/background classification loss and RPN target frame regression loss, and the two parts can be expressed as L_RPNThe second part is Mask R-CNN head loss, including classification loss, regression loss and pixel segmentation loss, which can be expressed as L_HEADI.e. the total loss function of the Mask R-CNN network is expressed as L_supervised＝L_RPN+L_HEADThe network minimizes the total loss function by adopting a random gradient descent algorithm, and returns the parameters to the convolution network, the RPN network and the head structure of the Mask R-CNN-1, and the trained Mask R-CNN-1 network can be obtained after the training is finished.

Step 3, performing domain adaptive training on a new Mask R-CNN-2 by using the source domain data and the target domain data:

fixing the convolutional network backbone-1 in the Mask R-CNN-1 network trained in the step 2, and respectively inputting the source domain data and the target domain data into the backbone-1 and the Mask R-CNN-2 to obtain a characteristic diagram F of the source domain data_sAnd feature map F of target domain data_tAnd respectively carrying out CAM (class Activation mapping) operation on the feature graphs of the two-domain data to obtain an attention graph A of the two-domain data_sAnd A_tWherein

F_jIs the jth channel, ω, of the feature map F_jIs F_jWeight value obtained after GAP (Global Average Pooling) operation. In the obtained F_sAnd F_tAnd A_sAnd A_tRespectively adding a feature discriminator D_FAnd attention discriminator D_AThe structure is used for countertraining, so that the feature distribution and the attention distribution of the target domain data are close to the feature distribution and the attention distribution of the source domain data to reduce the inter-domain gap, wherein the discriminator is a multilayer convolution structure, and the loss functions are respectively as follows:

and 4, carrying out target detection on the infrared image:

inputting the infrared image to be subjected to target detection into the Mask R-CNN-2 network trained in the step 3, obtaining a characteristic diagram of the infrared image through forward propagation, and sending the characteristic diagram into an RPN network to obtain the position, the category and the Mask information of the target, thereby finally realizing the target detection function.

The invention has the following beneficial effects:

1. by using a domain adaptation technology, the problem of poor network training effect under the condition of insufficient target domain data labels is solved, and the target detection accuracy is improved to a certain extent;

2. the method has the advantages that the region which has decisive significance on the target detection task is focused by using the attention mechanism while the region adaptation is carried out on the characteristic layer, the region adaptation task is carried out on the CAM layer surface, and the defect that the region adaptation task is carried out only on the characteristic layer is overcome.

3. The Mask R-CNN network is used, Mask (Mask) generation branches are added on the basis of the fast R-CNN network, pixel-level target detection is further realized on the basis of target detection, namely, each pixel point of a target can be identified to perform more accurate image target segmentation, and realization possibility is provided for downstream tasks such as human body key point detection, unmanned driving and other target identification tasks needing more accuracy.

Drawings

FIG. 1 is a flow chart of a target detection algorithm proposed by the present invention;

FIG. 2 is a schematic diagram of a pre-training Mask R-CNN-1 network structure;

FIG. 3 is a schematic diagram of Mask R-CNN-2 network domain adaptive training structure;

FIG. 4 is a schematic diagram of target detection based on a trained Mask R-CNN-2 network.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

At present, most target detection tasks can detect the position of a target, but foreground pixels and background pixels at the position of the target cannot be distinguished, and good effects cannot be achieved in some tasks needing more accurate target detection, a Mask R-CNN Network is adopted to realize pixel-level target detection, the target detection is evolved from fast R-CNN, a Mask generation branch is added on the basis of detecting the position and the type of the target, the structure of the target detection branch mainly comprises a Convolutional Network (Convolutional Back), a Region recommendation Network (Region Proposal Network), namely an RPN Network, an FPN Network (Feature Pyramid Network), a RoI Align structure and a multi-task head structure, wherein the Convolutional Network is a series of Convolutional layers for extracting image Feature maps, and the common structures comprise VGG19, Res 50 and the like; the area recommendation network is used to help the network recommend areas of interest, which consist of two convolutional layers. The process of detecting the target by the Mask R-CNN network is mainly divided into two stages, wherein the first stage is to divide an image alternative area, and the second stage is to detect the target in the alternative area; meanwhile, the cost for acquiring the infrared image is high, the infrared image data set with the label is less, and a large amount of manpower and financial resources are consumed for labeling the infrared image, so that an over-fitting phenomenon is often caused if a traditional full-supervision type network training method is performed, a good training effect cannot be achieved, and the method is difficult to be applied in practice; in addition, although the domain adaptation is well applied to the small-sample semi-supervised target detection task, the most domain adaptation is to set constraints between the feature maps of the source domain and the target domain, so that the feature distribution of the target domain data with few labels is close to the feature distribution of the source domain data with a large number of labels, and the region which has decisive significance on the target detection task in the feature maps is not considered.

Aiming at the problems, the invention provides a cross-domain indoor infrared human body detection algorithm, a flow chart of which is shown in figure 1, and the steps are as follows:

step (1), acquiring source domain data with labels and target domain data with a small number of labels and without labels:

the source domain data and the target domain data are respectively visible light images and infrared images, the scenes related to the source domain data and the target domain data are both indoor key areas, and the target is a human body. The labeled source domain is denoted as S ═ { X_s，Y_sIn which x_s∈X_sAnd y_s∈Y_sRespectively representing a source domain image and a corresponding human body target label; the target domain is composed of two parts, the target domain T containing a label₁＝{X_t1，Y_t1In which x_t1∈X_t1And y_t1∈Y_t1Respectively representing an image of the target field and a corresponding label of the target of the human body, and a target field T without label₂＝{X_t2In which x_t2∈X_t2Representing the target domain image. Wherein the number of images in the source domain data is more than 2 times the number of images in the target domain.

Step (2), training a Mask R-CNN-1 network by using source domain data containing labels: referring to fig. 2, the labeled source domain data is input into a Mask R-CNN-1 network for full supervision training, wherein a convolutional network in the Mask R-CNN-1 network is composed of a ResNet50 network and a FPN network and generates a feature map, and then the convolutional network is obtainedThe feature map input RPN network of (1) including two convolutional layers to generate a region Proposal (RoI), and then inputting the feature map and the region Proposal into a RoI Align structure, the feature obtained by each RoI is better aligned to the RoI region on the original map by bilinear interpolation. Parameters of the ResNet50 network are initialized using the pre-trained model, and parameters of FPN, PRN, and RoI Align are initialized randomly. The loss function of the network can be divided into two parts, the first part is RPN foreground/background classification loss and RPN target frame regression loss, which can be expressed as L_RPNThe second part is Mask R-CNN head loss, including classification loss, regression loss and pixel segmentation loss, which can be expressed as LHEAD, i.e. the total loss function of the Mask R-CNN network is expressed as L_supervised＝L_RPN+L_HEADThe network adopts a random gradient descent algorithm to minimize a total loss function, and transmits parameters back to a Mask R-CNN convolution network, an RPN network and a RoI Align structure, the learning rate of the random gradient descent algorithm is set to be 0.0001, and the iteration number is set to be 30. And obtaining the trained Mask R-CNN-1 network after the training is finished.

And (3) performing domain adaptive training on a new Mask R-CNN network Mask R-CNN-2 by using the source domain data and the target domain data: as shown in FIG. 3, the structure and initialization parameter setting of each part in the Mask R-CNN-2 network are the same as those of the Mask R-CNN-1 network. Fixing the convolutional network backbone-1 in the Mask R-CNN-1 network trained in the step 2, and respectively inputting the source domain data and the target domain data into the backbone-1 and the Mask R-CNN-2 to obtain a characteristic diagram F of the source domain data_sAnd feature map F of target domain data_tAnd respectively carrying out CAM (class Activation mapping) operation on the feature graphs of the two-domain data to obtain an attention graph A of the two-domain data_sAnd A_tWherein

and:

the arbiter network adopts a random gradient descent algorithm to minimize a total loss function, and transmits parameters back to a Mask R-CNN-2 convolution network, an RPN network and a RoI Align structure, the learning rate of the random gradient descent algorithm is set to be 0.0001, and the iteration number is set to be 30. The overall loss function of the domain adaptation part is:

where α and β are the feature domain adaptation loss weight and the attention domain adaptation weight, 0.4 and 0.6, respectively.

And (4) carrying out target detection on the infrared image: as shown in fig. 4, the infrared image to be subjected to human body detection is input to the Mask R-CNN-2 network trained in step 3, the characteristic diagram of the infrared image is obtained by forward propagation to the convolutional network, and is sent to the RPN network to obtain the position, category and Mask information of the human body, so that the function of detecting the human body in the indoor key area scene is finally realized.

Claims

1. A cross-domain infrared target detection method is characterized by comprising the following steps:

source domain data and target domain numberAccording to the visible light image and the infrared image, the related scenes of the visible light image and the infrared image are similar, and the targets are the same; the labeled source domain is denoted as S ═ { X_s,Y_sIn which x_s∈X_sAnd y_s∈Y_sRespectively representing source domain data and corresponding tags; the target domain is composed of two parts, the target domain T containing a label₁＝{X_t1,Y_t1In which x_t1∈X_t1And y_t1∈Y_t1Respectively representing target domain data and corresponding labels, and a target domain T without labels₂＝{X_t2In which x_t2∈X_t2Representing target domain data;

step 2, training a MaskR-CNN-1 network by using source domain data containing labels:

inputting the labeled source domain data into a MaskR-CNN-1 network for full supervision training, wherein a loss function of the network is divided into two parts, the first part is RPN foreground/background classification loss and RPN target frame regression loss, and the two parts can be expressed as L_RPNThe second part is Mask R-CNN head loss, including classification loss, regression loss and pixel segmentation loss, which can be expressed as L_HEADI.e. the overall loss function of the MaskR-CNN network is expressed as L_supervised＝L_RPN+L_HEADThe network minimizes the total loss function by adopting a random gradient descent algorithm, and returns parameters to the convolution network, the RPN network and the head structure of the MaskR-CNN-1, and the trained MaskR-CNN-1 network can be obtained after the training is finished;

step 3, performing domain adaptive training on a new MaskR-CNN-2 by using the source domain data and the target domain data:

fixing the convolutional network backbone-1 in the MaskR-CNN-1 network trained in the step 2, respectively inputting the source domain data and the target domain data to the backbone-1 and the Mask R-CNN-2 to obtain a characteristic diagram F of the source domain data_sAnd feature map F of target domain data_tAnd respectively carrying out CAM (class Activation mapping) operation on the feature graphs of the two-domain data to obtain an attention graph A of the two-domain data_sAnd A_tWherein

F_jIs the jth channel, ω, of the feature map F_jIs F_jWeight value obtained after GAP (Global Average Power) operation; in the obtained F_sAnd F_tAnd A_sAnd A_tRespectively adding a feature discriminator D_FAnd attention discriminator D_AThe structure is used for countertraining, so that the feature distribution and the attention distribution of the target domain data are close to the feature distribution and the attention distribution of the source domain data to reduce the inter-domain gap, wherein the discriminator is a multilayer convolution structure, and the loss functions are respectively as follows:

and 4, carrying out target detection on the infrared image:

2. The method of claim 1, wherein the number of images in the source domain data is more than 2 times the number of images in the target domain.

3. The method for detecting the cross-domain infrared target according to claim 1 or 2, characterized in that the specific method in the step (2) is as follows:

inputting the labeled source domain data into a Mask R-CNN-1 network for full supervision training, wherein a convolution network in the Mask R-CNN-1 network consists of a ResNet50 network and an FPN network and generates a feature map, and then convolvingInputting a feature map obtained by a network into an RPN network, wherein the feature map comprises two convolution layers to generate a region Proposal (RoI Proposal), then inputting the feature map and the region Proposal into a RoI Align structure, and enabling the feature obtained by each RoI to be better aligned with the RoI region on an original map through bilinear interpolation; initializing parameters of a ResNet50 network by using a pre-training model, and randomly initializing parameters of FPN, PRN and RoI Align; the loss function of the network can be divided into two parts, the first part is RPN foreground/background classification loss and RPN target frame regression loss, which can be expressed as L_RPNThe second part is Mask R-CNN head loss, including classification loss, regression loss and pixel segmentation loss, which can be expressed as L_HEADI.e. the total loss function of the Mask R-CNN network is expressed as L_supervised＝L_RPN+L_HEADThe network adopts a random gradient descent algorithm to minimize a total loss function, and transmits parameters back to a Mask R-CNN convolution network, an RPN network and a RoI Align structure, the learning rate of the random gradient descent algorithm is set to be 0.0001, and the iteration times are set to be 30; and obtaining the trained Mask R-CNN-1 network after the training is finished.

4. The method for detecting the cross-domain infrared target according to claim 3, wherein the specific method in the step (3) is as follows:

the structure and the initialization parameter setting of each part in the Mask R-CNN-2 network are the same as those of the Mask R-CNN-1 network; fixing the convolutional network backbone-1 in the Mask R-CNN-1 network trained in the step 2, and respectively inputting the source domain data and the target domain data into the backbone-1 and the Mask R-CNN-2 to obtain a characteristic diagram F of the source domain data_sAnd feature map F of target domain data_tAnd respectively carrying out CAM (class Activation mapping) operation on the feature graphs of the two-domain data to obtain an attention graph A of the two-domain data_sAnd A_tWherein

and:

the arbiter network adopts a random gradient descent algorithm to minimize a total loss function, and transmits parameters back to a Mask R-CNN-2 convolution network, an RPN network and a RoI Align structure, the learning rate of the random gradient descent algorithm is set to be 0.0001, and the iteration times are set to be 30; the overall loss function of the domain adaptation part is: