CN110544269A

CN110544269A - twin network infrared target tracking method based on characteristic pyramid

Info

Publication number: CN110544269A
Application number: CN201910720012.4A
Authority: CN
Inventors: 周慧鑫; 刘国均; 周腾飞; 宋江鲁奇; 李欢; 于跃; 张嘉嘉; 杜娟; 吴娜娜; 成宽洪; 秦翰林; 王炳健
Original assignee: Xian University of Electronic Science and Technology
Current assignee: Xian University of Electronic Science and Technology
Priority date: 2019-08-06
Filing date: 2019-08-06
Publication date: 2019-12-06

Abstract

the invention discloses a twin network infrared target tracking method based on a feature pyramid, which comprises the steps of respectively carrying out bottom-up full convolution operation on a template frame and a detection frame, then respectively carrying out top-up operation and transverse connection on convolution layers C5, C4, C3 and C2, and respectively and correspondingly generating P2, P3, P4 and P5 scale feature layers; performing convolution operation on the scale feature layers corresponding to the detection frames respectively according to the classification weight and the regression weight of each scale feature layer of the template frame to respectively determine scale proposals corresponding to each scale feature layer of the detection frames; and carrying out non-maximum suppression on scale suggestions corresponding to all scale feature layers of the detection frame, reserving the highest scale suggestion, and taking the scale feature layer corresponding to the highest scale suggestion as an output tracking result. The method has better discrimination capability on the target deformation, and is more suitable for complex scenes and target deformation.

Description

Twin network infrared target tracking method based on characteristic pyramid

Technical Field

the invention belongs to the field of infrared target tracking, and particularly relates to a twin network infrared target tracking method based on a characteristic pyramid.

background

In recent years, moving target detection and tracking technology based on visible light computer vision is rapidly developed, and has been widely applied in various fields, such as man-machine interaction, intelligent video monitoring, accurate guidance and the like. However, the visible light system cannot play an effective role in severe weather and at night, and compared with the visible light system, the infrared imaging system has relative advantages in this respect because the infrared imaging system acquires information by sensing infrared signals radiated by an object, and therefore the infrared imaging system has the advantages of working all day long, good concealment, strong smoke penetration capability, strong interference resistance capability and the like.

the infrared target tracking is a process of finding a target in an infrared image sequence and implementing effective tracking, and has a very important position in tasks such as defense, alarm and countermeasure of an infrared system as a key technology of an infrared imaging system, so that the research on the infrared target tracking technology has a very high application value.

At present, many scholars at home and abroad carry out intensive research on the infrared target detection and tracking technology. Nevertheless, there are still many problems to be solved due to the inherent limitations of infrared systems. The infrared image lacks visual color information, the image signal-to-noise ratio is low, the similarity is difficult to distinguish, the background clutter is serious, the resolution ratio is low, and the like, so that the infrared target is easily submerged in the background. In addition, if similar objects exist in the background, the infrared tracker is easy to miss the target, and the difficulty of re-detection after the target is shielded is high. Such as the twin network tracking algorithm (siamppn), which cannot output an accurate target size and is highly error-prone when there is an object similar to the target in the background. Therefore, the research on the infrared target tracking technology under the complex background is a challenging subject and has important significance for the deep research thereof.

disclosure of Invention

In view of the above, the main objective of the present invention is to provide a twin network infrared target tracking method based on a feature pyramid.

In order to achieve the purpose, the technical scheme of the invention is realized as follows:

the embodiment of the invention provides a twin network infrared target tracking method based on a characteristic pyramid, which comprises the following steps:

Taking a current frame of an original infrared image sequence as a detection frame of a detection branch in the feature pyramid-based twin network, and taking a previous frame as a template frame of a template branch in the feature pyramid-based twin network;

Respectively carrying out bottom-up full convolution operation on the template frame and the detection frame, then respectively carrying out top-up operation and transverse connection on the convolution layers C5, C4, C3 and C2, and respectively and correspondingly generating P2, P3, P4 and P5 scale feature layers;

Performing 2k 'and 4 k' channel expansion on the P2, P3, P4 and P5 scale feature layers of the template frame respectively to generate classification weight and regression weight respectively;

Performing convolution operation on the scale feature layers corresponding to the detection frames respectively according to the classification weight and the regression weight of each scale feature layer of the template frame to respectively determine scale proposals corresponding to each scale feature layer of the detection frames;

and carrying out non-maximum suppression on scale suggestions corresponding to all scale feature layers of the detection frame, reserving the highest scale suggestion, and taking the scale feature layer corresponding to the highest scale suggestion as an output tracking result.

In the above scheme, the taking the current frame of the original infrared image sequence as the detection frame of the detection branch in the feature pyramid-based twin network and the taking the previous frame as the template frame of the template branch in the feature pyramid-based twin network specifically includes: the current frame of the original infrared image sequence is used as a detection frame of a detection branch in the feature extraction sub-network, and the previous frame is used as a template frame of a template branch in the feature extraction sub-network.

In the above scheme, the performing a bottom-up full convolution operation on the template frame and the detection frame, and then performing a top-up operation and a horizontal connection on the convolutional layers C5, C4, C3, and C2, respectively, to generate the P2, P3, P4, and P5 scale feature layers, specifically: the network of each branch of the feature extraction sub-network adopts a three-route mode, namely bottom-up, top-down and transverse connection, and adopts five layers of networks which are the same as the area selection network from bottom to top, so that the scale is reduced layer by layer; performing up-sampling on the top-down route by adopting deconvolution; the transverse connection is to fuse the up-sampling result and the feature maps with the same size generated from bottom to top to generate feature maps from P2 to P5, which are in one-to-one correspondence with the feature maps from C2 to C5 from bottom to top; and performing target prediction by using P2, P3, P4 and P5 layers fused with the characteristics of each layer.

In the above scheme, the performing 2k 'and 4 k' channel expansion on the P2, P3, P4 and P5 scale feature layers of the template frame respectively to generate classification weights and regression weights respectively specifically includes: dividing the region selection sub-network into two branches, one branch for target-background classification and the other branch for regression of the target region; assuming that a region selection sub-network sets k ' anchor points, 2k ' channels need to be output for classification, and 4k ' channels need to be regressed.

In the above scheme, the performing 2k 'and 4 k' channel expansion on the P2, P3, P4 and P5 scale feature layers of the template frame respectively to generate classification weights and regression weights respectively is specifically implemented by the following steps:

(1) Adding the template φ p (z) to two branches [ φ p (z)) ] cls and [ φ p (z)) ] reg, extending to 2k 'and 4 k' times the number of channels through the two convolutional layers, respectively;

(2) The detection frame φ p (x) is also divided into two branches [ φ p (x)) ] cls and [ φ p (x)) ] reg by the two convolutional layers, but the number of channels remains unchanged;

(3) The [ Phip (z) ] and [ Phip (x) ] on the classification branch and the regression branch are respectively related by convolution operation, i.e. the correlation is

Wherein, [ phi p (z)) ] cls, [ phi p (z)) ] reg, [ phi p (x)) ] cls, [ phi p (x)) ] reg are classification branch of template frame, regression branch of template frame, classification branch of detection frame, regression branch of detection frame, classification branch relativity and regression branch relativity.

in the scheme, when a plurality of anchor training networks are used, a regularized and smoothed L1 loss function is adopted, and the expression is

Wherein Ax, Ay, Aw, Ah represent the center point, width and height of the anchor, Tx, Ty, Tw, Th represent the center point coordinate, width and height of the real target frame, and delta 0, delta 1, delta 2 and delta 3 are respectively the regularization distances of abscissa, ordinate, width and height;

smoothL1 loss of

the loss function of the network as a whole is

L＝L+λL

Where λ is the hyperparameter, Lreg regression loss function, Llcs is the classification loss function, expressed using the cross-entropy loss function, i.e.

In the above scheme, the respectively performing convolution operation on the scale feature layer corresponding to the detection frame according to the classification weight and the regression weight of each scale feature layer of the template frame to respectively determine the scale proposal corresponding to each scale feature layer of the detection frame specifically includes: and dividing each layer of scale feature layer of the detection frame into a classification branch and a regression branch, respectively combining the classification weight and the regression weight determined by the template frame corresponding to the scale feature layer to obtain a classification confidence map and a regression confidence map of each anchor, and determining a scale proposal corresponding to the scale feature layer of the detection frame according to the correlation between the classification confidence map and the regression confidence map of each anchor.

In the above scheme, the dividing each layer of scale feature layer of the detection frame into a classification branch and a regression branch, obtaining a classification confidence map and a regression confidence map of each anchor by respectively combining the classification weight and the regression weight determined by the template frame corresponding to the scale feature layer, and determining the scale proposal corresponding to the layer of scale feature layer of the detection frame according to the correlation between the classification confidence map and the regression confidence map of each anchor specifically includes:

Representing classification and regression output feature maps R as sets of points

Wherein i belongs to [0, w ], j belongs to [0, h ], l belongs to [0,2 k'), and p belongs to {2,3,4,5 };

Wherein i belongs to [0, w), j belongs to [0, h), and m belongs to [0, k').

Let variables i and j represent the position of the respective anchor, and l represents the index number of the respective anchor, deriving the anchor set as K resolution sets output as well, activating ANC above to obtain the respective improved coordinates

Compared with the prior art, the method has better discrimination capability on the target deformation, and is more suitable for complex scenes and target deformation; the detailed characteristics of the target are fully utilized, and the method has good adaptability to infrared images with few detailed characteristics; performing parallel calculation on classification and scale regression, and predicting the aspect ratio of the target to ensure that the tracking has more accuracy and real-time performance; the intelligent perception capability to the infrared target is stronger, and the problem of shielding the target, being similar to the background is more adaptive.

Drawings

FIG. 1 is a flowchart of a twin network infrared target tracking method based on a feature pyramid according to an embodiment of the present invention

FIG. 2 is a diagram of a twin network structure of a twin network infrared target tracking method based on a feature pyramid according to an embodiment of the present invention;

FIG. 3 is a diagram of selecting a target frame with a distance center not exceeding 7 in a classification feature map in a feature pyramid-based twin network infrared target tracking method according to an embodiment of the present invention;

FIG. 4 shows the tracking results of frame 1, 243, 700, 944 of the Boat2 sequence;

Fig. 5 is an accuracy chart and a success rate chart of six algorithms of the algorithm (Gif-siamfn), the pyramid-based twin network algorithm (siamfn), the context-aware-based scale estimation algorithm (Gif-SECA), the full convolution twin network algorithm (SiamFC), the discriminant scale space tracking algorithm (DSST), and the correlation filter network algorithm (CFNet) according to the present invention, respectively.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The embodiment of the invention provides a twin network infrared target tracking method based on a characteristic pyramid, which comprises the following steps of:

Step 1: taking a current frame of an original infrared image sequence as a detection frame of a detection branch in the feature pyramid-based twin network, and taking a previous frame as a template frame of a template branch in the feature pyramid-based twin network;

In particular, in the feature extraction sub-network, a full convolution network without a full connectivity layer is employed.

Dividing the feature extraction subnetwork into two branches, one of which is a template branch, which takes a target image block z in the first frame as input; the other is a detection branch, which takes as input the target candidate box x in the current frame, these two branches sharing parameters in the convolutional layer.

The single detection is considered as a discriminant task with the aim of finding the parameter W that minimizes the average loss L of the prediction function ψ p (xi; W). It is calculated over n samples xi and corresponding labels yi

The purpose of one-time learning is to learn W from the target template z. z represents a template frame, x represents a detection frame, a function phi p represents a certain layer of feature map, and a function zeta represents a prediction function of a selected sub-network in an area, so that the one-time detection task can be represented as

Step 2: respectively carrying out bottom-up full convolution operation on the template frame and the detection frame, then respectively carrying out top-up operation and transverse connection on the convolution layers C5, C4, C3 and C2, and respectively and correspondingly generating P2, P3, P4 and P5 scale feature layers;

specifically, the network of each branch of the feature extraction subnetwork adopts three routes, namely bottom-up (red arrow), top-down (green arrow) and transverse connection (blue arrow), and adopts five layers of networks which are the same as the area selection network from bottom to top, such as C1 to C5 in fig. 2, and the scale is reduced layer by layer; performing up-sampling on the top-down route by adopting deconvolution; the transverse connection is to fuse the up-sampling result and the feature maps with the same size generated from bottom to top to generate the feature maps of P2 to P5 shown in FIG. 2, which are in one-to-one correspondence with the feature maps of C2 to C5 from bottom to top; and performing target prediction by using P2, P3, P4 and P5 layers fused with the characteristics of each layer.

for convenience, φ p (z) and φ p (x) are represented as feature maps for some layer of output, where p ∈ {2,3,4,5 }.

and step 3: performing 2k 'and 4 k' channel expansion on the P2, P3, P4 and P5 scale feature layers of the template frame respectively to generate classification weight and regression weight respectively;

specifically, the region selection subnetwork is divided into two branches, one for target-background classification and the other for target region regression. Assuming that k ' anchor points are set for the network, the network needs to output 2k ' channels for classification, and 4k ' channels for regression.

the method is specifically realized by the following steps:

(1) The template φ p (z) is added to two branches [ φ p (z)) ] cls and [ φ p (z)) ] reg, which are extended to 2k 'and 4 k' times the number of channels through the two convolutional layers, respectively.

(2) The frame to be monitored φ p (x) is also divided into two branches [ φ p (x)) ] cls and [ φ p (x)) ] reg by the two convolutional layers, but the number of channels remains unchanged.

as shown in fig. 2, the output of the classification branch contains 2 k' channels, which represent the positive and negative activation functions of each anchor at the corresponding position on the output graph. Similarly, the output of the regression branch contains 4 k' channels, which represent dx, dy, dw, dh the distance between the anchor and the corresponding real target box.

And 4, step 4: performing convolution operation on the scale feature layers corresponding to the detection frames respectively according to the classification weight and the regression weight of each scale feature layer of the template frame to respectively determine scale proposals corresponding to each scale feature layer of the detection frames;

Specifically, the output of the template branch is treated as a convolution kernel of local detection, and correlation calculation is carried out on the output of the detection branch to obtain classification and regression output so as to obtain a proposal.

wherein i belongs to [0, w ], j belongs to [0, h ], l belongs to [0,2 k'), and p belongs to {2,3,4,5 }.

Wherein i belongs to [0, w), j belongs to [0, h), and m belongs to [0, k').

Let the variables i and j represent the position of the respective anchor and l the index number of the respective anchor, respectively, so that the anchor set can be derived as, in addition, the K resolution sets output upon activation of ANC above to obtain the respective improved coordinates are taken as

the scale proposals are determined in the same manner for the P2-P5 scale feature layers of the detected frames.

And 5: and carrying out non-maximum suppression on scale suggestions corresponding to all scale feature layers of the detection frame, reserving the highest scale suggestion, and taking the scale feature layer corresponding to the highest scale suggestion as an output tracking result.

in order to adapt the one-time detection method to the tracking task, the strategy of choosing the scale proposal is divided into two steps:

Because the target does not move too much in the tracking problem, the grid far away from the center is abandoned in the output classification response map, namely only g ' × g ' sub-area is reserved to obtain g ' × g ' × k ' proposed boxes (g ' takes 7, k ' takes 5), so as to remove the abnormal value. Fig. 3 is an illustration of selecting a target box within the classification feature map that is not more than 7 a from the center.

② the scores of the events are rearranged using cosine window and scale change penalty to obtain the best score. After discarding outliers, a cosine window is added to suppress large displacements, and then a penalty Pe is added to suppress large changes in size and ratio, Pe being expressed as

Wherein, it represents hyper-parameter, r represents the ratio of height and width of the proposed box, and r' represents the aspect ratio of the current frame target box. s and s' represent the overall dimensions of the proposed box and the current frame, and are calculated as

(w+p)×(h+p)＝s

Where w and h represent the width and height of the target and p ═ 2 (w + h).

After these operations, the proposals returned by the current anchor are reordered after multiplying the classification score by the penalty. Then performing non-maximum suppression to obtain a final tracking bounding box; after the final bounding box is selected, the target size is updated by linear interpolation.

Setting network models and parameters:

1. When a network is trained by using a plurality of anchors, a regularized and smoothed L1 loss function is adopted, and the expression is

wherein Ax, Ay, Aw, Ah represent the center point, width and height of the anchor, Tx, Ty, Tw, Th represent the center point coordinate, width and height of the real target frame, and delta 0, delta 1, delta 2 and delta 3 are respectively the regularization distances of abscissa, ordinate, width and height. smoothL1 loss of

the loss function of the network as a whole is

L＝L+λL

Where λ is the hyperparameter, Lreg is the regression loss function, Llcs is the classification loss function, expressed using the cross-entropy loss function, i.e.

2. The intersection ratio IoU of the prediction box and the true target box and two thresholds thi and thlo are used as metrics in the training phase. Positive samples are defined as having anchors and negative samples are defined as having anchors IoU < thlo.

The threshold thlo is set to 0.3 and thhi is set to 0.6.

3. For each scale feature layer, anchors of size 8 × 8 pixels are set, and each anchor has 5 aspect ratios: {0.33,0.5,1,2,3}, i.e., k' ═ 5. There are 20 anchors for the entire feature pyramid.

4. The length g of the sub-region of the best proposed choice is 7. For the training process, the loss function was optimized using the stochastic gradient descent method (SGD) with the initial learning rate value set to 0.01.

5. The bottom-up route of the full convolutional network uses a 5-layer convolution of AlexNet, using maximum pooling after the first two convolutional layers. With the exception of conv5, each convolutional layer employs a ReLU nonlinear activation function. The information of the convolution kernel size, the number of channels, the step size, the sizes of the detection frame and the template frame, and the like of each layer is shown in table 1.

TABLE 1 convolutional layer parameters

6. FIG. 4 shows frames 1, 243, 700, 944 of the Boat2 sequence, where the target is a ship at the sea surface, the target is initially about 8 × 20 pixels, the target is first out-of-plane inverted, the apparent morphology changes from FIG. 4(a) to FIG. 4(b), and all algorithms effectively track the target; then, the target is turned in the plane once again, the appearance of the 700 th frame is the same as that of the initial frame, but the GIF-SECA algorithm drifts due to the camera shake; around the 944 th frame, the target moves rapidly, and the scale becomes large rapidly, so that the algorithm of the invention has the highest accuracy.

7. Fig. 5(a) and 5(b) are accuracy maps and success rate maps respectively drawn by using 6 tracking algorithms of the present invention (Gif-siamfn), pyramid-based twin network algorithm (siamfn), context-aware-based scale estimation algorithm (Gif-SECA), full convolution twin network algorithm (SiamFC), discriminant scale space tracking algorithm (DSST), and correlation filter network algorithm (CFNet) for 16 infrared sequences, where the position error is 20 pairs of accuracy, the accuracy of the algorithm of the present invention (Gif-siamfn) reaches 0.914, the success rate curves are sorted by using the area under the curve, and the success rate of the present invention is relatively the highest. Therefore, the algorithm of the present invention surpasses the more classical tracking methods of recent years in both tracking accuracy and overlap rate.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention.

Claims

1. A twin network infrared target tracking method based on a characteristic pyramid is characterized by comprising the following steps:

2. The method for tracking the infrared target of the twin network based on the feature pyramid as claimed in claim 1, wherein the step of taking the current frame of the original infrared image sequence as the detection frame of the detection branch in the twin network based on the feature pyramid and taking the previous frame as the template frame of the template branch in the twin network based on the feature pyramid specifically comprises: the current frame of the original infrared image sequence is used as a detection frame of a detection branch in the feature extraction sub-network, and the previous frame is used as a template frame of a template branch in the feature extraction sub-network.

3. the feature pyramid based twin network infrared target tracking method according to claim 1 or 2, wherein the full convolution operation from bottom to top is performed on the template frame and the detection frame respectively, and then the convolution layers C5, C4, C3 and C2 are performed from top to top and connected laterally respectively, so as to generate the P2, P3, P4 and P5 scale feature layers respectively, specifically: the network of each branch of the feature extraction sub-network adopts a three-route mode, namely bottom-up, top-down and transverse connection, and adopts five layers of networks which are the same as the area selection network from bottom to top, so that the scale is reduced layer by layer; performing up-sampling on the top-down route by adopting deconvolution; the transverse connection is to fuse the up-sampling result and the feature maps with the same size generated from bottom to top to generate feature maps from P2 to P5, which are in one-to-one correspondence with the feature maps from C2 to C5 from bottom to top; and performing target prediction by using P2, P3, P4 and P5 layers fused with the characteristics of each layer.

4. the feature pyramid based twin network infrared target tracking method according to claim 3, wherein the channel expansion of 2k 'and 4 k' is performed on the P2, P3, P4 and P5 scale feature layers of the template frame respectively, and classification weights and regression weights are generated respectively, specifically: dividing the region selection sub-network into two branches, one branch for target-background classification and the other branch for regression of the target region; assuming that a region selection sub-network sets k ' anchor points, 2k ' channels need to be output for classification, and 4k ' channels need to be regressed.

5. The feature pyramid based twin network infrared target tracking method according to claim 4, wherein the channel expansion of 2k 'and 4 k' is performed on the P2, P3, P4 and P5 scale feature layers of the template frame respectively, and classification weights and regression weights are generated respectively, which is implemented by the following steps:

6. the feature pyramid based twin network infrared target tracking method of claim 5, wherein when multiple anchor training networks are used, a regularized smooth L1 loss function is adopted, and the expression is

smoothL1 loss of

the loss function of the network as a whole is

L＝L+λL

7. The feature pyramid-based twin network infrared target tracking method according to claim 6, wherein the determining the scale proposal corresponding to each scale feature layer of the detection frame by performing convolution operation on the scale feature layer corresponding to the detection frame according to the classification weight and the regression weight of each scale feature layer of the template frame respectively specifically comprises: and dividing each layer of scale feature layer of the detection frame into a classification branch and a regression branch, respectively combining the classification weight and the regression weight determined by the template frame corresponding to the scale feature layer to obtain a classification confidence map and a regression confidence map of each anchor, and determining a scale proposal corresponding to the scale feature layer of the detection frame according to the correlation between the classification confidence map and the regression confidence map of each anchor.

8. The feature pyramid-based twin network infrared target tracking method according to claim 7, wherein each scale feature layer of the detection frame is also divided into a classification branch and a regression branch, a classification confidence map and a regression confidence map of each anchor are obtained by respectively combining the classification weight and the regression weight determined by the template frame corresponding to the scale feature layer, and a scale proposal corresponding to the scale feature layer of the detection frame is determined according to the correlation between the classification confidence map and the regression confidence map of each anchor, specifically:

Wherein i belongs to [0, w), j belongs to [0, h), and m belongs to [0, k').