CN115719367A

CN115719367A - Twin target tracking method based on space-channel cross correlation and centrality guidance

Info

Publication number: CN115719367A
Application number: CN202211459889.0A
Authority: CN
Inventors: 张建明; 何宇凡
Original assignee: Changsha University of Science and Technology
Current assignee: Changsha University of Science and Technology
Priority date: 2022-11-17
Filing date: 2022-11-17
Publication date: 2023-02-28

Abstract

The invention discloses a twin target tracking method based on space-channel cross correlation and centrality guidance, which comprises the following steps of: acquiring a template and a search area in an image; respectively sending the template and the search area into a depth feature extraction network for feature extraction to obtain template features and search area features; respectively sending the template features and the search region features into two space-channel cross-correlation modules to obtain a feature map R1 suitable for a classification subnetwork and a feature map R2 suitable for a regression subnetwork; sending the feature map R1 into a classification sub-network to obtain a classification map, and sending the feature map R2 into a regression sub-network to obtain a centrality map and a regression map; and obtaining a target boundary frame according to the classification diagram, the centrality diagram and the regression diagram after optimization through loss function optimization. The invention effectively improves the accuracy and robustness of target tracking.

Description

Twin target tracking method based on space-channel cross correlation and centrality guidance

Technical Field

The invention relates to the technical field of computer vision target tracking, in particular to a twin target tracking method based on space-channel cross correlation and centrality guidance.

Background

The target tracking is a basic task of computer vision, integrates knowledge from multiple fields such as machine learning, optimization and image processing, and is widely applied to the fields such as automatic monitoring, vehicle navigation, robot sensing, man-machine interaction and enhancement realization. Although the target tracking has made a long-term progress, the influence of many factors such as illumination change, occlusion, rapid movement, deformation, scale change and similar object interference exists in a complex and changeable actual scene, and robust visual tracking still has great challenge.

In recent years, twin network based depth trackers have achieved a good balance between accuracy and speed, such as SiamFC, siamRPN, daSiamRPN, siamRPN + +, siamCAR, etc., but still suffer from a number of disadvantages: simple local convolution operation is used by the SimFC, the SimFC is essentially convolution operation, template features are directly used as convolution kernels to carry out cross-correlation operation with search features, and the number of finally obtained feature graph channels is 1; the SiamRPN + uses a deep cross-correlation operation which independently performs a cross-correlation operation on each channel of the input layer, so that the number of channels of the obtained feature map is the same as the number of channels of the search feature, but the feature information of different channels on the same spatial position is not utilized; some trackers have taken precedence over using a pixel-by-pixel cross-correlation operation in the twin network, which, as opposed to deep cross-correlation, does not take advantage of the characteristic information of different spaces at the same channel location; while SiamCAR introduces a new centrality branch in the classification sub-network to suppress the prediction frame far away from the central position, if the centrality branch shares the same feature map with the classification branch, the prediction result may be similar to that of the classification prediction, so that the two branches cannot be well distinguished, and the centrality branch is not suitable for being placed in the classification sub-network, as shown in (a) of fig. 1. Therefore, it is necessary to more effectively fuse the search area features and the template features, better utilize the characteristics of the centrality branch, and improve the high accuracy and robustness of the target tracking.

Disclosure of Invention

Technical problem to be solved

Based on the problems, the invention provides a twin target tracking method based on space-channel cross correlation and centrality guidance, which better utilizes the characteristics of centrality branches by more effectively fusing search area characteristics and template characteristics and solves the problem that the precision and robustness of target tracking are to be improved.

(II) technical scheme

Based on the technical problem, the invention provides a twin target tracking method based on space-channel cross correlation and centrality guidance, which comprises the following steps:

s1, obtaining a template and a search area in an image;

s2, respectively sending the template and the search area into a depth feature extraction network for feature extraction, and respectively obtaining template features and search area features

And

the width multiplied by the length multiplied by the number of channels are Hz multiplied by Wz multiplied by C and Hx multiplied by Wx multiplied by C respectively;

s3, respectively sending the template features and the search region features into two space-channel cross-correlation modules to obtain a feature map R1 suitable for a classification subnetwork and a feature map R2 suitable for a regression subnetwork;

s4, sending the feature graph R1 into a classification sub-network to obtain a classification graph, and sending the feature graph R2 into a regression sub-network to obtain a centrality graph and a regression graph;

s5, optimizing the steps S3-S4 through a loss function, and obtaining a predicted target boundary frame according to the optimized classification graph, the central degree graph and the regression graph;

the step S3 includes:

s31, characterizing the template

The method comprises the following steps of (1) dividing a space dimension into space kernels K1 comprising Hz multiplied by Wz small kernels, wherein the size of each small kernel is 1 multiplied by C; characterizing the template

Dividing the channel dimension into channel cores K2 comprising C small cores, wherein the size of each small core is 1 multiplied by HzWz;

s32, the search area characteristics are set

And performing pixel-by-pixel cross-correlation operation with the space kernel K1 to obtain a characteristic diagram F1:

:, pixel-by-pixel cross-correlation;

s33, performing pixel-by-pixel cross-correlation operation on the feature map F1 and the channel kernel K2 to obtain a feature map F2: f2= F1 ═ K2;

s34, splicing the characteristic diagram F1 and the characteristic diagram F2, performing 1 x 1 convolution dimensionality reduction, and sending the obtained object into a hot plug module SE-block;

s35, repeating the steps S31-S34 twice to respectively obtain a feature map R1 suitable for the classification sub-network and a feature map R2 suitable for the regression sub-network, wherein the feature maps are Hz multiplied by Wz multiplied by C.

Further, in step S1, the template is an image obtained by cutting out a first frame of image of a data set or a captured image of a camera to a specified pixel size with a target as a center; the search area is an image cut out in a specified size by taking the target position of the ith frame as the center in the ith +1 frame in the tracking process.

Further, the step S4 includes:

s41, sending the characteristic diagram R1 into a classification sub-network, wherein the classification sub-network is a CNN network and only has one classification branch, and outputtingA classification chart

Each point of the classification map predicts the likelihood that the point location is foreground and background.

S42, sending the feature graph R2 into a regression sub-network, wherein the regression sub-network is a CNN network and comprises a centrality branch and a boundary box regression branch, and respectively outputting the centrality graph

And regression graph

Each point of the center map predicts the probability that the location is the center of the target, and each point of the regression map predicts the distance of the point from the upper, lower, left and right sides of the bounding box.

Further, in step S5, the optimizing by the loss function includes: the centrality branch is trained with a centrality penalty, the bounding box regression branch is trained with a centrality weighted regression penalty, and the classification branch is trained with a classification penalty.

Further, the classification loss:

the centrality loss is:

the centrality weighted regression loss:

wherein i, j represents the coordinate position of the corresponding map,

a true value representing the classification label is indicated,

the true value representing the degree of centrality,

representing the result of the network prediction, N _pos Represents the number of positive samples and the number of positive samples,

represents the set of positive samples, ioU represents the intersection ratio of the two in parentheses, B _i,j ,

Respectively representing the predicted bounding box and the true bounding box.

Further, in step S5, the obtaining of the predicted target bounding box according to the optimized classification diagram, the optimized centrality diagram, and the optimized regression diagram includes:

multiplying the foreground part in the classification chart by the central degree chart to obtain a point with the maximum response, namely a predicted target central point; and then obtaining the distances from the points obtained by the regression graph to four edges of the predicted boundary frame, and combining the predicted target central point to obtain the predicted target boundary frame.

Further, the specified size of the template cropping is 287 × 287 pixels, and the specified size of the search region cropping is 127 × 127 pixels.

Further, the width × length × number of channels of the template feature is 13 × 13 × 256, and the width × length × number of channels of the search region feature is 25 × 25 × 256.

The invention also discloses a twin target tracking system based on space-channel cross-correlation and centrality guidance, which comprises at least one processor; and at least one memory communicatively coupled to the processor, wherein:

the memory stores program instructions executable by the processor, the processor calls the program instructions to execute the twin target tracking method based on space-channel cross-correlation and centrality guidance, and the method comprises the following functional modules:

a data acquisition module for executing the step S1;

a feature extraction module for executing the step S2;

a space-channel cross-correlation module for performing said step S3;

a classification regression module for executing the step S4;

a prediction module, configured to perform the step S5.

A non-transitory computer-readable storage medium storing computer instructions that cause the computer to perform the twin target tracking method based on space-channel cross-correlation and centrality guidance is also disclosed.

(III) advantageous effects

The technical scheme of the invention has the following advantages:

(1) The method has the advantages that the search area characteristics and the template characteristics are sent to the space-channel cross-correlation module for characteristic fusion, so that the fusion of the characteristics in two dimensions of the space and the channel is realized, the characteristic information of different channels in the same space position is utilized, the characteristic information of different spaces in the same channel position is also utilized, the search area characteristics and the template characteristics are more effectively fused, and the target tracking precision and the target tracking robustness are improved; meanwhile, the interference of similar objects is reduced, and the calculated amount is effectively reduced;

(2) The regression sub-network comprises a centrality branch and a regression branch, the centrality branch is used for guiding the whole regression sub-network, the centrality target is used for guiding the centrality branch, the centrality target is also weighted and multiplied to the IoU loss, the weight of a low-quality predicted target boundary frame far away from a central point is restrained, and the accuracy of the predicted target boundary frame is improved;

(3) The space-channel cross-correlation module performs the lightweight characteristic fusion, so that two operations can be repeatedly performed to obtain two characteristic graphs with different applicability, the two characteristic graphs are respectively used as the input of a classification sub-network and a regression sub-network to deal with different subtasks, and the accuracy and the robustness of target tracking are improved while the two tasks are better distinguished.

Drawings

The features and advantages of the present invention will be more clearly understood by reference to the accompanying drawings, which are schematic and are not to be understood as limiting the invention in any way, and in which:

FIG. 1 is a schematic diagram comparing SiamCAR and an embodiment of the method of the invention;

FIG. 2 is a general schematic diagram of a twin target tracking method based on space-channel cross-correlation and centrality guidance according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of the spatial-channel cross-correlation module portion of step S3 according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of the part of the regression subnetwork with centrality guidance in step S42 according to the embodiment of the present invention;

FIG. 5 is a comparison of the performance of the method of an embodiment of the present invention over OTB100 with other methods;

figure 6 is a comparison of the performance of the method of the present invention on the UAV123 with other methods.

Detailed Description

The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.

An embodiment of the present invention is a twin target tracking method based on space-channel cross correlation and centrality guidance, as shown in fig. 1 (b) and fig. 2, including the following steps:

s1, obtaining a template and a search area which are cut into a specified size in an image:

cutting out an image with a specified pixel size from a first frame of image of a data set or a picture captured by a camera by taking a target as a center to serve as a template, and cutting out an image with a set size from an i +1 th frame by taking an i-th frame target position as a center to serve as a search area in a tracking process; the specified sizes for template and search area cropping are 287 × 287 pixels and 127 × 127 pixels, respectively.

And

the sizes are Hz multiplied by Wz multiplied by C and Hx multiplied by Wx multiplied by C respectively, and the width multiplied by the length multiplied by the channel number of the template characteristic or the search area characteristic is represented;

the sizes in this embodiment are 13 × 13 × 256 and 25 × 25 × 256, respectively.

S3, respectively sending the template features and the search area features into two space-channel cross-correlation modules, namely SC3M, to obtain a feature map R1 suitable for classifying the sub-networks and a feature map R2 suitable for regressing the sub-networks, as shown in FIG. 3, the method comprises the following steps:

s31, characterizing the template

s32, the search area characteristics are set

And performing pixel-by-pixel cross-correlation operation with the spatial kernel K1 to obtain a characteristic diagram F1, wherein the size of the characteristic diagram F1 is Hx multiplied by Wx multiplied by HzWz:

: -pixel-by-pixel cross-correlation;

this step will searchRegional characteristics

And features of the template

Sufficiently fused in spatial dimension;

s33, performing pixel-by-pixel cross-correlation operation on the feature map F1 and the channel kernel K2 to obtain a feature map F2, wherein the size of the feature map F2 is Hz multiplied by Wz multiplied by C: f2= F1 < ANG > K2;

this step will search for region features

And features of the template

Fully fused in channel dimensions;

s34, splicing the characteristic diagram F1 and the characteristic diagram F2, performing 1 multiplied by 1 convolution dimensionality reduction, and sending the obtained result into a hot plug module SE-block;

the 1 × 1 convolution and hot plug module SE-block in the step can obtain characteristic graphs suitable for different biases through optimization; the hot plug module SE-block does not change the size of the input characteristics and has the function of obtaining the relevance between the global information and the channel;

s35, repeating the steps S31-S34 twice, and respectively obtaining a feature map R1 suitable for the classification sub-network and a feature map R2 suitable for the regression sub-network due to different emphasis points of the repeated training optimization, wherein the sizes of R1 and R2 are Hz multiplied by Wz multiplied by C, and

the same is true;

due to the lightweight property of the feature fusion of the steps S31-S34, the operation can be repeated twice; in the SiamCAR, because the feature fusion calculation amount is large, the calculation can be performed only once, and one feature map is obtained and used for all subtasks.

s41, sending the feature map R1 into a classification sub-network, wherein the classification sub-network is a CNN network and only has one classification branch, and outputting a classification map

And regression graph

Each point of the center map predicts the likelihood that the location is the center of the target, and each point of the regression map predicts the distance of the point from the bounding box, up, down, left, right, as shown in fig. 4.

The classification branch, the centrality branch and the boundary frame regression branch are all three layers of full convolution layers, and only the number of output channels is different.

s51, training the centrality branch by using centrality loss, training the boundary box regression branch by using centrality weighted regression loss, and training the classification branch by using classification loss, wherein the classification loss adopts a CE (Cross Entropy) loss function, and the classification loss is as follows:

the centrality loss adopts a BCE (Binary Cross Entropy) loss function, and the centrality loss is as follows:

the centrality weighted regression loss:

wherein i, j represents the coordinate position of the corresponding graph,

a true value representing the class label is shown,

the true value representing the centrality is,

Respectively representing the predicted bounding box and the true bounding box. We use the true value of centrality

Weighting onto the IoU regression penalty may suppress the weighting of those low quality predicted target bounding boxes that are far from the center point.

S52, obtaining a predicted target boundary box according to the optimized classification diagram, the optimized centrality diagram and the optimized regression diagram:

each point of the classification graph predicts the possibility that the point position is the foreground and the background, each point of the central graph predicts the possibility that the point position is the target center, and each point of the regression graph predicts the distance between the point and the upper, lower, left and right sides of the boundary box; therefore, the foreground part in the classification map is multiplied by the centrality map to obtain a point with the maximum response, namely a predicted target central point; and then obtaining the distances from the points to four edges of the predicted boundary frame according to the regression graph, and obtaining the predicted target boundary frame by combining the predicted target central point.

To verify the technical effect of this embodiment, the above method is verified on multiple authoritative data sets such as VOT2018, OTB100, UAV123, GOT-10k, etc., fig. 5 shows the performance comparison of the method of the present invention and other methods on OTB100, where (a) in fig. 5 represents an accurate graph on OTB2015 data set, and (b) in fig. 5 represents a success rate graph on OTB2015 data set; fig. 6 illustrates a comparison of the performance of the method of the present invention with other methods on the UAV123 data set, where (a) in fig. 6 represents an accurate plot of the method of the present invention with other methods on the UAV123 data set, and (b) in fig. 6 represents a power plot of the method of the present invention with other methods on the UAV123 data set; the abscissa of the accurate graph represents a threshold, the ordinate represents the percentage of video frames of which the distance between the central point of a target position (bounding box) estimated by a tracking algorithm and the central point of a target manually marked (ground-route) is smaller than a given threshold, the abscissa of the success rate graph represents the threshold, and the ordinate represents the percentage of frames with OS larger than the set threshold in all frames, so that the invention has better response performance on the accurate graph and the success rate graph; table 1 of the method of the present invention shows that, compared with other methods, the method of the present invention compares performance on the VOT2018 data set, where the larger the value of accuracy, the higher the accuracy, the larger the value of robustness, and the worse stability, EAO (abbreviation of Expected Average overlay) indicates a non-reset Overlap expectation, and the larger the value, the better the performance, and the values in the table can show that the accuracy of the method of the present invention is second only to siamrnn, the EAO reaction performance is the best, robustness is also better, and overall performance is excellent; table 2 shows a comparison of the performance of the method of the invention on a GOT-10k dataset with other methods, AO representing the average overlap between all estimated bounding boxes and the ground truth box, SR0.5 representing the rate of successfully tracked frames with an overlap of more than 0.5, SR0.75 representing the overlapFrames above 0.75, FPS indicates the number of frames transmitted per second, and the AO value of the method of the present invention is second only to SiamGAT + +, SR- _0.50 Second only to SiamGAT + + and RBO, SR _0.75 The value is optimal, the FPS value is high, and the overall performance of target tracking is good.

TABLE 1

TABLE 2

The embodiment of the invention also provides a twin target tracking system based on space-channel cross correlation and centrality guidance, which can realize the twin target tracking method based on space-channel cross correlation and centrality guidance, and comprises a processor and a storage medium, wherein the storage medium is used for storing instructions; the processor executes the twin target tracking method based on the space-channel cross correlation and the centrality guidance, and comprises the following functional modules: a data acquisition module for executing the step S1; a feature extraction module for executing the step S2; a space-channel cross-correlation module for performing said step S3; a classification regression module for executing the step S4; a prediction module, configured to perform the step S5.

The twin target tracking method described above may be converted into software program instructions, either implemented using a twin target tracking system comprising a processor and a memory to operate or implemented by computer instructions stored in a non-transitory computer readable storage medium. The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the methods according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

In summary, the twin target tracking method based on space-channel cross correlation and centrality guidance has the following beneficial advantages:

(1) The method has the advantages that the search area characteristics and the template characteristics are sent to the space-channel cross-correlation module for characteristic fusion, so that the fusion of the characteristics in two dimensions of the space and the channel is realized, the characteristic information of different channels in the same space position is utilized, the characteristic information of different spaces in the same channel position is utilized, the search area characteristics and the template characteristics are more effectively fused, and the target tracking precision and robustness are improved; meanwhile, the interference of similar objects is reduced, and the calculated amount is effectively reduced;

(2) The regression sub-network comprises a central degree branch and regression branches, the central degree branch is used for guiding the whole regression sub-network, the central degree target is used for guiding the central degree branch, the central degree target is also weighted and multiplied to IoU loss, the weight of a low-quality predicted target boundary box far away from a central point is inhibited, and the accuracy of the predicted target boundary box is improved;

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, and not to limit it; although the embodiments of the present invention have been described in conjunction with the accompanying drawings, those skilled in the art may make various modifications and variations without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope defined by the appended claims.

Claims

1. A twin target tracking method based on space-channel cross correlation and centrality guidance is characterized by comprising the following steps of:

s1, obtaining a template and a search area in an image;

And

the width multiplied by the length multiplied by the channel number is Hz multiplied by Wz multiplied by C, hx multiplied by Wx multiplied by C respectively;

s5, optimizing the steps S3-S4 through a loss function, and obtaining a predicted target boundary frame according to the classification chart, the centrality chart and the regression chart after optimization;

the step S3 includes:

s31, characterizing the template

s32, the search area characteristics are set

:, pixel-by-pixel cross-correlation;

2. The twin target tracking method based on spatial-channel cross-correlation and centrality guidance according to claim 1, wherein in step S1, the template is an image in which a first frame image of a dataset or a camera capture picture is cut out with a specified pixel size centering on a target; the search area is an image of which the designated size is cut out from the (i + 1) th frame by taking the target position of the (i) th frame as the center in the tracking process.

3. The twin target tracking method based on space-channel cross-correlation and centrality guidance according to claim 1, wherein the step S4 includes:

And regression graph

4. The twin target tracking method based on space-channel cross-correlation and centrality guidance according to claim 1, wherein in step S5, the optimization by a loss function comprises: the centrality branch is trained with a centrality penalty, the bounding box regression branch is trained with a centrality weighted regression penalty, and the classification branch is trained with a classification penalty.

5. The twin target tracking method based on space-channel cross-correlation and centrality guidance according to claim 4, wherein the classification penalty is:

the centrality loss is:

the centrality weighted regression loss:

wherein i, j represents the coordinate position of the corresponding map,

a true value representing the classification label is indicated,

the true value representing the centrality is,

Representing the prediction bounding box and the real bounding box, respectively.

6. The twin target tracking method based on spatial-channel cross-correlation and centrality guidance according to claim 1, wherein in step S5, the obtaining of the predicted target bounding box according to the optimized classification map, centrality map and regression map comprises:

multiplying the foreground part in the classification graph by the centrality graph to obtain a point with the maximum response, namely a predicted target central point; and then obtaining the distances from the points to four edges of the predicted boundary frame according to the regression graph, and obtaining the predicted target boundary frame by combining the predicted target central point.

7. The twin target tracking method based on space-channel cross-correlation and centrality guidance of claim 2, wherein the specified size of the template clipping is 287 x 287 pixels and the specified size of the search region clipping is 127 x 127 pixels.

8. The twin target tracking method based on space-channel cross-correlation and centrality guidance according to claim 1, wherein the width x length x number of channels size of the template feature is 13 x 256, and the width x length x number of channels size of the search area feature is 25 x 256.

9. A twin target tracking system based on space-channel cross-correlation and centrality guidance comprising at least one processor; and at least one memory communicatively coupled to the processor, wherein:

the memory stores program instructions executable by the processor, the processor invoking the program instructions to be able to perform the twin target tracking method based on space-channel cross-correlation and centrality guidance according to any one of claims 1 to 8, comprising the following functional modules:

a data acquisition module for executing the step S1;

a feature extraction module for executing the step S2;

a space-channel cross-correlation module for performing said step S3;

a classification regression module for executing the step S4;

a prediction module, configured to perform the step S5.

10. A non-transitory computer-readable storage medium storing computer instructions for causing a computer to execute the twin target tracking method based on space-channel cross-correlation and centrality guidance according to any one of claims 1 to 8.