CN113592906B

CN113592906B - Long video target tracking method and system based on annotation frame feature fusion

Info

Publication number: CN113592906B
Application number: CN202110787587.5A
Authority: CN
Inventors: 胡若澜; 张涵; 陈纪刚; 姜军
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2021-07-12
Filing date: 2021-07-12
Publication date: 2024-02-13
Anticipated expiration: 2041-07-12
Also published as: CN113592906A

Abstract

The invention discloses a long video target tracking method and a system based on label frame feature fusion, belonging to the field of computer vision, wherein the method comprises the following steps: sequentially carrying out feature fusion and target feature enhancement treatment on the guide vector of the labeling frame image and the feature image of the searching frame image to obtain an enhanced fusion result; inputting the reinforced fusion result into a candidate frame regression network to obtain a candidate frame set containing N candidate frames, respectively carrying out feature fusion on the regional features of each candidate frame and the regional features of the multi-scale marked frames with preset sizes, and inputting the region features into a classification regression head network to obtain a target frame to be selected and a confidence coefficient; mapping the target frame to be selected with high confidence into a dense feature vector, and generating one or more track segments according to the distribution of the dense feature vector; and determining the target frame to be selected corresponding to the track segment with the highest track rationality score as a target tracking result in the search frame image. The tracking performance of long video target tracking is improved.

Description

Long video target tracking method and system based on annotation frame feature fusion

Technical Field

The invention belongs to the field of computer vision, and in particular relates to a long video target tracking method and system based on annotation frame feature fusion.

Background

The purpose of visual target tracking is to allow a computer to continuously focus on moving targets in the field of view like a human being, which is a study simulating biological vision ability. Along with the increasingly urgent demands of applications such as robot vision, intelligent video monitoring, intelligent human-computer interaction and the like, a long video target tracking method is paid more attention to.

The target tracking method needs to complete the following tasks: a certain object is given or detected in the first frame of a sequence of video frames and object position and scale information is given in the form of bounding boxes in the subsequent sequence of frames. The initial condition given by the target tracking task is limited, which is essentially a single sample learning problem, and the position and scale of the target in the next frame are calculated through the initialization information and the accumulated information in the tracking process. There are various movements and changes of the target between successive frames of the video sequence, and the target may disappear and reappear repeatedly over a longer period of time, which requires the long video tracking algorithm to have the ability to re-capture the tracked target when the target reappears after disappearing, and the ability to determine whether the target is present in the current frame. It is challenging to implement a stable long video object tracking algorithm.

For the target tracking task with shorter time span, the solutions of correlation filtering, online density estimation and the like have better performance. For a long video target tracking task, as the long video target varies in a long video sequence, the direct use of methods such as correlation filtering, online density estimation and the like for online model parameter updating can cause error accumulation, so that the target tracking task fails. How to achieve stable target tracking for a long time remains a problem to be solved.

Disclosure of Invention

Aiming at the defects and improvement demands of the prior art, the invention provides a long video target tracking method and a long video target tracking system based on label frame feature fusion, and aims to solve the technical problem that the tracking performance of the existing target tracking method is poor in long video target tracking.

In order to achieve the above object, according to one aspect of the present invention, there is provided a long video object tracking method based on annotation frame feature fusion, including: s1, acquiring a marked frame image and a current search frame image in a long video to be tracked, and sequentially carrying out feature fusion and target feature enhancement treatment on a guide vector of the marked frame image and a feature image of the search frame image to obtain an enhanced fusion result; s2, inputting the reinforced fusion result into a candidate frame regression network to obtain a candidate frame set containing N candidate frames, wherein N is more than or equal to 1, respectively carrying out feature fusion on the regional features of each candidate frame and the multi-scale marked frame regional features with preset sizes in the marked frame image, and inputting the multi-scale marked frame regional features into a classification regression head network to obtain a target frame to be selected and a confidence level; s3, mapping the target frame to be selected with high confidence coefficient into a dense feature vector, and generating one or more track segments according to the distribution of the dense feature vector; s4, calculating the track rationality score of each track segment, and determining a target frame to be selected corresponding to the track segment with the highest track rationality score as a target tracking result in the search frame image.

Still further, the step S1 further includes: extracting features of a feature map level from the labeling frame image to obtain the guide vector, wherein the feature map level l is:

wherein h is _tag To mark the height of the frame, w _tag To mark the frame width, find_size is the set contrast size.

Further, in S1, the hadamard product operation and the convolution processing are sequentially performed on the guide vector and the feature map, so as to perform feature fusion on the guide vector and the feature map.

Further, the target feature enhancement processing in S1 includes: the feature graphs after feature fusion in the S1 are parallelManipulation and->The operation is carried out to respectively obtain the corresponding channel weight coefficient y _up And weight coefficient y _down ：

y _up ＝sigmoid(W _fc (pool(x))⊙x

y _down ＝sigmoid(W _d (reduce(x))⊙x

Wherein x is the feature map after feature fusion in S1, pool is a pooling layerProcessing W _fc Is a first convolution layer process, wherein, the operation of the Hadamard product is that the sigmoid is a sigmoid function, the reduction is a compression process, and the method comprises the following steps of _d The second convolution layer processing;

respectively combining the fusion result before strengthening with the channel weight coefficient y _up And weight coefficient y _down And multiplying, and carrying out convolution processing on the sum of the two multiplication results to obtain a fusion result with enhanced target characteristics.

Still further, the candidate box set is:

wherein P is ^N For the set of candidate boxes, v _tag For the guiding vector, f _search For the feature map of the search frame image,for convolution operation, W _1×1 Weight parameters of the 1 x 1 convolutional network, delta (x) is characteristic strengthening parameter, F _RPN And (x) is an RPN network.

Still further, the step of inputting the feature fused input classification regression header network in step S2 includes: performing cross-correlation operation on each candidate frame region feature and the multi-scale labeling frame region feature respectively; and respectively splicing and aggregating cross-correlation operation results of the region features of each candidate frame according to the channel dimension, and inputting the aggregation results into a classification regression head network to obtain the target frame to be selected and the confidence coefficient.

Still further, generating one or more track segments according to the distribution of dense feature vectors in S3 includes: when the number of the dense feature vectors is one, adding a target frame to be selected corresponding to the dense feature vectors into a target track segment to generate a track segment; when the number of the dense feature vectors is a plurality, judging whether a first dense feature vector exists or not, so that the similarity between the first dense feature vector and the target track segment is higher than a first threshold value, and the similarity between other dense feature vectors except the first dense feature vector and the target track segment is not higher than a second threshold value, wherein the second threshold value is smaller than the first threshold value; if the first dense feature vector exists, adding the target frame to be selected corresponding to the first dense feature vector into a target track segment, and generating a track segment; and if the track segments do not exist, initializing and generating a plurality of new track segments corresponding to the target frames to be selected.

Still further, the trace rationality score calculated in S4 includes: when the track segments are multiple, calculating the track rationality score of each track segment

Wherein,is F _old The rationality score s of the spliced track calculated in the tracking process before the current moment _sim Is F _new The similarity beam splitting of the newly added target frame to be selected and the marked target, w is a weighting coefficient, F _new To add track segment F _old For the non-inactivated trace section, link (F _new ，F _old ) And the maximum value of the coordinate difference value of the upper left corner and the lower right corner of the spliced target frame of the newly added track section and the non-inactivated track section is obtained.

Still further, the steps S1 and S2 are implemented based on a two-stage target tracking network model, and the step S1 further includes: training the two-stage target tracking network model by adopting a two-stage joint loss function L, wherein L is:

L＝L ₁ +L ₂

L ₁ ＝L _cls +L _reg

L ₂ ＝2(L _cls +L _reg )

wherein L is ₁ As a candidate regression stage loss function, L ₂ To classify the regression phase loss function, L _cls To classify the loss function, L _reg To return the loss function, N _cls For batch size, p _i Andrespectively calculating a label value and a confidence coefficient according to the target frame to be selected, and adding +_>For the confidence loss of the target frame to be selected, N _prop For the number of candidate frames, t _i For label frame coordinates, +.>For regression of result frame coordinates +.>Is regression loss.

According to another aspect of the present invention, there is provided a long video object tracking system based on annotation frame feature fusion, comprising: the fusion and enhancement module is used for acquiring the marked frame image and the current search frame image in the long video to be tracked, and sequentially carrying out feature fusion and target feature enhancement treatment on the guide vector of the marked frame image and the feature image of the search frame image to obtain an enhanced fusion result; the second fusion module is used for inputting the reinforced fusion result into a candidate frame regression network to obtain a candidate frame set containing N candidate frames, wherein N is more than or equal to 1, and the region features of each candidate frame are respectively subjected to feature fusion with the multi-scale marked frame region features with preset sizes in the marked frame image and then are input into a classification regression head network to obtain a target frame to be selected and the confidence level; the track generation module is used for mapping the target frame to be selected with high confidence into a dense feature vector and generating one or more track segments according to the distribution of the dense feature vector; and the scoring module is used for calculating the track rationality score of each track segment and determining a target frame to be selected corresponding to the track segment with the highest track rationality score as a target tracking result in the search frame image.

In general, through the above technical solutions conceived by the present invention, the following beneficial effects can be obtained:

(1) Designing a two-stage target tracking algorithm based on label frame feature fusion, fusing the extracted features of the label frames and the search frame features in a candidate regression stage, and performing enhancement processing on the fused target features; in the classification regression stage, merging the multi-scale features extracted by the labeling frame with candidate target frame features in the search frame; the fusion and enhancement processing of the two stages improves the re-capturing capability and the target tracking precision of the target after the target disappears in the long video target tracking process;

(2) The distance measurement algorithm is designed, image features of the corresponding areas of the high-confidence-level to-be-selected target frames obtained by the two-stage target tracking algorithm are mapped into dense features, and trace rationality scoring is carried out by combining time domain information of the to-be-selected targets, so that the influence of co-semantic interferents is inhibited, the problem of target tracking errors caused by the co-semantic interferents is avoided, and the stability of long video target tracking is improved;

(3) The feature map levels are flexibly selected, feature maps of multiple levels are extracted, adaptability to labeling target dimensions is enhanced, and feature maps with larger dimensions can be selected under the condition that labeling target dimensions are small, so that sufficient regional feature information is ensured to be extracted, and the performance of long video target tracking is further improved.

Drawings

FIG. 1 is a flowchart of a long video target tracking method based on annotation frame feature fusion provided by an embodiment of the invention;

FIG. 2 is a schematic diagram of a network structure of a target tracking network in a candidate regression stage according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a network structure of a target tracking network in a classification regression stage according to an embodiment of the present invention;

FIG. 4 is a network implementation block diagram of a long video target tracking method based on label frame feature fusion provided by an embodiment of the present invention;

fig. 5 is a block diagram of a long video target tracking system based on annotation frame feature fusion according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. In addition, the technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.

In the present invention, the terms "first," "second," and the like in the description and in the drawings, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order.

Fig. 1 is a flowchart of a long video target tracking method based on feature fusion of a labeling frame according to an embodiment of the present invention. Referring to fig. 1, in conjunction with fig. 2 to fig. 4, a method for tracking a long video object based on feature fusion of a labeling frame in this embodiment is described in detail, and the method includes operations S1 to S4.

In the embodiment of the invention, a two-stage target tracking network model and a distance measurement network model are constructed and trained to realize operation S1-operation S4, wherein operation S1 and operation S2 are realized by using the two-stage target tracking network model, and operation S3 and operation S4 are realized by using the distance measurement network model. The two-stage target tracking network model comprises two stages, namely candidate regression and classification regression.

Before performing operation S1, the built two-stage target tracking network model and distance metric network model need to be trained separately. For the two-stage target tracking network model, a first frame of a video sequence of a training set in the LaSOT dataset is used as a labeling frame image, a video sequence image is used as a searching frame image, the two-stage target tracking network model is input, a target frame and the confidence level thereof are used as the output of the two-stage target tracking network model, the two-stage target tracking network model is trained by adopting a two-stage joint loss function L, and L is:

L＝L ₁ +L ₂

L ₁ ＝L _cls +L _reg

L ₂ ＝2(L _cls +L _reg )

wherein L is ₁ Loss function for candidate regression stage; l (L) ₂ A loss function in a classification regression stage; l (L) _cls A class loss function; l (L) _reg Is a regression loss function; n (N) _cls Is a batch size; p is p _i Andthe label value and the confidence coefficient are respectively calculated according to the target frame to be selected, and the label value is 0 or 1; />Calculating confidence loss of the target frame to be selected by adopting a cross entropy loss function; n (N) _prop The number of the candidate frames; t is t _i For label frame coordinates (x ^* ，y ^* ，w ^* ，h ^* )；/>For regression of the result frame coordinates (x _a ，y _a ，w _a ，h _a )；/>Is regression loss.

For the distance measurement network, the distance measurement network model with BN-input as a main body is trained by taking images In image recognition databases Stanford-Online-Products, CUB-200-2011, in-Shop-loops-Retrieval and Car-196 as inputs and MS-loss as a loss function.

In this embodiment, for example, a first frame of a video sequence of a test set and a TLP data set in the LaSOT data set is used as a labeling frame image, a video sequence image is used as a searching frame image, and a trained two-stage target tracking network model is input; the high confidence level target frame area image to be selected output by the two-stage target tracking network model is input into a trained distance measurement network, mapped into a dense feature vector, and trace rationality scoring is carried out by combining time domain information, as shown in fig. 4.

S1, acquiring a labeling frame image and a current search frame image in a long video to be tracked, and sequentially carrying out feature fusion and target feature enhancement processing on a guide vector of the labeling frame image and a feature image of the search frame image to obtain an enhanced fusion result.

According to an embodiment of the present invention, operation S1 includes sub-operation S11-sub-operation S14.

In a sub-operation S11, an annotation frame image and a current search frame image in a long video to be tracked are acquired. Specifically, the annotation frame image and the current search frame image in the long video to be tracked are input into a two-stage target tracking network model. The annotation frame image is, for example, the first frame of a video sequence of the test set and the TLP set in the LaSOT dataset.

In sub-operation S12, feature extraction of the l feature map level is performed on the annotation frame image to obtain a guide vector, and feature extraction is performed on the search frame image to obtain a feature map of the entire image.

In the embodiment, a ResNet-50+ feature pyramid network (Feature Pyramid Networks, FPN) is selected as a feature extraction network to extract the image features of the target area and the image features of the search frame in the annotation frame image respectively.

And (3) carrying out region characteristic value calculation on the target region in the marked frame image by utilizing the ROIAlign, wherein the extracted region characteristic size is 7×7×256, for example. Further, the extracted feature map level is calculated by comparing the relative sizes of the labeling frame and the whole image, and the extracted feature map level l is:

Through flexibly selecting the feature map levels, the FPN is fully utilized to extract the feature maps of a plurality of levels, the adaptability to the labeling target scale is enhanced, and the feature map with a larger size can be selected under the condition that the labeling target scale is small, so that the sufficient regional feature information is ensured to be extracted. For the search frame image, four-level feature pyramid outputs P2 to P5 are extracted, with scales of, for example, 256×25×32, 256×50×64, 256×100×128, and 256×200×256, respectively.

In sub-operation S13, feature fusion is performed on the guide vector of the annotation frame image and the feature map of the search frame imageObtaining a fusion result F _s 。

In this embodiment, a feature fusion module and a target reinforcement module are added in the candidate regression stage, as shown in fig. 2. Referring to fig. 2, in the feature fusion module, a feature extraction network is used to extract the regional features of the tagged frame imageConvolving and outputting index vector representing marked frame image>And simultaneously, convolving the search frame image to obtain a feature map of the whole image. Will direct vector v _tag Carrying out Hadamard product operation on the characteristic diagram of the search frame image, and carrying out 1X 1 convolution to obtain a fusion result F _s 。

In sub-operation S14, the result F is fused _s And performing target feature enhancement treatment to obtain an enhanced fusion result.

Referring to fig. 2, a focus mechanism is introduced into the target enhancement module, and a fusion result F is output to the feature fusion module _s Parallel to each otherManipulation and->The operation is carried out to respectively obtain the corresponding channel weight coefficient y _up And weight coefficient y _down 。

In particular, use is made ofOperation calculates the upper way attention coefficient, first using two 1 x 1 convolution layers W _fc Mapping the output by using a sigmoid function to ensure that the element of the weight vector mask1 of the output is in the range of [0,1 ]]Within the numerical range of (2), the input/output dimension is changed to +.>The upper path attention coefficient calculates and outputs a 256 multiplied by 1 channel weight coefficient y _up ：

y _up ＝sigmoid(W _fc (pool(x))⊙x

Wherein x is a feature map after feature fusion in sub-operation S13, pool (x) is pooling layer processing, W _fc The first convolution layer processing, +..

By usingThe operation calculates the attention coefficient of the drop path, and F is output to the feature fusion module firstly _s Performing feature compression to obtain 2 Xw _f ×h _f After 5×5 convolution operation, inputting Sigmoid activation function, outputting 1×w _f ×h _f Weight coefficient y of (2) _down ：

y _down ＝sigmoid(W _d (reduce(x))⊙x

Wherein, the reduction is compression processing, W _d And is a second convolution layer process. The input/output dimension change isw _f ，h _f Is F _s Is determined by the feature extraction network convolution step size and the input picture size.

Further, the fusion result F before reinforcement _s And channel weight coefficient y _up Multiplying to obtain F _s1 Fusion result F before strengthening _s And the weight coefficient y _down Multiplying to obtain F _s2 For F _s1 、F _s2 And carrying out convolution processing of 3 multiplied by 3 once to obtain a fusion result after target feature enhancement.

And S2, inputting the reinforced fusion result into a candidate frame regression network to obtain a candidate frame set containing N candidate frames, wherein N is more than or equal to 1, respectively carrying out feature fusion on the regional features of each candidate frame and the multi-scale marked frame regional features with preset sizes in the marked frame image, and then inputting the regional features into a classification regression head network to obtain the target frame to be selected and the confidence coefficient.

According to an embodiment of the present invention, operation S2 includes sub-operation S21-sub-operation S23.

In a sub-operation S21, the reinforced fusion result is input into a candidate frame regression network to obtain a candidate frame set containing N candidate frames, wherein N is more than or equal to 1.

In the candidate regression stage of the two-stage target tracking network model, the fusion result obtained in the sub-operation S14 after the target feature enhancement is input into a candidate frame regression network (Region Proposal Network, RPN) to obtain a candidate frame set P containing N candidate frames ^N ：

Wherein v is _tag To guide vector, f _search In order to search for a feature map of a frame image,for convolution operation, W _1×1 Weight parameters of the 1 x 1 convolutional network, delta (x) is characteristic strengthening parameter, F _RPN And (x) is an RPN network.

In sub-operation S22, the features of the search image region corresponding to each candidate frame are extracted to obtain a candidate frame region feature set

A classification regression stage of the two-stage target tracking network model, and a candidate frame set P obtained in the candidate regression stage ^N Extracting features to obtain a candidate frame region feature set containing region features of each candidate frame

In sub-operation S23, the region features of each candidate frame are respectively fused with the multi-scale labeling frame region features with preset sizes in the labeling frame image, and then input into the classification regression header network, so as to obtain the target frame to be selected and the confidence level.

Referring to fig. 3, first, cross-correlation operation is performed on each candidate frame region feature and a multi-scale labeling frame region feature with a preset size in a labeling frame image. The multi-scale labeled frame region features of the preset size include, for example, labeled frame region features of two scales of 7×7×256 and 3×3×256. During the fusion process, willThe cross-correlation operation is respectively carried out on the plurality of candidate frame region features and the multi-scale labeling frame region features, and the calculation formula is as follows:

wherein i is an ordinate, j is an abscissa, u is an ordinate, v is an abscissa, X (u, v) is a characteristic value of the position of the labeling frame (u, v), Z (i+u, j+v) is a characteristic value of the position of the candidate frame (i+u, j+v), S (i, j) is a similarity of the positions of the candidate frames (i, j), m is a labeling frame height, and n is a labeling frame width.

And then, splicing and aggregating the cross-correlation operation results of the region features of each candidate frame according to the channel dimension, and inputting the aggregation results into a classification regression head network to obtain the target frame to be selected and the confidence coefficient. Specifically, for example, a 3×3 convolutional network is used to aggregate the feature maps obtained by the concatenation.

And S3, mapping the target frame to be selected with high confidence into dense feature vectors, and generating one or more track segments according to the distribution of the dense feature vectors.

In operation S3, the target frame to be selected with high confidence and the confidence thereof output by the two-stage target tracking network model are input into the distance measurement network model, the distance measurement network model maps the target frame to be selected with high confidence into dense feature vectors, and one or more track segments are generated according to the distribution of the dense feature vectors.

Specifically, a first track segment is first created using the annotation information of the first frame and treated as a meta-segment. Starting from the second frame, two cases are divided:

(1) If no interference object exists in the search frame image, the two-stage target tracking network model only outputs one target frame to be selected with high confidence, the target frame to be selected is output as a target tracking result, and the target frame to be selected corresponding to the high-density feature vector is added into the target track segment to generate a track segment.

(2) If an interfering object appears in the search frame image, the two-stage target tracking network model outputs a plurality of target frames to be selected with high confidence, and correspondingly, the number of dense feature vectors is also a plurality of, and at the moment, the new target frames to be selected are matched with the track segments according to a threshold condition. At this time, it is determined whether or not the first dense feature vector exists among the plurality of dense feature vectors such that the first dense feature vector d ₁ With the target track segment F ₁ The similarity between the first dense feature vector and the target track segment is higher than a first threshold, and the similarity between other dense feature vectors except the first dense feature vector and the target track segment is not higher than a second threshold, wherein the second threshold is smaller than the first threshold; if the first dense feature vector exists, adding the target frame to be selected corresponding to the first dense feature vector into a target track segment, and generating a track segment; if the new track segment does not exist, the new track segment is considered as a new interferent, and a plurality of new track segments corresponding to each candidate target frame need to be generated in an initializing mode. The first threshold is, for example, 0.5 and the second threshold is, for example, 0.4.

And S4, calculating the track rationality scores of the track segments, and determining a target frame to be selected corresponding to the track segment with the highest track rationality score as a target tracking result in the search frame image.

When the track segment obtained in operation S3 is one, the target frame to be selected in the track segment is determined as the target tracking result in the search frame image. When the track segments obtained in the operation S3 are multiple, calculating the track rationality scores of the track segmentsAnd score the trace rationality +.>The target frame to be selected corresponding to the highest track segment is determined as a target tracking result in the search frame image, and the track rationality score is +.>The method comprises the following steps:

wherein,is F _old Calculating a splice track rationality score in a tracking process before the current moment; s is(s) _sim Is F _new Splitting beams according to the similarity between the newly added target frame to be selected and the marked target; w is a weighting coefficient, for example, 3.0; f (F) _new Is a newly added track segment; f (F) _old Is a track segment which is not inactivated; link (F) _new， F _old ) And the maximum value of the coordinate difference value of the upper left corner and the lower right corner of the spliced target frame of the newly added track section and the non-inactivated track section is obtained.

In order to verify the tracking result of the long video target tracking method based on the labeling frame feature fusion in the embodiment of the invention on the long video target, a dataset for target tracking is constructed based on two long video tracking datasets, namely a LaSOT dataset and a TLP dataset, and the detailed contents of the datasets are shown in table 1.

Table 1 dataset details table

In table 1, laSOT includes a training set and a test set, and TLP includes only the test set. The present embodiment performs performance evaluation of the long video object tracking method on the test set.

After the data set preparation is completed, model training and testing is required. The two-stage target tracking network model based on the feature fusion of the labeling frame is trained as follows: and taking a first frame image of the video sequence images in the training set as an annotation frame image, taking the video sequence images as search frame images, and inputting a target tracking network model to obtain a target tracking result of the sequence frames. The training process adopts a two-step fine tuning strategy, firstly trains candidate regression stage networks, and then jointly trains the whole target tracking network model. During training, each input image is normalized and turned randomly. The training optimization algorithm of this example employs a random gradient descent method (SGD) optimizer with the lot data size set to 16. The iteration number of the model is 12 generations, the initial learning rate is 0.0015, the weight attenuation regularization coefficient is set to be 0.001, and the number of the model is reduced to one tenth of the number of the model in the 8 th and 11 th generations.

The LaSOT and TLP test sets are tested by using the trained network, the test sets are tested by using other deep learning target tracking methods, the accuracy and success rate of the test sets are calculated, and the experimental results are shown in Table 2.

Table 2 comparison of accuracy and success rate obtained with various target tracking methods

As can be seen from the experimental results in Table 2, on LaSOT, the ATOM has a success rate of 0.501, an accuracy of 0.506, a SiamRPN++ has a success rate of 0.494, an accuracy of 0.491, an MDNet success rate of 0.397, an accuracy of 0.373, an SPLT success rate of 0.426, and an accuracy of 0.396, whereas the embodiment has a success rate of 0.531 at the highest, an accuracy of 0.523 at the highest, and an ATOM higher than suboptimal ATOM by 0.030 and 0.017, respectively. On the TLP, the success rate of the ATOM is 0.430, the accuracy is 0.421, the success rate of the sialprn++ is 0.411, the accuracy is 0.405, the success rate of the mdnet is 0.372, the accuracy is 0.381, the success rate of the SPLT is 0.464, the accuracy is 0.471, while the success rate of the embodiment reaches the highest 0.522, the accuracy reaches the highest 0.541, and the success rates are respectively 0.058 and 0.070 higher than the suboptimal SPLT.

From the above analysis, it is known that, in the long video tracking tasks of LaSOT and TLP, the method in the embodiment of the present invention achieves the optimal target tracking performance. In theory, the video length increase does not significantly affect the performance of the method of the embodiment, and the longer the video time span corresponding to the target tracking task, the more obvious the advantages of the embodiment compared with other methods.

Fig. 5 is a block diagram of a long video target tracking system based on annotation frame feature fusion according to an embodiment of the present invention. Referring to fig. 5, the long video object tracking system 500 based on the annotation frame feature fusion includes a fusion and enhancement module 510, a second fusion module 520, a track generation module 530, and a scoring module 540.

The fusion and enhancement module 510 is configured to perform an operation S1, for example, to obtain a labeling frame image and a current search frame image in a long video to be tracked, and perform feature fusion and target feature enhancement processing on the guiding vector of the labeling frame image and the feature map of the search frame image in sequence, so as to obtain an enhanced fusion result.

The second fusion module 520 performs, for example, operation S2, and is configured to input the reinforced fusion result into a candidate frame regression network to obtain a candidate frame set including N candidate frames, where N is greater than or equal to 1, and input the region features of each candidate frame into a classification regression header network after feature fusion is performed on the region features of each candidate frame and the multi-scale labeled frame region features with preset sizes in the labeled frame image, so as to obtain the target frame to be selected and the confidence level.

The track generation module 530, for example, performs operation S3 for mapping the high confidence candidate object boxes into dense feature vectors and generating one or more track segments according to the distribution of the dense feature vectors.

The scoring module 540, for example, performs operation S4, is configured to calculate a track rationality score of each track segment, and determine a target frame to be selected corresponding to the track segment with the highest track rationality score as a target tracking result in the search frame image.

The long video object tracking system 500 based on the annotation frame feature fusion is used to perform the long video object tracking method based on the annotation frame feature fusion in the embodiments shown in fig. 1-4. For details not yet in this embodiment, please refer to the long video object tracking method based on the feature fusion of the labeling frame in the embodiments shown in fig. 1-4, which is not described herein.

It will be readily appreciated by those skilled in the art that the foregoing description is merely a preferred embodiment of the invention and is not intended to limit the invention, but any modifications, equivalents, improvements or alternatives falling within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. A long video target tracking method based on label frame feature fusion is characterized by comprising the following steps:

s1, acquiring a marked frame image and a current search frame image in a long video to be tracked, and sequentially carrying out feature fusion and target feature enhancement treatment on a guide vector of the marked frame image and a feature image of the search frame image to obtain an enhanced fusion result; extracting features of a feature map level from the labeling frame image to obtain the guide vector, wherein the feature map level l is:

wherein h is _tag To mark the height of the frame, w _tag For marking the frame width, the finish_size is a set contrast size;

s2, inputting the reinforced fusion result into a candidate frame regression network to obtain a candidate frame set containing N candidate frames, wherein N is more than or equal to 1, respectively carrying out feature fusion on the regional features of each candidate frame and the multi-scale marked frame regional features with preset sizes in the marked frame image, and inputting the multi-scale marked frame regional features into a classification regression head network to obtain a target frame to be selected and a confidence level;

s3, mapping the target frame to be selected with high confidence coefficient into a dense feature vector, and generating one or more track segments according to the distribution of the dense feature vector;

s4, calculating the track rationality score of each track segment, and determining a target frame to be selected corresponding to the track segment with the highest track rationality score as a target tracking result in the search frame image.

2. The long video object tracking method based on labeling frame feature fusion according to claim 1, wherein in S1, hadamard product operation and convolution processing are sequentially performed on the guide vector and the feature map to perform feature fusion on the guide vector and the feature map.

3. The long video object tracking method based on the labeling frame feature fusion according to claim 1, wherein the object feature enhancement processing in S1 comprises:

the feature graphs after feature fusion in the S1 are parallelManipulation and->The operation is carried out to respectively obtain the corresponding channel weight coefficient y _up And weight coefficient y _down ：

y _up ＝sigmoid(W _fc (pool(x))⊙x

y _down ＝sigmoid(W _d (reduce(x))⊙x

Wherein x is the feature map after feature fusion in S1, pool is pooling layer processing, W _fc Is a first convolution layer process, wherein, the operation of the Hadamard product is that the sigmoid is a sigmoid function, the reduction is a compression process, and the method comprises the following steps of _d The second convolution layer processing;

4. The long video object tracking method based on labeling frame feature fusion according to claim 1, wherein the candidate frame set is:

5. The long video object tracking method based on feature fusion of annotation frames according to claim 1, wherein the feature fusion input classification regression header network in S2 comprises:

performing cross-correlation operation on each candidate frame region feature and the multi-scale labeling frame region feature respectively;

and respectively splicing and aggregating cross-correlation operation results of the region features of each candidate frame according to the channel dimension, and inputting the aggregation results into a classification regression head network to obtain the target frame to be selected and the confidence coefficient.

6. The long video object tracking method based on labeled frame feature fusion according to claim 1, wherein generating one or more track segments according to the distribution of the dense feature vectors in S3 comprises:

when the number of the dense feature vectors is one, adding a target frame to be selected corresponding to the dense feature vectors into a target track segment to generate a track segment;

when the number of the dense feature vectors is a plurality, judging whether a first dense feature vector exists or not, so that the similarity between the first dense feature vector and the target track segment is higher than a first threshold value, and the similarity between other dense feature vectors except the first dense feature vector and the target track segment is not higher than a second threshold value, wherein the second threshold value is smaller than the first threshold value; if the first dense feature vector exists, adding the target frame to be selected corresponding to the first dense feature vector into a target track segment, and generating a track segment; and if the track segments do not exist, initializing and generating a plurality of new track segments corresponding to the target frames to be selected.

7. The long video object tracking method based on the labeling frame feature fusion according to claim 6, wherein the trajectory rationality score calculated in S4 comprises: when the track segments are multiple, calculating the track rationality score of each track segment

Wherein,is F _old The rationality score s of the spliced track calculated in the tracking process before the current moment _sim Is F _new The similarity beam splitting of the newly added target frame to be selected and the marked target, w is a weighting coefficient, F _new To add track segment F _old For the non-inactivated trace section, link (F _new ,F _old ) And the maximum value of the coordinate difference value of the upper left corner and the lower right corner of the spliced target frame of the newly added track section and the non-inactivated track section is obtained.

8. The long video object tracking method based on annotation frame feature fusion according to any one of claims 1-7, wherein S1 and S2 are implemented based on a two-stage object tracking network model, and the step S1 is preceded by: training the two-stage target tracking network model by adopting a two-stage joint loss function L, wherein L is:

L＝L ₁ +L ₂

L ₁ ＝L _cls +L _reg

L ₂ ＝2(L _cls +L _reg )

9. A long video target tracking system based on annotation frame feature fusion, comprising:

the fusion and enhancement module is used for acquiring the marked frame image and the current search frame image in the long video to be tracked, and sequentially carrying out feature fusion and target feature enhancement treatment on the guide vector of the marked frame image and the feature image of the search frame image to obtain an enhanced fusion result;

the fusion and enhancement module is further configured to perform feature extraction of a feature map level on the labeling frame image to obtain the guiding vector, where the feature map level l is:

the second fusion module is used for inputting the reinforced fusion result into a candidate frame regression network to obtain a candidate frame set containing N candidate frames, wherein N is more than or equal to 1, and the region features of each candidate frame are respectively subjected to feature fusion with the multi-scale marked frame region features with preset sizes in the marked frame image and then are input into a classification regression head network to obtain a target frame to be selected and the confidence level;

the track generation module is used for mapping the target frame to be selected with high confidence into a dense feature vector and generating one or more track segments according to the distribution of the dense feature vector;

and the scoring module is used for calculating the track rationality score of each track segment and determining a target frame to be selected corresponding to the track segment with the highest track rationality score as a target tracking result in the search frame image.