CN116168216B

CN116168216B - Single-target tracking method based on scene prompt

Info

Publication number: CN116168216B
Application number: CN202310430642.4A
Authority: CN
Inventors: 张天柱; 马银超; 尉前进; 何建峰; 张勇东
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2023-04-21
Filing date: 2023-04-21
Publication date: 2023-07-18
Anticipated expiration: 2043-04-21
Also published as: CN116168216A

Abstract

The present disclosure provides a single target tracking method based on scene prompt, which dynamically tracks a target in a video image, comprising: determining a target template image containing a target and a search area image and partitioning; the segmented target template image and the search area image are subjected to linear mapping to obtain corresponding target template image features and search area image features; inputting the image features of the target template and the image features of the search area into a visual transducer of a scene prompt, and carrying out feature interaction and enhancement under the action of a dynamically acquired scene prompt; returning the target frame by utilizing the characteristics of the search area enhanced by the visual transducer of the scene prompt, and estimating the quality of the target frame; and the tracker stores the characteristics of the tracking frames with good target frame quality in the memory, and when a given prompt updating interval is reached, the scene prompt generator generates a new scene prompt by utilizing the characteristics stored in the memory.

Description

Single-target tracking method based on scene prompt

Technical Field

The disclosure relates to the technical fields of computer vision, artificial intelligence and image processing, in particular to a single-target tracking method based on scene prompt.

Background

Visual single-target tracking is a fundamental research topic in the field of computer vision. The goal is to automatically locate the target in the subsequent frame by giving the target frame of the first frame. Single-target tracking has wide application in automatic driving, intelligent monitoring and man-machine interaction.

The single target tracking method locates the target through the template cut by the first frame and the search area cut by the current frame according to the result of the last frame. The single target tracking method can be classified into a dual stream and a single stream tracker. Most existing methods employ dual-stream tracking strategies that independently extract template and search region features, resulting in inability of the extracted features to interact to perceive the target, limiting the performance of the model. Recently, single-stream methods have been proposed for jointly extracting template and search region features, and generally these methods implement interaction of the template and the search region based on an attention mechanism, so that features of the target can be enhanced. However, the attention mechanism establishes relationships between pixels indifferently, resulting in some complex contexts that may be falsely enhanced, affecting the accuracy of the tracker.

Disclosure of Invention

Based on the above problems, the present disclosure provides a single-target tracking method based on scene prompt, so as to alleviate the above technical problems in the prior art.

Technical scheme (one)

The present disclosure provides a single target tracking method based on scene prompt, which dynamically tracks a target in a video image, comprising: determining a target template image containing a target and a search area image and partitioning; the segmented target template image and the search area image are subjected to linear mapping to obtain corresponding target template image features and search area image features; inputting the image features of the target template and the image features of the search area into a visual transducer of a scene prompt, and carrying out feature interaction and enhancement under the action of a dynamically acquired scene prompt; returning the target frame by utilizing the characteristics of the search area enhanced by the encoder, and estimating the quality of the target frame; and the tracker stores the characteristics of the tracking frames with good target frame quality in the memory, and when a given prompt updating interval is reached, the scene prompt generator generates a new scene prompt by utilizing the characteristics stored in the memory.

According to embodiments of the present disclosure, scene cues are dynamically acquired from a video spatiotemporal context during tracking by a scene cue generator. The scene prompt includes a target prompt and a background prompt.

According to the embodiment of the disclosure, the target frame is regressed by the target estimation head by utilizing the search area characteristics enhanced by the visual transducer of the scene prompt, and the quality of the target frame is estimated by utilizing the cross ratio regressive head.

The visual transducer of scene cues includes a 12-layer scene cue encoder.

According to an embodiment of the present disclosure, each layer of scene hint encoder includes: scene prompt modulator, attention mechanism, multi-layer perceptron.

According to the embodiment of the disclosure, the scene prompt modulator utilizes dynamically acquired scene prompt symbols to guide an attention mechanism of interaction among pixels in an encoder, and utilizes scene knowledge to suppress complex backgrounds.

According to the embodiment of the disclosure, the scene prompt generator divides the target region features into target features and background features according to the target frame, and introduces a plurality of target prototypes and background prototypes to interact with the target features and the background features through a mutual attention mechanism respectively.

According to the embodiment of the disclosure, the prompt learning is guided through diversity loss, and diversity is ensured by increasing cosine distance between prompts.

According to the embodiment of the disclosure, the target frame regression head comprises a three-branch full convolution network, a classification score map, an offset map and a normalized size map are respectively output, labels of the classification score map are generated by Gaussian kernels, learning of the classification score map is restrained through a weighted focusing loss function, and learning of the target frame is restrained through generalized cross-ratio loss and average absolute error loss; the cross ratio regression head is used for estimating the cross ratio between the predicted frame and the real frame, and the learning of the cross ratio score is constrained through a mean square loss function.

(II) advantageous effects

From the above technical solution, it can be seen that the single-target tracking method based on scene prompt in the present disclosure has at least one or a part of the following advantages:

(1) The prompt sign of the tracking scene can be dynamically obtained according to the space-time context in the tracking process, and the model is guided to learn various and comprehensive scene knowledge by using diversity loss;

(2) The scene prompt modulator is used for embedding the scene prompt sign into the attention mechanism so as to guide the feature learning of scene perception, enhance the discrimination of the features and effectively improve the target tracking precision under the complex background scene.

Drawings

Fig. 1 is a schematic diagram of an overall architecture of a single target tracking method based on scene prompt according to an embodiment of the disclosure.

Fig. 2 is a schematic diagram of a schematic architecture of a scene prompt generator according to an embodiment of the disclosure.

Fig. 3 is a flowchart of a single-target tracking method based on scene prompt according to an embodiment of the disclosure.

Detailed Description

The present disclosure provides a single target tracking method based on scene prompt, which can fully mine space-time information in the tracking process and generate a prompt containing tracking scene knowledge. Based on the scene prompt, the method designs a new attention method, and guides interaction among pixels through scene prompt so as to inhibit the influence of complex background on tracking, and effectively improves tracking precision under complex scene. Therefore, the scene prompt-based single-target tracking method can utilize the scene prompt symbol adaptively generated in the tracking process to guide the interaction attention mechanism between pixels to inhibit the complex background, and realize robust tracking.

For the purposes of promoting an understanding of the principles and advantages of the disclosure, reference will now be made to the embodiments illustrated in the drawings and specific language will be used to describe the same.

In an embodiment of the present disclosure, a single-target tracking method based on scene prompt is provided, which dynamically tracks a target in a video image, and is shown in fig. 1 to 3, and includes:

the method comprises the following steps of S1, determining a target template image containing a target and a search area image and partitioning;

s2, obtaining corresponding target template image features and search area image features through linear mapping of the segmented target template image and the search area image;

s3, inputting the image features of the target template and the image features of the search area into a visual transducer of a scene prompt, and carrying out feature interaction and enhancement under the action of a dynamically acquired scene prompt;

s4, returning the target frame by utilizing the characteristics of the search area enhanced by the visual transducer of the scene prompt, and estimating the quality of the target frame; and

and S5, the tracker stores the characteristics of the tracking frames with good target frame quality in the memory, and when a given prompt updating interval is reached, the scene prompt generator generates a new scene prompt by utilizing the characteristics stored in the memory.

According to embodiments of the present disclosure, the visual transducer of scene cues includes a 12-layer scene cue encoder (also referred to as an encoding layer). Each layer of scene prompt encoder comprises: scene prompt modulator, attention mechanism, multi-layer perceptron.

In an embodiment of the present disclosure, a single-target tracking method based on scene prompt is provided. The method mainly comprises a visual transducer based on scene prompt and a target estimation head. Two innovative modules in visual transducer based on scene cues: (1) a scene hint generator; (2) a scene hint modulator. The scene prompt generator adaptively generates a prompt related to the current tracking scene by utilizing scene information in the tracking process; the template and the search area image are segmented and mapped to obtain corresponding image features, and the corresponding image features are input into a visual transducer based on scene prompt for feature interaction and enhancement. Here, the transducer consists of 12 encoding layers, which mainly include scene prompt modulators, attention mechanisms, and multi-layer perceptrons. The scene prompt modulator uses the scene prompt to guide the attention mechanism of the interaction between pixels in the encoder, so that the scene prompt modulator can use scene knowledge to inhibit complex background; the target estimation head utilizes the enhanced searching region characteristics of the visual transducer subjected to scene prompt to return to the target frame and estimate the quality of the target frame; the tracker stores the features of the tracking frames with good target frame quality in the memory, and when a given prompt updating interval is reached, the scene prompt generator generates a new scene prompt by utilizing the features stored in the memory.

In an embodiment of the present disclosure, as shown in FIG. 1, the present disclosure takes as input to a tracker a pair of images including a target template imageAnd search area image->. The two images are respectively divided into image blocks and flattened to obtain flattened two-dimensional image blocks>And->Flattened two-dimensional image block->And->Spliced together, wherein->Is the image block resolution, +.>And (3) withIs the number of target template image blocks and search area image blocks. Thereafter, the present disclosure maps the flattened two-dimensional image block to by linear mappingCDimension. Position coding capable of learning->And->Is added to the corresponding target template image block and search area image block to obtain target template feature +.>Search area feature->For example, the numbers 1-9 in fig. 1 represent different position encodings. />And->Spliced together and input into a field Jing Dishi encoder to obtain updated template features +.>And search area feature->. Further, the present disclosure will update the search area feature +.>Input into the target box regression header and IoU (cross ratio) score header to estimate the target position and quality of the evaluation box. If the IoU score of the box is greater than 0.8, the present disclosure saves the characteristics of each coding layer output of the current search area image +.>In the memory of the memory device,Nthe number of coding layers. When the update interval is reached, the scene prompt generator generates scene prompts (including target prompts and background prompts) using features stored in memory. Further, the scene prompt modulator in the coding layer may utilize the scene prompt to direct the attention mechanism to suppress complex backgrounds so that the tracker may implement robust tracking.

In an embodiment of the present disclosure, as shown in fig. 1, the present disclosure dynamically acquires scene cues from a video spatiotemporal context during tracking by a scene cue generator. Specifically, the scene prompt generator outputs scene prompts, including target prompts, for each layer of encoderAnd background prompt +.>. Here, a->Representing the number of target prompts,/->The number of background cues is indicated,Crepresenting the dimensions of the prompt feature. As shown in FIG. 2, the historical frame in a given memoryiFeatures of layer encoder>，/>Representing the number of features stored in the memory, dividing the features into target features (also called foreground features) according to the corresponding predicted frame>With background features. The present disclosure introduces multiple target prototypes->Prototype with background->Respectively>And background features->Interaction is carried out through a mutual attention mechanism, and a target prompt is obtained>And background prompt +.>The formula is:

；

。

wherein, the liquid crystal display device comprises a liquid crystal display device,is a scaling factor, ->Is a matrix transpose operation. In addition, the present disclosure uses diversity loss to guide prompt learning in order to obtain a variety of comprehensive scene knowledge. Given the firstiTarget prompt for a layerAnd background prompt +.>The diversity is ensured by increasing the cosine distance between the prompts, and the formula is as follows:

。

wherein, the liquid crystal display device comprises a liquid crystal display device,for loss of diversity->Representing intra-class diversity loss weights, +.>Representing the weight of the loss of diversity between classes,iindicating the number of layers in which the corresponding scene prompt encoder is located,k,mfor the indicator index, & gt>Is the firstiLayer encoder of the first layerkTarget prompt, 10->Is the firstiLayer encoder of the first layermTarget prompt, 10->Is the firstiLayer encoder of the first layerkA background prompt->Is the firstiLayer encoder of the first layermA background prompt.

Regarding scene prompt modulator, given the firstiInput to layer encoderWherein->Is the firsti-Target template feature output by layer 1 encoder, < +.>Is the firsti-The 1-layer encoder outputs the search area characteristics, and the basic key ++can be obtained through linear mapping>Inquiry->Value->The formula is as follows:

；

。

wherein, the liquid crystal display device comprises a liquid crystal display device,representation layer normalization->For key map parameters, ++>For querying the mapping parameters->For the value mapping parameters, in the naive transducer, the bond +.>Inquiry->Value->The dependence between all pixels is built up indifferently as input to the multi-headed attention. Such an operation may cause a complex background to be falsely enhanced, thereby limiting the discrimination of the features. For this purpose, the present disclosure uses dynamically acquired scene prompt +.>And->A corresponding adaptive prompt is generated for each pixel that contains scene specific knowledge to discriminate against the background. Given a givenWherein->For the number of target template image blocks, the adaptive prompt comprises a template adaptive prompt +.>Adaptive prompt to search area +.>Expressed as:

；

。

wherein, the liquid crystal display device comprises a liquid crystal display device,representing the target area in the template image.

Thereafter, adaptive hintsIs mapped for modulating keys and queries as follows:

；

。

wherein, the liquid crystal display device comprises a liquid crystal display device,for the key modulated with the reminder +.>For a prompt modulated query,for the linear mapping parameter(s),Cfor the feature dimension of the adaptive hints, +.>For the feature dimensions of the key and query,Rrepresenting the real number domain. In this way, the scene prompt may instruct the attention mechanism to learn scene-aware features to discriminate the context. Finally, the firstiOutput of layer encoder->Can be expressed as:

；

。

wherein, the liquid crystal display device comprises a liquid crystal display device,for intermediate features->Is a value.

With respect to the target estimation head, as shown in FIG. 1, updated search region featuresThe target frame regression header and IoU regression header (or score header) are input to perform target state estimation and frame quality estimation.

The target frame regression head (or target frame prediction head) is composed of a three-branch full convolution network and outputs a classification score mapOffset map->Normalized dimension mapWherein->In order to search for the area image size,prepresenting the block size, giving the position of maximum classification score +.>，/>Coordinates of the position with the greatest score, +.>Representing coordinates, bounding box of object->Can be expressed as:

；

。

wherein, the liquid crystal display device comprises a liquid crystal display device,the coordinates of the left upper corner point of the target and the width and length of the target are respectively.

The labels of the classification score graph are generated by gaussian kernels,wherein, the method comprises the steps of, wherein,for classifying score map labels, < >>For the true center of the target, +.>The standard deviation is adaptive to the target size. Weighted focus loss->Is used to constrain the classification map. Average absolute error loss (+)>Loss) of>Cross-ratio loss with generalization->Is used to constrain the target bounding box, the formula is as follows:

；

。

wherein, the liquid crystal display device comprises a liquid crystal display device,for classifying loss->For frame loss->For average absolute error loss function->Is used for the weight of the (c),weight lost for generalization of cross-ratio, +.>Classification score map label and predicted Classification score map, respectively,>representing the real target frame and the predicted target frame, respectively.

The cross-correlation head is used to estimate the cross-correlation between the predicted frame and the real frame. Specifically, the present disclosure samples different cross-ratios by dithering of the real boxIs->. Then according to the sampled box->Target feature after clipping feature interaction +.>The formula is as follows:

。

wherein, the liquid crystal display device comprises a liquid crystal display device,for Precise RoI Pooling operations, for cropping target features in a search area. The present disclosure introduces an cross score token ∈ ->Is +.>Interaction is carried out through an attention mechanism to obtain updated cross score token +.>. Subsequently, will->Input into the multi-layer perceptron to get the cross score between estimated prediction frame and real frame +.>The formula is as follows:

。

further, the present disclosure utilizes mean square loss to constrain the cross-ratio score,

。

wherein, the liquid crystal display device comprises a liquid crystal display device,a score loss of IoU. Loss of training overall->The expression is as follows:

wherein, the liquid crystal display device comprises a liquid crystal display device,is the firstiThe diversity of the layer scene prompt is lost.

Thus, embodiments of the present disclosure have been described in detail with reference to the accompanying drawings. It should be noted that, in the drawings or the text of the specification, implementations not shown or described are all forms known to those of ordinary skill in the art, and not described in detail. Furthermore, the above definitions of the elements and methods are not limited to the specific structures, shapes or modes mentioned in the embodiments, and may be simply modified or replaced by those of ordinary skill in the art.

From the above description, one skilled in the art should have clear insight into the single-object tracking method of the present disclosure based on scene cues.

In summary, the present disclosure provides a single-target tracking method based on scene prompt, where a scene prompt generator can fully mine space-time scene information in a tracking process, and generate a prompt containing tracking scene knowledge. Based on the scene prompt, the scene prompt modulator is designed, and interaction based on attention among pixels is guided through scene knowledge, so that the discrimination of features is enhanced, the influence of complex background on tracking is restrained, and the tracking precision under complex scene is effectively improved. The method and the system can be widely applied to scenes such as automatic driving, intelligent monitoring, man-machine interaction and the like, can be applied to embedded equipment and mobile equipment to provide real-time tracking results, can be deployed in a large-scale calculation server and provide real-time positioning and tracking services with target precision for a large number of users.

It should also be noted that the foregoing describes various embodiments of the present disclosure. These examples are provided to illustrate the technical content of the present disclosure, and are not intended to limit the scope of the claims of the present disclosure. A feature of one embodiment may be applied to other embodiments by suitable modifications, substitutions, combinations, and separations.

It should be noted that in this document, having "an" element is not limited to having a single element, but may have one or more elements unless specifically indicated.

In this context, the so-called feature A "or" (or) or "and/or" (and/or) feature B, unless specifically indicated, refers to the presence of B alone, or both A and B; the feature A "and" (and) or "AND" (and) or "and" (and) feature B, means that the nail and the B coexist; the terms "comprising," "including," "having," "containing," and "containing" are intended to be inclusive and not limited to.

Furthermore, unless specifically described or steps must occur in sequence, the order of the above steps is not limited to the list above and may be changed or rearranged according to the desired design. In addition, the above embodiments may be mixed with each other or other embodiments based on design and reliability, i.e. the technical features of the different embodiments may be freely combined to form more embodiments.

While the foregoing embodiments have been described in some detail for purposes of clarity of understanding, it will be understood that the foregoing embodiments are merely illustrative of the invention and are not intended to limit the invention, and that any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A single target tracking method based on scene prompt dynamically tracks targets in video images, comprising:

determining a target template image containing a target and a search area image and partitioning;

the segmented target template image and the search area image are subjected to linear mapping to obtain corresponding target template image features and search area image features;

inputting the image features of the target template and the image features of the search area into a visual transducer of a scene prompt, and carrying out feature interaction and enhancement under the action of a dynamically acquired scene prompt;

returning the target frame by utilizing the characteristics of the search area enhanced by the visual transducer of the scene prompt, and estimating the quality of the target frame; and

the tracker stores the features of the tracking frames with good target frame quality in the memory, and when a given prompt updating interval is reached, the scene prompt generator generates a new scene prompt by utilizing the features stored in the memory.

2. The scene prompt-based single-target tracking method of claim 1, wherein the scene prompt is dynamically acquired from the video spatiotemporal context during tracking by the scene prompt generator.

3. The scene prompt-based single target tracking method according to claim 2, wherein the scene prompt includes a target prompt and a background prompt.

4. The scene prompt-based single-target tracking method according to claim 1, wherein the target frame is regressed by the target estimation head by using the search region features enhanced by the visual transducer of the scene prompt, and the quality of the target frame is estimated by using the cross-ratio regression head.

5. The scene prompt-based single target tracking method of claim 1, the visual transducer of scene prompts comprising a 12-layer scene prompt encoder.

6. The scene prompt-based single target tracking method according to claim 5, each layer of scene prompt encoder comprising: scene prompt modulator, attention mechanism, multi-layer perceptron.

7. The scene prompt-based single-target tracking method of claim 6, wherein the scene prompt modulator uses scene knowledge to suppress complex background by using dynamically acquired scene prompts to direct a mechanism of attention of interactions between pixels in the encoder.

8. The scene prompt-based single-target tracking method according to claim 1, wherein the scene prompt generator divides the target region features into target features and background features according to the target frames, and introduces a plurality of target prototypes and background prototypes to interact with the target features and the background features through a mutual attention mechanism, respectively.

9. The scene prompt-based single-target tracking method according to claim 8, guiding prompt learning through diversity loss, and guaranteeing diversity by increasing cosine distance between prompts.

10. The scene prompt-based single-target tracking method according to claim 4, wherein the target frame regression head comprises a three-branch full convolution network, a classification score map, an offset map and a normalized size map are respectively output, labels of the classification score map are generated by Gaussian kernels, learning of the classification score map is constrained by a weighted focusing loss function, and learning of the target frame is constrained by generalized cross-ratio loss and average absolute error loss; the cross ratio regression head is used for estimating the cross ratio between the predicted frame and the real frame, and the learning of the cross ratio score is constrained through a mean square loss function.