CN116168216A - Single-target tracking method based on scene prompt - Google Patents

Single-target tracking method based on scene prompt Download PDF

Info

Publication number
CN116168216A
CN116168216A CN202310430642.4A CN202310430642A CN116168216A CN 116168216 A CN116168216 A CN 116168216A CN 202310430642 A CN202310430642 A CN 202310430642A CN 116168216 A CN116168216 A CN 116168216A
Authority
CN
China
Prior art keywords
target
prompt
scene
scene prompt
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310430642.4A
Other languages
Chinese (zh)
Other versions
CN116168216B (en
Inventor
张天柱
马银超
尉前进
何建峰
张勇东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
Original Assignee
University of Science and Technology of China USTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC filed Critical University of Science and Technology of China USTC
Priority to CN202310430642.4A priority Critical patent/CN116168216B/en
Publication of CN116168216A publication Critical patent/CN116168216A/en
Application granted granted Critical
Publication of CN116168216B publication Critical patent/CN116168216B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The present disclosure provides a single target tracking method based on scene prompt, which dynamically tracks a target in a video image, comprising: determining a target template image containing a target and a search area image and partitioning; the segmented target template image and the search area image are subjected to linear mapping to obtain corresponding target template image features and search area image features; inputting the image features of the target template and the image features of the search area into a visual transducer of a scene prompt, and carrying out feature interaction and enhancement under the action of a dynamically acquired scene prompt; returning the target frame by utilizing the characteristics of the search area enhanced by the visual transducer of the scene prompt, and estimating the quality of the target frame; and the tracker stores the characteristics of the tracking frames with good target frame quality in the memory, and when a given prompt updating interval is reached, the scene prompt generator generates a new scene prompt by utilizing the characteristics stored in the memory.

Description

Single-target tracking method based on scene prompt
Technical Field
The disclosure relates to the technical fields of computer vision, artificial intelligence and image processing, in particular to a single-target tracking method based on scene prompt.
Background
Visual single-target tracking is a fundamental research topic in the field of computer vision. The goal is to automatically locate the target in the subsequent frame by giving the target frame of the first frame. Single-target tracking has wide application in automatic driving, intelligent monitoring and man-machine interaction.
The single target tracking method locates the target through the template cut by the first frame and the search area cut by the current frame according to the result of the last frame. The single target tracking method can be classified into a dual stream and a single stream tracker. Most existing methods employ dual-stream tracking strategies that independently extract template and search region features, resulting in inability of the extracted features to interact to perceive the target, limiting the performance of the model. Recently, single-stream methods have been proposed for jointly extracting template and search region features, and generally these methods implement interaction of the template and the search region based on an attention mechanism, so that features of the target can be enhanced. However, the attention mechanism establishes relationships between pixels indifferently, resulting in some complex contexts that may be falsely enhanced, affecting the accuracy of the tracker.
Disclosure of Invention
Based on the above problems, the present disclosure provides a single-target tracking method based on scene prompt, so as to alleviate the above technical problems in the prior art.
Technical scheme (one)
The present disclosure provides a single target tracking method based on scene prompt, which dynamically tracks a target in a video image, comprising: determining a target template image containing a target and a search area image and partitioning; the segmented target template image and the search area image are subjected to linear mapping to obtain corresponding target template image features and search area image features; inputting the image features of the target template and the image features of the search area into a visual transducer of a scene prompt, and carrying out feature interaction and enhancement under the action of a dynamically acquired scene prompt; returning the target frame by utilizing the characteristics of the search area enhanced by the encoder, and estimating the quality of the target frame; and the tracker stores the characteristics of the tracking frames with good target frame quality in the memory, and when a given prompt updating interval is reached, the scene prompt generator generates a new scene prompt by utilizing the characteristics stored in the memory.
According to embodiments of the present disclosure, scene cues are dynamically acquired from a video spatiotemporal context during tracking by a scene cue generator. The scene prompt includes a target prompt and a background prompt.
According to the embodiment of the disclosure, the target frame is regressed by the target estimation head by utilizing the search area characteristics enhanced by the visual transducer of the scene prompt, and the quality of the target frame is estimated by utilizing the cross ratio regressive head.
The visual transducer of scene cues includes a 12-layer scene cue encoder.
According to an embodiment of the present disclosure, each layer of scene hint encoder includes: scene prompt modulator, attention mechanism, multi-layer perceptron.
According to the embodiment of the disclosure, the scene prompt modulator utilizes dynamically acquired scene prompt symbols to guide an attention mechanism of interaction among pixels in an encoder, and utilizes scene knowledge to suppress complex backgrounds.
According to the embodiment of the disclosure, the scene prompt generator divides the target region features into target features and background features according to the target frame, and introduces a plurality of target prototypes and background prototypes to interact with the target features and the background features through a mutual attention mechanism respectively.
According to the embodiment of the disclosure, the prompt learning is guided through diversity loss, and diversity is ensured by increasing cosine distance between prompts.
According to the embodiment of the disclosure, the target frame regression head comprises a three-branch full convolution network, a classification score map, an offset map and a normalized size map are respectively output, labels of the classification score map are generated by Gaussian kernels, learning of the classification score map is restrained through a weighted focusing loss function, and learning of the target frame is restrained through generalized cross-ratio loss and average absolute error loss; the cross ratio regression head is used for estimating the cross ratio between the predicted frame and the real frame, and the learning of the cross ratio score is constrained through a mean square loss function.
(II) advantageous effects
From the above technical solution, it can be seen that the single-target tracking method based on scene prompt in the present disclosure has at least one or a part of the following advantages:
(1) The prompt sign of the tracking scene can be dynamically obtained according to the space-time context in the tracking process, and the model is guided to learn various and comprehensive scene knowledge by using diversity loss;
(2) The scene prompt modulator is used for embedding the scene prompt sign into the attention mechanism so as to guide the feature learning of scene perception, enhance the discrimination of the features and effectively improve the target tracking precision under the complex background scene.
Drawings
Fig. 1 is a schematic diagram of an overall architecture of a single target tracking method based on scene prompt according to an embodiment of the disclosure.
Fig. 2 is a schematic diagram of a schematic architecture of a scene prompt generator according to an embodiment of the disclosure.
Fig. 3 is a flowchart of a single-target tracking method based on scene prompt according to an embodiment of the disclosure.
Detailed Description
The present disclosure provides a single target tracking method based on scene prompt, which can fully mine space-time information in the tracking process and generate a prompt containing tracking scene knowledge. Based on the scene prompt, the method designs a new attention method, and guides interaction among pixels through scene prompt so as to inhibit the influence of complex background on tracking, and effectively improves tracking precision under complex scene. Therefore, the scene prompt-based single-target tracking method can utilize the scene prompt symbol adaptively generated in the tracking process to guide the interaction attention mechanism between pixels to inhibit the complex background, and realize robust tracking.
For the purposes of promoting an understanding of the principles and advantages of the disclosure, reference will now be made to the embodiments illustrated in the drawings and specific language will be used to describe the same.
In an embodiment of the present disclosure, a single-target tracking method based on scene prompt is provided, which dynamically tracks a target in a video image, and is shown in fig. 1 to 3, and includes:
the method comprises the following steps of S1, determining a target template image containing a target and a search area image and partitioning;
s2, obtaining corresponding target template image features and search area image features through linear mapping of the segmented target template image and the search area image;
s3, inputting the image features of the target template and the image features of the search area into a visual transducer of a scene prompt, and carrying out feature interaction and enhancement under the action of a dynamically acquired scene prompt;
s4, returning the target frame by utilizing the characteristics of the search area enhanced by the visual transducer of the scene prompt, and estimating the quality of the target frame; and
and S5, the tracker stores the characteristics of the tracking frames with good target frame quality in the memory, and when a given prompt updating interval is reached, the scene prompt generator generates a new scene prompt by utilizing the characteristics stored in the memory.
According to embodiments of the present disclosure, scene cues are dynamically acquired from a video spatiotemporal context during tracking by a scene cue generator. The scene prompt includes a target prompt and a background prompt.
According to the embodiment of the disclosure, the target frame is regressed by the target estimation head by utilizing the search area characteristics enhanced by the visual transducer of the scene prompt, and the quality of the target frame is estimated by utilizing the cross ratio regressive head.
According to embodiments of the present disclosure, the visual transducer of scene cues includes a 12-layer scene cue encoder (also referred to as an encoding layer). Each layer of scene prompt encoder comprises: scene prompt modulator, attention mechanism, multi-layer perceptron.
According to the embodiment of the disclosure, the scene prompt modulator utilizes dynamically acquired scene prompt symbols to guide an attention mechanism of interaction among pixels in an encoder, and utilizes scene knowledge to suppress complex backgrounds.
According to the embodiment of the disclosure, the scene prompt generator divides the target region features into target features and background features according to the target frame, and introduces a plurality of target prototypes and background prototypes to interact with the target features and the background features through a mutual attention mechanism respectively.
According to the embodiment of the disclosure, the prompt learning is guided through diversity loss, and diversity is ensured by increasing cosine distance between prompts.
According to the embodiment of the disclosure, the target frame regression head comprises a three-branch full convolution network, a classification score map, an offset map and a normalized size map are respectively output, labels of the classification score map are generated by Gaussian kernels, learning of the classification score map is restrained through a weighted focusing loss function, and learning of the target frame is restrained through generalized cross-ratio loss and average absolute error loss; the cross ratio regression head is used for estimating the cross ratio between the predicted frame and the real frame, and the learning of the cross ratio score is constrained through a mean square loss function.
In an embodiment of the present disclosure, a single-target tracking method based on scene prompt is provided. The method mainly comprises a visual transducer based on scene prompt and a target estimation head. Two innovative modules in visual transducer based on scene cues: (1) a scene hint generator; (2) a scene hint modulator. The scene prompt generator adaptively generates a prompt related to the current tracking scene by utilizing scene information in the tracking process; the template and the search area image are segmented and mapped to obtain corresponding image features, and the corresponding image features are input into a visual transducer based on scene prompt for feature interaction and enhancement. Here, the transducer consists of 12 encoding layers, which mainly include scene prompt modulators, attention mechanisms, and multi-layer perceptrons. The scene prompt modulator uses the scene prompt to guide the attention mechanism of the interaction between pixels in the encoder, so that the scene prompt modulator can use scene knowledge to inhibit complex background; the target estimation head utilizes the enhanced searching region characteristics of the visual transducer subjected to scene prompt to return to the target frame and estimate the quality of the target frame; the tracker stores the features of the tracking frames with good target frame quality in the memory, and when a given prompt updating interval is reached, the scene prompt generator generates a new scene prompt by utilizing the features stored in the memory.
In an embodiment of the present disclosure, as shown in FIG. 1, the present disclosure takes as input to a tracker a pair of images including a target template image
Figure SMS_12
And search area image->
Figure SMS_4
. The two images are respectively divided into image blocks and flattened to obtain flattened two-dimensional image blocks>
Figure SMS_8
And->
Figure SMS_9
Flattened two-dimensional image block->
Figure SMS_14
And->
Figure SMS_13
Spliced together, wherein->
Figure SMS_16
Is the image block resolution, +.>
Figure SMS_7
And->
Figure SMS_11
Is the number of target template image blocks and search area image blocks. Thereafter, the present disclosure maps the flattened two-dimensional image block to by linear mappingCDimension. Position coding capable of learning->
Figure SMS_1
And->
Figure SMS_5
Is added to the corresponding target template image block and search area image block to obtain target template feature +.>
Figure SMS_15
Search area feature->
Figure SMS_18
For example, the numbers 1-9 in fig. 1 represent different position encodings. />
Figure SMS_17
And->
Figure SMS_19
Spliced together and input into a field Jing Dishi encoder to obtain updated template features +.>
Figure SMS_2
And search area feature->
Figure SMS_6
. Further, the present disclosure will update the search area feature +.>
Figure SMS_3
Input into the target box regression header and IoU (cross ratio) score header to estimate the target position and quality of the evaluation box. If the IoU score of the box is greater than 0.8, the present disclosure saves the characteristics of each coding layer output of the current search area image +.>
Figure SMS_10
In the memory of the memory device,Nthe number of coding layers. When the update interval is reached, the scene prompt generator generates scene prompts (including target prompts and background prompts) using features stored in memory. Further, the scene prompt modulator in the coding layer may utilize the scene prompt to direct the attention mechanism to suppress complex backgrounds so that the tracker may implement robust tracking.
In an embodiment of the present disclosure, as shown in FIG. 1, the present disclosure is in a tracking process by a scene hint generatorScene cues are dynamically obtained from a video spatiotemporal context. Specifically, the scene prompt generator outputs scene prompts, including target prompts, for each layer of encoder
Figure SMS_20
And background prompt +.>
Figure SMS_31
. Here, a->
Figure SMS_33
Representing the number of target prompts,/->
Figure SMS_23
The number of background cues is indicated,Crepresenting the dimensions of the prompt feature. As shown in FIG. 2, the historical frame in a given memoryiFeatures of layer encoder>
Figure SMS_24
,/>
Figure SMS_26
Representing the number of features stored in the memory, dividing the features into target features (also called foreground features) according to the corresponding predicted frame>
Figure SMS_29
With background features
Figure SMS_21
. The present disclosure introduces multiple target prototypes->
Figure SMS_27
Prototype with background->
Figure SMS_30
Respectively>
Figure SMS_32
And background features->
Figure SMS_22
Interaction is carried out through a mutual attention mechanism, and a target prompt is obtained>
Figure SMS_25
And background prompt +.>
Figure SMS_28
The formula is:
Figure SMS_34
Figure SMS_35
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure SMS_36
is a scaling factor, ->
Figure SMS_37
Is a matrix transpose operation. In addition, the present disclosure uses diversity loss to guide prompt learning in order to obtain a variety of comprehensive scene knowledge. Given the firstiTarget prompt for a layer
Figure SMS_38
And background prompt +.>
Figure SMS_39
The diversity is ensured by increasing the cosine distance between the prompts, and the formula is as follows: />
Figure SMS_40
Wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure SMS_41
for loss of diversity->
Figure SMS_42
Representing intra-class diversity loss weights, +.>
Figure SMS_43
Representing the weight of the loss of diversity between classes,iindicating the number of layers in which the corresponding scene prompt encoder is located,k,mfor the indicator index, & gt>
Figure SMS_44
Is the firstiLayer encoder of the first layerkTarget prompt, 10->
Figure SMS_45
Is the firstiLayer encoder of the first layermTarget prompt, 10->
Figure SMS_46
Is the firstiLayer encoder of the first layerkA background prompt->
Figure SMS_47
Is the firstiLayer encoder of the first layermA background prompt.
Regarding scene prompt modulator, given the firstiInput to layer encoder
Figure SMS_48
Wherein->
Figure SMS_49
Is the firsti-Target template feature output by layer 1 encoder, < +.>
Figure SMS_50
Is the firsti-The 1-layer encoder outputs the search area characteristics, and the basic key ++can be obtained through linear mapping>
Figure SMS_51
Inquiry->
Figure SMS_52
Value->
Figure SMS_53
The formula is as follows:
Figure SMS_54
Figure SMS_55
Figure SMS_56
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure SMS_59
representation layer normalization->
Figure SMS_66
For key map parameters, ++>
Figure SMS_69
For querying the mapping parameters->
Figure SMS_57
For the value mapping parameters, in the naive transducer, the bond +.>
Figure SMS_61
Inquiry->
Figure SMS_65
Value->
Figure SMS_67
The dependence between all pixels is built up indifferently as input to the multi-headed attention. Such an operation may cause a complex background to be falsely enhanced, thereby limiting the discrimination of the features. For this purpose, the present disclosure uses dynamically acquired scene prompt +.>
Figure SMS_58
And->
Figure SMS_62
A corresponding adaptive prompt is generated for each pixel that contains scene specific knowledge to discriminate against the background. Given a given
Figure SMS_64
Wherein->
Figure SMS_68
For the number of target template image blocks, the adaptive prompt comprises a template adaptive prompt +.>
Figure SMS_60
Adaptive prompt to search area +.>
Figure SMS_63
Expressed as:
Figure SMS_70
;/>
Figure SMS_71
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure SMS_72
representing the target area in the template image.
Thereafter, adaptive hints
Figure SMS_73
Is mapped for modulating keys and queries as follows:
Figure SMS_74
Figure SMS_75
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure SMS_76
for the key modulated with the reminder +.>
Figure SMS_77
For a prompt modulated query,
Figure SMS_78
for the linear mapping parameter(s),Cfor the feature dimension of the adaptive hints, +.>
Figure SMS_79
For the feature dimensions of the key and query,Rrepresenting the real number domain. In this way, the scene prompt may instruct the attention mechanism to learn scene-aware features to discriminate the context. Finally, the firstiOutput of layer encoder->
Figure SMS_80
Can be expressed as:
Figure SMS_81
Figure SMS_82
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure SMS_83
for intermediate features->
Figure SMS_84
Is a value.
With respect to the target estimation head, as shown in FIG. 1, updated search region features
Figure SMS_85
The target frame regression header and IoU regression header (or score header) are input to perform target state estimation and frame quality estimation.
The target frame regression head (or target frame prediction head) is composed of a three-branch full convolution network and outputs a classification score map
Figure SMS_86
Offset map->
Figure SMS_89
Normalized dimension map
Figure SMS_93
Wherein->
Figure SMS_88
In order to search for the area image size,prepresenting the block size, giving the position of maximum classification score +.>
Figure SMS_91
,/>
Figure SMS_94
Coordinates of the position with the greatest score, +.>
Figure SMS_87
Representing coordinates, bounding box of object->
Figure SMS_92
Can be expressed as:
Figure SMS_95
Figure SMS_96
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure SMS_97
the coordinates of the left upper corner point of the target and the width and length of the target are respectively.
The labels of the classification score graph are generated by gaussian kernels,
Figure SMS_99
wherein, the method comprises the steps of, wherein,
Figure SMS_101
for classifying score map labels, < >>
Figure SMS_103
For the true center of the target, +.>
Figure SMS_98
Adaptive for target sizeStandard deviation of the application. Weighted focus loss->
Figure SMS_102
Is used to constrain the classification map. Average absolute error loss (+)>
Figure SMS_104
Loss) of>
Figure SMS_105
Cross-ratio loss with generalization->
Figure SMS_100
Is used to constrain the target bounding box, the formula is as follows:
Figure SMS_106
;/>
Figure SMS_107
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure SMS_108
for classifying loss->
Figure SMS_109
For frame loss->
Figure SMS_110
For average absolute error loss function->
Figure SMS_111
Is used for the weight of the (c),
Figure SMS_112
weight lost for generalization of cross-ratio, +.>
Figure SMS_113
Classification score map label and predicted Classification score map, respectively,>
Figure SMS_114
separate tableShowing the actual target frame and the predicted target frame.
The cross-correlation head is used to estimate the cross-correlation between the predicted frame and the real frame. Specifically, the present disclosure samples different cross-ratios by dithering of the real box
Figure SMS_115
Is->
Figure SMS_116
. Then according to the sampled box->
Figure SMS_117
Target feature after clipping feature interaction +.>
Figure SMS_118
The formula is as follows:
Figure SMS_119
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure SMS_120
for Precise RoI Pooling operations, for cropping target features in a search area. The present disclosure introduces an cross score token ∈ ->
Figure SMS_121
Is +.>
Figure SMS_122
Interaction is carried out through an attention mechanism to obtain updated cross score token +.>
Figure SMS_123
. Subsequently, will->
Figure SMS_124
Input into the multi-layer perceptron to get the cross score between estimated prediction frame and real frame +.>
Figure SMS_125
The formula is as follows:
Figure SMS_126
further, the present disclosure utilizes mean square loss to constrain the cross-ratio score,
Figure SMS_127
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure SMS_128
a score loss of IoU. Loss of training overall->
Figure SMS_129
The expression is as follows:
Figure SMS_130
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure SMS_131
is the firstiThe diversity of the layer scene prompt is lost.
Thus, embodiments of the present disclosure have been described in detail with reference to the accompanying drawings. It should be noted that, in the drawings or the text of the specification, implementations not shown or described are all forms known to those of ordinary skill in the art, and not described in detail. Furthermore, the above definitions of the elements and methods are not limited to the specific structures, shapes or modes mentioned in the embodiments, and may be simply modified or replaced by those of ordinary skill in the art.
From the above description, one skilled in the art should have clear insight into the single-object tracking method of the present disclosure based on scene cues.
In summary, the present disclosure provides a single-target tracking method based on scene prompt, where a scene prompt generator can fully mine space-time scene information in a tracking process, and generate a prompt containing tracking scene knowledge. Based on the scene prompt, the scene prompt modulator is designed, and interaction based on attention among pixels is guided through scene knowledge, so that the discrimination of features is enhanced, the influence of complex background on tracking is restrained, and the tracking precision under complex scene is effectively improved. The method and the system can be widely applied to scenes such as automatic driving, intelligent monitoring, man-machine interaction and the like, can be applied to embedded equipment and mobile equipment to provide real-time tracking results, can be deployed in a large-scale calculation server and provide real-time positioning and tracking services with target precision for a large number of users.
It should also be noted that the foregoing describes various embodiments of the present disclosure. These examples are provided to illustrate the technical content of the present disclosure, and are not intended to limit the scope of the claims of the present disclosure. A feature of one embodiment may be applied to other embodiments by suitable modifications, substitutions, combinations, and separations.
It should be noted that in this document, having "an" element is not limited to having a single element, but may have one or more elements unless specifically indicated.
In this context, the so-called feature A "or" (or) or "and/or" (and/or) feature B, unless specifically indicated, refers to the presence of B alone, or both A and B; the feature A "and" (and) or "AND" (and) or "and" (and) feature B, means that the nail and the B coexist; the terms "comprising," "including," "having," "containing," and "containing" are intended to be inclusive and not limited to.
Furthermore, unless specifically described or steps must occur in sequence, the order of the above steps is not limited to the list above and may be changed or rearranged according to the desired design. In addition, the above embodiments may be mixed with each other or other embodiments based on design and reliability, i.e. the technical features of the different embodiments may be freely combined to form more embodiments.
While the foregoing embodiments have been described in some detail for purposes of clarity of understanding, it will be understood that the foregoing embodiments are merely illustrative of the invention and are not intended to limit the invention, and that any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims (10)

1. A single target tracking method based on scene prompt dynamically tracks targets in video images, comprising:
determining a target template image containing a target and a search area image and partitioning;
the segmented target template image and the search area image are subjected to linear mapping to obtain corresponding target template image features and search area image features;
inputting the image features of the target template and the image features of the search area into a visual transducer of a scene prompt, and carrying out feature interaction and enhancement under the action of a dynamically acquired scene prompt;
returning the target frame by utilizing the characteristics of the search area enhanced by the visual transducer of the scene prompt, and estimating the quality of the target frame; and
the tracker stores the features of the tracking frames with good target frame quality in the memory, and when a given prompt updating interval is reached, the scene prompt generator generates a new scene prompt by utilizing the features stored in the memory.
2. The scene prompt-based single-target tracking method of claim 1, wherein the scene prompt is dynamically acquired from the video spatiotemporal context during tracking by the scene prompt generator.
3. The scene prompt-based single target tracking method according to claim 2, wherein the scene prompt includes a target prompt and a background prompt.
4. The scene prompt-based single-target tracking method according to claim 1, wherein the target frame is regressed by the target estimation head by using the search region features enhanced by the visual transducer of the scene prompt, and the quality of the target frame is estimated by using the cross-ratio regression head.
5. The scene prompt-based single target tracking method of claim 1, the visual transducer of scene prompts comprising a 12-layer scene prompt encoder.
6. The scene prompt-based single target tracking method according to claim 5, each layer of scene prompt encoder comprising: scene prompt modulator, attention mechanism, multi-layer perceptron.
7. The scene prompt-based single-target tracking method of claim 6, wherein the scene prompt modulator uses scene knowledge to suppress complex background by using dynamically acquired scene prompts to direct a mechanism of attention of interactions between pixels in the encoder.
8. The scene prompt-based single-target tracking method according to claim 1, wherein the scene prompt generator divides the target region features into target features and background features according to the target frames, and introduces a plurality of target prototypes and background prototypes to interact with the target features and the background features through a mutual attention mechanism, respectively.
9. The scene prompt-based single-target tracking method according to claim 8, guiding prompt learning through diversity loss, and guaranteeing diversity by increasing cosine distance between prompts.
10. The scene prompt-based single-target tracking method according to claim 4, wherein the target frame regression head comprises a three-branch full convolution network, a classification score map, an offset map and a normalized size map are respectively output, labels of the classification score map are generated by Gaussian kernels, learning of the classification score map is constrained by a weighted focusing loss function, and learning of the target frame is constrained by generalized cross-ratio loss and average absolute error loss; the cross ratio regression head is used for estimating the cross ratio between the predicted frame and the real frame, and the learning of the cross ratio score is constrained through a mean square loss function.
CN202310430642.4A 2023-04-21 2023-04-21 Single-target tracking method based on scene prompt Active CN116168216B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310430642.4A CN116168216B (en) 2023-04-21 2023-04-21 Single-target tracking method based on scene prompt

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310430642.4A CN116168216B (en) 2023-04-21 2023-04-21 Single-target tracking method based on scene prompt

Publications (2)

Publication Number Publication Date
CN116168216A true CN116168216A (en) 2023-05-26
CN116168216B CN116168216B (en) 2023-07-18

Family

ID=86422196

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310430642.4A Active CN116168216B (en) 2023-04-21 2023-04-21 Single-target tracking method based on scene prompt

Country Status (1)

Country Link
CN (1) CN116168216B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050031165A1 (en) * 2003-08-08 2005-02-10 Lockheed Martin Corporation. Method and apparatus for tracking an object
CN110992404A (en) * 2019-12-23 2020-04-10 驭势科技(南京)有限公司 Target tracking method, device and system and storage medium
US20200327680A1 (en) * 2019-04-12 2020-10-15 Beijing Moviebook Science and Technology Co., Ltd. Visual target tracking method and apparatus based on deep adversarial training
CN114255514A (en) * 2021-12-27 2022-03-29 厦门美图之家科技有限公司 Human body tracking system and method based on Transformer and camera device
CN114821390A (en) * 2022-03-17 2022-07-29 齐鲁工业大学 Twin network target tracking method and system based on attention and relationship detection
CN115908500A (en) * 2022-12-30 2023-04-04 长沙理工大学 High-performance video tracking method and system based on 3D twin convolutional network
US20230106873A1 (en) * 2022-03-10 2023-04-06 Beijing Baidu Netcom Science Technology Co., Ltd. Text extraction method, text extraction model training method, electronic device and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050031165A1 (en) * 2003-08-08 2005-02-10 Lockheed Martin Corporation. Method and apparatus for tracking an object
US20200327680A1 (en) * 2019-04-12 2020-10-15 Beijing Moviebook Science and Technology Co., Ltd. Visual target tracking method and apparatus based on deep adversarial training
CN110992404A (en) * 2019-12-23 2020-04-10 驭势科技(南京)有限公司 Target tracking method, device and system and storage medium
CN114255514A (en) * 2021-12-27 2022-03-29 厦门美图之家科技有限公司 Human body tracking system and method based on Transformer and camera device
US20230106873A1 (en) * 2022-03-10 2023-04-06 Beijing Baidu Netcom Science Technology Co., Ltd. Text extraction method, text extraction model training method, electronic device and storage medium
CN114821390A (en) * 2022-03-17 2022-07-29 齐鲁工业大学 Twin network target tracking method and system based on attention and relationship detection
CN115908500A (en) * 2022-12-30 2023-04-04 长沙理工大学 High-performance video tracking method and system based on 3D twin convolutional network

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
NING WANG.ET AL: ""Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking"", 《ARXIV:2103.11681V2》 *
XU JIA.ET AL: ""Visual tracking via adaptive structural local sparse appearance model"", 《IEEE》 *
王红涛等: ""基于深度学习的单目标跟踪算法综述"", 《计算机系统应用》, vol. 5, no. 31 *

Also Published As

Publication number Publication date
CN116168216B (en) 2023-07-18

Similar Documents

Publication Publication Date Title
RU2691214C1 (en) Text recognition using artificial intelligence
CN109558832B (en) Human body posture detection method, device, equipment and storage medium
CN110570433B (en) Image semantic segmentation model construction method and device based on generation countermeasure network
CN106960206A (en) Character identifying method and character recognition system
CN110781744A (en) Small-scale pedestrian detection method based on multi-level feature fusion
US11475681B2 (en) Image processing method, apparatus, electronic device and computer readable storage medium
CN113822951B (en) Image processing method, device, electronic equipment and storage medium
CN107818575A (en) A kind of visual object tracking based on layering convolution
CN112016569A (en) Target detection method, network, device and storage medium based on attention mechanism
CN114821326A (en) Method for detecting and identifying dense weak and small targets in wide remote sensing image
CN113554679A (en) Anchor-frame-free target tracking algorithm for computer vision application
CN109977963A (en) Image processing method, unit and computer-readable medium
CN109697236A (en) A kind of multi-medium data match information processing method
CN112101344A (en) Video text tracking method and device
CN114565035A (en) Tongue picture analysis method, terminal equipment and storage medium
CN112037239B (en) Text guidance image segmentation method based on multi-level explicit relation selection
CN116168216B (en) Single-target tracking method based on scene prompt
Ke et al. Dense small face detection based on regional cascade multi‐scale method
CN116229584A (en) Text segmentation recognition method, system, equipment and medium in artificial intelligence field
CN115953744A (en) Vehicle identification tracking method based on deep learning
CN114913339A (en) Training method and device of feature map extraction model
CN114332884B (en) Document element identification method, device, equipment and storage medium
Mangla et al. A novel key-frame selection-based sign language recognition framework for the video data
CN113052156B (en) Optical character recognition method, device, electronic equipment and storage medium
CN110555433B (en) Image processing method, device, electronic equipment and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant