CN113705325A - Deformable single-target tracking method and device based on dynamic compact memory embedding - Google Patents

Deformable single-target tracking method and device based on dynamic compact memory embedding Download PDF

Info

Publication number
CN113705325A
CN113705325A CN202110736925.2A CN202110736925A CN113705325A CN 113705325 A CN113705325 A CN 113705325A CN 202110736925 A CN202110736925 A CN 202110736925A CN 113705325 A CN113705325 A CN 113705325A
Authority
CN
China
Prior art keywords
target
memory
feature
existing
correlation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110736925.2A
Other languages
Chinese (zh)
Other versions
CN113705325B (en
Inventor
于洪涛
朱鹏飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN202110736925.2A priority Critical patent/CN113705325B/en
Publication of CN113705325A publication Critical patent/CN113705325A/en
Application granted granted Critical
Publication of CN113705325B publication Critical patent/CN113705325B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Abstract

The invention discloses a deformable single-target tracking method and a device based on dynamic compact memory embedding, wherein the method comprises the following steps: based on compact memory embedded target correlation matching, obtaining target foreground and background similarity and target posterior probability; the dynamic adjustment mechanism of compact memory embedding selects a high-quality part in the current target characteristics to integrate into the memory according to the characteristic correlation; capturing a target deformation state in the current target feature by adopting a global association mode from pixel points to reference features one by one, and realizing extraction of deformable features; the four characteristics of similarity of the target foreground, deformation characteristics and the like are cascaded along a channel and input into a decoder to obtain a refined target segmentation mask; and acquiring a rectangular bounding box of the target based on the target segmentation mask to realize visual tracking and positioning. The device comprises: a processor and a memory. The invention solves the challenging problems such as shielding, target deformation, target appearance change, similar target interference and the like in the tracking process.

Description

Deformable single-target tracking method and device based on dynamic compact memory embedding
Technical Field
The invention relates to the field of single-target visual tracking, in particular to a deformable single-target tracking method and device based on dynamic compact memory embedding.
Background
Visual object tracking is a basic and challenging task in the field of computer vision. It has many practical applications, such as: traffic monitoring, human-computer interaction, autonomous robots, autopilot, and the like. Although existing tracking methods have significantly improved both accuracy and robustness, there are still some challenges that remain to be solved, such as: occlusion, distortion, background clutter, etc.
The twin tracker is a simple and efficient tracking algorithm. In the tracking process, the maximum correlation between the search area and the template is regarded as the location of the target. To achieve better model generalization, twin trackers are typically trained using large amounts of labeled data. SINT (visual tracking method based on twinning example search) and SiamsesFC (visual tracking method based on twinning full convolution network) have a milestone impact on the development of the twin tracking field. They are the first attempt to train the twin network end-to-end for visual tracking. SiamRPN + + (tracking model based on large backbone network and twin region proposed network) and SiamDW (based on wider and deeper twin network tracking model) improve the structure of ResNet (network based on residual connection) model and successfully apply it to twin tracking model, thereby significantly improving tracking performance. SiamRPN (twin region proposal network based tracking model) applies a Region Proposal Network (RPN) to a twin tracking network. The two-pronged network has a classification net header for background-foreground separation of anchors, and a regression net clutter for suggesting box refinement. Compared with the anchor-based method, the anchor-free tracking method (Siamfc + + (twin full convolution tracking model based on anchor-free regression), SiamCAR (twin full convolution classification and regression model for visual tracking), SiamBAN (tracking model based on adaptive bounding box regression), Ocean (anchor-free tracking method based on target perception)) avoids a large amount of anchor presets, thereby significantly reducing model hyper-parameters. These methods may enable more flexible regression of the target bounding box. Twin tracking methods, while simple and effective, fixed templates have difficulty expressing changes in the appearance and scale of the target. MOSSE (based on adaptive correlation filtering tracking method) is the first initiative of the Discriminant Correlation Filters (DCF) tracking method. As an online tracking method, the correlation filter has better adaptability and universality for target appearance and scale change. The following improvement strategies are for example: continuous convolution, dynamic updating of the training set, spatial regularization, temporal smoothing regularization, etc., all further improve the performance of the DCF-based tracker. By combining the online update of DCF and the positioning refinement of the target by IOU-Net (regression network model for realizing the maximization of the target overlapping rate), ATOM (accurate target tracking realized by the maximization of the target overlapping) and DiMP (visual tracking method based on discriminant model prediction) obtain the best tracking performance at that time in the template matching type tracking method.
CSR-DCF (tracking method combined with correlation filtering and color segmentation) constructs a target mask from the color histograms of the foreground and background, and then adds this mask to the correlation filter, suppressing boundary effects very well. And the SimMask (tracking method combined with target segmentation) expands the segmentation network branches on the twin tracking model, and enhances the expression capability of the model by virtue of segmentation loss. Compared with a common Video Object Segmentation (VOS) method, the SiamMask achieves higher tracking speed by adopting a lightweight segmentation network. Unlike "detection-based tracking", D3S (discriminant segmentation tracking model) innovatively replaces the target regression branch in the tracking model with a segmentation network, and D3S achieves advanced tracking performance by combining the precise location of DCF and the robustness of the segmentation model to target deformation. DMB (dual-memory-library-based segmentation tracking model) provides a rich reference for current object segmentation by storing historical appearance and spatial localization information of objects.
The rich memory embedding provides sufficient reference information for the video analysis task. MemTrack (in conjunction with a dynamic memory network tracking model) uses a dynamic memory network to overcome the tracking drift problem caused by fixed templates. Whereas STM (a spatiotemporal memory-based video segmentation method) stores dense historical frame features and masks for temporal-spatial information matching at the current pixel level. The dense reference information makes it possible to handle appearance changes and occlusions during the VOS very well. In order to avoid redundant memory in excessive memory storage, AFB-URR (video segmentation model combining adaptive feature memory and uncertain region refinement) proposes an adaptive feature library to dynamically organize historical information. It combines similar memos using weighted averages and follows a cache replacement strategy to eliminate the memory with the lowest frequency of queries.
Fixed matching templates make it difficult to obtain varying target representations, especially for non-rigid objects. To this end, SiamAttn (twin attention based tracking method) proposes a deformable twin attention mechanism that calculates deformable self-attention and mutual attention between template features and search features. Self-attention uses spatial attention to learn context information, and mutual attention can aggregate rich context interdependencies between the target template and the search region, thereby completing the update of the target template in a hidden manner. Deformable-DETR (Deformable multi-head attention target detection model) applies a multi-scale Deformable attention module instead of the transform (multi-head attention mechanism) attention mechanism to process the feature map. Due to the flexibility of its weight, the method has excellent performance in the target detection task.
To some extent, while DCF-based trackers alleviate the difficulty of fixed templates adapting to scene changes, template-matching-based approaches have two limitations:
first, the template of a single initial frame cannot provide enough target information for target matching; secondly, the existing matching method is only carried out among corresponding pixels, the matching result is too rough, and the target deformation information with sufficient fineness is difficult to capture.
Disclosure of Invention
The invention provides a deformable single-target tracking method and a device based on dynamic compact memory embedding, which solve the problem of warfare such as shielding, target deformation, target appearance change, similar target interference, complex background and the like in the tracking process, and are described in detail as follows:
a deformable single-target tracking method based on dynamic compact memory embedding, the method comprising:
a feature affinity matrix generated in the process of matching the target similarity represents the correlation between the current target feature and the existing target feature memory; the front K values are screened line by line for the characteristic affinity matrix and averaged to obtain the target foreground similarity and the target posterior probability;
according to the acquired feature correlation, combining a high correlation part between the existing target feature memory and the current target feature, expanding a part with medium similarity to the existing target feature memory into the memory, and discarding a low correlation part, thereby realizing dynamic self-adaptive adjustment of memory embedding and obtaining compact memory embedding;
acquiring a target deformation state in the current target feature by adopting an association mode from query features to reference feature global points one by one and by aggregating the weighted correlation between the query pixels and the reference features, and establishing a corresponding relation between similar target parts to realize extraction of deformable features;
the similarity of the target foreground, the posterior probability of the target, the deformable characteristics and the target space positioning obtained in the online discriminant correlation filter are cascaded along the channel and input into a decoder to obtain a refined target segmentation mask; and acquiring a rectangular bounding box of the target based on the target segmentation mask to realize visual tracking and positioning.
Further, the obtaining of the correlation between the current target feature and the existing memory specifically includes:
when a new frame It-1After being segmented by the model, the current target query feature Ft-1And the obtained mask will integrate with the historical information, by continuously integrating new target information into the key and value memory, the formed memory embedding will contain rich target appearance information; for the current target characteristics and the existing target characteristic memory, firstly, the two are subjected to dimension transformation, and then matrix multiplication is carried out to obtain an affinity matrix of the two, wherein the affinity matrix expresses the correlation of the two.
The method comprises the following steps of acquiring feature correlation, combining a high correlation part between the existing target feature memory and the current target feature, expanding a part with medium correlation with the existing target feature memory into the memory, and directly discarding a low correlation part, so that dynamic adaptive adjustment of memory embedding is realized, wherein the dynamic adjustment process specifically comprises the following steps:
initializing memory embedding by utilizing target information of a first frame in a sequence, taking the target information as a main part of a memory library, comparing a target query with the existing target characteristic memory, and finding out a similar part of the target query and the existing target characteristic memory;
for each element in the target query, the affinity matrix is searched to obtain its memory M with the target featurek∈RThwThe most similar part of (a);
if the maximum correlation between the two is larger than a certain upper limit value, the records with the same key value are inserted into the same storage space according to the Hash single mapping principle, namely, a weighted fusion mode is adopted, and the part which is highly correlated with the existing memory in the current characteristics is updated into the existing memory embedding; for the current target characteristics with the correlation value higher than the overall average value with the existing memory, the current target characteristics are directly expanded into the existing memory; the same is done for corresponding parts of the target foreground and background masks.
Further, the updating of the part of the current feature that has high correlation with the existing memory into the existing memory embedding specifically includes:
merging the high correlation part between the existing memory and the current target feature:
Mk(j')=βFt-1(i)+(1-β)Mk(j),
Figure BDA0003140298980000045
Figure BDA0003140298980000046
wherein M isk(j') memory of the merged object features, Mvf(j') and Mvb(j ') memorizing the combined target foreground and background values respectively, j' memorizing the combined subscript index value after combination, and beta fusing the weight,Ft-1In order to target the query(s),
Figure BDA0003140298980000047
as a foreground, Yb t-1For background, MkFor object feature memory, MvfMemorization of the target foreground values, MvbMemorizing a target background value, wherein i is a subscript index of a feature point with medium similarity in the current feature and the existing memory, and j is a subscript index of a part with medium similarity in the existing memory and the current feature;
the current target characteristics with the correlation value higher than the average value with the existing memory are directly expanded into the existing memory:
Figure BDA0003140298980000041
wherein Union (r.) represents the Union operation of the current feature and the corresponding memory,
Figure BDA0003140298980000042
is the union of the existing key memory and the target feature at the previous moment,
Figure BDA0003140298980000043
for the union of the existing object foreground value memory and the previous moment object foreground mask,
Figure BDA0003140298980000044
the union of the existing target background value memory and the target background mask at the previous moment is used.
Acquiring a target deformation state in the current target feature by aggregating the weighted correlation between the query pixel and the reference feature, wherein the establishment of the corresponding relationship between similar target parts specifically comprises:
constructing the association between each pixel in the query feature F and the whole template, and applying an unshared attention mechanism to calculate each pixel r in the keyjAnd fiObtaining a similarity function:
by softhe tmax function pairs e over all pixels of the template RijNormalization, obtaining the normalized similarity weight for the pixel-by-pixel feature in the aggregation template, generating the similarity corresponding to fiThe target deformation characteristic of (1);
and connecting the query feature with the target deformation feature by using the residual error to obtain the feature containing deformation information, and enhancing the credibility of the target deformation feature by adopting the target foreground probability.
In a second aspect, a deformable single-target tracking device based on dynamic compact memory embedding, the device comprising: a processor and a memory, the memory having stored therein program instructions, the processor calling the program instructions stored in the memory to cause the apparatus to perform the method steps of any of the first aspects.
In a third aspect, a computer-readable storage medium storing a computer program comprising program instructions which, when executed by a processor, cause the processor to carry out the method steps of any one of the first aspect.
The technical scheme provided by the invention has the beneficial effects that:
1. in the process of matching the similarity of the targets, a dynamically adjusted compact target memory bank is introduced, so that effective query references can be provided for matching the similarity of the targets under the complex tracking conditions of shielding, background clutter and the like; in addition, a dynamic memory adjustment mechanism based on a Hash algorithm can effectively ensure the compactness and high quality of memory, and avoid memory redundancy and unnecessary memory retrieval;
2. the deformable feature learning module provided by the invention effectively obtains the deformation information of the current target by establishing the global corresponding relation between each pixel in the query features and the features of the whole reference template, and further solves the problem of deformation of the target in the tracking process;
3. according to the invention, on the basis of five challenging tracking data references (including VOT2016, VOT2018, VOT2019, GOT-10K and TrackingNet), extensive simulation experiments prove that the method has obvious advantages compared with other latest trackers, and particularly obtains the EAO score of 0.508 which is ranked first at present on the VOT 2018.
Drawings
FIG. 1 is a schematic diagram of the overall structure of the model of the present invention;
the model mainly comprises a tracker based on DCF, a deformation feature extraction module, a similarity matching module based on compact memory embedding, and an up-sampling segmentation module.
FIG. 2 is a schematic diagram of a target affinity match and compact memory embedding architecture of the present invention;
in order to obtain the target in the current frame, the query feature F and the target feature memory MkPerforms similarity matching therebetween. In particular, F and M after dimension transformationkConstructing an affinity matrix A epsilon R between the current feature and the memory through matrix multiplicationhw ×Thw. Then, A retrieves the value memory MvfAnd MvbCarrying out Top-K averaging on the retrieved affinity features along the column dimension to obtain the similarity S of the foreground and background targetsfAnd Sb. After the target segmentation and tracking is completed, the features F and the obtained target foreground and background segmentation mask Y are queriedfAnd YbThe compact memory adjustment mechanism proposed according to the present invention is updated into the corresponding memory.
FIG. 3 is a diagram illustrating a comparison of tracking effects of different memory storage methods;
FIG. 4 is a visualization diagram of the deformation characteristics for improving the model tracking effect;
the embedding of compact memory effectively improves the discrimination capability of the model on similar interference. However, the model still cannot completely segment the edge details of the target, and the deformation feature learning module effectively solves the problem and realizes accurate segmentation of the target.
FIG. 5 is a schematic diagram showing the result of a segmentation visualization experiment of the data set DAVIS2017 according to the method;
FIG. 6 is a schematic diagram showing the result of the visual experiment of the method on VOT2016, VOT2018 and VOT 2019;
FIG. 7 is a schematic structural diagram of a deformable single-target tracking device based on dynamic compact memory embedding.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below.
In an embodiment of the present invention, in order to solve the problems existing in the background art, a compact memory embedding for deformable visual tracking is proposed. Only a small amount of object information is contained for a single initial reference feature, and especially in video sequences, the objects will have significant appearance and morphology changes. The invention correspondingly provides a dynamic memory embedding regulation mechanism, and the embodiment of the invention obtains the correlation between the query features and the existing memory by retrieving the feature affinity matrix generated in the similarity matching process.
Embodiments of the present invention then incorporate a high correlation between the existing memory and the current target feature. In addition, parts with moderate similarity to memory are expanded into memory, while irrelevant parts are directly discarded. High-quality memory embedding can provide complete target change information in historical frames, so that the problems of target occlusion, similarity interference objects and the like are effectively solved. In addition, compared with the existing matching method based on correlation operation, the deformable feature extraction method provided by the embodiment of the invention adopts a correlation mode from pixel points to global features one by one, and effectively captures the target deformation state in the query features by aggregating the weighted correlation between the query pixels and the reference features.
Fig. 1 shows an overall flowchart of a tracking method in an embodiment of the present invention. In order to meet the challenge in the tracking process, the model of the method mainly comprises three key components, namely a target similarity matching module based on compact memory embedding, a deformable feature learning module and a tracker based on discriminant correlation filtering.
For tracking and segmenting a video frame I at the current time ttVideo frame ItAnd an initial frame I1First encoded by the backbone network (i.e., ResNet50) as query features and reference features. To improve the computational efficiency, the method comprisesThe extracted query frame and reference frame features are reduced to 64 channels, and the two features are respectively expressed as FtAnd R.
The target similarity matching module used in the embodiment of the present invention refers to a self-attention mechanism in a transform (multi-head attention mechanism). And the dynamic memory adjustment mechanism adjusts the current target feature FtAnd the target division mask is expanded to a compact memory storage MkAnd Mv(Here MkMemory storage representing target characteristics, MvMemory storage representing the target foreground and background segmentation masks) to effectively overcome occlusion and appearance variations during tracking. The object deformable feature learning module described in the embodiment of the present invention performs global comparison of the query feature and the reference feature pixel by pixel, and establishes a correspondence df (F) between similar object portionstR). Therefore, the global target corresponding relation can completely capture the deformation information of the target. And the discriminant tracker is used for extracting the target positioning information LtWith reference to the latest depth-dependent filter-based tracker ATOM, the method first uses 1 × 1 convolutional layer to reduce the stem features to 64 channels, and then processes them with 4 × 4 convolutional layer and continuous differentiable activation function PELU (parameterized exponential linear activation function). The activated feature maximum is regarded as the target space positioning. The tracking module is trained on-line by an efficient back propagation method. And finally, connecting the target similarity matching result, the deformable feature and the target positioning information together along the channel, and performing refinement and segmentation processing on the combined feature through a three-stage upsampling and segmenting module. And carrying out contour transformation on the divided target mask to obtain a target bounding box.
Object similarity matching based on compact memory embedding
1. Object similarity matching
Accurately separating the target from the complex background requires available reference information. The matching-based VOS method can make full use of the target labeling information in the initial frame to accurately match the current target. In the present embodiment, the conventional attention was followedIn the method of the mechanical mechanism, the target similarity matching module comprises three parts of inquiry, key and value. Fig. 2 shows the structure of the object similarity matching module. Query feature Ft∈Rh×w×cIs the target representation of the current frame. In contrast, the bond R ∈ R in the modelh ×w×cIs the target feature of the initial frame, and the value: (
Figure BDA0003140298980000071
And
Figure BDA0003140298980000072
) Is a foreground and background segmentation mask of a first frame of the video, wherein h is the pixel height of the target feature map, w is the pixel width of the target feature map, c is the number of channels of the target feature map, R is a representation symbol of a multi-dimensional matrix,
Figure BDA0003140298980000073
the mask is segmented for the foreground of the first frame of the video,
Figure BDA0003140298980000074
the mask is segmented for the background of the first frame in the video, f is the representation of the target foreground, and b is the representation of the target background.
In video sequences, objects typically undergo significant appearance and structural changes. Similarity matching using only fixed initial information of objects does not guarantee the quality of object matching. When a new frame, e.g. It-1When segmented by the model, the current target query feature Ft-1And the obtained mask (Y)f t-1And Yb t-1) Will be merged with the history information. Memory embedding (M) by inserting history information into key and value memoryk∈RT×h×w×c,Mvf∈RT×h×w×1And Mvb∈RT×h×w×1) Will contain rich target appearance information, where MkFor object feature memory, MvfMemorization of the target foreground values, MvbFor target background value memory, T is the accumulated video frame number.
To establish a pixel-level association of targets between key memory and query features, the method first generates an affinity matrix A. In particular, for better matching, the key memorizes MkAnd query FtA pixel-by-pixel normalization process is first performed along each channel. Furthermore, MkAnd FtIs also reshaped to Thw × c and hw × c, respectively.
A=Ft*(Mk)T, (1)
Wherein, represents matrix multiplication, ()TDenotes the matrix transposition, A ∈ Rhw×Thw
Affinity matrix A measures query features FtAnd key memory MkThe similarity of each pixel therebetween. In order to obtain an accurate matching target, further retrieval of value memory is required. Then, the foreground and background values are memorized into a map MvfAnd MvbIs Thw × 1 (1 in this text is a number). For i ∈ RhwAffinity vector ai∈RThwRetrieval value memory vector M by dot product calculationvf∈RThw×1And Mvb
Figure RE-GDA0003326879510000081
Figure RE-GDA0003326879510000082
Wherein the content of the first and second substances,
Figure BDA0003140298980000083
and
Figure BDA0003140298980000084
the result of retrieving the target foreground value memory vector for the affinity vector,
Figure BDA0003140298980000085
results of retrieving a target background value memory vector for an affinity vectorAnd i is the index of the ith feature vector in the dimension of the target feature hw.
The matching score with high confidence ensures the accuracy of target matching. Thus, embodiments of the present invention apply a top-K averaging function to extract the retrieved vectors
Figure BDA0003140298980000086
And
Figure BDA0003140298980000087
target score of (1):
Figure BDA0003140298980000088
wherein, aggregate
Figure BDA0003140298980000089
Expressed in a matching score matrix
Figure BDA00031402989800000810
J is a subscript index of a jth one of the first K matching scores. In this method, K is set to 3, and the background matching is as above. Final foreground and background matching result Sf∈RhwAnd Sb∈RhwFor generating a target posterior probability P, i.e. the probability of the target foreground relative to the background.
2. Compact memory embedding
The rich memory embedding can effectively improve the precision of the target similarity matching. However, due to the limited storage capacity of the computing device, it is not possible to store all historical frame information in the memory base, especially for long videos containing more than 1K frames. In addition, storing such target information may result in redundancy of memory-embedded storage, and unnecessary matching queries, since targets in adjacent frames may be too similar, or target occlusion may occur.
The existing method only applies several history frames in the neighborhood, or selects a part of the history frames at equal intervals. These methods will lose some valid reference information. AFB-URR (video segmentation model combining adaptive feature memory and uncertain region refinement) combines similar parts in memory and deletes the part with the lowest query frequency. It may still introduce irrelevant history information into the memory, which easily leads to accumulation of model errors until the model drifts. Inspired by a Hash algorithm, a dynamic compact memory adjustment mechanism is developed for the model, so that a more effective target reference information base is formed. FIG. 2 illustrates a dynamic compact memory embedded architecture. Based on the affinity matrix a, embodiments of the present invention incorporate a high similarity (above an upper threshold) portion between the current feature and the existing memory. To avoid mismatches due to low quality memory, target features with low correlation are discarded directly.
The matching-based VOS method mostly uses the target information of the initial frame as a reference template because the initial features with true value labels have accurate and complete target description. Thus, the method uses the target information of the first frame in the sequence to initialize memory embedding and to make it a main part of the memory bank. And the target in the latest video frame is most similar to the current target, but the corresponding query preference is reduced in the target matching process by considering the calculation error of the model. For example, during model inference, video frame It-1Completing object segmentation and obtaining object query Ft-1And foreground Yf t-1And background Yb t-1The split mask of (1). To extract useful reference information, a target query F is first queriedt-1With existing object feature memory MkAnd comparing to find out the similar parts of the two. Affinity matrix A epsilon R generated in the process of target similarity matchinghw×ThwMeasure query characteristics Ft-1And key memory MkThe correlation between them. Thus, embodiments of the present invention dynamically manage memory embedding directly using affinity matrix a. For Ft-1Each of the elements F int-1(i)∈RhwSearching the affinity matrix A to obtain its affinity matrix Mk∈RThwThe most similar part of (a):
Figure BDA0003140298980000091
where Re (-) is the maximum value in each row of the affinity matrix A and A (ij) is each element of the affinity matrix A.
If the maximum correlation between the two, A (ij), is greater than some upper limit ζ, the two are considered sufficiently similar. In the hash algorithm, records in the hash map having the same key will be inserted into the same storage space. Thus, embodiments of the present invention save only one of the plurality of similar features to memory. And in consideration of the diversity of the memories, the weight is adopted to combine the similar features and the corresponding memories, so that unnecessary retrieval and memory redundancy are avoided. From the above analysis, the initial reference information is most accurate. Therefore, the embodiment of the invention adopts a smaller fusion weight beta to update the current characteristics into the existing memory embedding so as to avoid the interference of model errors. In all experiments, ζ was set to 0.95 and β was set to 0.001. The memory-embedded online update formula is as follows: mk(j')=βFt-1(i)+(1-β)Mk(j),(6)
Figure BDA0003140298980000093
Figure BDA0003140298980000092
Wherein M isk(j') memory of the merged object features, Mvf(j') and Mvb(j ') memorizing the merged target foreground and background values respectively, and j' memorizing the merged subscript index value for memorizing and storing.
For maximum correlation Re (F)t-1(i))<ζ, embodiments of the invention select a correlation value higher than the average value
Figure BDA0003140298980000101
And extend it into existing memories to ensure memory embedding diversity. At the same time, irrelevance is also avoidedA query of memory, merging into an efficient compact memory store by:
Figure BDA0003140298980000102
wherein Union (r.) represents the Union operation of the current feature and the corresponding memory,
Figure BDA0003140298980000103
is the union of the existing key memory and the target feature at the previous moment,
Figure BDA0003140298980000104
for the union of the existing object foreground value memory and the previous moment object foreground mask,
Figure BDA0003140298980000105
the union of the existing target background value memory and the target background mask at the previous moment is used.
FIG. 3 shows a comparison of compact memory embedding and two other related methods in an embodiment of the present invention. As shown in the first row, storing all history memory and Adaptive Feature Base (AFB) improves the discrimination ability of the target similarity match to some extent. However, for complex background clutter, redundant memory can lead to false target matches. The memory embedding method of the embodiment of the invention fully excavates the diversity and compactness of the memory characteristics and can obtain better target matching performance.
Second, target variability feature learning
As shown in fig. 4, with the help of compact memory embedding, the target similarity matching has good discrimination capability, and can well solve tracking problems such as target occlusion and background interference. The method also has certain advantages in solving the target deformation. It does not effectively address the serious target distortion or spatial detail issues. Inspired by the attention mechanism, embodiments of the present invention propose target variability feature learning to further alleviate the above-mentioned dilemma. The embodiment of the invention constructs each pixel and the whole in the query feature FAnd (4) associating the templates R to obtain complete target deformation information. Since the query and key contain different target representations, f is assigned to each pixel in the queryiIn an embodiment of the present invention, a non-shared attention mechanism is applied to calculate each pixel r in a keyjAnd fiTo obtain a learnable similarity function: e.g. of the typeij=(WF fi)T(WR rj). (12)
Wherein, WFAnd WRDenotes a general formula fiAnd rjLearnable linear transformation into a higher dimensional representation, eijFor each pixel f in the query featureiAnd each pixel r in the key featurejThe correlation between them.
For convenience of comparison fiAnd similarity between different parts in the template R, e on all pixels of the template R by the softmax functionijAnd (3) carrying out normalization to obtain normalized similarity weight:
Figure BDA0003140298980000106
similarity weight d with normalizationijTo aggregate pixel-by-pixel features in the template R to generate a template corresponding to fiDeformation characteristic v ofi
Figure BDA0003140298980000107
Wherein phi isv(.) represents ReLU (W)v*(.)),WvA (.) indicates a linear transformation of the features of the input. Then, the query features and the obtained target deformation features are connected by utilizing residual connection to obtain the features containing deformation information
Figure BDA0003140298980000111
Figure BDA0003140298980000112
Wherein phi isc(.) represents ReLU (W)c()) is intended to reduce the dimensionality of the features. Concat (·) denotes a feature join operation. There is inevitably a background mismatch in the matching target deformed features. Therefore, the target foreground probability (i.e. the target segmentation mask of the initial frame) P is used to enhance the confidence of the deformed features.
In particular, the target deformation obtained
Figure BDA0003140298980000113
Retrieving foreground probability in an initial frame by dot product operation
Figure BDA0003140298980000114
To generate the final deformable features:
Figure BDA0003140298980000115
three, single target visual tracking
And (3) cascading four feature graphs along a channel, namely, the target foreground similarity and the target posterior probability obtained in the similarity matching process, the target deformation feature obtained by the deformation feature extraction module and the target space positioning obtained by the online discriminant correlation filter, and inputting the feature graphs into a lightweight decoder (an up-sampling segmentation module) to obtain a finally refined target segmentation mask. And acquiring a rectangular surrounding frame of the target on the final target segmentation mask by using a contour detection function in an OpenCV (open circuit vehicle library), wherein the central position of the surrounding frame is the positioning of the target, and the size of the surrounding frame is the scale state of the target.
Fourthly, concrete implementation steps and simulation experiment
In an embodiment of the invention, the first four phases of ResNet50 pre-trained on ImageNet are used as the backbone network to extract features. Object segmentation is a pixel-level classification task that requires the use of features with high-confidence semantics. Therefore, the method extracts the last layer of the backbone network for target similarity matching and deformation feature extraction. The trunk feature is then reduced to 64 channels by 1 × 1 convolutional layers, followed by 3 × 3 convolutional layers and ReLU activation. In the up-sampling segmentation process, the first three levels of the skeleton are utilized to supplement the spatial detail information of the target. The method sets top-K to be K-3. And the upper threshold ζ for the merging of similar memories is set to 0.95. In a preliminary experiment, the method finds that parameters of the memory embedded module and the top-K have no obvious influence on the tracking performance. Therefore, the above settings were fixed in all relevant experiments.
Model training: the compact memory embedding and DCF tracking based modules are updated online without pre-training. Therefore, the method only pretrains the similarity matching, deformable feature extraction and upsampling segmentation module on the Youtube-VOS dataset. Similar to the sampling strategy employed in twin network-based tracking models, a pair of masked images are sampled from a video sequence to construct training samples. The method minimizes cross entropy loss through an Adam optimizer, and the learning rate is 8 multiplied by 10-4Decay by 0.2 every 15 cycles. The whole training process is carried out on a single Nvidia Titan XP video card, and 60 epochs are needed.
Model reasoning: in the VOT task, a video sequence contains given bounding box labels. The method comprises the steps of firstly generating a pseudo mask as a pseudo label by using a ground truth box in an initial frame, and then initializing a model by using the generated pseudo label. And in the tracking process, processing the current frame by using the segmentation model to obtain a target segmentation template. The compact memory library is updated online using the query features and the generated segmentation mask. Finally, the resulting segmented mask is converted into a rotated target bounding box as a tracking result.
Simulation experiment results: tables 1-3 show the evaluation results of the tracking model of the embodiment of the invention on the basis of the disclosed single-target tracking data sets, namely VOT2016, VOT2018 and VOT2019, and the experimental results show the advancement of the tracking model by comparing the method with some latest tracking models (such as SiamMask, SiamRPN + +, ATOM, D3S, Ocean and the like), thereby also verifying the effectiveness of the method. Table 4 shows the evaluation result of the model constructed in the embodiment of the present invention on the video target segmentation data basis DAVIS2017, and also fully verifies the validity and the advancement of the model. And fig. 5 and 6 show the visual experimental results of the method on the basis of the VOT series and DAVIS2017 data.
The comparative tracking method related to the embodiment of the invention is explained as follows: SiamMask (tracking method combined with target segmentation); D3S (discriminant segmentation tracking model); CCOT (continuous convolution based tracking method); CSR-DCF (correlation filter tracking method based on channel and spatial confidence); ASRCF (adaptive spatial regularization based correlation filter tracking method); SiamDW (based on a wider and deeper twin network tracking model); SiamRPN + + (tracking model based on large backbone networks and twin area proposed networks); SiamRPN (twin region-based proposed network based tracking model); SiamBAN (adaptive bounding box regression based tracking model); ocean-on/off (anchor-free tracking method based on target perception); Update-Net (twin tracking method based on template online Update); SPM (real-time visual object tracking method based on sequence parallel matching); ATOM (target accurate tracking is achieved through target overlap maximization); DiMP (discriminant model prediction based visual tracking method); C-RPN (twin cascade region-based proposed network tracking method); SiamGraph (twin-attention-map model-based tracking method); LADCF (discriminant tracking method based on timing consistency constraints); OSMN (network modulation based efficient video object segmentation method); STM (spatiotemporal memory based video segmentation method); VM (video segmentation method based on object matching); FAVOS (tracking part based video segmentation method); OnAVOS (segmentation method based on online adaptive convolutional network).
TABLE 1 comparison of Performance of multiple tracking methods on VOT2016 data basis
Figure BDA0003140298980000121
Figure BDA0003140298980000131
TABLE 2 comparison of Performance of multiple tracking methods on VOT2018 data basis
Figure BDA0003140298980000132
TABLE 3 comparison of Performance of various tracking methods on VOT2019 data basis
Figure BDA0003140298980000133
TABLE 4 comparison of Performance of various tracking and VOS methods on DAVIS2017 data basis
Figure BDA0003140298980000134
Figure BDA0003140298980000141
Based on the same inventive concept, an embodiment of the present invention further provides a deformable single-target tracking apparatus based on dynamic compact memory embedding, referring to fig. 7, the apparatus includes: a processor 1 and a memory 2, the memory 2 having stored therein program instructions, the processor 1 calling the program instructions stored in the memory 2 to cause the apparatus to perform the following method steps in an embodiment:
the feature affinity matrix generated in the target similarity matching process is used for expressing the correlation between the current target feature and the existing target feature memory, and then the feature affinity moment matrix is subjected to line-by-line screening of the previous K values and the average of the previous K values to obtain the similarity of the target foreground and the background and the posterior probability of the target;
according to the acquired feature correlation, combining the high correlation part between the existing target feature memory and the current target feature, expanding the part with medium similarity to the existing target feature memory into the memory, and discarding the low correlation part, thereby realizing the dynamic self-adaptive adjustment of memory embedding;
acquiring a target deformation state in the current target feature by adopting an association mode from pixel points to global features one by one and aggregating the weighted correlation between query pixels and reference features, and establishing a corresponding relation between similar target parts to realize extraction of deformable features;
the similarity of the target foreground, the posterior probability of the target, the deformable characteristics and the target space positioning obtained in the online discriminant correlation filter are cascaded along the channel and input into a decoder to obtain a refined target segmentation mask; and acquiring a rectangular bounding box of the target based on the target segmentation mask to realize visual tracking and positioning.
The method for obtaining the correlation between the current target characteristics and the existing memory specifically comprises the following steps:
when a new frame It-1After being segmented by the model, the current target query feature Ft-1And the obtained mask will integrate with the historical information, by continuously integrating new target information into the key and value memory, the formed memory embedding will contain rich target appearance information; for the current target characteristics and the existing target characteristic memory, firstly, the two are subjected to dimension transformation, and then matrix multiplication is carried out to obtain an affinity matrix of the two, wherein the affinity matrix expresses the correlation of the two.
Further, according to the obtained feature correlation, combining a high correlation part between the existing target feature memory and the current target feature, expanding a part with medium correlation with the existing target feature memory into the memory, and discarding an irrelevant part, so that the dynamic adaptive adjustment of memory embedding is realized as follows:
initializing memory embedding by utilizing target information of a first frame in a sequence, taking the target information as a main part of a memory library, comparing a target query with the existing target characteristic memory, and finding out a similar part of the target query and the existing target characteristic memory;
for each element in the target query, the affinity matrix is searched to obtain its affinity matrix with Mk∈RThwThe most similar part of (a);
if the maximum correlation between the two is larger than a certain upper limit value, the records with the same key value are inserted into the same storage space according to the Hash single mapping principle, namely, a weighted fusion mode is adopted, and the part which is highly correlated with the existing memory in the current characteristics is updated into the existing memory embedding; for the current target characteristics with the correlation value higher than the overall average value with the existing memory, the current target characteristics are directly expanded into the existing memory; the same is done for corresponding parts of the target foreground and background masks.
Wherein, updating the part of the current characteristics which has high correlation with the existing memory into the existing memory embedding specifically comprises:
merging the high correlation part between the existing memory and the current target feature:
Mk(j')=βFt-1(i)+(1-β)Mk(j),
Figure BDA0003140298980000155
Figure BDA0003140298980000156
wherein M isk(j') memory of the merged object features, Mvf(j') and Mvb(j ') memorizing the merged target foreground and background values respectively, j' memorizing the merged subscript index value, beta fusing weight, Ft-1For a target query, Yf t-1As a foreground, Yb t-1For background, MkFor object feature memory, MvfMemorization of the target foreground values, MvbMemorizing a target background value, wherein i is a subscript index of a feature point with medium similarity in the current feature and the existing memory, and j is a subscript index of a part with medium similarity in the existing memory and the current feature;
the current target characteristics with the correlation value higher than the average value with the existing memory are directly expanded into the existing memory:
Figure BDA0003140298980000151
wherein Union (r.) represents the Union operation of the current feature and the corresponding memory,
Figure BDA0003140298980000152
is the union of the existing key memory and the target feature at the previous moment,
Figure BDA0003140298980000153
for the union of the existing object foreground value memory and the previous moment object foreground mask,
Figure BDA0003140298980000154
the union of the existing target background value memory and the target background mask at the previous moment is used.
Further, by aggregating the weighted correlation between the query pixel and the reference feature, the target deformation state in the current target feature is captured, and the correspondence relationship established between similar target portions is specifically:
constructing the association between each pixel in the query feature F and the whole template, and applying an unshared attention mechanism to calculate each pixel r in the keyjAnd fiObtaining a similarity function:
e on all pixels of the template R by the softmax functionijNormalization, obtaining the normalized similarity weight for the pixel-by-pixel feature in the aggregation template, generating the similarity corresponding to fiThe target deformation characteristic of (1);
and connecting the query feature with the target deformation feature by using the residual error to obtain the feature containing deformation information, and enhancing the credibility of the target deformation feature by adopting the target foreground probability.
It should be noted that the device description in the above embodiments corresponds to the method description in the embodiments, and the embodiments of the present invention are not described herein again.
The execution main bodies of the processor 1 and the memory 2 may be devices having a calculation function, such as a computer, a single chip, a microcontroller, and the like, and in the specific implementation, the execution main bodies are not limited in the embodiment of the present invention, and are selected according to the needs in the practical application.
The memory 2 and the processor 1 transmit data signals through the bus 3, which is not described in detail in the embodiment of the present invention.
Based on the same inventive concept, an embodiment of the present invention further provides a computer-readable storage medium, where the storage medium includes a stored program, and when the program runs, the apparatus on which the storage medium is located is controlled to execute the method steps in the foregoing embodiments.
The computer readable storage medium includes, but is not limited to, flash memory, hard disk, solid state disk, and the like.
It should be noted that the description of the readable storage medium in the above embodiments corresponds to the description of the method in the embodiments, and the description of the embodiments of the present invention is not repeated here.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the embodiments of the invention are generated in whole or in part when the computer program instructions are loaded and executed on a computer.
The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on or transmitted over a computer-readable storage medium. The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, data center, etc., that comprises an integration of one or more available media. The usable medium may be a magnetic medium or a semiconductor medium, etc.
In the embodiment of the present invention, except for the specific description of the model of each device, the model of other devices is not limited, as long as the device can perform the above functions.
Those skilled in the art will appreciate that the drawings are only schematic illustrations of preferred embodiments, and the above-described embodiments of the present invention are merely provided for description and do not represent the merits of the embodiments.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (7)

1. A deformable single-target tracking method based on dynamic compact memory embedding is characterized by comprising the following steps:
a feature affinity matrix generated in the process of matching the target similarity represents the correlation between the current target feature and the existing target feature memory; screening the first K values of the characteristic affinity matrix line by line and averaging the first K values to obtain the target foreground similarity and the target posterior probability;
according to the acquired feature correlation, combining the high correlation part between the existing target feature memory and the current target feature, expanding the part with medium similarity to the existing target feature memory into the memory, and discarding the low correlation part, thereby realizing the dynamic self-adaptive adjustment of memory embedding;
acquiring a target deformation state in the current target feature by adopting an association mode from pixel points to global features one by one and by aggregating the weighted correlation between query pixels and reference features, and establishing a corresponding relation between similar target parts to realize extraction of deformable features;
the similarity of the target foreground, the posterior probability of the target, the deformable characteristics and the target space positioning obtained in the online discriminant correlation filter are cascaded along the channel and input into a decoder to obtain a refined target segmentation mask; and acquiring a rectangular bounding box of the target based on the target segmentation mask to realize visual tracking and positioning.
2. The deformable single-target tracking method based on dynamic compact memory embedding as claimed in claim 1, wherein the obtaining of the correlation between the current target feature and the existing memory is specifically:
constructing a target similarity matching model, wherein keys in the model are target characteristics of an initial frame, and values are foreground and background segmentation masks of a first frame of a video;
when a new frame It-1After being segmented by the model, the current target query feature Ft-1And the obtained mask will integrate with the historical information, by continuously integrating new target information into the key and value memory, the formed memory embedding will contain rich target appearance information; for the current target characteristics and the existing target characteristic memory, firstly, the two are subjected to dimension transformation, and then matrix multiplication is carried out to obtain an affinity matrix of the two, wherein the affinity matrix expresses the correlation of the two.
3. The deformable single-target tracking method based on dynamic compact memory embedding as claimed in claim 1, wherein the high correlation part between the existing target feature memory and the current target feature is merged according to the obtained feature correlation, the part with medium correlation with the existing target feature memory is expanded into the memory, and the part with low correlation is discarded, so that the dynamic adaptive adjustment of memory embedding is realized by:
initializing memory embedding by utilizing target information of a first frame in a sequence, taking the target information as a main part of a memory library, comparing a target query characteristic with the existing target characteristic memory, and finding out a similar part of the target query characteristic and the existing target characteristic memory;
for each element in the target query, the affinity matrix is searched to obtain its memory M with the target featurek∈RThwThe most similar part of (a);
if the maximum correlation between the two is larger than a certain upper limit value, the records with the same key value are inserted into the same storage space according to the Hash single mapping principle, namely, a weighted fusion mode is adopted, and the part with high correlation with the existing memory in the current characteristics is updated into the existing memory embedding; for the current target characteristics with the correlation value higher than the overall average value with the existing memory, the current target characteristics are directly expanded into the existing memory; the same is done for corresponding parts of the target foreground and background masks.
4. The deformable single-target tracking method based on dynamic compact memory embedding as claimed in claim 3, wherein the updating of the part of the current feature having high correlation with the existing memory into the existing memory embedding is specifically:
merging the high correlation part between the existing memory and the current target feature:
Mk(j')=βFt-1(i)+(1-β)Mk(j),
Figure FDA0003140298970000021
Figure FDA0003140298970000022
wherein M isk(j') memory of the merged object features, Mvf(j') and Mvb(j ') memorizing the merged target foreground and background values respectively, j' memorizing the merged subscript index value, beta fusing weight, Ft-1For a target query, Yf t-1As a foreground, Yb t-1For background, MkFor object feature memory, MvfMemorization of the target foreground values, MvbMemorizing a target background value, wherein i is a subscript index of a feature point with medium similarity in the current feature and the existing memory, and j is a subscript index of a part with medium similarity in the existing memory and the current feature;
the current target characteristics with the correlation value higher than the average value with the existing memory are directly expanded into the existing memory:
Figure FDA0003140298970000023
wherein Union (r.) represents the Union operation of the current feature and the corresponding memory,
Figure FDA0003140298970000024
for the union of the existing key memory and the target feature at the previous moment,
Figure FDA0003140298970000025
for the union of the existing object foreground value memory and the previous moment object foreground mask,
Figure FDA0003140298970000026
the union of the existing target background value memory and the target background mask at the previous moment is memorized.
5. The deformable single-target tracking method based on dynamic compact memory embedding as claimed in claim 3, wherein the target deformation state in the current target feature is captured by aggregating weighted correlation between pixel points in the query feature and the whole reference feature, and the correspondence relationship established between similar target portions is specifically:
constructing an association between each pixel in the query feature F and the entire template, applying an unshared attention mechanism to calculate each pixel r in the keyjAnd fiObtaining a similarity function:
e on all pixels of the template R by the softmax functionijNormalization, obtaining the normalized similarity weight for aggregating the pixel-by-pixel characteristics in the template to generate the similarity corresponding to fiThe target deformation characteristic of (1);
and connecting the query feature with the target deformation feature by using the residual error to obtain the feature containing deformation information, and enhancing the reliability of the target deformation feature by adopting the target foreground probability.
6. A deformable single-target tracking device based on dynamic compact memory embedding, characterized in that the device comprises: a processor and a memory, the memory having stored therein program instructions, the processor calling upon the program instructions stored in the memory to cause the apparatus to perform the method steps of any of claims 1-5.
7. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program comprising program instructions which, when executed by a processor, cause the processor to carry out the method steps of any of claims 1-5.
CN202110736925.2A 2021-06-30 2021-06-30 Deformable single-target tracking method and device based on dynamic compact memory embedding Active CN113705325B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110736925.2A CN113705325B (en) 2021-06-30 2021-06-30 Deformable single-target tracking method and device based on dynamic compact memory embedding

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110736925.2A CN113705325B (en) 2021-06-30 2021-06-30 Deformable single-target tracking method and device based on dynamic compact memory embedding

Publications (2)

Publication Number Publication Date
CN113705325A true CN113705325A (en) 2021-11-26
CN113705325B CN113705325B (en) 2022-12-13

Family

ID=78648216

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110736925.2A Active CN113705325B (en) 2021-06-30 2021-06-30 Deformable single-target tracking method and device based on dynamic compact memory embedding

Country Status (1)

Country Link
CN (1) CN113705325B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115082430A (en) * 2022-07-20 2022-09-20 中国科学院自动化研究所 Image analysis method and device and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190147602A1 (en) * 2017-11-13 2019-05-16 Qualcomm Technologies, Inc. Hybrid and self-aware long-term object tracking
CN111612817A (en) * 2020-05-07 2020-09-01 桂林电子科技大学 Target tracking method based on depth feature adaptive fusion and context information
CN111833378A (en) * 2020-06-09 2020-10-27 天津大学 Multi-unmanned aerial vehicle single-target tracking method and device based on proxy sharing network
CN111951297A (en) * 2020-08-31 2020-11-17 郑州轻工业大学 Target tracking method based on structured pixel-by-pixel target attention mechanism

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190147602A1 (en) * 2017-11-13 2019-05-16 Qualcomm Technologies, Inc. Hybrid and self-aware long-term object tracking
CN111612817A (en) * 2020-05-07 2020-09-01 桂林电子科技大学 Target tracking method based on depth feature adaptive fusion and context information
CN111833378A (en) * 2020-06-09 2020-10-27 天津大学 Multi-unmanned aerial vehicle single-target tracking method and device based on proxy sharing network
CN111951297A (en) * 2020-08-31 2020-11-17 郑州轻工业大学 Target tracking method based on structured pixel-by-pixel target attention mechanism

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
TING ZHANG等: "Design and Implementation of Dairy Food Tracking System Based on RFID", 《2020 INTERNATIONAL WIRELESS COMMUNICATIONS AND MOBILE COMPUTING》 *
汤一明等: "视觉单目标跟踪算法综述", 《测控技术》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115082430A (en) * 2022-07-20 2022-09-20 中国科学院自动化研究所 Image analysis method and device and electronic equipment
CN115082430B (en) * 2022-07-20 2022-12-06 中国科学院自动化研究所 Image analysis method and device and electronic equipment

Also Published As

Publication number Publication date
CN113705325B (en) 2022-12-13

Similar Documents

Publication Publication Date Title
Ge et al. An attention mechanism based convolutional LSTM network for video action recognition
Bai et al. Sequence searching with CNN features for robust and fast visual place recognition
Lu et al. Learning transform-aware attentive network for object tracking
CN113408492A (en) Pedestrian re-identification method based on global-local feature dynamic alignment
Huang et al. End-to-end multitask siamese network with residual hierarchical attention for real-time object tracking
CN110992401A (en) Target tracking method and device, computer equipment and storage medium
Song et al. Context-interactive CNN for person re-identification
Pan et al. Correlation filter tracker with siamese: a robust and real-time object tracking framework
CN113705325B (en) Deformable single-target tracking method and device based on dynamic compact memory embedding
Li et al. Self-supervised monocular depth estimation with frequency-based recurrent refinement
Yu et al. Learning dynamic compact memory embedding for deformable visual object tracking
Elayaperumal et al. Learning spatial variance-key surrounding-aware tracking via multi-expert deep feature fusion
Cores et al. Spatiotemporal tubelet feature aggregation and object linking for small object detection in videos
Liao et al. Multi-scale saliency features fusion model for person re-identification
Xu et al. Learning the distribution-based temporal knowledge with low rank response reasoning for uav visual tracking
Qu et al. Source-free Style-diversity Adversarial Domain Adaptation with Privacy-preservation for person re-identification
Li et al. Dynamic feature-memory transformer network for RGBT tracking
Yang et al. IASA: An IoU-aware tracker with adaptive sample assignment
CN116543250A (en) Model compression method based on class attention transmission
Liu et al. Pseudo-label growth dictionary pair learning for crowd counting
Wang et al. Online convolution network tracking via spatio-temporal context
Wang et al. EMAT: Efficient feature fusion network for visual tracking via optimized multi-head attention
Luo et al. Selective relation-aware representations for person re-identification
Zhou et al. A target response adaptive correlation filter tracker with spatial attention
He et al. Multiple camera styles learning for unsupervised person re-identification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant