CN113705325A - Deformable single-target tracking method and device based on dynamic compact memory embedding - Google Patents
Deformable single-target tracking method and device based on dynamic compact memory embedding Download PDFInfo
- Publication number
- CN113705325A CN113705325A CN202110736925.2A CN202110736925A CN113705325A CN 113705325 A CN113705325 A CN 113705325A CN 202110736925 A CN202110736925 A CN 202110736925A CN 113705325 A CN113705325 A CN 113705325A
- Authority
- CN
- China
- Prior art keywords
- target
- memory
- feature
- existing
- correlation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000015654 memory Effects 0.000 title claims abstract description 214
- 238000000034 method Methods 0.000 title claims abstract description 115
- 230000011218 segmentation Effects 0.000 claims abstract description 50
- 230000008569 process Effects 0.000 claims abstract description 20
- 230000000007 visual effect Effects 0.000 claims abstract description 17
- 230000007246 mechanism Effects 0.000 claims abstract description 16
- 238000000605 extraction Methods 0.000 claims abstract description 9
- 239000011159 matrix material Substances 0.000 claims description 35
- 230000003044 adaptive effect Effects 0.000 claims description 12
- 230000006870 function Effects 0.000 claims description 12
- 230000004931 aggregating effect Effects 0.000 claims description 8
- 238000010606 normalization Methods 0.000 claims description 6
- 230000009466 transformation Effects 0.000 claims description 6
- 238000004590 computer program Methods 0.000 claims description 5
- 230000004927 fusion Effects 0.000 claims description 4
- 238000012935 Averaging Methods 0.000 claims description 3
- 230000002708 enhancing effect Effects 0.000 claims description 3
- 238000013507 mapping Methods 0.000 claims description 3
- 238000012216 screening Methods 0.000 claims description 2
- 230000008859 change Effects 0.000 abstract description 4
- 230000000875 corresponding effect Effects 0.000 description 17
- 238000010586 diagram Methods 0.000 description 7
- 239000013598 vector Substances 0.000 description 7
- 238000002474 experimental method Methods 0.000 description 5
- 238000012549 training Methods 0.000 description 5
- 238000004422 calculation algorithm Methods 0.000 description 4
- 238000001514 detection method Methods 0.000 description 4
- 238000005070 sampling Methods 0.000 description 4
- 230000004913 activation Effects 0.000 description 3
- 230000008901 benefit Effects 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000001914 filtration Methods 0.000 description 3
- 230000005055 memory storage Effects 0.000 description 3
- 238000004088 simulation Methods 0.000 description 3
- 230000002776 aggregation Effects 0.000 description 2
- 238000004220 aggregation Methods 0.000 description 2
- 230000002596 correlated effect Effects 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 230000008447 perception Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 239000000047 product Substances 0.000 description 2
- 238000012800 visualization Methods 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 241000195940 Bryophyta Species 0.000 description 1
- 238000009825 accumulation Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000009499 grossing Methods 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 230000004807 localization Effects 0.000 description 1
- 230000010387 memory retrieval Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000008844 regulatory mechanism Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 230000017105 transposition Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computational Linguistics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a deformable single-target tracking method and a device based on dynamic compact memory embedding, wherein the method comprises the following steps: based on compact memory embedded target correlation matching, obtaining target foreground and background similarity and target posterior probability; the dynamic adjustment mechanism of compact memory embedding selects a high-quality part in the current target characteristics to integrate into the memory according to the characteristic correlation; capturing a target deformation state in the current target feature by adopting a global association mode from pixel points to reference features one by one, and realizing extraction of deformable features; the four characteristics of similarity of the target foreground, deformation characteristics and the like are cascaded along a channel and input into a decoder to obtain a refined target segmentation mask; and acquiring a rectangular bounding box of the target based on the target segmentation mask to realize visual tracking and positioning. The device comprises: a processor and a memory. The invention solves the challenging problems such as shielding, target deformation, target appearance change, similar target interference and the like in the tracking process.
Description
Technical Field
The invention relates to the field of single-target visual tracking, in particular to a deformable single-target tracking method and device based on dynamic compact memory embedding.
Background
Visual object tracking is a basic and challenging task in the field of computer vision. It has many practical applications, such as: traffic monitoring, human-computer interaction, autonomous robots, autopilot, and the like. Although existing tracking methods have significantly improved both accuracy and robustness, there are still some challenges that remain to be solved, such as: occlusion, distortion, background clutter, etc.
The twin tracker is a simple and efficient tracking algorithm. In the tracking process, the maximum correlation between the search area and the template is regarded as the location of the target. To achieve better model generalization, twin trackers are typically trained using large amounts of labeled data. SINT (visual tracking method based on twinning example search) and SiamsesFC (visual tracking method based on twinning full convolution network) have a milestone impact on the development of the twin tracking field. They are the first attempt to train the twin network end-to-end for visual tracking. SiamRPN + + (tracking model based on large backbone network and twin region proposed network) and SiamDW (based on wider and deeper twin network tracking model) improve the structure of ResNet (network based on residual connection) model and successfully apply it to twin tracking model, thereby significantly improving tracking performance. SiamRPN (twin region proposal network based tracking model) applies a Region Proposal Network (RPN) to a twin tracking network. The two-pronged network has a classification net header for background-foreground separation of anchors, and a regression net clutter for suggesting box refinement. Compared with the anchor-based method, the anchor-free tracking method (Siamfc + + (twin full convolution tracking model based on anchor-free regression), SiamCAR (twin full convolution classification and regression model for visual tracking), SiamBAN (tracking model based on adaptive bounding box regression), Ocean (anchor-free tracking method based on target perception)) avoids a large amount of anchor presets, thereby significantly reducing model hyper-parameters. These methods may enable more flexible regression of the target bounding box. Twin tracking methods, while simple and effective, fixed templates have difficulty expressing changes in the appearance and scale of the target. MOSSE (based on adaptive correlation filtering tracking method) is the first initiative of the Discriminant Correlation Filters (DCF) tracking method. As an online tracking method, the correlation filter has better adaptability and universality for target appearance and scale change. The following improvement strategies are for example: continuous convolution, dynamic updating of the training set, spatial regularization, temporal smoothing regularization, etc., all further improve the performance of the DCF-based tracker. By combining the online update of DCF and the positioning refinement of the target by IOU-Net (regression network model for realizing the maximization of the target overlapping rate), ATOM (accurate target tracking realized by the maximization of the target overlapping) and DiMP (visual tracking method based on discriminant model prediction) obtain the best tracking performance at that time in the template matching type tracking method.
CSR-DCF (tracking method combined with correlation filtering and color segmentation) constructs a target mask from the color histograms of the foreground and background, and then adds this mask to the correlation filter, suppressing boundary effects very well. And the SimMask (tracking method combined with target segmentation) expands the segmentation network branches on the twin tracking model, and enhances the expression capability of the model by virtue of segmentation loss. Compared with a common Video Object Segmentation (VOS) method, the SiamMask achieves higher tracking speed by adopting a lightweight segmentation network. Unlike "detection-based tracking", D3S (discriminant segmentation tracking model) innovatively replaces the target regression branch in the tracking model with a segmentation network, and D3S achieves advanced tracking performance by combining the precise location of DCF and the robustness of the segmentation model to target deformation. DMB (dual-memory-library-based segmentation tracking model) provides a rich reference for current object segmentation by storing historical appearance and spatial localization information of objects.
The rich memory embedding provides sufficient reference information for the video analysis task. MemTrack (in conjunction with a dynamic memory network tracking model) uses a dynamic memory network to overcome the tracking drift problem caused by fixed templates. Whereas STM (a spatiotemporal memory-based video segmentation method) stores dense historical frame features and masks for temporal-spatial information matching at the current pixel level. The dense reference information makes it possible to handle appearance changes and occlusions during the VOS very well. In order to avoid redundant memory in excessive memory storage, AFB-URR (video segmentation model combining adaptive feature memory and uncertain region refinement) proposes an adaptive feature library to dynamically organize historical information. It combines similar memos using weighted averages and follows a cache replacement strategy to eliminate the memory with the lowest frequency of queries.
Fixed matching templates make it difficult to obtain varying target representations, especially for non-rigid objects. To this end, SiamAttn (twin attention based tracking method) proposes a deformable twin attention mechanism that calculates deformable self-attention and mutual attention between template features and search features. Self-attention uses spatial attention to learn context information, and mutual attention can aggregate rich context interdependencies between the target template and the search region, thereby completing the update of the target template in a hidden manner. Deformable-DETR (Deformable multi-head attention target detection model) applies a multi-scale Deformable attention module instead of the transform (multi-head attention mechanism) attention mechanism to process the feature map. Due to the flexibility of its weight, the method has excellent performance in the target detection task.
To some extent, while DCF-based trackers alleviate the difficulty of fixed templates adapting to scene changes, template-matching-based approaches have two limitations:
first, the template of a single initial frame cannot provide enough target information for target matching; secondly, the existing matching method is only carried out among corresponding pixels, the matching result is too rough, and the target deformation information with sufficient fineness is difficult to capture.
Disclosure of Invention
The invention provides a deformable single-target tracking method and a device based on dynamic compact memory embedding, which solve the problem of warfare such as shielding, target deformation, target appearance change, similar target interference, complex background and the like in the tracking process, and are described in detail as follows:
a deformable single-target tracking method based on dynamic compact memory embedding, the method comprising:
a feature affinity matrix generated in the process of matching the target similarity represents the correlation between the current target feature and the existing target feature memory; the front K values are screened line by line for the characteristic affinity matrix and averaged to obtain the target foreground similarity and the target posterior probability;
according to the acquired feature correlation, combining a high correlation part between the existing target feature memory and the current target feature, expanding a part with medium similarity to the existing target feature memory into the memory, and discarding a low correlation part, thereby realizing dynamic self-adaptive adjustment of memory embedding and obtaining compact memory embedding;
acquiring a target deformation state in the current target feature by adopting an association mode from query features to reference feature global points one by one and by aggregating the weighted correlation between the query pixels and the reference features, and establishing a corresponding relation between similar target parts to realize extraction of deformable features;
the similarity of the target foreground, the posterior probability of the target, the deformable characteristics and the target space positioning obtained in the online discriminant correlation filter are cascaded along the channel and input into a decoder to obtain a refined target segmentation mask; and acquiring a rectangular bounding box of the target based on the target segmentation mask to realize visual tracking and positioning.
Further, the obtaining of the correlation between the current target feature and the existing memory specifically includes:
when a new frame It-1After being segmented by the model, the current target query feature Ft-1And the obtained mask will integrate with the historical information, by continuously integrating new target information into the key and value memory, the formed memory embedding will contain rich target appearance information; for the current target characteristics and the existing target characteristic memory, firstly, the two are subjected to dimension transformation, and then matrix multiplication is carried out to obtain an affinity matrix of the two, wherein the affinity matrix expresses the correlation of the two.
The method comprises the following steps of acquiring feature correlation, combining a high correlation part between the existing target feature memory and the current target feature, expanding a part with medium correlation with the existing target feature memory into the memory, and directly discarding a low correlation part, so that dynamic adaptive adjustment of memory embedding is realized, wherein the dynamic adjustment process specifically comprises the following steps:
initializing memory embedding by utilizing target information of a first frame in a sequence, taking the target information as a main part of a memory library, comparing a target query with the existing target characteristic memory, and finding out a similar part of the target query and the existing target characteristic memory;
for each element in the target query, the affinity matrix is searched to obtain its memory M with the target featurek∈RThwThe most similar part of (a);
if the maximum correlation between the two is larger than a certain upper limit value, the records with the same key value are inserted into the same storage space according to the Hash single mapping principle, namely, a weighted fusion mode is adopted, and the part which is highly correlated with the existing memory in the current characteristics is updated into the existing memory embedding; for the current target characteristics with the correlation value higher than the overall average value with the existing memory, the current target characteristics are directly expanded into the existing memory; the same is done for corresponding parts of the target foreground and background masks.
Further, the updating of the part of the current feature that has high correlation with the existing memory into the existing memory embedding specifically includes:
merging the high correlation part between the existing memory and the current target feature:
Mk(j')=βFt-1(i)+(1-β)Mk(j),
wherein M isk(j') memory of the merged object features, Mvf(j') and Mvb(j ') memorizing the combined target foreground and background values respectively, j' memorizing the combined subscript index value after combination, and beta fusing the weight,Ft-1In order to target the query(s),as a foreground, Yb t-1For background, MkFor object feature memory, MvfMemorization of the target foreground values, MvbMemorizing a target background value, wherein i is a subscript index of a feature point with medium similarity in the current feature and the existing memory, and j is a subscript index of a part with medium similarity in the existing memory and the current feature;
the current target characteristics with the correlation value higher than the average value with the existing memory are directly expanded into the existing memory:
wherein Union (r.) represents the Union operation of the current feature and the corresponding memory,is the union of the existing key memory and the target feature at the previous moment,for the union of the existing object foreground value memory and the previous moment object foreground mask,the union of the existing target background value memory and the target background mask at the previous moment is used.
Acquiring a target deformation state in the current target feature by aggregating the weighted correlation between the query pixel and the reference feature, wherein the establishment of the corresponding relationship between similar target parts specifically comprises:
constructing the association between each pixel in the query feature F and the whole template, and applying an unshared attention mechanism to calculate each pixel r in the keyjAnd fiObtaining a similarity function:
by softhe tmax function pairs e over all pixels of the template RijNormalization, obtaining the normalized similarity weight for the pixel-by-pixel feature in the aggregation template, generating the similarity corresponding to fiThe target deformation characteristic of (1);
and connecting the query feature with the target deformation feature by using the residual error to obtain the feature containing deformation information, and enhancing the credibility of the target deformation feature by adopting the target foreground probability.
In a second aspect, a deformable single-target tracking device based on dynamic compact memory embedding, the device comprising: a processor and a memory, the memory having stored therein program instructions, the processor calling the program instructions stored in the memory to cause the apparatus to perform the method steps of any of the first aspects.
In a third aspect, a computer-readable storage medium storing a computer program comprising program instructions which, when executed by a processor, cause the processor to carry out the method steps of any one of the first aspect.
The technical scheme provided by the invention has the beneficial effects that:
1. in the process of matching the similarity of the targets, a dynamically adjusted compact target memory bank is introduced, so that effective query references can be provided for matching the similarity of the targets under the complex tracking conditions of shielding, background clutter and the like; in addition, a dynamic memory adjustment mechanism based on a Hash algorithm can effectively ensure the compactness and high quality of memory, and avoid memory redundancy and unnecessary memory retrieval;
2. the deformable feature learning module provided by the invention effectively obtains the deformation information of the current target by establishing the global corresponding relation between each pixel in the query features and the features of the whole reference template, and further solves the problem of deformation of the target in the tracking process;
3. according to the invention, on the basis of five challenging tracking data references (including VOT2016, VOT2018, VOT2019, GOT-10K and TrackingNet), extensive simulation experiments prove that the method has obvious advantages compared with other latest trackers, and particularly obtains the EAO score of 0.508 which is ranked first at present on the VOT 2018.
Drawings
FIG. 1 is a schematic diagram of the overall structure of the model of the present invention;
the model mainly comprises a tracker based on DCF, a deformation feature extraction module, a similarity matching module based on compact memory embedding, and an up-sampling segmentation module.
FIG. 2 is a schematic diagram of a target affinity match and compact memory embedding architecture of the present invention;
in order to obtain the target in the current frame, the query feature F and the target feature memory MkPerforms similarity matching therebetween. In particular, F and M after dimension transformationkConstructing an affinity matrix A epsilon R between the current feature and the memory through matrix multiplicationhw ×Thw. Then, A retrieves the value memory MvfAnd MvbCarrying out Top-K averaging on the retrieved affinity features along the column dimension to obtain the similarity S of the foreground and background targetsfAnd Sb. After the target segmentation and tracking is completed, the features F and the obtained target foreground and background segmentation mask Y are queriedfAnd YbThe compact memory adjustment mechanism proposed according to the present invention is updated into the corresponding memory.
FIG. 3 is a diagram illustrating a comparison of tracking effects of different memory storage methods;
FIG. 4 is a visualization diagram of the deformation characteristics for improving the model tracking effect;
the embedding of compact memory effectively improves the discrimination capability of the model on similar interference. However, the model still cannot completely segment the edge details of the target, and the deformation feature learning module effectively solves the problem and realizes accurate segmentation of the target.
FIG. 5 is a schematic diagram showing the result of a segmentation visualization experiment of the data set DAVIS2017 according to the method;
FIG. 6 is a schematic diagram showing the result of the visual experiment of the method on VOT2016, VOT2018 and VOT 2019;
FIG. 7 is a schematic structural diagram of a deformable single-target tracking device based on dynamic compact memory embedding.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below.
In an embodiment of the present invention, in order to solve the problems existing in the background art, a compact memory embedding for deformable visual tracking is proposed. Only a small amount of object information is contained for a single initial reference feature, and especially in video sequences, the objects will have significant appearance and morphology changes. The invention correspondingly provides a dynamic memory embedding regulation mechanism, and the embodiment of the invention obtains the correlation between the query features and the existing memory by retrieving the feature affinity matrix generated in the similarity matching process.
Embodiments of the present invention then incorporate a high correlation between the existing memory and the current target feature. In addition, parts with moderate similarity to memory are expanded into memory, while irrelevant parts are directly discarded. High-quality memory embedding can provide complete target change information in historical frames, so that the problems of target occlusion, similarity interference objects and the like are effectively solved. In addition, compared with the existing matching method based on correlation operation, the deformable feature extraction method provided by the embodiment of the invention adopts a correlation mode from pixel points to global features one by one, and effectively captures the target deformation state in the query features by aggregating the weighted correlation between the query pixels and the reference features.
Fig. 1 shows an overall flowchart of a tracking method in an embodiment of the present invention. In order to meet the challenge in the tracking process, the model of the method mainly comprises three key components, namely a target similarity matching module based on compact memory embedding, a deformable feature learning module and a tracker based on discriminant correlation filtering.
For tracking and segmenting a video frame I at the current time ttVideo frame ItAnd an initial frame I1First encoded by the backbone network (i.e., ResNet50) as query features and reference features. To improve the computational efficiency, the method comprisesThe extracted query frame and reference frame features are reduced to 64 channels, and the two features are respectively expressed as FtAnd R.
The target similarity matching module used in the embodiment of the present invention refers to a self-attention mechanism in a transform (multi-head attention mechanism). And the dynamic memory adjustment mechanism adjusts the current target feature FtAnd the target division mask is expanded to a compact memory storage MkAnd Mv(Here MkMemory storage representing target characteristics, MvMemory storage representing the target foreground and background segmentation masks) to effectively overcome occlusion and appearance variations during tracking. The object deformable feature learning module described in the embodiment of the present invention performs global comparison of the query feature and the reference feature pixel by pixel, and establishes a correspondence df (F) between similar object portionstR). Therefore, the global target corresponding relation can completely capture the deformation information of the target. And the discriminant tracker is used for extracting the target positioning information LtWith reference to the latest depth-dependent filter-based tracker ATOM, the method first uses 1 × 1 convolutional layer to reduce the stem features to 64 channels, and then processes them with 4 × 4 convolutional layer and continuous differentiable activation function PELU (parameterized exponential linear activation function). The activated feature maximum is regarded as the target space positioning. The tracking module is trained on-line by an efficient back propagation method. And finally, connecting the target similarity matching result, the deformable feature and the target positioning information together along the channel, and performing refinement and segmentation processing on the combined feature through a three-stage upsampling and segmenting module. And carrying out contour transformation on the divided target mask to obtain a target bounding box.
Object similarity matching based on compact memory embedding
1. Object similarity matching
Accurately separating the target from the complex background requires available reference information. The matching-based VOS method can make full use of the target labeling information in the initial frame to accurately match the current target. In the present embodiment, the conventional attention was followedIn the method of the mechanical mechanism, the target similarity matching module comprises three parts of inquiry, key and value. Fig. 2 shows the structure of the object similarity matching module. Query feature Ft∈Rh×w×cIs the target representation of the current frame. In contrast, the bond R ∈ R in the modelh ×w×cIs the target feature of the initial frame, and the value: (And) Is a foreground and background segmentation mask of a first frame of the video, wherein h is the pixel height of the target feature map, w is the pixel width of the target feature map, c is the number of channels of the target feature map, R is a representation symbol of a multi-dimensional matrix,the mask is segmented for the foreground of the first frame of the video,the mask is segmented for the background of the first frame in the video, f is the representation of the target foreground, and b is the representation of the target background.
In video sequences, objects typically undergo significant appearance and structural changes. Similarity matching using only fixed initial information of objects does not guarantee the quality of object matching. When a new frame, e.g. It-1When segmented by the model, the current target query feature Ft-1And the obtained mask (Y)f t-1And Yb t-1) Will be merged with the history information. Memory embedding (M) by inserting history information into key and value memoryk∈RT×h×w×c,Mvf∈RT×h×w×1And Mvb∈RT×h×w×1) Will contain rich target appearance information, where MkFor object feature memory, MvfMemorization of the target foreground values, MvbFor target background value memory, T is the accumulated video frame number.
To establish a pixel-level association of targets between key memory and query features, the method first generates an affinity matrix A. In particular, for better matching, the key memorizes MkAnd query FtA pixel-by-pixel normalization process is first performed along each channel. Furthermore, MkAnd FtIs also reshaped to Thw × c and hw × c, respectively.
A=Ft*(Mk)T, (1)
Wherein, represents matrix multiplication, ()TDenotes the matrix transposition, A ∈ Rhw×Thw。
Affinity matrix A measures query features FtAnd key memory MkThe similarity of each pixel therebetween. In order to obtain an accurate matching target, further retrieval of value memory is required. Then, the foreground and background values are memorized into a map MvfAnd MvbIs Thw × 1 (1 in this text is a number). For i ∈ RhwAffinity vector ai∈RThwRetrieval value memory vector M by dot product calculationvf∈RThw×1And Mvb,
Wherein,andthe result of retrieving the target foreground value memory vector for the affinity vector,results of retrieving a target background value memory vector for an affinity vectorAnd i is the index of the ith feature vector in the dimension of the target feature hw.
The matching score with high confidence ensures the accuracy of target matching. Thus, embodiments of the present invention apply a top-K averaging function to extract the retrieved vectorsAndtarget score of (1):
wherein, aggregateExpressed in a matching score matrixJ is a subscript index of a jth one of the first K matching scores. In this method, K is set to 3, and the background matching is as above. Final foreground and background matching result Sf∈RhwAnd Sb∈RhwFor generating a target posterior probability P, i.e. the probability of the target foreground relative to the background.
2. Compact memory embedding
The rich memory embedding can effectively improve the precision of the target similarity matching. However, due to the limited storage capacity of the computing device, it is not possible to store all historical frame information in the memory base, especially for long videos containing more than 1K frames. In addition, storing such target information may result in redundancy of memory-embedded storage, and unnecessary matching queries, since targets in adjacent frames may be too similar, or target occlusion may occur.
The existing method only applies several history frames in the neighborhood, or selects a part of the history frames at equal intervals. These methods will lose some valid reference information. AFB-URR (video segmentation model combining adaptive feature memory and uncertain region refinement) combines similar parts in memory and deletes the part with the lowest query frequency. It may still introduce irrelevant history information into the memory, which easily leads to accumulation of model errors until the model drifts. Inspired by a Hash algorithm, a dynamic compact memory adjustment mechanism is developed for the model, so that a more effective target reference information base is formed. FIG. 2 illustrates a dynamic compact memory embedded architecture. Based on the affinity matrix a, embodiments of the present invention incorporate a high similarity (above an upper threshold) portion between the current feature and the existing memory. To avoid mismatches due to low quality memory, target features with low correlation are discarded directly.
The matching-based VOS method mostly uses the target information of the initial frame as a reference template because the initial features with true value labels have accurate and complete target description. Thus, the method uses the target information of the first frame in the sequence to initialize memory embedding and to make it a main part of the memory bank. And the target in the latest video frame is most similar to the current target, but the corresponding query preference is reduced in the target matching process by considering the calculation error of the model. For example, during model inference, video frame It-1Completing object segmentation and obtaining object query Ft-1And foreground Yf t-1And background Yb t-1The split mask of (1). To extract useful reference information, a target query F is first queriedt-1With existing object feature memory MkAnd comparing to find out the similar parts of the two. Affinity matrix A epsilon R generated in the process of target similarity matchinghw×ThwMeasure query characteristics Ft-1And key memory MkThe correlation between them. Thus, embodiments of the present invention dynamically manage memory embedding directly using affinity matrix a. For Ft-1Each of the elements F int-1(i)∈RhwSearching the affinity matrix A to obtain its affinity matrix Mk∈RThwThe most similar part of (a):
where Re (-) is the maximum value in each row of the affinity matrix A and A (ij) is each element of the affinity matrix A.
If the maximum correlation between the two, A (ij), is greater than some upper limit ζ, the two are considered sufficiently similar. In the hash algorithm, records in the hash map having the same key will be inserted into the same storage space. Thus, embodiments of the present invention save only one of the plurality of similar features to memory. And in consideration of the diversity of the memories, the weight is adopted to combine the similar features and the corresponding memories, so that unnecessary retrieval and memory redundancy are avoided. From the above analysis, the initial reference information is most accurate. Therefore, the embodiment of the invention adopts a smaller fusion weight beta to update the current characteristics into the existing memory embedding so as to avoid the interference of model errors. In all experiments, ζ was set to 0.95 and β was set to 0.001. The memory-embedded online update formula is as follows: mk(j')=βFt-1(i)+(1-β)Mk(j),(6)
Wherein M isk(j') memory of the merged object features, Mvf(j') and Mvb(j ') memorizing the merged target foreground and background values respectively, and j' memorizing the merged subscript index value for memorizing and storing.
For maximum correlation Re (F)t-1(i))<ζ, embodiments of the invention select a correlation value higher than the average valueAnd extend it into existing memories to ensure memory embedding diversity. At the same time, irrelevance is also avoidedA query of memory, merging into an efficient compact memory store by:
wherein Union (r.) represents the Union operation of the current feature and the corresponding memory,is the union of the existing key memory and the target feature at the previous moment,for the union of the existing object foreground value memory and the previous moment object foreground mask,the union of the existing target background value memory and the target background mask at the previous moment is used.
FIG. 3 shows a comparison of compact memory embedding and two other related methods in an embodiment of the present invention. As shown in the first row, storing all history memory and Adaptive Feature Base (AFB) improves the discrimination ability of the target similarity match to some extent. However, for complex background clutter, redundant memory can lead to false target matches. The memory embedding method of the embodiment of the invention fully excavates the diversity and compactness of the memory characteristics and can obtain better target matching performance.
Second, target variability feature learning
As shown in fig. 4, with the help of compact memory embedding, the target similarity matching has good discrimination capability, and can well solve tracking problems such as target occlusion and background interference. The method also has certain advantages in solving the target deformation. It does not effectively address the serious target distortion or spatial detail issues. Inspired by the attention mechanism, embodiments of the present invention propose target variability feature learning to further alleviate the above-mentioned dilemma. The embodiment of the invention constructs each pixel and the whole in the query feature FAnd (4) associating the templates R to obtain complete target deformation information. Since the query and key contain different target representations, f is assigned to each pixel in the queryiIn an embodiment of the present invention, a non-shared attention mechanism is applied to calculate each pixel r in a keyjAnd fiTo obtain a learnable similarity function: e.g. of the typeij=(WF fi)T(WR rj). (12)
Wherein, WFAnd WRDenotes a general formula fiAnd rjLearnable linear transformation into a higher dimensional representation, eijFor each pixel f in the query featureiAnd each pixel r in the key featurejThe correlation between them.
For convenience of comparison fiAnd similarity between different parts in the template R, e on all pixels of the template R by the softmax functionijAnd (3) carrying out normalization to obtain normalized similarity weight:
similarity weight d with normalizationijTo aggregate pixel-by-pixel features in the template R to generate a template corresponding to fiDeformation characteristic v ofi;
Wherein phi isv(.) represents ReLU (W)v*(.)),WvA (.) indicates a linear transformation of the features of the input. Then, the query features and the obtained target deformation features are connected by utilizing residual connection to obtain the features containing deformation information
Wherein phi isc(.) represents ReLU (W)c()) is intended to reduce the dimensionality of the features. Concat (·) denotes a feature join operation. There is inevitably a background mismatch in the matching target deformed features. Therefore, the target foreground probability (i.e. the target segmentation mask of the initial frame) P is used to enhance the confidence of the deformed features.
In particular, the target deformation obtainedRetrieving foreground probability in an initial frame by dot product operationTo generate the final deformable features:
three, single target visual tracking
And (3) cascading four feature graphs along a channel, namely, the target foreground similarity and the target posterior probability obtained in the similarity matching process, the target deformation feature obtained by the deformation feature extraction module and the target space positioning obtained by the online discriminant correlation filter, and inputting the feature graphs into a lightweight decoder (an up-sampling segmentation module) to obtain a finally refined target segmentation mask. And acquiring a rectangular surrounding frame of the target on the final target segmentation mask by using a contour detection function in an OpenCV (open circuit vehicle library), wherein the central position of the surrounding frame is the positioning of the target, and the size of the surrounding frame is the scale state of the target.
Fourthly, concrete implementation steps and simulation experiment
In an embodiment of the invention, the first four phases of ResNet50 pre-trained on ImageNet are used as the backbone network to extract features. Object segmentation is a pixel-level classification task that requires the use of features with high-confidence semantics. Therefore, the method extracts the last layer of the backbone network for target similarity matching and deformation feature extraction. The trunk feature is then reduced to 64 channels by 1 × 1 convolutional layers, followed by 3 × 3 convolutional layers and ReLU activation. In the up-sampling segmentation process, the first three levels of the skeleton are utilized to supplement the spatial detail information of the target. The method sets top-K to be K-3. And the upper threshold ζ for the merging of similar memories is set to 0.95. In a preliminary experiment, the method finds that parameters of the memory embedded module and the top-K have no obvious influence on the tracking performance. Therefore, the above settings were fixed in all relevant experiments.
Model training: the compact memory embedding and DCF tracking based modules are updated online without pre-training. Therefore, the method only pretrains the similarity matching, deformable feature extraction and upsampling segmentation module on the Youtube-VOS dataset. Similar to the sampling strategy employed in twin network-based tracking models, a pair of masked images are sampled from a video sequence to construct training samples. The method minimizes cross entropy loss through an Adam optimizer, and the learning rate is 8 multiplied by 10-4Decay by 0.2 every 15 cycles. The whole training process is carried out on a single Nvidia Titan XP video card, and 60 epochs are needed.
Model reasoning: in the VOT task, a video sequence contains given bounding box labels. The method comprises the steps of firstly generating a pseudo mask as a pseudo label by using a ground truth box in an initial frame, and then initializing a model by using the generated pseudo label. And in the tracking process, processing the current frame by using the segmentation model to obtain a target segmentation template. The compact memory library is updated online using the query features and the generated segmentation mask. Finally, the resulting segmented mask is converted into a rotated target bounding box as a tracking result.
Simulation experiment results: tables 1-3 show the evaluation results of the tracking model of the embodiment of the invention on the basis of the disclosed single-target tracking data sets, namely VOT2016, VOT2018 and VOT2019, and the experimental results show the advancement of the tracking model by comparing the method with some latest tracking models (such as SiamMask, SiamRPN + +, ATOM, D3S, Ocean and the like), thereby also verifying the effectiveness of the method. Table 4 shows the evaluation result of the model constructed in the embodiment of the present invention on the video target segmentation data basis DAVIS2017, and also fully verifies the validity and the advancement of the model. And fig. 5 and 6 show the visual experimental results of the method on the basis of the VOT series and DAVIS2017 data.
The comparative tracking method related to the embodiment of the invention is explained as follows: SiamMask (tracking method combined with target segmentation); D3S (discriminant segmentation tracking model); CCOT (continuous convolution based tracking method); CSR-DCF (correlation filter tracking method based on channel and spatial confidence); ASRCF (adaptive spatial regularization based correlation filter tracking method); SiamDW (based on a wider and deeper twin network tracking model); SiamRPN + + (tracking model based on large backbone networks and twin area proposed networks); SiamRPN (twin region-based proposed network based tracking model); SiamBAN (adaptive bounding box regression based tracking model); ocean-on/off (anchor-free tracking method based on target perception); Update-Net (twin tracking method based on template online Update); SPM (real-time visual object tracking method based on sequence parallel matching); ATOM (target accurate tracking is achieved through target overlap maximization); DiMP (discriminant model prediction based visual tracking method); C-RPN (twin cascade region-based proposed network tracking method); SiamGraph (twin-attention-map model-based tracking method); LADCF (discriminant tracking method based on timing consistency constraints); OSMN (network modulation based efficient video object segmentation method); STM (spatiotemporal memory based video segmentation method); VM (video segmentation method based on object matching); FAVOS (tracking part based video segmentation method); OnAVOS (segmentation method based on online adaptive convolutional network).
TABLE 1 comparison of Performance of multiple tracking methods on VOT2016 data basis
TABLE 2 comparison of Performance of multiple tracking methods on VOT2018 data basis
TABLE 3 comparison of Performance of various tracking methods on VOT2019 data basis
TABLE 4 comparison of Performance of various tracking and VOS methods on DAVIS2017 data basis
Based on the same inventive concept, an embodiment of the present invention further provides a deformable single-target tracking apparatus based on dynamic compact memory embedding, referring to fig. 7, the apparatus includes: a processor 1 and a memory 2, the memory 2 having stored therein program instructions, the processor 1 calling the program instructions stored in the memory 2 to cause the apparatus to perform the following method steps in an embodiment:
the feature affinity matrix generated in the target similarity matching process is used for expressing the correlation between the current target feature and the existing target feature memory, and then the feature affinity moment matrix is subjected to line-by-line screening of the previous K values and the average of the previous K values to obtain the similarity of the target foreground and the background and the posterior probability of the target;
according to the acquired feature correlation, combining the high correlation part between the existing target feature memory and the current target feature, expanding the part with medium similarity to the existing target feature memory into the memory, and discarding the low correlation part, thereby realizing the dynamic self-adaptive adjustment of memory embedding;
acquiring a target deformation state in the current target feature by adopting an association mode from pixel points to global features one by one and aggregating the weighted correlation between query pixels and reference features, and establishing a corresponding relation between similar target parts to realize extraction of deformable features;
the similarity of the target foreground, the posterior probability of the target, the deformable characteristics and the target space positioning obtained in the online discriminant correlation filter are cascaded along the channel and input into a decoder to obtain a refined target segmentation mask; and acquiring a rectangular bounding box of the target based on the target segmentation mask to realize visual tracking and positioning.
The method for obtaining the correlation between the current target characteristics and the existing memory specifically comprises the following steps:
when a new frame It-1After being segmented by the model, the current target query feature Ft-1And the obtained mask will integrate with the historical information, by continuously integrating new target information into the key and value memory, the formed memory embedding will contain rich target appearance information; for the current target characteristics and the existing target characteristic memory, firstly, the two are subjected to dimension transformation, and then matrix multiplication is carried out to obtain an affinity matrix of the two, wherein the affinity matrix expresses the correlation of the two.
Further, according to the obtained feature correlation, combining a high correlation part between the existing target feature memory and the current target feature, expanding a part with medium correlation with the existing target feature memory into the memory, and discarding an irrelevant part, so that the dynamic adaptive adjustment of memory embedding is realized as follows:
initializing memory embedding by utilizing target information of a first frame in a sequence, taking the target information as a main part of a memory library, comparing a target query with the existing target characteristic memory, and finding out a similar part of the target query and the existing target characteristic memory;
for each element in the target query, the affinity matrix is searched to obtain its affinity matrix with Mk∈RThwThe most similar part of (a);
if the maximum correlation between the two is larger than a certain upper limit value, the records with the same key value are inserted into the same storage space according to the Hash single mapping principle, namely, a weighted fusion mode is adopted, and the part which is highly correlated with the existing memory in the current characteristics is updated into the existing memory embedding; for the current target characteristics with the correlation value higher than the overall average value with the existing memory, the current target characteristics are directly expanded into the existing memory; the same is done for corresponding parts of the target foreground and background masks.
Wherein, updating the part of the current characteristics which has high correlation with the existing memory into the existing memory embedding specifically comprises:
merging the high correlation part between the existing memory and the current target feature:
Mk(j')=βFt-1(i)+(1-β)Mk(j),
wherein M isk(j') memory of the merged object features, Mvf(j') and Mvb(j ') memorizing the merged target foreground and background values respectively, j' memorizing the merged subscript index value, beta fusing weight, Ft-1For a target query, Yf t-1As a foreground, Yb t-1For background, MkFor object feature memory, MvfMemorization of the target foreground values, MvbMemorizing a target background value, wherein i is a subscript index of a feature point with medium similarity in the current feature and the existing memory, and j is a subscript index of a part with medium similarity in the existing memory and the current feature;
the current target characteristics with the correlation value higher than the average value with the existing memory are directly expanded into the existing memory:
wherein Union (r.) represents the Union operation of the current feature and the corresponding memory,is the union of the existing key memory and the target feature at the previous moment,for the union of the existing object foreground value memory and the previous moment object foreground mask,the union of the existing target background value memory and the target background mask at the previous moment is used.
Further, by aggregating the weighted correlation between the query pixel and the reference feature, the target deformation state in the current target feature is captured, and the correspondence relationship established between similar target portions is specifically:
constructing the association between each pixel in the query feature F and the whole template, and applying an unshared attention mechanism to calculate each pixel r in the keyjAnd fiObtaining a similarity function:
e on all pixels of the template R by the softmax functionijNormalization, obtaining the normalized similarity weight for the pixel-by-pixel feature in the aggregation template, generating the similarity corresponding to fiThe target deformation characteristic of (1);
and connecting the query feature with the target deformation feature by using the residual error to obtain the feature containing deformation information, and enhancing the credibility of the target deformation feature by adopting the target foreground probability.
It should be noted that the device description in the above embodiments corresponds to the method description in the embodiments, and the embodiments of the present invention are not described herein again.
The execution main bodies of the processor 1 and the memory 2 may be devices having a calculation function, such as a computer, a single chip, a microcontroller, and the like, and in the specific implementation, the execution main bodies are not limited in the embodiment of the present invention, and are selected according to the needs in the practical application.
The memory 2 and the processor 1 transmit data signals through the bus 3, which is not described in detail in the embodiment of the present invention.
Based on the same inventive concept, an embodiment of the present invention further provides a computer-readable storage medium, where the storage medium includes a stored program, and when the program runs, the apparatus on which the storage medium is located is controlled to execute the method steps in the foregoing embodiments.
The computer readable storage medium includes, but is not limited to, flash memory, hard disk, solid state disk, and the like.
It should be noted that the description of the readable storage medium in the above embodiments corresponds to the description of the method in the embodiments, and the description of the embodiments of the present invention is not repeated here.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the embodiments of the invention are generated in whole or in part when the computer program instructions are loaded and executed on a computer.
The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on or transmitted over a computer-readable storage medium. The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, data center, etc., that comprises an integration of one or more available media. The usable medium may be a magnetic medium or a semiconductor medium, etc.
In the embodiment of the present invention, except for the specific description of the model of each device, the model of other devices is not limited, as long as the device can perform the above functions.
Those skilled in the art will appreciate that the drawings are only schematic illustrations of preferred embodiments, and the above-described embodiments of the present invention are merely provided for description and do not represent the merits of the embodiments.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.
Claims (7)
1. A deformable single-target tracking method based on dynamic compact memory embedding is characterized by comprising the following steps:
a feature affinity matrix generated in the process of matching the target similarity represents the correlation between the current target feature and the existing target feature memory; screening the first K values of the characteristic affinity matrix line by line and averaging the first K values to obtain the target foreground similarity and the target posterior probability;
according to the acquired feature correlation, combining the high correlation part between the existing target feature memory and the current target feature, expanding the part with medium similarity to the existing target feature memory into the memory, and discarding the low correlation part, thereby realizing the dynamic self-adaptive adjustment of memory embedding;
acquiring a target deformation state in the current target feature by adopting an association mode from pixel points to global features one by one and by aggregating the weighted correlation between query pixels and reference features, and establishing a corresponding relation between similar target parts to realize extraction of deformable features;
the similarity of the target foreground, the posterior probability of the target, the deformable characteristics and the target space positioning obtained in the online discriminant correlation filter are cascaded along the channel and input into a decoder to obtain a refined target segmentation mask; and acquiring a rectangular bounding box of the target based on the target segmentation mask to realize visual tracking and positioning.
2. The deformable single-target tracking method based on dynamic compact memory embedding as claimed in claim 1, wherein the obtaining of the correlation between the current target feature and the existing memory is specifically:
constructing a target similarity matching model, wherein keys in the model are target characteristics of an initial frame, and values are foreground and background segmentation masks of a first frame of a video;
when a new frame It-1After being segmented by the model, the current target query feature Ft-1And the obtained mask will integrate with the historical information, by continuously integrating new target information into the key and value memory, the formed memory embedding will contain rich target appearance information; for the current target characteristics and the existing target characteristic memory, firstly, the two are subjected to dimension transformation, and then matrix multiplication is carried out to obtain an affinity matrix of the two, wherein the affinity matrix expresses the correlation of the two.
3. The deformable single-target tracking method based on dynamic compact memory embedding as claimed in claim 1, wherein the high correlation part between the existing target feature memory and the current target feature is merged according to the obtained feature correlation, the part with medium correlation with the existing target feature memory is expanded into the memory, and the part with low correlation is discarded, so that the dynamic adaptive adjustment of memory embedding is realized by:
initializing memory embedding by utilizing target information of a first frame in a sequence, taking the target information as a main part of a memory library, comparing a target query characteristic with the existing target characteristic memory, and finding out a similar part of the target query characteristic and the existing target characteristic memory;
for each element in the target query, the affinity matrix is searched to obtain its memory M with the target featurek∈RThwThe most similar part of (a);
if the maximum correlation between the two is larger than a certain upper limit value, the records with the same key value are inserted into the same storage space according to the Hash single mapping principle, namely, a weighted fusion mode is adopted, and the part with high correlation with the existing memory in the current characteristics is updated into the existing memory embedding; for the current target characteristics with the correlation value higher than the overall average value with the existing memory, the current target characteristics are directly expanded into the existing memory; the same is done for corresponding parts of the target foreground and background masks.
4. The deformable single-target tracking method based on dynamic compact memory embedding as claimed in claim 3, wherein the updating of the part of the current feature having high correlation with the existing memory into the existing memory embedding is specifically:
merging the high correlation part between the existing memory and the current target feature:
Mk(j')=βFt-1(i)+(1-β)Mk(j),
wherein M isk(j') memory of the merged object features, Mvf(j') and Mvb(j ') memorizing the merged target foreground and background values respectively, j' memorizing the merged subscript index value, beta fusing weight, Ft-1For a target query, Yf t-1As a foreground, Yb t-1For background, MkFor object feature memory, MvfMemorization of the target foreground values, MvbMemorizing a target background value, wherein i is a subscript index of a feature point with medium similarity in the current feature and the existing memory, and j is a subscript index of a part with medium similarity in the existing memory and the current feature;
the current target characteristics with the correlation value higher than the average value with the existing memory are directly expanded into the existing memory:
wherein Union (r.) represents the Union operation of the current feature and the corresponding memory,for the union of the existing key memory and the target feature at the previous moment,for the union of the existing object foreground value memory and the previous moment object foreground mask,the union of the existing target background value memory and the target background mask at the previous moment is memorized.
5. The deformable single-target tracking method based on dynamic compact memory embedding as claimed in claim 3, wherein the target deformation state in the current target feature is captured by aggregating weighted correlation between pixel points in the query feature and the whole reference feature, and the correspondence relationship established between similar target portions is specifically:
constructing an association between each pixel in the query feature F and the entire template, applying an unshared attention mechanism to calculate each pixel r in the keyjAnd fiObtaining a similarity function:
e on all pixels of the template R by the softmax functionijNormalization, obtaining the normalized similarity weight for aggregating the pixel-by-pixel characteristics in the template to generate the similarity corresponding to fiThe target deformation characteristic of (1);
and connecting the query feature with the target deformation feature by using the residual error to obtain the feature containing deformation information, and enhancing the reliability of the target deformation feature by adopting the target foreground probability.
6. A deformable single-target tracking device based on dynamic compact memory embedding, characterized in that the device comprises: a processor and a memory, the memory having stored therein program instructions, the processor calling upon the program instructions stored in the memory to cause the apparatus to perform the method steps of any of claims 1-5.
7. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program comprising program instructions which, when executed by a processor, cause the processor to carry out the method steps of any of claims 1-5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110736925.2A CN113705325B (en) | 2021-06-30 | 2021-06-30 | Deformable single-target tracking method and device based on dynamic compact memory embedding |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110736925.2A CN113705325B (en) | 2021-06-30 | 2021-06-30 | Deformable single-target tracking method and device based on dynamic compact memory embedding |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113705325A true CN113705325A (en) | 2021-11-26 |
CN113705325B CN113705325B (en) | 2022-12-13 |
Family
ID=78648216
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110736925.2A Active CN113705325B (en) | 2021-06-30 | 2021-06-30 | Deformable single-target tracking method and device based on dynamic compact memory embedding |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113705325B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115082430A (en) * | 2022-07-20 | 2022-09-20 | 中国科学院自动化研究所 | Image analysis method and device and electronic equipment |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190147602A1 (en) * | 2017-11-13 | 2019-05-16 | Qualcomm Technologies, Inc. | Hybrid and self-aware long-term object tracking |
CN111612817A (en) * | 2020-05-07 | 2020-09-01 | 桂林电子科技大学 | Target tracking method based on depth feature adaptive fusion and context information |
CN111833378A (en) * | 2020-06-09 | 2020-10-27 | 天津大学 | Multi-unmanned aerial vehicle single-target tracking method and device based on proxy sharing network |
CN111951297A (en) * | 2020-08-31 | 2020-11-17 | 郑州轻工业大学 | Target tracking method based on structured pixel-by-pixel target attention mechanism |
-
2021
- 2021-06-30 CN CN202110736925.2A patent/CN113705325B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190147602A1 (en) * | 2017-11-13 | 2019-05-16 | Qualcomm Technologies, Inc. | Hybrid and self-aware long-term object tracking |
CN111612817A (en) * | 2020-05-07 | 2020-09-01 | 桂林电子科技大学 | Target tracking method based on depth feature adaptive fusion and context information |
CN111833378A (en) * | 2020-06-09 | 2020-10-27 | 天津大学 | Multi-unmanned aerial vehicle single-target tracking method and device based on proxy sharing network |
CN111951297A (en) * | 2020-08-31 | 2020-11-17 | 郑州轻工业大学 | Target tracking method based on structured pixel-by-pixel target attention mechanism |
Non-Patent Citations (2)
Title |
---|
TING ZHANG等: "Design and Implementation of Dairy Food Tracking System Based on RFID", 《2020 INTERNATIONAL WIRELESS COMMUNICATIONS AND MOBILE COMPUTING》 * |
汤一明等: "视觉单目标跟踪算法综述", 《测控技术》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115082430A (en) * | 2022-07-20 | 2022-09-20 | 中国科学院自动化研究所 | Image analysis method and device and electronic equipment |
CN115082430B (en) * | 2022-07-20 | 2022-12-06 | 中国科学院自动化研究所 | Image analysis method and device and electronic equipment |
Also Published As
Publication number | Publication date |
---|---|
CN113705325B (en) | 2022-12-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zhang et al. | Learning adaptive attribute-driven representation for real-time RGB-T tracking | |
CN113408492A (en) | Pedestrian re-identification method based on global-local feature dynamic alignment | |
CN110992401A (en) | Target tracking method and device, computer equipment and storage medium | |
Saribas et al. | TRAT: Tracking by attention using spatio-temporal features | |
Huang et al. | End-to-end multitask siamese network with residual hierarchical attention for real-time object tracking | |
CN113705325B (en) | Deformable single-target tracking method and device based on dynamic compact memory embedding | |
Yu et al. | Learning dynamic compact memory embedding for deformable visual object tracking | |
Li et al. | Dynamic feature-memory transformer network for RGBT tracking | |
Li et al. | Self-supervised monocular depth estimation with frequency-based recurrent refinement | |
Qu et al. | Source-free Style-diversity Adversarial Domain Adaptation with Privacy-preservation for person re-identification | |
Elayaperumal et al. | Learning spatial variance-key surrounding-aware tracking via multi-expert deep feature fusion | |
Xu et al. | Learning the distribution-based temporal knowledge with low rank response reasoning for uav visual tracking | |
Cores et al. | Spatiotemporal tubelet feature aggregation and object linking for small object detection in videos | |
Liao et al. | Multi-scale saliency features fusion model for person re-identification | |
Feng et al. | Exploring the potential of Siamese network for RGBT object tracking | |
Zhang et al. | Two-stage domain adaptation for infrared ship target segmentation | |
Wang et al. | EMAT: Efficient feature fusion network for visual tracking via optimized multi-head attention | |
WO2024082602A1 (en) | End-to-end visual odometry method and apparatus | |
Liu et al. | Pseudo-label growth dictionary pair learning for crowd counting | |
Yang et al. | IASA: An IoU-aware tracker with adaptive sample assignment | |
Luo et al. | Selective relation-aware representations for person re-identification | |
Chen et al. | Robust and efficient memory network for video object segmentation | |
Wang et al. | Transformer-Based Band Regrouping With Feature Refinement for Hyperspectral Object Tracking | |
Zhou et al. | A target response adaptive correlation filter tracker with spatial attention | |
Liu et al. | Boosting Visual Recognition in Real-world Degradations via Unsupervised Feature Enhancement Module with Deep Channel Prior |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |