CN113705325A

CN113705325A - Deformable single-target tracking method and device based on dynamic compact memory embedding

Info

Publication number: CN113705325A
Application number: CN202110736925.2A
Authority: CN
Inventors: 于洪涛; 朱鹏飞
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2021-06-30
Filing date: 2021-06-30
Publication date: 2021-11-26
Anticipated expiration: 2041-06-30
Also published as: CN113705325B

Abstract

The invention discloses a deformable single-target tracking method and a device based on dynamic compact memory embedding, wherein the method comprises the following steps: based on compact memory embedded target correlation matching, obtaining target foreground and background similarity and target posterior probability; the dynamic adjustment mechanism of compact memory embedding selects a high-quality part in the current target characteristics to integrate into the memory according to the characteristic correlation; capturing a target deformation state in the current target feature by adopting a global association mode from pixel points to reference features one by one, and realizing extraction of deformable features; the four characteristics of similarity of the target foreground, deformation characteristics and the like are cascaded along a channel and input into a decoder to obtain a refined target segmentation mask; and acquiring a rectangular bounding box of the target based on the target segmentation mask to realize visual tracking and positioning. The device comprises: a processor and a memory. The invention solves the challenging problems such as shielding, target deformation, target appearance change, similar target interference and the like in the tracking process.

Description

Deformable single-target tracking method and device based on dynamic compact memory embedding

Technical Field

The invention relates to the field of single-target visual tracking, in particular to a deformable single-target tracking method and device based on dynamic compact memory embedding.

Background

Visual object tracking is a basic and challenging task in the field of computer vision. It has many practical applications, such as: traffic monitoring, human-computer interaction, autonomous robots, autopilot, and the like. Although existing tracking methods have significantly improved both accuracy and robustness, there are still some challenges that remain to be solved, such as: occlusion, distortion, background clutter, etc.

The twin tracker is a simple and efficient tracking algorithm. In the tracking process, the maximum correlation between the search area and the template is regarded as the location of the target. To achieve better model generalization, twin trackers are typically trained using large amounts of labeled data. SINT (visual tracking method based on twinning example search) and SiamsesFC (visual tracking method based on twinning full convolution network) have a milestone impact on the development of the twin tracking field. They are the first attempt to train the twin network end-to-end for visual tracking. SiamRPN + + (tracking model based on large backbone network and twin region proposed network) and SiamDW (based on wider and deeper twin network tracking model) improve the structure of ResNet (network based on residual connection) model and successfully apply it to twin tracking model, thereby significantly improving tracking performance. SiamRPN (twin region proposal network based tracking model) applies a Region Proposal Network (RPN) to a twin tracking network. The two-pronged network has a classification net header for background-foreground separation of anchors, and a regression net clutter for suggesting box refinement. Compared with the anchor-based method, the anchor-free tracking method (Siamfc + + (twin full convolution tracking model based on anchor-free regression), SiamCAR (twin full convolution classification and regression model for visual tracking), SiamBAN (tracking model based on adaptive bounding box regression), Ocean (anchor-free tracking method based on target perception)) avoids a large amount of anchor presets, thereby significantly reducing model hyper-parameters. These methods may enable more flexible regression of the target bounding box. Twin tracking methods, while simple and effective, fixed templates have difficulty expressing changes in the appearance and scale of the target. MOSSE (based on adaptive correlation filtering tracking method) is the first initiative of the Discriminant Correlation Filters (DCF) tracking method. As an online tracking method, the correlation filter has better adaptability and universality for target appearance and scale change. The following improvement strategies are for example: continuous convolution, dynamic updating of the training set, spatial regularization, temporal smoothing regularization, etc., all further improve the performance of the DCF-based tracker. By combining the online update of DCF and the positioning refinement of the target by IOU-Net (regression network model for realizing the maximization of the target overlapping rate), ATOM (accurate target tracking realized by the maximization of the target overlapping) and DiMP (visual tracking method based on discriminant model prediction) obtain the best tracking performance at that time in the template matching type tracking method.

CSR-DCF (tracking method combined with correlation filtering and color segmentation) constructs a target mask from the color histograms of the foreground and background, and then adds this mask to the correlation filter, suppressing boundary effects very well. And the SimMask (tracking method combined with target segmentation) expands the segmentation network branches on the twin tracking model, and enhances the expression capability of the model by virtue of segmentation loss. Compared with a common Video Object Segmentation (VOS) method, the SiamMask achieves higher tracking speed by adopting a lightweight segmentation network. Unlike "detection-based tracking", D3S (discriminant segmentation tracking model) innovatively replaces the target regression branch in the tracking model with a segmentation network, and D3S achieves advanced tracking performance by combining the precise location of DCF and the robustness of the segmentation model to target deformation. DMB (dual-memory-library-based segmentation tracking model) provides a rich reference for current object segmentation by storing historical appearance and spatial localization information of objects.

The rich memory embedding provides sufficient reference information for the video analysis task. MemTrack (in conjunction with a dynamic memory network tracking model) uses a dynamic memory network to overcome the tracking drift problem caused by fixed templates. Whereas STM (a spatiotemporal memory-based video segmentation method) stores dense historical frame features and masks for temporal-spatial information matching at the current pixel level. The dense reference information makes it possible to handle appearance changes and occlusions during the VOS very well. In order to avoid redundant memory in excessive memory storage, AFB-URR (video segmentation model combining adaptive feature memory and uncertain region refinement) proposes an adaptive feature library to dynamically organize historical information. It combines similar memos using weighted averages and follows a cache replacement strategy to eliminate the memory with the lowest frequency of queries.

Fixed matching templates make it difficult to obtain varying target representations, especially for non-rigid objects. To this end, SiamAttn (twin attention based tracking method) proposes a deformable twin attention mechanism that calculates deformable self-attention and mutual attention between template features and search features. Self-attention uses spatial attention to learn context information, and mutual attention can aggregate rich context interdependencies between the target template and the search region, thereby completing the update of the target template in a hidden manner. Deformable-DETR (Deformable multi-head attention target detection model) applies a multi-scale Deformable attention module instead of the transform (multi-head attention mechanism) attention mechanism to process the feature map. Due to the flexibility of its weight, the method has excellent performance in the target detection task.

To some extent, while DCF-based trackers alleviate the difficulty of fixed templates adapting to scene changes, template-matching-based approaches have two limitations:

first, the template of a single initial frame cannot provide enough target information for target matching; secondly, the existing matching method is only carried out among corresponding pixels, the matching result is too rough, and the target deformation information with sufficient fineness is difficult to capture.

Disclosure of Invention

The invention provides a deformable single-target tracking method and a device based on dynamic compact memory embedding, which solve the problem of warfare such as shielding, target deformation, target appearance change, similar target interference, complex background and the like in the tracking process, and are described in detail as follows:

a deformable single-target tracking method based on dynamic compact memory embedding, the method comprising:

a feature affinity matrix generated in the process of matching the target similarity represents the correlation between the current target feature and the existing target feature memory; the front K values are screened line by line for the characteristic affinity matrix and averaged to obtain the target foreground similarity and the target posterior probability;

according to the acquired feature correlation, combining a high correlation part between the existing target feature memory and the current target feature, expanding a part with medium similarity to the existing target feature memory into the memory, and discarding a low correlation part, thereby realizing dynamic self-adaptive adjustment of memory embedding and obtaining compact memory embedding;

acquiring a target deformation state in the current target feature by adopting an association mode from query features to reference feature global points one by one and by aggregating the weighted correlation between the query pixels and the reference features, and establishing a corresponding relation between similar target parts to realize extraction of deformable features;

the similarity of the target foreground, the posterior probability of the target, the deformable characteristics and the target space positioning obtained in the online discriminant correlation filter are cascaded along the channel and input into a decoder to obtain a refined target segmentation mask; and acquiring a rectangular bounding box of the target based on the target segmentation mask to realize visual tracking and positioning.

Further, the obtaining of the correlation between the current target feature and the existing memory specifically includes:

when a new frame I_t-1After being segmented by the model, the current target query feature F_t-1And the obtained mask will integrate with the historical information, by continuously integrating new target information into the key and value memory, the formed memory embedding will contain rich target appearance information; for the current target characteristics and the existing target characteristic memory, firstly, the two are subjected to dimension transformation, and then matrix multiplication is carried out to obtain an affinity matrix of the two, wherein the affinity matrix expresses the correlation of the two.

The method comprises the following steps of acquiring feature correlation, combining a high correlation part between the existing target feature memory and the current target feature, expanding a part with medium correlation with the existing target feature memory into the memory, and directly discarding a low correlation part, so that dynamic adaptive adjustment of memory embedding is realized, wherein the dynamic adjustment process specifically comprises the following steps:

initializing memory embedding by utilizing target information of a first frame in a sequence, taking the target information as a main part of a memory library, comparing a target query with the existing target characteristic memory, and finding out a similar part of the target query and the existing target characteristic memory;

for each element in the target query, the affinity matrix is searched to obtain its memory M with the target feature_k∈R^ThwThe most similar part of (a);

if the maximum correlation between the two is larger than a certain upper limit value, the records with the same key value are inserted into the same storage space according to the Hash single mapping principle, namely, a weighted fusion mode is adopted, and the part which is highly correlated with the existing memory in the current characteristics is updated into the existing memory embedding; for the current target characteristics with the correlation value higher than the overall average value with the existing memory, the current target characteristics are directly expanded into the existing memory; the same is done for corresponding parts of the target foreground and background masks.

Further, the updating of the part of the current feature that has high correlation with the existing memory into the existing memory embedding specifically includes:

merging the high correlation part between the existing memory and the current target feature:

M_k(j')＝βF_t-1(i)+(1-β)M_k(j),

wherein M is_k(j') memory of the merged object features, M_vf(j') and M_vb(j ') memorizing the combined target foreground and background values respectively, j' memorizing the combined subscript index value after combination, and beta fusing the weight，F_t-1In order to target the query(s),

as a foreground, Y^b _t-1For background, M_kFor object feature memory, M_vfMemorization of the target foreground values, M_vbMemorizing a target background value, wherein i is a subscript index of a feature point with medium similarity in the current feature and the existing memory, and j is a subscript index of a part with medium similarity in the existing memory and the current feature;

the current target characteristics with the correlation value higher than the average value with the existing memory are directly expanded into the existing memory:

wherein Union (r.) represents the Union operation of the current feature and the corresponding memory,

is the union of the existing key memory and the target feature at the previous moment,

for the union of the existing object foreground value memory and the previous moment object foreground mask,

the union of the existing target background value memory and the target background mask at the previous moment is used.

Acquiring a target deformation state in the current target feature by aggregating the weighted correlation between the query pixel and the reference feature, wherein the establishment of the corresponding relationship between similar target parts specifically comprises:

constructing the association between each pixel in the query feature F and the whole template, and applying an unshared attention mechanism to calculate each pixel r in the key^jAnd fⁱObtaining a similarity function:

by softhe tmax function pairs e over all pixels of the template R_ijNormalization, obtaining the normalized similarity weight for the pixel-by-pixel feature in the aggregation template, generating the similarity corresponding to fⁱThe target deformation characteristic of (1);

and connecting the query feature with the target deformation feature by using the residual error to obtain the feature containing deformation information, and enhancing the credibility of the target deformation feature by adopting the target foreground probability.

In a second aspect, a deformable single-target tracking device based on dynamic compact memory embedding, the device comprising: a processor and a memory, the memory having stored therein program instructions, the processor calling the program instructions stored in the memory to cause the apparatus to perform the method steps of any of the first aspects.

In a third aspect, a computer-readable storage medium storing a computer program comprising program instructions which, when executed by a processor, cause the processor to carry out the method steps of any one of the first aspect.

The technical scheme provided by the invention has the beneficial effects that:

1. in the process of matching the similarity of the targets, a dynamically adjusted compact target memory bank is introduced, so that effective query references can be provided for matching the similarity of the targets under the complex tracking conditions of shielding, background clutter and the like; in addition, a dynamic memory adjustment mechanism based on a Hash algorithm can effectively ensure the compactness and high quality of memory, and avoid memory redundancy and unnecessary memory retrieval;

2. the deformable feature learning module provided by the invention effectively obtains the deformation information of the current target by establishing the global corresponding relation between each pixel in the query features and the features of the whole reference template, and further solves the problem of deformation of the target in the tracking process;

3. according to the invention, on the basis of five challenging tracking data references (including VOT2016, VOT2018, VOT2019, GOT-10K and TrackingNet), extensive simulation experiments prove that the method has obvious advantages compared with other latest trackers, and particularly obtains the EAO score of 0.508 which is ranked first at present on the VOT 2018.

Drawings

FIG. 1 is a schematic diagram of the overall structure of the model of the present invention;

the model mainly comprises a tracker based on DCF, a deformation feature extraction module, a similarity matching module based on compact memory embedding, and an up-sampling segmentation module.

FIG. 2 is a schematic diagram of a target affinity match and compact memory embedding architecture of the present invention;

in order to obtain the target in the current frame, the query feature F and the target feature memory M_kPerforms similarity matching therebetween. In particular, F and M after dimension transformation_kConstructing an affinity matrix A epsilon R between the current feature and the memory through matrix multiplication^hw ^×Thw. Then, A retrieves the value memory M_vfAnd M_vbCarrying out Top-K averaging on the retrieved affinity features along the column dimension to obtain the similarity S of the foreground and background targets_fAnd S_b. After the target segmentation and tracking is completed, the features F and the obtained target foreground and background segmentation mask Y are queried^fAnd Y^bThe compact memory adjustment mechanism proposed according to the present invention is updated into the corresponding memory.

FIG. 3 is a diagram illustrating a comparison of tracking effects of different memory storage methods;

FIG. 4 is a visualization diagram of the deformation characteristics for improving the model tracking effect;

the embedding of compact memory effectively improves the discrimination capability of the model on similar interference. However, the model still cannot completely segment the edge details of the target, and the deformation feature learning module effectively solves the problem and realizes accurate segmentation of the target.

FIG. 5 is a schematic diagram showing the result of a segmentation visualization experiment of the data set DAVIS2017 according to the method;

FIG. 6 is a schematic diagram showing the result of the visual experiment of the method on VOT2016, VOT2018 and VOT 2019;

FIG. 7 is a schematic structural diagram of a deformable single-target tracking device based on dynamic compact memory embedding.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below.

In an embodiment of the present invention, in order to solve the problems existing in the background art, a compact memory embedding for deformable visual tracking is proposed. Only a small amount of object information is contained for a single initial reference feature, and especially in video sequences, the objects will have significant appearance and morphology changes. The invention correspondingly provides a dynamic memory embedding regulation mechanism, and the embodiment of the invention obtains the correlation between the query features and the existing memory by retrieving the feature affinity matrix generated in the similarity matching process.

Embodiments of the present invention then incorporate a high correlation between the existing memory and the current target feature. In addition, parts with moderate similarity to memory are expanded into memory, while irrelevant parts are directly discarded. High-quality memory embedding can provide complete target change information in historical frames, so that the problems of target occlusion, similarity interference objects and the like are effectively solved. In addition, compared with the existing matching method based on correlation operation, the deformable feature extraction method provided by the embodiment of the invention adopts a correlation mode from pixel points to global features one by one, and effectively captures the target deformation state in the query features by aggregating the weighted correlation between the query pixels and the reference features.

Fig. 1 shows an overall flowchart of a tracking method in an embodiment of the present invention. In order to meet the challenge in the tracking process, the model of the method mainly comprises three key components, namely a target similarity matching module based on compact memory embedding, a deformable feature learning module and a tracker based on discriminant correlation filtering.

For tracking and segmenting a video frame I at the current time t_tVideo frame I_tAnd an initial frame I₁First encoded by the backbone network (i.e., ResNet50) as query features and reference features. To improve the computational efficiency, the method comprisesThe extracted query frame and reference frame features are reduced to 64 channels, and the two features are respectively expressed as F_tAnd R.

The target similarity matching module used in the embodiment of the present invention refers to a self-attention mechanism in a transform (multi-head attention mechanism). And the dynamic memory adjustment mechanism adjusts the current target feature F_tAnd the target division mask is expanded to a compact memory storage M_kAnd M_v(Here M_kMemory storage representing target characteristics, M_vMemory storage representing the target foreground and background segmentation masks) to effectively overcome occlusion and appearance variations during tracking. The object deformable feature learning module described in the embodiment of the present invention performs global comparison of the query feature and the reference feature pixel by pixel, and establishes a correspondence df (F) between similar object portions_tR). Therefore, the global target corresponding relation can completely capture the deformation information of the target. And the discriminant tracker is used for extracting the target positioning information L_tWith reference to the latest depth-dependent filter-based tracker ATOM, the method first uses 1 × 1 convolutional layer to reduce the stem features to 64 channels, and then processes them with 4 × 4 convolutional layer and continuous differentiable activation function PELU (parameterized exponential linear activation function). The activated feature maximum is regarded as the target space positioning. The tracking module is trained on-line by an efficient back propagation method. And finally, connecting the target similarity matching result, the deformable feature and the target positioning information together along the channel, and performing refinement and segmentation processing on the combined feature through a three-stage upsampling and segmenting module. And carrying out contour transformation on the divided target mask to obtain a target bounding box.

Object similarity matching based on compact memory embedding

1. Object similarity matching

Accurately separating the target from the complex background requires available reference information. The matching-based VOS method can make full use of the target labeling information in the initial frame to accurately match the current target. In the present embodiment, the conventional attention was followedIn the method of the mechanical mechanism, the target similarity matching module comprises three parts of inquiry, key and value. Fig. 2 shows the structure of the object similarity matching module. Query feature F_t∈R^h×w×cIs the target representation of the current frame. In contrast, the bond R ∈ R in the model^h ^×w×cIs the target feature of the initial frame, and the value: (

And

) Is a foreground and background segmentation mask of a first frame of the video, wherein h is the pixel height of the target feature map, w is the pixel width of the target feature map, c is the number of channels of the target feature map, R is a representation symbol of a multi-dimensional matrix,

the mask is segmented for the foreground of the first frame of the video,

the mask is segmented for the background of the first frame in the video, f is the representation of the target foreground, and b is the representation of the target background.

In video sequences, objects typically undergo significant appearance and structural changes. Similarity matching using only fixed initial information of objects does not guarantee the quality of object matching. When a new frame, e.g. I_t-1When segmented by the model, the current target query feature F_t-1And the obtained mask (Y)^f _t-1And Y^b _t-1) Will be merged with the history information. Memory embedding (M) by inserting history information into key and value memory_k∈R^T×h×w×c，M_vf∈R^T×h×w×1And M_vb∈R^T×h×w×1) Will contain rich target appearance information, where M_kFor object feature memory, M_vfMemorization of the target foreground values, M_vbFor target background value memory, T is the accumulated video frame number.

To establish a pixel-level association of targets between key memory and query features, the method first generates an affinity matrix A. In particular, for better matching, the key memorizes M_kAnd query F_tA pixel-by-pixel normalization process is first performed along each channel. Furthermore, M_kAnd F_tIs also reshaped to Thw × c and hw × c, respectively.

A＝F_t*(M_k)^T, (1)

Wherein, represents matrix multiplication, ()^TDenotes the matrix transposition, A ∈ R^hw×Thw。

Affinity matrix A measures query features F_tAnd key memory M_kThe similarity of each pixel therebetween. In order to obtain an accurate matching target, further retrieval of value memory is required. Then, the foreground and background values are memorized into a map M_vfAnd M_vbIs Thw × 1 (1 in this text is a number). For i ∈ R^hwAffinity vector aⁱ∈R^ThwRetrieval value memory vector M by dot product calculation_vf∈R^Thw×1And M_vb，

Wherein,

and

the result of retrieving the target foreground value memory vector for the affinity vector,

results of retrieving a target background value memory vector for an affinity vectorAnd i is the index of the ith feature vector in the dimension of the target feature hw.

The matching score with high confidence ensures the accuracy of target matching. Thus, embodiments of the present invention apply a top-K averaging function to extract the retrieved vectors

And

target score of (1):

wherein, aggregate

Expressed in a matching score matrix

J is a subscript index of a jth one of the first K matching scores. In this method, K is set to 3, and the background matching is as above. Final foreground and background matching result S_f∈R^hwAnd S_b∈R^hwFor generating a target posterior probability P, i.e. the probability of the target foreground relative to the background.

2. Compact memory embedding

The rich memory embedding can effectively improve the precision of the target similarity matching. However, due to the limited storage capacity of the computing device, it is not possible to store all historical frame information in the memory base, especially for long videos containing more than 1K frames. In addition, storing such target information may result in redundancy of memory-embedded storage, and unnecessary matching queries, since targets in adjacent frames may be too similar, or target occlusion may occur.

The existing method only applies several history frames in the neighborhood, or selects a part of the history frames at equal intervals. These methods will lose some valid reference information. AFB-URR (video segmentation model combining adaptive feature memory and uncertain region refinement) combines similar parts in memory and deletes the part with the lowest query frequency. It may still introduce irrelevant history information into the memory, which easily leads to accumulation of model errors until the model drifts. Inspired by a Hash algorithm, a dynamic compact memory adjustment mechanism is developed for the model, so that a more effective target reference information base is formed. FIG. 2 illustrates a dynamic compact memory embedded architecture. Based on the affinity matrix a, embodiments of the present invention incorporate a high similarity (above an upper threshold) portion between the current feature and the existing memory. To avoid mismatches due to low quality memory, target features with low correlation are discarded directly.

The matching-based VOS method mostly uses the target information of the initial frame as a reference template because the initial features with true value labels have accurate and complete target description. Thus, the method uses the target information of the first frame in the sequence to initialize memory embedding and to make it a main part of the memory bank. And the target in the latest video frame is most similar to the current target, but the corresponding query preference is reduced in the target matching process by considering the calculation error of the model. For example, during model inference, video frame I_t-1Completing object segmentation and obtaining object query F_t-1And foreground Y^f _t-1And background Y^b _t-1The split mask of (1). To extract useful reference information, a target query F is first queried_t-1With existing object feature memory M_kAnd comparing to find out the similar parts of the two. Affinity matrix A epsilon R generated in the process of target similarity matching^hw×ThwMeasure query characteristics F_t-1And key memory M_kThe correlation between them. Thus, embodiments of the present invention dynamically manage memory embedding directly using affinity matrix a. For F_t-1Each of the elements F in_t-1(i)∈R^hwSearching the affinity matrix A to obtain its affinity matrix M_k∈R^ThwThe most similar part of (a):

where Re (-) is the maximum value in each row of the affinity matrix A and A (ij) is each element of the affinity matrix A.

If the maximum correlation between the two, A (ij), is greater than some upper limit ζ, the two are considered sufficiently similar. In the hash algorithm, records in the hash map having the same key will be inserted into the same storage space. Thus, embodiments of the present invention save only one of the plurality of similar features to memory. And in consideration of the diversity of the memories, the weight is adopted to combine the similar features and the corresponding memories, so that unnecessary retrieval and memory redundancy are avoided. From the above analysis, the initial reference information is most accurate. Therefore, the embodiment of the invention adopts a smaller fusion weight beta to update the current characteristics into the existing memory embedding so as to avoid the interference of model errors. In all experiments, ζ was set to 0.95 and β was set to 0.001. The memory-embedded online update formula is as follows: m_k(j')＝βF_t-1(i)+(1-β)M_k(j),(6)

Wherein M is_k(j') memory of the merged object features, M_vf(j') and M_vb(j ') memorizing the merged target foreground and background values respectively, and j' memorizing the merged subscript index value for memorizing and storing.

For maximum correlation Re (F)_t-1(i))<ζ, embodiments of the invention select a correlation value higher than the average value

And extend it into existing memories to ensure memory embedding diversity. At the same time, irrelevance is also avoidedA query of memory, merging into an efficient compact memory store by:

FIG. 3 shows a comparison of compact memory embedding and two other related methods in an embodiment of the present invention. As shown in the first row, storing all history memory and Adaptive Feature Base (AFB) improves the discrimination ability of the target similarity match to some extent. However, for complex background clutter, redundant memory can lead to false target matches. The memory embedding method of the embodiment of the invention fully excavates the diversity and compactness of the memory characteristics and can obtain better target matching performance.

Second, target variability feature learning

As shown in fig. 4, with the help of compact memory embedding, the target similarity matching has good discrimination capability, and can well solve tracking problems such as target occlusion and background interference. The method also has certain advantages in solving the target deformation. It does not effectively address the serious target distortion or spatial detail issues. Inspired by the attention mechanism, embodiments of the present invention propose target variability feature learning to further alleviate the above-mentioned dilemma. The embodiment of the invention constructs each pixel and the whole in the query feature FAnd (4) associating the templates R to obtain complete target deformation information. Since the query and key contain different target representations, f is assigned to each pixel in the queryⁱIn an embodiment of the present invention, a non-shared attention mechanism is applied to calculate each pixel r in a key^jAnd fⁱTo obtain a learnable similarity function: e.g. of the type_ij＝(W_F fⁱ)^T(W_R r^j). (12)

Wherein, W_FAnd W_RDenotes a general formula fⁱAnd r^jLearnable linear transformation into a higher dimensional representation, e_ijFor each pixel f in the query featureⁱAnd each pixel r in the key feature^jThe correlation between them.

For convenience of comparison fⁱAnd similarity between different parts in the template R, e on all pixels of the template R by the softmax function_ijAnd (3) carrying out normalization to obtain normalized similarity weight:

similarity weight d with normalization_ijTo aggregate pixel-by-pixel features in the template R to generate a template corresponding to fⁱDeformation characteristic v of_i；

Wherein phi is_v(.) represents ReLU (W)_v*(.))，W_vA (.) indicates a linear transformation of the features of the input. Then, the query features and the obtained target deformation features are connected by utilizing residual connection to obtain the features containing deformation information

Wherein phi is_c(.) represents ReLU (W)_c()) is intended to reduce the dimensionality of the features. Concat (·) denotes a feature join operation. There is inevitably a background mismatch in the matching target deformed features. Therefore, the target foreground probability (i.e. the target segmentation mask of the initial frame) P is used to enhance the confidence of the deformed features.

In particular, the target deformation obtained

Retrieving foreground probability in an initial frame by dot product operation

To generate the final deformable features:

three, single target visual tracking

And (3) cascading four feature graphs along a channel, namely, the target foreground similarity and the target posterior probability obtained in the similarity matching process, the target deformation feature obtained by the deformation feature extraction module and the target space positioning obtained by the online discriminant correlation filter, and inputting the feature graphs into a lightweight decoder (an up-sampling segmentation module) to obtain a finally refined target segmentation mask. And acquiring a rectangular surrounding frame of the target on the final target segmentation mask by using a contour detection function in an OpenCV (open circuit vehicle library), wherein the central position of the surrounding frame is the positioning of the target, and the size of the surrounding frame is the scale state of the target.

Fourthly, concrete implementation steps and simulation experiment

In an embodiment of the invention, the first four phases of ResNet50 pre-trained on ImageNet are used as the backbone network to extract features. Object segmentation is a pixel-level classification task that requires the use of features with high-confidence semantics. Therefore, the method extracts the last layer of the backbone network for target similarity matching and deformation feature extraction. The trunk feature is then reduced to 64 channels by 1 × 1 convolutional layers, followed by 3 × 3 convolutional layers and ReLU activation. In the up-sampling segmentation process, the first three levels of the skeleton are utilized to supplement the spatial detail information of the target. The method sets top-K to be K-3. And the upper threshold ζ for the merging of similar memories is set to 0.95. In a preliminary experiment, the method finds that parameters of the memory embedded module and the top-K have no obvious influence on the tracking performance. Therefore, the above settings were fixed in all relevant experiments.

Model training: the compact memory embedding and DCF tracking based modules are updated online without pre-training. Therefore, the method only pretrains the similarity matching, deformable feature extraction and upsampling segmentation module on the Youtube-VOS dataset. Similar to the sampling strategy employed in twin network-based tracking models, a pair of masked images are sampled from a video sequence to construct training samples. The method minimizes cross entropy loss through an Adam optimizer, and the learning rate is 8 multiplied by 10^-4Decay by 0.2 every 15 cycles. The whole training process is carried out on a single Nvidia Titan XP video card, and 60 epochs are needed.

Model reasoning: in the VOT task, a video sequence contains given bounding box labels. The method comprises the steps of firstly generating a pseudo mask as a pseudo label by using a ground truth box in an initial frame, and then initializing a model by using the generated pseudo label. And in the tracking process, processing the current frame by using the segmentation model to obtain a target segmentation template. The compact memory library is updated online using the query features and the generated segmentation mask. Finally, the resulting segmented mask is converted into a rotated target bounding box as a tracking result.

Simulation experiment results: tables 1-3 show the evaluation results of the tracking model of the embodiment of the invention on the basis of the disclosed single-target tracking data sets, namely VOT2016, VOT2018 and VOT2019, and the experimental results show the advancement of the tracking model by comparing the method with some latest tracking models (such as SiamMask, SiamRPN + +, ATOM, D3S, Ocean and the like), thereby also verifying the effectiveness of the method. Table 4 shows the evaluation result of the model constructed in the embodiment of the present invention on the video target segmentation data basis DAVIS2017, and also fully verifies the validity and the advancement of the model. And fig. 5 and 6 show the visual experimental results of the method on the basis of the VOT series and DAVIS2017 data.

The comparative tracking method related to the embodiment of the invention is explained as follows: SiamMask (tracking method combined with target segmentation); D3S (discriminant segmentation tracking model); CCOT (continuous convolution based tracking method); CSR-DCF (correlation filter tracking method based on channel and spatial confidence); ASRCF (adaptive spatial regularization based correlation filter tracking method); SiamDW (based on a wider and deeper twin network tracking model); SiamRPN + + (tracking model based on large backbone networks and twin area proposed networks); SiamRPN (twin region-based proposed network based tracking model); SiamBAN (adaptive bounding box regression based tracking model); ocean-on/off (anchor-free tracking method based on target perception); Update-Net (twin tracking method based on template online Update); SPM (real-time visual object tracking method based on sequence parallel matching); ATOM (target accurate tracking is achieved through target overlap maximization); DiMP (discriminant model prediction based visual tracking method); C-RPN (twin cascade region-based proposed network tracking method); SiamGraph (twin-attention-map model-based tracking method); LADCF (discriminant tracking method based on timing consistency constraints); OSMN (network modulation based efficient video object segmentation method); STM (spatiotemporal memory based video segmentation method); VM (video segmentation method based on object matching); FAVOS (tracking part based video segmentation method); OnAVOS (segmentation method based on online adaptive convolutional network).

TABLE 1 comparison of Performance of multiple tracking methods on VOT2016 data basis

TABLE 2 comparison of Performance of multiple tracking methods on VOT2018 data basis

TABLE 3 comparison of Performance of various tracking methods on VOT2019 data basis

TABLE 4 comparison of Performance of various tracking and VOS methods on DAVIS2017 data basis

Based on the same inventive concept, an embodiment of the present invention further provides a deformable single-target tracking apparatus based on dynamic compact memory embedding, referring to fig. 7, the apparatus includes: a processor 1 and a memory 2, the memory 2 having stored therein program instructions, the processor 1 calling the program instructions stored in the memory 2 to cause the apparatus to perform the following method steps in an embodiment:

the feature affinity matrix generated in the target similarity matching process is used for expressing the correlation between the current target feature and the existing target feature memory, and then the feature affinity moment matrix is subjected to line-by-line screening of the previous K values and the average of the previous K values to obtain the similarity of the target foreground and the background and the posterior probability of the target;

according to the acquired feature correlation, combining the high correlation part between the existing target feature memory and the current target feature, expanding the part with medium similarity to the existing target feature memory into the memory, and discarding the low correlation part, thereby realizing the dynamic self-adaptive adjustment of memory embedding;

acquiring a target deformation state in the current target feature by adopting an association mode from pixel points to global features one by one and aggregating the weighted correlation between query pixels and reference features, and establishing a corresponding relation between similar target parts to realize extraction of deformable features;

The method for obtaining the correlation between the current target characteristics and the existing memory specifically comprises the following steps:

Further, according to the obtained feature correlation, combining a high correlation part between the existing target feature memory and the current target feature, expanding a part with medium correlation with the existing target feature memory into the memory, and discarding an irrelevant part, so that the dynamic adaptive adjustment of memory embedding is realized as follows:

for each element in the target query, the affinity matrix is searched to obtain its affinity matrix with M_k∈R^ThwThe most similar part of (a);

Wherein, updating the part of the current characteristics which has high correlation with the existing memory into the existing memory embedding specifically comprises:

M_k(j')＝βF_t-1(i)+(1-β)M_k(j),

wherein M is_k(j') memory of the merged object features, M_vf(j') and M_vb(j ') memorizing the merged target foreground and background values respectively, j' memorizing the merged subscript index value, beta fusing weight, F_t-1For a target query, Y^f _t-1As a foreground, Y^b _t-1For background, M_kFor object feature memory, M_vfMemorization of the target foreground values, M_vbMemorizing a target background value, wherein i is a subscript index of a feature point with medium similarity in the current feature and the existing memory, and j is a subscript index of a part with medium similarity in the existing memory and the current feature;

Further, by aggregating the weighted correlation between the query pixel and the reference feature, the target deformation state in the current target feature is captured, and the correspondence relationship established between similar target portions is specifically:

e on all pixels of the template R by the softmax function_ijNormalization, obtaining the normalized similarity weight for the pixel-by-pixel feature in the aggregation template, generating the similarity corresponding to fⁱThe target deformation characteristic of (1);

It should be noted that the device description in the above embodiments corresponds to the method description in the embodiments, and the embodiments of the present invention are not described herein again.

The execution main bodies of the processor 1 and the memory 2 may be devices having a calculation function, such as a computer, a single chip, a microcontroller, and the like, and in the specific implementation, the execution main bodies are not limited in the embodiment of the present invention, and are selected according to the needs in the practical application.

The memory 2 and the processor 1 transmit data signals through the bus 3, which is not described in detail in the embodiment of the present invention.

Based on the same inventive concept, an embodiment of the present invention further provides a computer-readable storage medium, where the storage medium includes a stored program, and when the program runs, the apparatus on which the storage medium is located is controlled to execute the method steps in the foregoing embodiments.

The computer readable storage medium includes, but is not limited to, flash memory, hard disk, solid state disk, and the like.

It should be noted that the description of the readable storage medium in the above embodiments corresponds to the description of the method in the embodiments, and the description of the embodiments of the present invention is not repeated here.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the embodiments of the invention are generated in whole or in part when the computer program instructions are loaded and executed on a computer.

The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on or transmitted over a computer-readable storage medium. The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, data center, etc., that comprises an integration of one or more available media. The usable medium may be a magnetic medium or a semiconductor medium, etc.

In the embodiment of the present invention, except for the specific description of the model of each device, the model of other devices is not limited, as long as the device can perform the above functions.

Those skilled in the art will appreciate that the drawings are only schematic illustrations of preferred embodiments, and the above-described embodiments of the present invention are merely provided for description and do not represent the merits of the embodiments.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A deformable single-target tracking method based on dynamic compact memory embedding is characterized by comprising the following steps:

a feature affinity matrix generated in the process of matching the target similarity represents the correlation between the current target feature and the existing target feature memory; screening the first K values of the characteristic affinity matrix line by line and averaging the first K values to obtain the target foreground similarity and the target posterior probability;

acquiring a target deformation state in the current target feature by adopting an association mode from pixel points to global features one by one and by aggregating the weighted correlation between query pixels and reference features, and establishing a corresponding relation between similar target parts to realize extraction of deformable features;

2. The deformable single-target tracking method based on dynamic compact memory embedding as claimed in claim 1, wherein the obtaining of the correlation between the current target feature and the existing memory is specifically:

constructing a target similarity matching model, wherein keys in the model are target characteristics of an initial frame, and values are foreground and background segmentation masks of a first frame of a video;

3. The deformable single-target tracking method based on dynamic compact memory embedding as claimed in claim 1, wherein the high correlation part between the existing target feature memory and the current target feature is merged according to the obtained feature correlation, the part with medium correlation with the existing target feature memory is expanded into the memory, and the part with low correlation is discarded, so that the dynamic adaptive adjustment of memory embedding is realized by:

initializing memory embedding by utilizing target information of a first frame in a sequence, taking the target information as a main part of a memory library, comparing a target query characteristic with the existing target characteristic memory, and finding out a similar part of the target query characteristic and the existing target characteristic memory;

if the maximum correlation between the two is larger than a certain upper limit value, the records with the same key value are inserted into the same storage space according to the Hash single mapping principle, namely, a weighted fusion mode is adopted, and the part with high correlation with the existing memory in the current characteristics is updated into the existing memory embedding; for the current target characteristics with the correlation value higher than the overall average value with the existing memory, the current target characteristics are directly expanded into the existing memory; the same is done for corresponding parts of the target foreground and background masks.

4. The deformable single-target tracking method based on dynamic compact memory embedding as claimed in claim 3, wherein the updating of the part of the current feature having high correlation with the existing memory into the existing memory embedding is specifically:

M_k(j')＝βF_t-1(i)+(1-β)M_k(j),

for the union of the existing key memory and the target feature at the previous moment,

the union of the existing target background value memory and the target background mask at the previous moment is memorized.

5. The deformable single-target tracking method based on dynamic compact memory embedding as claimed in claim 3, wherein the target deformation state in the current target feature is captured by aggregating weighted correlation between pixel points in the query feature and the whole reference feature, and the correspondence relationship established between similar target portions is specifically:

constructing an association between each pixel in the query feature F and the entire template, applying an unshared attention mechanism to calculate each pixel r in the key^jAnd fⁱObtaining a similarity function:

e on all pixels of the template R by the softmax function_ijNormalization, obtaining the normalized similarity weight for aggregating the pixel-by-pixel characteristics in the template to generate the similarity corresponding to fⁱThe target deformation characteristic of (1);

and connecting the query feature with the target deformation feature by using the residual error to obtain the feature containing deformation information, and enhancing the reliability of the target deformation feature by adopting the target foreground probability.

6. A deformable single-target tracking device based on dynamic compact memory embedding, characterized in that the device comprises: a processor and a memory, the memory having stored therein program instructions, the processor calling upon the program instructions stored in the memory to cause the apparatus to perform the method steps of any of claims 1-5.

7. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program comprising program instructions which, when executed by a processor, cause the processor to carry out the method steps of any of claims 1-5.