CN114372997A

CN114372997A - Target tracking method based on quality and similarity evaluation online template updating

Info

Publication number: CN114372997A
Application number: CN202111476809.8A
Authority: CN
Inventors: 李雅倩; 赵明; 肖存军; 李海滨; 张文明
Original assignee: Yanshan University
Current assignee: Yanshan University
Priority date: 2021-12-06
Filing date: 2021-12-06
Publication date: 2022-04-19

Abstract

The invention discloses a target tracking method based on quality and similarity evaluation online template updating, which comprises the following steps: s1, generating N incremental templates by amplifying the initial template frame

Establishing a template pool with the size of M; s2, extracting the characteristics of the template frame and the search frame through the target tracking module, then obtaining a response score map through convolution response, and carrying out comparison on the current new template according to the quality evaluation index

To carry outQuality evaluation; s3, judging the template passing the quality evaluation through the cosine similarity

Judging whether a new template needs to be added into the template pool or not according to the similarity between the template pool and the template in the template pool; s4, fusing the templates in the template pool through different weights of the templates to obtain t_iA final template of time; and S5, convolving the feature maps obtained by feature extraction of the template frame and the search frame with the cavity convolution layers with different step lengths and aspect ratios, and then performing feature fusion.

Description

Target tracking method based on quality and similarity evaluation online template updating

Technical Field

The invention relates to a video image processing technology, in particular to a target tracking method based on quality and similarity evaluation online template updating.

Background

At present, the application range of the artificial intelligence technology is wider and wider, the artificial intelligence technology is widely concerned by various industries, and the target tracking is one of the most important branches of the artificial intelligence technology, and the development of the artificial intelligence technology is very rapid. The main task of target tracking is to detect the initial position and size of the target position and to lock the target so that it is not lost in the field of view. With the continuous development of the computer vision field, higher requirements are put forward on the video processing technology, the target tracking technology is also highly emphasized, and the target tracking has wide application prospects.

Target tracking refers to the location and size of the target being located at a subsequent frame given the location and size information of the target at the first frame. With the continuous improvement of the algorithm, the target tracking performance is greatly improved. However, the target tracking has been challenged by drastic changes in the target morphology, motion blur, interference of similar objects, occlusion, and so on. These challenges make tracking targets susceptible to drift, resulting in tracking failures.

The target tracking is mainly to evaluate the tracking result through indexes of accuracy, robustness, average overlapping rate and speed. The accuracy represents the coincidence degree of the predicted target frame and the real target frame, if the coincidence degree is higher, the accuracy is better, the performance of the tracker is better, and otherwise, the performance of the tracker is poorer. The robustness represents the capability of recovering after the tracking result fails, namely the strange sequence image target can be identified, and the lower the robustness value is, the better the adaptability of the tracker is represented, and the better the performance is. The real-time performance indicates that the tracker needs to reach the minimum speed of video playing, the video playing is not stopped, and the speed is at least 24FPS to reach the real-time performance.

Object tracking is essentially a continuous image object detector, the result of the previous frame has an effect on the result of the next frame, the tracker is essentially divided into three steps: extracting features, matching target features, and determining the position and size of a target. The characteristic extraction is that the target is processed, and the original image cannot be directly applied and matched without being preprocessed, so that the original image needs to be cut and modeled to obtain some characteristics of the target, and the characteristics are the basis for determining the target; the matching of the target features is to compare the features of the current frame with the features of the target extracted, find a region which is in accordance with the features of the target, and generally regard the region as the tracking target to be found as long as the difference between the features of the region and the target features is minimum; the determination of the target position and size is the target area feature to be matched, including the size and position of the target, which is the result of the tracker output. Current tracking algorithms are mainly classified into two categories: the method comprises a traditional target tracking algorithm and a tracking algorithm based on deep learning, wherein the traditional target tracking algorithm generally uses a generative model, and typically comprises tracking algorithms based on relevant filtering, such as Kalman filtering and particle filtering, and the tracking algorithm based on deep learning is mainly a discriminant, so that the method is high in precision and strong in robustness.

The traditional target tracking algorithm is high in running speed, needs less data and is suitable for simple application scenes, but the traditional target tracking algorithm is poor in precision and robustness and cannot be suitable for complex scenes. The tracking algorithm based on deep learning has high precision and adaptability, can process the tracking task of a complex scene, but needs more data, has a large model and is slower in running speed. According to the current twin network target tracking based on deep learning, the problems of severe deformation, shielding, similarity interference and the like of a target in the tracking process cannot be solved by an initial template when the target is tracked over time. Therefore, the target template needs to be updated so as to adapt to the challenges of deformation, occlusion and the like of the target.

A processing flow of a target tracking system scheme based on a twin network update template in the prior art is shown in fig. 1, and the specific steps include:

1. reading images frame by frame, preprocessing the images, giving the position and the size of a target in a first frame, and defining the processed and cut image as a template frame;

2. and extracting features of the template frame and the search frame, and performing cross-correlation operation to obtain a score chart and a regression chart.

3. Selecting the value with the maximum score as a new position and size of the target;

4. taking the current frame as a new template, storing the new template into a template pool with a fixed size, distributing the weight of each template according to a convolution layer, and then cascading to form a new template;

5. displaying the tracking result of the current frame;

6. repeating the steps 2-5 until the whole video sequence is input;

the target tracking system scheme based on the twin network in the prior art has the following disadvantages:

1. the model of deep learning is originally bigger, the parameter quantity is more, the above-mentioned updating scheme is that every frame needs to be updated, thus has increased the enormous parameter quantity, the speed of tracing will be reduced inevitably, it is after long-time accumulation, will reduce the weight of the frame template of the initial template continuously, after the goal is lost, the subsequent frame is difficult to find the goal again, because the initial template frame is certain reliable;

2. the updating scheme solves the problem of model degradation to a certain extent, but does not consider the problem of reliability of the templates added into the template pool, when unreliable templates are added into the template pool, the template pool is polluted, so that the positioning of the target in the subsequent frames is more difficult, the obtained new templates are unreliable, errors are continuously accumulated in the process, and the target is more easily lost when being positioned.

In the prior art, a target tracking process based on twin network feature fusion is shown in fig. 2, and the reception field is enlarged mainly through two-layer hole convolution, so that more multi-scale information can be acquired.

The target tracking process based on twin network feature fusion in the prior art has the following defects:

1. a part of information is lost while a larger receptive field is obtained by a mode of convolution of two layers of holes, and the part of information exists in a mode of hole points, so that the information loss caused by characteristic fusion in the mode is inevitable;

2. the above process is convolved with voids of comparable aspect ratio, such that the aspect ratio 1: the ratio of 1 is such that the hole points always exist, and no matter how many layers of hole convolutions, the lost part of information can be obtained.

Disclosure of Invention

The invention provides a target tracking method for updating an online template based on quality and similarity evaluation, and solves the problems of the existing template updating and feature fusion.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows: a target tracking method based on quality and similarity evaluation online template updating comprises the following steps:

s1, generating N incremental templates by amplifying the initial template frame

Establishing a template pool with the size of M;

s2, extracting the characteristics of the template frame and the search frame through the target tracking module, then obtaining a response score map through convolution response, and carrying out comparison on the current new template according to the quality evaluation index

Carrying out quality evaluation;

s3, judging the template passing the quality evaluation through the cosine similarity

Judging whether a new template needs to be added into the template pool or not according to the similarity between the template pool and the template in the template pool;

s4, fusing the templates in the template pool through different weights of the templates to obtain t_iA final template of time;

and S5, convolving the feature maps obtained by feature extraction of the template frame and the search frame with the cavity convolution layers with different step lengths and aspect ratios, and then performing feature fusion.

The technical scheme of the invention is further improved as follows: in the step S1, the target tracking module is used to track the initial time t₀And performing data augmentation on video pictures at all times, performing rotation, translation, scale transformation and inversion operations on a given target according to a target tracking task to obtain templates of different postures of the target, and storing the templates into a template pool with the size of M.

The technical scheme of the invention is further improved as follows: in the step S2, feature extraction is performed on the template frame and the search frame respectively by using the backbone network of the target tracking module according to the provided image data, convolution operation is performed on the obtained feature maps respectively to obtain a classification map and a regression map, and quality evaluation is performed on the classification map.

The technical scheme of the invention is further improved as follows: the quality evaluation index calculation formula is as follows:

wherein A represents a quality assessment value, α₁A weight parameter, α, representing the degree of maximum score fluctuation₂Weight parameter, F, representing the degree of fluctuation of the multi-peak detection value_maxRepresents the maximum value of the current classification score,

the degree of score fluctuation, mean (F), is shown_max) Representing the mean value of the maximum value of the classification scores of the historical frames, mean (APCE) representing the peak energy of the historical frames, and APCE representing the current average peak correlation energy; f_minRepresents the minimum value of the classification score of the current frame, F_iEach score value represents a classification score.

The technical scheme of the invention is further improved as follows: passing the current template in the step S3

And comparing the cosine similarity with each template in the template pool, wherein the calculation formula is as follows:

wherein

A new template, T, representing the current frame_iRepresenting the templates in the template pool, S represents the set of cosine similarity metrics, COS represents the cosine similarity, and i represents the template index among the template pool.

The technical scheme of the invention is further improved as follows: in step S4, all templates in the template pool have different weights, except the initial template, and the other templates are assigned weights according to the distance of the current frame template, where the specific weight assignment formula is as follows:

wherein Tgt_nRepresents the respective weights of the templates in the template pool, N represents the number stored in the template pool, β represents the weight of the initial template,

represent the normalization of subsequent template weights;

the final output template is as follows:

wherein T is_newRepresenting the resulting final matching template, N ∈ [1, N]，

Represented as historical frames.

The technical scheme of the invention is further improved as follows: in step S6, the empty hole convolution layer is four convolutions of 3x3, and the expansion rates thereof are (m, n) ∈ { (1,1), (1,2), (2,1), (2,2) }, and the process of feature fusion may be as follows:

wherein f is_TRepresenting a template frame, f_SWhich represents a search frame, is displayed,

represents a single hole convolution and represents cross-correlation.

Due to the adoption of the technical scheme, the invention has the technical progress that:

1. the invention carries on data augmentation through the initial template frame, set up a template pool of fixed size to store the newer template, carry on the relevant fusion after the cavity convolution through different step length, length-width ratio separately through template frame and search frame, get categorised score map and regression map, have carried on the multi-peak detection calculation according to categorised score, judge the reliability of the new template, then carry on the similarity comparison with template in the template pool the new template, in this way, can judge the necessity of the new template renewal, judge whether the new template should be added to the template pool through these two indexes, both can avoid every frame to upgrade, accelerate the speed, has guaranteed the reliability of the template in the template pool, the invention can apply to certain particular goal real-time tracking task under the natural condition, such as video monitoring, unmanned driving, etc.;

2. the template updating method provided by the invention considers the reliability problem of a new template, reduces the pollution of a template pool, improves the robustness of the model, also considers the necessity problem of template updating, reduces information redundancy, improves the accuracy of the model, and in addition, provides a hole convolution characteristic fusion mode with different step lengths and length-width ratios, can not reduce information, obtains a larger view field at the same time, and better performs multi-scale estimation of a target.

Drawings

FIG. 1 is a schematic diagram of a twin network update template based target tracking system solution in the prior art;

FIG. 2 is a schematic diagram of a target tracking process based on twin network feature fusion in the prior art;

FIG. 3 is a flowchart of a target tracking implementation process for updating an online template based on quality and similarity evaluation according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a hole convolution provided in an embodiment of the present invention;

fig. 5 is a schematic diagram of a processing procedure of a template update policy according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the following examples:

as shown in fig. 3, the present invention provides a target tracking method for updating an online template based on quality and similarity evaluation, comprising the following steps:

s1, generating N incremental templates by amplifying the initial template frame

Establishing a template pool with the size of M, which comprises the following specific steps:

the present hardware can adopt GPU (graphic processor) to make acceleration operation, said invention uses computer hardware equipment of GPU, etc. to make image processing, and utilizes target tracking module to make initial time t₀Performing data augmentation on video pictures at moments according toAnd the target tracking task performs rotation, translation, scale transformation and inversion operations on the given target to obtain templates of different postures of the target, and then stores the templates into a template pool with the size of M.

Quality evaluation was carried out as follows:

according to the provided image data, the template frame and the search frame are input into a twin network, the twin network has the characteristic of sharing weight, so that a lot of parameter quantity can be reduced to accelerate calculation, feature extraction is carried out on the template frame and the search frame respectively through a main network of a target tracking module, convolution operation is carried out on the obtained feature images respectively to obtain a classification image and a regression image, and quality evaluation is carried out on the classification score image.

The invention is illustrated by the following specific examples:

s21, image feature extraction:

at the initial time, namely, at the time when T is 0, in the tracking task, one of the template frames at the initial time gives the position information and the scale information of the target, N incremental templates have been generated through step S1, at this time, the information of the incremental templates is obtained through the initial template frame, and there is no information of the subsequent frames, so that the new template T obtained by fusion in the template pool can be obtained_newViewed as an initial template frame

At an initial time, that is, at a time when t is 0, an initial template frame (with a size of 127 × 127 × 3) and a search frame (with a size of 303 × 303 × 3), where 3 denotes that they both have 3 color channels, and feature matrices of 7 × 7 × 256 and 31 × 31 × 256 are obtained through extraction of a backbone network, respectively, where the backbone network is composed of five convolutional layers and two pooling layers, where the convolutional core size of the first layer is 11 × 11 and the step size is 2; the second layer is a pooling layer with the size of 3 multiplied by 3 and the step length of 2; the convolution kernel size of the third layer is 5 multiplied by 5, and the step length is 1; the fourth layer is a pooling layer with the size of 3 multiplied by 3 and the step length of 2; the fifth layer is a convolution layer, the size of a convolution kernel is 3 multiplied by 3, and the step length is 1; the sixth layer is a convolutional layer with a convolutional kernel size of 3 × 3 with a step size of 1, the seventh layer is a convolutional layer with a convolutional kernel size of 3 × 3 with a step size of 1.

S22, feature fusion:

after the initial frame template is extracted by the backbone network, a feature matrix of 7 × 7 × 256 is obtained, similarly, a frame is searched to obtain a feature matrix of 31 × 31 × 256, and then the feature matrices pass through four void convolution layers respectively, and the specific process is as follows: inputting an initial frame template 7x7x256 feature matrix into a first layer of cavity convolution layer, wherein the convolution kernel is 3x3, the expansion rate is (1,1), the initial frame template can be regarded as a 3x3 common convolution kernel, inputting a first layer of cavity convolution layer into a similar search frame 31 x256 feature matrix, and performing correlation operation on feature maps obtained by the initial frame template and the first layer of cavity convolution layer to obtain a feature map with the size of 25 x256, wherein parameters are the same as those of the first layer of cavity convolution layer; inputting an initial frame template 7 × 7 × 256 feature matrix into a second-layer cavity convolution layer, wherein a convolution kernel is 3 × 3, an expansion rate is (1,2), searching frames 31 × 31 × 256 feature matrices for the same processing, and performing correlation operation on feature maps obtained by the initial frame template and the search frames to obtain a feature map with the size of 25 × 25 × 256; inputting an initial frame template 7 × 7 × 256 feature matrix into a third layer of void convolution layer, wherein a convolution kernel is 3 × 3, an expansion rate is (2,1), searching frames 31 × 31 × 256 feature matrices for the same processing, and performing correlation operation on feature maps obtained by the initial frame template and the search frames to obtain a feature map with the size of 25 × 25 × 256; inputting an initial frame template 7 × 7 × 256 feature matrix into a fourth layer of void convolution layer, performing the same processing on a convolution kernel of 3 × 3 and an expansion rate of (2,2) and searching a frame 31 × 31 × 256 feature matrix, and performing correlation operation on feature maps obtained by the initial frame template and the search frame to obtain a feature map with the size of 25 × 25 × 256. It can be seen from the above process that the template frame and the search frame are respectively subjected to four times of cavity convolution, and finally, the 25 × 25 × 256 feature maps obtained by four times of correlation operations are subjected to equal weight fusion.

S23, obtaining a classification branch and a regression branch:

and performing tertiary convolution on the obtained 25 × 25 × 256 feature map, and performing channel compression to obtain a classification branch, a quality evaluation branch and a regression branch respectively. The obtained classification branch feature map is 19 × 19 × 1, the quality evaluation branch feature map is 19 × 19 × 1, the regression branch feature map is 19 × 19 × 4, the classification branch feature map represents the equal number of each sample, the quality evaluation branch feature map is obtained by giving a higher weight to the important part of the score in the classification branch, and the regression branch is obtained by estimating the distance t from the target center to the target boundary frame as (l, t, r, b), which are the distances from the target center to the left frame, the upper frame, the right frame and the lower frame, respectively.

S24, according to the template

And tracking a classification branch feature map provided by the obtained result, and performing quality evaluation through the classification score map. The quality evaluation ensures the reliability of the template: the quality evaluation index calculation formula is as follows:

the degree of score fluctuation, mean (F), is shown_max) Mean (APCE) representing the mean of the maximum values of the classification scores of the historical frames, mean (APCE) representing the peak energy of the historical frames, APCE tableShowing the current mean peak correlation energy; f_minRepresents the minimum value of the classification score of the current frame, F_iEach score value represents a classification score.

After a plurality of experiments, the text takes alpha₁＝1，α₂And 2, setting the threshold value of the template A to be 1.8, wherein the threshold value of the template A is less than 1.8, and considering that the quality of the current template is poor, the current template is not added into the template pool for tracking the next frame.

Similarity with the templates in the template pool, and similarity evaluation to detect the necessity of updating the templates and judge whether a new template needs to be added into the template pool, as shown in fig. 5; by current template

wherein

A new template, T, representing the current frame_iRepresenting the templates in the template pool, S represents the set of cosine similarity metrics, COS represents the cosine similarity, and i represents the template index among the template pool. Over several experiments, S was set to 0.15.

S4, fusing the templates in the template pool through different weights of the templates to obtain t_iA final template of time; all templates in the template pool have different weights, the initial template is removed, and other templates are distributed with weights according to the distance of the current frame template. By the weight mode of the distribution template, the characteristics of the initial template can be kept to the maximum extent, and the latest information of the target is added, wherein the specific weight distribution formula is as follows:

represent the normalization of subsequent template weights;

the final output template is as follows:

Represented as historical frames. Over several experiments, the value of β was set to 0.5.

The schematic diagram of the hole convolution provided in the embodiment of the invention is shown in fig. 4, information is easy to lack in the process of extracting features of a template frame and a search frame, before the template frame is correlated with the search frame, the hole convolution layers are firstly correlated, and the hole convolution with different step lengths and aspect ratios can enlarge the receptive field, so that more scale information can be obtained without losing other information. The hole convolution layer is four convolutions of 3x3, the expansion rates of the four convolutions are (m, n) ∈ { (1,1), (1,2), (2,1), (2,2) }, and the process of feature fusion can be as follows:

represents a single hole convolution and represents cross-correlation.

The invention carries out data augmentation through an initial template frame, establishes a template pool with fixed size for storing an updated template, carries out correlation fusion after the template frame and a search frame are respectively convolved through cavities with different step lengths and length-width ratios to obtain a classification score graph and a regression graph, carries out multi-peak detection calculation according to the classification scores, judges the reliability of a new template, and then carries out similarity comparison on the new template and the templates in the template pool, thereby judging the necessity of updating the new template, judging whether the new template should be added into the template pool or not through the two indexes, not only avoiding updating each frame, accelerating the speed, but also ensuring the reliability of the template in the template pool.

The template updating method provided by the invention considers the reliability problem of a new template, reduces the pollution problem of a template pool, improves the robustness of the model, also considers the necessity problem of template updating, reduces information redundancy, improves the accuracy of the template, and provides a hole convolution feature fusion mode with different step lengths and length-width ratios, so that a larger view field can be obtained while information is not reduced, and multi-scale of a target can be better carried out.

Claims

1. A target tracking method based on quality and similarity evaluation online template updating is characterized in that: the method comprises the following steps:

s1, generating N incremental templates by amplifying the initial template frame

Establishing a template pool with the size of M;

s2, extracting the characteristics of the template frame and the search frame through the target tracking module, and then obtaining a response score map through convolution responseAccording to the quality evaluation index, the current new template is subjected to

Carrying out quality evaluation;

2. The method of claim 1, wherein the target tracking method comprises the following steps: in the step S1, the target tracking module is used to track the initial time t₀And performing data augmentation on video pictures at all times, performing rotation, translation, scale transformation and inversion operations on a given target according to a target tracking task to obtain templates of different postures of the target, and storing the templates into a template pool with the size of M.

3. The method of claim 2, wherein the target tracking method comprises the following steps: in the step S2, feature extraction is performed on the template frame and the search frame respectively by using the backbone network of the target tracking module according to the provided image data, convolution operation is performed on the obtained feature maps respectively to obtain a classification map and a regression map, and quality evaluation is performed on the classification map.

4. The method of claim 3, wherein the target tracking method comprises the following steps: the quality evaluation index calculation formula is as follows:

5. The method of claim 4, wherein the target tracking method based on quality and similarity evaluation online template update comprises: passing the current template in the step S3

wherein

6. The method of claim 5, wherein the target tracking method comprises the following steps: in step S4, all templates in the template pool have different weights, except the initial template, and the other templates are assigned weights according to the distance of the current frame template, where the specific weight assignment formula is as follows:

represent the normalization of subsequent template weights;

the final output template is as follows:

Represented as historical frames.

7. The method of claim 6, wherein the target tracking method comprises the following steps: in step S6, the empty hole convolution layer is four convolutions of 3x3, and the expansion rates thereof are (m, n) ∈ { (1,1), (1,2), (2,1), (2,2) }, and the process of feature fusion may be as follows:

represents a single hole convolution and represents cross-correlation.