CN110796680B

CN110796680B - Target tracking method and device based on similar template updating

Info

Publication number: CN110796680B
Application number: CN201910734740.0A
Authority: CN
Inventors: 明悦; 张润清
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2019-08-09
Filing date: 2019-08-09
Publication date: 2022-07-29
Anticipated expiration: 2039-08-09
Also published as: CN110796680A

Abstract

The invention provides a target tracking method and device based on similar template updating. The method comprises the following steps: image feature extraction is carried out on the video frame picture at the initial moment through a target tracking module to obtain an initial model

And initial incremental update model

Image feature extraction is carried out on the video frame picture at the current moment t through a target tracking module to obtain a new model at the moment t

New model according to t-1 time

And incremental update model at time t-1

Calculating to obtain an incremental updating model at the t moment

Calculate a new model

And

the similarity between delta _init Calculate a new model

And

the similarity between delta _incre According to the similarity delta _init And similarity δ _incre Selecting a new model by a model update strategy

Or

As the final model at time t. The invention calculates by using convolution response

And T _init 、

And the final model at the time t is selected according to the similarity, so that the reliability of the new model can be quickly detected.

Description

Target tracking method and device based on similar template updating

Technical Field

The invention relates to the technical field of picture processing, in particular to a target tracking method and device based on similar template updating.

Background

Artificial intelligence is an important driving force of a new technological revolution and industrial revolution, and target tracking is one of important research directions of artificial intelligence technology in computer vision, and the main task of the target tracking is to detect the accurate position of a certain known target or a plurality of known targets from a video. As current computer vision tasks focus more and more on video analysis, target tracking algorithms are receiving more and more attention.

The target tracking system can be roughly divided into three modules of video frame input, target tracking and result display. The video frame input module is used for reading video data and sending the video data to the target tracking module according to frames. The target tracking module is a core function module of the system and is used for searching a target determined by an initial frame in an input picture frame and acquiring the specific position and size of the target. And the result display module combines the specific position and size of the target obtained by the target tracking module with the picture frame to synthesize a video frame picture with a mark frame and output the video frame picture to a user.

The evaluation of the target tracking system mainly has two aspects of accuracy and real-time performance. The main indicators for evaluating accuracy include average overlap expectation, accuracy and robustness. The accuracy mainly evaluates the pixel difference between the tracking result and the actual position of the target, and the difference between the tracking result and the area basis of the actual size of the target. Robustness mainly evaluates the ability of the tracking result to recover correct tracking after the tracking fails. The accuracy of a target tracking system is affected by a number of factors. Given that the target information only has appearance and position information in the first frame, deformation, rotation and scaling of the target itself can affect the performance of the target tracking module. In addition, factors such as illumination change and obstructions exist in the environment where the target is located, and the performance of the target tracking module is affected. Blurring and shooting angle variation in the video shooting process can also become the cause of inaccurate target tracking. In addition to accuracy, real-time is also a very important indicator in target tracking systems. The lowest real-time playing speed of the video is required to be above 24FPS for the tracking result. However, in practical application, the target tracking algorithm often cannot achieve real-time performance due to the problems of complex modeling, image processing calculation and the like.

The target tracking module is essentially an image object detector that needs to detect specific position and size information of a specified target in an input image area. The method mainly comprises three submodules of feature extraction, target positioning and target model updating. For the target tracking algorithm, the feature extraction submodule is used for modeling a target, and a direct target picture cannot be used for target tracking, so that the picture needs to be specially processed into a feature vector, and a target model is constructed by using the feature vector. The image feature extraction method mainly comprises the traditional feature extraction method and the feature extraction method based on deep learning. The traditional feature extraction method has the characteristic of high speed, but the accuracy is much lower than that of features based on deep learning. The feature extraction method based on deep learning often cannot meet the requirement of real-time performance due to the problems of large quantity of required images, complex model, large parameter quantity and the like. And the target positioning sub-module processes the extracted image features, identifies which pixel regions belong to the target and which pixel regions do not belong to the target, and thus determines the specific position and size of the target. Currently common object localization models include convolutional layers and associated filters. The convolution layer has large calculation amount and long time consumption. The correlation filter has an advantage in speed, but has a problem of model degradation in practical application. The target model updating submodule is used for updating a specific model of a target, the appearance of the target changes along with the tracking, and the initial target model cannot ensure the tracking accuracy at the moment, so that the model of the target needs to be updated. In general, the target tracking system updates the target model every frame according to the prediction result of every frame, and such updating takes a lot of computation time. Furthermore, the updated template itself has unreliability and the update process may introduce background information such that the model is modeled incorrectly, which may make the model more and more distant from the correct model as the tracking progresses, leading to tracking drift. During tracking, the current target tracking system does not detect a new model, so that many updates are invalid during model updating. In fact, most frames of the object model are stable, and the update is redundant, and the model update of the object is only valid when the appearance of the object changes. Meanwhile, a large amount of computing resources and time are consumed for detecting whether the appearance of the target model changes, so that the time for the whole system to process the target tracking task is increased.

A processing flow of a target tracking system scheme based on a frame-by-frame incremental update model in the prior art is shown in fig. 1, and the specific steps are as follows:

1. video data is read on a frame-by-frame basis and simple data preprocessing is performed. And determining the position of the target in the current frame by adopting a positioning algorithm for each frame, and predicting the position of the current frame by adopting a model n.

2. And making a new model n from the target of the current nth frame by adopting a feature extraction algorithm, and fusing the new model n and the historical model n to obtain an updated model n.

3. And displaying the tracked video frame.

4. And taking the updated model as a new model for determining the target position of the next frame.

5. And repeating the steps 2) to 5) until the video frame is input.

The target tracking system scheme based on the frame-by-frame incremental update model in the prior art has the following disadvantages:

it is impossible to judge whether the new model is reliable. In the incremental updating method, the new model generated by each frame participates in the updating of the model, but the new model generated by each frame is not detected, and whether the new model is valid or not cannot be judged. When tracking becomes problematic, the model obviously introduces invalid background information, making existing models increasingly unreliable. When the introduced background information is excessive, the target tracking result can drift, so that a new model generated by tracking is more unreliable, and a vicious circle is generated.

The model update efficiency is low. In the target tracking algorithm, if a new model is updated for each frame, the new model is fused with the historical model, and a large amount of calculation is needed when a complex modeling method is adopted. This results in a large amount of time resources being consumed in updating the model, thereby reducing the tracking speed.

A processing flow chart of a design scheme of a conventional correlation filter target tracking system in the prior art is shown in fig. 2, and the specific steps include:

1. and taking the first frame as a template, and extracting the traditional image characteristics.

2. For a video image frame which is newly entered into the system at present, traditional image characteristics are extracted.

3. The image features of the template are used together with the image features of the new image frame to calculate a correlation response.

4. And selecting the position with the maximum response as a new coordinate of the target.

5. And repeating 2) to 4) until the video frame is input.

The above-mentioned conventional correlation filter target tracking system design scheme in the prior art has the following disadvantages:

1. conventional features are not accurate enough in target tracking systems. In the current target tracking system, the traditional characteristics can completely meet the real-time problem under the current hardware condition. However, compared with the deep learning feature, the traditional feature has the performance that the performance is not accurate enough, and because the description capability of the traditional feature is not strong enough, the semantic information of the target image cannot be expressed, when the traditional feature is used for determining the target position, the target is easy to lose.

2. The correlation filter itself models the degradation problem. In this type of design, a correlation filter is used as the positioning method. The performance of the correlation filter is not problematic, and as a single-frame target detector, the accuracy can even exceed that of a neural network convolutional layer. However, when the correlation filter is more prone to model degradation as target tracking progresses, model degradation may gradually misalign the model, resulting in lost target tracking.

Disclosure of Invention

The embodiment of the invention provides a target tracking method and device based on similar template updating, which aims to overcome the problems in the prior art.

In order to achieve the purpose, the invention adopts the following technical scheme.

According to one aspect of the invention, a target tracking method based on similar template updating is provided, which comprises the following steps:

transcoding and framing the video data to obtain video frame pictures corresponding to all moments including the target;

image feature extraction processing is carried out on the video frame picture at the initial moment through a target tracking module to obtain an initial model

And initial incremental update model

Carrying out image feature extraction processing on the video frame picture at the current moment t through a target tracking module to obtain a new model at the moment t

New model according to t-1 time

And incremental update model at time t-1

Calculating to obtain an incremental updating model at the t moment

Respectively calculating new models by a convolution response method

And the initial model

A value of similarity between delta _init Novel model

Updating the model with the current increment

A similarity value of _incre According to said similarity value δ _init And a similarity value delta _incre Selecting a new model by a model update strategy

Or

As the final model at time t;

the target tracking module extracts the image characteristics of the video frame picture at the current moment t to obtain a new model at the moment t

New model according to t-1 time

And incremental update model at time t-1

Calculating to obtain an incremental updating model at the t moment

The method comprises the following steps:

performing image feature extraction on the input video frame picture at the t moment by using the convolution layer through a target tracking module to obtain the specific position of the target in the video frame picture at the t moment, and converting the specific position of the target in the video frame picture at the t moment into a new model at the t moment

New model according to t-1 moment through target tracking algorithm

And incremental update model at time t-1

Obtaining an incremental update model at time t

Wherein alpha is a set learning rate;

Respectively calculating new models by the convolution response method

And the initial model

A value of similarity between delta _init Novel model

Updating the model with the current increment

Or

As a final model of time t, the following are included:

calculate a new model

And the initial model

To convert the convolution response into a new model

And the initial model

A value of similarity between delta _init Calculate a new model

Updating the model with the current increment

To convert the convolution response into a new model

Updating the model with the current increment

A value of similarity between delta _incre ；

Setting two thresholds of similarity

And

judgment of

And is

If yes, the new model is used

As final model at time t

Otherwise, select

As final model at time t

Preferably, the transcoding and framing the video data to obtain the video frame picture corresponding to each time point including the target includes:

the data reading thread finishes reading in the video data, transcoding and framing the video data to obtain a video frame picture sequence containing a target, wherein the video frame picture sequence comprises video frame pictures corresponding to all moments;

And carrying out preprocessing operation on each video frame picture in the video frame picture sequence, and transmitting the preprocessed video frame picture to a target tracking module, wherein the preprocessing comprises histogram equalization and picture size adjustment.

Preferably, the image feature extraction is performed on the video frame picture at the initial moment through the target tracking module to obtain an initial model

And initial incremental update model

The method comprises the following steps:

initial time t is tracked by the target tracking module ₁ Extracting image characteristics of video frame pictures at the moment, and setting an initial moment t according to a target tracking task ₁ Target position information in video frame picture of moment t ₁ After the image blocks of the frame at a moment pass through two convolution layers with the kernel of 3 multiplied by 3, a convolution characteristic matrix of 125 multiplied by 32 is obtained, and the convolution characteristic matrix is used as an initial model

t ₁ Incremental update model of time of day

Is equal in value to

According to another aspect of the present invention, there is provided an object tracking apparatus updated based on similar templates, including:

the video data preprocessing module is used for transcoding and framing the video data to obtain video frame pictures corresponding to all moments including targets;

a video frame primary filtering processing module for extracting image characteristics of the video frame image at the initial time through the target tracking module to obtain an initial model

And initial incremental update model

A current video frame filtering processing module for tracking by targetThe module carries out image feature extraction processing on the video frame picture at the current moment t to obtain a new model at the moment t

New model according to t-1 time

And incremental update model at time t-1

Calculating to obtain an incremental updating model at the t moment

A current video frame model determining module for calculating new models by convolution response method

And the initial model

A value of similarity between delta _init Novel model

Updating the model with the current increment

Or

As the final model at time t;

the current moment video frame filtering processing module is specifically used for passing through the eyesThe mark tracking module utilizes the convolution layer to extract image characteristics of the input video frame picture at the time t, obtains the specific position of the target in the video frame picture at the time t, and converts the specific position of the target in the video frame picture at the time t into a new model at the time t

New model according to t-1 moment through target tracking algorithm

And incremental update model at time t-1

Obtaining an incremental update model at time t

Wherein alpha is a set learning rate;

the module for determining the video frame model at the current moment is specifically used for calculating a new model

And the initial model

To convert the convolution response into a new model

And the initial model

A value of similarity between delta _init Calculate a new model

Updating the model with the current increment

To convert the convolution response into a new model

Updating the model with the current increment

A value of similarity between delta _incre ；

Setting two thresholds of similarity

And

judgment of

And is

If yes, the new model is used

As final model at time t

Otherwise, select

As final model at time t

Preferably, the video data preprocessing module is specifically configured to complete video data reading in by a data reading thread, perform transcoding and framing processing on the video data to obtain a video frame picture sequence including a target, where the video frame picture sequence includes video frame pictures corresponding to respective moments;

and preprocessing each video frame picture in the video frame picture sequence, and transmitting the preprocessed video frame pictures to a target tracking module, wherein the preprocessing comprises histogram equalization and picture size adjustment.

Preferably, the video frame primary filtering processing module is specifically configured to perform target tracking on the initial time t by using the target tracking module ₁ Extracting image characteristics of video frame pictures at the moment, and setting an initial moment t according to a target tracking task ₁ Target position information in video frame picture of moment t ₁ After the image blocks of the frame at a moment pass through two convolution layers with the kernel of 3 multiplied by 3, a convolution characteristic matrix of 125 multiplied by 32 is obtained, and the convolution characteristic matrix is used as an initial model

t ₁ Incremental update model of time of day

Is equal in value to

It can be seen from the technical solutions provided by the embodiments of the present invention that the embodiments of the present invention obtain a new model at the current time by using a convolutional neural network

With the initial model T _init Current incremental update model

Similarity between the models, and selecting a new model through a model updating strategy according to the similarity

Or

As a final model at the time t, similarity calculation is not needed by other algorithms, so that the reliability of the new model is rapidly detected.

Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flow chart illustrating a prior art approach to a target tracking system based on a frame-by-frame incremental update model;

FIG. 2 is a process flow diagram of a conventional correlation filter target tracking system design in the prior art;

FIG. 3 is a flowchart of a process for implementing target tracking based on similar template updating according to an embodiment of the present invention;

FIG. 4 is a graph of a convolution response provided by an embodiment of the present invention;

fig. 5 is a schematic diagram of a processing procedure of a model update policy according to an embodiment of the present invention;

fig. 6 is a block diagram of a target tracking apparatus based on similar template update according to an embodiment of the present invention, in which a video data preprocessing module 61, a video frame primary filtering processing module 62, a current video frame filtering processing module 63, and a current video frame model determining module 64 are included.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are exemplary only for explaining the present invention and are not construed as limiting the present invention.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or coupled. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

For the convenience of understanding the embodiments of the present invention, the following description will be further explained by taking several specific embodiments as examples in conjunction with the drawings, and the embodiments are not to be construed as limiting the embodiments of the present invention.

Example one

For the current target tracking system, a computer device including a GPU (Graphics Processing Unit) can be adopted, and the image calculation speed is high and the Processing capability is strong. The embodiment of the invention is based on the computer design containing GPU, video data is input into the system according to frames, and the picture characteristics are extracted by adopting a Siamese convolution neural network structure with 2 convolution layers. The method accelerates the image feature extraction by using the GPU, positions the target by combining the convolution layer, improves the overall tracking speed of the system, updates the model by adopting a template updating strategy based on the similarity, and improves the reliability of updating the template.

The method can be applied to certain specific target real-time tracking tasks under natural conditions, such as automatic driving vehicle target positioning, virtual reality human body gesture tracking, intelligent traffic monitoring, video behavior recognition and the like. The system is easy to build, simple to install and low in cost.

The processing flow for realizing target tracking based on similar template updating provided by the embodiment of the invention is shown in fig. 3, and comprises the following processing steps:

step S31, the data reading thread finishes reading video data first, transcodes and frames the video data to obtain a video frame picture sequence including a target, where the video frame picture sequence includes video frame pictures corresponding to each time. And then, preprocessing each video frame picture in the sequence, and transmitting the preprocessed video frame picture to a target tracking module, wherein the preprocessing comprises histogram equalization, picture size adjustment and the like.

Step S32, extracting image characteristics of the video frame picture at the initial time (namely the first video frame picture) through the target tracking module to obtain the initial time t ₁ Target position information in a video frame picture of a moment. Then, the target position information in the video frame picture at the initial moment is used as an initial model

At an initial time, i.e., at a time when t equals 1, the target information is given by the first frame of known image of the target tracking task. According to the initial time t given by the target tracking task ₁ Target position information in video frame picture of moment t ₁ The frame image block at the time passes through the convolution layer with two kernels of 3 x 3 to obtain 125 x 125 x 32 convolution feature matrix, using the convolution feature matrix as initial model

And at this time, the target's incremental update model is not updated (the second frame is input and needs to be updated), so the incremental update model is updated at this time

In numerical value, is equivalent to

The initial time, namely the time when t is 1, the target tracking task video will give the initial coordinate information Z of the target in the first frame image ₀ (x ₀ ，y ₀ ，h ₀ ，w ₀ ) Denotes the coordinates (x) of the upper left corner of the target image block in the first frame ₀ ，y ₀ ) And the height and width (h) of the target image block ₀ ，w ₀ )。

According to the initial coordinate information, the system extracts an initial target image block I of a target in a first frame ₀ And is combined with ₀ The image of (2) is resized to 125 x 125. If I ₀ A color image, which itself contains 3 color channels, is essentially a 125 × 125 × 3 matrix of data. If I ₀ A black-and-white image, which itself contains 1 color channel, is essentially a 125 × 125 × 1 matrix data, which the system converts by channel replication into 125 × 125 × 3 matrix data.

At this time, the initial target image block I taken out from the first frame ₀ Which is essentially a 125 x 3 matrix of data that the system converts to the initial model over a four-layer network

The four layers of the network are a convolution layer 1, a relu layer, a convolution layer 2 and a local response standardized layer in sequence. Wherein, the convolution layer 1 has convolution kernel size of 3 × 3 × 3 × 32, expansion edge of 1 pixel, and pooling step length of [1, 1%]The above-mentioned convolutional layer. The Relu layer nonlinearizes the output of convolutional layer 1, mitigating network overfitting. Convolutional layer 2 has convolutional kernel size of 3 × 3 × 32 × 32, extended edge of 1 pixel, pooling step of [1,1]The convolutional layer of (1).

Initial target image block I ₀ Obtaining matrix data with the size of 125 multiplied by 32 after passing through two layers of convolution layer 1, relu layer and convolution layer 2, and obtaining an initial model by normalizing the distribution of the matrix data through a local response normalization layer

Which is itself a 125 x 32 matrix of data.

Step S33, the target tracking module extracts the image characteristics of the video frame picture at the time t to obtain a new model at the time t

New model according to t-1 time

And incremental update model at time t-1

Obtaining an incremental update model at time t

After the video frame picture at the current moment t is input into the target tracking module, the target tracking module extracts image features of the video frame picture at the moment t, obtains the specific position of the target in the video frame picture at the moment t according to the convolution response, and converts the specific position of the target in the video frame picture at the moment t into a new model at the moment t

At each time t, the video has a new image frame input system, and the target tracking module obtains a result of positioning the target from the input new frameZ _t (x _t ，y _t ，h _t ，w _t ) Indicating the coordinate (x) of the upper left corner of the target image block in the t-th frame _t ，y _t ) And the height and width (h) of the target image block _t ，w _t ). According to the tracking result Z of each frame _t (x _t ，y _t ，h _t ，w _t ) The system extracts a target image block I of a target in the t frame _t And is combined with _t The image of (2) is resized to 125 x 125. If I _t A color image, which itself contains 3 color channels, is essentially a 125 × 125 × 3 matrix of data. If I _t A black-and-white image, which itself contains 1 color channel, is essentially a 125 × 125 × 1 matrix data, which the system converts by channel replication into 125 × 125 × 3 matrix data.

At this time, the target image block I taken out from the t-th frame _t Which is essentially a 125 x 3 matrix of data that the system converts to a new model for frame t through a four-layer network

Which is essentially a 125 x 32 matrix of data.

The target tracking algorithm is also based on the new model at time t-1

And incremental update model at time t-1

Obtaining an incremental update model at time t

Where α is the learning rate, i.e., the weight of the new model generated from the new video frame in the updated target model. In order to ensure the stability of the target model, the learning rate α is generally about 0.01. Although the updating is strong in stability, the information ratio of the new model to the overall model is very small, so that the updated model is difficult to describe the change of the target appearance at the time t.

Step S34, calculating a new model

And the initial model

To convert the convolution response into a new model

And the initial model

A value of similarity between delta _init Calculate a new model

Updating the model with the current increment

To convert the convolution response into a new model

Updating the model with the current increment

A value of similarity between delta _incre Based on the above similarity value δ _init And a similarity value delta _incre Determining a model by a model update strategy

Or

As final mode at time tAnd (4) molding.

Fig. 5 shows a schematic processing procedure diagram of a model update policy according to an embodiment of the present invention, which includes the following processing procedures: at time t, the updated model is the new model

Or incrementally updating the model

Dependent on the new model

Reliability and full descriptive nature of the system. Reliability refers to a new model

With the initial model T _init Similarity of (d) _init The similarity between the two models should be as large as possible under the condition of accurate tracking so as to ensure that the new model and the initial model are the same target. Fully descriptive in referring to new models

Updating the model with the current increment

Similarity of (d) _incre Small enough so that the new model adequately describes the change in appearance of the model at time t.

According to the principle, the invention sets two similarity thresholds

And

suitable threshold values are obtained by trial and error. When in use

Then the reliability of the new modelCan be ensured when

The full descriptive nature of the new model can also meet the requirements of the tracker. When both similarities meet the threshold requirement, then it is judged that the change in the appearance of the target is sufficient to use the new model

Substitution

Judging new models simultaneously

Whether the specified target object O can be described sufficiently reliably. Then, the new model is set

As final model at time t

Otherwise, select

As final model at time t

In the invention, the similarity between two models is calculated by adopting a convolution response method to calculate a new model

And incrementally updating the model

Taking the similarity between the two as an example, the new model at the time t is formed by a convolutional neural network

And incrementally updating the model

Performing convolution to

And

calculating the convolution response M in two dimensions _t I.e. by

A convolution response map is generated as shown in FIG. 4, where the light areas in FIG. 4 represent the response values M _t It is very big to say that in these areas, the new model

And incrementally updating the model

The correlation is high and therefore likely to be the center position of the target. And the darker the color the response value M _t The smaller, i.e. the lower the correlation, the less likely it is to be a region of the target. The maximum point of the convolution response map is the new position of the target.

It can be seen in principle that the response values reflect the new model

And incrementally updating the model

The degree of correlation of (c). Mapping the magnitude of the response value to [0,1 ] by adopting a normalization method]Then, the maximum value delta of the obtained response value is taken out, so that the delta can reflect a new model in the form of percentage similarity

And incrementally updating the model

The degree of correlation of (c).

For each model T, it is itself a 125 × 125 × 32 matrix, where N ═ 32 is the number of channels, and thus for each channel, it can be regarded as a 32 × 32 two-dimensional image matrix. Then for both models T ₁ 、T ₂ Their convolution response delta matrices are calculated by the convolution of the two in each channel, let T ₁ 、T ₂ The model at each channel is respectively

The convolution response of the corresponding channel Δ ⁱ Can be expressed as:

where Δ (s, T) represents the value of the s-th row and T-th column in the matrix Δ, and T (x, y) represents the value of the x-th row and y-th column in the matrix T. The similarity δ can then be expressed as:

where max (Δ) ⁱ (s, t)) represents the matrix Δ ⁱ The largest value among the values. Delta ⁱ The similarity matrix delta obtained from each channel is a 125 x 125 matrix ⁱ Combining to obtain a model T ₁ 、T ₂ 125 × 125 × 32 similarity matrix between them, the value of the similarity matrix is the model T ₁ 、T ₂ The similarity value δ therebetween.

Thus in our tracker, at each time T, there are three models, including the initial model T _init New model generated by video frame at time t

Incremental updating of the generated target model at the last time (t-1)

Three pipelines store three models, and the final model at the moment t is obtainedThe current position of the object in the model is converted into a coordinate frame on the video frame and the coordinate frame is displayed on the user interface.

Example two

The embodiment provides an object tracking device based on similar template updating, the structure of the device is shown in fig. 6, and the device comprises the following modules:

the video data preprocessing module 61 is configured to transcode and frame-divide the video data to obtain video frame pictures corresponding to each time point including the target;

a video frame primary filtering processing module 62, configured to perform image feature extraction processing on a video frame picture at an initial time through the target tracking module to obtain an initial model

And initial incremental update model

A current video frame filtering processing module 63, configured to perform image feature extraction processing on the video frame picture at the current time t through the target tracking module to obtain a new model at the time t

New model according to t-1 time

And incremental update model at time t-1

Calculating to obtain an incremental updating model at the t moment

A current video frame model determining module 64 for calculating new models respectively by convolution response method

And the initial model

A value of similarity between delta _init Novel model

Updating the model with the current increment

Or

As the final model at time t.

Preferably, the video data preprocessing module 61 is specifically configured to complete video data reading in by a data reading thread, perform transcoding and framing processing on the video data to obtain a video frame picture sequence including a target, where the video frame picture sequence includes video frame pictures corresponding to each time;

Preferably, the video frame primary filtering processing module 62 is specifically configured to perform the target tracking module on the initial time t ₁ Extracting image characteristics of video frame pictures at the moment, and setting an initial moment t according to a target tracking task ₁ Target position information in video frame picture of moment t ₁ After the image blocks of the frame at a moment pass through two convolution layers with the kernel of 3 multiplied by 3, a convolution characteristic matrix of 125 multiplied by 32 is obtained, and the convolution characteristic matrix is used as an initial model

t ₁ Incremental update model of time of day

Is equal in value to

Preferably, the current-time video frame filtering processing module 63 is specifically configured to perform image feature extraction on the input video frame picture at the time t by using the convolution layer through the target tracking module, obtain a specific position of the target in the video frame picture at the time t, and convert the specific position of the target in the video frame picture at the time t into a new model at the time t

New model according to t-1 moment through target tracking algorithm

And incremental update model at time t-1

Obtaining an incremental update model at time t

Where α is a set learning rate.

Preferably, the module 64 for determining the video frame model at the current moment is specifically configured to calculate a new model

And the initial model

To convert the convolution response into a new model

And the initial model

A value of similarity between delta _init Calculate a new model

Updating the model with the current increment

To convert the convolution response into a new model

Updating the model with the current increment

A value of similarity between delta _incre ；

Setting two thresholds of similarity

And

judgment of

And is

If yes, the new model is used

As final model at time t

Otherwise, select

As final model at time t

The specific process of performing similar template update-based target tracking by using the apparatus of the embodiment of the present invention is similar to that of the foregoing method embodiment, and is not described herein again.

In summary, the embodiment of the present invention obtains the new model of the current time by using the convolutional neural network

With the initial model T _init Current incremental update model

Or

The template updating strategy designed by the invention comprises three assembly line models, namely a conventional model, a new model and an initial model, and the mechanism of the three assembly lines can ensure that the original high-reliability normal-scale target tracking can be still kept when the reliability of the new model is not high. And when the reliability of the new model is higher, the model generated by combining the new model and the conventional model is adopted for target tracking.

The invention sets two thresholds of similarity

And

and the final target model is decided based on the comparison of the actual similarity with the two thresholds. By similarity

And the relevance between the updated model and the original target is ensured, so that the stability of the target model is ensured. By similarity

The updated model is judged to contain information of the change of the appearance of the target in the video, and the information describes the variability of the target. Therefore, the tracking stability of the target tracking system can be ensured.

Convolution response matrix M generated in the present invention _t The method can be applied to related video analysis systems. If the target occlusion judging system is used, for targets with integrally consistent appearance changes, the response matrix M _t There is a peak where the target exists, but when the local area value is small, it can be judged that the local area is occluded. Similarly, the method can also be applied to a system for analyzing the degree of change of the appearance of the target, namely a response matrix M _t The input of the video frame changes with each time t, and the matrix value change reflects the change degree of the target appearance.

Those of ordinary skill in the art will understand that: the figures are merely schematic representations of one embodiment, and the blocks or flow diagrams in the figures are not necessarily required to practice the present invention.

From the above description of the embodiments, it is clear to those skilled in the art that the present invention can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for apparatus or system embodiments, since they are substantially similar to method embodiments, they are described in relative terms, as long as they are described in partial descriptions of method embodiments. The above-described embodiments of the apparatus and system are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. One of ordinary skill in the art can understand and implement without inventive effort.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A target tracking method based on similar template updating is characterized by comprising the following steps:

And initial incremental update model

New model according to t-1 time

And incremental update model at time t-1

Calculating to obtain an incremental updating model at the t moment

Respectively calculating new models by a convolution response method

And the initial model

A value of similarity between delta _init Novel model

Updating the model with the current increment

Or

As the final model at time t;

New model according to t-1 time

And incremental update model at time t-1

Calculating to obtain an incremental updating model at the t moment

The method comprises the following steps:

New model according to t-1 moment through target tracking algorithm

And incremental update model at time t-1

Obtaining an incremental update model at time t

Wherein alpha is a set learning rate;

respectively calculating new models by the convolution response method

And the initial model

A value of similarity between delta _init Novel model

Updating the model with the current increment

Or alternatively

As a final model of the time t, the following are included:

calculate a new model

And the initial model

To convert the convolution response into a new model

And the initial model

A value of similarity between delta _init Calculate a new model

Updating the model with the current increment

To convert the convolution response into a new model

Updating the model with the current increment

A value of similarity between delta _incre ；

Setting two thresholds of similarity

And

judgment of

And is

If yes, the new model is used

As final model at time t

Otherwise, select

As final model at time t

2. The method of claim 1, wherein the transcoding and framing the video data to obtain the video frame pictures corresponding to the respective moments including the target comprises:

3. The method according to claim 1, wherein the initial model is obtained by extracting image features of the video frame picture at the initial time through the target tracking module

And initial incremental update model

The method comprises the following steps:

initial time t is tracked by the target tracking module ₁ Extracting image characteristics of video frame pictures at the moment, and setting an initial moment t according to a target tracking task ₁ Target position information in video frame picture of timet ₁ After the image blocks of the frame at a moment pass through two convolution layers with the kernel of 3 multiplied by 3, a convolution characteristic matrix of 125 multiplied by 32 is obtained, and the convolution characteristic matrix is used as an initial model

t ₁ Incremental update model of time of day

Is equal in value to

4. An object tracking device based on similar template updating, comprising:

And initial incremental update model

A current video frame filtering processing module for extracting image characteristics of the current video frame picture at the moment t through the target tracking module to obtain a new model at the moment t

New model according to t-1 time

And incremental update model at time t-1

Calculating to obtain an incremental updating model at the t moment

And the initial model

A value of similarity between delta _init Novel model

Updating the model with the current increment

Or

As the final model at time t;

the current-time video frame filtering processing module is specifically used for extracting image characteristics of the input t-time video frame picture by using the convolution layer through the target tracking module to obtain the specific position of the target in the t-time video frame picture and converting the specific position of the target in the t-time video frame picture into a new t-time model

New model according to t-1 moment through target tracking algorithm

And incremental update model at time t-1

Obtaining an incremental update model at time t

Wherein alpha is a set learning rate;

And the initial model

To convert the convolution response into a new model

And the initial model

A value of similarity between delta _init Calculate a new model

Updating the model with the current increment

To convert the convolution response into a new model

Updating the model with the current increment

A value of similarity between delta _incre ；

Setting two thresholds of similarity

And

judgment of

And is

If it is true, the new model is used

As final model at time t

Otherwise, select

As final model at time t

5. The apparatus of claim 4, wherein:

the video data preprocessing module is specifically used for completing video data reading in by a data reading thread, transcoding and framing the video data to obtain a video frame picture sequence containing a target, wherein the video frame picture sequence comprises video frame pictures corresponding to all moments;

6. The apparatus of claim 5, wherein:

the video frame primary filtering processing module is specifically used for processing the initial time t through the target tracking module ₁ Extracting image characteristics of video frame pictures at the moment, and setting an initial moment t according to a target tracking task ₁ Target position information in video frame picture of moment t ₁ After the image blocks of the frame at a moment pass through two convolution layers with the kernel of 3 multiplied by 3, a convolution characteristic matrix of 125 multiplied by 32 is obtained, and the convolution characteristic matrix is used as an initial model

t ₁ Incremental update model of time of day

Is equal in value to