CN113379806B

CN113379806B - Target tracking method and system based on learnable sparse conversion attention mechanism

Info

Publication number: CN113379806B
Application number: CN202110929160.4A
Authority: CN
Inventors: 王军; 章利民; 王员云; 孟晨晨; 张珮芸
Original assignee: Nanchang Institute of Technology
Current assignee: Nanchang Institute of Technology
Priority date: 2021-08-13
Filing date: 2021-08-13
Publication date: 2021-11-09
Anticipated expiration: 2041-08-13
Also published as: CN113379806A

Abstract

The invention provides a target tracking method and a system based on a learnable sparse conversion attention mechanism, which comprises the following steps: initializing an image in a given first frame target frame to generate a target template image; in the subsequent frame, the target center of the image in the target frame of the previous frame is used as a central point, and a plurality of search area images are obtained through a multi-scale strategy; inputting the target template image and the search area image into a convolutional neural network model sharing the weight, and respectively extracting features through a convolutional neural network; performing space conversion and channel conversion on the extracted features based on the learnable sparse model; and taking the depth feature of the target template as a convolution kernel, performing sliding window operation on the image of the search area to obtain a plurality of score maps, and inferring the relative displacement and scale change of the target according to the maximum position of the score values to realize target tracking. The method has good robustness and real-time performance, and can realize good target image tracking effect.

Description

Target tracking method and system based on learnable sparse conversion attention mechanism

Technical Field

The invention relates to the technical field of computer vision and digital image processing, in particular to a target tracking method and system based on a learnable sparse conversion attention mechanism.

Background

In recent years, visual tracking is a research hotspot in computer vision, and estimates the target position of a subsequent video frame by using the target initial state of a first frame image. Especially in recent years, with the rapid development of deep learning, significant progress in the field of target tracking is driven. However, in complex scenes, achieving robust and accurate target tracking still has great challenges, such as occlusion, motion blur, scale change, and illumination change.

In general, visual tracking algorithms include two categories: discriminant algorithms and generator algorithms. Specifically, (1) the algorithm based on the discriminant model can be regarded as a two-classification problem, that is, the target and the background information are simultaneously extracted to train a classifier, so that the target is distinguished from the background information of the current frame, and the target position of the current frame is obtained; (2) and establishing a motion model through online learning based on the algorithm of the generated model, and then searching a candidate region with the minimum reconstruction error through the model to realize target tracking. Meanwhile, in recent years, the method based on deep learning utilizes the strong characterization capability of the depth features, greatly improves the robustness and accuracy of the tracking algorithm, and gradually becomes a mainstream trend.

Specifically, the tracking algorithm based on deep learning mainly utilizes that a convolutional neural network has strong feature extraction and expression capacity and is used for extracting target features and distinguishing foreground and background to identify a tracking target. The video tracking algorithm based on the twin network converts the tracking problem into a matching problem, realizes off-line end-to-end large-scale data set training, and greatly improves the speed and the accuracy.

However, in the prior art, the robustness and the accuracy of an appearance model of a partial visual tracking algorithm are not ideal, and the influence caused by appearance changes such as motion blur, illumination change, complex background and occlusion cannot be well processed.

Disclosure of Invention

In view of the above situation, it is necessary to solve the problem in the prior art that the robustness and accuracy of the appearance model of the partial visual tracking algorithm are not ideal, and the influence caused by appearance changes such as motion blur, illumination change, complex background, occlusion, and the like cannot be well handled.

The embodiment of the invention provides a target tracking method based on a learnable sparse conversion attention mechanism, wherein the method comprises the following steps:

the method comprises the following steps: initializing an image in a given first frame target frame to generate a target template image;

step two: in a second frame and a subsequent frame, taking the target center of the image in the target frame of the previous frame as a central point, obtaining a plurality of search area images through a multi-scale strategy, and adjusting the plurality of search area images to be the same in size;

step three: inputting the target template image and the search area image into a convolutional neural network model sharing weight values, and respectively extracting a target template depth feature and a search area depth feature through a convolutional neural network;

step four: performing space conversion and channel conversion on the depth features of the target template and the depth features of the search area based on a learnable sparse model to reduce space feature redundancy and inter-channel redundancy;

step five: taking the depth features of the target template processed by the learnable sparse model as convolution kernels, and performing sliding window operation on the image of the search area to obtain a plurality of score maps;

step six: and according to the position with the maximum score value in the score maps, the relative displacement of the target center of the image in the target frame of the previous frame in the current frame is estimated, and the scale change of the target tracking image is obtained through a multi-scale strategy so as to realize the tracking of the target.

The invention provides a target tracking method based on a learnable sparse conversion attention mechanism, which combines a convolutional neural network model and a learnable sparse conversion model and can obtain more sparse and robust target template image characteristics and search area image characteristics; in addition, similarity calculation is carried out on the target template image characteristics and the search area image characteristics through cross correlation, and a multi-scale strategy is utilized to adapt to target scale change. The target tracking method provided by the invention has good robustness and real-time performance, can better process appearance changes including shading, illumination changes, motion blur and the like, and finally realizes a good target image tracking effect.

The target tracking method based on the learnable sparse conversion attention mechanism is characterized in that in the step one, the coordinates of the center of the target to be tracked in the target frame of the first frame are

The height and width of the target to be tracked in the first frame target frame are respectively

And

；

a correlation coefficient is correspondingly set

The expression is:

。

the target tracking method based on the learnable sparse conversion attention mechanism is characterized in that in the step one, correlation coefficients are used

Obtaining side lengths of target template images

The corresponding expression is:

。

the target tracking method based on the learnable sparse conversion attention mechanism is characterized in that in the second step, the side length of the image of the search area is searched

By correlation coefficient

Height from the image in the previous frame target frame

And width

And calculating to obtain the following concrete expression:

wherein, when the previous frame is the first frame, the height and width of the image are respectively

And

。

the target tracking method based on the learnable sparse conversion attention mechanism is characterized in that in the second step, the side length of the image of the search area is obtained

After the step of (a), the method further comprises:

center of object of image in previous frame object frame

As a central point, by respectively

As different side lengths, to obtain different search area images, wherein,

；

wherein a plurality of the search area images are all adjusted to

The size of (a).

In the third step, in the step of extracting the depth features by the convolutional neural network, the corresponding convolution operation is represented as:

wherein the content of the first and second substances,

in order to input the features of the image,

for the output characteristics after the convolution operation,

in order to be the convolution kernel size,

as to the number of channels of the input image,

in order to slide the window, the sliding window,

for sliding windows

From input features

The tensor of (A) is

The pixel of (a) is (are) in (b),

is as follows

A convolution kernel is

The pixel of (b).

The target tracking method based on the learnable sparse conversion attention mechanism, wherein in the fourth step, when performing spatial conversion, the method comprises:

decomposing an input image local area into different frequency bands through continuous row and column transformation, and initializing corresponding column and row transformation weights;

the concrete expression is as follows:

wherein the content of the first and second substances,

the weights corresponding to the spatial transformation are represented,

which represents the kronecker product of,

and

the transform initial weights for the columns and rows are represented, respectively.

The target tracking method based on the learnable sparse conversion attention mechanism comprises the following six specific steps:

finding the position with the maximum score value in the three score maps

Calculating the relative displacement between the target center of the image in the target frame of the previous frame and the target center of the image in the target frame of the previous frame;

and updating the position of the target center of the target tracking image of the current frame according to the relative displacement so as to position.

The target tracking method based on the learnable sparse conversion attention mechanism is characterized by further comprising the following steps:

updating the scale of the target tracking image of the current frame according to the scale of the maximum value of the score values in the three score maps;

wherein the corresponding scale variation is represented as:

wherein the content of the first and second substances,

in order to be a change in scale, the,

is the scale on which the maximum of the three score maps lies.

The invention also provides a target tracking system based on the learnable sparse conversion attention mechanism, wherein the system comprises:

the first processing module is used for initializing the image in the given first frame target frame to generate a target template image;

the second processing module is used for obtaining a plurality of search area images by taking the target center of the image in the target frame of the previous frame as a central point through a multi-scale strategy in the second frame and the subsequent frames and adjusting the plurality of search area images to be the same in size;

the first learning module is used for inputting the target template image and the search area image into a convolutional neural network model sharing weight values and respectively extracting a target template depth feature and a search area depth feature through a convolutional neural network;

the second learning module is used for carrying out space conversion and channel conversion on the depth features of the target template and the depth features of the search area based on a learnable sparse model so as to reduce space feature redundancy and inter-channel redundancy;

the sliding window processing module is used for taking the depth features of the target template processed by the learnable sparse model as convolution kernels and performing sliding window operation on the image of the search area to obtain a plurality of score maps;

and the positioning tracking module is used for estimating the relative displacement of the target center of the image in the target frame of the previous frame in the current frame according to the position with the maximum score value in the score maps, and acquiring the scale change of the target tracking image through a multi-scale strategy so as to realize the tracking of the target.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

FIG. 1 is a schematic diagram of a target tracking method based on a learnable sparse conversion attention mechanism proposed by the present invention;

fig. 2 is a structural diagram of a target tracking system based on a learnable sparse conversion attention mechanism proposed in the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.

These and other aspects of embodiments of the invention will be apparent with reference to the following description and attached drawings. In the description and drawings, particular embodiments of the invention have been disclosed in detail as being indicative of some of the ways in which the principles of the embodiments of the invention may be practiced, but it is understood that the scope of the embodiments of the invention is not limited correspondingly. On the contrary, the embodiments of the invention include all changes, modifications and equivalents coming within the spirit and terms of the claims appended hereto.

In the prior art, the robustness and the accuracy of an appearance model of a partial visual tracking algorithm are not ideal, and the influence caused by appearance changes such as motion blur, illumination change and shielding cannot be well processed.

In order to solve the technical problem, the present invention provides a target tracking method based on a learnable sparse transformation attention mechanism, please refer to fig. 1, the method includes the following steps:

s101, initializing the image in the given first frame target frame to generate a target template image.

In this step, the coordinates of the center of the target to be tracked in the first frame target frame are

And

. In addition, a correlation coefficient is correspondingly set

The expression is:

（1）

wherein the content of the first and second substances,

and

respectively the height and width of the target to be tracked in the first frame target frame.

At the same time, by the correlation coefficient

Obtaining side lengths of target template images

The corresponding expression is:

（2）

that is, given the coordinates of the center of the target to be tracked in the first frame target frame as

Side length of

Intercepting image block, and adjusting the size of target template image to

。

And S102, in the second frame and the subsequent frames, taking the target center of the image in the target frame of the previous frame as a central point, obtaining a plurality of search area images through a multi-scale strategy, and adjusting the plurality of search area images to be the same in size.

Here, step S102 is the same as the sampling method in step S101, except that a multi-scale strategy is used for regression of the search region images, and the search region images are resized to the same size.

Specifically, in this step, for the search area image, the side length of the search area image

By correlation coefficient

Height from the image in the previous frame target frame

And width

And calculating to obtain the following concrete expression:

（3）

And

. Side length of image in search area

After the step (2), coordinates of the center of the target to be tracked in the target frame of the previous frame

As a central point, by respectively

As different side lengths to obtain different search area images (filled in as a mean if out of range of the current frame). Finally, the images in the search area are all adjusted to

Three images of size. Wherein the content of the first and second substances,

。

s103, inputting the target template image and the search area image into a convolutional neural network model sharing a weight, and respectively extracting a target template depth feature and a search area depth feature through a convolutional neural network.

It is noted that the target template image and the search area image use the same convolutional neural network, and the weights are shared. In the stage of feature extraction, a full convolution neural network is used. When training neural network parameters, a value is initialized randomly, parameters are optimized through back propagation of a real value (Ground Truth) and a cross entropy loss function of a predicted value, and finally a difference function between the predicted value and the real value is observed to find a group of parameters capable of well fitting training data.

In addition, the used feature extraction backbone network is AlexNet, wherein the first four convolutional layers are used, the full connection layer is removed, and AlexNet network parameters can well cope with complex target appearance changes through large-scale data set offline end-to-end training.

In this step, in the step of extracting the depth feature by the convolutional neural network, the corresponding convolution operation is expressed as:

（4）

wherein the content of the first and second substances,

in order to input the features of the image,

for the output characteristics after the convolution operation,

in order to be the convolution kernel size,

as to the number of channels of the input image,

in order to slide the window, the sliding window,

for sliding windows

From input features

The tensor of (A) is

The pixel of (a) is (are) in (b),

is as follows

A convolution kernel is

The pixel of (b).

And S104, performing space conversion and channel conversion on the depth features of the target template and the depth features of the search area based on a learnable sparse model to reduce space feature redundancy and inter-channel redundancy.

In performing spatial transformation, the main purpose is to reduce the redundancy of spatial features. Specifically, the local region of the input image is decomposed into different frequency bands through continuous row and column transformation, and corresponding column and row transformation weights are initialized;

the concrete expression is as follows:

（5）

wherein the content of the first and second substances,

the weights corresponding to the spatial transformation are represented,

which represents the kronecker product of,

and

In addition, in the channel conversion, redundancy among channels is mainly reduced, and specifically, the correlation among the channels is used for mapping the input features, so that the number of the channels is changed. Meanwhile, a residual error structure is adopted for design, on one hand, important information of input features is reserved, and on the other hand, a region of interest in an input image is highlighted.

And S105, taking the depth feature of the target template processed by the learnable sparse model as a convolution kernel, and performing sliding window operation on the image of the search area to obtain a plurality of score maps.

And S106, according to the position with the maximum score value in the multiple score maps, the relative displacement of the target center of the image in the target frame of the previous frame in the current frame is estimated, and the scale change of the target tracking image is obtained through a multi-scale strategy so as to realize the tracking of the target.

In this step, the position where the score value is the largest is found in the three score maps

And calculating the relative displacement between the target center of the image in the target frame of the previous frame and the target center of the image in the target frame of the previous frame. And then updating the position of the target center of the target tracking image of the current frame according to the relative displacement so as to carry out positioning.

And meanwhile, updating the scale of the target tracking image of the current frame according to the scale of the maximum value of the score values in the three score maps.

Wherein the corresponding scale variation is represented as:

wherein the content of the first and second substances,

in order to be a change in scale, the,

is the scale on which the maximum of the three score maps lies.

Referring to fig. 2, the present invention further provides a target tracking system based on a learnable sparse conversion attention mechanism, wherein the system includes a first processing module 11, a second processing module 12, a first learning module 13, a second learning module 14, a sliding window processing module 15, and a positioning tracking module 16;

a first processing module 11, configured to initialize an image in a given first frame target frame to generate a target template image;

a second processing module 12, which obtains a plurality of search area images through a multi-scale strategy by using a target center of an image in a target frame of a previous frame as a central point in a second frame and a subsequent frame, and adjusts the plurality of search area images to be the same size;

the first learning module 13 is configured to input the target template image and the search area image into a convolutional neural network model sharing a weight, and extract a target template depth feature and a search area depth feature through a convolutional neural network respectively;

the second learning module 14 is configured to perform spatial transformation and channel transformation on the depth feature of the target template and the depth feature of the search area based on a learnable sparse model to reduce spatial feature redundancy and inter-channel redundancy;

the sliding window processing module 15 is configured to perform a sliding window operation on the search area image by using the depth feature of the target template processed by the learnable sparse model as a convolution kernel to obtain a plurality of score maps;

and the positioning and tracking module 16 is configured to estimate, according to the position with the largest score value in the multiple score maps, a relative displacement of a target center of the image in the target frame of the previous frame in the current frame, and obtain, through a multi-scale strategy, a scale change of the target tracking image, so as to implement tracking of the target.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A target tracking method based on a learnable sparse conversion attention mechanism is characterized by comprising the following steps:

step six: according to the position with the maximum score value in the score maps, the relative displacement of the target center of the image in the target frame of the previous frame in the current frame is estimated, and the scale change of the target tracking image is obtained through a multi-scale strategy so as to realize the tracking of the target;

in the third step, in the step of extracting the depth feature by the convolutional neural network, the corresponding convolution operation is represented as:

wherein the content of the first and second substances,

in order to input the features of the image,

for the output characteristics after the convolution operation,

in order to be the convolution kernel size,

as to the number of channels of the input image,

in order to slide the window, the sliding window,

for sliding windows

From input features

The tensor of (A) is

The pixel of (a) is (are) in (b),

is as follows

A convolution kernel is

A pixel of (b);

in the fourth step, when performing spatial conversion, the method includes:

the concrete expression is as follows:

wherein the content of the first and second substances,

the weights corresponding to the spatial transformation are represented,

which represents the kronecker product of,

and

2. The target tracking method based on the learnable sparse conversion attention mechanism as claimed in claim 1, wherein in the step one, the coordinates of the target center to be tracked in the first frame target frame are

And

；

a correlation coefficient is correspondingly set

The expression is:

。

3. the target tracking method based on the learnable sparse conversion attention mechanism as claimed in claim 2, wherein in the step one, the target tracking method is characterized by correlation coefficients

Obtaining side lengths of target template images

The corresponding expression is:

。

4. the target tracking method based on the learnable sparse conversion attention mechanism as claimed in claim 2, wherein in the second step, the side length of the image of the search area is searched

By correlation coefficient

Height from the image in the previous frame target frame

And width

And calculating to obtain the following concrete expression:

And

。

5. the target tracking method based on the learnable sparse conversion attention mechanism as claimed in claim 4, wherein in the second step, the side length of the image of the search area is obtained

After the step of (a), the method further comprises:

center of object of image in previous frame object frame

As a central point, by respectively

As different side lengths, to obtain different search area images, wherein,

；

wherein a plurality of the search area images are all adjusted to

The size of (a).

6. The target tracking method based on the learnable sparse conversion attention mechanism according to claim 2, wherein the sixth step specifically comprises:

finding the position with the maximum score value in the three score maps

7. The learnable sparse conversion attention mechanism based target tracking method of claim 6, the method further comprising:

wherein the corresponding scale variation is represented as:

wherein the content of the first and second substances,

in order to be a change in scale, the,

is the scale on which the maximum of the three score maps lies.

8. A target tracking system based on a learnable sparse conversion attention mechanism, the system comprising:

the positioning and tracking module is used for estimating the relative displacement of the target center of the image in the target frame of the previous frame in the current frame according to the position with the maximum score value in the score maps, and obtaining the scale change of the target tracking image through a multi-scale strategy so as to realize the tracking of the target;

wherein the first learning module is configured to extract depth features through a convolutional neural network, and the corresponding convolutional operation is represented as:

wherein the content of the first and second substances,

in order to input the features of the image,

for the output characteristics after the convolution operation,

in order to be the convolution kernel size,

as to the number of channels of the input image,

in order to slide the window, the sliding window,

for sliding windows

From input features

The tensor of (A) is

The pixel of (a) is (are) in (b),

is as follows

A convolution kernel is

A pixel of (b);

the second learning module is used for decomposing the local region of the input image into different frequency bands through continuous row and column transformation when space conversion is carried out, and initializing corresponding column and row transformation weights;

the concrete expression is as follows:

wherein the content of the first and second substances,

the weights corresponding to the spatial transformation are represented,

which represents the kronecker product of,

and