CN111899283B

CN111899283B - Video target tracking method

Info

Publication number: CN111899283B
Application number: CN202010753190.XA
Authority: CN
Inventors: 孟宇; 邓在旭; 沈伾伾; 焦志宝; 许焱
Original assignee: University of Science and Technology Beijing USTB
Current assignee: University of Science and Technology Beijing USTB
Priority date: 2020-07-30
Filing date: 2020-07-30
Publication date: 2023-10-17
Anticipated expiration: 2040-07-30
Also published as: CN111899283A

Abstract

The application provides a video target tracking method, belonging to the field of computer vision. The method comprises the following steps: inputting the target image and the search image into a hierarchy correlation twin network at the same time to perform feature extraction to obtain convolution features extracted by different convolution layers, performing correlation measurement on the convolution features of the target image and the search image extracted by the same convolution layer, and generating hierarchy correlation by hierarchy splicing of the correlation of each layer; taking the position of the search image with the highest tracking response in the hierarchical correlation as the center position of the tracking target in the search image; and determining the position of the tracking target in the search image according to the central position of the tracking target in the search image and the independent scale factors. By adopting the method and the device, any target can be accurately tracked.

Description

Video target tracking method

Technical Field

The application relates to the field of computer vision, in particular to a video target tracking method.

Background

In recent years, the number of automobiles has been rapidly increased with the improvement of the living standard of people and the great change of the automobile manufacturing industry, but the available road resources are smaller and smaller, and the human self-reaction capability and the perception capability are limited, so that the traffic accident rate has been continuously increased in recent years due to the fact that the information fed back from the outside is wrongly judged. According to incomplete statistics, the number of traffic accident deaths caused by driving automobiles in the world is over 3000 ten thousand, and more than the number of world deaths caused by large combat. With the opportunity of revolutionary changes brought to the automobile manufacturing industry by the internet technology, unmanned vehicles represent a rapid development potential in the current society, and the main purpose of the unmanned vehicles is to separate people from complex driving operations and improve the safety of vehicles running on roads.

However, the unmanned vehicle has a certain difficulty in actually putting unmanned vehicle into practice, and the most critical problem is that the unmanned vehicle cannot accurately judge complex road conditions and obstacle conditions according to the prior experience like a human brain. The video target tracking is used as a key ring in the unmanned vehicle, the target in front of the vehicle is tracked in real time, the dynamic state of the target in front of the vehicle can be mastered, and a basis is provided for the unmanned vehicle to make a correct decision in the current environment, so that various necessary basic operations such as vehicle distance maintenance, lane changing, vehicle speed adjustment and the like can be ensured in the driving process, the performance of the unmanned vehicle is greatly improved, unnecessary accidents are reduced, and the driving safety is improved.

However, the existing video target tracking method has the problems of low tracking accuracy and the like.

Disclosure of Invention

The embodiment of the application provides a video target tracking method, which can improve the accuracy of target tracking. The technical scheme is as follows:

in one aspect, a video object tracking method is provided, the method being applied to an electronic device, the method comprising:

inputting the target image and the search image into a hierarchy correlation twin network at the same time to perform feature extraction to obtain convolution features extracted by different convolution layers, performing correlation measurement on the convolution features of the target image and the search image extracted by the same convolution layer, and generating hierarchy correlation by splicing the correlation of each layer, wherein the target image comprises: tracking a target;

taking the position of the search image with the highest tracking response in the hierarchical correlation as the center position of the tracking target in the search image;

and determining the position of the tracking target in the search image according to the central position of the tracking target in the search image and the independent scale factors.

Further, the step of simultaneously inputting the target image and the search image into the hierarchy correlation twin network to perform feature extraction, and the step of obtaining convolution features extracted by different convolution layers includes:

simultaneously inputting the target image and the search image into two branches of the hierarchical correlation twin network for feature extraction to perform convolution calculation, so as to obtain convolution features extracted by different convolution layers;

each branch structure for extracting the characteristics in the hierarchical correlation twin network is as follows: (conv1+relu+overlay+max POOL) — (conv2+relu+overlay+max POOL) — (conv3+relu) — (conv4+relu) — (conv5+relu);

where conv represents the convolutional layer, reLU represents the nonlinear activation function, overmapping represents the local response normalization layer, and Max POOL represents the maximum pooling layer.

Further, the formula for performing correlation measurement on the convolution characteristics of the target image and the search image extracted by the same convolution layer is as follows:

wherein F (z, x) _i Representing a correlation between the target image extracted by the convolution layer i and the convolution characteristics of the search image; z and x represent the target image and the search image, respectively;a convolution characteristic representing the output of the convolution layer i; beta represents the deviation.

Further, the step of using the position of the search image with the highest tracking response in the hierarchical relevance as the tracking target at the center position of the search image comprises the following steps:

inputting the maximum correlation in the hierarchical correlation into a correlation attention module; the structure of the correlation attention module is as follows: full tie layer 1-full tie layer 2-full tie layer 3-full tie layer 4-softmax layer;

the correlation among the convolution characteristics of different layers is learned through four full-connection layers, and corresponding weights are distributed to each convolution layer through a softmax layer;

and determining the tracking response of each layer of convolution layer according to the correlation of each layer of convolution layer and the weight of the corresponding convolution layer obtained by allocation, and taking the position of the search image with the highest tracking response as the center position of the tracking target in the search image.

Further, the highest tracking response is expressed as:

wherein Y (z, x) represents the highest tracking response; z and x represent the target image and the search image, respectively;a convolution characteristic representing the output of the convolution layer i; alpha _i A weight assigned to convolution layer i; beta represents the deviation.

Further, the scale factors independent in the width direction are expressed as:

s _w (w+p)＝A _w

the scale factors independent in the height direction are expressed as:

s _h (h+p)＝A _h

wherein s is _w Sum s _h Representing scale factors of the target in the width direction and the height direction, respectively; w and h represent the width and height of the target, respectively; p represents a filled region; a is that _w And A is a _h The size of the object in the width direction and the height direction are indicated, respectively.

Further, the filled region p is expressed as:

p＝(w+h)/2。

in one aspect, an electronic device is provided that includes a processor and a memory having at least one instruction stored therein that is loaded and executed by the processor to implement the video object tracking method described above.

In one aspect, a computer readable storage medium having stored therein at least one instruction that is loaded and executed by a processor to implement the video object tracking method described above is provided.

The technical scheme provided by the embodiment of the application has the beneficial effects that at least:

1) The hierarchical correlation twin network is provided on the basis of the twin network to track the target, and can comprehensively utilize the characteristic information of a plurality of convolution layers, so that the selection of tracking target positions is increased, and the tracking accuracy of the video target tracking method is improved;

2) The video target tracking method can be adaptively adjusted when tracking different targets through the correlation attention module, different weights can be distributed to the correlation of each layer, the selection of the positions of the tracked targets is further enhanced, and the tracking accuracy is improved;

3) Independent scale factors are used in the width and height directions of the tracking target to output frames (i.e.: the size of the tracking target) can reduce the deformation of the output frame and increase the tracking accuracy;

4) Is more robust to the conditions of complex background and large scale change of the tracking target.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic flow chart of a video target tracking method according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a hierarchical correlation twin network according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a correlation attention module according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.

As shown in fig. 1, an embodiment of the present application provides a video object tracking method, which may be implemented by an electronic device, where the electronic device may be a terminal or a server, and the method includes:

s101, inputting a target image and a search image into a hierarchy correlation twin network at the same time to perform feature extraction to obtain convolution features extracted by different convolution layers, performing correlation measurement on the convolution features of the target image and the search image extracted by the same convolution layer, and generating hierarchy correlation by hierarchical stitching of the correlation of each layer, wherein the target image comprises: tracking a target;

s102, taking the position of the search image with highest tracking response in the hierarchical correlation as the center position of the tracking target in the search image;

s103, determining the position of the tracking target in the search image according to the central position of the tracking target in the search image and the independent scale factors.

According to the video target tracking method, a hierarchical correlation twin network is provided on the basis of the twin network to track the target, and the hierarchical correlation twin network can comprehensively utilize the characteristic information of a plurality of convolution layers, so that the selection of tracking target positions is increased, and the tracking accuracy of the video target tracking method is improved; because independent scale factors are used, the influence on deformation of a tracking target caused by zooming pictures can be reduced, and the accuracy of target tracking is further improved.

In a specific embodiment of the foregoing video object tracking method, further, each branch structure for performing feature extraction in the hierarchical correlation twin network is: (conv1+relu+overlay+max POOL) — (conv2+relu+overlay+max POOL) — (conv3+relu) — (conv4+relu) — (conv5+relu);

In this embodiment, the structure for extracting features in the hierarchical correlation twin network includes: five convolutional layers (conv); to prevent the gradient vanishing problem, a ReLU nonlinear activation function is added after each convolution layer.

In this embodiment, after the ReLU nonlinear activation functions of conv1 and conv2, a local response normalization layer is connected to accelerate convergence of the hierarchical correlation twin network, and at the same time, a maximum pooling layer is connected to reduce the size of the feature map after the local response normalization layer.

In this embodiment, since the target tracking task is different from the target detection task, it is not necessary to output the category of the tracking target, and thus, it is not necessary to use a full connection layer when extracting the features.

In this embodiment, as shown in fig. 2, for example, a target image with a size of 127×127 and a search image with a size of 255×255 may be input into a hierarchical correlation twin network at the same time to perform convolution calculation by two branches for feature extraction, so as to obtain convolution features extracted by different convolution layers; then, correlation measurement is carried out on the convolution characteristics of the target image and the search image extracted by the same layer of convolution layers, the correlation of each layer (common five layers) is spliced through the layers to generate the layer correlation, and the layer correlation of 5×17×17 is obtained, wherein 17×17 in the 5×17×17 is the size of the feature map output by each layer of convolution layers, and 5 is the number of convolution layers.

In this embodiment, the hierarchical concatenation may be understood as superposition, like RGB, where different R, G, B values are superimposed to show different colors.

In a specific embodiment of the foregoing video object tracking method, further, a formula for performing a correlation measurement on a convolution feature of a target image and a search image extracted by a same convolution layer is:

In this embodiment, the correlation metric can measure the difference between the convolution characteristics of the target image and the search image extracted by the same layer of convolution layers located in different branches.

In a specific embodiment of the foregoing video object tracking method, further, the step of using a position of the search image having the highest tracking response in the hierarchical relevance as the tracking object at a center position of the search image includes:

inputting the maximum correlation in the hierarchical correlation into a correlation attention module; the structure of the correlation attention module is as follows: full tie layer 1-full tie layer 2-full tie layer 3-full tie layer 4-softmax layer, as shown in fig. 3;

According to the method, the video target tracking method can be adaptively adjusted when tracking different targets through the correlation attention module, different weights can be distributed to the correlation of each layer, the selection of the positions of the tracked targets is further enhanced, and the tracking accuracy is improved.

In a specific embodiment of the foregoing video object tracking method, further, the highest tracking response is expressed as:

In a specific embodiment of the foregoing video object tracking method, further, the scale factors independent in the width direction are expressed as:

s _w (w+p)＝A _w

the scale factors independent in the height direction are expressed as:

s _h (h+p)＝A _h

wherein s is _w Sum s _h Respectively representing scale factors of the tracking target in the width direction and the height direction; w and h represent the width and height of the tracking target on the target image, respectively; p represents a filled region, p= (w+h)/2; a is that _w And A is a _h Respectively represent widthThe width and height of the target image input in the direction and height direction.

In this embodiment, the width w and the height h of the tracking target on the target image and the width A of the input target image can be used _w And height A _h Calculating scale factors s of the tracking target in the width direction and the height direction _w Sum s _h Further according to s _w Sum s _h The width and the height of the tracked object on the search image are calculated to be the final size of the output frame, so that the independent scale factors are used for adjusting the output frame (namely the size of the tracked object) in the width and the height directions of the tracked object, the deformation of the output frame can be reduced, and the tracking accuracy is increased.

In this embodiment, by using independent scale factors, the transformation of one dimension in the width and height directions will not affect the other dimension, and the tracking target will not be deformed basically, so that the influence on the deformation of the tracking target caused by scaling the target to a uniform size can be reduced.

According to the video target tracking method provided by the embodiment of the application, a hierarchical correlation twin network is provided on the basis of the twin network to track the target, and a correlation attention module and independent scale factors are designed for the hierarchical correlation twin network, so that compared with the video target tracking method based on a pure twin network, the video target tracking method has higher tracking accuracy and is more robust to the conditions of complex background and larger scale change of the tracked target.

Next, the validity of the video target tracking method provided by the embodiment of the application is verified, and the method is specifically:

the framework was deep learned using Python assembly language and TensorFlow.

And using the ILSRVC2015-VID data set as a training data set, randomly selecting two frames of images from one video segment, cutting and scaling to a fixed size, and then inputting the video segment into a network, wherein the interval between the two frames of images is not more than 100 frames.

In order to accelerate the convergence speed of the tracking model (comprising a hierarchical correlation twin network and a correlation attention module) in a training stage, an optimization method is a momentum gradient descent method, an exponentially weighted average gradient is used for replacing an original gradient to update parameters, and the momentum is set to be 0.9.

The iteration batch is set to 8 image pairs.

The initial learning rate was 0.01.

The attenuation coefficient was 0.86.

The training round number is 60, and each round contains 53200 image pairs.

After training, the trained tracking model provided by the application is tested on OTB50, OTB100, VOT2015 and VOT2016 data sets, and the test proves that the tracking model provided by the embodiment improves the tracking accuracy by 6.5% compared with the video target tracking model based on a pure twin network (for example, a full convolution twin network), and can improve the tracking performance of an algorithm under the condition of not reducing the speed.

Fig. 4 is a schematic structural diagram of an electronic device 600 according to an embodiment of the present application, where the electronic device 600 may have a relatively large difference due to different configurations or performances, and may include one or more processors (central processing units, CPU) 601 and one or more memories 602, where at least one instruction is stored in the memories 602, and the at least one instruction is loaded and executed by the processors 601 to implement the video object tracking method described above.

In an exemplary embodiment, a computer readable storage medium, such as a memory comprising instructions executable by a processor in a terminal to perform the video object tracking method described above, is also provided. For example, the computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing description of the preferred embodiments of the application is not intended to limit the application to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the application are intended to be included within the scope of the application.

Claims

1. A video object tracking method, comprising:

determining the position of the tracking target in the search image according to the central position of the tracking target in the search image and the independent scale factors;

the step of taking the position of the search image with the highest tracking response in the hierarchical relevance as a tracking target at the center position of the search image comprises the following steps:

determining the tracking response of each layer of convolution layer according to the correlation of each layer of convolution layer and the weight of the corresponding convolution layer obtained by distribution, and taking the position of the search image with the highest tracking response as the center position of the tracking target in the search image;

wherein the highest tracking response is expressed as:

wherein Y (z, x) represents the highest tracking response; z and x represent the target image and the search image, respectively;a convolution characteristic representing the output of the convolution layer i; alpha _i A weight assigned to convolution layer i; beta represents a deviation;

wherein the scale factors independent in the width direction are expressed as:

s _w (w+p)＝A _w

the scale factors independent in the height direction are expressed as:

s _h (h+p)＝A _h

wherein s is _w Sum s _h Respectively representing scale factors of the tracking target in the width direction and the height direction; w and h represent the width and height of the tracking target on the target image, respectively; p represents a filled region, p= (w+h)/2; a is that _w And A is a _h Representing the width and height of the target image input in the width direction and the height direction, respectively;

wherein the filled region p is denoted as:

p＝(w+h)/2。

2. the method of claim 1, wherein the step of simultaneously inputting the target image and the search image into the hierarchical correlation twin network to perform feature extraction, and obtaining convolution features extracted by different convolution layers comprises:

3. The video object tracking method according to claim 1, wherein the formula for performing the correlation measurement on the convolution characteristics of the object image and the search image extracted by the same convolution layer is: