CN114155273B

CN114155273B - Video image single-target tracking method combining historical track information

Info

Publication number: CN114155273B
Application number: CN202111221441.0A
Authority: CN
Inventors: 杨兆龙; 庞惠民; 夏永清
Original assignee: Zhejiang Dali Technology Co ltd
Current assignee: Zhejiang Dali Technology Co ltd
Priority date: 2021-10-20
Filing date: 2021-10-20
Publication date: 2024-06-04
Anticipated expiration: 2041-10-20
Also published as: CN114155273A

Abstract

The invention relates to a video image single-target tracking method combining historical track information, which comprises the following steps: respectively sending the template image and the current frame search image into a trained convolutional neural network feature extraction layer to obtain a template image feature map and a search image feature map; the template image feature map and the search image feature map are sequentially sent into a trained convolutional neural network classification layer and a trained convolutional neural network regression layer, so that a classification feature map and a regression feature map of the template image and a classification feature map and a regression feature map of the search image are obtained; performing cross-correlation operation on the classification features and the regression feature graphs of the template image and the search image to obtain a classification layer response graph and a regression layer response graph of the template image and the search image; performing maximum pooling operation on the classification layer response graphs of the template image and the search image; and finding out the predicted coordinate value which is most approximate to the predicted coordinate of the target in the previous frame searching image and the history track of the target in the previous M frame searching image, and taking the predicted coordinate value as the final predicted coordinate value of the target in the current frame searching image.

Description

Video image single-target tracking method combining historical track information

Technical Field

The invention relates to a single target tracking method combining historical track information, which relates to a twin neural network and a single target tracking method of the historical track information, belonging to the field of image processing and computer vision,

Background

Computer vision is a subject of specially researching how a computer can "see" like a human, and refers to that the camera and the computer are used for replacing human eyes, so that the machine can see the functions of extracting, identifying, tracking and the like of the human brain on a target.

Target tracking is to analyze a video picture sequence, match each detected candidate target area and locate the coordinates of the targets in the video sequence. In short, the positioning is performed for the target in the sequence image. The research of the target tracking algorithm is a hotspot in the field of computer vision, and has important research and application values in scenes such as virtual reality, man-machine interaction, intelligent monitoring, augmented reality, machine perception and the like.

The problem of target tracking in a single scene is mainly studied for continuous tracking of a single target, i.e. tracking only one specific target in a video sequence taken by a single camera. The study in this regard has been conducted around two basic problems: first, object appearance modeling, also known as object matching, is known. The method establishes a corresponding apparent model according to the apparent characteristic data of the target, and is the most important module of the algorithm. The quality of the apparent feature establishment directly influences the tracking accuracy and robustness, and commonly adopted features include contours, colors, textures and the like. Second, tracking policy. In the target tracking process, all contents in the scene are directly matched for searching the optimal position, which can certainly increase a large amount of redundant information, and the defects of large operation amount, low speed and the like are caused. Effective effects are obtained by narrowing the search range through priori knowledge, and typical methods include hidden Markov models, kalman filtering, mean shift algorithms, particle filtering and the like.

Target tracking algorithms can be divided into two categories: discriminant tracking and generative tracking. The generated tracking algorithm directly models the target without considering background information, a model is established through learning to represent the target, and then the model is directly matched with the target category, so that the tracking purpose is achieved. The discriminant method models the tracking problem as a binary classification problem to find a decision boundary that distinguishes the target object from the background, maximally distinguishing the target region from the non-target region. In recent years, the deep learning algorithm rapidly becomes a research hotspot, and achieves good effects in the field of computer vision. The deep learning method based on the twin neural network plays a significant role in the field of single-target tracking, siamFC is an algorithm of applying a typical twin neural network to single-target tracking, specifically, the structure has two inputs, one is a template serving as a reference, and the other is a candidate sample to be selected. In the single-target tracking task, the template serving as a reference is an object to be tracked, a target object in a first frame of a video sequence is usually selected, a candidate sample is an image search area in each frame later, and the twin network is required to find a candidate area which is the most similar to the template in the first frame in each frame later, namely the target in the frame, so that tracking of one target can be realized. The deep learning method remarkably improves the tracking speed and the tracking precision of the tracker. SiamFC and other twin network-based methods can meet the real-time requirement on high-performance operation equipment, but the method does not consider the historical track information of the target when the target is tracked, and when the same object as the target appears in a scene and the distance is close, the target is easy to lose, so that the accuracy of a tracking algorithm is reduced.

Disclosure of Invention

The invention solves the technical problems that: the method for tracking the single target by combining the historical track information is provided to solve the problem that the tracking target is lost when the same or similar targets appear in the scene.

The technical scheme for solving the technical problems is as follows: a video image single-target tracking method combined with historical track information comprises the following steps:

s1, acquiring a template image and a current frame search image;

S2, respectively sending the template image and the current frame search image into a trained convolutional neural network feature extraction layer to obtain a template image feature map and a search image feature map;

S3, sequentially sending the template image feature map and the search image feature map into a trained convolutional neural network classification layer and a trained regression layer to obtain a classification feature map and a regression feature map of the template image and a classification feature map and a regression feature map of the search image;

s4, performing cross-correlation operation on the classification characteristic image of the template image and the classification characteristic image of the search image to obtain a classification layer response image of the template image and the search image; performing cross-correlation operation on the regression feature map of the template image and the regression feature map of the search image to obtain a regression layer response map of the template image and the search image;

s5, carrying out maximum pooling operation on the classifying layer response graphs of the template image and the search image;

S6, taking out N front characteristic points from high to low of response values in the pooled classified layer response graph, calculating regression layer output corresponding to the N characteristic points, and obtaining N predicted coordinate values of the target in the current frame search image according to the regression layer output;

S7, if the current frame is the previous M frames of the video image, taking the predicted coordinate value corresponding to the maximum response value in the classified layer response diagram as the final predicted coordinate value record of the target in the current frame searching image; if the current frame is the Mth frame and the following frames of the video image, the step S8 is entered;

s8, finding out the predicted coordinate value which is most similar to the predicted coordinate of the target in the previous frame searching image and the history track of the target in the previous M frame searching image from N predicted coordinate values, and taking the predicted coordinate value as the final predicted coordinate value of the target in the current frame searching image, wherein M and N are more than or equal to 2.

Preferably, the cross-correlation operation in step S4 is as follows:

F(z,x)＝z*x+b

Wherein b is deviation, Z is a classification layer regression layer feature map or regression layer feature map of the template image, x is a classification layer regression layer feature map or regression layer feature map of the search image, and F is a classification layer response map of the template image and the search image or a regression layer response map of the template image and the search image.

Preferably, the trained convolutional neural network feature extraction layer is Alexnet network.

Preferably, the dimensions of the feature graphs before and after the pooling operation in the step S5 are consistent.

Preferably, the specific steps of the step S8 are as follows:

S8.1, acquiring a historical track coordinate { [ x _i,y_i],i＝1～M},(x_i,y_i ] of a target in a previous M-frame searching image to represent a predicted coordinate value of the target in an i-th frame searching image before a current frame;

S8.2, calculating historical track direction information of a target, wherein the historical track direction information of the target comprises direction information o _i from the (i+1) th frame of target position to the (i) th frame of target position before the current frame, and i=1-M;

s8.3, N predicted coordinate values (a _j,b_j) are obtained, and j=1 to N;

S8.4, calculating deviation of each predicted coordinate value and the predicted coordinate of the target in the previous frame of search image:

d_j＝(a_j-x₁,b_j-y₁),j＝1～N；

S8.5, calculating the similarity between each predicted coordinate value and the target historical track;

S8.6, selecting a predicted coordinate point corresponding to the S _j with the minimum similarity as a final output

Preferably, the specific calculation formula of the similarity between the jth predicted coordinate value and the target historical track in the step S8.5 is as follows:

S_j＝s_j,1+s_j,2

Wherein s _j,1 is the first component of s _j and s _j,2 is the first component of s _j; λ is a weight parameter, typically set to 1.

Preferably, the classification layer adopts a binary cross entropy function as a loss function during training.

Preferably, the regression layer uses smoothL1 as a loss function during training.

Compared with the prior art, the invention has the beneficial effects that:

(1) When similar targets appear in the pictures, the invention can better detect and position the targets and improve the target tracking precision by considering the historical track information and the current prediction distance information.

(2) The method has certain robustness aiming at the situation that the tracked target is blocked.

Drawings

FIG. 1 is a diagram of a network architecture of an embodiment of the present invention;

FIG. 2 is a flow chart of the single target tracking of the present invention in combination with historical track information.

Detailed Description

The single-target tracking method combining the historical track information provided by the invention is further described below with reference to the accompanying drawings and the detailed description. Advantages and features of the invention will become more apparent from the following description and from the claims.

As shown in fig. 1 and 2, the present invention provides a video image single-target tracking method in combination with historical track information, which includes the following steps:

s1, acquiring a template image and a current frame search image;

The trained convolutional neural network feature extraction layer is Alexnet networks.

The cross-correlation operation is as follows:

F(z,x)＝z*x+b

S5, carrying out maximum pooling operation on the classifying layer response graphs of the template image and the search image; the dimension of the feature graphs is consistent before and after the pooling operation;

s6, from the top N feature points of the response value in the classified layer response diagram after the pooling is taken out, calculating regression layer output corresponding to the N feature points, and obtaining N predicted coordinate values of the target in the current frame search image according to the regression layer output;

The method comprises the following specific steps:

s8.3, N predicted coordinate values (a _j,b_j) are obtained, and j=1 to N;

d_j＝(a_j-x₁,b_j-y₁),j＝1～N；

S8.5, calculating the similarity between each predicted coordinate value and the target historical track; the specific calculation formula of the similarity between the jth predicted coordinate value and the target historical track is as follows:

S_j＝s_j,1+s_j,2

Examples:

in a specific embodiment of the invention, a universal network Alexnet in the field of image classification is used as a framework to construct a Siamese convolutional neural network, wherein the Siamese convolutional neural network comprises a feature extraction layer, a classification layer and a regression layer. And training the Siamese convolutional neural network model by using a common dataset ILVSRC in the single-target tracking field and 800 videos which are automatically and practically shot and marked as training data. The key points of the model training process are as follows:

And (1) performing size normalization and data enhancement processing on the images in the video.

A target frame (x _min,y_min, w, h) is obtained from the first frame image in the video, where x _min and y _min represent the point location coordinates of the upper left corner of the real frame, respectively, and w and h represent the width and height of the target frame, respectively. Then, for each frame of image, taking the center point of the target frame as the center, cutting out a picture with the size of 127 x 127 as a template image, and cutting out a picture with the size of 255 x 255 as a search image. If the template image or the search image is not cut enough in the original image, the insufficient part is filled according to the average value of the RGB channels.

Performing data enhancement operations on the search image includes rotating the template image, adding noise, color dithering, and the like.

Key point 2, building network model

Referring to fig. 2, the network structure used in the present invention includes a feature extraction layer, a classification layer, and a regression layer.

The single-target tracking network has two identical feature extraction layers, and the two feature extraction layers share parameters. Namely, the single-target tracking network is divided into a searching branch and a template branch; where the template branches input template images, e.g. 127 x 3 template images, 127 x 127 representing the input image resolution, 3 representing the number of channels of the input image, typically RGB images. The search branch inputs a search image, for example, an image of 255×255×3 size.

The two branch networks of the feature extraction layer are both Alexnet-based convolutional neural networks, the network structures and parameters are identical, and the two branch networks comprise a first convolutional layer Conv1, a first pooling layer Pool1, a second pooling layer Pool2, a third convolutional layer Conv3, a fourth convolutional layer Conv4 and a fifth convolutional layer Conv5 which are sequentially connected. The specific parameters are as follows: the convolution kernel size of Conv1 is 11×11, the step length is 2, and the output channel number is 96; the convolution kernel size of Pool1 is 3×3, the step length is 2, and the output channel number is 96; the convolution kernel size of Pool2 is 3×3, the step size is 2, and the number of output channels is 256; the convolution kernel sizes of Conv3 and Conv4 are 3×3, the step sizes are 1, and the number of output channels is 192; conv5 has a convolution sum of 3×3, a step size of 1, and an output channel number of 128.

At the classification level, a convolution kernel size 3*3 is used first, the number of output channels is 256, then followed by a convolution of the size of convolution kernel 1*1, the number of output channels is 128.

At the regression layer, a convolution kernel size 3*3 is used first, the number of output channels is 256, then followed by a convolution kernel 1*1 size, the number of output channels is 128.

The related operation process is as follows: taking the template image input 127 x 3 and the search image input 255 x 3 as examples, 6 x 128 template image classification feature images and 23 x 128 search image classification feature images are respectively obtained, then, a step size s=1 is set by using 6×6×128 as a convolution kernel and using 23×23×128 as an input feature map, and a pad=0 is convolved to output a 17×17×1 classification layer response map.

The related operation process is as follows: taking the template image input 127 x 3 and the search image input 255 x 3 as examples, respectively obtaining a template image regression feature map of 6 x 128 and a search image regression feature map of 23 x 128, then, a step s=1 is set by using 6×6×128 as a convolution kernel and using 23×23×128 as an input feature map, and a pad=0 is convolved to output a feature map with a size of 17×17×1. Finally, using 1*1 convolutions, the number of output channels was 4, resulting in a 17 x 4 regression layer response plot.

Key point 3, loss function

At the classification level, the present invention uses a binary cross entropy function as the loss function. And when positive and negative samples are set, the sample points falling into the real target frame when the classification layer is mapped back to the original image are set as positive samples, and the other samples are set as negative samples.

At the regression layer, a feature map of 17×17×4 is obtained, where the regression score, that is, the position regression value of each sample, respectively represents the distance to the target frame. The loss function uses smoothL a loss function.

The final loss is as follows:

loss＝φ_cls+λ₂φ_reg

loss is the sum of the classification loss and the regression loss, lambda ₂ represents the superparameter, set to 0.5, and control the weight of the regression loss function.

In this embodiment, after the feature layer, the classification layer and the regression layer are established, in the video image single-target tracking method provided by the invention, step S5 adopts the maximum pooling layer of 3*3.

If the current processing frame is positioned in the first 5 frames, calculating the target position according to the maximum response point of the classification layer. And recording the current predicted target position. When the processed frame is greater than 5 frames, then the new target position is predicted in combination with the historical track information. The method comprises the following steps:

And taking out the first 4 maximum value response points of the classification layer, and calculating the output of the regression layer corresponding to the four values. Thus, four different predicted coordinates are obtained, and the four predicted coordinates, the tracking target of the previous frame and the historical track are calculated to obtain the most approximate predicted coordinates as the final output.

In step S6, the response values in the pooled classified layer response graphs are taken out from the top to the top of 4 feature points, and coordinates in the regression layer response graphs corresponding to the 4 feature points are calculated, so that 4 predicted coordinate values of the target in the current frame search image are obtained;

in step S7, if the current frame is the first 5 frames of the video image, the predicted coordinate value corresponding to the maximum response value in the classified layer response diagram is used as the final predicted coordinate value record of the target in the current frame searching image; if the current frame is the 5 th frame and the following frames of the video image, proceeding to step S8;

In step S8, from among the 4 predicted coordinate values, the predicted coordinate value most approximate to the predicted coordinate of the target in the previous frame search image and the predicted coordinate value most approximate to the history locus of the target in the previous 5 frame search image is found as the final predicted coordinate value of the target in the current frame search image.

The method comprises the following specific steps:

S8.1, acquiring historical track coordinates of targets in previous M-frame search images

{(x₅,y₅),(x₄,y₄),(x₃,y₃),(x₂,y₂),(x₁,y₁)}, Representing a predicted coordinate value of a target in an i-th frame search image before the current frame;

taking N equal to 5 as an example, specifically:

o₄＝(x₄-x₅,y₄-y₅)

o₃＝(x₃-x₄,y₃-y₄)

o₂＝(x₂-x₃,y₂-y₃)

o₁＝(x₁-x₂,y₁-y₂)

s8.3, 4 predicted coordinate values (a _j,b_j) are obtained, and j=1 to N;

d_j＝(a_j-x₁,b_j-y₁),j＝1～4；

S_j＝s_j,1+s_j,2

Although the present invention has been described in terms of the preferred embodiments, it is not intended to be limited to the embodiments, and any person skilled in the art can make any possible variations and modifications to the technical solution of the present invention by using the methods and technical matters disclosed above without departing from the spirit and scope of the present invention, so any simple modifications, equivalent variations and modifications to the embodiments described above according to the technical matters of the present invention are within the scope of the technical matters of the present invention.

Claims

1. A video image single-target tracking method combining historical track information is characterized by comprising the following steps:

s1, acquiring a template image and a current frame search image;

S8, finding out the predicted coordinate value which is most approximate to the predicted coordinate of the target in the previous frame of search image and the history track of the target in the previous M frame of search image from N predicted coordinate values, and taking the predicted coordinate value as the final predicted coordinate value of the target in the current frame of search image, wherein M and N are more than or equal to 2;

the specific steps of the step S8 are as follows:

s8.3, N predicted coordinate values (a _j,b_j) are obtained, and j=1 to N;

d_j＝(a_j-x₁,b_j-y₁),j＝1～N；

the specific calculation formula of the similarity between the jth predicted coordinate value and the target historical track is as follows:

S_j＝s_j,1+s_j,2

Wherein s _j,1 is the first component of s _j and s _j,2 is the second component of s _j; lambda is a weight parameter and is set to be 1;

S8.6, selecting a predicted coordinate point corresponding to the S _j with the minimum similarity as a final output.

2. The single-object tracking method in combination with historical track information according to claim 1, wherein the cross-correlation operation in step S4 is as follows:

F(z,x)＝z*x+b

3. The method for single-target tracking in combination with historical track information according to claim 1, wherein the trained convolutional neural network feature extraction layer is Alexnet network.

4. The single-object tracking method according to claim 1, wherein the step S5 is performed with consistent dimensions of feature maps before and after the pooling operation.

5. The method for single-object tracking in combination with historical track information according to claim 1, wherein the classification layer uses a binary cross entropy function as a loss function during training.

6. The method of claim 1, wherein the regression layer uses smoothL a as a loss function during training.