CN112883928A

CN112883928A - Multi-target tracking algorithm based on deep neural network

Info

Publication number: CN112883928A
Application number: CN202110325552.XA
Authority: CN
Inventors: 邵叶秦; 吕昌; 唐宇亮; 蒋雯
Original assignee: Nantong University
Current assignee: Nantong University
Priority date: 2021-03-26
Filing date: 2021-03-26
Publication date: 2021-06-01

Abstract

The invention discloses a multi-target tracking algorithm based on a deep neural network, which is characterized in that firstly, a single-target tracker realized based on a twin network is designed, similarity measurement is carried out on a determined target in a frame image through the network, and position prediction of each target on the previous frame image on the frame is obtained; then, carrying out target detection on the frame image by utilizing a deep convolutional neural network; and finally, performing correlation matching on the obtained tracking prediction image and the detected pedestrian image, and improving the tracking accuracy by utilizing cosine similarity and area overlapping. The invention firstly introduces a single target tracker designed based on a full convolution twin network, and mainly searches the most similar targets for each target in a to-be-tracked area by using the idea of twin network similarity measurement. And matching each detection target with each tracking candidate target by using a matching algorithm. Finally, experiments show that the theory has feasibility and show experimental results.

Description

Multi-target tracking algorithm based on deep neural network

Technical Field

The invention particularly relates to a multi-target tracking algorithm based on a deep neural network.

Background

In order to better identify the pedestrian, the complete pedestrian feature needs to be extracted, so a tracking module is added to acquire a video sequence of the pedestrian. With the continuous development of the computer vision field, the multi-target tracking algorithm is applied to more and more scenes. The multi-target tracking can be mainly divided into online tracking and offline tracking. On-line tracking is sequential frame-by-frame tracking, while off-line tracking is an estimation of the state of each target, then considering the rationality constraint of the overall state. The offline tracing can be simplified as: and obtaining the detection result of each frame of image, and associating the detection result with the existing tracking track to obtain a track result of multi-target tracking.

The tracking problem is to identify an object in a first frame of the video and then track in subsequent frames. Since the targets are randomly selected, a specific target tracker cannot be trained for tracking. A typical object tracker learns an appearance model of an object in an online manner. Such as TLD, Struck, KCF, MIL and MOSSE. Common tracking algorithms generally do not consider the detection of targets.

Aiming at the problems, the invention provides a single target tracker based on a full convolution twin network and a target detector based on a deep convolution neural network, and aims to match the obtained detection target with the candidate target of tracking prediction by means of cosine similarity and the like and fuse target detection into the tracking process.

Disclosure of Invention

The purpose of the invention is as follows: in order to overcome the defects of the prior art, the invention provides a multi-target tracking algorithm based on a deep neural network.

The technical scheme is as follows: a multi-target tracking algorithm based on a deep neural network, comprising the operations of: firstly, designing a single target tracker realized based on a twin network, and carrying out similarity measurement on a determined target in a next frame image through the network to obtain the position prediction of each target on the previous frame image on the frame image; then, carrying out target detection on the frame image by utilizing a deep convolutional neural network; and finally, performing correlation matching on the obtained tracking prediction image and the detected pedestrian image, and improving the tracking accuracy by using cosine similarity and area overlapping.

As an optimization: the twin network is a 'conjoined neural network', the 'conjoined' of the neural network is realized by sharing a weight, and the twin network is used for measuring the similarity of two inputs; the twin neural network respectively maps the two inputs to a new space to form a representation of the inputs in the new space, and similarity of the two inputs is evaluated through loss calculation;

twin network based single target tracking is to find the most similar regions between the template image z and the search regions A, B of this frame image x (these search regions are near the target in the previous frame image); template image and search area pass

Is mapped to the characteristic space of the image,

is a feature mapping implemented using a neural network; the characteristic size of the input target image after mapping is 6 multiplied by 128, and similarly, the characteristic size obtained by the lower branch of the twin network is 22 multiplied by 128; in order to obtain the characteristic position of the search area, taking the obtained 6 × 6 × 128 characteristic of the upper branch of the twin network as a convolution kernel of the lower branch, and performing characteristic convolution on the 22 × 22 × 128 characteristic of the lower branch; finally, obtaining a score map with the size of 17 multiplied by 1, wherein the map represents the similarity scores between each position and the template; the algorithm is used for comparing the similarity between a search area and a target, the method is similar to a correlation filtering idea, a twin network utilizes characteristics to serve as convolution, and the maximum similar position is found in a convolution result;

the full convolution network is independent of the size of the candidate image, it will calculate the similarity of all the transformation sub-windows x and z, the algorithm uses convolution embedding function

And cross-correlation to obtain the result, the formula is as follows:

in the formula (1), b represents a value at each position;

converting the size of an input image into the size of 127 multiplied by 127, converting a candidate image into the size of 255 multiplied by 255, changing the interval into p, keeping the area unchanged by a scale factor s, and continuously adjusting the image size by a formula:

s(w+2p)×s(h+2p)＝A (2)

in formula (2), A is 127²P ═ w + h)/4, w, h is the width and height of the candidate box;

when the network is trained, l (y, v) is the training of the network by positive and negative samples, and the formula is as follows:

l(y，v)＝log(1+exp(-yu)) (3)

in the formula (3), u is a score matrix of a single sample, v is a similarity score of a single template and the candidate region, and y is only +1 or-1;

the definition of y is as follows:

in the formula (4), k is the step length of the network, and R is the radius of the center of the fractional graph element;

during training, the convolution of the network is implemented using a graph containing an example graph and a larger candidate graph, the loss of the score graph being defined as the mean of all losses:

in the formula (5), D ∈ Z²Is a finite grid;

when the full convolution twin network is used for training, each pair of the inputs can obtain a similarity score map which is marked as u; finding the maximum score on the score map, and taking the position of the score point on the original graph as the final predicted position; using linear interpolation, expand the 17 × 17 score map to 272 × 272, map the point of the original score map that responds to the maximum value onto 272 × 272 score map as the target position:

maximum score position score graph center × grid block size (6)

Next, at each position of the score map, the convolution parameters are obtained by random gradient descent (SGD):

argmin_θEL(y，f(z，x；θ)) (7)

in the formula (7), theta is a parameter;

obtaining m detection targets {1, 2, …, m } through a detection network, obtaining n tracking prediction candidate targets {1, 2, …, n } through a tracking network, and matching the detection targets and the tracking prediction candidate targets by using a matching algorithm; firstly, an area fusion method is adopted to detect a candidate frame d (x)₁，y₁，x₂，y₂) And tracking candidate frame s (x'₁，y′₁，x′₂，y′₂) The formula is as follows:

x_min，x_max→(x₁，x₂，x′₁，x′₂) (8)

y_min，y_max→(y₁，y₂，y′₁，y′₂) (9)

w＝x_max-x_min (10)

h＝y_max-y_min (11)

s＝w×h (12)

in the formula (12), w is the width of the overlapping region, h is the height of the overlapping region, and s is the area of the overlapping region;

if the overlapping rate of the areas has a problem, the cosine matching method is used again for verification, and the cosine formula is as follows:

in formula (13), x_i，y_i(i ═ 1, 2, …, n) are the coordinates of vector i, respectively.

Has the advantages that: the invention firstly introduces a single target tracker designed based on a full convolution twin network, and mainly utilizes the idea of twin network similarity measurement to search a target for a region to be tracked to find the most similar target. And matching the detection target with the tracking candidate target by using a matching algorithm. Finally, experiments show that the theory has feasibility and show experimental results.

Drawings

FIG. 1 is a block diagram of the twin concept based multi-target tracking of the present invention;

FIG. 2 is a schematic diagram of a full convolution twin network based tracker structure of the present invention;

FIG. 3 is a sample schematic of the algorithm data of the present invention;

FIG. 4 is a score map visualization image of the present invention;

FIG. 5 is a schematic representation of the results of the single person tracking of the present invention;

FIG. 6 is a diagram illustrating the result of multi-person tracking according to the present invention;

FIG. 7 is a schematic diagram of the accuracy of the algorithm of the present invention;

FIG. 8 is a graphical representation of the efficiency of the algorithm of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below so that those skilled in the art can better understand the advantages and features of the present invention, and thus the scope of the present invention will be more clearly defined. The embodiments described herein are only a few embodiments of the present invention, rather than all embodiments, and all other embodiments that can be derived by one of ordinary skill in the art without inventive faculty based on the embodiments described herein are intended to fall within the scope of the present invention.

Examples

1. Problem definition and system framework

According to the tracking idea, the invention designs a deep convolutional neural network to track the target. Firstly, a single target tracker based on twin network implementation is designed, similarity measurement is carried out on the determined targets in the next frame image through the network, and position prediction of each target on the previous frame image on the frame image is obtained. Then, target detection is carried out on the frame image by utilizing a deep convolutional neural network. And finally, performing association matching on the obtained tracking predicted image and the detected pedestrian image, and improving the tracking accuracy by using cosine similarity and area overlapping modes, wherein fig. 1 is a frame schematic diagram of multi-target tracking.

The image shows that the tracker is composed of a tracker based on a twin network and a detector based on a deep convolutional neural network, and finally matching is carried out by utilizing a matching algorithm, so that the motion trail of the pedestrian in a natural walking state is tracked.

2. Tracking algorithm based on full convolution twin network

The twin network (Siamese-network) is a 'conjoined neural network', and the 'conjoined' of the neural network is realized by sharing weight values. The twin network is a measure of the similarity of the two inputs. The twin neural network maps the two inputs to new spaces, respectively, forming a representation of the inputs in the new spaces. Similarity of the two inputs is evaluated by loss calculation, and fig. 2 is a tracking structure based on a full convolution twin network.

Twin network based single object tracking finds the most similar areas between the template image z and the search areas A, B of this frame image x (these search areas are near the object in the previous frame image). Template image and search area pass

Is mapped to the characteristic space of the image,

is a feature mapping implemented using a neural network; the characteristic size of the input target image after mapping is 6 multiplied by 128, and similarly, the characteristic size obtained by the lower branch of the twin network is 22 multiplied by 128;in order to obtain the characteristic position of the search area, taking the obtained 6 × 6 × 128 characteristic of the upper branch of the twin network as a convolution kernel of the lower branch, and performing characteristic convolution on the 22 × 22 × 128 characteristic of the lower branch; finally, obtaining a score map with the size of 17 multiplied by 1, wherein the map represents the similarity scores between each position and the template; the algorithm is used for comparing the similarity between a search area and a target, the method is similar to a correlation filtering idea, a twin network utilizes characteristics to serve as convolution, and the maximum similar position is found in a convolution result;

the full convolution network is independent of the size of the candidate image and it will compute the similarity of all the transformation sub-windows x and z. The algorithm uses a convolution embedding function

And cross-correlation are combined to obtain the result, and the formula is as follows.

In the formula (1), b represents a value at each position.

The input image size is converted to 127 × 127, the candidate image is converted to 255 × 255, the variation interval is p, and the scale factor s keeps the area constant. The image size is continuously adjusted by a formula.

s(w+2p)×s(h+2p)＝A (2)

In formula (2), A is 127²And p is (w + h)/4, and w and h are the width and the height of the candidate frame.

When the network is trained, l (y, v) is the training of the network by positive and negative samples, and the formula is as follows.

l(y，v)＝log(1+exp(-yu)) (3)

In the formula (3), u is a score matrix of a single sample, v is a similarity score of a single template and the candidate region, and y is only +1 or-1.

The definition of y is as follows:

in the formula (4), k is the step length of the network, and R is the radius of the center of the fractional graph element.

During training, convolution of the network is achieved using the containment example graph and the larger candidate graph. The loss of the score map is defined as the mean of all losses.

In the formula (5), D ∈ Z²Is a finite grid.

maximum score position score graph center × grid block size (6)

ar g min_θ E L(y，f(z，x；θ)) (7)

in the formula (7), theta is a parameter;

x_min，x_max→(x₁，x₂，x′₁，x′₂) (8)

y_min，y_max→(y₁，y₂，y′₁，y′₂) (9)

w＝x_max-x_min (10)

h＝y_max-y_min (11)

s＝w×h (12)

in the formula (12), w is the width of the overlap region, h is the height of the overlap region, and s is the area of the overlap region.

3. Results and analysis of the experiments

3.1 Experimental data

In this experiment, the model was trained according to the ILSVRC2015 data set format. The ILSVRC has mainly three folders, ImageSets, Data and exceptions, where ImageSets contain the relevant description of the Data set. Data stores all Data information, including pictures and video clips, and indications of the pictures in the Data correspond to indications of the pictures in the Data. Fig. 3 shows a training sample of the algorithm model.

3.2 results of the experiment

The input image size is converted to a size of 127 × 127, and the candidate image is converted to a size of 255 × 255 as an input of the feature extraction network. The first layer is a convolution layer, and the template image and the search area image are subjected to 96 convolution kernels of 11 × 11 × 3 to obtain feature images of 59 × 59 and 123 × 123. The pooled layers were followed by downsampling the feature map obtained in the first layer using max-pooling in units of 3 × 3 to obtain 29 × 29 and 61 × 61 feature maps. After the convolution of 256 layers of the third layer, which is 5 × 5 × 48, 25 × 25 and 57 × 57 characteristic images are obtained, and after the pooling layer with the unit of 3 × 3, 12 × 12 and 28 × 28 characteristic images are obtained. After the convolution of the fourth, fifth and sixth layers, a 6 × 6 template feature map and a 22 × 22 search area feature map are finally obtained. To reduce the risk of over-fitting, a ReLU nonlinear activation layer follows each convolutional layer. Then, a score map matrix is obtained by taking the 6 × 6 template feature map as a convolution kernel of the 22 × 22 search area feature map, fig. 4 is a graph obtained by visualizing the score matrix, wherein the bright spots represent positions of tracking prediction, and the large map and the small map are the original image and the target image respectively. Finally, the score map is interpolated from 17 × 17 upsampling to 272 × 272 for target localization.

In this experiment, single person, multiple person, etc. were tested. The processing speed at the GPU server is 86 frames/second and fig. 5 and 6 are the result of the tracing.

Experiments show that the processing speed of each frame can not only meet the real-time requirement, but also reach more than 95% in accuracy. To further highlight the superiority of the algorithm, the accuracy and efficiency of the algorithm are shown in fig. 7 and 8, compared to the BOOSTING, MIL, KCF, TLD and MOSSE algorithms.

The fully-convolutional twin network mainly solves the problem of similarity, and the problem can be solved by training the twin network to perform similarity evaluation. Experiments prove that the network not only has good performance on the video data set ILSVRC2015, but also can achieve the effect of real-time tracking in practical application.

Claims

1. A multi-target tracking algorithm based on a deep neural network is characterized in that: the method comprises the following operations: firstly, designing a single target tracker based on a twin network, and carrying out similarity measurement on a determined target in a next frame image through the network to obtain the position prediction of each target on the previous frame image on the frame image; then, carrying out target detection on the frame image by utilizing a deep convolutional neural network; and finally, performing correlation matching on the obtained tracking prediction image and the detected pedestrian image, and improving the tracking accuracy by utilizing cosine similarity and area overlapping.

2. The deep neural network-based multi-target tracking algorithm of claim 1, wherein: and predicting the problem of each tracked target on the frame image by using a twin network according to the tracked target on the previous frame image, and comparing the problem with a plurality of results of target detection on the frame image to find the most similar target for each target so as to realize multi-target tracking.

3. The deep neural network-based multi-target tracking algorithm of claim 1, wherein: the twin network is a 'conjoined neural network', the 'conjoined' of the neural network is realized by sharing a weight, and the twin network is used for measuring the similarity of two inputs; the twin neural network maps the two inputs to a new space respectively to obtain the representation of the inputs in the new space, and the similarity of the two inputs is evaluated through loss calculation;

Is mapped to the characteristic space of the image,

is a feature mapping implemented using a neural network; the characteristic size of the input target image after mapping is 6 multiplied by 128, and similarly, the characteristic size obtained by the lower branch of the twin network is 22 multiplied by 128; in order to obtain the characteristic position of the search area, taking the obtained 6 × 6 × 128 characteristic of the upper branch of the twin network as a convolution kernel of the lower branch, and performing characteristic convolution on the 22 × 22 × 128 characteristic of the lower branch; finally, obtaining a score map with the size of 17 multiplied by 1, wherein the map represents the similarity scores between each position and the template; algorithms for comparing similarity between search regions and objects, this class of methodsSimilar to the correlation filtering idea, the twin network utilizes the characteristics to act as convolution, and finds the maximum similar position in the convolution result;

the full convolution network is independent of the size of the candidate image, it will calculate the similarity of all search windows x and z, the algorithm uses convolution embedding function