CN111275736A

CN111275736A - Unmanned aerial vehicle video multi-target tracking method based on target scene consistency

Info

Publication number: CN111275736A
Application number: CN202010015437.8A
Authority: CN
Inventors: 李国荣; 黄庆明; 苏荔; 于洪洋
Original assignee: University of Chinese Academy of Sciences
Current assignee: University of Chinese Academy of Sciences
Priority date: 2020-01-07
Filing date: 2020-01-07
Publication date: 2020-06-12

Abstract

The invention relates to the technical field of computer vision auxiliary devices, in particular to an unmanned aerial vehicle video multi-target tracking method based on target scene consistency, which can measure the similarity between targets by utilizing the consistency of scene and target appearance and improve the tracking performance of an unmanned aerial vehicle video; the method comprises the following steps: (1) calculating object-to-scene consistency using a ResNet-based twin Network (Siamese Network); (2) calculating the similarity of the appearance of the object and the object by using a ResNet-based twin Network (Simese Network); (3) constructing a branch network, wherein the network comprises a convolution layer and two full-connection layers, the input of the network is the output characteristic of the fifth convolution layer of the two twin networks, and the estimation of the offset of the detection result is output; the three networks are fused by a multi-task learning method so as to be mutually promoted.

Description

Unmanned aerial vehicle video multi-target tracking method based on target scene consistency

Technical Field

The invention relates to the technical field of computer vision auxiliary devices, in particular to an unmanned aerial vehicle video multi-target tracking method based on target scene consistency.

Background

Multi-target tracking (MOT) is a key step of many video analysis tasks, such as video event separation, behavioral understanding. MOT aims to track objects appearing in a video, giving the position of each object in each frame. The existing MOT method can be divided into two types according to the mode of utilizing the target detection result: off-line tracking and on-line tracking. The target detection result on the whole video is considered when the detection result is associated by offline tracking; and the online tracking considers the detection result on the current frame and the obtained motion trail of each object.

Existing methods generally use multiple cues (e.g., appearance, motion) to track objects to comprehensively measure the similarity between objects in adjacent frames. However, in the video of the unmanned aerial vehicle, the target size is small, so that the discrimination of self information, especially apparent information, is not strong, and the tracking performance of the existing multi-target tracking method on the video of the unmanned aerial vehicle is poor.

Disclosure of Invention

In order to solve the technical problems, the invention provides the unmanned aerial vehicle video multi-target tracking method based on the consistency of the target scene, which can comprehensively measure the probability that two targets are the same object by utilizing the consistency of the target and the scene and the apparent similarity of the target and the target, and improve the multi-target tracking precision in the unmanned aerial vehicle.

The invention discloses an unmanned aerial vehicle video multi-target tracking method based on target scene consistency, which comprises the following steps:

(1) calculating object-to-scene consistency using a ResNet-based twin Network (Siamese Network); the first five layers of the network are convolution layers, and then a full connection layer and soft max are connected, so that the confidence coefficient of the consistency of an object and a scene is output;

(2) calculating the similarity of the appearance of the object and the object by using a ResNet-based twin Network (Simese Network); the first five layers of the network are convolution layers, the full connection layer and soft max are connected behind the convolution layers, the output is the apparent similarity of the objects, and the larger the similarity is, the larger the probability that the two objects are the same object is;

(3) constructing a branch network, wherein the network comprises a convolution layer and two full-connection layers, the input of the network is the output characteristic of the fifth convolution layer of the two twin networks, and the estimation of the offset of the detection result is output;

the three networks are fused by a multi-task learning method so as to be mutually promoted.

The invention relates to an unmanned aerial vehicle video multi-target tracking method based on target scene consistency, which further comprises the step of initializing parameters of the first five convolutional layers of a twin network by using the parameters of the first five convolutional layers of RetNet50 trained on ImageNet, wherein the parameters of the fully-connected layers and the parameters of the rest convolutional layers are initialized in a random mode.

The invention discloses an unmanned aerial vehicle video multi-target tracking method based on target scene consistency, which further comprises the step of obtaining an object position with deviation as a detection result by adding disturbance to the real position of the object by utilizing a marked video sequence. And constructing a training data set by using the detection results, and training the whole network.

Compared with the prior art, the invention has the beneficial effects that: during multi-target tracking, the similarity between targets is utilized, and the consistency between scenes and the targets is also utilized, so that the method can deal with the situation that the targets in the video of the unmanned aerial vehicle are small and the apparent distinguishability is weak, thereby realizing more accurate target association and improving the tracking precision; and a multi-task learning framework is designed, so that a plurality of related tasks are mutually promoted, the target detection precision is improved, and the tracking accuracy is further improved.

Drawings

FIG. 1 is a schematic structural view of the present invention;

Detailed Description

The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.

Example (b):

the main network of each branch network of the twin is based on the first five convolutional layers of RestNet50, wherein the characteristic dimension of the output of the first four convolutional layers is 14 × 1024, and 2048-dimensional characteristics are obtained after the fifth convolutional layer. By f₁,f₂,f₃,f₄Features of the outputs of the first five convolutional layers of the four-branch network are shown. The first two networks are taken into two 2048-dimensional features (f)₁,f₂) Stitched together, input fully-connected layers to get confidence of apparent-scene consistency

The loss function here is defined as:

wherein

A true value representing the object-scene consistency relationship,

indicating that the target is consistent with the scene;

indicating that the object is not consistent with the scene.

The characteristics f obtained for the last two networks₃,f₄Inputting the full connection layer and the soft-max layer to obtain the apparent similarity measurement between the target and the target

The loss function here is defined as:

wherein

A true value representing the object-scene consistency relationship,

indicating that the appearance of two objects is similar, i.e. the same object;

indicating that the objects being compared are not the same object.

Four 2048 dimensional features (f) from four networks₁,f₂,f₃,f₄) Are arranged side by sideCombining them together to form 2X 2048D features, inputting convolution layer and two full-connection layers to obtain the estimation of offset of detection result

The loss function here is defined as:

here, the offset △ p between the detection result and the true position of the object is (△ x, △ y),

during training, the parameters of the first five convolutional layers are initialized by using the parameters of RestNet50 pre-trained on ImageNET, and the rest parameters are all initialized randomly;

constructing a training data set: adding disturbance on the basis of the real position of the target to obtain a biased result, using the biased result as a detection result of the target, and constructing a training data set; by minimizing the following loss function, the optimal network parameters are obtained.

Wherein ω is all parameters of the network, | ω | | non-calculation₂Representing the L2 norm of ω.

Designing a target-scene consistency measurement network based on a twin network, wherein the network comprises two networks based on RestNet, and the two networks share the parameters of a convolutional layer and are used for extracting and calculating the consistency of the scene of the object of the t frame and the object of the t +1 frame;

constructing twin networks for calculating the apparent similarity of the target by using two RestNet-based networks, calculating the apparent similarity of the object of the t frame and the object of the t +1 frame, wherein the two networks also share the convolutional layer parameters;

meanwhile, another detection result offset estimation network is introduced for adjusting the target detection result. The network is composed of a convolutional layer and two fully-connected layers, and the input of the network is a feature matrix extracted by the two twin networks. By introducing the branch network, partial parameters of the three task networks of consistency estimation of a target scene, apparent similarity estimation of a target and target detection result offset estimation can be shared, and mutual promotion can be realized.

According to the unmanned aerial vehicle video multi-target tracking method based on the target scene consistency, the installation mode, the connection mode or the setting mode of all the components are common mechanical modes, and the specific structures, models and coefficient indexes of all the components are self-contained technologies, so that the beneficial effects can be achieved, and further description is omitted.

The invention relates to a method for unmanned aerial vehicle video multi-target tracking based on target scene consistency, wherein orientation words such as 'up, down, left, right, front, back, inside, outside, vertical and horizontal' contained in a term only represent the orientation of the term in a conventional use state or are common names understood by a person skilled in the art without being limited to the contrary, meanwhile, the numerical terms such as 'first', 'second' and 'third' do not represent specific numbers and sequences and are only used for name differentiation, and the terms 'comprise', 'contain' or any other variants thereof are meant to cover non-exclusive inclusions, so that a process, method, article or equipment comprising a series of elements not only comprises the elements, but also comprises other elements which are not explicitly listed, or also comprises the elements which are not listed, the process, the method, the article or the equipment, Method, article, or apparatus.

The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims

1. An unmanned aerial vehicle video multi-target tracking method based on target scene consistency is characterized by comprising the following steps:

2. The unmanned aerial vehicle video multi-target tracking method based on target scene consistency of claim 1, further comprising initializing parameters of the first five convolutional layers of the twin network with parameters of the first five convolutional layers of RetNet50 trained on ImageNet, wherein the parameters of the fully-connected layers and the remaining convolutional layers are initialized using a random manner.

3. The unmanned aerial vehicle video multi-target tracking method based on target scene consistency of claim 2, further comprising obtaining a position of an object with a deviation as a detection result by adding disturbance to a real position of the object by using a labeled video sequence. And constructing a training data set by using the detection results, and training the whole network.