CN114581485A

CN114581485A - Target tracking method based on language modeling pattern twin network

Info

Publication number: CN114581485A
Application number: CN202210199168.4A
Authority: CN
Inventors: 傅衡成; 何为; 李凤荣; 胡育昱; 魏智; 纪立
Original assignee: Shanghai Hansuo Information Technology Co ltd
Current assignee: Shanghai Hansuo Information Technology Co ltd
Priority date: 2022-03-02
Filing date: 2022-03-02
Publication date: 2022-06-03

Abstract

The invention relates to a target tracking method based on a language modeling twin network, which comprises the following steps: step S1, acquiring a video containing continuous movement of a target, and making a training data set according to the video; step S2, training a twin neural network according to the training data set; step S3, keeping the parameters of the twin neural network unchanged, and training a target position extraction network; step S4, performing combined training on the trained twin neural network and the trained target position extraction network to obtain a language modeling type twin network; and step S5, acquiring a real-time image of the target to be tracked, and tracking the target to be tracked in real time by using the language modeling mode twin network. The invention does not need to integrate expert experience knowledge into the algorithm, and the realization process is simpler and more convenient. In addition, the invention has strong expansibility and can be accessed to a more general intelligent system.

Description

Target tracking method based on language modeling pattern twin network

Technical Field

The invention relates to the technical field of video target tracking, in particular to a target tracking method based on a language modeling twin network.

Background

The target tracking technology has wide application value in the civil and national defense fields, and has important significance for the development of the fields of robots, aircrafts, unmanned driving, security protection and the like. For example, in the security field, a camera tracks pedestrians in a field of view, and the pedestrians are analyzed and processed through a series of subsequent intelligent algorithms, so that the monitoring system can better sense and understand human postures, actions and behavior intentions, and intelligent, timely and efficient monitoring is achieved. The automatic following means that a tracked target is selected from an initialization picture, then the tracked target is tracked, and the posture of the tracked target and the distance between the tracked target and the target are adjusted according to the position of the target, so that the tracked target is ensured to be well imaged.

Current tracking algorithms can be divided into two broad categories: one class is based on traditional machine learning algorithms, such as correlation filtering, support vector machines, etc., which mainly rely on online training classifiers to distinguish targets from backgrounds, and then use the classifiers to locate the targets from candidate regions. The second type is based on deep learning algorithms, such as convolutional neural networks, twin neural networks, and the like, and such algorithms firstly perform offline training on a large-scale data set and then track a target. From the performance of various algorithms on a test data set, the deep learning tracking algorithm relies on strong feature representation capability, and the tracking accuracy of the deep learning tracking algorithm is far superior to that of the traditional tracking algorithm.

As target tracking is a basic sub-field in computer vision, the target tracking can have better application value by being combined with more vision processing algorithms, such as human body posture estimation, pedestrian re-recognition, action recognition and the like. The current tracking algorithm is limited to providing a rectangular box for a subsequent algorithm, has poor expansibility and can only access some specific intelligent systems. Moreover, the current tracking algorithm needs to incorporate a large amount of carefully designed and highly task-customized expert experience knowledge into the algorithm, and the implementation process is complicated.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a target tracking method based on a language modeling pattern twin network, which can be accessed into a more general intelligent system, does not need to incorporate expert experience knowledge into an algorithm, and is simple and convenient in implementation process.

The invention provides a target tracking method based on a language modeling twin network, which comprises the following steps:

step S1, acquiring a video containing continuous movement of a target, and making a training data set according to the video;

step S2, training a twin neural network according to the training data set;

step S3, keeping the parameters of the twin neural network unchanged, and training a target position extraction network;

step S4, performing combined training on the trained twin neural network and the trained target position extraction network to obtain a language modeling type twin network;

and step S5, acquiring a real-time image of the target to be tracked, and tracking the target to be tracked in real time by using the language modeling mode twin network.

Further, the creating a training data set in step S1 includes:

in step S11, the video containing the continuous motion of the object is framed into an image sequence, and the bounding box of the object in each image is marked.

Further, the training the twin neural network in step S2 includes:

step S21, randomly selecting two frames from the image sequence, obtaining a template branch from one frame, inputting the template picture into the twin neural network, and outputting a template feature map by the template branch; acquiring a candidate branch of a candidate region picture input to the twin neural network from another frame, wherein the candidate branch outputs a candidate feature map;

step S22, carrying out convolution operation on the template characteristic diagram and the candidate characteristic diagram, and outputting an encoding result diagram;

step S23: and determining a loss function according to the coding result graph, and training the twin neural network by using the loss function.

Further, the training of the target location extraction network in step S3 includes:

step S31, expanding the convolution result obtained in the step S22 into a vector form, inputting the vector into a feature dimension compression sub-network of a target position extraction network, and obtaining a compression result vector;

and step S32, inputting the compressed vector to a Transformer decoder of the target position extraction network to obtain the predicted coordinate of the target to be tracked, calculating loss according to the predicted coordinate of the target to be tracked and the actual coordinate of the target to be tracked, and training the target position extraction network by using a gradient back propagation algorithm.

Further, the step S5 of tracking the target to be tracked in real time includes:

step S51, initializing i ═ 2;

step S52, acquiring a boundary frame of the target to be tracked in the i-1 frame image, and extracting a template picture of the target to be tracked in the i-1 frame image; inputting a template picture of a target to be tracked in the (i-1) th frame image into a template branch of a twin neural network in a language modeling type twin network to obtain an (i-1) th frame template feature map;

step S53, taking the position of the target to be tracked in the i-1 th frame image as the center, cutting a picture with the size of 255 x 255 pixels from the i-th frame image as an i-th frame candidate area picture;

step S54, inputting the candidate region picture of the ith frame into a candidate branch of a twin neural network in a language modeling mode twin network to obtain a candidate feature picture of the ith frame;

step S55, performing convolution operation on the ith frame candidate feature map and the (i-1) th frame template feature map, unfolding the feature map obtained through the convolution operation into vectors, and then sending the vectors into a target position extraction network in a language modeling twin network to obtain a target frame tracked by the ith frame;

step S56, extracting the target picture of the ith frame from the target frame tracked by the ith frame, and acquiring the predicted target position characteristic diagram of the ith frame according to the target picture of the ith frame;

step S57, acquiring a template feature map of the ith frame according to the template feature map of the ith-1 frame and the predicted target feature map of the ith frame;

step S58, judging whether the ith frame is the last frame image of the target to be tracked according to the real-time image of the target to be tracked, if so, ending the process; if not, let i equal i +1, and repeat steps S53-S57.

The invention models the target tracking problem into a language modeling problem based on pixel input, and effectively fuses an image and language sequence method to track a specific target. Compared with the traditional pure vision-based target tracking method, the method does not need to incorporate expert experience knowledge into the algorithm, and the implementation process is simpler and more convenient. In addition, the invention has strong expansibility, and the discrete language sequence output can be used as a language interface to be accessed into a more general intelligent system.

Drawings

Fig. 1 is a flowchart of a target tracking method based on a language modeling pattern twin network according to the present invention.

Detailed Description

The preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

Because human beings strongly depend on natural language for deep-level wide-range perception and analysis of environment, the invention solves the tracking problem in computer vision by using a processing method in the natural language, and provides a target tracking method based on a language-modeling twin network, as shown in fig. 1, the method comprises the following steps:

step S1, acquiring a video containing continuous motion of the target, and creating a training data set according to the video containing continuous motion of the target.

Specifically, the method for making the training data set comprises the following steps:

step S11, framing the video containing the continuous motion of the target into an image sequence, and marking the bounding box of the target to be tracked in each image. According to the marked bounding box, the center point, the length and the width of the bounding box can be obtained.

And step S12, acquiring a template picture and a candidate area picture according to the boundary frame of the target to be tracked in each image. The template picture is a picture of an area in the bounding box. The method for acquiring the candidate area picture comprises the following steps: and taking the position of the randomly shifted central point of the boundary frame as the center of the target, and expanding a square area of 255 x 255 pixels around the center of the target, wherein the square area is a candidate area, and the picture in the square area is a candidate area picture.

And step S13, constructing a word list according to the candidate area picture, wherein the word list stores the actual coordinates of the target to be tracked. The size of the vocabulary is determined according to the size of the candidate region, so that the size of the vocabulary is 255 × 255, that is, the vocabulary stores 65025 coordinates including the positions of the upper left corner point and the lower right corner point of the bounding box. And in the subsequent supervision training, the prediction result of the target position extraction network and the truth value labeled by the word list are input into a loss function to calculate loss, so that the network parameters are updated by a gradient back propagation algorithm, and in the subsequent real-time tracking process, the prediction result is also represented by words in the word list. The invention determines the position of the target in the candidate region picture with the pixel-level precision, thereby ensuring the image precision and the positioning precision.

And step S2, training the twin neural network according to the produced training data set.

Specifically, training the twin neural network comprises the following steps:

step S21, randomly selecting two frames from the image sequence of step S11, obtaining a template branch from one frame, inputting the template picture into the twin neural network, and outputting a template feature map by the template branch; and acquiring a candidate region picture from another frame, inputting the candidate region picture into a candidate branch of the twin neural network, and outputting a candidate feature map by the candidate branch.

And step S22, performing convolution operation on the template characteristic diagram and the candidate characteristic diagram to obtain a convolution result, and outputting a coding result diagram. The encoding result graph corresponds to a feature graph of the target to be tracked, values in the feature graph reflect the possibility of the target to be tracked at the current position, and the larger the value of the feature graph is, the higher the possibility of the target at the current position is.

Step S23: and training a loss function according to the coding result graph and the labeling result, and training a twin neural network by using a gradient back propagation algorithm.

It should be noted that, when the twin neural network is trained, a gaussian label with the center of the target to be tracked as the mean value is used as the real label. The loss function for training the twin neural network is shown as follows:

in the formula, N represents the number of elements of the encoding result graph D, u represents the element position of the encoding result graph D, y ∈ {1,0} represents a real label, v represents an actual value in the encoding result graph, and log represents a logistic function with 2 as low.

And step S3, keeping the parameters of the twin neural network unchanged, and training the target position extraction network.

Parameters of the twin neural network include filter parameters and bias terms, and the two parameters are set to be in a non-training state in the process of training the target position extraction network. The target position extraction network comprises a feature dimension compression sub-network and a Transformer decoder, and the training of the target position extraction network comprises the following steps:

and step S31, expanding the convolution characteristic diagram obtained in the step S22 into vectors, and inputting the vectors into the characteristic dimension compression sub-network to obtain a compression result vector. The characteristic dimension compression sub-network consists of two fully-connected layers and is used for compressing the dimension of a convolution result and reducing the calculation amount of a subsequent network.

Step S32, the compressed result vector is input to the transform decoder, and the transform decoder outputs a final prediction result, where the prediction result includes the prediction coordinates (the upper left corner point coordinate and the lower right corner point coordinate) of the target to be tracked. Then, calculating loss according to the predicted coordinates of the target to be tracked and the actual coordinates of the target to be tracked, and training a target position extraction network by using a gradient back propagation algorithm

The loss function for training the target location extraction network is shown as follows:

in the formula, ω_jRepresents the weight of the jth coordinate in the vocabulary, L represents the total number of coordinates included in the image sequence of the target, x represents the vector of the encoding result (i.e., the vector into which the convolution feature map is developed),

denotes the j-th predicted coordinate, y_jRepresenting the jth coordinate of the object in the image sequence. In the present embodiment, the weights of all coordinates are equal. In other embodiments, the weights may also be set according to the location of the coordinates.

And step S4, performing combined training on the trained twin neural network and the trained target position extraction network to obtain the language modeling type twin network.

Specifically, the entire network is jointly trained in a multitasking manner, wherein the convolution output of the twin neural network is trained in a relay supervision manner, and the label and the loss function thereof are consistent with the step S2.

The loss function for training the language modeling twin network is shown as follows:

loss＝λ₁loss₁+λ₂loss₂ (3)

wherein λ is₁、λ₂Respectively representing the proportion of the two loss functions in the total loss function.

And step S5, acquiring a real-time image of the target to be tracked, and tracking the target to be tracked in real time by using the language building mode twin network.

Specifically, the real-time tracking of the target to be tracked comprises the following steps:

in step S51, the initialization i is 2.

Step S52, acquiring a boundary frame of the target to be tracked in the i-1 frame image, and extracting a template picture of the target to be tracked in the i-1 frame image; and inputting the template picture of the target to be tracked in the (i-1) th frame image into a template branch of a twin neural network in the language modeling type twin network to obtain the (i-1) th frame template characteristic diagram.

Step S53, taking the position of the target to be tracked in the i-1 th frame image as the center, cutting a picture with the size of 255 × 255 pixels from the i-th frame image as the i-th frame candidate area picture.

And step S54, inputting the candidate region picture of the ith frame into a candidate branch of a twin neural network in the language modeling mode twin network to obtain a candidate feature map of the ith frame.

And step S55, performing convolution operation on the ith frame candidate feature map and the (i-1) th frame template feature map, unfolding the feature map obtained through the convolution operation into vectors, inputting the vectors into a target position extraction network in the language modeling type twin network, and obtaining a target frame tracked by the ith frame. According to the target frame, the center position, the length and the width of the target frame can be obtained.

And step S56, extracting the target picture of the ith frame from the target frame tracked by the ith frame, and inputting the target picture of the ith frame into a template branch of the twin network to obtain the target feature map of the ith frame.

And step S57, acquiring the ith frame template feature map according to the ith-1 frame template feature map and the target feature map of the ith frame.

The template feature map for the ith frame may be obtained by the following expression:

F_i＝ωF_i-1+(1-ω)f_i

in the formula, F_iA template feature map representing the ith frame, f_iRepresents the predicted target characteristic diagram of the ith frame, and omega epsilon [ 01 ∈]The template feature map of the i-1 th frame is shown inThe proportion of the template feature map of the ith frame.

The method models the target tracking problem into the language modeling problem based on pixel input, and effectively fuses the image and language sequence method to track the specific target. Compared with the traditional target tracking method based on pure vision, the method does not need to integrate expert experience knowledge into the algorithm, is easier to analyze the tracked target by using a language tool, and has simpler and more convenient realization process. In addition, the invention has strong expansibility, and the discrete language sequence output can be used as a language interface to access a more general intelligent system, so that a follow-up model can analyze a tracking target by means of a language tool more easily.

The above embodiments are merely preferred embodiments of the present invention, which are not intended to limit the scope of the present invention, and various changes may be made in the above embodiments of the present invention. All simple and equivalent changes and modifications made according to the claims and the content of the specification of the present application fall within the scope of the claims of the present patent application. The invention has not been described in detail in the conventional technical content.

Claims

1. A target tracking method based on a language modeling pattern twin network is characterized by comprising the following steps:

step S2, training a twin neural network according to the training data set;

2. The target tracking method based on the language modeling twin network as claimed in claim 1, wherein the step S1 of creating the training data set includes:

3. The target tracking method based on the language modeling twin network as claimed in claim 2, wherein the training of the twin neural network in step S2 includes:

4. The target tracking method based on the language modeling twin network as claimed in claim 3, wherein the training of the target position extracting network in step S3 includes:

5. The target tracking method based on the language modeling twin network as claimed in claim 1, wherein the step S5 of tracking the target to be tracked in real time comprises:

step S51, initializing i ═ 2;

step S54, inputting the candidate region picture of the ith frame into a candidate branch of a twin neural network in a language modeling mode twin network to obtain a candidate feature map of the ith frame;

step S58, judging whether the ith frame is the last frame image of the target to be tracked or not according to the real-time image of the target to be tracked, if so, ending the process; if not, let i equal i +1, and repeat steps S53-S57.