CN112116626A

CN112116626A - Single-target tracking method based on flexible convolution

Info

Publication number: CN112116626A
Application number: CN202010773674.0A
Authority: CN
Inventors: 王涛; 李浥东; 李孟华; 郎丛妍; 冯松鹤
Original assignee: Beijing Jiaotong University
Current assignee: Beijing Jiaotong University
Priority date: 2020-08-04
Filing date: 2020-08-04
Publication date: 2020-12-22
Anticipated expiration: 2040-08-04

Abstract

The embodiment of the invention provides a single target tracking method based on flexible convolution, which is used for constructing a flexible convolution network model, wherein the flexible convolution network model comprises a shared layer and a specific domain layer, and is trained by utilizing a data set, and the method comprises the following steps: s1, acquiring an original video sequence and preprocessing the original video sequence; s2, inputting the preprocessed video sequence into a flexible convolution network model, enabling the sharing layer to obtain the sharing characteristics of the target through convolution operation, inputting the sharing characteristics into a specific domain layer to perform two classifications of the target and the background, then performing flexible RoI pooling to select a candidate target area, and improving the precision of the candidate target area by using a loss function, thereby realizing single-target tracking. According to the embodiment of the invention, a single-target tracking method based on flexible convolution is utilized, the problem that an object is easy to deform in single-target tracking is effectively solved, and meanwhile, the accuracy of a candidate target area is improved through RoI pooling.

Description

Single-target tracking method based on flexible convolution

Technical Field

The invention relates to the technical field of computer vision, in particular to a single-target tracking method based on flexible convolution.

Background

The single target tracking technology refers to a technology for identifying and positioning a given target in a video sequence given any target to be tracked in an initial state. The single-target tracking technology is always a research hotspot in the field of computer vision, and can be widely applied to multiple fields of video monitoring, unmanned driving, human-computer interaction and the like.

Because an object is easy to deform (such as dimension change, rotation, posture change and the like) in the motion process, the single-target tracking method adopted by the prior art is difficult to solve the problem, so that the tracking effect is poor. Such as: the traditional deep learning uses a traditional convolution mode to extract features, the traditional convolution is regular and has a fixed geometric size, such as 3 × 3 size, 5 × 5 size and the like, the sampled region is also a region with a fixed geometric size, the traditional single-target tracking algorithm is used for extracting features by using traditional convolution operation based on the traditional single-target tracking algorithm, and then target tracking is performed by using a corresponding tracking model, for example, the MDNet single-target tracking algorithm is used for realizing tracking by using a traditional convolution method. The traditional convolution neural network adopts the same convolution operation on different characteristic graphs, and the positions of sampled pixel points are fixed, so that the sampled information comprises a plurality of background characteristics and cannot be adaptive to the characteristics of an object.

Disclosure of Invention

The embodiment of the invention provides a single-target tracking method based on flexible convolution, which is used for overcoming the defects of the prior art.

In order to achieve the purpose, the invention adopts the following technical scheme.

A single target tracking method based on flexible convolution constructs a flexible convolution network model, wherein the flexible convolution network model comprises a sharing layer and a specific domain layer, and is trained by utilizing a data set, and the method comprises the following steps:

s1, acquiring an original video sequence and preprocessing the original video sequence;

s2, inputting the preprocessed video sequence into a flexible convolution network model, enabling the sharing layer to obtain the sharing characteristics of the target through convolution operation, inputting the sharing characteristics into a specific domain layer to perform two classifications of the target and the background, then performing flexible RoI pooling to select a candidate target area, and improving the precision of the candidate target area by using a loss function, thereby realizing single-target tracking.

Preferably, the flexible convolutional network model comprises a shared layer and a domain-specific layer, wherein the shared layer comprises 3 convolutional layers conv1-3 and 2 fully-connected layers fc4-5, each fully-connected layer has 512 output units, and relu and pooling layers are respectively arranged between every two adjacent convolutional layers and between the two fully-connected layers;

the specific domain layer is a full connection layer fc6 in the flexible convolution network model¹-fc6^KSaid full connection layer fc6¹-fc6^KThere are K domains, each domain contains a binary classification layer with a softmax cross entropy function, and the binary classification layer is responsible for distinguishing the target and the background in each domain.

Preferably, the softmax cross entropy function formula is as follows:

wherein i₁Is input, j₁For the number of inputs, e is 2.7.

Preferably, the shared layer acquires the shared characteristic of the target through a convolution operation, including:

sampling on the input feature map x using a regular grid R using a flexible convolution operation, increasing the position offset Δ P_n({ΔP_nN, N R | for each position P |, 1₀Carrying out weighted summation on the feature points of all the bits in the regular grid R and the bits corresponding to the convolution kernels to obtain the corresponding P on the new feature map₀Point, due to the original ruleAdding a two-dimensional offset value delta P on an x axis and a y axis on the basis of the offset of the grid R_nThe offset value Δ P_nIs a floating point value obtained by calculating bilinear interpolation of 4 real values around;

after flexible convolution operation, a new characteristic diagram with the same length and width as the original characteristic diagram and 2N channels is obtained, and each characteristic point P on the new characteristic diagram₀With 2N values, 2 being the offset in the corresponding x-and y-axes, N being the corresponding N Δ P_nThe value of (c).

Preferably, the performing flexible RoI pooling selection of candidate target regions comprises:

dividing the RoI of w h of the flexible RoI pooling layer pair into k bins and outputting a k feature map y for the (i) th₀,j₀) A memory cell, 0 ≦ i₀,j₀＜k，i₀Is the ith in the candidate target region₀Line, j₀Is j-th of the candidate target region₀The columns of the image data are,

where x is the input, P is each location in the regular network,

is the number of pixels in bin, output after flexible RoI pooling

Wherein the content of the first and second substances,

is offset and

preferably, the loss function in S2 is:

wherein the content of the first and second substances,

indicates the probability that the ith region is predicted as a positive example, y_iThe real label of the ith area is represented, the output of the network is the probability score of positive and negative samples, wherein a threshold value is set, if the threshold value is larger than the threshold value, the positive sample is obtained, otherwise, the negative sample is obtained.

According to the technical scheme provided by the embodiment of the invention, the embodiment of the invention provides the single-target tracking method based on the flexible convolution, the extracted features and the candidate target area are more accurate and better in effect through the flexible convolution and the flexible RoI pooling, the problem that the tracked object is easy to deform in the target tracking process is effectively solved, and the defects in the prior art are overcome.

Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic structural diagram of a single-target tracking method based on flexible convolution according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a flexible convolutional network model training process according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a testing process of the flexible convolutional network model according to the embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or coupled. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

For the convenience of understanding the embodiments of the present invention, the following description will be further explained by taking several specific embodiments as examples in conjunction with the drawings, and the embodiments are not to be construed as limiting the embodiments of the present invention.

The embodiment of the invention provides a single-target tracking method based on flexible convolution, which specifically comprises the following steps as shown in figures 1-3:

and S1, acquiring the original video sequence, preprocessing the original video sequence, and cutting the picture into 107 × 107.

S2, inputting the preprocessed video sequence into a flexible convolution network model, enabling a sharing layer to obtain sharing characteristics of a target through convolution operation, inputting the sharing characteristics into a specific domain layer to perform two classifications of the target and a background, then performing flexible RoI pooling to select a candidate target area, and improving the precision of the candidate target area by using a loss function, thereby realizing single-target tracking.

Firstly, a flexible convolution network model is constructed, wherein the flexible convolution network model comprises a sharing layer and a specific domain layer, and the method specifically comprises the following steps:

based on an MDNet (Multi-Domain volumetric Neural Networks) single-target tracking network, a traditional convolution method is modified into flexible convolution, a sharing layer of a flexible convolution network model comprises five hidden layers which are respectively 3 convolution layers (conv1-3) and 2 full-connection layers (fc4-5), each full-connection layer is provided with 512 output units, and relu layers and posing layers are respectively arranged between every two adjacent convolution layers and between the two full-connection layers.

The specific domain layer is used for learning information on the specific domain and is the last full connection layer (fc 6) corresponding to K domains in the flexible convolution network model¹-fc6^K) And each domain in the K domains comprises a binary classification layer with a softmax cross entropy function, and the binary classification layer is responsible for distinguishing the target and the background in each domain, so that the target characteristics are learned more accurately. Wherein, the softmax cross quotient function is shown as the following formula:

in the formula i₁Is input, j₁For the number of inputs, e is 2.7.

The feature extraction adopts flexible convolution operation, sampling is carried out on an input feature mapping x by using a regular grid R, and the traditional convolution operation is that for grids R with the size of 3 multiplied by 3 { (-1, -1), (-1,0),. }, (0, -1), (1,1) }

Wherein, P_nIs a position of the convolution kernel, ω is the convolution kernel, for each position P₀After performing position offset

Wherein, { Δ P_nN, i.e. for each pair, N ═ R |, i.e. for each onePosition P₀Carrying out weighted summation on the feature points of all the bits in the regular grid R and the bits corresponding to the convolution kernels to obtain the corresponding P on the new feature map₀And (4) point. Because a two-dimensional deviation delta P is added on the basis of the R deviation of the original regular grid_n(offset in x-and y-axes), which is a floating point value, needs to be obtained by bilinear interpolation of the surrounding 4 real values, i.e.

Wherein g (a, b) ═ max (0,1- | a-b |). After flexible convolution operation, a new characteristic diagram with the same length and width as the original characteristic diagram and 2N channels is obtained, and each characteristic point P on the new characteristic diagram₀With 2N values, 2 being the offset in the corresponding x-and y-axes, N being the corresponding N Δ P_nThe value of (c).

And after the flexible convolution operation is carried out, putting the extracted features into a full connection layer, carrying out background and target binary classification, and finally carrying out RoI pooling to select a candidate target area so as to finish target tracking.

Performing RoI pooling to select candidate target region, firstly performing standard RoI pooling on the input feature map x, and then outputting standard k × k offset through a full connection layer

Then according to the formula

Calculating Delta P_ijγ is used to adjust the magnitude of the offset, and is empirically set to a value of 0.1, with w and h being the width and height of the RoI, respectively. The offset value Δ P_ijStill a floating point value needs to be obtained by computing a bilinear interpolation of the surrounding 4 real values. The method comprises the following specific steps:

the flexible RoI pooling layer pair w × h RoI is segmented into a k × k region (bin) and outputs a k × k profile y. For the (i) th₀,j₀) Storage units (i is more than or equal to 0, j is less than k)

Wherein i₀Is the ith in the candidate target region₀Line, j₀Is j-th of the candidate target region₀Columns; x is the input, P is each location in the regular network, n_ijIs the number of pixels in bin, output after flexible RoI pooling

Wherein, Δ P_ijIs offset and is { Δ P_ij|0≤i,j＜k}。

The feature extraction of each video sequence specifically comprises:

the video sequence is RGB pictures, and each picture is expressed as after the characteristic of each picture is extracted

x＝[b,H,W,C]

Wherein, b is the current blocksize of the picture, C is the number of channels of the picture, the value is 3, the number is three channels of RGB, and H and W are picture pixel values.

As shown in fig. 2, offline learning is performed before tracking starts, and the purpose of offline learning is to train parameters through current training data, while conv1, conv2 and conv3 layer parameters are not updated during online tracking and fc4 and fc5 layer parameters are updated during online tracking. The original training data is continuous video frames cut in data containing a plurality of videos, and each frame of image passes through an artificially labeled group channel (hereinafter abbreviated as gt-channel) for representing the position of the tracking target in the image. The box is represented by a vector (x, y, w, h), where (x, y) is the coordinate of the box center point within the image, w represents the box column width, and h represents the box row height. In each frame of image in each video sequence, according to the gt-box, using a uniform random method to establish IoU boxes of 50 gt-boxes being greater than or equal to 0.7 as positive samples and 200 boxes of IoU being less than or equal to 0.5 as negative samples, and training the flexible convolution network model by using the data set specifically comprises the following steps:

the first step is as follows: and (5) initializing. The parameter { w1, w 2.. w5} is the result of the MDNet model pre-training, and w6 is the result of the random initialization.

The second step is that: and (5) performing regression training on the bounding box. According to the positions of the gt-boxes, IoU of 1000 gt-boxes are established to be boxes of more than or equal to 0.7 by using a uniform random method, images in the range of 1000 boxes are scaled to obtain 1000 training data with 107 x 107 resolution as input samples, and the parameters of fc _ box are obtained by using a linear regression algorithm. After the bounding box regression is completed, fc _ box is not updated until the next trace.

The third step: and training the network. According to the positions of the gt-boxes, boxes with IoU being more than or equal to 0.7 of 500 gt-boxes are established as positive samples by a Gaussian distribution random method, and boxes with IoU being more than or equal to 0.3 of 5000 gt-boxes are established as negative samples by a uniform random method. The learning rate of fc4 and fc5 layers is set to be 0.0001, the learning rate of fc6 layers is set to be 0.001, and 30 times of iterative training (SGD) are carried out. The mini batch size of each iteration is 128, and the parameters of fc4-6 layers are updated after training is completed by using 32 positive samples selected randomly and 96 hard negative samples selected from 1024 negative samples randomly.

After the training, the back propagation adjustment parameters are carried out by using the loss function, and the formula is as follows:

wherein the content of the first and second substances,

indicates the probability that the ith region is predicted as a positive example, y_iThe real label of the ith area is represented, the output of the network is the probability score of positive and negative samples, wherein a threshold value is set, if the probability score is larger than the threshold value, the positive sample is obtained, otherwise, the negative sample is obtained; the threshold range is greater than 0.5 and less than 1.

In the invention, the threshold value is set to be 0.5, and for each image sequence with manual marks, the marks are corresponding target positions. And if the overlap ratio of the predicted target area and the manual labeling is more than or equal to 0.5, the tracking is considered to be successful, otherwise, the tracking is considered to be failed.

The tracking of the flexible convolution network model by using the data set specifically comprises the following steps:

the fourth step: and (4) tracking on line. And generating 256 candidate boxes in a Gaussian distribution random mode according to the output of the box in the previous frame, respectively obtaining the positive score of the candidate boxes through network calculation, and selecting the box with the largest numerical value. If the positive score is greater than 0.5, the tracing is considered to be successful, and the following operations are executed: (1) adjusting the box by using the frame regression parameters to obtain a tracking result (namely generating a reference box of 256 candidate boxes in the next frame); (2) based on box after frame regression, 50 positive samples IoU of which are more than or equal to 0.7 and 200 negative samples IoU of which are more than or equal to 0.3 are generated in a Gaussian distribution random mode. If its positive score is less than 0.5, tracking is considered to have failed.

The proportion of the invention is 60%: 20%: the data set was divided into a training set, a validation set, and a test set by 20%.

As shown in fig. 3, an OTB100 dataset is used in an embodiment of the present invention. The evaluation indexes used in the OTB100 data set have accuracy and success rate.

The horizontal axis of the accuracy curve represents the range of a position Error Threshold (Location Error Threshold), the position Error value refers to the Euclidean distance between the predicted target center position and the actual artificially labeled real target frame center position in the target tracking process, the dereferencing range of the horizontal axis position Error Threshold is generally between [0 and 50], and the interval in the representative interval is 51 distances of 1 pixel point; the vertical axis of the accuracy curve represents that when the reference data set is predicted, the video frame number of which the Euclidean distance between the center positions of all the video sequences to be detected is smaller than the position error threshold value is obtained, the percentage of the Euclidean distance in the total frame number of the video sequences is calculated, and finally the average value of the percentage values of all the video sequences is taken as the accuracy (Precision) value. The percentage average values obtained by different position error thresholds are different, so that an accuracy curve can be obtained, and the thresholds are set to be 20 pixel points in the invention.

The horizontal axis of the success rate curve represents the range of an Overlap rate Threshold (Overlap Threshold), the Overlap rate refers to the size of the intersection ratio of a target frame predicted by an algorithm in target tracking and an actual artificially labeled real target frame, the value range of the horizontal axis Overlap rate Threshold is generally between [0 and 1], and the value range represents the size of 21 IoU values with the interval of 0.05 in an interval; the vertical axis of the Success Rate curve indicates that when the reference data set is predicted, all video frames with the values of the target frame predicted by each video sequence and the real target frame IoU in all the video sequences to be detected larger than the overlap Rate threshold value are obtained, the percentage of the video frames in the total frame number of the video sequences is calculated, finally, the average value of the percentage values of all the video sequences is used as the Success Rate (Success Rate) value, and the threshold value is set to be 0.5 in the invention.

IoU＝(A^G∩A^T)/(A^G∪A^T)

The formula of the overlap ratio is shown in the above formula, wherein A^TRepresenting the target frame area bounding box, A tracked by the tracking algorithm^GAnd representing the group of the manually marked real target area.

In training, these incremental convolutional layer and fully-connected layer weights for offset learning are initialized to zero. Their learning rate is set to be β times the learning rate of the existing layer (β defaults to 1), trained by bilinear interpolation operation and backpropagation.

In summary, the embodiment of the present invention provides a single-target tracking method based on flexible convolution, which performs feature extraction by using a flexible manner, and selects a candidate target region by using RoI pooling, so as to solve the problem of poor tracking effect caused by easy deformation of an object in a single-target tracking process.

Those of ordinary skill in the art will understand that: the figures are merely schematic representations of one embodiment, and the blocks or flow diagrams in the figures are not necessarily required to practice the present invention.

From the above description of the embodiments, it is clear to those skilled in the art that the present invention can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for apparatus or system embodiments, since they are substantially similar to method embodiments, they are described in relative terms, as long as they are described in partial descriptions of method embodiments. The above-described embodiments of the apparatus and system are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A single target tracking method based on flexible convolution is characterized in that a flexible convolution network model is constructed, the flexible convolution network model comprises a sharing layer and a specific domain layer, the flexible convolution network model is trained through a data set, and the method comprises the following steps:

2. The method of claim 1, wherein the flexible convolutional network model comprises a shared layer and a domain-specific layer, wherein the shared layer comprises 3 convolutional layers conv1-3 and 2 fully-connected layers fc4-5, each fully-connected layer has 512 output units, and there are relu and pooling layers between each two adjacent convolutional layers and between two fully-connected layers, respectively;

3. The method of claim 2, wherein the softmax cross-entropy function is formulated as follows:

wherein i₁Is input, j₁For the number of inputs, e is 2.7.

4. The method of claim 1, wherein the shared layer obtains the shared characteristic of the target by a convolution operation, comprising:

sampling on the input feature map x using a regular grid R using a flexible convolution operation, increasing the position offset Δ P_n({ΔP_nN, N R | for each position P |, 1₀Carrying out weighted summation on the feature points of all the bits in the regular grid R and the bits corresponding to the convolution kernels to obtain the corresponding P on the new feature map₀Point, because a two-dimensional deviation value delta P on the x axis and the y axis is added on the basis of the deviation of the original regular grid R_nThe offset value Δ P_nIs a floating point value obtained by calculating bilinear interpolation of 4 real values around;

5. The method of claim 1, wherein the performing flexible RoI pooling selection of candidate target regions comprises:

where x is the input, P is each location in the regular network,

is the number of pixels in bin, output after flexible RoI pooling

Wherein the content of the first and second substances,

is offset and

6. the method of claim 1, wherein the loss function in S2 is:

wherein the content of the first and second substances,