CN112084952B

CN112084952B - Video point location tracking method based on self-supervision training

Info

Publication number: CN112084952B
Application number: CN202010946922.7A
Authority: CN
Inventors: 李智勇; 王赛舟; 肖德贵; 李仁发
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2020-09-10
Filing date: 2020-09-10
Publication date: 2023-08-15
Anticipated expiration: 2040-09-10
Also published as: CN112084952A

Abstract

The invention discloses a video point location tracking method based on self-supervision training, which comprises the steps of obtaining a training data set; constructing a neural network model, extracting the position and the descriptor of the feature points and calculating the confidence score of the key point; performing self-supervision training on the neural network model to obtain a trained video network model; extracting feature descriptors of a target video; matching and screening feature descriptors of front and rear frames of a target video, and constructing a homography matrix; and obtaining the converted target position according to the target point and the homography matrix given by the first frame of the target video, and completing video point location tracking. According to the method, the neural network is constructed and trained to obtain the video network model, and the video network model is adopted to locate and track the target point of the target video, so that the video point tracking of the target video is realized, distortion and deformation are not generated in the tracking process, and the method is high in reliability, good in instantaneity and good in effect.

Description

Video point location tracking method based on self-supervision training

Technical Field

The invention belongs to the field of image processing, and particularly relates to a video point location tracking method based on self-supervision training.

Background

With the development of economic technology and the improvement of living standard of people, video watching has become an indispensable component in the production and living of people.

In a video, the scenes in the video tend to be identical over a period of time. How to dynamically add icons (such as advertisements, warning words, etc.) at specific positions in the same scene and achieve real and harmonious effects is always one of the research focuses of researchers.

For the addition of dynamic icons, there are several challenges to the joining of dynamic icons: 1. how to accurately estimate the camera motion; 2. processing illumination influence; 3. properly processing the depth of field deformation; 4. mutual occlusion of objects, etc. Aiming at the challenges, the current solving method is to extract characteristic points by using a SIFT method and match the characteristic points of front and rear image frames so as to obtain a transformation matrix: from the target position given in the first frame, the position of the target in the following frame can be obtained.

However, the existing methods have disadvantages in that: based on local detection of the extracted feature points, the extracted feature points tend to have inaccurate positions; the position of the characteristic point is inaccurate, and the characteristic point is reflected on the video, so that the added label generates offset and distortion; in addition, the current method has large calculated amount, serious time delay and poor real-time performance.

Disclosure of Invention

The invention aims to provide a video point location tracking method based on self-supervision training, which has high reliability, good real-time performance and good effect.

The video point location tracking method based on self-supervision training provided by the invention comprises the following steps:

s1, acquiring a training data set;

s2, constructing a neural network model, and extracting the position of the feature point, acquiring a descriptor and calculating the confidence score of the key point;

s3, performing self-supervision training on the neural network model constructed in the step S2 by adopting the training data set acquired in the step S1, so as to acquire a trained video network model;

s4, inputting the target video to be subjected to point location tracking into the video network model constructed in the step S3, so as to extract feature descriptors of the target video;

s5, matching and screening feature descriptors of front and rear frames of the target video;

s6, constructing a homography matrix according to the matching and screening results of the step S5;

s7, obtaining a converted target position according to the target point given by the first frame of the target video and the homography matrix obtained in the step S6, thereby completing video point location tracking.

The training data set described in the step S1 is a "Ma Lanshan cup" video specific point tracking data set.

And step S3, performing self-supervision training on the neural network model constructed in the step S2, specifically, performing self-supervision training on the neural network model constructed in the step S2 after forming a new picture by adopting an image enhancement algorithm.

The image enhancement algorithm is adopted to form a new picture, and specifically comprises the steps of carrying out random rotation on the image for a plurality of times and scaling the image.

And (3) inputting the target video to be subjected to point location tracking in the step (S4) into the video network model constructed in the step (S3) so as to extract the feature descriptors of the target video, and particularly, extracting the feature points of the target video by adopting the video network model constructed in the step (S3) so as to obtain the positions of the key points and the feature descriptors.

The step S5 of matching and screening the feature descriptors of the front and rear frames of the target video specifically comprises the following steps:

for a target video, feature point sets Q1 and Q2 extracted from any two frames of pictures are used for calculating an L2 distance d between feature descriptors of the point sets, deleting points with the distance larger than a threshold t, and finally pairing the points according to the distance score.

The matching and screening of feature descriptors of front and rear frames of the target video are specifically defined that a point set extracted from two pictures A and B is po epsilon { A, B }, wherein the B picture is a transformed picture obtained by transforming (scaling, translation, rotation and perspective) the A picture by a homography operator T, and the positions of points on the A picture are aligned and transformed by T, if the feature points extracted from the transformed picture are transformed in the same way as the feature points extracted from the picture before transformation, namely, the aligned blank is obtainedSufficiently close together, then the point pairs can be used to calculate the loss function. The network outputs three tensors C (po), P (po) and F (po) each time, and after two pictures are set to be Z (A) and Z (B) respectively after being output by the network, the distance G=Z (A) multiplied by Z (B);wherein->G is a distance matrix, G _ij Is an element in matrix G +.>For the point of diagram A subjected to T transformation, < >>As the point of the graph B, T is the homography transformation matrix, I ₂ Is Euclidean distance; if point in A _i Point in B _j Distance g between each other _ij Less than threshold t, point in A will _i And point in B _j Together join set Q, the subscript of the point in set Q is set to k, and the confidence of the kth point in set Q is C _k (po) position is P _k (po), descriptor is F _k (po) at a position distance ofWherein the formula is the position distance.

And step S6, constructing a homography matrix according to the matching and screening results of the step S5, and particularly constructing a homography matrix H according to the paired point sets obtained in the step S5.

The step S7 is to obtain a converted target position according to the target point given by the first frame of the target video and the homography matrix obtained in the step S6, specifically, the converted target position is obtained by calculating using the following formula:

wherein x1 is the abscissa of the transformed target position, y1 is the ordinate of the transformed target position, x2 is the abscissa of the target point given by the first frame of the target video, y2 is the ordinate of the target point given by the first frame of the target video, h ₀₀ ～h ₂₂ Is the element in the homography matrix H obtained in the step S6, and

the step S2 of constructing the neural network model is specifically to adopt the following model as the neural network model:

A. constructing a three-layer neural network model, including a backup, a Feature map, a Confidence branch, a Position branch and a distrubutor branch; the input end of the backbond is the input end of the neural network model, the output end of the backbond is connected with the input end of the Feature map, and the output end of the Feature map is simultaneously connected with the input end of the Confidence branch, the input end of the Position branch and the input end of the distriptor branch;

B. the Backbone of the neural network model adopts a characteristic pyramid structure;

C. in the Feature map of the neural network model, a pixel point corresponds to an N multiplied by N pixel block in the original image, and is connected through a concat, so that the multi-layer Feature is fused; n is a natural number;

the confidence branch is used for carrying out regression on each point on the feature map to obtain the confidence score of each point, and then a plurality of key points with highest scores are screened through the confidence scores;

the position branch is used for predicting the position of the key point;

the descriptor branch is used to output descriptors of feature points.

The backbond of the neural network model adopts a characteristic pyramid structure, specifically the backbond carries out up-sampling 2 times on the high-level characteristics of the two characteristic layers; the bottom layer feature changes the channel number of the bottom layer feature in a 1X 1 convolution mode; correspondingly adding the convolved result and the element of the up-sampling result; at the same time, a residual attention mechanism is added when the convolution is performed.

The up-sampling is to insert new elements into the original image by interpolation, and the feature size is several times of the original one.

The size of the Feature map isH is the height value of the original image, and W is the width value in the original image.

The Confidence branch is specifically composed of two convolution layers, and the activation function is Sigmoid.

The Position branch is specifically composed of two convolution layers, and the channel numbers of the two convolution layers are conv-256 and conv-2 respectively; superPoint predicts whether it is a key point by classification; and predicting the offset of coordinates in the NxN block by adopting a regression mode, wherein the activation function is Sigmoid.

The offset is calculated by adopting the following formula:

in p _prediction x (Δx, Δy) is the predicted x-axis offset; p is p _prediction y (Δx, Δy) is the predicted y-axis offset; p (P) _map x (x, y) is the x coordinate of the original image, P _map y (x, y) is the y coordinate of the original image; Δx is the offset in the x-axis direction, and Δy is the offset in the y-axis direction; f (f) _down Is a set value.

The distributor branch is specifically composed of two convolution layers, and the channel numbers of the two convolution layers are conv-256; the convolution layer is used for directly taking the 256-dimensional vector as a rough descriptor of the interest point, interpolating the descriptor map and then acquiring the descriptor according to the specific coordinates of the interest point obtained in the Position branch.

The neural network model specifically adopts the following formula as a total loss function:

L _total ＝w ₁ L _ut +w ₂ L _up +w ₃ L _des +w ₄ L _dec

in which L _total Is the total loss; w (w) ₁ ～w ₄ Is a weight coefficient; l (L) _ut Loss for unsupervised points; l (L) _up For point distribution loss, for reducing unreliability of edge points; l (L) _des For descriptor loss, for optimizing descriptors; l (L) _dec The feature descriptor loss is decorrelated for reducing overfitting.

Said unsupervised point loss L _ut Specifically, the following formula is adopted to calculate L _ut ：

W in _position Weights lost for location;is a loss of position; w (w) _confidence The weight is the confidence;is a loss of confidence; w (w) _mp Weighting the related points; />Loss for the relevant point pair; k is the number of points.

Said loss of positionIn particular +.>d _k Is the position distance.

Said confidence lossIn particular +.> And->For joint point pairs, log represents the logarithm of any base.

The related point pair lossIn particular +.>d _k For the distance of the joint point pair->And->Is a joint point pair.

Said point distribution loss L _up Specifically, the following formula is adopted to calculate L _up ：

L _up ＝L _up (x)+L _up (y)

In which L _up (x) Is L _up An x-axis component of (2), andL _up (y) is L _up And>m is the number of points and is->For x values ordered from small to large, +.>The y values are ordered from small to large.

The descriptor loss L _des Specifically, the following formula is adopted to calculate L _des ：

H in _C ,W _C Is cell size, alpha _d As a weight factor, C is {0,1}, m _p At the positive sample interval,for the feature points of A subjected to T transformation, f _i ^B Feature points extracted for branch B, m _n Is the negative sample interval.

Said decorrelation feature descriptor loss L _dec Specifically, the following formula is adopted to calculate L _dec ：

In the middle ofFor matrix R ^A Element of (a)>For matrix R ^B F is the channel number; matrix R ^p The value of (2) isp is A or B, matrixR ^p Element->The calculation formula of (2) isWherein->For column vector, +.>Is the average value.

According to the video point location tracking method based on self-supervision training, the neural network is constructed and trained to obtain the video network model, and the video network model is adopted to locate and track the target point of the target video, so that the video point tracking of the target video is realized, distortion and deformation are not generated in the tracking process, and the video point location tracking method based on self-supervision training is high in reliability, good in instantaneity and good in effect.

Drawings

FIG. 1 is a schematic flow chart of the method of the present invention.

FIG. 2 is a schematic diagram of a neural network model of the method of the present invention.

Fig. 3 is a schematic structural diagram of a backhaul of a neural network model of the method of the present invention.

Fig. 4 is a schematic diagram of the process of supervised training of the method of the present invention.

FIG. 5 is a schematic diagram showing the effect of an embodiment of the method of the present invention.

Detailed Description

A schematic process flow diagram of the method of the present invention is shown in fig. 1: the video point location tracking method based on self-supervision training provided by the invention comprises the following steps:

s1, acquiring a training data set;

in particular implementations, a "Ma Lanshan cup" video specific point tracking dataset may be employed;

in practice, the following model is used as a neural network model (as shown in fig. 2):

B. the Backbone of the neural network model adopts a characteristic pyramid structure (shown in fig. 3); specifically, the backstone carries out up-sampling for 2 times on the high-level features of the two feature layers; the bottom layer feature changes the channel number of the bottom layer feature in a 1X 1 convolution mode; correspondingly adding the convolved result and the up-sampled result (including inserting new elements on the original image by interpolation, wherein the feature size is several times of the original one); meanwhile, a residual attention mechanism is added during convolution, so that the feature map is effectively enhanced;

C. in the Feature map of the neural network model, a pixel point corresponds to an n×n pixel block (for example, an 8×8 pixel block) in the original image, and is connected through a concat, so that the multi-layer Feature is fused; n is a natural number;

the size of the Feature map isH is the height value of the original image, and W is the width value in the original image;

the confidence branch is used for carrying out regression on each point on the feature map to obtain the confidence score of each point, the numerical value of the score is between 0 and 1, and then a plurality of (preferably N) key points with the highest scores are screened through the confidence scores; specifically, the system consists of two convolution layers, and the activation function is Sigmoid;

the position branch is used for predicting the position of the key point; the system is particularly composed of two convolution layers, wherein the channel numbers of the two convolution layers are conv-256 and conv-2 respectively; superPoint predicts whether it is a key point by classification; predicting the offset of coordinates in the NxN block by adopting a regression mode, wherein an activation function is Sigmoid;

assuming that the input is 24 x 24 in size, after 8 times sampling, the branch network predicts the positions of 3*3 =9 key points;

in specific implementation, the offset is calculated using the following formula:

in p _prediction x (Δx, Δy) is the predicted x-axis offset; p is p _prediction y (Δx, Δy) is the predicted y-axis offset; p (P) _map x (x, y) is the original image x coordinate; p (P) _map y (x, y) is the original y coordinate; Δx is the x-axis offset; Δy is the y-direction offset; f (f) _down Is a set value; by adopting the offset formula to calculate, the interest points can be distributed more uniformly, so that the perspective transformation effect is improved, and the calculation process is tiny, thereby facilitating the follow-up self-supervision training;

the descriptor branch is used for outputting descriptors of feature points; specifically, the two convolution layers are formed, and the channel numbers of the two convolution layers are conv-256; the convolution layer is used for directly taking the 256-dimensional vector as a rough descriptor of the interest point, interpolating the descriptor map and then acquiring the descriptor according to the specific coordinates of the interest point obtained in the Position branch;

the neural network model uses the following formula as the total loss function:

L _total ＝w ₁ L _ut +w ₂ L _up +w ₃ L _des +w ₄ L _dec

in which L _total Is the total loss; w (w) ₁ ～w ₄ Is a weight coefficient; l (L) _ut Is non-monitoredLoss of governor point, L _up As regular term, point-to-Point loss, L _des To optimize the regularization term of the descriptor, L _dec Is a distance matrix.

For L _ut Specifically, the following formula is adopted to calculate L _ut ：

Wherein:

loss of positionIs->d _k The position distance is used for improving the stability of the position of the point under different visual angles, and the position of the image A is the same as the position of the image B after T transformation;

confidence lossIs->For reconciling confidence scores of points at different perspectives; />Log represents the logarithm of any base number for the joint point pair;

correlation point pair lossIs->The confidence used to make the network predictions reflect the probability of reproducibility of the keypoint coordinates; d, d _k For the distance of the joint point pair->Is a joint point pair;

for L _up Specifically, the following formula is adopted to calculate L _up ：

L _up ＝L _up (x)+L _up (y)

In which L _up (x) Is L _up An x-axis component of (2), andL _up (y) is L _up And>m is the number of points and is->For x values ordered from small to large, +.>Y values ordered from small to large;

for L _des Specifically, the following formula is adopted to calculate L _des ：

H in _C ,W _C Is cell size, alpha _d As a weight factor, C is {0,1}, m _p At the positive sample interval,for the feature points of A subjected to T transformation, f _i ^B Feature points extracted for branch B, m _n Is a negative sample interval;

for L _dec Specifically, the following formula is adopted to calculate L _dec ：

In the middle ofFor matrix R ^A Element of (a)>For matrix R ^B F is the channel number; matrix R ^p The value of (2) isThe value of p is A or B, and matrix R ^p Element->The calculation formula of (2) isWherein->As a column vector of the column-wise vector,is the average value;

s3, performing self-supervision training on the neural network model constructed in the step S2 by adopting the training data set acquired in the step S1, so as to acquire a trained video network model; specifically, an image enhancement algorithm (including performing random rotation on an image for a plurality of times, scaling the image, and the like) is adopted to form a new picture, and then self-supervision training is performed on the neural network model constructed in the step S2 (as shown in fig. 4);

s4, inputting the target video to be subjected to point location tracking into the video network model constructed in the step S3, so as to extract feature descriptors of the target video; specifically, the video network model constructed in the step S3 is adopted to extract feature points of the target video, so that the positions of key points and feature descriptors are obtained

S5, matching and screening feature descriptors of front and rear frames of the target video; the method comprises the following steps of matching and screening:

for a target video, feature point sets Q1 and Q2 extracted from any two frames of pictures are used for calculating an L2 distance d between feature descriptors of a point set, deleting points with the distance greater than a threshold t, and finally pairing the points according to the distance score;

in the implementation, a point set extracted from two pictures a and B is defined as po e { a, B }, wherein B is a transformed graph obtained by transforming (scaling, translation, rotation, perspective) the a graph by a homography operator T, and the positions of the points on the graph a are also aligned and transformed by T. The network outputs three tensors C (po), P (po) and F (po) each time, and the distance is set when two pictures are Z (A) and Z (B) respectively after the two pictures are output by the networkWherein the method comprises the steps ofG is a distance matrix, G _ij Is an element in matrix G +.>Is T-transformedMidpoint of diagram A->As the point of the graph B, T is the homography transformation matrix, I ₂ Is Euclidean distance; if point in A _i Point in B _j Distance g between each other _ij Less than threshold t, point in A will _i And point in B _j Together join set Q, the subscript of the point in set Q is set to k, and the confidence of the kth point in set Q is C _k (po) position is P _k (po), descriptor is F _k (po) position distance +.>Wherein I is the position distance;

s6, constructing a homography matrix according to the matching and screening results of the step S5; specifically, constructing a homography matrix H according to the paired point sets obtained in the step S5; the homography matrix describes different visual angles of the same thing, and for video scenes with the same scene, the moving distance of the camera is not large, so that the homography matrix is used for solving the position transformation more properly;

s7, obtaining a converted target position according to the target point given by the first frame of the target video and the homography matrix obtained in the step S6, so as to complete video point location tracking; specifically, the converted target position is obtained by adopting the following formula:

wherein x1 is the abscissa of the transformed target position, y1 is the ordinate of the transformed target position, x2 is the abscissa of the target point given by the first frame of the target video, y2 is the ordinate of the target point given by the first frame of the target video, h ₀₀ ～h ₂₂ Is the steps ofS6, obtaining elements in the homography matrix H, and

FIG. 5 is a schematic diagram showing the effect of the method of the present invention in practice. As can be seen from FIG. 5, the part with the X-shaped mark is always in a clear state and is not distorted or deformed, which shows that the method of the invention has high reliability and good effect.

Claims

1. A video point location tracking method based on self-supervision training comprises the following steps:

s1, acquiring a training data set;

s2, constructing a neural network model, and extracting the position of the feature point, acquiring a descriptor and calculating the confidence score of the key point; the following model is specifically adopted as a neural network model:

B. the Backbone of the neural network model adopts a characteristic pyramid structure; specifically, the backstone carries out up-sampling for 2 times on the high-level features of the two feature layers; the bottom layer feature changes the channel number of the bottom layer feature in a 1X 1 convolution mode; correspondingly adding the convolved result and the element of the up-sampling result; meanwhile, adding a residual attention mechanism when carrying out convolution;

the position branch is used for predicting the position of the key point;

the descriptor branch is used for outputting descriptors of feature points;

wherein the size of the Feature map is as followsH is the height value of the original image, and W is the width value in the original image; the Confidence branch is specifically composed of two convolution layers, and the activation function is Sigmoid; the Position branch is specifically composed of two convolution layers, and the channel numbers of the two convolution layers are conv-256 and conv-2 respectively; superPoint predicts whether it is a key point by classification; predicting the offset of coordinates in the NxN block by adopting a regression mode, wherein an activation function is Sigmoid; the offset is calculated by adopting the following formula:

in p _prediction x (Δx, Δy) is the predicted x-axis offset; p is p _prediction y (Δx, Δy) is the predicted y-axis offset; p (P) _map x (x, y) is the original image x coordinate, P _map y (x, y) is the original y coordinate; Δx is the offset in the x-axis direction, and Δy is the offset in the y-axis direction; f (f) _down Is a set value; the distributor branch is specifically composed of two convolution layers, and the channel numbers of the two convolution layers are conv-256; the convolution layer is used for directly taking the 256-dimensional vector as a rough descriptor of the interest point, interpolating the descriptor map and then acquiring the descriptor according to the specific coordinates of the interest point obtained in the Position branch;

2. The video point location tracking method based on self-supervision training according to claim 1, wherein the matching and screening of feature descriptors of the front and rear frames of the target video in step S5 is specifically performed by adopting the following steps: for a target video, feature point sets Q1 and Q2 extracted from any two frames of pictures are used for calculating an L2 distance d between feature descriptors of the point sets, deleting points with the distance larger than a threshold t, and finally pairing the points according to the distance score.

3. The video point location tracking method based on self-supervision training as claimed in claim 2, wherein the matching and screening are performed on feature descriptors of front and rear frames of the target video, specifically, defining point sets extracted from two pictures a and B as po e { a, B }, setting three tensors C (po), P (po) and F (po) to be Z (a) and Z (B) after the two pictures are output by the network each time, and then setting a distance g=z (a) ×z (B);wherein->G is a distance matrix, G _ij Is an element in matrix G +.>For the midpoint of diagram A after T transformation, +.>As the point of the graph B, T is the homography transformation matrix, I ₂ Is Euclidean distance; if point in A _i Point in B _j Distance g between each other _ij Less than threshold t, point in A will _i And point in B _j Together join set Q, the subscript of the point in set Q is set to k, and the confidence of the kth point in set Q is C _k (po) position is P _k (po), descriptor is F _k (po) at a position distance ofWherein the formula is the position distance.

4. The video point location tracking method based on self-supervision training according to claim 3, wherein the target point given by the first frame of the target video in step S7 and the homography matrix obtained in step S6 obtain a converted target location, specifically, the converted target location is obtained by calculating using the following formula:

5. the video point location tracking method based on self-supervision training of claim 4, wherein the neural network model specifically uses the following formula as a total loss function:

L _total ＝w ₁ L _ut +w ₂ L _up +w ₃ L _des +w ₄ L _dec

in which L _total Is the total loss; w (w) ₁ ～w ₄ Is a weight coefficient; l (L) _ut Loss for unsupervised points; l (L) _up Loss for dot distribution; l (L) _des To describe the loss of a sub-; l (L) _dec Is lost for decorrelated feature descriptors.

6. The method for tracking video point locations based on self-supervised training as recited in claim 5, wherein the unsupervised point loss L _ut Specifically, the following formula is adopted to calculate L _ut ：

W in _position Weights lost for location;is a loss of position; w (w) _confidence The weight is the confidence;is a loss of confidence; w (w) _mp Weighting the related points; />Loss for the relevant point pair; k is the number of points;

said loss of positionIn particular +.>d _k Is the position distance;

said confidence lossIn particular +.> And->Log represents the logarithm of any base number for the joint point pair;

the related point pair lossIn particular +.>d _k For the distance of the joint point pair->And->Is a joint point pair;

L _up ＝L _up (x)+L _up (y)

In the middle ofFor matrix R ^A Element of (a)>For matrix R ^B F is the channel number; matrix R ^p The value of (2) isp takes on valueIs A or B, matrix R ^p Element->The calculation formula of (2) isWherein->As a column vector of the column-wise vector,is the average value.