CN112084952B - Video point location tracking method based on self-supervision training - Google Patents

Video point location tracking method based on self-supervision training Download PDF

Info

Publication number
CN112084952B
CN112084952B CN202010946922.7A CN202010946922A CN112084952B CN 112084952 B CN112084952 B CN 112084952B CN 202010946922 A CN202010946922 A CN 202010946922A CN 112084952 B CN112084952 B CN 112084952B
Authority
CN
China
Prior art keywords
point
video
feature
target
loss
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010946922.7A
Other languages
Chinese (zh)
Other versions
CN112084952A (en
Inventor
李智勇
王赛舟
肖德贵
李仁发
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan University
Original Assignee
Hunan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan University filed Critical Hunan University
Priority to CN202010946922.7A priority Critical patent/CN112084952B/en
Publication of CN112084952A publication Critical patent/CN112084952A/en
Application granted granted Critical
Publication of CN112084952B publication Critical patent/CN112084952B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a video point location tracking method based on self-supervision training, which comprises the steps of obtaining a training data set; constructing a neural network model, extracting the position and the descriptor of the feature points and calculating the confidence score of the key point; performing self-supervision training on the neural network model to obtain a trained video network model; extracting feature descriptors of a target video; matching and screening feature descriptors of front and rear frames of a target video, and constructing a homography matrix; and obtaining the converted target position according to the target point and the homography matrix given by the first frame of the target video, and completing video point location tracking. According to the method, the neural network is constructed and trained to obtain the video network model, and the video network model is adopted to locate and track the target point of the target video, so that the video point tracking of the target video is realized, distortion and deformation are not generated in the tracking process, and the method is high in reliability, good in instantaneity and good in effect.

Description

Video point location tracking method based on self-supervision training
Technical Field
The invention belongs to the field of image processing, and particularly relates to a video point location tracking method based on self-supervision training.
Background
With the development of economic technology and the improvement of living standard of people, video watching has become an indispensable component in the production and living of people.
In a video, the scenes in the video tend to be identical over a period of time. How to dynamically add icons (such as advertisements, warning words, etc.) at specific positions in the same scene and achieve real and harmonious effects is always one of the research focuses of researchers.
For the addition of dynamic icons, there are several challenges to the joining of dynamic icons: 1. how to accurately estimate the camera motion; 2. processing illumination influence; 3. properly processing the depth of field deformation; 4. mutual occlusion of objects, etc. Aiming at the challenges, the current solving method is to extract characteristic points by using a SIFT method and match the characteristic points of front and rear image frames so as to obtain a transformation matrix: from the target position given in the first frame, the position of the target in the following frame can be obtained.
However, the existing methods have disadvantages in that: based on local detection of the extracted feature points, the extracted feature points tend to have inaccurate positions; the position of the characteristic point is inaccurate, and the characteristic point is reflected on the video, so that the added label generates offset and distortion; in addition, the current method has large calculated amount, serious time delay and poor real-time performance.
Disclosure of Invention
The invention aims to provide a video point location tracking method based on self-supervision training, which has high reliability, good real-time performance and good effect.
The video point location tracking method based on self-supervision training provided by the invention comprises the following steps:
s1, acquiring a training data set;
s2, constructing a neural network model, and extracting the position of the feature point, acquiring a descriptor and calculating the confidence score of the key point;
s3, performing self-supervision training on the neural network model constructed in the step S2 by adopting the training data set acquired in the step S1, so as to acquire a trained video network model;
s4, inputting the target video to be subjected to point location tracking into the video network model constructed in the step S3, so as to extract feature descriptors of the target video;
s5, matching and screening feature descriptors of front and rear frames of the target video;
s6, constructing a homography matrix according to the matching and screening results of the step S5;
s7, obtaining a converted target position according to the target point given by the first frame of the target video and the homography matrix obtained in the step S6, thereby completing video point location tracking.
The training data set described in the step S1 is a "Ma Lanshan cup" video specific point tracking data set.
And step S3, performing self-supervision training on the neural network model constructed in the step S2, specifically, performing self-supervision training on the neural network model constructed in the step S2 after forming a new picture by adopting an image enhancement algorithm.
The image enhancement algorithm is adopted to form a new picture, and specifically comprises the steps of carrying out random rotation on the image for a plurality of times and scaling the image.
And (3) inputting the target video to be subjected to point location tracking in the step (S4) into the video network model constructed in the step (S3) so as to extract the feature descriptors of the target video, and particularly, extracting the feature points of the target video by adopting the video network model constructed in the step (S3) so as to obtain the positions of the key points and the feature descriptors.
The step S5 of matching and screening the feature descriptors of the front and rear frames of the target video specifically comprises the following steps:
for a target video, feature point sets Q1 and Q2 extracted from any two frames of pictures are used for calculating an L2 distance d between feature descriptors of the point sets, deleting points with the distance larger than a threshold t, and finally pairing the points according to the distance score.
The matching and screening of feature descriptors of front and rear frames of the target video are specifically defined that a point set extracted from two pictures A and B is po epsilon { A, B }, wherein the B picture is a transformed picture obtained by transforming (scaling, translation, rotation and perspective) the A picture by a homography operator T, and the positions of points on the A picture are aligned and transformed by T, if the feature points extracted from the transformed picture are transformed in the same way as the feature points extracted from the picture before transformation, namely, the aligned blank is obtainedSufficiently close together, then the point pairs can be used to calculate the loss function. The network outputs three tensors C (po), P (po) and F (po) each time, and after two pictures are set to be Z (A) and Z (B) respectively after being output by the network, the distance G=Z (A) multiplied by Z (B);wherein->G is a distance matrix, G ij Is an element in matrix G +.>For the point of diagram A subjected to T transformation, < >>As the point of the graph B, T is the homography transformation matrix, I 2 Is Euclidean distance; if point in A i Point in B j Distance g between each other ij Less than threshold t, point in A will i And point in B j Together join set Q, the subscript of the point in set Q is set to k, and the confidence of the kth point in set Q is C k (po) position is P k (po), descriptor is F k (po) at a position distance ofWherein the formula is the position distance.
And step S6, constructing a homography matrix according to the matching and screening results of the step S5, and particularly constructing a homography matrix H according to the paired point sets obtained in the step S5.
The step S7 is to obtain a converted target position according to the target point given by the first frame of the target video and the homography matrix obtained in the step S6, specifically, the converted target position is obtained by calculating using the following formula:
wherein x1 is the abscissa of the transformed target position, y1 is the ordinate of the transformed target position, x2 is the abscissa of the target point given by the first frame of the target video, y2 is the ordinate of the target point given by the first frame of the target video, h 00 ~h 22 Is the element in the homography matrix H obtained in the step S6, and
the step S2 of constructing the neural network model is specifically to adopt the following model as the neural network model:
A. constructing a three-layer neural network model, including a backup, a Feature map, a Confidence branch, a Position branch and a distrubutor branch; the input end of the backbond is the input end of the neural network model, the output end of the backbond is connected with the input end of the Feature map, and the output end of the Feature map is simultaneously connected with the input end of the Confidence branch, the input end of the Position branch and the input end of the distriptor branch;
B. the Backbone of the neural network model adopts a characteristic pyramid structure;
C. in the Feature map of the neural network model, a pixel point corresponds to an N multiplied by N pixel block in the original image, and is connected through a concat, so that the multi-layer Feature is fused; n is a natural number;
the confidence branch is used for carrying out regression on each point on the feature map to obtain the confidence score of each point, and then a plurality of key points with highest scores are screened through the confidence scores;
the position branch is used for predicting the position of the key point;
the descriptor branch is used to output descriptors of feature points.
The backbond of the neural network model adopts a characteristic pyramid structure, specifically the backbond carries out up-sampling 2 times on the high-level characteristics of the two characteristic layers; the bottom layer feature changes the channel number of the bottom layer feature in a 1X 1 convolution mode; correspondingly adding the convolved result and the element of the up-sampling result; at the same time, a residual attention mechanism is added when the convolution is performed.
The up-sampling is to insert new elements into the original image by interpolation, and the feature size is several times of the original one.
The size of the Feature map isH is the height value of the original image, and W is the width value in the original image.
The Confidence branch is specifically composed of two convolution layers, and the activation function is Sigmoid.
The Position branch is specifically composed of two convolution layers, and the channel numbers of the two convolution layers are conv-256 and conv-2 respectively; superPoint predicts whether it is a key point by classification; and predicting the offset of coordinates in the NxN block by adopting a regression mode, wherein the activation function is Sigmoid.
The offset is calculated by adopting the following formula:
in p prediction x (Δx, Δy) is the predicted x-axis offset; p is p prediction y (Δx, Δy) is the predicted y-axis offset; p (P) map x (x, y) is the x coordinate of the original image, P map y (x, y) is the y coordinate of the original image; Δx is the offset in the x-axis direction, and Δy is the offset in the y-axis direction; f (f) down Is a set value.
The distributor branch is specifically composed of two convolution layers, and the channel numbers of the two convolution layers are conv-256; the convolution layer is used for directly taking the 256-dimensional vector as a rough descriptor of the interest point, interpolating the descriptor map and then acquiring the descriptor according to the specific coordinates of the interest point obtained in the Position branch.
The neural network model specifically adopts the following formula as a total loss function:
L total =w 1 L ut +w 2 L up +w 3 L des +w 4 L dec
in which L total Is the total loss; w (w) 1 ~w 4 Is a weight coefficient; l (L) ut Loss for unsupervised points; l (L) up For point distribution loss, for reducing unreliability of edge points; l (L) des For descriptor loss, for optimizing descriptors; l (L) dec The feature descriptor loss is decorrelated for reducing overfitting.
Said unsupervised point loss L ut Specifically, the following formula is adopted to calculate L ut
W in position Weights lost for location;is a loss of position; w (w) confidence The weight is the confidence;is a loss of confidence; w (w) mp Weighting the related points; />Loss for the relevant point pair; k is the number of points.
Said loss of positionIn particular +.>d k Is the position distance.
Said confidence lossIn particular +.> And->For joint point pairs, log represents the logarithm of any base.
The related point pair lossIn particular +.>d k For the distance of the joint point pair->And->Is a joint point pair.
Said point distribution loss L up Specifically, the following formula is adopted to calculate L up
L up =L up (x)+L up (y)
In which L up (x) Is L up An x-axis component of (2), andL up (y) is L up And>m is the number of points and is->For x values ordered from small to large, +.>The y values are ordered from small to large.
The descriptor loss L des Specifically, the following formula is adopted to calculate L des
H in C ,W C Is cell size, alpha d As a weight factor, C is {0,1}, m p At the positive sample interval,for the feature points of A subjected to T transformation, f i B Feature points extracted for branch B, m n Is the negative sample interval.
Said decorrelation feature descriptor loss L dec Specifically, the following formula is adopted to calculate L dec
In the middle ofFor matrix R A Element of (a)>For matrix R B F is the channel number; matrix R p The value of (2) isp is A or B, matrixR p Element->The calculation formula of (2) isWherein->For column vector, +.>Is the average value.
According to the video point location tracking method based on self-supervision training, the neural network is constructed and trained to obtain the video network model, and the video network model is adopted to locate and track the target point of the target video, so that the video point tracking of the target video is realized, distortion and deformation are not generated in the tracking process, and the video point location tracking method based on self-supervision training is high in reliability, good in instantaneity and good in effect.
Drawings
FIG. 1 is a schematic flow chart of the method of the present invention.
FIG. 2 is a schematic diagram of a neural network model of the method of the present invention.
Fig. 3 is a schematic structural diagram of a backhaul of a neural network model of the method of the present invention.
Fig. 4 is a schematic diagram of the process of supervised training of the method of the present invention.
FIG. 5 is a schematic diagram showing the effect of an embodiment of the method of the present invention.
Detailed Description
A schematic process flow diagram of the method of the present invention is shown in fig. 1: the video point location tracking method based on self-supervision training provided by the invention comprises the following steps:
s1, acquiring a training data set;
in particular implementations, a "Ma Lanshan cup" video specific point tracking dataset may be employed;
s2, constructing a neural network model, and extracting the position of the feature point, acquiring a descriptor and calculating the confidence score of the key point;
in practice, the following model is used as a neural network model (as shown in fig. 2):
A. constructing a three-layer neural network model, including a backup, a Feature map, a Confidence branch, a Position branch and a distrubutor branch; the input end of the backbond is the input end of the neural network model, the output end of the backbond is connected with the input end of the Feature map, and the output end of the Feature map is simultaneously connected with the input end of the Confidence branch, the input end of the Position branch and the input end of the distriptor branch;
B. the Backbone of the neural network model adopts a characteristic pyramid structure (shown in fig. 3); specifically, the backstone carries out up-sampling for 2 times on the high-level features of the two feature layers; the bottom layer feature changes the channel number of the bottom layer feature in a 1X 1 convolution mode; correspondingly adding the convolved result and the up-sampled result (including inserting new elements on the original image by interpolation, wherein the feature size is several times of the original one); meanwhile, a residual attention mechanism is added during convolution, so that the feature map is effectively enhanced;
C. in the Feature map of the neural network model, a pixel point corresponds to an n×n pixel block (for example, an 8×8 pixel block) in the original image, and is connected through a concat, so that the multi-layer Feature is fused; n is a natural number;
the size of the Feature map isH is the height value of the original image, and W is the width value in the original image;
the confidence branch is used for carrying out regression on each point on the feature map to obtain the confidence score of each point, the numerical value of the score is between 0 and 1, and then a plurality of (preferably N) key points with the highest scores are screened through the confidence scores; specifically, the system consists of two convolution layers, and the activation function is Sigmoid;
the position branch is used for predicting the position of the key point; the system is particularly composed of two convolution layers, wherein the channel numbers of the two convolution layers are conv-256 and conv-2 respectively; superPoint predicts whether it is a key point by classification; predicting the offset of coordinates in the NxN block by adopting a regression mode, wherein an activation function is Sigmoid;
assuming that the input is 24 x 24 in size, after 8 times sampling, the branch network predicts the positions of 3*3 =9 key points;
in specific implementation, the offset is calculated using the following formula:
in p prediction x (Δx, Δy) is the predicted x-axis offset; p is p prediction y (Δx, Δy) is the predicted y-axis offset; p (P) map x (x, y) is the original image x coordinate; p (P) map y (x, y) is the original y coordinate; Δx is the x-axis offset; Δy is the y-direction offset; f (f) down Is a set value; by adopting the offset formula to calculate, the interest points can be distributed more uniformly, so that the perspective transformation effect is improved, and the calculation process is tiny, thereby facilitating the follow-up self-supervision training;
the descriptor branch is used for outputting descriptors of feature points; specifically, the two convolution layers are formed, and the channel numbers of the two convolution layers are conv-256; the convolution layer is used for directly taking the 256-dimensional vector as a rough descriptor of the interest point, interpolating the descriptor map and then acquiring the descriptor according to the specific coordinates of the interest point obtained in the Position branch;
the neural network model uses the following formula as the total loss function:
L total =w 1 L ut +w 2 L up +w 3 L des +w 4 L dec
in which L total Is the total loss; w (w) 1 ~w 4 Is a weight coefficient; l (L) ut Is non-monitoredLoss of governor point, L up As regular term, point-to-Point loss, L des To optimize the regularization term of the descriptor, L dec Is a distance matrix.
For L ut Specifically, the following formula is adopted to calculate L ut
W in position Weights lost for location;is a loss of position; w (w) confidence The weight is the confidence;is a loss of confidence; w (w) mp Weighting the related points; />Loss for the relevant point pair; k is the number of points.
Wherein:
loss of positionIs->d k The position distance is used for improving the stability of the position of the point under different visual angles, and the position of the image A is the same as the position of the image B after T transformation;
confidence lossIs->For reconciling confidence scores of points at different perspectives; />Log represents the logarithm of any base number for the joint point pair;
correlation point pair lossIs->The confidence used to make the network predictions reflect the probability of reproducibility of the keypoint coordinates; d, d k For the distance of the joint point pair->Is a joint point pair;
for L up Specifically, the following formula is adopted to calculate L up
L up =L up (x)+L up (y)
In which L up (x) Is L up An x-axis component of (2), andL up (y) is L up And>m is the number of points and is->For x values ordered from small to large, +.>Y values ordered from small to large;
for L des Specifically, the following formula is adopted to calculate L des
H in C ,W C Is cell size, alpha d As a weight factor, C is {0,1}, m p At the positive sample interval,for the feature points of A subjected to T transformation, f i B Feature points extracted for branch B, m n Is a negative sample interval;
for L dec Specifically, the following formula is adopted to calculate L dec
In the middle ofFor matrix R A Element of (a)>For matrix R B F is the channel number; matrix R p The value of (2) isThe value of p is A or B, and matrix R p Element->The calculation formula of (2) isWherein->As a column vector of the column-wise vector,is the average value;
s3, performing self-supervision training on the neural network model constructed in the step S2 by adopting the training data set acquired in the step S1, so as to acquire a trained video network model; specifically, an image enhancement algorithm (including performing random rotation on an image for a plurality of times, scaling the image, and the like) is adopted to form a new picture, and then self-supervision training is performed on the neural network model constructed in the step S2 (as shown in fig. 4);
s4, inputting the target video to be subjected to point location tracking into the video network model constructed in the step S3, so as to extract feature descriptors of the target video; specifically, the video network model constructed in the step S3 is adopted to extract feature points of the target video, so that the positions of key points and feature descriptors are obtained
S5, matching and screening feature descriptors of front and rear frames of the target video; the method comprises the following steps of matching and screening:
for a target video, feature point sets Q1 and Q2 extracted from any two frames of pictures are used for calculating an L2 distance d between feature descriptors of a point set, deleting points with the distance greater than a threshold t, and finally pairing the points according to the distance score;
in the implementation, a point set extracted from two pictures a and B is defined as po e { a, B }, wherein B is a transformed graph obtained by transforming (scaling, translation, rotation, perspective) the a graph by a homography operator T, and the positions of the points on the graph a are also aligned and transformed by T. The network outputs three tensors C (po), P (po) and F (po) each time, and the distance is set when two pictures are Z (A) and Z (B) respectively after the two pictures are output by the networkWherein the method comprises the steps ofG is a distance matrix, G ij Is an element in matrix G +.>Is T-transformedMidpoint of diagram A->As the point of the graph B, T is the homography transformation matrix, I 2 Is Euclidean distance; if point in A i Point in B j Distance g between each other ij Less than threshold t, point in A will i And point in B j Together join set Q, the subscript of the point in set Q is set to k, and the confidence of the kth point in set Q is C k (po) position is P k (po), descriptor is F k (po) position distance +.>Wherein I is the position distance;
s6, constructing a homography matrix according to the matching and screening results of the step S5; specifically, constructing a homography matrix H according to the paired point sets obtained in the step S5; the homography matrix describes different visual angles of the same thing, and for video scenes with the same scene, the moving distance of the camera is not large, so that the homography matrix is used for solving the position transformation more properly;
s7, obtaining a converted target position according to the target point given by the first frame of the target video and the homography matrix obtained in the step S6, so as to complete video point location tracking; specifically, the converted target position is obtained by adopting the following formula:
wherein x1 is the abscissa of the transformed target position, y1 is the ordinate of the transformed target position, x2 is the abscissa of the target point given by the first frame of the target video, y2 is the ordinate of the target point given by the first frame of the target video, h 00 ~h 22 Is the steps ofS6, obtaining elements in the homography matrix H, and
FIG. 5 is a schematic diagram showing the effect of the method of the present invention in practice. As can be seen from FIG. 5, the part with the X-shaped mark is always in a clear state and is not distorted or deformed, which shows that the method of the invention has high reliability and good effect.

Claims (6)

1. A video point location tracking method based on self-supervision training comprises the following steps:
s1, acquiring a training data set;
s2, constructing a neural network model, and extracting the position of the feature point, acquiring a descriptor and calculating the confidence score of the key point; the following model is specifically adopted as a neural network model:
A. constructing a three-layer neural network model, including a backup, a Feature map, a Confidence branch, a Position branch and a distrubutor branch; the input end of the backbond is the input end of the neural network model, the output end of the backbond is connected with the input end of the Feature map, and the output end of the Feature map is simultaneously connected with the input end of the Confidence branch, the input end of the Position branch and the input end of the distriptor branch;
B. the Backbone of the neural network model adopts a characteristic pyramid structure; specifically, the backstone carries out up-sampling for 2 times on the high-level features of the two feature layers; the bottom layer feature changes the channel number of the bottom layer feature in a 1X 1 convolution mode; correspondingly adding the convolved result and the element of the up-sampling result; meanwhile, adding a residual attention mechanism when carrying out convolution;
C. in the Feature map of the neural network model, a pixel point corresponds to an N multiplied by N pixel block in the original image, and is connected through a concat, so that the multi-layer Feature is fused; n is a natural number;
the confidence branch is used for carrying out regression on each point on the Feature map to obtain the confidence score of each point, and then a plurality of key points with highest scores are screened through the confidence scores;
the position branch is used for predicting the position of the key point;
the descriptor branch is used for outputting descriptors of feature points;
wherein the size of the Feature map is as followsH is the height value of the original image, and W is the width value in the original image; the Confidence branch is specifically composed of two convolution layers, and the activation function is Sigmoid; the Position branch is specifically composed of two convolution layers, and the channel numbers of the two convolution layers are conv-256 and conv-2 respectively; superPoint predicts whether it is a key point by classification; predicting the offset of coordinates in the NxN block by adopting a regression mode, wherein an activation function is Sigmoid; the offset is calculated by adopting the following formula:
in p prediction x (Δx, Δy) is the predicted x-axis offset; p is p prediction y (Δx, Δy) is the predicted y-axis offset; p (P) map x (x, y) is the original image x coordinate, P map y (x, y) is the original y coordinate; Δx is the offset in the x-axis direction, and Δy is the offset in the y-axis direction; f (f) down Is a set value; the distributor branch is specifically composed of two convolution layers, and the channel numbers of the two convolution layers are conv-256; the convolution layer is used for directly taking the 256-dimensional vector as a rough descriptor of the interest point, interpolating the descriptor map and then acquiring the descriptor according to the specific coordinates of the interest point obtained in the Position branch;
s3, performing self-supervision training on the neural network model constructed in the step S2 by adopting the training data set acquired in the step S1, so as to acquire a trained video network model;
s4, inputting the target video to be subjected to point location tracking into the video network model constructed in the step S3, so as to extract feature descriptors of the target video;
s5, matching and screening feature descriptors of front and rear frames of the target video;
s6, constructing a homography matrix according to the matching and screening results of the step S5;
s7, obtaining a converted target position according to the target point given by the first frame of the target video and the homography matrix obtained in the step S6, thereby completing video point location tracking.
2. The video point location tracking method based on self-supervision training according to claim 1, wherein the matching and screening of feature descriptors of the front and rear frames of the target video in step S5 is specifically performed by adopting the following steps: for a target video, feature point sets Q1 and Q2 extracted from any two frames of pictures are used for calculating an L2 distance d between feature descriptors of the point sets, deleting points with the distance larger than a threshold t, and finally pairing the points according to the distance score.
3. The video point location tracking method based on self-supervision training as claimed in claim 2, wherein the matching and screening are performed on feature descriptors of front and rear frames of the target video, specifically, defining point sets extracted from two pictures a and B as po e { a, B }, setting three tensors C (po), P (po) and F (po) to be Z (a) and Z (B) after the two pictures are output by the network each time, and then setting a distance g=z (a) ×z (B);wherein->G is a distance matrix, G ij Is an element in matrix G +.>For the midpoint of diagram A after T transformation, +.>As the point of the graph B, T is the homography transformation matrix, I 2 Is Euclidean distance; if point in A i Point in B j Distance g between each other ij Less than threshold t, point in A will i And point in B j Together join set Q, the subscript of the point in set Q is set to k, and the confidence of the kth point in set Q is C k (po) position is P k (po), descriptor is F k (po) at a position distance ofWherein the formula is the position distance.
4. The video point location tracking method based on self-supervision training according to claim 3, wherein the target point given by the first frame of the target video in step S7 and the homography matrix obtained in step S6 obtain a converted target location, specifically, the converted target location is obtained by calculating using the following formula:
wherein x1 is the abscissa of the transformed target position, y1 is the ordinate of the transformed target position, x2 is the abscissa of the target point given by the first frame of the target video, y2 is the ordinate of the target point given by the first frame of the target video, h 00 ~h 22 Is the element in the homography matrix H obtained in the step S6, and
5. the video point location tracking method based on self-supervision training of claim 4, wherein the neural network model specifically uses the following formula as a total loss function:
L total =w 1 L ut +w 2 L up +w 3 L des +w 4 L dec
in which L total Is the total loss; w (w) 1 ~w 4 Is a weight coefficient; l (L) ut Loss for unsupervised points; l (L) up Loss for dot distribution; l (L) des To describe the loss of a sub-; l (L) dec Is lost for decorrelated feature descriptors.
6. The method for tracking video point locations based on self-supervised training as recited in claim 5, wherein the unsupervised point loss L ut Specifically, the following formula is adopted to calculate L ut
W in position Weights lost for location;is a loss of position; w (w) confidence The weight is the confidence;is a loss of confidence; w (w) mp Weighting the related points; />Loss for the relevant point pair; k is the number of points;
said loss of positionIn particular +.>d k Is the position distance;
said confidence lossIn particular +.> And->Log represents the logarithm of any base number for the joint point pair;
the related point pair lossIn particular +.>d k For the distance of the joint point pair->And->Is a joint point pair;
said point distribution loss L up Specifically, the following formula is adopted to calculate L up
L up =L up (x)+L up (y)
In which L up (x) Is L up An x-axis component of (2), andL up (y) is L up And>m is the number of points and is->For x values ordered from small to large, +.>Y values ordered from small to large;
the descriptor loss L des Specifically, the following formula is adopted to calculate L des
H in C ,W C Is cell size, alpha d As a weight factor, C is {0,1}, m p At the positive sample interval,for the feature points of A subjected to T transformation, f i B Feature points extracted for branch B, m n Is a negative sample interval;
said decorrelation feature descriptor loss L dec Specifically, the following formula is adopted to calculate L dec
In the middle ofFor matrix R A Element of (a)>For matrix R B F is the channel number; matrix R p The value of (2) isp takes on valueIs A or B, matrix R p Element->The calculation formula of (2) isWherein->As a column vector of the column-wise vector,is the average value.
CN202010946922.7A 2020-09-10 2020-09-10 Video point location tracking method based on self-supervision training Active CN112084952B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010946922.7A CN112084952B (en) 2020-09-10 2020-09-10 Video point location tracking method based on self-supervision training

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010946922.7A CN112084952B (en) 2020-09-10 2020-09-10 Video point location tracking method based on self-supervision training

Publications (2)

Publication Number Publication Date
CN112084952A CN112084952A (en) 2020-12-15
CN112084952B true CN112084952B (en) 2023-08-15

Family

ID=73733156

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010946922.7A Active CN112084952B (en) 2020-09-10 2020-09-10 Video point location tracking method based on self-supervision training

Country Status (1)

Country Link
CN (1) CN112084952B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112861988B (en) * 2021-03-04 2022-03-11 西南科技大学 Feature matching method based on attention-seeking neural network
CN113657528B (en) * 2021-08-24 2024-02-13 湖南国科微电子股份有限公司 Image feature point extraction method and device, computer terminal and storage medium
CN115661724B (en) * 2022-12-12 2023-03-28 内江师范学院 Network model and training method suitable for homography transformation of continuous frame sequence

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106210808A (en) * 2016-08-08 2016-12-07 腾讯科技(深圳)有限公司 Media information put-on method, terminal, server and system
CN108446634A (en) * 2018-03-20 2018-08-24 北京天睿空间科技股份有限公司 The aircraft combined based on video analysis and location information continues tracking
CN109872346A (en) * 2019-03-11 2019-06-11 南京邮电大学 A kind of method for tracking target for supporting Recognition with Recurrent Neural Network confrontation study
CN110210551A (en) * 2019-05-28 2019-09-06 北京工业大学 A kind of visual target tracking method based on adaptive main body sensitivity

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108701210B (en) * 2016-02-02 2021-08-17 北京市商汤科技开发有限公司 Method and system for CNN network adaptation and object online tracking

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106210808A (en) * 2016-08-08 2016-12-07 腾讯科技(深圳)有限公司 Media information put-on method, terminal, server and system
CN108446634A (en) * 2018-03-20 2018-08-24 北京天睿空间科技股份有限公司 The aircraft combined based on video analysis and location information continues tracking
CN109872346A (en) * 2019-03-11 2019-06-11 南京邮电大学 A kind of method for tracking target for supporting Recognition with Recurrent Neural Network confrontation study
CN110210551A (en) * 2019-05-28 2019-09-06 北京工业大学 A kind of visual target tracking method based on adaptive main body sensitivity

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王玲 ; 王辉 ; 王鹏 ; 李岩芳 ; .基于FasterMDNet的视频目标跟踪算法.计算机工程与应用.2019,(第14期),全文. *

Also Published As

Publication number Publication date
CN112084952A (en) 2020-12-15

Similar Documents

Publication Publication Date Title
CN109584248B (en) Infrared target instance segmentation method based on feature fusion and dense connection network
US11830230B2 (en) Living body detection method based on facial recognition, and electronic device and storage medium
CN113196289B (en) Human body action recognition method, human body action recognition system and equipment
CN112084952B (en) Video point location tracking method based on self-supervision training
CN113065558A (en) Lightweight small target detection method combined with attention mechanism
CN107529650B (en) Closed loop detection method and device and computer equipment
CN108960211B (en) Multi-target human body posture detection method and system
CN111539370A (en) Image pedestrian re-identification method and system based on multi-attention joint learning
CN111126412B (en) Image key point detection method based on characteristic pyramid network
CN112232134B (en) Human body posture estimation method based on hourglass network and attention mechanism
CN111931686B (en) Video satellite target tracking method based on background knowledge enhancement
CN112633220B (en) Human body posture estimation method based on bidirectional serialization modeling
CN111709313B (en) Pedestrian re-identification method based on local and channel combination characteristics
CN113011329A (en) Pyramid network based on multi-scale features and dense crowd counting method
CN110781736A (en) Pedestrian re-identification method combining posture and attention based on double-current network
WO2019136591A1 (en) Salient object detection method and system for weak supervision-based spatio-temporal cascade neural network
CN114140623A (en) Image feature point extraction method and system
CN115131218A (en) Image processing method, image processing device, computer readable medium and electronic equipment
Zheng et al. T-net: Deep stacked scale-iteration network for image dehazing
CN112329771A (en) Building material sample identification method based on deep learning
CN116977674A (en) Image matching method, related device, storage medium and program product
CN111260687A (en) Aerial video target tracking method based on semantic perception network and related filtering
CN114882537A (en) Finger new visual angle image generation method based on nerve radiation field
CN112329662B (en) Multi-view saliency estimation method based on unsupervised learning
CN113763417A (en) Target tracking method based on twin network and residual error structure

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant