CN111899284B

CN111899284B - Planar target tracking method based on parameterized ESM network

Info

Publication number: CN111899284B
Application number: CN202010816457.5A
Authority: CN
Inventors: 王涛; 刘贺; 李浥东; 郎丛妍; 冯松鹤; 金�一
Original assignee: Beijing Jiaotong University
Current assignee: Beijing Jiaotong University
Priority date: 2020-08-14
Filing date: 2020-08-14
Publication date: 2024-04-09
Anticipated expiration: 2040-08-14
Also published as: CN111899284A

Abstract

The embodiment of the invention provides a planar target tracking method based on a parameterized ESM network, which comprises the following steps: s1, acquiring a target template T, an input image of a T frame and initial motion parameters in the T frame, and determining a target area I of the input image by the initial motion parameters _t For target template T and target region I _t Preprocessing, including image scaling and normalization, the target template T after preprocessing and the target area I of the input image of the T frame are processed by using the feature extraction network _t Extracting features to obtain feature map F ^T And F _t ^I The method comprises the steps of carrying out a first treatment on the surface of the S2, calculating two feature mappings F by using a similarity measurement module ^T And F _t ^I Differences between; s3, determining and removing the blocked part of the target in the current frame through a blocking detection mechanism, and solving the motion parameters of the target by minimizing the difference of the non-blocked part in the current frame. The method is more suitable for target tracking tasks, and greatly improves the tracking accuracy.

Description

Planar target tracking method based on parameterized ESM network

Technical Field

The invention relates to the field of machine vision and pattern recognition, in particular to a planar target tracking method based on a parameterized ESM network.

Background

Planar object tracking refers to a given sequence of video frames and specifies a planar object of interest in a first frame, the object of the planar object tracking algorithm being to calculate the change in pose of the planar object in subsequent video frames. Planar object tracking is a central problem in computer vision and has applications in many fields, such as augmented reality, robotics, unmanned aerial vehicle technology, etc.

Patent document No. 201510147895.6 discloses a moving object tracking method based on a bit plane. The invention obtains a brightness bit plane and a local binary pattern bit plane after smoothing for a tracking target and a searching area; then searching the two appearance planes of the search area for the area closest to the two appearance models corresponding to the tracking target as the tracking target; after tracking is completed, the appearance model is updated according to the established appearance model and the tracking result in the current frame and the preset updating rate. The method has obvious advantages in tracking precision and robustness, and effectively solves the problem of difficult tracking of the moving target under the conditions of illumination condition change, target pose change, apparent change and the like in the video.

The patent document with application number 201910297980.9 discloses a moving target tracking method based on a template matching and depth classification network, which mainly solves the problems that in the prior art, the target detection speed is low, and the tracking is inaccurate when the target is deformed and shielded. The scheme extracts a template network and a detection network from a double-residual depth classification network; extracting template characteristics and detection characteristics on the template and the detection area by using corresponding networks; template matching is carried out on the template characteristics on the detection characteristics, and a template matching diagram is obtained; determining a target position according to the template matching diagram; and tracking the target position and updating the template characteristics. The method has high tracking speed and high accuracy and is used for tracking the video target with severe deformation and illumination change.

For the patent document with the application number of 201510147895.6, the scheme solves the problem that the target tracking is difficult under the conditions of illumination change, appearance change and the like in the video to a certain extent. Although the invention carries out careful design modeling on the brightness and texture of the target, the modeling method of manual design cannot accurately reflect the appearance characteristics of the target. In the patent document 201910297980.9, although a depth network is adopted as a feature extractor, the feature extractor is not embedded into an end-to-end framework constructed in a video tracking task to train and verify, but is trained on a classification task, and a simple sliding window convolution method is adopted in calculating a feature response graph. Indeed, the sliding window convolution method is not necessarily applicable on depth profiles. Furthermore, neither of these inventions takes into account that the target portion is blocked or that the target portion is out of view.

Disclosure of Invention

The embodiment of the invention provides a planar target tracking method based on a parameterized ESM network, which overcomes the defects of the prior art.

In order to achieve the above purpose, the present invention adopts the following technical scheme.

A planar target tracking method based on a parameterized ESM network, a depth planar object tracking model is constructed, the depth planar object tracking model comprising: the method comprises the steps of feature extraction network, similarity measurement module and occlusion detection mechanism, constructing a data set to train the depth plane object tracking model, wherein the plane object tracking method comprises the following steps:

s1, acquiring a target template T, an input image of a T frame and initial motion parameters in the T frame, and determining a target area I of the input image by the initial motion parameters _t For target template T and target region I _t Preprocessing, including image scaling and normalization, the target template T after preprocessing and the target area I of the input image of the T frame are processed by using the feature extraction network _t Extracting features to obtain feature map F ^T And F _t ^I The dimensions of the template and the target area after pretreatment are h multiplied by l multiplied by 3, and h, l and 3 are the width, length and channel number of the image respectively;

s2, calculating two feature mappings F by using a similarity measurement module ^T And F _t ^I Differences between;

s3, determining and removing the blocked part of the target in the current frame through a blocking detection mechanism, and solving the motion parameters of the target by minimizing the difference of the non-blocked part in the current frame.

Preferably, the tracking of each frame in the video is divided into two phases, specifically:

the motion parameter of the tracking result of the first stage is used as the initial motion parameter of the second stage, and in the next iteration process, the motion parameter of the tracking result of the last second stage is used as the initial motion parameter of the first stage in the current iteration.

Preferably, the feature extraction network of each stage is composed of 7 convolution layers, each layer is followed by a batch norm layer and an activation function ReLU layer, the convolution kernel number of the first 6 convolution layers is 64, the convolution kernel number of the last convolution layer is 8, in the kth stage, the step size of the first 4-k convolution layers of the 7 convolution layers is 2, and the step size of the remaining convolution layers is 1, k is 1 or 2.

Preferably, the similarity measurement module is an encoder-decoder network based on a u-net framework, the input of which is the feature map F of the target template T ^T And a target area I of a t-th frame input image _t Feature mapping F of (2) _t ^I Outputs as a concatenation of the two feature maps F ^T And F _t ^I Is a difference tensor of (c).

Preferably, the S3 includes:

target area I of input image of t-th frame _t Simplified representation as I, its feature map F _t ^I Is simply denoted as F ^I L2 normalized feature map F for target region of given template and t-th frame input image ^T And F ^I The feature dimension is h 'x l' x d, where h 'and l' correspond to the width and length of the extracted feature image, respectively,k is 1 or 2, d represents the dimension of the feature;

first, F is respectively taken as a unit of each feature ^T And F ^I And (c) developing into an m x d matrix along the h ' direction, wherein m=h ' ×l ', denoted asAnd->Feature map representing the template T being expanded, +.>The feature map of the expanded target region is represented, and then a correlation map R is calculated to record the similarity of each pair of features, and the dimension of the correlation map R is m×m, and the formula is as follows:

wherein i, j respectively represent indexes of features in feature mapping of the target template T and the target region, R _i,j Representing the similarity between the ith feature in the template feature map and the jth feature in the feature map of the target region, Z is a trainable parameter matrix with Z dimension of d x d, and the confidence vector is formed by selecting the maximum value of each row in RThe formula is as follows:

then, willElement of (C) is normalized to [0,1 ]]Within the interval as final confidence vector +.>

Finally, confidence vectors are usedTaking h ' as a row, arranging the motion parameters into the size of h ' x l ', marking as C, and solving the motion parameters of the target by minimizing the difference of the non-shielded parts, for exampleThe following formula:

wherein p represents the current predicted target motion parameter; x represents the two-dimensional index of the feature in the feature map; c (-) is the contribution degree of the position feature to the optimization after detection by an occlusion detection mechanism, the contribution degree of the occluded part feature to the optimization is 0, and the contribution degree of the non-occluded feature is 1; m (&. Cndot.) measures the variability of each pair of features in the template and target region;a formula representing the coordinate transformation;

solving the formula (3) by adopting an ESM method, wherein the method is as follows:

order the

The increment of the motion parameter is obtained by:

wherein the method comprises the steps ofRepresenting pseudo-inverse of matrix, J _T Calculated at U unit transformJacobian matrix, J _E (p) represents the jacobian matrix of E (x; p) at p:

the motion parameters are updated in combination with the delta deltap of the motion parameters:

wherein,representing a binary operation.

Preferably, the constructing the dataset trains the depth plane object tracking model, comprising:

constructing two labeled DATA sets GEN-DATA and OCC-DATA, wherein GEN-DATA comprises illumination, deformation and noise factors, OCC-DATA is based on GEN-DATA, and the situation that the target part is blocked and the target part is out of the visual field range is increased, and each sample in the DATA sets GEN-DATA and OCC-DATA is a quadruple (T, Q, p) ₀ ,p ^gt ) Wherein T is a template image, Q is a current input image, and p ₀ For initial motion parameters, p ^gt Is the real motion parameter of the target;

the DATA set GEN-DATA construction process includes: geometric transformations and optical disturbances;

the geometric transformation includes:

given the target template T and the real motion parameters p of the target ^gt Mapping the pixel points in the target template into the input image Q through a perspective transformation formula, wherein the perspective transformation formula is as follows:

wherein,the pixel coordinate is (u, v) the coordinate of the pixel, (x, y) the coordinate of the pixel after perspective transformation;

the angular point of the target in the input image Q is respectively moved by d pixels along any direction, d is an integer of 0 to 20, and a corresponding transformation matrix, namely an initial motion parameter p, is calculated according to the moved angular point coordinates ₀ ；

The optical perturbation includes:

1) Adding motion blur or gaussian blur to the input image;

2) Adding gaussian noise to the input image;

3) The brightness variation of different degrees is implemented on all pixels on the input image along a certain direction;

the DATA set OCC-DATA construction process includes:

for each sample in the GEN-DATA, a point is selected on each side of the object in the input image, forming a size N _P Randomly selecting N (N is more than or equal to 0 and less than or equal to N) _P ) The points are sequentially connected to divide a target area in a video frame into a plurality of parts, and a part of the target area is randomly selected to be filled with a pattern of another picture so as to simulate the shielding condition;

the DATA sets GEN-DATA and OCC-DATA are each at 8:2 is divided into a training set and a verification set for training the performance of the model and the verification model;

during training, firstly, an occlusion detection mechanism is not added, a GEN-DATA is used for training a feature extraction network and a similarity measurement module, parameters of the feature extraction network and the similarity measurement module are fixed after training is completed, an OCC-DATA is used for training the occlusion detection mechanism, and parameters of the feature extraction network and the similarity measurement module are finely adjusted at the same time;

the loss function formula adopted in the training process is as follows:

wherein,target motion parameters predicted for model, p ^gt Is the object ofIs a real motion parameter of the robot; n is the number of target corner points, r _q Coordinates of corner points; />A formula representing the coordinate transformation.

According to the technical scheme provided by the embodiment of the invention, the planar target tracking method based on the parameterized ESM network is provided, and the trainable feature extraction module and the sufficient training set are adopted, so that the feature extraction module learns the relatively robust feature representation, and the problem that the traditional manual design features cannot accurately reflect the appearance characteristics of the target in the tracking process is solved to a certain extent; the incompatibility problem of the depth features and the traditional similarity measurement method is also solved through the trained feature extraction module and the similarity measurement module; the model solving is assisted by using an occlusion detection mechanism, so that the model is more robust to partial occlusion; meanwhile, the loss function in the logarithmic function form can enable the training process of the model not to be dominated by samples with larger loss. Because the invention uses the COCO data set of Microsoft as the raw material to construct the data set, the model plane object tracking accuracy is far higher than that of the traditional method and the existing depth network-based method, and the model plane object tracking accuracy is sufficient in training samples and end-to-end network construction.

Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a planar target tracking method based on a parameterized ESM network according to an embodiment of the present invention;

FIG. 2 is a GEN-DATA generation effect diagram of a planar target tracking method based on a parameterized ESM network according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of OCC-DATA generation effect of a planar target tracking method based on a parameterized ESM network according to an embodiment of the present invention;

FIG. 4 is a flowchart of a similarity measurement module of a planar target tracking method based on a parameterized ESM network according to an embodiment of the present invention;

fig. 5 is a schematic diagram of selecting a tracking target in a first frame according to an embodiment of the present invention.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below by referring to the drawings are exemplary only for explaining the present invention and are not to be construed as limiting the present invention.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or coupled. The term "and/or" as used herein includes any and all combinations of one or more of the associated listed items.

It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

For the purpose of facilitating an understanding of the embodiments of the invention, reference will now be made to the drawings of several specific embodiments illustrated in the drawings and in no way should be taken to limit the embodiments of the invention.

The embodiment of the invention provides a planar target tracking method based on a parameterized ESM network, which is shown in fig. 1, and comprises the steps of: the feature extraction network, the similarity metric module (ML Layer, metric learning Layer) and the occlusion detection mechanism (CMG, confidence map generator) construct a dataset training depth plane object tracking model.

For training the model, using the MS-COCO dataset as the material, two labeled datasets GEN-DATA and OCC-DATA were constructed to train a deep planar object tracking model. The GEN-DATA mainly comprises factors such as illumination, deformation, noise and the like; OCC-DATA increases the situations that the target portion is blocked and the target portion is out of view based on GEN-DATA. Each sample in the two data sets is a quadruple (T, Q, p ₀ ,p ^gt ) Respectively a template image, a current input image, initial motion parameters and real motion parameters of a target. The template image is from a template pool constructed from MS-COCO, i.e. the picture in MS-COCO is scaled to a picture with a length-width of 80-160 pixels.

The above-mentioned DATA set GEN-DATA construction process mainly includes the method of geometric transformation and optical disturbance, as shown in fig. 2.

The method for the geometric transformation comprises the following steps:

1) Given the target template T and the real motion parameters p of the target ^gt The pixel points in the target template are mapped into the input image Q through a perspective transformation formula. The perspective transformation formula is as follows:

wherein,the transformation matrix, (u, v) is the coordinates of the pixel, and (x, y) is the coordinates of the pixel after perspective transformation.

2) The corner points of the object in the input image Q are respectively shifted by d pixels in any direction, d being an integer from 0 to 20. Calculating corresponding transformation matrix, namely initial motion parameter p, according to the moved angular point coordinates ₀ 。

The implementation of the optical disturbance is specifically as follows:

1) Adding motion blur or gaussian blur to the input image;

2) Adding gaussian noise to the input image;

3) Varying degrees of brightness are applied to all pixels on an input image in a certain direction (e.g., top-to-bottom, or left-to-right).

The generation method of the DATA set OCC-DATA specifically comprises the following steps:

for each sample in the GEN-DATA, a point is selected on each side of the object in the input image, forming a size N _P Is a set of points. Then randomly selecting N (N is more than or equal to 0 and less than or equal to N) _P ) The points are connected in sequence. Thus, the target area in the video frame is divided into several parts. Then randomly selecting a pattern of which a part is filled with another picture to simulate the occlusion situation, as shown in fig. 3.

Both DATA sets GEN-DATA and OCC-DATA are set at 8: the scale of 2 is divided into training and validation sets for training the model and validating the performance of the model.

The characteristics learned by a large amount of data can better reflect the appearance characteristics of the target. During training, the feature detection module is not added first, and the GEN-DATA is used for training the feature extraction network and the similarity measurement module. After the training is completed, we fix the parameters of the two modules, train the occlusion detection module with OCC-DATA, and fine tune the parameters of the feature extraction network and the similarity metric module.

The loss function formula adopted in the training process is as follows:

wherein,target motion parameters predicted for model, p ^gt Is the real motion parameter of the target. N is the number of target corner points, r _q Is the coordinates of the corner points. />A formula representing the coordinate transformation.

The distance sum of the corner points is embedded into a logarithmic function so as to avoid that samples with larger loss dominate the whole training process.

After the training of the depth plane object tracking model is completed, the target tracking process is as follows:

the tracking of each frame is divided into two stages, specifically:

Taking the first stage as an example:

s1, firstly, acquiring a target template T, an input image of a T frame and initial motion parameters in the T frame, and determining a target area I of the input image according to the initial motion parameters _t For target template T and target region I _t Preprocessing, including operations such as zooming and normalizing of pictures; use specialTarget region I of input image of feature extraction network to target template T and T frame _t Extracting features to obtain feature map F ^T And F _t ^I The dimensions of the template and the target area after pretreatment are h multiplied by l multiplied by 3, and h, l and 3 are the width, length and channel number of the image respectively.

The feature extraction network for each stage consists of 7 convolutional layers, each followed by a batch norm layer and an activation function (ReLU) layer. The convolution kernels of the first 6 convolution layers are all 64, and the convolution kernel of the last convolution layer is 8. In the kth stage, the step size of the first 4-k convolution layers of the 7 convolution layers is 2, the step size of the remaining convolution layers is 1, and k is 1 or 2. Taking the first phase as an example, k=1.

S2, calculating two feature mappings F by using a similarity measurement module ^T And F _t ^I Differences between them. Wherein the similarity measurement module is an encoder-decoder network based on a u-net framework, and the input of the similarity measurement module is a feature mapping F of a target template T ^T And a target area I of a t-th frame input image _t Feature mapping F of (2) _t ^I Outputs as a concatenation of the two feature maps F ^T And F _t ^I As shown in fig. 4.

S3, determining and removing the blocked part of the target in the current frame by using a blocking detection mechanism, and solving the motion parameters of the target by minimizing the difference of the non-blocked part in the current frame.

The detection process of the occlusion detection mechanism is specifically as follows:

to describe this process more clearly, the target area I of the input image of the t-th frame will be _t Simplified representation as I, its feature map F _t ^I Is simply denoted as F ^I . L2 normalized feature map F for target region of input image given template and t-th frame ^T And F ^I (the characteristic dimensions thereof are h '. Times.l'. Times.d, h '. Times.l' correspond to the width and length of the extracted characteristic image respectively,k is 1 or 2D represents the dimension of the feature).

First, in units of each feature, F ^T And F ^I Expanded in the h ' direction into an m×d matrix (where m=h ' ×l '), denoted asAnd->Feature map representing the template T being expanded, +.>Representing the feature map of the expanded target region, then computing a correlation map R (dimension m×m) to record the similarity of each pair of features, with the formula:

wherein i, j respectively represent indexes of features in feature mapping of the target template T and the target region, R _i,j The similarity between the ith feature in the template feature map and the jth feature in the target region feature map is shown, and Z (dimension d x d) is a trainable parameter matrix. Confidence vector is then constructed by selecting the maximum value of each row in RThe formula is as follows:

Finally, confidence vectors are usedH ' is taken as a row, and is arranged in the size of h ' ×l ', and is marked as C. The motion parameters of the object are solved by minimizing the differences in the non-occluded parts. See the following formula:

wherein p represents the current predicted target motion parameter; x represents the two-dimensional index of the feature in the feature map; c (-) is the contribution degree of the position feature to the optimization after detection by an occlusion detection mechanism, theoretically, the contribution degree of the occluded part feature to the optimization is 0, and the contribution degree of the non-occluded feature is 1; m (&. Cndot.) measures the variability of each pair of features in the template and target region;a formula representing the coordinate transformation.

The following formula is solved by adopting an ESM method, and the method is specifically as follows:

order the

The increment of the motion parameter can be obtained by:

wherein the method comprises the steps ofRepresenting pseudo-inverse of matrix, J _T Calculated at U (unit transform)Jacobian matrix, J _E (p) represents the jacobian matrix of E (x; p) at p:

wherein, the degree represents a binary operation.

The second stage is similar to the first stage, and will not be described again.

The planar target tracking process based on the parameterized ESM network provided by the embodiment of the invention comprises the following steps:

(1) In the first frame, the tracked target area is determined by calibrating the corner points of the target. As shown in fig. 5, the inside of the rectangular frame is the target template.

Taking fig. 5 as an example, when the target is calibrated, the real motion parameters of the first frame of the corresponding target can be calculated through the coordinates of four corner points of the target

The acquisition process of (a) is as follows:

let the width and height of the template be l, h. In fig. 5, the coordinates of point 1 to point 4 are (0, 0), (0,l), (h, l), (h, 0), respectively, by setting up the coordinate system of the template with point 1 as the origin. In the coordinate system of the image with the upper left corner of the frame image as the origin, the coordinates of the points 1 to 4 in the image are (x 1, y 1), (x 2, y 2), (x 3, y 3), (x 4, y 4), a33 is set as 1, and the inverse operation of the following formula is solved

(2) Starting from the second frame, the real motion parameters of the first frameAs an initial motion parameter p of the second frame, the input image is Qpass +.>And obtaining an image block (patch) with the same size as the template, namely the target area. And preprocessing the template and the target area, extracting features, measuring similarity, detecting shielding, and updating p through the solving process of the ESM for a plurality of times. Finally updating the obtained p ₁ As a result of the tracking of this frame.

(3) In the subsequent frames, the process is similar to (2).

In summary, the embodiment of the invention provides a planar target tracking method based on a parameterized ESM network, which calculates the difference between depth features by using a trainable metric module and assists the optimization process by using a trainable occlusion detection mechanism. In addition, the invention generates a large number of labeled samples to simulate a real tracking scene, and monitors the training process of the model through a designed loss function, so that the generated target tracking samples are used for training a feature extraction network, a similarity measurement module and an occlusion detection mechanism in an end-to-end mode. Compared with a feature extractor trained on an image classification task and a traditional sliding window convolution method, the training method of the invention enables the model to be more suitable for a target tracking task, and the learned measurement method is more compatible with the deep convolution feature than the traditional sliding window convolution method, so that the tracking accuracy is greatly improved.

Those of ordinary skill in the art will appreciate that: the drawing is a schematic diagram of one embodiment and the modules or flows in the drawing are not necessarily required to practice the invention.

From the above description of embodiments, it will be apparent to those skilled in the art that the present invention may be implemented in software plus a necessary general hardware platform. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the embodiments or some parts of the embodiments of the present invention.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for apparatus or system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, with reference to the description of method embodiments in part. The apparatus and system embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

The present invention is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present invention are intended to be included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims

1. A planar target tracking method based on a parameterized ESM network, characterized in that a depth planar object tracking model is constructed, the depth planar object tracking model comprising: the method comprises the steps of feature extraction network, similarity measurement module and occlusion detection mechanism, constructing a data set to train the depth plane object tracking model, wherein the plane object tracking method comprises the following steps:

s3, determining and removing the blocked part of the target in the current frame through a blocking detection mechanism, and solving the motion parameters of the target by minimizing the difference of the non-blocked part in the current frame; the method specifically comprises the following steps:

target area I of input image of t-th frame _t Simplified representation as I, its feature map F _t ^I Is simply denoted as F ^I L2 normalized feature map F for target region of given template and t-th frame input image ^T And F ^I The feature dimension is h 'x l' x d, where h 'and l' correspond to the width and length of the extracted feature image, respectively,k is1 or 2, d represents the dimension of the feature;

first, F is respectively taken as a unit of each feature ^T And F ^I And (c) developing into an m x d matrix along the h ' direction, wherein m=h ' ×l ', denoted asAnd-> Feature map representing the template T being expanded, +.>The feature map of the expanded target region is represented, and then a correlation map R is calculated to record the similarity of each pair of features, and the dimension of the correlation map R is m×m, and the formula is as follows:

Finally, confidence vectors are usedTaking h ' as a row, arranging the motion parameters into the size of h ' x l ', and marking the motion parameters as C, and solving the motion parameters of the target by minimizing the difference of the non-shielded parts, wherein the motion parameters are shown in the following formula:

order the

The increment of the motion parameter is obtained by:

wherein the method comprises the steps ofRepresenting pseudo-inverse of matrix, J _T Calculated at U unit transform +.>Jacobian matrix, J _E (p) represents the jacobian matrix of E (x; p) at p:

p≡p° Δp … … (7) where ° represents a binary operation.

2. The method according to claim 1, characterized in that the tracking of each frame in the video is divided into two phases, in particular:

3. The method of claim 2, wherein the feature extraction network of each stage is composed of 7 convolution layers, each layer is followed by a batch norm layer and an activation function ReLU layer, the convolution kernels of the first 6 convolution layers are all 64, the convolution kernel of the last convolution layer is 8, the step size of the first 4-k convolution layers of the 7 convolution layers is 2, and the step size of the remaining convolution layers is 1, k is 1 or 2 in the kth stage.

4. The method according to claim 1, wherein the similarity measure module is an u-net frame based encoder-decoder network whose input is a feature map F of a target template T ^T And a target area I of a t-th frame input image _t Feature mapping F of (2) _t ^I Outputs as a concatenation of the two feature maps F ^T And F _t ^I Is a difference tensor of (c).

5. The method of claim 1, wherein the constructing a dataset trains the depth plane object tracking model, comprising:

the geometric transformation includes:

the corner points of the object in the input image Q are respectively shifted by d pixels in any direction,d, taking an integer of 0 to 20, and calculating a corresponding transformation matrix according to the moved corner coordinates, namely an initial motion parameter p ₀ ；

The optical perturbation includes:

1) Adding motion blur or gaussian blur to the input image;

2) Adding gaussian noise to the input image;

the DATA set OCC-DATA construction process includes:

the loss function formula adopted in the training process is as follows:

wherein,target motion parameters predicted for model, p ^gt Is the real motion parameter of the target; n is the number of target corner points, r _q Coordinates of corner points; />A formula representing the coordinate transformation.