CN111899284A

CN111899284A - Plane target tracking method based on parameterized ESM network

Info

Publication number: CN111899284A
Application number: CN202010816457.5A
Authority: CN
Inventors: 王涛; 刘贺; 李浥东; 郎丛妍; 冯松鹤; 金�一
Original assignee: Beijing Jiaotong University
Current assignee: Beijing Jiaotong University
Priority date: 2020-08-14
Filing date: 2020-08-14
Publication date: 2020-11-06
Anticipated expiration: 2040-08-14
Also published as: CN111899284B

Abstract

The embodiment of the invention provides a plane target tracking method based on a parameterized ESM network, which comprises the following steps: s1, obtaining the target template T, the input image of the T frame and the initial motion parameter in the T frame, determining the target area I of the input image according to the initial motion parameter_tFor the target template T and the target area I_tPreprocessing including image scaling and normalization, and performing feature extraction on the preprocessed target template T and the target region I of the input image of the T-th frame by using a feature extraction network_tExtracting the characteristics to obtain a characteristic mapping F^TAnd F_t ^I(ii) a S2, calculating two feature maps F by utilizing the similarity measurement module^TAnd F_t ^IThe difference between them; and S3, determining and eliminating the occluded part of the target in the current frame through an occlusion detection mechanism, and solving the motion parameter of the target by minimizing the difference of the unoccluded part in the current frame. The method of the present inventionThe method is suitable for target tracking tasks, and greatly improves the tracking accuracy.

Description

Plane target tracking method based on parameterized ESM network

Technical Field

The invention relates to the field of machine vision and pattern recognition, in particular to a plane target tracking method based on a parameterized ESM network.

Background

Planar object tracking refers to a given sequence of video frames and a planar object of interest is specified in the first frame, and the object of the planar object tracking algorithm is to calculate the pose change of the planar object in the subsequent video frames. Planar object tracking, as a core problem in computer vision, has applications in many fields, such as augmented reality, robot control, unmanned aerial vehicle technology, and the like.

Patent document No. 201510147895.6 discloses a moving object tracking method based on a bit plane. The method comprises the steps of obtaining a brightness bit plane and a local binary pattern bit plane which are smoothed from a tracking target and a search area; then searching the areas which are closest to the two appearance models corresponding to the tracking target on the two appearance planes of the search area as the tracking target; and after the tracking is finished, updating the appearance model according to the established appearance model and the tracking result in the current frame and the preset updating rate. The method has obvious advantages in tracking precision and robustness, and effectively solves the problem that the moving target is difficult to track under the conditions of illumination condition change, target pose change, obvious appearance change and the like in the video.

Patent document No. 201910297980.9 discloses a moving target tracking method based on template matching and depth classification network, which mainly solves the problems of slow target detection speed and inaccurate tracking when the target is deformed and shielded in the prior art. The scheme extracts a template network and a detection network from a double-residual deep classification network; extracting template features and detection features on the template and the detection area by using corresponding networks; carrying out template matching on the template characteristics on the detection characteristics to obtain a template matching image; determining the target position according to the template matching graph; and tracking the target position to update the template characteristics. The method is high in tracking speed and high in accuracy, and is used for tracking the video target with severe deformation and illumination change.

For patent document No. 201510147895.6, the solution solves the problem of difficulty in tracking the target under the conditions of illumination change, appearance change and the like in the video to a certain extent. Although the invention carries out elaborately designed modeling on the brightness and the texture of the target, the manually designed modeling method can not accurately reflect the appearance characteristic of the target. Although patent document No. 201910297980.9 adopts a deep network as a feature extractor, the feature extractor is not embedded in a video tracking task to construct an end-to-end frame for training and verification, but is trained on a classification task, and a simple sliding window convolution method is adopted in calculating a feature response graph. In practice, the sliding window convolution method is not necessarily applicable on the depth feature map. In addition, neither of these inventions takes into account the situation where the target portion is occluded or otherwise out of view.

Disclosure of Invention

The embodiment of the invention provides a plane target tracking method based on a parameterized ESM network, which overcomes the defects of the prior art.

In order to achieve the purpose, the invention adopts the following technical scheme.

A plane target tracking method based on a parameterized ESM network constructs a depth plane object tracking model, and the depth plane object tracking model comprises the following steps: the method comprises the following steps of constructing a data set to train the deep plane object tracking model by using a feature extraction network, a similarity measurement module and an occlusion detection mechanism, wherein the plane target tracking method comprises the following steps:

s1, obtaining the target template T, the input image of the T frame and the initial motion parameter in the T frame, determining the target area I of the input image according to the initial motion parameter_tFor the target template T and the target area I_tPreprocessing including image scaling and normalization, and performing feature extraction on the preprocessed target template T and the target region I of the input image of the T-th frame by using a feature extraction network_tExtracting the characteristics to obtain a characteristic mapping F^TAnd

the dimensions of the preprocessed template and the target area are h multiplied by l multiplied by 3, and h, l and 3 are the width and length of the image and the number of channels of the image respectively;

s2, calculating two feature maps F by utilizing the similarity measurement module^TAnd

the difference between them;

and S3, determining and eliminating the occluded part of the target in the current frame through an occlusion detection mechanism, and solving the motion parameter of the target by minimizing the difference of the unoccluded part in the current frame.

Preferably, the tracking of each frame in the video is divided into two stages, specifically:

and in the next iteration process, the motion parameter of the last tracking result of the second stage is used as the initial motion parameter of the first stage in the current iteration.

Preferably, the feature extraction network of each stage is composed of 7 convolutional layers, each layer is connected with a batchnorm layer and an activation function ReLU layer, the number of convolutional kernels of the first 6 convolutional layers is 64, the number of convolutional kernels of the last convolutional layer is 8, in the kth stage, the step size of the first 4-k convolutional layers of the 7 convolutional layers is 2, the step size of the remaining convolutional layers is 1, and k is 1 or 2.

Preferably, the similarity measurement module is an encoder-decoder network based on a u-net framework, and the input of the similarity measurement module is a feature mapping F of the target template T^TAnd a target region I of the input image of the t-th frame_tFeature mapping of

The output is the two feature maps F^TAnd

differential tensor of。

Preferably, the S3 includes:

target area I of input image of t-th frame_tSimplified representation as I, its feature mapping

Simplified representation as F^IL2 normalized feature map F of the target region of the given template and the t-th frame input image^TAnd F^IThe characteristic dimension is h 'x l' x d, wherein h 'and l' respectively correspond to the width and length of the extracted characteristic image,

k is 1 or 2, d represents the dimension of the feature;

firstly, taking each feature as a unit, respectively adding F^TAnd F^ISpreading the matrix in the h ' direction to form an m × d matrix, where m ═ h ' × l ', and is recorded as

And

a feature map representing the expanded template T,

representing the feature mapping of the expanded target region, and then calculating a correlation graph R to record the similarity of each pair of features, wherein the dimension of the correlation graph R is m × m, and the formula is as follows:

wherein i and j respectively represent indexes of features in feature mapping of the target template T and the target area, and R_i,jRepresenting the similarity of the ith feature in the template feature map to the jth feature in the feature map for the target regionZ is a trainable parameter matrix, the dimension of Z is d x d, and a confidence coefficient vector is formed by selecting the maximum value of each row in R

The formula is as follows:

then, will

Normalized to [0,1 ]]Taking the interval as a final confidence coefficient vector;

finally, the confidence coefficient vector is calculated

Taking h ' as a line, arranging the h ' multiplied by l ' as the size, marked as C, solving the motion parameter of the target by minimizing the difference of the parts which are not shielded, and the following formula is shown:

wherein p represents a currently predicted target motion parameter; x represents a two-dimensional index of features in the feature map; c (-) is the contribution degree of the position feature to the optimization after the occlusion detection mechanism detects, the contribution degree of the occluded part feature to the optimization is 0, and the contribution degree of the feature which is not occluded is 1; m (-) measures the difference of each pair of features in the template and the target region;

a formula representing coordinate transformation;

solving the formula (3) by adopting an ESM method, which is concretely as follows:

order to

The increment of the motion parameter is obtained by:

wherein

Representing the pseudo-inverse of the matrix, J_TIs calculated at U unit conversion

Jacobian matrix of J_E(p) represents the Jacobian matrix for E (x; p) at p:

the motion parameters are updated in conjunction with the increment Δ p of the motion parameters:

wherein the content of the first and second substances,

representing a binary operation.

Preferably, the construction data set trains the depth plane object tracking model, including:

constructing two labeled DATA sets GEN-DATA and OCC-DATA, wherein GEN-DATA comprises illumination, deformation and noise factors, OCC-DATA is based on GEN-DATA and increases the condition that the target part is blocked and the target part exceeds the visual field range, and each sample in the DATA sets GEN-DATA and OCC-DATA is a quadruple (T, Q, p)₀,p^gt) Where T is the template image, Q is the current input image, p₀As initial movement parameter, p^gtReal motion parameters of the target;

the DATA set GEN-DATA construction process comprises: geometric transformations and optical perturbations;

the geometric transformation includes:

given a target template T and a true motion parameter p of the target^gtAnd mapping the pixel points in the target template to the input image Q through a perspective transformation formula, wherein the perspective transformation formula is as follows:

wherein the content of the first and second substances,

is a transformation matrix, (u, v) is the coordinates of the pixel, and (x, y) is the coordinates of the pixel after perspective transformation;

respectively moving the corner points of the target in the input image Q by d pixels along any direction, taking d as an integer from 0 to 20, and calculating a corresponding transformation matrix, namely an initial motion parameter p according to the coordinates of the moved corner points₀；

The optical perturbation comprises:

1) adding motion blur or gaussian blur on the input image;

2) adding Gaussian noise to an input image;

3) performing brightness variation of different degrees in a certain direction for all pixels on an input image;

the DATA set OCC-DATA construction process includes:

for each sample in GEN-DATA, a point is selected on each edge of the object in the input image, constituting a size N_PRandomly selecting N (N is more than or equal to 0 and less than or equal to N)_P) Points are connected in sequence to divide a target area in a video frame into a plurality of parts, and a part of the pattern filled with another picture is randomly selected to simulate the shielding condition;

the DATA sets GEN-DATA and OCC-DATA are each represented by 8: 2, dividing the proportion into a training set and a verification set for training the performance of the model and verifying the performance of the model;

during training, firstly, without adding an occlusion detection mechanism, training a feature extraction network and a similarity measurement module by using GEN-DATA, fixing parameters of the feature extraction network and the similarity measurement module after the training is finished, training the occlusion detection mechanism by using OCC-DATA, and simultaneously fine-tuning the parameters of the feature extraction network and the similarity measurement module;

the loss function formula adopted in the training process is as follows:

wherein the content of the first and second substances,

target motion parameters predicted for the model, p^gtReal motion parameters of the target; n is the number of target corner points, r_qCoordinates of the corner points;

a formula representing coordinate transformation.

According to the technical scheme provided by the embodiment of the invention, the embodiment of the invention provides a plane target tracking method based on a parameterized ESM network, and the method has the advantages that the characteristic extraction module learns the robust characteristic representation through the trainable characteristic extraction module and a sufficient training set, so that the problem that the traditional manual design characteristics cannot accurately reflect the appearance characteristics of the target in the tracking process is effectively solved to a certain extent; the problem of incompatibility of the depth features and the traditional similarity measurement method is solved through the trained feature extraction module and the similarity measurement module; an occlusion detection mechanism is used for assisting model solution, so that the model is more robust to partial occlusion; meanwhile, the loss function in the form of a logarithmic function can prevent the training process of the model from being dominated by samples with large loss. Because the present invention uses microsoft's COCO dataset as raw material to construct the dataset, with sufficient training samples and end-to-end network construction, the model planar object tracking accuracy is much higher than the traditional method and the existing deep network-based method.

Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a flowchart of a planar target tracking method based on a parameterized ESM network according to an embodiment of the present invention;

fig. 2 is a GEN-DATA generation effect diagram of a planar target tracking method based on a parameterized ESM network according to an embodiment of the present invention;

fig. 3 is a diagram illustrating an OCC-DATA generation effect of a planar target tracking method based on a parameterized ESM network according to an embodiment of the present invention;

fig. 4 is a flowchart of a similarity measurement module of a planar target tracking method based on a parameterized ESM network according to an embodiment of the present invention;

fig. 5 is a schematic diagram of selecting a tracking target in a first frame according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or coupled. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

For the convenience of understanding the embodiments of the present invention, the following description will be further explained by taking several specific embodiments as examples in conjunction with the drawings, and the embodiments are not to be construed as limiting the embodiments of the present invention.

The embodiment of the invention provides a plane target tracking method based on a parameterized ESM network, as shown in FIG. 1, a depth plane object tracking model is constructed, and the depth plane object tracking model comprises the following steps: the method comprises the steps of constructing a data set to train a deep plane object tracking model by a feature extraction network, a metric learning Layer (ML Layer) and a occlusion detection mechanism (CMG).

To train the model, two tagged DATA sets GEN-DATA and OCC-DATA were constructed to train the depth plane object tracking model, using the MS-COCO DATA set as the material. Wherein, GEN-DATA mainly includes factors such as illumination, deformation, noise and the like; OCC-DATA is based on GEN-DATA, and is increasedThe target portion is blocked and the target portion is out of the field of view. Each sample in the two data sets is a quadruple (T, Q, p)₀,p^gt) The parameters are respectively template image, current input image, initial motion parameter and real motion parameter of the target. The template image is from a template pool constructed by MS-COCO, namely, a picture in the MS-COCO is zoomed into a picture with the length and the width of 80-160 pixels.

The construction process of the DATA set GEN-DATA mainly includes methods of geometric transformation and optical perturbation, as shown in fig. 2.

The method for geometric transformation comprises the following steps:

1) given a target template T and a true motion parameter p of the target^gtAnd mapping the pixel points in the target template to the input image Q through a perspective transformation formula. The perspective transformation formula is as follows:

wherein the content of the first and second substances,

in order to transform the matrix, (u, v) is the coordinates of the pixel, and (x, y) is the coordinates of the pixel after perspective transformation.

2) The corner points of the object in the input image Q are shifted by d pixels in any direction, respectively, d taking an integer from 0 to 20. Calculating a corresponding transformation matrix, namely an initial motion parameter p, according to the coordinates of the moved corner points₀。

The optical perturbation is implemented specifically as follows:

1) adding motion blur or gaussian blur on the input image;

2) adding Gaussian noise to an input image;

3) different degrees of brightness variation are implemented in a certain direction (e.g., from top to bottom, or from left to right) for all pixels on the input image.

The method for generating the DATA set OCC-DATA specifically comprises the following steps:

for each sample in GEN-DATA, a point is selected on each edge of the object in the input image, constituting a size N_PThe point set of (2). Then randomly selecting N (N is more than or equal to 0 and less than or equal to N)_P) Point and are connected in sequence. Thus, the target area in the video frame is divided into several parts. Then a part of the pattern filled with another picture is randomly selected to simulate the occlusion situation, as shown in fig. 3.

The two DATA sets GEN-DATA and OCC-DATA are each represented by 8: scale of 2 is divided into a training set and a validation set to train the model and validate the performance of the model.

The characteristics learned by a large amount of data can reflect the appearance characteristics of the target. In the training process, firstly, the feature detection module is not added, and the GEN-DATA is used for training the feature extraction network and the similarity measurement module. After training is completed, we fix the parameters of the two modules, train the occlusion detection module with OCC-DATA, and fine tune the parameters of the feature extraction network and the similarity measurement module.

The loss function formula adopted in the training process is as follows:

wherein the content of the first and second substances,

target motion parameters predicted for the model, p^gtIs the true motion parameter of the target. N is the number of target corner points, r_qThe coordinates of the corner points.

A formula representing coordinate transformation.

The distance sum of the corner points is embedded into a logarithmic function to avoid that a sample with a large loss dominates the whole training process.

After the training of the depth plane object tracking model is completed, the target tracking process is as follows:

the tracking of each frame is divided into two stages, specifically:

Taking the first stage as an example:

s1, firstly, obtaining the target template T, the input image of the T frame and the initial motion parameter in the T frame, and determining the target area I of the input image according to the initial motion parameter_tFor the target template T and the target area I_tPreprocessing, including operations such as image scaling and normalization; target region I of input image of target template T and T-th frame using feature extraction network_tExtracting the characteristics to obtain a characteristic mapping F^TAnd

the dimensions of the preprocessed template and the target area are h multiplied by l multiplied by 3, and h, l and 3 are the width and the length of the image and the channel number of the image respectively.

The feature extraction network of each stage is composed of 7 convolutional layers, and a batchnorm layer and an activation function (ReLU) layer are connected behind each convolutional layer. The convolution kernels for the first 6 convolutional layers are all 64, and the convolution kernel for the last convolutional layer is 8. In the kth stage, the step size of the first 4-k convolutional layers of the 7 convolutional layers is 2, the step size of the remaining convolutional layers is 1, and k is 1 or 2. Taking the first stage as an example, k is 1.

S2 calculating two feature maps F by using a similarity measurement module^TAnd

the difference between them. Wherein, the similarity measurement module is an encoder-decoder network based on a u-net framework, and the input of the similarity measurement module is a feature mapping F of a target template T^TAnd a target region I of the input image of the t-th frame_tFeature mapping of

The output is the two feature maps F^TAnd

the differential tensor of (a), as shown in figure 4.

S3, determining and eliminating the occluded part of the target in the current frame by using an occlusion detection mechanism, and solving the motion parameter of the target by minimizing the difference of the unoccluded part in the current frame.

The detection process of the occlusion detection mechanism is specifically as follows:

to describe this process more clearly, the target area I of the input image of the t-th frame is divided into_tSimplified representation as I, its feature mapping

Simplified representation as F^I. L2 normalized feature map F of the target region of the input image of the given template and the t-th frame^TAnd F^I(the characteristic dimensions are h '×' × d, h '×' respectively correspond to the width and length of the extracted characteristic image,

k is 1 or 2, d represents the dimension of the feature).

Firstly, taking each feature as a unit, and taking F as a unit^TAnd F^IThe matrix is developed along the h ' direction to form an m × d matrix (where m ═ h ' × l '), and is recorded as

And

a feature map representing the expanded template T,

the representation being expandedMapping the features of the target area, and then calculating a correlation graph R (with dimension m × m) to record the similarity of each pair of features, the formula is as follows:

wherein i and j respectively represent indexes of features in feature mapping of the target template T and the target area, and R_i,jThe similarity between the ith feature in the template feature map and the jth feature in the target region feature map is shown, and Z (dimension d × d) is a trainable parameter matrix. Then, a confidence coefficient vector is formed by selecting the maximum value of each row in the R

The formula is as follows:

then, will

Normalized to [0,1 ]]And the interval is used as a final confidence coefficient vector.

Finally, the confidence coefficient vector is calculated

The rows are arranged in h 'x l' sizes, denoted as C. Solving for the motion parameters of the object by minimizing the difference of the unoccluded part. See the following equation:

wherein p represents a currently predicted target motion parameter; x represents a two-dimensional index of features in the feature map; c (-) is the contribution degree of the position feature to the optimization after the occlusion detection mechanism detects, theoretically, the contribution degree of the occluded part feature to the optimization is 0, and the contribution degree of the unoccluded feature is 1; m (·,. cndot.) metric moduleThe difference of each pair of features in the plate and the target region;

a formula representing coordinate transformation.

The ESM method is adopted to solve the following formula, which is concretely as follows:

order to

The increment of the motion parameter can be obtained by:

wherein

Representing the pseudo-inverse of the matrix, J_TIs calculated at U (unit transform)

Jacobian matrix of J_E(p) represents the Jacobian matrix for E (x; p) at p:

wherein the content of the first and second substances,

representing a binary operation.

The specific process of the second stage is similar to the method of the first stage, and is not described herein again.

The embodiment of the invention provides a plane target tracking process based on a parameterized ESM network, which comprises the following steps:

(1) in the first frame, the tracked target area is determined by calibrating the corner points of the target. As shown in fig. 5, the target template is inside the rectangular frame.

Taking fig. 5 as an example, when the target is calibrated, the real motion parameters of the corresponding first frame of the target can be calculated through the coordinates of the four corner points of the target

The acquisition process is as follows:

assume the width and height of the template as l, h. In fig. 5, a coordinate system of the template is established with the point 1 as an origin, and coordinates of the points 1 to 4 are (0,0), (0, l), (h, l), (h,0), respectively. Establishing a coordinate system of the image by taking the upper left corner of the frame image as an origin, setting the coordinates of the points 1 to 4 in the image as (x1, y1), (x2, y2), (x3, y3), (x4, y4), setting a33 as 1, and obtaining the coordinate system by solving the inverse operation of the following formula

(2) Starting from the second frame, the true motion parameters of the first frame

As the initial motion parameter p of the second frame, the input image is Q pass

And obtaining an image block (patch) with the same size as the template, namely the target area. Then to the template and the meshAnd preprocessing the standard region, extracting features, measuring similarity, detecting occlusion, and updating p through a solution process of repeated iteration ESM. Finally, the obtained p is updated₁As a result of the tracking of this frame.

(3) In the subsequent frame, the process is similar to (2).

In summary, embodiments of the present invention provide a parameterized ESM network-based planar target tracking method, which uses a trainable metric module to calculate the difference between depth features and uses a trainable occlusion detection mechanism to assist the optimization process. In addition, a large number of samples with labels are generated to simulate a real tracking scene, and the training process of the model is supervised through a designed loss function, so that the generated target tracking samples are used for training a feature extraction network, a similarity measurement module and an occlusion detection mechanism in an end-to-end mode. Compared with a feature extractor trained on an image classification task and a traditional sliding window convolution method, the training method enables the model to be more suitable for a target tracking task, and the learned measurement method is more compatible with deep convolution features than the traditional sliding window convolution method, so that the tracking accuracy is greatly improved.

Those of ordinary skill in the art will understand that: the figures are merely schematic representations of one embodiment, and the blocks or flow diagrams in the figures are not necessarily required to practice the present invention.

From the above description of the embodiments, it is clear to those skilled in the art that the present invention can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for apparatus or system embodiments, since they are substantially similar to method embodiments, they are described in relative terms, as long as they are described in partial descriptions of method embodiments. The above-described embodiments of the apparatus and system are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A plane target tracking method based on a parameterized ESM network is characterized in that a depth plane object tracking model is constructed, and the depth plane object tracking model comprises the following steps: the method comprises the following steps of constructing a data set to train the deep plane object tracking model by using a feature extraction network, a similarity measurement module and an occlusion detection mechanism, wherein the plane target tracking method comprises the following steps:

s1, obtaining the target template T, the input image of the T frame and the initial motion parameter in the T frame, determining the target area I of the input image according to the initial motion parameter_tFor the target template T and the target area I_tPreprocessing including image scaling and normalization, and processing the preprocessed target template T and the target of the input image of the T-th frame by using a feature extraction networkRegion I_tExtracting the characteristics to obtain a characteristic mapping F^TAnd

the difference between them;

2. The method according to claim 1, wherein tracking of each frame in the video is divided into two phases, specifically:

3. The method of claim 2, wherein the feature extraction network of each stage is composed of 7 convolutional layers, each layer is followed by a batchnorm layer and an activation function ReLU layer, the convolutional kernels of the first 6 convolutional layers are all 64, the convolutional kernels of the last convolutional layer are 8, in the kth stage, the step size of the first 4-k convolutional layers of the 7 convolutional layers is 2, the step size of the remaining convolutional layers is 1, and k is 1 or 2.

4. The method according to claim 1, characterized in that said similarity measure module is an encoder-decoder network based on u-net framework, whose input is the feature mapping F of the target template T^TAnd a target area of the input image of the t-th frameDomain I_tFeature mapping of

The output is the two feature maps F^TAnd

the differential tensor of (a).

5. The method according to claim 1, wherein the S3 includes:

k is 1 or 2, d represents the dimension of the feature;

And

a feature map representing the expanded template T,

of the representationIs the feature mapping of the expanded target area, and then calculates a correlation graph R to record the similarity of each pair of features, the dimension of the correlation graph R is m × m, and the formula is as follows:

wherein i and j respectively represent indexes of features in feature mapping of the target template T and the target area, and R_i,jRepresenting the similarity of the ith feature in the template feature map and the jth feature in the feature map of the target region, Z being a trainable parameter matrix with dimension d x d, forming a confidence vector by selecting the maximum value of each row in R

The formula is as follows:

then, will

finally, the confidence coefficient vector is calculated

wherein p represents a currently predicted target motion parameter; x representsA two-dimensional index of features in a feature map; c (-) is the contribution degree of the position feature to the optimization after the occlusion detection mechanism detects, the contribution degree of the occluded part feature to the optimization is 0, and the contribution degree of the feature which is not occluded is 1; m (-) measures the difference of each pair of features in the template and the target region;