CN111899284A - Plane target tracking method based on parameterized ESM network - Google Patents

Plane target tracking method based on parameterized ESM network Download PDF

Info

Publication number
CN111899284A
CN111899284A CN202010816457.5A CN202010816457A CN111899284A CN 111899284 A CN111899284 A CN 111899284A CN 202010816457 A CN202010816457 A CN 202010816457A CN 111899284 A CN111899284 A CN 111899284A
Authority
CN
China
Prior art keywords
target
feature
data
template
input image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010816457.5A
Other languages
Chinese (zh)
Other versions
CN111899284B (en
Inventor
王涛
刘贺
李浥东
郎丛妍
冯松鹤
金�一
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jiaotong University
Original Assignee
Beijing Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jiaotong University filed Critical Beijing Jiaotong University
Priority to CN202010816457.5A priority Critical patent/CN111899284B/en
Publication of CN111899284A publication Critical patent/CN111899284A/en
Application granted granted Critical
Publication of CN111899284B publication Critical patent/CN111899284B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the invention provides a plane target tracking method based on a parameterized ESM network, which comprises the following steps: s1, obtaining the target template T, the input image of the T frame and the initial motion parameter in the T frame, determining the target area I of the input image according to the initial motion parametertFor the target template T and the target area ItPreprocessing including image scaling and normalization, and performing feature extraction on the preprocessed target template T and the target region I of the input image of the T-th frame by using a feature extraction networktExtracting the characteristics to obtain a characteristic mapping FTAnd Ft I(ii) a S2, calculating two feature maps F by utilizing the similarity measurement moduleTAnd Ft IThe difference between them; and S3, determining and eliminating the occluded part of the target in the current frame through an occlusion detection mechanism, and solving the motion parameter of the target by minimizing the difference of the unoccluded part in the current frame. The method of the present inventionThe method is suitable for target tracking tasks, and greatly improves the tracking accuracy.

Description

Plane target tracking method based on parameterized ESM network
Technical Field
The invention relates to the field of machine vision and pattern recognition, in particular to a plane target tracking method based on a parameterized ESM network.
Background
Planar object tracking refers to a given sequence of video frames and a planar object of interest is specified in the first frame, and the object of the planar object tracking algorithm is to calculate the pose change of the planar object in the subsequent video frames. Planar object tracking, as a core problem in computer vision, has applications in many fields, such as augmented reality, robot control, unmanned aerial vehicle technology, and the like.
Patent document No. 201510147895.6 discloses a moving object tracking method based on a bit plane. The method comprises the steps of obtaining a brightness bit plane and a local binary pattern bit plane which are smoothed from a tracking target and a search area; then searching the areas which are closest to the two appearance models corresponding to the tracking target on the two appearance planes of the search area as the tracking target; and after the tracking is finished, updating the appearance model according to the established appearance model and the tracking result in the current frame and the preset updating rate. The method has obvious advantages in tracking precision and robustness, and effectively solves the problem that the moving target is difficult to track under the conditions of illumination condition change, target pose change, obvious appearance change and the like in the video.
Patent document No. 201910297980.9 discloses a moving target tracking method based on template matching and depth classification network, which mainly solves the problems of slow target detection speed and inaccurate tracking when the target is deformed and shielded in the prior art. The scheme extracts a template network and a detection network from a double-residual deep classification network; extracting template features and detection features on the template and the detection area by using corresponding networks; carrying out template matching on the template characteristics on the detection characteristics to obtain a template matching image; determining the target position according to the template matching graph; and tracking the target position to update the template characteristics. The method is high in tracking speed and high in accuracy, and is used for tracking the video target with severe deformation and illumination change.
For patent document No. 201510147895.6, the solution solves the problem of difficulty in tracking the target under the conditions of illumination change, appearance change and the like in the video to a certain extent. Although the invention carries out elaborately designed modeling on the brightness and the texture of the target, the manually designed modeling method can not accurately reflect the appearance characteristic of the target. Although patent document No. 201910297980.9 adopts a deep network as a feature extractor, the feature extractor is not embedded in a video tracking task to construct an end-to-end frame for training and verification, but is trained on a classification task, and a simple sliding window convolution method is adopted in calculating a feature response graph. In practice, the sliding window convolution method is not necessarily applicable on the depth feature map. In addition, neither of these inventions takes into account the situation where the target portion is occluded or otherwise out of view.
Disclosure of Invention
The embodiment of the invention provides a plane target tracking method based on a parameterized ESM network, which overcomes the defects of the prior art.
In order to achieve the purpose, the invention adopts the following technical scheme.
A plane target tracking method based on a parameterized ESM network constructs a depth plane object tracking model, and the depth plane object tracking model comprises the following steps: the method comprises the following steps of constructing a data set to train the deep plane object tracking model by using a feature extraction network, a similarity measurement module and an occlusion detection mechanism, wherein the plane target tracking method comprises the following steps:
s1, obtaining the target template T, the input image of the T frame and the initial motion parameter in the T frame, determining the target area I of the input image according to the initial motion parametertFor the target template T and the target area ItPreprocessing including image scaling and normalization, and performing feature extraction on the preprocessed target template T and the target region I of the input image of the T-th frame by using a feature extraction networktExtracting the characteristics to obtain a characteristic mapping FTAnd
Figure BDA0002632908990000021
the dimensions of the preprocessed template and the target area are h multiplied by l multiplied by 3, and h, l and 3 are the width and length of the image and the number of channels of the image respectively;
s2, calculating two feature maps F by utilizing the similarity measurement moduleTAnd
Figure BDA0002632908990000031
the difference between them;
and S3, determining and eliminating the occluded part of the target in the current frame through an occlusion detection mechanism, and solving the motion parameter of the target by minimizing the difference of the unoccluded part in the current frame.
Preferably, the tracking of each frame in the video is divided into two stages, specifically:
and in the next iteration process, the motion parameter of the last tracking result of the second stage is used as the initial motion parameter of the first stage in the current iteration.
Preferably, the feature extraction network of each stage is composed of 7 convolutional layers, each layer is connected with a batchnorm layer and an activation function ReLU layer, the number of convolutional kernels of the first 6 convolutional layers is 64, the number of convolutional kernels of the last convolutional layer is 8, in the kth stage, the step size of the first 4-k convolutional layers of the 7 convolutional layers is 2, the step size of the remaining convolutional layers is 1, and k is 1 or 2.
Preferably, the similarity measurement module is an encoder-decoder network based on a u-net framework, and the input of the similarity measurement module is a feature mapping F of the target template TTAnd a target region I of the input image of the t-th frametFeature mapping of
Figure BDA0002632908990000032
The output is the two feature maps FTAnd
Figure BDA0002632908990000033
differential tensor of。
Preferably, the S3 includes:
target area I of input image of t-th frametSimplified representation as I, its feature mapping
Figure BDA0002632908990000034
Simplified representation as FIL2 normalized feature map F of the target region of the given template and the t-th frame input imageTAnd FIThe characteristic dimension is h 'x l' x d, wherein h 'and l' respectively correspond to the width and length of the extracted characteristic image,
Figure BDA0002632908990000035
k is 1 or 2, d represents the dimension of the feature;
firstly, taking each feature as a unit, respectively adding FTAnd FISpreading the matrix in the h ' direction to form an m × d matrix, where m ═ h ' × l ', and is recorded as
Figure BDA0002632908990000036
And
Figure BDA0002632908990000037
Figure BDA0002632908990000038
a feature map representing the expanded template T,
Figure BDA0002632908990000039
representing the feature mapping of the expanded target region, and then calculating a correlation graph R to record the similarity of each pair of features, wherein the dimension of the correlation graph R is m × m, and the formula is as follows:
Figure BDA00026329089900000310
wherein i and j respectively represent indexes of features in feature mapping of the target template T and the target area, and Ri,jRepresenting the similarity of the ith feature in the template feature map to the jth feature in the feature map for the target regionZ is a trainable parameter matrix, the dimension of Z is d x d, and a confidence coefficient vector is formed by selecting the maximum value of each row in R
Figure BDA0002632908990000041
The formula is as follows:
Figure BDA0002632908990000042
then, will
Figure BDA0002632908990000043
Normalized to [0,1 ]]Taking the interval as a final confidence coefficient vector;
finally, the confidence coefficient vector is calculated
Figure BDA0002632908990000044
Taking h ' as a line, arranging the h ' multiplied by l ' as the size, marked as C, solving the motion parameter of the target by minimizing the difference of the parts which are not shielded, and the following formula is shown:
Figure BDA0002632908990000045
wherein p represents a currently predicted target motion parameter; x represents a two-dimensional index of features in the feature map; c (-) is the contribution degree of the position feature to the optimization after the occlusion detection mechanism detects, the contribution degree of the occluded part feature to the optimization is 0, and the contribution degree of the feature which is not occluded is 1; m (-) measures the difference of each pair of features in the template and the target region;
Figure BDA0002632908990000046
a formula representing coordinate transformation;
solving the formula (3) by adopting an ESM method, which is concretely as follows:
order to
Figure BDA0002632908990000047
The increment of the motion parameter is obtained by:
Figure BDA0002632908990000048
wherein
Figure BDA0002632908990000049
Representing the pseudo-inverse of the matrix, JTIs calculated at U unit conversion
Figure BDA00026329089900000410
Jacobian matrix of JE(p) represents the Jacobian matrix for E (x; p) at p:
Figure BDA0002632908990000051
the motion parameters are updated in conjunction with the increment Δ p of the motion parameters:
Figure BDA0002632908990000052
wherein the content of the first and second substances,
Figure BDA0002632908990000053
representing a binary operation.
Preferably, the construction data set trains the depth plane object tracking model, including:
constructing two labeled DATA sets GEN-DATA and OCC-DATA, wherein GEN-DATA comprises illumination, deformation and noise factors, OCC-DATA is based on GEN-DATA and increases the condition that the target part is blocked and the target part exceeds the visual field range, and each sample in the DATA sets GEN-DATA and OCC-DATA is a quadruple (T, Q, p)0,pgt) Where T is the template image, Q is the current input image, p0As initial movement parameter, pgtReal motion parameters of the target;
the DATA set GEN-DATA construction process comprises: geometric transformations and optical perturbations;
the geometric transformation includes:
given a target template T and a true motion parameter p of the targetgtAnd mapping the pixel points in the target template to the input image Q through a perspective transformation formula, wherein the perspective transformation formula is as follows:
Figure BDA0002632908990000054
Figure BDA0002632908990000055
wherein the content of the first and second substances,
Figure BDA0002632908990000056
is a transformation matrix, (u, v) is the coordinates of the pixel, and (x, y) is the coordinates of the pixel after perspective transformation;
respectively moving the corner points of the target in the input image Q by d pixels along any direction, taking d as an integer from 0 to 20, and calculating a corresponding transformation matrix, namely an initial motion parameter p according to the coordinates of the moved corner points0
The optical perturbation comprises:
1) adding motion blur or gaussian blur on the input image;
2) adding Gaussian noise to an input image;
3) performing brightness variation of different degrees in a certain direction for all pixels on an input image;
the DATA set OCC-DATA construction process includes:
for each sample in GEN-DATA, a point is selected on each edge of the object in the input image, constituting a size NPRandomly selecting N (N is more than or equal to 0 and less than or equal to N)P) Points are connected in sequence to divide a target area in a video frame into a plurality of parts, and a part of the pattern filled with another picture is randomly selected to simulate the shielding condition;
the DATA sets GEN-DATA and OCC-DATA are each represented by 8: 2, dividing the proportion into a training set and a verification set for training the performance of the model and verifying the performance of the model;
during training, firstly, without adding an occlusion detection mechanism, training a feature extraction network and a similarity measurement module by using GEN-DATA, fixing parameters of the feature extraction network and the similarity measurement module after the training is finished, training the occlusion detection mechanism by using OCC-DATA, and simultaneously fine-tuning the parameters of the feature extraction network and the similarity measurement module;
the loss function formula adopted in the training process is as follows:
Figure BDA0002632908990000061
wherein the content of the first and second substances,
Figure BDA0002632908990000062
target motion parameters predicted for the model, pgtReal motion parameters of the target; n is the number of target corner points, rqCoordinates of the corner points;
Figure BDA0002632908990000063
a formula representing coordinate transformation.
According to the technical scheme provided by the embodiment of the invention, the embodiment of the invention provides a plane target tracking method based on a parameterized ESM network, and the method has the advantages that the characteristic extraction module learns the robust characteristic representation through the trainable characteristic extraction module and a sufficient training set, so that the problem that the traditional manual design characteristics cannot accurately reflect the appearance characteristics of the target in the tracking process is effectively solved to a certain extent; the problem of incompatibility of the depth features and the traditional similarity measurement method is solved through the trained feature extraction module and the similarity measurement module; an occlusion detection mechanism is used for assisting model solution, so that the model is more robust to partial occlusion; meanwhile, the loss function in the form of a logarithmic function can prevent the training process of the model from being dominated by samples with large loss. Because the present invention uses microsoft's COCO dataset as raw material to construct the dataset, with sufficient training samples and end-to-end network construction, the model planar object tracking accuracy is much higher than the traditional method and the existing deep network-based method.
Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a flowchart of a planar target tracking method based on a parameterized ESM network according to an embodiment of the present invention;
fig. 2 is a GEN-DATA generation effect diagram of a planar target tracking method based on a parameterized ESM network according to an embodiment of the present invention;
fig. 3 is a diagram illustrating an OCC-DATA generation effect of a planar target tracking method based on a parameterized ESM network according to an embodiment of the present invention;
fig. 4 is a flowchart of a similarity measurement module of a planar target tracking method based on a parameterized ESM network according to an embodiment of the present invention;
fig. 5 is a schematic diagram of selecting a tracking target in a first frame according to an embodiment of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or coupled. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.
It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
For the convenience of understanding the embodiments of the present invention, the following description will be further explained by taking several specific embodiments as examples in conjunction with the drawings, and the embodiments are not to be construed as limiting the embodiments of the present invention.
The embodiment of the invention provides a plane target tracking method based on a parameterized ESM network, as shown in FIG. 1, a depth plane object tracking model is constructed, and the depth plane object tracking model comprises the following steps: the method comprises the steps of constructing a data set to train a deep plane object tracking model by a feature extraction network, a metric learning Layer (ML Layer) and a occlusion detection mechanism (CMG).
To train the model, two tagged DATA sets GEN-DATA and OCC-DATA were constructed to train the depth plane object tracking model, using the MS-COCO DATA set as the material. Wherein, GEN-DATA mainly includes factors such as illumination, deformation, noise and the like; OCC-DATA is based on GEN-DATA, and is increasedThe target portion is blocked and the target portion is out of the field of view. Each sample in the two data sets is a quadruple (T, Q, p)0,pgt) The parameters are respectively template image, current input image, initial motion parameter and real motion parameter of the target. The template image is from a template pool constructed by MS-COCO, namely, a picture in the MS-COCO is zoomed into a picture with the length and the width of 80-160 pixels.
The construction process of the DATA set GEN-DATA mainly includes methods of geometric transformation and optical perturbation, as shown in fig. 2.
The method for geometric transformation comprises the following steps:
1) given a target template T and a true motion parameter p of the targetgtAnd mapping the pixel points in the target template to the input image Q through a perspective transformation formula. The perspective transformation formula is as follows:
Figure BDA0002632908990000091
Figure BDA0002632908990000092
wherein the content of the first and second substances,
Figure BDA0002632908990000093
in order to transform the matrix, (u, v) is the coordinates of the pixel, and (x, y) is the coordinates of the pixel after perspective transformation.
2) The corner points of the object in the input image Q are shifted by d pixels in any direction, respectively, d taking an integer from 0 to 20. Calculating a corresponding transformation matrix, namely an initial motion parameter p, according to the coordinates of the moved corner points0
The optical perturbation is implemented specifically as follows:
1) adding motion blur or gaussian blur on the input image;
2) adding Gaussian noise to an input image;
3) different degrees of brightness variation are implemented in a certain direction (e.g., from top to bottom, or from left to right) for all pixels on the input image.
The method for generating the DATA set OCC-DATA specifically comprises the following steps:
for each sample in GEN-DATA, a point is selected on each edge of the object in the input image, constituting a size NPThe point set of (2). Then randomly selecting N (N is more than or equal to 0 and less than or equal to N)P) Point and are connected in sequence. Thus, the target area in the video frame is divided into several parts. Then a part of the pattern filled with another picture is randomly selected to simulate the occlusion situation, as shown in fig. 3.
The two DATA sets GEN-DATA and OCC-DATA are each represented by 8: scale of 2 is divided into a training set and a validation set to train the model and validate the performance of the model.
The characteristics learned by a large amount of data can reflect the appearance characteristics of the target. In the training process, firstly, the feature detection module is not added, and the GEN-DATA is used for training the feature extraction network and the similarity measurement module. After training is completed, we fix the parameters of the two modules, train the occlusion detection module with OCC-DATA, and fine tune the parameters of the feature extraction network and the similarity measurement module.
The loss function formula adopted in the training process is as follows:
Figure BDA0002632908990000101
wherein the content of the first and second substances,
Figure BDA0002632908990000102
target motion parameters predicted for the model, pgtIs the true motion parameter of the target. N is the number of target corner points, rqThe coordinates of the corner points.
Figure BDA0002632908990000103
A formula representing coordinate transformation.
The distance sum of the corner points is embedded into a logarithmic function to avoid that a sample with a large loss dominates the whole training process.
After the training of the depth plane object tracking model is completed, the target tracking process is as follows:
the tracking of each frame is divided into two stages, specifically:
and in the next iteration process, the motion parameter of the last tracking result of the second stage is used as the initial motion parameter of the first stage in the current iteration.
Taking the first stage as an example:
s1, firstly, obtaining the target template T, the input image of the T frame and the initial motion parameter in the T frame, and determining the target area I of the input image according to the initial motion parametertFor the target template T and the target area ItPreprocessing, including operations such as image scaling and normalization; target region I of input image of target template T and T-th frame using feature extraction networktExtracting the characteristics to obtain a characteristic mapping FTAnd
Figure BDA0002632908990000111
the dimensions of the preprocessed template and the target area are h multiplied by l multiplied by 3, and h, l and 3 are the width and the length of the image and the channel number of the image respectively.
The feature extraction network of each stage is composed of 7 convolutional layers, and a batchnorm layer and an activation function (ReLU) layer are connected behind each convolutional layer. The convolution kernels for the first 6 convolutional layers are all 64, and the convolution kernel for the last convolutional layer is 8. In the kth stage, the step size of the first 4-k convolutional layers of the 7 convolutional layers is 2, the step size of the remaining convolutional layers is 1, and k is 1 or 2. Taking the first stage as an example, k is 1.
S2 calculating two feature maps F by using a similarity measurement moduleTAnd
Figure BDA0002632908990000112
the difference between them. Wherein, the similarity measurement module is an encoder-decoder network based on a u-net framework, and the input of the similarity measurement module is a feature mapping F of a target template TTAnd a target region I of the input image of the t-th frametFeature mapping of
Figure BDA0002632908990000113
The output is the two feature maps FTAnd
Figure BDA0002632908990000114
the differential tensor of (a), as shown in figure 4.
S3, determining and eliminating the occluded part of the target in the current frame by using an occlusion detection mechanism, and solving the motion parameter of the target by minimizing the difference of the unoccluded part in the current frame.
The detection process of the occlusion detection mechanism is specifically as follows:
to describe this process more clearly, the target area I of the input image of the t-th frame is divided intotSimplified representation as I, its feature mapping
Figure BDA0002632908990000115
Simplified representation as FI. L2 normalized feature map F of the target region of the input image of the given template and the t-th frameTAnd FI(the characteristic dimensions are h '×' × d, h '×' respectively correspond to the width and length of the extracted characteristic image,
Figure BDA0002632908990000116
k is 1 or 2, d represents the dimension of the feature).
Firstly, taking each feature as a unit, and taking F as a unitTAnd FIThe matrix is developed along the h ' direction to form an m × d matrix (where m ═ h ' × l '), and is recorded as
Figure BDA0002632908990000117
And
Figure BDA0002632908990000118
Figure BDA0002632908990000119
a feature map representing the expanded template T,
Figure BDA00026329089900001110
the representation being expandedMapping the features of the target area, and then calculating a correlation graph R (with dimension m × m) to record the similarity of each pair of features, the formula is as follows:
Figure BDA00026329089900001111
wherein i and j respectively represent indexes of features in feature mapping of the target template T and the target area, and Ri,jThe similarity between the ith feature in the template feature map and the jth feature in the target region feature map is shown, and Z (dimension d × d) is a trainable parameter matrix. Then, a confidence coefficient vector is formed by selecting the maximum value of each row in the R
Figure BDA0002632908990000121
The formula is as follows:
Figure BDA0002632908990000122
then, will
Figure BDA0002632908990000123
Normalized to [0,1 ]]And the interval is used as a final confidence coefficient vector.
Finally, the confidence coefficient vector is calculated
Figure BDA0002632908990000124
The rows are arranged in h 'x l' sizes, denoted as C. Solving for the motion parameters of the object by minimizing the difference of the unoccluded part. See the following equation:
Figure BDA0002632908990000125
wherein p represents a currently predicted target motion parameter; x represents a two-dimensional index of features in the feature map; c (-) is the contribution degree of the position feature to the optimization after the occlusion detection mechanism detects, theoretically, the contribution degree of the occluded part feature to the optimization is 0, and the contribution degree of the unoccluded feature is 1; m (·,. cndot.) metric moduleThe difference of each pair of features in the plate and the target region;
Figure BDA0002632908990000126
a formula representing coordinate transformation.
The ESM method is adopted to solve the following formula, which is concretely as follows:
order to
Figure BDA0002632908990000127
The increment of the motion parameter can be obtained by:
Figure BDA0002632908990000128
wherein
Figure BDA0002632908990000129
Representing the pseudo-inverse of the matrix, JTIs calculated at U (unit transform)
Figure BDA00026329089900001210
Jacobian matrix of JE(p) represents the Jacobian matrix for E (x; p) at p:
Figure BDA00026329089900001211
the motion parameters are updated in conjunction with the increment Δ p of the motion parameters:
Figure BDA00026329089900001212
wherein the content of the first and second substances,
Figure BDA0002632908990000131
representing a binary operation.
The specific process of the second stage is similar to the method of the first stage, and is not described herein again.
The embodiment of the invention provides a plane target tracking process based on a parameterized ESM network, which comprises the following steps:
(1) in the first frame, the tracked target area is determined by calibrating the corner points of the target. As shown in fig. 5, the target template is inside the rectangular frame.
Taking fig. 5 as an example, when the target is calibrated, the real motion parameters of the corresponding first frame of the target can be calculated through the coordinates of the four corner points of the target
Figure BDA0002632908990000132
Figure BDA0002632908990000133
The acquisition process is as follows:
assume the width and height of the template as l, h. In fig. 5, a coordinate system of the template is established with the point 1 as an origin, and coordinates of the points 1 to 4 are (0,0), (0, l), (h, l), (h,0), respectively. Establishing a coordinate system of the image by taking the upper left corner of the frame image as an origin, setting the coordinates of the points 1 to 4 in the image as (x1, y1), (x2, y2), (x3, y3), (x4, y4), setting a33 as 1, and obtaining the coordinate system by solving the inverse operation of the following formula
Figure BDA0002632908990000134
Figure BDA0002632908990000135
Figure BDA0002632908990000136
(2) Starting from the second frame, the true motion parameters of the first frame
Figure BDA0002632908990000137
As the initial motion parameter p of the second frame, the input image is Q pass
Figure BDA0002632908990000138
And obtaining an image block (patch) with the same size as the template, namely the target area. Then to the template and the meshAnd preprocessing the standard region, extracting features, measuring similarity, detecting occlusion, and updating p through a solution process of repeated iteration ESM. Finally, the obtained p is updated1As a result of the tracking of this frame.
(3) In the subsequent frame, the process is similar to (2).
In summary, embodiments of the present invention provide a parameterized ESM network-based planar target tracking method, which uses a trainable metric module to calculate the difference between depth features and uses a trainable occlusion detection mechanism to assist the optimization process. In addition, a large number of samples with labels are generated to simulate a real tracking scene, and the training process of the model is supervised through a designed loss function, so that the generated target tracking samples are used for training a feature extraction network, a similarity measurement module and an occlusion detection mechanism in an end-to-end mode. Compared with a feature extractor trained on an image classification task and a traditional sliding window convolution method, the training method enables the model to be more suitable for a target tracking task, and the learned measurement method is more compatible with deep convolution features than the traditional sliding window convolution method, so that the tracking accuracy is greatly improved.
Those of ordinary skill in the art will understand that: the figures are merely schematic representations of one embodiment, and the blocks or flow diagrams in the figures are not necessarily required to practice the present invention.
From the above description of the embodiments, it is clear to those skilled in the art that the present invention can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for apparatus or system embodiments, since they are substantially similar to method embodiments, they are described in relative terms, as long as they are described in partial descriptions of method embodiments. The above-described embodiments of the apparatus and system are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (6)

1. A plane target tracking method based on a parameterized ESM network is characterized in that a depth plane object tracking model is constructed, and the depth plane object tracking model comprises the following steps: the method comprises the following steps of constructing a data set to train the deep plane object tracking model by using a feature extraction network, a similarity measurement module and an occlusion detection mechanism, wherein the plane target tracking method comprises the following steps:
s1, obtaining the target template T, the input image of the T frame and the initial motion parameter in the T frame, determining the target area I of the input image according to the initial motion parametertFor the target template T and the target area ItPreprocessing including image scaling and normalization, and processing the preprocessed target template T and the target of the input image of the T-th frame by using a feature extraction networkRegion ItExtracting the characteristics to obtain a characteristic mapping FTAnd
Figure FDA0002632908980000011
the dimensions of the preprocessed template and the target area are h multiplied by l multiplied by 3, and h, l and 3 are the width and length of the image and the number of channels of the image respectively;
s2, calculating two feature maps F by utilizing the similarity measurement moduleTAnd
Figure FDA0002632908980000012
the difference between them;
and S3, determining and eliminating the occluded part of the target in the current frame through an occlusion detection mechanism, and solving the motion parameter of the target by minimizing the difference of the unoccluded part in the current frame.
2. The method according to claim 1, wherein tracking of each frame in the video is divided into two phases, specifically:
and in the next iteration process, the motion parameter of the last tracking result of the second stage is used as the initial motion parameter of the first stage in the current iteration.
3. The method of claim 2, wherein the feature extraction network of each stage is composed of 7 convolutional layers, each layer is followed by a batchnorm layer and an activation function ReLU layer, the convolutional kernels of the first 6 convolutional layers are all 64, the convolutional kernels of the last convolutional layer are 8, in the kth stage, the step size of the first 4-k convolutional layers of the 7 convolutional layers is 2, the step size of the remaining convolutional layers is 1, and k is 1 or 2.
4. The method according to claim 1, characterized in that said similarity measure module is an encoder-decoder network based on u-net framework, whose input is the feature mapping F of the target template TTAnd a target area of the input image of the t-th frameDomain ItFeature mapping of
Figure FDA0002632908980000021
The output is the two feature maps FTAnd
Figure FDA0002632908980000022
the differential tensor of (a).
5. The method according to claim 1, wherein the S3 includes:
target area I of input image of t-th frametSimplified representation as I, its feature mapping
Figure FDA0002632908980000023
Simplified representation as FIL2 normalized feature map F of the target region of the given template and the t-th frame input imageTAnd FIThe characteristic dimension is h 'x l' x d, wherein h 'and l' respectively correspond to the width and length of the extracted characteristic image,
Figure FDA0002632908980000024
k is 1 or 2, d represents the dimension of the feature;
firstly, taking each feature as a unit, respectively adding FTAnd FISpreading the matrix in the h ' direction to form an m × d matrix, where m ═ h ' × l ', and is recorded as
Figure FDA0002632908980000025
And
Figure FDA0002632908980000026
Figure FDA0002632908980000027
a feature map representing the expanded template T,
Figure FDA0002632908980000028
of the representationIs the feature mapping of the expanded target area, and then calculates a correlation graph R to record the similarity of each pair of features, the dimension of the correlation graph R is m × m, and the formula is as follows:
Figure FDA0002632908980000029
wherein i and j respectively represent indexes of features in feature mapping of the target template T and the target area, and Ri,jRepresenting the similarity of the ith feature in the template feature map and the jth feature in the feature map of the target region, Z being a trainable parameter matrix with dimension d x d, forming a confidence vector by selecting the maximum value of each row in R
Figure FDA00026329089800000210
The formula is as follows:
Figure FDA00026329089800000211
then, will
Figure FDA00026329089800000212
Normalized to [0,1 ]]Taking the interval as a final confidence coefficient vector;
finally, the confidence coefficient vector is calculated
Figure FDA00026329089800000213
Taking h ' as a line, arranging the h ' multiplied by l ' as the size, marked as C, solving the motion parameter of the target by minimizing the difference of the parts which are not shielded, and the following formula is shown:
Figure FDA00026329089800000214
Figure FDA0002632908980000031
wherein p represents a currently predicted target motion parameter; x representsA two-dimensional index of features in a feature map; c (-) is the contribution degree of the position feature to the optimization after the occlusion detection mechanism detects, the contribution degree of the occluded part feature to the optimization is 0, and the contribution degree of the feature which is not occluded is 1; m (-) measures the difference of each pair of features in the template and the target region;
Figure FDA0002632908980000032
a formula representing coordinate transformation;
solving the formula (3) by adopting an ESM method, which is concretely as follows:
order to
Figure FDA0002632908980000033
The increment of the motion parameter is obtained by:
Figure FDA0002632908980000034
wherein
Figure FDA0002632908980000035
Representing the pseudo-inverse of the matrix, JTIs calculated at U unit conversion
Figure FDA0002632908980000036
Jacobian matrix of JE(p) represents the Jacobian matrix for E (x; p) at p:
Figure FDA0002632908980000037
the motion parameters are updated in conjunction with the increment Δ p of the motion parameters:
Figure FDA0002632908980000038
wherein the content of the first and second substances,
Figure FDA0002632908980000039
representing a binary operation.
6. The method of claim 1, wherein the constructing the dataset trains the depth plane object tracking model, comprising:
constructing two labeled DATA sets GEN-DATA and OCC-DATA, wherein GEN-DATA comprises illumination, deformation and noise factors, OCC-DATA is based on GEN-DATA and increases the condition that the target part is blocked and the target part exceeds the visual field range, and each sample in the DATA sets GEN-DATA and OCC-DATA is a quadruple (T, Q, p)0,pgt) Where T is the template image, Q is the current input image, p0As initial movement parameter, pgtReal motion parameters of the target;
the DATA set GEN-DATA construction process comprises: geometric transformations and optical perturbations;
the geometric transformation includes:
given a target template T and a true motion parameter p of the targetgtAnd mapping the pixel points in the target template to the input image Q through a perspective transformation formula, wherein the perspective transformation formula is as follows:
Figure FDA0002632908980000041
Figure FDA0002632908980000042
wherein the content of the first and second substances,
Figure FDA0002632908980000043
is a transformation matrix, (u, v) is the coordinates of the pixel, and (x, y) is the coordinates of the pixel after perspective transformation;
respectively moving the corner points of the target in the input image Q by d pixels along any direction, taking d as an integer from 0 to 20, and calculating a corresponding transformation matrix, namely an initial motion parameter p according to the coordinates of the moved corner points0
The optical perturbation comprises:
1) adding motion blur or gaussian blur on the input image;
2) adding Gaussian noise to an input image;
3) performing brightness variation of different degrees in a certain direction for all pixels on an input image;
the DATA set OCC-DATA construction process includes:
for each sample in GEN-DATA, a point is selected on each edge of the object in the input image, constituting a size NPRandomly selecting N (N is more than or equal to 0 and less than or equal to N)P) Points are connected in sequence to divide a target area in a video frame into a plurality of parts, and a part of the pattern filled with another picture is randomly selected to simulate the shielding condition;
the DATA sets GEN-DATA and OCC-DATA are each represented by 8: 2, dividing the proportion into a training set and a verification set for training the performance of the model and verifying the performance of the model;
during training, firstly, without adding an occlusion detection mechanism, training a feature extraction network and a similarity measurement module by using GEN-DATA, fixing parameters of the feature extraction network and the similarity measurement module after the training is finished, training the occlusion detection mechanism by using OCC-DATA, and simultaneously fine-tuning the parameters of the feature extraction network and the similarity measurement module;
the loss function formula adopted in the training process is as follows:
Figure FDA0002632908980000051
wherein the content of the first and second substances,
Figure FDA0002632908980000052
target motion parameters predicted for the model, pgtReal motion parameters of the target; n is the number of target corner points, rqCoordinates of the corner points;
Figure FDA0002632908980000053
representing co-ordinate transformationsFormula (II) is shown.
CN202010816457.5A 2020-08-14 2020-08-14 Planar target tracking method based on parameterized ESM network Active CN111899284B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010816457.5A CN111899284B (en) 2020-08-14 2020-08-14 Planar target tracking method based on parameterized ESM network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010816457.5A CN111899284B (en) 2020-08-14 2020-08-14 Planar target tracking method based on parameterized ESM network

Publications (2)

Publication Number Publication Date
CN111899284A true CN111899284A (en) 2020-11-06
CN111899284B CN111899284B (en) 2024-04-09

Family

ID=73229031

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010816457.5A Active CN111899284B (en) 2020-08-14 2020-08-14 Planar target tracking method based on parameterized ESM network

Country Status (1)

Country Link
CN (1) CN111899284B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113609316A (en) * 2021-07-27 2021-11-05 支付宝(杭州)信息技术有限公司 Method and device for detecting similarity of media contents

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101324956A (en) * 2008-07-10 2008-12-17 上海交通大学 Method for tracking anti-shield movement object based on average value wander
CN103729861A (en) * 2014-01-03 2014-04-16 天津大学 Multiple object tracking method
CN106920248A (en) * 2017-01-19 2017-07-04 博康智能信息技术有限公司上海分公司 A kind of method for tracking target and device
CN110796680A (en) * 2019-08-09 2020-02-14 北京邮电大学 Target tracking method and device based on similar template updating
WO2020155873A1 (en) * 2019-02-02 2020-08-06 福州大学 Deep apparent features and adaptive aggregation network-based multi-face tracking method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101324956A (en) * 2008-07-10 2008-12-17 上海交通大学 Method for tracking anti-shield movement object based on average value wander
CN103729861A (en) * 2014-01-03 2014-04-16 天津大学 Multiple object tracking method
CN106920248A (en) * 2017-01-19 2017-07-04 博康智能信息技术有限公司上海分公司 A kind of method for tracking target and device
WO2020155873A1 (en) * 2019-02-02 2020-08-06 福州大学 Deep apparent features and adaptive aggregation network-based multi-face tracking method
CN110796680A (en) * 2019-08-09 2020-02-14 北京邮电大学 Target tracking method and device based on similar template updating

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
WU ZIJIAN: "《Netted Radar Tracking with Multiple Simultaneous Transmissions against Combined PDS Interception》", 《 JOURNAL OF SENSORS》 *
王涛: "《基于核相关滤波的长时间目标跟踪算法研究》", 《中国优秀硕士学位论文全文数据库(信息科技辑)》 *
王涛等: "《基于时空背景差的带跟踪补偿目标检测方法》", 《计算机应用》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113609316A (en) * 2021-07-27 2021-11-05 支付宝(杭州)信息技术有限公司 Method and device for detecting similarity of media contents

Also Published As

Publication number Publication date
CN111899284B (en) 2024-04-09

Similar Documents

Publication Publication Date Title
US10839543B2 (en) Systems and methods for depth estimation using convolutional spatial propagation networks
CN110210551B (en) Visual target tracking method based on adaptive subject sensitivity
CN110738697B (en) Monocular depth estimation method based on deep learning
CN113269237B (en) Assembly change detection method, device and medium based on attention mechanism
US11361456B2 (en) Systems and methods for depth estimation via affinity learned with convolutional spatial propagation networks
CN109800689B (en) Target tracking method based on space-time feature fusion learning
CN114202672A (en) Small target detection method based on attention mechanism
CN111950453A (en) Optional-shape text recognition method based on selective attention mechanism
CN108021889A (en) A kind of binary channels infrared behavior recognition methods based on posture shape and movable information
CN114565655B (en) Depth estimation method and device based on pyramid segmentation attention
CN112348849A (en) Twin network video target tracking method and device
CN110084201B (en) Human body action recognition method based on convolutional neural network of specific target tracking in monitoring scene
US20220044072A1 (en) Systems and methods for aligning vectors to an image
CN114399533B (en) Single-target tracking method based on multi-level attention mechanism
CN113724379B (en) Three-dimensional reconstruction method and device for fusing image and laser point cloud
CN113516693B (en) Rapid and universal image registration method
CN114429555A (en) Image density matching method, system, equipment and storage medium from coarse to fine
CN114140623A (en) Image feature point extraction method and system
CN116402851A (en) Infrared dim target tracking method under complex background
CN116563682A (en) Attention scheme and strip convolution semantic line detection method based on depth Hough network
CN114663880A (en) Three-dimensional target detection method based on multi-level cross-modal self-attention mechanism
CN114943870A (en) Training method and device of line feature extraction model and point cloud matching method and device
CN112669452B (en) Object positioning method based on convolutional neural network multi-branch structure
Shit et al. An encoder‐decoder based CNN architecture using end to end dehaze and detection network for proper image visualization and detection
CN111274901B (en) Gesture depth image continuous detection method based on depth gating recursion unit

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant