CN112541944B - Probability twin target tracking method and system based on conditional variational encoder - Google Patents

Probability twin target tracking method and system based on conditional variational encoder Download PDF

Info

Publication number
CN112541944B
CN112541944B CN202011434400.5A CN202011434400A CN112541944B CN 112541944 B CN112541944 B CN 112541944B CN 202011434400 A CN202011434400 A CN 202011434400A CN 112541944 B CN112541944 B CN 112541944B
Authority
CN
China
Prior art keywords
encoder
prior
search image
hidden space
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011434400.5A
Other languages
Chinese (zh)
Other versions
CN112541944A (en
Inventor
郭胤辰
黄文慧
李丰泽
何伟
薛婧一
胡意廷
侯宛辰
周媛媛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Normal University
Original Assignee
Shandong Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Normal University filed Critical Shandong Normal University
Priority to CN202011434400.5A priority Critical patent/CN112541944B/en
Publication of CN112541944A publication Critical patent/CN112541944A/en
Application granted granted Critical
Publication of CN112541944B publication Critical patent/CN112541944B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/80Analysis of captured images to determine intrinsic or extrinsic camera parameters, i.e. camera calibration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a probability twin target tracking method and system based on a conditional variational encoder, belonging to the technical field of target tracking and identification of intelligent robots, wherein a template image is input to a trained shared convolutional neural network to obtain template image characteristics; inputting the current frame serving as a search image into a shared convolutional neural network and a prior encoder to respectively obtain a search image characteristic and a prior encoder hidden space; performing cross correlation operation on the template image characteristics and the search image characteristics, and then connecting the template image characteristics and the search image characteristics in series with mean sampling of a hidden space of a prior encoder; and inputting the result after the series connection into a mask decoder to obtain a binary mask, obtain a boundary box of the target and realize the positioning of the target. According to the invention, uncertainty learning is introduced into target tracking, and complete probability distribution is obtained; under the condition of low calculation amount, a consistent target state can be generated, noise is added into the neurons, regularization is introduced into the deep neural network to prevent an overfitting phenomenon, and robustness is improved.

Description

Probability twin target tracking method and system based on conditional variational encoder
Technical Field
The invention relates to the technical field of target tracking and identification of intelligent robots, in particular to a probability twinning target tracking method and system based on a conditional variational encoder.
Background
In order to realize interaction with the environment, the human-machine-environment co-fusion robot needs to have a basic capability of estimating the state of the environment. Specifically, the robot first needs to achieve the positioning of the target to achieve interaction with the target and further operational objectives. Tracking a moving target enables a mobile robot to accomplish a variety of real-world tasks.
Deep convolutional neural networks are increasingly becoming a typical method structure in the field of target tracking due to their significantly improved performance compared to conventional methods and techniques. However, in some very complex and difficult tracking scenarios, even leading target tracking algorithms are prone to target loss or lack of accuracy, and the results of these algorithms are often poorly confident in prediction.
These algorithms provide deterministic features or regression maps through embedded deep learning models, such as convolutional neural networks. Inaccurate deterministic regression maps with low confidence may lead to catastrophic consequences and fail to provide a basis for further operations and timely human intervention. In addition, in the regression process, the maximum value in the predicted regression graph in the final output result of the whole process is often mistakenly considered to reflect the uncertainty of the model. However, the model may still output a higher regression prediction response in the absence of confidence.
Target tracking typically requires a large amount of training data to train a mature network model. Although large datasets are intended to provide clear labels for training and testing, there is some inherent label uncertainty in large datasets due to the limitations of labeling methods and preference of the annotator. The use of ambiguous data or the introduction of noise inherent in the observation (also referred to as data uncertainty or occasional uncertainty) will result in inherent uncertainty in the testing phase. For example, uncertainty in the training data, if present in the boundary region, will result in higher prediction uncertainty in the boundary region.
Disclosure of Invention
The invention aims to provide a probability twin target tracking method and a probability twin target tracking system based on a conditional variation encoder, which introduce uncertainty learning into target tracking, acquire complete probability distribution, generate a consistent target state, prevent an overfitting phenomenon and improve robustness, so as to solve at least one technical problem in the background technology.
In order to achieve the purpose, the invention adopts the following technical scheme:
in one aspect, the present invention provides a probability twin target tracking method based on a conditional variational encoder, including:
extracting a certain frame of image in a video sequence as a first frame of image for target tracking, calibrating a target object needing tracking in the first frame of image, and initializing target appearance model parameters to obtain a template image;
inputting the template image into a trained shared convolutional neural network to obtain template image characteristics;
inputting the current frame serving as a search image into a shared convolutional neural network and a prior encoder to respectively obtain a search image characteristic and a prior encoder hidden space;
performing cross correlation operation on the template image characteristics and the search image characteristics, and then connecting the template image characteristics and the search image characteristics in series with mean sampling of a hidden space of a prior encoder;
and inputting the result after the serial connection into a mask decoder to obtain a binary mask, obtain a boundary box of the target and realize the positioning of the target.
Preferably, the training of the shared convolutional neural network comprises:
the search image and the real calibration mask are connected in series and then input to an identification encoder to obtain an identification encoder hidden space;
calculating the regression loss of a boundary frame between a hidden space of a prior encoder and a hidden space of an identification encoder;
performing cross correlation operation on the template image characteristics and the search image characteristics, and connecting the template image characteristics and the search image characteristics in series with random sampling of a hidden space of a prior encoder;
inputting the result after the serial connection into a mask decoder to obtain a binary mask;
calculating the cross entropy loss between the binary mask and the real calibration mask;
weighting the regression loss and the cross entropy loss of the bounding box to obtain a loss value of the network;
and (4) adopting a random gradient descent method, optimizing the network according to the loss value, and performing iterative training until the minimization evidence is off-line to obtain the trained shared convolutional neural network.
Preferably, the minimizing the downline of evidence includes:
establishing a mapping between a target segmentation and a position with uncertainty using an identification encoder;
combining the probability distribution output by the identification encoder with the result of the cross correlation operation, randomly sampling in the probability distribution output by the identification encoder to generate a segmentation prediction, and measuring the distance between the segmentation prediction and a true value label of an original search image by the cross entropy loss of the cross correlation operation;
and adopting KL divergence to punish the distance between the identification encoder and the prior encoder, and combining the cross entropy loss and the KL divergence to obtain a minimized evidence lower line of the shared convolutional neural network.
Preferably, the obtaining of the encoder hidden space comprises:
applying a plurality of prior encoders on the same search image; through the regression loss of a boundary frame, the probability output of a prior encoder is supervised by combining a real value label, and a complete probability distribution which encodes all possible features is generated and is distributed into a hidden space omega; wherein the parameter of the prior encoder is phi, which estimates the feature variant of the original search image X; the probability output distribution of the prior encoder is normal distribution parallel to the coordinate axis, and the mean value of the probability output distribution is muprior(X;. phi.) belongs to omega, and the variance is sigmaprior(X;φ)∈Ω。
Preferably, the randomly sampling from the hidden space comprises:
from Gaussian hidden space N (mu)recogrecog) The process of sampling Z translates into randomly sampling noise o' that follows a normal distribution, and the process of sampling Z is expressed as: z ═ μ + o 'σ, o' e N (0, I); where μ denotes a mean value of a gaussian hidden space, σ denotes a variance of the hidden space, and N (0, I) denotes a standard normal distribution.
Preferably, the concatenation of the mean samples of the implicit space of the prior encoder comprises:
in the ith iteration process, i belongs to 1,2, wherein m represents a positive integer, and Z is randomly sampled from the probability distribution of the prior encoderi:Zi~P(·|X)=N(μprior(X;φ),σprior(X; phi)); mixing the sample ZiBroadcasting to the N-channel search image feature map to make the search image feature map and the segmentation mask have the same dimension, and then correlating the search image feature map with the result g of cross correlation operationSiameseThe series connection is carried out to obtain the characteristics after series connection:
Figure BDA0002827694960000041
wherein the function gcombComposed of three successive 1 x 1 convolutional layers, sigma representing the twin network gSiameseParameter, τ denotes gcombAnd (4) medium convolution layer parameters.
Preferably, obtaining the binary mask comprises:
features to be connected in series
Figure BDA0002827694960000042
Input to a mask decoder gdecoderGenerating a segmentation mask:
Figure BDA0002827694960000043
where θ represents a mask decoder parameter.
In a second aspect, the present invention provides a probabilistic twin target tracking system based on a conditional variational encoder, comprising:
the image acquisition module is used for extracting a certain frame of image in the video sequence as a first frame of image for target tracking, calibrating a target object needing tracking in the first frame of image, and initializing target appearance model parameters to obtain a template image;
the first extraction module is used for obtaining the image characteristics of the template image by utilizing the trained shared convolutional neural network;
the second extraction module is used for taking the current frame as a search image, and respectively obtaining the characteristics of the search image and the hidden space of the prior encoder by utilizing the trained shared convolutional neural network and the prior encoder;
the operation module is used for performing cross correlation operation on the template image characteristics and the search image characteristics and then serially connecting the template image characteristics and the search image characteristics with the mean value sampling of the hidden space of the prior encoder;
and the positioning module is used for inputting the result after the serial connection into a mask decoder to obtain a binary mask and obtain a boundary frame of the target so as to realize the positioning of the target.
In a third aspect, the invention provides a computer apparatus comprising a memory and a processor, the processor and the memory being in communication with each other, the memory storing program instructions executable by the processor, the processor invoking the program instructions to perform the method as described above.
In a fourth aspect, the invention provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements a method as described above.
The invention has the beneficial effects that:
aiming at the problem that a trained network has inherent uncertainty and still outputs a deterministic result, a probability twin target tracking method and system based on a variational encoder are provided, uncertainty learning is introduced into target tracking, and a target tracking network model comprising a Bayesian network is provided to generate complete probability distribution.
When there are frames with ambiguity that require multiple reasonable assumptions, the proposed method of the present disclosure can produce multiple consistent target states with only a low amount of computation.
Aiming at the problem that abundant real value calibration information in a large data set is not fully utilized, the real calibration value and corresponding training data are combined to be used as condition information in the training process and input into an encoder of a condition variation encoder based on a supervision model.
To prevent the over-fitting problem, noise insertion prediction is derived from low-dimensional implicit spatial sampling. By adding noise to the neurons, regularization is introduced into the deep neural network to prevent an overfitting phenomenon, and robustness is further improved.
Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic flow chart of a probability twin target tracking method based on a conditional variation encoder according to an embodiment of the present invention.
FIG. 2 is a diagram illustrating the inherent uncertainty in regression prediction according to an embodiment of the present invention.
FIG. 3 is a diagram illustrating inherent uncertainty in the calibration of a large dataset according to an embodiment of the present invention.
FIG. 4 is a diagram illustrating an embodiment of generating an infinite partition and a bounding box.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below by way of the drawings are illustrative only and are not to be construed as limiting the invention.
It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
In the description of the specification, reference to the description of "one embodiment," "some embodiments," "an example," "a specific example," or "some examples" or the like means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
For the purpose of facilitating an understanding of the present invention, the present invention will be further explained by way of specific embodiments with reference to the accompanying drawings, which are not intended to limit the present invention.
It should be understood by those skilled in the art that the drawings are merely schematic representations of embodiments and that the elements shown in the drawings are not necessarily required to practice the invention.
Example 1
The embodiment 1 of the present invention provides a probability twin target tracking system based on a conditional variational encoder, including:
the image acquisition module is used for extracting a certain frame of image in the video sequence as a first frame of image for target tracking, calibrating a target object needing tracking in the first frame of image, and initializing target appearance model parameters to obtain a template image;
the first extraction module is used for obtaining the image characteristics of the template image by utilizing the trained shared convolutional neural network;
the second extraction module is used for taking the current frame as a search image, and respectively obtaining the characteristics of the search image and the hidden space of the prior encoder by utilizing the trained shared convolutional neural network and the prior encoder;
the operation module is used for performing cross correlation operation on the template image characteristics and the search image characteristics and then serially connecting the template image characteristics and the search image characteristics with the mean value sampling of the hidden space of the prior encoder;
and the positioning module is used for inputting the result after the serial connection into a mask decoder to obtain a binary mask and obtain a boundary frame of the target so as to realize the positioning of the target.
The probability twin target tracking method based on the conditional variational encoder is realized by utilizing the probability twin target tracking system based on the conditional variational encoder, and the method specifically comprises the following steps:
extracting a certain frame of image in a video sequence as a first frame of image for target tracking, calibrating a target object needing tracking in the first frame of image, and initializing target appearance model parameters to obtain a template image;
inputting the template image into a trained shared convolutional neural network to obtain template image characteristics;
inputting the current frame serving as a search image into a shared convolutional neural network and a prior encoder to respectively obtain a search image characteristic and a prior encoder hidden space;
performing cross correlation operation on the template image characteristics and the search image characteristics, and then connecting the template image characteristics and the search image characteristics in series with mean sampling of a hidden space of a prior encoder;
and inputting the result after the series connection into a mask decoder to obtain a binary mask, obtain a boundary box of the target and realize the positioning of the target.
The training of the shared convolutional neural network comprises:
the search image and the real calibration mask are connected in series and then input to an identification encoder to obtain an identification encoder hidden space;
calculating the regression loss of a boundary frame between a hidden space of a prior encoder and a hidden space of an identification encoder;
performing cross correlation operation on the template image characteristics and the search image characteristics, and connecting the template image characteristics and the search image characteristics in series with random sampling of a hidden space of a prior encoder;
inputting the result after the serial connection into a mask decoder to obtain a binary mask;
calculating the cross entropy loss between the binary mask and the real calibration mask;
weighting the regression loss and the cross entropy loss of the bounding box to obtain a loss value of the network;
and (4) adopting a random gradient descent method, optimizing the network according to the loss value, and performing iterative training until the minimization evidence is off-line to obtain the trained shared convolutional neural network.
The minimizing the downline of evidence comprises:
establishing a mapping between a target segmentation and a position with uncertainty using an identification encoder;
combining the probability distribution output by the identification encoder with the result of the cross correlation operation, randomly sampling in the probability distribution output by the identification encoder to generate a segmentation prediction, and measuring the distance between the segmentation prediction and a true value label of an original search image by the cross entropy loss of the cross correlation operation;
and adopting KL divergence to punish the distance between the identification encoder and the prior encoder, and combining the cross entropy loss and the KL divergence to obtain a minimized evidence lower line of the shared convolutional neural network.
Obtaining the encoder latent space includes:
applying a plurality of prior encoders on the same search image; through the regression loss of a boundary frame, the probability output of a prior encoder is supervised by combining a real value label, and a complete probability distribution which encodes all possible features is generated and is distributed into a hidden space omega; wherein the parameter of the prior encoder is phi, which estimates the feature variant of the original search image X; firstly, the first step is toThe probability output distribution of the experimental encoder is normal distribution parallel to the coordinate axis, and the mean value of the probability output distribution is muprior(X;. phi.) belongs to omega, and the variance is sigmaprior(X;φ)∈Ω。
Randomly sampling from the hidden space includes:
from Gaussian hidden space N (mu)recogrecog) The process of sampling Z translates into randomly sampling noise o' that follows a normal distribution, and the process of sampling Z is expressed as: z ═ μ + o 'σ, o' e N (0, I); where μ denotes a mean value of a gaussian hidden space, σ denotes a variance of the hidden space, and N (0, I) denotes a standard normal distribution.
The cascading of mean samples of the implicit space of the prior encoder comprises the following steps:
in the ith iteration process, i belongs to 1,2, wherein m represents a positive integer, and Z is randomly sampled from the probability distribution of the prior encoderi:Zi~P(·|X)=N(μprior(X;φ),σprior(X; phi)); mixing the sample ZiBroadcasting to the N-channel search image feature map to make the search image feature map and the segmentation mask have the same dimension, and then correlating the search image feature map with the result g of cross correlation operationSiameseThe series connection is carried out to obtain the characteristics after series connection:
Figure BDA0002827694960000091
wherein the function gcombComposed of three successive 1 x 1 convolutional layers, sigma representing the twin network gSiameseParameter, τ denotes gcombAnd (4) medium convolution layer parameters.
Obtaining the binary mask includes: features to be connected in series
Figure BDA0002827694960000092
Input to a mask decoder gdecoderGenerating a segmentation mask:
Figure BDA0002827694960000093
where θ represents a mask decoder parameter.
Example 2
The embodiment 2 of the invention provides a probability twin target tracking method based on a variational encoder, which comprises the following steps:
inputting the template image into the trained shared convolutional neural network to obtain the template image characteristics;
inputting the current frame serving as a search image into a shared convolutional neural network and a prior encoder to respectively obtain a search image characteristic and a prior encoder hidden space;
performing cross correlation operation on the template image characteristics and the search image characteristics, and then serially connecting the template image characteristics and the search image characteristics with the mean sampling of the hidden space of the prior encoder;
and inputting the result after the series connection into a mask decoder to obtain a binary mask, and further obtaining a boundary box of the target by a minimum-maximum method to realize the positioning of the target.
In this embodiment 2, the training process of the shared convolutional neural network specifically includes:
before target position prediction, i.e. target tracking, is performed, the whole neural network needs to be trained. In the training process, besides obtaining template image features, search image features and an encoder hidden space, a search image and a real calibration mask are required to be connected in series and then input into an identification encoder to obtain an identification encoder hidden space, and then KL loss (boundary box regression loss) between a priori encoder hidden space and the identification encoder hidden space is calculated: dkl(Q||P)=Ez~Q[logQ-logP](ii) a Where P denotes the computational prior encoder implicit space, Q denotes the recognition encoder implicit space, EZ~QIndicating a desire.
Then, performing cross correlation operation on the template image characteristics and the search image characteristics, and then connecting the template image characteristics and the search image characteristics in series with random sampling Z of a hidden space of a prior encoder; inputting the result after the serial connection into a mask decoder to obtain a binary mask; calculating the cross entropy loss between the binary mask and the real calibration mask:
CrossEntropy=EZ~Q(·|M,X)[-log(Pc(M|R(X,Z)))](ii) a Wherein, PcRepresenting the category distribution, X representing the original search image, M representing the originalThe image is searched for a true value label, R represents a segmentation prediction of the sample Z.
Finally, the loss value of the whole network is obtained by weighting the KL loss and the cross entropy loss:
Loss(M,X)=CrossEntropy+β·Dkl(Q||p)(ii) a Wherein β is a weighting parameter;
and (3) optimizing the network according to the loss value by adopting a random gradient descent method, and performing 50 times of iterative training by using the complete data set.
In embodiment 2 of the present invention, the obtaining process of searching the hidden space of the image recognition encoder specifically includes: the goal of training the proposed network is to minimize the evidentiary downline, Loss (M, X), the training process follows the standard training process of a conditional variational encoder. Unlike the twin model determined by training, the network model proposed in this embodiment 2 needs to further establish an effective hidden space for encoding the feature variants. Thus, this example 2 employs a recognition encoder to establish the segmentation of the object (given the original search image X and its true value tag M) and with uncertainty σrecogPosition mu of (X, M; omega) E omegarecog(X, M; omega) epsilon omega represents that the main component is a hidden space with low dimension. Identifying the probability distribution of the encoder output as Q, from which samples Z can be expressed as: z to Q (· | X, M) ═ N (μ)recog(X,M;ω),σrecog(X, M; omega)); where ω denotes identifying encoder parameters.
By combining Q with a twin network gSiameseAnd combining the results of cross correlation operation of the intermediate template image features and the search image features, generating a segmentation prediction R by the sample Z, and measuring the distance between the segmentation prediction R and the true value label M by cross entropy loss. Further, KL divergence is employed to penalize the distance between the identification encoder Q and the a priori encoder P. And further combining the cross entropy loss and the KL divergence to obtain a final evidence offline.
In the actual training process, along with the gradual decrease of the KL divergence, the characteristic variants of the coding of the recognition encoder tend to be consistent with the prior distribution.
In this embodiment 2, the parameter of the prior encoder isPhi, which estimates the characteristic variations of the original search image X. The output distribution of the prior encoder P is normal distribution parallel to the coordinate axes, and the average value is muprior(X;. phi.) belongs to omega, and the variance is sigmaprior(X;φ)∈Ω。
For general prediction, only the mean of the prior distribution is concatenated with the cross-correlated output as a sample to obtain a uniform segmentation and tracking bounding box. To predict an unlimited number of reasonable hypotheses, an m-order a priori encoder is applied on the same search image. Through KL loss, the real calibration label supervises the output of the prior encoder.
Therefore, the network model proposed in this embodiment 2 can generate a complete probability distribution that encodes all possible features into a hidden space. By sampling in the learned hidden space, a variety of reasonable tracking results can be obtained.
In this embodiment 2, the process of sampling from the hidden space specifically includes:
in a conditional variant coder, the process of sampling from a learned hidden space satisfying a gaussian distribution is not differentiable, and therefore, a model cannot be trained effectively by a stochastic gradient descent method. The embodiment uses the heavy parameter method to enable the conditional variation encoder to normally acquire the gradient, so that the error can be reversely propagated. From Gaussian hidden space N (mu)recogrecog) The process of sampling Z translates into randomly sampling noise o' that follows a normal distribution, and the process of sampling Z is expressed as: z ═ μ + o 'σ, o' e N (0, I); where μ denotes a mean value of a gaussian hidden space, σ denotes a variance of the hidden space, and N (0, I) denotes a standard normal distribution.
In the prediction phase, only the mean of the prior distribution is combined with the cross-correlated output as an extracted deterministic feature. Because the output of the prior encoder is aligned with the output of the identification encoder under calibration label supervision via KL losses, the deterministic feature of this embodiment 2 combined with the prior distribution means contains the predicted calibration label information.
In the training process, the characteristics of the noise are sampled and inserted in the Gaussian hidden space learned by the recognition coder. In contrast to the traditional deterministic twin network, the method proposed by the present disclosure samples the mean and variance from the probabilistic distribution, the variance being input into the neural network as the insertion noise. When noise is input to the neurons, the deep neural network is regularized to prevent overfitting, and robustness is improved.
In this embodiment 2, the process of connecting in series with the mean value of the implicit space of the prior encoder specifically includes: in the ith iteration process, i belongs to 1,2, m and m represents the number of extracted features, and Z is randomly sampled from the prior distribution Pi:Zi~P(·|X)=N(μprior(X;φ),σprior(X;φ))。
Mixing the sample ZiBroadcast to the N channel with the same dimensions as the split mask, and then associate it with the twin network gSiameseAnd (3) connecting the results of the medium cross correlation in series to obtain the characteristics after series connection:
Figure BDA0002827694960000121
wherein the function gcombComposed of three successive 1 x 1 convolutional layers, sigma representing the twin network gSiameseParameter,. tau.represents gcombAnd (4) medium convolution layer parameters.
In this embodiment 2, the process of obtaining the binary mask specifically includes:
features to be connected in series
Figure BDA0002827694960000131
Input to a mask decoder gdecoderGenerating a segmentation mask:
Figure BDA0002827694960000132
where θ represents a mask decoder parameter.
The prior encoder block of the conditional variational encoder encodes the partition variants and trains by scaling the partitions using the true values, creating a probability distribution containing the mean and variance, thus at a given searchAn infinite confidence map of features may be generated when retrieving images. Implementing an unlimited number of trusted profiles requires no significant amount of computation, since only a small portion of the entire network needs to be evaluated repeatedly in each iteration. When extracting m features from the hidden space, only the output of the prior network and the cross-correlation results in the twin network can be reused, i.e. only
Figure BDA0002827694960000133
And RiRepetitive calculations are required.
Example 3
As shown in fig. 1, in embodiment 3 of the present invention, uncertainty learning is introduced into target tracking, and a target tracking network model including a bayesian network is proposed to generate a complete probability distribution. Wherein, Template Image represents Template Image, Search Image represents Search Image, Conv Layers represents convolutional layer network, Prior Encoder represents Prior Encoder, registration Encoder represents identification Encoder, Latent Space represents hidden Space, Mask Decoder represents Mask Decoder, Mask represents Mask, Cross Entropy represents Cross Entropy loss, group Truth represents true value mark, g _ comb represents g _ combcombSample represents a Sample.
Firstly, a certain frame of image in a video sequence is extracted as a first frame of image for target tracking, and the video sequence can be a video shot in scenes such as video monitoring and intelligent transportation. And then, calibrating the target object needing to be tracked in the image. And initializing the parameters of the target appearance model, wherein the tracking target can be any object in the image. Then, entering the next frame, and tracking the target by adopting the method provided by the disclosure, wherein the tracking method comprises the following specific steps:
(1) inputting the template image into the trained shared convolutional neural network to obtain the characteristics of the template image;
(2) inputting the current frame serving as a search image into a shared convolutional neural network and a prior encoder to respectively obtain a search image characteristic and a prior encoder hidden space;
(3) performing cross correlation operation on the template image characteristics and the search image characteristics, and then connecting the template image characteristics and the search image characteristics in series with the mean value of the hidden space of the prior encoder;
(4) and inputting the result after the serial connection into a mask decoder to obtain a binary mask, and further obtaining a boundary box of the target by a minimum-maximum method to realize the positioning of the target.
The above steps are explained in detail below.
Before target position prediction, i.e. target tracking, is performed, the whole neural network needs to be trained. In the training process, besides obtaining template image characteristics, searching image characteristics and an encoder hidden space, a searching image and a real calibration mask are required to be connected in series and then input into an identification encoder to obtain an identification encoder hidden space, and then KL loss D between a priori encoder hidden space and the identification encoder hidden space is calculatedkl(Q|P)
Then, performing cross correlation operation on the template image characteristics and the search image characteristics, and then connecting the template image characteristics and the search image characteristics in series with random sampling of a hidden space of a prior encoder; inputting the result after the serial connection into a mask decoder to obtain a binary mask; and calculating cross entropy Loss Cross Encopy between the binary mask and the real calibration mask, and finally, weighting the Loss value Loss (M, X) of the whole network by KL Loss and cross entropy Loss. And (4) optimizing the network according to the loss value by adopting a random gradient descent method, and performing 50 times of iterative training by using the complete data set.
The goal of training the proposed network is to minimize the evidentiary downline, Loss (M, X), the training process follows the standard training process of a conditional variational encoder.
Unlike the twin model determined by training, the network model proposed in this embodiment 3 needs to further establish an effective hidden space for encoding the feature variants. Thus, this example 3 employs a recognition encoder to establish the segmentation of the object (given the original search image X and its true value tag M) and with uncertainty σrecogPosition mu of (X, M; omega) epsilon omegarecog(X, M; ω) e Ω.
Identifying the probability distribution of the encoder output as Q, from which samples Z can be expressed as:
Z~Q(·|X,M)=N(μrecog(X,M;ω),σrecog(X,M;ω))
by combining Q with a twin network gSiameseWhen the results of the intermediate cross-correlation are combined, the sample Z will generate a segmentation prediction, and the distance between the segmentation prediction R and the true value label M is measured by the cross entropy loss.
In addition, in this embodiment 3, KL divergence is adopted to penalize the distance between the identification encoder Q and the a priori encoder P. And further combining the cross entropy loss and the KL divergence to obtain a final evidence offline. In the actual training process, along with the gradual decrease of the KL divergence, the characteristic variants coded by the recognition encoder tend to be consistent with the prior distribution.
The main component of the algorithm proposed in this embodiment 3 is a low-dimensional hidden space Ω. The parameter of the a priori encoder is phi, which estimates the feature variation of the search image X. The output distribution of the prior encoder P is normal distribution of parallel and coordinate axes, and the average value is muprior(X;. phi.) belongs to omega, and the variance is sigmaprior(X;φ)∈Ω。
For general prediction, only the mean of the prior distribution is concatenated with the cross-correlated output as a sample to obtain a uniform segmentation and tracking bounding box. To predict an unlimited number of reasonable hypotheses, an m-order a priori encoder is applied on the same search image. Through KL loss, the real calibration label supervises the output of the prior encoder.
Therefore, the network model proposed in this embodiment 3 can generate a complete probability distribution that encodes all possible features into a hidden space. By sampling in the learned hidden space, a variety of reasonable tracking results can be obtained.
In this embodiment 3, the process of sampling from the hidden space specifically includes:
in the conditional variational encoder, the process of sampling from a learned hidden space satisfying a gaussian distribution is not differentiable, and therefore, a model cannot be trained efficiently by a stochastic gradient descent method. The embodiment uses the method of heavy parameters to make the condition variableThe encoder can normally acquire the gradient, and the error can be propagated reversely. From Gaussian hidden space N (mu)recogrecog) The process of sampling Z translates into randomly sampling noise o' that follows a normal distribution, and the process of sampling Z is expressed as: z ═ μ + o 'σ, o' e N (0, I); where μ denotes a mean value of a gaussian hidden space, σ denotes a variance of the hidden space, and N (0, I) denotes a standard normal distribution.
In the prediction phase, only the mean of the prior distribution is combined with the cross-correlated output as an extracted deterministic feature. Because the output of the prior encoder is aligned with the output of the identification encoder under calibration label supervision via KL losses, the deterministic feature of this embodiment 3 combined with the prior distribution means contains the predicted calibration label information.
In the training process, the characteristics of the noise are sampled and inserted in the Gaussian hidden space learned by the recognition coder. In contrast to the traditional deterministic twin network, the method proposed by the present disclosure samples the mean and variance from the probabilistic distribution, the variance being input into the neural network as the insertion noise. When noise is input to the neurons, the deep neural network is regularized to prevent overfitting, and robustness is improved.
In this embodiment 3, the process of concatenating the average value of the implicit space of the prior encoder specifically includes: in the ith iteration process, i belongs to 1,2, m and m represents the number of extracted features, and Z is randomly sampled from the prior distribution Pi:Zi~P(·|X)=N(μprior(X;φ),σprior(X;φ))
Sample ZiBroadcast to the N channel with the same dimensions as the split mask, and then match it with the twin network gSiameseAnd (3) connecting the results of the medium cross correlation in series to obtain the characteristics after series connection:
Figure BDA0002827694960000161
wherein the function gcombComposed of three successive 1 x 1 convolutional layers, sigma representing the twin network gSiameseParameter, τ denotes gcombAnd (4) medium convolution layer parameters.
In this embodiment 3, the process of obtaining the binary mask specifically includes:
will be connected in series
Figure BDA0002827694960000162
Input to a mask decoder gdecoderGenerating a segmentation mask:
Figure BDA0002827694960000163
where θ represents a mask decoder parameter.
The prior encoder module of the conditional variational encoder encodes the segmentation variants and trains by scaling the segmentation using the true values, creating a probability distribution containing the mean and variance, thus yielding an infinitely reliable feature map given the search image. Implementing an unlimited number of trusted profiles requires no significant amount of computation, since only a small portion of the entire network needs to be evaluated repeatedly in each iteration. When extracting m features from the hidden space, only the output of the prior network and the cross-correlation results in the twin network can be reused, i.e. only
Figure BDA0002827694960000172
And RiRepetitive calculations are required.
In example 3, three general criteria were used for evaluation: VOT2016, VOT2018, and TColor-128. The VOT2016 and VOT2018 datasets each contain 60 video clips with annotated rotating bounding boxes, allowing full positioning accuracy.
The PS-CVAE proposed in this example 3 was compared to other up-to-date methods on these data sets using the official toolkit provided by the VOT in this example 3.
As indicators, the Expected Average Overlap (EAO), accuracy (average overlap over successfully tracked frames), and robustness (failure rate) are employed as measures of the VOT. TColor-128 is a recently proposed data set that contains 128 video clips and bounding boxes that are axis-aligned.
In this example 3, the tracker on TColor-128 was evaluated and the area under the curve (AUC) was used as a measure.
All experiments were performed on a PC equipped with an i5 quad-core 2.59GHz CPU, 8GB RAM and GTX 1070 GPU. The average execution speed of the tracker proposed in this embodiment 3 is 33 Frames Per Second (FPS).
The experimental results of the tracker on the VOT2016 dataset were compared to the results of the other 8 most recent trackers, as described in Table 1. EAO, accuracy and failure scores of the tracker are compared to SPM, SiamMask, ATOM, ASRCF, siamrPN, CSRDCF, CCOT and TCNN. All trackers compared, except CCOT and TCNN, are real-time trackers. The trackers SiamMask and SiamRPN of the present disclosure both use the Siamese architecture.
In this example 3, the proposed tracker has an EAO of 0.443, which is the best of all the compared trackers; it is 2.1% higher than the second bit SPM (0.434) algorithm and 2.3% higher than the third bit SiamMask (0.433).
Table 1 results on VOT2016 dataset
Figure BDA0002827694960000171
Figure BDA0002827694960000181
In this example 3, the experimental results of the tracker on the VOT2018 dataset were compared with the results of the other 8 most recent trackers, as described in table 2. EAO, accuracy and failure score for the tracker are listed, as well as SiamRPN + +, ATOM, TCNN, SiamRPN, SiamMask, comparing SASiamR, SiamVGG and SASiam.
All compared trackers, except TCNN, are real-time trackers, all except ATOM and TCNN, use a twin symmetric architecture. In this example 3, the proposed tracker has an EAO of 0.415, which is the best of all the compared trackers; it is 0.24% higher than the SiamRPN + + (0.414) algorithm for the second bit, and 3.5% higher for the ATOM (0.401) algorithm for the third bit.
TABLE 2 results on VOT2018 dataset
Figure BDA0002827694960000182
Figure BDA0002827694960000191
The results of the most recently proposed trackers for real-time tracking of TColor-128 are shown in Table 3, which includes the trackers UDT, SimFC, CSRDCF, SCT, CFNet, DSST and KCF proposed in this example 3. From the AUC perspective, the tracker of this example 3 (0.530) ranks first among all compared trackers, 4.5% higher than UDT (0.507). In addition, the performance of the tracker of this embodiment 3 is also better than that of the siamf fc (0.503) and CFNet (0.456) using the siame architecture, which are respectively improved by 5.4% and 16.2%.
TABLE 3 results on TColor-128 dataset
Figure BDA0002827694960000192
The essence of target tracking is to regress the state of the target over time. Defining a regression mapping X → Y, wherein YiE Y is inherently disturbed by observation noise n (x)i),xiIs epsilon.X. Noise (e.g., sensor or motion noise) can cause uncertainty in learning that cannot be reduced even if more data is collected. Therefore, these data uncertainties are reflected in the regression process.
The noise regression can be expressed as yi=f(xi)+o′σ(xi) Wherein o' is ∈ [0, I]And f (-) is the learned embedding function. A typical regression model only learns the estimate f (·).
However, as shown in FIG. 2, regression with data uncertainty can estimate not only f (-) but also σ (x)i) This indicates the predicted value of uncertainty f (·).
Similar to the regression uncertainty, the dataset compiled for training the visual tracking network consists of X → Y and also contains the data uncertainty. In this case, X represents image space and Y represents ground truth annotation. Although these large data sets are intended to provide clear annotations for training and testing, as shown in fig. 3, inherent ambiguities may be introduced in the annotation process due to limitations in the annotation process and annotator preferences. Filtering out these low quality annotations from large-scale datasets is difficult or even impossible. However, deep learning approaches typically use embedding Z in the underlying spacei. Suppose each xie.X corresponds to a version f (X) without embedded noisei) It is less corrupted by ambiguous information, so the embedded prediction can be re-expressed as zi=f(xi)+n(xi) Wherein n (x)i) Representing noise.
In this embodiment 3, infinite partitions and bounding boxes can be generated in the prediction process, as shown in fig. 4. In a generic prediction process, only the mean of the gaussian distribution of the a priori encoder output is used to provide a priori knowledge and produce a uniform segmentation result and bounding box. In addition, since the a priori encoder in CVAE is conditioned by the calibrated true value mask, samples can be generated from the learned hidden space to obtain multiple segments and bounding boxes, which can provide multiple reasonable predictions.
Example 4
Embodiment 4 of the present invention provides a computer device, including a memory and a processor, where the processor and the memory are in communication with each other, the memory stores program instructions executable by the processor, and the processor calls the program instructions to execute a method for probabilistic twin target tracking based on a variational encoder, where the method includes:
inputting the template image into the trained shared convolutional neural network to obtain the template image characteristics;
inputting the current frame serving as a search image into a shared convolutional neural network and a prior encoder to respectively obtain a search image characteristic and a prior encoder hidden space;
performing cross correlation operation on the template image characteristics and the search image characteristics, and then connecting the template image characteristics and the search image characteristics in series with mean sampling of a hidden space of a prior encoder;
and inputting the result after the series connection into a mask decoder to obtain a binary mask, and further obtaining a boundary box of the target by a minimum-maximum method to realize the positioning of the target.
Example 5
An embodiment 5 of the present invention provides a computer-readable storage medium, in which a computer program is stored, where the computer program, when executed by a processor, implements a probabilistic twin target tracking method based on a variational encoder, where the method includes:
inputting the template image into the trained shared convolutional neural network to obtain the template image characteristics;
inputting the current frame serving as a search image into a shared convolutional neural network and a prior encoder to respectively obtain a search image characteristic and a prior encoder hidden space;
performing cross correlation operation on the template image characteristics and the search image characteristics, and then serially connecting the template image characteristics and the search image characteristics with the mean sampling of the hidden space of the prior encoder;
and inputting the result after the series connection into a mask decoder to obtain a binary mask, and further obtaining a boundary box of the target by a minimum-maximum method to realize the positioning of the target.
In summary, in order to solve the problem that a trained network has inherent uncertainty and still outputs a deterministic result, the method and system for probability twin target tracking based on a variational encoder according to the embodiments of the present invention introduce uncertainty learning into target tracking and provide a target tracking network model including a bayesian network to generate a complete probability distribution. Specifically, firstly, a novel probabilistic twinning target tracking method is realized by establishing the relation between a twinning network structure and a conditional variational encoder. In a conditional variational encoder, latent target state variables are implicitly spatially encoded; the randomly sampled samples will then be inserted into the twin network to produce the corresponding target state prediction. Furthermore, when frames in which ambiguity exists require a variety of reasonable assumptions, consistent target states can be generated with only a low amount of computation. In the training process, the real calibration value and corresponding training data are combined to be used as condition information and input into an encoder of a condition variation encoder, so that the abundant real value calibration information in a large data set is fully utilized based on a supervision model; noise insertion prediction is obtained by sampling from a low-dimensional hidden space, and by adding noise to neurons, regularization is introduced into a deep neural network to prevent an over-fitting phenomenon, so that robustness is further improved.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention has been described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above description is only a preferred embodiment of the present disclosure and is not intended to limit the present disclosure, and various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.
Although the present disclosure has been described with reference to the specific embodiments shown in the drawings, it is not intended to limit the scope of the present disclosure, and it should be understood by those skilled in the art that various modifications and variations can be made without inventive faculty based on the technical solutions disclosed in the present disclosure.

Claims (10)

1. A probability twin target tracking method based on a conditional variational encoder is characterized by comprising the following steps:
extracting a certain frame of image in a video sequence as a first frame of image for target tracking, calibrating a target object needing tracking in the first frame of image, and initializing target appearance model parameters to obtain a template image;
inputting the template image into a trained shared convolutional neural network to obtain template image characteristics;
inputting the current frame serving as a search image into a shared convolutional neural network and a prior encoder to respectively obtain a search image characteristic and a prior encoder hidden space;
performing cross correlation operation on the template image characteristics and the search image characteristics, and then connecting the template image characteristics and the search image characteristics in series with the mean value of the hidden space of the prior encoder;
and inputting the result after the series connection into a mask decoder to obtain a binary mask, obtain a boundary box of the target and realize the positioning of the target.
2. The probabilistic twin target tracking method based on conditional variational encoder according to claim 1, wherein the training of the shared convolutional neural network comprises:
the search image and a real calibration mask are connected in series and then input to an identification encoder to obtain an identification encoder hidden space;
calculating the regression loss of a boundary frame between a hidden space of a prior encoder and a hidden space of an identification encoder;
performing cross correlation operation on the template image characteristics and the search image characteristics, and connecting the template image characteristics and the search image characteristics in series with a random sampling result of a hidden space of a prior encoder;
inputting the result after the serial connection into a mask decoder to obtain a binary mask;
calculating the cross entropy loss between the binary mask and the real calibration mask;
weighting the regression loss and the cross entropy loss of the bounding box to obtain a loss value of the network;
and (4) adopting a random gradient descent method, optimizing the network according to the loss value, and performing iterative training until the minimization evidence is off-line to obtain the trained shared convolutional neural network.
3. The conditional variational encoder based probabilistic twin target tracking method according to claim 2, wherein minimizing the evidence downline comprises:
establishing a mapping between a target segmentation and a position with uncertainty using an identification encoder;
combining the probability distribution output by the identification encoder with the result of the cross correlation operation, randomly sampling in the probability distribution output by the identification encoder to generate a segmentation prediction, and measuring the distance between the segmentation prediction and a true value label of an original search image by the cross entropy loss of the cross correlation operation;
and adopting KL divergence to punish the distance between the identification encoder and the prior encoder, and combining the cross entropy loss and the KL divergence to obtain a minimized evidence lower line of the shared convolutional neural network.
4. The probabilistic twin target tracking method based on conditional variational encoder according to claim 3, wherein obtaining the encoder hidden space comprises:
applying a plurality of prior encoders on the same search image; through the regression loss of a boundary frame, the probability output of a prior encoder is supervised by combining a real value label, and a complete probability distribution which encodes all possible features is generated and is distributed into a hidden space omega; wherein the parameter of the prior encoder is phi, which estimates the feature variant of the original search image X; the probability output distribution of the prior encoder is normal distribution parallel to the coordinate axis, and the mean value of the probability output distribution is muprior(X;. phi.) belongs to omega, and the variance is sigmaprior(X;φ)∈Ω。
5. The probabilistic twin target tracking method based on conditional variational coder according to claim 4, wherein the random sampling from the hidden space comprises:
from the hidden space of Gaussian N (mu)recogrecog) The process of sampling Z translates into randomly sampling noise o' that follows a normal distribution, and the process of sampling Z is expressed as: z ═ μ + o 'σ, o' e N (0, I); where μ denotes a mean value of a gaussian hidden space, σ denotes a variance of a hidden space, and N (0, I) denotes a standard normal distribution.
6. The probabilistic twin target tracking method based on conditional variational coder according to claim 5, wherein the concatenation of the mean values of the implicit spaces of the prior coders comprises:
in the ith iteration process, i belongs to 1,2Randomly sampling Z from the prior encoder probability distributioni:Zi~P(·|X)=N(μprior(X;φ),σprior(X; phi)); mixing the sample ZiBroadcasting to the N-channel search image feature map to make the search image feature map and the segmentation mask have the same dimension, and then correlating the search image feature map with the result g of cross correlation operationSiameseThe series connection is carried out to obtain the characteristics after series connection:
Figure FDA0003679289530000031
wherein the function gcombIs composed of three groups of continuous 1 x 1 convolution layers, sigma represents twin network parameters, and tau represents gcombAnd (4) medium convolution layer parameters.
7. The probabilistic twin target tracking method based on conditional variational encoder according to claim 6, wherein obtaining the binary mask comprises:
features to be connected in series
Figure FDA0003679289530000033
Input to a mask decoder gdecoderGenerating a segmentation mask:
Figure FDA0003679289530000032
where θ represents a mask decoder parameter.
8. A probabilistic twin target tracking system based on a conditional variational encoder, comprising:
the image acquisition module is used for extracting a certain frame of image in the video sequence as a first frame of image for target tracking, calibrating a target object needing tracking in the first frame of image, and initializing target appearance model parameters to obtain a template image;
the first extraction module is used for obtaining the image characteristics of the template image by utilizing the trained shared convolutional neural network;
the second extraction module is used for taking the current frame as a search image, and respectively obtaining the characteristics of the search image and the hidden space of the prior encoder by utilizing the trained shared convolutional neural network and the prior encoder;
the operation module is used for performing cross correlation operation on the template image characteristics and the search image characteristics and then serially connecting the template image characteristics and the search image characteristics with the mean value of the hidden space of the prior encoder;
and the positioning module is used for inputting the result after the serial connection into a mask decoder to obtain a binary mask and obtain a boundary frame of the target so as to realize the positioning of the target.
9. A computer device comprising a memory and a processor, the processor and the memory in communication with each other, the memory storing program instructions executable by the processor, characterized in that: the processor calls the program instructions to perform the method of any one of claims 1-7.
10. A computer-readable storage medium storing a computer program, characterized in that: the computer program, when executed by a processor, implements the method of any one of claims 1-7.
CN202011434400.5A 2020-12-10 2020-12-10 Probability twin target tracking method and system based on conditional variational encoder Active CN112541944B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011434400.5A CN112541944B (en) 2020-12-10 2020-12-10 Probability twin target tracking method and system based on conditional variational encoder

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011434400.5A CN112541944B (en) 2020-12-10 2020-12-10 Probability twin target tracking method and system based on conditional variational encoder

Publications (2)

Publication Number Publication Date
CN112541944A CN112541944A (en) 2021-03-23
CN112541944B true CN112541944B (en) 2022-07-12

Family

ID=75019817

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011434400.5A Active CN112541944B (en) 2020-12-10 2020-12-10 Probability twin target tracking method and system based on conditional variational encoder

Country Status (1)

Country Link
CN (1) CN112541944B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113223055B (en) * 2021-05-31 2022-08-05 华中科技大学 Image target tracking model establishing method and image target tracking method
CN113435488B (en) * 2021-06-17 2023-11-07 深圳大学 Image sampling probability improving method and application thereof
CN114155215B (en) * 2021-11-24 2023-11-10 中山大学肿瘤防治中心(中山大学附属肿瘤医院、中山大学肿瘤研究所) Nasopharyngeal carcinoma recognition and tumor segmentation method and system based on MR image
CN115990875B (en) * 2022-11-10 2024-05-07 华南理工大学 Flexible cable state prediction and control system based on hidden space interpolation
CN117291952B (en) * 2023-10-31 2024-05-17 中国矿业大学(北京) Multi-target tracking method and device based on speed prediction and image reconstruction
CN117934979A (en) * 2024-03-22 2024-04-26 南京大学 Target identification method based on fractal coder-decoder

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110244715A (en) * 2019-05-23 2019-09-17 西安理工大学 A kind of multiple mobile robot's high-precision cooperative tracking method based on super-broadband tech
CN111968155A (en) * 2020-07-23 2020-11-20 天津大学 Target tracking method based on segmented target mask updating template

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11378654B2 (en) * 2018-08-02 2022-07-05 Metawave Corporation Recurrent super-resolution radar for autonomous vehicles
CN110009013B (en) * 2019-03-21 2021-04-27 腾讯科技(深圳)有限公司 Encoder training and representation information extraction method and device
US11227179B2 (en) * 2019-09-27 2022-01-18 Intel Corporation Video tracking with deep Siamese networks and Bayesian optimization
CN111626154B (en) * 2020-05-14 2023-04-07 闽江学院 Face tracking method based on convolution variational encoder
CN111862159A (en) * 2020-07-23 2020-10-30 北京以萨技术股份有限公司 Improved target tracking and segmentation method, system and medium for twin convolutional network

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110244715A (en) * 2019-05-23 2019-09-17 西安理工大学 A kind of multiple mobile robot's high-precision cooperative tracking method based on super-broadband tech
CN111968155A (en) * 2020-07-23 2020-11-20 天津大学 Target tracking method based on segmented target mask updating template

Also Published As

Publication number Publication date
CN112541944A (en) 2021-03-23

Similar Documents

Publication Publication Date Title
CN112541944B (en) Probability twin target tracking method and system based on conditional variational encoder
Marchetti et al. Mantra: Memory augmented networks for multiple trajectory prediction
CN110335337B (en) Method for generating visual odometer of antagonistic network based on end-to-end semi-supervision
CN110120064B (en) Depth-related target tracking algorithm based on mutual reinforcement and multi-attention mechanism learning
US11640714B2 (en) Video panoptic segmentation
CN111161412B (en) Three-dimensional laser mapping method and system
Chen et al. GPU-accelerated real-time stereo estimation with binary neural network
CN113241128A (en) Molecular property prediction method based on molecular space position coding attention neural network model
CN111583300B (en) Target tracking method based on enrichment target morphological change update template
CN111325766B (en) Three-dimensional edge detection method, three-dimensional edge detection device, storage medium and computer equipment
KR20200063368A (en) Unsupervised stereo matching apparatus and method using confidential correspondence consistency
Fan et al. Siamese residual network for efficient visual tracking
CN113298036A (en) Unsupervised video target segmentation method
CN113298014A (en) Closed loop detection method, storage medium and equipment based on reverse index key frame selection strategy
CN117152554A (en) ViT model-based pathological section data identification method and system
CN113516682B (en) Loop detection method of laser SLAM
Wen et al. Efficient algorithms for maximum consensus robust fitting
CN114972438A (en) Self-supervision target tracking method based on multi-period cycle consistency
CN115482252A (en) Motion constraint-based SLAM closed loop detection and pose graph optimization method
CN113724293A (en) Vision-based intelligent internet public transport scene target tracking method and system
Tsintotas et al. The revisiting problem in simultaneous localization and mapping
CN110458867B (en) Target tracking method based on attention circulation network
CN117173607A (en) Multi-level fusion multi-target tracking method, system and computer readable storage medium
CN116245913A (en) Multi-target tracking method based on hierarchical context guidance
CN114638953B (en) Point cloud data segmentation method and device and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant