CN112541944B

CN112541944B - Probability twin target tracking method and system based on conditional variational encoder

Info

Publication number: CN112541944B
Application number: CN202011434400.5A
Authority: CN
Inventors: 郭胤辰; 黄文慧; 李丰泽; 何伟; 薛婧一; 胡意廷; 侯宛辰; 周媛媛
Original assignee: Shandong Normal University
Current assignee: Shandong Normal University
Priority date: 2020-12-10
Filing date: 2020-12-10
Publication date: 2022-07-12
Anticipated expiration: 2040-12-10
Also published as: CN112541944A

Abstract

The invention provides a probability twin target tracking method and system based on a conditional variational encoder, belonging to the technical field of target tracking and identification of intelligent robots, wherein a template image is input to a trained shared convolutional neural network to obtain template image characteristics; inputting the current frame serving as a search image into a shared convolutional neural network and a prior encoder to respectively obtain a search image characteristic and a prior encoder hidden space; performing cross correlation operation on the template image characteristics and the search image characteristics, and then connecting the template image characteristics and the search image characteristics in series with mean sampling of a hidden space of a prior encoder; and inputting the result after the series connection into a mask decoder to obtain a binary mask, obtain a boundary box of the target and realize the positioning of the target. According to the invention, uncertainty learning is introduced into target tracking, and complete probability distribution is obtained; under the condition of low calculation amount, a consistent target state can be generated, noise is added into the neurons, regularization is introduced into the deep neural network to prevent an overfitting phenomenon, and robustness is improved.

Description

Probability twin target tracking method and system based on conditional variational encoder

Technical Field

The invention relates to the technical field of target tracking and identification of intelligent robots, in particular to a probability twinning target tracking method and system based on a conditional variational encoder.

Background

In order to realize interaction with the environment, the human-machine-environment co-fusion robot needs to have a basic capability of estimating the state of the environment. Specifically, the robot first needs to achieve the positioning of the target to achieve interaction with the target and further operational objectives. Tracking a moving target enables a mobile robot to accomplish a variety of real-world tasks.

Deep convolutional neural networks are increasingly becoming a typical method structure in the field of target tracking due to their significantly improved performance compared to conventional methods and techniques. However, in some very complex and difficult tracking scenarios, even leading target tracking algorithms are prone to target loss or lack of accuracy, and the results of these algorithms are often poorly confident in prediction.

These algorithms provide deterministic features or regression maps through embedded deep learning models, such as convolutional neural networks. Inaccurate deterministic regression maps with low confidence may lead to catastrophic consequences and fail to provide a basis for further operations and timely human intervention. In addition, in the regression process, the maximum value in the predicted regression graph in the final output result of the whole process is often mistakenly considered to reflect the uncertainty of the model. However, the model may still output a higher regression prediction response in the absence of confidence.

Target tracking typically requires a large amount of training data to train a mature network model. Although large datasets are intended to provide clear labels for training and testing, there is some inherent label uncertainty in large datasets due to the limitations of labeling methods and preference of the annotator. The use of ambiguous data or the introduction of noise inherent in the observation (also referred to as data uncertainty or occasional uncertainty) will result in inherent uncertainty in the testing phase. For example, uncertainty in the training data, if present in the boundary region, will result in higher prediction uncertainty in the boundary region.

Disclosure of Invention

The invention aims to provide a probability twin target tracking method and a probability twin target tracking system based on a conditional variation encoder, which introduce uncertainty learning into target tracking, acquire complete probability distribution, generate a consistent target state, prevent an overfitting phenomenon and improve robustness, so as to solve at least one technical problem in the background technology.

In order to achieve the purpose, the invention adopts the following technical scheme:

in one aspect, the present invention provides a probability twin target tracking method based on a conditional variational encoder, including:

extracting a certain frame of image in a video sequence as a first frame of image for target tracking, calibrating a target object needing tracking in the first frame of image, and initializing target appearance model parameters to obtain a template image;

inputting the template image into a trained shared convolutional neural network to obtain template image characteristics;

inputting the current frame serving as a search image into a shared convolutional neural network and a prior encoder to respectively obtain a search image characteristic and a prior encoder hidden space;

performing cross correlation operation on the template image characteristics and the search image characteristics, and then connecting the template image characteristics and the search image characteristics in series with mean sampling of a hidden space of a prior encoder;

and inputting the result after the serial connection into a mask decoder to obtain a binary mask, obtain a boundary box of the target and realize the positioning of the target.

Preferably, the training of the shared convolutional neural network comprises:

the search image and the real calibration mask are connected in series and then input to an identification encoder to obtain an identification encoder hidden space;

calculating the regression loss of a boundary frame between a hidden space of a prior encoder and a hidden space of an identification encoder;

performing cross correlation operation on the template image characteristics and the search image characteristics, and connecting the template image characteristics and the search image characteristics in series with random sampling of a hidden space of a prior encoder;

inputting the result after the serial connection into a mask decoder to obtain a binary mask;

calculating the cross entropy loss between the binary mask and the real calibration mask;

weighting the regression loss and the cross entropy loss of the bounding box to obtain a loss value of the network;

and (4) adopting a random gradient descent method, optimizing the network according to the loss value, and performing iterative training until the minimization evidence is off-line to obtain the trained shared convolutional neural network.

Preferably, the minimizing the downline of evidence includes:

establishing a mapping between a target segmentation and a position with uncertainty using an identification encoder;

combining the probability distribution output by the identification encoder with the result of the cross correlation operation, randomly sampling in the probability distribution output by the identification encoder to generate a segmentation prediction, and measuring the distance between the segmentation prediction and a true value label of an original search image by the cross entropy loss of the cross correlation operation;

and adopting KL divergence to punish the distance between the identification encoder and the prior encoder, and combining the cross entropy loss and the KL divergence to obtain a minimized evidence lower line of the shared convolutional neural network.

Preferably, the obtaining of the encoder hidden space comprises:

applying a plurality of prior encoders on the same search image; through the regression loss of a boundary frame, the probability output of a prior encoder is supervised by combining a real value label, and a complete probability distribution which encodes all possible features is generated and is distributed into a hidden space omega; wherein the parameter of the prior encoder is phi, which estimates the feature variant of the original search image X; the probability output distribution of the prior encoder is normal distribution parallel to the coordinate axis, and the mean value of the probability output distribution is mu_prior(X;. phi.) belongs to omega, and the variance is sigma_prior(X；φ)∈Ω。

Preferably, the randomly sampling from the hidden space comprises:

from Gaussian hidden space N (mu)_recog,σ_recog) The process of sampling Z translates into randomly sampling noise o' that follows a normal distribution, and the process of sampling Z is expressed as: z ═ μ + o 'σ, o' e N (0, I); where μ denotes a mean value of a gaussian hidden space, σ denotes a variance of the hidden space, and N (0, I) denotes a standard normal distribution.

Preferably, the concatenation of the mean samples of the implicit space of the prior encoder comprises:

in the ith iteration process, i belongs to 1,2, wherein m represents a positive integer, and Z is randomly sampled from the probability distribution of the prior encoder_i：Z_i～P(·|X)＝N(μ_prior(X；φ),σ_prior(X; phi)); mixing the sample Z_iBroadcasting to the N-channel search image feature map to make the search image feature map and the segmentation mask have the same dimension, and then correlating the search image feature map with the result g of cross correlation operation_SiameseThe series connection is carried out to obtain the characteristics after series connection:

wherein the function g_combComposed of three successive 1 x 1 convolutional layers, sigma representing the twin network g_SiameseParameter, τ denotes g_combAnd (4) medium convolution layer parameters.

Preferably, obtaining the binary mask comprises:

features to be connected in series

Input to a mask decoder g_decoderGenerating a segmentation mask:

where θ represents a mask decoder parameter.

In a second aspect, the present invention provides a probabilistic twin target tracking system based on a conditional variational encoder, comprising:

the image acquisition module is used for extracting a certain frame of image in the video sequence as a first frame of image for target tracking, calibrating a target object needing tracking in the first frame of image, and initializing target appearance model parameters to obtain a template image;

the first extraction module is used for obtaining the image characteristics of the template image by utilizing the trained shared convolutional neural network;

the second extraction module is used for taking the current frame as a search image, and respectively obtaining the characteristics of the search image and the hidden space of the prior encoder by utilizing the trained shared convolutional neural network and the prior encoder;

the operation module is used for performing cross correlation operation on the template image characteristics and the search image characteristics and then serially connecting the template image characteristics and the search image characteristics with the mean value sampling of the hidden space of the prior encoder;

and the positioning module is used for inputting the result after the serial connection into a mask decoder to obtain a binary mask and obtain a boundary frame of the target so as to realize the positioning of the target.

In a third aspect, the invention provides a computer apparatus comprising a memory and a processor, the processor and the memory being in communication with each other, the memory storing program instructions executable by the processor, the processor invoking the program instructions to perform the method as described above.

In a fourth aspect, the invention provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements a method as described above.

The invention has the beneficial effects that:

aiming at the problem that a trained network has inherent uncertainty and still outputs a deterministic result, a probability twin target tracking method and system based on a variational encoder are provided, uncertainty learning is introduced into target tracking, and a target tracking network model comprising a Bayesian network is provided to generate complete probability distribution.

When there are frames with ambiguity that require multiple reasonable assumptions, the proposed method of the present disclosure can produce multiple consistent target states with only a low amount of computation.

Aiming at the problem that abundant real value calibration information in a large data set is not fully utilized, the real calibration value and corresponding training data are combined to be used as condition information in the training process and input into an encoder of a condition variation encoder based on a supervision model.

To prevent the over-fitting problem, noise insertion prediction is derived from low-dimensional implicit spatial sampling. By adding noise to the neurons, regularization is introduced into the deep neural network to prevent an overfitting phenomenon, and robustness is further improved.

Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic flow chart of a probability twin target tracking method based on a conditional variation encoder according to an embodiment of the present invention.

FIG. 2 is a diagram illustrating the inherent uncertainty in regression prediction according to an embodiment of the present invention.

FIG. 3 is a diagram illustrating inherent uncertainty in the calibration of a large dataset according to an embodiment of the present invention.

FIG. 4 is a diagram illustrating an embodiment of generating an infinite partition and a bounding box.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below by way of the drawings are illustrative only and are not to be construed as limiting the invention.

It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

In the description of the specification, reference to the description of "one embodiment," "some embodiments," "an example," "a specific example," or "some examples" or the like means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

For the purpose of facilitating an understanding of the present invention, the present invention will be further explained by way of specific embodiments with reference to the accompanying drawings, which are not intended to limit the present invention.

It should be understood by those skilled in the art that the drawings are merely schematic representations of embodiments and that the elements shown in the drawings are not necessarily required to practice the invention.

Example 1

The embodiment 1 of the present invention provides a probability twin target tracking system based on a conditional variational encoder, including:

The probability twin target tracking method based on the conditional variational encoder is realized by utilizing the probability twin target tracking system based on the conditional variational encoder, and the method specifically comprises the following steps:

and inputting the result after the series connection into a mask decoder to obtain a binary mask, obtain a boundary box of the target and realize the positioning of the target.

The training of the shared convolutional neural network comprises:

The minimizing the downline of evidence comprises:

Obtaining the encoder latent space includes:

applying a plurality of prior encoders on the same search image; through the regression loss of a boundary frame, the probability output of a prior encoder is supervised by combining a real value label, and a complete probability distribution which encodes all possible features is generated and is distributed into a hidden space omega; wherein the parameter of the prior encoder is phi, which estimates the feature variant of the original search image X; firstly, the first step is toThe probability output distribution of the experimental encoder is normal distribution parallel to the coordinate axis, and the mean value of the probability output distribution is mu_prior(X;. phi.) belongs to omega, and the variance is sigma_prior(X；φ)∈Ω。

Randomly sampling from the hidden space includes:

The cascading of mean samples of the implicit space of the prior encoder comprises the following steps:

Obtaining the binary mask includes: features to be connected in series

Input to a mask decoder g_decoderGenerating a segmentation mask:

where θ represents a mask decoder parameter.

Example 2

The embodiment 2 of the invention provides a probability twin target tracking method based on a variational encoder, which comprises the following steps:

inputting the template image into the trained shared convolutional neural network to obtain the template image characteristics;

performing cross correlation operation on the template image characteristics and the search image characteristics, and then serially connecting the template image characteristics and the search image characteristics with the mean sampling of the hidden space of the prior encoder;

and inputting the result after the series connection into a mask decoder to obtain a binary mask, and further obtaining a boundary box of the target by a minimum-maximum method to realize the positioning of the target.

In this embodiment 2, the training process of the shared convolutional neural network specifically includes:

before target position prediction, i.e. target tracking, is performed, the whole neural network needs to be trained. In the training process, besides obtaining template image features, search image features and an encoder hidden space, a search image and a real calibration mask are required to be connected in series and then input into an identification encoder to obtain an identification encoder hidden space, and then KL loss (boundary box regression loss) between a priori encoder hidden space and the identification encoder hidden space is calculated: d_kl(Q||P)＝E_z～Q[logQ-logP](ii) a Where P denotes the computational prior encoder implicit space, Q denotes the recognition encoder implicit space, E_Z～QIndicating a desire.

Then, performing cross correlation operation on the template image characteristics and the search image characteristics, and then connecting the template image characteristics and the search image characteristics in series with random sampling Z of a hidden space of a prior encoder; inputting the result after the serial connection into a mask decoder to obtain a binary mask; calculating the cross entropy loss between the binary mask and the real calibration mask:

CrossEntropy＝E_{Z～Q(·|M,X)}[-log(P_c(M|R(X,Z)))](ii) a Wherein, P_cRepresenting the category distribution, X representing the original search image, M representing the originalThe image is searched for a true value label, R represents a segmentation prediction of the sample Z.

Finally, the loss value of the whole network is obtained by weighting the KL loss and the cross entropy loss:

Loss(M,X)＝CrossEntropy+β·D_kl(Q||p)(ii) a Wherein β is a weighting parameter;

and (3) optimizing the network according to the loss value by adopting a random gradient descent method, and performing 50 times of iterative training by using the complete data set.

In embodiment 2 of the present invention, the obtaining process of searching the hidden space of the image recognition encoder specifically includes: the goal of training the proposed network is to minimize the evidentiary downline, Loss (M, X), the training process follows the standard training process of a conditional variational encoder. Unlike the twin model determined by training, the network model proposed in this embodiment 2 needs to further establish an effective hidden space for encoding the feature variants. Thus, this example 2 employs a recognition encoder to establish the segmentation of the object (given the original search image X and its true value tag M) and with uncertainty σ_recogPosition mu of (X, M; omega) E omega_recog(X, M; omega) epsilon omega represents that the main component is a hidden space with low dimension. Identifying the probability distribution of the encoder output as Q, from which samples Z can be expressed as: z to Q (· | X, M) ═ N (μ)_recog(X,M；ω),σ_recog(X, M; omega)); where ω denotes identifying encoder parameters.

By combining Q with a twin network g_SiameseAnd combining the results of cross correlation operation of the intermediate template image features and the search image features, generating a segmentation prediction R by the sample Z, and measuring the distance between the segmentation prediction R and the true value label M by cross entropy loss. Further, KL divergence is employed to penalize the distance between the identification encoder Q and the a priori encoder P. And further combining the cross entropy loss and the KL divergence to obtain a final evidence offline.

In the actual training process, along with the gradual decrease of the KL divergence, the characteristic variants of the coding of the recognition encoder tend to be consistent with the prior distribution.

In this embodiment 2, the parameter of the prior encoder isPhi, which estimates the characteristic variations of the original search image X. The output distribution of the prior encoder P is normal distribution parallel to the coordinate axes, and the average value is mu_prior(X;. phi.) belongs to omega, and the variance is sigma_prior(X；φ)∈Ω。

For general prediction, only the mean of the prior distribution is concatenated with the cross-correlated output as a sample to obtain a uniform segmentation and tracking bounding box. To predict an unlimited number of reasonable hypotheses, an m-order a priori encoder is applied on the same search image. Through KL loss, the real calibration label supervises the output of the prior encoder.

Therefore, the network model proposed in this embodiment 2 can generate a complete probability distribution that encodes all possible features into a hidden space. By sampling in the learned hidden space, a variety of reasonable tracking results can be obtained.

In this embodiment 2, the process of sampling from the hidden space specifically includes:

in a conditional variant coder, the process of sampling from a learned hidden space satisfying a gaussian distribution is not differentiable, and therefore, a model cannot be trained effectively by a stochastic gradient descent method. The embodiment uses the heavy parameter method to enable the conditional variation encoder to normally acquire the gradient, so that the error can be reversely propagated. From Gaussian hidden space N (mu)_recog,σ_recog) The process of sampling Z translates into randomly sampling noise o' that follows a normal distribution, and the process of sampling Z is expressed as: z ═ μ + o 'σ, o' e N (0, I); where μ denotes a mean value of a gaussian hidden space, σ denotes a variance of the hidden space, and N (0, I) denotes a standard normal distribution.

In the prediction phase, only the mean of the prior distribution is combined with the cross-correlated output as an extracted deterministic feature. Because the output of the prior encoder is aligned with the output of the identification encoder under calibration label supervision via KL losses, the deterministic feature of this embodiment 2 combined with the prior distribution means contains the predicted calibration label information.

In the training process, the characteristics of the noise are sampled and inserted in the Gaussian hidden space learned by the recognition coder. In contrast to the traditional deterministic twin network, the method proposed by the present disclosure samples the mean and variance from the probabilistic distribution, the variance being input into the neural network as the insertion noise. When noise is input to the neurons, the deep neural network is regularized to prevent overfitting, and robustness is improved.

In this embodiment 2, the process of connecting in series with the mean value of the implicit space of the prior encoder specifically includes: in the ith iteration process, i belongs to 1,2, m and m represents the number of extracted features, and Z is randomly sampled from the prior distribution P_i：Z_i～P(·|X)＝N(μ_prior(X；φ),σ_prior(X；φ))。

Mixing the sample Z_iBroadcast to the N channel with the same dimensions as the split mask, and then associate it with the twin network g_SiameseAnd (3) connecting the results of the medium cross correlation in series to obtain the characteristics after series connection:

wherein the function g_combComposed of three successive 1 x 1 convolutional layers, sigma representing the twin network g_SiameseParameter,. tau.represents g_combAnd (4) medium convolution layer parameters.

In this embodiment 2, the process of obtaining the binary mask specifically includes:

features to be connected in series

Input to a mask decoder g_decoderGenerating a segmentation mask:

where θ represents a mask decoder parameter.

The prior encoder block of the conditional variational encoder encodes the partition variants and trains by scaling the partitions using the true values, creating a probability distribution containing the mean and variance, thus at a given searchAn infinite confidence map of features may be generated when retrieving images. Implementing an unlimited number of trusted profiles requires no significant amount of computation, since only a small portion of the entire network needs to be evaluated repeatedly in each iteration. When extracting m features from the hidden space, only the output of the prior network and the cross-correlation results in the twin network can be reused, i.e. only

And R_iRepetitive calculations are required.

Example 3

As shown in fig. 1, in embodiment 3 of the present invention, uncertainty learning is introduced into target tracking, and a target tracking network model including a bayesian network is proposed to generate a complete probability distribution. Wherein, Template Image represents Template Image, Search Image represents Search Image, Conv Layers represents convolutional layer network, Prior Encoder represents Prior Encoder, registration Encoder represents identification Encoder, Latent Space represents hidden Space, Mask Decoder represents Mask Decoder, Mask represents Mask, Cross Entropy represents Cross Entropy loss, group Truth represents true value mark, g _ comb represents g _ comb_combSample represents a Sample.

Firstly, a certain frame of image in a video sequence is extracted as a first frame of image for target tracking, and the video sequence can be a video shot in scenes such as video monitoring and intelligent transportation. And then, calibrating the target object needing to be tracked in the image. And initializing the parameters of the target appearance model, wherein the tracking target can be any object in the image. Then, entering the next frame, and tracking the target by adopting the method provided by the disclosure, wherein the tracking method comprises the following specific steps:

(1) inputting the template image into the trained shared convolutional neural network to obtain the characteristics of the template image;

(2) inputting the current frame serving as a search image into a shared convolutional neural network and a prior encoder to respectively obtain a search image characteristic and a prior encoder hidden space;

(3) performing cross correlation operation on the template image characteristics and the search image characteristics, and then connecting the template image characteristics and the search image characteristics in series with the mean value of the hidden space of the prior encoder;

(4) and inputting the result after the serial connection into a mask decoder to obtain a binary mask, and further obtaining a boundary box of the target by a minimum-maximum method to realize the positioning of the target.

The above steps are explained in detail below.

Before target position prediction, i.e. target tracking, is performed, the whole neural network needs to be trained. In the training process, besides obtaining template image characteristics, searching image characteristics and an encoder hidden space, a searching image and a real calibration mask are required to be connected in series and then input into an identification encoder to obtain an identification encoder hidden space, and then KL loss D between a priori encoder hidden space and the identification encoder hidden space is calculated_kl(Q|P)。

Then, performing cross correlation operation on the template image characteristics and the search image characteristics, and then connecting the template image characteristics and the search image characteristics in series with random sampling of a hidden space of a prior encoder; inputting the result after the serial connection into a mask decoder to obtain a binary mask; and calculating cross entropy Loss Cross Encopy between the binary mask and the real calibration mask, and finally, weighting the Loss value Loss (M, X) of the whole network by KL Loss and cross entropy Loss. And (4) optimizing the network according to the loss value by adopting a random gradient descent method, and performing 50 times of iterative training by using the complete data set.

The goal of training the proposed network is to minimize the evidentiary downline, Loss (M, X), the training process follows the standard training process of a conditional variational encoder.

Unlike the twin model determined by training, the network model proposed in this embodiment 3 needs to further establish an effective hidden space for encoding the feature variants. Thus, this example 3 employs a recognition encoder to establish the segmentation of the object (given the original search image X and its true value tag M) and with uncertainty σ_recogPosition mu of (X, M; omega) epsilon omega_recog(X, M; ω) e Ω.

Identifying the probability distribution of the encoder output as Q, from which samples Z can be expressed as:

Z～Q(·|X,M)＝N(μ_recog(X,M；ω),σ_recog(X,M；ω))

by combining Q with a twin network g_SiameseWhen the results of the intermediate cross-correlation are combined, the sample Z will generate a segmentation prediction, and the distance between the segmentation prediction R and the true value label M is measured by the cross entropy loss.

In addition, in this embodiment 3, KL divergence is adopted to penalize the distance between the identification encoder Q and the a priori encoder P. And further combining the cross entropy loss and the KL divergence to obtain a final evidence offline. In the actual training process, along with the gradual decrease of the KL divergence, the characteristic variants coded by the recognition encoder tend to be consistent with the prior distribution.

The main component of the algorithm proposed in this embodiment 3 is a low-dimensional hidden space Ω. The parameter of the a priori encoder is phi, which estimates the feature variation of the search image X. The output distribution of the prior encoder P is normal distribution of parallel and coordinate axes, and the average value is mu_prior(X;. phi.) belongs to omega, and the variance is sigma_prior(X；φ)∈Ω。

Therefore, the network model proposed in this embodiment 3 can generate a complete probability distribution that encodes all possible features into a hidden space. By sampling in the learned hidden space, a variety of reasonable tracking results can be obtained.

In this embodiment 3, the process of sampling from the hidden space specifically includes:

in the conditional variational encoder, the process of sampling from a learned hidden space satisfying a gaussian distribution is not differentiable, and therefore, a model cannot be trained efficiently by a stochastic gradient descent method. The embodiment uses the method of heavy parameters to make the condition variableThe encoder can normally acquire the gradient, and the error can be propagated reversely. From Gaussian hidden space N (mu)_recog,σ_recog) The process of sampling Z translates into randomly sampling noise o' that follows a normal distribution, and the process of sampling Z is expressed as: z ═ μ + o 'σ, o' e N (0, I); where μ denotes a mean value of a gaussian hidden space, σ denotes a variance of the hidden space, and N (0, I) denotes a standard normal distribution.

In the prediction phase, only the mean of the prior distribution is combined with the cross-correlated output as an extracted deterministic feature. Because the output of the prior encoder is aligned with the output of the identification encoder under calibration label supervision via KL losses, the deterministic feature of this embodiment 3 combined with the prior distribution means contains the predicted calibration label information.

In this embodiment 3, the process of concatenating the average value of the implicit space of the prior encoder specifically includes: in the ith iteration process, i belongs to 1,2, m and m represents the number of extracted features, and Z is randomly sampled from the prior distribution P_i：Z_i～P(·|X)＝N(μ_prior(X；φ),σ_prior(X；φ))

Sample Z_iBroadcast to the N channel with the same dimensions as the split mask, and then match it with the twin network g_SiameseAnd (3) connecting the results of the medium cross correlation in series to obtain the characteristics after series connection:

In this embodiment 3, the process of obtaining the binary mask specifically includes:

will be connected in series

Input to a mask decoder g_decoderGenerating a segmentation mask:

where θ represents a mask decoder parameter.

The prior encoder module of the conditional variational encoder encodes the segmentation variants and trains by scaling the segmentation using the true values, creating a probability distribution containing the mean and variance, thus yielding an infinitely reliable feature map given the search image. Implementing an unlimited number of trusted profiles requires no significant amount of computation, since only a small portion of the entire network needs to be evaluated repeatedly in each iteration. When extracting m features from the hidden space, only the output of the prior network and the cross-correlation results in the twin network can be reused, i.e. only

And R_iRepetitive calculations are required.

In example 3, three general criteria were used for evaluation: VOT2016, VOT2018, and TColor-128. The VOT2016 and VOT2018 datasets each contain 60 video clips with annotated rotating bounding boxes, allowing full positioning accuracy.

The PS-CVAE proposed in this example 3 was compared to other up-to-date methods on these data sets using the official toolkit provided by the VOT in this example 3.

As indicators, the Expected Average Overlap (EAO), accuracy (average overlap over successfully tracked frames), and robustness (failure rate) are employed as measures of the VOT. TColor-128 is a recently proposed data set that contains 128 video clips and bounding boxes that are axis-aligned.

In this example 3, the tracker on TColor-128 was evaluated and the area under the curve (AUC) was used as a measure.

All experiments were performed on a PC equipped with an i5 quad-core 2.59GHz CPU, 8GB RAM and GTX 1070 GPU. The average execution speed of the tracker proposed in this embodiment 3 is 33 Frames Per Second (FPS).

The experimental results of the tracker on the VOT2016 dataset were compared to the results of the other 8 most recent trackers, as described in Table 1. EAO, accuracy and failure scores of the tracker are compared to SPM, SiamMask, ATOM, ASRCF, siamrPN, CSRDCF, CCOT and TCNN. All trackers compared, except CCOT and TCNN, are real-time trackers. The trackers SiamMask and SiamRPN of the present disclosure both use the Siamese architecture.

In this example 3, the proposed tracker has an EAO of 0.443, which is the best of all the compared trackers; it is 2.1% higher than the second bit SPM (0.434) algorithm and 2.3% higher than the third bit SiamMask (0.433).

Table 1 results on VOT2016 dataset

In this example 3, the experimental results of the tracker on the VOT2018 dataset were compared with the results of the other 8 most recent trackers, as described in table 2. EAO, accuracy and failure score for the tracker are listed, as well as SiamRPN + +, ATOM, TCNN, SiamRPN, SiamMask, comparing SASiamR, SiamVGG and SASiam.

All compared trackers, except TCNN, are real-time trackers, all except ATOM and TCNN, use a twin symmetric architecture. In this example 3, the proposed tracker has an EAO of 0.415, which is the best of all the compared trackers; it is 0.24% higher than the SiamRPN + + (0.414) algorithm for the second bit, and 3.5% higher for the ATOM (0.401) algorithm for the third bit.

TABLE 2 results on VOT2018 dataset

The results of the most recently proposed trackers for real-time tracking of TColor-128 are shown in Table 3, which includes the trackers UDT, SimFC, CSRDCF, SCT, CFNet, DSST and KCF proposed in this example 3. From the AUC perspective, the tracker of this example 3 (0.530) ranks first among all compared trackers, 4.5% higher than UDT (0.507). In addition, the performance of the tracker of this embodiment 3 is also better than that of the siamf fc (0.503) and CFNet (0.456) using the siame architecture, which are respectively improved by 5.4% and 16.2%.

TABLE 3 results on TColor-128 dataset

The essence of target tracking is to regress the state of the target over time. Defining a regression mapping X → Y, wherein Y_iE Y is inherently disturbed by observation noise n (x)_i),x_iIs epsilon.X. Noise (e.g., sensor or motion noise) can cause uncertainty in learning that cannot be reduced even if more data is collected. Therefore, these data uncertainties are reflected in the regression process.

The noise regression can be expressed as y_i＝f(x_i)+o′σ(x_i) Wherein o' is ∈ [0, I]And f (-) is the learned embedding function. A typical regression model only learns the estimate f (·).

However, as shown in FIG. 2, regression with data uncertainty can estimate not only f (-) but also σ (x)_i) This indicates the predicted value of uncertainty f (·).

Similar to the regression uncertainty, the dataset compiled for training the visual tracking network consists of X → Y and also contains the data uncertainty. In this case, X represents image space and Y represents ground truth annotation. Although these large data sets are intended to provide clear annotations for training and testing, as shown in fig. 3, inherent ambiguities may be introduced in the annotation process due to limitations in the annotation process and annotator preferences. Filtering out these low quality annotations from large-scale datasets is difficult or even impossible. However, deep learning approaches typically use embedding Z in the underlying space_i. Suppose each x_ie.X corresponds to a version f (X) without embedded noise_i) It is less corrupted by ambiguous information, so the embedded prediction can be re-expressed as z_i＝f(x_i)+n(x_i) Wherein n (x)_i) Representing noise.

In this embodiment 3, infinite partitions and bounding boxes can be generated in the prediction process, as shown in fig. 4. In a generic prediction process, only the mean of the gaussian distribution of the a priori encoder output is used to provide a priori knowledge and produce a uniform segmentation result and bounding box. In addition, since the a priori encoder in CVAE is conditioned by the calibrated true value mask, samples can be generated from the learned hidden space to obtain multiple segments and bounding boxes, which can provide multiple reasonable predictions.

Example 4

Embodiment 4 of the present invention provides a computer device, including a memory and a processor, where the processor and the memory are in communication with each other, the memory stores program instructions executable by the processor, and the processor calls the program instructions to execute a method for probabilistic twin target tracking based on a variational encoder, where the method includes:

Example 5

An embodiment 5 of the present invention provides a computer-readable storage medium, in which a computer program is stored, where the computer program, when executed by a processor, implements a probabilistic twin target tracking method based on a variational encoder, where the method includes:

In summary, in order to solve the problem that a trained network has inherent uncertainty and still outputs a deterministic result, the method and system for probability twin target tracking based on a variational encoder according to the embodiments of the present invention introduce uncertainty learning into target tracking and provide a target tracking network model including a bayesian network to generate a complete probability distribution. Specifically, firstly, a novel probabilistic twinning target tracking method is realized by establishing the relation between a twinning network structure and a conditional variational encoder. In a conditional variational encoder, latent target state variables are implicitly spatially encoded; the randomly sampled samples will then be inserted into the twin network to produce the corresponding target state prediction. Furthermore, when frames in which ambiguity exists require a variety of reasonable assumptions, consistent target states can be generated with only a low amount of computation. In the training process, the real calibration value and corresponding training data are combined to be used as condition information and input into an encoder of a condition variation encoder, so that the abundant real value calibration information in a large data set is fully utilized based on a supervision model; noise insertion prediction is obtained by sampling from a low-dimensional hidden space, and by adding noise to neurons, regularization is introduced into a deep neural network to prevent an over-fitting phenomenon, so that robustness is further improved.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention has been described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only a preferred embodiment of the present disclosure and is not intended to limit the present disclosure, and various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Although the present disclosure has been described with reference to the specific embodiments shown in the drawings, it is not intended to limit the scope of the present disclosure, and it should be understood by those skilled in the art that various modifications and variations can be made without inventive faculty based on the technical solutions disclosed in the present disclosure.

Claims

1. A probability twin target tracking method based on a conditional variational encoder is characterized by comprising the following steps:

performing cross correlation operation on the template image characteristics and the search image characteristics, and then connecting the template image characteristics and the search image characteristics in series with the mean value of the hidden space of the prior encoder;

2. The probabilistic twin target tracking method based on conditional variational encoder according to claim 1, wherein the training of the shared convolutional neural network comprises:

the search image and a real calibration mask are connected in series and then input to an identification encoder to obtain an identification encoder hidden space;

performing cross correlation operation on the template image characteristics and the search image characteristics, and connecting the template image characteristics and the search image characteristics in series with a random sampling result of a hidden space of a prior encoder;

3. The conditional variational encoder based probabilistic twin target tracking method according to claim 2, wherein minimizing the evidence downline comprises:

4. The probabilistic twin target tracking method based on conditional variational encoder according to claim 3, wherein obtaining the encoder hidden space comprises:

5. The probabilistic twin target tracking method based on conditional variational coder according to claim 4, wherein the random sampling from the hidden space comprises:

from the hidden space of Gaussian N (mu)_recog,σ_recog) The process of sampling Z translates into randomly sampling noise o' that follows a normal distribution, and the process of sampling Z is expressed as: z ═ μ + o 'σ, o' e N (0, I); where μ denotes a mean value of a gaussian hidden space, σ denotes a variance of a hidden space, and N (0, I) denotes a standard normal distribution.

6. The probabilistic twin target tracking method based on conditional variational coder according to claim 5, wherein the concatenation of the mean values of the implicit spaces of the prior coders comprises:

in the ith iteration process, i belongs to 1,2Randomly sampling Z from the prior encoder probability distribution_i：Z_i～P(·|X)＝N(μ_prior(X；φ),σ_prior(X; phi)); mixing the sample Z_iBroadcasting to the N-channel search image feature map to make the search image feature map and the segmentation mask have the same dimension, and then correlating the search image feature map with the result g of cross correlation operation_SiameseThe series connection is carried out to obtain the characteristics after series connection:

wherein the function g_combIs composed of three groups of continuous 1 x 1 convolution layers, sigma represents twin network parameters, and tau represents g_combAnd (4) medium convolution layer parameters.

7. The probabilistic twin target tracking method based on conditional variational encoder according to claim 6, wherein obtaining the binary mask comprises:

features to be connected in series

Input to a mask decoder g_decoderGenerating a segmentation mask:

where θ represents a mask decoder parameter.

8. A probabilistic twin target tracking system based on a conditional variational encoder, comprising:

the operation module is used for performing cross correlation operation on the template image characteristics and the search image characteristics and then serially connecting the template image characteristics and the search image characteristics with the mean value of the hidden space of the prior encoder;

9. A computer device comprising a memory and a processor, the processor and the memory in communication with each other, the memory storing program instructions executable by the processor, characterized in that: the processor calls the program instructions to perform the method of any one of claims 1-7.

10. A computer-readable storage medium storing a computer program, characterized in that: the computer program, when executed by a processor, implements the method of any one of claims 1-7.