WO2019228654A1 - Method for training a prediction system and system for sequence prediction - Google Patents

Method for training a prediction system and system for sequence prediction Download PDF

Info

Publication number
WO2019228654A1
WO2019228654A1 PCT/EP2018/064534 EP2018064534W WO2019228654A1 WO 2019228654 A1 WO2019228654 A1 WO 2019228654A1 EP 2018064534 W EP2018064534 W EP 2018064534W WO 2019228654 A1 WO2019228654 A1 WO 2019228654A1
Authority
WO
WIPO (PCT)
Prior art keywords
model
samples
prediction
variable
sequence
Prior art date
Application number
PCT/EP2018/064534
Other languages
French (fr)
Inventor
Apratim BHATTACHARYYA
Mario Fritz
Bernt Schiele
Daniel OLMEDA REINO
Original Assignee
Toyota Motor Europe
Max-Planck-Institut Für Informatik
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Toyota Motor Europe, Max-Planck-Institut Für Informatik filed Critical Toyota Motor Europe
Priority to PCT/EP2018/064534 priority Critical patent/WO2019228654A1/en
Publication of WO2019228654A1 publication Critical patent/WO2019228654A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Definitions

  • the present disclosure is related to a method for training a prediction system and to a system for sequence prediction, in particular employing a recurrent neural network (RNN).
  • RNN recurrent neural network
  • Anticipation of future events and states of their environment is a key competence for autonomous agents to successfully operate in the real world. Predicting the future is important in many scenarios ranging from autonomous driving to precipitation forecasting. Many of these tasks can be formulated as sequence prediction problems. Given a past sequence of events, probable future outcomes are to be predicted.
  • Recurrent Neural Networks especially LSTM (Long short-term memory) formulations, are state-of-the-art models for sequence prediction tasks, cf. e.g.:
  • the prediction system comprises a hidden variable model using a hidden random variable (z) for sequence prediction.
  • the method comprises the steps of:
  • the same sequence input (x) is desirably inputted into the hidden variable model multiple times, so that the model outputs in response respectively multiple distinct samples (y).
  • the distinction (difference) between the output samples is due to (i.e. conditioned by) the random variable z.
  • the best of the multiple samples i.e. "Best of Many” sample
  • is used to train the model i.e. the model is trained by using / based on the best multiple sample).
  • the (pre-trained) model according to the present disclosure is thus in particular suitable for predicting scenarios that induce multi-modal distributions over future sequences.
  • the model is desirably trained by using only the best of the multiple samples for training the model and by disregarding the further samples.
  • the model may be trained based on the best of the multiple samples in relation to the ground truth.
  • the difference between the best sample and the ground truth may serve to determine an error based on which the model is desirably trained.
  • the hidden random variable may be a Gaussian latent variable and/or has a zero mean Gaussian distribution.
  • the random variable may be conditioned by the ground truth, e.g. by using a Convolutional - Long short-term memory (LSTM) recognition network.
  • LSTM Convolutional - Long short-term memory
  • the random variable is desirably not conditioned any more by the ground truth.
  • the trained model desirably predicts possible sequences by outputting multiple samples (y) for the same sequence input (x) conditioned by the varying random variable (z).
  • the present disclosure further relates to a computer program comprising instructions for executing the steps of the method, when the program is executed by a computer.
  • the present disclosure further relates to a system for sequence prediction, comprising a hidden variable model using a hidden random variable (z) for sequence prediction.
  • the hidden variable model is configured to output multiple distinct samples (y) representing predicted sequences in response to a sequence input (x), the multiple distinct samples (y) being conditioned by the random variable (z).
  • the model is pre-trained based on the best of the multiple samples, the best sample being the closest to the ground truth.
  • the system according to the present disclosure is pre- trained by using the "Best of Many" sample objective that leads to more accurate and more diverse predictions that better capture the true variations in real-world sequence data. Therefore, the pre-trained model according to the present disclosure is in particular suitable for predicting scenarios that induce multi-modal distributions over future sequences.
  • the predicted sequence may comprise a single input of a data set (x), e.g. a single image, based on which the possible distinct output samples (y) (e.g. distinct images) are desirably predicted. Accordingly, the sequence may comprise only two images, i.e. one as input and one as predicted output.
  • the model is desirably pre-trained based on the best of multiple samples outputted for a training sequence input (x).
  • the model may be pre-trained by using only the best of the multiple samples for training the model and by disregarding the further samples.
  • the model may be pre-trained based on the best of the multiple samples in relation to the ground truth.
  • the hidden random variable may a Gaussian latent variable and/or has a zero mean Gaussian distribution.
  • the model is desirably pre-trained based on the best of multiple samples.
  • the model may be pre-trained such that the multiple samples (y) are outputted for the same training sequence input (x) conditioned by the varying random variable (z).
  • the random variable may be conditioned by the ground truth.
  • the model may be or comprise a neural network, e.g. a recurrent neural network (RNN). More in particular the model may be or comprise a RNN encoder-decoder network, or a conditional variational auto-encoder (CVAE).
  • RNN recurrent neural network
  • CVAE conditional variational auto-encoder
  • the system may further be configured to carry out the method of the present disclosure, as described above.
  • the system may comprise a data set (or be configured to receive the data set) for training the model of the present disclosure. It may additionally or alternatively comprise a sensor, e.g. a digital camera, to track image sequences, which may be used as the sequence input.
  • a sensor e.g. a digital camera
  • FIG. 1 shows a schematic block diagram of a system according to embodiments of the present disclosure
  • FIG. 2 shows a schematic representation of a deep conditional generative model according to embodiments of the present disclosure
  • Fig. 3a shows a schematic representation of a model for structured trajectory prediction according to embodiments of the present disclosure.
  • Fig. 3b shows a schematic representation of a model for structured image sequence prediction according to embodiments of the present disclosure.
  • Fig. 1 shows a block diagram of a system 10 according to embodiments of the present disclosure.
  • the system may have various further functions, e.g. may be a robotic system or a camera system. It may further be integrated in a vehicle.
  • the system 10 may comprise an electronic circuit, a processor (shared, dedicated, or group), a combinational logic circuit, a memory that executes one or more software programs, and/or other suitable components that provide the described functionality.
  • system 10 may be a computer device.
  • the system may be connected to a memory, which may store data, e.g. a computer program which when executed, carries out the method according to the present disclosure.
  • the system or the memory may store software which may comprise a hidden variable model (e.g. implemented as a neural network) according to the present disclosure.
  • the system 10 has an input for receiving digital images or a sequence (or a stream) of digital images.
  • the system 10 may be connected to an optical sensor 1, in particular a digital camera.
  • the digital camera 1 is configured such that it can record a scene, and in particular output digital data to the system 10.
  • the system may be configured to identify objects in the images, e.g. by carrying out a computer vision algorithm for detecting the presence and location of objects in a sensed scene. For example, persons, vehicles and other objects may be detected.
  • the system may track the detected objects across the images. It may for example detect the trajectory of a moving object. Based on said detected trajectory the system may determine (predict) samples of the (future) continuation of said trajectory (i.e. predict samples of possible future events).
  • the system may not (or not only) receive data from a camera but may receive a data set based on which it predicts possible future events.
  • the system may be configured to perform an image sequence prediction, as described in more detail in the following.
  • the system may predict possible (future) weather evolutions based on e.g. weather radar intensity images.
  • Fig. 2 shows a schematic representation of a deep conditional generative model according to embodiments of the present disclosure.
  • a latent variable z i.e. a random variable
  • the output sequence V is then sampled from the distribution p ⁇ (y
  • the latent variables z i.e. a random variable
  • the simplifying assumption is made that z is independent of x and p(z[x) is N(0, I).
  • the training of such models is described.
  • the integral in (2) is computationally intractable.
  • a variational lower bound of the data log-likelihood (2) may be derived, which can be estimated empirically using Monte-Carlo integration, cf. e.g. equation (3):
  • the recognition network q ⁇ has multiple chances to draw samples with high posterior probability p ⁇ (y
  • the model architectures according to the present disclosure are desirably based on RNN Encoder-Decoders.
  • LSTM formulations may e.g. used as RIMNs for structured trajectory prediction tasks (Figure 3a) and Convolutional LSTM formulations (Figure 3b) for structured image sequence prediction tasks.
  • LSTM recognition networks may be used in case of trajectory prediction ( Figure 3a) and for image sequence prediction, Conv-LSTM recognition networks ( Figure 3b) may be used. Note that, the simplifying assumption may be made that z is independent of x, the recognition networks are desirably conditioned only on y.
  • Fig. 3a shows a schematic representation of a model for structured trajectory prediction according to embodiments of the present disclosure.
  • the model for structured trajectory prediction (see Figure 3a) according to the disclosure may be based on (or correspond to) the sampling module described in : N. Lee, W. Choi, P. Vernaza, C. B. Choy, P. H. Torr, and M. Chandraker. Desire: Distant future prediction in dynamic scenes with interacting agents. CVPR, 2017.
  • the input sequence x is processed using an embedding layer to extract features and the embedded sequence is read by the encoder LSTM.
  • the encoder LSTM produces a summary vector v (providing a summary of the past information), which is its internal state after reading the input sequence x.
  • the decoder LSTM is conditioned on the summary vector v and additionally a sample of the latent variable z.
  • the decoder LSTM is unrolled in time and a prediction is generated by a linear transformation of its output. Therefore, the predicted sequence at a certain time-step is conditioned on the output at the previous time-step, the summary vector v and the latent variable z.
  • the summary v is deterministic given x, there may be the following equation (9)
  • a CNN may be used to extract a summary of a visual observation of a scene. This visual summary Is given as input to the decoder LSTM, ensuring that the generated samples are additionally conditioned on the visual input.
  • Conv-LSTMs retain spatial information by considering the hidden states h and cell states c as 3D tensors - the cell and hidden states are composed of
  • conditional generative models networks Conv-LSTMs may be used with for structured image sequence prediction ( Figure 3b).
  • Fig. 3b shows a schematic representation of a model for structured image sequence prediction according to embodiments of the present disclosure.
  • the used encoder and decoder consists of two stacked Conv-LSTMs for feature aggregation.
  • the output is conditioned on a latent variable z to model multiple modes of the conditional distribution p(y
  • latent variables z may be used which are 3D tensors.
  • the input image sequence x is processed using a convolutional embedding layer.
  • the Conv-LSTM reads the embedded input sequence and produces a 3D tensor v as the summary.
  • the 3D summary v and latent variable z is given as input to the Conv-LSTM decoder at every timestep.
  • the "best of many" sample objective for Gaussian latent variable models is advantageous for learning conditional models on multi-modal distributions.
  • the learnt latent representation is better matched between training and test time which in turn leads to more accurate samples.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to a method for training a prediction system. The prediction system comprises a hidden variable model using a hidden random variable for sequence prediction. The method comprises the steps of: * multiple input of a sequence input (x) into the hidden variable model which outputs in response multiple distinct samples (y) conditioned by the random variable, * use of the best of multiple samples to train the model, the best sample being the closest to the ground truth. The invention further relates to a system for sequence prediction.

Description

Method for training a prediction system and system for sequence prediction
FIELD OF THE DISCLOSURE
[0001] The present disclosure is related to a method for training a prediction system and to a system for sequence prediction, in particular employing a recurrent neural network (RNN).
BACKGROUND OF THE DISCLOSURE
[0002] Anticipation of future events and states of their environment is a key competence for autonomous agents to successfully operate in the real world. Predicting the future is important in many scenarios ranging from autonomous driving to precipitation forecasting. Many of these tasks can be formulated as sequence prediction problems. Given a past sequence of events, probable future outcomes are to be predicted.
[0003] Recurrent Neural Networks (RNN), especially LSTM (Long short-term memory) formulations, are state-of-the-art models for sequence prediction tasks, cf. e.g.:
A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, and S. Savarese. Social Istm: Human trajectory prediction in crowded spaces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 961- 971, 2016, or
N. Lee, W. Choi, P. Vernaza, C. B. Choy, P. H. Torr, and M. Chandraker. Desire: Distant future prediction in dynamic scenes with interacting agents. CVPR, 2017.
[0004] These approaches predict only point estimates. However, many sequence prediction problems are only partially observed or stochastic in nature and hence the distribution of future sequences can be highly multi-modal.
[0005] It may be considered e.g. the task of predicting future pedestrian trajectories. In many cases, there is no information about the intentions of the pedestrians in the scene. A pedestrian after walking over a Zebra crossing might decide to turn either left or right. A point estimate in such a situation would be highly unrealistic. Therefore, in order to incorporate uncertainty of future outcomes, structured predictions may be used. Structured prediction implies learning a one to many mapping of a given fixed sequence to plausible future sequences, cf. e.g.:
K. Sohn, H. Lee, and X. Yan. Learning structured output representation using deep conditional generative models. In Advances in Neural Information Processing Systems, pages 3483-3491, 2015.
This leads to more realistic predictions and enables probabilistic inference.
[0006] Recent work has proposed deep conditional generative models with Gaussian latent variables for structured sequence prediction, cf.:
N. Lee, W. Choi, P. Vernaza, C. B. Choy, P. H. Torr, and M. Chandraker. Desire: Distant future prediction in dynamic scenes with interacting agents. CVPR, 2017.
[0007] The Conditional Variational Auto-Encoder (CVAE) framework described by K. Sohn et al. is used in N. Lee et al. for learning of the Gaussian Latent Variables.
[0008] However, there are two key limitations of this CVAE framework. First, the currently used objectives hinder learning of diverse samples due to a marginalization over multi-modal futures. Second, a mismatch in latent variable distribution between training and testing leads to errors in model fitting.
SUMMARY OF THE DISCLOSURE
[0009] Currently, it remains desirable to provide a method for training a prediction system and a system for sequence prediction which is able to generate more accurate and diverse prediction samples. In particular, it remains desirable to better capture the true variations in training data.
[0010] Therefore, according to the embodiments of the present disclosure, it is provided a method for training a prediction system. The prediction system comprises a hidden variable model using a hidden random variable (z) for sequence prediction. The method comprises the steps of:
• multiple input of a sequence input (x) into the hidden variable model which outputs in response multiple distinct samples (y) conditioned by the random variable,
• use of the best of multiple samples to train the model, the best sample being the closest to the ground truth.
[0011] In other words, the same sequence input (x) is desirably inputted into the hidden variable model multiple times, so that the model outputs in response respectively multiple distinct samples (y). The distinction (difference) between the output samples is due to (i.e. conditioned by) the random variable z. The best of the multiple samples (i.e. "Best of Many" sample) is used to train the model (i.e. the model is trained by using / based on the best multiple sample).
[0012] Accordingly, a major contribution of the present disclosure to the prior art is a "Best of Many" sample objective that leads to more accurate and more diverse predictions that better capture the true variations in real-world sequence data.
[0013] The (pre-trained) model according to the present disclosure is thus in particular suitable for predicting scenarios that induce multi-modal distributions over future sequences.
[0014] Hence, the two key limitations of this CVAE framework of the prior art, as described above, can be overcome resulting in more accurate and diverse samples.
[0015] The model is desirably trained by using only the best of the multiple samples for training the model and by disregarding the further samples.
[0016] The model may be trained based on the best of the multiple samples in relation to the ground truth.
[0017] Accordingly, the difference between the best sample and the ground truth may serve to determine an error based on which the model is desirably trained.
[0018] The hidden random variable may be a Gaussian latent variable and/or has a zero mean Gaussian distribution.
[0019] During training the random variable may be conditioned by the ground truth, e.g. by using a Convolutional - Long short-term memory (LSTM) recognition network.
[0020] Once the model is trained and is used (i.e. as a pre-trained model) e.g. in testing or in a real-life application, the random variable is desirably not conditioned any more by the ground truth.
[0021] The trained model desirably predicts possible sequences by outputting multiple samples (y) for the same sequence input (x) conditioned by the varying random variable (z). [0022] The present disclosure further relates to a computer program comprising instructions for executing the steps of the method, when the program is executed by a computer.
[0023] The present disclosure further relates to a system for sequence prediction, comprising a hidden variable model using a hidden random variable (z) for sequence prediction. The hidden variable model is configured to output multiple distinct samples (y) representing predicted sequences in response to a sequence input (x), the multiple distinct samples (y) being conditioned by the random variable (z). The model is pre-trained based on the best of the multiple samples, the best sample being the closest to the ground truth.
[0024] Accordingly, the system according to the present disclosure is pre- trained by using the "Best of Many" sample objective that leads to more accurate and more diverse predictions that better capture the true variations in real-world sequence data. Therefore, the pre-trained model according to the present disclosure is in particular suitable for predicting scenarios that induce multi-modal distributions over future sequences.
[0025] The predicted sequence may comprise a single input of a data set (x), e.g. a single image, based on which the possible distinct output samples (y) (e.g. distinct images) are desirably predicted. Accordingly, the sequence may comprise only two images, i.e. one as input and one as predicted output.
[0026] The model is desirably pre-trained based on the best of multiple samples outputted for a training sequence input (x).
[0027] The model may be pre-trained by using only the best of the multiple samples for training the model and by disregarding the further samples.
[0028] The model may be pre-trained based on the best of the multiple samples in relation to the ground truth.
[0029] The hidden random variable may a Gaussian latent variable and/or has a zero mean Gaussian distribution.
[0030] The model is desirably pre-trained based on the best of multiple samples. The model may be pre-trained such that the multiple samples (y) are outputted for the same training sequence input (x) conditioned by the varying random variable (z). In this regard, the random variable may be conditioned by the ground truth. [0031] The model may be or comprise a neural network, e.g. a recurrent neural network (RNN). More in particular the model may be or comprise a RNN encoder-decoder network, or a conditional variational auto-encoder (CVAE).
[0032] The system may further be configured to carry out the method of the present disclosure, as described above. For example, the system may comprise a data set (or be configured to receive the data set) for training the model of the present disclosure. It may additionally or alternatively comprise a sensor, e.g. a digital camera, to track image sequences, which may be used as the sequence input.
[0033] It is intended that combinations of the above-described elements and those within the specification may be made, except where otherwise contradictory.
[0034] It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure, as claimed.
[0035] The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description, and serve to explain the principles thereof.
BRIEF DESCRIPTION OF THE DRAWINGS
[0036] Fig. 1 shows a schematic block diagram of a system according to embodiments of the present disclosure;
[0037] Fig. 2 shows a schematic representation of a deep conditional generative model according to embodiments of the present disclosure;
[0038] Fig. 3a shows a schematic representation of a model for structured trajectory prediction according to embodiments of the present disclosure; and
[0039] Fig. 3b shows a schematic representation of a model for structured image sequence prediction according to embodiments of the present disclosure.
DESCRIPTION OF THE EMBODIMENTS
[0040] Reference will now be made in detail to exemplary embodiments of the disclosure, examples of which are illustrated in the accompanying drawings. Wherever possible, the same reference numbers will be used throughout the drawings to refer to the same or like parts. [0041] Fig. 1 shows a block diagram of a system 10 according to embodiments of the present disclosure. The system may have various further functions, e.g. may be a robotic system or a camera system. It may further be integrated in a vehicle.
[0042] The system 10 may comprise an electronic circuit, a processor (shared, dedicated, or group), a combinational logic circuit, a memory that executes one or more software programs, and/or other suitable components that provide the described functionality. In other words, system 10 may be a computer device.
[0043] The system may be connected to a memory, which may store data, e.g. a computer program which when executed, carries out the method according to the present disclosure. In particular, the system or the memory may store software which may comprise a hidden variable model (e.g. implemented as a neural network) according to the present disclosure.
[0044] The system 10 has an input for receiving digital images or a sequence (or a stream) of digital images. In particular, the system 10 may be connected to an optical sensor 1, in particular a digital camera. The digital camera 1 is configured such that it can record a scene, and in particular output digital data to the system 10.
[0045] The system may be configured to identify objects in the images, e.g. by carrying out a computer vision algorithm for detecting the presence and location of objects in a sensed scene. For example, persons, vehicles and other objects may be detected. The system may track the detected objects across the images. It may for example detect the trajectory of a moving object. Based on said detected trajectory the system may determine (predict) samples of the (future) continuation of said trajectory (i.e. predict samples of possible future events).
[0046] In another example, the system may not (or not only) receive data from a camera but may receive a data set based on which it predicts possible future events. In particular the system may be configured to perform an image sequence prediction, as described in more detail in the following. For example, the system may predict possible (future) weather evolutions based on e.g. weather radar intensity images.
[0047] In the following the operation of the hidden variable model according to the present disclosure is explained in more detail with reference to the "best of many" sample objective for Gaussian latent variable models according to the present disclosure,
[0048] In particular, first an overview of deep conditional generative models with gaussian latent variables is given. Then, it is introduced the "best-of-many" samples objective function according to the present disclosure. Thereafter, exemplary conditional generative models are described which may serve as the test bed for the objective according to the present disclosure. It is further described the model for structured trajectory prediction which is based on the sampling module described by N. Lee et al. (see above) and consider extensions which additionally conditions on visual input and generates full image sequences.
[0049] Fig. 2 shows a schematic representation of a deep conditional generative model according to embodiments of the present disclosure. Given an input sequence x, a latent variable z (i.e. a random variable) is drawn from the conditional distribution p(z|x) (assumed Gaussian). The output sequence V is then sampled from the distribution p©(y|x, z) of the conditional generative model according to the present disclosure with parameterized by Q. The latent variables z (i.e. a random variable) enables one-to-many mapping and the learning of multiple modes (i.e. multiple distinct possible occurrences) of the true posterior distribution p(y|x). In practice, the simplifying assumption is made that z is independent of x and p(z[x) is N(0, I). Next, the training of such models is described.
Conditional Variational Autoencoder Based Training Objective
[0050] It is desirable to maximize the data log-likelihood pe(y|x). To estimate the data log-likelihood of the model p© according to the present disclosure, one possibility is to perform Monte-Carlo sampling of the latent variable z. For T samples, this leads to the following estimate, cf. equation (1):
Figure imgf000008_0001
[0051] This estimate is unbiased but has high variance. This would underestimate the log-likelihood for some samples and overestimate for others, especially if T is small. This wou' in turn lead to high variance weight updates.
[0052] It is possible to i educe the variance of updates by estimating the log- likelihood through importance sampling during training. The latent variables z may be sampled from a recognition network qe using e.g. the re-parameterization trick. The data log-likelihood is, cf. equation (2):
Figure imgf000009_0001
p{ F (2)
log ( / Pe(y\z, x) Q4>{z\x,y) dz ) .
(IF { z\c·. y)
[0053] The integral in (2) is computationally intractable. A variational lower bound of the data log-likelihood (2) may be derived, which can be estimated empirically using Monte-Carlo integration, cf. e.g. equation (3):
Figure imgf000009_0002
[0054] The lower bound in (3) weights all samples
Figure imgf000009_0003
equally and so they must all ascribe high probability to the data point (x, y). This introduces a strong constraint on the recognition network qe. Therefore, the model is forced to trade- off between a good estimate of the data log-likelihood and the KL divergence between the training and test latent variable distributions. One possibility to close the gap introduced between the training and test pipelines, is to use an hybrid objective of the form
Figure imgf000009_0004
Although such a hybrid objective has shown modest improvement in performance in certain cases, it would not provide any significant improvement over the standard CVAE
objective in a structured sequence prediction tasks. [0055] Therefore it is proposed to derive the "best-of-many-samples" objective which on the one hand encourages sample diversity and on the other hand aims to close the gap between the training and testing pipelines. Best of Many Samples Objective
[0056] It is proposed to weight unlike (3) each sample not equally. As qe(y|x, y) in a recognition network qe is normally distributed, the integral can be very well approximated on a large enough bounded interval [a; b]. Therefore, it is possible to use the First Mean Value Theorem of Integration, cf. equation (4):
Figure imgf000010_0002
[0057] It is further possible to lower bound (4) with the minimum of the term on the right, cf. equation (5):
Figure imgf000010_0003
[0058] It may be estimated the first term on the right of (5) using Monte- Carlo integration. The minimum in the second term on the right of (5) is difficult to estimate, therefore it may be approximated by the KL divergence over the full distribution. The KL divergence heavily penalizes qe(y|x, y) when is it is high for low values p(z|x) (which leads to low value of the ratio of the distributions). This leads to the following "many-sample" objective, (more details in the supplementary section), cf. equation (6):
Figure imgf000010_0001
[0059] The recognition network q© has multiple chances to draw samples with high posterior probability p©(y|z, x). This encourages diversity in the generated samples. Furthermore, the data log-likelihood estimate in this objective is tighter as > :V E follows from the Jensen's inequality. Therefore, this bound loosens the constrains on the recognition network q© and allows it more closely match the latent variable distribution p(z | x). However, as it is desirably focused on regression tasks, probabilities are of the form
Figure imgf000011_0001
Therefore in practice the Log-Average term can cause numerical instabilities due to limited machine precision in representing the probability
Figure imgf000011_0002
1SEi i,‘ r'. Therefore, it is used a "Best of Many Samples" approximation of (6). The constant 1/T term can be pulled down outside the average in (6) and approximate the sum with the maximum, cf. equations (7) and (8):
Figure imgf000011_0003
( )
- DKl\q \J. y) ίΐ tt}). ~ Qoi- · //)·
[0060] Similar to (6), this objective encourages diversity and loosens the constraints on the recognition network q© as only the best sample is consider'd. During training, initially p© assigns low probability to the data for all samples The log(T) difference between (6) and (8) would be dominated by the low data log-likelihood. Later on, as both objectives promote diversity, the Log-Average term in (6) would be dominated by one term in the average. Therefore, (6) would be well approximated by the maximum of the terms in the average. Furthermore, (8) avoids numerical stability issues.
Model Architectures for Structured Sequence Prediction
[0061] The model architectures according to the present disclosure are desirably based on RNN Encoder-Decoders. LSTM formulations may e.g. used as RIMNs for structured trajectory prediction tasks (Figure 3a) and Convolutional LSTM formulations (Figure 3b) for structured image sequence prediction tasks. During training, LSTM recognition networks may be used in case of trajectory prediction (Figure 3a) and for image sequence prediction, Conv-LSTM recognition networks (Figure 3b) may be used. Note that, the simplifying assumption may be made that z is independent of x, the recognition networks are desirably conditioned only on y.
[0062] Fig. 3a shows a schematic representation of a model for structured trajectory prediction according to embodiments of the present disclosure. The model for structured trajectory prediction (see Figure 3a) according to the disclosure may be based on (or correspond to) the sampling module described in : N. Lee, W. Choi, P. Vernaza, C. B. Choy, P. H. Torr, and M. Chandraker. Desire: Distant future prediction in dynamic scenes with interacting agents. CVPR, 2017.
[0063] The input sequence x is processed using an embedding layer to extract features and the embedded sequence is read by the encoder LSTM. The encoder LSTM produces a summary vector v (providing a summary of the past information), which is its internal state after reading the input sequence x. The decoder LSTM is conditioned on the summary vector v and additionally a sample of the latent variable z. The decoder LSTM is unrolled in time and a prediction is generated by a linear transformation of its output. Therefore, the predicted sequence at a certain time-step
Figure imgf000012_0001
is conditioned on the output at the previous time-step, the summary vector v and the latent variable z. As the summary v is deterministic given x, there may be the following equation (9)
Figure imgf000012_0002
[0064] Conditioning the predicted sequence at all time-steps upon a single sample of z (i.e. z is defined for all time steps of one sequence) enables z to capture global characteristics (e.g, speed and direction of motion) of the future sequence and generation of temporally consistent sample sequences V .
[0065] In case of dynamic agents e.g. pedestrians in traffic scenes, the future trajectory is highly dependent upon the environment e.g. layout of the streets. Therefore, additionally conditioning samples on sensory input (e.g. visuals of the environment) may enable more accurate sample generation. A CNN may be used to extract a summary of a visual observation of a scene. This visual summary Is given as input to the decoder LSTM, ensuring that the generated samples are additionally conditioned on the visual input.
[0066] If the sequence (x; y) in question consists of images e.g. frames of a video, the trajectory prediction model of Figure 3a is not suitable to exploit the
1 ] spatial structure of the image sequence. More specifically, if a pixel
Figure imgf000013_0001
at time- step t + 1 of the image sequence y is considered, the pixel value at time-step t+1 depends upon only the pixel !Ji-J and a certain neighbourhood around it. Furthermore, spatially neighbouring pixels are correlated. This spatial structure can be exploited by using Convolutional LSTMs as RNN encoder-decoders, cf. e.g. :
S. Xingjian, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-c. Woo. Convolutional Istm network: A machine learning approach for precipitation nowcasting. In Advances in Neural Information Processing Systems, pages 802- 810, 2015.
[0067] Conv-LSTMs retain spatial information by considering the hidden states h and cell states c as 3D tensors - the cell and hidden states are composed of
/
vectors
Figure imgf000013_0002
r-J corresponding to each spatial position. New cell states, hidden states and outputs are computed using convolutional operations. Therefore, new
// + 1
cell states
Figure imgf000013_0003
, hidden states *· > depend upon only a local spatial r1 h1
neighbourhood of > -J <·.i , thus presser ving spatial information. Accordingly, conditional generative models networks Conv-LSTMs may be used with for structured image sequence prediction (Figure 3b).
[0068] Fig. 3b shows a schematic representation of a model for structured image sequence prediction according to embodiments of the present disclosure. The used encoder and decoder consists of two stacked Conv-LSTMs for feature aggregation. As before, the output is conditioned on a latent variable z to model multiple modes of the conditional distribution p(y | x). The future states of neighboring pixels are highly correlated. However, spatially distant parts of the image sequences can evolve independently. In order to take into account the spatial structure of images, latent variables z may be used which are 3D tensors. [0069] As detailed in Figure 3b, the input image sequence x is processed using a convolutional embedding layer. The Conv-LSTM reads the embedded input sequence and produces a 3D tensor v as the summary. The 3D summary v and latent variable z is given as input to the Conv-LSTM decoder at every timestep.
The cell state, hidden state or output at a certain spatial position,
Figure imgf000014_0001
, it is conditioned on a sub-tensor zy of the latent tensor z. Spatially neighbouring cell states, hidden states (and thus outputs) are therefore conditioned on spatially neighbouring sub-tensors Zy. This coupled with the spatial information preserving property of Conv-LSTMs detailed above, enables z to capture spatial location specific characteristics of the future image sequence and allows for modeling the correlation of future states of spatially neighboring pixels. This ensures spatial consistency of sampled output sequences V . Furthermore, as in the fully connected case, conditioning the full output sequence sample V is on a single sample of z ensures temporal consistency.
[0070] In summary, the "best of many" sample objective for Gaussian latent variable models is advantageous for learning conditional models on multi-modal distributions. The learnt latent representation is better matched between training and test time which in turn leads to more accurate samples.
[0071] Throughout the description, including the claims, the term "comprising a" should be understood as being synonymous with "comprising at least one" unless otherwise stated. In addition, any range set forth in the description, including the claims should be understood as including its end value(s) unless otherwise stated. Specific values for described elements should be understood to be within accepted manufacturing or industry tolerances known to one of skill in the art, and any use of the terms "substantially" and/or "approximately" and/or "generally" should be understood to mean falling within such accepted tolerances.
[0072] Although the present disclosure herein has been described with reference to particular embodiments, it is to be understood that these embodiments are merely illustrative of the principles and applications of the present disclosure.
[0073] It is intended that the specification and examples be considered as exemplary only, with a true scope of the disclosure being indicated by the following claims.

Claims

1. A method for training a prediction system,
the prediction system comprises a hidden variable model using a hidden random variable for sequence prediction,
the method comprising the steps of:
multiple input of a sequence input (x) into the hidden variable model which outputs in response multiple distinct samples (y) conditioned by the random variable,
use of the best of the multiple samples (y) to train the model, the best sample being the closest to the ground truth.
2. The method according to any one of the preceding claims 1, wherein the model is trained by using only the best of the multiple samples for training the model and by disregarding the further samples.
3. The method according to any one of the preceding claims land 2, wherein
the model is trained based on the best of the multiple samples in relation to the ground truth.
4. The method according to any one of the preceding claims, wherein the hidden random variable is a Gaussian latent variable and/or has a zero mean Gaussian distribution, wherein
during training the random variable is conditioned by the ground truth, in particular by using a Convolutional - Long short-term memory (LSTM) recognition network.
5. The method according to any one of the preceding claims, wherein the trained model predicts possible sequences by outputting multiple samples
(y) for the same sequence input (x) conditioned by the varying random variable
(z).
6. A computer program comprising instructions for executing the steps of the method according to any one of the preceding method claims, when the program is executed by a computer.
7. A system for sequence prediction, comprising:
a hidden variable model using a hidden random variable (z) for sequence prediction, wherein
the hidden variable model is configured to output multiple distinct samples (y) representing predicted sequences in response to a sequence input (x), the multiple distinct samples (y) being conditioned by the random variable (z), wherein
the model is pre-trained based on the best of the multiple samples, the best sample being the closest to the ground truth.
8. The system according to the preceding system claim, wherein the model is pre-trained based on the best of multiple samples outputted for a training sequence input (x).
9. The system according to any one of the preceding system claims, wherein
the model is pre-trained by using only the best of the multiple samples for training the model and by disregarding the further samples.
10. The system according to any one of the preceding system claims, wherein
the model is pre-trained based on the best of the multiple samples in relation to the ground truth.
11. The system according to any one of the preceding system claims, wherein
the hidden random variable is a Gaussian latent variable and/or has a zero mean Gaussian distribution.
12. The system according to any one of the preceding system claims, wherein the model is pre-trained based on the best of multiple samples, the multiple samples (y) being outputted for the same training sequence input (x) conditioned by the varying random variable (z), the random variable being conditioned by the ground truth.
13. The system according any one of the preceding system claims, wherein
the model comprises a neural network, in particular a RNN encoder-decoder network, or a conditional variational auto-encoder (CVAE).
PCT/EP2018/064534 2018-06-01 2018-06-01 Method for training a prediction system and system for sequence prediction WO2019228654A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/EP2018/064534 WO2019228654A1 (en) 2018-06-01 2018-06-01 Method for training a prediction system and system for sequence prediction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2018/064534 WO2019228654A1 (en) 2018-06-01 2018-06-01 Method for training a prediction system and system for sequence prediction

Publications (1)

Publication Number Publication Date
WO2019228654A1 true WO2019228654A1 (en) 2019-12-05

Family

ID=62530215

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2018/064534 WO2019228654A1 (en) 2018-06-01 2018-06-01 Method for training a prediction system and system for sequence prediction

Country Status (1)

Country Link
WO (1) WO2019228654A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111915881A (en) * 2020-06-11 2020-11-10 西安理工大学 Small sample traffic flow prediction method based on variational automatic encoder
CN112967275A (en) * 2021-03-29 2021-06-15 中国科学院深圳先进技术研究院 Soft tissue motion prediction method and device, terminal equipment and readable storage medium

Non-Patent Citations (9)

* Cited by examiner, † Cited by third party
Title
A. ALAHI; K. GOEL; V. RAMANATHAN; A. ROBICQUET; L. FEI-FEI; S. SAVARESE: "Social Istm: Human trajectory prediction in crowded spaces", PROCEEDINGS OF THE IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, 2016, pages 961 - 971, XP033021272, DOI: doi:10.1109/CVPR.2016.110
ALEX X LEE ET AL: "Stochastic Adversarial Video Prediction", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 4 April 2018 (2018-04-04), XP080867640 *
K. SOHN, THE CONDITIONAL VARIATIONAL AUTO-ENCODER (CVAE) FRAMEWORK DESCRIBED
K. SOHN; H. LEE; X. YAN: "Learning structured output representation using deep conditional generative models", ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS, 2015, pages 3483 - 3491
N. LEE ET AL., GAUSSIAN LATENT VARIABLES
N. LEE; W. CHOI; P. VERNAZA; C. B. CHOY; P. H. TORR; M. CHANDRAKER: "Desire: Distant future prediction in dynamic scenes with interacting agents", CVPR, 2017
NAMHOON LEE ET AL: "DESIRE: Distant Future Prediction in Dynamic Scenes with Interacting Agents", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 14 April 2017 (2017-04-14), XP080763013, DOI: 10.1109/CVPR.2017.233 *
S. XINGJIAN; Z. CHEN; H. WANG; D.-Y. YEUNG; W.-K. WONG; W.-C. WOO: "Convolutional Istm network: A machine learning approach for precipitation nowcasting", ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS, 2015, pages 802 - 810
XINGJIAN SHI ET AL: "Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting", 19 September 2015 (2015-09-19), XP055368436, Retrieved from the Internet <URL:https://arxiv.org/pdf/1506.04214.pdf> [retrieved on 20170502] *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111915881A (en) * 2020-06-11 2020-11-10 西安理工大学 Small sample traffic flow prediction method based on variational automatic encoder
CN112967275A (en) * 2021-03-29 2021-06-15 中国科学院深圳先进技术研究院 Soft tissue motion prediction method and device, terminal equipment and readable storage medium
WO2022206036A1 (en) * 2021-03-29 2022-10-06 中国科学院深圳先进技术研究院 Soft tissue motion prediction method and apparatus, terminal device, and readable storage medium

Similar Documents

Publication Publication Date Title
Mukhoti et al. Evaluating bayesian deep learning methods for semantic segmentation
EP3673417B1 (en) System and method for distributive training and weight distribution in a neural network
Bhattacharyya et al. Accurate and diverse sampling of sequences based on a “best of many” sample objective
US10275691B2 (en) Adaptive real-time detection and examination network (ARDEN)
Becker-Ehmck et al. Switching linear dynamics for variational bayes filtering
CN113264066B (en) Obstacle track prediction method and device, automatic driving vehicle and road side equipment
Breitenstein et al. Systematization of corner cases for visual perception in automated driving
JP4208898B2 (en) Object tracking device and object tracking method
Fernando et al. Going deeper: Autonomous steering with neural memory networks
Hu et al. A framework for probabilistic generic traffic scene prediction
Gao et al. Distributed mean-field-type filters for traffic networks
Akan et al. Stretchbev: Stretching future instance prediction spatially and temporally
CN111931720B (en) Method, apparatus, computer device and storage medium for tracking image feature points
CN111652181B (en) Target tracking method and device and electronic equipment
CN112464930A (en) Target detection network construction method, target detection method, device and storage medium
CN115690153A (en) Intelligent agent track prediction method and system
CN112802076A (en) Reflection image generation model and training method of reflection removal model
Kadim et al. Deep-learning based single object tracker for night surveillance.
CN111695627A (en) Road condition detection method and device, electronic equipment and readable storage medium
WO2019228654A1 (en) Method for training a prediction system and system for sequence prediction
Ussa et al. A hybrid neuromorphic object tracking and classification framework for real-time systems
Lange et al. Lopr: Latent occupancy prediction using generative models
CN112651294A (en) Method for recognizing human body shielding posture based on multi-scale fusion
CN116166642A (en) Spatio-temporal data filling method, system, equipment and medium based on guide information
Balasubramanian et al. ExAgt: Expert-guided augmentation for representation learning of traffic scenarios

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18729387

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18729387

Country of ref document: EP

Kind code of ref document: A1