WO2019228654A1

WO2019228654A1 - Method for training a prediction system and system for sequence prediction

Info

Publication number: WO2019228654A1
Application number: PCT/EP2018/064534
Authority: WO
Inventors: Apratim BHATTACHARYYA; Mario Fritz; Bernt Schiele; Daniel OLMEDA REINO
Original assignee: Toyota Motor Europe; Max-Planck-Institut Für Informatik
Priority date: 2018-06-01
Filing date: 2018-06-01
Publication date: 2019-12-05

Abstract

The invention relates to a method for training a prediction system. The prediction system comprises a hidden variable model using a hidden random variable for sequence prediction. The method comprises the steps of: * multiple input of a sequence input (x) into the hidden variable model which outputs in response multiple distinct samples (y) conditioned by the random variable, * use of the best of multiple samples to train the model, the best sample being the closest to the ground truth. The invention further relates to a system for sequence prediction.

Description

Method for training a prediction system and system for sequence prediction

FIELD OF THE DISCLOSURE

[0001] The present disclosure is related to a method for training a prediction system and to a system for sequence prediction, in particular employing a recurrent neural network (RNN).

BACKGROUND OF THE DISCLOSURE

[0002] Anticipation of future events and states of their environment is a key competence for autonomous agents to successfully operate in the real world. Predicting the future is important in many scenarios ranging from autonomous driving to precipitation forecasting. Many of these tasks can be formulated as sequence prediction problems. Given a past sequence of events, probable future outcomes are to be predicted.

[0003] Recurrent Neural Networks (RNN), especially LSTM (Long short-term memory) formulations, are state-of-the-art models for sequence prediction tasks, cf. e.g.:

A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, and S. Savarese. Social Istm: Human trajectory prediction in crowded spaces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 961- 971, 2016, or

N. Lee, W. Choi, P. Vernaza, C. B. Choy, P. H. Torr, and M. Chandraker. Desire: Distant future prediction in dynamic scenes with interacting agents. CVPR, 2017.

[0004] These approaches predict only point estimates. However, many sequence prediction problems are only partially observed or stochastic in nature and hence the distribution of future sequences can be highly multi-modal.

[0005] It may be considered e.g. the task of predicting future pedestrian trajectories. In many cases, there is no information about the intentions of the pedestrians in the scene. A pedestrian after walking over a Zebra crossing might decide to turn either left or right. A point estimate in such a situation would be highly unrealistic. Therefore, in order to incorporate uncertainty of future outcomes, structured predictions may be used. Structured prediction implies learning a one to many mapping of a given fixed sequence to plausible future sequences, cf. e.g.:

K. Sohn, H. Lee, and X. Yan. Learning structured output representation using deep conditional generative models. In Advances in Neural Information Processing Systems, pages 3483-3491, 2015.

This leads to more realistic predictions and enables probabilistic inference.

[0006] Recent work has proposed deep conditional generative models with Gaussian latent variables for structured sequence prediction, cf.:

[0007] The Conditional Variational Auto-Encoder (CVAE) framework described by K. Sohn et al. is used in N. Lee et al. for learning of the Gaussian Latent Variables.

[0008] However, there are two key limitations of this CVAE framework. First, the currently used objectives hinder learning of diverse samples due to a marginalization over multi-modal futures. Second, a mismatch in latent variable distribution between training and testing leads to errors in model fitting.

SUMMARY OF THE DISCLOSURE

[0009] Currently, it remains desirable to provide a method for training a prediction system and a system for sequence prediction which is able to generate more accurate and diverse prediction samples. In particular, it remains desirable to better capture the true variations in training data.

[0010] Therefore, according to the embodiments of the present disclosure, it is provided a method for training a prediction system. The prediction system comprises a hidden variable model using a hidden random variable (z) for sequence prediction. The method comprises the steps of:

• multiple input of a sequence input (x) into the hidden variable model which outputs in response multiple distinct samples (y) conditioned by the random variable,

• use of the best of multiple samples to train the model, the best sample being the closest to the ground truth.

[0011] In other words, the same sequence input (x) is desirably inputted into the hidden variable model multiple times, so that the model outputs in response respectively multiple distinct samples (y). The distinction (difference) between the output samples is due to (i.e. conditioned by) the random variable z. The best of the multiple samples (i.e. "Best of Many" sample) is used to train the model (i.e. the model is trained by using / based on the best multiple sample).

[0012] Accordingly, a major contribution of the present disclosure to the prior art is a "Best of Many" sample objective that leads to more accurate and more diverse predictions that better capture the true variations in real-world sequence data.

[0013] The (pre-trained) model according to the present disclosure is thus in particular suitable for predicting scenarios that induce multi-modal distributions over future sequences.

[0014] Hence, the two key limitations of this CVAE framework of the prior art, as described above, can be overcome resulting in more accurate and diverse samples.

[0015] The model is desirably trained by using only the best of the multiple samples for training the model and by disregarding the further samples.

[0016] The model may be trained based on the best of the multiple samples in relation to the ground truth.

[0017] Accordingly, the difference between the best sample and the ground truth may serve to determine an error based on which the model is desirably trained.

[0018] The hidden random variable may be a Gaussian latent variable and/or has a zero mean Gaussian distribution.

[0019] During training the random variable may be conditioned by the ground truth, e.g. by using a Convolutional - Long short-term memory (LSTM) recognition network.

[0020] Once the model is trained and is used (i.e. as a pre-trained model) e.g. in testing or in a real-life application, the random variable is desirably not conditioned any more by the ground truth.

[0021] The trained model desirably predicts possible sequences by outputting multiple samples (y) for the same sequence input (x) conditioned by the varying random variable (z). [0022] The present disclosure further relates to a computer program comprising instructions for executing the steps of the method, when the program is executed by a computer.

[0023] The present disclosure further relates to a system for sequence prediction, comprising a hidden variable model using a hidden random variable (z) for sequence prediction. The hidden variable model is configured to output multiple distinct samples (y) representing predicted sequences in response to a sequence input (x), the multiple distinct samples (y) being conditioned by the random variable (z). The model is pre-trained based on the best of the multiple samples, the best sample being the closest to the ground truth.

[0024] Accordingly, the system according to the present disclosure is pre- trained by using the "Best of Many" sample objective that leads to more accurate and more diverse predictions that better capture the true variations in real-world sequence data. Therefore, the pre-trained model according to the present disclosure is in particular suitable for predicting scenarios that induce multi-modal distributions over future sequences.

[0025] The predicted sequence may comprise a single input of a data set (x), e.g. a single image, based on which the possible distinct output samples (y) (e.g. distinct images) are desirably predicted. Accordingly, the sequence may comprise only two images, i.e. one as input and one as predicted output.

[0026] The model is desirably pre-trained based on the best of multiple samples outputted for a training sequence input (x).

[0027] The model may be pre-trained by using only the best of the multiple samples for training the model and by disregarding the further samples.

[0028] The model may be pre-trained based on the best of the multiple samples in relation to the ground truth.

[0029] The hidden random variable may a Gaussian latent variable and/or has a zero mean Gaussian distribution.

[0030] The model is desirably pre-trained based on the best of multiple samples. The model may be pre-trained such that the multiple samples (y) are outputted for the same training sequence input (x) conditioned by the varying random variable (z). In this regard, the random variable may be conditioned by the ground truth. [0031] The model may be or comprise a neural network, e.g. a recurrent neural network (RNN). More in particular the model may be or comprise a RNN encoder-decoder network, or a conditional variational auto-encoder (CVAE).

[0032] The system may further be configured to carry out the method of the present disclosure, as described above. For example, the system may comprise a data set (or be configured to receive the data set) for training the model of the present disclosure. It may additionally or alternatively comprise a sensor, e.g. a digital camera, to track image sequences, which may be used as the sequence input.

[0033] It is intended that combinations of the above-described elements and those within the specification may be made, except where otherwise contradictory.

[0034] It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure, as claimed.

[0035] The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description, and serve to explain the principles thereof.

BRIEF DESCRIPTION OF THE DRAWINGS

[0036] Fig. 1 shows a schematic block diagram of a system according to embodiments of the present disclosure;

[0037] Fig. 2 shows a schematic representation of a deep conditional generative model according to embodiments of the present disclosure;

[0038] Fig. 3a shows a schematic representation of a model for structured trajectory prediction according to embodiments of the present disclosure; and

[0039] Fig. 3b shows a schematic representation of a model for structured image sequence prediction according to embodiments of the present disclosure.

DESCRIPTION OF THE EMBODIMENTS

[0040] Reference will now be made in detail to exemplary embodiments of the disclosure, examples of which are illustrated in the accompanying drawings. Wherever possible, the same reference numbers will be used throughout the drawings to refer to the same or like parts. [0041] Fig. 1 shows a block diagram of a system 10 according to embodiments of the present disclosure. The system may have various further functions, e.g. may be a robotic system or a camera system. It may further be integrated in a vehicle.

[0042] The system 10 may comprise an electronic circuit, a processor (shared, dedicated, or group), a combinational logic circuit, a memory that executes one or more software programs, and/or other suitable components that provide the described functionality. In other words, system 10 may be a computer device.

[0043] The system may be connected to a memory, which may store data, e.g. a computer program which when executed, carries out the method according to the present disclosure. In particular, the system or the memory may store software which may comprise a hidden variable model (e.g. implemented as a neural network) according to the present disclosure.

[0044] The system 10 has an input for receiving digital images or a sequence (or a stream) of digital images. In particular, the system 10 may be connected to an optical sensor 1, in particular a digital camera. The digital camera 1 is configured such that it can record a scene, and in particular output digital data to the system 10.

[0045] The system may be configured to identify objects in the images, e.g. by carrying out a computer vision algorithm for detecting the presence and location of objects in a sensed scene. For example, persons, vehicles and other objects may be detected. The system may track the detected objects across the images. It may for example detect the trajectory of a moving object. Based on said detected trajectory the system may determine (predict) samples of the (future) continuation of said trajectory (i.e. predict samples of possible future events).

[0046] In another example, the system may not (or not only) receive data from a camera but may receive a data set based on which it predicts possible future events. In particular the system may be configured to perform an image sequence prediction, as described in more detail in the following. For example, the system may predict possible (future) weather evolutions based on e.g. weather radar intensity images.

[0047] In the following the operation of the hidden variable model according to the present disclosure is explained in more detail with reference to the "best of many" sample objective for Gaussian latent variable models according to the present disclosure,

[0048] In particular, first an overview of deep conditional generative models with gaussian latent variables is given. Then, it is introduced the "best-of-many" samples objective function according to the present disclosure. Thereafter, exemplary conditional generative models are described which may serve as the test bed for the objective according to the present disclosure. It is further described the model for structured trajectory prediction which is based on the sampling module described by N. Lee et al. (see above) and consider extensions which additionally conditions on visual input and generates full image sequences.

[0049] Fig. 2 shows a schematic representation of a deep conditional generative model according to embodiments of the present disclosure. Given an input sequence x, a latent variable ^z (i.e. a random variable) is drawn from the conditional distribution p(z|x) (assumed Gaussian). The output sequence V is then sampled from the distribution p_©(y|x, z) of the conditional generative model according to the present disclosure with parameterized by Q. The latent variables z (i.e. a random variable) enables one-to-many mapping and the learning of multiple modes (i.e. multiple distinct possible occurrences) of the true posterior distribution p(y|x). In practice, the simplifying assumption is made that z is independent of x and p(z[x) is N(0, I). Next, the training of such models is described.

Conditional Variational Autoencoder Based Training Objective

[0050] It is desirable to maximize the data log-likelihood p_e(y|x). To estimate the data log-likelihood of the model p© according to the present disclosure, one possibility is to perform Monte-Carlo sampling of the latent variable z. For T samples, this leads to the following estimate, cf. equation (1):

[0051] This estimate is unbiased but has high variance. This would underestimate the log-likelihood for some samples and overestimate for others, especially if T is small. This wou' in turn lead to high variance weight updates.

[0052] It is possible to i educe the variance of updates by estimating the log- likelihood through importance sampling during training. The latent variables z may be sampled from a recognition network q_e using e.g. the re-parameterization trick. The data log-likelihood is, cf. equation (2):

p{ F (2)

log ( / Pe(y\z, x) Q4>{z\x,y) dz ) .

⁽IF { ^z\^c·. y)

[0053] The integral in (2) is computationally intractable. A variational lower bound of the data log-likelihood (2) may be derived, which can be estimated empirically using Monte-Carlo integration, cf. e.g. equation (3):

[0054] The lower bound in (3) weights all samples

equally and so they must all ascribe high probability to the data point (x, y). This introduces a strong constraint on the recognition network q_e. Therefore, the model is forced to trade- off between a good estimate of the data log-likelihood and the KL divergence between the training and test latent variable distributions. One possibility to close the gap introduced between the training and test pipelines, is to use an hybrid objective of the form

Although such a hybrid objective has shown modest improvement in performance in certain cases, it would not provide any significant improvement over the standard CVAE

objective in a structured sequence prediction tasks. [0055] Therefore it is proposed to derive the "best-of-many-samples" objective which on the one hand encourages sample diversity and on the other hand aims to close the gap between the training and testing pipelines. Best of Many Samples Objective

[0056] It is proposed to weight unlike (3) each sample not equally. As qe(y|x, y) in a recognition network qe is normally distributed, the integral can be very well approximated on a large enough bounded interval [a; b]. Therefore, it is possible to use the First Mean Value Theorem of Integration, cf. equation (4):

[0057] It is further possible to lower bound (4) with the minimum of the term on the right, cf. equation (5):

[0058] It may be estimated the first term on the right of (5) using Monte- Carlo integration. The minimum in the second term on the right of (5) is difficult to estimate, therefore it may be approximated by the KL divergence over the full distribution. The KL divergence heavily penalizes q_e(y|x, y) when is it is high for low values p(z|x) (which leads to low value of the ratio of the distributions). This leads to the following "many-sample" objective, (more details in the supplementary section), cf. equation (6):

[0059] The recognition network q_© has multiple chances to draw samples with high posterior probability p©(y|z, x). This encourages diversity in the generated samples. Furthermore, the data log-likelihood estimate in this objective is tighter as > :^V E follows from the Jensen's inequality. Therefore, this bound loosens the constrains on the recognition network q_© and allows it more closely match the latent variable distribution p(z | x). However, as it is desirably focused on regression tasks, probabilities are of the form

Therefore in practice the Log-Average term can cause numerical instabilities due to limited machine precision in representing the probability

^{1SEi i,‘ r}'. Therefore, it is used a "Best of Many Samples" approximation of (6). The constant 1/T term can be pulled down outside the average in (6) and approximate the sum with the maximum, cf. equations (7) and (8):

( )

- D_Kl\q \J. y) ίΐ tt}). ~ Qoi- · //)·

[0060] Similar to (6), this objective encourages diversity and loosens the constraints on the recognition network q© as only the best sample is consider'd. During training, initially p_© assigns low probability to the data for all samples The log(T) difference between (6) and (8) would be dominated by the low data log-likelihood. Later on, as both objectives promote diversity, the Log-Average term in (6) would be dominated by one term in the average. Therefore, (6) would be well approximated by the maximum of the terms in the average. Furthermore, (8) avoids numerical stability issues.

Model Architectures for Structured Sequence Prediction

[0061] The model architectures according to the present disclosure are desirably based on RNN Encoder-Decoders. LSTM formulations may e.g. used as RIMNs for structured trajectory prediction tasks (Figure 3a) and Convolutional LSTM formulations (Figure 3b) for structured image sequence prediction tasks. During training, LSTM recognition networks may be used in case of trajectory prediction (Figure 3a) and for image sequence prediction, Conv-LSTM recognition networks (Figure 3b) may be used. Note that, the simplifying assumption may be made that z is independent of x, the recognition networks are desirably conditioned only on y.

[0062] Fig. 3a shows a schematic representation of a model for structured trajectory prediction according to embodiments of the present disclosure. The model for structured trajectory prediction (see Figure 3a) according to the disclosure may be based on (or correspond to) the sampling module described in : N. Lee, W. Choi, P. Vernaza, C. B. Choy, P. H. Torr, and M. Chandraker. Desire: Distant future prediction in dynamic scenes with interacting agents. CVPR, 2017.

[0063] The input sequence x is processed using an embedding layer to extract features and the embedded sequence is read by the encoder LSTM. The encoder LSTM produces a summary vector v (providing a summary of the past information), which is its internal state after reading the input sequence x. The decoder LSTM is conditioned on the summary vector v and additionally a sample of the latent variable z. The decoder LSTM is unrolled in time and a prediction is generated by a linear transformation of its output. Therefore, the predicted sequence at a certain time-step

is conditioned on the output at the previous time-step, the summary vector v and the latent variable z. As the summary v is deterministic given x, there may be the following equation (9)

[0064] Conditioning the predicted sequence at all time-steps upon a single sample of z (i.e. z is defined for all time steps of one sequence) enables z to capture global characteristics (e.g, speed and direction of motion) of the future sequence and generation of temporally consistent sample sequences V .

[0065] In case of dynamic agents e.g. pedestrians in traffic scenes, the future trajectory is highly dependent upon the environment e.g. layout of the streets. Therefore, additionally conditioning samples on sensory input (e.g. visuals of the environment) may enable more accurate sample generation. A CNN may be used to extract a summary of a visual observation of a scene. This visual summary Is given as input to the decoder LSTM, ensuring that the generated samples are additionally conditioned on the visual input.

[0066] If the sequence (x; y) in question consists of images e.g. frames of a video, the trajectory prediction model of Figure 3a is not suitable to exploit the

1 ] spatial structure of the image sequence. More specifically, if a pixel

at time- step t + 1 of the image sequence y is considered, the pixel value at time-step t+1 depends upon only the pixel ^!Ji-J and a certain neighbourhood around it. Furthermore, spatially neighbouring pixels are correlated. This spatial structure can be exploited by using Convolutional LSTMs as RNN encoder-decoders, cf. e.g. :

S. Xingjian, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-c. Woo. Convolutional Istm network: A machine learning approach for precipitation nowcasting. In Advances in Neural Information Processing Systems, pages 802- 810, 2015.

[0067] Conv-LSTMs retain spatial information by considering the hidden states h and cell states c as 3D tensors - the cell and hidden states are composed of

/

vectors

^r-J corresponding to each spatial position. New cell states, hidden states and outputs are computed using convolutional operations. Therefore, new

// + ¹

cell states

, hidden states *· > depend upon only a local spatial r¹ h¹

neighbourhood of > -J <·.i , thus presser ving spatial information. Accordingly, conditional generative models networks Conv-LSTMs may be used with for structured image sequence prediction (Figure 3b).

[0068] Fig. 3b shows a schematic representation of a model for structured image sequence prediction according to embodiments of the present disclosure. The used encoder and decoder consists of two stacked Conv-LSTMs for feature aggregation. As before, the output is conditioned on a latent variable z to model multiple modes of the conditional distribution p(y | x). The future states of neighboring pixels are highly correlated. However, spatially distant parts of the image sequences can evolve independently. In order to take into account the spatial structure of images, latent variables z may be used which are 3D tensors. [0069] As detailed in Figure 3b, the input image sequence x is processed using a convolutional embedding layer. The Conv-LSTM reads the embedded input sequence and produces a 3D tensor v as the summary. The 3D summary v and latent variable z is given as input to the Conv-LSTM decoder at every timestep.

The cell state, hidden state or output at a certain spatial position,

, it is conditioned on a sub-tensor zy of the latent tensor z. Spatially neighbouring cell states, hidden states (and thus outputs) are therefore conditioned on spatially neighbouring sub-tensors Zy. This coupled with the spatial information preserving property of Conv-LSTMs detailed above, enables z to capture spatial location specific characteristics of the future image sequence and allows for modeling the correlation of future states of spatially neighboring pixels. This ensures spatial consistency of sampled output sequences V . Furthermore, as in the fully connected case, conditioning the full output sequence sample V is on a single sample of z ensures temporal consistency.

[0070] In summary, the "best of many" sample objective for Gaussian latent variable models is advantageous for learning conditional models on multi-modal distributions. The learnt latent representation is better matched between training and test time which in turn leads to more accurate samples.

[0071] Throughout the description, including the claims, the term "comprising a" should be understood as being synonymous with "comprising at least one" unless otherwise stated. In addition, any range set forth in the description, including the claims should be understood as including its end value(s) unless otherwise stated. Specific values for described elements should be understood to be within accepted manufacturing or industry tolerances known to one of skill in the art, and any use of the terms "substantially" and/or "approximately" and/or "generally" should be understood to mean falling within such accepted tolerances.

[0072] Although the present disclosure herein has been described with reference to particular embodiments, it is to be understood that these embodiments are merely illustrative of the principles and applications of the present disclosure.

[0073] It is intended that the specification and examples be considered as exemplary only, with a true scope of the disclosure being indicated by the following claims.

Claims

1. A method for training a prediction system,

the prediction system comprises a hidden variable model using a hidden random variable for sequence prediction,

the method comprising the steps of:

multiple input of a sequence input (x) into the hidden variable model which outputs in response multiple distinct samples (y) conditioned by the random variable,

use of the best of the multiple samples (y) to train the model, the best sample being the closest to the ground truth.

2. The method according to any one of the preceding claims 1, wherein the model is trained by using only the best of the multiple samples for training the model and by disregarding the further samples.

3. The method according to any one of the preceding claims land 2, wherein

the model is trained based on the best of the multiple samples in relation to the ground truth.

4. The method according to any one of the preceding claims, wherein the hidden random variable is a Gaussian latent variable and/or has a zero mean Gaussian distribution, wherein

during training the random variable is conditioned by the ground truth, in particular by using a Convolutional - Long short-term memory (LSTM) recognition network.

5. The method according to any one of the preceding claims, wherein the trained model predicts possible sequences by outputting multiple samples

(y) for the same sequence input (x) conditioned by the varying random variable

(z).

6. A computer program comprising instructions for executing the steps of the method according to any one of the preceding method claims, when the program is executed by a computer.

7. A system for sequence prediction, comprising:

a hidden variable model using a hidden random variable (z) for sequence prediction, wherein

the hidden variable model is configured to output multiple distinct samples (y) representing predicted sequences in response to a sequence input (x), the multiple distinct samples (y) being conditioned by the random variable (z), wherein

the model is pre-trained based on the best of the multiple samples, the best sample being the closest to the ground truth.

8. The system according to the preceding system claim, wherein the model is pre-trained based on the best of multiple samples outputted for a training sequence input (x).

9. The system according to any one of the preceding system claims, wherein

the model is pre-trained by using only the best of the multiple samples for training the model and by disregarding the further samples.

10. The system according to any one of the preceding system claims, wherein

the model is pre-trained based on the best of the multiple samples in relation to the ground truth.

11. The system according to any one of the preceding system claims, wherein

the hidden random variable is a Gaussian latent variable and/or has a zero mean Gaussian distribution.

12. The system according to any one of the preceding system claims, wherein the model is pre-trained based on the best of multiple samples, the multiple samples (y) being outputted for the same training sequence input (x) conditioned by the varying random variable (z), the random variable being conditioned by the ground truth.

13. The system according any one of the preceding system claims, wherein

the model comprises a neural network, in particular a RNN encoder-decoder network, or a conditional variational auto-encoder (CVAE).