CN110719487A

CN110719487A - Video prediction method and device, electronic equipment and vehicle

Info

Publication number: CN110719487A
Application number: CN201810770361.2A
Authority: CN
Inventors: 侯鹏飞; 范坤
Original assignee: Shenzhen Horizon Robotics Science and Technology Co Ltd
Current assignee: Shenzhen Horizon Robotics Science and Technology Co Ltd
Priority date: 2018-07-13
Filing date: 2018-07-13
Publication date: 2020-01-21
Anticipated expiration: 2038-07-13
Also published as: CN110719487B

Abstract

A video prediction method, a video prediction apparatus, an electronic device, and a vehicle are disclosed. The video prediction method comprises the following steps: a training step comprising: generating an a priori distribution from a previous frame using an a priori encoder; generating an a posteriori distribution from a previous frame and a subsequent frame using an a posteriori encoder; using the a posteriori distribution as an intermediate variable of a generator to generate a first predicted frame from a previous frame; and optimizing the prior encoder and the generator with the difference between the first predicted frame and the subsequent frame and the KL divergence between the prior distribution and the posterior distribution as a loss function; and a predicting step comprising: generating an a priori distribution from a known frame using an a priori encoder; and using the a priori distribution as an intermediate variable of the generator to generate future frames from the known frames. In this way, the generator and the a priori encoder for predicting video can be optimized by using the a priori distribution of the a priori encoder and the a posteriori distribution of the a posteriori encoder, thereby simplifying the training process of video prediction and improving the prediction effect.

Description

Video prediction method and device, electronic equipment and vehicle

Technical Field

The present application relates generally to the field of Assisted Driving (ADAS), and more particularly, to a video prediction method, a video prediction apparatus, an electronic device, and a vehicle.

Background

In recent years, automated driving, or Advanced Driving Assistance Systems (ADAS), have received extensive attention and intense research. The ADAS system needs to sense various states of the vehicle itself and the surrounding environment using various vehicle-mounted sensors, collect data, perform identification, detection and tracking of static and dynamic entities, and perform systematic calculation and analysis in combination with map data, thereby making driving policy decisions and finally realizing an automatic driving function.

In an automatic driving scene, videos obtained by image acquisition devices such as a camera and the like need to be predicted to realize dynamic prediction of entities in the environment, and then prediction results are used by subsequent modules to realize functions such as driving control of vehicles.

In video prediction, a variational automatic encoder (variational automatic encoder) is used to fit the posterior distribution of the future frames of a video by calculating the prior distribution of the video from the previous frames. The predicted result image needs to be as vivid as possible, and the motion track needs to conform to the real motion rule of the object as much as possible. During the training process, as the training progresses, the posterior distribution may need to gradually approach the distribution of the data set, and the prior distribution may need to gradually approach the posterior distribution. But since the posterior distribution is random at the beginning, the prior training is easily affected, and finally the overall effect is not ideal.

Accordingly, there is a need for an improved video prediction scheme.

Disclosure of Invention

The present application is proposed to solve the above-mentioned technical problems. Embodiments of the present application provide a video prediction method, a video prediction apparatus, an electronic device, and a vehicle, which obtain a posterior distribution by using real data instead of prediction data in a training stage, and optimize a priori encoder and a prediction generator using the posterior distribution, thereby simplifying a training process of video prediction and improving a prediction effect.

According to an aspect of the present application, there is provided a video prediction method, including: a training step comprising: generating an a priori distribution from a previous frame using an a priori encoder; generating an a posteriori distribution from the previous and subsequent frames using an a posteriori encoder; using the a posteriori distribution as an intermediate variable for a generator, generating a first predicted frame from the previous frame using the generator; and optimizing the prior encoder and the generator with the difference between the first predicted frame and the subsequent frame and the KL divergence between the prior distribution and the posterior distribution as a loss function; and, a predicting step comprising: generating an a priori distribution from a known frame using the a priori encoder; and using the a priori distribution as an intermediate variable for the generator, generating future frames from the known frames using the generator.

In the above video prediction method, the a priori encoder and the a posteriori encoder each comprise a convolutional network, and the generator comprises one of a long-short term memory network, a convolutional network, and an optical flow network.

In the above video prediction method, the previous frame and the subsequent frame for the a priori encoder and the a posteriori encoder in the training step are both real video data.

In the above video prediction method, the previous frame and the subsequent frame are video frames acquired by a driving assistance system of the vehicle.

In the above video prediction method, the predicting step further includes: generating a next future frame using the future frame as a known frame.

In the above video prediction method, generating an a priori distribution from a previous frame using an a priori encoder comprises: generating a first data pair of a plurality of means and variances using the previous frame; and generating a first random number that follows a gaussian distribution using the first pair of data for the plurality of means and variances as the prior distribution, and generating a posterior distribution from the previous and subsequent frames using an posterior encoder comprises: generating a plurality of second data pairs of means and variances using the previous and subsequent frames; and generating a second random number subject to a gaussian distribution as the posterior distribution using the second data pair of the plurality of means and variances.

In the above video prediction method, the training step further includes: using the a priori distribution as an intermediate variable for the generator, generating a second predicted frame from the previous frame using the generator; and optimizing the prior encoder and the generator with the difference between the second predicted frame and the subsequent frame and the KL divergence between the prior distribution and the posterior distribution as a loss function.

According to another aspect of the present application, there is provided a video prediction apparatus comprising an a priori encoder, an a posteriori encoder, a generator, a training unit, and a prediction unit, wherein the training unit is configured to: generating an a priori distribution from a previous frame using the a priori encoder; generating an a posteriori distribution from the previous and subsequent frames using the a posteriori encoder; generating, by the generator, a first predicted frame from the previous frame using the a posteriori distribution as an intermediate variable for the generator; and optimizing the prior encoder and the generator using the difference between the first predicted frame and the subsequent frame and the KL divergence between the prior distribution and the posterior distribution as a loss function, and the prediction unit is configured to: generating an a priori distribution from a known frame using the a priori encoder; and generating, by the generator, future frames from the known frames using the a priori distribution as an intermediate variable of the generator.

In the above video prediction apparatus, the a priori encoder and the a posteriori encoder each include a convolutional network, and the generator includes one of a long-short term memory network, a convolutional network, and an optical flow network.

In the above video prediction apparatus, the previous frame and the subsequent frame for the a priori encoder and the a posteriori encoder in the training unit are both real video data.

In the above-described video prediction apparatus, the previous frame and the subsequent frame are video frames acquired by a driving assistance system of the vehicle.

In the above video prediction apparatus, the prediction unit is further configured to: generating a next future frame using the future frame as a known frame.

In the above video prediction apparatus, the training unit generating an a priori distribution from a previous frame using an a priori encoder comprises: generating a first data pair of a plurality of means and variances using the previous frame; and generating a first random number that is uniform with a gaussian distribution using the first data pair of the plurality of means and variances as the prior distribution, and the training unit generating a posterior distribution from the previous and subsequent frames using an posterior encoder comprises: generating a plurality of second data pairs of means and variances using the previous and subsequent frames; and generating a second random number subject to a gaussian distribution as the posterior distribution using the second data pair of the plurality of means and variances.

In the above video prediction apparatus, the training unit is further configured to: using the a priori distribution as an intermediate variable for the generator, generating a second predicted frame from the previous frame using the generator; and optimizing the prior encoder and the generator with the difference between the second predicted frame and the subsequent frame and the KL divergence between the prior distribution and the posterior distribution as a loss function.

According to still another aspect of the present application, there is provided an electronic apparatus including: a processor; and a memory having stored therein computer program instructions which, when executed by the processor, cause the processor to perform the video prediction method as described above.

According to yet another aspect of the present application, there is provided a vehicle comprising an electronic device as described above.

According to yet another aspect of the present application, there is provided a computer readable medium having stored thereon computer program instructions which, when executed by a processor, cause the processor to perform a video prediction method as described above.

Drawings

The above and other objects, features and advantages of the present application will become more apparent by describing in more detail embodiments of the present application with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the principles of the application. In the drawings, like reference numbers generally represent like parts or steps.

Fig. 1 is a schematic diagram illustrating a system architecture to which a video prediction method according to an embodiment of the present application is applied.

Fig. 2 illustrates a flow diagram of a video prediction method according to an embodiment of the present application.

Fig. 3 illustrates a schematic diagram of a training process of a video prediction method according to an embodiment of the present application.

Fig. 4 illustrates a schematic diagram of a prediction process of a video prediction method according to an embodiment of the present application.

Fig. 5 illustrates a schematic diagram of another example of a training process of a video prediction method according to an embodiment of the present application.

Fig. 6 illustrates a block diagram of a video prediction apparatus according to an embodiment of the present application.

FIG. 7 illustrates a block diagram of an electronic device in accordance with an embodiment of the present application.

Detailed Description

Hereinafter, example embodiments according to the present application will be described in detail with reference to the accompanying drawings. It should be understood that the described embodiments are only some embodiments of the present application and not all embodiments of the present application, and that the present application is not limited by the example embodiments described herein.

Summary of the application

As described above, in current video prediction, a variational auto-encoder is mainly used to calculate a prior distribution from the first few frames of a video, and to fit a posterior distribution of the future few frames of the video later, which is usually estimated a prior distribution with an LSTM (Long-Short term memory) network or simply assumed to be a standard normal distribution, and also used to encode the posterior. However, assuming a prior distribution as a normal distribution is too simple to fit the actual data. On the other hand, if the LSTM network is adopted, the difficulty of training the LSTM network structure itself is relatively high, so that it is easy to cause difficulty in learning a priori and a posteriori, and the training efficiency is low.

In view of the above technical problems, a basic idea of the present application is to provide a video prediction method, a video prediction apparatus, an electronic device, and a vehicle, in which a prior encoder generates a prior distribution using real data in a training step, a posterior encoder generates a posterior distribution using real data of more frames, the posterior distribution is used as an intermediate variable of a prediction generator to perform prediction, and the prior encoder and the generator are trained using KL divergence between the prior and posterior distributions and a difference between a predicted frame and a real frame, such as a mean square error. Moreover, the prior encoder and the posterior encoder can adopt a convolution network to replace a commonly used LSTM network, thereby greatly simplifying the training process and improving the prediction effect. In the prediction step, the a priori distribution generated by the trained a priori encoder can be used as an intermediate variable of the generator to perform video prediction.

Here, the video prediction method, the video prediction apparatus, the electronic device, and the vehicle according to the embodiments of the present application may be directly applied to video prediction, and may also be used for any other prediction task that can be converted into video prediction. For example, the motion prediction of objects such as vehicles and pedestrians in an automatic driving scene can be converted into a prediction task of a position occupying lattice point diagram sequence of each object in a panoramic top view. Furthermore, the predicted image does not only refer to a natural image containing a single or three color channels, but may also be any multi-channel three-dimensional data that implicitly expresses other information (e.g., velocity, acceleration).

Having described the general principles of the present application, various non-limiting embodiments of the present application will now be described with reference to the accompanying drawings.

Exemplary System

As shown in fig. 1, the system 100 may include an a priori encoder 10, an a posteriori encoder 20, and a generator 30. The a priori encoder 10 may receive several known frames of video, including the current frame, and generate an a priori distribution based thereon. The a posteriori encoder 20 may receive more frames than the a priori encoder 10, i.e. the a posteriori encoder 20 receives several subsequent frames in addition to several known frames received by the a priori encoder 10, so that an a posteriori distribution may be generated based on a preceding known frame and a following subsequent frame. During the training phase, both the a priori distribution and the a posteriori distribution may be provided to generator 30 as intermediate variables by which generator 30 may predict future frames from known frames. The prior encoder 10 and generator 30 may be trained using KL divergence between the prior distribution and the posterior distribution and the difference between the predicted future frame and the real frame, e.g., mean square error, as a loss function.

Here, specific implementations of the a priori encoder 10, the a posteriori encoder 20, and the generator 30 will be described in further detail below.

Exemplary method

Fig. 2 illustrates a flow diagram of a video prediction method 200 according to an embodiment of the present application. As shown in fig. 2, a video prediction method 200 according to an embodiment of the present application may include a training step S210 and a prediction step S220, where a variational auto-encoder (VAE) may be trained in the training step S210, and then prediction may be performed using the trained variational auto-encoder in the prediction step S220. It will be appreciated that the a priori encoder 10 and the generator 30 constitute a variational auto-encoder.

The training step S210 and the prediction step S220 shown in fig. 2 will be described in detail below with reference to fig. 3 to 4. As shown in fig. 2, the training step S210 may include a step S211 of generating an a priori distribution from a previous frame using the a priori encoder 10, and a step S212 of generating an a posteriori distribution from the previous frame and a subsequent frame using the a posteriori encoder 20, which process is illustrated in fig. 3. As shown in FIG. 3, a priori encoder 10 may receive multiple previous frames of video, such as frame X_t-4:X_tWherein the frame X_tCan be considered the current frame and an a priori distribution P1 is generated. In some embodiments, the process of generating the prior distribution P1 may include utilizing a previous frame X_t-4:X_tGenerating a plurality of data pairs of mean μ and variance σ, and then using the meanThe data pair of value μ and variance σ generates a random number that follows a gaussian distribution as the prior distribution. A posteriori encoder 20 receives in addition to the previous frame X_t-4:X_tIn addition, a subsequent frame X is received_t+1:X_t+4Here, the previous frame X for the training step_t-4:X_tAnd subsequent frame X_t+1:X_t+4Are real data rather than predictive data generated by generator 30. For example, when applied to the driving assistance field, the previous frame X_t-4:X_tAnd subsequent frame X_t+1:X_t+4Which may be a video frame captured by a sensor, such as a camera, of the vehicle's driver assistance system. A posteriori encoder 20 uses the previous frame X_t-4:X_tAnd subsequent frame X_t+1:X_t+4A posterior distribution P2 is generated. Specifically, the process of generating the posterior distribution P2 may include utilizing the previous frame X_t-4:X_tAnd subsequent frame X_t+1:X_t+4A plurality of data pairs of the mean μ and the variance σ are generated, and then random numbers that follow a gaussian distribution are generated using the data pairs of the mean μ and the variance σ as a posterior distribution. The KL divergence between the prior distribution P1 and the posterior distribution P2 may be used as a loss function for the training process.

With continued reference to fig. 2, the training step S210 further comprises a step S213 of using the posterior distribution P2 as an intermediate variable for the generator 30, using the generator 30 from the previous frame X_t-4:X_tA first predicted frame is generated. Fig. 3 illustrates this process. In the example shown in fig. 3, the generator 30 comprises a Long Short Term Memory (LSTM) network comprising an input layer 31, an output layer 33, and a plurality of intermediate layers 32 located therebetween, fig. 3 showing three

intermediate layers

32a, 32b and 32c, the intermediate layers 32 also being referred to as hidden layers. The posterior distribution P2 is provided to middle tier 32 as an intermediate variable, otherwise known as an implicit variable. Generator 30 uses intermediate variables to derive from previous frame X_t-4:X_tGenerating a predicted frame X'_t+1. It should be understood that other models, such as convolutional networks, optical flow networks, etc., may be employed by generator 30 depending on the application scenario.

In step S214, frame X 'may be predicted'_t+1And its true value (i.e. the subsequent frame X)_t+1) Example of the difference therebetweenSuch as Mean Square Error (MSE) and KL divergence between the prior distribution P1 and the posterior distribution P2 as a loss function, i.e., loss-MSE + KL, to train the prior encoder 10 and the generator 30. For example, the trainable network parameters of the a priori encoder 10 and generator 30 may be trained and optimized by methods such as the random steepest descent method (SGD) or improvements thereof. The a posteriori encoder 20 may be previously trained or may be synchronously trained in a training step S210 together with the a priori encoder 10 and the generator 30.

The inventors have found that when convolutional networks are used for both a priori encoder 10 and a posteriori encoder 20, the KL divergence between a priori distribution P1 and a posteriori distribution P2 can converge faster during training than when LSTM was used before, and therefore training efficiency can be significantly improved. Compared with the traditional LSTM network, the training process is simple and stable when the prior encoder and the posterior encoder both adopt the convolutional network, parameters do not need to be adjusted, special skills are not needed, the calculated amount is small, and the training speed is high.

After the training process is completed, a prediction step S220 may be performed, the prediction step S220 performing prediction using only the a priori encoder 10 and the generator 30. Specifically, in step S221, the prior distribution P1 is generated from the known frame using the prior encoder 10, and then in step S222, the future frame is generated from the known frame using the generator 30 using the prior distribution P1 as an intermediate variable of the generator 30.

Fig. 4 illustrates this process. As shown in FIG. 4, the a priori encoder 10 receives a known frame X of video to be predicted_t-4:X_tWhich includes the current frame Xt and uses the known frame X_t-4:X_tAn a priori distribution P1 is generated. The prior distribution P1 is provided to an intermediate layer of the generator 30 as an intermediate variable that the generator 30 uses from the known frame X_t-4:X_tGenerating future frame X'_t+1. In the prediction process, predicted future frame X'_t+1May also be provided as a known frame to the a priori encoder 10 and generator 30 to predict the next future frame X'_t+2. According to the video prediction method, after training is completed, the predicted image is clear and relatively accords with a real objectLaw of motion of the body. When the video prediction method according to the embodiment of the application is applied to the field of driving assistance, the predicted future frame can be used for a driving assistance system to decide an appropriate driving strategy.

Fig. 5 illustrates a schematic diagram of another example of a training process of a video prediction method according to an embodiment of the present application. For simplicity and clarity, only the differences of the example of fig. 5 from the example of fig. 3 will be described below. As shown in fig. 5, the a priori distribution P1 produced by the a priori encoder 10 is also provided to the generator 30 as an intermediate variable during the training process. Generator 30 generates predicted frame X 'using a posterior distributed intermediate variable P2'_t+1And generates a predicted frame Y 'using an a priori distributed intermediate variable P1'_t+1. In one aspect, predicted frame X 'may be used'_t+1True value X corresponding thereto_t+1The difference between, for example, mean square error MSE1 and KL divergence between a priori distribution P1 and a posteriori distribution P2 are trained as a loss function, which may be referred to as first training, on the other hand, predicted frame Y 'may be used'_t+1True value X corresponding thereto_t+1The differences between, for example, mean square error MSE2 and the KL divergence between a prior distribution P1 and a posterior distribution P2 are trained as a loss function, which may be referred to as a second training. The first training and the second training can be carried out alternatively or synchronously, the training process introduces the idea of confrontation training, better training effect can be realized, and the accuracy of the prediction result is further improved. Finally, through training, the prior distribution P1 converges to agree with the posterior distribution P2.

Exemplary devices

Fig. 6 illustrates a functional block diagram of a video prediction apparatus 300 according to an embodiment of the present application. As shown in fig. 6, the video prediction apparatus 300 according to the embodiment of the present application may include an a priori encoder 310, an a posteriori encoder 320, a generator 330, a training unit 340, and a prediction unit 350.

Training unit 340 may be configured to schedule other units to perform the training process, and in particular, may use a priori encoder 310 to perform the training process from previous frame X_t-4:X_tGenerating an a priori distribution P1 from a previous frame X using a posteriori encoder 320_t-4:X_tAnd subsequentlyFrame X_t+1:X_t+4A posterior distribution P2 is generated and a posterior distribution P2 is provided to the generator 330 as an intermediate variable. Training unit 340 may also use generator 330 to derive from previous frame X_t-4:X_tGenerating a predicted frame X'_t+1And to predict frame X'_t+1Corresponding to true value, i.e. frame X_t+1The difference between, for example, the Mean Square Error (MSE) and the KL divergence between the a priori distribution P1 and the a posteriori distribution P2, as loss functions to optimize the a priori encoder 310 and generator 330. In some embodiments, training unit 340 may also provide a priori distribution P1 to generator 330 as an intermediate variable, using generator 330, from a previous frame X_t-4:X_tGenerating a predicted frame Y'_t+1And to predict frame Y'_t+1Corresponding to true value, i.e. frame X_t+1The difference between, for example, the mean square error and the KL divergence between the a priori distribution P1 and the a posteriori distribution P2, as a loss function to optimize the a priori encoder 310 and the generator 330. The training unit 340 may alternatively or synchronously perform a training process with the prior distribution P1 and the posterior distribution P2 as intermediate variables until the prior distribution P1 and the posterior distribution P2 converge to be consistent.

The prediction unit 350 may be configured to schedule other units to perform the training process, and in particular, may use the a priori encoder 310 to derive from the known frame X_t-4:X_tAn a priori distribution P1 is generated and an a priori distribution P1 is used as an intermediate variable for the generator 330 from the known frame X by the generator 330_t-4:X_tGenerating future frame X'_t+1。

In one example, the a priori encoder 310 and the a posteriori encoder 320 may each comprise a convolutional network, and the generator 330 may comprise one of a long-short term memory network, a convolutional network, and an optical flow network.

It is to be understood that the specific functions and operations of the respective units and modules in the video prediction apparatus 300 have been described in detail in the video prediction method described above with reference to fig. 1 to 5, and thus, a repetitive description thereof will be omitted.

As described above, the video prediction apparatus 300 according to the embodiment of the present application can be implemented in various terminal devices, for example, in-vehicle devices for driving assistance. In one example, the video prediction apparatus 300 according to the embodiment of the present application may be integrated into the terminal device as a software module and/or a hardware module. For example, the apparatus 300 may be a software module in an operating system of the terminal device, or may be an application program developed for the terminal device, which runs on a CPU (central processing unit) and/or a GPU (graphics processing unit), or runs on a dedicated hardware acceleration chip, such as a dedicated chip adapted to run a deep neural network; of course, the apparatus 300 may also be one of many hardware modules of the terminal device.

Alternatively, in another example, the video prediction apparatus 300 and the terminal device may be separate devices, and the apparatus 300 may be connected to the terminal device through a wired and/or wireless network and transmit the interactive information according to an agreed data format.

Exemplary electronic device

As shown in fig. 7, electronic device 400 includes one or more processors 410 and memory 420. The processor 410 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 400 to perform desired functions.

Memory 420 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc. One or more computer program instructions may be stored on the computer-readable storage medium and executed by the processor 410 to implement the video prediction methods of the various embodiments of the present application described above and/or other desired functions. Various content such as previous frames, subsequent frames, predicted frames, etc. may also be stored in the computer readable storage medium.

In one example, electronic device 400 can also include an input interface 430 and an output interface 440, which are interconnected by a bus system and/or other form of connection mechanism (not shown). The input interface 430 may be connected to, for example, a video capture device such as an in-vehicle camera to receive known video frames that may be used for the training or prediction steps described above. The output interface 440 may output the prediction result to the outside, for example, may output the prediction result to a driving assistance system of the vehicle for use in determining a driving strategy.

Of course, for simplicity, only some of the components of the electronic device 400 relevant to the present application are shown in fig. 7, and components such as buses, input/output interfaces, and the like are omitted. In addition, electronic device 400 may include any other suitable components depending on the particular application.

Exemplary computer program product and computer-readable storage Medium

In addition to the above-described methods and apparatus, embodiments of the present application may also be a computer program product comprising computer program instructions that, when executed by a processor, cause the processor to perform the steps in the video prediction method according to various embodiments of the present application described in the "exemplary methods" section of this specification, supra.

The computer program product may be written with program code for performing the operations of embodiments of the present application in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present application may also be a computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, cause the processor to perform the steps in the video prediction method according to various embodiments of the present application described in the "exemplary methods" section above in this specification.

The computer-readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The foregoing describes the general principles of the present application in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present application are merely examples and are not limiting, and they should not be considered essential to the various embodiments of the present application. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the foregoing disclosure is not intended to be exhaustive or to limit the disclosure to the precise details disclosed.

The block diagrams of devices, apparatuses, systems referred to in this application are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," "comprising," "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably therewith. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".

It should also be noted that in the devices, apparatuses, and methods of the present application, the components or steps may be decomposed and/or recombined. These decompositions and/or recombinations are to be considered as equivalents of the present application.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present application. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the application. Thus, the present application is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, the description is not intended to limit embodiments of the application to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

Claims

1. A video prediction method, comprising:

a training step comprising:

generating an a priori distribution from a previous frame using an a priori encoder;

generating an a posteriori distribution from the previous and subsequent frames using an a posteriori encoder;

using the a posteriori distribution as an intermediate variable for a generator, generating a first predicted frame from the previous frame using the generator; and

optimizing the prior encoder and the generator with a difference between the first predicted frame and the subsequent frame and a KL divergence between the prior distribution and the posterior distribution as a loss function; and

a prediction step comprising:

generating an a priori distribution from a known frame using the a priori encoder; and

using the prior distribution as an intermediate variable for the generator, generating future frames from the known frames using the generator.

2. The method of claim 1, wherein the a priori encoder and the a posteriori encoder each comprise a convolutional network, and the generator comprises one of a long short term memory network, a convolutional network, and an optical flow network.

3. The method of claim 1, wherein the previous and subsequent frames for the a priori encoder and the a posteriori encoder in the training step are both real video data.

4. The video prediction method of claim 3, wherein the previous frame and the subsequent frame are video frames acquired by a driver assistance system of a vehicle.

5. The method of claim 1, wherein the predicting step further comprises:

generating a next future frame using the future frame as a known frame.

6. The method of claim 1, wherein,

generating an a priori distribution from a previous frame using an a priori encoder includes:

generating a first data pair of a plurality of means and variances using the previous frame; and

using the first pair of data of the plurality of means and variances as the prior distribution to generate a first random number that follows a Gaussian distribution, an

Generating an a posteriori distribution from the previous and subsequent frames using an a posteriori encoder comprises:

generating a plurality of second data pairs of means and variances using the previous and subsequent frames; and

generating a second random number subject to a Gaussian distribution as the posterior distribution using the second data pair of the plurality of means and variances.

7. The method of claim 1, wherein the training step further comprises:

using the a priori distribution as an intermediate variable for the generator, generating a second predicted frame from the previous frame using the generator; and

optimizing the prior encoder and the generator with a difference between the second predicted frame and the subsequent frame and a KL divergence between the prior distribution and the posterior distribution as a loss function.

8. A video prediction apparatus includes a prior encoder, a posterior encoder, a generator, a training unit, and a prediction unit,

the training unit is configured to:

generating an a priori distribution from a previous frame using the a priori encoder;

generating an a posteriori distribution from the previous and subsequent frames using the a posteriori encoder;

generating, by the generator, a first predicted frame from the previous frame using the a posteriori distribution as an intermediate variable for the generator; and

optimizing the prior encoder and the generator using a difference between the first predicted frame and the subsequent frame and a KL divergence between the prior distribution and the posterior distribution as a loss function, and

the prediction unit is configured to:

generating, by the generator, a future frame from the known frame using the prior distribution as an intermediate variable of the generator.

9. The apparatus of claim 8, wherein the training unit is further configured to:

generating, by the generator, a second predicted frame from the previous frame using the prior distribution as an intermediate variable of the generator; and

10. The apparatus of claim 9, wherein said a priori encoder and said a posteriori encoder each comprise a convolutional network, and said generator comprises one of a long short term memory network, a convolutional network, and an optical flow network.

11. An electronic device, comprising:

a processor; and

memory having stored therein computer program instructions which, when executed by the processor, cause the processor to perform the video prediction method of any one of claims 1-7.

12. A vehicle comprising the electronic device of claim 11.

13. A computer readable medium having stored thereon computer program instructions which, when executed by a processor, cause the processor to perform a video prediction method according to any one of claims 1-7.