CN112581508A

CN112581508A - Video prediction method, video prediction device, computer equipment and storage medium

Info

Publication number: CN112581508A
Application number: CN202011421744.2A
Authority: CN
Inventors: 郜杰
Original assignee: Shanghai Eye Control Technology Co Ltd
Current assignee: Shanghai Eye Control Technology Co Ltd
Priority date: 2020-12-08
Filing date: 2020-12-08
Publication date: 2021-03-30
Anticipated expiration: 2040-12-08
Also published as: CN112581508B

Abstract

The application relates to a video prediction method, a video prediction device, a computer device and a storage medium. The method comprises the following steps: acquiring a historical optical flow information sequence corresponding to a historical video; predicting according to the historical optical flow information sequence to obtain a predicted optical flow information sequence; determining first space-time information corresponding to a predicted video according to the predicted optical flow information sequence; and predicting according to the first time-space information and the historical video to obtain a plurality of predicted images, and forming the predicted video by the plurality of predicted images. By adopting the method, the accuracy of video prediction can be improved.

Description

Video prediction method, video prediction device, computer equipment and storage medium

Technical Field

The present application relates to the field of image prediction technologies, and in particular, to a video prediction method, apparatus, computer device, and storage medium.

Background

Since video can provide rich visual information, more and more information is presented in a video mode. With the development of computer technology and image processing technology, video prediction technology has emerged. The video prediction technology is applied to the fields of automatic driving, weather prediction and the like, and great convenience can be provided for the work and life of people.

In the related art, the video prediction technology generally extrapolates a historical video image by using an optical flow method to obtain a future video image. However, the problem of inaccurate prediction exists in the image extrapolation by the optical flow method.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a video prediction method, apparatus, computer device and storage medium capable of improving prediction accuracy.

A method of video prediction, the method comprising:

acquiring a historical optical flow information sequence corresponding to a historical video; the historical video is formed by arranging N historical images according to a time sequence, the historical optical flow information sequence comprises N-1 historical optical flow information, and the historical optical flow information represents the optical flow change characteristics between one historical image and the previous historical image;

predicting according to the historical optical flow information sequence to obtain a predicted optical flow information sequence; the predicted optical flow information sequence comprises a plurality of pieces of predicted optical flow information, and the predicted optical flow information represents optical flow change characteristics between each predicted image and a previous image;

determining first time-space information corresponding to the predicted video according to the predicted optical flow information sequence; the first time-space information comprises the time characteristics and the space characteristics corresponding to each prediction image;

and predicting according to the first time-space information and the historical video to obtain a plurality of predicted images, and forming a predicted video by the plurality of predicted images.

In one embodiment, the determining the first spatiotemporal information corresponding to the predicted video according to the sequence of predicted optical flow information includes:

performing deformation processing according to the predicted optical flow information sequence and the Nth historical image to obtain image content information corresponding to each predicted image;

and inputting the image content information corresponding to each predicted image into a pre-trained encoding network to obtain the time characteristics and the space characteristics corresponding to each predicted image output by the encoding network, and forming first time-space information by the time characteristics and the space characteristics corresponding to the plurality of predicted images.

In one embodiment, the performing deformation processing according to the predicted optical flow information sequence and the nth historical image to obtain image content information corresponding to each predicted image includes:

performing deformation processing on the Nth historical image according to the 1 st predicted optical flow information to obtain image content information corresponding to the 1 st predicted image;

performing deformation processing on the (i-1) th predicted image according to the ith predicted optical flow information to obtain image content information corresponding to the ith predicted image; wherein i is a positive integer greater than 1.

In one embodiment, the predicting the plurality of predicted images according to the first space-time information and the historical video includes:

inputting both the historical video and the first time-space information into a video prediction network, extracting the characteristics of the historical video by using the video prediction network, and predicting according to the extracted second time-space information and the first time-space information to obtain a plurality of predicted images; the second spatiotemporal information includes temporal features and spatial features corresponding to the respective historical images.

In one embodiment, the video prediction network includes a feature extraction layer, and the performing feature extraction on the historical video by using the video prediction network includes:

and respectively inputting the N historical images into the feature extraction layer to obtain the time feature and the space feature corresponding to each historical image output by the feature extraction layer, and forming second space-time information by the time feature and the space feature corresponding to the plurality of historical images.

In one embodiment, the video prediction network includes a video prediction layer, and the predicting according to the extracted second spatio-temporal information and the first spatio-temporal information to obtain a plurality of predicted images includes:

inputting the first space-time information and the second space-time information into a video prediction layer for feature processing to obtain processed space-time information;

and predicting according to the processed space-time information to obtain a plurality of predicted images.

In one embodiment, the inputting the first spatio-temporal information and the second spatio-temporal information into the video prediction layer for feature processing to obtain processed spatio-temporal information includes:

and inputting the first space-time information and the second space-time information into a video prediction layer to obtain processed space-time information which is obtained by splicing or replacing the first space-time information and the second space-time information by the video prediction layer and is output.

In one embodiment, before the inputting of both the historical video and the first spatio-temporal information into the video prediction network, the method further comprises:

acquiring a training sample set; the training sample set comprises a plurality of historical sample images, a plurality of prediction sample images and sample spatio-temporal information;

training a neural network based on the training sample set, and determining whether a training result output by the neural network meets a preset convergence condition by using a preset loss function; presetting a loss function as a mean square error loss function;

and finishing the training under the condition that the training result meets the preset convergence condition to obtain the video prediction network.

In one embodiment, the predicting according to the historical optical flow information sequence to obtain a predicted optical flow information sequence includes:

and inputting the historical optical flow information sequence into a pre-trained optical flow prediction network to obtain a predicted optical flow information sequence output by the optical flow prediction network.

In one embodiment, the obtaining of the historical optical flow information sequence corresponding to the historical video includes:

inputting every two adjacent historical images into a preset optical flow calculation model to obtain a plurality of pieces of historical optical flow information calculated by the optical flow calculation model, and forming a historical optical flow information sequence by the plurality of pieces of historical optical flow information.

A video prediction apparatus, the apparatus comprising:

the optical flow acquisition module is used for acquiring a historical optical flow information sequence corresponding to a historical video; the historical video is formed by arranging N historical images according to a time sequence, the historical optical flow information sequence comprises N-1 historical optical flow information, and the historical optical flow information represents the optical flow change characteristics between one historical image and the previous historical image;

the optical flow prediction module is used for predicting according to the historical optical flow information sequence to obtain a predicted optical flow information sequence; the predicted optical flow information sequence comprises a plurality of pieces of predicted optical flow information, and the predicted optical flow information represents optical flow change characteristics between each predicted image and a previous image;

the space-time information determining module is used for determining first space-time information corresponding to the predicted video according to the predicted optical flow information sequence; the first time-space information comprises the time characteristics and the space characteristics corresponding to each prediction image;

and the video prediction module is used for predicting according to the first time-space information and the historical video to obtain a plurality of predicted images, and the predicted video is formed by the plurality of predicted images.

In one embodiment, the spatiotemporal information determining module includes:

the image content determining submodule is used for carrying out deformation processing according to the predicted optical flow information sequence and the Nth historical image to obtain image content information corresponding to each predicted image;

and the space-time information determining submodule is used for inputting the image content information corresponding to each predicted image into a pre-trained coding network to obtain the time characteristics and the space characteristics corresponding to each predicted image output by the coding network, and the time characteristics and the space characteristics corresponding to the plurality of predicted images form first space-time information.

In one embodiment, the image content determining sub-module is specifically configured to perform deformation processing on the nth historical image according to the 1 st predicted optical flow information to obtain image content information corresponding to the 1 st predicted image; performing deformation processing on the (i-1) th predicted image according to the ith predicted optical flow information to obtain image content information corresponding to the ith predicted image; wherein i is a positive integer greater than 1.

In one embodiment, the video prediction module is specifically configured to input both a historical video and first spatio-temporal information into a video prediction network, perform feature extraction on the historical video by using the video prediction network, and perform prediction according to the extracted second spatio-temporal information and the first spatio-temporal information to obtain a plurality of prediction images; the second spatiotemporal information includes temporal features and spatial features corresponding to the respective historical images.

In one embodiment, the video prediction network includes a feature extraction layer, and the video prediction module is specifically configured to input the N historical images into the feature extraction layer, obtain a temporal feature and a spatial feature corresponding to each historical image output by the feature extraction layer, and form the second spatiotemporal information from the temporal features and the spatial features corresponding to the multiple historical images.

In one embodiment, the video prediction network includes a video prediction layer, and the video prediction module is specifically configured to input the first spatio-temporal information and the second spatio-temporal information into the video prediction layer for feature processing to obtain processed spatio-temporal information; and predicting according to the processed space-time information to obtain a plurality of predicted images.

In one embodiment, the video prediction module is specifically configured to input the first spatio-temporal information and the second spatio-temporal information into a video prediction layer, obtain processed spatio-temporal information that is obtained by performing splicing or replacement processing on the first spatio-temporal information and the second spatio-temporal information by the video prediction layer, and output the processed spatio-temporal information.

In one embodiment, the apparatus further comprises:

the sample acquisition module is used for acquiring a training sample set; the training sample set comprises a plurality of historical sample images, a plurality of prediction sample images and sample spatio-temporal information;

the training module is used for training the neural network based on the training sample set and determining whether the training result output by the neural network meets a preset convergence condition or not by using a preset loss function; presetting a loss function as a mean square error loss function;

and the network obtaining module is used for finishing the training under the condition that the training result meets the preset convergence condition to obtain the video prediction network.

In one embodiment, the optical flow prediction module is specifically configured to input the historical optical flow information sequence into a pre-trained optical flow prediction network, so as to obtain a predicted optical flow information sequence output by the optical flow prediction network.

In one embodiment, the optical flow acquisition module is configured to input every two adjacent historical images into a preset optical flow calculation model, obtain a plurality of pieces of historical optical flow information calculated by the optical flow calculation model, and compose a historical optical flow information sequence from the plurality of pieces of historical optical flow information.

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

According to the video prediction method, the video prediction device, the computer equipment and the storage medium, the server acquires a historical optical flow information sequence corresponding to a historical video; predicting according to the historical optical flow information sequence to obtain a predicted optical flow information sequence; the sequence of predicted optical flow information includes a plurality of predicted optical flow information; determining first time-space information corresponding to the predicted video according to the predicted optical flow information sequence; and predicting according to the first time-space information and the historical video to obtain a plurality of predicted images, and forming a predicted video by the plurality of predicted images. In the embodiment of the disclosure, more accurate temporal features and spatial features are extracted on the basis of optical flow variation features, and therefore, the accuracy of video prediction can be improved.

Drawings

FIG. 1 is a diagram of an exemplary video prediction method;

FIG. 2 is a flow diagram of a video prediction method in one embodiment;

FIG. 3 is a flowchart illustrating the step of determining first spatiotemporal information corresponding to a predicted video according to a sequence of predicted optical flow information in one embodiment;

FIG. 4 is a schematic flow chart illustrating training a video prediction network according to one embodiment;

FIG. 5 is a block diagram of a video prediction device in one embodiment;

FIG. 6 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The video prediction method provided by the application can be applied to the application environment shown in fig. 1. The application environment includes a video capture device 102 and a server 104, where the video capture device 102 communicates with the server 104 over a network. The video capture device 102 may be, but is not limited to, various cameras, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 104 may be implemented by an independent server or a server cluster formed by a plurality of servers.

In one embodiment, as shown in fig. 2, a video prediction method is provided, which is described by taking the method as an example applied to the server in fig. 1, and includes the following steps:

step 201, obtaining a historical optical flow information sequence corresponding to a historical video.

The historical video is formed by arranging N historical images according to a time sequence, the historical optical flow information sequence comprises N-1 pieces of historical optical flow information, and the historical optical flow information represents optical flow change characteristics between one historical image and the previous historical image. For example, the historical video is arranged by 8 historical images including the historical image 1 and the historical image 2 … … in chronological order; the historical optical flow information sequence includes 7 pieces of historical optical flow information, that is, includes optical flow change features between the historical images 2 and 1, and optical flow change features … … between the historical images 3 and 2 and between the historical images 8 and 7. The historical videos may be videos in the fields of automatic driving, weather prediction, and the like, and the number of the historical videos and the number of the historical images are not limited in the embodiment of the disclosure.

The server can acquire the historical video in the following way: the video acquisition equipment sends the stored historical video to the server, and the server receives the historical video sent by the video acquisition equipment. Or the video acquisition equipment sends the acquired video to a server, and the server receives the video and then stores the video; at prediction, the server retrieves the stored historical video locally. The embodiment of the present disclosure does not limit the acquisition mode of the historical video.

An optical flow calculation model is preset in the server, and the optical flow calculation model can calculate optical flow change characteristics between two images. After the historical video is acquired, the server inputs every two adjacent historical images in the historical video into a preset optical flow calculation model to obtain a plurality of pieces of historical optical flow information calculated by the optical flow calculation model, and then a historical optical flow information sequence is formed by the plurality of pieces of historical optical flow information.

For example, the historical image 1 and the historical image 2 are input into an optical flow calculation model that outputs an optical flow change feature between the historical image 2 and the historical image 1, that is, outputs historical optical flow information 1; next, the historical images 2 and 3 are input to an optical flow calculation model that outputs optical flow change characteristics between the historical images 3 and 2, that is, outputs historical optical flow information 2. By analogy, outputting a plurality of historical optical flow information, and forming a historical optical flow information sequence.

The optical flow calculation model may be an optical flow estimation network float based on deep learning, and the optical flow calculation model is not limited in the embodiments of the present disclosure.

And step 202, predicting according to the historical optical flow information sequence to obtain a predicted optical flow information sequence.

The predicted optical flow information sequence comprises a plurality of pieces of predicted optical flow information, and the predicted optical flow information represents optical flow change characteristics between each predicted image and a previous image. For example, the sequence of predicted optical flow information includes 3 pieces of predicted optical flow information, that is, the predicted optical flow change feature between the predicted image 1 and the history image N, the predicted optical flow change feature between the predicted image 2 and the predicted image 1, and the predicted optical flow change feature between the predicted image 3 and the predicted image 2.

The server trains the optical flow prediction network in advance, and after the historical optical flow information sequence is obtained, the server inputs the historical optical flow information sequence into the optical flow prediction network trained in advance to obtain a predicted optical flow information sequence output by the optical flow prediction network.

The optical flow prediction network may be a cyclic convolutional neural network, such as ConvLSTM, and the optical flow prediction network is not limited in the embodiments of the present disclosure. It is understood that the time-series correlation characteristics of the optical flow change can be learned by learning the change features of the optical flow itself by using the convolution cyclic neural network.

And step 203, determining first space-time information corresponding to the predicted video according to the predicted optical flow information sequence.

The first time-space information comprises the time characteristics and the space characteristics corresponding to each prediction image.

After the server obtains the predicted optical flow information sequence, the server determines the time characteristic and the space characteristic corresponding to each predicted image according to the predicted optical flow information in the predicted optical flow information sequence, and the time characteristic and the space characteristic corresponding to a plurality of predicted images form first time-space information.

For example, the temporal feature and the spatial feature corresponding to the predicted image 1 are determined from the predicted optical flow information 1, and the temporal feature and the spatial feature corresponding to the predicted image 2 are determined from the predicted optical flow information 2.

And step 204, predicting according to the first time-space information and the historical video to obtain a plurality of predicted images, and forming a predicted video by the plurality of predicted images.

The method comprises the steps that a server firstly obtains time characteristics and space characteristics of each historical image in a historical video; then, predicting according to the time characteristic and the space characteristic of each historical image and the time characteristic and the space characteristic corresponding to each predicted image to obtain a plurality of predicted images; finally, a predictive video is composed of a plurality of predictive pictures.

For example, the prediction image 1 is obtained by performing prediction based on the temporal features and spatial features corresponding to the prediction image 1 and the temporal features and spatial features of the history image N. And predicting according to the time characteristic and the space characteristic corresponding to the predicted image 2 and the time characteristic and the space characteristic corresponding to the predicted image 1 to obtain the predicted image 2. By analogy, a plurality of predicted images can be obtained.

In the video prediction method, a server acquires a historical optical flow information sequence corresponding to a historical video; predicting according to the historical optical flow information sequence to obtain a predicted optical flow information sequence; determining first time-space information corresponding to the predicted video according to the predicted optical flow information sequence; and predicting according to the first time-space information and the historical video to obtain a plurality of predicted images, and forming a predicted video by the plurality of predicted images. In the embodiment of the disclosure, more accurate temporal features and spatial features are extracted on the basis of optical flow variation features, and therefore, the accuracy of video prediction can be improved.

In one embodiment, as shown in fig. 3, the step of determining the first spatiotemporal information corresponding to the predicted video according to the sequence of predicted optical flow information may include:

step 301, performing deformation processing according to the predicted optical flow information sequence and the nth historical image to obtain image content information corresponding to each predicted image.

The server carries out deformation processing on the Nth historical image according to the 1 st predicted optical flow information to obtain image content information corresponding to the 1 st predicted image; performing deformation processing on the (i-1) th predicted image according to the ith predicted optical flow information to obtain image content information corresponding to the ith predicted image; wherein i is a positive integer greater than 1.

For example, the server inputs the predicted optical flow information 1 and the history image N into a deformation network trained in advance, and obtains image content information of the predicted image 1 output by the deformation network. After the image prediction is performed to obtain the predicted image 1, the predicted optical flow information 2 and the predicted image 1 are input to the morphing network, and the image content information of the predicted image 2 output by the morphing network is obtained. By analogy, the image content information corresponding to each predicted image can be obtained.

Each predictive picture contains two components, one component is picture content information, for example, the predictive picture is an apple; the other component is motion information, such as the direction of motion of the apple.

Step 302, inputting the image content information corresponding to each predicted image into a pre-trained encoding network, obtaining the time features and the space features corresponding to each predicted image output by the encoding network, and forming first time-space information by the time features and the space features corresponding to a plurality of predicted images.

The server trains the encoding network in advance, and after image content information corresponding to a predicted image is obtained, the image content information is input into the encoding network, and the encoding network outputs the time characteristic and the space characteristic corresponding to the predicted image.

For example, the server inputs the image content information corresponding to the predicted image 1 into the encoding network, and obtains the temporal feature and the spatial feature corresponding to the predicted image 1. After the image content information corresponding to the predicted image 2 is obtained, the image content information corresponding to the predicted image 2 is input to the encoding network, and the temporal feature and the spatial feature corresponding to the predicted image 2 are obtained. And finally, forming first time-space information by the time characteristics and the space characteristics corresponding to the plurality of predicted images.

In the step of determining the first spatiotemporal information corresponding to the predicted video according to the predicted optical flow information sequence, the server performs deformation processing according to the predicted optical flow information sequence and the Nth historical image to obtain image content information corresponding to each predicted image; and inputting the image content information corresponding to each predicted image into a pre-trained encoding network to obtain the time characteristics and the space characteristics corresponding to each predicted image output by the encoding network, and forming first time-space information by the time characteristics and the space characteristics corresponding to the plurality of predicted images. In the embodiment of the disclosure, the server extracts the temporal features and the spatial features corresponding to the predicted image according to the image content information corresponding to the predicted image, and the temporal features and the spatial features can express the predicted image more accurately, so that the accuracy of video prediction can be improved.

In an embodiment, the step of predicting to obtain a plurality of predicted images according to the first temporal-spatial information and the historical video may include: and inputting both the historical video and the first time-space information into a video prediction network, extracting the characteristics of the historical video by using the video prediction network, and predicting according to the extracted second time-space information and the first time-space information to obtain a plurality of predicted images.

The second spatiotemporal information comprises the time characteristics and the space characteristics corresponding to the historical images.

The server pre-trains a video prediction network, which may include a feature extraction layer and a video prediction layer. In actual operation, the server respectively inputs the N historical images into the feature extraction layer to obtain the time feature and the spatial feature corresponding to each historical image output by the feature extraction layer, and the time feature and the spatial feature corresponding to the plurality of historical images form second space-time information. Then, the server inputs the first time-space information and the second time-space information into a video prediction layer for feature processing to obtain processed time-space information; and predicting according to the processed space-time information to obtain a plurality of predicted images.

For example, the server inputs the history image 1 and the history image 2 … … into the feature extraction layer, and the feature extraction layer extracts the temporal feature and the spatial feature corresponding to the history image 1, and the temporal feature and the spatial feature corresponding to the history image 2 and the temporal feature … … corresponding to the history image 8. The temporal feature H and the spatial feature c of each history image may be represented as H (H, c). The encoding network extracts the temporal feature H and the spatial feature c corresponding to the predicted image, which can be represented as H' (H, c). And inputting the first space-time information and the second space-time information into a video prediction layer, and processing H (H, c) and H' (H, c) by the video prediction layer to obtain processed space-time information.

In one embodiment, the inputting the first spatio-temporal information and the second spatio-temporal information into the video prediction layer for feature processing to obtain processed spatio-temporal information may include: and inputting the first space-time information and the second space-time information into a video prediction layer to obtain processed space-time information which is obtained by splicing or replacing the first space-time information and the second space-time information by the video prediction layer and is output.

For example, H (H, c) of the history image 8 and H' (H, c) corresponding to the predicted image 1 are subjected to stitching processing to obtain processed spatio-temporal information; and splicing the H '(H, c) corresponding to the predicted image 1 and the H' (H, c) corresponding to the predicted image 2 to obtain processed space-time information. Or, performing feature replacement on the H (H, c) of the historical image 8 by using the H' (H, c) corresponding to the predicted image 1 to obtain processed spatio-temporal information; and performing feature replacement on the H '(H, c) corresponding to the predicted image 2 by using the H' (H, c) corresponding to the predicted image 2, and obtaining processed space-time information. The embodiment of the present disclosure does not limit the feature processing manner.

It can be understood that, since the spatiotemporal information can represent the image more accurately, the server performs prediction according to the first spatiotemporal information and the second spatiotemporal information, and can obtain a more accurate predicted image, thereby improving the accuracy of video prediction.

In one embodiment, as shown in fig. 4, before inputting both the historical video and the first spatio-temporal information into the video prediction network, a training process of the video prediction network may be further included:

step 401, a training sample set is obtained.

Wherein the training sample set comprises a plurality of historical sample images, a plurality of predictive sample images, and sample spatiotemporal information.

And acquiring a sample video, taking the first N sample images in the sample video as historical sample images, and taking the subsequent sample images in the sample video as predicted sample images. Respectively extracting the characteristics of each historical sample image to obtain the corresponding time characteristics and space characteristics of each historical sample image; respectively extracting the characteristics of each prediction sample image to obtain the corresponding time characteristics and space characteristics of each prediction sample image; and finally, obtaining sample space-time information according to the time characteristics and the space characteristics of the historical sample images and the time characteristics and the space characteristics corresponding to the prediction sample images.

Step 402, training the neural network based on the training sample set, and determining whether the training result output by the neural network meets a preset convergence condition by using a preset loss function.

Wherein the predetermined loss function is a mean square error loss function, e.g.

Where MSE is the mean square error, T_iTo predict sample images, T_i' is the training result. Other functions may also be adopted as the preset loss function, which is not limited in this disclosure.

The server inputs a historical sample image into the neural network, the neural network outputs a training result, a loss value between the training result and the prediction sample image is calculated by using a preset loss function, if the loss value is smaller than a preset threshold value, the training result is determined to meet a preset convergence condition, and step 403 is executed. And if the loss value is greater than or equal to the preset threshold value, determining that the training result does not meet the preset convergence condition, adjusting parameters in the neural network by the server, and inputting the next historical sample image into the neural network after the parameters are adjusted until the training result output in the neural network meets the preset convergence condition. The embodiment of the present disclosure does not limit the preset threshold.

And step 403, finishing training under the condition that the training result meets the preset convergence condition to obtain the video prediction network.

And finishing the training after the training result is determined to meet the preset convergence condition, and determining the neural network after finishing the training as the video prediction network.

In one embodiment, the video prediction network trained by the embodiment of the present disclosure is compared with the existing convolutional recurrent neural network ConvLSTM, as shown in the following table:

predicting the result of Structural Similarity (SSIM) on the satellite cloud map:

SSIM	Step1	Step2	Step3	Step4	Step5	Step6	Step7	Step8
									ConvLSTM	0.797	0.754	0.718	0.690	0.666	0.646	0.629	0.614
the disclosed embodiments	0.797	0.755	0.719	0.691	0.667	0.647	0.630	0.616

The pixel value Mean Square Error (MSE) prediction result on the satellite cloud image is as follows:

SSIM	Step1	Step2	Step3	Step4	Step5	Step6	Step7	Step8
									ConvLSTM	21.58	23.55	24.95	26.17	27.31	28.29	29.32	30.32
the disclosed embodiments	21.84	23.43	24.67	25.77	26.77	27.73	28.72	29.67

As can be seen from the above table, indexes such as MSE, SSIM, etc. of the video prediction network trained by the embodiment of the present disclosure are all improved.

In the above embodiment, a training sample set is obtained; training a neural network based on the training sample set, and determining whether a training result output by the neural network meets a preset convergence condition by using a preset loss function; and finishing the training under the condition that the training result meets the preset convergence condition to obtain the video prediction network. Compared with the prior art, the video prediction network structure adopted by the embodiment of the disclosure has higher prediction accuracy.

In actual operation, the optical flow prediction network, the coding network and the video prediction network may be trained separately or simultaneously, which is not limited in the embodiments of the present disclosure and may be set according to actual situations.

It should be understood that although the various steps in the flowcharts of fig. 2-4 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2-4 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed in turn or alternately with other steps or at least some of the other steps or stages.

In one embodiment, as shown in fig. 5, there is provided a video prediction apparatus including:

an optical flow obtaining module 501, configured to obtain a historical optical flow information sequence corresponding to a historical video; the historical video is formed by arranging N historical images according to a time sequence, the historical optical flow information sequence comprises N-1 historical optical flow information, and the historical optical flow information represents the optical flow change characteristics between one historical image and the previous historical image;

an optical flow prediction module 502, configured to perform prediction according to the historical optical flow information sequence to obtain a predicted optical flow information sequence; the predicted optical flow information sequence comprises a plurality of pieces of predicted optical flow information, and the predicted optical flow information represents optical flow change characteristics between each predicted image and a previous image;

a spatiotemporal information determining module 503, configured to determine first spatiotemporal information corresponding to the predicted video according to the predicted optical flow information sequence; the first time-space information comprises the time characteristics and the space characteristics corresponding to each prediction image;

and the video prediction module 504 is configured to perform prediction according to the first temporal-spatial information and the historical video to obtain a plurality of prediction images, and form a prediction video from the plurality of prediction images.

In one embodiment, the spatiotemporal information determining module 503 includes:

In one embodiment, the video prediction module 504 is specifically configured to input both the historical video and the first temporal-spatial information into a video prediction network, perform feature extraction on the historical video by using the video prediction network, and perform prediction according to the extracted second temporal-spatial information and the first temporal-spatial information to obtain a plurality of prediction images; the second spatiotemporal information includes temporal features and spatial features corresponding to the respective historical images.

In one embodiment, the video prediction network includes a feature extraction layer, and the video prediction module 504 is specifically configured to input the N historical images into the feature extraction layer respectively, obtain a temporal feature and a spatial feature corresponding to each historical image output by the feature extraction layer, and form the second spatiotemporal information by using the temporal features and the spatial features corresponding to the multiple historical images.

In one embodiment, the video prediction network includes a video prediction layer, and the video prediction module 504 is specifically configured to input the first spatio-temporal information and the second spatio-temporal information into the video prediction layer for feature processing to obtain processed spatio-temporal information; and predicting according to the processed space-time information to obtain a plurality of predicted images.

In one embodiment, the video prediction module 504 is specifically configured to input the first spatio-temporal information and the second spatio-temporal information into a video prediction layer, so as to obtain processed spatio-temporal information, which is obtained by performing splicing or replacement processing on the first spatio-temporal information and the second spatio-temporal information by the video prediction layer, and outputting the processed spatio-temporal information.

In one embodiment, the apparatus further comprises:

In one embodiment, the optical flow prediction module 502 is specifically configured to input the historical optical flow information sequence into a pre-trained optical flow prediction network to obtain a predicted optical flow information sequence output by the optical flow prediction network.

In one embodiment, the optical flow acquisition module 501 is configured to input every two adjacent historical images into a preset optical flow calculation model, obtain a plurality of pieces of historical optical flow information calculated by the optical flow calculation model, and compose a historical optical flow information sequence from the plurality of pieces of historical optical flow information.

For specific limitations of the video prediction apparatus, reference may be made to the above limitations of the video prediction method, which is not described herein again. The various modules in the video prediction apparatus described above may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 6. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing video prediction data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a video prediction method.

Those skilled in the art will appreciate that the architecture shown in fig. 6 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program:

In one embodiment, the processor, when executing the computer program, further performs the steps of:

In one embodiment, the video prediction network includes a feature extraction layer, and the processor executes the computer program to further implement the following steps:

In one embodiment, the video prediction network includes a video prediction layer, and the processor further implements the following steps when executing the computer program:

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:

In one embodiment, the computer program when executed by the processor further performs the steps of:

In one embodiment, the video prediction network comprises a feature extraction layer, and the computer program when executed by the processor further performs the steps of:

In one embodiment, the video prediction network comprises a video prediction layer, and the computer program when executed by the processor further performs the steps of:

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. a video prediction method, is characterized in that, described method comprises:

Obtain the historical optical flow information sequence corresponding to the historical video; the historical video is composed of N historical images arranged in chronological order, the historical optical flow information sequence includes N-1 historical optical flow information, and the historical optical flow information Characterize the optical flow change characteristics between a historical image and the previous historical image;

Predict according to the historical optical flow information sequence to obtain a predicted optical flow information sequence; the predicted optical flow information sequence includes a plurality of predicted optical flow information, and the predicted optical flow information represents the difference between each predicted image and the previous image The optical flow change characteristics of ;

Determine the first spatiotemporal information corresponding to the predicted video according to the predicted optical flow information sequence; the first spatiotemporal information includes temporal features and spatial features corresponding to each of the predicted images;

The prediction is performed according to the first spatiotemporal information and the historical video to obtain a plurality of the predicted images, and the predicted video is composed of the plurality of the predicted images.

2. The method according to claim 1, wherein the determining the first spatiotemporal information corresponding to the predicted video according to the predicted optical flow information sequence comprises:

Perform deformation processing according to the predicted optical flow information sequence and the Nth historical image to obtain image content information corresponding to each of the predicted images;

Input the image content information corresponding to each of the predicted images into a pre-trained coding network to obtain the temporal features and spatial features corresponding to each of the predicted images output by the coding network, and to obtain the corresponding temporal features and spatial features of each of the predicted images output by the coding network. The temporal features and spatial features of , constitute the first spatiotemporal information.

3. The method according to claim 2, wherein the performing deformation processing according to the predicted optical flow information sequence and the Nth historical image to obtain image content information corresponding to each of the predicted images, comprising:

Perform deformation processing on the Nth historical image according to the first predicted optical flow information to obtain image content information corresponding to the first predicted image;

Perform deformation processing on the ith predicted image according to the ith predicted optical flow information to obtain image content information corresponding to the ith predicted image; wherein, i is a positive integer greater than 1.

4. The method according to claim 1, wherein the predicting according to the first spatiotemporal information and the historical video to obtain a plurality of the predicted images, comprising:

Input the historical video and the first spatiotemporal information into the video prediction network, use the video prediction network to perform feature extraction on the historical video, and perform feature extraction on the historical video according to the extracted second spatiotemporal information and the first spatiotemporal information. A plurality of predicted images are obtained by performing prediction with one spatiotemporal information; the second spatiotemporal information includes temporal features and spatial features corresponding to each historical image.

5. The method according to claim 4, characterized in that, before both the historical video and the first spatiotemporal information are input into the video prediction network, the method further comprises:

Obtain a training sample set; the training sample set includes multiple historical sample images, multiple predicted sample images and sample spatiotemporal information;

The neural network is trained based on the training sample set, and a preset loss function is used to determine whether the training result output by the neural network satisfies a preset convergence condition; the preset loss function is a mean square error loss function;

The training is ended when the training result satisfies the preset convergence condition, and the video prediction network is obtained.

6. The method according to claim 1, wherein the predicting according to the historical optical flow information sequence to obtain a predicted optical flow information sequence, comprising:

The historical optical flow information sequence is input into a pre-trained optical flow prediction network to obtain the predicted optical flow information sequence output by the optical flow prediction network.

7. The method according to claim 1, wherein the acquiring the historical optical flow information sequence corresponding to the historical video comprises:

Input every two adjacent historical images into a preset optical flow calculation model to obtain a plurality of the historical optical flow information calculated by the optical flow calculation model, which is composed of a plurality of the historical optical flow information. Describe the sequence of historical optical flow information.

8. A video prediction device, wherein the device comprises:

The optical flow acquisition module is used to acquire the historical optical flow information sequence corresponding to the historical video; the historical video is composed of N historical images arranged in chronological order, and the historical optical flow information sequence includes N-1 historical optical flow information , the historical optical flow information represents the optical flow change characteristics between a historical image and a previous historical image;

an optical flow prediction module, configured to perform prediction according to the historical optical flow information sequence to obtain a predicted optical flow information sequence; the predicted optical flow information sequence includes a plurality of predicted optical flow information, and the predicted optical flow information represents each prediction Optical flow change characteristics between the image and the previous image;

a spatiotemporal information determining module, configured to determine first spatiotemporal information corresponding to the predicted video according to the predicted optical flow information sequence; the first spatiotemporal information includes temporal features and spatial features corresponding to each predicted image;

A video prediction module, configured to perform prediction according to the first spatiotemporal information and the historical video to obtain a plurality of the predicted images, and form the predicted video from the plurality of the predicted images.

9. A computer device, comprising a memory and a processor, wherein the memory stores a computer program, wherein the processor implements the method according to any one of claims 1 to 7 when the processor executes the computer program. step.

10. A computer-readable storage medium on which a computer program is stored, characterized in that, when the computer program is executed by a processor, the steps of the method according to any one of claims 1 to 7 are implemented.