CN112581508A - Video prediction method, video prediction device, computer equipment and storage medium - Google Patents
Video prediction method, video prediction device, computer equipment and storage medium Download PDFInfo
- Publication number
- CN112581508A CN112581508A CN202011421744.2A CN202011421744A CN112581508A CN 112581508 A CN112581508 A CN 112581508A CN 202011421744 A CN202011421744 A CN 202011421744A CN 112581508 A CN112581508 A CN 112581508A
- Authority
- CN
- China
- Prior art keywords
- optical flow
- historical
- predicted
- information
- video
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 39
- 230000003287 optical effect Effects 0.000 claims abstract description 284
- 238000012549 training Methods 0.000 claims description 68
- 238000012545 processing Methods 0.000 claims description 37
- 238000004590 computer program Methods 0.000 claims description 31
- 230000008859 change Effects 0.000 claims description 30
- 230000006870 function Effects 0.000 claims description 25
- 230000002123 temporal effect Effects 0.000 claims description 25
- 238000013528 artificial neural network Methods 0.000 claims description 24
- 238000004364 calculation method Methods 0.000 claims description 20
- 230000008569 process Effects 0.000 claims description 4
- 238000000605 extraction Methods 0.000 description 23
- 238000010586 diagram Methods 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 6
- 239000000284 extract Substances 0.000 description 3
- 125000004122 cyclic group Chemical group 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000013213 extrapolation Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/20—Analysis of motion
- G06T7/269—Analysis of motion using gradient-based methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10016—Video; Image sequence
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Image Analysis (AREA)
Abstract
The application relates to a video prediction method, a video prediction device, a computer device and a storage medium. The method comprises the following steps: acquiring a historical optical flow information sequence corresponding to a historical video; predicting according to the historical optical flow information sequence to obtain a predicted optical flow information sequence; determining first space-time information corresponding to a predicted video according to the predicted optical flow information sequence; and predicting according to the first time-space information and the historical video to obtain a plurality of predicted images, and forming the predicted video by the plurality of predicted images. By adopting the method, the accuracy of video prediction can be improved.
Description
Technical Field
The present application relates to the field of image prediction technologies, and in particular, to a video prediction method, apparatus, computer device, and storage medium.
Background
Since video can provide rich visual information, more and more information is presented in a video mode. With the development of computer technology and image processing technology, video prediction technology has emerged. The video prediction technology is applied to the fields of automatic driving, weather prediction and the like, and great convenience can be provided for the work and life of people.
In the related art, the video prediction technology generally extrapolates a historical video image by using an optical flow method to obtain a future video image. However, the problem of inaccurate prediction exists in the image extrapolation by the optical flow method.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a video prediction method, apparatus, computer device and storage medium capable of improving prediction accuracy.
A method of video prediction, the method comprising:
acquiring a historical optical flow information sequence corresponding to a historical video; the historical video is formed by arranging N historical images according to a time sequence, the historical optical flow information sequence comprises N-1 historical optical flow information, and the historical optical flow information represents the optical flow change characteristics between one historical image and the previous historical image;
predicting according to the historical optical flow information sequence to obtain a predicted optical flow information sequence; the predicted optical flow information sequence comprises a plurality of pieces of predicted optical flow information, and the predicted optical flow information represents optical flow change characteristics between each predicted image and a previous image;
determining first time-space information corresponding to the predicted video according to the predicted optical flow information sequence; the first time-space information comprises the time characteristics and the space characteristics corresponding to each prediction image;
and predicting according to the first time-space information and the historical video to obtain a plurality of predicted images, and forming a predicted video by the plurality of predicted images.
In one embodiment, the determining the first spatiotemporal information corresponding to the predicted video according to the sequence of predicted optical flow information includes:
performing deformation processing according to the predicted optical flow information sequence and the Nth historical image to obtain image content information corresponding to each predicted image;
and inputting the image content information corresponding to each predicted image into a pre-trained encoding network to obtain the time characteristics and the space characteristics corresponding to each predicted image output by the encoding network, and forming first time-space information by the time characteristics and the space characteristics corresponding to the plurality of predicted images.
In one embodiment, the performing deformation processing according to the predicted optical flow information sequence and the nth historical image to obtain image content information corresponding to each predicted image includes:
performing deformation processing on the Nth historical image according to the 1 st predicted optical flow information to obtain image content information corresponding to the 1 st predicted image;
performing deformation processing on the (i-1) th predicted image according to the ith predicted optical flow information to obtain image content information corresponding to the ith predicted image; wherein i is a positive integer greater than 1.
In one embodiment, the predicting the plurality of predicted images according to the first space-time information and the historical video includes:
inputting both the historical video and the first time-space information into a video prediction network, extracting the characteristics of the historical video by using the video prediction network, and predicting according to the extracted second time-space information and the first time-space information to obtain a plurality of predicted images; the second spatiotemporal information includes temporal features and spatial features corresponding to the respective historical images.
In one embodiment, the video prediction network includes a feature extraction layer, and the performing feature extraction on the historical video by using the video prediction network includes:
and respectively inputting the N historical images into the feature extraction layer to obtain the time feature and the space feature corresponding to each historical image output by the feature extraction layer, and forming second space-time information by the time feature and the space feature corresponding to the plurality of historical images.
In one embodiment, the video prediction network includes a video prediction layer, and the predicting according to the extracted second spatio-temporal information and the first spatio-temporal information to obtain a plurality of predicted images includes:
inputting the first space-time information and the second space-time information into a video prediction layer for feature processing to obtain processed space-time information;
and predicting according to the processed space-time information to obtain a plurality of predicted images.
In one embodiment, the inputting the first spatio-temporal information and the second spatio-temporal information into the video prediction layer for feature processing to obtain processed spatio-temporal information includes:
and inputting the first space-time information and the second space-time information into a video prediction layer to obtain processed space-time information which is obtained by splicing or replacing the first space-time information and the second space-time information by the video prediction layer and is output.
In one embodiment, before the inputting of both the historical video and the first spatio-temporal information into the video prediction network, the method further comprises:
acquiring a training sample set; the training sample set comprises a plurality of historical sample images, a plurality of prediction sample images and sample spatio-temporal information;
training a neural network based on the training sample set, and determining whether a training result output by the neural network meets a preset convergence condition by using a preset loss function; presetting a loss function as a mean square error loss function;
and finishing the training under the condition that the training result meets the preset convergence condition to obtain the video prediction network.
In one embodiment, the predicting according to the historical optical flow information sequence to obtain a predicted optical flow information sequence includes:
and inputting the historical optical flow information sequence into a pre-trained optical flow prediction network to obtain a predicted optical flow information sequence output by the optical flow prediction network.
In one embodiment, the obtaining of the historical optical flow information sequence corresponding to the historical video includes:
inputting every two adjacent historical images into a preset optical flow calculation model to obtain a plurality of pieces of historical optical flow information calculated by the optical flow calculation model, and forming a historical optical flow information sequence by the plurality of pieces of historical optical flow information.
A video prediction apparatus, the apparatus comprising:
the optical flow acquisition module is used for acquiring a historical optical flow information sequence corresponding to a historical video; the historical video is formed by arranging N historical images according to a time sequence, the historical optical flow information sequence comprises N-1 historical optical flow information, and the historical optical flow information represents the optical flow change characteristics between one historical image and the previous historical image;
the optical flow prediction module is used for predicting according to the historical optical flow information sequence to obtain a predicted optical flow information sequence; the predicted optical flow information sequence comprises a plurality of pieces of predicted optical flow information, and the predicted optical flow information represents optical flow change characteristics between each predicted image and a previous image;
the space-time information determining module is used for determining first space-time information corresponding to the predicted video according to the predicted optical flow information sequence; the first time-space information comprises the time characteristics and the space characteristics corresponding to each prediction image;
and the video prediction module is used for predicting according to the first time-space information and the historical video to obtain a plurality of predicted images, and the predicted video is formed by the plurality of predicted images.
In one embodiment, the spatiotemporal information determining module includes:
the image content determining submodule is used for carrying out deformation processing according to the predicted optical flow information sequence and the Nth historical image to obtain image content information corresponding to each predicted image;
and the space-time information determining submodule is used for inputting the image content information corresponding to each predicted image into a pre-trained coding network to obtain the time characteristics and the space characteristics corresponding to each predicted image output by the coding network, and the time characteristics and the space characteristics corresponding to the plurality of predicted images form first space-time information.
In one embodiment, the image content determining sub-module is specifically configured to perform deformation processing on the nth historical image according to the 1 st predicted optical flow information to obtain image content information corresponding to the 1 st predicted image; performing deformation processing on the (i-1) th predicted image according to the ith predicted optical flow information to obtain image content information corresponding to the ith predicted image; wherein i is a positive integer greater than 1.
In one embodiment, the video prediction module is specifically configured to input both a historical video and first spatio-temporal information into a video prediction network, perform feature extraction on the historical video by using the video prediction network, and perform prediction according to the extracted second spatio-temporal information and the first spatio-temporal information to obtain a plurality of prediction images; the second spatiotemporal information includes temporal features and spatial features corresponding to the respective historical images.
In one embodiment, the video prediction network includes a feature extraction layer, and the video prediction module is specifically configured to input the N historical images into the feature extraction layer, obtain a temporal feature and a spatial feature corresponding to each historical image output by the feature extraction layer, and form the second spatiotemporal information from the temporal features and the spatial features corresponding to the multiple historical images.
In one embodiment, the video prediction network includes a video prediction layer, and the video prediction module is specifically configured to input the first spatio-temporal information and the second spatio-temporal information into the video prediction layer for feature processing to obtain processed spatio-temporal information; and predicting according to the processed space-time information to obtain a plurality of predicted images.
In one embodiment, the video prediction module is specifically configured to input the first spatio-temporal information and the second spatio-temporal information into a video prediction layer, obtain processed spatio-temporal information that is obtained by performing splicing or replacement processing on the first spatio-temporal information and the second spatio-temporal information by the video prediction layer, and output the processed spatio-temporal information.
In one embodiment, the apparatus further comprises:
the sample acquisition module is used for acquiring a training sample set; the training sample set comprises a plurality of historical sample images, a plurality of prediction sample images and sample spatio-temporal information;
the training module is used for training the neural network based on the training sample set and determining whether the training result output by the neural network meets a preset convergence condition or not by using a preset loss function; presetting a loss function as a mean square error loss function;
and the network obtaining module is used for finishing the training under the condition that the training result meets the preset convergence condition to obtain the video prediction network.
In one embodiment, the optical flow prediction module is specifically configured to input the historical optical flow information sequence into a pre-trained optical flow prediction network, so as to obtain a predicted optical flow information sequence output by the optical flow prediction network.
In one embodiment, the optical flow acquisition module is configured to input every two adjacent historical images into a preset optical flow calculation model, obtain a plurality of pieces of historical optical flow information calculated by the optical flow calculation model, and compose a historical optical flow information sequence from the plurality of pieces of historical optical flow information.
A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:
acquiring a historical optical flow information sequence corresponding to a historical video; the historical video is formed by arranging N historical images according to a time sequence, the historical optical flow information sequence comprises N-1 historical optical flow information, and the historical optical flow information represents the optical flow change characteristics between one historical image and the previous historical image;
predicting according to the historical optical flow information sequence to obtain a predicted optical flow information sequence; the predicted optical flow information sequence comprises a plurality of pieces of predicted optical flow information, and the predicted optical flow information represents optical flow change characteristics between each predicted image and a previous image;
determining first time-space information corresponding to the predicted video according to the predicted optical flow information sequence; the first time-space information comprises the time characteristics and the space characteristics corresponding to each prediction image;
and predicting according to the first time-space information and the historical video to obtain a plurality of predicted images, and forming a predicted video by the plurality of predicted images.
A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:
acquiring a historical optical flow information sequence corresponding to a historical video; the historical video is formed by arranging N historical images according to a time sequence, the historical optical flow information sequence comprises N-1 historical optical flow information, and the historical optical flow information represents the optical flow change characteristics between one historical image and the previous historical image;
predicting according to the historical optical flow information sequence to obtain a predicted optical flow information sequence; the predicted optical flow information sequence comprises a plurality of pieces of predicted optical flow information, and the predicted optical flow information represents optical flow change characteristics between each predicted image and a previous image;
determining first time-space information corresponding to the predicted video according to the predicted optical flow information sequence; the first time-space information comprises the time characteristics and the space characteristics corresponding to each prediction image;
and predicting according to the first time-space information and the historical video to obtain a plurality of predicted images, and forming a predicted video by the plurality of predicted images.
According to the video prediction method, the video prediction device, the computer equipment and the storage medium, the server acquires a historical optical flow information sequence corresponding to a historical video; predicting according to the historical optical flow information sequence to obtain a predicted optical flow information sequence; the sequence of predicted optical flow information includes a plurality of predicted optical flow information; determining first time-space information corresponding to the predicted video according to the predicted optical flow information sequence; and predicting according to the first time-space information and the historical video to obtain a plurality of predicted images, and forming a predicted video by the plurality of predicted images. In the embodiment of the disclosure, more accurate temporal features and spatial features are extracted on the basis of optical flow variation features, and therefore, the accuracy of video prediction can be improved.
Drawings
FIG. 1 is a diagram of an exemplary video prediction method;
FIG. 2 is a flow diagram of a video prediction method in one embodiment;
FIG. 3 is a flowchart illustrating the step of determining first spatiotemporal information corresponding to a predicted video according to a sequence of predicted optical flow information in one embodiment;
FIG. 4 is a schematic flow chart illustrating training a video prediction network according to one embodiment;
FIG. 5 is a block diagram of a video prediction device in one embodiment;
FIG. 6 is a diagram illustrating an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The video prediction method provided by the application can be applied to the application environment shown in fig. 1. The application environment includes a video capture device 102 and a server 104, where the video capture device 102 communicates with the server 104 over a network. The video capture device 102 may be, but is not limited to, various cameras, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 104 may be implemented by an independent server or a server cluster formed by a plurality of servers.
In one embodiment, as shown in fig. 2, a video prediction method is provided, which is described by taking the method as an example applied to the server in fig. 1, and includes the following steps:
The historical video is formed by arranging N historical images according to a time sequence, the historical optical flow information sequence comprises N-1 pieces of historical optical flow information, and the historical optical flow information represents optical flow change characteristics between one historical image and the previous historical image. For example, the historical video is arranged by 8 historical images including the historical image 1 and the historical image 2 … … in chronological order; the historical optical flow information sequence includes 7 pieces of historical optical flow information, that is, includes optical flow change features between the historical images 2 and 1, and optical flow change features … … between the historical images 3 and 2 and between the historical images 8 and 7. The historical videos may be videos in the fields of automatic driving, weather prediction, and the like, and the number of the historical videos and the number of the historical images are not limited in the embodiment of the disclosure.
The server can acquire the historical video in the following way: the video acquisition equipment sends the stored historical video to the server, and the server receives the historical video sent by the video acquisition equipment. Or the video acquisition equipment sends the acquired video to a server, and the server receives the video and then stores the video; at prediction, the server retrieves the stored historical video locally. The embodiment of the present disclosure does not limit the acquisition mode of the historical video.
An optical flow calculation model is preset in the server, and the optical flow calculation model can calculate optical flow change characteristics between two images. After the historical video is acquired, the server inputs every two adjacent historical images in the historical video into a preset optical flow calculation model to obtain a plurality of pieces of historical optical flow information calculated by the optical flow calculation model, and then a historical optical flow information sequence is formed by the plurality of pieces of historical optical flow information.
For example, the historical image 1 and the historical image 2 are input into an optical flow calculation model that outputs an optical flow change feature between the historical image 2 and the historical image 1, that is, outputs historical optical flow information 1; next, the historical images 2 and 3 are input to an optical flow calculation model that outputs optical flow change characteristics between the historical images 3 and 2, that is, outputs historical optical flow information 2. By analogy, outputting a plurality of historical optical flow information, and forming a historical optical flow information sequence.
The optical flow calculation model may be an optical flow estimation network float based on deep learning, and the optical flow calculation model is not limited in the embodiments of the present disclosure.
And step 202, predicting according to the historical optical flow information sequence to obtain a predicted optical flow information sequence.
The predicted optical flow information sequence comprises a plurality of pieces of predicted optical flow information, and the predicted optical flow information represents optical flow change characteristics between each predicted image and a previous image. For example, the sequence of predicted optical flow information includes 3 pieces of predicted optical flow information, that is, the predicted optical flow change feature between the predicted image 1 and the history image N, the predicted optical flow change feature between the predicted image 2 and the predicted image 1, and the predicted optical flow change feature between the predicted image 3 and the predicted image 2.
The server trains the optical flow prediction network in advance, and after the historical optical flow information sequence is obtained, the server inputs the historical optical flow information sequence into the optical flow prediction network trained in advance to obtain a predicted optical flow information sequence output by the optical flow prediction network.
The optical flow prediction network may be a cyclic convolutional neural network, such as ConvLSTM, and the optical flow prediction network is not limited in the embodiments of the present disclosure. It is understood that the time-series correlation characteristics of the optical flow change can be learned by learning the change features of the optical flow itself by using the convolution cyclic neural network.
And step 203, determining first space-time information corresponding to the predicted video according to the predicted optical flow information sequence.
The first time-space information comprises the time characteristics and the space characteristics corresponding to each prediction image.
After the server obtains the predicted optical flow information sequence, the server determines the time characteristic and the space characteristic corresponding to each predicted image according to the predicted optical flow information in the predicted optical flow information sequence, and the time characteristic and the space characteristic corresponding to a plurality of predicted images form first time-space information.
For example, the temporal feature and the spatial feature corresponding to the predicted image 1 are determined from the predicted optical flow information 1, and the temporal feature and the spatial feature corresponding to the predicted image 2 are determined from the predicted optical flow information 2.
And step 204, predicting according to the first time-space information and the historical video to obtain a plurality of predicted images, and forming a predicted video by the plurality of predicted images.
The method comprises the steps that a server firstly obtains time characteristics and space characteristics of each historical image in a historical video; then, predicting according to the time characteristic and the space characteristic of each historical image and the time characteristic and the space characteristic corresponding to each predicted image to obtain a plurality of predicted images; finally, a predictive video is composed of a plurality of predictive pictures.
For example, the prediction image 1 is obtained by performing prediction based on the temporal features and spatial features corresponding to the prediction image 1 and the temporal features and spatial features of the history image N. And predicting according to the time characteristic and the space characteristic corresponding to the predicted image 2 and the time characteristic and the space characteristic corresponding to the predicted image 1 to obtain the predicted image 2. By analogy, a plurality of predicted images can be obtained.
In the video prediction method, a server acquires a historical optical flow information sequence corresponding to a historical video; predicting according to the historical optical flow information sequence to obtain a predicted optical flow information sequence; determining first time-space information corresponding to the predicted video according to the predicted optical flow information sequence; and predicting according to the first time-space information and the historical video to obtain a plurality of predicted images, and forming a predicted video by the plurality of predicted images. In the embodiment of the disclosure, more accurate temporal features and spatial features are extracted on the basis of optical flow variation features, and therefore, the accuracy of video prediction can be improved.
In one embodiment, as shown in fig. 3, the step of determining the first spatiotemporal information corresponding to the predicted video according to the sequence of predicted optical flow information may include:
The server carries out deformation processing on the Nth historical image according to the 1 st predicted optical flow information to obtain image content information corresponding to the 1 st predicted image; performing deformation processing on the (i-1) th predicted image according to the ith predicted optical flow information to obtain image content information corresponding to the ith predicted image; wherein i is a positive integer greater than 1.
For example, the server inputs the predicted optical flow information 1 and the history image N into a deformation network trained in advance, and obtains image content information of the predicted image 1 output by the deformation network. After the image prediction is performed to obtain the predicted image 1, the predicted optical flow information 2 and the predicted image 1 are input to the morphing network, and the image content information of the predicted image 2 output by the morphing network is obtained. By analogy, the image content information corresponding to each predicted image can be obtained.
Each predictive picture contains two components, one component is picture content information, for example, the predictive picture is an apple; the other component is motion information, such as the direction of motion of the apple.
The server trains the encoding network in advance, and after image content information corresponding to a predicted image is obtained, the image content information is input into the encoding network, and the encoding network outputs the time characteristic and the space characteristic corresponding to the predicted image.
For example, the server inputs the image content information corresponding to the predicted image 1 into the encoding network, and obtains the temporal feature and the spatial feature corresponding to the predicted image 1. After the image content information corresponding to the predicted image 2 is obtained, the image content information corresponding to the predicted image 2 is input to the encoding network, and the temporal feature and the spatial feature corresponding to the predicted image 2 are obtained. And finally, forming first time-space information by the time characteristics and the space characteristics corresponding to the plurality of predicted images.
In the step of determining the first spatiotemporal information corresponding to the predicted video according to the predicted optical flow information sequence, the server performs deformation processing according to the predicted optical flow information sequence and the Nth historical image to obtain image content information corresponding to each predicted image; and inputting the image content information corresponding to each predicted image into a pre-trained encoding network to obtain the time characteristics and the space characteristics corresponding to each predicted image output by the encoding network, and forming first time-space information by the time characteristics and the space characteristics corresponding to the plurality of predicted images. In the embodiment of the disclosure, the server extracts the temporal features and the spatial features corresponding to the predicted image according to the image content information corresponding to the predicted image, and the temporal features and the spatial features can express the predicted image more accurately, so that the accuracy of video prediction can be improved.
In an embodiment, the step of predicting to obtain a plurality of predicted images according to the first temporal-spatial information and the historical video may include: and inputting both the historical video and the first time-space information into a video prediction network, extracting the characteristics of the historical video by using the video prediction network, and predicting according to the extracted second time-space information and the first time-space information to obtain a plurality of predicted images.
The second spatiotemporal information comprises the time characteristics and the space characteristics corresponding to the historical images.
The server pre-trains a video prediction network, which may include a feature extraction layer and a video prediction layer. In actual operation, the server respectively inputs the N historical images into the feature extraction layer to obtain the time feature and the spatial feature corresponding to each historical image output by the feature extraction layer, and the time feature and the spatial feature corresponding to the plurality of historical images form second space-time information. Then, the server inputs the first time-space information and the second time-space information into a video prediction layer for feature processing to obtain processed time-space information; and predicting according to the processed space-time information to obtain a plurality of predicted images.
For example, the server inputs the history image 1 and the history image 2 … … into the feature extraction layer, and the feature extraction layer extracts the temporal feature and the spatial feature corresponding to the history image 1, and the temporal feature and the spatial feature corresponding to the history image 2 and the temporal feature … … corresponding to the history image 8. The temporal feature H and the spatial feature c of each history image may be represented as H (H, c). The encoding network extracts the temporal feature H and the spatial feature c corresponding to the predicted image, which can be represented as H' (H, c). And inputting the first space-time information and the second space-time information into a video prediction layer, and processing H (H, c) and H' (H, c) by the video prediction layer to obtain processed space-time information.
In one embodiment, the inputting the first spatio-temporal information and the second spatio-temporal information into the video prediction layer for feature processing to obtain processed spatio-temporal information may include: and inputting the first space-time information and the second space-time information into a video prediction layer to obtain processed space-time information which is obtained by splicing or replacing the first space-time information and the second space-time information by the video prediction layer and is output.
For example, H (H, c) of the history image 8 and H' (H, c) corresponding to the predicted image 1 are subjected to stitching processing to obtain processed spatio-temporal information; and splicing the H '(H, c) corresponding to the predicted image 1 and the H' (H, c) corresponding to the predicted image 2 to obtain processed space-time information. Or, performing feature replacement on the H (H, c) of the historical image 8 by using the H' (H, c) corresponding to the predicted image 1 to obtain processed spatio-temporal information; and performing feature replacement on the H '(H, c) corresponding to the predicted image 2 by using the H' (H, c) corresponding to the predicted image 2, and obtaining processed space-time information. The embodiment of the present disclosure does not limit the feature processing manner.
It can be understood that, since the spatiotemporal information can represent the image more accurately, the server performs prediction according to the first spatiotemporal information and the second spatiotemporal information, and can obtain a more accurate predicted image, thereby improving the accuracy of video prediction.
In one embodiment, as shown in fig. 4, before inputting both the historical video and the first spatio-temporal information into the video prediction network, a training process of the video prediction network may be further included:
Wherein the training sample set comprises a plurality of historical sample images, a plurality of predictive sample images, and sample spatiotemporal information.
And acquiring a sample video, taking the first N sample images in the sample video as historical sample images, and taking the subsequent sample images in the sample video as predicted sample images. Respectively extracting the characteristics of each historical sample image to obtain the corresponding time characteristics and space characteristics of each historical sample image; respectively extracting the characteristics of each prediction sample image to obtain the corresponding time characteristics and space characteristics of each prediction sample image; and finally, obtaining sample space-time information according to the time characteristics and the space characteristics of the historical sample images and the time characteristics and the space characteristics corresponding to the prediction sample images.
Wherein the predetermined loss function is a mean square error loss function, e.g.Where MSE is the mean square error, TiTo predict sample images, Ti' is the training result. Other functions may also be adopted as the preset loss function, which is not limited in this disclosure.
The server inputs a historical sample image into the neural network, the neural network outputs a training result, a loss value between the training result and the prediction sample image is calculated by using a preset loss function, if the loss value is smaller than a preset threshold value, the training result is determined to meet a preset convergence condition, and step 403 is executed. And if the loss value is greater than or equal to the preset threshold value, determining that the training result does not meet the preset convergence condition, adjusting parameters in the neural network by the server, and inputting the next historical sample image into the neural network after the parameters are adjusted until the training result output in the neural network meets the preset convergence condition. The embodiment of the present disclosure does not limit the preset threshold.
And step 403, finishing training under the condition that the training result meets the preset convergence condition to obtain the video prediction network.
And finishing the training after the training result is determined to meet the preset convergence condition, and determining the neural network after finishing the training as the video prediction network.
In one embodiment, the video prediction network trained by the embodiment of the present disclosure is compared with the existing convolutional recurrent neural network ConvLSTM, as shown in the following table:
predicting the result of Structural Similarity (SSIM) on the satellite cloud map:
SSIM | Step1 | Step2 | Step3 | Step4 | Step5 | Step6 | Step7 | Step8 |
ConvLSTM | 0.797 | 0.754 | 0.718 | 0.690 | 0.666 | 0.646 | 0.629 | 0.614 |
the disclosed embodiments | 0.797 | 0.755 | 0.719 | 0.691 | 0.667 | 0.647 | 0.630 | 0.616 |
The pixel value Mean Square Error (MSE) prediction result on the satellite cloud image is as follows:
SSIM | Step1 | Step2 | Step3 | Step4 | Step5 | Step6 | Step7 | Step8 |
ConvLSTM | 21.58 | 23.55 | 24.95 | 26.17 | 27.31 | 28.29 | 29.32 | 30.32 |
the disclosed embodiments | 21.84 | 23.43 | 24.67 | 25.77 | 26.77 | 27.73 | 28.72 | 29.67 |
As can be seen from the above table, indexes such as MSE, SSIM, etc. of the video prediction network trained by the embodiment of the present disclosure are all improved.
In the above embodiment, a training sample set is obtained; training a neural network based on the training sample set, and determining whether a training result output by the neural network meets a preset convergence condition by using a preset loss function; and finishing the training under the condition that the training result meets the preset convergence condition to obtain the video prediction network. Compared with the prior art, the video prediction network structure adopted by the embodiment of the disclosure has higher prediction accuracy.
In actual operation, the optical flow prediction network, the coding network and the video prediction network may be trained separately or simultaneously, which is not limited in the embodiments of the present disclosure and may be set according to actual situations.
It should be understood that although the various steps in the flowcharts of fig. 2-4 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2-4 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed in turn or alternately with other steps or at least some of the other steps or stages.
In one embodiment, as shown in fig. 5, there is provided a video prediction apparatus including:
an optical flow obtaining module 501, configured to obtain a historical optical flow information sequence corresponding to a historical video; the historical video is formed by arranging N historical images according to a time sequence, the historical optical flow information sequence comprises N-1 historical optical flow information, and the historical optical flow information represents the optical flow change characteristics between one historical image and the previous historical image;
an optical flow prediction module 502, configured to perform prediction according to the historical optical flow information sequence to obtain a predicted optical flow information sequence; the predicted optical flow information sequence comprises a plurality of pieces of predicted optical flow information, and the predicted optical flow information represents optical flow change characteristics between each predicted image and a previous image;
a spatiotemporal information determining module 503, configured to determine first spatiotemporal information corresponding to the predicted video according to the predicted optical flow information sequence; the first time-space information comprises the time characteristics and the space characteristics corresponding to each prediction image;
and the video prediction module 504 is configured to perform prediction according to the first temporal-spatial information and the historical video to obtain a plurality of prediction images, and form a prediction video from the plurality of prediction images.
In one embodiment, the spatiotemporal information determining module 503 includes:
the image content determining submodule is used for carrying out deformation processing according to the predicted optical flow information sequence and the Nth historical image to obtain image content information corresponding to each predicted image;
and the space-time information determining submodule is used for inputting the image content information corresponding to each predicted image into a pre-trained coding network to obtain the time characteristics and the space characteristics corresponding to each predicted image output by the coding network, and the time characteristics and the space characteristics corresponding to the plurality of predicted images form first space-time information.
In one embodiment, the image content determining sub-module is specifically configured to perform deformation processing on the nth historical image according to the 1 st predicted optical flow information to obtain image content information corresponding to the 1 st predicted image; performing deformation processing on the (i-1) th predicted image according to the ith predicted optical flow information to obtain image content information corresponding to the ith predicted image; wherein i is a positive integer greater than 1.
In one embodiment, the video prediction module 504 is specifically configured to input both the historical video and the first temporal-spatial information into a video prediction network, perform feature extraction on the historical video by using the video prediction network, and perform prediction according to the extracted second temporal-spatial information and the first temporal-spatial information to obtain a plurality of prediction images; the second spatiotemporal information includes temporal features and spatial features corresponding to the respective historical images.
In one embodiment, the video prediction network includes a feature extraction layer, and the video prediction module 504 is specifically configured to input the N historical images into the feature extraction layer respectively, obtain a temporal feature and a spatial feature corresponding to each historical image output by the feature extraction layer, and form the second spatiotemporal information by using the temporal features and the spatial features corresponding to the multiple historical images.
In one embodiment, the video prediction network includes a video prediction layer, and the video prediction module 504 is specifically configured to input the first spatio-temporal information and the second spatio-temporal information into the video prediction layer for feature processing to obtain processed spatio-temporal information; and predicting according to the processed space-time information to obtain a plurality of predicted images.
In one embodiment, the video prediction module 504 is specifically configured to input the first spatio-temporal information and the second spatio-temporal information into a video prediction layer, so as to obtain processed spatio-temporal information, which is obtained by performing splicing or replacement processing on the first spatio-temporal information and the second spatio-temporal information by the video prediction layer, and outputting the processed spatio-temporal information.
In one embodiment, the apparatus further comprises:
the sample acquisition module is used for acquiring a training sample set; the training sample set comprises a plurality of historical sample images, a plurality of prediction sample images and sample spatio-temporal information;
the training module is used for training the neural network based on the training sample set and determining whether the training result output by the neural network meets a preset convergence condition or not by using a preset loss function; presetting a loss function as a mean square error loss function;
and the network obtaining module is used for finishing the training under the condition that the training result meets the preset convergence condition to obtain the video prediction network.
In one embodiment, the optical flow prediction module 502 is specifically configured to input the historical optical flow information sequence into a pre-trained optical flow prediction network to obtain a predicted optical flow information sequence output by the optical flow prediction network.
In one embodiment, the optical flow acquisition module 501 is configured to input every two adjacent historical images into a preset optical flow calculation model, obtain a plurality of pieces of historical optical flow information calculated by the optical flow calculation model, and compose a historical optical flow information sequence from the plurality of pieces of historical optical flow information.
For specific limitations of the video prediction apparatus, reference may be made to the above limitations of the video prediction method, which is not described herein again. The various modules in the video prediction apparatus described above may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 6. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing video prediction data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a video prediction method.
Those skilled in the art will appreciate that the architecture shown in fig. 6 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program:
acquiring a historical optical flow information sequence corresponding to a historical video; the historical video is formed by arranging N historical images according to a time sequence, the historical optical flow information sequence comprises N-1 historical optical flow information, and the historical optical flow information represents the optical flow change characteristics between one historical image and the previous historical image;
predicting according to the historical optical flow information sequence to obtain a predicted optical flow information sequence; the predicted optical flow information sequence comprises a plurality of pieces of predicted optical flow information, and the predicted optical flow information represents optical flow change characteristics between each predicted image and a previous image;
determining first time-space information corresponding to the predicted video according to the predicted optical flow information sequence; the first time-space information comprises the time characteristics and the space characteristics corresponding to each prediction image;
and predicting according to the first time-space information and the historical video to obtain a plurality of predicted images, and forming a predicted video by the plurality of predicted images.
In one embodiment, the processor, when executing the computer program, further performs the steps of:
performing deformation processing according to the predicted optical flow information sequence and the Nth historical image to obtain image content information corresponding to each predicted image;
and inputting the image content information corresponding to each predicted image into a pre-trained encoding network to obtain the time characteristics and the space characteristics corresponding to each predicted image output by the encoding network, and forming first time-space information by the time characteristics and the space characteristics corresponding to the plurality of predicted images.
In one embodiment, the processor, when executing the computer program, further performs the steps of:
performing deformation processing on the Nth historical image according to the 1 st predicted optical flow information to obtain image content information corresponding to the 1 st predicted image;
performing deformation processing on the (i-1) th predicted image according to the ith predicted optical flow information to obtain image content information corresponding to the ith predicted image; wherein i is a positive integer greater than 1.
In one embodiment, the processor, when executing the computer program, further performs the steps of:
inputting both the historical video and the first time-space information into a video prediction network, extracting the characteristics of the historical video by using the video prediction network, and predicting according to the extracted second time-space information and the first time-space information to obtain a plurality of predicted images; the second spatiotemporal information includes temporal features and spatial features corresponding to the respective historical images.
In one embodiment, the video prediction network includes a feature extraction layer, and the processor executes the computer program to further implement the following steps:
and respectively inputting the N historical images into the feature extraction layer to obtain the time feature and the space feature corresponding to each historical image output by the feature extraction layer, and forming second space-time information by the time feature and the space feature corresponding to the plurality of historical images.
In one embodiment, the video prediction network includes a video prediction layer, and the processor further implements the following steps when executing the computer program:
inputting the first space-time information and the second space-time information into a video prediction layer for feature processing to obtain processed space-time information;
and predicting according to the processed space-time information to obtain a plurality of predicted images.
In one embodiment, the processor, when executing the computer program, further performs the steps of:
and inputting the first space-time information and the second space-time information into a video prediction layer to obtain processed space-time information which is obtained by splicing or replacing the first space-time information and the second space-time information by the video prediction layer and is output.
In one embodiment, the processor, when executing the computer program, further performs the steps of:
acquiring a training sample set; the training sample set comprises a plurality of historical sample images, a plurality of prediction sample images and sample spatio-temporal information;
training a neural network based on the training sample set, and determining whether a training result output by the neural network meets a preset convergence condition by using a preset loss function; presetting a loss function as a mean square error loss function;
and finishing the training under the condition that the training result meets the preset convergence condition to obtain the video prediction network.
In one embodiment, the processor, when executing the computer program, further performs the steps of:
and inputting the historical optical flow information sequence into a pre-trained optical flow prediction network to obtain a predicted optical flow information sequence output by the optical flow prediction network.
In one embodiment, the processor, when executing the computer program, further performs the steps of:
inputting every two adjacent historical images into a preset optical flow calculation model to obtain a plurality of pieces of historical optical flow information calculated by the optical flow calculation model, and forming a historical optical flow information sequence by the plurality of pieces of historical optical flow information.
In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:
acquiring a historical optical flow information sequence corresponding to a historical video; the historical video is formed by arranging N historical images according to a time sequence, the historical optical flow information sequence comprises N-1 historical optical flow information, and the historical optical flow information represents the optical flow change characteristics between one historical image and the previous historical image;
predicting according to the historical optical flow information sequence to obtain a predicted optical flow information sequence; the predicted optical flow information sequence comprises a plurality of pieces of predicted optical flow information, and the predicted optical flow information represents optical flow change characteristics between each predicted image and a previous image;
determining first time-space information corresponding to the predicted video according to the predicted optical flow information sequence; the first time-space information comprises the time characteristics and the space characteristics corresponding to each prediction image;
and predicting according to the first time-space information and the historical video to obtain a plurality of predicted images, and forming a predicted video by the plurality of predicted images.
In one embodiment, the computer program when executed by the processor further performs the steps of:
performing deformation processing according to the predicted optical flow information sequence and the Nth historical image to obtain image content information corresponding to each predicted image;
and inputting the image content information corresponding to each predicted image into a pre-trained encoding network to obtain the time characteristics and the space characteristics corresponding to each predicted image output by the encoding network, and forming first time-space information by the time characteristics and the space characteristics corresponding to the plurality of predicted images.
In one embodiment, the computer program when executed by the processor further performs the steps of:
performing deformation processing on the Nth historical image according to the 1 st predicted optical flow information to obtain image content information corresponding to the 1 st predicted image;
performing deformation processing on the (i-1) th predicted image according to the ith predicted optical flow information to obtain image content information corresponding to the ith predicted image; wherein i is a positive integer greater than 1.
In one embodiment, the computer program when executed by the processor further performs the steps of:
inputting both the historical video and the first time-space information into a video prediction network, extracting the characteristics of the historical video by using the video prediction network, and predicting according to the extracted second time-space information and the first time-space information to obtain a plurality of predicted images; the second spatiotemporal information includes temporal features and spatial features corresponding to the respective historical images.
In one embodiment, the video prediction network comprises a feature extraction layer, and the computer program when executed by the processor further performs the steps of:
and respectively inputting the N historical images into the feature extraction layer to obtain the time feature and the space feature corresponding to each historical image output by the feature extraction layer, and forming second space-time information by the time feature and the space feature corresponding to the plurality of historical images.
In one embodiment, the video prediction network comprises a video prediction layer, and the computer program when executed by the processor further performs the steps of:
inputting the first space-time information and the second space-time information into a video prediction layer for feature processing to obtain processed space-time information;
and predicting according to the processed space-time information to obtain a plurality of predicted images.
In one embodiment, the computer program when executed by the processor further performs the steps of:
and inputting the first space-time information and the second space-time information into a video prediction layer to obtain processed space-time information which is obtained by splicing or replacing the first space-time information and the second space-time information by the video prediction layer and is output.
In one embodiment, the computer program when executed by the processor further performs the steps of:
acquiring a training sample set; the training sample set comprises a plurality of historical sample images, a plurality of prediction sample images and sample spatio-temporal information;
training a neural network based on the training sample set, and determining whether a training result output by the neural network meets a preset convergence condition by using a preset loss function; presetting a loss function as a mean square error loss function;
and finishing the training under the condition that the training result meets the preset convergence condition to obtain the video prediction network.
In one embodiment, the computer program when executed by the processor further performs the steps of:
and inputting the historical optical flow information sequence into a pre-trained optical flow prediction network to obtain a predicted optical flow information sequence output by the optical flow prediction network.
In one embodiment, the computer program when executed by the processor further performs the steps of:
inputting every two adjacent historical images into a preset optical flow calculation model to obtain a plurality of pieces of historical optical flow information calculated by the optical flow calculation model, and forming a historical optical flow information sequence by the plurality of pieces of historical optical flow information.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.
Claims (10)
1. A method for video prediction, the method comprising:
acquiring a historical optical flow information sequence corresponding to a historical video; the historical video is formed by arranging N historical images according to a time sequence, the historical optical flow information sequence comprises N-1 historical optical flow information, and the historical optical flow information represents optical flow change characteristics between one historical image and the previous historical image;
predicting according to the historical optical flow information sequence to obtain a predicted optical flow information sequence; the sequence of predicted optical flow information comprises a plurality of pieces of predicted optical flow information, and the predicted optical flow information represents optical flow change characteristics between each predicted image and a previous image;
determining first space-time information corresponding to a predicted video according to the predicted optical flow information sequence; the first time-space information comprises a time characteristic and a space characteristic corresponding to each prediction image;
and predicting according to the first time-space information and the historical video to obtain a plurality of predicted images, and forming the predicted video by the plurality of predicted images.
2. The method of claim 1, wherein said determining first spatiotemporal information corresponding to a predicted video from said sequence of predicted optical flow information comprises:
performing deformation processing according to the predicted optical flow information sequence and the Nth historical image to obtain image content information corresponding to each predicted image;
and inputting image content information corresponding to each predicted image into a pre-trained encoding network to obtain a time characteristic and a spatial characteristic corresponding to each predicted image output by the encoding network, and forming the first time-space information by the time characteristic and the spatial characteristic corresponding to a plurality of predicted images.
3. The method according to claim 2, wherein said performing a morphing process based on said sequence of predicted optical flow information and an nth historical image to obtain image content information corresponding to each of said predicted images comprises:
performing deformation processing on the Nth historical image according to the 1 st predicted optical flow information to obtain image content information corresponding to the 1 st predicted image;
performing deformation processing on the (i-1) th predicted image according to the ith predicted optical flow information to obtain image content information corresponding to the ith predicted image; wherein i is a positive integer greater than 1.
4. The method according to claim 1, wherein said predicting from the first temporal-spatial information and the historical video a plurality of the predictive pictures comprises:
inputting both the historical video and the first time-space information into the video prediction network, extracting the characteristics of the historical video by using the video prediction network, and predicting according to the extracted second time-space information and the first time-space information to obtain a plurality of predicted images; the second spatiotemporal information includes temporal features and spatial features corresponding to the respective historical images.
5. The method of claim 4, wherein prior to said inputting both the historical video and the first spatio-temporal information into the video prediction network, the method further comprises:
acquiring a training sample set; the training sample set comprises a plurality of historical sample images, a plurality of predicted sample images and sample spatio-temporal information;
training a neural network based on the training sample set, and determining whether a training result output by the neural network meets a preset convergence condition by using a preset loss function; the preset loss function is a mean square error loss function;
and finishing the training under the condition that the training result meets the preset convergence condition to obtain the video prediction network.
6. The method of claim 1, wherein said predicting from said historical sequence of optical flow information to obtain a sequence of predicted optical flow information comprises:
and inputting the historical optical flow information sequence into a pre-trained optical flow prediction network to obtain the predicted optical flow information sequence output by the optical flow prediction network.
7. The method according to claim 1, wherein the obtaining of the historical optical flow information sequence corresponding to the historical video comprises:
inputting every two adjacent historical images into a preset optical flow calculation model to obtain a plurality of pieces of historical optical flow information calculated by the optical flow calculation model, and forming the historical optical flow information sequence by the plurality of pieces of historical optical flow information.
8. An apparatus for video prediction, the apparatus comprising:
the optical flow acquisition module is used for acquiring a historical optical flow information sequence corresponding to a historical video; the historical video is formed by arranging N historical images according to a time sequence, the historical optical flow information sequence comprises N-1 historical optical flow information, and the historical optical flow information represents optical flow change characteristics between one historical image and the previous historical image;
the optical flow prediction module is used for predicting according to the historical optical flow information sequence to obtain a predicted optical flow information sequence; the sequence of predicted optical flow information comprises a plurality of pieces of predicted optical flow information, and the predicted optical flow information represents optical flow change characteristics between each predicted image and a previous image;
the space-time information determining module is used for determining first space-time information corresponding to the predicted video according to the predicted optical flow information sequence; the first time-space information comprises a time characteristic and a space characteristic corresponding to each prediction image;
and the video prediction module is used for predicting according to the first time-space information and the historical video to obtain a plurality of predicted images, and the predicted video is formed by the plurality of predicted images.
9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 7.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011421744.2A CN112581508A (en) | 2020-12-08 | 2020-12-08 | Video prediction method, video prediction device, computer equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011421744.2A CN112581508A (en) | 2020-12-08 | 2020-12-08 | Video prediction method, video prediction device, computer equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112581508A true CN112581508A (en) | 2021-03-30 |
Family
ID=75127903
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011421744.2A Pending CN112581508A (en) | 2020-12-08 | 2020-12-08 | Video prediction method, video prediction device, computer equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112581508A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113610900A (en) * | 2021-10-11 | 2021-11-05 | 深圳佑驾创新科技有限公司 | Method and device for predicting scale change of vehicle tail sequence and computer equipment |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111898573A (en) * | 2020-08-05 | 2020-11-06 | 上海眼控科技股份有限公司 | Image prediction method, computer device, and storage medium |
-
2020
- 2020-12-08 CN CN202011421744.2A patent/CN112581508A/en active Pending
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111898573A (en) * | 2020-08-05 | 2020-11-06 | 上海眼控科技股份有限公司 | Image prediction method, computer device, and storage medium |
Non-Patent Citations (1)
Title |
---|
李森;许宏科;: "基于时空建模的视频帧预测模型", 物联网技术, no. 2, 20 February 2020 (2020-02-20), pages 66 - 69 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113610900A (en) * | 2021-10-11 | 2021-11-05 | 深圳佑驾创新科技有限公司 | Method and device for predicting scale change of vehicle tail sequence and computer equipment |
CN113610900B (en) * | 2021-10-11 | 2022-02-15 | 深圳佑驾创新科技有限公司 | Method and device for predicting scale change of vehicle tail sequence and computer equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021043168A1 (en) | Person re-identification network training method and person re-identification method and apparatus | |
CN110929622B (en) | Video classification method, model training method, device, equipment and storage medium | |
CN112613515B (en) | Semantic segmentation method, semantic segmentation device, computer equipment and storage medium | |
CN110176024B (en) | Method, device, equipment and storage medium for detecting target in video | |
CN110824587B (en) | Image prediction method, image prediction device, computer equipment and storage medium | |
CN109255351B (en) | Three-dimensional convolution neural network-based bounding box regression method, system, equipment and medium | |
CN111709471B (en) | Object detection model training method and object detection method and device | |
CN112749726B (en) | Training method and device for target detection model, computer equipment and storage medium | |
CN108564102A (en) | Image clustering evaluation of result method and apparatus | |
CN111178162B (en) | Image recognition method, device, computer equipment and storage medium | |
CN110166826B (en) | Video scene recognition method and device, storage medium and computer equipment | |
WO2021227787A1 (en) | Neural network predictor training method and apparatus, and image processing method and apparatus | |
US20210365726A1 (en) | Method and apparatus for object detection in image, vehicle, and robot | |
CN114641800A (en) | Method and system for forecasting crowd dynamics | |
CN111667001A (en) | Target re-identification method and device, computer equipment and storage medium | |
CN111898735A (en) | Distillation learning method, distillation learning device, computer equipment and storage medium | |
CN112966754B (en) | Sample screening method, sample screening device and terminal equipment | |
CN111047088A (en) | Prediction image acquisition method and device, computer equipment and storage medium | |
CN113449586A (en) | Target detection method, target detection device, computer equipment and storage medium | |
CN115984307A (en) | Video object segmentation method and device, electronic equipment and storage medium | |
CN112581508A (en) | Video prediction method, video prediction device, computer equipment and storage medium | |
Sharjeel et al. | Real time drone detection by moving camera using COROLA and CNN algorithm | |
CN110162689B (en) | Information pushing method, device, computer equipment and storage medium | |
CN111898573A (en) | Image prediction method, computer device, and storage medium | |
CN110163401A (en) | Prediction technique, data predication method and the device of time series |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |