CN111988622A

CN111988622A - Video prediction method and device, electronic equipment and storage medium

Info

Publication number: CN111988622A
Application number: CN202010843348.2A
Authority: CN
Inventors: 张伟; 刘光灿; 赵琦
Original assignee: Shenzhen Sensetime Technology Co Ltd
Current assignee: Shenzhen Sensetime Technology Co Ltd
Priority date: 2020-08-20
Filing date: 2020-08-20
Publication date: 2020-11-24
Anticipated expiration: 2040-08-20
Also published as: CN111988622B

Abstract

The disclosure relates to a video prediction method and apparatus, an electronic device, and a storage medium. The method comprises the following steps: compressing the collected first video frame sequence to obtain a first compressed representation matrix sequence corresponding to the first video frame sequence; predicting according to the first compressed representation matrix sequence to obtain a predicted second compressed representation matrix sequence; and restoring the second compressed representation matrix sequence to obtain a first image sequence corresponding to the second compressed representation matrix sequence.

Description

Video prediction method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computer vision technologies, and in particular, to a video prediction method and apparatus, an electronic device, and a storage medium.

Background

In human daily life, predicting the future is an unusual difficulty and very important thing. Since ancient times, human beings have great interest in predicting the future, and meanwhile, natural phenomena which seem to have internal laws exist in nature. Predicting these changes by some methods is a very challenging task. In the field of computer vision, we typically model this series of problems as predictions of spatio-temporal sequence type data, i.e. input as a sequence of several consecutive multi-frame pictures, output as the next frame or frames. How to increase the speed of video prediction is a technical problem to be solved urgently.

Disclosure of Invention

The present disclosure provides a video prediction technical solution.

According to an aspect of the present disclosure, there is provided a video prediction method, including:

compressing the collected first video frame sequence to obtain a first compressed representation matrix sequence corresponding to the first video frame sequence;

predicting according to the first compressed representation matrix sequence to obtain a predicted second compressed representation matrix sequence;

and restoring the second compressed representation matrix sequence to obtain a first image sequence corresponding to the second compressed representation matrix sequence.

The method comprises the steps of obtaining a first compressed representation matrix sequence corresponding to a first video frame sequence by compressing the collected first video frame sequence, predicting according to the first compressed representation matrix sequence to obtain a predicted second compressed representation matrix sequence, and recovering the second compressed representation matrix sequence to obtain a first image sequence corresponding to the second compressed representation matrix sequence.

In a possible implementation manner, after the obtaining of the first image sequence corresponding to the second compressed representation matrix sequence, the method further includes:

predicting according to at least part of video frames in the first video frame sequence to obtain a predicted second image sequence;

and obtaining a predicted third image sequence according to the first image sequence and the second image sequence.

In this implementation, a predicted second image sequence is obtained by performing prediction based on at least a part of the video frames in the first video frame sequence, and a predicted third image sequence is obtained based on the first image sequence and the second image sequence, so that video prediction can be performed based on richer image information, and accuracy of a prediction result can be further improved.

In a possible implementation manner, the compressing the acquired first video frame sequence to obtain a first compressed representation matrix sequence corresponding to the first video frame sequence includes:

performing discrete cosine transform on a collected first video frame sequence to obtain a first sparse representation matrix sequence corresponding to the first video frame sequence;

and randomly projecting the first sparse representation matrix sequence to obtain a first compressed representation matrix sequence corresponding to the first video frame sequence.

In the implementation mode, the dimensionality of the first video frame sequence can be reduced by performing discrete cosine transform and random projection on the first video frame sequence, and a first compressed representation matrix sequence with a lower dimensionality is obtained for subsequent prediction processing. By performing the compression processing on the first video frame sequence by discrete cosine transform and random projection, a complex result is not generated compared with the compression processing by DFT, and thus the computational complexity can be reduced. By performing discrete cosine transform and random projection on the first video frame sequence for compression processing, the calculation speed can be increased compared with the compression processing by adopting PCA and other modes.

In a possible implementation manner, the predicting according to the first compressed representation matrix sequence to obtain a predicted second compressed representation matrix sequence includes:

and inputting the first compressed representation matrix sequence into a first sub-neural network, and predicting through the first sub-neural network to obtain a second compressed representation matrix sequence corresponding to the first compressed representation matrix sequence.

In this implementation, the first sub-neural network performs prediction according to the first compressed representation matrix sequence to obtain a predicted second compressed representation matrix sequence, thereby improving prediction accuracy and prediction speed.

In a possible implementation manner, the restoring the second compressed representation matrix sequence to obtain a first image sequence corresponding to the second compressed representation matrix sequence includes:

recovering the second compressed representation matrix sequence to obtain a second sparse representation matrix sequence corresponding to the second compressed representation matrix sequence;

and performing inverse discrete cosine transform on the second sparse representation matrix sequence to obtain a first image sequence corresponding to the second compressed representation matrix sequence.

In this implementation, the second compressed representation matrix sequence is restored to obtain a second sparse representation matrix sequence, and the second sparse representation matrix sequence is subjected to inverse discrete cosine transform, so that each compressed representation matrix in the second compressed representation matrix sequence can be restored to an image with the same size as the image in the first video frame sequence.

In a possible implementation manner, the recovering the second compressed representation matrix sequence to obtain a second sparse representation matrix sequence corresponding to the second compressed representation matrix sequence includes:

and performing iterative processing on the second compressed representation matrix sequence by adopting an activation function of a soft threshold value to obtain a second sparse representation matrix sequence corresponding to the second compressed representation matrix sequence.

By adopting the activation function of the soft threshold value to carry out iterative processing on the second compressed representation matrix sequence, a second sparse representation matrix sequence corresponding to the second compressed representation matrix sequence can be quickly obtained, and the video prediction speed can be improved.

In a possible implementation manner, the predicting according to at least a part of the video frames in the first video frame sequence to obtain a predicted second image sequence includes:

inputting at least part of video frames in the first video frame sequence into a second sub-neural network, and predicting through the second sub-neural network to obtain a second image sequence corresponding to the at least part of video frames.

And processing at least part of the video frames in the first video frame sequence through a second sub-neural network to obtain a second image sequence, so that effective image information required by subsequent video prediction can be obtained. Therefore, the final prediction result is obtained based on the second image sequence and the first image sequence, and the accuracy of video prediction can be further improved.

In one possible implementation, the at least part of the video frames includes M video frames newly captured in the first video frame sequence, where M is a positive integer, and the number of video frames in the first video frame sequence is greater than or equal to M.

The M video frames newly collected in the first video frame sequence are adopted for prediction, so that the accuracy of video prediction is improved.

In a possible implementation manner, the obtaining a predicted third image sequence according to the first image sequence and the second image sequence includes:

performing feature extraction on the first image sequence to obtain a first feature corresponding to the first image sequence;

performing feature extraction on the second image sequence to obtain second features corresponding to the second image sequence;

performing first fusion processing according to the first characteristic and the second characteristic to obtain a first fusion characteristic;

and obtaining a predicted third image sequence according to the first fusion characteristic.

By combining the first feature extracted from the first image sequence with the second feature extracted from the second image sequence, video prediction can be performed based on richer image information, so that the accuracy of video prediction can be improved.

In a possible implementation manner, the obtaining a predicted third image sequence according to the first fusion feature includes:

performing residual error processing on the first fusion characteristic to obtain a residual error characteristic;

and obtaining a predicted third image sequence according to the first fusion characteristic and the residual error characteristic.

Residual error processing is carried out on the first fusion characteristic to obtain a residual error characteristic, and a predicted third image sequence is obtained according to the first fusion characteristic and the residual error characteristic, so that the problems of gradient diffusion and the like of a neural network can be relieved to a certain extent, and the accuracy of video prediction is further improved. In addition, in the training of the neural network, the convergence rate of the neural network can be increased by performing the residual error processing.

In one possible implementation, the first fused feature includes a plurality of levels;

obtaining a predicted third image sequence according to the first fusion feature and the residual error feature, wherein the predicted third image sequence comprises:

performing second fusion processing on the first fusion feature of the last stage to obtain a second fusion feature;

and obtaining a predicted third image sequence according to the second fusion characteristic and the residual error characteristic.

In this implementation, the second fusion feature is obtained by performing the second fusion processing on the first fusion feature of the last stage, and the predicted third image sequence is obtained according to the second fusion feature and the residual feature, so that the accuracy of the predicted third image sequence can be improved.

the residual error processing is performed on the first fusion feature to obtain a residual error feature, and the residual error feature includes:

and carrying out residual error processing on the first fusion characteristic of the first stage to obtain a residual error characteristic.

Residual error features are obtained by carrying out residual error processing on the first-level first fusion features, and a predicted third image sequence is obtained according to the first fusion features and the residual error features, so that the problems of gradient diffusion and the like of a neural network can be relieved to a certain extent, and the accuracy of video prediction is further improved.

In a possible implementation manner, the performing feature extraction on the first image sequence to obtain a first feature corresponding to the first image sequence includes:

performing multi-level feature extraction on the first image sequence to obtain multi-level first features corresponding to the first image sequence;

the feature extraction of the second image sequence to obtain a second feature corresponding to the second image sequence includes:

performing multi-level feature extraction on the second image sequence to obtain multi-level second features corresponding to the second image sequence;

the performing a first fusion process according to the first feature and the second feature to obtain a first fusion feature includes:

and for any stage in the plurality of stages, performing feature fusion according to the first feature of the stage and the second feature of the stage to obtain a first fusion feature of the stage.

Multi-scale first features corresponding to the first image sequence can be obtained by performing multi-level feature extraction on the first image sequence; and performing multi-level feature extraction on the second image sequence to obtain multi-scale second features corresponding to the second image sequence. Therefore, by performing multi-level feature extraction on the first image sequence and the second image sequence, richer image information can be obtained. And performing first fusion processing by using the multi-scale first feature and the multi-scale second feature to obtain the multi-scale first fusion feature. According to the multi-scale first fusion feature, a more accurate third image sequence can be predicted.

In a possible implementation manner, the performing feature fusion according to the first feature of the stage and the second feature of the stage to obtain the first fused feature of the stage includes:

in response to the stage not belonging to the last stage, performing feature fusion on the first feature of the stage, the second feature of the stage and the first fusion feature of a stage subsequent to the stage to obtain a first fusion feature of the stage;

and/or the presence of a gas in the gas,

and in response to the stage belonging to the last stage, performing feature fusion on the first feature of the stage and the second feature of the stage to obtain a first fused feature of the stage.

Through the mode, the characteristics of all levels can be fully fused, and the multi-level first fusion characteristics containing rich information are obtained.

According to an aspect of the present disclosure, there is provided a video prediction apparatus including:

the compression module is used for compressing the collected first video frame sequence to obtain a first compressed representation matrix sequence corresponding to the first video frame sequence;

the first prediction module is used for predicting according to the first compressed representation matrix sequence to obtain a predicted second compressed representation matrix sequence;

and the recovery module is used for recovering the second compressed representation matrix sequence to obtain a first image sequence corresponding to the second compressed representation matrix sequence.

In one possible implementation, the apparatus further includes:

the second prediction module is used for predicting according to at least part of video frames in the first video frame sequence to obtain a predicted second image sequence;

and the determining module is used for obtaining a predicted third image sequence according to the first image sequence and the second image sequence.

In one possible implementation, the compression module is configured to:

In one possible implementation, the first prediction module is configured to:

In one possible implementation, the recovery module is configured to:

In one possible implementation, the second prediction module is configured to:

In one possible implementation, the determining module is configured to:

the determination module is to:

In one possible implementation, the determining module is configured to:

and/or the presence of a gas in the gas,

According to an aspect of the present disclosure, there is provided an electronic device including: one or more processors; a memory for storing executable instructions; wherein the one or more processors are configured to invoke the memory-stored executable instructions to perform the above-described method.

According to an aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method.

In the embodiment of the disclosure, a first compressed representation matrix sequence corresponding to a first video frame sequence is obtained by compressing the collected first video frame sequence, a predicted second compressed representation matrix sequence is obtained by predicting according to the first compressed representation matrix sequence, and the second compressed representation matrix sequence is restored to obtain a first image sequence corresponding to the second compressed representation matrix sequence.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure.

Fig. 1 shows a flowchart of a video prediction method provided by an embodiment of the present disclosure.

Fig. 2 shows a schematic diagram of a neural network provided by an embodiment of the present disclosure.

Fig. 3 shows a schematic diagram of a compression prediction module in a neural network provided by an embodiment of the present disclosure.

Fig. 4 shows a schematic diagram of a convolution improving module in a neural network provided by an embodiment of the present disclosure.

Fig. 5 is a schematic diagram illustrating a prediction result obtained by using a video prediction method provided by an embodiment of the present disclosure.

Fig. 6 shows another schematic diagram of a prediction result obtained by using the video prediction method provided by the embodiment of the present disclosure.

Fig. 7 shows a block diagram of a video prediction apparatus provided by an embodiment of the present disclosure.

Fig. 8 illustrates a block diagram of an electronic device 800 provided by an embodiment of the disclosure.

Fig. 9 shows a block diagram of an electronic device 1900 provided by an embodiment of the disclosure.

Detailed Description

Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.

Fig. 1 shows a flowchart of a video prediction method provided by an embodiment of the present disclosure. The execution subject of the video prediction method may be a video prediction apparatus. For example, the video prediction method may be performed by a terminal device or a server or other processing device. The terminal device may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle-mounted device, or a wearable device. In some possible implementations, the video prediction method may be implemented by a processor calling computer readable instructions stored in a memory. As shown in fig. 1, the video prediction method includes steps S11 through S13.

In step S11, the collected first video frame sequence is compressed to obtain a first compressed representation matrix sequence corresponding to the first video frame sequence.

In the embodiment of the present disclosure, the collected video may be split by frame to obtain a first video frame sequence. For example. The first sequence of video frames may be denoted I. Wherein the first sequence of video frames comprises a plurality of video frames.

In an embodiment of the disclosure, the first compressed representation matrix sequence represents a sequence of matrices resulting from compression of the first video frame sequence. For example, if the first video frame sequence includes N video frames, the first compressed representation matrix sequence may include N compressed representation matrices, where the N compressed representation matrices are in one-to-one correspondence with the N video frames in the first video frame sequence, and are respectively used for representing image information after compression processing of the corresponding video frames in the first video frame sequence, for example, may be respectively used for representing compressed images after compression processing of the corresponding video frames in the first video frame sequence, where N is greater than 1. In an embodiment of the present disclosure, a dimension of any compressed representation matrix in the first sequence of compressed representation matrices is lower than a dimension of a matrix corresponding to any video frame in the first sequence of video frames, where the matrix corresponding to any video frame in the first sequence of video frames refers to a matrix representation form of the video frame.

In a possible implementation manner, the compressing the acquired first video frame sequence to obtain a first compressed representation matrix sequence corresponding to the first video frame sequence includes: performing Discrete Cosine Transform (DCT) on the acquired first video frame sequence to obtain a first sparse representation matrix sequence corresponding to the first video frame sequence; and randomly projecting the first sparse representation matrix sequence to obtain a first compressed representation matrix sequence corresponding to the first video frame sequence.

For example, the first sparse representation matrix sequence may be denoted as Z and the first compressed representation matrix sequence may be denoted as X.

In this implementation, a stochastic projection method such as gaussian stochastic projection may be used to perform stochastic projection on the first sparse representation matrix sequence. For example, a Gaussian random matrix R may be used₁And R₂And carrying out random projection on the first sparse representation matrix sequence to obtain a first compressed representation matrix sequence.

In the implementation mode, the first video frame sequence is subjected to discrete cosine transform, so that the dimension reduction can be performed on the first video frame sequence to obtain a first sparse representation matrix sequence with dimension lower than that of the first video frame sequence; by randomly projecting the first sparse representation matrix sequence, the dimension of the first sparse representation matrix sequence can be reduced, and a first compressed representation matrix sequence with dimension lower than that of the first sparse representation matrix sequence is obtained. Therefore, the dimensionality of the first video frame sequence can be reduced by performing discrete cosine transform and random projection on the first video frame sequence, and a first compressed representation matrix sequence with a lower dimensionality is obtained for subsequent prediction processing. By performing the compression processing on the first video frame sequence by Discrete cosine Transform and random projection, a complex result is not generated as compared with the compression processing using DFT (Discrete Fourier Transform), and thus the computational complexity can be reduced. By performing compression processing on the first video frame sequence by discrete cosine transform and random projection, the calculation speed can be increased as compared with the compression processing by a method such as PCA (Principal Component Analysis).

Of course, in other possible implementation manners, the first video frame sequence may also be compressed by using DFT, PCA and other manners to obtain a first compressed representation matrix sequence corresponding to the first video frame sequence, which is not limited herein.

In step S12, a prediction is performed based on the first compressed representation matrix sequence to obtain a predicted second compressed representation matrix sequence.

For example, the second compressed representation matrix sequence may be represented as X.

In the embodiment of the present disclosure, by predicting the first compressed representation matrix sequence, one or more items of image information (e.g., one or more images) following the first compressed representation matrix sequence in the time dimension may be obtained, and each item of image information may be represented by one compressed representation matrix, that is, the one or more items of image information may be represented by one or more compressed representation matrices, and the one or more compressed representation matrices constitute the second compressed representation matrix sequence. That is, the second sequence of compressed representation matrices may include one or more compressed representation matrices. The dimensions of any of the second sequence of compressed representation matrices may be the same as the dimensions of any of the first sequence of compressed representation matrices. Of course, the dimensions of the compressed representation matrices in the second sequence of compressed representation matrices may also be different from the dimensions of the compressed representation matrices in the first sequence of compressed representation matrices.

In a possible implementation manner, the predicting according to the first compressed representation matrix sequence to obtain a predicted second compressed representation matrix sequence includes: and inputting the first compressed representation matrix sequence into a first sub-neural network, and predicting through the first sub-neural network to obtain a second compressed representation matrix sequence corresponding to the first compressed representation matrix sequence.

In this implementation, the first sub-Neural Network may be a Recurrent Neural Network (RNN), such as a Long Short Term Memory (LSTM) Network, a Long Short Term Memory (RC-LSTM) Network based on a Regional Convolution, a Long Short Term Memory (ST-LSTM) Network based on a SpatioTemporal property, or other types of Neural networks, which is not limited herein.

In other possible implementation manners, the second compressed representation matrix sequence may be obtained by predicting according to the first compressed representation matrix sequence by a recursive model or the like that does not include a neural network.

In step S13, the second compressed representation matrix sequence is restored to obtain a first image sequence corresponding to the second compressed representation matrix sequence.

For example, the first image sequence may be denoted as I.

In the embodiment of the present disclosure, by performing recovery processing on the second compressed representation matrix sequence, the second compressed representation matrix sequence with a lower dimension may be mapped to the first image sequence with a higher dimension, thereby implementing image recovery. The dimension of a matrix corresponding to any image in the first image sequence is higher than that of any compressed representation matrix in the second compressed representation matrix sequence, wherein the matrix corresponding to any image in the first image sequence refers to a matrix representation form of the image. The size of any image in the first sequence of images may be equal to the size of any video frame in the first sequence of video frames. Of course, the size of the images in the first image sequence may not be equal to the size of the video frames in the first video frame sequence.

In the embodiment of the disclosure, a first compressed representation matrix sequence corresponding to a first video frame sequence is obtained by compressing the collected first video frame sequence, a predicted second compressed representation matrix sequence is obtained by predicting according to the first compressed representation matrix sequence, and the second compressed representation matrix sequence is restored to obtain a first image sequence corresponding to the second compressed representation matrix sequence. Therefore, the embodiment of the present disclosure is not only suitable for video prediction of small-sized video, but also suitable for video prediction of large-sized video.

In a possible implementation manner, the restoring the second compressed representation matrix sequence to obtain a first image sequence corresponding to the second compressed representation matrix sequence includes: recovering the second compressed representation matrix sequence to obtain a second sparse representation matrix sequence corresponding to the second compressed representation matrix sequence; and performing inverse discrete cosine transform on the second sparse representation matrix sequence to obtain a first image sequence corresponding to the second compressed representation matrix sequence.

In this implementation, the second compressed representation matrix sequence is predicted from the first compressed representation matrix sequence, where the first compressed representation matrix sequence may be obtained by discrete cosine transforming and random projecting the first video frame sequence.

In one example, a second sparse representation matrix sequence Z corresponding to the second compressed representation matrix sequence can be obtained according to equation 1^*，

Wherein, X^*Representing a second compressed representation matrix sequence, Z representing a first sparse representation matrix sequence, R₁And R₂Representing a gaussian random matrix, alpha representing a penalty term coefficient,

show to get the order

The smallest Z. α may be set according to an empirical value, and may be set to 0.5, for example. Of course, those skilled in the art can flexibly set the value of α according to the requirements of the actual application scenario, and the setting is not limited herein.

As an example of this implementation, the performing recovery processing on the second compressed representation matrix sequence to obtain a second sparse representation matrix sequence corresponding to the second compressed representation matrix sequence includes: and performing iterative processing on the second compressed representation matrix sequence by adopting an activation function of a soft threshold value to obtain a second sparse representation matrix sequence corresponding to the second compressed representation matrix sequence.

In one example, a second sparse representation matrix sequence corresponding to the second compressed representation matrix sequence may be obtained according to equation 2,

Z_n+1＝h_θ(W₁X*W₂+S₁Z_nS₂) In the formula (2), the first and second groups,

wherein h is_θ() An activation function, h, representing a soft threshold_θ(x)＝sign(x)(|x|-θ)₊θ is a threshold value that can be learned, θ can be initialized to 0, W₁、W₂、S₁And S₂Is h_θ() Of a learnable parameter matrix, Z_n+1Denotes the result of the (n + 1) th iteration, Z_nRepresenting the result of the nth iteration, Z₀May be equal to 0.

In this example, Z may be_n+1And Z_nWhen the difference value between the Z values is within a preset range, the current Z value is set_n+1As Z^*(ii) a Alternatively, Z obtained after iteration for a preset number of times may be used_n+1As Z^*。

In this example, the second compressed representation matrix sequence is subjected to iterative processing by using an activation function of a soft threshold, so that a second sparse representation matrix sequence corresponding to the second compressed representation matrix sequence can be obtained quickly, and the video prediction speed can be improved.

In other possible implementation manners, other image reconstruction methods or image restoration methods in the related art may also be adopted to perform restoration processing on the second compressed representation matrix sequence, so as to obtain a first image sequence corresponding to the second compressed representation matrix sequence.

In a possible implementation manner, after the obtaining of the first image sequence corresponding to the second compressed representation matrix sequence, the method further includes: predicting according to at least part of video frames in the first video frame sequence to obtain a predicted second image sequence; and obtaining a predicted third image sequence according to the first image sequence and the second image sequence.

Wherein the third image sequence may comprise at least one image. The size of the images in the third sequence of images may be the same as the size of the video frames in the first sequence of video frames. Of course, the size of the images in the third sequence of images may also be different from the size of the video frames in the first sequence of video frames.

As an example of this implementation, the predicting from at least some video frames in the first video frame sequence to obtain a predicted second image sequence includes: inputting at least part of video frames in the first video frame sequence into a second sub-neural network, and predicting through the second sub-neural network to obtain a second image sequence corresponding to the at least part of video frames.

In this example, the second sub-neural network may be a neural network based on an attention mechanism, or may be another neural network, which is not limited herein.

In this example, at least a part of the video frames in the first video frame sequence are processed by the second sub-neural network to obtain a second image sequence, so that effective image information required for subsequent video prediction can be obtained. Therefore, the final prediction result is obtained based on the second image sequence and the first image sequence, and the accuracy of video prediction can be further improved.

In other examples, a prediction model that does not include a neural network may be used to predict from at least some of the first sequence of video frames, resulting in a predicted second sequence of images.

As an example of this implementation, the at least part of the video frames comprises M video frames that are newly captured in the first video frame sequence, where M is a positive integer, and the number of video frames in the first video frame sequence is greater than or equal to M. For example, M is greater than 1. In this example, by using the latest acquired M video frames in the first video frame sequence for prediction, the accuracy of video prediction is improved.

In other examples, the at least some video frames may further include any M video frames in the first sequence of video frames.

As an example of this implementation, the deriving a predicted third image sequence from the first image sequence and the second image sequence includes: performing feature extraction on the first image sequence to obtain a first feature corresponding to the first image sequence; performing feature extraction on the second image sequence to obtain second features corresponding to the second image sequence; performing first fusion processing according to the first characteristic and the second characteristic to obtain a first fusion characteristic; and obtaining a predicted third image sequence according to the first fusion characteristic.

In this example, the first fusion process may be performed by concat or addition (Add), and the like, which is not limited herein. Wherein the first fusion feature represents a fusion feature obtained by performing a first fusion process on the basis of the first feature and the second feature.

In this example, by combining the first feature extracted from the first image sequence with the second feature extracted from the second image sequence, video prediction can be performed based on richer image information, so that the accuracy of video prediction can be improved.

In one example, the extracting features of the first image sequence to obtain first features corresponding to the first image sequence includes: performing multi-level feature extraction on the first image sequence to obtain multi-level first features corresponding to the first image sequence; the feature extraction of the second image sequence to obtain a second feature corresponding to the second image sequence includes: performing multi-level feature extraction on the second image sequence to obtain multi-level second features corresponding to the second image sequence; the performing a first fusion process according to the first feature and the second feature to obtain a first fusion feature includes: and for any stage in the plurality of stages, performing feature fusion according to the first feature of the stage and the second feature of the stage to obtain a first fusion feature of the stage.

For example, three-level feature extraction may be performed on the first image sequence to obtain a first feature of a first level, a first feature of a second level, and a first feature of a third level corresponding to the first image sequence; performing three-level feature extraction on a second image sequence to obtain a second feature of a first level, a second feature of a second level and a second feature of a third level corresponding to the second image sequence; performing feature fusion according to the first feature of the first level and the second feature of the first level to obtain a first fusion feature of the first level; performing feature fusion according to the first feature of the second level and the second feature of the second level to obtain a first fusion feature of the second level; and performing feature fusion according to the first feature of the third level and the second feature of the third level to obtain a first fusion feature of the third level.

For example, a third sub-neural network comprising a plurality of convolutional layers may be adopted to perform multi-level feature extraction on a first image sequence, so as to obtain multi-level first features corresponding to the first image sequence; a fourth sub-neural network including a plurality of convolutional layers may be adopted to perform multi-level feature extraction on the second image sequence, so as to obtain a multi-level second feature corresponding to the second image sequence. Of course, the third sub-neural network and the fourth sub-neural network may also include other types of network layers, such as an activation layer, a pooling layer, and the like, which are not limited herein.

For any one of the multiple levels, concat processing may be performed according to the first feature of the level and the second feature of the level, so as to obtain a first fused feature of the level.

In this example, multi-scale first features corresponding to the first image sequence can be obtained by performing multi-level feature extraction on the first image sequence; and performing multi-level feature extraction on the second image sequence to obtain multi-scale second features corresponding to the second image sequence. Therefore, by performing multi-level feature extraction on the first image sequence and the second image sequence, richer image information can be obtained. And performing first fusion processing by using the multi-scale first feature and the multi-scale second feature to obtain the multi-scale first fusion feature. According to the multi-scale first fusion feature, a more accurate third image sequence can be predicted.

In this example, the performing feature fusion according to the first feature of the stage and the second feature of the stage to obtain the first fused feature of the stage may include: in response to the stage not belonging to the last stage, performing feature fusion on the first feature of the stage, the second feature of the stage and the first fusion feature of a stage subsequent to the stage to obtain a first fusion feature of the stage; and/or, in response to the stage belonging to the last stage, performing feature fusion on the first feature of the stage and the second feature of the stage to obtain a first fused feature of the stage.

For example, for a third level of the three levels, feature fusion may be performed on the first feature of the third level and the second feature of the third level to obtain a first fused feature of the third level; for the second level of the three levels, performing feature fusion on the first feature of the second level, the second feature of the second level and the first fusion feature of the third level to obtain a first fusion feature of the second level; for a first stage of the three stages, feature fusion may be performed on a first feature of the first stage, a second feature of the first stage, and a first fusion feature of the second stage to obtain a first fusion feature of the first stage.

In one example, the deriving a predicted third image sequence according to the first fusion feature includes: performing residual error processing on the first fusion characteristic to obtain a residual error characteristic; and obtaining a predicted third image sequence according to the first fusion characteristic and the residual error characteristic.

In this example, the first fusion feature may be subjected to residual processing by a residual network or a residual function, so as to obtain a residual feature.

In this example, residual error processing is performed on the first fusion feature to obtain a residual error feature, and a predicted third image sequence is obtained according to the first fusion feature and the residual error feature, so that the problems of gradient diffusion of a neural network and the like can be alleviated to a certain extent, and the accuracy of video prediction is further improved. In addition, in the training of the neural network, the convergence rate of the neural network can be increased by performing the residual error processing.

In one example, the first fused feature includes a plurality of levels; the residual error processing is performed on the first fusion feature to obtain a residual error feature, and the residual error feature includes: and carrying out residual error processing on the first fusion characteristic of the first stage to obtain a residual error characteristic.

For example, the first fused feature of the first stage may be input to a fifth sub-neural network, and the residual feature may be output via the fifth sub-neural network.

In this example, residual error processing is performed on the first-level first fusion feature to obtain a residual error feature, and a predicted third image sequence is obtained according to the first fusion feature and the residual error feature, so that the problems of gradient diffusion of a neural network and the like can be alleviated to a certain extent, and the accuracy of video prediction is further improved.

In another example, the multi-level first fused feature may be subjected to residual processing to obtain a residual feature.

In another example, the first fused feature may include only one level.

In one example, the first fused feature includes a plurality of levels; obtaining a predicted third image sequence according to the first fusion feature and the residual error feature, wherein the predicted third image sequence comprises: performing second fusion processing on the first fusion feature of the last stage to obtain a second fusion feature; and obtaining a predicted third image sequence according to the second fusion characteristic and the residual error characteristic. In this example, the second fusion feature is obtained by performing the second fusion process on the first fusion feature of the last stage, and the predicted third image sequence is obtained according to the second fusion feature and the residual feature, so that the accuracy of the predicted third image sequence can be improved.

The second fusion feature represents a fusion feature obtained by performing second fusion processing on the first fusion feature of the last stage.

In this example, the first fused feature of the last stage (e.g., the first fused feature of the third stage) may be input to a sixth sub-neural network including one or more convolutional layers, and the second fused feature may be obtained by performing a second fusion process on the first fused feature of the last stage via the sixth sub-neural network. For example, the sixth sub-neural network may include 3 convolutional layers. Of course, the sixth sub-neural network may also include other types of network layers, such as an activation layer, a pooling layer, and the like, which are not limited herein.

In this example, the second fused feature and the residual feature may be deconvolved to obtain a predicted third image sequence. For example, a multi-stage deconvolution process may be performed on the second fusion feature and the residual feature to obtain a predicted third image sequence. For example, the second fused feature and the residual feature may be input to a seventh sub-neural network, via which a predicted third image sequence is obtained. The seventh sub-neural network may include a plurality of deconvolution layers, and the third image sequence may be obtained by performing deconvolution processing on the plurality of deconvolution layers. Of course, the seventh sub-neural network may also include other types of network layers, such as a pooling layer, a full-link layer, etc., which are not limited herein.

Fig. 2 shows a schematic diagram of a neural network provided by an embodiment of the present disclosure. The neural network may be referred to as a Deep compression prediction network (DCPNet). As shown in fig. 2, the neural network may include two modules, namely, a Compressed Prediction Module (CPM) and a Convolution modification Module (CRM). Wherein the input of the compressed prediction module may be a first sequence of video frames and the output may be a first sequence of images; the convolution improvement module may have an input of the first image sequence and the most recently acquired M video frames of the first video frame sequence and an output of the third image sequence. In the embodiment of the disclosure, the low frequency information in the first video frame sequence can be extracted by the compression prediction module, and the high frequency information in the first video frame sequence can be extracted by the convolution improvement module, so that the interpretability of the neural network can be improved. The neural network provided by the embodiment of the disclosure has smaller parameter quantity, thereby reducing the calculation complexity and improving the calculation speed.

Fig. 3 shows a schematic diagram of a compression prediction module in a neural network provided by an embodiment of the present disclosure. As shown in fig. 3, the input of the compression prediction module may be a first sequence of video frames I and the output may be a first sequence of images I. The compression prediction module may include a compression unit (Compressing Cell), a prediction unit (Predicting Cell), and a recovery unit (Recovering Cell). The compression unit may be configured to perform Discrete Cosine Transform (DCT) on the first video frame sequence to obtain a first sparse representation matrix sequence, and perform Random Projection (Random Projection) on the first sparse representation matrix sequence to obtain a first compressed representation matrix sequence. The prediction unit may be configured to predict the first sequence of compressed representation matrices by RNN (e.g. RC-LSTM or ST-LSTM, etc.) resulting in a predicted second sequence of compressed representation matrices. For example, the prediction unit may be implemented using the first sub-neural network described above. The recovery unit may be configured to perform iterative processing on the second compressed representation matrix sequence by using a D-LISTA algorithm or the like to obtain a second sparse representation matrix sequence corresponding to the second compressed representation matrix sequence, and perform Inverse Discrete Cosine Transform (iDCT) on the second sparse representation matrix sequence to obtain a first image sequence corresponding to the second compressed representation matrix sequence.

Fig. 4 shows a schematic diagram of a convolution improving module in a neural network provided by an embodiment of the present disclosure. As shown in fig. 4, the convolution improving Module may include an Attention Module (Frame Attention Module), a track Encoder (Trajectory Encoder), a detail Encoder (Frame Encoder), a Residual Encoder (Residual Encoder), a combination Encoder (combination Encoder), and a Final Decoder (Final Decoder). For example, the attention module may be implemented by the second sub-neural network, the track encoder may be implemented by the third sub-neural network, the detail encoder may be implemented by the fourth sub-neural network, the residual encoder may be implemented by the fifth sub-neural network, the fusion encoder may be implemented by the sixth sub-neural network, and the final encoder may be implemented by the seventh sub-neural network.

The attention module may be configured to perform prediction according to M video frames newly acquired in the first video frame sequence to obtain a second image sequence. The track encoder can be used for performing three-level feature extraction on a first image sequence output by a Compression Prediction Module (CPM) to obtain a first feature of a first level, a first feature of a second level and a first feature of a third level corresponding to the first image sequence; the detail encoder may be configured to perform feature extraction on the second image sequence to obtain a first-level second feature, a second-level second feature, and a third-level second feature corresponding to the second image sequence. The concat processing can be carried out on the first characteristic of the third level and the second characteristic of the third level, so as to obtain a first fusion characteristic of the third level; performing concat processing on the first feature of the second level, the second feature of the second level and the first fusion feature of the third level to obtain the first fusion feature of the second level; and performing concat processing on the first feature of the first level, the second feature of the first level and the first fusion feature of the second level to obtain the first fusion feature of the first level. The fusion encoder may be configured to perform second fusion processing on the first fusion feature of the third level to obtain a second fusion feature; the residual encoder may be configured to perform residual processing on the first fusion feature of the first level to obtain a residual feature. And inputting the second fusion characteristic and the residual error characteristic into a final encoder to obtain a predicted third image sequence.

In the embodiment of the present disclosure, the optimizer of the neural network may adopt Adam, SGD, AdaGrad, RMSProp, or the like. The learning rate of the neural network may be set to 0.01. Of course, those skilled in the art can also flexibly set the learning rate according to the requirements of the actual application scenario. The neural network can be trained in a deep learning framework such as Tensorflow, PyTorch, or MXNet.

The method and the device can be applied to application scenes such as unmanned driving, meteorological prediction, traffic flow monitoring prediction and the like. For example, in unmanned driving, the collected video may be analyzed to determine the traffic information ahead and the movement status of pedestrians and vehicles, and the video prediction method provided by the embodiment of the present disclosure is used to quickly predict the movement status several seconds later, and determine whether a collision will occur. For another example, in weather prediction, the video prediction method provided by the embodiments of the present disclosure may be used to learn a weather radar map and predict future weather conditions. For another example, in an intelligent transportation trip, the video prediction method provided by the embodiment of the disclosure may be used to learn historical traffic information and traffic flow information to predict future traffic information, thereby helping to improve traffic conditions and optimize distribution of public transportation resources.

Fig. 5 is a schematic diagram illustrating a prediction result obtained by using a video prediction method provided by an embodiment of the present disclosure. In fig. 5, the first line represents an input video frame sequence (e.g., a first video frame sequence), the second line represents a true value or a predicted target, and the third line represents a predicted image sequence (e.g., a third image sequence).

Fig. 6 shows another schematic diagram of a prediction result obtained by using the video prediction method provided by the embodiment of the present disclosure. In fig. 6, the first row represents an incoming video frame sequence (e.g., a first video frame sequence), the second row represents a true value or a predicted target, and the third row represents a predicted image sequence (e.g., a third image sequence).

It is understood that the above-mentioned method embodiments of the present disclosure can be combined with each other to form a combined embodiment without departing from the logic of the principle, which is limited by the space, and the detailed description of the present disclosure is omitted.

It will be understood by those skilled in the art that in the method of the present invention, the order of writing the steps does not imply a strict order of execution and any limitations on the implementation, and the specific order of execution of the steps should be determined by their function and possible inherent logic.

In addition, the present disclosure also provides a video prediction apparatus, an electronic device, a computer-readable storage medium, and a program, which can be used to implement any video prediction method provided by the present disclosure, and the corresponding technical solutions and descriptions and corresponding descriptions in the methods section are not repeated.

Fig. 7 shows a block diagram of a video prediction apparatus provided by an embodiment of the present disclosure. As shown in fig. 7, the video prediction apparatus includes: a compression module 71, configured to perform compression processing on a collected first video frame sequence to obtain a first compressed representation matrix sequence corresponding to the first video frame sequence; a first prediction module 72, configured to perform prediction according to the first compressed representation matrix sequence to obtain a predicted second compressed representation matrix sequence; and a restoring module 73, configured to perform restoring processing on the second compressed representation matrix sequence to obtain a first image sequence corresponding to the second compressed representation matrix sequence.

In one possible implementation, the apparatus further includes: the second prediction module is used for predicting according to at least part of video frames in the first video frame sequence to obtain a predicted second image sequence; and the determining module is used for obtaining a predicted third image sequence according to the first image sequence and the second image sequence.

In one possible implementation, the compression module 71 is configured to: performing discrete cosine transform on a collected first video frame sequence to obtain a first sparse representation matrix sequence corresponding to the first video frame sequence; and randomly projecting the first sparse representation matrix sequence to obtain a first compressed representation matrix sequence corresponding to the first video frame sequence.

In one possible implementation, the first prediction module 72 is configured to: and inputting the first compressed representation matrix sequence into a first sub-neural network, and predicting through the first sub-neural network to obtain a second compressed representation matrix sequence corresponding to the first compressed representation matrix sequence.

In one possible implementation, the recovery module 73 is configured to: recovering the second compressed representation matrix sequence to obtain a second sparse representation matrix sequence corresponding to the second compressed representation matrix sequence; and performing inverse discrete cosine transform on the second sparse representation matrix sequence to obtain a first image sequence corresponding to the second compressed representation matrix sequence.

In one possible implementation, the recovery module 73 is configured to: and performing iterative processing on the second compressed representation matrix sequence by adopting an activation function of a soft threshold value to obtain a second sparse representation matrix sequence corresponding to the second compressed representation matrix sequence.

In one possible implementation, the second prediction module is configured to: inputting at least part of video frames in the first video frame sequence into a second sub-neural network, and predicting through the second sub-neural network to obtain a second image sequence corresponding to the at least part of video frames.

In one possible implementation, the determining module is configured to: performing feature extraction on the first image sequence to obtain a first feature corresponding to the first image sequence; performing feature extraction on the second image sequence to obtain second features corresponding to the second image sequence; performing first fusion processing according to the first characteristic and the second characteristic to obtain a first fusion characteristic; and obtaining a predicted third image sequence according to the first fusion characteristic.

In one possible implementation, the determining module is configured to: performing residual error processing on the first fusion characteristic to obtain a residual error characteristic; and obtaining a predicted third image sequence according to the first fusion characteristic and the residual error characteristic.

In one possible implementation, the first fused feature includes a plurality of levels; the determination module is to: performing second fusion processing on the first fusion feature of the last stage to obtain a second fusion feature; and obtaining a predicted third image sequence according to the second fusion characteristic and the residual error characteristic.

In one possible implementation, the first fused feature includes a plurality of levels; the determination module is to: and carrying out residual error processing on the first fusion characteristic of the first stage to obtain a residual error characteristic.

In one possible implementation, the determining module is configured to: performing multi-level feature extraction on the first image sequence to obtain multi-level first features corresponding to the first image sequence; performing multi-level feature extraction on the second image sequence to obtain multi-level second features corresponding to the second image sequence; and for any stage in the plurality of stages, performing feature fusion according to the first feature of the stage and the second feature of the stage to obtain a first fusion feature of the stage.

In one possible implementation, the determining module is configured to: in response to the stage not belonging to the last stage, performing feature fusion on the first feature of the stage, the second feature of the stage and the first fusion feature of a stage subsequent to the stage to obtain a first fusion feature of the stage; and/or, in response to the stage belonging to the last stage, performing feature fusion on the first feature of the stage and the second feature of the stage to obtain a first fused feature of the stage.

In some embodiments, functions of or modules included in the apparatus provided in the embodiments of the present disclosure may be used to execute the method described in the above method embodiments, and specific implementation thereof may refer to the description of the above method embodiments, and for brevity, will not be described again here.

Embodiments of the present disclosure also provide a computer-readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the above-described method. The computer-readable storage medium may be a non-volatile computer-readable storage medium, or may be a volatile computer-readable storage medium.

The disclosed embodiments also provide a computer program product comprising computer readable code, which when run on a device, a processor in the device executes instructions for implementing a video prediction method as provided in any of the above embodiments.

The embodiments of the present disclosure also provide another computer program product for storing computer readable instructions, which when executed cause a computer to perform the operations of the video prediction method provided in any of the above embodiments.

An embodiment of the present disclosure further provides an electronic device, including: one or more processors; a memory for storing executable instructions; wherein the one or more processors are configured to invoke the memory-stored executable instructions to perform the above-described method.

The electronic device may be provided as a terminal, server, or other form of device.

Fig. 8 illustrates a block diagram of an electronic device 800 provided by an embodiment of the disclosure. For example, the electronic device 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, or the like terminal.

Referring to fig. 8, electronic device 800 may include one or more of the following components: processing component 802, memory 804, power component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814, and communication component 816.

The processing component 802 generally controls overall operation of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operations at the electronic device 800. Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply component 806 provides power to the various components of the electronic device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the electronic device 800.

The multimedia component 808 includes a screen that provides an output interface between the electronic device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the electronic device 800 is in an operation mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the electronic device 800. For example, the sensor assembly 814 may detect an open/closed state of the electronic device 800, the relative positioning of components, such as a display and keypad of the electronic device 800, the sensor assembly 814 may also detect a change in the position of the electronic device 800 or a component of the electronic device 800, the presence or absence of user contact with the electronic device 800, orientation or acceleration/deceleration of the electronic device 800, and a change in the temperature of the electronic device 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate wired or wireless communication between the electronic device 800 and other devices. The electronic device 800 may access a wireless network based on a communication standard, such as Wi-Fi, 2G, 3G, 4G/LTE, 5G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium, such as the memory 804, is also provided that includes computer program instructions executable by the processor 820 of the electronic device 800 to perform the above-described methods.

Fig. 9 shows a block diagram of an electronic device 1900 provided by an embodiment of the disclosure. For example, the electronic device 1900 may be provided as a server. Referring to fig. 9, electronic device 1900 includes a processing component 1922 further including one or more processors and memory resources, represented by memory 1932, for storing instructions, e.g., applications, executable by processing component 1922. The application programs stored in memory 1932 may include one or more modules that each correspond to a set of instructions. Further, the processing component 1922 is configured to execute instructions to perform the above-described method.

The electronic device 1900 may also include a power component 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input/output (I/O) interface 1958. Electronic device 1900 may operate based on storage in memory1932 operating systems, e.g. Windows

Mac OS

Or the like.

In an exemplary embodiment, a non-transitory computer readable storage medium, such as the memory 1932, is also provided that includes computer program instructions executable by the processing component 1922 of the electronic device 1900 to perform the above-described methods.

The present disclosure may be systems, methods, and/or computer program products. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied thereon for causing a processor to implement various aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present disclosure may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, the electronic circuitry that can execute the computer-readable program instructions implements aspects of the present disclosure by utilizing the state information of the computer-readable program instructions to personalize the electronic circuitry, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA).

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The computer program product may be embodied in hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK), or the like.

Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method for video prediction, comprising:

2. The method of claim 1, wherein after said obtaining the first sequence of images corresponding to the second sequence of compressed representation matrices, the method further comprises:

3. The method according to claim 1 or 2, wherein the compressing the acquired first video frame sequence to obtain a first compressed representation matrix sequence corresponding to the first video frame sequence comprises:

4. The method according to any one of claims 1 to 3, wherein the predicting from the first compressed representation matrix sequence to obtain a predicted second compressed representation matrix sequence comprises:

5. The method according to claim 3, wherein the recovering the second compressed representation matrix sequence to obtain a first image sequence corresponding to the second compressed representation matrix sequence comprises:

6. The method according to claim 5, wherein the performing the recovery processing on the second compressed representation matrix sequence to obtain a second sparse representation matrix sequence corresponding to the second compressed representation matrix sequence comprises:

7. The method of claim 2, wherein predicting from at least some of the video frames in the first sequence of video frames to obtain a predicted second sequence of images comprises:

8. A method as claimed in claim 2 or 7, wherein the at least part of the video frames comprises M video frames of the first sequence of video frames, M being a positive integer, and wherein the number of video frames in the first sequence of video frames is greater than or equal to M.

9. The method according to claim 7 or 8, wherein deriving a predicted third image sequence from the first image sequence and the second image sequence comprises:

10. The method according to claim 9, wherein deriving the predicted third image sequence based on the first fused feature comprises:

11. The method of claim 10, wherein the first fused feature comprises a plurality of levels;

12. The method of claim 10 or 11, wherein the first fused feature comprises a plurality of levels;

13. The method according to any one of claims 9 to 12, wherein the extracting features from the first image sequence to obtain first features corresponding to the first image sequence comprises:

14. The method of claim 13, wherein said performing feature fusion based on the first feature of the stage and the second feature of the stage to obtain the first fused feature of the stage comprises:

and/or the presence of a gas in the gas,

15. A video prediction apparatus, comprising:

16. An electronic device, comprising:

one or more processors;

a memory for storing executable instructions;

wherein the one or more processors are configured to invoke the memory-stored executable instructions to perform the method of any one of claims 1 to 14.

17. A computer readable storage medium having computer program instructions stored thereon, which when executed by a processor implement the method of any one of claims 1 to 14.