CN110913218A

CN110913218A - Video frame prediction method and device and terminal equipment

Info

Publication number: CN110913218A
Application number: CN201911199747.3A
Authority: CN
Inventors: 李东阳
Original assignee: Hefei Map Duck Mdt Infotech Ltd
Current assignee: Hefei Map Duck Mdt Infotech Ltd
Priority date: 2019-11-29
Filing date: 2019-11-29
Publication date: 2020-03-24

Abstract

The invention is suitable for the technical field of video compression, and provides a video frame prediction method, a video frame prediction device and terminal equipment, wherein the method comprises the following steps: calculating optical flow information between the current frame and the reference frame; inputting the optical flow information, the reference frame and the current frame into a motion compensation network to obtain motion compensation characteristic information, wherein an up-sampling layer of the motion compensation network comprises N convolutional layers; the motion compensation characteristic information is input into the motion compensation network after being subjected to entropy coding and entropy decoding; in the up-sampling layer, obtaining m groups of corresponding reconstructed optical flow information and a separation convolution kernel from the (N +1-m) th convolution layer to the Nth convolution layer from the input end; obtaining m corresponding prediction frames according to the m groups of reconstructed optical flow information and the separation convolution kernels; and inputting the m corresponding prediction frames into a fusion network for fusion to obtain a current frame prediction frame. The invention improves the performance of video prediction by fusing the prediction frames on different scales.

Description

Video frame prediction method and device and terminal equipment

Technical Field

The invention belongs to the technical field of video compression, and particularly relates to a video frame prediction method, a video frame prediction device and terminal equipment.

Background

In the video compression process, an efficient time domain prediction mechanism is the key point for improving the video compression performance. In the existing video compression technology, the corresponding predicted image is mostly obtained by operating on the original resolution size according to optical flow interpolation or separation convolution. However, none of them considers the impact of temporal prediction on different scales, resulting in poor video prediction performance.

Therefore, a new technical solution is needed to solve the above problems.

Disclosure of Invention

In view of this, embodiments of the present invention provide a video frame prediction method and a terminal device, so as to solve the problem of low video prediction performance in the prior art.

A first aspect of an embodiment of the present invention provides a video frame prediction method, including:

calculating optical flow information between the current frame and the reference frame;

inputting the optical flow information, the reference frame and the current frame into a motion compensation network to obtain motion compensation characteristic information, wherein an up-sampling layer of the motion compensation network comprises N convolutional layers, and N is a positive integer greater than 1;

the motion compensation characteristic information is input into the motion compensation network after being subjected to entropy coding and entropy decoding;

in the up-sampling layer, obtaining m groups of corresponding reconstructed optical flow information and a separation convolution kernel from the (N +1-m) th convolution layer to the Nth convolution layer from the input end, wherein m is a positive integer smaller than N;

obtaining m corresponding prediction frames according to the m groups of reconstructed optical flow information and the separation convolution kernels;

and inputting the m corresponding prediction frames into a fusion network for fusion to obtain a current frame prediction frame.

A second aspect of the embodiments of the present invention provides a video frame prediction apparatus, including:

the optical flow module is used for calculating optical flow information between the current frame and the reference frame;

a motion compensation module, configured to input the optical flow information, the reference frame, and the current frame into a motion compensation network to obtain motion compensation feature information, where an upsampling layer of the motion compensation network includes N convolutional layers, where N is a positive integer greater than 1;

the entropy coding and decoding module is used for inputting the motion compensation characteristic information into the motion compensation network after entropy coding and entropy decoding are carried out on the motion compensation characteristic information;

the convolution layer output module is used for obtaining m groups of corresponding reconstructed optical flow information and a separation convolution kernel from the (N +1-m) th convolution layer to the Nth convolution layer from the input end in the up-sampling layer, wherein m is a positive integer smaller than N;

a corresponding prediction frame module for obtaining m corresponding prediction frames according to the m groups of reconstructed optical flow information and the separation convolution kernel;

and the fusion module is used for inputting the m corresponding prediction frames into a fusion network for fusion to obtain the current frame prediction frame.

A third aspect of embodiments of the present invention provides a video frame prediction terminal device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the steps of the method provided in the first aspect when executing the computer program.

A fourth aspect of embodiments of the present invention provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the steps of the method as provided in the first aspect above.

Compared with the prior art, the embodiment of the invention has the following beneficial effects:

the invention improves the performance of video prediction by fusing the prediction frames on different scales.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1 is a schematic flow chart illustrating an implementation of a video frame prediction method according to an embodiment of the present invention;

FIG. 2 is a diagram of an apparatus for predicting video frames according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a video frame prediction terminal device according to an embodiment of the present invention.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present invention with unnecessary detail.

In order to explain the technical means of the present invention, the following description will be given by way of specific examples.

Example one

Fig. 1 shows a flow of implementing a video frame prediction method according to an embodiment of the present invention, where an execution subject of the method may be a terminal device, and details are as follows:

in step S101, optical flow information between the current frame and the reference frame is calculated.

Optionally, a spatial position mapping relationship between pixels of the current frame image and pixels of the reference frame image is calculated to obtain optical flow information.

Specifically, the optical flow is to use the change of pixels in the image sequence in the time domain and the correlation between adjacent frames to find the correlation between two adjacent frames, so as to calculate the motion information of the object between the adjacent frames: and inputting the current frame and the reference frame into a preset optical flow network to obtain optical flow information. Further, the optical flow network includes two network structures: FlowNeTS (FlowNetSimple) and FlowNetC (FlowNetCorr). The optical flow network FlowNet S directly overlaps and inputs two images according to channel dimensions, and the network structure of the FlowNet S only has convolution layers; the optical flow network FlowNet C firstly extracts the characteristics of the two input images respectively and then calculates the correlation of the characteristics, namely the characteristics of the two images are subjected to convolution operation in a space dimension.

Step S102, inputting the optical flow information, the reference frame and the current frame into a motion compensation network to obtain motion compensation characteristic information, wherein an up-sampling layer of the motion compensation network comprises N convolution layers, and N is a positive integer greater than 1.

Optionally, the motion compensation network comprises an upsampling layer, a downsampling layer, an encoding network, and a decoding network, wherein the upsampling layer and the downsampling layer each comprise N convolutional layers. And further, inputting the optical flow information, the reference frame and the current frame into a motion compensation network, and performing down-sampling operation and convolution operation on a down-sampling layer to obtain motion compensation characteristic information.

Step S103, entropy coding and entropy decoding the motion compensation feature information, and inputting the motion compensation feature information into the motion compensation network.

Optionally, entropy coding is performed on the motion compensation feature information to obtain a compressed bit stream, and the compressed bit stream is stored. Further, the stored compressed bit stream is input to the motion compensation network after being entropy decoded. The above coding may be an entropy coding scheme such as Shannon (Shannon) coding, Huffman (Huffman) coding, or arithmetic coding (arithmeticcoding), and is not limited herein.

And step S104, obtaining corresponding m groups of reconstructed optical flow information and separation convolution kernels from the (N +1-m) th convolution layer to the Nth convolution layer from the input end in the up-sampling layer, wherein m is a positive integer smaller than N.

Optionally, according to the motion compensation information after entropy decoding in the input motion compensation network, on m convolutional layers from the (N +1-m) th convolutional layer to the nth convolutional layer from the input end of the upsampling layer, the reconstructed optical flow information and the separate convolution kernel corresponding to each convolutional layer, that is, the m sets of reconstructed optical flow information and the separate convolution kernel corresponding to each convolutional layer, are obtained respectively.

For example, assuming that N is 4 and m is 3, it should be noted that the numerical values herein are merely used for convenience of description and do not constitute any limitation. Specifically, after the compressed bitstream is entropy decoded and input into the motion compensation network, data is transmitted to an upsampling layer including 4 convolutional layers, and reconstructed optical flow information and a separation convolutional kernel on the (N +1-m) th convolutional layer from the input end of the upsampling layer, that is, reconstructed optical flow information and a separation convolutional kernel on the 2 nd convolutional layer (at 1/4 resolution) are obtained (that is, the reconstructed optical flow information and the separation convolutional kernel are output when the data reaches the 2 nd convolutional layer); further, the reconstructed optical flow information and the separation convolution kernel on the 3 rd convolution layer (at 1/2 resolution) at the input end of the up-sampling layer are obtained (namely, the reconstructed optical flow information and the separation convolution kernel are output when the data reaches the 3 rd convolution layer); further, acquiring reconstructed optical flow information and a separation convolution kernel on the 4 th convolution layer (at full resolution, namely the outermost convolution layer close to the output end) at the input end of the up-sampling layer (namely, outputting the reconstructed optical flow information and the separation convolution kernel when the data reaches the 3 rd convolution layer); resulting in reconstructed optical flow information and separate convolution kernels corresponding to the 2 nd, 3 rd, and 4 th convolutional layers, respectively (i.e., reconstructed optical flow information and separate convolution kernels at 1/4 resolution, 1/2 resolution, and full resolution).

Step S105, m corresponding prediction frames are obtained according to the m groups of reconstructed optical flow information and the separation convolution kernels.

Optionally, after performing warp operation on the reference frames according to the m groups of reconstructed optical flow information, performing separation convolution operation on separation convolution kernels corresponding to the reconstructed optical flow information to obtain m corresponding prediction frames.

Specifically, m corresponding warp predicted frames are obtained by respectively carrying out affine transformation on reference frames warp (images) to specified positions according to the m groups of reconstructed optical flow information.

Furthermore, m corresponding warp prediction frames and corresponding separation convolution kernels are subjected to separation convolution operation respectively to obtain m corresponding prediction frames. Wherein the separating convolution operation is: each pixel in the warp predicted frame is convolved with a corresponding separate convolution kernel.

Illustratively, the reconstructed optical flow information and the separate convolution kernels corresponding to the 2 nd, 3 rd, and 4 th convolution layers, respectively, are obtained in the example of step S104. Furthermore, performing warp operation on the reference frame according to the reconstructed optical flow information corresponding to the 2 nd convolutional layer, and performing separation convolution operation on a separation convolution kernel corresponding to the 2 nd convolutional layer to obtain a prediction frame corresponding to the 2 nd convolutional layer; performing warp operation on the reference frame according to the reconstructed optical flow information corresponding to the 3 rd convolutional layer, and performing separation convolution operation on a separation convolution kernel corresponding to the 3 rd convolutional layer to obtain a prediction frame corresponding to the 3 rd convolutional layer; performing warp operation on the reference frame according to the reconstructed optical flow information corresponding to the 4 th convolutional layer, and performing separation convolution operation on a separation convolution kernel corresponding to the 4 th convolutional layer to obtain a prediction frame corresponding to the 4 th convolutional layer; thereby obtaining predicted frames corresponding to the 2 nd convolutional layer, the 3 rd convolutional layer, and the 4 th convolutional layer, respectively.

And step S106, inputting the m corresponding prediction frames into a fusion network for fusion to obtain a current frame prediction frame.

Optionally, inputting the m corresponding prediction frames into a fusion network, where the fusion network includes m convolutional layers; furthermore, m prediction frames corresponding to the (N +1-m) th convolutional layer to the nth convolutional layer are in one-to-one correspondence with the outputs of the m convolutional layers from the input end to the output end in the fusion network and are fused to finally obtain the current frame prediction frame.

Illustratively, after obtaining predicted frames corresponding to the 2 nd convolutional layer (at 1/4 resolution), the 3 rd convolutional layer (at 1/2 resolution) and the 4 th convolutional layer (at full resolution) respectively in step S105, the 3 predicted frames are sequentially input into a fusion network, wherein the fusion network is an upsampling network comprising m convolutional layers (i.e., 3 convolutional layers). Specifically, the predicted frame corresponding to the 2 nd convolutional layer (at 1/4 resolution) is input into the 1 st convolutional layer of the fusion network, and then is fused with the predicted frame corresponding to the 2 nd convolutional layer (at 1/4 resolution) to obtain the corresponding fusion predicted frame; inputting the fused predicted frame corresponding to the 2 nd convolutional layer (1/4 resolution position) into the 2 nd convolutional layer of the fusion network, and then fusing the fused predicted frame corresponding to the 3 rd convolutional layer (full resolution position) to obtain a corresponding fused predicted frame; inputting the fused prediction frame corresponding to the 3 rd convolutional layer (at the full resolution position) into the 3 rd convolutional layer of the fusion network, and then fusing the fused prediction frame with the prediction frame corresponding to the 3 rd convolutional layer (at the full resolution position) to obtain the prediction frame of the current frame.

Optionally, after the m corresponding prediction frames are input into a fusion network and fused to obtain a current frame prediction frame, the method further includes:

subtracting the current frame from the predicted frame of the current frame to obtain a residual error;

inputting the residual error into a residual error compression network to obtain a decompressed residual error; optionally, wherein the residual compression network is a neural network comprising an upsampling layer, an encoding network, a decoding network, and a downsampling layer. Inputting the residual error into a residual error compression network, coding the residual error to obtain a residual error bit stream, decoding the residual error bit stream based on a decoding network, and down-sampling to obtain a decompressed residual error.

And adding the predicted frame of the current frame and the decompressed residual error to obtain a reconstructed frame of the current frame.

In the embodiment, the performance of video prediction is improved by fusing the prediction frames on different scales.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.

Example two

Fig. 2 is a block diagram illustrating a structure of a video frame prediction apparatus according to an embodiment of the present invention, and only a part related to the embodiment of the present invention is shown for convenience of description. The video frame prediction apparatus 2 includes: an optical flow module 21, a motion compensation module 22, an entropy coding entropy decoding module 23, a convolutional layer output module 24, a corresponding prediction frame module 25, and a fusion module 26.

The optical flow module 21 is configured to calculate optical flow information between the current frame and the reference frame;

a motion compensation module 22, configured to input the optical flow information, the reference frame, and the current frame into a motion compensation network to obtain motion compensation feature information, where an upsampling layer of the motion compensation network includes N convolutional layers, where N is a positive integer greater than 1;

an entropy coding and decoding module 23, configured to perform entropy coding and entropy decoding on the motion compensation feature information and then input the motion compensation feature information to the motion compensation network;

a convolutional layer output module 24, configured to obtain, in the upsampled layer, m sets of corresponding reconstructed optical flow information and separate convolutional kernels from an (N +1-m) th convolutional layer to an nth convolutional layer from an input end, where m is a positive integer smaller than N;

a corresponding prediction frame module 25, configured to obtain m corresponding prediction frames according to the m groups of reconstructed optical flow information and the separation convolution kernel;

and the fusion module 26 is configured to input the m corresponding prediction frames into a fusion network for fusion to obtain a current frame prediction frame.

Optionally, the optical flow module 21 includes:

and the optical flow calculating unit is used for calculating the spatial position mapping relation between the pixels of the current frame image and the pixels of the reference frame image to obtain optical flow information.

Optionally, the convolutional layer output module 24 includes:

and a reconstructed optical flow information and separation convolution unit, configured to obtain reconstructed optical flow information and separation convolution kernels corresponding to each convolution layer, that is, m groups of reconstructed optical flow information and separation convolution kernels corresponding to each convolution layer, on m convolution layers from an (N +1-m) th convolution layer to an nth convolution layer, respectively, according to the motion compensation feature information after entropy decoding in the input motion compensation network.

Optionally, the corresponding predicted frame module 25 includes:

a warp unit, configured to perform warp operations on the reference frames according to the m groups of reconstructed optical flow information;

and the separation convolution unit is used for performing separation convolution operation on the output of the warp unit and the corresponding separation convolution kernels respectively to obtain m corresponding prediction frames.

Optionally, the fusion module 26 includes:

an input unit for inputting the m corresponding prediction frames into a fusion network, wherein the fusion network includes m convolutional layers;

and the fusion unit is used for correspondingly fusing m prediction frames corresponding to the (N +1-m) th convolution layer to the Nth convolution layer with the outputs of the m convolution layers from the input end to the output end in the fusion network one by one to finally obtain the current frame prediction frame.

Optionally, the video frame prediction apparatus 2 further includes:

and the reconstructed frame module is used for subtracting the predicted frame of the current frame from the current frame to obtain a residual error, inputting the residual error into a residual error compression network to obtain a decompressed residual error, and adding the predicted frame of the current frame and the decompressed residual error to obtain a reconstructed frame of the current frame.

EXAMPLE III

Fig. 3 is a schematic diagram of a video frame prediction terminal device according to an embodiment of the present invention. As shown in fig. 3, the video frame prediction terminal device 3 of this embodiment includes: a processor 30, a memory 31 and a computer program 32, such as a video frame prediction program, stored in the memory 31 and executable on the processor 30. The processor 30, when executing the computer program 32, implements the steps of the various embodiments of the video frame prediction method described above, such as the steps 101 to 106 shown in fig. 1. Alternatively, the processor 30 implements the functions of the modules/units in the device embodiments, such as the modules 21 to 26 shown in fig. 2, when executing the computer program 32.

Illustratively, the computer program 32 may be divided into one or more modules, which are stored in the memory 31 and executed by the processor 30 to implement the present invention. The one or more modules may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution process of the computer program 32 in the video frame prediction terminal device 3. For example, the computer program 32 may be divided into an optical flow module, a motion compensation module, an entropy coding and decoding module, and a predicted frame module, and each module has the following specific functions:

the entropy coding and decoding module is used for inputting the motion compensation network after entropy coding and entropy decoding are carried out on the motion compensation characteristic information;

a convolution layer output module, configured to obtain, in the upsampled layer, m sets of corresponding reconstructed optical flow information and separate convolution kernels from an (N +1-m) th convolution layer to an nth convolution layer from an input end, where m is a positive integer smaller than N;

The video frame prediction terminal device 3 may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The video frame prediction terminal device may include, but is not limited to, a processor 30 and a memory 31. It will be appreciated by those skilled in the art that fig. 3 is merely an example of the video frame prediction terminal device 3, and does not constitute a limitation of the video frame prediction terminal device 3, and may include more or less components than those shown, or combine some components, or different components, for example, the video frame prediction terminal device may further include an input-output device, a network access device, a bus, etc.

The Processor 30 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 31 may be an internal storage unit of the video frame prediction terminal device 3, such as a hard disk or a memory of the video frame prediction terminal device 3. The memory 31 may be an external storage device of the video frame prediction terminal device 3, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card), or the like provided on the video frame prediction terminal device 3. Further, the memory 31 may include both an internal storage unit and an external storage device of the video frame prediction terminal device 3. The memory 31 is used to store the computer program and other programs and data required by the video frame prediction terminal device. The above-mentioned memory 31 may also be used to temporarily store data that has been output or is to be output.

As can be seen from the above, the present embodiment improves the performance of video prediction by fusing predicted frames on different scales.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned functions may be distributed as different functional units and modules according to needs, that is, the internal structure of the apparatus may be divided into different functional units or modules to implement all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other ways. For example, the above-described embodiments of the apparatus/terminal device are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain other components which may be suitably increased or decreased as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media which may not include electrical carrier signals and telecommunications signals in accordance with legislation and patent practice.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims

1. A method for video frame prediction, comprising:

2. The video frame prediction method of claim 1, wherein said calculating optical flow information between the current frame and the reference frame comprises:

and calculating the spatial position mapping relation between the pixels of the current frame image and the pixels of the reference frame image to obtain optical flow information.

3. The method of claim 1, wherein said deriving corresponding m sets of reconstructed optical flow information and separate convolution kernels on (N +1-m) th to nth convolutional layers from an input end in said upsampled layer comprises:

according to the motion compensation characteristic information after entropy decoding in the input motion compensation network, respectively obtaining reconstructed optical flow information and a separation convolution kernel corresponding to each convolution layer from the (N +1-m) th convolution layer to the (N) th convolution layer from the input end of the upper sampling layer, namely m groups of reconstructed optical flow information and separation convolution kernels corresponding to each convolution layer.

4. The video frame prediction method of claim 1 wherein said deriving m corresponding predicted frames from said m sets of reconstructed optical flow information and separate convolution kernels comprises:

and after performing warp operation on the reference frames according to the m groups of reconstructed optical flow information, performing separation convolution operation on separation convolution kernels corresponding to the reconstructed optical flow information to obtain m corresponding prediction frames.

5. The method of claim 1, wherein the inputting the m corresponding predicted frames into a fusion network for fusion to obtain the predicted frame of the current frame comprises:

inputting the m corresponding predicted frames into a converged network, wherein the converged network comprises m convolutional layers;

and corresponding m prediction frames from the (N +1-m) th convolution layer to the Nth convolution layer to the outputs of the m convolution layers from the input end to the output end in the fusion network in a one-to-one correspondence mode, and fusing the m prediction frames to obtain the current frame prediction frame.

6. A video frame prediction apparatus, comprising:

7. The video frame prediction device of claim 6, wherein the corresponding predicted frame module comprises:

a warp unit, configured to perform warp operation on the reference frames according to the m sets of reconstructed optical flow information;

8. The video frame prediction device of claim 6, wherein the fusion module comprises:

an input unit for inputting the m corresponding prediction frames into a converged network, wherein the converged network comprises m convolutional layers;

9. Video frame prediction terminal device comprising a memory, a processor and a computer program stored in said memory and executable on said processor, characterized in that said processor realizes the steps of the method according to any of the claims 1 to 5 when executing said computer program.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 5.