CN116962718A

CN116962718A - Intermediate frame determining method, device, equipment, program product and medium

Info

Publication number: CN116962718A
Application number: CN202211723143.6A
Authority: CN
Inventors: 姜博源; 孔令通; 罗栋豪; 储文青; 邰颖; 汪铖杰
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-12-30
Filing date: 2022-12-30
Publication date: 2023-10-27

Abstract

The invention provides a method, a device, equipment and a medium for determining an intermediate frame, wherein the method comprises the following steps: encoding and decoding the first image frame and the second image frame through an intermediate frame optical flow estimation network in the dynamic frame insertion model to obtain a first optical flow of the target intermediate frame and the first image frame, a second optical flow of the target intermediate frame and the second image frame and a fusion weight parameter; calculating a sparse mask matched to a decoder layer of the intermediate frame image synthesis network; and adjusting the initial intermediate frame by using the sparse mask, the first optical flow, the second optical flow, the fusion weight parameter and the second feature vector through a decoder layer of the intermediate frame image synthesis network to obtain a target intermediate frame. The method and the device can flexibly adjust the calculated amount of the target intermediate frame by utilizing a dynamic frame inserting model comprising an intermediate frame optical flow estimation network and an intermediate frame image synthesis network, realize rapid video frame inserting, reduce the overhead of hardware equipment and reduce the floating point calculated amount.

Description

Intermediate frame determining method, device, equipment, program product and medium

Technical Field

The present invention relates to machine learning technology, and in particular, to an intermediate frame determining method, an apparatus, an electronic device, a computer program product, and a storage medium.

Background

In the related art, with the rapid development of computer vision technology, people have higher and higher requirements on video frame rate, and the video with high frame rate can greatly improve the viewing experience of users. In order to watch the video with higher fluency/definition, the video frame rate shot by the existing camera is also improved from 25FPS to 60FPS and then to 240FPS or even higher, but the frame rate is improved only by the hardware iteration of the camera, so that the cost is higher, and the video frame inserting technology is generated.

The purpose of video interpolation is to generate high frame rate video from low frame rate video, the general operation of video interpolation being to generate an image of an intermediate frame given images of two adjacent video frames. At present, most video frame inserting methods adopt modeling of object motion to estimate optical flow of an intermediate frame, floating point calculation amount required by the method is large, so that video frame inserting speed is low, and meanwhile, cost of required hardware equipment is large, so that video frame inserting is not facilitated to be realized at a terminal side.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method, an apparatus, an electronic device, a computer program product, and a storage medium for determining an intermediate frame, which can flexibly adjust the calculated amount of a target intermediate frame by using a dynamic interpolation model including an intermediate frame optical flow estimation network and an intermediate frame image synthesis network, thereby implementing fast video interpolation, reducing the overhead of hardware devices, and reducing the floating point calculated amount.

The technical scheme of the embodiment of the invention is realized as follows:

the embodiment of the invention provides a method for determining an intermediate frame, which comprises the following steps:

acquiring a first image frame and a second image frame in a video frame, wherein the first image frame and the second image frame are continuous image frames;

determining a first optical flow between the first image frame and a target intermediate frame and a second optical flow between the second image frame and the target intermediate frame through an intermediate frame optical flow estimation network, and determining fusion weight parameters of the first optical flow and the second optical flow;

determining an initial intermediate frame of the first image frame and the second image frame based on the first optical flow, the second optical flow, and the fusion weight parameter;

performing feature extraction on the first image frame and the second image frame through an encoder layer of an intermediate frame image synthesis network to obtain a first feature vector;

determining, by an encoder layer of the mid-frame optical flow estimation network, a second feature vector using the first feature vector;

calculating a sparse mask matched to a decoder layer of the intermediate frame image synthesis network;

and adjusting the initial intermediate frame by using the sparse mask, the first optical flow, the second optical flow, the fusion weight parameter and the second feature vector through a decoder layer of the intermediate frame image synthesis network to obtain a target intermediate frame.

The embodiment of the invention also provides an intermediate frame determining device, which comprises:

an information transmission module for acquiring a first image frame and a second image frame in a video frame, wherein the first image frame and the second image frame are continuous image frames;

the information processing module is used for determining a first optical flow between the first image frame and a target intermediate frame and a second optical flow between the second image frame and the target intermediate frame through an intermediate frame optical flow estimation network, and determining a fusion weight parameter of the first optical flow and the second optical flow;

the information processing module is used for determining initial intermediate frames of the first image frame and the second image frame based on the first optical flow, the second optical flow and the fusion weight parameter;

the information processing module is used for extracting the characteristics of the first image frame and the second image frame through an encoder layer of an intermediate frame image synthesis network to obtain a first characteristic vector;

the information processing module is used for determining a second feature vector by utilizing the first feature vector through an encoder layer of the intermediate frame optical flow estimation network;

the information processing module is used for calculating a sparse mask matched with a decoder layer of the intermediate frame image synthesis network;

The information processing module is configured to adjust, by using the sparse mask, the first optical flow, the second optical flow, the fusion weight parameter, and the second feature vector, the initial intermediate frame through a decoder layer of the intermediate frame image synthesis network, to obtain a target intermediate frame.

In the above-described arrangement, the first and second embodiments,

the information processing module is used for calculating a first intermediate frame according to the first image frame and the first optical flow;

the information processing module is used for calculating a second intermediate frame according to the second image frame and the second optical flow;

and the information processing module is used for fusing the first intermediate frame and the second intermediate frame according to the fusion weight parameter to obtain an initial intermediate frame.

In the above-described arrangement, the first and second embodiments,

the information processing module is configured to extract, by using a first encoder network of the intermediate frame image synthesis network, a feature vector of the first image frame, where an encoder layer of the intermediate frame image synthesis network includes: a first encoder network and a second encoder network, the first encoder network and the second encoder network having the same structure and different parameters;

The information processing module is used for extracting the feature vector of the second image frame through a first encoder network of the intermediate frame image synthesis network;

the information processing module is configured to combine the feature vector of the first image frame and the feature vector of the second image frame, and perform feature conversion through the first optical flow and the second optical flow to obtain the first feature vector.

In the above-described arrangement, the first and second embodiments,

the information processing module is configured to perform feature extraction on the first optical flow, the second optical flow, the fusion weight parameter and the initial intermediate frame through the second encoder network, so as to obtain a third feature vector;

the information processing module is configured to perform feature fusion on the first feature vector and the third feature vector through the second encoder network to obtain the second feature vector.

In the above-described arrangement, the first and second embodiments,

the information processing module is used for acquiring a low-frequency component, a high-frequency component in the horizontal direction, a high-frequency component in the vertical direction and a high-frequency component in the diagonal direction which are output by a decoder layer of the intermediate frame image synthesis network;

the information processing module is used for acquiring a sparseness threshold value of the sparse mask;

The information processing module is configured to calculate a sparse mask that matches a decoder layer of the intermediate frame image synthesis network using the sparseness threshold, the low frequency component, the high frequency component in the horizontal direction, the high frequency component in the vertical direction, and the high frequency component in the diagonal direction.

In the above-described arrangement, the first and second embodiments,

the information processing module is used for acquiring the application scene of the intermediate frame;

the information processing module is used for dynamically adjusting the sparseness threshold according to the application scene.

In the above-described arrangement, the first and second embodiments,

the information processing module is configured to decode, through a decoder layer of the intermediate frame image synthesis network, the second feature vector and the third feature vector to obtain the high-frequency component in the horizontal direction, the high-frequency component in the vertical direction, and the high-frequency component in the diagonal direction, where the decoder layer of the intermediate frame image synthesis network is a sparse convolution decoding network based on haar wavelet decomposition;

the information processing module is used for carrying out wavelet inverse transformation processing on the low-frequency component, the high-frequency component in the horizontal direction, the high-frequency component in the vertical direction and the high-frequency component in the diagonal direction to obtain a wavelet inverse transformation result;

And the information processing module is used for adjusting the initial intermediate frame by utilizing the wavelet inverse transformation result to obtain the target intermediate frame.

In the above-described arrangement, the first and second embodiments,

the information processing module is used for calculating at least one intermediate frame in the video according to the first image frame and the second image frame;

the information processing module is used for inserting the at least one intermediate frame into the intermediate positions of the first image frame and the second image frame to obtain a complete target video.

In the above-described arrangement, the first and second embodiments,

the information processing module is used for detecting the playing fluency of the target video when the video coding strategy matched with the playing environment of the target video is determined to be the code rate of the enhanced video;

the information processing module is used for determining that a video coding strategy matched with the playing environment of the target video is to simultaneously improve the frame rate and code rate coding when the situation that the playing of the target video is blocked is detected;

the information processing module is used for determining a target frame rate and a target video code rate;

the information processing module is used for determining the number of the target intermediate frames according to the target frame rate and the target video code rate.

In the above-described arrangement, the first and second embodiments,

the information processing module is used for acquiring a first image frame and a second image frame in the target video when the video coding strategy matched with the playing environment of the target video is determined to be the code rate coding of the enhanced video;

the information processing module is used for acquiring difference images of the first image frame and the second image frame;

the information processing module is used for converting the difference image into a matched gray level image and determining different pixel points included in the gray level image;

the information processing module is used for detecting the playing fluency of the target video according to the gray values of different pixel points in the gray image.

The embodiment of the invention also provides an intermediate frame determining device, which is characterized by comprising:

a memory for storing executable instructions;

and the processor is used for realizing the preamble intermediate frame determination method when executing the executable instructions stored in the memory.

The embodiment of the invention also provides a computer program product, and the computer program or the instructions realize the preamble intermediate frame determining method when being executed by a processor.

The embodiment of the invention also provides a computer readable storage medium which stores executable instructions, and is characterized in that the executable instructions realize the preamble intermediate frame determination method when being executed by a processor.

The embodiment of the invention has the following beneficial effects:

1) The embodiment of the invention obtains a first image frame and a second image frame in a video frame, encodes and decodes the first image frame and the second image frame through an intermediate frame optical flow estimation network to obtain a first optical flow between the first image frame and a target intermediate frame, a second optical flow between the second image frame and the target intermediate frame and a fusion weight parameter, and extracts a first feature vector in the first image frame and the second image frame through an encoder layer of an intermediate frame image synthesis network; converting the first feature vector through an intermediate frame optical flow estimation network to obtain a second feature vector; calculating a sparse mask matched to a decoder layer of the intermediate frame image synthesis network; because of the controllability of the sparse mask, the calculation area of the target intermediate frame can be flexibly controlled, so that the floating point calculation amount of the hardware equipment is flexibly adjusted.

2) And adjusting the initial intermediate frame by using the sparse mask, the first optical flow, the second optical flow, the fusion weight parameter and the second feature vector through a decoder layer of the intermediate frame image synthesis network to obtain a target intermediate frame. Therefore, the calculation amount of the target intermediate frame can be flexibly adjusted by utilizing the dynamic frame inserting model comprising the intermediate frame optical flow estimation network and the intermediate frame image synthesis network, so that the video frame inserting can be rapidly realized, meanwhile, the calculation amount in the dynamic frame inserting process can be controlled, so that the video frame inserting process at the mobile terminal side can be realized, the processing speed of the video frame inserting process can be improved, the user can obtain better video frame inserting experience, the video frame inserting process can be carried out at the service side, and the operation pressure of the mobile terminal can be reduced.

Drawings

Fig. 1 is a schematic view of a usage environment of an intermediate frame determining method according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a composition structure of an intermediate frame determining apparatus according to an embodiment of the present invention;

FIG. 3 is a schematic flow chart of an alternative method for determining an intermediate frame according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a dynamic frame insertion model according to an embodiment of the present invention;

fig. 5 is a schematic diagram of a test effect of an intermediate frame determining method according to an embodiment of the present invention;

fig. 6 is a schematic diagram of an intermediate frame calculation accuracy of an intermediate frame determining method according to an embodiment of the present invention;

fig. 7 is a schematic view of an application effect of the method for determining an intermediate frame according to an embodiment of the present invention;

fig. 8 is a schematic view of an application effect of an intermediate frame determining method according to an embodiment of the present invention;

fig. 9 is a schematic flowchart of an alternative method for determining an intermediate frame according to an embodiment of the present invention.

Detailed Description

The present invention will be further described in detail with reference to the accompanying drawings, for the purpose of making the objects, technical solutions and advantages of the present invention more apparent, and the described embodiments should not be construed as limiting the present invention, and all other embodiments obtained by those skilled in the art without making any inventive effort are within the scope of the present invention.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with one another without conflict.

Before describing embodiments of the present invention in further detail, the terms and terminology involved in the embodiments of the present invention will be described, and the terms and terminology involved in the embodiments of the present invention will be used in the following explanation.

1) Video coding (Video Transcoding) refers to converting a video code stream that has been compression coded into another video code stream to accommodate different network bandwidths, different terminal processing capabilities, and different user requirements.

2) A client, a carrier in a terminal that implements a specific function, for example, a mobile client (APP) is a carrier of a specific function in a mobile terminal, for example, a function of performing live online (video push) or a play function of online video.

3) In response to: for representing a condition or state upon which an operation is performed, one or more operations performed may be in real-time or with a set delay when the condition or state upon which the operation is dependent is satisfied; without being specifically described, there is no limitation in the execution sequence of the plurality of operations performed.

5) Cloud gaming: the game itself runs in the cloud server equipment, the game picture rendered by the cloud equipment is encoded and then transmitted to the user terminal through the network, the user terminal decodes the encoded file and then renders the encoded file to the display screen for displaying, and therefore a user can complete the game interaction process without installing the game locally and only establishing communication network connection with the cloud.

6) Transmission frame per second FPS: (Frames Per Second) the FPS is a definition in the field of images, which refers to the number of frames per second of transmission of a picture, and colloquially to the number of pictures of an animation or video. The FPS measures the amount of information used to save and display the dynamic video. The more frames per second, the smoother the displayed motion. Typically, the minimum to avoid motion dysfluency is 30FPS.

7) Artificial neural network: in the field of machine learning and cognitive science, neural Networks (NN) are mathematical or computational models that mimic the structure and function of biological Neural networks and are used to estimate or approximate functions.

8) Video frame interpolation: video interpolation refers to the synthesis of frames at intermediate moments between two adjacent frames of video.

9) Warp: and transforming one image into another image according to a certain rule.

10 Discrete wavelet transform): discrete wavelet transformation; inverse discrete wavelet transform: inverse discrete wavelet transform; haar waves: haar wavelet.

Before introducing the method for determining the intermediate frames provided by the application, firstly introducing a video frame inserting method in the related technology, in order to improve the experience of watching video by a user, a video service provider usually optimizes the video by utilizing the video frame inserting technology so as to obviously improve the smoothness of video pictures. The video interpolation algorithm can be divided into three types according to the difference of the synthesis modes of the intermediate frames (intermediate video frames): optical flow-based methods, kernel-based methods, image generation-based methods:

1) The kernel-based method synthesizes an intermediate frame by generating an image by convolving a local block around each output pixel, and convolving the local block around each output pixel. However, it cannot handle Large motions (Large Motion) beyond the kernel size, and is generally computationally expensive. The method based on image generation can generate finer texture structures, but if an object with larger motion exists in a video, the problems of double image and the like can occur, and the video look and feel after frame insertion is affected.

2) The convolutional neural network (CNN Convolutional Neural Networks) in deep learning can understand the influence of the motion law of pixel values in images along with time by predicting optical flow, and most video interpolation algorithms are currently based on optical flow (the instantaneous speed of pixel motion of a space moving object on an observation imaging plane). For example, a Depth-AwareVideo Frame Interpolation (DAIN) algorithm, which includes an optical flow estimation module, a Depth estimation module, a feature extraction module and an interpolation kernel estimation module, acquires corresponding optical flow diagrams, depth diagrams, feature diagrams and interpolation kernels from the input front and rear frame images through the four modules respectively, and then uses the optical flow and local interpolation kernels to distort the input frames, the Depth diagrams and the context features, and sends the distorted frames to a target frame synthesis network to synthesize an output frame.

3) The DAIN algorithm estimates the optical flow of two adjacent frames and estimates the optical flow of the middle frame relative to the previous frame by using a linear assumption, and the algorithm is only suitable for the motion of an object to be uniform motion, otherwise, the optical flow of the middle frame estimated by the linear assumption has larger deviation from the actual one. To address this problem, the Quadratic Video Interpolation (QVI) algorithm proposes to estimate the acceleration of the object by using the front and rear three frames, and then to estimate the optical flow of the middle frame using the uniform acceleration motion assumption.

However, in either way, in the middle frame synthesis process, the synthesis difficulty of different areas is different, such as a solid color area of a video frame like sky, solid color clothes, or an area with smaller motion amplitude of adjacent frames, and the synthesis difficulty of the areas is lower, so that less calculation amount can be used for synthesis. In the related art, the common convolutional neural network adopted by the synthetic network has the same calculation amount distribution to all areas, so that the redundancy of calculation increases the floating point calculation amount of the equipment.

Meanwhile, since the motion of an object in a real scene is very complex, the motion modeling of a single mode based on uniform motion or more complex uniform acceleration motion cannot cover all cases. For this reason, a correction network is usually integrated in the related art to correct the estimated optical flow of the intermediate frame. However, the method of the set connection can obviously increase the time complexity and the space complexity of the neural network, further prolong the reasoning time consumption, be unfavorable for realizing dynamic frame insertion at one side of the terminal and increase the waiting time of the user.

Fig. 1 is a schematic view of a usage scenario of an intermediate frame determining method according to an embodiment of the present invention, referring to fig. 1, a terminal (including a terminal 10-1 and a terminal 10-2) is provided with a client of video processing software, and a user may obtain video content stored in a server or a cloud server through the provided client; the terminal is connected to the server 200 through the network 300, the network 300 may be a wide area network or a local area network, or a combination of the two, and data transmission is realized by using a wireless link, where the method for determining an intermediate frame provided by the present invention may serve as a cloud service for an application program that needs to perform video frame insertion, for example, because videos stored in the server 200 are all processed by video compression, when playing videos acquired from the server 200, frames that are not saved are obtained by calculation and insertion by the method for determining an intermediate frame provided by the present invention, and thus a complete target video can be obtained for a user to watch. For the live video service environment, the video code rate of the live video can be improved and the fluency of live video playing can be improved by dynamically inserting frames to the live video acquired from the server 200.

The method for determining the intermediate frame is realized based on artificial intelligence, wherein the artificial intelligence (Artificial Intelligence, AI) is a theory, a method, a technology and an application system which simulate, extend and expand human intelligence by using a digital computer or a machine controlled by the digital computer, sense environment, acquire knowledge and acquire an optimal result by using the knowledge. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

In the embodiment of the invention, the mainly related artificial intelligence software technology comprises the voice processing technology, machine learning and other directions. For example, speech recognition techniques (Automatic Speech Recognition, ASR) in Speech technology (Speech Technology) may be involved, including Speech signal preprocessing (Speech signal preprocessing), speech signal frequency domain analysis (Speech signal frequency analyzing), speech signal feature extraction (Speech signal feature extraction), speech signal feature matching/recognition (Speech signal feature matching/recognition), training of Speech (Speech training), and the like.

For example, machine Learning (ML) may be involved, which is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, and algorithm complexity theory. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine Learning typically includes Deep Learning (Deep Learning) techniques, including artificial neural networks (artificial neural network), such as convolutional neural networks (Convolutional Neural Network, CNN), recurrent neural networks (Recurrent Neural Network, RNN), deep neural networks (Deep neural network, DNN), and the like.

It can be appreciated that the method for determining the intermediate frame and the voice processing provided by the invention can be applied to an intelligent device (Intelligent device), and the intelligent device can be any device with a video playing function, for example, an intelligent terminal, an intelligent home device (such as an intelligent sound box and an intelligent washing machine), an intelligent wearing device (such as an intelligent watch), a vehicle-mounted intelligent central control system (for dynamically inserting frames of videos played by the vehicle-mounted system) or an AI intelligent medical device (for displaying treatment cases by video playing) and the like.

The following describes in detail the structure of the intermediate frame determining apparatus according to the embodiment of the present invention, and the intermediate frame determining apparatus may be implemented in various forms, such as a cloud server with a video information processing function, or may be a server or a server cluster provided with a video information processing function, for example, the server 200 in fig. 1. Fig. 2 is a schematic diagram of a composition structure of an intermediate frame determining apparatus according to an embodiment of the present invention, and it is understood that fig. 2 only shows an exemplary structure of the intermediate frame determining apparatus, but not all the structures, and that part or all of the structures shown in fig. 2 may be implemented as needed.

The intermediate frame determining device provided by the embodiment of the invention comprises the following steps: at least one processor 201, a memory 202, a user interface 203, and at least one network interface 204. The various components in the intermediate frame determination device 20 are coupled together by a bus system 205. It is understood that the bus system 205 is used to enable connected communications between these components. The bus system 205 includes a power bus, a control bus, and a status signal bus in addition to the data bus. But for clarity of illustration the various buses are labeled as bus system 205 in fig. 2.

The user interface 203 may include, among other things, a display, keyboard, mouse, trackball, click wheel, keys, buttons, touch pad, or touch screen, etc.

It will be appreciated that the memory 202 may be either volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. The memory 202 in embodiments of the present invention is capable of storing data to support operation of the terminal (e.g., 10-1). Examples of such data include: any computer program, such as an operating system and application programs, for operation on the terminal (e.g., 10-1). The operating system includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, for implementing various basic services and processing hardware-based tasks. The application may comprise various applications.

In some embodiments, the intermediate frame determining apparatus provided in the embodiments of the present invention may be implemented by combining software and hardware, and by way of example, the intermediate frame determining apparatus provided in the embodiments of the present invention may be a processor in the form of a hardware decoding processor, which is programmed to perform the intermediate frame determining method provided in the embodiments of the present invention. For example, a processor in the form of a hardware decoding processor may employ one or more application specific integrated circuits (ASICs, application Specific Integrated Circuit), DSPs, programmable logic devices (PLDs, programmable Logic Device), complex programmable logic devices (CPLDs, complex Programmable Logic Device), field programmable gate arrays (FPGAs, field-Programmable Gate Array), or other electronic components.

As an example of implementation of the intermediate frame determining apparatus provided by the embodiment of the present invention by combining software and hardware, the intermediate frame determining apparatus provided by the embodiment of the present invention may be directly embodied as a combination of software modules executed by the processor 201, the software modules may be located in a storage medium, the storage medium is located in the memory 202, and the processor 201 reads executable instructions included in the software modules in the memory 202, and performs the intermediate frame determining method provided by the embodiment of the present invention in combination with necessary hardware (including, for example, the processor 201 and other components connected to the bus 205).

By way of example, the processor 201 may be an integrated circuit chip having signal processing capabilities such as a general purpose processor, such as a microprocessor or any conventional processor, a digital signal processor (DSP, digital Signal Processor), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like.

As an example of implementation of the intermediate frame determining apparatus provided in the embodiments of the present invention by hardware, the apparatus provided in the embodiments of the present invention may be implemented directly by the processor 201 in the form of a hardware decoding processor, for example, by one or more application specific integrated circuits (ASIC, application Specific Integrated Circuit), DSPs, programmable logic devices (PLDs, programmable Logic Device), complex programmable logic devices (CPLDs, complex Programmable Logic Device), field programmable gate arrays (FPGAs, field-Programmable Gate Array), or other electronic components.

The memory 202 in the embodiment of the present invention is used to store various types of data to support the operation of the intermediate frame determining apparatus 20. Examples of such data include: any executable instructions, such as executable instructions, for operation on the intermediate frame determination device 20, a program implementing the method of determining from an intermediate frame of an embodiment of the present invention may be included in the executable instructions.

In other embodiments, the intermediate frame determining device provided in the embodiments of the present invention may be implemented in a software manner, and fig. 2 shows the intermediate frame determining device stored in the memory 202, which may be software in the form of a program, a plug-in, or the like, and includes a series of modules, and as an example of the program stored in the memory 202, may include the intermediate frame determining device, where the intermediate frame determining device includes the following software modules: information transmission module 2081, information processing module 2082. When the software modules in the intermediate frame determining apparatus are read into the RAM by the processor 201 and executed, the intermediate frame determining method provided by the embodiment of the present invention will be implemented, and the functions of each software module in the intermediate frame determining apparatus in the embodiment of the present invention will be described below, where the information transmission module 2081 is configured to acquire a first image frame and a second image frame in the video frame, where the first image frame and the second image frame are consecutive image frames.

The information processing module 2082 is configured to determine, through the inter-frame optical flow estimation network, a first optical flow between the first image frame and the target inter-frame, a second optical flow between the second image frame and the target inter-frame, and determine a fusion weight parameter of the first optical flow and the second optical flow.

The information processing module 2082 is configured to determine an initial intermediate frame of the first image frame and the second image frame based on the first optical flow, the second optical flow, and the fusion weight parameter.

The information processing module 2082 is configured to perform feature extraction on the first image frame and the second image frame through an encoder layer of the intermediate frame image synthesis network, so as to obtain a first feature vector.

Information processing module 2082 is configured to determine, using the first feature vector, a second feature vector through an encoder layer of the mid-frame optical flow estimation network.

The information processing module 2082 is configured to adjust, by using the sparse mask, the first optical flow, the second optical flow, the fusion weight parameter, and the second feature vector, the initial intermediate frame through a decoder layer of the intermediate frame image synthesis network, so as to obtain a target intermediate frame.

According to the intermediate frame determining device shown in fig. 2, in one aspect of the application, the application also provides a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions are read from the computer-readable storage medium by a processor of a computer device, which executes the computer instructions, causing the computer device to perform the different embodiments and combinations of embodiments provided in various alternative implementations of the above described mid-frame determination method.

The method for determining an intermediate frame according to the embodiment of the present application will be described with reference to the intermediate frame determining apparatus 20 shown in fig. 2, and before describing the method for determining an intermediate frame according to the embodiment of the present application, a process for performing video interpolation in the related art of the present application will be described,

in order to solve the drawbacks in the related art, referring to fig. 3, fig. 3 is an optional flowchart of an intermediate frame determining method provided in an embodiment of the present application, it may be understood that the steps shown in fig. 3 may be performed by various electronic devices running the intermediate frame determining apparatus, for example, a server or a server cluster with a video information processing function, or a cloud game server, a video live broadcast server, or a terminal with a cloud game playing function or a terminal with a video playing function, which is not limited in this application. The following is a description of the steps shown in fig. 3.

Step 301: the intermediate frame determining device acquires a first image frame and a second image frame in the video frames, wherein the first image frame and the second image frame are continuous image frames.

For a video live broadcast environment, the target intermediate frame is added between two continuous image frames, namely the first image frame and the second image frame, so that the code rate of live video can be effectively improved, and the fluency of watching video of a viewer can be improved. For the offline video processing scene, since the video frame rate of the existing camera is also increased from 25FPS to 60FPS and then to 240FPS or even higher, but the frame rate is increased only by the hardware iteration of the camera, the cost is high, and therefore, the hardware cost of video processing can be effectively reduced by adding the target intermediate frame between the two continuous image frames, namely, the first image frame and the second image frame. In the field of video compression, for a video, a local terminal side may store only image frames of a part of the video, and when playing, the image frames which are not stored may restore a complete video by adding a target intermediate frame between two consecutive image frames, i.e. a first image frame and a second image frame. It should be noted that, the number of the target intermediate frames may be adjusted according to the target frame rate or the target video rate, for example, when the target frame rate is increased from 30FPS to 60FPS, the number of the target intermediate frames is 30; for network jittered video live broadcast, the number of target intermediate frames is 130 when the h.264 format target video rate is increased from 120kps to 250 kps.

Step 302: the intermediate frame determining device determines a first optical flow between the first image frame and the target intermediate frame, a second optical flow between the second image frame and the target intermediate frame through the intermediate frame optical flow estimating network, and determines a fusion weight parameter of the first optical flow and the second optical flow.

In order to more clearly illustrate the working processes of the intermediate frame optical flow estimation network and the intermediate frame image synthesis network in the present application, referring to fig. 4, fig. 4 is a schematic structural diagram of a dynamic frame interpolation model in an embodiment of the present application, where the dynamic frame interpolation model includes the intermediate frame optical flow estimation network and the intermediate frame image synthesis network, the intermediate frame image synthesis network is constructed based on a multi-scale wavelet synthesis network, and the wavelet synthesis network can calculate wavelet coefficients of the intermediate frame image by combining input adjacent two frame images and an intermediate frame optical flow prediction result, and finally obtain a target intermediate frame. Meanwhile, the intermediate frame image synthesis network can dynamically allocate calculated amount according to the motion degree when calculating the wavelet coefficient, and allocate less calculated amount to the region with relatively gentle change, thereby reducing the calculated amount of the intermediate frame image synthesis network, and the working processes of the intermediate frame optical flow estimation network and the intermediate frame image synthesis network are respectively described below.

As shown in fig. 4, when the intermediate frame optical flow estimation network works, a first intermediate frame may be calculated according to the first image frame and the first optical flow; calculating a second intermediate frame from the second image frame and the second optical flow; and fusing the first intermediate frame and the second intermediate frame according to the fusion weight parameters to obtain an initial intermediate frame. Wherein the first optical flow represents the firstThe method comprises the steps that a pixel point on an image frame corresponds to a pixel point in a target intermediate frame, a second optical flow represents the pixel point of the pixel point on the second image frame corresponds to the pixel point in the target intermediate frame, an initial intermediate frame can be calculated by using the first optical flow and the second optical flow, an intermediate frame optical flow estimation network is a U-shaped 4-layer encoder-decoder network, the intermediate frame optical flow estimation network inputs the first image frame and the second image frame which are adjacent to each other in a video mode during calculation, and outputs optical flows from the target intermediate frame to the adjacent two frames, namely a first optical flow F _t→0 And a second optical flow F _t→1 Fusion weight parameter O _t Based on the output result described above, the initial intermediate frame can be calculated by equation 1.

Wherein warp is optical flow mapping process, I ₀ And I ₁ Representing a first image frame and a second image frame respectively, the first optical flow being F _t→0 The second optical flow is F _t→1 ；Map I for two mapped frames _t ' is the initial intermediate frame, +..

As shown in equation 1, the mid-frame optical flow estimation network calculates a first image frame I ₀ Is the first optical flow F of (1) _t→0 And a second image frame I ₁ Second optical flow F of (2) _t→1 Obtaining a first intermediate frame through optical flow mapping (warp) processingAnd a second intermediate frame->Then fusing to obtain initial intermediate frame I _t 'A'; but initial intermediate frame I calculated by intermediate frame optical flow estimation network _t ' is relatively coarse, and the floating point calculation amount of the video interpolation frame is still high at the moment, so that the image integration through the intermediate frame is neededNetwork formation for initial intermediate frame I _t ' correcting to obtain target intermediate frame, inserting first video frame I ₀ And a second video frame I ₁ Between them.

Before describing the working process of the intermediate frame image synthesis network, the model structure and the calculation principle of the intermediate frame image synthesis network will be described with reference to fig. 4, where the intermediate frame image synthesis network is also based on a U-shaped encoder-decoder structure, but different from the structure of the intermediate frame optical flow estimation network of the present application, the decoder layer part of the intermediate frame image synthesis network adopts a sparse convolution network based on wavelet decomposition as a decoder of each layer, where the intermediate frame image synthesis network may adopt haar wavelet transform (TDW Discrete Wavelet Transformation), and the transform formula of the haar wavelet refers to formula 2:

Wherein, the liquid crystal display device comprises a liquid crystal display device,for the low frequency component of the l-1 level, l is the current pyramid level l e {1,2,3,4},respectively the high frequency components of the current l-level. For an input feature or image, the wavelet transform will decompose it into a low frequency component LL and three high frequency components LH, HL and HH (L and H represent low pass and high pass filtered outputs, respectively), where the high frequency component in the horizontal direction is HL, the high frequency component in the vertical direction is LH and the high frequency component in the diagonal direction is HH, because the discrete wavelet transform (DWT Discrete Wavelet Transformation) is a reversible decomposition process, the inverse transform IDWT thereof can be used in the present application to synthesize the image to be predicted by calculating LL, LH, HL, HH four coefficients using equation 3 through the mid-frame image synthesis network:

wherein, the liquid crystal display device comprises a liquid crystal display device,for the low frequency component of the level of layer l-1, l is the current pyramid level l ε {1,2,3,4}, LL ^l ,LH ^l ,HL ^l ,HH ^l The IDWT is an inverse discrete wavelet transform operation, respectively, of the high frequency components of the current level.

The following describes the specific working procedure of the intermediate frame image synthesis network:

step 303: the intermediate frame determining device performs feature extraction on the first image frame and the second image frame through an encoder layer of the intermediate frame image synthesis network to obtain a first feature vector.

In some embodiments of the present invention, extracting the first feature vector in the first image frame and the second image frame may be achieved by:

extracting feature vectors of a first image frame through a first encoder network of the intermediate frame image synthesis network, and extracting feature vectors of a second image frame through a first encoder network of the intermediate frame image synthesis network; and combining the characteristic vector of the first image frame and the characteristic vector of the second image frame to obtain a first characteristic vector. Wherein the Encoder network of the Encoder layer of the intermediate frame image synthesis network may employ a Pyramid feature Encoder (Pyramid Encoder), and the Decoder network of the Decoder layer of the intermediate frame image synthesis network may employ a Fine-to-Fine Decoder (Coarse-to-Fine Decoder). Alternatively, each encoder network may be composed of 4 convolution blocks, each convolution block including two convolution layers of convolution kernel size 3×3, step sizes 2 and 1, respectively, wherein the number of channels of the convolution layers of the 4 convolution blocks is 32, 48, 72, 96, respectively. The decoder network is also made up of 4 convolution blocks, each containing two convolution layers with 3 x 3 and 4 x 4 convolution kernels, respectively, with step-size percentages of 1 and 1/2. The number of channels of each convolution block is identical to the number of channels of the convolution block of its encoder of the corresponding level. For example, the number of channels of the convolution block E1 of the encoder first level corresponds to the same number of channels of the convolution block D1 of the decoder first level.

The number of convolution blocks of the encoder and the decoder or the number of layers of the encoder/decoder may be set according to Resolution (Res) Resolution of the video frame of the input target video. Alternatively, the number of convolution blocks or the number of levels num of the encoder/decoder satisfies 2num < Res, which is not limited herein. For example, when the resolution of a video frame of a target video is 256×256, the levels of the encoder/decoder may be set up to 8 levels at maximum.

As shown in fig. 4, when using the pyramid feature encoder, the encoder layer of the intermediate frame image synthesis network includes: the first encoder network and the second encoder network are epsilon respectively _WS1 ,ε _WS2 The first encoder network and the second encoder network have the same structure and different parameters; wherein ε is _WS1 From input adjacent frame image I ₀ And I ₁ Extracting pyramid features of 4 levels respectively

Step 304: the intermediate frame determining means determines a second feature vector using the first feature vector.

As shown in connection with FIG. 4, the second feature vector is represented by the encoder network ε _WS1 And encoder network epsilon _WS2 Is combined by the output results of (E) _WS1 From input adjacent frame image I ₀ And I ₁ Extracted pyramid featuresAnd->Obtaining a first eigenvector of the state of the intermediate frame by optical flow mapping processing >Then pass through a second encoder network epsilon _WS2 For the first image frame I ₀ Is the first optical flow F of (1) _t→0 And a second image frame I ₁ Second optical flow F of (2) _t→1 Feature extraction is carried out by fusing weight parameters and the initial intermediate frame,obtain a third feature vector->Finally, the third feature vectorAnd a first feature vector->Feature fusion is carried out to obtain a second feature vectorSecond feature vector->As an output result of the encoder layer of the intermediate frame image synthesizing network, and is input to the decoder layer of the intermediate frame image synthesizing network for processing.

Thus far, the processing of the coding layer of the intermediate frame image synthesis network is completed through steps 302-304, and then the decoding processing needs to be performed by using the decoding layer of the intermediate frame image synthesis network, wherein, as shown in fig. 4, the decoder is also composed of four levels, and the decoder of each level takes the output of the previous level, the encoder characteristics of the corresponding level and a sparse mask (mask) as inputs, wherein the sparse mask defines which characteristics of the regions need to participate in calculation, only the region with 1 in the sparse mask needs to participate in calculation, and the floating point calculation amount when determining the intermediate frame can be controlled by controlling the size of the sparseness threshold for generating the sparse mask.

In some embodiments of the present application, when the working process of the decoding layer of the intermediate frame image synthesis network controls the size of the sparseness threshold generating the sparse mask, the size of the sparseness threshold can be controlled by a compression threshold classifier, the compression threshold classifier can determine the threshold ratio super parameter η of the intermediate frame image synthesis network, the compression threshold is mainly affected by the static scene structure, in the frame interpolation, the compression degree of the target frame is affected by a plurality of scene structures and motion conditions, and the modeling is relatively complex. For example, in a complex motion scene, the motion is represented by a motion vector with a modelThe target frame synthesized from input samples of paste, exposure and other noise textures typically contains more unreliable high frequency textures that are more compressible and even better image transformation results can be obtained with the denoising properties of the wavelet. On the other hand, the object frame with rich textures and stable motion should keep more high-frequency wavelet coefficients so as to obtain better quantitative and qualitative image calculation results. To address the above-described issues, the present application introduces a threshold classifier in the mid-frame optical flow estimation network to find a suitable instance-aware threshold ratio by inferring a probability distribution of sparseness thresholds of candidate sparseness masks. In use, the threshold classifier comprises a lightweight network consisting of one convolutional layer and two fully-connected layers separated by a leak ReLU activation, embedded in the penultimate convolutional layer of decoder D4 of the mid-frame optical flow estimation network decoder layer. In eta ¹ ，η ¹ ...η ^m As candidates, the threshold classifier predicts pi= { pi ₁ ，π ₂ ，，，π _m -selecting h e {0,1} in the process of encoding from the probability output pi to one-hot, with reference to equation 4, due to the problem of non-differentiability ^m Can make discrete decisions differentiable during ladder back propagation using a probabilistic distribution normalization processing (gummel softmax) technique, i.e., can generate discrete candidate threshold selections for the threshold classifier by equation 4:

wherein gk= -log (-log pi) _k ) Is a member selected from gummel (0; 1) Independent and uniformly distributed samples extracted from the sample, k=1, 2, … …, k, pi _k Is the prediction of the threshold classifier, k is a positive integer greater than or equal to 1. The one-hot encoding process can encode the output information of a lightweight network included in the threshold classifier into one-hot format, the position where the one-hot encoding is maximum can be represented by 0 through formula 4, and the other positions are negative numbers, so for e-functions with differentiability, the derivative of the position at 0 is 1, and the derivative of the value at the negative number position is close toAt 0, this results in a discrete one-hot to softmax formulation, which can convert the non-conductive formulation to the conductive formulation 4.

However, the description of the discrete candidate thresholds for the generation threshold classifier in equation 4 above is discrete and cannot be updated using back propagation. Thus, embodiments of the present invention use a generalized form of softmax as a continuous and slightly approximatable of argmax, i.e. the derivative of the above-described one-hot operation can be approximated with a continuously slightly gummel softmax function, such that the output result is normalized to [0,1] by the softmax function.

In the transformation scheme, gumbel (0;1) is an independent Geng Beier-distributed sample with the same distribution, the highest parameter calculation of the original classification probability distribution is not affected, and the threshold classifier predicts pi _k . During training, the threshold classifier candidates select discrete threshold h _k Is referred to formula 5:

wherein h is _k Is the value of the threshold classifier, τ is the temperature parameter, gk= -log (-log pi) _k ) Is a member selected from gummel (0; 1) Independent and uniformly distributed samples extracted from the sample, k=1, 2, … …, k, pi _k Is the prediction of the threshold classifier, k is a positive integer greater than or equal to 1. At τ→infinity, samples from the gummel softmax distribution become uniform. Conversely, when τ→0, samples from the gummel softmax distribution become one-hot processed sparse vectors. In the test phase, starting from a high temperature of τ=1.0 by using an annealing algorithm, finally annealing to 0.4, and taking the value h of the threshold classifier _k The threshold classifier h can be obtained by testing _k Is set to the optimum value of (2).

Step 305: the intermediate frame determining means calculates a sparse mask that matches the decoder layer of the intermediate frame image synthesizing network.

Step 306: the intermediate frame determining device adjusts the initial intermediate frame through a decoder layer of the intermediate frame image synthesis network by using the sparse mask, the first optical flow, the second optical flow, the fusion weight parameter and the second feature vector to obtain a target intermediate frame.

In some embodiments of the present invention, obtaining the target intermediate frame may be achieved by: decoding the second feature vector and the third feature vector through a decoder layer of the intermediate frame image synthesis network to obtain a high-frequency component in the horizontal direction, a high-frequency component in the vertical direction and a high-frequency component in the diagonal direction, wherein the decoder layer of the intermediate frame image synthesis network is a sparse convolution decoding network based on haar wavelet decomposition; performing wavelet inverse transformation on the low-frequency component, the high-frequency component in the horizontal direction, the high-frequency component in the vertical direction and the high-frequency component in the diagonal direction to obtain a wavelet inverse transformation result; and adjusting the initial intermediate frame by using the wavelet inverse transformation result to obtain a target intermediate frame.

The output of the sparse decoder, which is each layer of decoder, is calculated as three high frequency coefficients of the current level by equation 6, wherein the high frequency wavelet coefficients of the sparse convolutional decoder in the segmented flat regions of the high resolution image and the cartoon image are mostly small in value, close to zero, and only some significant wavelet coefficient values will be near the edges of the video frame of the graph. Thus, for a high resolution target frame, the determination of the target image frame may be accomplished by estimating the non-zero wavelet coefficients at each pixel at a particular pixel location. Representing these particular locations as an effective sparse mask M in layer L ^l ∈{0,1} ^HL×WL ，：

Wherein, the liquid crystal display device comprises a liquid crystal display device,LH, the output of the l-layer decoder ^l ,HL ^l ,HH ^l The high frequency components of the current level are combined with the features LL reconstructed from the previous level, and the features of the next level can be reconstructed by using the IDWT transformation. For the bottom-most decoder, i.e. l=4, only the low frequency coefficients +.>An intermediate frame can be obtained.

In some embodiments of the present invention, computing a sparse mask that matches the decoder layer of the intermediate frame image synthesis network may be implemented by:

acquiring a low-frequency component, a high-frequency component in a horizontal direction, a high-frequency component in a vertical direction and a high-frequency component in a diagonal direction output by a decoder layer of an intermediate frame image synthesis network; acquiring a sparseness threshold of a sparse mask; a sparse mask is calculated that matches the decoder layer of the intermediate frame image synthesis network using the sparseness threshold, the low frequency component, the high frequency component in the horizontal direction, the high frequency component in the vertical direction, and the high frequency component in the diagonal direction. Wherein the mask M is sparse ^l The calculation process based on the low frequency coefficient of level l and the high frequency coefficient of level l+1 refers to equation 7:

M ^l ＝up2(max(|LH ^l+1 |，|HL ^l+1 |，|HH ^l+1 |)＞η·(max(LL ^l )-min(LL ^l ) Formula 7)

Wherein LL is ^l For the low frequency component of the l level, l is the current pyramid level, l ε {1,2,3,4}, LL ^l+1 ,HL ^l+1 ,HH ^l ⁺¹ High frequency components of level l+1, M ^L Is a sparse mask of the decoder layer.

As shown in equation 5, if the sparseness threshold η is greater, more regions in M are 0, that is, fewer regions participate in the calculation, and at this time, the floating point calculation amount when the target intermediate frame is generated is reduced. Regarding the selection of η, taking the three-channel RGB image as the target intermediate frame as an example, m possible options may be defined first, in the present invention, m=4, i.e. 0,0.005,0.01,0.015 respectively, and the network will select, according to the output of the first stage, a value with the highest confidence as the threshold for controlling the sparseness of the sparse mask.

In some embodiments of the present invention, an application scenario of an intermediate frame may be acquired; according to the application scene, the sparsity threshold value is dynamically adjusted, wherein the live video frame rate is different from the sparsity threshold value of the cloud video, which is obtained by inserting frames (supplementing frames) at the client, the threshold value of the game video is larger than the sparsity threshold value of the live video, and the floating point calculated amount of inserting frames can be flexibly adjusted.

When the calculation of one intermediate frame is completed through steps 301 to 306, for a game video requiring continuous execution of video interpolation, at least one intermediate frame in the video may be calculated according to the first image frame and the second image frame; and inserting at least one intermediate frame into the intermediate position of the first image frame and the second image frame to obtain the complete target video.

The method comprises the steps of inserting at least one intermediate frame into the intermediate positions of a first image frame and a second image frame, storing the complete target video in a cloud server cluster after obtaining the complete target video, and facilitating random call of a user, or inserting at least one intermediate frame into the intermediate positions of the first image frame and the second image frame by a terminal side, and storing the complete target video in a game server after obtaining the complete target video, so that floating point calculation amount of the game server is reduced, and hardware consumption of the server is reduced.

In some embodiments of the present invention, in an actual video plug-in scenario, the computer device may select a plurality of pairs of two adjacent video frames from the target video and plug-in an intermediate video frame between the two adjacent video frames, thereby enhancing the smoothness and sharpness of the target video.

For example, to avoid the perception of frame skipping of a target video played by a terminal, the viewing experience of the user is affected, and the sequence of video frames { I ] of the animation may be ₀ ，I ₁ ，I ₂ ，I ₃ …，I _n+1 Video interpolation, n may be a time sequence of video frames for the segment of animation. The terminal can be used for two adjacent video frames { I } in the video frame time sequence of the animation ₀ ，I ₁ }，{I ₁ ，I ₂ }，{I ₂ ，I ₃ }，…，{I _n-1 ，I _n+1 Respectively inserting corresponding target intermediate video frames between }, respectively{I _0.5 ，I _1.5 ，I _2.5 ，...，I _n }

Referring to fig. 5 and 6, fig. 5 is a schematic diagram illustrating a test effect of an intermediate frame determining method according to an embodiment of the present application, and the intermediate frame determining method according to the present application is compared with a video interpolation frame in a related art, where in a training phase of an intermediate frame optical flow estimation network and an intermediate frame image synthesis network, a forward propagation phase of the intermediate frame optical flow estimation network may be summarized as the following steps: 1) Given two input frames I ₀ 、I ₁ Intermediate frame optical flow estimation network calculates first image frame I ₀ Is the first optical flow F of (1) _t→0 And a second image frame I ₁ Second optical flow F of (2) _t→1 2) obtaining a first intermediate frame through optical flow mapping processingAnd a second intermediate frame->Then fusing to obtain initial intermediate frame I _t 'A'; since the sparse convolutional decoder mostly takes very small values, close to zero, and only some significant wavelet coefficient values will be near the edges of the video frames of the picture, given I by reference to equation 8, in the piecewise flat regions of the high resolution image and the cartoon image ₀ 、I ₁ 、F _t→0 、F _t→1 、O _t 、I _t And a candidate selection discrete threshold hk, k for a threshold classifier _∈ { _1,2...m A multiscale wavelet coefficient set used by the intermediate frame image synthesis network is W _k ，W _k＝ N _ws(....hk) . Using inverse transform IDWT of DWT, synthesizing initial intermediate frame LL according to LL, LH, HL, HH four coefficients through intermediate frame image synthesizing network ⁰ ：

Wherein LL is ⁰ For initial intermediate frame, A _IDWT Computing algorithmic inverse discrete wavelet transforms for target intermediate framesInstead, hk selects a discrete threshold for the threshold classifier candidates, and W is a set of multi-scale wavelet coefficients, so equation 8 can be reduced to:

to achieve a mid-frame optical flow estimation network and a mid-frame image synthesis network, an initial mid-frame LL is obtained by synthesis of equation 8 ⁰ . Referring to equation 9, the loss function of the training phase consists of three parts including the cross entropy loss L of the final initial intermediate frame and the target intermediate frame _r Frequency domain reconstruction loss L of target intermediate frame _f And network parameter regularization loss L _c ，

L＝L _r +αL _f+ βL _c Equation 9

Wherein L is _r Cross entropy loss, L, for target intermediate frames _f Loss, L, of frequency domain reconstruction for target intermediate frames _c Regularized loss α for network parameters, β being a weight parameter.

Wherein, referring to formula 10, L _r Can pass through the initial intermediate frame LL ⁰ And target intermediate frameThe two-way optical flow loss between the two-way optical flow loss calculation results,

where ρ is a superparameter, ρ (x) = (x) ² +ε ² ) ^α ，α＝0.5，ε＝10 ^-3 ，L _cen The method is characterized in that the loss of an intermediate frame image synthesis network is caused, for a high-resolution target frame, the intermediate frame image synthesis network calculates a residual image, then a high-resolution image is reconstructed, in the process, an amplified predicted residual image can be obtained by carrying out convolution and transposition convolution operations on an input image, an up-sampling operation is carried out on a low-resolution image, and then the obtained result is added with the predicted residual image, so that the high-resolution image generated by the intermediate frame image synthesis network can be obtained as a target intermediate frame.

To better embody the model structure, the loss of frequency domain reconstruction of the perceptual target intermediate frame refers to equation 11:

wherein ρ is a super parameter ω _j ，Is from the prediction set omega and the truth set +.>Wavelet coefficients of the corresponding calculation target intermediate frame image. Finally, in order to reduce the calculation amount and balance the selection of different compression threshold ratios, the calculation of the loss function also introduces the regularization loss L of network parameters _c Overfitting during the training phase of the mid-frame optical flow estimation network and the mid-frame image synthesis network can be prevented, referring to equation 12,

where C represents a floating point count counter, H and W represent the height and width of the resolution of the first image frame of the original input, hk selects a discrete threshold for the threshold classifier candidates, N _WS Is a multi-scale wavelet coefficient set.

In the loss function calculation of the formula 9, weight parameters α, β are also included, wherein β controls the balance between accuracy and efficiency, the compression threshold ratio η is set to 0 at the start of training, and the weight parameters α, β are set to 0.01 and 0, respectively; when the compression threshold ratio η is gradually increased, the weight parameters α, β may be set to 0.01 and 0.1, respectively. During training, the network parameters of the intermediate frame optical flow estimation network and the intermediate frame image synthesis network are adjusted, so that when the loss function shown in the formula 9 reaches a convergence condition, the network parameters of the intermediate frame optical flow estimation network and the intermediate frame image synthesis network are fixed, and then a test stage is started.

The data set used in the test phase includes: ATD-12K data set (large-scale animation triplet data set, which contains 12000 triples with rich labels, wherein training set contains 10000 animation frame triples, test set contains 2000 triples), xiph-2K data set and Xiph-4K data set, wherein Xip data set is composed of 9 Video segments of Xiph.org Video Test Media [ surf's collection ], which are used for dynamically inserting frames for Video with different resolutions. The index for evaluating each video frame inserting method comprises the following steps: parameter params (M), floating point operand (FLOPS floating-point operations per second), peak signal-to-noise ratio (PSNR Peak Signal to Noise Ratio), and structural similarity (SSIM Structural Similarity). The video frame inserting method in the related art comprises the following steps: the application provides a SepConv dynamic plug frame, a DAIN dynamic plug frame (Depth-Aware Video Frame Interpolation), a CAIN dynamic plug frame (Channel Attention Is All You Need for Video Frame Interpolation), an AdaCoF+ dynamic plug frame (Adaptive Collaboration of Flows for Video Frame Interpolation), a SoftSplat dynamic plug frame, a BMBC dynamic plug frame (Bilateral Motion Estimation with Bilateral Cost Volume for Video Interpolation), a CDFI dynamic plug frame (Compression-Driven Network Design for Frame Interpolation), an ABME dynamic plug frame (Asymmetric Bilateral Motion Estimation for Video Frame Interpolation) and a CAIN-SD dynamic plug frame, wherein the parameter number of the intermediate frame determination method is 19.4M, and floating point operation amounts in three data sets are respectively: 0.274, 1.480 and 1.428; the peak signal-to-noise ratios are respectively: 28.79, 36.32, and 33.61; the structural similarity is as follows: 0.956, 0.965 and 0.945. Compared with other video frame inserting methods, the method for determining the intermediate frame can effectively reduce the floating point calculation amount under the parameter of the same evaluation index, so that the method for determining the intermediate frame can be realized at a terminal side.

Fig. 6 is a schematic diagram of an intermediate frame calculation accuracy of the intermediate frame determining method according to the embodiment of the present application, and when the sparse threshold is increased, the floating point calculation amount of the intermediate frame determining method according to the present application is continuously reduced. Specifically, as shown in fig. 6, the data set and the evaluation index used in the test stage are respectively: vimeo90KPSNR, vimeo90K SSIM, ATD12K PSNR, ATD12K SSIM, xiph-2K PSNR, xiph-2K SSIM, xiph-4K PSNR, xiph-4K SSIM, wherein the Vimeo90K dataset comprises 448x256 resolution three frames of video. The abscissa of fig. 6 is the relative reduction of the floating point calculation amount, and the ordinate is the loss of the relative score of the video interpolation frame, each point in fig. 6 represents the result of a specific η value, and η gradually increases from left to right on a specific line. Assuming that the larger η, the smaller the floating point calculation amount of the video interpolation, the lower the performance, for example, on the Vimeo90K, when the floating point calculation amount is reduced by 10%, the PSNR is reduced by about 0.7%.

Whereas on ATD12K, PSNR was reduced by only 0.4%, the calculated amount was reduced by 40%. Therefore, under the condition of ensuring the same peak signal-to-noise ratio and structural similarity, the method for determining the intermediate frame can effectively reduce the floating point calculation amount in the video frame inserting process, so that the terminal equipment with weaker calculation capability can execute the method for determining the intermediate frame, realize the dynamic video frame inserting, and the complete target video is not required to be constructed by a server with stronger calculation capability by utilizing the video frame inserting method in the related technology and then transmitted to the terminal.

Referring to fig. 7, fig. 7 is a schematic diagram illustrating an effect of using an intermediate frame determining method according to an embodiment of the present invention, where a first resolution of a game video is 720P, when a resolution selected by a user in a game video process is greater than or equal to 720P, when a network of a user terminal is switched from a broadband access game video server to a mobile network access game video server, fluctuation of an information transmission rate occurs, resulting in unstable bandwidth, and a sparse threshold is dynamically adjusted by the intermediate frame determining method, so that a code rate of the video can be improved, so as to ensure that the user obtains a game video with higher resolution for viewing, and thus, the user obtains better visual experience.

Further, the frame dropping rate can be dynamically adjusted if the condition that the code rate adjustment is low still occurs. Specifically, when the video coding strategy matched with the playing environment of the target video is determined to be the improvement code rate, detecting the playing fluency of the target video; when the occurrence of a pause in playing of the target video is detected, determining a video coding strategy matched with the playing environment of the target video as to simultaneously reduce the frame rate and code rate coding of the video. Wherein, after determining the target frame rate and the target video code rate, the number of target intermediate frames may be determined according to the target frame rate and the target video code rate, for example, when the target frame rate is increased from 60FPS to 180FPS, the number of target intermediate frames is 120; for network jittering video live broadcast, when the H.264 format target video code rate is increased from 120kps to 250kps, the number of target intermediate frames is 130; when the live watching user needs to raise the target frame rate from 60FPS to 180FPS and the target video code rate from 120kps to 250kps due to terminal replacement, the number of target intermediate frames is determined to be 120 to meet the common requirements of the target frame rate and the target video code rate.

In some embodiments of the present application, when the playing smoothness of the target video needs to be detected, a first image frame and a second image frame in the target video may be acquired, and a difference image of the first image frame and the second image frame may be acquired; converting the difference image into a matched gray level image, and determining different pixel points included in the gray level image; and detecting the playing fluency of the target video according to the gray values of different pixel points in the gray image. Specifically, the difference between the first image frame and the second image frame has positive correlation with the gray value in the gray image, that is, the larger the gray value in the gray image is, the larger the difference between the first image frame and the second image frame is, so that it can be determined that the first image frame and the second image frame do not generate a clamping condition; similarly, in response to the smaller gray value in the gray image, the smaller the difference between the first image frame and the second image frame is, and when the maximum gray value in the gray image is smaller than a preset threshold value, the occurrence of a stuck condition between the first image frame and the second image frame is determined. When the method for determining the intermediate frame provided by the application processes the video which needs to be dynamically inserted, whether the video playing process is jammed or not is detected directly through the terminal side according to the difference degree between the image frames, so that the accuracy of the jam detection is improved.

In order to better illustrate the working process of the method for determining an intermediate frame provided by the present application, taking a video insertion frame during the rate improvement of live video as an example, referring to fig. 8, fig. 8 is a schematic diagram showing an optional effect of the method for determining an intermediate frame provided by the embodiment of the present application, in which video frame 3 (not shown) needs to be inserted into video frame 1 and video frame 2 in fig. 8 during video insertion, because the video type in fig. 8 is a live video, at this time, the motion amplitude of the target objects a, b, c in the image frame is smaller, and at the same time, the sky in the background frame and the clothes of the target objects a, b, c are solid, so that the amount of floating point calculation required during calculation is smaller, and at this time, the preferred value of the sparseness threshold η of the sparse mask is 0.1015, which can effectively reduce the amount of floating point calculation when the terminal calculates the target intermediate frame, and increase the calculation speed of the video insertion frame.

Referring to fig. 9, fig. 9 is an optional flowchart of an intermediate frame determining method according to an embodiment of the present application, including the following steps:

step 901: and acquiring two continuous image frames of the live video.

Step 902: and determining the number of the target intermediate frames as n according to the target frame rate and the target video code rate of the live video.

Step 903: and configuring a sparse threshold, and calculating a sparse mask matched with a decoder layer of the intermediate frame image synthesis network according to the sparse threshold.

At this time, the preferred value of the sparseness threshold η of the sparseness mask is determined to be 0.1015 based on the confidence level of the sparseness threshold η.

Step 904: the optical flow parameters, the fusion weight parameters and the initial intermediate frames are calculated by the intermediate frame optical flow estimation network.

Step 905: extracting feature vectors through an intermediate frame image synthesis network, and adjusting the initial intermediate frames by using sparse masks, optical flow parameters, fusion weight parameters and the feature vectors to obtain n target intermediate frames.

Step 906: inserting n target intermediate frames between two continuous image frames of the live video to obtain a target live video, and sending the target live video to a live watching user.

The invention has the following beneficial technical effects:

1) The embodiment of the invention obtains a first image frame and a second image frame in a video frame, encodes and decodes the first image frame and the second image frame through an intermediate frame optical flow estimation network in a dynamic frame insertion model to obtain a first optical flow of a target intermediate frame and the first image frame, a second optical flow of the target intermediate frame and the second image frame and a fusion weight parameter, and extracts a first feature vector in the first image frame and the second image frame through an encoder layer of an intermediate frame image synthesis network; performing feature conversion on the first feature vector through an intermediate frame optical flow estimation network to obtain a second feature vector; calculating a sparse mask matched to a decoder layer of the intermediate frame image synthesis network; because of the controllability of the sparse mask, the calculation area of the target intermediate frame can be flexibly controlled, so that the floating point calculation amount of the hardware equipment is flexibly adjusted.

2) And adjusting the initial intermediate frame by using the sparse mask, the first optical flow, the second optical flow, the fusion weight parameter and the second feature vector through a decoder layer of the intermediate frame image synthesis network to obtain a target intermediate frame. Therefore, the calculation amount of the target intermediate frame can be flexibly adjusted by utilizing the dynamic frame inserting model comprising the intermediate frame optical flow estimation network and the intermediate frame image synthesis network, the rapid video frame inserting is realized, the video frame inserting at the mobile terminal side is realized, the processing speed of the video frame inserting is improved, and a user obtains better video frame inserting experience.

3) According to the video frame inserting requirements in different use environments, different sparseness thresholds can be set, and the floating point calculation amount of the frame inserting is flexibly adjusted, so that the video frame inserting can be completed by a server side, and can also be completed by a terminal side with lower floating point calculation capability, the floating point calculation amount of the server side is reduced, and the processing cost of the video is saved.

The above embodiments are merely examples of the present invention, and are not intended to limit the scope of the present invention, so any modifications, equivalent substitutions and improvements made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A method of mid-frame determination, the method comprising:

determining a second feature vector using the first feature vector;

2. The method of claim 1, wherein the determining an initial intermediate frame of the first image frame and the second image frame based on the first optical flow, the second optical flow, and the fusion weight parameter comprises:

calculating a first intermediate frame from the first image frame and the first optical flow;

calculating a second intermediate frame from the second image frame and the second optical flow;

and fusing the first intermediate frame and the second intermediate frame according to the fusion weight parameters to obtain an initial intermediate frame.

3. The method of claim 1, wherein the performing feature extraction on the first image frame and the second image frame by the encoder layer of the intermediate frame image synthesis network to obtain a first feature vector comprises:

extracting, by a first encoder network of the intermediate frame image synthesis network, feature vectors of the first image frame, wherein an encoder layer of the intermediate frame image synthesis network comprises: a first encoder network and a second encoder network, the first encoder network and the second encoder network having the same structure and different parameters;

extracting feature vectors of the second image frames through a first encoder network of the intermediate frame image synthesis network;

And combining the feature vector of the first image frame with the feature vector of the second image frame, and performing feature conversion through the first optical flow and the second optical flow to obtain the first feature vector.

4. A method according to claim 3, wherein determining a second feature vector using the first feature vector comprises:

extracting features of the first optical flow, the second optical flow, the fusion weight parameters and the initial intermediate frame through the second encoder network to obtain a third feature vector;

and carrying out feature fusion on the first feature vector and the third feature vector through the second encoder network to obtain the second feature vector.

5. The method of claim 1, wherein said calculating a sparse mask that matches a decoder layer of the intermediate frame image synthesis network comprises:

acquiring a low-frequency component, a high-frequency component in a horizontal direction, a high-frequency component in a vertical direction and a high-frequency component in a diagonal direction output by a decoder layer of the intermediate frame image synthesis network;

acquiring a sparseness threshold of the sparse mask;

And calculating a sparse mask matched with a decoder layer of the intermediate frame image synthesis network by using the sparse threshold, the low frequency component, the high frequency component in the horizontal direction, the high frequency component in the vertical direction and the high frequency component in the diagonal direction.

6. The method of claim 5, wherein the method further comprises:

acquiring an application scene of the intermediate frame;

and dynamically adjusting the sparseness threshold according to the application scene.

7. The method of claim 5, wherein the adjusting the initial intermediate frame by the decoder layer of the intermediate frame image synthesis network using the sparse mask, the first optical flow, the second optical flow, the fusion weight parameter, and the second feature vector to obtain the target intermediate frame comprises:

decoding the second feature vector through a decoder layer of the intermediate frame image synthesis network to obtain a high-frequency component in the horizontal direction, a high-frequency component in the vertical direction and a high-frequency component in the diagonal direction, wherein the decoder layer of the intermediate frame image synthesis network is a sparse convolution decoding network based on haar wavelet decomposition;

Performing wavelet inverse transformation processing on the low-frequency component, the high-frequency component in the horizontal direction, the high-frequency component in the vertical direction and the high-frequency component in the diagonal direction to obtain a wavelet inverse transformation result;

and adjusting the initial intermediate frame by using the wavelet inverse transformation result to obtain the target intermediate frame.

8. The method according to claim 1, wherein the method further comprises:

calculating at least one intermediate frame in the video from the first image frame and the second image frame;

and inserting the at least one intermediate frame into the intermediate positions of the first image frame and the second image frame to obtain a complete target video.

9. The method of claim 8, wherein the method further comprises:

when the video coding strategy matched with the playing environment of the target video is determined to be the code rate of the enhanced video, detecting the playing fluency of the target video;

when the situation that the playing of the target video is blocked is detected, determining that a video coding strategy matched with the playing environment of the target video is to simultaneously improve the frame rate and code rate coding of the video;

Determining a target frame rate and a target video code rate;

and determining the number of the target intermediate frames according to the target frame rate and the target video code rate.

10. The method of claim 9, wherein detecting the smoothness of the playing of the target video when it is determined that the video encoding strategy matching the playing environment of the target video is to promote video rate encoding, comprises:

when the video coding strategy matched with the playing environment of the target video is determined to be the code rate coding of the enhanced video, acquiring a first image frame and a second image frame in the target video;

acquiring difference images of the first image frame and the second image frame;

converting the difference image into a matched gray level image, and determining different pixel points included in the gray level image;

and detecting the playing fluency of the target video according to the gray values of different pixel points in the gray image.

11. An intermediate frame determination apparatus, the apparatus comprising:

12. An electronic device, the electronic device comprising:

a memory for storing executable instructions;

a processor for implementing the intermediate frame determination method of any one of claims 1 to 10 when executing executable instructions stored in said memory.

13. A computer program product comprising a computer program or instructions which, when executed by a processor, implements the intermediate frame determination method of any one of claims 1 to 10.

14. A computer readable storage medium storing executable instructions which when executed by a processor implement the intermediate frame determination method of any one of claims 1 to 10.