CN113365110B

CN113365110B - Model training method, video frame interpolation method, device, equipment and storage medium

Info

Publication number: CN113365110B
Application number: CN202110794887.6A
Authority: CN
Inventors: 郑贺
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-07-14
Filing date: 2021-07-14
Publication date: 2023-01-31
Anticipated expiration: 2041-07-14
Also published as: CN113365110A

Abstract

The disclosure provides a model training method, a video frame interpolation method, a device, equipment and a storage medium, and relates to the field of artificial intelligence, in particular to the technical field of computer vision and deep learning. The specific implementation scheme is as follows: determining a first image frame set and a second image frame set which correspond to different resolutions before and after the frame insertion position by using a first reference frame and a second reference frame; respectively inputting image frames in the first image frame set and the second image frame set into an initial optical flow estimation network to obtain a first optical flow set and a second optical flow set; determining a target loss function based on the first set of optical flows and the second set of optical flows; an initial optical flow estimation network is trained according to the objective loss function. The implementation mode can interpolate the video frames with different resolutions, improves the accuracy of light stream estimation, and further improves the effect of video frame interpolation.

Description

Model training method, video frame interpolation method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence, and more particularly to the field of computer vision and deep learning technologies, and more particularly to methods, apparatuses, devices, and storage media for model training and video frame insertion.

Background

Video interpolation is a classic task in video processing and aims to synthesize an intermediate frame with smooth transition according to two frames before and after a video segment. The application scenes of the video frame interpolation comprise: firstly, the method is used for improving the video frame rate displayed by the equipment and enabling a user to feel that the video is clearer and smoother; secondly, in video production and editing, the method is used for assisting in achieving the slow motion effect of the video, or is used for adding intermediate frames among key frames of animation to reduce the manpower expenditure of animation production; third, it is used for inter-frame compression of video, or to provide auxiliary data for other computer vision tasks.

Disclosure of Invention

The present disclosure provides a model training, video frame interpolation method, apparatus, device, and storage medium.

According to a first aspect, there is provided a model training method comprising: determining that a first reference frame and a second reference frame before and after the frame insertion position correspond to a first image frame set and a second image frame set with different resolutions; respectively inputting image frames in the first image frame set and the second image frame set into an initial optical flow estimation network to obtain a first optical flow set and a second optical flow set; determining a target loss function based on the first set of optical flows and the second set of optical flows; and training an initial optical flow estimation network according to the target loss function.

According to a second aspect, there is provided a video frame interpolation method, comprising: acquiring a target video; determining a first optical flow and a second optical flow between two adjacent video frames according to two adjacent video frames in the target video and the optical flow estimation network obtained by training according to the method described in the first aspect; and synthesizing the intermediate video frame of the two adjacent video frames according to the first optical flow and the second optical flow.

According to a third aspect, there is provided a model training apparatus comprising: the resolution expansion unit is configured to determine that a first reference frame and a second reference frame before and after the frame insertion position correspond to a first image frame set and a second image frame set with different resolutions; a first optical flow calculation unit configured to input image frames in the first image frame set and the second image frame set into an initial optical flow estimation network respectively to obtain a first optical flow set and a second optical flow set; a loss function determination unit configured to determine a target loss function based on the first optical flow set and the second optical flow set; a model training unit configured to train an initial optical flow estimation network according to the target loss function.

According to a fourth aspect, there is provided a video frame interpolation apparatus comprising: a video acquisition unit configured to acquire a target video; a second optical flow calculation unit configured to determine a first optical flow and a second optical flow between two adjacent video frames in the target video, based on the two video frames that are adjacent in front and back and the optical flow estimation network trained by the method described in the first aspect; and a video frame interpolation unit configured to synthesize an intermediate video frame of two adjacent video frames according to the first optical flow and the second optical flow.

According to a fifth aspect, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the method as described in the first aspect or to perform the method as described in the second aspect.

According to a sixth aspect, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method as described in the first aspect.

According to a seventh aspect, a computer program product comprising a computer program which, when executed by a processor, implements the method as described in the first aspect.

According to the technology disclosed by the invention, the video frames with different resolutions can be interpolated, so that the accuracy of optical flow estimation is improved, and the effect of video interpolation is further improved.

It should be understood that the statements in this section are not intended to identify key or critical features of the embodiments of the present disclosure, nor are they intended to limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present disclosure may be applied;

FIG. 2 is a flow diagram of one embodiment of a model training method according to the present disclosure;

FIG. 3 is a schematic diagram of one application scenario of a model training method according to the present disclosure;

FIG. 4 is a flow diagram of another embodiment of a model training method according to the present disclosure;

FIG. 5 is a flow diagram for one embodiment of a video frame insertion method according to the present disclosure;

FIG. 6 is a schematic block diagram of one embodiment of a model training apparatus according to the present disclosure;

FIG. 7 is a block diagram of one embodiment of a video frame insertion apparatus according to the present disclosure;

fig. 8 is a block diagram of an electronic device for implementing a model training method, a video frame interpolation method according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It should be noted that, in the present disclosure, the embodiments and the features of the embodiments may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 illustrates an exemplary system architecture 100 to which embodiments of the model training method, video interpolation method, model training apparatus, or video interpolation apparatus of the present disclosure may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

A user may use

terminal devices

101, 102, 103 to interact with a server 105 over a network 104 to receive or send messages or the like. Various communication client applications, such as a video playing application, a video processing application, etc., may be installed on the

terminal devices

101, 102, 103.

The

terminal devices

101, 102, 103 may be hardware or software. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices including, but not limited to, smart phones, tablet computers, e-book readers, car-mounted computers, laptop portable computers, desktop computers, and the like. When the

terminal devices

101, 102, 103 are software, they can be installed in the electronic devices listed above. It may be implemented as a plurality of software or software modules (for example to provide distributed services) or as a single software or software module. And is not particularly limited herein.

The server 105 may be a server providing various services, such as a background server providing an optical flow estimation network on the

terminal devices

101, 102, 103. The background server may train the initial optical flow estimation network by using training samples with different resolutions to obtain a trained optical flow estimation network, and feed back the trained optical flow estimation network to the

terminal devices

101, 102, and 103.

The server 105 may be hardware or software. When the server 105 is hardware, it may be implemented as a distributed server cluster composed of a plurality of servers, or may be implemented as a single server. When the server 105 is software, it may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be noted that the model training method provided by the embodiment of the present disclosure is generally executed by the server 105, and the video frame insertion method may be executed by the

terminal devices

101, 102, and 103, or may be executed by the server 105. Accordingly, the model training device is generally disposed in the server 105, and the video frame insertion device may be disposed in the

terminal devices

101, 102, and 103, or may be disposed in the server 105.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for an implementation.

With continued reference to FIG. 2, a flow 200 of one embodiment of a model training method according to the present disclosure is shown. The model training method of the embodiment comprises the following steps:

step 201, determining that a first reference frame and a second reference frame before and after the frame insertion position correspond to a first image frame set and a second image frame set with different resolutions.

In this embodiment, the execution main body may obtain a first reference frame and a second reference frame before and after the position of an insertion frame in the video of the frame to be inserted. The execution body may further perform various processing on the first reference frame and the second reference frame to respectively determine an extended frame with different resolutions corresponding to the first reference frame and an extended frame with different resolutions corresponding to the second reference frame. And taking the first reference frame and the corresponding extended frame as a first image frame set, and taking the second reference frame and the corresponding extended frame as a second image frame set. Specifically, the execution body may perform at least one downsampling process and/or at least one upsampling process on the first reference frame and the second reference frame to obtain an extended frame of the first reference frame and an extended frame of the second reference frame.

Step 202, inputting the image frames in the first image frame set and the second image frame set into the initial optical flow estimation network, respectively, to obtain a first optical flow set and a second optical flow set.

In this embodiment, after obtaining the first image frame set and the second image frame set, the execution subject may input each image frame to the initial optical flow estimation network to obtain the first optical flow set and the second optical flow set. Specifically, the execution body may extract image frames of the same resolution from the first image frame set and the second image frame set, respectively, as matching image frame pairs, and then input the matching image frame pairs to the initial optical flow estimation network, respectively. Here, the initial optical flow estimation network may be a pre-trained optical flow estimation network, and may be, for example, an initial optical flow estimation network obtained by various existing methods. It can be a Flownet series network or a PWC-net series network. The initial optical flow estimation network may have better optical flow estimation performance for image frames of a certain resolution. After the matching image frame pair is input into the initial optical flow estimation network, a first optical flow between a first image frame and a second image frame in the matching image frame can be obtained, and a second optical flow between the second image frame and the first image frame in the matching image frame can also be obtained, so that an optical flow set and a second optical flow set are obtained.

Step 203, determining a target loss function based on the first optical flow set and the second optical flow set.

After the execution body obtains the first optical flow set and the second optical flow set, it may perform various calculations on them to obtain the objective loss function. Specifically, the execution main body may calculate a difference value between any two first optical flows in the first optical flow set, and record the difference value as the first difference value. And adding all the first difference values to obtain a first sum value. The execution main body may also calculate a difference value between any two second optical flows in the second optical flow sets, and the difference value is regarded as a second difference value. And adding all the second difference values to obtain a second sum value. And finally, weighting the first sum and the second sum to obtain the target loss function. Alternatively, the executing body may select one first optical flow from the first optical flow set as a standard, and make a difference between the other first optical flows and the standard. Similarly, the executing body may further select one second optical flow from the second optical flow set as a standard, and perform a difference between the other second optical flows and the standard. And finally, weighting the two obtained difference values to obtain a target loss function.

And step 204, training an initial optical flow estimation network according to the target loss function.

After the execution main body determines the target loss function, parameters in the initial optical flow estimation network can be subjected to iterative optimization according to the value of the target loss function, and fine adjustment of the initial optical flow estimation network is achieved, so that the fine-adjusted optical flow estimation network can insert frames of videos with different resolutions.

According to the model training method provided by the embodiment of the disclosure, the initial optical flow estimation network can be finely adjusted by using the image frames with different resolutions, so that the finely adjusted optical flow estimation network can adapt to the images with different resolutions, the accuracy of optical flow estimation is improved, and the frame interpolation effect is better.

With continued reference to FIG. 3, a flow 300 of another embodiment of a model training method according to the present disclosure is shown. As shown in fig. 3, the method of the present embodiment may include the following steps:

301, performing at least one downsampling on a first reference frame and a second reference frame before and after the frame interpolation position to obtain an extended frame corresponding to the first reference frame and an extended frame corresponding to the second reference frame; determining a first image frame set according to the first reference frame and the corresponding extended frame; a second set of image frames is determined from the second reference frame and the corresponding extended frame.

In this embodiment, the execution main body may perform downsampling on the first reference frame and the second reference frame at least once respectively to obtain the extended frame corresponding to the first reference frame and the extended frame corresponding to the second reference frame. It should be noted that the resolutions of the first reference frame and the second reference frame are the same, and the resolutions of the obtained extended frames are also correspondingly the same. The execution body may treat the first reference frame and the corresponding extended frame as a first set of image frames. Similarly, the second reference frame and the corresponding extended frame may also be used as the second image frame set. In this way, each image frame in the first image frame set corresponds to each image frame in the second image frame set one by one, and the resolution of the corresponding two image frames is the same.

Step 302, inputting a first image frame and a second image frame with the same resolution into an initial optical flow estimation network to obtain a first optical flow between the first image frame and the second image frame and a second optical flow between the second image frame and the first image frame; obtaining a first optical flow set according to each first optical flow; and obtaining a second optical flow set according to the second optical flows.

In this embodiment, the execution subject may input the first image frame and the second image frame having the same resolution into the initial optical flow estimation network, so that the first optical flow between the first image frame and the second optical flow between the second image frame and the first image frame may be obtained. The execution subject may treat each first optical flow as a first set of optical flows and each second optical flow as a second set of optical flows.

Step 303, determining a target loss function based on any two first optical flows in the first set of optical flows and any two second optical flows in the second set of optical flows.

In this embodiment, the execution subject may calculate a difference value between any two first optical flows in the first optical flow set, and the difference value may be regarded as the first difference value. And adding all the first difference values to obtain a first sum value. The execution body may also calculate a difference value between any two second optical flows in the second optical flow sets, and the difference value may be regarded as a second difference value. And adding all the second difference values to obtain a second sum value. And finally, weighting the first sum and the second sum to obtain the target loss function.

In some optional implementations of the embodiment, the execution subject determines the objective loss function by:

step 3031, determining the target resolution from different resolutions according to the initial optical flow estimation network.

In this implementation, the executing entity may first determine the target resolution from different resolutions according to the initial optical flow estimation network. Specifically, the execution subject may input two reference frames with different resolutions into the initial optical flow estimation network, respectively, to obtain a first optical flow and a second optical flow between the two reference frames, respectively. The execution subject may perform a warping (warp) operation based on the first optical flow and the second optical flow, and then perform a composition resulting in a composite intermediate frame. And then, carrying out image evaluation on the intermediate frames corresponding to different resolutions, and taking the resolution corresponding to the image with the best image evaluation result as the target resolution.

Alternatively, the execution subject may determine the target resolution by: acquiring a training image set of an initial optical flow estimation network; determining a resolution similarity between each image frame in the first image frame set or the second image frame set and a training image set; and determining the target resolution from different resolutions according to the similarity of the resolutions.

In this implementation, the executing agent may first obtain a set of training images for the initial optical flow estimation network. In particular, the implementer may determine a set of training images for the initial optical flow estimation network from where the initial optical flow estimation network was published. It can be appreciated that the accuracy of the initial optical flow estimation network is highest for optical flow calculations at the resolution of the set of training images. After acquiring the training image set, the executing subject may calculate a resolution similarity between each image frame in the first image frame set or the second image frame set and the training image set. And taking the resolution corresponding to the maximum value in the resolution similarity as the target resolution.

Step 3032, determining a first loss function according to the target first optical flow corresponding to the target resolution and other first optical flows except the target first optical flow in the first optical flow set.

In this embodiment, the execution subject may take the first optical flow corresponding to the target resolution as the target first optical flow. It can be appreciated that the accuracy of the target first optical flow is highest. The execution subject may correct the target first optical flow as a result of the standard correction of the other first optical flow. Therefore, the executing body may calculate the difference values between the other first optical flows other than the target first optical flow in the first optical flow set and the target first optical flow, respectively. And weighting each obtained difference value to obtain a first loss function.

Step 3033, determining a second loss function according to the target second optical flow corresponding to the target resolution and other second optical flows except the target second optical flow in the second optical flow set.

Similarly, the executing body may set the second optical flow corresponding to the target resolution in the second optical flow set as the target second optical flow. Then, differences between other second optical flows in the second optical flow set except the target second optical flow and the target second optical flow are calculated, respectively. And weighting each obtained difference value to obtain a second loss function.

Step 3034, a target loss function is determined based on the first loss function and the second loss function.

The execution subject may weight the first loss function and the second loss function to obtain a target loss function.

A step 304 of generating a synthesized frame corresponding to each resolution from the first optical flow and the second optical flow having the same resolution in the first optical flow set and the second optical flow set; determining a third loss function according to the truth value frame corresponding to each synthesized frame and the frame inserting position; based on the third loss function, a target loss function is determined.

In this embodiment, the execution subject may generate a composite frame corresponding to each resolution from the first optical flow and the second optical flow corresponding to the same resolution. Specifically, the execution body determines the third loss function according to each synthesized frame and a true-value frame (ground-route) corresponding to the position of the interpolated frame. In some embodiments, the truth frame may be user annotated or generated. It should be noted that the truth frames of the interpolation frame positions may include a plurality of truth frames, and the resolution of the different truth frames is different. The third loss function can be determined by subtracting each synthesized frame from the true value frame of the corresponding resolution. The execution subject may determine the target loss function based on the third loss function.

In some optional implementations of this embodiment, the execution subject may further generate the target loss function together according to the first loss function, the second loss function, and the third loss function described above. Different weights can be set between the loss functions for addition.

Step 305, training an initial optical flow estimation network according to the target loss function.

The model training method provided by the embodiment of the disclosure can perform frame interpolation on videos with different resolutions, improve the accuracy of optical flow estimation, and also improve the frame interpolation effect.

Referring to fig. 4, a flow 400 of one embodiment of a video frame insertion method according to the present disclosure is shown. As shown in fig. 4, the video frame interpolation method of the present embodiment may include the following steps:

step 401, a target video is obtained.

In this embodiment, the execution main body may obtain a target video, where the target video may be a video of a frame to be inserted.

Step 402, determining a first optical flow and a second optical flow between two adjacent video frames according to the two adjacent video frames in the target video and the optical flow estimation network.

In this embodiment, the executing subject may input two adjacent video frames in the target video as reference frames into the trained optical flow estimation network to obtain the first optical flow and the second optical flow. Here, the optical flow estimation network is trained by the method described in the embodiment of fig. 2 or fig. 3. It is understood that the execution subject may interpolate every two video frames in the target video, or select multiple pairs of video frames from the target video.

And step 403, synthesizing an intermediate video frame of the two adjacent video frames according to the first optical flow and the second optical flow.

After obtaining the first optical flow and the second optical flow, the execution main body can perform a torsion (warp) operation, and then, a video frame after the warp operation is input into a pre-trained synthesis network to obtain an intermediate video frame, so as to realize frame interpolation on the target video.

The video frame interpolation method provided by the above embodiment of the disclosure can utilize the trained optical flow estimation network to calculate the first optical flow and the second optical flow between the video frames more accurately, so that the synthesized video frames are more accurate.

With continued reference to fig. 5, a schematic diagram of one application scenario of the model training method and the video frame interpolation method according to the present disclosure is shown. In the application scenario of fig. 5, the server 501 first performs downsampling processing on two video frames before and after the frame insertion position for multiple times, so as to obtain a first image frame set and a second image frame set. The image frames in the first image frame set and the second image frame set correspond to each other one by one, and the resolution of the corresponding image frames is the same. Then, the server 501 may input the first image frame and the second image frame of the same resolution into the initial optical flow estimation network, and determine the target loss function using the obtained first optical flow set and the second optical flow set. And (4) fine-tuning the initial optical flow estimation network by using the target loss function to obtain the trained optical flow estimation network. The user selects high-quality video playback through the user terminal 502, and the user terminal 502 sends a video playback request to the server 501. The server 501 inserts frames into the video by using the trained optical flow estimation network, compresses the video after inserting frames, and sends the compressed video to the user terminal 502 for playing.

With further reference to fig. 6, as an implementation of the methods shown in the above-mentioned figures, the present disclosure provides an embodiment of a model training apparatus, which corresponds to the embodiment of the method shown in fig. 2, and which can be applied in various electronic devices.

As shown in fig. 6, the model training apparatus 600 of the present embodiment includes: a resolution extension unit 601, a first optical flow calculation unit 602, a loss function determination unit 603, and a model training unit 604.

The resolution extension unit 601 is configured to determine that the first reference frame and the second reference frame before and after the frame insertion position correspond to the first image frame set and the second image frame set with different resolutions.

A first optical flow calculation unit 602 configured to input the image frames in the first image frame set and the second image frame set into the initial optical flow estimation network, respectively, resulting in a first optical flow set and a second optical flow set.

A loss function determination unit 603 configured to determine a target loss function based on the first set of optical flows and the second set of optical flows.

A model training unit 604 configured to train an initial optical flow estimation network according to the target loss function.

In some optional implementations of this embodiment, the loss function determining unit 603 may be further configured to: an objective loss function is determined based on any two first optical flows in the first set of optical flows and any two second optical flows in the second set of optical flows.

In some optional implementations of this embodiment, the loss function determining unit 603 may be further configured to: determining a target resolution from different resolutions according to the initial optical flow estimation network; determining a first loss function according to a target first optical flow corresponding to the target resolution and other first optical flows except the target first optical flow in the first optical flow set; determining a second loss function according to a target second optical flow corresponding to the target resolution and other second optical flows except the target second optical flow in the second optical flow set; based on the first loss function and the second loss function, a target loss function is determined.

In some optional implementations of this embodiment, the loss function determining unit 603 may be further configured to: acquiring a training image set of an initial optical flow estimation network; determining a resolution similarity between each image frame in the first image frame set or the second image frame set and the training image set; and determining the target resolution from different resolutions according to the similarity of the resolutions.

In some optional implementations of this embodiment, the loss function determining unit 603 may be further configured to: generating a synthesized frame corresponding to each resolution according to the first optical flow and the second optical flow with the same resolution in the first optical flow set and the second optical flow set; determining a third loss function according to the truth value frame corresponding to each composite frame and the frame insertion position; based on the third loss function, a target loss function is determined.

In some optional implementations of this embodiment, the resolution extension unit 601 may be further configured to: at least one down-sampling is carried out on the first reference frame and the second reference frame to obtain an extended frame corresponding to the first reference frame and an extended frame corresponding to the second reference frame; determining a first image frame set according to a first reference frame and a corresponding extended frame; a second set of image frames is determined from the second reference frame and the corresponding extended frame.

In some optional implementations of the present embodiment, each image frame in the first image frame set corresponds to each second image frame in the second image frame set one to one, and the resolutions of the corresponding two image frames are the same. The first optical flow calculation unit 602 may be further configured to: inputting a first image frame and a second image frame with the same resolution into an initial optical flow estimation network to obtain a first optical flow between the first image frame and the second image frame and a second optical flow between the second image frame and the first image frame; obtaining a first optical flow set according to each first optical flow; and obtaining a second optical flow set according to the second optical flows.

It should be understood that units 601 to 604 recited in the model training apparatus 600 correspond to respective steps in the method described with reference to fig. 2. Thus, the operations and features described above with respect to the model training method are also applicable to the apparatus 600 and the units included therein, and are not described herein again.

With further reference to fig. 7, as an implementation of the methods shown in the above-mentioned figures, the present disclosure provides an embodiment of a video frame interpolation apparatus, which corresponds to the method embodiment shown in fig. 4, and which is particularly applicable to various electronic devices.

As shown in fig. 7, the video frame interpolation apparatus 700 of the present embodiment includes: a video acquisition unit 701, a second optical flow calculation unit 702, and a video interpolation unit 703.

A video acquisition unit 701 configured to acquire a target video.

The second optical flow calculation unit 702 is configured to determine a first optical flow and a second optical flow between two adjacent video frames according to two adjacent video frames in the target video and an optical flow estimation network trained by the method described in the embodiment of fig. 2 or fig. 3.

A video interpolation unit 703 configured to synthesize an intermediate video frame of two adjacent video frames according to the first optical flow and the second optical flow.

It should be understood that the units 701 to 703 recited in the video frame interpolation apparatus 700 correspond to respective steps in the method described with reference to fig. 4. Thus, the operations and features described above for the video frame interpolation method are also applicable to the apparatus 700 and the units included therein, and are not described herein again.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, and do not violate the customs of public sequences.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to an embodiment of the present disclosure.

FIG. 8 illustrates a block diagram of an electronic device 800 that performs a model training method, a video frame insertion method, in accordance with an embodiment of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic devices may also represent various forms of mobile devices, such as personal digital processors, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not intended to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the electronic device 800 includes a processor 801 that may perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a memory 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the electronic apparatus 800 can also be stored. The processor 801, ROM 802, and RAM 803 are connected to each other by a bus 804. An I/O interface (input/output interface) 805 is also connected to the bus 804.

A number of components in the electronic device 800 are connected to the I/O interface 805, including: an input unit 806 such as a keyboard, a mouse, or the like; an output unit 807 such as various types of displays, speakers, and the like; a memory 808, such as a magnetic disk, optical disk, or the like; and a communication unit 809 such as a network card, modem, wireless communication transceiver, etc. The communication unit 809 allows the electronic device 800 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The processor 801 may be various general purpose and/or special purpose processing components having processing and computing capabilities. Some examples of processor 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, or the like. The processor 801 performs the various methods and processes described above, such as the model training method, the video frame interpolation method. For example, in some embodiments, the model training method, the video framing method, may be implemented as a computer software program tangibly embodied in a machine-readable storage medium, such as the memory 808. In some embodiments, part or all of the computer program can be loaded and/or installed onto the electronic device 800 via the ROM 802 and/or the communication unit 809. When loaded into RAM 803 and executed by processor 801, the computer program may perform one or more of the steps of the model training method, video framing method described above. Alternatively, in other embodiments, the processor 801 may be configured to perform the model training method, the video framing method, in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. The program code described above may be packaged as a computer program product. These program code or computer program products may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program code, when executed by the processor 601, causes the functions/acts specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable storage medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable storage medium may be a machine-readable signal storage medium or a machine-readable storage medium. A machine-readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service ("Virtual Private Server", or simply "VPS"). The server may also be a server of a distributed system, or a server incorporating a blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions of the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims

1. A model training method, comprising:

determining a first image frame set and a second image frame set which correspond to different resolutions before and after the frame insertion position by using a first reference frame and a second reference frame;

inputting image frames in the first image frame set and the second image frame set into an initial optical flow estimation network respectively to obtain a first optical flow set and a second optical flow set;

determining a target loss function based on the first set of optical flows and the second set of optical flows;

training the initial optical flow estimation network according to the target loss function;

wherein said determining a target loss function based on said first set of optical flows and said second set of optical flows comprises: determining a target loss function based on any two first optical flows in the first set of optical flows and any two second optical flows in the second set of optical flows;

wherein the determining a target loss function based on any two first optical flows in the first set of optical flows and any two second optical flows in the second set of optical flows comprises:

respectively calculating the difference value of any two first optical flows in the first optical flow set, recording as a first difference value, and adding all the first difference values to obtain a first sum value;

respectively calculating the difference value of any two second optical flows in the second optical flow set, recording the difference value as a second difference value, and adding all the second difference values to obtain a second sum value;

and weighting the first sum value and the second sum value to obtain a target loss function.

2. The method of claim 1, wherein said determining an objective loss function from any two first optical flows in the first set of optical flows and any two second optical flows in the second set of optical flows comprises:

determining a target resolution from the different resolutions according to the initial optical flow estimation network;

determining a first loss function according to a target first optical flow corresponding to the target resolution and other first optical flows except the target first optical flow in the first optical flow set;

determining a second loss function according to a target second optical flow corresponding to the target resolution and other second optical flows except the target second optical flow in the second optical flow set;

determining the target loss function based on the first loss function and the second loss function.

3. The method of claim 2, wherein said determining a target resolution from said different resolutions from said initial optical flow estimation network comprises:

acquiring a training image set of the initial optical flow estimation network;

determining a resolution similarity between each image frame of the first set of image frames or the second set of image frames and the training set of images;

and determining the target resolution from the different resolutions according to the similarity of the resolutions.

4. The method of any of claims 1-3, wherein determining the target loss function based on the first set of optical flows and the second set of optical flows comprises:

generating a synthesized frame corresponding to each resolution according to the first optical flow and the second optical flow with the same resolution in the first optical flow set and the second optical flow set;

determining a third loss function according to the truth value frame corresponding to each synthesized frame and the frame inserting position;

determining the target loss function based on the third loss function.

5. The method of any of claims 1 to 4, wherein the determining the first and second reference frames before and after the frame insertion position correspond to first and second sets of image frames of different resolutions, comprising:

performing downsampling on the first reference frame and the second reference frame at least once to obtain an extended frame corresponding to the first reference frame and an extended frame corresponding to the second reference frame;

determining a first image frame set according to the first reference frame and the corresponding extended frame;

and determining a second image frame set according to the second reference frame and the corresponding extended frame.

6. The method according to any one of claims 1 to 5, wherein each image frame of the first image frame set corresponds to each second image frame of the second image frame set in a one-to-one correspondence mode, and the resolution of the corresponding two image frames is the same; and

the inputting the image frames in the first image frame set and the second image frame set into an initial optical flow estimation network to obtain a first optical flow set and a second optical flow set, respectively, including:

inputting a first image frame and a second image frame with the same resolution into the initial optical flow estimation network to obtain a first optical flow between the first image frame and the second image frame and a second optical flow between the second image frame and the first image frame;

obtaining the first optical flow set according to each first optical flow;

and obtaining the second optical flow set according to the second optical flows.

7. A video frame interpolation method, comprising:

acquiring a target video;

determining a first optical flow and a second optical flow between two adjacent video frames according to two adjacent video frames in the target video and an optical flow estimation network obtained by training the method according to any one of claims 1 to 6;

synthesizing an intermediate video frame of the two adjacent video frames according to the first optical flow and the second optical flow.

8. A model training apparatus comprising:

the resolution expanding unit is configured to determine that a first reference frame and a second reference frame before and after the frame insertion position correspond to a first image frame set and a second image frame set with different resolutions;

a first optical flow calculation unit configured to input image frames in the first image frame set and the second image frame set into an initial optical flow estimation network respectively, resulting in a first optical flow set and a second optical flow set;

a loss function determination unit configured to determine a target loss function based on the first set of optical flows and the second set of optical flows;

a model training unit configured to train the initial optical flow estimation network according to the target loss function;

wherein the first optical flow calculation unit is further configured to: determining a target loss function based on any two first optical flows in the first set of optical flows and any two second optical flows in the second set of optical flows;

and weighting the first sum and the second sum to obtain a target loss function.

9. The apparatus of claim 8, wherein the loss function determination unit is further configured to:

10. The apparatus of claim 9, wherein the loss function determination unit is further configured to:

acquiring a training image set of the initial optical flow estimation network;

11. The apparatus of any of claims 8-10, wherein the loss function determination unit is further configured to:

determining the target loss function based on the third loss function.

12. The apparatus of any of claims 9-11, wherein the resolution expansion unit is further configured to:

at least one down-sampling is carried out on the first reference frame and the second reference frame to obtain an extended frame corresponding to the first reference frame and an extended frame corresponding to the second reference frame;

13. The apparatus according to any one of claims 8 to 12, wherein each image frame in the first image frame set corresponds to each second image frame in the second image frame set, and the resolution of the corresponding two image frames is the same; and

the first optical flow calculation unit is further configured to:

obtaining the first optical flow set according to each first optical flow;

14. A video framing apparatus, comprising:

a video acquisition unit configured to acquire a target video;

a second optical flow calculation unit configured to determine a first optical flow and a second optical flow between two adjacent video frames in the target video according to the two video frames adjacent to each other and an optical flow estimation network trained by the method of any one of claims 1 to 6;

a video interpolation unit configured to synthesize an intermediate video frame of the adjacent two video frames according to the first optical flow and the second optical flow.

15. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein, the first and the second end of the pipe are connected with each other,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6 or to perform the method of claim 7.

16. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-6 or to perform the method of claim 7.