CN116156218A - Method and device for determining video frame inserting model, and method and device for video frame inserting - Google Patents

Method and device for determining video frame inserting model, and method and device for video frame inserting Download PDF

Info

Publication number
CN116156218A
CN116156218A CN202310149226.7A CN202310149226A CN116156218A CN 116156218 A CN116156218 A CN 116156218A CN 202310149226 A CN202310149226 A CN 202310149226A CN 116156218 A CN116156218 A CN 116156218A
Authority
CN
China
Prior art keywords
network
video
module
model
frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310149226.7A
Other languages
Chinese (zh)
Inventor
邢恩旭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Boguan Information Technology Co Ltd
Original Assignee
Guangzhou Boguan Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Boguan Information Technology Co Ltd filed Critical Guangzhou Boguan Information Technology Co Ltd
Priority to CN202310149226.7A priority Critical patent/CN116156218A/en
Publication of CN116156218A publication Critical patent/CN116156218A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/23418Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/2343Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements
    • H04N21/234381Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements by altering the temporal resolution, e.g. decreasing the frame rate by frame skipping
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/25Management operations performed by the server for facilitating the content distribution or administrating data related to end-users or client devices, e.g. end-user or client device authentication, learning user preferences for recommending movies
    • H04N21/266Channel or content management, e.g. generation and management of keys and entitlement messages in a conditional access system, merging a VOD unicast channel into a multicast channel
    • H04N21/2662Controlling the complexity of the video stream, e.g. by scaling the resolution or bitrate of the video stream based on the client capabilities
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/4402Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display
    • H04N21/440281Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display by altering the temporal resolution, e.g. by frame skipping
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

The disclosure relates to the technical field of video processing, and provides a method and a device for determining a video frame inserting model, a method and a device for video frame inserting, a medium and electronic equipment. The method for determining the video plug-in frame model comprises the following steps: acquiring a training data set, and training the initial pyramid network model according to the training data set and the training loss function to obtain a target pyramid network model; determining a video frame inserting model according to a fusion network module in the target pyramid network model; the initial pyramid network model comprises a plurality of fusion network modules and a plurality of synthesis modules; the synthesis module is used for fusing the intermediate frames predicted by the optical flow network in the corresponding fusion network module with the intermediate frames predicted by the synthesis network so as to obtain target predicted intermediate frames; the training loss function is determined according to the target prediction intermediate frame of the fusion network module and the intermediate frame label value corresponding to the fusion network module. The method and the device can improve frame inserting accuracy and efficiency.

Description

Method and device for determining video frame inserting model, and method and device for video frame inserting
Technical Field
The disclosure relates to the technical field of video processing, and in particular relates to a method for determining a video plug-in frame model, a device for determining the video plug-in frame model, a video plug-in frame method, a video plug-in frame device, a computer readable storage medium and electronic equipment.
Background
The frame rate determines the fluency of the video, greatly influences the visual experience of a viewer, and the video frame inserting technology can estimate the intermediate frames through the front frame and the rear frame of the video, so that the frame rate of the video frame is improved.
In the related art, the estimation in the middle may be performed by the deep learning network. However, the large-scale neural network has high estimation accuracy, low efficiency, difficult application to live broadcast and other real-time scenes, and the small-scale neural network has high calculation speed, but has poor frame inserting effect in video with nonlinear moving objects.
Therefore, a video frame inserting model and a video frame inserting method with high frame inserting accuracy and high frame inserting efficiency are needed.
It should be noted that the information disclosed in the above background section is only for enhancing understanding of the background of the present disclosure and thus may include information that does not constitute prior art known to those of ordinary skill in the art.
Disclosure of Invention
The disclosure aims to provide a method and a device for determining a video frame inserting model, a method and a device for video frame inserting, a computer readable storage medium and electronic equipment, so as to solve the problems of low video frame inserting accuracy and low video frame inserting efficiency at least to a certain extent.
Other features and advantages of the present disclosure will be apparent from the following detailed description, or may be learned in part by the practice of the disclosure.
According to a first aspect of the present disclosure, there is provided a method for determining a video plug-in model, the video plug-in model including a pyramid network model including a plurality of converged network modules, any one of the converged network modules being determined according to an optical flow network and a synthetic network, input data of the synthetic network being determined according to the optical flow network, the method comprising: acquiring a training data set, and training an initial pyramid network model according to the training data set and a training loss function to obtain a target pyramid network model; determining a video frame inserting model according to a fusion network module in the target pyramid network model, wherein the video frame inserting model is used for inserting an intermediate frame between adjacent video frames of a video to be processed; the initial pyramid network model comprises a plurality of synthesis modules corresponding to a plurality of fusion network modules in the initial pyramid network model; the synthesis module is used for fusing a first predicted intermediate frame determined by the optical flow network in the corresponding fusion network module with a second predicted intermediate frame determined by the synthesis network to obtain a target predicted intermediate frame of the fusion network module; and the training loss function is determined according to the target prediction intermediate frame of the fusion network module and the intermediate frame label value corresponding to the fusion network module.
According to a second aspect of the present disclosure, there is provided a video frame inserting method, including: acquiring an original adjacent video frame of a video to be processed, and adjusting the size of the original adjacent video frame to obtain a target adjacent video frame; inputting the target adjacent video frames into a video interpolation model to obtain intermediate video frames between the original adjacent video frames; inserting the intermediate video frames between the original adjacent video frames to perform frame inserting processing on the video to be processed; the video plug-in model is determined according to the method of the first aspect.
According to a third aspect of the present disclosure, there is provided a determining apparatus of a video plug-in model, the video plug-in model including a pyramid network model including a plurality of converged network modules, any one of the converged network modules being determined according to an optical flow network and a synthetic network, input data of the synthetic network being determined according to the optical flow network, the apparatus comprising: the training module is configured to acquire a training data set, train the initial pyramid network model according to the training data set and the training loss function, and obtain a target pyramid network model; the video frame inserting model determining module is configured to determine a video frame inserting model according to the fusion network module in the target pyramid network model, wherein the video frame inserting model is used for inserting an intermediate frame between adjacent video frames of the video to be processed; the initial pyramid network model comprises a plurality of synthesis modules corresponding to a plurality of fusion network modules in the initial pyramid network model; the synthesis module is used for fusing a first predicted intermediate frame determined by the optical flow network in the corresponding fusion network module with a second predicted intermediate frame determined by the synthesis network to obtain a target predicted intermediate frame of the fusion network module; and the training loss function is determined according to the target prediction intermediate frame of the fusion network module and the intermediate frame label value corresponding to the fusion network module.
According to a fourth aspect of the present disclosure, there is provided a video frame inserting apparatus, comprising: the size adjustment module is configured to acquire original adjacent video frames of the video to be processed, and adjust the sizes of the original adjacent video frames to obtain target adjacent video frames; an intermediate video frame determination module configured to input the target adjacent video frames into a video interpolation model to obtain intermediate video frames between the original adjacent video frames; the frame inserting module is configured to insert the intermediate video frames between the original adjacent video frames so as to perform frame inserting processing on the video to be processed; the method of the first aspect of the video plug-in model is determined.
According to a fifth aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the method of the first and/or second aspects of the embodiments described above.
According to a sixth aspect of embodiments of the present disclosure, there is provided an electronic device, comprising: a processor; and storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the method of the first and/or second aspect as in the embodiments described above.
According to a seventh aspect of embodiments of the present disclosure, there is provided a chip comprising a processor and a communication interface coupled to the processor for running a program or instructions implementing a method as described in the first and/or second aspect.
According to an eighth aspect of embodiments of the present disclosure, there is provided a computer program product comprising instructions which, when run on a computer, cause the computer to perform the steps of the method according to the first and/or second aspect.
As can be seen from the above technical solutions, the method and apparatus for determining a video plug-in frame model, the method and apparatus for video plug-in frame, and the computer readable storage medium and electronic device for implementing the method in the exemplary embodiments of the present disclosure have at least the following advantages and positive effects:
in the technical schemes provided by some embodiments of the present disclosure, in a training stage, a training loss function is determined through a target prediction intermediate frame output by a synthesis module and an intermediate frame label value corresponding to the synthesis module, so as to improve training accuracy, and further improve accuracy of intermediate frame prediction of a determined video interpolation frame model; in the actual reasoning prediction stage, discarding the synthesis module, and determining a final video frame inserting model according to the fusion network module of the target pyramid network model obtained by training, thereby improving the prediction efficiency of the video frame inserting model; in summary, the video interpolation module in the disclosure may improve the efficiency of inter-frame prediction while ensuring the accuracy of inter-frame prediction.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure. It will be apparent to those of ordinary skill in the art that the drawings in the following description are merely examples of the disclosure and that other drawings may be derived from them without undue effort.
FIG. 1 illustrates a system architecture diagram showing an application environment in which the methods and apparatus of the present disclosure may be applied;
FIG. 2 illustrates a flow diagram of a method of determining a video plug-in model in an exemplary embodiment of the present disclosure;
FIG. 3 illustrates a flow diagram of a method of deriving a target pyramid network model in an exemplary embodiment of the present disclosure;
FIG. 4 illustrates a flow diagram of another method of deriving a target pyramid network model in an exemplary embodiment of the present disclosure;
FIG. 5 illustrates a schematic diagram of a pyramid network model in an exemplary embodiment of the present disclosure;
FIG. 6 illustrates a schematic block diagram of an optical flow network in an exemplary embodiment of the present disclosure;
FIG. 7 illustrates a schematic diagram of a first heavy parameter module in an exemplary embodiment of the present disclosure;
FIG. 8 illustrates a schematic diagram of a second re-parameter module in an exemplary embodiment of the present disclosure;
FIG. 9 shows a flow diagram of a video plug-in method in an exemplary embodiment of the present disclosure;
FIG. 10 illustrates a flow diagram of a method of determining an intermediate video frame between original adjacent video frames in an exemplary embodiment of the present disclosure;
FIG. 11 illustrates a flow diagram of another method of determining an intermediate video frame between original adjacent video frames in an exemplary embodiment of the present disclosure;
fig. 12 is a schematic diagram showing a configuration of a determining apparatus of a video plug-in model in an exemplary embodiment of the present disclosure;
fig. 13 is a schematic diagram illustrating a structure of a video frame inserting apparatus in an exemplary embodiment of the present disclosure;
fig. 14 shows a schematic structural diagram of an electronic device in an exemplary embodiment of the present disclosure.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the present disclosure. One skilled in the relevant art will recognize, however, that the aspects of the disclosure may be practiced without one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known technical solutions have not been shown or described in detail to avoid obscuring aspects of the present disclosure.
The terms "a," "an," "the," and "said" are used in this specification to denote the presence of one or more elements/components/etc.; the terms "comprising" and "having" are intended to be inclusive and mean that there may be additional elements/components/etc. in addition to the listed elements/components/etc.; the terms "first" and "second" and the like are used merely as labels, and are not intended to limit the number of their objects.
Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus a repetitive description thereof will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities.
The frame rate determines the fluency of the video, greatly influences the visual experience of a viewer, and the intelligent frame inserting technology is an algorithm for estimating the intermediate frames through the front frames and the rear frames of the video, so that the estimated intermediate frames are inserted between the front frames and the rear frames of the video, and the frame rate of the video is improved.
The intelligent frame inserting method in the related art can comprise two types of MEMC and deep learning. The method comprises the steps of performing a Motion Estimation and Motion Compensation (MEMC) (Motion Estimation and Motion Compensation) based variable frequency Video (VFI) (Video Frame Interpolation, video interpolation) technology, wherein the real-time video interpolation is realized by combining operation of a MEMC chip and a traditional interpolation algorithm. However, the traditional frame inserting algorithm based on MEMC chip operation is difficult to break through due to the limitation of hardware, and compared with the prior art, the VFI technology based on deep learning is more widely researched.
The VFI technology based on deep learning can estimate an intermediate frame through a neural network model, and can be divided into three schemes of an optical flow network, a synthetic network and an optical flow synthetic network from the variety of the neural network model. The optical flow network estimates the light flow of the front frame, the back frame and the middle frame simply through a neural network model, and then performs warp (image distortion, which can be understood as mapping) operation to obtain the final middle frame. The synthesis network directly carries out the estimation of the intermediate frame through the neural network model. The optical flow combining network combines the optical flow network and the combining network to estimate the intermediate frame, for example, the output of the optical flow network can be used as the input of the combining network to estimate the intermediate frame or the optical flow network and the combining network can be directly combined to obtain the intermediate frame estimation result. Optical flow composite networks tend to achieve better results in VFI technology due to the combination of the advantages of both networks.
However, as a generating task, the video interpolation algorithm based on deep learning often has a large limitation on the model scale, and is difficult to apply to live broadcast and other real-time scenes, especially when the resolution of the model input image is high, and when nonlinear objects appear in the scenes such as live games or movies, most of the interpolation algorithms cannot perform reasonable intermediate frame estimation, especially the accuracy of the intermediate frame estimation of some small deep learning models is lower.
In order to solve the above-mentioned problems, the present disclosure proposes a method and an apparatus for determining a video plug-in model, and the method and the apparatus for video plug-in frame can be applied to the system architecture of the exemplary application environment shown in fig. 1.
As shown in fig. 1, the system architecture 100 may include one or more of the terminal devices 101, 102, 103, 104, a network 105, and a server 106. The network 105 serves as a medium for providing communication links between the terminal devices 101, 102, 103, 104 and the server 106. The network 105 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others. The terminal devices 101, 102, 103, 104 may be smart phones, tablet computers, notebook computers, desktop computers, etc., but are not limited thereto.
It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. For example, the server 106 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligence platforms, and the like.
The method for determining a video plug-in frame model and the method for video plug-in frame provided by the embodiments of the present disclosure may be executed in the server 106, and accordingly, the device for determining a video plug-in frame model and the device for video plug-in frame may be correspondingly disposed in the server 106. The method for determining the video plug-in frame model and the method for video plug-in frame provided by the embodiment of the disclosure are also executed in the terminal equipment, and correspondingly, the device for determining the video plug-in frame model and the device for video plug-in frame can also be arranged in the terminal equipment. The method for determining a video plug-in frame model and the method for video plug-in frame provided by the embodiments of the present disclosure may be partially executed in the server 106 and partially executed in the terminal, and accordingly, for each of the device for determining a video plug-in frame model and the device for video plug-in frame, a part of the modules of the device may be set in the server 106 and a part of the modules may be set in the terminal device.
For example, different videos may be collected based on the terminal device, then a video triplet may be generated according to the collected videos, where the video triplet refers to continuous 3-frame data, a first frame and a third frame in the video triplet may be input data in the training data set, and a second frame may obtain tag data for input data corresponding to the video triplet, so as to generate the training data. The server 106 may perform model training according to the training data based on the method for determining a video plug-in model in the present disclosure, so as to obtain a video plug-in model. Of course, the server may also directly acquire the training data set, and execute the method for determining the video plug-in frame model in the present disclosure according to the acquired training data set, so as to determine the video plug-in frame model.
In an exemplary application scenario, the terminal device may send the video to be processed to the server 106, and the server 106 may perform frame insertion processing on the video to be processed based on the video frame insertion method in the present disclosure according to an original adjacent video frame of the video to be processed, and send the video to be processed after frame insertion processing to the terminal device, so as to improve smoothness of the video played in the terminal device. The terminal equipment can also store the target pyramid network model, and then the terminal equipment can perform frame inserting processing on the video played in the terminal equipment based on the video frame inserting method in the disclosure according to the stored target pyramid network model so as to improve the fluency of the video played in the terminal equipment.
It is to be understood by those skilled in the art that the above application scenario is merely for example, and the present exemplary embodiment is not limited thereto.
FIG. 2 is a flow chart of a method for determining a video plug-in model in an exemplary embodiment of the disclosure, wherein the video plug-in model includes a pyramid network model including a plurality of converged network modules, any converged network module is determined according to an optical flow network and a synthetic network, and input data of the synthetic network in any converged network module is determined according to the optical flow network of the converged network module. Referring to fig. 2, the method includes:
Step S210, a training data set is obtained, and the initial pyramid network model is trained according to the training data set and the training loss function so as to obtain a target pyramid network model;
step S220, determining a video frame inserting model according to a fusion network module in the target pyramid network model, wherein the video frame inserting model is used for inserting an intermediate frame between adjacent video frames of a video to be processed;
the initial pyramid network model comprises a plurality of synthesis modules corresponding to a plurality of fusion network modules in the initial pyramid network model; the synthesis module is used for fusing a first predicted intermediate frame determined by the optical flow network in the corresponding fusion network module with a second predicted intermediate frame determined by the synthesis network to obtain a target predicted intermediate frame of the synthesis module; and the training loss function is determined according to the target prediction intermediate frame of the synthesis module and the intermediate frame label value corresponding to the synthesis module.
In the technical scheme provided by the embodiment shown in fig. 2, in the training stage, a training loss function is determined through a target prediction intermediate frame output by the synthesis module and an intermediate frame label value corresponding to the synthesis module, so that training accuracy is improved, and further, accuracy of intermediate frame prediction of the determined video interpolation frame model is improved; in the actual reasoning prediction stage, discarding the synthesis module, and determining a final video frame inserting model according to the fusion network module of the target pyramid network model obtained by training, thereby improving the prediction efficiency of the video frame inserting model; in summary, the video interpolation module in the disclosure may improve the efficiency of inter-frame prediction while ensuring the accuracy of inter-frame prediction.
The following describes in detail the specific implementation of each step in the embodiment shown in fig. 2:
in step S210, a training data set is acquired, and the initial pyramid network model is trained according to the training data set and the training loss function, so as to obtain a target pyramid network model.
In one exemplary embodiment, the initial pyramid network model includes a plurality of synthesis modules corresponding to a plurality of fusion network modules in the initial pyramid network model; the synthesis module is used for fusing the first predicted intermediate frame determined by the optical flow network in the corresponding fusion network module with the second predicted intermediate frame determined by the synthesis network to obtain a target predicted intermediate frame of the synthesis module; and the training loss function is determined according to the target prediction intermediate frame of the synthesis module and the intermediate frame label value corresponding to the synthesis module.
In another exemplary embodiment, the synthesizing module is further configured to perform image warping on front and rear frames corresponding to the fusion network module according to the optical flow estimated by the optical flow network in the fusion network module corresponding to the synthesizing module, so as to obtain a first candidate intermediate frame and a second candidate intermediate frame, and fuse the first candidate intermediate frame and the second candidate intermediate frame, so as to obtain a first predicted intermediate frame determined by the optical flow network in the fusion network module corresponding to the synthesizing module. The synthesis module may fuse the first predicted intermediate frame and the second predicted intermediate frame according to the first fusion weight, and the synthesis module may fuse the first candidate intermediate frame and the second candidate intermediate frame according to the second fusion weight.
In one exemplary embodiment, the first fused weight may be determined by a synthetic network and the second fused weight may be determined by an optical flow network.
The first fusion weight may be used to characterize the importance of the first predicted intermediate frame and the second predicted intermediate frame to the finally determined target predicted intermediate frame, and the second fusion weight may be used to characterize the importance of the first candidate intermediate frame and the second candidate intermediate frame to the finally determined first predicted intermediate frame. Taking the example that the sum of the weight coefficients of the first predicted intermediate frame and the second predicted intermediate frame is 1, the first fusion weight can be understood as the weight coefficient of the second predicted intermediate frame, if the first fusion weight is 0.7, the importance degree of the second predicted intermediate frame is represented as 0.7, the importance degree of the first predicted intermediate frame is 0.3, and the importance degree of the second predicted intermediate frame is higher than that of the first predicted intermediate frame.
In an exemplary embodiment, the first fusion weights corresponding to each synthesis module may be the same or different, and the second fusion weights corresponding to each synthesis module may be the same or different, which are determined according to the training situation of the initial pyramid network model, which is not particularly limited in this exemplary embodiment.
For example, the initial pyramid network model includes a plurality of converged network modules and a plurality of synthesis modules, where the number of synthesis modules is the same as the number of converged network modules and corresponds to one another, i.e. one converged network module corresponds to one synthesis module. Each merging network module can respectively perform warp operation in front and rear frames according to the optical flow estimated by the optical flow network included by the merging network module so as to respectively obtain a first candidate intermediate frame and a second candidate intermediate frame, and each merging network module can also merge the first candidate intermediate frame and the second candidate intermediate frame according to the second merging weight estimated by the optical flow network included by the merging network module so as to obtain a first predicted intermediate frame determined by the optical flow network. After the fusion network module obtains the first predicted intermediate frame, the first predicted intermediate frame can be transmitted to the corresponding synthesis module, and after the synthesis network in the fusion network module determines the second predicted intermediate frame and the first fusion weight, the second predicted intermediate frame and the first fusion weight can also be transmitted to the corresponding synthesis module. In this way, the synthesis module corresponding to the fusion network module can fuse the first predicted intermediate frame determined by the optical flow network of the fusion network module and the second predicted intermediate frame determined by the synthesis network in the fusion network module according to the first fusion weight, so as to obtain the target predicted intermediate frame of the fusion network module.
As described above, the generation and fusion of the first candidate intermediate frame and the second candidate intermediate frame may also be implemented by the synthesis module corresponding to the fusion network module. For example, each converged network module transmits the optical flow estimated by the optical flow network included in the converged network module and the second converged weight to the corresponding synthesis module, and each converged network module may also transmit the first converged weight predicted by the synthesized network and the second predicted intermediate frame included in the converged network module to the corresponding synthesis module. The synthesis module may perform warp (image warping) operation on the front and rear frames corresponding to the fusion network module according to the optical flow estimated by the optical flow network, so as to obtain a first candidate intermediate frame and a second candidate intermediate frame, then fuse the first candidate intermediate frame and the second candidate intermediate frame according to the first fusion weight estimated by the optical flow network, so as to obtain a first predicted intermediate frame determined by the optical flow network, and then fuse the first predicted intermediate frame and the second predicted intermediate frame according to the first fusion weight, so as to obtain a target predicted intermediate frame.
In an exemplary embodiment, the plurality of converged network modules in the pyramid network model are connected in series, and for any converged network module, the input data of the converged network module includes one or more of an optical flow estimated by an optical flow network in a converged network module preceding the converged network module, a second predicted intermediate frame determined by a synthetic network in a converged network module preceding the converged network module, a first converged weight determined by a converged network module preceding the converged network module, and a second converged weight determined by a converged network module preceding the converged network module.
In an exemplary embodiment, the pyramid network model in this disclosure may be understood as a pyramid network model corresponding to an image pyramid, where a pyramid of one image is a series of image sets of different resolutions arranged in a pyramid shape. The bottom of the pyramid is a high resolution representation of the image, i.e., the original image, while the top of the pyramid is a low resolution representation of the image. The resolution of the bottommost layer of the pyramid is the highest, namely the resolution of the original image, and the resolution gradually decreases along with the increase of the layer number of the pyramid.
In other words, the resolution of the input image is different for each layer of the pyramid network model. That is, the resolution of the input image of each converged network module in the pyramid network model is different. Therefore, when determining the input data of the current fusion network module, the resolution of the second predicted intermediate frame determined by the synthesis network in the previous fusion network module of the current fusion network module may be adjusted according to the resolution of the current fusion network module, so as to obtain a target second predicted intermediate frame, and the target second predicted intermediate frame is used as one of the input data of the current fusion network module.
That is, for any of the converged network modules, the input data may include one or more of an optical flow estimated by an optical flow network in the previous converged network module, a target second predicted intermediate frame obtained by performing resolution adjustment on a second predicted intermediate frame determined by a synthetic network in the previous converged network module according to an input image resolution thereof, a first converged weight determined by the synthetic network in the previous converged network module, and a second converged weight determined by the optical flow network in the previous converged network module.
For any converged network module, the previous converged network module of the converged network module can be understood as the direction from the top to the bottom of the pyramid network model, and is located at the last converged network module of the converged network module. Taking the pyramid network model as an example, the pyramid network model comprises 4 fusion network modules, the topmost layer of the pyramid network model is the fusion network module 1, the lower layer of the fusion network module 1 is the fusion network module 2, the lower layer of the fusion network module 2 is the fusion network module 3, and the fusion network module 4 is the fusion network module at the bottommost layer of the pyramid network model, the previous fusion network module of the fusion network module 2 is the fusion network module 1, the previous fusion network module of the fusion network module 3 is the fusion network module 2, and the previous fusion network module of the fusion network module 4 is the fusion network module 3.
Based on this, taking the converged network module as an example, the input data of the converged network module 2 may include one or more of an optical flow estimated by an optical flow network in the converged network module 1, a target second predicted intermediate frame obtained by performing resolution adjustment on a second predicted intermediate frame determined by a synthetic network in the converged network module 1 according to the resolution of the converged network module 2, a first converged weight determined by the synthetic network in the converged network module 1, and a second converged weight determined by the optical flow network in the converged network module 1.
In an exemplary embodiment, the plurality of converged network modules in the pyramid network model are connected in series, and for any converged network module, the input data may include one or more of an optical flow estimated by an optical flow network in a reference converged network module, a target second predicted intermediate frame obtained by performing resolution adjustment on a second predicted intermediate frame determined by a synthetic network in the reference converged network module according to an input image resolution, a first converged weight determined by the synthetic network in the reference converged network module, and a second converged weight determined by the optical flow network in the reference converged network module. For any converged network module, the reference converged network module includes the converged network module before the converged network module.
Taking the above-mentioned converged network module 1, converged network module 2, converged network module 3, and converged network module 4 as examples in order from top to bottom of the pyramid network model, for the converged network module 2, the reference converged network module includes the converged network module 1, for the converged network module 3, the reference converged network module includes the converged network module 1 and/or the converged network module 2, and for the converged network module 4, the reference converged network module includes any converged network module of the converged network module 1, the converged network module 2, and the converged network module 3.
In an exemplary embodiment, the training data set may comprise a video triplet data set. For example, different videos may be acquired, and a video triplet may be generated according to 3 continuous frames in the videos, where the 1 st frame and the 3 rd frame in the 3 continuous frames are used to determine adjacent video frames (i.e. front and back frames) input during training, and the 2 nd frame in the 3 continuous frames (i.e. middle frames) is used to determine a middle frame tag value corresponding to the adjacent video frames consisting of the 1 st frame and the 3 rd frame in the video triplet during training.
Because the resolution of each layer of the pyramid network model is different, the specific implementation manner that the synthesizing module or the merging network module respectively performs image distortion on the front frame and the rear frame corresponding to the merging network module according to the optical flow estimated by the optical flow network may include: and performing resolution adjustment on the front and rear frames in the training data set according to the resolution of the current fusion network module to obtain front and rear frames corresponding to the current fusion network module, and then performing warp operation on the basis of the front and rear frames corresponding to the current fusion network module respectively to obtain a first candidate intermediate frame and a second candidate intermediate frame.
Illustratively, FIG. 3 shows a flow diagram of a method of deriving a target pyramid network model in an exemplary embodiment of the present disclosure. Referring to fig. 3, the method may include steps S310 to S320. Wherein:
in step S310, a training data set is obtained, and the size of a video frame in the training data set is adjusted according to the input image size of the top fusion network module of the initial pyramid network model, so as to obtain a target training data set of the pyramid network model.
As previously described, in the present disclosure, a pyramid network model may include a plurality of converged network modules, one converged network module being located at each layer of the pyramid, the input images of each layer of the pyramid network model being different in size, and the resolution of the input images gradually increasing from the top to the bottom of the pyramid network model until the resolution of the bottom layer of the pyramid network model is the same as the resolution of the original image.
In the present disclosure, the pyramid network model is processed in order from the top-most layer to the bottom-most layer when image processing is performed. Based on the above, the resolution of the video frames in the training dataset can be adjusted according to the resolution of the top fusion network module of the initial pyramid network model, so as to obtain the target training dataset of the pyramid network model.
The adjusting the resolution of the video frames in the training data set may be understood as performing scaling and/or clipping operations on the video frames in the training data set according to the required resolution. The resolution of each layer of the pyramid network model may be determined by referring to the description of the related content in step S410, which is not described herein.
In step S320, training the initial pyramid network model according to the target training data set and the training loss function, so as to obtain a target pyramid network model.
For example, the 1 st frame and the 3 rd frame of each video triplet in the target training data set may be input into the top-most fusion network module of the pyramid network model to obtain the target prediction middle frame of each fusion network module in the pyramid network model. And meanwhile, taking the loss between the target prediction intermediate frame of each fusion network module and the intermediate frame label value of the fusion network module as a part of a training loss function so as to train the initial pyramid network model and further obtain the target pyramid network model.
As previously described, since the resolution of each converged network module is different, the mid-frame tag value corresponding to each converged network module should match the resolution of that converged network module. For the top-level fusion network module of the initial pyramid network model, the 2 nd frame in each video triplet in the target training data set obtained in step S310 may be used as the intermediate frame tag value corresponding to the top-level fusion network module. For other fusion network modules, resolution adjustment is required to be performed on the 2 nd frame in each video triplet in the training data set according to the corresponding resolution thereof, so as to obtain the intermediate frame tag value corresponding to the input data in each video triplet in each fusion network module. Therefore, based on the intermediate frame label value corresponding to each fusion network module, the loss between the target prediction intermediate frame corresponding to the fusion network module and the intermediate frame label value corresponding to the fusion network module is calculated, and further training is carried out based on the loss.
In an exemplary embodiment, in training the initial pyramid network model based on the methods in steps S310 to S320, a first predicted intermediate frame of an optical flow network in a fusion network module at a bottommost layer of the initial pyramid network model is determined according to an optical flow estimated by the optical flow network and a video frame in the training dataset.
Taking the number of the fusion network modules included in the pyramid network model as N as an example, for the first N-1 fusion network modules (other fusion network modules except the bottommost fusion network module) or the synthesis modules corresponding to the first N-1 fusion network modules, the first frame and the 3 frame in the video triplet of the target training data set can be subjected to warp operation according to the current optical flow estimation value, so as to obtain a first candidate intermediate frame and a second candidate intermediate frame, and then the first candidate intermediate frame and the second candidate intermediate frame are fused according to the second fusion weight, so that a first predicted intermediate frame determined by the optical flow network is obtained.
However, since the resolution of the video triplets in the target training data set is the same as that of the top-most fusion network module, for any one of the first N-1 fusion network modules except the top-most fusion network module, the resolution of the second predicted intermediate frame predicted by the synthesized network of the first fusion network module may be adjusted to be consistent with that of the top-most fusion network module, and then the first predicted intermediate frame of the optical flow network obtained by warp based on the target training data set is fused with the first predicted intermediate frame of the optical flow network, so as to obtain the target predicted intermediate frame corresponding to the fusion network module. When the loss of the target prediction intermediate frame corresponding to the fusion network module of the layer is calculated, the label value corresponding to the target prediction intermediate frame of the fusion network module of the layer can be the intermediate frame in the corresponding video triplet in the target training data set, that is, when the loss is calculated, the resolution of the prediction value and the label value is the same.
For the previous N-1 fusion network modules or the synthesis module to perform the warp operation on the basis of the target training data set obtained in step S310, the whole training process only needs to adjust the resolution of the second predicted intermediate frame output by the synthesis network except for adjusting the resolution of the training data set according to the resolution of the top fusion network module, thereby saving processing time. And the bottommost fusion network module performs warp operation on the basis of the front frame and the rear frame in the training data set to obtain a first predicted intermediate frame, so that the accuracy of the finally obtained first predicted intermediate frame can be ensured.
Of course, for the front N-1 fusion network modules or the synthesis modules corresponding to the front N-1 fusion network modules, the video frames in the training data set may be adjusted according to the resolutions of the front N-1 fusion network modules, so as to obtain the corresponding basic image frames of the front N-1 fusion network modules or the synthesis modules corresponding to the front N-1 fusion network modules during the warp operation.
That is, for each converged network module or the corresponding synthesis module of each converged network module, when performing the forward and backward frame warp operation according to the optical flow estimated by the optical flow network in the current converged network module to obtain the first candidate intermediate frame and the second candidate intermediate frame, the resolution adjustment may be performed on the 1 st frame and the 3 rd frame in the video triplet data set in the training data set (i.e. the original image) according to the resolution of each converged network module, so as to obtain the image frame aimed at when each converged network module performs the warp operation correspondingly, so as to perform the warp operation on the basis of the image frame, so as to obtain the first candidate intermediate frame and the second candidate intermediate frame. In this way, for each converged network module or the synthesis module corresponding to each converged network module, the resolution of the first predicted intermediate frame obtained by fusing the first candidate intermediate frame and the second candidate intermediate frame is the same as the resolution of the input image of the converged network module. And each fusion network module or the synthesis module corresponding to each fusion network module performs warp operation on the basis of the front and rear frames corresponding to the resolution ratio of the fusion network module or the synthesis module to obtain a first prediction intermediate frame, so that the accuracy of the first prediction intermediate frame determined by the fusion network module or the synthesis module of each layer can be ensured, the accuracy of loss function calculation is ensured, and the accuracy of intermediate frame prediction of a model obtained through training is further ensured.
In the training of the initial pyramid network model through the above-described steps S310 to S320, the input data at the time of training includes data adjusted according to the resolution of the topmost layer of the initial pyramid network model. The data after the video frames in the training data set are adjusted based on the resolutions of other fusion network modules can be used as the basic image frames for the fusion network modules or the synthesis modules in the corresponding pyramid network models when performing warp operation based on optical streams instead of being used as the input of the pyramid network models.
By way of example, fig. 4 shows a flow diagram of another method of deriving a target pyramid network model in an exemplary embodiment of the present disclosure. Referring to fig. 4, the method may include steps S410 to S430.
In step S410, a training data set is acquired, and the size of a video frame in the training data set is adjusted according to the input image size of the fusion network module for any fusion network module of the initial pyramid network model, so as to obtain the first input data of the fusion network module.
For example, the size of the video frames in the training dataset may be adjusted according to the input image size of each fusion network module to obtain the first input data of each fusion network module.
In an exemplary embodiment, for a pyramid network model, the input image size of the bottom-most fused network module is the same as the size of the original image, and in this disclosure, the input image size of the bottom-most fused network module may be the same as the size of the video frames in the training dataset. If the video frame size in the training dataset is 320×240, then the input image size of the lowest converged network module is 320×240. For training data, training can be performed based on data with similar size, if the resolution of most data which can be used for training is 320×240, then for other data which can be used for training and is similar to 320×240 in size, the data can be cut or scaled to adjust the resolution to 320×240 and then used as training data for participating in training, so that the number of training data is increased, and meanwhile, the generalization capability of a model can be improved.
For the pyramid network model, the resolution increases proportionally from the top layer to the bottom layer until the resolution of the bottom layer is the same as the resolution of the original image, and the resolution increases 2 times from the top layer to the bottom layer, the pyramid network model comprises 4 fused network modules, the original image size is 320×240, for example, the input image size of the fused network module at the top layer can be 40×30, the input image size of the fused network module at the next layer at the top layer can be 80×60, the input image size of the fused network module at the next layer can be 160×120, and the input image size of the fused network module at the bottom layer can be 320×240, which is the same as the original image size.
In other words, the input image size of the fusion network module of each layer of the pyramid network model can be determined according to the size of the original image, the number of layers of the pyramid network model and the resolution relation between the layers.
Taking the pyramid network model as 4 layers, namely taking 4 fusion network modules as an example, the sizes of the training data sets can be respectively adjusted according to the resolutions of the input images of the 4 fusion network modules so as to respectively obtain training data with 4 resolutions. And taking the data after the training data (namely the original image) is subjected to resolution adjustment as first input data of the corresponding fusion network module.
It should be noted that, in the present disclosure, the number of layers of the pyramid network model is related to the resolution of the video to be framed, and the higher the resolution is, the more layers of the pyramid network model, i.e. the more the number of the converged network modules are.
With continued reference to fig. 4, in step S420, the target input data of the converged network module is determined according to the first input data of the converged network module, the optical flow estimated by the optical flow network in the converged network module before the converged network model, and the second predicted intermediate frame determined by the synthetic network in the converged network module before the converged network module.
As described above, for any converged network module, the input data may further include one or more of a target second predicted intermediate frame obtained from a second predicted intermediate frame of the synthetic network of its previous converged network module, an estimated optical flow of the optical flow network in its previous converged network module, a first converged weight determined by the synthetic network of the previous converged network module, and a second converged weight determined by the optical flow network of the previous converged network module. The data and the first input data of the converged network module determined in step S410 may be used to obtain target input data of the converged network module.
That is, for the top-most converged network module of the initial pyramid network model, the target input data may include a video frame obtained by performing resolution adjustment on the training data set according to the input image size of the top-most converged network module. For any other converged network module, the target input data may include an adjacent video frame obtained by adjusting the resolution of an adjacent video frame in the training data set according to the input image size, an optical flow and a second converged weight output by an optical flow network of a previous converged network module, a first converged weight determined by a synthesis network in the previous converged network module, and a target second predicted intermediate frame obtained by adjusting the resolution of a second predicted intermediate frame of the synthesis network of the previous converged network module according to the resolution.
In step S430, the initial pyramid network model is trained according to the target input data and the training loss function of each fusion network module to obtain a target pyramid network model.
For example, for the current fusion network module, each data in the target input data of the current fusion network module may be spliced, and then the spliced data is input into the current fusion network module, so as to process the current input data based on the current fusion network module, to obtain an optical flow estimated by an optical flow network in the current fusion network module, a second fusion weight, a first fusion weight, a second predicted intermediate frame determined by a synthetic network in the current fusion network module, and a target predicted intermediate frame corresponding to the current fusion network module. And determining the value of the training loss function based on the target prediction intermediate frame of each fusion network module and the intermediate frame label value corresponding to each fusion network module. In the training process of the initial pyramid network model, the value of the training loss function can be gradually reduced to be an optimization target, and when the value of the training loss function is smaller than a preset value, training can be stopped. And then testing the trained initialized pyramid network model.
For example, the initial pyramid network model after training may be tested according to the test data set, and when the performance index of the test, such as the prediction accuracy, meets the requirement, the pyramid network model obtained after stopping training may be determined as the target pyramid network model.
The determination method of the mid-frame tag value of each converged network module is described in the aforementioned step S320, and will not be described herein.
In the steps S410 to S430, the input data of each fusion network module not only has the output data of the previous fusion network module, but also can include the data of the original image after resolution adjustment, so that the richness of training data is increased, and for each fusion network module in the pyramid network model, only the warp is required to learn, the information input by the previous layer is not required to learn, the fitting difficulty of the model is reduced, the generalization capability of the model is improved, and the prediction accuracy of the video frame inserting model is improved.
The training process of the initial pyramid network model described in fig. 4 is further described below with reference to a schematic diagram of a pyramid network model in an exemplary embodiment of the present disclosure shown in fig. 5. Referring to fig. 5, the pyramid network model in the present disclosure may include 4 layers, where each layer corresponds to one converged network module, such as converged network Block1 (module 1), converged network Block2 (module 2), converged network Block3 (module 3), and converged network Block4 (module 4) in fig. 5. During training, each layer of the initial pyramid network model may include a synthesis module, such as synthesis module 1, synthesis module 2, synthesis module 3, and synthesis module 4 in fig. 5, respectively, in addition to the fusion network module.
As can be seen from fig. 5, the input of the top-most fusion network Block1 includes a front frame 1 and a rear frame 1 after resolution adjustment of the 1 st frame and the 3 rd frame in the video triplet of the training set. The inputs of the fusion network Block2 to the fusion network Block4 may include the outputs of the adjacent video frames and the previous fusion network module after the resolution adjustment of the adjacent video frames in the training set. Meanwhile, the output of the synthesis module corresponding to each fusion network module is not transferred to the next fusion network module, but is used to obtain a target predicted intermediate frame, such as intermediate frames 1 to 4 in fig. 5. And calculating losses by the intermediate frames 1 to 4 and the intermediate frame label values of the corresponding fusion network modules respectively, so that 4 losses can be obtained, and the 4 losses can be used as part of losses in a training loss function, so that the initial pyramid network model is trained based on the training loss function, and a target pyramid network model is obtained.
For example, during training, the video frame triplet data may be adjusted to different resolutions according to the resolutions of different fusion network modules, so as to be used as the input of the corresponding fusion network modules, and in each adjusted video triplet corresponding to the resolution, the 1 st frame and the 3 rd frame are input data during training, such as the front frame 1 and the rear frame 1, the front frame 2 and the rear frame 2, the front frame 3 and the rear frame 3, and the front frame 4 and the rear frame 4 in fig. 5, and are training adjacent video frames with different resolutions. In the video triples corresponding to each adjusted resolution, the 2 nd frame is an intermediate frame label value corresponding to the input data during training. Taking fig. 5 as an example, in the training process, the model outputs 4 intermediate frame prediction results, such as intermediate frames 1 to 4 in fig. 5, where the resolutions of the 4 results are different, and calculates training losses by respectively calculating the intermediate frame label values indicated by the intermediate frames 1 to 4 and the corresponding resolutions thereof, so as to obtain 4 losses. These 4 losses are part of the training loss function and participate in the training of the model.
Of course, the training loss function may also include other losses, for example, in the disclosure, the self-supervision network is used to supervise the optical flow, the loss function of the self-supervision network is also a part of the training loss function, the loss between the second predicted intermediate frame output by the synthesis network of each fusion network module and the corresponding intermediate frame label value thereof may also be a part of the training loss function, the loss between the first predicted intermediate frame determined by the optical flow network in each fusion network module and the corresponding intermediate frame label value of the fusion network module may also be a part of the training function, and so on.
In an exemplary embodiment, the input data of the synthetic network is determined from the optical flow network, comprising: and splicing the positive number second layer features of the optical flow network with the penultimate layer features of the optical flow network to obtain input data of the synthetic network.
The penultimate layer feature is selected but not the last layer feature because the self-supervision is used in this disclosure to perform optical flow supervision, thus fixing the last layer feature of the optical flow network as an optical flow feature, while the mere multiplexing of optical flow features has limited effect on the composite network, which is also verified in experiments. The reason why the second layer of features is selected and the deeper features are not selected is that the thought difference between the optical flow network and the synthetic network is large, the extracted features are different, and the deep features of the optical flow network are difficult to apply to the synthetic network.
The double-layer characteristics of the optical flow network (namely the positive second-layer characteristics and the penultimate second-layer characteristics of the optical flow network) are used as input data of the synthesis network, so that the multiplexing of the double-layer characteristics of the optical flow network by the synthesis network is realized based on the prediction intermediate frame of the synthesis network, the feature extraction is not required to be repeated by the synthesis network, and the prediction is directly performed by using the extracted features of the optical flow network, so that the reasoning speed of the pyramid network model can be improved. Therefore, while the model reasoning accuracy is ensured, the time consumption of reasoning brought by the use of the synthetic network is prevented from being increased, and the prediction efficiency of the model is improved.
In an exemplary embodiment, features of other layers of the optical flow network may be selected and multiplexed into the synthetic network according to requirements, so as to improve model reasoning speed while ensuring model prediction accuracy, that is, input data of the synthetic network may be determined according to features extracted by a preset network layer of the optical flow network, for example, positive number third layer features of the optical flow network may be multiplexed into the synthetic network, which is not limited in this exemplary embodiment.
When the characteristics of the optical flow network multiplexed into the synthetic network are adjusted, for example, when the positive third layer characteristics of the optical flow network are multiplexed into the synthetic network, the structure of the synthetic network can be adaptively adjusted so as to ensure that the output characteristic diagram of the synthetic network has the same resolution as the output characteristic diagram of the optical flow network, and the output characteristic diagram of the optical flow network are convenient to fuse.
In an exemplary embodiment, the synthetic network may be understood as a convolution layer, and after the two-layer features of the optical flow network are spliced, the output value of the synthetic network may be obtained through the convolution layer.
In an exemplary embodiment, the determining manner of the optical flow network of the fusion network module of the initial pyramid network model includes; according to a first heavy parameter module, determining a network layer for extracting shallow features in the optical flow network, wherein the shallow features comprise edge features of a video frame; and determining a network layer for deep feature extraction in the optical flow network according to the second heavy parameter module, wherein the deep features comprise abstract features of video frames. The first re-parameter module is replaced with a first convolution layer in the video frame-inserting model determined in step S220, and the second re-parameter module is replaced with a second convolution layer in the video frame-inserting model determined in step S220.
Exemplary, FIG. 6 shows a schematic diagram of an optical flow network in an exemplary embodiment of the present disclosure. Referring to fig. 6, the optical flow network in the present disclosure may be a U-shaped network, and the middle block in fig. 6 may be understood as a small attention mechanism. The network layer for shallow feature extraction in the optical flow network described above may include downsampling block 1, downsampling block 2, upsampling block 1, and upsampling block 2 in fig. 6. Each of the modules of the downsampling block 1, the downsampling block 2, the upsampling block 1 and the upsampling block 2 may be according to the first heavy parameter module. The following sampling block 1 may include a first re-parameter module and a downsampling layer 1, the downsampling block 2 may include a first re-parameter module and a downsampling layer 2, the upsampling block 1 may include a first re-parameter module and an upsampling layer 1, and the upsampling block 2 may include a first re-parameter module and an upsampling layer 2. The network layers for deep feature extraction in the optical flow network described above may include 61 to 65 in fig. 6, in other words, each of 61 to 65 in fig. 6 may include a second heavy parameter module.
Fig. 7 illustrates a schematic structure of a first re-parameter module in an exemplary embodiment of the present disclosure. Referring to fig. 7, the first re-parameter module in the present disclosure may include 7 parallel branches, from left to right, a shortcut (first branch), a conv-3×3 corresponding second branch, a conv-1×1 and a conv-3×3 corresponding third branch, a conv-1×1 corresponding fourth branch, a conv-1×1 and a sobel operator corresponding fifth branch, a conv-1×1 and a Laplacian operator corresponding sixth branch, and two conv-1×1 corresponding seventh branches, respectively. Among them, sobel operator, i.e., image edge detection operator, laplacian operator can be understood as a filter capable of performing a second differentiation of image brightness to detect an image edge.
Experiments prove that edge features can improve the accuracy of frame insertion in shallow features, so that the shallow features use image edge features. Of course, other shallow features that can enhance the effect of the frame insertion may be used, which is not particularly limited in the present exemplary embodiment.
Fig. 8 illustrates a schematic structure of a second re-parameter module in an exemplary embodiment of the present disclosure. Referring to fig. 8, the second re-parameter module includes 4 parallel branches, namely, a first branch corresponding to shortcut, two second branches corresponding to conv-1×1, "conv-3×3," conv-1×1, "a third branch corresponding to conv-1×1," conv-3×3, "and a fourth branch corresponding to conv-1×1," from left to right, wherein the third branch further includes a shortcut sub-branch, and the fourth branch also includes a shortcut sub-branch.
The core idea of the heavy parameter technology is to train by designing a complex convolution network module so as to improve the generalization capability of the model, but the complex convolution module is equivalent to a common convolution layer during reasoning so as to improve the reasoning speed of the model while ensuring the generalization capability of the model.
In the present disclosure, the heavy parameter technique is applied to the VFI, and two kinds of heavy parameter modules, i.e., the first heavy parameter module and the second heavy parameter module described above, are designed according to different roles. The two kinds of heavy parameter modules are equivalent to a convolution layer in reasoning, and do not bring extra burden to model reasoning. The first re-parametric module may be understood as a shallow re-parametric module, which is more focused on the extraction of shallow features of the image, thus designing multiple parallel branches with shallower depths. The deep layer module focuses more on the network availability and deep feature extraction, thus the model depth and branches of multiple hops (i.e., shortcut branches described above) are considered in the design.
In the training stage of the initial pyramid network model, the heavy parameter module is used, so that the generalization capability of the target pyramid network model obtained through training can be effectively improved.
With continued reference to fig. 2, in step S220, a video plug-in model is determined according to the fusion network module in the target pyramid network model, where the video plug-in model is used to plug in an intermediate frame between adjacent video frames of the video to be processed.
In an exemplary case that the synthesis module is configured to, in addition to fusing the first predicted intermediate frame and the second predicted intermediate frame to obtain the target predicted intermediate frame, further perform image warping on front and rear frames corresponding to the fusion network module according to the optical flow estimated by the optical flow network in the fusion network module corresponding to the target predicted intermediate frame, to obtain a first candidate intermediate frame and a second candidate intermediate frame, and fuse the first candidate intermediate frame and the second candidate intermediate frame according to the second fusion weight predicted by the optical flow network, to obtain the first predicted intermediate frame determined by the optical flow network in the fusion network module corresponding to the synthesis module, one embodiment of step S220 may include: determining a video frame inserting model according to a fusion network module and a target synthesis module in the target pyramid network model; the target synthesis module comprises a synthesis module corresponding to a fusion network module at the bottommost layer in the target pyramid network model.
Taking the model structure shown in fig. 5 as an example, the synthesis module corresponding to each fusion network module may obtain the optical flow data, the second fusion weight, the second predicted intermediate frame and the first fusion weight output by the fusion network module. In this way, the synthesis module may firstly warp the front frame and the rear frame according to the optical flow estimated by the optical flow network to obtain a first candidate intermediate frame and a second candidate intermediate frame, then fuse the first candidate intermediate frame and the second candidate intermediate frame according to the second fusion weight to obtain a first predicted intermediate frame, and then fuse the first predicted intermediate frame and the second predicted intermediate frame according to the first fusion weight to obtain a target predicted intermediate frame (such as intermediate frames 1 to 4 in fig. 5). Based on the above, the video frame inserting module can be determined according to the fusion network Block1, the fusion network Block2, the fusion network Block3, the fusion network Block4 and the synthesis module 4 in the target pyramid network model. That is, for the synthesis modules 1 to 3, they only participate in training for improving the prediction accuracy of the model, and they can be discarded at the time of actual prediction to improve the frame insertion efficiency of the model.
For example, in the case where the operation of performing the warp operation according to the optical flow estimated by the optical flow network and the operation of fusing the first candidate intermediate frame and the second candidate intermediate frame obtained after the warp operation according to the second fusion weight determined by the optical flow network to obtain the first predicted intermediate frame determined by the optical flow network is implemented in the fusion network module, the specific embodiment of step S220 may include: determining a candidate fusion network module corresponding to the fusion network module according to any fusion network model except the bottommost fusion network module in the target pyramid network model and other parts except for the warp operation in the fusion network module and the operation of fusing the first candidate intermediate frame and the second candidate intermediate frame according to the second fusion weight; and determining a video frame inserting model according to each candidate fusion network module, the fusion network module at the bottommost layer in the target pyramid network model and the synthesis module corresponding to the fusion network module at the bottommost layer in the target pyramid network model.
In other words, in the present disclosure, only the warp operation of the last layer (i.e., the bottommost layer) of the target pyramid network model, the operation of fusing according to the first fusion weight, and the operation of fusing according to the second fusion weight are retained, while the warp operation of the other layers in the target pyramid network model, the operation of fusing according to the first fusion weight, and the operation of fusing according to the second fusion weight are discarded, thereby obtaining the video interpolation model.
And in the video plug-in model determined in step S220, for the optical flow network of each fusion module in the target pyramid network module, the heavy parameter module used by the fusion module is equivalent to a convolution layer. For example, the first re-parameter module may be equivalently a 3×3 convolutional layer, such as the rightmost conv-3×3 in fig. 7, and the second re-parameter module may also be equivalently a 3×3 convolutional layer, such as the rightmost conv-3×3 in fig. 8.
For example, in the present disclosure, the synthesis modules corresponding to the fusion modules of the other layers of the target pyramid network model, except for the synthesis module corresponding to the fusion network module of the lowest layer of the target pyramid network model, only participate in the training of the model, whose output performs a loss calculation with the mid-frame tag value as part of the training loss function of the initial pyramid network model, but whose output is not as an input to the next layer, so that it can be discarded at the time of reasoning, i.e., it can be discarded at the time of determining the final video plug-in model. The synthesis modules 1 to 3 in fig. 5 can be regarded as training modules only, and the training modules only can assist the model to train, so that the generalization capability and the prediction accuracy of the model are improved. In the actual frame insertion process, only the training module is abandoned, and only other modules are reserved for intermediate frame prediction, so that the intermediate frame prediction efficiency of the model can be improved due to less reasoning of only the training module. The design of the training module only is obtained through data verification, so that the reasoning speed of the video frame inserting model is improved by 15% as a whole.
In other words, the model prediction accuracy is improved and the model prediction efficiency is improved by only training the module in the present disclosure. Meanwhile, in the method, the data after resolution adjustment of the front frame and the rear frame are used as additional input of each fusion network module, so that the method that each fusion network module in the pyramid network model uses an upper predicted target intermediate frame as input is effectively replaced, the design cannot cause the degradation of the model performance, but the synthesis module can be abandoned in the reasoning process, the reasoning performance of the model is accelerated while the accuracy of bidirectional optical flow estimation is ensured, and the intermediate frame prediction efficiency of the model in an actual frame inserting task is improved.
Meanwhile, the addition of the synthetic network enables the video frame inserting model to be used in various scenes, improves the generalization capability of the model, and can achieve a good frame inserting effect especially in scenes which do not accord with the linear motion rule. However, in order to avoid the increase of model reasoning time caused by the use of the synthetic network, the synthetic network multiplexes the double-layer characteristics of the optical flow network, and the reasoning time consumption is reduced. In other words, the method for synthesizing the double-layer characteristics of the network multiplexing optical flow network not only improves the generalization capability of the model, but also ensures the reasoning speed of the model.
In addition, the use of the heavy parameter module also improves the reasoning speed of the model and the generalization capability of the model. The first heavy parameter module is used for extracting shallow layer features, feature extraction related to image frame insertion is effectively improved, the second heavy parameter module is used for extracting deep layer features, and convergence of a model is guaranteed through the use of direct connection branches. When the frame inserting task is actually carried out, the heavy parameter module is equivalent to a convolution layer, the performance of the model is unchanged, but the model is smaller, so that the actual prediction efficiency of the model can be improved while the performance of the model is improved, and the model can be applied to occasions with higher real-time requirements.
The video interpolation frame model determined by the method for determining the video interpolation frame model in the disclosure can reach an inference speed of 15 milliseconds in 1080P video in an operation environment of GPU (Graphics Processing Unit, graphic processor) of 2080ti, inference frame of libtorch, and inference back end of cuda (Compute Unified Device Architecture, unified computing device architecture) and cudnn (CUDA Deep Neural Network library, deep neural network library), effectively realizes real-time improvement of video frame rate from 30 frames to 60 frames, even from 60 frames to 120 frames, and can be applied to various complex scenes such as various nonlinear motion scenes in game live broadcast.
Fig. 9 shows a flow diagram of a video interpolation method in an exemplary embodiment of the present disclosure. Referring to fig. 9, the method may include steps S910 to S930. Wherein:
in step S910, an original adjacent video frame of the video to be processed is acquired, and the size of the original adjacent video frame is adjusted to obtain a target adjacent video frame.
In an exemplary embodiment, the video to be processed may include any video that needs to be subjected to frame insertion, such as a game video, a movie video, and the like, and the video to be processed may be a video that has been shot, or may be a live video in live broadcast, which is not particularly limited in this exemplary embodiment.
An embodiment of step S910 may include obtaining an original adjacent video frame of the video to be processed, and adjusting a size of the original adjacent video frame according to an input image size of a top-most fusion network module of the video plug-in model to obtain a target adjacent video frame. For example, in the case where the video plug-in model used in step S920 includes training according to the method shown in fig. 3 to obtain the video plug-in model, an original adjacent video frame of the video to be processed may be obtained, and the resolution of the original adjacent video frame may be adjusted according to the input image size of the fusion network module at the top layer of the video plug-in model to obtain the target adjacent video frame.
In an exemplary embodiment, for the video interpolation model determined by the method in the disclosure, the interpolation process can be performed on the video to be processed with any resolution, without limiting the resolution of the video to be processed to be the same as or similar to that of the training video, but the resolution ratio relationship between the layers of the video interpolation model to be actually inferred is the same as that of the video interpolation model to be actually inferred when training, for example, when training, the resolution ratio between the layers is 2 times that of the resolution of the next layer for the initial pyramid network model, and then the resolution ratio relationship between the layers in the video interpolation model is also the same. For example, the input image size of the fusion network module of each layer of the video plug-in model during the actual plug-in processing can be determined according to the resolution of the video to be processed, the number of layers of the video plug-in model (i.e. the number of fusion network modules included) and the resolution ratio relationship between each layer of the video plug-in model (i.e. each fusion network module).
By way of example, another embodiment of step S910 may include: the method comprises the steps of obtaining original adjacent video frames of a video to be processed, aiming at any fusion network module of a video plug-in frame model, adjusting the size of the original adjacent video frames according to the input image size of the fusion network module so as to obtain target adjacent video frames corresponding to the fusion network module.
For example, in the case where the video interpolation model used in step S920 includes the video interpolation model trained according to the method shown in fig. 4, the resolution of the input image of the fusion network module of each layer of the video interpolation model during the actual interpolation process may be determined according to the resolution of the video to be processed, the number of layers of the video interpolation model, and the resolution ratio relationship between each layer of the video interpolation model. And then, respectively adjusting the resolutions of the original adjacent video frames according to the input image resolution of each fusion network module in the video plug-in frame model so as to obtain target adjacent video frames corresponding to each fusion network model.
In step S920, the target adjacent video frame is input into a video interpolation model to obtain an intermediate video frame between the original adjacent video frames.
In an exemplary embodiment, the video plug-in model in step S920 includes a video plug-in model obtained according to the above-described method for determining a video plug-in model, that is, includes the video plug-in model determined in step S220.
For example, in the case where the video interpolation model in step S920 is trained according to the method shown in fig. 3, the method for determining the intermediate video frame between the original adjacent video frames may be as shown in fig. 10.
Fig. 10 shows a flow diagram of a method of determining an intermediate video frame between original adjacent video frames in an exemplary embodiment of the present disclosure. Referring to fig. 10, the method may include steps S1010 to S1020. Wherein: in step S1010, inputting the target adjacent video frames into the fusion network module at the top layer of the video interpolation model; in step S1020, an intermediate video frame between the original adjacent video frames is obtained according to the output of the synthesis module corresponding to the fusion network module at the bottommost layer of the video interpolation model.
For example, in the method shown in fig. 10, for the video interpolation model, the input data only includes the target adjacent video frame obtained after the resolution adjustment is performed on the original adjacent video frame according to the input image resolution of the fusion network module at the top layer of the video interpolation module. And inputting the target adjacent video frames into the top fusion network module of the video plug-in module, and obtaining a first intermediate video frame predicted by an optical flow network of the bottom fusion network module of the video plug-in module and a second intermediate video frame predicted by a synthetic network of the bottom fusion network module through data transmission among the fusion network modules. And then fusing the first intermediate video frame and the second intermediate video frame according to a synthesis module corresponding to a fusion network module at the bottommost layer of the video plug-in frame model, so that the intermediate video frame between the original adjacent video frames can be obtained.
For example, in the case where the video interpolation model in step S920 is trained according to the method shown in fig. 4, the method for determining the intermediate video frame between the original adjacent video frames may be as shown in fig. 11.
Fig. 11 shows a flow diagram of another method of determining an intermediate video frame between original adjacent video frames in an exemplary embodiment of the present disclosure. Referring to fig. 11, the method may include steps S1110 to S1120. Wherein: in step S1110, the target adjacent video frames corresponding to each fusion network module are respectively input into the corresponding fusion network module; in step S1120, according to the output of the synthesis module corresponding to the fusion network module at the bottommost layer of the video interpolation model, an intermediate video frame between the original adjacent video frames is obtained.
For example, in the method shown in fig. 11, for the video plug-in model, each layer of the fusion network module needs to input its corresponding target adjacent video frame, i.e. each layer of the fusion network module needs to input the target adjacent video frame with the same resolution as itself. And then, according to the input target adjacent video frames and the data transmitted between the fusion network modules, obtaining a first intermediate video frame output by an optical flow network of the fusion network module at the bottommost layer and a second intermediate video frame output by the fusion network module at the bottommost layer. And fusing the first intermediate video frame and the second intermediate video frame according to the synthesis module corresponding to the lowest fusion network module, and obtaining the intermediate video frame between the original adjacent video frames based on the output of the synthesis module corresponding to the lowest fusion network module.
In step S930, the intermediate video frame is inserted between the original adjacent video frames to perform an inserting process on the video to be processed.
For example, an intermediate video frame between original adjacent video frames may be inserted between the original adjacent video frames to perform frame insertion processing on the video to be processed, so as to obtain a target video after frame insertion processing. The target video can be played in the terminal equipment, so that the fluency of the video played by the terminal equipment is improved.
In the video interpolation model, the accuracy of the intermediate video frame predicted by the model can be improved as the training module only performs auxiliary training in the training stage. In the actual frame inserting process, only the training module is abandoned, so that the calculated amount of the part can be reduced, the reasoning speed of the video frame inserting model can be improved, the video frame inserting processing efficiency is improved, and the video frame inserting model and the video frame inserting method determined by the method can be used for carrying out video frame inserting processing in occasions with higher real-time requirements.
Those skilled in the art will appreciate that all or part of the steps implementing the above embodiments are implemented as a computer program executed by a CPU. When executed by a CPU, performs the functions defined by the above-described method provided by the present invention. The program may be stored in a computer readable storage medium, which may be a read-only memory, a magnetic disk or an optical disk, etc.
Furthermore, it should be noted that the above-described figures are merely illustrative of the processes involved in the method according to the exemplary embodiment of the present invention, and are not intended to be limiting. It will be readily appreciated that the processes shown in the above figures do not indicate or limit the temporal order of these processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, for example, among a plurality of modules.
Fig. 12 is a schematic diagram showing the configuration of a determination apparatus of a video plug-in model in an exemplary embodiment of the present disclosure. The video frame inserting model comprises a pyramid network model, the pyramid network model comprises a plurality of fusion network modules, any fusion network module is determined according to an optical flow network and a synthetic network, and input data of the synthetic network is determined according to the optical flow network.
The video plug-in model determining apparatus 1200 may include a training module 1210 and a video plug-in model determining module 1220. The training module 1210 is configured to acquire a training data set, and train the initial pyramid network model according to the training data set and the training loss function to obtain a target pyramid network model; the video interpolation model determining module 1220 is configured to determine a video interpolation model according to the fusion network module in the target pyramid network model, where the video interpolation model is used to insert an intermediate frame between adjacent video frames of the video to be processed. In an exemplary implementation, based on the foregoing embodiment, the initial pyramid network model includes a plurality of synthesis modules corresponding to a plurality of fusion network modules in the initial pyramid network model; the synthesis module is used for fusing a first predicted intermediate frame determined by the optical flow network in the corresponding fusion network module with a second predicted intermediate frame determined by the synthesis network to obtain a target predicted intermediate frame of the fusion network module; and the training loss function is determined according to the target prediction intermediate frame of the fusion network module and the intermediate frame label value corresponding to the fusion network module.
In an exemplary implementation manner, based on the foregoing embodiment, the synthesis module is further configured to perform image warping on front and rear frames corresponding to the fusion network module according to an optical flow estimated by an optical flow network in the fusion network module corresponding to the synthesis module, so as to obtain a first candidate intermediate frame and a second candidate intermediate frame, and fuse the first candidate intermediate frame and the second candidate intermediate frame, so as to obtain a first predicted intermediate frame determined by the optical flow network in the fusion network module corresponding to the synthesis module.
In an exemplary implementation, based on the foregoing embodiment, the synthesizing module fuses the first predicted intermediate frame and the second predicted intermediate frame according to a first fusion weight, and the synthesizing module fuses the first candidate intermediate frame and the second candidate intermediate frame according to a second fusion weight; the fusion network modules are connected in series, and for any fusion network module, the input data of the fusion network module comprises one or more of an optical flow estimated by an optical flow network in a fusion network module before the fusion network module, a second predicted intermediate frame determined by a synthetic network in a fusion network module before the fusion network module, a first fusion weight determined by the fusion network module before the fusion network module, and a second fusion weight determined by the fusion network module before the fusion network module. In an exemplary implementation, based on the foregoing embodiments, the training module 1210 may be specifically configured to: acquiring a training data set, and adjusting the size of a video frame in the training data set according to the input image size of a fusion network module at the topmost layer of an initial pyramid network model so as to obtain a target training data set of the pyramid network model; and training the initial pyramid network model according to the target training data set and the training loss function to obtain a target pyramid network model.
In an exemplary implementation, based on the foregoing embodiments, the training module 1210 may be specifically configured to: acquiring a training data set, aiming at any fusion network module of an initial pyramid network model, and adjusting the size of a video frame in the training data set according to the input image size of the fusion network module so as to obtain first input data of the fusion network module; determining target input data of the fusion network module according to the first input data of the fusion network module, the optical flow estimated by the optical flow network in the fusion network module before the fusion network module, and the second predicted intermediate frame determined by the synthetic network in the fusion network module before the fusion network module; and training the initial pyramid network model according to the target input data and the training loss function of each fusion network module to obtain a target pyramid network model.
In an exemplary implementation, based on the foregoing embodiment, the input data of the synthetic network is determined according to the optical flow network, including: and splicing the positive number second layer features of the optical flow network with the penultimate layer features of the optical flow network to obtain input data of the synthetic network.
In an exemplary implementation manner, based on the foregoing embodiment, the determining manner of the optical flow network of the merging network module of the initial pyramid network model includes; according to a first heavy parameter module, determining a network layer for extracting shallow features in the optical flow network, wherein the shallow features comprise edge features of a video frame; according to the second heavy parameter module, determining a network layer for deep feature extraction in the optical flow network, wherein the deep features comprise abstract features of video frames; wherein the first re-parameter module is replaced with a first convolution layer in the video plug-in model and the second re-parameter module is replaced with a second convolution layer in the video plug-in model.
In an exemplary implementation, based on the foregoing embodiments, the video plug-in model determination module 1210 may be specifically configured to: determining a video frame inserting model according to a fusion network module and a target synthesis module in the target pyramid network model; the target synthesis module comprises a synthesis module corresponding to a fusion network module at the bottommost layer in the target pyramid network model.
Fig. 13 illustrates a schematic structure of a video frame inserting apparatus in an exemplary embodiment of the present disclosure. Referring to fig. 13, the video inter-frame apparatus 1300 may include a resizing module 1310, an intermediate video frame determining module 1320, and an inter-frame module 1330. Wherein:
A size adjustment module 1310 configured to obtain an original adjacent video frame of the video to be processed, and adjust the size of the original adjacent video frame to obtain a target adjacent video frame;
an intermediate video frame determining module 1320 configured to input the target adjacent video frame into a video interpolation model to obtain an intermediate video frame between the original adjacent video frames, where the video interpolation model is obtained according to the above-mentioned determining method of the video interpolation model;
the frame inserting module 1330 is configured to insert the intermediate video frame between the original adjacent video frames to perform frame inserting processing on the video to be processed.
In an exemplary implementation, based on the foregoing embodiments, the resizing module 1310 may be specifically configured to: acquiring an original adjacent video frame of a video to be processed, and adjusting the size of the original adjacent video frame according to the input image size of a fusion network module at the top layer of a video plug-in frame model so as to obtain a target adjacent video frame; based on this, the intermediate video frame determination module 1320 may be specifically configured to: inputting the target adjacent video frames into a fusion network module at the topmost layer of the video interpolation model; and obtaining an intermediate video frame between the original adjacent video frames according to the output of the synthesis module corresponding to the fusion network module at the bottommost layer of the video plug-in frame model.
In an exemplary implementation, based on the foregoing embodiments, the resizing module 1310 may be specifically configured to: acquiring original adjacent video frames of a video to be processed, and aiming at any fusion network module of the video plug-in frame model, adjusting the size of the original adjacent video frames according to the input image size of the fusion network module to obtain target adjacent video frames corresponding to the fusion network module; based on this, the intermediate video frame determination module 1320 may be specifically configured to: respectively inputting the target adjacent video frames corresponding to each fusion network module into the corresponding fusion network module; and obtaining an intermediate video frame between the original adjacent video frames according to the output of the synthesis module corresponding to the fusion network module at the bottommost layer of the video plug-in frame model.
The specific details of each unit in the above apparatus have been described in detail in the corresponding method, and thus are not described here again.
It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit in accordance with embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.
Furthermore, although the steps of the methods in the present disclosure are depicted in a particular order in the drawings, this does not require or imply that the steps must be performed in that particular order or that all illustrated steps be performed in order to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform, etc.
From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, including several instructions to cause a computing device (may be a personal computer, a server, a mobile terminal, or a network device, etc.) to perform the method according to the embodiments of the present disclosure.
In an exemplary embodiment of the present disclosure, a computer storage medium capable of implementing the above method is also provided. On which a program product is stored which enables the implementation of the method described above in the present specification. In some possible embodiments, the various aspects of the present disclosure may also be implemented in the form of a program product comprising program code for causing a terminal device to carry out the steps according to the various exemplary embodiments of the present disclosure described in the above description of the method of determining a video plug-in frame model and/or the video plug-in frame method section, when the program product is run on the terminal device.
Embodiments of the present disclosure may also include a program product for implementing the above method, which may employ a portable compact disc read-only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present disclosure is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The computer readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).
In addition, in an exemplary embodiment of the present disclosure, an electronic device capable of implementing the above method is also provided.
Those skilled in the art will appreciate that the various aspects of the present disclosure may be implemented as a system, method, or program product. Accordingly, various aspects of the disclosure may be embodied in the following forms, namely: an entirely hardware embodiment, an entirely software embodiment (including firmware, micro-code, etc.) or an embodiment combining hardware and software aspects may be referred to herein as a "circuit," module "or" system.
An electronic device 1400 according to such an embodiment of the present disclosure is described below with reference to fig. 14. The electronic device 1400 shown in fig. 14 is merely an example and should not be construed as limiting the functionality and scope of use of the disclosed embodiments.
As shown in fig. 14, the electronic device 1400 is embodied in the form of a general purpose computing device. Components of electronic device 1400 may include, but are not limited to: the at least one processing unit 1410, the at least one memory unit 1420, a bus 1430 connecting the different system components (including the memory unit 1420 and the processing unit 1410), and a display unit 1440.
Wherein the storage unit stores program code that is executable by the processing unit 1410, such that the processing unit 1410 performs the steps according to various exemplary embodiments of the present disclosure described in the above-described video plug-in model determination method and/or video plug-in method of the present specification. For example, the processing unit 1410 may perform the steps as shown in fig. 2.
The memory unit 1420 may include readable media in the form of volatile memory units, such as Random Access Memory (RAM) 14201 and/or cache memory 14202, and may further include Read Only Memory (ROM) 14203.
The memory unit 1420 may also include a program/utility 14204 having a set (at least one) of program modules 8205, such program modules 14205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.
Bus 1430 may be a local bus representing one or more of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or using any of a variety of bus architectures.
The electronic device 1400 may also communicate with one or more external devices 1500 (e.g., keyboard, pointing device, bluetooth device, etc.), one or more devices that enable a user to interact with the electronic device 1400, and/or any device (e.g., router, modem, etc.) that enables the electronic device 1400 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 1450. Also, electronic device 1400 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet, through network adapter 1460. As shown, the network adapter 1460 communicates with other modules of the electronic device 1400 via the bus 1430. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with electronic device 1400, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.
From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, including several instructions to cause a computing device (may be a personal computer, a server, a terminal device, or a network device, etc.) to perform the method according to the embodiments of the present disclosure.
Furthermore, the above-described figures are only schematic illustrations of processes included in the method according to the exemplary embodiments of the present disclosure, and are not intended to be limiting. It will be readily appreciated that the processes shown in the above figures do not indicate or limit the temporal order of these processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, for example, among a plurality of modules.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

Claims (15)

1. A method for determining a video plug-in model, wherein the video plug-in model includes a pyramid network model, the pyramid network model includes a plurality of converged network modules, any one of the converged network modules is determined according to an optical flow network and a synthetic network, and input data of the synthetic network is determined according to the optical flow network, the method comprising:
acquiring a training data set, and training an initial pyramid network model according to the training data set and a training loss function to obtain a target pyramid network model;
determining a video frame inserting model according to a fusion network module in the target pyramid network model, wherein the video frame inserting model is used for inserting an intermediate frame between adjacent video frames of a video to be processed;
the initial pyramid network model comprises a plurality of synthesis modules corresponding to a plurality of fusion network modules in the initial pyramid network model; the synthesis module is used for fusing a first predicted intermediate frame determined by the optical flow network in the corresponding fusion network module with a second predicted intermediate frame determined by the synthesis network to obtain a target predicted intermediate frame of the fusion network module;
And the training loss function is determined according to the target prediction intermediate frame of the fusion network module and the intermediate frame label value corresponding to the fusion network module.
2. The method for determining a video interpolation frame model according to claim 1, wherein the synthesis module is further configured to perform image warping on front and rear frames corresponding to the fusion network module according to an optical flow estimated by an optical flow network in the fusion network module corresponding to the synthesis module, so as to obtain a first candidate intermediate frame and a second candidate intermediate frame, and fuse the first candidate intermediate frame and the second candidate intermediate frame, so as to obtain a first predicted intermediate frame determined by the optical flow network in the fusion network module corresponding to the synthesis module.
3. The method for determining a video plug-in model according to claim 2, wherein the synthesizing module fuses the first predicted intermediate frame and the second predicted intermediate frame according to a first fusion weight, and wherein the synthesizing module fuses the first candidate intermediate frame and the second candidate intermediate frame according to a second fusion weight;
the plurality of fusion network modules are connected in series, and for any fusion network module, the input data of the fusion network module comprises one or more of an optical flow estimated by an optical flow network in a fusion network module before the fusion network module, a second predicted intermediate frame determined by a synthetic network in a fusion network module before the fusion network module, a first fusion weight determined by the fusion network module before the fusion network module, and a second fusion weight determined by the fusion network module before the fusion network module.
4. The method for determining a video plug-in model according to claim 1, wherein the obtaining a training dataset, training an initial pyramid network model according to the training dataset and a training loss function to obtain a target pyramid network model, comprises:
acquiring a training data set, and adjusting the size of a video frame in the training data set according to the input image size of a fusion network module at the topmost layer of an initial pyramid network model so as to obtain a target training data set of the pyramid network model;
and training the initial pyramid network model according to the target training data set and the training loss function to obtain a target pyramid network model.
5. The method for determining a video plug-in model according to claim 1, wherein the obtaining a training dataset, training an initial pyramid network model according to the training dataset and a training loss function to obtain a target pyramid network model, comprises:
acquiring a training data set, aiming at any fusion network module of an initial pyramid network model, and adjusting the size of a video frame in the training data set according to the input image size of the fusion network module so as to obtain first input data of the fusion network module;
Determining target input data of the fusion network module according to first input data of the fusion network module, an optical flow estimated by an optical flow network in a fusion network module before the fusion network module and a second prediction intermediate frame determined by a synthetic network in the fusion network module before the fusion network module aiming at any fusion network module of the initial pyramid network model;
and training the initial pyramid network model according to the target input data and the training loss function of each fusion network module to obtain a target pyramid network model.
6. The method of determining a video plug-in model according to claim 1, wherein the input data of the composite network is determined from the optical flow network, comprising:
and splicing the positive number second layer features of the optical flow network with the penultimate layer features of the optical flow network to obtain input data of the synthetic network.
7. The method for determining a video plug-in model according to claim 1, wherein the determining manner of the optical flow network of the merging network module of the initial pyramid network model includes;
According to a first heavy parameter module, determining a network layer for extracting shallow features in the optical flow network, wherein the shallow features comprise edge features of a video frame;
according to the second heavy parameter module, determining a network layer for deep feature extraction in the optical flow network, wherein the deep features comprise abstract features of video frames;
wherein the first re-parameter module is replaced with a first convolution layer in the video plug-in model and the second re-parameter module is replaced with a second convolution layer in the video plug-in model.
8. The method for determining a video plug-in frame model according to claim 1 or 2, wherein the determining a video plug-in frame model according to the fusion network module in the target pyramid network model includes:
determining a video frame inserting model according to a fusion network module and a target synthesis module in the target pyramid network model;
the target synthesis module comprises a synthesis module corresponding to a fusion network module at the bottommost layer in the target pyramid network model.
9. A method for video framing, comprising:
acquiring an original adjacent video frame of a video to be processed, and adjusting the size of the original adjacent video frame to obtain a target adjacent video frame;
Inputting the target adjacent video frames into a video interpolation model to obtain intermediate video frames between the original adjacent video frames;
inserting the intermediate video frames between the original adjacent video frames to perform frame inserting processing on the video to be processed;
the video plug-in model is determined according to the method of any one of claims 1 to 8.
10. The method for inserting frames into video according to claim 1, wherein said obtaining original adjacent frames of video to be processed and adjusting the size of said original adjacent frames to obtain target adjacent frames comprises:
acquiring an original adjacent video frame of a video to be processed, and adjusting the size of the original adjacent video frame according to the input image size of a fusion network module at the top layer of a video plug-in frame model so as to obtain a target adjacent video frame;
the inputting the target adjacent video frames into a video interpolation model to obtain intermediate video frames between the original adjacent video frames comprises:
inputting the target adjacent video frames into a fusion network module at the topmost layer of the video interpolation model;
and obtaining an intermediate video frame between the original adjacent video frames according to the output of the synthesis module corresponding to the fusion network module at the bottommost layer of the video plug-in frame model.
11. The method for inserting frames in a video according to claim 9, wherein said obtaining an original adjacent video frame of a video to be processed, and adjusting a size of the original adjacent video frame to obtain a target adjacent video frame, comprises:
acquiring original adjacent video frames of a video to be processed, and aiming at any fusion network module of the video plug-in frame model, adjusting the size of the original adjacent video frames according to the input image size of the fusion network module to obtain target adjacent video frames corresponding to the fusion network module;
the inputting the target adjacent video frames into a video interpolation model to obtain intermediate video frames between the original adjacent video frames comprises:
respectively inputting the target adjacent video frames corresponding to each fusion network module into the corresponding fusion network module;
and obtaining an intermediate video frame between the original adjacent video frames according to the output of the synthesis module corresponding to the fusion network module at the bottommost layer of the video plug-in frame model.
12. A device for determining a video plug-in model, wherein the video plug-in model includes a pyramid network model, the pyramid network model includes a plurality of converged network modules, any one of the converged network modules is determined according to an optical flow network and a synthetic network, and input data of the synthetic network is determined according to the optical flow network, the device includes:
The training module is configured to acquire a training data set, train the initial pyramid network model according to the training data set and the training loss function, and obtain a target pyramid network model;
the video frame inserting model determining module is configured to determine a video frame inserting model according to the fusion network module in the target pyramid network model, wherein the video frame inserting model is used for inserting an intermediate frame between adjacent video frames of the video to be processed;
the initial pyramid network model comprises a plurality of synthesis modules corresponding to a plurality of fusion network modules in the initial pyramid network model; the synthesis module is used for fusing a first predicted intermediate frame determined by the optical flow network in the corresponding fusion network module with a second predicted intermediate frame determined by the synthesis network to obtain a target predicted intermediate frame of the fusion network module;
and the training loss function is determined according to the target prediction intermediate frame of the fusion network module and the intermediate frame label value corresponding to the fusion network module.
13. A video framing apparatus, comprising:
the size adjustment module is configured to acquire original adjacent video frames of the video to be processed, and adjust the sizes of the original adjacent video frames to obtain target adjacent video frames;
An intermediate video frame determination module configured to input the target adjacent video frames into a video interpolation model to obtain intermediate video frames between the original adjacent video frames;
the frame inserting module is configured to insert the intermediate video frames between the original adjacent video frames so as to perform frame inserting processing on the video to be processed;
the video plug-in model is determined according to the method of any one of claims 1 to 8.
14. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any one of claims 1 to 11.
15. An electronic device, comprising:
one or more processors;
storage means for storing one or more programs which when executed by the one or more processors cause the one or more processors to implement the method of any of claims 1 to 11.
CN202310149226.7A 2023-02-20 2023-02-20 Method and device for determining video frame inserting model, and method and device for video frame inserting Pending CN116156218A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310149226.7A CN116156218A (en) 2023-02-20 2023-02-20 Method and device for determining video frame inserting model, and method and device for video frame inserting

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310149226.7A CN116156218A (en) 2023-02-20 2023-02-20 Method and device for determining video frame inserting model, and method and device for video frame inserting

Publications (1)

Publication Number Publication Date
CN116156218A true CN116156218A (en) 2023-05-23

Family

ID=86361531

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310149226.7A Pending CN116156218A (en) 2023-02-20 2023-02-20 Method and device for determining video frame inserting model, and method and device for video frame inserting

Country Status (1)

Country Link
CN (1) CN116156218A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116886961A (en) * 2023-09-06 2023-10-13 中移(杭州)信息技术有限公司 Distributed live video frame inserting method, device, system and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116886961A (en) * 2023-09-06 2023-10-13 中移(杭州)信息技术有限公司 Distributed live video frame inserting method, device, system and storage medium
CN116886961B (en) * 2023-09-06 2023-12-26 中移(杭州)信息技术有限公司 Distributed live video frame inserting method, device, system and storage medium

Similar Documents

Publication Publication Date Title
CN109087346B (en) Monocular depth model training method and device and electronic equipment
CN111901598B (en) Video decoding and encoding method, device, medium and electronic equipment
CN113592913B (en) Method for eliminating uncertainty of self-supervision three-dimensional reconstruction
US20220103782A1 (en) Method for video frame interpolation, and electronic device
WO2023103576A1 (en) Video processing method and apparatus, and computer device and storage medium
CN110827380A (en) Image rendering method and device, electronic equipment and computer readable medium
CN111681177A (en) Video processing method and device, computer readable storage medium and electronic equipment
CN115393599A (en) Method, device, electronic equipment and medium for constructing image semantic segmentation model and image processing
CN115082300B (en) Training method of image generation model, image generation method and device
CN116156218A (en) Method and device for determining video frame inserting model, and method and device for video frame inserting
CN113344794A (en) Image processing method and device, computer equipment and storage medium
JP2023039426A (en) Computer implementation method, information processing system, computer program (spatio-temporal relation based mr content arrangement)
WO2024041235A1 (en) Image processing method and apparatus, device, storage medium and program product
CN111696034B (en) Image processing method and device and electronic equipment
Zhang et al. Salient object detection for RGBD video via spatial interaction and depth-based boundary refinement
US20230186608A1 (en) Method, device, and computer program product for video processing
CN112995433B (en) Time sequence video generation method and device, computing equipment and storage medium
WO2023015414A1 (en) Method for eliminating uncertainty in self-supervised three-dimensional reconstruction
CN112052863B (en) Image detection method and device, computer storage medium and electronic equipment
CN115115972A (en) Video processing method, video processing apparatus, computer device, medium, and program product
US20230162379A1 (en) Training alignment of a plurality of images
CN116527956B (en) Virtual object live broadcast method, device and system based on target event triggering
CN116137017A (en) Super-resolution video determination method and electronic equipment
CN117392180B (en) Interactive video character tracking method and system based on self-supervision optical flow learning
CN116977167A (en) Video processing method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination