CN115883869A

CN115883869A - Swin transform-based video frame interpolation model processing method, device and equipment

Info

Publication number: CN115883869A
Application number: CN202211502343.9A
Authority: CN
Inventors: 李登实; 王前瑞; 陈澳雷; 高雨; 宋昊; 薛童; 朱晨倚
Original assignee: Jianghan University
Current assignee: Jianghan University
Priority date: 2022-11-28
Filing date: 2022-11-28
Publication date: 2023-03-31
Anticipated expiration: 2042-11-28
Also published as: CN115883869B

Abstract

The application provides a method, a device and a device for processing a video frame interpolation model based on Swin transform, which are used for providing a novel training framework for training the video frame interpolation model.

Description

Swin transform-based video frame interpolation model processing method, device and equipment

Technical Field

The application relates to the field of videos, in particular to a method, a device and equipment for processing a Swin transform-based video frame interpolation model.

Background

With the development of the technology, the playing devices such as televisions, tablets or mobile phones and the like can support videos with higher frame rates, but are limited by technical reasons such as network transmission, frame loss during shooting or post editing, the difference between the online video frame rate and the actual video frame rate is large, and if a user watches videos with a lower actual video frame rate, the user can easily feel a pause-in-time feeling, and in order to offset the pause-in-time feeling of the actual video frame rate, technically, the video frame rate is improved by using a video frame insertion technology in a naked manner, so that smooth video playing experience is brought to the user.

The video frame interpolation technology, which may also be referred to as a frame rate conversion technology, is to increase one or more frames in adjacent frames of an original video, and shorten a display time span between the frames, for example, increase the number of pictures played in 1 second, thereby improving the fluency of the video and achieving a better visual and sensory effect.

In the research process of the prior art, the inventor of the present application finds that the existing video frame interpolation technology has the condition of unstable frame interpolation effect, although the frame rate is improved, the content of the picture has abnormity, which brings a sharp feeling to the user, and obviously, the frame interpolation precision is not very good.

Disclosure of Invention

The application provides a method, a device and a device for processing a video frame interpolation model based on Swin transform, which are used for providing a novel training framework for training the video frame interpolation model, so that the video frame interpolation model obtained by training can more accurately realize the interpolation of video frames of videos to be interpolated, the abrupt feeling is obviously reduced, and the smooth and fluent video playing experience can be obtained.

In a first aspect, the present application provides a method for processing a Swin Transformer-based video frame interpolation model, where the method includes:

acquiring a sample set, wherein the sample set comprises different sample videos and different sample audios, and the different sample videos correspond to the different sample audios one to one;

extracting audio features of different sample audios, wherein the audio features comprise a frequency spectrum envelope MFCC, a frequency domain feature FBANK, a fundamental frequency pitch and an unvoiced feature;

coding the audio features to obtain high-order audio features;

extracting video frame space-time characteristics of three layers of a sample video through three layers of neural networks based on Swin transform, wherein each layer of neural network outputs one layer of video frame space-time characteristics;

based on adjacent odd video frames in different sample videos, training a neural network model to predict intermediate frames between the adjacent odd video frames by combining with the spatial-temporal characteristics and the high-order audio characteristics of the video frames corresponding to the three layers, and obtaining a video frame interpolation model after completing model training, wherein the video frame interpolation model is used for predicting the intermediate frames in the video to be interpolated by combining with corresponding audio on the basis of the input video to be interpolated so as to realize the video frame interpolation effect with preset frame numbers.

In a second aspect, the present application provides an apparatus for processing a Swin Transformer-based video frame interpolation model, where the apparatus includes:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a sample set, the sample set comprises different sample videos, the sample set also comprises different sample audios, and the different sample videos correspond to the different sample audios one to one;

the extraction unit is used for extracting audio features of different sample audios, wherein the audio features comprise a frequency spectrum envelope MFCC, a frequency domain feature FBANK, a fundamental frequency pitch and an unvoiced feature;

the encoding unit is used for encoding the audio features to obtain high-order audio features;

the extraction unit is also used for extracting three layers of video frame space-time characteristics of the sample video through three layers of neural networks based on Swin Transformer, wherein each layer of neural network outputs one layer of video frame space-time characteristics;

and the training unit is used for training the neural network model to predict intermediate frames between adjacent odd-numbered video frames on the basis of adjacent odd-numbered video frames in different sample videos by combining the space-time characteristics and the high-order audio characteristics of the video frames of the corresponding three layers, and obtaining a video frame interpolation model after completing model training, wherein the video frame interpolation model is used for predicting intermediate frames in a frame to be interpolated video by combining corresponding audio on the basis of the input frame to be interpolated video so as to realize the video frame interpolation effect with preset frame numbers.

In a third aspect, the present application provides a processing device, including a processor and a memory, where the memory stores a computer program, and the processor executes the method provided in the first aspect of the present application or any one of the possible implementation manners of the first aspect of the present application when calling the computer program in the memory.

In a fourth aspect, the present application provides a computer-readable storage medium storing a plurality of instructions, which are suitable for being loaded by a processor to perform the method provided by the first aspect of the present application or any one of the possible implementation manners of the first aspect of the present application.

From the above, the present application has the following advantageous effects:

aiming at the requirements of video frame interpolation, in the process of training a video frame interpolation model, in addition to paying attention to picture characteristics (video characteristics) paid attention to in the prior art, audio characteristics are paid attention to, high-order audio characteristics are extracted from basic audio characteristics, three layers of neural networks based on Swin transform are configured for video characteristics, three layers of video frame space-time characteristics are extracted, accordingly richer video characteristics are extracted, on the basis, the high-order audio characteristics and the three layers of video frame space-time characteristics are fused for predicting a final intermediate frame, and under the novel training framework, the trained video frame interpolation model can more accurately realize the video frame interpolation of a video to be interpolated, the abrupt feeling is remarkably reduced, and more smooth video playing experience can be obtained.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic flowchart of a processing method of a Swin Transformer-based video frame interpolation model according to the present application;

FIG. 2 is a schematic diagram of a model training architecture of the present application;

FIG. 3 is a schematic structural diagram of a processing apparatus of a Swin Transformer-based video frame interpolation model according to the present application;

fig. 4 is a schematic structural view of a processing apparatus according to the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first," "second," and the like in the description and in the claims of the present application and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Moreover, the terms "comprises," "comprising," and any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or modules is not necessarily limited to those steps or modules explicitly listed, but may include other steps or modules not expressly listed or inherent to such process, method, article, or apparatus. The naming or numbering of the steps appearing in the present application does not mean that the steps in the method flow have to be executed in the chronological/logical order indicated by the naming or numbering, and the named or numbered process steps may be executed in a modified order depending on the technical purpose to be achieved, as long as the same or similar technical effects are achieved.

The division of the modules presented in this application is a logical division, and in practical applications, there may be another division, for example, multiple modules may be combined or integrated into another system, or some features may be omitted, or not executed, and in addition, the shown or discussed coupling or direct coupling or communication connection between each other may be through some interfaces, and the indirect coupling or communication connection between the modules may be in an electrical or other similar form, which is not limited in this application. The modules or sub-modules described as separate components may or may not be physically separated, may or may not be physical modules, or may be distributed in a plurality of circuit modules, and some or all of the modules may be selected according to actual needs to achieve the purpose of the present disclosure.

Before introducing the processing method of the Swin Transformer-based video frame interpolation model provided in the present application, first, the background related to the present application is introduced.

The method and the device for processing the video frame interpolation model based on the Swin transform and the computer-readable storage medium can be applied to processing equipment and are used for providing a novel training framework for training the video frame interpolation model, the video frame interpolation model obtained through training can more accurately realize the interpolation of the video frame of the video to be interpolated, the abrupt feeling is remarkably reduced, and the smooth and fluent video playing experience can be obtained.

In the method for processing the Swin Transformer-based video frame interpolation model, an execution main body may be a Swin Transformer-based processing apparatus for the video frame interpolation model, or different types of processing apparatuses such as a server, a physical host, or a User Equipment (UE) that are integrated with the Swin Transformer-based processing apparatus for the video frame interpolation model. The processing device of the Swin Transformer-based video frame interpolation model may be implemented in a hardware or software manner, the UE may specifically be a terminal device such as a smart phone, a tablet computer, a notebook computer, a desktop computer, or a Personal Digital Assistant (PDA), and the processing device may be set in a device cluster manner.

In practical application, the processing device may specifically be a device in the background of the technical support, so that the model configuration may be performed in the background, and the support of the model is provided to the relevant user or the operator of the video application, and certainly, when the processing device directly relates to the application of the model, that is, the video playing is performed based on the trained video frame insertion model, the processing device itself may be a device on the user side, such as a physical host or UE, and the training of the model and the application thereof are directly performed locally.

Next, a method for processing the Swin Transformer-based video frame interpolation model provided in the present application is described.

First, referring to fig. 1, fig. 1 shows a flowchart of a processing method of a Swin Transformer-based video frame interpolation model according to the present application, and the processing method of a Swin Transformer-based video frame interpolation model according to the present application may specifically include the following steps S101 to S105:

step S101, a sample set is obtained, wherein the sample set comprises different sample videos and different sample audios, and the different sample videos correspond to the different sample audios one to one;

it will be appreciated that training of the video frame interpolation model begins with a sample set configured for training the model.

The sample set, which may also be referred to as a sample set, includes different sample videos corresponding to processing objects of the video frame interpolation model, and it should be noted that, in the present application, the interpolation processing of the video frame is additionally considered in combination with the audio, and therefore, the sample set also includes different sample audios, and for the sample audios, obviously, the sample audios are configured with the sample videos.

The sample audio can be extracted from the audio content in the sample video, or can be directly configured outside the sample video, and the purpose of configuring the sample audio is to continue to provide guidance of audio features for training of the model on the basis that the sample video provides video features (picture features) for training of the model.

As an example, the acquisition process of the sample set may include the following processes:

extracting a speaker data set (sample audio) from a sample video by adopting voxceleb2, wherein the video frame rate of the data set is 25fps, and the resolution is 224 multiplied by 224;

splitting a 1s video into 25 pictures according to a video frame rate (25 fps);

splitting the audio into the same number of audio segments according to the video frame rate, namely starting from the initial position of the audio, intercepting the audio (one segment of the audio is 30 milliseconds) by using a sliding window with the size of 17 milliseconds and a frame shift of 13 milliseconds, splitting the audio into 77 segments of the audio segments, selecting 25 segments of the audio segments from a third segment of the audio segment at intervals, and combining the front segment and a rear segment of the selected 25 segments of the audio segments to form a whole and correspond to one picture;

the resulting data set (including video and audio) was used 80% as a training data set for training of the neural network model and 20% as a test data set for testing of the neural network model.

In order to improve the accuracy and precision of the model in the training process, data enhancement operations including random cropping, scaling, mirroring or rotation of the video frame can be performed on the sample data.

Correspondingly, the following data processing needs to consider how to extract the audio features of the sample audio to help the training of the video frame interpolation model.

Step S102, extracting audio features of different sample audios, wherein the audio features comprise a frequency spectrum envelope MFCC, a frequency domain feature FBANK, a fundamental frequency pitch and an unvoiced feature;

after the sample set is obtained, the audio features of the sample audio therein can be extracted, including specific features such as the spectral envelope MFCC, the frequency domain feature FBANK, the fundamental frequency pitch, and the unvoiced sound feature.

For the spectral envelope MFCC, MFCC is a Mel-scale frequency Cepstral Coefficients (Mel-scale frequency Cepstral Coefficients) feature, which can be understood as a feature extracted based on the auditory characteristics of human ears, and has a nonlinear correspondence with Hz frequency, and MFCC uses the relationship between them to calculate the Hz spectral feature and provide a data basis for the following audio feature processing.

For the frequency domain feature FBANK, which is a Filter Bank (Filter Bank), the response of human ears to a sound spectrum is nonlinear, and FBANK is a front-end processing algorithm, and audio is processed in a manner similar to that of human ears, so that the performance of speech recognition can be improved, and the main processing for obtaining the frequency domain feature FBANK of a speech signal includes: pre-emphasis, framing, windowing, short Time Fourier Transform (STFT), mel filtering, mean value removal and the like, and Discrete Cosine Transform (DCT) is performed on the frequency domain characteristic FBANK to obtain a frequency spectrum envelope MFCC;

for the fundamental pitch, the fundamental pitch is related to the tone and intonation that the user can perceive, and is therefore commonly used to describe the tone and intonation trends of tonal languages.

For unvoiced sound characteristics, unvoiced sound is sound generated by using gas friction as a sound source, and unvoiced sound characteristics are characteristics of the sound itself.

It can be understood that under the specific characteristics of the spectrum envelope MFCC, the frequency domain characteristic FBANK, the fundamental frequency pitch, the unvoiced sound characteristic and the like, the related audio characteristics of the sample audio can be better characterized from multiple aspects, so that a good foundation can be laid for the subsequent model training.

Step S103, coding the audio features to obtain high-order audio features;

after the audio features including the specific features such as the spectral envelope MFCC, the frequency domain feature FBANK, the fundamental frequency pitch, the unvoiced feature and the like are obtained, the audio features can be encoded by a related encoder, so that high-order audio features are obtained.

With respect to the higher-order audio features referred to herein, it should be understood that the present application is directed to obtaining further audio features, and the present application considers that the audio features are time-series signals, and then the audio features containing time-series information can be further processed by an encoder, so as to help the subsequent data processing to focus more on valuable parts of the audio features.

As a practical implementation manner, the encoding of the audio features to obtain the high-order audio features herein may specifically be implemented as follows:

carrying out convolution processing on different sample audios by using the 1D convolution layer to obtain a convolution result;

and coding the convolution result through a Transformer coder with 5 layers, and performing feature mapping by using a full connection layer to obtain high-order audio features.

For the embodiments herein, the application considers that the audio features are time-series signals, so the audio features are further processed by using 1D convolution to obtain audio features containing time-series information, and the transform encoder includes an attention mechanism, so that during the processing, more attention information can be given more weight, and irrelevant information can be given less weight, so as to extract audio features more useful for subsequent data processing, i.e. inserting intermediate frames.

Step S104, extracting three-layer video frame space-time characteristics of a sample video through three-layer Swin transform-based neural networks, wherein each layer of neural network outputs one layer of video frame space-time characteristics;

it can be understood that, in order to predict the inter frame between two video frames more accurately, in addition to using the above processed higher-order audio features as the guidance of the model prediction process, it can also consider performing corresponding optimization on the video features (picture features) themselves, so as to obtain more accurate guidance effect from the video features themselves.

Specifically, the present application is directed to processing video features, specifically video frame spatio-temporal features, where the video frame spatio-temporal features may relate to spatial features between video frames and also to temporal features between video frames, and in order to better fuse with high-order audio features subsequently, three layers of video frame spatio-temporal features are also specifically generated, and the specific fusion content thereof may refer to the following content.

In the processing of the video frame spatio-temporal features, a three-layer neural network model designed based on Swin Transformer in the present application may be involved, and the feature extraction is performed on the basis of the alternate video frames (the subsequent content is all set as the adjacent odd video frames) through the designed three-layer output architecture, and the three-layer spatio-temporal features of the video frames are output.

It will be appreciated that for the neural network architecture referred to herein, although initially the same input data, different video frame spatio-temporal features are output due to the configuration of the neural networks of different layers, thereby obtaining rich video frame spatio-temporal features to provide more detailed feature data references for the fusion of subsequent higher order audio features and the prediction of related intermediate frames.

Specifically, as another practical implementation manner, referring to an architecture diagram of a neural network architecture of the present application shown in fig. 2, in the process of extracting spatiotemporal features of video frames of three layers of a sample video through a Swin Transformer-based neural network, the following may be specifically included:

performing convolution processing on adjacent odd video frames in different sample videos in a Swin transform-based encoder to obtain picture characteristics;

dividing the picture feature into four parts with the same size along the middle, respectively calculating attention of the four parts to obtain a first spatial feature containing local spatial information, further dividing the four parts and exchanging positions to obtain four new parts with the same size, and recalculating new local attention features to obtain a second spatial feature containing the local spatial information and global spatial information;

splicing adjacent odd video frames in different sample videos according to time dimension, splitting a splicing result according to the number of pixel points, and calculating time attention characteristics between the pixel points corresponding to the first characteristic and the second characteristic;

and (2) decoding the four-layer video frame space-time characteristics (namely, the four-layer encoder exists, as shown in fig. 2) obtained by four times of same processing of the Swin transducer-based encoder by using a three-layer Swin transducer-based decoder to obtain three-layer video frame space-time characteristics, wherein the space-time characteristics comprise a first space characteristic, a second space characteristic and a temporal attention characteristic, the time characteristic of the first layer is decoded by the decoder of the first layer, the time characteristic of the second layer is decoded by the decoder of the second layer, and the time characteristic of the third layer and the space-time characteristic of the fourth layer are decoded by the decoder of the third layer.

For the arrangement here, specifically, attention feature calculation, first divides a picture feature into four parts of the same size, and calculates attention for each part separately to obtain a local attention feature; the picture feature is subdivided into nine (the number of divisions in the example) portions of different sizes, the positions of the nine portions are moved to reassemble new four portions, and the attention of each portion is calculated again, so that the global attention can be obtained, and the spatial feature can be obtained through the above processing.

Because the video frames have a time relation, the input odd frames are spliced through the time dimension in sequence, and the time characteristic can be obtained through calculating attention according to the number of pixel points to the splicing characteristic.

And an intermediate frame obtaining process: the method comprises the steps of processing picture characteristics by using a four-layer encoder to obtain picture space-time characteristics, decoding a result obtained by a third encoder and a result obtained by a fourth encoder during decoding, obtaining an intermediate frame of a third layer through a decoder, a fusion device and a synthesizer, sampling the intermediate frame and extracting characteristics after the intermediate frame of the third layer is obtained when the decoders of the second layer and the first layer are different from the decoders of the third layer, splicing the characteristics obtained by inputting the picture to the characteristic decoder of the second layer and the intermediate frame characteristics according to time sequence through time dimension, processing again by using a Swin transform encoder to obtain the space-time characteristics with the characteristics of the intermediate frame fused, obtaining the intermediate frame of the second layer through the fusion device and the synthesizer, wherein the process of obtaining the intermediate frame of the first layer is similar to that of the second layer, and the intermediate frame obtained by the first layer is the final intermediate frame.

For the special setting, the method can be understood from the reverse side, if the space-time characteristics of the video frame are obtained without splitting, each pixel point and the rest of the pixel points need to be calculated respectively in actual operation, the calculation is complex, the time complexity is high, and at the moment, the calculation difficulty and the time complexity are reduced by splitting and calculating the picture characteristics.

And step S105, training a neural network model to predict intermediate frames between adjacent odd-numbered video frames based on adjacent odd-numbered video frames in different sample videos by combining the space-time characteristics and the high-order audio characteristics of the video frames of the corresponding three layers, and obtaining a video frame interpolation model after completing model training, wherein the video frame interpolation model is used for predicting intermediate frames in a frame to be interpolated video by combining corresponding audio on the basis of the input frame to be interpolated video so as to realize a video frame interpolation effect with preset frame numbers.

After the configuration of input data for training the model, namely sample video, video frame spatio-temporal characteristics and high-order audio characteristics is completed, the training process of a specific video frame interpolation model can be put into practice.

It can be understood that the video frame spatio-temporal features and the high-order audio features play a role in providing accurate data guidance during the training and identification processes of the model, so that the auxiliary model can conveniently and accurately predict and measure the intermediate frames between the adjacent odd video frames.

Specifically, because the facial expression of the human face and the correlation between the mouth and the audio are generated when the human speaks, the accuracy of inter-frame prediction can be effectively enhanced by fusing the audio features and the picture features.

It should be understood that the above processing is performed based on the corresponding audio/video data, or based on the audio/video data pointing to the same object at the same time point, as a processing unit.

Specifically, in each training link, it can be understood that adjacent odd video frames in a sample video are taken as processing objects, intermediate frames which can be inserted between the adjacent odd video frames are predicted as targets, and on the basis of video features of the adjacent odd video frames, the video frame spatio-temporal features and high-order audio features are taken as references, and the video frame spatio-temporal features and the high-order audio features are fused to complete the prediction processing of the intermediate frames.

Different video frames can be sequentially marked through numerical labels, and if the first video frame can be marked by 0, and the third video frame and the fifth video frame … N +1 video frame can be marked by 2 and 4 and … N; if the first video frame can be identified by "1", and the following third video frame, and the fifth video frame … N +1 video frame can be identified by "3" or "5" or "…" N +1 ".

The type of the Neural Network model specifically adopted by the video frame interpolation model can be adjusted according to actual needs, for example, the types of the Neural Network models such as a Convolutional Neural Network (CNN), a Recurrent Neural Network (RNN), a Deep Belief Network (DBN), a generation countermeasure Network (GAN), and the like; similarly, the loss function used in the training can be configured according to actual needs.

In the training process, as another practical implementation manner, in the process of predicting the intermediate frame between adjacent odd video frames by using the video frame interpolation model, the following may be specifically included:

performing feature mapping by using a full connection layer aiming at the video frame space-time features of each layer corresponding to the adjacent odd video frames to obtain a feature mapping result;

adding the feature mapping result and the high-order audio features of the corresponding layer, performing feature mapping by using another full-link layer, continuing to perform normalization by using a softmax layer (the range is between 0 and 1) to obtain a new feature mapping result (which can be understood as a weight coefficient), multiplying the new feature mapping result (the weight coefficient) and the feature mapping result, and adding the feature mapping result to obtain a fusion feature, wherein the fusion feature of each layer is used as an intermediate frame predicted by the corresponding layer, the fusion feature of the third layer is added together with the feature mapping result of the second layer and the high-order audio features (namely, the fusion feature of the third layer is iterated into the process of adding the feature mapping result and the high-order audio features of the corresponding layer by the second layer), the fusion feature of the second layer is added together with the feature mapping result and the high-order audio features of the first layer (namely, the fusion feature of the first layer is iterated into the process of adding the feature mapping result and the high-order audio features of the corresponding layer), and the fusion feature of the first layer is used as an intermediate frame which is finally output.

It can be seen from the feature fusion framework of different layers herein that the present application iteratively fuses the fusion features obtained from the third layer in the direction of the first layer, and finally the fusion features obtained by the first layer process preferably fuse the spatial-temporal features of the video frames of the three layers to obtain intermediate frames capable of obtaining better feature details, that is, the intermediate frames finally output as prediction results.

Specifically, it should be understood that the merging process is to process the picture spatio-temporal features through the full link layer after the features are mapped through the full link layer, add the two features obtained through the mapping of the full link layer for the purpose of merging the picture features and the audio features, map the results through the full link layer again and normalize the results through softmax processing to obtain a weight coefficient, multiply the weight coefficient and the features of the original picture to give a greater weight to the pixels having a greater effect on obtaining the intermediate frame, and give a smaller weight to the pixels having a smaller effect. To prevent that some useful information may have been filtered out by the above operation, the original picture features are added to it.

In addition, in terms of the loss function involved in the training of the model, as another practical implementation, for the loss function involved in the training of the video frame interpolation model, it is easy to understand that the prediction result of the intermediate frame exists in the form of a picture, and correspondingly, the loss function may include a picture loss function, specifically:

wherein, I _t In order to be a true inter-frame picture,

is a predicted inter frame picture.

The image loss function is easy to understand and is used for restricting the accuracy of the intermediate frame finally generated by the model, so that the difference between the image loss function and the intermediate frame is continuously reduced through the training network, and the prediction effect of the intermediate frame is optimized.

In addition, other types of existing loss functions can be adopted in the model training process, and the loss functions can be adjusted according to actual needs.

Under the condition that two or more than two loss functions are configured, the finally adopted loss function calculation result can be quantized through setting of different weights, and the finally adopted loss function calculation result is used for reversely optimizing the related model adopted number so as to improve the prediction effect of the final intermediate frame.

In addition, in the training process of the model, basic parameter settings for the model may also be set at an early stage, such as a training iteration number (a requirement for completing training), a learning rate, and the like, which are set to be 0.0001 and gradually reduced to 0.000001 as an example, where the training iteration number is 100.

For the prediction result of the intermediate frame output by the model itself, the image quality of the intermediate frame itself may also be evaluated, for example, peak Signal to Noise Ratio (PSNR) may be adopted, where the Peak PSNR represents a Ratio of maximum possible Signal power and destructive Noise power affecting its representation accuracy, and is usually represented by a logarithmic decibel unit, and the larger the value of the Peak PSNR is, the better the quality of the predicted image is, and the Peak PSNR may specifically be calculated by using the following formula:

wherein the MSE is a mean square error between the real image and the predicted image.

For another example, structural Similarity (SSIM) may be used to evaluate Structural Similarity between an image output by a training model and an original image, where SSIM may quantify the Similarity between two images, specifically, structural information may be defined as an attribute that is independent of brightness and contrast and reflects an object structure in a scene from an image composition angle, and distortion is modeled as a combination of three different factors, i.e., brightness, contrast and structure, where SSIM has a range from 0 to 1, and a larger value indicates better quality of a predicted image, and when two images are identical, a value of SSIM is equal to 1, and SSIM may specifically be calculated using the following formula:

wherein x and y are pixel values of two images respectively, mu _x Is the average value of x, μ _y Is the mean value of y, σ _x Is the variance, σ, of x _y Variance of y, σ _xy Is the covariance of x and y, c ₁ ＝(k ₁ L) ² ,c ₁ ＝(k ₁ L) ² L is the dynamic of the pixel value for maintaining a stable constantRange, k ₁ ＝0.01,k ₁ ＝0.03。

For another example, a learning-aware Image block Similarity (LPIPS) may be used to evaluate the Perceptual Similarity between the Image output by the training model and the original Image, and the learning-aware Image block Similarity (LPIPS may measure the difference between the two images, and the metric learning-aware Image block Similarity (LPIPS) may learn the inverse mapping of the generated Image to the Ground Truth to reconstruct the inverse mapping of the true Image from the false Image, and preferentially process the Perceptual Similarity therebetween. The lower the value of the learning perception image block similarity LPIPS is, the more similar the two images are, that is, the better the predicted picture quality is, otherwise, the larger the difference is, that is, the worse the predicted picture quality is, the learning perception image block similarity LPIPS can be specifically calculated by using the following formula:

wherein d is x ₀ Distance from x, x ₀ For a real image block, x for a prediction image block,

and &>

Features of the l-th layer of the real and predicted image blocks, w, normalized over the channel dimension, respectively _l For vectors used to scale the active channel, H, w are the height and width of the image block, H, respectively _l And W _l Respectively the image height and width of the l-th layer.

It is understood that the above evaluation parameters, in addition to evaluating the picture quality of the inter frame predicted by the model, can also be used as a specific loss function type to be put into the model training process.

After preset training requirements such as training times, training duration, prediction accuracy and the like are met, the training of the model can be completed, and the video frame interpolation model at the moment can be put into practical use and practical application.

Correspondingly, the method of the application can further comprise:

acquiring a frame video to be inserted;

inputting a video to be inserted into a video frame insertion model so that the video frame insertion model predicts an intermediate frame in the video to be inserted by combining corresponding audio on the basis of the input video to be inserted;

and acquiring the target video of the frame to be inserted after the intermediate frame is inserted into the video.

It should be understood that, in practical applications, the video frame interpolation model may not input the audio corresponding to the video to be interpolated, and the video frame interpolation model may directly extract the corresponding audio from the video to be interpolated.

For the content of the above scheme, it can be seen that, for the requirement of video frame interpolation, in the process of training a video frame interpolation model, in addition to paying attention to the picture features (video features) paid attention to in the prior art, the audio features are paid attention to, and the high-order audio features are extracted from the basic audio features, and for the video features, three layers of Swin transform-based neural networks are configured to extract three layers of video frame spatio-temporal features, so that richer video features are extracted, and on this basis, the high-order audio features and the three layers of video frame spatio-temporal features are fused to predict the final intermediate frame.

The foregoing is an introduction of a processing method for a Swin Transformer-based video frame interpolation model provided in the present application, and in order to better implement the processing method for the Swin Transformer-based video frame interpolation model provided in the present application, the present application further provides a processing apparatus for a Swin Transformer-based video frame interpolation model from a functional module perspective.

Referring to fig. 3, fig. 3 is a schematic structural diagram of a processing apparatus for Swin Transformer-based video frame interpolation model according to the present application, in which the processing apparatus 300 for Swin Transformer-based video frame interpolation model specifically includes the following structure:

an obtaining unit 301, configured to obtain a sample set, where the sample set includes different sample videos and also includes different sample audios, and the different sample videos correspond to the different sample audios one to one;

an extracting unit 302, configured to extract audio features of different sample audios, where the audio features include a spectral envelope MFCC, a frequency domain feature FBANK, a fundamental frequency pitch, and an unvoiced feature;

the encoding unit 303 is configured to encode the audio features to obtain high-order audio features;

the extracting unit 302 is further configured to extract three layers of video frame spatio-temporal features of the sample video through three layers of Swin Transformer-based neural networks, where each layer of neural network outputs one layer of video frame spatio-temporal features;

the training unit 304 is configured to train a neural network model to predict an intermediate frame between adjacent odd-numbered video frames based on adjacent odd-numbered video frames in different sample videos in combination with spatial-temporal features and high-order audio features of the video frames of the corresponding three layers, and obtain a video frame interpolation model after completing model training, where the video frame interpolation model is configured to predict an intermediate frame in a frame to be interpolated video in combination with corresponding audio on the basis of an input frame to be interpolated video, so as to achieve a video frame interpolation effect with a preset number of frames.

In an exemplary implementation manner, the encoding unit 303 is specifically configured to:

and coding the convolution result through a 5-layer Transformer coder, and performing feature mapping by using a full-connection layer to obtain high-order audio features.

In another exemplary implementation manner, the extracting unit 302 is specifically configured to:

splicing adjacent odd video frames in different sample videos according to the time dimension, splitting a splicing result according to the number of pixel points, and calculating the time attention feature between the pixel points corresponding to the first feature and the second feature;

and decoding the four layers of space-time characteristics obtained by four times of same processing of a Swin transducer-based encoder by using a three-layer Swin transducer-based decoder to obtain three layers of video frame space-time characteristics, wherein the space-time characteristics comprise a first space characteristic, a second space characteristic and a time attention characteristic, the time characteristic of the first layer is decoded by the decoder of the first layer, the time characteristic of the second layer is decoded by the decoder of the second layer, and the time characteristic of the third layer and the space-time characteristic of the fourth layer are decoded by the decoder of the third layer.

In another exemplary implementation, the video frame interpolation model, in predicting an inter frame between adjacent odd video frames, includes:

adding the feature mapping result and the high-order audio features of the corresponding layer, performing feature mapping by using another full-connection layer, continuously performing normalization by using a softmax layer to obtain a new feature mapping result, multiplying the new feature mapping result by the feature mapping result and adding the feature mapping result to obtain fusion features, wherein the fusion features of each layer are used as intermediate frames predicted by the corresponding layer, the fusion features of the third layer are added with the feature mapping result of the second layer and the high-order audio features, the fusion features of the second layer are added with the feature mapping result of the first layer and the high-order audio features, and the fusion features of the first layer are used as final output intermediate frames.

In another exemplary implementation manner, the loss function adopted by the video frame interpolation model in the training process includes a picture loss function, which is specifically:

wherein, I _t For the real inter-frame picture,

is a predicted inter frame picture.

In another exemplary implementation manner, in a training process of a video frame interpolation model, one or any combination of a peak signal-to-noise ratio PSNR, a structural similarity SSIM, and a learning perception image block similarity LPIPS is used to quantify the image quality of a predicted intermediate frame;

the peak signal-to-noise ratio PSNR adopts the formula:

wherein MSE is the mean square error between the real image and the predicted image;

the formula adopted by the structural similarity SSIM is:

wherein x and y are pixel values of two images respectively, mu _x Is the average value of x, μ _y Is the mean value of y, σ _x Is the variance, σ, of x _y Variance of y, σ _xy Is the covariance of x and y, c ₁ ＝(k ₁ L) ² ,c ₁ ＝(k ₁ L) ² For maintaining a constant for stability, L is the pixel valueDynamic range, k ₁ ＝0.01,k ₁ ＝0.03；

The formula adopted by LPIPS for learning and sensing the image block similarity is as follows:

wherein d is x ₀ Distance from x, x ₀ Is a true image block, x is a prediction image block,

and &>

Features of the l-th layer of the real and predicted image blocks, w, respectively, normalized in the channel dimension _l For vectors used to scale the active channel, H, w are the height and width of the image block, H, respectively _l And W _l Respectively the image height and width of the l-th layer.

In yet another exemplary implementation, the apparatus further includes an application unit 305 configured to:

acquiring a frame video to be inserted;

inputting a video frame to be inserted into a video frame insertion model so that the video frame insertion model predicts an intermediate frame in the video frame to be inserted by combining corresponding audio on the basis of the input video frame to be inserted;

and acquiring the target video of the frame video to be inserted after the intermediate frame is inserted into the intermediate frame.

The present application further provides a processing device from a hardware structure perspective, referring to fig. 4, fig. 4 shows a schematic structural diagram of the processing device of the present application, specifically, the processing device of the present application may include a processor 401, a memory 402, and an input/output device 403, where the processor 401 is configured to implement, when executing a computer program stored in the memory 402, the steps of the processing method based on the Swin Transformer video frame interpolation model in the corresponding embodiment of fig. 1; alternatively, the processor 401 is configured to implement the functions of the units in the corresponding embodiment of fig. 3 when executing the computer program stored in the memory 402, and the memory 402 is configured to store the computer program required by the processor 401 to execute the processing method of the Swin Transformer-based video frame interpolation model in the corresponding embodiment of fig. 1.

Illustratively, a computer program may be partitioned into one or more modules/units, which are stored in memory 402 and executed by processor 401 to accomplish the present application. One or more modules/units may be a series of computer program instruction segments capable of performing certain functions, the instruction segments being used to describe the execution of the computer program in the computer apparatus.

The processing devices may include, but are not limited to, a processor 401, a memory 402, and input-output devices 403. Those skilled in the art will appreciate that the illustration is merely an example of a processing device and does not constitute a limitation of the processing device and may include more or less components than those illustrated, or combine certain components, or different components, e.g., the processing device may also include a network access device, bus, etc., through which the processor 401, memory 402, input output device 403, etc., are connected.

The Processor 401 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, the processor being the control center for the processing device and the various interfaces and lines connecting the various parts of the overall device.

The memory 402 may be used to store computer programs and/or modules, and the processor 401 may implement various functions of the computer device by operating or executing the computer programs and/or modules stored in the memory 402 and invoking data stored in the memory 402. The memory 402 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to use of the processing apparatus, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.

The processor 401, when executing the computer program stored in the memory 402, may specifically implement the following functions:

coding the audio features to obtain high-order audio features;

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the above-described specific working processes of the processing apparatus and the processing device based on the Swin Transformer video frame interpolation model and the corresponding units thereof may refer to the description of the processing method based on the Swin Transformer video frame interpolation model in the corresponding embodiment of fig. 1, and are not described herein again in detail.

It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.

For this reason, the present application provides a computer-readable storage medium, where a plurality of instructions are stored, where the instructions can be loaded by a processor to execute the steps of the processing method for the Swin Transformer-based video frame interpolation model in the embodiment corresponding to fig. 1 in the present application, and specific operations may refer to the description of the processing method for the Swin Transformer-based video frame interpolation model in the embodiment corresponding to fig. 1, and are not described herein again.

Wherein the computer-readable storage medium may include: read Only Memory (ROM), random Access Memory (RAM), magnetic or optical disks, and the like.

Since the instructions stored in the computer-readable storage medium may execute the steps of the method for processing the video frame interpolation model based on the Swin Transformer in the embodiment corresponding to fig. 1 of the present application, beneficial effects that can be achieved by the method for processing the video frame interpolation model based on the Swin Transformer in the embodiment corresponding to fig. 1 of the present application can be achieved, for details, see the foregoing description, and are not repeated here.

The method, the apparatus, the processing device, and the computer-readable storage medium for processing the Swin Transformer-based video frame interpolation model provided in the present application are introduced in detail above, and a specific example is applied in the present application to explain the principle and the implementation of the present application, and the description of the above embodiment is only used to help understanding the method and the core idea of the present application; meanwhile, for those skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A method for processing a Swin transform-based video frame interpolation model is characterized by comprising the following steps:

acquiring a sample set, wherein the sample set comprises different sample videos and different sample audios, and the different sample videos and the different sample audios are in one-to-one correspondence;

extracting audio features of the different sample audios, wherein the audio features comprise a spectral envelope MFCC, a frequency domain feature FBANK, a fundamental frequency pitch and an unvoiced feature;

coding the audio features to obtain high-order audio features;

extracting video frame space-time characteristics of three layers of the sample video through three layers of neural networks based on Swin transform, wherein each layer of neural networks outputs one layer of video frame space-time characteristics;

and training a neural network model to predict intermediate frames between the adjacent odd video frames on the basis of the adjacent odd video frames in the different sample videos by combining the spatial-temporal characteristics and the high-order audio characteristics of the video frames of the corresponding three layers, and obtaining a video frame interpolation model after model training is completed, wherein the video frame interpolation model is used for predicting the intermediate frames in the video of the frame to be interpolated on the basis of the input video of the frame to be interpolated by combining corresponding audio so as to realize a video frame interpolation effect with a preset frame number.

2. The method of claim 1, wherein the encoding the audio feature to obtain higher-order audio features comprises:

carrying out convolution processing on the different sample audios by using the 1D convolution layer to obtain a convolution result;

and coding the convolution result through a 5-layer transform coder, and performing feature mapping by using a full connection layer to obtain the high-order audio features.

3. The method of claim 1, wherein the extracting the video frame spatio-temporal features of the three layers of the sample video through a Swin transform-based neural network comprises:

in the Swin transform-based encoder, performing convolution processing on adjacent odd video frames in the different sample videos to obtain picture characteristics;

dividing the picture feature into four parts with the same size along the middle, respectively calculating attention of the four parts to obtain a first spatial feature containing local spatial information, further dividing the four parts and exchanging positions to obtain four new parts with the same size, and recalculating new local attention feature to obtain a second spatial feature containing the local spatial information and global spatial information;

splicing adjacent odd video frames in the different sample videos according to a time dimension, splitting a splicing result according to the number of pixel points, and calculating a time attention feature between the pixel points corresponding to the first feature and the second feature;

and decoding four layers of spatiotemporal features obtained by four times of same processing of the Swin transducer-based encoder by using three layers of Swin transducer-based decoders to obtain three layers of spatiotemporal features of the video frames, wherein the spatiotemporal features comprise the first spatial feature, the second spatial feature and the temporal attention feature, the temporal feature of the first layer is decoded by a decoder of the first layer, the temporal feature of the second layer is decoded by a decoder of the second layer, and the temporal feature of the third layer and the spatiotemporal feature of the fourth layer are decoded by a decoder of the third layer.

4. The method of claim 3, wherein the video frame interpolation model predicts the inter frames between the adjacent odd video frames by:

performing feature mapping by using a full connection layer aiming at the video frame space-time feature of each layer corresponding to the adjacent odd video frames to obtain a feature mapping result;

adding the feature mapping result and the high-order audio features of the corresponding layer, performing feature mapping by using an additional full-link layer, continuously performing normalization by using a softmax layer to obtain a new feature mapping result, multiplying the new feature mapping result by the feature mapping result, adding the feature mapping result to obtain a fusion feature, taking the fusion feature of each layer as an intermediate frame predicted by the corresponding layer, adding the fusion feature of a third layer together with the feature mapping result of the second layer and the high-order audio features, adding the fusion feature of the second layer together with the feature mapping result of the first layer and the high-order audio features, and taking the fusion feature of the first layer as a final output intermediate frame.

5. The method according to claim 1, wherein the loss function adopted by the video frame interpolation model in the training process comprises a picture loss function, and specifically comprises:

wherein, I _t In order to be a true inter-frame picture,

is a predicted inter frame picture.

6. The method according to claim 1, wherein in the training process of the video frame interpolation model, one or any combination of peak signal-to-noise ratio (PSNR), structural Similarity (SSIM) and learning perception image block similarity (LPIPS) is used to quantify the predicted image quality of the intermediate frame;

the peak signal-to-noise ratio PSNR adopts the formula as follows:

the structural similarity SSIM adopts a formula as follows:

wherein x and y are pixel values of two images respectively, mu _x Is the average value of x, μ _y Is the mean value of y, σ _x Variance of x, σ _y Variance of y, σ _xy Is the covariance of x and y, c ₁ ＝(k ₁ L) ² ,c ₁ ＝(k ₁ L) ² For a constant to maintain stability, L is the dynamic range of the pixel values, k ₁ ＝0.01,k ₁ ＝0.03；

The learning perception image block similarity LPIPS adopts the formula as follows:

and &>

Features, w, normalized in channel dimension, of the l-th layer of said real and said predicted image blocks, respectively _l For vectors used to scale the active channel, H, w are the height and width of the image block, H, respectively _l And W _l The image height and width of the l-th layer, respectively.

7. The method of claim 1, further comprising:

acquiring a frame video to be inserted;

inputting the video of the frame to be inserted into the video frame insertion model so that the video frame insertion model predicts an intermediate frame in the video of the frame to be inserted by combining with corresponding audio on the basis of the input video of the frame to be inserted;

8. An apparatus for processing a Swin Transformer-based video frame interpolation model, the apparatus comprising:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a sample set, the sample set comprises different sample videos and different sample audios, and the different sample videos and the different sample audios are in one-to-one correspondence;

an extracting unit, configured to extract audio features of the different sample audios, where the audio features include a spectral envelope MFCC, a frequency domain feature FBANK, a fundamental frequency pitch, and an unvoiced feature;

the coding unit is used for coding the audio features to obtain high-order audio features;

the extracting unit is further configured to extract video frame spatio-temporal features of three layers of the sample video through three layers of Swin transform-based neural networks, wherein each layer of the neural networks outputs one layer of the video frame spatio-temporal features;

and the training unit is used for training a neural network model to predict intermediate frames between the adjacent odd-numbered video frames on the basis of the adjacent odd-numbered video frames in the different sample videos by combining the space-time characteristics of the video frames of the corresponding three layers and the high-order audio characteristics, and obtaining a video frame interpolation model after model training is finished, wherein the video frame interpolation model is used for predicting the intermediate frames in the video of the frame to be interpolated on the basis of the input video of the frame to be interpolated by combining the corresponding audio so as to realize a video frame interpolation effect with a preset frame number.

9. A processing device comprising a processor and a memory, a computer program being stored in the memory, the processor performing the method according to any of claims 1 to 7 when calling the computer program in the memory.

10. A computer-readable storage medium storing a plurality of instructions adapted to be loaded by a processor to perform the method of any one of claims 1 to 7.