WO2023103378A1

WO2023103378A1 - Video frame interpolation model training method and apparatus, and computer device and storage medium

Info

Publication number: WO2023103378A1
Application number: PCT/CN2022/105652
Authority: WO
Inventors: 周昆; 李文博; 蒋念娟; 沈小勇; 吕江波
Original assignee: 深圳思谋信息科技有限公司; 上海思谋科技有限公司
Priority date: 2021-12-06
Filing date: 2022-07-14
Publication date: 2023-06-15
Also published as: CN113891027A; CN113891027B

Abstract

The present application relates to a video frame interpolation model training method and apparatus, and a computer device, a storage medium and a computer program product. The method comprises: acquiring training image frame groups; inputting the first image frame and the third image frame in each training image frame group into a video frame interpolation model, and outputting an estimated intermediate image frame corresponding to each training image frame group; and adjusting parameters in the video frame interpolation model on the basis of a first difference, a second difference and a third difference in each training image frame group until a training stop condition is met, and then ending the training. By using the method in the present application, a high-quality video frame can be effectively generated, thereby improving the frame rate of a video, and improving the fluency of a picture.

Description

Video frame interpolation model training method, device, computer equipment and storage medium

This application claims the priority of the Chinese patent application submitted to the State Intellectual Property Office of China on December 06, 2021, with the application number 202111477500.0, and the title of the invention is "Video frame interpolation model training method, device, computer equipment and storage medium". The entire contents are incorporated by reference in this application.

technical field

The present application relates to the technical field of image processing, in particular to a video frame interpolation model training method, device, computer equipment and storage medium.

Background technique

With the development of image processing technology, people's demand for high-quality video images with high refresh rates is also growing rapidly. Therefore, a video frame interpolation model training technology has emerged. The main purpose of video interpolation is to improve the frame rate by increasing the frame rate. The smoothness of the screen. Today, video frame insertion technology has been applied in various fields. For example, with the development of mobile phone hardware, the refresh rate has also been greatly improved, and the previous video content also needs to increase the frame rate to match the highest supported by the hardware. refresh rate. In animation production, a video frame interpolation method is also required, which can obtain a smoother video clip based on a small number of key image frames.

In the related art, it is difficult to accurately capture the timing correspondence for objects with large displacements, so blurred frame interpolation results are likely to be generated. Furthermore, related techniques rely on supervised learning for model training, and supervised images are only one possible solution. Therefore, there is a one-to-many mapping relationship between the input and output of the video interpolation model, and the use of pixel-level one-to-one supervision will lead to over-constraint problems, so that the output results tend to generate average content. , resulting in over-smoothed images and unclear textures in the resulting intermediate image frames.

Contents of the invention

Based on this, it is necessary to provide a video frame interpolation model training method, device, computer equipment, computer readable storage medium and computer program product for the above technical problems.

In the first aspect, the embodiment of the present application provides a video frame interpolation model training method, including:

Obtain training image frame groups, each training image frame group is composed of three consecutive image frames in the video arranged in order, and the second image frame in each training image frame group is used as each training image frame group The corresponding label intermediate image frame;

The first image frame and the third image frame in each training image frame group are input to the video frame interpolation model, and the estimated intermediate image frame corresponding to each training image frame group is output;

Based on the first difference between the first image frame in each training image frame group and the estimated intermediate image frame corresponding to each training image frame group, the second image frame in each training image frame group and each The second difference between the estimated intermediate image frames corresponding to a training image frame group, and the difference between the third image frame in each training image frame group and the estimated intermediate image frame corresponding to each training image frame group The third difference is to adjust the parameters in the video frame interpolation model until the training stops when the training stop condition is met; wherein, the degree of correlation between the second difference and the parameter adjustment is greater than that between the first difference or the third difference and the parameter adjustment Correlation.

In some embodiments, the first image frame and the third image frame in each training image frame group are input to the video frame interpolation model, and the estimated intermediate image frames corresponding to each training image frame group are output, including:

For any training image frame group, the first image frame and the third image frame in any training image frame group are respectively used as the first image frame and the third image frame, and the first image frame and the third image frame are simultaneously The same resolution is used for adjustment; among them, there are a total of n-1 adjustments and the resolution used for each adjustment is different, and n is a positive integer and not less than 2;

Feature extraction is performed on the two image frames after each adjustment, each image frame feature group is composed of the features extracted from the two image frames after each adjustment, and the image frame feature group set is composed of each image frame feature group ;

performing cross-scale alignment processing on the features corresponding to the first image frame in the image frame feature set set to the features corresponding to the third image frame in the image frame feature set set, to obtain the alignment result of the first image frame;

performing cross-scale alignment processing on the features corresponding to the third image frame in the image frame feature set set to the features corresponding to the first image frame in the image frame feature set set, to obtain the alignment result of the third image frame;

performing two-way information fusion on the alignment result of the first image frame and the alignment result of the third image frame to obtain a two-way information fusion result;

A reconstruction process is performed on the result of bidirectional information fusion to obtain an estimated intermediate image frame corresponding to each training image frame group.

In some embodiments, the resolutions corresponding to the image frame feature groups in the image frame feature group set are sequentially increased; The corresponding features in the frame feature group set are aligned across scales to obtain the alignment result of the first image frame, including:

For the i-th image frame feature group, if i is 1, the corresponding features of the first image frame in the i-th image frame feature group and the corresponding features of the third image frame in the i-th image frame feature group are used as Alignment processing to obtain the i-th alignment processing result; if i is not 1, the first i-1 bilinear interpolation calculation results and the corresponding features of the third image frame in the i-th image frame feature group are cross-scale Fusion processing, to obtain the i-th cross-scale fusion processing result, align the i-th cross-scale fusion processing result with the corresponding features of the first image frame in the i-th image frame feature group, and obtain the i-th alignment processing result , repeat the above-mentioned processing process for each image frame feature group until all image frame feature groups are processed, and use the nth alignment processing result as the alignment result of the first image frame; wherein, i is not less than 1 and not greater than n a positive integer;

Among them, for the j-th bilinear interpolation calculation result among the first i-1 bilinear interpolation calculation results, the j-th bilinear interpolation calculation result is to perform i-j bilinear interpolation consecutively on the j-th alignment processing result Calculated, j is a positive integer not less than 1 and less than i.

In some embodiments, the resolutions corresponding to the image frame feature groups in the image frame feature group set are sequentially increased; The corresponding features in the frame feature group set are aligned across scales to obtain the alignment result of the third image frame, including:

For the i-th image frame feature group, if i is 1, the corresponding features of the third image frame in the i-th image frame feature group and the corresponding features of the first image frame in the i-th image frame feature group are used as Alignment processing to obtain the i-th alignment processing result; if i is not 1, the first i-1 bilinear interpolation calculation results and the corresponding features of the first image frame in the i-th image frame feature group are cross-scale Fusion processing, to obtain the i-th cross-scale fusion processing result, align the i-th cross-scale fusion processing result with the corresponding features of the third image frame in the i-th image frame feature group, and obtain the i-th alignment processing result , repeat the above-mentioned processing process for each image frame feature group until all image frame feature groups are processed, and use the nth alignment processing result as the alignment result of the third image frame; wherein, i is not less than 1 and not greater than n a positive integer;

In some embodiments, two-way information fusion is performed on the alignment result of the first image frame and the alignment result of the third image frame to obtain a two-way information fusion result, including:

Convolving the alignment result of the first image frame and the alignment result of the third image frame to obtain a convolution result;

Calculate the convolution result to obtain the fusion weight;

According to the fusion weight, the alignment result of the first image frame and the alignment result of the third image frame are fused to obtain a bidirectional information fusion result.

In some embodiments, both the first difference and the third difference are similarities; the determination process of the first difference and the third difference includes:

For any training image frame group, select any t*t pixel from the estimated intermediate image frame corresponding to any training image frame group, according to the center pixel of any t*t pixel corresponding to any training image frame group The position in the estimated intermediate image frame of , determine the first target pixel of t*t in the first image frame in any training image frame group and determine t* in the third image frame in any training image frame group The third target pixel of t; wherein, t is an odd number not equal to 1;

Determine the first character set according to the first target pixel of t*t; determine the third character set according to the third target pixel of t*t; determine the second character set according to any t*t pixel;

According to the first character set, the second character set and the third character set, determine the similarity between any pixel and the first target pixel as the first difference; determine the similarity between any pixel and the third target pixel , as the third difference.

In some embodiments, the process of determining the second difference includes:

For any training image frame group, according to the RGB values of all pixels in the second image frame in any training image frame group and the RGB values of all pixels in the estimated intermediate image frame corresponding to any training image frame group, determine The RGB value difference between the second image frame in any training image frame group and the estimated intermediate image frame corresponding to any training image frame group is used as the second difference.

In some embodiments, also include:

Comparing the first difference with the third difference, determine the best matching pixel of the center pixel of any t*t pixel; according to the best matching pixel, calculate the center pixel of any t*t pixel through the texture consistency loss function Texture consistency loss; among them, the texture consistency loss is used to train the video frame interpolation model.

In some embodiments, also include:

Obtain two image frames to be processed in the video to be processed;

Input the two image frames to be processed into the trained video frame interpolation model to obtain the intermediate image frame of the two image frames to be processed.

In the second aspect, the embodiment of the present application also provides a video frame interpolation model training device, including:

The acquisition module is used to obtain training image frame groups, each training image frame group is formed by sequential arrangement of three consecutive image frames in the video, and the second image frame in each training image frame group is used as each a label intermediate image frame corresponding to the training image frame group;

The video frame interpolation module is used to input the first image frame and the third image frame in each training image frame group to the video frame interpolation model, and output the corresponding estimated intermediate image frame of each training image frame group;

An adjustment module, configured to be based on the first difference between the first image frame in each training image frame group and the estimated intermediate image frame corresponding to each training image frame group, the second difference in each training image frame group The second difference between the first image frame and the estimated intermediate image frame corresponding to each training image frame group, and the third image frame in each training image frame group and the estimated intermediate image frame corresponding to each training image frame group The third difference between the image frames adjusts the parameters in the video frame interpolation model until the training stops when the training stop condition is satisfied; wherein, the correlation between the second difference and the parameter adjustment is greater than that of the first difference or the third difference and Correlation between parameter adjustments.

In a third aspect, the embodiment of the present application further provides a computer device. The computer device includes a memory and a processor, the memory stores a computer program, and the processor implements the following steps when executing the computer program:

Input the first image frame and the third image frame in each training image frame group to the video frame interpolation model, and output the estimated intermediate image frame corresponding to each training image frame group;

In the fourth aspect, the embodiment of the present application also provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the following steps are implemented:

In a fifth aspect, the embodiment of the present application further provides a computer program product. Said computer program product comprises a computer program which, when executed by a processor, implements the following steps:

Compared with the related art that only adjusts the parameters in the video frame interpolation model by comparing the difference between the second image frame and the estimated intermediate image frame, due to the addition of comparing the first image frame and the third image frame with the estimated intermediate The difference between image frames Adjusting the parameters in the video frame interpolation model can make the texture of the intermediate image frame output by the video frame interpolation model clearer and closer to the texture structure of the input image frame, avoiding the generation of blurred and unclear texture content .

Description of drawings

Fig. 1 is the application environment diagram of the video frame interpolation model training method in the embodiment of the present application;

FIG. 2 is a schematic flow diagram of a video frame interpolation model training method in an embodiment of the present application;

FIG. 3 is a schematic diagram of the reconstruction process of the video frame interpolation model training method in the embodiment of the present application;

4 is a schematic diagram of the cross-scale alignment processing of the video frame interpolation model training method in the embodiment of the present application;

5 is a schematic diagram of the matching process of the video frame interpolation model training method in the embodiment of the present application;

6 is a schematic diagram of the training process of the video frame interpolation model training method in the embodiment of the present application;

Fig. 7a is the comparative evaluation result figure of single-frame video interpolation in the embodiment of the present application;

Fig. 7b is a comparative evaluation result diagram of multi-frame video interpolation in the embodiment of the present application;

Fig. 7c is a comparative evaluation result diagram of single-frame video extrapolation in the embodiment of the present application;

Fig. 7d is a comparison diagram of visual effects after integrating the trained video frame interpolation model into a video super-resolution model in the embodiment of the present application;

FIG. 7e is a visual comparison diagram of single-frame video interpolation in the embodiment of the present application;

Figure 7f is a visual comparison diagram of multi-frame video interpolation in the embodiment of the present application;

Figure 7g is a visual comparison diagram of single-frame video extrapolation in the embodiment of the present application;

Figure 7h is a comparison diagram of the impact of single-frame video interpolation on video super-resolution in the embodiment of the present application;

Figure 7i is a single visual comparison diagram of the TCL loss function added in the embodiment of the present application;

Fig. 7j is a plurality of visual comparison diagrams in which the TCL loss function is added in the embodiment of the present application;

FIG. 8 is a structural block diagram of a video frame interpolation model training device in an embodiment of the present application;

FIG. 9 is an internal structural diagram of a computer device in an embodiment of the present application.

Detailed ways

In order to make the purpose, technical solutions and advantages of the embodiments of the present application clearer, the embodiments of the present application will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the embodiments of the present application, and are not intended to limit the embodiments of the present application.

It can be understood that the terms "first" and "second" used in the embodiments of the present application may be used to describe various technical terms herein, but unless otherwise specified, these technical terms are not limited by these terms. These terms are only used to distinguish one term from another. For example, without departing from the scope of the embodiments of the present application, the third preset threshold and the fourth preset threshold may be the same or different.

The video frame interpolation model training method provided in the embodiment of the present application can be applied to the application environment shown in FIG. 1 . Wherein, the terminal 101 communicates with the server 102 through a network. The data storage system can store data that needs to be processed by the server 102 . The data storage system can be integrated on the server 102, or placed on the cloud or other network servers. The terminal 101 acquires the training image frame group, and the server processes the training image frame group. Of course, in the actual implementation process, the processing function of the server 102 can also be directly integrated into the terminal 101, that is, the terminal 101 acquires training image frames, and processes the training image frames to obtain a trained video frame insertion model. Wherein, the terminal 101 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, IoT devices and portable wearable devices. The server 102 can be implemented by an independent server or a server cluster composed of multiple servers.

In some embodiments, as shown in FIG. 2 , a video frame interpolation model training method is provided. The method is applied to the terminal 101 in FIG. 1 as an example for illustration, including the following steps:

201. Obtain a training image frame group, each training image frame group is composed of three consecutive image frames arranged in sequence in the video, and the second image frame in each training image frame group is used as each training image The labeled intermediate image frame corresponding to the frame group.

202. Input the first image frame and the third image frame in each training image frame group to the video frame interpolation model, and output the estimated intermediate image frame corresponding to each training image frame group.

203. Based on the first difference between the first image frame in each training image frame group and the estimated intermediate image frame corresponding to each training image frame group, the second image frame in each training image frame group The second difference between the estimated intermediate image frames corresponding to each training image frame group, and the difference between the third image frame in each training image frame group and the estimated intermediate image frame corresponding to each training image frame group The third difference between them adjusts the parameters in the video frame interpolation model until the training ends when the training stop condition is satisfied; wherein, the degree of correlation between the second difference and the parameter adjustment is greater than the first difference or the third difference and the parameter adjustment correlation between.

In the above step 201, the training image frame group refers to extracting every three consecutive image frames obtained after the image frame extraction process is performed on the video as a training image frame group. Wherein, the three image frames in each training image frame group are arranged in order of their appearance time in the video. In addition, the video can be not only one video, but also multiple different videos, so the obtained training image frame group can come from one video or multiple videos.

For the second image frame in each image frame group, because it is the intermediate image frame corresponding to the first image frame and the third image frame in the image frame group, the content of the second image frame is the first image Frame and the third image frame can constitute the associated connection content, therefore, in this embodiment, the second image frame is used as the corresponding label intermediate image frame of each training image frame group, and each training image frame group The corresponding labeled intermediate image frame is used as the supervised image of each training image frame group, and the video frame interpolation model can be supervised for training.

In the above step 202, after the first image frame and the third image frame of each training image frame group are input to the video frame interpolation model, the estimated intermediate image frame corresponding to each training image frame group will be obtained. The content of the intermediate image frame is obtained by processing the contents of the first image frame and the three image frames, and the content of the estimated intermediate image frame is the same as the content of the label intermediate image frame corresponding to each training image frame group resemblance.

It is worth mentioning that the second image frame of each training image frame group is only one possible solution corresponding to the first image frame to the third image frame in the training image frame group. For example, the content of the video shooting is that a ball moves from point A to point E through points B and C. If the first image frame of a certain training image frame group shows that the ball is at point A, the third image frame shows that the ball is at point A. The ball is at point E, and the second image frame shows the ball at point B, but the actual ball also passes through point C while moving, but this position at point C is not captured because the video is It consists of still image frames. Therefore, the video cannot reflect the continuous movement of the ball in time, and the moving process of the ball captured by the video only reflects that the ball is at a certain position at a certain moment.

In the above step 203, the training stop condition refers to: the video frame interpolation model constantly adjusts the parameters of the video frame interpolation model during the training process, and when the change rate of the parameters of the video frame interpolation model does not exceed the predetermined range, the video The interpolation model satisfies the training stop condition.

Specifically, when training the video frame interpolation model according to each training image frame group, a supervisory function is added in this embodiment, so that the video frame interpolation model can adjust the parameters of the video frame interpolation model during training, so that the video frame interpolation The model is continuously optimized during the training process. Among them, the supervision function is divided into two parts, the first part is the first loss function, and the second difference between the label intermediate image frame corresponding to each training image frame group and the estimated intermediate image frame corresponding to each training image frame group Decide. The second part is the texture consistency loss function (Texture Consistency Loss, TCL), which consists of the first image frame in each training image frame group and the estimated intermediate image frame corresponding to each training image frame group. The difference is determined by a third difference between the third image frame in each training image frame group and the estimated intermediate image frame corresponding to each training image frame group.

In addition, in the above step 203, the correlation degree between the second difference and the parameter adjustment is greater than that, the correlation degree between the first difference or the third difference and the parameter adjustment refers to: in the supervisory function, the first The degree of correlation between parameter adjustments of the loss function is greater than that of the texture consistency loss function.

Among them, the supervisory function can be shown as formula (1):

In formula (1),

Represents the estimated intermediate image frame corresponding to each training image frame group, I ₀ represents the label intermediate image frame corresponding to each training image frame group, I- ₁ represents the first image frame in each training image frame group, I ₁ represents the third image frame in each training image frame group, α is an adjustable coefficient, L ₁ is the first loss function, and L _p is the texture consistency loss function.

In the method provided by the embodiment of the present application, since the texture consistency loss function is added on the basis of the original supervisory function, the video frame interpolation model not only considers the middle of the label corresponding to each training image frame group in the process of supervised training. The content of the image frame will also consider the content of the first image frame and the third image frame in each training image frame group, which can alleviate the over-constraint problem in the supervised training, so that the image frame output by the video frame interpolation model The texture definition, signal-to-noise ratio and structural similarity are higher, thereby improving the frame rate of the video and increasing the smoothness of the picture.

301. For any training image frame group, use the first image frame and the third image frame in any training image frame group as the first image frame and the third image frame respectively, and for the first image frame and the third image frame Frames are adjusted with the same resolution at the same time; among them, a total of n-1 adjustments are made and the resolutions used for each adjustment are different, and n is a positive integer and not less than 2.

302. Perform feature extraction on each adjusted two image frames, form each image frame feature group from the features extracted from each adjusted two image frames, and form each image frame feature group from each image frame feature group group collection.

303 . Perform cross-scale alignment processing on features corresponding to the first image frame in the image frame feature set set to features corresponding to the third image frame in the image frame feature set set, to obtain an alignment result of the first image frame.

304 . Perform cross-scale alignment processing on features corresponding to the third image frame in the image frame feature set set to features corresponding to the first image frame in the image frame feature set set, to obtain an alignment result of the third image frame.

305. Perform bidirectional information fusion on the alignment result of the first image frame and the alignment result of the third image frame to obtain a bidirectional information fusion result;

306. Perform reconstruction processing on the bidirectional information fusion result to obtain an estimated intermediate image frame corresponding to each training image frame group.

Specifically, for any set of training image frames, before performing feature extraction on the first image frame and the third image frame, the first image frame and the third image frame will be adjusted n-1 times in resolution , each time the resolution is adjusted, the obtained first image frame and the third image frame will have a lower resolution than the first image frame and the third image frame before the resolution adjustment.

For example, performing the third resolution adjustment on the first image frame and the third image frame is to reduce the first image frame and the third image frame on the basis of the first image frame and the third image frame obtained after the second resolution adjustment. The resolution of the third image frame. Therefore, the resolutions of the first image frame and the third image frame obtained after the third resolution adjustment are smaller than the resolutions of the first image frame and the third image frame obtained after the second resolution adjustment. In addition, the resolution adjustment times of the first image frame and the third image frame should not be less than 1 time.

The image frames after resolution adjustment are grouped, and the image frames with the same resolution are grouped into one group. Since n-1 resolution adjustments are performed, the original image frame group without resolution adjustment is added. Therefore, there are n groups of image frame groups with different resolutions. Then feature extraction is performed on n groups of image frame groups with different resolutions to obtain n groups of image feature group sets.

In addition, this embodiment does not specifically limit the method for obtaining image frame feature groups of different resolutions, including but not limited to: the implementation process of the above steps 301 and 302, and: for any training image frame group, the The first image frame and the third image frame in any training image frame group are respectively used as the first image frame and the third image frame, and the same resolution is used to extract features for the first image frame and the third image frame at the same time, and each The features extracted from the two image frames after the second adjustment form each image frame feature group, and each image frame feature group forms an image frame feature group set. Among them, the features are extracted n times in total and the resolution of each extracted feature is different, and n is a positive integer and not less than 2. Specifically, convolution may be used to simultaneously perform resolution adjustment and feature extraction on the first image frame and the third image frame, so as to obtain the image frame feature group set in step 302 above.

The order of obtaining the alignment result of the first image frame and the alignment result of the third image frame is not specifically limited in the embodiment of the present application, and the alignment result of the first image frame can be obtained first, and then the alignment result of the second image frame can be obtained result. Alternatively, the alignment result of the second image frame may be obtained first, and then the alignment result of the first image frame may be obtained. It is also possible to obtain the alignment result of the first image frame and the alignment result of the second image frame at the same time.

In addition, the process of cross-scale aligning the features corresponding to the first image frame in the image frame feature set to the features corresponding to the third image frame in the image frame feature set is the same as aligning the third image frame in the image frame feature set The process of performing cross-scale alignment processing of the corresponding features in the set to the corresponding features of the first image frame in the image frame feature group set is the same.

In the above step 306, the reconstruction process refers to regressing the estimated intermediate image frame according to the two-way information fusion result. Specifically, the two-way information fusion result is processed first, then the processing result is input for single-layer convolution, and finally the estimated intermediate image frame is output.

For example, as shown in Figure 3, the two-way information fusion result F ₀ is first input to the first layer (Layer1) for processing, and then the processing result is input to the second layer (Layer2) for single-layer convolution, and finally output Estimating intermediate image frames

In FIG. 3 , "40×RB(128)" indicates that 40 "RB(128)" are used, and RB(128) indicates a residual block with a channel dimension of 128. "Conv(128,3,3,1)" represents a single-layer convolution, where the input and output are 128, 3, the convolution kernel is 3, and the convolution step is 1.

In the method provided in the embodiment of the present application, by inputting the first image frame and the third image of each training image group into the video frame interpolation model, and outputting the estimated intermediate image frame corresponding to each training image group, in order to By training the video frame interpolation model, the parameters of the video frame interpolation model can be adjusted, thereby improving the quality of the output image frames of the video frame interpolation model.

In combination with the content of the above-mentioned embodiments, in some embodiments, the resolutions corresponding to the image frame feature groups in the image frame feature group set increase sequentially; the corresponding features of the first image frame in the image frame feature set set Perform cross-scale alignment (Cross-scale Pyramid Alignment) processing to the features corresponding to the third image frame in the image frame feature group set to obtain the alignment result of the first image frame, including:

For the i-th image frame feature group, if i is 1, the corresponding features of the first image frame in the i-th image frame feature group and the corresponding features of the third image frame in the i-th image frame feature group are used as Alignment (Alignment Block, AB) processing to obtain the i-th alignment processing result; if i is not 1, the first i-1 bilinear interpolation (Bilinear Upsampling, BU) calculation results and the third image frame at the i-th The corresponding features in each image frame feature group are processed by cross-scale fusion (Cross-scale Fusion, CSF), and the i-th cross-scale fusion processing result is obtained, and the i-th cross-scale fusion processing result is combined with the first image frame at the i-th The corresponding features in the image frame feature groups are aligned to obtain the i-th alignment processing result; repeat the above-mentioned processing process for each image frame feature group until all image frame feature groups are processed, and the nth alignment processing result As the alignment result of the first image frame; wherein, i is a positive integer not less than 1 and not greater than n;

Specifically, the alignment in this embodiment does not specifically limit the size of the resolution corresponding to the image frame feature group in the image frame feature group set. Moreover, during the cross-scale alignment process, the number of alignment processes is the same as the number of image frame feature groups in the set of image frame feature groups in step 302 above.

For example, if there are 4 groups of image frame feature groups in the image frame feature group collection, then the corresponding feature of the first image frame in the image frame feature group collection is transferred to the corresponding feature of the third image frame in the image frame feature group collection In the process of cross-scale alignment processing, four alignment processings are required.

In addition, the number of times of alignment processing in the process of performing cross-scale alignment processing from the features corresponding to the first image frame in the image frame feature set to the features corresponding to the third image frame in the image frame feature set should not be less than 2 Second-rate.

For the process of cross-scale alignment of the features corresponding to the first image frame in the image frame feature set to the features corresponding to the third image frame in the image frame feature set, the image frame feature set contains three sets of images The image frame feature group is illustrated as an example. As shown in (a) in Figure 4, where,

is the image frame feature group with the highest resolution in the image frame feature group set, that is, the resolution of the image feature group is the same as the resolution of the image frame before the resolution adjustment. (a) in Figure 4

Be the second image frame feature group in the image frame feature group collection, the resolution of (a) in Fig. 4

is the image frame feature group with the smallest resolution in the set of image frame feature groups. in addition,

and

Indicates the 3 image frame features extracted from the first image frame after 2 resolution adjustments,

and

Indicates the features of the three image frames extracted from the third image frame after two resolution adjustments.

Indicates the alignment result of the first image frame.

Among them, the process of alignment processing is shown in (b) in Figure 4. First, the two input image frame features are concatenated (Concatenation), and then the concatenation results are sequentially input into the single-layer convolution "Conv3×3", 5 A serial residual block "Res.block×5", and another convolutional layer "Conv3×3", get the weight tensor

and

Finally, deformed convolution processing is used to obtain the result of this alignment

Where l is the number of times of resolution adjustment processing.

The method provided in the embodiment of the present application can extract effective reconstruction signals from image frames of multiple scales by aligning features with the same resolution and adding a cross-scale fusion process, thereby improving the output first The accuracy of the alignment results of image frames can comprehensively and effectively utilize multi-scale information.

In some embodiments, the features corresponding to the third image frame in the image frame feature set are subjected to cross-scale alignment processing to the features corresponding to the first image frame in the image frame feature set to obtain the alignment result of the third image frame ,include:

For the i-th image frame feature group, if i is 1, the corresponding features of the third image frame in the i-th image frame feature group and the corresponding features of the first image frame in the i-th image frame feature group are used as Alignment processing to obtain the i-th alignment processing result; if i is not 1, the first i-1 bilinear interpolation calculation results and the corresponding features of the first image frame in the i-th image frame feature group are cross-scale Fusion processing, to obtain the i-th cross-scale fusion processing result, align the i-th cross-scale fusion processing result with the corresponding features of the third image frame in the i-th image frame feature group, and obtain the i-th alignment processing result ; Repeat the above processing for each image frame feature group until all image frame feature groups are processed, and use the nth alignment processing result as the alignment result of the third image frame.

It should be noted that the processing method of obtaining the alignment result of the third image frame is the same as that of obtaining the alignment result of the third image frame, which will not be described here again. For the specific processing procedure of obtaining the alignment result of the third image frame, refer to the above-mentioned processing procedure of obtaining the alignment result of the first image frame.

In the method provided in the embodiment of the present application, the third image can be obtained by performing cross-scale alignment processing on the features corresponding to the third image frame in the image frame feature set set to the features corresponding to the first image frame in the image frame feature set set The alignment result of the frame.

In combination with the content of the foregoing embodiments, in some embodiments, two-way information fusion (Attention-based Fusion) is performed on the alignment result of the first image frame and the alignment result of the third image frame to obtain a two-way information fusion result, including:

401. Convolute the alignment result of the first image frame and the alignment result of the third image frame to obtain a convolution result.

402. Calculate the convolution result to obtain fusion weights.

403. Perform fusion processing on the alignment result of the first image frame and the alignment result of the third image frame according to the fusion weight, to obtain a two-way information fusion result.

In the above step 402, this embodiment does not specifically limit the selection of the calculation method for calculating the convolution result, including but not limited to: Sigmoid function and so on.

Specifically, the alignment result of the first image frame and the alignment result of the third image frame are subjected to single-layer convolution to obtain the convolution processing result, and then the convolution processing result is activated by a function to obtain the fusion weight, and then according to the fusion The weight is calculated and processed according to formula (2) on the alignment result of the first image frame and the result of the third image frame to obtain a two-way information fusion result.

In formula (2), M is the fusion weight,

is the alignment result of the first image frame,

is the alignment result of the third image frame, and F ₀ is the result of two-way information fusion.

The method provided in the embodiment of the present application can obtain the bidirectional information fusion result by performing bidirectional information fusion on the alignment result of the first image frame and the result of the third image frame, thereby improving the quality of the image frame output by the video frame interpolation model.

Combining the content of the above-mentioned embodiments, in some embodiments, both the first difference and the third difference are similarities; the determination process of the first difference and the third difference includes:

501. For any training image frame group, select any t*t pixel from the estimated intermediate image frame corresponding to any training image frame group, and select any training image frame according to the central pixel of any t*t pixel The position in the estimated intermediate image frame corresponding to the group is determined in the first image frame of any training image frame group and the first target pixel of t*t is determined in the third image frame in any training image frame group The third target pixel of t*t; wherein, t is an odd number not equal to 1.

502. Determine the first character set according to the first target pixel of t*t; determine the third character set according to the third target pixel of t*t; determine the second character set according to any t*t pixel.

503. According to the first character set, the second character set, and the third character set, determine the similarity between any pixel and the first target pixel as the first difference; determine the similarity between any pixel and the third target pixel Similarity, as a third difference.

Specifically, for any training image frame group, when the video frame interpolation model outputs the corresponding estimated intermediate image frame

After that, the intermediate image frame is estimated from the

Select any t*t pixel block in

Where t is an odd number other than 1, for example, 3, 5, 7, etc., and x is the two-dimensional coordinate of the central pixel of the pixel block. Then according to the pixel block

The two-dimensional coordinates x of the central pixel of , determine the first target pixel in the first image frame in any training image frame group and determine the third target pixel in the third image frame in any training image frame group.

Then according to the two-dimensional coordinate x in the first image frame in any training image frame group, determine the first target pixel of t*t and the third image frame in any training image frame group to determine the first target pixel of t*t Three target pixels. The first target pixel and the third target pixel of t*t are used as pixels to be matched.

Will

and the pixels to be matched undergo CT (Census Transform) transformation to determine the first character set, the first character set and the third character set. Finally, according to the first character set, the second character set and the third character set, determine the similarity between any t*t pixel and the first target pixel, and use it as the first difference to determine any t*t The similarity between the pixel of and the third target pixel is taken as the third difference.

Compare the first difference with the third difference, and determine the best matching pixel of the central pixel of any t*t pixel, that is, according to the first difference and the third difference, from the first target pixel of t*t and t* Select the best matching pixel with the central pixel of any t*t pixel from the third target pixel of t. Then according to the best matching pixel, the texture consistency loss of the central pixel of any t*t pixel at x is calculated by the texture consistency loss function (Texture Consistency Loss, TCL), and the video is analyzed according to the texture consistency loss The interpolation model is supervised for training.

In combination with the above content, after selecting the best matching pixel corresponding to the central pixel of any t*t pixel from the first target pixel of t*t and the third target pixel of t*t, the calculated The texture consistency loss function of the central pixel of any t*t pixel.

Wherein, the texture consistency loss is determined by comparing the RGB values of the central pixel of any t*t pixel with the best matching pixel.

to estimate intermediate image frames from

Select any 3*3 pixel in

As an example to illustrate the method provided in the embodiment of this application:

(1) For

Any 3*3 pixel in

(x represents the two-dimensional coordinates of the center point of the image block), it is necessary to find the best matching pixel from the first image frame I _-1 and the third image frame I ₁ through a matching algorithm

(2) Use the best matching pixel

to the estimated

Supervision, where t ^* ∈ {-1, 1} represents the label of the image frame, that is, the best matching pixel comes from the first or third image frame, and y* represents the two-dimensional value of the best matching pixel coordinate.

The matching process is shown in Figure 5 and is divided into four steps:

1. For any training image frame group, input the estimated intermediate image frame corresponding to any training image frame group

Any 3*3 pixel in

The first image frame I ₋₁ and the third image frame I ₁ .

2. Take any 3*3 pixels

The pixel at the position of the two-dimensional coordinate x of the center is the center pixel, and all the pixels f _y ^t to be matched are respectively obtained from the first image frame I ₋₁ and the third image frame I ₁ within a certain range d. Among them, the value of d is an odd number not less than 3, such as 3, 5, 7, etc., t∈{-1, 1}, t indicates that the pixel to be matched comes from the first image frame or the third image frame, y represents the two-dimensional coordinates of the pixel f _y ^t to be matched. The formula for determining the two-dimensional coordinate φ(x) is shown in formula (3):

φ(x)＝{y||y-x|＜d} (3)

3. will

And all the 3*3 pixels to be matched are transformed by CT (Census Transform) to obtain the second string

and a string selected from the first string and the third string

The CT transformation formula is:

In formula (4), f _x (x) is any 3*3 pixel

The RGB value of the pixel at the center position of , f _x (x+x _n ) is the RGB value of other pixels to be matched, x is the coordinate of the pixel at the center position (0, 0), and x _n is other to be matched The two-dimensional coordinates of the pixel, R is the coordinates of the other eight pixels except the pixel at the center position.

Where, R={(-1,-1), (-1,0), (-1,1), (1,-1), (1,1), (1,0), (0,1 ), (0, -1)}.

4. After each pixel undergoes CT transformation, similarity matching is performed according to formula (5), and the two-dimensional coordinate y* of the best matching pixel and the label t* of the corresponding image frame are obtained.

In formula (5), L2 is a matching function for similarity matching.

The method provided in the embodiment of the present application can alleviate the over-constraint problem caused by the motion ambiguity of the object in the image frame due to the use of the texture consistency function, so that the texture of the image frame output by the trained video frame interpolation model is more accurate. Clear, closer to the texture structure of the input image frame, avoiding blurry and texture-unclear content.

In combination with the content of the above embodiments, in some embodiments, the process of determining the second difference includes:

For any training image frame group, according to the RGB values of all pixels in the label intermediate image frame corresponding to any training image frame group and the RGB values of all pixels in the estimated intermediate image frame corresponding to any training image frame group, determine The RGB value difference between the label intermediate image frame corresponding to any training image frame group and the estimated intermediate image frame corresponding to any training image frame group is used as the second difference.

Specifically, for any training image frame group, before determining the second difference, it is necessary to determine the RGB values of all pixels in the label intermediate image frame corresponding to any training image frame group and the corresponding pre-set value of any training image frame group. Estimate the RGB values of all pixels in the intermediate image frame. Then all the pixels in the intermediate image frame of the label are compared with the RGB values of the pixels with the same two-dimensional coordinates in all the pixels in the estimated intermediate image frame one by one, and it is determined that all the pixels in the intermediate image frame of the label are consistent with The estimated difference between RGB values of all pixels in the intermediate image frame. The differences of the RGB values of all pixels are summed and then averaged, and the average value can be used as the second difference.

In the method provided in the embodiment of the present application, by comparing the difference between the label intermediate image frame corresponding to any training image frame group and the corresponding estimated intermediate image frame, this difference can be used to realize the supervised training of the video frame interpolation model, thereby The accuracy of the image frame output by the video frame interpolation model can be improved, thereby improving the fluency and clarity of the video.

In combination with the content of the above-mentioned embodiments, in some embodiments, after the video frame interpolation model is trained, it includes:

601. Acquire two image frames in the video to be processed.

602. Input the two image frames into the trained video frame interpolation model to obtain an intermediate image frame of the two image frames to be processed.

Specifically, the training process of the video frame interpolation model is shown in Figure 6. After the training of the video frame interpolation model is completed, the process of using it is to obtain the video for which video frame interpolation needs to be performed, and then perform image frame extraction on the video, Two image frames are selected from the extracted image frames. Input two image frames into the trained video frame interpolation model, and after processing by the video frame interpolation model, the intermediate image frame of the two image frames can be output.

It is worth mentioning that the video frame interpolation model trained in this embodiment can not only complete single-frame video interpolation and extrapolation, but also complete multi-frame video interpolation. That is, the trained video frame interpolation model in this embodiment can be used to generate an intermediate image frame between two image frames, or it can be used to generate a future image frame placed after the two image frames, and it can also be used for multiple images frame generates an intermediate image frame.

Compared with the frame interpolation results achieved by related technologies, the output results of the video frame interpolation model can effectively improve the performance of video super-resolution. The comparison diagrams of the obtained image frames are shown in Fig. 7a to Fig. 7j.

Among them, Fig. 7a is a comparison and evaluation result diagram of single-frame video interpolation, wherein, the video frame interpolation model has two input image frames and outputs one intermediate image frame. Fig. 7b is a diagram of comparative evaluation results of multi-frame video interpolation, where the video frame interpolation model has 4 input image frames and outputs 1 intermediate image frame. Fig. 7c is a comparative evaluation result diagram of single-frame video extrapolation, in which, the video frame interpolation model takes 2 input image frames and outputs 1 future image frame. Fig. 7d is a comparison diagram of visual effect after integrating the trained video frame interpolation model into a video super-resolution model. Fig. 7e is a visual comparison diagram of single-frame video interpolation. Fig. 7f is a visual comparison diagram of multi-frame video interpolation. Figure 7g is a visual comparison diagram of single-frame video extrapolation. Fig. 7h is a comparison diagram of the impact of single-frame video interpolation on video super-resolution. Figure 7i is a single visualization comparison diagram with TCL loss function added. Figure 7j is a comparison of multiple visualizations with TCL loss function added.

The method provided by this application implements the trained video frame interpolation model to process the video to be processed, and can output high-definition image frames, thereby effectively improving the performance of video super-resolution. Compared with related technical methods, The method provided in this embodiment achieves the highest peak signal-to-noise ratio (Peak Signal to Noise Ratio, PSNR) and structural similarity (Structural Similarity, SSIM).

It should be understood that although the steps in the flow charts involved in the above embodiments are shown sequentially according to the arrows, these steps are not necessarily executed sequentially in the order indicated by the arrows. Unless otherwise specified herein, there is no strict order restriction for the execution of these steps, and these steps can be executed in other orders. Moreover, at least some of the steps in the flow charts involved in the above-mentioned embodiments may include multiple steps or stages, and these steps or stages are not necessarily executed at the same time, but may be performed at different times For execution, the execution order of these steps or stages is not necessarily performed sequentially, but may be executed in turn or alternately with other steps or at least a part of steps or stages in other steps.

Based on the same inventive concept, an embodiment of the present application also provides a video frame interpolation model training device for implementing the above-mentioned video frame interpolation model training method. The solution to the problem provided by the device is similar to the implementation described in the above method, so the specific limitations in one or more embodiments of the video frame interpolation model training device provided below can be referred to above for video frame interpolation The limitation of the model training method will not be repeated here.

In some embodiments, as shown in FIG. 8 , a video frame insertion model training device is provided, including: an acquisition module, a video frame insertion module and an adjustment module, wherein:

The obtaining module 801 is used to obtain the training image frame group, each training image frame group is formed by sequential arrangement of three consecutive image frames in the video, and the second image frame in each training image frame group is used as Label intermediate image frames corresponding to each training image frame group;

The video frame interpolation module 802 is used to input the first image frame and the third image frame in each training image frame group to the video frame interpolation model, and output the estimated intermediate image frame corresponding to each training image frame group;

An adjustment module 803, configured to use a label corresponding to each training image frame group based on the first difference between the first image frame in each training image frame group and the estimated intermediate image frame corresponding to each training image frame group The second difference between the intermediate image frame and the estimated intermediate image frame corresponding to each training image frame group, and the estimated intermediate image frame corresponding to each training image frame group corresponding to the third image frame in each training image frame group The third difference between the image frames adjusts the parameters in the video frame interpolation model until the training stop condition is satisfied and the training ends; wherein, the degree of correlation between the second difference and the parameter adjustment is greater than the first difference or the third difference and Correlation between parameter adjustments.

In some embodiments, the video frame insertion module 802 includes:

The adjustment sub-module is used for any training image frame group, using the first image frame and the third image frame in any training image frame group as the first image frame and the third image frame respectively, for the first image frame Adjusting with the same resolution as the third image frame at the same time; wherein, a total of n-1 adjustments are made and the resolution used for each adjustment is different, wherein n is a positive integer and not less than 2;

The feature extraction sub-module is used to perform feature extraction on the two image frames after each adjustment, each image frame feature group is formed by the features extracted from the two image frames after each adjustment, and each image frame feature Grouping constitutes an image frame feature set;

The first alignment sub-module is used to perform cross-scale alignment processing of the features corresponding to the first image frame in the image frame feature set set to the features corresponding to the third image frame in the image frame feature set set, to obtain the first image frame alignment result;

The second alignment sub-module is used to perform cross-scale alignment processing on the features corresponding to the third image frame in the image frame feature set set to the features corresponding to the first image frame in the image frame feature set set, to obtain the third image frame alignment result;

The two-way information fusion sub-module is used to perform two-way information fusion on the alignment result of the first image frame and the alignment result of the third image frame to obtain a two-way information fusion result;

The reconstruction module performs reconstruction processing on the two-way information fusion result to obtain the estimated intermediate image frame corresponding to each training image frame group.

In some embodiments, the first alignment submodule includes:

The first repeating unit is used for the feature group of the i-th image frame, if i is 1, the feature corresponding to the first image frame in the i-th image frame feature group and the third image frame in the i-th image frame The corresponding features in the feature group are aligned to obtain the i-th alignment result. If i is not 1, the first i-1 bilinear interpolation calculation results and the third image frame are placed in the i-th image frame feature group Perform cross-scale fusion processing on the corresponding features in to obtain the i-th cross-scale fusion processing result, align the i-th cross-scale fusion processing result with the corresponding features of the first image frame in the i-th image frame feature group, Obtain the i-th alignment processing result, repeat the above-mentioned processing process for each image frame feature group, until all image frame feature groups are processed, and use the n-th alignment processing result as the alignment result of the first image frame; wherein, i is A positive integer not less than 1 and not greater than n;

In some embodiments, the second alignment submodule includes:

The second repeating unit is used for, for the i-th image frame feature group, if i is 1, the feature corresponding to the third image frame in the i-th image frame feature group and the first image frame in the i-th image frame The corresponding features in the feature group are aligned to obtain the i-th alignment result; if i is not 1, the first i-1 bilinear interpolation calculation results and the first image frame in the i-th image frame feature group Perform cross-scale fusion processing on the corresponding features in to obtain the i-th cross-scale fusion processing result, align the i-th cross-scale fusion processing result with the corresponding features of the third image frame in the i-th image frame feature group, Obtain the i-th alignment processing result, repeat the above-mentioned processing process for each image frame feature group, until all image frame feature groups are processed, and use the n-th alignment processing result as the alignment result of the third image frame; wherein, i is A positive integer not less than 1 and not greater than n;

In some embodiments, the two-way information fusion submodule includes:

The first acquisition unit is configured to convolve the alignment result of the first image frame and the alignment result of the third image frame to obtain a convolution result;

The second acquisition unit is used to calculate the convolution result to obtain the fusion weight;

The first processing unit is configured to perform fusion processing on the alignment result of the first image frame and the alignment result of the third image frame according to the fusion weight to obtain a two-way information fusion result.

In some embodiments, the adjustment module 803 includes:

The first determining unit is used for selecting any t*t pixel from the estimated intermediate image frame corresponding to any training image frame group for any training image frame group, according to the central pixel of any t*t pixel In the position in the estimated intermediate image frame corresponding to any training image frame group, determine the first target pixel of t*t in the first image frame in any training image frame group and the third target pixel in any training image frame group Determine the third target pixel of t*t in an image frame; Wherein, t is an odd number not equal to 1;

The second determination unit is used to determine the first character set according to the first target pixel of t*t; determine the third character set according to the third target pixel of t*t; determine the first character set according to any t*t pixel two-character set;

The third determination unit is used to determine the similarity between any pixel and the first target pixel according to the first character set, the second character set and the third character set, as the first difference, determine any pixel and the third character set The similarity between target pixels, as the third difference.

In some embodiments, the adjustment module 803 further includes:

The fourth determination unit is used for any training image frame group, according to the RGB values of all pixels in the label intermediate image frame corresponding to any training image frame group and the estimated intermediate image frame corresponding to any training image frame group The RGB values of all pixels determine the RGB value difference between the label intermediate image frame corresponding to any training image frame group and the estimated intermediate image frame corresponding to any training image frame group as the second difference.

In some embodiments, also include:

The comparison module is used to compare the first difference with the third difference, determine the best matching pixel of the central pixel of any t*t pixel, and calculate the value of any t*t through the texture consistency loss function according to the best matching pixel The texture consistency loss of the central pixel of the pixel; where the texture consistency loss is used to train the video frame interpolation model.

In some embodiments, also include:

An image frame acquisition module, configured to acquire two image frames to be processed in the video to be processed;

The input module is used to input the two image frames to be processed into the trained video frame interpolation model to obtain an intermediate image frame of the two image frames to be processed.

Each module in the above-mentioned video frame interpolation model device can be fully or partially realized by software, hardware and a combination thereof. The above-mentioned modules can be embedded in or independent of the processor in the computer device in the form of hardware, and can also be stored in the memory of the computer device in the form of software, so that the processor can invoke and execute the corresponding operations of the above-mentioned modules.

In some embodiments, a computer device is provided. The computer device may be a terminal, and its internal structure may be as shown in FIG. 9 . The computer device includes a processor, a memory, a communication interface, a display screen and an input device connected through a system bus. Wherein, the processor of the computer device is used to provide calculation and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and computer programs. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium. The communication interface of the computer device is used to communicate with an external terminal in a wired or wireless manner, and the wireless manner can be realized through WIFI, mobile cellular network, NFC (Near Field Communication) or other technologies. When the computer program is executed by the processor, a video frame insertion model method is realized. The display screen of the computer device may be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer device may be a touch layer covered on the display screen, or a button, a trackball or a touch pad provided on the casing of the computer device , and can also be an external keyboard, touchpad, or mouse.

Those skilled in the art can understand that the structure shown in FIG. 9 is only a block diagram of a part of the structure related to the embodiment of the application, and does not constitute a limitation on the computer equipment to which the embodiment of the application is applied. The computer device may include more or fewer components than shown in the figures, or combine certain components, or have a different arrangement of components.

In some embodiments, a computer device is provided, including a memory and a processor, a computer program is stored in the memory, and the processor implements the following steps when executing the computer program:

Based on the first difference between the first image frame in each training image frame group and the estimated intermediate image frame corresponding to each training image frame group, the label intermediate image frame corresponding to each training image frame group and each The second difference between the estimated intermediate image frames corresponding to the training image frame group, and the third difference between the third image frame in each training image frame group and the estimated intermediate image frame corresponding to each training image frame group Three differences, adjust the parameters in the video frame interpolation model, and end the training until the training stop condition is satisfied; wherein, the degree of correlation between the second difference and the parameter adjustment is greater than the correlation between the first difference or the third difference and the parameter adjustment Spend.

In some embodiments, the following steps are also implemented when the processor executes the computer program:

Calculate the convolution result to obtain the fusion weight;

According to the first character set, the second character set and the third character set, determine the similarity between any pixel and the first target pixel, as the first difference, determine the similarity between any pixel and the third target pixel Similarity, as a third difference.

Compare the first difference with the third difference, determine the best matching pixel of the center pixel of any t*t pixel, and calculate the center pixel of any t*t pixel through the texture consistency loss function according to the best matching pixel Texture consistency loss; among them, the texture consistency loss is used to train the video frame interpolation model.

Obtain two image frames to be processed in the video to be processed;

In some embodiments, a computer-readable storage medium is provided, on which a computer program is stored, and when the computer program is executed by a processor, the following steps are implemented:

Based on the first difference between the first image frame in each training image frame group and the estimated intermediate image frame corresponding to each training image frame group, the label intermediate image frame corresponding to each training image frame group and each The second difference between the estimated intermediate image frames corresponding to the training image frame group, and the third difference between the third image frame in each training image frame group and the estimated intermediate image frame corresponding to each training image frame group The third difference is to adjust the parameters in the video frame interpolation model until the training stop condition is satisfied, and the training is ended; wherein, the degree of correlation between the second difference and the parameter adjustment is greater than the first difference or the third difference.

In some embodiments, when the computer program is executed by the processor, the following steps are also implemented:

The feature corresponding to the first image frame in the image frame feature group set is carried out to the feature corresponding to the third image frame in the image frame feature set set for cross-scale alignment processing to obtain the alignment result of the first image frame;

For the i-th image frame feature group, if i is 1, the corresponding features of the first image frame in the i-th image frame feature group and the corresponding features of the third image frame in the i-th image frame feature group are used as Alignment processing, to get the i-th alignment processing result, if i is not 1, the first i-1 bilinear interpolation calculation results and the corresponding features of the third image frame in the i-th image frame feature group are cross-scale Fusion processing, to obtain the i-th cross-scale fusion processing result, align the i-th cross-scale fusion processing result with the corresponding features of the first image frame in the i-th image frame feature group, and obtain the i-th alignment processing result , repeat the above-mentioned processing process for each image frame feature group until all image frame feature groups are processed, and use the nth alignment processing result as the alignment result of the first image frame; wherein, i is not less than 1 and not greater than n a positive integer;

For the i-th image frame feature group, if i is 1, the corresponding features of the third image frame in the i-th image frame feature group and the corresponding features of the first image frame in the i-th image frame feature group are used as Alignment processing to obtain the i-th alignment processing result; if i is not 1, the first i-1 bilinear interpolation calculation results and the corresponding features of the first image frame in the i-th image frame feature group are cross-scale Fusion processing, to obtain the i-th cross-scale fusion processing result, align the i-th cross-scale fusion processing result with the corresponding features of the third image frame in the i-th image frame feature group, and obtain the i-th alignment processing result ; Repeat the above process for each image frame feature group until all image frame feature groups are processed, and the nth alignment processing result is used as the alignment result of the third image frame; wherein, i is not less than 1 and not greater than n a positive integer;

Calculate the convolution result to obtain the fusion weight;

According to the first character set, the second character set and the third character set, determine the similarity between any pixel and the first target pixel, as the first difference, determine the similarity between any pixel and the third target pixel , as the third difference.

Obtain two image frames to be processed in the video to be processed;

In some embodiments, a computer program product is provided, comprising a computer program that, when executed by a processor, implements the following steps:

Based on the first difference between the first image frame in each training image frame group and the estimated intermediate image frame corresponding to each training image frame group, the label intermediate image frame corresponding to each training image frame group and each The second difference between the estimated intermediate image frames corresponding to the training image frame group, and the third difference between the third image frame in each training image frame group and the estimated intermediate image frame corresponding to each training image frame group Three differences, adjust the parameters in the video frame interpolation model, and end the training until the training stop condition is met; wherein, the correlation between the second difference and the parameter adjustment is greater than that between the first difference or the third difference and the parameter adjustment degree of relevance.

For any training image frame group, the first image frame and the third image frame in any training image frame group are respectively used as the first image frame and the third image frame, and the first image frame and the third image frame are simultaneously The same resolution is used for adjustment; wherein, a total of n-1 adjustments are made and the resolution used for each adjustment is different, where n is a positive integer and not less than 2;

For the i-th image frame feature group, if i is 1, the corresponding features of the first image frame in the i-th image frame feature group and the corresponding features of the third image frame in the i-th image frame feature group are used as Alignment processing to obtain the i-th alignment processing result, if i is not 1, the first i-1 bilinear interpolation calculation results and the corresponding features of the third image frame in the i-th image frame feature group are cross-scale Fusion processing, to obtain the i-th cross-scale fusion processing result, align the i-th cross-scale fusion processing result with the corresponding features of the first image frame in the i-th image frame feature group, and obtain the i-th alignment processing result , repeat the above process for each image frame feature group until all image frame feature groups are processed, and take the nth alignment processing result as the alignment result of the first image frame, where i is not less than 1 and not greater than n a positive integer;

For the i-th image frame feature group, if i is 1, the corresponding features of the third image frame in the i-th image frame feature group and the corresponding features of the first image frame in the i-th image frame feature group are used as Alignment processing to obtain the i-th alignment processing result; if i is not 1, the first i-1 bilinear interpolation calculation results and the corresponding features of the first image frame in the i-th image frame feature group are cross-scale Fusion processing, to obtain the i-th cross-scale fusion processing result, align the i-th cross-scale fusion processing result with the corresponding features of the third image frame in the i-th image frame feature group, and obtain the i-th alignment processing result ; Repeat the above-mentioned processing process for each image frame feature group until all image frame feature groups are processed, and the nth alignment processing result is used as the alignment result of the third image frame; wherein, i is not less than 1 and not greater than n a positive integer;

Calculate the convolution result to obtain the fusion weight;

Obtain two image frames to be processed in the video to be processed;

It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, stored data, displayed data, etc.) involved in the embodiments of this application , are all information and data authorized by the user or fully authorized by all parties.

Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above-mentioned embodiments can be completed by instructing related hardware through computer programs, and the computer programs can be stored in a non-volatile computer-readable memory In the medium, when the computer program is executed, it may include the processes of the embodiments of the above-mentioned methods. Wherein, any reference to storage, database or other media used in the various embodiments provided in the embodiments of the present application may include at least one of non-volatile and volatile storage. Non-volatile memory can include read-only memory (Read-Only Memory, ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive variable memory (ReRAM), magnetic variable memory (Magnetoresistive Random Access Memory, MRAM), Ferroelectric Random Access Memory (FRAM), Phase Change Memory (Phase Change Memory, PCM), graphene memory, etc. The volatile memory may include random access memory (Random Access Memory, RAM) or external cache memory, etc. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM). The databases involved in the various embodiments provided in the embodiments of the present application may include at least one of a relational database and a non-relational database. Non-relational databases may include blockchain-based distributed databases, etc., but are not limited thereto. The processors involved in the various embodiments provided in the embodiments of the present application may be general-purpose processors, central processing units, graphics processors, digital signal processors, programmable logic devices, data processing logic devices based on quantum computing, etc. Not limited to this.

The technical features of the above embodiments can be combined arbitrarily. To make the description concise, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction in the combination of these technical features, they should be It is considered to be within the range described in this specification.

The above-mentioned embodiments only express several implementation modes of the embodiments of the present application, and the descriptions thereof are relatively specific and detailed, but should not be construed as limiting the patent scope of the embodiments of the present application. It should be noted that those skilled in the art can make several modifications and improvements without departing from the concepts of the embodiments of the present application, and these all belong to the protection scope of the embodiments of the present application. Therefore, the scope of protection of the embodiments of the present application should be determined by the appended claims.

Claims

A video frame interpolation model training method, characterized in that, comprising:

Obtain training image frame groups, each training image frame group is composed of three consecutive image frames in the video arranged in order, and the second image frame in each training image frame group is used as each training image frame group The corresponding label intermediate image frame;

Input the first image frame and the third image frame in each training image frame group to the video frame interpolation model, and output the estimated intermediate image frame corresponding to each training image frame group;

Based on the first difference between the first image frame in each training image frame group and the estimated intermediate image frame corresponding to each training image frame group, the label intermediate image frame corresponding to each training image frame group and each The second difference between the estimated intermediate image frames corresponding to the training image frame group, and the third difference between the third image frame in each training image frame group and the estimated intermediate image frame corresponding to each training image frame group Three differences, adjust the parameters in the video frame interpolation model, and end the training until the training stop condition is satisfied; the correlation between the second difference and the parameter adjustment is greater than that of the first difference or the third difference and Correlation between parameter adjustments.
The method according to claim 1, wherein the first image frame and the third image frame in each training image frame group are input to the video frame interpolation model, and the output corresponding to each training image frame group is Estimated intermediate image frames, including:

For any training image frame group, using the first image frame and the third image frame in any training image frame group as the first image frame and the third image frame respectively, for the first image frame and the third image frame The third image frame is adjusted with the same resolution at the same time; wherein, a total of n-1 adjustments are made and the resolutions used for each adjustment are different, and the n is a positive integer and not less than 2;

Feature extraction is performed on the two image frames after each adjustment, each image frame feature group is composed of the features extracted from the two image frames after each adjustment, and the image frame feature group set is composed of each image frame feature group ;

performing cross-scale alignment processing on features corresponding to the first image frame in the image frame feature set set to features corresponding to the third image frame in the image frame feature set set, to obtain the first image frame alignment result;

performing cross-scale alignment processing on the features corresponding to the third image frame in the image frame feature set set to the features corresponding to the first image frame in the image frame feature set set, to obtain the third image frame alignment result;

performing two-way information fusion on the alignment result of the first image frame and the alignment result of the third image frame to obtain a two-way information fusion result;

Reconstruction processing is performed on the two-way information fusion result to obtain an estimated intermediate image frame corresponding to each training image frame group.
The method according to claim 2, characterized in that, the resolutions corresponding to the image frame feature groups in the image frame feature group set are sequentially increased; the first image frame in the image The corresponding features in the frame feature group set perform cross-scale alignment processing to the corresponding features of the third image frame in the image frame feature group set to obtain the alignment result of the first image frame, including:

For the i-th image frame feature group, if i is 1, the feature corresponding to the first image frame in the i-th image frame feature group and the third image frame in the i-th image The corresponding features in the frame feature group are aligned to obtain the i-th alignment processing result; if i is not 1, the first i-1 bilinear interpolation calculation results and the third image frame in the i-th The corresponding features in the image frame feature group are subjected to cross-scale fusion processing to obtain the i-th cross-scale fusion processing result, and the i-th cross-scale fusion processing result is combined with the first image frame in the i-th image Align the corresponding features in the frame feature group to obtain the i-th alignment processing result, repeat the above-mentioned processing process for each image frame feature group until all image frame feature groups are processed, and use the nth alignment processing result as the The alignment result of the first image frame; wherein, the i is a positive integer not less than 1 and not greater than n;

Wherein, for the jth bilinear interpolation calculation result among the first i-1 bilinear interpolation calculation results, the jth bilinear interpolation calculation result is performed i-j times consecutively on the jth alignment processing result Obtained by bilinear interpolation calculation, the j is a positive integer not less than 1 and less than i.
The method according to claim 2, wherein the resolutions corresponding to the image frame feature groups in the set of image frame feature groups are sequentially increased; The corresponding features in the frame feature group set perform cross-scale alignment processing to the corresponding features of the first image frame in the image frame feature group set, and obtain the alignment result of the third image frame, including:

For the i-th image frame feature group, if i is 1, then the feature corresponding to the third image frame in the i-th image frame feature group and the first image frame in the i-th image The corresponding features in the frame feature group are aligned to obtain the i-th alignment processing result; if i is not 1, the first i-1 bilinear interpolation calculation results and the first image frame in the i-th The corresponding features in the image frame feature group are subjected to cross-scale fusion processing to obtain the i-th cross-scale fusion processing result, and the i-th cross-scale fusion processing result is combined with the third image frame in the i-th image Align the corresponding features in the frame feature group to obtain the i-th alignment processing result, repeat the above-mentioned processing process for each image frame feature group until all image frame feature groups are processed, and use the nth alignment processing result as the The alignment result of the third image frame; wherein, the i is a positive integer not less than 1 and not greater than n;

Wherein, for the jth bilinear interpolation calculation result among the first i-1 bilinear interpolation calculation results, the jth bilinear interpolation calculation result is performed i-j times consecutively on the jth alignment processing result obtained by bilinear interpolation calculation, the j is a positive integer not less than 1 and less than i.
The method according to claim 3 or 4, wherein the two-way information fusion is performed on the alignment result of the first image frame and the alignment result of the third image frame to obtain a two-way information fusion result, comprising:

Convolving the alignment result of the first image frame and the alignment result of the third image frame to obtain a convolution result;

Calculating the convolution result to obtain fusion weights;

Perform fusion processing on the alignment result of the first image frame and the alignment result of the third image frame according to the fusion weight to obtain a two-way information fusion result.
The method according to claim 1, wherein both the first difference and the third difference are similarities; the process of determining the first difference and the third difference includes:

For any training image frame group, select any t*t pixel from the estimated intermediate image frame corresponding to any training image frame group, according to the center pixel of any t*t pixel in the described The positions in the estimated intermediate image frames corresponding to any training image frame group, respectively determine the first target pixel of t*t in the first image frame in the any training image frame group and in the any training image frame group Determine the third target pixel of t*t in the third image frame in the image frame group; wherein, t is an odd number not equal to 1;

Determine the first character set according to the first target pixel of t*t; determine the third character set according to the third target pixel of t*t; determine the second character set according to any pixel of t*t set of characters;

According to the first character set, the second character set and the third character set, determine the similarity between the any pixel and the first target pixel as the first difference; determine the The similarity between any pixel and the third target pixel is used as the third difference.
The method according to claim 1, wherein the determining process of the second difference comprises:

For any training image frame group, according to the RGB values of all the pixels in the label intermediate image frame corresponding to the any training image frame group and the estimated values of all pixels in the intermediate image frame corresponding to the any training image frame group RGB value, determining the RGB value difference between the label intermediate image frame corresponding to any training image frame group and the estimated intermediate image frame corresponding to any training image frame group, as the second difference.
The method according to claim 6, further comprising:

Comparing the first difference with the third difference, determining the best matching pixel of the central pixel of any t*t pixel; according to the best matching pixel, calculating the any A texture consistency loss of the central pixel of t*t pixels; wherein, the texture consistency loss is used to train the video frame interpolation model.
The method according to any one of claims 1-8, further comprising:

Obtain two image frames to be processed in the video to be processed;

Input the two image frames to be processed into the trained video frame interpolation model to obtain an intermediate image frame of the two image frames to be processed.
A video frame interpolation model training device, characterized in that it comprises:

The acquisition module is used to obtain training image frame groups, each training image frame group is formed by sequential arrangement of three consecutive image frames in the video, and the second image frame in each training image frame group is used as each a label intermediate image frame corresponding to the training image frame group;

The video frame interpolation module is used to input the first image frame and the third image frame in each training image frame group to the video frame interpolation model, and output the corresponding estimated intermediate image frame of each training image frame group;

The adjustment module is used to base on the first difference between the first image frame in each training image frame group and the estimated intermediate image frame corresponding to each training image frame group, and the label intermediate corresponding to each training image frame group The second difference between the image frame and the estimated intermediate image frame corresponding to each training image frame group, and the estimated intermediate image corresponding to the third image frame in each training image frame group and each training image frame group The third difference between frames, adjust the parameters in the video frame interpolation model until the training stops when the training stop condition is satisfied; the degree of correlation between the second difference and parameter adjustment is greater than that of the first difference or the The degree of correlation between the third difference and parameter adjustment.
The device according to claim 10, wherein the video frame insertion module comprises:

The adjustment sub-module is used for any training image frame group, using the first image frame and the third image frame in any training image frame group as the first image frame and the third image frame respectively, and the The first image frame and the third image frame are adjusted at the same resolution at the same time; wherein, a total of n-1 adjustments are made and the resolutions used for each adjustment are different, and the n is a positive integer and not less than 2 ;

The feature extraction sub-module is used to perform feature extraction on the two image frames after each adjustment, each image frame feature group is formed by the features extracted from the two image frames after each adjustment, and each image frame feature Grouping constitutes an image frame feature set;

A first alignment submodule, configured to perform cross-scale alignment of features corresponding to the first image frame in the image frame feature set to features corresponding to the third image frame in the image frame feature set processing to obtain the alignment result of the first image frame;

The second alignment submodule is configured to perform cross-scale alignment of the features corresponding to the third image frame in the image frame feature set to the features corresponding to the first image frame in the image frame feature set processing to obtain the alignment result of the third image frame;

A two-way information fusion sub-module, configured to perform two-way information fusion on the alignment result of the first image frame and the alignment result of the third image frame to obtain a two-way information fusion result;

The reconstruction module is configured to perform reconstruction processing on the bidirectional information fusion result to obtain an estimated intermediate image frame corresponding to each training image frame group.
The device according to claim 11, wherein the first alignment submodule comprises:

The first repeating unit is used for, for the i-th image frame feature group, if i is 1, the feature corresponding to the first image frame in the i-th image frame feature group and the third image frame The features corresponding to the i-th image frame feature group are aligned to obtain the i-th alignment result; if i is not 1, the first i-1 bilinear interpolation calculation results and the third The features corresponding to the image frame in the i-th image frame feature group are subjected to cross-scale fusion processing to obtain the i-th cross-scale fusion processing result, and the i-th cross-scale fusion processing result is combined with the first image frame Perform alignment processing on the corresponding features in the i-th image frame feature group to obtain the i-th alignment processing result, repeat the above-mentioned processing process for each image frame feature group, until all image frame feature groups are processed, and the first n alignment processing results are used as alignment results of the first image frame; wherein, the i is a positive integer not less than 1 and not greater than n;

Wherein, for the jth bilinear interpolation calculation result among the first i-1 bilinear interpolation calculation results, the jth bilinear interpolation calculation result is performed i-j times consecutively on the jth alignment processing result Obtained by bilinear interpolation calculation, the j is a positive integer not less than 1 and less than i.
The device according to claim 11, wherein the second alignment submodule comprises:

The second repeating unit is used for, for the i-th image frame feature group, if i is 1, the feature corresponding to the third image frame in the i-th image frame feature group and the first image frame The features corresponding to the i-th image frame feature group are aligned to obtain the i-th alignment result; if i is not 1, the first i-1 bilinear interpolation calculation results and the first The features corresponding to the image frame in the i-th image frame feature group are subjected to cross-scale fusion processing to obtain the i-th cross-scale fusion processing result, and the i-th cross-scale fusion processing result is combined with the third image frame Perform alignment processing on the corresponding features in the i-th image frame feature group to obtain the i-th alignment processing result, repeat the above-mentioned processing process for each image frame feature group, until all image frame feature groups are processed, and the first n alignment processing results are used as alignment results of the third image frame; wherein, the i is a positive integer not less than 1 and not greater than n;

Wherein, for the jth bilinear interpolation calculation result among the first i-1 bilinear interpolation calculation results, the jth bilinear interpolation calculation result is performed i-j times consecutively on the jth alignment processing result Obtained by bilinear interpolation calculation, the j is a positive integer not less than 1 and less than i.
The device according to claim 12 or 13, wherein the two-way information fusion submodule comprises:

A first acquisition unit, configured to convolve the alignment result of the first image frame and the alignment result of the third image frame to obtain a convolution result;

a second acquisition unit, configured to calculate the convolution result to obtain fusion weights;

The first processing unit is configured to perform fusion processing on the alignment result of the first image frame and the alignment result of the third image frame according to the fusion weight to obtain a two-way information fusion result.
The device according to claim 10, wherein the adjustment module further comprises:

The first determination unit is configured to, for any training image frame group, select any t*t pixel from the estimated intermediate image frame corresponding to any training image frame group, according to any t*t pixel The position of the central pixel of the pixel in the estimated intermediate image frame corresponding to any training image frame group, the first target pixel of t*t is determined in the first image frame in any training image frame group respectively And determine the third target pixel of t*t in the third image frame in any one of the training image frame groups; wherein, t is an odd number not equal to 1;

The second determination unit is used to determine the first character set according to the first target pixel of t*t; determine the third character set according to the third target pixel of t*t; The pixels of t determine the second character set;

A third determining unit, configured to determine the similarity between any pixel and the first target pixel according to the first character set, the second character set, and the third character set, as the the first difference; determining the similarity between the any pixel and the third target pixel as the third difference.
The device according to claim 10, wherein the adjustment module further comprises:

The fourth determining unit is used for any training image frame group, according to the RGB values of all pixels in the label intermediate image frame corresponding to the any training image frame group and the corresponding estimated value of the any training image frame group The RGB values of all pixels in the intermediate image frame determine the RGB value difference between the label intermediate image frame corresponding to any training image frame group and the estimated intermediate image frame corresponding to any training image frame group, as the Describe the second difference.
The device according to any one of claims 10-16, further comprising:

An image frame acquisition module, configured to acquire two image frames to be processed in the video to be processed;

The input module is configured to input the two image frames to be processed into the trained video frame interpolation model to obtain an intermediate image frame of the two image frames to be processed.
A computer device, comprising a memory and a processor, the memory stores a computer program, wherein the processor implements the steps of the method according to any one of claims 1 to 9 when executing the computer program.
A computer-readable storage medium on which a computer program is stored, wherein the computer program implements the steps of the method according to any one of claims 1 to 9 when the computer program is executed by a processor.
A computer program product, comprising a computer program, characterized in that, when the computer program product is executed by a processor, the steps of the method according to any one of claims 1 to 9 are realized.