CN114332709A

CN114332709A - Video processing method, video processing device, storage medium and electronic equipment

Info

Publication number: CN114332709A
Application number: CN202111638616.8A
Authority: CN
Inventors: 丁予康; 周雅; 徐宁; 闻兴; 戴宇荣
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2021-12-29
Filing date: 2021-12-29
Publication date: 2022-04-12

Abstract

The disclosure provides a video processing method, a video processing device, a storage medium and an electronic device. The method comprises the following steps: dividing a video into groups including a predetermined number of video frames, and classifying the video frames in each group into key frames and non-key frames; respectively performing super-resolution processing on the key frames and the non-key frames of each group through a super-resolution model to obtain super-resolution frames of the key frames and the non-key frames of each group; encoding the super-resolution frame of the key frame and the super-resolution frame of the non-key frame of each group as a super-resolution video, wherein the super-resolution model is a model configured to use the super-resolution frame of the key frame as a reference for super-resolution processing of the non-key frame. According to the video super-resolution method disclosed by the invention, the super-resolution processing of the non-key frames can be guided by fully utilizing the super-resolution result of the key frames, and good benefits are obtained in the aspects of processing speed and subjective effect.

Description

Video processing method, video processing device, storage medium and electronic equipment

Technical Field

The present disclosure relates to the field of video technologies, and in particular, to a video processing method and apparatus, a training method for a video super-resolution model, an electronic device, and a computer-readable storage medium.

Background

Super Resolution (Super Resolution) is an image/video processing technology, which can process an lr (low Resolution) image or video into an hr (high Resolution) image or video, thereby improving the Resolution of the image or video and improving the quality of the image or video.

The video super-resolution technology is mainly divided into the following modes, one mode is to decode and extract frames of a video to be processed, then serially call a super-resolution model for each frame to process, and perform video coding on an obtained result frame after each frame is processed to obtain a final output video. The method performs equal computation processing on all video frames, and does not consider the characteristics of various types of video frames (such as I frames, P frames and B frames) in the actual video, so that the method for equally processing all the video frames causes a large amount of computation redundancy, thereby reducing the computation efficiency.

The other mode is that after the video is decoded into the video frame, model processing is only carried out on the I frame, and lightweight model reasoning is carried out on non-I frames or reasoning is directly abandoned, so that the computing speed is improved. However, the model algorithm does not process non-I frames well, and although the method is fast in reasoning speed, the subjective effect of the video processed by the algorithm is not satisfactory.

Disclosure of Invention

The present disclosure provides a video processing method and apparatus, and a corresponding method for training a super-resolution video model, an electronic device, and a computer-readable storage medium, so as to solve at least the problems of low super-resolution calculation efficiency and poor subjective effect of a super-resolution video in the related art, and also not solve any of the above problems.

According to a first aspect of the present disclosure, there is provided a video processing method, comprising: dividing a video into groups including a predetermined number of video frames, and classifying the video frames in each group into key frames and non-key frames; respectively performing super-resolution processing on the key frames and the non-key frames of each group through a super-resolution model to obtain super-resolution frames of the key frames and the non-key frames of each group; encoding the super-resolution frame of the key frame and the super-resolution frame of the non-key frame of each group as a super-resolution video, wherein the super-resolution model is a model configured to use the super-resolution frame of the key frame as a reference for super-resolution processing of the non-key frame.

According to a first aspect of the disclosure, the super-resolution model comprises a first super-resolution model and a second super-resolution model, the super-resolution model being trained by: performing single-frame image inference on the key frame through the first super-resolution model to obtain a super-resolution frame of the key frame; fusing the super-resolution frame of the key frame and the characteristics of the non-key frame based on a second super-resolution model, and performing inference based on the fused characteristics to obtain the super-resolution frame of the non-key frame; determining a first loss function for a first super-resolution model from the key frame and the super-resolution frame of the key frame, and adjusting a parameter of the first super-resolution model based on a value of the first loss function; determining a second loss function for a second super-resolution model from the non-key frames and the super-resolution frames of the non-key frames, and adjusting parameters of the second super-resolution model based on the second loss function.

According to a first aspect of the present disclosure, classifying the video frames in each group into video frames of key frames and non-key frames comprises: classifying the predetermined type of video frames in the group as key frames and classifying the remaining video frames as non-key frames, or classifying the first frame in the group as a key frame and classifying the remaining video frames as non-key frames.

According to a first aspect of the disclosure, each of the first and second super-resolution models comprises an inference subject network comprising a plurality of residual convolutional layers and a fast upsampling layer for performing an inference operation, wherein the number of channels and the number of residual convolutional layers comprised by the first super-resolution model are higher than the number of channels and the number of residual convolutional layers comprised by the second super-resolution model.

According to a first aspect of the present disclosure, performing single-frame image inference on a key frame by a first super-resolution model to obtain a super-resolution frame of the key frame comprises: calculating the depth characteristics of the key frame layer by layer through the residual convolutional layers; the depth features of the key frames are upsampled by the fast upsampling layer into super-resolution frames of key frames having a high resolution.

According to the first aspect of the present disclosure, the second super-resolution model further includes a first feature extractor, a second feature extractor, and a stitching unit, wherein fusing the super-resolution frames of the key frames and the features of the non-key frames based on the second super-resolution model, and performing inference to obtain the super-resolution frames of the non-key frames based on the fused features includes: extracting features of the super-resolution frames of the key frames by a first feature extractor and features of the non-key frames by a second feature extractor; performing feature splicing and convolution on the extracted features of the super-resolution frames of the key frames and the features of the non-key frames through a splicing unit to obtain the fused features; performing a computation on the fused features layer by inferring a plurality of residual convolutional layers of a subject network to extract depth features of non-key frames; depth features of non-key frames are upsampled by a fast upsampling layer into super-resolution frames of non-key frames having a high resolution.

According to a first aspect of the present disclosure, each residual convolution layer of the first and second super-resolution models includes a plurality of residual convolution blocks connected in series, and each residual convolution block includes a plurality of basic convolution operation units configured to perform a convolution operation on an input feature to extract a feature of a deeper layer.

According to a first aspect of the disclosure, in the first and second super-resolution models, input features and output features of the subject inference network are connected across layers; the input features and the output features of each residual convolution layer are connected in a cross-layer mode; the output characteristics of the first basic convolution operation unit and the last convolution operation unit in each residual block of the residual convolution layer are connected in a cross-layer mode.

According to a second aspect of the present disclosure, there is provided a video processing apparatus comprising: a grouping unit configured to divide a video into groups including a predetermined number of video frames and classify the video frames in each group into key frames and non-key frames; a video processing unit configured to perform super-resolution processing on the key frames and the non-key frames of each group through a super-resolution model, respectively, to obtain super-resolution frames of the key frames and the non-key frames of each group; an encoding unit configured to encode the super-resolution frame of the key frame and the super-resolution frame of the non-key frame of each group as a super-resolution video, wherein the super-resolution model is a model configured to use the super-resolution frame of the key frame as a reference for super-resolution processing of the non-key frame.

According to a second aspect of the disclosure, the super-resolution model comprises a first super-resolution model and a second super-resolution model, the super-resolution model being trained by: performing single-frame image inference on the key frame through the first super-resolution model to obtain a super-resolution frame of the key frame; fusing the super-resolution frame of the key frame and the characteristics of the non-key frame based on a second super-resolution model, and performing inference based on the fused characteristics to obtain the super-resolution frame of the non-key frame; determining a first loss function for a first super-resolution model from the key frame and the super-resolution frame of the key frame, and adjusting a parameter of the first super-resolution model based on a value of the first loss function; determining a second loss function for a second super-resolution model from the non-key frames and the super-resolution frames of the non-key frames, and adjusting parameters of the second super-resolution model based on the second loss function.

According to a second aspect of the disclosure, the grouping unit is configured to: classifying the predetermined type of video frames in the group as key frames and classifying the remaining video frames as non-key frames, or classifying the first frame in the group as a key frame and classifying the remaining video frames as non-key frames.

According to a second aspect of the present disclosure, each of the first and second super-resolution models includes an inference subject network including a plurality of residual convolutional layers and a fast upsampling layer for performing an inference operation, wherein the number of channels and the number of residual convolutional layers included in the first super-resolution model are higher than the number of channels and the number of residual convolutional layers included in the second super-resolution model.

According to a second aspect of the disclosure, the first super-resolution model is configured to: calculating the depth characteristics of the key frame layer by layer through the residual convolutional layers; the depth features of the key frames are upsampled by the fast upsampling layer into super-resolution frames of key frames having a high resolution.

According to a second aspect of the disclosure, the second super-resolution model further comprises a first feature extractor, a second feature extractor and a stitching unit, wherein the second super-resolution model is configured to: extracting features of the super-resolution frames of the key frames by a first feature extractor and features of the non-key frames by a second feature extractor; performing feature splicing and convolution on the extracted features of the super-resolution frames of the key frames and the features of the non-key frames through a splicing unit to obtain the fused features; performing a computation on the fused features layer by inferring a plurality of residual convolutional layers of a subject network to extract depth features of non-key frames; depth features of non-key frames are upsampled by a fast upsampling layer into super-resolution frames of non-key frames having a high resolution.

According to a second aspect of the present disclosure, each of the residual convolution layers of the first and second super-resolution models includes a plurality of residual convolution blocks connected in series, and each of the residual convolution blocks includes a plurality of basic convolution operation units configured to perform a convolution operation on an input feature to extract a feature of a deeper layer.

According to a second aspect of the present disclosure, in the first and second super-resolution models, input features and output features of the subject inference network are connected across layers; the input features and the output features of each residual convolution layer are connected in a cross-layer mode; the output characteristics of the first basic convolution operation unit and the last convolution operation unit in each residual block of the residual convolution layer are connected in a cross-layer mode.

According to a third aspect of the present disclosure, there is provided an electronic device comprising: at least one processor; at least one memory storing computer-executable instructions, wherein the computer-executable instructions, when executed by the at least one processor, cause the at least one processor to perform the video processing method as described above.

According to a fourth aspect of the present disclosure, there is provided a computer-readable storage medium in which instructions, when executed by a processor of an electronic device, enable the electronic device to perform the video processing method as described above.

According to a fifth aspect of the present disclosure, there is provided a computer program product in which instructions are executed by at least one processor in an electronic device to perform the video processing method as described above.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects: the super-resolution processing of the non-key frames is guided by fully utilizing the super-resolution result of the key frames, a good subjective effect can be guaranteed even if a lightweight processing model is used for the non-key frames, and good benefits are obtained in the aspects of processing speed and the subjective effect.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

Fig. 1 is a flowchart illustrating a training method of a video super-resolution model according to an exemplary embodiment of the present disclosure.

Fig. 2 is a schematic diagram illustrating a process in which a video super-resolution model performs key frame inference according to an exemplary embodiment of the present disclosure.

Fig. 3 is a schematic diagram illustrating a process in which a video super-resolution model performs non-key frame inference according to an exemplary embodiment of the present disclosure.

Fig. 4 is a schematic diagram illustrating a model network structure for performing key frame inference according to an exemplary embodiment of the present disclosure.

Fig. 5 is a schematic diagram illustrating a model network structure for performing non-key frame inference according to an exemplary embodiment of the present disclosure.

Fig. 6 is a block diagram illustrating a training apparatus of a video super-resolution model according to an exemplary embodiment of the present disclosure.

Fig. 7 is a flowchart illustrating a video processing method according to an exemplary embodiment of the present disclosure.

Fig. 8 is a block diagram illustrating a video processing apparatus according to an exemplary embodiment of the present disclosure.

Fig. 9 is a block diagram illustrating an electronic device for performing a video processing method according to an exemplary embodiment of the present disclosure.

Fig. 10 is a block diagram illustrating an electronic device for performing a video processing method according to another exemplary embodiment.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The embodiments described in the following examples do not represent all embodiments consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

In this case, the expression "at least one of the items" in the present disclosure means a case where three types of parallel expressions "any one of the items", "a combination of any plural ones of the items", and "the entirety of the items" are included. For example, "include at least one of a and B" includes the following three cases in parallel: (1) comprises A; (2) comprises B; (3) including a and B. For another example, "at least one of the first step and the second step is performed", which means that the following three cases are juxtaposed: (1) executing the step one; (2) executing the step two; (3) and executing the step one and the step two.

As shown in fig. 1, first, in step S110, a video is divided into groups including a predetermined number of video frames, and the video frames in each group are classified into key frames and non-key frames. According to an exemplary embodiment of the present disclosure, a video may be first decoded into a sequence of video frames and the video frames may be grouped into a plurality of groups in order. For example, a plurality of groups may be obtained in a predetermined number, and each group may have the same number (e.g., 10 frames), i.e., the video is divided in 10-frame intervals.

According to an exemplary embodiment of the present disclosure, the first frame in the divided group may be classified as a key frame, and the remaining video frames may be classified as non-key frames. Alternatively, certain types of frames (e.g., I-frames) in the divided group may be classified as key frames and the remaining video frames as non-key frames. The key frame may be, for example, a video frame that has a large amount of information and is generally used as a reference frame in the encoding/decoding process of other video frames.

Next, in step S120, single-frame image inference is performed on the key frame by the first super-resolution model to obtain a super-resolution frame of the key frame. According to an exemplary embodiment of the present disclosure, the first super-resolution model according to an exemplary embodiment of the present disclosure may be implemented by a convolutional neural network. The first super-resolution model may perform a layer-by-layer convolution calculation on the input key frame to obtain the features of the key frame and finally obtain a super-resolution frame of the key frame. This process is called the super-resolution inference process of key frames. The super-resolution inference process of key frames will be described with reference to fig. 2.

As shown in fig. 2, assuming that an I frame (reference frame I) is used as a key frame in each group, a plurality of convolutional layers included in the first super-resolution model (SR model) extract features for the reference frame I, respectively, layer by layer. That is, the 1 st convolutional layer extracts the 1 st layer feature from the key frame, the 2 nd convolutional layer further extracts the 2 nd layer feature … from the first layer feature, and so on, the output result of the last nth convolutional layer can be used as the super-resolution frame of the key frame. The output result is saved and used for the subsequent super-resolution inference process of the non-key frame.

Then, in step S130, the super-resolution frames of the key frames and the features of the non-key frames are fused by a second super-resolution model, and inference is performed based on the fused features to obtain the super-resolution frames of the non-key frames. According to an exemplary embodiment of the present disclosure, the second super-resolution model may also be implemented using a structure having a plurality of convolution layers similar to the first super-resolution model. As shown in fig. 3, the second super-resolution model (SR model) according to the exemplary embodiment of the present disclosure may obtain super-resolution frames (i +1 st frame result …, i + n th frame result) of non-key frames using the super-resolution result of the key frame (key frame SR result) obtained at step S120 and the remaining non-key frames (e.g., i +1 st frame, i +2 nd frame … (i + n th frame) except the i th frame in the video packet shown in fig. 3).

According to an exemplary embodiment of the present disclosure, each of the first and second super-resolution models includes an inference subject network including a plurality of residual convolution layers and a fast upsampling layer, which are cascaded, wherein the number of channels and the number of residual convolution layers included in the first super-resolution model are higher than the number of channels and the number of residual convolution layers included in the second super-resolution model.

For example, the number of channels of the first super-resolution model may be greater than the number of channels of the second super-resolution model, and more convolutional layers are employed, so that the convolutional neural network of the first super-resolution model has a larger width and a deeper depth, which may be twice as much as the second super-resolution model, for example. Thus, the super-resolution processing effect on the key frame can be ensured.

According to an exemplary embodiment of the present disclosure, the first super-resolution model may calculate the depth feature of the key frame layer by layer through the plurality of residual convolutional layers, and up-sample the depth feature of the key frame into a super-resolution frame having a high resolution through the fast up-sampling layer.

According to an exemplary embodiment of the present disclosure, fusing the features of the super-resolution frame of the key frame and the non-key frame through a second super-resolution model, and performing inference based on the fused features to obtain the super-resolution frame of the non-key frame includes: extracting features of the super-resolution frames of the key frames by a first feature extractor and features of the non-key frames by a second feature extractor; performing feature splicing and convolution on the extracted features of the super-resolution frames of the key frames and the features of the non-key frames through a splicing unit to obtain the fused features; performing a computation on the fused features layer by inferring a plurality of residual convolutional layers of a subject network to extract depth features of non-key frames; depth features of non-key frames are upsampled by a fast upsampling layer into super-resolution frames of non-key frames having a high resolution.

That is to say, the second super-resolution model may have two input branches to respectively input the super-resolution frame and the non-key frame of the previously stored key frame, and after the depth features of the same dimensionality are extracted by the feature extractor, the two features are fused by the splicing unit, thereby helping the main body inference network of the second super-resolution model to better process the non-key frame. Here, the feature extractor may be a convolution module of a predetermined number of layers and dimensions.

As described above, since the second super-resolution model may be a lightweight inference model, a super-resolution result of a non-key frame may be obtained with a small amount of computation, while a super-resolution result of a non-key frame with good effect may be obtained because it performs an inference operation with reference to the super-resolution result of a key frame.

An example of a specific structure of a super-resolution model according to an exemplary embodiment of the present disclosure will be explained below with reference to fig. 4 and 5.

Fig. 4 and 5 are schematic diagrams illustrating network structures of a first super-resolution model and a second super-resolution model, respectively, according to an exemplary embodiment of the present disclosure.

As shown in fig. 4, the first super-resolution model includes a subject inference network consisting of a cascade of multiple residual convolutional layers and fast upsampling layers. Each residual convolutional layer includes a plurality of residual convolutional blocks connected in series, and each residual convolutional block includes a plurality of basic convolution operation units configured to perform convolution operation on the input features to extract deeper features. The fast upsampling layer performs fast upsampling on the output of the last residual convolutional layer to obtain a final super-resolution result. In FIG. 4, the key frame LR_IAfter being input, the super-resolution frame SR is obtained after passing through a plurality of residual convolution layers and a fast upsampling layer_I。

According to the exemplary embodiment of the disclosure, cross-layer connections can be set between the residual convolution layers, between the residual convolution blocks and between the basic convolution operation units to add features of different depths, thereby helping the gradient of the convolutional neural network to perform effective back transmission and obtaining a better network optimization result.

For example, as shown in fig. 4, the input features and the output features of the subject inference network of the first super-resolution model are cross-layer connected (global cross-layer connection shown in fig. 4), the input features and the output features of each residual convolution layer of the subject inference network of the first super-resolution model are cross-layer connected (long cross-layer connection shown in fig. 4), and the output features of the first basic convolution operation unit and the last convolution operation unit in each residual block of the residual convolution layer are cross-layer connected (short cross-layer connection shown in fig. 4). Here, cross-layer connection of features refers to adding features of different layers.

The second super-resolution model shown in FIG. 5 includes super-resolution frames SR for extracting key frames, respectively_ICharacteristic of (Feat)_IAnd non-key frame LR_PCharacteristic of (Feat)_pTwo feature extractors. Super-resolution frame SR of extracted key frame_ICharacteristic of (Feat)_IAnd non-key frame LR_PCharacteristic of (Feat)_pSpliced in a splicing unit (namely, two feature matrixes are combined into one feature matrix), and then convolved to obtain a fused feature Feat_I-p＝Conv(concat{Feat_I,Feat_pAnd h, concat represents a feature splicing operation, and conv represents a convolution operation performed on a convolution layer passing through a convolution neural network. Then, the fused features perform super-resolution inference through a subject inference network consisting of a cascade of multiple residual convolution layers and a fast upsampling layer to obtain super-resolution frames for each non-key frame.

Similar to the first super-resolution model, the input features and the output features of the subject inference network of the second super-resolution model are cross-layer connected (global cross-layer connection shown in fig. 5), the input features and the output features of each residual convolution layer of the subject inference network of the first super-resolution model are cross-layer connected (long cross-layer connection shown in fig. 5), and the output features of the first basic convolution operation unit and the last convolution operation unit in each residual block of the residual convolution layer are cross-layer connected (short cross-layer connection shown in fig. 5).

It should be understood that the network structures, the cross-layer connection manner, and the like of the above first and second super-resolution models are merely illustrative, and those skilled in the art may use other network structures suitable for super-resolution image processing to obtain super-resolution frames of key frames and obtain super-resolution frames of non-key frames based on the super-resolution frames of key frames and non-key frames.

After the super-resolution frames of the key frame and the non-key frame are obtained, a first loss function for the first super-resolution model may be determined from the super-resolution frames of the key frame and parameters of the first super-resolution model may be adjusted based on a value of the first loss function at step S140, and a second loss function for the second super-resolution model may be determined from the super-resolution frames of the non-key frame and parameters of the second super-resolution model may be adjusted based on the second loss function at step S150.

For example, the high resolution frame HR may be based on the keyframes_ISuper-resolution frame SR of sum key frame_IThe absolute value difference between them to determine the first Loss function Loss1 ═ HR of the first super-resolution model_I–SR_IAnd from high resolution frames HR of non-key frames_pSuper-resolution frame SR of sum non-key frame_pThe absolute value difference between the two to determine the second Loss function Loss1 ═ HR of the second super-resolution model_p–SR_pAnd then adjusting parameters of the subject inference network according to the first loss function, and adjusting parameters of the feature extractor, the stitching unit and the subject inference network of the second super-resolution model according to the second loss function until the first loss function and the second loss function converge to a predetermined target value, thereby completing the model training process.

As described above, the video super-resolution model trained by the above method uses the characteristics of the key frames of the video, and uses the super-resolution processing result of the key frames as a reference for super-resolution processing of non-key frames, thereby reducing the amount of model calculation and ensuring the processing effect of the model.

As shown in fig. 6, the training apparatus 600 for a video super-resolution model according to an exemplary embodiment of the present disclosure may include a grouping unit 610, a first inference unit 620, a second inference unit 630, a first parameter adjustment unit 640, and a second parameter adjustment unit 650.

The grouping unit 610 is configured to divide a video into groups including a predetermined number of video frames, and classify the video frames in each group into key frames and non-key frames.

According to an exemplary embodiment of the present disclosure, the grouping unit 610 classifies predetermined types of video frames in the group as key frames and the remaining video frames as non-key frames, or classifies a first frame in the group as a key frame and the remaining video frames as non-key frames.

The first inference unit 620 is configured to perform single-frame image inference on the key frame by the first super-resolution model to obtain a super-resolution frame of the key frame. The second inference unit 630 is configured to fuse the super-resolution frames of the key frames and the features of the non-key frames based on a second super-resolution model, and perform inference to obtain the super-resolution frames of the non-key frames based on the fused features.

The first parameter tuning unit 640 is configured to determine a first loss function for the first super-resolution model from the key frame and the super-resolution frame of the key frame, and to adjust parameters of the first super-resolution model based on a value of the first loss function. The second parameter tuning unit 650 is configured to determine a second loss function for the second super-resolution model from the non-key frames and the super-resolution frames of the non-key frames and to adjust parameters of the second super-resolution model based on the second loss function.

According to an exemplary embodiment of the present disclosure, each of the first and second super-resolution models includes an inference subject network including a plurality of residual convolutional layers and a fast upsampling layer for performing an inference operation, wherein the number of channels and the number of residual convolutional layers included in the first super-resolution model are higher than the number of channels and the number of residual convolutional layers included in the second super-resolution model.

According to an exemplary embodiment of the present disclosure, the first inference unit 630 is configured to: calculating the depth characteristics of the key frame layer by layer through the residual convolutional layers; the depth features of the key frames are upsampled by the fast upsampling layer into super-resolution frames of key frames having a high resolution.

According to an exemplary embodiment of the present disclosure, the second super-resolution model further comprises a first feature extractor, a second feature extractor and a stitching unit, wherein the second inference unit 640 is configured to: extracting features of the super-resolution frames of the key frames by a first feature extractor and features of the non-key frames by a second feature extractor; performing feature splicing and convolution on the extracted features of the super-resolution frames of the key frames and the features of the non-key frames through a splicing unit to obtain the fused features; performing a computation on the fused features layer by inferring a plurality of residual convolutional layers of a subject network to extract depth features of non-key frames; depth features of non-key frames are upsampled by a fast upsampling layer into super-resolution frames of non-key frames having a high resolution.

According to an exemplary embodiment of the present disclosure, each of the residual convolution layers of the first and second super-resolution models includes a plurality of residual blocks connected in series, and each of the residual blocks includes a plurality of basic convolution operation units.

According to an exemplary embodiment of the present disclosure, the input features and the output features of the first and second super-resolution models are connected across layers; the input characteristic and the output characteristic of each residual convolution layer of the first super-resolution model and the second super-resolution model are connected in a cross-layer mode; the output characteristics of the first basic convolution operation unit and the last convolution operation unit in each residual block of the residual convolution layer are connected in a cross-layer mode. The input features and the output features of the first super-resolution model and the second super-resolution model are connected across layers; the input characteristic and the output characteristic of each residual convolution layer of the first super-resolution model and the second super-resolution model are connected in a cross-layer mode; the output characteristics of the first basic convolution operation unit and the last convolution operation unit in each residual block of the residual convolution layer are connected in a cross-layer mode.

First, in step S710, a video is divided into groups including a predetermined number of video frames, and the video frames in each group are classified into key frames and non-key frames.

As described above, according to an exemplary embodiment of the present disclosure, it is possible to sequentially divide a frame sequence of a video into a plurality of groups by a predetermined number and determine key frames and non-key frames in each group. According to an exemplary embodiment of the present disclosure, the first frame in each group may be determined as a key frame, and the remaining video frames may be determined as non-key frames. Alternatively, the predetermined type of video frames in each group may be determined as key frames and the remaining video frames as non-key frames.

Next, in step S720, super-resolution processing is performed on the key frames and the non-key frames of each group by a super-resolution model, respectively, to obtain super-resolution frames of the key frames and the non-key frames of each group.

Then, in step S730, the super-resolution frame of the key frame and the super-resolution frame of the non-key frame of each group are encoded into a super-resolution video, wherein the super-resolution model is a model configured to use the super-resolution frame of the key frame as a reference for super-resolution processing of the non-key frame.

According to an exemplary embodiment of the present disclosure, the super-resolution model includes a first super-resolution model and a second super-resolution model, and the super-resolution model may be trained by performing single-frame image inference on a key frame through the first super-resolution model to obtain a super-resolution frame of the key frame, fusing features of the super-resolution frame of the key frame and the non-key frame based on the second super-resolution model, and performing inference to obtain the super-resolution frame of the non-key frame based on the fused features. Here, the first and second super-resolution models are trained based on the method described above with reference to fig. 1 to 5, and have the same structure as the first and second super-resolution models described with reference to fig. 1 to 5, and thus a description about the first and second super-resolution models will not be repeated here.

Fig. 8 is a block diagram illustrating a video super-resolution device according to an exemplary embodiment of the present disclosure.

As shown in fig. 8, a video super-resolution device 800 according to an exemplary embodiment of the present disclosure may include a grouping unit 810, a video processing unit 820, and an encoding unit 830.

The grouping unit 810 is configured to divide a video into groups including a predetermined number of video frames, and classify the video frames in each group into key frames and non-key frames. As described above, according to an exemplary embodiment of the present disclosure, it is possible to sequentially divide a frame sequence of a video into a plurality of groups by a predetermined number and determine key frames and non-key frames in each group. According to an exemplary embodiment of the present disclosure, the first frame in each group may be determined as a key frame, and the remaining video frames may be determined as non-key frames. Alternatively, the predetermined type of video frames in each group may be determined as key frames and the remaining video frames as non-key frames.

The video processing unit 820 is configured to perform super-resolution processing on the key frames and the non-key frames of each group by a super-resolution model, respectively, to obtain super-resolution frames of the key frames and the non-key frames of each group, wherein the super-resolution model is a model configured to use the super-resolution frames of the key frames as a reference for the super-resolution processing of the non-key frames.

The encoding unit 830 is configured to encode the super-resolution frames of the key frames and the super-resolution frames of the non-key frames of each group as a super-resolution video.

According to an exemplary embodiment of the present disclosure, the super-resolution model includes a first super-resolution model and a second super-resolution model, the super-resolution model is trained by: performing single-frame image inference on a key frame through a first super-resolution model to obtain a super-resolution frame of the key frame, fusing features of the super-resolution frame of the key frame and the non-key frame based on a second super-resolution model, and performing inference on the fused features to obtain a super-resolution frame of the non-key frame, wherein the first and second super-resolution models are trained based on the method described above with reference to fig. 1-5, and have the same structure as the first and second super-resolution models described with reference to fig. 1-5, and thus, a description about the first and second super-resolution models will not be repeated herein.

Fig. 9 is a block diagram illustrating a structure of an electronic device for video super-resolution processing and/or training of a video super-resolution model according to an exemplary embodiment of the present disclosure. The electronic device 900 may be, for example: a smart phone, a tablet computer, an MP4(Moving Picture Experts Group Audio Layer IV) player, a notebook computer or a desktop computer. The electronic device 900 may also be referred to by other names such as user equipment, portable terminals, laptop terminals, desktop terminals, and the like.

In general, the electronic device 900 includes: a processor 901 and a memory 902.

Processor 901 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so forth. The processor 901 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 901 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 901 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, the processor 901 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.

Memory 902 may include one or more computer-readable storage media, which may be non-transitory. The memory 902 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer-readable storage medium in memory 902 is used to store at least one instruction for execution by processor 901 to implement the video super-resolution model training method and/or the video super-resolution method provided by the method embodiments of the present disclosure as shown in fig. 2-7.

In some embodiments, the electronic device 900 may further optionally include: a peripheral interface 903 and at least one peripheral. The processor 901, memory 902, and peripheral interface 903 may be connected by buses or signal lines. Various peripheral devices may be connected to the peripheral interface 903 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of a radio frequency circuit 904, a touch display screen 905, a camera 906, an audio circuit 907, a positioning component 908, and a power supply 909.

The peripheral interface 903 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 901 and the memory 902. In some embodiments, the processor 901, memory 902, and peripheral interface 903 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 901, the memory 902 and the peripheral interface 903 may be implemented on a separate chip or circuit board, which is not limited by this embodiment.

The Radio Frequency circuit 904 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 904 communicates with communication networks and other communication devices via electromagnetic signals. The radio frequency circuit 904 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 904 comprises: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuit 904 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: metropolitan area networks, various generation mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the radio frequency circuit 904 may also include NFC (Near Field Communication) related circuits, which are not limited by this disclosure.

The display screen 905 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 905 is a touch display screen, the display screen 905 also has the ability to capture touch signals on or over the surface of the display screen 905. The touch signal may be input to the processor 901 as a control signal for processing. At this point, the display 905 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display screen 905 may be one, disposed on the front panel of the electronic device 900; in other embodiments, the number of the display panels 905 may be at least two, and each of the display panels is disposed on a different surface of the terminal 900 or is in a foldable design; in still other embodiments, the display 905 may be a flexible display disposed on a curved surface or a folded surface of the terminal 900. Even more, the display screen 905 may be arranged in a non-rectangular irregular figure, i.e. a shaped screen. The Display panel 905 can be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), and other materials.

The camera assembly 906 is used to capture images or video. Optionally, camera assembly 906 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 906 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.

Audio circuit 907 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 901 for processing, or inputting the electric signals to the radio frequency circuit 904 for realizing voice communication. For stereo sound acquisition or noise reduction purposes, the microphones may be multiple and disposed at different locations of the terminal 900. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 901 or the radio frequency circuit 904 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, audio circuit 907 may also include a headphone jack.

The positioning component 908 is used to locate a current geographic Location of the electronic device 900 to implement navigation or LBS (Location Based Service). The Positioning component 908 may be a Positioning component based on the GPS (Global Positioning System) in the united states, the beidou System in china, the graves System in russia, or the galileo System in the european union.

The power supply 909 is used to supply power to various components in the electronic device 900. The power source 909 may be alternating current, direct current, disposable or rechargeable. When power source 909 comprises a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, the electronic device 900 also includes one or more sensors 910. The one or more sensors 910 include, but are not limited to: acceleration sensor 911, gyro sensor 912, pressure sensor 913, fingerprint sensor 914, optical sensor 915, and proximity sensor 916.

The acceleration sensor 911 can detect the magnitude of acceleration in three coordinate axes of the coordinate system established with the terminal 900. For example, the acceleration sensor 911 may be used to detect the components of the gravitational acceleration in three coordinate axes. The processor 901 can control the touch display 905 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 911. The acceleration sensor 911 may also be used for acquisition of motion data of a game or a user.

The gyro sensor 912 may detect a body direction and a rotation angle of the terminal 900, and the gyro sensor 912 may cooperate with the acceleration sensor 911 to acquire a 3D motion of the user on the terminal 900. The processor 901 can implement the following functions according to the data collected by the gyro sensor 912: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.

Pressure sensors 913 may be disposed on the side bezel of terminal 900 and/or underneath touch display 905. When the pressure sensor 913 is disposed on the side frame of the terminal 900, the user's holding signal of the terminal 900 may be detected, and the processor 901 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 913. When the pressure sensor 913 is disposed at a lower layer of the touch display 905, the processor 901 controls the operability control on the UI according to the pressure operation of the user on the touch display 905. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.

The fingerprint sensor 914 is used for collecting a fingerprint of the user, and the processor 901 identifies the user according to the fingerprint collected by the fingerprint sensor 914, or the fingerprint sensor 914 identifies the user according to the collected fingerprint. Upon recognizing that the user's identity is a trusted identity, processor 901 authorizes the user to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying, and changing settings, etc. The fingerprint sensor 914 may be disposed on the front, back, or side of the electronic device 900. When a physical button or vendor Logo is provided on the electronic device 900, the fingerprint sensor 914 may be integrated with the physical button or vendor Logo.

The optical sensor 915 is used to collect ambient light intensity. In one embodiment, the processor 901 may control the display brightness of the touch display 905 based on the ambient light intensity collected by the optical sensor 915. Specifically, when the ambient light intensity is high, the display brightness of the touch display screen 905 is increased; when the ambient light intensity is low, the display brightness of the touch display screen 905 is turned down. In another embodiment, the processor 901 can also dynamically adjust the shooting parameters of the camera assembly 906 according to the ambient light intensity collected by the optical sensor 915.

The proximity sensor 916, also known as a distance sensor, is typically disposed on the front panel of the electronic device 900. The proximity sensor 916 is used to capture the distance between the user and the front of the electronic device 900. In one embodiment, when the proximity sensor 916 detects that the distance between the user and the front face of the terminal 900 gradually decreases, the processor 901 controls the touch display 905 to switch from the bright screen state to the dark screen state; when the proximity sensor 916 detects that the distance between the user and the front surface of the electronic device 900 becomes gradually larger, the processor 901 controls the touch display 905 to switch from the breath screen state to the bright screen state.

Those skilled in the art will appreciate that the configuration shown in fig. 9 does not constitute a limitation of the electronic device 900, and may include more or fewer components than those shown, or combine certain components, or employ a different arrangement of components.

Fig. 10 is a block diagram of another electronic device 1000. For example, the electronic device 1000 may be provided as a server. Referring to fig. 10, the electronic device 1000 includes one or more processor(s) 1110 and memory 1120. The memory 1120 may include one or more programs for performing the above video super-resolution method and/or video super-resolution model training method. The electronic device 1100 may also include a power component 1130 configured to perform power management of the electronic device 1100, a wired or wireless network interface 1140 configured to connect the electronic device 1100 to a network, and an input/output (I/O) interface 1150. The electronic device 1100 may operate based on an operating system stored in the memory 1120, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, or the like.

According to an embodiment of the present disclosure, there may also be provided a computer-readable storage medium storing instructions that, when executed by at least one processor, cause the at least one processor to perform a video super-resolution model training method and/or a video super-resolution method according to the present disclosure. Examples of the computer-readable storage medium herein include: read-only memory (ROM), random-access programmable read-only memory (PROM), electrically erasable programmable read-only memory (EEPROM), random-access memory (RAM), dynamic random-access memory (DRAM), static random-access memory (SRAM), flash memory, non-volatile memory, CD-ROM, CD-R, CD + R, CD-RW, CD + RW, DVD-ROM, DVD-R, DVD + R, DVD-RW, DVD + RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, Blu-ray or compact disc memory, Hard Disk Drive (HDD), solid-state drive (SSD), card-type memory (such as a multimedia card, a Secure Digital (SD) card or a extreme digital (XD) card), magnetic tape, a floppy disk, a magneto-optical data storage device, an optical data storage device, a magnetic tape, a floppy disk, a magneto-optical data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic disk, a magnetic tape, a magnetic disk, a magnetic data storage device, a magnetic disk, a magnetic data storage device, a magnetic data, a magnetic disk, a magnetic data storage device, a magnetic disk, a, Hard disk, solid state disk, and any other device configured to store and provide a computer program and any associated data, data files, and data structures to a processor or computer in a non-transitory manner such that the processor or computer can execute the computer program. The computer program in the computer-readable storage medium described above can be run in an environment deployed in a computer apparatus, such as a client, a host, a proxy device, a server, and the like, and further, in one example, the computer program and any associated data, data files, and data structures are distributed across a networked computer system such that the computer program and any associated data, data files, and data structures are stored, accessed, and executed in a distributed fashion by one or more processors or computers.

According to an embodiment of the present disclosure, there may also be provided a computer program product, instructions in which are executable by a processor of a computer device to perform a video super-resolution model training method and/or a video super-resolution method.

According to the video super-resolution model training method and/or the video super-resolution method and device, the electronic equipment and the computer readable storage medium, the video key frames can be fully processed, the super-resolution processing of the non-key frames is guided by fully utilizing the super-resolution result of the key frames, the good subjective effect can be ensured even if the lightweight processing model is used for the non-key frames, and the good benefits can be obtained in the aspects of processing speed and subjective effect.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A video processing method, comprising:

dividing a video into groups including a predetermined number of video frames, and classifying the video frames in each group into key frames and non-key frames;

respectively performing super-resolution processing on the key frames and the non-key frames of each group through a super-resolution model to obtain super-resolution frames of the key frames and the non-key frames of each group;

the super-resolution frames of the key frames and the super-resolution frames of the non-key frames of each group are encoded into a super-resolution video,

wherein the super-resolution model is a model configured to use a super-resolution frame of the key frame as a reference for super-resolution processing of the non-key frame.

2. The method of claim 1, wherein the super-resolution model comprises a first super-resolution model and a second super-resolution model, the super-resolution model being trained by:

performing single-frame image inference on the key frame through the first super-resolution model to obtain a super-resolution frame of the key frame;

fusing the super-resolution frame of the key frame and the characteristics of the non-key frame based on a second super-resolution model, and performing inference based on the fused characteristics to obtain the super-resolution frame of the non-key frame;

determining a first loss function for a first super-resolution model from the key frame and the super-resolution frame of the key frame, and adjusting a parameter of the first super-resolution model based on a value of the first loss function;

determining a second loss function for a second super-resolution model from the non-key frames and the super-resolution frames of the non-key frames, and adjusting parameters of the second super-resolution model based on the second loss function.

3. The method of claim 2, wherein each of the first and second super-resolution models comprises an inference subject network comprising a plurality of residual convolutional layers and a fast upsampling layer for performing an inference operation, wherein the first super-resolution model comprises a higher number of channels and a higher number of residual convolutional layers than the second super-resolution model.

4. The method of claim 3, wherein performing single-frame image inference on the key frame via the first super-resolution model to obtain a super-resolution frame of the key frame comprises:

calculating the depth characteristics of the key frame layer by layer through the residual convolutional layers;

the depth features of the key frames are upsampled by the fast upsampling layer into super-resolution frames of key frames having a high resolution.

5. The method of claim 3, wherein the second super resolution model further comprises a first feature extractor, a second feature extractor, and a stitching unit, wherein fusing the features of the super resolution frames of the key frames and the non-key frames based on the second super resolution model and performing inference to obtain the super resolution frames of the non-key frames based on the fused features comprises:

extracting features of the super-resolution frames of the key frames by a first feature extractor and features of the non-key frames by a second feature extractor;

performing feature splicing and convolution on the extracted features of the super-resolution frames of the key frames and the features of the non-key frames through a splicing unit to obtain the fused features;

performing a computation on the fused features layer by inferring a plurality of residual convolutional layers of a subject network to extract depth features of non-key frames;

depth features of non-key frames are upsampled by a fast upsampling layer into super-resolution frames of non-key frames having a high resolution.

6. A video processing apparatus, comprising:

a grouping unit configured to divide a video into groups including a predetermined number of video frames and classify the video frames in each group into key frames and non-key frames;

a video processing unit configured to perform super-resolution processing on the key frames and the non-key frames of each group through a super-resolution model, respectively, to obtain super-resolution frames of the key frames and the non-key frames of each group;

an encoding unit configured to encode the super-resolution frames of the key frames and the super-resolution frames of the non-key frames of each group as a super-resolution video,

7. The apparatus of claim 6, wherein the super-resolution model comprises a first super-resolution model and a second super-resolution model, the super-resolution model being trained by:

8. The apparatus of claim 6, wherein each of the first and second super-resolution models comprises an inference subject network comprising a plurality of residual convolutional layers and fast upsampling layers for performing an inference operation, wherein the first super-resolution model comprises a higher number of channels and a higher number of residual convolutional layers than the second super-resolution model.

9. An electronic device, comprising:

at least one processor;

at least one memory storing computer-executable instructions,

wherein the computer-executable instructions, when executed by the at least one processor, cause the at least one processor to perform the method of any one of claims 1 to 5.

10. A computer-readable storage medium whose instructions, when executed by a processor of an electronic device, enable the electronic device to perform the method of any of claims 1 to 5.