WO2024046144A1

WO2024046144A1 - Video processing method and related device thereof

Info

Publication number: WO2024046144A1
Application number: PCT/CN2023/113745
Authority: WO
Inventors: 郭佳明; 邹学益; 刘毅; 张恒胜
Original assignee: 华为技术有限公司
Priority date: 2022-08-30
Filing date: 2023-08-18
Publication date: 2024-03-07
Also published as: CN115623242A

Abstract

According to the video processing method and the related equipment thereof, which have a good super-resolution effect on a video frame in a video stream, and can enable the whole video stream subjected to super-resolution to have good image quality, thereby improving user experience. The method of the present application comprises: obtaining a current video frame and a motion vector used in the decoding process of the current video frame; and transforming feature information of a reference video frame of the current video frame on the basis of the motion vector to obtain transformed feature information, wherein the feature information of the reference video frame is obtained in the super-resolution process of a target model on the reference video frame; and performing super-resolution on the current video frame by means of the target model on the basis of the transformed feature information to obtain the current video frame subjected to super-resolution.

Description

A video processing method and related equipment

This application claims priority to the Chinese patent application submitted to the China Patent Office on August 30, 2022, with the application number 202211049719.5 and the invention title "A video processing method and related equipment", the entire content of which is incorporated herein by reference. Applying.

Technical field

This application relates to artificial intelligence (AI) technology, and in particular to a video processing method and related equipment.

Background technique

With the rapid development of technology, video has become the most important carrier of information dissemination. In order to enhance the image quality of the video, the resolution of each video frame in the video stream can be improved through a neural network model that can achieve super-resolution (SR) reconstruction function, thereby providing high-quality, high-resolution video supply. Users watch.

Currently, in the video stream to be super-resolved, if the resolution of the current video frame needs to be improved, the current video frame and the reference video frame of the current video frame (for example, the previous video frame and/or the subsequent video frame of the current video frame can be video frames, etc.) are input to the neural network model, so that the neural network model performs super-resolution reconstruction (also called super-resolution) of the current video frame based on the reference video frame, and obtains the current video frame after super-resolution.

It can be seen that in the super-resolution process of the current video frame, the neural network model only uses the reference video frame itself as the reference benchmark, and the factors considered are relatively single. The current video frame after super-resolution output by the neural network model is not of high quality (cannot have ideal resolution), so that the image quality of the entire video stream after super-resolution is still not good enough, resulting in poor user experience.

Contents of the invention

Embodiments of the present application provide a video processing method and related equipment, which have a good super-resolution effect on video frames in the video stream, so that the entire video stream after super-resolution has good image quality, thereby improving user experience.

A first aspect of the embodiments of the present application provides a video processing method, which method includes:

When it is necessary to perform super-resolution reconstruction on the decoded current video frame, the current video frame and the motion vector used in the decoding process of the current video frame can be obtained first.

After obtaining the current video frame and the motion vector used in the decoding process of the current video frame, the motion vector used in the decoding process of the current video frame can be used to transform the feature information of the reference video frame to obtain the transformed reference video frame. The feature information, that is, the feature information of the reference video frame aligned to the current video frame. It should be noted that the feature information of the reference video frame is obtained during the super-resolution process of the target model on the reference video frame. Regarding the super-resolution process of the target model on the reference video frame, please refer to the subsequent super-resolution of the current video frame by the target model. The relevant description of the sub-process will not be expanded upon here.

After obtaining the transformed feature information, the transformed feature information and the current video frame can be input to the target model (for example, a trained recurrent neural network model), so that the current video frame can be processed by the target model based on the transformed feature information. Super-resolution reconstruction to obtain the current video frame after super-resolution.

It can be seen from the above method that after obtaining the current video frame and the motion vector used in the decoding process of the current video frame, the feature information of the reference video frame of the current video frame can be transformed based on the motion vector, thereby obtaining the transformed Feature information, wherein the feature information of the reference video frame is obtained during the super-resolution process of the reference video frame by the target model. Then, the current video frame can be super-resolved based on the transformed feature information through the target model, thereby obtaining the super-resolved current video frame. In the foregoing process, the target model can perform super-resolution on the current video frame based on the transformed feature information of the reference video frame, because the transformed feature information of the reference video frame is based on the motion vector pair used in the decoding process of the current video frame. It is obtained by transforming the feature information of the reference video frame. It can be seen that in the super-resolution process of the current video frame by the target model, not only the information of the reference video frame itself is considered, but also the image blocks between the reference video frame and the current video frame are considered. position correspondence relationship, the factors considered are relatively comprehensive, so the current video frame after super-resolution finally output by the target model is of high enough quality (with a relatively ideal resolution), so that the entire video stream after super-resolution has good image quality, thereby improving user experience.

In one possible implementation method, transforming the feature information of the reference video frame based on the motion vector to obtain the transformed feature information includes: calculating the motion vector and the feature information of the reference video frame through a warping algorithm to obtain the transformed feature information. Feature information. The foregoing In the implementation, the motion vector used in the decoding process of the current video frame and the feature information of the reference video frame can be calculated through a warping algorithm (for example, bilinear difference method, bicubic difference method, etc.), so as to Accurately obtain the transformed feature information.

In one possible implementation method, the current video frame is super-resolved based on the transformed feature information through the target model, and obtaining the current video frame after super-resolution includes: performing feature extraction on the current video frame through the target model to obtain the current video The first feature of the frame; fuse the transformed feature information and the first feature through the target model to obtain the second feature of the current video frame; perform feature extraction on the second feature through the target model to obtain the third feature of the current video frame , the third feature is used as the current video frame after super-resolution. In the aforementioned implementation manner, after the transformed feature information and the current video frame are input to the target model, the target model can first perform feature extraction on the current video frame, thereby obtaining the first feature of the current video frame. After obtaining the first feature of the current video frame, the target model can fuse the transformed feature information and the first feature of the current video frame, thereby obtaining the second feature of the current video frame. After obtaining the second feature of the current video frame, the target model can continue to perform feature extraction on the second feature of the current video frame, thereby obtaining the third feature of the current video frame. The target model can directly use the third feature as the current super-resolved feature. video frames and output them externally.

In a possible implementation manner, the method further includes: fusing the third feature and the current video frame through the target model to obtain the super-resolved current video frame. In the foregoing implementation, after obtaining the third feature of the current video frame, the target model can fuse the third feature of the current video frame and the current video frame, thereby obtaining and outputting the super-resolved current video frame.

In one possible implementation manner, the third feature or the current video frame after super-resolution is used as the feature information of the current video frame. In the aforementioned implementation, the target model can obtain the feature information of the current video frame through the following multiple methods: After obtaining the third feature of the current video frame, the target model can directly use the third feature of the current video frame as the feature of the current video frame. information and output it to the outside for use in the super-resolution process of the next video frame; after obtaining the current video frame after super-resolution, the target model can directly use the current video frame after super-resolution as the feature information of the current video frame and output it to the outside. For use in the super-resolution process of the next video frame.

In a possible implementation manner, the method further includes: extracting features of the third feature or the current video frame after super-resolution through the target model to obtain feature information of the current video frame. The target model can also obtain the feature information of the current video frame in the following multiple ways: After obtaining the third feature of the current video frame, the target model can continue to perform feature extraction on the third feature of the current video frame, thereby obtaining the feature information of the current video frame. Feature information; after obtaining the current video frame after super-resolution, the target model can continue to extract features of the current video frame after super-resolution, thereby obtaining the feature information of the current video frame.

In a possible implementation manner, the current video frame contains N image blocks, and obtaining the motion vector used in the decoding process of the current video frame includes: obtaining the decoding of M image blocks in the current video frame from the compressed video stream. The motion vector used in the process, N≥2, N>M≥1; based on the motion vector used in the decoding process of M image blocks, calculate the motion vector used in the decoding process of N-M image blocks, or, The preset value is determined as the motion vector used in the decoding process of N-M image blocks. In the aforementioned implementation, if among the N image blocks contained in the current video frame, only M image blocks appear in the reference video frame of the current video frame, that is to say, the contents of the current video frame and the reference video frame are only partially the same. , there are some differences. At this time, the compressed video stream only provides the motion vectors corresponding to these M image blocks. Since the compressed video stream does not provide the motion vectors corresponding to the remaining N-M image blocks of the current video frame, the following is used. There are several ways to calculate the motion vectors corresponding to these N-M image blocks: use the preset value directly as the motion vector corresponding to these N-M image blocks; calculate the motion vectors corresponding to M image blocks to obtain the corresponding motion vectors of these N-M image blocks. motion vector. After calculating the motion vectors corresponding to the N-M image blocks, the motion vectors corresponding to the M image blocks derived from the compressed video stream can be used as the motion used in the decoding process of the M image blocks in the current video frame. vector, and use the calculated motion vectors corresponding to the N-M image blocks as the motion vectors used in the decoding process of the N-M image blocks in the current video frame, which is equivalent to obtaining the motion vectors used in the decoding process of the current video frame. The motion vector to use.

A second aspect of the embodiment of the present application provides a video processing method, which method includes:

When it is necessary to perform super-resolution reconstruction on the decoded current video frame, the current video frame and the residual information used in the decoding process of the current video frame can be obtained first.

After obtaining the current video frame and the residual information used in the decoding process of the current video frame, the characteristic information of the reference video frame can also be obtained, and the current video frame, the residual information used in the decoding process of the current video frame, and The characteristic information of the reference video frame is input to the target model, so that the target model super-resolves the current video frame based on the characteristic information of the reference video frame and the residual information used in the decoding process of the current video frame, and obtains the super-resolved The current video frame. It should be noted that the feature information of the reference video frame is obtained during the super-resolution process of the target model on the reference video frame. For the super-resolution process of the target model on the reference video frame, please refer to the subsequent target model. The relevant description of the super-resolution process of the current video frame will not be expanded here.

It can be seen from the above method: obtain the current video frame and the residual information used in the decoding process of the current video frame; use the target model to super-resolve the current video frame based on the feature information and residual information of the reference video frame, The current video frame after super-resolution is obtained, and the feature information of the reference video frame is obtained during the super-resolution processing of the reference video frame by the target model. In the aforementioned process, the target model can perform super-resolution of the current video frame based on the feature information of the reference video frame and the residual information used in the decoding process of the current video frame. It can be seen that during the super-resolution process of the current video frame by the target model , not only the information of the reference video frame itself is considered, but also the difference in pixel values between the reference video frame and the current video frame is considered. The factors considered are relatively comprehensive, so the current video frame after super-resolution finally output by the target model is High enough quality (with a relatively ideal resolution) so that the entire video stream after super-resolution has good image quality, thus improving the user experience.

In one possible implementation method, the current video frame is super-resolved through the target model based on the feature information and residual information of the reference video frame. The current video frame obtained after super-resolution includes: super-resolving the current video frame through the target model. Feature extraction is used to obtain the first feature of the current video frame; the feature information of the reference video frame and the first feature are fused through the target model to obtain the second feature of the current video frame; the second feature is extracted through the target model to obtain The third feature of the current video frame; the target model performs feature extraction on the third feature based on the residual information to obtain the fourth feature of the current video frame, and the fourth feature is used as the current video frame after super-resolution. In the aforementioned implementation, after the current video frame, the residual information used in the decoding process of the current video frame, and the feature information of the reference video frame are input to the target model, the target model can first perform feature extraction on the current video frame, thereby obtaining The first feature of the current video frame. After obtaining the first feature of the current video frame, the target model fuses the feature information of the reference video frame and the first feature of the current video frame, thereby obtaining the second feature of the current video frame. After obtaining the second feature of the current video frame, the target model can continue to perform feature extraction on the second feature of the current video frame, thereby obtaining the third feature of the current video frame. After obtaining the third feature of the current video frame, the target model can continue to perform feature extraction on the third feature of the current video frame based on the residual information used in the decoding process of the current video frame, thereby obtaining the fourth feature of the current video frame. , the target model can use the fourth feature as the current video frame after super-resolution and output it externally.

In one possible implementation, the residual information includes the residual information used in the decoding process of N image blocks in the current video frame, and the target model is used to extract the third feature based on the residual information to obtain the current video The fourth feature of the frame includes: using the target model to determine P image blocks whose residual information is greater than the preset residual threshold among the N image blocks, N≥2, N>P≥1; using the target model to determine the third Among the features, features corresponding to the P image blocks are extracted to obtain the fourth feature of the current video frame. In the aforementioned implementation, it is assumed that the current video frame can be divided into N image blocks, so the residual information used in the decoding process of the current video frame includes the residual information used in the decoding process of N image blocks in the current video frame. . Among these N image blocks, the target model can sequentially compare the residual information used in the decoding process of each image block with the preset threshold, thereby determining P items whose residual information is greater than the preset residual threshold. Image block. After obtaining P image blocks whose residual information is greater than the preset residual threshold, the target model can perform feature extraction on the part of the third feature of the current video frame corresponding to the P image blocks, and the third feature is equal to The other part of the features corresponding to the remaining N-P image blocks remains unchanged, thereby obtaining the fourth feature of the current video frame.

In a possible implementation manner, the method further includes: fusing the fourth feature and the current video frame through the target model to obtain the super-resolved current video frame. In the foregoing implementation, after obtaining the fourth feature of the current video frame, the target model can fuse the fourth feature of the current video frame and the current video frame to obtain the super-resolved current video frame.

In a possible implementation manner, the third feature, the fourth feature or the current video frame after super-resolution is used as the feature information of the current video frame. In the aforementioned implementation, the target model can obtain the feature information of the current video frame through the following multiple methods: After obtaining the third feature of the current video frame, the target model can directly use the third feature of the current video frame as the feature of the current video frame. information and output it to the outside for use in the super-resolution process of the next video frame; after obtaining the fourth feature of the current video frame, the target model can directly use the fourth feature of the current video frame as the feature information of the current video frame and output it to the outside. It can be used for the super-resolution process of the next video frame; after obtaining the current video frame after super-resolution, the target model can directly use the current video frame after super-resolution as the feature information of the current video frame and output it for the next video. The super-resolution process of the frame is used.

In a possible implementation manner, the method further includes: performing feature extraction on the third feature, the fourth feature or the current video frame after super-resolution through the target model to obtain the feature information of the current video frame. In the aforementioned implementation, the target model can also obtain the feature information of the current video frame through the following multiple methods: after obtaining the third feature of the current video frame, the target model can continue to extract features of the third feature of the current video frame, thereby Obtain the feature information of the current video frame; after obtaining the fourth feature of the current video frame, the target model can continue to perform feature extraction on the fourth feature of the current video frame, thereby obtaining the feature information of the current video frame; obtain the current video after super-resolution After the frame, the target The standard model can continue to extract features of the current video frame after super-resolution, thereby obtaining the feature information of the current video frame.

The third aspect of the embodiment of the present application provides a model training method, which method includes: obtaining the current video frame and the motion vector used in the decoding process of the current video frame; based on the motion vector, the reference video frame of the current video frame is Transform the feature information to obtain the transformed feature information. The feature information of the reference video frame is obtained during the super-resolution process of the reference video frame by the model to be trained; the current video frame is super-resolved by the model to be trained based on the transformed feature information. score, the current video frame after super-resolution is obtained; based on the current video frame after super-resolution and the current video frame after real super-resolution, the target loss is obtained, and the target loss is used to indicate the current video frame after super-resolution and the real video frame after super-resolution The difference between the current video frames; the parameters of the model to be trained are updated based on the target loss until the model training conditions are met and the target model is obtained.

The target model trained by the above method has the ability to super-resolve video frames. Specifically, after obtaining the current video frame and the motion vector used in the decoding process of the current video frame, the feature information of the reference video frame of the current video frame can be transformed based on the motion vector, thereby obtaining the transformed feature information, where , the feature information of the reference video frame is obtained during the super-resolution process of the target model on the reference video frame. Then, the current video frame can be super-resolved based on the transformed feature information through the target model, thereby obtaining the super-resolved current video frame. In the foregoing process, the target model can perform super-resolution on the current video frame based on the transformed feature information of the reference video frame, because the transformed feature information of the reference video frame is based on the motion vector pair used in the decoding process of the current video frame. It is obtained by transforming the feature information of the reference video frame. It can be seen that in the super-resolution process of the current video frame by the target model, not only the information of the reference video frame itself is considered, but also the image blocks between the reference video frame and the current video frame are considered. position correspondence relationship, the factors considered are relatively comprehensive, so the current video frame after super-resolution finally output by the target model is of high enough quality (with a relatively ideal resolution), so that the entire video stream after super-resolution has good image quality, thereby improving user experience.

In one possible implementation method, transforming the feature information of the reference video frame based on the motion vector to obtain the transformed feature information includes: calculating the motion vector and the feature information of the reference video frame through a warping algorithm to obtain the transformed feature information. Feature information.

In one possible implementation method, the current video frame is super-resolved based on the transformed feature information through the model to be trained, and obtaining the current video frame after the super-resolution includes: performing feature extraction on the current video frame through the model to be trained, and obtaining The first feature of the current video frame; fuse the transformed feature information and the first feature through the model to be trained to obtain the second feature of the current video frame; perform feature extraction on the second feature through the model to be trained to obtain the current video frame The third feature is the current video frame after super-resolution.

In a possible implementation manner, the method further includes: fusing the third feature and the current video frame through a model to be trained to obtain the super-resolved current video frame.

In one possible implementation manner, the third feature or the current video frame after super-resolution is used as the feature information of the current video frame.

In a possible implementation manner, the method further includes: extracting features of the third feature or the current video frame after super-resolution through the model to be trained, to obtain feature information of the current video frame.

In a possible implementation manner, the current video frame contains N image blocks, and obtaining the motion vector used in the decoding process of the current video frame includes: obtaining the decoding of M image blocks in the current video frame from the compressed video stream. The motion vector used in the process, N≥2, N>M≥1; based on the motion vector used in the decoding process of M image blocks, calculate the motion vector used in the decoding process of N-M image blocks, or, The preset value is determined as the motion vector used in the decoding process of N-M image blocks.

The fourth aspect of the embodiment of the present application provides a model training method. The method includes: obtaining the current video frame and the residual information used in the decoding process of the current video frame; using the model to be trained based on the characteristics of the reference video frame information and residual information, perform super-resolution on the current video frame, and obtain the current video frame after super-resolution. The feature information of the reference video frame is obtained in the super-resolution processing of the reference video frame by the model to be trained; based on the current super-resolution The video frame and the current video frame after the real super-resolution are used to obtain the target loss. The target loss is used to indicate the difference between the current video frame after the super-resolution and the current video frame after the real super-resolution; the parameters of the training model are to be treated based on the target loss. Update until the model training conditions are met and the target model is obtained.

The target model trained by the above method has the ability to super-resolve video frames. Specifically, obtain the current video frame and the residual information used in the decoding process of the current video frame; use the target model to super-score the current video frame based on the feature information and residual information of the reference video frame, and obtain the super-score The current video frame, the feature information of the reference video frame is obtained in the super-resolution processing of the reference video frame by the target model. In the aforementioned process, the target model can perform super-resolution of the current video frame based on the feature information of the reference video frame and the residual information used in the decoding process of the current video frame. It can be seen that during the super-resolution process of the current video frame by the target model , not only considers the information of the reference video frame itself, but also considers the difference in pixel values between the reference video frame and the current video frame. The factors considered are relatively comprehensive. Therefore, the current video frame after super-resolution finally output by the target model is of high enough quality (with a relatively ideal resolution), so that the entire video stream after super-resolution has good image quality, thereby improving the user experience.

In one possible implementation method, the current video frame is super-resolved based on the feature information and residual information of the reference video frame through the model to be trained, and the current video frame obtained after super-scoring includes: Feature extraction is performed on the frame to obtain the first feature of the current video frame; the feature information of the reference video frame and the first feature are fused through the model to be trained to obtain the second feature of the current video frame; the second feature is processed through the model to be trained Feature extraction is used to obtain the third feature of the current video frame; the model to be trained performs feature extraction on the third feature based on the residual information to obtain the fourth feature of the current video frame, and the fourth feature is used as the current video frame after super-resolution.

In one possible implementation, the residual information includes the residual information used in the decoding process of N image blocks in the current video frame, and the model to be trained extracts the third feature based on the residual information to obtain the current The fourth feature of the video frame includes: using the model to be trained, P image blocks whose residual information is greater than the preset residual threshold are determined among the N image blocks, N≥2, N>P≥1; Feature extraction is performed on the features corresponding to the P image blocks in the third feature to obtain the fourth feature of the current video frame.

In a possible implementation manner, the method further includes: fusing the fourth feature and the current video frame through the model to be trained to obtain the super-resolved current video frame.

In a possible implementation manner, the third feature, the fourth feature or the current video frame after super-resolution is used as the feature information of the current video frame.

In a possible implementation manner, the method further includes: performing feature extraction on the third feature, the fourth feature or the current video frame after super-resolution through the model to be trained, to obtain the feature information of the current video frame.

The fifth aspect of the embodiment of the present application provides a video processing device. The device includes: an acquisition module, used to acquire the current video frame and the motion vector used in the decoding process of the current video frame; a transformation module, used based on The motion vector transforms the feature information of the reference video frame of the current video frame to obtain the transformed feature information. The feature information of the reference video frame is obtained during the super-resolution process of the reference video frame by the target model; the super-resolution module is used to pass The target model performs super-resolution on the current video frame based on the transformed feature information to obtain the current video frame after super-resolution.

It can be seen from the above device that after obtaining the current video frame and the motion vector used in the decoding process of the current video frame, the feature information of the reference video frame of the current video frame can be transformed based on the motion vector, thereby obtaining the transformed Feature information, wherein the feature information of the reference video frame is obtained during the super-resolution process of the reference video frame by the target model. Then, the current video frame can be super-resolved based on the transformed feature information through the target model, thereby obtaining the super-resolved current video frame. In the foregoing process, the target model can perform super-resolution on the current video frame based on the transformed feature information of the reference video frame, because the transformed feature information of the reference video frame is based on the motion vector pair used in the decoding process of the current video frame. It is obtained by transforming the feature information of the reference video frame. It can be seen that in the super-resolution process of the current video frame by the target model, not only the information of the reference video frame itself is considered, but also the image blocks between the reference video frame and the current video frame are considered. position correspondence relationship, the factors considered are relatively comprehensive, so the current video frame after super-resolution finally output by the target model is of high enough quality (with a relatively ideal resolution), so that the entire video stream after super-resolution has good image quality, thereby improving user experience.

In one possible implementation manner, the transformation module is used to calculate the motion vector and the feature information of the reference video frame through a warping algorithm to obtain transformed feature information.

In one possible implementation method, the super-resolution module is used to: extract features of the current video frame through the target model to obtain the first feature of the current video frame; perform feature extraction on the transformed feature information and the first feature through the target model. Through fusion, the second feature of the current video frame is obtained; the second feature is extracted through the target model to obtain the third feature of the current video frame, and the third feature is used as the current video frame after super-resolution.

In one possible implementation manner, the super-resolution module is also used to fuse the third feature and the current video frame through the target model to obtain the current video frame after super-resolution.

In one possible implementation manner, the super-resolution module is also used to extract features of the third feature or the current video frame after super-resolution through the target model to obtain feature information of the current video frame.

In a possible implementation manner, the acquisition module is used to acquire the motion vectors used in the decoding process of M image blocks in the current video frame from the compressed video stream, N≥2, N>M≥1; based on The motion vector used in the decoding process of M image blocks, the motion vector used in the decoding process of NM image blocks is calculated, or the preset value is determined as the motion vector used in the decoding process of NM image blocks. .

The sixth aspect of the embodiment of the present application provides a video processing device. The device includes: an acquisition module, used to acquire the current video frame and residual information used in the decoding process of the current video frame; a super-resolution module, The target model performs super-resolution on the current video frame based on the feature information and residual information of the reference video frame, and obtains the current video frame after super-resolution. The feature information of the reference video frame is used in the super-resolution processing of the reference video frame by the target model. Get in.

It can be seen from the above device that: the current video frame and the residual information used in the decoding process of the current video frame are obtained; the current video frame is super-resolved through the target model based on the feature information and residual information of the reference video frame, The current video frame after super-resolution is obtained, and the feature information of the reference video frame is obtained during the super-resolution processing of the reference video frame by the target model. In the aforementioned process, the target model can perform super-resolution of the current video frame based on the feature information of the reference video frame and the residual information used in the decoding process of the current video frame. It can be seen that during the super-resolution process of the current video frame by the target model , not only the information of the reference video frame itself is considered, but also the difference in pixel values between the reference video frame and the current video frame is considered. The factors considered are relatively comprehensive, so the current video frame after super-resolution finally output by the target model is High enough quality (with a relatively ideal resolution) so that the entire video stream after super-resolution has good image quality, thus improving the user experience.

In one possible implementation method, the super-resolution module is used to: extract features of the current video frame through the target model to obtain the first feature of the current video frame; extract feature information and the first feature of the reference video frame through the target model Perform fusion to obtain the second feature of the current video frame; perform feature extraction on the second feature through the target model to obtain the third feature of the current video frame; perform feature extraction on the third feature based on the residual information through the target model to obtain the current video The fourth feature of the frame is used as the current video frame after super-resolution.

In one possible implementation, the residual information includes the residual information used in the decoding process of N image blocks in the current video frame. The super-resolution module is used to: determine in the N image blocks through the target model P image blocks whose residual information is greater than the preset residual threshold, N≥2, N>P≥1; use the target model to extract features corresponding to the P image blocks in the third feature to obtain the current video frame The fourth characteristic.

In a possible implementation manner, the super-resolution module is also used to fuse the fourth feature and the current video frame through the target model to obtain the current video frame after super-resolution.

In a possible implementation manner, the super-resolution module is also used to extract features of the third feature, the fourth feature or the current video frame after super-resolution through the target model, and obtain the feature information of the current video frame.

The seventh aspect of the embodiment of the present application provides a model training device. The device includes: a first acquisition module, used to acquire the current video frame and the motion vector used in the decoding process of the current video frame; a transformation module, using The feature information of the reference video frame of the current video frame is transformed based on the motion vector to obtain the transformed feature information. The feature information of the reference video frame is obtained during the super-resolution process of the reference video frame by the model to be trained; the super-resolution module , used to super-resolve the current video frame based on the transformed feature information through the model to be trained, and obtain the current video frame after super-resolution; the second acquisition module is used to perform super-resolution based on the current video frame after super-resolution and the real super-resolution of the current video frame to obtain the target loss. The target loss is used to indicate the difference between the current video frame after super-resolution and the current video frame after real super-resolution; the update module is used to update the parameters of the model to be trained based on the target loss. , until the model training conditions are met and the target model is obtained.

The target model trained by the above device has the ability to super-resolve video frames. Specifically, after obtaining the current video frame and the motion vector used in the decoding process of the current video frame, the feature information of the reference video frame of the current video frame can be transformed based on the motion vector, thereby obtaining the transformed feature information, where , the feature information of the reference video frame is obtained during the super-resolution process of the reference video frame by the target model. Then, the current video frame can be super-resolved based on the transformed feature information through the target model, thereby obtaining the super-resolved current video frame. In the foregoing process, the target model can perform super-resolution on the current video frame based on the transformed feature information of the reference video frame, because the transformed feature information of the reference video frame is based on the motion vector pair used in the decoding process of the current video frame. It is obtained by transforming the feature information of the reference video frame. It can be seen that in the super-resolution process of the current video frame by the target model, not only the information of the reference video frame itself is considered, but also the image blocks between the reference video frame and the current video frame are considered. position correspondence relationship, the factors considered are relatively comprehensive, so the current video frame after super-resolution finally output by the target model is of high enough quality (with a relatively ideal resolution), so that the entire video stream after super-resolution has good image quality, thereby improving user experience.

In one possible implementation, the super-resolution module is used to: extract features of the current video frame through the model to be trained, and obtain the current The first feature of the previous video frame; the transformed feature information and the first feature are fused through the model to be trained to obtain the second feature of the current video frame; the second feature is extracted through the model to be trained to obtain the current video frame The third feature is the current video frame after super-resolution.

In one possible implementation manner, the super-resolution module is also used to fuse the third feature and the current video frame through the model to be trained to obtain the current video frame after super-resolution.

In a possible implementation manner, the super-resolution module is also used to extract the third feature or the current video frame after super-resolution through the model to be trained, so as to obtain the feature information of the current video frame.

In a possible implementation manner, the acquisition module is used to acquire the motion vectors used in the decoding process of M image blocks in the current video frame from the compressed video stream, N≥2, N>M≥1; based on The motion vector used in the decoding process of M image blocks, the motion vector used in the decoding process of N-M image blocks is calculated, or the preset value is determined as the motion vector used in the decoding process of N-M image blocks. .

The eighth aspect of the embodiment of the present application provides a model training device, which includes: a first acquisition module, used to acquire the current video frame and residual information used in the decoding process of the current video frame; a super-resolution module , used to super-score the current video frame based on the feature information and residual information of the reference video frame through the model to be trained, and obtain the current video frame after the super-score. The feature information of the reference video frame is used in the model to be trained to compare the reference video frame. Obtained from the super-resolution processing; the second acquisition module is used to obtain the target loss based on the current video frame after super-resolution and the current video frame after the real super-resolution, and the target loss is used to indicate the current video frame after super-resolution and the real The difference between the current video frames after super-resolution; the update module is used to update the parameters of the model to be trained based on the target loss until the model training conditions are met and the target model is obtained.

The target model trained by the above device has the ability to super-resolve video frames. Specifically, obtain the current video frame and the residual information used in the decoding process of the current video frame; use the target model to super-score the current video frame based on the feature information and residual information of the reference video frame, and obtain the super-score The current video frame, the feature information of the reference video frame is obtained in the super-resolution processing of the reference video frame by the target model. In the aforementioned process, the target model can perform super-resolution of the current video frame based on the feature information of the reference video frame and the residual information used in the decoding process of the current video frame. It can be seen that during the super-resolution process of the current video frame by the target model , not only the information of the reference video frame itself is considered, but also the difference in pixel values between the reference video frame and the current video frame is considered. The factors considered are relatively comprehensive, so the current video frame after super-resolution finally output by the target model is High enough quality (with a relatively ideal resolution) so that the entire video stream after super-resolution has good image quality, thus improving the user experience.

A ninth aspect of the embodiment of the present application provides a video processing device, which includes a memory and a processor; the memory stores code, and the processor is configured to execute the code. When the code is executed, the video processing device executes the first step aspect, any possible implementation manner in the first aspect, the second aspect, or the method described in any possible implementation manner in the second aspect.

A tenth aspect of the embodiment of the present application provides a model training device, which includes a memory and a processor; the memory stores code, and the processor is configured to execute the code. When the code is executed, the model training device executes the third step Any one of the aspect and the third aspect Possible implementation manners, the fourth aspect, or the method described in any possible implementation manner of the fourth aspect.

An eleventh aspect of the embodiments of the present application provides a circuit system. The circuit system includes a processing circuit configured to perform any of the possible implementations of the first aspect and the second aspect. , any possible implementation manner in the second aspect, the third aspect, any possible implementation manner in the third aspect, the fourth aspect, or the method described in any possible implementation manner in the fourth aspect.

A twelfth aspect of the embodiments of the present application provides a chip system. The chip system includes a processor for calling a computer program or computer instructions stored in a memory, so that the processor executes the first aspect as described in the first aspect. any possible implementation manner of the second aspect, any possible implementation manner of the second aspect, the third aspect, any possible implementation manner of the third aspect, the fourth aspect or any of the fourth aspects Any possible implementation method.

In one possible implementation, the processor is coupled to the memory through an interface.

In a possible implementation, the chip system further includes a memory, and computer programs or computer instructions are stored in the memory.

A thirteenth aspect of the embodiments of the present application provides a computer storage medium. The computer storage medium stores a computer program. When executed by a computer, the program causes the computer to implement any one of the first aspect and the first aspect. Possible implementation methods, the second aspect, any one possible implementation method of the second aspect, the third aspect, any one possible implementation method of the third aspect, the fourth aspect or any one possible implementation method of the fourth aspect The method described in the implementation.

A fourteenth aspect of the embodiments of the present application provides a computer program product. The computer program product stores instructions. When executed by a computer, the instructions make it possible for the computer to implement any one of the first aspect and the first aspect. The implementation method, the second aspect, any possible implementation method of the second aspect, the third aspect, any possible implementation method of the third aspect, the fourth aspect or any possible implementation of the fourth aspect method as described.

In the embodiment of the present application, after obtaining the current video frame and the motion vector used in the decoding process of the current video frame, the feature information of the reference video frame of the current video frame can be transformed based on the motion vector, thereby obtaining the transformed features. Information, wherein the feature information of the reference video frame is obtained during the super-resolution process of the reference video frame by the target model. Then, the current video frame can be super-resolved based on the transformed feature information through the target model, thereby obtaining the super-resolved current video frame. In the foregoing process, the target model can perform super-resolution on the current video frame based on the transformed feature information of the reference video frame, because the transformed feature information of the reference video frame is based on the motion vector pair used in the decoding process of the current video frame. It is obtained by transforming the feature information of the reference video frame. It can be seen that in the super-resolution process of the current video frame by the target model, not only the information of the reference video frame itself is considered, but also the image blocks between the reference video frame and the current video frame are considered. position correspondence relationship, the factors considered are relatively comprehensive, so the current video frame after super-resolution finally output by the target model is of high enough quality (with a relatively ideal resolution), so that the entire video stream after super-resolution has good image quality, thereby improving user experience.

Description of drawings

Figure 1 is a structural schematic diagram of the main framework of artificial intelligence;

Figure 2a is a schematic structural diagram of a video processing system provided by an embodiment of the present application;

Figure 2b is another structural schematic diagram of the video processing system provided by the embodiment of the present application;

Figure 2c is a schematic diagram of video processing related equipment provided by the embodiment of the present application;

Figure 3 is a schematic diagram of the architecture of the system 100 provided by the embodiment of the present application;

Figure 4 is a schematic flow chart of the video processing method provided by the embodiment of the present application;

Figure 5 is a schematic structural diagram of the target model provided by the embodiment of the present application;

Figure 6 is another schematic flowchart of a video processing method provided by an embodiment of the present application;

Figure 7 is another structural schematic diagram of the target model provided by the embodiment of the present application;

Figure 8 is a schematic flow chart of the model training method provided by the embodiment of the present application;

Figure 9 is another schematic flow chart of the model training method provided by the embodiment of the present application;

Figure 10 is a schematic structural diagram of a video processing device provided by an embodiment of the present application;

Figure 11 is another structural schematic diagram of a video processing device provided by an embodiment of the present application;

Figure 12 is a schematic structural diagram of the model training device provided by the embodiment of the present application;

Figure 13 is another structural schematic diagram of the model training device provided by the embodiment of the present application;

Figure 14 is a schematic structural diagram of an execution device provided by an embodiment of the present application;

Figure 15 is a schematic structural diagram of the training equipment provided by the embodiment of the present application;

Figure 16 is a schematic structural diagram of a chip provided by an embodiment of the present application.

Detailed ways

The terms "first", "second", etc. in the description and claims of this application and the above-mentioned drawings are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It is to be understood that the data so used are interchangeable under appropriate circumstances so that the embodiments described herein can be practiced in sequences other than those illustrated or described herein. In addition, the terms "including" and "having" and any variations thereof are intended to cover non-exclusive inclusions, for example, a process, method, system, product or device that includes a series of steps or modules and need not be limited to those explicitly listed. Those steps or modules may instead include other steps or modules not expressly listed or inherent to the processes, methods, products or devices. The naming or numbering of steps in this application does not mean that the steps in the method flow must be executed in the time/logical sequence indicated by the naming or numbering. The process steps that have been named or numbered can be implemented according to the purpose to be achieved. The order of execution can be changed for technical purposes, as long as the same or similar technical effect can be achieved. The division of units presented in this application is a logical division. In actual applications, there may be other divisions. For example, multiple units may be combined or integrated into another system, or some features may be ignored. , or not executed. In addition, the coupling or direct coupling or communication connection between the units shown or discussed may be through some interfaces, and the indirect coupling or communication connection between units may be electrical or other similar forms. There are no restrictions in the application. Furthermore, the units or subunits described as separate components may or may not be physically separated, may or may not be physical units, or may be distributed into multiple circuit units, and some or all of them may be selected according to actual needs. unit to achieve the purpose of this application plan.

Currently, for the current video frame in the video stream to be super-resolved (which can be any video frame in the video stream to be super-resolved), if you need to improve the resolution of the current video frame, you can add the current video frame and the current video frame to The reference video frame (for example, the previous video frame and/or the subsequent video frame of the current video frame, etc.) is input to the neural network model, so that the neural network model super-resolves the current video frame based on the reference video frame, and obtains super-resolution The divided current video frame. For the remaining video frames in the video stream to be super-resolved except the current video frame, the neural network model can also be used to perform the same operations on the remaining video frames as the current video frame, so each video frame after super-resolution can be obtained , that is, the entire video stream after super-resolution.

It can be seen that in the super-resolution process of the current video frame, the neural network model only uses the reference video frame itself as the reference benchmark, and the factors considered are relatively single. The current video frame output by the model after super-resolution is not of high quality (it cannot have the ideal resolution), so that the image quality of the entire video stream after super-resolution is still not good enough (cannot have ideal quality and resolution), resulting in poor user experience.

Furthermore, in the super-resolution process for the current video frame, the neural network model needs to perform a series of processing on all image blocks contained in the entire current video frame one by one, which requires a large amount of calculation, resulting in the aforementioned neural network model-based Video processing methods are difficult to apply to small devices with limited computing power (for example, smartphones, smart watches, etc.).

In order to solve the above problem, embodiments of the present application provide a video processing method, which can be implemented in conjunction with artificial intelligence (artificial intelligence, AI) technology. AI technology is a technical discipline that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence. AI technology obtains the best results by perceiving the environment, acquiring knowledge and using knowledge. In other words, artificial intelligence technology is a branch of computer science that attempts to understand the nature of intelligence and produce a new intelligent machine that can respond in a similar way to human intelligence. Using artificial intelligence for data processing is a common application method of artificial intelligence.

First, the overall workflow of the artificial intelligence system is described. Please refer to Figure 1. Figure 1 is a structural schematic diagram of the main framework of artificial intelligence. The following is from the "intelligent information chain" (horizontal axis) and "IT value chain" (vertical axis) The above artificial intelligence theme framework is elaborated on in two dimensions. Among them, the "intelligent information chain" reflects a series of processes from data acquisition to processing. For example, it can be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. During this process, the data It has gone through the condensation process of "data-information-knowledge-wisdom". The "IT value chain" reflects the value that artificial intelligence brings to the information technology industry, from the underlying infrastructure of human intelligence and information (providing and processing technology implementation) to the systematic industrial ecological process.

(1)Infrastructure

Infrastructure provides computing power support for artificial intelligence systems, enables communication with the external world, and supports it through basic platforms. Communicate with the outside through sensors; computing power is provided by smart chips (hardware acceleration chips such as CPU, NPU, GPU, ASIC, FPGA, etc.); the basic platform includes distributed computing framework and network and other related platform guarantees and support, which can include cloud storage and Computing, interconnection networks, etc. For example, sensors communicate with the outside world to obtain data, which are provided to smart chips in the distributed computing system provided by the basic platform for calculation.

(2)Data

Data from the upper layer of the infrastructure is used to represent data sources in the field of artificial intelligence. The data involves graphics, images, voice, and text, as well as IoT data of traditional devices, including business data of existing systems and sensory data such as force, displacement, liquid level, temperature, and humidity.

(3)Data processing

Data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making and other methods.

Among them, machine learning and deep learning can perform symbolic and formal intelligent information modeling, extraction, preprocessing, training, etc. on data.

Reasoning refers to the process of simulating human intelligent reasoning in computers or intelligent systems, using formal information to perform machine thinking and problem solving based on reasoning control strategies. Typical functions are search and matching.

Decision-making refers to the process of decision-making after intelligent information is reasoned, and usually provides functions such as classification, sorting, and prediction.

(4) General ability

After the data is processed as mentioned above, some general capabilities can be formed based on the results of further data processing, such as algorithms or a general system, such as translation, text analysis, computer vision processing, speech recognition, and image processing. identification, etc.

(5) Intelligent products and industry applications

Intelligent products and industry applications refer to the products and applications of artificial intelligence systems in various fields. They are the encapsulation of overall artificial intelligence solutions, productizing intelligent information decision-making and realizing practical applications. Its application fields mainly include: intelligent terminals, intelligent transportation, Smart healthcare, autonomous driving, smart cities, etc.

Next, several application scenarios of this application will be introduced.

Figure 2a is a schematic structural diagram of a video processing system provided by an embodiment of the present application. The video processing system includes user equipment and data processing equipment. Among them, user equipment includes smart terminals such as mobile phones, personal computers, or information processing centers. The user equipment is the initiator of video processing. As the initiator of the video processing request, the user usually initiates the request through the user equipment.

The above-mentioned data processing equipment may be a cloud server, a network server, an application server, a management server, and other equipment or servers with data processing functions. The data processing device receives the video processing request from the smart terminal through the interactive interface, and then performs information processing in the form of machine learning, deep learning, search, reasoning, decision-making, etc. through the memory that stores the data and the processor that processes the data. The memory in the data processing device can be a general term, including local storage and a database that stores historical data. The database can be on the data processing device or on other network servers.

In the video processing system shown in Figure 2a, the user equipment can receive the user's instructions. For example, the user equipment can obtain the compressed video stream input/selected by the user, and then initiate a request to the data processing equipment, so that the data processing equipment can obtain the information obtained by the user equipment. A video processing application is executed on the compressed video stream to obtain a processed video stream. For example, the user equipment can obtain the compressed video stream selected by the user and initiate a processing request for the compressed video stream to the data processing device. Next, the data processing device first obtains the compressed video stream (low-quality, low-resolution video stream) and decodes the compressed video stream to restore the video stream to be super-resolved (which can also be called a decompressed video stream. Still a low-quality, low-resolution video stream, but with a larger number of video frames). Then, the data processing device can perform super-resolution processing on the video to be super-resolved, thereby obtaining a super-resolution video stream (a high-quality, high-resolution video stream), and return the super-resolution video stream to the user device. For users to view and use.

In Figure 2a, the data processing device can execute the video processing method according to the embodiment of the present application.

Figure 2b is another structural schematic diagram of a video processing system provided by an embodiment of the present application. In Figure 2b, the user equipment directly serves as a data processing equipment. The user equipment can directly obtain input from the user and directly perform processing by the hardware of the user equipment itself. Processing, the specific process is similar to Figure 2a, please refer to the above description, and will not be repeated here.

In the video processing system shown in Figure 2b, the user equipment can receive the user's instructions. For example, the user equipment can obtain the compressed video stream selected by the user, and then the user equipment itself obtains the compressed video stream (low-quality, low-resolution video stream). ), and decodes the compressed video stream to restore the video stream to be super-resolved (which can also be called a decompressed video stream, which is still a low-quality, low-resolution video stream, but with a larger number of video frames). Then, the user equipment can perform super-resolution processing on the video to be super-resolved, thereby obtaining a post-super-resolution video stream (a high-quality, high-resolution video stream) for the user to watch and use.

In Figure 2b, the user equipment itself can execute the video processing method according to the embodiment of the present application.

Figure 2c is a schematic diagram of video processing related equipment provided by the embodiment of the present application.

The user equipment in Figure 2a and Figure 2b can be the local device 301 or the local device 302 in Figure 2c, and the data processing device in Figure 2a can be the execution device 210 in Figure 2c, where the data storage system 250 can To store the data to be processed by the execution device 210, the data storage system 250 can be integrated on the execution device 210, or can be set up on the cloud or other network servers.

The processors in Figure 2a and Figure 2b can perform data training/machine learning/deep learning through neural network models or other models (for example, support vector machine-based models), and use the data to finally train or learn the model to execute on the video Video processing applications to obtain corresponding processing results.

Figure 3 is a schematic diagram of the architecture of the system 100 provided by the embodiment of the present application. In Figure 3, the execution device 110 is configured with an input/output (I/O) interface 112 for data interaction with external devices. The user Data can be input to the I/O interface 112 through the client device 140. In this embodiment of the present application, the input data may include: various to-be-scheduled tasks, callable resources, and other parameters.

When the execution device 110 preprocesses the input data, or when the calculation module 111 of the execution device 110 performs calculation and other related processing (such as implementing the function of the neural network in this application), the execution device 110 can call the data storage system 150 The data, codes, etc. in the system can be used for corresponding processing, and the data, instructions, etc. obtained by corresponding processing can also be stored in the data storage system 150 .

Finally, the I/O interface 112 returns the processing results to the client device 140, thereby providing them to the user.

It is worth mentioning that the training device 120 can generate corresponding target models/rules based on different training data for different goals or different tasks, and the corresponding target models/rules can be used to achieve the above goals or complete the above tasks. , thereby providing users with the desired results. The training data may be stored in the database 130 and come from training samples collected by the data collection device 160 .

In the case shown in FIG. 3 , the user can manually enter the input data, and the manual setting can be operated through the interface provided by the I/O interface 112 . In another case, the client device 140 can automatically send input data to the I/O interface 112. If requiring the client device 140 to automatically send input data requires the user's authorization, the user can set corresponding permissions in the client device 140. The user can view the results output by the execution device 110 on the client device 140, and the specific presentation form may be display, sound, action, etc. The client device 140 can also be used as a data collection end to collect the input data of the input I/O interface 112 and the output results of the output I/O interface 112 as new sample data, and store them in the database 130 . Of course, it is also possible to collect without going through the client device 140. Instead, the I/O interface 112 directly uses the input data input to the I/O interface 112 and the output result of the output I/O interface 112 as a new sample as shown in the figure. The data is stored in database 130.

It is worth noting that Figure 3 is only a schematic diagram of a system architecture provided by an embodiment of the present application. The positional relationship between the devices, devices, modules, etc. shown in the figure does not constitute any limitation. For example, in Figure 3, the data The storage system 150 is an external memory relative to the execution device 110. In other cases, the data storage system 150 can also be placed in the execution device 110. As shown in Figure 3, the neural network can be trained according to the training device 120.

An embodiment of the present application also provides a chip, which includes a neural network processor NPU. The chip can be disposed in the execution device 110 as shown in FIG. 3 to complete the calculation work of the calculation module 111. The chip can also be installed in the training device 120 as shown in Figure 3 to complete the training work of the training device 120 and output the target model/rules.

Neural network processor NPU, NPU is mounted on the main central processing unit (CPU) (host CPU) as a co-processor, and the main CPU allocates tasks. The core part of the NPU is the arithmetic circuit. The controller controls the arithmetic circuit to extract the data in the memory (weight memory or input memory) and perform operations.

In some implementations, the computing circuit includes multiple processing units (PE). In some implementations, the arithmetic circuit is a two-dimensional systolic array. The arithmetic circuit may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuit is a general-purpose matrix processor.

For example, assume there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit fetches the corresponding data of matrix B from the weight memory and caches it on each PE in the arithmetic circuit. The operation circuit takes matrix A data and matrix B from the input memory to perform matrix operations, and the partial result or final result of the obtained matrix is stored in the accumulator (accumulator).

The vector calculation unit can further process the output of the arithmetic circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, etc. For example, the vector computing unit can be used for network calculations in non-convolutional/non-FC layers in neural networks, such as pooling, batch normalization, local response normalization, etc.

In some implementations, the vector computation unit can store the processed output vector into a unified buffer. For example, the vector calculation unit may apply a nonlinear function to the output of the arithmetic circuit, such as a vector of accumulated values, to generate activation values. In some implementations, the vector computation unit generates normalized values, merged values, or both. In some implementations, the processed output vector can be used as an activation input to an arithmetic circuit, such as for use in a subsequent layer in a neural network.

Unified memory is used to store input data and output data.

The weight data directly transfers the input data in the external memory to the input memory and/or the unified memory through the storage unit access controller (direct memory access controller, DMAC), stores the weight data in the external memory into the weight memory, and stores the weight data in the unified memory. The data is stored in external memory.

The bus interface unit (BIU) is used to realize the interaction between the main CPU, DMAC and instruction memory through the bus.

The instruction fetch buffer connected to the controller is used to store instructions used by the controller;

The controller is used to call instructions cached in the memory to control the working process of the computing accelerator.

Generally, the unified memory, input memory, weight memory and instruction memory are all on-chip memories, and the external memory is the memory outside the NPU. The external memory can be double data rate synchronous dynamic random access memory (double data). rate synchronous dynamic random accessmemory (DDR SDRAM), high bandwidth memory (high bandwidth memory (HBM)) or other readable and writable memory.

Since the embodiments of the present application involve the application of a large number of neural networks, in order to facilitate understanding, the relevant terms involved in the embodiments of the present application and related concepts such as neural networks are first introduced below.

(1) Neural network

The neural network can be composed of neural units. The neural unit can refer to an arithmetic unit that takes xs and intercept 1 as input. The output of the arithmetic unit can be:

Among them, s=1, 2,...n, n is a natural number greater than 1, Ws is the weight of xs, and b is the bias of the neural unit. f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal. The output signal of this activation function can be used as the input of the next convolutional layer. The activation function can be a sigmoid function. A neural network is a network formed by connecting many of the above-mentioned single neural units together, that is, the output of one neural unit can be the input of another neural unit. The input of each neural unit can be connected to the local receptive field of the previous layer to extract the features of the local receptive field. The local receptive field can be an area composed of several neural units.

The work of each layer in the neural network can be described by the mathematical expression y=a(Wx+b): From the physical level, the work of each layer in the neural network can be understood as five pairs of input spaces (input vectors) set) operations to complete the transformation from input space to output space (i.e., row space to column space of the matrix). These five operations include: 1. Dimension raising/reducing; 2. Enlarging/reducing; 3. Rotation; 4. Translation; 5. "Bend". Among them, the operations of 1, 2, and 3 are completed by Wx, the operation of 4 is completed by +b, and the operation of 5 is implemented by a(). The reason why the word "space" is used here is because the object to be classified is not a single thing, but a class of things. Space refers to all the things of this type. collection of individuals. Among them, W is a weight vector, and each value in this vector represents the weight value of a neuron in this layer of neural network. This vector W determines the spatial transformation from the input space to the output space described above, that is, the weight W of each layer controls how to transform the space. The purpose of training a neural network is to finally obtain the weight matrix of all layers of the trained neural network (a weight matrix formed by the vector W of many layers). Therefore, the training process of neural network is essentially to learn how to control spatial transformation, and more specifically, to learn the weight matrix.

Because you want the output of the neural network to be as close as possible to the value you really want to predict, you can compare the predicted value of the current network with the really desired target value, and then update each layer of the neural network based on the difference between the two. weight vector (of course, there is usually an initialization process before the first update, that is, pre-configuring parameters for each layer in the neural network). For example, if the predicted value of the network is high, adjust the weight vector to make it predict lower Some, constant adjustments are made until the neural network can predict the truly desired target value. Therefore, it is necessary to define in advance "how to compare the difference between the predicted value and the target value". This is the loss function (loss function) or objective function (objective function), which is used to measure the difference between the predicted value and the target value. Important equations. Among them, taking the loss function as an example, the higher the output value (loss) of the loss function, the greater the difference. Then the training of the neural network becomes a process of reducing this loss as much as possible.

(2)Back propagation algorithm

The neural network can use the error back propagation (BP) algorithm to modify the size of the parameters in the initial neural network model during the training process, so that the reconstruction error loss of the neural network model becomes smaller and smaller. Specifically, forward propagation of the input signal until the output will produce an error loss, and the parameters in the initial neural network model are updated by backpropagating the error loss information, so that the error loss converges. The backpropagation algorithm is a backpropagation movement dominated by error loss, aiming to obtain the optimal parameters of the neural network model, such as the weight matrix.

The method provided by this application is described below from the training side of the neural network and the application side of the neural network.

The model training method provided by the embodiment of the present application involves the processing of data sequences, and can be specifically applied to methods such as data training, machine learning, and deep learning. The training data (for example, the current video in the model training method provided by the embodiment of the present application) frames, etc.) to perform symbolic and formalized intelligent information modeling, extraction, preprocessing, training, etc., and finally obtain a trained neural network (such as the target model in the model training method provided by the embodiment of this application); and, The video processing method provided by the embodiment of the present application can use the above-trained neural network to input input data (for example, the current video frame in the video processing method provided by the embodiment of the present application, etc.) into the trained neural network. In the network, output data (such as the current video frame after super-resolution in the video processing method provided by the embodiment of the present application, etc.) is obtained. It should be noted that the model training method and video processing method provided in the embodiments of this application are inventions based on the same concept, and can also be understood as two parts of a system, or two stages of an overall process: such as model training phase and model application phase.

Figure 4 is a schematic flow chart of a video processing method provided by an embodiment of the present application. As shown in Figure 4, the method includes:

401. Obtain the current video frame and the motion vector used in the decoding process of the current video frame.

In this embodiment, after the compressed video stream specified by the user is determined, the compressed video stream can be decoded to obtain a video stream to be super-resolved. It should be noted that during the aforementioned decoding process, the compressed video stream at least contains the first video frame, the motion vector and residual information corresponding to the second video frame, the motion vector and residual information corresponding to the third video frame, ..., the motion vector and residual information corresponding to the last video frame. Then, the first video frame can be used as the reference video frame of the second video frame, motion compensation is performed on the first video frame based on the motion vector corresponding to the second video frame, and the intermediate video frame is obtained, and then the intermediate video frame is The residual information corresponding to the second video frame is superimposed to obtain the second video frame. In this way, the decoding of the second video frame is completed. Then, the second video frame can be used as the reference video frame of the third video frame, and motion compensation is performed on the second video frame based on the motion vector corresponding to the third video frame to obtain an intermediate video frame, and then in the intermediate video frame The residual information corresponding to the third video frame is superimposed to obtain the third video frame. In this way, the decoding of the third video frame is completed. By analogy, the decoding of the fourth video frame can also be completed,..., the decoding of the last video frame is equivalent to obtaining the first video frame, the second video frame, the third video frame,... , the last video frame and multiple video frames constitute the video stream to be super-resolved.

For convenience of explanation, any video frame among multiple video frames included in the video stream to be super-resolved will be schematically introduced below, and this video frame will be called the current video frame. After decoding to obtain the current video frame based on the reference video frame of the current video frame (for example, the previous video frame of the current video frame), the motion vector corresponding to the current video frame, and the residual information corresponding to the current video frame, the current video frame can also be obtained based on The motion vector corresponding to the current video frame is used to obtain the motion vector used in the decoding process of the current video frame.

Specifically, assuming that the current video frame can be divided into N image blocks (N is a positive integer greater than or equal to 2), the motion vector used in the decoding process of the current video frame can be obtained in the following way:

(1) If the N image blocks contained in the current video frame all appear in the reference video frame of the current video frame, that is to say, the contents of the current video frame and the reference video frame are basically the same. At this time, the current video frame provided by the compressed video stream The motion vector corresponding to the video frame contains the motion vector corresponding to the N image blocks. The motion vector corresponding to the i-th image block (i=1,...,N) among the N image blocks is used to indicate the i-th image. The difference between the position of the block in the reference video frame and the position of the i-th image block in the current video frame, that is, the movement and change of the position of the i-th image block from the reference video frame to the current video frame. Then, the motion quantities corresponding to the N image blocks derived from the compressed video stream can be directly used as the motion vectors used in the decoding process of the N image blocks in the current video frame, that is, in the decoding process of the current video frame The motion vector used.

(2) If among the N image blocks contained in the current video frame, only M image blocks (M is less than or equal to N, and M is a positive integer greater than or equal to 1) appear in the reference video frame of the current video frame, also That is to say, the contents of the current video frame and the reference video frame are only partly the same, and partly different. At this time, the motion vector corresponding to the current video frame provided by the compressed video stream only contains the motion vectors corresponding to these M image blocks. This The motion vector corresponding to the j-th image block (j=1,...,M) among the M image blocks is used to indicate the position of the j-th image block in the reference video frame and the position of the j-th image block in the current video frame. The difference between positions, that is, the movement and change in the position of the j-th image block from the reference video frame to the current video frame. Since the compressed video stream does not provide the motion vectors corresponding to the remaining N-M image blocks of the current video frame, the motion vectors corresponding to the N-M image blocks are calculated in the following multiple ways:

(2.1) Use the preset value (the size of the preset value can be set according to actual needs, and there is no limit here, for example, the preset value is 0, etc.) directly as the motion vector corresponding to the N-M image blocks.

(2.2) Calculate the motion vectors corresponding to the M image blocks to obtain the motion vectors corresponding to the NM image blocks. The calculation process is as shown in the following formula:

In the above formula, is the motion vector (k=1,...,NM) corresponding to the k-th image block among the NM image blocks of the current video frame, are the motion vectors corresponding to the four image blocks located on the left, right, upper and lower sides of the k-th image block in the current video frame.

After calculating the motion vectors corresponding to the N-M image blocks, the motion vectors corresponding to the M image blocks derived from the compressed video stream can be used as the motion vectors used in the decoding process of the M image blocks in the current video frame. , and use the calculated motion vectors corresponding to the N-M image blocks as the motion vectors used in the decoding process of the N-M image blocks in the current video frame, which is equivalent to obtaining the motion vector used in the decoding process of the current video frame. motion vector.

It should be understood that this embodiment is only schematically illustrated by assuming that the reference video frame of the current video frame is the previous video frame of the current video frame. In practical applications, the reference video frame can also be the next video frame of the current video frame, or , the reference video frame can also be the first two video frames of the current video frame, the reference video frame can also be the last two video frames of the current video frame, etc., and there is no limit here.

402. Transform the feature information of the reference video frame of the current video frame based on the motion vector used in the decoding process of the current video frame to obtain the transformed feature information. The feature information of the reference video frame is in the target model of the reference video frame. Obtained during the super score process.

After obtaining the current video frame and the motion vector used in the decoding process of the current video frame, the motion vector used in the decoding process of the current video frame can be used to obtain the feature information of the reference video frame (which can also be called the hidden information of the reference video frame). Containing state (hidden state)) is transformed to obtain the transformed feature information of the reference video frame, that is, the feature information of the reference video frame aligned to the current video frame.

It should be noted that the feature information of the reference video frame is obtained during the super-resolution process of the target model on the reference video frame. That is to say, during the super-resolution process of the target model on the reference video frame, the feature information of the reference video frame is obtained. This can be either the intermediate output of the process, the final output, or the reference video frame itself. Regarding the super-resolution process of the target model on the reference video frame, please refer to the relevant description of the subsequent super-resolution process of the current video frame by the target model, which will not be discussed here.

Specifically, the feature information of the reference video frame can be transformed in the following manner to obtain the transformed feature information:

The motion vector used in the decoding process of the current video frame and the feature information of the reference video frame can be calculated through the warp algorithm to obtain the transformed feature information. The calculation process is as shown in the following formula:

In the above formula, MV _t is the motion vector used in the decoding process of the current video frame, h _t-1 is the feature information of the reference video frame, It is the transformed feature information of the reference video frame, and Warp() is the warping algorithm.

403. Use the target model to perform super-resolution on the current video frame based on the transformed feature information, and obtain the current video frame after super-resolution.

Specifically, the target model can perform super-resolution on the current video frame in the following ways, thereby obtaining the current video frame after super-resolution:

(1) After inputting the transformed feature information and the current video frame to the target model, the target model can first perform feature extraction (for example, convolution processing, etc.) on the current video frame to obtain the first feature of the current video frame. For example, as shown in Figure 5 (Figure 5 is a schematic structural diagram of the target model provided by the embodiment of the present application), assuming that the compressed video stream is decoded, multiple video frames can be obtained, where the t-th video frame LR _t is The current video frame, the t-1th video frame is the reference video frame of the tth video frame LR _t . The transformed implicit state of the t-th video frame LR _t and the t-1th video frame (obtained by transforming the hidden state h t- ₁ of the t-1th video frame using the motion vector MV _t used in the decoding process of the t-th video frame) After inputting the target model, the target model can first Preliminary feature extraction is performed on the video frame LR _t to obtain the preliminary feature f _t ¹ of the t-th video frame (ie, the aforementioned first feature).

(2) After obtaining the first feature of the current video frame, the target model can fuse the transformed feature information and the first feature of the current video frame (for example, splicing processing, etc.) to obtain the second feature of the current video frame. . Still as in the above example, after obtaining the preliminary feature f _t ¹ of the t-th video frame, the target model can combine the preliminary feature f _t ¹ of the t-th video frame and the transformed hidden state of the t-1 video frame. Perform splicing (cascade) to obtain the fusion feature f _t ² of the t-th video frame (i.e., the aforementioned second feature).

(3) After obtaining the second feature of the current video frame, the target model can continue to perform feature extraction (for example, convolution processing, etc.) on the second feature of the current video frame, thereby obtaining the third feature of the current video frame. Still as in the above example, after obtaining the fusion feature f _t ² of the t-th video frame, the target model can continue to perform feature extraction on the fusion feature f _t ² of the t-th video frame to obtain further features f _t of the t-th video frame. ³ .

(4) After obtaining the third feature of the current video frame, the target model can fuse the third feature of the current video frame and the current video frame (for example, addition processing, etc.), thereby obtaining and outputting the super-resolved current Video frames. Still as in the above example, after obtaining the further feature f _t ³ of the t-th video frame, the target model can add the further feature f _t ³ of the t-th video frame and the t-th video frame LR _t to obtain and output The t-th video frame SR _t after super-resolution.

Furthermore, the target model can obtain the feature information (hidden state) of the current video frame in a variety of ways:

(1) After obtaining the third feature of the current video frame, the target model can directly use the third feature of the current video frame as the feature information of the current video frame, and output it externally for use in the super-resolution process of the next video frame. Still as in the above example, after obtaining the further features f _t ³ of the t-th video frame, the target model can use it as the hidden state h _t of the t-th video frame and output the hidden state h _t of the t-th video frame. .

(2) After obtaining the current video frame after super-resolution, the target model can directly use the current video frame after super-resolution as the feature information of the current video frame, and output it to the outside for use in the super-resolution process of the next video frame. Still as in the above example, get the t-th video frame SR _t after super-resolution Afterwards, the target model can use it as the hidden state h _t of the t-th video frame, and output the hidden state h _t of the t-th video frame.

(3) After obtaining the third feature of the current video frame, the target model can continue to perform feature extraction (for example, convolution processing, etc.) on the third feature of the current video frame, thereby obtaining feature information of the current video frame. Still as in the above example, after obtaining the further feature f _t ³ of the t-th video frame, the target model can perform feature extraction on the further feature f _t ³ of the t-th video frame, thereby obtaining and outputting the implicit feature of the t-th video frame. State h _t .

(4) After obtaining the current video frame after super-resolution, the target model can continue to perform feature extraction (for example, convolution processing, etc.) on the current video frame after super-resolution, thereby obtaining the feature information of the current video frame. Still as in the above example, after obtaining the t-th video frame SR _t after super-resolution, the target model can perform feature extraction on the t-th video frame SR _t after super-resolution, thereby obtaining and outputting the implicit information of the t-th video frame. State h _t .

It should be understood that after obtaining the third feature of the current video frame, the target model may not fuse the third feature with the current video frame, but directly use the third feature as the super-resolved current video frame. Still as in the above example, after obtaining the further features f _t ³ of the t-th video frame, the target model can directly use it as the t-th video frame SR _t after super-resolution, and output the t-th video frame SR after super-resolution. _t .

At this point, the super-resolution processing for the current video frame is completed. For the remaining video frames in the video stream to be super-resolved except the current video frame, the same operations as those performed on the current video frame can also be performed, so that the video stream after super-resolution can be obtained.

Figure 6 is another schematic flowchart of a video processing method provided by an embodiment of the present application. As shown in Figure 6, the method includes:

601. Obtain the current video frame and the residual information used in the decoding process of the current video frame.

Then, the residual information corresponding to the current video frame provided by the compressed video stream can be used as the residual information used in the decoding process of the current video frame.

602. Based on the feature information of the reference video frame and the residual information used in the decoding process of the current video frame through the target model, Super-resolution is performed on the current video frame to obtain the current video frame after super-resolution. The feature information of the reference video frame is obtained during the super-resolution processing of the reference video frame by the target model.

After obtaining the current video frame and the residual information used in the decoding process of the current video frame, the feature information of the reference video frame (also called the hidden state of the reference video frame) can also be obtained, and the current The video frame, the residual information used in the decoding process of the current video frame, and the feature information of the reference video frame are input to the target model, so that the target model is based on the feature information of the reference video frame and the feature information used in the decoding process of the current video frame. Residual information, perform super-resolution on the current video frame, and obtain the current video frame after super-resolution.

(1) After inputting the current video frame, the residual information used in the decoding process of the current video frame, and the feature information of the reference video frame into the target model, the target model can first perform feature extraction (for example, convolution) on the current video frame Processing, etc.) to obtain the first feature of the current video frame. For example, as shown in Figure 7 (Figure 7 is another structural schematic diagram of the target model provided by the embodiment of the present application), assuming that the compressed video stream is decoded, multiple video frames can be obtained, wherein the t-th video frame LR _t is the current video frame, and the t-1th video frame is the reference video frame of the t-th video frame LR _t . After inputting the t-th video frame LR _t , the residual information Res _t used in the decoding process of the current video frame, and the hidden state h _t-1 of the t-1th video frame into the target model, the target model can first Preliminary feature extraction is performed on the t-th video frame LR _t , and the preliminary feature f _t ¹ of the t-th video frame is obtained (ie, the aforementioned first feature).

(2) After obtaining the first feature of the current video frame, the target model fuses the feature information of the reference video frame and the first feature of the current video frame (for example, splicing processing, etc.) to obtain the second feature of the current video frame. . Still as in the above example, after obtaining the preliminary feature f _t ¹ of the t-th video frame, the target model can combine the preliminary feature f _t ¹ of the t-th video frame and the hidden state h _t-1 of the t-1th video frame Perform splicing (cascade) to obtain the fusion feature f _t ² of the t-th video frame (i.e., the aforementioned second feature).

(4) After obtaining the third feature of the current video frame, the target model can continue to perform feature extraction (for example, convolution processing, etc.) on the third feature of the current video frame based on the residual information used in the decoding process of the current video frame. etc.), thereby obtaining the fourth feature of the current video frame. Still as in the above example, further features f _t ³ of the t-th video frame are obtained. The target model can use the residual information Res _t used in the decoding process of the current video frame to obtain further features f _t ³ of the t-th video frame. Feature extraction is performed to obtain further features f _t ⁴ of the t-th video frame.

(5) After obtaining the fourth feature of the current video frame, the target model can fuse the fourth feature of the current video frame and the current video frame (for example, addition processing, etc.) to obtain the super-resolved current video frame. Still as in the above example, after obtaining the further feature f _t ⁴ of the t-th video frame, the target model can add the further feature f _t ⁴ of the t-th video frame and the t-th video frame LR _t to obtain And output the t-th video frame SR _t after super-resolution.

More specifically, the target model can obtain the fourth feature of the current video frame in the following way:

(1) Assume that the current video frame can be divided into N image blocks (N is a positive integer greater than or equal to 2), so the residual information used in the decoding process of the current video frame includes the N image blocks in the current video frame. Residual information used in the decoding process. In these N image blocks, the target model can sequentially compare the residual information used in the decoding process of each image block with the preset threshold (the size of the threshold can be determined according to the actual (requirements are set, there are no restrictions here) are compared to determine P image blocks whose residual information is greater than the preset residual threshold (P is less than N, and P is a positive integer greater than or equal to 1).

(2) After obtaining P image blocks whose residual information is greater than the preset residual threshold, the target model can perform feature extraction on the part of the third feature of the current video frame corresponding to the P image blocks, and the third The other part of the features corresponding to the remaining N-P image blocks remains unchanged, thereby obtaining the fourth feature of the current video frame.

(2) After obtaining the fourth feature of the current video frame, the target model can directly use the fourth feature of the current video frame as the feature information of the current video frame, and output it externally for use in the super-resolution process of the next video frame. Still as in the above example, after obtaining the further features f _t ⁴ of the t-th video frame, the target model can use it as the hidden state h _t of the t-th video frame and output the hidden state h of the t-th video frame. _t .

(3) After obtaining the current video frame after super-resolution, the target model can directly use the current video frame after super-resolution as the feature information of the current video frame, and output it to the outside for use in the super-resolution process of the next video frame. Still as in the above example, after obtaining the super-resolved t-th video frame SR _t , the target model can use it as the hidden state h _t of the t-th video frame and output the hidden state h _t of the t-th video frame. .

(4) After obtaining the third feature of the current video frame, the target model can continue to perform feature extraction (for example, convolution processing, etc.) on the third feature of the current video frame, thereby obtaining feature information of the current video frame. Still as in the above example, after obtaining the further feature f _t ³ of the t-th video frame, the target model can perform feature extraction on the further feature f _t ³ of the t-th video frame, thereby obtaining and outputting the implicit feature of the t-th video frame. State h _t .

(5) After obtaining the fourth feature of the current video frame, the target model can continue to perform feature extraction (for example, convolution processing, etc.) on the fourth feature of the current video frame, thereby obtaining feature information of the current video frame. Still as in the above example, after obtaining the further feature f _t ⁴ of the t-th video frame, the target model can perform feature extraction on the further feature f _t ⁴ of the t-th video frame, thereby obtaining and outputting the further feature f t 4 of the t-th video frame. Hidden state h _t .

(6) After obtaining the current video frame after super-resolution, the target model can continue to perform feature extraction (for example, convolution processing, etc.) on the current video frame after super-resolution, thereby obtaining the feature information of the current video frame. Still as in the above example, after obtaining the t-th video frame SR _t after super-resolution, the target model can perform feature extraction on the t-th video frame SR _t after super-resolution, thereby obtaining and outputting the implicit information of the t-th video frame. State h _t .

It should be understood that after obtaining the fourth feature of the current video frame, the target model may not fuse the fourth feature with the current video frame, but directly use the fourth feature as the super-resolved current video frame. Still as in the above example, after obtaining the further features f _t ⁴ of the t-th video frame, the target model can directly use it as the t-th video frame SR _t after super-resolution, and output the t-th video frame after super-resolution SR _t .

In the embodiment of this application, the current video frame and the residual information used in the decoding process of the current video frame are obtained; the current video frame is super-resolved through the target model based on the feature information and residual information of the reference video frame, and we obtain The feature information of the current video frame after super-resolution and the reference video frame is obtained through the super-resolution processing of the reference video frame by the target model. In the aforementioned process, the target model can perform super-resolution of the current video frame based on the feature information of the reference video frame and the residual information used in the decoding process of the current video frame. It can be seen that during the super-resolution process of the current video frame by the target model , not only the information of the reference video frame itself is considered, but also the difference in pixel values between the reference video frame and the current video frame is considered. The factors considered are relatively comprehensive, so the current video frame after super-resolution finally output by the target model is High enough quality (with a relatively ideal resolution) so that the entire video stream after super-resolution has good image quality, thus improving the user experience.

Furthermore, during the super-resolution process of the target model for the current video frame, the neural network model only needs to perform all processing on some image blocks contained in the current video frame, and does not need to perform all processing on the other part of the image blocks included in the current video frame. The processing can reduce the amount of calculation required, so the video processing method based on the target model can be applied to small devices with limited computing power.

It is worth noting that the embodiment shown in Fig. 4 and the embodiment shown in Fig. 6 can be superimposed and used.

The above is a detailed description of the video processing method provided by the embodiment of the present application. The model training method provided by the embodiment of the present application will be introduced below. Figure 8 is a schematic flow chart of the model training method provided by the embodiment of the present application. As shown in Figure 8, the method includes:

801. Obtain the current video frame and the motion vector used in the decoding process of the current video frame.

In this embodiment, when the model to be trained (the recurrent neural network that needs to be trained) needs to be trained, a batch of training data can be obtained first. The batch of training data includes the current video frame and the motion used in the decoding process of the current video frame. Vector. It should be noted that the current video frame after the real super-resolution (that is, the real super-resolution result of the current video frame) is known.

802. Transform the feature information of the reference video frame of the current video frame based on the motion vector to obtain transformed feature information. The feature information of the reference video frame is obtained during the super-resolution process of the reference video frame by the model to be trained.

It should be noted that the feature information of the reference video frame is obtained during the super-resolution process of the reference video frame by the model to be trained. That is to say, during the super-resolution process of the reference video frame by the model to be trained, the reference video frame Feature information can be either an intermediate output or a final output of the process. Regarding the super-resolution process of the model to be trained on the reference video frame, please refer to the relevant description of the subsequent super-resolution process of the model to be trained on the current video frame, which will not be discussed here.

803. Use the model to be trained to perform super-resolution on the current video frame based on the transformed feature information to obtain the current video frame after super-resolution.

After obtaining the transformed feature information, the transformed feature information and the current video frame can be input to the model to be trained, so that the model to be trained can perform super-resolution reconstruction of the current video frame based on the transformed feature information, and obtain the super-resolution the current video frame.

In a possible implementation manner, the method further includes: fusing the third feature and the current video frame through the model to be trained to obtain the super-resolved current video frame.

804. Obtain a target loss based on the current video frame after super-resolution and the current video frame after real super-resolution. The target loss is used to indicate the difference between the current video frame after super-resolution and the current video frame after real super-resolution.

After obtaining the current video frame after super-resolution, the current video frame after super-resolution and the current video frame after real super-resolution can be calculated through the preset loss function to obtain the target loss. The target loss is used to indicate the result after super-resolution. The difference between the current video frame and the current video frame after real super-resolution.

805. Update the parameters of the model to be trained based on the target loss until the model training conditions are met and the target model is obtained.

After obtaining the target loss, the parameters of the model to be trained can be updated based on the target loss, and the next batch of training data can be used to continue training the model to be trained after the updated parameters until the model training conditions are met (for example, the target loss reaches convergence, etc. ), the target model in the embodiment shown in Figure 4 is obtained.

The target model trained in the embodiment of this application has the ability to super-resolve video frames. Specifically, after obtaining the current video frame to and the motion vector used in the decoding process of the current video frame, the feature information of the reference video frame of the current video frame can be transformed based on the motion vector, thereby obtaining the transformed feature information, where the feature information of the reference video frame is in Obtained from the super-resolution process of the target model on the reference video frame. Then, the current video frame can be super-resolved based on the transformed feature information through the target model, thereby obtaining the super-resolved current video frame. In the foregoing process, the target model can perform super-resolution on the current video frame based on the transformed feature information of the reference video frame, because the transformed feature information of the reference video frame is based on the motion vector pair used in the decoding process of the current video frame. It is obtained by transforming the feature information of the reference video frame. It can be seen that in the super-resolution process of the current video frame by the target model, not only the information of the reference video frame itself is considered, but also the image blocks between the reference video frame and the current video frame are considered. position correspondence relationship, the factors considered are relatively comprehensive, so the current video frame after super-resolution finally output by the target model is of high enough quality (with a relatively ideal resolution), so that the entire video stream after super-resolution has good image quality, thereby improving user experience.

Figure 9 is another schematic flowchart of a model training method provided by an embodiment of the present application. As shown in Figure 9, the method includes:

901. Obtain the current video frame and the residual information used in the decoding process of the current video frame.

In this embodiment, when the model to be trained (the recurrent neural network that needs to be trained) needs to be trained, a batch of training data can be obtained first. The batch of training data includes the current video frame and the residuals used in the decoding process of the current video frame. Poor information. It should be noted that the current video frame after the real super-resolution (that is, the real super-resolution result of the current video frame) is known.

902. Use the model to be trained to perform super-resolution on the current video frame based on the feature information and residual information of the reference video frame to obtain the current video frame after super-resolution. The feature information of the reference video frame is in the model to be trained on the reference video frame. Obtained from super score processing.

After obtaining the current video frame and the residual information used in the decoding process of the current video frame, the feature information of the reference video frame (also called the hidden state of the reference video frame) can also be obtained, and the current The video frame, the residual information used in the decoding process of the current video frame, and the feature information of the reference video frame are input to the model to be trained, so that the model to be trained is based on the feature information of the reference video frame and the information used in the decoding process of the current video frame. Use the residual information to perform super-resolution on the current video frame and obtain the current video frame after super-resolution.

903. Based on the current video frame after super-resolution and the current video frame after real super-resolution, obtain the target loss. The target loss is used to indicate the difference between the current video frame after super-resolution and the current video frame after real super-resolution.

904. Update the parameters of the model to be trained based on the target loss until the model training conditions are met and the target model is obtained.

After obtaining the target loss, the parameters of the model to be trained can be updated based on the target loss, and the next batch of training data can be used to continue training the model to be trained after the updated parameters until the model training conditions are met (for example, the target loss reaches convergence, etc. ), the target model in the embodiment shown in Figure 6 is obtained.

The target model trained in the embodiment of this application has the ability to super-resolve video frames. Specifically, obtain the current video frame and the residual information used in the decoding process of the current video frame; use the target model to super-score the current video frame based on the feature information and residual information of the reference video frame, and obtain the super-score The current video frame, the feature information of the reference video frame is obtained in the super-resolution processing of the reference video frame by the target model. In the aforementioned process, the target model can perform super-resolution of the current video frame based on the feature information of the reference video frame and the residual information used in the decoding process of the current video frame. It can be seen that during the super-resolution process of the current video frame by the target model , not only the information of the reference video frame itself is considered, but also the difference in pixel values between the reference video frame and the current video frame is considered. The factors considered are relatively comprehensive, so the current video frame after super-resolution finally output by the target model is High enough quality (with a relatively ideal resolution) so that the entire video stream after super-resolution has good image quality, thus improving the user experience.

The above is a detailed description of the model training method provided by the embodiment of the present application. The video processing device and the model training device provided by the embodiment of the present application will be introduced below. Figure 10 is a schematic structural diagram of a video processing device provided by an embodiment of the present application. As shown in Figure 10, the device includes:

The acquisition module 1001 is used to acquire the current video frame and the motion vector used in the decoding process of the current video frame;

The transformation module 1002 is used to transform the feature information of the reference video frame of the current video frame based on the motion vector to obtain the transformed feature information. The feature information of the reference video frame is obtained during the super-resolution process of the reference video frame by the target model;

The super-resolution module 1003 is used to perform super-resolution on the current video frame based on the transformed feature information through the target model to obtain the super-resolved current video frame.

In one possible implementation manner, the transformation module 1002 is configured to calculate the feature information of the motion vector and the reference video frame through a warping algorithm to obtain transformed feature information.

In a possible implementation manner, the super-resolution module 1003 is used to: perform feature extraction on the current video frame through the target model to obtain the first feature of the current video frame; and extract the transformed feature information and the first feature through the target model. Fusion is performed to obtain the second feature of the current video frame; feature extraction is performed on the second feature through the target model to obtain the third feature of the current video frame, and the third feature is used as the current video frame after super-resolution.

In a possible implementation manner, the super-resolution module 1003 is also used to fuse the third feature and the current video frame through the target model to obtain the current video frame after super-resolution.

In one possible implementation manner, the super-resolution module 1003 is also used to extract features of the third feature or the current video frame after super-resolution through the target model to obtain feature information of the current video frame.

In a possible implementation manner, the acquisition module 1001 is used to acquire the motion vectors used in the decoding process of M image blocks in the current video frame from the compressed video stream, N≥2, N>M≥1; Based on the motion vectors used in the decoding process of the M image blocks, calculate the motion vectors used in the decoding process of the N-M image blocks, or determine the preset value as the motion used in the decoding process of the N-M image blocks. Vector.

Figure 11 is another structural schematic diagram of a video processing device provided by an embodiment of the present application. As shown in Figure 11, the device includes:

The acquisition module 1101 is used to acquire the current video frame and the residual information used in the decoding process of the current video frame;

The super-resolution module 1102 is used to perform super-resolution on the current video frame based on the feature information and residual information of the reference video frame through the target model to obtain the current video frame after super-resolution. The feature information of the reference video frame is compared with the reference in the target model. Obtained from super-resolution processing of video frames.

In the embodiment of this application, the current video frame and the residual information used in the decoding process of the current video frame are obtained; the current video frame is super-resolved through the target model based on the feature information and residual information of the reference video frame, and we obtain The feature information of the current video frame after super-resolution and the reference video frame is obtained during the super-resolution processing of the reference video frame by the target model. In the aforementioned process, the target model can perform super-resolution of the current video frame based on the feature information of the reference video frame and the residual information used in the decoding process of the current video frame. It can be seen that during the super-resolution process of the current video frame by the target model , not only the information of the reference video frame itself is considered, but also the difference in pixel values between the reference video frame and the current video frame is considered. The factors considered are relatively comprehensive, so the current video frame after super-resolution finally output by the target model is High enough quality (with a relatively ideal resolution) so that the entire video stream after super-resolution has good image quality, thus improving the user experience.

In a possible implementation manner, the super-resolution module 1102 is used to: perform feature extraction on the current video frame through the target model to obtain the first feature of the current video frame; and extract the feature information of the reference video frame and the first feature through the target model. The features are fused to obtain the second feature of the current video frame; the second feature is extracted through the target model to obtain the third feature of the current video frame; the third feature is extracted based on the residual information through the target model to obtain the current The fourth feature of the video frame, the fourth feature is used as the current video frame after super-resolution.

In one possible implementation manner, the residual information includes the residual information used in the decoding process of N image blocks in the current video frame, and the super-resolution module 1102 is used to: use the target model in the N image blocks, Determine P image blocks whose residual information is greater than the preset residual threshold, N≥2, N>P≥1; use the target model to extract features corresponding to the P image blocks in the third feature to obtain the current video The fourth characteristic of the frame.

In a possible implementation manner, the super-resolution module 1102 is also used to fuse the fourth feature and the current video frame through the target model to obtain the current video frame after super-resolution.

In one possible implementation manner, the third feature, the fourth feature or the current video frame after super-resolution is used as the feature information of the current video frame.

In a possible implementation manner, the super-resolution module 1102 is also used to perform feature extraction on the third feature, the fourth feature, or the current video frame after super-resolution through the target model to obtain feature information of the current video frame.

Figure 12 is a schematic structural diagram of a model training device provided by an embodiment of the present application. As shown in Figure 12, the device includes:

The first acquisition module 1201 is used to acquire the current video frame and the motion vector used in the decoding process of the current video frame;

The transformation module 1202 is used to transform the feature information of the reference video frame of the current video frame based on the motion vector to obtain the transformed feature information. The feature information of the reference video frame is used in the super-resolution process of the reference video frame by the model to be trained. get;

The super-resolution module 1203 is used to perform super-resolution on the current video frame based on the transformed feature information through the model to be trained, and obtain the current video frame after super-resolution;

The second acquisition module 1204 is used to obtain the target loss based on the current video frame after super-resolution and the current video frame after real super-resolution. The target loss is used to indicate the current video frame after super-resolution and the current video after real super-resolution. Differences between frames;

The update module 1205 is used to update the parameters of the model to be trained based on the target loss until the model training conditions are met and the target model is obtained.

The target model trained in the embodiment of this application has the ability to super-resolve video frames. Specifically, after obtaining the current video frame and the motion vector used in the decoding process of the current video frame, the feature information of the reference video frame of the current video frame can be transformed based on the motion vector, thereby obtaining the transformed feature information, where , the feature information of the reference video frame is obtained during the super-resolution process of the reference video frame by the target model. Then, the current video frame can be super-resolved based on the transformed feature information through the target model, thereby obtaining the super-resolved current video frame. In the foregoing process, the target model can perform super-resolution on the current video frame based on the transformed feature information of the reference video frame, because the transformed feature information of the reference video frame is based on the motion vector pair used in the decoding process of the current video frame. It is obtained by transforming the feature information of the reference video frame. It can be seen that in the super-resolution process of the current video frame by the target model, not only the information of the reference video frame itself is considered, but also the image blocks between the reference video frame and the current video frame are considered. position correspondence relationship, the factors considered are relatively comprehensive, so the current video frame after super-resolution finally output by the target model is of high enough quality (with a relatively ideal resolution), so that the entire video stream after super-resolution has good image quality, thereby improving user experience.

In one possible implementation manner, the transformation module 1202 is configured to calculate the motion vector and the feature information of the reference video frame through a warping algorithm to obtain transformed feature information.

In one possible implementation, the super-resolution module 1203 is used to extract features of the current video frame through the model to be trained, and obtain to the first feature of the current video frame; fuse the transformed feature information and the first feature through the model to be trained to obtain the second feature of the current video frame; extract the second feature through the model to be trained to obtain the current video The third feature of the frame is used as the current video frame after super-resolution.

In one possible implementation manner, the super-resolution module 1203 is also used to fuse the third feature and the current video frame through the model to be trained to obtain the current video frame after super-resolution.

In one possible implementation manner, the super-resolution module 1203 is also used to extract features of the third feature or the current video frame after super-resolution through the model to be trained, to obtain feature information of the current video frame.

In a possible implementation manner, the acquisition module 1201 is used to acquire the motion vectors used in the decoding process of M image blocks in the current video frame from the compressed video stream, N≥2, N>M≥1; Calculate the motion vectors used in the decoding process of N-M image blocks based on the motion vectors used in the decoding process of the M image blocks, or determine the preset value as the motion used in the decoding process of the N-M image blocks Vector.

Figure 13 is another structural schematic diagram of a model training device provided by an embodiment of the present application. As shown in Figure 13, the device includes:

The first acquisition module 1301 is used to acquire the current video frame and the residual information used in the decoding process of the current video frame;

The super-resolution module 1302 is used to perform super-resolution on the current video frame based on the feature information and residual information of the reference video frame through the model to be trained, and obtain the current video frame after super-resolution. The feature information of the reference video frame is used in the model to be trained. Obtained from super-resolution processing of reference video frames;

The second acquisition module 1303 is used to obtain the target loss based on the current video frame after super-resolution and the current video frame after real super-resolution. The target loss is used to indicate the current video frame after super-resolution and the current video after real super-resolution. Differences between frames;

The update module 1304 is used to update the parameters of the model to be trained based on the target loss until the model training conditions are met and the target model is obtained.

In a possible implementation manner, the super-resolution module 1302 is used to: perform feature extraction on the current video frame through the target model to obtain the first feature of the current video frame; and perform feature information on the reference video frame and the first feature on the reference video frame through the target model. The features are fused to obtain the second feature of the current video frame; the second feature is extracted through the target model to obtain the third feature of the current video frame; the third feature is extracted based on the residual information through the target model to obtain the current The fourth feature of the video frame, the fourth feature is used as the current video frame after super-resolution.

In one possible implementation manner, the residual information includes the residual information used in the decoding process of N image blocks in the current video frame. The super-resolution module 1302 is used to: use the target model in the N image blocks, Determine P image blocks whose residual information is greater than the preset residual threshold, N≥2, N>P≥1; use the target model to extract features corresponding to the P image blocks in the third feature to obtain the current video The fourth characteristic of the frame.

In a possible implementation manner, the super-resolution module 1302 is also used to fuse the fourth feature and the current video frame through the target model to obtain the current video frame after super-resolution.

In one possible implementation manner, the super-resolution module 1302 is also used to perform feature extraction on the third feature, the fourth feature, or the current video frame after super-resolution through the target model to obtain feature information of the current video frame.

It should be noted that the information interaction, execution process, etc. between the modules/units of the above-mentioned device are based on the same concept as the method embodiments of the present application, and the technical effects they bring are the same as those of the method embodiments of the present application. The specific content can be Refer to the description in the method embodiments shown above in the embodiments of the present application, which will not be described again here.

The embodiment of the present application also relates to an execution device. Figure 14 is a schematic structural diagram of the execution device provided by the embodiment of the present application. like As shown in Figure 14, the execution device 1400 can be embodied as a mobile phone, a tablet, a laptop, a smart wearable device, a server, etc., and is not limited here. The video processing device described in the corresponding embodiment of FIG. 10 or FIG. 11 may be deployed on the execution device 1400 to implement the video processing function in the corresponding embodiment of FIG. 4 or FIG. 6 . Specifically, the execution device 1400 includes: a receiver 1401, a transmitter 1402, a processor 1403 and a memory 1404 (the number of processors 1403 in the execution device 1400 can be one or more, one processor is taken as an example in Figure 14) , wherein the processor 1403 may include an application processor 14031 and a communication processor 14032. In some embodiments of the present application, the receiver 1401, the transmitter 1402, the processor 1403, and the memory 1404 may be connected by a bus or other means.

Memory 1404 may include read-only memory and random access memory and provides instructions and data to processor 1403 . A portion of memory 1404 may also include non-volatile random access memory (NVRAM). The memory 1404 stores processor and operating instructions, executable modules or data structures, or a subset thereof, or an extended set thereof, where the operating instructions may include various operating instructions for implementing various operations.

The processor 1403 controls the execution of operations of the device. In specific applications, various components of the execution device are coupled together through a bus system. In addition to the data bus, the bus system may also include a power bus, a control bus, a status signal bus, etc. However, for the sake of clarity, various buses are called bus systems in the figure.

The methods disclosed in the above embodiments of the present application can be applied to the processor 1403 or implemented by the processor 1403. The processor 1403 may be an integrated circuit chip with signal processing capabilities. During the implementation process, each step of the above method can be completed by instructions in the form of hardware integrated logic circuits or software in the processor 1403 . The above-mentioned processor 1403 can be a general-purpose processor, a digital signal processor (DSP), a microprocessor or a microcontroller, and can further include an application specific integrated circuit (ASIC), a field programmable Gate array (field-programmable gate array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. The processor 1403 can implement or execute each method, step and logical block diagram disclosed in the embodiment of this application. A general-purpose processor may be a microprocessor or the processor may be any conventional processor, etc. The steps of the method disclosed in conjunction with the embodiments of the present application can be directly implemented by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor. The software module can be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other mature storage media in this field. The storage medium is located in the memory 1404. The processor 1403 reads the information in the memory 1404 and completes the steps of the above method in combination with its hardware.

The receiver 1401 may be configured to receive input numeric or character information and generate signal inputs related to performing relevant settings and functional controls of the device. The transmitter 1402 can be used to output numeric or character information through the first interface; the transmitter 1402 can also be used to send instructions to the disk group through the first interface to modify the data in the disk group; the transmitter 1402 can also include a display device such as a display screen .

In the embodiment of the present application, in one case, the processor 1403 is used to recommend items for information associated with the user through the first model in the corresponding embodiment of FIG. 4 or the target model in the corresponding embodiment of FIG. 9 .

The embodiment of the present application also relates to a training device. Figure 15 is a schematic structural diagram of the training device provided by the embodiment of the present application. As shown in Figure 15, the training device 1500 is implemented by one or more servers. The training device 1500 can vary greatly due to different configurations or performance, and can include one or more central processing units (CPU) 1514 (eg, one or more processors) and memory 1532, one or more storage media 1530 (eg, one or more mass storage devices) storing applications 1542 or data 1544. Among them, the memory 1532 and the storage medium 1530 may be short-term storage or persistent storage. The program stored in the storage medium 1530 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations in the training device. Furthermore, the central processor 1514 may be configured to communicate with the storage medium 1530 and execute a series of instruction operations in the storage medium 1530 on the training device 1500 .

The training device 1500 may also include one or more power supplies 1526, one or more wired or wireless network interfaces 1550, one or more input and output interfaces 1558; or, one or more operating systems 1541, such as Windows ServerTM, Mac OS XTM , UnixTM, LinuxTM, FreeBSDTM and so on.

Specifically, the training device can execute the model training method in the embodiment corresponding to FIG. 8 or FIG. 9 .

Embodiments of the present application also relate to a computer storage medium. The computer-readable storage medium stores a program for performing signal processing. When the program is run on a computer, it causes the computer to perform the steps performed by the aforementioned execution device, or, The computer is caused to perform the steps performed by the aforementioned training device.

Embodiments of the present application also relate to a computer program product that stores instructions that, when executed by a computer, cause the computer to perform the steps performed by the foregoing execution device, or cause the computer to perform the steps performed by the foregoing training device. A step of.

The execution device, training device or terminal device provided by the embodiment of the present application may specifically be a chip. The chip includes: a processing unit and a communication unit. The processing unit may be, for example, a processor. The communication unit may be, for example, an input/output interface. Pins or circuits, etc. The processing unit can execute the computer execution instructions stored in the storage unit, so that the chip in the execution device executes the data processing method described in the above embodiment, or so that the chip in the training device executes the data processing method described in the above embodiment. Optionally, the storage unit is a storage unit within the chip, such as a register, cache, etc. The storage unit may also be a storage unit located outside the chip in the wireless access device, such as Read-only memory (ROM) or other types of static storage devices that can store static information and instructions, random access memory (random access memory, RAM), etc.

Specifically, please refer to Figure 16. Figure 16 is a schematic structural diagram of a chip provided by an embodiment of the present application. The chip can be represented as a neural network processor NPU 1600. The NPU 1600 serves as a co-processor and is mounted to the host CPU (Host CPU). ), tasks are allocated by the Host CPU. The core part of the NPU is the arithmetic circuit 1603. The arithmetic circuit 1603 is controlled by the controller 1604 to extract the matrix data in the memory and perform multiplication operations.

In some implementations, the computing circuit 1603 includes multiple processing units (Process Engine, PE). In some implementations, arithmetic circuit 1603 is a two-dimensional systolic array. The arithmetic circuit 1603 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, arithmetic circuit 1603 is a general-purpose matrix processor.

For example, assume there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit obtains the corresponding data of matrix B from the weight memory 1602 and caches it on each PE in the arithmetic circuit. The operation circuit takes matrix A data and matrix B from the input memory 1601 to perform matrix operations, and the partial result or final result of the matrix is stored in an accumulator (accumulator) 1608 .

The unified memory 1606 is used to store input data and output data. The weight data directly passes through the storage unit access controller (Direct Memory Access Controller, DMAC) 1605, and the DMAC is transferred to the weight memory 1602. Input data is also transferred to unified memory 1606 via DMAC.

BIU is the Bus Interface Unit, that is, the bus interface unit 1613, which is used for the interaction between the AXI bus and the DMAC and the Instruction Fetch Buffer (IFB) 1609.

The bus interface unit 1613 (Bus Interface Unit, BIU for short) is used to fetch the memory 1609 to obtain instructions from the external memory, and is also used for the storage unit access controller 1605 to obtain the original data of the input matrix A or the weight matrix B from the external memory.

DMAC is mainly used to transfer the input data in the external memory DDR to the unified memory 1606 or the weight data to the weight memory 1602 or the input data to the input memory 1601 .

The vector calculation unit 1607 includes multiple arithmetic processing units, and if necessary, further processes the output of the arithmetic circuit 1603, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, etc. It is mainly used for non-convolutional/fully connected layer network calculations in neural networks, such as Batch Normalization, pixel-level summation, upsampling of predicted label planes, etc.

In some implementations, vector calculation unit 1607 can store the processed output vectors to unified memory 1606 . For example, the vector calculation unit 1607 can apply a linear function; or a nonlinear function to the output of the operation circuit 1603, such as linear interpolation on the prediction label plane extracted by the convolution layer, or a vector of accumulated values, to generate an activation value. . In some implementations, vector calculation unit 1607 generates normalized values, pixel-wise summed values, or both. In some implementations, the processed output vector can be used as an activation input to the arithmetic circuit 1603, such as for use in a subsequent layer in a neural network.

The instruction fetch buffer 1609 connected to the controller 1604 is used to store instructions used by the controller 1604;

The unified memory 1606, the input memory 1601, the weight memory 1602 and the fetch memory 1609 are all On-Chip memories. External memory is private to the NPU hardware architecture.

The processor mentioned in any of the above places can be a general central processing unit, a microprocessor, an ASIC, or one or more integrated circuits used to control the execution of the above programs.

In addition, it should be noted that the device embodiments described above are only illustrative. The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physically separate. The physical unit can be located in one place, or it can be distributed across multiple network units. You can select some or all of the modules according to actual needs to implement The purpose of this embodiment is achieved. In addition, in the drawings of the device embodiments provided in this application, the connection relationship between modules indicates that there are communication connections between them, which can be specifically implemented as one or more communication buses or signal lines.

Through the above description of the embodiments, those skilled in the art can clearly understand that the present application can be implemented by software plus necessary general hardware. Of course, it can also be implemented by dedicated hardware including dedicated integrated circuits, dedicated CPUs, dedicated memories, Special components, etc. to achieve. In general, all functions performed by computer programs can be easily implemented with corresponding hardware. Moreover, the specific hardware structures used to implement the same function can also be diverse, such as analog circuits, digital circuits or special-purpose circuits. circuit etc. However, for this application, software program implementation is a better implementation in most cases. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product in essence or that contributes to the existing technology. The computer software product is stored in a readable storage medium, such as a computer floppy disk. , U disk, mobile hard disk, ROM, RAM, magnetic disk or optical disk, etc., including several instructions to cause a computer device (which can be a personal computer, training device, or network device, etc.) to execute the steps described in various embodiments of the present application. method.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented using software, it may be implemented in whole or in part in the form of a computer program product.

The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the processes or functions described in the embodiments of the present application are generated in whole or in part. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, the computer instructions may be transferred from a website, computer, training device, or data The center transmits to another website site, computer, training equipment or data center through wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.) means. The computer-readable storage medium may be any available medium that a computer can store, or a data storage device such as a training device or a data center integrated with one or more available media. The available media may be magnetic media (eg, floppy disk, hard disk, magnetic tape), optical media (eg, DVD), or semiconductor media (eg, solid state disk (Solid State Disk, SSD)), etc.

Claims

A video processing method, characterized in that the method includes:

Obtain the current video frame and the motion vector used in the decoding process of the current video frame;

Based on the motion vector, the feature information of the reference video frame of the current video frame is transformed to obtain the transformed feature information. The feature information of the reference video frame is used in the super-resolution process of the reference video frame by the target model. get;

The target model performs super-resolution on the current video frame based on the transformed feature information to obtain a super-resolved current video frame.
The method according to claim 1, wherein said transforming the feature information of the reference video frame based on the motion vector to obtain the transformed feature information includes:

The motion vector and the feature information of the reference video frame are calculated through a warping algorithm to obtain transformed feature information.
The method according to claim 1 or 2, wherein the target model performs super-resolution on the current video frame based on the transformed feature information, and obtaining the super-resolved current video frame includes:

Perform feature extraction on the current video frame through the target model to obtain the first feature of the current video frame;

The transformed feature information and the first feature are fused by the target model to obtain the second feature of the current video frame;

Feature extraction is performed on the second feature through the target model to obtain a third feature of the current video frame, and the third feature is used as the current video frame after super-resolution.
The method of claim 3, further comprising:

The third feature and the current video frame are fused through the target model to obtain a super-resolved current video frame.
The method according to claim 4, characterized in that the third feature or the current video frame after super-resolution is used as the feature information of the current video frame.
The method of claim 4, further comprising:

Feature extraction is performed on the third feature or the current video frame after super-resolution through the target model to obtain feature information of the current video frame.
The method according to any one of claims 1 to 6, characterized in that the current video frame contains N image blocks, and the obtaining the motion vector used in the decoding process of the current video frame includes:

From the compressed video stream, obtain the motion vectors used in the decoding process of M image blocks in the current video frame, N≥2, N>M≥1;

Based on the motion vectors used in the decoding process of the M image blocks, calculate the motion vectors used in the decoding process of the N-M image blocks, or determine the preset value as the motion vector used in the decoding process of the N-M image blocks. The motion vector used.
A video processing method, characterized in that the method includes:

Obtain the current video frame and the residual information used in the decoding process of the current video frame;

The target model performs super-resolution on the current video frame based on the characteristic information of the reference video frame and the residual information to obtain the super-resolved current video frame. The characteristic information of the reference video frame is in the target Obtained from the super-resolution processing of the reference video frame by the model.
The method according to claim 8, characterized in that the target model performs super-resolution on the current video frame based on the feature information of the reference video frame and the residual information, and obtains the super-resolution The current video frame includes:

Perform feature extraction on the current video frame through the target model to obtain the first feature of the current video frame;

The feature information of the reference video frame and the first feature are fused by the target model to obtain the second feature of the current video frame;

Perform feature extraction on the second feature through the target model to obtain the third feature of the current video frame;

The target model performs feature extraction on the third feature based on the residual information to obtain a fourth feature of the current video frame, and the fourth feature is used as the current video frame after super-resolution.
The method of claim 9, wherein the residual information includes residual information used in the decoding process of N image blocks in the current video frame, and the target model is based on the The residual information performs feature extraction on the third feature, Obtaining the fourth feature of the current video frame includes:

Use the target model to determine P image blocks whose residual information is greater than the preset residual threshold among the N image blocks, N≥2, N>P≥1;

Features of the third features corresponding to the P image blocks are extracted using the target model to obtain the fourth feature of the current video frame.
The method according to claim 9 or 10, characterized in that, the method further includes:

The fourth feature and the current video frame are fused through the target model to obtain a super-resolved current video frame.
The method according to claim 11, characterized in that the third feature, the fourth feature or the super-resolved current video frame is used as the feature information of the current video frame.
The method according to claim 11, characterized in that, the method further includes:

Feature extraction is performed on the third feature, the fourth feature or the super-resolved current video frame through the target model to obtain feature information of the current video frame.
A model training method, characterized in that the method includes:

Obtain the current video frame and the motion vector used in the decoding process of the current video frame;

Based on the motion vector, the feature information of the reference video frame of the current video frame is transformed to obtain the transformed feature information. The feature information of the reference video frame is used in the super-resolution process of the reference video frame by the model to be trained. get in;

The current video frame is super-resolved by the to-be-trained model based on the transformed feature information to obtain the super-resolved current video frame;

Based on the current video frame after the super-resolution and the current video frame after the real super-resolution, a target loss is obtained, and the target loss is used to indicate the current video frame after the super-resolution and the current video frame after the real super-resolution. differences between;

The parameters of the model to be trained are updated based on the target loss until the model training conditions are met, and the target model is obtained.
A model training method, characterized in that the method includes:

Obtain the current video frame and the residual information used in the decoding process of the current video frame;

The model to be trained performs super-resolution on the current video frame based on the feature information of the reference video frame and the residual information to obtain the super-resolved current video frame. The feature information of the reference video frame is in the Obtained from the super-resolution processing of the reference video frame by the model to be trained;

Based on the current video frame after the super-resolution and the current video frame after the real super-resolution, a target loss is obtained, and the target loss is used to indicate the current video frame after the super-resolution and the current video frame after the real super-resolution. differences between;

The parameters of the model to be trained are updated based on the target loss until the model training conditions are met, and the target model is obtained.
A video processing device, characterized in that the device includes:

An acquisition module, used to acquire the current video frame and the motion vector used in the decoding process of the current video frame;

A transformation module, configured to transform the feature information of a reference video frame of the current video frame based on the motion vector to obtain transformed feature information. The feature information of the reference video frame is used in the target model to transform the reference video frame. Obtained during the super score process;

A super-resolution module, configured to perform super-resolution on the current video frame based on the transformed feature information through the target model to obtain a super-resolved current video frame.
A video processing device, characterized in that the device includes:

An acquisition module, used to acquire the current video frame and the residual information used in the decoding process of the current video frame;

A super-resolution module, configured to perform super-resolution on the current video frame based on the feature information of the reference video frame and the residual information through a target model to obtain the current video frame after super-resolution, and the reference video frame is Feature information is obtained in the super-resolution processing of the reference video frame by the target model.
A model training device, characterized in that the device includes:

The first acquisition module is used to acquire the current video frame and the motion vector used in the decoding process of the current video frame;

A transformation module, configured to transform the feature information of a reference video frame of the current video frame based on the motion vector to obtain transformed feature information. The feature information of the reference video frame is used in the model to be trained to Obtained during the super-resolution process of video frames;

A super-resolution module, configured to perform super-resolution on the current video frame based on the transformed feature information through the model to be trained, to obtain The current video frame after super-resolution;

The second acquisition module is configured to obtain a target loss based on the current video frame after the super-resolution and the current video frame after the real super-resolution, where the target loss is used to indicate the current video frame after the super-resolution and the real super-resolution. The difference between the divided current video frames;

An update module, configured to update the parameters of the model to be trained based on the target loss until the model training conditions are met and the target model is obtained.
A model training device, characterized in that the device includes:

The first acquisition module is used to acquire the current video frame and the residual information used in the decoding process of the current video frame;

A super-resolution module, configured to perform super-resolution on the current video frame based on the characteristic information of the reference video frame and the residual information through the model to be trained, and obtain the current video frame after super-resolution, the reference video frame The feature information of is obtained in the super-resolution processing of the reference video frame by the model to be trained;

The second acquisition module is used to obtain a target loss based on the current video frame after the super-resolution and the current video frame after the real super-resolution, where the target loss is used to indicate the current video frame after the super-resolution and the real super-resolution. The difference between the divided current video frames;

An update module, configured to update the parameters of the model to be trained based on the target loss until the model training conditions are met and the target model is obtained.
A video processing device, characterized in that the device includes a memory and a processor; the memory stores code, the processor is configured to execute the code, and when the code is executed, the video processing The device performs the method according to any one of claims 1 to 15.
A computer storage medium, characterized in that the computer storage medium stores one or more instructions, which when executed by one or more computers cause the one or more computers to implement any of claims 1 to 15. The method described in 1.
A computer program product, characterized in that the computer program product stores instructions, which when executed by a computer, cause the computer to implement the method described in any one of claims 1 to 15.