WO2024046144A1 - Procédé de traitement vidéo et son dispositif associé - Google Patents

Procédé de traitement vidéo et son dispositif associé Download PDF

Info

Publication number
WO2024046144A1
WO2024046144A1 PCT/CN2023/113745 CN2023113745W WO2024046144A1 WO 2024046144 A1 WO2024046144 A1 WO 2024046144A1 CN 2023113745 W CN2023113745 W CN 2023113745W WO 2024046144 A1 WO2024046144 A1 WO 2024046144A1
Authority
WO
WIPO (PCT)
Prior art keywords
video frame
current video
super
resolution
feature
Prior art date
Application number
PCT/CN2023/113745
Other languages
English (en)
Chinese (zh)
Inventor
郭佳明
邹学益
刘毅
张恒胜
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2024046144A1 publication Critical patent/WO2024046144A1/fr

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/23418Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/134Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or criterion affecting or controlling the adaptive coding
    • H04N19/136Incoming video signal characteristics or properties
    • H04N19/137Motion inside a coding unit, e.g. average field, frame or block difference
    • H04N19/139Analysis of motion vectors, e.g. their magnitude, direction, variance or reliability
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/2343Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements
    • H04N21/234309Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements by transcoding between formats or standards, e.g. from MPEG-2 to MPEG-4 or from Quicktime to Realvideo
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/4402Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display
    • H04N21/440218Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display by transcoding between formats or standards, e.g. from MPEG-2 to MPEG-4

Definitions

  • This application relates to artificial intelligence (AI) technology, and in particular to a video processing method and related equipment.
  • AI artificial intelligence
  • the current video frame and the reference video frame of the current video frame are input to the neural network model, so that the neural network model performs super-resolution reconstruction (also called super-resolution) of the current video frame based on the reference video frame, and obtains the current video frame after super-resolution.
  • super-resolution reconstruction also called super-resolution
  • the neural network model only uses the reference video frame itself as the reference benchmark, and the factors considered are relatively single.
  • the current video frame after super-resolution output by the neural network model is not of high quality (cannot have ideal resolution), so that the image quality of the entire video stream after super-resolution is still not good enough, resulting in poor user experience.
  • Embodiments of the present application provide a video processing method and related equipment, which have a good super-resolution effect on video frames in the video stream, so that the entire video stream after super-resolution has good image quality, thereby improving user experience.
  • a first aspect of the embodiments of the present application provides a video processing method, which method includes:
  • the current video frame and the motion vector used in the decoding process of the current video frame can be obtained first.
  • the motion vector used in the decoding process of the current video frame can be used to transform the feature information of the reference video frame to obtain the transformed reference video frame.
  • the feature information that is, the feature information of the reference video frame aligned to the current video frame. It should be noted that the feature information of the reference video frame is obtained during the super-resolution process of the target model on the reference video frame. Regarding the super-resolution process of the target model on the reference video frame, please refer to the subsequent super-resolution of the current video frame by the target model. The relevant description of the sub-process will not be expanded upon here.
  • the transformed feature information and the current video frame can be input to the target model (for example, a trained recurrent neural network model), so that the current video frame can be processed by the target model based on the transformed feature information.
  • the target model for example, a trained recurrent neural network model
  • Super-resolution reconstruction to obtain the current video frame after super-resolution.
  • the feature information of the reference video frame of the current video frame can be transformed based on the motion vector, thereby obtaining the transformed Feature information, wherein the feature information of the reference video frame is obtained during the super-resolution process of the reference video frame by the target model. Then, the current video frame can be super-resolved based on the transformed feature information through the target model, thereby obtaining the super-resolved current video frame.
  • the target model can perform super-resolution on the current video frame based on the transformed feature information of the reference video frame, because the transformed feature information of the reference video frame is based on the motion vector pair used in the decoding process of the current video frame. It is obtained by transforming the feature information of the reference video frame. It can be seen that in the super-resolution process of the current video frame by the target model, not only the information of the reference video frame itself is considered, but also the image blocks between the reference video frame and the current video frame are considered. position correspondence relationship, the factors considered are relatively comprehensive, so the current video frame after super-resolution finally output by the target model is of high enough quality (with a relatively ideal resolution), so that the entire video stream after super-resolution has good image quality, thereby improving user experience.
  • transforming the feature information of the reference video frame based on the motion vector to obtain the transformed feature information includes: calculating the motion vector and the feature information of the reference video frame through a warping algorithm to obtain the transformed feature information.
  • Feature information The foregoing
  • the motion vector used in the decoding process of the current video frame and the feature information of the reference video frame can be calculated through a warping algorithm (for example, bilinear difference method, bicubic difference method, etc.), so as to Accurately obtain the transformed feature information.
  • the current video frame is super-resolved based on the transformed feature information through the target model, and obtaining the current video frame after super-resolution includes: performing feature extraction on the current video frame through the target model to obtain the current video The first feature of the frame; fuse the transformed feature information and the first feature through the target model to obtain the second feature of the current video frame; perform feature extraction on the second feature through the target model to obtain the third feature of the current video frame , the third feature is used as the current video frame after super-resolution.
  • the target model can first perform feature extraction on the current video frame, thereby obtaining the first feature of the current video frame.
  • the target model After obtaining the first feature of the current video frame, the target model can fuse the transformed feature information and the first feature of the current video frame, thereby obtaining the second feature of the current video frame. After obtaining the second feature of the current video frame, the target model can continue to perform feature extraction on the second feature of the current video frame, thereby obtaining the third feature of the current video frame.
  • the target model can directly use the third feature as the current super-resolved feature. video frames and output them externally.
  • the method further includes: fusing the third feature and the current video frame through the target model to obtain the super-resolved current video frame.
  • the target model can fuse the third feature of the current video frame and the current video frame, thereby obtaining and outputting the super-resolved current video frame.
  • the third feature or the current video frame after super-resolution is used as the feature information of the current video frame.
  • the target model can obtain the feature information of the current video frame through the following multiple methods: After obtaining the third feature of the current video frame, the target model can directly use the third feature of the current video frame as the feature of the current video frame. information and output it to the outside for use in the super-resolution process of the next video frame; after obtaining the current video frame after super-resolution, the target model can directly use the current video frame after super-resolution as the feature information of the current video frame and output it to the outside. For use in the super-resolution process of the next video frame.
  • the method further includes: extracting features of the third feature or the current video frame after super-resolution through the target model to obtain feature information of the current video frame.
  • the target model can also obtain the feature information of the current video frame in the following multiple ways: After obtaining the third feature of the current video frame, the target model can continue to perform feature extraction on the third feature of the current video frame, thereby obtaining the feature information of the current video frame. Feature information; after obtaining the current video frame after super-resolution, the target model can continue to extract features of the current video frame after super-resolution, thereby obtaining the feature information of the current video frame.
  • the current video frame contains N image blocks
  • obtaining the motion vector used in the decoding process of the current video frame includes: obtaining the decoding of M image blocks in the current video frame from the compressed video stream.
  • the motion vector used in the process N ⁇ 2, N>M ⁇ 1; based on the motion vector used in the decoding process of M image blocks, calculate the motion vector used in the decoding process of N-M image blocks, or,
  • the preset value is determined as the motion vector used in the decoding process of N-M image blocks.
  • the compressed video stream only provides the motion vectors corresponding to these M image blocks. Since the compressed video stream does not provide the motion vectors corresponding to the remaining N-M image blocks of the current video frame, the following is used. There are several ways to calculate the motion vectors corresponding to these N-M image blocks: use the preset value directly as the motion vector corresponding to these N-M image blocks; calculate the motion vectors corresponding to M image blocks to obtain the corresponding motion vectors of these N-M image blocks. motion vector. After calculating the motion vectors corresponding to the N-M image blocks, the motion vectors corresponding to the M image blocks derived from the compressed video stream can be used as the motion used in the decoding process of the M image blocks in the current video frame.
  • a second aspect of the embodiment of the present application provides a video processing method, which method includes:
  • the current video frame and the residual information used in the decoding process of the current video frame can be obtained first.
  • the characteristic information of the reference video frame can also be obtained, and the current video frame, the residual information used in the decoding process of the current video frame, and The characteristic information of the reference video frame is input to the target model, so that the target model super-resolves the current video frame based on the characteristic information of the reference video frame and the residual information used in the decoding process of the current video frame, and obtains the super-resolved The current video frame.
  • the feature information of the reference video frame is obtained during the super-resolution process of the target model on the reference video frame.
  • the super-resolution process of the target model on the reference video frame please refer to the subsequent target model. The relevant description of the super-resolution process of the current video frame will not be expanded here.
  • the target model can perform super-resolution of the current video frame based on the feature information of the reference video frame and the residual information used in the decoding process of the current video frame.
  • the current video frame is super-resolved through the target model based on the feature information and residual information of the reference video frame.
  • the current video frame obtained after super-resolution includes: super-resolving the current video frame through the target model. Feature extraction is used to obtain the first feature of the current video frame; the feature information of the reference video frame and the first feature are fused through the target model to obtain the second feature of the current video frame; the second feature is extracted through the target model to obtain The third feature of the current video frame; the target model performs feature extraction on the third feature based on the residual information to obtain the fourth feature of the current video frame, and the fourth feature is used as the current video frame after super-resolution.
  • the target model can first perform feature extraction on the current video frame, thereby obtaining The first feature of the current video frame. After obtaining the first feature of the current video frame, the target model fuses the feature information of the reference video frame and the first feature of the current video frame, thereby obtaining the second feature of the current video frame. After obtaining the second feature of the current video frame, the target model can continue to perform feature extraction on the second feature of the current video frame, thereby obtaining the third feature of the current video frame.
  • the target model can continue to perform feature extraction on the third feature of the current video frame based on the residual information used in the decoding process of the current video frame, thereby obtaining the fourth feature of the current video frame.
  • the target model can use the fourth feature as the current video frame after super-resolution and output it externally.
  • the residual information includes the residual information used in the decoding process of N image blocks in the current video frame
  • the target model is used to extract the third feature based on the residual information to obtain the current video
  • the fourth feature of the frame includes: using the target model to determine P image blocks whose residual information is greater than the preset residual threshold among the N image blocks, N ⁇ 2, N>P ⁇ 1; using the target model to determine the third Among the features, features corresponding to the P image blocks are extracted to obtain the fourth feature of the current video frame.
  • the current video frame can be divided into N image blocks, so the residual information used in the decoding process of the current video frame includes the residual information used in the decoding process of N image blocks in the current video frame. .
  • the target model can sequentially compare the residual information used in the decoding process of each image block with the preset threshold, thereby determining P items whose residual information is greater than the preset residual threshold.
  • Image block After obtaining P image blocks whose residual information is greater than the preset residual threshold, the target model can perform feature extraction on the part of the third feature of the current video frame corresponding to the P image blocks, and the third feature is equal to The other part of the features corresponding to the remaining N-P image blocks remains unchanged, thereby obtaining the fourth feature of the current video frame.
  • the method further includes: fusing the fourth feature and the current video frame through the target model to obtain the super-resolved current video frame.
  • the target model can fuse the fourth feature of the current video frame and the current video frame to obtain the super-resolved current video frame.
  • the third feature, the fourth feature or the current video frame after super-resolution is used as the feature information of the current video frame.
  • the target model can obtain the feature information of the current video frame through the following multiple methods: After obtaining the third feature of the current video frame, the target model can directly use the third feature of the current video frame as the feature of the current video frame. information and output it to the outside for use in the super-resolution process of the next video frame; after obtaining the fourth feature of the current video frame, the target model can directly use the fourth feature of the current video frame as the feature information of the current video frame and output it to the outside.
  • the target model can directly use the current video frame after super-resolution as the feature information of the current video frame and output it for the next video.
  • the super-resolution process of the frame is used.
  • the method further includes: performing feature extraction on the third feature, the fourth feature or the current video frame after super-resolution through the target model to obtain the feature information of the current video frame.
  • the target model can also obtain the feature information of the current video frame through the following multiple methods: after obtaining the third feature of the current video frame, the target model can continue to extract features of the third feature of the current video frame, thereby Obtain the feature information of the current video frame; after obtaining the fourth feature of the current video frame, the target model can continue to perform feature extraction on the fourth feature of the current video frame, thereby obtaining the feature information of the current video frame; obtain the current video after super-resolution After the frame, the target The standard model can continue to extract features of the current video frame after super-resolution, thereby obtaining the feature information of the current video frame.
  • the third aspect of the embodiment of the present application provides a model training method, which method includes: obtaining the current video frame and the motion vector used in the decoding process of the current video frame; based on the motion vector, the reference video frame of the current video frame is Transform the feature information to obtain the transformed feature information.
  • the feature information of the reference video frame is obtained during the super-resolution process of the reference video frame by the model to be trained; the current video frame is super-resolved by the model to be trained based on the transformed feature information.
  • the current video frame after super-resolution is obtained; based on the current video frame after super-resolution and the current video frame after real super-resolution, the target loss is obtained, and the target loss is used to indicate the current video frame after super-resolution and the real video frame after super-resolution
  • the difference between the current video frames; the parameters of the model to be trained are updated based on the target loss until the model training conditions are met and the target model is obtained.
  • the target model trained by the above method has the ability to super-resolve video frames. Specifically, after obtaining the current video frame and the motion vector used in the decoding process of the current video frame, the feature information of the reference video frame of the current video frame can be transformed based on the motion vector, thereby obtaining the transformed feature information, where , the feature information of the reference video frame is obtained during the super-resolution process of the target model on the reference video frame. Then, the current video frame can be super-resolved based on the transformed feature information through the target model, thereby obtaining the super-resolved current video frame.
  • the target model can perform super-resolution on the current video frame based on the transformed feature information of the reference video frame, because the transformed feature information of the reference video frame is based on the motion vector pair used in the decoding process of the current video frame. It is obtained by transforming the feature information of the reference video frame. It can be seen that in the super-resolution process of the current video frame by the target model, not only the information of the reference video frame itself is considered, but also the image blocks between the reference video frame and the current video frame are considered. position correspondence relationship, the factors considered are relatively comprehensive, so the current video frame after super-resolution finally output by the target model is of high enough quality (with a relatively ideal resolution), so that the entire video stream after super-resolution has good image quality, thereby improving user experience.
  • transforming the feature information of the reference video frame based on the motion vector to obtain the transformed feature information includes: calculating the motion vector and the feature information of the reference video frame through a warping algorithm to obtain the transformed feature information. Feature information.
  • the current video frame is super-resolved based on the transformed feature information through the model to be trained, and obtaining the current video frame after the super-resolution includes: performing feature extraction on the current video frame through the model to be trained, and obtaining The first feature of the current video frame; fuse the transformed feature information and the first feature through the model to be trained to obtain the second feature of the current video frame; perform feature extraction on the second feature through the model to be trained to obtain the current video frame
  • the third feature is the current video frame after super-resolution.
  • the method further includes: fusing the third feature and the current video frame through a model to be trained to obtain the super-resolved current video frame.
  • the third feature or the current video frame after super-resolution is used as the feature information of the current video frame.
  • the method further includes: extracting features of the third feature or the current video frame after super-resolution through the model to be trained, to obtain feature information of the current video frame.
  • the current video frame contains N image blocks
  • obtaining the motion vector used in the decoding process of the current video frame includes: obtaining the decoding of M image blocks in the current video frame from the compressed video stream.
  • the motion vector used in the process N ⁇ 2, N>M ⁇ 1; based on the motion vector used in the decoding process of M image blocks, calculate the motion vector used in the decoding process of N-M image blocks, or,
  • the preset value is determined as the motion vector used in the decoding process of N-M image blocks.
  • the fourth aspect of the embodiment of the present application provides a model training method.
  • the method includes: obtaining the current video frame and the residual information used in the decoding process of the current video frame; using the model to be trained based on the characteristics of the reference video frame information and residual information, perform super-resolution on the current video frame, and obtain the current video frame after super-resolution.
  • the feature information of the reference video frame is obtained in the super-resolution processing of the reference video frame by the model to be trained; based on the current super-resolution
  • the video frame and the current video frame after the real super-resolution are used to obtain the target loss.
  • the target loss is used to indicate the difference between the current video frame after the super-resolution and the current video frame after the real super-resolution; the parameters of the training model are to be treated based on the target loss. Update until the model training conditions are met and the target model is obtained.
  • the target model trained by the above method has the ability to super-resolve video frames. Specifically, obtain the current video frame and the residual information used in the decoding process of the current video frame; use the target model to super-score the current video frame based on the feature information and residual information of the reference video frame, and obtain the super-score
  • the current video frame, the feature information of the reference video frame is obtained in the super-resolution processing of the reference video frame by the target model.
  • the target model can perform super-resolution of the current video frame based on the feature information of the reference video frame and the residual information used in the decoding process of the current video frame.
  • the target model not only considers the information of the reference video frame itself, but also considers the difference in pixel values between the reference video frame and the current video frame.
  • the factors considered are relatively comprehensive. Therefore, the current video frame after super-resolution finally output by the target model is of high enough quality (with a relatively ideal resolution), so that the entire video stream after super-resolution has good image quality, thereby improving the user experience.
  • the current video frame is super-resolved based on the feature information and residual information of the reference video frame through the model to be trained, and the current video frame obtained after super-scoring includes: Feature extraction is performed on the frame to obtain the first feature of the current video frame; the feature information of the reference video frame and the first feature are fused through the model to be trained to obtain the second feature of the current video frame; the second feature is processed through the model to be trained Feature extraction is used to obtain the third feature of the current video frame; the model to be trained performs feature extraction on the third feature based on the residual information to obtain the fourth feature of the current video frame, and the fourth feature is used as the current video frame after super-resolution.
  • the residual information includes the residual information used in the decoding process of N image blocks in the current video frame, and the model to be trained extracts the third feature based on the residual information to obtain the current
  • the fourth feature of the video frame includes: using the model to be trained, P image blocks whose residual information is greater than the preset residual threshold are determined among the N image blocks, N ⁇ 2, N>P ⁇ 1; Feature extraction is performed on the features corresponding to the P image blocks in the third feature to obtain the fourth feature of the current video frame.
  • the method further includes: fusing the fourth feature and the current video frame through the model to be trained to obtain the super-resolved current video frame.
  • the third feature, the fourth feature or the current video frame after super-resolution is used as the feature information of the current video frame.
  • the method further includes: performing feature extraction on the third feature, the fourth feature or the current video frame after super-resolution through the model to be trained, to obtain the feature information of the current video frame.
  • the fifth aspect of the embodiment of the present application provides a video processing device.
  • the device includes: an acquisition module, used to acquire the current video frame and the motion vector used in the decoding process of the current video frame; a transformation module, used based on The motion vector transforms the feature information of the reference video frame of the current video frame to obtain the transformed feature information.
  • the feature information of the reference video frame is obtained during the super-resolution process of the reference video frame by the target model; the super-resolution module is used to pass The target model performs super-resolution on the current video frame based on the transformed feature information to obtain the current video frame after super-resolution.
  • the feature information of the reference video frame of the current video frame can be transformed based on the motion vector, thereby obtaining the transformed Feature information, wherein the feature information of the reference video frame is obtained during the super-resolution process of the reference video frame by the target model. Then, the current video frame can be super-resolved based on the transformed feature information through the target model, thereby obtaining the super-resolved current video frame.
  • the target model can perform super-resolution on the current video frame based on the transformed feature information of the reference video frame, because the transformed feature information of the reference video frame is based on the motion vector pair used in the decoding process of the current video frame. It is obtained by transforming the feature information of the reference video frame. It can be seen that in the super-resolution process of the current video frame by the target model, not only the information of the reference video frame itself is considered, but also the image blocks between the reference video frame and the current video frame are considered. position correspondence relationship, the factors considered are relatively comprehensive, so the current video frame after super-resolution finally output by the target model is of high enough quality (with a relatively ideal resolution), so that the entire video stream after super-resolution has good image quality, thereby improving user experience.
  • the transformation module is used to calculate the motion vector and the feature information of the reference video frame through a warping algorithm to obtain transformed feature information.
  • the super-resolution module is used to: extract features of the current video frame through the target model to obtain the first feature of the current video frame; perform feature extraction on the transformed feature information and the first feature through the target model. Through fusion, the second feature of the current video frame is obtained; the second feature is extracted through the target model to obtain the third feature of the current video frame, and the third feature is used as the current video frame after super-resolution.
  • the super-resolution module is also used to fuse the third feature and the current video frame through the target model to obtain the current video frame after super-resolution.
  • the third feature or the current video frame after super-resolution is used as the feature information of the current video frame.
  • the super-resolution module is also used to extract features of the third feature or the current video frame after super-resolution through the target model to obtain feature information of the current video frame.
  • the acquisition module is used to acquire the motion vectors used in the decoding process of M image blocks in the current video frame from the compressed video stream, N ⁇ 2, N>M ⁇ 1; based on The motion vector used in the decoding process of M image blocks, the motion vector used in the decoding process of NM image blocks is calculated, or the preset value is determined as the motion vector used in the decoding process of NM image blocks. .
  • the sixth aspect of the embodiment of the present application provides a video processing device.
  • the device includes: an acquisition module, used to acquire the current video frame and residual information used in the decoding process of the current video frame; a super-resolution module, The target model performs super-resolution on the current video frame based on the feature information and residual information of the reference video frame, and obtains the current video frame after super-resolution.
  • the feature information of the reference video frame is used in the super-resolution processing of the reference video frame by the target model. Get in.
  • the current video frame and the residual information used in the decoding process of the current video frame are obtained; the current video frame is super-resolved through the target model based on the feature information and residual information of the reference video frame, The current video frame after super-resolution is obtained, and the feature information of the reference video frame is obtained during the super-resolution processing of the reference video frame by the target model.
  • the target model can perform super-resolution of the current video frame based on the feature information of the reference video frame and the residual information used in the decoding process of the current video frame.
  • the super-resolution module is used to: extract features of the current video frame through the target model to obtain the first feature of the current video frame; extract feature information and the first feature of the reference video frame through the target model Perform fusion to obtain the second feature of the current video frame; perform feature extraction on the second feature through the target model to obtain the third feature of the current video frame; perform feature extraction on the third feature based on the residual information through the target model to obtain the current video
  • the fourth feature of the frame is used as the current video frame after super-resolution.
  • the residual information includes the residual information used in the decoding process of N image blocks in the current video frame.
  • the super-resolution module is used to: determine in the N image blocks through the target model P image blocks whose residual information is greater than the preset residual threshold, N ⁇ 2, N>P ⁇ 1; use the target model to extract features corresponding to the P image blocks in the third feature to obtain the current video frame The fourth characteristic.
  • the super-resolution module is also used to fuse the fourth feature and the current video frame through the target model to obtain the current video frame after super-resolution.
  • the third feature, the fourth feature or the current video frame after super-resolution is used as the feature information of the current video frame.
  • the super-resolution module is also used to extract features of the third feature, the fourth feature or the current video frame after super-resolution through the target model, and obtain the feature information of the current video frame.
  • the seventh aspect of the embodiment of the present application provides a model training device.
  • the device includes: a first acquisition module, used to acquire the current video frame and the motion vector used in the decoding process of the current video frame; a transformation module, using The feature information of the reference video frame of the current video frame is transformed based on the motion vector to obtain the transformed feature information.
  • the feature information of the reference video frame is obtained during the super-resolution process of the reference video frame by the model to be trained; the super-resolution module , used to super-resolve the current video frame based on the transformed feature information through the model to be trained, and obtain the current video frame after super-resolution; the second acquisition module is used to perform super-resolution based on the current video frame after super-resolution and the real super-resolution of the current video frame to obtain the target loss.
  • the target loss is used to indicate the difference between the current video frame after super-resolution and the current video frame after real super-resolution; the update module is used to update the parameters of the model to be trained based on the target loss. , until the model training conditions are met and the target model is obtained.
  • the target model trained by the above device has the ability to super-resolve video frames. Specifically, after obtaining the current video frame and the motion vector used in the decoding process of the current video frame, the feature information of the reference video frame of the current video frame can be transformed based on the motion vector, thereby obtaining the transformed feature information, where , the feature information of the reference video frame is obtained during the super-resolution process of the reference video frame by the target model. Then, the current video frame can be super-resolved based on the transformed feature information through the target model, thereby obtaining the super-resolved current video frame.
  • the target model can perform super-resolution on the current video frame based on the transformed feature information of the reference video frame, because the transformed feature information of the reference video frame is based on the motion vector pair used in the decoding process of the current video frame. It is obtained by transforming the feature information of the reference video frame. It can be seen that in the super-resolution process of the current video frame by the target model, not only the information of the reference video frame itself is considered, but also the image blocks between the reference video frame and the current video frame are considered. position correspondence relationship, the factors considered are relatively comprehensive, so the current video frame after super-resolution finally output by the target model is of high enough quality (with a relatively ideal resolution), so that the entire video stream after super-resolution has good image quality, thereby improving user experience.
  • the transformation module is used to calculate the motion vector and the feature information of the reference video frame through a warping algorithm to obtain transformed feature information.
  • the super-resolution module is used to: extract features of the current video frame through the model to be trained, and obtain the current The first feature of the previous video frame; the transformed feature information and the first feature are fused through the model to be trained to obtain the second feature of the current video frame; the second feature is extracted through the model to be trained to obtain the current video frame The third feature is the current video frame after super-resolution.
  • the super-resolution module is also used to fuse the third feature and the current video frame through the model to be trained to obtain the current video frame after super-resolution.
  • the third feature or the current video frame after super-resolution is used as the feature information of the current video frame.
  • the super-resolution module is also used to extract the third feature or the current video frame after super-resolution through the model to be trained, so as to obtain the feature information of the current video frame.
  • the acquisition module is used to acquire the motion vectors used in the decoding process of M image blocks in the current video frame from the compressed video stream, N ⁇ 2, N>M ⁇ 1; based on The motion vector used in the decoding process of M image blocks, the motion vector used in the decoding process of N-M image blocks is calculated, or the preset value is determined as the motion vector used in the decoding process of N-M image blocks. .
  • the eighth aspect of the embodiment of the present application provides a model training device, which includes: a first acquisition module, used to acquire the current video frame and residual information used in the decoding process of the current video frame; a super-resolution module , used to super-score the current video frame based on the feature information and residual information of the reference video frame through the model to be trained, and obtain the current video frame after the super-score.
  • the feature information of the reference video frame is used in the model to be trained to compare the reference video frame.
  • the second acquisition module is used to obtain the target loss based on the current video frame after super-resolution and the current video frame after the real super-resolution, and the target loss is used to indicate the current video frame after super-resolution and the real The difference between the current video frames after super-resolution;
  • the update module is used to update the parameters of the model to be trained based on the target loss until the model training conditions are met and the target model is obtained.
  • the target model trained by the above device has the ability to super-resolve video frames. Specifically, obtain the current video frame and the residual information used in the decoding process of the current video frame; use the target model to super-score the current video frame based on the feature information and residual information of the reference video frame, and obtain the super-score
  • the current video frame, the feature information of the reference video frame is obtained in the super-resolution processing of the reference video frame by the target model.
  • the target model can perform super-resolution of the current video frame based on the feature information of the reference video frame and the residual information used in the decoding process of the current video frame.
  • the super-resolution module is used to: extract features of the current video frame through the target model to obtain the first feature of the current video frame; extract feature information and the first feature of the reference video frame through the target model Perform fusion to obtain the second feature of the current video frame; perform feature extraction on the second feature through the target model to obtain the third feature of the current video frame; perform feature extraction on the third feature based on the residual information through the target model to obtain the current video
  • the fourth feature of the frame is used as the current video frame after super-resolution.
  • the residual information includes the residual information used in the decoding process of N image blocks in the current video frame.
  • the super-resolution module is used to: determine in the N image blocks through the target model P image blocks whose residual information is greater than the preset residual threshold, N ⁇ 2, N>P ⁇ 1; use the target model to extract features corresponding to the P image blocks in the third feature to obtain the current video frame The fourth characteristic.
  • the super-resolution module is also used to fuse the fourth feature and the current video frame through the target model to obtain the current video frame after super-resolution.
  • the third feature, the fourth feature or the current video frame after super-resolution is used as the feature information of the current video frame.
  • the super-resolution module is also used to extract features of the third feature, the fourth feature or the current video frame after super-resolution through the target model, and obtain the feature information of the current video frame.
  • a ninth aspect of the embodiment of the present application provides a video processing device, which includes a memory and a processor; the memory stores code, and the processor is configured to execute the code.
  • the video processing device executes the first step aspect, any possible implementation manner in the first aspect, the second aspect, or the method described in any possible implementation manner in the second aspect.
  • a tenth aspect of the embodiment of the present application provides a model training device, which includes a memory and a processor; the memory stores code, and the processor is configured to execute the code.
  • the model training device executes the third step Any one of the aspect and the third aspect Possible implementation manners, the fourth aspect, or the method described in any possible implementation manner of the fourth aspect.
  • An eleventh aspect of the embodiments of the present application provides a circuit system.
  • the circuit system includes a processing circuit configured to perform any of the possible implementations of the first aspect and the second aspect. , any possible implementation manner in the second aspect, the third aspect, any possible implementation manner in the third aspect, the fourth aspect, or the method described in any possible implementation manner in the fourth aspect.
  • a twelfth aspect of the embodiments of the present application provides a chip system.
  • the chip system includes a processor for calling a computer program or computer instructions stored in a memory, so that the processor executes the first aspect as described in the first aspect.
  • any possible implementation manner of the second aspect, any possible implementation manner of the second aspect, the third aspect, any possible implementation manner of the third aspect, the fourth aspect or any of the fourth aspects Any possible implementation method.
  • the processor is coupled to the memory through an interface.
  • the chip system further includes a memory, and computer programs or computer instructions are stored in the memory.
  • a thirteenth aspect of the embodiments of the present application provides a computer storage medium.
  • the computer storage medium stores a computer program.
  • the program When executed by a computer, the program causes the computer to implement any one of the first aspect and the first aspect.
  • Possible implementation methods, the second aspect, any one possible implementation method of the second aspect, the third aspect, any one possible implementation method of the third aspect, the fourth aspect or any one possible implementation method of the fourth aspect The method described in the implementation.
  • a fourteenth aspect of the embodiments of the present application provides a computer program product.
  • the computer program product stores instructions. When executed by a computer, the instructions make it possible for the computer to implement any one of the first aspect and the first aspect.
  • the feature information of the reference video frame of the current video frame can be transformed based on the motion vector, thereby obtaining the transformed features.
  • Information wherein the feature information of the reference video frame is obtained during the super-resolution process of the reference video frame by the target model. Then, the current video frame can be super-resolved based on the transformed feature information through the target model, thereby obtaining the super-resolved current video frame.
  • the target model can perform super-resolution on the current video frame based on the transformed feature information of the reference video frame, because the transformed feature information of the reference video frame is based on the motion vector pair used in the decoding process of the current video frame. It is obtained by transforming the feature information of the reference video frame. It can be seen that in the super-resolution process of the current video frame by the target model, not only the information of the reference video frame itself is considered, but also the image blocks between the reference video frame and the current video frame are considered. position correspondence relationship, the factors considered are relatively comprehensive, so the current video frame after super-resolution finally output by the target model is of high enough quality (with a relatively ideal resolution), so that the entire video stream after super-resolution has good image quality, thereby improving user experience.
  • Figure 1 is a structural schematic diagram of the main framework of artificial intelligence
  • Figure 2a is a schematic structural diagram of a video processing system provided by an embodiment of the present application.
  • Figure 2b is another structural schematic diagram of the video processing system provided by the embodiment of the present application.
  • Figure 2c is a schematic diagram of video processing related equipment provided by the embodiment of the present application.
  • Figure 3 is a schematic diagram of the architecture of the system 100 provided by the embodiment of the present application.
  • Figure 4 is a schematic flow chart of the video processing method provided by the embodiment of the present application.
  • Figure 5 is a schematic structural diagram of the target model provided by the embodiment of the present application.
  • Figure 6 is another schematic flowchart of a video processing method provided by an embodiment of the present application.
  • Figure 7 is another structural schematic diagram of the target model provided by the embodiment of the present application.
  • Figure 8 is a schematic flow chart of the model training method provided by the embodiment of the present application.
  • Figure 9 is another schematic flow chart of the model training method provided by the embodiment of the present application.
  • Figure 10 is a schematic structural diagram of a video processing device provided by an embodiment of the present application.
  • Figure 11 is another structural schematic diagram of a video processing device provided by an embodiment of the present application.
  • Figure 12 is a schematic structural diagram of the model training device provided by the embodiment of the present application.
  • Figure 13 is another structural schematic diagram of the model training device provided by the embodiment of the present application.
  • Figure 14 is a schematic structural diagram of an execution device provided by an embodiment of the present application.
  • Figure 15 is a schematic structural diagram of the training equipment provided by the embodiment of the present application.
  • Figure 16 is a schematic structural diagram of a chip provided by an embodiment of the present application.
  • Embodiments of the present application provide a video processing method and related equipment, which have a good super-resolution effect on video frames in the video stream, so that the entire video stream after super-resolution has good image quality, thereby improving user experience.
  • the naming or numbering of steps in this application does not mean that the steps in the method flow must be executed in the time/logical sequence indicated by the naming or numbering.
  • the process steps that have been named or numbered can be implemented according to the purpose to be achieved. The order of execution can be changed for technical purposes, as long as the same or similar technical effect can be achieved.
  • the division of units presented in this application is a logical division. In actual applications, there may be other divisions. For example, multiple units may be combined or integrated into another system, or some features may be ignored. , or not executed.
  • the coupling or direct coupling or communication connection between the units shown or discussed may be through some interfaces, and the indirect coupling or communication connection between units may be electrical or other similar forms. There are no restrictions in the application.
  • the units or subunits described as separate components may or may not be physically separated, may or may not be physical units, or may be distributed into multiple circuit units, and some or all of them may be selected according to actual needs. unit to achieve the purpose of this application plan.
  • the current video frame in the video stream to be super-resolved (which can be any video frame in the video stream to be super-resolved)
  • the reference video frame (for example, the previous video frame and/or the subsequent video frame of the current video frame, etc.) is input to the neural network model, so that the neural network model super-resolves the current video frame based on the reference video frame, and obtains super-resolution The divided current video frame.
  • the neural network model can also be used to perform the same operations on the remaining video frames as the current video frame, so each video frame after super-resolution can be obtained , that is, the entire video stream after super-resolution.
  • the neural network model only uses the reference video frame itself as the reference benchmark, and the factors considered are relatively single.
  • the current video frame output by the model after super-resolution is not of high quality (it cannot have the ideal resolution), so that the image quality of the entire video stream after super-resolution is still not good enough (cannot have ideal quality and resolution), resulting in poor user experience.
  • the neural network model needs to perform a series of processing on all image blocks contained in the entire current video frame one by one, which requires a large amount of calculation, resulting in the aforementioned neural network model-based Video processing methods are difficult to apply to small devices with limited computing power (for example, smartphones, smart watches, etc.).
  • AI technology is a technical discipline that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence. AI technology obtains the best results by perceiving the environment, acquiring knowledge and using knowledge.
  • artificial intelligence technology is a branch of computer science that attempts to understand the nature of intelligence and produce a new intelligent machine that can respond in a similar way to human intelligence.
  • Using artificial intelligence for data processing is a common application method of artificial intelligence.
  • Figure 1 is a structural schematic diagram of the main framework of artificial intelligence.
  • the following is from the “intelligent information chain” (horizontal axis) and “IT value chain” (vertical axis)
  • the above artificial intelligence theme framework is elaborated on in two dimensions.
  • the "intelligent information chain” reflects a series of processes from data acquisition to processing. For example, it can be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. During this process, the data It has gone through the condensation process of "data-information-knowledge-wisdom".
  • the "IT value chain” reflects the value that artificial intelligence brings to the information technology industry, from the underlying infrastructure of human intelligence and information (providing and processing technology implementation) to the systematic industrial ecological process.
  • Infrastructure provides computing power support for artificial intelligence systems, enables communication with the external world, and supports it through basic platforms.
  • computing power is provided by smart chips (hardware acceleration chips such as CPU, NPU, GPU, ASIC, FPGA, etc.);
  • the basic platform includes distributed computing framework and network and other related platform guarantees and support, which can include cloud storage and Computing, interconnection networks, etc.
  • sensors communicate with the outside world to obtain data, which are provided to smart chips in the distributed computing system provided by the basic platform for calculation.
  • Data from the upper layer of the infrastructure is used to represent data sources in the field of artificial intelligence.
  • the data involves graphics, images, voice, and text, as well as IoT data of traditional devices, including business data of existing systems and sensory data such as force, displacement, liquid level, temperature, and humidity.
  • Data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making and other methods.
  • machine learning and deep learning can perform symbolic and formal intelligent information modeling, extraction, preprocessing, training, etc. on data.
  • Reasoning refers to the process of simulating human intelligent reasoning in computers or intelligent systems, using formal information to perform machine thinking and problem solving based on reasoning control strategies. Typical functions are search and matching.
  • Decision-making refers to the process of decision-making after intelligent information is reasoned, and usually provides functions such as classification, sorting, and prediction.
  • some general capabilities can be formed based on the results of further data processing, such as algorithms or a general system, such as translation, text analysis, computer vision processing, speech recognition, and image processing. identification, etc.
  • Intelligent products and industry applications refer to the products and applications of artificial intelligence systems in various fields. They are the encapsulation of overall artificial intelligence solutions, productizing intelligent information decision-making and realizing practical applications. Its application fields mainly include: intelligent terminals, intelligent transportation, Smart healthcare, autonomous driving, smart cities, etc.
  • FIG. 2a is a schematic structural diagram of a video processing system provided by an embodiment of the present application.
  • the video processing system includes user equipment and data processing equipment.
  • user equipment includes smart terminals such as mobile phones, personal computers, or information processing centers.
  • the user equipment is the initiator of video processing. As the initiator of the video processing request, the user usually initiates the request through the user equipment.
  • the above-mentioned data processing equipment may be a cloud server, a network server, an application server, a management server, and other equipment or servers with data processing functions.
  • the data processing device receives the video processing request from the smart terminal through the interactive interface, and then performs information processing in the form of machine learning, deep learning, search, reasoning, decision-making, etc. through the memory that stores the data and the processor that processes the data.
  • the memory in the data processing device can be a general term, including local storage and a database that stores historical data.
  • the database can be on the data processing device or on other network servers.
  • the user equipment can receive the user's instructions. For example, the user equipment can obtain the compressed video stream input/selected by the user, and then initiate a request to the data processing equipment, so that the data processing equipment can obtain the information obtained by the user equipment.
  • a video processing application is executed on the compressed video stream to obtain a processed video stream.
  • the user equipment can obtain the compressed video stream selected by the user and initiate a processing request for the compressed video stream to the data processing device.
  • the data processing device first obtains the compressed video stream (low-quality, low-resolution video stream) and decodes the compressed video stream to restore the video stream to be super-resolved (which can also be called a decompressed video stream.
  • the data processing device can perform super-resolution processing on the video to be super-resolved, thereby obtaining a super-resolution video stream (a high-quality, high-resolution video stream), and return the super-resolution video stream to the user device. For users to view and use.
  • the data processing device can execute the video processing method according to the embodiment of the present application.
  • Figure 2b is another structural schematic diagram of a video processing system provided by an embodiment of the present application.
  • the user equipment directly serves as a data processing equipment.
  • the user equipment can directly obtain input from the user and directly perform processing by the hardware of the user equipment itself. Processing, the specific process is similar to Figure 2a, please refer to the above description, and will not be repeated here.
  • the user equipment can receive the user's instructions. For example, the user equipment can obtain the compressed video stream selected by the user, and then the user equipment itself obtains the compressed video stream (low-quality, low-resolution video stream). ), and decodes the compressed video stream to restore the video stream to be super-resolved (which can also be called a decompressed video stream, which is still a low-quality, low-resolution video stream, but with a larger number of video frames). Then, the user equipment can perform super-resolution processing on the video to be super-resolved, thereby obtaining a post-super-resolution video stream (a high-quality, high-resolution video stream) for the user to watch and use.
  • the user equipment can obtain the compressed video stream selected by the user, and then the user equipment itself obtains the compressed video stream (low-quality, low-resolution video stream). ), and decodes the compressed video stream to restore the video stream to be super-resolved (which can also be called a decompressed video stream, which is
  • the user equipment itself can execute the video processing method according to the embodiment of the present application.
  • Figure 2c is a schematic diagram of video processing related equipment provided by the embodiment of the present application.
  • the user equipment in Figure 2a and Figure 2b can be the local device 301 or the local device 302 in Figure 2c
  • the data processing device in Figure 2a can be the execution device 210 in Figure 2c
  • the data storage system 250 can To store the data to be processed by the execution device 210, the data storage system 250 can be integrated on the execution device 210, or can be set up on the cloud or other network servers.
  • the processors in Figure 2a and Figure 2b can perform data training/machine learning/deep learning through neural network models or other models (for example, support vector machine-based models), and use the data to finally train or learn the model to execute on the video Video processing applications to obtain corresponding processing results.
  • neural network models or other models for example, support vector machine-based models
  • Figure 3 is a schematic diagram of the architecture of the system 100 provided by the embodiment of the present application.
  • the execution device 110 is configured with an input/output (I/O) interface 112 for data interaction with external devices.
  • the user Data can be input to the I/O interface 112 through the client device 140.
  • the input data may include: various to-be-scheduled tasks, callable resources, and other parameters.
  • the execution device 110 When the execution device 110 preprocesses the input data, or when the calculation module 111 of the execution device 110 performs calculation and other related processing (such as implementing the function of the neural network in this application), the execution device 110 can call the data storage system 150
  • the data, codes, etc. in the system can be used for corresponding processing, and the data, instructions, etc. obtained by corresponding processing can also be stored in the data storage system 150 .
  • the I/O interface 112 returns the processing results to the client device 140, thereby providing them to the user.
  • the training device 120 can generate corresponding target models/rules based on different training data for different goals or different tasks, and the corresponding target models/rules can be used to achieve the above goals or complete the above tasks. , thereby providing users with the desired results.
  • the training data may be stored in the database 130 and come from training samples collected by the data collection device 160 .
  • the user can manually enter the input data, and the manual setting can be operated through the interface provided by the I/O interface 112 .
  • the client device 140 can automatically send input data to the I/O interface 112. If requiring the client device 140 to automatically send input data requires the user's authorization, the user can set corresponding permissions in the client device 140.
  • the user can view the results output by the execution device 110 on the client device 140, and the specific presentation form may be display, sound, action, etc.
  • the client device 140 can also be used as a data collection end to collect the input data of the input I/O interface 112 and the output results of the output I/O interface 112 as new sample data, and store them in the database 130 .
  • the I/O interface 112 directly uses the input data input to the I/O interface 112 and the output result of the output I/O interface 112 as a new sample as shown in the figure.
  • the data is stored in database 130.
  • Figure 3 is only a schematic diagram of a system architecture provided by an embodiment of the present application.
  • the positional relationship between the devices, devices, modules, etc. shown in the figure does not constitute any limitation.
  • the data The storage system 150 is an external memory relative to the execution device 110. In other cases, the data storage system 150 can also be placed in the execution device 110.
  • the neural network can be trained according to the training device 120.
  • An embodiment of the present application also provides a chip, which includes a neural network processor NPU.
  • the chip can be disposed in the execution device 110 as shown in FIG. 3 to complete the calculation work of the calculation module 111.
  • the chip can also be installed in the training device 120 as shown in Figure 3 to complete the training work of the training device 120 and output the target model/rules.
  • Neural network processor NPU is mounted on the main central processing unit (CPU) (host CPU) as a co-processor, and the main CPU allocates tasks.
  • the core part of the NPU is the arithmetic circuit.
  • the controller controls the arithmetic circuit to extract the data in the memory (weight memory or input memory) and perform operations.
  • the computing circuit includes multiple processing units (PE).
  • the arithmetic circuit is a two-dimensional systolic array.
  • the arithmetic circuit may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition.
  • the arithmetic circuit is a general-purpose matrix processor.
  • the arithmetic circuit fetches the corresponding data of matrix B from the weight memory and caches it on each PE in the arithmetic circuit.
  • the operation circuit takes matrix A data and matrix B from the input memory to perform matrix operations, and the partial result or final result of the obtained matrix is stored in the accumulator (accumulator).
  • the vector calculation unit can further process the output of the arithmetic circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, etc.
  • the vector computing unit can be used for network calculations in non-convolutional/non-FC layers in neural networks, such as pooling, batch normalization, local response normalization, etc.
  • the vector computation unit can store the processed output vector into a unified buffer.
  • the vector calculation unit may apply a nonlinear function to the output of the arithmetic circuit, such as a vector of accumulated values, to generate activation values.
  • the vector computation unit generates normalized values, merged values, or both.
  • the processed output vector can be used as an activation input to an arithmetic circuit, such as for use in a subsequent layer in a neural network.
  • Unified memory is used to store input data and output data.
  • the weight data directly transfers the input data in the external memory to the input memory and/or the unified memory through the storage unit access controller (direct memory access controller, DMAC), stores the weight data in the external memory into the weight memory, and stores the weight data in the unified memory.
  • the data is stored in external memory.
  • the bus interface unit (BIU) is used to realize the interaction between the main CPU, DMAC and instruction memory through the bus.
  • the instruction fetch buffer connected to the controller is used to store instructions used by the controller
  • the controller is used to call instructions cached in the memory to control the working process of the computing accelerator.
  • the unified memory, input memory, weight memory and instruction memory are all on-chip memories, and the external memory is the memory outside the NPU.
  • the external memory can be double data rate synchronous dynamic random access memory (double data). rate synchronous dynamic random accessmemory (DDR SDRAM), high bandwidth memory (high bandwidth memory (HBM)) or other readable and writable memory.
  • DDR SDRAM rate synchronous dynamic random access memory
  • HBM high bandwidth memory
  • the neural network can be composed of neural units.
  • the neural unit can refer to an arithmetic unit that takes xs and intercept 1 as input.
  • the output of the arithmetic unit can be:
  • s 1, 2,...n, n is a natural number greater than 1
  • Ws is the weight of xs
  • b is the bias of the neural unit.
  • f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal. The output signal of this activation function can be used as the input of the next convolutional layer.
  • the activation function can be a sigmoid function.
  • a neural network is a network formed by connecting many of the above-mentioned single neural units together, that is, the output of one neural unit can be the input of another neural unit.
  • the input of each neural unit can be connected to the local receptive field of the previous layer to extract the features of the local receptive field.
  • the local receptive field can be an area composed of several neural units.
  • W is a weight vector, and each value in this vector represents the weight value of a neuron in this layer of neural network.
  • This vector W determines the spatial transformation from the input space to the output space described above, that is, the weight W of each layer controls how to transform the space.
  • the purpose of training a neural network is to finally obtain the weight matrix of all layers of the trained neural network (a weight matrix formed by the vector W of many layers). Therefore, the training process of neural network is essentially to learn how to control spatial transformation, and more specifically, to learn the weight matrix.
  • weight vector (of course, there is usually an initialization process before the first update, that is, pre-configuring parameters for each layer in the neural network). For example, if the predicted value of the network is high, adjust the weight vector to make it predict lower Some, constant adjustments are made until the neural network can predict the truly desired target value. Therefore, it is necessary to define in advance "how to compare the difference between the predicted value and the target value". This is the loss function (loss function) or objective function (objective function), which is used to measure the difference between the predicted value and the target value. Important equations. Among them, taking the loss function as an example, the higher the output value (loss) of the loss function, the greater the difference. Then the training of the neural network becomes a process of reducing this loss as much as possible.
  • the neural network can use the error back propagation (BP) algorithm to modify the size of the parameters in the initial neural network model during the training process, so that the reconstruction error loss of the neural network model becomes smaller and smaller. Specifically, forward propagation of the input signal until the output will produce an error loss, and the parameters in the initial neural network model are updated by backpropagating the error loss information, so that the error loss converges.
  • the backpropagation algorithm is a backpropagation movement dominated by error loss, aiming to obtain the optimal parameters of the neural network model, such as the weight matrix.
  • the model training method provided by the embodiment of the present application involves the processing of data sequences, and can be specifically applied to methods such as data training, machine learning, and deep learning.
  • the training data for example, the current video in the model training method provided by the embodiment of the present application) frames, etc.
  • the video processing method provided by the embodiment of the present application can use the above-trained neural network to input input data (for example, the current video frame in the video processing method provided by the embodiment of the present application, etc.) into the trained neural network.
  • model training method and video processing method provided in the embodiments of this application are inventions based on the same concept, and can also be understood as two parts of a system, or two stages of an overall process: such as model training phase and model application phase.
  • Figure 4 is a schematic flow chart of a video processing method provided by an embodiment of the present application. As shown in Figure 4, the method includes:
  • the compressed video stream can be decoded to obtain a video stream to be super-resolved.
  • the compressed video stream at least contains the first video frame, the motion vector and residual information corresponding to the second video frame, the motion vector and residual information corresponding to the third video frame, ..., the motion vector and residual information corresponding to the last video frame.
  • the first video frame can be used as the reference video frame of the second video frame, motion compensation is performed on the first video frame based on the motion vector corresponding to the second video frame, and the intermediate video frame is obtained, and then the intermediate video frame is The residual information corresponding to the second video frame is superimposed to obtain the second video frame.
  • the decoding of the second video frame is completed.
  • the second video frame can be used as the reference video frame of the third video frame, and motion compensation is performed on the second video frame based on the motion vector corresponding to the third video frame to obtain an intermediate video frame, and then in the intermediate video frame The residual information corresponding to the third video frame is superimposed to obtain the third video frame.
  • the decoding of the third video frame is completed.
  • the decoding of the fourth video frame can also be completed,..., the decoding of the last video frame is equivalent to obtaining the first video frame, the second video frame, the third video frame,... , the last video frame and multiple video frames constitute the video stream to be super-resolved.
  • any video frame among multiple video frames included in the video stream to be super-resolved will be schematically introduced below, and this video frame will be called the current video frame.
  • the current video frame After decoding to obtain the current video frame based on the reference video frame of the current video frame (for example, the previous video frame of the current video frame), the motion vector corresponding to the current video frame, and the residual information corresponding to the current video frame, the current video frame can also be obtained based on The motion vector corresponding to the current video frame is used to obtain the motion vector used in the decoding process of the current video frame.
  • the motion vector used in the decoding process of the current video frame can be obtained in the following way:
  • the motion vector corresponding to the video frame contains the motion vector corresponding to the N image blocks.
  • the difference between the position of the block in the reference video frame and the position of the i-th image block in the current video frame that is, the movement and change of the position of the i-th image block from the reference video frame to the current video frame.
  • the motion quantities corresponding to the N image blocks derived from the compressed video stream can be directly used as the motion vectors used in the decoding process of the N image blocks in the current video frame, that is, in the decoding process of the current video frame The motion vector used.
  • M is less than or equal to N, and M is a positive integer greater than or equal to 1
  • the motion vector corresponding to the current video frame provided by the compressed video stream only contains the motion vectors corresponding to these M image blocks.
  • the motion vectors corresponding to the N-M image blocks are calculated in the following multiple ways:
  • the motion vectors corresponding to the M image blocks derived from the compressed video stream can be used as the motion vectors used in the decoding process of the M image blocks in the current video frame.
  • the calculated motion vectors corresponding to the N-M image blocks as the motion vectors used in the decoding process of the N-M image blocks in the current video frame, which is equivalent to obtaining the motion vector used in the decoding process of the current video frame. motion vector.
  • the reference video frame of the current video frame is the previous video frame of the current video frame.
  • the reference video frame can also be the next video frame of the current video frame, or , the reference video frame can also be the first two video frames of the current video frame, the reference video frame can also be the last two video frames of the current video frame, etc., and there is no limit here.
  • the feature information of the reference video frame is in the target model of the reference video frame. Obtained during the super score process.
  • the motion vector used in the decoding process of the current video frame can be used to obtain the feature information of the reference video frame (which can also be called the hidden information of the reference video frame). Containing state (hidden state)) is transformed to obtain the transformed feature information of the reference video frame, that is, the feature information of the reference video frame aligned to the current video frame.
  • the feature information of the reference video frame is obtained during the super-resolution process of the target model on the reference video frame. That is to say, during the super-resolution process of the target model on the reference video frame, the feature information of the reference video frame is obtained. This can be either the intermediate output of the process, the final output, or the reference video frame itself.
  • the super-resolution process of the target model on the reference video frame please refer to the relevant description of the subsequent super-resolution process of the current video frame by the target model, which will not be discussed here.
  • the feature information of the reference video frame can be transformed in the following manner to obtain the transformed feature information:
  • the motion vector used in the decoding process of the current video frame and the feature information of the reference video frame can be calculated through the warp algorithm to obtain the transformed feature information.
  • the calculation process is as shown in the following formula:
  • MV t is the motion vector used in the decoding process of the current video frame
  • h t-1 is the feature information of the reference video frame
  • It is the transformed feature information of the reference video frame
  • Warp() is the warping algorithm.
  • the transformed feature information and the current video frame can be input to the target model (for example, a trained recurrent neural network model), so that the current video frame can be processed by the target model based on the transformed feature information.
  • the target model for example, a trained recurrent neural network model
  • Super-resolution reconstruction to obtain the current video frame after super-resolution.
  • the target model can perform super-resolution on the current video frame in the following ways, thereby obtaining the current video frame after super-resolution:
  • the target model After inputting the transformed feature information and the current video frame to the target model, the target model can first perform feature extraction (for example, convolution processing, etc.) on the current video frame to obtain the first feature of the current video frame.
  • feature extraction for example, convolution processing, etc.
  • Figure 5 Figure 5 is a schematic structural diagram of the target model provided by the embodiment of the present application
  • the t-th video frame LR t is The current video frame
  • the t-1th video frame is the reference video frame of the tth video frame LR t .
  • the transformed implicit state of the t-th video frame LR t and the t-1th video frame (obtained by transforming the hidden state h t- 1 of the t-1th video frame using the motion vector MV t used in the decoding process of the t-th video frame)
  • the target model can first Preliminary feature extraction is performed on the video frame LR t to obtain the preliminary feature f t 1 of the t-th video frame (ie, the aforementioned first feature).
  • the target model After obtaining the first feature of the current video frame, the target model can fuse the transformed feature information and the first feature of the current video frame (for example, splicing processing, etc.) to obtain the second feature of the current video frame. . Still as in the above example, after obtaining the preliminary feature f t 1 of the t-th video frame, the target model can combine the preliminary feature f t 1 of the t-th video frame and the transformed hidden state of the t-1 video frame. Perform splicing (cascade) to obtain the fusion feature f t 2 of the t-th video frame (i.e., the aforementioned second feature).
  • the target model can continue to perform feature extraction (for example, convolution processing, etc.) on the second feature of the current video frame, thereby obtaining the third feature of the current video frame. Still as in the above example, after obtaining the fusion feature f t 2 of the t-th video frame, the target model can continue to perform feature extraction on the fusion feature f t 2 of the t-th video frame to obtain further features f t of the t-th video frame. 3 .
  • feature extraction for example, convolution processing, etc.
  • the target model can fuse the third feature of the current video frame and the current video frame (for example, addition processing, etc.), thereby obtaining and outputting the super-resolved current Video frames. Still as in the above example, after obtaining the further feature f t 3 of the t-th video frame, the target model can add the further feature f t 3 of the t-th video frame and the t-th video frame LR t to obtain and output The t-th video frame SR t after super-resolution.
  • the target model can obtain the feature information (hidden state) of the current video frame in a variety of ways:
  • the target model can directly use the third feature of the current video frame as the feature information of the current video frame, and output it externally for use in the super-resolution process of the next video frame. Still as in the above example, after obtaining the further features f t 3 of the t-th video frame, the target model can use it as the hidden state h t of the t-th video frame and output the hidden state h t of the t-th video frame. .
  • the target model After obtaining the current video frame after super-resolution, the target model can directly use the current video frame after super-resolution as the feature information of the current video frame, and output it to the outside for use in the super-resolution process of the next video frame. Still as in the above example, get the t-th video frame SR t after super-resolution Afterwards, the target model can use it as the hidden state h t of the t-th video frame, and output the hidden state h t of the t-th video frame.
  • the target model can continue to perform feature extraction (for example, convolution processing, etc.) on the third feature of the current video frame, thereby obtaining feature information of the current video frame. Still as in the above example, after obtaining the further feature f t 3 of the t-th video frame, the target model can perform feature extraction on the further feature f t 3 of the t-th video frame, thereby obtaining and outputting the implicit feature of the t-th video frame. State h t .
  • the target model can continue to perform feature extraction (for example, convolution processing, etc.) on the current video frame after super-resolution, thereby obtaining the feature information of the current video frame. Still as in the above example, after obtaining the t-th video frame SR t after super-resolution, the target model can perform feature extraction on the t-th video frame SR t after super-resolution, thereby obtaining and outputting the implicit information of the t-th video frame. State h t .
  • the target model may not fuse the third feature with the current video frame, but directly use the third feature as the super-resolved current video frame. Still as in the above example, after obtaining the further features f t 3 of the t-th video frame, the target model can directly use it as the t-th video frame SR t after super-resolution, and output the t-th video frame SR after super-resolution. t .
  • the super-resolution processing for the current video frame is completed.
  • the same operations as those performed on the current video frame can also be performed, so that the video stream after super-resolution can be obtained.
  • the feature information of the reference video frame of the current video frame can be transformed based on the motion vector, thereby obtaining the transformed features.
  • Information wherein the feature information of the reference video frame is obtained during the super-resolution process of the reference video frame by the target model. Then, the current video frame can be super-resolved based on the transformed feature information through the target model, thereby obtaining the super-resolved current video frame.
  • the target model can perform super-resolution on the current video frame based on the transformed feature information of the reference video frame, because the transformed feature information of the reference video frame is based on the motion vector pair used in the decoding process of the current video frame. It is obtained by transforming the feature information of the reference video frame. It can be seen that in the super-resolution process of the current video frame by the target model, not only the information of the reference video frame itself is considered, but also the image blocks between the reference video frame and the current video frame are considered. position correspondence relationship, the factors considered are relatively comprehensive, so the current video frame after super-resolution finally output by the target model is of high enough quality (with a relatively ideal resolution), so that the entire video stream after super-resolution has good image quality, thereby improving user experience.
  • Figure 6 is another schematic flowchart of a video processing method provided by an embodiment of the present application. As shown in Figure 6, the method includes:
  • the compressed video stream can be decoded to obtain a video stream to be super-resolved.
  • the compressed video stream at least contains the first video frame, the motion vector and residual information corresponding to the second video frame, the motion vector and residual information corresponding to the third video frame, ..., the motion vector and residual information corresponding to the last video frame.
  • the first video frame can be used as the reference video frame of the second video frame, motion compensation is performed on the first video frame based on the motion vector corresponding to the second video frame, and the intermediate video frame is obtained, and then the intermediate video frame is The residual information corresponding to the second video frame is superimposed to obtain the second video frame.
  • the decoding of the second video frame is completed.
  • the second video frame can be used as the reference video frame of the third video frame, and motion compensation is performed on the second video frame based on the motion vector corresponding to the third video frame to obtain an intermediate video frame, and then in the intermediate video frame The residual information corresponding to the third video frame is superimposed to obtain the third video frame.
  • the decoding of the third video frame is completed.
  • the decoding of the fourth video frame can also be completed,..., the decoding of the last video frame is equivalent to obtaining the first video frame, the second video frame, the third video frame,... , the last video frame and multiple video frames constitute the video stream to be super-resolved.
  • any video frame among multiple video frames included in the video stream to be super-resolved will be schematically introduced below, and this video frame will be called the current video frame.
  • the current video frame After decoding to obtain the current video frame based on the reference video frame of the current video frame (for example, the previous video frame of the current video frame), the motion vector corresponding to the current video frame, and the residual information corresponding to the current video frame, the current video frame can also be obtained based on The motion vector corresponding to the current video frame is used to obtain the motion vector used in the decoding process of the current video frame.
  • the residual information corresponding to the current video frame provided by the compressed video stream can be used as the residual information used in the decoding process of the current video frame.
  • Super-resolution is performed on the current video frame to obtain the current video frame after super-resolution.
  • the feature information of the reference video frame is obtained during the super-resolution processing of the reference video frame by the target model.
  • the feature information of the reference video frame (also called the hidden state of the reference video frame) can also be obtained, and the current The video frame, the residual information used in the decoding process of the current video frame, and the feature information of the reference video frame are input to the target model, so that the target model is based on the feature information of the reference video frame and the feature information used in the decoding process of the current video frame. Residual information, perform super-resolution on the current video frame, and obtain the current video frame after super-resolution.
  • the feature information of the reference video frame is obtained during the super-resolution process of the target model on the reference video frame. That is to say, during the super-resolution process of the target model on the reference video frame, the feature information of the reference video frame is obtained. This can be either the intermediate output of the process, the final output, or the reference video frame itself.
  • the super-resolution process of the target model on the reference video frame please refer to the relevant description of the subsequent super-resolution process of the current video frame by the target model, which will not be discussed here.
  • the target model can perform super-resolution on the current video frame in the following ways, thereby obtaining the current video frame after super-resolution:
  • the target model After inputting the current video frame, the residual information used in the decoding process of the current video frame, and the feature information of the reference video frame into the target model, the target model can first perform feature extraction (for example, convolution) on the current video frame Processing, etc.) to obtain the first feature of the current video frame.
  • feature extraction for example, convolution
  • Figure 7 Figure 7 is another structural schematic diagram of the target model provided by the embodiment of the present application
  • the compressed video stream is decoded
  • multiple video frames can be obtained, wherein the t-th video frame LR t is the current video frame, and the t-1th video frame is the reference video frame of the t-th video frame LR t .
  • the target model After inputting the t-th video frame LR t , the residual information Res t used in the decoding process of the current video frame, and the hidden state h t-1 of the t-1th video frame into the target model, the target model can first Preliminary feature extraction is performed on the t-th video frame LR t , and the preliminary feature f t 1 of the t-th video frame is obtained (ie, the aforementioned first feature).
  • the target model After obtaining the first feature of the current video frame, the target model fuses the feature information of the reference video frame and the first feature of the current video frame (for example, splicing processing, etc.) to obtain the second feature of the current video frame. . Still as in the above example, after obtaining the preliminary feature f t 1 of the t-th video frame, the target model can combine the preliminary feature f t 1 of the t-th video frame and the hidden state h t-1 of the t-1th video frame Perform splicing (cascade) to obtain the fusion feature f t 2 of the t-th video frame (i.e., the aforementioned second feature).
  • the target model can continue to perform feature extraction (for example, convolution processing, etc.) on the second feature of the current video frame, thereby obtaining the third feature of the current video frame. Still as in the above example, after obtaining the fusion feature f t 2 of the t-th video frame, the target model can continue to perform feature extraction on the fusion feature f t 2 of the t-th video frame to obtain further features f t of the t-th video frame. 3 .
  • feature extraction for example, convolution processing, etc.
  • the target model can continue to perform feature extraction (for example, convolution processing, etc.) on the third feature of the current video frame based on the residual information used in the decoding process of the current video frame. etc.), thereby obtaining the fourth feature of the current video frame.
  • feature extraction for example, convolution processing, etc.
  • further features f t 3 of the t-th video frame are obtained.
  • the target model can use the residual information Res t used in the decoding process of the current video frame to obtain further features f t 3 of the t-th video frame.
  • Feature extraction is performed to obtain further features f t 4 of the t-th video frame.
  • the target model can fuse the fourth feature of the current video frame and the current video frame (for example, addition processing, etc.) to obtain the super-resolved current video frame. Still as in the above example, after obtaining the further feature f t 4 of the t-th video frame, the target model can add the further feature f t 4 of the t-th video frame and the t-th video frame LR t to obtain And output the t-th video frame SR t after super-resolution.
  • the target model can obtain the fourth feature of the current video frame in the following way:
  • the target model can sequentially compare the residual information used in the decoding process of each image block with the preset threshold (the size of the threshold can be determined according to the actual (requirements are set, there are no restrictions here) are compared to determine P image blocks whose residual information is greater than the preset residual threshold (P is less than N, and P is a positive integer greater than or equal to 1).
  • the target model After obtaining P image blocks whose residual information is greater than the preset residual threshold, the target model can perform feature extraction on the part of the third feature of the current video frame corresponding to the P image blocks, and the third The other part of the features corresponding to the remaining N-P image blocks remains unchanged, thereby obtaining the fourth feature of the current video frame.
  • the target model can obtain the feature information (hidden state) of the current video frame in a variety of ways:
  • the target model can directly use the third feature of the current video frame as the feature information of the current video frame, and output it externally for use in the super-resolution process of the next video frame. Still as in the above example, after obtaining the further features f t 3 of the t-th video frame, the target model can use it as the hidden state h t of the t-th video frame and output the hidden state h t of the t-th video frame. .
  • the target model After obtaining the fourth feature of the current video frame, the target model can directly use the fourth feature of the current video frame as the feature information of the current video frame, and output it externally for use in the super-resolution process of the next video frame. Still as in the above example, after obtaining the further features f t 4 of the t-th video frame, the target model can use it as the hidden state h t of the t-th video frame and output the hidden state h of the t-th video frame. t .
  • the target model After obtaining the current video frame after super-resolution, the target model can directly use the current video frame after super-resolution as the feature information of the current video frame, and output it to the outside for use in the super-resolution process of the next video frame. Still as in the above example, after obtaining the super-resolved t-th video frame SR t , the target model can use it as the hidden state h t of the t-th video frame and output the hidden state h t of the t-th video frame. .
  • the target model can continue to perform feature extraction (for example, convolution processing, etc.) on the third feature of the current video frame, thereby obtaining feature information of the current video frame. Still as in the above example, after obtaining the further feature f t 3 of the t-th video frame, the target model can perform feature extraction on the further feature f t 3 of the t-th video frame, thereby obtaining and outputting the implicit feature of the t-th video frame. State h t .
  • the target model can continue to perform feature extraction (for example, convolution processing, etc.) on the fourth feature of the current video frame, thereby obtaining feature information of the current video frame. Still as in the above example, after obtaining the further feature f t 4 of the t-th video frame, the target model can perform feature extraction on the further feature f t 4 of the t-th video frame, thereby obtaining and outputting the further feature f t 4 of the t-th video frame.
  • Hidden state h t Hidden state h t .
  • the target model can continue to perform feature extraction (for example, convolution processing, etc.) on the current video frame after super-resolution, thereby obtaining the feature information of the current video frame. Still as in the above example, after obtaining the t-th video frame SR t after super-resolution, the target model can perform feature extraction on the t-th video frame SR t after super-resolution, thereby obtaining and outputting the implicit information of the t-th video frame. State h t .
  • the target model may not fuse the fourth feature with the current video frame, but directly use the fourth feature as the super-resolved current video frame. Still as in the above example, after obtaining the further features f t 4 of the t-th video frame, the target model can directly use it as the t-th video frame SR t after super-resolution, and output the t-th video frame after super-resolution SR t .
  • the super-resolution processing for the current video frame is completed.
  • the same operations as those performed on the current video frame can also be performed, so that the video stream after super-resolution can be obtained.
  • the current video frame and the residual information used in the decoding process of the current video frame are obtained; the current video frame is super-resolved through the target model based on the feature information and residual information of the reference video frame, and we obtain The feature information of the current video frame after super-resolution and the reference video frame is obtained through the super-resolution processing of the reference video frame by the target model.
  • the target model can perform super-resolution of the current video frame based on the feature information of the reference video frame and the residual information used in the decoding process of the current video frame.
  • the neural network model only needs to perform all processing on some image blocks contained in the current video frame, and does not need to perform all processing on the other part of the image blocks included in the current video frame.
  • the processing can reduce the amount of calculation required, so the video processing method based on the target model can be applied to small devices with limited computing power.
  • Figure 8 is a schematic flow chart of the model training method provided by the embodiment of the present application. As shown in Figure 8, the method includes:
  • a batch of training data can be obtained first.
  • the batch of training data includes the current video frame and the motion used in the decoding process of the current video frame.
  • Vector the current video frame after the real super-resolution (that is, the real super-resolution result of the current video frame) is known.
  • the current video frame contains N image blocks
  • obtaining the motion vector used in the decoding process of the current video frame includes: obtaining the decoding of M image blocks in the current video frame from the compressed video stream.
  • the motion vector used in the process N ⁇ 2, N>M ⁇ 1; based on the motion vector used in the decoding process of M image blocks, calculate the motion vector used in the decoding process of N-M image blocks, or,
  • the preset value is determined as the motion vector used in the decoding process of N-M image blocks.
  • the feature information of the reference video frame is obtained during the super-resolution process of the reference video frame by the model to be trained.
  • the motion vector used in the decoding process of the current video frame can be used to obtain the feature information of the reference video frame (which can also be called the hidden information of the reference video frame). Containing state (hidden state)) is transformed to obtain the transformed feature information of the reference video frame, that is, the feature information of the reference video frame aligned to the current video frame.
  • the feature information of the reference video frame is obtained during the super-resolution process of the reference video frame by the model to be trained. That is to say, during the super-resolution process of the reference video frame by the model to be trained, the reference video frame Feature information can be either an intermediate output or a final output of the process.
  • the super-resolution process of the model to be trained on the reference video frame please refer to the relevant description of the subsequent super-resolution process of the model to be trained on the current video frame, which will not be discussed here.
  • transforming the feature information of the reference video frame based on the motion vector to obtain the transformed feature information includes: calculating the motion vector and the feature information of the reference video frame through a warping algorithm to obtain the transformed feature information. Feature information.
  • the transformed feature information and the current video frame can be input to the model to be trained, so that the model to be trained can perform super-resolution reconstruction of the current video frame based on the transformed feature information, and obtain the super-resolution the current video frame.
  • the current video frame is super-resolved based on the transformed feature information through the model to be trained, and obtaining the current video frame after the super-resolution includes: performing feature extraction on the current video frame through the model to be trained, and obtaining The first feature of the current video frame; fuse the transformed feature information and the first feature through the model to be trained to obtain the second feature of the current video frame; perform feature extraction on the second feature through the model to be trained to obtain the current video frame
  • the third feature is the current video frame after super-resolution.
  • the method further includes: fusing the third feature and the current video frame through the model to be trained to obtain the super-resolved current video frame.
  • the third feature or the current video frame after super-resolution is used as the feature information of the current video frame.
  • the method further includes: extracting features of the third feature or the current video frame after super-resolution through the model to be trained, to obtain feature information of the current video frame.
  • the target loss is used to indicate the difference between the current video frame after super-resolution and the current video frame after real super-resolution.
  • the current video frame after super-resolution and the current video frame after real super-resolution can be calculated through the preset loss function to obtain the target loss.
  • the target loss is used to indicate the result after super-resolution. The difference between the current video frame and the current video frame after real super-resolution.
  • the parameters of the model to be trained can be updated based on the target loss, and the next batch of training data can be used to continue training the model to be trained after the updated parameters until the model training conditions are met (for example, the target loss reaches convergence, etc. ), the target model in the embodiment shown in Figure 4 is obtained.
  • the target model trained in the embodiment of this application has the ability to super-resolve video frames. Specifically, after obtaining the current video frame to and the motion vector used in the decoding process of the current video frame, the feature information of the reference video frame of the current video frame can be transformed based on the motion vector, thereby obtaining the transformed feature information, where the feature information of the reference video frame is in Obtained from the super-resolution process of the target model on the reference video frame. Then, the current video frame can be super-resolved based on the transformed feature information through the target model, thereby obtaining the super-resolved current video frame.
  • the target model can perform super-resolution on the current video frame based on the transformed feature information of the reference video frame, because the transformed feature information of the reference video frame is based on the motion vector pair used in the decoding process of the current video frame. It is obtained by transforming the feature information of the reference video frame. It can be seen that in the super-resolution process of the current video frame by the target model, not only the information of the reference video frame itself is considered, but also the image blocks between the reference video frame and the current video frame are considered. position correspondence relationship, the factors considered are relatively comprehensive, so the current video frame after super-resolution finally output by the target model is of high enough quality (with a relatively ideal resolution), so that the entire video stream after super-resolution has good image quality, thereby improving user experience.
  • Figure 9 is another schematic flowchart of a model training method provided by an embodiment of the present application. As shown in Figure 9, the method includes:
  • a batch of training data can be obtained first.
  • the batch of training data includes the current video frame and the residuals used in the decoding process of the current video frame. Poor information. It should be noted that the current video frame after the real super-resolution (that is, the real super-resolution result of the current video frame) is known.
  • the feature information of the reference video frame (also called the hidden state of the reference video frame) can also be obtained, and the current The video frame, the residual information used in the decoding process of the current video frame, and the feature information of the reference video frame are input to the model to be trained, so that the model to be trained is based on the feature information of the reference video frame and the information used in the decoding process of the current video frame.
  • the feature information of the reference video frame is obtained during the super-resolution process of the reference video frame by the model to be trained. That is to say, during the super-resolution process of the reference video frame by the model to be trained, the reference video frame Feature information can be either an intermediate output or a final output of the process.
  • the super-resolution process of the model to be trained on the reference video frame please refer to the relevant description of the subsequent super-resolution process of the model to be trained on the current video frame, which will not be discussed here.
  • the current video frame is super-resolved based on the feature information and residual information of the reference video frame through the model to be trained, and the current video frame obtained after super-scoring includes: Feature extraction is performed on the frame to obtain the first feature of the current video frame; the feature information of the reference video frame and the first feature are fused through the model to be trained to obtain the second feature of the current video frame; the second feature is processed through the model to be trained Feature extraction is used to obtain the third feature of the current video frame; the model to be trained performs feature extraction on the third feature based on the residual information to obtain the fourth feature of the current video frame, and the fourth feature is used as the current video frame after super-resolution.
  • the residual information includes the residual information used in the decoding process of N image blocks in the current video frame, and the model to be trained extracts the third feature based on the residual information to obtain the current
  • the fourth feature of the video frame includes: using the model to be trained, P image blocks whose residual information is greater than the preset residual threshold are determined among the N image blocks, N ⁇ 2, N>P ⁇ 1; Feature extraction is performed on the features corresponding to the P image blocks in the third feature to obtain the fourth feature of the current video frame.
  • the method further includes: fusing the fourth feature and the current video frame through the model to be trained to obtain the super-resolved current video frame.
  • the third feature, the fourth feature or the current video frame after super-resolution is used as the feature information of the current video frame.
  • the method further includes: performing feature extraction on the third feature, the fourth feature or the current video frame after super-resolution through the model to be trained, to obtain the feature information of the current video frame.
  • the target loss is used to indicate the difference between the current video frame after super-resolution and the current video frame after real super-resolution.
  • the current video frame after super-resolution and the current video frame after real super-resolution can be calculated through the preset loss function to obtain the target loss.
  • the target loss is used to indicate the result after super-resolution. The difference between the current video frame and the current video frame after real super-resolution.
  • the parameters of the model to be trained can be updated based on the target loss, and the next batch of training data can be used to continue training the model to be trained after the updated parameters until the model training conditions are met (for example, the target loss reaches convergence, etc. ), the target model in the embodiment shown in Figure 6 is obtained.
  • the target model trained in the embodiment of this application has the ability to super-resolve video frames. Specifically, obtain the current video frame and the residual information used in the decoding process of the current video frame; use the target model to super-score the current video frame based on the feature information and residual information of the reference video frame, and obtain the super-score
  • the current video frame, the feature information of the reference video frame is obtained in the super-resolution processing of the reference video frame by the target model.
  • the target model can perform super-resolution of the current video frame based on the feature information of the reference video frame and the residual information used in the decoding process of the current video frame.
  • Figure 10 is a schematic structural diagram of a video processing device provided by an embodiment of the present application. As shown in Figure 10, the device includes:
  • the acquisition module 1001 is used to acquire the current video frame and the motion vector used in the decoding process of the current video frame;
  • the transformation module 1002 is used to transform the feature information of the reference video frame of the current video frame based on the motion vector to obtain the transformed feature information.
  • the feature information of the reference video frame is obtained during the super-resolution process of the reference video frame by the target model;
  • the super-resolution module 1003 is used to perform super-resolution on the current video frame based on the transformed feature information through the target model to obtain the super-resolved current video frame.
  • the feature information of the reference video frame of the current video frame can be transformed based on the motion vector, thereby obtaining the transformed features.
  • Information wherein the feature information of the reference video frame is obtained during the super-resolution process of the reference video frame by the target model. Then, the current video frame can be super-resolved based on the transformed feature information through the target model, thereby obtaining the super-resolved current video frame.
  • the target model can perform super-resolution on the current video frame based on the transformed feature information of the reference video frame, because the transformed feature information of the reference video frame is based on the motion vector pair used in the decoding process of the current video frame. It is obtained by transforming the feature information of the reference video frame. It can be seen that in the super-resolution process of the current video frame by the target model, not only the information of the reference video frame itself is considered, but also the image blocks between the reference video frame and the current video frame are considered. position correspondence relationship, the factors considered are relatively comprehensive, so the current video frame after super-resolution finally output by the target model is of high enough quality (with a relatively ideal resolution), so that the entire video stream after super-resolution has good image quality, thereby improving user experience.
  • the transformation module 1002 is configured to calculate the feature information of the motion vector and the reference video frame through a warping algorithm to obtain transformed feature information.
  • the super-resolution module 1003 is used to: perform feature extraction on the current video frame through the target model to obtain the first feature of the current video frame; and extract the transformed feature information and the first feature through the target model. Fusion is performed to obtain the second feature of the current video frame; feature extraction is performed on the second feature through the target model to obtain the third feature of the current video frame, and the third feature is used as the current video frame after super-resolution.
  • the super-resolution module 1003 is also used to fuse the third feature and the current video frame through the target model to obtain the current video frame after super-resolution.
  • the third feature or the current video frame after super-resolution is used as the feature information of the current video frame.
  • the super-resolution module 1003 is also used to extract features of the third feature or the current video frame after super-resolution through the target model to obtain feature information of the current video frame.
  • the acquisition module 1001 is used to acquire the motion vectors used in the decoding process of M image blocks in the current video frame from the compressed video stream, N ⁇ 2, N>M ⁇ 1; Based on the motion vectors used in the decoding process of the M image blocks, calculate the motion vectors used in the decoding process of the N-M image blocks, or determine the preset value as the motion used in the decoding process of the N-M image blocks. Vector.
  • FIG 11 is another structural schematic diagram of a video processing device provided by an embodiment of the present application. As shown in Figure 11, the device includes:
  • the acquisition module 1101 is used to acquire the current video frame and the residual information used in the decoding process of the current video frame;
  • the super-resolution module 1102 is used to perform super-resolution on the current video frame based on the feature information and residual information of the reference video frame through the target model to obtain the current video frame after super-resolution.
  • the feature information of the reference video frame is compared with the reference in the target model. Obtained from super-resolution processing of video frames.
  • the current video frame and the residual information used in the decoding process of the current video frame are obtained; the current video frame is super-resolved through the target model based on the feature information and residual information of the reference video frame, and we obtain The feature information of the current video frame after super-resolution and the reference video frame is obtained during the super-resolution processing of the reference video frame by the target model.
  • the target model can perform super-resolution of the current video frame based on the feature information of the reference video frame and the residual information used in the decoding process of the current video frame.
  • the super-resolution module 1102 is used to: perform feature extraction on the current video frame through the target model to obtain the first feature of the current video frame; and extract the feature information of the reference video frame and the first feature through the target model.
  • the features are fused to obtain the second feature of the current video frame; the second feature is extracted through the target model to obtain the third feature of the current video frame; the third feature is extracted based on the residual information through the target model to obtain the current.
  • the fourth feature of the video frame, the fourth feature is used as the current video frame after super-resolution.
  • the residual information includes the residual information used in the decoding process of N image blocks in the current video frame
  • the super-resolution module 1102 is used to: use the target model in the N image blocks, Determine P image blocks whose residual information is greater than the preset residual threshold, N ⁇ 2, N>P ⁇ 1; use the target model to extract features corresponding to the P image blocks in the third feature to obtain the current video The fourth characteristic of the frame.
  • the super-resolution module 1102 is also used to fuse the fourth feature and the current video frame through the target model to obtain the current video frame after super-resolution.
  • the third feature, the fourth feature or the current video frame after super-resolution is used as the feature information of the current video frame.
  • the super-resolution module 1102 is also used to perform feature extraction on the third feature, the fourth feature, or the current video frame after super-resolution through the target model to obtain feature information of the current video frame.
  • Figure 12 is a schematic structural diagram of a model training device provided by an embodiment of the present application. As shown in Figure 12, the device includes:
  • the first acquisition module 1201 is used to acquire the current video frame and the motion vector used in the decoding process of the current video frame;
  • the transformation module 1202 is used to transform the feature information of the reference video frame of the current video frame based on the motion vector to obtain the transformed feature information.
  • the feature information of the reference video frame is used in the super-resolution process of the reference video frame by the model to be trained. get;
  • the super-resolution module 1203 is used to perform super-resolution on the current video frame based on the transformed feature information through the model to be trained, and obtain the current video frame after super-resolution;
  • the second acquisition module 1204 is used to obtain the target loss based on the current video frame after super-resolution and the current video frame after real super-resolution.
  • the target loss is used to indicate the current video frame after super-resolution and the current video after real super-resolution. Differences between frames;
  • the update module 1205 is used to update the parameters of the model to be trained based on the target loss until the model training conditions are met and the target model is obtained.
  • the target model trained in the embodiment of this application has the ability to super-resolve video frames. Specifically, after obtaining the current video frame and the motion vector used in the decoding process of the current video frame, the feature information of the reference video frame of the current video frame can be transformed based on the motion vector, thereby obtaining the transformed feature information, where , the feature information of the reference video frame is obtained during the super-resolution process of the reference video frame by the target model. Then, the current video frame can be super-resolved based on the transformed feature information through the target model, thereby obtaining the super-resolved current video frame.
  • the target model can perform super-resolution on the current video frame based on the transformed feature information of the reference video frame, because the transformed feature information of the reference video frame is based on the motion vector pair used in the decoding process of the current video frame. It is obtained by transforming the feature information of the reference video frame. It can be seen that in the super-resolution process of the current video frame by the target model, not only the information of the reference video frame itself is considered, but also the image blocks between the reference video frame and the current video frame are considered. position correspondence relationship, the factors considered are relatively comprehensive, so the current video frame after super-resolution finally output by the target model is of high enough quality (with a relatively ideal resolution), so that the entire video stream after super-resolution has good image quality, thereby improving user experience.
  • the transformation module 1202 is configured to calculate the motion vector and the feature information of the reference video frame through a warping algorithm to obtain transformed feature information.
  • the super-resolution module 1203 is used to extract features of the current video frame through the model to be trained, and obtain to the first feature of the current video frame; fuse the transformed feature information and the first feature through the model to be trained to obtain the second feature of the current video frame; extract the second feature through the model to be trained to obtain the current video
  • the third feature of the frame is used as the current video frame after super-resolution.
  • the super-resolution module 1203 is also used to fuse the third feature and the current video frame through the model to be trained to obtain the current video frame after super-resolution.
  • the third feature or the current video frame after super-resolution is used as the feature information of the current video frame.
  • the super-resolution module 1203 is also used to extract features of the third feature or the current video frame after super-resolution through the model to be trained, to obtain feature information of the current video frame.
  • the acquisition module 1201 is used to acquire the motion vectors used in the decoding process of M image blocks in the current video frame from the compressed video stream, N ⁇ 2, N>M ⁇ 1; Calculate the motion vectors used in the decoding process of N-M image blocks based on the motion vectors used in the decoding process of the M image blocks, or determine the preset value as the motion used in the decoding process of the N-M image blocks Vector.
  • Figure 13 is another structural schematic diagram of a model training device provided by an embodiment of the present application. As shown in Figure 13, the device includes:
  • the first acquisition module 1301 is used to acquire the current video frame and the residual information used in the decoding process of the current video frame;
  • the super-resolution module 1302 is used to perform super-resolution on the current video frame based on the feature information and residual information of the reference video frame through the model to be trained, and obtain the current video frame after super-resolution.
  • the feature information of the reference video frame is used in the model to be trained. Obtained from super-resolution processing of reference video frames;
  • the second acquisition module 1303 is used to obtain the target loss based on the current video frame after super-resolution and the current video frame after real super-resolution.
  • the target loss is used to indicate the current video frame after super-resolution and the current video after real super-resolution. Differences between frames;
  • the update module 1304 is used to update the parameters of the model to be trained based on the target loss until the model training conditions are met and the target model is obtained.
  • the target model trained in the embodiment of this application has the ability to super-resolve video frames. Specifically, obtain the current video frame and the residual information used in the decoding process of the current video frame; use the target model to super-score the current video frame based on the feature information and residual information of the reference video frame, and obtain the super-score
  • the current video frame, the feature information of the reference video frame is obtained in the super-resolution processing of the reference video frame by the target model.
  • the target model can perform super-resolution of the current video frame based on the feature information of the reference video frame and the residual information used in the decoding process of the current video frame.
  • the super-resolution module 1302 is used to: perform feature extraction on the current video frame through the target model to obtain the first feature of the current video frame; and perform feature information on the reference video frame and the first feature on the reference video frame through the target model.
  • the features are fused to obtain the second feature of the current video frame; the second feature is extracted through the target model to obtain the third feature of the current video frame; the third feature is extracted based on the residual information through the target model to obtain the current.
  • the fourth feature of the video frame, the fourth feature is used as the current video frame after super-resolution.
  • the residual information includes the residual information used in the decoding process of N image blocks in the current video frame.
  • the super-resolution module 1302 is used to: use the target model in the N image blocks, Determine P image blocks whose residual information is greater than the preset residual threshold, N ⁇ 2, N>P ⁇ 1; use the target model to extract features corresponding to the P image blocks in the third feature to obtain the current video The fourth characteristic of the frame.
  • the super-resolution module 1302 is also used to fuse the fourth feature and the current video frame through the target model to obtain the current video frame after super-resolution.
  • the third feature, the fourth feature or the current video frame after super-resolution is used as the feature information of the current video frame.
  • the super-resolution module 1302 is also used to perform feature extraction on the third feature, the fourth feature, or the current video frame after super-resolution through the target model to obtain feature information of the current video frame.
  • FIG. 14 is a schematic structural diagram of the execution device provided by the embodiment of the present application.
  • the execution device 1400 can be embodied as a mobile phone, a tablet, a laptop, a smart wearable device, a server, etc., and is not limited here.
  • the video processing device described in the corresponding embodiment of FIG. 10 or FIG. 11 may be deployed on the execution device 1400 to implement the video processing function in the corresponding embodiment of FIG. 4 or FIG. 6 .
  • the execution device 1400 includes: a receiver 1401, a transmitter 1402, a processor 1403 and a memory 1404 (the number of processors 1403 in the execution device 1400 can be one or more, one processor is taken as an example in Figure 14) , wherein the processor 1403 may include an application processor 14031 and a communication processor 14032.
  • the receiver 1401, the transmitter 1402, the processor 1403, and the memory 1404 may be connected by a bus or other means.
  • Memory 1404 may include read-only memory and random access memory and provides instructions and data to processor 1403 .
  • a portion of memory 1404 may also include non-volatile random access memory (NVRAM).
  • NVRAM non-volatile random access memory
  • the memory 1404 stores processor and operating instructions, executable modules or data structures, or a subset thereof, or an extended set thereof, where the operating instructions may include various operating instructions for implementing various operations.
  • the processor 1403 controls the execution of operations of the device.
  • various components of the execution device are coupled together through a bus system.
  • the bus system may also include a power bus, a control bus, a status signal bus, etc.
  • various buses are called bus systems in the figure.
  • the methods disclosed in the above embodiments of the present application can be applied to the processor 1403 or implemented by the processor 1403.
  • the processor 1403 may be an integrated circuit chip with signal processing capabilities. During the implementation process, each step of the above method can be completed by instructions in the form of hardware integrated logic circuits or software in the processor 1403 .
  • the above-mentioned processor 1403 can be a general-purpose processor, a digital signal processor (DSP), a microprocessor or a microcontroller, and can further include an application specific integrated circuit (ASIC), a field programmable Gate array (field-programmable gate array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
  • DSP digital signal processor
  • ASIC application specific integrated circuit
  • FPGA field-programmable gate array
  • the processor 1403 can implement or execute each method, step and logical block diagram disclosed in the embodiment of this application.
  • a general-purpose processor may be a microprocessor or the processor may be any conventional processor, etc.
  • the steps of the method disclosed in conjunction with the embodiments of the present application can be directly implemented by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other mature storage media in this field.
  • the storage medium is located in the memory 1404.
  • the processor 1403 reads the information in the memory 1404 and completes the steps of the above method in combination with its hardware.
  • the receiver 1401 may be configured to receive input numeric or character information and generate signal inputs related to performing relevant settings and functional controls of the device.
  • the transmitter 1402 can be used to output numeric or character information through the first interface; the transmitter 1402 can also be used to send instructions to the disk group through the first interface to modify the data in the disk group; the transmitter 1402 can also include a display device such as a display screen .
  • the processor 1403 is used to recommend items for information associated with the user through the first model in the corresponding embodiment of FIG. 4 or the target model in the corresponding embodiment of FIG. 9 .
  • FIG. 15 is a schematic structural diagram of the training device provided by the embodiment of the present application.
  • the training device 1500 is implemented by one or more servers.
  • the training device 1500 can vary greatly due to different configurations or performance, and can include one or more central processing units (CPU) 1514 (eg, one or more processors) and memory 1532, one or more storage media 1530 (eg, one or more mass storage devices) storing applications 1542 or data 1544.
  • the memory 1532 and the storage medium 1530 may be short-term storage or persistent storage.
  • the program stored in the storage medium 1530 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations in the training device.
  • the central processor 1514 may be configured to communicate with the storage medium 1530 and execute a series of instruction operations in the storage medium 1530 on the training device 1500 .
  • the training device 1500 may also include one or more power supplies 1526, one or more wired or wireless network interfaces 1550, one or more input and output interfaces 1558; or, one or more operating systems 1541, such as Windows ServerTM, Mac OS XTM , UnixTM, LinuxTM, FreeBSDTM and so on.
  • operating systems 1541 such as Windows ServerTM, Mac OS XTM , UnixTM, LinuxTM, FreeBSDTM and so on.
  • the training device can execute the model training method in the embodiment corresponding to FIG. 8 or FIG. 9 .
  • Embodiments of the present application also relate to a computer storage medium.
  • the computer-readable storage medium stores a program for performing signal processing.
  • the program When the program is run on a computer, it causes the computer to perform the steps performed by the aforementioned execution device, or, The computer is caused to perform the steps performed by the aforementioned training device.
  • Embodiments of the present application also relate to a computer program product that stores instructions that, when executed by a computer, cause the computer to perform the steps performed by the foregoing execution device, or cause the computer to perform the steps performed by the foregoing training device. A step of.
  • the execution device, training device or terminal device provided by the embodiment of the present application may specifically be a chip.
  • the chip includes: a processing unit and a communication unit.
  • the processing unit may be, for example, a processor.
  • the communication unit may be, for example, an input/output interface. Pins or circuits, etc.
  • the processing unit can execute the computer execution instructions stored in the storage unit, so that the chip in the execution device executes the data processing method described in the above embodiment, or so that the chip in the training device executes the data processing method described in the above embodiment.
  • the storage unit is a storage unit within the chip, such as a register, cache, etc.
  • the storage unit may also be a storage unit located outside the chip in the wireless access device, such as Read-only memory (ROM) or other types of static storage devices that can store static information and instructions, random access memory (random access memory, RAM), etc.
  • ROM Read-only memory
  • RAM random access memory
  • Figure 16 is a schematic structural diagram of a chip provided by an embodiment of the present application.
  • the chip can be represented as a neural network processor NPU 1600.
  • the NPU 1600 serves as a co-processor and is mounted to the host CPU (Host CPU). ), tasks are allocated by the Host CPU.
  • the core part of the NPU is the arithmetic circuit 1603.
  • the arithmetic circuit 1603 is controlled by the controller 1604 to extract the matrix data in the memory and perform multiplication operations.
  • the computing circuit 1603 includes multiple processing units (Process Engine, PE).
  • arithmetic circuit 1603 is a two-dimensional systolic array.
  • the arithmetic circuit 1603 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition.
  • arithmetic circuit 1603 is a general-purpose matrix processor.
  • the arithmetic circuit obtains the corresponding data of matrix B from the weight memory 1602 and caches it on each PE in the arithmetic circuit.
  • the operation circuit takes matrix A data and matrix B from the input memory 1601 to perform matrix operations, and the partial result or final result of the matrix is stored in an accumulator (accumulator) 1608 .
  • the unified memory 1606 is used to store input data and output data.
  • the weight data directly passes through the storage unit access controller (Direct Memory Access Controller, DMAC) 1605, and the DMAC is transferred to the weight memory 1602.
  • Input data is also transferred to unified memory 1606 via DMAC.
  • DMAC Direct Memory Access Controller
  • BIU is the Bus Interface Unit, that is, the bus interface unit 1613, which is used for the interaction between the AXI bus and the DMAC and the Instruction Fetch Buffer (IFB) 1609.
  • IFB Instruction Fetch Buffer
  • the bus interface unit 1613 (Bus Interface Unit, BIU for short) is used to fetch the memory 1609 to obtain instructions from the external memory, and is also used for the storage unit access controller 1605 to obtain the original data of the input matrix A or the weight matrix B from the external memory.
  • BIU Bus Interface Unit
  • DMAC is mainly used to transfer the input data in the external memory DDR to the unified memory 1606 or the weight data to the weight memory 1602 or the input data to the input memory 1601 .
  • the vector calculation unit 1607 includes multiple arithmetic processing units, and if necessary, further processes the output of the arithmetic circuit 1603, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, etc. It is mainly used for non-convolutional/fully connected layer network calculations in neural networks, such as Batch Normalization, pixel-level summation, upsampling of predicted label planes, etc.
  • vector calculation unit 1607 can store the processed output vectors to unified memory 1606 .
  • the vector calculation unit 1607 can apply a linear function; or a nonlinear function to the output of the operation circuit 1603, such as linear interpolation on the prediction label plane extracted by the convolution layer, or a vector of accumulated values, to generate an activation value.
  • vector calculation unit 1607 generates normalized values, pixel-wise summed values, or both.
  • the processed output vector can be used as an activation input to the arithmetic circuit 1603, such as for use in a subsequent layer in a neural network.
  • the instruction fetch buffer 1609 connected to the controller 1604 is used to store instructions used by the controller 1604;
  • the unified memory 1606, the input memory 1601, the weight memory 1602 and the fetch memory 1609 are all On-Chip memories. External memory is private to the NPU hardware architecture.
  • the processor mentioned in any of the above places can be a general central processing unit, a microprocessor, an ASIC, or one or more integrated circuits used to control the execution of the above programs.
  • the device embodiments described above are only illustrative.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physically separate.
  • the physical unit can be located in one place, or it can be distributed across multiple network units. You can select some or all of the modules according to actual needs to implement The purpose of this embodiment is achieved.
  • the connection relationship between modules indicates that there are communication connections between them, which can be specifically implemented as one or more communication buses or signal lines.
  • the present application can be implemented by software plus necessary general hardware. Of course, it can also be implemented by dedicated hardware including dedicated integrated circuits, dedicated CPUs, dedicated memories, Special components, etc. to achieve. In general, all functions performed by computer programs can be easily implemented with corresponding hardware. Moreover, the specific hardware structures used to implement the same function can also be diverse, such as analog circuits, digital circuits or special-purpose circuits. circuit etc. However, for this application, software program implementation is a better implementation in most cases. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product in essence or that contributes to the existing technology.
  • the computer software product is stored in a readable storage medium, such as a computer floppy disk. , U disk, mobile hard disk, ROM, RAM, magnetic disk or optical disk, etc., including several instructions to cause a computer device (which can be a personal computer, training device, or network device, etc.) to execute the steps described in various embodiments of the present application. method.
  • a computer device which can be a personal computer, training device, or network device, etc.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, the computer instructions may be transferred from a website, computer, training device, or data
  • the center transmits to another website site, computer, training equipment or data center through wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.) means.
  • wired such as coaxial cable, optical fiber, digital subscriber line (DSL)
  • wireless such as infrared, wireless, microwave, etc.
  • the computer-readable storage medium may be any available medium that a computer can store, or a data storage device such as a training device or a data center integrated with one or more available media.
  • the available media may be magnetic media (eg, floppy disk, hard disk, magnetic tape), optical media (eg, DVD), or semiconductor media (eg, solid state disk (Solid State Disk, SSD)), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

Un procédé de traitement vidéo et son équipement associé ont un effet de super-résolution satisfaisant sur une trame vidéo dans un flux vidéo, et peuvent permettre à l'ensemble du flux vidéo soumis à une super-résolution d'avoir une bonne qualité d'image, ce qui permet d'améliorer l'expérience de l'utilisateur. Le procédé de la présente demande consiste à : obtenir une trame vidéo actuelle et un vecteur de mouvement utilisé dans le processus de décodage de la trame vidéo actuelle ; et transformer des informations de caractéristiques d'une trame vidéo de référence de la trame vidéo actuelle sur la base du vecteur de mouvement pour obtenir des informations de caractéristiques transformées, les informations de caractéristiques de la trame vidéo de référence étant obtenues dans le processus de super-résolution d'un modèle cible sur la trame vidéo de référence ; et effectuer une super-résolution sur la trame vidéo actuelle au moyen du modèle cible sur la base des informations de caractéristiques transformées pour obtenir la trame vidéo actuelle soumise à une super-résolution.
PCT/CN2023/113745 2022-08-30 2023-08-18 Procédé de traitement vidéo et son dispositif associé WO2024046144A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211049719.5A CN115623242A (zh) 2022-08-30 2022-08-30 一种视频处理方法及其相关设备
CN202211049719.5 2022-08-30

Publications (1)

Publication Number Publication Date
WO2024046144A1 true WO2024046144A1 (fr) 2024-03-07

Family

ID=84856895

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/113745 WO2024046144A1 (fr) 2022-08-30 2023-08-18 Procédé de traitement vidéo et son dispositif associé

Country Status (2)

Country Link
CN (1) CN115623242A (fr)
WO (1) WO2024046144A1 (fr)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115623242A (zh) * 2022-08-30 2023-01-17 华为技术有限公司 一种视频处理方法及其相关设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9794483B1 (en) * 2016-08-22 2017-10-17 Raytheon Company Video geolocation
CN112465698A (zh) * 2019-09-06 2021-03-09 华为技术有限公司 一种图像处理方法和装置
CN114339260A (zh) * 2020-09-30 2022-04-12 华为技术有限公司 图像处理方法及装置
CN115623242A (zh) * 2022-08-30 2023-01-17 华为技术有限公司 一种视频处理方法及其相关设备

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9794483B1 (en) * 2016-08-22 2017-10-17 Raytheon Company Video geolocation
CN112465698A (zh) * 2019-09-06 2021-03-09 华为技术有限公司 一种图像处理方法和装置
CN114339260A (zh) * 2020-09-30 2022-04-12 华为技术有限公司 图像处理方法及装置
CN115623242A (zh) * 2022-08-30 2023-01-17 华为技术有限公司 一种视频处理方法及其相关设备

Also Published As

Publication number Publication date
CN115623242A (zh) 2023-01-17

Similar Documents

Publication Publication Date Title
WO2022022274A1 (fr) Procédé et appareil d'instruction de modèles
WO2022042713A1 (fr) Procédé d'entraînement d'apprentissage profond et appareil à utiliser dans un dispositif informatique
WO2022116856A1 (fr) Structure de modèle, procédé de formation de modèle, et procédé et dispositif d'amélioration d'image
WO2022179581A1 (fr) Procédé de traitement d'images et dispositif associé
WO2022111617A1 (fr) Procédé et appareil d'entraînement de modèle
WO2023231954A1 (fr) Procédé de débruitage de données et dispositif associé
WO2022111387A1 (fr) Procédé de traitement de données et appareil associé
WO2022179586A1 (fr) Procédé d'apprentissage de modèle, et dispositif associé
US20240135174A1 (en) Data processing method, and neural network model training method and apparatus
WO2024046144A1 (fr) Procédé de traitement vidéo et son dispositif associé
WO2021169366A1 (fr) Procédé et appareil d'amélioration de données
WO2024001806A1 (fr) Procédé d'évaluation de données basé sur un apprentissage fédéré et dispositif associé
CN111950700A (zh) 一种神经网络的优化方法及相关设备
CN114266897A (zh) 痘痘类别的预测方法、装置、电子设备及存储介质
WO2022179603A1 (fr) Procédé de réalité augmentée et dispositif associé
WO2022156475A1 (fr) Procédé et appareil de formation de modèle de réseau neuronal, et procédé et appareil de traitement de données
US20240185573A1 (en) Image classification method and related device thereof
WO2024114659A1 (fr) Procédé de génération de résumé et dispositif associé
WO2023246735A1 (fr) Procédé de recommandation d'article et dispositif connexe associé
WO2023197910A1 (fr) Procédé de prédiction de comportement d'utilisateur et dispositif associé
WO2023185541A1 (fr) Procédé de formation de modèle et dispositif associé
WO2024067113A1 (fr) Procédé de prédiction d'action et dispositif associé
CN113627421A (zh) 一种图像处理方法、模型的训练方法以及相关设备
WO2023197857A1 (fr) Procédé de partitionnement de modèle et dispositif associé
WO2024017282A1 (fr) Procédé et dispositif de traitement de données

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23859178

Country of ref document: EP

Kind code of ref document: A1