WO2023185284A1 - 视频处理方法和装置 - Google Patents

视频处理方法和装置 Download PDF

Info

Publication number
WO2023185284A1
WO2023185284A1 PCT/CN2023/075906 CN2023075906W WO2023185284A1 WO 2023185284 A1 WO2023185284 A1 WO 2023185284A1 CN 2023075906 W CN2023075906 W CN 2023075906W WO 2023185284 A1 WO2023185284 A1 WO 2023185284A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature map
pixel
target frame
correlation
reference feature
Prior art date
Application number
PCT/CN2023/075906
Other languages
English (en)
French (fr)
Inventor
俞济洋
刘金根
薄列峰
梅涛
周伯文
Original Assignee
网银在线(北京)科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 网银在线(北京)科技有限公司 filed Critical 网银在线(北京)科技有限公司
Publication of WO2023185284A1 publication Critical patent/WO2023185284A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4053Scaling of whole images or parts thereof, e.g. expanding or contracting based on super-resolution, i.e. the output image resolution being higher than the sensor resolution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4046Scaling of whole images or parts thereof, e.g. expanding or contracting using neural networks

Definitions

  • the present disclosure relates to the field of video processing, and in particular, to a video processing method and device.
  • Super-resolution refers to improving the resolution of the original image through hardware or software methods.
  • the process of obtaining a high-resolution image through a series of low-resolution images is super-resolution reconstruction.
  • the core idea of super-resolution reconstruction is to exchange temporal bandwidth (obtaining multiple frame image sequences of the same scene) for spatial resolution to achieve the conversion of temporal resolution to spatial resolution.
  • a video processing method including: performing feature encoding on a target frame of a video and its adjacent frames to obtain a feature map of the target frame and its adjacent frames; for the feature map of the target frame For each pixel in, determine the pixel with the greatest correlation with the pixel in the feature map of the adjacent frame; construct a first reference based on the pixel with the greatest correlation with the pixel in the feature map of the adjacent frame Feature map: perform super-resolution reconstruction according to the feature map of the target frame and the first reference feature map.
  • performing super-resolution reconstruction according to the feature map of the target frame and the first reference feature map includes: based on the feature map of the target frame, the first reference feature map and the second reference feature map to perform super-resolution reconstruction, where the second reference feature map incorporates universal pixel features.
  • determining the pixel with the greatest correlation with the pixel in the feature map of the adjacent frame includes: determining the correlation of the pixel with its neighbor pixels in the feature map of the adjacent frame; According to the correlation degree, the pixel with the highest correlation degree is selected from the neighborhood pixels.
  • constructing the first reference feature map according to the pixel with the greatest correlation with the pixel in the feature map of the adjacent frame includes: based on the correlation with the pixel in the feature map of the adjacent frame The characteristics of the largest pixel and the corresponding correlation degree determine the first reference characteristics corresponding to the pixel; according to the characteristics corresponding to the pixel The first reference feature constructs the first reference feature map.
  • the video processing method further includes: determining the correlation between the pixels in the feature map of the target frame and multiple universal pixel features; performing fusion processing on the multiple universal pixel features according to the correlation, To obtain the second reference feature corresponding to the pixel; and construct the second reference feature map according to the second reference feature corresponding to the pixel.
  • fusing the multiple universal pixel features according to the correlation degree to obtain the second reference feature corresponding to the pixel includes: normalizing the correlation degree; The final correlation is used as a weight, a weighted average operation is performed on the multiple common pixel features, and the result of the weighted average operation is used as the second reference feature corresponding to the pixel.
  • performing super-resolution reconstruction according to the feature map of the target frame, the first reference feature map and the second reference feature map includes: combining the feature map of the target frame, the first reference feature map The feature map and the second reference feature map input the feature decoding network model to output the first reconstructed image; the feature map obtained by performing bilinear interpolation processing on the first reconstructed image and the feature map of the target frame, Superposition is performed to obtain a second reconstructed image.
  • the network model used for feature encoding includes a multi-layer ResNet module.
  • the feature decoding network model includes a multi-layer ResNet module and an upsampling module.
  • a video processing device including: a feature encoding module configured to perform feature encoding on the target frame of the video and its adjacent frames to obtain the features of the target frame and its adjacent frames.
  • a first building module configured to, for each pixel in the feature map of the target frame, determine the pixel in the feature map of the adjacent frame that has the greatest correlation with the pixel, and, according to the phase
  • the pixel in the feature map of the adjacent frame that has the greatest correlation with the pixel constructs a first reference feature map
  • a reconstruction module is configured to perform super-resolution based on the feature map of the target frame and the first reference feature map. rate reconstruction.
  • the reconstruction module is configured to: perform super-resolution reconstruction according to the feature map of the target frame, the first reference feature map and the second reference feature map, wherein the second reference feature map
  • the feature map incorporates common pixel features.
  • the first building module is configured to: determine a correlation between the pixel and its neighbor pixels in the feature map of the adjacent frame; Select the pixel with the greatest correlation among the pixels.
  • the first building module is configured to: according to the feature map of the adjacent frame and the The characteristics of the pixel with the highest pixel correlation and the corresponding correlation are used to determine the first reference feature corresponding to the pixel; and a first reference feature map is constructed based on the first reference feature corresponding to the pixel.
  • the video processing device further includes a second building module, the second building module is configured to: determine the correlation between the pixels in the feature map of the target frame and a plurality of common pixel features; according to Correlation: perform fusion processing on the plurality of common pixel features to obtain the second reference feature corresponding to the pixel; construct the second reference feature map according to the second reference feature corresponding to the pixel.
  • the second building module is configured to: normalize the correlation; use the normalized correlation as a weight to perform a weighted average operation on the multiple common pixel features, And the result of the weighted average operation is used as the second reference feature corresponding to the pixel.
  • the second building module is configured to input the feature map of the target frame, the first reference feature map and the second reference feature map into a feature decoding network model to output a first reconstructed image ; Superimpose the first reconstructed image and the feature map obtained by performing bilinear interpolation on the feature map of the target frame to obtain a second reconstructed image.
  • the feature encoding network model includes a multi-layer ResNet module.
  • the feature decoding network model includes a multi-layer ResNet module and an upsampling module.
  • a video processing device including: a memory; and a processor coupled to the memory, the processor being configured to execute the above video processing method based on instructions stored in the memory.
  • a computer-readable storage medium on which computer program instructions are stored, and when the instructions are executed by a processor, the above-mentioned video processing method is implemented.
  • Figure 1 is a schematic flowchart of some embodiments of the video processing method of the present disclosure.
  • FIG. 2 is a schematic flowchart of other embodiments of the video processing method of the present disclosure.
  • Figure 3 is a schematic flowchart of some embodiments of constructing a second reference feature map of the present disclosure.
  • Figure 4 is a comparison diagram of video frame processing effects between embodiments of the present disclosure and related technologies.
  • Figure 5 is a schematic structural diagram of some embodiments of the video processing device of the present disclosure.
  • FIG. 6 is a schematic structural diagram of some other embodiments of the video processing device of the present disclosure.
  • Figure 7a is a schematic structural diagram of some embodiments of the feature encoding network model of the present disclosure.
  • Figure 7b is a schematic structural diagram of some embodiments of the feature decoding network model of the present disclosure.
  • Figure 8 is a schematic structural diagram of some other embodiments of the video processing device of the present disclosure.
  • Figure 9 is a schematic structural diagram of some other embodiments of the video processing device of the present disclosure.
  • any specific values are to be construed as illustrative only and not as limiting. Accordingly, other examples of the exemplary embodiments may have different values.
  • super-resolution processing can be performed through convolutional neural networks to restore image details in low-resolution videos.
  • This type of method mainly fills in the missing details in low-resolution video frames by fusing local details of adjacent frames. It has the following shortcomings: First, related technologies need to rely on optical flow methods, etc. before fusing the layout details of adjacent frames. This method calculates the image correspondence between video frames, which will lead to poor calculation accuracy when there is large motion in the video, as well as high requirements on computing resources and slow speed. Secondly, due to the relative Adjacent frames are also low-resolution and contribute little to the sharpness of the enhanced keyframe.
  • a technical problem to be solved by this disclosure is to provide a video processing method and device that can improve the efficiency of video super-resolution processing, reduce the requirements for computing resources for video super-resolution processing, and improve the quality of super-resolution reconstructed video.
  • a cross-frame attention mechanism is introduced to achieve pixel-level matching between adjacent frames and the target frame. Specifically, determine The first reference feature map is constructed based on the pixels in the feature maps of the adjacent frames that have the greatest correlation with the pixels of the target frame and the pixels in the feature maps of the adjacent frames that have the greatest correlation with the pixels in the target frame. In this way, the present disclosure does not need to explicitly calculate the pixel correspondence between video frames based on optical flow methods and other methods, thereby solving the problems caused by relying on optical flow methods and other methods to calculate the image correspondence between video frames in related technologies.
  • Figure 1 is a schematic flowchart of some embodiments of the video processing method of the present disclosure.
  • step 110 feature encoding is performed on the target frame and its adjacent frames of the video to obtain feature maps of the target frame and its adjacent frames.
  • the target frame is the video frame to be processed by super-resolution
  • the adjacent frame is one or more video frames adjacent to the video frame to be processed by super-resolution. For example, take the first 3 frames and the last 3 frames adjacent to the target frame as adjacent frames, or take the first 2 frames adjacent to the target frame as adjacent frames, or take the last 4 frames adjacent to the target frame as adjacent frames, etc.
  • a key frame in the video is taken as the target frame, and one or more video frames adjacent to the key frame are taken as adjacent frames.
  • the sliding window method is used to perform super-resolution processing on the video. Each time the key frame to be super-resolution processed and its adjacent first 3 frames and last 3 frames are processed according to the process shown in Figure 1 to obtain Super-resolution reconstructed image of this keyframe.
  • a feature coding network model is used to perform feature coding on the target frame and its adjacent frames to obtain a feature map of the target frame and a feature map of the adjacent frames.
  • the feature dimension of the encoded feature map is larger than the feature dimension of the original video frame.
  • a three-channel RGB video frame is input into the feature encoding network model, and a 128-channel feature map is obtained.
  • the feature encoding network model is a network model composed of multi-layer residual network (ResNet) modules, For example, a network model composed of 5 layers of ResNet modules.
  • ResNet residual network
  • the feature encoding network used in the present disclosure can also adopt other network model structures, such as autoencoders. (autoencoder), or Residual Dense Network (RDN, full name: Residual Dense Network), etc.
  • step 120 for each pixel in the feature map of the target frame, the pixel in the feature map of the adjacent frame that has the greatest correlation with the pixel is determined.
  • the correlation between the pixel and its neighbor pixels in the feature map of the adjacent frame is determined, and the correlation is selected from the neighbor pixels based on the correlation. largest pixels.
  • the neighborhood pixels of a pixel in the feature map of its adjacent frame refer to the pixels that are spatially adjacent to the pixels of the target frame in the feature map of the adjacent frame. For example, taking a 9*9 pixel area centered on pixel a as the neighborhood of the pixel, then the number of neighbor pixels of pixel a on a certain adjacent frame is 9*9.
  • the correlation degree of pixels describes the degree of similarity between pixels.
  • the correlation between two pixels can be calculated as follows: the inner product of the feature vector of pixel a and the feature vector of pixel b is taken as the correlation between pixel a and pixel b.
  • the target frame is the 4th frame in the video
  • the adjacent frames of the target frame are the 1st, 2nd, 3rd, 5th, 6th, and 7th frames in the video
  • 6*9*9 correlation values can be obtained; according to the correlation, from the 6*9*9 neighborhood pixels Select the pixel with the greatest correlation with pixel a; perform the above process separately for each pixel in the target frame feature map, so that each pixel in the target frame feature map can be found from the feature maps of these six adjacent frames. The most relevant pixel.
  • a first reference feature map is constructed based on the pixel in the feature map of the adjacent frame that has the greatest correlation with the pixel.
  • the pixel is determined based on the characteristics of the pixel with the greatest correlation with the pixel selected from the feature map of the adjacent frame and the corresponding correlation.
  • the corresponding first reference feature construct the first reference based on the first reference feature corresponding to each pixel in the feature map of the target frame Feature map.
  • the product of the feature of the pixel with the greatest correlation selected from the feature map of the adjacent frame and the corresponding correlation is used as the product The first reference feature corresponding to the pixel.
  • the first reference features corresponding to each pixel in the feature map of the target frame can be obtained, and further, the first reference feature map can be obtained.
  • the pixel features in the first reference feature map are the first reference features corresponding to the pixels in the feature map of the target frame.
  • the product of the pixel feature with the highest correlation degree and the corresponding correlation degree is used as the first reference feature, for example only. Without affecting the implementation of the present invention, other specific implementations that determine the first reference feature based on the pixel feature with the highest correlation and the corresponding correlation can also be adopted.
  • a cross-frame attention mechanism is introduced through steps 120 and 130 to achieve pixel-level matching between adjacent frames and the target frame.
  • the embodiments of the present disclosure do not need to explicitly calculate the pixel correspondence between video frames based on the optical flow method and other methods, thereby solving the problems in related technologies that rely on the optical flow method and other methods to calculate the image correspondence between video frames.
  • the resulting problems include poor calculation accuracy, high requirements on computing resources, and slow speed. It helps to improve the efficiency of video super-resolution processing, reduce the requirements for computing resources for video super-resolution processing, and improve the quality of the video reconstructed by super-resolution.
  • step 140 super-resolution reconstruction is performed based on the feature map of the target frame and the first reference feature map.
  • the feature map of the target frame and the first reference feature map are input into the feature decoding network model, and the image output by the model is used as a super-resolution reconstructed image, that is, the high-resolution image corresponding to the finally reconstructed target frame .
  • the function of the decoding network model is to fuse the feature map of the target frame and the first reference feature map, and remap the fused image from the high-dimensional feature space to the low-dimensional feature space to obtain high-resolution reconstruction. image.
  • a 128-channel target frame feature map and a 128-channel first reference feature map are input into the feature decoding network model to obtain a high-resolution three-channel RGB image.
  • the feature decoding network model is a network model composed of a multi-layer ResNet module and an upsampling module, such as a network model composed of a 40-layer ResNet module and an upsampling module.
  • the detailed composition of the ResNet module is shown in Figure 7a
  • the detailed composition of the upsampling module is shown in Figure 7b.
  • the ResNet module includes a convolution layer, a ReLu activation layer and a convolution layer.
  • the upsampling module includes a convolution layer, Pixel Shuffle, Leaky ReLu processing layer, and Leaky ReLu processing layer.
  • the ResNet module is responsible for fusing the feature map of the target frame and the first reference feature map to obtain a fused high-dimensional low-resolution image; the upsampling module is responsible for reorganizing the high-dimensional low-resolution image features into a low-dimensional high-resolution image.
  • Image features such as reorganizing 128-channel low-resolution features into 3-channel high-resolution RGB image features.
  • Pixel Shuffle is an upsampling method that can effectively enlarge the reduced feature map. This method can replace interpolation or deconvolution methods to achieve upsampling.
  • Leaky ReLu and ReLu are two commonly used activation functions in deep learning.
  • the feature decoding network used in the present disclosure can adopt other network model structures in addition to the network model composed of a 40-layer ResNet module and an upsampling module. , such as RDN and so on.
  • the feature map of the target frame and the first reference feature map are input into the feature decoding network model to output the first reconstructed image; the feature map obtained by performing bilinear interpolation on the feature map of the target frame is The first reconstructed images are superimposed to obtain a second reconstructed image, and the second reconstructed image is used as a high-resolution image corresponding to the finally reconstructed target frame.
  • the reconstructed image can be further improved by overlaying it with a feature map obtained by bilinear interpolation of the feature map of the target frame. resolution.
  • super-resolution processing of the target frame of the video is achieved through the above steps. Compared with related technologies, it can not only improve the efficiency of video super-resolution processing and reduce the requirements for computing resources for video super-resolution processing, but also improve the quality of the video reconstructed by super-resolution.
  • FIG. 2 is a schematic flowchart of other embodiments of the video processing method of the present disclosure.
  • step 210 feature encoding is performed on the target frame of the video and its adjacent frames to obtain feature maps of the target frame and its adjacent frames.
  • a feature coding network model is used to perform feature coding on the target frame and its adjacent frames to obtain the feature map of the target frame and the feature maps of the adjacent frames.
  • the feature dimension of the encoded feature map is larger than the feature dimension of the original video frame.
  • a three-channel RGB video frame is input into the feature encoding network model, and a 128-channel feature map is obtained.
  • the feature dimension can be increased and richer image details can be extracted, which in turn helps to improve the effect of image super-resolution processing.
  • step 220 for each pixel in the feature map of the target frame, the pixel in the feature map of the adjacent frame that has the greatest correlation with the pixel is determined.
  • the correlation between the pixel and its neighbor pixels in the feature map of the adjacent frame is determined, and the correlation is selected from the neighbor pixels based on the correlation. largest pixels.
  • a first reference feature map is constructed based on the pixel in the feature map of adjacent frames that has the greatest correlation with the pixel.
  • the product of the feature of the pixel with the greatest correlation with the pixel selected from the feature map of the adjacent frame and the corresponding correlation is as The first reference feature corresponding to this pixel.
  • the first reference feature corresponding to each pixel in the feature map of the target frame can be obtained, and further, the first reference feature map can be obtained based on this.
  • the pixel features in the first reference feature map are the first reference features corresponding to the pixels in the feature map of the target frame.
  • a second reference feature map is constructed.
  • the second reference feature map incorporates universal pixel features.
  • the general pixel features are image detail information learned from known high-resolution video images and their corresponding low-resolution images.
  • the universal pixel features are stored in the form of a two-dimensional matrix. For example, a 256*128 two-dimensional matrix is used. This two-dimensional matrix consists of 256 128-channel feature vectors. Each feature vector stores the known high-resolution video images and their corresponding low-resolution images during the training process. The most representative universal pixel features learned in .
  • a second reference feature map is constructed from the two-dimensional matrix.
  • step 250 super-resolution reconstruction is performed based on the feature map of the target frame, the first reference feature map, and the second reference feature map.
  • the feature map of the target frame, the first reference feature map, and the second reference feature map are input into the feature decoding network model, and the image output by the model is used as a super-resolution reconstructed image.
  • the function of the decoding network model is to fuse the feature map of the target frame with the first reference feature map and the second reference feature map, and remap the fused image from the high-dimensional feature space to the low-dimensional feature space. to obtain high-resolution reconstructed images.
  • a 128-channel target frame feature map, a 128-channel first reference feature map, and a 128-channel second reference feature map are input into the feature decoding network model to obtain a high-resolution three-channel RGB image.
  • the feature map of the target frame, the first reference feature map and the second reference feature map are input into the feature decoding network model to output the first reconstructed image; bilinear interpolation is performed on the feature map of the target frame.
  • the processed feature map is superimposed on the first reconstructed image to obtain a second reconstructed image, and the second reconstructed image is used as the final super-resolution reconstructed image.
  • after the high-resolution reconstructed image is obtained through the feature decoding network model it is overlaid with the feature map obtained by bilinear interpolation of the feature map of the target frame. Plus, the image resolution can be further improved.
  • the test based on the Vimeo90K-Test and Vid4 data sets recognized by academic circles verified that the method of the disclosed embodiment has good performance in general small-displacement video super-resolution processing.
  • tests based on the Parkour data set have verified that the method of the disclosed embodiment also has good performance in large-displacement video super-resolution processing.
  • the quantification standards are the peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) recognized by the academic community.
  • super-resolution processing of the target frame of the video is achieved through the above steps.
  • a cross-frame attention mechanism is introduced to achieve pixel-level matching between adjacent frames and the target frame, so that this disclosure does not require explicit calculation based on optical flow methods or other methods.
  • the pixel correspondence between video frames solves the problem of poor calculation accuracy, high requirements on computing resources, and slow speed caused by relying on optical flow methods and other methods to calculate the image correspondence between video frames in related technologies. It helps to improve the efficiency of video super-resolution processing, reduce the requirements for computing resources for video super-resolution processing, and improve the quality of video reconstructed by super-resolution.
  • the image details are also integrated with the image details of universal pixels, which can increase image details to a certain extent and improve the quality of the video reconstructed by super-resolution.
  • FIG. 3 is a schematic flowchart of some embodiments of constructing a second reference feature map of the present disclosure. Step 240 in the previous embodiment will be described in detail below with reference to FIG. 3 .
  • step 241 the correlation between the pixels in the feature map of the target frame and multiple common pixel features is determined.
  • the universal pixel features are stored in the form of a two-dimensional matrix.
  • the two-dimensional matrix is composed of multiple feature vectors.
  • Each feature vector stores a known high-resolution video image and its corresponding low-resolution image during the training process. Resolve the most representative common pixel features learned in the image. For each pixel in the feature map of the target frame, the correlation between the pixel and each feature vector in the two-dimensional matrix is calculated.
  • the above two-dimensional matrix is obtained through pre-training.
  • the two-dimensional matrix gradually converges from the initial random value to the final result.
  • the above two-dimensional matrix is fixed.
  • the two-dimensional matrix is formed, so that the two-dimensional matrix can be referred to when performing super-resolution processing on the current video, thereby improving the quality of video super-resolution processing.
  • step 242 multiple common pixel features are fused according to the degree of correlation to obtain second reference features corresponding to the pixels in the feature map of the target frame.
  • the correlation obtained in step 241 is normalized; the normalized correlation is used as a weight, a weighted average operation is performed on multiple common pixel features, and the result of the weighted average operation is used as the weight.
  • a 256*128 two-dimensional matrix is used.
  • This two-dimensional matrix consists of 256 128-channel feature vectors.
  • Each feature vector stores the known high-resolution video images and their corresponding low-resolution images during the training process.
  • the most representative universal pixel features learned in For each pixel in the feature map of the target frame, the correlation between it and the 256 feature vectors in the two-dimensional matrix is calculated, and 256 correlation values are obtained. After normalizing these 256 correlation values, the normalized correlation values are used as weights, a weighted average operation is performed on the 256 feature vectors, and the result of the weighted average operation is used as the second reference corresponding to the pixel. feature.
  • a second reference feature map is constructed based on the second reference features corresponding to the pixels in the feature map of the target frame.
  • the pixel features in the second reference feature map are the second reference features corresponding to the pixels in the feature map of the target frame.
  • the second reference feature map is well constructed through the above steps.
  • the first reference feature map and the second reference feature map reconstructed through the above steps not only the image details of adjacent frames are fused when performing super-resolution reconstruction of the target frame, but also It also incorporates image details of universal pixels, which can increase image details to a certain extent and improve the quality of the video reconstructed by super-resolution.
  • Figure 4 is a comparison diagram of video frame processing effects between embodiments of the present disclosure and related technologies. As shown in Figure 4, it includes three rows and four columns of images. The first column from left to right is an example video frame, and the second to fourth columns are enlarged images of the area selected by the rectangular frame in the example video frame. Among them, the second column is the result of bicubic interpolation amplification of the low-resolution video frame, the third column is the result of the low-resolution video frame processed by the video processing method of the embodiment of the present disclosure, and the last column is the result of the low-resolution video frame. The result is processed by the current industry-leading super-resolution processing method.
  • the third horizontal row shows an example of a sports video.
  • the objects in this type of video have large displacements.
  • the current industry-leading method is used. Generally, high-resolution details cannot be restored in such videos, but the present disclosure has been tested and proven to be able to maintain good super-resolution processing performance in such large-displacement videos.
  • FIG. 5 is a schematic structural diagram of some embodiments of the video processing device of the present disclosure.
  • the video processing device in the embodiment of the present disclosure includes: a feature encoding module 510 , a first building module 520 , and a reconstruction module 530 .
  • the feature encoding module 510 is configured to perform feature encoding on the target frame of the video and its adjacent frames to obtain feature maps of the target frame and its adjacent frames.
  • the target frame is the video frame to be processed by super-resolution
  • the adjacent frame is one or more video frames adjacent to the video frame to be processed by super-resolution. For example, take the first 3 frames and the last 3 frames adjacent to the target frame as adjacent frames, or take the first 2 frames adjacent to the target frame as adjacent frames, or take the last 4 frames adjacent to the target frame as adjacent frames, etc.
  • the feature encoding module 510 uses a feature encoding network model to perform feature encoding on the target frame and its adjacent frames to obtain feature maps of the target frame and feature maps of adjacent frames.
  • the feature dimension of the encoded feature map is larger than the feature dimension of the original video frame.
  • a three-channel RGB video frame is input into the feature encoding network model, and a 128-channel feature map is obtained.
  • the feature encoding network model is a network model composed of multi-layer ResNet (residual network) modules, such as a network model composed of 5-layer ResNet modules.
  • ResNet residual network
  • the feature encoding network used in the present disclosure can also adopt other network model structures, such as autoencoders. (autoencoder), or Residual Dense Network (RDN, full name: Residual Dense Network), etc.
  • the feature encoding module is used to perform feature encoding on the target frame and its adjacent frames, which can achieve feature dimensionality enhancement and extract richer image detail information, thereby helping to improve the effect of image super-resolution processing.
  • the first building module 520 is configured to, for each pixel in the feature map of the target frame, determine the pixel in the feature map of the adjacent frame that has the greatest correlation with the pixel, and, based on the adjacent
  • the first reference feature map is constructed from the pixel in the feature map of the frame that has the greatest correlation with the pixel.
  • the first building module 520 determines the correlation of the pixel with its neighbor pixels in the feature map of the adjacent frame, and obtains the correlation from the neighbor pixels according to the correlation. Select the pixels with the greatest correlation. In these embodiments, the first building module 520 only calculates the correlation between the pixels in the target frame and its neighbor pixels in the adjacent frames. Compared with other embodiments, the first building module 520 calculates the correlation between the pixels in the target frame and adjacent frames one by one. The degree of correlation greatly reduces the amount of calculation.
  • the first building module 520 selects the characteristics of the pixel with the greatest correlation with the pixel from the feature map of the adjacent frame, and the corresponding correlation degree, determine the first reference feature corresponding to the pixel; construct a first reference feature map according to the first reference feature corresponding to each pixel in the feature map of the target frame.
  • the first building module 520 selects the features of the pixel with the greatest correlation from the feature map of the adjacent frame, and the corresponding correlation The product of , serves as the first reference feature corresponding to the pixel.
  • the first reference features corresponding to each pixel in the feature map of the target frame can be obtained, and further, the first reference feature map can be obtained.
  • the pixel features in the first reference feature map are the first reference features corresponding to the pixels in the feature map of the target frame.
  • the product of the pixel feature with the highest correlation degree and the corresponding correlation degree is used as the first reference feature, for example only.
  • the first building module 520 may also adopt other specific implementations of determining the first reference feature based on the pixel feature with the greatest correlation and the corresponding correlation.
  • the pixel correlation between the target frame and the adjacent frame feature map is calculated through the first building module, and the first feature map is constructed based on the pixels with the highest correlation selected from the adjacent frame feature map, so that the It is disclosed that there is no need to explicitly calculate the pixel correspondence between video frames based on optical flow methods and other methods, thus solving the poor calculation accuracy caused by relying on optical flow methods and other methods to calculate the image correspondence between video frames in related technologies. , as well as problems such as high requirements on computing resources and slow speed, it helps to improve the efficiency of video super-resolution processing, reduce the requirements for computing resources for video super-resolution processing, and improve the quality of the video reconstructed by super-resolution.
  • the reconstruction module 530 is configured to perform super-resolution reconstruction according to the feature map of the target frame and the first reference feature map.
  • the reconstruction module 530 inputs the feature map of the target frame and the first reference feature map into the feature decoding network model, and uses the image output by the model as a super-resolution reconstructed image, that is, the high-resolution image corresponding to the finally reconstructed target frame. resolution image.
  • the function of the decoding network model is to fuse the feature map of the target frame and the first reference feature map, and remap the fused image from the high-dimensional feature space to the low-dimensional feature space to obtain high-resolution reconstruction. image.
  • a 128-channel target frame feature map and a 128-channel first reference feature map are input into the feature decoding network model to obtain a high-resolution three-channel RGB image.
  • the feature decoding network model is a network model composed of a multi-layer ResNet module and an upsampling module, such as a network model composed of a 40-layer ResNet module and an upsampling module.
  • the feature decoding network used in the present disclosure can not only adopt a network model composed of a 40-layer ResNet module and an upsampling module, but also can adopt other network model structures, such as RDN, etc. .
  • the reconstruction module 530 inputs the feature map of the target frame and the first reference feature map into the feature decoding network model to output the first reconstructed image; the feature map of the target frame is processed by bilinear interpolation. The feature map is superimposed on the first reconstructed image to obtain a second reconstructed image, and the second reconstructed image is used as a high-resolution image corresponding to the final reconstructed target frame.
  • the reconstructed image can be further improved by overlaying it with a feature map obtained by bilinear interpolation of the feature map of the target frame. resolution.
  • the above device realizes super-resolution processing of the target frame of the video. Compared with related technologies, it can not only improve the efficiency of video super-resolution processing and reduce the requirements for computing resources for video super-resolution processing, but also improve the quality of the video reconstructed by super-resolution.
  • FIG. 6 is a schematic structural diagram of some other embodiments of the video processing device of the present disclosure.
  • the video processing device in the embodiment of the present disclosure includes: a feature encoding module 610, a first building module 620, a second building module 630, and a reconstruction module 640.
  • the feature encoding module 610 is configured to perform feature encoding on the target frame of the video and its adjacent frames to obtain feature maps of the target frame and its adjacent frames.
  • the feature encoding module 610 uses a feature encoding network model to perform feature encoding on the target frame and its adjacent frames to obtain the feature map of the target frame and the feature maps of the adjacent frames.
  • the feature dimension of the encoded feature map is larger than the feature dimension of the original video frame.
  • three-channel RGB video frames were input into the feature encoding network model, and a 128-channel feature map was obtained.
  • feature dimensionality can be increased, and richer image detail information can be extracted, thereby helping to improve the effect of image super-resolution processing.
  • the first building module 620 is configured to, for each pixel in the feature map of the target frame, determine the pixel in the feature map of the adjacent frame that has the greatest correlation with the pixel, and, based on the adjacent
  • the first reference feature map is constructed from the pixel in the feature map of the frame that has the greatest correlation with the pixel.
  • the first building module 620 determines the correlation of the pixel with its neighbor pixels in the feature map of the adjacent frame, and obtains the correlation from the neighbor pixels according to the correlation. Select the pixels with the greatest correlation.
  • the first building module 620 selects the feature of the pixel with the greatest correlation with the pixel from the feature map of the adjacent frame, and the corresponding correlation The product of degrees is used as the first reference feature corresponding to the pixel. In this way, the first reference feature corresponding to each pixel in the feature map of the target frame can be obtained, and further, the first reference feature map can be obtained based on this.
  • the pixel features in the first reference feature map are the first reference features corresponding to the pixels in the feature map of the target frame.
  • the second building module 630 is configured to build a second reference feature map.
  • the second reference feature map incorporates universal pixel features.
  • the general pixel features are image detail information learned from high-resolution videos.
  • the universal pixel features are stored in the form of a two-dimensional matrix. For example, a 256*128 two-dimensional matrix is used. This two-dimensional matrix consists of 256 128-channel feature vectors. Each feature vector stores the most representative universal pixels learned from high-resolution videos during the training process. feature.
  • a second reference feature map is constructed from the two-dimensional matrix.
  • the second building module 630 determines the correlation between the pixels in the feature map of the target frame and multiple universal pixel features, and performs fusion processing on the multiple universal pixel features according to the correlation to obtain the obtained
  • the second reference feature corresponding to the pixel is used to construct the second reference feature map according to the second reference feature corresponding to the pixel.
  • the second building module 630 normalizes the obtained correlation, uses the normalized correlation as a weight, performs a weighted average operation on multiple common pixel features, and uses the weighted average operation result As the second reference feature corresponding to the pixel in the feature map of the target frame.
  • a 256*128 two-dimensional matrix is used. This two-dimensional matrix consists of 256 128-channel feature vectors. Each feature vector stores the known high-resolution video images and their corresponding low-resolution images during the training process. The most representative universal pixel features learned.
  • the correlation between it and the 256 feature vectors in the two-dimensional matrix is calculated, and 256 correlation values are obtained. After normalizing these 256 correlation values, the normalized correlation values are used as weights, a weighted average operation is performed on the 256 feature vectors, and the result of the weighted average operation is used as the second reference corresponding to the pixel. feature.
  • the reconstruction module 640 is configured to perform super-resolution reconstruction based on the feature map of the target frame, the first reference feature map, and the second reference feature map.
  • the reconstruction module 640 inputs the feature map of the target frame, the first reference feature map, and the second reference feature map into the feature decoding network model, and uses the image output by the model as a super-resolution reconstructed image. That , the function of the decoding network model is to fuse the feature map of the target frame with the first reference feature map and the second reference feature map, and remap the fused image from the high-dimensional feature space to the low-dimensional feature space. to obtain high-resolution reconstructed images. For example, in some specific examples, a 128-channel target frame feature map, a 128-channel first reference feature map, and a 128-channel second reference feature map are input into the feature decoding network model to obtain a high-resolution three-channel RGB image.
  • the reconstruction module 640 inputs the feature map of the target frame, the first reference feature map, and the second reference feature map into the feature decoding network model to output the first reconstructed image; the feature map of the target frame is double-processed.
  • the feature map obtained by the linear interpolation process is superposed with the first reconstructed image to obtain a second reconstructed image, and the second reconstructed image is used as the final super-resolution reconstructed image.
  • the image resolution can be further improved by overlaying it with a feature map obtained by bilinear interpolation of the feature map of the target frame.
  • the method further includes: training each module in the video processing device before performing super-resolution processing on the current video based on the video processing device.
  • the training process can be carried out in an end-to-end (end-to-end) manner.
  • the entire network training process is as follows: first disconnect the second building module, only train the other three modules, and the loss function is to output high resolution The L1 distance (i.e., Manhattan distance) between the image and the correctly labeled high-resolution sample image; next, fix the other three modules and only train the two-dimensional matrix used in the second building module, and the loss function is the network model used to train the two-dimensional matrix The L1 distance between the input and the output; then, all modules in the video processing device are trained, and the loss function is the L1 distance between the output high-resolution image and the correctly labeled high-resolution sample image.
  • the L1 distance i.e., Manhattan distance
  • the above device realizes super-resolution processing of the target frame of the video.
  • a cross-frame attention mechanism is introduced to achieve pixel-level matching between adjacent frames and the target frame, so that this disclosure does not require explicit calculation based on optical flow methods or other methods.
  • the pixel correspondence between video frames solves the problem of poor calculation accuracy, high requirements on computing resources, and slow speed caused by relying on optical flow methods and other methods to calculate the image correspondence between video frames in related technologies. It helps to improve the efficiency of video super-resolution processing, reduce the requirements for computing resources for video super-resolution processing, and improve the quality of video reconstructed by super-resolution.
  • the image details are also integrated with the image details of universal pixels, which can increase image details to a certain extent and improve the quality of the video reconstructed by super-resolution.
  • Figure 8 is a schematic structural diagram of some other embodiments of the video processing device of the present disclosure.
  • the device includes memory 810 and processor 820, wherein: memory 810 may be a disk, flash memory, or any other non-volatile storage medium.
  • the memory is used to store instructions in the embodiments corresponding to Figures 1-3.
  • Processor 820 is coupled to memory 810 and may be implemented as one or more integrated circuits, such as a microprocessor or microcontroller.
  • the processor 820 is used to execute instructions stored in the memory.
  • the device 900 includes a memory 910 and a processor 920 .
  • Processor 920 is coupled to memory 910 via BUS bus 930 .
  • the device 900 can also be connected to an external storage device 950 through a storage interface 940 to call external data, and can also be connected to a network or another computer system (not shown) through a network interface 960, which will not be described in detail here.
  • the efficiency of video processing can be improved, the requirements for computing resources for video super-resolution processing can be reduced, and the quality of the video reconstructed by super-resolution can be improved.
  • a computer-readable storage medium has computer program instructions stored thereon, and when the instructions are executed by a processor, the steps of the methods in the embodiments corresponding to Figures 1-3 are implemented.
  • embodiments of the present disclosure may be provided as methods, apparatuses, or computer program products. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment that combines software and hardware aspects.
  • the present disclosure may take the form of a computer program product embodied on one or more computer-usable non-transitory storage media (including, but not limited to, disk memory, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein. .
  • These computer program instructions may also be stored in a computer-readable memory that causes a computer or other programmable data processing apparatus to operate in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction means, the instructions
  • the device implements the functions specified in a process or processes of the flowchart and/or a block or blocks of the block diagram.
  • These computer program instructions can also be loaded onto a computer or other programmable data processing device so that when the computer A series of operational steps are executed on a computer or other programmable device to produce a computer-implemented process, whereby the instructions executed on the computer or other programmable device provide for implementing a process or processes in a flowchart and/or a block diagram The steps for a function specified in a box or boxes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Processing (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

本公开公开了一种视频处理方法和装置,涉及视频处理技术领域。其中的视频处理方法包括:对视频的目标帧及其相邻帧进行特征编码,以得到目标帧及其相邻帧的特征图;对于所述目标帧的特征图中的每个像素,确定所述相邻帧的特征图中与所述像素相关度最大的像素;根据所述相邻帧的特征图中与所述像素相关度最大的像素,构建第一参考特征图;根据所述目标帧的特征图和所述第一参考特征图,进行超分辨率重建。

Description

视频处理方法和装置
相关申请的交叉引用
本申请是以CN申请号为202210334768.7,申请日为2022年3月31日的申请为基础,并主张其优先权,该CN申请的公开内容在此作为整体引入本申请中。
技术领域
本公开涉及视频处理领域,尤其涉及一种视频处理方法和装置。
背景技术
超分辨率,即通过硬件或软件的方法提高原有图像的分辨率,通过一系列低分辨率的图像来得到一幅高分辨率的图像过程就是超分辨率重建。超分辨率重建的核心思想是用时间带宽(获取同一场景的多帧图像序列)换取空间分辨率,实现时间分辨率向空间分辨率的转换。
发明内容
根据本公开一方面,提出一种视频处理方法,包括:对视频的目标帧及其相邻帧进行特征编码,以得到目标帧及其相邻帧的特征图;对于所述目标帧的特征图中的每个像素,确定所述相邻帧的特征图中与所述像素相关度最大的像素;根据所述相邻帧的特征图中与所述像素相关度最大的像素,构建第一参考特征图;根据所述目标帧的特征图和所述第一参考特征图,进行超分辨率重建。
在一些实施例中,根据所述目标帧的特征图和所述第一参考特征图,进行超分辨率重建包括:根据所述目标帧的特征图、所述第一参考特征图以及第二参考特征图,进行超分辨率重建,其中,所述第二参考特征图融合了通用像素特征。
在一些实施例中,确定所述相邻帧的特征图中与所述像素相关度最大的像素包括:确定所述像素与其在所述相邻帧的特征图中的邻域像素的相关度;根据所述相关度,从所述邻域像素中选取出相关度最大的像素。
在一些实施例中,根据所述相邻帧的特征图中与所述像素相关度最大的像素,构建第一参考特征图包括:根据所述相邻帧的特征图中与所述像素相关度最大的像素的特征,以及对应的相关度,确定所述像素对应的第一参考特征;根据所述像素对应的 第一参考特征,构建第一参考特征图。
在一些实施例中,视频处理方法还包括:确定所述目标帧的特征图中的像素与多个通用像素特征的相关度;按照所述相关度对所述多个通用像素特征进行融合处理,以得到所述像素对应的第二参考特征;根据所述像素对应的第二参考特征,构建所述第二参考特征图。
在一些实施例中,按照所述相关度对所述多个通用像素特征进行融合处理,以得到所述像素对应的第二参考特征包括:对所述相关度进行归一化;将归一化后的相关度作为权重,对所述多个通用像素特征进行加权平均运算,并将加权平均运算的结果作为所述像素对应的第二参考特征。
在一些实施例中,根据所述目标帧的特征图、所述第一参考特征图以及第二参考特征图,进行超分辨率重建包括:将所述目标帧的特征图、所述第一参考特征图以及第二参考特征图输入特征解码网络模型,以输出第一重建图像;对所述第一重建图像,以及,对所述目标帧的特征图进行双线性插值处理得到的特征图,进行叠加,以得到第二重建图像。
在一些实施例中,特征编码所用的网络模型包括多层ResNet模块。
在一些实施例中,所述特征解码网络模型包括多层ResNet模块、以及上采样模块。
根据本公开的另一方面,还提出一种视频处理装置,包括:特征编码模块,被配置为对视频的目标帧及其相邻帧进行特征编码,以得到目标帧及其相邻帧的特征图;第一构建模块,被配置为对于所述目标帧的特征图中的每个像素,确定所述相邻帧的特征图中与所述像素相关度最大的像素,和,根据所述相邻帧的特征图中与所述像素相关度最大的像素,构建第一参考特征图;以及重建模块,被配置为根据所述目标帧的特征图和所述第一参考特征图,进行超分辨率重建。
在一些实施例中,所述重建模块被配置为:根据所述目标帧的特征图、所述第一参考特征图以及第二参考特征图,进行超分辨率重建,其中,所述第二参考特征图融合了通用像素特征。
在一些实施例中,所述第一构建模块被配置为:确定所述像素与其在所述相邻帧的特征图中的邻域像素的相关度;根据所述相关度,从所述邻域像素中选取出相关度最大的像素。
在一些实施例中,所述第一构建模块被配置为:根据所述相邻帧的特征图中与所 述像素相关度最大的像素的特征,以及对应的相关度,确定所述像素对应的第一参考特征;根据所述像素对应的第一参考特征,构建第一参考特征图。
在一些实施例中,视频处理装置还包括第二构建模块,所述第二构建模块被配置为:确定所述目标帧的特征图中的像素与多个通用像素特征的相关度;按照所述相关度,对所述多个通用像素特征进行融合处理,以得到所述像素对应的第二参考特征;根据所述像素对应的第二参考特征,构建所述第二参考特征图。
在一些实施例中,所述第二构建模块被配置为:对所述相关度进行归一化;将归一化后的相关度作为权重,对所述多个通用像素特征进行加权平均运算,并将加权平均运算的结果作为所述像素对应的第二参考特征。
在一些实施例中,所述第二构建模块被配置为:将所述目标帧的特征图、所述第一参考特征图以及第二参考特征图输入特征解码网络模型,以输出第一重建图像;对所述第一重建图像,以及,对所述目标帧的特征图进行双线性插值处理得到的特征图,进行叠加,以得到第二重建图像。
在一些实施例中,所述特征编码网络模型包括多层ResNet模块。
在一些实施例中,所述特征解码网络模型包括多层ResNet模块、以及上采样模块。
根据本公开的另一方面,还提出一种视频处理装置,包括:存储器;以及耦接至存储器的处理器,处理器被配置为基于存储在存储器的指令执行如上述的视频处理方法。
根据本公开的另一方面,还提出一种计算机可读存储介质,其上存储有计算机程序指令,该指令被处理器执行时实现上述的视频处理方法。
通过以下参照附图对本公开的示例性实施例的详细描述,本公开的其它特征及其优点将会变得清楚。
附图说明
构成说明书的一部分的附图描述了本公开的实施例,并且连同说明书一起用于解释本公开的原理。
参照附图,根据下面的详细描述,可以更加清楚地理解本公开,其中:
图1为本公开的视频处理方法的一些实施例的流程示意图。
图2为本公开的视频处理方法的另一些实施例的流程示意图。
图3为本公开的构建第二参考特征图的一些实施例的流程示意图。
图4为本公开实施例与相关技术的视频帧处理效果对比图。
图5为本公开的视频处理装置的一些实施例的结构示意图。
图6为本公开的视频处理装置的另一些实施例的结构示意图。
图7a为本公开的特征编码网络模型的一些实施例的结构示意图。
图7b为本公开的特征解码网络模型的一些实施例的结构示意图。
图8为本公开的视频处理装置的另一些实施例的结构示意图。
图9为本公开的视频处理装置的另一些实施例的结构示意图。
具体实施方式
现在将参照附图来详细描述本公开的各种示例性实施例。应注意到:除非另外具体说明,否则在这些实施例中阐述的部件和步骤的相对布置、数字表达式和数值不限制本公开的范围。
同时,应当明白,为了便于描述,附图中所示出的各个部分的尺寸并不是按照实际的比例关系绘制的。
以下对至少一个示例性实施例的描述实际上仅仅是说明性的,决不作为对本公开及其应用或使用的任何限制。
对于相关领域普通技术人员已知的技术、方法和设备可能不作详细讨论,但在适当情况下,所述技术、方法和设备应当被视为授权说明书的一部分。
在这里示出和讨论的所有示例中,任何具体值应被解释为仅仅是示例性的,而不是作为限制。因此,示例性实施例的其它示例可以具有不同的值。
应注意到:相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步讨论。
为使本公开的目的、技术方案和优点更加清楚明白,以下结合具体实施例,并参照附图,对本公开进一步详细说明。
相关技术中,可通过卷积神经网络进行超分辨率处理,以恢复低分辨率视频中的图像细节。这类方法主要通过融合相邻帧的局部细节以填补低分辨率视频帧中缺失的细节,其存在以下不足:第一、相关技术在融合相邻帧的布局细节之前,需要依靠光流法等方式计算视频帧之间的图像对应关系,这会导致在视频中有较大运动的情况下计算准确性不佳,以及对计算资源要求高,速度慢等问题;第二、由于可以参考的相 邻帧也是低分辨率的,对增强关键帧的清晰度贡献很小。
本公开要解决的一个技术问题是,提供一种视频处理方法和装置,能够提高视频超分辨处理效率,降低视频超分辨处理对计算资源的要求,提高超分辨重建出的视频质量。
与相关技术相比,在本公开的一些实施例中,在融合相邻帧的图像细节时,引入了跨帧注意力机制实现相邻帧与目标帧在像素级别的匹配,具体体现在,确定相邻帧的特征图中与目标帧的像素相关度最大的像素,根据相邻帧的特征图中与目标帧的特征图中像素相关度最大的像素构建第一参考特征图。这样一来,本公开无需基于光流法等方式显式计算视频帧之间的像素对应关系,从而解决了相关技术中由于依靠光流法等方式计算视频帧之间的图像对应关系所导致地计算准确性不佳,以及对计算资源要求高,速度慢等问题,有助于提高视频超分辨处理效率,降低视频超分辨处理对计算资源的要求,提高超分辨重建出的视频质量。进一步,在本公开的一些实施例中,在视频超分辨重建时不仅融合了相邻帧的图像细节,还融合了通用像素的图像细节,从而能够在一定程度上增加图像细节,提高超分辨重建出的视频质量。
图1为本公开的视频处理方法的一些实施例的流程示意图。
在步骤110,对视频的目标帧及其相邻帧进行特征编码,以得到目标帧及其相邻帧的特征图。
其中,目标帧为待超分辨率处理的视频帧,相邻帧为与待超分辨率处理的视频帧相邻的一帧或多帧视频帧。例如,取与目标帧相邻的前3帧和后3帧作为相邻帧,或者,取与目标帧相邻的前2帧作为相邻帧,或者,取与目标帧相邻的后4帧作为相邻帧,等等。
在一些实施例中,取视频中的关键帧作为目标帧,取与关键帧相邻的一帧或多帧视频帧作为相邻帧。例如,采用滑动窗口的方式对视频进行超分辨率处理,每次取待超分辨率处理的关键帧及其相邻的前3帧和后3帧,按照图1所示流程进行处理,以得到该关键帧的超分辨率重建图像。
在步骤110中,采用特征编码网络模型对目标帧及其相邻帧进行特征编码,以得到目标帧的特征图、以及相邻帧的特征图。其中,编码得到的特征图的特征维度大于原始视频帧的特征维度。例如,在一具体示例中,将三通道RGB视频帧输入特征编码网络模型,得到了128通道的特征图。
示例性地,特征编码网络模型为由多层残差网络(ResNet)模块组成的网络模型, 比如由5层ResNet模块组成的网络模型。本领域技术人员可以理解,在不影响本公开实施的情况下,本公开所用的特征编码网络除了可以采用5层ResNet模块组成的网络模型之外,还可以采用其他网络模型结构,比如自编码器(autoencoder)、或者残差密集网络(RDN,全称为Residual Dense Network)等等。
在本公开实施例中,通过设置特征编码网络模型对目标帧及其相邻帧进行特征编码,能够实现特征升维,提取出更丰富的图像细节信息,进而有助于提高图像超分辨处理的效果。
在步骤120中,对于目标帧的特征图中的每个像素,确定相邻帧的特征图中与该像素相关度最大的像素。
在一些实施例中,对于目标帧的特征图中的每个像素,确定该像素与其在相邻帧的特征图中的邻域像素的相关度,根据相关度从邻域像素中选取出相关度最大的像素。在这些实施例中,通过只计算目标帧中的像素与其在相邻帧中的邻域像素的相关度,相比另一些实施例中逐一计算目标帧与相邻帧的像素相关度,大大减少了计算量。
其中,像素在其相邻帧的特征图中的邻域像素是指:在相邻帧的特征图中,与目标帧的像素在空间位置上相邻的像素。示例性地,将以像素a为中心的9*9像素区域作为该像素的邻域,则像素a在某一相邻帧上的邻域像素为9*9个。
其中,像素的相关度描述了像素之间的相似程度。示例性地,可根据如下方式计算两个像素的相关度:将像素a的特征向量与像素b的特征向量的内积作为像素a与像素b的相关度。
例如,假设目标帧为视频中的第4帧,目标帧的相邻帧为视频中的第1、2、3、5、6、7帧;对于第4帧特征图中的像素a,分别计算其与6个相邻帧特征图中的9*9个邻域像素的相关度,则可以得到6*9*9个相关度值;按照相关度,从6*9*9个邻域像素中选取出与像素a相关度最大的像素;对目标帧特征图中的每个像素分别执行上述过程,从而可从这6个相邻帧的特征图中找到与目标帧特征图中的每个像素相关度最大的像素。
在步骤130中,根据相邻帧的特征图中与该像素相关度最大的像素,构建第一参考特征图。
在一些实施例中,对于目标帧的特征图中的每个像素,根据从相邻帧的特征图中选取出的与该像素相关度最大的像素的特征,以及对应的相关度,确定该像素对应的第一参考特征;根据目标帧的特征图中每个像素对应的第一参考特征,构建第一参考 特征图。
在一些可选实施方式中,对于目标帧的特征图中的每个像素,将从相邻帧的特征图中选取出的相关度最大的像素的特征,以及对应的相关度的乘积,作为该像素对应的第一参考特征。这样一来,可得到目标帧的特征图中各个像素对应的第一参考特征,进而,可得到第一参考特征图。其中,第一参考特征图中的像素特征,为目标帧的特征图中像素对应的第一参考特征。
本领域技术人员可以理解的是,将相关度最大的像素特征以及对应相关度的乘积作为第一参考特征,仅用于举例。在不影响本发明实施的情况下,也可采用其他根据相关度最大的像素特征以及对应相关度确定第一参考特征的具体实施方式。
在本公开实施例中,在融合相邻帧的图像细节时,通过步骤120和步骤130引入了跨帧注意力机制实现相邻帧与目标帧在像素级别的匹配。这样一来,本公开实施例无需基于光流法等方式显式计算视频帧之间的像素对应关系,从而解决了相关技术中由于依靠光流法等方式计算视频帧之间的图像对应关系所导致地计算准确性不佳,以及对计算资源要求高,速度慢等问题,有助于提高视频超分辨处理效率,降低视频超分辨处理对计算资源的要求,提高超分辨重建出的视频质量。
在步骤140中,根据目标帧的特征图和第一参考特征图,进行超分辨率重建。
在一些实施例中,将目标帧的特征图和第一参考特征图输入特征解码网络模型,并将模型输出的图像作为超分辨重建的图像,即最终重建出的目标帧对应的高分辨率图像。其中,解码网络模型的作用是对目标帧的特征图和第一参考特征图进行融合,并将融合后的图像从高维的特征空间重新映射至低维特征空间,以得到高分辨率的重建图像。例如,在一具体示例中,将128通道的目标帧特征图、以及128通道的第一参考特征图输入特征解码网络模型,得到了高分辨率的三通道RGB图像。
示例性地,特征解码网络模型为由多层ResNet模块和上采样模块组成的网络模型,比如由40层ResNet模块和上采样模块组成的网络模型。在一些实施例中,ResNet模块的详细组成如图7a所示,上采样模块的详细组成如图7b所示。其中,ResNet模块包括卷积层、ReLu激活层以及卷积层。上采样模块包括卷积层、Pixel Shuffle以及Leaky ReLu处理层、Leaky ReLu处理层。ResNet模块负责对目标帧的特征图和第一参考特征图进行融合,得到融合后的高维度低分辨率图像;上采样模块负责将高维度的低分辨率图像特征重组为低维度的高分辨率图像特征,比如将128通道低分辨率特征重组为3通道高分辨率的RGB图像特征,此过程采用的是业界通用的Pixel  Shuffle操作。Pixel Shuffle是一种上采样方法,可以对缩小后的特征图进行有效的放大,该方法可以替代插值或解卷积的方法实现上采样。Leaky ReLu和ReLu为深度学习中两种常用的激活函数。
本领域技术人员可以理解,在不影响本公开实施的情况下,本公开所用的特征解码网络除了可以采用由40层ResNet模块和上采样模块组成的网络模型之外,还可以采用其他网络模型结构,比如RDN等等。
在另一些实施例中,将目标帧的特征图和第一参考特征图输入特征解码网络模型,以输出第一重建图像;将对目标帧的特征图进行双线性插值处理得到的特征图与第一重建图像进行叠加,以得到第二重建图像,并将第二重建图像作为最终重建出的目标帧对应的高分辨率图像。在这些实施例中,在通过特征解码网络模型得到高分辨率的重建图像后,通过将其与对目标帧的特征图进行双线性插值得到的特征图进行叠加,能够进一步提高重建出的图像的分辨率。
在本公开实施例中,通过以上步骤实现了对视频的目标帧的超分辨处理。与相关技术相比,不仅能够提高视频超分辨处理效率,降低视频超分辨处理对计算资源的要求,而且能够提高超分辨重建出的视频质量。
图2为本公开的视频处理方法的另一些实施例的流程示意图。
在步骤210中,对视频的目标帧及其相邻帧进行特征编码,以得到目标帧及其相邻帧的特征图。
在该实施例中,采用特征编码网络模型对目标帧及其相邻帧进行特征编码,以得到目标帧的特征图、以及相邻帧的特征图。其中,编码得到的特征图的特征维度大于原始视频帧的特征维度。例如,在一具体示例中,将三通道RGB视频帧输入特征编码网络模型,得到了128通道的特征图。
在该实施例中,通过设置特征编码网络模型对目标帧及其相邻帧进行特征编码,能够实现特征升维,提取出更丰富的图像细节信息,进而有助于提高图像超分辨处理的效果。
在步骤220中,对于目标帧的特征图中的每个像素,确定相邻帧的特征图中与该像素相关度最大的像素。
在该实施例中,对于目标帧的特征图中的每个像素,确定该像素与其在相邻帧的特征图中的邻域像素的相关度,根据相关度从邻域像素中选取出相关度最大的像素。在该实施例中,通过只计算目标帧中的像素与其在相邻帧中的邻域像素的相关度,相 比逐一计算目标帧与相邻帧的像素相关度,大大减少了计算量。
在步骤230中,根据相邻帧的特征图中与像素相关度最大的像素,构建第一参考特征图。
在该实施例中,对于目标帧的特征图中的每个像素,将从相邻帧的特征图中选取出的与该像素相关度最大的像素的特征,以及对应的相关度的乘积,作为该像素对应的第一参考特征。这样一来,可得到目标帧的特征图中各个像素对应的第一参考特征,进而,可据此得到第一参考特征图。其中,第一参考特征图中的像素特征,为目标帧的特征图中像素对应的第一参考特征。
在步骤240中,构建第二参考特征图。其中,第二参考特征图融合了通用像素特征。
其中,通用像素特征为从已知的高分辨率视频图像和其对应的低分辨率图像中学习到的图像细节信息。在一些实施例中,以二维矩阵的形式存储通用像素特征。例如,采用256*128的二维矩阵,该二维矩阵由256个128通道的特征向量组成,每个特征向量存储了训练过程中从已知高分辨率视频图像和其对应的低分辨率图像中学习到的最具代表性的通用像素特征。在这些实施例中,根据该二维矩阵构建第二参考特征图。
在步骤250中,根据目标帧的特征图、第一参考特征图以及第二参考特征图,进行超分辨率重建。
在一些实施例中,将目标帧的特征图、第一参考特征图、以及第二参考特征图输入特征解码网络模型,并将模型输出的图像作为超分辨重建的图像。其中,解码网络模型的作用是对目标帧的特征图和第一参考特征图、以及第二参考特征图进行融合,并将融合后的图像从高维的特征空间重新映射至低维特征空间,以得到高分辨率的重建图像。例如,在一具体示例中,将128通道的目标帧特征图、以及128通道的第一参考特征图、以及128通道的第二参考特征图输入特征解码网络模型,得到了高分辨率的三通道RGB图像。
在另一些实施例中,将目标帧的特征图、第一参考特征图以及第二参考特征图输入特征解码网络模型,以输出第一重建图像;将对目标帧的特征图进行双线性插值处理得到的特征图与第一重建图像进行叠加,以得到第二重建图像,并将第二重建图像作为最终超分辨重建的图像。在这些实施例中,在通过特征解码网络模型得到高分辨率的重建图像后,通过将其与对目标帧的特征图进行双线性插值得到的特征图进行叠 加,能够进一步提高图像分辨率。
具体实施时,基于学术界公认的数据集Vimeo90K-Test和Vid4测试验证了本公开实施例的方法在一般小位移的视频超分辨率处理上具有很好的性能。同时,基于Parkour数据集测试验证了本公开实施例的方法在大位移视频超分辨率处理上也具有很好的性能。其中,在测量超分辨率的结果和正确标注的(ground truth)高分辨率视频的视觉差异时,量化标准为学术界公认的峰值信噪比(PSNR)和结构相似性(SSIM)。
在本公开实施例中,通过以上步骤实现了对视频的目标帧的超分辨处理。与相关技术相比,通过在融合相邻帧的图像细节时,引入了跨帧注意力机制实现相邻帧与目标帧在像素级别的匹配,使得本公开无需基于光流法等方式显式计算视频帧之间的像素对应关系,从而解决了相关技术中由于依靠光流法等方式计算视频帧之间的图像对应关系所导致地计算准确性不佳,以及对计算资源要求高,速度慢等问题,有助于提高视频超分辨处理效率,降低视频超分辨处理对计算资源的要求,提高超分辨重建出的视频质量。进一步,在本公开实施例中,通过根据目标帧的特征图、第一参考特征图以及第二参考特征图进行超分辨率重建,使得在针对目标帧进行超分辨重建时不仅融合了相邻帧的图像细节,还融合了通用像素的图像细节,从而能够在一定程度上增加图像细节,提高超分辨重建出的视频质量。
图3为本公开的构建第二参考特征图的一些实施例的流程示意图。以下结合图3对上一实施例中的步骤240进行详细说明。
在步骤241中,确定目标帧的特征图中的像素与多个通用像素特征的相关度。
在一些实施例中,以二维矩阵的形式存储通用像素特征,该二维矩阵由多个特征向量组成,每个特征向量存储了训练过程中从已知高分辨率视频图像和其对应的低分辨图像中学习到的最具代表性的通用像素特征。对于目标帧的特征图中的每个像素,计算该像素与二维矩阵中各个特征向量之间的相关度。
在一些实施例中,上述二维矩阵通过预训练得到。在进行预训练过程中,该二维矩阵由初始的随机值逐渐收敛为最终结果。在视频超分辨率处理过程中,上述二维矩阵是固定的。通过在预训练阶段学习通常视频中的一些图像特征,形成该二维矩阵,使得在对当前视频进行超分辨率处理时能够参考该二维矩阵,改善视频超分辨率处理的质量。
在步骤242中,按照相关度对多个通用像素特征进行融合处理,以得到目标帧的特征图中的像素对应的第二参考特征。
在一些实施例中,对步骤241得出的相关度进行归一化;将归一化后的相关度作为权重,对多个通用像素特征进行加权平均运算,并将加权平均运算的结果作为所述像素对应的第二参考特征。
例如,采用256*128的二维矩阵,该二维矩阵由256个128通道的特征向量组成,每个特征向量存储了训练过程中从已知的高分辨率视频图像和其对应的低分辨图像中学习到的最具代表性的通用像素特征。对于目标帧的特征图中的每个像素,计算其与该二维矩阵中256个特征向量之间的相关度,得到256个相关度值。将这256个相关度值进行归一化后,将归一化后的相关度值作为权重,对256个特征向量进行加权平均运算,并将加权平均运算的结果作为该像素对应的第二参考特征。
在步骤243中,根据目标帧的特征图中的像素对应的第二参考特征,构建第二参考特征图。
其中,第二参考特征图中的像素特征,为目标帧的特征图中像素对应的第二参考特征。
在本公开实施例中,通过以上步骤很好地构建了第二参考特征图。在对目标帧进行超分辨率重建时,通过引入第一参考特征图以及经由上述步骤重建的第二参考特征图,使得在针对目标帧进行超分辨重建时不仅融合了相邻帧的图像细节,还融合了通用像素的图像细节,从而能够在一定程度上增加图像细节,提高超分辨重建出的视频质量。
图4为本公开实施例与相关技术的视频帧处理效果对比图。如图4所示,包括三行四列图像,由左至右第一列为示例视频帧,第二至第四列为示例视频帧中矩形框所选区域的放大图像。其中,第二列为低分辨率视频帧经双三次插值放大的结果,第三列为低分辨率视频帧经过本公开实施例的视频处理方法处理得到的结果,最后一列为低分辨率视频帧经过当前业内领先的超分辨率处理方法处理得到的结果。第一横行中,由于光流法对重复的局部图像(如吉他琴弦)的计算不准确,采用业内领先的方法进行超分辨率处理容易产生错误的纹路,而本公开由于不需要计算视频帧之间的像素对应关系,可以较好的恢复重复的纹理。第二横行中,采用业内领先方法进行超分辨率处理得到的图像不够清晰,因为图像的很多细节在低分辨率视频中已经永久丢失,无法仅从本视频的其他帧中获取有用的信息,而本公开在训练过程中保存了来自其他视频中总结出的图像细节信息,因此具有一定程度上增加图像细节的能力。第三横行展示的是一个运动视频的例子,此类视频内物体位移较大,采用当前业内领先的方法一 般不能在此类视频中恢复高分辨率的细节,而本公开经测试证明在此类大位移视频中也能保持很好的超分辨处理性能。
图5为本公开的视频处理装置的一些实施例的结构示意图。如图5所示,本公开实施例中的视频处理装置包括:特征编码模块510、第一构建模块520、重建模块530。
特征编码模块510,被配置为对视频的目标帧及其相邻帧进行特征编码,以得到目标帧及其相邻帧的特征图。
其中,目标帧为待超分辨率处理的视频帧,相邻帧为与待超分辨率处理的视频帧相邻的一帧或多帧视频帧。例如,取与目标帧相邻的前3帧和后3帧作为相邻帧,或者,取与目标帧相邻的前2帧作为相邻帧,或者,取与目标帧相邻的后4帧作为相邻帧,等等。
在一些实施例中,特征编码模块510采用特征编码网络模型对目标帧及其相邻帧进行特征编码,以得到目标帧的特征图、以及相邻帧的特征图。其中,编码得到的特征图的特征维度大于原始视频帧的特征维度。例如,在一具体示例中,将三通道RGB视频帧输入特征编码网络模型,得到了128通道的特征图。
示例性地,特征编码网络模型为由多层ResNet(残差网络)模块组成的网络模型,比如由5层ResNet模块组成的网络模型。本领域技术人员可以理解,在不影响本公开实施的情况下,本公开所用的特征编码网络除了可以采用5层ResNet模块组成的网络模型之外,还可以采用其他网络模型结构,比如自编码器(autoencoder)、或者残差密集网络(RDN,全称为Residual Dense Network)等等。
在本公开实施例中,通过特征编码模块对目标帧及其相邻帧进行特征编码,能够实现特征升维,提取出更丰富的图像细节信息,进而有助于提高图像超分辨处理的效果。
第一构建模块520,被配置为对于所述目标帧的特征图中的每个像素,确定所述相邻帧的特征图中与所述像素相关度最大的像素,和,根据所述相邻帧的特征图中与所述像素相关度最大的像素,构建第一参考特征图。
在一些实施例中,对于目标帧的特征图中的每个像素,第一构建模块520确定该像素与其在相邻帧的特征图中的邻域像素的相关度,根据相关度从邻域像素中选取出相关度最大的像素。在这些实施例中,第一构建模块520通过只计算目标帧中的像素与其在相邻帧中的邻域像素的相关度,相比另一些实施例中逐一计算目标帧与相邻帧的像素相关度,大大减少了计算量。
在一些实施例中,对于目标帧的特征图中的每个像素,第一构建模块520根据从相邻帧的特征图中选取出的与该像素相关度最大的像素的特征,以及对应的相关度,确定该像素对应的第一参考特征;根据目标帧的特征图中每个像素对应的第一参考特征,构建第一参考特征图。
在一些可选实施方式中,对于目标帧的特征图中的每个像素,第一构建模块520将从相邻帧的特征图中选取出的相关度最大的像素的特征,以及对应的相关度的乘积,作为该像素对应的第一参考特征。这样一来,可得到目标帧的特征图中各个像素对应的第一参考特征,进而,可得到第一参考特征图。其中,第一参考特征图中的像素特征,为目标帧的特征图中像素对应的第一参考特征。
本领域技术人员可以理解的是,将相关度最大的像素特征以及对应相关度的乘积作为第一参考特征,仅用于举例。在不影响本发明实施的情况下,第一构建模块520也可采用其他根据相关度最大的像素特征以及对应相关度确定第一参考特征的具体实施方式。
在本公开实施例中,通过第一构建模块计算目标帧与相邻帧特征图的像素相关度,根据从相邻帧特征图中选取出的相关度最大的像素构建第一特征图,使得本公开无需基于光流法等方式显式计算视频帧之间的像素对应关系,从而解决了相关技术中由于依靠光流法等方式计算视频帧之间的图像对应关系所导致地计算准确性不佳,以及对计算资源要求高,速度慢等问题,有助于提高视频超分辨处理效率,降低视频超分辨处理对计算资源的要求,提高超分辨重建出的视频质量。
重建模块530,被配置为根据目标帧的特征图和第一参考特征图,进行超分辨率重建。
在一些实施例中,重建模块530将目标帧的特征图和第一参考特征图输入特征解码网络模型,并将模型输出的图像作为超分辨重建的图像,即最终重建出的目标帧对应的高分辨率图像。其中,解码网络模型的作用是对目标帧的特征图和第一参考特征图进行融合,并将融合后的图像从高维的特征空间重新映射至低维特征空间,以得到高分辨率的重建图像。例如,在一些具体示例中,将128通道的目标帧特征图、以及128通道的第一参考特征图输入特征解码网络模型,得到了高分辨率的三通道RGB图像。
示例性地,特征解码网络模型为由多层ResNet模块和上采样模块组成的网络模型,比如由40层ResNet模块和上采样模块组成的网络模型。本领域技术人员可以理 解,在不影响本公开实施的情况下,本公开所用的特征解码网络除了可以采用由40层ResNet模块和上采样模块组成的网络模型之外,还可以采用其他网络模型结构,比如RDN等等。
在另一些实施例中,重建模块530将目标帧的特征图和第一参考特征图输入特征解码网络模型,以输出第一重建图像;将对目标帧的特征图进行双线性插值处理得到的特征图与第一重建图像进行叠加,以得到第二重建图像,并将第二重建图像作为最终重建出的目标帧对应的高分辨率图像。在这些实施例中,在通过特征解码网络模型得到高分辨率的重建图像后,通过将其与对目标帧的特征图进行双线性插值得到的特征图进行叠加,能够进一步提高重建出的图像的分辨率。
在本公开实施例中,通过以上装置实现了对视频的目标帧的超分辨处理。与相关技术相比,不仅能够提高视频超分辨处理效率,降低视频超分辨处理对计算资源的要求,而且能够提高超分辨重建出的视频质量。
图6为本公开的视频处理装置的另一些实施例的结构示意图。如图6所示,本公开实施例中的视频处理装置包括:特征编码模块610、第一构建模块620、第二构建模块630、重建模块640。
特征编码模块610,被配置为对视频的目标帧及其相邻帧进行特征编码,以得到目标帧及其相邻帧的特征图。
在该实施例中,特征编码模块610采用特征编码网络模型对目标帧及其相邻帧进行特征编码,以得到目标帧的特征图、以及相邻帧的特征图。其中,编码得到的特征图的特征维度大于原始视频帧的特征维度。例如,在一些具体示例中,将三通道RGB视频帧输入特征编码网络模型,得到了128通道的特征图。
在该实施例中,通过特征编码模块610对目标帧及其相邻帧进行特征编码,能够实现特征升维,提取出更丰富的图像细节信息,进而有助于提高图像超分辨处理的效果。
第一构建模块620,被配置为对于所述目标帧的特征图中的每个像素,确定所述相邻帧的特征图中与所述像素相关度最大的像素,和,根据所述相邻帧的特征图中与所述像素相关度最大的像素,构建第一参考特征图。
在一些实施例中,对于目标帧的特征图中的每个像素,第一构建模块620确定该像素与其在相邻帧的特征图中的邻域像素的相关度,根据相关度从邻域像素中选取出相关度最大的像素。通过只计算目标帧中的像素与其在相邻帧中的邻域像素的相关度, 相比逐一计算目标帧与相邻帧的像素相关度,大大减少了计算量。
在一些实施例中,对于目标帧的特征图中的每个像素,第一构建模块620将从相邻帧的特征图中选取出的与该像素相关度最大的像素的特征,以及对应的相关度的乘积,作为该像素对应的第一参考特征。这样一来,可得到目标帧的特征图中各个像素对应的第一参考特征,进而,可据此得到第一参考特征图。其中,第一参考特征图中的像素特征,为目标帧的特征图中像素对应的第一参考特征。
第二构建模块630,被配置为构建第二参考特征图。其中,第二参考特征图融合了通用像素特征。
其中,通用像素特征为从高分辨率视频中学习到的图像细节信息。在一些实施例中,以二维矩阵的形式存储通用像素特征。例如,采用256*128的二维矩阵,该二维矩阵由256个128通道的特征向量组成,每个特征向量存储了训练过程中从高分辨率视频中学习到的最具代表性的通用像素特征。在这些实施例中,根据该二维矩阵构建第二参考特征图。
在一些实施例中,第二构建模块630确定目标帧的特征图中的像素与多个通用像素特征的相关度,按照所述相关度对所述多个通用像素特征进行融合处理,以得到所述像素对应的第二参考特征,根据所述像素对应的第二参考特征,构建所述第二参考特征图。
在一些实施例中,第二构建模块630对得到的相关度进行归一化,将归一化后的相关度作为权重,对多个通用像素特征进行加权平均运算,并将加权平均运算的结果作为目标帧的特征图中的像素对应的第二参考特征。例如,采用256*128的二维矩阵,该二维矩阵由256个128通道的特征向量组成,每个特征向量存储了训练过程中从已知高分辨率视频图像和其对应的低分辨图像中学习到的最具代表性的通用像素特征。对于目标帧的特征图中的每个像素,计算其与该二维矩阵中256个特征向量之间的相关度,得到256个相关度值。将这256个相关度值进行归一化后,将归一化后的相关度值作为权重,对256个特征向量进行加权平均运算,并将加权平均运算的结果作为该像素对应的第二参考特征。
重建模块640,被配置为根据目标帧的特征图、第一参考特征图以及第二参考特征图,进行超分辨率重建。
在一些实施例中,重建模块640将目标帧的特征图、第一参考特征图、以及第二参考特征图输入特征解码网络模型,并将模型输出的图像作为超分辨重建的图像。其 中,解码网络模型的作用是对目标帧的特征图和第一参考特征图、以及第二参考特征图进行融合,并将融合后的图像从高维的特征空间重新映射至低维特征空间,以得到高分辨率的重建图像。例如,在一些具体示例中,将128通道的目标帧特征图、以及128通道的第一参考特征图、以及128通道的第二参考特征图输入特征解码网络模型,得到了高分辨率的三通道RGB图像。
在另一些实施例中,重建模块640将目标帧的特征图、第一参考特征图以及第二参考特征图输入特征解码网络模型,以输出第一重建图像;将对目标帧的特征图进行双线性插值处理得到的特征图与第一重建图像进行叠加,以得到第二重建图像,并将第二重建图像作为最终超分辨重建的图像。在这些实施例中,在通过特征解码网络模型得到高分辨率的重建图像后,通过将其与对目标帧的特征图进行双线性插值得到的特征图进行叠加,能够进一步提高图像分辨率。
在一些实施例中,还包括:在基于视频处理装置对当前视频进行超分辨率处理之前,对视频处理装置中的各个模块进行训练。总体上来说,训练过程可采用end-to-end(端到端)的方式进行,整个网络训练过程如下:首先断开第二构建模块,只训练其他三个模块,损失函数为输出高分辨率图像与正确标注的高分辨率样本图像的L1距离(即曼哈顿距离);接下来,固定其他三个模块,仅训练第二构建模块所用的二维矩阵,损失函数为训练二维矩阵所用网络模型的输入与输出间的L1距离;然后,对视频处理装置中的所有模块进行训练,损失函数为输出高分辨率图像与正确标注的高分辨率样本图像的L1距离。
在本公开实施例中,通过以上装置实现了对视频的目标帧的超分辨处理。与相关技术相比,通过在融合相邻帧的图像细节时,引入了跨帧注意力机制实现相邻帧与目标帧在像素级别的匹配,使得本公开无需基于光流法等方式显式计算视频帧之间的像素对应关系,从而解决了相关技术中由于依靠光流法等方式计算视频帧之间的图像对应关系所导致地计算准确性不佳,以及对计算资源要求高,速度慢等问题,有助于提高视频超分辨处理效率,降低视频超分辨处理对计算资源的要求,提高超分辨重建出的视频质量。进一步,在本公开实施例中,通过根据目标帧的特征图、第一参考特征图以及第二参考特征图进行超分辨率重建,使得在针对目标帧进行超分辨重建时不仅融合了相邻帧的图像细节,还融合了通用像素的图像细节,从而能够在一定程度上增加图像细节,提高超分辨重建出的视频质量。
图8为本公开的视频处理装置的另一些实施例的结构示意图。该装置包括存储器 810和处理器820,其中:存储器810可以是磁盘、闪存或其它任何非易失性存储介质。存储器用于存储图1-3所对应实施例中的指令。处理器820耦接至存储器810,可以作为一个或多个集成电路来实施,例如微处理器或微控制器。该处理器820用于执行存储器中存储的指令。
在一些实施例中,还可以如图9所示,该装置900包括存储器910和处理器920。处理器920通过BUS总线930耦合至存储器910。该装置900还可以通过存储接口940连接至外部存储装置950以便调用外部数据,还可以通过网络接口960连接至网络或者另外一台计算机系统(未标出),此处不再进行详细介绍。
在该实施例中,通过存储器存储数据指令,再通过处理器处理上述指令,能够提高视频处理效率,降低视频超分辨处理对计算资源的要求,而且能够提高超分辨重建出的视频质量。
在另一些实施例中,一种计算机可读存储介质,其上存储有计算机程序指令,该指令被处理器执行时实现图1-3所对应实施例中的方法的步骤。本领域内的技术人员应明白,本公开的实施例可提供为方法、装置、或计算机程序产品。因此,本公开可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本公开可采用在一个或多个其中包含有计算机可用程序代码的计算机可用非瞬时性存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本公开是参照根据本公开实施例的方法、设备(系统)和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计 算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
至此,已经详细描述了本公开。为了避免遮蔽本公开的构思,没有描述本领域所公知的一些细节。本领域技术人员根据上面的描述,完全可以明白如何实施这里公开的技术方案。
虽然已经通过示例对本公开的一些特定实施例进行了详细说明,但是本领域的技术人员应该理解,以上示例仅是为了进行说明,而不是为了限制本公开的范围。本领域的技术人员应该理解,可在不脱离本公开的范围和精神的情况下,对以上实施例进行修改。本公开的范围由所附权利要求来限定。

Claims (18)

  1. 一种视频处理方法,包括:
    对视频的目标帧及其相邻帧进行特征编码,以得到目标帧及其相邻帧的特征图;
    对于所述目标帧的特征图中的每个像素,确定所述相邻帧的特征图中与所述像素相关度最大的像素;
    根据所述相邻帧的特征图中与所述像素相关度最大的像素,构建第一参考特征图;
    根据所述目标帧的特征图和所述第一参考特征图,进行超分辨率重建。
  2. 根据权利要求1所述的视频处理方法,其中,根据所述目标帧的特征图和所述第一参考特征图,进行超分辨率重建包括:
    根据所述目标帧的特征图、所述第一参考特征图以及第二参考特征图,进行超分辨率重建,其中,所述第二参考特征图融合了通用像素特征。
  3. 根据权利要求1或2所述的视频处理方法,其中,确定所述相邻帧的特征图中与所述像素相关度最大的像素包括:
    确定所述像素与其在所述相邻帧的特征图中的邻域像素的相关度;
    根据所述相关度,从所述邻域像素中选取出相关度最大的像素。
  4. 根据权利要求1或2所述的视频处理方法,其中,根据所述相邻帧的特征图中与所述像素相关度最大的像素,构建第一参考特征图包括:
    根据所述相邻帧的特征图中与所述像素相关度最大的像素的特征,以及对应的相关度,确定所述像素对应的第一参考特征;
    根据所述像素对应的第一参考特征,构建第一参考特征图。
  5. 根据权利要求2所述的视频处理方法,还包括:
    确定所述目标帧的特征图中的像素与多个通用像素特征的相关度;
    按照所述相关度对所述多个通用像素特征进行融合处理,以得到所述像素对应的第二参考特征;
    根据所述像素对应的第二参考特征,构建所述第二参考特征图。
  6. 根据权利要求5所述的视频处理方法,其中,按照所述相关度对所述多个通用像素特征进行融合处理,以得到所述像素对应的第二参考特征包括:
    对所述相关度进行归一化;
    将归一化后的相关度作为权重,对所述多个通用像素特征进行加权平均运算,并将加权平均运算的结果作为所述像素对应的第二参考特征。
  7. 根据权利要求2所述的视频处理方法,其中,根据所述目标帧的特征图、所述第一参考特征图以及第二参考特征图,进行超分辨率重建包括:
    将所述目标帧的特征图、所述第一参考特征图以及第二参考特征图输入特征解码网络模型,以输出第一重建图像。
  8. 根据权利要求1所述的视频处理方法,其中,根据所述目标帧的特征图和所述第一参考特征图,进行超分辨率重建包括:
    将所述目标帧的特征图和所述第一参考特征图输入特征解码网络模型,以输出第一重建图像。
  9. 根据权利要求7或8所述的视频处理方法,其中,根据所述目标帧的特征图、所述第一参考特征图以及第二参考特征图,进行超分辨率重建还包括:
    对所述第一重建图像,以及,对所述目标帧的特征图进行双线性插值处理得到的特征图,进行叠加,以得到第二重建图像。
  10. 根据权利要求1所述的视频处理方法,其中,特征编码所用的网络模型包括多层ResNet模块。
  11. 根据权利要求7所述的视频处理方法,其中,所述特征解码网络模型包括多层ResNet模块、以及上采样模块。
  12. 根据权利要求4所述的视频处理方法,其中,根据所述相邻帧的特征图中与所述像素相关度最大的像素的特征,以及对应的相关度,确定所述像素对应的第一参 考特征包括:
    将所述相邻帧的特征图中与所述像素相关度最大的像素的特征,以及对应的相关度的乘积,作为所述像素对应的第一参考特征。
  13. 一种视频处理装置,包括:
    特征编码模块,被配置为对视频的目标帧及其相邻帧进行特征编码,以得到目标帧及其相邻帧的特征图;
    第一构建模块,被配置为
    对于所述目标帧的特征图中的每个像素,确定所述相邻帧的特征图中与所述像素相关度最大的像素,和
    根据所述相邻帧的特征图中与所述像素相关度最大的像素,构建第一参考特征图;以及
    重建模块,被配置为根据所述目标帧的特征图和所述第一参考特征图,进行超分辨率重建。
  14. 根据权利要求13所述的视频处理装置,其中,所述重建模块被配置为:
    根据所述目标帧的特征图、所述第一参考特征图以及第二参考特征图,进行超分辨率重建,其中,所述第二参考特征图融合了通用像素特征。
  15. 根据权利要求13或14所述的视频处理装置,其中,所述第一构建模块被配置为:
    确定所述像素与其在所述相邻帧的特征图中的邻域像素的相关度;
    根据所述相关度,从所述邻域像素中选取出相关度最大的像素。
  16. 根据权利要求13或14所述的视频处理装置,其中,所述第一构建模块被配置为:
    根据所述相邻帧的特征图中与所述像素相关度最大的像素的特征,以及对应的相关度,确定所述像素对应的第一参考特征;
    根据所述像素对应的第一参考特征,构建第一参考特征图。
  17. 一种视频处理装置,包括:
    存储器;以及
    耦接至所述存储器的处理器,所述处理器被配置为基于存储在所述存储器的指令执行如权利要求1至12任一项所述的视频处理方法。
  18. 一种计算机可读存储介质,其上存储有计算机程序指令,该指令被处理器执行时实现权利要求1至12任一项所述的视频处理方法。
PCT/CN2023/075906 2022-03-31 2023-02-14 视频处理方法和装置 WO2023185284A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210334768.7 2022-03-31
CN202210334768.7A CN114648446A (zh) 2022-03-31 2022-03-31 视频处理方法和装置

Publications (1)

Publication Number Publication Date
WO2023185284A1 true WO2023185284A1 (zh) 2023-10-05

Family

ID=81995198

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/075906 WO2023185284A1 (zh) 2022-03-31 2023-02-14 视频处理方法和装置

Country Status (2)

Country Link
CN (1) CN114648446A (zh)
WO (1) WO2023185284A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114648446A (zh) * 2022-03-31 2022-06-21 网银在线(北京)科技有限公司 视频处理方法和装置

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111583112A (zh) * 2020-04-29 2020-08-25 华南理工大学 视频超分辨率的方法、系统、装置和存储介质
CN112700392A (zh) * 2020-12-01 2021-04-23 华南理工大学 一种视频超分辨率处理方法、设备及存储介质
CN113570606A (zh) * 2021-06-30 2021-10-29 北京百度网讯科技有限公司 目标分割的方法、装置及电子设备
CN114648446A (zh) * 2022-03-31 2022-06-21 网银在线(北京)科技有限公司 视频处理方法和装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111583112A (zh) * 2020-04-29 2020-08-25 华南理工大学 视频超分辨率的方法、系统、装置和存储介质
CN112700392A (zh) * 2020-12-01 2021-04-23 华南理工大学 一种视频超分辨率处理方法、设备及存储介质
CN113570606A (zh) * 2021-06-30 2021-10-29 北京百度网讯科技有限公司 目标分割的方法、装置及电子设备
CN114648446A (zh) * 2022-03-31 2022-06-21 网银在线(北京)科技有限公司 视频处理方法和装置

Also Published As

Publication number Publication date
CN114648446A (zh) 2022-06-21

Similar Documents

Publication Publication Date Title
CN109903228B (zh) 一种基于卷积神经网络的图像超分辨率重建方法
Li et al. MDCN: Multi-scale dense cross network for image super-resolution
Isobe et al. Revisiting temporal modeling for video super-resolution
CN110120011B (zh) 一种基于卷积神经网络和混合分辨率的视频超分辨方法
CN111028150B (zh) 一种快速时空残差注意力视频超分辨率重建方法
CN110136062B (zh) 一种联合语义分割的超分辨率重建方法
CN111311490A (zh) 基于多帧融合光流的视频超分辨率重建方法
CN110163801B (zh) 一种图像超分辨和着色方法、系统及电子设备
CN111524068A (zh) 一种基于深度学习的变长输入超分辨率视频重建方法
CN111861961A (zh) 单幅图像超分辨率的多尺度残差融合模型及其复原方法
Zhang et al. NTIRE 2023 challenge on image super-resolution (x4): Methods and results
Li et al. Deep interleaved network for image super-resolution with asymmetric co-attention
CN109949217B (zh) 基于残差学习和隐式运动补偿的视频超分辨率重建方法
US20230153946A1 (en) System and Method for Image Super-Resolution
WO2023185284A1 (zh) 视频处理方法和装置
CN112017116A (zh) 基于非对称卷积的图像超分辨率重建网络及其构建方法
Liu et al. A deep recursive multi-scale feature fusion network for image super-resolution
Li Image super-resolution using attention based densenet with residual deconvolution
CN114332625A (zh) 基于神经网络的遥感图像彩色化和超分辨率方法及系统
CN112785502B (zh) 一种基于纹理迁移的混合相机的光场图像超分辨率方法
US20240062347A1 (en) Multi-scale fusion defogging method based on stacked hourglass network
WO2024040973A1 (zh) 一种基于堆叠沙漏网络的多尺度融合去雾方法
Zhou et al. Deep fractal residual network for fast and accurate single image super resolution
CN117196948A (zh) 一种基于事件数据驱动的视频超分辨率方法
Li et al. Parallel-connected residual channel attention network for remote sensing image super-resolution

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23777663

Country of ref document: EP

Kind code of ref document: A1