WO2022179087A1 - Video processing method and apparatus - Google Patents

Video processing method and apparatus Download PDF

Info

Publication number
WO2022179087A1
WO2022179087A1 PCT/CN2021/118552 CN2021118552W WO2022179087A1 WO 2022179087 A1 WO2022179087 A1 WO 2022179087A1 CN 2021118552 W CN2021118552 W CN 2021118552W WO 2022179087 A1 WO2022179087 A1 WO 2022179087A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
features
video processing
feature
video
Prior art date
Application number
PCT/CN2021/118552
Other languages
French (fr)
Chinese (zh)
Inventor
刘晶晶
徐宁
Original Assignee
北京达佳互联信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN202110213511.1A external-priority patent/CN113034412B/en
Application filed by 北京达佳互联信息技术有限公司 filed Critical 北京达佳互联信息技术有限公司
Publication of WO2022179087A1 publication Critical patent/WO2022179087A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/50Image enhancement or restoration by the use of more than one image, e.g. averaging, subtraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N23/00Cameras or camera modules comprising electronic image sensors; Control thereof
    • H04N23/80Camera processing pipelines; Components thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10024Color image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Definitions

  • the present disclosure relates to the field of video technology. More particularly, the present disclosure relates to a video processing method and apparatus.
  • the original video quality is often unsatisfactory to users.
  • the original video often has problems such as too bright or too dark, or insufficient color saturation.
  • the later video quality enhancement technology the original image quality can be greatly improved. Therefore, many users have a strong and huge demand for video quality enhancement.
  • a video processing method comprising: acquiring two adjacent frames of images of a video as a first image and a second image, wherein the first image is located after the second image in the video; based on The first image and the second image generate a three-dimensional look-up table of each image area of the first image; based on the three-dimensional look-up table of each image area of the first image, a quality-enhanced image of the first image is generated.
  • the step of generating a three-dimensional look-up table for each image region of the first image based on the first image and the second image may include: obtaining global image features and local image features from the first image and the second image; global image features and local image features to generate a three-dimensional look-up table for each image region of the first image.
  • the step of obtaining global image features and local image features from the first image and the second image may include: performing convolutional feature extraction on the first image and the second image, respectively; Perform convolution feature fusion with the convolution features of the second image; obtain global image features and local image features based on the fused convolution features.
  • the step of separately performing convolutional feature extraction on the first image and the second image may include: respectively inputting the first image and the second image into a convolutional neural network in which a self-attention module is embedded in the convolutional layer.
  • the network obtains the convolutional features of the first image and the second image including the positional relationship and the semantic relationship between the image parts.
  • the step of performing convolution feature fusion on the convolution feature of the first image and the convolution feature of the second image may include: calculating a difference between the convolution feature of the first image and the convolution feature of the second image based on the similarity; determine the first fusion weight of the convolution feature of the first image and the second fusion weight of the convolution feature of the second image respectively; based on the first fusion weight and the second fusion weight
  • the convolutional feature of , and the convolutional feature of the second image perform convolutional feature fusion.
  • the step of obtaining global image features and local image features based on the fused convolutional features may include: inputting the fused convolutional features into a fully connected layer and a convolutional layer, respectively, to obtain global image features and local image features image features.
  • the step of generating a three-dimensional lookup table for each image region of the first image based on the global image feature and the local image feature may include: fusing the global image feature and the local image feature to obtain each fused image The feature vector of the region; the three-dimensional lookup table of each image region of the first image is generated according to the fused feature vector of each image region.
  • the step of generating the quality-enhanced image of the first image may include: converting the red, green, and blue three primary color values of the first image based on a three-dimensional look-up table of each image area of the first image to obtain the first image.
  • the quality of the image enhances the image.
  • the step of acquiring two adjacent frames of images of the video as the first image and the second image may include: acquiring two adjacent frames of images of the video; performing low-resolution conversion on the acquired two adjacent frames of images, And take the converted two low-resolution images as the first image and the second image.
  • a video processing apparatus comprising: an image acquisition unit configured to acquire images of two adjacent frames of a video as a first image and a second image, wherein the first image in the video located after the second image; a look-up table generation unit configured to generate a three-dimensional look-up table for each image region of the first image based on the first image and the second image; and an image quality enhancement unit configured to A three-dimensional look-up table for each image region generates a quality-enhanced image of the first image.
  • the look-up table generation unit may be configured to: obtain global image features and local image features from the first image and the second image; generate each image region of the first image based on the global image features and the local image features 3D look-up table.
  • the lookup table generating unit may be configured to: perform convolution feature extraction on the first image and the second image, respectively; perform convolution feature extraction on the convolution feature of the first image and the convolution feature of the second image Fusion; global image features and local image features are obtained based on the fused convolutional features.
  • the look-up table generation unit may be configured to: respectively input the first image and the second image into a convolutional neural network with a self-attention module embedded in the convolutional layer to obtain the first image and the second image
  • the convolutional features that contain the positional and semantic relations between image parts.
  • the look-up table generating unit may be configured to: calculate a similarity between the convolutional features of the first image and the convolutional features of the second image; The first fusion weight and the second fusion weight of the convolution feature of the second image; based on the first fusion weight and the second fusion weight, convolution feature fusion is performed on the convolution feature of the first image and the convolution feature of the second image .
  • the lookup table generation unit may be configured to input the fused convolutional features to the fully connected layer and the convolutional layer, respectively, to obtain global image features and local image features.
  • the look-up table generating unit may be configured to: fuse the global image feature and the local image feature to obtain a feature vector of each fused image region; generate a first feature vector according to the fused feature vector of each image region A three-dimensional look-up table for each image region of the image.
  • the image quality enhancement unit may be configured to: based on the three-dimensional look-up table of each image area of the first image, convert the values of the three primary colors of red, green and blue in the first image to obtain image quality enhancement of the first image image.
  • the image acquisition unit may be configured to: acquire two adjacent frames of images of the video; perform low-resolution conversion on the acquired two adjacent frames of images, and use the converted two low-resolution images as the first an image and a second image.
  • an electronic device comprising: a processor; a memory for storing instructions executable by the processor; wherein the processor is configured to execute the instructions to implement a A video processing method of an exemplary embodiment of the present disclosure.
  • a non-volatile computer-readable storage medium having stored thereon a computer program that, when executed by a processor of an electronic device, causes the electronic device to perform the execution according to the present disclosure
  • An exemplary embodiment of the video processing method An exemplary embodiment of the video processing method.
  • a computer program product including a computer program/instructions which, when executed by a processor, implement the video processing method according to the exemplary embodiment of the present disclosure.
  • FIG. 1 shows a flowchart of a video processing method according to an exemplary embodiment of the present disclosure.
  • FIG. 2A shows a schematic diagram of video processing according to an exemplary embodiment of the present disclosure.
  • FIG. 2B illustrates an example of a grid structure according to an exemplary embodiment of the present disclosure.
  • FIG. 3 shows a schematic diagram of fusing global image features and local image features according to an exemplary embodiment of the present disclosure.
  • FIG. 4 illustrates a block diagram of a video processing apparatus according to an exemplary embodiment of the present disclosure.
  • FIG. 5 is a block diagram of an electronic device 500 according to an exemplary embodiment of the present disclosure.
  • deep neural networks can be used to learn from massive video materials how to convert the original video red, green and blue (hereinafter, also referred to as RGB) pixel values into new RGB values, to To achieve the purpose of image quality enhancement.
  • RGB red, green and blue
  • Such big data-driven methods not only make full use of a large number of professional videos, but also adaptively adjust the image quality enhancement algorithm in combination with the user's current shooting content.
  • by fusing the enhancement parameters of the front and rear frames the coherence of the image quality enhancement on the video is achieved.
  • the model for RGB value conversion in the related art uses a piecewise linear function, and independently converts the original value in three color channels of R, G, and B.
  • the mathematical expression ability of the piecewise linear function is inherently insufficient, and it cannot completely obtain the mapping relationship between the original RGB values and the enhanced RGB values.
  • the transformation is done separately on the three color channels, the rich semantics brought by the R, G, and B color associations are ignored.
  • the scenes and contents of videos shot by users are ever-changing. In some scenarios, this type of method is prone to insufficient image quality enhancement or distortion.
  • these methods only rely on local image features to drive enhancement algorithms, ignoring the benefits of global features for image quality enhancement.
  • 3D LUT adaptive 3D lookup table
  • massive image data which is used to simulate the original RGB values and enhanced RGB values. the mapping relationship between them.
  • 3D LUT adaptive 3D lookup table
  • this method does not require algorithm experts to manually set various parameters of the 3D LUT.
  • 3D LUTs with higher complexity can better adapt to various original image quality.
  • Related technologies only use the global visual features of images to drive enhancement algorithms when learning 3D LUTs, ignoring that different image parts require different enhancement methods. What's more fatal is that the image quality enhancement technology cannot be directly used for video enhancement because the continuity of enhancement cannot be guaranteed.
  • FIGS. 1 to 5 a video processing method and apparatus according to exemplary embodiments of the present disclosure will be described in detail with reference to FIGS. 1 to 5 .
  • FIG. 1 shows a flowchart of a video processing method according to an exemplary embodiment of the present disclosure.
  • FIG. 2A shows a schematic diagram of video processing according to an exemplary embodiment of the present disclosure.
  • FIG. 2B illustrates an example of a grid structure according to an exemplary embodiment of the present disclosure.
  • a processor acquires images of two adjacent frames of a video as a first image and a second image.
  • the first image comes after the second image in the video.
  • two adjacent frames of images of a video when two adjacent frames of images of a video are acquired as the first image and the second image, two adjacent frames of images of the video (for example, but not limited to, the t-th frame of images) may be acquired first. and the t-1th frame image), then low-resolution conversion is performed on the acquired adjacent two frames of images, and the converted two low-resolution images (for example, but not limited to, the t-th frame in Figure 2A is low-resolution high-resolution image and the t-1th frame low-resolution image) as the first image and the second image.
  • the low resolution may be, for example, but not limited to, 256x256.
  • a three-dimensional look-up table for each image region of the first image is generated by a processor (eg, a processor for look-up table generation) based on the first image and the second image.
  • a processor eg, a processor for look-up table generation
  • the adaptive 3D LUT not only uses the global visual features of the image to drive the enhancement algorithm, but can also take into account the local visual and semantic differences of each image.
  • different image parts can use different enhancement methods. For example, by dividing the image into a grid structure (for example, but not limited to the grid structure in FIG. 2B ), each image in the grid locally corresponds to different enhancement parameters, so that the local self-adaptability of image quality enhancement can be obtained. .
  • global image features and local image features may be first obtained from the first image and the second image image features, and then generate a three-dimensional look-up table for each image region of the first image based on the global image features and the local image features.
  • video enhancement also pursues the stability of image quality on consecutive frames.
  • convolutional feature extraction may be performed on the first image and the second image, respectively, and then the first image and the second image are respectively extracted.
  • the convolutional features of the image and the convolutional features of the second image are fused with convolutional features, and then global image features and local image features are obtained based on the fused convolutional features. That is, by fusing the visual features of the front and rear frames, the continuity of the enhancement in the temporal domain can be obtained.
  • a self-attention module when the convolutional feature extraction is performed on the first image and the second image, respectively, can be embedded by inputting the first image and the second image into the convolutional layer respectively.
  • the convolutional neural network is used to obtain the convolutional features of the first image and the second image including the positional relationship and semantic relationship between the image parts.
  • a self-attention module is embedded in the convolutional layers of the convolutional neural network.
  • Convolutional neural networks with self-attention modules embedded in convolutional layers can be used to capture the positional and semantic relationships between image parts. The advantages of doing so are: 1) the smoothness of image quality enhancement in the image space is obtained through the positional relationship; 2) the correlation of the local enhancement effect is improved through the semantic relationship. For example, when both blue sky and grass appear in an image, the enhancements can be automatically adjusted to handle such a combination.
  • the self-attention module imposes constraints on the spatial smoothness and enhanced semantic relevance of images while preserving the local adaptation for image quality enhancement.
  • the convolution feature of the first image and the convolution feature of the second image may be calculated first Then, based on the similarity, the first fusion weight of the convolutional feature of the first image and the second fusion weight of the convolutional feature of the second image are determined respectively, and then based on the first fusion weight and the second fusion weight weight to perform convolution feature fusion on the convolution feature of the first image and the convolution feature of the second image.
  • Each pixel on the convolutional feature can represent a visual feature vector of the corresponding image part.
  • the visual similarity of the preceding and following video frames may be measured by calculating the cosine similarity between the two feature vectors. Since the feature vectors correspond to different image regions, each element of the similarity matrix actually represents the visual similarity of the corresponding positions in the two images. For image areas with high similarity, the convolutional features of frame t-1 have a higher fusion weight; for areas with low similarity, the convolutional features of frame t are used as much as possible. When the local scene changes little, convolutional feature fusion helps to maintain the coherence of enhancement on the video; and when the local scene has significant visual changes, the image quality enhancement strategy can also be automatically adjusted to deal with it.
  • the global image features can be obtained by inputting the fused convolutional features into the fully connected layer and the convolutional layer, respectively. Image features and local image features, thereby improving the accuracy of feature extraction.
  • the global image feature and the local image feature may be first fused to obtain a fusion Then, a three-dimensional look-up table of each image area of the first image is generated according to the fused feature vector of each image area, thereby improving the accuracy of the three-dimensional look-up table.
  • FIG. 3 shows a schematic diagram of fusing global image features and local image features according to an exemplary embodiment of the present disclosure.
  • the fully connected layer and the convolutional layer can be used to obtain global image features and local image features respectively, and then the global feature vector is accumulated element by element with each local feature.
  • global image features are used to obtain the overall visual characteristics of the image, such as brightness, saturation, scene, and so on.
  • the local image features can be used to fine-tune the global features according to the local semantic information and spatial location of the image.
  • the fusion features thus obtained have global and local adaptability, so that they can cope with diverse user videos and can obtain better image quality enhancement effects.
  • each image region of the first image will generate a feature vector, which is used to drive the generation of a three-dimensional look-up table.
  • different 3D lookup tables mean different picture quality enhancement effects.
  • a quality-enhanced image of the first image is generated by a processor (eg, a processor for image quality enhancement) based on the three-dimensional lookup table of each image area of the first image.
  • a processor eg, a processor for image quality enhancement
  • the three primary color values of red, green and blue of the first image may be converted based on the three-dimensional lookup table of each image area of the first image, A quality-enhanced image of the first image is obtained. That is, based on the three-dimensional look-up table corresponding to the pixel position of each image, as shown in FIG. 2A , the RGB values in frame t of the original input video are converted to generate an enhanced image of frame t.
  • the t-th frame image and the t-1-th frame image of the video are first obtained, and the t-th frame image and the t-1-th frame image are converted into low-resolution images, respectively.
  • the t-th frame image and the t-1-th frame image of the low-resolution image are input into a convolutional neural network (CNN) with the function of capturing the positional and semantic relationships between image parts.
  • CNN convolutional neural network
  • the convolutional feature extraction is performed through the convolutional neural network (CNN), and the convolutional feature of the t-th frame image and the convolutional feature of the t-1th frame image are obtained.
  • FIG. 4 The video processing method according to the exemplary embodiment of the present disclosure has been described above with reference to FIGS. 1 to 3 .
  • a video processing apparatus and a unit thereof according to an exemplary embodiment of the present disclosure will be described with reference to FIG. 4 .
  • FIG. 4 illustrates a block diagram of a video processing apparatus according to an exemplary embodiment of the present disclosure.
  • the video processing apparatus includes an image acquisition unit 41 , a lookup table generation unit 42 and an image quality enhancement unit 43 .
  • the image acquisition unit 41 is configured to acquire images of two adjacent frames of the video as the first image and the second image.
  • the first image comes after the second image in the video.
  • the image acquisition unit 41 may be configured to: acquire two adjacent frames of images of the video; perform low-resolution conversion on the acquired two adjacent frames of images, and convert the converted two frames into low-resolution images. Resolution images as the first image and the second image.
  • the look-up table generation unit 42 is configured to generate a three-dimensional look-up table for each image region of the first image based on the first image and the second image.
  • the look-up table generating unit 42 may be configured to: obtain global image features and local image features from the first image and the second image; generate the first image based on the global image features and the local image features 3D look-up table for each image region.
  • the look-up table generating unit 42 may be configured to: perform convolution feature extraction on the first image and the second image, respectively; Convolution feature fusion is performed on the features; global image features and local image features are obtained based on the fused convolution features.
  • the look-up table generating unit 42 may be configured to: respectively input the first image and the second image into a convolutional neural network with a self-attention module embedded in the convolutional layer, and obtain the first image and the second image respectively. Convolutional features of the image and the second image that contain the positional and semantic relationships between image parts.
  • the look-up table generating unit 42 may be configured to: calculate the similarity between the convolutional feature of the first image and the convolutional feature of the second image; determine the first image based on the similarity, respectively The first fusion weight of the convolution feature of the second image and the second fusion weight of the convolution feature of the second image; based on the first fusion weight and the second fusion weight, the convolution feature of the first image and the convolution feature of the second image are combined. Perform convolutional feature fusion.
  • the lookup table generation unit 42 may be configured to input the fused convolutional features to the fully connected layer and the convolutional layer, respectively, to obtain global image features and local image features.
  • the look-up table generating unit 42 may be configured to: fuse the global image feature and the local image feature to obtain a feature vector of each fused image region; The feature vector generates a three-dimensional look-up table for each image region of the first image.
  • the image quality enhancement unit 43 is configured to generate an image quality enhanced image of the first image based on the three-dimensional look-up table for each image region of the first image.
  • the image quality enhancement unit 43 may be configured to: based on the three-dimensional look-up table of each image area of the first image, convert the values of the three primary colors of red, green and blue in the first image to obtain the first image. The quality of the image enhances the image.
  • the video processing apparatus has been described above with reference to FIG. 4 .
  • an electronic device according to an exemplary embodiment of the present disclosure will be described with reference to FIG. 5 .
  • FIG. 5 is a block diagram of an electronic device 500 according to an exemplary embodiment of the present disclosure.
  • the electronic device 500 includes at least one memory 501 and at least one processor 502.
  • the at least one memory 501 stores a computer-executable instruction set.
  • the computer-executable instruction set is executed by the at least one processor 502, the execution A method of video processing according to an exemplary embodiment of the present disclosure.
  • the electronic device 500 may be a PC computer, a tablet device, a personal digital assistant, a smart phone, or any other device capable of executing the above set of instructions.
  • the electronic device 500 is not necessarily a single electronic device, but can also be a collection of any device or circuit capable of individually or jointly executing the above-mentioned instructions (or instruction sets).
  • Electronic device 500 may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces locally or remotely (eg, via wireless transmission).
  • processor 502 may include a central processing unit (CPU), graphics processing unit (GPU), programmable logic device, special purpose processor system, microcontroller, or microprocessor.
  • processors may also include analog processors, digital processors, microprocessors, multi-core processors, processor arrays, network processors, and the like.
  • Processor 502 may execute instructions or code stored in memory 501, which may also store data. Instructions and data may also be sent and received over a network via a network interface device, which may employ any known transport protocol.
  • the memory 501 may be integrated with the processor 502, eg, RAM or flash memory arranged within an integrated circuit microprocessor or the like. Furthermore, memory 501 may comprise a separate device such as an external disk drive, storage array, or any other storage device that may be used by a database system.
  • the memory 501 and the processor 502 may be operatively coupled, or may communicate with each other, eg, through I/O ports, network connections, etc., to enable the processor 502 to read files stored in the memory.
  • the electronic device 500 may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of electronic device 500 may be connected to each other via a bus and/or network.
  • a video display such as a liquid crystal display
  • a user interaction interface such as a keyboard, mouse, touch input device, etc.
  • a computer-readable storage medium including instructions, such as a memory 501 including instructions, which can be executed by the processor 502 of the apparatus 500 to complete the above method.
  • the computer-readable storage medium may be ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, and the like.
  • a computer program product comprising computer programs/instructions which, when executed by a processor, implement the exemplary embodiments according to the present disclosure method of video processing.
  • the video processing apparatus and its units shown in FIG. 4 may be respectively configured as software, hardware, firmware or any combination of the above items to perform specific functions
  • the electronic device shown in FIG. 5 is not It is limited to include the components shown above, but some components may be added or deleted as needed, and the above components may also be combined.
  • the three-dimensional lookup table of each image area of the first image is generated based on the first image and the second image by acquiring two adjacent frames of images of the video as the first image and the second image, Based on the three-dimensional lookup table of each image area of the first image, an image quality enhancement image of the first image is generated, so that one-click automatic completion of video image quality enhancement is realized without user participation.
  • global and local adaptability can be obtained through global/local feature fusion, so that various user videos can be dealt with and better image quality enhancement effect can be obtained.

Abstract

A video processing method and an apparatus. The video processing method comprises: obtaining two adjacent image frames of a video to serve as a first image and a second image, wherein the first image is situated after the second image in the video; generating a three-dimensional lookup table of each image region of the first image on the basis of the first image and the second image; and generating an enhanced quality image of the first image on the basis of the three-dimensional lookup table of each image region of the first image.

Description

视频处理方法及装置Video processing method and device
相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS
本公开基于申请号为202110213511.1、申请日为2021年2月25日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本公开作为参考。The present disclosure is based on a Chinese patent application with application number 202110213511.1 and an application date of February 25, 2021, and claims the priority of the Chinese patent application, the entire contents of which are incorporated herein by reference.
技术领域technical field
本公开涉及视频技术领域。更具体地,本公开涉及一种视频处理方法及装置。The present disclosure relates to the field of video technology. More particularly, the present disclosure relates to a video processing method and apparatus.
背景技术Background technique
随着智能手机的普及,越来越多的用户使用个人手机来拍摄视频,用于记录生活中的日常片段。受限于手机的摄像头硬件参数以及拍摄环境和用户的拍摄技巧,原始的视频画质常常不能令用户满意,例如原始视频常常出现画面过亮或过暗、或者画面色彩饱和度不足等等的问题。通过后期的视频画质增强技术,原始画质可以得到极大提升。因此很多用户对的视频画质增强有着强烈和巨大的需求。With the popularization of smartphones, more and more users use their personal mobile phones to shoot videos to record daily moments in life. Limited by the camera hardware parameters of the mobile phone, the shooting environment and the user's shooting skills, the original video quality is often unsatisfactory to users. For example, the original video often has problems such as too bright or too dark, or insufficient color saturation. . Through the later video quality enhancement technology, the original image quality can be greatly improved. Therefore, many users have a strong and huge demand for video quality enhancement.
发明内容SUMMARY OF THE INVENTION
根据本公开的示例性实施例,提供一种视频处理方法,包括:获取视频的相邻两帧图像作为第一图像和第二图像,其中,在视频中第一图像位于第二图像之后;基于第一图像和第二图像生成第一图像的每个图像区域的三维查询表;基于第一图像的每个图像区域的三维查询表,生成第一图像的画质增强图像。According to an exemplary embodiment of the present disclosure, a video processing method is provided, comprising: acquiring two adjacent frames of images of a video as a first image and a second image, wherein the first image is located after the second image in the video; based on The first image and the second image generate a three-dimensional look-up table of each image area of the first image; based on the three-dimensional look-up table of each image area of the first image, a quality-enhanced image of the first image is generated.
在一些实施例中,基于第一图像和第二图像生成第一图像的每个图像区域的三维查询表的步骤可包括:从第一图像和第二图像获得全局图像特征和局部图像特征;基于全局图像特征和局部图像特征来生成第一图像的每个图像区域的三维查询表。In some embodiments, the step of generating a three-dimensional look-up table for each image region of the first image based on the first image and the second image may include: obtaining global image features and local image features from the first image and the second image; global image features and local image features to generate a three-dimensional look-up table for each image region of the first image.
在一些实施例中,从第一图像和第二图像获得全局图像特征和局部图像特征的步骤可包括:分别对第一图像和第二图像进行卷积特征提取;对第一图像的卷积特征和第二图像的卷积特征进行卷积特征融合;基于融合后的卷积特征获得全局图像特征和局部图像特征。In some embodiments, the step of obtaining global image features and local image features from the first image and the second image may include: performing convolutional feature extraction on the first image and the second image, respectively; Perform convolution feature fusion with the convolution features of the second image; obtain global image features and local image features based on the fused convolution features.
在一些实施例中,分别对第一图像和第二图像进行卷积特征提取的步骤可包括:分别将第一图像和第二图像输入到卷积层中嵌入了自注意力模块的卷积神经网络,获得第一图像和第二图像的包含图像局部之间的位置关系和语义关系的卷积特征。In some embodiments, the step of separately performing convolutional feature extraction on the first image and the second image may include: respectively inputting the first image and the second image into a convolutional neural network in which a self-attention module is embedded in the convolutional layer. The network obtains the convolutional features of the first image and the second image including the positional relationship and the semantic relationship between the image parts.
在一些实施例中,对第一图像的卷积特征和第二图像的卷积特征进行卷积特征融合的步骤可包括:计算第一图像的卷积特征和第二图像的卷积特征之间的相似度;基于相似度分别确定第一图像的卷积特征的第一融合权重和第二图像的卷积特征的第二融合权重;基于第一融合权重和第二融合权重来对第一图像的卷积特征和第二图像的卷积特征进行卷积特征融合。In some embodiments, the step of performing convolution feature fusion on the convolution feature of the first image and the convolution feature of the second image may include: calculating a difference between the convolution feature of the first image and the convolution feature of the second image based on the similarity; determine the first fusion weight of the convolution feature of the first image and the second fusion weight of the convolution feature of the second image respectively; based on the first fusion weight and the second fusion weight The convolutional feature of , and the convolutional feature of the second image perform convolutional feature fusion.
在一些实施例中,基于融合后的卷积特征获得全局图像特征和局部图像特征的步骤可包括:将融合后的卷积特征分别输入到全连接层和卷积层来获得全局图像特征和局部图像特征。In some embodiments, the step of obtaining global image features and local image features based on the fused convolutional features may include: inputting the fused convolutional features into a fully connected layer and a convolutional layer, respectively, to obtain global image features and local image features image features.
在一些实施例中,基于全局图像特征和局部图像特征来生成第一图像的每个图像区域的三维查询表的步骤可包括:将全局图像特征和局部图像特征进行融合,获得融合的每个图像区域的特征向量;根据融合的每个图像区域的特征向量生成第一图像的每个图像区域的三维查询表。In some embodiments, the step of generating a three-dimensional lookup table for each image region of the first image based on the global image feature and the local image feature may include: fusing the global image feature and the local image feature to obtain each fused image The feature vector of the region; the three-dimensional lookup table of each image region of the first image is generated according to the fused feature vector of each image region.
在一些实施例中,生成第一图像的画质增强图像的步骤可包括:基于第一图像的每个图像区域的三维查询表,对第一图像的红绿蓝三原色数值进行转换,获得第一图像的画质增强图像。In some embodiments, the step of generating the quality-enhanced image of the first image may include: converting the red, green, and blue three primary color values of the first image based on a three-dimensional look-up table of each image area of the first image to obtain the first image. The quality of the image enhances the image.
在一些实施例中,获取视频的相邻两帧图像作为第一图像和第二图像的步骤可包括:获取视频的相邻两帧图像;对获取的相邻两帧图像进行低分辨率转换,并将转换后的两帧低分辨率图像作为第一图像和第二图像。In some embodiments, the step of acquiring two adjacent frames of images of the video as the first image and the second image may include: acquiring two adjacent frames of images of the video; performing low-resolution conversion on the acquired two adjacent frames of images, And take the converted two low-resolution images as the first image and the second image.
根据本公开的示例性实施例,提供一种视频处理装置,包括:图像获取单元,被配置为获取视频的相邻两帧图像作为第一图像和第二图像,其中,在视频中第一图像位于第二图像之后;查询表生成单元,被配置为基于第一图像和第二图像生成第一图像的每个图像区域的三维查询表;和画质增强单元,被配置为基于第一图像的每个图像区域的三维查询表,生成第一图像的 画质增强图像。According to an exemplary embodiment of the present disclosure, there is provided a video processing apparatus, comprising: an image acquisition unit configured to acquire images of two adjacent frames of a video as a first image and a second image, wherein the first image in the video located after the second image; a look-up table generation unit configured to generate a three-dimensional look-up table for each image region of the first image based on the first image and the second image; and an image quality enhancement unit configured to A three-dimensional look-up table for each image region generates a quality-enhanced image of the first image.
在一些实施例中,查询表生成单元可被配置为:从第一图像和第二图像获得全局图像特征和局部图像特征;基于全局图像特征和局部图像特征来生成第一图像的每个图像区域的三维查询表。In some embodiments, the look-up table generation unit may be configured to: obtain global image features and local image features from the first image and the second image; generate each image region of the first image based on the global image features and the local image features 3D look-up table.
在一些实施例中,查询表生成单元可被配置为:分别对第一图像和第二图像进行卷积特征提取;对第一图像的卷积特征和第二图像的卷积特征进行卷积特征融合;基于融合后的卷积特征获得全局图像特征和局部图像特征。In some embodiments, the lookup table generating unit may be configured to: perform convolution feature extraction on the first image and the second image, respectively; perform convolution feature extraction on the convolution feature of the first image and the convolution feature of the second image Fusion; global image features and local image features are obtained based on the fused convolutional features.
在一些实施例中,查询表生成单元可被配置为:分别将第一图像和第二图像输入到卷积层中嵌入了自注意力模块的卷积神经网络,获得第一图像和第二图像的包含图像局部之间的位置关系和语义关系的卷积特征。In some embodiments, the look-up table generation unit may be configured to: respectively input the first image and the second image into a convolutional neural network with a self-attention module embedded in the convolutional layer to obtain the first image and the second image The convolutional features that contain the positional and semantic relations between image parts.
在一些实施例中,查询表生成单元可被配置为:计算第一图像的卷积特征和第二图像的卷积特征之间的相似度;基于相似度分别确定第一图像的卷积特征的第一融合权重和第二图像的卷积特征的第二融合权重;基于第一融合权重和第二融合权重来对第一图像的卷积特征和第二图像的卷积特征进行卷积特征融合。In some embodiments, the look-up table generating unit may be configured to: calculate a similarity between the convolutional features of the first image and the convolutional features of the second image; The first fusion weight and the second fusion weight of the convolution feature of the second image; based on the first fusion weight and the second fusion weight, convolution feature fusion is performed on the convolution feature of the first image and the convolution feature of the second image .
在一些实施例中,查询表生成单元可被配置为:将融合后的卷积特征分别输入到全连接层和卷积层来获得全局图像特征和局部图像特征。In some embodiments, the lookup table generation unit may be configured to input the fused convolutional features to the fully connected layer and the convolutional layer, respectively, to obtain global image features and local image features.
在一些实施例中,查询表生成单元可被配置为:将全局图像特征和局部图像特征进行融合,获得融合的每个图像区域的特征向量;根据融合的每个图像区域的特征向量生成第一图像的每个图像区域的三维查询表。In some embodiments, the look-up table generating unit may be configured to: fuse the global image feature and the local image feature to obtain a feature vector of each fused image region; generate a first feature vector according to the fused feature vector of each image region A three-dimensional look-up table for each image region of the image.
在一些实施例中,画质增强单元可被配置为:基于第一图像的每个图像区域的三维查询表,对第一图像的红绿蓝三原色数值进行转换,获得第一图像的画质增强图像。In some embodiments, the image quality enhancement unit may be configured to: based on the three-dimensional look-up table of each image area of the first image, convert the values of the three primary colors of red, green and blue in the first image to obtain image quality enhancement of the first image image.
在一些实施例中,图像获取单元可被配置为:获取视频的相邻两帧图像;对获取的相邻两帧图像进行低分辨率转换,并将转换后的两帧低分辨率图像作为第一图像和第二图像。In some embodiments, the image acquisition unit may be configured to: acquire two adjacent frames of images of the video; perform low-resolution conversion on the acquired two adjacent frames of images, and use the converted two low-resolution images as the first an image and a second image.
根据本公开的示例性实施例,提供一种电子设备,包括:处理器;用于存储所述处理器可执行指令的存储器;其中,所述处理器被配置为执行所述指令,以实现根据本公开的示例性实施例的视频处理方法。According to an exemplary embodiment of the present disclosure, there is provided an electronic device, comprising: a processor; a memory for storing instructions executable by the processor; wherein the processor is configured to execute the instructions to implement a A video processing method of an exemplary embodiment of the present disclosure.
根据本公开的示例性实施例,提供一种非易失性计算机可读存储介质,其上存储有计算机程序,当所述计算机程序被电子设备的处理器执行时,使 得电子设备执行根据本公开的示例性实施例的视频处理方法。According to an exemplary embodiment of the present disclosure, there is provided a non-volatile computer-readable storage medium having stored thereon a computer program that, when executed by a processor of an electronic device, causes the electronic device to perform the execution according to the present disclosure An exemplary embodiment of the video processing method.
根据本公开的示例性实施例,提供一种计算机程序产品,包括计算机程序/指令,当所述计算机程序/指令被处理器执行时,实现根据本公开的示例性实施例的视频处理方法。According to an exemplary embodiment of the present disclosure, there is provided a computer program product including a computer program/instructions which, when executed by a processor, implement the video processing method according to the exemplary embodiment of the present disclosure.
在本公开的实施例提供的技术方案中,1、无需用户参与,一键式自动完成视频画质增强;2、可以处理多样的用户视频,增强效果好,时域上连贯。In the technical solutions provided by the embodiments of the present disclosure, 1. One-click automatic completion of video image quality enhancement without user participation; 2. Various user videos can be processed, with good enhancement effect and coherence in time domain.
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本公开。It is to be understood that the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the present disclosure.
附图说明Description of drawings
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本公开的实施例,并与说明书一起用于解释本公开的原理,并不构成对本公开的不当限定。The accompanying drawings, which are incorporated into and constitute a part of this specification, illustrate embodiments consistent with the present disclosure, and together with the description, serve to explain the principles of the present disclosure and do not unduly limit the present disclosure.
图1示出根据本公开的示例性实施例的视频处理方法的流程图。FIG. 1 shows a flowchart of a video processing method according to an exemplary embodiment of the present disclosure.
图2A示出根据本公开的示例性实施例的进行视频处理的示意图。FIG. 2A shows a schematic diagram of video processing according to an exemplary embodiment of the present disclosure.
图2B示出根据本公开的示例性实施例的网格结构的示例。FIG. 2B illustrates an example of a grid structure according to an exemplary embodiment of the present disclosure.
图3示出根据本公开的示例性实施例的将全局图像特征和局部图像特征进行融合的示意图。FIG. 3 shows a schematic diagram of fusing global image features and local image features according to an exemplary embodiment of the present disclosure.
图4示出根据本公开的示例性实施例的视频处理装置的框图。FIG. 4 illustrates a block diagram of a video processing apparatus according to an exemplary embodiment of the present disclosure.
图5是根据本公开的示例性实施例的电子设备500的框图。FIG. 5 is a block diagram of an electronic device 500 according to an exemplary embodiment of the present disclosure.
具体实施方式Detailed ways
为了使本领域普通人员更好地理解本公开的技术方案,下面将结合附图,对本公开实施例中的技术方案进行清楚、完整地描述。In order to make those skilled in the art better understand the technical solutions of the present disclosure, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.
需要说明的是,本公开的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本公开的实施例能够以除了在这里图示或描述的那些以外的顺序实施。以下实施例中所描述的实施方式并不代表与本公开相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本公开的一些方面相一致的装置和方法的例子。It should be noted that the terms "first", "second" and the like in the description and claims of the present disclosure and the above drawings are used to distinguish similar objects, and are not necessarily used to describe a specific sequence or sequence. It is to be understood that the data so used may be interchanged under appropriate circumstances such that the embodiments of the disclosure described herein can be practiced in sequences other than those illustrated or described herein. The implementations described in the following examples are not intended to represent all implementations consistent with this disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as recited in the appended claims.
在此需要说明的是,在本公开中出现的“若干项之中的至少一项”均表示包含“该若干项中的任意一项”、“该若干项中的任意多项的组合”、“该若干项的全体”这三类并列的情况。例如“包括A和B之中的至少一个”即包括如下三种并列的情况:(1)包括A;(2)包括B;(3)包括A和B。又例如“执行步骤一和步骤二之中的至少一个”,即表示如下三种并列的情况:(1)执行步骤一;(2)执行步骤二;(3)执行步骤一和步骤二。It should be noted here that "at least one of several items" in the present disclosure all means including "any one of the several items", "a combination of any of the several items", The three categories of "the whole of the several items" are juxtaposed. For example, "including at least one of A and B" includes the following three parallel situations: (1) including A; (2) including B; (3) including A and B. Another example is "execute at least one of step 1 and step 2", which means the following three parallel situations: (1) execute step 1; (2) execute step 2; (3) execute step 1 and step 2.
现存的大多数视频画质增强方法需要用户手动的设置各种画面参数。这不仅要求用户具备一定程度的摄影知识,手动调整画质参数往往还要花费用户相当程度的精力和时间。另外,手动增强的视频画质也很难达到比较专业的水平。由此,需要一种自动的自适应视频画质增强方法来自动的提升原始视频画质,使其在视觉上拥有增好的审美效果。Most existing video image quality enhancement methods require users to manually set various image parameters. This not only requires the user to have a certain degree of photography knowledge, but also requires a considerable amount of energy and time to manually adjust the image quality parameters. In addition, it is difficult to manually enhance the video quality to a more professional level. Therefore, an automatic self-adaptive video image quality enhancement method is required to automatically enhance the original video image quality so that it has an enhanced aesthetic effect visually.
在相关的视频画质增强方法中,可使用深度神经网络,从海量视频素材中学习如何将原始的视频红绿蓝三原色(在下文中,也被简称RGB)像素值转换成新的RGB数值,来达到画质增强的目的。此类大数据驱动的方法,不仅充分的利用了大量的专业视频,还能结合用户当前的拍摄内容,自适应的调整画质增强算法。此外,通过融合前后帧的增强参数,实现了画质增强在视频上的连贯。In related video quality enhancement methods, deep neural networks can be used to learn from massive video materials how to convert the original video red, green and blue (hereinafter, also referred to as RGB) pixel values into new RGB values, to To achieve the purpose of image quality enhancement. Such big data-driven methods not only make full use of a large number of professional videos, but also adaptively adjust the image quality enhancement algorithm in combination with the user's current shooting content. In addition, by fusing the enhancement parameters of the front and rear frames, the coherence of the image quality enhancement on the video is achieved.
相关技术中用于RGB数值转换的模型使用的是分段线性函数,且独立的在R、G、B三个颜色通道中转换原始数值。然而,分段线性函数的数学表达能力先天不足,其不能完整的获取原始RGB数值到增强RGB数值之间的映射关系。此外,由于转换是在三颜色通道上分别完成的,忽略了R、G、B颜色关联性带来的丰富语义。用户拍摄的视频画场景和内容千变万化。在一些场景下,该类方法容易出现画质增强不足或者失真的现象。此外,该类方法仅依赖局部图像特征驱动增强算法,忽略了全局特征给画质增强带来的收益。The model for RGB value conversion in the related art uses a piecewise linear function, and independently converts the original value in three color channels of R, G, and B. However, the mathematical expression ability of the piecewise linear function is inherently insufficient, and it cannot completely obtain the mapping relationship between the original RGB values and the enhanced RGB values. Furthermore, since the transformation is done separately on the three color channels, the rich semantics brought by the R, G, and B color associations are ignored. The scenes and contents of videos shot by users are ever-changing. In some scenarios, this type of method is prone to insufficient image quality enhancement or distortion. In addition, these methods only rely on local image features to drive enhancement algorithms, ignoring the benefits of global features for image quality enhancement.
目前领先的画质增强方法大多集中在单幅图像增强,而非视频增强。在相关的视频画质增强方法中,还可使用深度神经网络,从海量图像数据学习出一个自适应三维查询表(在下文中,也被简称3D LUT),用于模拟原始RGB数值和增强RGB数值之间的映射关系。所谓自适应,是指3D LUT能根据图像的视觉特征而自动调整。相较于传统的3D LUT图像增强技术,该类方法不需要算法专家去手动的设置3D LUT的各项参数。同时,复杂度较高的3D LUT可以较好的适应各种原始图像画质。相关技术在学习3D LUT时仅使用 了图像的全局视觉特征来驱动增强算法,忽视了不同的图像局部需要不同的增强方法。更为致命的是,由于不能保证增强的连续性,图像画质增强技术并不能直接用于视频增强。Most of the current leading image quality enhancement methods focus on single image enhancement rather than video enhancement. In related video image quality enhancement methods, deep neural networks can also be used to learn an adaptive 3D lookup table (hereinafter, also referred to as 3D LUT) from massive image data, which is used to simulate the original RGB values and enhanced RGB values. the mapping relationship between them. The so-called adaptive means that the 3D LUT can be automatically adjusted according to the visual characteristics of the image. Compared with the traditional 3D LUT image enhancement technology, this method does not require algorithm experts to manually set various parameters of the 3D LUT. At the same time, 3D LUTs with higher complexity can better adapt to various original image quality. Related technologies only use the global visual features of images to drive enhancement algorithms when learning 3D LUTs, ignoring that different image parts require different enhancement methods. What's more fatal is that the image quality enhancement technology cannot be directly used for video enhancement because the continuity of enhancement cannot be guaranteed.
下面,将参照图1至图5具体描述根据本公开的示例性实施例的视频处理方法及装置。Hereinafter, a video processing method and apparatus according to exemplary embodiments of the present disclosure will be described in detail with reference to FIGS. 1 to 5 .
图1示出根据本公开的示例性实施例的视频处理方法的流程图。图2A示出根据本公开的示例性实施例的进行视频处理的示意图。图2B示出根据本公开的示例性实施例的网格结构的示例。FIG. 1 shows a flowchart of a video processing method according to an exemplary embodiment of the present disclosure. FIG. 2A shows a schematic diagram of video processing according to an exemplary embodiment of the present disclosure. FIG. 2B illustrates an example of a grid structure according to an exemplary embodiment of the present disclosure.
参照图1,在步骤S101,由处理器(例如,用于图像获取的处理器)获取视频的相邻两帧图像作为第一图像和第二图像。这里,在视频中第一图像位于第二图像之后。Referring to FIG. 1, in step S101, a processor (eg, a processor for image acquisition) acquires images of two adjacent frames of a video as a first image and a second image. Here, the first image comes after the second image in the video.
在本公开的示例性实施例中,在获取视频的相邻两帧图像作为第一图像和第二图像时,可首先获取视频的相邻两帧图像(例如,但不限于,第t帧图像和第t-1帧图像),然后对获取的相邻两帧图像进行低分辨率转换,并将转换后的两帧低分辨率图像(例如,但不限于,图2A中的第t帧低分辨率图像和第t-1帧低分辨率图像)作为第一图像和第二图像。这里,低分辨率可以是例如,但不限于,256x256。In an exemplary embodiment of the present disclosure, when two adjacent frames of images of a video are acquired as the first image and the second image, two adjacent frames of images of the video (for example, but not limited to, the t-th frame of images) may be acquired first. and the t-1th frame image), then low-resolution conversion is performed on the acquired adjacent two frames of images, and the converted two low-resolution images (for example, but not limited to, the t-th frame in Figure 2A is low-resolution high-resolution image and the t-1th frame low-resolution image) as the first image and the second image. Here, the low resolution may be, for example, but not limited to, 256x256.
在步骤S102,由处理器(例如,用于查询表生成的处理器)基于第一图像和第二图像生成第一图像的每个图像区域的三维查询表。At step S102, a three-dimensional look-up table for each image region of the first image is generated by a processor (eg, a processor for look-up table generation) based on the first image and the second image.
与相关技术相比,在本公开的示例性实施例中,自适应3D LUT不仅使用图像全局视觉特征来驱动增强算法,也可考虑各个图像局部的视觉和语义差异。换句话说,不同的图像局部可采用不一样的增强方法。例如,可通过将图像划分为网格结构(例如,但不限于图2B中的网格结构),网格中每一个图像局部对应不同的增强参数,可以获得画质增强在局部的自适应性。Compared with the related art, in the exemplary embodiment of the present disclosure, the adaptive 3D LUT not only uses the global visual features of the image to drive the enhancement algorithm, but can also take into account the local visual and semantic differences of each image. In other words, different image parts can use different enhancement methods. For example, by dividing the image into a grid structure (for example, but not limited to the grid structure in FIG. 2B ), each image in the grid locally corresponds to different enhancement parameters, so that the local self-adaptability of image quality enhancement can be obtained. .
在本公开的示例性实施例中,在基于第一图像和第二图像生成第一图像的每个图像区域的三维查询表时,可首先从第一图像和第二图像获得全局图像特征和局部图像特征,然后基于全局图像特征和局部图像特征来生成第一图像的每个图像区域的三维查询表。In an exemplary embodiment of the present disclosure, when generating a three-dimensional lookup table for each image region of the first image based on the first image and the second image, global image features and local image features may be first obtained from the first image and the second image image features, and then generate a three-dimensional look-up table for each image region of the first image based on the global image features and the local image features.
不同于单幅图像增强,视频增强还要追求画质在连续帧上的稳定性。在本公开的示例性实施例中,在从第一图像和第二图像获得全局图像特征和局部图像特征时,可首先分别对第一图像和第二图像进行卷积特征提取,然后 对第一图像的卷积特征和第二图像的卷积特征进行卷积特征融合,之后基于融合后的卷积特征获得全局图像特征和局部图像特征。也就是说,通过融合前后帧的视觉特征,从而可以获得增强在时域上的连续性。Different from single image enhancement, video enhancement also pursues the stability of image quality on consecutive frames. In an exemplary embodiment of the present disclosure, when the global image feature and the local image feature are obtained from the first image and the second image, convolutional feature extraction may be performed on the first image and the second image, respectively, and then the first image and the second image are respectively extracted. The convolutional features of the image and the convolutional features of the second image are fused with convolutional features, and then global image features and local image features are obtained based on the fused convolutional features. That is, by fusing the visual features of the front and rear frames, the continuity of the enhancement in the temporal domain can be obtained.
在本公开的示例性实施例中,在分别对第一图像和第二图像进行卷积特征提取时,可通过分别将第一图像和第二图像输入到卷积层中嵌入了自注意力模块的卷积神经网络,来获得第一图像和第二图像的包含图像局部之间的位置关系和语义关系的卷积特征。这里,卷积神经网络的卷积层中嵌入了自注意力模块。卷积层中嵌入了自注意力模块的卷积神经网络可用于捕捉图像局部之间的位置和语义关系。这样做的好处是:1)通过位置关系来获得画质增强在图像空间的平滑性;2)通过语义关系来提高局部增强效果的相关性。例如,当图像中同时出现蓝天和草地时,可以自动的调整增强方式来处理这样的组合。总的说来,在保留画质增强局部自适应的同时,自注意力模块给图像的空间平滑性和增强的语义相关性增加了约束。In an exemplary embodiment of the present disclosure, when the convolutional feature extraction is performed on the first image and the second image, respectively, a self-attention module can be embedded by inputting the first image and the second image into the convolutional layer respectively. The convolutional neural network is used to obtain the convolutional features of the first image and the second image including the positional relationship and semantic relationship between the image parts. Here, a self-attention module is embedded in the convolutional layers of the convolutional neural network. Convolutional neural networks with self-attention modules embedded in convolutional layers can be used to capture the positional and semantic relationships between image parts. The advantages of doing so are: 1) the smoothness of image quality enhancement in the image space is obtained through the positional relationship; 2) the correlation of the local enhancement effect is improved through the semantic relationship. For example, when both blue sky and grass appear in an image, the enhancements can be automatically adjusted to handle such a combination. Overall, the self-attention module imposes constraints on the spatial smoothness and enhanced semantic relevance of images while preserving the local adaptation for image quality enhancement.
在本公开的示例性实施例中,在对第一图像的卷积特征和第二图像的卷积特征进行卷积特征融合时,可首先计算第一图像的卷积特征和第二图像的卷积特征之间的相似度,然后基于相似度分别确定第一图像的卷积特征的第一融合权重和第二图像的卷积特征的第二融合权重,之后基于第一融合权重和第二融合权重来对第一图像的卷积特征和第二图像的卷积特征进行卷积特征融合。卷积特征上每个像素可表示对应图像局部的视觉特征向量。在本公开的示例性实施例中,可通过计算两个特征向量之间的余弦相似度来衡量前后视频帧的视觉相似度。由于特征向量对应着不同的图像区域,相似度矩阵的每一个元素实际上表示的是两张图像中对应位置的视觉相似度。对于相似度高的图像区域,t-1帧的卷积特征就拥有较高的融合权重;而相似度低的区域,尽可能的使用t帧的卷积特征。当局部场景变化不大时,卷积特征融合有助于维持增强在视频上的连贯性;而当局部场景发生显著视觉变化时,也可以自动的调整画质增强策略来应对。In an exemplary embodiment of the present disclosure, when convolution feature fusion is performed on the convolution feature of the first image and the convolution feature of the second image, the convolution feature of the first image and the convolution feature of the second image may be calculated first Then, based on the similarity, the first fusion weight of the convolutional feature of the first image and the second fusion weight of the convolutional feature of the second image are determined respectively, and then based on the first fusion weight and the second fusion weight weight to perform convolution feature fusion on the convolution feature of the first image and the convolution feature of the second image. Each pixel on the convolutional feature can represent a visual feature vector of the corresponding image part. In an exemplary embodiment of the present disclosure, the visual similarity of the preceding and following video frames may be measured by calculating the cosine similarity between the two feature vectors. Since the feature vectors correspond to different image regions, each element of the similarity matrix actually represents the visual similarity of the corresponding positions in the two images. For image areas with high similarity, the convolutional features of frame t-1 have a higher fusion weight; for areas with low similarity, the convolutional features of frame t are used as much as possible. When the local scene changes little, convolutional feature fusion helps to maintain the coherence of enhancement on the video; and when the local scene has significant visual changes, the image quality enhancement strategy can also be automatically adjusted to deal with it.
在本公开的示例性实施例中,在基于融合后的卷积特征获得全局图像特征和局部图像特征时,可通过将融合后的卷积特征分别输入到全连接层和卷积层来获得全局图像特征和局部图像特征,从而提高特征提取的准确性。In an exemplary embodiment of the present disclosure, when global image features and local image features are obtained based on the fused convolutional features, the global image features can be obtained by inputting the fused convolutional features into the fully connected layer and the convolutional layer, respectively. Image features and local image features, thereby improving the accuracy of feature extraction.
在本公开的示例性实施例中,在基于全局图像特征和局部图像特征来生成第一图像的每个图像区域的三维查询表时,可首先将全局图像特征和局部 图像特征进行融合,获得融合的每个图像区域的特征向量,然后根据融合的每个图像区域的特征向量生成第一图像的每个图像区域的三维查询表,从而提高三维查询表的准确性。图3示出根据本公开的示例性实施例的将全局图像特征和局部图像特征进行融合的示意图。如图3所示,基于融合后的卷积特征,可首先分别使用全连接层和卷积层来获得全局图像特征和局部图像特征,然后将全局特征向量逐元素的与每个局部特征累加。在此设定下,全局图像特征用于获得图像的整体视觉特性,比如亮度、饱和度、场景等等。而局部图像特征可被用于根据图像局部的语义信息、空间位置等对全局特征进行微调。由此获得的融合特征具有全局和局部自适应性,从而可以应对多样的用户视频,并且可以得到更好的画质增强效果。In an exemplary embodiment of the present disclosure, when generating a three-dimensional lookup table for each image region of the first image based on the global image feature and the local image feature, the global image feature and the local image feature may be first fused to obtain a fusion Then, a three-dimensional look-up table of each image area of the first image is generated according to the fused feature vector of each image area, thereby improving the accuracy of the three-dimensional look-up table. FIG. 3 shows a schematic diagram of fusing global image features and local image features according to an exemplary embodiment of the present disclosure. As shown in Figure 3, based on the fused convolutional features, the fully connected layer and the convolutional layer can be used to obtain global image features and local image features respectively, and then the global feature vector is accumulated element by element with each local feature. In this setting, global image features are used to obtain the overall visual characteristics of the image, such as brightness, saturation, scene, and so on. The local image features can be used to fine-tune the global features according to the local semantic information and spatial location of the image. The fusion features thus obtained have global and local adaptability, so that they can cope with diverse user videos and can obtain better image quality enhancement effects.
在本公开的示例性实施例中,第一图像的每个图像区域都将产生一个特征向量,用来驱动生成一个三维查询表。而不同的三维查询表意味着不同的画质增强效果。In an exemplary embodiment of the present disclosure, each image region of the first image will generate a feature vector, which is used to drive the generation of a three-dimensional look-up table. And different 3D lookup tables mean different picture quality enhancement effects.
在步骤S103,由处理器(例如,用于画质增强的处理器)基于第一图像的每个图像区域的三维查询表,生成第一图像的画质增强图像。通过将自适应三维查询表引入至视频画质增强框架,在无需用户参与的情况下,实现了一键式自动完成视频画质增强。In step S103, a quality-enhanced image of the first image is generated by a processor (eg, a processor for image quality enhancement) based on the three-dimensional lookup table of each image area of the first image. By introducing an adaptive 3D lookup table into the video quality enhancement framework, one-click automatic video quality enhancement is realized without user participation.
在本公开的示例性实施例中,在生成第一图像的画质增强图像时,可基于第一图像的每个图像区域的三维查询表,对第一图像的红绿蓝三原色数值进行转换,获得第一图像的画质增强图像。也就是说,基于每个图像像素位置对应的三维查询表,如图2A所示,对原始输入视频t帧中的RGB数值进行转换,生成t帧的画质增强图像。In an exemplary embodiment of the present disclosure, when the image quality-enhanced image of the first image is generated, the three primary color values of red, green and blue of the first image may be converted based on the three-dimensional lookup table of each image area of the first image, A quality-enhanced image of the first image is obtained. That is, based on the three-dimensional look-up table corresponding to the pixel position of each image, as shown in FIG. 2A , the RGB values in frame t of the original input video are converted to generate an enhanced image of frame t.
在一个示例性实施例中,如图2A所示,首先获取视频的第t帧图像和第t-1帧图像,将第t帧图像和第t-1帧图像转换为低分辨率图像,分别将低分辨率图像的第t帧图像和第t-1帧图像输入到具有捕捉图像局部之间的位置和语义关系的功能的卷积神经网络(CNN)中。通过该卷积神经网络(CNN)进行卷积特征提取,得到第t帧图像的卷积特征和第t-1帧图像的卷积特征。计算第t帧图像的卷积特征和第t-1帧图像的卷积特征的相似度矩阵,并基于相似度矩阵对第t帧图像的卷积特征和第t-1帧图像的卷积特征进行融合。然后基于融合后的卷积特征,分别使用全连接层和卷积层来获得全局和局部图像特征,将全局特征向量逐元素的与每个局部特征累加,得到与每一个图像 像素位置对应的图像区域的特征向量。通过每一个图像区域的特征向量分别驱动生成一个三维查询表(3D LUT)。最后,基于每一个图像像素位置对应的3D LUT,对原始输入视频t帧中的RGB数值进行转换,生成t帧的画质增强图像。In an exemplary embodiment, as shown in FIG. 2A , the t-th frame image and the t-1-th frame image of the video are first obtained, and the t-th frame image and the t-1-th frame image are converted into low-resolution images, respectively. The t-th frame image and the t-1-th frame image of the low-resolution image are input into a convolutional neural network (CNN) with the function of capturing the positional and semantic relationships between image parts. The convolutional feature extraction is performed through the convolutional neural network (CNN), and the convolutional feature of the t-th frame image and the convolutional feature of the t-1th frame image are obtained. Calculate the similarity matrix of the convolution feature of the t frame image and the convolution feature of the t-1 frame image, and based on the similarity matrix, the convolution feature of the t frame image and the convolution feature of the t-1 frame image Fusion. Then, based on the fused convolutional features, the fully connected layer and the convolutional layer are used to obtain global and local image features, respectively, and the global feature vector is accumulated element by element with each local feature to obtain an image corresponding to each image pixel position. feature vector of the region. A three-dimensional look-up table (3D LUT) is generated by driving the feature vector of each image area. Finally, based on the 3D LUT corresponding to each image pixel position, the RGB values in the t frame of the original input video are converted to generate an image quality enhanced image of the t frame.
以上已经结合图1至图3对根据本公开的示例性实施例的视频处理方法进行了描述。在下文中,将参照图4对根据本公开的示例性实施例的视频处理装置及其单元进行描述。The video processing method according to the exemplary embodiment of the present disclosure has been described above with reference to FIGS. 1 to 3 . Hereinafter, a video processing apparatus and a unit thereof according to an exemplary embodiment of the present disclosure will be described with reference to FIG. 4 .
图4示出根据本公开的示例性实施例的视频处理装置的框图。FIG. 4 illustrates a block diagram of a video processing apparatus according to an exemplary embodiment of the present disclosure.
参照图4,视频处理装置包括图像获取单元41、查询表生成单元42和画质增强单元43。4 , the video processing apparatus includes an image acquisition unit 41 , a lookup table generation unit 42 and an image quality enhancement unit 43 .
图像获取单元41被配置为获取视频的相邻两帧图像作为第一图像和第二图像。这里,在视频中第一图像位于第二图像之后。The image acquisition unit 41 is configured to acquire images of two adjacent frames of the video as the first image and the second image. Here, the first image comes after the second image in the video.
在本公开的示例性实施例中,图像获取单元41可被配置为:获取视频的相邻两帧图像;对获取的相邻两帧图像进行低分辨率转换,并将转换后的两帧低分辨率图像作为第一图像和第二图像。In an exemplary embodiment of the present disclosure, the image acquisition unit 41 may be configured to: acquire two adjacent frames of images of the video; perform low-resolution conversion on the acquired two adjacent frames of images, and convert the converted two frames into low-resolution images. Resolution images as the first image and the second image.
查询表生成单元42被配置为基于第一图像和第二图像生成第一图像的每个图像区域的三维查询表。The look-up table generation unit 42 is configured to generate a three-dimensional look-up table for each image region of the first image based on the first image and the second image.
在本公开的示例性实施例中,查询表生成单元42可被配置为:从第一图像和第二图像获得全局图像特征和局部图像特征;基于全局图像特征和局部图像特征来生成第一图像的每个图像区域的三维查询表。In an exemplary embodiment of the present disclosure, the look-up table generating unit 42 may be configured to: obtain global image features and local image features from the first image and the second image; generate the first image based on the global image features and the local image features 3D look-up table for each image region.
在本公开的示例性实施例中,查询表生成单元42可被配置为:分别对第一图像和第二图像进行卷积特征提取;对第一图像的卷积特征和第二图像的卷积特征进行卷积特征融合;基于融合后的卷积特征获得全局图像特征和局部图像特征。In an exemplary embodiment of the present disclosure, the look-up table generating unit 42 may be configured to: perform convolution feature extraction on the first image and the second image, respectively; Convolution feature fusion is performed on the features; global image features and local image features are obtained based on the fused convolution features.
在本公开的示例性实施例中,查询表生成单元42可被配置为:分别将第一图像和第二图像输入到卷积层中嵌入了自注意力模块的卷积神经网络,获得第一图像和第二图像的包含图像局部之间的位置关系和语义关系的卷积特征。In an exemplary embodiment of the present disclosure, the look-up table generating unit 42 may be configured to: respectively input the first image and the second image into a convolutional neural network with a self-attention module embedded in the convolutional layer, and obtain the first image and the second image respectively. Convolutional features of the image and the second image that contain the positional and semantic relationships between image parts.
在本公开的示例性实施例中,查询表生成单元42可被配置为:计算第一图像的卷积特征和第二图像的卷积特征之间的相似度;基于相似度分别确定第一图像的卷积特征的第一融合权重和第二图像的卷积特征的第二融合权重; 基于第一融合权重和第二融合权重来对第一图像的卷积特征和第二图像的卷积特征进行卷积特征融合。In an exemplary embodiment of the present disclosure, the look-up table generating unit 42 may be configured to: calculate the similarity between the convolutional feature of the first image and the convolutional feature of the second image; determine the first image based on the similarity, respectively The first fusion weight of the convolution feature of the second image and the second fusion weight of the convolution feature of the second image; based on the first fusion weight and the second fusion weight, the convolution feature of the first image and the convolution feature of the second image are combined. Perform convolutional feature fusion.
在本公开的示例性实施例中,查询表生成单元42可被配置为:将融合后的卷积特征分别输入到全连接层和卷积层来获得全局图像特征和局部图像特征。In an exemplary embodiment of the present disclosure, the lookup table generation unit 42 may be configured to input the fused convolutional features to the fully connected layer and the convolutional layer, respectively, to obtain global image features and local image features.
在本公开的示例性实施例中,查询表生成单元42可被配置为:将全局图像特征和局部图像特征进行融合,获得融合的每个图像区域的特征向量;根据融合的每个图像区域的特征向量生成第一图像的每个图像区域的三维查询表。In an exemplary embodiment of the present disclosure, the look-up table generating unit 42 may be configured to: fuse the global image feature and the local image feature to obtain a feature vector of each fused image region; The feature vector generates a three-dimensional look-up table for each image region of the first image.
画质增强单元43被配置为基于第一图像的每个图像区域的三维查询表,生成第一图像的画质增强图像。The image quality enhancement unit 43 is configured to generate an image quality enhanced image of the first image based on the three-dimensional look-up table for each image region of the first image.
在本公开的示例性实施例中,画质增强单元43可被配置为:基于第一图像的每个图像区域的三维查询表,对第一图像的红绿蓝三原色数值进行转换,获得第一图像的画质增强图像。In an exemplary embodiment of the present disclosure, the image quality enhancement unit 43 may be configured to: based on the three-dimensional look-up table of each image area of the first image, convert the values of the three primary colors of red, green and blue in the first image to obtain the first image. The quality of the image enhances the image.
关于上述实施例中的装置,其中各个单元执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明。Regarding the apparatus in the above-mentioned embodiment, the specific manner in which each unit performs the operation has been described in detail in the embodiment of the method, and will not be described in detail here.
以上已经结合图4对根据本公开的示例性实施例的视频处理装置进行了描述。接下来,结合图5对根据本公开的示例性实施例的电子设备进行描述。The video processing apparatus according to the exemplary embodiment of the present disclosure has been described above with reference to FIG. 4 . Next, an electronic device according to an exemplary embodiment of the present disclosure will be described with reference to FIG. 5 .
图5是根据本公开的示例性实施例的电子设备500的框图。FIG. 5 is a block diagram of an electronic device 500 according to an exemplary embodiment of the present disclosure.
参照图5,电子设备500包括至少一个存储器501和至少一个处理器502,所述至少一个存储器501中存储有计算机可执行指令集合,当计算机可执行指令集合被至少一个处理器502执行时,执行根据本公开的示例性实施例的视频处理的方法。Referring to FIG. 5 , the electronic device 500 includes at least one memory 501 and at least one processor 502. The at least one memory 501 stores a computer-executable instruction set. When the computer-executable instruction set is executed by the at least one processor 502, the execution A method of video processing according to an exemplary embodiment of the present disclosure.
作为示例,电子设备500可以是PC计算机、平板装置、个人数字助理、智能手机、或其他能够执行上述指令集合的装置。这里,电子设备500并非必须是单个的电子设备,还可以是任何能够单独或联合执行上述指令(或指令集)的装置或电路的集合体。电子设备500还可以是集成控制系统或系统管理器的一部分,或者可被配置为与本地或远程(例如,经由无线传输)以接口互联的便携式电子设备。As an example, the electronic device 500 may be a PC computer, a tablet device, a personal digital assistant, a smart phone, or any other device capable of executing the above set of instructions. Here, the electronic device 500 is not necessarily a single electronic device, but can also be a collection of any device or circuit capable of individually or jointly executing the above-mentioned instructions (or instruction sets). Electronic device 500 may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces locally or remotely (eg, via wireless transmission).
在电子设备500中,处理器502可包括中央处理器(CPU)、图形处理器(GPU)、可编程逻辑装置、专用处理器系统、微控制器或微处理器。作为示 例而非限制,处理器还可包括模拟处理器、数字处理器、微处理器、多核处理器、处理器阵列、网络处理器等。In electronic device 500, processor 502 may include a central processing unit (CPU), graphics processing unit (GPU), programmable logic device, special purpose processor system, microcontroller, or microprocessor. By way of example and not limitation, processors may also include analog processors, digital processors, microprocessors, multi-core processors, processor arrays, network processors, and the like.
处理器502可运行存储在存储器501中的指令或代码,其中,存储器501还可以存储数据。指令和数据还可经由网络接口装置而通过网络被发送和接收,其中,网络接口装置可采用任何已知的传输协议。 Processor 502 may execute instructions or code stored in memory 501, which may also store data. Instructions and data may also be sent and received over a network via a network interface device, which may employ any known transport protocol.
存储器501可与处理器502集成为一体,例如,将RAM或闪存布置在集成电路微处理器等之内。此外,存储器501可包括独立的装置,诸如,外部盘驱动、存储阵列或任何数据库系统可使用的其他存储装置。存储器501和处理器502可在操作上进行耦合,或者可例如通过I/O端口、网络连接等互相通信,使得处理器502能够读取存储在存储器中的文件。The memory 501 may be integrated with the processor 502, eg, RAM or flash memory arranged within an integrated circuit microprocessor or the like. Furthermore, memory 501 may comprise a separate device such as an external disk drive, storage array, or any other storage device that may be used by a database system. The memory 501 and the processor 502 may be operatively coupled, or may communicate with each other, eg, through I/O ports, network connections, etc., to enable the processor 502 to read files stored in the memory.
此外,电子设备500还可包括视频显示器(诸如,液晶显示器)和用户交互接口(诸如,键盘、鼠标、触摸输入装置等)。电子设备500的所有组件可经由总线和/或网络而彼此连接。Additionally, the electronic device 500 may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of electronic device 500 may be connected to each other via a bus and/or network.
根据本公开的示例性实施例,还提供一种包括指令的计算机可读存储介质,例如包括指令的存储器501,上述指令可由装置500的处理器502执行以完成上述方法。在一些实施例中,计算机可读存储介质可以是ROM、随机存取存储器(RAM)、CD-ROM、磁带、软盘和光数据存储设备等。According to an exemplary embodiment of the present disclosure, there is also provided a computer-readable storage medium including instructions, such as a memory 501 including instructions, which can be executed by the processor 502 of the apparatus 500 to complete the above method. In some embodiments, the computer-readable storage medium may be ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, and the like.
根据本公开的示例性实施例,还可提供一种计算机程序产品,该计算机程序产品包括计算机程序/指令,当所述计算机程序/指令被处理器执行时,实现根据本公开的示例性实施例的视频处理的方法。According to exemplary embodiments of the present disclosure, there may also be provided a computer program product comprising computer programs/instructions which, when executed by a processor, implement the exemplary embodiments according to the present disclosure method of video processing.
以上已参照图1至图5描述了根据本公开的示例性实施例的视频处理方法及装置。然而,应该理解的是:图4中所示的视频处理装置及其单元可分别被配置为执行特定功能的软件、硬件、固件或上述项的任意组合,图5中所示的电子设备并不限于包括以上示出的组件,而是可根据需要增加或删除一些组件,并且以上组件也可被组合。The video processing method and apparatus according to the exemplary embodiments of the present disclosure have been described above with reference to FIGS. 1 to 5 . However, it should be understood that: the video processing apparatus and its units shown in FIG. 4 may be respectively configured as software, hardware, firmware or any combination of the above items to perform specific functions, and the electronic device shown in FIG. 5 is not It is limited to include the components shown above, but some components may be added or deleted as needed, and the above components may also be combined.
根据本公开的视频处理方法及装置,通过获取视频的相邻两帧图像作为第一图像和第二图像,基于第一图像和第二图像生成第一图像的每个图像区域的三维查询表,基于第一图像的每个图像区域的三维查询表,生成第一图像的画质增强图像,从而在无需用户参与的情况下,实现了一键式自动完成视频画质增强。According to the video processing method and device of the present disclosure, the three-dimensional lookup table of each image area of the first image is generated based on the first image and the second image by acquiring two adjacent frames of images of the video as the first image and the second image, Based on the three-dimensional lookup table of each image area of the first image, an image quality enhancement image of the first image is generated, so that one-click automatic completion of video image quality enhancement is realized without user participation.
此外,根据本公开的视频处理方法及装置,可通过在神经网络的卷积层 中嵌入自注意力模块,从而实现在保留增强局部自适应的同时,增加图像的空间平滑性和增强的语义相关性的约束。In addition, according to the video processing method and device of the present disclosure, by embedding a self-attention module in the convolutional layer of the neural network, it is possible to increase the spatial smoothness of the image and enhance the semantic correlation while preserving the enhanced local adaptation. Sexual constraints.
此外,根据本公开的视频处理方法及装置,可通过全局/局部特征融合来获得全局和局部自适应性,从而可以应对多样的用户视频,并且获得更好的画质增强效果。In addition, according to the video processing method and device of the present disclosure, global and local adaptability can be obtained through global/local feature fusion, so that various user videos can be dealt with and better image quality enhancement effect can be obtained.
本领域技术人员在考虑说明书及实践这里公开的内容后,将容易想到本公开的其它实施方案。本申请旨在涵盖本公开的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本公开的一般性原理并包括本公开未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本公开的真正范围和精神由下面的权利要求指出。Other embodiments of the present disclosure will readily occur to those skilled in the art upon consideration of the specification and practice of what is disclosed herein. This application is intended to cover any variations, uses, or adaptations of the present disclosure that follow the general principles of the present disclosure and include common knowledge or techniques in the technical field not disclosed by the present disclosure . The specification and examples are to be regarded as exemplary only, with the true scope and spirit of the disclosure being indicated by the following claims.
应当理解的是,本公开并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本公开的范围仅由所附的权利要求来限制。It is to be understood that the present disclosure is not limited to the precise structures described above and illustrated in the accompanying drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (21)

  1. 一种视频处理方法,包括:A video processing method, comprising:
    获取视频的相邻两帧图像作为第一图像和第二图像,其中,在视频中第一图像位于第二图像之后;Acquiring two adjacent frames of images of the video as the first image and the second image, wherein the first image is located after the second image in the video;
    基于第一图像和第二图像生成第一图像的每个图像区域的三维查询表;generating a three-dimensional look-up table for each image region of the first image based on the first image and the second image;
    基于第一图像的每个图像区域的三维查询表,生成第一图像的画质增强图像。A quality-enhanced image of the first image is generated based on the three-dimensional lookup table for each image region of the first image.
  2. 根据权利要求1所述的视频处理方法,其中,基于第一图像和第二图像生成第一图像的每个图像区域的三维查询表的步骤包括:The video processing method of claim 1, wherein the step of generating a three-dimensional lookup table for each image region of the first image based on the first image and the second image comprises:
    从第一图像和第二图像获得全局图像特征和局部图像特征;obtaining global image features and local image features from the first image and the second image;
    基于全局图像特征和局部图像特征来生成第一图像的每个图像区域的三维查询表。A three-dimensional look-up table for each image region of the first image is generated based on the global image features and the local image features.
  3. 根据权利要求2所述的视频处理方法,其中,从第一图像和第二图像获得全局图像特征和局部图像特征的步骤包括:The video processing method of claim 2, wherein the step of obtaining global image features and local image features from the first image and the second image comprises:
    分别对第一图像和第二图像进行卷积特征提取;Perform convolution feature extraction on the first image and the second image respectively;
    对第一图像的卷积特征和第二图像的卷积特征进行卷积特征融合;Perform convolution feature fusion on the convolution feature of the first image and the convolution feature of the second image;
    基于融合后的卷积特征获得全局图像特征和局部图像特征。Global image features and local image features are obtained based on the fused convolutional features.
  4. 根据权利要求3所述的视频处理方法,其中,分别对第一图像和第二图像进行卷积特征提取的步骤包括:The video processing method according to claim 3, wherein the step of performing convolution feature extraction on the first image and the second image respectively comprises:
    分别将第一图像和第二图像输入到卷积层中嵌入了自注意力模块的卷积神经网络,获得第一图像和第二图像的包含图像局部之间的位置关系和语义关系的卷积特征。The first image and the second image are respectively input into the convolutional neural network with the self-attention module embedded in the convolution layer, and the convolution of the first image and the second image including the positional relationship and semantic relationship between the image parts is obtained. feature.
  5. 根据权利要求3所述的视频处理方法,其中,对第一图像的卷积特征和第二图像的卷积特征进行卷积特征融合的步骤包括:The video processing method according to claim 3, wherein the step of performing convolution feature fusion on the convolution feature of the first image and the convolution feature of the second image comprises:
    计算第一图像的卷积特征和第二图像的卷积特征之间的相似度;calculating the similarity between the convolution feature of the first image and the convolution feature of the second image;
    基于相似度分别确定第一图像的卷积特征的第一融合权重和第二图像的卷积特征的第二融合权重;Determine the first fusion weight of the convolution feature of the first image and the second fusion weight of the convolution feature of the second image based on the similarity;
    基于第一融合权重和第二融合权重来对第一图像的卷积特征和第二图像的卷积特征进行卷积特征融合。Convolution feature fusion is performed on the convolution feature of the first image and the convolution feature of the second image based on the first fusion weight and the second fusion weight.
  6. 根据权利要求3所述的视频处理方法,其中,基于融合后的卷积特征 获得全局图像特征和局部图像特征的步骤包括:video processing method according to claim 3, wherein, the step of obtaining global image feature and local image feature based on the convolution feature after fusion comprises:
    将融合后的卷积特征分别输入到全连接层和卷积层来获得全局图像特征和局部图像特征。The fused convolutional features are input to the fully connected layer and the convolutional layer respectively to obtain global image features and local image features.
  7. 根据权利要求2所述的视频处理方法,其中,基于全局图像特征和局部图像特征来生成第一图像的每个图像区域的三维查询表的步骤包括:The video processing method according to claim 2, wherein the step of generating a three-dimensional look-up table for each image region of the first image based on the global image feature and the local image feature comprises:
    将全局图像特征和局部图像特征进行融合,获得融合的每个图像区域的特征向量;Fusion of global image features and local image features to obtain the feature vector of each fused image region;
    根据融合的每个图像区域的特征向量生成第一图像的每个图像区域的三维查询表。A three-dimensional look-up table for each image region of the first image is generated according to the fused feature vector of each image region.
  8. 根据权利要求1所述的视频处理方法,其中,生成第一图像的画质增强图像的步骤包括:The video processing method according to claim 1, wherein the step of generating a quality-enhanced image of the first image comprises:
    基于第一图像的每个图像区域的三维查询表,对第一图像的红绿蓝三原色数值进行转换,获得第一图像的画质增强图像。Based on the three-dimensional look-up table of each image area of the first image, the values of the three primary colors of red, green and blue in the first image are converted to obtain an image quality enhanced image of the first image.
  9. 根据权利要求1所述的视频处理方法,其中,获取视频的相邻两帧图像作为第一图像和第二图像的步骤包括:The video processing method according to claim 1, wherein the step of acquiring two adjacent frames of images of the video as the first image and the second image comprises:
    获取视频的相邻两帧图像;Obtain two adjacent frames of the video;
    对获取的相邻两帧图像进行低分辨率转换,并将转换后的两帧低分辨率图像作为第一图像和第二图像。Low-resolution conversion is performed on the acquired adjacent two frames of images, and the converted two frames of low-resolution images are used as the first image and the second image.
  10. 一种视频处理装置,包括:A video processing device, comprising:
    图像获取单元,被配置为获取视频的相邻两帧图像作为第一图像和第二图像,其中,在视频中第一图像位于第二图像之后;an image acquisition unit, configured to acquire two adjacent frames of images of the video as the first image and the second image, wherein the first image is located after the second image in the video;
    查询表生成单元,被配置为基于第一图像和第二图像生成第一图像的每个图像区域的三维查询表;和a look-up table generation unit configured to generate a three-dimensional look-up table for each image region of the first image based on the first image and the second image; and
    画质增强单元,被配置为基于第一图像的每个图像区域的三维查询表,生成第一图像的画质增强图像。The image quality enhancement unit is configured to generate an image quality enhanced image of the first image based on the three-dimensional lookup table of each image area of the first image.
  11. 根据权利要求10所述的视频处理装置,其中,查询表生成单元被配置为:The video processing apparatus according to claim 10, wherein the look-up table generating unit is configured to:
    从第一图像和第二图像获得全局图像特征和局部图像特征;obtaining global image features and local image features from the first image and the second image;
    基于全局图像特征和局部图像特征来生成第一图像的每个图像区域的三维查询表。A three-dimensional look-up table for each image region of the first image is generated based on the global image features and the local image features.
  12. 根据权利要求11所述的视频处理装置,其中,查询表生成单元被配 置为:The video processing apparatus according to claim 11, wherein the look-up table generating unit is configured to:
    分别对第一图像和第二图像进行卷积特征提取;Perform convolution feature extraction on the first image and the second image respectively;
    对第一图像的卷积特征和第二图像的卷积特征进行卷积特征融合;Perform convolution feature fusion on the convolution feature of the first image and the convolution feature of the second image;
    基于融合后的卷积特征获得全局图像特征和局部图像特征。Global image features and local image features are obtained based on the fused convolutional features.
  13. 根据权利要求12所述的视频处理装置,其中,查询表生成单元被配置为:The video processing apparatus according to claim 12, wherein the look-up table generating unit is configured to:
    分别将第一图像和第二图像输入到卷积层中嵌入了自注意力模块的卷积神经网络,获得第一图像和第二图像的包含图像局部之间的位置关系和语义关系的卷积特征。The first image and the second image are respectively input into the convolutional neural network with the self-attention module embedded in the convolution layer, and the convolution of the first image and the second image including the positional relationship and semantic relationship between the image parts is obtained. feature.
  14. 根据权利要求12所述的视频处理装置,其中,查询表生成单元被配置为:The video processing apparatus according to claim 12, wherein the look-up table generating unit is configured to:
    计算第一图像的卷积特征和第二图像的卷积特征之间的相似度;calculating the similarity between the convolution feature of the first image and the convolution feature of the second image;
    基于相似度分别确定第一图像的卷积特征的第一融合权重和第二图像的卷积特征的第二融合权重;Determine the first fusion weight of the convolution feature of the first image and the second fusion weight of the convolution feature of the second image based on the similarity;
    基于第一融合权重和第二融合权重来对第一图像的卷积特征和第二图像的卷积特征进行卷积特征融合。Convolution feature fusion is performed on the convolution feature of the first image and the convolution feature of the second image based on the first fusion weight and the second fusion weight.
  15. 根据权利要求12所述的视频处理装置,其中,查询表生成单元被配置为:The video processing apparatus according to claim 12, wherein the look-up table generating unit is configured to:
    将融合后的卷积特征分别输入到全连接层和卷积层来获得全局图像特征和局部图像特征。The fused convolutional features are input to the fully connected layer and the convolutional layer respectively to obtain global image features and local image features.
  16. 根据权利要求11所述的视频处理装置,其中,查询表生成单元被配置为:The video processing apparatus according to claim 11, wherein the look-up table generating unit is configured to:
    将全局图像特征和局部图像特征进行融合,获得融合的每个图像区域的特征向量;Fusion of global image features and local image features to obtain the feature vector of each fused image region;
    根据融合的每个图像区域的特征向量生成第一图像的每个图像区域的三维查询表。A three-dimensional look-up table for each image region of the first image is generated according to the fused feature vector of each image region.
  17. 根据权利要求10所述的视频处理装置,其中,画质增强单元被配置为:The video processing apparatus according to claim 10, wherein the image quality enhancement unit is configured to:
    基于第一图像的每个图像区域的三维查询表,对第一图像的红绿蓝三原色数值进行转换,获得第一图像的画质增强图像。Based on the three-dimensional look-up table of each image area of the first image, the values of the three primary colors of red, green and blue in the first image are converted to obtain an image quality enhanced image of the first image.
  18. 根据权利要求10所述的视频处理装置,其中,图像获取单元被配置 为:The video processing apparatus of claim 10, wherein the image acquisition unit is configured to:
    获取视频的相邻两帧图像;Get two adjacent frames of video;
    对获取的相邻两帧图像进行低分辨率转换,并将转换后的两帧低分辨率图像作为第一图像和第二图像。Low-resolution conversion is performed on the acquired adjacent two frames of images, and the converted two frames of low-resolution images are used as the first image and the second image.
  19. 一种电子设备,包括:An electronic device comprising:
    处理器;processor;
    用于存储所述处理器可执行指令的存储器;a memory for storing the processor-executable instructions;
    其中,所述处理器被配置为执行所述指令,以实现如权利要求1至9中任一项所述的视频处理方法。Wherein, the processor is configured to execute the instructions to implement the video processing method as claimed in any one of claims 1 to 9.
  20. 一种非易失性计算机可读存储介质,存储有计算机程序,其中,当所述计算机程序被电子设备的处理器执行时,使得电子设备执行一种视频处理方法,所述视频处理方法包括:A non-volatile computer-readable storage medium storing a computer program, wherein, when the computer program is executed by a processor of an electronic device, the electronic device is caused to execute a video processing method, the video processing method comprising:
    获取视频的相邻两帧图像作为第一图像和第二图像,其中,在视频中第一图像位于第二图像之后;Acquiring two adjacent frames of images of the video as the first image and the second image, wherein the first image is located after the second image in the video;
    基于第一图像和第二图像生成第一图像的每个图像区域的三维查询表;generating a three-dimensional look-up table for each image region of the first image based on the first image and the second image;
    基于第一图像的每个图像区域的三维查询表,生成第一图像的画质增强图像。A quality-enhanced image of the first image is generated based on the three-dimensional lookup table for each image region of the first image.
  21. 一种计算机程序产品,包括计算机程序/指令,其中,当所述计算机程序/指令被处理器执行时,实现一种视频处理方法,,所述视频处理方法包括:A computer program product, comprising a computer program/instruction, wherein, when the computer program/instruction is executed by a processor, a video processing method is implemented, the video processing method comprising:
    获取视频的相邻两帧图像作为第一图像和第二图像,其中,在视频中第一图像位于第二图像之后;Acquiring two adjacent frames of images of the video as the first image and the second image, wherein the first image is located after the second image in the video;
    基于第一图像和第二图像生成第一图像的每个图像区域的三维查询表;generating a three-dimensional look-up table for each image region of the first image based on the first image and the second image;
    基于第一图像的每个图像区域的三维查询表,生成第一图像的画质增强图像。A quality-enhanced image of the first image is generated based on the three-dimensional lookup table for each image region of the first image.
PCT/CN2021/118552 2021-02-25 2021-09-15 Video processing method and apparatus WO2022179087A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110213511.1 2021-02-25
CN202110213511.1A CN113034412B (en) 2021-02-25 Video processing method and device

Publications (1)

Publication Number Publication Date
WO2022179087A1 true WO2022179087A1 (en) 2022-09-01

Family

ID=76462081

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/118552 WO2022179087A1 (en) 2021-02-25 2021-09-15 Video processing method and apparatus

Country Status (1)

Country Link
WO (1) WO2022179087A1 (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101902550A (en) * 2009-05-28 2010-12-01 佳能株式会社 Image processing apparatus and image processing method
CN102769759A (en) * 2012-07-20 2012-11-07 上海富瀚微电子有限公司 Digital image color correcting method and realizing device
CN109934776A (en) * 2018-12-25 2019-06-25 北京奇艺世纪科技有限公司 Model generating method, video enhancement method, device and computer readable storage medium
US10679584B1 (en) * 2017-11-01 2020-06-09 Gopro, Inc. Systems and methods for transforming presentation of visual content
CN111681177A (en) * 2020-05-18 2020-09-18 腾讯科技(深圳)有限公司 Video processing method and device, computer readable storage medium and electronic equipment
CN113034412A (en) * 2021-02-25 2021-06-25 北京达佳互联信息技术有限公司 Video processing method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101902550A (en) * 2009-05-28 2010-12-01 佳能株式会社 Image processing apparatus and image processing method
CN102769759A (en) * 2012-07-20 2012-11-07 上海富瀚微电子有限公司 Digital image color correcting method and realizing device
US10679584B1 (en) * 2017-11-01 2020-06-09 Gopro, Inc. Systems and methods for transforming presentation of visual content
CN109934776A (en) * 2018-12-25 2019-06-25 北京奇艺世纪科技有限公司 Model generating method, video enhancement method, device and computer readable storage medium
CN111681177A (en) * 2020-05-18 2020-09-18 腾讯科技(深圳)有限公司 Video processing method and device, computer readable storage medium and electronic equipment
CN113034412A (en) * 2021-02-25 2021-06-25 北京达佳互联信息技术有限公司 Video processing method and device

Also Published As

Publication number Publication date
CN113034412A (en) 2021-06-25

Similar Documents

Publication Publication Date Title
Li et al. Low-light image and video enhancement using deep learning: A survey
US20200258197A1 (en) Method for generating high-resolution picture, computer device, and storage medium
WO2021208600A1 (en) Image processing method, smart device, and computer-readable storage medium
US20180082407A1 (en) Style transfer-based image content correction
US20180082715A1 (en) Artistic style transfer for videos
KR20200014842A (en) Image illumination methods, devices, electronic devices and storage media
WO2023016039A1 (en) Video processing method and apparatus, electronic device, and storage medium
WO2023024697A1 (en) Image stitching method and electronic device
CN111710049B (en) Method and device for determining ambient illumination in AR scene
CN106101561A (en) Camera focusing detection method and device
WO2023016035A1 (en) Video processing method and apparatus, electronic device, and storage medium
CN114640783B (en) Photographing method and related equipment
CN103440674A (en) Method for rapidly generating crayon special effect of digital image
WO2022262618A1 (en) Screen saver interaction method and apparatus, electronic device, and storage medium
WO2024017093A1 (en) Image generation method, model training method, related apparatus, and electronic device
WO2022133944A1 (en) Image processing method and image processing apparatus
US20230074060A1 (en) Artificial-intelligence-based image processing method and apparatus, electronic device, computer-readable storage medium, and computer program product
Liu et al. Progressive complex illumination image appearance transfer based on CNN
WO2024027583A1 (en) Image processing method and apparatus, and electronic device and readable storage medium
Liu et al. 4D LUT: learnable context-aware 4d lookup table for image enhancement
Lv et al. Low-light image enhancement via deep Retinex decomposition and bilateral learning
US20080094481A1 (en) Intelligent Multiple Exposure
CN113297937A (en) Image processing method, device, equipment and medium
WO2024067461A1 (en) Image processing method and apparatus, and computer device and storage medium
WO2022179087A1 (en) Video processing method and apparatus

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21927513

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 11205A DATED 16.01.2024)