WO2022179087A1

WO2022179087A1 - Video processing method and apparatus

Info

Publication number: WO2022179087A1
Application number: PCT/CN2021/118552
Authority: WO
Inventors: 刘晶晶; 徐宁
Original assignee: 北京达佳互联信息技术有限公司
Priority date: 2021-02-25
Filing date: 2021-09-15
Publication date: 2022-09-01
Also published as: CN113034412A

Abstract

A video processing method and an apparatus. The video processing method comprises: obtaining two adjacent image frames of a video to serve as a first image and a second image, wherein the first image is situated after the second image in the video; generating a three-dimensional lookup table of each image region of the first image on the basis of the first image and the second image; and generating an enhanced quality image of the first image on the basis of the three-dimensional lookup table of each image region of the first image.

Description

Video processing method and device

CROSS-REFERENCE TO RELATED APPLICATIONS

The present disclosure is based on a Chinese patent application with application number 202110213511.1 and an application date of February 25, 2021, and claims the priority of the Chinese patent application, the entire contents of which are incorporated herein by reference.

technical field

The present disclosure relates to the field of video technology. More particularly, the present disclosure relates to a video processing method and apparatus.

Background technique

With the popularization of smartphones, more and more users use their personal mobile phones to shoot videos to record daily moments in life. Limited by the camera hardware parameters of the mobile phone, the shooting environment and the user's shooting skills, the original video quality is often unsatisfactory to users. For example, the original video often has problems such as too bright or too dark, or insufficient color saturation. . Through the later video quality enhancement technology, the original image quality can be greatly improved. Therefore, many users have a strong and huge demand for video quality enhancement.

SUMMARY OF THE INVENTION

According to an exemplary embodiment of the present disclosure, a video processing method is provided, comprising: acquiring two adjacent frames of images of a video as a first image and a second image, wherein the first image is located after the second image in the video; based on The first image and the second image generate a three-dimensional look-up table of each image area of the first image; based on the three-dimensional look-up table of each image area of the first image, a quality-enhanced image of the first image is generated.

In some embodiments, the step of generating a three-dimensional look-up table for each image region of the first image based on the first image and the second image may include: obtaining global image features and local image features from the first image and the second image; global image features and local image features to generate a three-dimensional look-up table for each image region of the first image.

In some embodiments, the step of obtaining global image features and local image features from the first image and the second image may include: performing convolutional feature extraction on the first image and the second image, respectively; Perform convolution feature fusion with the convolution features of the second image; obtain global image features and local image features based on the fused convolution features.

In some embodiments, the step of separately performing convolutional feature extraction on the first image and the second image may include: respectively inputting the first image and the second image into a convolutional neural network in which a self-attention module is embedded in the convolutional layer. The network obtains the convolutional features of the first image and the second image including the positional relationship and the semantic relationship between the image parts.

In some embodiments, the step of performing convolution feature fusion on the convolution feature of the first image and the convolution feature of the second image may include: calculating a difference between the convolution feature of the first image and the convolution feature of the second image based on the similarity; determine the first fusion weight of the convolution feature of the first image and the second fusion weight of the convolution feature of the second image respectively; based on the first fusion weight and the second fusion weight The convolutional feature of , and the convolutional feature of the second image perform convolutional feature fusion.

In some embodiments, the step of obtaining global image features and local image features based on the fused convolutional features may include: inputting the fused convolutional features into a fully connected layer and a convolutional layer, respectively, to obtain global image features and local image features image features.

In some embodiments, the step of generating a three-dimensional lookup table for each image region of the first image based on the global image feature and the local image feature may include: fusing the global image feature and the local image feature to obtain each fused image The feature vector of the region; the three-dimensional lookup table of each image region of the first image is generated according to the fused feature vector of each image region.

In some embodiments, the step of generating the quality-enhanced image of the first image may include: converting the red, green, and blue three primary color values of the first image based on a three-dimensional look-up table of each image area of the first image to obtain the first image. The quality of the image enhances the image.

In some embodiments, the step of acquiring two adjacent frames of images of the video as the first image and the second image may include: acquiring two adjacent frames of images of the video; performing low-resolution conversion on the acquired two adjacent frames of images, And take the converted two low-resolution images as the first image and the second image.

According to an exemplary embodiment of the present disclosure, there is provided a video processing apparatus, comprising: an image acquisition unit configured to acquire images of two adjacent frames of a video as a first image and a second image, wherein the first image in the video located after the second image; a look-up table generation unit configured to generate a three-dimensional look-up table for each image region of the first image based on the first image and the second image; and an image quality enhancement unit configured to A three-dimensional look-up table for each image region generates a quality-enhanced image of the first image.

In some embodiments, the look-up table generation unit may be configured to: obtain global image features and local image features from the first image and the second image; generate each image region of the first image based on the global image features and the local image features 3D look-up table.

In some embodiments, the lookup table generating unit may be configured to: perform convolution feature extraction on the first image and the second image, respectively; perform convolution feature extraction on the convolution feature of the first image and the convolution feature of the second image Fusion; global image features and local image features are obtained based on the fused convolutional features.

In some embodiments, the look-up table generation unit may be configured to: respectively input the first image and the second image into a convolutional neural network with a self-attention module embedded in the convolutional layer to obtain the first image and the second image The convolutional features that contain the positional and semantic relations between image parts.

In some embodiments, the look-up table generating unit may be configured to: calculate a similarity between the convolutional features of the first image and the convolutional features of the second image; The first fusion weight and the second fusion weight of the convolution feature of the second image; based on the first fusion weight and the second fusion weight, convolution feature fusion is performed on the convolution feature of the first image and the convolution feature of the second image .

In some embodiments, the lookup table generation unit may be configured to input the fused convolutional features to the fully connected layer and the convolutional layer, respectively, to obtain global image features and local image features.

In some embodiments, the look-up table generating unit may be configured to: fuse the global image feature and the local image feature to obtain a feature vector of each fused image region; generate a first feature vector according to the fused feature vector of each image region A three-dimensional look-up table for each image region of the image.

In some embodiments, the image quality enhancement unit may be configured to: based on the three-dimensional look-up table of each image area of the first image, convert the values of the three primary colors of red, green and blue in the first image to obtain image quality enhancement of the first image image.

In some embodiments, the image acquisition unit may be configured to: acquire two adjacent frames of images of the video; perform low-resolution conversion on the acquired two adjacent frames of images, and use the converted two low-resolution images as the first an image and a second image.

According to an exemplary embodiment of the present disclosure, there is provided an electronic device, comprising: a processor; a memory for storing instructions executable by the processor; wherein the processor is configured to execute the instructions to implement a A video processing method of an exemplary embodiment of the present disclosure.

According to an exemplary embodiment of the present disclosure, there is provided a non-volatile computer-readable storage medium having stored thereon a computer program that, when executed by a processor of an electronic device, causes the electronic device to perform the execution according to the present disclosure An exemplary embodiment of the video processing method.

According to an exemplary embodiment of the present disclosure, there is provided a computer program product including a computer program/instructions which, when executed by a processor, implement the video processing method according to the exemplary embodiment of the present disclosure.

In the technical solutions provided by the embodiments of the present disclosure, 1. One-click automatic completion of video image quality enhancement without user participation; 2. Various user videos can be processed, with good enhancement effect and coherence in time domain.

It is to be understood that the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the present disclosure.

Description of drawings

The accompanying drawings, which are incorporated into and constitute a part of this specification, illustrate embodiments consistent with the present disclosure, and together with the description, serve to explain the principles of the present disclosure and do not unduly limit the present disclosure.

FIG. 1 shows a flowchart of a video processing method according to an exemplary embodiment of the present disclosure.

FIG. 2A shows a schematic diagram of video processing according to an exemplary embodiment of the present disclosure.

FIG. 2B illustrates an example of a grid structure according to an exemplary embodiment of the present disclosure.

FIG. 3 shows a schematic diagram of fusing global image features and local image features according to an exemplary embodiment of the present disclosure.

FIG. 4 illustrates a block diagram of a video processing apparatus according to an exemplary embodiment of the present disclosure.

FIG. 5 is a block diagram of an electronic device 500 according to an exemplary embodiment of the present disclosure.

Detailed ways

In order to make those skilled in the art better understand the technical solutions of the present disclosure, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first", "second" and the like in the description and claims of the present disclosure and the above drawings are used to distinguish similar objects, and are not necessarily used to describe a specific sequence or sequence. It is to be understood that the data so used may be interchanged under appropriate circumstances such that the embodiments of the disclosure described herein can be practiced in sequences other than those illustrated or described herein. The implementations described in the following examples are not intended to represent all implementations consistent with this disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as recited in the appended claims.

It should be noted here that "at least one of several items" in the present disclosure all means including "any one of the several items", "a combination of any of the several items", The three categories of "the whole of the several items" are juxtaposed. For example, "including at least one of A and B" includes the following three parallel situations: (1) including A; (2) including B; (3) including A and B. Another example is "execute at least one of step 1 and step 2", which means the following three parallel situations: (1) execute step 1; (2) execute step 2; (3) execute step 1 and step 2.

Most existing video image quality enhancement methods require users to manually set various image parameters. This not only requires the user to have a certain degree of photography knowledge, but also requires a considerable amount of energy and time to manually adjust the image quality parameters. In addition, it is difficult to manually enhance the video quality to a more professional level. Therefore, an automatic self-adaptive video image quality enhancement method is required to automatically enhance the original video image quality so that it has an enhanced aesthetic effect visually.

In related video quality enhancement methods, deep neural networks can be used to learn from massive video materials how to convert the original video red, green and blue (hereinafter, also referred to as RGB) pixel values into new RGB values, to To achieve the purpose of image quality enhancement. Such big data-driven methods not only make full use of a large number of professional videos, but also adaptively adjust the image quality enhancement algorithm in combination with the user's current shooting content. In addition, by fusing the enhancement parameters of the front and rear frames, the coherence of the image quality enhancement on the video is achieved.

The model for RGB value conversion in the related art uses a piecewise linear function, and independently converts the original value in three color channels of R, G, and B. However, the mathematical expression ability of the piecewise linear function is inherently insufficient, and it cannot completely obtain the mapping relationship between the original RGB values and the enhanced RGB values. Furthermore, since the transformation is done separately on the three color channels, the rich semantics brought by the R, G, and B color associations are ignored. The scenes and contents of videos shot by users are ever-changing. In some scenarios, this type of method is prone to insufficient image quality enhancement or distortion. In addition, these methods only rely on local image features to drive enhancement algorithms, ignoring the benefits of global features for image quality enhancement.

Most of the current leading image quality enhancement methods focus on single image enhancement rather than video enhancement. In related video image quality enhancement methods, deep neural networks can also be used to learn an adaptive 3D lookup table (hereinafter, also referred to as 3D LUT) from massive image data, which is used to simulate the original RGB values and enhanced RGB values. the mapping relationship between them. The so-called adaptive means that the 3D LUT can be automatically adjusted according to the visual characteristics of the image. Compared with the traditional 3D LUT image enhancement technology, this method does not require algorithm experts to manually set various parameters of the 3D LUT. At the same time, 3D LUTs with higher complexity can better adapt to various original image quality. Related technologies only use the global visual features of images to drive enhancement algorithms when learning 3D LUTs, ignoring that different image parts require different enhancement methods. What's more fatal is that the image quality enhancement technology cannot be directly used for video enhancement because the continuity of enhancement cannot be guaranteed.

Hereinafter, a video processing method and apparatus according to exemplary embodiments of the present disclosure will be described in detail with reference to FIGS. 1 to 5 .

FIG. 1 shows a flowchart of a video processing method according to an exemplary embodiment of the present disclosure. FIG. 2A shows a schematic diagram of video processing according to an exemplary embodiment of the present disclosure. FIG. 2B illustrates an example of a grid structure according to an exemplary embodiment of the present disclosure.

Referring to FIG. 1, in step S101, a processor (eg, a processor for image acquisition) acquires images of two adjacent frames of a video as a first image and a second image. Here, the first image comes after the second image in the video.

In an exemplary embodiment of the present disclosure, when two adjacent frames of images of a video are acquired as the first image and the second image, two adjacent frames of images of the video (for example, but not limited to, the t-th frame of images) may be acquired first. and the t-1th frame image), then low-resolution conversion is performed on the acquired adjacent two frames of images, and the converted two low-resolution images (for example, but not limited to, the t-th frame in Figure 2A is low-resolution high-resolution image and the t-1th frame low-resolution image) as the first image and the second image. Here, the low resolution may be, for example, but not limited to, 256x256.

At step S102, a three-dimensional look-up table for each image region of the first image is generated by a processor (eg, a processor for look-up table generation) based on the first image and the second image.

Compared with the related art, in the exemplary embodiment of the present disclosure, the adaptive 3D LUT not only uses the global visual features of the image to drive the enhancement algorithm, but can also take into account the local visual and semantic differences of each image. In other words, different image parts can use different enhancement methods. For example, by dividing the image into a grid structure (for example, but not limited to the grid structure in FIG. 2B ), each image in the grid locally corresponds to different enhancement parameters, so that the local self-adaptability of image quality enhancement can be obtained. .

In an exemplary embodiment of the present disclosure, when generating a three-dimensional lookup table for each image region of the first image based on the first image and the second image, global image features and local image features may be first obtained from the first image and the second image image features, and then generate a three-dimensional look-up table for each image region of the first image based on the global image features and the local image features.

Different from single image enhancement, video enhancement also pursues the stability of image quality on consecutive frames. In an exemplary embodiment of the present disclosure, when the global image feature and the local image feature are obtained from the first image and the second image, convolutional feature extraction may be performed on the first image and the second image, respectively, and then the first image and the second image are respectively extracted. The convolutional features of the image and the convolutional features of the second image are fused with convolutional features, and then global image features and local image features are obtained based on the fused convolutional features. That is, by fusing the visual features of the front and rear frames, the continuity of the enhancement in the temporal domain can be obtained.

In an exemplary embodiment of the present disclosure, when the convolutional feature extraction is performed on the first image and the second image, respectively, a self-attention module can be embedded by inputting the first image and the second image into the convolutional layer respectively. The convolutional neural network is used to obtain the convolutional features of the first image and the second image including the positional relationship and semantic relationship between the image parts. Here, a self-attention module is embedded in the convolutional layers of the convolutional neural network. Convolutional neural networks with self-attention modules embedded in convolutional layers can be used to capture the positional and semantic relationships between image parts. The advantages of doing so are: 1) the smoothness of image quality enhancement in the image space is obtained through the positional relationship; 2) the correlation of the local enhancement effect is improved through the semantic relationship. For example, when both blue sky and grass appear in an image, the enhancements can be automatically adjusted to handle such a combination. Overall, the self-attention module imposes constraints on the spatial smoothness and enhanced semantic relevance of images while preserving the local adaptation for image quality enhancement.

In an exemplary embodiment of the present disclosure, when convolution feature fusion is performed on the convolution feature of the first image and the convolution feature of the second image, the convolution feature of the first image and the convolution feature of the second image may be calculated first Then, based on the similarity, the first fusion weight of the convolutional feature of the first image and the second fusion weight of the convolutional feature of the second image are determined respectively, and then based on the first fusion weight and the second fusion weight weight to perform convolution feature fusion on the convolution feature of the first image and the convolution feature of the second image. Each pixel on the convolutional feature can represent a visual feature vector of the corresponding image part. In an exemplary embodiment of the present disclosure, the visual similarity of the preceding and following video frames may be measured by calculating the cosine similarity between the two feature vectors. Since the feature vectors correspond to different image regions, each element of the similarity matrix actually represents the visual similarity of the corresponding positions in the two images. For image areas with high similarity, the convolutional features of frame t-1 have a higher fusion weight; for areas with low similarity, the convolutional features of frame t are used as much as possible. When the local scene changes little, convolutional feature fusion helps to maintain the coherence of enhancement on the video; and when the local scene has significant visual changes, the image quality enhancement strategy can also be automatically adjusted to deal with it.

In an exemplary embodiment of the present disclosure, when global image features and local image features are obtained based on the fused convolutional features, the global image features can be obtained by inputting the fused convolutional features into the fully connected layer and the convolutional layer, respectively. Image features and local image features, thereby improving the accuracy of feature extraction.

In an exemplary embodiment of the present disclosure, when generating a three-dimensional lookup table for each image region of the first image based on the global image feature and the local image feature, the global image feature and the local image feature may be first fused to obtain a fusion Then, a three-dimensional look-up table of each image area of the first image is generated according to the fused feature vector of each image area, thereby improving the accuracy of the three-dimensional look-up table. FIG. 3 shows a schematic diagram of fusing global image features and local image features according to an exemplary embodiment of the present disclosure. As shown in Figure 3, based on the fused convolutional features, the fully connected layer and the convolutional layer can be used to obtain global image features and local image features respectively, and then the global feature vector is accumulated element by element with each local feature. In this setting, global image features are used to obtain the overall visual characteristics of the image, such as brightness, saturation, scene, and so on. The local image features can be used to fine-tune the global features according to the local semantic information and spatial location of the image. The fusion features thus obtained have global and local adaptability, so that they can cope with diverse user videos and can obtain better image quality enhancement effects.

In an exemplary embodiment of the present disclosure, each image region of the first image will generate a feature vector, which is used to drive the generation of a three-dimensional look-up table. And different 3D lookup tables mean different picture quality enhancement effects.

In step S103, a quality-enhanced image of the first image is generated by a processor (eg, a processor for image quality enhancement) based on the three-dimensional lookup table of each image area of the first image. By introducing an adaptive 3D lookup table into the video quality enhancement framework, one-click automatic video quality enhancement is realized without user participation.

In an exemplary embodiment of the present disclosure, when the image quality-enhanced image of the first image is generated, the three primary color values of red, green and blue of the first image may be converted based on the three-dimensional lookup table of each image area of the first image, A quality-enhanced image of the first image is obtained. That is, based on the three-dimensional look-up table corresponding to the pixel position of each image, as shown in FIG. 2A , the RGB values in frame t of the original input video are converted to generate an enhanced image of frame t.

In an exemplary embodiment, as shown in FIG. 2A , the t-th frame image and the t-1-th frame image of the video are first obtained, and the t-th frame image and the t-1-th frame image are converted into low-resolution images, respectively. The t-th frame image and the t-1-th frame image of the low-resolution image are input into a convolutional neural network (CNN) with the function of capturing the positional and semantic relationships between image parts. The convolutional feature extraction is performed through the convolutional neural network (CNN), and the convolutional feature of the t-th frame image and the convolutional feature of the t-1th frame image are obtained. Calculate the similarity matrix of the convolution feature of the t frame image and the convolution feature of the t-1 frame image, and based on the similarity matrix, the convolution feature of the t frame image and the convolution feature of the t-1 frame image Fusion. Then, based on the fused convolutional features, the fully connected layer and the convolutional layer are used to obtain global and local image features, respectively, and the global feature vector is accumulated element by element with each local feature to obtain an image corresponding to each image pixel position. feature vector of the region. A three-dimensional look-up table (3D LUT) is generated by driving the feature vector of each image area. Finally, based on the 3D LUT corresponding to each image pixel position, the RGB values in the t frame of the original input video are converted to generate an image quality enhanced image of the t frame.

The video processing method according to the exemplary embodiment of the present disclosure has been described above with reference to FIGS. 1 to 3 . Hereinafter, a video processing apparatus and a unit thereof according to an exemplary embodiment of the present disclosure will be described with reference to FIG. 4 .

4 , the video processing apparatus includes an image acquisition unit 41 , a lookup table generation unit 42 and an image quality enhancement unit 43 .

The image acquisition unit 41 is configured to acquire images of two adjacent frames of the video as the first image and the second image. Here, the first image comes after the second image in the video.

In an exemplary embodiment of the present disclosure, the image acquisition unit 41 may be configured to: acquire two adjacent frames of images of the video; perform low-resolution conversion on the acquired two adjacent frames of images, and convert the converted two frames into low-resolution images. Resolution images as the first image and the second image.

The look-up table generation unit 42 is configured to generate a three-dimensional look-up table for each image region of the first image based on the first image and the second image.

In an exemplary embodiment of the present disclosure, the look-up table generating unit 42 may be configured to: obtain global image features and local image features from the first image and the second image; generate the first image based on the global image features and the local image features 3D look-up table for each image region.

In an exemplary embodiment of the present disclosure, the look-up table generating unit 42 may be configured to: perform convolution feature extraction on the first image and the second image, respectively; Convolution feature fusion is performed on the features; global image features and local image features are obtained based on the fused convolution features.

In an exemplary embodiment of the present disclosure, the look-up table generating unit 42 may be configured to: respectively input the first image and the second image into a convolutional neural network with a self-attention module embedded in the convolutional layer, and obtain the first image and the second image respectively. Convolutional features of the image and the second image that contain the positional and semantic relationships between image parts.

In an exemplary embodiment of the present disclosure, the look-up table generating unit 42 may be configured to: calculate the similarity between the convolutional feature of the first image and the convolutional feature of the second image; determine the first image based on the similarity, respectively The first fusion weight of the convolution feature of the second image and the second fusion weight of the convolution feature of the second image; based on the first fusion weight and the second fusion weight, the convolution feature of the first image and the convolution feature of the second image are combined. Perform convolutional feature fusion.

In an exemplary embodiment of the present disclosure, the lookup table generation unit 42 may be configured to input the fused convolutional features to the fully connected layer and the convolutional layer, respectively, to obtain global image features and local image features.

In an exemplary embodiment of the present disclosure, the look-up table generating unit 42 may be configured to: fuse the global image feature and the local image feature to obtain a feature vector of each fused image region; The feature vector generates a three-dimensional look-up table for each image region of the first image.

The image quality enhancement unit 43 is configured to generate an image quality enhanced image of the first image based on the three-dimensional look-up table for each image region of the first image.

In an exemplary embodiment of the present disclosure, the image quality enhancement unit 43 may be configured to: based on the three-dimensional look-up table of each image area of the first image, convert the values of the three primary colors of red, green and blue in the first image to obtain the first image. The quality of the image enhances the image.

Regarding the apparatus in the above-mentioned embodiment, the specific manner in which each unit performs the operation has been described in detail in the embodiment of the method, and will not be described in detail here.

The video processing apparatus according to the exemplary embodiment of the present disclosure has been described above with reference to FIG. 4 . Next, an electronic device according to an exemplary embodiment of the present disclosure will be described with reference to FIG. 5 .

Referring to FIG. 5 , the electronic device 500 includes at least one memory 501 and at least one processor 502. The at least one memory 501 stores a computer-executable instruction set. When the computer-executable instruction set is executed by the at least one processor 502, the execution A method of video processing according to an exemplary embodiment of the present disclosure.

As an example, the electronic device 500 may be a PC computer, a tablet device, a personal digital assistant, a smart phone, or any other device capable of executing the above set of instructions. Here, the electronic device 500 is not necessarily a single electronic device, but can also be a collection of any device or circuit capable of individually or jointly executing the above-mentioned instructions (or instruction sets). Electronic device 500 may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces locally or remotely (eg, via wireless transmission).

In electronic device 500, processor 502 may include a central processing unit (CPU), graphics processing unit (GPU), programmable logic device, special purpose processor system, microcontroller, or microprocessor. By way of example and not limitation, processors may also include analog processors, digital processors, microprocessors, multi-core processors, processor arrays, network processors, and the like.

Processor 502 may execute instructions or code stored in memory 501, which may also store data. Instructions and data may also be sent and received over a network via a network interface device, which may employ any known transport protocol.

The memory 501 may be integrated with the processor 502, eg, RAM or flash memory arranged within an integrated circuit microprocessor or the like. Furthermore, memory 501 may comprise a separate device such as an external disk drive, storage array, or any other storage device that may be used by a database system. The memory 501 and the processor 502 may be operatively coupled, or may communicate with each other, eg, through I/O ports, network connections, etc., to enable the processor 502 to read files stored in the memory.

Additionally, the electronic device 500 may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of electronic device 500 may be connected to each other via a bus and/or network.

According to an exemplary embodiment of the present disclosure, there is also provided a computer-readable storage medium including instructions, such as a memory 501 including instructions, which can be executed by the processor 502 of the apparatus 500 to complete the above method. In some embodiments, the computer-readable storage medium may be ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, and the like.

According to exemplary embodiments of the present disclosure, there may also be provided a computer program product comprising computer programs/instructions which, when executed by a processor, implement the exemplary embodiments according to the present disclosure method of video processing.

The video processing method and apparatus according to the exemplary embodiments of the present disclosure have been described above with reference to FIGS. 1 to 5 . However, it should be understood that: the video processing apparatus and its units shown in FIG. 4 may be respectively configured as software, hardware, firmware or any combination of the above items to perform specific functions, and the electronic device shown in FIG. 5 is not It is limited to include the components shown above, but some components may be added or deleted as needed, and the above components may also be combined.

According to the video processing method and device of the present disclosure, the three-dimensional lookup table of each image area of the first image is generated based on the first image and the second image by acquiring two adjacent frames of images of the video as the first image and the second image, Based on the three-dimensional lookup table of each image area of the first image, an image quality enhancement image of the first image is generated, so that one-click automatic completion of video image quality enhancement is realized without user participation.

In addition, according to the video processing method and device of the present disclosure, by embedding a self-attention module in the convolutional layer of the neural network, it is possible to increase the spatial smoothness of the image and enhance the semantic correlation while preserving the enhanced local adaptation. Sexual constraints.

In addition, according to the video processing method and device of the present disclosure, global and local adaptability can be obtained through global/local feature fusion, so that various user videos can be dealt with and better image quality enhancement effect can be obtained.

Other embodiments of the present disclosure will readily occur to those skilled in the art upon consideration of the specification and practice of what is disclosed herein. This application is intended to cover any variations, uses, or adaptations of the present disclosure that follow the general principles of the present disclosure and include common knowledge or techniques in the technical field not disclosed by the present disclosure . The specification and examples are to be regarded as exemplary only, with the true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise structures described above and illustrated in the accompanying drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

A video processing method, comprising:

Acquiring two adjacent frames of images of the video as the first image and the second image, wherein the first image is located after the second image in the video;

generating a three-dimensional look-up table for each image region of the first image based on the first image and the second image;

A quality-enhanced image of the first image is generated based on the three-dimensional lookup table for each image region of the first image.
The video processing method of claim 1, wherein the step of generating a three-dimensional lookup table for each image region of the first image based on the first image and the second image comprises:

obtaining global image features and local image features from the first image and the second image;

A three-dimensional look-up table for each image region of the first image is generated based on the global image features and the local image features.
The video processing method of claim 2, wherein the step of obtaining global image features and local image features from the first image and the second image comprises:

Perform convolution feature extraction on the first image and the second image respectively;

Perform convolution feature fusion on the convolution feature of the first image and the convolution feature of the second image;

Global image features and local image features are obtained based on the fused convolutional features.
The video processing method according to claim 3, wherein the step of performing convolution feature extraction on the first image and the second image respectively comprises:

The first image and the second image are respectively input into the convolutional neural network with the self-attention module embedded in the convolution layer, and the convolution of the first image and the second image including the positional relationship and semantic relationship between the image parts is obtained. feature.
The video processing method according to claim 3, wherein the step of performing convolution feature fusion on the convolution feature of the first image and the convolution feature of the second image comprises:

calculating the similarity between the convolution feature of the first image and the convolution feature of the second image;

Determine the first fusion weight of the convolution feature of the first image and the second fusion weight of the convolution feature of the second image based on the similarity;

Convolution feature fusion is performed on the convolution feature of the first image and the convolution feature of the second image based on the first fusion weight and the second fusion weight.
video processing method according to claim 3, wherein, the step of obtaining global image feature and local image feature based on the convolution feature after fusion comprises:

The fused convolutional features are input to the fully connected layer and the convolutional layer respectively to obtain global image features and local image features.
The video processing method according to claim 2, wherein the step of generating a three-dimensional look-up table for each image region of the first image based on the global image feature and the local image feature comprises:

Fusion of global image features and local image features to obtain the feature vector of each fused image region;

A three-dimensional look-up table for each image region of the first image is generated according to the fused feature vector of each image region.
The video processing method according to claim 1, wherein the step of generating a quality-enhanced image of the first image comprises:

Based on the three-dimensional look-up table of each image area of the first image, the values of the three primary colors of red, green and blue in the first image are converted to obtain an image quality enhanced image of the first image.
The video processing method according to claim 1, wherein the step of acquiring two adjacent frames of images of the video as the first image and the second image comprises:

Obtain two adjacent frames of the video;

Low-resolution conversion is performed on the acquired adjacent two frames of images, and the converted two frames of low-resolution images are used as the first image and the second image.
A video processing device, comprising:

an image acquisition unit, configured to acquire two adjacent frames of images of the video as the first image and the second image, wherein the first image is located after the second image in the video;

a look-up table generation unit configured to generate a three-dimensional look-up table for each image region of the first image based on the first image and the second image; and

The image quality enhancement unit is configured to generate an image quality enhanced image of the first image based on the three-dimensional lookup table of each image area of the first image.
The video processing apparatus according to claim 10, wherein the look-up table generating unit is configured to:

obtaining global image features and local image features from the first image and the second image;

A three-dimensional look-up table for each image region of the first image is generated based on the global image features and the local image features.
The video processing apparatus according to claim 11, wherein the look-up table generating unit is configured to:

Perform convolution feature extraction on the first image and the second image respectively;

Perform convolution feature fusion on the convolution feature of the first image and the convolution feature of the second image;

Global image features and local image features are obtained based on the fused convolutional features.
The video processing apparatus according to claim 12, wherein the look-up table generating unit is configured to:

The first image and the second image are respectively input into the convolutional neural network with the self-attention module embedded in the convolution layer, and the convolution of the first image and the second image including the positional relationship and semantic relationship between the image parts is obtained. feature.
The video processing apparatus according to claim 12, wherein the look-up table generating unit is configured to:

calculating the similarity between the convolution feature of the first image and the convolution feature of the second image;

Determine the first fusion weight of the convolution feature of the first image and the second fusion weight of the convolution feature of the second image based on the similarity;

Convolution feature fusion is performed on the convolution feature of the first image and the convolution feature of the second image based on the first fusion weight and the second fusion weight.
The video processing apparatus according to claim 12, wherein the look-up table generating unit is configured to:

The fused convolutional features are input to the fully connected layer and the convolutional layer respectively to obtain global image features and local image features.
The video processing apparatus according to claim 11, wherein the look-up table generating unit is configured to:

Fusion of global image features and local image features to obtain the feature vector of each fused image region;

A three-dimensional look-up table for each image region of the first image is generated according to the fused feature vector of each image region.
The video processing apparatus according to claim 10, wherein the image quality enhancement unit is configured to:

Based on the three-dimensional look-up table of each image area of the first image, the values of the three primary colors of red, green and blue in the first image are converted to obtain an image quality enhanced image of the first image.
The video processing apparatus of claim 10, wherein the image acquisition unit is configured to:

Get two adjacent frames of video;

Low-resolution conversion is performed on the acquired adjacent two frames of images, and the converted two frames of low-resolution images are used as the first image and the second image.
An electronic device comprising:

processor;

a memory for storing the processor-executable instructions;

Wherein, the processor is configured to execute the instructions to implement the video processing method as claimed in any one of claims 1 to 9.
A non-volatile computer-readable storage medium storing a computer program, wherein, when the computer program is executed by a processor of an electronic device, the electronic device is caused to execute a video processing method, the video processing method comprising:

Acquiring two adjacent frames of images of the video as the first image and the second image, wherein the first image is located after the second image in the video;

generating a three-dimensional look-up table for each image region of the first image based on the first image and the second image;

A quality-enhanced image of the first image is generated based on the three-dimensional lookup table for each image region of the first image.
A computer program product, comprising a computer program/instruction, wherein, when the computer program/instruction is executed by a processor, a video processing method is implemented, the video processing method comprising:

Acquiring two adjacent frames of images of the video as the first image and the second image, wherein the first image is located after the second image in the video;

generating a three-dimensional look-up table for each image region of the first image based on the first image and the second image;

A quality-enhanced image of the first image is generated based on the three-dimensional lookup table for each image region of the first image.