CN113034412A

CN113034412A - Video processing method and device

Info

Publication number: CN113034412A
Application number: CN202110213511.1A
Authority: CN
Inventors: 刘晶晶; 徐宁
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2021-02-25
Filing date: 2021-02-25
Publication date: 2021-06-25
Anticipated expiration: 2041-02-25
Also published as: WO2022179087A1

Abstract

The disclosure relates to a video processing method and device. The video processing method comprises the following steps: acquiring two adjacent frames of images of a video as a first image and a second image, wherein the first image is positioned behind the second image in the video; generating a three-dimensional look-up table for each image region of the first image based on the first image and the second image; an enhanced quality image of the first image is generated based on the three-dimensional look-up table for each image region of the first image. According to the video processing method and device, the video image quality enhancement can be automatically completed in one key mode under the condition that a user does not need to participate.

Description

Video processing method and device

Technical Field

The present disclosure relates to the field of video technology. More particularly, the present disclosure relates to a video processing method and apparatus.

Background

With the popularity of smartphones, more and more users use personal phones to take videos for recording daily segments of life. The original video image quality is often not satisfactory to users, for example, the original video often has the problems of too bright or too dark picture, insufficient color saturation of picture, and the like. Through the later-stage video image quality enhancement technology, the original image quality can be greatly improved. Therefore, many users have strong and huge demands for video image quality enhancement.

However, most existing video quality enhancement methods require a user to manually set various picture parameters. This requires the user to have a certain degree of knowledge of photographing, and it often takes a considerable amount of effort and time to manually adjust the image quality parameters. In addition, the manually enhanced video quality is difficult to reach a professional level. Therefore, there is a need for an automatic adaptive video quality enhancement method to automatically enhance the original video quality so that the original video quality is visually enhanced with an enhanced aesthetic effect.

Disclosure of Invention

An exemplary embodiment of the present disclosure is to provide a video processing method and apparatus to solve at least the problems of video processing in the related art, and may not solve any of the problems.

According to an exemplary embodiment of the present disclosure, there is provided a video processing method including: acquiring two adjacent frames of images of a video as a first image and a second image, wherein the first image is positioned behind the second image in the video; generating a three-dimensional look-up table for each image region of the first image based on the first image and the second image; an enhanced quality image of the first image is generated based on the three-dimensional look-up table for each image region of the first image.

Optionally, the step of generating a three-dimensional look-up table for each image region of the first image based on the first image and the second image may comprise: obtaining global image features and local image features from the first image and the second image; a three-dimensional lookup table for each image region of the first image is generated based on the global image features and the local image features.

Optionally, the step of obtaining the global image feature and the local image feature from the first image and the second image may comprise: performing convolution feature extraction on the first image and the second image respectively; performing convolution feature fusion on the convolution feature of the first image and the convolution feature of the second image; and obtaining the global image characteristic and the local image characteristic based on the fused convolution characteristic.

Optionally, the step of performing convolution feature extraction on the first image and the second image respectively may include: and respectively inputting the first image and the second image into a convolutional neural network embedded with a self-attention module in a convolutional layer to obtain convolutional features of the first image and the second image, wherein the convolutional features comprise the position relation and the semantic relation between image parts.

Optionally, the step of performing convolution feature fusion on the convolution feature of the first image and the convolution feature of the second image may include: calculating the similarity between the convolution characteristic of the first image and the convolution characteristic of the second image; respectively determining a first fusion weight of the convolution feature of the first image and a second fusion weight of the convolution feature of the second image based on the similarity; convolution feature fusion is performed on the convolution features of the first image and the convolution features of the second image based on the first fusion weight and the second fusion weight.

Optionally, the step of obtaining the global image feature and the local image feature based on the merged convolution feature may include: and inputting the fused convolution features into the full-link layer and the convolution layer respectively to obtain global image features and local image features.

Optionally, the step of generating a three-dimensional lookup table for each image region of the first image based on the global image features and the local image features may comprise: fusing the global image features and the local image features to obtain feature vectors of each fused image area; and generating a three-dimensional lookup table of each image area of the first image according to the fused feature vector of each image area.

Alternatively, the step of generating the enhanced-quality image of the first image may include: and converting the red, green and blue three-primary-color numerical values of the first image based on the three-dimensional lookup table of each image area of the first image to obtain an image quality enhanced image of the first image.

Optionally, the step of acquiring two adjacent frames of images of the video as the first image and the second image may include: acquiring two adjacent frames of images of a video; and performing low-resolution conversion on the acquired two adjacent frames of images, and taking the two converted frames of low-resolution images as a first image and a second image.

According to an exemplary embodiment of the present disclosure, there is provided a video processing apparatus including: an image acquisition unit configured to acquire two adjacent frames of images of a video as a first image and a second image, wherein the first image is positioned after the second image in the video; a lookup table generation unit configured to generate a three-dimensional lookup table for each image region of the first image based on the first image and the second image; and an image quality enhancement unit configured to generate an image quality enhanced image of the first image based on the three-dimensional lookup table for each image region of the first image.

Optionally, the look-up table generating unit may be configured to: obtaining global image features and local image features from the first image and the second image; a three-dimensional lookup table for each image region of the first image is generated based on the global image features and the local image features.

Optionally, the look-up table generating unit may be configured to: performing convolution feature extraction on the first image and the second image respectively; performing convolution feature fusion on the convolution feature of the first image and the convolution feature of the second image; and obtaining the global image characteristic and the local image characteristic based on the fused convolution characteristic.

Optionally, the look-up table generating unit may be configured to: and respectively inputting the first image and the second image into a convolutional neural network embedded with a self-attention module in a convolutional layer to obtain convolutional features of the first image and the second image, wherein the convolutional features comprise the position relation and the semantic relation between image parts.

Optionally, the look-up table generating unit may be configured to: calculating the similarity between the convolution characteristic of the first image and the convolution characteristic of the second image; respectively determining a first fusion weight of the convolution feature of the first image and a second fusion weight of the convolution feature of the second image based on the similarity; convolution feature fusion is performed on the convolution features of the first image and the convolution features of the second image based on the first fusion weight and the second fusion weight.

Optionally, the look-up table generating unit may be configured to: and inputting the fused convolution features into the full-link layer and the convolution layer respectively to obtain global image features and local image features.

Optionally, the look-up table generating unit may be configured to: fusing the global image features and the local image features to obtain feature vectors of each fused image area; and generating a three-dimensional lookup table of each image area of the first image according to the fused feature vector of each image area.

Optionally, the image quality enhancing unit may be configured to: and converting the red, green and blue three-primary-color numerical values of the first image based on the three-dimensional lookup table of each image area of the first image to obtain an image quality enhanced image of the first image.

Alternatively, the image acquisition unit may be configured to: acquiring two adjacent frames of images of a video; and performing low-resolution conversion on the acquired two adjacent frames of images, and taking the two converted frames of low-resolution images as a first image and a second image.

According to an exemplary embodiment of the present disclosure, there is provided an electronic apparatus including: a processor; a memory for storing the processor-executable instructions; wherein the processor is configured to execute the instructions to implement a video processing method according to an exemplary embodiment of the present disclosure.

According to an exemplary embodiment of the present disclosure, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor of an electronic device, causes the electronic device to execute a video processing method according to an exemplary embodiment of the present disclosure.

According to an exemplary embodiment of the present disclosure, a computer program product is provided, comprising computer programs/instructions which, when executed by a processor, implement a video processing method according to an exemplary embodiment of the present disclosure.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

1. the video image quality enhancement is automatically completed in one key mode without the participation of a user;

2. the method can process various user videos, has a good enhancement effect, and is coherent in time domain.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

Fig. 1 illustrates a flow chart of a video processing method according to an exemplary embodiment of the present disclosure.

Fig. 2A shows a schematic diagram of video processing according to an exemplary embodiment of the present disclosure.

Fig. 2B illustrates an example of a mesh structure according to an exemplary embodiment of the present disclosure.

Fig. 3 illustrates a schematic diagram of fusing global image features and local image features according to an exemplary embodiment of the present disclosure.

Fig. 4 illustrates a block diagram of a video processing apparatus according to an exemplary embodiment of the present disclosure.

Fig. 5 is a block diagram of an electronic device 500 according to an example embodiment of the present disclosure.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The embodiments described in the following examples do not represent all embodiments consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

In this case, the expression "at least one of the items" in the present disclosure means a case where three types of parallel expressions "any one of the items", "a combination of any plural ones of the items", and "the entirety of the items" are included. For example, "include at least one of a and B" includes the following three cases in parallel: (1) comprises A; (2) comprises B; (3) including a and B. For another example, "at least one of the first step and the second step is performed", which means that the following three cases are juxtaposed: (1) executing the step one; (2) executing the step two; (3) and executing the step one and the step two.

In the related video image quality enhancement method, a deep neural network can be used to learn how to convert the original pixel values of three primary colors (hereinafter, also referred to as RGB) of red, green and blue of a video into new RGB values from a mass of video materials, so as to achieve the purpose of image quality enhancement. The big data driving method not only fully utilizes a large amount of professional videos, but also can adaptively adjust the image quality enhancement algorithm by combining the current shooting content of the user. In addition, the consistency of image quality enhancement on the video is realized by fusing the enhancement parameters of the previous frame and the next frame.

The model for RGB numerical conversion in the related art uses piecewise linear functions and converts the original numerical values independently in R, G, B three color channels. However, the mathematical expression capability of the piecewise linear function is inherently insufficient, and the mapping relationship between the original RGB values and the enhanced RGB values cannot be completely obtained. In addition, since the conversion is done separately on three color channels, rich semantics brought by R, G, B color relevance are ignored. The scenes and contents of video pictures shot by users vary widely. In some scenes, the method is easy to cause the phenomenon of insufficient image quality enhancement or distortion. In addition, the method only depends on a local image feature driving enhancement algorithm, and benefits brought to image quality enhancement by global features are ignored.

The leading image quality enhancement methods at present mostly focus on single image enhancement, not video enhancement. In the related video image quality enhancement method, a deep neural network may also be used to learn an adaptive three-dimensional look-up table (hereinafter, also referred to as a 3D LUT) from the mass image data, so as to simulate the mapping relationship between the original RGB values and the enhanced RGB values. Adaptive means that the 3D LUT can be automatically adjusted according to the visual characteristics of the image. Compared with the traditional 3D LUT image enhancement technology, the method does not need an algorithm expert to manually set various parameters of the 3D LUT. Meanwhile, the 3D LUT with higher complexity can better adapt to various original image qualities. The related art only uses the global visual features of the images to drive the enhancement algorithm when learning the 3D LUT, neglects that different images locally need different enhancement methods. Even more fatal, the image quality enhancement technique cannot be directly applied to video enhancement because the continuity of enhancement cannot be guaranteed.

Hereinafter, a video processing method and apparatus according to an exemplary embodiment of the present disclosure will be described in detail with reference to fig. 1 to 5.

Fig. 1 illustrates a flow chart of a video processing method according to an exemplary embodiment of the present disclosure. Fig. 2A shows a schematic diagram of video processing according to an exemplary embodiment of the present disclosure. Fig. 2B illustrates an example of a mesh structure according to an exemplary embodiment of the present disclosure.

Referring to fig. 1, in step S101, two adjacent frame images of a video are acquired as a first image and a second image. Here, the first image is located after the second image in the video.

In an exemplary embodiment of the present disclosure, when acquiring two adjacent frame images of a video as a first image and a second image, the two adjacent frame images of the video (for example, but not limited to, a t-th frame image and a t-1 th frame image) may be acquired first, then low-resolution conversion may be performed on the acquired two adjacent frame images, and the converted two frame low-resolution images (for example, but not limited to, the t-th frame low-resolution image and the t-1 th frame low-resolution image in fig. 2A) may be taken as the first image and the second image. Here, the low resolution may be, for example, but not limited to, 256 × 256.

In step S102, a three-dimensional lookup table for each image region of the first image is generated based on the first image and the second image.

In contrast to the related art, in the exemplary embodiments of the present disclosure, the adaptive 3D LUT not only uses the image global visual features to drive the enhancement algorithm, but also may consider the visual and semantic differences local to each image. In other words, different image portions may employ different enhancement methods. For example, the image quality enhancement can be locally adaptive by dividing the image into a mesh structure (e.g., but not limited to the mesh structure shown in fig. 2B), where each image in the mesh corresponds to a different enhancement parameter locally.

In an exemplary embodiment of the present disclosure, in generating the three-dimensional lookup table for each image region of the first image based on the first image and the second image, the global image feature and the local image feature may be first obtained from the first image and the second image, and then the three-dimensional lookup table for each image region of the first image may be generated based on the global image feature and the local image feature.

Unlike single image enhancement, video enhancement also pursues stabilization of picture quality over successive frames. In an exemplary embodiment of the present disclosure, when obtaining the global image feature and the local image feature from the first image and the second image, the convolution feature extraction may be performed on the first image and the second image, respectively, first, and then convolution feature fusion may be performed on the convolution feature of the first image and the convolution feature of the second image, and then the global image feature and the local image feature may be obtained based on the fused convolution features. That is, by fusing the visual features of the previous and subsequent frames, an enhanced continuity in the time domain can be obtained.

In an exemplary embodiment of the present disclosure, in performing convolution feature extraction on the first image and the second image, respectively, convolution features of the first image and the second image including a positional relationship and a semantic relationship between image parts may be obtained by inputting the first image and the second image into a convolution neural network in which a self-attention module is embedded in a convolution layer, respectively. Here, the convolutional layers of the convolutional neural network have a self-attention module embedded therein. Convolutional neural networks with self-attention modules embedded in the convolutional layers can be used to capture the positional and semantic relationships between image parts. The benefits of this are: 1) obtaining the smoothness of image quality enhancement in an image space through the position relation; 2) the relevance of the local enhancement effect is improved through semantic relation. For example, when blue sky and grass are simultaneously present in the image, such combinations may be handled in an automatic adjustment enhancement. In general, the self-attention module adds constraints to the spatial smoothness and enhanced semantic relevance of the image while preserving the local adaptation of the image quality enhancement.

In an exemplary embodiment of the present disclosure, in performing convolution feature fusion on the convolution feature of the first image and the convolution feature of the second image, a similarity between the convolution feature of the first image and the convolution feature of the second image may be first calculated, then a first fusion weight of the convolution feature of the first image and a second fusion weight of the convolution feature of the second image are respectively determined based on the similarity, and then the convolution feature fusion of the convolution feature of the first image and the convolution feature of the second image is performed based on the first fusion weight and the second fusion weight. Each pixel on the convolved features may represent a visual feature vector that corresponds to a local portion of the image. In an exemplary embodiment of the present disclosure, the visual similarity of the front and rear video frames may be measured by calculating the cosine similarity between two feature vectors. Since the feature vectors correspond to different image regions, each element of the similarity matrix actually represents the visual similarity of the corresponding position in the two images. For the image area with high similarity, the convolution characteristic of the t-1 frame has higher fusion weight; and in the area with low similarity, the convolution characteristic of the t frame is used as much as possible. When the local scene change is not large, the convolution feature fusion is beneficial to maintaining the consistency of enhancement on the video; when the local scene has a significant visual change, the image quality enhancement strategy can be automatically adjusted to deal with the visual change.

In an exemplary embodiment of the present disclosure, when the global image feature and the local image feature are obtained based on the fused convolution features, the global image feature and the local image feature may be obtained by inputting the fused convolution features to the full link layer and the convolution layer, respectively, thereby improving the accuracy of feature extraction.

In an exemplary embodiment of the present disclosure, when generating the three-dimensional lookup table for each image region of the first image based on the global image feature and the local image feature, the global image feature and the local image feature may be first fused to obtain a feature vector for each fused image region, and then the three-dimensional lookup table for each image region of the first image may be generated according to the feature vector for each fused image region, thereby improving the accuracy of the three-dimensional lookup table. Fig. 3 illustrates a schematic diagram of fusing global image features and local image features according to an exemplary embodiment of the present disclosure. Based on the fused convolution features, as shown in fig. 3, the global image features and the local image features may be obtained first using the fully-connected layer and the convolution layer, respectively, and then the global feature vector may be accumulated element by element with each local feature. In this setting, the global image features are used to obtain overall visual characteristics of the image, such as brightness, saturation, scene, and so on. And the local image features can be used for fine-tuning the global features according to the semantic information, the spatial position and the like of the local image. The obtained fusion features have global and local self-adaptability, so that the fusion features can deal with various user videos and can obtain better image quality enhancement effect.

In an exemplary embodiment of the present disclosure, each image region of the first image will produce a feature vector that is used to drive the generation of a three-dimensional look-up table. And different three-dimensional look-up tables mean different image quality enhancement effects.

In step S103, an enhanced image of the first image is generated based on the three-dimensional lookup table for each image region of the first image. By introducing the self-adaptive three-dimensional query table into the video image quality enhancement frame, the video image quality enhancement can be automatically completed in one key mode under the condition that a user does not need to participate.

In an exemplary embodiment of the present disclosure, when generating the enhanced-quality image of the first image, the three-dimensional lookup table for each image region of the first image may be used to convert three primary color values of red, green, and blue of the first image, so as to obtain the enhanced-quality image of the first image. That is, based on the three-dimensional lookup table corresponding to each image pixel position, as shown in fig. 2A, RGB values in the t frame of the original input video are converted to generate the image quality enhanced image of the t frame.

In one exemplary embodiment, as shown in fig. 2A, the t frame image and the t-1 frame image of a video are first acquired, the t frame image and the t-1 frame image are converted into low resolution images, and the t frame image and the t-1 frame image of the low resolution images are respectively input into a Convolutional Neural Network (CNN) having a function of capturing the position and semantic relationship between image parts. And performing convolution feature extraction through the Convolution Neural Network (CNN) to obtain convolution features of the t frame image and convolution features of the t-1 frame image. And calculating a similarity matrix of the convolution characteristic of the image of the t-th frame and the convolution characteristic of the image of the t-1 th frame, and fusing the convolution characteristic of the image of the t-th frame and the convolution characteristic of the image of the t-1 th frame based on the similarity matrix. Then, based on the convolution characteristics after fusion, the global and local image characteristics are obtained by respectively using the full connection layer and the convolution layer, and the global characteristic vector is accumulated with each local characteristic element by element to obtain the characteristic vector of the image area corresponding to each image pixel position. And respectively generating a three-dimensional look-up table (3D LUT) by driving the feature vector of each image area. And finally, converting RGB values in the t frame of the original input video based on the 3D LUT corresponding to each image pixel position to generate an image quality enhanced image of the t frame.

The video processing method according to the exemplary embodiment of the present disclosure has been described above with reference to fig. 1 to 3. Hereinafter, a video processing apparatus and units thereof according to an exemplary embodiment of the present disclosure will be described with reference to fig. 4.

Referring to fig. 4, the video processing apparatus includes an image acquisition unit 41, a lookup table generation unit 42, and an image quality enhancement unit 43.

The image acquisition unit 41 is configured to acquire two adjacent frame images of a video as a first image and a second image. Here, the first image is located after the second image in the video.

In an exemplary embodiment of the present disclosure, the image acquisition unit 41 may be configured to: acquiring two adjacent frames of images of a video; and performing low-resolution conversion on the acquired two adjacent frames of images, and taking the two converted frames of low-resolution images as a first image and a second image.

The look-up table generating unit 42 is configured to generate a three-dimensional look-up table for each image region of the first image based on the first image and the second image.

In an exemplary embodiment of the present disclosure, the look-up table generating unit 42 may be configured to: obtaining global image features and local image features from the first image and the second image; a three-dimensional lookup table for each image region of the first image is generated based on the global image features and the local image features.

In an exemplary embodiment of the present disclosure, the look-up table generating unit 42 may be configured to: performing convolution feature extraction on the first image and the second image respectively; performing convolution feature fusion on the convolution feature of the first image and the convolution feature of the second image; and obtaining the global image characteristic and the local image characteristic based on the fused convolution characteristic.

In an exemplary embodiment of the present disclosure, the look-up table generating unit 42 may be configured to: and respectively inputting the first image and the second image into a convolutional neural network embedded with a self-attention module in a convolutional layer to obtain convolutional features of the first image and the second image, wherein the convolutional features comprise the position relation and the semantic relation between image parts.

In an exemplary embodiment of the present disclosure, the look-up table generating unit 42 may be configured to: calculating the similarity between the convolution characteristic of the first image and the convolution characteristic of the second image; respectively determining a first fusion weight of the convolution feature of the first image and a second fusion weight of the convolution feature of the second image based on the similarity; convolution feature fusion is performed on the convolution features of the first image and the convolution features of the second image based on the first fusion weight and the second fusion weight.

In an exemplary embodiment of the present disclosure, the look-up table generating unit 42 may be configured to: and inputting the fused convolution features into the full-link layer and the convolution layer respectively to obtain global image features and local image features.

In an exemplary embodiment of the present disclosure, the look-up table generating unit 42 may be configured to: fusing the global image features and the local image features to obtain feature vectors of each fused image area; and generating a three-dimensional lookup table of each image area of the first image according to the fused feature vector of each image area.

The image quality enhancing unit 43 is configured to generate an image quality enhanced image of the first image based on the three-dimensional lookup table for each image region of the first image.

In an exemplary embodiment of the present disclosure, the image quality enhancing unit 43 may be configured to: and converting the red, green and blue three-primary-color numerical values of the first image based on the three-dimensional lookup table of each image area of the first image to obtain an image quality enhanced image of the first image.

With regard to the apparatus in the above-described embodiment, the specific manner in which each unit performs the operation has been described in detail in the embodiment related to the method, and will not be described in detail here.

The video processing apparatus according to the exemplary embodiment of the present disclosure has been described above with reference to fig. 4. Next, an electronic device according to an exemplary embodiment of the present disclosure is described with reference to fig. 5.

Referring to fig. 5, an electronic device 500 includes at least one memory 501 and at least one processor 502, the at least one memory 501 having stored therein a set of computer-executable instructions that, when executed by the at least one processor 502, perform a method of video processing according to an exemplary embodiment of the present disclosure.

By way of example, the electronic device 500 may be a PC computer, tablet device, personal digital assistant, smartphone, or other device capable of executing the set of instructions described above. Here, the electronic device 500 need not be a single electronic device, but can be any collection of devices or circuits that can execute the above instructions (or sets of instructions) individually or in combination. The electronic device 500 may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces with local or remote (e.g., via wireless transmission).

In the electronic device 500, the processor 502 may include a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a programmable logic device, a special-purpose processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, processors may also include analog processors, digital processors, microprocessors, multi-core processors, processor arrays, network processors, and the like.

The processor 502 may execute instructions or code stored in the memory 501, wherein the memory 501 may also store data. The instructions and data may also be transmitted or received over a network via a network interface device, which may employ any known transmission protocol.

The memory 501 may be integrated with the processor 502, for example, by having RAM or flash memory disposed within an integrated circuit microprocessor or the like. Further, memory 501 may comprise a stand-alone device, such as an external disk drive, storage array, or any other storage device usable by a database system. The memory 501 and the processor 502 may be operatively coupled or may communicate with each other, e.g., through I/O ports, network connections, etc., such that the processor 502 is able to read files stored in the memory.

In addition, the electronic device 500 may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the electronic device 500 may be connected to each other via a bus and/or a network.

There is also provided, in accordance with an example embodiment of the present disclosure, a computer-readable storage medium, such as a memory 501, comprising instructions executable by a processor 502 of an apparatus 500 to perform the above-described method. Alternatively, the computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

According to an exemplary embodiment of the present disclosure, a computer program product may also be provided, which comprises computer programs/instructions, which when executed by a processor, implement the method of video processing according to an exemplary embodiment of the present disclosure.

The video processing method and apparatus according to the exemplary embodiments of the present disclosure have been described above with reference to fig. 1 to 5. However, it should be understood that: the video processing apparatus and its units shown in fig. 4 may be respectively configured as software, hardware, firmware, or any combination thereof to perform a specific function, the electronic device shown in fig. 5 is not limited to including the above-shown components, but some components may be added or deleted as needed, and the above components may also be combined.

According to the video processing method and device, two adjacent frame images of the video are acquired as the first image and the second image, the three-dimensional lookup table of each image area of the first image is generated based on the first image and the second image, and the image quality enhancement image of the first image is generated based on the three-dimensional lookup table of each image area of the first image, so that the one-key automatic completion of the image quality enhancement of the video is realized without the participation of a user.

In addition, according to the video processing method and apparatus of the present disclosure, constraints of increasing spatial smoothness of an image and enhanced semantic relevance while preserving enhanced local adaptation can be achieved by embedding a self-attention module in a convolutional layer of a neural network.

In addition, according to the video processing method and apparatus of the present disclosure, global and local adaptivity can be obtained through global/local feature fusion, so that diverse user videos can be dealt with, and a better image quality enhancement effect is obtained.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A video processing method, comprising:

acquiring two adjacent frames of images of a video as a first image and a second image, wherein the first image is positioned behind the second image in the video;

generating a three-dimensional look-up table for each image region of the first image based on the first image and the second image;

an enhanced quality image of the first image is generated based on the three-dimensional look-up table for each image region of the first image.

2. The video processing method of claim 1, wherein the step of generating a three-dimensional look-up table for each image region of the first image based on the first image and the second image comprises:

obtaining global image features and local image features from the first image and the second image;

a three-dimensional lookup table for each image region of the first image is generated based on the global image features and the local image features.

3. The video processing method of claim 2, wherein the step of obtaining global image features and local image features from the first image and the second image comprises:

performing convolution feature extraction on the first image and the second image respectively;

performing convolution feature fusion on the convolution feature of the first image and the convolution feature of the second image;

and obtaining the global image characteristic and the local image characteristic based on the fused convolution characteristic.

4. The video processing method of claim 3, wherein the step of performing convolutional feature extraction on the first image and the second image respectively comprises:

and respectively inputting the first image and the second image into a convolutional neural network embedded with a self-attention module in a convolutional layer to obtain convolutional features of the first image and the second image, wherein the convolutional features comprise the position relation and the semantic relation between image parts.

5. The video processing method of claim 3, wherein the step of performing convolution feature fusion on the convolution features of the first image and the convolution features of the second image comprises:

calculating the similarity between the convolution characteristic of the first image and the convolution characteristic of the second image;

respectively determining a first fusion weight of the convolution feature of the first image and a second fusion weight of the convolution feature of the second image based on the similarity;

convolution feature fusion is performed on the convolution features of the first image and the convolution features of the second image based on the first fusion weight and the second fusion weight.

6. The video processing method according to claim 3, wherein the step of obtaining the global image feature and the local image feature based on the merged convolution feature comprises:

and inputting the fused convolution features into the full-link layer and the convolution layer respectively to obtain global image features and local image features.

7. The video processing method of claim 2, wherein the step of generating a three-dimensional look-up table for each image region of the first image based on the global image features and the local image features comprises:

fusing the global image features and the local image features to obtain feature vectors of each fused image area;

and generating a three-dimensional lookup table of each image area of the first image according to the fused feature vector of each image area.

8. A video processing apparatus, comprising:

an image acquisition unit configured to acquire two adjacent frames of images of a video as a first image and a second image, wherein the first image is positioned after the second image in the video;

a lookup table generation unit configured to generate a three-dimensional lookup table for each image region of the first image based on the first image and the second image; and

an image quality enhancement unit configured to generate an image quality enhanced image of the first image based on the three-dimensional lookup table for each image region of the first image.

9. An electronic device/server, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the video processing method of any of claims 1 to 7.

10. A computer-readable storage medium storing a computer program, which when executed by a processor of an electronic device causes the electronic device to perform the video processing method of any one of claims 1 to 7.