WO2022268181A1

WO2022268181A1 - Video enhancement processing methods and apparatus, electronic device and storage medium

Info

Publication number: WO2022268181A1
Application number: PCT/CN2022/100898
Authority: WO
Inventors: 王学嘉; 崔文学; 刘天鸿; 姜峰; 刘绍辉; 赵德斌; 吴钊; 吴平; 高莹
Original assignee: 中兴通讯股份有限公司; 哈尔滨工业大学
Priority date: 2021-06-23
Filing date: 2022-06-23
Publication date: 2022-12-29
Also published as: CN115511756A

Abstract

Embodiments of the present application provide video enhancement processing methods and an apparatus, an electronic device and a storage medium, a method comprising: determining an enhanced auxiliary image of an image to be enhanced, the enhanced auxiliary image and the image to be enhanced being reconstructed images generated by decoding compressed video data (110); determining a spatiotemporal feature map on the basis of the image to be enhanced, the enhanced auxiliary image and a preset feature extraction network (120); processing the spatiotemporal feature map according to a preset feature enhancement network so as to generate a superimposed image (130); and according to the superimposed image, processing the image to be enhanced so as to generate a video enhanced image (140).

Description

Video enhancement processing method, device, electronic device and storage medium

Cross References to Related Applications

This application is based on a Chinese patent application with application number 202110697703.4 and a filing date of June 23, 2021, and claims the priority of this Chinese patent application. The entire content of this Chinese patent application is hereby incorporated by reference into this application.

technical field

The present application relates to the technical field of image processing, and in particular to a video enhancement processing method, device, electronic equipment and storage medium.

Background technique

With the increasing number of video applications, video application scenarios have become more flexible and diverse, and the range of video resolutions has gradually increased, which means higher requirements for video compression quality. Compressed video has problems of distortion and compression noise, and the compressed and restored video has different degrees of quality loss. How to reduce these quality losses and improve video quality has become an important field of video processing.

At present, the video compression coding standards H.256/HEVC and H.266/VCC mainly use loop filtering technology for the post-processing of compressed images, which includes deblocking filter (Deblocking Filter, DF), pixel adaptive compensation (Sample Adaptive Offset, SAO) and adaptive loop filtering (Adaptive Loop Filter, ALF). Among them, block filtering includes two links: filtering decision and filtering operation; SAO divides the reconstructed pixels into categories by selecting an appropriate classifier, and then uses different compensation values for different categories of pixels; ALF, according to the directionality and activity with gradient to choose the appropriate filter coefficients. These traditional methods can eliminate compression noise and improve the quality of compressed video to a certain extent, but due to the fixed parameters used by the filtering algorithm, they cannot completely restore the mapping relationship between the lossy compressed image and the original image.

In recent years, with the rise of deep learning, the video field has tried to apply deep learning to reduce video compression loss. Compared with traditional video enhancement processing methods, deep learning technology has the ability to learn by itself based on big data, which can abandon traditional manual learning features and improve the mapping relationship between lossy compressed images and original images, thereby improving video quality. At the same time, since the learning effect of deep learning depends on the amount of training data, the effect, robustness, and generalization ability of deep learning will increase with the increase of the amount of data. In view of the problems of blurring and weakening of detailed information in video images after compression, in order to solve these problems, video enhancement based on single-frame quality is often used in deep learning. However, since the true value of the image itself does not exist in the compressed video, there is an ill-posed problem. The image enhancement processing based on deep learning mainly relies on the acquired prior knowledge of the external training set, and there is room for improvement in the enhancement of video quality.

Contents of the invention

The main purpose of the embodiments of the present application is to provide a video enhancement processing method, device, electronic equipment and storage medium, which aims to improve the display quality of video compressed and reconstructed images and enhance the viewing effect of users.

An embodiment of the present application provides a video enhancement processing method, which includes the following steps: determining an enhanced auxiliary image of an image to be enhanced, wherein the enhanced auxiliary image and the image to be enhanced are reconstructed images generated by decoding compressed video data Determining a spatio-temporal feature map based on the image to be enhanced, the enhanced auxiliary image and a preset feature extraction network; processing the spatio-temporal feature map according to a preset feature enhancement network to generate a superimposed image; processing the superimposed image according to the superimposed image Enhance the image to generate a video enhanced image.

The embodiment of the present application also provides another video enhancement processing method, which includes the following steps: determining an enhanced auxiliary image of the image to be enhanced, wherein the enhanced auxiliary image and the image to be enhanced are generated by decoding compressed video data Reconstructing an image; determining a spatiotemporal feature map based on the image to be enhanced, the enhanced auxiliary image, and a preset feature extraction network; transmitting the spatiotemporal feature map and the preset feature enhancement network.

The embodiment of the present application also provides another video enhancement processing method, which includes the following steps: receiving a spatiotemporal feature map and a preset feature enhancement network; processing the spatiotemporal feature map according to the preset feature enhancement network to generate a superimposed image; The superimposed image processes the image to be enhanced to generate a video enhanced image.

An embodiment of the present application provides a video enhancement processing device, which includes: an image extraction module, configured to determine an enhanced auxiliary image of an image to be enhanced, wherein the enhanced auxiliary image and the image to be enhanced are decoded compressed video data The generated reconstructed image; a feature map module, used to determine a spatiotemporal feature map based on the image to be enhanced, the enhanced auxiliary image and a preset feature extraction network; a feature enhancement module, used to process the spatiotemporal feature enhancement network according to a preset feature The feature map is used to generate a superimposed image; the enhanced image module is used to process the image to be enhanced according to the superimposed image to generate a video enhanced image.

The embodiment of the present application also provides another video enhancement processing device, which includes: an image extraction module, configured to determine an enhanced auxiliary image of an image to be enhanced, wherein the enhanced auxiliary image and the image to be enhanced are compressed video A reconstructed image generated by data decoding; a feature map module, used to determine a spatio-temporal feature map based on the image to be enhanced, the enhanced auxiliary image and a preset feature extraction network; an encoding and sending module, used to transmit the spatio-temporal feature map and the The aforementioned preset features enhance the network.

The embodiment of the present application also provides another video enhancement processing device, which includes: a decoding and receiving module for receiving a spatio-temporal feature map and a preset feature enhancement network; a feature enhancement module for enhancing the network processing according to preset features The spatio-temporal feature map is used to generate a superimposed image; the enhanced image module is used to process the image to be enhanced according to the superimposed image to generate a video enhanced image.

The embodiment of the present application also provides an electronic device, the electronic device includes: one or more processors; memory, used to store one or more programs, when the one or more programs are used by the one or more The processor executes, so that the one or more processors implement the video enhancement processing method described in any one of the embodiments of the present application.

The embodiment of the present application also provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the video enhancement processing method as described in any one of the embodiments of the present application is implemented.

In the embodiment of the present application, by determining the enhanced auxiliary image of the image to be enhanced, using the preset feature extraction network to process the enhanced image and the enhanced auxiliary image to obtain the spatio-temporal feature map, based on the preset enhanced network processing the spatio-temporal feature map to generate a superposition Image, process the image to be enhanced according to the superimposed image to generate a video enhanced image, improve the display quality of the image based on the spatio-temporal characteristics of the video reconstruction image, improve the display effect of the video, and enhance the viewing experience of the user.

Description of drawings

FIG. 1 is a flow chart of a video enhancement processing method provided in an embodiment of the present application;

Fig. 2a is a selection example diagram of an enhanced auxiliary image provided by an embodiment of the present application;

Fig. 2b is a selection example diagram of an enhanced auxiliary image provided by an embodiment of the present application;

Fig. 2c is a selection example diagram of an enhanced auxiliary image provided by the embodiment of the present application;

Fig. 3 is a schematic structural diagram of a three-dimensional deformable convolution residual block provided by an embodiment of the present application;

FIG. 4 is a schematic structural diagram of a convolution residual block provided by an embodiment of the present application;

FIG. 5 is a transmission example diagram of a network model provided by an embodiment of the present application;

Fig. 6 is a schematic diagram of partial region image processing provided by the embodiment of the present application;

FIG. 7 is a block diagram of video enhancement processing provided by an embodiment of the present application;

FIG. 8 is a flow chart of a video enhancement processing method provided by an embodiment of the present application;

FIG. 9 is an example diagram of a feature extraction process provided by an embodiment of the present application;

FIG. 10 is an example diagram of a three-dimensional deformable convolutional network provided by an embodiment of the present application;

FIG. 11 is an example diagram of a feature enhancement process provided by an embodiment of the present application;

FIG. 12 is a flow chart of a video enhancement processing method provided by an embodiment of the present application;

FIG. 13 is a flow chart of a video enhancement processing method provided by an embodiment of the present application;

FIG. 14 is an example diagram of a video enhancement process provided by an embodiment of the present application;

Fig. 15 is an example diagram of another video enhancement processing provided by the embodiment of the present application;

FIG. 16 is a schematic structural diagram of a video enhancement processing device provided by an embodiment of the present application;

Fig. 17 is a schematic structural diagram of a video enhancement processing device provided by an embodiment of the present application;

Fig. 18 is a schematic structural diagram of a video enhancement processing device provided by an embodiment of the present application;

FIG. 19 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.

detailed description

It should be understood that the specific embodiments described here are only used to explain the present application, and are not intended to limit the present application.

In the subsequent description, use of suffixes such as 'module', 'part' or 'unit' for denoting elements is only for facilitating the description of the present application and has no specific meaning by itself. Therefore, 'module', 'part' or 'unit' may be used in combination.

Fig. 1 is a flow chart of a video enhancement processing method provided by the embodiment of the present application. The embodiment of the present application can be applied to the situation of enhancing the image display quality of the decoded video, and the method can be executed by a video enhancement processing device, which can Realized by software and/or hardware, and generally based on the video decoding end, referring to Figure 1, the method provided by the embodiment of the present application includes the following steps:

Step 110: Determine an enhanced auxiliary image of the image to be enhanced, wherein the enhanced auxiliary image and the image to be enhanced are reconstructed images generated by decoding compressed video data.

Wherein, the image to be enhanced may be an image that needs to be enhanced for screen display effect, the image may be an image generated after video decoding, and the image has loss compared with the video image before compression, and the enhanced auxiliary image may assist the image to be enhanced for display For the enhanced image, the enhanced auxiliary image can be associated with the image to be enhanced in time and space. For example, the enhanced auxiliary image can be the previous frame or the next frame of the image to be enhanced on the video timeline, and the enhanced auxiliary image can be related to the image to be enhanced. Images have associated images, which may contain the same objects, or whose dimensions are proportional. The reconstructed image can refer to the video data generated after the original video image is compressed and transformed, and the image generated after decoding. The reconstructed image has the characteristics of compression and distortion. The reconstructed image can be used as a reference image for inter-frame coding, or it can be generated by video decoding. .

In the embodiment of the present application, one or more frames of reconstructed images may be selected from the reconstructed images generated by decoding compressed video data as the enhanced auxiliary images based on the image to be enhanced. It can be understood that the image to be enhanced and the enhanced auxiliary image are related in time and space. Exemplarily, in video decoding to generate multi-frame reconstructed images, several frames of reconstructed images before and after the current time t on the timeline can be used as enhanced auxiliary images, see Figure 2a, the image to be enhanced is the reconstructed image at time t, and the value can be selected Two frames of reconstructed images at time t-2 and t-1 in front of the image and two frames of reconstructed images at time t+1 and t+2 behind the image are used as enhanced auxiliary images; see Figure 2b, the image to be enhanced is the reconstructed image at time t , get two frames of reconstructed images at time t-4 and t-2 in front of it and two frames of reconstructed images at time t+2 and t+4 behind it as an enhanced auxiliary image at intervals of one frame; or, as shown in Figure 2c, the current frame is to be To enhance the image, two reconstructed images of I-frames before and after the current frame may be selected as auxiliary enhanced images, wherein the I-frame may be an intra-frame coding frame.

Step 120, determine the spatio-temporal feature map based on the image to be enhanced, the enhanced auxiliary image and the preset feature extraction network.

Wherein, the preset feature extraction network can be a pre-trained neural network, which can be used to extract the spatiotemporal features between the image to be enhanced and the enhanced auxiliary image, and the preset feature extraction network can be a deformable convolutional neural network, which can be a three-dimensional input, Preset feature extraction networks can be generated using a large number of reconstructed images for training.

The image to be enhanced and the enhanced auxiliary graphics can be input to the preset feature extraction network, and the spatiotemporal feature map of the image to be enhanced and the enhanced auxiliary graphic can be determined through the processing of the preset feature extraction network, wherein the spatiotemporal feature map can be the preset feature extraction The result output by the network can reflect the spatio-temporal feature relationship between the image to be enhanced and the enhanced auxiliary image in the form of a graph. The spatio-temporal feature correlation can include the data representation of the feature in the image or the degree of pixel change.

Step 130, process the spatio-temporal feature map according to the preset feature enhancement network to generate a superimposed image.

Wherein, the preset feature enhancement network may be a neural network model for processing spatiotemporal feature maps, the preset feature enhancement network may be a convolutional neural network, and the preset feature enhancement network may be generated through massive training of feature maps including spatiotemporal features, The result output by the preset feature enhancement network can be a two-dimensional image, which can be used to enhance the display effect of the image to be enhanced. The two-dimensional image can include information corresponding to spatio-temporal features and/or intra-frame features. The preset features The enhanced network can generate superimposed images through one or more spatio-temporal feature data included in the spatio-temporal feature map after being trained by a large number of feature maps. The superimposed image can include information that needs to be supplemented at each position in the image to be enhanced, and the information can include brightness. value, chroma value, color value, etc.

In the embodiment of the present application, the spatiotemporal feature map can be input into the preset feature enhancement network, and the spatiotemporal feature map can be processed by the preset feature enhancement network, and the spatiotemporal feature map can be converted into a superimposed image supplementing the image to be enhanced, and the superimposed image The information contained in can be used to supplement the image to be enhanced, so as to enhance the display effect of the reconstructed image.

Step 140, process the image to be enhanced according to the superimposed image to generate a video enhanced image.

The superimposed image can be used to enhance the display effect of the image to be enhanced. For example, pixel values in the superimposed image can be extracted, such as brightness or chromaticity information, and the corresponding area in the image to be enhanced can be displayed and enhanced according to the average value of the pixel values, for example, the pixel value corresponding to the average value can be increased or decreased, or , the superimposed image can be directly superimposed on the image to be enhanced, the pixel value included in the corresponding superimposed image can be increased or decreased at each position in the image to be enhanced, and the superimposed image can be used as a video enhanced image.

On the basis of the above-mentioned application embodiments, the determination of the enhanced auxiliary image of the image to be enhanced includes: obtaining a threshold number of auxiliary images respectively before and/or after the image to be enhanced in the reconstructed image set generated by video decoding in time sequence The reconstructed image is used as an enhanced auxiliary image, wherein the set of reconstructed images includes at least two frames of reconstructed images.

Wherein, the time sequence may be the playback time sequence of the video corresponding to the reconstructed image, the threshold number may be the number of frames for extracting the reconstructed image, and the threshold numbers before and after the image to be enhanced may be the same or different. For example, 2 frames of reconstructed images may be extracted before the image to be enhanced as enhanced auxiliary images, and 3 frames of reconstructed images may be extracted after the image to be enhanced as enhanced auxiliary images.

In the embodiment of the present application, in the reconstruction image set generated by identification and decoding, according to the time sequence of video playback, a threshold number of reconstruction images can be extracted from the reconstruction images before the image to be enhanced as enhanced auxiliary images, and after the image to be enhanced Extract a threshold number of reconstructed images from the reconstructed images as enhanced auxiliary images.

On the basis of the above-mentioned application embodiments, the processing of the image to be enhanced according to the superimposed image to generate a video enhanced image includes:

The superimposed image is superimposed on the image to be enhanced, and the superimposed image is used as a video enhanced image.

In the embodiment of the present application, the superimposed image can be superimposed on the image to be enhanced, and the pixel value of the corresponding position of the superimposed image can be added or subtracted from each position point in the image to be enhanced to realize the processing of the image to be enhanced. Enhanced image as a video enhanced image.

On the basis of the above application embodiments, the preset feature extraction network includes at least one 3D deformable convolution residual block, and the 3D deformable convolution residual block includes at least a 3D deformable convolution layer and an activation function.

In the embodiment of the present application, the preset feature extraction network may be a three-dimensional convolutional neural network, and the three-dimensional convolutional neural network may be composed of one or more convolutional residual blocks, and each convolutional residual block may include at least 3D deformable convolutional layers and activation functions.

In an exemplary embodiment, the preset feature extraction network may be composed of multiple three-dimensional deformable convolution residual blocks. FIG. 3 is a schematic structural diagram of a three-dimensional deformable convolution residual block provided by an embodiment of the present application. , each 3D deformable convolution residual block can be shown in Figure 3, the image to be enhanced and the enhanced auxiliary image can be superimposed with itself after passing through a 3D deformable convolution layer, an activation function, and a 3D deformable convolution layer, and then output , the output result can be used as the input data of the next 3D deformable convolution residual block in the preset feature extraction network. The activation function can include LReLU activation function, sigmoid function, tanh function, etc. The number of three-dimensional deformable convolution residual blocks in the preset feature extraction network can be N. The larger the value of N, the better the enhancement effect of the video image. However, the parameter complexity of the entire network will increase significantly, and the network training and calculation time will also increase.

Further, on the basis of the above-mentioned application embodiments, the preset feature enhancement network includes at least one convolutional residual block, and the convolutional residual block includes at least a convolutional layer and an activation function.

In the embodiment of the present application, the preset feature enhancement network can be a pre-trained convolutional neural network, which can include a convolutional layer and an activation function layer, and the spatio-temporal feature map can pass through a two-dimensional convolutional layer, activation function , and then superimposed with itself to form a residual after passing through a two-dimensional convolutional layer, so as to enhance the salience of the spatiotemporal features in the spatiotemporal feature map.

In an exemplary implementation, FIG. 4 is a schematic diagram of a convolutional residual block structure provided by the embodiment of the present application. The spatiotemporal feature map can be passed through a two-dimensional convolutional layer, an activation function, and then a two-dimensional convolutional layer. Superimposed with itself to form a residual, where the two-dimensional convolution can be a two-dimensional deformable convolutional network (Deformable Convolutional Networks, DCN), or a two-dimensional convolutional neural network (Convolutional Neural Networks, CNN).

The network model and network parameters of the preset feature enhancement network and the preset feature extraction network are transmitted in the code stream and/or the transport layer.

Among them, the network model can be the organizational structure of the preset feature enhancement network and the preset feature extraction network, which can be called network structure, model representation or network topology, etc. , can include the number of convolutional layers, the number of pooling layers, the connection relationship between convolutional layers and pooling layers, etc., and the network parameters can include specific weight coefficients in the convolutional layer, pooling layer, and activation function in the network and bias etc.

In the embodiment of the present application, the preset feature enhancement network and the preset feature extraction network can be transmitted in the code stream and/or the transport layer, for example, the encoding end can use the network model of the preset feature enhancement network and the preset feature extraction network and network parameters into code stream, and send the code stream to the decoding end; the encoding end can also send the network model and network parameters of the preset feature enhancement network and preset feature extraction network to the server through the transport layer, and then the encoding end The identification number of the preset feature enhancement network or the preset feature extraction network is sent to the decoder through the code stream, and the decoder requests the server for the network model and network parameters of the preset feature enhancement network and the preset feature extraction network according to the identification number.

On the basis of the above application embodiments, the network model and the network parameters are located in at least one of the following: video code stream, supplementary enhancement information of the video code stream, video application information, system layer media attribute description unit, and media track.

In the embodiment of the present application, the preset feature extraction network and the preset feature enhancement network can be composed of network models and network parameters. One or more types of information in the system layer media attribute description unit and media track are transmitted.

In an exemplary embodiment, the network model used by the preset feature extraction network and the preset feature enhancement network can be used to describe the organizational structure of the network, which is designed before training, and can also be called network structure (network structure). ), or network representation (model representation), or network topology (network topology). Network parameters are obtained during network model training, including but not limited to weights and biases. Referring to Figure 5, the network model and network parameters can be written into the video code stream at the encoding end, sent to the decoding end together with the video code stream, or can be transmitted out-of-band separately. One organizational relationship for network models can be of the form adopted by PyTorch, as follows:

Among them, the network parameters can be transmitted or stored in the .pth format of PyTorch. In one embodiment, the network model and network parameters can also adopt other formats, such as NNEF (Neural Network Exchange Format), ONNX (Open Neural Network Exchange), TensorFlow format, Caffe format, etc.

If the network model and network parameters are written into the video code stream, they can be written into Supplemental Enhancement Information (SEI) in the video code stream, for example, the structure shown in Table 1.

Table 1

Similarly, the network model and network parameters can also be written into the video application information (Video Usability Information, VUI) in the video code stream.

If the network model and network parameters are written to the transport layer, they can be written to the system layer media attribute description unit, such as the descriptor of the transport stream, the data unit of the file format (such as in Box), and the media description information of the transport stream, such as Information units such as Media Presentation Description (MPD).

For example, use ISO/IEC 14496-12 ISO BMFF to encapsulate the network model and network parameters.

The network model and network parameters for feature extraction, and the network model and network parameters for feature enhancement are stored in different media tracks, and are stored in different media tracks by defining different types of sample entries (such as four-character code identification). data types, such as network models and network parameters. Moreover, instruction information is given in the sample entry for feature extraction and feature enhancement. The specific network model and network parameters are stored in the sample of the media track. The indication information in the media track is implemented as follows:

aligned(8) class NeutralNetworkInfo()

{

unsigned int(1) feature_extraction_flag;

unsigned int(1) feature_enhancement_flag;

if(feature_extraction_flag==1)

{

unsigned int(1) fext_nn_model_flag;

unsigned int(1) fext_nn_parameter_flag;

}

if(feature_enhancement_flag==1)

{

unsigned int(1) fenh_nn_model_flag;

unsigned int(1) fenh_nn_parameter_flag;

}

bit(6) reserved=0;

}

feature_extraction_flag indicates whether feature extraction network information is included, 1 is included, 0 is not included.

feature_enhancement_flag indicates whether feature enhancement network information is included, 1 is included, 0 is not included.

fext_nn_model_flag indicates whether the feature extraction network model is included, 1 is included, and 0 is not included.

fext_nn_parameter_flag indicates whether to include feature extraction network parameters, 1 is included, 0 is not included.

fenh_nn_model_flag indicates whether the feature enhancement network model is included, 1 is included, and 0 is not included.

fenh_nn_parameter_flag indicates whether to include feature enhancement network parameters, 1 means yes, 0 means no.

The indication information may be indicated at the file level, such as indicated in the related media header data box (MediaHeaderBox) under the media information data box (MediaInformationBox), or indicated in other data boxes (Box) at the file level.

The indication information may also be indicated at the media track level, such as indicated in a corresponding sample entry.

In one embodiment, no matter what form is used to store or transmit the network model and network parameters for feature extraction, or the network model and network parameters for feature enhancement, they can be stored or transmitted separately.

On the basis of the above-mentioned application embodiments, the number N of the 3D deformable convolution residual blocks in the preset feature extraction network is determined according to the video attribute corresponding to the reconstructed image and/or the device processing performance.

The preset feature extraction network is a three-dimensional deformable convolutional neural network. The network model of this neural network can include multiple three-dimensional deformable convolutional residual blocks, and the number of them can be processed by the video attributes of the corresponding compressed video of the reconstructed image and the device. Performance determination, where the video attribute can be information reflecting the type of video, for example, conference video or movie video, etc., and the device processing performance can be the performance of the device that processes image enhancement, for example, the 3D deformable volume that can be used by high-performance devices The number of convolutional residual blocks is large, and the number of three-dimensional deformable convolutional residual blocks that can be used by low-performance devices is small.

On the basis of the foregoing application embodiments, the video attribute may include at least one of the following: video type and application scenario.

In the embodiment of this application, different numbers of three-dimensional deformable convolution residual blocks can be configured in the preset feature extraction network according to the video type and/or application scenario corresponding to the reconstructed image, so as to adapt to images under different video types or application scenarios Display effects, for example, when reconstructing an image for a video conference, a preset feature extraction network with a smaller number of 3D deformable convolutional residual blocks can be selected to extract spatiotemporal features in the reconstructed image to meet the real-time nature of the video, or, reconstruction When the image-corresponding video is played on a movie website, a larger number of 3D deformable convolutional residual blocks can be selected to extract spatio-temporal features to meet the high-quality requirements of the video.

In an exemplary embodiment, the number N of 3D deformable convolution residual blocks in the preset feature extraction network can be set according to the video type or application scenario, or can be set according to actual computing power and resources . For example, if the computing power of the encoding end is strong, then more three-dimensional deformable convolution residual blocks can be used to extract features better. In one embodiment, the encoder can train network models with different numbers of 3D deformable convolution residual blocks, and use different network models according to the needs of the decoder.

Similarly, the number M of convolutional residual blocks in the preset feature enhancement network can also be set according to video types or application scenarios, or according to actual computing power and resources. For example, the computing power of the decoding end is weak, so less two-dimensional convolutional residual blocks can be used. Although the effect of feature enhancement is slightly worse, the real-time performance of the decoding end is guaranteed.

Wherein, the network model can be sent from the encoding end to the decoding end, and can also be stored on the server. If the network model is stored on the server, then the decoding end obtains the network model from the server.

On the basis of the above application embodiments, it further includes: respectively training at least one of the preset feature extraction networks and at least one of the preset feature enhancement networks for video types and/or application scenarios.

The preset feature extraction network and preset feature enhancement network can be pre-trained according to different video types and/or application scenarios, and the preset feature extraction network and preset feature extraction network used when processing images to be enhanced under different video types and application scenarios Let the feature enhancement network be unavailable.

In an exemplary embodiment, the preset feature extraction network and the preset feature enhancement network may be neural networks with a fixed network model, and multiple sets of network parameters may be trained according to video types or application products. For example, there are a set of network parameters for strenuous exercise scenarios, video conferencing scenarios, and surveillance scenarios. For example, the encoding end selects the corresponding network parameters according to the current video type for feature extraction to generate a spatio-temporal feature map, and then sends the spatio-temporal feature map and the corresponding feature enhancement network parameters to the decoding end.

There is no limit to how the encoding end and the decoding end use multiple sets of network models. The encoding end can send the currently used set of network parameters to the decoding end, and retransmit a new set when choosing to use another set of network parameters. Network parameters. It is also possible to establish a communication link between the encoding end and the decoding end to send all network parameters. During the communication process, the encoding end only sends the currently used network parameter index for the decoding end to select the corresponding network parameters. The decoding end only needs to select the corresponding network parameter according to the index. The network parameters can be. It can also be the default network parameters of the encoding end and the decoding end, without the need for the encoding end to send to the decoding end, and the decoding end uses the default network parameters, or only needs to select the corresponding network parameters according to the index.

The network parameters can also be stored on the server. The encoding end only needs to send the network parameter index, and the decoding end applies to the server to obtain the corresponding network parameters according to the index information.

On the basis of the above-mentioned application embodiments, it also includes:

Weighting the information of the image to be enhanced and the enhanced auxiliary image by using a weight parameter.

Among them, the weight parameter can be a parameter reflecting the display priority of different areas in the reconstructed image. For example, if the center of the picture needs to be highlighted, a weight parameter with an increased value can be set for the center of the picture, and the four corners of the picture are not noticed by the viewing user. Set a weight parameter with a small value for the four corners of the screen. The weight parameter can also be used to reflect the display priority among different frame images, for example, the key frame in the reconstructed image can use a weight parameter with a larger value.

In the embodiment of the present application, weight parameters can be used to weight the information of the image to be enhanced and the auxiliary image to be enhanced. The weight parameters can be preset. For example, different weight parameters can be set for different regions in the image, or for Different weight parameters are set for the image, and different weight parameters can be set for different content displayed in the image. Another example is that the brightness component of the image is multiplied by the weight parameter value before inputting the feature extraction network.

On the basis of the above application embodiments, different weight parameters are set for different areas of the image to be enhanced and the enhanced auxiliary image.

The image to be enhanced and the enhanced auxiliary image can be divided into multiple regions, and different weight parameters can be set for each region. For example, the image to be enhanced and the enhanced auxiliary image can be divided into the image center and the four corners of the image, or the image content and image For areas such as the background, the value of the weight parameter that can be set for different areas can be different.

On the basis of the above application embodiments, different weight parameters are set for different enhanced participating images, wherein the enhanced participating images include the image to be enhanced and the enhanced auxiliary image.

In the embodiment of the present application, the image to be enhanced and the auxiliary enhanced image may be recorded as the enhanced participating images, and different weight parameters may be set for a single enhanced participating image.

In an exemplary embodiment, different weight parameters can be set for each frame of the image to be enhanced and the enhanced auxiliary image, and feature extraction is performed after each frame of image is weighted. For example, the weight parameter can be determined by the video timeline The time distance from the current frame determines the size of the value. The current frame image is at time t, then the weight parameter of the reconstructed image at time t-1 is larger than the value of the weight parameter at time t-2, and the weight parameter can be based on the reconstructed image It is determined by the importance of the image decoding process. For example, the I frame is a key frame, and the P frame and B frame are non-key frames. The value of the weight parameter of the I frame in the image to be enhanced and the enhanced auxiliary image can be greater than that of the P frame and The value of the weight parameter of the B frame.

In another exemplary embodiment, different weights can be used for weighting the information of a single frame image in the reconstructed image such as the image to be enhanced and the enhanced auxiliary graphics. feature extraction.

In another exemplary embodiment, the single-frame image in the reconstructed image such as the image to be enhanced and the enhanced auxiliary graphics can be divided into regions, and different weight parameters are set for each region, and the single-frame image is weighted to perform feature extraction and feature enhancement, etc. For the operation, for example, a high weight parameter can be used for the value of a person, and a low weight parameter can be used for the background area.

On the basis of the above application embodiments, the image to be enhanced and the enhanced auxiliary image are at least one component of the reconstructed image.

The component may be a component of image information, which may include luminance, chrominance or color components, etc. When performing enhancement, the image to be enhanced and the enhanced auxiliary image may use one or more components in the reconstructed image for image enhancement. For example, the reconstructed image is a Red Green Blue (RGB) image, and the image to be enhanced may be an image formed by the R component or an image formed by the G component as the image to be enhanced or the enhanced auxiliary image.

In an exemplary embodiment, the image to be enhanced and the enhanced auxiliary graphics may be a reconstructed image, which may be only one component of the image, or multiple components. For example, the reconstructed image is composed of luminance and chrominance (YUV) components, Then, image enhancement can be performed by performing operations such as feature extraction and feature enhancement on the brightness component, and image enhancement can also be performed on the chroma component by performing feature extraction and feature enhancement operations, or feature extraction can be performed on image brightness and chroma and feature enhancement to perform image enhancement. The reconstructed image is composed of RGB (Red, Green, Blue) components. The three components can perform image enhancement by performing feature extraction and feature enhancement operations separately, or can perform image enhancement by performing feature extraction and feature enhancement operations on the three components as a whole. .

On the basis of the above application embodiments, the image to be enhanced and the enhanced auxiliary image are partial regions of the reconstructed image.

In the embodiment of the present application, the image to be enhanced may be a partial area in the reconstructed image, for example, the center or four corners of the picture in the reconstructed image. Before performing image enhancement, partial areas may be intercepted in the reconstructed image for image enhancement.

In an exemplary embodiment, referring to FIG. 6 , only partial areas of the reconstructed image can be intercepted for feature extraction and feature enhancement, and the enhanced video image is only generated by superimposing partial areas in the current reconstructed image with the size of the intercepted partial area. For the same enhanced image A, the enhanced image can also be superimposed on the corresponding intercepted area of the current reconstructed image to generate an enhanced image B with the same size as the reconstructed image.

On the basis of the above-mentioned application embodiments, the network parameters in the preset feature extraction network and the preset feature enhancement network can be updated during the image enhancement process, for example, each network can be adjusted based on the image enhancement effect after each use Parameters, all network parameters can be adjusted, or only some network parameters can be adjusted. The encoding end can also only send the adjusted network parameters to the decoding end.

In an exemplary embodiment, feature extraction is first performed on the current video reconstruction image and its adjacent multi-frame reconstruction images to generate a spatio-temporal feature map, then feature enhancement is performed on the spatio-temporal feature map to generate an enhanced map, and finally the current video reconstruction image The enhanced image is obtained by adding it to the enhanced image, and the processing block diagram is shown in Figure 7. Fig. 8 is a flowchart of a video enhancement processing method provided by an embodiment of the present application. Referring to Fig. 8, the method of this embodiment includes the following steps:

Step S101: Input multi-frame reconstructed images

The reconstructed image refers to a reconstructed image generated by compressing and encoding an original video image, and then decoding the video data, that is, a reconstructed image with compression distortion characteristics. The multi-frame reconstructed image is composed of the current reconstructed image and its multiple frames before and after the reconstructed image on the timeline.

Wherein, the reconstructed image may be a reconstructed image generated during video encoding, and these reconstructed images are used as reference images for inter-frame encoding, or may be a reconstructed image generated during video decoding. A multi-frame reconstructed image refers to several frames of reconstructed images before and after the reconstructed image at the current moment on the timeline. These reconstructed images can be adjacent images on the timeline. The current reconstructed image is the reconstructed image at time t. Select t-2 and Two frames of reconstructed images at t-1, followed by two frames of reconstructed images at t+1 and t+2, a total of five frames of reconstructed images are used as input. It can also be a reconstructed image selected at a certain interval. The current image is a reconstructed image at time t, and the two frames t-4 and t-2 in front of it are selected to reconstruct the image at an interval of one frame, and the reconstructed images in the two frames t+2 and t+4 behind it are selected. , a total of five reconstructed images are taken as input. It may also be a reconstructed image selected according to certain rules, and two I frames before and after the current reconstructed image frame are selected to reconstruct the image (intra-frame coding frame). Multi-frame images can also be related images instead of contextual relationships on the timeline, for example, they all contain a certain object, or the image sizes have a certain proportional relationship.

Step S102: Feature extraction to generate spatio-temporal feature maps

Feature extraction is performed on the input multi-frame reconstructed image, as shown in Figure 9. Multi-frame images generate feature information through a multi-layer three-dimensional deformable convolution residual block (Residual Block), and then perform convolution fusion on the feature information to generate a spatio-temporal feature map (Feature Map). Wherein, each 3D deformable convolution residual block may include a 3D deformable convolution layer and an activation function. Multi-frame data input passes through three-dimensional deformable convolution (DCN3D), activation function (Activation Function), three-dimensional deformable convolution (DCN3D), and then superimposed with itself and then output, and the output result is used as the input of the next module. The multi-frame data may be multi-frame reconstructed images, or the output data of the previous module. The activation function may be LReLU (Leaky Rectified Linear Activation), or other activation functions. There can be N number of 3D deformable convolution residual blocks. An increase in the number of 3D deformable convolution residual blocks will improve the quality of the enhanced video. However, with the increase of 3D deformable convolution residual blocks, the entire network Parameter complexity will increase significantly, and network training and calculation will also take a lot of time.

The feature information generated after N three-dimensional deformable convolution residual blocks is then fused by a convolution module (Bottleneck) to generate a spatio-temporal feature map. The size of the spatio-temporal feature map is related to the image size and the number of features.

A convolution module is added before the 3D deformable convolution residual block to map low-order features to high-order features and increase the number of features.

Among them, the 3D deformable convolution is extended to 3D on the basis of the 2D deformable convolution (DCN). As shown in Figure 10, a 3D offset is first generated through a convolution, and then the input features are processed using the 3D offset. The convolution operation obtains output features, which can be multi-frame reconstructed images or the output features of the previous module.

Step S103: Perform feature enhancement on the spatio-temporal feature map

The feature enhancement process is shown in Figure 11. The spatio-temporal feature map passes through multiple 3D deformable convolution residual blocks, and then undergoes a convolution, such as 1x1conv, to recover an enhanced map with the same size as the current reconstruction map, that is, superposition picture. Wherein, the number M of convolutional residual blocks is not necessarily equal to the number of three-dimensional deformable convolutional residual blocks in the feature extraction process. The convolutional residual block includes a two-dimensional convolutional layer and an activation function. The input data is subjected to two-dimensional convolution, the activation function, and then superimposed with itself to generate a residual after two-dimensional convolution. Among them, the two-dimensional convolution can be a two-dimensional deformable convolution (DCN), or a two-dimensional convolutional neural network (CNN).

Step S104: Generate an enhanced image

The enhanced image (overlay image) generated in step S103 is superimposed with the current reconstructed image to generate an enhanced image.

Fig. 12 is a flowchart of a video enhancement processing method provided by the embodiment of the present application. The embodiment of the present application can be applied to the situation of enhancing the image display quality of the decoded video, and the method can be executed by a video enhancement processing device, which can Realized by software and/or hardware, and generally based on the video encoding end, referring to Figure 12, the method provided by the embodiment of the present application includes the following steps:

Step 210: Determine an enhanced auxiliary image of the image to be enhanced, wherein the enhanced auxiliary image and the image to be enhanced are reconstructed images generated by decoding compressed video data.

Step 220, determine the spatio-temporal feature map based on the image to be enhanced, the enhanced auxiliary image and the preset feature extraction network.

Step 230, transmitting the spatio-temporal feature map and the preset feature enhancement network.

In the embodiment of the present application, the spatiotemporal feature map and feature enhancement network can be sent to the decoder, and the decoder processes the image to be enhanced according to the spatiotemporal feature map and the preset feature enhancement network to improve the display effect of the image to be enhanced. The spatiotemporal feature map and the preset feature enhancement network can be directly transmitted to the decoder, or the spatiotemporal feature map and feature enhancement network can be uploaded to the server first, and then the decoder sends an acquisition request to the server.

On the basis of the above application embodiments, before the transmission of the spatio-temporal feature map and the preset feature enhancement network, it also includes:

performing compression coding on the spatio-temporal feature map and the preset feature enhancement network.

In the embodiment of the present application, compression coding can be performed on the spatio-temporal feature map and the preset feature enhancement network, so as to reduce the amount of transmission data and improve transmission efficiency.

In an exemplary embodiment, the preset feature extraction network, the preset feature enhancement network, and the spatio-temporal feature map may be compressed during transmission to reduce data volume and facilitate transmission or storage. The network model and network parameters of the preset feature extraction network of the spatio-temporal feature map and the preset feature enhancement network can adopt a lossless compression method, such as Huffman coding, arithmetic coding, etc. The network model can be passed through methods such as parameter pruning and sharing, low-rank factorization, transferred/compact convolutional filters, knowledge distillation, etc. to compress. Network parameters can also be encoded with lossy compression, for example, quantization can be used to reduce the amount of data required.

Fig. 13 is a flow chart of a video enhancement processing method provided by the embodiment of the present application. The embodiment of the present application can be applied to the situation of enhancing the image display quality of the decoded video, and the method can be executed by a video enhancement processing device, which can Realized by software and/or hardware, and generally based on the video decoding end, referring to Figure 13, the method provided by the embodiment of the present application includes the following steps:

Step 310, receiving a spatio-temporal feature map and a preset feature enhancement network.

In the embodiment of the present application, the spatio-temporal feature map and the preset feature enhancement network can be directly sent from the encoding end to the decoding end or downloaded to the decoding end by the server, and the spatio-temporal feature map and the preset feature enhancement network in the server can be provided by the encoding end upload.

Step 320, processing the spatio-temporal feature map according to the preset feature enhancement network to generate a superimposed image.

Step 330, process the image to be enhanced according to the superimposed image to generate a video enhanced image.

Wherein, the image to be enhanced can be generated by decoding a bit stream, and the bit stream can be sent by the encoding end and received by the decoding end.

After the spatio-temporal feature map is processed, the information on each position in the superimposed image can be extracted, such as chroma, brightness, color value, etc., and the corresponding area in the image to be enhanced can be displayed and enhanced according to the information, and the superimposed image can also be combined with The image to be enhanced is directly superimposed, and the superimposed image is used as a video enhanced image.

In an exemplary implementation, FIG. 14 is an example diagram of a video enhancement process provided by the embodiment of the present application. Referring to FIG. 14 , the embodiment of the present application reconstructs the image by using the multi-frame encoding generated during the encoding process at the encoding end. Perform feature extraction and generate a spatio-temporal feature map, transmit the spatio-temporal feature map and feature enhancement network model and network parameters to the decoder, and the decoder will enhance the decoded and reconstructed image according to the spatiotemporal feature map, feature enhancement network model and network parameters, and the spatiotemporal feature map The network model and network parameters of feature enhancement and feature enhancement can be transmitted during the enhancement process. The network model and network parameters of the spatio-temporal feature map and feature enhancement network can be transmitted separately or in combination, and can be written into the video stream or independently Video code stream out-of-band transmission.

In another exemplary embodiment, FIG. 15 is an example diagram of another video enhancement process provided by the embodiment of the present application. Referring to FIG. 15 , the network model and network parameters for feature extraction and feature enhancement can be used only in decoding The end, that is, the decoding end decodes the video stream, and then uses the network model and network parameters of feature extraction and feature enhancement to enhance the decoded reconstructed image. If it is only used at the decoding end, then the spatio-temporal feature map output by feature extraction can be directly used as the input of feature enhancement without having to store the spatio-temporal feature map separately.

The decoder can obtain the network model and network parameters for feature extraction and feature enhancement by reading local files, or obtain the network model and network parameters for feature extraction and feature enhancement from the server, and can also be sent to the decoder by the encoder.

Fig. 16 is a schematic structural diagram of a video enhancement processing device provided by an embodiment of the present application, which can execute the video enhancement processing method provided by any embodiment of the present application, and has corresponding functional modules and beneficial effects for executing the method. The device can be composed of software and /or hardware implementation, generally integrated at the encoding end, including: an image extraction module 401 , a feature map module 402 , a feature enhancement module 403 and an image enhancement module 404 .

The image extraction module 401 is configured to determine an enhanced auxiliary image of the image to be enhanced, wherein the enhanced auxiliary image and the image to be enhanced are reconstructed images generated by decoding compressed video data.

A feature map module 402, configured to determine a spatio-temporal feature map based on the image to be enhanced, the enhanced auxiliary image and a preset feature extraction network.

The feature enhancement module 403 is configured to process the spatio-temporal feature map according to a preset feature enhancement network to generate a superimposed image.

An enhanced image module 404, configured to process the image to be enhanced according to the superimposed image to generate a video enhanced image.

Fig. 17 is a schematic structural diagram of a video enhancement processing device provided by an embodiment of the present application, which can execute the video enhancement processing method provided by any embodiment of the present application, and has corresponding functional modules and beneficial effects for executing the method. The device can be composed of software and /or hardware implementation, generally integrated at the encoding end, including: an image extraction module 501 , a feature map module 502 and an encoding sending module 503 .

The image extraction module 501 is configured to determine an enhanced auxiliary image of the image to be enhanced, wherein the enhanced auxiliary image and the image to be enhanced are reconstructed images generated by decoding compressed video data.

A feature map module 502, configured to determine a spatio-temporal feature map based on the image to be enhanced, the enhanced auxiliary image and a preset feature extraction network.

An encoding sending module 503, configured to transmit the spatio-temporal feature map and the preset feature enhancement network.

Fig. 18 is a schematic structural diagram of a video enhancement processing device provided by an embodiment of the present application, which can execute the video enhancement processing method provided by any embodiment of the present application, and has corresponding functional modules and beneficial effects for executing the method. The device can be composed of software and /or hardware implementation, generally integrated at the decoding end, including: a code receiving module 601 , a feature enhancement module 602 and an image enhancement module 603 .

The decoding and receiving module 601 is configured to receive a spatio-temporal feature map and a preset feature enhancement network.

A feature enhancement module 602, configured to process the spatio-temporal feature map according to a preset feature enhancement network to generate a superimposed image.

An enhanced image module 603, configured to process the image to be enhanced according to the superimposed image to generate a video enhanced image.

On the basis of the above embodiments, the preset feature extraction network in the encoding end and/or decoding end includes at least one three-dimensional deformable convolution residual block, and the three-dimensional deformable convolution residual block includes at least a three-dimensional deformable convolution residual block. Deformable convolutional layers and activation functions.

On the basis of the above embodiments, the preset feature enhancement network in the device at the encoding end and/or decoding end includes at least one convolutional residual block, and the convolutional residual block includes at least a convolutional layer and an activation function.

On the basis of the above-mentioned embodiments, the network model and network parameters of the preset feature enhancement network and the preset feature extraction network in the device at the encoding end and/or decoding end are transmitted in the code stream and/or the transport layer .

On the basis of the above-mentioned application embodiments, the network model and the network parameters in the device at the encoding end and/or decoding end are located in at least one of the following: video code stream, supplementary enhancement information of video code stream, video application information , a system layer media attribute description unit, and a media track.

On the basis of the above-mentioned application embodiments, the number N of the three-dimensional deformable convolution residual blocks in the preset feature extraction network in the device at the encoding end and/or decoding end is based on the corresponding Video properties and/or device processing capabilities are determined.

On the basis of the above-mentioned application embodiments, the video attribute in the device at the encoding end and/or decoding end includes at least one of the following: video type and application scenario.

On the basis of the above-mentioned application embodiments, the device at the encoding end and/or decoding end further includes:

The network training module is used to respectively train at least one of the preset feature extraction networks and at least one of the preset feature enhancement networks for video types and/or application scenarios.

A weighting module, configured to use weight parameters to weight the information of the image to be enhanced and the enhanced auxiliary image.

On the basis of the above-mentioned application embodiments, different weight parameters are set for different regions of the image to be enhanced and the enhanced auxiliary image in the device at the encoding end and/or the decoding end.

On the basis of the above-mentioned application embodiments, different weight parameters are set for different enhancement participating images in the devices at the encoding end and/or decoding end, wherein the enhancement participating images include the image to be enhanced and the enhanced auxiliary image.

On the basis of the above-mentioned application embodiments, the image to be enhanced and the enhanced auxiliary image in the device at the encoding end and/or the decoding end are at least one component of the reconstructed image.

On the basis of the above-mentioned application embodiments, the image to be enhanced and the enhanced auxiliary image in the devices at the encoding end and/or decoding end are partial regions of the reconstructed image.

On the basis of the above-mentioned application embodiments, the image extraction module in the device at the encoding end and/or decoding end is set to: respectively before and/or after the image to be enhanced in the reconstructed image set generated by video decoding in time order Respectively acquire a threshold number of reconstructed images as enhanced auxiliary images, wherein the set of reconstructed images includes at least two frames of reconstructed images.

On the basis of the above-mentioned application embodiments, the enhanced image module in the device at the encoding end and/or decoding end is set to: superimpose the spatio-temporal feature map enhanced by image spatio-temporal features with the image to be enhanced, and superimpose The resulting image is used as a video-enhanced image.

An encoding and compression module, configured to compress and encode the spatio-temporal feature map and the preset feature enhancement network.

In an exemplary implementation, the example enhancement processing device provided in the embodiment of the present application may include the following modules: a feature extraction module A01, configured to extract features of multi-frame reconstructed images;

The video encoding module A02 is used to encode network parameters and spatio-temporal feature maps, and output encoded and reconstructed images as input to the feature extraction module A01.

The transmission module A03 is used to transmit encoded video data, and can also encode and transmit network parameters and spatiotemporal feature maps.

The feature enhancement module A04 is used to perform feature enhancement and generate an enhanced map.

The video decoding module A05 is used to decode network parameters and spatio-temporal feature maps from video data, and reconstruct images.

The transmission module A06 is used to transmit compressed video data, and can also decode network parameters and spatiotemporal feature maps.

The above-mentioned transmission module A01, video encoding module A02, transmission module A03, feature enhancement module A04, video decoding module A05, and transmission module A06 can be implemented by using dedicated hardware or hardware that can be combined with appropriate software to perform processing. Such hardware or special purpose hardware may include application specific integrated circuits (ASICs), various other circuits, various processors, and the like. When implemented by a processor, the functionality may be provided by a single dedicated processor, a single shared processor, or multiple independent processors, some of which may be shared. Additionally, a processor should not be understood to refer exclusively to hardware capable of executing software, but may implicitly include, without limitation, digital signal processor (DSP) hardware, read-only memory (ROM) for storing software, random Access memory (RAM), and non-volatile storage devices.

The apparatus of this embodiment may be a device in a video application, for example, a mobile phone, a computer, a server, a set-top box, a portable mobile terminal, a digital camera, a TV broadcasting system device, and the like.

Figure 19 is a schematic structural diagram of an electronic device provided by an embodiment of the present application, the electronic device includes a processor 60, a memory 71, an input device 72 and an output device 73; the number of processors 70 in the electronic device can be one or more In FIG. 19, a processor 70 is taken as an example; the processor 70, memory 71, input device 72 and output device 73 in the electronic device can be connected by a bus or in other ways. In FIG. 19, the connection by a bus is taken as an example.

The memory 71, as a computer-readable storage medium, can be used to store software programs, computer-executable programs and modules, such as the modules corresponding to the video enhancement processing device in the embodiment of the present application (image extraction module 401, feature map module 402, feature Enhancement module 403 and enhanced image module 404, or image extraction module 501, feature map module 502 and encoding sending module 503, or decoding receiving module 601, feature enhancement module 602 and enhanced image module 603). The processor 70 executes various functional applications and data processing of the electronic device by running software programs, instructions and modules stored in the memory 71 , that is, realizes the above-mentioned video enhancement processing method.

The memory 71 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system and at least one application required by a function; the data storage area may store data created according to the use of the electronic device, and the like. In addition, the memory 71 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage devices. In some examples, the memory 71 may also include a memory that is remotely located relative to the processor 70, and these remote memories may be connected to the electronic device through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 72 can be used to receive input numbers or character information, and generate key signal input related to user settings and function control of the electronic device. The output device 73 may include a display device such as a display screen.

The embodiment of the present application also provides a storage medium containing computer-executable instructions, the computer-executable instructions are used to execute a video enhancement processing method when executed by a computer processor, the method comprising:

determining an enhanced auxiliary image of the image to be enhanced, wherein the enhanced auxiliary image and the image to be enhanced are reconstructed images generated by decoding compressed video data;

determining a spatio-temporal feature map based on the image to be enhanced, the enhanced auxiliary image and a preset feature extraction network;

processing the spatio-temporal feature map according to a preset feature enhancement network to generate a superimposed image;

The image to be enhanced is processed according to the superimposed image to generate a video enhanced image.

or,

and transmitting the spatio-temporal feature map and the preset feature enhancement network.

or,

Receive spatio-temporal feature map and preset feature enhancement network;

Through the above description about the implementation, those skilled in the art can clearly understand that the present application can be realized by means of software and necessary general-purpose hardware, and of course it can also be realized by hardware, but in many cases the former is a better implementation . Based on this understanding, the essence of the technical solution of this application or the part that contributes can be embodied in the form of a software product, and the computer software product can be stored in a computer-readable storage medium, such as a computer floppy disk, a read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), flash memory (FLASH), hard disk or CD, etc., including several instructions to make a computer device (which can be a personal computer, server, or network devices, etc.) execute the methods described in various embodiments of the present application.

It is worth noting that in the above-mentioned embodiment of the device, the included units and modules are only divided according to functional logic, but are not limited to the above-mentioned division, as long as the corresponding functions can be realized; in addition, each functional unit The specific names are only for the convenience of distinguishing each other, and are not used to limit the protection scope of the present application.

Those of ordinary skill in the art can understand that all or some of the steps in the methods disclosed above, the functional modules/units in the system, and the device can be implemented as software, firmware, hardware, and an appropriate combination thereof.

In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be composed of several physical components. Components cooperate to execute. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application-specific integrated circuit . Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). As known to those of ordinary skill in the art, the term computer storage media includes both volatile and nonvolatile media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules, or other data. permanent, removable and non-removable media. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD) or other optical disk storage, magnetic cartridges, tape, magnetic disk storage or other magnetic storage devices, or can Any other medium used to store desired information and which can be accessed by a computer. In addition, as is well known to those of ordinary skill in the art, communication media typically embodies computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism, and may include any information delivery media .

Several embodiments of the present application have been described above with reference to the accompanying drawings, and the scope of rights of the present application is not limited thereto. Any modifications, equivalent replacements and improvements made by those skilled in the art without departing from the scope and essence of the present application shall fall within the scope of rights of the present application.

Claims

A video enhancement processing method, the method comprising:

determining an enhanced auxiliary image of the image to be enhanced, wherein the enhanced auxiliary image and the image to be enhanced are reconstructed images generated by decoding compressed video data;

determining a spatio-temporal feature map based on the image to be enhanced, the enhanced auxiliary image and a preset feature extraction network;

processing the spatio-temporal feature map according to a preset feature enhancement network to generate a superimposed image;

The image to be enhanced is processed according to the superimposed image to generate a video enhanced image.
The method according to claim 1, wherein the preset feature extraction network comprises at least one 3D deformable convolutional residual block, and the 3D deformable convolutional residual block comprises at least a 3D deformable convolutional layer and an activation function.
The method according to claim 1, wherein the preset feature enhancement network includes at least one convolutional residual block, and the convolutional residual block includes at least a convolutional layer and an activation function.
The method according to claim 1, wherein the network model and network parameters of the preset feature enhancement network and the preset feature extraction network are transmitted in code stream and/or transport layer.
The method according to claim 4, wherein the network model and the network parameters are located in at least one of the following: video code stream, supplementary enhancement information of video code stream, video application information, system layer media attribute description unit, media track.
The method according to claim 2, wherein the number N of the three-dimensional deformable convolution residual blocks in the preset feature extraction network is determined according to the video attribute corresponding to the reconstructed image and/or the device processing performance .
The method according to claim 6, wherein the video attribute includes at least one of the following: video type, application scene.
The method according to claim 1, further comprising:

Training at least one preset feature extraction network and at least one preset feature enhancement network respectively for video types and/or application scenarios.
The method according to claim 1, further comprising:

Weighting the information of the image to be enhanced and the enhanced auxiliary image by using a weight parameter.
The method according to claim 9, wherein different weight parameters are set for different areas of the image to be enhanced and the enhanced auxiliary image.
The method according to claim 9, wherein the weight parameters set for different enhanced participating images are different, wherein the enhanced participating images include the image to be enhanced and the enhanced auxiliary image.
The method according to claim 1, wherein the image to be enhanced and the enhanced auxiliary image are at least one component of the reconstructed image.
The method according to claim 1, wherein the image to be enhanced and the enhanced auxiliary image are partial regions of the reconstructed image.
The method according to claim 1, wherein said determining the enhanced auxiliary image of the image to be enhanced comprises:

Obtaining a threshold number of reconstructed images as enhanced auxiliary images respectively before and/or after the image to be enhanced in a reconstructed image set generated by video decoding in time order, wherein the reconstructed image set includes at least two frames of reconstructed images.
The method according to claim 1, wherein said processing said image to be enhanced according to said superimposed image to generate a video enhanced image comprises:

The superimposed image is superimposed on the image to be enhanced, and the superimposed image is used as a video enhanced image.
A video enhancement processing method, the method comprising:

determining an enhanced auxiliary image of the image to be enhanced, wherein the enhanced auxiliary image and the image to be enhanced are reconstructed images generated by decoding compressed video data;

determining a spatio-temporal feature map based on the image to be enhanced, the enhanced auxiliary image and a preset feature extraction network;

and transmitting the spatio-temporal feature map and the preset feature enhancement network.
The method according to claim 16, wherein, before transmitting the spatio-temporal feature map and the preset feature enhancement network, further comprising:

performing compression coding on the spatio-temporal feature map and the preset feature enhancement network.
A video enhancement processing method, the method comprising:

Receive spatio-temporal feature map and preset feature enhancement network;

processing the spatio-temporal feature map according to a preset feature enhancement network to generate a superimposed image;

The image to be enhanced is processed according to the superimposed image to generate a video enhanced image.
A video enhancement processing device, the device comprising:

An image extraction module, configured to determine an enhanced auxiliary image of the image to be enhanced, wherein the enhanced auxiliary image and the image to be enhanced are reconstructed images generated by decoding compressed video data;

A feature map module, configured to determine a spatiotemporal feature map based on the image to be enhanced, the enhanced auxiliary image and a preset feature extraction network;

A feature enhancement module, configured to process the spatio-temporal feature map according to a preset feature enhancement network to generate a superimposed image;

An enhanced image module, configured to process the image to be enhanced according to the superimposed image to generate a video enhanced image.
A video enhancement processing device, the device comprising:

An image extraction module, configured to determine an enhanced auxiliary image of the image to be enhanced, wherein the enhanced auxiliary image and the image to be enhanced are reconstructed images generated by decoding compressed video data;

A feature map module, configured to determine a spatiotemporal feature map based on the image to be enhanced, the enhanced auxiliary image and a preset feature extraction network;

An encoding sending module, configured to transmit the spatio-temporal feature map and the preset feature enhancement network.
A video enhancement processing device, the device comprising:

The decoding receiving module is used to receive the spatio-temporal feature map and the preset feature enhancement network;

A feature enhancement module, configured to process the spatio-temporal feature map according to a preset feature enhancement network to generate a superimposed image;

An enhanced image module, configured to process the image to be enhanced according to the superimposed image to generate a video enhanced image.
An electronic device comprising:

one or more processors;

memory for storing one or more programs,

When the one or more programs are executed by the one or more processors, the one or more processors implement the video enhancement processing described in any one of claims 1-15, 16-17 and 18 method.
A computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the video enhancement processing method according to any one of claims 1-15, 16-17 or 18 is realized.