WO2022268181A1 - Video enhancement processing methods and apparatus, electronic device and storage medium - Google Patents

Video enhancement processing methods and apparatus, electronic device and storage medium Download PDF

Info

Publication number
WO2022268181A1
WO2022268181A1 PCT/CN2022/100898 CN2022100898W WO2022268181A1 WO 2022268181 A1 WO2022268181 A1 WO 2022268181A1 CN 2022100898 W CN2022100898 W CN 2022100898W WO 2022268181 A1 WO2022268181 A1 WO 2022268181A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
enhanced
video
network
enhancement
Prior art date
Application number
PCT/CN2022/100898
Other languages
French (fr)
Chinese (zh)
Inventor
王学嘉
崔文学
刘天鸿
姜峰
刘绍辉
赵德斌
吴钊
吴平
高莹
Original Assignee
中兴通讯股份有限公司
哈尔滨工业大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司, 哈尔滨工业大学 filed Critical 中兴通讯股份有限公司
Publication of WO2022268181A1 publication Critical patent/WO2022268181A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/50Image enhancement or restoration by the use of more than one image, e.g. averaging, subtraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Definitions

  • the present application relates to the technical field of image processing, and in particular to a video enhancement processing method, device, electronic equipment and storage medium.
  • Video application scenarios have become more flexible and diverse, and the range of video resolutions has gradually increased, which means higher requirements for video compression quality.
  • Compressed video has problems of distortion and compression noise, and the compressed and restored video has different degrees of quality loss. How to reduce these quality losses and improve video quality has become an important field of video processing.
  • the video compression coding standards H.256/HEVC and H.266/VCC mainly use loop filtering technology for the post-processing of compressed images, which includes deblocking filter (Deblocking Filter, DF), pixel adaptive compensation (Sample Adaptive Offset, SAO) and adaptive loop filtering (Adaptive Loop Filter, ALF).
  • block filtering includes two links: filtering decision and filtering operation; SAO divides the reconstructed pixels into categories by selecting an appropriate classifier, and then uses different compensation values for different categories of pixels; ALF, according to the directionality and activity with gradient to choose the appropriate filter coefficients.
  • the image enhancement processing based on deep learning mainly relies on the acquired prior knowledge of the external training set, and there is room for improvement in the enhancement of video quality.
  • the main purpose of the embodiments of the present application is to provide a video enhancement processing method, device, electronic equipment and storage medium, which aims to improve the display quality of video compressed and reconstructed images and enhance the viewing effect of users.
  • An embodiment of the present application provides a video enhancement processing method, which includes the following steps: determining an enhanced auxiliary image of an image to be enhanced, wherein the enhanced auxiliary image and the image to be enhanced are reconstructed images generated by decoding compressed video data Determining a spatio-temporal feature map based on the image to be enhanced, the enhanced auxiliary image and a preset feature extraction network; processing the spatio-temporal feature map according to a preset feature enhancement network to generate a superimposed image; processing the superimposed image according to the superimposed image Enhance the image to generate a video enhanced image.
  • the embodiment of the present application also provides another video enhancement processing method, which includes the following steps: determining an enhanced auxiliary image of the image to be enhanced, wherein the enhanced auxiliary image and the image to be enhanced are generated by decoding compressed video data Reconstructing an image; determining a spatiotemporal feature map based on the image to be enhanced, the enhanced auxiliary image, and a preset feature extraction network; transmitting the spatiotemporal feature map and the preset feature enhancement network.
  • the embodiment of the present application also provides another video enhancement processing method, which includes the following steps: receiving a spatiotemporal feature map and a preset feature enhancement network; processing the spatiotemporal feature map according to the preset feature enhancement network to generate a superimposed image; The superimposed image processes the image to be enhanced to generate a video enhanced image.
  • An embodiment of the present application provides a video enhancement processing device, which includes: an image extraction module, configured to determine an enhanced auxiliary image of an image to be enhanced, wherein the enhanced auxiliary image and the image to be enhanced are decoded compressed video data The generated reconstructed image; a feature map module, used to determine a spatiotemporal feature map based on the image to be enhanced, the enhanced auxiliary image and a preset feature extraction network; a feature enhancement module, used to process the spatiotemporal feature enhancement network according to a preset feature The feature map is used to generate a superimposed image; the enhanced image module is used to process the image to be enhanced according to the superimposed image to generate a video enhanced image.
  • the embodiment of the present application also provides another video enhancement processing device, which includes: an image extraction module, configured to determine an enhanced auxiliary image of an image to be enhanced, wherein the enhanced auxiliary image and the image to be enhanced are compressed video A reconstructed image generated by data decoding; a feature map module, used to determine a spatio-temporal feature map based on the image to be enhanced, the enhanced auxiliary image and a preset feature extraction network; an encoding and sending module, used to transmit the spatio-temporal feature map and the The aforementioned preset features enhance the network.
  • an image extraction module configured to determine an enhanced auxiliary image of an image to be enhanced, wherein the enhanced auxiliary image and the image to be enhanced are compressed video A reconstructed image generated by data decoding
  • a feature map module used to determine a spatio-temporal feature map based on the image to be enhanced, the enhanced auxiliary image and a preset feature extraction network
  • an encoding and sending module used to transmit the spatio-temporal feature map and the The aforementioned preset features
  • the embodiment of the present application also provides another video enhancement processing device, which includes: a decoding and receiving module for receiving a spatio-temporal feature map and a preset feature enhancement network; a feature enhancement module for enhancing the network processing according to preset features
  • the spatio-temporal feature map is used to generate a superimposed image; the enhanced image module is used to process the image to be enhanced according to the superimposed image to generate a video enhanced image.
  • the embodiment of the present application also provides an electronic device, the electronic device includes: one or more processors; memory, used to store one or more programs, when the one or more programs are used by the one or more The processor executes, so that the one or more processors implement the video enhancement processing method described in any one of the embodiments of the present application.
  • the embodiment of the present application also provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the video enhancement processing method as described in any one of the embodiments of the present application is implemented.
  • the present application by determining the enhanced auxiliary image of the image to be enhanced, using the preset feature extraction network to process the enhanced image and the enhanced auxiliary image to obtain the spatio-temporal feature map, based on the preset enhanced network processing the spatio-temporal feature map to generate a superposition Image, process the image to be enhanced according to the superimposed image to generate a video enhanced image, improve the display quality of the image based on the spatio-temporal characteristics of the video reconstruction image, improve the display effect of the video, and enhance the viewing experience of the user.
  • FIG. 1 is a flow chart of a video enhancement processing method provided in an embodiment of the present application
  • Fig. 2a is a selection example diagram of an enhanced auxiliary image provided by an embodiment of the present application.
  • Fig. 2b is a selection example diagram of an enhanced auxiliary image provided by an embodiment of the present application.
  • Fig. 2c is a selection example diagram of an enhanced auxiliary image provided by the embodiment of the present application.
  • Fig. 3 is a schematic structural diagram of a three-dimensional deformable convolution residual block provided by an embodiment of the present application.
  • FIG. 4 is a schematic structural diagram of a convolution residual block provided by an embodiment of the present application.
  • FIG. 5 is a transmission example diagram of a network model provided by an embodiment of the present application.
  • Fig. 6 is a schematic diagram of partial region image processing provided by the embodiment of the present application.
  • FIG. 7 is a block diagram of video enhancement processing provided by an embodiment of the present application.
  • FIG. 8 is a flow chart of a video enhancement processing method provided by an embodiment of the present application.
  • FIG. 9 is an example diagram of a feature extraction process provided by an embodiment of the present application.
  • FIG. 10 is an example diagram of a three-dimensional deformable convolutional network provided by an embodiment of the present application.
  • FIG. 11 is an example diagram of a feature enhancement process provided by an embodiment of the present application.
  • FIG. 12 is a flow chart of a video enhancement processing method provided by an embodiment of the present application.
  • FIG. 13 is a flow chart of a video enhancement processing method provided by an embodiment of the present application.
  • FIG. 14 is an example diagram of a video enhancement process provided by an embodiment of the present application.
  • Fig. 15 is an example diagram of another video enhancement processing provided by the embodiment of the present application.
  • FIG. 16 is a schematic structural diagram of a video enhancement processing device provided by an embodiment of the present application.
  • Fig. 17 is a schematic structural diagram of a video enhancement processing device provided by an embodiment of the present application.
  • Fig. 18 is a schematic structural diagram of a video enhancement processing device provided by an embodiment of the present application.
  • FIG. 19 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • Fig. 1 is a flow chart of a video enhancement processing method provided by the embodiment of the present application.
  • the embodiment of the present application can be applied to the situation of enhancing the image display quality of the decoded video, and the method can be executed by a video enhancement processing device, which can Realized by software and/or hardware, and generally based on the video decoding end, referring to Figure 1, the method provided by the embodiment of the present application includes the following steps:
  • Step 110 Determine an enhanced auxiliary image of the image to be enhanced, wherein the enhanced auxiliary image and the image to be enhanced are reconstructed images generated by decoding compressed video data.
  • the image to be enhanced may be an image that needs to be enhanced for screen display effect
  • the image may be an image generated after video decoding, and the image has loss compared with the video image before compression
  • the enhanced auxiliary image may assist the image to be enhanced for display
  • the enhanced auxiliary image can be associated with the image to be enhanced in time and space.
  • the enhanced auxiliary image can be the previous frame or the next frame of the image to be enhanced on the video timeline
  • the enhanced auxiliary image can be related to the image to be enhanced.
  • Images have associated images, which may contain the same objects, or whose dimensions are proportional.
  • the reconstructed image can refer to the video data generated after the original video image is compressed and transformed, and the image generated after decoding.
  • the reconstructed image has the characteristics of compression and distortion.
  • the reconstructed image can be used as a reference image for inter-frame coding, or it can be generated by video decoding. .
  • one or more frames of reconstructed images may be selected from the reconstructed images generated by decoding compressed video data as the enhanced auxiliary images based on the image to be enhanced. It can be understood that the image to be enhanced and the enhanced auxiliary image are related in time and space.
  • the image to be enhanced is the reconstructed image at time t, and the value can be selected Two frames of reconstructed images at time t-2 and t-1 in front of the image and two frames of reconstructed images at time t+1 and t+2 behind the image are used as enhanced auxiliary images; see Figure 2b, the image to be enhanced is the reconstructed image at time t , get two frames of reconstructed images at time t-4 and t-2 in front of it and two frames of reconstructed images at time t+2 and t+4 behind it as an enhanced auxiliary image at intervals of one frame; or, as shown in Figure 2c, the current frame is to be To enhance the image, two reconstructed images of I-frames before and after the current frame may be selected as auxiliary enhanced images, wherein the I-frame may be an intra-frame
  • Step 120 determine the spatio-temporal feature map based on the image to be enhanced, the enhanced auxiliary image and the preset feature extraction network.
  • the preset feature extraction network can be a pre-trained neural network, which can be used to extract the spatiotemporal features between the image to be enhanced and the enhanced auxiliary image
  • the preset feature extraction network can be a deformable convolutional neural network, which can be a three-dimensional input
  • Preset feature extraction networks can be generated using a large number of reconstructed images for training.
  • the image to be enhanced and the enhanced auxiliary graphics can be input to the preset feature extraction network, and the spatiotemporal feature map of the image to be enhanced and the enhanced auxiliary graphic can be determined through the processing of the preset feature extraction network, wherein the spatiotemporal feature map can be the preset feature extraction
  • the result output by the network can reflect the spatio-temporal feature relationship between the image to be enhanced and the enhanced auxiliary image in the form of a graph.
  • the spatio-temporal feature correlation can include the data representation of the feature in the image or the degree of pixel change.
  • Step 130 process the spatio-temporal feature map according to the preset feature enhancement network to generate a superimposed image.
  • the preset feature enhancement network may be a neural network model for processing spatiotemporal feature maps
  • the preset feature enhancement network may be a convolutional neural network
  • the preset feature enhancement network may be generated through massive training of feature maps including spatiotemporal features
  • the result output by the preset feature enhancement network can be a two-dimensional image, which can be used to enhance the display effect of the image to be enhanced.
  • the two-dimensional image can include information corresponding to spatio-temporal features and/or intra-frame features.
  • the enhanced network can generate superimposed images through one or more spatio-temporal feature data included in the spatio-temporal feature map after being trained by a large number of feature maps.
  • the superimposed image can include information that needs to be supplemented at each position in the image to be enhanced, and the information can include brightness. value, chroma value, color value, etc.
  • the spatiotemporal feature map can be input into the preset feature enhancement network, and the spatiotemporal feature map can be processed by the preset feature enhancement network, and the spatiotemporal feature map can be converted into a superimposed image supplementing the image to be enhanced, and the superimposed image
  • the information contained in can be used to supplement the image to be enhanced, so as to enhance the display effect of the reconstructed image.
  • Step 140 process the image to be enhanced according to the superimposed image to generate a video enhanced image.
  • the superimposed image can be used to enhance the display effect of the image to be enhanced.
  • pixel values in the superimposed image can be extracted, such as brightness or chromaticity information, and the corresponding area in the image to be enhanced can be displayed and enhanced according to the average value of the pixel values, for example, the pixel value corresponding to the average value can be increased or decreased, or , the superimposed image can be directly superimposed on the image to be enhanced, the pixel value included in the corresponding superimposed image can be increased or decreased at each position in the image to be enhanced, and the superimposed image can be used as a video enhanced image.
  • the present application by determining the enhanced auxiliary image of the image to be enhanced, using the preset feature extraction network to process the enhanced image and the enhanced auxiliary image to obtain the spatio-temporal feature map, based on the preset enhanced network processing the spatio-temporal feature map to generate a superposition Image, process the image to be enhanced according to the superimposed image to generate a video enhanced image, improve the display quality of the image based on the spatio-temporal characteristics of the video reconstruction image, improve the display effect of the video, and enhance the viewing experience of the user.
  • the determination of the enhanced auxiliary image of the image to be enhanced includes: obtaining a threshold number of auxiliary images respectively before and/or after the image to be enhanced in the reconstructed image set generated by video decoding in time sequence
  • the reconstructed image is used as an enhanced auxiliary image, wherein the set of reconstructed images includes at least two frames of reconstructed images.
  • the time sequence may be the playback time sequence of the video corresponding to the reconstructed image
  • the threshold number may be the number of frames for extracting the reconstructed image
  • the threshold numbers before and after the image to be enhanced may be the same or different. For example, 2 frames of reconstructed images may be extracted before the image to be enhanced as enhanced auxiliary images, and 3 frames of reconstructed images may be extracted after the image to be enhanced as enhanced auxiliary images.
  • a threshold number of reconstruction images can be extracted from the reconstruction images before the image to be enhanced as enhanced auxiliary images, and after the image to be enhanced Extract a threshold number of reconstructed images from the reconstructed images as enhanced auxiliary images.
  • the processing of the image to be enhanced according to the superimposed image to generate a video enhanced image includes:
  • the superimposed image is superimposed on the image to be enhanced, and the superimposed image is used as a video enhanced image.
  • the superimposed image can be superimposed on the image to be enhanced, and the pixel value of the corresponding position of the superimposed image can be added or subtracted from each position point in the image to be enhanced to realize the processing of the image to be enhanced.
  • Enhanced image as a video enhanced image.
  • the preset feature extraction network includes at least one 3D deformable convolution residual block, and the 3D deformable convolution residual block includes at least a 3D deformable convolution layer and an activation function.
  • the preset feature extraction network may be a three-dimensional convolutional neural network, and the three-dimensional convolutional neural network may be composed of one or more convolutional residual blocks, and each convolutional residual block may include at least 3D deformable convolutional layers and activation functions.
  • the preset feature extraction network may be composed of multiple three-dimensional deformable convolution residual blocks.
  • FIG. 3 is a schematic structural diagram of a three-dimensional deformable convolution residual block provided by an embodiment of the present application.
  • each 3D deformable convolution residual block can be shown in Figure 3, the image to be enhanced and the enhanced auxiliary image can be superimposed with itself after passing through a 3D deformable convolution layer, an activation function, and a 3D deformable convolution layer, and then output , the output result can be used as the input data of the next 3D deformable convolution residual block in the preset feature extraction network.
  • the activation function can include LReLU activation function, sigmoid function, tanh function, etc.
  • the number of three-dimensional deformable convolution residual blocks in the preset feature extraction network can be N.
  • the parameter complexity of the entire network will increase significantly, and the network training and calculation time will also increase.
  • the preset feature enhancement network includes at least one convolutional residual block, and the convolutional residual block includes at least a convolutional layer and an activation function.
  • the preset feature enhancement network can be a pre-trained convolutional neural network, which can include a convolutional layer and an activation function layer, and the spatio-temporal feature map can pass through a two-dimensional convolutional layer, activation function , and then superimposed with itself to form a residual after passing through a two-dimensional convolutional layer, so as to enhance the salience of the spatiotemporal features in the spatiotemporal feature map.
  • FIG. 4 is a schematic diagram of a convolutional residual block structure provided by the embodiment of the present application.
  • the spatiotemporal feature map can be passed through a two-dimensional convolutional layer, an activation function, and then a two-dimensional convolutional layer. Superimposed with itself to form a residual, where the two-dimensional convolution can be a two-dimensional deformable convolutional network (Deformable Convolutional Networks, DCN), or a two-dimensional convolutional neural network (Convolutional Neural Networks, CNN).
  • DCN Deformable Convolutional Networks
  • CNN Convolutional Neural Networks
  • the network model and network parameters of the preset feature enhancement network and the preset feature extraction network are transmitted in the code stream and/or the transport layer.
  • the network model can be the organizational structure of the preset feature enhancement network and the preset feature extraction network, which can be called network structure, model representation or network topology, etc. , can include the number of convolutional layers, the number of pooling layers, the connection relationship between convolutional layers and pooling layers, etc., and the network parameters can include specific weight coefficients in the convolutional layer, pooling layer, and activation function in the network and bias etc.
  • the preset feature enhancement network and the preset feature extraction network can be transmitted in the code stream and/or the transport layer, for example, the encoding end can use the network model of the preset feature enhancement network and the preset feature extraction network and network parameters into code stream, and send the code stream to the decoding end; the encoding end can also send the network model and network parameters of the preset feature enhancement network and preset feature extraction network to the server through the transport layer, and then the encoding end
  • the identification number of the preset feature enhancement network or the preset feature extraction network is sent to the decoder through the code stream, and the decoder requests the server for the network model and network parameters of the preset feature enhancement network and the preset feature extraction network according to the identification number.
  • the network model and the network parameters are located in at least one of the following: video code stream, supplementary enhancement information of the video code stream, video application information, system layer media attribute description unit, and media track.
  • the preset feature extraction network and the preset feature enhancement network can be composed of network models and network parameters.
  • One or more types of information in the system layer media attribute description unit and media track are transmitted.
  • the network model used by the preset feature extraction network and the preset feature enhancement network can be used to describe the organizational structure of the network, which is designed before training, and can also be called network structure (network structure). ), or network representation (model representation), or network topology (network topology).
  • Network parameters are obtained during network model training, including but not limited to weights and biases.
  • the network model and network parameters can be written into the video code stream at the encoding end, sent to the decoding end together with the video code stream, or can be transmitted out-of-band separately.
  • One organizational relationship for network models can be of the form adopted by PyTorch, as follows:
  • the network parameters can be transmitted or stored in the .pth format of PyTorch.
  • the network model and network parameters can also adopt other formats, such as NNEF (Neural Network Exchange Format), ONNX (Open Neural Network Exchange), TensorFlow format, Caffe format, etc.
  • network model and network parameters are written into the video code stream, they can be written into Supplemental Enhancement Information (SEI) in the video code stream, for example, the structure shown in Table 1.
  • SEI Supplemental Enhancement Information
  • the network model and network parameters can also be written into the video application information (Video Usability Information, VUI) in the video code stream.
  • VUI Video Usability Information
  • the network model and network parameters are written to the transport layer, they can be written to the system layer media attribute description unit, such as the descriptor of the transport stream, the data unit of the file format (such as in Box), and the media description information of the transport stream, such as Information units such as Media Presentation Description (MPD).
  • the system layer media attribute description unit such as the descriptor of the transport stream, the data unit of the file format (such as in Box), and the media description information of the transport stream, such as Information units such as Media Presentation Description (MPD).
  • MPD Media Presentation Description
  • the network model and network parameters for feature extraction, and the network model and network parameters for feature enhancement are stored in different media tracks, and are stored in different media tracks by defining different types of sample entries (such as four-character code identification). data types, such as network models and network parameters. Moreover, instruction information is given in the sample entry for feature extraction and feature enhancement.
  • the specific network model and network parameters are stored in the sample of the media track.
  • the indication information in the media track is implemented as follows:
  • feature_extraction_flag indicates whether feature extraction network information is included, 1 is included, 0 is not included.
  • feature_enhancement_flag indicates whether feature enhancement network information is included, 1 is included, 0 is not included.
  • fext_nn_model_flag indicates whether the feature extraction network model is included, 1 is included, and 0 is not included.
  • fext_nn_parameter_flag indicates whether to include feature extraction network parameters, 1 is included, 0 is not included.
  • fenh_nn_model_flag indicates whether the feature enhancement network model is included, 1 is included, and 0 is not included.
  • fenh_nn_parameter_flag indicates whether to include feature enhancement network parameters, 1 means yes, 0 means no.
  • the indication information may be indicated at the file level, such as indicated in the related media header data box (MediaHeaderBox) under the media information data box (MediaInformationBox), or indicated in other data boxes (Box) at the file level.
  • the indication information may also be indicated at the media track level, such as indicated in a corresponding sample entry.
  • the number N of the 3D deformable convolution residual blocks in the preset feature extraction network is determined according to the video attribute corresponding to the reconstructed image and/or the device processing performance.
  • the preset feature extraction network is a three-dimensional deformable convolutional neural network.
  • the network model of this neural network can include multiple three-dimensional deformable convolutional residual blocks, and the number of them can be processed by the video attributes of the corresponding compressed video of the reconstructed image and the device.
  • Performance determination where the video attribute can be information reflecting the type of video, for example, conference video or movie video, etc.
  • the device processing performance can be the performance of the device that processes image enhancement, for example, the 3D deformable volume that can be used by high-performance devices
  • the number of convolutional residual blocks is large, and the number of three-dimensional deformable convolutional residual blocks that can be used by low-performance devices is small.
  • the video attribute may include at least one of the following: video type and application scenario.
  • different numbers of three-dimensional deformable convolution residual blocks can be configured in the preset feature extraction network according to the video type and/or application scenario corresponding to the reconstructed image, so as to adapt to images under different video types or application scenarios Display effects, for example, when reconstructing an image for a video conference, a preset feature extraction network with a smaller number of 3D deformable convolutional residual blocks can be selected to extract spatiotemporal features in the reconstructed image to meet the real-time nature of the video, or, reconstruction When the image-corresponding video is played on a movie website, a larger number of 3D deformable convolutional residual blocks can be selected to extract spatio-temporal features to meet the high-quality requirements of the video.
  • the number N of 3D deformable convolution residual blocks in the preset feature extraction network can be set according to the video type or application scenario, or can be set according to actual computing power and resources . For example, if the computing power of the encoding end is strong, then more three-dimensional deformable convolution residual blocks can be used to extract features better.
  • the encoder can train network models with different numbers of 3D deformable convolution residual blocks, and use different network models according to the needs of the decoder.
  • the number M of convolutional residual blocks in the preset feature enhancement network can also be set according to video types or application scenarios, or according to actual computing power and resources.
  • the computing power of the decoding end is weak, so less two-dimensional convolutional residual blocks can be used.
  • the effect of feature enhancement is slightly worse, the real-time performance of the decoding end is guaranteed.
  • the network model can be sent from the encoding end to the decoding end, and can also be stored on the server. If the network model is stored on the server, then the decoding end obtains the network model from the server.
  • it further includes: respectively training at least one of the preset feature extraction networks and at least one of the preset feature enhancement networks for video types and/or application scenarios.
  • the preset feature extraction network and preset feature enhancement network can be pre-trained according to different video types and/or application scenarios, and the preset feature extraction network and preset feature extraction network used when processing images to be enhanced under different video types and application scenarios Let the feature enhancement network be unavailable.
  • the preset feature extraction network and the preset feature enhancement network may be neural networks with a fixed network model, and multiple sets of network parameters may be trained according to video types or application products. For example, there are a set of network parameters for strenuous exercise scenarios, video conferencing scenarios, and surveillance scenarios.
  • the encoding end selects the corresponding network parameters according to the current video type for feature extraction to generate a spatio-temporal feature map, and then sends the spatio-temporal feature map and the corresponding feature enhancement network parameters to the decoding end.
  • the encoding end can send the currently used set of network parameters to the decoding end, and retransmit a new set when choosing to use another set of network parameters.
  • Network parameters It is also possible to establish a communication link between the encoding end and the decoding end to send all network parameters. During the communication process, the encoding end only sends the currently used network parameter index for the decoding end to select the corresponding network parameters. The decoding end only needs to select the corresponding network parameter according to the index.
  • the network parameters can be. It can also be the default network parameters of the encoding end and the decoding end, without the need for the encoding end to send to the decoding end, and the decoding end uses the default network parameters, or only needs to select the corresponding network parameters according to the index.
  • the network parameters can also be stored on the server.
  • the encoding end only needs to send the network parameter index, and the decoding end applies to the server to obtain the corresponding network parameters according to the index information.
  • the weight parameter can be a parameter reflecting the display priority of different areas in the reconstructed image. For example, if the center of the picture needs to be highlighted, a weight parameter with an increased value can be set for the center of the picture, and the four corners of the picture are not noticed by the viewing user. Set a weight parameter with a small value for the four corners of the screen.
  • the weight parameter can also be used to reflect the display priority among different frame images, for example, the key frame in the reconstructed image can use a weight parameter with a larger value.
  • weight parameters can be used to weight the information of the image to be enhanced and the auxiliary image to be enhanced.
  • the weight parameters can be preset. For example, different weight parameters can be set for different regions in the image, or for Different weight parameters are set for the image, and different weight parameters can be set for different content displayed in the image. Another example is that the brightness component of the image is multiplied by the weight parameter value before inputting the feature extraction network.
  • the image to be enhanced and the enhanced auxiliary image can be divided into multiple regions, and different weight parameters can be set for each region.
  • the image to be enhanced and the enhanced auxiliary image can be divided into the image center and the four corners of the image, or the image content and image For areas such as the background, the value of the weight parameter that can be set for different areas can be different.
  • different weight parameters are set for different enhanced participating images, wherein the enhanced participating images include the image to be enhanced and the enhanced auxiliary image.
  • the image to be enhanced and the auxiliary enhanced image may be recorded as the enhanced participating images, and different weight parameters may be set for a single enhanced participating image.
  • different weight parameters can be set for each frame of the image to be enhanced and the enhanced auxiliary image, and feature extraction is performed after each frame of image is weighted.
  • the weight parameter can be determined by the video timeline The time distance from the current frame determines the size of the value.
  • the current frame image is at time t, then the weight parameter of the reconstructed image at time t-1 is larger than the value of the weight parameter at time t-2, and the weight parameter can be based on the reconstructed image It is determined by the importance of the image decoding process.
  • the I frame is a key frame
  • the P frame and B frame are non-key frames.
  • the value of the weight parameter of the I frame in the image to be enhanced and the enhanced auxiliary image can be greater than that of the P frame and The value of the weight parameter of the B frame.
  • different weights can be used for weighting the information of a single frame image in the reconstructed image such as the image to be enhanced and the enhanced auxiliary graphics. feature extraction.
  • the single-frame image in the reconstructed image such as the image to be enhanced and the enhanced auxiliary graphics can be divided into regions, and different weight parameters are set for each region, and the single-frame image is weighted to perform feature extraction and feature enhancement, etc.
  • a high weight parameter can be used for the value of a person
  • a low weight parameter can be used for the background area.
  • the image to be enhanced and the enhanced auxiliary image are at least one component of the reconstructed image.
  • the component may be a component of image information, which may include luminance, chrominance or color components, etc.
  • the image to be enhanced and the enhanced auxiliary image may use one or more components in the reconstructed image for image enhancement.
  • the reconstructed image is a Red Green Blue (RGB) image
  • the image to be enhanced may be an image formed by the R component or an image formed by the G component as the image to be enhanced or the enhanced auxiliary image.
  • RGB Red Green Blue
  • the image to be enhanced and the enhanced auxiliary graphics may be a reconstructed image, which may be only one component of the image, or multiple components.
  • the reconstructed image is composed of luminance and chrominance (YUV) components.
  • image enhancement can be performed by performing operations such as feature extraction and feature enhancement on the brightness component, and image enhancement can also be performed on the chroma component by performing feature extraction and feature enhancement operations, or feature extraction can be performed on image brightness and chroma and feature enhancement to perform image enhancement.
  • the reconstructed image is composed of RGB (Red, Green, Blue) components.
  • the three components can perform image enhancement by performing feature extraction and feature enhancement operations separately, or can perform image enhancement by performing feature extraction and feature enhancement operations on the three components as a whole. .
  • the image to be enhanced and the enhanced auxiliary image are partial regions of the reconstructed image.
  • the image to be enhanced may be a partial area in the reconstructed image, for example, the center or four corners of the picture in the reconstructed image.
  • partial areas may be intercepted in the reconstructed image for image enhancement.
  • only partial areas of the reconstructed image can be intercepted for feature extraction and feature enhancement, and the enhanced video image is only generated by superimposing partial areas in the current reconstructed image with the size of the intercepted partial area.
  • the enhanced image can also be superimposed on the corresponding intercepted area of the current reconstructed image to generate an enhanced image B with the same size as the reconstructed image.
  • the network parameters in the preset feature extraction network and the preset feature enhancement network can be updated during the image enhancement process, for example, each network can be adjusted based on the image enhancement effect after each use Parameters, all network parameters can be adjusted, or only some network parameters can be adjusted.
  • the encoding end can also only send the adjusted network parameters to the decoding end.
  • Fig. 8 is a flowchart of a video enhancement processing method provided by an embodiment of the present application. Referring to Fig. 8, the method of this embodiment includes the following steps:
  • Step S101 Input multi-frame reconstructed images
  • the reconstructed image refers to a reconstructed image generated by compressing and encoding an original video image, and then decoding the video data, that is, a reconstructed image with compression distortion characteristics.
  • the multi-frame reconstructed image is composed of the current reconstructed image and its multiple frames before and after the reconstructed image on the timeline.
  • the reconstructed image may be a reconstructed image generated during video encoding, and these reconstructed images are used as reference images for inter-frame encoding, or may be a reconstructed image generated during video decoding.
  • a multi-frame reconstructed image refers to several frames of reconstructed images before and after the reconstructed image at the current moment on the timeline. These reconstructed images can be adjacent images on the timeline.
  • the current reconstructed image is the reconstructed image at time t. Select t-2 and Two frames of reconstructed images at t-1, followed by two frames of reconstructed images at t+1 and t+2, a total of five frames of reconstructed images are used as input. It can also be a reconstructed image selected at a certain interval.
  • the current image is a reconstructed image at time t, and the two frames t-4 and t-2 in front of it are selected to reconstruct the image at an interval of one frame, and the reconstructed images in the two frames t+2 and t+4 behind it are selected. , a total of five reconstructed images are taken as input. It may also be a reconstructed image selected according to certain rules, and two I frames before and after the current reconstructed image frame are selected to reconstruct the image (intra-frame coding frame). Multi-frame images can also be related images instead of contextual relationships on the timeline, for example, they all contain a certain object, or the image sizes have a certain proportional relationship.
  • Step S102 Feature extraction to generate spatio-temporal feature maps
  • Multi-frame images generate feature information through a multi-layer three-dimensional deformable convolution residual block (Residual Block), and then perform convolution fusion on the feature information to generate a spatio-temporal feature map (Feature Map).
  • each 3D deformable convolution residual block may include a 3D deformable convolution layer and an activation function.
  • Multi-frame data input passes through three-dimensional deformable convolution (DCN3D), activation function (Activation Function), three-dimensional deformable convolution (DCN3D), and then superimposed with itself and then output, and the output result is used as the input of the next module.
  • DCN3D three-dimensional deformable convolution
  • Activation Function activation function
  • DCN3D three-dimensional deformable convolution
  • the multi-frame data may be multi-frame reconstructed images, or the output data of the previous module.
  • the activation function may be LReLU (Leaky Rectified Linear Activation), or other activation functions.
  • the feature information generated after N three-dimensional deformable convolution residual blocks is then fused by a convolution module (Bottleneck) to generate a spatio-temporal feature map.
  • the size of the spatio-temporal feature map is related to the image size and the number of features.
  • a convolution module is added before the 3D deformable convolution residual block to map low-order features to high-order features and increase the number of features.
  • the 3D deformable convolution is extended to 3D on the basis of the 2D deformable convolution (DCN).
  • DCN 2D deformable convolution
  • a 3D offset is first generated through a convolution, and then the input features are processed using the 3D offset.
  • the convolution operation obtains output features, which can be multi-frame reconstructed images or the output features of the previous module.
  • Step S103 Perform feature enhancement on the spatio-temporal feature map
  • the feature enhancement process is shown in Figure 11.
  • the spatio-temporal feature map passes through multiple 3D deformable convolution residual blocks, and then undergoes a convolution, such as 1x1conv, to recover an enhanced map with the same size as the current reconstruction map, that is, superposition picture.
  • the number M of convolutional residual blocks is not necessarily equal to the number of three-dimensional deformable convolutional residual blocks in the feature extraction process.
  • the convolutional residual block includes a two-dimensional convolutional layer and an activation function.
  • the input data is subjected to two-dimensional convolution, the activation function, and then superimposed with itself to generate a residual after two-dimensional convolution.
  • the two-dimensional convolution can be a two-dimensional deformable convolution (DCN), or a two-dimensional convolutional neural network (CNN).
  • Step S104 Generate an enhanced image
  • the enhanced image (overlay image) generated in step S103 is superimposed with the current reconstructed image to generate an enhanced image.
  • Fig. 12 is a flowchart of a video enhancement processing method provided by the embodiment of the present application.
  • the embodiment of the present application can be applied to the situation of enhancing the image display quality of the decoded video, and the method can be executed by a video enhancement processing device, which can Realized by software and/or hardware, and generally based on the video encoding end, referring to Figure 12, the method provided by the embodiment of the present application includes the following steps:
  • Step 210 Determine an enhanced auxiliary image of the image to be enhanced, wherein the enhanced auxiliary image and the image to be enhanced are reconstructed images generated by decoding compressed video data.
  • Step 220 determine the spatio-temporal feature map based on the image to be enhanced, the enhanced auxiliary image and the preset feature extraction network.
  • Step 230 transmitting the spatio-temporal feature map and the preset feature enhancement network.
  • the spatiotemporal feature map and feature enhancement network can be sent to the decoder, and the decoder processes the image to be enhanced according to the spatiotemporal feature map and the preset feature enhancement network to improve the display effect of the image to be enhanced.
  • the spatiotemporal feature map and the preset feature enhancement network can be directly transmitted to the decoder, or the spatiotemporal feature map and feature enhancement network can be uploaded to the server first, and then the decoder sends an acquisition request to the server.
  • the spatio-temporal feature map and the preset feature enhancement network before the transmission of the spatio-temporal feature map and the preset feature enhancement network, it also includes:
  • compression coding can be performed on the spatio-temporal feature map and the preset feature enhancement network, so as to reduce the amount of transmission data and improve transmission efficiency.
  • the preset feature extraction network, the preset feature enhancement network, and the spatio-temporal feature map may be compressed during transmission to reduce data volume and facilitate transmission or storage.
  • the network model and network parameters of the preset feature extraction network of the spatio-temporal feature map and the preset feature enhancement network can adopt a lossless compression method, such as Huffman coding, arithmetic coding, etc.
  • the network model can be passed through methods such as parameter pruning and sharing, low-rank factorization, transferred/compact convolutional filters, knowledge distillation, etc. to compress.
  • Network parameters can also be encoded with lossy compression, for example, quantization can be used to reduce the amount of data required.
  • Fig. 13 is a flow chart of a video enhancement processing method provided by the embodiment of the present application.
  • the embodiment of the present application can be applied to the situation of enhancing the image display quality of the decoded video, and the method can be executed by a video enhancement processing device, which can Realized by software and/or hardware, and generally based on the video decoding end, referring to Figure 13, the method provided by the embodiment of the present application includes the following steps:
  • Step 310 receiving a spatio-temporal feature map and a preset feature enhancement network.
  • the spatio-temporal feature map and the preset feature enhancement network can be directly sent from the encoding end to the decoding end or downloaded to the decoding end by the server, and the spatio-temporal feature map and the preset feature enhancement network in the server can be provided by the encoding end upload.
  • Step 320 processing the spatio-temporal feature map according to the preset feature enhancement network to generate a superimposed image.
  • Step 330 process the image to be enhanced according to the superimposed image to generate a video enhanced image.
  • the image to be enhanced can be generated by decoding a bit stream, and the bit stream can be sent by the encoding end and received by the decoding end.
  • the information on each position in the superimposed image can be extracted, such as chroma, brightness, color value, etc., and the corresponding area in the image to be enhanced can be displayed and enhanced according to the information, and the superimposed image can also be combined with The image to be enhanced is directly superimposed, and the superimposed image is used as a video enhanced image.
  • FIG. 14 is an example diagram of a video enhancement process provided by the embodiment of the present application.
  • the embodiment of the present application reconstructs the image by using the multi-frame encoding generated during the encoding process at the encoding end.
  • Perform feature extraction and generate a spatio-temporal feature map transmit the spatio-temporal feature map and feature enhancement network model and network parameters to the decoder, and the decoder will enhance the decoded and reconstructed image according to the spatiotemporal feature map, feature enhancement network model and network parameters, and the spatiotemporal feature map
  • the network model and network parameters of feature enhancement and feature enhancement can be transmitted during the enhancement process.
  • the network model and network parameters of the spatio-temporal feature map and feature enhancement network can be transmitted separately or in combination, and can be written into the video stream or independently Video code stream out-of-band transmission.
  • FIG. 15 is an example diagram of another video enhancement process provided by the embodiment of the present application.
  • the network model and network parameters for feature extraction and feature enhancement can be used only in decoding The end, that is, the decoding end decodes the video stream, and then uses the network model and network parameters of feature extraction and feature enhancement to enhance the decoded reconstructed image. If it is only used at the decoding end, then the spatio-temporal feature map output by feature extraction can be directly used as the input of feature enhancement without having to store the spatio-temporal feature map separately.
  • the decoder can obtain the network model and network parameters for feature extraction and feature enhancement by reading local files, or obtain the network model and network parameters for feature extraction and feature enhancement from the server, and can also be sent to the decoder by the encoder.
  • Fig. 16 is a schematic structural diagram of a video enhancement processing device provided by an embodiment of the present application, which can execute the video enhancement processing method provided by any embodiment of the present application, and has corresponding functional modules and beneficial effects for executing the method.
  • the device can be composed of software and /or hardware implementation, generally integrated at the encoding end, including: an image extraction module 401 , a feature map module 402 , a feature enhancement module 403 and an image enhancement module 404 .
  • the image extraction module 401 is configured to determine an enhanced auxiliary image of the image to be enhanced, wherein the enhanced auxiliary image and the image to be enhanced are reconstructed images generated by decoding compressed video data.
  • a feature map module 402 configured to determine a spatio-temporal feature map based on the image to be enhanced, the enhanced auxiliary image and a preset feature extraction network.
  • the feature enhancement module 403 is configured to process the spatio-temporal feature map according to a preset feature enhancement network to generate a superimposed image.
  • An enhanced image module 404 configured to process the image to be enhanced according to the superimposed image to generate a video enhanced image.
  • Fig. 17 is a schematic structural diagram of a video enhancement processing device provided by an embodiment of the present application, which can execute the video enhancement processing method provided by any embodiment of the present application, and has corresponding functional modules and beneficial effects for executing the method.
  • the device can be composed of software and /or hardware implementation, generally integrated at the encoding end, including: an image extraction module 501 , a feature map module 502 and an encoding sending module 503 .
  • the image extraction module 501 is configured to determine an enhanced auxiliary image of the image to be enhanced, wherein the enhanced auxiliary image and the image to be enhanced are reconstructed images generated by decoding compressed video data.
  • a feature map module 502 configured to determine a spatio-temporal feature map based on the image to be enhanced, the enhanced auxiliary image and a preset feature extraction network.
  • An encoding sending module 503, configured to transmit the spatio-temporal feature map and the preset feature enhancement network.
  • Fig. 18 is a schematic structural diagram of a video enhancement processing device provided by an embodiment of the present application, which can execute the video enhancement processing method provided by any embodiment of the present application, and has corresponding functional modules and beneficial effects for executing the method.
  • the device can be composed of software and /or hardware implementation, generally integrated at the decoding end, including: a code receiving module 601 , a feature enhancement module 602 and an image enhancement module 603 .
  • the decoding and receiving module 601 is configured to receive a spatio-temporal feature map and a preset feature enhancement network.
  • a feature enhancement module 602 configured to process the spatio-temporal feature map according to a preset feature enhancement network to generate a superimposed image.
  • An enhanced image module 603, configured to process the image to be enhanced according to the superimposed image to generate a video enhanced image.
  • the preset feature extraction network in the encoding end and/or decoding end includes at least one three-dimensional deformable convolution residual block, and the three-dimensional deformable convolution residual block includes at least a three-dimensional deformable convolution residual block.
  • Deformable convolutional layers and activation functions include:
  • the preset feature enhancement network in the device at the encoding end and/or decoding end includes at least one convolutional residual block, and the convolutional residual block includes at least a convolutional layer and an activation function.
  • the network model and network parameters of the preset feature enhancement network and the preset feature extraction network in the device at the encoding end and/or decoding end are transmitted in the code stream and/or the transport layer .
  • the network model and the network parameters in the device at the encoding end and/or decoding end are located in at least one of the following: video code stream, supplementary enhancement information of video code stream, video application information , a system layer media attribute description unit, and a media track.
  • the number N of the three-dimensional deformable convolution residual blocks in the preset feature extraction network in the device at the encoding end and/or decoding end is based on the corresponding Video properties and/or device processing capabilities are determined.
  • the video attribute in the device at the encoding end and/or decoding end includes at least one of the following: video type and application scenario.
  • the device at the encoding end and/or decoding end further includes:
  • the network training module is used to respectively train at least one of the preset feature extraction networks and at least one of the preset feature enhancement networks for video types and/or application scenarios.
  • the device at the encoding end and/or decoding end further includes:
  • a weighting module configured to use weight parameters to weight the information of the image to be enhanced and the enhanced auxiliary image.
  • different weight parameters are set for different regions of the image to be enhanced and the enhanced auxiliary image in the device at the encoding end and/or the decoding end.
  • different weight parameters are set for different enhancement participating images in the devices at the encoding end and/or decoding end, wherein the enhancement participating images include the image to be enhanced and the enhanced auxiliary image.
  • the image to be enhanced and the enhanced auxiliary image in the device at the encoding end and/or the decoding end are at least one component of the reconstructed image.
  • the image to be enhanced and the enhanced auxiliary image in the devices at the encoding end and/or decoding end are partial regions of the reconstructed image.
  • the image extraction module in the device at the encoding end and/or decoding end is set to: respectively before and/or after the image to be enhanced in the reconstructed image set generated by video decoding in time order Respectively acquire a threshold number of reconstructed images as enhanced auxiliary images, wherein the set of reconstructed images includes at least two frames of reconstructed images.
  • the enhanced image module in the device at the encoding end and/or decoding end is set to: superimpose the spatio-temporal feature map enhanced by image spatio-temporal features with the image to be enhanced, and superimpose The resulting image is used as a video-enhanced image.
  • the device at the encoding end and/or decoding end further includes:
  • An encoding and compression module configured to compress and encode the spatio-temporal feature map and the preset feature enhancement network.
  • the example enhancement processing device may include the following modules: a feature extraction module A01, configured to extract features of multi-frame reconstructed images;
  • the video encoding module A02 is used to encode network parameters and spatio-temporal feature maps, and output encoded and reconstructed images as input to the feature extraction module A01.
  • the transmission module A03 is used to transmit encoded video data, and can also encode and transmit network parameters and spatiotemporal feature maps.
  • the feature enhancement module A04 is used to perform feature enhancement and generate an enhanced map.
  • the video decoding module A05 is used to decode network parameters and spatio-temporal feature maps from video data, and reconstruct images.
  • the transmission module A06 is used to transmit compressed video data, and can also decode network parameters and spatiotemporal feature maps.
  • transmission module A01, video encoding module A02, transmission module A03, feature enhancement module A04, video decoding module A05, and transmission module A06 can be implemented by using dedicated hardware or hardware that can be combined with appropriate software to perform processing.
  • Such hardware or special purpose hardware may include application specific integrated circuits (ASICs), various other circuits, various processors, and the like.
  • ASICs application specific integrated circuits
  • processors the functionality may be provided by a single dedicated processor, a single shared processor, or multiple independent processors, some of which may be shared.
  • processor should not be understood to refer exclusively to hardware capable of executing software, but may implicitly include, without limitation, digital signal processor (DSP) hardware, read-only memory (ROM) for storing software, random Access memory (RAM), and non-volatile storage devices.
  • DSP digital signal processor
  • ROM read-only memory
  • RAM random Access memory
  • the apparatus of this embodiment may be a device in a video application, for example, a mobile phone, a computer, a server, a set-top box, a portable mobile terminal, a digital camera, a TV broadcasting system device, and the like.
  • Figure 19 is a schematic structural diagram of an electronic device provided by an embodiment of the present application, the electronic device includes a processor 60, a memory 71, an input device 72 and an output device 73; the number of processors 70 in the electronic device can be one or more
  • a processor 70 is taken as an example; the processor 70, memory 71, input device 72 and output device 73 in the electronic device can be connected by a bus or in other ways.
  • the connection by a bus is taken as an example.
  • the memory 71 can be used to store software programs, computer-executable programs and modules, such as the modules corresponding to the video enhancement processing device in the embodiment of the present application (image extraction module 401, feature map module 402, feature Enhancement module 403 and enhanced image module 404, or image extraction module 501, feature map module 502 and encoding sending module 503, or decoding receiving module 601, feature enhancement module 602 and enhanced image module 603).
  • the processor 70 executes various functional applications and data processing of the electronic device by running software programs, instructions and modules stored in the memory 71 , that is, realizes the above-mentioned video enhancement processing method.
  • the memory 71 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system and at least one application required by a function; the data storage area may store data created according to the use of the electronic device, and the like.
  • the memory 71 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage devices.
  • the memory 71 may also include a memory that is remotely located relative to the processor 70, and these remote memories may be connected to the electronic device through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
  • the input device 72 can be used to receive input numbers or character information, and generate key signal input related to user settings and function control of the electronic device.
  • the output device 73 may include a display device such as a display screen.
  • the embodiment of the present application also provides a storage medium containing computer-executable instructions, the computer-executable instructions are used to execute a video enhancement processing method when executed by a computer processor, the method comprising:
  • the image to be enhanced is processed according to the superimposed image to generate a video enhanced image.
  • the image to be enhanced is processed according to the superimposed image to generate a video enhanced image.
  • the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be composed of several physical components. Components cooperate to execute.
  • Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application-specific integrated circuit .
  • a processor such as a central processing unit, digital signal processor, or microprocessor
  • Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media).
  • computer storage media includes both volatile and nonvolatile media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules, or other data. permanent, removable and non-removable media.
  • Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD) or other optical disk storage, magnetic cartridges, tape, magnetic disk storage or other magnetic storage devices, or can Any other medium used to store desired information and which can be accessed by a computer.
  • communication media typically embodies computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism, and may include any information delivery media .

Abstract

Embodiments of the present application provide video enhancement processing methods and an apparatus, an electronic device and a storage medium, a method comprising: determining an enhanced auxiliary image of an image to be enhanced, the enhanced auxiliary image and the image to be enhanced being reconstructed images generated by decoding compressed video data (110); determining a spatiotemporal feature map on the basis of the image to be enhanced, the enhanced auxiliary image and a preset feature extraction network (120); processing the spatiotemporal feature map according to a preset feature enhancement network so as to generate a superimposed image (130); and according to the superimposed image, processing the image to be enhanced so as to generate a video enhanced image (140).

Description

视频增强处理方法、装置、电子设备和存储介质Video enhancement processing method, device, electronic device and storage medium
相关申请的交叉引用Cross References to Related Applications
本申请基于申请号为202110697703.4、申请日为2021年06月23日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。This application is based on a Chinese patent application with application number 202110697703.4 and a filing date of June 23, 2021, and claims the priority of this Chinese patent application. The entire content of this Chinese patent application is hereby incorporated by reference into this application.
技术领域technical field
本申请涉及图像处理技术领域,尤其涉及一种视频增强处理方法、装置、电子设备和存储介质。The present application relates to the technical field of image processing, and in particular to a video enhancement processing method, device, electronic equipment and storage medium.
背景技术Background technique
随着视频应用的与日俱增,视频应用场景变得更加灵活多样,视频分辨率范围逐渐增加,这意味着对视频的压缩质量具有更高的要求。经过压缩的视频存在失真和压缩噪声的问题,经过压缩复原后的视频存在不同程度的质量损失,如何降低这些质量损失,提高视频质量,成为视频处理的重要领域。With the increasing number of video applications, video application scenarios have become more flexible and diverse, and the range of video resolutions has gradually increased, which means higher requirements for video compression quality. Compressed video has problems of distortion and compression noise, and the compressed and restored video has different degrees of quality loss. How to reduce these quality losses and improve video quality has become an important field of video processing.
目前,视频压缩编码标准H.256/HEVC和H.266/VCC对于压缩图像的后处理主要采用的是环路滤波技术,其包括去方块滤波(Deblocking Filter,DF)、像素自适应补偿(Sample Adaptive Offset,SAO)和自适应环路滤波(Adaptive Loop Filter,ALF)。其中,方块滤波包括两个环节:滤波决策和滤波操作;SAO通过选择一个合适的分类器将重建像素划分类别,然后对不同类别像素使用不同的补偿值;ALF,根据具备梯度的方向性和活动性来选择合适的滤波系数。这些传统方法能够在一定程度上消除压缩噪声并提高压缩视频质量,但是由于滤波算法使用的参数固定,不能完全还原有损压缩图像与原始图像之间的映射关系。At present, the video compression coding standards H.256/HEVC and H.266/VCC mainly use loop filtering technology for the post-processing of compressed images, which includes deblocking filter (Deblocking Filter, DF), pixel adaptive compensation (Sample Adaptive Offset, SAO) and adaptive loop filtering (Adaptive Loop Filter, ALF). Among them, block filtering includes two links: filtering decision and filtering operation; SAO divides the reconstructed pixels into categories by selecting an appropriate classifier, and then uses different compensation values for different categories of pixels; ALF, according to the directionality and activity with gradient to choose the appropriate filter coefficients. These traditional methods can eliminate compression noise and improve the quality of compressed video to a certain extent, but due to the fixed parameters used by the filtering algorithm, they cannot completely restore the mapping relationship between the lossy compressed image and the original image.
而近年来,随着深度学习的兴起,视频领域尝试应用深度学习来降低视频压缩损失。相比传统的视频增强处理方法,深度学习技术基于大数据而具有自行学习的能力,可摒弃传统人工设置学习特征,提高有损压缩图像与原始图像之间的映射关系,从而提高视频质量。同时,由于深度学习的学习效果依赖于训练数据的数据量,深度学习的效果、鲁棒性以及泛化能力会随着数据量的增大而增强。鉴于视频图像在压缩后存在模糊和细节信息弱化的问题,针对这些问题,深度学习中多采用基于单帧质量增强视频,但是由于压缩视频中图像本身的真值并不存在,存在不适定性问题,基于深度学习的图像增强处理主要依赖于外部训练集的习得的先验知识,视频质量的增强存在进步空间。In recent years, with the rise of deep learning, the video field has tried to apply deep learning to reduce video compression loss. Compared with traditional video enhancement processing methods, deep learning technology has the ability to learn by itself based on big data, which can abandon traditional manual learning features and improve the mapping relationship between lossy compressed images and original images, thereby improving video quality. At the same time, since the learning effect of deep learning depends on the amount of training data, the effect, robustness, and generalization ability of deep learning will increase with the increase of the amount of data. In view of the problems of blurring and weakening of detailed information in video images after compression, in order to solve these problems, video enhancement based on single-frame quality is often used in deep learning. However, since the true value of the image itself does not exist in the compressed video, there is an ill-posed problem. The image enhancement processing based on deep learning mainly relies on the acquired prior knowledge of the external training set, and there is room for improvement in the enhancement of video quality.
发明内容Contents of the invention
本申请实施例的主要目的在于提出一种视频增强处理方法、装置、电子设备和存储介质,其旨在提高视频压缩重建图像的显示质量,增强用户的观看效果。The main purpose of the embodiments of the present application is to provide a video enhancement processing method, device, electronic equipment and storage medium, which aims to improve the display quality of video compressed and reconstructed images and enhance the viewing effect of users.
本申请实施例提供了一种视频增强处理方法,该方法包括以下步骤:确定待增强图像的增强辅助图像,其中,所述增强辅助图像和所述待增强图像为压缩视频数据解码生成的重建图像;基于所述待增强图像、所述增强辅助图像和预设特征提取网络确定时空特征图;根据预设特征增强网络处理所述时空特征图以生成叠加图像;根据所述叠加图像处理所述待增强图像以生成视频增强图像。An embodiment of the present application provides a video enhancement processing method, which includes the following steps: determining an enhanced auxiliary image of an image to be enhanced, wherein the enhanced auxiliary image and the image to be enhanced are reconstructed images generated by decoding compressed video data Determining a spatio-temporal feature map based on the image to be enhanced, the enhanced auxiliary image and a preset feature extraction network; processing the spatio-temporal feature map according to a preset feature enhancement network to generate a superimposed image; processing the superimposed image according to the superimposed image Enhance the image to generate a video enhanced image.
本申请实施例还提供了另一种视频增强处理方法,该方法包括以下步骤:确定待增强图像的增强辅助图像,其中,所述增强辅助图像和所述待增强图像为压缩视频数据解码生成的重建图像;基于所述待增强图像、所述增强辅助图像和预设特征提取网络确定时空特征图;传输所述时空特征图和所述预设特征增强网络。The embodiment of the present application also provides another video enhancement processing method, which includes the following steps: determining an enhanced auxiliary image of the image to be enhanced, wherein the enhanced auxiliary image and the image to be enhanced are generated by decoding compressed video data Reconstructing an image; determining a spatiotemporal feature map based on the image to be enhanced, the enhanced auxiliary image, and a preset feature extraction network; transmitting the spatiotemporal feature map and the preset feature enhancement network.
本申请实施例还提供了另一种视频增强处理方法,该方法包括以下步骤:接收时空特征图和预设特征增强网络;根据预设特征增强网络处理所述时空特征图以生成叠加图像;根据所述叠加图像处理所述待增强图像以生成视频增强图像。The embodiment of the present application also provides another video enhancement processing method, which includes the following steps: receiving a spatiotemporal feature map and a preset feature enhancement network; processing the spatiotemporal feature map according to the preset feature enhancement network to generate a superimposed image; The superimposed image processes the image to be enhanced to generate a video enhanced image.
本申请实施例提供了一种视频增强处理装置,该装置包括:图像提取模块,用于确定待增 强图像的增强辅助图像,其中,所述增强辅助图像和所述待增强图像为压缩视频数据解码生成的重建图像;特征图模块,用于基于所述待增强图像、所述增强辅助图像和预设特征提取网络确定时空特征图;特征增强模块,用于根据预设特征增强网络处理所述时空特征图以生成叠加图像;增强图像模块,用于根据所述叠加图像处理所述待增强图像以生成视频增强图像。An embodiment of the present application provides a video enhancement processing device, which includes: an image extraction module, configured to determine an enhanced auxiliary image of an image to be enhanced, wherein the enhanced auxiliary image and the image to be enhanced are decoded compressed video data The generated reconstructed image; a feature map module, used to determine a spatiotemporal feature map based on the image to be enhanced, the enhanced auxiliary image and a preset feature extraction network; a feature enhancement module, used to process the spatiotemporal feature enhancement network according to a preset feature The feature map is used to generate a superimposed image; the enhanced image module is used to process the image to be enhanced according to the superimposed image to generate a video enhanced image.
本申请实施例还提供了另一种视频增强处理装置,该装置包括:图像提取模块,用于确定待增强图像的增强辅助图像,其中,所述增强辅助图像和所述待增强图像为压缩视频数据解码生成的重建图像;特征图模块,用于基于所述待增强图像、所述增强辅助图像和预设特征提取网络确定时空特征图;编码发送模块,用于传输所述时空特征图和所述预设特征增强网络。The embodiment of the present application also provides another video enhancement processing device, which includes: an image extraction module, configured to determine an enhanced auxiliary image of an image to be enhanced, wherein the enhanced auxiliary image and the image to be enhanced are compressed video A reconstructed image generated by data decoding; a feature map module, used to determine a spatio-temporal feature map based on the image to be enhanced, the enhanced auxiliary image and a preset feature extraction network; an encoding and sending module, used to transmit the spatio-temporal feature map and the The aforementioned preset features enhance the network.
本申请实施例还提供了另一种视频增强处理装置,该装置包括:解码接收模块,用于接收时空特征图和预设特征增强网络;特征增强模块,用于根据预设特征增强网络处理所述时空特征图以生成叠加图像;增强图像模块,用于根据所述叠加图像处理所述待增强图像以生成视频增强图像。The embodiment of the present application also provides another video enhancement processing device, which includes: a decoding and receiving module for receiving a spatio-temporal feature map and a preset feature enhancement network; a feature enhancement module for enhancing the network processing according to preset features The spatio-temporal feature map is used to generate a superimposed image; the enhanced image module is used to process the image to be enhanced according to the superimposed image to generate a video enhanced image.
本申请实施例还提供了一种电子设备,该电子设备包括:一个或多个处理器;存储器,用于存储一个或多个程序,当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如本申请实施例中任一所述的视频增强处理方法。The embodiment of the present application also provides an electronic device, the electronic device includes: one or more processors; memory, used to store one or more programs, when the one or more programs are used by the one or more The processor executes, so that the one or more processors implement the video enhancement processing method described in any one of the embodiments of the present application.
本申请实施例还提供了一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现如本申请实施例中任一所述的视频增强处理方法。The embodiment of the present application also provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the video enhancement processing method as described in any one of the embodiments of the present application is implemented.
本申请实施例,通过确定出待增强图像的增强辅助图像,使用预设特征提取网络对带增强图像和增强辅助图像进行处理以获取到时空特征图,基于预设增强网络处理时空特征图生成叠加图像,按照该叠加图像对待增强图像进行处理以生成视频增强图像,基于视频重建图像的时空特征提高图像的显示质量,提高视频的显示效果,可增强用户的观看体验。In the embodiment of the present application, by determining the enhanced auxiliary image of the image to be enhanced, using the preset feature extraction network to process the enhanced image and the enhanced auxiliary image to obtain the spatio-temporal feature map, based on the preset enhanced network processing the spatio-temporal feature map to generate a superposition Image, process the image to be enhanced according to the superimposed image to generate a video enhanced image, improve the display quality of the image based on the spatio-temporal characteristics of the video reconstruction image, improve the display effect of the video, and enhance the viewing experience of the user.
附图说明Description of drawings
图1是本申请实施例提供的一种视频增强处理方法的流程图;FIG. 1 is a flow chart of a video enhancement processing method provided in an embodiment of the present application;
图2a是本申请实施例提供的一种增强辅助图像的选择示例图;Fig. 2a is a selection example diagram of an enhanced auxiliary image provided by an embodiment of the present application;
图2b是本申请实施例提供的一种增强辅助图像的选择示例图;Fig. 2b is a selection example diagram of an enhanced auxiliary image provided by an embodiment of the present application;
图2c是本申请实施例提供的一种增强辅助图像的选择示例图;Fig. 2c is a selection example diagram of an enhanced auxiliary image provided by the embodiment of the present application;
图3是本申请实施例提供的一种三维可变形卷积残差块的结构示意图;Fig. 3 is a schematic structural diagram of a three-dimensional deformable convolution residual block provided by an embodiment of the present application;
图4是本申请实施例提供的一种卷积残差块结构示意图;FIG. 4 is a schematic structural diagram of a convolution residual block provided by an embodiment of the present application;
图5是本申请实施例提供的一种网络模型的传输示例图;FIG. 5 is a transmission example diagram of a network model provided by an embodiment of the present application;
图6是本申请实施例提供的一种部分区域图像处理示意图;Fig. 6 is a schematic diagram of partial region image processing provided by the embodiment of the present application;
图7是本申请实施例提供的一种视频增强处理框图;FIG. 7 is a block diagram of video enhancement processing provided by an embodiment of the present application;
图8是本申请实施例提供的一种视频增强处理方法的流程图;FIG. 8 is a flow chart of a video enhancement processing method provided by an embodiment of the present application;
图9是本申请实施例提供的一种特征提取过程的示例图;FIG. 9 is an example diagram of a feature extraction process provided by an embodiment of the present application;
图10是本申请实施例提供的一种三维可变形卷积网络的示例图;FIG. 10 is an example diagram of a three-dimensional deformable convolutional network provided by an embodiment of the present application;
图11是本申请实施例提供的一种特征增强过程的示例图;FIG. 11 is an example diagram of a feature enhancement process provided by an embodiment of the present application;
图12是本申请实施例提供的一种视频增强处理方法的流程图;FIG. 12 is a flow chart of a video enhancement processing method provided by an embodiment of the present application;
图13是本申请实施例提供的一种视频增强处理方法的流程图;FIG. 13 is a flow chart of a video enhancement processing method provided by an embodiment of the present application;
图14是本申请实施例提供的一种视频增强处理的示例图;FIG. 14 is an example diagram of a video enhancement process provided by an embodiment of the present application;
图15是本申请实施例提供的另一种视频增强处理的示例图;Fig. 15 is an example diagram of another video enhancement processing provided by the embodiment of the present application;
图16是本申请实施例提供的一种视频增强处理装置的结构示意图;FIG. 16 is a schematic structural diagram of a video enhancement processing device provided by an embodiment of the present application;
图17是本申请实施例提供的一种视频增强处理装置的结构示意图;Fig. 17 is a schematic structural diagram of a video enhancement processing device provided by an embodiment of the present application;
图18是本申请实施例提供的一种视频增强处理装置的结构示意图;Fig. 18 is a schematic structural diagram of a video enhancement processing device provided by an embodiment of the present application;
图19是本申请实施例提供的一种电子设备的结构示意图。FIG. 19 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
具体实施方式detailed description
应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。It should be understood that the specific embodiments described here are only used to explain the present application, and are not intended to limit the present application.
在后续的描述中,使用用于表示元件的诸如“模块”、“部件”或“单元”的后缀仅为了有利于本申请的说明,其本身没有特有的意义。因此,“模块”、“部件”或“单元”可以混合地使用。In the subsequent description, use of suffixes such as 'module', 'part' or 'unit' for denoting elements is only for facilitating the description of the present application and has no specific meaning by itself. Therefore, 'module', 'part' or 'unit' may be used in combination.
图1是本申请实施例提供的一种视频增强处理方法的流程图,本申请实施例可以适用于增强解码视频的图像显示质量的情况,该方法可以由视频增强处理装置来执行,该装置可以通过软件和/或硬件方式实现,并一般基础在视频解码端,参见图1,本申请实施例提供的方法包括如下步骤:Fig. 1 is a flow chart of a video enhancement processing method provided by the embodiment of the present application. The embodiment of the present application can be applied to the situation of enhancing the image display quality of the decoded video, and the method can be executed by a video enhancement processing device, which can Realized by software and/or hardware, and generally based on the video decoding end, referring to Figure 1, the method provided by the embodiment of the present application includes the following steps:
步骤110、确定待增强图像的增强辅助图像,其中,增强辅助图像和待增强图像为压缩视频数据解码生成的重建图像。Step 110: Determine an enhanced auxiliary image of the image to be enhanced, wherein the enhanced auxiliary image and the image to be enhanced are reconstructed images generated by decoding compressed video data.
其中,待增强图像可以是需要进行画面显示效果增强的图像,该图像可以是在视频解码后生成的图像,该图像相比压缩前的视频图像存在损失,增强辅助图像可以辅助待增强图像进行显示增强的图像,增强辅助图像在时空上可以与待增强图像存在关联关系,例如,增强辅助图像可以是待增强图像在视频时间线上的前一帧或后一帧,增强辅助图像可以与待增强图像具有关联的图像,图像内可以包括相同对象,或者图像尺寸均有比例关系。重建图像可以是指视频原始图像经过压缩变换后生成的视频数据,且经过解码生成的图像,重建图像具有压缩失真的特性,重建图像可以作为参考图像用于帧间编码,也可以是视频解码生成。Wherein, the image to be enhanced may be an image that needs to be enhanced for screen display effect, the image may be an image generated after video decoding, and the image has loss compared with the video image before compression, and the enhanced auxiliary image may assist the image to be enhanced for display For the enhanced image, the enhanced auxiliary image can be associated with the image to be enhanced in time and space. For example, the enhanced auxiliary image can be the previous frame or the next frame of the image to be enhanced on the video timeline, and the enhanced auxiliary image can be related to the image to be enhanced. Images have associated images, which may contain the same objects, or whose dimensions are proportional. The reconstructed image can refer to the video data generated after the original video image is compressed and transformed, and the image generated after decoding. The reconstructed image has the characteristics of compression and distortion. The reconstructed image can be used as a reference image for inter-frame coding, or it can be generated by video decoding. .
在本申请实施例中,可以以待增强图像为准在压缩视频数据解码生成的各重建图像中选择一帧或多帧重建图像作为增强辅助图像,可以理解的是,待增强图像与增强辅助图像在时空上具有关联关系。示例性的,在视频解码生成多帧重建图像中,可以时间线上当前时刻t前后的几帧重建图像作为增强辅助图像,参见图2a,待增强图像为t时刻的重建图像,可以选取值该图像前面t-2和t-1时刻的两帧重建图像以及该图像后面t+1和t+2时刻两帧重建图像作为增强辅助图像;参见图2b,待增强图像为t时刻的重建图像,间隔一帧获取它前面t-4和t-2时刻的两帧重建图像以及它后面的t+2和t+4两帧重建图像作为增强辅助图像;或者,如图2c,当前帧为待增强图像,可以选取当前帧前后各两帧I帧重建图像作为增强辅助图像,其中,I帧可以为帧内编码帧。In the embodiment of the present application, one or more frames of reconstructed images may be selected from the reconstructed images generated by decoding compressed video data as the enhanced auxiliary images based on the image to be enhanced. It can be understood that the image to be enhanced and the enhanced auxiliary image are related in time and space. Exemplarily, in video decoding to generate multi-frame reconstructed images, several frames of reconstructed images before and after the current time t on the timeline can be used as enhanced auxiliary images, see Figure 2a, the image to be enhanced is the reconstructed image at time t, and the value can be selected Two frames of reconstructed images at time t-2 and t-1 in front of the image and two frames of reconstructed images at time t+1 and t+2 behind the image are used as enhanced auxiliary images; see Figure 2b, the image to be enhanced is the reconstructed image at time t , get two frames of reconstructed images at time t-4 and t-2 in front of it and two frames of reconstructed images at time t+2 and t+4 behind it as an enhanced auxiliary image at intervals of one frame; or, as shown in Figure 2c, the current frame is to be To enhance the image, two reconstructed images of I-frames before and after the current frame may be selected as auxiliary enhanced images, wherein the I-frame may be an intra-frame coding frame.
步骤120、基于待增强图像、增强辅助图像和预设特征提取网络确定时空特征图。 Step 120, determine the spatio-temporal feature map based on the image to be enhanced, the enhanced auxiliary image and the preset feature extraction network.
其中,预设特征提取网络可以是预先训练的神经网络,可以用于提取待增强图像与增强辅助图像间的时空特征,预设特征提取网络可以为可变形卷积神经网络,可以为三维输入,预设特征提取网络可以使用大量的重建图像训练生成。Wherein, the preset feature extraction network can be a pre-trained neural network, which can be used to extract the spatiotemporal features between the image to be enhanced and the enhanced auxiliary image, and the preset feature extraction network can be a deformable convolutional neural network, which can be a three-dimensional input, Preset feature extraction networks can be generated using a large number of reconstructed images for training.
可以将待增强图像和增强辅助图形输入到预设特征提取网络,经过预设特征提取网络的处理确定出待增强图形和增强辅助图形的时空特征图,其中,时空特征图可以是预设特征提取网络输出的结果,该结果可以以图的形式反映出待增强图像与增强辅助图像之间的时空特征关联关系,该时空特征关联关系可以包括图像中特征的数据表示或者像素点改变程度等。The image to be enhanced and the enhanced auxiliary graphics can be input to the preset feature extraction network, and the spatiotemporal feature map of the image to be enhanced and the enhanced auxiliary graphic can be determined through the processing of the preset feature extraction network, wherein the spatiotemporal feature map can be the preset feature extraction The result output by the network can reflect the spatio-temporal feature relationship between the image to be enhanced and the enhanced auxiliary image in the form of a graph. The spatio-temporal feature correlation can include the data representation of the feature in the image or the degree of pixel change.
步骤130、根据预设特征增强网络处理时空特征图以生成叠加图像。 Step 130, process the spatio-temporal feature map according to the preset feature enhancement network to generate a superimposed image.
其中,预设特征增强网络可以是对时空特征图进行处理的神经网络模型,预设特征增强网络可以为卷积神经网络,预设特征增强网络可以通过海量的包括时空特征的特征图训练生成,预设特征增强网络输出的结果可以为二维图像,该二维图像可以用于增强待增强图像的显示效果,二维图像中可以包括时空特征和/或帧内特征对应的信息,预设特征增强网络经过海量的特征图训练后可以通过时空特征图中包括的一个或者多个时空特征数据生成叠加图像,叠加图像中可以包括待增强图像中各位置上需要补充的信息,该信息可以包括亮度值、色度值、颜色值等。Wherein, the preset feature enhancement network may be a neural network model for processing spatiotemporal feature maps, the preset feature enhancement network may be a convolutional neural network, and the preset feature enhancement network may be generated through massive training of feature maps including spatiotemporal features, The result output by the preset feature enhancement network can be a two-dimensional image, which can be used to enhance the display effect of the image to be enhanced. The two-dimensional image can include information corresponding to spatio-temporal features and/or intra-frame features. The preset features The enhanced network can generate superimposed images through one or more spatio-temporal feature data included in the spatio-temporal feature map after being trained by a large number of feature maps. The superimposed image can include information that needs to be supplemented at each position in the image to be enhanced, and the information can include brightness. value, chroma value, color value, etc.
在本申请实施例中,可以将时空特征图输入到预设特征增强网络,可以由预设特征增强网络处理该时空特征图,将时空特征图转换为补充待增强图像的叠加图像,该叠加图像中包括的信息可以用于补充待增强图像,以便增强重建图像的显示效果。In the embodiment of the present application, the spatiotemporal feature map can be input into the preset feature enhancement network, and the spatiotemporal feature map can be processed by the preset feature enhancement network, and the spatiotemporal feature map can be converted into a superimposed image supplementing the image to be enhanced, and the superimposed image The information contained in can be used to supplement the image to be enhanced, so as to enhance the display effect of the reconstructed image.
步骤140、根据叠加图像处理待增强图像以生成视频增强图像。 Step 140, process the image to be enhanced according to the superimposed image to generate a video enhanced image.
可以使用叠加图像对待增强图像进行显示效果的增强。例如,可以提取叠加图像中像素值,如亮度或色度等信息,可以按照像素值的平均值对待增强图像中对应的区域进行显示增强,例如,可以增加或减少对应平均值的像素值,或者,可以将叠加图像直接与待增强图像进行叠加,在待增强图像中各位置增加或减少对应叠加图像中包括的像素值,可以将叠加生成的图像作为视频增强图像。The superimposed image can be used to enhance the display effect of the image to be enhanced. For example, pixel values in the superimposed image can be extracted, such as brightness or chromaticity information, and the corresponding area in the image to be enhanced can be displayed and enhanced according to the average value of the pixel values, for example, the pixel value corresponding to the average value can be increased or decreased, or , the superimposed image can be directly superimposed on the image to be enhanced, the pixel value included in the corresponding superimposed image can be increased or decreased at each position in the image to be enhanced, and the superimposed image can be used as a video enhanced image.
本申请实施例,通过确定出待增强图像的增强辅助图像,使用预设特征提取网络对带增强 图像和增强辅助图像进行处理以获取到时空特征图,基于预设增强网络处理时空特征图生成叠加图像,按照该叠加图像对待增强图像进行处理以生成视频增强图像,基于视频重建图像的时空特征提高图像的显示质量,提高视频的显示效果,可增强用户的观看体验。In the embodiment of the present application, by determining the enhanced auxiliary image of the image to be enhanced, using the preset feature extraction network to process the enhanced image and the enhanced auxiliary image to obtain the spatio-temporal feature map, based on the preset enhanced network processing the spatio-temporal feature map to generate a superposition Image, process the image to be enhanced according to the superimposed image to generate a video enhanced image, improve the display quality of the image based on the spatio-temporal characteristics of the video reconstruction image, improve the display effect of the video, and enhance the viewing experience of the user.
在上述申请实施例的基础上,所述确定待增强图像的增强辅助图像,包括:按照时间顺序在视频解码生成的重建图像集中分别在所述待增强图像之前和/或之后分别获取阈值数量的重建图像作为增强辅助图像,其中,所述重建图像集包括至少两帧重建图像。On the basis of the above-mentioned application embodiments, the determination of the enhanced auxiliary image of the image to be enhanced includes: obtaining a threshold number of auxiliary images respectively before and/or after the image to be enhanced in the reconstructed image set generated by video decoding in time sequence The reconstructed image is used as an enhanced auxiliary image, wherein the set of reconstructed images includes at least two frames of reconstructed images.
其中,时间顺序可以是重建图像对应的视频的播放时间顺序,阈值数量可以是提取重建图像的帧数,在待增强图像之前和之后的阈值数量可以相同也可以不同。例如,可以在待增强图像之前提取2帧重建图像作为增强辅助图像,在待增强图像之后提取3帧重建图像作为增强辅助图像。Wherein, the time sequence may be the playback time sequence of the video corresponding to the reconstructed image, the threshold number may be the number of frames for extracting the reconstructed image, and the threshold numbers before and after the image to be enhanced may be the same or different. For example, 2 frames of reconstructed images may be extracted before the image to be enhanced as enhanced auxiliary images, and 3 frames of reconstructed images may be extracted after the image to be enhanced as enhanced auxiliary images.
在本申请实施例中,可以在识别解码生成的重建图像集中,按照视频播放的时间顺序,在待增强图像之前的重建图像中提取阈值数量的重建图像作为增强辅助图像,以及在待增强图像之后的重建图像中提取阈值数量的重建图像作为增强辅助图像。In the embodiment of the present application, in the reconstruction image set generated by identification and decoding, according to the time sequence of video playback, a threshold number of reconstruction images can be extracted from the reconstruction images before the image to be enhanced as enhanced auxiliary images, and after the image to be enhanced Extract a threshold number of reconstructed images from the reconstructed images as enhanced auxiliary images.
在上述申请实施例的基础上,所述根据所述叠加图像处理所述待增强图像以生成视频增强图像,包括:On the basis of the above-mentioned application embodiments, the processing of the image to be enhanced according to the superimposed image to generate a video enhanced image includes:
将所述叠加图像与所述待增强图像进行叠加,并将叠加后生成的图像作为视频增强图像。The superimposed image is superimposed on the image to be enhanced, and the superimposed image is used as a video enhanced image.
本申请实施例,可以将叠加图像与待增强图像进行叠加,在待增强图像中各位置点加上或减去叠加图像对应位置的像素值,实现待增强图像的处理,可以将处理后的待增强图像作为视频增强图像。In the embodiment of the present application, the superimposed image can be superimposed on the image to be enhanced, and the pixel value of the corresponding position of the superimposed image can be added or subtracted from each position point in the image to be enhanced to realize the processing of the image to be enhanced. Enhanced image as a video enhanced image.
在上述申请实施例的基础上,所述预设特征提取网络包括至少一个三维可变形卷积残差块,所述三维可变形卷积残差块至少包括三维可变形卷积层和激活函数。On the basis of the above application embodiments, the preset feature extraction network includes at least one 3D deformable convolution residual block, and the 3D deformable convolution residual block includes at least a 3D deformable convolution layer and an activation function.
在本申请实施例中,预设特征提取网络可以为三维卷积神经网络,该三维卷积神经网络可以由一个或者多个卷积残差块组成,每个卷积残差块中可以至少包括三维可变形卷积层和激活函数。In the embodiment of the present application, the preset feature extraction network may be a three-dimensional convolutional neural network, and the three-dimensional convolutional neural network may be composed of one or more convolutional residual blocks, and each convolutional residual block may include at least 3D deformable convolutional layers and activation functions.
在一个示例性的实施方式中,预设特征提取网络可以由多个三维可变形卷积残差块组成,图3是本申请实施例提供的一种三维可变形卷积残差块的结构示意图,每个三维可变形卷积残差块可以如图3所示,待增强图像和增强辅助图像可以经过三维可变形卷积层、激活函数、三维可变形卷积层后与自身进行叠加后输出,输出的结果可以作为预设特征提取网络中下一个三维可变形卷积残差块的输入数据。激活函数可以包括LReLU激活函数、s igmoid函数、tanh函数等,预设特征提取网络中三维可变形卷积残差块的数量可以为N个,N的取值越大,视频图像的增强效果越佳,但整个网络的参数复杂度也会明显增加,网络训练和计算时间也将增长。In an exemplary embodiment, the preset feature extraction network may be composed of multiple three-dimensional deformable convolution residual blocks. FIG. 3 is a schematic structural diagram of a three-dimensional deformable convolution residual block provided by an embodiment of the present application. , each 3D deformable convolution residual block can be shown in Figure 3, the image to be enhanced and the enhanced auxiliary image can be superimposed with itself after passing through a 3D deformable convolution layer, an activation function, and a 3D deformable convolution layer, and then output , the output result can be used as the input data of the next 3D deformable convolution residual block in the preset feature extraction network. The activation function can include LReLU activation function, sigmoid function, tanh function, etc. The number of three-dimensional deformable convolution residual blocks in the preset feature extraction network can be N. The larger the value of N, the better the enhancement effect of the video image. However, the parameter complexity of the entire network will increase significantly, and the network training and calculation time will also increase.
进一步的,在上述申请实施例的基础上,所述预设特征增强网络包括至少一个卷积残差块,所述卷积残差块至少包括卷积层和激活函数。Further, on the basis of the above-mentioned application embodiments, the preset feature enhancement network includes at least one convolutional residual block, and the convolutional residual block includes at least a convolutional layer and an activation function.
在本申请实施例中,预设特征增强网络可以为预先训练的卷积神经网络,该卷积神经网络可以包括卷积层和激活函数层,时空特征图可以经过二维卷积层、激活函数、再经过二维卷积层后与自己叠加形成残差,以增强时空特征图中的时空特征的显著性。In the embodiment of the present application, the preset feature enhancement network can be a pre-trained convolutional neural network, which can include a convolutional layer and an activation function layer, and the spatio-temporal feature map can pass through a two-dimensional convolutional layer, activation function , and then superimposed with itself to form a residual after passing through a two-dimensional convolutional layer, so as to enhance the salience of the spatiotemporal features in the spatiotemporal feature map.
在一个示例性的实施方式中,图4是本申请实施例提供的一种卷积残差块结构示意图,时空特征图可以经过二维卷积层、激活函数、再经过二维卷积层后与自己叠加形成残差,其中,二维卷积可以是二维可变形卷积网络(Deformable Convolutional Networks,DCN),也可以是二维卷积神经网络(Convolutional Neural Networks,CNN)。In an exemplary implementation, FIG. 4 is a schematic diagram of a convolutional residual block structure provided by the embodiment of the present application. The spatiotemporal feature map can be passed through a two-dimensional convolutional layer, an activation function, and then a two-dimensional convolutional layer. Superimposed with itself to form a residual, where the two-dimensional convolution can be a two-dimensional deformable convolutional network (Deformable Convolutional Networks, DCN), or a two-dimensional convolutional neural network (Convolutional Neural Networks, CNN).
预设特征增强网络以及预设特征提取网络的网络模型和网络参数在码流和/或传输层中传输。The network model and network parameters of the preset feature enhancement network and the preset feature extraction network are transmitted in the code stream and/or the transport layer.
其中,网络模型(network model)可以是预设特征增强网络和预设特征提取网络的组织结构,可以被称为网络结构(network structure)、网络表示(model representation)或网络拓扑(network topology)等,可以包括卷积层的个数、池化层的个数、卷积层与池化层的连接关系等,网络参数可以包括网络中卷积层、池化层以及激活函数中具体的权重系数和偏置等。Among them, the network model can be the organizational structure of the preset feature enhancement network and the preset feature extraction network, which can be called network structure, model representation or network topology, etc. , can include the number of convolutional layers, the number of pooling layers, the connection relationship between convolutional layers and pooling layers, etc., and the network parameters can include specific weight coefficients in the convolutional layer, pooling layer, and activation function in the network and bias etc.
在本申请实施例中,预设特征增强网络和预设特征提取网络可以在码流和/或传输层中传输,例如,编码端可以将预设特征增强网络和预设特征提取网络的网络模型和网络参数编码为码流, 并将该码流发送到解码端;编码端还可以通过传输层将预设特征增强网络和预设特征提取网络的网络模型和网络参数发送到服务器,然后编码端通过码流将预设特征增强网络或预设特征提取网络的标识号发送到解码端,解码端根据标识号向服务器请求预设特征增强网络和预设特征提取网络的网络模型和网络参数。In the embodiment of the present application, the preset feature enhancement network and the preset feature extraction network can be transmitted in the code stream and/or the transport layer, for example, the encoding end can use the network model of the preset feature enhancement network and the preset feature extraction network and network parameters into code stream, and send the code stream to the decoding end; the encoding end can also send the network model and network parameters of the preset feature enhancement network and preset feature extraction network to the server through the transport layer, and then the encoding end The identification number of the preset feature enhancement network or the preset feature extraction network is sent to the decoder through the code stream, and the decoder requests the server for the network model and network parameters of the preset feature enhancement network and the preset feature extraction network according to the identification number.
在上述申请实施例的基础上,所述网络模型和所述网络参数位于以下至少之一:视频码流、视频码流的补充增强信息、视频应用信息、系统层媒体属性描述单元、媒体轨道。On the basis of the above application embodiments, the network model and the network parameters are located in at least one of the following: video code stream, supplementary enhancement information of the video code stream, video application information, system layer media attribute description unit, and media track.
在本申请实施例中,预设特征提取网络以及预设特征增强网络可以由网络模型和网络参数组成,网络模型和网络参数可以通过视频码流、视频码流的补充增强信息、视频应用信息、系统层媒体属性描述单元、媒体轨道中的一种或者多种信息中传输。In the embodiment of the present application, the preset feature extraction network and the preset feature enhancement network can be composed of network models and network parameters. One or more types of information in the system layer media attribute description unit and media track are transmitted.
在一个示例性的实施方式中,预设特征提取网络和预设特征增强网络所使用的网络模型可以用于描述网络的组织结构,是训练前设计好的,也可以称作网络结构(network structure),或网络表示(model representation),或网络拓扑(network topology)。网络参数是网络模型训练中获得,包括但不限于权重和偏置。参见图5,网络模型和网络参数可以在编码端写入视频码流中,与视频码流一起发送到解码端,也可以单独进行带外传输。网络模型的一种组织关系可以是PyTorch采用的形式,如下所示:In an exemplary embodiment, the network model used by the preset feature extraction network and the preset feature enhancement network can be used to describe the organizational structure of the network, which is designed before training, and can also be called network structure (network structure). ), or network representation (model representation), or network topology (network topology). Network parameters are obtained during network model training, including but not limited to weights and biases. Referring to Figure 5, the network model and network parameters can be written into the video code stream at the encoding end, sent to the decoding end together with the video code stream, or can be transmitted out-of-band separately. One organizational relationship for network models can be of the form adopted by PyTorch, as follows:
Figure PCTCN2022100898-appb-000001
Figure PCTCN2022100898-appb-000001
Figure PCTCN2022100898-appb-000002
Figure PCTCN2022100898-appb-000002
其中,网络参数可以采用PyTorch的.pth格式传输或存储。在一实施方式中,网络模型和网络参数也可以采用其它的格式,比如NNEF(Neural Network Exchange Format),ONNX(Open Neural Network Exchange),TensorFlow格式,Caffe格式等。Among them, the network parameters can be transmitted or stored in the .pth format of PyTorch. In one embodiment, the network model and network parameters can also adopt other formats, such as NNEF (Neural Network Exchange Format), ONNX (Open Neural Network Exchange), TensorFlow format, Caffe format, etc.
如果将网络模型和网络参数写入到视频码流中,可以写入视频码流中的补充增强信息(Supplemental Enhancement Information,SEI)中,示例如表1所示的结构。If the network model and network parameters are written into the video code stream, they can be written into Supplemental Enhancement Information (SEI) in the video code stream, for example, the structure shown in Table 1.
表1Table 1
Figure PCTCN2022100898-appb-000003
Figure PCTCN2022100898-appb-000003
同理,网络模型和网络参数也可以写入到视频码流中的视频应用信息(Video Usability Information,VUI)中。Similarly, the network model and network parameters can also be written into the video application information (Video Usability Information, VUI) in the video code stream.
如果网络模型和网络参数写入到传输层,可以写入到系统层媒体属性描述单元,例如在传输流的描述子、文件格式的数据单元(例如Box中)、传输流的媒体描述信息,例如媒体呈现描述(Media Presentation Description,MPD)等信息单元。If the network model and network parameters are written to the transport layer, they can be written to the system layer media attribute description unit, such as the descriptor of the transport stream, the data unit of the file format (such as in Box), and the media description information of the transport stream, such as Information units such as Media Presentation Description (MPD).
比如,采用ISO/IEC 14496-12 ISO BMFF对网络模型和网络参数进行封装。For example, use ISO/IEC 14496-12 ISO BMFF to encapsulate the network model and network parameters.
特征提取的网络模型和网络参数,与特征增强的网络模型和网络参数分别存放在不同的媒体轨道中,通过定义不同类型(如采用四字代码标识)的样本入口(sample entry)识别轨道中存放的数据类型,如网络模型、网络参数。并且,在样本入口中给出指示信息用于特征提取和特征增强。具体的网络模型和网络参数存放在该媒体轨道的样本中。媒体轨道中的指示信息实现方式如下:The network model and network parameters for feature extraction, and the network model and network parameters for feature enhancement are stored in different media tracks, and are stored in different media tracks by defining different types of sample entries (such as four-character code identification). data types, such as network models and network parameters. Moreover, instruction information is given in the sample entry for feature extraction and feature enhancement. The specific network model and network parameters are stored in the sample of the media track. The indication information in the media track is implemented as follows:
aligned(8)class NeutralNetworkInfo()aligned(8) class NeutralNetworkInfo()
{{
unsigned int(1)feature_extraction_flag;unsigned int(1) feature_extraction_flag;
unsigned int(1)feature_enhancement_flag;unsigned int(1) feature_enhancement_flag;
if(feature_extraction_flag==1)if(feature_extraction_flag==1)
{{
unsigned int(1)fext_nn_model_flag;unsigned int(1) fext_nn_model_flag;
unsigned int(1)fext_nn_parameter_flag;unsigned int(1) fext_nn_parameter_flag;
}}
if(feature_enhancement_flag==1)if(feature_enhancement_flag==1)
{{
unsigned int(1)fenh_nn_model_flag;unsigned int(1) fenh_nn_model_flag;
unsigned int(1)fenh_nn_parameter_flag;unsigned int(1) fenh_nn_parameter_flag;
}}
bit(6)reserved=0;bit(6) reserved=0;
}}
feature_extraction_flag指示是否含有特征提取网络信息,1是含有,0是不含有。feature_extraction_flag indicates whether feature extraction network information is included, 1 is included, 0 is not included.
feature_enhancement_flag指示是否含有特征增强网络信息,1是含有,0是不含有。feature_enhancement_flag indicates whether feature enhancement network information is included, 1 is included, 0 is not included.
fext_nn_model_flag指示是否含特征提取网络模型,1是含有,0是不含有。fext_nn_model_flag indicates whether the feature extraction network model is included, 1 is included, and 0 is not included.
fext_nn_parameter_flag指示是否含特征提取网络参数,1是含有,0是不含有。fext_nn_parameter_flag indicates whether to include feature extraction network parameters, 1 is included, 0 is not included.
fenh_nn_model_flag指示是否含特征增强网络模型,1是含有,0是不含有。fenh_nn_model_flag indicates whether the feature enhancement network model is included, 1 is included, and 0 is not included.
fenh_nn_parameter_flag指示是否含特征增强网络参数,1是含有,0是不含有。fenh_nn_parameter_flag indicates whether to include feature enhancement network parameters, 1 means yes, 0 means no.
该指示信息可以在文件层级中指示,如在媒体信息数据盒(MediaInformationBox)下相关的媒体头数据盒(MediaHeaderBox)中指示,或者文件层级的其他数据盒(Box)中指示。The indication information may be indicated at the file level, such as indicated in the related media header data box (MediaHeaderBox) under the media information data box (MediaInformationBox), or indicated in other data boxes (Box) at the file level.
该指示信息也可以在媒体轨道层级中指示,如在相应的样本入口(sample entry)中指示。The indication information may also be indicated at the media track level, such as indicated in a corresponding sample entry.
在一实施方式中,无论以何种形式存储或传输特征提取的网络模型和网络参数,特征增强的网络模型和网络参数,它们均可以单独各自存储或传输。In one embodiment, no matter what form is used to store or transmit the network model and network parameters for feature extraction, or the network model and network parameters for feature enhancement, they can be stored or transmitted separately.
在上述申请实施例的基础上,所述预设特征提取网络中的所述三维可变形卷积残差块的个数N根据所述重建图像对应的视频属性和/或设备处理性能确定。On the basis of the above-mentioned application embodiments, the number N of the 3D deformable convolution residual blocks in the preset feature extraction network is determined according to the video attribute corresponding to the reconstructed image and/or the device processing performance.
预设特征提取网络为三维可变形卷积神经网络,该神经网络的网络模型可以包括多个三维可变形卷积残差块,包括的个数可以由重建图像对应压缩视频的视频属性以及设备处理性能决定,其中,视频属性可以是反映视频类型的信息,例如,会议视频或者电影视频等,设备处理性能可以是处理图像增强的设备的性能,例如,高性能的设备可以使用的三维可变形卷积残差块的数量较多,低性能的设备可以使用的三维可变形卷积残差块的数量较少等。The preset feature extraction network is a three-dimensional deformable convolutional neural network. The network model of this neural network can include multiple three-dimensional deformable convolutional residual blocks, and the number of them can be processed by the video attributes of the corresponding compressed video of the reconstructed image and the device. Performance determination, where the video attribute can be information reflecting the type of video, for example, conference video or movie video, etc., and the device processing performance can be the performance of the device that processes image enhancement, for example, the 3D deformable volume that can be used by high-performance devices The number of convolutional residual blocks is large, and the number of three-dimensional deformable convolutional residual blocks that can be used by low-performance devices is small.
在上述申请实施例的基础上,所述视频属性可以包括以下至少之一:视频类型、应用场景。On the basis of the foregoing application embodiments, the video attribute may include at least one of the following: video type and application scenario.
在本申请实施例中,预设特征提取网络中可以根据重建图像对应的视频类型和/或应用场景配置不同数量的三维可变形卷积残差块,以适应不同视频类型或者应用场景下的图像显示效果,例如,重建图像为视频会议时,可以选择较少数量的三维可变形卷积残差块的预设特征提取网络提取重建图像中的时空特征,以满足视频的实时性,或者,重建图像对应视频在电影网站播放时,可以选择较多数量的三维可变形卷积残差块提取时空特征,以满足视频的高质量要求。In the embodiment of this application, different numbers of three-dimensional deformable convolution residual blocks can be configured in the preset feature extraction network according to the video type and/or application scenario corresponding to the reconstructed image, so as to adapt to images under different video types or application scenarios Display effects, for example, when reconstructing an image for a video conference, a preset feature extraction network with a smaller number of 3D deformable convolutional residual blocks can be selected to extract spatiotemporal features in the reconstructed image to meet the real-time nature of the video, or, reconstruction When the image-corresponding video is played on a movie website, a larger number of 3D deformable convolutional residual blocks can be selected to extract spatio-temporal features to meet the high-quality requirements of the video.
在一个示例性的实施方式中,预设特征提取网络中的三维可变形卷积残差块个数N可以根据视频类型或者应用场景来设定,也可以根据实际的计算能力和资源来设定。比如,编码端的计算能力较强,那么,可以采用较多三维可变形卷积残差块,以便更好的提取特征。在一实施方式中,编码端可以训练含有不同三维可变形卷积残差块个数的网络模型,根据解码端需要采用不同的网络模型。In an exemplary embodiment, the number N of 3D deformable convolution residual blocks in the preset feature extraction network can be set according to the video type or application scenario, or can be set according to actual computing power and resources . For example, if the computing power of the encoding end is strong, then more three-dimensional deformable convolution residual blocks can be used to extract features better. In one embodiment, the encoder can train network models with different numbers of 3D deformable convolution residual blocks, and use different network models according to the needs of the decoder.
同样的,预设特征增强网络中的卷积残差块个数M也可以根据视频类型或者应用场景来设定,或者根据实际的计算能力和资源设定。比如解码端的计算能力较弱,那么,可以采用较少的二维卷积残差块,虽然特征增强的效果略差,但是保证了解码端的实时性。Similarly, the number M of convolutional residual blocks in the preset feature enhancement network can also be set according to video types or application scenarios, or according to actual computing power and resources. For example, the computing power of the decoding end is weak, so less two-dimensional convolutional residual blocks can be used. Although the effect of feature enhancement is slightly worse, the real-time performance of the decoding end is guaranteed.
其中,网络模型可以由编码端发送给解码端,也可以存储在服务器上,如果网络模型存储在服务器上,那么,解码端从服务器上获取网络模型。Wherein, the network model can be sent from the encoding end to the decoding end, and can also be stored on the server. If the network model is stored on the server, then the decoding end obtains the network model from the server.
在上述申请实施例的基础上,还包括:针对视频类型和/或应用场景分别训练至少一个所述预设特征提取网络和至少一个所述预设特征增强网络。On the basis of the above application embodiments, it further includes: respectively training at least one of the preset feature extraction networks and at least one of the preset feature enhancement networks for video types and/or application scenarios.
可以按照不同的视频类型和/或应用场景预先训练分别训练预设特征提取网络以及预设特征增强网络,不同的视频类型以及应用场景下对待增强图像进行处理时使用的预设特征提取网络以及预设特征增强网络可以不通。The preset feature extraction network and preset feature enhancement network can be pre-trained according to different video types and/or application scenarios, and the preset feature extraction network and preset feature extraction network used when processing images to be enhanced under different video types and application scenarios Let the feature enhancement network be unavailable.
在一个示例性的实施方式中,预设特征提取网络和预设特征增强网络可以为固定网络模型的神经网络,可以根据视频类型或应用乘积训练出多组网络参数。比如,针对剧烈运动场景,视频会议场景,监控场景等各有一组网络参数。比如,编码端根据当前视频类型选取对应的网 络参数进行特征提取生成时空特征图,再将时空特征图和对应的特征增强网络参数发送给解码端。In an exemplary embodiment, the preset feature extraction network and the preset feature enhancement network may be neural networks with a fixed network model, and multiple sets of network parameters may be trained according to video types or application products. For example, there are a set of network parameters for strenuous exercise scenarios, video conferencing scenarios, and surveillance scenarios. For example, the encoding end selects the corresponding network parameters according to the current video type for feature extraction to generate a spatio-temporal feature map, and then sends the spatio-temporal feature map and the corresponding feature enhancement network parameters to the decoding end.
编码端和解码端使用多组网络模型的方式可以不进行限定,可以由编码端将当前使用的一组网络参数发送给解码端,在选择使用另一组网络参数时再重新传输新的一组网络参数。也可以由编码端和解码端建立通信链路后发送全部的网络参数,在通信过程中编码端仅发送当前使用的网络参数索引供解码端选取对应的网络参数,解码端只需要根据索引选择对应的网络参数即可。也可以是编码端和解码端默认网络参数,无需编码端发送给解码端,解码端使用默认网络参数,或者只需要根据索引选择对应的网络参数即可。There is no limit to how the encoding end and the decoding end use multiple sets of network models. The encoding end can send the currently used set of network parameters to the decoding end, and retransmit a new set when choosing to use another set of network parameters. Network parameters. It is also possible to establish a communication link between the encoding end and the decoding end to send all network parameters. During the communication process, the encoding end only sends the currently used network parameter index for the decoding end to select the corresponding network parameters. The decoding end only needs to select the corresponding network parameter according to the index. The network parameters can be. It can also be the default network parameters of the encoding end and the decoding end, without the need for the encoding end to send to the decoding end, and the decoding end uses the default network parameters, or only needs to select the corresponding network parameters according to the index.
也可以将网络参数存储在服务器上,编码端仅需要发送网络参数索引,解码端根据索引信息向服务器申请获得对应的网络参数。The network parameters can also be stored on the server. The encoding end only needs to send the network parameter index, and the decoding end applies to the server to obtain the corresponding network parameters according to the index information.
在上述申请实施例的基础上,还包括:On the basis of the above-mentioned application embodiments, it also includes:
使用权重参数对所述待增强图像和所述增强辅助图像的信息进行加权。Weighting the information of the image to be enhanced and the enhanced auxiliary image by using a weight parameter.
其中,权重参数可以是反映重建图像中不同区域的显示优先级的参数,比如画面中心需要突出显示,可以为画面中心设置一个加大数值的权重参数,而画面四角不被观看用户在意,可以为画面四角设置一个较小数值的权重参数。也可以使用权重参数来反映不同帧图像间的显示优先级,例如,重建图像中关键帧可以使用较大数值的权重参数。Among them, the weight parameter can be a parameter reflecting the display priority of different areas in the reconstructed image. For example, if the center of the picture needs to be highlighted, a weight parameter with an increased value can be set for the center of the picture, and the four corners of the picture are not noticed by the viewing user. Set a weight parameter with a small value for the four corners of the screen. The weight parameter can also be used to reflect the display priority among different frame images, for example, the key frame in the reconstructed image can use a weight parameter with a larger value.
在本申请实施例中,可以使用权重参数对待增强图像和增强辅助图像的信息进行加权,该权重参数可以预先设置,例如,可以为图像中的不同区域设置不同权重参数,也可以为不同帧的图像设置不同的权重参数,还可以为图像中显示的不同内容设置不同的权重参数。又比如图像的亮度分量在输入特征提取网络前先乘以权重参数值。In the embodiment of the present application, weight parameters can be used to weight the information of the image to be enhanced and the auxiliary image to be enhanced. The weight parameters can be preset. For example, different weight parameters can be set for different regions in the image, or for Different weight parameters are set for the image, and different weight parameters can be set for different content displayed in the image. Another example is that the brightness component of the image is multiplied by the weight parameter value before inputting the feature extraction network.
在上述申请实施例的基础上,所述待增强图像和所述增强辅助图像的不同区域设置的权重参数不同。On the basis of the above application embodiments, different weight parameters are set for different areas of the image to be enhanced and the enhanced auxiliary image.
可以将待增强图像和增强辅助图像分别划分为多个区域,每个区域可以设置不同的权重参数,例如,待增强图像以及增强辅助图形可以划分为图像中心和图像四角,或者,图像内容和图像背景等区域,可以为不同的区域设置的权重参数的取值可以不同。The image to be enhanced and the enhanced auxiliary image can be divided into multiple regions, and different weight parameters can be set for each region. For example, the image to be enhanced and the enhanced auxiliary image can be divided into the image center and the four corners of the image, or the image content and image For areas such as the background, the value of the weight parameter that can be set for different areas can be different.
在上述申请实施例的基础上,所述不同增强参与图像设置的权重参数不同,其中,所述增强参与图像包括所述待增强图像和所述增强辅助图像。On the basis of the above application embodiments, different weight parameters are set for different enhanced participating images, wherein the enhanced participating images include the image to be enhanced and the enhanced auxiliary image.
在本申请实施例中,可以将待增强图像和增强辅助图像记为增强参与图像,可以为单帧增强参与图像分别设置不同的权重参数。In the embodiment of the present application, the image to be enhanced and the auxiliary enhanced image may be recorded as the enhanced participating images, and different weight parameters may be set for a single enhanced participating image.
在一个示例性的实施方式中,可以为待增强图像以及增强辅助图像中每一帧图像设置不同的权重参数,在各帧图像进行加权后再进行特征提取,例如,权重参数可以由视频时间线上距离当前帧的时间远近决定取值的大小,当前帧图像为t时刻,那么t-1时刻的重建图像的权重参数比t-2时刻的权重参数的取值大,权重参数可以根据重建图像再图像解码过程中的重要性来决定,例如,I帧为关键帧,P帧和B帧为非关键帧,待增强图像以及增强辅助图像中I帧的权重参数的取值可以大于P帧和B帧的权重参数的取值。In an exemplary embodiment, different weight parameters can be set for each frame of the image to be enhanced and the enhanced auxiliary image, and feature extraction is performed after each frame of image is weighted. For example, the weight parameter can be determined by the video timeline The time distance from the current frame determines the size of the value. The current frame image is at time t, then the weight parameter of the reconstructed image at time t-1 is larger than the value of the weight parameter at time t-2, and the weight parameter can be based on the reconstructed image It is determined by the importance of the image decoding process. For example, the I frame is a key frame, and the P frame and B frame are non-key frames. The value of the weight parameter of the I frame in the image to be enhanced and the enhanced auxiliary image can be greater than that of the P frame and The value of the weight parameter of the B frame.
在另一个示例性的实施方式中,可以为待增强图像和增强辅助图形等重建图像中单帧图像的信息使用不同的权重进行加权,例如,针对不同量化参数区域采用不同的权重加权后再进行特征提取。In another exemplary embodiment, different weights can be used for weighting the information of a single frame image in the reconstructed image such as the image to be enhanced and the enhanced auxiliary graphics. feature extraction.
在另一个示例性的实施方式中,待增强图像和增强辅助图形等重建图像中单帧图像可以分区域,每个区域设置不同的权重参数,单帧图像经过加权后进行特征提取和特征增强等操作,例如,可以针对有人物的取值采用高权重参数,背景区域采用低权重参数。In another exemplary embodiment, the single-frame image in the reconstructed image such as the image to be enhanced and the enhanced auxiliary graphics can be divided into regions, and different weight parameters are set for each region, and the single-frame image is weighted to perform feature extraction and feature enhancement, etc. For the operation, for example, a high weight parameter can be used for the value of a person, and a low weight parameter can be used for the background area.
在上述申请实施例的基础上,所述待增强图像和所述增强辅助图像为所述重建图像的至少一个分量。On the basis of the above application embodiments, the image to be enhanced and the enhanced auxiliary image are at least one component of the reconstructed image.
分量可以是图像信息的分量,可以包括亮度色度分量或者颜色分量等,在进行增强时,待增强图像和增强辅助图像可以使用重建图像中的一个或多个分量进行图像增强。例如,重建图像为红绿蓝(Red Green Blue,RGB)图像,待增强图像可以为R分量形成的图像或者G分量形成的图像作为待增强图像或者增强辅助图像。The component may be a component of image information, which may include luminance, chrominance or color components, etc. When performing enhancement, the image to be enhanced and the enhanced auxiliary image may use one or more components in the reconstructed image for image enhancement. For example, the reconstructed image is a Red Green Blue (RGB) image, and the image to be enhanced may be an image formed by the R component or an image formed by the G component as the image to be enhanced or the enhanced auxiliary image.
在一个示例性的实施方式中,待增强图像和增强辅助图形可以为重建图像可以仅是图像的 一个分量,也可以是多个分量,比如,重建图像由亮度和色度(YUV)分量组成,那么,可以针对亮度分量进行特征提取和特征增强等操作来进行图像增强,也可以针对色度分量进行特征提取和特征增强等操作来进行图像增强,也可以针对图像亮度和色度一起进行特征提取和特征增强等操作来进行图像增强。重建图像由RGB(Red,Green,Blue)分量组成,三个分量可以各自进行特征提取和特征增强等操作来进行图像增强,也可以三个分量整体进行特征提取和特征增强等操作来进行图像增强。In an exemplary embodiment, the image to be enhanced and the enhanced auxiliary graphics may be a reconstructed image, which may be only one component of the image, or multiple components. For example, the reconstructed image is composed of luminance and chrominance (YUV) components, Then, image enhancement can be performed by performing operations such as feature extraction and feature enhancement on the brightness component, and image enhancement can also be performed on the chroma component by performing feature extraction and feature enhancement operations, or feature extraction can be performed on image brightness and chroma and feature enhancement to perform image enhancement. The reconstructed image is composed of RGB (Red, Green, Blue) components. The three components can perform image enhancement by performing feature extraction and feature enhancement operations separately, or can perform image enhancement by performing feature extraction and feature enhancement operations on the three components as a whole. .
在上述申请实施例的基础上,所述待增强图像和所述增强辅助图像为所述重建图像的部分区域。On the basis of the above application embodiments, the image to be enhanced and the enhanced auxiliary image are partial regions of the reconstructed image.
本申请实施例中,进行待增强图像可以为重建图像中的部分区域,例如,重建图像中的画面中心或者画面四角,在进行图像增强前,可以在重建图像中截取部分区域进行图像增强。In the embodiment of the present application, the image to be enhanced may be a partial area in the reconstructed image, for example, the center or four corners of the picture in the reconstructed image. Before performing image enhancement, partial areas may be intercepted in the reconstructed image for image enhancement.
在一个示例性的实施方式中,参见图6,可以仅截取重建图像的部分区域进行特征提取和特征增强,增强后的视频图像仅为当前重建图像中的部分区域叠加生成的与截取部分区域大小一样的增强后图像A,也可以将增强图叠加在当前重建图像的对应截取区域上,生成与重建图像大小一样的增强后图像B。In an exemplary embodiment, referring to FIG. 6 , only partial areas of the reconstructed image can be intercepted for feature extraction and feature enhancement, and the enhanced video image is only generated by superimposing partial areas in the current reconstructed image with the size of the intercepted partial area. For the same enhanced image A, the enhanced image can also be superimposed on the corresponding intercepted area of the current reconstructed image to generate an enhanced image B with the same size as the reconstructed image.
在上述申请实施例的基础上,所述预设特征提取网络和预设特征增强网络中的网络参数可以在图像增强过程中进行更新,例如,可以基于每次使用后的图像增强效果调整各网络参数,可以对所有的网络参数进行调整,也可以仅对部分网络参数进行调整。编码端还可以仅将调整后的网络参数发送给解码端。On the basis of the above-mentioned application embodiments, the network parameters in the preset feature extraction network and the preset feature enhancement network can be updated during the image enhancement process, for example, each network can be adjusted based on the image enhancement effect after each use Parameters, all network parameters can be adjusted, or only some network parameters can be adjusted. The encoding end can also only send the adjusted network parameters to the decoding end.
在一个示例性的实施方式中,先对当前视频重建图像及其相邻多帧重建图像进行特征提取并生成时空特征图,再对时空特征图进行特征增强生成增强图,最后将当前视频重建图像与增强图相加得到增强后图像,处理框图如图7所示。图8是本申请实施例提供的一种视频增强处理方法的流程图,参见图8,该实施例的方法包括如下步骤:In an exemplary embodiment, feature extraction is first performed on the current video reconstruction image and its adjacent multi-frame reconstruction images to generate a spatio-temporal feature map, then feature enhancement is performed on the spatio-temporal feature map to generate an enhanced map, and finally the current video reconstruction image The enhanced image is obtained by adding it to the enhanced image, and the processing block diagram is shown in Figure 7. Fig. 8 is a flowchart of a video enhancement processing method provided by an embodiment of the present application. Referring to Fig. 8, the method of this embodiment includes the following steps:
步骤S101:输入多帧重建图像Step S101: Input multi-frame reconstructed images
所述的重建图像是指视频原始图像经过压缩编码后生成视频数据,视频数据再经过解码生成的重建图像,即具有压缩失真特征的重建图像。而多帧重建图像是由当前重建图像和它在时间线上前后多帧重建图像共同组成。The reconstructed image refers to a reconstructed image generated by compressing and encoding an original video image, and then decoding the video data, that is, a reconstructed image with compression distortion characteristics. The multi-frame reconstructed image is composed of the current reconstructed image and its multiple frames before and after the reconstructed image on the timeline.
其中,重建图像可以在视频编码过程中生成的重建图像,这些重建图像作为参考图像用于帧间编码,也可以是在视频解码生成的重建图像。多帧重建图像是指在时间线上当前时刻重建图像前后的几帧重建图像,这些重建图像可以是时间线上相邻的图像,当前重建图像是t时刻重建图像,选取它前面t-2和t-1两帧重建图像,它后面t+1和t+2两帧重建图像,一共五帧重建图像作为输入。也可以是按照一定间隔选取的重建图像,当前图像是t时刻重建图像,间隔一帧选取它前面t-4和t-2两帧重建图像,它后面t+2和t+4两帧重建图像,一共五帧重建图像作为输入。也可以是按照一定规则选取的重建图像,选取当前重建图像帧前后各两帧I帧重建图像(帧内编码帧)。多帧图像也可以是非时间线上前后关系,而是有关联的图像,比如,都包含某一对象,或者图像尺寸有一定比例关系。Wherein, the reconstructed image may be a reconstructed image generated during video encoding, and these reconstructed images are used as reference images for inter-frame encoding, or may be a reconstructed image generated during video decoding. A multi-frame reconstructed image refers to several frames of reconstructed images before and after the reconstructed image at the current moment on the timeline. These reconstructed images can be adjacent images on the timeline. The current reconstructed image is the reconstructed image at time t. Select t-2 and Two frames of reconstructed images at t-1, followed by two frames of reconstructed images at t+1 and t+2, a total of five frames of reconstructed images are used as input. It can also be a reconstructed image selected at a certain interval. The current image is a reconstructed image at time t, and the two frames t-4 and t-2 in front of it are selected to reconstruct the image at an interval of one frame, and the reconstructed images in the two frames t+2 and t+4 behind it are selected. , a total of five reconstructed images are taken as input. It may also be a reconstructed image selected according to certain rules, and two I frames before and after the current reconstructed image frame are selected to reconstruct the image (intra-frame coding frame). Multi-frame images can also be related images instead of contextual relationships on the timeline, for example, they all contain a certain object, or the image sizes have a certain proportional relationship.
步骤S102:特征提取生成时空特征图Step S102: Feature extraction to generate spatio-temporal feature maps
对输入的多帧重建图像进行特征提取,如图9所示。多帧图像经过多层三维可变形卷积残差块(Residual Block)生成特征信息,再对特征信息进行卷积融合生成时空特征图(Feature Map)。其中,每一个三维可变形卷积残差块可以包括三维可变形卷积层和激活函数。多帧数据输入经过三维可变形卷积(DCN3D),激活函数(Activation Function),三维可变形卷积(DCN3D)后与自身进行叠加后输出,输出的结果作为下一个模块的输入。所述多帧数据可以是多帧重建图像,也可以前一模块的输出数据。所述激活函数可以是LReLU(Leaky Rectified Linear Activation),也可以是其它激活函数。三维可变形卷积残差块可以有N个,三维可变形卷积残差块的数量增多会使得视频增强后的质量效果提高,但随着三维可变形卷积残差块增多,整个网络的参数复杂度会明显增加,网络训练和计算也会需要大量时间。Feature extraction is performed on the input multi-frame reconstructed image, as shown in Figure 9. Multi-frame images generate feature information through a multi-layer three-dimensional deformable convolution residual block (Residual Block), and then perform convolution fusion on the feature information to generate a spatio-temporal feature map (Feature Map). Wherein, each 3D deformable convolution residual block may include a 3D deformable convolution layer and an activation function. Multi-frame data input passes through three-dimensional deformable convolution (DCN3D), activation function (Activation Function), three-dimensional deformable convolution (DCN3D), and then superimposed with itself and then output, and the output result is used as the input of the next module. The multi-frame data may be multi-frame reconstructed images, or the output data of the previous module. The activation function may be LReLU (Leaky Rectified Linear Activation), or other activation functions. There can be N number of 3D deformable convolution residual blocks. An increase in the number of 3D deformable convolution residual blocks will improve the quality of the enhanced video. However, with the increase of 3D deformable convolution residual blocks, the entire network Parameter complexity will increase significantly, and network training and calculation will also take a lot of time.
经过N个三维可变形卷积残差块后生成的特征信息,再经过一个卷积模块(Bottleneck)融合生成时空特征图,时空特征图的大小与图像尺寸和特征数有关。The feature information generated after N three-dimensional deformable convolution residual blocks is then fused by a convolution module (Bottleneck) to generate a spatio-temporal feature map. The size of the spatio-temporal feature map is related to the image size and the number of features.
在三维可变形卷积残差块前增加一个卷积模块,以便将低阶特征映射到高阶特征,增加特 征数量。A convolution module is added before the 3D deformable convolution residual block to map low-order features to high-order features and increase the number of features.
其中,三维可变形卷积是在二维可变形卷积(DCN)的基础上扩展到三维,如图10所示,先经过一个卷积生成三维偏移,再利用三维偏移对输入特征进行卷积操作获得输出特征,所述输入特征可以是多帧重建图像,也可以是前一模块的输出特征。Among them, the 3D deformable convolution is extended to 3D on the basis of the 2D deformable convolution (DCN). As shown in Figure 10, a 3D offset is first generated through a convolution, and then the input features are processed using the 3D offset. The convolution operation obtains output features, which can be multi-frame reconstructed images or the output features of the previous module.
步骤S103:对时空特征图进行特征增强Step S103: Perform feature enhancement on the spatio-temporal feature map
特征增强过程如图11所示,时空特征图经过多个三维可变形卷积残差块后,再经过一个卷积,如1x1conv,恢复出一帧跟当前重建图大小一致的增强图,即叠加图。其中,卷积残差块的个数M和特征提取过程中的三维可变形卷积残差块的个数不一定相等。卷积残差块包括二维卷积层和激活函数,输入数据经过二维卷积,激活函数,再经过二维卷积后与自己叠加生成残差。其中,二维卷积可以是二维可变形卷积(DCN),也可以是二维卷积神经网络(CNN)。The feature enhancement process is shown in Figure 11. The spatio-temporal feature map passes through multiple 3D deformable convolution residual blocks, and then undergoes a convolution, such as 1x1conv, to recover an enhanced map with the same size as the current reconstruction map, that is, superposition picture. Wherein, the number M of convolutional residual blocks is not necessarily equal to the number of three-dimensional deformable convolutional residual blocks in the feature extraction process. The convolutional residual block includes a two-dimensional convolutional layer and an activation function. The input data is subjected to two-dimensional convolution, the activation function, and then superimposed with itself to generate a residual after two-dimensional convolution. Among them, the two-dimensional convolution can be a two-dimensional deformable convolution (DCN), or a two-dimensional convolutional neural network (CNN).
步骤S104:生成增强后图像Step S104: Generate an enhanced image
将步骤S103生成的增强图像(叠加图)与当前重建图像叠加生成增强后图像。The enhanced image (overlay image) generated in step S103 is superimposed with the current reconstructed image to generate an enhanced image.
图12是本申请实施例提供的一种视频增强处理方法的流程图,本申请实施例可以适用于增强解码视频的图像显示质量的情况,该方法可以由视频增强处理装置来执行,该装置可以通过软件和/或硬件方式实现,并一般基础在视频编码端,参见图12,本申请实施例提供的方法包括如下步骤:Fig. 12 is a flowchart of a video enhancement processing method provided by the embodiment of the present application. The embodiment of the present application can be applied to the situation of enhancing the image display quality of the decoded video, and the method can be executed by a video enhancement processing device, which can Realized by software and/or hardware, and generally based on the video encoding end, referring to Figure 12, the method provided by the embodiment of the present application includes the following steps:
步骤210、确定待增强图像的增强辅助图像,其中,增强辅助图像和待增强图像为压缩视频数据解码生成的重建图像。Step 210: Determine an enhanced auxiliary image of the image to be enhanced, wherein the enhanced auxiliary image and the image to be enhanced are reconstructed images generated by decoding compressed video data.
步骤220、基于待增强图像、增强辅助图像和预设特征提取网络确定时空特征图。 Step 220, determine the spatio-temporal feature map based on the image to be enhanced, the enhanced auxiliary image and the preset feature extraction network.
步骤230、传输时空特征图和预设特征增强网络。 Step 230, transmitting the spatio-temporal feature map and the preset feature enhancement network.
在本申请实施例中,可以将时空特征图和特征增强网络发送给解码端,由解码端根据时空特征图和预设特征增强网络对待增强图像进行处理,提高待增强图像的显示效果。时空特征图和预设特征增强网络可以直接传送给解码端,也可以先上传时空特征图和特征增强网络到服务器,再由解码端向服务器发送获取请求的方式获取。In the embodiment of the present application, the spatiotemporal feature map and feature enhancement network can be sent to the decoder, and the decoder processes the image to be enhanced according to the spatiotemporal feature map and the preset feature enhancement network to improve the display effect of the image to be enhanced. The spatiotemporal feature map and the preset feature enhancement network can be directly transmitted to the decoder, or the spatiotemporal feature map and feature enhancement network can be uploaded to the server first, and then the decoder sends an acquisition request to the server.
在上述申请实施例的基础上,在所述传输所述时空特征图和所述预设特征增强网络之前,还包括:On the basis of the above application embodiments, before the transmission of the spatio-temporal feature map and the preset feature enhancement network, it also includes:
对所述时空特征图和所述预设特征增强网络进行压缩编码。performing compression coding on the spatio-temporal feature map and the preset feature enhancement network.
在本申请实施例中,可以对时空特征图和预设特征增强网络进行压缩编码,以便减少传输数据量,提高传送效率。In the embodiment of the present application, compression coding can be performed on the spatio-temporal feature map and the preset feature enhancement network, so as to reduce the amount of transmission data and improve transmission efficiency.
在一个示例性的实施方式中,预设特征提取网络和预设特征增强网络以及时空特征图在传输时,可以进行压缩,以降低数据量,便于传输或存储。时空特征图的预设特征提取网络和预设特征增强网络的网络模型以及网络参数可以采用采用无损压缩的方式,比如霍夫曼编码,算术编码等。网络模型可以通过,参数修剪和共享(parameter pruning and sharing)、低秩因子分解(low-rank factorization)、转移/紧凑卷积滤波器(transferred/compact convolutional filters)、知识蒸馏(knowledge distillation)等方法进行压缩。网络参数,也可以采用有损压缩编码,比如可以采用量化的方式降低所需的数据量。In an exemplary embodiment, the preset feature extraction network, the preset feature enhancement network, and the spatio-temporal feature map may be compressed during transmission to reduce data volume and facilitate transmission or storage. The network model and network parameters of the preset feature extraction network of the spatio-temporal feature map and the preset feature enhancement network can adopt a lossless compression method, such as Huffman coding, arithmetic coding, etc. The network model can be passed through methods such as parameter pruning and sharing, low-rank factorization, transferred/compact convolutional filters, knowledge distillation, etc. to compress. Network parameters can also be encoded with lossy compression, for example, quantization can be used to reduce the amount of data required.
图13是本申请实施例提供的一种视频增强处理方法的流程图,本申请实施例可以适用于增强解码视频的图像显示质量的情况,该方法可以由视频增强处理装置来执行,该装置可以通过软件和/或硬件方式实现,并一般基础在视频解码端,参见图13,本申请实施例提供的方法包括如下步骤:Fig. 13 is a flow chart of a video enhancement processing method provided by the embodiment of the present application. The embodiment of the present application can be applied to the situation of enhancing the image display quality of the decoded video, and the method can be executed by a video enhancement processing device, which can Realized by software and/or hardware, and generally based on the video decoding end, referring to Figure 13, the method provided by the embodiment of the present application includes the following steps:
步骤310、接收时空特征图和预设特征增强网络。 Step 310, receiving a spatio-temporal feature map and a preset feature enhancement network.
在本申请实施例中,时空特征图和预设特征增强网络可以由编码端直接发送到解码端或者由服务器下载到解码端,而服务器中的时空特征图和预设特征增强网络可以由编码端上传。In the embodiment of the present application, the spatio-temporal feature map and the preset feature enhancement network can be directly sent from the encoding end to the decoding end or downloaded to the decoding end by the server, and the spatio-temporal feature map and the preset feature enhancement network in the server can be provided by the encoding end upload.
步骤320、根据预设特征增强网络处理时空特征图以生成叠加图像。 Step 320, processing the spatio-temporal feature map according to the preset feature enhancement network to generate a superimposed image.
步骤330、根据叠加图像处理待增强图像以生成视频增强图像。 Step 330, process the image to be enhanced according to the superimposed image to generate a video enhanced image.
其中,待增强图像可以通过解码码流的方式生成,该码流可以由编码端发送并由解码端接收。Wherein, the image to be enhanced can be generated by decoding a bit stream, and the bit stream can be sent by the encoding end and received by the decoding end.
在时空特征图经过处理后,可以提取叠加图像中各位置上的信息,例如,色度、亮度、颜 色值等,按照该信息对待增强图像中对应的区域进行显示增强,还可以将叠加图像与待增强图像直接进行叠加,并将叠加生成的图像作为视频增强图像。After the spatio-temporal feature map is processed, the information on each position in the superimposed image can be extracted, such as chroma, brightness, color value, etc., and the corresponding area in the image to be enhanced can be displayed and enhanced according to the information, and the superimposed image can also be combined with The image to be enhanced is directly superimposed, and the superimposed image is used as a video enhanced image.
在一个示例性的实施方式中,图14是本申请实施例提供的一种视频增强处理的示例图,参见图14,本申请实施例通过在编码端利用编码过程中生成的多帧编码重建图进行特征提取并生成时空特征图,将时空特征图和特征增强网络模型和网络参数传输给解码端,解码端根据时空特征图和特征增强网络模型和网络参数对解码重建图像进行增强,时空特征图和特征增强网络模型和网络参数可以在增强过程中传输,时空特征图和特征增强网络的网络模型和网络参数可以单独分开传输,也可以组合传输,可以写入视频码流中,也可以独立于视频码流带外传输。In an exemplary implementation, FIG. 14 is an example diagram of a video enhancement process provided by the embodiment of the present application. Referring to FIG. 14 , the embodiment of the present application reconstructs the image by using the multi-frame encoding generated during the encoding process at the encoding end. Perform feature extraction and generate a spatio-temporal feature map, transmit the spatio-temporal feature map and feature enhancement network model and network parameters to the decoder, and the decoder will enhance the decoded and reconstructed image according to the spatiotemporal feature map, feature enhancement network model and network parameters, and the spatiotemporal feature map The network model and network parameters of feature enhancement and feature enhancement can be transmitted during the enhancement process. The network model and network parameters of the spatio-temporal feature map and feature enhancement network can be transmitted separately or in combination, and can be written into the video stream or independently Video code stream out-of-band transmission.
在一个另示例性的实施方式中,图15是本申请实施例提供的另一种视频增强处理的示例图,参见图15,特征提取和特征增强的网络模型和网络参数可以是仅使用在解码端,即解码端对视频码流解码,再使用特征提取和特征增强的网络模型和网络参数对解码后的重建图像进行增强。如果仅在解码端使用,那么,可以直接将特征提取输出的时空特征图当作特征增强的输入,而不必单独存储时空特征图。In another exemplary embodiment, FIG. 15 is an example diagram of another video enhancement process provided by the embodiment of the present application. Referring to FIG. 15 , the network model and network parameters for feature extraction and feature enhancement can be used only in decoding The end, that is, the decoding end decodes the video stream, and then uses the network model and network parameters of feature extraction and feature enhancement to enhance the decoded reconstructed image. If it is only used at the decoding end, then the spatio-temporal feature map output by feature extraction can be directly used as the input of feature enhancement without having to store the spatio-temporal feature map separately.
解码端可以通过读取本地文件来获取特征提取和特征增强的网络模型和网络参数,也可以从服务器端获取特征提取和特征增强的网络模型和网络参数,还可以由编码端发送给解码端。The decoder can obtain the network model and network parameters for feature extraction and feature enhancement by reading local files, or obtain the network model and network parameters for feature extraction and feature enhancement from the server, and can also be sent to the decoder by the encoder.
图16是本申请实施例提供的一种视频增强处理装置的结构示意图,可执行本申请任意实施例提供的视频增强处理方法,具备执行方法相应的功能模块和有益效果,该装置可以由软件和/或硬件实现,一般集成在编码端,包括:图像提取模块401、特征图模块402、特征增强模块403和增强图像模块404。Fig. 16 is a schematic structural diagram of a video enhancement processing device provided by an embodiment of the present application, which can execute the video enhancement processing method provided by any embodiment of the present application, and has corresponding functional modules and beneficial effects for executing the method. The device can be composed of software and /or hardware implementation, generally integrated at the encoding end, including: an image extraction module 401 , a feature map module 402 , a feature enhancement module 403 and an image enhancement module 404 .
图像提取模块401,用于确定待增强图像的增强辅助图像,其中,所述增强辅助图像和所述待增强图像为压缩视频数据解码生成的重建图像。The image extraction module 401 is configured to determine an enhanced auxiliary image of the image to be enhanced, wherein the enhanced auxiliary image and the image to be enhanced are reconstructed images generated by decoding compressed video data.
特征图模块402,用于基于所述待增强图像、所述增强辅助图像和预设特征提取网络确定时空特征图。A feature map module 402, configured to determine a spatio-temporal feature map based on the image to be enhanced, the enhanced auxiliary image and a preset feature extraction network.
特征增强模块403,用于根据预设特征增强网络处理所述时空特征图以生成叠加图像。The feature enhancement module 403 is configured to process the spatio-temporal feature map according to a preset feature enhancement network to generate a superimposed image.
增强图像模块404,用于根据所述叠加图像处理所述待增强图像以生成视频增强图像。An enhanced image module 404, configured to process the image to be enhanced according to the superimposed image to generate a video enhanced image.
图17是本申请实施例提供的一种视频增强处理装置的结构示意图,可执行本申请任意实施例提供的视频增强处理方法,具备执行方法相应的功能模块和有益效果,该装置可以由软件和/或硬件实现,一般集成在编码端,包括:图像提取模块501、特征图模块502和编码发送模块503。Fig. 17 is a schematic structural diagram of a video enhancement processing device provided by an embodiment of the present application, which can execute the video enhancement processing method provided by any embodiment of the present application, and has corresponding functional modules and beneficial effects for executing the method. The device can be composed of software and /or hardware implementation, generally integrated at the encoding end, including: an image extraction module 501 , a feature map module 502 and an encoding sending module 503 .
图像提取模块501,用于确定待增强图像的增强辅助图像,其中,所述增强辅助图像和所述待增强图像为压缩视频数据解码生成的重建图像。The image extraction module 501 is configured to determine an enhanced auxiliary image of the image to be enhanced, wherein the enhanced auxiliary image and the image to be enhanced are reconstructed images generated by decoding compressed video data.
特征图模块502,用于基于所述待增强图像、所述增强辅助图像和预设特征提取网络确定时空特征图。A feature map module 502, configured to determine a spatio-temporal feature map based on the image to be enhanced, the enhanced auxiliary image and a preset feature extraction network.
编码发送模块503,用于传输所述时空特征图和所述预设特征增强网络。An encoding sending module 503, configured to transmit the spatio-temporal feature map and the preset feature enhancement network.
图18是本申请实施例提供的一种视频增强处理装置的结构示意图,可执行本申请任意实施例提供的视频增强处理方法,具备执行方法相应的功能模块和有益效果,该装置可以由软件和/或硬件实现,一般集成在解码端,包括:编码接收模块601、特征增强模块602和增强图像模块603。Fig. 18 is a schematic structural diagram of a video enhancement processing device provided by an embodiment of the present application, which can execute the video enhancement processing method provided by any embodiment of the present application, and has corresponding functional modules and beneficial effects for executing the method. The device can be composed of software and /or hardware implementation, generally integrated at the decoding end, including: a code receiving module 601 , a feature enhancement module 602 and an image enhancement module 603 .
解码接收模块601,用于接收时空特征图和预设特征增强网络。The decoding and receiving module 601 is configured to receive a spatio-temporal feature map and a preset feature enhancement network.
特征增强模块602,用于根据预设特征增强网络处理所述时空特征图以生成叠加图像。A feature enhancement module 602, configured to process the spatio-temporal feature map according to a preset feature enhancement network to generate a superimposed image.
增强图像模块603,用于根据所述叠加图像处理所述待增强图像以生成视频增强图像。An enhanced image module 603, configured to process the image to be enhanced according to the superimposed image to generate a video enhanced image.
在上述实施例的基础上,所述编码端和/或解码端中的预设特征提取网络包括至少一个三维可变形卷积残差块,所述三维可变形卷积残差块至少包括三维可变形卷积层和激活函数。On the basis of the above embodiments, the preset feature extraction network in the encoding end and/or decoding end includes at least one three-dimensional deformable convolution residual block, and the three-dimensional deformable convolution residual block includes at least a three-dimensional deformable convolution residual block. Deformable convolutional layers and activation functions.
在上述实施例的基础上,所述编码端和/或解码端的装置中预设特征增强网络包括至少一个卷积残差块,所述卷积残差块至少包括卷积层和激活函数。On the basis of the above embodiments, the preset feature enhancement network in the device at the encoding end and/or decoding end includes at least one convolutional residual block, and the convolutional residual block includes at least a convolutional layer and an activation function.
在上述实施例的基础上,所述编码端和/或解码端的装置中所述预设特征增强网络以及所述预设特征提取网络的网络模型和网络参数在码流和/或传输层中传输。On the basis of the above-mentioned embodiments, the network model and network parameters of the preset feature enhancement network and the preset feature extraction network in the device at the encoding end and/or decoding end are transmitted in the code stream and/or the transport layer .
在上述申请实施例的基础上,所述编码端和/或解码端的装置中所述网络模型和所述网络参 数位于以下至少之一:视频码流、视频码流的补充增强信息、视频应用信息、系统层媒体属性描述单元、媒体轨道。On the basis of the above-mentioned application embodiments, the network model and the network parameters in the device at the encoding end and/or decoding end are located in at least one of the following: video code stream, supplementary enhancement information of video code stream, video application information , a system layer media attribute description unit, and a media track.
在上述申请实施例的基础上,所述编码端和/或解码端的装置中所述预设特征提取网络中的所述三维可变形卷积残差块的个数N根据所述重建图像对应的视频属性和/或设备处理性能确定。On the basis of the above-mentioned application embodiments, the number N of the three-dimensional deformable convolution residual blocks in the preset feature extraction network in the device at the encoding end and/or decoding end is based on the corresponding Video properties and/or device processing capabilities are determined.
在上述申请实施例的基础上,所述编码端和/或解码端的装置中所述视频属性包括以下至少之一:视频类型、应用场景。On the basis of the above-mentioned application embodiments, the video attribute in the device at the encoding end and/or decoding end includes at least one of the following: video type and application scenario.
在上述申请实施例的基础上,所述编码端和/或解码端的装置中还包括:On the basis of the above-mentioned application embodiments, the device at the encoding end and/or decoding end further includes:
网络训练模块,用于针对视频类型和/或应用场景分别训练至少一个所述预设特征提取网络和至少一个所述预设特征增强网络。The network training module is used to respectively train at least one of the preset feature extraction networks and at least one of the preset feature enhancement networks for video types and/or application scenarios.
在上述申请实施例的基础上,所述编码端和/或解码端的装置中还包括:On the basis of the above-mentioned application embodiments, the device at the encoding end and/or decoding end further includes:
加权模块,用于使用权重参数对所述待增强图像和所述增强辅助图像的信息进行加权。A weighting module, configured to use weight parameters to weight the information of the image to be enhanced and the enhanced auxiliary image.
在上述申请实施例的基础上,所述编码端和/或解码端的装置中所述待增强图像和所述增强辅助图像的不同区域设置的权重参数不同。On the basis of the above-mentioned application embodiments, different weight parameters are set for different regions of the image to be enhanced and the enhanced auxiliary image in the device at the encoding end and/or the decoding end.
在上述申请实施例的基础上,所述编码端和/或解码端的装置中不同增强参与图像设置的权重参数不同,其中,所述增强参与图像包括所述待增强图像和所述增强辅助图像。On the basis of the above-mentioned application embodiments, different weight parameters are set for different enhancement participating images in the devices at the encoding end and/or decoding end, wherein the enhancement participating images include the image to be enhanced and the enhanced auxiliary image.
在上述申请实施例的基础上,所述编码端和/或解码端的装置中所述待增强图像和所述增强辅助图像为所述重建图像的至少一个分量。On the basis of the above-mentioned application embodiments, the image to be enhanced and the enhanced auxiliary image in the device at the encoding end and/or the decoding end are at least one component of the reconstructed image.
在上述申请实施例的基础上,所述编码端和/或解码端的装置中所述待增强图像和所述增强辅助图像为所述重建图像的部分区域。On the basis of the above-mentioned application embodiments, the image to be enhanced and the enhanced auxiliary image in the devices at the encoding end and/or decoding end are partial regions of the reconstructed image.
在上述申请实施例的基础上,所述编码端和/或解码端的装置中图像提取模块被设置为:按照时间顺序在视频解码生成的重建图像集中分别在所述待增强图像之前和/或之后分别获取阈值数量的重建图像作为增强辅助图像,其中,所述重建图像集包括至少两帧重建图像。On the basis of the above-mentioned application embodiments, the image extraction module in the device at the encoding end and/or decoding end is set to: respectively before and/or after the image to be enhanced in the reconstructed image set generated by video decoding in time order Respectively acquire a threshold number of reconstructed images as enhanced auxiliary images, wherein the set of reconstructed images includes at least two frames of reconstructed images.
在上述申请实施例的基础上,所述编码端和/或解码端的装置中增强图像模块被设置为:将经过图像时空特征增强的所述时空特征图与所述待增强图像叠加,并将叠加后生成的图像作为视频增强图像。On the basis of the above-mentioned application embodiments, the enhanced image module in the device at the encoding end and/or decoding end is set to: superimpose the spatio-temporal feature map enhanced by image spatio-temporal features with the image to be enhanced, and superimpose The resulting image is used as a video-enhanced image.
在上述申请实施例的基础上,所述编码端和/或解码端的装置中还包括:On the basis of the above-mentioned application embodiments, the device at the encoding end and/or decoding end further includes:
编码压缩模块,用于对所述时空特征图和所述预设特征增强网络进行压缩编码。An encoding and compression module, configured to compress and encode the spatio-temporal feature map and the preset feature enhancement network.
在一个示例性的实施方式中,本申请实施例提供的实例增强处理装置可以包括如下模块:特征提取模块A01,用于提取多帧重建图像特征;In an exemplary implementation, the example enhancement processing device provided in the embodiment of the present application may include the following modules: a feature extraction module A01, configured to extract features of multi-frame reconstructed images;
视频编码模块A02,用于编码网络参数和时空特征图,输出编码重建图像输入特征提取模块A01。The video encoding module A02 is used to encode network parameters and spatio-temporal feature maps, and output encoded and reconstructed images as input to the feature extraction module A01.
传输模块A03,用于传输视频编码后数据,也可以对网络参数和时空特征图进行编码传输。The transmission module A03 is used to transmit encoded video data, and can also encode and transmit network parameters and spatiotemporal feature maps.
特征增强模块A04,用于进行特征增强生成增强图。The feature enhancement module A04 is used to perform feature enhancement and generate an enhanced map.
视频解码模块A05,用于从视频数据中解码出网络参数和时空特征图,重建图像。The video decoding module A05 is used to decode network parameters and spatio-temporal feature maps from video data, and reconstruct images.
传输模块A06,用于传输视频压缩数据,也可以对网络参数和时空特征图进行解码。The transmission module A06 is used to transmit compressed video data, and can also decode network parameters and spatiotemporal feature maps.
上述传输模块A01、视频编码模块A02、传输模块A03、特征增强模块A04、视频解码模块A05、传输模块A06可以通过使用专用硬件、或者能够与适当的软件相结合来执行处理的硬件来实现。这样的硬件或专用硬件可以包括专用集成电路(ASIC)、各种其它电路、各种处理器等。当由处理器实现时,该功能可以由单个专用处理器、单个共享处理器、或者多个独立的处理器(其中某些可能被共享)来提供。另外,处理器不应该被理解为专指能够执行软件的硬件,而是可以隐含地包括、而不限于数字信号处理器(DSP)硬件、用来存储软件的只读存储器(ROM)、随机存取存储器(RAM)、以及非易失存储设备。The above-mentioned transmission module A01, video encoding module A02, transmission module A03, feature enhancement module A04, video decoding module A05, and transmission module A06 can be implemented by using dedicated hardware or hardware that can be combined with appropriate software to perform processing. Such hardware or special purpose hardware may include application specific integrated circuits (ASICs), various other circuits, various processors, and the like. When implemented by a processor, the functionality may be provided by a single dedicated processor, a single shared processor, or multiple independent processors, some of which may be shared. Additionally, a processor should not be understood to refer exclusively to hardware capable of executing software, but may implicitly include, without limitation, digital signal processor (DSP) hardware, read-only memory (ROM) for storing software, random Access memory (RAM), and non-volatile storage devices.
本实施例的装置可以是视频应用中设备,例如,手机、计算机、服务器、机顶盒、便携式移动终端、数字摄像机,电视广播系统设备等。The apparatus of this embodiment may be a device in a video application, for example, a mobile phone, a computer, a server, a set-top box, a portable mobile terminal, a digital camera, a TV broadcasting system device, and the like.
图19是本申请实施例提供的一种电子设备的结构示意图,该电子设备包括处理器60、存储器71、输入装置72和输出装置73;电子设备中处理器70的数量可以是一个或多个,图19中以一个处理器70为例;电子设备中处理器70、存储器71、输入装置72和输出装置73可以通过总线或其他方式连接,图19中以通过总线连接为例。Figure 19 is a schematic structural diagram of an electronic device provided by an embodiment of the present application, the electronic device includes a processor 60, a memory 71, an input device 72 and an output device 73; the number of processors 70 in the electronic device can be one or more In FIG. 19, a processor 70 is taken as an example; the processor 70, memory 71, input device 72 and output device 73 in the electronic device can be connected by a bus or in other ways. In FIG. 19, the connection by a bus is taken as an example.
存储器71作为一种计算机可读存储介质,可用于存储软件程序、计算机可执行程序以及模块,如本申请实施例中的视频增强处理装置对应的模块(图像提取模块401、特征图模块402、特征增强模块403和增强图像模块404,或者,图像提取模块501、特征图模块502和编码发送模块503,又或者,解码接收模块601、特征增强模块602和增强图像模块603)。处理器70通过运行存储在存储器71中的软件程序、指令以及模块,从而执行电子设备的各种功能应用以及数据处理,即实现上述的视频增强处理方法。The memory 71, as a computer-readable storage medium, can be used to store software programs, computer-executable programs and modules, such as the modules corresponding to the video enhancement processing device in the embodiment of the present application (image extraction module 401, feature map module 402, feature Enhancement module 403 and enhanced image module 404, or image extraction module 501, feature map module 502 and encoding sending module 503, or decoding receiving module 601, feature enhancement module 602 and enhanced image module 603). The processor 70 executes various functional applications and data processing of the electronic device by running software programs, instructions and modules stored in the memory 71 , that is, realizes the above-mentioned video enhancement processing method.
存储器71可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序;存储数据区可存储根据电子设备的使用所创建的数据等。此外,存储器71可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他非易失性固态存储器件。在一些实例中,存储器71还可包括相对于处理器70远程设置的存储器,这些远程存储器可以通过网络连接至电子设备。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。The memory 71 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system and at least one application required by a function; the data storage area may store data created according to the use of the electronic device, and the like. In addition, the memory 71 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage devices. In some examples, the memory 71 may also include a memory that is remotely located relative to the processor 70, and these remote memories may be connected to the electronic device through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
输入装置72可用于接收输入的数字或字符信息,以及产生与电子设备的用户设置以及功能控制有关的键信号输入。输出装置73可包括显示屏等显示设备。The input device 72 can be used to receive input numbers or character information, and generate key signal input related to user settings and function control of the electronic device. The output device 73 may include a display device such as a display screen.
本申请实施例还提供一种包含计算机可执行指令的存储介质,所述计算机可执行指令在由计算机处理器执行时用于执行一种视频增强处理方法,该方法包括:The embodiment of the present application also provides a storage medium containing computer-executable instructions, the computer-executable instructions are used to execute a video enhancement processing method when executed by a computer processor, the method comprising:
确定待增强图像的增强辅助图像,其中,所述增强辅助图像和所述待增强图像为压缩视频数据解码生成的重建图像;determining an enhanced auxiliary image of the image to be enhanced, wherein the enhanced auxiliary image and the image to be enhanced are reconstructed images generated by decoding compressed video data;
基于所述待增强图像、所述增强辅助图像和预设特征提取网络确定时空特征图;determining a spatio-temporal feature map based on the image to be enhanced, the enhanced auxiliary image and a preset feature extraction network;
根据预设特征增强网络处理所述时空特征图以生成叠加图像;processing the spatio-temporal feature map according to a preset feature enhancement network to generate a superimposed image;
根据所述叠加图像处理所述待增强图像以生成视频增强图像。The image to be enhanced is processed according to the superimposed image to generate a video enhanced image.
或者,or,
确定待增强图像的增强辅助图像,其中,所述增强辅助图像和所述待增强图像为压缩视频数据解码生成的重建图像;determining an enhanced auxiliary image of the image to be enhanced, wherein the enhanced auxiliary image and the image to be enhanced are reconstructed images generated by decoding compressed video data;
基于所述待增强图像、所述增强辅助图像和预设特征提取网络确定时空特征图;determining a spatio-temporal feature map based on the image to be enhanced, the enhanced auxiliary image and a preset feature extraction network;
传输所述时空特征图和所述预设特征增强网络。and transmitting the spatio-temporal feature map and the preset feature enhancement network.
或者,or,
接收时空特征图和预设特征增强网络;Receive spatio-temporal feature map and preset feature enhancement network;
根据预设特征增强网络处理所述时空特征图以生成叠加图像;processing the spatio-temporal feature map according to a preset feature enhancement network to generate a superimposed image;
根据所述叠加图像处理所述待增强图像以生成视频增强图像。The image to be enhanced is processed according to the superimposed image to generate a video enhanced image.
通过以上关于实施方式的描述,所属领域的技术人员可以清楚地了解到,本申请可借助软件及必需的通用硬件来实现,当然也可以通过硬件实现,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在计算机可读存储介质中,如计算机的软盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、闪存(FLASH)、硬盘或光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。Through the above description about the implementation, those skilled in the art can clearly understand that the present application can be realized by means of software and necessary general-purpose hardware, and of course it can also be realized by hardware, but in many cases the former is a better implementation . Based on this understanding, the essence of the technical solution of this application or the part that contributes can be embodied in the form of a software product, and the computer software product can be stored in a computer-readable storage medium, such as a computer floppy disk, a read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), flash memory (FLASH), hard disk or CD, etc., including several instructions to make a computer device (which can be a personal computer, server, or network devices, etc.) execute the methods described in various embodiments of the present application.
值得注意的是,上述装置的实施例中,所包括的各个单元和模块只是按照功能逻辑进行划分的,但并不局限于上述的划分,只要能够实现相应的功能即可;另外,各功能单元的具体名称也只是为了便于相互区分,并不用于限制本申请的保护范围。It is worth noting that in the above-mentioned embodiment of the device, the included units and modules are only divided according to functional logic, but are not limited to the above-mentioned division, as long as the corresponding functions can be realized; in addition, each functional unit The specific names are only for the convenience of distinguishing each other, and are not used to limit the protection scope of the present application.
本领域普通技术人员可以理解,上文中所公开方法中的全部或某些步骤、系统、设备中的功能模块/单元可以被实施为软件、固件、硬件及其适当的组合。Those of ordinary skill in the art can understand that all or some of the steps in the methods disclosed above, the functional modules/units in the system, and the device can be implemented as software, firmware, hardware, and an appropriate combination thereof.
在硬件实施方式中,在以上描述中提及的功能模块/单元之间的划分不一定对应于物理组件的划分;例如,一个物理组件可以具有多个功能,或者一个功能或步骤可以由若干物理组件合作执行。某些物理组件或所有物理组件可以被实施为由处理器,如中央处理器、数字信号处理器或微处理器执行的软件,或者被实施为硬件,或者被实施为集成电路,如专用集成电路。这样的软件可以分布在计算机可读介质上,计算机可读介质可以包括计算机存储介质(或非暂时性介质)和通信介质(或暂时性介质)。如本领域普通技术人员公知的,术语计算机存储介质包 括在用于存储信息(诸如计算机可读指令、数据结构、程序模块或其他数据)的任何方法或技术中实施的易失性和非易失性、可移除和不可移除介质。计算机存储介质包括但不限于RAM、ROM、EEPROM、闪存或其他存储器技术、CD-ROM、数字多功能盘(DVD)或其他光盘存储、磁盒、磁带、磁盘存储或其他磁存储装置、或者可以用于存储期望的信息并且可以被计算机访问的任何其他的介质。此外,本领域普通技术人员公知的是,通信介质通常包含计算机可读指令、数据结构、程序模块或者诸如载波或其他传输机制之类的调制数据信号中的其他数据,并且可包括任何信息递送介质。In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be composed of several physical components. Components cooperate to execute. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application-specific integrated circuit . Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). As known to those of ordinary skill in the art, the term computer storage media includes both volatile and nonvolatile media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules, or other data. permanent, removable and non-removable media. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD) or other optical disk storage, magnetic cartridges, tape, magnetic disk storage or other magnetic storage devices, or can Any other medium used to store desired information and which can be accessed by a computer. In addition, as is well known to those of ordinary skill in the art, communication media typically embodies computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism, and may include any information delivery media .
以上参照附图说明了本申请的若干实施例,并非因此局限本申请的权利范围。本领域技术人员不脱离本申请的范围和实质内所作的任何修改、等同替换和改进,均应在本申请的权利范围之内。Several embodiments of the present application have been described above with reference to the accompanying drawings, and the scope of rights of the present application is not limited thereto. Any modifications, equivalent replacements and improvements made by those skilled in the art without departing from the scope and essence of the present application shall fall within the scope of rights of the present application.

Claims (23)

  1. 一种视频增强处理方法,所述方法包括:A video enhancement processing method, the method comprising:
    确定待增强图像的增强辅助图像,其中,所述增强辅助图像和所述待增强图像为压缩视频数据解码生成的重建图像;determining an enhanced auxiliary image of the image to be enhanced, wherein the enhanced auxiliary image and the image to be enhanced are reconstructed images generated by decoding compressed video data;
    基于所述待增强图像、所述增强辅助图像和预设特征提取网络确定时空特征图;determining a spatio-temporal feature map based on the image to be enhanced, the enhanced auxiliary image and a preset feature extraction network;
    根据预设特征增强网络处理所述时空特征图以生成叠加图像;processing the spatio-temporal feature map according to a preset feature enhancement network to generate a superimposed image;
    根据所述叠加图像处理所述待增强图像以生成视频增强图像。The image to be enhanced is processed according to the superimposed image to generate a video enhanced image.
  2. 根据权利要求1所述的方法,其中,所述预设特征提取网络包括至少一个三维可变形卷积残差块,所述三维可变形卷积残差块至少包括三维可变形卷积层和激活函数。The method according to claim 1, wherein the preset feature extraction network comprises at least one 3D deformable convolutional residual block, and the 3D deformable convolutional residual block comprises at least a 3D deformable convolutional layer and an activation function.
  3. 根据权利要求1所述的方法,其中,所述预设特征增强网络包括至少一个卷积残差块,所述卷积残差块至少包括卷积层和激活函数。The method according to claim 1, wherein the preset feature enhancement network includes at least one convolutional residual block, and the convolutional residual block includes at least a convolutional layer and an activation function.
  4. 根据权利要求1所述的方法,其中,所述预设特征增强网络以及所述预设特征提取网络的网络模型和网络参数在码流和/或传输层中传输。The method according to claim 1, wherein the network model and network parameters of the preset feature enhancement network and the preset feature extraction network are transmitted in code stream and/or transport layer.
  5. 根据权利要求4所述的方法,其中,所述网络模型和所述网络参数位于以下至少之一:视频码流、视频码流的补充增强信息、视频应用信息、系统层媒体属性描述单元、媒体轨道。The method according to claim 4, wherein the network model and the network parameters are located in at least one of the following: video code stream, supplementary enhancement information of video code stream, video application information, system layer media attribute description unit, media track.
  6. 根据权利要求2所述的方法,其中,所述预设特征提取网络中的所述三维可变形卷积残差块的个数N根据所述重建图像对应的视频属性和/或设备处理性能确定。The method according to claim 2, wherein the number N of the three-dimensional deformable convolution residual blocks in the preset feature extraction network is determined according to the video attribute corresponding to the reconstructed image and/or the device processing performance .
  7. 根据权利要求6所述的方法,其中,所述视频属性包括以下至少之一:视频类型、应用场景。The method according to claim 6, wherein the video attribute includes at least one of the following: video type, application scene.
  8. 根据权利要求1所述的方法,还包括:The method according to claim 1, further comprising:
    针对视频类型和/或应用场景分别训练至少一个所述预设特征提取网络和至少一个所述预设特征增强网络。Training at least one preset feature extraction network and at least one preset feature enhancement network respectively for video types and/or application scenarios.
  9. 根据权利要求1所述的方法,还包括:The method according to claim 1, further comprising:
    使用权重参数对所述待增强图像和所述增强辅助图像的信息进行加权。Weighting the information of the image to be enhanced and the enhanced auxiliary image by using a weight parameter.
  10. 根据权利要求9所述的方法,其中,所述待增强图像和所述增强辅助图像的不同区域设置的权重参数不同。The method according to claim 9, wherein different weight parameters are set for different areas of the image to be enhanced and the enhanced auxiliary image.
  11. 根据权利要求9所述的方法,其中,不同增强参与图像设置的权重参数不同,其中,所述增强参与图像包括所述待增强图像和所述增强辅助图像。The method according to claim 9, wherein the weight parameters set for different enhanced participating images are different, wherein the enhanced participating images include the image to be enhanced and the enhanced auxiliary image.
  12. 根据权利要求1所述的方法,其中,所述待增强图像和所述增强辅助图像为所述重建图像的至少一个分量。The method according to claim 1, wherein the image to be enhanced and the enhanced auxiliary image are at least one component of the reconstructed image.
  13. 根据权利要求1所述的方法,其中,所述待增强图像和所述增强辅助图像为所述重建图像的部分区域。The method according to claim 1, wherein the image to be enhanced and the enhanced auxiliary image are partial regions of the reconstructed image.
  14. 根据权利要求1所述的方法,其中,所述确定待增强图像的增强辅助图像,包括:The method according to claim 1, wherein said determining the enhanced auxiliary image of the image to be enhanced comprises:
    按照时间顺序在视频解码生成的重建图像集中分别在所述待增强图像之前和/或之后分别获取阈值数量的重建图像作为增强辅助图像,其中,所述重建图像集包括至少两帧重建图像。Obtaining a threshold number of reconstructed images as enhanced auxiliary images respectively before and/or after the image to be enhanced in a reconstructed image set generated by video decoding in time order, wherein the reconstructed image set includes at least two frames of reconstructed images.
  15. 根据权利要求1所述的方法,其中,所述根据所述叠加图像处理所述待增强图像以生成视频增强图像,包括:The method according to claim 1, wherein said processing said image to be enhanced according to said superimposed image to generate a video enhanced image comprises:
    将所述叠加图像与所述待增强图像进行叠加,并将叠加后生成的图像作为视频增强图像。The superimposed image is superimposed on the image to be enhanced, and the superimposed image is used as a video enhanced image.
  16. 一种视频增强处理方法,所述方法包括:A video enhancement processing method, the method comprising:
    确定待增强图像的增强辅助图像,其中,所述增强辅助图像和所述待增强图像为压缩视频数据解码生成的重建图像;determining an enhanced auxiliary image of the image to be enhanced, wherein the enhanced auxiliary image and the image to be enhanced are reconstructed images generated by decoding compressed video data;
    基于所述待增强图像、所述增强辅助图像和预设特征提取网络确定时空特征图;determining a spatio-temporal feature map based on the image to be enhanced, the enhanced auxiliary image and a preset feature extraction network;
    传输所述时空特征图和所述预设特征增强网络。and transmitting the spatio-temporal feature map and the preset feature enhancement network.
  17. 根据权利要求16所述的方法,其中,在所述传输所述时空特征图和所述预设特征增强网络之前,还包括:The method according to claim 16, wherein, before transmitting the spatio-temporal feature map and the preset feature enhancement network, further comprising:
    对所述时空特征图和所述预设特征增强网络进行压缩编码。performing compression coding on the spatio-temporal feature map and the preset feature enhancement network.
  18. 一种视频增强处理方法,所述方法包括:A video enhancement processing method, the method comprising:
    接收时空特征图和预设特征增强网络;Receive spatio-temporal feature map and preset feature enhancement network;
    根据预设特征增强网络处理所述时空特征图以生成叠加图像;processing the spatio-temporal feature map according to a preset feature enhancement network to generate a superimposed image;
    根据所述叠加图像处理所述待增强图像以生成视频增强图像。The image to be enhanced is processed according to the superimposed image to generate a video enhanced image.
  19. 一种视频增强处理装置,所述装置包括:A video enhancement processing device, the device comprising:
    图像提取模块,用于确定待增强图像的增强辅助图像,其中,所述增强辅助图像和所述待增强图像为压缩视频数据解码生成的重建图像;An image extraction module, configured to determine an enhanced auxiliary image of the image to be enhanced, wherein the enhanced auxiliary image and the image to be enhanced are reconstructed images generated by decoding compressed video data;
    特征图模块,用于基于所述待增强图像、所述增强辅助图像和预设特征提取网络确定时空特征图;A feature map module, configured to determine a spatiotemporal feature map based on the image to be enhanced, the enhanced auxiliary image and a preset feature extraction network;
    特征增强模块,用于根据预设特征增强网络处理所述时空特征图以生成叠加图像;A feature enhancement module, configured to process the spatio-temporal feature map according to a preset feature enhancement network to generate a superimposed image;
    增强图像模块,用于根据所述叠加图像处理所述待增强图像以生成视频增强图像。An enhanced image module, configured to process the image to be enhanced according to the superimposed image to generate a video enhanced image.
  20. 一种视频增强处理装置,所述装置包括:A video enhancement processing device, the device comprising:
    图像提取模块,用于确定待增强图像的增强辅助图像,其中,所述增强辅助图像和所述待增强图像为压缩视频数据解码生成的重建图像;An image extraction module, configured to determine an enhanced auxiliary image of the image to be enhanced, wherein the enhanced auxiliary image and the image to be enhanced are reconstructed images generated by decoding compressed video data;
    特征图模块,用于基于所述待增强图像、所述增强辅助图像和预设特征提取网络确定时空特征图;A feature map module, configured to determine a spatiotemporal feature map based on the image to be enhanced, the enhanced auxiliary image and a preset feature extraction network;
    编码发送模块,用于传输所述时空特征图和所述预设特征增强网络。An encoding sending module, configured to transmit the spatio-temporal feature map and the preset feature enhancement network.
  21. 一种视频增强处理装置,所述装置包括:A video enhancement processing device, the device comprising:
    解码接收模块,用于接收时空特征图和预设特征增强网络;The decoding receiving module is used to receive the spatio-temporal feature map and the preset feature enhancement network;
    特征增强模块,用于根据预设特征增强网络处理所述时空特征图以生成叠加图像;A feature enhancement module, configured to process the spatio-temporal feature map according to a preset feature enhancement network to generate a superimposed image;
    增强图像模块,用于根据所述叠加图像处理所述待增强图像以生成视频增强图像。An enhanced image module, configured to process the image to be enhanced according to the superimposed image to generate a video enhanced image.
  22. 一种电子设备,所述电子设备包括:An electronic device comprising:
    一个或多个处理器;one or more processors;
    存储器,用于存储一个或多个程序,memory for storing one or more programs,
    当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如权利要求1-15、16-17以及18中任一所述的视频增强处理方法。When the one or more programs are executed by the one or more processors, the one or more processors implement the video enhancement processing described in any one of claims 1-15, 16-17 and 18 method.
  23. 一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现如权利要求1-15、16-17或18中任一所述的视频增强处理方法。A computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the video enhancement processing method according to any one of claims 1-15, 16-17 or 18 is realized.
PCT/CN2022/100898 2021-06-23 2022-06-23 Video enhancement processing methods and apparatus, electronic device and storage medium WO2022268181A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110697703.4 2021-06-23
CN202110697703.4A CN115511756A (en) 2021-06-23 2021-06-23 Video enhancement processing method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2022268181A1 true WO2022268181A1 (en) 2022-12-29

Family

ID=84500144

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/100898 WO2022268181A1 (en) 2021-06-23 2022-06-23 Video enhancement processing methods and apparatus, electronic device and storage medium

Country Status (2)

Country Link
CN (1) CN115511756A (en)
WO (1) WO2022268181A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116385302A (en) * 2023-04-07 2023-07-04 北京拙河科技有限公司 Dynamic blur elimination method and device for optical group camera

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190297276A1 (en) * 2018-03-20 2019-09-26 EndoVigilant, LLC Endoscopy Video Feature Enhancement Platform
CN112381716A (en) * 2020-11-18 2021-02-19 爱像素(深圳)智能科技有限公司 Image enhancement method based on generation type countermeasure network
CN112801900A (en) * 2021-01-21 2021-05-14 北京航空航天大学 Video blur removing method for generating countermeasure network based on bidirectional cyclic convolution
CN112862675A (en) * 2020-12-29 2021-05-28 成都东方天呈智能科技有限公司 Video enhancement method and system for space-time super-resolution

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190297276A1 (en) * 2018-03-20 2019-09-26 EndoVigilant, LLC Endoscopy Video Feature Enhancement Platform
CN112381716A (en) * 2020-11-18 2021-02-19 爱像素(深圳)智能科技有限公司 Image enhancement method based on generation type countermeasure network
CN112862675A (en) * 2020-12-29 2021-05-28 成都东方天呈智能科技有限公司 Video enhancement method and system for space-time super-resolution
CN112801900A (en) * 2021-01-21 2021-05-14 北京航空航天大学 Video blur removing method for generating countermeasure network based on bidirectional cyclic convolution

Also Published As

Publication number Publication date
CN115511756A (en) 2022-12-23

Similar Documents

Publication Publication Date Title
AU2012394396B2 (en) Processing high dynamic range images
US10182235B2 (en) Hardware efficient sparse FIR filtering in layered video coding
US10013746B2 (en) High dynamic range video tone mapping
KR20170020288A (en) Methods, systems and aparatus for hdr to hdr inverse tone mapping
WO2019134557A1 (en) Method and device for processing video image
RU2758035C2 (en) Method and device for reconstructing image data by decoded image data
US10542265B2 (en) Self-adaptive prediction method for multi-layer codec
WO2016192937A1 (en) Methods, apparatus, and systems for hdr tone mapping operator
US20200404339A1 (en) Loop filter apparatus and method for video coding
WO2022268181A1 (en) Video enhancement processing methods and apparatus, electronic device and storage medium
CN112235606A (en) Multi-layer video processing method, system and readable storage medium
JP2023085337A (en) Method and apparatus of cross-component linear modeling for intra prediction, decoder, encoder, and program
WO2022156688A1 (en) Layered encoding and decoding methods and apparatuses
Hanhart et al. Evaluation of JPEG XT for high dynamic range cameras
EP4133730A1 (en) Combining high-quality foreground with enhanced low-quality background
WO2023087598A1 (en) Enhanced picture generation method and apparatus, storage medium and electronic apparatus
WO2020181540A1 (en) Video processing method and device, encoding apparatus, and decoding apparatus
US20220321887A1 (en) Image or video coding on basis of transform skip - and palette coding-related data
US20220400250A1 (en) Image or video coding based on quantization parameter information for palette coding or transform unit
WO2020140889A1 (en) Quantization and dequantization method and device
WO2024061660A1 (en) Dynamic structures for volumetric data coding
CN114494051A (en) Image processing method and device, electronic equipment and readable storage medium
Vaidya et al. DCT based image compression for low bit rate video processing and band limited communication
CN116188603A (en) Image processing method and device
Maaroof et al. H264 Video Compression Technique with Retinex Enhancement

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22827673

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE