CN111726633B

CN111726633B - Compressed video stream recoding method based on deep learning and significance perception

Info

Publication number: CN111726633B
Application number: CN202010394906.1A
Authority: CN
Inventors: 李永军; 李莎莎; 杜浩浩; 邓浩; 陈立家; 曹雪; 王赞; 陈竞; 李鹏飞
Original assignee: Henan University
Current assignee: Henan University
Priority date: 2020-05-11
Filing date: 2020-05-11
Publication date: 2021-03-26
Anticipated expiration: 2040-05-11
Also published as: CN111726633A

Abstract

The invention provides a compressed video stream recoding method based on deep learning and significance perception, which comprises the following steps: constructing and training a compressed domain video image significance detection deep learning model; inputting a compressed video image X to be recoded into the compressed domain video image significance detection depth learning model CDVNet trained in the step 1; decoding a compressed video image X part to be re-encoded by using a compressed domain video image significance detection deep learning model CDVNet; the video image is recoded by using an HEVC (high efficiency video coding) technology and combining the quantization parameter updated by each coding unit; the method adopts the saliency characteristic extraction based on the compression domain, and utilizes the data information obtained by partial decoding to carry out saliency detection in the compressed code stream, thereby overcoming the defect that the saliency detection based on the pixel domain in the prior art can carry out the characteristic extraction and the saliency detection only after completely decompressing the compressed videos to the pixel domain, and having the advantages of small calculated amount and low time consumption.

Description

Compressed video stream recoding method based on deep learning and significance perception

Technical Field

The invention relates to the technical field of video image processing, in particular to a compressed video stream recoding method based on deep learning and significance perception in the technical field of video image compression.

Background

Systematization and standardization of video image compression techniques, such as JPEG, JPEG2000, h.264/AVC, HEVC, etc., make it normal for massive amounts of video image data to be stored and transmitted in a compressed form. Subject to commercial, privacy, or bandwidth constraints, it is desirable in some applications to compress image data to provide or transmit image data at different resolutions. For example, the resolution is required to be reduced and the transmission rate is required to be reduced when high-definition video images are transmitted on a network with limited bandwidth; in the space-based integrated combat command system, the hyperspectral images transmitted to a military command center from a communication satellite are different in grade from the hyperspectral images transmitted to each combat individual soldier. In addition, the display accuracy of various display devices and communication terminals on the market is greatly different, and video images with different resolutions are also required. This requires efficient re-encoding of already compressed video image data in order to meet the requirements of different transmission bandwidths and different code rates for application scenarios such as display terminals and communication terminals.

At present, the recoding of a compressed video image is mainly realized by cascading two independent image decoders and encoders, the data of the input compressed video image is completely decoded to restore the signals of the pixel domain of the original video image, and then the secondary compression is carried out according to the requirements of different application scenes. South rui group limited disclosed a video image recompression method in its patent application "a video image recompression method" (patent application No. 201811379107.6, publication No. 109640100 a). The method comprises the steps of completely decoding a compressed video image, classifying video segments formed by dividing an original video according to an SBD technology, respectively processing different types of video segments, and finally recompressing according to requirements. The method has a certain effect on the compression ratio, but the structure of 'full decompression and full compression' cannot well utilize the information obtained by first compression, so that not only are the calculation and cache resources wasted, but also the compression time is long, and the real-time processing is difficult to realize.

Disclosure of Invention

The invention aims to provide a compressed video stream re-encoding method based on deep learning and significance perception, which can overcome the defect that in the prior art, the significance detection based on a pixel domain must completely decompress compressed videos to the pixel domain, and then feature extraction and significance detection can be carried out.

In order to achieve the purpose, the invention adopts the following technical scheme:

the compressed video stream recoding method based on deep learning and significance perception comprises the following steps:

step 1, constructing and training a compressed domain video image significance detection deep learning model, and specifically adopting the following method:

step 1.1, carrying out batch normalization on Discrete Cosine Transform (DCT) residual coefficients of a compressed domain video image used for training and a corresponding video image significance mapping chart;

step 1.2, taking a Resnext network as a feature extraction network, and constructing a compressed domain video image saliency detection deep learning model CDVNet by using a loss function loss of the feature extraction network; specifically, the method comprises the following steps: the loss function loss of the feature extraction network is

Wherein, G (i, j) ═ 1 indicates that the image position corresponding to the ith row and jth column residual DCT macro block is significant, and G (i, j) ═ 0 indicates that the image position corresponding to the ith row and jth column residual DCT macro block is not significant; s (i, j) represents the probability that the residual DCT coefficient of the ith row and the jth column is predicted to be a significance value; wherein α is 0.5 and γ is 2; further, taking alpha as 0.5 to balance the uneven proportion of the positive and negative samples; taking gamma 2 is used to adjust the rate of simple sample weight reduction;

step 1.3, sending the DCT residual coefficients of the Batch of normalized compressed domain video images and the corresponding video image significance mapping maps into a compressed domain video image significance detection deep learning model CDVNet, and training the compressed domain video image significance detection deep learning model CDVNet by using a random optimization algorithm Adam, wherein the size of a training Batch is that Batch is 64, Momentum is that Momentum is 0.9, and the learning rate is initially set as lr is 0.001; training the batch of Epoch to 200 to finally obtain a trained compressed domain video image significance detection deep learning model CDVNet;

step 2, inputting a compressed video image X to be recoded into the compressed domain video image significance detection deep learning model CDVNet trained in the step 1;

step 3, utilizing the significance of the video image in the compressed domain to detect the deep learning model CDVNet to decode the part of the compressed video image X to be recoded; in particular, the method comprises the following steps of,

partially decoding the compressed video image X to be recoded to obtain

The predicted residual DCT coefficient of each frame of image of the compressed video image X to be recoded;

height H and width W of the video frame image;

quantization parameter QP, number of quantization parameters l_QP；

The group number G of groups of pictures (GOP) of a compressed video image X to be re-encoded, the number F of video frames of each group of GOP, the number K of encoding units CU contained in each frame, and the total number R of video images;

step 4, extracting local significant features of the partially decoded compressed video image X to be recoded in the step 3; specifically, the method comprises the following steps:

step 4.1, initializing the frame number r of the video frame image of the partially decoded compressed video image X to be recoded to 1;

step 4.2, calculating the norm after quantizing the prediction residual DCT coefficient of each macro block in the r frame in the video frame image in the step 4.1 to obtain the RDCN characteristic diagram, and specifically adopting the following method:

wherein RDCN is the norm of the DCT coefficient of the prediction residual error,

the motion vector of the motion;

step 4.3, performing maximum and minimum value normalization on the RDCN feature map of the r frame in the video frame image obtained in the step 4.2;

4.4, performing convolution on the RDCN characteristic diagram normalized by the maximum and minimum values obtained in the step 4.3 by using a Gaussian filter of 3 multiplied by 3 to realize spatial filtering;

step 4.5, performing motion median filtering on the feature map subjected to spatial filtering in the step 4.4 by using the previous r frame to obtain a local saliency feature map SRDCN of the r frame in the video frame image; specifically, the following method is adopted:

wherein, Med [ ·]Represents the median of the spatially filtered prior r frame eigenvalues,

is the RDCN characteristic value after spatial filtering of the ith row and jth column macro block of the r-t frame in the video frame image, and t belongs to {1,2, … r-2 };

and 5: the method for extracting the high-level saliency features of the compressed video image X by using the compressed domain video image saliency detection deep learning model CDVNet comprises the following steps:

step 5.1, normalizing the DCT residual coefficient of the compressed video image X, so that the normalized data is distributed around a 0 value;

step 5.2, inputting the DCT residual coefficient normalized in the step 5.1 into the compressed domain video image significance detection deep learning model CDVNet trained in the step 1 to obtain a global significance characteristic map GSFI of the r-th frame of the video frame image of the compressed video image X;

step 6, fusing and enhancing the local saliency characteristic map SRDCN and the global saliency characteristic map GSFI of the r frame in the video frame image, wherein the method comprises the following steps:

step 6.1, fusing the local saliency characteristic map SRDCN of the r frame in the video frame image obtained in the step 4.5 and the global saliency characteristic map GSFI of the r frame in the video frame image obtained in the step 5.2 according to the following formula to obtain the visual saliencyFusion significance map S of the r-th frame of the frequency frame image_fuse：

S_fuse＝Norm(α·GSFI+β·SRDCN+γ·SRDCN⊙GSFI)；

Wherein Norm (. cndot.) represents normalization to [0,1 ]]Interval, [ alpha ] indicates dot product, [ alpha ] indicates QP/(3 · l)_QP)，

β＝2·(1-(QP-3)/(3·l_QP))，

Here QP and l_QPThe quantization parameters and the number of the quantization parameters obtained by decoding the compressed video image part;

step 6.2, a fusion saliency mapping image S of the r-th frame of the video frame image through the central saliency map based on the Gaussian model according to the following formula_fusePerforming significance enhancement and non-significance inhibition to obtain a position S in the image corresponding to the fused characteristic value_central：

Wherein x is_iAnd y_iIndicating the position in the image to which the macroblock corresponds,

indicating the number of macroblocks per line of the video frame,

indicating the number of macroblocks in each column of the video frame. Wherein x_cAnd y_cDenotes S_fuseThe mean of the coordinates of the first 10 maxima, and

wherein S is_fuse(x_i,y_i) Is a fused significant feature value, S_fuse(x₁,y₁)≥S_fuse(x₂,y₂)≥…≥S_fuse(x_N,y_N)；

Step 6.3, the fusion significance mapping map S of the r frame of the video frame image obtained in the step 6.1 is obtained through the following formula_fuseCombining the position of the enhanced saliency characteristic map obtained in the step 6.2 to obtain a final saliency map S of the No. r frame of the video frame image_r：

S_r＝S_fuse⊙S_central；

Step 6.4, adding 1 to the video frame serial number R of the R frame video frame image, and judging whether the video frame serial number added with 1 is equal to the total number R of the video frames; if yes, executing step 7, otherwise, executing step 4.1;

step 7, constructing an R-lambda model of the region of interest, comprising the following steps:

step 7.1, respectively initializing the GOP group number g of the compressed video image X obtained in the step 3, the video frame number f of each group of GOPs and the number k of the coding unit CU of each frame to 1;

step 7.2, combining the final saliency map S of the r-th frame of the video frame image obtained in step 6.3_rReallocating the target bit number T to the GOP group of the compressed video image X partially decoded in the step 3 according to the following formula_G：

Wherein, T_GTarget number of bits, R, allocated for group g GOP_uFor a target code rate per frame, f_psVideo frame rate, δ is offset, default is 0.75, γ is ROI ratio,

N_GSFIfor the number of significant macroblocks in a GOP group,

varying between 0.75 and 1.75;

step 7.3, obtaining the target bit number T of the f frame according to the following formula_F：

Wherein, T_FNumber of bits, R, of the current frame_GOPcodedIs the target number of bits, ω, that the current GOP has consumed_iIs a frame-level bit allocation weight adjusted according to the target bit, the coding structure and the characteristics of the coded frame, and the coded is the number of uncoded images;

step 7.4, obtaining the target bit T of the kth coding unit CU according to the following formula_CU；

Wherein, P_CUObtaining the probability value of the characteristic value of the macro block in each frame after RDCN normalization;

step 7.5, calculating the quantization parameter QP value and the lambda value of the kth coding unit CU according to the R-lambda model, and specifically adopting the following method:

λ＝α×bpp^β；

wherein alpha and beta are parameters related to the characteristics of the sequence content, the initial values are 3.2005 and-1.367, alpha and beta are continuously updated according to the self-adaptation of the content, C₁＝4.2005，C₂＝13.7122；

Step 7.6, adding 1 to the serial number K of the coding unit, and judging whether the serial number K of the coding unit after adding one is equal to the total number K of the coding unit; if yes, executing step 7.7, otherwise, executing step 7.4;

step 7.7, adding 1 to the sequence number F of the video frame, and judging whether the frame sequence number F after adding 1 is equal to the number F of the video frames in the GOP group; if yes, executing step 7.8, otherwise, executing step 7.3;

step 7.8, adding 1 to the number G of the GOP groups, and judging whether the sequence number G of the GOP groups after adding 1 is equal to the total number G of the GOPs; if yes, executing step 8, otherwise, executing step 7.1;

and 8, performing video image recoding by using an HEVC (high efficiency video coding) technology and combining the updated quantization parameter of each coding unit.

The HEVC coding technique described in step 8 employs international standard "h.265" established in 2013.

The invention has the beneficial effects that:

firstly, the saliency feature extraction based on the compression domain is adopted, and the data information obtained by partial decoding is utilized to carry out saliency detection in the compressed code stream, so that the defect that in the prior art, the saliency detection based on the pixel domain must completely decompress the compressed videos to the pixel domain before feature extraction and saliency detection can be carried out is overcome, and the method has the advantages of small calculated amount and low time consumption;

secondly, because the method of the deep convolutional neural network is adopted, the high-level saliency characteristics in the code stream are extracted from the constructed and trained network model CDVNet, the defect that the interested saliency obtained by the traditional detection method only exists in the visual information such as the brightness, the chromaticity, the edge and the like of the image is overcome, the capability of extracting the high-level characteristics of the image is realized, and the deep-level characterization problem of scene saliency can be well processed;

thirdly, as the invention adopts an improved algorithm based on an R-lambda model, the quantization step sizes with different sizes are adjusted according to the quantization parameters in the model for the significant region and the non-significant region contained in the fused characteristic diagram to realize the reasonable distribution of the bit rate, thereby overcoming the defects of video distortion and video perception effect reduction, having good coding performance and achieving better subjective quality with higher compression ratio.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a flow chart of the method of the present invention.

Detailed Description

The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1: the invention relates to a compressed video stream recoding method based on deep learning and significance perception, which comprises the following steps:

step 1.1, carrying out batch normalization on Discrete Cosine Transform (DCT) residual coefficients of compressed domain video images used for training and corresponding video image significance mapping maps;

step 1.3, sending the DCT residual coefficients of the Batch of normalized compressed domain video images and the corresponding video image significance mapping maps into a compressed domain video image significance detection deep learning model CDVNet, and training the compressed domain video image significance detection deep learning model CDVNet by using a random optimization algorithm Adam, wherein the size of a training Batch is that Batch is 64, Momentum is that Momentum is 0.9, and the learning rate is initially set as lr is 0.001; and (5) training the batch of Epoch to 200, and finally obtaining the trained compressed domain video image saliency detection deep learning model CDVNet.

And 2, inputting the compressed video image X to be recoded into the compressed domain video image significance detection deep learning model CDVNet trained in the step 1.

partially decoding the compressed video image X to be recoded to obtain

height H and width W of the video frame image;

quantization parameter QP, number of quantization parameters l_QP；

The number of groups of pictures (GOPs) G of the compressed video image X to be re-encoded, the number of video frames F of each group of GOPs, the number K of coding units CU contained in each frame, and the total number of frames R of the video image.

the motion vector of the motion;

step 6.1, fusing the local saliency characteristic map SRDCN of the r frame in the video frame image obtained in the step 4.5 and the global saliency characteristic map GSFI of the r frame of the video frame image obtained in the step 5.2 according to the following formula to obtain a fused saliency map S of the r frame of the video frame image_fuse：

S_fuse＝Norm(α·GSFI+β·SRDCN+γ·SRDCN⊙GSFI)；

β＝2·(1-(QP-3)/(3·l_QP))，

Wherein x is_iAnd y_iIndicating the picture corresponding to the macro blockIn the position of (a) in the first,

indicating the number of macroblocks per line of the video frame,

S_r＝S_fuse⊙S_central；

N_GSFIfor the number of significant macroblocks in a GOP group,

varying between 0.75 and 1.75;

Wherein, P_CUObtaining the macro block after normalizing for RDCNThe probability value of the characteristic value of (2) in each frame;

λ＝α×bpp^β；

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. The method for recoding the compressed video stream based on deep learning and significance perception is characterized by comprising the following steps of:

partially decoding the compressed video image X to be recoded to obtain

height H and width W of the video frame image;

quantization parameter QP, number of quantization parameters l_QP；

the motion vector of the motion;

is the r-t frame in the video frame imageThe ith row and the jth column of the RDCN feature value after the macro block spatial filtering, and t belongs to {1,2, … r-2 };

S_fuse＝Norm(α·GSFI+β·SRDCN+γ·SRDCN⊙GSFI)；

Wherein Norm (. cndot.) represents normalization to [0,1 ]]Interval, [ alpha ] indicates dot product, [ alpha ] indicates QP/(3 · l)_QP)，β＝2·(1-(QP-3)/(3·l_QP))，

indicating the number of macroblocks per line of the video frame,

representing the number of macro blocks in each column of the video frame; wherein x_cAnd y_cDenotes S_fuseThe mean of the coordinates of the first 10 maxima, and

S_r＝S_fuse⊙S_central；

N_GSFIfor the number of significant macroblocks in a GOP group,

varying between 0.75 and 1.75;

step 7.4, pressObtaining a target bit T of a kth coding unit CU according to the following formula_CU；

λ＝α×bpp^β；

2. The method of claim 1, wherein the method comprises: the HEVC coding technique described in step 8 employs international standard "h.265" established in 2013.