CN111726633B - Compressed video stream recoding method based on deep learning and significance perception - Google Patents
Compressed video stream recoding method based on deep learning and significance perception Download PDFInfo
- Publication number
- CN111726633B CN111726633B CN202010394906.1A CN202010394906A CN111726633B CN 111726633 B CN111726633 B CN 111726633B CN 202010394906 A CN202010394906 A CN 202010394906A CN 111726633 B CN111726633 B CN 111726633B
- Authority
- CN
- China
- Prior art keywords
- frame
- image
- video
- compressed
- video image
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/60—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using transform coding
- H04N19/61—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using transform coding in combination with predictive coding
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/10—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
- H04N19/102—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
- H04N19/124—Quantisation
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/10—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
- H04N19/169—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
- H04N19/177—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being a group of pictures [GOP]
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/60—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using transform coding
- H04N19/625—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using transform coding using discrete cosine transform [DCT]
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Physics & Mathematics (AREA)
- Discrete Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Compression Or Coding Systems Of Tv Signals (AREA)
Abstract
The invention provides a compressed video stream recoding method based on deep learning and significance perception, which comprises the following steps: constructing and training a compressed domain video image significance detection deep learning model; inputting a compressed video image X to be recoded into the compressed domain video image significance detection depth learning model CDVNet trained in the step 1; decoding a compressed video image X part to be re-encoded by using a compressed domain video image significance detection deep learning model CDVNet; the video image is recoded by using an HEVC (high efficiency video coding) technology and combining the quantization parameter updated by each coding unit; the method adopts the saliency characteristic extraction based on the compression domain, and utilizes the data information obtained by partial decoding to carry out saliency detection in the compressed code stream, thereby overcoming the defect that the saliency detection based on the pixel domain in the prior art can carry out the characteristic extraction and the saliency detection only after completely decompressing the compressed videos to the pixel domain, and having the advantages of small calculated amount and low time consumption.
Description
Technical Field
The invention relates to the technical field of video image processing, in particular to a compressed video stream recoding method based on deep learning and significance perception in the technical field of video image compression.
Background
Systematization and standardization of video image compression techniques, such as JPEG, JPEG2000, h.264/AVC, HEVC, etc., make it normal for massive amounts of video image data to be stored and transmitted in a compressed form. Subject to commercial, privacy, or bandwidth constraints, it is desirable in some applications to compress image data to provide or transmit image data at different resolutions. For example, the resolution is required to be reduced and the transmission rate is required to be reduced when high-definition video images are transmitted on a network with limited bandwidth; in the space-based integrated combat command system, the hyperspectral images transmitted to a military command center from a communication satellite are different in grade from the hyperspectral images transmitted to each combat individual soldier. In addition, the display accuracy of various display devices and communication terminals on the market is greatly different, and video images with different resolutions are also required. This requires efficient re-encoding of already compressed video image data in order to meet the requirements of different transmission bandwidths and different code rates for application scenarios such as display terminals and communication terminals.
At present, the recoding of a compressed video image is mainly realized by cascading two independent image decoders and encoders, the data of the input compressed video image is completely decoded to restore the signals of the pixel domain of the original video image, and then the secondary compression is carried out according to the requirements of different application scenes. South rui group limited disclosed a video image recompression method in its patent application "a video image recompression method" (patent application No. 201811379107.6, publication No. 109640100 a). The method comprises the steps of completely decoding a compressed video image, classifying video segments formed by dividing an original video according to an SBD technology, respectively processing different types of video segments, and finally recompressing according to requirements. The method has a certain effect on the compression ratio, but the structure of 'full decompression and full compression' cannot well utilize the information obtained by first compression, so that not only are the calculation and cache resources wasted, but also the compression time is long, and the real-time processing is difficult to realize.
Disclosure of Invention
The invention aims to provide a compressed video stream re-encoding method based on deep learning and significance perception, which can overcome the defect that in the prior art, the significance detection based on a pixel domain must completely decompress compressed videos to the pixel domain, and then feature extraction and significance detection can be carried out.
In order to achieve the purpose, the invention adopts the following technical scheme:
the compressed video stream recoding method based on deep learning and significance perception comprises the following steps:
step 1, constructing and training a compressed domain video image significance detection deep learning model, and specifically adopting the following method:
step 1.1, carrying out batch normalization on Discrete Cosine Transform (DCT) residual coefficients of a compressed domain video image used for training and a corresponding video image significance mapping chart;
step 1.2, taking a Resnext network as a feature extraction network, and constructing a compressed domain video image saliency detection deep learning model CDVNet by using a loss function loss of the feature extraction network; specifically, the method comprises the following steps: the loss function loss of the feature extraction network is
Wherein, G (i, j) ═ 1 indicates that the image position corresponding to the ith row and jth column residual DCT macro block is significant, and G (i, j) ═ 0 indicates that the image position corresponding to the ith row and jth column residual DCT macro block is not significant; s (i, j) represents the probability that the residual DCT coefficient of the ith row and the jth column is predicted to be a significance value; wherein α is 0.5 and γ is 2; further, taking alpha as 0.5 to balance the uneven proportion of the positive and negative samples; taking gamma 2 is used to adjust the rate of simple sample weight reduction;
step 1.3, sending the DCT residual coefficients of the Batch of normalized compressed domain video images and the corresponding video image significance mapping maps into a compressed domain video image significance detection deep learning model CDVNet, and training the compressed domain video image significance detection deep learning model CDVNet by using a random optimization algorithm Adam, wherein the size of a training Batch is that Batch is 64, Momentum is that Momentum is 0.9, and the learning rate is initially set as lr is 0.001; training the batch of Epoch to 200 to finally obtain a trained compressed domain video image significance detection deep learning model CDVNet;
step 2, inputting a compressed video image X to be recoded into the compressed domain video image significance detection deep learning model CDVNet trained in the step 1;
step 3, utilizing the significance of the video image in the compressed domain to detect the deep learning model CDVNet to decode the part of the compressed video image X to be recoded; in particular, the method comprises the following steps of,
partially decoding the compressed video image X to be recoded to obtain
The predicted residual DCT coefficient of each frame of image of the compressed video image X to be recoded;
height H and width W of the video frame image;
quantization parameter QP, number of quantization parameters lQP;
The group number G of groups of pictures (GOP) of a compressed video image X to be re-encoded, the number F of video frames of each group of GOP, the number K of encoding units CU contained in each frame, and the total number R of video images;
step 4, extracting local significant features of the partially decoded compressed video image X to be recoded in the step 3; specifically, the method comprises the following steps:
step 4.1, initializing the frame number r of the video frame image of the partially decoded compressed video image X to be recoded to 1;
step 4.2, calculating the norm after quantizing the prediction residual DCT coefficient of each macro block in the r frame in the video frame image in the step 4.1 to obtain the RDCN characteristic diagram, and specifically adopting the following method:
wherein RDCN is the norm of the DCT coefficient of the prediction residual error,the motion vector of the motion;
step 4.3, performing maximum and minimum value normalization on the RDCN feature map of the r frame in the video frame image obtained in the step 4.2;
4.4, performing convolution on the RDCN characteristic diagram normalized by the maximum and minimum values obtained in the step 4.3 by using a Gaussian filter of 3 multiplied by 3 to realize spatial filtering;
step 4.5, performing motion median filtering on the feature map subjected to spatial filtering in the step 4.4 by using the previous r frame to obtain a local saliency feature map SRDCN of the r frame in the video frame image; specifically, the following method is adopted:
wherein, Med [ ·]Represents the median of the spatially filtered prior r frame eigenvalues,is the RDCN characteristic value after spatial filtering of the ith row and jth column macro block of the r-t frame in the video frame image, and t belongs to {1,2, … r-2 };
and 5: the method for extracting the high-level saliency features of the compressed video image X by using the compressed domain video image saliency detection deep learning model CDVNet comprises the following steps:
step 5.1, normalizing the DCT residual coefficient of the compressed video image X, so that the normalized data is distributed around a 0 value;
step 5.2, inputting the DCT residual coefficient normalized in the step 5.1 into the compressed domain video image significance detection deep learning model CDVNet trained in the step 1 to obtain a global significance characteristic map GSFI of the r-th frame of the video frame image of the compressed video image X;
step 6, fusing and enhancing the local saliency characteristic map SRDCN and the global saliency characteristic map GSFI of the r frame in the video frame image, wherein the method comprises the following steps:
step 6.1, fusing the local saliency characteristic map SRDCN of the r frame in the video frame image obtained in the step 4.5 and the global saliency characteristic map GSFI of the r frame in the video frame image obtained in the step 5.2 according to the following formula to obtain the visual saliencyFusion significance map S of the r-th frame of the frequency frame imagefuse:
Sfuse=Norm(α·GSFI+β·SRDCN+γ·SRDCN⊙GSFI);
Wherein Norm (. cndot.) represents normalization to [0,1 ]]Interval, [ alpha ] indicates dot product, [ alpha ] indicates QP/(3 · l)QP),
β=2·(1-(QP-3)/(3·lQP)),Here QP and lQPThe quantization parameters and the number of the quantization parameters obtained by decoding the compressed video image part;
step 6.2, a fusion saliency mapping image S of the r-th frame of the video frame image through the central saliency map based on the Gaussian model according to the following formulafusePerforming significance enhancement and non-significance inhibition to obtain a position S in the image corresponding to the fused characteristic valuecentral:
Wherein x isiAnd yiIndicating the position in the image to which the macroblock corresponds, indicating the number of macroblocks per line of the video frame,indicating the number of macroblocks in each column of the video frame. Wherein xcAnd ycDenotes SfuseThe mean of the coordinates of the first 10 maxima, and
wherein S isfuse(xi,yi) Is a fused significant feature value, Sfuse(x1,y1)≥Sfuse(x2,y2)≥…≥Sfuse(xN,yN);
Step 6.3, the fusion significance mapping map S of the r frame of the video frame image obtained in the step 6.1 is obtained through the following formulafuseCombining the position of the enhanced saliency characteristic map obtained in the step 6.2 to obtain a final saliency map S of the No. r frame of the video frame imager:
Sr=Sfuse⊙Scentral;
Step 6.4, adding 1 to the video frame serial number R of the R frame video frame image, and judging whether the video frame serial number added with 1 is equal to the total number R of the video frames; if yes, executing step 7, otherwise, executing step 4.1;
step 7, constructing an R-lambda model of the region of interest, comprising the following steps:
step 7.1, respectively initializing the GOP group number g of the compressed video image X obtained in the step 3, the video frame number f of each group of GOPs and the number k of the coding unit CU of each frame to 1;
step 7.2, combining the final saliency map S of the r-th frame of the video frame image obtained in step 6.3rReallocating the target bit number T to the GOP group of the compressed video image X partially decoded in the step 3 according to the following formulaG:
Wherein, TGTarget number of bits, R, allocated for group g GOPuFor a target code rate per frame, fpsVideo frame rate, δ is offset, default is 0.75, γ is ROI ratio,NGSFIfor the number of significant macroblocks in a GOP group,varying between 0.75 and 1.75;
step 7.3, obtaining the target bit number T of the f frame according to the following formulaF:
Wherein, TFNumber of bits, R, of the current frameGOPcodedIs the target number of bits, ω, that the current GOP has consumediIs a frame-level bit allocation weight adjusted according to the target bit, the coding structure and the characteristics of the coded frame, and the coded is the number of uncoded images;
step 7.4, obtaining the target bit T of the kth coding unit CU according to the following formulaCU;
Wherein, PCUObtaining the probability value of the characteristic value of the macro block in each frame after RDCN normalization;
step 7.5, calculating the quantization parameter QP value and the lambda value of the kth coding unit CU according to the R-lambda model, and specifically adopting the following method:
λ=α×bppβ;
wherein alpha and beta are parameters related to the characteristics of the sequence content, the initial values are 3.2005 and-1.367, alpha and beta are continuously updated according to the self-adaptation of the content, C1=4.2005,C2=13.7122;
Step 7.6, adding 1 to the serial number K of the coding unit, and judging whether the serial number K of the coding unit after adding one is equal to the total number K of the coding unit; if yes, executing step 7.7, otherwise, executing step 7.4;
step 7.7, adding 1 to the sequence number F of the video frame, and judging whether the frame sequence number F after adding 1 is equal to the number F of the video frames in the GOP group; if yes, executing step 7.8, otherwise, executing step 7.3;
step 7.8, adding 1 to the number G of the GOP groups, and judging whether the sequence number G of the GOP groups after adding 1 is equal to the total number G of the GOPs; if yes, executing step 8, otherwise, executing step 7.1;
and 8, performing video image recoding by using an HEVC (high efficiency video coding) technology and combining the updated quantization parameter of each coding unit.
The HEVC coding technique described in step 8 employs international standard "h.265" established in 2013.
The invention has the beneficial effects that:
firstly, the saliency feature extraction based on the compression domain is adopted, and the data information obtained by partial decoding is utilized to carry out saliency detection in the compressed code stream, so that the defect that in the prior art, the saliency detection based on the pixel domain must completely decompress the compressed videos to the pixel domain before feature extraction and saliency detection can be carried out is overcome, and the method has the advantages of small calculated amount and low time consumption;
secondly, because the method of the deep convolutional neural network is adopted, the high-level saliency characteristics in the code stream are extracted from the constructed and trained network model CDVNet, the defect that the interested saliency obtained by the traditional detection method only exists in the visual information such as the brightness, the chromaticity, the edge and the like of the image is overcome, the capability of extracting the high-level characteristics of the image is realized, and the deep-level characterization problem of scene saliency can be well processed;
thirdly, as the invention adopts an improved algorithm based on an R-lambda model, the quantization step sizes with different sizes are adjusted according to the quantization parameters in the model for the significant region and the non-significant region contained in the fused characteristic diagram to realize the reasonable distribution of the bit rate, thereby overcoming the defects of video distortion and video perception effect reduction, having good coding performance and achieving better subjective quality with higher compression ratio.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 1: the invention relates to a compressed video stream recoding method based on deep learning and significance perception, which comprises the following steps:
step 1, constructing and training a compressed domain video image significance detection deep learning model, and specifically adopting the following method:
step 1.1, carrying out batch normalization on Discrete Cosine Transform (DCT) residual coefficients of compressed domain video images used for training and corresponding video image significance mapping maps;
step 1.2, taking a Resnext network as a feature extraction network, and constructing a compressed domain video image saliency detection deep learning model CDVNet by using a loss function loss of the feature extraction network; specifically, the method comprises the following steps: the loss function loss of the feature extraction network is
Wherein, G (i, j) ═ 1 indicates that the image position corresponding to the ith row and jth column residual DCT macro block is significant, and G (i, j) ═ 0 indicates that the image position corresponding to the ith row and jth column residual DCT macro block is not significant; s (i, j) represents the probability that the residual DCT coefficient of the ith row and the jth column is predicted to be a significance value; wherein α is 0.5 and γ is 2; further, taking alpha as 0.5 to balance the uneven proportion of the positive and negative samples; taking gamma 2 is used to adjust the rate of simple sample weight reduction;
step 1.3, sending the DCT residual coefficients of the Batch of normalized compressed domain video images and the corresponding video image significance mapping maps into a compressed domain video image significance detection deep learning model CDVNet, and training the compressed domain video image significance detection deep learning model CDVNet by using a random optimization algorithm Adam, wherein the size of a training Batch is that Batch is 64, Momentum is that Momentum is 0.9, and the learning rate is initially set as lr is 0.001; and (5) training the batch of Epoch to 200, and finally obtaining the trained compressed domain video image saliency detection deep learning model CDVNet.
And 2, inputting the compressed video image X to be recoded into the compressed domain video image significance detection deep learning model CDVNet trained in the step 1.
Step 3, utilizing the significance of the video image in the compressed domain to detect the deep learning model CDVNet to decode the part of the compressed video image X to be recoded; in particular, the method comprises the following steps of,
partially decoding the compressed video image X to be recoded to obtain
The predicted residual DCT coefficient of each frame of image of the compressed video image X to be recoded;
height H and width W of the video frame image;
quantization parameter QP, number of quantization parameters lQP;
The number of groups of pictures (GOPs) G of the compressed video image X to be re-encoded, the number of video frames F of each group of GOPs, the number K of coding units CU contained in each frame, and the total number of frames R of the video image.
Step 4, extracting local significant features of the partially decoded compressed video image X to be recoded in the step 3; specifically, the method comprises the following steps:
step 4.1, initializing the frame number r of the video frame image of the partially decoded compressed video image X to be recoded to 1;
step 4.2, calculating the norm after quantizing the prediction residual DCT coefficient of each macro block in the r frame in the video frame image in the step 4.1 to obtain the RDCN characteristic diagram, and specifically adopting the following method:
wherein RDCN is the norm of the DCT coefficient of the prediction residual error,the motion vector of the motion;
step 4.3, performing maximum and minimum value normalization on the RDCN feature map of the r frame in the video frame image obtained in the step 4.2;
4.4, performing convolution on the RDCN characteristic diagram normalized by the maximum and minimum values obtained in the step 4.3 by using a Gaussian filter of 3 multiplied by 3 to realize spatial filtering;
step 4.5, performing motion median filtering on the feature map subjected to spatial filtering in the step 4.4 by using the previous r frame to obtain a local saliency feature map SRDCN of the r frame in the video frame image; specifically, the following method is adopted:
wherein, Med [ ·]Represents the median of the spatially filtered prior r frame eigenvalues,is the RDCN characteristic value after spatial filtering of the ith row and jth column macro block of the r-t frame in the video frame image, and t belongs to {1,2, … r-2 };
and 5: the method for extracting the high-level saliency features of the compressed video image X by using the compressed domain video image saliency detection deep learning model CDVNet comprises the following steps:
step 5.1, normalizing the DCT residual coefficient of the compressed video image X, so that the normalized data is distributed around a 0 value;
step 5.2, inputting the DCT residual coefficient normalized in the step 5.1 into the compressed domain video image significance detection deep learning model CDVNet trained in the step 1 to obtain a global significance characteristic map GSFI of the r-th frame of the video frame image of the compressed video image X;
step 6, fusing and enhancing the local saliency characteristic map SRDCN and the global saliency characteristic map GSFI of the r frame in the video frame image, wherein the method comprises the following steps:
step 6.1, fusing the local saliency characteristic map SRDCN of the r frame in the video frame image obtained in the step 4.5 and the global saliency characteristic map GSFI of the r frame of the video frame image obtained in the step 5.2 according to the following formula to obtain a fused saliency map S of the r frame of the video frame imagefuse:
Sfuse=Norm(α·GSFI+β·SRDCN+γ·SRDCN⊙GSFI);
Wherein Norm (. cndot.) represents normalization to [0,1 ]]Interval, [ alpha ] indicates dot product, [ alpha ] indicates QP/(3 · l)QP),
β=2·(1-(QP-3)/(3·lQP)),Here QP and lQPThe quantization parameters and the number of the quantization parameters obtained by decoding the compressed video image part;
step 6.2, a fusion saliency mapping image S of the r-th frame of the video frame image through the central saliency map based on the Gaussian model according to the following formulafusePerforming significance enhancement and non-significance inhibition to obtain a position S in the image corresponding to the fused characteristic valuecentral:
Wherein x isiAnd yiIndicating the picture corresponding to the macro blockIn the position of (a) in the first, indicating the number of macroblocks per line of the video frame,indicating the number of macroblocks in each column of the video frame. Wherein xcAnd ycDenotes SfuseThe mean of the coordinates of the first 10 maxima, and
wherein S isfuse(xi,yi) Is a fused significant feature value, Sfuse(x1,y1)≥Sfuse(x2,y2)≥…≥Sfuse(xN,yN);
Step 6.3, the fusion significance mapping map S of the r frame of the video frame image obtained in the step 6.1 is obtained through the following formulafuseCombining the position of the enhanced saliency characteristic map obtained in the step 6.2 to obtain a final saliency map S of the No. r frame of the video frame imager:
Sr=Sfuse⊙Scentral;
Step 6.4, adding 1 to the video frame serial number R of the R frame video frame image, and judging whether the video frame serial number added with 1 is equal to the total number R of the video frames; if yes, executing step 7, otherwise, executing step 4.1;
step 7, constructing an R-lambda model of the region of interest, comprising the following steps:
step 7.1, respectively initializing the GOP group number g of the compressed video image X obtained in the step 3, the video frame number f of each group of GOPs and the number k of the coding unit CU of each frame to 1;
step 7.2, combining the final saliency map S of the r-th frame of the video frame image obtained in step 6.3rReallocating the target bit number T to the GOP group of the compressed video image X partially decoded in the step 3 according to the following formulaG:
Wherein, TGTarget number of bits, R, allocated for group g GOPuFor a target code rate per frame, fpsVideo frame rate, δ is offset, default is 0.75, γ is ROI ratio,NGSFIfor the number of significant macroblocks in a GOP group,varying between 0.75 and 1.75;
step 7.3, obtaining the target bit number T of the f frame according to the following formulaF:
Wherein, TFNumber of bits, R, of the current frameGOPcodedIs the target number of bits, ω, that the current GOP has consumediIs a frame-level bit allocation weight adjusted according to the target bit, the coding structure and the characteristics of the coded frame, and the coded is the number of uncoded images;
step 7.4, obtaining the target bit T of the kth coding unit CU according to the following formulaCU;
Wherein, PCUObtaining the macro block after normalizing for RDCNThe probability value of the characteristic value of (2) in each frame;
step 7.5, calculating the quantization parameter QP value and the lambda value of the kth coding unit CU according to the R-lambda model, and specifically adopting the following method:
λ=α×bppβ;
wherein alpha and beta are parameters related to the characteristics of the sequence content, the initial values are 3.2005 and-1.367, alpha and beta are continuously updated according to the self-adaptation of the content, C1=4.2005,C2=13.7122;
Step 7.6, adding 1 to the serial number K of the coding unit, and judging whether the serial number K of the coding unit after adding one is equal to the total number K of the coding unit; if yes, executing step 7.7, otherwise, executing step 7.4;
step 7.7, adding 1 to the sequence number F of the video frame, and judging whether the frame sequence number F after adding 1 is equal to the number F of the video frames in the GOP group; if yes, executing step 7.8, otherwise, executing step 7.3;
step 7.8, adding 1 to the number G of the GOP groups, and judging whether the sequence number G of the GOP groups after adding 1 is equal to the total number G of the GOPs; if yes, executing step 8, otherwise, executing step 7.1;
and 8, performing video image recoding by using an HEVC (high efficiency video coding) technology and combining the updated quantization parameter of each coding unit.
Firstly, the saliency feature extraction based on the compression domain is adopted, and the data information obtained by partial decoding is utilized to carry out saliency detection in the compressed code stream, so that the defect that in the prior art, the saliency detection based on the pixel domain must completely decompress the compressed videos to the pixel domain before feature extraction and saliency detection can be carried out is overcome, and the method has the advantages of small calculated amount and low time consumption;
secondly, because the method of the deep convolutional neural network is adopted, the high-level saliency characteristics in the code stream are extracted from the constructed and trained network model CDVNet, the defect that the interested saliency obtained by the traditional detection method only exists in the visual information such as the brightness, the chromaticity, the edge and the like of the image is overcome, the capability of extracting the high-level characteristics of the image is realized, and the deep-level characterization problem of scene saliency can be well processed;
thirdly, as the invention adopts an improved algorithm based on an R-lambda model, the quantization step sizes with different sizes are adjusted according to the quantization parameters in the model for the significant region and the non-significant region contained in the fused characteristic diagram to realize the reasonable distribution of the bit rate, thereby overcoming the defects of video distortion and video perception effect reduction, having good coding performance and achieving better subjective quality with higher compression ratio.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.
Claims (2)
1. The method for recoding the compressed video stream based on deep learning and significance perception is characterized by comprising the following steps of:
step 1, constructing and training a compressed domain video image significance detection deep learning model, and specifically adopting the following method:
step 1.1, carrying out batch normalization on Discrete Cosine Transform (DCT) residual coefficients of a compressed domain video image used for training and a corresponding video image significance mapping chart;
step 1.2, taking a Resnext network as a feature extraction network, and constructing a compressed domain video image saliency detection deep learning model CDVNet by using a loss function loss of the feature extraction network; specifically, the method comprises the following steps: the loss function loss of the feature extraction network is
Wherein, G (i, j) ═ 1 indicates that the image position corresponding to the ith row and jth column residual DCT macro block is significant, and G (i, j) ═ 0 indicates that the image position corresponding to the ith row and jth column residual DCT macro block is not significant; s (i, j) represents the probability that the residual DCT coefficient of the ith row and the jth column is predicted to be a significance value; wherein α is 0.5 and γ is 2; further, taking alpha as 0.5 to balance the uneven proportion of the positive and negative samples; taking gamma 2 is used to adjust the rate of simple sample weight reduction;
step 1.3, sending the DCT residual coefficients of the Batch of normalized compressed domain video images and the corresponding video image significance mapping maps into a compressed domain video image significance detection deep learning model CDVNet, and training the compressed domain video image significance detection deep learning model CDVNet by using a random optimization algorithm Adam, wherein the size of a training Batch is that Batch is 64, Momentum is that Momentum is 0.9, and the learning rate is initially set as lr is 0.001; training the batch of Epoch to 200 to finally obtain a trained compressed domain video image significance detection deep learning model CDVNet;
step 2, inputting a compressed video image X to be recoded into the compressed domain video image significance detection deep learning model CDVNet trained in the step 1;
step 3, utilizing the significance of the video image in the compressed domain to detect the deep learning model CDVNet to decode the part of the compressed video image X to be recoded; in particular, the method comprises the following steps of,
partially decoding the compressed video image X to be recoded to obtain
The predicted residual DCT coefficient of each frame of image of the compressed video image X to be recoded;
height H and width W of the video frame image;
quantization parameter QP, number of quantization parameters lQP;
The group number G of groups of pictures (GOP) of a compressed video image X to be re-encoded, the number F of video frames of each group of GOP, the number K of encoding units CU contained in each frame, and the total number R of video images;
step 4, extracting local significant features of the partially decoded compressed video image X to be recoded in the step 3; specifically, the method comprises the following steps:
step 4.1, initializing the frame number r of the video frame image of the partially decoded compressed video image X to be recoded to 1;
step 4.2, calculating the norm after quantizing the prediction residual DCT coefficient of each macro block in the r frame in the video frame image in the step 4.1 to obtain the RDCN characteristic diagram, and specifically adopting the following method:
wherein RDCN is the norm of the DCT coefficient of the prediction residual error,the motion vector of the motion;
step 4.3, performing maximum and minimum value normalization on the RDCN feature map of the r frame in the video frame image obtained in the step 4.2;
4.4, performing convolution on the RDCN characteristic diagram normalized by the maximum and minimum values obtained in the step 4.3 by using a Gaussian filter of 3 multiplied by 3 to realize spatial filtering;
step 4.5, performing motion median filtering on the feature map subjected to spatial filtering in the step 4.4 by using the previous r frame to obtain a local saliency feature map SRDCN of the r frame in the video frame image; specifically, the following method is adopted:
wherein, Med [ ·]Represents the median of the spatially filtered prior r frame eigenvalues,is the r-t frame in the video frame imageThe ith row and the jth column of the RDCN feature value after the macro block spatial filtering, and t belongs to {1,2, … r-2 };
and 5: the method for extracting the high-level saliency features of the compressed video image X by using the compressed domain video image saliency detection deep learning model CDVNet comprises the following steps:
step 5.1, normalizing the DCT residual coefficient of the compressed video image X, so that the normalized data is distributed around a 0 value;
step 5.2, inputting the DCT residual coefficient normalized in the step 5.1 into the compressed domain video image significance detection deep learning model CDVNet trained in the step 1 to obtain a global significance characteristic map GSFI of the r-th frame of the video frame image of the compressed video image X;
step 6, fusing and enhancing the local saliency characteristic map SRDCN and the global saliency characteristic map GSFI of the r frame in the video frame image, wherein the method comprises the following steps:
step 6.1, fusing the local saliency characteristic map SRDCN of the r frame in the video frame image obtained in the step 4.5 and the global saliency characteristic map GSFI of the r frame of the video frame image obtained in the step 5.2 according to the following formula to obtain a fused saliency map S of the r frame of the video frame imagefuse:
Sfuse=Norm(α·GSFI+β·SRDCN+γ·SRDCN⊙GSFI);
Wherein Norm (. cndot.) represents normalization to [0,1 ]]Interval, [ alpha ] indicates dot product, [ alpha ] indicates QP/(3 · l)QP),β=2·(1-(QP-3)/(3·lQP)),Here QP and lQPThe quantization parameters and the number of the quantization parameters obtained by decoding the compressed video image part;
step 6.2, a fusion saliency mapping image S of the r-th frame of the video frame image through the central saliency map based on the Gaussian model according to the following formulafusePerforming significance enhancement and non-significance inhibition to obtain a position S in the image corresponding to the fused characteristic valuecentral:
Wherein x isiAnd yiIndicating the position in the image to which the macroblock corresponds, indicating the number of macroblocks per line of the video frame,representing the number of macro blocks in each column of the video frame; wherein xcAnd ycDenotes SfuseThe mean of the coordinates of the first 10 maxima, and
wherein S isfuse(xi,yi) Is a fused significant feature value, Sfuse(x1,y1)≥Sfuse(x2,y2)≥…≥Sfuse(xN,yN);
Step 6.3, the fusion significance mapping map S of the r frame of the video frame image obtained in the step 6.1 is obtained through the following formulafuseCombining the position of the enhanced saliency characteristic map obtained in the step 6.2 to obtain a final saliency map S of the No. r frame of the video frame imager:
Sr=Sfuse⊙Scentral;
Step 6.4, adding 1 to the video frame serial number R of the R frame video frame image, and judging whether the video frame serial number added with 1 is equal to the total number R of the video frames; if yes, executing step 7, otherwise, executing step 4.1;
step 7, constructing an R-lambda model of the region of interest, comprising the following steps:
step 7.1, respectively initializing the GOP group number g of the compressed video image X obtained in the step 3, the video frame number f of each group of GOPs and the number k of the coding unit CU of each frame to 1;
step 7.2, combining the final saliency map S of the r-th frame of the video frame image obtained in step 6.3rReallocating the target bit number T to the GOP group of the compressed video image X partially decoded in the step 3 according to the following formulaG:
Wherein, TGTarget number of bits, R, allocated for group g GOPuFor a target code rate per frame, fpsVideo frame rate, δ is offset, default is 0.75, γ is ROI ratio,NGSFIfor the number of significant macroblocks in a GOP group,varying between 0.75 and 1.75;
step 7.3, obtaining the target bit number T of the f frame according to the following formulaF:
Wherein, TFNumber of bits, R, of the current frameGOPcodedIs the target number of bits, ω, that the current GOP has consumediIs a frame-level bit allocation weight adjusted according to the target bit, the coding structure and the characteristics of the coded frame, and the coded is the number of uncoded images;
step 7.4, pressObtaining a target bit T of a kth coding unit CU according to the following formulaCU;
Wherein, PCUObtaining the probability value of the characteristic value of the macro block in each frame after RDCN normalization;
step 7.5, calculating the quantization parameter QP value and the lambda value of the kth coding unit CU according to the R-lambda model, and specifically adopting the following method:
λ=α×bppβ;
wherein alpha and beta are parameters related to the characteristics of the sequence content, the initial values are 3.2005 and-1.367, alpha and beta are continuously updated according to the self-adaptation of the content, C1=4.2005,C2=13.7122;
Step 7.6, adding 1 to the serial number K of the coding unit, and judging whether the serial number K of the coding unit after adding one is equal to the total number K of the coding unit; if yes, executing step 7.7, otherwise, executing step 7.4;
step 7.7, adding 1 to the sequence number F of the video frame, and judging whether the frame sequence number F after adding 1 is equal to the number F of the video frames in the GOP group; if yes, executing step 7.8, otherwise, executing step 7.3;
step 7.8, adding 1 to the number G of the GOP groups, and judging whether the sequence number G of the GOP groups after adding 1 is equal to the total number G of the GOPs; if yes, executing step 8, otherwise, executing step 7.1;
and 8, performing video image recoding by using an HEVC (high efficiency video coding) technology and combining the updated quantization parameter of each coding unit.
2. The method of claim 1, wherein the method comprises: the HEVC coding technique described in step 8 employs international standard "h.265" established in 2013.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010394906.1A CN111726633B (en) | 2020-05-11 | 2020-05-11 | Compressed video stream recoding method based on deep learning and significance perception |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010394906.1A CN111726633B (en) | 2020-05-11 | 2020-05-11 | Compressed video stream recoding method based on deep learning and significance perception |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111726633A CN111726633A (en) | 2020-09-29 |
CN111726633B true CN111726633B (en) | 2021-03-26 |
Family
ID=72564323
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010394906.1A Active CN111726633B (en) | 2020-05-11 | 2020-05-11 | Compressed video stream recoding method based on deep learning and significance perception |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111726633B (en) |
Families Citing this family (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2022073159A1 (en) * | 2020-10-07 | 2022-04-14 | 浙江大学 | Feature data encoding method, apparatus and device, feature data decoding method, apparatus and device, and storage medium |
CN112399177B (en) * | 2020-11-17 | 2022-10-28 | 深圳大学 | Video coding method, device, computer equipment and storage medium |
CN112399176B (en) * | 2020-11-17 | 2022-09-16 | 深圳市创智升科技有限公司 | Video coding method and device, computer equipment and storage medium |
CN113038279B (en) * | 2021-03-29 | 2023-04-18 | 京东方科技集团股份有限公司 | Video transcoding method and system and electronic device |
CN113242433B (en) * | 2021-04-27 | 2022-01-21 | 中国科学院国家空间科学中心 | Image compression method and image compression system based on ARM multi-core heterogeneous processor |
CN113709464B (en) * | 2021-09-01 | 2024-08-09 | 展讯通信(天津)有限公司 | Video coding method and related equipment |
CN113660498B (en) * | 2021-10-20 | 2022-02-11 | 康达洲际医疗器械有限公司 | Inter-frame image universal coding method and system based on significance detection |
CN114866784A (en) * | 2022-04-19 | 2022-08-05 | 东南大学 | Vehicle detection method based on compressed video DCT (discrete cosine transformation) coefficients |
CN115314722B (en) * | 2022-06-17 | 2023-12-08 | 百果园技术(新加坡)有限公司 | Video code rate distribution method, system, equipment and storage medium |
CN114786011B (en) * | 2022-06-22 | 2022-11-15 | 苏州浪潮智能科技有限公司 | JPEG image compression method, system, equipment and storage medium |
CN115115845A (en) * | 2022-07-04 | 2022-09-27 | 杭州海康威视数字技术股份有限公司 | Image semantic content understanding method and device, electronic equipment and storage medium |
CN116847101B (en) * | 2023-09-01 | 2024-02-13 | 易方信息科技股份有限公司 | Video bit rate ladder prediction method, system and equipment based on transform network |
Family Cites Families (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP3364342A1 (en) * | 2017-02-17 | 2018-08-22 | Cogisen SRL | Method for image processing and video compression |
CN107437096B (en) * | 2017-07-28 | 2020-06-26 | 北京大学 | Image classification method based on parameter efficient depth residual error network model |
CN109118469B (en) * | 2018-06-20 | 2020-11-17 | 国网浙江省电力有限公司 | Prediction method for video saliency |
CN109547803B (en) * | 2018-11-21 | 2020-06-09 | 北京航空航天大学 | Time-space domain significance detection and fusion method |
CN109451310B (en) * | 2018-11-21 | 2020-10-09 | 北京航空航天大学 | Rate distortion optimization method and device based on significance weighting |
CN109309834B (en) * | 2018-11-21 | 2021-01-05 | 北京航空航天大学 | Video compression method based on convolutional neural network and HEVC compression domain significant information |
CN109859166B (en) * | 2018-12-26 | 2023-09-19 | 上海大学 | Multi-column convolutional neural network-based parameter-free 3D image quality evaluation method |
CN110135435B (en) * | 2019-04-17 | 2021-05-18 | 上海师范大学 | Saliency detection method and device based on breadth learning system |
CN111028153B (en) * | 2019-12-09 | 2024-05-07 | 南京理工大学 | Image processing and neural network training method and device and computer equipment |
CN111083477B (en) * | 2019-12-11 | 2020-11-10 | 北京航空航天大学 | HEVC (high efficiency video coding) optimization algorithm based on visual saliency |
-
2020
- 2020-05-11 CN CN202010394906.1A patent/CN111726633B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN111726633A (en) | 2020-09-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111726633B (en) | Compressed video stream recoding method based on deep learning and significance perception | |
US7697783B2 (en) | Coding device, coding method, decoding device, decoding method, and programs of same | |
US9762917B2 (en) | Quantization method and apparatus in encoding/decoding | |
US5892548A (en) | Adaptive quantizer with modification of high frequency coefficients | |
EP1867175B1 (en) | Method for locally adjusting a quantization step | |
JP6141295B2 (en) | Perceptually lossless and perceptually enhanced image compression system and method | |
WO2020238439A1 (en) | Video quality-of-service enhancement method under restricted bandwidth of wireless ad hoc network | |
JP2002543693A (en) | Quantization method and video compression device | |
US6934418B2 (en) | Image data coding apparatus and image data server | |
CN103501438B (en) | A kind of content-adaptive method for compressing image based on principal component analysis | |
CN111131828B (en) | Image compression method and device, electronic equipment and storage medium | |
CN112738533B (en) | Machine inspection image regional compression method | |
CN114793282A (en) | Neural network based video compression with bit allocation | |
JP3532470B2 (en) | Techniques for video communication using coded matched filter devices. | |
CN116916036A (en) | Video compression method, device and system | |
US8139881B2 (en) | Method for locally adjusting a quantization step and coding device implementing said method | |
CN112040231B (en) | Video coding method based on perceptual noise channel model | |
CN101742323B (en) | Method and device for coding and decoding re-loss-free video | |
CN112001854A (en) | Method for repairing coded image and related system and device | |
CN111277835A (en) | Monitoring video compression and decompression method combining yolo3 and flownet2 network | |
CN110493597A (en) | A kind of efficiently perception video encoding optimization method | |
CN113194312B (en) | Planetary science exploration image adaptive quantization coding system combined with visual saliency | |
Peng et al. | An optimized algorithm based on generalized difference expansion method used for HEVC reversible video information hiding | |
CN116982262A (en) | State transition for dependent quantization in video coding | |
CN111491166A (en) | Dynamic compression system and method based on content analysis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |