CN111726633B - Compressed video stream recoding method based on deep learning and significance perception - Google Patents

Compressed video stream recoding method based on deep learning and significance perception Download PDF

Info

Publication number
CN111726633B
CN111726633B CN202010394906.1A CN202010394906A CN111726633B CN 111726633 B CN111726633 B CN 111726633B CN 202010394906 A CN202010394906 A CN 202010394906A CN 111726633 B CN111726633 B CN 111726633B
Authority
CN
China
Prior art keywords
frame
image
video
compressed
video image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010394906.1A
Other languages
Chinese (zh)
Other versions
CN111726633A (en
Inventor
李永军
李莎莎
杜浩浩
邓浩
陈立家
曹雪
王赞
陈竞
李鹏飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Henan University
Original Assignee
Henan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Henan University filed Critical Henan University
Priority to CN202010394906.1A priority Critical patent/CN111726633B/en
Publication of CN111726633A publication Critical patent/CN111726633A/en
Application granted granted Critical
Publication of CN111726633B publication Critical patent/CN111726633B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/60Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using transform coding
    • H04N19/61Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using transform coding in combination with predictive coding
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/102Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
    • H04N19/124Quantisation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/169Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
    • H04N19/177Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being a group of pictures [GOP]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/60Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using transform coding
    • H04N19/625Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using transform coding using discrete cosine transform [DCT]

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • Discrete Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

The invention provides a compressed video stream recoding method based on deep learning and significance perception, which comprises the following steps: constructing and training a compressed domain video image significance detection deep learning model; inputting a compressed video image X to be recoded into the compressed domain video image significance detection depth learning model CDVNet trained in the step 1; decoding a compressed video image X part to be re-encoded by using a compressed domain video image significance detection deep learning model CDVNet; the video image is recoded by using an HEVC (high efficiency video coding) technology and combining the quantization parameter updated by each coding unit; the method adopts the saliency characteristic extraction based on the compression domain, and utilizes the data information obtained by partial decoding to carry out saliency detection in the compressed code stream, thereby overcoming the defect that the saliency detection based on the pixel domain in the prior art can carry out the characteristic extraction and the saliency detection only after completely decompressing the compressed videos to the pixel domain, and having the advantages of small calculated amount and low time consumption.

Description

Compressed video stream recoding method based on deep learning and significance perception
Technical Field
The invention relates to the technical field of video image processing, in particular to a compressed video stream recoding method based on deep learning and significance perception in the technical field of video image compression.
Background
Systematization and standardization of video image compression techniques, such as JPEG, JPEG2000, h.264/AVC, HEVC, etc., make it normal for massive amounts of video image data to be stored and transmitted in a compressed form. Subject to commercial, privacy, or bandwidth constraints, it is desirable in some applications to compress image data to provide or transmit image data at different resolutions. For example, the resolution is required to be reduced and the transmission rate is required to be reduced when high-definition video images are transmitted on a network with limited bandwidth; in the space-based integrated combat command system, the hyperspectral images transmitted to a military command center from a communication satellite are different in grade from the hyperspectral images transmitted to each combat individual soldier. In addition, the display accuracy of various display devices and communication terminals on the market is greatly different, and video images with different resolutions are also required. This requires efficient re-encoding of already compressed video image data in order to meet the requirements of different transmission bandwidths and different code rates for application scenarios such as display terminals and communication terminals.
At present, the recoding of a compressed video image is mainly realized by cascading two independent image decoders and encoders, the data of the input compressed video image is completely decoded to restore the signals of the pixel domain of the original video image, and then the secondary compression is carried out according to the requirements of different application scenes. South rui group limited disclosed a video image recompression method in its patent application "a video image recompression method" (patent application No. 201811379107.6, publication No. 109640100 a). The method comprises the steps of completely decoding a compressed video image, classifying video segments formed by dividing an original video according to an SBD technology, respectively processing different types of video segments, and finally recompressing according to requirements. The method has a certain effect on the compression ratio, but the structure of 'full decompression and full compression' cannot well utilize the information obtained by first compression, so that not only are the calculation and cache resources wasted, but also the compression time is long, and the real-time processing is difficult to realize.
Disclosure of Invention
The invention aims to provide a compressed video stream re-encoding method based on deep learning and significance perception, which can overcome the defect that in the prior art, the significance detection based on a pixel domain must completely decompress compressed videos to the pixel domain, and then feature extraction and significance detection can be carried out.
In order to achieve the purpose, the invention adopts the following technical scheme:
the compressed video stream recoding method based on deep learning and significance perception comprises the following steps:
step 1, constructing and training a compressed domain video image significance detection deep learning model, and specifically adopting the following method:
step 1.1, carrying out batch normalization on Discrete Cosine Transform (DCT) residual coefficients of a compressed domain video image used for training and a corresponding video image significance mapping chart;
step 1.2, taking a Resnext network as a feature extraction network, and constructing a compressed domain video image saliency detection deep learning model CDVNet by using a loss function loss of the feature extraction network; specifically, the method comprises the following steps: the loss function loss of the feature extraction network is
Figure GDA0002944025650000021
Wherein, G (i, j) ═ 1 indicates that the image position corresponding to the ith row and jth column residual DCT macro block is significant, and G (i, j) ═ 0 indicates that the image position corresponding to the ith row and jth column residual DCT macro block is not significant; s (i, j) represents the probability that the residual DCT coefficient of the ith row and the jth column is predicted to be a significance value; wherein α is 0.5 and γ is 2; further, taking alpha as 0.5 to balance the uneven proportion of the positive and negative samples; taking gamma 2 is used to adjust the rate of simple sample weight reduction;
step 1.3, sending the DCT residual coefficients of the Batch of normalized compressed domain video images and the corresponding video image significance mapping maps into a compressed domain video image significance detection deep learning model CDVNet, and training the compressed domain video image significance detection deep learning model CDVNet by using a random optimization algorithm Adam, wherein the size of a training Batch is that Batch is 64, Momentum is that Momentum is 0.9, and the learning rate is initially set as lr is 0.001; training the batch of Epoch to 200 to finally obtain a trained compressed domain video image significance detection deep learning model CDVNet;
step 2, inputting a compressed video image X to be recoded into the compressed domain video image significance detection deep learning model CDVNet trained in the step 1;
step 3, utilizing the significance of the video image in the compressed domain to detect the deep learning model CDVNet to decode the part of the compressed video image X to be recoded; in particular, the method comprises the following steps of,
partially decoding the compressed video image X to be recoded to obtain
The predicted residual DCT coefficient of each frame of image of the compressed video image X to be recoded;
height H and width W of the video frame image;
quantization parameter QP, number of quantization parameters lQP
The group number G of groups of pictures (GOP) of a compressed video image X to be re-encoded, the number F of video frames of each group of GOP, the number K of encoding units CU contained in each frame, and the total number R of video images;
step 4, extracting local significant features of the partially decoded compressed video image X to be recoded in the step 3; specifically, the method comprises the following steps:
step 4.1, initializing the frame number r of the video frame image of the partially decoded compressed video image X to be recoded to 1;
step 4.2, calculating the norm after quantizing the prediction residual DCT coefficient of each macro block in the r frame in the video frame image in the step 4.1 to obtain the RDCN characteristic diagram, and specifically adopting the following method:
Figure GDA0002944025650000031
wherein RDCN is the norm of the DCT coefficient of the prediction residual error,
Figure GDA0002944025650000032
the motion vector of the motion;
step 4.3, performing maximum and minimum value normalization on the RDCN feature map of the r frame in the video frame image obtained in the step 4.2;
4.4, performing convolution on the RDCN characteristic diagram normalized by the maximum and minimum values obtained in the step 4.3 by using a Gaussian filter of 3 multiplied by 3 to realize spatial filtering;
step 4.5, performing motion median filtering on the feature map subjected to spatial filtering in the step 4.4 by using the previous r frame to obtain a local saliency feature map SRDCN of the r frame in the video frame image; specifically, the following method is adopted:
Figure GDA0002944025650000033
wherein, Med [ ·]Represents the median of the spatially filtered prior r frame eigenvalues,
Figure GDA0002944025650000034
is the RDCN characteristic value after spatial filtering of the ith row and jth column macro block of the r-t frame in the video frame image, and t belongs to {1,2, … r-2 };
and 5: the method for extracting the high-level saliency features of the compressed video image X by using the compressed domain video image saliency detection deep learning model CDVNet comprises the following steps:
step 5.1, normalizing the DCT residual coefficient of the compressed video image X, so that the normalized data is distributed around a 0 value;
step 5.2, inputting the DCT residual coefficient normalized in the step 5.1 into the compressed domain video image significance detection deep learning model CDVNet trained in the step 1 to obtain a global significance characteristic map GSFI of the r-th frame of the video frame image of the compressed video image X;
step 6, fusing and enhancing the local saliency characteristic map SRDCN and the global saliency characteristic map GSFI of the r frame in the video frame image, wherein the method comprises the following steps:
step 6.1, fusing the local saliency characteristic map SRDCN of the r frame in the video frame image obtained in the step 4.5 and the global saliency characteristic map GSFI of the r frame in the video frame image obtained in the step 5.2 according to the following formula to obtain the visual saliencyFusion significance map S of the r-th frame of the frequency frame imagefuse
Sfuse=Norm(α·GSFI+β·SRDCN+γ·SRDCN⊙GSFI);
Wherein Norm (. cndot.) represents normalization to [0,1 ]]Interval, [ alpha ] indicates dot product, [ alpha ] indicates QP/(3 · l)QP),
β=2·(1-(QP-3)/(3·lQP)),
Figure GDA0002944025650000041
Here QP and lQPThe quantization parameters and the number of the quantization parameters obtained by decoding the compressed video image part;
step 6.2, a fusion saliency mapping image S of the r-th frame of the video frame image through the central saliency map based on the Gaussian model according to the following formulafusePerforming significance enhancement and non-significance inhibition to obtain a position S in the image corresponding to the fused characteristic valuecentral
Figure GDA0002944025650000042
Wherein x isiAnd yiIndicating the position in the image to which the macroblock corresponds,
Figure GDA0002944025650000043
Figure GDA0002944025650000044
Figure GDA0002944025650000045
indicating the number of macroblocks per line of the video frame,
Figure GDA0002944025650000046
indicating the number of macroblocks in each column of the video frame. Wherein xcAnd ycDenotes SfuseThe mean of the coordinates of the first 10 maxima, and
Figure GDA0002944025650000047
wherein S isfuse(xi,yi) Is a fused significant feature value, Sfuse(x1,y1)≥Sfuse(x2,y2)≥…≥Sfuse(xN,yN);
Step 6.3, the fusion significance mapping map S of the r frame of the video frame image obtained in the step 6.1 is obtained through the following formulafuseCombining the position of the enhanced saliency characteristic map obtained in the step 6.2 to obtain a final saliency map S of the No. r frame of the video frame imager
Sr=Sfuse⊙Scentral
Step 6.4, adding 1 to the video frame serial number R of the R frame video frame image, and judging whether the video frame serial number added with 1 is equal to the total number R of the video frames; if yes, executing step 7, otherwise, executing step 4.1;
step 7, constructing an R-lambda model of the region of interest, comprising the following steps:
step 7.1, respectively initializing the GOP group number g of the compressed video image X obtained in the step 3, the video frame number f of each group of GOPs and the number k of the coding unit CU of each frame to 1;
step 7.2, combining the final saliency map S of the r-th frame of the video frame image obtained in step 6.3rReallocating the target bit number T to the GOP group of the compressed video image X partially decoded in the step 3 according to the following formulaG
Figure GDA0002944025650000051
Wherein, TGTarget number of bits, R, allocated for group g GOPuFor a target code rate per frame, fpsVideo frame rate, δ is offset, default is 0.75, γ is ROI ratio,
Figure GDA0002944025650000052
NGSFIfor the number of significant macroblocks in a GOP group,
Figure GDA0002944025650000053
varying between 0.75 and 1.75;
step 7.3, obtaining the target bit number T of the f frame according to the following formulaF
Figure GDA0002944025650000054
Wherein, TFNumber of bits, R, of the current frameGOPcodedIs the target number of bits, ω, that the current GOP has consumediIs a frame-level bit allocation weight adjusted according to the target bit, the coding structure and the characteristics of the coded frame, and the coded is the number of uncoded images;
step 7.4, obtaining the target bit T of the kth coding unit CU according to the following formulaCU
Figure GDA0002944025650000055
Wherein, PCUObtaining the probability value of the characteristic value of the macro block in each frame after RDCN normalization;
step 7.5, calculating the quantization parameter QP value and the lambda value of the kth coding unit CU according to the R-lambda model, and specifically adopting the following method:
λ=α×bppβ
Figure GDA0002944025650000056
wherein alpha and beta are parameters related to the characteristics of the sequence content, the initial values are 3.2005 and-1.367, alpha and beta are continuously updated according to the self-adaptation of the content, C1=4.2005,C2=13.7122;
Step 7.6, adding 1 to the serial number K of the coding unit, and judging whether the serial number K of the coding unit after adding one is equal to the total number K of the coding unit; if yes, executing step 7.7, otherwise, executing step 7.4;
step 7.7, adding 1 to the sequence number F of the video frame, and judging whether the frame sequence number F after adding 1 is equal to the number F of the video frames in the GOP group; if yes, executing step 7.8, otherwise, executing step 7.3;
step 7.8, adding 1 to the number G of the GOP groups, and judging whether the sequence number G of the GOP groups after adding 1 is equal to the total number G of the GOPs; if yes, executing step 8, otherwise, executing step 7.1;
and 8, performing video image recoding by using an HEVC (high efficiency video coding) technology and combining the updated quantization parameter of each coding unit.
The HEVC coding technique described in step 8 employs international standard "h.265" established in 2013.
The invention has the beneficial effects that:
firstly, the saliency feature extraction based on the compression domain is adopted, and the data information obtained by partial decoding is utilized to carry out saliency detection in the compressed code stream, so that the defect that in the prior art, the saliency detection based on the pixel domain must completely decompress the compressed videos to the pixel domain before feature extraction and saliency detection can be carried out is overcome, and the method has the advantages of small calculated amount and low time consumption;
secondly, because the method of the deep convolutional neural network is adopted, the high-level saliency characteristics in the code stream are extracted from the constructed and trained network model CDVNet, the defect that the interested saliency obtained by the traditional detection method only exists in the visual information such as the brightness, the chromaticity, the edge and the like of the image is overcome, the capability of extracting the high-level characteristics of the image is realized, and the deep-level characterization problem of scene saliency can be well processed;
thirdly, as the invention adopts an improved algorithm based on an R-lambda model, the quantization step sizes with different sizes are adjusted according to the quantization parameters in the model for the significant region and the non-significant region contained in the fused characteristic diagram to realize the reasonable distribution of the bit rate, thereby overcoming the defects of video distortion and video perception effect reduction, having good coding performance and achieving better subjective quality with higher compression ratio.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 1: the invention relates to a compressed video stream recoding method based on deep learning and significance perception, which comprises the following steps:
step 1, constructing and training a compressed domain video image significance detection deep learning model, and specifically adopting the following method:
step 1.1, carrying out batch normalization on Discrete Cosine Transform (DCT) residual coefficients of compressed domain video images used for training and corresponding video image significance mapping maps;
step 1.2, taking a Resnext network as a feature extraction network, and constructing a compressed domain video image saliency detection deep learning model CDVNet by using a loss function loss of the feature extraction network; specifically, the method comprises the following steps: the loss function loss of the feature extraction network is
Figure GDA0002944025650000071
Wherein, G (i, j) ═ 1 indicates that the image position corresponding to the ith row and jth column residual DCT macro block is significant, and G (i, j) ═ 0 indicates that the image position corresponding to the ith row and jth column residual DCT macro block is not significant; s (i, j) represents the probability that the residual DCT coefficient of the ith row and the jth column is predicted to be a significance value; wherein α is 0.5 and γ is 2; further, taking alpha as 0.5 to balance the uneven proportion of the positive and negative samples; taking gamma 2 is used to adjust the rate of simple sample weight reduction;
step 1.3, sending the DCT residual coefficients of the Batch of normalized compressed domain video images and the corresponding video image significance mapping maps into a compressed domain video image significance detection deep learning model CDVNet, and training the compressed domain video image significance detection deep learning model CDVNet by using a random optimization algorithm Adam, wherein the size of a training Batch is that Batch is 64, Momentum is that Momentum is 0.9, and the learning rate is initially set as lr is 0.001; and (5) training the batch of Epoch to 200, and finally obtaining the trained compressed domain video image saliency detection deep learning model CDVNet.
And 2, inputting the compressed video image X to be recoded into the compressed domain video image significance detection deep learning model CDVNet trained in the step 1.
Step 3, utilizing the significance of the video image in the compressed domain to detect the deep learning model CDVNet to decode the part of the compressed video image X to be recoded; in particular, the method comprises the following steps of,
partially decoding the compressed video image X to be recoded to obtain
The predicted residual DCT coefficient of each frame of image of the compressed video image X to be recoded;
height H and width W of the video frame image;
quantization parameter QP, number of quantization parameters lQP
The number of groups of pictures (GOPs) G of the compressed video image X to be re-encoded, the number of video frames F of each group of GOPs, the number K of coding units CU contained in each frame, and the total number of frames R of the video image.
Step 4, extracting local significant features of the partially decoded compressed video image X to be recoded in the step 3; specifically, the method comprises the following steps:
step 4.1, initializing the frame number r of the video frame image of the partially decoded compressed video image X to be recoded to 1;
step 4.2, calculating the norm after quantizing the prediction residual DCT coefficient of each macro block in the r frame in the video frame image in the step 4.1 to obtain the RDCN characteristic diagram, and specifically adopting the following method:
Figure GDA0002944025650000081
wherein RDCN is the norm of the DCT coefficient of the prediction residual error,
Figure GDA0002944025650000082
the motion vector of the motion;
step 4.3, performing maximum and minimum value normalization on the RDCN feature map of the r frame in the video frame image obtained in the step 4.2;
4.4, performing convolution on the RDCN characteristic diagram normalized by the maximum and minimum values obtained in the step 4.3 by using a Gaussian filter of 3 multiplied by 3 to realize spatial filtering;
step 4.5, performing motion median filtering on the feature map subjected to spatial filtering in the step 4.4 by using the previous r frame to obtain a local saliency feature map SRDCN of the r frame in the video frame image; specifically, the following method is adopted:
Figure GDA0002944025650000083
wherein, Med [ ·]Represents the median of the spatially filtered prior r frame eigenvalues,
Figure GDA0002944025650000084
is the RDCN characteristic value after spatial filtering of the ith row and jth column macro block of the r-t frame in the video frame image, and t belongs to {1,2, … r-2 };
and 5: the method for extracting the high-level saliency features of the compressed video image X by using the compressed domain video image saliency detection deep learning model CDVNet comprises the following steps:
step 5.1, normalizing the DCT residual coefficient of the compressed video image X, so that the normalized data is distributed around a 0 value;
step 5.2, inputting the DCT residual coefficient normalized in the step 5.1 into the compressed domain video image significance detection deep learning model CDVNet trained in the step 1 to obtain a global significance characteristic map GSFI of the r-th frame of the video frame image of the compressed video image X;
step 6, fusing and enhancing the local saliency characteristic map SRDCN and the global saliency characteristic map GSFI of the r frame in the video frame image, wherein the method comprises the following steps:
step 6.1, fusing the local saliency characteristic map SRDCN of the r frame in the video frame image obtained in the step 4.5 and the global saliency characteristic map GSFI of the r frame of the video frame image obtained in the step 5.2 according to the following formula to obtain a fused saliency map S of the r frame of the video frame imagefuse
Sfuse=Norm(α·GSFI+β·SRDCN+γ·SRDCN⊙GSFI);
Wherein Norm (. cndot.) represents normalization to [0,1 ]]Interval, [ alpha ] indicates dot product, [ alpha ] indicates QP/(3 · l)QP),
β=2·(1-(QP-3)/(3·lQP)),
Figure GDA0002944025650000091
Here QP and lQPThe quantization parameters and the number of the quantization parameters obtained by decoding the compressed video image part;
step 6.2, a fusion saliency mapping image S of the r-th frame of the video frame image through the central saliency map based on the Gaussian model according to the following formulafusePerforming significance enhancement and non-significance inhibition to obtain a position S in the image corresponding to the fused characteristic valuecentral
Figure GDA0002944025650000092
Wherein x isiAnd yiIndicating the picture corresponding to the macro blockIn the position of (a) in the first,
Figure GDA0002944025650000093
Figure GDA0002944025650000094
Figure GDA0002944025650000095
indicating the number of macroblocks per line of the video frame,
Figure GDA0002944025650000096
indicating the number of macroblocks in each column of the video frame. Wherein xcAnd ycDenotes SfuseThe mean of the coordinates of the first 10 maxima, and
Figure GDA0002944025650000097
wherein S isfuse(xi,yi) Is a fused significant feature value, Sfuse(x1,y1)≥Sfuse(x2,y2)≥…≥Sfuse(xN,yN);
Step 6.3, the fusion significance mapping map S of the r frame of the video frame image obtained in the step 6.1 is obtained through the following formulafuseCombining the position of the enhanced saliency characteristic map obtained in the step 6.2 to obtain a final saliency map S of the No. r frame of the video frame imager
Sr=Sfuse⊙Scentral
Step 6.4, adding 1 to the video frame serial number R of the R frame video frame image, and judging whether the video frame serial number added with 1 is equal to the total number R of the video frames; if yes, executing step 7, otherwise, executing step 4.1;
step 7, constructing an R-lambda model of the region of interest, comprising the following steps:
step 7.1, respectively initializing the GOP group number g of the compressed video image X obtained in the step 3, the video frame number f of each group of GOPs and the number k of the coding unit CU of each frame to 1;
step 7.2, combining the final saliency map S of the r-th frame of the video frame image obtained in step 6.3rReallocating the target bit number T to the GOP group of the compressed video image X partially decoded in the step 3 according to the following formulaG
Figure GDA0002944025650000101
Wherein, TGTarget number of bits, R, allocated for group g GOPuFor a target code rate per frame, fpsVideo frame rate, δ is offset, default is 0.75, γ is ROI ratio,
Figure GDA0002944025650000102
NGSFIfor the number of significant macroblocks in a GOP group,
Figure GDA0002944025650000103
varying between 0.75 and 1.75;
step 7.3, obtaining the target bit number T of the f frame according to the following formulaF
Figure GDA0002944025650000104
Wherein, TFNumber of bits, R, of the current frameGOPcodedIs the target number of bits, ω, that the current GOP has consumediIs a frame-level bit allocation weight adjusted according to the target bit, the coding structure and the characteristics of the coded frame, and the coded is the number of uncoded images;
step 7.4, obtaining the target bit T of the kth coding unit CU according to the following formulaCU
Figure GDA0002944025650000105
Wherein, PCUObtaining the macro block after normalizing for RDCNThe probability value of the characteristic value of (2) in each frame;
step 7.5, calculating the quantization parameter QP value and the lambda value of the kth coding unit CU according to the R-lambda model, and specifically adopting the following method:
λ=α×bppβ
Figure GDA0002944025650000106
wherein alpha and beta are parameters related to the characteristics of the sequence content, the initial values are 3.2005 and-1.367, alpha and beta are continuously updated according to the self-adaptation of the content, C1=4.2005,C2=13.7122;
Step 7.6, adding 1 to the serial number K of the coding unit, and judging whether the serial number K of the coding unit after adding one is equal to the total number K of the coding unit; if yes, executing step 7.7, otherwise, executing step 7.4;
step 7.7, adding 1 to the sequence number F of the video frame, and judging whether the frame sequence number F after adding 1 is equal to the number F of the video frames in the GOP group; if yes, executing step 7.8, otherwise, executing step 7.3;
step 7.8, adding 1 to the number G of the GOP groups, and judging whether the sequence number G of the GOP groups after adding 1 is equal to the total number G of the GOPs; if yes, executing step 8, otherwise, executing step 7.1;
and 8, performing video image recoding by using an HEVC (high efficiency video coding) technology and combining the updated quantization parameter of each coding unit.
Firstly, the saliency feature extraction based on the compression domain is adopted, and the data information obtained by partial decoding is utilized to carry out saliency detection in the compressed code stream, so that the defect that in the prior art, the saliency detection based on the pixel domain must completely decompress the compressed videos to the pixel domain before feature extraction and saliency detection can be carried out is overcome, and the method has the advantages of small calculated amount and low time consumption;
secondly, because the method of the deep convolutional neural network is adopted, the high-level saliency characteristics in the code stream are extracted from the constructed and trained network model CDVNet, the defect that the interested saliency obtained by the traditional detection method only exists in the visual information such as the brightness, the chromaticity, the edge and the like of the image is overcome, the capability of extracting the high-level characteristics of the image is realized, and the deep-level characterization problem of scene saliency can be well processed;
thirdly, as the invention adopts an improved algorithm based on an R-lambda model, the quantization step sizes with different sizes are adjusted according to the quantization parameters in the model for the significant region and the non-significant region contained in the fused characteristic diagram to realize the reasonable distribution of the bit rate, thereby overcoming the defects of video distortion and video perception effect reduction, having good coding performance and achieving better subjective quality with higher compression ratio.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (2)

1. The method for recoding the compressed video stream based on deep learning and significance perception is characterized by comprising the following steps of:
step 1, constructing and training a compressed domain video image significance detection deep learning model, and specifically adopting the following method:
step 1.1, carrying out batch normalization on Discrete Cosine Transform (DCT) residual coefficients of a compressed domain video image used for training and a corresponding video image significance mapping chart;
step 1.2, taking a Resnext network as a feature extraction network, and constructing a compressed domain video image saliency detection deep learning model CDVNet by using a loss function loss of the feature extraction network; specifically, the method comprises the following steps: the loss function loss of the feature extraction network is
Figure FDA0002944025640000011
Wherein, G (i, j) ═ 1 indicates that the image position corresponding to the ith row and jth column residual DCT macro block is significant, and G (i, j) ═ 0 indicates that the image position corresponding to the ith row and jth column residual DCT macro block is not significant; s (i, j) represents the probability that the residual DCT coefficient of the ith row and the jth column is predicted to be a significance value; wherein α is 0.5 and γ is 2; further, taking alpha as 0.5 to balance the uneven proportion of the positive and negative samples; taking gamma 2 is used to adjust the rate of simple sample weight reduction;
step 1.3, sending the DCT residual coefficients of the Batch of normalized compressed domain video images and the corresponding video image significance mapping maps into a compressed domain video image significance detection deep learning model CDVNet, and training the compressed domain video image significance detection deep learning model CDVNet by using a random optimization algorithm Adam, wherein the size of a training Batch is that Batch is 64, Momentum is that Momentum is 0.9, and the learning rate is initially set as lr is 0.001; training the batch of Epoch to 200 to finally obtain a trained compressed domain video image significance detection deep learning model CDVNet;
step 2, inputting a compressed video image X to be recoded into the compressed domain video image significance detection deep learning model CDVNet trained in the step 1;
step 3, utilizing the significance of the video image in the compressed domain to detect the deep learning model CDVNet to decode the part of the compressed video image X to be recoded; in particular, the method comprises the following steps of,
partially decoding the compressed video image X to be recoded to obtain
The predicted residual DCT coefficient of each frame of image of the compressed video image X to be recoded;
height H and width W of the video frame image;
quantization parameter QP, number of quantization parameters lQP
The group number G of groups of pictures (GOP) of a compressed video image X to be re-encoded, the number F of video frames of each group of GOP, the number K of encoding units CU contained in each frame, and the total number R of video images;
step 4, extracting local significant features of the partially decoded compressed video image X to be recoded in the step 3; specifically, the method comprises the following steps:
step 4.1, initializing the frame number r of the video frame image of the partially decoded compressed video image X to be recoded to 1;
step 4.2, calculating the norm after quantizing the prediction residual DCT coefficient of each macro block in the r frame in the video frame image in the step 4.1 to obtain the RDCN characteristic diagram, and specifically adopting the following method:
Figure FDA0002944025640000021
wherein RDCN is the norm of the DCT coefficient of the prediction residual error,
Figure FDA0002944025640000022
the motion vector of the motion;
step 4.3, performing maximum and minimum value normalization on the RDCN feature map of the r frame in the video frame image obtained in the step 4.2;
4.4, performing convolution on the RDCN characteristic diagram normalized by the maximum and minimum values obtained in the step 4.3 by using a Gaussian filter of 3 multiplied by 3 to realize spatial filtering;
step 4.5, performing motion median filtering on the feature map subjected to spatial filtering in the step 4.4 by using the previous r frame to obtain a local saliency feature map SRDCN of the r frame in the video frame image; specifically, the following method is adopted:
Figure FDA0002944025640000023
wherein, Med [ ·]Represents the median of the spatially filtered prior r frame eigenvalues,
Figure FDA0002944025640000024
is the r-t frame in the video frame imageThe ith row and the jth column of the RDCN feature value after the macro block spatial filtering, and t belongs to {1,2, … r-2 };
and 5: the method for extracting the high-level saliency features of the compressed video image X by using the compressed domain video image saliency detection deep learning model CDVNet comprises the following steps:
step 5.1, normalizing the DCT residual coefficient of the compressed video image X, so that the normalized data is distributed around a 0 value;
step 5.2, inputting the DCT residual coefficient normalized in the step 5.1 into the compressed domain video image significance detection deep learning model CDVNet trained in the step 1 to obtain a global significance characteristic map GSFI of the r-th frame of the video frame image of the compressed video image X;
step 6, fusing and enhancing the local saliency characteristic map SRDCN and the global saliency characteristic map GSFI of the r frame in the video frame image, wherein the method comprises the following steps:
step 6.1, fusing the local saliency characteristic map SRDCN of the r frame in the video frame image obtained in the step 4.5 and the global saliency characteristic map GSFI of the r frame of the video frame image obtained in the step 5.2 according to the following formula to obtain a fused saliency map S of the r frame of the video frame imagefuse
Sfuse=Norm(α·GSFI+β·SRDCN+γ·SRDCN⊙GSFI);
Wherein Norm (. cndot.) represents normalization to [0,1 ]]Interval, [ alpha ] indicates dot product, [ alpha ] indicates QP/(3 · l)QP),β=2·(1-(QP-3)/(3·lQP)),
Figure FDA0002944025640000031
Here QP and lQPThe quantization parameters and the number of the quantization parameters obtained by decoding the compressed video image part;
step 6.2, a fusion saliency mapping image S of the r-th frame of the video frame image through the central saliency map based on the Gaussian model according to the following formulafusePerforming significance enhancement and non-significance inhibition to obtain a position S in the image corresponding to the fused characteristic valuecentral
Figure FDA0002944025640000032
Wherein x isiAnd yiIndicating the position in the image to which the macroblock corresponds,
Figure FDA0002944025640000033
Figure FDA0002944025640000034
Figure FDA0002944025640000035
indicating the number of macroblocks per line of the video frame,
Figure FDA0002944025640000036
representing the number of macro blocks in each column of the video frame; wherein xcAnd ycDenotes SfuseThe mean of the coordinates of the first 10 maxima, and
Figure FDA0002944025640000037
wherein S isfuse(xi,yi) Is a fused significant feature value, Sfuse(x1,y1)≥Sfuse(x2,y2)≥…≥Sfuse(xN,yN);
Step 6.3, the fusion significance mapping map S of the r frame of the video frame image obtained in the step 6.1 is obtained through the following formulafuseCombining the position of the enhanced saliency characteristic map obtained in the step 6.2 to obtain a final saliency map S of the No. r frame of the video frame imager
Sr=Sfuse⊙Scentral
Step 6.4, adding 1 to the video frame serial number R of the R frame video frame image, and judging whether the video frame serial number added with 1 is equal to the total number R of the video frames; if yes, executing step 7, otherwise, executing step 4.1;
step 7, constructing an R-lambda model of the region of interest, comprising the following steps:
step 7.1, respectively initializing the GOP group number g of the compressed video image X obtained in the step 3, the video frame number f of each group of GOPs and the number k of the coding unit CU of each frame to 1;
step 7.2, combining the final saliency map S of the r-th frame of the video frame image obtained in step 6.3rReallocating the target bit number T to the GOP group of the compressed video image X partially decoded in the step 3 according to the following formulaG
Figure FDA0002944025640000041
Wherein, TGTarget number of bits, R, allocated for group g GOPuFor a target code rate per frame, fpsVideo frame rate, δ is offset, default is 0.75, γ is ROI ratio,
Figure FDA0002944025640000042
NGSFIfor the number of significant macroblocks in a GOP group,
Figure FDA0002944025640000043
varying between 0.75 and 1.75;
step 7.3, obtaining the target bit number T of the f frame according to the following formulaF
Figure FDA0002944025640000044
Wherein, TFNumber of bits, R, of the current frameGOPcodedIs the target number of bits, ω, that the current GOP has consumediIs a frame-level bit allocation weight adjusted according to the target bit, the coding structure and the characteristics of the coded frame, and the coded is the number of uncoded images;
step 7.4, pressObtaining a target bit T of a kth coding unit CU according to the following formulaCU
Figure FDA0002944025640000045
Wherein, PCUObtaining the probability value of the characteristic value of the macro block in each frame after RDCN normalization;
step 7.5, calculating the quantization parameter QP value and the lambda value of the kth coding unit CU according to the R-lambda model, and specifically adopting the following method:
λ=α×bppβ
Figure FDA0002944025640000046
wherein alpha and beta are parameters related to the characteristics of the sequence content, the initial values are 3.2005 and-1.367, alpha and beta are continuously updated according to the self-adaptation of the content, C1=4.2005,C2=13.7122;
Step 7.6, adding 1 to the serial number K of the coding unit, and judging whether the serial number K of the coding unit after adding one is equal to the total number K of the coding unit; if yes, executing step 7.7, otherwise, executing step 7.4;
step 7.7, adding 1 to the sequence number F of the video frame, and judging whether the frame sequence number F after adding 1 is equal to the number F of the video frames in the GOP group; if yes, executing step 7.8, otherwise, executing step 7.3;
step 7.8, adding 1 to the number G of the GOP groups, and judging whether the sequence number G of the GOP groups after adding 1 is equal to the total number G of the GOPs; if yes, executing step 8, otherwise, executing step 7.1;
and 8, performing video image recoding by using an HEVC (high efficiency video coding) technology and combining the updated quantization parameter of each coding unit.
2. The method of claim 1, wherein the method comprises: the HEVC coding technique described in step 8 employs international standard "h.265" established in 2013.
CN202010394906.1A 2020-05-11 2020-05-11 Compressed video stream recoding method based on deep learning and significance perception Active CN111726633B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010394906.1A CN111726633B (en) 2020-05-11 2020-05-11 Compressed video stream recoding method based on deep learning and significance perception

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010394906.1A CN111726633B (en) 2020-05-11 2020-05-11 Compressed video stream recoding method based on deep learning and significance perception

Publications (2)

Publication Number Publication Date
CN111726633A CN111726633A (en) 2020-09-29
CN111726633B true CN111726633B (en) 2021-03-26

Family

ID=72564323

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010394906.1A Active CN111726633B (en) 2020-05-11 2020-05-11 Compressed video stream recoding method based on deep learning and significance perception

Country Status (1)

Country Link
CN (1) CN111726633B (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022073159A1 (en) * 2020-10-07 2022-04-14 浙江大学 Feature data encoding method, apparatus and device, feature data decoding method, apparatus and device, and storage medium
CN112399177B (en) * 2020-11-17 2022-10-28 深圳大学 Video coding method, device, computer equipment and storage medium
CN112399176B (en) * 2020-11-17 2022-09-16 深圳市创智升科技有限公司 Video coding method and device, computer equipment and storage medium
CN113038279B (en) * 2021-03-29 2023-04-18 京东方科技集团股份有限公司 Video transcoding method and system and electronic device
CN113242433B (en) * 2021-04-27 2022-01-21 中国科学院国家空间科学中心 Image compression method and image compression system based on ARM multi-core heterogeneous processor
CN113709464B (en) * 2021-09-01 2024-08-09 展讯通信(天津)有限公司 Video coding method and related equipment
CN113660498B (en) * 2021-10-20 2022-02-11 康达洲际医疗器械有限公司 Inter-frame image universal coding method and system based on significance detection
CN114866784A (en) * 2022-04-19 2022-08-05 东南大学 Vehicle detection method based on compressed video DCT (discrete cosine transformation) coefficients
CN115314722B (en) * 2022-06-17 2023-12-08 百果园技术(新加坡)有限公司 Video code rate distribution method, system, equipment and storage medium
CN114786011B (en) * 2022-06-22 2022-11-15 苏州浪潮智能科技有限公司 JPEG image compression method, system, equipment and storage medium
CN115115845A (en) * 2022-07-04 2022-09-27 杭州海康威视数字技术股份有限公司 Image semantic content understanding method and device, electronic equipment and storage medium
CN116847101B (en) * 2023-09-01 2024-02-13 易方信息科技股份有限公司 Video bit rate ladder prediction method, system and equipment based on transform network

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3364342A1 (en) * 2017-02-17 2018-08-22 Cogisen SRL Method for image processing and video compression
CN107437096B (en) * 2017-07-28 2020-06-26 北京大学 Image classification method based on parameter efficient depth residual error network model
CN109118469B (en) * 2018-06-20 2020-11-17 国网浙江省电力有限公司 Prediction method for video saliency
CN109547803B (en) * 2018-11-21 2020-06-09 北京航空航天大学 Time-space domain significance detection and fusion method
CN109451310B (en) * 2018-11-21 2020-10-09 北京航空航天大学 Rate distortion optimization method and device based on significance weighting
CN109309834B (en) * 2018-11-21 2021-01-05 北京航空航天大学 Video compression method based on convolutional neural network and HEVC compression domain significant information
CN109859166B (en) * 2018-12-26 2023-09-19 上海大学 Multi-column convolutional neural network-based parameter-free 3D image quality evaluation method
CN110135435B (en) * 2019-04-17 2021-05-18 上海师范大学 Saliency detection method and device based on breadth learning system
CN111028153B (en) * 2019-12-09 2024-05-07 南京理工大学 Image processing and neural network training method and device and computer equipment
CN111083477B (en) * 2019-12-11 2020-11-10 北京航空航天大学 HEVC (high efficiency video coding) optimization algorithm based on visual saliency

Also Published As

Publication number Publication date
CN111726633A (en) 2020-09-29

Similar Documents

Publication Publication Date Title
CN111726633B (en) Compressed video stream recoding method based on deep learning and significance perception
US7697783B2 (en) Coding device, coding method, decoding device, decoding method, and programs of same
US9762917B2 (en) Quantization method and apparatus in encoding/decoding
US5892548A (en) Adaptive quantizer with modification of high frequency coefficients
EP1867175B1 (en) Method for locally adjusting a quantization step
JP6141295B2 (en) Perceptually lossless and perceptually enhanced image compression system and method
WO2020238439A1 (en) Video quality-of-service enhancement method under restricted bandwidth of wireless ad hoc network
JP2002543693A (en) Quantization method and video compression device
US6934418B2 (en) Image data coding apparatus and image data server
CN103501438B (en) A kind of content-adaptive method for compressing image based on principal component analysis
CN111131828B (en) Image compression method and device, electronic equipment and storage medium
CN112738533B (en) Machine inspection image regional compression method
CN114793282A (en) Neural network based video compression with bit allocation
JP3532470B2 (en) Techniques for video communication using coded matched filter devices.
CN116916036A (en) Video compression method, device and system
US8139881B2 (en) Method for locally adjusting a quantization step and coding device implementing said method
CN112040231B (en) Video coding method based on perceptual noise channel model
CN101742323B (en) Method and device for coding and decoding re-loss-free video
CN112001854A (en) Method for repairing coded image and related system and device
CN111277835A (en) Monitoring video compression and decompression method combining yolo3 and flownet2 network
CN110493597A (en) A kind of efficiently perception video encoding optimization method
CN113194312B (en) Planetary science exploration image adaptive quantization coding system combined with visual saliency
Peng et al. An optimized algorithm based on generalized difference expansion method used for HEVC reversible video information hiding
CN116982262A (en) State transition for dependent quantization in video coding
CN111491166A (en) Dynamic compression system and method based on content analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant