Disclosure of Invention
The technical problem to be solved by the invention is to provide a multi-user-oriented HDR video dynamic range scalable coding method, which can enable the same HDR video stream to be displayed on different dynamic range display devices of multiple users at the same time.
The technical scheme adopted by the invention is that a multi-user-oriented HDR video dynamic range scalable coding method comprises the following steps:
(1) converting the input HDR video into an HDR video sequence with a plurality of dynamic range levels represented by different quantization depths through a conversion process based on Perception Quantization (PQ);
(2) decomposing an HDR video frame into an SDR basic frame and a plurality of Residual Signal Frames (RSFs) by establishing a Dynamic Range Scalable Model (DRSM), wherein the RSFs represent difference information between two adjacent dynamic range levels and record the maximum value and the minimum value of the original RSFs;
(3) carrying out median filtering pretreatment on the RSFs sequence according to statistical analysis and perception characteristic analysis, filtering out pixel points which have little influence on perception quality in the RSFs by using human eye brightness masking effect, and keeping the total difference which can be reflected by the RSFs;
(4) the processed RSFs sequence and the SDR sequence are respectively encoded into a dynamic range hierarchical video code stream through a unified HEVC encoder, and meanwhile, the maximum value and the minimum value of the recorded RSFs are used as auxiliary Enhancement Information (SEI) for encoding and transmission so as to assist the HDR video reconstruction of a decoding end;
(5) and decoding and reconstructing the video to obtain SDR and HDR videos with different dynamic range quantization depths through the inverse process of DRSM at a decoding end so as to realize that the HDR video content is suitable for being displayed on multi-user-end MDR display equipment.
The invention has the beneficial effects that: the method decomposes HDR video stream into a standard dynamic range SDR video and a plurality of residual signal frame RSFs sequences by considering a dynamic range clustering model DRSM of HDR video perception characteristics to form a code stream with a dynamic range grading, thereby meeting the requirements of multi-user multi-dynamic range display equipment; meanwhile, filtering processing is carried out on the RSFs by combining the brightness masking effect and the human eye perception characteristic, the coding efficiency of the RSFs is improved, and the efficiency of the coding method is further improved.
In the step (1), a specific method for performing Perceptual Quantization (PQ) -based conversion processing on an input HDR video is as follows: comprises the following steps:
firstly, HDR-RGB image data in an original OpenEXR format is converted into RG' B in a perception domain through a non-linear function of PQ;
secondly, realizing color space conversion from R 'G' B 'to Y' CbCr through a 3 multiplied by 3 conversion matrix;
thirdly, quantizing the converted data into integer data with different bit depths, namely:
wherein, (Y ', Cb, Cr) represents 4:4:4 floating-point data obtained by color space conversion, (DY', DCb, DCr) represents quantized integer data, Clip3(·) represents clipping functions of two directional restrictions, 219*2
b-8Represents the brightness scale, 2
b-4Representing the luminance signal offset, 224 x 2
b-8Denotes the chromaticity scale, 2
b-1Represents the color difference signal offset, b represents the quantization depth, Round (·) represents the rounding function;
and fourthly, sampling the 4:4:4 chroma format into a 4:2:0 chroma format, and converting the 4:2:0 chroma format into a Y' CbCr video sequence to adapt to a subsequent HEVC coding system.
In the step (2), the specific process of establishing a dynamic range scalable model DRSM is as follows:
firstly, performing dynamic range up-sampling on video content in a lower-level dynamic range to obtain an HDR video in a higher-level quantization depth, namely: vd'(x,y)=Vd-Δd(x,y)<<2,d∈{10,12,14,16},Δd=2,Vd' means by Vd-ΔdHDR video sequence obtained by dynamic range up-sampling has dynamic range more than Vd-ΔdThe height is higher by one level;
secondly, making a difference with the originally converted HDR video sequence with the same dynamic range level, quantizing the residual error obtained by decomposition into RSFs with the same quantization depth as the SDR sequence in order to adapt to the HEVC encoder of the SDR video, and using the RSFs to represent the difference information between two adjacent dynamic range levels, namely
d∈{10,12,14,16},i∈N*,V
d' means by V
d-ΔdHDR video sequence, V, obtained by dynamic range up-sampling
dRepresenting the original HDR video sequence, i.e. further quantizing the normalized residual data to a data range of the same quantization depth as the SDR video frame, to achieve compatibility with data encoding of the SDR video frame;
in the step (3), the specific process of performing median filtering on the RSFs sequence is as follows:
according to the brightness masking effect in the visual perception characteristic of human eyes, the human eyes have low detail perception capability on a flat area and low distortion perception capability on a complex area, the flat area in a picture is taken, and information insensitive to the human eyes can be filtered through filtering processing corresponding to the area in RSFs; taking a complex content region in a picture, wherein the region in the corresponding RSFs contains less information, and the filtering processing does not influence the expression of valuable contents of the region;
secondly, counting the pixel value characteristics of RSFso before the RSFs quantization of the balloon effectiveness sequence;
through statistical analysis of RSFs, the method discovers that a large amount of isolated noise point information is contained in a complex region, isolated data point information which is not easy to be sensed by a user exists in a flat region, and information of edge and texture characteristics which are easy to be sensed exists in a region with a foreground and a background;
considering that human eyes have a brightness masking effect, namely the human eyes are sensitive to texture and detail information in a single bright area or a single dark area and are insensitive to texture and detail in a scene containing the bright and dark areas at the same time, most HDR video sequence scenes contain the bright and dark areas at the same time through analysis, and the RSFs can be preprocessed in a median filtering mode to enable the content of corresponding positions of the RSFs to tend to be smooth and the overall difference characteristic between adjacent dynamic range levels can be reserved;
in the step (5), the specific process of performing HDR video reconstruction at the decoding end through the inverse process of DRSM is as follows:
firstly, an SDR video facing a standard dynamic range display device is obtained by directly decoding an SDR video code stream through an HEVC decoder;
secondly, the HDR video for the high dynamic range display device can be obtained by reconstructing an inverse process of the DRSM, that is:
d e {10,12,14,16}, Δ d2, i e N, wherein,
representing a reconstructed HDR video sequence with a dynamic range of dbit,
representing a reconstructed HDR video sequence with a lower level dynamic range of d-ad,
representing a pixel value at a coordinate position (x, y) in a reconstructed dbit video frame, and if the resolution of the video frame is L × W, { (x, y) x ═ 0,1,2,. said., L-1, y ═ 0,1,2,. said., W-1 };
the inverse quantization process representing the pixel value p, namely:
wherein p is
maxAnd p
minIs obtained from the auxiliary enhancement information.
Detailed Description
The invention is further described below with reference to the accompanying drawings in combination with specific embodiments so that those skilled in the art can practice the invention with reference to the description, and the scope of the invention is not limited to the specific embodiments.
The invention relates to a multi-user-oriented HDR video dynamic range scalable coding method, which comprises the following steps:
1. converting an input HDR video into an HDR video sequence with a plurality of dynamic range levels represented by different quantization depths (such as 8bit, 10bit, 12bit and the like) through a conversion process based on Perception Quantization (PQ);
2. in order to enable the existing MDR display equipment to bring high-quality HDR video pictures to users, a Dynamic Range Scalable Model (DRSM) is provided, one HDR video frame is decomposed into one SDR basic frame and a plurality of Residual Signal Frames (RSFs), and the RSFs can represent difference information between two adjacent Dynamic Range levels;
3. perceptual filtering preprocessing is carried out on the RSFs sequence, and then a dynamic range hierarchical video code stream is formed through an HEVC (high efficiency video coding) coder suitable for an SDR (standard definition extension) video together with the SDR sequence;
4. coding and transmitting the maximum value and the minimum value of the RSFs as auxiliary enhancement information (SEI) so as to assist the HDR video reconstruction of a decoding end;
5. and decoding and reconstructing the video to obtain SDR and HDR videos with different dynamic range quantization depths through the inverse process of DRSM at a decoding end so as to realize that HDR video content can adapt to be displayed on MDR display equipment with multiple user ends.
Fig. 1 is a general implementation block diagram of a multi-user-oriented HDR video dynamic range scalable coding method, which takes luminance depths of 8 bits, 10 bits, and 12bits as examples, and the specific implementation steps are as follows:
1. the input HDR video is converted into HDR video sequences with a plurality of dynamic range levels represented by different quantization depths through a conversion process based on Perception Quantization (PQ), wherein the HDR video sequences are respectively marked as V by taking 8bit, 10bit and 12bit brightness depths as examplesSDR_8bit、VSDR_10bit、VSDR_12bit;
2. Converting the HDR-RGB image data in the original OpenEXR format into R ' G ' B ' of a perception domain through a non-linear function of PQ;
3. color space conversion from R 'G' B 'to Y' CbCr is achieved via a 3 x 3 conversion matrix;
4. the converted data is quantized into 8bit, 10bit and 12bit integer data,
wherein, (Y ', Cb, Cr) represents 4:4:4 floating-point data obtained by color space conversion, (DY', DCb, DCr) represents quantized integer data, and Clip3(·) represents two directional constraints (i.e. 0-2)
b-1) Of (3) a clipping function of 219 x 2
b-8Represents the brightness scale, 2
b-4Representing the luminance signal offset, 224 x 2
b-8Denotes the chromaticity scale, 2
b-1Represents the color difference signal offset, b represents the quantization depth, Round (·) represents the rounding function;
5. downsampling a 4:4:4 chroma format into a 4:2:0 chroma format, and converting to obtain 8bit, 10bit and 12bit Y' CbCr video sequences of the 4:2:0 chroma format so as to adapt to a subsequent HEVC coding system;
6. the video content in the lower dynamic range is up-sampled in the dynamic range to obtain the HDR video in the higher quantization depth, that is, the HDR video
Wherein, V
SDR8(x, y) and V
HDR10(x, y) represents the pixel value at the coordinate position (x, y) in the 8-bit and 10-bit video frames, respectively, (i.e., (DY', DCb, DCr) described above, which is further subjected to chroma downsampling), V
HDR10' (x, y) and V
HDR12' (x, y)) means V
SDR8(x, y) and V
HDR10(x, y) pixel values at (x, y) are processed through dynamic range upsampling,<<2 denotes left-shift by 2bits, and if the resolution of the video frame is L × W, { (x, y) y { (x, y) 0,1, 2., L-1, y { (x, y) 0,1, 2.,. u., W-1 };
7. in order to adapt to the HEVC encoder of SDR video, the residual obtained by decomposition is quantized to RSFs with the same quantization depth as the SDR sequence, so as to represent the difference information between two adjacent dynamic range levels,
wherein the content of the first and second substances,S
RSF1、S
RSF2RSFs representing 8bit to 10bit respectively
1And RSF of 10bit to 12bit
2Q (p) denotes a uniform quantization function, i.e. the normalized residual data is further quantized to a data range of the same quantization depth as the SDR video frame to achieve compatibility with the data coding of the SDR video frame, b denotes the quantization depth, 8 denotes the same level as the SDR video frame data, p denotes the quantization depth
maxAnd p
minRespectively representing the maximum value and the minimum value of all pixel values, and simultaneously recording p of each frame of RSFs
maxAnd p
min;
8. A Dynamic Range Scalable Model (DRSM) is established according to steps 7 and 8, a HDR video frame is decomposed into an SDR basic frame and a plurality of Residual Signal Frames (RSFs), the RSFs can represent difference information between two adjacent dynamic range levels, and simultaneously record the maximum and minimum values of the original RSFs;
9. according to the brightness masking effect in the visual perception characteristic of human eyes, the human eyes have low detail perception capability on a flat area and low distortion perception capability on a complex area, for example, in fig. 2, the sky in a virtual frame belongs to the flat area, the contained content is relatively smooth, the information is less, and the information insensitive to the human eyes can be filtered by filtering processing corresponding to the area in the RSFs; grassland and people belong to a complex content region, the tolerable distortion of human eyes is large, the information contained in the region in the corresponding RSFs is less, and the filtering processing does not influence the expression of the valuable content of the region;
10. counting the pixel value characteristics of RSFso before RSFs quantization of a Balloon Festival sequence, wherein RSFso represents a residual signal before quantization, RSF1o and RSF2o represent original residual signals of 8bit to 10bit and 10bit to 12bit respectively, and the pixel values of RSF1o and RSF2o are both found in an interval of [ -7,6], mainly concentrated near a 0 value and are integer pixel values, and the maximum and minimum values of RSFso pixels of 20 frames before the Balloon Festival sequence are listed in the following table 1, namely the content of coded transmission as SEI;
11. through statistical analysis of RSFs, a large amount of isolated noise point information is contained in a complex region, isolated data point information which is not easy to be sensed by a user exists in a flat region, and information of easily sensed edge and texture characteristics exists in a region with a foreground and a background;
12. considering that human eyes have a brightness masking effect, namely the human eyes are sensitive to texture and detail information in a single bright area or a single dark area and are insensitive to texture and detail in a scene containing the bright and dark areas at the same time, most HDR video sequence scenes contain the bright and dark areas at the same time through analysis, the RSFs can be preprocessed in a median filtering mode, so that the content of corresponding positions of the RSFs tends to be smooth, and the overall difference characteristic between adjacent dynamic range levels can be reserved;
13. pixel points which have little influence on the perception quality in the RSFs are effectively filtered by using the human eye brightness masking effect, and the total difference which can be reflected by the RSFs is reserved;
14. respectively encoding the processed RSFs sequence and the SDR sequence into a dynamic range hierarchical video code stream through a unified HEVC (high efficiency video coding) encoder, and simultaneously encoding and transmitting the maximum value and the minimum value of the recorded RSFs as auxiliary Enhancement Information (SEI) so as to assist the HDR video reconstruction of a decoding end;
15. SDR video (V) facing 8bit display equipmentRSDR8) Directly decoding by an SDR video code stream HEVC decoder;
16. HDR video (V) for 10-bit and 12-bit display devices
RHDR10And V
RHDR12) Can be reconstructed from the reverse process of the DRSM,
wherein, V
RHDR10(x, y) and V
RHDR12(x, y) denotes the pixel value at coordinate position (x, y) in the reconstructed 10-bit and 12-bit video frames, respectively, V
RSDR8(x, y) represents the pixel value of an 8-bit SDR decoded video frame at the coordinate position (x, y), and if the resolution of the video frame is L × W, { (x, y) x { (x, y) 0,1, 2., L-1, y { (x, y) 0,1, 2., W-1 }; q
inv(p) represents the inverse quantization process of the pixel value p, p is 8, p
maxAnd p
minObtained from the SEI information.
Next, the encoding method of the present invention was tested to prove the effectiveness and feasibility of the encoding method of the present invention.
The HDR video test sequences used in the test are all from a recognized test database, provided by MPEG, and are balloon effect, SunRise, Market3 and Tibul2, respectively, the resolution size is 1920 × 1080, the original frame image format is OpenEXR, and the first frame content is as shown in fig. 3.
Table 1 is a table for summarizing the coding rate statistics of balloon effect sequence. Before and after the filtering preprocessing of the RSFs, the consumption difference of the coding code rates is large, when QPs are 12, 17, 22 and 27, experimental tests show that the median filtering windows are respectively 3 × 3, 7 × 7, 11 × 11 and 15 × 15, the situation of coding is carried out by full-frame configuration, and the consumption ratio situations of the coding code rates in the 4 states and the 5 states before the RSFs processing are counted. Taking the balloonestival sequence as an example, when QP is 12, the average code rates of the full intra coding SDR, RSF1, and RSF2 are 58342.58 (7.88%), 304769.44 (41.17%), and 377137.01 (50.95%), respectively. Wherein, SDR code rate containing basic picture content only occupies 7.88% of total code rate, while RSF1 and RSF2 code rate representing difference information between dynamic range grades occupy a ratio as high as 92.12%, and the cost of excessively high code rate consumption is not favorable for application in practical coding transmission. The RSFs are subjected to median filtering preprocessing by combining human visual perception, scattered data points in a local block can be filtered through a set window, the coding rate of the RSFs is effectively reduced, and meanwhile the overall difference between adjacent dynamic range levels is kept. In table 1, non in the Medfilt column indicates that the RSFs are directly encoded without being processed, and W × W (W ═ 3, 7, 11, 15) indicates the size of the median filter window, and all RSFs are encoded after being filter-preprocessed. In the table, the influence of filtering preprocessing of different degrees on the consumption of coding code rate under different QPs is counted, the code rate ratio under each condition is calculated, and finally, the rate of reducing the code rate under each condition of filtering preprocessing relative to the rate under the condition of not processing under the same QP is calculated. As can be seen from the data in the table, the RSFs are subjected to filtering pretreatment and then are encoded, so that the code rate can be reduced to a large extent, and compared with the code rate which is not directly encoded, the code rate is reduced by 88.18% to the maximum extent.
TABLE 1
Table 2 shows the BD-rate (%) for the methods of the present invention and the original reference platform. Scheme one deployed 1 represents an encoding scheme that employs a 15 x 15 window filtering process; scheme two deployed 2 represents an encoding scheme that employs an 11 x 11 window filtering process. The rate-distortion performance of the first scheme is optimal, the code rates are averagely saved by 32.03% and 31.28%, the highest code rate is saved by 59.0%, and the code rates are averagely saved by 4.05% and 4.30%. The BD-rate change fluctuation is large because isolated data points in the RSFs are effectively removed through filtering preprocessing, intra-frame correlation is improved, code rate is obviously reduced, and the reconstruction quality is not greatly influenced. The RSFs of the balloon effect sequence contain a large amount of gradual change information, a large code rate can still be consumed through quantization and filtering, and the performance of the first scheme is similar to that of HM-16.4. The RSFs of the SunRise sequence contain more information in lighter and darker areas, the filtering processing can well remove isolated noise points and reserve valuable contents, and the optimal method code rate saves 25.5% and 26.5%. The content information contained in the RSFs of the mark 3 sequence is less and gentler, after isolated noise points are filtered, the coding correlation of the RSFs is greatly improved, the coding rate is reduced, the optimal method code rate is saved by 59.0% and 59.1%, and the suboptimal method code rate is also saved by 49.8% and 50.3%. The RSFs of the Tibul2 sequence have more information in the edge region and the uneven surface, the filtering processing can effectively filter the meaningless noise of the uneven surface, and the optimal method code rate saves 39.1% and 38.3%.
TABLE 2
Table 3 shows BD-rate results (%) for different filter processing schemes compared to no processing scheme. In order to study the influence of the RSFs on the scalable coding performance before and after filtering processing, 4 different filtering processing schemes are compared with a scheme without filtering processing, and after coding reconstruction, the BD-rate measured by PSNR and HDR-VDP-2.2 is used for representing. Here, propofol 1 represents a scheme using 3 × 3 window filtering processing, propofol 2 represents a scheme using 7 × 7 window filtering processing, propofol 3 represents a scheme using 11 × 11 window filtering processing, and propofol 4 represents a scheme using 15 × 15 window filtering processing. Compared with the scheme without filtering after DRSM, each filtering scheme saves a lot of code rates, which further illustrates that appropriate filtering can effectively remove meaningless scattered data points, increase the correlation of RSFs intraframe coding, save the code rates, and simultaneously retain the overall difference information between dynamic range levels for reconstruction.
TABLE 3
Fig. 4, 5, 6 and 7 are rate-distortion curves plotted according to HDR-VDP-2.2 quality index and code rate consumption. The method of directly encoding all sequences generated by DRSM ranking is herein denoted as deployed 0. Fig. 4 is a graph of a rate-distortion curve for reconstructing a 12-bit balloon estimation sequence, where the performance of the deployed 0 is much lower than that of the HM platform, and the coding performance can be improved by proper filtering preprocessing, and both the deployed 3 and the deployed 4 are close to or better than the HM platform coding algorithm; fig. 5 is a rate-distortion curve for reconstructing a 12bit sunrise sequence, and the influence on the rate-distortion performance is not large when the filter window is increased to a certain value, which indicates that a saturation threshold exists in the filter window, and the performance of the deployed 3 and the deployed 4 is slightly better than that of the HM platform coding algorithm; fig. 6 is a rate-distortion curve for reconstructing a 12-bit Market3 sequence, where the performance is improved slowly when the filtering window is large, the RSFs of the sequence include less inter-stage difference information of dynamic range and are filtered more by filtering, but the rate-distortion performance of the deployed 2, deployed 3, and deployed 4 is improved greatly compared with the HM platform; fig. 7 is a rate-distortion curve for reconstructing a 12-bit Tibul2 sequence, the performance of the deployed 1 is also poor, the rate-distortion performance is improved more and more when the filter window is larger, and the performance of the deployed 3 and the deployed 4 is generally better than that of the coding algorithm of the HM platform.