CN114449279A

CN114449279A - Video coding based on video quality metrics

Info

Publication number: CN114449279A
Application number: CN202111300571.3A
Authority: CN
Inventors: 许继征; 张莉
Original assignee: Lemon Inc Cayman Island
Current assignee: Lemon Inc Cayman Island
Priority date: 2020-11-06
Filing date: 2021-11-04
Publication date: 2022-05-06

Abstract

Embodiments of the present disclosure relate to video encoding methods, electronic devices, and computer storage media. The method comprises the following steps: determining a video quality metric for the video block, the video quality metric comprising at least one of: evaluating and fusing VMAF, structural similarity SSIM, multi-scale structural similarity MS-SSIM or visual information fidelity VIF by using a video multi-method; determining encoding parameters for encoding the video block based on the video quality metric; and encoding the video block into a code stream based on the encoding parameters. Thus, embodiments of the present disclosure can optimize the quality of video coding.

Description

Video coding based on video quality metrics

Technical Field

Embodiments of the present disclosure relate to the field of computers, and more particularly, to a video encoding method, an electronic device, and a computer storage medium.

Background

With the continuous development of multimedia technology, various videos have become an important part of people's life and entertainment. For example, people can watch various online video programs online through mobile devices.

In recent years, video coding techniques have also been rapidly developed. In 4.2018, VCEG (Q6/16) and ISO/IEC JTC1 SC29/WG11(MPEG) together form the joint video experts group (jfet) to address the VVC standard, with the goal of reducing the bit rate by 50% compared to HEVC (i.e., high efficiency video coding). In terms of video coding, on the one hand, it is desirable to increase the degree of video compression to reduce the network overhead or storage overhead of video transmission. On the other hand, it is also desirable to be able to obtain higher quality video.

Disclosure of Invention

Embodiments of the present disclosure provide a scheme for video encoding.

According to a first aspect of the present disclosure, a video encoding method is presented. The method comprises the following steps: determining a video quality metric for the video block, the video quality metric comprising at least one of: evaluating and fusing VMAF, structural similarity SSIM, multi-scale structural similarity MS-SSIM or visual information fidelity VIF by using a video multi-method; determining encoding parameters for encoding the video block based on the video quality metric; and encoding the video block into a code stream based on the encoding parameters.

According to a second aspect of the present disclosure, an electronic device is presented. The apparatus comprises: a memory and a processor; wherein the memory is to store one or more computer instructions, wherein the one or more computer instructions are executed by the processor to implement a method: determining a video quality metric for the video block, the video quality metric comprising at least one of: evaluating and fusing VMAF, structural similarity SSIM, multi-scale structural similarity MS-SSIM or visual information fidelity VIF by using a video multi-method; determining encoding parameters for encoding the video block based on the video quality metric; and encoding the video block into a code stream based on the encoding parameters.

In a third aspect of the disclosure, a computer storage medium is provided having one or more computer instructions stored thereon, wherein the one or more computer instructions are executed by a processor to implement the method described according to the first aspect.

In a fourth aspect of the present disclosure, there is provided a computer storage medium having stored thereon a codestream of video generated by a video processing apparatus executing the method described in accordance with the first aspect.

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the disclosure, nor is it intended to be used to limit the scope of the disclosure.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. In the drawings, like or similar reference characters designate like or similar elements, and wherein:

fig. 1 illustrates a block diagram of an example video encoding system, in accordance with some embodiments of the present disclosure;

fig. 2 illustrates a block diagram of an example video encoder, in accordance with some embodiments of the present disclosure;

fig. 3 illustrates a block diagram of an example video decoder, in accordance with some embodiments of the present disclosure;

fig. 4 illustrates a flow diagram of a video encoding process in accordance with some embodiments of the present disclosure;

FIG. 5 shows a schematic block diagram of an example device that may be used to implement embodiments of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

In describing embodiments of the present disclosure, the terms "include" and its derivatives should be interpreted as being inclusive, i.e., "including but not limited to. The term "based on" should be understood as "based at least in part on". The term "one embodiment" or "the embodiment" should be understood as "at least one embodiment". The terms "first," "second," and the like may refer to different or the same object. Other explicit and implicit definitions are also possible below.

Fig. 1 illustrates a block diagram of an example video encoding system 100, in accordance with some embodiments of the present disclosure.

As shown in fig. 1, video encoding system 100 may include a source device 110 and a destination device 120. Source device 110 generates encoded video data, which may be referred to as a video encoding device. Destination device 120 may decode the encoded video data generated by source device 110, and source device 110 may be referred to as a video decoding device.

The source device 110 may include a video source 112, a video encoder 114, and an input/output (I/O) interface 116.

The video source 112 may include sources such as a video capture device, an interface that receives video data from a video content provider, and/or a computer graphics system for generating video data, or a combination of such sources. The video data may include one or more pictures. The video encoder 114 encodes video data from the video source 112 to generate a bitstream. The bitstream may comprise a sequence of bits that forms an encoded representation of the video data. The bitstream may include encoded pictures and related data. An encoded picture is an encoded representation of a picture. The related data may include sequence parameter sets, picture parameter sets, and other syntax structures. The I/O interface 116 may include a modulator/demodulator (modem) and/or a transmitter. The encoded video data may be transmitted directly to the destination device 120 via the I/O interface 116 over the network 130 a. The encoded video data may also be stored on a storage medium/server 130b for access by destination device 120. Destination device 120 may include an I/O interface 126, a video decoder 124, and a display device 122. I/O interface 126 may include a receiver and/or a modem. I/O interface 126 may obtain encoded video data from source device 110 or storage medium/server 130 b. The video decoder 124 may decode the encoded video data. Display device 122 may display the decoded video data to a user. The display device 122 may be integrated with the destination device 120 or may be external to the destination device 120, the destination device 120 being configured to interface with an external display device. The video encoder 114 and the video decoder 124 may operate in accordance with video compression standards such as the current High Efficiency Video Coding (HEVC) standard, the VVM standard, and other current and/or further standards.

Fig. 2 shows a block diagram of an example video encoder 200, the video encoder 200 may be the video encoder 112 in the system 100 shown in fig. 1.

Video encoder 200 may be configured to perform any or all of the techniques of this disclosure. In the example of fig. 2, video encoder 200 includes a number of functional components. The techniques described in this disclosure may be shared among various components of video encoder 200. In some examples, the processor may be configured to perform any or all of the techniques described in this disclosure. The functional components of video encoder 200 may include a partition unit 201, a prediction unit 202 (which may include a mode selection unit 203), a motion estimation unit 204, a motion compensation unit 205, and an intra prediction unit 206, a residual generation unit 207, a transform unit 208, a quantization unit 209, an inverse quantization unit 210, an inverse transform unit 211, a reconstruction unit 212, a buffer 213, and an entropy encoding unit 214.

In other examples, video encoder 200 may include more, fewer, or different functional components. In one example, the prediction unit may comprise an Intra Block Copy (IBC) unit. The IBC unit may perform prediction in IBC mode, where the at least one reference picture is a picture in which the current video block is located.

Furthermore, some components, such as the motion estimation unit 204 and the motion compensation unit 205, may be highly integrated, but are shown separately in the example of fig. 1 for purposes of explanation.

Partition unit 201 may partition a current picture into one or more video blocks. The video encoder 200 and the video decoder 300 may support various video block sizes.

The mode selection unit 203 may, for example, select one of the encoding modes (intra or inter) based on the error result and provide the resulting intra or inter encoded block to the residual generation unit 206 to generate residual block data and to the reconstruction unit 211 to reconstruct the encoded block to be used as a reference picture. In some examples, mode selection unit 203 may select a Combination of Intra and Inter Prediction (CIIP) modes, where the prediction is based on an inter prediction signal and an intra prediction signal. In the case of inter prediction, mode selection unit 203 may also select the resolution of the motion vector for the block (e.g., sub-pixel or integer-pixel precision).

To perform inter prediction on the current video block, motion estimation unit 204 may generate motion information for the current video block by comparing one or more reference frames from buffer 213 to the current video block. Motion compensation unit 205 may determine a predictive video block for the current video block based on the motion information and decoded samples from buffer 213 for pictures other than the picture associated with the current video block (e.g., a reference picture).

The motion estimation unit 204 and the motion compensation unit 205 may perform different operations on the current video block, e.g., depending on whether the current video block is in an I slice, a P slice, or a B slice.

In some examples, motion estimation unit 204 may perform uni-directional prediction on the current video block, and motion estimation unit 204 may search for a reference video block of the current video block in a list 0 or list 1 reference picture. Motion estimation unit 204 may then generate a reference index indicating a reference picture in list 0 or list 1 that includes the reference video block, and a motion vector indicating the spatial displacement between the current video block and the reference video block. Motion estimation unit 204 may output the reference index, the prediction direction indicator, and the motion vector as motion information of the current video block. The motion compensation unit 205 may generate a prediction video block of the current block based on a reference video block indicated by motion information of the current video block.

In other examples, motion estimation unit 204 may perform bi-prediction on the current video block, motion estimation unit 204 may search a reference picture in list 0 for a reference video block of the current video block, and may also search a reference picture in list 1 for another reference video block of the current video block. Motion estimation unit 204 may then generate reference indices indicating reference pictures in list 0 and list 1 that include a reference video block and a motion vector indicating a spatial displacement between the reference video block and the current video block. Motion estimation unit 204 may output the reference index and the motion vector of the current video block as motion information for the current video block. Motion compensation unit 205 may generate a prediction video block for the current video block based on the reference video block indicated by the motion information for the current video block. In some examples, motion estimation unit 204 does not output the full set of motion information for the current video to, for example, entropy encoding unit 214. Instead, motion estimation unit 204 may signal motion information for the current video block with reference to motion information of another video block. For example, motion estimation unit 204 may determine that the motion information of the current video block is sufficiently similar to the motion information of the neighboring video block.

In one example, motion estimation unit 204 may indicate to video decoder 300 that the current video block has the same value of motion information as another video block in a syntax structure associated with the current video block.

In another example, motion estimation unit 204 may identify another video block and a Motion Vector Difference (MVD) in a syntax structure associated with the current video block. The motion vector difference indicates a difference between the motion vector of the current video block and the motion vector of the indicated video block. Video decoder 300 may use the indicated motion vector and motion vector difference for the video block to determine the motion vector for the current video block.

As described above, the video encoder 200 may predictively signal the motion vectors. Two examples of prediction signal techniques that may be implemented by video encoder 200 include Advanced Motion Vector Prediction (AMVP) and merge mode signals.

The intra prediction unit 206 may perform intra prediction on the current video block. When intra-prediction unit 206 performs intra-prediction on the current video block, intra-prediction unit 206 may generate prediction data for the current video block based on decoded samples of other video blocks in the same picture. The prediction data for the current video block may include a prediction video block and various syntax elements.

Residual generation unit 207 may generate residual data for the current video block by subtracting (e.g., as indicated by a minus sign) the prediction video block of the current video block from the current video block. The residual data for the current video block may include residual video blocks corresponding to different sample compositions of samples in the current video block.

In other examples, for example in skip mode, the current video block may not have residual data for the current video block, and the residual generation unit 207 may not perform the subtraction operation.

Transform processing unit 208 may generate one or more transform coefficient video blocks for the current video block by applying one or more transforms to remaining video blocks associated with the current video block.

After transform processing unit 208 generates a transform coefficient video block associated with the current video block, quantization unit 209 may quantize the transform coefficient video block associated with the current video block based on a Quantization Parameter (QP) value associated with the current video block.

Inverse quantization unit 210 and inverse transform unit 211 may apply inverse quantization and inverse transform, respectively, to the transform coefficient video blocks to reconstruct residual video blocks from the transform coefficient video blocks. Reconstruction unit 202 may add the reconstructed residual video block to corresponding samples from one or more prediction video blocks generated by the prediction unit to produce a reconstructed video block associated with the current block for storage in buffer 213.

After reconstruction unit 212 reconstructs the video block, a loop filtering operation may be performed to reduce video block artifacts in the video block.

Entropy encoding unit 214 may receive data from other functional components of video encoder 200. When entropy encoding unit 214 receives the data, entropy encoding unit 214 may perform one or more entropy encoding operations to generate entropy encoded data and output a bitstream that includes the entropy encoded data.

Fig. 3 shows a block diagram of an example video decoder 300, the video decoder 300 may be the video decoder 114 in the system 100 shown in fig. 1. Video decoder 300 may be configured to perform any or all of the techniques of this disclosure. In the example of fig. 3, the video decoder 300 includes a number of functional components. The techniques described in this disclosure may be shared among various components of the video decoder 300. In some examples, the processor may be configured to perform any or all of the techniques described in this disclosure.

In the example of fig. 3, the video decoder 300 includes an entropy decoding unit 301, a motion compensation unit 302, an intra prediction unit 303, an inverse transform unit 304, an inverse quantization unit 305, and a reconstruction unit 306 and a buffer 307. In some examples, video decoder 300 may perform a decoding process that is generally the inverse of the encoding process described with respect to video encoder 200 (fig. 1).

The entropy decoding unit 301 may retrieve the encoded bitstream. The encoded bitstream may include entropy encoded video data (e.g., encoded blocks of video data). Entropy decoding unit 301 may decode entropy encoded video data, and motion compensation unit 302 may determine motion information, including motion vectors, motion vector precision, reference picture list indices, and other motion information, from the entropy decoded video data. The motion compensation unit 302 may determine this information, for example, by performing AMVP and merge mode.

Motion compensation unit 302 may use the motion vectors and/or maximum values received in the bitstream to identify the predictive video block in the reference picture in buffer 307.

The motion compensation unit 302 generates a motion compensation block, possibly based on an interpolation filter, to perform interpolation. An identifier of an interpolation filter for sub-pixel precision motion estimation may be included in the syntax element.

Motion compensation unit 302 may use interpolation filters used by video encoder 20 during video block encoding to calculate interpolated values for sub-integer pixels of the reference block. Motion compensation unit 72 may determine an interpolation filter used by video encoder 20 from the received syntax information and generate the prediction block using the interpolation filter.

The motion compensation unit 302 may use some syntax information to determine the block size for encoding a frame and/or slice of an encoded video sequence, segmentation information describing how each macroblock of a picture of the encoded video sequence is segmented, a mode indicating how each segmentation is encoded, one or more reference frames (and reference frame lists) of each inter-coded block, and other information for decoding the encoded video sequence.

The intra prediction unit 303 may use, for example, an intra prediction mode received in the bitstream to form a prediction block from spatially neighboring blocks. The inverse quantization unit 303 inversely quantizes (i.e., dequantizes) the quantized video block coefficients provided in the bitstream and decoded by the entropy decoding unit 301. The inverse transform unit 303 applies inverse transform.

The reconstruction unit 306 may add the residual blocks to the respective prediction blocks generated by the motion compensation unit 202 or the intra prediction unit 303 to form decoded blocks. A loop filter may also be applied to filter the decoded block to remove blocking artifacts, if desired. The decoded video blocks are then stored in a buffer 307, the buffer 307 providing reference blocks for subsequent motion compensation and also generating decoded video for presentation on a display device and also generating decoded video for presentation on the display device.

An example video encoding scheme according to an embodiment of the present disclosure will be described below.

1. Abstract

The present application relates to image/video systems and encoding techniques. It may be applicable to existing and future image/video coding systems or standards.

2. Background of the invention

Video coding standards have evolved largely through the development of the international telecommunication union (ITU-T) and ISO/IEC standards. The telecommunications union has produced H.261 and H.263, the ISO/IEC has produced MPEG-1 and MPEG-4 video, and the two organizations have produced H.262/MPEG-2 video and H.264/MPEG-4 Advanced Video Coding (AVC) and H.265/HEVC standards together. Since h.262, video coding standards have been based on hybrid video coding structures, in which temporal prediction plus transform coding is used. In 2015, VCEG and MPEG have together formed a joint video exploration group (temporary or project building) (jfet) for exploring future video coding techniques beyond HEVC. Thereafter, JFET adopted several new approaches and placed it into a reference software named Joint Exploration Model (JEM). In 4.2018, VCEG (Q6/16) and ISO/IEC JTC1 SC29/WG11(MPEG) together form the joint video experts group (jfet) to address the VVC standard, with the goal of reducing the bit rate by 50% compared to HEVC (i.e., high efficiency video coding).

The latest version of the HEVC standard may be found in the following addresses:https://www.itu.int/rec/ recommendation.asplang＝en&parent＝T-REC-H.265-201911-I

the latest version of the VVC standard, i.e. generic video coding, can be seen in the following addresses:http://phenix.it- sudparis.eu/jvet/doc_end_user/documents/19_Teleconference/wg11/JVET-S2001- v17.zip

2.1 H.264

advanced Video Coding (AVC), also known as H.264 or MPEG-4 part 10, advanced video coding (MPEG-4 AVC), is a video compression standard based on block-oriented motion compensated integer DCT coding. By 9 months 2019, 91% of video industry developers use this format to record, compress, and distribute video content. It supports a resolution of no more than 8K UHD.

The h.264/AVC project aims to create a standard that can provide good video quality at a much lower bit rate than previous standards (i.e. half or less of the bit rate of MPEG-2, h.263 or MPEG-4 part 2) without increasing the complexity of the design to be impractical or prohibitively expensive to implement. This is achieved by reduced complexity integer discrete cosine transform, variable block size partitioning, and inter-picture prediction. Another object is to provide sufficient flexibility to adapt the standard for various applications on various networks and systems, including low and high bit rates, low and high resolution video, broadcast, DVD storage, RTP/IP packet networks, and ITU-T multimedia telephony systems. The h.264 standard can be considered as a "standard family" consisting of a number of different profiles, although its "high profile" is by far the most common format. A particular decoder decodes at least one (but not necessarily all) of the profiles. The standard describes the format of the encoded data and the manner in which the data is decoded, but does not specify the video encoding algorithm, which is a matter of choice for the encoder designer, and a number of encoding schemes have been developed. H.264 is commonly used for lossy compression, but it is also possible to create regions in the lossy coded picture that are truly lossless coded, or to support rare use cases where the entire coding is lossless.

H.264 is commonly standardized by the ITU-T Video Coding Experts Group (VCEG) and the ISO/IEC JTC1 Motion Picture Experts Group (MPEG) of the 16 th research group. The collaborative project for this project is called a joint video team (temporary or build for project) (JVT). The international telecommunication union h.264 standard and the ISO/IEC MPEG-4AVC standard (formal name ISO/IEC 14496-10-MPEG-4 part 10, advanced video coding) are commonly maintained to have the same technical content. The final drafting of the first edition of the standard is completed in 5 months of 2003, and subsequent editions add various extensions to their functionality. High Efficiency Video Coding (HEVC), also known as h.265, and MPEG-H part 2 are later versions of h.264/MPEG-4AVC developed by the same organization, while earlier standards are still in common use.

H.264 is probably the most common video coding format on blu-ray discs. It is also widely used for streaming internet resources such as Video from Netflix, Hulu, Prime Video, Vimeo, YouTube and iTunes Store, networking software such as Adobe Flash Player and Microsoft Silverlight, and various high definition television broadcasts over terrestrial (ATSC, ISDB-T, DVB-T or DVB-T2), cable (DVB-C) and satellite (DVB-S and DVB-S2) systems.

H.264 is protected by the patents owned by the parties. Permissions covering most (but not all) of the patents necessary for h.264 are managed by the MPEG LA managed patent pool.

Commercial use of the h.264 patent technology requires payment of royalties to MPEG LA and other patent owners. MPEG LA allows end users to transmit internet video free of charge using h.264 technology, cisco systems pay a license fee to MPEG LA on behalf of the user for their open source h.264 encoder.

The h.264 video format has a very wide range of applications, ranging from all forms of digitally compressed video, from low bit rate internet streaming to high definition television broadcasting and digital cinema applications, with almost lossless encoding. With h.264, a bit rate saving of 50% or more is achieved compared to MPEG-2 part 2. For example, it is reported that H.264 provides digital satellite television quality equivalent to current MPEG-2 implementations, but at bit rates of less than half, with current MPEG-2 implementations operating at rates of about 3.5Mbit/s and H.264 only at 1.5 Mbit/s. Sony states that the 9Mbit/s AVC recording mode corresponds to the picture quality of the HDV format, which uses about 18-25 Mbit/s.

To ensure compatibility and problem-free adoption of H.264/AVC, many standards bodies have revised or added to their video-related standards so that users of these standards can use H.264/AVC. The Blu-ray Disc format and the HD DVD format which is off production now take H.264/AVC High Profile as one of three mandatory video compression formats. The digital video broadcasting project approved H.264/AVC for broadcast television in 2004.

The U.S. Advanced Television Systems Committee (ATSC) standards institute approved h.264/AVC for broadcast television in 2008/7, although the standard has not been used for fixed ATSC broadcasts within the united states. It is also approved for use with the latest ATSC M/H (mobile/handheld) standard, using the AVC and SVC parts of h.264.

The closed circuit television and video surveillance markets have incorporated this technology into many products.

Many common DSLRs use h.264 video packaged in QuickTime MOV containers as the local recording format.

h.264/AVC/MPEG-4 part 10 contains many new functions that enable it to compress video more efficiently than the old standard and provide greater flexibility for application in various network environments. In particular, some of the key characteristics include:

multi-frame inter prediction, comprising the following characteristics:

using previously coded pictures as references in a more flexible way than the past standard, allows in some cases to use a maximum of 16 reference frames (or 32 reference fields in case of interlaced coding). In a configuration file that supports non-IDR frames, most levels specify that enough buffers should be provided to allow at least 4 or 5 reference frames at the highest resolution. This is in contrast to previous standards, which are typically limited to one reference frame; alternatively, in the case of a conventional "B picture" (B frame), there are two reference frames to limit.

Variable block-size motion compensation (VBSMC), with block sizes as large as 16 × 16 and as small as 4 × 4, enabling accurate segmentation of motion regions. The supported luma prediction block sizes include 16 × 16, 16 × 8, 8 × 16, 8 × 8, 8 × 4, 4 × 8, and 4 × 4, where multiple sizes may be used together in a single macroblock. When chroma subsampling is used, the chroma prediction block size is correspondingly smaller.

For a B macroblock consisting of 16 4 × 4 partitions, up to 32 motion vectors (one or two per partition) can be used per macroblock. The motion vectors of each 8x8 or larger partition may point to different reference pictures.

The ability to use any macroblock type (including I macroblocks) in the B frame, makes the coding more efficient when using the B frame. This function is explicitly ignored in MPEG-4 ASP.

-six tap filtering to derive a half-pixel luma sample prediction for sharper sub-pixel motion compensation. Quarter-pixel motion is derived by linear interpolation of half-pixel values to save processing performance.

One quarter pixel accuracy of motion compensation, enabling accurate description of the displacement of the motion region. For chroma, the resolution is typically reduced by half in both the vertical and horizontal directions (see 4: 2: 0), and therefore motion compensation for chroma uses one-eighth of the chroma pixel grid unit.

Weighted prediction, which allows the encoder to specify the use of scaling and offset when performing motion compensation, can provide significant performance advantages in special scenarios (e.g., fade-black, fade-in, and cross-fade transitions). This includes implicit weighted prediction for B frames and explicit weighted prediction for P frames.

Spatial prediction from the edges of neighboring blocks for "intra" coding, rather than just "DC" prediction, which is found in MPEG-2 part 2 alone, and transform coefficient prediction, which is found in h.263v2 and MPEG-4 part 2. This includes luma prediction block sizes of 16 × 16, 8 × 8 and 4 × 4 (only one type can be used within each macroblock).

Integer discrete cosine transform (integer DCT), a discrete cosine transform where the transform is an integer approximation of a standard DCT. It has selectable block sizes and exact match integer computations to reduce complexity, including:

one exactly matched, integer 4x4 spatial block transform, allows for exact placement of the residual signal, with almost no "ringing" as is common in previous codec designs. It is similar to the standard DCT used in previous standards, but uses a smaller block size and simple integer processing. Unlike cosine-based formulas and tolerances in earlier standards (e.g., h.261 and MPEG-2), integer processing provides an accurately specified decoding result.

Exactly matched integer 8x8 spatial block transforms, so that the compression efficiency of the highly correlated regions is higher than 4x4 transforms. The design is based on a standard discrete cosine transform, but is simplified to provide an accurate specific decoding.

Adaptive encoder selection between 4 × 4 and 8 × 8 transform block sizes in integer transform operation.

The "direct" coefficients, which are applied to the main spatial transform of the chrominance direct coefficients (and in a special case also the luminance), are subjected to a quadratic Hadamard transform to obtain a greater compression in the smoothing region.

Lossless macroblock coding characteristics, including:

a lossless "PCM macroblock" representation mode, in which the video data samples are represented directly, allows a perfect representation of a particular region, and allows a strict limitation on the amount of encoded data per macroblock.

An enhanced lossless macroblock representation mode, able to perfectly represent a particular area, while typically using much fewer bits than PCM mode.

Flexible interlaced video coding features, including:

macroblock adaptive frame field (MBAFF) coding, using a macroblock pair structure for pictures coded as frames, allows processing 16 × 16 macroblocks in field mode (in contrast to MPEG-2, in pictures coded as frames, field mode processing results in processing 16 × 8 half-macroblocks).

Picture adaptive frame field coding (PAFF or PicAFF), allowing for freely chosen picture mix coding as a complete frame, with two fields coded either combined together or as a single field alone.

A quantitative design comprising:

omicronf logarithmic step size control, facilitating bit rate management and simplified inverse quantization scaling for the encoder

The encoder customizes the quantization scale matrix for frequencies selected for perception-based quantization optimization

An in-loop deblocking filter to help prevent blockiness common to other DCT-based image compression techniques, thereby improving visual appearance and compression efficiency

An entropy coding design comprising:

context-adaptive binary arithmetic coding (CABAC), an algorithm for lossless compression of syntax elements in a video stream given the probability of syntax elements in a Context. CABAC compresses data more efficiently than CAVLC, but requires more processing to decode.

Context-adaptive variable-length coding (CAVLC), which is a lower complexity alternative to encoding quantized transform coefficient values. Although CAVLC is less complex than CABAC, CAVLC is more complex and efficient than other methods commonly used for coefficient coding in previous designs.

One common simple and highly structured Variable Length Coding (VLC) technique for many syntax elements not coded for CABAC or CAVLC, called exponential-Golomb (or Exp-Golomb) coding.

The lost elasticity features include:

the Network Abstraction Layer (NAL) definition allows the same video syntax to be used in many Network environments. A very basic design concept of h.264 is to generate independent packets to eliminate Header duplication in Header Extension Codes (HECs) like MPEG-4. This is achieved by separating information related to a plurality of segments from the media stream. The combination of advanced parameters is called a parameter set. The h.264 specification includes two types of parameter sets: a sequence parameter set and a picture parameter set. The active sequence parameter set remains unchanged throughout the encoded video sequence and the active picture parameter set remains unchanged in the encoded picture. The sequence and picture parameter set structures contain information such as picture size, optional coding modes employed, and macroblock to slice group mapping.

Flexible Macroblock Ordering (FMO), also called slice group, and Arbitrary Slice Ordering (ASO), which are techniques for reorganizing the presentation order of basic regions (macroblocks) in an image. Generally considered as error/loss robustness features, FMO and ASO may also be used for other purposes.

Data Partitioning (DP), a property that can separate more and less important syntax elements into different packets, enabling Unequal Error Protection (UEP) and other types of error/loss robustness improvements to be applied.

Redundant Slices (RS), an error/loss robustness feature, allows the encoder to send additional representations of the image region (typically with lower fidelity) that can be used if the primary representation is corrupted or lost.

Frame number, a property that allows the creation of "subsequences", enables temporal scalability by optionally including additional pictures between other pictures and detecting and concealing the loss of the entire picture that may occur due to network packet loss or channel errors.

Switch slices, called SP and SI slices, allow the encoder to instruct the decoder to jump to the ongoing video stream for video stream bit rate switching and "trick mode" operation. When the decoder jumps to the middle of the video stream using the SP/SI feature, it can obtain an exact match with the decoded picture at that position in the video stream, although a different picture or no picture at all is used as a reference.

A simple automatic procedure that prevents accidental simulation of start codes, which are special bit sequences in the encoded data, allows random access to the bit stream and restores byte alignment in systems where byte synchronization may be lost.

Supplemental Enhancement Information (SEI) and Video Usability Information (VUI), which are additional information that can be inserted into a bitstream (also called a codestream) for various purposes, such as indicating the color space used by the video content or various constraints applicable to encoding. SEI messages may contain any user-defined metadata payload or other message with syntax and semantics defined in the standard.

Auxiliary pictures, which can be used for purposes such as letter synthesis.

Support for monochrome (4: 0: 0), 4: 2: 0. 4: 2: 2 and 4: 4: 4: 4 chroma sampling (depending on the selected profile).

Sample bit depth precision supporting 8 to 14 bits per sample (depending on the selected configuration file).

The ability to encode a single color plane as a different picture with its own slice structure, macroblock mode, motion vectors, etc., allows the encoder to be designed with a simple parallelized structure (supported only in three 4: 4: 4 supported configuration files).

Picture order count, a feature used to maintain picture order and isolation of sample values from timing information in decoded pictures, allows the system to carry and control/change timing information separately without affecting the decoded picture content.

These techniques, and others, help h.264 perform significantly better than any previous standard in various application environments. H.264 generally performs better than MPEG-2 video-generally achieving the same quality at half or less bit rate, especially on high bit rate and high resolution video content.

H.264/AVC has a freely downloadable reference software implementation as other ISO/IEC MPEG video standards. Its primary purpose is to illustrate the functionality of H.264/AVC, rather than as a useful application itself. The motion picture experts group has also performed some reference hardware design work. The above aspects include features in all profiles of h.264. A codec's profile is a set of characteristics of the codec that are identified as meeting the specific specifications of the intended application. This means that some profiles do not support many of the functions listed. Various profiles of H.264/AVC will be discussed in the next section.

2.2 high efficiency video coding

High Efficiency Video Coding (HEVC), also known as h.265 and MPEG-H part 2, is a video compression standard designed as part of the MPEG-H project as a successor to the widely used advanced video coding (AVC, h.264 or MPEG-4 part 10). HEVC provides 25% to 50% data compression at the same video quality level, or a substantial improvement in video quality at the same bit rate, compared to AVC. It supports up to 8192 × 4320 resolutions, including 8K UHD, unlike the Main 8-bit AVC, the high fidelity Main10 profile of HEVC has been incorporated into almost all supporting hardware.

AVC uses integer Discrete Cosine Transforms (DCTs) of 4x4 and 8x8 block sizes, while HEVC uses integer DCTs and DST transforms of 4x4 and 32x32 block sizes that are not equal. The High Efficiency Image Format (HEIF) is based on HEVC. As of 2019, 43% of video developers used HEVC, second only to AVC, the second largest video coding format.

Most video coding standards are designed primarily to achieve the highest coding efficiency. Coding efficiency refers to the ability to code video at as low a bit rate as possible while maintaining a certain video quality. There are two standard methods for measuring the coding efficiency of video coding standards, one is to use objective indexes such as peak signal-to-noise ratio (PSNR), and the other is to use subjective evaluation of video quality. Subjective assessment of video quality is considered to be the most important method of measuring video coding standards because people subjectively perceive video quality.

HEVC benefits from using larger Coding Tree Units (CTUs). This has been demonstrated in PSNR testing using the HM-8.0HEVC encoder, where the encoder is forced to use progressively smaller CTU sizes. For all test sequences, the HEVC bit rate increases by 2.2% when a 32 × 32 CTU code is mandatory and by 11.0% when a 16 × 16 CTU code is mandatory, compared to a 64 × 64 CTU code. In class a test sequences, the resolution of the video is 2560 × 1600, and the HEVC bitrate is increased by 5.7% when using a CTU size of 32 × 32 compared to a CTU size of 64 × 64; when using a CTU size of 16 × 16, an increase of 28.2% is obtained. Tests have shown that the large CTU size improves coding efficiency while also reducing decoding time.

In terms of coding efficiency, the HEVC Main Profile (MP) was compared with the H.264/MPEG-4AVC High Profile (HP), the MPEG-4 Advanced Simple Profile (ASP), the H.263 High Latency Profile (HLP), and the H.262/MPEG-2 Main Profile. Video coding is used for entertainment applications, with 12 different bit rates for the 9 video test sequences, and with the HM-8.0HEVC encoder. Of the 9 video test sequences, 5 were high definition resolution and 4 were WVGA (800 × 480) resolution. The bit rate reduction of HEVC is determined based on PSNR, where the bit rate reduction of HEVC is 35.4% compared to h.264/MPEG-4AVC HP, 63.7% compared to MPEG-4 ASP, 65.1% compared to h.263hlp, and 70.8% compared to h.262/MPEG-2 MP.

HEVC MP was also compared to h.264/MPEG-4AVC HP in terms of subjective video quality. Video coding is used for entertainment applications, with 4 different bit rates for the 9 video test sequences and the use of the HM-5.0HEVC encoder. Subjective evaluations are performed at an earlier date than PSNR comparisons, so earlier versions of HEVC encoders with somewhat lower performance are used. The bit rate reduction is determined based on subjective evaluation using mean opinion scores. The overall subjective bitrate of HEVC MP is reduced by 49.3% compared to h.264/MPEG-4AVC HP.

Luo sang Federal institute of technology (

Polytechnique F é ray de launane, EPFL) performed a study to evaluate subjective video quality with higher resolution than HEVC for high definition television. The study was performed using three videos at resolutions 3840 × 1744, 3840 × 2048, and 3840 × 2160, respectively. A five second video clip shows streets, traffic and scenes in the open source computer animated movie site. Video sequences are encoded at five different bit rates using the HM-6.1.1 HEVC encoder and the JM-18.3 h.264/MPEG-4AVC encoder. The subjective bitrate reduction is determined based on subjective evaluation using mean opinion scores. This study compared HEVC MP with h.264/MPEG-4AVC HP and showed an average bit rate reduction of 44.4% based on PSNR and 66.5% based on subjective video quality for HEVC MP.

In an HEVC performance comparison published in 4 months of 2013, HEVC MP and Main10 Profile (M10P) were compared to h.264/MPEG-4AVC HP and High 10Profile (H10P) using a 3840 × 2160 video sequence. Video sequences are encoded using an HM-10.0HEVC encoder and a JM-18.4 h.264/MPEG-4AVC encoder. The average bit rate of the video between frames based on PSNR is reduced by 45%.

HEVC aims to improve coding efficiency significantly compared to h.264/MPEG-4AVC HP, i.e. to reduce the bitrate requirement by half at the cost of increased computational complexity with comparable picture quality. The design goal of HEVC is to allow video content to have up to 1000: 1 data compression ratio. HEVC encoders may trade off computational complexity, compression rate, robustness to errors, and coding delay time, depending on application requirements. Two key features of HEVC improvement over h.264/MPEG-4AVC are the support of higher resolution video and improved parallel processing methods.

The goal of HEVC is the next generation of high definition television displays and content capture systems that have progressive scan frame rates and display resolutions ranging from QVGA (320x240) to 4320p (7680x4320) and that improve image quality in terms of noise level, color space and dynamic range.

Video coding layer

Starting from h.261, the HEVC video coding layer uses the same "hybrid" approach used in all modern video standards, i.e., inter/intra picture prediction and 2D transform coding. The HEVC encoder first proceeds by using intra-picture prediction to segment a picture into a first picture or a block-shaped region of the first picture of a random access point. Intra-picture prediction refers to prediction of a block in a picture based only on information in the picture. For all other pictures, inter-picture prediction is used, where prediction information from other pictures is used. After the prediction method is complete and the image passes through the loop filter, the final image representation is stored in the decoded image buffer. The pictures stored in the decoded picture buffer may be used to predict other pictures.

The design idea of HEVC is to use progressive video, and no coding tools are specifically added for interlaced video. HEVC does not support coding tools for specific interfaces, such as MBAFF and PAFF. Instead, HEVC sends metadata that describes how interlaced video is sent. Interlaced video can be sent by encoding each frame as a separate picture or by encoding each field as a separate picture. For interlaced video, HEVC may use Sequence Adaptive Frame Field (SAFF) to switch between Frame coding and Field coding, which allows the coding mode to be changed for each video Sequence. This allows interlaced video to be sent with HEVC without the need to add a special interlaced decoding process to the HEVC decoder.

Color space

The HEVC standard supports multiple color spaces such as general film, NTSC, PAL, rec.601, rec.709, rec.2020, rec.2100, SMPTE 170M, SMPTE 240M, sRGB, sYCC, xvYCC, XYZ, and externally specified color spaces. HEVC supports color coding representations such as RGB, YCbCr, and YCoCg.

Coding tree unit

HEVC replaces the 16x16 pixel macroblock used in previous standards with a Coding Tree Unit (CTU) that can use a larger block structure of up to 64x64 samples and can better subdivide the picture into structures of variable size. HEVC initially divides the image into 64 × 64, 32 × 32, or 16 × 16 CTUs, with larger pixel blocks generally increasing coding efficiency.

Inverse transformation

HEVC specifies four Transform Unit (TUs) sizes of 4x4, 8x8, 16x16, and 32x32 to encode the prediction residual. One CTB may be recursively divided into 4 or more TUs. TU uses an integer basis function based on a Discrete Cosine Transform (DCT). Also, the 4x4 luminance transform block belonging to the intra-coded region is transformed using an integer transform derived from a Discrete Sine Transform (DST). This provides a 1% bit rate reduction but is limited to the 4x4 luma transform block due to the marginal benefit of other conversion cases. Chroma uses the same TU size as luma, so chroma does not have a 2x2 transform.

Parallel processing tool

Tile (tile) allows a picture to be divided into rectangular area grids that can be independently decoded/encoded. The main purpose of the tiles is to allow parallel processing. Tiles can be decoded independently and even certain regions of pictures in a video stream can be randomly accessed.

Omicron Wavefront Parallel Processing (WPP) means: a slice (slice) is divided into CTU rows, where the first row decodes normally, but each additional row needs to make a decision on the previous row. The entropy encoder of WPP uses information from the previous row of CTUs and allows a parallel processing approach that can achieve better compression than blocks.

Omicron block and WPP are allowed, but selective. If tiles are present, the tile must be at least 64 pixels in height and 256 pixels in width, and a certain level of restriction is placed on the number of tiles allowed.

Slices can be decoded largely independently of each other, the main purpose being to resynchronize slices in case of data loss in the video stream. Slices may be defined to be independent because prediction is not made across slice boundaries. When loop filtering a picture, information across slice boundaries may be needed. The slice is subjected to CTU decoding in raster scan order, and different coding types can be used for I-type, P-type or B-type slices.

Dependent slices may allow the system to access data related to a slice or WPP faster than if the entire slice had to be decoded. The main purpose of dependent slices is to allow low-delay video coding because of its lower delay.

Entropy coding

The context-adaptive binary arithmetic coding (CABAC) algorithm used by HEVC is substantially similar to the CABAC algorithm in H.264/MPEG-4 AVC. CABAC is the only entropy coding method allowed by HEVC, while h.264/MPEG-4AVC allows two entropy coding methods. In HEVC, entropy coding of CABAC and transform coefficients is designed to have higher throughput than h.264/MPEG-4AVC, while maintaining higher compression efficiency for larger transform block sizes relative to simple extensions. For example, the number of context coding bins (bins) is reduced by a factor of 8, and the CABAC bypass mode improves on design to improve throughput. Another improvement over HEVC is that the correlation between the encoded data has been changed to further improve the increased throughput. The context modeling of HEVC is also improved compared to h.264/MPEG-4AVC, so CABAC can better select the context that improves efficiency.

Intra prediction

HEVC specifies 33 intra prediction direction modes, while h.264/MPEG-4AVC specifies 8 intra prediction direction modes. HEVC also specifies direct current intra prediction and plane prediction modes. The dc intra prediction mode generates an average value by averaging reference samples and can be used for planes. The plane prediction modes of HEVC support all block sizes defined by HEVC, while the plane prediction modes in h.264/MPEG-4AVC are limited to block sizes of 16 × 16 pixels. The intra prediction mode uses data from neighboring prediction blocks that have been previously decoded from within the same picture.

Motion compensation

For interpolation of fractional luma sample positions, HEVC uses separable application of one-dimensional half-sample interpolation of 8-tap (8-tap) filters or quarter-sample interpolation of 7-tap filters, while in contrast h.264/MPEG-4AVC uses a two-stage process, first deriving values for half-sample positions using separable one-dimensional 6-tap interpolation, then integer rounding, and then applying linear interpolation between values of adjacent half-sample positions to generate values for quarter-sample positions. HEVC improves accuracy due to longer interpolation filters and elimination of intermediate rounding errors. For 4: 2: 0 video, chroma samples are interpolated by separable 1-dimensional 4-tap filtering to yield 8-tap precision, in contrast to h.264/MPEG-4AVC which uses only a 2-tap bilinear filter (again with 8-tap precision).

As with h.264/MPEG-4AVC, weighted prediction of HEVC may be used with uni-prediction (where a single predictor is used) or bi-prediction (where predictors from two prediction blocks are combined).

Motion vector prediction

HEVC defines a 16-bit signed range of horizontal and vertical motion vectors. This is added to the mvLX variable of HEVC at the HEVC conference of 7 months 2012. HEVC has a horizontal/vertical maximum in the range of 32768 to 32767, allowing a maximum in the range of 8192 to 8191.75 luma samples in view of the quarter-pixel precision used by HEVC. In contrast, H.264/MPEG-4AVC allows for a horizontal MV range of 2048 to 2047.75 luma samples and a vertical MV range of 512 to 511.75 luma samples.

HEVC allows two MV modes, Advanced Motion Vector Prediction (AMVP) and merge mode. AMVP uses data from reference pictures, and may also use data from neighboring prediction blocks. The merge mode allows inheritance of the maximum values from neighboring prediction blocks. The merge mode of HEVC is similar to the "skipped" and "direct" motion inference modes in h.264/MPEG-4AVC, but with two improvements. A first improvement is that HEVC uses index information to select one out of several available candidates. A second improvement is that HEVC uses information in the reference picture lists and reference picture indices.

Loop filter

HEVC specifies two loop filters that are applied sequentially, first applying a deblocking filter (DBF) and then applying a Sample Adaptive Offset (SAO) filter. Both loop filters are applied to the inter prediction loop, i.e., the filtered pictures are stored in a Decoded Picture Buffer (DPB) as a reference for inter prediction.

Deblocking filter

DBF is similar to that used in H.264/MPEG-4AVC, but is simpler in design and supports more parallel processing. In HEVC, DBF is only applicable to 8 × 8 sample grids, while for H.264/MPEG-4AVC, DBF is applicable to 4 × 4 sample grids. The DBF uses an 8x8 sample grid because it does not cause significant degradation and improves parallel processing significantly because the DBF no longer causes cascading interactions with other operations. Another variation is that HEVC only allows three DBF lengths: 0 to 2. HEVC also requires that DBF first apply horizontal filtering to the vertical edges of a picture and then apply vertical filtering to the horizontal edges of the picture. This makes multiple parallel threads available to the DBF.

Sample adaptive offset

The SAO filter is applied after the DBF and is designed to better reconstruct the original signal amplitude by applying the offsets stored in the bitstream look-up table. For each CTB, the SAO filter may be disabled or applied in one of two modes: edge bias mode or bandwidth bias mode. The edge shift mode operates by using one of four directional gradient modes to compare a sample value with two of its eight adjacent sample values. Based on comparison with these two adjacent samples, the samples are classified into five categories: a minimum, a maximum, an edge with a lower value sample, an edge with a higher value sample, or monotonicity. For each of the first four categories, an offset is applied. The band offset mode applies an offset based on the magnitude of a single sample. The samples are divided into 32 segments (histogram bins) by their amplitude. Four consecutive segments of the 32 bands define the bias because in flat areas prone to banding artifacts, the sample amplitudes tend to be concentrated in a small range. SAO filters aim to improve image quality, reduce banding artifacts, and reduce ringing artifacts.

Range extension

The range extension in MPEG is an additional profile, level and technique to support the requirements beyond consumer video playback:

support profiles of more than 10 bit depths and different luma/chroma bit depths.

Internal configuration files when file size is far less important than random access decoding speed.

O static picture configuration file, forming the basis of an efficient image file format, without limiting picture size or complexity (level 8.5). Unlike all other levels, the lowest decoder capacity is not needed, only reasonable back-off with best effort.

These new profiles contain enhanced coding functions, many of which support efficient screen coding or high speed processing:

continuous Rice adaptation (Persistent Rice adaptation), a general optimization of entropy coding.

Higher precision weighted prediction of the superior bit depth.

Cross component prediction, allowing imperfect YCbCr color decorrelation to match luminance (or G) matching settings to predicted chroma (or R/B), which makes YCbCr 4: 4: the gain of 4 can reach 7% at most, and the gain of RGB video can reach 26% at most. Especially for screen coding.

Intra smoothing control, allowing the encoder to turn smoothing on or off per block rather than per frame.

Modification of conversion skip:

residual dpcm (rdpcm), allowing for a more optimal coding of the residual data as possible, rather than the typical zigzag coding.

Block size flexibility, supporting a maximum block size of 32x32 (in contrast to only 4x4 transform skipping is supported in release 1).

Omicron 4x4 round tones to improve potential efficiency.

The omic transform skips contexts, enabling DCT and RDPCM blocks to carry separate contexts.

Extended precision processing makes the decoding of the low bit depth video somewhat more accurate.

Omicron CABAC bypass calibration, one for high flux 4: 4: 416 decoding optimization of the intra-frame configuration file.

HEVC version 2 adds several Supplemental Enhancement Information (SEI) messages:

o color remapping: one color space is mapped to another color space.

Omicron Knee function (Knee function): the cues switch between dynamic ranges, particularly from HDR to SDR.

Omicron mastering display color capacity

O time code for archival purposes

Screen content coding extension

Additional coding tool options were added to the Screen Content Coding (SCC) extension draft of month 3 in 2016:

adaptive color conversion.

Adaptive motion vector resolution.

Intra block copy.

O palette mode.

The ITU-T release standard with added SCC extensions (approved in 2016 and published in 2017 and 3) has added support for a hybrid Log-gamma (hlg) transfer function and ICtCp color matrix. This enables HEVC version four to support the two HDR transfer functions defined in rec.2100.

HEVC version four adds several pieces of supplemental enhancement information, including:

an optional transfer characteristic information SEI message, providing information about the preferred transfer function to be used. Its main purpose is to transport HLG video in a manner that is backward compatible with legacy devices.

-ambient viewing environment SEI message providing ambient light information of the viewing environment used to compose the video.

2.3 general video coding

General Video Coding (VVC), also known as H.266, MPEG-I part 3, and Future Video Coding (FVC), is the Video compression standard ultimately determined by the Joint Video experts group (JFET) of the MPEG working group of ISO/IEC JTC1 and the VCEG working group of ITU-T on 6/7/2020, is a successor to high efficiency Video Coding (HEVC, also known as ITU-T H.265 and MPEG-H part 2).

2.4 video quality index

■ PSNR (Peak Signal to Noise Ratio)

PSNR index is calculated in pixels:

where MAX is the maximum pixel value and MSE is the mean square error between the distorted and reference images. The mean square error of an M × N size image is:

where R (i, j) represents the reference image sample, D (i, j) represents the distorted image sample, and M, N represents the width and height of the image.

In the development of HEVC and VVC, another PSNR definition is used, i.e.

Where BD is the bit depth of the input signal.

■ SSIM (Structural Similarity)

The SSIM index is calculated over different windows of the image. The metric between two x and y windows of the same size nxn is:

u_xis the average value of x is x, u_yIs the average value of y;

is the variance of x;

is the variance of y; sigma_xyIs the covariance of x and y; c. C₁＝(k₁L)²，c₂＝(k₂L)²Stabilizing two variables of the division with a weak denominator; l represents the dynamic range of the pixel value; by default k₁0.01, and k₂＝0.03。

■ MS-SSIM (Multi Scale-SSIM, multiscale structural similarity)

A more advanced form of SSIM, called multi-scale SSIM (MS-SSIM), is thought of as multi-scale processing in early visual systems, by performing a multi-stage sub-sampling process at multiple scales. It performs equally or better than SSIM on different subjective image and video databases.

■ VIF (Visual Information Fidelity)

The images and videos of the three-dimensional visual environment come from a common category: a natural scene category. Natural scenes form a tiny subspace in all possible signal spaces, and researchers have developed complex models to describe these statistics. Most real-world distortion processes interfere with these statistics and make the image or video signal unnatural. The VIF index quantifies information shared between a test image and a reference image using a Natural Scene Statistical (NSS) model and a distortion (channel) model. Furthermore, the VIF index is based on the assumption that shared information is an aspect of fidelity that is closely related to visual quality. Unlike existing methods based on Human Visual System (HVS) error sensitivity and structural measurements, this statistical method is used in the context of information theory, yielding a Full Reference (FR) Quality Assessment (QA) method that does not rely on any HVS or observation geometry parameters or any constants that need to be optimized, but is competitive with the most advanced quality assessment methods.

In particular, the reference image is modeled as the output of a random "natural" source, which passes through the HVS channel and is then processed by the brain. The information content of the reference picture is quantized to mutual information between the input and output of the HVS channel. This is the information that the brain can ideally extract from the HVS output. The same measurements are then quantified in the presence of an image distortion channel that distorts the output of the natural source before it passes through the HVS channel, thus measuring the information that the brain can ideally extract from the test image.

■ VMAF (Video Multi-Method Association Fusion)

The video monitoring and analysis section uses existing image quality indicators and other functions to predict video quality:

omicron Visual Information Fidelity (VIF): information fidelity loss considering four different spatial scales

Index of Loss of Detail (Detail Loss Metric, DLM): measuring loss of detail and impairment of distracted viewers

Mean Co-Located Pixel Difference (MCPD): measuring temporal differences between frames on a luminance component

Omicron antinoise Signal-to-noise ratio (Anti-noise signal-to-noise ratio, AN-SNR)

The above features are fused using SVM-based regression to provide a single output score in the range of 0-100 per video frame, where 100 represents the same quality as the reference video. These scores are then temporally pooled across the entire video sequence using an arithmetic mean to provide an overall Differential Mean Opinion Score (DMOS).

Due to the public availability of training source code ("VMAF Development Kit," VDK), the fusion method can be retrained and evaluated based on different video data sets and features.

3. Problems to be solved

Optimizing VMAF/SSIM/MS-SSIM is a difficult problem to solve.

4. Example scheme

1. The VMAF/SSIM/MS-SSIM delta for each block may be calculated or estimated.

2. The bit delta amount for each block may be calculated or estimated.

3. In the above example, a block is defined as a video unit that is smaller than a picture (frame).

a. In one example, the block may be a fixed size (e.g., 64x 64).

b. In one example, the blocks may be CTUs/CTBs.

c. In one example, the block may be a sub-picture/tile/CTU row.

d. In one example, a block may contain only a luma component.

e. In one example, a block may contain three components.

f. In one example, the definition of a block may depend on coding information, such as a tree partition structure (e.g., single/double tree), color format, slice type, reference frame information, temporal layer identifier, and the like.

4. In the above example, the delta is defined as the VMAF/SSIM/MS-SSIM value of the current block compared to the reference block.

a. In one example, the reference block is a reference block for motion prediction.

5. Each block may be assigned a weight when performing rate-distortion optimization (RDO).

a. In one example, weights are applied to the distorted portion.

6. The weight may be based on the difference amount and the bit difference amount of the VMAF/SSIM/MS-SSIM of one block.

a. In one example, the weight may be proportional to a ratio of the dispersion of the VMAF/SSIM/MS-SSIM to the dispersion of the bits when the dispersion of the bits is not zero.

b. In one example, the weights may be cut off between a range.

7. The weights may be based on the difference in VMAF/SSIM/MS-SSIM, the bit difference, and the mean square error MSE of the block.

a. In one example, the weight may be a weighted average of the dispersion and mean square error of the VMAF/SSIM/MS-SSIM, then divided by the dispersion in bits.

8. The weights may be based on the dispersion and mean square error of the VMAF/SSIM/MS-SSIM.

a. In one example, the weight may be a weighted average of the dispersion and mean square error of the VMAF/SSIM/MS-SSIM.

9. The above method is also applicable to other standards.

a. In one example, the above method may also be applied to VIF.

Fig. 4 shows a flow diagram of a video encoding method 400 according to an embodiment of the present disclosure. The method 400 may be implemented, for example, by referring to the encoder 114 in fig. 1 or the encoder 200 in fig. 2.

As shown in fig. 4, at block 410, an encoder determines a video quality metric for a video block, the video quality metric including at least one of: and evaluating and fusing VMAF, structural similarity SSIM, multi-scale structural similarity MS-SSIM or visual information fidelity VIF by using a video multi-method. At block 420, the encoder determines encoding parameters for encoding the video block based on the video quality metric. At block 430, the encoder encodes the video block into a codestream based on the encoding parameters.

In this manner, embodiments of the present disclosure are able to optimize the video encoding process based on the video quality metric. In addition, the video multi-method evaluation fusion VMAF, the structure similarity SSIM, the multi-scale structure similarity MS-SSIM or the visual information fidelity VIF can better reflect the subjective quality of the video, and the embodiment of the disclosure can further improve the subjective quality of the video.

In some embodiments, the video quality metric is a first video quality metric, and determining encoding parameters for encoding the video block based on the first video quality metric comprises: determining a second video quality metric for a reference block of the video blocks; and determining the encoding parameter based on a first difference between the first video quality metric and the second video quality metric.

Such a reference block may be, for example, a reference block for motion prediction of the video block, which may be located in the same frame or in a different frame of the video block.

In some embodiments, determining the encoding parameter based on the first difference comprises: determining a second difference between the first bit representation of the video block and a second bit representation of a reconstructed block of the video block; and determining the encoding parameter based on the first difference and the second difference. As discussed above, embodiments of the present disclosure may determine the encoding parameter based on both the delta and the bit delta of the VMAF/SSIM/MS-SSIM/VIF.

In some embodiments, determining the encoding parameter based on the first difference and the second difference comprises: determining the encoding parameter based on a first ratio of the first difference to the second difference if the second difference is not zero.

Illustratively, when the dispersion of bits is not 0, the value of the encoding parameter may be proportional to the ratio of the dispersion of VMAF/SSIM/MS-SSIM/VIF to the dispersion of bits. Further, if the bit delta is 0, the encoding parameter may be set to a predetermined value, for example.

In some embodiments, determining the encoding parameter based on the first ratio of the first difference and the second difference comprises: determining a ratio range within which the first ratio falls; and determining a predetermined encoding parameter corresponding to the ratio range as the encoding parameter.

For example, a plurality of ratio ranges may be set, and each ratio range may correspond to a predetermined encoding parameter, for example. After determining the first ratio, the corresponding encoding parameter may be determined according to a ratio range in which the first ratio falls.

In some embodiments, determining the encoding parameter based on the first difference comprises: determining a Mean Square Error (MSE) for the video block; and determining the coding parameter based on the first difference and the mean square error. It should be appreciated that the mean square error MSE may be calculated according to the method described in section 2.4.

In some embodiments, determining the encoding parameter based on the first difference and the mean square error comprises: determining the coding parameter based on a first weighted sum of the first difference and the mean square error. For example, the weight parameter may be a weighted average of the dispersion and mean square error of the VMAF/SSIM/MS-SSIM/VIF. It should be understood that other weighting coefficients are possible.

In some embodiments, determining the encoding parameter based on the first difference and the second difference comprises: determining a mean square error of the video block; and determining the coding parameter based on the first difference, the second difference, and the mean square error.

In some embodiments, determining the encoding parameter based on the first difference, the second difference, and the mean square error comprises: determining a second weighted sum of the first difference and the mean square error; and determining the coding parameter based on a first ratio of the second weighted sum to the second difference if the second difference is not zero.

For example, the second weighted sum may be determined based on a weighted average of the dispersion and the mean square error of the VMAF/SSIM/MS-SSIM/VIF. It should be understood that other weighting coefficients are possible. Additionally, when the bit difference amount is not 0, the value of the encoding parameter may be proportional to a ratio of the second weighted sum to the bit difference amount. Further, if the bit delta is 0, the encoding parameter may be set to a predetermined value, for example.

In some embodiments, the encoding parameters include weight coefficients for a rate distortion optimization process for the video block. Thus, embodiments of the present disclosure may adjust parameters for the entropy encoding process based on the video quality metric. Illustratively, the weight coefficient may be a weight applied to the distortion part in a rate distortion optimization process.

In some embodiments, the video block comprises one of: a video block having a predetermined size; a coding tree unit CTU or a coding tree block CTB in the video; or sub-picture subpacture, Tile, or CTU row.

In some embodiments, the video block is: a video block including only a luminance component; or a video block that includes one luminance component and two chrominance components.

In some embodiments, the video block is determined based on coding information, the coding information including at least one of: tree partition structure, color format, stripe type, reference frame information, or timing layer identifier.

Embodiments of the present disclosure may be described based on the following examples, it being understood that features in the following examples may be combined in an appropriate manner.

Example 1. a video encoding method, comprising:

determining a video quality metric for a video block, the video quality metric comprising at least one of: evaluating and fusing VMAF, structural similarity SSIM, multi-scale structural similarity MS-SSIM or visual information fidelity VIF by using a video multi-method;

determining encoding parameters for encoding the video block based on the video quality metric; and

and coding the video block into a code stream based on the coding parameters.

The method of example 1, wherein the video quality metric is a first video quality metric, and determining encoding parameters for encoding the video block based on the first video quality metric comprises:

determining a second video quality metric for a reference block of the video blocks; and

determining the encoding parameter based on a first difference between the first video quality metric and the second video quality metric.

Example 3. the method of example 2, wherein determining the encoding parameter based on the first difference comprises:

determining a second difference between the first bit representation of the video block and a second bit representation of a reconstructed block of the video block;

determining the encoding parameter based on the first difference and the second difference.

Example 4. the method of example 3, wherein determining the encoding parameter based on the first difference and the second difference comprises:

determining the encoding parameter based on a first ratio of the first difference to the second difference if the second difference is not zero.

Example 5. the method of example 4, wherein determining the encoding parameter based on the first ratio of the first difference and the second difference comprises:

determining a ratio range within which the first ratio falls; and

determining a predetermined encoding parameter corresponding to the ratio range as the encoding parameter.

Example 6. the method of example 2, wherein determining the encoding parameter based on the first difference comprises:

determining a Mean Square Error (MSE) for the video block; and

determining the encoding parameter based on the first difference and the mean square error.

Example 7. the method of example 6, wherein determining the coding parameter based on the first difference and the mean square error comprises:

determining the coding parameter based on a first weighted sum of the first difference and the mean square error.

Example 8. the method of example 3, wherein determining the encoding parameter based on the first difference and the second difference comprises:

determining a mean square error of the video block; and

determining the encoding parameter based on the first difference, the second difference, and the mean square error.

Example 9. the method of example 8, wherein determining the encoding parameter based on the first difference, the second difference, and the mean square error comprises:

determining a second weighted sum of the first difference and the mean square error; and

determining the encoding parameter based on a first ratio of the second weighted sum to the second difference if the second difference is not zero.

Example 10 the method of any of examples 1 to 9, wherein the encoding parameters comprise weight coefficients for a rate distortion optimization process of the video block.

Example 11. the method of any of examples 1 to 10, wherein the video block comprises one of:

a video block having a predetermined size;

a coding tree unit CTU or a coding tree block CTB in the video; or

Sub-picture, Tile, or CTU row.

The method of any of examples 1-11, wherein the video block is:

a video block including only a luminance component; or

A video block comprising one luminance component and two chrominance components.

The method of any of examples 1-12, wherein the video block is determined based on coding information, the coding information including at least one of:

tree partition structure, color format, stripe type, reference frame information, or timing layer identifier.

Example 14. an electronic device, comprising:

a memory and a processor;

wherein the memory is to store one or more computer instructions, wherein the one or more computer instructions are to be executed by the processor to implement the method of any of examples 1 to 13.

Example 15 a computer storage medium having one or more computer instructions stored thereon, wherein the one or more computer instructions are executed by a processor to implement the method of any of examples 1-13.

Example 16 a computer storage medium having stored thereon a codestream of video generated by a video processing apparatus executing a method according to any one of examples 1 to 13.

Fig. 5 illustrates a block diagram of a computing device/server 500 in which one or more embodiments of the present disclosure may be implemented. It should be appreciated that the computing device/server 500 illustrated in FIG. 5 is merely exemplary and should not be construed as limiting in any way the functionality and scope of the embodiments described herein.

As shown in fig. 5, computing device/server 500 is in the form of a general purpose computing device. Components of computing device/server 500 may include, but are not limited to, one or more processors or processing units 510, memory 520, storage 530, one or more communication units 540, one or more input devices 550, and one or more output devices 560. The processing unit 510 may be a real or virtual processor and may be capable of performing various processes according to programs stored in the memory 520. In a multiprocessor system, multiple processing units execute computer-executable instructions in parallel to improve the parallel processing capability of computing device/server 500.

Computing device/server 500 typically includes a number of computer storage media. Such media may be any available media that is accessible by computing device/server 500 and includes, but is not limited to, volatile and non-volatile media, removable and non-removable media. Memory 520 may be volatile memory (e.g., registers, cache, Random Access Memory (RAM)), non-volatile memory (e.g., Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory), or some combination thereof. Storage 530 may be a removable or non-removable medium and may include a machine-readable medium, such as a flash drive, a magnetic disk, or any other medium that may be capable of being used to store information and/or data (e.g., training data for training) and that may be accessed within computing device/server 500.

Computing device/server 500 may further include additional removable/non-removable, volatile/nonvolatile storage media. Although not shown in FIG. 5, a magnetic disk drive for reading from or writing to a removable, non-volatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, non-volatile optical disk may be provided. In these cases, each drive may be connected to a bus (not shown) by one or more data media interfaces. Memory 520 may include a computer program product 525 having one or more program modules configured to perform the various methods or acts of the various embodiments of the disclosure.

The communication unit 540 enables communication with other computing devices over a communication medium. Additionally, the functionality of the components of computing device/server 500 may be implemented in a single computing cluster or multiple computing machines capable of communicating over a communications connection. Thus, computing device/server 500 may operate in a networked environment using logical connections to one or more other servers, network Personal Computers (PCs), or another network node.

The input device 550 may be one or more input devices such as a mouse, keyboard, trackball, or the like. Output device 560 may be one or more output devices such as a display, speakers, printer, or the like. Computing device/server 500 may also communicate with one or more external devices (not shown), such as storage devices, display devices, etc., as desired through communication unit 540, with one or more devices that enable a user to interact with computing device/server 500, or with any device (e.g., network card, modem, etc.) that enables computing device/server 500 to communicate with one or more other computing devices. Such communication may be performed via input/output (I/O) interfaces (not shown).

According to an exemplary implementation of the present disclosure, a computer-readable storage medium is provided, on which one or more computer instructions are stored, wherein the one or more computer instructions are executed by a processor to implement the above-described method.

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products implemented in accordance with the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.

These computer-readable program instructions may be provided to a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processing unit of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various implementations of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The foregoing has described implementations of the present disclosure, and the above description is illustrative, not exhaustive, and is not limited to the implementations disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described implementations. The terminology used herein was chosen in order to best explain the principles of implementations, the practical application, or improvements to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the implementations disclosed herein.

Claims

1. A video encoding method, comprising:

and coding the video block into a code stream based on the coding parameters.

2. The method of claim 1, wherein the video quality metric is a first video quality metric, and determining encoding parameters for encoding the video block based on the first video quality metric comprises:

determining a second video quality metric for a reference block to the video block; and

3. The method of claim 2, wherein determining the encoding parameter based on the first difference comprises:

determining a second difference between the first bit representation of the video block and a second bit representation of a reconstructed block of the video block; and

4. The method of claim 3, wherein determining the encoding parameter based on the first difference and the second difference comprises:

5. The method of claim 4, wherein determining the encoding parameter based on the first ratio of the first difference and the second difference comprises:

determining a ratio range within which the first ratio falls; and

and determining a predetermined encoding parameter corresponding to the ratio range as the encoding parameter.

6. The method of claim 2, wherein determining the encoding parameter based on the first difference comprises:

determining a Mean Square Error (MSE) for the video block; and

7. The method of claim 6, wherein determining the encoding parameter based on the first difference and the mean square error comprises:

8. The method of claim 3, wherein determining the encoding parameter based on the first difference and the second difference comprises:

determining a mean square error of the video block; and

9. The method of claim 8, wherein determining the encoding parameter based on the first difference, the second difference, and the mean square error comprises:

10. The method of any of claims 1-9, wherein the encoding parameters comprise weight coefficients for a rate-distortion optimization process for the video block.

11. The method of any of claims 1-10, wherein the video block comprises one of:

a video block having a predetermined size;

a coding tree unit CTU or a coding tree block CTB in the video; or

Sub-picture, Tile, or CTU row.

12. The method of any one of claims 1-11, wherein the video block is:

a video block including only a luminance component; or alternatively

13. The method of any of claims 1-12, wherein the video block is determined based on coding information, the coding information comprising at least one of:

14. An electronic device, comprising:

a memory and a processor;

wherein the memory is to store one or more computer instructions, wherein the one or more computer instructions are to be executed by the processor to implement the method of any one of claims 1 to 13.

15. A computer storage medium having one or more computer instructions stored thereon, wherein the one or more computer instructions are executed by a processor to implement the method of any one of claims 1 to 13.

16. A computer storage medium having stored thereon a codestream of video generated by a video processing apparatus executing the method according to any one of claims 1 to 13.