WO2022257134A1

WO2022257134A1 - Video encoding/decoding method, device and system, and storage medium

Info

Publication number: WO2022257134A1
Application number: PCT/CN2021/099827
Authority: WO
Inventors: 戴震宇
Original assignee: Oppo广东移动通信有限公司
Priority date: 2021-06-11
Filing date: 2021-06-11
Publication date: 2022-12-15
Also published as: CN117441336A

Abstract

A video encoding/decoding method, device and system, and a storage medium. During decoding, an identifier of an intra-prediction mode of a current block is parsed from a received bitstream, and the intra-prediction mode of the current block is determined from a plurality of optional intra-prediction modes according to the identifier, wherein the plurality of optional intra-prediction modes comprise an autoencoder mode. During encoding, an intra-prediction mode of a current block is selected from a plurality of optional intra-prediction modes comprising an autoencoder mode; intra-prediction is performed on the current block on the basis of the intra-prediction mode of the current block, and an identifier of the intra-prediction mode of the current block is encoded and written into a bitstream. According to embodiments of the present disclosure, autoencoder-based image compression is introduced into a block level-based video codec, thereby improving the performance of a video codec.

Description

A video encoding and decoding method, device, system and storage medium

technical field

Embodiments of the present disclosure relate to, but are not limited to, video compression technologies, and in particular, relate to a video encoding and decoding method, device, system, and storage medium.

Background technique

Digital video compression technology mainly compresses huge digital image and video data to facilitate transmission and storage. With the proliferation of Internet videos and people's higher and higher requirements for video clarity, it is still necessary to pursue better digital video compression technology to improve video quality and reduce the bandwidth and traffic pressure of digital video transmission.

Summary of the invention

The following is an overview of the topics described in detail in this article. This summary is not intended to limit the scope of the claims.

An embodiment of the present disclosure provides a video decoding method, including:

Parse the identifier of the intra-frame prediction mode of the current block from the received code stream, and determine the intra-frame prediction mode of the current block from a variety of selectable intra-frame prediction modes according to the identifier, wherein the multiple kinds of intra-frame prediction modes Selectable intra prediction modes include autoencoder mode;

Perform intra-frame prediction on the current block according to the intra-frame prediction mode of the current block.

An embodiment of the present disclosure also provides a video coding method, including:

selecting an intra prediction mode of the current block from a plurality of selectable intra prediction modes, the plurality of selectable intra prediction modes including an autoencoder mode;

Perform intra-frame prediction on the current block based on the intra-frame prediction mode of the current block, encode the identifier of the intra-frame prediction mode of the current block and write it into a code stream.

An embodiment of the present disclosure also provides a video decoding device, including a processor and a memory storing a computer program that can run on the processor, wherein, when the processor executes the computer program, the implementation of the present disclosure The video decoding method described in any one of the embodiments.

An embodiment of the present disclosure also provides a video decoding device, including an intra prediction processing unit, wherein the intra prediction processing unit includes:

a mode selection unit, configured to parse the identifier of the intra-frame prediction mode of the current block from the received code stream, and determine the intra-frame prediction mode of the current block from a variety of selectable intra-frame prediction modes according to the identifier, The multiple selectable intra-frame prediction modes include an autoencoder mode; and, according to the intra-frame prediction mode of the current block, activate a corresponding prediction unit to perform intra-frame prediction on the current block;

The self-encoder prediction unit is used for, when the intra-frame prediction mode of the current block is the self-encoder mode, based on the decoding network of the self-encoder corresponding to the self-encoder mode, for the supplementary information of the current block or for all The supplementary information of the current block and the reconstructed reference information adjacent to the current block are nonlinearly transformed to obtain the predicted value of the pixels in the current block.

An embodiment of the present disclosure also provides a video encoding device, including a processor and a memory storing a computer program that can run on the processor, wherein, when the processor executes the computer program, the implementation of the present disclosure The video encoding method described in any one of the embodiments.

An embodiment of the present disclosure also provides a video encoding device, including an intra-frame prediction processing unit, wherein the intra-frame prediction processing unit includes:

The mode selection unit is used to select the intra prediction mode of the current block from a variety of selectable intra prediction modes, activate the corresponding prediction unit to perform intra prediction on the current block according to the intra prediction mode of the current block, and perform the intra prediction on the current block. The identification of the intra-frame prediction mode of the current block is coded and written into the code stream; wherein the multiple selectable intra-frame prediction modes include an autoencoder mode;

The self-encoder prediction unit is used to reconstruct the original values of the pixels in the current block and/or the adjacent pixels of the current block based on the coding network of the self-encoder when the intra-frame prediction mode of the current block is the self-encoder mode performing a first non-linear transformation on the reference information to obtain supplementary information of the current block; and, based on the decoding network of an autoencoder, performing a second non-linear transformation on the supplementary information of the current block or on the supplementary information of the current block and adjacent reconstructed reference information Linear transformation to get the predicted value of the pixels in the current block.

An embodiment of the present disclosure further provides a video encoding and decoding system, including the video encoding device described in any embodiment of the present disclosure and the video decoding device described in any embodiment of the present disclosure.

An embodiment of the present disclosure also provides a non-transitory computer-readable storage medium, the computer-readable storage medium stores a computer program, wherein, when the computer program is executed by a processor, any implementation of the present disclosure can be realized. The video encoding method or the video decoding method described in the example.

An embodiment of the present disclosure further provides a code stream, wherein the code stream is generated according to the video coding method of the embodiment of the present disclosure, and the code stream includes the identification of the intra prediction mode of the current block and the The codeword obtained by encoding the supplementary information of the current block; or, the code stream includes one or more of the identification of the intra prediction mode of the current block, the supplementary information of the current block, and the following information A codeword obtained by encoding: the network parameter identifier of the current block, the residual value of pixels in the current block, and the residual identifier of the current block.

Other aspects will be apparent to others upon reading and understanding the drawings and detailed description.

Figure overview

The accompanying drawings are used to provide an understanding of the embodiments of the present disclosure, and constitute a part of the description, together with the embodiments of the present disclosure, are used to explain the technical solutions of the present disclosure, and do not constitute limitations on the technical solutions of the present disclosure.

FIG. 1 is a structural block diagram of a video codec system that can be used in an embodiment of the present disclosure;

FIG. 2 is a structural block diagram of a video encoder that can be used in an embodiment of the present disclosure;

FIG. 3 is a structural block diagram of a video decoder that can be used in an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a traditional intra prediction mode;

Fig. 5 is the frame diagram of the image compression based on self-encoder;

Fig. 6 is a schematic structural diagram of an autoencoder transplanted into a video codec framework according to an embodiment of the present disclosure;

Fig. 7 is a schematic diagram of reference pixels adjacent to the current block according to an embodiment of the present disclosure;

FIG. 8 is a schematic structural diagram of an intra-frame prediction processing unit transplanted into a self-encoder in a video encoder according to an embodiment of the present disclosure;

9 is a schematic diagram of modules participating in encoding processing when residual processing is skipped in the first autoencoder mode in the video encoder according to an embodiment of the present disclosure;

FIG. 10 is a flowchart of a video encoding method according to an embodiment of the present disclosure;

FIG. 11 is a schematic structural diagram of an intra-frame prediction processing unit transplanted into a self-encoder in a video decoder according to an embodiment of the present disclosure;

12 is a schematic diagram of modules participating in decoding processing when residual processing is skipped in the second self-encoder mode in the video decoder according to an embodiment of the present disclosure;

FIG. 13 is a flowchart of a video decoding method according to an embodiment of the present disclosure;

FIG. 14 is a schematic structural diagram of a video encoder according to an embodiment of the present disclosure;

FIGS. 15A to 15D , FIGS. 16A to 16D , and FIGS. 17A to 17D are comparison diagrams of three sets of reconstructed blocks obtained by using different intra prediction methods and original blocks.

detail

The present disclosure describes a number of embodiments, but the description is illustrative rather than restrictive, and it will be apparent to those of ordinary skill in the art that within the scope encompassed by the described embodiments of the present disclosure, There are many more embodiments and implementations.

In the description of the present disclosure, words such as "exemplary" or "for example" are used to mean an example, illustration or illustration. Any embodiment described in this disclosure as "exemplary" or "for example" should not be construed as preferred or advantageous over other embodiments. "And/or" in this article is a description of the relationship between associated objects, which means that there can be three relationships, for example, A and/or B, which can mean: A exists alone, A and B exist simultaneously, and there exists alone B these three situations. "A plurality" means two or more than two. In addition, in order to clearly describe the technical solutions of the embodiments of the present disclosure, words such as "first" and "second" are used to distinguish the same or similar items with basically the same function and effect. Those skilled in the art can understand that words such as "first" and "second" do not limit the number and execution order, and words such as "first" and "second" do not necessarily limit the difference.

In describing representative exemplary embodiments, the specification may have presented a method and/or process as a particular sequence of steps. However, to the extent the method or process is not dependent on the specific order of steps described herein, the method or process should not be limited to the specific order of steps described. Other sequences of steps are also possible, as will be appreciated by those of ordinary skill in the art. Therefore, the specific order of the steps set forth in the specification should not be construed as limitations on the claims. Furthermore, claims to the method and/or process should not be limited to performing their steps in the order written, as those skilled in the art can readily appreciate that such order can be varied and still remain within the spirit and scope of the disclosed embodiments Inside.

Internationally, mainstream video codec standards include H.264/Advanced Video Coding (Advanced Video Coding, AVC), H.265/High Efficiency Video Coding (High Efficiency Video Coding, HEVC), H.266/Versatile Video Coding (Multiple Functional Video Coding, VVC), MPEG (Moving Picture Experts Group, Dynamic Image Experts Group), AOM (Open Media Alliance, Alliance for Open Media), AVS (Audio Video coding Standard, audio and video coding standards) and the expansion of these standards, Or any other custom standards, etc. These standards reduce the amount of transmitted data and stored data through video compression technology, so as to achieve more efficient video codec and transmission storage.

The above-mentioned mainstream video codec standards all adopt a block-based hybrid coding method. This coding method first uses the block as the basic unit to perform intra-frame prediction or inter-frame prediction, and then transforms the residual (also referred to as the residual) , quantization, entropy encoding is performed on related syntax elements such as block division and prediction, and residuals after quantization, to obtain encoded video code streams (abbreviated as code streams herein, and may also be referred to as bit streams, bit streams, etc.). In the video codec process, the reconstruction quality of the image can also be improved through loop filtering.

In H.264/AVC, the input image is divided into blocks of fixed size as the basic unit of encoding, and it is called a macro block (MB, Macro Block), including a luma block and two chrominance blocks, luma The block size is 16×16. If 4:2:0 sampling is used, the chroma block size is half of the luma block size. In the prediction link, according to different prediction modes, the macroblock is further divided into small blocks for prediction. In the intra-frame prediction, the macroblock can be divided into small blocks of 16×16, 8×8, and 4×4, and intra-frame prediction is performed on each small block. In the transformation and quantization link, the macroblock is divided into 4×4 or 8×8 small blocks, and the residual in each small block is transformed and quantized respectively to obtain quantized coefficients.

Compared with H.264/AVC, H.265/HEVC has taken improvement measures in multiple encoding links. In H.265/HEVC, an image is divided into coding tree units (CTU, Coding Tree Unit), and CTU is the basic unit of coding (corresponding to macroblocks in H.264/AVC). A CTU includes a luma coding tree block (CTB, Coding Tree Block) and two chrominance coding tree blocks. The maximum size of a CU in the H.265/HEVC standard is generally 64×64. In order to adapt to a variety of video content and video features, CTU is iteratively divided into a series of coding units (CU, Coding Unit) in the form of quadtree (QT, Quadro Tree). CU is the basic unit of intra/inter coding. A CU includes a luma coding block (CB, Coding Block) and two chroma coding blocks and related syntax structures. The maximum CU size is CTU, and the minimum CU size is 8×8. The leaf node CUs obtained through coding tree division can be divided into three types according to different prediction methods: intra CU for intra-frame prediction, inter CU for inter-frame prediction, and skipped CU. The skipped CU can be regarded as a special case of the inter CU, which does not contain motion information and residual information. The leaf node CU contains one or more prediction units (PU, Prediction Unit). H.265/HEVC supports PUs of 4×4 to 64×64 sizes, and there are eight division modes in total. For the intra coding mode, there are two possible division modes: Part_2Nx2N and Part_NxN. For the residual signal, the CU uses a residual quadtree to divide it into transform units (TU: Transform Unit). A TU includes a luma transform block (TB, Transform Block) and two chroma transform blocks. Only square division is allowed, and one CB is divided into 1 or 4 PBs. The same TU has the same transformation and quantization process, and the supported sizes are from 4×4 to 32×32. Different from previous coding standards, in inter prediction, TB can cross the boundary of PB to further maximize the coding efficiency of inter coding.

In VVC/H.266, video coded images are first divided into coding tree units CTU similar to HEVC, but the maximum size is increased from 64×64 to 128×128. H.266/VVC proposed quadtree and nested multi-type tree (MTT, Multi-Type Tree) division, MTT includes binary tree (BT, Binary Tree) and ternary tree (TT, Ternary Tree), and unified H. 265/HEVC concepts of CU, PU, and TU, and supports more flexible CU division shapes. The CTU is divided according to the quadtree structure, and the leaf nodes are further divided by MTT. The leaf nodes of the multi-type tree become the coding unit CU. When the CU is not larger than the largest transformation unit (64×64), subsequent prediction and transformation will not be further divided. In most cases CU, PU, TU have the same size. Considering the different characteristics of luma and chroma and the parallelism of specific implementations, in VVC/H.266, chroma can use a separate division tree structure, instead of keeping consistent with the luma division tree. The chroma division of I frame in H.266/VVC adopts chroma separation tree, and the chroma division of P frame and B frame is consistent with the luma division.

FIG. 1 is a block diagram of a video encoding and decoding system applicable to an embodiment of the present disclosure. As shown in FIG. 1 , the system is divided into an encoding-side device 1 and a decoding-side device 2 , and the encoding-side device 1 generates code streams. The decoding side device 2 can decode the code stream. The encoding side device 1 and the decoding side device 2 may include one or more processors and memory coupled to the one or more processors, such as random access memory, charged erasable programmable read-only memory, flash memory or other media. The encoding side device 1 and the decoding side device 2 can be implemented with various devices, such as desktop computers, mobile computing devices, notebook computers, tablet computers, set-top boxes, televisions, cameras, display devices, digital media players, vehicle-mounted computers, or other similar installation.

The device 2 on the decoding side can receive the code stream from the device 1 on the encoding side via the link 3 . The link 3 includes one or more media or devices capable of moving the code stream from the device 1 on the encoding side to the device 2 on the decoding side. In one example, the link 3 includes one or more communication media that enable the device 1 on the encoding side to directly transmit the code stream to the device 2 on the decoding side. The device 1 on the encoding side can modulate the code stream according to a communication standard (such as a wireless communication protocol), and can send the modulated code stream to the device 2 on the decoding side. The one or more communication media may include wireless and/or wired communication media, such as a radio frequency (RF) spectrum or one or more physical transmission lines. The one or more communication media may form part of a packet-based network, such as a local area network, a wide area network, or a global network (eg, the Internet). The one or more communication media may include routers, switches, base stations, or other devices that facilitate communication from device 1 on the encoding side to device 2 on the decoding side. In another example, the code stream can also be output from the output interface 15 to a storage device, and the decoding-side device 2 can read the stored data from the storage device via streaming or downloading. The storage device may comprise any of a variety of distributed-access or locally-accessed data storage media, such as hard disk drives, Blu-ray Discs, Digital Versatile Discs, CD-ROMs, flash memory, volatile or non-volatile Volatile memory, file servers, and more.

In the example shown in FIG. 1 , the encoding side device 1 includes a data source 11 , an encoder 13 and an output interface 15 . In some examples. Data sources 11 may include video capture devices (eg, video cameras), archives containing previously captured data, feed interfaces to receive data from content providers, computer graphics systems to generate data, or combinations of these sources. The encoder 13 can encode the data from the data source 11 and output it to the output interface 15, and the output interface 15 can include at least one of an adjuster, a modem and a transmitter.

In the example shown in FIG. 1 , the decoding side device 2 includes an input interface 21 , a decoder 23 and a display device 25 . In some examples, input interface 21 includes at least one of a receiver and a modem. The input interface 21 can receive the code stream via the link 3 or from a storage device. The decoder 23 decodes the received code stream. The display device 25 is used for displaying the decoded data, and the display device 25 may be integrated with other devices of the decoding side device 2 or provided separately. The display device 25 may be, for example, a liquid crystal display, a plasma display, an organic light emitting diode display or other types of display devices. In other examples, the device 2 on the decoding side may not include the display device 25 , or may include other devices or devices for applying the decoded data.

Encoder 13 and decoder 23 of Fig. 1 can use any one in the following circuits or any combination of following circuits to realize: one or more microprocessors, digital signal processors, application specific integrated circuits, field programmable gate arrays , discrete logic, hardware. If the present disclosure is implemented partially in software, instructions for the software may be stored in a suitable non-transitory computer-readable storage medium and executed in hardware using one or more processors to thereby Implement the disclosed method.

Fig. 2 is a structural block diagram of an exemplary video encoder. In this example, the description is mainly based on the terminology and block division of the H.265/HEVC standard, but the structure of the video encoder can also be used for H.264/AVC, VVC/H.266, MPEG, AOM, AVS , and similar standards, as well as successors and extensions of these standards.

As shown in the figure, the video encoder 20 is used to encode video data to generate code streams. As shown in the figure, the video encoder 20 includes a prediction processing unit 100, a division unit 101, a residual generation unit 102, a transformation processing unit 104, a quantization unit 106, an inverse quantization unit 108, an inverse transformation processing unit 110, a reconstruction unit 112, a filter Decoder unit 113, decoded picture buffer 114, and entropy coding unit 116. The prediction processing unit 100 includes an inter prediction processing unit 121 and an intra prediction processing unit 126 . In other embodiments, video encoder 20 may contain more, fewer or different functional components than this example. Both the residual generation unit 102 and the reconstruction unit 112 are represented by circles with plus signs in the figure.

The division unit 101 cooperates with the prediction processing unit 100 to divide the received video data into slices (Slices), CTUs or other larger units. The video data received by the dividing unit 101 may be a video sequence including video frames such as I frames, P frames, or B frames.

The prediction processing unit 100 may divide a CTU into CUs, and perform intra-frame predictive coding or inter-frame predictive coding on the CUs. When performing intra-frame coding on a CU, the 2N×2N CU can be divided into 2N×2N or N×N prediction units (PU: prediction unit) for intra-frame prediction. When performing inter-frame prediction on a CU, the 2N×2N CU can be divided into PUs of 2N×2N, 2N×N, N×2N, N×N or other sizes for inter-frame prediction, and asymmetrical PUs can also be supported divided.

The inter prediction processing unit 121 may perform inter prediction on the PU to generate prediction data of the PU, the prediction data including the prediction block of the PU, motion information of the PU and various syntax elements.

The intra prediction processing unit 126 may perform intra prediction on the PU to generate prediction data for the PU. The prediction data for a PU may include the prediction block and various syntax elements for the PU. The intra-frame prediction processing unit 126 may try multiple selectable intra-frame prediction modes, and select an intra-frame prediction mode with the least cost to perform intra-frame prediction on the PU.

The residual generation unit 102 may generate the residual block of the CU based on the original block of the CU and the prediction block of the PU into which the CU is divided.

The transform processing unit 104 may divide the CU into one or more transform units (TU: Transform Unit), and the residual block associated with the TU is a sub-block obtained by dividing the residual block of the CU. A TU-associated coefficient block is generated by applying one or more transforms to the TU-associated residual block. For example, the transform processing unit 104 may apply discrete cosine transform (DCT: Discrete Cosine Transform), directional transform or other transforms to the residual block associated with the TU, and may convert the residual block from the pixel domain to the frequency domain. Transformation processing may also be skipped in some cases.

The quantization unit 106 can quantize the coefficients in the coefficient block based on selected quantization parameters (QP: Quantitative parameters). Quantization may cause quantization losses (quantitative losses). By adjusting the QP value, the degree of quantization of the coefficient block can be adjusted. Quantization processing can also be skipped in some cases.

The inverse quantization unit 108 and the inverse transformation unit 110 may respectively apply inverse quantization and inverse transformation to the coefficient blocks to obtain TU-associated reconstructed residual blocks.

The reconstruction unit 112 may generate a reconstruction block of the CU based on the reconstruction residual block and the prediction block generated by the prediction processing unit 100 .

The filter unit 113 performs loop filtering on the reconstructed block and stores it in the decoded picture buffer 114 . The intra prediction processing unit 126 may extract the reconstructed reference information adjacent to the PU from the reconstructed blocks cached in the decoded picture buffer 114 to perform intra prediction on the PU. Inter prediction processing unit 121 may perform inter prediction on PUs of other pictures using reference pictures cached by decoded picture buffer 114 that contain reconstructed blocks.

The entropy coding unit 116 can perform entropy coding operations on received data (such as syntax elements, quantized system blocks, motion information, etc.), such as performing context adaptive variable length coding (CAVLC: Context Adaptive Variable Length Coding), context self- Adapt to Binary Arithmetic Coding (CABAC: Context-based Adaptive Binary Arithmetic Coding), etc., and output code stream (that is, coded video code stream).

FIG. 3 is a structural block diagram of an exemplary video decoder. In this example, the description is mainly based on the terminology and block division of the H.265/HEVC standard, but the structure of the video decoder can also be used for videos of H.264/AVC, VVC/H.266 and other similar standards decoding.

The video decoder 30 can decode the received code stream and output decoded video data. As shown in the figure, the video decoder 30 includes an entropy decoding unit 150, a prediction processing unit 152, an inverse quantization unit 154, an inverse transformation processing unit 156, a reconstruction unit 158 (indicated by a circle with a plus sign in the figure), a filter unit 159 , and the picture buffer 160. In other embodiments, video decoder 30 may contain more, fewer or different functional components.

The entropy decoding unit 150 may perform entropy decoding on the received code stream to extract information such as syntax elements, quantized coefficient blocks, and PU motion information. The prediction processing unit 152 , the inverse quantization unit 154 , the inverse transform processing unit 156 , the reconstruction unit 158 and the filter unit 159 can all perform corresponding operations based on the syntax elements extracted from the code stream.

As a functional component performing a reconstruction operation, the inverse quantization unit 154 may inverse quantize the quantized TU-associated coefficient blocks. Inverse transform processing unit 156 may apply one or more inverse transforms to the inverse quantized coefficient block in order to generate the reconstructed residual block of the TU.

Prediction processing unit 152 includes inter prediction processing unit 162 and intra prediction processing unit 164 . If the PU is encoded using intra-frame prediction, the intra-frame prediction processing unit 164 can determine the intra-frame prediction mode of the PU based on the syntax elements parsed from the code stream, and according to the determined intra-frame prediction mode and the adjacent PU obtained from the picture buffer device 60 Intra prediction is performed on the reconstructed reference information, resulting in a prediction block of the PU. If the PU is encoded using inter-prediction, inter-prediction processing unit 162 may determine one or more reference blocks for the PU based on the motion information of the PU and corresponding syntax elements to generate a predictive block for the PU.

The reconstruction unit 158 may obtain the reconstruction block of the CU based on the reconstruction residual block associated with the TU and the prediction block of the PU generated by the prediction processing unit 152 (ie intra prediction data or inter prediction data).

The filter unit 159 may perform loop filtering on the reconstructed block of the CU to obtain a reconstructed picture. The reconstructed pictures are stored in the picture buffer 160 . The picture buffer 160 can provide reference pictures for subsequent motion compensation, intra prediction, inter prediction, etc., and can also output the reconstructed video data as decoded video data for presentation on a display device.

The basic idea of intra prediction is to use the correlation of adjacent pixels to remove spatial redundancy. In some video coding and decoding methods, intra-frame prediction usually uses various intra-frame prediction modes to perform intra-frame prediction on the current block, and selects the best intra-frame prediction mode for the current block through rate-distortion optimization (Rate Distortion Optimized), according to the The optimal intra-frame prediction mode performs intra-frame prediction encoding, and writes the information of the intra-frame prediction mode into the code stream. The decoding end parses the intra-frame prediction mode from the code stream, and predicts and obtains the prediction data of the current block according to the intra-frame prediction mode.

After the development of digital video codec standards in the past generations, the non-angle modes in intra prediction remain relatively stable, including the mean mode and planar mode; the angle mode continues to increase with the evolution of digital video codec standards. Take the international digital video coding standard H series as an example, the H.264/AVC standard only has 8 angle prediction modes and 1 non-angle prediction mode; H.265/HEVC extends to 33 angle prediction modes and 2 non-angle prediction modes . In H.266/VVC, the intra prediction mode is further extended. For luma blocks, selectable intra prediction modes include: mode 0 (planar mode), mode 1 (DC mode), modes 2 to 66 (all angle modes), as shown in FIG. 4 . In addition to planar mode, DC mode and angle mode for chroma blocks, VTM also has matrix weighted intra prediction (MIP: (Matrix weighted intra prediction) mode, CCLM: Cross-component linear model prediction) mode. The MIP mode is unique to VVC, and the CCLM mode also exists in other advanced standards, such as AV1's Chroma from Luma (CfL) mode and AVS3's Two Step Cross-component Prediction Mode (TSCPM) mode.

In some video codec methods, the transformation can be divided into primary transform and secondary transform. The transformation is carried out after the prediction stage. In the current popular codec, the transformation effect and residual, the purpose is to concentrate the energy of the residual as much as possible in the upper left corner of the block. Transformation can also have a variety of modes to choose from. Taking VVC as an example, the transformation core of one transformation can be DCT2-DCT2, DST7-DST7, DCT8-DCT8, DST7-DCT8 or DCT8-DST7.

In some video encoding and decoding methods, the quantization function and the transformed residual block, commonly used quantization methods include rate-distortion optimization quantization (RDOQ: rate-distortion optimization quantization) algorithm and trellis coding quantization (TCQ: Trellis Coded Quantization). The quantization of RDOQ and TCQ is done by scalar quantizer.

The end-to-end image compression effect based on deep learning has approached or surpassed many traditional algorithms. At present, most end-to-end image compression uses an autoencoder (autoencoder, AE) structure. The autoencoder here can be understood as a neural network with the same input and learning goals, the purpose is to make the output close to the input, which can be used to extract features. AE regards the encoding end and the decoding end as a whole, and performs training optimization at the same time. An exemplary network structure of an autoencoder is shown in FIG. 5 .

At the encoding end, z is the image to be compressed, and z obtains the vector x through the set transformation, expressed as x=g _p (z) (this step is optional); x performs nonlinear transformation through the trained encoding network g _a Obtain y, that is, y=g _a (x; φ), φ represents the network parameters that the encoding network g _a needs to train, such as weight, bias, etc. y changes the floating-point number y into q through the set quantization method (not shown in the figure). R indicates that the number of bits consumed by q after entropy encoding is R. Since entropy coding is lossless, entropy coding and entropy decoding may not appear in training, and the processing of entropy coding and entropy decoding is not shown in the figure.

At the decoding end, q becomes

Through the nonlinear transformation of the trained decoding network g _s , it is obtained

which is

θ represents the network parameters that the decoding network g _s needs to be trained, such as weights, biases, etc.

Then transform into the reconstructed image z through the set transformation

Expressed as

The image z to be compressed and the reconstructed image

The loss between is D, expressed as

Through the training on the training set, the expected Lagrangian rate-distortion cost can be achieved, and the network parameters φ of the encoding network and the network parameters θ of the decoding network can be obtained.

The inventors of the present application have found through research on the above-mentioned traditional video encoding and decoding methods and self-encoder image compression methods that in traditional video encoding and decoding methods, the intra-frame prediction adopts a linear calculation method, and some blocks with complex details , the subjective quality of the reconstructed blocks is poor. The end-to-end video encoding and decoding method uses the image compression technology based on the self-encoder, and uses the neural network to realize the complete video encoding and decoding framework, which shows good results in areas with complex detail features, but usually has no effect on areas with simple detail features. Not as good as traditional predictive coding. In addition, in the end-to-end video codec method, quantization is realized by using a simple scalar quantizer, which cannot be adjusted flexibly, and it is difficult to achieve ideal performance.

The embodiment of the present disclosure introduces the image compression method based on the self-encoder into the block-level video codec, expands the traditional intra-frame prediction, and adds the frame based on the self-encoder to the traditional intra-frame prediction mode. Intra-prediction mode (called self-encoder mode) to take advantage of self-encoder-based image compression for the coding performance of complex areas with detailed features, while retaining the characteristics of traditional predictive coding for areas with simple detailed features, Improve video codec performance.

In order to introduce an autoencoder-based image compression method into a block-level video codec, an embodiment of the present disclosure provides an autoencoder network suitable for intra-frame prediction in video codec. As shown in Figure 6. The self-encoder network is composed of an encoding network 51 (corresponding to the encoding network g _a in the figure) and a decoding network 52 (corresponding to the decoding network g _s in the figure), and the encoding network 51 and the decoding network 52 can consist of a series of hidden layers and the activation function. The hidden layer can be composed of nonlinear calculations such as convolution and/or fully connected calculations. The encoding network 51 and decoding network 52 in the figure both include 4 hidden layers and 3 activation functions arranged between adjacent hidden layers. But this is only an example, and the encoding network 51 and the decoding network 52 can be set to other numbers. In other examples, the encoding network 51 and the decoding network 52 of the autoencoder may also be implemented in other forms of structures.

When training the self-encoder, in the embodiment shown in Figure 6, the sample block used in video encoding and decoding can be selected, and the original value of the pixel in the sample block and the reconstructed reference information adjacent to the sample block can be used as the encoding network 51 input, the first nonlinear transformation is performed in the encoding network to obtain supplementary information. In this embodiment, the supplementary information and the reconstructed reference information adjacent to the sample block are used as the input of the decoding network 52, and the second nonlinear transformation is performed in the decoding network to obtain the predicted value of the pixels in the sample block.

The encoding network of the autoencoder can adopt a dimension-reducing neural network, and the decoding network can adopt a dimension-raising neural network. In an example, sample blocks ranging from 4x4 to 32x32 can be used during training, and for blocks of different sizes, supplementary information with different numbers of elements can be used, and larger blocks can use supplementary information with more elements. For example, for blocks smaller than 16x16 blocks (4x8, 4x16, 16x4, 8x4, 4x4, 8x16, 16x8, 8x8), through encoding network dimensionality reduction, the obtained supplementary information is a 1x2 vector, for other blocks equal to or greater than 16x16 , the obtained supplementary information is a 1x4 vector. But this is only an example, and the present disclosure does not limit the dimension of the supplementary information, for example, it may also be a matrix composed of elements in multiple rows and columns.

In an exemplary embodiment of the present disclosure, the size of the sample block for autoencoder training is 16×16, and there are 256 pixels in the block, that is, there are 256 original pixel values. The reconstructed reference information of the sample block includes the reconstructed values of reference pixels adjacent to the sample block. As shown in FIG. 7, the reference pixels adjacent to the sample block 61 may include one or more of the following pixels: one or more lines above the sample block One row of pixels 62, one or more rows of pixels 65 at the upper right of the sample block, one or more columns of pixels 63 at the left side of the sample block, one or more columns of pixels 66 at the lower left side of the sample block, and one or more rows of pixels at the upper left corner of the sample block 64 pixels. In the example shown in Figure 7, the reconstruction values of the 4 rows of pixels above the sample block, the reconstruction values of the 4 columns of pixels on the left side of the sample block, and the reconstruction values of 4x4 pixels in the upper left corner of the sample block are used as the sample block. Reference information has been rebuilt. For a 16×16 sample block, the number of reconstructed pixel values is 4×16 (upper)+4×16 (left)+4×4 (upper left)=144. But this is only exemplary. The area where the reference pixels adjacent to the sample block are located and the number of rows and columns of reference pixels in each area can be adjusted according to the size of the sample block and other factors. For example, when the sample block size is 4×4 When , you can use the reconstructed value of the 1 row of pixels above the sample block, the reconstructed value of the 1 row of pixels above the sample block, the reconstructed value of the 1 column of pixels on the left side of the sample block, and the reconstructed value of the 1 column of pixels on the lower left side of the sample block , and the reconstructed value of 1 pixel in the upper left corner of the sample block are used as the reconstructed reference information of the sample block.

Although the embodiment shown in FIG. 6 uses the reconstructed reference information adjacent to the sample block, in another embodiment of the present disclosure, when performing intra prediction based on the autoencoder mode, the reconstructed reference information adjacent to the sample block may not be used. Reference information is used for training, that is, the original value of the pixels in the sample block is used for encoding into the encoding network 51 during training, and the supplementary information output by the encoding network 51 is used as the input of the decoding network 52 .

In the training process of the autoencoder, the blocks with different target code rates can be trained separately, and multiple sets of network parameters corresponding to blocks with different target code rates can be obtained (a set of network parameters includes the network parameters of the encoding network and the network parameters of the decoding network. network parameters). By setting different target bit rates during training, different network parameters can be trained.

In the training process of the autoencoder, for blocks of different sizes, the sample sets of blocks of different sizes can be used for training separately, and multiple sets of network parameters corresponding to blocks of different sizes can be obtained.

In the training process of the autoencoder, blocks of different shapes can be trained with different sample sets, and multiple sets of network parameters corresponding to blocks of different shapes can be obtained.

In the training process of the self-encoder, for different types of blocks (such as luma blocks and chrominance blocks), the sample sets of different types of blocks can be used to train separately, and multiple sets of network parameters corresponding to different types of blocks can be obtained. For example, the luma block uses the first set of network parameters, and the two chroma blocks use the second set of network parameters, or the luma block uses the first set of network parameters, the first chroma block (Cr component) uses the second set of network parameters, and the second Dichroma blocks (Cb components) use a third set of network parameters.

That is to say, multiple sets of network parameters can be trained separately according to different characteristics of the block, and the characteristics of the block can be determined according to one or more of the target code rate, size, shape and type of the block. For example, in one example, the characteristics of the block are determined according to the target code rate, size and type. Assuming that there are 16 combinations of the target code rate, size and type of the block, there can be 16 blocks with different characteristics. The sample sets of the block are trained on the autoencoder respectively, and 16 sets of network parameters can be obtained, and each set of network parameters corresponds to a block of a feature. When the autoencoder is used for intra prediction, a set of network parameters corresponding to the current block that the autoencoder should use can be determined according to the characteristics of the current block.

The codec method in the embodiment of the present disclosure is not limited to a specific codec standard. However, as mentioned above, in different video codec standards, the block division methods, names, etc. are different. In order to express uniformly, this disclosure collectively refers to coding units that can independently select intra-frame prediction modes and perform intra-frame prediction during video coding and decoding as "blocks", and "blocks" can be video coding units (including luma blocks and chroma blocks). block), or a luma block contained in a video coding unit, or a chrominance block contained in a video coding unit, and the video coding unit may be, but not limited to, a macroblock MB, a coding tree unit CTB, a coding unit CU, a prediction unit PU, Or transform unit TU. Herein, the block currently being processed is referred to as the current block. For example, in encoding, the current block refers to the block currently being encoded; in decoding, the current block refers to the block currently being decoded.

In order to introduce a block-level video codec into the image compression method based on the autoencoder, the embodiments of the present disclosure transplant the trained autoencoder into the traditional video codec framework, and use the autoencoder mode as an independent intraframe predictive mode.

In an embodiment of the present disclosure, the trained self-encoder is transplanted into the intra-frame prediction processing unit of the video coding framework, and its structure is shown in FIG. 8 . The transplanted intra prediction processing unit 126' includes:

The mode selection unit 1261 is configured to select the intra prediction mode of the current block from a variety of selectable intra prediction modes, activate the corresponding prediction unit to perform intra prediction on the current block according to the intra prediction mode of the current block, and Encoding the identification of the intra-frame prediction mode of the current block and writing it into the code stream; wherein, the multiple selectable intra-frame prediction modes include an autoencoder mode.

The prediction unit of multiple modes is used to perform intra-frame prediction on the current block based on the corresponding intra-frame prediction mode, and obtain the predicted value of the pixels in the current block. The prediction units of the multiple modes include:

The self-encoder prediction unit 1263 is used for, when the intra-frame prediction mode of the current block is the self-encoder mode, based on the coding network of the self-encoder, the original values of the pixels in the current block and/or the reconstructed references adjacent to the current block performing a first nonlinear transformation on the information to obtain supplementary information of the current block; and, based on the decoding network of an autoencoder, performing a second nonlinear transformation on the supplementary information of the current block or on the supplementary information of the current block and adjacent reconstructed reference information Transform to obtain the prediction value of the pixels in the current block (that is, the prediction block that generates the current block), and the prediction value is the prediction result output in the self-encoder mode.

The intra prediction modes corresponding to the prediction units in other modes may include one or more of planar mode, horizontal mode, vertical mode, DC mode, angle mode, inter-component linear model CCLM mode or matrix weighted intra prediction MIP mode. In the figure, the plane mode prediction unit 1262 is taken as an example, and prediction units of other modes are omitted.

Although the example shown in FIG. 8 shows one self-encoder prediction unit, in another example, multiple self-encoder prediction units may also be set in the intra-frame prediction processing unit in the embodiment of the present disclosure. Different autoencoders (with different autoencoder structures) are used for intra prediction respectively. The optional intra-frame prediction mode in the embodiment of the present disclosure may include multiple autoencoder modes, and different autoencoder modes may use different autoencoders to perform intra-frame prediction, or use the same autoencoder but set different Network parameters to perform intra prediction in different autoencoder modes.

It can be seen from Figure 8 that in a video encoder, intra prediction in autoencoder mode is different from traditional intra prediction, and compared to traditional intra prediction, intra prediction in autoencoder mode generates supplementary information , the supplementary information needs to be additionally coded and written into the code stream, and sent to the entropy coding unit 116 together with the syntax elements. In addition, the intra prediction in autoencoder mode utilizes the original values of the pixels in the current block (i.e., utilizes the original block of the current block) instead of just predicting based on the reconstructed reference information, which makes the supplementary information contain from Information extracted from the original block. The traditional intra prediction is to calculate the difference between the original value and the predicted value of the pixel in the current block to obtain the residual value of the pixel in the current block (that is, the residual block of the current block is generated, which can also be called the residual block of the current block ), to reflect the information of the original block through the residual value.

The rate-distortion cost of the predicted value and the original value that can be output during self-encoder training is the target. When using the self-encoder mode for intra-frame prediction, the distortion of the predicted value of the pixel in the current block relative to the original value is controllable, so When the autoencoder mode is used for intra-frame prediction, residual processing can be skipped, where residual processing includes residual calculation, residual transformation, quantization, entropy coding, inverse quantization, inverse transformation, and other processing. In an exemplary embodiment of the present disclosure, the video encoder uses the first self-encoder mode to perform intra-frame prediction on the current block, and the modules that need to perform corresponding encoding processing are shown in FIG. 9 . Other modules (including the residual generation unit 102 , the transformation processing unit 104, the optimization unit 106, the inverse quantization unit 108, and the inverse transformation processing unit 110) can be skipped. By default, the reconstructed residual value of the pixel in the current block is equal to 0. After the reconstruction unit 112 receives the predicted value of the pixel in the current block output by the intra prediction processing unit 126, the filtering process can be omitted, and the predicted value can be used as the predicted value in the current block. The reconstructed value of the pixel is stored in the decoded picture buffer 114 , of course, it can also be filtered by the filter unit 113 and then stored in the decoded picture buffer 114 . When the intra-frame prediction mode is the first self-encoder mode, the method of skipping residual processing does not need to encode the residual, which can save overhead, and can greatly reduce the complexity of encoding and reduce the burden on the encoder.

In another exemplary embodiment of the present disclosure, when the second self-encoder mode is used to perform intra-frame prediction on the current block, residual coding processing is performed. Referring to Figure 8 and Figure 2, the supplementary information output from the encoding network in the self-encoder prediction unit 1263 needs to be transmitted to the entropy encoding unit 116 for encoding, while the prediction value output from the decoding network in the self-encoder prediction unit 1263 and the prediction unit of the traditional mode The generated predicted value also needs to be sent to the residual generation unit 102 to generate the residual information, and then the residual information is also transformed, quantized, dequantized, and inversely transformed, and the quantized residual information is encoded and written into the code. flow. Compared with the foregoing embodiment in which the residual processing is skipped, this embodiment has more complex encoding operations, but can have better image quality.

The first autoencoder mode and the second autoencoder mode with different residual processing methods can be used as two different intra prediction modes for selection by the intra prediction processing unit in the video encoder. Also available as two sub-modes in one autoencoder mode. Comparing the encoding cost of these two sub-modes, when the encoding cost of the current block when skipping residual processing is less than or equal to the encoding cost of the current block when performing residual processing, the first autoencoder mode is used, that is, skipping residual processing to To save overhead and simplify calculations, when the encoding cost of the current block when skipping residual processing is greater than the encoding cost of the current block when performing residual processing, the second autoencoder mode is used to perform residual processing to ensure video quality.

An embodiment of the present disclosure provides a video encoding method, and also provides an intra prediction method for video encoding, as shown in FIG. 10 , including:

Step 610, select the intra prediction mode of the current block from a variety of selectable intra prediction modes, the multiple selectable intra prediction modes include an autoencoder mode;

Step 620: Perform intra-frame prediction on the current block based on the intra-frame prediction mode of the current block, encode the identifier of the intra-frame prediction mode of the current block and write it into a code stream.

The video encoding method of this embodiment introduces the self-encoder mode as one of the intra-frame prediction modes, and adds an intra-frame prediction mode based on nonlinear calculation, which can enhance the encoding performance of the video encoder for images with complex detail features.

In an exemplary embodiment of the present disclosure, when the intra prediction mode of the current block is an autoencoder mode, the intra prediction mode of the current block is used to perform intra prediction on the current block. Forecasts, including:

The supplementary information of the current block is obtained through encoding processing, and the encoding process includes: based on the encoding network of the autoencoder corresponding to the autoencoder mode, the original value of the pixel in the current block and/or the current performing a first nonlinear transformation on the reconstructed reference information adjacent to the block; and

Based on the decoding network of the autoencoder corresponding to the autoencoder mode, perform a second nonlinear transformation on the supplementary information of the current block or on the supplementary information of the current block and the reconstructed reference information adjacent to the current block , to obtain the predicted value of the pixel in the current block.

In an exemplary embodiment of the present disclosure, the multiple selectable intra prediction modes further include one or more of the following modes: planar mode, horizontal mode, vertical mode, DC mode, angle mode, component Interlinear model CCLM mode and matrix weighted intra prediction MIP mode. The present disclosure is not limited to other intra-frame prediction modes, which may be intra-frame prediction modes adopted in existing codec standards, or intra-frame prediction modes not yet adopted in existing codec standards.

In an exemplary embodiment of the present disclosure, the identification of the intra prediction mode includes an autoencoder mode identification bit, and a value of the autoencoder mode identification bit indicates that the intra prediction mode is an autoencoder mode, Another value indicates that the intra-frame prediction mode is not an autoencoder mode; or, the identifier of the intra-frame prediction mode includes an index number of the intra-frame prediction mode.

In an exemplary embodiment of the present disclosure, the reconstructed reference information adjacent to the current block includes reconstruction values of reference pixels adjacent to the current block, and the reference pixels adjacent to the current block include one of the following pixels or Various:

one or more rows of pixels above the current block;

One or more rows of pixels at the upper right of the current block;

One or more columns of pixels to the left of the current block;

One or more columns of pixels on the lower left side of the current block;

One or more pixels in the upper left corner of the current block.

As mentioned above, in the end-to-end video encoding and decoding method, quantization is implemented using a simple scalar quantizer, which cannot be adjusted flexibly. After it is transplanted to the intra prediction function module of the video encoder framework, it will affect the encoding Performance improvements. Usually, the supplementary information is composed of multiple integer values. The value of the integer value will affect the bits consumed by encoding the supplementary information and the size of the distortion between the predicted value and the original value. You can adjust the value of the integer value to minimize bit overhead and distortion size. Therefore, in an embodiment of the present disclosure, when performing the encoding process of the intra prediction, it further includes: determining multiple fine-tuning modes of the element values output by the first nonlinear transformation; calculating the multiple The encoding cost of the current block in one fine-tuning manner, such as the rate-distortion cost; the element value is fine-tuned according to the fine-tuning manner with the smallest encoding cost, and the fine-tuned element value is used as the supplementary information of the current block. In this embodiment, the element values output by the encoding network of the self-encoder are fine-tuned, and an optimal fine-tuning method is selected at the encoding cost, which can increase the flexibility of quantization and improve the encoding performance. The coding cost of the current block under the above multiple fine-tuning methods can be calculated separately when the transformation mode is selected, or the multiple fine-tuning methods can be combined with the selectable transformation modes, and the combination with the smallest coding cost The fine-tuning method in is regarded as the fine-tuning method with the least encoding cost.

During the training process of the neural network, precise gradient calculations are required, so parameters such as network weights are floating-point numbers, which leads to the output of each layer of the network is also a floating-point number, and the supplementary information needs to be entropy encoded, and entropy encoding is carried out on integers. Therefore, this disclosure proposes two processing methods, the first is to quantize the trained network weights, so that the output of the encoding network of the autoencoder is an integer when it is used, and the second is to quantify the floating value output by the encoding network of the autoencoder Points are quantized so that they are integers.

For these two different situations, the following two different fine-tuning methods can be adopted accordingly:

The first method: when the element values output by the first nonlinear transformation are integers, the multiple fine-tuning methods include combining possible fine-tuning values of some or all element values output by the first nonlinear transformation Part or all of the resulting combinations, wherein the possible fine-tuning values of an element value include the element value, the value obtained by adding 1 to the element value, and the value obtained by subtracting 1 from the element value;

In this way, the weights of the encoding network and decoding network of the autoencoder are integers, which means that the weights of the trained neural network have been quantized, and the supplementary information output by the encoding network of the autoencoder is already an integer. At this time, due to the distortion caused by the quantization of the network weights, the output supplementary information is not the optimal result. The implementation of this disclosure attempts to add and subtract one by one to the supplementary information and keep the original value from the outside to the inside. The Langer rate-distortion is used as the measurement standard to determine the optimal supplementary information. For example, for the implementation of 1x2, there are 2 to ³ combinations of supplementary information. For all combinations, the bit overhead and the distortion degree of the reconstructed block are calculated one by one, and the optimal combination is selected as the final supplementary information to be written into the code stream. It can improve the performance of video codec.

The second method: in the case where the element values output by the first nonlinear transformation are floating-point numbers, the multiple fine-tuning methods include performing possible fine-tuning values on some or all of the element values output by the first nonlinear transformation Some or all of the combination methods obtained by combination, wherein, the possible fine-tuning value of an element value includes the value obtained by rounding up the value of the element, and the value obtained by rounding down the value of the element. The fine-tuning method adopted in this embodiment,

For the output of the encoding network of the autoencoder, an optional quantization method is rounding, and quantization will cause a loss of precision. Compared with the rounding operation, the embodiment of the present disclosure attempts to round up and down the element values contained in the supplementary information one by one, and combine them into a variety of fine-tuning methods, and use the Lagrangian rate distortion as the measurement standard to determine the optimal Supplementary information, for example, for the 1x2 vector in the implementation, there are 2 ² combinations of supplementary information, calculate the rate-distortion cost for all combinations one by one, select the optimal combination as the final supplementary information and write it into the code stream, which can improve video encoding and decoding performance.

In an embodiment of the present disclosure, the selection of the intra prediction mode of the current block from multiple selectable intra prediction modes includes:

determining an intra-frame prediction mode that requires cost calculation among the multiple selectable intra-frame prediction modes;

Calculating the encoding costs of the intra prediction modes that require cost calculation;

Selecting the intra-frame prediction mode with the smallest coding cost as the intra-frame prediction mode of the current block;

Wherein, for any block, the self-encoder mode is used as the intra-frame prediction mode that requires cost calculation; or, if the feature of the current block belongs to the set feature of the block that can use the self-encoder mode, then the self-encoder mode is used The encoder mode is an intra-frame prediction mode that requires cost calculation, and the features are determined according to one or more of target code rate, size, shape and type. In one example, the autoencoder mode is used as the intra prediction mode that requires cost calculation for blocks with a size between 4x4 and 32x32, and the autoencoder mode is not used as a costly calculation mode for blocks of other sizes Intra prediction mode.

In the embodiment of the present disclosure, the encoding cost is measured by Lagrangian rate-distortion cost, Lagrangian rate-distortion cost J=λR+D, where R is the bit overhead of the codeword after encoding the current block, and for self-encoding Comparing with other modes, additional supplementary information overhead is required; D is the distortion of the reconstructed block of the current block relative to the original block; λ is determined according to the quantization parameter. When the rate-distortion cost of the autoencoder mode is the smallest, the autoencoder mode identification bit of the current block can be set to 1 and the supplementary information can be encoded and written into the code stream.

In the embodiment of the present disclosure, the self-encoder mode is added to the most probable mode list of traditional intra prediction, and the most probable mode list includes a plurality of selected intra prediction modes that need to be fully rate-distortion optimized, and each of them The intra prediction mode needs to calculate the rate-distortion cost of the current block in this mode for comparison. By adding the autoencoder mode to the list and comparing the cost with other intra prediction modes, the advantages of the autoencoder mode can be fully utilized to achieve performance optimization of the video encoder.

In an example of this embodiment, when the autoencoder mode includes the first autoencoder mode, when calculating the encoding cost corresponding to the first autoencoder mode according to the Lagrangian rate-distortion formula, The bit overhead in the Lagrange rate-distortion formula includes the supplementary information of the current block and the overhead of the identification of the intra prediction mode, and the distortion in the Lagrangian rate-distortion formula is based on the pixel in the current block The difference between the original value and the predicted value is determined.

In another example of this embodiment, when the autoencoder mode is the second autoencoder mode, the rate-distortion cost corresponding to the second autoencoder mode is calculated according to the Lagrangian rate-distortion formula When , the bit overhead in the Lagrangian rate-distortion formula includes the overhead of the supplementary information of the current block, the residual value of the pixel in the current block, and the identification of the intra prediction mode of the current block; The distortion in the Lagrangian rate-distortion formula is determined according to the original value and the reconstructed value of the pixel in the current block, wherein the reconstructed value of the pixel in the current block is equal to the predicted value of the pixel in the current block plus the reconstructed The residual value of (the reconstructed residual value is output by the dequantization unit).

In an example of this embodiment, the current block that selects the autoencoder mode can be restricted, for example, only blocks with a size between 4x4 and 32x32 are allowed to use the autoencoder mode, and blocks of different sizes can be Generate supplementary information of varying data volumes. For example, for blocks smaller than 16x16, the supplementary information includes two elements, and for other blocks larger than or equal to 16x16, the supplementary information includes 4 elements. The size of the block is related to the effect of intra-frame encoding in the autoencoder mode. Limiting the block size of the autoencoder can make full use of the advantages of the autoencoder mode, while simplifying the operation of the mode selection.

In an example of this embodiment, for a luma block and a chroma block at the same spatial position (such as the same video coding unit), the intra prediction mode selected for the luma block can be used as the intra prediction mode for the chroma block at the same spatial position. Intra prediction mode. If the autoencoder mode is selected for the luma block in a video coding unit, then the autoencoder mode is also selected for the chroma block in the video coding unit.

In an embodiment of the present disclosure, the autoencoder has a set of trained network parameters.

In an embodiment of the present disclosure, the autoencoder has multiple sets of network parameters trained, each set of network parameters includes the network parameters of the encoding network and decoding network of the autoencoder, and different sets of network parameters are based on their corresponding The sample sets of the blocks are trained separately, and the blocks corresponding to different groups of network parameters have different characteristics, and the characteristics are determined according to one or more of the target code rate, size, shape and type. In this embodiment, when the intra prediction mode of the current block is an autoencoder mode, the network parameters of the encoding network and decoding network of the autoencoder can be determined in the following manner: according to the A set of network parameters corresponding to the current block is found from the multiple sets of network parameters, and network parameters of the encoding network and decoding network of the autoencoder are determined according to the set of network parameters. In addition, in this embodiment, the video encoding method further includes: encoding the network parameter identifier of the current block and writing it into a code stream, where the network parameter identifier of the current block is used to indicate the A set of network parameters corresponding to the current block. Adding the network parameter identifier in the code stream can facilitate the video decoder to quickly determine the network parameters of the decoding network of the self-encoder. In another example, the intra prediction processing unit in the video decoder can also search for to the corresponding network parameter identifier, thereby saving the overhead of encoding the network parameter identifier.

In an embodiment of the present disclosure, when the current block is a video coding unit, and the video coding unit includes a luma block and two chroma blocks, for a luma block and two chroma blocks in the video coding unit The autoencoder used by the degree block can be done in three different ways:

The first is to use an autoencoder, and use the sample data related to the one luma block and two chrominance blocks as the input data of the autoencoder for training, and the supplementary information output by the encoding network of the autoencoder is shared by one luma block and two chrominance blocks. Correspondingly, when intra-frame predictive coding is performed, the obtaining the supplementary information of the current block through coding processing includes: based on the coding network of an autoencoder corresponding to the video coding unit in the autoencoder mode, to The original values of the pixels in the one luma block and the two chroma blocks and/or the reconstructed reference information adjacent to the one luma block and the two chroma blocks are subjected to a first nonlinear transformation to obtain the one luma block and the two chroma blocks Supplementary information common to both chroma blocks. Using an autoencoder to simultaneously perform intra-frame prediction on a luma block and two chrominance blocks can improve computing efficiency and speed up prediction.

The second is to use two self-encoders, wherein one self-encoder uses the sample data related to the one luminance block as input data for training, and the supplementary information output by the encoding network of the self-encoder is the information of the one luminance block Additional information. Another self-encoder uses the sample data related to the two chrominance blocks as input data for training, and the supplementary information output by the encoding network of the other self-encoder is the supplementary information shared by the two luma blocks. Correspondingly, when intra-frame predictive coding is performed, said obtaining the supplementary information of the current block through coding processing includes: based on the coding network of an autoencoder corresponding to the luma block in the autoencoder mode, the performing a first nonlinear transformation on the original values of pixels in the one luma block and/or the reconstructed reference information adjacent to the one luma block to obtain supplementary information of the one luma block; and, based on the autoencoder mode an encoding network of an autoencoder corresponding to the chrominance block, performing a first nonlinear transformation on the original values of the pixels in the two chrominance blocks and/or the reconstructed reference information adjacent to the two chrominance blocks, Complementary information shared by the two chrominance blocks is obtained.

The third is to use three autoencoders, wherein the first autoencoder is trained with the sample data related to the one luminance block as input data, and the supplementary information output by the encoding network of the autoencoder is the one luminance Supplementary information for the block. The second self-encoder is trained with the sample data related to the first chroma block in the two chroma blocks as input data, and the supplementary information output by the encoding network of the second self-encoder is the first color Supplementary information for degree blocks. The third self-encoder is trained with the sample data related to the second chroma block in the two chroma blocks as input data, and the supplementary information output by the encoding network of the third self-encoder is the second color Supplementary information for degree blocks. Correspondingly, when intra-frame predictive coding is performed, said obtaining the supplementary information of the current block through coding processing includes: based on the coding network of an autoencoder corresponding to the luma block in the autoencoder mode, the performing a first nonlinear transformation on the original value of the pixel in the one luma block and/or the reconstructed reference information adjacent to the one luma block to obtain the supplementary information of the one luma block; based on the self-encoder mode and the An encoding network of an autoencoder corresponding to the first chrominance block of the two chrominance blocks, performing a first nonlinear transformation on the original value of the pixel in the first chrominance block and/or the adjacent reconstructed reference information , to obtain the supplementary information of the first chrominance block; and, based on the encoding network of an autoencoder corresponding to the second chrominance block of the two chrominance blocks in the autoencoder mode, the The original values of the pixels in the second chroma block and/or adjacent reconstructed reference information are subjected to a first nonlinear transformation to obtain supplementary information of the second chroma block.

If the current block is a luma block included in the video coding unit, when the autoencoder mode is selected, an autoencoder may be used to perform intra-frame prediction on the luma block. If the current block is a chroma block contained in a video coding unit, and two chroma blocks contained in the same video coding unit are used for intra prediction mode selection and intra prediction together, the sample data of the two chroma blocks can be used An autoencoder is trained, and the autoencoder is used to perform intra-frame prediction on the two chrominance blocks to obtain supplementary information shared by the two chrominance blocks. If the current block is a chroma block included in a video coding unit, but when performing intra prediction mode selection and intra prediction on two chroma blocks included in the same video coding unit, the samples of the two chroma blocks can be The data trains two autoencoders respectively, and uses the two autoencoders to perform intra-frame prediction on the two chrominance blocks respectively.

In an embodiment of the present disclosure, the encoding network of the autoencoder adopts a dimension-reduced neural network.

In an embodiment of the video coding method of the present disclosure, when the intra prediction mode of the current block is an autoencoder mode, the video coding method further includes: coding and writing the supplementary information of the current block into stream.

In an example of the video encoding method of the present disclosure, the autoencoder mode includes a first autoencoder mode; when the intra prediction mode of the current block is the first autoencoder mode, the video encoding The method further includes: defaulting that the reconstructed residual value of pixels in the current block is equal to 0, and skipping residual processing. This method can save the overhead of residual coding, simplify the computational complexity of intra-frame prediction coding, and reduce the occupation of video encoder resources.

In an example of the video encoding method of the present disclosure, the autoencoder mode includes a second autoencoder mode; when the intra prediction mode of the current block is the second autoencoder mode, the video encoding The method further includes: obtaining the residual value of the pixel in the current block according to the difference between the original value and the predicted value of the pixel in the current block; and encoding the residual value of the pixel in the current block and writing it into the code stream . Coding the residual value includes entropy coding and may also include one or more of transform and quantization. This approach preserves the residual processing in the autoencoder mode, which can improve the quality of the decoded video. Since intra prediction in autoencoder mode needs to encode supplementary information, larger quantization step sizes can be used for the encoding of residual information.

In an embodiment of the video encoding method of the present disclosure, in the case of performing intra-frame prediction on the current block based on an autoencoder mode, the video encoding method further includes:

In the case of J1<J2, performing residual processing on the current block, the residual processing includes: obtaining the residual of the pixels in the current block according to the difference between the original value and the predicted value of the pixels in the current block value, and, encode the residual value of the pixel in the current block and write it into the code stream;

In the case of J1≥J2, skip the residual processing of the current block;

Wherein, J1 is the encoding cost of the current block when residual processing is performed, and J2 is the encoding cost of the current block when residual processing is skipped.

The selection of the two residual processing methods in the present disclosure is not limited to the above methods, for example, a threshold method can also be used to judge, for example, when J2 is less than or equal to the set threshold, the residual of the current block is skipped For processing, when J2 is greater than the set threshold, residual processing is performed on the current block.

In an embodiment of the video coding method of the present disclosure, in the case of performing intra-frame prediction on the current block based on the self-encoder mode, and can choose to skip residual processing or perform residual processing, the video coding method Also includes:

Generate a residual identifier of the current block, where, in the case of performing residual processing on the current block, the residual identifier is used to indicate that there is residual data of the current block in the code stream, and skip In the case of residual processing of the current block, the residual identifier is used to indicate that there is no residual data of the current block in the code stream;

Encode the residual identifier of the current block and write it into a code stream.

In an example, the above residual identifier can be represented by one bit, one value of this bit indicates that the residual data of the current block exists in the code stream, and the other value indicates that there is no residual data of the current block in the code stream data. In another example, the above residual identification can be set in combination with the identification of the intra prediction mode, for example, when the identification of the intra prediction mode indicates that the intra prediction mode of the current block is the first autoencoder mode (or the autoencoder mode) A sub-mode under the encoder mode), it means that the residual data of the current block does not exist in the code stream, and when the intra prediction mode indicator indicates that the intra prediction mode of the current block is the second self-encoder mode (also may be another submode in the self-encoder mode), it indicates that the residual data of the current block exists in the code stream.

In an embodiment of the present disclosure, the trained self-encoder is transplanted into the intra prediction processing unit of the video decoder, and its structure is shown in FIG. 11 . The transplanted intra prediction processing unit 164' includes:

The mode selection unit 1641 is configured to parse the identifier of the intra-frame prediction mode of the current block from the received code stream, and determine the intra-frame prediction mode of the current block from a variety of selectable intra-frame prediction modes according to the identifier and, activating a corresponding prediction unit to perform intra-frame prediction on the current block according to the intra-frame prediction mode of the current block; wherein, the multiple selectable intra-frame prediction modes include an autoencoder mode.

The prediction unit of multiple modes is used to perform intra-frame prediction on the current block based on the corresponding intra-frame prediction mode, and obtain the predicted value of the pixels in the current block. These include:

The self-encoder prediction unit 1643 is configured to, when the intra-frame prediction mode of the current block is the self-encoder mode, based on the decoding network of the self-encoder corresponding to the self-encoder mode, the supplementary information of the current block or the The supplementary information of the current block and the reconstructed reference information adjacent to the current block are nonlinearly transformed to obtain the predicted values of the pixels in the current block (that is, to generate the predicted block of the current block). In one example, the autoencoder prediction unit 1643 in the video decoder uses the same decoding network as the autoencoder prediction unit 1263 in the video encoder.

The prediction units in other modes may include, for example, one or more prediction units in planar mode, horizontal mode, vertical mode, DC mode, angle mode, inter-component linear model CCLM mode or matrix weighted intra prediction MIP mode. In the figure, the plane mode prediction unit 1642 is taken as an example, and other prediction units are omitted.

The identification and supplementary information of the above-mentioned intra prediction mode are obtained by parsing the code stream, specifically, the code stream can be obtained by decoding the code stream by the entropy decoding unit and then sent to the intra prediction processing unit 164'.

Although the example shown in FIG. 11 shows one self-encoder prediction unit, in another example, multiple self-encoder prediction units may also be set in the intra-frame prediction processing unit in the embodiment of the present disclosure. Intra prediction is performed using different autoencoders respectively. The selectable intra-frame prediction mode in the embodiment of the present disclosure may include multiple autoencoder modes (such as the first autoencoder mode and the second autoencoder mode), and different autoencoder modes may be performed using different autoencoders For intra prediction, it is also possible to use the same autoencoder but set different network parameters to perform intra prediction in different autoencoder modes.

It can be seen from Fig. 11 that in the video decoder, the input of the intra prediction in the self-encoder mode includes not only the reconstructed reference information, but also the supplementary information parsed from the code stream. As mentioned above, supplementary information contains information extracted from the original block. Similar to the encoding side, when the video decoder uses the self-encoder mode for intra-frame prediction, the residual processing can also be skipped, that is, the entropy decoding, inverse quantization, and inverse transformation of the residual are no longer processed. In an exemplary embodiment of the present disclosure, when the video decoder uses the first self-encoder mode to perform intra-frame prediction on the current block, the modules that need to perform corresponding decoding processing are shown in FIG. 12 , and other modules (including the inverse quantization unit 154 , inverse transformation processing unit 155) can be skipped. By default, the reconstructed residual value of the pixel in the current block is equal to 0. After the reconstruction unit 158 receives the predicted value of the pixel in the current block output by the intra prediction processing unit 164, the filtering process can be omitted (filtering process can also be performed), and the obtained The predicted value is stored in the picture buffer 160 as the reconstructed value of the pixel in the current block, and skipping the residual processing can reduce the coding overhead and decoding complexity, and reduce the burden on the decoder.

In another exemplary embodiment of the present disclosure, the video decoder uses the second self-encoder mode to perform intra-frame prediction on the current block, and perform residual processing (such as entropy decoding, inverse quantization, inverse transformation) and the like. Referring to Figure 11 and Figure 3, the predicted value of the pixels in the current block output from the encoder prediction unit 1643 needs to be sent to the reconstruction unit 158, and the reconstructed residual of the pixels in the current block obtained through entropy decoding, inverse quantization, and inverse transformation The difference values are added to generate the reconstructed value of the pixels in the current block, and then filtered by the filter unit 159 (the filtering process can be skipped during intra-frame encoding). Save to picture buffer 160 . The intra prediction decoding operation corresponding to the second autoencoder mode is relatively complex, but the image quality is relatively good.

The first self-encoder mode and the second self-encoder mode with different residual processing methods can be used as two different intra-frame prediction modes for selection by the intra-frame prediction processing unit in the video decoder. Also available as two sub-modes in one autoencoder mode. Comparing the encoding cost of these two sub-modes, when the encoding cost of the current block when skipping residual processing is less than or equal to the encoding cost of the current block when performing residual processing, the first autoencoder mode is used, that is, skipping residual processing to To save overhead and simplify calculations, when the encoding cost of the current block when skipping residual processing is greater than the encoding cost of the current block when performing residual processing, the second autoencoder mode is used to perform residual processing to ensure video quality.

An embodiment of the present disclosure also provides a video decoding method, and also provides an intra prediction method for video decoding, as shown in FIG. 13 , including:

Step 710, parse out the identifier of the intra-frame prediction mode of the current block from the received code stream, and determine the intra-frame prediction mode of the current block from a variety of selectable intra-frame prediction modes according to the identifier, wherein the The multiple selectable intra prediction modes include autoencoder mode;

Step 720: Perform intra-frame prediction on the current block according to the intra-frame prediction mode of the current block.

Wherein, the identifier of the intra prediction mode of the current block in the code stream can be analyzed through entropy decoding.

The video decoding method of this embodiment introduces the self-encoder mode as a kind of intra-frame prediction mode, adds an intra-frame prediction mode based on nonlinear calculation at the decoding end, and can enhance the decoding performance of the video encoder for images with complex detail features.

In an exemplary embodiment of the present disclosure, the multiple selectable intra-frame prediction modes further include planar mode, horizontal mode, vertical mode, DC mode, angle mode, inter-component linear model CCLM mode or matrix weighted intra-frame One or more of the MIP modes are predicted.

In an exemplary embodiment of the present disclosure, the identification of the intra prediction mode includes an autoencoder mode identification bit, and a value of the autoencoder mode identification bit indicates that the intra prediction mode is an autoencoder mode, Another value indicates that the intra-frame prediction mode is not an autoencoder mode; or, the identifier of the intra-frame prediction mode includes an index number of the intra-frame prediction mode. The supplementary information may be parsed after determining that the intra-frame prediction mode of the current block is the self-encoder mode according to the identifier of the intra-frame prediction mode.

one or more rows of pixels above the current block;

One or more rows of pixels at the upper right of the current block;

one or more columns of pixels to the left of the current block;

One or more columns of pixels on the lower left side of the current block;

One or more pixels in the upper left corner of the current block.

In an exemplary embodiment of the present disclosure, the decoding network of the autoencoder adopts an up-dimensional neural network.

In an exemplary embodiment of the present disclosure, when the intra prediction mode of the current block is the autoencoder mode, the intra prediction on the decoding side does not need to use the encoding network of the autoencoder, but uses the decoding The network performs nonlinear transformation on the supplementary information to obtain the predicted value. On the decoding side, perform intra prediction on the current block according to the intra prediction mode of the current block, including:

Parsing the supplementary information of the current block from the code stream, the supplementary information can be obtained by performing entropy decoding on the code stream;

Based on the decoding network of the autoencoder corresponding to the autoencoder mode, the supplementary information of the current block or the supplementary information of the current block and the reconstructed reference information adjacent to the current block are nonlinearly transformed to obtain Predicted values of pixels in the current block.

In an exemplary embodiment of the present disclosure, the decoding network has a set of trained network parameters; or, the decoding network has multiple sets of trained network parameters, and different sets of network parameters are based on the The sample sets are trained separately, and the blocks corresponding to different sets of network parameters have different characteristics, and the characteristics are determined according to one or more of the target code rate, size, shape and type. Wherein, in the case that the decoding network has multiple sets of network parameters trained, the network parameters of the decoding network are determined in the following manner: according to the characteristics of the current block, the A group of network parameters corresponding to the current block, and determine the network parameters of the decoding network according to the group of network parameters; or, parse out the network parameter identifier of the current block from the code stream, and according to the network parameter identifier indicated by the network parameter identifier A set of network parameters corresponding to the current block determines parameters of the decoding network. According to the network parameter identification, the network parameters of the decoding network can be quickly determined, reducing the complexity of the operation.

In an exemplary embodiment of the present disclosure, when the current block is a video coding unit, and the video coding unit includes one luma block and two chrominance blocks, corresponding to the encoding side, there are three possible Intra prediction method. At this time, the decoding network based on the autoencoder corresponding to the autoencoder mode performs intra-frame prediction on the current block, and the supplementary information of the current block or the supplementary information of the current block and the The reconstructed reference information adjacent to the current block is subjected to nonlinear transformation to obtain the predicted value of the pixels in the current block, including:

Based on the decoding network of an autoencoder corresponding to the video coding unit of the autoencoder mode, for the supplementary information shared by the one luma block and the two chroma blocks or for the one luma block and the two chroma blocks The supplementary information shared by the block and the adjacent reconstructed reference information are nonlinearly transformed to obtain the predicted values of the pixels in the one luma block and the two chroma blocks; or

A decoding network of an autoencoder corresponding to a luminance block based on the autoencoder mode, non-linearly performs a non-linear operation on the supplementary information of the one luminance block or on the supplementary information of the one luminance block and adjacent reconstructed reference information Transform to obtain the predicted value of the pixels in the one luma block; and, based on the decoding network of an autoencoder corresponding to the chroma block in the autoencoder mode, the supplementary information shared by the two chroma blocks or perform nonlinear transformation on the supplementary information shared by the two chroma blocks and the adjacent reconstructed reference information to obtain the predicted values of the pixels in the two chroma blocks; or

A decoding network of an autoencoder corresponding to a luma block based on the autoencoder mode, nonlinearly performs a non-linear operation on the supplementary information of the one luma block or on the supplementary information of the one luma block and adjacent reconstructed reference information Transform to obtain the predicted value of the pixel in the one luma block; based on the decoding network of the self-encoder corresponding to the first chrominance block in the two chrominance blocks of the self-encoder mode, the first chroma The supplementary information of the block or the supplementary information of the first chroma block and the adjacent reconstructed reference information are nonlinearly transformed to obtain the predicted value of the pixel in the first chroma block; and, based on the self-encoder The decoding network of the self-encoder corresponding to the second chrominance block of the two chrominance blocks of the mode, the supplementary information of the second chrominance block or the supplementary information of the second chrominance block and the adjacent existing The reconstruction reference information is subjected to nonlinear transformation to obtain the predicted value of the pixel in the second chrominance block.

In an embodiment of the video decoding method of the present disclosure, the autoencoder mode includes the first autoencoder mode; when the intra prediction mode of the current block is the first autoencoder mode, the video The decoding method further includes: defaulting that the reconstructed residual value of the pixel in the current block is equal to 0, and using the predicted value of the pixel in the current block as the reconstructed value of the pixel in the current block. This intra-frame predictive decoding method corresponds to the intra-frame predictive encoding method in the first self-encoder mode that skips the residual processing adopted by the encoding side.

In an embodiment of the video decoding method of the present disclosure, the autoencoder mode includes a second autoencoder mode; when the intra prediction mode of the current block is the second autoencoder mode, the video The decoding method further includes: analyzing the code stream to obtain the reconstructed residual value of the pixel in the current block, and adding the predicted value of the pixel in the current block to the reconstructed residual value to obtain the pixel in the current block Rebuild value. This intra-frame predictive decoding method corresponds to the intra-frame predictive encoding method in the second self-encoder mode that performs residual processing on the encoding side. The parsing of the code stream obtains the reconstructed residual value of the pixels in the current block, wherein the parsing may include entropy decoding, inverse quantization and inverse transformation, and inverse quantization and/or inverse transformation may also be skipped.

In an embodiment of the video decoding method of the present disclosure, when it is determined that the intra prediction mode of the current block is an autoencoder mode, the video decoding method further includes:

Parsing the residual identifier of the current block from the code stream, where the residual identifier is used to indicate whether residual data of the current block exists in the code stream;

In the case where the residual identifier indicates that residual data of the current block exists in the code stream, parsing and processing the residual data of the current block;

In the case where the residual identifier indicates that the residual data of the current block does not exist in the code stream, the parsing process of the residual data of the current block is skipped, and the reconstructed residual of the pixels in the current block is defaulted The difference is equal to 0.

An embodiment of the present disclosure provides a video decoding device, as shown in FIG. 14 , including a processor 5 and a memory 6 storing computer programs that can run on the processor, wherein the processor 5 executes the The computer program is used to implement the video decoding method described in any embodiment of the present disclosure.

An embodiment of the present disclosure also provides a video encoding device, see also FIG. 14 , including a processor and a memory storing a computer program that can run on the processor, wherein the processor executes the computer The program implements the video coding method described in any embodiment of the present disclosure.

By adopting the intra prediction and video encoding and decoding methods of the above-mentioned embodiments of the present disclosure, the subjective and objective prediction quality and encoding performance can be improved. Figures 15A to 15D, Figures 16A to 16D, and Figures 17A to 17D are three groups of comparison diagrams of reconstructed blocks and original blocks obtained by using different intra prediction methods. Each group uses a 32x32 brightness block but has a different content. Among them, Fig. 15A, Fig. 16A and Fig. 17A show the original block of the luma block and its adjacent reconstruction pixels, and Fig. 15B, Fig. 16B and Fig. 17B show the luma block obtained based on the intra prediction in the self-encoder mode The predicted block and adjacent reconstructed pixels of , Figure 15C, Figure 16C and Figure 17C show the predicted block and adjacent reconstructed pixels of the luma block when the supplementary information of the self-encoder is set to 0, Figure 15D, Figure 16D and FIG. 17D shows the predicted block and adjacent reconstructed pixels of a luma block obtained by intra prediction using DC mode. It can be seen that the result based on intra prediction in autoencoder mode has less distortion.

The embodiment of the present disclosure introduces the image compression based on the self-encoder into the video codec based on the block level, expands the traditional intra-frame prediction, and uses the end-to-end compression for complex structure coding performance to improve the video coding performance. decoder performance. In addition, in the previous autoencoder network, the supplementary information output by the autoencoder encoding network is not further optimized, but directly decoded as the input of the autoencoder decoding network. In the disclosed embodiment. A fine-tuning of supplementary information is proposed to minimize the Lagrangian rate-distortion cost to optimize performance.

In one or more exemplary embodiments, the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over, as one or more instructions or code, a computer-readable medium and executed by a hardware-based processing unit. Computer-readable media may include computer-readable storage media that correspond to tangible media such as data storage media, or communication media including any medium that facilitates transfer of a computer program from one place to another, eg, according to a communication protocol. In this manner, a computer-readable medium may generally correspond to a non-transitory tangible computer-readable storage medium or a communication medium such as a signal or carrier wave. Data storage media may be any available media that can be accessed by one or more computers or one or more processors to retrieve instructions, code and/or data structures for implementation of the techniques described in this disclosure. A computer program product may comprise a computer readable medium.

By way of example and not limitation, such computer-readable storage media may include RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk or other magnetic storage, flash memory, or may be used to store instructions or data Any other medium that stores desired program code in the form of a structure and that can be accessed by a computer. Moreover, any connection could also be termed a computer-readable medium. For example, if a connection is made from a website, server or other remote source for transmitting instructions, coaxial cable, fiber optic cable, dual wire, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. It should be understood, however, that computer-readable storage media and data storage media do not encompass connections, carrier waves, signals, or other transitory (transitory) media, but are instead directed to non-transitory tangible storage media. As used herein, disk and disc include compact disc (CD), laser disc, optical disc, digital versatile disc (DVD), floppy disk, or blu-ray disc, etc. where disks usually reproduce data magnetically, while discs use lasers to Data is reproduced optically. Combinations of the above should also be included within the scope of computer-readable media.

can be implemented by one or more processors such as one or more digital signal processors (DSPs), general purpose microprocessors, application specific integrated circuits (ASICs), field programmable logic arrays (FPGAs), or other equivalent integrated or discrete logic circuits. Execute instructions. Accordingly, the term "processor," as used herein may refer to any of the foregoing structure or any other structure suitable for implementation of the techniques described herein. In addition, in some aspects the functionality described herein may be provided within dedicated hardware and/or software modules configured for encoding and decoding, or incorporated in a combined codec. Also, the techniques may be fully implemented in one or more circuits or logic elements.

The technical solutions of the embodiments of the present disclosure may be implemented in a wide variety of devices or devices, including a wireless handset, an integrated circuit (IC), or a set of ICs (eg, a chipset). Various components, modules, or units are described in the disclosed embodiments to emphasize functional aspects of devices configured to perform the described techniques, but do not necessarily require realization by different hardware units. Rather, as described above, the various units may be combined in a codec hardware unit or provided by a collection of interoperable hardware units (comprising one or more processors as described above) in combination with suitable software and/or firmware.

Claims

A video decoding method, comprising:

Parse the identifier of the intra-frame prediction mode of the current block from the received code stream, and determine the intra-frame prediction mode of the current block from a variety of selectable intra-frame prediction modes according to the identifier, wherein the multiple kinds of intra-frame prediction modes Selectable intra prediction modes include autoencoder mode;

Perform intra-frame prediction on the current block according to the intra-frame prediction mode of the current block.
The video decoding method as claimed in claim 1, wherein:

The multiple selectable intra prediction modes further include one or more of planar mode, horizontal mode, vertical mode, DC mode, angle mode, inter-component linear model CCLM mode or matrix weighted intra prediction MIP mode.
The video decoding method as claimed in claim 1, wherein:

In a case where the intra prediction mode of the current block is an autoencoder mode, performing intra prediction on the current block according to the intra prediction mode of the current block includes:

Parsing the supplementary information of the current block from the code stream;

Based on the decoding network of the autoencoder corresponding to the autoencoder mode, the supplementary information of the current block or the supplementary information of the current block and the reconstructed reference information adjacent to the current block are nonlinearly transformed to obtain Predicted values of pixels in the current block.
The video decoding method as claimed in claim 3, wherein:

the autoencoder patterns include a first autoencoder pattern;

In the case where the intra prediction mode of the current block is the first self-encoder mode, the video decoding method further includes: defaulting that the reconstructed residual value of pixels in the current block is equal to 0, and converting the current block to The predicted value of the pixel in the current block is used as the reconstructed value of the pixel in the current block.
The video decoding method as claimed in claim 3, wherein:

the autoencoder patterns include a second autoencoder pattern;

When the intra prediction mode of the current block is the second self-encoder mode, the video decoding method further includes: analyzing the code stream to obtain a reconstructed residual value of pixels in the current block, and converting the The predicted value of the pixel in the current block is added to the reconstructed residual value to obtain the reconstructed value of the pixel in the current block.
The video decoding method as claimed in claim 3, wherein:

The decoding network has a trained set of network parameters; or

The decoding network has multiple sets of network parameters trained, and the network parameters of different sets are respectively trained according to the sample sets of corresponding blocks, and the blocks corresponding to the network parameters of different sets have different characteristics, and the characteristics are based on the target code rate , size, shape and type determined by one or more.
The video decoding method as claimed in claim 6, wherein:

In the case where the decoding network has multiple sets of network parameters trained, the network parameters of the decoding network are determined in the following manner:

Finding a set of network parameters corresponding to the current block from the multiple sets of network parameters according to the characteristics of the current block, and determining network parameters of the decoding network according to the set of network parameters; or

The network parameter identifier of the current block is parsed from the code stream, and the parameters of the decoding network are determined according to a set of network parameters corresponding to the current block indicated by the network parameter identifier.
The video decoding method as claimed in claim 3, wherein:

The current block is a video coding unit, or a luma block contained in a video coding unit, or a chrominance block contained in a video coding unit, and the video coding unit is a macroblock MB, a coding tree unit CTB, a coding unit CU, a prediction Unit PU, or transform unit TU.
The video decoding method as claimed in claim 8, wherein:

In the case that the current block is a video coding unit, and the video coding unit includes a luma block and two chrominance blocks, the decoding network based on the autoencoder corresponding to the autoencoder mode Performing intra-frame prediction on the block, performing nonlinear transformation on the supplementary information of the current block or on the supplementary information of the current block and the reconstructed reference information adjacent to the current block, to obtain the predicted value of the pixel in the current block, include:

Based on the decoding network of an autoencoder corresponding to the video coding unit of the autoencoder mode, for the supplementary information shared by the one luma block and the two chroma blocks or for the one luma block and the two chroma blocks The supplementary information shared by the block and the adjacent reconstructed reference information are nonlinearly transformed to obtain the predicted values of the pixels in the one luma block and the two chroma blocks; or

A decoding network of an autoencoder corresponding to a luminance block based on the autoencoder mode, non-linearly performs a non-linear operation on the supplementary information of the one luminance block or on the supplementary information of the one luminance block and adjacent reconstructed reference information Transform to obtain the predicted value of the pixels in the one luma block; and, based on the decoding network of an autoencoder corresponding to the chroma block in the autoencoder mode, the supplementary information shared by the two chroma blocks or perform nonlinear transformation on the supplementary information shared by the two chroma blocks and the adjacent reconstructed reference information to obtain the predicted values of the pixels in the two chroma blocks; or

A decoding network of an autoencoder corresponding to a luma block based on the autoencoder mode, nonlinearly performs a non-linear operation on the supplementary information of the one luma block or on the supplementary information of the one luma block and adjacent reconstructed reference information Transform to obtain the predicted value of the pixel in the one luma block; based on the decoding network of the self-encoder corresponding to the first chrominance block in the two chrominance blocks of the self-encoder mode, the first chroma The supplementary information of the block or the supplementary information of the first chroma block and the adjacent reconstructed reference information are nonlinearly transformed to obtain the predicted value of the pixel in the first chroma block; and, based on the self-encoder The decoding network of the self-encoder corresponding to the second chrominance block of the two chrominance blocks of the mode, the supplementary information of the second chrominance block or the supplementary information of the second chrominance block and the adjacent existing The reconstruction reference information is subjected to nonlinear transformation to obtain the predicted value of the pixel in the second chrominance block.
The video decoding method as claimed in claim 1, wherein:

The identification of the intra-frame prediction mode includes an auto-encoder mode identification bit. One value of the auto-encoder mode identification bit indicates that the intra-frame prediction mode is an auto-encoder mode, and the other value indicates that the intra-frame prediction mode is not an auto-encoder mode. encoder mode; or

The identifier of the intra-frame prediction mode includes an index number of the intra-frame prediction mode.
The video decoding method as claimed in claim 3, wherein:

The reconstructed reference information adjacent to the current block includes reconstruction values of reference pixels adjacent to the current block, and the reference pixels adjacent to the current block include one or more of the following pixels:

one or more rows of pixels above the current block;

One or more rows of pixels at the upper right of the current block;

one or more columns of pixels to the left of the current block;

One or more columns of pixels on the lower left side of the current block;

One or more pixels in the upper left corner of the current block.
The video decoding method as claimed in claim 3, wherein:

In the case where it is determined that the intra prediction mode of the current block is an autoencoder mode, the video decoding method further includes:

Parsing the residual identifier of the current block from the code stream, where the residual identifier is used to indicate whether residual data of the current block exists in the code stream;

In the case where the residual identifier indicates that residual data of the current block exists in the code stream, parsing and processing the residual data of the current block;

In the case where the residual identifier indicates that the residual data of the current block does not exist in the code stream, the parsing process of the residual data of the current block is skipped, and the reconstructed residual of the pixels in the current block is defaulted The difference is equal to 0.
A video coding method, comprising:

selecting an intra prediction mode of the current block from a plurality of selectable intra prediction modes, the plurality of selectable intra prediction modes including an autoencoder mode;

Perform intra-frame prediction on the current block based on the intra-frame prediction mode of the current block, encode the identifier of the intra-frame prediction mode of the current block and write it into a code stream.
The video coding method as claimed in claim 13, wherein:

The multiple selectable intra prediction modes also include one or more of the following modes: planar mode, horizontal mode, vertical mode, DC mode, angle mode, inter-component linear model CCLM mode and matrix weighted intra prediction MIP mode.
The video coding method as claimed in claim 13, wherein:

In the case where the intra prediction mode of the current block is an autoencoder mode, performing intra prediction on the current block based on the intra prediction mode of the current block includes:

The supplementary information of the current block is obtained through encoding processing, and the encoding process includes: based on the encoding network of the autoencoder corresponding to the autoencoder mode, the original value of the pixel in the current block and/or the current performing a first nonlinear transformation on the reconstructed reference information adjacent to the block;

Based on the decoding network of the autoencoder corresponding to the autoencoder mode, perform a second nonlinear transformation on the supplementary information of the current block or on the supplementary information of the current block and the reconstructed reference information adjacent to the current block , to obtain the predicted value of the pixel in the current block.
The video coding method as claimed in claim 15, wherein:

In the case that the intra prediction mode of the current block is an autoencoder mode, the video encoding method further includes: encoding the supplementary information of the current block and writing it into a code stream.
The video coding method as claimed in claim 15, wherein:

the autoencoder patterns include a first autoencoder pattern;

In the case where the intra prediction mode of the current block is the first self-encoder mode, the video encoding method further includes: defaulting that the reconstructed residual value of pixels in the current block is equal to 0, and skipping residual processing .
The video coding method as claimed in claim 15, wherein:

the autoencoder patterns include a second autoencoder pattern;

In the case where the intra prediction mode of the current block is the second self-encoder mode, the video encoding method further includes: according to the difference between the original value and the predicted value of the pixel in the current block, obtaining A residual value of a pixel; and, encoding the residual value of a pixel in the current block and writing it into a code stream.
The video coding method as claimed in claim 15, wherein:

The encoding process also includes:

determining multiple fine-tuning modes of element values output by the first nonlinear transformation;

respectively calculating the encoding cost of the current block in the multiple fine-tuning modes;

The element value is fine-tuned in a fine-tuning manner with the least coding cost, and the fine-tuned element value is used as supplementary information of the current block.
The video coding method as claimed in claim 19, wherein:

The element values output by the first nonlinear transformation are integers; the various fine-tuning methods include some or all combinations obtained by combining possible fine-tuning values of some or all element values output by the first nonlinear transformation, Wherein, the possible fine-tuning values of an element value include the element value, the value obtained by adding 1 to the element value, and the value obtained by subtracting 1 from the element value; or

The element values output by the first nonlinear transformation are floating-point numbers, and the various fine-tuning methods include some or all of the combinations obtained by combining possible fine-tuning values of some or all element values output by the first nonlinear transformation , where the possible fine-tuning values of an element value include the value obtained by rounding up the element value and the value obtained by rounding down the element value.
The video coding method according to any one of claims 16 to 20, wherein:

The selection of the intra-frame prediction mode of the current block from a variety of selectable intra-frame prediction modes includes:

determining an intra-frame prediction mode that requires cost calculation among the multiple selectable intra-frame prediction modes;

Calculating the encoding costs of the intra prediction modes that require cost calculation;

Selecting the intra-frame prediction mode with the smallest encoding cost as the intra-frame prediction mode of the current block;

Wherein, for any block, the self-encoder mode is used as the intra-frame prediction mode that requires cost calculation; or, if the feature of the current block belongs to the set feature of the block that can use the self-encoder mode, then the self-encoder mode is used The encoder mode is an intra-frame prediction mode that requires cost calculation, and the features are determined according to one or more of target code rate, size, shape and type.
The video coding method as claimed in claim 21, wherein:

The encoding cost is calculated according to the Lagrangian rate-distortion formula; when the self-encoder mode includes the first self-encoder mode, the first self-encoder mode is calculated according to the Lagrangian rate-distortion formula corresponding to When the coding cost is high, the bit overhead in the Lagrange rate-distortion formula includes the overhead of the supplementary information of the current block and the identification of the intra prediction mode, and the distortion in the Lagrangian rate-distortion formula is based on the The original value and predicted value of the pixels in the current block are determined.
The video coding method as claimed in claim 21, wherein:

The encoding cost is calculated according to the Lagrangian rate-distortion formula; when the self-encoder mode is the second self-encoder mode, the second self-encoder mode is calculated according to the Lagrangian rate-distortion formula corresponding to When the rate-distortion cost is higher, the bit overhead in the Lagrangian rate-distortion formula includes the supplementary information of the current block, the residual value of the pixels in the current block, and the identification of the intra prediction mode of the current block overhead; the distortion in the Lagrangian rate-distortion formula is determined according to the original value and reconstructed value of the pixel in the current block, and the reconstructed value of the pixel in the current block is equal to the predicted value of the pixel in the current block plus Residual value of reconstruction on .
The video coding method as claimed in claim 15, wherein:

The autoencoder has a trained set of network parameters; or

The autoencoder has multiple sets of network parameters trained, each set of network parameters includes the network parameters of the encoding network and the decoding network of the autoencoder, and the network parameters of different groups are obtained by training respectively according to the sample sets of their corresponding blocks. , blocks corresponding to different groups of network parameters have different characteristics, and the characteristics are determined according to one or more of target code rate, size, shape and type.
The video coding method as claimed in claim 24, wherein:

In the case that the intra prediction mode of the current block is an autoencoder mode, the network parameters of the encoding network and the decoding network of the autoencoder are determined in the following manner: from the plurality of groups according to the characteristics of the current block A group of network parameters corresponding to the current block is found in the network parameters, and network parameters of the encoding network and the decoding network of the autoencoder are determined according to the group of network parameters.
The video coding method as claimed in claim 25, wherein:

The video encoding method further includes: encoding the network parameter identifier of the current block and writing it into a code stream, where the network parameter identifier of the current block is used to indicate one of the multiple groups of network parameters corresponding to the current block Group network parameters.
The video coding method as claimed in claim 13, wherein:

The current block is a video coding unit, or a luma block contained in a video coding unit, or a chrominance block contained in a video coding unit, and the video coding unit is a macroblock MB, a coding tree unit CTB, a coding unit CU, a prediction Unit PU, or transform unit TU.
The video coding method as claimed in claim 15, wherein:

In the case that the current block is a video coding unit, and the video coding unit includes a luma block and two chrominance blocks, the supplementary information of the current block obtained through encoding processing includes:

Based on the encoding network of an autoencoder corresponding to the video coding unit of the autoencoder mode, the original values of the pixels in the one luma block and the two chrominance blocks and/or the one luma block and the two Performing a first nonlinear transformation on the reconstructed reference information adjacent to the chroma block to obtain supplementary information shared by the one luma block and the two chroma blocks; or

Based on the encoding network of an autoencoder corresponding to the luminance block of the autoencoder mode, a first non-destructive process is performed on the original value of the pixel in the one luminance block and/or the reconstructed reference information adjacent to the one luminance block. Linear transformation to obtain the supplementary information of the one luma block; and, based on the encoding network of an autoencoder corresponding to the chrominance block in the autoencoder mode, the original values of the pixels in the two chrominance blocks and/or performing a first nonlinear transformation on the reconstructed reference information adjacent to the two chroma blocks to obtain supplementary information shared by the two chroma blocks; or

Based on the encoding network of an autoencoder corresponding to the luminance block of the autoencoder mode, a first non-destructive process is performed on the original value of the pixel in the one luminance block and/or the reconstructed reference information adjacent to the one luminance block. Linear transformation to obtain the supplementary information of the one luma block; based on the encoding network of an autoencoder corresponding to the first chrominance block of the two chrominance blocks in the autoencoder mode, the first performing a first nonlinear transformation on the original values of pixels in the chroma block and/or adjacent reconstructed reference information to obtain supplementary information for the first chroma block; and, based on the combination of the autoencoder mode and the two An encoding network of an autoencoder corresponding to a second chrominance block in a chrominance block performs a first nonlinear transformation on the original value of a pixel in the second chrominance block and/or adjacent reconstructed reference information, to obtain Supplementary information of the second chroma block.
The video coding method as claimed in claim 13, wherein:

The identification of the intra-frame prediction mode includes an auto-encoder mode identification bit. One value of the auto-encoder mode identification bit indicates that the intra-frame prediction mode is an auto-encoder mode, and the other value indicates that the intra-frame prediction mode is not an auto-encoder mode. encoder mode; or

The identifier of the intra-frame prediction mode includes an index number of the intra-frame prediction mode.
The video coding method as claimed in claim 15, wherein:

The reconstructed reference information adjacent to the current block includes reconstruction values of reference pixels adjacent to the current block, and the reference pixels adjacent to the current block include one or more of the following pixels:

one or more rows of pixels above the current block;

One or more rows of pixels at the upper right of the current block;

one or more columns of pixels to the left of the current block;

One or more columns of pixels on the lower left side of the current block;

One or more pixels in the upper left corner of the current block.
The video coding method as claimed in claim 15, wherein:

In the case of performing intra prediction on the current block based on an autoencoder mode, the video encoding method further includes:

In the case of J1<J2, performing residual processing on the current block, the residual processing includes: obtaining the residual of the pixels in the current block according to the difference between the original value and the predicted value of the pixels in the current block value, and, encode the residual value of the pixel in the current block and write it into the code stream;

In the case of J1≥J2, skip the residual processing of the current block;

Wherein, J1 is the encoding cost of the current block when residual processing is performed, and J2 is the encoding cost of the current block when residual processing is skipped.
The video coding method as claimed in claim 17 or 18 or 31, wherein:

In the case of performing intra prediction on the current block based on an autoencoder mode, the video encoding method further includes:

Generate a residual identifier of the current block, where, in the case of performing residual processing on the current block, the residual identifier is used to indicate that there is residual data of the current block in the code stream, and skip In the case of residual processing of the current block, the residual identifier is used to indicate that there is no residual data of the current block in the code stream;

Encode the residual identifier of the current block and write it into a code stream.
A video decoding device, comprising a processor and a memory storing a computer program that can run on the processor, wherein, when the processor executes the computer program, the computer program described in any one of claims 1 to 12 is realized. The video decoding method.
A video decoding device, comprising an intra prediction processing unit, wherein the intra prediction processing unit includes:

a mode selection unit, configured to parse the identifier of the intra-frame prediction mode of the current block from the received code stream, and determine the intra-frame prediction mode of the current block from a variety of selectable intra-frame prediction modes according to the identifier, The multiple selectable intra-frame prediction modes include an autoencoder mode; and, according to the intra-frame prediction mode of the current block, activate a corresponding prediction unit to perform intra-frame prediction on the current block;

The self-encoder prediction unit is used for, when the intra-frame prediction mode of the current block is the self-encoder mode, based on the decoding network of the self-encoder corresponding to the self-encoder mode, for the supplementary information of the current block or for all The supplementary information of the current block and the reconstructed reference information adjacent to the current block are nonlinearly transformed to obtain the predicted value of the pixels in the current block.
A video encoding device, comprising a processor and a memory storing a computer program that can run on the processor, wherein when the processor executes the computer program, the computer program described in any one of claims 13 to 32 is realized. video encoding method.
A video encoding device, including an intra prediction processing unit, wherein the intra prediction processing unit includes:

The mode selection unit is used to select the intra prediction mode of the current block from a variety of selectable intra prediction modes, activate the corresponding prediction unit to perform intra prediction on the current block according to the intra prediction mode of the current block, and perform the intra prediction on the current block. The identification of the intra-frame prediction mode of the current block is coded and written into the code stream; wherein the multiple selectable intra-frame prediction modes include an autoencoder mode;

The self-encoder prediction unit is used to reconstruct the original values of the pixels in the current block and/or the adjacent pixels of the current block based on the coding network of the self-encoder when the intra-frame prediction mode of the current block is the self-encoder mode performing a first non-linear transformation on the reference information to obtain supplementary information of the current block; and, based on the decoding network of an autoencoder, performing a second non-linear transformation on the supplementary information of the current block or on the supplementary information of the current block and adjacent reconstructed reference information Linear transformation to get the predicted value of the pixels in the current block.
A video encoding and decoding system, comprising the video encoding device as claimed in claim 35 or 36, and the video decoding device as claimed in claim 33 or 34.
A non-transitory computer-readable storage medium, the computer-readable storage medium stores a computer program, wherein the computer program implements the method according to any one of claims 1 to 31 when executed by a processor.
A code stream, wherein the code stream is generated according to the video coding method according to any one of claims 13 to 32, and the code stream includes the identification of the intra prediction mode of the current block and the The codeword obtained by encoding the supplementary information of the current block; or, the code stream includes one or more of the identification of the intra prediction mode of the current block, the supplementary information of the current block, and the following information The obtained codeword: the network parameter identifier of the current block, the residual value of pixels in the current block, and the residual identifier of the current block.