WO2023020320A1

WO2023020320A1 - Entropy encoding and decoding method and device

Info

Publication number: WO2023020320A1
Application number: PCT/CN2022/110827
Authority: WO
Inventors: 郭天生; 王晶
Original assignee: 华为技术有限公司
Priority date: 2021-08-17
Filing date: 2022-08-08
Publication date: 2023-02-23
Also published as: CN115706798A

Abstract

The present application provides an entropy encoding and decoding method and device. The entropy encoding method in the present application comprises: obtaining data to be encoded, where the data to be encoded is non-first-bit data among multiple pieces of data contained in a current data stream; acquiring reference information, where the reference information comprises at least one of first context information and first side information, the first context information is obtained by inputting at least one piece of encoded data into a self-attention decoding network, and the first side information is obtained by inputting the plurality of pieces of data into a self-attention encoding network; obtaining a first estimated probability distribution according to the reference information; and performing entropy encoding on the data to be encoded according to the first estimated probability distribution so as to obtain a first code stream. The present application can improve the transmission efficiency of the multiple pieces of data contained in the current data stream.

Description

Entropy encoding and decoding method and device

This application claims the priority of the Chinese patent application with the application number 202110944357.5 and the application name "entropy coding and decoding method and device" submitted to the China Patent Office on August 17, 2021, the entire contents of which are incorporated in this application by reference.

technical field

The embodiments of the present application relate to the technical field of data compression based on artificial intelligence (AI), and in particular to an entropy encoding and decoding method and device.

Background technique

Video encoding (video encoding and decoding) is widely used in digital video applications, such as broadcast digital TV, video transmission over the Internet and mobile networks, real-time session applications such as video chat and video conferencing, Digital Versatile Disc (DVD) ) and Blu-ray discs, video content capture and editing systems, and camcorders for security applications.

Large amounts of video data need to be described even in the case of short movies, which can cause difficulties when the data is to be sent or otherwise transmitted over a network with limited bandwidth capacity. Therefore, video data is usually compressed before being transmitted over modern telecommunications networks. Since memory resources may be limited, the size of the video may also be an issue when storing the video on a storage device. Video compression devices typically use software and/or hardware on the source side to encode video data prior to transmission or storage, thereby reducing the amount of data required to represent digital video images. The compressed data is then received by the video decompression device at the destination side. With limited network resources and growing demand for higher video quality, there is a need for improved compression and decompression techniques that can increase compression ratios with little impact on image quality.

In recent years, it has gradually become a trend to apply deep learning to the field of image or video coding and decoding. Related technologies use a preset fixed probability distribution or a learned model obtained through training to determine the probability distribution, and then encode/decode data to be encoded/decoded based on the probability distribution. However, the accuracy of the probability distribution obtained by the above method is low, which leads to a large overhead of entropy coding, which in turn leads to low data transmission efficiency.

Contents of the invention

The present application provides an entropy encoding and decoding method and device to improve the accuracy of the estimated probability distribution of data to be encoded, reduce the code rate in the process of entropy encoding and decoding, and thereby reduce the overhead of entropy encoding and decoding.

In a first aspect, the present application provides an entropy encoding method, the method comprising: acquiring data to be encoded, where the data to be encoded is non-first encoded data among multiple data included in the current data stream; acquiring reference information, the The reference information at least includes at least one of first context information and first side information, the first context information is obtained by inputting at least one coded data into the self-attention decoding network, and the first side information is the The plurality of data inputs are obtained from the attention coding network; a first estimated probability distribution is obtained according to the reference information estimation; entropy coding is performed on the data to be encoded according to the first estimated probability distribution to obtain a first code stream .

Among them, the encoded data refers to the data that has been entropy encoded by the encoder among the multiple data. Since there is no encoded data when entropy encoding is performed on the first data of the current data stream, the data to be encoded needs to be the current data The non-first data of the stream, so that the first context information can be extracted.

The first estimated probability distribution estimated according to the reference information may include at least one estimated probability parameter. Exemplarily, the at least one estimated probability parameter may include a mean value and a variance, and the mean value and variance form a Gaussian distribution. The encoder may calculate the probability value of the data to be encoded according to the first estimated probability distribution, and then perform entropy encoding on the data to be encoded according to the probability value. The first code stream obtained after performing entropy coding may be in a binary format.

Multiple data can also be referred to as multiple data units. Multiple data can include video data, image data, audio data, integer data, and other data with compression/decompression requirements. limited. Wherein, each data corresponds to a piece of position information, and the data to be encoded is not at the first place among the multiple data.

In this entropy encoding method, the self-attention decoding network is a neural network with a self-attention mechanism (that is, including a self-attention structure). The self-attention mechanism is a variant of the attention mechanism, which has a global receptive field and can better Get internal correlations of data or features. The self-attention decoding network can obtain the weights of all the input encoded data and the data to be encoded, and then weight all or part of the input encoded data with corresponding weights to obtain the first context information. This improves the utilization rate of the coded data in the process of obtaining the first context information, and when the first estimated probability distribution is estimated by using the first context information, the accuracy of the first estimated probability distribution can be improved, and the entropy coding can be further reduced. The code rate in the process can further reduce the entropy coding overhead.

The self-attention encoding network has a global receptive field, and can obtain the correlation between all the input data and the data to be encoded. The correlation can be the weight of all the input data relative to the data to be encoded. After the self-attention encoding network obtains the weights of all the input data relative to the data to be encoded, it weights the corresponding data according to the weights to obtain the first side information.

Optionally, the self-attention encoding network can weight all or part of the input data with corresponding weights to obtain the first side information. In this way, the utilization rate of data in the process of obtaining the first side information is improved. When the first estimated probability distribution is subsequently estimated by using the first side information, the accuracy of the obtained first estimated probability distribution can be further improved, and the code rate in the entropy encoding process can be further reduced, thereby further reducing the entropy encoding overhead.

In addition to at least one of the first context information and the first side information, the reference information may also include at least one of the second context information and the second side information, so the following situations may be included:

(1) The reference information includes the first context information

Correspondingly, estimating and obtaining the first estimated probability distribution according to the reference information may include: inputting the first context information into the probability distribution estimation network, so as to obtain the first estimated probability distribution output by the probability distribution estimation network. The probability distribution estimation network may be a single neural network, or a structure in the self-attention decoding network, which is not limited in this embodiment of the present application.

(2) The reference information includes the first context information and the first side information

Correspondingly, estimating and obtaining the first estimated probability distribution according to the reference information may include: inputting the first context information and the first side information into the probability distribution estimation network, so as to obtain the first estimated probability distribution output by the probability distribution estimation network.

(3) Reference information includes first context information and second context information

Correspondingly, estimating and obtaining the first estimated probability distribution according to the reference information may include: inputting the first context information and the second context information into the probability distribution estimation network, so as to obtain the first estimated probability distribution output by the probability distribution estimation network.

(4) Reference information includes first context information, first side information and second context information

Correspondingly, estimating and obtaining the first estimated probability distribution according to the reference information may include: inputting the first context information, the first side information and the second context information into the probability distribution estimation network to obtain the first estimated probability distribution output by the probability distribution estimation network .

(5) The reference information includes the first context information and the second side information

Correspondingly, estimating and obtaining the first estimated probability distribution according to the reference information may include: inputting the first context information and the second side information into the probability distribution estimation network, so as to obtain the first estimated probability distribution output by the probability distribution estimation network.

(6) The reference information includes the first context information, the first side information and the second side information

Correspondingly, estimating and obtaining the first estimated probability distribution according to the reference information may include: inputting the first context information, the first side information and the second side information into the probability distribution estimation network to obtain the first estimated probability distribution output by the probability distribution estimation network .

(7) Reference information includes first context information, second context information and second side information

Correspondingly, estimating and obtaining the first estimated probability distribution according to the reference information may include: inputting the first context information, the second context information and the second side information into the probability distribution estimation network to obtain the first estimated probability distribution output by the probability distribution estimation network .

(8) The reference information includes the first context information, the first side information, the second context information and the second side information

Correspondingly, estimating and obtaining the first estimated probability distribution according to the reference information may include: inputting the first context information, the first side information, the second context information and the second side information into the probability distribution estimation network to obtain the output of the probability distribution estimation network The first estimated probability distribution.

(9) The reference information includes the first side information

Correspondingly, estimating and obtaining the first estimated probability distribution according to the reference information may include: inputting the first side information into the probability distribution estimation network, so as to obtain the first estimated probability distribution output by the probability distribution estimation network.

(10) Reference information includes first side information and context information

Correspondingly, estimating and obtaining the first estimated probability distribution according to the reference information may include: inputting the first side information and the context information into the probability distribution estimation network, so as to obtain the first estimated probability distribution output by the probability distribution estimation network.

(11) The reference information includes the first side information and the second side information

Correspondingly, estimating and obtaining the first estimated probability distribution according to the reference information may include: inputting the first side information and the second side information into the probability distribution estimation network, so as to obtain the first estimated probability distribution output by the probability distribution estimation network.

(12) Reference information includes first side information, context information and second side information

Correspondingly, estimating and obtaining the first estimated probability distribution according to the reference information may include: inputting the first side information, the context information and the second side information into the probability distribution estimation network to obtain the first estimated probability distribution output by the probability distribution estimation network.

When the reference information includes the first side information, the method further includes: estimating to obtain a second estimated probability distribution; performing entropy encoding on the first side information according to the second estimated probability distribution to obtain a second code stream.

Optionally, the second estimated probability distribution may be obtained by estimating according to preset information. Alternatively, the second estimated probability distribution may be obtained by estimating the learnable model obtained through training. Then calculate the probability value of the first side information according to the second estimated probability distribution, and perform entropy encoding on the first side information according to the probability value.

Optionally, the second code stream can be sent to the decoding side alone, or the second code stream can be added to the first code stream and sent to the decoding side. The embodiment of the present application does not limit the sending method of the second code stream .

In a possible implementation manner, the reference information further includes second context information, and the second context information is inputting at least one data that meets a preset condition in the at least one coded data into a masked convolutional network ( Masked Convolution Network) obtained. Masked ConvNets consist of masked convolutional layers or regular convolutional layers.

Exemplarily, the at least one piece of data that meets the preset condition may be at least one piece of data that is adjacent to the data to be encoded in the at least one piece of encoded data. For one-dimensional data, the neighbors of the data to be coded may be the coded data of the first m bits of the data to be coded, m>0. For two-dimensional data, the neighbors of the data to be encoded can be the adjacent data of the data to be encoded, or the encoded data in the peripheral n circle data of the data to be encoded, etc., n>0, the embodiment of the present application does not limit the neighbors .

The coded data is utilized in the process of acquiring the second context information, which can improve the accuracy of the first estimated probability distribution, thereby reducing the code rate in the process of entropy coding and reducing the overhead of entropy coding.

Masked convolutional networks have local receptive fields, which include masked convolutional layers or regular convolutional layers. The input of the masking convolutional network is at least one data adjacent to the data to be encoded in the at least one encoded data, and the output is the activation feature of the convolution output, that is, the second context information.

In a possible implementation manner, the reference information further includes second side information, and the second side information is inputting at least one data meeting a preset condition among the plurality of data into a Hyper Encoder Network (Hyper Encoder Network) ) obtained; the method further includes: estimating and obtaining a third estimated probability distribution; performing entropy encoding on the second side information according to the third estimated probability distribution to obtain a third code stream.

Exemplarily, the at least one piece of data that meets the preset condition may be at least one piece of data that is adjacent to the data to be encoded among the multiple pieces of data. For one-dimensional data, the neighbors to the data to be encoded may be the first m1 bits and/or the last m2 bits of the data to be encoded, m1, m2>0. For two-dimensional data, the adjacent data to the data to be coded may be adjacent data to the data to be coded, or the data of the outer n circles of the data to be coded, etc., n>0.

The masked convolutional network has a local receptive field, which includes a conventional convolutional layer. The input of the masked convolutional network is at least one data that is adjacent to the data to be encoded among multiple data, and the output is the activation feature of the convolution output, that is, the second side information.

Optionally, the third code stream can be sent to the decoding side alone, or the third code stream can be added to the first code stream and sent to the decoding side. The embodiment of the present application does not limit the sending method of the third code stream .

In a possible implementation manner, the method further includes: acquiring the first coded data among the plurality of data; estimating and obtaining a fourth estimated probability distribution according to preset information; Entropy encoding is performed on the encoded data of the first bit to obtain a fourth code stream.

For the first encoded data, the fourth estimated probability distribution may be obtained by estimating according to preset information. Alternatively, the learnable model obtained through training is used to estimate and obtain the fourth estimated probability distribution. The embodiment of the present application does not limit the manner of obtaining the fourth estimated probability distribution.

In this application, the self-attention encoding network can use, for example, a Transformer Encoder, and the self-attention decoding network, for example, can use a Transformer Decoder. The first code stream may refer to the first coded bit stream (first encoded bitstream), the second code stream may refer to the second coded bit stream (second encoded bitstream), and the third code stream may refer to the third coded bit stream (third encoded bitstream), the fourth bitstream may refer to the fourth encoded bitstream (fouth encoded bitstream).

In a second aspect, the present application provides an entropy decoding method, the method comprising: obtaining a first code stream; obtaining reference information, the reference information at least including at least one of the first context information and the decoded first side information , the first context information is obtained by inputting at least one decoded data into the self-attention decoding network, and the decoded first side information is obtained by entropy decoding the second code stream; it is obtained according to the estimation of the reference information A first estimated probability distribution; performing entropy decoding on the first code stream according to the first estimated probability distribution to obtain decoded data, where the decoded data is non-first decoded data among multiple data included in the current data stream .

In the entropy decoding method, the received first code stream is obtained by performing entropy encoding on the data to be encoded according to a first estimated probability distribution, and the first estimated probability distribution is obtained based on reference information, and the reference information may include first context information and at least one item of the decoded first side information, the self-attention decoding network can weight all the input encoded data with corresponding weights to obtain the first context information. In this way, the utilization rate of encoded data in the process of acquiring the first context information is improved. When the first estimated probability distribution is estimated by using the first context information, the accuracy of the obtained first estimated probability distribution can be improved, and the code rate in the entropy encoding process can be reduced, thereby reducing the transmission of the first code stream to the decoding side. The bandwidth occupancy rate at that time improves the transmission efficiency of the first code stream to the decoding side.

In a possible implementation manner, the acquiring reference information further includes: acquiring a second code stream; estimating to obtain a second estimated probability distribution; performing entropy decoding on the second code stream according to the second estimated probability distribution To obtain the decoded first side information, correspondingly, the reference information further includes the decoded first side information.

It should be noted that the second estimated probability distribution estimated by the decoding side needs to be consistent with the second estimated probability distribution estimated by the encoding side.

In addition to at least one of the first context information and the decoded first side information, the reference information may also include at least one of the second context information and the decoded second side information, so it may include the following Condition:

(1) The reference information includes the first context information

Correspondingly, estimating and obtaining the first estimated probability distribution according to the reference information may include: inputting the first context information into the probability distribution estimation network, so as to obtain the first estimated probability distribution output by the probability distribution estimation network.

(2) The reference information includes the first context information and the decoded first side information

Correspondingly, estimating and obtaining the first estimated probability distribution according to the reference information may include: inputting the first context information and the decoded first side information into the probability distribution estimation network, so as to obtain the first estimated probability distribution output by the probability distribution estimation network.

(4) The reference information includes the first context information, the decoded first side information and the second context information

Correspondingly, estimating and obtaining the first estimated probability distribution according to the reference information may include: inputting the first context information, the decoded first side information and the second context information into the probability distribution estimation network, so as to obtain the first estimation output by the probability distribution estimation network Probability distributions.

(5) The reference information includes the first context information and the decoded second side information

Correspondingly, estimating and obtaining the first estimated probability distribution according to the reference information may include: inputting the first context information and the decoded second side information into the probability distribution estimation network, so as to obtain the first estimated probability distribution output by the probability distribution estimation network.

(6) The reference information includes the first context information, the decoded first side information and the decoded second side information

Correspondingly, estimating and obtaining the first estimated probability distribution according to the reference information may include: inputting the first context information, the decoded first side information and the decoded second side information into the probability distribution estimation network, so as to obtain the first estimated probability distribution output by the probability distribution estimation network. An estimated probability distribution.

(7) The reference information includes the first context information, the second context information and the decoded second side information

Correspondingly, estimating and obtaining the first estimated probability distribution according to the reference information may include: inputting the first context information, the second context information and the decoded second side information into the probability distribution estimation network, so as to obtain the first estimation output by the probability distribution estimation network Probability distributions.

(8) The reference information includes the first context information, the decoded first side information, the second context information and the decoded second side information

Correspondingly, estimating and obtaining the first estimated probability distribution according to the reference information may include: inputting the first context information, the decoded first side information, the second context information and the decoded second side information into the probability distribution estimation network to obtain the probability distribution Estimate a first estimated probability distribution for the output of the network.

(9) The reference information includes the decoded first side information

Correspondingly, estimating and obtaining the first estimated probability distribution according to the reference information may include: inputting the decoded first side information into the probability distribution estimation network, so as to obtain the first estimated probability distribution output by the probability distribution estimation network.

(10) The reference information includes the decoded first side information and context information

Correspondingly, estimating and obtaining the first estimated probability distribution according to the reference information may include: inputting the decoded first side information and context information into the probability distribution estimation network, so as to obtain the first estimated probability distribution output by the probability distribution estimation network.

(11) The reference information includes the decoded first side information and the decoded second side information

Correspondingly, estimating and obtaining the first estimated probability distribution according to the reference information may include: inputting the decoded first side information and the decoded second side information into the probability distribution estimation network, so as to obtain the first estimated probability distribution output by the probability distribution estimation network.

(12) The reference information includes decoded first side information, context information and decoded second side information

Correspondingly, estimating and obtaining the first estimated probability distribution according to the reference information may include: inputting the decoded first side information, the context information and the decoded second side information into the probability distribution estimation network to obtain the first estimate output by the probability distribution estimation network Probability distributions.

In a possible implementation manner, the reference information further includes second context information, and the second context information is obtained by inputting at least one piece of data that meets preset conditions in the at least one piece of decoded data into a masked convolutional network. of.

In a possible implementation manner, the acquiring reference information further includes: acquiring a third code stream; estimating to obtain a third estimated probability distribution; performing entropy decoding on the third code stream according to the third estimated probability distribution To obtain the decoded second side information, correspondingly, the reference information further includes the decoded second side information.

It should be noted that the third estimated probability distribution estimated by the decoding side needs to be consistent with the third estimated probability distribution estimated by the encoding side.

In a possible implementation manner, the method further includes: acquiring a fourth code stream; estimating and obtaining a fourth estimated probability distribution according to preset information; performing entropy on the fourth code stream according to the fourth estimated probability distribution decoding to obtain decoded leading data, the decoded leading data being the first decoded data among the plurality of data.

It should be noted that the fourth estimated probability distribution estimated by the decoding side needs to be consistent with the fourth estimated probability distribution estimated by the encoding side.

In a third aspect, the present application provides an entropy coding device, which includes: an acquisition module, configured to acquire data to be encoded, where the data to be encoded is non-first encoded data among multiple data included in the current data stream; Reference information, the reference information includes at least one of first context information and first side information, the first context information is obtained by inputting at least one coded data into a self-attention decoding network, the first The side information is obtained by inputting the plurality of data into the self-attention coding network; the estimation module is used to estimate and obtain the first estimated probability distribution according to the reference information; the coding module is used to pair the first estimated probability distribution according to the first estimated probability distribution Entropy encoding is performed on the data to be encoded to obtain a first code stream.

In a possible implementation manner, the reference information specifically includes the first context information and the first side information; the estimation module is specifically configured to combine the first context information and the first side information Information is input into a probability distribution estimation network to obtain said first estimated probability distribution output by said probability distribution estimation network.

In a possible implementation manner, the reference information specifically includes the first context information and second context information, and the second context information is at least one of the at least one coded data that meets a preset condition. Obtained by inputting data into a concealed convolutional network; the estimation module is specifically configured to input the first context information and the second context information into a probability distribution estimation network, so as to obtain the first output of the probability distribution estimation network An estimated probability distribution.

In a possible implementation manner, the reference information specifically includes the first context information, the first side information, and second context information, and the second context information is the At least one data that meets the preset conditions is obtained by inputting a masked convolutional network; the estimation module is specifically configured to input the first context information, the first side information, and the second context information into a probability distribution estimation network , to obtain the first estimated probability distribution output by the probability distribution estimation network.

In a possible implementation manner, the reference information specifically includes the first context information and second side information, and the second side information is input of at least one data meeting a preset condition among the plurality of data Obtained by a supercoding network; the estimation module is specifically configured to input the first context information and the second side information into a probability distribution estimation network, so as to obtain the first estimated probability output by the probability distribution estimation network distributed.

In a possible implementation manner, the reference information specifically includes the first context information, the first side information, and the second side information, and the second side information is the It is obtained by inputting at least one conditional data into a supercoding network; the estimation module is specifically configured to input the first context information, the first side information and the second side information into a probability distribution estimation network to obtain The probability distribution estimation network outputs the first estimated probability distribution.

In a possible implementation manner, the reference information specifically includes the first context information, the second context information, and the second side information, and the second side information is the data that meets the preset condition in the plurality of data At least one data input into the super-encoded network is obtained, and the second context information is obtained by inputting at least one data that meets the preset conditions in the at least one encoded data into the masked convolutional network; the estimation module is specifically used inputting the first context information, the second context information and the second side information into a probability distribution estimation network to obtain the first estimated probability distribution output by the probability distribution estimation network.

In a possible implementation manner, the reference information specifically includes the first context information, the first side information, the second context information, and the second side information, and the second side information is a combination of the multiple The second context information is obtained by inputting at least one data that meets the preset conditions into the super-encoding network among the data, and the second context information is obtained by inputting at least one data that meets the preset conditions among the at least one encoded data into the masked convolutional network; The estimation module is specifically configured to input the first context information, the first side information, the second context information and the second side information into a probability distribution estimation network, so as to obtain the probability distribution estimation network Output the first estimated probability distribution.

In a possible implementation manner, the reference information specifically includes the first side information and second context information, and the second context information is at least one of the at least one coded data that meets a preset condition. Obtained by inputting data into a masked convolutional network; the estimation module is specifically configured to input the first side information and the second context information into a probability distribution estimation network, so as to obtain the first output of the probability distribution estimation network An estimated probability distribution.

In a possible implementation manner, the reference information specifically includes the first side information and the second side information, and the second side information is to input at least one data meeting a preset condition among the plurality of data into Obtained by a supercoding network; the estimation module is specifically configured to input the first side information and the second side information into a probability distribution estimation network, so as to obtain the first estimated probability output by the probability distribution estimation network distributed.

In a possible implementation manner, the reference information specifically includes the first side information, the second context information, and the second side information, and the second side information is the data that meets the preset condition in the plurality of data At least one data input into the super-encoded network is obtained, and the second context information is obtained by inputting at least one data that meets the preset conditions in the at least one encoded data into the masked convolutional network; the estimation module is specifically used inputting the first side information, the second context information and the second side information into a probability distribution estimation network to obtain the first estimated probability distribution output by the probability distribution estimation network.

In a possible implementation manner, the estimation module is further configured to estimate and obtain a second estimated probability distribution; the encoding module is further configured to perform entropy on the first side information according to the second estimated probability distribution Encode to obtain the second code stream.

In a possible implementation manner, the estimating module is further configured to estimate and obtain the third estimated probability distribution; the encoding module is further configured to process the second side information according to the third estimated probability distribution Entropy coding is performed to obtain a third code stream.

In a possible implementation manner, the acquiring module is further configured to acquire the first coded data among the plurality of data; the estimating module is further configured to estimate and obtain a fourth estimated probability distribution according to preset information; The encoding module is further configured to perform entropy encoding on the first encoded data according to the fourth estimated probability distribution to obtain a fourth code stream.

In a fourth aspect, the present application provides an entropy decoding device, the device comprising: an acquisition module, configured to acquire a first code stream; and acquire reference information, where the reference information includes at least first context information and decoded first side information At least one of the above, the first context information is obtained by inputting at least one decoded data into the self-attention decoding network, and the decoded first side information is obtained by entropy decoding the second code stream; the estimation module , for estimating and obtaining a first estimated probability distribution according to the reference information; a decoding module, for performing entropy decoding on the first code stream according to the first estimated probability distribution to obtain decoded data, the decoded data The non-first decoded data among the multiple data contained in the current data stream.

In a possible implementation manner, the reference information specifically includes the first context information and the decoded first side information; the estimation module is specifically configured to combine the first context information and the decoded first side information Decoding the first side information is input into a probability distribution estimation network to obtain the first estimated probability distribution output by the probability distribution estimation network.

In a possible implementation manner, the reference information specifically includes the first context information and second context information, and the second context information is at least one of the at least one decoded data that meets a preset condition. Obtained by inputting data into a concealed convolutional network; the estimation module is specifically configured to input the first context information and the second context information into a probability distribution estimation network, so as to obtain the first output of the probability distribution estimation network An estimated probability distribution.

In a possible implementation manner, the reference information specifically includes the first context information, the decoded first side information, and second context information, and the second context information is the at least one decoded At least one data that meets the preset conditions is obtained by inputting a masked convolutional network; the estimation module is specifically configured to input the first context information, the decoded first side information, and the second context information a probability distribution estimation network to obtain the first estimated probability distribution output by the probability distribution estimation network.

In a possible implementation manner, the reference information specifically includes the first context information and decoded second side information, where the decoded second side information is obtained by performing entropy decoding on the third code stream; The estimation module is specifically configured to input the first context information and the decoded second side information into a probability distribution estimation network, so as to obtain the first estimated probability distribution output by the probability distribution estimation network.

In a possible implementation manner, the reference information specifically includes the first context information, the decoded first side information, and the decoded second side information, and the decoded second side information is for the third The code stream is obtained by performing entropy decoding; the estimation module is specifically configured to input the first context information, the decoded first side information and the decoded second side information into a probability distribution estimation network, so as to obtain the The probability distribution estimation network outputs the first estimated probability distribution.

In a possible implementation manner, the reference information specifically includes the first context information, the second context information, and the decoded second side information, and the decoded second side information is an entropy analysis performed on the third code stream. The second context information is obtained by decoding, and the second context information is obtained by inputting at least one data that meets preset conditions in the at least one decoded data into a masked convolutional network; the estimation module is specifically used to use the first context information, the second context information and the decoded second side information are input into a probability distribution estimation network to obtain the first estimated probability distribution output by the probability distribution estimation network.

In a possible implementation manner, the reference information specifically includes the first context information, the decoded first side information, the second context information, and the decoded second side information, and the decoded second side information The information is obtained by performing entropy decoding on the third code stream, and the second context information is obtained by inputting at least one piece of data that meets a preset condition in the at least one piece of decoded data into a masked convolutional network; the estimation module, Specifically for inputting the first context information, the decoded first side information, the second context information and the decoded second side information into a probability distribution estimation network, so as to obtain an output of the probability distribution estimation network The first estimated probability distribution of .

In a possible implementation manner, the reference information specifically includes the decoded first side information and second context information, and the second context information is an At least one data input is obtained by a masked convolutional network; the estimation module is specifically configured to input the decoded first side information and the second context information into a probability distribution estimation network to obtain an output of the probability distribution estimation network The first estimated probability distribution of .

In a possible implementation manner, the reference information specifically includes the decoded first side information and the decoded second side information, and the decoded second side information is obtained by performing entropy decoding on the third code stream The estimation module is specifically configured to input the decoded first side information and the decoded second side information into a probability distribution estimation network, so as to obtain the first estimated probability distribution output by the probability distribution estimation network .

In a possible implementation manner, the reference information specifically includes the decoded first side information, the second context information, and the decoded second side information, and the decoded second side information is a reference to the third code stream Obtained by performing entropy decoding, the second context information is obtained by inputting at least one data that meets preset conditions in the at least one decoded data into a masked convolutional network; the estimation module is specifically configured to use the Decoding the first side information, the second context information and the decoded second side information are input into a probability distribution estimation network to obtain the first estimated probability distribution output by the probability distribution estimation network.

In a possible implementation manner, when the reference information includes the decoded first side information, the acquiring module is further configured to acquire the second code stream; the estimating module is further configured to estimate obtain a second estimated probability distribution; the decryption module is further configured to perform entropy decoding on the second code stream according to the second estimated probability distribution to obtain the decoded first side information.

In a possible implementation manner, when the reference information includes the decoded second side information, the acquiring module is further configured to acquire the third code stream; the estimating module is further configured to estimate obtaining a third estimated probability distribution; the decoding module is further configured to perform entropy decoding on the third code stream according to the third estimated probability distribution to obtain the decoded second side information.

In a possible implementation manner, the acquiring module is further configured to acquire a fourth code stream; the estimating module is further configured to estimate and obtain a fourth estimated probability distribution according to preset information; the decoding module is further configured to use performing entropy decoding on the fourth code stream according to the fourth estimated probability distribution to obtain decoded first data, where the decoded first data is first decoded data among the plurality of data.

In a fifth aspect, the present application provides an entropy encoding device, including: one or more processors; a memory for storing one or more programs; when the one or more programs are executed by the one or more processors Executing, so that the one or more processors implement the method described in any one of the above first aspects.

In a sixth aspect, the present application provides an entropy decoding device, including: one or more processors; a memory for storing one or more programs; when the one or more programs are executed by the one or more processors Execute, so that the one or more processors implement the method described in any one of the above second aspects.

In a seventh aspect, the present application provides a computer-readable storage medium, including a computer program. When the computer program is executed on a computer, the computer executes the method described in any one of the first to second aspects above.

In an eighth aspect, the present application provides a computer program product, the computer program product includes computer program code, and when the computer program code is run on a computer, it causes the computer to execute any one of the above-mentioned first to second aspects. Methods.

Description of drawings

FIG. 1 is an exemplary block diagram of a decoding system 10 provided in an embodiment of the present application;

FIG. 2 is an exemplary block diagram of a video encoder provided in an embodiment of the present application;

FIG. 3 is an exemplary block diagram of a video decoder provided in an embodiment of the present application;

FIG. 4 is an exemplary schematic diagram of a candidate image block provided by an embodiment of the present application;

FIG. 5 is a schematic diagram of an application scenario provided by an embodiment of the present application;

FIG. 6 is a schematic diagram of another application scenario provided by the embodiment of the present application;

FIG. 7 is a schematic structural diagram of an encoder in an end-to-end encoding and decoding architecture provided by an embodiment of the present application;

FIG. 8 is a schematic structural diagram of a decoder in an end-to-end codec architecture provided by an embodiment of the present application;

FIG. 9 is a schematic structural diagram of an encoder provided in an embodiment of the present application;

FIG. 10 is a schematic structural diagram of a decoder provided in an embodiment of the present application;

FIG. 11 is a schematic structural diagram of an encoder provided in an embodiment of the present application;

FIG. 12 is a schematic structural diagram of a decoder provided in an embodiment of the present application;

FIG. 13 is a schematic structural diagram of an encoder provided in an embodiment of the present application;

FIG. 14 is a schematic structural diagram of a decoder provided in an embodiment of the present application;

FIG. 15 is a schematic diagram of a self-attention structure provided by the embodiment of the present application;

FIG. 16 is a schematic structural diagram of a self-attention encoding network provided by an embodiment of the present application;

FIG. 17 is a schematic structural diagram of a self-attention decoding network provided by an embodiment of the present application;

FIG. 18 is a flowchart of a process 100 of the entropy encoding method provided by the embodiment of the present application;

FIG. 19 is a flow chart of the process 200 of the entropy decoding method provided by the embodiment of the present application;

FIG. 20 is a flowchart of a process 300 of the entropy encoding and decoding method provided by the embodiment of the present application;

Fig. 21 is a schematic diagram of the process of obtaining the first estimated probability distribution provided by the embodiment of the present application;

Fig. 22 is a schematic diagram of entropy coding performance provided by the embodiment of the present application;

FIG. 23 is a flow chart of the process 400 of the entropy encoding and decoding method provided by the embodiment of the present application;

Fig. 24 is a schematic diagram of the process of obtaining the first estimated probability distribution provided by the embodiment of the present application;

FIG. 25 is a flowchart of a process 500 of the entropy encoding and decoding method provided by the embodiment of the present application;

FIG. 26 is a flow chart of the process 600 of the entropy encoding and decoding method provided by the embodiment of the present application;

Fig. 27 is a schematic diagram of the process of obtaining the first estimated probability distribution provided by the embodiment of the present application.

Detailed ways

The embodiment of the present application provides an AI-based data compression/decompression technology, especially a neural network-based data compression/decompression technology, and specifically provides an entropy coding and decoding technology to improve traditional mixed data coding and decoding system.

Data encoding and decoding includes two parts: data encoding and data decoding. Data encoding is performed on the source side (or commonly referred to as the encoder side), and typically involves processing (eg, compressing) raw data to reduce the amount of data needed to represent that raw data (and thus more efficient storage and/or transmission). Data decoding is performed on the destination side (or commonly referred to as the decoder side), and usually involves inverse processing relative to the encoder side to reconstruct the original data. The "codec" of data involved in the embodiments should be understood as "encoding" or "decoding" of data. The encoding part and the decoding part are also collectively referred to as codec (encoding and decoding, CODEC).

In the case of lossless data encoding, the original data can be reconstructed, i.e. the reconstructed original data is of the same quality as the original data (assuming no transmission loss or other data loss during storage or transmission). In the case of lossy data encoding, further compression is performed by quantization, etc., to reduce the amount of data required to represent the original data, and the decoder side cannot completely reconstruct the original data, that is, the quality of the reconstructed original data is lower than that of the original data or Difference.

The embodiments of the present application may be applied to video data, image data, audio data, integer data, and other data that require compression/decompression. The following takes video data encoding (referred to as video encoding) as an example to illustrate the embodiment of the present application. Other types of data (such as image data, audio data, integer data, and other data with compression/decompression requirements) can refer to the following description , which will not be described in detail in this embodiment of the present application. It should be noted that, compared with video coding, in the coding process of data such as audio data and integer data, there is no need to divide the data into blocks, but the data can be directly coded.

Video coding generally refers to the processing of sequences of images that form a video or video sequence. In the field of video coding, the terms "picture", "frame" or "image" may be used as synonyms.

Several video coding standards belong to "lossy hybrid video codecs" (ie, combining spatial and temporal prediction in the pixel domain with 2D transform coding in the transform domain for applying quantization). Each image in a video sequence is usually partitioned into a non-overlapping set of blocks, usually encoded at the block level. In other words, encoders usually process, i.e. encode, video at the block (video block) level, e.g., through spatial (intra) prediction and temporal (inter) prediction to produce a predicted block; from the current block (currently processed/to be processed block) to obtain the residual block; transform the residual block in the transform domain and quantize the residual block to reduce the amount of data to be transmitted (compressed), and the decoder side will be inversely processed relative to the encoder Partially applied to encoded or compressed blocks to reconstruct the current block for representation. Additionally, the encoder needs to repeat the decoder's processing steps such that the encoder and decoder generate the same predicted (eg, intra and inter) and/or reconstructed pixels for processing, ie encoding, subsequent blocks.

In the following embodiment of the decoding system 10 , the encoder 20 and the decoder 30 are described with reference to FIGS. 1-3 .

FIG. 1 is an exemplary block diagram of a decoding system 10 provided by an embodiment of the present application, for example, a video decoding system 10 (or simply referred to as the decoding system 10 ) that can utilize the technology of the present application. Video encoder 20 (or simply encoder 20) and video decoder 30 (or simply decoder 30) in video coding system 10 represent devices, etc. that may be used to perform techniques according to various examples described in this application. .

As shown in FIG. 1 , the decoding system 10 includes a source device 12 for providing coded image data 21 such as coded images to a destination device 14 for decoding the coded image data 21 .

The source device 12 includes an encoder 20 , and optionally, an image source 16 , a preprocessor (or a preprocessing unit) 18 such as an image preprocessor, and a communication interface (or a communication unit) 22 .

Image source 16 may include or be any type of image capture device for capturing real world images, etc., and/or any type of image generation device, such as a computer graphics processor or any type of Devices for acquiring and/or providing real-world images, computer-generated images (e.g., screen content, virtual reality (VR) images, and/or any combination thereof (e.g., augmented reality (AR) images). So The image source may be any type of memory or storage that stores any of the above images.

To distinguish the processing performed by the preprocessor (or preprocessing unit) 18 , the image (or image data) 17 may also be referred to as an original image (or original image data) 17 .

The preprocessor 18 is used to receive the original image data 17 and perform preprocessing on the original image data 17 to obtain a preprocessed image (or preprocessed image data) 19 . For example, preprocessing performed by preprocessor 18 may include cropping, color format conversion (eg, from RGB to YCbCr), color grading, or denoising. It can be understood that the preprocessing unit 18 can be an optional component.

A video encoder (or encoder) 20 is used to receive preprocessed image data 19 and provide encoded image data 21 (to be further described below with reference to FIG. 2 etc.).

The communication interface 22 in the source device 12 may be used to receive the encoded image data 21 and send the encoded image data 21 (or any other processed version) via the communication channel 13 to another device such as the destination device 14 or any other device for storage Or rebuild directly.

The destination device 14 includes a decoder 30 , and may also optionally include a communication interface (or communication unit) 28 , a post-processor (or post-processing unit) 32 and a display device 34 .

The communication interface 28 in the destination device 14 is used to receive the coded image data 21 (or any other processed version) directly from the source device 12 or from any other source device such as a storage device, for example, the storage device is a coded image data storage device, And the coded image data 21 is supplied to the decoder 30 .

The communication interface 22 and the communication interface 28 can be used to pass through a direct communication link between the source device 12 and the destination device 14, such as a direct wired or wireless connection, etc., or through any type of network, such as a wired network, a wireless network, or any other Combination, any type of private network and public network or any combination thereof, send or receive coded image data (or coded data) 21 .

For example, the communication interface 22 can be used to encapsulate the encoded image data 21 into a suitable format such as a message, and/or use any type of transmission encoding or processing to process the encoded image data, so that it can be transmitted over a communication link or communication network on the transmission.

The communication interface 28 corresponds to the communication interface 22, eg, can be used to receive the transmission data and process the transmission data using any type of corresponding transmission decoding or processing and/or decapsulation to obtain the encoded image data 21 .

Both the communication interface 22 and the communication interface 28 can be configured as a one-way communication interface as indicated by an arrow from the source device 12 to the corresponding communication channel 13 of the destination device 14 in FIG. 1, or a two-way communication interface, and can be used to send and receive messages etc., to establish the connection, confirm and exchange any other information related to the communication link and/or data transmission such as encoded image data transmission, etc.

The video decoder (or decoder) 30 is used to receive encoded image data 21 and provide decoded image data (or decoded image data) 31 (which will be further described below with reference to FIG. 3 , etc.).

The post-processor 32 is used to perform post-processing on decoded image data 31 (also referred to as reconstructed image data) such as a decoded image to obtain post-processed image data 33 such as a post-processed image. Post-processing performed by post-processing unit 32 may include, for example, color format conversion (e.g., from YCbCr to RGB), color grading, cropping, or resampling, or any other processing for producing decoded image data 31 for display by a display device 34 or the like. .

The display device 34 is used to receive the post-processed image data 33 to display the image to a user or viewer or the like. Display device 34 may be or include any type of display for representing the reconstructed image, eg, an integrated or external display screen or display. For example, the display screen may include a liquid crystal display (LCD), an organic light emitting diode (OLED) display, a plasma display, a projector, a micro LED display, a liquid crystal on silicon (LCoS) display, or a liquid crystal on silicon (LCoS) display. ), a digital light processor (DLP), or any type of other display.

The decoding system 10 also includes a training engine 25, the training engine 25 is used to train the encoder 20 (especially the entropy encoding unit 270 in the encoder 20) or the decoder 30 (especially the entropy decoding unit 304 in the decoder 30), To process the input image or image area or image block to obtain the reference information of the image block to be encoded, or process the input reference information to estimate the estimated probability distribution of the image block to be encoded, and to obtain the estimated probability distribution of the image block to be encoded according to the estimated probability distribution For entropy encoding, please refer to the following method embodiment for detailed description of the training engine 25 .

Although FIG. 1 shows the source device 12 and the destination device 14 as independent devices, the device embodiment may also include the source device 12 and the destination device 14 or the functions of the source device 12 and the destination device 14 at the same time, that is, include the source device 12 and the destination device 14 at the same time. Device 12 or corresponding function and destination device 14 or corresponding function. In these embodiments, source device 12 or corresponding functionality and destination device 14 or corresponding functionality may be implemented using the same hardware and/or software or by separate hardware and/or software or any combination thereof.

It will be apparent to a skilled person from the description that the presence and (exact) division of different units or functions in the source device 12 and/or destination device 14 shown in FIG. 1 may vary depending on the actual device and application. .

Please refer to FIG. 2 and FIG. 3. FIG. 2 is an exemplary block diagram of a video encoder provided in an embodiment of the present application, and FIG. 3 is an exemplary block diagram of a video decoder provided in an embodiment of the present application. Encoder 20 may be implemented by processing circuitry 46 to include the various modules discussed with reference to encoder 20 of FIG. 2 and/or any other encoder system or subsystem described herein. Decoder 30 may be implemented by processing circuitry 46 to include the various modules discussed with reference to decoder 30 of FIG. 3 and/or any other decoder system or subsystem described herein. The processing circuitry 46 may be used to perform various operations discussed below. If part of the technology is implemented in software, the device can store software instructions in a suitable non-transitory computer-readable storage medium, and use one or more processors to execute the instructions in hardware, thereby implementing the technology of the present application. One of the video encoder 20 and the video decoder 30 may be integrated in a single device as part of a combined encoder/decoder (CODEC).

Source device 12 and destination device 14 may comprise any of a variety of devices, including any type of handheld or stationary device, such as a notebook or laptop computer, cell phone, smartphone, tablet or tablet computer, camera, Desktop computers, set-top boxes, televisions, display devices, digital media players, video game consoles, video streaming devices (such as content service servers or content distribution servers), broadcast receiving devices, broadcast transmitting devices, and monitoring devices, etc., and No or any type of operating system may be used. The source device 12 and the destination device 14 may also be devices in a cloud computing scenario, such as virtual machines in a cloud computing scenario. In some cases, source device 12 and destination device 14 may be equipped with components for wireless communication. Accordingly, source device 12 and destination device 14 may be wireless communication devices.

The source device 12 and the destination device 14 may install a virtual scene application (application, APP) such as a virtual reality (virtual reality, VR) application, an augmented reality (augmented reality, AR) application or a mixed reality (mixed reality, MR) application, and A VR application, an AR application or an MR application may be run based on user operations (such as clicking, touching, sliding, shaking, voice control, etc.). The source device 12 and the destination device 14 can collect images/videos of any objects in the environment through cameras and/or sensors, and then display virtual objects on the display device according to the collected images/videos. The virtual objects can be VR scenes, AR scenes or Virtual objects in the MR scene (that is, objects in the virtual environment).

It should be noted that, in the embodiment of the present application, the virtual scene applications in the source device 12 and the destination device 14 can be built-in applications in the source device 12 and the destination device 14, or can be third-party service providers installed by the user The provided application is not specifically limited.

In addition, source device 12 and destination device 14 may install real-time video transmission applications, such as live broadcast applications. The source device 12 and the destination device 14 can collect images/videos through cameras, and then display the collected images/videos on a display device.

In some cases, the video coding system 10 shown in FIG. 1 is merely exemplary, and the techniques provided herein are applicable to video coding settings (e.g., video coding or video decoding) that do not necessarily include coding devices and Decode any data communication between devices. In other examples, data is retrieved from local storage, sent over a network, and so on. A video encoding device may encode and store data into memory, and/or a video decoding device may retrieve and decode data from memory. In some examples, encoding and decoding are performed by devices that do not communicate with each other but simply encode data to memory and/or retrieve and decode data from memory.

A video coding system may include an imaging device, a video encoder, a video decoder (and/or a video encoder/decoder implemented by a processing circuit), an antenna, one or more processors, one or more memory stores, and/or or display device.

Imaging devices, antennas, processing circuits, video encoders, video decoders, processors, memory storage and/or display devices can communicate with each other. In different examples, a video coding system may include only a video encoder or only a video decoder.

In some examples, an antenna may be used to transmit or receive an encoded bitstream of video data. Additionally, in some instances, a display device may be used to present video data. The processing circuit may include application-specific integrated circuit (application-specific integrated circuit, ASIC) logic, a graphics processor, a general-purpose processor, and the like. The video decoding system may also include an optional processor, and the optional processor may similarly include application-specific integrated circuit (ASIC) logic, a graphics processor, a general-purpose processor, and the like. In addition, the memory storage can be any type of memory, such as volatile memory (for example, static random access memory (static random access memory, SRAM), dynamic random access memory (dynamic random access memory, DRAM), etc.) or nonvolatile memory permanent memory (for example, flash memory, etc.) and the like. In a non-limiting example, memory storage may be implemented by cache memory. In other examples, processing circuitry may include memory (eg, cache memory, etc.) for implementing image buffers, etc.

In some examples, video encoder 20 implemented with logic circuitry may include an image buffer (eg, implemented with processing circuitry or memory storage) and a graphics processing unit (eg, implemented with processing circuitry). A graphics processing unit may be communicatively coupled to the image buffer. Graphics processing unit may include video encoder 20 implemented with processing circuitry to implement the various modules discussed with reference to FIG. 2 and/or any other encoder system or subsystem described herein. Logic circuits may be used to perform the various operations discussed herein.

In some examples, video decoder 30 may be implemented by processing circuitry in a similar manner to implement the various modules discussed with reference to video decoder 30 of FIG. 3 and/or any other decoder system or subsystem described herein . In some examples, logic circuit implemented video decoder 30 may include an image buffer (implemented by processing circuitry or memory storage) and a graphics processing unit (eg, implemented by processing circuitry). A graphics processing unit may be communicatively coupled to the image buffer. The graphics processing unit may include video decoder 30 implemented by processing circuitry to implement the various modules discussed with reference to FIG. 3 and/or any other decoder system or subsystem described herein.

In some examples, an antenna may be used to receive an encoded bitstream of video data. As discussed, an encoded bitstream may contain data related to encoded video frames, indicators, index values, mode selection data, etc., as discussed herein, such as data related to encoding partitions (e.g., transform coefficients or quantized transform coefficients , (as discussed) an optional indicator, and/or data defining an encoding split). The video coding system may also include video decoder 30 coupled to the antenna and for decoding the encoded bitstream. Display devices are used to render video frames.

It should be understood that, for the example described with reference to the video encoder 20 in the embodiment of the present application, the video decoder 30 may be used to perform a reverse process. With regard to signaling syntax elements, the video decoder 30 may be configured to receive and parse such syntax elements and decode the associated video data accordingly. In some examples, video encoder 20 may entropy encode the syntax elements into an encoded video bitstream. In such instances, video decoder 30 may parse such syntax elements and decode the related video data accordingly.

For ease of description, refer to the general video coding (versatile video coding, VVC) reference software or by the ITU-T video coding experts group (video coding experts group, VCEG) and ISO/IEC motion picture experts group (motion picture experts group, MPEG) The high-efficiency video coding (HEVC) developed by the video coding joint working group (joint collaboration team on video coding, JCT-VC) describes the embodiment of the present application. Those of ordinary skill in the art understand that the embodiments of the present application are not limited to HEVC or VVC.

Encoders and Encoding Methods

As shown in Figure 2, the video encoder 20 includes an input terminal (or input interface) 201, a residual calculation unit 204, a transformation processing unit 206, a quantization unit 208, an inverse quantization unit 210, an inverse transformation processing unit 212, a reconstruction unit 214, Loop filter 220 , decoded picture buffer (decoded picture buffer, DPB) 230 , mode selection unit 260 , entropy coding unit 270 and output terminal (or output interface) 272 . Mode selection unit 260 may include inter prediction unit 244 , intra prediction unit 254 , and partition unit 262 . Inter prediction unit 244 may include a motion estimation unit and a motion compensation unit (not shown). The video encoder 20 shown in FIG. 2 may also be called a hybrid video encoder or a video encoder based on a hybrid video codec.

Referring to FIG. 2 , the inter-frame prediction unit is a trained target model (also called a neural network), and the neural network is used to process an input image or an image region or an image block to generate a prediction value of the input image block. For example, a neural network for inter-frame prediction is used to receive an input image or image region or image block and generate a prediction value for the input image or image region or image block.

The residual calculation unit 204, the transform processing unit 206, the quantization unit 208, and the mode selection unit 260 constitute the forward signal path of the encoder 20, while the inverse quantization unit 210, the inverse transform processing unit 212, the reconstruction unit 214, the buffer 216, the loop A path filter 220, a decoded picture buffer (decoded picture buffer, DPB) 230, an inter prediction unit 244, and an intra prediction unit 254 form the backward signal path of the encoder, wherein the backward signal path of the encoder 20 corresponds to the decoding signal path of the decoder (see decoder 30 in FIG. 3). Inverse quantization unit 210, inverse transform processing unit 212, reconstruction unit 214, loop filter 220, decoded picture buffer 230, inter prediction unit 244, and intra prediction unit 254 also make up the "built-in decoder" of video encoder 20 .

Images and Image Segmentation (Images and Blocks)

The encoder 20 is operable to receive, via an input 201 or the like, an image (or image data) 17, eg an image in a sequence of images forming a video or a video sequence. The received image or image data may also be a preprocessed image (or preprocessed image data) 19 . For simplicity, the following description uses image 17. Image 17 may also be referred to as a current image or an image to be encoded (especially when the current image is distinguished from other images in video encoding, other images such as the same video sequence, that is, the video sequence that also includes the current image, before encoding post image and/or decoded image).

A (digital) image is or can be viewed as a two-dimensional array or matrix of pixel points with intensity values. Pixels in the array may also be referred to as pixels (pixel or pel) (short for image element). The number of pixels in the array or image in the horizontal and vertical directions (or axes) determines the size and/or resolution of the image. In order to represent a color, three color components are usually used, that is, an image can be represented as or include three pixel arrays. In the RBG format or color space, an image includes corresponding red, green and blue pixel arrays. However, in video coding, each pixel is usually expressed in a luminance/chroma format or color space, such as YCbCr, including a luminance component indicated by Y (sometimes also indicated by L) and two chrominance components indicated by Cb and Cr. The luminance (luma) component Y represents brightness or grayscale level intensity (e.g., both are the same in a grayscale image), while the two chrominance (chroma) components Cb and Cr represent chrominance or color information components . Correspondingly, an image in the YCbCr format includes a luminance pixel point array of luminance pixel point values (Y) and two chrominance pixel point arrays of chrominance values (Cb and Cr). Images in RGB format can be converted or transformed to YCbCr format and vice versa, a process also known as color transformation or conversion. If the image is black and white, the image may only include an array of luminance pixels. Correspondingly, the image can be, for example, an array of luma pixels in monochrome format or an array of luma pixels and two corresponding arrays of chrominance pixels in 4:2:0, 4:2:2 and 4:4:4 color formats .

In one embodiment, an embodiment of the video encoder 20 may include an image segmentation unit (not shown in FIG. 2 ) for segmenting the image 17 into a plurality of (typically non-overlapping) image blocks 203 . These blocks can also be called root blocks, macroblocks (H.264/AVC) or coding tree blocks (CTB), or coding tree units (coding tree unit, CTU) in the H.265/HEVC and VVC standards ). The segmentation unit can be used to use the same block size for all images in a video sequence and to use a corresponding grid that defines the block size, or to vary the block size between images or subsets or groups of images and segment each image into corresponding piece.

In other embodiments, the video encoder may be adapted to directly receive the blocks 203 of the image 17 , for example one, several or all blocks making up said image 17 . The image block 203 may also be referred to as a current image block or an image block to be encoded.

Like the image 17 , the image block 203 is also or can be regarded as a two-dimensional array or matrix composed of pixels with intensity values (pixel values), but the image block 203 is smaller than that of the image 17 . In other words, block 203 may comprise one pixel point array (for example, a luminance array in the case of a monochrome image 17 or a luminance array or a chrominance array in the case of a color image) or three pixel point arrays (for example, in the case of a color image 17 one luma array and two chrominance arrays) or any other number and/or type of arrays depending on the color format employed. The number of pixels in the horizontal direction and vertical direction (or axis) of the block 203 defines the size of the block 203 . Correspondingly, a block may be an array of M×N (M columns×N rows) pixel points, or an array of M×N transform coefficients, and the like.

In one embodiment, the video encoder 20 shown in FIG. 2 is used to encode the image 17 block by block, eg, performing encoding and prediction on each block 203 .

In one embodiment, the video encoder 20 shown in FIG. 2 can also be used to segment and/or encode an image using slices (also called video slices), where an image can use one or more slices (typically non-overlapping ) for segmentation or encoding. Each slice may include one or more blocks (for example, a coding tree unit CTU) or one or more block groups (for example, a coding block (tile) in the H.265/HEVC/VVC standard and a tile in the VVC standard ( brick).

In one embodiment, the video encoder 20 shown in FIG. 2 can also be configured to use slices/coded block groups (also called video coded block groups) and/or coded blocks (also called video coded block groups) ) to segment and/or encode an image, where an image may be segmented or encoded using one or more slices/coded block groups (usually non-overlapping), each slice/coded block group may consist of one or more A block (such as a CTU) or one or more coding blocks, etc., wherein each coding block may be in the shape of a rectangle or the like, and may include one or more complete or partial blocks (such as a CTU).

residual calculation

The residual calculation unit 204 is used to calculate the residual block 205 according to the image block (or original block) 203 and the prediction block 265 (the prediction block 265 will be described in detail later): for example, pixel by pixel (pixel by pixel) from the image The pixel value of the predicted block 265 is subtracted from the pixel value of the block 203 to obtain the residual block 205 in the pixel domain.

transform

The transform processing unit 206 is configured to perform discrete cosine transform (discrete cosine transform, DCT) or discrete sine transform (discrete sine transform, DST) etc. on the pixel point values of the residual block 205 to obtain transform coefficients 207 in the transform domain. The transform coefficients 207 may also be referred to as transform residual coefficients, representing the residual block 205 in the transform domain.

Transform processing unit 206 may be configured to apply an integer approximation of DCT/DST, such as the transform specified for H.265/HEVC. This integer approximation is usually scaled by some factor compared to the orthogonal DCT transform. To maintain the norm of the forward and inverse transformed residual blocks, other scaling factors are used as part of the transformation process. The scaling factor is usually chosen according to certain constraints, such as the scaling factor being a power of 2 for the shift operation, the bit depth of the transform coefficients, the trade-off between accuracy and implementation cost, etc. For example, specifying a specific scaling factor for the inverse transform at the encoder 20 side by the inverse transform processing unit 212 (and for the corresponding inverse transform at the decoder 30 side by, for example, the inverse transform processing unit 312), and correspondingly, can The side 20 specifies the corresponding scaling factor for the forward transform through the transform processing unit 206 .

In one embodiment, the video encoder 20 (correspondingly, the transform processing unit 206) can be used to output transform parameters such as one or more transform types, for example, directly output or output after encoding or compression by the entropy encoding unit 270 , for example, so that the video decoder 30 can receive and use the transformation parameters for decoding.

Quantify

The quantization unit 208 is configured to quantize the transform coefficient 207 by, for example, scalar quantization or vector quantization, to obtain a quantized transform coefficient 209 . Quantized transform coefficients 209 may also be referred to as quantized residual coefficients 209 .

The quantization process may reduce the bit depth associated with some or all of the transform coefficients 207 . For example, n-bit transform coefficients may be rounded down to m-bit transform coefficients during quantization, where n is greater than m. The degree of quantization can be modified by adjusting a quantization parameter (quantization parameter, QP). For example, with scalar quantization, different degrees of scaling can be applied to achieve finer or coarser quantization. A smaller quantization step size corresponds to finer quantization, and a larger quantization step size corresponds to coarser quantization. A suitable quantization step size can be indicated by a quantization parameter (quantization parameter, QP). For example, a quantization parameter may be an index to a predefined set of suitable quantization step sizes. For example, a smaller quantization parameter may correspond to fine quantization (smaller quantization step size), and a larger quantization parameter may correspond to coarse quantization (larger quantization step size), and vice versa. Quantization may include dividing by a quantization step size, while corresponding or inverse dequantization performed by the inverse quantization unit 210 or the like may include multiplying by a quantization step size. Embodiments according to some standards such as HEVC may be used to determine the quantization step size using quantization parameters. In general, the quantization step size can be calculated from the quantization parameter using a fixed-point approximation of an equation involving division. Other scaling factors may be introduced for quantization and dequantization to recover the norm of the residual block that may have been modified by the scale used in the fixed-point approximation of the equations for quantization step size and quantization parameter. In one exemplary implementation, the inverse transform and dequantization scales may be combined. Alternatively, a custom quantization table could be used and indicated from the encoder to the decoder in the bitstream etc. Quantization is a lossy operation, where the larger the quantization step size, the greater the loss.

In one embodiment, the video encoder 20 (correspondingly, the quantization unit 208) can be used to output a quantization parameter (quantization parameter, QP), for example, directly output or output after being encoded or compressed by the entropy encoding unit 270, for example, making the video Decoder 30 may receive and use the quantization parameters for decoding.

dequantization

The inverse quantization unit 210 is used to perform the inverse quantization of the quantization unit 208 on the quantization coefficients to obtain the dequantization coefficients 211, for example, perform the inverse quantization of the quantization scheme performed by the quantization unit 208 according to or use the same quantization step size as that of the quantization unit 208 plan. The dequantized coefficients 211 may also be referred to as dequantized residual coefficients 211 , corresponding to the transform coefficients 207 , but due to loss caused by quantization, the dequantized coefficients 211 are usually not exactly the same as the transform coefficients.

inverse transform

The inverse transform processing unit 212 is configured to perform an inverse transform of the transform performed by the transform processing unit 206, for example, an inverse discrete cosine transform (discrete cosine transform, DCT) or an inverse discrete sine transform (discrete sine transform, DST), to transform in the pixel domain A reconstructed residual block 213 (or corresponding dequantization coefficients 213) is obtained. The reconstructed residual block 213 may also be referred to as a transform block 213 .

reconstruction

The reconstruction unit 214 (e.g., summer 214) is used to add the transform block 213 (i.e., the reconstructed residual block 213) to the predicted block 265 to obtain the reconstructed block 215 in the pixel domain, for example, the reconstructed residual block 213 The pixel value is added to the pixel value of the prediction block 265 .

filtering

The loop filter unit 220 (or "loop filter" 220 for short) is used to filter the reconstructed block 215 to obtain the filtered block 221, or generally used to filter the reconstructed pixels to obtain filtered pixel values. For example, a loop filter unit is used to smooth pixel transitions or improve video quality. The loop filter unit 220 may include one or more loop filters, such as deblocking filters, pixel adaptive offset (sample-adaptive offset, SAO) filters, or one or more other filters, such as auto Adaptive loop filter (ALF), noise suppression filter (NSF), or any combination. For example, the loop filter unit 220 may include a deblocking filter, an SAO filter, and an ALF filter. The order of the filtering process may be deblocking filter, SAO filter and ALF filter. As another example, add a process called luma mapping with chroma scaling (LMCS) (ie adaptive in-loop shaper). This process is performed before deblocking. For another example, the deblocking filtering process can also be applied to internal sub-block edges, such as affine sub-block edges, ATMVP sub-block edges, sub-block transform (sub-block transform, SBT) edges and intra sub-partition (ISP )edge. Although loop filter unit 220 is shown in FIG. 2 as a loop filter, in other configurations, loop filter unit 220 may be implemented as a post-loop filter. The filtering block 221 may also be referred to as a filtering reconstruction block 221 .

In one embodiment, video encoder 20 (correspondingly, loop filter unit 220) can be used to output loop filter parameters (such as SAO filter parameters, ALF filter parameters or LMCS parameters), for example, directly or by entropy The encoding unit 270 performs entropy encoding to output, for example, so that the decoder 30 can receive and use the same or different loop filter parameters for decoding.

decoded image buffer

A decoded picture buffer (DPB) 230 may be a reference picture memory that stores reference picture data for use by the video encoder 20 when encoding video data. The DPB 230 may be formed from any of a variety of memory devices, such as dynamic random access memory (DRAM), including synchronous DRAM (synchronous DRAM, SDRAM), magnetoresistive RAM (magnetoresistive RAM, MRAM), Resistive RAM (resistive RAM, RRAM) or other types of storage devices. The decoded picture buffer 230 may be used to store one or more filter blocks 221 . The decoded picture buffer 230 may also be used to store other previously filtered blocks, such as the previously reconstructed and filtered block 221, of the same current picture or a different picture such as a previous reconstructed picture, and may provide the complete previously reconstructed, i.e. decoded picture (and the corresponding reference blocks and pixels) and/or a partially reconstructed current image (and corresponding reference blocks and pixels), for example for inter-frame prediction. The decoded image buffer 230 can also be used to store one or more unfiltered reconstruction blocks 215, or generally store unfiltered reconstruction pixels, for example, the reconstruction blocks 215 that have not been filtered by the loop filter unit 220, or have not been filtered. Any other processed reconstruction blocks or reconstructed pixels.

Mode selection (segmentation and prediction)

The mode selection unit 260 includes a segmentation unit 262, an inter prediction unit 244, and an intra prediction unit 254 for receiving or obtaining raw raw image data such as block 203 (current block 203 of current image 17) and reconstructed image data, e.g. filtered and/or unfiltered reconstructed pixels of the same (current) image and/or one or more previously decoded images or Rebuild blocks. The reconstructed image data is used as reference image data required for prediction such as inter-frame prediction or intra-frame prediction to obtain a prediction block 265 or a prediction value 265 .

The mode selection unit 260 can be used to determine or select a partition for the current block (including no partition) and a prediction mode (such as intra or inter prediction mode), and generate a corresponding prediction block 265 to calculate and calculate the residual block 205 The reconstruction block 215 is reconstructed.

In one embodiment, mode selection unit 260 is operable to select a partitioning and prediction mode (e.g., from among the prediction modes supported or available by mode selection unit 260) that provides the best match or the smallest residual (minimum Residual refers to better compression in transmission or storage), or provides minimal signaling overhead (minimum signaling overhead refers to better compression in transmission or storage), or considers or balances both of the above. The mode selection unit 260 may be configured to determine the partition and prediction mode according to rate distortion optimization (RDO), that is, to select the prediction mode that provides the minimum rate distortion optimization. The terms "best", "lowest", "best" herein do not necessarily refer to "best", "lowest", "best" in general, but may refer to situations where termination or selection criteria are met, e.g., Values above or below thresholds or other constraints may result in "sub-optimal selection", but reduce complexity and processing time.

In other words, segmentation unit 262 may be used to segment images in a video sequence into a sequence of coding tree units (CTUs), and CTUs 203 may be further segmented into smaller block portions or sub-blocks (again forming blocks), e.g. By iteratively using quad-tree partitioning (QT) partitioning, binary-tree partitioning (BT) partitioning or triple-tree partitioning (TT) partitioning or any combination thereof, and for example or each of the sub-blocks to perform prediction, wherein the mode selection includes selecting the tree structure of the partition block 203 and selecting the prediction mode to be applied to the block portion or each of the sub-blocks.

The partitioning (eg, performed by partition unit 262 ) and prediction processing (eg, performed by inter-prediction unit 244 and intra-prediction unit 254 ) performed by video encoder 20 are described in detail below.

segmentation

The segmentation unit 262 may divide (or divide) an image block (or CTU) 203 into smaller parts, such as square or rectangular shaped small blocks. For an image with three pixel arrays, a CTU consists of N×N luma pixel blocks and two corresponding chrominance pixel blocks. The maximum allowed size of a luma block in a CTU is specified as 128×128 in the developing Versatile Video Coding (VVC) standard, but may be specified in the future to a value other than 128×128, such as 256×256. The CTUs of an image can be pooled/grouped into slices/coded block groups, coded blocks or bricks. A coding block covers a rectangular area of an image, and a coding block can be divided into one or more bricks. A brick consists of multiple CTU rows within an encoded block. A coded block that is not partitioned into multiple bricks may be called a brick. However, bricks are a true subset of coded blocks and are therefore not called coded blocks. VVC supports two coded block group modes, namely raster scan slice/coded block group mode and rectangular slice mode. In RSCBG mode, a slice/CBG contains a sequence of CBGs in a coded block raster scan of an image. In rectangular tile mode, a tile contains multiple tiles of an image that together form a rectangular area of the image. The tiles within the rectangular slice are arranged in the photo's tile raster scan order. These smaller blocks (also called sub-blocks) can be further divided into smaller parts. This is also known as tree splitting or hierarchical tree splitting, where the root block at root tree level 0 (hierarchy level 0, depth 0) etc. can be recursively split into blocks of two or more next lower tree levels, For example a node at tree level 1 (hierarchy level 1, depth 1). These blocks can in turn be split into two or more blocks at the next lower level, e.g. tree level 2 (hierarchy level 2, depth 2), etc., until the end of the split (because the end criteria are met, e.g. maximum tree depth or minimum block size). Blocks that are not further divided are also called leaf blocks or leaf nodes of the tree. A tree divided into two parts is called a binary-tree (BT), a tree divided into three parts is called a ternary-tree (TT), and a tree divided into four parts is called a quadtree ( quad-tree, QT).

For example, a coding tree unit (CTU) may be or include a CTB of luma pixels, two corresponding CTBs of chroma pixels of an image having an array of three pixels, or a CTB of pixels of a monochrome image or using three The CTB of the pixel of the image encoded by the independent color plane and the syntax structure (used to encode the pixel). Correspondingly, a coding tree block (CTB) can be an N×N pixel block, where N can be set to a certain value so that the components are divided into CTBs, which is segmentation. A coding unit (CU) may be or include a coding block of luma pixels, two corresponding coding blocks of chrominance pixels of an image having three pixel arrays, or a coding block of pixels of a monochrome image or An encoded block of pixels of an image encoded using three separate color planes and syntax structures (for encoding pixels). Correspondingly, a coding block (CB) can be M×N pixel blocks, where M and N can be set to a certain value so that the CTB is divided into coding blocks, which is division.

For example, in an embodiment, a coding tree unit (CTU) may be divided into a plurality of CUs according to HEVC by using a quadtree structure represented as a coding tree. The decision whether to encode an image region using inter (temporal) prediction or intra (spatial) prediction is made at the leaf-CU level. Each leaf-CU can be further divided into one, two or four PUs according to the PU division type. The same prediction process is used within a PU, and relevant information is transmitted to the decoder in units of PUs. After applying the prediction process according to the PU partition type to obtain the residual block, the leaf CU can be partitioned into transform units (TUs) according to other quadtree structures similar to the coding tree used for the CU.

For example, in an embodiment, according to the latest video coding standard currently under development, called Versatile Video Coding (VVC), a combined quadtree of nested multi-type trees (such as binary and ternary trees) is used to partition The segmentation structure of the tree unit. In the coding tree structure in the coding tree unit, the CU can be square or rectangular. For example, the coding tree unit (CTU) is first divided by the quadtree structure. The quadtree leaf nodes are further composed of multi-type Tree structure segmentation. There are four types of division in multi-type tree structures: vertical binary tree division (SPLIT_BT_VER), horizontal binary tree division (SPLIT_BT_HOR), vertical ternary tree division (SPLIT_TT_VER) and horizontal ternary tree division (SPLIT_TT_HOR). Multi-type leaf nodes are called is a coding unit (CU), unless the CU is too large for the maximum transform length, such a segment is used for prediction and transform processing without any other partition.In most cases, this means that CU, PU and TU are in the quad The block size in the coding block structure of the tree-nested multi-type tree is the same. This exception occurs when the maximum supported transform length is less than the width or height of the color component of the CU. VVC has a quad-tree nested multi-type tree The only signaling mechanism for splitting and dividing information in the coding structure. In the signaling mechanism, the coding tree unit (CTU) is first divided by the quadtree structure as the root of the quadtree. Then each quadtree leaf node (when enough can be further split into a multi-type tree structure. In the multi-type tree structure, the first flag (mtt_split_cu_flag) indicates whether the node is further split. When the node is further split, the second flag (mtt_split_cu_vertical_flag) is first indicated. Divide the direction, and then use the third identification (mtt_split_cu_binary_flag) to indicate that the division is a binary tree division or a ternary tree division.According to the values of mtt_split_cu_vertical_flag and mtt_split_cu_binary_flag, the decoder can derive the multi-type tree division mode (MttSplitMode) of the CU based on predefined rules or tables. It should be noted that for a certain design, such as the 64×64 luma block and the 32×32 chroma pipeline design in the VVC hardware decoder, when the width or height of the luma coding block is greater than 64, TT division is not allowed .When the width or height of the chroma encoding block is greater than 32, TT division is also not allowed. The pipeline design divides the image into multiple virtual pipeline data units (virtual pipeline data unit, VPDU), and each VPDU is defined in the image as mutual Non-overlapping units. In the hardware decoder, consecutive VPDUs are processed simultaneously in multiple pipeline stages. In most pipeline stages, the VPDU size is roughly proportional to the buffer size, so VPD needs to be kept small U. In most hardware decoders, the VPDU size can be set to the maximum transform block (TB) size. However, in VVC, splitting of ternary tree (TT) and binary tree (BT) may increase the size of VPDU.

In addition, it should be noted that when a part of the tree node block exceeds the bottom or the right boundary of the image, the tree node block is forced to be divided until all pixels of each coded CU are located within the image boundary.

For example, the intra sub-partitions (intra sub-partitions, ISP) tool may vertically or horizontally divide the luma intra prediction block into two or four sub-parts according to the block size.

In one example, mode selection unit 260 of video encoder 20 may be configured to perform any combination of the segmentation techniques described above.

As mentioned above, the video encoder 20 is configured to determine or select the best or optimal prediction mode from a set of (predetermined) prediction modes. The set of prediction modes may include, for example, intra prediction modes and/or inter prediction modes.

intra prediction

The set of intra prediction modes can include 35 different intra prediction modes, e.g. non-directional modes like DC (or mean) mode and planar mode, or directional modes as defined by HEVC, or can include 67 different Intra prediction modes, eg non-directional modes like DC (or mean) mode and planar mode, or directional modes as defined in VVC. For example, several traditional angle intra prediction modes are adaptively replaced with wide angle intra prediction modes for non-square blocks defined in VVC. As another example, to avoid the division operation of DC prediction, only the longer side is used to calculate the average value of non-square blocks. Moreover, the intra prediction result of the planar mode can also be modified using a position dependent intra prediction combination (PDPC) method.

The intra prediction unit 254 is configured to generate an intra prediction block 265 by using reconstructed pixels of adjacent blocks of the same current image according to an intra prediction mode in the intra prediction mode set.

Intra prediction unit 254 (or generally mode selection unit 260) is also configured to output intra prediction parameters (or generally information indicating the selected intra prediction mode for a block) in the form of syntax elements 266 to entropy encoding unit 270 , to be included in the encoded image data 21, so that the video decoder 30 can perform operations such as receiving and using prediction parameters for decoding.

The intra prediction modes in HEVC include DC prediction mode, planar prediction mode and 33 angle prediction modes, a total of 35 candidate prediction modes. The current block can be intra-predicted using the pixels of the reconstructed image blocks on the left and above as references. An image block used for performing intra-frame prediction on the current block in the peripheral area of the current block becomes a reference block, and pixels in the reference block are called reference pixels. Among the 35 candidate prediction modes, the DC prediction mode is suitable for the area with flat texture in the current block, and all pixels in this area use the average value of the reference pixels in the reference block as prediction; the planar prediction mode is suitable for image blocks with smooth texture changes , the current block that meets this condition uses the reference pixels in the reference block to perform bilinear interpolation as the prediction of all pixels in the current block; the angle prediction mode uses the characteristic that the texture of the current block is highly correlated with the texture of the adjacent reconstructed image block , copy the value of the reference pixel in the corresponding reference block along a certain angle as the prediction of all the pixels in the current block.

The HEVC encoder selects an optimal intra prediction mode from 35 candidate prediction modes for the current block, and writes the optimal intra prediction mode into the video stream. In order to improve the coding efficiency of intra prediction, the encoder/decoder will derive the three most probable modes from the respective optimal intra prediction modes of the reconstructed image blocks in the surrounding area using intra prediction. If given to the current block The selected optimal intra prediction mode is one of the three most probable modes, then encode a first index indicating that the selected optimal intra prediction mode is one of the three most probable modes; if selected The optimal intra prediction mode is not the three most probable modes, then encode a second index indicating that the selected optimal intra prediction mode is the other 32 modes (except the above three most probable modes among the 35 candidate prediction modes one of the other modes). The HEVC standard uses a 5-bit fixed-length code as the aforementioned second index.

The method for the HEVC encoder to derive the three most probable modes includes: selecting the optimal intra prediction mode of the left adjacent image block and the upper adjacent image block of the current block into the set, if the two optimal intra prediction modes are the same, only one can be kept in the set. If the two optimal intra prediction modes are the same and both are angle prediction modes, then select two angle prediction modes adjacent to the angle direction to add to the set; otherwise, select planar prediction mode, DC mode mode and vertical prediction mode in turn Patterns are added to the set until the number of patterns in the set reaches 3.

After the HEVC decoder performs entropy decoding on the code stream, it obtains the mode information of the current block, which includes an indicator indicating whether the optimal intra prediction mode of the current block is among the three most probable modes, and the optimal intra prediction mode of the current block. The index of the intra prediction mode in the 3 most probable modes or the index of the optimal intra prediction mode of the current block in the other 32 modes.

Inter prediction

In a possible implementation, the set of inter prediction modes depends on available reference pictures (i.e., e.g. at least some previously decoded pictures previously stored in DBP 230) and other inter prediction parameters, e.g. on whether the entire reference picture is used or only Use part of the reference image, e.g. the search window area around the area of the current block, to search for the best matching reference block, and/or e.g. depending on whether half-pel, quarter-pel and/or 16th interpolation is performed pixel interpolation.

In addition to the prediction modes described above, skip mode and/or direct mode may also be employed.

For example, Extended Merge Prediction, the merge candidate list for this mode consists of the following five candidate types in order: Spatial MVP from spatially adjacent CUs, Temporal MVP from collocated CUs, History-based MVP from FIFO table, Pairwise MVP Average MVP and zero MV. Decoder side motion vector refinement (DMVR) based on bilateral matching can be used to increase the accuracy of MV in merge mode. Merge mode with MVD (merge mode with MVD, MMVD) comes from merge mode with motion vector difference. Send the MMVD flag immediately after sending the skip flag and the merge flag to specify whether the CU uses MMVD mode. A CU-level adaptive motion vector resolution (AMVR) scheme may be used. AMVR supports CU's MVD encoding at different precisions. According to the prediction mode of the current CU, the MVD of the current CU is adaptively selected. When a CU is coded in merge mode, a combined inter/intra prediction (CIIP) mode can be applied to the current CU. A weighted average is performed on the inter-frame and intra-frame prediction signals to obtain CIIP prediction. For affine motion compensated prediction, the affine motion field of a block is described by the motion information of 2 control points (4 parameters) or 3 control points (6 parameters) motion vector. Subblock-based temporal motion vector prediction (SbTMVP), similar to temporal motion vector prediction (TMVP) in HEVC, but predicts the motion of sub-CUs within the current CU vector. Bi-directional optical flow (BDOF), formerly known as BIO, is a simplified version that reduces computation, especially in terms of the number of multiplications and the size of the multiplier. In the triangular partition mode, the CU is evenly divided into two triangular parts in two ways: diagonal division and anti-diagonal division. In addition, the bidirectional prediction mode extends simple averaging to support weighted averaging of two prediction signals.

The inter prediction unit 244 may include a motion estimation (motion estimation, ME) unit and a motion compensation (motion compensation, MC) unit (both are not shown in FIG. 2 ). The motion estimation unit is operable to receive or acquire image block 203 (current image block 203 of current image 17) and decoded image 231, or at least one or more previously reconstructed blocks, e.g., of one or more other/different previously decoded images 231 Reconstruct blocks for motion estimation. For example, a video sequence may comprise a current picture and a previous decoded picture 231, or in other words, the current picture and a previous decoded picture 231 may be part of or form a sequence of pictures forming the video sequence.

For example, encoder 20 may be configured to select a reference block from a plurality of reference blocks in the same or different images in a plurality of other images, and assign the reference image (or reference image index) and/or the position (x, y coordinates) of the reference block ) and the position of the current block (spatial offset) are provided to the motion estimation unit as inter prediction parameters. This offset is also called a motion vector (MV).

The motion compensation unit is configured to obtain, for example, receive, inter-frame prediction parameters, and perform inter-frame prediction according to or using the inter-frame prediction parameters to obtain an inter-frame prediction block 246 . Motion compensation performed by the motion compensation unit may include extracting or generating a prediction block from a motion/block vector determined by motion estimation, and may include performing interpolation to sub-pixel precision. Interpolation filtering can generate pixels of other pixels from pixels of known pixels, thereby potentially increasing the number of candidate predictive blocks that can be used to encode an image block. Once the motion vector corresponding to the PU of the current image block is received, the motion compensation unit may locate the prediction block pointed to by the motion vector in one of the reference image lists.

The motion compensation unit may also generate block- and video-slice-related syntax elements for use by video decoder 30 when decoding image blocks of video slices. Additionally, or instead of slices and corresponding syntax elements, coding block groups and/or coding blocks and corresponding syntax elements may be generated or used.

In the process of obtaining the candidate motion vector list in the advanced motion vector prediction (advanced motion vector prediction, AMVP) mode, the motion vector (motion vector, MV) that can be added to the candidate motion vector list as an alternative includes the spatial phase of the current block The MVs of adjacent and temporally adjacent image blocks, wherein the MVs of spatially adjacent image blocks may include the MV of the left candidate image block on the left of the current block and the MV of the upper candidate image block above the current block. For example, please refer to FIG. 4, which is an exemplary schematic diagram of candidate image blocks provided by the embodiment of the present application. As shown in FIG. 4, the set of candidate image blocks on the left includes {A0, A1}, and the upper The set of candidate image blocks includes {B0, B1, B2}, and the set of temporally adjacent candidate image blocks includes {C, T}. These three sets can all be added to the list of candidate motion vectors as candidates, but According to the existing coding standard, the maximum length of the candidate motion vector list of AMVP is 2, so it is necessary to determine the MVs that add at most two image blocks to the candidate motion vector list from the three sets according to the specified order. The order can be to give priority to the set {A0, A1} of the left candidate image block of the current block (consider A0 first, A0 is not available and then consider A1), and secondly consider the set of candidate image blocks above the current block {B0, B1, B2} (consider B0 first, consider B1 if B0 is not available, and then consider B2 if B1 is not available), and finally consider the set {C, T} of candidate image blocks adjacent to the current block in time domain (consider T first, T is not available Consider C) again.

After the above candidate motion vector list is obtained, the optimal MV is determined from the candidate motion vector list through the rate distortion cost (RD cost), and the candidate motion vector with the smallest RD cost is used as the motion vector predictor (motion vector predictor, MVP). The rate-distortion cost is calculated by the following formula:

J=SAD+λR

Among them, J represents RD cost, SAD is the absolute error sum (sum of absolute differences, SAD) between the pixel value of the prediction block obtained after motion estimation using the candidate motion vector and the pixel value of the current block, R represents the code rate, λ represents the Lagrangian multiplier.

The encoding end transmits the determined index of the MVP in the candidate motion vector list to the decoding end. Further, the motion search can be performed in the neighborhood centered on the MVP to obtain the actual motion vector of the current block, and the encoding end calculates the motion vector difference (motion vector difference, MVD) between the MVP and the actual motion vector, and calculates the MVD passed to the decoder. The decoding end parses the index, finds the corresponding MVP in the candidate motion vector list according to the index, parses the MVD, and adds the MVD and the MVP to obtain the actual motion vector of the current block.

In the process of obtaining the candidate motion information list in Merge mode, the motion information that can be added to the candidate motion information list as an alternative includes the motion information of the image blocks adjacent to the current block in the spatial domain or in the temporal domain, where the spatial domain Adjacent image blocks and temporally adjacent image blocks can refer to Figure 4. The candidate motion information corresponding to the spatial domain in the candidate motion information list comes from five spatially adjacent blocks (A0, A1, B0, B1, and B2) , if the neighboring blocks in space are unavailable or are intra-frame predicted, their motion information will not be added to the candidate motion information list. The candidate motion information in the time domain of the current block is obtained by scaling the MV of the corresponding position block in the reference frame according to the picture order count (POC) of the reference frame and the current frame, and first judges the block whose position is T in the reference frame Whether it is available, if not available, select the block with position C. After obtaining the above candidate motion information list, determine the optimal motion information from the candidate motion information list through RD cost as the motion information of the current block. The encoding end transmits the index value (denoted as merge index) of the position of the optimal motion information in the candidate motion information list to the decoding end.

entropy coding

Referring to Fig. 2, the entropy encoding unit 270 includes a trained self-attention decoding network 2071 and a self-attention encoding network 2072, and the self-attention decoding network 2071 is used to process an input image or image region or image block to obtain first context information ; The self-attention encoding network 2072 is used to process the input image or image region or image block to obtain the first side information.

The entropy coding unit 270 is used to use an entropy coding algorithm or scheme (for example, a variable length coding (variable length coding, VLC) scheme, a context adaptive VLC scheme (context adaptive VLC, CALVC), an arithmetic coding scheme, a binarization algorithm, Context Adaptive Binary Arithmetic Coding (CABAC), Syntax-based context-adaptive Binary Arithmetic Coding (SBAC), Probability Interval Partitioning Entropy (PIPE) ) encoding or other entropy encoding methods or techniques) are applied to the quantized residual coefficient 209, inter prediction parameters, intra prediction parameters, loop filter parameters and/or other syntax elements, and the obtained bit stream can be encoded by the output terminal 272 21 etc., so that the video decoder 30 etc. can receive and use parameters for decoding. Encoded bitstream 21 may be transmitted to video decoder 30 or stored in memory for later transmission or retrieval by video decoder 30 .

Other structural variants of the video encoder 20 are available for encoding video streams. For example, a non-transform based encoder 20 may directly quantize the residual signal without a transform processing unit 206 for certain blocks or frames. In another implementation, encoder 20 may have quantization unit 208 and inverse quantization unit 210 combined into a single unit.

Decoder and Decoding Method

As shown in FIG. 3 , the video decoder 30 is used to receive the encoded image data 21 (eg encoded bit stream 21 ) encoded by the encoder 20 to obtain a decoded image 331 . The coded image data or bitstream comprises information for decoding said coded image data, eg data representing image blocks of a coded video slice (and/or coded block group or coded block) and associated syntax elements.

In the example of FIG. 3, the decoder 30 includes an entropy decoding unit 304, an inverse quantization unit 310, an inverse transform processing unit 312, a reconstruction unit 314 (such as a summer 314), a loop filter 320, a decoded picture buffer (DBP ) 330, mode application unit 360, inter prediction unit 344, and intra prediction unit 354. Inter prediction unit 344 may be or include a motion compensation unit. In some examples, video decoder 30 may perform a decoding process that is substantially inverse to the encoding process described with reference to video encoder 100 of FIG. 2 .

Referring to FIG. 3 , the entropy decoding unit 304 includes a trained self-attention decoding network 3041 , and the self-attention decoding network 3041 is used to process an input image or image region or image block to obtain first context information.

As described in encoder 20, inverse quantization unit 210, inverse transform processing unit 212, reconstruction unit 214, loop filter 220, decoded picture buffer DPB 230, inter prediction unit 244, and intra prediction unit 254 also constitute a video coding The "built-in decoder" of the device 20. Correspondingly, the inverse quantization unit 310 can be functionally the same as the inverse quantization unit 210, the inverse transform processing unit 312 can be functionally the same as the inverse transform processing unit 212, the reconstruction unit 314 can be functionally the same as the reconstruction unit 214, and the loop The filter 320 may be functionally the same as the loop filter 220 , and the decoded picture buffer 330 may be functionally the same as the decoded picture buffer 230 . Therefore, the explanation of the corresponding elements and functions of the video encoder 20 applies to the corresponding elements and functions of the video decoder 30 accordingly.

entropy decoding

The entropy decoding unit 304 is used to analyze the bit stream 21 (or generally coded image data 21) and perform entropy decoding on the coded image data 21 to obtain quantization coefficients 309 and/or decoded coding parameters (not shown in FIG. 3 ), etc. , such as inter prediction parameters (such as reference image index and motion vector), intra prediction parameters (such as intra prediction mode or index), transformation parameters, quantization parameters, loop filter parameters and/or other syntax elements, etc. either or all. The entropy decoding unit 304 may be configured to apply a decoding algorithm or scheme corresponding to the encoding scheme of the entropy encoding unit 270 of the encoder 20 . Entropy decoding unit 304 may also be configured to provide inter-prediction parameters, intra-prediction parameters, and/or other syntax elements to mode application unit 360, as well as other parameters to other units of decoder 30. Video decoder 30 may receive video slice and/or video block level syntax elements. Additionally, or instead of slices and corresponding syntax elements, coding block groups and/or coding blocks and corresponding syntax elements may be received or used.

dequantization

The inverse quantization unit 310 may be configured to receive a quantization parameter (quantization parameter, QP) (or generally information related to inverse quantization) and quantization coefficients from the encoded image data 21 (for example, parsed and/or decoded by the entropy decoding unit 304), and based on The quantization parameter performs inverse quantization on the decoded quantization coefficient 309 to obtain an inverse quantization coefficient 311 , and the inverse quantization coefficient 311 may also be called a transform coefficient 311 . The inverse quantization process may include using quantization parameters calculated by video encoder 20 for each video block in the video slice to determine the degree of quantization, as well as the degree of inverse quantization that needs to be performed.

inverse transform

The inverse transform processing unit 312 is operable to receive inverse quantization coefficients 311 , also referred to as transform coefficients 311 , and apply a transform to the inverse quantization coefficients 311 to obtain a reconstructed residual block 213 in the pixel domain. The reconstructed residual block 213 may also be referred to as a transform block 313 . The transform may be an inverse transform, such as an inverse DCT, an inverse DST, an inverse integer transform, or a conceptually similar inverse transform process. The inverse transform processing unit 312 may also be configured to receive transform parameters or corresponding information from the encoded image data 21 (eg, parsed and/or decoded by the entropy decoding unit 304 ) to determine the transform to apply to the dequantized coefficients 311 .

reconstruction

The reconstruction unit 314 (for example, the summer 314) is used to add the reconstruction residual block 313 to the prediction block 365 to obtain the reconstruction block 315 in the pixel domain, for example, the pixel value of the reconstruction residual block 313 and the prediction block 365 pixel values are added.

filtering

The loop filter unit 320 is used (in the encoding loop or after) to filter the reconstructed block 315 to obtain the filtered block 321 to smooth pixel transformation or improve video quality, etc. The loop filter unit 320 may include one or more loop filters, such as deblocking filters, pixel adaptive offset (sample-adaptive offset, SAO) filters, or one or more other filters, such as auto Adaptive loop filter (ALF), noise suppression filter (NSF), or any combination. For example, the loop filter unit 220 may include a deblocking filter, an SAO filter, and an ALF filter. The order of the filtering process may be deblocking filter, SAO filter and ALF filter. As another example, add a process called luma mapping with chroma scaling (LMCS) (ie adaptive in-loop shaper). This process is performed before deblocking. For another example, the deblocking filtering process can also be applied to internal sub-block edges, such as affine sub-block edges, ATMVP sub-block edges, sub-block transform (sub-block transform, SBT) edges and intra sub-partition (ISP )edge. Although loop filter unit 320 is shown in FIG. 3 as a loop filter, in other configurations, loop filter unit 320 may be implemented as a post-loop filter.

decoded image buffer

The decoded video block 321 in one picture is then stored in a decoded picture buffer 330 which stores the decoded picture 331 as a reference picture for subsequent motion compensation in other pictures and/or for respective output display.

The decoder 30 is used to output the decoded image 311 through the output terminal 312 and so on, for displaying or viewing by the user.

predict

The inter prediction unit 344 may be functionally the same as the inter prediction unit 244 (especially the motion compensation unit), and the intra prediction unit 354 may be functionally the same as the inter prediction unit 254, and is based on the coded image data 21 (eg Partitioning and/or prediction parameters or corresponding information received by the entropy decoding unit 304 (parsed and/or decoded) determines partitioning or partitioning and performs prediction. The mode application unit 360 can be used to perform prediction (intra-frame or inter-frame prediction) for each block according to the reconstructed image, block or corresponding pixels (filtered or unfiltered), to obtain the predicted block 365 .

When a video slice is encoded as an intra coded (I) slice, the intra prediction unit 354 in the mode application unit 360 is used to generate an input frame based on the indicated intra prediction mode and data from a previously decoded block of the current picture. A prediction block 365 based on an image block of the current video slice. When video images are encoded as inter-coded (i.e., B or P) slices, inter prediction unit 344 (e.g., motion compensation unit) in mode application unit 360 is used to The element generates a prediction block 365 for a video block of the current video slice. For inter prediction, the predicted blocks may be generated from one of the reference pictures in one of the reference picture lists. Video decoder 30 may construct reference frame list 0 and list 1 from the reference pictures stored in DPB 330 using a default construction technique. In addition to or instead of slices (e.g., video slices), the same or similar process can be applied to embodiments of encoding block groups (e.g., video encoding block groups) and/or encoding blocks (e.g., video encoding blocks), For example video may be encoded using I, P or B coding block groups and/or coding blocks.

The mode application unit 360 is configured to determine prediction information for a video block of the current video slice by parsing motion vectors and other syntax elements, and use the prediction information to generate a prediction block for the current video block being decoded. For example, the mode application unit 360 uses some of the received syntax elements to determine the prediction mode (such as intra prediction or inter prediction), the inter prediction slice type (such as B slice, P slice or GPB slice) for encoding the video block of the video slice. slice), construction information for one or more reference picture lists for the slice, motion vectors for each inter-coded video block of the slice, inter prediction state for each inter-coded video block of the slice, Other information to decode video blocks within the current video slice. In addition to or instead of slices (e.g., video slices), the same or similar process can be applied to embodiments of encoding block groups (e.g., video encoding block groups) and/or encoding blocks (e.g., video encoding blocks), For example video may be encoded using I, P or B coding block groups and/or coding blocks.

In one embodiment, the video encoder 30 of FIG. 3 can also be used to segment and/or decode an image using slices (also called video slices), where an image can be segmented using one or more slices (typically non-overlapping). split or decode. Each slice may include one or more blocks (eg, CTUs) or one or more block groups (eg, coded blocks in the H.265/HEVC/VVC standard and tiles in the VVC standard.

In one embodiment, the video decoder 30 shown in FIG. 3 can also be configured to use slices/coded block groups (also called video coded block groups) and/or coded blocks (also called video coded block groups) ) to segment and/or decode an image, where an image may be segmented or decoded using one or more slices/coded block groups (usually non-overlapping), each slice/coded block group may consist of one or more A block (such as a CTU) or one or more coding blocks, etc., wherein each coding block may be in the shape of a rectangle or the like, and may include one or more complete or partial blocks (such as a CTU).

Other variants of video decoder 30 may be used to decode encoded image data 21 . For example, decoder 30 may generate an output video stream without loop filter unit 320 . For example, the non-transform based decoder 30 can directly inverse quantize the residual signal if some blocks or frames do not have the inverse transform processing unit 312 . In another implementation, video decoder 30 may have inverse quantization unit 310 and inverse transform processing unit 312 combined into a single unit.

It should be understood that in the encoder 20 and the decoder 30, the processing result of the current step can be further processed, and then output to the next step. For example, after interpolation filtering, motion vector derivation or loop filtering, further operations, such as clipping or shifting operations, may be performed on the processing results of interpolation filtering, motion vector derivation or loop filtering.

It should be noted that further operations can be performed on the derived motion vector of the current block (including but not limited to control point motion vector in affine mode, affine, plane, sub-block motion vector in ATMVP mode, temporal motion vector, etc.). For example, the value of the motion vector is limited to a predefined range according to the representation bits of the motion vector. If the representation bit of the motion vector is bitDepth, the range is -2^(bitDepth-1) to 2^(bitDepth-1)-1, where "^" represents a power. For example, if the bitDepth is set to 16, the range is -32768 to 32767; if the bitDepth is set to 18, the range is -131072 to 131071. For example, the value of deriving a motion vector (e.g. the MVs of 4 4x4 sub-blocks in an 8x8 block) is constrained such that the maximum difference between the integer parts of the 4 4x4 sub-blocks MVs is not More than N pixels, for example, no more than 1 pixel. Here are two ways to limit motion vectors based on bitDepth.

Although the above embodiments primarily describe video codecs, it should be noted that embodiments of the decoding system 10, encoder 20, and decoder 30, as well as other embodiments described herein, may also be used for still image processing or codecs, That is, the processing or coding of a single image in a video codec independently of any previous or successive images. In general, if image processing is limited to a single image 17, inter prediction unit 244 (encoder) and inter prediction unit 344 (decoder) may not be available. All other functions (also referred to as tools or techniques) of video encoder 20 and video decoder 30 are equally applicable to still image processing, such as residual calculation 204/304, transform 206, quantization 208, inverse quantization 210/310, (inverse ) transformation 212/312, segmentation 262/362, intra prediction 254/354 and/or loop filtering 220/320, entropy encoding 270 and entropy decoding 304.

An embodiment of the present application provides a video decoding device. In one embodiment, the video decoding device can be a decoder, such as the video decoder 30 in FIG. 1, or an encoder, such as the video decoder 30 in FIG. 1. Encoder 20.

The video decoding device includes: an input port (or input port) and a receiving unit (receiver unit, Rx) for receiving data; a processor, a logic unit or a central processing unit (central processing unit, CPU) for processing data; For example, the processor here may be a neural network processor; a transmitter unit (transmitter unit, Tx) and an output port (or output port) for transmitting data; and a memory for storing data. The video decoding device may also include an optical-to-electrical (OE) component and an electrical-to-optical (EO) component coupled to the input port, the receiving unit, the transmitting unit and the output port, for optical signals or The exit or entrance of an electrical signal.

Processors are implemented in hardware and software. A processor may be implemented as one or more processor chips, cores (eg, multi-core processors), FPGAs, ASICs, and DSPs. The processor communicates with the ingress port, the receiving unit, the transmitting unit, the egress port and the memory. The processor includes a decoding module (eg, a neural network based decoding module). The coding module implements the embodiments disclosed above. For example, the decoding module performs, processes, prepares, or provides various encoding operations. Thus, a substantial improvement in the functionality of the video decoding device is provided by the decoding module and the switching of the video decoding device to different states is effected. Alternatively, the decode module is implemented as instructions stored in memory and executed by a processor.

Memory, including one or more magnetic disks, tape drives, and solid-state drives, may be used as an overflow data storage device for storing programs when such programs are selected for execution, and for storing instructions and data that are read during program execution. Memory can be volatile and/or nonvolatile, and can be read-only memory (ROM), random access memory (RAM), ternary content -addressable memory, TCAM) and/or static random-access memory (static random-access memory, SRAM).

An embodiment of the present application provides an apparatus, which may include a processor, a memory, and a bus. The apparatus may be used as either or both of source device 12 and destination device 14 in FIG. 1 .

The processor in the device may be a central processing unit. Alternatively, a processor may be any other type or devices, existing or later developed, capable of manipulating or processing information. While the disclosed implementations can be implemented using a single processor, such as the one shown, it is faster and more efficient to use more than one processor.

In one implementation, the memory in the apparatus may be a read only memory (ROM) device or a random access memory (RAM) device. Any other suitable type of storage device may be used as memory. The memory can include code and data accessed by the processor through the bus. The memory may also include an operating system and application programs, including at least one program that allows the processor to perform the methods described herein. For example, the application programs may include applications 1 through N, and also include a video coding application that performs the methods described herein.

An apparatus may also include one or more output devices, such as displays. In one example, the display can be a touch sensitive display that combines the display with touch sensitive elements that can be used to sense touch input. A display can be coupled to the processor via a bus.

Although the bus in the device is described herein as a single bus, the bus may include multiple buses. Additionally, secondary storage may be directly coupled to other components of the device or accessed over a network, and may comprise a single integrated unit such as a memory card or multiple units such as multiple memory cards. Accordingly, the device may have a wide variety of configurations.

Since the embodiment of the present application involves the application of a neural network, for ease of understanding, some nouns or terms used in the embodiment of the present application are firstly explained below, and the nouns or terms are also part of the summary of the invention.

(1) neural network

Neural network (neural network, NN) is a machine learning model. A neural network can be composed of neural units. A neural unit can refer to a computing unit that takes xs and intercept 1 as input. The output of the computing unit can be:

Wherein, s=1, 2, ... n, n is a natural number greater than 1, Ws is the weight of xs, and b is the bias of the neural unit. f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal. The output signal of this activation function can be used as the input of the next convolutional layer. The activation function may be a sigmoid function. A neural network is a network formed by connecting many of the above-mentioned single neural units, that is, the output of one neural unit can be the input of another neural unit. The input of each neural unit can be connected with the local receptive field of the previous layer to extract the features of the local receptive field. The local receptive field can be an area composed of several neural units.

(2) Deep Neural Network

A deep neural network (DNN), also known as a multilayer neural network, can be understood as a neural network with many hidden layers, and there is no special metric for the "many" here. According to the position of different layers of DNN, the neural network inside DNN can be divided into three categories: input layer, hidden layer, and output layer. Generally speaking, the first layer is the input layer, the last layer is the output layer, and the layers in the middle are all hidden layers. The layers are fully connected, that is, any neuron in the i-th layer must be connected to any neuron in the i+1-th layer. Although DNN looks complicated, it is actually not complicated in terms of the work of each layer. In simple terms, it is the following linear relationship expression:

in,

is the input vector,

is the output vector,

Is the offset vector, W is the weight matrix (also called coefficient), and α() is the activation function. Each layer is just an input vector

After such a simple operation to get the output vector

Due to the large number of DNN layers, the coefficient W and the offset vector

The number is also a lot. The definition of these parameters in DNN is as follows: Take the coefficient W as an example: Assume that in a three-layer DNN, the linear coefficient from the fourth neuron of the second layer to the second neuron of the third layer is defined as

The superscript 3 represents the layer number of the coefficient W, and the subscript corresponds to the output third layer index 2 and the input second layer index 4. The summary is: the coefficient of the kth neuron of the L-1 layer to the jth neuron of the L layer is defined as

It should be noted that the input layer has no W parameter. In deep neural networks, more hidden layers make the network more capable of describing complex situations in the real world. Theoretically speaking, a model with more parameters has a higher complexity and a greater "capacity", which means that it can complete more complex learning tasks. Training the deep neural network is the process of learning the weight matrix, and its ultimate goal is to obtain the weight matrix of all layers of the trained deep neural network (the weight matrix formed by the vector W of many layers).

(3) Convolutional neural network

Convolutional neural network (CNN) is a deep neural network with a convolutional structure. It is a deep learning (deep learning) architecture. Multiple levels of learning are carried out at the level. As a deep learning architecture, CNN is a feed-forward artificial neural network in which individual neurons can respond to images input into it. A convolutional neural network consists of a feature extractor consisting of convolutional and pooling layers. The feature extractor can be seen as a filter, and the convolution process can be seen as using a trainable filter to convolve with an input image or convolutional feature map.

The convolutional layer refers to the neuron layer that performs convolution processing on the input signal in the convolutional neural network. The convolution layer can include many convolution operators, which are also called kernels, and their role in image processing is equivalent to a filter that extracts specific information from the input image matrix. The convolution operator can essentially Is a weight matrix, this weight matrix is usually pre-defined, in the process of convolution operation on the image, the weight matrix is usually along the horizontal direction of the input image pixel by pixel (or two pixels by two pixels... ...This depends on the value of the stride) to complete the work of extracting specific features from the image. The size of the weight matrix should be related to the size of the image. It should be noted that the depth dimension of the weight matrix is the same as the depth dimension of the input image. During the convolution operation, the weight matrix will be extended to The entire depth of the input image. Therefore, convolution with a single weight matrix will produce a convolutional output with a single depth dimension, but in most cases instead of using a single weight matrix, multiple weight matrices of the same size (row×column) are applied, That is, multiple matrices of the same shape. The output of each weight matrix is stacked to form the depth dimension of the convolution image, where the dimension can be understood as determined by the "multiple" mentioned above. Different weight matrices can be used to extract different features in the image. For example, one weight matrix is used to extract image edge information, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to filter unwanted noise in the image. Do blurring etc. The multiple weight matrices have the same size (row×column), and the feature maps extracted by the multiple weight matrices of the same size are also of the same size, and then the extracted multiple feature maps of the same size are combined to form the convolution operation. output. The weight values in these weight matrices need to be obtained through a lot of training in practical applications, and each weight matrix formed by the weight values obtained through training can be used to extract information from the input image, so that the convolutional neural network can make correct predictions. When the convolutional neural network has multiple convolutional layers, the initial convolutional layer often extracts more general features, which can also be called low-level features; as the depth of the convolutional neural network deepens, The features extracted by the later convolutional layers become more and more complex, such as high-level semantic features, and the higher semantic features are more suitable for the problem to be solved.

Since it is often necessary to reduce the number of training parameters, it is often necessary to periodically introduce a pooling layer after a convolutional layer. It can be a convolutional layer followed by a pooling layer, or a multi-layer convolutional layer followed by a pooling layer. layer or multiple pooling layers. In image processing, the sole purpose of pooling layers is to reduce the spatial size of the image. The pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling an input image to obtain an image of a smaller size. The average pooling operator can calculate the pixel value in the image within a specific range to generate an average value as the result of average pooling. The maximum pooling operator can take the pixel with the largest value within a specific range as the result of maximum pooling. Also, just like the size of the weight matrix used in the convolutional layer should be related to the size of the image, the operators in the pooling layer should also be related to the size of the image. The size of the image output after being processed by the pooling layer may be smaller than the size of the image input to the pooling layer, and each pixel in the image output by the pooling layer represents the average or maximum value of the corresponding sub-region of the image input to the pooling layer.

After being processed by the convolutional layer/pooling layer, the convolutional neural network is not enough to output the required output information. Because as mentioned earlier, the convolutional layer/pooling layer only extracts features and reduces the parameters brought by the input image. However, in order to generate the final output information (required class information or other relevant information), the convolutional neural network needs to use the neural network layer to generate an output of one or a set of required classes. Therefore, the neural network layer can include multiple hidden layers, and the parameters contained in the multi-layer hidden layers can be pre-trained according to the relevant training data of the specific task type. For example, the task type can include image recognition, Image classification, image super-resolution reconstruction and more.

Optionally, after the multi-layer hidden layers in the neural network layer, the output layer of the entire convolutional neural network is also included. This output layer has a loss function similar to the classification cross entropy, which is specifically used to calculate the prediction error. Once the entire volume After the forward propagation of the convolutional neural network is completed, the backpropagation will start to update the weight values and deviations of the aforementioned layers to reduce the loss of the convolutional neural network, and the results and ideal output of the convolutional neural network through the output layer The error between the results.

(4) Recurrent neural network

Recurrent neural networks (RNN) are used to process sequence data. In the traditional neural network model, from the input layer to the hidden layer to the output layer, the layers are fully connected, and each node in each layer is unconnected. Although this ordinary neural network solves many problems, it is still powerless to many problems. For example, if you want to predict what the next word in a sentence is, you generally need to use the previous words, because the preceding and following words in a sentence are not independent. The reason why RNN is called a recurrent neural network is that the current output of a sequence is also related to the previous output. The specific manifestation is that the network will remember the previous information and apply it to the calculation of the current output, that is, the nodes between the hidden layer and the current layer are no longer connected but connected, and the input of the hidden layer not only includes The output of the input layer also includes the output of the hidden layer at the previous moment. In theory, RNN can process sequence data of any length. The training of RNN is the same as that of traditional CNN or DNN. The error backpropagation algorithm is also used, but there is a difference: that is, if the RNN is expanded to the network, then the parameters, such as W, are shared; while the above-mentioned traditional neural network is not the case. And in the gradient descent algorithm, the output of each step depends not only on the network of the current step, but also depends on the state of the previous several steps of the network. This learning algorithm is called Back propagation Through Time (BPTT) based on time.

Since there are already convolutional neural networks, why do we need recurrent neural networks? The reason is simple. In the convolutional neural network, there is a premise that the elements are independent of each other, and the input and output are also independent, such as cats and dogs. But in the real world, many elements are interconnected, such as the change of stocks over time, or a person said: I like to travel, and my favorite place is Yunnan, and I must go there in the future. Fill in the blank here, humans should know that it is to fill in "Yunnan". Because humans will infer based on the content of the context, but how to make the machine do this? RNN came into being. RNN is designed to allow machines to have the ability to remember like humans. Therefore, the output of RNN needs to depend on the current input information and historical memory information.

(5) Loss function

In the process of training the deep neural network, because it is hoped that the output of the deep neural network is as close as possible to the value you really want to predict, you can compare the predicted value of the current network with the target value you really want, and then according to the difference between the two to update the weight vector of each layer of the neural network (of course, there is usually an initialization process before the first update, that is, to pre-configure parameters for each layer in the deep neural network), for example, if the predicted value of the network If it is high, adjust the weight vector to make it predict lower, and keep adjusting until the deep neural network can predict the real desired target value or a value very close to the real desired target value. Therefore, it is necessary to pre-define "how to compare the difference between the predicted value and the target value", which is the loss function (loss function) or objective function (objective function), which is used to measure the difference between the predicted value and the target value important equation. Among them, taking the loss function as an example, the higher the output value (loss) of the loss function, the greater the difference. Then the training of the deep neural network becomes a process of reducing the loss as much as possible.

(6) Back propagation algorithm

The convolutional neural network can use the error back propagation (back propagation, BP) algorithm to correct the size of the parameters in the initial super-resolution model during the training process, so that the reconstruction error loss of the super-resolution model becomes smaller and smaller. Specifically, passing the input signal forward until the output will generate an error loss, and updating the parameters in the initial super-resolution model by backpropagating the error loss information, so that the error loss converges. The backpropagation algorithm is a backpropagation movement dominated by error loss, aiming to obtain the parameters of the optimal super-resolution model, such as the weight matrix.

(7) Generative confrontation network

Generative adversarial networks (GAN) is a deep learning model. The model includes at least two modules: one module is a Generative Model, and the other is a Discriminative Model. These two modules learn from each other through games to produce better output. Both the generative model and the discriminative model can be neural networks, specifically deep neural networks or convolutional neural networks. The basic principle of GAN is as follows: Taking the GAN that generates pictures as an example, suppose there are two networks, G (Generator) and D (Discriminator), where G is a network that generates pictures, which receives a random noise z, and passes this noise Generate a picture, denoted as G(z); D is a discriminant network, used to determine whether a picture is "real". Its input parameter is x, x represents a picture, and the output D(x) represents the probability that x is a real picture. If it is 1, it means that 100% is a real picture. If it is 0, it means that it cannot be real. picture. In the process of training the generative confrontation network, the goal of the generation network G is to generate real pictures as much as possible to deceive the discriminant network D, and the goal of the discriminant network D is to distinguish the pictures generated by G from the real pictures as much as possible. Come. In this way, G and D constitute a dynamic "game" process, which is the "confrontation" in the "generative confrontation network". As a result of the final game, in an ideal state, G can generate a picture G(z) that is enough to "disguise the real one", but it is difficult for D to determine whether the picture generated by G is real, that is, D(G(z)) = 0.5. In this way, an excellent generative model G is obtained, which can be used to generate pictures.

Fig. 5 is a schematic diagram of an application scenario provided by an embodiment of the present application, and Fig. 5 is illustrated by taking data including images/videos as an example. The application scenario is that the device acquires images/videos, performs entropy encoding on the acquired images/videos to obtain code streams, and stores the code streams. When the image/video needs to be output subsequently, the code stream is entropy decoded to obtain the image/video. The device may integrate the functions of the aforementioned source device and destination device.

As shown in Figure 5, the device includes an encoding network, a super-encoding network, an entropy encoding network, a saving module, a loading module, a super-decoding network, an entropy decoding network and a decoding network. The encoding network is used to extract features from the input images/videos to obtain feature images/videos with low redundancy. The super-encoding network is used to estimate the estimated probability value of each feature element in the feature image/video. Then the entropy encoding module is used to perform entropy encoding on the corresponding feature element according to the estimated probability value of each feature element to obtain the code stream and store the code stream through the saving module. The subsequent loading module can load the code stream, and the super decoding network is used to estimate the estimated probability value of the code stream corresponding to each feature element in the code stream. The entropy decoding module is used to perform entropy decoding on the corresponding code stream according to the estimated probability value of the code stream corresponding to each feature element to obtain the feature image/video. The decoding network is used to perform inverse feature extraction on feature images/videos to obtain images/videos.

It should be understood that the device compresses the image/video to save storage space. Optionally, the device can store compressed images/videos in an album or a cloud album.

FIG. 6 is a schematic diagram of another application scenario provided by the embodiment of the present application. FIG. 6 is illustrated by taking data including images/videos as an example. The application scenario is to acquire images/videos locally, perform image (JPEG) encoding on the acquired data to obtain compressed images/videos, and then send compressed images/videos to the cloud. The cloud performs JPEG decoding on the compressed image/video to obtain the image/video, and then performs entropy encoding on the image/video to obtain the code stream and store the code stream. When the local needs to obtain images/videos from the cloud, the cloud performs entropy decoding on the code stream to obtain images/videos, and then JPEG encodes the images/videos to obtain compressed images/videos, and sends compressed images/videos to the local. Locally perform JPEG decoding on compressed images/videos to obtain images/videos. The cloud may be integrated with the functions of the aforementioned source device and destination device. For the structure of the cloud and the usage of each module, reference may be made to the structure of FIG. 5 and the usage of each module, and the embodiment of the present application will not repeat them here.

It should be understood that JPEG encoding is performed locally or on the cloud to reduce transmission bandwidth, and image/video compression is performed on the cloud to save storage space.

The method in the embodiment of the present application may be applied to an end-to-end (end-to-end, etoe) codec architecture. Please refer to FIG. 7 , which is a schematic structural diagram of an encoder in an end-to-end encoding and decoding architecture provided by an embodiment of the present application. As shown in Figure 7, the encoder includes an encoding network, a quantization module, a super-encoding network, a super-decoding network and an entropy encoding module. The encoding network is used to perform feature extraction on the input current data stream to obtain feature data. The quantization module is used to quantize the feature data, and the quantized feature data passes through the super-encoding network to obtain the code stream 2 of side information. Code stream 2 gets side information through the super decoding network. The entropy encoding module is used to perform entropy encoding on the input feature data by using side information to obtain code stream 1 .

Please refer to FIG. 8 , which is a schematic structural diagram of a decoder in an end-to-end codec architecture provided by an embodiment of the present application. As shown in Figure 8, the decoder includes a decoding network, an entropy decoding module and a super decoding network. The code stream 2 is decoded by the super-decoding network to obtain side information, and the entropy decoding module is used to perform entropy decoding on the code stream 1 according to the side information to obtain feature data. The decoding network is used to perform anti-feature extraction on the feature data to obtain the current data stream.

In the entropy encoding and decoding method provided in the embodiment of the present application, the encoder can obtain reference information, and then estimate the estimated probability distribution of the data to be encoded according to the reference information, and use the estimated probability distribution of the data to be encoded to perform entropy encoding on the data to be encoded to obtain a code stream . The decoder can obtain the reference information, and then estimate the estimated probability distribution of the code stream according to the reference information, and perform entropy decoding on the code stream by using the estimated probability distribution of the code stream. The reference information may include first context information and/or first side information, and further, the reference information may further include second context information and second side information.

When performing entropy encoding on the data to be encoded included in the current data stream, the current data stream includes multiple data, the first context information is obtained by inputting at least one encoded data in the multiple data into the self-attention decoding network, the first Side information is obtained by feeding multiple data in the current data stream into the self-attention encoding network. The second context information is obtained by inputting at least one of the at least one coded data meeting the preset condition into the masked convolutional network. The second side information is obtained by inputting at least one data meeting the preset condition among the multiple data into the supercoding network.

The at least one piece of data meeting the preset condition in the at least one piece of coded data may include at least one piece of data in the coded data that is adjacent to the data to be coded. For one-dimensional data, the neighbors of the data to be coded may be the coded data of the first m bits of the data to be coded, m>0. For two-dimensional data, the neighbors of the data to be encoded can be the adjacent data of the data to be encoded, or the encoded data in the peripheral n circle data of the data to be encoded, etc., n>0, the embodiment of the present application does not limit the neighbors . It can be known that the first context information is obtained based on at least one encoded data among the plurality of data, and the second context information is obtained based on at least one data adjacent to the data to be encoded among the at least one encoded data. Compared with the second context information, the first context information has a higher utilization rate of encoded data and more comprehensive content.

The at least one piece of data that meets the preset condition among the pieces of data may include at least one piece of data that is adjacent to the data to be encoded among the pieces of data. For one-dimensional data, the neighbors to the data to be encoded may be the first m ₁ bits and/or the last m ₂ bits of the data to be encoded, m ₁ , m ₂ >0. For two-dimensional data, the neighbors of the data to be encoded may be the adjacent data of the data to be encoded, or the data of the outer n circles of the data to be encoded, etc., n>0, and the embodiment of the present application does not limit the neighbors. It can be seen from this that the first side information is obtained based on a plurality of data, and the second side information is obtained based on at least one data adjacent to the data to be encoded among the plurality of data. Compared with the second side information, the first side information has a higher utilization rate of data and more comprehensive content.

When entropy decoding the code stream, the first context information is obtained by inputting at least one decoded data into the self-attention decoding network, and the first side information is obtained by entropy decoding the code stream of the first side information. The second context information is obtained by inputting at least one piece of data meeting the preset condition in the at least one piece of decoded data into the masked convolutional network. The second side information is obtained by performing entropy decoding on the code stream of the second side information.

The structures of the encoder and the decoder are described below by taking several ways of referring to information as examples. In one implementation, the reference information only includes the first context information. Correspondingly, please refer to FIG. 9 and FIG. 10. FIG. The structure diagram of a decoder provided as an example. As shown in Figure 9, the encoder includes an encoding network, a quantization module, a self-attention decoding network and an entropy encoding module. The functions of the same network or module as in FIG. 8 are also the same, and the embodiment of the present application will not repeat them here. The self-attention decoding network is used to extract the first context information from the quantized feature data, and the entropy coding module is used to perform entropy coding on the quantized feature data according to the first context information to obtain a code stream. As shown in Figure 10, the decoder includes a self-attention decoding network, an entropy decoding module, and a decoding network. The self-attention decoding network is used to extract the first context information from the decoded data, and the entropy decoding module is used to perform entropy decoding on the code stream according to the first context information.

In another implementation, the reference information only includes the first side information. Correspondingly, please refer to FIG. 11 and FIG. 12. FIG. 11 is a schematic structural diagram of an encoder provided in an embodiment of the present application, and FIG. A schematic structural diagram of a decoder provided in the embodiment. As shown in Figure 11, the encoder includes an encoding network, a self-attention encoding network, a quantization module, a decomposition entropy model, an entropy encoding module, an entropy decoding module, and a self-attention decoding network. The self-attention coding network is used to extract the first side information from the feature data after feature extraction, the decomposition entropy model is used to estimate the estimated probability distribution of the first side information, and the entropy coding module is used to estimate the first side information The probability distribution performs entropy coding on the first side information to obtain code stream 2. The entropy decoding module is configured to perform entropy decoding on the code stream 2 according to the estimated probability distribution of the first side information to obtain the first side information. The self-attention decoding network is used to estimate the estimated probability distribution of the current data stream according to the first side information. The entropy encoding module is used to perform entropy encoding on the quantized feature data according to the estimated probability distribution of the current data stream to obtain the code stream 1 .

As shown in Figure 12, the decoder includes an entropy decoding module, a self-attention decoding network, and a decoding network. The entropy decoding module is used to perform entropy decoding on the code stream 2 to obtain the first side information, the self-attention decoding network is used to estimate the estimated probability distribution of the code stream 1 according to the first side information, and the entropy decoding module is used to obtain the estimated probability distribution of the code stream 1 according to the Estimate the probability distribution to perform entropy decoding on code stream 1.

In another implementation manner, the reference information includes the first context information and the first side information. Correspondingly, please refer to FIG. 13 and FIG. 14. FIG. 13 is a schematic structural diagram of an encoder provided by an embodiment of the present application. 14 is a schematic structural diagram of a decoder provided in the embodiment of the present application. As shown in Figure 13, the encoder includes an encoding network, a self-attention encoding network, a quantization module, a decomposition entropy model, an entropy encoding module, an entropy decoding module and a self-attention decoding network. For the functions of each module, reference may be made to the functions of the corresponding modules in FIG. 9 and FIG. 11 , and details are not described here in this embodiment of the present application. The self-attention decoding network is used to extract the first context information from the quantized feature data, and estimate the estimated probability distribution of the current data stream according to the first context information and the first side information.

As shown in Figure 14, the decoder includes an entropy decoding module, a self-attention decoding network, and a decoding network. The self-attention decoding network is used to extract the first context information from the decoded data, and estimate the estimated probability distribution of code stream 1 according to the first context information and the first side information.

Both the self-attention decoding network and the self-attention encoding network are neural networks with a self-attention mechanism (ie, including a self-attention structure). The self-attention mechanism is a variant of the attention mechanism, which reduces the dependence on external information and can better obtain the internal correlation of data or features.

Please refer to FIG. 15. FIG. 15 is a schematic diagram of a self-attention structure provided by the embodiment of the present application. The input of the self-attention structure includes three tensor queries (Query, Q), keys (Key, K) and values ( Value, V). The self-attention structure includes matrix multiplication (MatMul) operations, scaling (Scale) operations, mask (Mask) operations, and exponential normalization (Softmax) operations.

Please refer to Figure 16. Figure 16 is a schematic structural diagram of a self-attention encoding network provided by the embodiment of the present application. The self-attention encoding network includes the operation of embedding position encoding into the input and the N1 part. The N1 part includes a multi-head attention mechanism. operations, summation and normalization operations, and feedforward operations.

Please refer to FIG. 17. FIG. 17 is a schematic structural diagram of a self-attention decoding network provided by an embodiment of the present application. The self-attention decoding network includes the operation of embedding position codes into the input and the N2 part. The N2 part includes masked multi-head attention Force mechanism operations, summation and normalization operations, and feedforward operations.

Please refer to FIG. 18 , which is a flowchart of a process 100 of the entropy encoding method provided by the embodiment of the present application. The process 100 can be executed by an encoder, specifically, it can be executed by an entropy coding unit of the encoder. The process 100 is described as a series of steps or operations. It should be understood that the process 100 may be performed in various orders and/or concurrently, and is not limited to the order of execution shown in FIG. 18 . Assuming that a current data stream with multiple data is using an encoder, a process 100 including the following steps is performed to entropy encode data. Process 800 may include:

Step 101. Obtain the data to be encoded, where the data to be encoded is the non-first encoded data among the multiple data included in the current data stream.

Step 102, obtain reference information, the reference information includes at least one of the first context information and the first side information, the first context information is obtained by inputting at least one coded data into the self-attention decoding network, the first side information It is obtained by feeding multiple data into the self-attention encoding network.

Step 103, estimating and obtaining a first estimated probability distribution according to the reference information.

Step 104: Perform entropy encoding on the data to be encoded according to the first estimated probability distribution to obtain a first code stream.

Please refer to FIG. 19 , which is a flowchart of a process 200 of an entropy decoding method provided by an embodiment of the present application. The process 200 can be executed by a decoder, specifically, it can be executed by an entropy decoding unit of the decoder. The process 200 is described as a series of steps or operations. It should be understood that the process 200 may be performed in various orders and/or concurrently, and is not limited to the order of execution shown in FIG. 19 . Assuming that a current data stream with multiple data is using a decoder, a process 200 including the following steps is performed to entropy encode and decode data. Process 200 may include:

Step 201. Acquire a first code stream.

Step 202. Obtain reference information. The reference information includes at least one of the first context information and the decoded first side information. The first context information is obtained by inputting at least one decoded data into the self-attention decoding network. After decoding The first side information is obtained by performing entropy decoding on the second code stream.

Step 203, estimating and obtaining a first estimated probability distribution according to the reference information.

Step 204: Perform entropy decoding on the first code stream according to the first estimated probability distribution to obtain decoded data, where the decoded data is non-first decoded data among multiple data included in the current data stream.

Please refer to FIG. 20 , which is a flowchart of a process 300 of the entropy encoding and decoding method provided by the embodiment of the present application. The process 300 can be performed by an encoder and a decoder, specifically, it can be performed by an entropy encoding unit of the encoder and an entropy decoding unit of the decoder. The process 300 is described as a series of steps or operations. It should be understood that the process 300 may be performed in various orders and/or concurrently, and is not limited to the order of execution shown in FIG. 20 . Assuming that a current data stream with multiple data is using an encoder and a decoder, a process 300 including the following steps is performed to entropy encode and decode data. Process 300 may include:

In step 301, the encoder acquires data to be encoded, and the data to be encoded is the non-first encoded data among multiple data included in the current data stream.

The current data stream may be in a one-dimensional format or a two-dimensional format, and the embodiment of the present application does not limit the format of the current data stream. Optionally, when the initial data stream is in a non-one-dimensional format (such as a two-dimensional format), the encoder can directly use the initial data stream as the current data stream, or can flatten the initial data stream in a non-one-dimensional format into a one-dimensional format , to get the current data flow, at this time each data can be regarded as a "word" in the text.

For example, when flattening an initial data stream in a non-one-dimensional format into a one-dimensional format, it may be flattened in a preset order. For example, when the initial data stream is in a two-dimensional format, the two-dimensional initial data stream can be flattened in the order of top to bottom and left to right, or in the order of bottom to top and left to right, or in the order of The preset sequence is equal to flattening, and the embodiment of the present application does not limit the sequence of flattening.

In the embodiment of the present application, after obtaining the data to be coded, the data to be coded can also be quantized, which can reduce the amount of data required to represent the data to be coded, so that the code rate in the subsequent entropy coding process is reduced, thereby effectively reducing the entropy Encoding overhead. As mentioned above, the quantization process may be performed in a manner such as scalar quantization or vector quantization, and the embodiment of the present application does not limit the quantization process manner.

It should be noted that, when entropy encoding is performed on multiple pieces of data included in the current data stream, entropy encoding is generally performed on the first data first, and then entropy encoding is performed on the data to be encoded. For the first encoded data, the fourth estimated probability distribution may be obtained by estimating according to preset information. Alternatively, a fourth estimated probability distribution is estimated by using a learnable model obtained through training, and then entropy encoding is performed on the first encoded data according to the fourth estimated probability distribution to obtain a fourth code stream. The embodiment of the present application does not limit the manner of obtaining the fourth estimated probability distribution.

In step 302, the encoder acquires first context information.

The first context information is obtained by inputting at least one encoded data among the plurality of data included in the current data stream into the self-attention decoding network, and the encoded data refers to data that has been entropy-encoded by the encoder among the plurality of data. Since there is no encoded data when performing entropy encoding on the first data of the current data stream, the data to be encoded needs to be the non-first data of the current data stream, so that the first context information can be extracted. The first context information obtained based on at least one of the encoded data among the plurality of data has less data redundancy, and the utilization rate of the encoded data is higher. When the first estimated probability distribution is subsequently estimated by using the first context information , which can improve the accuracy of the obtained first estimated probability distribution. Since the accuracy of the first estimated probability distribution is higher, the code rate in the entropy encoding process is smaller, so inputting at least one encoded data in the plurality of data into the self-attention decoding network to obtain the first context information can reduce the entropy The code rate in the encoding process, thereby reducing the overhead of entropy encoding. Wherein, the code rate is an average code length required for entropy coding unit data.

The self-attention decoding network is a neural network with a self-attention mechanism (that is, including a self-attention structure), which has a global receptive field, and can obtain the correlation between all the input encoded data and the data to be encoded. The correlation can be expressed as The weight of all encoded data entered relative to the data to be encoded. After the self-attention decoding network obtains the weights of all the input encoded data relative to the data to be encoded, it weights the corresponding encoded data according to the weights to obtain the first context information.

Optionally, the self-attention decoding network may weight all input encoded data with corresponding weights to obtain the first context information. In this way, the utilization rate of encoded data in the process of acquiring the first context information is improved. When the first estimated probability distribution is subsequently estimated by using the first context information, the accuracy of the obtained first estimated probability distribution can be further improved, and the code rate in the entropy encoding process can be further reduced, thereby further reducing the entropy encoding overhead.

Alternatively, the self-attention decoding network may select the input part of the encoded data according to the obtained weight, and weight the part of the encoded data with the corresponding weight to obtain the first context information. For example, the obtained weights may be sorted in descending order, and the coded data corresponding to the top i ₁ weights are selected for weighting. Alternatively, the obtained weights are sorted in ascending order, and the encoded data corresponding to the last i ₂ weights are selected for weighting. Or select encoded data with a weight greater than _i2 for weighting. where both i ₁ and i ₂ are less than the number of all weights obtained. In this way, the flexibility in the process of acquiring the first context information can be improved. And when the coded data with higher weight is selected for weighting, the utilization rate of the coded data with higher weight in the process of obtaining the first context information can be guaranteed, and when the first estimated probability distribution is estimated by using the first context information subsequently , can further improve the accuracy of the obtained first estimated probability distribution, and further reduce the code rate in the process of entropy coding, thereby further reducing the overhead of entropy coding.

After the current data stream is input into the self-attention decoding network, the self-attention decoding network can perform an embedding operation on each data in the current data stream. The embedding operation refers to converting each data from the original data space to another space . Then perform positional encoding on each data to obtain the positional information of each data, and combine the positional information of each data with the data. Each data has coordinate information, and location coding refers to extracting the location information of each data according to the coordinate information of each data. The location information of each data can be combined with the data by bitwise addition or concatenation, and the embodiment of the present application does not limit the location encoding method.

The structure of the self-attention decoding network can refer to the aforementioned FIG. 17 , and the embodiment of the present application will not repeat it here. As shown in Figure 17 above, the input of the self-attention decoding network includes three tensors Q, K and V. Q, K, and V go through masked multi-head self-attention mechanism, summation and normalization operation, multi-head attention mechanism, summation and normalization operation, feedforward operation, summation and normalization operation, and linearization Operation, output the first context information. Q, K, and V refer to tensors of coded data, for example, tensors obtained by performing embedding operations and position coding on non-prime coded data in the foregoing process.

Step 303, the encoder estimates and obtains a first estimated probability distribution according to the first context information.

The first estimated probability distribution may comprise at least one estimated probability parameter. Exemplarily, the at least one estimated probability parameter may include a mean (mean) and a variance (scale), and the mean and the scale form a Gaussian distribution.

Optionally, the first context information may be input into the probability distribution estimation network to obtain a first estimated probability distribution output by the probability distribution estimation network. The probability distribution estimation network may be a single neural network, or a structure in the self-attention decoding network, which is not limited in this embodiment of the present application.

Please refer to Figure 21. Figure 21 is a schematic diagram of the process of obtaining the first estimated probability distribution provided by the embodiment of the present application. Figure 21 takes the initial data stream as a two-dimensional format and needs to flatten the initial data stream as an example. The initial data Stream a includes 16 positions a1 to a16 arranged in 4×4, each position corresponding to one piece of data. The data corresponding to position a10 is data to be encoded, the data corresponding to positions a1 to a9 are all encoded data, and the data corresponding to other positions are unencoded data, and each encoded data corresponds to a first estimated probability distribution. The initial data stream a is flattened into a one-dimensional format from top to bottom and from left to right to obtain a current data stream b including 16 positions a1 to a16 arranged in sequence. The current data stream b is input into the self-attention decoding network, and the self-attention decoding network determines the position information of each data in the current data stream b, and combines the position information of each data with the data. The self-attention decoding network outputs the first context information based on the encoded data in the data stream b combined with position information (ie, the data corresponding to positions a1 to a9), and the first context information is input to the probability distribution estimation network, and the probability distribution estimation The network outputs the first estimated probability distribution, that is, the estimated probability distribution of the data corresponding to position a10. The process shown in FIG. 21 is only an exemplary description, and does not limit the process of obtaining the first estimated probability distribution.

Step 304, the encoder performs entropy encoding on the data to be encoded according to the first estimated probability distribution to obtain a first code stream.

The encoder may calculate the probability value of the data to be encoded according to the first estimated probability distribution, and then perform entropy encoding on the data to be encoded according to the probability value. The first code stream may be in binary format.

The aforementioned steps 301 to 304 are described by taking the estimation to obtain the first estimated probability distribution, and performing entropy encoding on the data to be encoded according to the first estimated probability distribution to obtain the first code stream as an example. Each non-first data included in the current data stream can be used as the data to be encoded respectively, and the first estimated probability distribution is obtained according to the process shown in the aforementioned steps 301 to 304, and entropy encoding is performed according to the first estimated probability distribution, so as to obtain each A code stream of non-first data. It should be noted that, after each piece of data is encoded, the data is added to the encoded data.

Step 305, the encoder sends the first code stream to the decoder.

As described in the foregoing embodiments, the encoder and the decoder have a communication interface with established communication connections, and the encoder can send the first code stream to the communication interface of the decoder through the communication interface.

It should be noted that, the encoder performs entropy encoding on each non-first encoded data included in the current data stream to obtain a code stream of each non-first data. Furthermore, the current code stream is obtained according to the code stream of each non-prime data, and the current code stream includes code streams of multiple non-prime coded data arranged according to the encoding order of the multiple non-prime data by the encoder. Of course, the current code stream includes the first code stream. Then the encoder can send the current code stream including the first code stream to the decoder. For the first coded data, after the encoder entropy codes the first coded data according to the fourth estimated probability distribution to obtain a fourth code stream, the fourth code stream may be included in the current code stream to be transmitted to the decoder. Alternatively, the encoder sends the fourth code stream to the decoder independently, and this embodiment of the present application does not limit the way of sending the fourth code stream.

Step 306, the decoder acquires the first context information.

As described in the aforementioned step 305, the first code stream belongs to a code stream in the current code stream received by the decoder, and the decoded data obtained after decoding the first code stream is the non-first bit among the multiple data contained in the current data stream decoded data. The first context information may be obtained by inputting at least one piece of decoded data into the self-attention decoding network, and the decoded data refers to data obtained by performing entropy decoding before decoding the first code stream. Since there is no decoded data when performing entropy decoding on the fourth code stream, the decoded data obtained after decoding the first code stream is the non-first decoded data among the multiple data contained in the current data stream, so as to extract Get the first context information. For the acquisition process of the first context information, reference may be made to the aforementioned step 302, and details are not described here in this embodiment of the present application.

It should be noted that, when the decoder performs entropy decoding on each code stream in the received current code stream, it usually performs entropy decoding on the fourth code stream first. The decoder can estimate and obtain the fourth estimated probability distribution according to preset information. Or use the learnable model obtained through training to estimate the fourth estimated probability distribution, and then perform entropy decoding on the fourth code stream according to the fourth estimated probability distribution to obtain the decoded first bit data, the decoded first bit data is the first bit decoded among the plurality of data data. The embodiment of the present application does not limit the manner of obtaining the fourth estimated probability distribution.

The fourth estimated probability distribution estimated by the decoder needs to be consistent with the fourth estimated probability distribution estimated by the encoder. For example, when the encoder estimates and obtains the fourth estimated probability distribution according to preset information, the decoder obtains the fourth estimated probability distribution according to the same fixed information. When the encoder estimates the fourth estimated probability distribution by using the learnable model obtained through training, the decoder estimates the fourth estimated probability distribution according to the same learnable model, and the estimated fourth estimated probability distribution is the same.

Step 307, the decoder estimates and obtains a first estimated probability distribution according to the first context information.

Optionally, the first context information may be input into the probability distribution estimation network to obtain a first estimated probability distribution output by the probability distribution estimation network. For this process, reference may be made to the foregoing step 303 , and details are not described here in this embodiment of the present application.

Step 308 , the decoder performs entropy decoding on the first code stream according to the first estimated probability distribution to obtain decoded data, where the decoded data is non-first decoded data among multiple data included in the current data stream.

The decoder may calculate the probability value of the first code stream according to the first estimated probability distribution, and then perform entropy decoding on the first code stream according to the probability value. It should be noted that after each piece of decoded data is obtained through decoding, the decoded data is added to the decoded data.

The aforementioned steps 306 to 308 are described by taking the first estimated probability distribution obtained through estimation, and performing entropy decoding on the first code stream according to the first estimated probability distribution as an example. Each code stream included in the current code stream can be used as the first code stream respectively, and the first estimated probability distribution is obtained according to the process shown in the foregoing step 306 to step 308, and entropy decoding is performed according to the first estimated probability distribution.

After performing entropy decoding on all the current code streams, the obtained decoded data is in a one-dimensional format. The decoder may transform the format of the decoded data into one-dimensional according to the two-dimensional distribution information of the decoded data, so as to obtain the two-dimensional decoded data having the same arrangement as the current data stream acquired by the encoder. The two-dimensional distribution information may include the number and arrangement of decoded data arranged in the length direction and width direction of the two-dimensional plane, respectively. The two-dimensional distribution information can be pre-stored in the decoder, or can be sent by the encoder. The embodiment of the present application does not limit the content and acquisition method of the two-dimensional distribution information, as long as the two-dimensional decoded data can be guaranteed to be compatible with the encoder. The arrangement of the obtained current data streams may be the same.

In related technologies, during encoding, the adjacent encoded data of the data to be encoded is determined according to the position information of each data, and the context information is extracted from the adjacent encoded data by using the masked convolutional neural network, and then entropy encoding is performed on the data to be encoded based on the context information . When decoding, according to the position information of the data corresponding to each code stream, the adjacent decoded data of the data corresponding to the code stream to be decoded is determined, and the context information is extracted from the adjacent decoded data by using the masked convolutional neural network, and then based on the context information, the data to be decoded is The code stream is entropy decoded.

Since context information needs to be extracted from adjacent encoded data or adjacent decoded data in related technologies, this process needs to be performed according to the location information of each data, so the entropy encoding or entropy decoding process of multiple data needs to be performed according to the location information of each data The sorting is performed serially, and the serial execution takes a long time, resulting in low efficiency of entropy encoding and entropy decoding. However, in the embodiment of the present application, the first context information is extracted from at least one encoded data or decoded data, without considering the position encoding of each data, so the entropy encoding or entropy decoding process of multiple data can be executed in parallel, parallel execution The time consumption is shorter, and the efficiency of entropy encoding and entropy decoding is improved compared with related technologies.

In addition, in related technologies, the masked convolutional neural network is used to extract context information. When extracting context information, only a local receptive field is used, and the utilization rate of encoded data or decoded data is low, resulting in an estimated probability distribution based on context information. is less accurate, resulting in high overhead for entropy encoding and entropy decoding. However, in the embodiment of the present application, the self-attention decoding network with self-attention mechanism can be used to obtain the weights of all the input encoded data or decoded data, and then part/all of the input encoded data or part/all of the decoded data The data is weighted with corresponding weights to obtain the first context information. The utilization rate of the encoded data or the decoded data is improved, the data redundancy of the extracted first context information is less, and the accuracy of the obtained estimated probability distribution is further improved. Compared with related technologies, the code rate in the process of entropy encoding is reduced, thereby reducing the overhead of entropy encoding and entropy decoding.

Please refer to Fig. 22, Fig. 22 is a schematic diagram of the entropy coding performance provided by the embodiment of the present application, the coordinate system (22a) in Fig. -SSIM) using the embodiment of the present application and related technologies to perform entropy encoding on the test set respectively, the coordinate system (22b) shows the use of the present application under the peak signal-to-noise ratio (Peak Signal to Noise Ratio, PSNR) index The embodiments and related technologies respectively perform entropy coding performance on a test set. The test set is the Kodak test set, and the Kodak test set includes 24 images in Portable Network Graphics (PNG) format. The resolution of the 24 images can be 768×512 or 512×768. In the two coordinate systems in Fig. 22, the abscissa represents the pixel depth (Bits per pixel, BPP), and the ordinate represents the code rate. BPP represents the average number of bits used by a pixel, and the smaller the value, the smaller the compression rate. Both MS-SSIM and PSNR are objective standards for evaluating images, and the higher the value, the better the image quality. The broken line e1 in the coordinate system (22a) and the coordinate system (22b) represents the embodiment of the present application, and the broken line e2 represents the related technology. It can be seen from Figure 22 that the MSSSIM index and PSNR index of the embodiment of the present application are higher than those of the related art at each code rate point, and the code rate of the embodiment of the present application is lower than that of the related art under the same compression quality. The technology is 17% smaller and 15% smaller than related technologies at high bit rate points. That is, the compression performance of the embodiment of the present application is higher than that of the related art, and the embodiment of the present application can improve the accuracy of the estimated probability distribution of acquired data to be encoded or data to be decoded.

To sum up, in the entropy encoding and decoding method provided by the embodiment of the present application, the encoder obtains the current data stream and the first context information, and then estimates the first estimated probability distribution according to the first context information, and treats it according to the first estimated probability distribution Entropy encoding is performed on the encoded data to obtain a first code stream, and then the first code stream is sent to the decoder, and the decoder obtains the first code stream and first context information, estimates and obtains a first estimated probability distribution according to the first context information, and then The first code stream is entropy decoded according to the first estimated probability distribution, the first context information is obtained by inputting at least one encoded data or decoded data into the self-attention decoding network, and the self-attention decoding network can analyze all the input The encoded data is weighted with corresponding weights to obtain the first context information. In this way, the utilization rate of encoded data in the process of acquiring the first context information is improved. When the first estimated probability distribution is estimated by using the first context information, the accuracy of the obtained first estimated probability distribution can be improved, the code rate in the entropy encoding process can be further reduced, and the entropy encoding overhead can be further reduced. Therefore, the bandwidth occupancy rate of the first code stream transmitted to the decoder is reduced, and the transmission efficiency of the first code stream transmitted to the decoding side is improved. And in the process of obtaining the first context information, there is no need to consider the location information of each data, so the entropy encoding or entropy decoding process of multiple data can be executed in parallel, and the parallel execution takes less time, which improves the efficiency of entropy encoding and entropy decoding .

The order of the methods provided in the embodiments of the present application can be adjusted appropriately, and the steps can also be increased or decreased according to the situation. Any person skilled in the art within the technical scope disclosed in this application can easily think of changing methods, which should be covered within the scope of protection of this application, which is not limited in the embodiments of this application.

Please refer to FIG. 23 , which is a flowchart of a process 400 of the entropy encoding and decoding method provided by the embodiment of the present application. The process 400 can be performed by an encoder and a decoder, specifically, it can be performed by an entropy encoding unit of the encoder and an entropy decoding unit of the decoder. The process 400 is described as a series of steps or operations. It should be understood that the process 400 may be performed in various orders and/or concurrently, and is not limited to the order of execution shown in FIG. 23 . Assuming that a current data stream with multiple data is using an encoder and decoder, a process 400 including the following steps is performed to entropy encode and decode data. Process 400 may include:

In step 401, the encoder acquires data to be encoded included in the current data stream.

The data to be encoded may be the first encoded data or the non-first encoded data among multiple data contained in the current data stream, and the embodiment of the present application does not limit the position of the to-be-encoded data in the current data stream. For this process, reference may be made to the aforementioned step 301, and details are not described here in this embodiment of the present application.

Step 402, the encoder obtains the first side information.

The first side information is obtained by feeding multiple data into the self-attention encoding network. Taking the initial data stream a shown in FIG. 21 as an example, the data corresponding to positions a1 to a16 can be input into the self-attention encoding network to obtain the first side information. The content of the first side information obtained based on multiple data is relatively comprehensive. When the second estimated probability distribution is subsequently estimated by using the first side information, the accuracy of the obtained second estimated probability distribution can be improved, thereby reducing the code rate in the entropy encoding process, and reducing the entropy encoding overhead.

The self-attention encoding network is a neural network with a self-attention mechanism (ie, including a self-attention structure). It has better feature transformation ability, and the quality of the extracted first side information is better. When the first estimated probability distribution is estimated by using the first side information, the accuracy of the first estimated probability distribution can be improved. Therefore, the code rate in the process of entropy encoding is reduced, and the overhead of entropy encoding is reduced.

Optionally, the self-attention encoding network can weight all input data with corresponding weights to obtain the first side information. In this way, the utilization rate of data in the process of obtaining the first side information is improved. When the first estimated probability distribution is subsequently estimated by using the first side information, the accuracy of the obtained first estimated probability distribution can be further improved, and the code rate in the entropy encoding process can be further reduced, thereby further reducing the entropy encoding overhead.

Or the self-attention encoding network can select part of the input data according to the obtained weight, and weight the part of the data with the corresponding weight to obtain the first side information. For this process, reference may be made to the aforementioned step 302, which will not be described in detail here in this embodiment of the present application. In this way, the flexibility in the process of obtaining the first side information can be improved. And when the data with higher weight is selected for weighting, the utilization rate of data with higher weight in the process of obtaining the first side information can be guaranteed, and the first estimated probability distribution can be further improved when the first side information is used to estimate the first estimated probability distribution. The accuracy of the obtained first estimated probability distribution further reduces the code rate in the process of entropy coding, thereby further reducing the overhead of entropy coding.

In this step 402, the structure of the self-attention encoding network can refer to the aforementioned FIG. 16 , which will not be described in detail here in the embodiment of the present application. As shown in Figure 16 above, the input of the self-attention encoding network includes three tensors Q, K, and V, and Q, K, and V sequentially undergo a multi-head attention mechanism, summation and normalization operations, feedforward operations, and summation The sum and normalization operations output the first side information. Q, K, and V refer to tensors of data, for example, tensors obtained by embedding and position encoding the data in the current data stream in the foregoing process.

Step 403, the encoder obtains a first estimated probability distribution according to the first side information estimation.

Optionally, the first side information may be input into the probability distribution estimation network to obtain a first estimated probability distribution output by the probability distribution estimation network. The probability distribution estimation network may be a single neural network, or a structure in the self-attention decoding network, which is not limited in this embodiment of the present application.

Please refer to Figure 24. Figure 24 is a schematic diagram of the process of obtaining the first estimated probability distribution provided by the embodiment of the present application. Figure 24 uses the initial data stream as a two-dimensional format, which needs to be flattened and self-attention The decoding network performs probability distribution estimation as an example. The initial data stream a includes 16 positions a1 to a16 arranged in 4×4, and each position corresponds to one piece of data. The initial data stream a is flattened into a one-dimensional format from top to bottom and from left to right to obtain a current data stream b including 16 positions a1 to a16 arranged in sequence.

Input the current data stream b into the self-attention coding network, and the self-attention coding network determines the position information of each data in the current data stream b, and combines the position information of each data with the data, based on the data combined with the position information All the data in the stream b (that is, the data corresponding to the positions a1 to a16) output the first side information. Decompose the entropy model to estimate the second estimated probability distribution, the entropy encoding module uses the second estimated probability distribution to entropy encode the first side information to obtain the code stream of the first side information, and the entropy decoding module uses the second estimated probability distribution to encode the first side information Entropy decoding is performed on the code stream of the information to obtain the first side information. The first side information is input to the self-attention decoding network, and the self-attention decoding network outputs the first estimated probability distribution (ie, the estimated probability distribution of the data corresponding to position a10). For this process, reference may be made to the foregoing step 303 , and details are not described here in this embodiment of the present application. The process shown in FIG. 24 is only an exemplary description, and does not limit the process of obtaining the first estimated probability distribution.

In step 404, the encoder performs entropy encoding on the data to be encoded according to the first estimated probability distribution to obtain a first code stream.

The encoder may calculate the probability value of the data to be encoded according to the first estimated probability distribution, and then perform entropy encoding on the data to be encoded according to the probability value. The first code stream may be in binary format. For this process, reference may be made to the foregoing step 304 , and details are not described here in this embodiment of the present application.

Step 405, the encoder sends the first code stream to the decoder.

For this process, reference may be made to the foregoing step 305 , and details are not described here in this embodiment of the present application.

Step 406, the encoder estimates and obtains a second estimated probability distribution.

Optionally, the second estimated probability distribution may be obtained by estimating according to preset information. Alternatively, the second estimated probability distribution is obtained by estimating the learnable model obtained through training. The embodiment of the present application does not limit the manner of obtaining the second estimated probability distribution.

Step 407, the encoder performs entropy encoding on the first side information according to the second estimated probability distribution to obtain a second code stream.

The encoder may calculate the probability value of the first side information according to the second estimated probability distribution, and then perform entropy encoding on the first side information according to the probability value. The second code stream may be in binary format.

Step 408, the encoder sends the second code stream to the decoder.

Step 409, the decoder estimates and obtains a second estimated probability distribution.

Optionally, the second estimated probability distribution may be obtained by estimating according to preset information. Alternatively, the second estimated probability distribution may be obtained by estimating the learnable model obtained through training. The embodiment of the present application does not limit the manner of obtaining the second estimated probability distribution. It should be noted that the second estimated probability distribution estimated by the decoder needs to be consistent with the second estimated probability distribution estimated by the encoder. For example, when the encoder estimates and obtains the second estimated probability distribution according to preset information, the decoder obtains the second estimated probability distribution according to the same fixed information. When the encoder estimates the second estimated probability distribution using the learnable model obtained through training, the decoder estimates the second estimated probability distribution according to the same learnable model, and the estimated second estimated probability distributions are the same.

Step 410, the decoder performs entropy decoding on the second code stream according to the second estimated probability distribution to obtain the decoded first side information.

The decoder may calculate the probability value of the second code stream according to the second estimated probability distribution, and then perform entropy decoding on the second code stream according to the probability value.

Step 411, the decoder estimates and obtains a first estimated probability distribution according to the decoded first side information.

Optionally, the decoded first side information may be input into the probability distribution estimation network to obtain a first estimated probability distribution output by the probability distribution estimation network. For this process, reference may be made to the foregoing step 307, and details are not described here in this embodiment of the present application.

Step 412, the decoder performs entropy decoding on the first code stream according to the first estimated probability distribution to obtain decoded data.

The decoder may calculate the probability value of the first code stream according to the first estimated probability distribution, and then perform entropy decoding on the first code stream according to the probability value. For this process, reference may be made to the foregoing step 308 , and details are not described here in this embodiment of the present application.

To sum up, in the entropy encoding and decoding method provided by the embodiment of the present application, the encoder obtains the data to be encoded and the first side information contained in the current data stream, and then estimates the first estimated probability distribution according to the first side information, and then obtains the first estimated probability distribution according to the first side information. Entropy encoding the data to be encoded to obtain the first code stream through an estimated probability distribution, and then sending the first code stream to the decoder, the encoder estimates the second estimated probability distribution, and entropy-encodes the first side information according to the second estimated probability distribution Encoding to obtain a second code stream, and sending the second code stream to a decoder, the decoder estimates to obtain a second estimated probability distribution, performs entropy decoding on the second code stream according to the second estimated probability distribution to obtain the first side information, and The first estimated probability distribution is estimated according to the first side information, and then the first code stream is entropy decoded according to the first estimated probability distribution. The first side information is obtained by inputting a plurality of data into the self-attention coding network, and the self-attention The force encoding network can weight all the input data with corresponding weights to obtain the first side information. The content of the first side information obtained in this way is relatively comprehensive. When the first estimated probability distribution is subsequently estimated using the first side information, the accuracy of the obtained first estimated probability distribution can be improved, and the code rate in the entropy coding process can be reduced, thereby reducing the entropy coding overhead and the first code The bandwidth occupancy rate when the stream is transmitted to the decoder improves the transmission efficiency of the first code stream transmitted to the decoder.

The sequence of the method provided in the embodiment of the present application can be appropriately adjusted, and the steps can also be increased or decreased according to the situation. For example, the two processes of steps 403 to 405 and steps 406 to 408 can be executed simultaneously. Any person skilled in the art within the technical scope disclosed in this application can easily think of changing methods, which should be covered within the scope of protection of this application, which is not limited in the embodiments of this application.

Please refer to FIG. 25 , which is a flowchart of a process 500 of the entropy encoding and decoding method provided by the embodiment of the present application. The process 500 can be performed by an encoder and a decoder, specifically, it can be performed by an entropy encoding unit of the encoder and an entropy decoding unit of the decoder. The process 500 is described as a series of steps or operations. It should be understood that the process 500 may be performed in various orders and/or concurrently, and is not limited to the order of execution shown in FIG. 25 . Assuming that a current data stream with multiple data is using an encoder and decoder, a process 500 including the following steps is performed to entropy encode and decode data. Process 500 may include:

In step 501, the encoder obtains data to be encoded, and the data to be encoded is the non-first encoded data among multiple data included in the current data stream.

For this process, reference may be made to the aforementioned step 301, and details are not described here in this embodiment of the present application.

Step 502, the encoder acquires first context information and first side information.

For this process, reference may be made to the foregoing

steps

302 and 402, and details are not described here in this embodiment of the present application.

Step 503, the encoder estimates and obtains a first estimated probability distribution according to the first context information and the first side information.

The encoder may aggregate the first context information and the first side information, and estimate and obtain a first estimated probability distribution according to the aggregated information. Optionally, the encoder may aggregate the first context information and the first side information through an aggregation network. The aggregation network can include a self-attention decoding network. The self-attention decoding network has a self-attention mechanism, which can fully obtain the complementarity of the first context information and the first side information, and then use these two information to efficiently estimate the second An estimated probability distribution, thereby improving the accuracy of the estimated first estimated probability distribution. For the process of estimating and obtaining the first estimated probability distribution according to the aggregated information, reference may be made to the foregoing step 303 , which will not be described in detail here in the embodiment of the present application.

Step 504, the encoder performs entropy encoding on the data to be encoded according to the first estimated probability distribution to obtain a first code stream.

For this process, reference may be made to the foregoing step 304 , and details are not described here in this embodiment of the present application.

Step 505, the encoder sends the first code stream to the decoder.

It should be noted that, the encoder performs entropy encoding on each non-first encoded data included in the current data stream to obtain a code stream of each non-first data. Furthermore, the current code stream is obtained according to the code stream of each non-first data. For the first coded data, its code stream can be included in the current code stream to be transmitted to the decoder. Or the encoder sends its bitstream to the decoder alone. For this process, reference may be made to the foregoing step 305 , and details are not described here in this embodiment of the present application.

Step 506, the encoder estimates to obtain a second estimated probability distribution.

For this process, reference may be made to the aforementioned step 406, which will not be described in detail here in this embodiment of the present application.

Step 507, the encoder performs entropy encoding on the first side information according to the second estimated probability distribution to obtain a second code stream.

For this process, reference may be made to the aforementioned step 407, and details are not described here in this embodiment of the present application.

Step 508, the encoder sends the second code stream to the decoder.

Optionally, the encoder can send the second code stream to the decoder alone, or add the second code stream to the first code stream and send it to the decoder. Do limited. For this process, reference may be made to the foregoing step 405 , and details are not described here in this embodiment of the present application.

Step 509, the decoder obtains the first context information.

For the manner of acquiring the first context information, reference may be made to the aforementioned step 306, which will not be described in detail here in this embodiment of the present application.

Step 510, the decoder estimates and obtains a second estimated probability distribution.

The second code stream is the code stream of the first side information. For this process, reference may be made to the aforementioned step 409, and details are not described here in this embodiment of the present application. It should be noted that the second estimated probability distribution estimated by the decoder needs to be consistent with the second estimated probability distribution estimated by the encoder.

Step 511, the decoder performs entropy decoding on the second code stream according to the second estimated probability distribution to obtain the decoded first side information.

For this process, reference may be made to the foregoing step 410, and details are not described here in this embodiment of the present application.

Step 512, the decoder estimates and obtains a first estimated probability distribution according to the first context information and the decoded first side information.

For this process, reference may be made to the foregoing

steps

303 and 403, and details are not described here in this embodiment of the present application.

Step 513 , the decoder performs entropy decoding on the first code stream according to the first estimated probability distribution to obtain decoded data, where the decoded data is non-first decoded data among multiple data included in the current data stream.

For the process of performing entropy decoding on the first code stream, reference may be made to the foregoing step 308, and details are not described here in this embodiment of the present application. It should be noted that after each piece of data is decoded, the data is added to the decoded data.

To sum up, in the entropy encoding and decoding method provided by the embodiment of the present application, the encoder obtains the current data stream, the first context information and the first pass information, and then estimates the first estimated probability according to the first context information and the first side information distribution, and perform entropy encoding on the data to be encoded according to the first estimated probability distribution to obtain the first code stream, and then send the first code stream to the decoder, and the encoder estimates the second estimated probability distribution, and according to the second estimated probability distribution, the first code stream Perform entropy encoding on one side of the information to obtain the second code stream, and send the second code stream to the decoder, the decoder estimates the second estimated probability distribution, and performs entropy decoding on the second code stream according to the second estimated probability distribution to obtain the second code stream side information, the decoder estimates the first estimated probability distribution according to the first context information and the first side information, and then performs entropy decoding on the first code stream according to the first estimated probability distribution, the first context information is at least one encoded The data or the decoded data is obtained by inputting the self-attention decoding network, and the self-attention decoding network can weight all the input encoded data with corresponding weights to obtain the first context information. The first side information is obtained by inputting multiple data into the self-attention encoding network, and the self-attention encoding network can weight all the input data with corresponding weights to obtain the first side information. In this way, the utilization rate of encoded data in the process of acquiring the first context information is improved, and the content of the obtained first side information is more comprehensive. When the first estimated probability distribution is estimated by using the first context information and the first side information, the accuracy of the obtained first estimated probability distribution can be improved, the code rate in the entropy coding process can be further reduced, and the entropy coding overhead can be further reduced , thereby reducing the entropy encoding overhead and the bandwidth occupancy rate when the first code stream is transmitted to the decoder, and improving the transmission efficiency of the first code stream transmitted to the decoder. And in the process of obtaining the first context information, there is no need to consider the location information of each data, so the entropy encoding or entropy decoding process of multiple data can be executed in parallel, and the parallel execution takes less time, which improves the efficiency of entropy encoding and entropy decoding .

Please refer to FIG. 26 , which is a flowchart of a process 600 of the entropy encoding and decoding method provided by the embodiment of the present application. The process 600 can be performed by an encoder and a decoder, specifically, it can be performed by an entropy encoding unit of the encoder and an entropy decoding unit of the decoder. The process 600 is described as a series of steps or operations. It should be understood that the process 600 may be performed in various orders and/or concurrently, and is not limited to the order of execution shown in FIG. 26 . Assuming that a current data stream with multiple data is using an encoder and a decoder, a process 600 including the following steps is performed to entropy encode and decode data. Process 600 may include:

In step 601, the encoder acquires data to be encoded, and the data to be encoded is the non-first encoded data among multiple data included in the current data stream.

It should be noted that, when entropy encoding is performed on multiple pieces of data included in the current data stream, entropy encoding is generally performed on the first data first, and then entropy encoding is performed on the data to be encoded. For the first encoded data, the fourth estimated probability distribution may be obtained by estimating according to preset information. Alternatively, the fourth estimated probability distribution is obtained by estimating the learnable model obtained through training. Or estimate and obtain the fourth estimated probability distribution according to the first side information and/or the second side information. Entropy encoding is then performed on the first encoded data according to the fourth estimated probability distribution to obtain a fourth code stream. The embodiment of the present application does not limit the manner of obtaining the fourth estimated probability distribution.

Step 602, the encoder acquires first context information, second context information, first side information and second side information.

For the manner of acquiring the first context information, refer to the aforementioned step 302, and for the manner of acquiring the first side information, refer to the aforementioned step 402, which will not be repeated in this embodiment of the present application.

The second context information is obtained by inputting at least one piece of data that meets the preset condition in the at least one coded data into a masked convolution network (Masked Convolution network). The at least one piece of data that meets the preset condition may be at least one piece of data that is adjacent to the data to be encoded among the at least one encoded data of the plurality of data. The encoded data is used in the process of obtaining the second context information, which can improve the accuracy of the first estimated probability distribution obtained by the subsequent estimation, thereby reducing the code rate in the process of entropy encoding and reducing the overhead of entropy encoding.

It can be seen that the first context information in the embodiment shown in FIG. 20 is obtained based on at least one coded data among the plurality of data, and the second context information in step 602 is obtained based on the at least one coded data. Obtained from at least one piece of data adjacent to the data to be encoded. Taking the initial data stream shown in FIG. 21 as an example, the first context information is obtained based on the data corresponding to positions a1 to a9, and the second context information is obtained based on at least one decoded data adjacent to position a10 (for example, positions a6 and The data corresponding to position a9) is obtained. That is, compared with the second context information, the first context information has a higher utilization rate of encoded data and more comprehensive content.

The second side information is obtained by inputting at least one data that meets the preset conditions among the multiple data into a hyperencoder network (Hyper Encoder). The at least one piece of data that meets the preset condition may be at least one piece of data that is adjacent to the data to be encoded among the multiple pieces of data.

It can be seen that the first side information in the embodiment shown in FIG. 23 is obtained based on a plurality of data, and the second side information in step 602 is based on at least one data adjacent to the data to be encoded among the plurality of data. owned. Taking the initial data flow shown in Figure 21 as an example, the first side information is obtained based on the data corresponding to positions a1 to a16, and the second side information is obtained based on at least one data adjacent to position a10 (such as position a6, position a9 , data corresponding to position a11 and position a14) obtained. That is, compared with the second side information, the first side information has a higher utilization rate of data and more comprehensive content.

Masked convolutional networks or superencoded networks have local receptive fields. The masked convolutional network includes a masked convolutional layer or a regular convolutional layer, the input of which is at least one piece of data adjacent to the data to be encoded in at least one encoded data, and the output is the activation feature of the convolution output, that is, the second context information. The super-encoding network includes a conventional convolutional layer, whose input is at least one data adjacent to the data to be encoded among the multiple data, and the output is the activation feature of the convolution output, that is, the second side information.

The way the encoder obtains the second context information through the masked convolutional network, the way the encoder obtains the second side information through the super-encoded network, and the architecture of the masked convolutional network and the super-encoded network can all refer to the self-attention in step 302 above. The relevant content of the decoding network is not described here in this embodiment of the present application.

In the embodiment of the present application, the first estimated probability distribution can be obtained by combining the first context information, the first side information, the second context information and the second side information, which can further improve the accuracy of the obtained first estimated probability distribution, so that Reduce the code rate in the process of entropy encoding to realize the reduction of entropy encoding overhead.

Step 603, the encoder estimates and obtains a first estimated probability distribution according to the first context information, the second context information, the first side information and the second side information.

The encoder may aggregate the first context information, the second context information, the first side information, and the second side information, and estimate and obtain a first estimated probability distribution according to the aggregated information. Optionally, the encoder may aggregate the first context information, the second context information, the first side information and the second side information through an aggregation network. The aggregation network can include a self-attention decoding network. The self-attention decoding network has a self-attention mechanism, which can fully obtain the complementarity of the first context information, the first side information, the second context information and the second side information. The four pieces of information are efficiently estimated to obtain the first estimated probability distribution, thereby improving the accuracy of the estimated first estimated probability distribution. For the process of estimating and obtaining the first estimated probability distribution according to the aggregated information, reference may be made to the foregoing step 303 , which will not be described in detail here in the embodiment of the present application.

Please refer to FIG. 27. FIG. 27 is a schematic diagram of the process of obtaining the first estimated probability distribution provided by the embodiment of the present application. FIG. 27 takes the initial data flow shown in FIG. The sequence from bottom to bottom and from left to right is flattened into a one-dimensional format, and the current data stream b including 16 positions a1 to a16 arranged in sequence is obtained. The current data stream b is fed into the super-encoding network, self-attention encoding network, self-attention decoding network and masked convolutional network respectively. The super-encoding network and the self-attention encoding network output the second side information and the first side information respectively, decompose the entropy model estimation to obtain the second estimated probability distribution, and the hyper-entropy model estimation obtains the third estimated probability distribution. The entropy encoding module entropy encodes the first side information according to the second estimated probability distribution, the entropy decoding module performs entropy decoding on the first side information according to the second estimated probability distribution, and inputs the entropy decoded first side information into the aggregation network. The entropy encoding module entropy encodes the second side information according to the third estimated probability distribution, the entropy decoding module performs entropy decoding on the second side information according to the third estimated probability distribution, and inputs the entropy decoded second side information into the aggregation network. The self-attention decoding network and the masking convolutional network output the first context information and the second context information respectively, and both the first context information and the second context information are input to the aggregation network. The aggregation network aggregates the input first context information, second context information, first side information and second side information, and outputs the first estimated probability distribution (ie, the estimated probability distribution of the data corresponding to position a10). The process shown in FIG. 27 is only an exemplary description, and does not limit the process of obtaining the first estimated probability distribution.

In step 604, the encoder performs entropy encoding on the data to be encoded according to the first estimated probability distribution to obtain a first code stream.

For the process of obtaining the first code stream, reference may be made to the aforementioned step 304, and details are not described here in this embodiment of the present application.

Step 605, the encoder sends the first code stream to the decoder.

It should be noted that, the encoder performs entropy encoding on each non-first encoded data included in the current data stream to obtain a code stream of each non-first data. Furthermore, the current code stream is obtained according to the code stream of each non-first data. For data coded at the first bit, the fourth code stream can be included in the current code stream for transmission to the decoder. Or the encoder sends the fourth code stream to the decoder separately. For this process, reference may be made to the foregoing step 305 , and details are not described here in this embodiment of the present application.

Step 606, the encoder estimates and obtains a second estimated probability distribution.

Step 607, the encoder performs entropy encoding on the first side information according to the second estimated probability distribution to obtain a second code stream.

Step 608, the encoder sends the second code stream to the decoder.

Optionally, the encoder can send the second code stream to the decoder alone, or add the second code stream to the first code stream and send it to the decoder. Do limited. For this process, reference may be made to the foregoing step 305 , and details are not described here in this embodiment of the present application.

Step 609, the encoder estimates and obtains a third estimated probability distribution.

Step 610, the encoder performs entropy encoding on the second side information according to the third estimated probability distribution to obtain a third code stream.

Step 611, the encoder sends the third code stream to the decoder.

Optionally, the encoder can send the third code stream to the decoder alone, or add the third code stream to the first code stream and send it to the decoder. Do limited. For this process, reference may be made to the foregoing step 305 , and details are not described here in this embodiment of the present application.

Step 612, the decoder acquires the first context information and the second context information.

The first code stream belongs to a code stream in the current code stream received by the decoder, and the decoded data after decoding is the non-first decoded data among the multiple data contained in the current code stream. The second context information may be obtained by inputting at least one piece of at least one piece of decoded data that meets a preset condition into the masked convolutional network. For the masked convolutional network, reference may be made to the foregoing step 602, which will not be described in detail here in this embodiment of the present application.

It should be noted that, when the decoder performs entropy decoding on each code stream in the received current code stream, it usually performs entropy decoding on the fourth code stream first. The decoder can estimate and obtain the fourth estimated probability distribution according to preset information. Alternatively, the fourth estimated probability distribution is obtained by estimating the learnable model obtained through training. Or estimate and obtain the fourth estimated probability distribution according to the first side information and/or the second side information. Then perform entropy decoding on the fourth code stream according to the fourth estimated probability distribution to obtain decoded first data, where the decoded first data is first decoded data among the plurality of data.

The fourth estimated probability distribution estimated by the decoder needs to be consistent with the fourth estimated probability distribution estimated by the encoder. For example, when the encoder estimates and obtains the fourth estimated probability distribution according to preset information, the decoder obtains the fourth estimated probability distribution according to the same fixed information. When the encoder estimates the fourth estimated probability distribution by using the learnable model obtained through training, the decoder estimates the fourth estimated probability distribution according to the same learnable model, and the estimated fourth estimated probability distribution is the same. When the encoder estimates and obtains the fourth estimated probability distribution according to the first side information and the second side information, the decoder obtains the fourth estimated probability distribution according to the first side information and the second side information.

Step 613, the decoder estimates and obtains a second estimated probability distribution.

For this process, reference may be made to the foregoing step 409 , and details are not described here in this embodiment of the present application. It should be noted that the second estimated probability distribution estimated by the decoder needs to be consistent with the second estimated probability distribution estimated by the encoder.

Step 614, the decoder performs entropy decoding on the second code stream according to the second estimated probability distribution to obtain the decoded first side information.

Step 615, the decoder estimates and obtains a third estimated probability distribution.

For this process, reference may be made to the foregoing step 409 , and details are not described here in this embodiment of the present application. It should be noted that the third estimated probability distribution estimated by the decoder needs to be consistent with the third estimated probability distribution estimated by the encoder.

Step 616, the decoder performs entropy decoding on the third code stream according to the third estimated probability distribution to obtain decoded second side information.

Step 617, the decoder estimates and obtains a first estimated probability distribution according to the first context information, the second context information, the decoded first side information and the decoded second side information.

For this process, reference may be made to the foregoing

steps

Step 618: The decoder performs entropy decoding on the first code stream according to the first estimated probability distribution to obtain decoded data, where the decoded data is non-first decoded data among multiple data included in the current data stream.

To sum up, in the entropy encoding and decoding method provided by the embodiment of the present application, the encoder obtains the data to be encoded, the first context information, the second context information, the first side information and the second side information contained in the current data stream, and then according to The first context information, the second context information, the first side information and the second side information are estimated to obtain the first estimated probability distribution, and the data to be encoded is entropy encoded according to the first estimated probability distribution to obtain the first code stream, and then decoded The encoder sends the first code stream, the encoder estimates the second estimated probability distribution and the third estimated probability distribution, and performs entropy encoding on the first side information and the second side information according to the second estimated probability distribution and the third estimated probability distribution to obtain Obtain the second code stream and the third code stream, and send the second code stream and the third code stream to the decoder, the decoder obtains the first context information and the second context information, and estimates the second estimated probability distribution and the second estimated probability distribution respectively Three estimated probability distributions, performing entropy decoding on the second code stream and the third code stream respectively according to the second estimated probability distribution and the third estimated probability distribution to obtain the first side information and the second side information, and then according to the first context information , the second context information, the first side information and the second side information estimate to obtain the first estimated probability distribution, and then perform entropy decoding on the first code stream according to the first estimated probability distribution, and combine the first context information, the first side information, The second context information and the second side information are estimated to obtain the first estimated probability distribution, which can further improve the accuracy of the obtained first estimated probability distribution, reduce the code rate in the entropy coding process, and thus reduce the entropy coding overhead and the current The bandwidth occupancy rate when each data included in the data stream is transmitted to the decoder improves the transmission efficiency of each data included in the current data stream.

The order of the method provided in the embodiment of the present application can be adjusted appropriately, and the steps can also be increased or decreased according to the situation. For example, the three processes of steps 603 to 605, steps 606 to 608, and steps 609 to 611 can be executed at the same time. Step 612 , steps 613 to 614, and steps 615 to 616 can be executed simultaneously. Any person skilled in the art within the technical scope disclosed in this application can easily think of changing methods, which should be covered within the scope of protection of this application, which is not limited in the embodiments of this application.

In the implementation process, each step of the above-mentioned method embodiments may be completed by an integrated logic circuit of hardware in a processor or instructions in the form of software. The processor can be a general-purpose processor, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field programmable gate array (FPGA), or other possible Program logic devices, discrete gate or transistor logic devices, discrete hardware components. A general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like. The steps of the methods disclosed in the embodiments of the present application may be directly implemented by a hardware coded processor, or executed by a combination of hardware and software modules in the coded processor. The software module can be located in a mature storage medium in the field such as random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, register. The storage medium is located in the memory, and the processor reads the information in the memory, and completes the steps of the above method in combination with its hardware.

The memories mentioned in the above embodiments may be volatile memories or nonvolatile memories, or may include both volatile and nonvolatile memories. Among them, the non-volatile memory can be read-only memory (read-only memory, ROM), programmable read-only memory (programmable ROM, PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrically programmable Erases programmable read-only memory (electrically EPROM, EEPROM) or flash memory. Volatile memory can be random access memory (RAM), which acts as external cache memory. By way of illustration and not limitation, many forms of RAM are available such as static random access memory (static RAM, SRAM), dynamic random access memory (dynamic RAM, DRAM), synchronous dynamic random access memory (synchronous DRAM, SDRAM), double data rate synchronous dynamic random access memory (double data rate SDRAM, DDR SDRAM), enhanced synchronous dynamic random access memory (enhanced SDRAM, ESDRAM), synchronous connection dynamic random access memory (synchlink DRAM, SLDRAM ) and direct memory bus random access memory (direct rambus RAM, DR RAM). It should be noted that the memory of the systems and methods described herein is intended to include, but not be limited to, these and any other suitable types of memory.

Those skilled in the art can appreciate that the units and algorithm steps of the examples described in conjunction with the embodiments disclosed herein can be implemented by electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented by hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art may use different methods to implement the described functions for each specific application, but such implementation should not be regarded as exceeding the scope of the present application.

Those skilled in the art can clearly understand that for the convenience and brevity of the description, the specific working process of the above-described system, device and unit can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here.

In the several embodiments provided in this application, it should be understood that the disclosed systems, devices and methods may be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components can be combined or May be integrated into another system, or some features may be ignored, or not implemented. In another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.

The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.

If the functions described above are realized in the form of software function units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application is essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (personal computer, server, or network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), magnetic disk or optical disc and other media that can store program codes. .

The above is only a specific implementation of the application, but the scope of protection of the application is not limited thereto. Anyone familiar with the technical field can easily think of changes or substitutions within the technical scope disclosed in the application. Should be covered within the protection scope of this application. Therefore, the protection scope of the present application should be determined by the protection scope of the claims.

Claims

A kind of entropy encoding method, is characterized in that, described method comprises:

Obtain the data to be encoded, the data to be encoded is the non-first encoded data among the multiple data contained in the current data stream;

Obtaining reference information, the reference information including at least one of first context information and first side information, the first context information is obtained by inputting at least one encoded data into a self-attention decoding network, the first One side information is obtained by inputting the plurality of data into a self-attention encoding network;

estimating and obtaining a first estimated probability distribution according to the reference information;

Entropy encoding is performed on the data to be encoded according to the first estimated probability distribution to obtain a first code stream.
The method according to claim 1, wherein the reference information specifically includes the first context information and the first side information;

The estimating and obtaining the first estimated probability distribution according to the reference information includes:

Inputting the first context information and the first side information into a probability distribution estimation network to obtain the first estimated probability distribution output by the probability distribution estimation network.
The method according to claim 1, wherein the reference information specifically includes the first context information and second context information, and the second context information is the at least one coded data conforming to a preset Conditional at least one data input masked convolutional network obtained;

The estimating and obtaining the first estimated probability distribution according to the reference information includes:

Inputting the first context information and the second context information into a probability distribution estimation network to obtain the first estimated probability distribution output by the probability distribution estimation network.
The method according to claim 1, wherein the reference information specifically includes the first context information, the first side information and second context information, and the second context information is the at least one Obtained by inputting at least one data that meets the preset conditions in the encoded data into the masked convolutional network;

The estimating and obtaining the first estimated probability distribution according to the reference information includes:

Inputting the first context information, the first side information and the second context information into a probability distribution estimation network to obtain the first estimated probability distribution output by the probability distribution estimation network.
The method according to claim 1, wherein the reference information specifically includes the first context information and second side information, and the second side information is a set of information that meets preset conditions among the plurality of data At least one data input into the hypercoding network is obtained;

The estimating and obtaining the first estimated probability distribution according to the reference information includes:

Inputting the first context information and the second side information into a probability distribution estimation network to obtain the first estimated probability distribution output by the probability distribution estimation network.
The method according to claim 1, wherein the reference information specifically includes the first context information, the first side information and second side information, and the second side information is a combination of the multiple Obtained by inputting at least one data that meets the preset conditions into the hypercoding network;

The estimating and obtaining the first estimated probability distribution according to the reference information includes:

Inputting the first context information, the first side information and the second side information into a probability distribution estimation network to obtain the first estimated probability distribution output by the probability distribution estimation network.
The method according to claim 1, wherein the reference information specifically includes the first context information, second context information, and second side information, and the second side information is the At least one data that meets the preset conditions is input into the super-encoding network, and the second context information is obtained by inputting at least one data that meets the preset conditions in the at least one encoded data into the masked convolutional network;

The estimating and obtaining the first estimated probability distribution according to the reference information includes:

Inputting the first context information, the second context information and the second side information into a probability distribution estimation network to obtain the first estimated probability distribution output by the probability distribution estimation network.
The method according to claim 1, wherein the reference information specifically includes the first context information, the first side information, the second context information and the second side information, and the second side information is The second context information is obtained by inputting at least one data meeting preset conditions among the plurality of data into a super-coding network, and the second context information is obtained by inputting at least one data meeting preset conditions among the at least one coded data into masked convolution obtained from the network;

The estimating and obtaining the first estimated probability distribution according to the reference information includes:

inputting the first context information, the first side information, the second context information and the second side information into a probability distribution estimation network to obtain the first estimated probability output by the probability distribution estimation network distributed.
The method according to claim 1, wherein the reference information specifically includes the first side information and second context information, and the second context information is the at least one coded data conforming to a preset Conditional at least one data input masked convolutional network obtained;

The estimating and obtaining the first estimated probability distribution according to the reference information includes:

Inputting the first side information and the second context information into a probability distribution estimation network to obtain the first estimated probability distribution output by the probability distribution estimation network.
The method according to claim 1, wherein the reference information specifically includes the first side information and the second side information, and the second side information is a set of information that meets preset conditions among the plurality of data. At least one data input into the hypercoding network is obtained;

The estimating and obtaining the first estimated probability distribution according to the reference information includes:

Inputting the first side information and the second side information into a probability distribution estimation network to obtain the first estimated probability distribution output by the probability distribution estimation network.
The method according to claim 1, wherein the reference information specifically includes the first side information, the second context information, and the second side information, and the second side information is the At least one data that meets the preset conditions is input into the super-encoding network, and the second context information is obtained by inputting at least one data that meets the preset conditions in the at least one encoded data into the masked convolutional network;

The estimating and obtaining the first estimated probability distribution according to the reference information includes:

Inputting the first side information, the second context information and the second side information into a probability distribution estimation network to obtain the first estimated probability distribution output by the probability distribution estimation network.
The method according to any one of claims 1-2, 4, 6, 8-11, wherein the method further comprises:

Estimate the second estimated probability distribution;

performing entropy encoding on the first side information according to the second estimated probability distribution to obtain a second code stream.
The method according to any one of claims 5-8, 10-11, wherein the method further comprises:

estimating said third estimated probability distribution;

performing entropy encoding on the second side information according to the third estimated probability distribution to obtain a third code stream.
The method according to any one of claims 1-13, further comprising:

Obtaining the first coded data among the plurality of data;

Estimating and obtaining a fourth estimated probability distribution according to preset information;

performing entropy encoding on the first bit encoded data according to the fourth estimated probability distribution to obtain a fourth code stream.
A kind of entropy decoding method, is characterized in that, described method comprises:

Obtain the first code stream;

Obtaining reference information, the reference information including at least one of the first context information and the decoded first side information, the first context information is obtained by inputting at least one decoded data into the self-attention decoding network, so The decoded first side information is obtained by performing entropy decoding on the second code stream;

estimating and obtaining a first estimated probability distribution according to the reference information;

Perform entropy decoding on the first code stream according to the first estimated probability distribution to obtain decoded data, where the decoded data is non-first decoded data among multiple data included in the current data stream.
The method according to claim 15, wherein the reference information specifically includes the first context information and the decoded first side information;

The estimating and obtaining the first estimated probability distribution according to the reference information includes:

The first context information and the decoded first side information are input into a probability distribution estimation network to obtain the first estimated probability distribution output by the probability distribution estimation network.
The method according to claim 15, wherein the reference information specifically includes the first context information and second context information, and the second context information is a preset Conditional at least one data input masked convolutional network obtained;

The estimating and obtaining the first estimated probability distribution according to the reference information includes:

Inputting the first context information and the second context information into a probability distribution estimation network to obtain the first estimated probability distribution output by the probability distribution estimation network.
The method according to claim 15, wherein the reference information specifically includes the first context information, the decoded first side information and second context information, and the second context information is the Obtained by inputting at least one piece of data that meets the preset condition in the at least one piece of decoded data into the masked convolutional network;

The estimating and obtaining the first estimated probability distribution according to the reference information includes:

The first context information, the decoded first side information and the second context information are input into a probability distribution estimation network to obtain the first estimated probability distribution output by the probability distribution estimation network.
The method according to claim 15, wherein the reference information specifically includes the first context information and decoded second side information, and the decoded second side information is entropy decoding of the third code stream owned;

The estimating and obtaining the first estimated probability distribution according to the reference information includes:

The first context information and the decoded second side information are input into a probability distribution estimation network to obtain the first estimated probability distribution output by the probability distribution estimation network.
The method according to claim 15, wherein the reference information specifically includes the first context information, the decoded first side information and the decoded second side information, and the decoded second side information is obtained by performing entropy decoding on the third code stream;

The estimating and obtaining the first estimated probability distribution according to the reference information includes:

The first context information, the decoded first side information and the decoded second side information are input into a probability distribution estimation network to obtain the first estimated probability distribution output by the probability distribution estimation network.
The method according to claim 15, wherein the reference information specifically includes the first context information, the second context information, and the decoded second side information, and the decoded second side information is for the third The code stream is obtained by performing entropy decoding, and the second context information is obtained by inputting at least one data that meets a preset condition in the at least one decoded data into a masked convolutional network;

The estimating and obtaining the first estimated probability distribution according to the reference information includes:

The first context information, the second context information and the decoded second side information are input into a probability distribution estimation network to obtain the first estimated probability distribution output by the probability distribution estimation network.
The method according to claim 15, wherein the reference information specifically includes the first context information, the decoded first side information, the second context information and the decoded second side information, and the decoded Decoding the second side information is obtained by performing entropy decoding on the third code stream, and the second context information is obtained by inputting at least one data that meets a preset condition in the at least one decoded data into a masked convolutional network;

The estimating and obtaining the first estimated probability distribution according to the reference information includes:

inputting the first context information, the decoded first side information, the second context information and the decoded second side information into a probability distribution estimation network to obtain the output of the probability distribution estimation network The first estimated probability distribution.
The method according to claim 15, wherein the reference information specifically includes the decoded first side information and second context information, and the second context information is a combination of the at least one decoded data conforming to At least one data of the preset condition is input into the masked convolutional network to obtain;

The estimating and obtaining the first estimated probability distribution according to the reference information includes:

The decoded first side information and the second context information are input into a probability distribution estimation network to obtain the first estimated probability distribution output by the probability distribution estimation network.
The method according to claim 15, wherein the reference information specifically includes the decoded first side information and the decoded second side information, and the decoded second side information is the third code stream obtained by entropy decoding;

The estimating and obtaining the first estimated probability distribution according to the reference information includes:

The decoded first side information and the decoded second side information are input into a probability distribution estimation network to obtain the first estimated probability distribution output by the probability distribution estimation network.
The method according to claim 15, wherein the reference information specifically includes the decoded first side information, the second context information, and the decoded second side information, and the decoded second side information is a reference to The third code stream is obtained by performing entropy decoding, and the second context information is obtained by inputting at least one data that meets a preset condition in the at least one decoded data into a masked convolutional network;

The estimating and obtaining the first estimated probability distribution according to the reference information includes:

The decoded first side information, the second context information and the decoded second side information are input into a probability distribution estimation network to obtain the first estimated probability distribution output by the probability distribution estimation network.
The method according to any one of claims 15-16, 18, 20, 22-25, wherein when the reference information includes the decoded first side information, the acquiring reference information includes :

Obtain the second code stream;

Estimate the second estimated probability distribution;

Entropy decoding the second codestream according to the second estimated probability distribution to obtain the decoded first side information.
The method according to any one of claims 19-22, 24-25, wherein when the reference information includes the decoded second side information, the obtaining the reference information further includes:

Obtain the third code stream;

Estimate a third estimated probability distribution;

Entropy decoding the third code stream according to the third estimated probability distribution to obtain the decoded second side information.
The method according to any one of claims 15-27, further comprising:

Obtain the fourth code stream;

Estimating and obtaining a fourth estimated probability distribution according to preset information;

Entropy decoding is performed on the fourth code stream according to the fourth estimated probability distribution to obtain decoded leading data, where the decoded leading data is first decoded data among the plurality of data.
An entropy encoding device, characterized in that it comprises:

one or more processors;

memory for storing one or more programs;

When the one or more programs are executed by the one or more processors, the one or more processors are made to implement the method according to any one of claims 1-14.
An entropy decoding device, characterized in that it comprises:

one or more processors;

memory for storing one or more programs;

When the one or more programs are executed by the one or more processors, the one or more processors are made to implement the method according to any one of claims 15-28.
A computer-readable storage medium, characterized by comprising a computer program, when the computer program is executed on a computer, it causes the computer to execute the method according to any one of claims 1-28.
A computer program product, characterized in that the computer program product includes computer program code, and when the computer program code is run on a computer, the computer is made to execute the method according to any one of claims 1-28.