CN116582685A

CN116582685A - AI-based grading residual error coding method, device, equipment and storage medium

Info

Publication number: CN116582685A
Application number: CN202310396480.7A
Authority: CN
Inventors: 高虹; 袁子逸
Original assignee: Bigo Technology Singapore Pte Ltd
Current assignee: Bigo Technology Singapore Pte Ltd
Priority date: 2023-04-12
Filing date: 2023-04-12
Publication date: 2023-08-11

Abstract

The embodiment of the application discloses an AI-based grading residual error coding method, an AI-based grading residual error coding device, AI-based grading residual error coding equipment and a storage medium, wherein the AI-based grading residual error coding method comprises the following steps: at least one-stage downsampling processing is carried out on video data to be subjected to hierarchical coding, the layer where the obtained data with the lowest resolution is located is used as a base layer, and the layer where the video data is located or the layer where the video data is located and the layer where the data with the middle resolution are located are used as enhancement layers; determining base layer encoded data and an enhancement layer residual; encoding the residual error of the enhancement layer by adopting a depth video encoding network to obtain residual error encoding data of the enhancement layer; and generating code stream data based on the residual code data and the base layer code data. According to the scheme, the video coding with high compression ratio can be realized, the storage space and the network bandwidth are saved, the details and textures of the video are reserved through the residual coding data of the enhancement layer, the visual quality and definition of the video are improved, the coding efficiency can be improved through the depth video coding network, and better code rate control capability is embodied.

Description

AI-based grading residual error coding method, device, equipment and storage medium

Technical Field

The embodiment of the application relates to the technical field of video data processing, in particular to an AI-based grading residual error coding method, an AI-based grading residual error coding device, AI-based grading residual error coding equipment and an AI-based grading residual error coding storage medium.

Background

With the development of computer and network technologies, various applications based on video images are increasing. While video is transmitted, video encoding is first required, and is important in that it allows video to be stored and transmitted more easily. By compressing video data, more video files can be stored in a limited memory space and video can be transmitted faster to view on the internet. In addition, the video coding technology can also reduce the video flow, thereby reducing network congestion and improving the video playing quality.

Today, LCEVC (Low Complexity Enhancement VideoCoding, low complexity enhanced video coding) is used to code video, which is a video coding technique that combines both base coding and enhancement coding techniques to improve the quality and efficiency of video by adding a small amount of enhancement coded data on the base coding basis. In the LCEVC framework, in addition to the encoding of the base layer, there is provided a residual encoding module similar to that in a conventional encoder, including: three main steps of transformation, quantization and entropy coding.

However, the residual information generated by the inter-frame motion estimation and motion compensation in the conventional coding is different from the residual information generated by the high-frequency information loss caused by the downsampling and upsampling under the LCEVC framework, and the loss caused by the downsampling and upsampling only is shown as being more sparse in residual compared with the loss caused by the motion estimation and motion compensation, the residual distribution is more dispersed, and no obvious motion trail exists. Therefore, more efficient coding schemes are sought for more sparse residual information generated by layered coding schemes to increase their compression efficiency.

Disclosure of Invention

The embodiment of the application provides an AI-based grading residual error coding method, an AI-based grading residual error coding device, AI-based grading residual error coding equipment and an AI-based storage medium, which can provide higher-quality video while keeping a smaller file size, and code the video more quickly, thereby reducing the delay of video transmission. Finer coding can also be performed, thereby improving coding efficiency.

In a first aspect, an embodiment of the present application provides an AI-based hierarchical residual coding method, including:

at least one-stage downsampling processing is carried out on video data to be subjected to hierarchical coding, the layer where the obtained data with the lowest resolution is located is used as a base layer, and the layer where the video data is located or the layer where the video data is located and the layer where the data with the middle resolution are located are used as enhancement layers;

Determining base layer encoded data and determining an enhancement layer residual;

encoding the residual error of the enhancement layer by adopting a depth video encoding network to obtain residual error encoding data of the enhancement layer;

and generating code stream data based on the residual code data and the base layer code data.

In a second aspect, an embodiment of the present application further provides an AI-based hierarchical residual encoding apparatus, including:

the downsampling module is used for performing at least one-stage downsampling processing on video data to be subjected to hierarchical coding, taking a layer where the obtained data with the lowest resolution is located as a base layer, and taking a layer where the video data is located or a layer where the video data is located and a layer where the data with the middle resolution is located as an enhancement layer;

the determining module is used for determining base layer coding data and determining an enhancement layer residual error;

the coding module is used for coding the residual error of the enhancement layer by adopting a depth video coding network to obtain residual error coding data of the enhancement layer;

and the generating module is used for generating code stream data based on the residual error coding data and the base layer coding data.

In a third aspect, an embodiment of the present application further provides an AI-based hierarchical residual encoding apparatus, including:

One or more processors;

storage means for storing one or more programs,

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the AI-based hierarchical residual encoding method of an embodiment of the application.

In a fourth aspect, embodiments of the present application also provide a storage medium storing computer-executable instructions that, when executed by a computer processor, are configured to perform the AI-based hierarchical residual encoding method of embodiments of the present application.

In a fifth aspect, the embodiment of the present application further provides a computer program product, where the computer program product includes a computer program, where the computer program is stored in a computer readable storage medium, and where at least one processor of the device reads and executes the computer program from the computer readable storage medium, so that the device performs the AI-based classification residual encoding method according to the embodiment of the present application.

In the embodiment of the application, at least one-stage downsampling processing is carried out on video data to be subjected to hierarchical coding, so that a layer where the obtained data with the lowest resolution is located is taken as a base layer, and a layer where the video data is located or a layer where the video data is located and a layer where the data with the middle resolution is located are taken as enhancement layers; determining base layer encoded data and determining an enhancement layer residual; encoding the residual error of the enhancement layer by adopting a depth video encoding network to obtain residual error encoding data of the enhancement layer; and generating code stream data based on the residual code data and the base layer code data. By the AI-based grading residual error coding method, the size of the code stream can be effectively reduced, video coding with high compression ratio is realized, and storage space and network bandwidth are saved. Meanwhile, the residual coding data of the enhancement layer can better keep details and textures of the video, so that the visual quality and definition of the video are improved, and higher coding efficiency and better code rate control capability can be realized by using a depth video coding network.

Drawings

Fig. 1 is a flowchart of an AI-based hierarchical residual coding method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of an end-to-end scalable coding system for AI-based scalable residual coding according to one embodiment of the application;

FIG. 3 is a schematic diagram of a network structure of AI-based level residual coding according to one embodiment of the application;

fig. 4 is a flow chart of an AI-based hierarchical residual coding method according to a second embodiment of the present application;

fig. 5 is a schematic diagram of an up-down sampling frame of AI-based grading residual coding according to a second embodiment of the present application;

fig. 6 is a schematic structural diagram of an AI-based hierarchical residual encoding apparatus according to a third embodiment of the present application;

fig. 7 is a schematic structural diagram of an AI-based hierarchical residual encoding apparatus according to a fourth embodiment of the present application.

Detailed Description

Embodiments of the present application will be described in further detail below with reference to the drawings and examples. It should be understood that the particular embodiments described herein are illustrative only and are not limiting of embodiments of the application. It should be further noted that, for convenience of description, only some, but not all of the structures related to the embodiments of the present application are shown in the drawings.

The terms first, second and the like in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged, as appropriate, such that embodiments of the present application may be implemented in sequences other than those illustrated or described herein, and that the objects identified by "first," "second," etc. are generally of a type, and are not limited to the number of objects, such as the first object may be one or more. Furthermore, in the description and claims, "and/or" means at least one of the connected objects, and the character "/", generally means that the associated object is an "or" relationship.

The AI-based classification residual coding method, the AI-based classification residual coding device, the AI-based classification residual coding equipment and the AI-based classification residual coding storage medium provided by the embodiment of the application are described in detail below through specific embodiments and application scenes of the specific embodiments with reference to the accompanying drawings.

Example 1

Fig. 1 is a flowchart of an AI-based hierarchical residual encoding method according to an embodiment of the present application. As shown in fig. 1, the method specifically includes the following steps:

s101, performing at least one-stage downsampling processing on video data to be subjected to hierarchical coding, taking a layer where the obtained data with the lowest resolution is located as a base layer, and taking a layer where the video data is located or a layer where the video data is located and a layer where the data with the middle resolution is located as an enhancement layer.

First, the usage scenario of the present solution may be a scenario of processing video to be hierarchically encoded, processing video residual encoded data, and generating the video stream data.

Based on the above usage scenario, it can be appreciated that the execution subject of the present application may be an intelligent terminal or a server, etc., which is not limited herein too.

In this scheme, fig. 2 is a schematic diagram of an end-to-end scalable coding system for AI-based scalable residual coding according to an embodiment of the present application, and as shown in fig. 2, video data to be scalable coded may refer to video data that needs to be level compression coded. In hierarchical coding, video data is divided into a plurality of layers or levels, each layer or level containing different information. Each level may be encoded according to its importance and availability for the purpose of compressing video data. In this way, video data may be transmitted or stored as needed for playback in different devices and network environments. Video data to be hierarchically encoded is typically high definition or ultra high definition video data, which has a large number of pixels and a high bit rate, and thus needs to be compressed for better storage and transmission. For example, a 5 minute high definition video file may exceed a size of 1GB, which increases transmission costs if it takes a long time and space to directly transmit and store. Accordingly, video data needs to be hierarchically encoded in order to reduce the file size and reduce the transmission cost without affecting the video quality.

The base layer may refer to the lowest layer of video coding, which contains the most basic video information, and can be independently decoded and played, while also being the basis of other layers. The base layer typically contains the lowest resolution data of the video, which can be used by other layers to achieve multi-resolution encoding.

Multi-resolution encoding means that video data can be played in different devices and network environments, as different devices and network environments can handle video of different resolutions. In multi-resolution coding, the base layer is usually the lowest resolution video data, while the other layers are higher resolution video data. In this way, video data may be compressed and transmitted and played in different devices and network environments while maintaining a high video quality. Therefore, the layer of the lowest resolution data of the video data to be hierarchically encoded is taken as a base layer, which means that the lowest layer of the video encoding contains the lowest resolution data of the video and is taken as a base of other layers for realizing multi-resolution encoding. The resolution is also called resolution and resolution, which determine the fine degree of the details of the bitmap image. In general, the higher the resolution of the image, the more pixels that are included, the sharper the image and the better the quality of the print. At the same time, it also increases the storage space occupied by the files.

An enhancement layer may refer to a coding layer for enhancing video data quality or providing additional functionality relative to a base layer. The enhancement layer may contain higher resolution, higher quality, and more complex coding techniques or other enhancement functions of the video data. The obtained data layer with the lowest resolution is taken as a base layer, and the layer with the video data and the layer with the middle resolution are taken as enhancement layers, so that multi-resolution video coding can be realized.

Downsampling may refer to the process of downsampling a high resolution image into a low resolution image to obtain a base layer in multi-resolution video coding. The method of downsampling may include average sampling, max/min pooling, interpolation, and so forth. In video data to be hierarchically encoded, the resolution of the original video data may be very high, not being suitable for a base layer directly used for multi-resolution video encoding. Therefore, the original video data needs to be downsampled to generate video data suitable as a base layer. In particular, the downsampled resolution may be half or less of the original video resolution for purposes of reducing the amount of data and improving encoder performance.

S102, determining base layer coding data and determining an enhancement layer residual error.

The base layer encoded data may refer to encoded data corresponding to the lowest resolution data in multi-resolution video encoding. The encoded data is obtained after compression and encoding of the original video data for transmission and storage. Base layer encoded data typically includes inter-prediction, intra-prediction, transform coding, and entropy coding. These steps can effectively reduce redundant information of video data, thereby achieving compression and encoding of video data. The base layer encoded data may also provide reference information for the enhancement layer to facilitate higher resolution and higher quality encoding by the enhancement layer.

The enhancement layer residual may refer to a difference between the high resolution data and the corresponding low resolution base layer data, i.e., a difference value obtained by subtracting the low resolution data from the high resolution data. These residual data may be used to enhance the base layer data to obtain higher resolution and higher quality video data.

Determining the base layer encoded data requires first determining the resolution and encoding parameters of the video to be acquired, then encoding the video using a video encoder, and saving the encoded data as base layer encoded data.

The specific determination process can be performed according to the following steps:

1. and determining the resolution and frame rate of the acquired video, and recording video materials.

2. The recorded video material is encoded using a video encoder. In the encoding process, proper encoding parameters are required to be selected, which can include frame rate, code rate, quantization parameters, encoding algorithm and the like, so as to ensure the encoding effect and the transmission quality.

3. And saving the encoded data as base layer encoded data. The encoded data can be stored in a local hard disk or cloud storage, and can also be transmitted to other devices through a network for subsequent processing and analysis.

The determination of the enhancement layer residual may include downsampling the high resolution data and upsampling the low resolution data to ensure that the high resolution data and the low resolution data are compared in the same spatial or frequency domain. The determination of the enhancement layer residual may be achieved by various algorithms and techniques, such as bilinear interpolation, wavelet transformation, and predictive filtering.

And S103, encoding the residual error of the enhancement layer by adopting a depth video encoding network to obtain residual error encoding data of the enhancement layer.

Fig. 3 is a schematic network structure diagram of AI-based hierarchical residual coding according to an embodiment of the present application, as shown in fig. 3, the depth video coding network (Deep Video Coding Network, DVC) may be a video coding framework based on a deep learning technique, and the main idea is to use a convolutional neural network (Convolutional Neural Network, CNN) to efficiently encode and decode video data. Compared with the traditional video coding method, the DVC utilizes CNN to perform end-to-end coding and decoding on video data, can automatically learn the spatial and temporal characteristics of the video data, and performs frame-by-frame compression and reconstruction on the video data. The DVC can achieve efficient video compression by utilizing the spatio-temporal correlation of video sequences, and reduce the bandwidth and storage cost of data transmission while maintaining video quality. The encoder and decoder of the DVC are both composed of convolutional neural networks, the encoder is responsible for feature extraction and encoding of video frames, and the decoder is responsible for decoding encoded data into original video frames. DVC may also improve the quality and efficiency of encoding and decoding by adding some additional features, such as motion vectors, residual frames, and context information.

The enhancement layer residual coding data may be residual data between the enhancement layer and the base layer, i.e. the result of subtracting the base layer data from the enhancement layer data. These residual data can be used to improve the quality and accuracy of the video and can reduce to some extent the transmission and storage overhead of the encoded data. Specifically, the residual coding data of the enhancement layer may refer to residual coding data between the enhancement layer and the base layer obtained by further coding the enhancement layer after decomposing the original video data into the base layer and the enhancement layer. These residual encoded data can be used by the decoder to reconstruct the enhancement layer portion of the original video data, thereby improving the quality and accuracy of the video.

The depth video coding network may be configured to code residual data of the enhancement layer to obtain residual coded data of the enhancement layer. Specifically, the following steps may be adopted to obtain residual encoded data of the enhancement layer:

1. residual data of the enhancement layer is encoded using a depth video coding network. Depth video coding networks are typically composed of two parts, an encoder and a decoder, wherein the encoder can map the input enhancement layer residual data into a low-dimensional space, and the decoder can decode the encoded data in the low-dimensional space into reconstructed residual data.

2. And compressing and storing the encoded enhancement layer residual error encoded data in a code stream. Common compression algorithms include variable length coding, entropy coding, and the like.

On the basis of the above technical solutions, optionally, the depth video coding network includes a residual coding sub-network;

correspondingly, the enhancement layer residual error is encoded by adopting a depth video encoding network to obtain residual error encoding data of the enhancement layer, and the method comprises the following steps:

and adopting the residual coding sub-network to code the residual of the enhancement layer, and taking the output characteristic domain code as residual coding data of the enhancement layer.

In this scheme, the residual coding sub-network may be a part of a depth video coding network, and is used for coding residual information of video data. The sub-network can be composed of a plurality of convolution layers, a normalization layer, an activation function and the like, and can adaptively learn coding modes suitable for different video data. In the residual coding sub-network, the input is residual data, and the output is coded residual coding data after coding. The residual coding sub-network can help to reduce the volume of residual data, improve the compression efficiency of video data and ensure the reconstruction quality of the video data.

The feature field code may refer to a coding result output after the processing of the residual coding sub-network, and is usually a number in the form of a vector or matrix with a certain dimension, and may represent various features and characteristics of input data, including spatial information, color information, motion information and the like. These feature field codes can be transmitted and stored and can be used for subsequent decoding and reconstruction operations so that the final image or video data can accurately restore various features and details of the original data. In depth video coding, feature field codes are generally used to represent the difference portions of residual data and prediction data in order to more efficiently compress and transmit video data.

The residual coding sub-network is adopted to code the residual of the enhancement layer, and the output characteristic domain code is used as the residual coding data of the enhancement layer, and the method can be carried out according to the following steps:

1. and taking the residual error of the enhancement layer as an input, and inputting the residual error into a residual error coding sub-network for coding.

2. The residual coding sub-network is typically composed of a series of convolutional and pooling layers for extracting the characteristics of the residual.

3. In the encoding process, the residual error encoding sub-network compresses residual error information into a group of characteristic domain codes, and the characteristic domain codes can better express important information of the residual error.

4. And taking the characteristic domain code as output, and taking the characteristic domain code as residual error coding data of the enhancement layer for subsequent decoding and recovering the enhancement layer.

In the scheme, the residual error coding sub-network is used for coding the residual error of the enhancement layer, and the parameter number of the model can be reduced by compressing residual error information, so that the storage efficiency and the calculation speed of the model are improved. Meanwhile, as the residual coding sub-network can learn the relevant characteristics of the residual, the performance and accuracy of the model can be further improved by using the coded residual information as the residual coding data of the enhancement layer.

On the basis of the above technical solutions, optionally, the depth video coding network further includes a quantization sub-network;

and adopting the residual coding sub-network to code the residual of the enhancement layer, and adopting the quantization sub-network to cut the output characteristic domain code to obtain residual coding data of the enhancement layer.

In the scheme, the quantization sub-network can be a neural network structure and is mainly used for converting floating point number weights and activation values into integer values with low bit width, so that the storage and calculation cost of the model is reduced. In the training process, the quantization sub-network can minimize quantization errors through a series of optimization methods, so that the model is ensured to have higher accuracy under the condition of low bit width. The quantization subnetwork has important application value in the aspects of lightweight model design, model compression, low-power deployment and the like.

The output of the residual coding sub-network is a characteristic field code, which needs to be compression coded in order to be used for communication and storage of the network. Therefore, the quantization sub-network is required to cut the feature field code. Clipping the output feature field code using a quantization subnetwork may refer to converting a continuous real value into a set of discrete symbol and bit representations, thereby compressing the storage and transmission of data. The method can reduce the data quantity to a certain extent, improve the calculation and storage efficiency, and can be carried out according to the following steps:

1. encoding the residual of the enhancement layer using a residual encoding sub-network: and taking the enhancement layer residual as input, and obtaining the output characteristic domain code through calculation of a residual coding sub-network. Since the residual is typically smaller than the dynamic range of the original image, this encoding process can significantly reduce the number of bits required for encoding.

2. Clipping the output characteristic domain codes by using a quantization sub-network: and inputting the output characteristic domain code of the residual coding sub-network into the quantization sub-network. The quantization sub-network will quantize the feature field code, i.e., map each element in the feature field code into a smaller set of values, such as-128 to 127. The quantization sub-network will preserve the differences between these values, but will lose smaller differences. This process significantly reduces the number of bits of the feature field code, thereby reducing the bandwidth required for network transmission and storage.

After the residual coding data is obtained, the data can be transmitted to a residual decoding sub-network, and the residual decoding sub-network processes the residual coding data according to the designed structure and parameters thereof so as to reconstruct an original signal or recover data such as higher resolution and finer images, video or audio. The output of the residual decoding subnetwork may be residual decoded data.

In the scheme, the residual coding sub-network and the quantization sub-network are adopted, so that the parameter number of the model can be effectively reduced, and the calculation complexity and the storage space requirement of the model are reduced.

On the basis of the above technical solutions, optionally, the depth video coding network further includes a bit estimation sub-network;

accordingly, after the enhancement layer residual error is encoded by using the depth video encoding network to obtain residual error encoded data of the enhancement layer, the method further includes:

and carrying out bit estimation on residual coding data of the enhancement layer by adopting the bit estimation sub-network, and determining the coding compression efficiency of the depth video coding network.

In this scheme, the bit estimation sub-network may refer to a sub-network for estimating the number of bits of each channel of the quantized feature map in the compressed neural network. The method is mainly used for a quantization step in a compressed neural network, wherein quantization is used for converting floating point number parameters in the neural network into smaller integers so as to reduce the storage and calculation cost of the parameters. However, since quantization causes a loss of accuracy, it is necessary to reduce the loss of accuracy due to quantization as much as possible while ensuring the accuracy of the model. The function of the sub-network of bit estimation is to help us to better control the quantized bit number, thus achieving the best balance between storage and computational overhead and model accuracy.

The coding compression efficiency of the depth video coding network may refer to a balance relationship between a compression code rate and video quality obtained by compressing an original video. In compressing video, it is necessary to reduce the compression rate as much as possible while ensuring the video quality, so as to reduce the bandwidth and storage space requirements in transmitting and storing video. Therefore, the coding compression efficiency of a depth video coding network may be an important index for evaluating the performance of the depth video coding network, and the compression rate or bit rate is generally measured, where a smaller compression rate or bit rate indicates a higher compression efficiency, and a smaller space occupied by compressed video.

In a depth video coding network, the residual coded data of the enhancement layer gets discrete integer values through the quantization sub-network, but these integer values are not necessarily optimal and there may be more optimal approximations. The function of the sub-network of bit estimates is to optimise these integer values, resulting in a more compact encoded representation. Specifically, the bit estimation sub-network may attempt to optimize each integer value, find an approximation closer to the original residual than the integer value, and calculate the number of bits used to represent the approximation. In this way, a more compact encoded representation can be obtained, thereby improving the encoding compression efficiency. The bit estimation sub-network may be used to estimate the bit rate of each data block to determine the coding compression efficiency of the depth video coding network.

The coding compression efficiency of a depth video coding network may be determined as follows:

1. transmitting the residual coding data of the enhancement layer into a bit estimation sub-network;

2. the bit estimation sub-network processes the input data and outputs the bit rate estimation value of each data block;

3. calculating the compression ratio of each data block according to the bit rate estimated value and the data block size;

4. and (5) averaging the compression ratios of all the data blocks to obtain the average coding compression efficiency of the whole video sequence.

In the scheme, the bit estimation sub-network is adopted to carry out bit estimation on residual coding data of the enhancement layer, and the coding compression efficiency can be determined by predicting the bit number required by coding. The method is beneficial to controlling the output code rate more accurately when the depth video coding network compresses the video, and ensures that the quality of the compressed video is more stable and controllable.

And S104, generating code stream data based on the residual code data and the base layer code data.

The code stream data can be media data streams such as video, audio or images after compression coding processing. In digital media transmission or storage, code stream data is typically generated by an encoder and stored or transmitted to a decoder for decoding and playback. The code stream data can be realized by different compression coding algorithms to realize efficient data compression and transmission, thereby saving storage space and network bandwidth. In video coding, code stream data is typically composed of a plurality of data units including header information, image and audio data, and the like. Wherein the header information contains metadata and control information about the code stream data, such as resolution of an image, frame rate, encoding format, etc., so that the decoder can correctly decode the code stream data.

The process of generating bitstream data based on residual encoded data and base layer encoded data is commonly referred to as bitstream packing. The code stream data may be generated as follows:

1. base layer encoded data and residual encoded data of the enhancement layer are obtained from the encoder.

2. And combining the base layer coded data and the residual coded data into a data stream to form code stream data.

3. The code stream data is segmented and organized to meet the requirements of the code stream format. For example, for video coding, the code stream data is typically arranged in frame order, and some header information and control information, such as code rate, resolution, frame rate, etc., are added.

4. And the code stream data is compressed to reduce the size of the code stream, and save the storage space and the network bandwidth. Compression algorithms commonly employed include variable length coding, entropy coding, predictive coding, and the like.

On the basis of the above technical solutions, optionally, after generating code stream data based on the residual encoded data and the base layer encoded data, the method further includes:

providing the code stream data to a receiving end, decoding and up-sampling the code stream data based on the base layer coding data by the receiving end to obtain a decoding up-sampling result of a target enhancement layer, performing pixel domain mapping on the residual coding data based on a depth video decoding network to obtain a pixel domain mapping result, and determining reconstructed video data of the target enhancement layer according to the decoding up-sampling result and the pixel domain mapping result; wherein the convolutional layers in the depth video decoding network and the depth video encoding network are transposed convolutional layers.

In this scheme, the receiving end may refer to a network terminal device, such as a computer, a smart phone, a tablet computer, a television, etc., which may receive the code stream data through the network, and decode the data into the original video and audio contents through the corresponding decoder for playing.

In a depth video coding network, each enhancement layer corresponds to a respective decoder. The encoder converts the input data into a base stream, and the decoder converts the base stream into decoded data. The decoding and up-sampling result of the target enhancement layer may be a reconstruction result of the corresponding target enhancement layer obtained by the receiving end using the base layer encoded data to decode and up-sample. This result may be obtained by inputting the received base layer encoded data into a decoder, through decoding and up-sampling processes.

The pixel domain mapping is performed on the residual coding data based on the depth video decoding network, which may mean that the residual coding data obtained by coding by using the encoder and the decoding result of the base layer data are calculated to obtain a set of pixel values, which represent the pixel value of the target enhancement layer, and the set of pixel values is the pixel domain mapping result.

The reconstructed video data of the target enhancement layer may refer to video data reconstructed at the receiving end according to the decoding up-sampling result and the pixel domain mapping result, which contains information of the base layer and the enhancement layer in the original video data, and the output video data has similar quality and visual effect with the original video data through decoding and up-sampling processing.

The convolutional layer may be a conventional neural network layer that extracts features from the input data. The convolution layer can carry out convolution operation on input data in a sliding window mode to obtain a group of output characteristic diagrams. The parameters of the convolution layer consist of a set of convolution kernels, each convolving a small portion of the input data and generating a corresponding signature.

The transposed convolution layer may be an inverse operation of the convolution. It can expand the size of the input data, typically for upsampling low resolution feature maps to high resolution. Like the convolutional layer, the transposed convolutional layer is also composed of a set of learnable parameters, i.e., a transposed convolutional kernel. The transpose convolution kernel upsamples the low resolution feature map to high resolution and learns the reconstruction of the features during this process. The number of parameters of the transposed convolutional layer is typically greater than that of the convolutional layer because it requires learning how to scale up the feature map while preserving the original information.

In the video transmission process, the coded code stream data is required to be transmitted to a receiving end for decoding and reconstruction by the receiving end. In particular, the data may be transmitted to the receiving end through a network transmission, such as a network protocol (e.g., TCP/IP). Conventional storage media (e.g., optical disc, usb disk, etc.) may also be used to provide the code stream data to the receiving end for decoding and reconstruction.

The receiving end needs to decode, reconstruct and up-sample the received code stream data according to the coding mode of the coding network. The method can be carried out according to the following steps:

1. and decoding the received code stream data according to the layered structure of the coding network to obtain base layer coding data and residual coding data of each enhancement layer.

2. And decoding and reconstructing the base layer encoded data to obtain a decoding result of the base layer.

3. And up-sampling the decoding result of the base layer to obtain a decoding up-sampling result with the same size as the target enhancement layer.

4. And carrying out bit estimation on residual coding data of each enhancement layer, and determining the coding compression efficiency of the residual coding data.

5. And decoding and reconstructing residual coding data of each enhancement layer to obtain decoding results of each enhancement layer.

6. And adding the decoding result of each enhancement layer with the decoding result of the base layer to obtain a decoding up-sampling result of the target enhancement layer.

Pixel domain mapping of residual encoded data based on a depth video decoding network can be divided into the following steps:

1. the residual encoded data is input to a depth video decoding network, where the encoded data is restored to image form by convolutional and transposed convolutional layers.

2. And adding the restored image data with the base layer data to obtain the decoding result of the target enhancement layer.

3. The decoding result of the target enhancement layer is input into a pixel domain mapping network, and mapped onto the original resolution of the target enhancement layer through convolution and transposition convolution layers.

4. The resulting pixel domain map results are the final enhancement layer image data.

At the receiving end, the decoding up-sampling result and the pixel domain mapping result of the target enhancement layer can be obtained according to the decoding up-sampling result and the pixel domain mapping result, and then the decoding up-sampling result and the pixel domain mapping result are added to obtain the reconstruction result of the target enhancement layer. Specifically, the reconstructed video data of the target enhancement layer may be obtained by:

1. and performing deconvolution operation on the decoding up-sampling result to obtain a pixel domain representation of the decoding result.

2. And reversely mapping the pixel domain mapping result to obtain the characteristic domain representation during encoding.

3. And (3) adding the base layer data obtained in the encoding process and the characteristic domain data obtained in the step (2) to obtain the characteristic domain representation of the target enhancement layer.

4. And performing deconvolution operation on the characteristic domain representation of the target enhancement layer to obtain the pixel domain representation of the target enhancement layer.

5. And (3) adding the result obtained in the step (4) and the base layer of the original video frame to obtain final reconstructed video data.

In the scheme, the video is encoded through the depth video encoding network, so that a higher compression ratio can be obtained, the bandwidth requirement in the transmission process is reduced, and the transmission delay and the cost are reduced. The depth video decoding network is adopted to decode, reconstruct and up-sample the coded data, so that higher video quality can be obtained, and the viewing experience of a user is improved. And because of using technologies such as residual error coding and pixel domain mapping, the data volume after coding can be reduced, thereby reducing the bandwidth requirement in the transmission process and reducing network congestion and transmission delay.

Example two

Fig. 4 is a flowchart of an AI-based hierarchical residual encoding method according to a second embodiment of the present application. As shown in fig. 4, the method specifically comprises the following steps:

s201, at least one-stage downsampling processing is carried out on video data to be subjected to hierarchical coding, the layer where the data with the lowest resolution is obtained is taken as a base layer, and the layer where the video data is or the layer where the video data is and the layer where the data with the middle resolution are taken as an enhancement layer.

S202, processing the base layer by adopting a base encoder to obtain base layer encoded data.

The base encoder may be a video encoder that compresses a video sequence into base layer encoded data. It is typically a standard video encoder such as h.264, h.265, HEVC (High Efficiency Video Coding ), etc. The encoders adopt advanced compression algorithm, and can effectively compress video data so as to achieve the purposes of reducing data transmission quantity and improving video quality. The base encoder typically encodes only a low resolution portion of the video sequence, i.e., the base layer, and thus can ensure the encoding efficiency and the transmission efficiency while ensuring the video quality.

The base encoder may process the base layer to obtain base layer encoded data by:

1. pretreatment: the video sequence is subjected to preprocessing such as denoising, filtering and the like so as to reduce noise and distortion during encoding.

2. Sampling: and carrying out downsampling processing on the video sequence, reducing the resolution of the video, and obtaining base layer data.

3. Spatial prediction: the base encoder performs intra prediction using information of a previous frame and a current frame of the base layer or inter prediction using information of a previous frame and a subsequent frame of the base layer to improve encoding efficiency.

4. And (3) time prediction: the encoder divides successive video frames into multiple GOPs (Group of Pictures, groups of pictures) and uses inter prediction within the GOP to make temporal predictions to further reduce redundant information.

5. Transformation and quantization: the prediction error signal is transformed and quantized to reduce the data amount of the signal.

6. Entropy coding: and compressing the transformed and quantized signals by using an entropy coding algorithm to obtain final base layer coding data.

S203, up-sampling processing is carried out on the base layer coded data to obtain up-sampling results of all enhancement layers; wherein the sampling factors of the up-sampling process and the down-sampling process are the same.

Fig. 5 is a schematic diagram of up-and-down sampling of AI-based hierarchical residual coding according to a second embodiment of the present application, and as shown in fig. 5, the up-sampling result of the enhancement layer may be the result of interpolating or copying the base layer data, so that the resolution of the enhancement layer is the same as that obtained in the down-sampling process. In particular, the upsampling may be implemented by interpolation algorithms, such as lanxos method, bilinear interpolation, bicubic interpolation, etc., or may be implemented by simply copying the pixel values.

In the upsampling and downsampling processes, the sampling factor may represent the ratio of the resolutions before and after sampling. For example, the resolution is downsampled from 720 x 1280 to 360 x 640, and the sampling factor is 1/2 of the original resolution for both width and height. The resolution is up-sampled from 360 x 640 to 720 x 1280, the sampling factor is 2 times the original resolution for both width and height. When the sampling factors of the up-sampling process and the down-sampling process are the same, it means that the ratio of their samples is the same. For example, if the sampling factor is 2, the up-sampling process increases each dimension of the image by a factor of 2, while the down-sampling process decreases each dimension of the image by a factor of 2, the two operations sampling the same ratio.

Upsampling the base layer encoded data may use interpolation methods such as bilinear interpolation, bicubic interpolation, etc. Specifically, the method can be carried out by the following steps:

1. the resolution after upsampling is determined based on the resolution of the base layer encoded data and the sampling factor.

2. And decoding the base layer encoded data to obtain the pixel value of the base layer.

3. The pixel values of the base layer are up-sampled to the target resolution using an interpolation method.

4. And coding the pixel value after upsampling to obtain an upsampling result of the enhancement layer. It should be noted that the upsampling factors used by the different enhancement layers may be different and thus need to be processed separately.

And S204, determining an enhancement layer residual error based on the data of the enhancement layer and the up-sampling result.

The enhancement layer residual may be determined by calculating the difference between the enhancement layer data and its corresponding upsampling result. In particular, enhancement layer data may be downsampled to match the resolution of its corresponding upsampled result, and then the difference between them is calculated. This difference is the residual of the enhancement layer, which can be encoded and transmitted to the receiving end.

On the basis of the above technical solutions, optionally, determining an enhancement layer residual based on the enhancement layer data and the upsampling result includes:

And carrying out pixel point alignment difference by adopting the data of the enhancement layer and the up-sampling result to obtain a space domain residual calculation result, and determining the space domain residual calculation result as an enhancement layer residual of the current enhancement layer.

In this scheme, the pixel point opposite difference may refer to subtraction operation of pixels at the same position of two images to obtain a new difference image. In video coding, pixel pairs are typically used to make differences to calculate a residual between two frames, which can be encoded and transmitted to reconstruct the next frame image at the decoding end. This reduces inter-frame redundancy and achieves a higher compression rate.

The spatial residual calculation may be used to represent details and texture information of the enhancement layer, since some high frequency information is generated during upsampling and is typically lost during downsampling. By calculating spatial residuals, the information may be reintroduced into the encoded data, thereby improving the visual quality of the video. At the same time, spatial residual can also be used for video compression, as it can represent high frequency information in enhancement layer data with fewer bits.

The enhancement layer original data can be subjected to up-sampling processing to obtain an up-sampling result. And then carrying out pixel point alignment difference on the original data and the up-sampling result to obtain a space domain residual calculation result. This process can be expressed as:

Spatial residual calculation result = enhancement layer original data-upsampling result;

the up-sampling result and the pixel point of the original data are aligned, that is, the pixel point positions are in one-to-one correspondence.

This process may be implemented by using interpolation algorithms, such as bilinear interpolation or cubic spline interpolation. And aligning the original data with the up-sampling result according to the pixel positions, and performing pixel alignment difference to obtain a airspace residual calculation result, namely, the enhancement layer residual of the current enhancement layer.

In the scheme, the pixel point difference between the enhancement layer and the up-sampling result is calculated to obtain the enhancement layer residual error, and the high-frequency detail information and the texture information in the video can be effectively extracted, so that the video quality is enhanced.

And S205, encoding the residual error of the enhancement layer by adopting a depth video encoding network to obtain residual error encoding data of the enhancement layer.

S206, generating code stream data based on the residual code data and the base layer code data.

In this embodiment, the residual information of the enhancement layer may be generated by using the information of the base layer, so as to reduce the coding amount of the enhancement layer data and further improve the compression efficiency of video coding. Meanwhile, the data of each enhancement layer is obtained by up-sampling the base layer, so that the information circulation among all layers can be ensured, and the quality of video reconstruction is improved. The finally generated code stream data can reduce the size of the code stream while guaranteeing the video quality, improves the transmission efficiency and is more suitable for network transmission.

Example III

Fig. 6 is a schematic structural diagram of an AI-based hierarchical residual encoding apparatus according to a third embodiment of the present application.

As shown in fig. 6, the method specifically includes the following steps:

the downsampling module 301 is configured to perform at least one level of downsampling processing on video data to be hierarchically encoded, so that a layer where data with a lowest resolution is obtained is used as a base layer, and a layer where video data is located or a layer where video data is located and a layer where data with a middle resolution is located is used as an enhancement layer;

a determining module 302, configured to determine base layer encoded data, and determine an enhancement layer residual;

the encoding module 303 is configured to encode the enhancement layer residual error by using a depth video encoding network, so as to obtain residual error encoded data of the enhancement layer;

a generating module 304, configured to generate code stream data based on the residual code data and the base layer code data.

According to the technical scheme provided by the embodiment, the downsampling module is used for performing at least one-stage downsampling processing on video data to be subjected to hierarchical coding, wherein the layer with the lowest resolution is obtained and is used as a base layer, and the layer with the video data or the layer with the video data and the layer with the middle resolution is used as an enhancement layer; the determining module is used for determining base layer coding data and determining an enhancement layer residual error; the coding module is used for coding the residual error of the enhancement layer by adopting a depth video coding network to obtain residual error coding data of the enhancement layer; and the generating module is used for generating code stream data based on the residual error coding data and the base layer coding data. Through the AI-based grading residual error coding device, the size of the code stream can be effectively reduced, the video coding with high compression ratio is realized, and the storage space and the network bandwidth are saved. Meanwhile, the residual coding data of the enhancement layer can better keep details and textures of the video, so that the visual quality and definition of the video are improved, and higher coding efficiency and better code rate control capability can be realized by using a depth video coding network.

The AI-based grading residual coding device in the embodiment of the application can be a device, a component in a terminal, an integrated circuit or a chip. The device may be a mobile electronic device or a non-mobile electronic device. By way of example, the mobile electronic device may be a cell phone, tablet computer, notebook computer, palm computer, vehicle mounted electronic device, wearable device, ultra-mobile personal computer (ultra-mobile personal computer, UMPC), netbook or personal digital assistant (personal digital assistant, PDA), etc., and the non-mobile electronic device may be a server, network attached storage (Network Attached Storage, NAS), personal computer (personal computer, PC), television (TV), teller machine or self-service machine, etc., and embodiments of the present application are not limited in particular.

The AI-based hierarchical residual encoding apparatus in the embodiment of the present application may be an apparatus having an operating system. The operating system may be an Android operating system, an ios operating system, or other possible operating systems, and the embodiment of the present application is not limited specifically.

The AI-based grading residual error coding device provided by the embodiment of the present application can implement each process implemented by the method embodiments of fig. 1 and fig. 4, and in order to avoid repetition, a description thereof will not be repeated here.

Example IV

Fig. 7 is a schematic structural diagram of an AI-based hierarchical residual coding apparatus according to an embodiment of the present application, and as shown in fig. 7, the apparatus includes a processor 401, a memory 402, an input device 403, and an output device 404; the number of processors 401 in the device may be one or more, one processor 401 being exemplified in fig. 7; the processor 401, memory 402, input means 403 and output means 404 in the device may be connected by a bus or other means, in fig. 7 by way of example. The memory 402 is used as a computer readable storage medium for storing a software program, a computer executable program, and modules, such as program instructions/modules corresponding to the AI-based hierarchical residual encoding method in an embodiment of the application. The processor 401 executes various functional applications of the device and data processing by running software programs, instructions and modules stored in the memory 402, i.e., implements the AI-based hierarchical residual encoding method described above. The input means 403 may be used to receive entered numeric or character information and to generate key signal inputs related to user settings and function control of the device. The output 404 may include a display device such as a display screen.

The present application also provides a storage medium containing computer executable instructions, which when executed by a computer processor, are configured to perform a method of AI-based classification residual encoding described in the above embodiments, including:

It should be noted that, in the above embodiment of the AI-based hierarchical residual encoding apparatus, each unit and module included is only divided according to the functional logic, but is not limited to the above division, as long as the corresponding function can be implemented; in addition, the specific names of the functional units are also only for distinguishing from each other, and are not used to limit the protection scope of the embodiments of the present application.

In some possible embodiments, the aspects of the method provided by the present application may also be implemented as a form of a program product, which includes a program code for causing a computer device to perform the steps in the method according to the various exemplary embodiments of the present application described in the present specification when the program product is run on the computer device, for example, the computer device may perform the AI-based classification residual encoding method described in the examples of the present application. The program product may be implemented using any combination of one or more readable media.

Claims

1. An AI-based hierarchical residual coding method, comprising:

2. The AI-based step residual coding method of claim 1, wherein determining base layer encoded data and determining an enhancement layer residual comprises:

processing the base layer by adopting a base encoder to obtain base layer encoded data;

performing up-sampling processing on the base layer coded data to obtain up-sampling results of all enhancement layers; wherein the sampling factors of the up-sampling process and the down-sampling process are the same;

and determining an enhancement layer residual error based on the data of the enhancement layer and the up-sampling result.

3. The AI-based hierarchical residual encoding method of claim 2, wherein determining an enhancement layer residual based on the enhancement layer data and the upsampling result comprises:

4. The AI-based hierarchical residual coding method of claim 1, wherein the depth video coding network includes a residual coding sub-network;

5. The AI-based hierarchical residual coding method of claim 4, wherein the depth video coding network further comprises a quantization sub-network;

6. The AI-based hierarchical residual coding method of claim 4 or 5, wherein the depth video coding network further comprises a bit estimation sub-network;

7. The AI-based hierarchical residual coding method of claim 1, wherein after generating code stream data based on the residual coded data and the base layer coded data, the method further comprises:

8. An AI-based hierarchical residual coding device, comprising:

9. An AI-based hierarchical residual encoding apparatus, the apparatus comprising: one or more processors; storage means for storing one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the AI-based classification residual encoding method of any of claims 1-7.

10. A storage medium storing computer executable instructions for performing the AI-based step residual encoding method of any of claims 1-7 when executed by a computer processor.

11. A computer program product comprising a computer program, which when executed by a processor implements the AI-based classification residual encoding method of any of claims 1-7.