CN113949882A

CN113949882A - Video coding and decoding method and device based on convolutional neural network

Info

Publication number: CN113949882A
Application number: CN202111093964.1A
Authority: CN
Inventors: 约翰·普莱斯特
Original assignee: Rongming Microelectronics Jinan Co ltd
Current assignee: Rongming Microelectronics Jinan Co ltd
Priority date: 2021-09-17
Filing date: 2021-09-17
Publication date: 2022-01-18

Abstract

The invention provides a video coding and decoding method and device based on a convolutional neural network, wherein the video coding method comprises the following steps: creating a scaling reference image through the current frame image and coding through an image coding network to obtain a compressed image; decoding the compressed image through an image decoding network to obtain a decompressed image and reducing the decompressed image into an original size to obtain a baseline predicted image; acquiring a base line residual error image through the current frame image and the base line predicted image; generating a plurality of reference residual images through the current frame image and the reference frame, generating an optimal residual image in the baseline residual image and the plurality of reference residual images by using a residual generator, and acquiring corresponding motion vector mapping; coding the optimal residual image through a residual coding network to obtain a residual compressed image; the compressed image, the residual compressed image and the motion vector are mapped and encoded and then output or directly output; the image network and the residual error network are convolutional neural networks.

Description

Video coding and decoding method and device based on convolutional neural network

Technical Field

The invention relates to the technical field of video transcoding, in particular to a convolutional neural network-based video encoding and decoding method and device.

Background

As the demand for real-time streaming media, video on demand, cloud games, and the like increases, the demand for compressed video streams also increases. In addition, the increasing popularity of high-definition technologies such as 4K, 8K, and HDR has also driven the increase in the number of video bits that can be transmitted over the internet. The transmission and storage of such data is costly and, over time, increases in cost. In order to cope with this trend, in recent years, video compression technology has been advanced, and h.265 and AV1 can further increase the compression ratio of video data based on the original video compression standard.

As shown in fig. 1, most video codecs currently used to compress video follow a similar algorithm. A frame of video is compared to a reference frame. In comparison, the frame is broken down into many smaller blocks and compared with the previous frame or blocks that have previously completed encoding using a particular algorithm. By analyzing the information in the reference frames, these algorithms act as a "tool" to predict what needs to be encoded. The difference between the block and the real block is then quantized and encoded using an entropy encoder. The goal of these algorithms or tools is to predict/create the closest match to the real image. The more excellent the algorithm is, the smaller the prediction and real errors are, so the entropy is very low, and the coding can be effectively carried out to reduce the code rate.

However, these new technologies still cannot keep up with the demand for rapid development of high definition/ultra high definition video, and further improvement is required. Optimizing a codec for a particular video quality is a complex problem and there is significant room for development.

Disclosure of Invention

The invention provides a video coding and decoding method and device based on a convolutional neural network, and aims to solve the technical problem of improving the video compression performance.

The video coding method based on the convolutional neural network comprises the following steps:

creating a scaling reference image through the current frame image and coding through an image coding network to obtain a compressed image;

decoding the compressed image through an image decoding network to obtain a decompressed image, and reducing the decompressed image into the original size of the current frame image to obtain a baseline predicted image;

acquiring a base line residual error image through the current frame image and the base line predicted image;

selecting one or more reconstructed coded images as reference frames, performing motion estimation by adopting an interframe coding mode, obtaining motion vector mapping in the current frame image and each reference frame, and generating a plurality of reference residual images;

using a residual generator in the baseline residual image and the plurality of reference residual images to generate an optimal residual image while obtaining a corresponding motion vector map for determining the reference frame or the compressed image from which each block in the optimal residual image corresponds to, and the corresponding location;

coding the optimal residual image through a residual coding network to obtain a residual compressed image;

the compressed image, the residual compressed image and the motion vector map are output after being encoded or directly output;

wherein the image coding network, the image decoding network and the residual coding network are all convolutional neural networks.

According to some embodiments of the invention, the optimal residual image is generated by a residual generator based on the baseline residual image and a plurality of the reference residual images generated by motion vector scanning.

In some embodiments of the invention, the method further comprises: and decoding the residual compressed image through a residual decoding network to obtain a residual decompressed image.

According to some embodiments of the present invention, the original coding, the residual coding and the mapping coding are obtained and output after entropy coding the compressed image, the residual compressed image and the motion vector mapping, respectively.

The video coding device based on the convolutional neural network comprises the following components:

a size scaling module for creating a scaled reference image from the current frame image;

the image compression module is used for coding the scaling reference image through an image coding network to obtain a compressed image;

the image decoding module is used for decoding the compressed image through an image decoding network to obtain a decompressed image, and the image compression module restores the decompressed image to the original size of the current frame image to obtain a baseline predicted image;

the residual error calculation module is used for acquiring a baseline residual error image through the current frame image and the baseline predicted image; the motion estimation device is used for selecting one or more reconstructed coded images as reference frames, performing motion estimation in an interframe coding mode, acquiring motion vector mapping in the current frame image and each reference frame, and generating a plurality of reference residual images;

a residual generator configured to generate an optimal residual image based on the baseline residual image and the plurality of reference residual images, and obtain a corresponding motion vector map, where the motion vector map is used to determine a corresponding position and the reference frame or the compressed image from which each block in the optimal residual image corresponds;

the residual error compression module is used for coding the optimal residual error image through a residual error coding network to obtain a residual error compressed image;

the coding output module is used for coding and outputting or directly outputting the compressed image, the residual compressed image and the motion vector mapping;

According to some embodiments of the invention, the residual generator obtains a minimum residual block from the baseline residual image and the plurality of reference residual images, and generates the optimal residual image based on the minimum residual block.

In some embodiments of the invention, the apparatus further comprises: and the residual error reconstruction module is used for decoding the residual error compressed image through a residual error decoding network to obtain a residual error decompressed image.

According to some embodiments of the present invention, the encoding output module is an entropy encoding module, and is configured to perform entropy encoding on the compressed image, the residual compressed image, and the motion vector map, and then obtain and output an original encoding, a residual encoding, and a mapping encoding, respectively.

According to the convolutional neural network-based decoding method of the embodiment of the present invention, the decoding method decodes the video encoded by the convolutional neural network-based video encoding method as described above, and the method includes:

entropy decoding the image coding, the residual coding and the mapping coding respectively;

decoding the image code after entropy decoding through an image decoding network to obtain a decompressed image, and performing size reduction on the decompressed image;

decoding the residual coding after entropy decoding through a residual decoding network to obtain a residual decompressed image;

and performing image reconstruction based on the decompressed image, the residual decoded image, the decoded mapping and the image of the decoded frame after size reduction to finish image decoding and reconstruction of the current frame.

According to the convolutional neural network-based decoding apparatus of an embodiment of the present invention, the decoding apparatus decodes a video encoded by using the convolutional neural network-based video encoding method as described above, the apparatus includes:

an entropy decoding module for entropy decoding the original encoding, the residual encoding and the mapping encoding, respectively;

the image decoding module is used for decoding the original code after entropy decoding through an image decoding network to obtain a decompressed image;

the size adjusting module is used for carrying out size reduction on the decompressed image;

the residual decoding module is used for decoding the entropy-decoded residual coding image through a residual decoding network to obtain a residual decompressed image;

and the frame reconstruction module is used for reconstructing images based on the decompressed images, the residual decompressed images, the decoded mapping and the decoded frame images after size reduction to finish image decoding and reconstruction of the current frame.

The video coding and decoding method and device based on the convolutional neural network have the following beneficial effects:

the present invention provides an efficient method of video compression and decompression that can be easily trained to specific targets and is easier to implement in hardware than software-based algorithms. Furthermore, using the video compression method of the present invention, for a very high bit rate given video quality, a high video compression performance can be achieved while focusing on the perceptual quality of the video stream.

Drawings

FIG. 1 is a flow chart of a prior art video compression method;

FIG. 2 is a schematic diagram of video encoding of a convolutional neural network-based video encoder according to an embodiment of the present invention;

FIG. 3 is a flowchart of a convolutional neural network-based video encoding method according to an embodiment of the present invention;

FIG. 4 is a block diagram of a convolutional neural network-based video encoding device according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of video decoding of a convolutional neural network-based video decoder according to an embodiment of the present invention;

FIG. 6 is a flowchart of a convolutional neural network-based video decoding method according to an embodiment of the present invention;

fig. 7 is a schematic diagram illustrating a convolutional neural network-based video decoding device according to an embodiment of the present invention.

Reference numerals:

the coding apparatus 100 is provided with a coding device,

a size scaling module 110, an image compression module 120, a decoding module 130, a residual calculation module 140, a residual generator 150, a residual compression module 160, an encoding output module 170, a residual reconstruction module 180,

the decoding apparatus (200) is provided with a decoding unit,

an entropy decoding module 210, an image decoding module 220, a resizing module 230, a residual decoding module 240, a frame reconstruction module 250.

Detailed Description

To further explain the technical means and effects of the present invention adopted to achieve the intended purpose, the present invention will be described in detail with reference to the accompanying drawings and preferred embodiments.

The description of the method flow in the present specification and the steps of the flow chart in the drawings of the present specification are not necessarily strictly performed by the step numbers, and the execution order of the method steps may be changed. Moreover, certain steps may be omitted, multiple steps may be combined into one step execution, and/or a step may be broken down into multiple step executions.

Over the past decades, conventional video coding algorithms have been used. Although it is still developing, these methods have serious limitations. First, they are currently implemented software-centric, which makes hardware acceleration and parallelization more difficult. Second, they require manual optimization for a given index, whether PSNR, SSIM or other. Moreover, there is no effective way to guarantee the optimization.

In recent years, with the development of new artificial intelligence technologies, artificial intelligence technologies represented by deep learning have made great breakthrough in the field of image processing. And new image compression methods have now begun to be developed. Unlike traditional approaches that focus on optimizing a fixed set of complex algorithms/tools, new video compression methods can use convolutional neural networks to optimize the quality of the video. These convolutional neural networks are formed by connecting multiple layers of neurons. Weights are applied to the connections between layers and convolutions are used to propagate between different layers. By using such a convolutional neural network and defining a suitable objective function, the neural network can be trained for many different optimizations.

At present, the technology of CNN (convolutional neural network) based video coding is far from widespread. Most of the techniques are in experimental stage and hardware implementation cannot be simplified. Its implementation is still software-centric, with no parallelization support from available hardware in executing the algorithm. Moreover, the speed and quality of these implementations remains to be developed and improved.

The invention provides a novel video coding method based on a convolutional neural network, which is used for further improving the video compression performance and paying attention to the sensible quality of a video stream.

As shown in fig. 2 and 3, a convolutional neural network-based video encoding method according to an embodiment of the present invention includes:

s100, creating a zoom reference image through a current frame image and coding through an image coding network to obtain a compressed image;

s200, decoding the compressed image through an image decoding network to obtain a decompressed image, and reducing the decompressed image into the original size of the current frame image to obtain a baseline predicted image;

s210, acquiring a base line residual error image through the current frame image and the base line predicted image;

s220, selecting one or more reconstructed coded images as reference frames, performing motion estimation in an interframe coding mode, obtaining motion vector mapping in a current frame image and each reference frame, and generating a plurality of reference residual images;

it should be noted that the reference residual image is not simply the difference between the current frame image and the reference frame, but is the result obtained by performing motion estimation on each block in the current frame image, finding the corresponding position in the reference frame, and then subtracting the corresponding position from the reference frame.

S300, generating an optimal residual image by using a residual generator in the baseline residual image and the plurality of reference residual images, and simultaneously acquiring corresponding motion vector mapping, wherein the motion vector mapping is used for determining a reference frame or a compressed image of a source corresponding to each block in the optimal residual image and a corresponding position;

it should be noted that if there is a drastic change in the video scene (or in the I-frame, no reference frame is used), the block in the current frame image cannot find a suitable corresponding block in multiple reference frames using motion estimation. The optimal residual image may select the baseline residual image in step S210.

If the video is completely still, and the current frame image information can be obtained from the reference frame, the baseline residual image, i.e., the compressed image in step S100, may not be used. At this time, the compressed image may not be transmitted (or an empty compressed image may be transmitted) to further reduce the code rate.

Between the above two cases, some blocks in the current frame image cannot find suitable corresponding blocks in multiple reference frames using motion estimation, and the optimal residual image may be selected as the baseline residual image in step S200. And some blocks can find suitable corresponding positions in the reference frame through motion estimation, then some optimal reference residual image in step S220 is selected.

S400, coding the optimal residual image through a residual coding network to obtain a residual compressed image;

s500, mapping the compressed image, the residual compressed image and the motion vector, and outputting or directly outputting after encoding;

the image coding network, the image decoding network, the residual error coding network and the image decoding network mentioned below are all convolutional neural networks.

According to some embodiments of the invention, an optimal residual image is generated by a residual generator based on a baseline residual image and a plurality of reference residual images generated by a motion vector scan. It should be noted that the previously decoded frame may also be subtracted from the current frame to generate a reference residual image. In this case, in order to find the block in the reference frame that is most similar to the block in the current frame, a motion vector scan is performed: by shifting the decoded reference frame relative to a single previous frame within the search distance. When some blocks in the moved reference frame are matched with some blocks in the single previous frame, the residual error is minimum, so that the motion vectors of the blocks are found. The scanning process is performed over the entire screen.

In some embodiments of the invention, as shown in fig. 2, the method further comprises: and decoding the residual compressed image through a residual decoding network to obtain a residual decompressed image. It will be appreciated that the retrieved reconstructed frame may be used as a reference frame in the decoding of subsequent frames.

According to some embodiments of the present invention, the ratio of the size of the scaled reference image to the size of the current frame image is one of: 1/2, 1/4, 1/6, 1/8. It should be noted that, by reducing the size of the original image, the calculation amount of the subsequent image coding compression processing can be reduced, but in order to retain the image information of the original image, it is not desirable to reduce the size of the original image too small. For example, the size of the scaled reference image may be 1/4 of the size of the current frame image, and the scaled size is not particularly limited by the present invention.

As shown in fig. 4, the convolutional neural network-based video encoding apparatus 100 according to an embodiment of the present invention includes: a size scaling module 110, an image compression module 120, a first decoding module 130, a residual calculation module 140, a residual generator 150, a residual compression module 160, and an encoding output module 170.

Wherein, the size scaling module 110 is configured to create a scaled reference image from the current frame image;

the image compression module 120 is configured to encode the scaled reference image through an image encoding network to obtain a compressed image;

the decoding module 130 is configured to decode the compressed image through an image decoding network to obtain a decompressed image, and the image compression module 120 restores the decompressed image to the original size of the current frame image to obtain a baseline predicted image;

the residual calculation module 140 is configured to obtain a baseline residual image from the current frame image and the baseline predicted image; the motion estimation device is used for selecting one or more reconstructed coded images as reference frames, performing motion estimation by adopting an interframe coding mode, acquiring motion vector mapping in a current frame image and each reference frame, and generating a plurality of reference residual images;

the residual generator 150 is configured to generate an optimal residual image based on the baseline residual image and the plurality of reference residual images, and obtain a corresponding motion vector map, where the motion vector map is used to determine a reference frame or a compressed image from which each block in the optimal residual image corresponds to and a corresponding position;

the residual compression module 160 is configured to encode the optimal residual image through a residual encoding network to obtain a residual compressed image;

the encoding output module 170 is configured to map the compressed image, the residual compressed image, and the motion vector, encode the compressed image, the residual compressed image, and the motion vector, and output the encoded image or directly output the encoded image and the residual compressed image;

the image coding network, the image decoding network and the residual error coding network are all convolutional neural networks.

According to some embodiments of the present invention, the residual generator 150 obtains a minimum residual block from the baseline residual image and the plurality of reference residual images, and generates an optimal residual image based on the minimum residual block. It should be noted that the previously decoded frame may also be subtracted from the current frame to generate a reference residual image. In this case, in order to find the block in the reference frame that is most similar to the block in the current frame, a motion vector scan is performed: by shifting the decoded reference frame relative to a single previous frame within the search distance. When some blocks in the moved reference frame are matched with some blocks in the single previous frame, the residual error is minimum, so that the motion vectors of the blocks are found. The scanning process is performed over the entire screen.

In some embodiments of the invention, as shown in fig. 4, the apparatus further comprises: and the residual error reconstruction module 180 is configured to decode the residual error compressed image through a residual error decoding network to obtain a residual error decompressed image. It will be appreciated that the retrieved reconstructed frame may be used as a reference frame in the decoding of subsequent frames.

And the coding output module is an entropy coding module and is used for respectively obtaining and outputting original coding, residual coding and mapping coding after entropy coding the compressed image, the residual compressed image and the motion vector mapping.

As shown in fig. 5 and 6, according to the convolutional neural network-based decoding method of the embodiment of the present invention, the decoding method decodes a video encoded by using the above convolutional neural network-based video encoding method, and the method includes:

d100, respectively carrying out entropy decoding on the image coding, the residual error coding and the mapping coding;

d210, decoding the entropy-decoded image code through an image decoding network to obtain a decompressed image, and performing size reduction on the decompressed image;

d220, decoding the residual coding after entropy decoding through a residual decoding network to obtain a residual decompressed image;

and D300, performing image reconstruction based on the decompressed image with the reduced size, the residual decoded image, the decoded mapping and the image of the decoded frame, and completing image decoding and reconstruction of the current frame.

As shown in fig. 7, according to the convolutional neural network based decoding apparatus 200 according to an embodiment of the present invention, the decoding apparatus 200 decodes a video encoded by the above convolutional neural network based video encoding method, and the apparatus includes: an entropy decoding module 210, a second decoding module 220, a resizing module 230, a residual decoding module 240, and a second reconstruction module 250.

The entropy decoding module 210 is configured to perform entropy decoding on the original encoding, the residual encoding, and the mapping encoding;

the image decoding module 220 is configured to decode the entropy-decoded original code through an image decoding network to obtain a decompressed image;

the resizing module 230 is configured to resize the decompressed image;

the residual decoding module 240 is configured to decode the entropy-decoded residual encoded image through a residual decoding network to obtain a residual decompressed image;

the frame reconstruction module 250 is configured to perform image reconstruction based on the decompressed image after size reduction, the residual decompressed image, the decoded mapping, and the decoded frame image, and complete image decoding and reconstruction of the current frame.

Hereinafter, a video encoding and decoding method and apparatus based on a convolutional neural network according to the present invention will be described in detail with reference to the accompanying drawings. It is to be understood that the following description is only exemplary in nature and should not be taken as a specific limitation on the invention.

The video coding method based on the CNN is mainly divided into two components and is realized by using the CNN.

As shown in fig. 2, first, a scaled reference image is created. Current implementation suggests 1/4, however, in theory any scaling suitable for the application may be used. The image is then encoded by a Convolutional Neural Network (CNN), which may be implemented by any convolutional neural network method suitable for image compression. For example, one implementation may employ a Recurrent Neural Network (RNN) proposed by Toderici. The present invention is not limited to neural networks that have particular application to image compression, and any effective neural network implementation may be used by those skilled in the art.

The scaled image is then decoded by an inverse convolutional neural network. This operation can be understood as the inverse operation of the image compression network described above, restoring the compressed image to the original image (possibly with errors, the magnitude of which is correlated well with the CNN network coding). As with the encoder, the implementation of the decoding neural network may be accomplished by any implementation that effectively restores image quality. The decoded image is then re-enlarged to full size to produce a decoded image. Which then subtracts the frame from the original frame to create a residual image.

In addition, previously decoded frames may also be subtracted from the current frame to produce a residual. In this case, in order to find the block in the reference frame that is most similar to the block in the current frame, a motion vector scan is performed: by shifting the decoded reference frame relative to the current frame within the search distance. When some blocks in the moved reference frame are matched with some blocks in the single previous frame, the residual error is minimum, so that the motion vectors of the blocks are found. The scanning process is performed over the entire screen.

Such comparison objects may be made with a single previous frame, multiple previous frames, or in some implementations, future frames may be used. The number of frames for motion vector scanning may be one or more. The residual generator then finds the lowest residual for each block of the frame. The block may be any number of pixels, depending on how optimized the different implementations are. A map of motion vectors is then created that describes which block corresponds to which reference frame so that the decoder can recreate the image.

The residual image, the scaled image and the map are entropy encoded using an algorithm such as Huffman coding and then placed in a bitstream to create a compressed bitstream.

Some frames are also reconstructed in the encoder so that they can be used as reference frames for future frames. Since reference frames are not required for all frames in the encoding process, performance optimization can be achieved by decoding only frames to be used as reference frames.

And compressing the obtained residual image through a residual convolution neural network.

Based on the already decoded reference frame(s) and the use of the inverse convolutional network on the scaled image frames, then up to the original size frame, the decoder uses the residual CNN inverse convolutional network to generate a residual image and motion vector mapped information to construct the original frame from the new.

The invention is realized by training the convolutional neural network for the image CNN and the residual CNN respectively. This allows each CNN to be optimized for different types of information that is typically provided. It should be noted that, the residual error network and the image network used in the present invention may create a set of different weights for each path according to different target parameters. Such as PSNR or MSSIM, these CNNs can be trained to optimize the bitstream for different targets by changing their objective function.

It should be noted that the present invention can be implemented using any image convolution neural network. The particular convolutional network, as described in the present invention, is not critical to the present invention. The entropy encoder can be implemented in many different ways, or can be eliminated completely, and the bit stream can be directly output from the convolutional neural network.

Motion vector scanning may use many different scanning implementations, including block-based scanning methods, binary tree search methods, or algorithms that look up a motion vector map in a previous frame.

In summary, the present invention provides an efficient method for compressing video, which can be easily trained to specific targets and is easier to implement in hardware than software-based algorithms. Furthermore, using the video compression method of the present invention, the video quality for a given bit rate can be very high.

While the invention has been described in connection with specific embodiments thereof, it is to be understood that it is intended by the appended drawings and description that the invention may be embodied in other specific forms without departing from the spirit or scope of the invention.

Claims

1. A method for video coding based on a convolutional neural network, comprising:

2. The convolutional neural network-based video coding method of claim 1, wherein the optimal residual image is generated by a residual generator based on the baseline residual image and a plurality of the reference residual images generated by a plurality of motion vector scans.

3. The convolutional neural network-based video encoding method of claim 1, wherein the method further comprises: and decoding the residual compressed image through a residual decoding network to obtain a residual decompressed image.

4. The convolutional neural network-based video coding method of claim 1, wherein the original coding, the residual coding, and the mapping coding are obtained and output after entropy coding the compressed image, the residual compressed image, and the motion vector mapping, respectively.

5. A convolutional neural network-based video encoding apparatus, comprising:

6. The convolutional neural network-based video coding device of claim 5, wherein the residual generator obtains a minimum residual block from the baseline residual image and the plurality of reference residual images, and generates the optimal residual image based on the minimum residual block.

7. The convolutional neural network-based video encoding device of claim 5, wherein the device further comprises: and the residual error reconstruction module is used for decoding the residual error compressed image through a residual error decoding network to obtain a residual error decompressed image.

8. The convolutional neural network-based video coding device of claim 5, wherein the coding output module is an entropy coding module, and is configured to perform entropy coding on the compressed image, the residual compressed image, and the motion vector map to obtain and output an original code, a residual code, and a mapping code, respectively.

9. A convolutional neural network-based decoding method for decoding a video encoded by the convolutional neural network-based video encoding method according to any one of claims 1 to 4, the method comprising:

10. A convolutional neural network-based decoding apparatus for decoding a video encoded by the convolutional neural network-based video encoding method according to any one of claims 1 to 4, the apparatus comprising: