CN114466199A

CN114466199A - Reference frame generation method and system applicable to VVC (variable valve timing) coding standard

Info

Publication number: CN114466199A
Application number: CN202210376248.2A
Authority: CN
Inventors: 蒋先涛; 张纪庄; 郭咏梅; 郭咏阳
Original assignee: Ningbo Kangda Kaineng Medical Technology Co ltd
Current assignee: Ningbo Kangda Kaineng Medical Technology Co ltd
Priority date: 2022-04-12
Filing date: 2022-04-12
Publication date: 2022-05-10

Abstract

The invention discloses a reference frame generation method and a system applicable to VVC (variable visual code) coding standards, which relate to the technical field of image processing and comprise the following steps: extracting a reference frame in a target area taking the current time as a reference in a frame buffer area as a training set; training a PredNet model through a training set and updating the parameter weight of the model; acquiring an optimal reference frame obtained after inputting the current time interframe image to a PredNet model; replacing the reference frame in the target area in the frame buffer by the preferred reference frame; obtaining a prediction frame of an inter-frame image at the current moment through a reference frame in a target area in a motion estimation frame buffer area; motion compensation in a video decoding process is performed based on the predicted frame and the inter-frame image at the current time. The invention inserts the PredNet model in the process of using the reference frame in the frame buffer area to carry out motion compensation, so that the pixel accuracy of the reference frame and the current frame in the frame buffer area is higher, and a decoded image with higher quality is obtained while the code rate required by decoding is lower.

Description

Reference frame generation method and system applicable to VVC (variable valve timing) coding standard

Technical Field

The invention relates to the technical field of image processing, in particular to a reference frame generation method and a reference frame generation system applicable to a VVC (variable video coding) standard.

Background

As people demand higher and higher Video content supporting higher resolution, higher quality and more diversity, demands for next generation Video Coding standards exceeding High-Efficiency Video Coding (HEVC) capability are generated. For this reason, MPEG and VCEG jointly form a joint Video team (jfet), and a next generation Video Coding standard, a universal Video Coding (VVC) standard, is being developed. The compression capability of the VVC far exceeds the HEVC standard of the previous generation. VVC, in contrast, has twice the compression efficiency of HEVC. In particular, in the inter-frame coding process, if the compression distortion of the reference frame is smaller and is more similar to the actual content of the current frame, the inter-frame prediction effect is better.

Motion compensation is an important technology in the process of inter-frame coding, and inherits the coding method of the previous generation. Motion compensation is performed by storing decoded reconstructed reference pictures and applying them to the prediction of the current coded frame. The quality of motion compensated prediction is highly dependent on the available reference pictures. Furthermore, the better the motion compensation performance, a lower prediction error bit rate can be obtained. However, when the video frame contains irregular or large moving blocks, there is no correlation between the current frame and the reference frame. This problem becomes more pronounced in support of high resolutions such as 4K and 8K.

Disclosure of Invention

The present invention provides a deep learning method capable of predicting and generating a reference frame in nonlinear motion to solve the above problems. The method forms a new model frame by dynamically learning PredNet, and obtains a reference frame which can be used in a frame buffer stream in the process of continuous training, in particular to a reference frame generation method applicable to VVC coding standard, which comprises the following steps:

s1: judging whether the inter-frame image corresponding to the current time in the target video segment is the last frame image in the video segment, if not, entering the step S2, and if so, ending;

s2: extracting a reference frame in a target area taking the current time as a reference in a frame buffer area as a training set;

s3: training a PredNet model through a training set and updating the parameter weight of the model;

s4: acquiring an optimal reference frame obtained after inputting the current time interframe image to a PredNet model;

s5: replacing the reference frame in the target area in the frame buffer by the preferred reference frame;

s6: obtaining a prediction frame of an inter-frame image at the current moment through a reference frame in a target area in a motion estimation frame buffer area;

s7: motion compensation in the video decoding process is performed based on the predicted frame and the inter-frame image at the current time and returns to the step of S1.

Further, in the step S1, the target video segment is a video segment obtained by dividing the video decoding process with a preset video length as a dividing point.

Further, the target region is a region formed by continuous reference frames of preset frame numbers corresponding to the current time and the subsequent time in the frame buffer.

Further, in the step S7, the motion compensation specifically includes: and acquiring coding information corresponding to a residual error between the prediction frame and the inter-frame image at the current moment.

The invention also provides a reference frame generating system applicable to the VVC coding standard, which comprises the following steps:

the system control unit is used for controlling the system to keep running before the interframe image corresponding to the current time in the target video segment reaches the last frame in the video segment;

the model training unit is used for extracting a reference frame in a target area which takes the current time as a reference in a frame buffer area as a training set, and training the PredNet model through the training set to update the parameter weight of the model;

the model prediction unit is used for acquiring a preferred reference frame obtained after inputting the current-time interframe image to the PredNet model and replacing the reference frame in the target area in the frame buffer area as the preferred reference frame;

and the motion compensation unit is used for acquiring a prediction frame of the inter-frame image at the current moment through a reference frame in a target area in the motion estimation frame buffer area, and executing motion compensation in the video decoding process based on the prediction frame and the inter-frame image at the current moment.

Further, in the system control unit, the target video segment is a video segment obtained by dividing a video decoding process with a preset video length as a division point.

Further, in the motion compensation unit, the motion compensation specifically includes: and acquiring coding information corresponding to a residual error between the prediction frame and the inter-frame image at the current moment.

Compared with the prior art, the invention at least has the following beneficial effects:

(1) according to the reference frame generation method and system applicable to the VVC coding standard, based on the relationship between the interframe prediction effect and the distortion, in the traditional process of performing motion compensation by using a reference frame in a frame buffer area, the application of a PredNet model is inserted, so that the pixel accuracy of the reference frame and the current frame in the frame buffer area is higher, the code rate required by decoding is lower (the coding efficiency is higher), and a decoded image with higher quality is obtained;

(2) the characteristics of image motion are fully considered, the PredNet model is trained in real time through the interframe image at the current moment, and the parameter weight of the PredNet model is updated, so that the PredNet model can better adapt to the motion characteristics of the interframe image of each frame, and the motion compensation is better performed.

Drawings

FIG. 1 is a diagram of method steps for a reference frame generation method applicable to VVC coding standards;

FIG. 2 is a system diagram of a reference frame generation system that is applicable to the VVC encoding standard;

FIG. 3 is a diagram showing the overall structure of the PredNet model;

fig. 4 is a schematic diagram of motion compensation.

Detailed Description

The following are specific embodiments of the present invention and are further described with reference to the drawings, but the present invention is not limited to these embodiments.

Example one

In the VVC coding standard, inter prediction plays an important role in achieving efficient video coding. To improve the encoding capability, the challenge is to improve the compression efficiency of encoding while maintaining the video quality. Based on this, since the HEVC coding standard, motion compensated prediction algorithms have been proposed to achieve this goal, with new frames being formed by vectors where macroblocks or subblocks are moved from other frames, which become reference frames. The precision of the good reference frame and the current frame is higher, so that the difference between the good reference frame and the current frame is smaller, the code rate required for transmitting the information corresponding to the difference is smaller, and the overall coding efficiency is improved.

However, in the existing inter-frame coding process, taking the VVC coding standard as an example, the inter-frame image after compressed video decoding reconstruction will be stored in the DPB (decoded Picture buffer), and several frames will be selected from the DPB to construct a Reference Picture List (RPL) during motion compensation for inter-frame prediction, and when the compression distortion of the Reference frame is smaller and the content of the current frame is more similar, the inter-frame prediction effect is better. That is, the reference frame cannot be selected in a targeted manner according to the motion characteristics of the current frame, which means that when an inter-frame image with high motion characteristic change occurs, the compression distortion of the reference frame is too large, thereby affecting the encoding efficiency. Based on this, in order to improve the accuracy of the reference frame, as shown in fig. 1, the present invention provides a reference frame generating method applicable to the VVC coding standard, including the steps of:

The target video segment is obtained by dividing the video decoding process by taking a preset video length as a dividing point.

The invention is explained below by further refining the decomposition. The overall architecture diagram of the PredNet model is shown in fig. 3, and it can be seen from the diagram that the whole network is formed by stacking the left half part on two dimensions of time and network layer, and the right half part is the specific implementation process of each module. Wherein each module consists of four units:

: input convolution layer (conv)&pool), for the first layer, is the target image; for higher layers, convolution of the previous layer prediction error E + ReLU;

: convolution of the LSTM layer (conv)&LSTM）；

: prediction layer (conv), obtained by convolving the R units with ReLU.

: error indication layer (ReLU)&subtract），

。

The description thereof is developed in such a way that,

the unit, since it employs the activation function ReLU,

and

the part whose difference (achieved by the subtract) is less than zero will be zeroed, so it is necessary to zero

And

performing mutual difference, splicing, and passing through a ReLU layer;

to be transmitted to

As the input of the next layer, from top to bottom;

unit for accepting as input the error of the previous inscription layer

State of the layer

High layer prediction feature at this time

(from top to bottom), the characteristic pole is predicted according to the three. The predicted characteristics are

Convolution of the cells to obtain featuresImage combination

And (6) comparing.

The overall loss in the entire model is the weighted sum of the prediction errors at each layer and each time (each layer error weight)

Error weight at each time instant

Determined by experimentation). The whole network state has updates in the horizontal (time) direction and the vertical (each layer) direction, wherein the vertical direction is updated first, and each layer of errors obtained by forward propagation calculation from bottom to top are firstly transmitted

Calculating the state of RNN unit from top to bottom

. And after the network is updated at the time t, updating at the time t + 1. Thus for each time t network, the input is the previous time RNN state

And the target output image A at this time₀。

From the analysis of the PredNet model, it can be seen that the model can well transfer the motion deviation amount between the inter-frame images at each time, and avoid the prediction misalignment caused by the suddenly appearing high motion deviation value. It is based on this property of the PredNet model that makes it very suitable for use in inter-frame coding. However, for the already established VVC coding standard, how to reasonably apply the PredNet model is the consideration of the present invention.

Based on the above considerations, we will see again a next inter-frame coding. In image transmission technology, moving images, particularly television images, are a main object of interest. A moving image is a temporal image sequence consisting of successive image frames spaced apart in time by a frame period, which has a greater correlation in time than in space. Most of television images have small detail change between adjacent frames, namely video images have strong correlation between frames, and the compression ratio higher than that of intra-frame coding can be obtained by using the characteristic of the correlation of the frames to perform inter-frame coding. For still images or images with slow movement, some frames can be transmitted less frequently, such as frames transmitted every other frame, frames not transmitted, and data of the previous frame in the frame memory of the receiving end is used as the frame data, which has no influence on the vision. Because the human eye requires a higher spatial resolution for static or slow moving parts of the image, while the temporal resolution may be less demanding. The method is called a frame repetition method, is widely applied to video telephones and video conference systems, and the image frame rate is generally 1-15 frames/second.

The method of predictive coding is used to eliminate the temporal correlation of the images of the sequence, i.e. the difference between x and the corresponding pixel x' of the previous or subsequent frame is transmitted instead of directly transmitting the pixel value of the current frame, which is called inter-frame prediction. When there is a moving object in the image, simple prediction cannot receive good effect, as shown in fig. 4, the current frame is completely the same as the background of the previous frame, but the small ball is shifted by one position, and if the pixel value of the (k-1) th frame is simply used as the predicted value of the k frame, the prediction error in the circles shown by the solid line and the dotted line is not zero. If the direction and speed of the ball movement are known, the position of the ball in the k-1 frame can be deduced from the position of the ball in the k-1 frame, and the background image (without considering the shielded part) is still replaced by the background of the previous frame, so that the k-1 frame image considering the ball displacement is used as the predicted value of the k frame, which is much more accurate than the simple prediction, and higher data compression ratio can be achieved. This prediction method is called inter prediction with motion compensation.

In the VVC coding standard, motion compensation is performed by subtracting a current frame from a reference frame to form a residual, and the residual is encoded and transformed to carry information required by a decoder and output to the decoder, thereby implementing inter-frame coding during decoding. Then, if the residual between the reference frame and the current frame can be reduced, the efficiency of inter-frame coding can be greatly improved. Therefore, the invention processes the current frame by using the PredNet model, transfers the error between the inter-frame images at different moments frame by frame, thereby obtaining the preferred reference frame with higher precision, replaces the reference frame in the original frame buffer area with the predicted frame, and then uses the reference frame for motion compensation, thereby improving the coding efficiency and improving the coding quality.

It should be noted that if a frame image is to be added to the frame buffer, this is very complicated for VVC decoding control, and therefore, in the present invention, the reference frame in the target region (the region formed by consecutive reference frames corresponding to the preset number of frames at the current time and the subsequent time in the frame buffer) based on the current time in the frame buffer is replaced by the preferred reference frame, so as to avoid the high computational power requirement caused by the complicated decoding control.

Further, because the PredNet model under the same parameter weight may not be well applicable to various motion scenes due to the characteristics of image motion (for non-smooth motion images), on the basis of the above, before the processing of the current frame by the PredNet model is performed, the reference frame in the target region based on the current time in the frame buffer is extracted as a training set, and the PredNet model is trained by the training set to update the parameter weight of the model. After the current frame is processed by the PredNet model, the reference frame is also updated, so that the method provided by the invention forms a closed-loop architecture, which corresponds to the current frame.

In particular, considering the complexity of decoding control, when a reference frame is extracted to train a PredNet model, and the PredNet model performs preferred reference frame acquisition according to a current inter-frame image and replaces a reference frame in a frame buffer, the number of extracted or replaced frames is consistent and is a continuous reference frame with preset number of frames corresponding to the current time and the subsequent time in the frame buffer.

Example two

In order to better understand the technical points of the present invention, this embodiment explains the present invention in the form of a system structure, as shown in fig. 2, a reference frame generating system applicable to the VVC coding standard includes:

and the motion compensation unit is used for acquiring a prediction frame of the inter-frame image at the current moment through a reference frame in a target area in the motion estimation frame buffer area, and executing motion compensation in the video decoding process based on the prediction frame at the current moment and the inter-frame image.

Further, in the system control unit, the target video segment is a video segment obtained by dividing the video decoding process with a preset video length as a division point.

EXAMPLE III

In order to better prove the feasibility of the technical point of the invention, the embodiment verifies the technical point of the invention through a specific set of data. In order to evaluate the performance of the proposed algorithm, the above proposed algorithm is implemented in the VVC reference software VTM-5.0, and the performance of the algorithm is verified by comparing the rate-distortion (BD-rate) of the proposed algorithm and the VVC reference software, using a Low Delay (LD) configuration, selecting Class C and Class D for the video sequence, a group of pictures (GOP) size of 5, and a PredNet dynamic training in units of each GOP.

As can be seen from the experimental results table 1, the comparison of Y, U, V shows the bit rate and VTM comparison results under our algorithm. Overall, this patent improves coding efficiency by 0.36%, 2.06%, and 1.48% for Y, U, and V, and in particular, improves the mean of Y, U, V for group C sequences by 0.58%, 1.7%, and 0.49%, respectively.

Table 1:

in summary, the reference frame generation method and system applicable to the VVC coding standard according to the present invention insert the PredNet model into the conventional motion compensation process using the reference frame in the frame buffer based on the relationship between the inter-frame prediction effect and the distortion, so that the pixel accuracy of the reference frame and the current frame in the frame buffer is higher, and thus the code rate required for decoding is lower (the coding efficiency is higher), and a decoded image with higher quality is obtained.

The characteristics of image motion are fully considered, the PredNet model is trained in real time through the interframe image at the current moment, and the parameter weight of the PredNet model is updated, so that the PredNet model can better adapt to the motion characteristics of the interframe image of each frame, and the motion compensation is better performed.

It should be noted that all the directional indicators (such as up, down, left, right, front, and rear … …) in the embodiment of the present invention are only used to explain the relative position relationship between the components, the movement situation, etc. in a specific posture (as shown in the drawing), and if the specific posture is changed, the directional indicator is changed accordingly.

Furthermore, descriptions of the present invention as related to "first," "second," "a," etc. are for descriptive purposes only and are not to be construed as indicating or implying relative importance or to imply that the number of technical features indicated is indicative. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

In the present invention, unless otherwise expressly stated or limited, the terms "connected," "secured," and the like are to be construed broadly, and for example, "secured" may be a fixed connection, a removable connection, or an integral part; can be mechanically or electrically connected; they may be directly connected or indirectly connected through intervening media, or they may be connected internally or in any other suitable relationship, unless expressly stated otherwise. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.

In addition, the technical solutions in the embodiments of the present invention may be combined with each other, but it must be based on the realization of those skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination of technical solutions should not be considered to exist, and is not within the protection scope of the present invention.

Claims

1. A method for generating a reference frame applicable to a VVC coding standard, comprising the steps of:

2. The method as claimed in claim 1, wherein in step S1, the target video segment is a video segment obtained by dividing a video decoding process with a preset video length as a dividing point.

3. The method as claimed in claim 1, wherein the target region is a region formed by consecutive reference frames corresponding to a predetermined number of frames in the frame buffer at a current time and a subsequent time.

4. The method as claimed in claim 1, wherein in step S7, the motion compensation is specifically as follows: and acquiring coding information corresponding to a residual error between the prediction frame and the inter-frame image at the current moment.

5. A reference frame generation system applicable to a VVC coding standard, comprising:

6. The system of claim 5, wherein the target video segment is a video segment obtained by dividing a video decoding process with a predetermined video length as a dividing point.

7. The system as claimed in claim 5, wherein the target region is a region formed by consecutive reference frames corresponding to a predetermined number of frames in the frame buffer at the current time and at a subsequent time.

8. The system as claimed in claim 5, wherein the motion compensation unit is configured to perform motion compensation by: and acquiring coding information corresponding to a residual error between the prediction frame and the inter-frame image at the current moment.