CN113573076A

CN113573076A - Method and apparatus for video encoding

Info

Publication number: CN113573076A
Application number: CN202010358452.2A
Authority: CN
Inventors: 刘家瑛; 王晶; 胡煜章
Original assignee: Peking University; Huawei Technologies Co Ltd
Current assignee: Peking University; Huawei Technologies Co Ltd
Priority date: 2020-04-29
Filing date: 2020-04-29
Publication date: 2021-10-29

Abstract

The application relates to a video coding technology in the field of artificial intelligence, and provides a method and a device for video coding, which can generate more accurate prediction frames and further improve the video coding efficiency. The video coding method comprises the following steps: acquiring a reconstruction frame of a fixed number of second video frames before a first video frame to be coded in a video sequence; generating a composite reference frame of the first video frame from a reconstructed frame of the second video frame and a global long-term memory of the video sequence, wherein the global long-term memory is determined from a reconstructed frame of each video frame and a composite reference frame of each video frame in a plurality of video frames preceding the first video frame in the video sequence; encoding the first video frame according to the synthesized reference frame of the first video frame. Because the synthesized reference frame has the capability of describing complex motion between video sequences, the embodiment of the application can generate more accurate prediction frames and improve the video coding efficiency.

Description

Method and apparatus for video encoding

Technical Field

The present application relates to video coding techniques in the field of artificial intelligence, and more particularly, to a method and apparatus for video coding.

Background

As the demand of users for high-quality video increases, new generation video coding standards such as High Efficiency Video Coding (HEVC) are proposed. Video is composed of consecutive video frames, and there is a strong correlation between adjacent video frames. It is this temporal correlation that is used by inter-frame prediction in mainstream video coding standards to achieve data compression. The temporal redundancy between adjacent frames is eliminated by block-level motion estimation by placing the encoded reconstructed frame into a reference frame list as a reference for the encoding of the next frame. However, the conventional reference mode has drawbacks. For example, the conventional inter-frame prediction is difficult to characterize irregular motion patterns such as rotational motion due to the block-level linear motion search mechanism, thereby preventing further improvement of video compression efficiency. Therefore, how to improve the inter-frame prediction mechanism so that it can better handle complex motion between frame sequences is of great significance to the further development of the encoder.

In recent years, deep learning techniques have been developed rapidly and have exhibited good performance in the field of computer vision, and therefore, there have been efforts to improve the inter-frame prediction effect by using the deep learning techniques. In a video encoder, there is a low delay (low delay) encoding configuration. In this configuration, the coding order of the frames is consistent with the numbering of each frame in the video sequence, i.e., the coding of the video frames is done sequentially from front to back. This encoding configuration can be applied to a use scene such as live broadcasting. Also because of this coding structure, there is strong temporal continuity between the sequences of frames to be coded. Therefore, there is a method for improving video coding efficiency by using a deep learning technique, by using several frames that have been coded as input of a neural network, to output a predicted frame of a next frame, and using the predicted result as an additional reference frame for a coding process of the next frame.

However, in this method, only a fixed number of frames are input to the neural network to generate a prediction frame in the process of generating a reference frame, and information included in the remaining encoded frames is not utilized. Therefore, how to further improve the video coding efficiency is an urgent problem to be solved.

Disclosure of Invention

The application provides a method and a device for video coding, which can generate more accurate prediction frames and further improve the video coding efficiency.

In a first aspect, a method for video coding is provided, the method comprising:

acquiring a reconstruction frame of a fixed number of second video frames before a first video frame to be coded in a video sequence;

generating a composite reference frame of the first video frame from a reconstructed frame of the second video frame and a global long-term memory of the video sequence, wherein the global long-term memory is determined from a reconstructed frame of each video frame and a composite reference frame of each video frame in a plurality of video frames preceding the first video frame in the video sequence;

encoding the first video frame according to the synthesized reference frame of the first video frame.

Therefore, the embodiment of the application generates a composite reference frame of a first video frame to be coded in a video sequence according to a fixed number of reconstructed frames of a second video frame before the first video frame and the global long-term memory of the video sequence, and then codes the first video frame according to the composite reference frame. Because the synthesized reference frame has the capability of describing complex motion between video sequences, the embodiment of the application can generate more accurate prediction frames and improve the video coding efficiency.

The second video frame that precedes the first video frame by a fixed number may be, for example, the first two video frames that precede the first video frame, or the first three video frames that precede the first video frame, or the previous video frame that precedes the first video frame, which is not limited in this embodiment of the application. Here, the reconstructed frames of the fixed number of second video frames preceding the first video frame contain short-term temporal information before the first video frame is encoded.

The global long-term memory of the video sequence may be determined from a plurality of video frames in the video sequence that have been encoded, e.g., from a reconstructed frame of each video frame of the plurality of video frames and a synthesized reference frame of each video. For example, the plurality of video frames may be all video frames or part of video frames that have been encoded in the video sequence, which is not limited in this embodiment of the present application.

Since the reconstructed frames of the fixed number of second video frames contain strong temporal correlation between adjacent video frames, and the global long-term memory contains long-term temporal information of the video sequence, generating a synthesized reference frame from the reconstructed frames of the fixed number of second video frames before the first video frame and the global long-term memory of the video sequence as a reference frame when the first video frame is encoded can help to make the generated synthesized reference frame have the capability of describing complex motion (such as nonlinear motion and rotational motion) between video sequences.

It should be noted that the method for video coding according to the embodiment of the present application may be applied to a low delay coding configuration. In the low delay coding configuration, the coding order of the video frames is the same as the order of their frame numbers, i.e., sequential coding. For example, before the t-th frame (t > 2) video frame is encoded, the t-1 th frame and the t-2 th frame are encoded. Therefore, before the t-th frame video frame is encoded, the reconstructed frames of the t-1-th frame and the t-2-th frame video frame can be obtained firstly. Therefore, the embodiment of the application can realize the synchronous promotion of the global long-term memory and the coding process.

With reference to the first aspect, in certain implementations of the first aspect, the method further includes:

acquiring a reconstructed frame of the first video frame;

and updating the global long-term memory according to the difference value of the reconstructed frame of the first video frame and the synthesized reference frame of the first video frame. Here, the difference value may be considered as an error generated in encoding the first frame video frame.

Therefore, according to the embodiment of the application, the global long-term memory is updated in real time according to the difference value between the reconstructed frame of the video frame and the synthesized reference frame of the video frame, so that the video frame sequence can be encoded in an iterative manner, the long-term continuous transmission of time domain information is realized, the synthesized reference frame of each video frame in the video sequence has the capability of describing auxiliary motion between the video sequences, and the encoding performance is further improved. For example, the global long-term memory can be dynamically updated all the time from the beginning of the first frame coding to the end of the last frame coding, so that the time domain information can be fully utilized, more accurate prediction frames can be generated, and the video coding efficiency can be improved.

In some embodiments, the difference may be obtained by a module or unit internal to the reference frame generation module, and the long-term memory may be updated based on the difference. In other embodiments, the difference value may be obtained by a special module or unit, for example, a memory updating module, and the long-term memory may be updated according to the difference value, which is not limited in this embodiment.

extracting characteristic information of a reconstructed frame of the second video frame;

wherein the generating a synthesized reference frame for the first video frame from the reconstructed frame for the second video frame and the global long-term memory for the video sequence comprises:

and generating a network model by using the characteristic information of the reconstructed frame of the second video frame and the global long-term memory input reference frame, and acquiring a synthesized reference frame of the first video frame, wherein the network model for generating the reference frame is obtained by training by using a training data sample set, the training data sample set comprises a plurality of video sequence samples, each video sequence sample comprises a lossless video frame, and the video frame is acquired by performing lossy compression on the lossless video frame.

Therefore, the embodiment of the application can output the synthetic reference frame of the first video frame by taking the reconstructed frame of a fixed number of second video frames before the first video frame to be coded in the video sequence and the global long-term memory of the video sequence as the input of the neural network model through deep learning, and then code the first video frame according to the synthetic reference frame. Because the synthesized reference frame has the capability of describing complex motion between video sequences, the embodiment of the application can generate more accurate prediction frames and improve the video coding efficiency.

With reference to the first aspect, in certain implementations of the first aspect, the generating a network model from the feature information of the reconstructed frame of the second video frame and the global long-term memory input reference frame, and obtaining the synthesized reference frame of the first video frame includes:

obtaining a synthesized reference frame of the first video frame according to the following formula:

wherein the content of the first and second substances,

a composite reference frame representing the first video frame, t representing a frame number of the first video frame, I representing a frame number of a second video frame preceding the first video frame, I_iA reconstructed frame representing the video of the ith frame,

representing local convolution operations in deep learning, K_iA set of convolution kernel coefficients indicating all pixels in the first video frame, a dot product operation indicating a pixel level for performing a weighted addition of pixel levels obtained by performing a partial convolution on the input frame, and a weight matrix of M_i，ε_tRepresenting said global long term memory, γ, when encoding said first video frame_tAnd the characteristic information of the reconstructed frame of the second video frame is represented, t, i is a positive integer, and t is greater than 2.

Therefore, in the embodiment of the present application, a convolution kernel coefficient is generated for each pixel, that is, a local convolution mode, which has a stronger expression capability than a regression mode that is the same for all pixels, so that a better regression effect can be achieved, which is helpful for generating a more accurate prediction frame, and further improves video coding efficiency. Here, regression refers to a process of generating a synthetic reference frame.

With reference to the first aspect, in certain implementations of the first aspect, the second video frame includes first two video frames of the first video frame, where the first video frame is a third video frame in the video sequence or a video frame after the third video frame.

In some embodiments, the reference frame generation module may not generate a composite reference frame for the first frame video frame or a composite video frame for the second frame video frame at the time when the first frame and the second frame video frame of the entire video sequence are encoded because the number of reconstructed frames input to the reference frame generation module is insufficient.

Then, for example, starting from the fourth frame of video frame, the reference frame generation module and the encoder module may enter a normal operation mode, that is, the reference frame generation module continuously generates a synthesized reference frame according to the long-term memory and the reconstructed frames of the first two frames of video frames, and the encoder module exclusively rights the synthesized reference frame and generates the reconstructed frame of the current encoded video frame. Optionally, the encoder module may obtain a difference between a reconstructed frame and a reference synthesized frame of the video frame, so that the reference frame generation module can update the long-term memory according to the difference. The above process may be repeated until all video frames in the sequence of video frames have been encoded.

With reference to the first aspect, in certain implementations of the first aspect, the global long-term memory is 0 in a case where the first video frame is a third frame video frame in the video sequence. In addition, the global long-term memory may be set to 0 before the third frame video frame is encoded.

With reference to the first aspect, in certain implementations of the first aspect, the encoding the first video frame according to a synthesized reference frame of the first video frame includes:

acquiring a reference frame list of the first video frame, wherein the reference frame list comprises reconstructed frames of at least two video frames which are encoded;

removing a reconstructed frame corresponding to a frame number with the largest difference between the frame numbers of the first video frame in a reference frame list, and adding a synthesized reference frame of the first video frame to the position of the removed reconstructed frame in the reference frame list;

encoding the first video frame according to the reference frame list.

Therefore, because the time domain correlation between the reconstructed frame corresponding to the frame number with the maximum difference between the frame numbers of the first video frame in the reference frame list and the first video frame is weakest, the reconstructed frame corresponding to the frame number with the maximum difference between the frame numbers of the first video frame in the reference frame list is removed, and the synthesized reference frame of the first video frame is added into the reference frame list and used as a reference in the current encoding process, so that the time domain correlation between different frames is utilized, and a better compression effect is achieved.

In a second aspect, there is provided a video coding apparatus configured to perform the method of the first aspect or any possible implementation manner of the first aspect, and in particular, the apparatus includes a module configured to perform the method of the first aspect or any possible implementation manner of the first aspect.

In a third aspect, an apparatus for video coding is provided, including: memory, processor. Wherein the memory is configured to store instructions and the processor is configured to execute the memory-stored instructions, and when the processor executes the memory-stored instructions, the execution causes the video coding apparatus to perform the first aspect or the method of any possible implementation manner of the first aspect.

In a fourth aspect, there is provided a computer readable medium for storing a computer program comprising instructions for carrying out the method of the first aspect or any possible implementation manner of the first aspect.

In a fifth aspect, a computer program product comprising instructions is provided, which when run on a computer, causes the computer to perform the first aspect or any one of the possible implementations of the first aspect.

It should be understood that the beneficial effects achieved by the second to fifth aspects and the corresponding implementation manners of the present application are referred to the beneficial effects achieved by the first aspect and the corresponding implementation manners of the present application, and are not repeated herein.

Drawings

Fig. 1 shows a schematic block diagram of a system for video encoding provided by an embodiment of the present application.

Fig. 2 shows a schematic flow chart of a method for video encoding provided by an embodiment of the present application.

Fig. 3 shows a specific example of encoding a tth frame video frame.

Fig. 4 shows a specific example of video coding provided by the embodiment of the present application.

Fig. 5 shows an example of a PSNR curve of a scheme of video encoding of an embodiment of the present application.

Fig. 6 shows a schematic block diagram of an apparatus for video encoding according to an embodiment of the present application.

Fig. 7 is a schematic block diagram illustrating another apparatus for video encoding according to an embodiment of the present disclosure.

Detailed Description

The technical solution in the present application will be described below with reference to the accompanying drawings.

Fig. 1 shows a schematic block diagram of a system 100 for video encoding provided by an embodiment of the present application. For example, the system 100 may be disposed on an intelligent video storage and playback device, such as an electronic product (mobile phone, television, computer, etc.) with a camera function, a video playback function, or a video storage function. As shown in fig. 1, the system 100 includes a reference frame generation module 110 and an encoder module 120. For example, the reference frame generation module 110 may be a long short-term memory (LSTM) network, and the encoder module 120 may be an HEVC video encoder.

The reference frame generating module 110 is configured to generate a synthesized reference frame of a first video frame to be encoded according to an encoded video frame in a video sequence. Here, the encoded video frame includes all or a part of the already encoded video frame before the first video frame. For example, in addition to the fixed number of already-encoded video frames before the first video frame, the already-encoded video frames before the fixed number of already-encoded video frames may also be included in the already-encoded video frames, which is not limited in this embodiment of the present application.

In some embodiments, the encoded video frames may include a reconstructed frame of a fixed number of second video frames before a first video frame to be currently encoded, and a global long-term memory (which may also be referred to as long-term memory, global memory, etc., without limitation) of the video sequence. The encoder side device (e.g., the encoder module 120 in the system 100) of the reconstructed frame simulates the decoder side device to recover the video frame, and the obtained video frame.

For example, in the case that the video frame to be currently encoded is a first video frame, the global long-term memory input to the reference frame generation module 110 may be determined according to the reconstructed frame of each video frame and the synthesized reference frame of each video frame in a plurality (e.g., all) of video frames preceding the first video frame in the video sequence.

In the embodiment of the present application, the synthesized reference frame may also be referred to as a reference frame, a predicted frame, and the like, which is not limited in the embodiment of the present application.

The encoder module 120 is configured to use the synthesized reference frame of the first video frame as an additional reference frame for the encoding process of the first video frame to encode the first video frame.

Optionally, the encoder module 120 may further obtain a reconstructed frame of the first video frame, for example, simulate a decoding-end device to recover the first video frame, so as to obtain the reconstructed frame of the first video frame.

In some embodiments, the system 100 may maintain the global long term memory. For example, after encoding of a video frame is completed, the global long-term memory is updated based on the reconstructed frame of the video frame. For example, after obtaining the reconstructed frame of the first video frame, a difference between the reconstructed frame of the first video frame and the synthesized reference frame of the first video frame may be determined, and the global long-term memory may be updated according to the difference. The updated global long-term memory may be used in the generation of a synthesized reference frame for a video frame next to the first video frame. Specifically, the generation process and the encoding process of the synthesized reference frame of the next frame are similar to the generation process and the encoding process of the synthesized reference frame of the first video frame, and are not repeated again.

Therefore, the global long-term memory is updated in real time according to the difference between the reconstructed frame and the synthesized reference frame of the video frame, and the video frame sequence can be encoded in an iterative manner, so that the long-term time domain information can be continuously transmitted, the synthesized reference frame of each video frame in the video sequence has the capability of describing auxiliary motion between the video sequences, and the encoding performance is further improved. For example, the global long-term memory can be dynamically updated all the time from the beginning of the first frame coding to the end of the last frame coding, so that the time domain information can be fully utilized, more accurate prediction frames can be generated, and the video coding efficiency can be improved.

Fig. 2 shows a schematic flow chart of a method 200 for video encoding provided by an embodiment of the present application. The method 200 may be applied to the system 100 for video encoding shown in fig. 1. As shown in fig. 2, method 200 includes steps 210 through 240.

At 210, reconstructed frames of a fixed number of second video frames preceding a first video frame to be encoded in a video sequence are obtained. For example, the second video frame of the fixed data may refer to the description above, and is not described here again.

Fig. 3 shows a specific example of encoding a tth frame video frame. Here, the t-th frame video frame is an example of the above-described first video frame. As shown in fig. 3, the fixed number of second video frames before the tth frame video frame may be the first two video frames of the tth frame video frame, i.e., the t-1 th frame video frame and the t-2 th frame video frame. Here, t is a positive integer and t > 2. At this time, the reconstructed frame of the second video frame is the reconstructed frame of the t-1 frame video frame (i.e., the t-1 reconstructed frame in fig. 3) and the reconstructed frame of the t-2 frame video frame (i.e., the t-2 reconstructed frame in fig. 3).

It should be noted that the method for video coding according to the embodiment of the present application may be applied to a low delay coding configuration. In the low delay coding configuration, the coding order of the video frames is the same as the order of their frame numbers, i.e., sequential coding. For example, before the t-th frame (t > 2) video frame is encoded, the t-1 th frame and the t-2 th frame are encoded. Therefore, before the t-th frame video frame is encoded, the reconstructed frames of the t-1-th frame and the t-2-th frame video frame can be obtained firstly.

Therefore, the embodiment of the present application may be designed for a low delay coding configuration, and in this configuration, the frames are coded in sequence, that is, the coding sequence is consistent with the frame number, so that the embodiment of the present application can implement the synchronization of the global long term memory and the coding process.

And 220, generating a synthesized reference frame of the first video frame according to the reconstructed frame of the second video frame and a global long-term memory of the video sequence, wherein the global long-term memory is determined according to the reconstructed frame of each video frame and the synthesized reference frame of each video frame in a plurality of video frames before the first video frame in the video sequence. The global long-term memory and the synthesized reference frame may be referred to the above description, and are not described herein again.

Illustratively, with continued reference to FIG. 3, after acquiring reconstructed frames of the t-1 th frame and the t-2 th frame, a reference frame generation module may be utilized to generate a synthesized reference frame of the t-th frame (i.e., the t-synthesized reference frame in FIG. 3). For example, the input of the reference frame generation module at this time can be the reconstructed frames of the t-1 th frame and the t-2 th frame video frame, and the long-term memory of the video sequence. Here, the long-term memory and the reconstructed frames of the t-1 frame and the t-2 frame video frame are used as the input of the reference frame generation module, so that more sufficient time domain information can be input.

And 230, encoding the first video frame according to the synthesized reference frame of the first video frame.

Illustratively, with continued reference to FIG. 3, after the reference frame generation module generates the t synthesized reference frame, the encoder module may read the t synthesized reference frame. In some alternative embodiments, the encoder module may place the t-th synthesized reference frame in a reference frame list in the encoder module and complete encoding of the t-th frame video frame according to the reference frame list. At least two video frames can be included in the reference frame list and used as references in the current frame coding process.

In some embodiments, the encoder module may maintain a list of reference frames corresponding to each frame of video during encoding of the frame of video. For example, before the encoder module has not acquired the synthesized reference frame of the video of the t-th frame, the reconstructed frame of several video frames that have been encoded may be stored in the reference frame list corresponding to the t-th frame. After the encoder module acquires the synthesized reference frame of the t-th frame video frame, the reconstructed frame corresponding to the frame number with the largest difference between the frame numbers of the currently encoded video frame (e.g., the first video frame or the t-th frame video frame) in the reference frame list may be removed, and the synthesized reference frame of the current video frame (e.g., the first video frame or the t-th frame video frame) acquired from the reference frame generation module may be added to the reference frame list, for example, the position of the removed reconstructed frame that may be added to the reference frame list, which is not limited in this embodiment of the present application.

Because the time domain correlation between the reconstructed frame corresponding to the frame number with the maximum difference between the frame numbers of the first video frame in the reference frame list and the first video frame is weakest, the reconstructed frame corresponding to the frame number with the maximum difference between the frame numbers of the first video frame in the reference frame list is removed, and the synthesized reference frame of the first video frame is added into the reference frame list to be used as a reference in the current coding process, so that the time domain correlation between different frames can be utilized, and a better compression effect is achieved.

In some optional embodiments, the method 200 may further include step 240, obtaining a reconstructed frame (ttebuilt frame) of the first video frame, and updating the global long-term memory according to a difference between the reconstructed frame of the first video frame and the synthesized reference frame of the first video frame.

Illustratively, with continued reference to fig. 3, the encoding-side device (e.g., the encoder module) may emulate the decoding side to recover the t video frames, thereby obtaining the t reconstructed frames. After the t reconstructed frame is acquired, the difference between the t reconstructed frame and the t synthesized reference frame may be calculated. Here, the difference value may be considered as an error generated in encoding the tth frame video frame. After the difference value is obtained, the maintained long-term memory of the difference value can be dynamically updated according to the difference value, so that the reference frame generation module can sense the error generated in the process of coding the t-th frame video frame in the future prediction process (namely the process of generating the synthesized reference frame of the next frame, such as the t +1 frame), and the generation of a more accurate synthesized reference frame in the future is facilitated, and the video coding efficiency is improved.

It should be noted that, for the video encoding process in fig. 3, since the number of reconstructed frames input to the reference frame generation module is insufficient when the first frame and the second frame of the video sequence are encoded, the reference frame generation module may not generate the synthesized reference frame of the first frame of the video frame or the synthesized video frame of the second frame of the video frame at this time.

Before the third frame of video frame is encoded, the reconstructed frames of the first frame of video frame and the reconstructed frames of the second frame of video frame are already generated, so that the synthesized reference frame of the third frame of video frame can be generated by the reference frame generation module at this time. This also means that during the encoding of the entire video frame, the reference frame generation module generates the composite reference frame for the first time, so that there is no long-term memory of the video sequence, before which the reference frame generation module can set the long-term memory to 0. At this time, the long-term memory set to 0, the reconstructed frames of the first frame and the second frame of the video frame may be used together as the input of the synthesis process, and the synthesized reference frame of the third frame of the video frame may be output. And then, starting from the fourth frame video frame, the reference frame generation module and the encoder module can enter a normal working mode, namely the reference frame generation module continuously generates a synthesized reference frame according to the long-term memory and the reconstructed frames of the first two frames of video frames, and the encoder module has the exclusive right of the synthesized reference frame and generates the reconstructed frame of the current encoded video frame. Optionally, the encoder module may obtain a difference between a reconstructed frame and a reference synthesized frame of the video frame, so that the reference frame generation module can update the long-term memory according to the difference. The above process may be repeated until all video frames in the sequence of video frames have been encoded.

A specific example of video coding provided in the embodiment of the present application is described in detail below with reference to fig. 4. It should be noted that the following examples are intended only to assist those skilled in the art in understanding and implementing embodiments of the present invention, and are not intended to limit the scope of embodiments of the present invention. Equivalent alterations and modifications may be effected by those skilled in the art in light of the examples set forth herein, and such alterations and modifications are intended to be within the scope of the embodiments of the invention.

For example, the system architecture of FIG. 4 may be implemented on a platform equipped with i7-9700K, 32G memory, GTX1080Ti based on Ubuntu 18.04. Referring to fig. 4, the system includes a feature extraction module 410, a reference frame generation module 420, an encoder module 430, and a memory update module 440. The feature extraction module 410, the reference frame generation module 420 and the memory update module 440 may be implemented by a neural network module. At this time, the system in fig. 4 may also be referred to as a neural network model, or a neural network system.

In the example shown in FIG. 4, the feature extraction module 410 may be utilized to extract a fixed number of reconstructed frames of a second video frame (e.g., t-1 reconstructed frame I) prior to the first video frame_t-1And t-2 reconstructed frame I_t-2) Then extracting the feature information of the reconstructed frame of the second video frame and the global long-term memory epsilon_tInputting a reference frame to generate a network model, and obtaining a synthesized reference frame of the first video frame

The encoder module 430 may then synthesize a reference frame based on the synthesized reference frame

A first video frame is encoded. Thereafter, the encoder module may further obtain and output a reconstructed frame I of the first video frame_tThe memory update module 440 can obtain a reconstructed frame I of the first video frame_tAnd obtaining a synthetic reference frame

And reconstructing the frame I_tDifference R of_tThen, the memory refresh network 430 in the memory refresh module 440 refreshes the difference R_tFor global long-term memory epsilon_tAnd (6) updating.

Each neural network module in fig. 4 may be obtained by training using a training data sample set, where the training data sample set includes a plurality of video sequence samples, each video sequence sample includes a lossless video frame, and a video frame obtained by performing lossy compression on the lossless video frame.

Specific parameters of each neural network module in the system of fig. 4 are described below, wherein, taking the coding of the t-th frame of video frame as an example, it is assumed that the length and width of the two frames of images input into the neural network model are h and w, respectively.

The feature extraction module 410:

inputting: the previous two frames are reconstructed (i.e. t-2 reconstructed frame I in FIG. 4)_t-2And t-1 reconstructed frame I_t-1). Wherein, the characteristicsThe number of input channels of the extracting module 410 may be 6, that is, the previous two reconstructed frames are spliced on the number of channels (where the number of input channels of each reconstructed frame is 3, that is, three input channels of RGB).

Network parameters: as shown in table 1 below:

TABLE 1

It should be noted that the sequence numbers in table 1 are names obtained after numbering from left to right in the feature extraction module 410, for example, the leftmost convolution module in the feature extraction module 410 is called convolution 1, and so on. The dashed line indicates that the outputs of the two parts are added, such as the output of convolution 1 and upsampling 5. This is a skip connection structure common in self-encoders, used to convey low-level (low level) information.

The reference frame generation module 420 includes a local convolution parameter generation network 421 and a weight generation network 422.

Local convolution parameter generation network 421:

inputting:

(1) the sum of the convolution 2 of the feature extraction module 410 and the output of the upsampling 2, labeled γ_t；

(2) Long term memory, marked as ε_t。

An input integration mode: and (3) splicing according to the channel numbers of the (1) and the (2).

And (3) outputting: the number of channels is 51, and the length and width are the same as the local convolution coefficients of the input image.

Network parameters: as shown in table 2 below:

TABLE 2

Weight generation network 422:

inputting: the sum of upsampling 5 and convolution 1.

And (3) outputting: the number of channels is 1, and the weight matrix has the same length and width as the input image, i.e. M marked in FIG. 4_t-1. In addition, M in FIG. 4_t-2M may be subtracted from the full 1 matrix_t-1Thus obtaining the product.

Network parameters: as shown in table 3 below:

TABLE 3

The generation method of the synthesized reference frame (i.e., the operation method of the portion 423 indicated by the dashed line in fig. 4) is shown in the following formula (1):

wherein, I_iA reconstructed frame representing the input two frames of video,

representing local convolution operations in deep learning, K_iA set of convolution kernel coefficients representing all pixels in the first video frame, including separately generating convolution kernel coefficients for each pixel point, a dot product operation representing a pixel level for performing a weighted addition of pixel levels resulting from partial convolution of two input frames, the weight matrix being M_i。

The memory update network 440:

inputting:

(1) t reconstructed frame I_tAnd synthesizing the reference frame

Difference R of_tThe number of channels is 3, and the length and the width are the same as those of the input pictures;

(2) state h of one cycle before LSTM_i-1，c_i-1。

And (3) outputting: ConvLSTM state at the next time, i.e. h_i，c_iThe number of channels is 32.

Network parameters: as shown in table 4 below:

TABLE 4

Next, a training mode of the neural network model in fig. 4 is described.

Illustratively, embodiments of the present application may train the neural network model using a Vimeo 90K data set, which may include 89800 video sequences, where each video sequence may have 7 consecutive lossless video frames, which may be labeled as { I }₁,I₂,…,I₁}. First, the data set may be processed, e.g., lossy compressed, for each frame of each video sequence, to obtain a sequence of degraded video frames

The method is used for simulating the image quality loss phenomenon generated in the real encoding process. Meanwhile, the neural network model is also provided with a global long-term memory epsilon, wherein the input global long-term memory can be marked as epsilon in the prediction process of the t frame video frame_t。

The following is one example of a training process:

step 1: in a degraded video frame sequence

In which the first two lossy frames are selected, i.e.

And

and will initially memorize epsilon₃Set to 0, the third frame is predicted. Inputting the three into a reference frame generation module, and calculating to obtain a predicted frame of a third frame

Next, a predicted frame is calculated

And lossless third frame I₃And the error is propagated backwards, and the network parameters are updated. Then, a predicted frame is calculated

And lossy frame

And will be initially 0 long term memory epsilon₃Is updated to epsilon₄For prediction of the fourth frame.

Step 2: the next two frames of the last network forward propagated input frame and the long term memory after the last update are selected as input. For example, if the input during the last cycle was

And

then this time the input is a lossy frame

And

also, after the last forward propagation, the long-term memory has been updated

Therefore, will

And

inputting into network, proceeding forward propagation to obtain predicted frame

And calculates a predicted frame

And lossless frame I_t+3And the error is propagated backwards, and the network parameters are updated. Then, a predicted frame is calculated

And lossy frame

The long-term memory is updated to

And step 3: step 2 is repeated until all frames in the video sequence are used. Since there are 7 consecutive frames per video sequence in the training data, step 2 is repeated 4 times for each video sequence.

And 4, step 4: and selecting other video sequences in the training data, and repeating the three steps until the neural network is fitted.

Fig. 5 shows an example of a peak signal to noise ratio (PSNR) curve of a scheme of video encoding according to an embodiment of the present application. For example, the encoder of the present application may perform the encoding test on the fourier horizon test sequence in the low delay encoding configuration. As shown in fig. 5, compared to the conventional HEVC scheme, the memory-enhanced auto-regressive network (MAAR-Net) scheme of the embodiment of the present application can improve the performance of the BD-rate (bjontegaard rate) by 10.6%. In fig. 5, the abscissa represents the code rate (bitrate) in Kbps, and the ordinate represents the luminance Y peak signal to noise ratio (YPSNR) in dB.

In a video encoding scheme in the prior art, in the process of generating a reference frame by using an LSTM network, only a fixed number of reconstructed frames before a currently encoded video frame are input into the network for generating the reference frame. Taking four frames of input before the currently encoded video frame as an example, each time one frame is input, the LSTM state is updated once until all four reconstructed frames are input, that is, the LSTM state is updated four times, and finally the generated reference frame of the currently encoded video frame is output. However, in the process of generating the reference frame, only the information of the input four reconstructed frames is included, and the information included in the remaining encoded reconstructed frames is not utilized. In addition, the generation process of each reference frame is independent, namely the memory of the network is reset before the reference frame is generated every time, so that the transmission of long-term time domain information is blocked, and an accurate prediction frame cannot be generated.

In the embodiment of the present application, a global long-term memory is maintained from the beginning of the encoding process of the video sequence, and the time domain span of the global long-term memory may be as long as hundreds of frames. For example, the global long-term memory is dynamically updated all the time from the encoding process of the first frame to the encoding process of the last frame, so that the input of the network is ensured to contain a sufficiently long time domain span, and further, the time domain information can be fully utilized to generate a more accurate prediction frame, and the efficiency of video encoding is improved.

It should be noted that, in the video playing device, the scheme provided in the embodiment of the present application may be implemented in the form of a hardware chip, or may be implemented in the form of a software code, and the embodiment of the present application is not limited to this.

An embodiment of the present application further provides a video encoding apparatus, please refer to fig. 6. The video encoding apparatus 600 may be, for example, a video storage and playing device. In this embodiment, the apparatus 600 may include an obtaining unit 610, a generating unit 620 and an encoding unit 630.

An obtaining unit 610 is configured to obtain reconstructed frames of a fixed number of second video frames before a first video frame to be encoded in a video sequence.

A generating unit 620, configured to generate a synthesized reference frame of the first video frame according to the reconstructed frame of the second video frame and a global long-term memory of the video sequence, where the global long-term memory is determined according to the reconstructed frame of each video frame and the synthesized reference frame of each video frame in a plurality of video frames before the first video frame in the video sequence.

An encoding unit 630, configured to encode the first video frame according to the synthesized reference frame of the first video frame.

In some possible implementations, the method further includes:

and the updating unit is used for acquiring the reconstructed frame of the first video frame and updating the global long-term memory according to the difference value between the reconstructed frame of the first video frame and the synthesized reference frame of the first video frame.

In some possible implementations, the obtaining unit 610 is further configured to extract feature information of a reconstructed frame of the second video frame;

the generating unit 620 is specifically configured to generate a network model from the feature information of the reconstructed frame of the second video frame and the global long-term memory input reference frame, and obtain a synthesized reference frame of the first video frame, where the network model is obtained by training using a training data sample set, the training data sample set includes a plurality of video sequence samples, each video sequence sample includes a lossless video frame, and a video frame obtained by lossy compression of the lossless video frame.

In some possible implementations, the generating unit 620 is specifically configured to obtain the synthesized reference frame of the first video frame according to the following formula:

wherein the content of the first and second substances,

In some possible implementations, the second video frame includes the first two video frames of the first video frame, where the first video frame is a third video frame in the video sequence or a video frame after the third video frame.

In some possible implementations, the global long-term memory is 0 if the first video frame is a third frame video frame in the video sequence.

In some possible implementations, the encoding unit 630 is specifically configured to obtain a reference frame list of the first video frame, where the reference frame list includes reconstructed frames of at least two video frames that have been encoded; removing a reconstructed frame corresponding to a frame number with the largest difference between the frame numbers of the first video frame in a reference frame list, and adding a synthesized reference frame of the first video frame to the position of the removed reconstructed frame in the reference frame list; encoding the first video frame according to the reference frame list.

Fig. 7 is a schematic hardware structure diagram of an apparatus 700 for video encoding according to an embodiment of the present application. The apparatus 700 shown in fig. 7 can be regarded as a computer device, and the apparatus 700 can be implemented as an implementation manner of the apparatus for video encoding according to the embodiment of the present application, and also as an implementation manner of the method for video encoding according to the embodiment of the present application, where the apparatus 700 includes a processor 701, a memory 702, an input/output interface 703 and a bus 705, and may further include a communication interface 704. The processor 701, the memory 702, the input/output interface 703 and the communication interface 704 are communicatively connected to each other via a bus 705.

The processor 701 may be a general Central Processing Unit (CPU), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits, configured to execute related programs to implement functions required to be executed by modules in the apparatus for processing media data according to the embodiment of the present application, or to execute the method for processing media data according to the embodiment of the present application. The processor 701 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be implemented by integrated logic circuits of hardware or instructions in the form of software in the processor 701. The processor 701 may be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, or discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 702, and the processor 701 reads information in the memory 702, and completes functions required to be executed by modules included in the apparatus for processing media data according to the embodiment of the present application, or performs a method for processing media data according to the embodiment of the method of the present application, in combination with hardware thereof.

The memory 702 may be a Read Only Memory (ROM), a static memory device, a dynamic memory device, or a Random Access Memory (RAM). The memory 702 may store an operating system as well as other application programs. When the functions required to be executed by the modules included in the apparatus for processing media data according to the embodiment of the present application or the method for processing media data according to the embodiment of the present application are implemented by software or firmware, the program codes for implementing the technical solutions provided by the embodiment of the present application are stored in the memory 702, and the processor 701 executes the operations required to be executed by the modules included in the apparatus for processing media data or the method for processing media data according to the embodiment of the present application.

The input/output interface 703 is used for receiving input data and information and outputting data such as an operation result.

The communication interface 704 enables communication between the apparatus 700 and other devices or communication networks using transceiver means such as, but not limited to, transceivers. May be used as the acquiring module or the sending module in the processing device.

Bus 705 may include a pathway to transfer information between various components of device 700, such as processor 701, memory 702, input/output interface 703, and communication interface 704.

It should be noted that although the apparatus 700 shown in fig. 7 only shows the processor 701, the memory 702, the input/output interface 703, the communication interface 704 and the bus 705, in a specific implementation process, a person skilled in the art should understand that the apparatus 700 further comprises other devices necessary for normal operation, for example, a display for displaying video data to be played. Also, those skilled in the art will appreciate that the apparatus 700 may also include hardware components for performing other additional functions, according to particular needs. Furthermore, those skilled in the art will appreciate that apparatus 700 may also include only those components necessary to implement embodiments of the present application, and need not include all of the components shown in FIG. 7.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

Embodiments of the present application further provide a computer-readable storage medium, which includes a computer program and when the computer program runs on a computer, the computer is caused to execute the method provided by the above method embodiments.

Embodiments of the present application further provide a computer program product containing instructions, which when run on a computer, cause the computer to execute the method provided by the above method embodiments.

It should be understood that, in the various embodiments of the present application, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

It should be understood that the descriptions of the first, second, etc. appearing in the embodiments of the present application are only for illustrating and differentiating the objects, and do not represent a particular limitation to the number of devices in the embodiments of the present application, and do not constitute any limitation to the embodiments of the present application.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of video encoding, comprising:

2. The method of claim 1, further comprising:

acquiring a reconstructed frame of the first video frame;

and updating the global long-term memory according to the difference value of the reconstructed frame of the first video frame and the synthesized reference frame of the first video frame.

3. The method of claim 1 or 2, further comprising:

4. The method of claim 3, wherein generating a network model from the feature information of the reconstructed frame of the second video frame and the global long-term memory input reference frame, and obtaining the synthesized reference frame of the first video frame comprises:

wherein the content of the first and second substances,

5. The method according to any of claims 1-4, wherein the second video frame comprises the first two video frames of the first video frame, and wherein the first video frame is a third video frame in the video sequence or a video frame subsequent to the third video frame.

6. The method of claim 5, wherein the global long term memory is 0 if the first video frame is a third video frame in the video sequence.

7. The method according to any of claims 1-6, wherein said encoding said first video frame based on a synthesized reference frame of said first video frame comprises:

encoding the first video frame according to the reference frame list.

8. An apparatus for video encoding, comprising:

the device comprises an acquisition unit, a coding unit and a decoding unit, wherein the acquisition unit is used for acquiring the reconstructed frames of a fixed number of second video frames before a first video frame to be coded in a video sequence;

a generating unit, configured to generate a synthesized reference frame of the first video frame according to a reconstructed frame of the second video frame and a global long-term memory of the video sequence, wherein the global long-term memory is determined according to a reconstructed frame of each video frame and a synthesized reference frame of each video frame in a plurality of video frames before the first video frame in the video sequence;

and the coding unit is used for coding the first video frame according to the synthesized reference frame of the first video frame.

9. The apparatus of claim 8, further comprising:

10. The apparatus according to claim 8 or 9, wherein the obtaining unit is further configured to:

wherein the generating unit is specifically configured to:

11. The apparatus according to claim 10, wherein the generating unit is specifically configured to:

wherein the content of the first and second substances,

representing local convolution operations in deep learning, K_iA set of convolution kernel coefficients representing all pixels in the first video frame, a point multiplication operation indicating a pixel level for a pairPerforming pixel-level weighted addition of the results obtained by partial convolution on the input frame with a weight matrix of M_i，ε_tRepresenting said global long term memory, γ, when encoding said first video frame_tAnd the characteristic information of the reconstructed frame of the second video frame is represented, t, i is a positive integer, and t is greater than 2.

12. The apparatus according to any of claims 8-11, wherein the second video frame comprises the first two video frames of the first video frame, and wherein the first video frame is a third video frame in the video sequence or a video frame subsequent to the third video frame.

13. The apparatus of claim 12, wherein the global long term memory is 0 if the first video frame is a third video frame in the video sequence.

14. The apparatus according to any of claims 8-13, wherein the encoding unit is specifically configured to:

encoding the first video frame according to the reference frame list.

15. An apparatus of video encoding, comprising: a processor and a memory, wherein the memory is configured to store instructions and the processor is configured to execute the instructions, and when the processor executes the instructions stored by the memory, the apparatus for video encoding is configured to perform the method of any of claims 1-7.