CN113573076A

CN113573076A - Method and apparatus for video coding

Info

Publication number: CN113573076A
Application number: CN202010358452.2A
Authority: CN
Inventors: 刘家瑛; 王晶; 胡煜章
Original assignee: Peking University; Huawei Technologies Co Ltd
Current assignee: Peking University; Huawei Technologies Co Ltd
Priority date: 2020-04-29
Filing date: 2020-04-29
Publication date: 2021-10-29

Abstract

The present application relates to video coding technology in the field of artificial intelligence, and provides a video coding method and apparatus, which can generate more accurate prediction frames, thereby improving video coding efficiency. The video coding method includes: acquiring reconstructed frames of a fixed number of second video frames before a first video frame to be coded in a video sequence; according to the reconstructed frames of the second video frame and the global long-term memory of the video sequence , generating a synthetic reference frame of the first video frame, wherein the global long-term memory is based on the reconstructed frame sum of each video frame in a plurality of video frames preceding the first video frame in the video sequence The synthetic reference frame of each video frame is determined; and the first video frame is encoded according to the synthetic reference frame of the first video frame. Since the synthetic reference frame has the ability to describe complex motion between video sequences, the embodiments of the present application can generate more accurate prediction frames and improve video coding efficiency.

Description

Method and apparatus for video coding

技术领域technical field

本申请涉及人工智能领域中的视频编码技术，并且更具体的，涉及一种视频编码的方法和装置。The present application relates to video coding technology in the field of artificial intelligence, and more particularly, to a video coding method and apparatus.

背景技术Background technique

随着用户对于高质量的视频的需求不断增加，不断有诸如高性能视频压缩编码(high efficiency video coding，HEVC)等新一代视频编码标准被提出。视频是由连续的视频帧组成的，而相邻视频帧之间具有很强的相关性。主流视频编码标准中的帧间预测正是利用该时域相关性实现数据压缩。通过将已编码完成的重建帧放入参考帧列表，作为下一帧编码的参考，利用块级别的运动估计来消除相邻帧之间的时域冗余。但是，传统的参考模式也存在着缺陷。例如，受限于块级别的线性运动搜索机制，传统的帧间预测难以刻画旋转运动等不规则的运动模式，从而阻碍了视频压缩效率的进一步提升。因此，如何改进帧间预测机制，使得其能够更好地处理帧序列之间的复杂运动，对于编码器的进一步发展具有重要意义。As users' demands for high-quality video continue to increase, new-generation video coding standards such as high-efficiency video coding (HEVC) have been proposed. Video is composed of consecutive video frames, and there is a strong correlation between adjacent video frames. Inter-frame prediction in mainstream video coding standards utilizes this temporal correlation to achieve data compression. By putting the encoded reconstructed frame into the reference frame list as a reference for the next frame encoding, block-level motion estimation is used to eliminate temporal redundancy between adjacent frames. However, the traditional reference mode also has shortcomings. For example, limited by the block-level linear motion search mechanism, it is difficult for traditional inter-frame prediction to describe irregular motion patterns such as rotational motion, which hinders the further improvement of video compression efficiency. Therefore, how to improve the inter-frame prediction mechanism so that it can better handle the complex motion between frame sequences is of great significance for the further development of the encoder.

近年来，深度学习技术发展迅速，并在计算机视觉领域展现出了很好的性能，因此也有工作尝试利用深度学习技术来提升帧间预测的效果。在视频编码器中，有一种低延迟(low delay)的编码配置。在这种配置下，帧的编码顺序和每一帧在视频序列中的编号是一致的，即视频帧的编码是从前往后顺序进行的。这种编码配置能够被应用于直播等使用场景。也因为这种编码结构，待编码的帧序列之间存在很强的时域连续性。因此，有方法借助深度学习技术，通过将已编码的若干帧作为神经网络的输入，从而输出下一帧的预测帧，并将该预测结果作为额外的参考帧用于下一帧的编码过程，从而提高视频编码效率。In recent years, deep learning technology has developed rapidly and has shown good performance in the field of computer vision. Therefore, there are also attempts to use deep learning technology to improve the effect of inter-frame prediction. In video encoders, there is a low delay encoding configuration. In this configuration, the coding sequence of frames is consistent with the numbering of each frame in the video sequence, that is, the coding of video frames is performed sequentially from front to back. This encoding configuration can be applied to usage scenarios such as live streaming. Also because of this coding structure, there is a strong temporal continuity between the frame sequences to be coded. Therefore, there is a method to use deep learning technology to output the predicted frame of the next frame by using several encoded frames as the input of the neural network, and use the predicted result as an additional reference frame for the encoding process of the next frame, Thereby, the video coding efficiency is improved.

但是，该方法在生成参考帧的过程中，仅仅将固定数量的帧输入至神经网络来进行预测帧的生成，其余已编码的帧所包含的信息并未得到利用。因此，如何进一步提高视频编码效率是亟待解决的问题。However, in this method, in the process of generating reference frames, only a fixed number of frames are input to the neural network to generate predicted frames, and the information contained in the remaining encoded frames is not utilized. Therefore, how to further improve the video coding efficiency is an urgent problem to be solved.

发明内容SUMMARY OF THE INVENTION

本申请提供一种视频编码的方法和装置，能够生成更准确的预测帧，进而提高视频编码效率。The present application provides a video coding method and apparatus, which can generate more accurate prediction frames, thereby improving video coding efficiency.

第一方面，提供了一种视频编码的方法，该方法包括：In a first aspect, a video encoding method is provided, the method comprising:

获取视频序列中待编码的第一视频帧之前的固定数量的第二视频帧的重建帧；Obtaining reconstructed frames of a fixed number of second video frames before the first video frame to be encoded in the video sequence;

根据所述第二视频帧的重建帧和所述视频序列的全局长期记忆，生成所述第一视频帧的合成参考帧，其中，所述全局长期记忆是根据所述视频序列中的所述第一视频帧之前的多个视频帧中的每个视频帧的重建帧和所述每个视频帧的合成参考帧确定的；A synthetic reference frame of the first video frame is generated according to the reconstructed frame of the second video frame and the global long-term memory of the video sequence, wherein the global long-term memory is based on the first video sequence in the video sequence. Determined by the reconstructed frame of each video frame in a plurality of video frames preceding a video frame and the composite reference frame of each video frame;

根据所述第一视频帧的合成参考帧，对所述第一视频帧进行编码。The first video frame is encoded according to a composite reference frame of the first video frame.

因此，本申请实施例通过根据视频序列中的待编码的第一视频帧之前固定数量的第二视频帧的重建帧，以及视频序列的全局长期记忆，生成该第一视频帧的合成参考帧，然后根据该合成参考帧对该第一视频帧进行编码。由于该合成参考帧具有描述视频序列之间复杂的运动的能力，因此本申请实施例能够生成更准确的预测帧，提高视频编码效率。Therefore, the embodiment of the present application generates a synthetic reference frame of the first video frame according to the reconstructed frames of a fixed number of second video frames before the first video frame to be encoded in the video sequence, and the global long-term memory of the video sequence, The first video frame is then encoded according to the composite reference frame. Since the synthetic reference frame has the ability to describe complex motion between video sequences, the embodiments of the present application can generate more accurate prediction frames and improve video coding efficiency.

其中，第一视频帧之前固定数量的第二视频帧，例如可以是第一视频帧之前的前两个视频帧，或者是第一视频帧之前的前三个视频帧，或者是第一视频帧之前的前一个视频帧，本申请实施例对此不作限定。这里，第一视频帧之前的固定数量的第二视频帧的重建帧，包含了对第一视频帧进行编码之前的短期时域信息。The fixed number of second video frames before the first video frame may be, for example, the first two video frames before the first video frame, or the first three video frames before the first video frame, or the first video frame The previous previous video frame is not limited in this embodiment of the present application. Here, the reconstructed frames of a fixed number of second video frames before the first video frame include short-term temporal information before encoding the first video frame.

视频序列的全局长期记忆，可以是根据视频序列中的已编码的多个视频帧确定的，例如根据该多个视频帧中的每个视频帧的重建帧和该每个视频的合成参考帧确定的。示例性的，该多个视频帧，可以是该视频序列中的已编码的所有视频帧或部分视频帧，本申请实施例对此不作限定。The global long-term memory of the video sequence may be determined according to a plurality of encoded video frames in the video sequence, for example, determined according to the reconstructed frame of each video frame in the plurality of video frames and the synthesized reference frame of each video of. Exemplarily, the multiple video frames may be all or part of the encoded video frames in the video sequence, which is not limited in this embodiment of the present application.

由于固定数量的第二视频帧的重建帧包含相邻视频帧之间很强的时域相关性，以及全局长期记忆包含视频序列的长期时域信息，因此根据该第一视频帧之前固定数量的第二视频帧的重建帧，以及视频序列的全局长期记忆来生成合成参考帧，作为该第一视频帧编码时的参考帧，能够有助于使得生成的合成参考帧具有描述视频序列之间复杂的运动(例如非线性运动、旋转运动)的能力。Since the reconstructed frames of the fixed number of second video frames contain strong temporal correlations between adjacent video frames, and the global long-term memory contains long-term temporal information of the video sequence, according to the fixed number of previous video frames of the first video frame The reconstructed frame of the second video frame and the global long-term memory of the video sequence are used to generate a synthetic reference frame, which can be used as a reference frame when the first video frame is encoded. The ability of motion (eg nonlinear motion, rotational motion).

需要说明的是，本申请实施例的视频编码的方法可以应用于low delay编码配置。在low delay编码配置下，视频帧的编码顺序与其帧号的顺序是相同的，即顺序编码。例如，对于第t帧(t＞2)视频帧进行编码之前，第t-1帧和第t-2帧视频帧都已经编码完毕了。因此，在对该第t帧视频帧进行编码之前，首先可以获取第t-1帧以及第t-2帧视频帧的重建帧。这样，本申请实施例能够实现全局长期记忆和编码过程的同步推进。It should be noted that, the video coding method in this embodiment of the present application may be applied to a low delay coding configuration. In the low delay encoding configuration, the encoding order of video frames is the same as the order of their frame numbers, that is, sequential encoding. For example, before encoding the t-th (t>2) video frame, the t-1-th and t-2-th video frames have been encoded. Therefore, before encoding the t-th video frame, the reconstructed frames of the t-1-th video frame and the t-2-th video frame may be obtained first. In this way, the embodiments of the present application can realize the simultaneous advancement of the global long-term memory and the encoding process.

结合第一方面，在第一方面的某些实现方式中，还包括：In conjunction with the first aspect, in some implementations of the first aspect, further includes:

获取所述第一视频帧的重建帧；obtaining a reconstructed frame of the first video frame;

根据所述第一视频帧的重建帧和所述第一视频帧的合成参考帧的差值，更新所述全局长期记忆。这里，该差值可以认为是对第一帧视频帧进行编码过程中产生的误差。The global long-term memory is updated according to the difference between the reconstructed frame of the first video frame and the synthetic reference frame of the first video frame. Here, the difference can be regarded as an error generated during the encoding process of the first video frame.

因此，本申请实施例通过根据视频帧的重建帧和该视频帧的合成参考帧的差值，对全局长期记忆进行实时更新，能够实现迭代式地进行视频帧序列的编码，从而实现长期的时域信息的连续传递，使得视频序列中的每个视频帧的合成参考帧都具有描述视频序列之间辅助的运动的能力，从而进一步提升编码性能。例如，本申请能够从第一帧编码开始，直到最后一帧编码结束，始终动态更新该全局长期记忆，从而能够充分利用时域信息，生成更加准确的预测帧，提高视频编码效率。Therefore, in the embodiments of the present application, the global long-term memory is updated in real time according to the difference between the reconstructed frame of the video frame and the synthetic reference frame of the video frame, so that the video frame sequence can be iteratively encoded, thereby realizing long-term time The continuous transfer of domain information enables the synthetic reference frame of each video frame in the video sequence to have the ability to describe auxiliary motion between video sequences, thereby further improving coding performance. For example, the present application can dynamically update the global long-term memory from the encoding of the first frame to the end of the encoding of the last frame, so that the temporal information can be fully utilized to generate more accurate prediction frames and improve the video encoding efficiency.

在一些实施例中，可以由设置于参考帧生成模块内部模块或单元来获取该差值，并根据该差值，更新长期记忆。在另一些实施例中，还可以由专门的模块或单元，例如记忆更新模块来获取该差值，并根据该差值更新长期记忆，本申请实施例对此不作限定。In some embodiments, the difference value may be acquired by a module or unit provided in the reference frame generation module, and the long-term memory is updated according to the difference value. In other embodiments, a special module or unit, such as a memory updating module, may also acquire the difference value, and update the long-term memory according to the difference value, which is not limited in this embodiment of the present application.

提取所述第二视频帧的重建帧的特征信息；extracting feature information of the reconstructed frame of the second video frame;

其中，所述根据所述第二视频帧的重建帧和所述视频序列的全局长期记忆，生成所述第一视频帧的合成参考帧，包括：Wherein, generating the synthetic reference frame of the first video frame according to the reconstructed frame of the second video frame and the global long-term memory of the video sequence includes:

将所述第二视频帧的重建帧的特征信息和所述全局长期记忆输入参考帧生成网络模型，获取所述第一视频帧的合成参考帧，其中，所述参考帧生成网络模型是利用训练数据样本集训练得到的，所述训练数据样本集中包括多个视频序列样本，所述每个视频序列样本包括无损视频帧，以及对所述无损视频帧进行有损压缩获取的视频帧。The feature information of the reconstructed frame of the second video frame and the global long-term memory are input into a reference frame generation network model, and a synthetic reference frame of the first video frame is obtained, wherein the reference frame generation network model is based on training The training data sample set includes a plurality of video sequence samples, and each video sequence sample includes a lossless video frame and a video frame obtained by lossy compression of the lossless video frame.

因此，本申请实施例可以通过深度学习，通过将视频序列中的待编码的第一视频帧之前固定数量的第二视频帧的重建帧，以及视频序列的全局长期记忆作为神经网络模型的输入，从而输出该第一视频帧的合成参考帧，然后根据该合成参考帧对该第一视频帧进行编码。由于该合成参考帧具有描述视频序列之间复杂的运动的能力，因此本申请实施例能够生成更准确的预测帧，提高视频编码效率。Therefore, in this embodiment of the present application, deep learning can be used to use the reconstructed frames of a fixed number of second video frames before the first video frame to be encoded in the video sequence and the global long-term memory of the video sequence as the input of the neural network model, Thus, the synthetic reference frame of the first video frame is output, and then the first video frame is encoded according to the synthetic reference frame. Since the synthetic reference frame has the ability to describe complex motion between video sequences, the embodiments of the present application can generate more accurate prediction frames and improve video coding efficiency.

结合第一方面，在第一方面的某些实现方式中，所述将所述第二视频帧的重建帧的特征信息和所述全局长期记忆输入参考帧生成网络模型，获取所述第一视频帧的合成参考帧，包括：With reference to the first aspect, in some implementations of the first aspect, the feature information of the reconstructed frame of the second video frame and the global long-term memory are input into the reference frame to generate a network model, and the first video is obtained. A composite reference frame of frames, including:

根据以下公式，获取所述第一视频帧的合成参考帧：Obtain the synthetic reference frame of the first video frame according to the following formula:

其中，

表示所述第一视频帧的合成参考帧，t表示所述第一视频帧的帧号，i表示所述第一视频帧之前的第二视频帧的帧号，I_i表示第i帧视频的重建帧，

表示深度学习中的局部卷积操作，K_i表示所述第一视频帧中所有像素的卷积核系数的集合，⊙表示像素级别的点乘操作，用于将对于输入帧进行局部卷积得到的结果的像素级别加权相加，其权值矩阵为M_i，ε_t表示对所述第一视频帧进行编码时的所述全局长期记忆，γ_t表示所述第二视频帧的重建帧的特征信息，t，i为正整数，且t＞2。in,

Represents the synthetic reference frame of the first video frame, t represents the frame number of the first video frame, i represents the frame number of the second video frame before the first video frame, I _i represents the ith video frame number. reconstructed frame,

represents the local convolution operation in deep learning, K _i represents the set of convolution kernel coefficients of all pixels in the first video frame, and ⊙ represents the pixel-level dot product operation, which is used to obtain the input frame by local convolution The _pixel -level _weighted _addition of the results of Feature information, t, i are positive integers, and t>2.

因此，本申请实施例通过为每个像素分别生成卷积核系数的方式，即局部卷积的方式，这相对于为所有像素采用相同的回归方式而言，拥有更强的表达能力，因此能够达到更好的回归效果，从而有助于生成更准确的预测帧，进而提高视频编码效率。这里，回归，指的是生成合成参考帧的过程。Therefore, in the embodiment of the present application, the method of generating convolution kernel coefficients for each pixel, that is, the method of local convolution, has stronger expressive ability than using the same regression method for all pixels, so it can A better regression effect can be achieved, thereby helping to generate more accurate predicted frames, thereby improving video coding efficiency. Here, regression refers to the process of generating synthetic reference frames.

结合第一方面，在第一方面的某些实现方式中，所述第二视频帧包括所述第一视频帧的前两帧视频帧，其中，所述第一视频帧为所述视频序列中的第三帧视频帧，或所述第三帧视频帧之后的视频帧。With reference to the first aspect, in some implementation manners of the first aspect, the second video frame includes the first two video frames of the first video frame, wherein the first video frame is in the video sequence The third video frame of the video frame, or the video frame following the third video frame.

在一些实施例中，由于在对整个视频序列的第一帧和第二帧视频帧进行编码的时候，输入参考帧生成模块的重建帧的数量不足，因此此时参考帧生成模块可以不生成第一帧视频帧的合成参考帧，或者第二帧视频帧的合成视频帧。In some embodiments, since the number of reconstructed frames input to the reference frame generation module is insufficient when encoding the first frame and the second video frame of the entire video sequence, the reference frame generation module may not generate the first frame at this time. A composite reference frame of a video frame, or a composite video frame of a second video frame.

之后，例如从第四帧视频帧开始，参考帧生成模块和编码器模块可以进入正常工作模式，即参考帧生成模块不断根据长期记忆、前两帧视频帧的重建帧生成合成参考帧，编码器模块独权该合成参考帧，并产生当前编码视频帧的重建帧。可选的，编码器模块可以获取视频帧的重建帧和参考合成帧的差值，使得参考帧生成模块能够根据该差值，完成长期记忆的更新。上述流程可以不断重复，直到视频帧序列中的所有视频帧全部编码完成。After that, starting from the fourth video frame, for example, the reference frame generation module and the encoder module can enter the normal working mode, that is, the reference frame generation module continuously generates a synthetic reference frame according to the long-term memory and the reconstructed frames of the first two video frames. The module exclusively authorizes the synthesized reference frame and produces a reconstructed frame of the currently encoded video frame. Optionally, the encoder module can obtain the difference between the reconstructed frame of the video frame and the reference synthetic frame, so that the reference frame generation module can complete the update of the long-term memory according to the difference. The above process can be repeated continuously until all the video frames in the video frame sequence are completely encoded.

结合第一方面，在第一方面的某些实现方式中，在所述第一视频帧为所述视频序列中的第三帧视频帧的情况下，所述全局长期记忆为0。另外，在对第三帧视频帧进行编码之前，可以将全局长期记忆置为0。With reference to the first aspect, in some implementations of the first aspect, when the first video frame is a third video frame in the video sequence, the global long-term memory is 0. Additionally, the global long-term memory can be set to 0 before encoding the third video frame.

结合第一方面，在第一方面的某些实现方式中，所述根据所述第一视频帧的合成参考帧，对所述第一视频帧进行编码，包括：With reference to the first aspect, in some implementations of the first aspect, the encoding the first video frame according to the synthesized reference frame of the first video frame includes:

获取所述第一视频帧的参考帧列表，所述参考帧列表中包括已编码完成的至少两个视频帧的重建帧；obtaining a reference frame list of the first video frame, where the reference frame list includes reconstructed frames of at least two video frames that have been encoded;

将参考帧列表中与所述第一视频帧的帧号之差最大的帧号对应的重建帧移除，并将所述第一视频帧的合成参考帧加入到所述参考帧列表中的被移除的所述重建帧的位置；Remove the reconstructed frame corresponding to the frame number with the largest difference between the frame numbers of the first video frame in the reference frame list, and add the synthesized reference frame of the first video frame to the reference frame in the reference frame list. the position of the removed reconstructed frame;

根据所述参考帧列表，对所述第一视频帧进行编码。The first video frame is encoded according to the reference frame list.

因此，由于参考帧列表中与所述第一视频帧的帧号之差最大的帧号对应的重建帧与第一视频帧的时域相关性最弱，因此通过将参考帧列表中与所述第一视频帧的帧号之差最大的帧号对应的重建帧移除，并将所述第一视频帧的合成参考帧加入到所述参考帧列表，用作当前编码过程中的参考，从而利用不同帧之间的时域联系性，达到更好的压缩效果。Therefore, since the reconstructed frame corresponding to the frame number with the largest difference between the frame numbers of the first video frame and the first video frame in the reference frame list has the weakest temporal correlation with the first video frame, by comparing the reference frame list with the first video frame The reconstructed frame corresponding to the frame number with the largest difference between the frame numbers of the first video frame is removed, and the synthetic reference frame of the first video frame is added to the reference frame list, which is used as a reference in the current encoding process, thereby Utilize the time domain connection between different frames to achieve better compression effect.

第二方面，提供了一种视频编码的装置，用于执行上述第一方面或第一方面的任意可能的实现方式中的方法，具体的，该设备包括用于执行上述第一方面或第一方面任意可能的实现方式中的方法的模块。In a second aspect, a video encoding apparatus is provided, which is used to execute the method in the first aspect or any possible implementation manner of the first aspect. Specifically, the device includes a device for executing the first aspect or the first aspect A module for a method in any possible implementation of an aspect.

第三方面，提供了一种视频编码的装置，包括：存储器、处理器。其中，该存储器用于存储指令，该处理器用于执行该存储器存储的指令，并且当该处理器执行该存储器存储的指令时，该执行使得该视频编码的装置执行第一方面或第一方面的任意可能的实现方式中的方法。In a third aspect, a video encoding apparatus is provided, including: a memory and a processor. Wherein, the memory is used to store instructions, the processor is used to execute the instructions stored in the memory, and when the processor executes the instructions stored in the memory, the execution causes the video encoding apparatus to perform the first aspect or the first aspect. A method in any possible implementation.

第四方面，提供了一种计算机可读介质，用于存储计算机程序，该计算机程序包括用于执行第一方面或第一方面的任意可能的实现方式中的方法的指令。In a fourth aspect, a computer-readable medium is provided for storing a computer program, the computer program comprising instructions for performing the method of the first aspect or any possible implementation of the first aspect.

第五方面，提供了一种包含指令的计算机程序产品，当该计算机程序产品在计算机上运行时，使得该计算机执行第一方面或第一方面的任意一种可能的实现方式。In a fifth aspect, there is provided a computer program product comprising instructions, when the computer program product runs on a computer, the computer program product causes the computer to execute the first aspect or any possible implementation manner of the first aspect.

应理解，本申请的第二至第五方面及对应的实现方式所取得的有益效果参见本申请的第一方面及对应的实现方式所取得的有益效果，不再赘述。It should be understood that the beneficial effects obtained by the second to fifth aspects of the present application and the corresponding implementation manners can be referred to the beneficial effects obtained by the first aspect of the present application and the corresponding implementation manners, which will not be repeated.

附图说明Description of drawings

图1示出了本申请实施例提供的视频编码的系统的示意性框图。FIG. 1 shows a schematic block diagram of a video coding system provided by an embodiment of the present application.

图2示出了本申请实施例提供的一种视频编码的方法的示意性流程图。FIG. 2 shows a schematic flowchart of a video coding method provided by an embodiment of the present application.

图3示出了对第t帧视频帧进行编码的具体示例。FIG. 3 shows a specific example of encoding the t-th video frame.

图4示出了本申请实施例提供的视频编码的一个具体例子。FIG. 4 shows a specific example of video coding provided by an embodiment of the present application.

图5示出了本申请实施例的视频编码的方案的PSNR曲线的一个示例。FIG. 5 shows an example of a PSNR curve of the video coding solution according to the embodiment of the present application.

图6示出了本申请实施例提供的一种视频编码的装置的示意性框图。FIG. 6 shows a schematic block diagram of an apparatus for video encoding provided by an embodiment of the present application.

图7示出了本申请实施例提供的另一种视频编码的装置的示意性框图。FIG. 7 shows a schematic block diagram of another apparatus for video coding provided by an embodiment of the present application.

具体实施方式Detailed ways

下面将结合附图，对本申请中的技术方案进行描述。The technical solutions in the present application will be described below with reference to the accompanying drawings.

图1示出了本申请实施例提供的视频编码的系统100的示意性框图。示例性的，系统100可以设置在智能视频存储播放设备，比如具有摄像功能、视频播放功能或视频存储功能的电子产品(手机、电视、电脑等)上。如图1所示，该系统100包括参考帧生成模块110和编码器模块120。示例性的，参考帧生成模块110可以为长短期记忆网络(long short-termmemory，LSTM)，编码器模块120可以为HEVC视频编码器。FIG. 1 shows a schematic block diagram of a system 100 for video coding provided by an embodiment of the present application. Exemplarily, the system 100 may be set on an intelligent video storage and playback device, such as an electronic product (mobile phone, TV, computer, etc.) with a camera function, a video playback function, or a video storage function. As shown in FIG. 1 , the system 100 includes a reference frame generation module 110 and an encoder module 120 . Exemplarily, the reference frame generation module 110 may be a long short-term memory (LSTM) network, and the encoder module 120 may be an HEVC video encoder.

其中，参考帧生成模块110用于根据视频序列中的已编码的视频帧，生成待编码的第一视频帧的合成参考帧。这里，该已编码的视频帧，包括第一视频帧之前的全部或部分已经编码完成的视频帧。例如，这些已经编码的视频帧中，除了第一视频帧之前的固定数量的已经编码完成的视频帧之外，还可以包含该固定数量的视频帧之前的已经编码完成的视频帧，本申请实施例对此不作限定。The reference frame generation module 110 is configured to generate a synthetic reference frame of the first video frame to be encoded according to the encoded video frames in the video sequence. Here, the encoded video frame includes all or part of the encoded video frames before the first video frame. For example, in addition to a fixed number of encoded video frames before the first video frame, these encoded video frames may also include encoded video frames before the fixed number of video frames. This application implements The example does not limit this.

一些实施例中，上述已编码的视频帧，可以包括当前待编码的第一视频帧之前固定数量的第二视频帧的重建帧，以及该视频序列的全局长期记忆(也可以称为长期记忆、全局记忆等，不作限定)。其中，重建帧指的编码端设备(例如上述系统100中的编码器模块120)模拟解码端设备来恢复视频帧，得到的视频帧。In some embodiments, the above-mentioned encoded video frames may include reconstructed frames of a fixed number of second video frames before the current first video frame to be encoded, and the global long-term memory of the video sequence (also referred to as long-term memory, global memory, etc., not limited). Wherein, the reconstructed frame refers to the encoding end device (for example, the encoder module 120 in the above-mentioned system 100 ) to simulate the decoding end device to restore the video frame to obtain the video frame.

例如，在当前待编码的视频帧为第一视频帧的情况下，输入到该参考帧生成模块110的全局长期记忆可以是根据视频序列中该第一视频帧之前的多个(例如所有)视频帧中的每个视频帧的重建帧和该每个视频帧的合成参考帧确定的。For example, in the case where the video frame currently to be encoded is the first video frame, the global long-term memory input to the reference frame generation module 110 may be based on multiple (eg all) videos in the video sequence preceding the first video frame A reconstructed frame for each video frame in the frame and a composite reference frame for that each video frame are determined.

本申请实施例中，合成参考帧也可以称为参考帧、预测帧等，本申请实施例对此不作限定。In this embodiment of the present application, the synthetic reference frame may also be referred to as a reference frame, a prediction frame, or the like, which is not limited in this embodiment of the present application.

编码器模块120用于将该第一视频帧的合成参考帧作为额外的参考帧，用于该第一视频帧的编码过程，对该第一视频帧进行编码。The encoder module 120 is configured to use the synthesized reference frame of the first video frame as an additional reference frame for the encoding process of the first video frame to encode the first video frame.

可选的，编码器模块120还可以获取第一视频帧的重建帧，例如模拟解码端设备来恢复第一视频帧，得到该第一视频帧的重建帧。Optionally, the encoder module 120 may also obtain the reconstructed frame of the first video frame, for example, by simulating a decoder device to restore the first video frame, to obtain the reconstructed frame of the first video frame.

一些实施例中，系统100可以维护该全局长期记忆。例如在完成一个视频帧的编码之后，根据该视频帧的重建帧，更新该全局长期记忆。示例性的，在获取第一视频帧的重建帧之后，可以确定该第一视频帧的重建帧与第一视频帧的合成参考帧之间的差值，并根据该差值，更新该全局长期记忆。更新后的该全局长期记忆，可以用于第一视频帧的下一帧视频帧的合成参考帧的生成过程。具体的，该下一帧的合成参考帧的生成过程以及编码过程与第一视频帧的合成参考帧的生成过程和编码过程类似，不再赘述。In some embodiments, system 100 may maintain this global long-term memory. For example, after the encoding of a video frame is completed, the global long-term memory is updated according to the reconstructed frame of the video frame. Exemplarily, after acquiring the reconstructed frame of the first video frame, the difference between the reconstructed frame of the first video frame and the synthetic reference frame of the first video frame can be determined, and the global long-term value can be updated according to the difference. memory. The updated global long-term memory can be used in the generation process of the synthetic reference frame of the next video frame of the first video frame. Specifically, the generation process and encoding process of the composite reference frame of the next frame are similar to the generation process and encoding process of the composite reference frame of the first video frame, and are not described again.

因此，本申请实施例通过根据视频帧的重建帧和合成参考帧之间的差值，对全局长期记忆进行实时更新，能够实现迭代式地进行视频帧序列的编码，从而实现长期的时域信息的连续传递，使得视频序列中的每个视频帧的合成参考帧都具有描述视频序列之间辅助的运动的能力，从而进一步提升编码性能。例如，本申请能够从第一帧编码开始，直到最后一帧编码结束，始终动态更新该全局长期记忆，从而能够充分利用时域信息，生成更加准确的预测帧，提高视频编码效率。Therefore, in the embodiment of the present application, by updating the global long-term memory in real time according to the difference between the reconstructed frame of the video frame and the synthetic reference frame, iterative encoding of the video frame sequence can be implemented, thereby realizing long-term temporal information The continuous transfer of the video sequence enables the synthetic reference frame of each video frame in the video sequence to have the ability to describe the auxiliary motion between the video sequences, thereby further improving the coding performance. For example, the present application can dynamically update the global long-term memory from the encoding of the first frame to the end of the encoding of the last frame, so that the temporal information can be fully utilized to generate more accurate prediction frames and improve the video encoding efficiency.

图2示出了本申请实施例提供的一种视频编码的方法200的示意性流程图。该方法200可以应用于图1中所示的视频编码的系统100。如图2所示，方法200包括步骤210至240。FIG. 2 shows a schematic flowchart of a video coding method 200 provided by an embodiment of the present application. The method 200 may be applied to the system 100 of video coding shown in FIG. 1 . As shown in FIG. 2 , method 200 includes steps 210 to 240 .

210，获取视频序列中待编码的第一视频帧之前的固定数量的第二视频帧的重建帧。示例性的，该固定数据的第二视频帧可以参见上文中的描述，这里不再赘述。210. Acquire reconstructed frames of a fixed number of second video frames before the first video frame to be encoded in the video sequence. Exemplarily, reference may be made to the above description for the second video frame of the fixed data, and details are not repeated here.

图3示出了对第t帧视频帧进行编码的具体示例。这里，第t帧视频帧即上述第一视频帧的一个示例。如图3所示，第t帧视频帧之前的固定数量的第二视频帧可以为该第t帧视频帧的前两帧视频帧，即第t-1帧视频帧和第t-2帧视频帧。这里，t为正整数且t＞2。此时，第二视频帧的重建帧即为的第t-1帧视频帧的重建帧(即图3中的t-1重建帧)和第t-2帧视频帧的重建帧(即图3中的t-2重建帧)。FIG. 3 shows a specific example of encoding the t-th video frame. Here, the t-th video frame is an example of the above-mentioned first video frame. As shown in FIG. 3 , the fixed number of second video frames before the t-th video frame may be the first two video frames of the t-th video frame, that is, the t-1 th video frame and the t-2 th video frame frame. Here, t is a positive integer and t>2. At this time, the reconstructed frame of the second video frame is the reconstructed frame of the t-1 th video frame (ie, the t-1 reconstructed frame in FIG. 3 ) and the reconstructed frame of the t-2 th video frame (ie, FIG. 3 ). t-2 reconstruction frame in ).

需要说明的是，本申请实施例的视频编码的方法可以应用于low delay编码配置。在low delay编码配置下，视频帧的编码顺序与其帧号的顺序是相同的，即顺序编码。例如，对于第t帧(t＞2)视频帧进行编码之前，第t-1帧和第t-2帧视频帧都已经编码完毕了。因此，在对该第t帧视频帧进行编码之前，首先可以获取第t-1帧以及第t-2帧视频帧的重建帧。It should be noted that, the video coding method in this embodiment of the present application may be applied to a low delay coding configuration. In the low delay encoding configuration, the encoding order of video frames is the same as the order of their frame numbers, that is, sequential encoding. For example, before encoding the t-th (t>2) video frame, the t-1-th and t-2-th video frames have been encoded. Therefore, before encoding the t-th video frame, the reconstructed frames of the t-1-th video frame and the t-2-th video frame may be obtained first.

因此，本申请实施例可以针对low delay编码配置进行设计，并且在该配置下，帧的编码是按照顺序进行的，即编码顺序与帧序号保持一致，因此本申请实施例能够实现全局长期记忆和编码过程的同步推进。Therefore, the embodiment of the present application can be designed for a low delay encoding configuration, and in this configuration, the encoding of frames is performed in sequence, that is, the encoding sequence is consistent with the frame sequence number. Therefore, the embodiment of the present application can implement global long-term memory and Synchronous advancement of the encoding process.

220，根据所述第二视频帧的重建帧和所述视频序列的全局长期记忆，生成所述第一视频帧的合成参考帧，其中，所述全局长期记忆是根据所述视频序列中的所述第一视频帧之前的多个视频帧中的每个视频帧的重建帧和所述每个视频帧的合成参考帧确定的。其中，全局长期记忆以及合成参考帧可以参见上文中的描述，这里不再赘述。220. Generate a synthetic reference frame of the first video frame according to the reconstructed frame of the second video frame and the global long-term memory of the video sequence, wherein the global long-term memory is based on all the data in the video sequence. It is determined by the reconstructed frame of each video frame in the plurality of video frames before the first video frame and the synthesized reference frame of each video frame. The global long-term memory and the synthetic reference frame may refer to the above description, and will not be repeated here.

示例性的，可以继续参考图3，在获取第t-1帧以及第t-2帧视频帧的重建帧之后，可以利用参考帧生成模块来生成该第t帧视频帧的合成参考帧(即图3中的t合成参考帧)。示例性的，此时参考帧生成模块的输入可以为第t-1帧以及第t-2帧视频帧的重建帧，以及视频序列的长期记忆。这里，将长期记忆与第t-1帧以及第t-2帧视频帧的重建帧共同作为参考帧生成模块的输入，能够保证更充分的时域信息的输入。Exemplarily, can continue to refer to FIG. 3, after obtaining the reconstruction frame of the t-1th frame and the t-2th frame of the video frame, the reference frame generation module can be used to generate the t-th frame. The synthetic reference frame of the video frame (that is, t synthetic reference frame in Fig. 3). Exemplarily, at this time, the input of the reference frame generation module may be the reconstructed frames of the t-1th frame and the t-2th frame of the video frame, and the long-term memory of the video sequence. Here, the long-term memory is used together with the reconstructed frames of the t-1th frame and the t-2th frame video frame as the input of the reference frame generation module, which can ensure the input of more sufficient time domain information.

230，根据所述第一视频帧的合成参考帧，对所述第一视频帧进行编码。230. Encode the first video frame according to the synthesized reference frame of the first video frame.

示例性的，继续参考图3，在参考帧生成模块生成t合成参考帧之后，编码器模块可以读取该t合成参考帧。在一些可选的实施例中，编码器模块可以将该t合成参考帧放入该编码器模块中的参考帧列表，并根据该参考帧列表，完成第t帧视频帧的编码。其中，参考帧列表中可以包括至少两个视频帧，用作当前帧编码过程中的参考。Exemplarily, continuing to refer to FIG. 3 , after the reference frame generation module generates the t-synthesized reference frame, the encoder module may read the t-synthesized reference frame. In some optional embodiments, the encoder module may put the t synthetic reference frame into a reference frame list in the encoder module, and complete the encoding of the t-th video frame according to the reference frame list. The reference frame list may include at least two video frames, which are used as references in the encoding process of the current frame.

在一些实施例中，编码器模块在对每一帧视频进行编码的过程中，可以维护对应于该帧视频的参考帧列表。示例性的，在编码器模块还没有获取第t帧视频的合成参考帧之前，该第t帧对应的参考帧列表中可以存放已编码完成的若干视频帧的重建帧。在编码器模块获取了第t帧视频帧的合成参考帧之后，可以将参考帧列表中与当前编码的视频帧(例如第一视频帧，或第t帧视频帧)的帧号之差最大的帧号对应的重建帧移除，并将从参考帧生成模块获取的当前视频帧(例如第一视频帧，或第t帧视频帧)的合成参考帧加入到该参考帧列表，例如可以加入到参考帧列表中的被移除的重建帧的位置，本申请实施例对此不作限定。In some embodiments, in the process of encoding each frame of video, the encoder module may maintain a list of reference frames corresponding to the frame of video. Exemplarily, before the encoder module obtains the synthetic reference frame of the t-th frame of video, the reference frame list corresponding to the t-th frame may store reconstructed frames of several video frames that have been encoded. After the encoder module obtains the synthesized reference frame of the t-th video frame, the frame number with the largest difference from the frame number of the currently encoded video frame (for example, the first video frame, or the t-th video frame) in the reference frame list can be determined. The reconstructed frame corresponding to the frame number is removed, and the synthesized reference frame of the current video frame (such as the first video frame, or the t-th video frame) obtained from the reference frame generation module is added to the reference frame list, for example, it can be added to The position of the removed reconstructed frame in the reference frame list is not limited in this embodiment of the present application.

由于参考帧列表中与所述第一视频帧的帧号之差最大的帧号对应的重建帧与第一视频帧的时域相关性最弱，因此通过将参考帧列表中与所述第一视频帧的帧号之差最大的帧号对应的重建帧移除，并将所述第一视频帧的合成参考帧加入到所述参考帧列表，用作当前编码过程中的参考，从而能够利用不同帧之间的时域联系性，达到更好的压缩效果。Since the reconstructed frame corresponding to the frame number with the largest difference between the frame numbers of the first video frame and the first video frame in the reference frame list has the weakest temporal correlation with the first video frame, by comparing the first video frame in the reference frame list with the first video frame The reconstructed frame corresponding to the frame number with the largest difference between the frame numbers of the video frames is removed, and the synthesized reference frame of the first video frame is added to the reference frame list, which is used as a reference in the current encoding process, so as to use The time domain connection between different frames achieves better compression effect.

在一些可选的实施例中，方法200还可以包括步骤240，获取所述第一视频帧的重建帧(t重建帧)，并根据所述第一视频帧的重建帧和所述第一视频帧的合成参考帧的差值，更新所述全局长期记忆。In some optional embodiments, the method 200 may further include step 240, obtaining a reconstructed frame (t reconstructed frame) of the first video frame, and obtaining the reconstructed frame of the first video frame and the first video according to the reconstructed frame of the first video frame The difference value of the synthetic reference frame of the frame, the global long-term memory is updated.

示例性的，继续参考图3，编码端设备(例如该编码器模块)可以模拟解码端，恢复t视频帧，从而获得上述t重建帧。在获取t重建帧之后，可以计算该t重建帧与t合成参考帧之间的差值。这里，该差值可以认为是对第t帧视频帧进行编码过程中产生的误差。获取该差值后，可以根据该差值，动态更新其维护的长期记忆，从而能够实现参考帧生成模块在未来的预测过程(即生成下一帧，例如t+1帧的合成参考帧的过程)中能够感知对第t帧视频帧进行编码过程中产生的该误差，从而有助于未来生成更加准确的合成参考帧，提高视频编码效率。Exemplarily, continuing to refer to FIG. 3 , the encoding end device (eg, the encoder module) may simulate the decoding end to restore the t video frame, so as to obtain the above-mentioned t reconstructed frame. After acquiring the t reconstructed frame, the difference between the t reconstructed frame and the t synthetic reference frame can be calculated. Here, the difference value can be considered as an error generated in the process of encoding the t-th video frame. After the difference value is obtained, the long-term memory maintained by the difference value can be dynamically updated according to the difference value, so that the future prediction process of the reference frame generation module (that is, the process of generating the next frame, such as the synthetic reference frame of the t+1 frame) can be realized. ) can perceive the error generated in the encoding process of the t-th video frame, thereby helping to generate a more accurate synthetic reference frame in the future and improving the video encoding efficiency.

需要说明的是，对于图3中的视频编码过程而言，由于在对整个视频序列的第一帧和第二帧视频帧进行编码的时候，输入参考帧生成模块的重建帧的数量不足，因此此时参考帧生成模块可以不生成第一帧视频帧的合成参考帧，或者第二帧视频帧的合成视频帧。It should be noted that, for the video encoding process in FIG. 3, since the number of reconstructed frames input to the reference frame generation module is insufficient when encoding the first frame and the second video frame of the entire video sequence, therefore At this time, the reference frame generation module may not generate a synthetic reference frame of the first frame of video frame, or a synthetic video frame of the second frame of video frame.

在对第三帧视频帧进行编码之前，已经生成了第一帧视频帧的重建帧和第二帧视频帧的重建帧，因此此时可以利用参考帧生成模块生成该第三帧视频帧的合成参考帧。这也在对整个视频帧进行编码的过程中，参考帧生成模块第一次生成合成参考帧，因此此时并不存在视频序列的长期记忆，在此之前参考帧生成模块可以将长期记忆置为0。此时，可以将该置为0的长期记忆、第一帧和第二帧视频帧的重建帧共同作为合成过程的输入，输出第三帧视频帧的合成参考帧。之后，即从第四帧视频帧开始，参考帧生成模块和编码器模块可以进入正常工作模式，即参考帧生成模块不断根据长期记忆、前两帧视频帧的重建帧生成合成参考帧，编码器模块独权该合成参考帧，并产生当前编码视频帧的重建帧。可选的，编码器模块可以获取视频帧的重建帧和参考合成帧的差值，使得参考帧生成模块能够根据该差值，完成长期记忆的更新。上述流程可以不断重复，直到视频帧序列中的所有视频帧全部编码完成。Before encoding the third frame of video frame, the reconstructed frame of the first frame of video frame and the reconstructed frame of the second frame of video frame have been generated, so at this time, the reference frame generation module can be used to generate the composite of the third frame of video frame reference frame. This is also in the process of encoding the entire video frame. The reference frame generation module generates a synthetic reference frame for the first time, so there is no long-term memory of the video sequence at this time. Before that, the reference frame generation module can set the long-term memory to 0. At this time, the long-term memory set to 0 and the reconstructed frames of the first and second video frames can be used as the input of the synthesis process, and the synthesized reference frame of the third video frame can be output. After that, starting from the fourth video frame, the reference frame generation module and the encoder module can enter the normal working mode, that is, the reference frame generation module continuously generates synthetic reference frames according to the long-term memory and the reconstructed frames of the first two video frames, and the encoder The module exclusively authorizes the synthesized reference frame and produces a reconstructed frame of the currently encoded video frame. Optionally, the encoder module can obtain the difference between the reconstructed frame of the video frame and the reference synthetic frame, so that the reference frame generation module can complete the update of the long-term memory according to the difference. The above process can be repeated continuously until all the video frames in the video frame sequence are completely encoded.

下面，结合图4，详细描述本申请实施例提供的视频编码的一个具体例子。应注意，下面的例子仅仅是为了帮助本领域技术人员理解和实现本发明的实施例，而非限制本发明实施例的范围。本领域技术人员可以根据这里给出的例子进行等价变换或修改，这样的变换或修改仍然应落入本发明实施例的范围内。Hereinafter, with reference to FIG. 4 , a specific example of video coding provided by this embodiment of the present application will be described in detail. It should be noted that the following examples are only for helping those skilled in the art to understand and implement the embodiments of the present invention, rather than limiting the scope of the embodiments of the present invention. Those skilled in the art can perform equivalent transformations or modifications according to the examples given herein, and such transformations or modifications should still fall within the scope of the embodiments of the present invention.

示例性的，图4的系统架构可以是在配备有i7-9700K，32G内存，GTX1080Ti基于Ubuntu 18.04的平台上实现的。请参考图4，该系统中包括特征提取模块410、参考帧生成模块420、编码器模块430和记忆更新模块440。其中，特征提取模块410、参考帧生成模块420和记忆更新模块440可以由神经网络模块实现。此时，图4中的该系统也可以被称为神经网络模型，或神经网络系统。Exemplarily, the system architecture of FIG. 4 can be implemented on a platform equipped with i7-9700K, 32G memory, GTX1080Ti based on Ubuntu 18.04. Please refer to FIG. 4 , the system includes a feature extraction module 410 , a reference frame generation module 420 , an encoder module 430 and a memory update module 440 . The feature extraction module 410, the reference frame generation module 420 and the memory update module 440 can be implemented by a neural network module. At this time, the system in FIG. 4 may also be referred to as a neural network model, or a neural network system.

在图4所示的例子中，可以利用特征提取模块410提取第一视频帧之前的固定数量的第二视频帧的重建帧(例如t-1重建帧I_t-1和t-2重建帧I_t-2)的特征信息，然后将提取到的该第二视频帧的重建帧的特征信息和全局长期记忆ε_t输入参考帧生成网络模型，获取第一视频帧的合成参考帧

然后，编码器模块430可以根据该合成参考帧

对第一视频帧进行编码。之后，编码器模块还可以获取并输出该第一视频帧的重建帧I_t，记忆更新模块440可以获取第一视频帧的重建帧I_t，并获取合成参考帧

与重建帧I_t的差值R_t，再由记忆更新模块440中的记忆更新网络430根据该差值R_t对全局长期记忆ε_t进行更新。In the example shown in FIG. 4 , the feature extraction module 410 can be used to extract a fixed number of reconstructed frames of the second video frame before the first video frame (eg t-1 reconstructed frame I _t-1 and t-2 reconstructed frame I _t-2 ), and then the extracted feature information of the reconstructed frame of the second video frame and the global long-term memory _εt are input to the reference frame to generate a network model, and the synthetic reference frame of the first video frame is obtained.

The encoder module 430 can then synthesize the reference frame from the

The first video frame is encoded. After that, the encoder module can also acquire and output the reconstructed frame It of the first video frame, and the memory updating module ₄₄₀ can acquire the reconstructed frame It of the first video _frame , and acquire the synthetic reference frame

The global long-term memory ε _t _{is updated by the memory updating network 430 in the memory updating module 440 according to the difference R t} _with the reconstructed frame It _.

其中，该图4中各神经网络模块可以是利用训练数据样本集训练得到的，该训练数据样本集中包括多个视频序列样本，每个视频序列样本包括无损视频帧，以及对所述无损视频帧进行有损压缩获取的视频帧。Wherein, each neural network module in FIG. 4 may be obtained by training using a training data sample set, the training data sample set includes a plurality of video sequence samples, each video sequence sample includes a lossless video frame, and the lossless video frame Video frames obtained by lossy compression.

下面描述图4系统中各神经网络模块的具体参数，其中，以对第t帧视频帧进行编码为例，假设输入神经网络模型的两帧图像的长宽分别是h，w。The specific parameters of each neural network module in the system of FIG. 4 are described below. Taking the encoding of the t-th video frame as an example, it is assumed that the length and width of the two frames of images input to the neural network model are h and w respectively.

特征提取模块410：Feature extraction module 410:

输入：之前两帧重建帧(即图4中的t-2重建帧I_t-2和t-1重建帧I_t-1)。其中，特征提取模块410的输入通道数可以为6，即为之前两帧重建帧在通道数上进行拼接(其中每一帧重建帧的输入通道数为3，即RGB三个输入通道)。Input: Reconstructed frames of the previous two frames (ie, t-2 reconstructed frame It _-2 and t-1 reconstructed frame It- ₁ in Figure 4). The number of input channels of the feature extraction module 410 may be 6, that is, the reconstructed frames of the previous two frames are spliced on the number of channels (the number of input channels of each reconstructed frame is 3, that is, three input channels of RGB).

网络参数：如下表1所示：Network parameters: as shown in Table 1 below:

表1Table 1

需要说明的是，表1中的序号为特征提取模块410中从左至右进行编号之后得到的名称，例如，特征提取模块410中最左边的卷积模块被称为卷积1，以此类推。虚线表示将两个部分的输出相加，比如将卷积1和上采样5的输出相加。这是自编码器中常见的skipconnection结构，用于传递低层(low level)信息。It should be noted that the serial numbers in Table 1 are the names obtained after the feature extraction module 410 is numbered from left to right. For example, the leftmost convolution module in the feature extraction module 410 is called convolution 1, and so on. . The dashed line represents summing the outputs of the two parts, such as summing the outputs of convolution 1 and upsampling 5. This is a skipconnection structure common in autoencoders to pass low level information.

参考帧生成模块420包括局部卷积参数生成网络421和权值生成网络422。The reference frame generation module 420 includes a local convolution parameter generation network 421 and a weight generation network 422 .

局部卷积参数生成网络421：Local convolution parameter generation network 421:

输入：enter:

(1)特征提取模块410的卷积2与上采样2的输出之和，标记为γ_t；(1) The sum of the outputs of the convolution 2 and the upsampling 2 of the feature extraction module 410 is marked as γ _t ;

(2)长期记忆，标记为ε_t。(2) Long-term memory, labeled ε _t .

输入整合方式：按照上述(1)和(2)的通道数进行拼接。Input integration method: splicing according to the number of channels in (1) and (2) above.

输出：通道数为51，长宽和输入图像相同的局部卷积系数。Output: The number of channels is 51, and the length and width are the same as the local convolution coefficients of the input image.

网络参数：如下表2所示：Network parameters: as shown in Table 2 below:

表2Table 2

权值生成网络422：Weight generation network 422:

输入：上采样5与卷积1之和。Input: The sum of upsampling 5 and convolution 1.

输出：通道数为1，长宽与输入图像相同的权值矩阵，即图4中所标记的M_t-1。另外，图4中的M_t-2可由全1矩阵减去M_t-1得到。Output: The number of channels is 1, and the weight matrix with the same length and width as the input image, that is, M _t-1 marked in Figure 4 . In addition, M _t-2 in Figure 4 can be obtained by subtracting M _t-1 from the all-ones matrix.

网络参数：如下表3所示：Network parameters: as shown in Table 3 below:

表3table 3

其中，合成参考帧的生成方式(即图4中虚线框423部分的运算方式)如下公式(1)所示：Wherein, the generation method of the synthetic reference frame (that is, the operation method of the dashed box 423 in FIG. 4 ) is shown in the following formula (1):

其中，I_i表示输入的两帧视频的重建帧，

表示深度学习中的局部卷积操作，K_i表示所述第一视频帧中所有像素的卷积核系数的集合，即包括为每一个像素点单独生成卷积核系数，⊙表示像素级别的点乘操作，用于将对于两个输入帧进行局部卷积得到的结果的像素级别加权相加，其权值矩阵为M_i。Among them, I _i represents the reconstructed frame of the input two frames of video,

Represents the local convolution operation in deep learning, K _i represents the set of convolution kernel coefficients of all pixels in the first video frame, that is, including generating convolution kernel coefficients for each pixel point separately, ⊙ represents the pixel-level point The multiplication operation is used to add the pixel-level weighted sum of the results obtained by local convolution of the two input frames, and its weight matrix is M _i .

记忆更新网络440：Memory update network 440:

输入：enter:

(1)t重建帧I_t与合成参考帧

的差值R_t，通道数为3，长宽和输入图片相同；(1) _t reconstructed frame It and synthetic reference frame

The difference R _t of , the number of channels is 3, and the length and width are the same as the input picture;

(2)LSTM之前一次循环的状态h_i-1，c_i-1。(2) The states h _i-1 , c _i-1 of the previous cycle of the LSTM.

输出：下一时刻的ConvLSTM状态，即h_i，c_i，通道数均为32。Output: ConvLSTM state at the next moment, i.e. h _i , c _i , the number of channels is 32.

网络参数：如下表4所示：Network parameters: as shown in Table 4 below:

表4Table 4

下面，描述图4中该神经网络模型的训练方式。Next, the training method of the neural network model in FIG. 4 is described.

示例性的，本申请实施例可以采用Vimeo 90K数据集来训练该神经网络模型，在该数据集中，可以包括89800个视频序列，其中每个视频序列可以有7帧连续的无损视频帧，可以标记为{I₁,I₂,…,I₁}。首先，可以对该数据集进行处理，例如对其中每个视频序列的每一帧进行有损压缩，从而得到降质的视频帧序列

用于模拟真实编码过程中产生的图像质量损失现象。同时，该神经网络模型还设置有一个全局长期记忆ε，其中在第t帧视频帧的预测过程中，输入的全局长期记忆可以被标记为ε_t。Exemplarily, the Vimeo 90K data set can be used to train the neural network model in this embodiment of the present application. In this data set, 89,800 video sequences can be included, and each video sequence can have 7 consecutive lossless video frames, which can be marked. is {I ₁ ,I ₂ ,...,I ₁ }. First, the dataset can be processed, such as lossy compression of each frame of each video sequence in it, resulting in a degraded sequence of video frames

It is used to simulate the image quality loss phenomenon produced in the real encoding process. At the same time, the neural network model is also set with a global long-term memory ε, in which the input global long-term memory can be marked as ε _t during the prediction process of the t-th video frame.

以下是训练过程的一个示例：Here is an example of the training process:

步骤1：在降质的视频帧序列

中，选取前两帧有损帧，即

和

并将初始记忆ε₃设置成0，对第三帧进行预测。将这三者输入参考帧生成模块，计算得到第三帧的预测帧

接下来，计算预测帧

与无损第三帧I₃的损失函数，并将误差反向传播，更新网络参数。然后，计算预测帧

与有损帧

之间的预测误差，并将初始为0的长期记忆ε₃更新为ε₄，用于第四帧的预测。Step 1: Sequence of downgraded video frames

, select the first two lossy frames, namely

and

And set the initial memory ε3 to 0 to make predictions for the _third frame. Input these three into the reference frame generation module, and calculate the predicted frame of the third frame

Next, compute the predicted frame

With the loss function of the lossless third frame I ₃ , and back-propagating the error, update the network parameters. Then, calculate the predicted frame

with lossy frames

and update the long-term memory ε ₃ from 0 to ε ₄ for the prediction of the fourth frame.

步骤2：选择上一次网络前向传播的输入帧的接下来两帧以及上一次更新之后的长期记忆作为输入。举例而言，如果上一次循环过程中的输入是

与

那么这一次的输入是有损帧

与

同时，在上一次前向传播之后，长期记忆也已经被更新成了

因此将

以及

一并输入网络，进行前向传播，得到预测帧

并计算预测帧

与无损帧I_t+3的损失函数，并将误差反向传播，更新网络参数。然后，计算预测帧

与有损帧

之间的预测误差，将长期记忆更新为

Step 2: Select the next two frames of the input frame of the last network forward pass and the long-term memory after the last update as input. For example, if the input during the last loop was

and

So this time the input is a lossy frame

and

At the same time, after the last forward pass, the long-term memory has also been updated to

Therefore will

as well as

Enter the network together, carry out forward propagation, and get the predicted frame

and calculate the predicted frame

Loss function with lossless frame It ₊₃ , and back-propagates the error to update the network parameters. Then, calculate the predicted frame

with lossy frames

The prediction error between , updating the long-term memory as

步骤3：重复步骤2，直到该视频序列中的所有帧都被使用。由于训练数据中，每个视频序列有连续的7帧，因此对于每个视频序列，步骤2会被重复4次。Step 3: Repeat Step 2 until all frames in this video sequence are used. Since each video sequence has 7 consecutive frames in the training data, step 2 is repeated 4 times for each video sequence.

步骤4：选择训练数据中其他的视频序列，重复进行上述三步，直到神经网络拟合。Step 4: Select other video sequences in the training data, and repeat the above three steps until the neural network is fitted.

图5示出了本申请实施例的视频编码的方案的峰值信噪比(peak signal tonoise ratio，PSNR)曲线的一个示例。示例性的，可以本申请的编码器在low delay编码配置下，在FourPeople测试序列上进行编码测试。如图5所示，相比于传统的HEVC方案，本申请实施例的记忆增强自回归网络(memory-augmented auto-regressive network，MAAR-Net)方案在BD-RATE(Bjontegaard rate)上能够有10.6％的性能提升。其中，在图5中，横坐标为码率(bitrate)，单位为Kbps，纵坐标为亮度Y峰值信噪比(Y-peak signal to noiseratio，YPSNR)，单位为dB。FIG. 5 shows an example of a peak signal-to-noise ratio (PSNR) curve of the video coding solution according to the embodiment of the present application. Exemplarily, the encoding test can be performed on the FourPeople test sequence under the low delay encoding configuration of the encoder of the present application. As shown in FIG. 5 , compared with the traditional HEVC solution, the memory-augmented auto-regressive network (MAAR-Net) solution of the embodiment of the present application can achieve a BD-RATE (Bjontegaard rate) of 10.6 % performance boost. Wherein, in FIG. 5 , the abscissa is the bitrate, and the unit is Kbps, and the ordinate is the luminance Y-peak signal to noise ratio (Y-peak signal to noise ratio, YPSNR), and the unit is dB.

现有技术中一种视频编码的方案，在使用LSTM网络生成参考帧的过程中，仅仅将当前编码的视频帧之前的固定数量的重建帧输入网络进行参考帧的生成。以当前编码的视频帧之前的四帧输入为例，每次输入一帧，则更新一次LSTM状态，直到四帧重建帧全部输入完成，即LSTM的状态进行了四次更新，最终输出生成的当前编码的视频帧的参考帧。但是，在生成该参考帧的流程中，仅仅只有输入的四帧重建帧的信息被包含了，而其余已编码的重建帧所包含的信息并未得到利用。另外，每一次参考帧的生成过程都是独立的，即每次生成参考帧之前，都会重置网络的记忆，从而阻碍了长期时域信息的传递，从而无法生成准确的预测帧。In a video coding solution in the prior art, in the process of using the LSTM network to generate the reference frame, only a fixed number of reconstructed frames before the currently coded video frame are input into the network to generate the reference frame. Taking the input of the four frames before the currently encoded video frame as an example, each time a frame is input, the LSTM state is updated once until all the input of the four reconstructed frames is completed, that is, the state of the LSTM is updated four times, and the generated current state is finally output. The reference frame of the encoded video frame. However, in the process of generating the reference frame, only the information of the input four reconstructed frames is included, and the information contained in the remaining encoded reconstructed frames is not used. In addition, the generation process of each reference frame is independent, that is, before each reference frame is generated, the memory of the network will be reset, which hinders the transmission of long-term temporal information, so that accurate prediction frames cannot be generated.

而在本申请实施例中，从对视频序列进行编码过程开始，一直维护着全局长期记忆，该全局长期记忆的时域跨度可能长达数百帧。例如，从第一帧的编码过程直到最后一帧的编码过程，始终动态更新该全局长期记忆，从而保证网络的输入中包含了足够长的时域跨度，进而能够充分利用时域信息，生成更加准确的预测帧，提高视频编码的效率。In the embodiment of the present application, the global long-term memory is maintained since the encoding process of the video sequence, and the time-domain span of the global long-term memory may be as long as hundreds of frames. For example, from the encoding process of the first frame to the encoding process of the last frame, the global long-term memory is always dynamically updated, so as to ensure that the input of the network contains a long enough time domain span, so that the time domain information can be fully utilized to generate more Accurately predict frames and improve the efficiency of video coding.

需要说明的是，在视频播放设备上，本申请实施例提供的方案可以以硬件芯片的形态实现，也可以以软件代码的形态实现，本申请实施例对此不作限定。It should be noted that, on a video playback device, the solutions provided by the embodiments of the present application may be implemented in the form of hardware chips or software codes, which are not limited in the embodiments of the present application.

本申请实施例还提供了一种视频编码的装置，请参见图6。示例性的，该视频编码的装置600可以为视频存储播放设备。本申请实施例中，装置600可以包括获取单元610，生成单元620和编码单元630。An embodiment of the present application further provides an apparatus for video encoding, please refer to FIG. 6 . Exemplarily, the video encoding apparatus 600 may be a video storage and playback device. In this embodiment of the present application, the apparatus 600 may include an acquiring unit 610 , a generating unit 620 and an encoding unit 630 .

获取单元610，用于获取视频序列中待编码的第一视频帧之前的固定数量的第二视频帧的重建帧。The obtaining unit 610 is configured to obtain reconstructed frames of a fixed number of second video frames before the first video frame to be encoded in the video sequence.

生成单元620，用于根据所述第二视频帧的重建帧和所述视频序列的全局长期记忆，生成所述第一视频帧的合成参考帧，其中，所述全局长期记忆是根据所述视频序列中的所述第一视频帧之前的多个视频帧中的每个视频帧的重建帧和所述每个视频帧的合成参考帧确定的。A generating unit 620, configured to generate a synthetic reference frame of the first video frame according to the reconstructed frame of the second video frame and the global long-term memory of the video sequence, wherein the global long-term memory is based on the video It is determined by a reconstructed frame of each video frame in a plurality of video frames preceding the first video frame in the sequence and a composite reference frame of each video frame.

编码单元630，用于根据所述第一视频帧的合成参考帧，对所述第一视频帧进行编码。The encoding unit 630 is configured to encode the first video frame according to the synthesized reference frame of the first video frame.

在一些可能的实现方式中，还包括：Among some possible implementations, it also includes:

更新单元，用于获取所述第一视频帧的重建帧，并根据所述第一视频帧的重建帧和所述第一视频帧的合成参考帧的差值，更新所述全局长期记忆。An update unit, configured to acquire the reconstructed frame of the first video frame, and update the global long-term memory according to the difference between the reconstructed frame of the first video frame and the synthetic reference frame of the first video frame.

在一些可能的实现方式中，所述获取单元610还用于提取所述第二视频帧的重建帧的特征信息；In some possible implementations, the obtaining unit 610 is further configured to extract feature information of the reconstructed frame of the second video frame;

其中，所述生成单元620具体用于将所述第二视频帧的重建帧的特征信息和所述全局长期记忆输入参考帧生成网络模型，获取所述第一视频帧的合成参考帧，其中，所述参考帧生成网络模型是利用训练数据样本集训练得到的，所述训练数据样本集中包括多个视频序列样本，所述每个视频序列样本包括无损视频帧，以及对所述无损视频帧进行有损压缩获取的视频帧。The generating unit 620 is specifically configured to generate a network model from the feature information of the reconstructed frame of the second video frame and the global long-term memory input reference frame, and obtain a synthetic reference frame of the first video frame, wherein, The reference frame generation network model is obtained by training with a training data sample set, the training data sample set includes a plurality of video sequence samples, each video sequence sample includes a lossless video frame, and the lossless video frame is subjected to Video frames obtained by lossy compression.

在一些可能的实现方式中，所述生成单元620具体用于根据以下公式，获取所述第一视频帧的合成参考帧：In some possible implementations, the generating unit 620 is specifically configured to obtain the synthetic reference frame of the first video frame according to the following formula:

其中，

represents the local convolution operation in deep learning, K _i represents the set of convolution kernel coefficients of all pixels in the first video frame, and ⊙ represents the pixel-level dot product operation, which is used to obtain the result obtained by performing local convolution on the input frame. _pixel -level _weighted _addition of the results of Feature information, t, i are positive integers, and t>2.

在一些可能的实现方式中，所述第二视频帧包括所述第一视频帧的前两帧视频帧，其中，所述第一视频帧为所述视频序列中的第三帧视频帧，或所述第三帧视频帧之后的视频帧。In some possible implementations, the second video frame includes the first two video frames of the first video frame, where the first video frame is the third video frame in the video sequence, or The video frame after the third video frame.

在一些可能的实现方式中，在所述第一视频帧为所述视频序列中的第三帧视频帧的情况下，所述全局长期记忆为0。In some possible implementations, when the first video frame is the third video frame in the video sequence, the global long-term memory is 0.

在一些可能的实现方式中，所述编码单元630具体用于获取所述第一视频帧的参考帧列表，所述参考帧列表中包括已编码完成的至少两个视频帧的重建帧；将参考帧列表中与所述第一视频帧的帧号之差最大的帧号对应的重建帧移除，并将所述第一视频帧的合成参考帧加入到所述参考帧列表中的被移除的所述重建帧的位置；根据所述参考帧列表，对所述第一视频帧进行编码。In some possible implementations, the encoding unit 630 is specifically configured to obtain a reference frame list of the first video frame, where the reference frame list includes reconstructed frames of at least two video frames that have been encoded; refer to The reconstructed frame corresponding to the frame number with the largest difference between the frame numbers of the first video frame in the frame list is removed, and the synthesized reference frame of the first video frame is added to the removed frame number in the reference frame list. the position of the reconstructed frame; encode the first video frame according to the reference frame list.

图7是本申请实施例的视频编码的装置700的硬件结构示意图。图7所示的装置700可以看成是一种计算机设备，装置700可以作为本申请实施例的视频编码的装置的一种实现方式，也可以作为本申请实施例的视频编码的方法的一种实现方式，装置700包括处理器701、存储器702、输入/输出接口703和总线705，还可以包括通信接口704。其中，处理器701、存储器702、输入/输出接口703和通信接口704通过总线705实现彼此之间的通信连接。FIG. 7 is a schematic diagram of a hardware structure of an apparatus 700 for video encoding according to an embodiment of the present application. The apparatus 700 shown in FIG. 7 can be regarded as a kind of computer equipment, and the apparatus 700 can be used as an implementation manner of the apparatus for video coding according to the embodiment of the present application, and can also be used as a kind of the video coding method according to the embodiment of the present application. In an implementation manner, the apparatus 700 includes a processor 701 , a memory 702 , an input/output interface 703 and a bus 705 , and may also include a communication interface 704 . The processor 701 , the memory 702 , the input/output interface 703 and the communication interface 704 are connected to each other through the bus 705 for communication.

处理器701可以采用通用的中央处理器(central processing unit，CPU)，微处理器，应用专用集成电路(application specific integrated circuit，ASIC)，或者一个或多个集成电路，用于执行相关程序，以实现本申请实施例的处理媒体数据的装置中的模块所需执行的功能，或者执行本申请方法实施例的处理媒体数据的方法。处理器701可能是一种集成电路芯片，具有信号的处理能力。在实现过程中，上述方法的各步骤可以通过处理器701中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器701可以是通用处理器、数字信号处理器(digital signal processing，DSP)、专用集成电路(ASIC)、现成可编程门阵列(field programmable gate array，FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成，或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器，闪存、只读存储器，可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器702，处理器701读取存储器702中的信息，结合其硬件完成本申请实施例的处理媒体数据的装置中包括的模块所需执行的功能，或者执行本申请方法实施例的处理媒体数据的方法。The processor 701 may use a general-purpose central processing unit (CPU), a microprocessor, an application specific integrated circuit (ASIC), or one or more integrated circuits for executing related programs to Implement the functions required to be performed by the modules in the apparatus for processing media data according to the embodiment of the present application, or execute the method for processing media data according to the method embodiment of the present application. The processor 701 may be an integrated circuit chip with signal processing capability. In the implementation process, each step of the above-mentioned method can be completed by an integrated logic circuit of hardware in the processor 701 or an instruction in the form of software. The above-mentioned processor 701 may be a general-purpose processor, a digital signal processor (digital signal processing, DSP), an application-specific integrated circuit (ASIC), an off-the-shelf programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, Discrete gate or transistor logic devices, discrete hardware components. The methods, steps, and logic block diagrams disclosed in the embodiments of this application can be implemented or executed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the methods disclosed in conjunction with the embodiments of the present application may be directly embodied as executed by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor. The software module may be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other storage media mature in the art. The storage medium is located in the memory 702, and the processor 701 reads the information in the memory 702 and, in combination with its hardware, completes the functions required to be performed by the modules included in the apparatus for processing media data of the embodiments of the present application, or executes the functions of the method embodiments of the present application. A method for processing media data.

存储器702可以是只读存储器(read only memory，ROM)，静态存储设备，动态存储设备或者随机存取存储器(random access memory，RAM)。存储器702可以存储操作系统以及其他应用程序。在通过软件或者固件来实现本申请实施例的处理媒体数据的装置中包括的模块所需执行的功能，或者执行本申请方法实施例的处理媒体数据的方法时，用于实现本申请实施例提供的技术方案的程序代码保存在存储器702中，并由处理器701来执行处理媒体数据的装置中包括的模块所需执行的操作，或者执行本申请方法实施例提供的处理媒体数据的方法。The memory 702 may be a read only memory (ROM), a static storage device, a dynamic storage device, or a random access memory (RAM). Memory 702 may store operating systems as well as other applications. When the functions to be performed by the modules included in the apparatus for processing media data according to the embodiment of the present application are implemented through software or firmware, or when the method for processing media data according to the method embodiment of the present application is executed, the method provided by the embodiment of the present application is used to realize the The program code of the technical solution is stored in the memory 702, and the processor 701 executes the operations required by the modules included in the apparatus for processing media data, or executes the method for processing media data provided by the method embodiments of the present application.

输入/输出接口703用于接收输入的数据和信息，输出操作结果等数据。The input/output interface 703 is used for receiving input data and information, and outputting data such as operation results.

通信接口704使用例如但不限于收发器一类的收发装置，来实现装置700与其他设备或通信网络之间的通信。可以作为处理装置中的获取模块或者发送模块。The communication interface 704 uses a transceiving device such as, but not limited to, a transceiver to enable communication between the device 700 and other devices or a communication network. It can be used as an acquisition module or a sending module in the processing device.

总线705可包括在装置700各个部件(例如处理器701、存储器702、输入/输出接口703和通信接口704)之间传送信息的通路。Bus 705 may include a pathway for communicating information between various components of device 700 (eg, processor 701 , memory 702 , input/output interface 703 , and communication interface 704 ).

应注意，尽管图7所示的装置700仅仅示出了处理器701、存储器702、输入/输出接口703、通信接口704以及总线705，但是在具体实现过程中，本领域的技术人员应当明白，装置700还包括实现正常运行所必须的其他器件，例如还可以包括显示器，用于显示要播放的视频数据。同时，根据具体需要，本领域的技术人员应当明白，装置700还可包括实现其他附加功能的硬件器件。此外，本领域的技术人员应当明白，装置700也可仅仅包括实现本申请实施例所必须的器件，而不必包括图7中所示的全部器件。It should be noted that although the apparatus 700 shown in FIG. 7 only shows the processor 701, the memory 702, the input/output interface 703, the communication interface 704 and the bus 705, in the specific implementation process, those skilled in the art should understand that, The apparatus 700 also includes other devices necessary for normal operation, for example, a display for displaying video data to be played. Meanwhile, according to specific needs, those skilled in the art should understand that the apparatus 700 may further include hardware devices for implementing other additional functions. In addition, those skilled in the art should understand that the apparatus 700 may only include the necessary devices for implementing the embodiments of the present application, and does not necessarily include all the devices shown in FIG. 7 .

本领域普通技术人员可以意识到，结合本文中所公开的实施例描述的各示例的单元及算法步骤，能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行，取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能，但是这种实现不应认为超出本申请的范围。Those of ordinary skill in the art can realize that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of this application.

本申请实施例还提供一种计算机可读存储介质，该计算机可读存储介质包括计算机程序，当其在计算机上运行时，使得该计算机执行上述方法实施例提供的方法。Embodiments of the present application further provide a computer-readable storage medium, where the computer-readable storage medium includes a computer program, which, when executed on a computer, causes the computer to execute the method provided by the above method embodiments.

本申请实施例还提供一种包含指令的计算机程序产品，当该计算机程序产品在计算机上运行时，使得该计算机执行上述方法实施例提供的方法。The embodiments of the present application also provide a computer program product containing instructions, when the computer program product runs on a computer, the computer is made to execute the method provided by the above method embodiments.

应理解，在本申请的各种实施例中，上述各过程的序号的大小并不意味着执行顺序的先后，各过程的执行顺序应以其功能和内在逻辑确定，而不应对本申请实施例的实施过程构成任何限定。It should be understood that, in various embodiments of the present application, the size of the sequence numbers of the above-mentioned processes does not mean the sequence of execution, and the execution sequence of each process should be determined by its functions and internal logic, and should not be dealt with in the embodiments of the present application. implementation constitutes any limitation.

应理解，本申请实施例中出现的第一、第二等描述，仅作示意与区分描述对象之用，没有次序之分，也不表示本申请实施例中对设备个数的特别限定，不能构成对本申请实施例的任何限制。It should be understood that the descriptions of the first, second, etc. appearing in the embodiments of the present application are only for the purpose of illustrating and distinguishing the description objects, and have no order, nor do they represent a special limitation on the number of devices in the embodiments of the present application. It constitutes any limitation to the embodiments of the present application.

所属领域的技术人员可以清楚地了解到，为描述的方便和简洁，上述描述的系统、装置和单元的具体工作过程，可以参考前述方法实施例中的对应过程，在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and brevity of description, the specific working process of the above-described systems, devices and units may refer to the corresponding processes in the foregoing method embodiments, which will not be repeated here.

在本申请所提供的几个实施例中，应该理解到，所揭露的系统、装置和方法，可以通过其它的方式实现。例如，以上所描述的装置实施例仅仅是示意性的，例如，所述单元的划分，仅仅为一种逻辑功能划分，实际实现时可以有另外的划分方式，例如多个单元或组件可以结合或者可以集成到另一个系统，或一些特征可以忽略，或不执行。另一点，所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口，装置或单元的间接耦合或通信连接，可以是电性，机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.

所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

另外，在本申请各个实施例中的各功能单元可以集成在一个处理单元中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个单元中。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.

所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(read-only memory，ROM)、随机存取存储器(random access memory，RAM)、磁碟或者光盘等各种可以存储程序代码的介质。The functions, if implemented in the form of software functional units and sold or used as independent products, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product in essence, or the part that contributes to the prior art or the part of the technical solution. The computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage medium includes: U disk, removable hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program codes .

以上所述，仅为本申请的具体实施方式，但本申请的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本申请揭露的技术范围内，可轻易想到变化或替换，都应涵盖在本申请的保护范围之内。因此，本申请的保护范围应以所述权利要求的保护范围为准。The above are only specific embodiments of the present application, but the protection scope of the present application is not limited to this. should be covered within the scope of protection of this application. Therefore, the protection scope of the present application should be subject to the protection scope of the claims.

Claims

1. A method of video encoding, comprising:

acquiring a reconstruction frame of a fixed number of second video frames before a first video frame to be coded in a video sequence;

generating a composite reference frame of the first video frame from a reconstructed frame of the second video frame and a global long-term memory of the video sequence, wherein the global long-term memory is determined from a reconstructed frame of each video frame and a composite reference frame of each video frame in a plurality of video frames preceding the first video frame in the video sequence;

encoding the first video frame according to the synthesized reference frame of the first video frame.

2. The method of claim 1, further comprising:

acquiring a reconstructed frame of the first video frame;

and updating the global long-term memory according to the difference value of the reconstructed frame of the first video frame and the synthesized reference frame of the first video frame.

3. The method of claim 1 or 2, further comprising:

extracting characteristic information of a reconstructed frame of the second video frame;

wherein the generating a synthesized reference frame for the first video frame from the reconstructed frame for the second video frame and the global long-term memory for the video sequence comprises:

and generating a network model by using the characteristic information of the reconstructed frame of the second video frame and the global long-term memory input reference frame, and acquiring a synthesized reference frame of the first video frame, wherein the network model for generating the reference frame is obtained by training by using a training data sample set, the training data sample set comprises a plurality of video sequence samples, each video sequence sample comprises a lossless video frame, and the video frame is acquired by performing lossy compression on the lossless video frame.

4. The method of claim 3, wherein generating a network model from the feature information of the reconstructed frame of the second video frame and the global long-term memory input reference frame, and obtaining the synthesized reference frame of the first video frame comprises:

obtaining a synthesized reference frame of the first video frame according to the following formula:

wherein,

a composite reference frame representing the first video frame, t representing a frame number of the first video frame, I representing a frame number of a second video frame preceding the first video frame, I_iA reconstructed frame representing the video of the ith frame,

representing local convolution operations in deep learning, K_iA set of convolution kernel coefficients indicating all pixels in the first video frame, a dot product operation indicating a pixel level for performing a weighted addition of pixel levels obtained by performing a partial convolution on the input frame, and a weight matrix of M_i，ε_tRepresenting said global long term memory, γ, when encoding said first video frame_tAnd the characteristic information of the reconstructed frame of the second video frame is represented, t, i is a positive integer, and t is greater than 2.

5. The method according to any of claims 1-4, wherein the second video frame comprises the first two video frames of the first video frame, and wherein the first video frame is a third video frame in the video sequence or a video frame subsequent to the third video frame.

6. The method of claim 5, wherein the global long term memory is 0 if the first video frame is a third video frame in the video sequence.

7. The method according to any of claims 1-6, wherein said encoding said first video frame based on a synthesized reference frame of said first video frame comprises:

acquiring a reference frame list of the first video frame, wherein the reference frame list comprises reconstructed frames of at least two video frames which are encoded;

removing a reconstructed frame corresponding to a frame number with the largest difference between the frame numbers of the first video frame in a reference frame list, and adding a synthesized reference frame of the first video frame to the position of the removed reconstructed frame in the reference frame list;

encoding the first video frame according to the reference frame list.

8. An apparatus for video encoding, comprising:

the device comprises an acquisition unit, a coding unit and a decoding unit, wherein the acquisition unit is used for acquiring the reconstructed frames of a fixed number of second video frames before a first video frame to be coded in a video sequence;

a generating unit, configured to generate a synthesized reference frame of the first video frame according to a reconstructed frame of the second video frame and a global long-term memory of the video sequence, wherein the global long-term memory is determined according to a reconstructed frame of each video frame and a synthesized reference frame of each video frame in a plurality of video frames before the first video frame in the video sequence;

and the coding unit is used for coding the first video frame according to the synthesized reference frame of the first video frame.

9. The apparatus of claim 8, further comprising:

and the updating unit is used for acquiring the reconstructed frame of the first video frame and updating the global long-term memory according to the difference value between the reconstructed frame of the first video frame and the synthesized reference frame of the first video frame.

10. The apparatus according to claim 8 or 9, wherein the obtaining unit is further configured to:

wherein the generating unit is specifically configured to:

11. The apparatus according to claim 10, wherein the generating unit is specifically configured to:

wherein,

representing local convolution operations in deep learning, K_iA set of convolution kernel coefficients representing all pixels in the first video frame, a point multiplication operation indicating a pixel level for a pairPerforming pixel-level weighted addition of the results obtained by partial convolution on the input frame with a weight matrix of M_i，ε_tRepresenting said global long term memory, γ, when encoding said first video frame_tAnd the characteristic information of the reconstructed frame of the second video frame is represented, t, i is a positive integer, and t is greater than 2.

12. The apparatus according to any of claims 8-11, wherein the second video frame comprises the first two video frames of the first video frame, and wherein the first video frame is a third video frame in the video sequence or a video frame subsequent to the third video frame.

13. The apparatus of claim 12, wherein the global long term memory is 0 if the first video frame is a third video frame in the video sequence.

14. The apparatus according to any of claims 8-13, wherein the encoding unit is specifically configured to:

encoding the first video frame according to the reference frame list.

15. An apparatus of video encoding, comprising: a processor and a memory, wherein the memory is configured to store instructions and the processor is configured to execute the instructions, and when the processor executes the instructions stored by the memory, the apparatus for video encoding is configured to perform the method of any of claims 1-7.