CN110290386B

CN110290386B - A low-bit-rate human motion video coding system and method based on generative adversarial network

Info

Publication number: CN110290386B
Application number: CN201910479249.8A
Authority: CN
Inventors: 陈志波; 吴耀军; 何天宇
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2019-06-04
Filing date: 2019-06-04
Publication date: 2022-09-06
Anticipated expiration: 2039-06-04
Also published as: CN110290386A

Abstract

The invention relates to a low-bit-rate human motion video coding system and method based on a generative confrontation network, which fully utilizes the structure information of human motion video content, and decomposes the video content information into memory features including global appearance attribute information and information that can express human motion. The skeletal features are two parts, and the attention mechanism and generative adversarial network are used to achieve more efficient low-bit-rate video compression. The invention makes full use of the structure information of human body motion video content, decomposes the video content information into two parts: memory features including global appearance attribute information and skeleton features that can express human body motion information, and utilizes attention mechanism and generative confrontation network to achieve more efficient Low bit rate video compression.

Description

A low-bit-rate human motion video coding system and method based on generative adversarial network

技术领域technical field

本发明涉及一种基于生成对抗网络的低码率人体运动视频编码系统及方法，属于视频图像编码和压缩技术领域。The invention relates to a low-bit-rate human motion video coding system and method based on a generative confrontation network, and belongs to the technical field of video image coding and compression.

背景技术Background technique

传统的混合视频编码框架(MPEG-2、H.264和H、265等)将视频压缩分解成预测、变换、量化和熵编码四个基本步骤来进行。在该编码框架内，视频相邻帧之间的冗余信息主要通过以固定大小的块为单位的运动补偿来去除。经过多年发展，传统编码器性能得到了不断的提升。随着深度学习的发展，有人提出了利用深度学习实现视频压缩编码的模型。不同于传统编码，基于深度学习的视频编码能够实现更好的变换学习。此外，端到端的深度学习模型能够自行学习，根据统一的优化目标自动调整各部分模块来优化模型。Traditional hybrid video coding frameworks (MPEG-2, H.264 and H, 265, etc.) decompose video compression into four basic steps of prediction, transformation, quantization and entropy coding. Within this coding framework, redundant information between adjacent frames of a video is mainly removed by motion compensation in units of fixed-size blocks. After years of development, the performance of traditional encoders has been continuously improved. With the development of deep learning, some people have proposed a model for video compression coding using deep learning. Different from traditional coding, video coding based on deep learning can achieve better transform learning. In addition, the end-to-end deep learning model can learn by itself, and automatically adjust each part of the module to optimize the model according to the unified optimization goal.

上述的编码方案均利用像素级的保真度作为优化目标，视频序列内容的结构信息并没有在编码过程中得到充分的利用。例如公共安全领域的监控视频应用中对于人的识别非常重要，也存在大量的包含人体运动的视频需要进行编码和后期的分析研判，而这类视频其结构上实际包含全局外观属性信息和人体运动信息两部分。因此本发明考虑利用人体运动视频的结构信息来进一步提高编码效率。The above coding schemes all use pixel-level fidelity as an optimization goal, and the structural information of the video sequence content is not fully utilized in the coding process. For example, surveillance video applications in the field of public security are very important for human identification. There are also a large number of videos containing human motion that need to be encoded and analyzed later. The structure of such videos actually contains global appearance attribute information and human motion. Information in two parts. Therefore, the present invention considers using the structure information of the human body motion video to further improve the coding efficiency.

发明内容SUMMARY OF THE INVENTION

本发明技术解决问题：克服现有技术的不足，提供一种基于生成对抗网络的低码率人体运动视频编码系统及方法，充分利用人体运动视频内容结构信息，将视频内容信息分解为包含全局外观属性信息的记忆特征和可以表达人体运动信息的骨骼特征两部分，并利用注意力机制和生成对抗网络实现更加高效的低码率视频压缩。The technology of the present invention solves the problem: overcomes the deficiencies of the prior art, provides a low-bit-rate human motion video coding system and method based on a generative confrontation network, makes full use of the structure information of the human motion video content, and decomposes the video content information into components including the global appearance. The memory feature of attribute information and the skeleton feature that can express human motion information are two parts, and the attention mechanism and generative adversarial network are used to achieve more efficient low-bit rate video compression.

本发明技术解决方案：一种基于生成对抗网络的低码率人体运动视频编码系统，其中集成了针对人体运动视频内容的结构信息提取、人体运动视频内容的结构信息融合与特征信息编/解码，其特征在于包括：记忆特征提取模块，记忆特征编解码模块、骨骼特征提取模块、骨骼特征编解码模块、回忆注意力模块和生成对抗网络模块，其中：The technical solution of the present invention: a low-bit-rate human motion video coding system based on generative confrontation network, which integrates structure information extraction for human motion video content, structure information fusion and feature information encoding/decoding for human motion video content, It is characterized by comprising: a memory feature extraction module, a memory feature encoding and decoding module, a bone feature extraction module, a bone feature encoding and decoding module, a recall attention module and a generative confrontation network module, wherein:

记忆特征提取网络模块，根据卷积循环神经网络，将图像组内的视频帧按时间上的先后顺序输入卷积循环神经网络，直到最后一个视频帧输入后得到的输出即为记忆特征；The memory feature extraction network module, according to the convolutional cyclic neural network, inputs the video frames in the image group into the convolutional cyclic neural network in time sequence, and the output obtained after the last video frame is input is the memory feature;

记忆特征编解码模块，包括记忆特征编码和记忆特征解码两个模块，将记忆特征输入记忆特征编码模块，该模块首先对记忆特征进行量化操作，得到量化后记忆特征输出；然后其对量化后记忆特征输出使用记忆特征的熵编码，得到记忆特征部分用于传输的码流；将记忆特征部分的码流输入记忆特征解码模块，该模块首先将码流进行熵解码得到量化后记忆特征重建；然后对量化后记忆特征输出进行反量化得到记忆特征重建，其将输入回忆注意力模块；The memory feature encoding and decoding module includes two modules: memory feature encoding and memory feature decoding. The memory feature is input into the memory feature encoding module. This module first quantizes the memory feature and obtains the quantized memory feature output; then it quantifies the memory feature after quantization. The feature output uses the entropy encoding of the memory feature to obtain the code stream that the memory feature part is used for transmission; input the code stream of the memory feature part into the memory feature decoding module, which first performs entropy decoding on the code stream to obtain the quantized memory feature reconstruction; then Inverse quantization of the quantized memory feature output to obtain memory feature reconstruction, which is input to the recall attention module;

骨骼特征提取模块，将需要编码的视频帧输入到人体姿势估计网络中进行节点位置估计，得到包含人身体关键节点位置信息的骨骼特征，所述骨骼特征为人身体关键节点及关节位置，所述人身体关键节点包括头、手、脚和躯体；为了提高编码效率，骨骼特征将输入骨骼特征编解码模块进一步编码压缩；The skeleton feature extraction module inputs the video frames to be encoded into the human body pose estimation network for node position estimation, and obtains skeleton features including the position information of key nodes of the human body, and the skeleton features are the key nodes and joint positions of the human body. The key nodes of the body include head, hands, feet and body; in order to improve the coding efficiency, the skeleton feature will further encode and compress the input skeleton feature codec module;

骨骼特征编解码模块，对编码视频帧的骨骼特征进行编解码；将骨骼特征输入骨骼特征编码部分，对输入的骨骼特征首先进行预测编码，得到实际传输的残差信息；然后对残差信息进行骨骼特征残差信息的熵编码，得到骨骼特征部分传输的码流；对传输的码流输入骨骼特征解码部分，首先对码流进行骨骼特征熵解码得到骨骼特征残差重建；再对骨骼特征残差重建进行预测解码，得到最终的骨骼特征重建，将输入回忆注意力模块；The skeleton feature encoding and decoding module encodes and decodes the skeleton feature of the encoded video frame; the skeleton feature is input into the skeleton feature coding part, and the input skeleton feature is first predicted and encoded to obtain the actual transmitted residual information; then the residual information is processed. The entropy encoding of the residual information of the skeleton feature is used to obtain the code stream transmitted by the skeleton feature part; the transmitted code stream is input to the skeleton feature decoding part, and the skeleton feature entropy decoding is first performed on the code stream to obtain the skeleton feature residual reconstruction; then the skeleton feature residual is reconstructed. The difference reconstruction is used for prediction and decoding, and the final skeleton feature reconstruction is obtained, which is input to the recall attention module;

回忆注意力模块，利用注意力机制将记忆特征解码模块得到的记忆特征重建和骨骼特征解码模块得到的骨骼特征重建进行融合，得到融合的特征信息；The recall attention module uses the attention mechanism to fuse the memory feature reconstruction obtained by the memory feature decoding module and the skeleton feature reconstruction obtained by the skeleton feature decoding module to obtain the fused feature information;

生成对抗网络模块，包括生成器和判别器两个部分；所述生成器将回忆注意力模块得到的融合特征信息输入生成网络，得到生成的视频帧；判别器判断生成的视频帧是否与真实自然视频帧一致来给出分数，该分数将作为生成网络的一部分用于生成网络的训练，即损失函数的一部分来优化生成器的训练。The generative confrontation network module includes two parts: a generator and a discriminator; the generator inputs the fusion feature information obtained by the recall attention module into the generation network to obtain the generated video frame; the discriminator judges whether the generated video frame is consistent with the real nature. Video frames are aligned to give a score that will be used as part of the generative network's training, i.e. part of the loss function to optimize the generator's training.

所述记忆特征提取网络模块中，记忆特征包含图像组连续帧全局外观属性信息；所述图像组为一定长度连续视频帧所构成的集合；所述图像组连续帧全局外观属性信息为利用图像组内视频帧输入卷积循环神经网络提取的视频外观属性信息，视频外观属性信息包括视频内背景外观以及场景内人体五官衣物外表属性信息。In the memory feature extraction network module, the memory feature includes the global appearance attribute information of consecutive frames of the image group; the image group is a set composed of continuous video frames of a certain length; the global appearance attribute information of the continuous frame of the image group is the use of the image group. The video appearance attribute information extracted by the convolutional recurrent neural network is input to the inner video frame, and the video appearance attribute information includes the background appearance in the video and the appearance attribute information of the human facial features and clothing in the scene.

所述记忆特征提取网络模块中的卷积循环神经网络结构如下：包括一层卷积循环层，该卷积循环层建立类似循环神经网络的时序关系，同时像卷积神经网络一样刻画局部空间特征，卷积循环层的通道数为128，核大小为5，步长为2。The structure of the convolutional cyclic neural network in the memory feature extraction network module is as follows: it includes a convolutional cyclic layer, the convolutional cyclic layer establishes a time sequence relationship similar to a cyclic neural network, and at the same time depicts local spatial features like a convolutional neural network. , the number of channels of the convolutional recurrent layer is 128, the kernel size is 5, and the stride is 2.

所述记忆特征编解码模块中，记忆特征编码模块包括对记忆特征的量化以及对量化后特征的熵编码两部分；所述记忆特征量化根据已有的量化方法，对记忆特征内的各个特征值分别进行量化处理，各个特征值量化后得到量化后记忆特征；所述记忆特征熵编码为针对量化后记忆特征的熵编码方法，根据已有的熵编码方法，对量化后记忆特征进行熵编码，得到记忆特征部分用于传输的码流；记忆特征解码模块包括对应的熵解码以及反量化两个部分；所述记忆特征熵解码根据与记忆特征熵编码对应的熵解码方法，对记忆特征部分传输的码流进行熵解码处理，得到量化后记忆特征重建；所述记忆特征反量化根据与记忆特征量化相对的反量化方法，对量化后的记忆特征进行反量化操作，得到记忆特征重建，该记忆特征重建将输入回忆注意力模块用于视频帧的重建。In the memory feature encoding and decoding module, the memory feature encoding module includes two parts: the quantization of the memory feature and the entropy encoding of the quantized feature; the memory feature quantization is based on the existing quantization method. The quantization process is respectively performed, and the quantized memory feature is obtained after each feature value is quantized; the memory feature entropy coding is an entropy coding method for the quantized memory feature, and according to the existing entropy coding method, the quantized memory feature is entropy encoded, Obtain the code stream that the memory feature part is used for transmission; the memory feature decoding module includes two parts corresponding to entropy decoding and inverse quantization; the memory feature entropy decoding transmits the memory feature part according to the entropy decoding method corresponding to the memory feature entropy coding. Entropy decoding process is performed on the code stream of the quantized memory to obtain the memory feature reconstruction after quantization; the memory feature inverse quantization is based on the inverse quantization method relative to the memory feature quantization, and the quantized memory feature is inversely quantized to obtain the memory feature reconstruction, and the memory feature is reconstructed. Feature reconstruction uses the input recall attention module for video frame reconstruction.

所述骨骼特征编解码模块包括骨骼特征压缩编码和解码两个部分；骨骼特征编码部分包括预测编码和熵编码两个部分；所述预测编码考虑骨骼特征时域上的冗余性，对每个节点的坐标利用编码视频帧前一帧的该节点坐标作为预测值，该节点预测值和该帧对应节点的坐标即真实值进行残差运算，得到的残差将进行骨骼特征的熵编码；所述骨骼特征熵编码为针对骨骼特征的熵编码器，根据已有的熵编码方法，骨骼特征预测编码得到的残差将输入熵编码器进行熵编码得到骨骼特征的码流信息；骨骼特征解码部分利用骨骼特征编码后得到的码流通过熵解码和预测解码实现骨骼特征的无损重建；所述骨骼特征熵解码根据已有的熵解码方法，对骨骼特征码流进行熵编码得到骨骼特征预测编码得到的残差；所述骨骼特征预测解码则利用前一时刻的该节点坐标和骨骼特征熵解码所得残差，进行相加运算得到解码的骨骼特征重建。The bone feature encoding and decoding module includes two parts: bone feature compression encoding and decoding; the bone feature encoding part includes two parts: prediction encoding and entropy encoding; the prediction encoding considers the redundancy in the time domain of bone features, and for each The coordinates of the node use the coordinates of the node in the previous frame of the encoded video frame as the predicted value, and the predicted value of the node and the coordinates of the corresponding node of the frame, that is, the real value, carry out residual calculation, and the obtained residual will be used for entropy encoding of bone features; The bone feature entropy encoding is described as an entropy encoder for bone features. According to the existing entropy encoding method, the residual obtained by the bone feature prediction encoding is input to the entropy encoder for entropy encoding to obtain the code stream information of the bone feature; the bone feature decoding part Entropy decoding and predictive decoding are used to achieve lossless reconstruction of skeletal features by using the code stream obtained after skeletal feature encoding; the skeletal feature entropy decoding is based on the existing entropy decoding method. The skeletal feature prediction and decoding uses the node coordinates at the previous moment and the skeletal feature entropy decoding residual, and performs an addition operation to obtain the decoded skeletal feature reconstruction.

所述骨骼特征提取模块中，人体姿势估计网络为PAF网络，其结构分为两路结构，一路经过3层核大小为3的卷积层后再经过2层核大小为1x1的卷积层得到关节置信图，关节置信图表达了检测得到的各个节点的位置信息以及该节点属于人体节点的置信度；另一路经过3层核大小为3的卷积层后再经过2层核大小为1x1的卷积层得到PAFs。PAFs是一个2D向量集合，每一个2D向量都会编码一个肢体的位置和方向。其将和关节的置信图一起进行联合学习和预测。In the skeletal feature extraction module, the human pose estimation network is a PAF network, and its structure is divided into two-way structures. One way passes through three convolutional layers with a kernel size of 3, and then passes through two convolutional layers with a kernel size of 1×1. The joint confidence map expresses the position information of each node detected and the confidence that the node belongs to the human body node; the other way passes through 3 convolution layers with a kernel size of 3, and then passes through 2 layers with a kernel size of 1x1. Convolutional layers get PAFs. PAFs are a collection of 2D vectors, each of which encodes the position and orientation of a limb. It will be jointly learned and predicted together with the confidence maps of the joints.

本发明一种基于生成对抗网络的低码率人体运动视频编码方法，包括以下步骤：A low-bit-rate human motion video coding method based on a generative confrontation network of the present invention comprises the following steps:

(1)将图像组的连续帧按先后顺序输入记忆特征提取模块，记忆特征提取模块利用卷积神经网络得到记忆特征输出；(1) Input the consecutive frames of the image group into the memory feature extraction module in sequence, and the memory feature extraction module uses the convolutional neural network to obtain the memory feature output;

(2)将记忆特征输入记忆特征编码模块，该模块将记忆特征进行量化和后再进行记忆特征的熵编码处理，从而得到记忆特征部分的码流输出；(2) the memory feature is input into the memory feature encoding module, which quantifies the memory feature and then performs entropy encoding processing of the memory feature, thereby obtaining the code stream output of the memory feature part;

(3)将需要编码的视频帧输入骨骼特征提取模块，该模块利用人体姿势估计网络得到骨骼特征；(3) Input the video frame to be encoded into the skeleton feature extraction module, which uses the human body pose estimation network to obtain skeleton features;

(4)将骨骼特征输入骨骼特征编码模块，骨骼特征经过预测编码后经过骨骼特征的熵编码处理得到骨骼特征部分码流输出；(4) Input the skeleton feature into the skeleton feature coding module, and after the skeleton feature is predicted and encoded, the skeleton feature partial code stream output is obtained through the entropy coding process of the skeleton feature;

(5)将记忆特征部分码流输入记忆特征解码模块，该模块将记忆特征码流进行记忆特征熵解码后经过记忆特征反量化得到记忆特征重建，同时将骨骼特征码流输入骨骼特征解码模块，该模块将骨骼特征码流熵解码后经过骨骼特征反量化得到骨骼特征重建；(5) Input the memory feature code stream into the memory feature decoding module, this module performs the memory feature entropy decoding on the memory feature code stream and then obtains the memory feature reconstruction through the memory feature inverse quantization, and simultaneously inputs the skeleton feature code stream into the skeleton feature decoding module. This module decodes the skeleton feature code stream entropy and inversely quantizes the skeleton feature to obtain the skeleton feature reconstruction;

(6)将记忆特征重建和骨骼特征重建输入回忆注意力模块，该模块利用注意力机制得到融合两部分信息的融合特征；(6) Input the memory feature reconstruction and skeleton feature reconstruction into the recall attention module, which uses the attention mechanism to obtain the fusion feature that fuses the two parts of information;

(7)将融合特征作为生成对抗网络生成器的条件输入得到视频帧重建，在训练时判别器会根据生成器的生成给出一个生成器生成的视频帧重建是否符合自然视频的分数，其将用于生成器的训练。(7) The fusion feature is used as the conditional input of the generative adversarial network generator to obtain the video frame reconstruction. During training, the discriminator will give a score according to the generation of the generator whether the reconstruction of the video frame generated by the generator conforms to the natural video. for training the generator.

本发明与现有技术相比的优点在于：The advantages of the present invention compared with the prior art are:

(1)与传统视频编码基于块和像素层级失真不同，本发明首次将视频信息按记忆特征和骨骼特征进行分解，充分利用了视频结构信息，从而进一步提高了编码效率；(1) Different from traditional video coding based on block and pixel level distortion, the present invention decomposes video information according to memory feature and skeleton feature for the first time, and makes full use of video structure information, thereby further improving coding efficiency;

(2)本发明对视频帧重建利用了生成对抗网络的方式进行重建，与传统编码不同，生成对抗网络方式可以用生成的方式在解码时恢复视频丢失的信息，从而实现主观质量的提升；(2) The present invention utilizes a generative adversarial network to reconstruct the video frame. Different from the traditional coding, the generative adversarial network can recover the lost information of the video during decoding in a generative manner, thereby realizing the improvement of subjective quality;

(3)针对分解后特征的处理，本发明首次提出了回忆注意力模块进行视频分解后得到的特征的融合，利用注意力机制，记忆特征重建和骨骼特征重建有效的进行了融合，从而保证了视频帧依据信息的恢复重建。(3) For the processing of the decomposed features, the present invention proposes for the first time the fusion of the features obtained after video decomposition by the recall attention module. By using the attention mechanism, the memory feature reconstruction and the skeleton feature reconstruction are effectively fused, thereby ensuring that the The video frame is reconstructed according to the restoration of the information.

附图说明Description of drawings

图1为本发明系统的总体框图；Fig. 1 is the overall block diagram of the system of the present invention;

图2为本发明中记忆特征编解码模块结构框图；Fig. 2 is the structural block diagram of memory feature encoding and decoding module in the present invention;

图3为本发明中骨骼特征编解码模块结构框图；3 is a structural block diagram of a skeleton feature encoding and decoding module in the present invention;

图4为本发明中回忆注意力模块结构框图；Fig. 4 is the structure block diagram of recall attention module in the present invention;

图5本发明与传统编码方法主观质量比较图。FIG. 5 is a comparison diagram of subjective quality between the present invention and the traditional coding method.

具体实施方式Detailed ways

下面结合本发明的一种记忆特征编/解码方法、一种骨骼特征编/解码方法、一种回忆注意力模块以及一种生成对抗网络实现对本发明中的技术方案进行清楚、完整的描述。显然，所描述的实施例仅为本发明一部分实施例，并不是全部的实施例。任何基于本发明的实施例以及本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例都属于本发明的保护范围。The technical solutions in the present invention will be clearly and completely described below with reference to a memory feature encoding/decoding method, a skeleton feature encoding/decoding method, a recall attention module, and a generative confrontation network implementation of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, but not all of the embodiments. Any embodiment based on the present invention and all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the protection scope of the present invention.

如图1所示，本发明所述的编码框架包含如下模块：记忆特征提取，记忆特征编/解码、骨骼特征提取、骨骼特征编/解码、回忆注意力模块和生成对抗网络。其中：As shown in Figure 1, the coding framework of the present invention includes the following modules: memory feature extraction, memory feature encoding/decoding, skeleton feature extraction, skeleton feature encoding/decoding, recall attention module and generative adversarial network. in:

所述记忆特征提取网络作用为提取包含图像组连续帧全局外观属性信息的记忆特征，其主要结构为卷积循环神经网络。其包括一层卷积循环层，该层建立类似循环神经网络的时序关系，同时可以像卷积神经网络一样刻画局部空间特征。其卷积循环层的通道数为128，核大小为5，步长为2。The function of the memory feature extraction network is to extract the memory features including the global appearance attribute information of the continuous frames of the image group, and its main structure is a convolutional recurrent neural network. It includes a convolutional recurrent layer, which establishes a temporal relationship similar to a recurrent neural network, and can describe local spatial features like a convolutional neural network. The number of channels of its convolutional recurrent layer is 128, the kernel size is 5, and the stride is 2.

所述记忆特征编解码主要包括记忆特征编码和记忆特征解码两个部分。其中，编码部分主要包括对记忆特征的量化以及对量化后特征的熵编码两部分，解码部分包括对应的熵解码以及反量化。其中熵编/解码器一般为算术编/解码器，量化一般使用标量量化。The memory feature encoding and decoding mainly includes two parts: memory feature encoding and memory feature decoding. The encoding part mainly includes two parts, quantization of memory features and entropy encoding of quantized features, and the decoding part includes corresponding entropy decoding and inverse quantization. The entropy encoder/decoder is generally an arithmetic encoder/decoder, and the quantization generally uses scalar quantization.

在本发明实施例中，所述记忆特征编解码如图2所示，记忆特征首先使用标量量化进行量化，随后使用熵编码来进一步去除其冗余从而得到最终的码流。为了进一步去除量化后记忆特征的冗余以提高编码效率，本发明实施例使用超先验网络来对量化后记忆特征各个特征值进行了概率预测建模，并根据预测得到的概率分布来对量化后的记忆特征进行熵编码。In the embodiment of the present invention, the memory feature encoding and decoding is shown in FIG. 2 , the memory feature is first quantized using scalar quantization, and then entropy encoding is used to further remove its redundancy to obtain the final code stream. In order to further remove the redundancy of the memory features after quantization and improve the coding efficiency, the embodiment of the present invention uses a super-prior network to perform probability prediction modeling on each feature value of the memory features after quantization, and quantizes the quantization according to the predicted probability distribution. The later memory features are entropy encoded.

为了得到具体记忆特征的概率，记忆特征将输入到超先验网络来提取变量z，该变量会经过算术编码作为附加信息进行传输。在编解码端，参考目前已有的基于深度学习的图像编码方法，记忆特征的概率分布将建模成高斯分布。而变量z将用于预测记忆特征概率分布的均值和方差。In order to obtain the probability of a specific memory feature, the memory feature will be input to the super-prior network to extract the variable z, which will be transmitted as additional information through arithmetic coding. On the encoding and decoding side, referring to the existing image encoding methods based on deep learning, the probability distribution of memory features will be modeled as a Gaussian distribution. And the variable z will be used to predict the mean and variance of the probability distribution of memory features.

此外，记忆特征的量化将集成在记忆特征编码模块网络当中，而标准的标量量化在记忆特征编码模块网络中会来带来训练不可导的问题，所以训练时使用了均匀噪声来代替标准的标量量化过程。其具体的公式如下：In addition, the quantization of the memory feature will be integrated in the memory feature encoding module network, and the standard scalar quantization in the memory feature encoding module network will bring about the problem of non-training, so uniform noise is used in training instead of the standard scalar quantification process. Its specific formula is as follows:

上式中

为量化后的记忆特征，M为量化前的记忆特征，

为在

和

之间符合均匀分布的噪声。In the above formula

is the memory feature after quantization, M is the memory feature before quantization,

for in

and

uniformly distributed noise.

所述骨骼特征提取作用为提取当前编码视频帧内可以表达人体运动信息骨骼信息，其主要形式为人体关键节点位置。通过使用已有的PAFs人体姿势估计网络，包含人身体关键节点位置信息的骨骼特征将从该模块输出。The function of the skeleton feature extraction is to extract the skeleton information that can express human motion information in the currently encoded video frame, and its main form is the position of the key node of the human body. By using the existing PAFs human pose estimation network, the skeletal features containing the position information of the key nodes of the human body will be output from this module.

所述骨骼特征编解码主要包括骨骼特征压缩编码和解码两个部分。编码部分考虑骨骼特征时域上的冗余性首先使用预测编码，预测值和真实值的残差将输入熵编码器进行编码得到码流输出。解码部分主要通过熵解码和预测解码实现骨骼特征重建。其中，熵编/解码器通常为算术编解码器等。The bone feature encoding and decoding mainly includes two parts: bone feature compression encoding and decoding. The coding part considers the redundancy of bone features in the time domain. First, predictive coding is used, and the residual between the predicted value and the actual value is input to the entropy encoder for coding to obtain the code stream output. The decoding part mainly realizes bone feature reconstruction through entropy decoding and prediction decoding. Among them, the entropy codec/decoder is usually an arithmetic codec or the like.

所述回忆注意力模块，主要利用注意力机制将记忆特征信息和骨骼特征信息进行融合，在含有全局外观属性信息的记忆特征基础上结合含有特定帧人体运动信息的骨骼特征以得到融合的特征信息。The recall attention module mainly uses the attention mechanism to fuse the memory feature information and the skeleton feature information, and combines the skeleton feature containing the human motion information of a specific frame on the basis of the memory feature containing the global appearance attribute information to obtain the fused feature information. .

本发明实施例所述回忆注意力模块具体实现方式可见图4，由图可知实施例所用注意力机制实际上可以用询问(query)和键-值(key-value)三者的函数来表示。其对应的公式如下：The specific implementation of the recall attention module according to the embodiment of the present invention can be seen in FIG. 4 . It can be seen from the figure that the attention mechanism used in the embodiment can actually be represented by a function of query and key-value. The corresponding formula is as follows:

R(Q,K,V)＝[WV^T+Q,V]R(Q,K,V)＝[WV ^T +Q,V]

其中，Q是询问矩阵，K是键矩阵，V是值矩阵，R为融合得到的特征信息，[·,·]表示串接。矩阵W主要通过询问矩阵Q与键矩阵K得到，其具体的计算公式如下：Among them, Q is the query matrix, K is the key matrix, V is the value matrix, R is the feature information obtained by fusion, and [·,·] means concatenation. The matrix W is mainly obtained by querying the matrix Q and the key matrix K, and its specific calculation formula is as follows:

W＝QK^T W=QK ^T

记忆特征重建

经过两个不同的卷积核大小为1，通道数与输入相同，步长为1的卷积后得到的特征将分别做为实际的键矩阵K和值矩阵V，而询问矩阵Q为经过一个卷积核大小为1，通道数与输入相同，步长为1的卷积后得到的特征。memory feature reconstruction

After two different convolution kernels with a size of 1 and the same number of channels as the input, the features obtained after convolution with a stride of 1 will be used as the actual key matrix K and value matrix V respectively, and the query matrix Q is obtained after a The convolution kernel size is 1, the number of channels is the same as the input, and the features obtained after convolution with stride 1.

在本发明实施例中，所述骨骼特征编解码压缩实际对象为人身体18个节点的位置坐标，其基本结构如图3所示。In the embodiment of the present invention, the actual object of the skeleton feature encoding and decoding is the position coordinates of 18 nodes of the human body, and the basic structure is shown in FIG. 3 .

对于身体每个节点的坐标信息，实施例用前一时刻该节点的坐标信息作为预测值，在实际编码过程中使用当前时刻节点坐标和预测值的残差作为实际的编码信息。该残差信息使用传统编码方案常用的自适应算术编码进行进一步压缩，得到的码流即为骨骼特征部分的最终码流。其解码过程相似，首先通过算术解码得到残差信息，并进行预测解码得到骨骼特征重建。For the coordinate information of each node of the body, the embodiment uses the coordinate information of the node at the previous moment as the predicted value, and uses the residual of the node coordinate at the current moment and the predicted value as the actual encoding information in the actual encoding process. The residual information is further compressed using adaptive arithmetic coding commonly used in traditional coding schemes, and the obtained code stream is the final code stream of the skeleton feature part. The decoding process is similar. First, the residual information is obtained by arithmetic decoding, and then the skeleton feature reconstruction is obtained by predictive decoding.

本发明实施例所述生成对抗网络包含生成器和判别器两个部分。其中，生成器G将回忆注意力模块输出的融合特征表达作为条件输入得到视频帧重建，本发明使用的生成器网络为pix2pixHD网络。本发明实施例的判别器则包括空域判别器D_I和时域判别器D_x两个部分，判别器的网络结构为VGG网络。空域判别器D_I的输入为当前帧的骨骼特征S、生成帧信息和真实帧信息X。其用于判断生成的视频帧是否贴近真实的自然图像。时域判别器D_x的输入则是当前帧以及前一帧的骨骼特征S、生成帧信息和真实帧信息。其作用主要为保证生成视频相邻帧之间的连续性。上述两个判别器共同构成了本网络的对抗损失l_#dv。其计算公式如下：The generative adversarial network described in the embodiment of the present invention includes two parts, a generator and a discriminator. Among them, the generator G uses the fusion feature expression output by the recall attention module as the conditional input to obtain the video frame reconstruction, and the generator network used in the present invention is the pix2pixHD network. The discriminator in the embodiment of the present invention includes two parts, a spatial discriminator D _I and a time domain discriminator _Dx , and the discriminator has a network structure of a VGG network. The input of the spatial discriminator D _I is the skeleton feature S of the current frame, the generated frame information and the real frame information X. It is used to judge whether the generated video frame is close to the real natural image. The input of the temporal discriminator _Dx is the skeleton feature S of the current frame and the previous frame, the generated frame information and the real frame information. Its function is mainly to ensure the continuity between adjacent frames of the generated video. The above two discriminators together constitute the adversarial loss l _#dv of this network. Its calculation formula is as follows:

其中s_t为t时刻的骨骼特征信息，为x_t为t时刻的视频帧，G为生成对抗网络，

为期望函数。where s _t is the bone feature information at time t, x _t is the video frame at time t, G is the generative adversarial network,

is the expected function.

生成对抗网络G的损失函数设计如下：The loss function of the generative adversarial network G is designed as follows:

l＝l_#dv+λ_compl_comp+λ_fml_fm+λ_VGGl_VGG l=l _#dv +λ _comp l _comp +λ _fm l _fm +λ _VGG l _VGG

其中，l_#dv为两个判别器对应的对抗损失，l_c9mp为记忆特征的码流大小。l_fm和l_VGG则是参考了已有的生成对抗网络工作添加的特征匹配损失和VGG网络感知损失。λ_c9mp、λ_fm和7_VGG为对应损失的权重。在本发明的实施例中，权重设置为λ_c9mp＝1、λ_fm＝10、λ_VGG＝10。Among them, l _#dv is the adversarial loss corresponding to the two discriminators, and l _c9mp is the code stream size of the memory feature. l _fm and l _VGG refer to the feature matching loss and VGG network perception loss added by existing generative adversarial network work. λ _c9mp , λ _fm and 7 _VGG are the weights of the corresponding losses. In the embodiment of the present invention, the weights are set to λ _c9mp =1, λ _fm =10, and λ _VGG =10.

本发明将人体运动视频从结构上分解为记忆信息和骨骼信息两部分，充分利用了视频内容的结构信息，从而实现了在质量上优于传统视频编码方法(H.264、H.265)的效果。The present invention decomposes the human body motion video into two parts: memory information and skeleton information from the structure, makes full use of the structural information of the video content, and realizes the video coding method (H.264, H.265) which is superior in quality to the traditional video coding method. Effect.

本发明分别在KTH数据集和APE数据集上进行了编码性能测试。对于KTH数据集，本发明随机抽取了其中的8个视频序列作为测试集进行测试。对于APE数据集，本发明则随机抽取了7个视频序列进行测试。具体的性能结果可以参看图5以及表1。The present invention conducts coding performance tests on KTH data set and APE data set respectively. For the KTH data set, the present invention randomly selects 8 video sequences in it as a test set for testing. For the APE data set, the present invention randomly selects 7 video sequences for testing. The specific performance results can be seen in Figure 5 and Table 1.

图5为本发明与传统编码方法主观质量比较图。从图中可以看到本发明与传统方法相比恢复了更多的细节信息，并且本发明没有严重的模糊和块效应现象。FIG. 5 is a comparison diagram of subjective quality between the present invention and the traditional coding method. It can be seen from the figure that the present invention recovers more detailed information compared with the traditional method, and the present invention does not have serious blurring and blockiness.

表1为本发明和传统编码方案在测试序列上的平均编码性能比较表。从表中可以看出，本发明在KTH数据集上得到了可以和H.264相当的编码性能，同时在APE数据集上，本发明的PSNR和H.265相比还高了3.61dB。在上述的比较中本发明方法在上述两个数据集中使用的编码码流大小大约只有传统编码方案的50％。Table 1 is a comparison table of the average coding performance of the present invention and the traditional coding scheme on the test sequence. It can be seen from the table that the present invention obtains coding performance comparable to that of H.264 on the KTH data set, and at the same time, on the APE data set, the PSNR of the present invention is 3.61dB higher than that of H.265. In the above comparison, the encoding code stream size used by the method of the present invention in the above two data sets is only about 50% of the traditional encoding scheme.

表1本发明与传统编码方案在测试序列上的平均编码性能比较表Table 1 The average coding performance comparison table of the present invention and the traditional coding scheme on the test sequence

总之，本发明提出了基于生成对抗网络的低码率人体运动视频压缩模型，其性能与当前编码标准方法相比有极大提升。对于公共安全领域监控视频等包含人体运动的视频编码中，本发明具有极大的应用前景。In conclusion, the present invention proposes a low-bit-rate human motion video compression model based on a generative adversarial network, and its performance is greatly improved compared with the current coding standard method. The present invention has great application prospect in video coding including human motion, such as surveillance video in the field of public safety.

以上所述仅为本发明较佳的具体实施方式，但本发明的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明披露的技术范围内，可轻易想到的变化或者替换，都应当涵盖在本发明的保护范围之内。因此，本发明的保护范围应当以权利要求书的保护范围为准。The above is only a preferred embodiment of the present invention, but the protection scope of the present invention is not limited to this. Any person skilled in the art can easily think of changes or substitutions within the technical scope disclosed by the present invention. , should be covered within the protection scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims

1. A low bit rate human motion video coding system based on a generation countermeasure network, comprising: the device comprises a memory characteristic extraction module, a memory characteristic coding and decoding module, a bone characteristic extraction module, a bone characteristic coding and decoding module, a recall attention module and a confrontation network generation module, wherein:

the memory feature extraction network module is used for inputting the video frames in the image group into the convolutional recurrent neural network according to the time sequence of the convolutional recurrent neural network until the output obtained after the last video frame is input is the memory feature;

the memory characteristic coding and decoding module comprises a memory characteristic coding module and a memory characteristic decoding module, and the memory characteristic coding module inputs the memory characteristic into the memory characteristic coding module, and the memory characteristic coding module firstly carries out quantization operation on the memory characteristic to obtain quantized memory characteristic output; then, entropy coding of the memory characteristics is used for outputting the quantized memory characteristics to obtain a code stream for transmission of the memory characteristics; inputting the code stream of the memory characteristic part into a memory characteristic decoding module, wherein the module firstly carries out entropy decoding on the code stream to obtain quantized memory characteristic reconstruction; then, carrying out inverse quantization on the quantized memory characteristic output to obtain memory characteristic reconstruction, and inputting the memory characteristic reconstruction into a memory attention module;

the skeleton feature extraction module is used for inputting a video frame to be coded into a human body posture estimation network for node position estimation to obtain skeleton features containing human body key node position information, wherein the skeleton features are human body key nodes and joint positions, and the human body key nodes comprise heads, hands, feet and bodies;

the bone characteristic coding and decoding module is used for coding and decoding the bone characteristics of the coded video frame; inputting the bone characteristics into a bone characteristic coding part, and firstly performing predictive coding on the input bone characteristics to obtain actually transmitted residual information; then, carrying out entropy coding on the skeleton characteristic residual error information on the residual error information to obtain a code stream transmitted by the skeleton characteristic part; inputting a transmitted code stream into a skeleton characteristic decoding part, and firstly carrying out skeleton characteristic entropy decoding on the code stream to obtain skeleton characteristic residual error reconstruction; then, the bone characteristic residual error reconstruction is subjected to predictive decoding to obtain the final bone characteristic reconstruction, and the final bone characteristic reconstruction is input into a recall attention module;

the attention recalling module is used for fusing the memory characteristic reconstruction obtained by the memory characteristic decoding module with the bone characteristic reconstruction obtained by the bone characteristic decoding module by using an attention mechanism to obtain fused characteristic information;

the generation countermeasure network module comprises a generator and a discriminator; the generator inputs the fusion characteristic information obtained by the recall attention module into a generation network to obtain a generated video frame; the arbiter judges whether the generated video frame is consistent with the real natural video frame to give a score which will be used as part of the generated network for training of the generated network, i.e. part of the loss function to optimize training of the generator;

in the memory feature extraction network module, the memory features comprise image group continuous frame global appearance attribute information; the image group is a set formed by continuous video frames with a certain length; the image group continuous frame global appearance attribute information is video appearance attribute information extracted by inputting video frames in an image group into a convolution cyclic neural network, and the video appearance attribute information comprises video internal background appearance and scene internal human five sense organ clothes appearance attribute information;

in the memory characteristic coding and decoding module, a memory characteristic coding module comprises two parts of quantization of memory characteristics and entropy coding of quantized characteristics; the memory characteristic quantization is that each characteristic value in the memory characteristic is respectively quantized according to the existing quantization method, and each characteristic value is quantized to obtain the quantized memory characteristic; the memory characteristic entropy coding is an entropy coding method for the quantized memory characteristics, and entropy coding is carried out on the quantized memory characteristics according to the existing entropy coding method to obtain a code stream for transmission of a memory characteristic part; the memory characteristic decoding module comprises a corresponding entropy decoding part and an inverse quantization part; the memory characteristic entropy decoding carries out entropy decoding processing on the code stream transmitted by the memory characteristic part according to an entropy decoding method corresponding to the memory characteristic entropy coding to obtain the memory characteristic reconstruction after quantization; the memory characteristic inverse quantization carries out inverse quantization operation on the quantized memory characteristics according to an inverse quantization method relative to the memory characteristic quantization to obtain memory characteristic reconstruction, and the memory characteristic reconstruction inputs a memory recall attention module for reconstructing a video frame;

the bone feature coding and decoding module comprises a bone feature compression coding part and a bone feature compression decoding part; the bone characteristic coding part comprises two parts of predictive coding and entropy coding; the prediction coding considers the redundancy of the skeleton feature time domain, the coordinate of the node of the previous frame of the coded video frame is used as a predicted value for the coordinate of each node, residual error operation is carried out on the predicted value of the node and the coordinate, namely the true value, of the corresponding node of the frame, and the obtained residual error is subjected to entropy coding of the skeleton feature; the skeleton characteristic entropy coding is an entropy coder aiming at skeleton characteristics, and residual errors obtained by skeleton characteristic predictive coding are input into the entropy coder to be entropy coded according to the existing entropy coding method to obtain code stream information of the skeleton characteristics; the bone feature decoding part realizes the lossless reconstruction of bone features by entropy decoding and predictive decoding by using a code stream obtained after the bone feature coding; the skeleton characteristic entropy decoding is used for entropy coding a skeleton characteristic code stream according to an existing entropy decoding method to obtain a residual error obtained by skeleton characteristic predictive coding; and the bone feature prediction decoding is to perform addition operation by using the node coordinate at the previous moment and the residual error obtained by the bone feature entropy decoding to obtain the decoded bone feature reconstruction.

2. A low bit rate human motion video coding method based on a generation countermeasure network based on the system of claim 1, characterized in that: the method comprises the following steps:

(1) inputting the continuous frames of the image group into a memory feature extraction module according to the sequence, wherein the memory feature extraction module obtains memory feature output by using a convolutional neural network;

(2) inputting the memory characteristics into a memory characteristic coding module, and quantizing the memory characteristics and then performing entropy coding processing on the memory characteristics by the memory characteristic coding module so as to obtain code stream output of a memory characteristic part;

(3) inputting a video frame to be coded into a bone feature extraction module, wherein the bone feature extraction module utilizes a human body posture estimation network to obtain bone features;

(4) inputting the bone characteristics into a bone characteristic coding module, and performing predictive coding on the bone characteristics and entropy coding on the bone characteristics to obtain partial code stream of the bone characteristics for output;

(5) inputting the memory characteristic part code stream into a memory characteristic decoding module, performing memory characteristic entropy decoding on the memory characteristic code stream by the module, then performing memory characteristic inverse quantization to obtain memory characteristic reconstruction, and simultaneously inputting the skeleton characteristic code stream into a skeleton characteristic decoding module, performing entropy decoding on the skeleton characteristic code stream by the module, and then performing skeleton characteristic inverse quantization to obtain skeleton characteristic reconstruction;

(6) inputting the memory characteristic reconstruction and the skeleton characteristic reconstruction into a memory attention module, and obtaining a fusion characteristic fusing two parts of information by the module by using an attention mechanism;

(7) and inputting the fusion characteristics as conditions for generating the confrontation network generator to obtain video frame reconstruction, wherein the discriminator can give out a score whether the video frame reconstruction generated by the generator accords with the natural video or not according to the generation of the generator during training, and the score is used for training the generator.