CN110290386B - Low-bit-rate human motion video coding system and method based on generation countermeasure network - Google Patents

Low-bit-rate human motion video coding system and method based on generation countermeasure network Download PDF

Info

Publication number
CN110290386B
CN110290386B CN201910479249.8A CN201910479249A CN110290386B CN 110290386 B CN110290386 B CN 110290386B CN 201910479249 A CN201910479249 A CN 201910479249A CN 110290386 B CN110290386 B CN 110290386B
Authority
CN
China
Prior art keywords
characteristic
memory
coding
module
bone
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910479249.8A
Other languages
Chinese (zh)
Other versions
CN110290386A (en
Inventor
陈志波
吴耀军
何天宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
Original Assignee
University of Science and Technology of China USTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC filed Critical University of Science and Technology of China USTC
Priority to CN201910479249.8A priority Critical patent/CN110290386B/en
Publication of CN110290386A publication Critical patent/CN110290386A/en
Application granted granted Critical
Publication of CN110290386B publication Critical patent/CN110290386B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/102Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
    • H04N19/124Quantisation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/134Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or criterion affecting or controlling the adaptive coding
    • H04N19/136Incoming video signal characteristics or properties
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/169Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
    • H04N19/177Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being a group of pictures [GOP]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/65Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using error resilience
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/85Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using pre-processing or post-processing specially adapted for video compression
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/90Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using coding techniques not provided for in groups H04N19/10-H04N19/85, e.g. fractals
    • H04N19/91Entropy coding, e.g. variable length coding [VLC] or arithmetic coding

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention relates to a low-bit-rate human motion video coding system and method based on a generated confrontation network, which fully utilize the content structure information of human motion video, decompose the video content information into two parts, namely memory characteristics containing global appearance attribute information and skeleton characteristics capable of expressing the human motion information, and utilize an attention mechanism and the generated confrontation network to realize more efficient low-bit-rate video compression. The invention fully utilizes the structure information of the human motion video content, decomposes the video content information into two parts, namely the memory characteristic containing global appearance attribute information and the skeleton characteristic capable of expressing the human motion information, and utilizes an attention mechanism and a generation countermeasure network to realize more efficient low-bit-rate video compression.

Description

Low-bit-rate human motion video coding system and method based on generation countermeasure network
Technical Field
The invention relates to a low-bit-rate human motion video coding system and method based on a generation countermeasure network, and belongs to the technical field of video image coding and compression.
Background
The traditional hybrid video coding framework (MPEG-2, h.264 and H, 265, etc.) decomposes video compression into four basic steps of prediction, transformation, quantization and entropy coding. Within this coding framework, redundant information between adjacent frames of video is mainly removed by motion compensation in units of fixed-size blocks. Through years of development, the performance of the traditional encoder is continuously improved. With the development of deep learning, a model for implementing video compression coding by using deep learning is proposed. Unlike conventional coding, video coding based on deep learning can achieve better transform learning. In addition, the end-to-end deep learning model can learn by itself, and all the modules are automatically adjusted according to a uniform optimization target to optimize the model.
The above coding schemes all use pixel-level fidelity as an optimization target, and the structural information of the video sequence content is not fully utilized in the coding process. For example, in surveillance video applications in the public security field, it is very important for human identification, and there are also a large number of videos containing human body motion that need to be encoded and analyzed and judged at a later stage, and such videos actually contain two parts, namely global appearance attribute information and human body motion information in structure. The present invention thus considers the use of structural information of human motion video to further improve coding efficiency.
Disclosure of Invention
The invention solves the problems: the system and the method have the advantages that the defects of the prior art are overcome, the human motion video coding system and the method based on the generation countermeasure network are provided, the human motion video content structure information is fully utilized, the video content information is decomposed into two parts including memory characteristics of global appearance attribute information and skeleton characteristics capable of expressing the human motion information, and the attention mechanism and the generation countermeasure network are utilized to realize more efficient low-bit-rate video compression.
The technical scheme of the invention is as follows: a low bit rate human motion video coding system based on a generation countermeasure network, wherein structure information extraction, structure information fusion and characteristic information coding/decoding aiming at human motion video content are integrated, and the system is characterized by comprising: the device comprises a memory feature extraction module, a memory feature coding and decoding module, a bone feature extraction module, a bone feature coding and decoding module, a recall attention module and a generation confrontation network module, wherein:
the memory feature extraction network module is used for inputting the video frames in the image group into the convolutional recurrent neural network according to the time sequence of the convolutional recurrent neural network until the output obtained after the last video frame is input is the memory feature;
the memory characteristic coding and decoding module comprises a memory characteristic coding module and a memory characteristic decoding module, and the memory characteristic coding module inputs the memory characteristic into the memory characteristic coding module, and the memory characteristic coding module firstly carries out quantization operation on the memory characteristic to obtain quantized memory characteristic output; then, entropy coding of the memory characteristics is used for outputting the quantized memory characteristics to obtain a code stream for transmission of the memory characteristics; inputting the code stream of the memory characteristic part into a memory characteristic decoding module, wherein the module firstly carries out entropy decoding on the code stream to obtain quantized memory characteristic reconstruction; then, carrying out inverse quantization on the quantized memory characteristic output to obtain memory characteristic reconstruction, and inputting the memory characteristic reconstruction into a memory attention module;
the skeleton feature extraction module is used for inputting a video frame to be coded into a human body posture estimation network for node position estimation to obtain skeleton features containing human body key node position information, wherein the skeleton features are human body key nodes and joint positions, and the human body key nodes comprise heads, hands, feet and bodies; in order to improve the coding efficiency, the skeleton characteristics are input into a skeleton characteristic coding and decoding module for further coding and compressing;
the bone characteristic coding and decoding module is used for coding and decoding the bone characteristics of the coded video frame; inputting the bone characteristics into a bone characteristic coding part, and firstly performing predictive coding on the input bone characteristics to obtain actually transmitted residual information; then, carrying out entropy coding on the skeleton characteristic residual error information on the residual error information to obtain a code stream transmitted by the skeleton characteristic part; inputting the transmitted code stream into a skeleton characteristic decoding part, and firstly, carrying out skeleton characteristic entropy decoding on the code stream to obtain skeleton characteristic residual error reconstruction; then, the bone characteristic residual error reconstruction is subjected to predictive decoding to obtain the final bone characteristic reconstruction, and the final bone characteristic reconstruction is input into a recall attention module;
the attention recalling module is used for fusing the memory characteristic reconstruction obtained by the memory characteristic decoding module with the bone characteristic reconstruction obtained by the bone characteristic decoding module by using an attention mechanism to obtain fused characteristic information;
the generation countermeasure network module comprises a generator and a discriminator; the generator inputs the fusion characteristic information obtained by the recall attention module into a generation network to obtain a generated video frame; the arbiter determines whether the generated video frame is consistent with a real natural video frame to give a score that will be used as part of the generated network for training of the generated network, i.e. a loss function to optimize the training of the generator.
In the memory feature extraction network module, the memory features comprise image group continuous frame global appearance attribute information; the image group is a set formed by continuous video frames with a certain length; the image group continuous frame global appearance attribute information is video appearance attribute information extracted by inputting video frames in the image group into a convolution cyclic neural network, and the video appearance attribute information comprises video internal background appearance and scene internal human five sense organ clothes appearance attribute information.
The convolution cyclic neural network structure in the memory feature extraction network module is as follows: the method comprises a convolution cycle layer, wherein the convolution cycle layer establishes a time sequence relation similar to a cyclic neural network, simultaneously describes local space characteristics like the convolutional neural network, and has the channel number of 128, the kernel size of 5 and the step length of 2.
In the memory characteristic coding and decoding module, the memory characteristic coding module comprises two parts of quantization of memory characteristics and entropy coding of quantized characteristics; the memory characteristic quantization is that each characteristic value in the memory characteristic is respectively quantized according to the existing quantization method, and each characteristic value is quantized to obtain the quantized memory characteristic; the memory characteristic entropy coding is an entropy coding method for the quantized memory characteristics, and entropy coding is carried out on the quantized memory characteristics according to the existing entropy coding method to obtain a code stream for transmission of a memory characteristic part; the memory characteristic decoding module comprises a corresponding entropy decoding part and an inverse quantization part; the memory characteristic entropy decoding carries out entropy decoding processing on the code stream transmitted by the memory characteristic part according to an entropy decoding method corresponding to the memory characteristic entropy coding to obtain the memory characteristic reconstruction after quantization; and carrying out inverse quantization operation on the quantized memory characteristics according to an inverse quantization method relative to the memory characteristic quantization to obtain memory characteristic reconstruction, wherein the memory characteristic reconstruction is used for reconstructing a video frame by inputting a memory attention module.
The bone characteristic coding and decoding module comprises a bone characteristic compression coding part and a bone characteristic decoding part; the bone characteristic coding part comprises two parts of predictive coding and entropy coding; the prediction coding considers the redundancy of the skeleton feature time domain, the coordinate of the node of the previous frame of the coded video frame is used as a predicted value for the coordinate of each node, residual error operation is carried out on the predicted value of the node and the coordinate, namely the true value, of the corresponding node of the frame, and the obtained residual error is subjected to entropy coding of the skeleton feature; the skeleton characteristic entropy coding is an entropy coder aiming at skeleton characteristics, and residual errors obtained by skeleton characteristic predictive coding are input into the entropy coder to be entropy coded according to the existing entropy coding method to obtain code stream information of the skeleton characteristics; the bone characteristic decoding part realizes lossless reconstruction of bone characteristics by entropy decoding and predictive decoding by using a code stream obtained after the bone characteristic coding; the skeleton characteristic entropy decoding is used for entropy coding a skeleton characteristic code stream according to an existing entropy decoding method to obtain a residual error obtained by skeleton characteristic predictive coding; and the bone feature prediction decoding is to perform addition operation by using the node coordinate at the previous moment and the residual error obtained by the bone feature entropy decoding to obtain the decoded bone feature reconstruction.
In the bone feature extraction module, a human body posture estimation network is a PAF network, the structure of the PAF network is divided into two paths of structures, one path of the PAF network passes through 3 layers of convolution layers with the core size of 3 and then passes through 2 layers of convolution layers with the core size of 1x1 to obtain a joint confidence map, and the joint confidence map expresses the position information of each detected node and the confidence coefficient that the node belongs to a human body node; and the other path of PAFs is obtained through 3 convolutional layers with the core size of 3 and then through 2 convolutional layers with the core size of 1x 1. PAFs are a set of 2D vectors, each of which encodes the position and orientation of a limb. It will be jointly learned and predicted along with the confidence map of the joint.
The invention relates to a low-bit-rate human motion video coding method based on a generation countermeasure network, which comprises the following steps of:
(1) inputting the continuous frames of the image group into a memory feature extraction module in sequence, wherein the memory feature extraction module obtains memory feature output by using a convolutional neural network;
(2) inputting the memory characteristics into a memory characteristic coding module, and quantizing the memory characteristics and then performing entropy coding processing on the memory characteristics by the memory characteristic coding module so as to obtain code stream output of a memory characteristic part;
(3) inputting a video frame to be coded into a bone feature extraction module, wherein the bone feature extraction module utilizes a human body posture estimation network to obtain bone features;
(4) inputting the bone characteristics into a bone characteristic coding module, and performing predictive coding on the bone characteristics and entropy coding on the bone characteristics to obtain partial code stream of the bone characteristics for output;
(5) inputting the memory characteristic part code stream into a memory characteristic decoding module, performing memory characteristic entropy decoding on the memory characteristic code stream by the memory characteristic decoding module, performing memory characteristic inverse quantization to obtain memory characteristic reconstruction, inputting the skeleton characteristic code stream into a skeleton characteristic decoding module, performing entropy decoding on the skeleton characteristic code stream by the module, and performing skeleton characteristic inverse quantization to obtain skeleton characteristic reconstruction;
(6) inputting the memory characteristic reconstruction and the skeleton characteristic reconstruction into a memory attention module, and obtaining a fusion characteristic fusing two parts of information by the module by using an attention mechanism;
(7) and inputting the fusion characteristics as conditions for generating the confrontation network generator to obtain video frame reconstruction, wherein the discriminator gives out a score whether the video frame reconstruction generated by one generator accords with the natural video according to the generation of the generator during training, and the score is used for training the generator.
Compared with the prior art, the invention has the advantages that:
(1) different from the traditional video coding based on block and pixel level distortion, the method decomposes the video information according to memory characteristics and skeleton characteristics for the first time, and fully utilizes the video structure information, thereby further improving the coding efficiency;
(2) the invention reconstructs the video frame by utilizing the mode of generating the confrontation network, and the mode of generating the confrontation network can recover the information lost by the video during decoding by using the generated mode, thereby realizing the improvement of subjective quality;
(3) aiming at the processing of the decomposed features, the invention firstly proposes the fusion of the features obtained after the video decomposition is carried out by the recall attention module, and the fusion is effectively carried out by utilizing an attention mechanism, the memory feature reconstruction and the bone feature reconstruction, thereby ensuring the recovery reconstruction of the video frame according to the information.
Drawings
FIG. 1 is a general block diagram of the system of the present invention;
FIG. 2 is a block diagram of a memory feature codec module according to the present invention;
FIG. 3 is a block diagram of the structure of the bone feature encoding/decoding module according to the present invention;
FIG. 4 is a block diagram of a recall attention module according to the present invention;
fig. 5 is a subjective quality comparison of the present invention with a conventional coding method.
Detailed Description
The technical scheme of the invention is clearly and completely described below by combining a memory characteristic coding/decoding method, a skeleton characteristic coding/decoding method, a recall attention module and a generation countermeasure network implementation of the invention. It should be understood that the described embodiments are only a few, and not all, of the embodiments of the present invention. Any embodiment based on the present invention and all other embodiments obtained by a person of ordinary skill in the art without any inventive step belong to the protection scope of the present invention.
As shown in fig. 1, the coding framework of the present invention includes the following modules: memory characteristic extraction, memory characteristic coding/decoding, skeleton characteristic extraction, skeleton characteristic coding/decoding, attention module recall and confrontation network generation. Wherein:
the memory feature extraction network is used for extracting memory features containing image group continuous frame global appearance attribute information, and the main structure of the memory feature extraction network is a convolution cyclic neural network. The method comprises a convolution cycle layer, wherein the convolution cycle layer establishes a time sequence relation similar to a cyclic neural network, and can describe local spatial features like the convolution neural network. The number of channels of the convolution cycle layer is 128, the kernel size is 5, and the step size is 2.
The memory characteristic coding and decoding mainly comprises a memory characteristic coding part and a memory characteristic decoding part. The coding part mainly comprises two parts of quantization of memory characteristics and entropy coding of quantized characteristics, and the decoding part comprises corresponding entropy decoding and inverse quantization. Where the entropy coder/decoder is typically an arithmetic coder/decoder, quantization typically uses scalar quantization.
In the embodiment of the present invention, as shown in fig. 2, the memory characteristics are first quantized using scalar quantization, and then entropy coding is used to further remove redundancy thereof, so as to obtain a final code stream. In order to further remove redundancy of the quantized memory features to improve coding efficiency, the embodiment of the invention uses a super-prior network to perform probability prediction modeling on each feature value of the quantized memory features, and entropy coding is performed on the quantized memory features according to probability distribution obtained by prediction.
To get the probability of a specific memory feature, the memory feature is input to the prior network to extract the variable z, which is then transmitted as additional information via arithmetic coding. At the encoding and decoding end, by referring to the existing image encoding method based on deep learning, the probability distribution of the memory characteristics is modeled into Gaussian distribution. While the variable z will be used to predict the mean and variance of the probability distribution of the memory feature.
In addition, the quantification of the memory characteristics is integrated in the memory characteristic coding module network, and the standard scalar quantification can bring the problem of training incorgorability in the memory characteristic coding module network, so uniform noise is used for replacing the standard scalar quantification process during training. The specific formula is as follows:
Figure BDA0002083234320000051
in the above formula
Figure BDA0002083234320000052
M is the memory characteristic before quantization,
Figure BDA0002083234320000053
is at the same time
Figure BDA0002083234320000054
And
Figure BDA0002083234320000055
with uniformly distributed noise.
The skeleton feature extraction function is used for extracting skeleton information which can express human motion information in a current coding video frame, and the main form of the skeleton feature extraction function is the position of a key node of a human body. By using the existing PAFs human pose estimation network, skeletal features containing information on the positions of key nodes of the human body will be output from the module.
The bone characteristic coding and decoding mainly comprises two parts of bone characteristic compression coding and decoding. The coding part considers the redundancy of the skeleton characteristic time domain, firstly uses predictive coding, and inputs the residual error of the predicted value and the true value into an entropy coder for coding to obtain code stream output. The decoding part mainly realizes the bone characteristic reconstruction through entropy decoding and predictive decoding. Among them, the entropy codec is generally an arithmetic codec or the like.
The memory attention module mainly utilizes an attention mechanism to fuse the memory characteristic information and the bone characteristic information, and combines the bone characteristics containing specific frame human motion information on the basis of the memory characteristics containing global appearance attribute information to obtain fused characteristic information.
The specific implementation manner of the recall attention module according to the embodiment of the present invention can be seen in fig. 4, and it can be seen from the figure that the attention mechanism used in the embodiment can be actually expressed by a function of query (query) and key-value (key-value). The corresponding formula is as follows:
R(Q,K,V)=[WV T +Q,V]
wherein Q is the query matrix, K is the key matrix, V is the value matrix, and R is the fused feature information, [, ] representing the concatenation. The matrix W is mainly obtained by inquiring the matrix Q and the key matrix K, and the specific calculation formula is as follows:
W=QK T
memory feature reconstruction
Figure BDA0002083234320000061
The features obtained after convolution with two different convolution kernels of size 1, the number of channels being the same as the input and the step length being 1 will be respectively used as the actual key matrix K and value matrix V, while the query matrix Q is the features obtained after convolution with a convolution kernel of size 1, the number of channels being the same as the input and the step length being 1.
In the embodiment of the present invention, the skeletal feature codec compresses the position coordinates of the actual object, which is 18 nodes of the human body, and the basic structure of the compressed actual object is as shown in fig. 3.
For the coordinate information of each node of the body, the embodiment uses the coordinate information of the node at the previous moment as a predicted value, and uses the coordinate of the node at the current moment and the residual error of the predicted value as actual coding information in the actual coding process. The residual information is further compressed by using the common adaptive arithmetic coding of the traditional coding scheme, and the obtained code stream is the final code stream of the bone characteristic part. The decoding process is similar, firstly residual error information is obtained through arithmetic decoding, and bone characteristic reconstruction is obtained through predictive decoding.
The embodiment of the invention discloses a generation countermeasure network which comprises a generator and an arbiter. The generator G takes the fusion characteristic expression output by the memory attention module as condition input to obtain video frame reconstruction, and the generator network used by the invention is a pix2pixHD network. The discriminator of the embodiment of the invention comprises a space domain discriminator D I Sum time domain discriminator D x And in two parts, the network structure of the discriminator is a VGG network. Airspace discriminator D I The input of (2) is the skeletal features S of the current frame, the generated frame information and the real frame information X. Which is used to determine whether the generated video frame is close to a real natural image. Time domain discriminator D x The inputs of (1) are the skeletal features S of the current frame and the previous frame, the generated frame information and the real frame information. The main purpose of this is to ensure continuity between adjacent frames of the generated video. The two discriminators together form the countermeasure loss l of the network #dv . The calculation formula is as follows:
Figure BDA0002083234320000071
wherein s is t Is the bone characteristic information at the time t and is x t Is a video frame at time t, G is generated as a countermeasure network,
Figure BDA0002083234320000072
is a desired function.
The loss function for generating the countermeasure network G is designed as follows:
l=l #dvcomp l compfm l fmVGG l VGG
wherein l #dv For the corresponding countermeasures of two discriminators,/ c9mp Is the code stream size of the memory characteristics. l fm And l VGG Then reference is made to existing feature matching losses and VGG network aware losses that generate an addition to counter network work. Lambda c9mp 、λ fm And 7 VGG Is the weight corresponding to the loss. In an embodiment of the invention, the weight is set to λ c9mp =1、λ fm =10、λ VGG =10。
The invention decomposes the human motion video into two parts of memory information and skeleton information from the structure, and makes full use of the structure information of the video content, thereby realizing the effect superior to the traditional video coding methods (H.264 and H.265) in quality.
The invention respectively carries out coding performance test on the KTH data set and the APE data set. For the KTH data set, 8 video sequences are randomly extracted to be used as a test set for testing. For the APE data set, the present invention randomly extracts 7 video sequences for testing. Specific performance results can be seen in figure 5 and table 1.
Fig. 5 is a subjective quality comparison of the present invention and a conventional encoding method. It can be seen from the figure that the present invention recovers more detailed information than the conventional method and the present invention does not have serious blurring and blocking artifacts.
Table 1 is a table comparing the average coding performance of the present invention and the conventional coding scheme over the test sequences. As can be seen from the table, the invention obtains the encoding performance equivalent to H.264 on the KTH data set, and the PSNR of the invention is higher by 3.61dB than that of H.265 on the APE data set. In the comparison, the size of the code stream used in the two data sets is only about 50% of the conventional coding scheme.
Table 1 comparison of average coding performance of the present invention and conventional coding schemes over test sequences
Figure BDA0002083234320000081
In a word, the invention provides a low-bit-rate human motion video compression model based on a generated countermeasure network, and the performance of the low-bit-rate human motion video compression model is greatly improved compared with that of the current coding standard method. The invention has great application prospect in video coding including human body movement, such as monitoring video in the public safety field.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention disclosed herein should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (2)

1. A low bit rate human motion video coding system based on a generation countermeasure network, comprising: the device comprises a memory characteristic extraction module, a memory characteristic coding and decoding module, a bone characteristic extraction module, a bone characteristic coding and decoding module, a recall attention module and a confrontation network generation module, wherein:
the memory feature extraction network module is used for inputting the video frames in the image group into the convolutional recurrent neural network according to the time sequence of the convolutional recurrent neural network until the output obtained after the last video frame is input is the memory feature;
the memory characteristic coding and decoding module comprises a memory characteristic coding module and a memory characteristic decoding module, and the memory characteristic coding module inputs the memory characteristic into the memory characteristic coding module, and the memory characteristic coding module firstly carries out quantization operation on the memory characteristic to obtain quantized memory characteristic output; then, entropy coding of the memory characteristics is used for outputting the quantized memory characteristics to obtain a code stream for transmission of the memory characteristics; inputting the code stream of the memory characteristic part into a memory characteristic decoding module, wherein the module firstly carries out entropy decoding on the code stream to obtain quantized memory characteristic reconstruction; then, carrying out inverse quantization on the quantized memory characteristic output to obtain memory characteristic reconstruction, and inputting the memory characteristic reconstruction into a memory attention module;
the skeleton feature extraction module is used for inputting a video frame to be coded into a human body posture estimation network for node position estimation to obtain skeleton features containing human body key node position information, wherein the skeleton features are human body key nodes and joint positions, and the human body key nodes comprise heads, hands, feet and bodies;
the bone characteristic coding and decoding module is used for coding and decoding the bone characteristics of the coded video frame; inputting the bone characteristics into a bone characteristic coding part, and firstly performing predictive coding on the input bone characteristics to obtain actually transmitted residual information; then, carrying out entropy coding on the skeleton characteristic residual error information on the residual error information to obtain a code stream transmitted by the skeleton characteristic part; inputting a transmitted code stream into a skeleton characteristic decoding part, and firstly carrying out skeleton characteristic entropy decoding on the code stream to obtain skeleton characteristic residual error reconstruction; then, the bone characteristic residual error reconstruction is subjected to predictive decoding to obtain the final bone characteristic reconstruction, and the final bone characteristic reconstruction is input into a recall attention module;
the attention recalling module is used for fusing the memory characteristic reconstruction obtained by the memory characteristic decoding module with the bone characteristic reconstruction obtained by the bone characteristic decoding module by using an attention mechanism to obtain fused characteristic information;
the generation countermeasure network module comprises a generator and a discriminator; the generator inputs the fusion characteristic information obtained by the recall attention module into a generation network to obtain a generated video frame; the arbiter judges whether the generated video frame is consistent with the real natural video frame to give a score which will be used as part of the generated network for training of the generated network, i.e. part of the loss function to optimize training of the generator;
in the memory feature extraction network module, the memory features comprise image group continuous frame global appearance attribute information; the image group is a set formed by continuous video frames with a certain length; the image group continuous frame global appearance attribute information is video appearance attribute information extracted by inputting video frames in an image group into a convolution cyclic neural network, and the video appearance attribute information comprises video internal background appearance and scene internal human five sense organ clothes appearance attribute information;
in the memory characteristic coding and decoding module, a memory characteristic coding module comprises two parts of quantization of memory characteristics and entropy coding of quantized characteristics; the memory characteristic quantization is that each characteristic value in the memory characteristic is respectively quantized according to the existing quantization method, and each characteristic value is quantized to obtain the quantized memory characteristic; the memory characteristic entropy coding is an entropy coding method for the quantized memory characteristics, and entropy coding is carried out on the quantized memory characteristics according to the existing entropy coding method to obtain a code stream for transmission of a memory characteristic part; the memory characteristic decoding module comprises a corresponding entropy decoding part and an inverse quantization part; the memory characteristic entropy decoding carries out entropy decoding processing on the code stream transmitted by the memory characteristic part according to an entropy decoding method corresponding to the memory characteristic entropy coding to obtain the memory characteristic reconstruction after quantization; the memory characteristic inverse quantization carries out inverse quantization operation on the quantized memory characteristics according to an inverse quantization method relative to the memory characteristic quantization to obtain memory characteristic reconstruction, and the memory characteristic reconstruction inputs a memory recall attention module for reconstructing a video frame;
the bone feature coding and decoding module comprises a bone feature compression coding part and a bone feature compression decoding part; the bone characteristic coding part comprises two parts of predictive coding and entropy coding; the prediction coding considers the redundancy of the skeleton feature time domain, the coordinate of the node of the previous frame of the coded video frame is used as a predicted value for the coordinate of each node, residual error operation is carried out on the predicted value of the node and the coordinate, namely the true value, of the corresponding node of the frame, and the obtained residual error is subjected to entropy coding of the skeleton feature; the skeleton characteristic entropy coding is an entropy coder aiming at skeleton characteristics, and residual errors obtained by skeleton characteristic predictive coding are input into the entropy coder to be entropy coded according to the existing entropy coding method to obtain code stream information of the skeleton characteristics; the bone feature decoding part realizes the lossless reconstruction of bone features by entropy decoding and predictive decoding by using a code stream obtained after the bone feature coding; the skeleton characteristic entropy decoding is used for entropy coding a skeleton characteristic code stream according to an existing entropy decoding method to obtain a residual error obtained by skeleton characteristic predictive coding; and the bone feature prediction decoding is to perform addition operation by using the node coordinate at the previous moment and the residual error obtained by the bone feature entropy decoding to obtain the decoded bone feature reconstruction.
2. A low bit rate human motion video coding method based on a generation countermeasure network based on the system of claim 1, characterized in that: the method comprises the following steps:
(1) inputting the continuous frames of the image group into a memory feature extraction module according to the sequence, wherein the memory feature extraction module obtains memory feature output by using a convolutional neural network;
(2) inputting the memory characteristics into a memory characteristic coding module, and quantizing the memory characteristics and then performing entropy coding processing on the memory characteristics by the memory characteristic coding module so as to obtain code stream output of a memory characteristic part;
(3) inputting a video frame to be coded into a bone feature extraction module, wherein the bone feature extraction module utilizes a human body posture estimation network to obtain bone features;
(4) inputting the bone characteristics into a bone characteristic coding module, and performing predictive coding on the bone characteristics and entropy coding on the bone characteristics to obtain partial code stream of the bone characteristics for output;
(5) inputting the memory characteristic part code stream into a memory characteristic decoding module, performing memory characteristic entropy decoding on the memory characteristic code stream by the module, then performing memory characteristic inverse quantization to obtain memory characteristic reconstruction, and simultaneously inputting the skeleton characteristic code stream into a skeleton characteristic decoding module, performing entropy decoding on the skeleton characteristic code stream by the module, and then performing skeleton characteristic inverse quantization to obtain skeleton characteristic reconstruction;
(6) inputting the memory characteristic reconstruction and the skeleton characteristic reconstruction into a memory attention module, and obtaining a fusion characteristic fusing two parts of information by the module by using an attention mechanism;
(7) and inputting the fusion characteristics as conditions for generating the confrontation network generator to obtain video frame reconstruction, wherein the discriminator can give out a score whether the video frame reconstruction generated by the generator accords with the natural video or not according to the generation of the generator during training, and the score is used for training the generator.
CN201910479249.8A 2019-06-04 2019-06-04 Low-bit-rate human motion video coding system and method based on generation countermeasure network Active CN110290386B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910479249.8A CN110290386B (en) 2019-06-04 2019-06-04 Low-bit-rate human motion video coding system and method based on generation countermeasure network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910479249.8A CN110290386B (en) 2019-06-04 2019-06-04 Low-bit-rate human motion video coding system and method based on generation countermeasure network

Publications (2)

Publication Number Publication Date
CN110290386A CN110290386A (en) 2019-09-27
CN110290386B true CN110290386B (en) 2022-09-06

Family

ID=68003085

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910479249.8A Active CN110290386B (en) 2019-06-04 2019-06-04 Low-bit-rate human motion video coding system and method based on generation countermeasure network

Country Status (1)

Country Link
CN (1) CN110290386B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110942463B (en) * 2019-10-30 2021-03-16 杭州电子科技大学 Video target segmentation method based on generation countermeasure network
CN110929242B (en) * 2019-11-20 2020-07-10 上海交通大学 Method and system for carrying out attitude-independent continuous user authentication based on wireless signals
CN112950729A (en) * 2019-12-10 2021-06-11 山东浪潮人工智能研究院有限公司 Image compression method based on self-encoder and entropy coding
CN111967340B (en) * 2020-07-27 2023-08-04 中国地质大学(武汉) Visual perception-based abnormal event detection method and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108174225A (en) * 2018-01-11 2018-06-15 上海交通大学 Filter achieving method and system in coding and decoding video loop based on confrontation generation network
CN108596149A (en) * 2018-05-10 2018-09-28 上海交通大学 The motion sequence generation method for generating network is fought based on condition
CN109086869A (en) * 2018-07-16 2018-12-25 北京理工大学 A kind of human action prediction technique based on attention mechanism

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101587962B1 (en) * 2010-12-22 2016-01-28 한국전자통신연구원 Motion capture apparatus and method
KR20130047194A (en) * 2011-10-31 2013-05-08 한국전자통신연구원 Apparatus and method for 3d appearance creation and skinning

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108174225A (en) * 2018-01-11 2018-06-15 上海交通大学 Filter achieving method and system in coding and decoding video loop based on confrontation generation network
CN108596149A (en) * 2018-05-10 2018-09-28 上海交通大学 The motion sequence generation method for generating network is fought based on condition
CN109086869A (en) * 2018-07-16 2018-12-25 北京理工大学 A kind of human action prediction technique based on attention mechanism

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
End-to-End Facial Image Compression with Integrated Semantic Distortion Metric;Tianyu He等;《2018 IEEE Visual Communications and Image Processing (VCIP)》;20190425;全文 *
多模型融合动作识别研究;田曼等;《电子测量技术》;20181023(第20期);全文 *

Also Published As

Publication number Publication date
CN110290386A (en) 2019-09-27

Similar Documents

Publication Publication Date Title
CN110290386B (en) Low-bit-rate human motion video coding system and method based on generation countermeasure network
VR An enhanced coding algorithm for efficient video coding
CN112203093B (en) Signal processing method based on deep neural network
CN108924558B (en) Video predictive coding method based on neural network
CN108174218B (en) Video coding and decoding system based on learning
CN112866694A (en) Intelligent image compression optimization method combining asymmetric volume block and condition context
CN113132727B (en) Scalable machine vision coding method and training method of motion-guided image generation network
Xia et al. An emerging coding paradigm VCM: A scalable coding approach beyond feature and signal
CN113822147A (en) Deep compression method for semantic task of cooperative machine
CN115880762B (en) Human-machine hybrid vision-oriented scalable face image coding method and system
CN104539961A (en) Scalable video encoding system based on hierarchical structure progressive dictionary learning
CN116233445B (en) Video encoding and decoding processing method and device, computer equipment and storage medium
CN113132735A (en) Video coding method based on video frame generation
CN115052147B (en) Human body video compression method and system based on generative model
CN110677624B (en) Monitoring video-oriented foreground and background parallel compression method based on deep learning
CN116156202A (en) Method, system, terminal and medium for realizing video error concealment
Wu et al. Memorize, then recall: a generative framework for low bit-rate surveillance video compression
CN110677644B (en) Video coding and decoding method and video coding intra-frame predictor
CN114422795A (en) Face video coding method, decoding method and device
CN113068041B (en) Intelligent affine motion compensation coding method
CN116437089B (en) Depth video compression method based on key target
CN104363454A (en) Method and system for video coding and decoding of high-bit-rate images
WO2023225808A1 (en) Learned image compress ion and decompression using long and short attention module
Mital et al. Deep stereo image compression with decoder side information using wyner common information
Lei et al. An end-to-end face compression and recognition framework based on entropy coding model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant