CN110290386A

CN110290386A - A kind of low bit- rate human motion video coding system and method based on generation confrontation network

Info

Publication number: CN110290386A
Application number: CN201910479249.8A
Authority: CN
Inventors: 陈志波; 吴耀军; 何天宇
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2019-06-04
Filing date: 2019-06-04
Publication date: 2019-09-27
Anticipated expiration: 2039-06-04
Also published as: CN110290386B

Abstract

The present invention relates to a kind of based on the low bit- rate human motion video coding system and method that generate confrontation network, make full use of human motion video content structural information, video content information is decomposed into the memory character comprising global appearance attribute information and skeleton character two parts of body motion information can be expressed, and is compressed using attention mechanism with the more efficient low bit-rate video of confrontation network implementations is generated.The present invention makes full use of human motion video content structural information, video content information is decomposed into the memory character comprising global appearance attribute information and skeleton character two parts of body motion information can be expressed, and is compressed using attention mechanism with the more efficient low bit-rate video of confrontation network implementations is generated.

Description

It is a kind of based on generate confrontation network low bit- rate human motion video coding system and Method

Technical field

The present invention relates to a kind of based on the low bit- rate human motion video coding system and method that generate confrontation network, belongs to Encoding video pictures and compression technique area.

Background technique

Video compress is resolved into prediction, become by traditional mixed video coding framework (MPEG-2, H.264 with H, 265 etc.) It changes, quantify and entropy coding four basic steps carry out.In the coding framework, the redundancy between video consecutive frame is main It is removed by the motion compensation as unit of the block of fixed size.By years development, conventional codec performance has been obtained not Disconnected promotion.With the development of deep learning, it is thus proposed that realize the model of video compression coding using deep learning.It is different In traditional code, the Video coding based on deep learning can be realized better changing learning.In addition, deep learning end to end Model can voluntarily learn, according to unified optimization aim adjust automatically each section module come Optimized model.

Above-mentioned encoding scheme is using the fidelity of Pixel-level as optimization aim, the structural information of video sequence content There is no be fully utilized in an encoding process.Such as the identification in the monitor video application of public safety field for people Extremely important, the analysis for carrying out coding and later period there is also a large amount of video comprising human motion is studied and judged, and this kind of view Frequently actually comprising global appearance attribute information and body motion information two parts in its structure.Therefore the present invention considers to utilize human body The structural information of sport video further increases code efficiency.

Summary of the invention

The technology of the present invention solves the problems, such as: overcoming the deficiencies of the prior art and provide a kind of based on the low code for generating confrontation network Rate human motion video coding system and method make full use of human motion video content structural information, by video content information It is decomposed into the memory character comprising global appearance attribute information and skeleton character two parts of body motion information can be expressed, and Using attention mechanism and generate the more efficient low bit-rate video compression of confrontation network implementations.

The technology of the present invention solution: a kind of low bit- rate human motion video coding system based on generation confrontation network, Wherein be integrated with the structural information for human motion video content extract, the fusion of the structural information of human motion video content with Characteristic information coding/decoding, characterized by comprising: memory character extraction module, memory character coding/decoding module, skeleton character mention Modulus block, skeleton character coding/decoding module recall attention power module and generate confrontation network module, in which:

Memory character extracts network module, according to convolution loop neural network, by the video frame in image group temporally on Sequencing input convolution loop neural network, the output obtained after to the last video frame input be remember it is special Sign；

Memory character coding/decoding module, including memory character coding and memory character decode two modules, by memory character Memory character coding module is inputted, which carries out quantization operation to memory character first, and memory character exports after being quantified； Then its entropy coding for using memory character output after quantization memory character, obtains the code that memory character part is used for transmission Stream；The code stream of memory character part is inputted into memory character decoder module, which carries out the entropy decoding amount of obtaining for code stream first Feature reconstruction is remembered after change；Then inverse quantization is carried out to memory character output after quantization and obtains memory character reconstruction, it will input Recall and pays attention to power module；

Skeleton character extraction module will need the video frame encoded to be input to human posture and estimate to carry out node position in network Estimation is set, the skeleton character comprising human body key node location information is obtained, the skeleton character is human body key node And joint position, the human body key node include head, hand, foot and body；In order to improve code efficiency, skeleton character will be defeated Enter skeleton character coding/decoding module and further encodes compression；

Skeleton character coding/decoding module carries out encoding and decoding to the skeleton character of encoded video frame；Skeleton character is inputted into bone Bone feature coding part carries out predictive coding to the skeleton character of input first, obtains the residual information of actual transmissions；Then right Residual information carries out the entropy coding of skeleton character residual information, obtains the code stream of skeleton character fractional transmission；To the code stream of transmission Skeleton character decoded portion is inputted, skeleton character entropy decoding is carried out to code stream first and obtains the reconstruction of skeleton character residual error；Again to bone Bone feature residual error, which is rebuild, carries out prediction decoding, obtains final skeleton character and rebuilds, input is recalled and pays attention to power module；

Recall and pays attention to power module, the memory character reconstruction for being obtained memory character decoder module using attention mechanism and bone The skeleton character reconstruction that bone feature decoder module obtains is merged, the characteristic information merged；

Generate confrontation network module, including two parts of generator and arbiter；The generator will recall attention mould The fusion feature information input that block obtains generates network, the video frame generated；Whether arbiter judges the video frame generated Consistent with true nature video frame to provide score, which will be used to generate the instruction of network as a part for generating network Practice, i.e. a part of loss function optimizes the training of generator.

The memory character extracts in network module, and memory character includes image group successive frame overall situation appearance attribute information； The set that described image group is made of certain length successive video frames；Described image group successive frame overall situation appearance attribute information is Utilize the video appearance attribute information that video frame input convolution loop neural network is extracted in image group, video appearance attribute information Including human body face clothing appearance attributes information in background appearance in video and scene.

The convolution loop neural network structure that the memory character extracts in network module is as follows: including one layer of convolution loop Layer, which establishes the sequential relationship of similar cycle neural network, while part is portrayed as convolutional neural networks Space characteristics, the port number of convolution loop layer are 128, and core size is 5, step-length 2.

In the memory character coding/decoding module, memory character coding module includes quantization to memory character and to amount Entropy coding two parts of feature after change；The memory character quantization is according to existing quantization method, to each in memory character Characteristic value carries out quantification treatment, memory character after being quantified after each characteristic value quantization respectively；The memory character entropy coding Entropy is carried out to memory character after quantization according to existing entropy coding method for the entropy coding method for memory character after quantization Coding, obtains the code stream that memory character part is used for transmission；Memory character decoder module includes corresponding entropy decoding and inverse Change two parts；The memory character entropy decoding is according to entropy decoding method corresponding with memory character entropy coding, to memory character The code stream of fractional transmission carries out entropy decoding process, remembers feature reconstruction after being quantified；The memory character inverse quantization according to Memory character quantifies opposite quantification method, carries out inverse quantization operation to the memory character after quantization, obtains memory character weight It builds, which, which rebuilds, recalls the reconstruction for noticing that power module is used for video frame for input.

The skeleton character coding/decoding module includes two parts of skeleton character compressed encoding and decoding；Skeleton character coding Part includes two parts of predictive coding and entropy coding；The predictive coding considers the redundancy in skeleton character time domain, to every For the coordinate of a node using the node coordinate of encoded video frame former frame as predicted value, the node predicted value is corresponding with the frame The coordinate of node, that is, true value carries out residual error operation, and obtained residual error will carry out the entropy coding of skeleton character；The skeleton character Entropy coding is the entropy coder for skeleton character, and according to existing entropy coding method, skeleton character predictive coding is obtained residual Difference will input entropy coder progress entropy coding and obtain the code stream information of skeleton character；Skeleton character decoded portion utilizes skeleton character The code stream obtained after coding realizes the lossless reconstruction of skeleton character by entropy decoding and prediction decoding；The skeleton character entropy decoding According to existing entropy decoding method, entropy coding is carried out to skeleton character code stream and obtains the residual error that skeleton character predictive coding obtains； The skeleton character prediction decoding then utilizes node coordinate and skeleton character entropy decoding gained residual error of previous moment, carries out phase Add operation to obtain decoded skeleton character to rebuild.

In the skeleton character extraction module, human posture estimates that network is PAF network, and structure is divided into two line structures, Joint confidence map, joint are obtained using the convolutional layer that 2 layers of core size are 1x1 after the convolutional layer that 3 layers of core size are 3 all the way Confidence map expresses the location information for detecting obtained each node and the node belongs to the confidence level of human body node；Another way PAFs is obtained using the convolutional layer that 2 layers of core size are 1x1 after the convolutional layer that 3 layers of core size are 3.PAFs be a 2D to Duration set, each 2D vector can encode the position and direction of a limbs.It will join together with the confidence map in joint Close study and prediction.

The present invention is a kind of based on the low bit- rate human motion method for video coding for generating confrontation network, comprising the following steps:

(1) successive frame of image group is sequentially inputted into memory character extraction module, memory character extraction module benefit Memory character output is obtained with convolutional neural networks；

(2) memory character is inputted into memory character coding module, which quantify carrying out again with after by memory character The entropy coding of memory character is handled, to obtain the code stream output of memory character part；

(3) video frame for needing to encode is inputted into skeleton character extraction module, which estimates network using human posture Obtain skeleton character；

(4) skeleton character is inputted into skeleton character coding module, skeleton character is after predictive coding by skeleton character Entropy coding handle to obtain the output of skeleton character partial code streams；

(5) memory character partial code streams are inputted into memory character decoder module, which remembers memory character code stream Recall and obtain memory character by memory character inverse quantization after feature entropy decoding and rebuild, while skeleton character code stream input bone is special Decoder module is levied, which will obtain skeleton character by skeleton character inverse quantization after skeleton character code stream entropy decoding and rebuild；

(6) memory character is rebuild and skeleton character rebuilds input memory and notices that power module, the module utilize attention machine The fusion feature of fusion two parts information is made；

(7) fusion feature is obtained video frame as the condition entry for generating confrontation network generator to rebuild, in training Arbiter can provide the video frame that a generator generates rebuilds the score for whether meeting natural video frequency according to the generation of generator, Its training that will be used for generator.

The advantages of the present invention over the prior art are that:

(1) different based on block and the distortion of pixel level from conventional video coding, the present invention is for the first time by video information by memory Feature and skeleton character are decomposed, and video structure information is taken full advantage of, to further improve code efficiency；

(2) mode that generation confrontation network is utilized to video frame reconstruction in the present invention is rebuild, different from traditional code, The information of video-losing can be restored with the mode generated in decoding by generating confrontation network mode, to realize subjective quality It is promoted；

(3) for the processing of feature after decomposing, after recalling attention power module progress video decomposition The fusion of obtained feature, using attention mechanism, memory character is rebuild and skeleton character reconstruction is effectively merged, from And it ensure that restoration and reconstruction of the video frame according to information.

Detailed description of the invention

Fig. 1 is the general frame of present system；

Fig. 2 is memory character coding/decoding module structural block diagram in the present invention；

Fig. 3 is skeleton character coding/decoding module structural block diagram in the present invention；

Fig. 4 is that attention modular structure block diagram is recalled in the present invention；

Fig. 5 present invention figure compared with traditional coding method subjective quality.

Specific embodiment

Below with reference to a kind of memory character coding/decoding method of the invention, a kind of skeleton character coding/decoding method, one kind Recall pay attention to power module and a kind of generations fight network implementations to the technical solution in the present invention carry out it is clear, completely retouch It states.Obviously, described embodiment is only a part of the embodiment of the present invention, is not whole embodiments.It is any to be based on this hair Bright embodiment and every other reality obtained by those of ordinary skill in the art without making creative efforts It applies example and belongs to protection scope of the present invention.

As shown in Figure 1, coding framework of the present invention includes following module: memory character extracts, memory character volume/solution Code, skeleton character coding/decoding, recalls attention power module and generates confrontation network skeleton character extraction.Wherein:

It is to extract the memory comprising image group successive frame overall situation appearance attribute information that the memory character, which extracts role of network, Feature, primary structure are convolution loop neural network.It includes one layer of convolution loop layer, which establishes similar cycle nerve net The sequential relationship of network, while local spatial feature can be portrayed as convolutional neural networks.The port number of its convolution loop layer It is 128, core size is 5, step-length 2.

The memory character encoding and decoding mainly include that memory character coding and memory character decode two parts.Wherein, it compiles Code part mainly includes the quantization to memory character and entropy coding two parts to feature after quantization, and decoded portion includes corresponding to Entropy decoding and inverse quantization.Wherein entropy coding/decoding device is generally arithmetic coder/decoder, and quantization generally uses scalar quantization.

In embodiments of the present invention, the memory character encoding and decoding are as shown in Fig. 2, memory character uses scalar quantization first Quantified, removes its redundancy further using entropy coding then to obtain final code stream.For further removal amount The redundancy of memory character is after change to improve code efficiency, and the embodiment of the present invention is using super pro-active network come to memory character after quantization Each characteristic value has carried out probabilistic forecasting modeling, and is carried out according to the probability distribution that prediction obtains to the memory character after quantization Entropy coding.

The probability of specific memory character in order to obtain, memory character will enter into super pro-active network to extract variable z, the change Amount can be transmitted by arithmetic coding as additional information.At encoding and decoding end, with reference to existing at present based on deep learning The probability distribution of image encoding method, memory character will be modeled as Gaussian Profile.And variable z will be used to predict memory character probability The mean value and variance of distribution.

In addition, the quantization of memory character will be integrated in memory character coding module network, and the scalar quantization of standard Can be to bring training that can not lead in memory character coding module network the problem of, so having used Uniform noise to carry out generation when training For the scalar quantization process of standard.Its specific formula is as follows:

In above formulaFor the memory character after quantization, M is the memory character before quantization,ForWithIt Between meet equally distributed noise.

The skeleton character extraction effect can express body motion information bone letter to extract in present encoding video frame Breath, principal mode are human body key node position.Network is estimated by using existing PAFs human posture, includes human body The skeleton character of key node location information will be exported from the module.

The skeleton character encoding and decoding mainly include two parts of skeleton character compressed encoding and decoding.Coded portion considers Redundancy in skeleton character time domain uses predictive coding first, and the residual error of predicted value and true value will input entropy coder and carry out Coding obtains code stream output.Decoded portion mainly passes through entropy decoding and prediction decoding realizes that skeleton character is rebuild.Wherein, entropy compile/ Decoder is usually Arithmetic codecs etc..

The memory pays attention to power module, main to be carried out memory character information and skeleton character information using attention mechanism Fusion combines the bone containing particular frame body motion information special on the basis of the memory character containing global appearance attribute information Levy the characteristic information to be merged.

Memory described in the embodiment of the present invention pays attention to the visible Fig. 4 of power module specific implementation, as seen from the figure used in embodiment Attention mechanism can actually be indicated with the function of inquiry (query) and key-value (key-value) three.Its is corresponding Formula is as follows:

R (Q, K, V)=[WV^T+Q,V]

Wherein, Q is inquiry matrix, and K is key matrix, and V is value matrix, and R is the characteristic information that fusion obtains, [,] table Show concatenation.Matrix W mainly passes through inquiry matrix Q and key matrix K and obtains, and specific calculation formula is as follows:

W=QK^T

Memory character is rebuildIt is 1 by two different convolution kernel sizes, port number is identical as input, and step-length is 1 The feature obtained after convolution will be respectively as actual key matrix K and value matrix V, and inquiring matrix Q is by a convolution kernel Size is 1, and port number is identical as input, the feature obtained after the convolution that step-length is 1.

In embodiments of the present invention, the skeleton character codec compression practical object is the position of 18 nodes of human body Coordinate, basic structure are as shown in Figure 3.

The coordinate information of node each for body, embodiment use the coordinate information of the previous moment node as prediction Value, uses the residual error of current time node coordinate and predicted value as actual encoded information during actual coding.This is residual Poor information is further compressed using the common adaptive arithmetic code of conventional coding scheme, and obtained code stream is bone spy Levy the final code stream of part.Its decoding process is similar, obtains residual information by arithmetic decoding first, and carries out prediction decoding and obtain It is rebuild to skeleton character.

Generation confrontation network includes two parts of generator and arbiter described in the embodiment of the present invention.Wherein, generator G will Recall and notices that the fusion feature expression of power module output obtains video frame reconstruction, the generator that the present invention uses as condition entry Network is pix2pixHD network.The arbiter of the embodiment of the present invention then includes airspace arbiter D_IWith time domain arbiter D_xTwo portions Point, the network structure of arbiter is VGG network.Airspace arbiter D_IInput be present frame skeleton character S, generate frame information With true frame information X.Whether it is used to judge the video frame generated close to true natural image.Time domain arbiter D_xInput It is then the skeleton character S of present frame and former frame, generates frame information and true frame information.It, which is acted on, predominantly guarantees to generate view Continuity between frequency consecutive frame.Above-mentioned two arbiter together constitutes the confrontation loss l of present networks_#dv.Its calculation formula is such as Under:

Wherein s_tIt is x for the skeleton character information of t moment_tFor the video frame of t moment, G makes a living into confrontation network,By a definite date Hope function.

The loss function design for generating confrontation network G is as follows:

L=l_#dv+λ_compl_comp+λ_fml_fm+λ_VGGl_VGG

Wherein, l_#dvFor the corresponding confrontation loss of two arbiters, l_c9mpFor the code stream size of memory character.l_fmAnd l_VGGThen It is with reference to the existing characteristic matching loss and the loss of VGG network aware for generating confrontation network work addition.λ_c9mp、λ_fmWith 7_VGGFor the weight of corresponding loss.In an embodiment of the present invention, weight is set as λ_c9mp=1, λ_fm=10, λ_VGG=10.

Human motion video is decomposed into recall info and bone information two parts by the present invention from structure, is taken full advantage of The structural information of video content, to realize the effect for being better than conventional video coding method (H.264, H.265) in quality.

The present invention has carried out coding efficiency test on KTH data set and APE data set respectively.For KTH data set, originally Invention has been randomly selected 8 video sequences therein and has been tested as test set.For APE data set, the present invention is then random 7 video sequences have been extracted to be tested.Specific results of property may refer to Fig. 5 and table 1.

Fig. 5 is present invention figure compared with traditional coding method subjective quality.The present invention and tradition side as we can see from the figure Method, which is compared, has restored more detailed information, and the present invention does not have serious fuzzy and blocking artifact phenomenon.

Table 1 is the present invention and average coding efficiency comparison sheet of the conventional coding scheme on cycle tests.It can be with from table Find out, the present invention obtained on KTH data set can with H.264 comparable coding efficiency, while on APE data set, this The PSNR of invention with H.265 compared 3.61dB also high.The method of the present invention is in above-mentioned two data set in above-mentioned comparison The encoding code stream size used only about conventional coding scheme 50%.

The average coding efficiency comparison sheet of 1 present invention of table and conventional coding scheme on cycle tests

In short, the invention proposes based on the low bit- rate human motion video compress model for generating confrontation network, performance There is significant increase compared with present encoding standard method.It include the video of human motion for public safety field monitor video etc. In coding, the present invention has great application prospect.

The foregoing is merely the preferable specific embodiments of the present invention, but scope of protection of the present invention is not limited thereto, Anyone skilled in the art within the technical scope of the present disclosure, the variation that can readily occur in or replaces It changes, should all be included within the scope of the present invention.Therefore, protection scope of the present invention should be with the guarantor of claims It protects subject to range.

Claims

1. a kind of based on the low bit- rate human motion video coding system for generating confrontation network characterized by comprising memory is special Extraction module is levied, memory character coding/decoding module, skeleton character coding/decoding module, recalls attention at skeleton character extraction module Module and generation confrontation network module, in which:

Memory character extracts network module, according to convolution loop neural network, by the video frame in image group temporally on elder generation After sequentially input convolution loop neural network, the output obtained after to the last video frame input is memory character；

Memory character coding/decoding module, including memory character coding and memory character decode two modules, and memory character is inputted Memory character coding module, the module carry out quantization operation to memory character first, and memory character exports after being quantified；Then It uses memory character output after quantization the entropy coding of memory character, obtains the code stream that memory character part is used for transmission；It will The code stream of memory character part inputs memory character decoder module, which obtains quantization postscript for code stream progress entropy decoding first Recall feature reconstruction；Then inverse quantization is carried out to memory character output after quantization and obtains memory character reconstruction, note is recalled into input Meaning power module；

Skeleton character extraction module will need the video frame encoded to be input to human posture and estimate that carrying out node location in network estimates Meter, obtains the skeleton character comprising human body key node location information, and the skeleton character is human body key node and pass Section is set, and the human body key node includes head, hand, foot and body；

Skeleton character coding/decoding module carries out encoding and decoding to the skeleton character of encoded video frame；Skeleton character input bone is special Coded portion is levied, predictive coding is carried out to the skeleton character of input first, obtains the residual information of actual transmissions；Then to residual error Information carries out the entropy coding of skeleton character residual information, obtains the code stream of skeleton character fractional transmission；Code stream input to transmission Skeleton character decoded portion carries out skeleton character entropy decoding to code stream first and obtains the reconstruction of skeleton character residual error；Again to bone spy It levies residual error and rebuilds progress prediction decoding, obtaining final skeleton character reconstruction ,=input is recalled and pays attention to power module by ----；

Recall and pay attention to power module, is rebuild using the memory character that attention mechanism obtains memory character decoder module and bone is special The skeleton character reconstruction that sign decoder module obtains is merged, the characteristic information merged；

Generate confrontation network module, including two parts of generator and arbiter；The generator, which will be recalled, notices that power module obtains The fusion feature information input arrived generates network, the video frame generated；Arbiter judge generate video frame whether with very Real natural video frequency frame unanimously provides score, which will be used to generate the training of network as a part for generating network, i.e., A part of loss function optimizes the training of generator.

2. according to claim 1 based on the low bit- rate human motion video coding system for generating confrontation network, feature Be: the memory character extracts in network module, and memory character includes image group successive frame overall situation appearance attribute information；It is described The set that image group is made of certain length successive video frames；Described image group successive frame overall situation appearance attribute information is to utilize The video appearance attribute information that video frame input convolution loop neural network is extracted in image group, video appearance attribute information include Human body face clothing appearance attributes information in background appearance and scene in video.

3. according to claim 1 based on the low bit- rate human motion video coding system for generating confrontation network, feature Be: in the memory character coding/decoding module, memory character coding module includes quantization to memory character and to quantization Entropy coding two parts of feature afterwards；The memory character quantization is according to existing quantization method, to each spy in memory character Value indicative carries out quantification treatment, memory character after being quantified after each characteristic value quantization respectively；The memory character entropy coding is For the entropy coding method of memory character after quantization, according to existing entropy coding method, entropy volume is carried out to memory character after quantization Code, obtains the code stream that memory character part is used for transmission；Memory character decoder module includes corresponding entropy decoding and inverse quantization Two parts；The memory character entropy decoding is according to entropy decoding method corresponding with memory character entropy coding, to memory character portion Divide the code stream of transmission to carry out entropy decoding process, remembers feature reconstruction after being quantified；The memory character inverse quantization according to note Recall the opposite quantification method of characteristic quantification, inverse quantization operation carried out to the memory character after quantization, obtains memory character reconstruction, The memory character, which is rebuild, recalls the reconstruction for noticing that power module is used for video frame for input.

4. according to claim 1 based on the low bit- rate human motion video coding system for generating confrontation network, feature Be: the skeleton character coding/decoding module includes two parts of skeleton character compressed encoding and decoding；Skeleton character coding unit It is divided to including two parts of predictive coding and entropy coding；The predictive coding considers the redundancy in skeleton character time domain, to each The coordinate of node is using the node coordinate of encoded video frame former frame as predicted value, node predicted value section corresponding with the frame The coordinate of point, that is, true value carries out residual error operation, and obtained residual error will carry out the entropy coding of skeleton character；The skeleton character entropy It is encoded to the entropy coder for skeleton character, according to existing entropy coding method, residual error that skeleton character predictive coding obtains Input entropy coder is subjected to entropy coding and obtains the code stream information of skeleton character；Skeleton character decoded portion is compiled using skeleton character The code stream obtained after code realizes the lossless reconstruction of skeleton character by entropy decoding and prediction decoding；The skeleton character entropy decoding root According to existing entropy decoding method, entropy coding is carried out to skeleton character code stream and obtains the residual error that skeleton character predictive coding obtains；Institute The node coordinate and skeleton character entropy decoding gained residual error that skeleton character prediction decoding then utilizes previous moment are stated, is added Operation obtains decoded skeleton character and rebuilds.

5. a kind of based on the low bit- rate human motion method for video coding for generating confrontation network, it is characterised in that: including following step It is rapid:

(1) successive frame of image group is sequentially inputted into memory character extraction module, memory character extraction module utilizes volume Product neural network obtains memory character output；

(2) memory character is inputted into memory character coding module, which quantify remembering again with after by memory character The entropy coding of feature is handled, to obtain the code stream output of memory character part；

(3) video frame for needing to encode is inputted into skeleton character extraction module, which estimates that network obtains using human posture Skeleton character；

(4) skeleton character is inputted into skeleton character coding module, skeleton character is after predictive coding by the entropy of skeleton character Coded treatment obtains the output of skeleton character partial code streams；

(5) memory character partial code streams are inputted into memory character decoder module, which carries out memory spy for memory character code stream Memory character is obtained by memory character inverse quantization after sign entropy decoding to rebuild, while skeleton character code stream is inputted into skeleton character solution Code module, the module will obtain skeleton character by skeleton character inverse quantization after skeleton character code stream entropy decoding and rebuild；

(6) memory character is rebuild and skeleton character rebuilds input memory and notices that power module, the module are obtained using attention mechanism To the fusion feature of fusion two parts information；

(7) fusion feature is obtained video frame as the condition entry for generating confrontation network generator to rebuild, is differentiated in training Device can provide the video frame that a generator generates rebuilds the score for whether meeting natural video frequency according to the generation of generator, will Training for generator.