CN110290386B - Low-bit-rate human motion video coding system and method based on generation countermeasure network - Google Patents
Low-bit-rate human motion video coding system and method based on generation countermeasure network Download PDFInfo
- Publication number
- CN110290386B CN110290386B CN201910479249.8A CN201910479249A CN110290386B CN 110290386 B CN110290386 B CN 110290386B CN 201910479249 A CN201910479249 A CN 201910479249A CN 110290386 B CN110290386 B CN 110290386B
- Authority
- CN
- China
- Prior art keywords
- characteristic
- memory
- coding
- module
- bone
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/10—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
- H04N19/102—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
- H04N19/124—Quantisation
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/10—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
- H04N19/134—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or criterion affecting or controlling the adaptive coding
- H04N19/136—Incoming video signal characteristics or properties
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/10—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
- H04N19/169—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
- H04N19/177—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being a group of pictures [GOP]
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/50—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/65—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using error resilience
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/85—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using pre-processing or post-processing specially adapted for video compression
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/90—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using coding techniques not provided for in groups H04N19/10-H04N19/85, e.g. fractals
- H04N19/91—Entropy coding, e.g. variable length coding [VLC] or arithmetic coding
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Compression Or Coding Systems Of Tv Signals (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
The invention relates to a low-bit-rate human motion video coding system and method based on a generated confrontation network, which fully utilize the content structure information of human motion video, decompose the video content information into two parts, namely memory characteristics containing global appearance attribute information and skeleton characteristics capable of expressing the human motion information, and utilize an attention mechanism and the generated confrontation network to realize more efficient low-bit-rate video compression. The invention fully utilizes the structure information of the human motion video content, decomposes the video content information into two parts, namely the memory characteristic containing global appearance attribute information and the skeleton characteristic capable of expressing the human motion information, and utilizes an attention mechanism and a generation countermeasure network to realize more efficient low-bit-rate video compression.
Description
Technical Field
The invention relates to a low-bit-rate human motion video coding system and method based on a generation countermeasure network, and belongs to the technical field of video image coding and compression.
Background
The traditional hybrid video coding framework (MPEG-2, h.264 and H, 265, etc.) decomposes video compression into four basic steps of prediction, transformation, quantization and entropy coding. Within this coding framework, redundant information between adjacent frames of video is mainly removed by motion compensation in units of fixed-size blocks. Through years of development, the performance of the traditional encoder is continuously improved. With the development of deep learning, a model for implementing video compression coding by using deep learning is proposed. Unlike conventional coding, video coding based on deep learning can achieve better transform learning. In addition, the end-to-end deep learning model can learn by itself, and all the modules are automatically adjusted according to a uniform optimization target to optimize the model.
The above coding schemes all use pixel-level fidelity as an optimization target, and the structural information of the video sequence content is not fully utilized in the coding process. For example, in surveillance video applications in the public security field, it is very important for human identification, and there are also a large number of videos containing human body motion that need to be encoded and analyzed and judged at a later stage, and such videos actually contain two parts, namely global appearance attribute information and human body motion information in structure. The present invention thus considers the use of structural information of human motion video to further improve coding efficiency.
Disclosure of Invention
The invention solves the problems: the system and the method have the advantages that the defects of the prior art are overcome, the human motion video coding system and the method based on the generation countermeasure network are provided, the human motion video content structure information is fully utilized, the video content information is decomposed into two parts including memory characteristics of global appearance attribute information and skeleton characteristics capable of expressing the human motion information, and the attention mechanism and the generation countermeasure network are utilized to realize more efficient low-bit-rate video compression.
The technical scheme of the invention is as follows: a low bit rate human motion video coding system based on a generation countermeasure network, wherein structure information extraction, structure information fusion and characteristic information coding/decoding aiming at human motion video content are integrated, and the system is characterized by comprising: the device comprises a memory feature extraction module, a memory feature coding and decoding module, a bone feature extraction module, a bone feature coding and decoding module, a recall attention module and a generation confrontation network module, wherein:
the memory feature extraction network module is used for inputting the video frames in the image group into the convolutional recurrent neural network according to the time sequence of the convolutional recurrent neural network until the output obtained after the last video frame is input is the memory feature;
the memory characteristic coding and decoding module comprises a memory characteristic coding module and a memory characteristic decoding module, and the memory characteristic coding module inputs the memory characteristic into the memory characteristic coding module, and the memory characteristic coding module firstly carries out quantization operation on the memory characteristic to obtain quantized memory characteristic output; then, entropy coding of the memory characteristics is used for outputting the quantized memory characteristics to obtain a code stream for transmission of the memory characteristics; inputting the code stream of the memory characteristic part into a memory characteristic decoding module, wherein the module firstly carries out entropy decoding on the code stream to obtain quantized memory characteristic reconstruction; then, carrying out inverse quantization on the quantized memory characteristic output to obtain memory characteristic reconstruction, and inputting the memory characteristic reconstruction into a memory attention module;
the skeleton feature extraction module is used for inputting a video frame to be coded into a human body posture estimation network for node position estimation to obtain skeleton features containing human body key node position information, wherein the skeleton features are human body key nodes and joint positions, and the human body key nodes comprise heads, hands, feet and bodies; in order to improve the coding efficiency, the skeleton characteristics are input into a skeleton characteristic coding and decoding module for further coding and compressing;
the bone characteristic coding and decoding module is used for coding and decoding the bone characteristics of the coded video frame; inputting the bone characteristics into a bone characteristic coding part, and firstly performing predictive coding on the input bone characteristics to obtain actually transmitted residual information; then, carrying out entropy coding on the skeleton characteristic residual error information on the residual error information to obtain a code stream transmitted by the skeleton characteristic part; inputting the transmitted code stream into a skeleton characteristic decoding part, and firstly, carrying out skeleton characteristic entropy decoding on the code stream to obtain skeleton characteristic residual error reconstruction; then, the bone characteristic residual error reconstruction is subjected to predictive decoding to obtain the final bone characteristic reconstruction, and the final bone characteristic reconstruction is input into a recall attention module;
the attention recalling module is used for fusing the memory characteristic reconstruction obtained by the memory characteristic decoding module with the bone characteristic reconstruction obtained by the bone characteristic decoding module by using an attention mechanism to obtain fused characteristic information;
the generation countermeasure network module comprises a generator and a discriminator; the generator inputs the fusion characteristic information obtained by the recall attention module into a generation network to obtain a generated video frame; the arbiter determines whether the generated video frame is consistent with a real natural video frame to give a score that will be used as part of the generated network for training of the generated network, i.e. a loss function to optimize the training of the generator.
In the memory feature extraction network module, the memory features comprise image group continuous frame global appearance attribute information; the image group is a set formed by continuous video frames with a certain length; the image group continuous frame global appearance attribute information is video appearance attribute information extracted by inputting video frames in the image group into a convolution cyclic neural network, and the video appearance attribute information comprises video internal background appearance and scene internal human five sense organ clothes appearance attribute information.
The convolution cyclic neural network structure in the memory feature extraction network module is as follows: the method comprises a convolution cycle layer, wherein the convolution cycle layer establishes a time sequence relation similar to a cyclic neural network, simultaneously describes local space characteristics like the convolutional neural network, and has the channel number of 128, the kernel size of 5 and the step length of 2.
In the memory characteristic coding and decoding module, the memory characteristic coding module comprises two parts of quantization of memory characteristics and entropy coding of quantized characteristics; the memory characteristic quantization is that each characteristic value in the memory characteristic is respectively quantized according to the existing quantization method, and each characteristic value is quantized to obtain the quantized memory characteristic; the memory characteristic entropy coding is an entropy coding method for the quantized memory characteristics, and entropy coding is carried out on the quantized memory characteristics according to the existing entropy coding method to obtain a code stream for transmission of a memory characteristic part; the memory characteristic decoding module comprises a corresponding entropy decoding part and an inverse quantization part; the memory characteristic entropy decoding carries out entropy decoding processing on the code stream transmitted by the memory characteristic part according to an entropy decoding method corresponding to the memory characteristic entropy coding to obtain the memory characteristic reconstruction after quantization; and carrying out inverse quantization operation on the quantized memory characteristics according to an inverse quantization method relative to the memory characteristic quantization to obtain memory characteristic reconstruction, wherein the memory characteristic reconstruction is used for reconstructing a video frame by inputting a memory attention module.
The bone characteristic coding and decoding module comprises a bone characteristic compression coding part and a bone characteristic decoding part; the bone characteristic coding part comprises two parts of predictive coding and entropy coding; the prediction coding considers the redundancy of the skeleton feature time domain, the coordinate of the node of the previous frame of the coded video frame is used as a predicted value for the coordinate of each node, residual error operation is carried out on the predicted value of the node and the coordinate, namely the true value, of the corresponding node of the frame, and the obtained residual error is subjected to entropy coding of the skeleton feature; the skeleton characteristic entropy coding is an entropy coder aiming at skeleton characteristics, and residual errors obtained by skeleton characteristic predictive coding are input into the entropy coder to be entropy coded according to the existing entropy coding method to obtain code stream information of the skeleton characteristics; the bone characteristic decoding part realizes lossless reconstruction of bone characteristics by entropy decoding and predictive decoding by using a code stream obtained after the bone characteristic coding; the skeleton characteristic entropy decoding is used for entropy coding a skeleton characteristic code stream according to an existing entropy decoding method to obtain a residual error obtained by skeleton characteristic predictive coding; and the bone feature prediction decoding is to perform addition operation by using the node coordinate at the previous moment and the residual error obtained by the bone feature entropy decoding to obtain the decoded bone feature reconstruction.
In the bone feature extraction module, a human body posture estimation network is a PAF network, the structure of the PAF network is divided into two paths of structures, one path of the PAF network passes through 3 layers of convolution layers with the core size of 3 and then passes through 2 layers of convolution layers with the core size of 1x1 to obtain a joint confidence map, and the joint confidence map expresses the position information of each detected node and the confidence coefficient that the node belongs to a human body node; and the other path of PAFs is obtained through 3 convolutional layers with the core size of 3 and then through 2 convolutional layers with the core size of 1x 1. PAFs are a set of 2D vectors, each of which encodes the position and orientation of a limb. It will be jointly learned and predicted along with the confidence map of the joint.
The invention relates to a low-bit-rate human motion video coding method based on a generation countermeasure network, which comprises the following steps of:
(1) inputting the continuous frames of the image group into a memory feature extraction module in sequence, wherein the memory feature extraction module obtains memory feature output by using a convolutional neural network;
(2) inputting the memory characteristics into a memory characteristic coding module, and quantizing the memory characteristics and then performing entropy coding processing on the memory characteristics by the memory characteristic coding module so as to obtain code stream output of a memory characteristic part;
(3) inputting a video frame to be coded into a bone feature extraction module, wherein the bone feature extraction module utilizes a human body posture estimation network to obtain bone features;
(4) inputting the bone characteristics into a bone characteristic coding module, and performing predictive coding on the bone characteristics and entropy coding on the bone characteristics to obtain partial code stream of the bone characteristics for output;
(5) inputting the memory characteristic part code stream into a memory characteristic decoding module, performing memory characteristic entropy decoding on the memory characteristic code stream by the memory characteristic decoding module, performing memory characteristic inverse quantization to obtain memory characteristic reconstruction, inputting the skeleton characteristic code stream into a skeleton characteristic decoding module, performing entropy decoding on the skeleton characteristic code stream by the module, and performing skeleton characteristic inverse quantization to obtain skeleton characteristic reconstruction;
(6) inputting the memory characteristic reconstruction and the skeleton characteristic reconstruction into a memory attention module, and obtaining a fusion characteristic fusing two parts of information by the module by using an attention mechanism;
(7) and inputting the fusion characteristics as conditions for generating the confrontation network generator to obtain video frame reconstruction, wherein the discriminator gives out a score whether the video frame reconstruction generated by one generator accords with the natural video according to the generation of the generator during training, and the score is used for training the generator.
Compared with the prior art, the invention has the advantages that:
(1) different from the traditional video coding based on block and pixel level distortion, the method decomposes the video information according to memory characteristics and skeleton characteristics for the first time, and fully utilizes the video structure information, thereby further improving the coding efficiency;
(2) the invention reconstructs the video frame by utilizing the mode of generating the confrontation network, and the mode of generating the confrontation network can recover the information lost by the video during decoding by using the generated mode, thereby realizing the improvement of subjective quality;
(3) aiming at the processing of the decomposed features, the invention firstly proposes the fusion of the features obtained after the video decomposition is carried out by the recall attention module, and the fusion is effectively carried out by utilizing an attention mechanism, the memory feature reconstruction and the bone feature reconstruction, thereby ensuring the recovery reconstruction of the video frame according to the information.
Drawings
FIG. 1 is a general block diagram of the system of the present invention;
FIG. 2 is a block diagram of a memory feature codec module according to the present invention;
FIG. 3 is a block diagram of the structure of the bone feature encoding/decoding module according to the present invention;
FIG. 4 is a block diagram of a recall attention module according to the present invention;
fig. 5 is a subjective quality comparison of the present invention with a conventional coding method.
Detailed Description
The technical scheme of the invention is clearly and completely described below by combining a memory characteristic coding/decoding method, a skeleton characteristic coding/decoding method, a recall attention module and a generation countermeasure network implementation of the invention. It should be understood that the described embodiments are only a few, and not all, of the embodiments of the present invention. Any embodiment based on the present invention and all other embodiments obtained by a person of ordinary skill in the art without any inventive step belong to the protection scope of the present invention.
As shown in fig. 1, the coding framework of the present invention includes the following modules: memory characteristic extraction, memory characteristic coding/decoding, skeleton characteristic extraction, skeleton characteristic coding/decoding, attention module recall and confrontation network generation. Wherein:
the memory feature extraction network is used for extracting memory features containing image group continuous frame global appearance attribute information, and the main structure of the memory feature extraction network is a convolution cyclic neural network. The method comprises a convolution cycle layer, wherein the convolution cycle layer establishes a time sequence relation similar to a cyclic neural network, and can describe local spatial features like the convolution neural network. The number of channels of the convolution cycle layer is 128, the kernel size is 5, and the step size is 2.
The memory characteristic coding and decoding mainly comprises a memory characteristic coding part and a memory characteristic decoding part. The coding part mainly comprises two parts of quantization of memory characteristics and entropy coding of quantized characteristics, and the decoding part comprises corresponding entropy decoding and inverse quantization. Where the entropy coder/decoder is typically an arithmetic coder/decoder, quantization typically uses scalar quantization.
In the embodiment of the present invention, as shown in fig. 2, the memory characteristics are first quantized using scalar quantization, and then entropy coding is used to further remove redundancy thereof, so as to obtain a final code stream. In order to further remove redundancy of the quantized memory features to improve coding efficiency, the embodiment of the invention uses a super-prior network to perform probability prediction modeling on each feature value of the quantized memory features, and entropy coding is performed on the quantized memory features according to probability distribution obtained by prediction.
To get the probability of a specific memory feature, the memory feature is input to the prior network to extract the variable z, which is then transmitted as additional information via arithmetic coding. At the encoding and decoding end, by referring to the existing image encoding method based on deep learning, the probability distribution of the memory characteristics is modeled into Gaussian distribution. While the variable z will be used to predict the mean and variance of the probability distribution of the memory feature.
In addition, the quantification of the memory characteristics is integrated in the memory characteristic coding module network, and the standard scalar quantification can bring the problem of training incorgorability in the memory characteristic coding module network, so uniform noise is used for replacing the standard scalar quantification process during training. The specific formula is as follows:
in the above formulaM is the memory characteristic before quantization,is at the same timeAndwith uniformly distributed noise.
The skeleton feature extraction function is used for extracting skeleton information which can express human motion information in a current coding video frame, and the main form of the skeleton feature extraction function is the position of a key node of a human body. By using the existing PAFs human pose estimation network, skeletal features containing information on the positions of key nodes of the human body will be output from the module.
The bone characteristic coding and decoding mainly comprises two parts of bone characteristic compression coding and decoding. The coding part considers the redundancy of the skeleton characteristic time domain, firstly uses predictive coding, and inputs the residual error of the predicted value and the true value into an entropy coder for coding to obtain code stream output. The decoding part mainly realizes the bone characteristic reconstruction through entropy decoding and predictive decoding. Among them, the entropy codec is generally an arithmetic codec or the like.
The memory attention module mainly utilizes an attention mechanism to fuse the memory characteristic information and the bone characteristic information, and combines the bone characteristics containing specific frame human motion information on the basis of the memory characteristics containing global appearance attribute information to obtain fused characteristic information.
The specific implementation manner of the recall attention module according to the embodiment of the present invention can be seen in fig. 4, and it can be seen from the figure that the attention mechanism used in the embodiment can be actually expressed by a function of query (query) and key-value (key-value). The corresponding formula is as follows:
R(Q,K,V)=[WV T +Q,V]
wherein Q is the query matrix, K is the key matrix, V is the value matrix, and R is the fused feature information, [, ] representing the concatenation. The matrix W is mainly obtained by inquiring the matrix Q and the key matrix K, and the specific calculation formula is as follows:
W=QK T
memory feature reconstructionThe features obtained after convolution with two different convolution kernels of size 1, the number of channels being the same as the input and the step length being 1 will be respectively used as the actual key matrix K and value matrix V, while the query matrix Q is the features obtained after convolution with a convolution kernel of size 1, the number of channels being the same as the input and the step length being 1.
In the embodiment of the present invention, the skeletal feature codec compresses the position coordinates of the actual object, which is 18 nodes of the human body, and the basic structure of the compressed actual object is as shown in fig. 3.
For the coordinate information of each node of the body, the embodiment uses the coordinate information of the node at the previous moment as a predicted value, and uses the coordinate of the node at the current moment and the residual error of the predicted value as actual coding information in the actual coding process. The residual information is further compressed by using the common adaptive arithmetic coding of the traditional coding scheme, and the obtained code stream is the final code stream of the bone characteristic part. The decoding process is similar, firstly residual error information is obtained through arithmetic decoding, and bone characteristic reconstruction is obtained through predictive decoding.
The embodiment of the invention discloses a generation countermeasure network which comprises a generator and an arbiter. The generator G takes the fusion characteristic expression output by the memory attention module as condition input to obtain video frame reconstruction, and the generator network used by the invention is a pix2pixHD network. The discriminator of the embodiment of the invention comprises a space domain discriminator D I Sum time domain discriminator D x And in two parts, the network structure of the discriminator is a VGG network. Airspace discriminator D I The input of (2) is the skeletal features S of the current frame, the generated frame information and the real frame information X. Which is used to determine whether the generated video frame is close to a real natural image. Time domain discriminator D x The inputs of (1) are the skeletal features S of the current frame and the previous frame, the generated frame information and the real frame information. The main purpose of this is to ensure continuity between adjacent frames of the generated video. The two discriminators together form the countermeasure loss l of the network #dv . The calculation formula is as follows:
wherein s is t Is the bone characteristic information at the time t and is x t Is a video frame at time t, G is generated as a countermeasure network,is a desired function.
The loss function for generating the countermeasure network G is designed as follows:
l=l #dv +λ comp l comp +λ fm l fm +λ VGG l VGG
wherein l #dv For the corresponding countermeasures of two discriminators,/ c9mp Is the code stream size of the memory characteristics. l fm And l VGG Then reference is made to existing feature matching losses and VGG network aware losses that generate an addition to counter network work. Lambda c9mp 、λ fm And 7 VGG Is the weight corresponding to the loss. In an embodiment of the invention, the weight is set to λ c9mp =1、λ fm =10、λ VGG =10。
The invention decomposes the human motion video into two parts of memory information and skeleton information from the structure, and makes full use of the structure information of the video content, thereby realizing the effect superior to the traditional video coding methods (H.264 and H.265) in quality.
The invention respectively carries out coding performance test on the KTH data set and the APE data set. For the KTH data set, 8 video sequences are randomly extracted to be used as a test set for testing. For the APE data set, the present invention randomly extracts 7 video sequences for testing. Specific performance results can be seen in figure 5 and table 1.
Fig. 5 is a subjective quality comparison of the present invention and a conventional encoding method. It can be seen from the figure that the present invention recovers more detailed information than the conventional method and the present invention does not have serious blurring and blocking artifacts.
Table 1 is a table comparing the average coding performance of the present invention and the conventional coding scheme over the test sequences. As can be seen from the table, the invention obtains the encoding performance equivalent to H.264 on the KTH data set, and the PSNR of the invention is higher by 3.61dB than that of H.265 on the APE data set. In the comparison, the size of the code stream used in the two data sets is only about 50% of the conventional coding scheme.
Table 1 comparison of average coding performance of the present invention and conventional coding schemes over test sequences
In a word, the invention provides a low-bit-rate human motion video compression model based on a generated countermeasure network, and the performance of the low-bit-rate human motion video compression model is greatly improved compared with that of the current coding standard method. The invention has great application prospect in video coding including human body movement, such as monitoring video in the public safety field.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention disclosed herein should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.
Claims (2)
1. A low bit rate human motion video coding system based on a generation countermeasure network, comprising: the device comprises a memory characteristic extraction module, a memory characteristic coding and decoding module, a bone characteristic extraction module, a bone characteristic coding and decoding module, a recall attention module and a confrontation network generation module, wherein:
the memory feature extraction network module is used for inputting the video frames in the image group into the convolutional recurrent neural network according to the time sequence of the convolutional recurrent neural network until the output obtained after the last video frame is input is the memory feature;
the memory characteristic coding and decoding module comprises a memory characteristic coding module and a memory characteristic decoding module, and the memory characteristic coding module inputs the memory characteristic into the memory characteristic coding module, and the memory characteristic coding module firstly carries out quantization operation on the memory characteristic to obtain quantized memory characteristic output; then, entropy coding of the memory characteristics is used for outputting the quantized memory characteristics to obtain a code stream for transmission of the memory characteristics; inputting the code stream of the memory characteristic part into a memory characteristic decoding module, wherein the module firstly carries out entropy decoding on the code stream to obtain quantized memory characteristic reconstruction; then, carrying out inverse quantization on the quantized memory characteristic output to obtain memory characteristic reconstruction, and inputting the memory characteristic reconstruction into a memory attention module;
the skeleton feature extraction module is used for inputting a video frame to be coded into a human body posture estimation network for node position estimation to obtain skeleton features containing human body key node position information, wherein the skeleton features are human body key nodes and joint positions, and the human body key nodes comprise heads, hands, feet and bodies;
the bone characteristic coding and decoding module is used for coding and decoding the bone characteristics of the coded video frame; inputting the bone characteristics into a bone characteristic coding part, and firstly performing predictive coding on the input bone characteristics to obtain actually transmitted residual information; then, carrying out entropy coding on the skeleton characteristic residual error information on the residual error information to obtain a code stream transmitted by the skeleton characteristic part; inputting a transmitted code stream into a skeleton characteristic decoding part, and firstly carrying out skeleton characteristic entropy decoding on the code stream to obtain skeleton characteristic residual error reconstruction; then, the bone characteristic residual error reconstruction is subjected to predictive decoding to obtain the final bone characteristic reconstruction, and the final bone characteristic reconstruction is input into a recall attention module;
the attention recalling module is used for fusing the memory characteristic reconstruction obtained by the memory characteristic decoding module with the bone characteristic reconstruction obtained by the bone characteristic decoding module by using an attention mechanism to obtain fused characteristic information;
the generation countermeasure network module comprises a generator and a discriminator; the generator inputs the fusion characteristic information obtained by the recall attention module into a generation network to obtain a generated video frame; the arbiter judges whether the generated video frame is consistent with the real natural video frame to give a score which will be used as part of the generated network for training of the generated network, i.e. part of the loss function to optimize training of the generator;
in the memory feature extraction network module, the memory features comprise image group continuous frame global appearance attribute information; the image group is a set formed by continuous video frames with a certain length; the image group continuous frame global appearance attribute information is video appearance attribute information extracted by inputting video frames in an image group into a convolution cyclic neural network, and the video appearance attribute information comprises video internal background appearance and scene internal human five sense organ clothes appearance attribute information;
in the memory characteristic coding and decoding module, a memory characteristic coding module comprises two parts of quantization of memory characteristics and entropy coding of quantized characteristics; the memory characteristic quantization is that each characteristic value in the memory characteristic is respectively quantized according to the existing quantization method, and each characteristic value is quantized to obtain the quantized memory characteristic; the memory characteristic entropy coding is an entropy coding method for the quantized memory characteristics, and entropy coding is carried out on the quantized memory characteristics according to the existing entropy coding method to obtain a code stream for transmission of a memory characteristic part; the memory characteristic decoding module comprises a corresponding entropy decoding part and an inverse quantization part; the memory characteristic entropy decoding carries out entropy decoding processing on the code stream transmitted by the memory characteristic part according to an entropy decoding method corresponding to the memory characteristic entropy coding to obtain the memory characteristic reconstruction after quantization; the memory characteristic inverse quantization carries out inverse quantization operation on the quantized memory characteristics according to an inverse quantization method relative to the memory characteristic quantization to obtain memory characteristic reconstruction, and the memory characteristic reconstruction inputs a memory recall attention module for reconstructing a video frame;
the bone feature coding and decoding module comprises a bone feature compression coding part and a bone feature compression decoding part; the bone characteristic coding part comprises two parts of predictive coding and entropy coding; the prediction coding considers the redundancy of the skeleton feature time domain, the coordinate of the node of the previous frame of the coded video frame is used as a predicted value for the coordinate of each node, residual error operation is carried out on the predicted value of the node and the coordinate, namely the true value, of the corresponding node of the frame, and the obtained residual error is subjected to entropy coding of the skeleton feature; the skeleton characteristic entropy coding is an entropy coder aiming at skeleton characteristics, and residual errors obtained by skeleton characteristic predictive coding are input into the entropy coder to be entropy coded according to the existing entropy coding method to obtain code stream information of the skeleton characteristics; the bone feature decoding part realizes the lossless reconstruction of bone features by entropy decoding and predictive decoding by using a code stream obtained after the bone feature coding; the skeleton characteristic entropy decoding is used for entropy coding a skeleton characteristic code stream according to an existing entropy decoding method to obtain a residual error obtained by skeleton characteristic predictive coding; and the bone feature prediction decoding is to perform addition operation by using the node coordinate at the previous moment and the residual error obtained by the bone feature entropy decoding to obtain the decoded bone feature reconstruction.
2. A low bit rate human motion video coding method based on a generation countermeasure network based on the system of claim 1, characterized in that: the method comprises the following steps:
(1) inputting the continuous frames of the image group into a memory feature extraction module according to the sequence, wherein the memory feature extraction module obtains memory feature output by using a convolutional neural network;
(2) inputting the memory characteristics into a memory characteristic coding module, and quantizing the memory characteristics and then performing entropy coding processing on the memory characteristics by the memory characteristic coding module so as to obtain code stream output of a memory characteristic part;
(3) inputting a video frame to be coded into a bone feature extraction module, wherein the bone feature extraction module utilizes a human body posture estimation network to obtain bone features;
(4) inputting the bone characteristics into a bone characteristic coding module, and performing predictive coding on the bone characteristics and entropy coding on the bone characteristics to obtain partial code stream of the bone characteristics for output;
(5) inputting the memory characteristic part code stream into a memory characteristic decoding module, performing memory characteristic entropy decoding on the memory characteristic code stream by the module, then performing memory characteristic inverse quantization to obtain memory characteristic reconstruction, and simultaneously inputting the skeleton characteristic code stream into a skeleton characteristic decoding module, performing entropy decoding on the skeleton characteristic code stream by the module, and then performing skeleton characteristic inverse quantization to obtain skeleton characteristic reconstruction;
(6) inputting the memory characteristic reconstruction and the skeleton characteristic reconstruction into a memory attention module, and obtaining a fusion characteristic fusing two parts of information by the module by using an attention mechanism;
(7) and inputting the fusion characteristics as conditions for generating the confrontation network generator to obtain video frame reconstruction, wherein the discriminator can give out a score whether the video frame reconstruction generated by the generator accords with the natural video or not according to the generation of the generator during training, and the score is used for training the generator.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910479249.8A CN110290386B (en) | 2019-06-04 | 2019-06-04 | Low-bit-rate human motion video coding system and method based on generation countermeasure network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910479249.8A CN110290386B (en) | 2019-06-04 | 2019-06-04 | Low-bit-rate human motion video coding system and method based on generation countermeasure network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110290386A CN110290386A (en) | 2019-09-27 |
CN110290386B true CN110290386B (en) | 2022-09-06 |
Family
ID=68003085
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910479249.8A Active CN110290386B (en) | 2019-06-04 | 2019-06-04 | Low-bit-rate human motion video coding system and method based on generation countermeasure network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110290386B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110942463B (en) * | 2019-10-30 | 2021-03-16 | 杭州电子科技大学 | Video target segmentation method based on generation countermeasure network |
CN110929242B (en) * | 2019-11-20 | 2020-07-10 | 上海交通大学 | Method and system for carrying out attitude-independent continuous user authentication based on wireless signals |
CN112950729A (en) * | 2019-12-10 | 2021-06-11 | 山东浪潮人工智能研究院有限公司 | Image compression method based on self-encoder and entropy coding |
CN111967340B (en) * | 2020-07-27 | 2023-08-04 | 中国地质大学(武汉) | Visual perception-based abnormal event detection method and system |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108174225A (en) * | 2018-01-11 | 2018-06-15 | 上海交通大学 | Filter achieving method and system in coding and decoding video loop based on confrontation generation network |
CN108596149A (en) * | 2018-05-10 | 2018-09-28 | 上海交通大学 | The motion sequence generation method for generating network is fought based on condition |
CN109086869A (en) * | 2018-07-16 | 2018-12-25 | 北京理工大学 | A kind of human action prediction technique based on attention mechanism |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101587962B1 (en) * | 2010-12-22 | 2016-01-28 | 한국전자통신연구원 | Motion capture apparatus and method |
KR20130047194A (en) * | 2011-10-31 | 2013-05-08 | 한국전자통신연구원 | Apparatus and method for 3d appearance creation and skinning |
-
2019
- 2019-06-04 CN CN201910479249.8A patent/CN110290386B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108174225A (en) * | 2018-01-11 | 2018-06-15 | 上海交通大学 | Filter achieving method and system in coding and decoding video loop based on confrontation generation network |
CN108596149A (en) * | 2018-05-10 | 2018-09-28 | 上海交通大学 | The motion sequence generation method for generating network is fought based on condition |
CN109086869A (en) * | 2018-07-16 | 2018-12-25 | 北京理工大学 | A kind of human action prediction technique based on attention mechanism |
Non-Patent Citations (2)
Title |
---|
End-to-End Facial Image Compression with Integrated Semantic Distortion Metric;Tianyu He等;《2018 IEEE Visual Communications and Image Processing (VCIP)》;20190425;全文 * |
多模型融合动作识别研究;田曼等;《电子测量技术》;20181023(第20期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN110290386A (en) | 2019-09-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110290386B (en) | Low-bit-rate human motion video coding system and method based on generation countermeasure network | |
VR | An enhanced coding algorithm for efficient video coding | |
CN112203093B (en) | Signal processing method based on deep neural network | |
CN108924558B (en) | Video predictive coding method based on neural network | |
CN108174218B (en) | Video coding and decoding system based on learning | |
CN112866694A (en) | Intelligent image compression optimization method combining asymmetric volume block and condition context | |
CN113132727B (en) | Scalable machine vision coding method and training method of motion-guided image generation network | |
Xia et al. | An emerging coding paradigm VCM: A scalable coding approach beyond feature and signal | |
CN113822147A (en) | Deep compression method for semantic task of cooperative machine | |
CN115880762B (en) | Human-machine hybrid vision-oriented scalable face image coding method and system | |
CN104539961A (en) | Scalable video encoding system based on hierarchical structure progressive dictionary learning | |
CN116233445B (en) | Video encoding and decoding processing method and device, computer equipment and storage medium | |
CN113132735A (en) | Video coding method based on video frame generation | |
CN115052147B (en) | Human body video compression method and system based on generative model | |
CN110677624B (en) | Monitoring video-oriented foreground and background parallel compression method based on deep learning | |
CN116156202A (en) | Method, system, terminal and medium for realizing video error concealment | |
Wu et al. | Memorize, then recall: a generative framework for low bit-rate surveillance video compression | |
CN110677644B (en) | Video coding and decoding method and video coding intra-frame predictor | |
CN114422795A (en) | Face video coding method, decoding method and device | |
CN113068041B (en) | Intelligent affine motion compensation coding method | |
CN116437089B (en) | Depth video compression method based on key target | |
CN104363454A (en) | Method and system for video coding and decoding of high-bit-rate images | |
WO2023225808A1 (en) | Learned image compress ion and decompression using long and short attention module | |
Mital et al. | Deep stereo image compression with decoder side information using wyner common information | |
Lei et al. | An end-to-end face compression and recognition framework based on entropy coding model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |