WO2023143349A1 - 一种面部视频编码方法、解码方法及装置 - Google Patents

一种面部视频编码方法、解码方法及装置 Download PDF

Info

Publication number
WO2023143349A1
WO2023143349A1 PCT/CN2023/073054 CN2023073054W WO2023143349A1 WO 2023143349 A1 WO2023143349 A1 WO 2023143349A1 CN 2023073054 W CN2023073054 W CN 2023073054W WO 2023143349 A1 WO2023143349 A1 WO 2023143349A1
Authority
WO
WIPO (PCT)
Prior art keywords
video frame
facial video
target
motion estimation
facial
Prior art date
Application number
PCT/CN2023/073054
Other languages
English (en)
French (fr)
Inventor
王钊
陈柏林
叶琰
王诗淇
Original Assignee
阿里巴巴(中国)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴(中国)有限公司 filed Critical 阿里巴巴(中国)有限公司
Publication of WO2023143349A1 publication Critical patent/WO2023143349A1/zh

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/169Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
    • H04N19/184Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being bits, e.g. of the compressed video stream
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • G06F18/24137Distances to cluster centroïds
    • G06F18/2414Smoothing the distance, e.g. radial basis function networks [RBFN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • G06T7/248Analysis of motion using feature-based methods, e.g. the tracking of corners or segments involving reference images or patches
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/42Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by implementation details or hardware specially adapted for video compression or decompression, e.g. dedicated software implementation

Definitions

  • the embodiments of the present application relate to the field of computer technology, and in particular to a facial video encoding method, decoding method and device.
  • video coding and decoding devices have been widely used in various scenarios, such as video conferencing, live video broadcasting, and the like.
  • embodiments of the present application provide a facial video encoding method, decoding method, and device to at least partially solve the above-mentioned problems.
  • a facial video coding method including:
  • a facial video decoding method including:
  • the facial video bitstream includes: after coding, refer to the facial video frame and compact feature information after coding; after the coding, the compact feature information represents the key feature information of the target facial video frame to be reconstructed;
  • Sparse motion estimation is performed based on the reference compact features and the target compact features to obtain a sparse motion estimation map, the sparse motion estimation map is characterized in a preset sparse feature domain, and the target face video frame is consistent with the reference face Relative motion relationship between video frames;
  • a reconstructed facial video frame corresponding to the target facial video frame is obtained.
  • a model training method including:
  • the target facial video frame sample is input into the feature extraction model to obtain the target compact feature sample; the target compact feature sample and the reference facial video frame sample are encoded respectively to obtain the facial video bitstream sample;
  • the initial reconstructed facial video frame samples and the target facial video frame samples construct a perceptual loss function and an adversarial loss function respectively; based on the initial reconstructed facial video frame samples, the target facial video frame samples and the target compact features The transmission code rate corresponding to the sample, and the rate-distortion loss function is obtained;
  • the perceptual loss function, the confrontation loss function and the rate-distortion loss function are fused to obtain a training loss function; according to the training loss function, the feature extraction model and the deformed image prediction model are trained.
  • an electronic device including: a processor, a memory, a communication interface, and a communication bus, and the processor, the memory, and the communication interface complete the mutual communication via the communication bus. communication among them; the memory is used to store at least one executable instruction, and the executable instruction causes the processor to execute the operation corresponding to the facial video coding method as described in the first aspect, or, as described in the second aspect The operations corresponding to the facial video decoding method, or the operations corresponding to the model training method described in the third aspect.
  • a computer storage medium on which a computer program is stored, and when the program is executed by a processor, the facial video coding method as described in the first aspect is implemented, or, as described in the second The face video decoding method described in the aspect, or, the model training method as described in the third aspect.
  • a computer program product including computer instructions, the computer instructions instruct the computing device to perform the operations corresponding to the facial video encoding method described in the first aspect, or, as described in the second aspect Operations corresponding to the facial video decoding method described in the aspect, or operations corresponding to the model training method described in the third aspect.
  • the target facial video frame is subjected to target compact feature extraction, and the facial video bitstream obtained by encoding the above-mentioned target compact feature, due to the target
  • the compact feature is a feature that characterizes the key feature information in the target facial video frame, and it represents the key information in the entire facial video frame through a small amount of data. Therefore, the facial video bitstream obtained by encoding the target compact feature, The amount of data is also small, and the corresponding bit stream is also small (lower code rate) during video stream transmission.
  • the facial video bit stream obtained above is decoded, and then based on the decoded representation
  • the target compact features of the key feature information in the target facial video frame are used for facial video
  • the quality difference between the reconstructed video frame and the target facial video frame is also small.
  • the embodiment of the present application can reduce the encoding bit rate on the premise of ensuring the facial video reconstruction quality, and better meet the requirements of low bit rate facial video encoding.
  • Fig. 1 is a schematic framework diagram of a codec method based on depth video generation
  • Fig. 2 is a flow chart of the steps of a facial video encoding method according to Embodiment 1 of the present application;
  • Fig. 3 is a schematic diagram of a scenario example in the embodiment shown in Fig. 2;
  • Fig. 4 is a flow chart of the steps of a facial video decoding method according to Embodiment 2 of the present application;
  • Fig. 5 is a schematic diagram of a scenario example in the embodiment shown in Fig. 4;
  • Fig. 6 is a schematic diagram of another scenario example in the embodiment shown in Fig. 4;
  • FIG. 7 is a flow chart of steps of a model training method according to Embodiment 3 of the present application.
  • Fig. 8 is a schematic diagram of a scenario example in the embodiment shown in Fig. 7;
  • FIG. 9 is a structural block diagram of a facial video encoding device according to Embodiment 4 of the present application.
  • FIG. 10 is a structural block diagram of a facial video decoding device according to Embodiment 5 of the present application.
  • Fig. 11 is a structural block diagram of a model training device according to Embodiment 6 of the present application.
  • FIG. 12 is a schematic structural diagram of an electronic device according to Embodiment 7 of the present application.
  • FIG. 1 is a schematic framework diagram of a codec method based on depth video generation.
  • the main principle of this method is to deform the reference frame based on the motion of the frame to be encoded to obtain a reconstructed frame corresponding to the frame to be encoded.
  • the basic framework of the codec method based on depth video generation is described below in conjunction with Figure 1:
  • the encoder uses a key point extractor to extract the target key point information of the target facial video frame to be encoded, and encodes the target key point information; meanwhile, adopts traditional image coding methods (such as VVC, HEVC, etc.) ) encodes the reference facial video frame.
  • a key point extractor to extract the target key point information of the target facial video frame to be encoded, and encodes the target key point information; meanwhile, adopts traditional image coding methods (such as VVC, HEVC, etc.) ) encodes the reference facial video frame.
  • the motion estimation module in the decoder extracts the reference key point information of the reference facial video frame through the key point extractor; and performs dense motion estimation based on the reference key point information and target key point information to obtain dense motion Estimation map and occlusion map, among them, the dense motion estimation map represents the characteristics of key point information representation In the domain, the relative motion relationship between the target facial video frame and the reference facial video frame; the occlusion map represents the degree of occlusion of each pixel in the target facial video frame.
  • the third step is the decoding stage.
  • the generation module in the decoder performs deformation processing on the reference facial video frame based on the dense motion estimation map to obtain the deformation processing result, and then multiplies the deformation processing result by the occlusion map to output the reconstructed facial video frame.
  • the face video frame is reconstructed based on the key point information extracted from the face video frame, and the key point information is the information displayed.
  • the data volume of the key point information cannot be calculated according to the The specific requirements for encoding bit consumption are further reduced. Therefore, the above method cannot meet the encoding requirements for low-bit-rate facial video frames.
  • the reconstructed facial video frame based on the key point information usually cannot be reconstructed more accurately in terms of facial pose information and expression information, that is, the reconstruction quality of the video frame is lower.
  • facial video frame reconstruction is performed based on compact features extracted from facial video frames to characterize its key feature information.
  • the implicit feature of the compact feature can not only represent the key feature information in the video frame, but also the size of the compact feature matrix can be further reduced according to the specific requirements of bit consumption, that is to say , the compact feature can represent the key information in the entire facial video frame with a small amount of data. Therefore, the facial video bitstream obtained by encoding the compact feature has a small amount of data, and the corresponding The bitstream is also smaller (lower code rate).
  • key feature information may include facial features position information, posture information, expression information, and the like. Therefore, compared with the key point information, the information represented by the compact feature is richer, and furthermore, the image quality of the reconstructed video frame is closer to the original target face video frame.
  • FIG. 2 is a flow chart of steps of a face video encoding method according to Embodiment 1 of the present application. Specifically, the facial video encoding method provided in this embodiment includes the following steps:
  • Step 202 acquiring target facial video frames and reference facial video frames to be encoded.
  • step 204 feature extraction is performed on the target facial video frame to obtain target compact features, which represent key feature information in the target facial video frame.
  • a machine learning model may be used to perform feature extraction on the target facial video frame, so as to obtain target compact features.
  • the target face video frame can be input into the pre-trained feature extraction model, so that the feature extraction model outputs the target compact features of each target face video frame.
  • the key feature information may specifically be: facial features position information, posture information, expression information, and the like.
  • the structure and parameters of the feature extraction model are not limited, and can be based on actual needs
  • the feature extraction model can be a U-Net network based on a combination of convolutional layers and generalized division normalization layers, and so on.
  • Step 206 respectively encode the target compact feature and the reference facial video frame to obtain a facial video bit stream.
  • VVC general video coding
  • Fig. 3 is a schematic diagram of a scene corresponding to Embodiment 1 of the present application.
  • an example of a specific scene will be used to describe the embodiment of the present application:
  • a differential operation in order to further reduce the bit rate of facial video encoding, can be performed based on the target compact features of adjacent target facial video frames, and then the difference obtained by the differential operation is encoded to form Facial video bitstream.
  • Feature extraction is performed on each target facial video frame to obtain the target compact feature of each target facial video frame; the target compact feature of two adjacent target facial video frames is differentially calculated to obtain the target compact feature residual; the target compact feature is obtained respectively
  • the feature residuals and the reference facial video frame are encoded to obtain a facial video bitstream.
  • the encoding process is performed based on the difference between the target compact features, so as to obtain the facial video bit stream.
  • the difference between the target compact features The data volume of is less than the data volume of the target compact feature itself. Therefore, the coding process based on the difference between the target compact features can effectively reduce the bit rate of facial video coding.
  • the target facial video frame is subjected to target compact feature extraction, and the facial video bitstream obtained by encoding the above-mentioned target compact feature, because the target compact feature represents the target facial video frame
  • the facial video bitstream obtained by encoding the target compact features has a small amount of data, and it is difficult to perform
  • the corresponding bit stream is also smaller (lower code rate) when the video stream is transmitted.
  • the coding bit rate can be reduced, and the requirement of low bit rate facial video coding can be better met.
  • the facial video encoding method provided in Embodiment 1 of the present application can be executed by a video encoding terminal (encoder), and is used to encode facial video files, so as to realize digital broadband compression of facial video files. It can be applicable to many different scenarios, such as: general storage and streaming of video games involving faces, specifically:
  • the game video frame can be encoded by the facial video encoding method provided in the embodiment of the present application to form a corresponding video code stream for storage and transmission in video streaming services or other similar applications; another example: video conference, live video, etc.
  • the facial video data collected by the video acquisition device can be encoded by the facial video encoding method provided by the embodiment of the present application to form a corresponding video code stream, and sent to the conference terminal, through which the conference terminal The video code stream is decoded so as to obtain the corresponding facial video picture; also for example: in a virtual reality scene, the facial video data collected by the video acquisition device can be encoded by the facial video coding method provided by the embodiment of the application to form a corresponding video code stream, and send it to virtual reality-related devices (such as VR virtual glasses, etc.), decode the video stream through the VR device to obtain the corresponding facial video picture, and realize the corresponding VR function based on the facial video picture, and so on.
  • virtual reality-related devices such as VR virtual glasses, etc.
  • FIG. 4 is a flow chart of steps of a face video decoding method according to Embodiment 2 of the present application. Specifically, the facial video decoding method provided by this embodiment includes the following steps:
  • Step 402 acquire the facial video bit stream, the facial video bit stream includes: encoded reference facial video frames and encoded compact feature information.
  • the encoded compact feature information represents the key feature information of the target facial video frame to be reconstructed.
  • the encoded compact feature information corresponds to the compact feature information used to characterize the key feature information obtained by performing feature extraction on each target facial video frame, and may also correspond to: the target compactness of adjacent target facial video frames difference between features.
  • Step 404 decoding the encoded reference facial video frame, and performing feature extraction on the decoded reference facial video frame to obtain reference compact features.
  • a machine learning model can be used to extract features from reference facial video frames to obtain reference compact features.
  • the reference facial video frame can be input into the pre-trained feature extraction model, so that the feature extraction model outputs the reference compact features of each reference facial video frame.
  • Step 406 decoding the encoded compact feature information to obtain the target compact feature of the target facial video frame.
  • the encoded compact feature information corresponds to the compact feature information used to represent the key feature information obtained by feature extraction for each target facial video frame
  • the encoded compact feature information can be decoded to obtain the target facial video frame.
  • Compact features when the encoded compact feature information corresponds to: the difference between the target compact features of adjacent target facial video frames, then after obtaining the target compact features of the previous target facial video frame, based on the decoded The difference between the target compact features is calculated to calculate the target compact feature of the next target facial video frame.
  • Step 408 perform sparse motion estimation based on the reference compact feature and the target compact feature, and obtain a sparse motion estimation map.
  • the sparse motion estimation map represents the relative motion relationship between the target facial video frame and the reference facial video frame in a preset sparse feature domain.
  • the relative motion relationship can be characterized in several ways, for example: at the pixel level, the relative motion relationship between each pixel in the two facial video frames can be calculated respectively; , feature extraction is performed on two facial video frames to obtain corresponding relatively sparse feature maps, so as to calculate the relative motion relationship between the two facial video frames at the above-mentioned feature map (feature domain) level.
  • the relative motion relationship between the reference facial video frame and the target facial video frame time in the compact feature domain can be obtained.
  • Step 410 according to the sparse motion estimation map and the reference facial video frame, obtain the reconstructed facial video frame corresponding to the target facial video frame.
  • Fig. 5 is a schematic diagram of a scene corresponding to Embodiment 2 of the present application.
  • an example of a specific scene will be used to describe the embodiment of the present application:
  • the sparse motion estimation graph represents the relative motion relationship between the target facial video frame and the reference facial video frame in the sparse feature domain, that is to say, the sparse motion estimation graph represents a relatively rough relative motion relationship, so , generate a reconstructed facial video frame corresponding to the target facial video frame directly according to the sparse motion estimation map and the reference facial video frame, and the quality difference between the obtained reconstructed facial video frame and the target facial video frame may be large.
  • obtaining the reconstructed facial video frame corresponding to the target facial video frame may include:
  • the dense motion estimation map represents the relative motion between the target facial video frame and the reference facial video frame in the preset dense feature domain. relation;
  • a reconstructed facial video frame corresponding to the target facial video frame is obtained.
  • a dense motion estimation map is obtained, that is: in the denser feature
  • the relative motion relationship between the target facial video frame and the reference facial video frame is more accurate than the relative motion relationship represented by the sparse motion estimation map. Therefore, based on the dense motion estimation map and the reference Facial video frames, generating reconstructed facial video frames, can improve the quality of reconstructed facial video frames.
  • the dense motion estimation is performed according to the compact feature difference and the initial reconstructed facial video frame to obtain a dense motion estimation map, which may further include:
  • a reconstructed facial video frame corresponding to the target facial video frame is obtained, including:
  • the reconstructed facial video frame corresponding to the target facial video frame is obtained.
  • the reference facial video frame can be deformed first according to the dense motion estimation map to obtain the deformed facial video frame, and then based on the occlusion map, the deformed facial video frame can be further deformed to obtain the final reconstructed facial video frame.
  • the face In the reference facial video frame and the target facial video frame, the face may be twisted at a certain angle. At this time, some pixels in the video frame may be blocked.
  • the face in the reference face video frame is a frontal face
  • the face in the target face video frame is slightly turned to the left or right by a certain angle, at this time, there are blocked pixels.
  • the probability of each pixel in the video frame being occluded can be considered on the basis of considering the dense motion estimation map, and based on the dense motion estimation Map and occlusion map, deform the reference facial video frame, so as to obtain a more accurate reconstructed facial video frame.
  • Feature extraction is performed on the decoded reference facial video frame to obtain reference compact features, which may include:
  • the reference facial video frame is deformed to obtain the initial reconstructed facial video frame corresponding to the target facial video frame, which may include:
  • the sparse motion estimation map and the reference facial video frame are input into the warped image prediction model, so that the warped image prediction model outputs the initial reconstructed facial video frame corresponding to the target facial video frame.
  • Performing dense motion estimation based on the compact feature difference and the initial reconstructed facial video frame to obtain a dense motion estimation map which may include:
  • the dense feature difference and the initial reconstructed facial video frame are input to the dense motion estimation model, so that the dense motion estimation model outputs a dense motion estimation map.
  • the dense motion estimation map, the reference facial video frame, and the occlusion map are input into the generative model, so that the generative model outputs a reconstructed facial video frame corresponding to the target facial video frame.
  • FIG. 6 is a schematic diagram of another scene corresponding to the second embodiment of the present application. This scene is based on the scene shown in FIG. , to get the final reconstructed facial video frame, specifically:
  • the facial video frame is reconstructed, thereby obtaining the initial reconstructed facial video frame corresponding to the target facial video frame; at the same time, the reference facial video frame and the target facial video Then, based on the initial reconstructed facial video frame and the above-mentioned differential operation results, dense motion estimation is performed, and a dense motion estimation map and occlusion map are obtained at the same time; finally, based on the dense motion estimation map, occlusion map and reference Facial video frame to obtain the final reconstructed facial video frame corresponding to the target facial video frame.
  • the facial video bitstream obtained in the encoding stage is decoded, and then the facial video frame is reconstructed based on the target compact features obtained by decoding, because the target compact features can represent key points in the target facial video frame Therefore, the quality difference between the reconstructed video frame obtained based on the target compact features and the target face video frame is also small.
  • the facial video decoding method in this embodiment can be executed by any suitable electronic device with data capability, including but not limited to: a server, a PC, and the like.
  • FIG. 7 is a flowchart of steps of a model training method according to Embodiment 3 of the present application. Specifically, the model training method provided in this embodiment includes the following steps:
  • Step 702 Input the target facial video frame samples into the feature extraction model to obtain target compact feature samples; respectively encode the target compact feature samples and reference facial video frame samples to obtain facial video bitstream samples.
  • the structure and parameters of the feature extraction model are not limited, and can be set according to actual needs.
  • the feature extraction model can be a U-Net network based on a combination of convolutional layers and generalized division normalization layers. ,etc.
  • Step 704 decode the facial video bitstream samples to obtain reference facial video frame samples and target compact feature samples; input the reference facial video frame samples into the feature extraction model to obtain reference compact feature samples.
  • the feature extraction model in this step may be exactly the same as the feature extraction model in step 602, so as to obtain a reference compact feature sample corresponding to the target compact feature sample.
  • Step 706 Perform sparse motion estimation based on the reference compact feature samples and target compact feature samples to obtain a sparse motion estimation sample map; input the sparse motion estimation sample map and reference facial video frame samples into the warped image prediction model to obtain an initial reconstructed facial video frame sample.
  • the structure and parameters of the deformed image prediction model are not limited, and can be set according to actual needs.
  • it can also be a U-Net based on a combination of convolutional layers and generalized division normalization layers. network, etc.
  • step 702-step 706 For the specific execution process of each step in the above step 702-step 706, reference may be made to the corresponding steps in the above embodiment 1 or embodiment 2, which will not be repeated here.
  • Step 708 Construct a perceptual loss function and an adversarial loss function respectively according to the initial reconstructed facial video frame sample and the target facial video frame sample; based on the transmission code rate corresponding to the initial reconstructed facial video frame, the target facial video frame sample, and the target compact feature sample, Get the rate-distortion loss function.
  • the perceptual loss function can be constructed as follows:
  • the adversarial loss function can be constructed in the following way: the initial reconstructed facial video frame sample and the target facial video frame sample are input into the pre-trained classifier at the same time, and the corresponding adversarial loss function is constructed based on the classification result (whether it is the same type of video frame) .
  • the construction process of the rate-distortion loss function may include: first obtain the transmission code rate corresponding to the target compact feature sample, and then, based on the initial reconstructed facial video frame sample and the target facial video frame sample, construct a distortion function (in the embodiment of the present application, for The specific method used to construct the distortion function is not limited, for example: the depth image structure and texture similarity algorithm can be used to construct the distortion function, etc.), and then the above-mentioned transmission code rate and the constructed distortion function are fused (such as addition, etc. ), to obtain the rate-distortion loss function.
  • step 710 the perceptual loss function, the adversarial loss function and the rate-distortion loss function are fused to obtain the training loss function; according to the training loss function, the feature extraction model and the deformed image prediction model are trained.
  • corresponding weight values can be set for the perceptual loss function, adversarial loss function, and rate-distortion loss function, and then based on the set weight values, the perceptual loss function, adversarial loss function, and rate-distortion loss function are summed Processing to get the final training loss function. Specifically, see the following formula:
  • L is the final training loss function ;
  • L per is the perceptual loss function;
  • L GD is the adversarial loss function;
  • L RD is the rate-distortion loss function;
  • Fig. 8 is a schematic diagram of a scene corresponding to Embodiment 3 of the present application.
  • an example of a specific scene is used to describe the embodiment of the present application:
  • a dense motion estimation model and a generative model can be further introduced, and on the basis of the training program shown in FIG. 8, the following improvements are made: the sparse motion estimation sample graph Input the deformed image prediction model with the reference facial video frame sample to obtain the initial reconstructed facial video frame sample; perform a differential operation on the reference compact feature sample and the target compact feature sample to obtain the compact feature sample difference; combine the compact feature sample difference with the initial Input the reconstructed facial video frame samples into the dense motion estimation model to be trained to obtain dense motion estimation sample graphs and occlusion sample graphs; input the dense motion estimation sample graphs, reference facial video frame samples, and occlusion sample graphs into the generative model to be trained to obtain reconstruction Facial video frame.
  • the first perceptual loss function according to the initial reconstructed facial video frame samples and the target facial video frame samples; construct the second perceptual loss function according to the reconstructed facial video frame and the target facial video frame samples; Face video frame samples, constructing an adversarial loss function; based on the distortion loss function constructed by the reconstructed face video frame and the target face video frame sample, and the transmission bit rate corresponding to the target compact feature sample, the rate-distortion loss function is obtained; for the first perception
  • the loss function, the second perceptual loss function, the adversarial loss function and the rate-distortion loss function are fused to obtain the training loss function; according to the training loss function, the above-mentioned feature extraction model to be trained, the deformed image prediction model, the dense motion estimation model and The generated model is trained, and the trained feature extraction model, deformed image prediction model, dense motion estimation model and generated model are obtained for training.
  • the model training method in this embodiment may be executed by any suitable electronic device with data capability, including but not limited to: a server, a PC, and the like.
  • FIG. 9 is a structural block diagram of a facial video encoding device according to Embodiment 4 of the present application.
  • the facial video encoding device provided by the embodiment of the present application includes:
  • Facial video frame acquiring module 902 configured to acquire target facial video frame and reference facial video frame to be encoded.
  • the feature extraction module 904 is configured to perform feature extraction on the target facial video frame to obtain target compact features, which represent key feature information in the target facial video frame.
  • the encoding module 906 is configured to encode the target compact feature and the reference facial video frame respectively to obtain a facial video bit stream.
  • the target facial video frame is a plurality of continuous facial video frames;
  • the feature extraction module 904 is specifically used to: carry out feature extraction to each target facial video frame respectively, to obtain the target facial video frame target compact features;
  • the encoding module 906 is specifically used for: performing a differential operation on the target compact features of two adjacent target facial video frames to obtain target compact feature residuals;
  • the target compact feature residual and the reference facial video frame are encoded separately to obtain the facial video bitstream.
  • the feature extraction module 904 is specifically configured to: respectively input each target facial video frame into the feature extraction model, so that the feature extraction model outputs the target compact features of each target facial video frame.
  • the facial video encoding device of this embodiment is used to implement the corresponding facial video encoding methods in the aforementioned multiple method embodiments, and has the beneficial effects of the corresponding method embodiments, which will not be repeated here.
  • the function implementation of each module in the facial video encoding device of this embodiment reference may be made to the description of corresponding parts in the foregoing method embodiments, and details are not repeated here.
  • FIG. 10 is a structural block diagram of a facial video decoding device according to Embodiment 5 of the present application.
  • the facial video decoding device provided by the embodiment of the present application includes:
  • the video bit stream acquisition module 1002 is used to obtain the face video bit stream, and the face video bit stream includes: after encoding, refer to the face video frame and the encoded compact feature information; after encoding, the compact feature information represents the key features of the target face video frame to be reconstructed information;
  • the first decoding module 1004 is used to decode and encode the reference facial video frame, and perform feature extraction on the reference facial video frame obtained by decoding to obtain a reference compact feature;
  • the second decoding module 1006 is used to decode the encoded compact feature information to obtain the target compact feature of the target facial video frame;
  • the sparse motion estimation module 1008 is used to perform sparse motion estimation based on the reference compact feature and the target compact feature to obtain a sparse motion estimation map, the sparse motion estimation map is represented in a preset sparse feature domain, and the target facial video frame and the reference facial video frame The relative motion relationship between;
  • the reconstructed facial video frame obtaining module 1010 is configured to obtain a reconstructed facial video frame corresponding to the target facial video frame according to the sparse motion estimation map and the reference facial video frame.
  • the reference facial video frame is deformed to obtain the target facial video frame pair corresponding initial reconstructed facial video frames
  • the dense motion estimation map represents the relative motion between the target facial video frame and the reference facial video frame in the preset dense feature domain. relation;
  • a reconstructed facial video frame corresponding to the target facial video frame is obtained.
  • the reconstructed face video frame obtaining module 1010 executes the step of performing dense motion estimation according to the compact feature difference and the initial reconstructed face video frame to obtain a dense motion estimation map
  • it is specifically configured to:
  • the compact feature difference and the initial reconstructed facial video frame are used for dense motion estimation to obtain a dense motion estimation map and an occlusion map.
  • the occlusion map represents the degree of occlusion of each pixel in the target facial video frame;
  • Video frame when obtaining the step of reconstructing the facial video frame corresponding to the target facial video frame, it is specifically used for: according to the dense motion estimation map, the reference facial video frame and the occlusion map, obtain the reconstructed facial video frame corresponding to the target facial video frame.
  • the first decompression module 1004 is specifically used to:
  • the reconstructed facial video frame obtaining module 1010 when executing the step of deforming the reference facial video frame based on the sparse motion estimation map to obtain the initial reconstructed facial video frame corresponding to the target facial video frame , specifically for: inputting the sparse motion estimation map and the reference facial video frame into the deformed image prediction model, so that the deformed image prediction model outputs the initial reconstructed facial video frame corresponding to the target facial video frame.
  • the reconstructed facial video frame obtaining module 1010 is specifically used to:
  • the compact feature difference and the initial reconstructed facial video frame are input to the dense motion estimation model, so that the dense motion estimation model outputs a dense motion estimation map.
  • the reconstructed facial video frame obtaining module 1010 when performing the step of obtaining the reconstructed facial video frame corresponding to the target facial video frame according to the dense motion estimation map, the reference facial video frame and the occlusion map , specifically for:
  • the dense motion estimation map, the reference facial video frame, and the occlusion map are input into the generative model, so that the generative model outputs a reconstructed facial video frame corresponding to the target facial video frame.
  • the facial video decoding device in this embodiment is used to implement the corresponding facial video decoding methods in the foregoing method embodiments, and has the beneficial effects of the corresponding method embodiments, which will not be repeated here.
  • the function implementation of each module in the facial video decoding device of this embodiment reference may be made to the description of corresponding parts in the foregoing method embodiments, and details are not repeated here.
  • FIG. 11 is a structural block diagram of a model training device according to Embodiment 6 of the present application.
  • the model training device provided in the embodiment of the present application includes:
  • Target compact feature sample and reference facial video frame sample are encoded respectively, obtains facial video bit stream sample;
  • the initial reconstructed face video frame sample obtaining module 1106 is used to perform sparse motion estimation based on the reference compact feature sample and the target compact feature sample to obtain a sparse motion estimation sample map; input the sparse motion estimation sample map and the reference facial video frame sample into the deformed image pre-set Estimating the model to obtain the initial reconstructed facial video frame samples;
  • the model training module 1110 is used to fuse the perceptual loss function, the adversarial loss function and the rate-distortion loss function to obtain the training loss function; according to the training loss function, the feature extraction model and the deformed image prediction model are trained.
  • the model training device in this embodiment is used to implement the corresponding model training methods in the aforementioned multiple method embodiments, and has the beneficial effects of the corresponding method embodiments, which will not be repeated here.
  • the function implementation of each module in the model training device of this embodiment reference may be made to the descriptions of corresponding parts in the foregoing method embodiments, and details are not repeated here.
  • FIG. 12 shows a schematic structural diagram of an electronic device according to Embodiment 7 of the present application.
  • the specific embodiment of the present application does not limit the specific implementation of the electronic device.
  • the conference terminal may include: a processor (processor) 1202, a communication interface (Communications Interface) 1204, a memory (memory) 1206, and a communication bus 1208.
  • processor processor
  • Communication interface Communication Interface
  • memory memory
  • the processor 1202 , the communication interface 1204 , and the memory 1206 communicate with each other through the communication bus 1208 .
  • the communication interface 1204 is used for communicating with other electronic devices or servers.
  • the processor 1202 is configured to execute the program 1210. Specifically, it may execute the above-mentioned facial video encoding method, or the facial video decoding method, or the relevant steps in the embodiment of the model training method.
  • the program 1210 may include program codes including computer operation instructions.
  • the processor 1202 may be a CPU, or an ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement the embodiments of the present application.
  • the one or more processors included in the smart device may be of the same type, such as one or more CPUs, or may be different types of processors, such as one or more CPUs and one or more ASICs.
  • the memory 1206 is used to store the program 1210 .
  • the memory 1206 may include a high-speed RAM memory, and may also include a non-volatile memory (non-volatile memory), such as at least one magnetic disk memory.
  • the program 1210 can specifically be used to make the processor 1202 perform the following operations: obtain the target facial video frame to be encoded and the reference facial video frame; perform feature extraction on the facial target facial video frame to obtain the target compact feature, and the facial target compact feature represents the facial target
  • the key feature information in the facial video frame; the facial target compact feature and the facial reference facial video frame are respectively encoded to obtain the facial video bit stream.
  • the program 1210 can specifically be used to make the processor 1202 perform the following operations: acquire the facial video bitstream, and the facial video bitstream includes: encoded reference facial video frames and encoded compact feature information; the encoded facial compact feature information represents the target to be reconstructed The key feature information of the facial video frame; after decoding the facial encoding, refer to the facial video frame, and perform feature extraction on the decoded reference facial video frame to obtain the reference compact feature; after decoding the facial encoding compact feature information, obtain the facial target facial video frame Target compact features; perform sparse motion estimation based on facial reference compact features and facial target compact features, and obtain a sparse motion estimation map.
  • the facial sparse motion estimation map is represented in the preset sparse feature domain, and the facial target facial video frame is compared with the facial reference facial video The relative motion relationship between the frames; according to the face sparse motion estimation map and the face reference face video frame, the reconstructed face video frame corresponding to the face target face video frame is obtained.
  • the program 1210 can specifically be used to make the processor 1202 perform the following operations: input the target facial video frame sample into the feature extraction model to obtain the target compact feature sample; respectively encode the facial target compact feature sample and the reference facial video frame sample to obtain the facial video Bitstream samples; decode facial video bitstream samples to obtain facial reference facial video frame samples and facial target compact feature samples; input facial reference facial video frame samples into the facial feature extraction model to obtain reference compact feature samples; based on facial reference compact feature samples Perform sparse motion estimation with the facial target compact feature samples to obtain a sparse motion estimation sample map; input the face sparse motion estimation sample map and face reference face video frame samples into the deformed image prediction model to obtain the initial reconstructed face video frame samples; according to the face initial Reconstruct the facial video frame samples and the facial target facial video frame samples, and construct the perceptual loss function and the adversarial loss function respectively; based on the transmission bit rate corresponding to the initial reconstructed facial video frame, the facial target facial video frame sample and the facial target compact feature sample, the obtained The rate-dist
  • each step in program 1210 can refer to the above-mentioned facial video coding method, or, facial video
  • the decoding method, or the corresponding steps in the embodiment of the model training method and the corresponding descriptions in the units are not repeated here.
  • Those skilled in the art can clearly understand that for the convenience and brevity of description, the specific working process of the above-described devices and modules can refer to the corresponding process description in the foregoing method embodiments, and details are not repeated here.
  • the target facial video frame is subjected to target compact feature extraction, and the facial video bitstream obtained by encoding the above-mentioned target compact feature, because the target compact feature is to characterize the target facial video
  • the feature of the key feature information in the frame which represents the key information in the entire facial video frame through a small amount of data, therefore, the facial video bitstream obtained by encoding the target compact features has a small amount of data,
  • the corresponding bit stream is also small (lower code rate) when video stream transmission is carried out.
  • the facial video bit stream obtained above is decoded, and then based on the key features in the characterizing target facial video frame obtained by decoding
  • the target compact feature of the information is used to reconstruct the facial video frame, and the quality difference between the reconstructed video frame and the target facial video frame is also small.
  • the embodiment of the present application can reduce the encoding bit rate on the premise of ensuring the facial video reconstruction quality, and better meet the requirements of low bit rate facial video encoding.
  • An embodiment of the present application further provides a computer program product, including computer instructions, where the computer instruction instructs a computing device to perform operations corresponding to any method in the foregoing multiple method embodiments.
  • each component/step described in the embodiment of the present application can be divided into more components/steps, and two or more components/steps or partial operations of components/steps can also be combined into New components/steps to achieve the purpose of the embodiment of the present application.
  • the above method according to the embodiment of the present application can be implemented in hardware, firmware, or as software or computer code that can be stored in a recording medium (such as CD ROM, RAM, floppy disk, hard disk, or magneto-optical disk), or implemented by Computer code downloaded from a network that is originally stored on a remote recording medium or a non-transitory machine-readable medium and will be stored on a local recording medium so that the methods described herein can be stored on a computer code using a general-purpose computer, a dedicated processor, or a programmable Such software processing on a recording medium of dedicated hardware such as ASIC or FPGA.
  • a recording medium such as CD ROM, RAM, floppy disk, hard disk, or magneto-optical disk
  • Computer code downloaded from a network that is originally stored on a remote recording medium or a non-transitory machine-readable medium and will be stored on a local recording medium so that the methods described herein can be stored on a computer code using a general-purpose computer, a dedicated processor, or
  • a computer, processor, microprocessor controller, or programmable hardware includes memory components (e.g., RAM, ROM, flash memory, etc.) that can store or receive software or computer code, when the software or computer code is Or when the hardware accesses and executes, realize the facial video coding method described herein, or the facial video decoding method, or the model training method.
  • memory components e.g., RAM, ROM, flash memory, etc.
  • the hardware accesses and executes, realize the facial video coding method described herein, or the facial video decoding method, or the model training method.
  • the execution of the code converts the general-purpose computer into a method for executing the face video shown here.
  • the face video coding method, or, the face video decoding method, or, the special purpose computer of model training method when a general-purpose computer accesses the codes for implementing the face video encoding method shown here, or the face video decoding method, or the model training method.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

本申请实施例提供了一种面部视频编码方法、解码方法及装置。面部视频编码方法包括:获取待编码的目标面部视频帧和参考面部视频帧;对所述目标面部视频帧进行特征提取,得到目标紧凑特征,所述目标紧凑特征表征所述目标面部视频帧中的关键特征信息;分别对所述目标紧凑特征和所述参考面部视频帧进行编码,得到面部视频比特流。本申请实施例,可以在保证面部视频编码质量的前提下,降低编码码率,更好地满足了低码率面部视频编码的需求。

Description

一种面部视频编码方法、解码方法及装置
本申请要求于2022年01月25日提交中国专利局、申请号为202210085278.8、申请名称为“一种面部视频编码方法、解码方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及计算机技术领域,尤其涉及一种面部视频编码方法、解码方法及装置。
背景技术
随着视频编解码技术的不断发展,视频编解码设备已被广泛应用于各种场景中,例如:视频会议、视频直播等等。
目前,较为传统的视频编解码方法通常针对的是普遍的自然场景,采用基于块的运动估计、离散余弦变换等方法进行视频帧的编解码。
采用传统视频编码方法对面部视频进行编码时,为保证视频编码质量,视频的编码率通常较低,无法满足低码率面部视频编码的需求。
发明内容
有鉴于此,本申请实施例提供一种面部视频编码方法、解码方法及装置,以至少部分解决上述问题。
根据本申请实施例的第一方面,提供了一种面部视频编码方法,包括:
获取待编码的目标面部视频帧和参考面部视频帧;
对所述目标面部视频帧进行特征提取,得到目标紧凑特征,所述目标紧凑特征表征所述目标面部视频帧中的关键特征信息;
分别对所述目标紧凑特征和所述参考面部视频帧进行编码,得到面部视频比特流。
根据本申请实施例的第二方面,提供了一种面部视频解码方法,包括:
获取面部视频比特流,所述面部视频比特流包括:编码后参考面部视频帧和编码后紧凑特征信息;所述编码后紧凑特征信息表征待重建的目标面部视频帧的关键特征信息;
解码所述编码后参考面部视频帧,并对解码得到的参考面部视频帧进行特征提取,得到参考紧凑特征;
解码所述编码后紧凑特征信息,得到所述目标面部视频帧的目标紧凑特征;
基于所述参考紧凑特征和所述目标紧凑特征进行稀疏运动估计,得到稀疏运动估计图,所述稀疏运动估计图表征在预设的稀疏特征域中,所述目标面部视频帧与所述参考面部视频帧之间的相对运动关系;
根据所述稀疏运动估计图和所述参考面部视频帧,得到与所述目标面部视频帧对应的重建面部视频帧。
根据本申请实施例的第三方面,提供了一种模型训练方法,包括:
将目标面部视频帧样本输入特征提取模型,得到目标紧凑特征样本;分别对所述目标紧凑特征样本和参考面部视频帧样本进行编码,得到面部视频比特流样本;
解码所述面部视频比特流样本,得到所述参考面部视频帧样本和所述目标紧凑特征样本;将所述参考面部视频帧样本输入所述特征提取模型,得到参考紧凑特征样本;
基于所述参考紧凑特征样本和所述目标紧凑特征样本进行稀疏运动估计,得到稀疏运动估计样本图;将所述稀疏运动估计样本图和所述参考面部视频帧样本输入形变图像预估模型,得到初始重建面部视频帧样本;
根据所述初始重建面部视频帧样本和所述目标面部视频帧样本,分别构建感知损失函数和对抗损失函数;基于所述初始重建面部视频帧、所述目标面部视频帧样本以及所述目标紧凑特征样本对应的传输码率,得到率失真损失函数;
对所述感知损失函数、对抗损失函数以及率失真损失函数进行融合,得到训练损失函数;根据所述训练损失函数,对所述特征提取模型和所述形变图像预估模型进行训练。
根据本申请实施例的第四方面,提供了一种电子设备,包括:处理器、存储器、通信接口和通信总线,所述处理器、所述存储器和所述通信接口通过所述通信总线完成相互间的通信;所述存储器用于存放至少一可执行指令,所述可执行指令使所述处理器执行如第一方面所述的面部视频编码方法对应的操作,或者,如第二方面所述的面部视频解码方法对应的操作,或者,如第三方面所述的模型训练方法对应的操作。
根据本申请实施例的第五方面,提供了一种计算机存储介质,其上存储有计算机程序,该程序被处理器执行时实现如第一方面所述的面部视频编码方法,或者,如第二方面所述的面部视频解码方法,或者,如第三方面所述的模型训练方法。
根据本申请实施例的第六方面,提供了一种计算机程序产品,包括计算机指令,所述计算机指令指示计算设备执行如第一方面所述的面部视频编码方法对应的操作,或者,如第二方面所述的面部视频解码方法对应的操作,或者,如第三方面所述的模型训练方法对应的操作。
根据本申请实施例提供的面部视频编码方法以及解码方法,在编码阶段,是对目标面部视频帧进行了目标紧凑特征提取,并通过对上述目标紧凑特征的编码得到的面部视频比特流,由于目标紧凑特征是表征目标面部视频帧中的关键特征信息的特征,其通过较小的数据量表征了整个面部视频帧中的关键信息,因此,通过对目标紧凑特征的编码得到的面部视频比特流,其数据量也较小,在进行视频流传输时对应的比特流也较小(码率较低),另外,在解码阶段,对上述得到的面部视频比特流进行解码,再基于解码得到的表征目标面部视频帧中关键特征信息的目标紧凑特征,进行面部视 频帧重构,得到的重构视频帧与目标面部视频帧间的质量差异也较小。综上,本申请实施例,可以在保证面部视频重建质量的前提下,降低编码码率,更好地满足了低码率面部视频编码的需求。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请实施例中记载的一些实施例,对于本领域普通技术人员来讲,还可以根据这些附图获得其他的附图。
图1为基于深度视频生成的编解码方法的框架示意图;
图2为根据本申请实施例一的一种面部视频编码方法的步骤流程图;
图3为图2所示实施例中的一种场景示例的示意图;
图4为根据本申请实施例二的一种面部视频解码方法的步骤流程图;
图5为图4所示实施例中的一种场景示例的示意图;
图6为图4所示实施例中的另一种场景示例的示意图;
图7为根据本申请实施例三的一种模型训练方法的步骤流程图;
图8为图7所示实施例中的一种场景示例的示意图;
图9为根据本申请实施例四的一种面部视频编码装置的结构框图;
图10为根据本申请实施例五的一种面部视频解码装置的结构框图;
图11为根据本申请实施例六的一种模型训练装置的结构框图;
图12为根据本申请实施例七的一种电子设备的结构示意图。
具体实施方式
为了使本领域的人员更好地理解本申请实施例中的技术方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅是本申请实施例一部分实施例,而不是全部的实施例。基于本申请实施例中的实施例,本领域普通技术人员所获得的所有其他实施例,都应当属于本申请实施例保护的范围。
参见图1,图1为基于深度视频生成的编解码方法的框架示意图。该方法的主要原理是基于待编码帧的运动对参考帧进行形变,以得到待编码帧对应的重建帧。下面结合图1对基于深度视频生成的编解码方法的基本框架进行说明:
第一步,编码阶段,编码器采用关键点提取器提取待编码的目标面部视频帧的目标关键点信息,并对目标关键点信息编码;同时,采用传统的图像编码方法(如VVC、HEVC等)对参考面部视频帧进行编码。
第二步,解码阶段,解码器中的运动估计模块,通过关键点提取器提取参考面部视频帧的参考关键点信息;并基于参考关键点信息和目标关键点信息进行稠密运动估计,得到稠密运动估计图和遮挡图,其中,稠密运动估计图表征关键点信息表征的特 征域中,目标面部视频帧与参考面部视频帧之间的相对运动关系;遮挡图表征目标面部视频帧中各像素点被遮挡的程度。
第三步,解码阶段,解码器中的生成模块基于稠密运动估计图对参考面部视频帧进行形变处理,得到形变处理结果,再将形变处理结果与遮挡图相乘,从而输出重建面部视频帧。
图1所示方法中,是基于从面部视频帧中提取到的关键点信息进行面部视频帧重建的,而关键点信息为显示表示的信息,在编码过程中,关键点信息的数据量无法根据对编码比特消耗的具体要求进一步减小,因此,上述方法无法满足低码率面部视频帧的编码要求。
另外,基于关键点信息得到的重建面部视频帧,与原始目标面部视频帧相比,其面部姿态信息以及表情信息等通常无法较为准确地得到重建,也就是说,视频帧的重建质量较低。
本申请实施例中,基于从面部视频帧提取到表征其关键特征信息的紧凑特征进行面部视频帧重建。与关键点信息相比,紧凑特征这一隐式特征,其不但能够表征视频帧中的关键特征信息,而且,紧凑特征矩阵的大小,可以根据比特消耗的具体要求进一步地减小,也就是说,紧凑特征可以通过较小的数据量表征整个面部视频帧中的关键信息,因此,通过对紧凑特征的编码得到的面部视频比特流,其数据量也较小,在进行视频流传输时对应的比特流也较小(码率较低)。
另外,对于面部视频帧而言,关键特征信息可以包括五官位置信息、姿态信息以及表情信息等等。因此,与关键点信息相比,紧凑特征表征的信息更丰富,进而,得到的重建视频帧与原始目标面部视频帧的图像质量也更加接近。
下面结合本申请实施例附图进一步说明本申请实施例具体实现。
实施例一
参照图2,图2为根据本申请实施例一的一种面部视频编码方法的步骤流程图。具体地,本实施例提供的面部视频编码方法包括以下步骤:
步骤202,获取待编码的目标面部视频帧和参考面部视频帧。
步骤204,对目标面部视频帧进行特征提取,得到目标紧凑特征,目标紧凑特征表征目标面部视频帧中的关键特征信息。
本申请实施例中,可以借助机器学习模型对目标面部视频帧进行特征提取,从而得到目标紧凑特征。具体地:可以将目标面部视频帧输入预先训练完成的特征提取模型中,以使特征提取模型输出各目标面部视频帧的目标紧凑特征。
对于面部视频帧而言,关键特征信息具体可以为:五官位置信息、姿态信息以及表情信息等等。
本申请实施例中,对于特征提取模型的结构和参数不做限定,可以根据实际需要 进行设定,例如:特征提取模型可以为基于卷积层和广义除法归一化层组合而成的U-Net网络,等等。
步骤206,分别对目标紧凑特征和参考面部视频帧进行编码,得到面部视频比特流。
具体地,针对参考面部视频帧,可以采用相对较小的量化失真进行编码,编码过程保留参考面部视频帧的完整数据,例如:可以采用通用视频编码(VVC)的方式,对参考面部视频帧进行编码。针对目标紧凑特征,则可以通过量化及熵编码的方式,进行编码。
参见图3,图3为本申请实施例一对应的场景示意图,以下,将参考图3所示的示意图,以一个具体场景示例,对本申请实施例进行说明:
分别获取目标面部视频帧a,以及,参考面部视频帧a0;对目标面部视频帧a进行特征提取,得到目标紧凑特征;对目标紧凑特征以及参考面部视频帧a0分别进行编码,从而得到面部视频比特流,后续可以将面部视频比特流传送至解码端,以通过解码端对基于面部视频比特流进行面部视频流解码,从而得到目标面部视频帧a对应的重构面部视频帧。
进一步的,在本申请一些实施例中,为了进一步降低面部视频编码的码率,可以基于相邻目标面部视频帧的目标紧凑特征,进行差分运算,再对差分运算得到的差值进行编码以形成面部视频比特流。
具体过程如下:
分别对各目标面部视频帧进行特征提取,得到各目标面部视频帧的目标紧凑特征;对相邻两个目标面部视频帧的目标紧凑特征进行差分运算,得到目标紧凑特征残差;分别对目标紧凑特征残差和参考面部视频帧进行编码,得到面部视频比特流。
与直接基于目标紧凑特征进行编码处理的方式相比,上述方式中,是基于目标紧凑特征之间的差值进行编码处理,从而得到面部视频比特流的,显然,目标紧凑特征之间的差值的数据量小于目标紧凑特征本身的数据量,因此,基于目标紧凑特征之间的差值进行编码处理,可以有效降低面部视频编码的码率。
本申请实施例中,在编码阶段,是对目标面部视频帧进行了目标紧凑特征提取,并通过对上述目标紧凑特征的编码得到的面部视频比特流,由于目标紧凑特征是表征目标面部视频帧中的关键特征信息的特征,其通过较小的数据量表征了整个面部视频帧中的关键信息,因此,通过对目标紧凑特征的编码得到的面部视频比特流,其数据量也较小,在进行视频流传输时对应的比特流也较小(码率较低)。本申请实施例,可以降低编码码率,更好地满足了低码率面部视频编码的需求。
本申请实施例一提供的面部视频编码方法,可以由视频编码端(编码器)执行,用于对面部视频文件进行编码,以实现对面部视频文件的数字宽带进行压缩。其可以适用与多种不同的场景,如:常规的涉及面部的视频游戏的存储和流式传输,具体地: 可以通过本申请实施例提供的面部视频编码方法对游戏视频帧进行编码,形成对应的视频码流,以在视频流服务或者其他类似的应用中存储和传输;又如:视频会议、视频直播等低延时场景,具体地:可以通过本申请实施例提供的面部视频编码方法对视频采集设备采集到的面部视频数据进行编码,形成对应的视频码流,并发送至会议终端,通过会议终端对视频码流进行解码从而得到对应的面部视频画面;还如:虚拟现实场景,可以通过本申请实施例提供的面部视频编码方法对视频采集设备采集到的面部视频数据进行编码,形成对应的视频码流,并发送至虚拟现实相关设备(如VR虚拟眼镜等),通过VR设备对视频码流进行解码从而得到对应的面部视频画面,并基于面部视频画面实现对应的VR功能,等等。
实施例二
参照图4,图4为根据本申请实施例二的一种面部视频解码方法的步骤流程图。具体地,本实施例提供的面部视频解码方法包括以下步骤:
步骤402,获取面部视频比特流,面部视频比特流包括:编码后参考面部视频帧和编码后紧凑特征信息。
其中,编码后紧凑特征信息表征待重建的目标面部视频帧的关键特征信息。
本申请实施例中,编码后紧凑特征信息对应为对各目标面部视频帧进行特征提取得到的用于表征关键特征信息的紧凑特征信息,也可以对应为:相邻的目标面部视频帧的目标紧凑特征之间的差值。
步骤404,解码编码后参考面部视频帧,并对解码得到的参考面部视频帧进行特征提取,得到参考紧凑特征。
可以借助机器学习模型对参考面部视频帧进行特征提取,从而得到参考紧凑特征。具体地:可以将参考面部视频帧输入预先训练完成的特征提取模型中,以使特征提取模型输出各参考面部视频帧的参考紧凑特征。
步骤406,解码编码后紧凑特征信息,得到目标面部视频帧的目标紧凑特征。
当编码后紧凑特征信息对应为对各目标面部视频帧进行特征提取得到的用于表征关键特征信息的紧凑特征信息时,可以对编码后紧凑特征信息进行解码处理,从而得到目标面部视频帧的目标紧凑特征;当编码后紧凑特征信息对应为:相邻的目标面部视频帧的目标紧凑特征之间的差值时,则可以在获取到前一目标面部视频帧的目标紧凑特征之后,基于解码后的目标紧凑特征之间的差值,计算出后一目标面部视频帧的目标紧凑特征。
步骤408,基于参考紧凑特征和目标紧凑特征进行稀疏运动估计,得到稀疏运动估计图。
稀疏运动估计图表征在预设的稀疏特征域中,目标面部视频帧与参考面部视频帧之间的相对运动关系。
两个不同的面部视频帧之间,可以通过若干种方式表征其相对运动关系,例如:可以在像素级别上,分别计算两个面部视频帧中的每个像素之间的相对运动关系;也可以,对两个面部视频帧进行特征提取,得到对应的较为稀疏的特征图,从而在上述特征图(特征域)级别上,分别计算两个面部视频帧之间的相对运动关系。
本步骤中,则可以是按照后一种方式,在紧凑特征级别上,基于参考紧凑特征和目标紧凑特征,得到紧凑特征域中,参考面部视频帧与目标面部视频帧时间的相对运动关系。
步骤410,根据稀疏运动估计图和参考面部视频帧,得到与目标面部视频帧对应的重建面部视频帧。
参见图5,图5本申请实施例二对应的场景示意图,以下,将参考图5所示的示意图,以一个具体场景示例,对本申请实施例进行说明:
获取由编码后参考面部视频帧和编码后紧凑特征信息组成的面部视频比特流;对编码后参考面部视频帧进行解码,从而得到参考面部视频帧,并对参考面部视频帧进行特征提取,得到参考紧凑特征;另外,对编码后紧凑特征信息进行解码,从而得到目标面部视频帧的目标紧凑特征;然后,基于得到的参考紧凑特征和目标紧凑特征进行稀疏运动估计,从而得到稀疏运动估计图;最后,即可根据稀疏运动估计图和参考面部视频帧,进行面部视频帧重建,从而得到与目标面部视频帧对应的重建面部视频帧。
由于稀疏运动估计图表征的是在稀疏特征域中,目标面部视频帧与参考面部视频帧之间的相对运动关系,也就是说,稀疏运动估计图表征的是一个较为粗略的相对运动关系,因此,直接根据稀疏运动估计图和参考面部视频帧,生成与目标面部视频帧对应的重建面部视频帧,得到的重建面部视频帧与目标面部视频帧之间的质量差异可能较大。
因此,为了进一步提高重建面部视频帧的质量,进一步地,在其中一些实施例中,根据稀疏运动估计图和参考面部视频帧,得到与目标面部视频帧对应的重建面部视频帧,可以包括:
基于稀疏运动估计图,对参考面部视频帧进行形变处理,得到目标面部视频帧对应的初始重建面部视频帧;
对参考紧凑特征和目标紧凑特征进行差分运算,得到紧凑特征差值;
根据紧凑特征差值和初始重建面部视频帧进行稠密运动估计,得到稠密运动估计图,稠密运动估计图表征在预设的稠密特征域中,目标面部视频帧与参考面部视频帧之间的相对运动关系;
根据稠密运动估计图和参考面部视频帧,得到与目标面部视频帧对应的重建面部视频帧。
上述方式中,再次基于紧凑特征之间的差值,以及,基于稀疏运动估计图和参考面部视频帧生成的初始建面部视频帧,得到了稠密运动估计图,也即:在更为稠密的特征域中,目标面部视频帧与参考面部视频帧之间的相对运动关系,该相对运动关系相较于稀疏运动估计图表征的相对运动关系,则更为精准,因此,基于稠密运动估计图和参考面部视频帧,生成重建面部视频帧,可以提高重建面部视频帧的质量。
进一步地,在其中一些实施例中,根据紧凑特征差值和初始重建面部视频帧进行稠密运动估计,得到稠密运动估计图,进一步地可以包括:
根据紧凑特征差值和初始重建面部视频帧进行稠密运动估计,得到稠密运动估计图和遮挡图,遮挡图表征目标面部视频帧中各像素点被遮挡的程度;
根据稠密运动估计图和参考面部视频帧,得到与目标面部视频帧对应的重建面部视频帧,包括:
根据稠密运动估计图、参考面部视频帧以及遮挡图,得到与目标面部视频帧对应的重建面部视频帧。
具体地,可以先根据稠密运动估计图,对参考面部视频帧进行形变处理,得到形变面部视频帧,再基于遮挡图,对形变面部视频帧进一步进行形变处理,得到最终的重建面部视频帧。
在参考面部视频帧和目标面部视频帧中,面部可能会发生一定角度的扭转,此时,可能会存在视频帧中的某些像素点被遮挡的情况。例如:参考面部视频帧中的面部为正面的面部,而目标面部视频帧的面部则稍微像左侧或者右侧转动了一定角度,此时,则存在被遮挡的像素点。
因此,为了进一步提升重建面部视频帧的质量,在生成重建面部视频帧的过程中,可以在考虑稠密运动估计图的基础上,同时考虑视频帧中各像素点被遮挡的概率,基于稠密运动估计图和遮挡图,对参考面部视频帧进行变形处理,从而得到更为精准的重建面部视频帧。
进一步的,为提高面部视频解码的整体效率,上述一些步骤,可以借助机器学习模型进行处理,具体地:
对解码得到的参考面部视频帧进行特征提取,得到参考紧凑特征,可以包括:
将解码得到的参考面部视频帧输入特征提取模型,以使特征提取模型输出参考紧凑特征。
基于稀疏运动估计图,对参考面部视频帧进行形变处理,得到目标面部视频帧对应的初始重建面部视频帧,可以包括:
将稀疏运动估计图和参考面部视频帧输入形变图像预估模型,以使形变图像预估模型输出目标面部视频帧对应的初始重建面部视频帧。
根据紧凑特征差值和初始重建面部视频帧进行稠密运动估计,得到稠密运动估计图,可以包括:
将紧凑特征差值和初始重建面部视频帧输入稠密运动估计模型,以使稠密运动估计模型输出稠密运动估计图。
根据稠密运动估计图、参考面部视频帧以及遮挡图,得到与目标面部视频帧对应的重建面部视频帧,可以包括:
将稠密运动估计图、参考面部视频帧以及遮挡图输入生成模型,以使生成模型输出与目标面部视频帧对应的重建面部视频帧。
参见图6,图6为本申请实施例二对应的另一场景示意图,该场景是在图5所示场景的基础上,同时基于稠密运动估计图、遮挡图,对参考面部视频帧进行形变处理,得到最终的重建面部视频帧,具体地:
在图5的基础上,根据稀疏运动估计图和参考面部视频帧,进行面部视频帧重建,从而得到与目标面部视频帧对应的初始重建面部视频帧;同时,对参考面部视频帧和目标面部视频帧进行差分运算,得到差分运算结果;再基于初始重建面部视频帧和上述差分运算结果,进行稠密运动估计,同时得到稠密运动估计图和遮挡图;最后,基于稠密运动估计图、遮挡图以及参考面部视频帧,得到最终的与目标面部视频帧对应的重建面部视频帧。
本申请实施例中,在解码阶段,对编码阶段得到的面部视频比特流进行解码,再基于解码得到的目标紧凑特征,进行面部视频帧重构,由于目标紧凑特征能够表征目标面部视频帧中关键特征信息,因此,基于目标紧凑特征得到的重构视频帧与目标面部视频帧间的质量差异也较小。本申请实施例,可以降低编码码率的同时,得到较高质量的重构面部视频帧。
本实施例的面部视频解码方法可以由任意适当的具有数据能力的电子设备执行,包括但不限于:服务器、PC机等。
实施例三
参照图7,图7为根据本申请实施例三的一种模型训练方法的步骤流程图。具体地,本实施例提供的模型训练方法包括以下步骤:
步骤702,将目标面部视频帧样本输入特征提取模型,得到目标紧凑特征样本;分别对目标紧凑特征样本和参考面部视频帧样本进行编码,得到面部视频比特流样本。
本申请中,对于特征提取模型的结构和参数不做限定,可以根据实际需要进行设定,例如:特征提取模型可以为基于卷积层和广义除法归一化层组合而成的U-Net网络,等等。
步骤704,解码面部视频比特流样本,得到参考面部视频帧样本和目标紧凑特征样本;将参考面部视频帧样本输入特征提取模型,得到参考紧凑特征样本。
本步骤中的特征提取模型,可以为与步骤602中的特征提取模型完全相同的模型,以便于得到与目标紧凑特征样本对应的参考紧凑特征样本。
步骤706,基于参考紧凑特征样本和目标紧凑特征样本进行稀疏运动估计,得到稀疏运动估计样本图;将稀疏运动估计样本图和参考面部视频帧样本输入形变图像预估模型,得到初始重建面部视频帧样本。
本申请中,对于形变图像预估模型的结构和参数也不做限定,可以根据实际需要进行设定,例如:也可以为基于卷积层和广义除法归一化层组合而成的U-Net网络,等等。
上述步骤702-步骤706中各步骤的具体执行过程,可以参考上述实施例一或者实施例二中的对应步骤,此处不再赘述。
步骤708,根据初始重建面部视频帧样本和目标面部视频帧样本,分别构建感知损失函数和对抗损失函数;基于初始重建面部视频帧、目标面部视频帧样本以及目标紧凑特征样本对应的传输码率,得到率失真损失函数。
具体地,可以通过如下方式构建感知损失函数:
分别将初始重建面部视频帧样本和目标面部视频帧样本输入预设的已训练完成的图像分类模型,例如VGG-19网络模型,从而分别得到初始重建面部视频帧样本对应的初始特征图和目标面部视频帧样本对应的目标特征图;然后再基于初始特征图和目标特征图,进行均方误差计算,从而得到感知损失函数。
对抗损失函数可以通过如下方式构建:将初始重建面部视频帧样本和目标面部视频帧样本同时输入预先已训练完成的分类器,基于分类结果(是否为同一类型的视频帧)构建对应的对抗损失函数。
率失真损失函数的构建过程可以包括:先获取目标紧凑特征样本对应的传输码率,然后,再基于初始重建面部视频帧样本和目标面部视频帧样本,构建失真函数(本申请实施例中,对于构建失真函数所采用的具体方式不做限定,例如:可以使用深度图像结构和纹理相似性算法构建失真函数,等等),再对上述传输码率和构建的失真函数进行融合(如相加等),得到率失真损失函数。
步骤710,对感知损失函数、对抗损失函数以及率失真损失函数进行融合,得到训练损失函数;根据训练损失函数,对特征提取模型和形变图像预估模型进行训练。
具体地,可以分别为感知损失函数、对抗损失函数以及率失真损失函数设定对应的权重值,然后基于设定的各权重值,对感知损失函数、对抗损失函数以及率失真损失函数进行加和处理,从而得到最终的训练损失函数。具体地,可参见如下公式:
L=λ1Lper2LGD3LRD
其中,L为最终的训练损失函数;Lper为感知损失函数;LGD为对抗损失函数;LRD为率失真损失函数;λ1、λ2、λ3分别为感知损失函数的权重值、对抗损失函数的权重值,以及率失真损失函数的权重值。
参见图8,图8为本申请实施例三对应的场景示意图,以下,将参考图8所示的示意图,以一个具体场景示例,对本申请实施例进行说明:
将目标面部视频帧样本输入待训练的特征提取模型,得到目标紧凑特征样本;分别对目标紧凑特征样本和参考面部视频帧样本进行编码,得到面部视频比特流样本;解码面部视频比特流样本中的编码后参考面部视频帧样本,得到参考面部视频帧样本;解码面部视频比特流样本中的编码后紧凑特征样本,得到目标紧凑特征样本;再将参考面部视频帧样本输入上述特征提取模型,得到参考紧凑特征样本;基于参考紧凑特征样本和目标紧凑特征样本进行稀疏运动估计,得到稀疏运动估计样本图;将稀疏运动估计样本图和参考面部视频帧样本输入待训练的形变图像预估模型,得到初始重建面部视频帧样本;根据初始重建面部视频帧样本和目标面部视频帧样本,分别构建感知损失函数Lper和对抗损失函数LGD;基于初始重建面部视频帧、目标面部视频帧样本以及目标紧凑特征样本对应的传输码率,得到率失真损失函数LRD;对Lper、LGD以及LRD进行融合,得到训练损失函数L;根据L,对上述待训练的特征提取模型和形变图像预估模型进行训练,从而得到训练完成的特征提取模型和形变图像预估模型。
进一步地,与图6对应地,在其中一些实施例中,还可以进一步地引入稠密运动估计模型以及生成模型,在图8所示训练程序的基础上,进行如下改进:将稀疏运动估计样本图和参考面部视频帧样本输入形变图像预估模型,得到初始重建面部视频帧样本;对参考紧凑特征样本和目标紧凑特征样本进行差分运算,得到紧凑特征样本差值;将紧凑特征样本差值和初始重建面部视频帧样本输入待训练的稠密运动估计模型,得到稠密运动估计样本图和遮挡样本图;将稠密运动估计样本图、参考面部视频帧样本以及遮挡样本图输入待训练的生成模型,得到重建面部视频帧。
之后,根据初始重建面部视频帧样本和目标面部视频帧样本,构建第一感知损失函数;根据重建面部视频帧和目标面部视频帧样本,构建第二感知损失函数;基于述重建面部视频帧和目标面部视频帧样本,构建对抗损失函数;基于述重建面部视频帧和目标面部视频帧样本构建的失真损失函数,以及,目标紧凑特征样本对应的传输码率,得到率失真损失函数;对第一感知损失函数、第二感知损失函数、对抗损失函数以及率失真损失函数进行融合,得到训练损失函数;根据训练损失函数,对上述待训练的特征提取模型、形变图像预估模型、稠密运动估计模型以及生成模型进行训练,得到训练完成的特征提取模型、形变图像预估模型、稠密运动估计模型以及生成模型进行训练。
本实施例的模型训练方法可以由任意适当的具有数据能力的电子设备执行,包括但不限于:服务器、PC机等。
实施例四
参见图9,图9为根据本申请实施例四的一种面部视频编码装置的结构框图。本申请实施例提供的面部视频编码装置包括:
面部视频帧获取模块902,用于获取待编码的目标面部视频帧和参考面部视频帧。
特征提取模块904,用于对目标面部视频帧进行特征提取,得到目标紧凑特征,目标紧凑特征表征目标面部视频帧中的关键特征信息。
编码模块906,用于分别对目标紧凑特征和参考面部视频帧进行编码,得到面部视频比特流。
可选地,在其中一些实施例中,目标面部视频帧为多个连续面部视频帧;特征提取模块904,具体用于:分别对各目标面部视频帧进行特征提取,得到各目标面部视频帧的目标紧凑特征;
编码模块906,具体用于:对相邻两个目标面部视频帧的目标紧凑特征进行差分运算,得到目标紧凑特征残差;
分别对目标紧凑特征残差和参考面部视频帧进行编码,得到面部视频比特流。
可选地,在其中一些实施例中,特征提取模块904,具体用于:分别将各目标面部视频帧输入特征提取模型,以使特征提取模型输出各目标面部视频帧的目标紧凑特征。
本实施例的面部视频编码装置用于实现前述多个方法实施例中相应的面部视频编码方法,并具有相应的方法实施例的有益效果,在此不再赘述。此外,本实施例的面部视频编码装置中的各个模块的功能实现均可参照前述方法实施例中的相应部分的描述,在此亦不再赘述。
实施例五
参见图10,图10为根据本申请实施例五的一种面部视频解码装置的结构框图。本申请实施例提供的面部视频解码装置包括:
视频比特流获取模块1002,用于获取面部视频比特流,面部视频比特流包括:编码后参考面部视频帧和编码后紧凑特征信息;编码后紧凑特征信息表征待重建的目标面部视频帧的关键特征信息;
第一解码模块1004,用于解码编码后参考面部视频帧,并对解码得到的参考面部视频帧进行特征提取,得到参考紧凑特征;
第二解码模块1006,用于解码编码后紧凑特征信息,得到目标面部视频帧的目标紧凑特征;
稀疏运动估计模块1008,用于基于参考紧凑特征和目标紧凑特征进行稀疏运动估计,得到稀疏运动估计图,稀疏运动估计图表征在预设的稀疏特征域中,目标面部视频帧与参考面部视频帧之间的相对运动关系;
重建面部视频帧得到模块1010,用于根据稀疏运动估计图和参考面部视频帧,得到与目标面部视频帧对应的重建面部视频帧。
可选地,在其中一些实施例中,重建面部视频帧得到模块1010,具体用于:
基于稀疏运动估计图,对参考面部视频帧进行形变处理,得到目标面部视频帧对 应的初始重建面部视频帧;
对参考紧凑特征和目标紧凑特征进行差分运算,得到紧凑特征差值;
根据紧凑特征差值和初始重建面部视频帧进行稠密运动估计,得到稠密运动估计图,稠密运动估计图表征在预设的稠密特征域中,目标面部视频帧与参考面部视频帧之间的相对运动关系;
根据稠密运动估计图和参考面部视频帧,得到与目标面部视频帧对应的重建面部视频帧。
可选地,在其中一些实施例中,重建面部视频帧得到模块1010在执行根据紧凑特征差值和初始重建面部视频帧进行稠密运动估计,得到稠密运动估计图的步骤时,具体用于:根据紧凑特征差值和初始重建面部视频帧进行稠密运动估计,得到稠密运动估计图和遮挡图,遮挡图表征目标面部视频帧中各像素点被遮挡的程度;在执行根据稠密运动估计图和参考面部视频帧,得到与目标面部视频帧对应的重建面部视频帧的步骤时,具体用于:根据稠密运动估计图、参考面部视频帧以及遮挡图,得到与目标面部视频帧对应的重建面部视频帧。
可选地,在其中一些实施例中,第一解压模块1004在执行对解码得到的参考面部视频帧进行特征提取,得到参考紧凑特征步骤时,具体用于:
将解码得到的参考面部视频帧输入特征提取模型,以使特征提取模型输出参考紧凑特征。
可选地,在其中一些实施例中,重建面部视频帧得到模块1010,在执行基于稀疏运动估计图,对参考面部视频帧进行形变处理,得到目标面部视频帧对应的初始重建面部视频帧步骤时,具体用于:将稀疏运动估计图和参考面部视频帧输入形变图像预估模型,以使形变图像预估模型输出目标面部视频帧对应的初始重建面部视频帧。
可选地,在其中一些实施例中,重建面部视频帧得到模块1010,在执行根据紧凑特征差值和初始重建面部视频帧进行稠密运动估计,得到稠密运动估计图步骤时,具体用于:将紧凑特征差值和初始重建面部视频帧输入稠密运动估计模型,以使稠密运动估计模型输出稠密运动估计图。
可选地,在其中一些实施例中,重建面部视频帧得到模块1010,在执行根据稠密运动估计图、参考面部视频帧以及遮挡图,得到与目标面部视频帧对应的重建面部视频帧的步骤时,具体用于:
将稠密运动估计图、参考面部视频帧以及遮挡图输入生成模型,以使生成模型输出与目标面部视频帧对应的重建面部视频帧。
本实施例的面部视频解码装置用于实现前述多个方法实施例中相应的面部视频解码方法,并具有相应的方法实施例的有益效果,在此不再赘述。此外,本实施例的面部视频解码装置中的各个模块的功能实现均可参照前述方法实施例中的相应部分的描述,在此亦不再赘述。
实施例六
参见图11,图11为根据本申请实施例六的一种模型训练装置的结构框图。本申请实施例提供的模型训练装置包括:
面部视频比特流样本得到模块1102,用于将目标面部视频帧样本输入特征提取模型,得到目标紧凑特征样本;分别对目标紧凑特征样本和参考面部视频帧样本进行编码,得到面部视频比特流样本;
紧凑特征样本得到模块1104,用于解码编码后面部视频流样本,得到参考面部视频帧样本和目标紧凑特征样本;将参考面部视频帧样本输入特征提取模型,得到参考紧凑特征样本;
初始重建面部视频帧样本得到模块1106,用于基于参考紧凑特征样本和目标紧凑特征样本进行稀疏运动估计,得到稀疏运动估计样本图;将稀疏运动估计样本图和参考面部视频帧样本输入形变图像预估模型,得到初始重建面部视频帧样本;
率失真损失函数得到模块1108,用于根据初始重建面部视频帧样本和目标面部视频帧样本,分别构建感知损失函数和对抗损失函数;基于初始重建面部视频帧、目标面部视频帧样本以及目标紧凑特征样本对应的传输码率,得到率失真损失函数;
模型训练模块1110,用于对感知损失函数、对抗损失函数以及率失真损失函数进行融合,得到训练损失函数;根据训练损失函数,对特征提取模型和形变图像预估模型进行训练。
本实施例的模型训练装置用于实现前述多个方法实施例中相应的模型训练方法,并具有相应的方法实施例的有益效果,在此不再赘述。此外,本实施例的模型训练装置中的各个模块的功能实现均可参照前述方法实施例中的相应部分的描述,在此亦不再赘述。
实施例七
参照图12,示出了根据本申请实施例七的一种电子设备的结构示意图,本申请具体实施例并不对电子设备的具体实现做限定。
如图12所示,该会议终端可以包括:处理器(processor)1202、通信接口(Communications Interface)1204、存储器(memory)1206、以及通信总线1208。
其中:
处理器1202、通信接口1204、以及存储器1206通过通信总线1208完成相互间的通信。
通信接口1204,用于与其它电子设备或服务器进行通信。
处理器1202,用于执行程序1210,具体可以执行上述面部视频编码方法,或者,面部视频解码方法,或者,模型训练方法实施例中的相关步骤。
具体地,程序1210可以包括程序代码,该程序代码包括计算机操作指令。
处理器1202可能是CPU,或者是特定集成电路ASIC(Application Specific Integrated Circuit),或者是被配置成实施本申请实施例的一个或多个集成电路。智能设备包括的一个或多个处理器,可以是同一类型的处理器,如一个或多个CPU;也可以是不同类型的处理器,如一个或多个CPU以及一个或多个ASIC。
存储器1206,用于存放程序1210。存储器1206可能包含高速RAM存储器,也可能还包括非易失性存储器(non-volatile memory),例如至少一个磁盘存储器。
程序1210具体可以用于使得处理器1202执行以下操作:获取待编码的目标面部视频帧和参考面部视频帧;对面部目标面部视频帧进行特征提取,得到目标紧凑特征,面部目标紧凑特征表征面部目标面部视频帧中的关键特征信息;分别对面部目标紧凑特征和面部参考面部视频帧进行编码,得到面部视频比特流。
或者,
程序1210具体可以用于使得处理器1202执行以下操作:获取面部视频比特流,面部视频比特流包括:编码后参考面部视频帧和编码后紧凑特征信息;面部编码后紧凑特征信息表征待重建的目标面部视频帧的关键特征信息;解码面部编码后参考面部视频帧,并对解码得到的参考面部视频帧进行特征提取,得到参考紧凑特征;解码面部编码后紧凑特征信息,得到面部目标面部视频帧的目标紧凑特征;基于面部参考紧凑特征和面部目标紧凑特征进行稀疏运动估计,得到稀疏运动估计图,面部稀疏运动估计图表征在预设的稀疏特征域中,面部目标面部视频帧与面部参考面部视频帧之间的相对运动关系;根据面部稀疏运动估计图和面部参考面部视频帧,得到与面部目标面部视频帧对应的重建面部视频帧。
或者,
程序1210具体可以用于使得处理器1202执行以下操作:将目标面部视频帧样本输入特征提取模型,得到目标紧凑特征样本;分别对面部目标紧凑特征样本和参考面部视频帧样本进行编码,得到面部视频比特流样本;解码面部视频比特流样本,得到面部参考面部视频帧样本和面部目标紧凑特征样本;将面部参考面部视频帧样本输入面部特征提取模型,得到参考紧凑特征样本;基于面部参考紧凑特征样本和面部目标紧凑特征样本进行稀疏运动估计,得到稀疏运动估计样本图;将面部稀疏运动估计样本图和面部参考面部视频帧样本输入形变图像预估模型,得到初始重建面部视频帧样本;根据面部初始重建面部视频帧样本和面部目标面部视频帧样本,分别构建感知损失函数和对抗损失函数;基于面部初始重建面部视频帧、面部目标面部视频帧样本以及面部目标紧凑特征样本对应的传输码率,得到率失真损失函数;对面部感知损失函数、对抗损失函数以及率失真损失函数进行融合,得到训练损失函数;根据面部训练损失函数,对面部特征提取模型和面部形变图像预估模型进行训练。
程序1210中各步骤的具体实现可以参见上述面部视频编码方法,或者,面部视频 解码方法,或者,模型训练方法实施例中的相应步骤和单元中对应的描述,在此不赘述。所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的设备和模块的具体工作过程,可以参考前述方法实施例中的对应过程描述,在此不再赘述。
通过本实施例的电子设备,在编码阶段,是对目标面部视频帧进行了目标紧凑特征提取,并通过对上述目标紧凑特征的编码得到的面部视频比特流,由于目标紧凑特征是表征目标面部视频帧中的关键特征信息的特征,其通过较小的数据量表征了整个面部视频帧中的关键信息,因此,通过对目标紧凑特征的编码得到的面部视频比特流,其数据量也较小,在进行视频流传输时对应的比特流也较小(码率较低),另外,在解码阶段,对上述得到的面部视频比特流进行解码,再基于解码得到的表征目标面部视频帧中关键特征信息的目标紧凑特征,进行面部视频帧重构,得到的重构视频帧与目标面部视频帧间的质量差异也较小。综上,本申请实施例,可以在保证面部视频重建质量的前提下,降低编码码率,更好地满足了低码率面部视频编码的需求。
本申请实施例还提供了一种计算机程序产品,包括计算机指令,该计算机指令指示计算设备执行上述多个方法实施例中的任一方法对应的操作。
需要指出,根据实施的需要,可将本申请实施例中描述的各个部件/步骤拆分为更多部件/步骤,也可将两个或多个部件/步骤或者部件/步骤的部分操作组合成新的部件/步骤,以实现本申请实施例的目的。
上述根据本申请实施例的方法可在硬件、固件中实现,或者被实现为可存储在记录介质(诸如CD ROM、RAM、软盘、硬盘或磁光盘)中的软件或计算机代码,或者被实现通过网络下载的原始存储在远程记录介质或非暂时机器可读介质中并将被存储在本地记录介质中的计算机代码,从而在此描述的方法可被存储在使用通用计算机、专用处理器或者可编程或专用硬件(诸如ASIC或FPGA)的记录介质上的这样的软件处理。可以理解,计算机、处理器、微处理器控制器或可编程硬件包括可存储或接收软件或计算机代码的存储组件(例如,RAM、ROM、闪存等),当软件或计算机代码被计算机、处理器或硬件访问且执行时,实现在此描述的面部视频编码方法,或者,面部视频解码方法,或者,模型训练方法。此外,当通用计算机访问用于实现在此示出的面部视频编码方法,或者,面部视频解码方法,或者,模型训练方法的代码时,代码的执行将通用计算机转换为用于执行在此示出的面部视频编码方法,或者,面部视频解码方法,或者,模型训练方法的专用计算机。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及方法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请实施例的范围。
以上实施方式仅用于说明本申请实施例,而并非对本申请实施例的限制,有关技术领域的普通技术人员,在不脱离本申请实施例的精神和范围的情况下,还可以做出各种变化和变型,因此所有等同的技术方案也属于本申请实施例的范畴,本申请实施例的专利保护范围应由权利要求限定。

Claims (12)

  1. 一种面部视频编码方法,包括:
    获取待编码的目标面部视频帧和参考面部视频帧;
    对所述目标面部视频帧进行特征提取,得到目标紧凑特征,所述目标紧凑特征表征所述目标面部视频帧中的关键特征信息;
    分别对所述目标紧凑特征和所述参考面部视频帧进行编码,得到面部视频比特流。
  2. 根据权利要求1所述的方法,其中,所述目标面部视频帧为多个连续面部视频帧;所述对所述目标面部视频帧进行特征提取,得到所述目标面部视频帧的紧凑特征,包括:
    分别对各目标面部视频帧进行特征提取,得到各目标面部视频帧的目标紧凑特征;
    所述分别对所述目标紧凑特征和所述参考面部视频帧进行编码,得到面部视频比特流,包括:
    对相邻两个目标面部视频帧的目标紧凑特征进行差分运算,得到目标紧凑特征残差;
    分别对所述目标紧凑特征残差和所述参考面部视频帧进行编码,得到面部视频比特流。
  3. 根据权利要求2所述的方法,其中,所述分别对各目标面部视频帧进行特征提取,得到各目标面部视频帧的目标紧凑特征,包括:
    分别将各目标面部视频帧输入特征提取模型,以使所述特征提取模型输出各目标面部视频帧的目标紧凑特征。
  4. 一种面部视频解码方法,包括:
    获取面部视频比特流,所述面部视频比特流包括:编码后参考面部视频帧和编码后紧凑特征信息;所述编码后紧凑特征信息表征待重建的目标面部视频帧的关键特征信息;
    解码所述编码后参考面部视频帧,并对解码得到的参考面部视频帧进行特征提取,得到参考紧凑特征;
    解码所述编码后紧凑特征信息,得到所述目标面部视频帧的目标紧凑特征;
    基于所述参考紧凑特征和所述目标紧凑特征进行稀疏运动估计,得到稀疏运动估计图,所述稀疏运动估计图表征在预设的稀疏特征域中,所述目标面部视频帧与所述参考面部视频帧之间的相对运动关系;
    根据所述稀疏运动估计图和所述参考面部视频帧,得到与所述目标面部视频帧对应的重建面部视频帧。
  5. 根据权利要求4所述的方法,其中,所述根据所述稀疏运动估计图和所述参考面部视频帧,得到与所述目标面部视频帧对应的重建面部视频帧,包括:
    基于所述稀疏运动估计图,对所述参考面部视频帧进行形变处理,得到所述目标 面部视频帧对应的初始重建面部视频帧;
    对所述参考紧凑特征和所述目标紧凑特征进行差分运算,得到紧凑特征差值;
    根据所述紧凑特征差值和所述初始重建面部视频帧进行稠密运动估计,得到稠密运动估计图,所述稠密运动估计图表征在预设的稠密特征域中,所述目标面部视频帧与所述参考面部视频帧之间的相对运动关系;
    根据所述稠密运动估计图和所述参考面部视频帧,得到与所述目标面部视频帧对应的重建面部视频帧。
  6. 根据权利要求5所述的方法,其中,所述根据所述紧凑特征差值和所述初始重建面部视频帧进行稠密运动估计,得到稠密运动估计图,包括:
    根据所述紧凑特征差值和所述初始重建面部视频帧进行稠密运动估计,得到稠密运动估计图和遮挡图,所述遮挡图表征所述目标面部视频帧中各像素点被遮挡的程度;
    所述根据所述稠密运动估计图和所述参考面部视频帧,得到与所述目标面部视频帧对应的重建面部视频帧,包括:
    根据所述稠密运动估计图、所述参考面部视频帧以及所述遮挡图,得到与所述目标面部视频帧对应的重建面部视频帧。
  7. 根据权利要求6所述的方法,其中,所述对解码得到的参考面部视频帧进行特征提取,得到参考紧凑特征,包括:
    将解码得到的参考面部视频帧输入特征提取模型,以使所述特征提取模型输出参考紧凑特征。
  8. 根据权利要求6所述的方法,其中,所述基于所述稀疏运动估计图,对所述参考面部视频帧进行形变处理,得到所述目标面部视频帧对应的初始重建面部视频帧,包括:
    将所述稀疏运动估计图和所述参考面部视频帧输入形变图像预估模型,以使所述形变图像预估模型输出目标面部视频帧对应的初始重建面部视频帧。
  9. 根据权利要求6所述的方法,其中,所述根据所述紧凑特征差值和所述初始重建面部视频帧进行稠密运动估计,得到稠密运动估计图,包括:
    将所述紧凑特征差值和所述初始重建面部视频帧输入稠密运动估计模型,以使所述稠密运动估计模型输出稠密运动估计图。
  10. 根据权利要求6所述的方法,其中,所述根据所述稠密运动估计图、所述参考面部视频帧以及所述遮挡图,得到与所述目标面部视频帧对应的重建面部视频帧,包括:
    将所述稠密运动估计图、所述参考面部视频帧以及所述遮挡图输入生成模型,以使所述生成模型输出与所述目标面部视频帧对应的重建面部视频帧。
  11. 一种模型训练方法,包括:
    将目标面部视频帧样本输入特征提取模型,得到目标紧凑特征样本;分别对所述 目标紧凑特征样本和参考面部视频帧样本进行编码,得到面部视频比特流样本;
    解码所述面部视频比特流样本,得到所述参考面部视频帧样本和所述目标紧凑特征样本;将所述参考面部视频帧样本输入所述特征提取模型,得到参考紧凑特征样本;
    基于所述参考紧凑特征样本和所述目标紧凑特征样本进行稀疏运动估计,得到稀疏运动估计样本图;将所述稀疏运动估计样本图和所述参考面部视频帧样本输入形变图像预估模型,得到初始重建面部视频帧样本;
    根据所述初始重建面部视频帧样本和所述目标面部视频帧样本,分别构建感知损失函数和对抗损失函数;基于所述初始重建面部视频帧、所述目标面部视频帧样本以及所述目标紧凑特征样本对应的传输码率,得到率失真损失函数;
    对所述感知损失函数、对抗损失函数以及率失真损失函数进行融合,得到训练损失函数;根据所述训练损失函数,对所述特征提取模型和所述形变图像预估模型进行训练。
  12. 一种电子设备,包括:处理器、存储器、通信接口和通信总线,所述处理器、所述存储器和所述通信接口通过所述通信总线完成相互间的通信;
    所述存储器用于存放至少一可执行指令,所述可执行指令使所述处理器执行如权利要求1-3中任一项所述的面部视频编码方法对应的操作,或者,如权利要求4-9中任一项所述的面部视频解码方法对应的操作,或者,如权利要求11中所述的模型训练方法对应的操作。
PCT/CN2023/073054 2022-01-25 2023-01-19 一种面部视频编码方法、解码方法及装置 WO2023143349A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210085278.8A CN114422795A (zh) 2022-01-25 2022-01-25 一种面部视频编码方法、解码方法及装置
CN202210085278.8 2022-01-25

Publications (1)

Publication Number Publication Date
WO2023143349A1 true WO2023143349A1 (zh) 2023-08-03

Family

ID=81276556

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/073054 WO2023143349A1 (zh) 2022-01-25 2023-01-19 一种面部视频编码方法、解码方法及装置

Country Status (2)

Country Link
CN (1) CN114422795A (zh)
WO (1) WO2023143349A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114422795A (zh) * 2022-01-25 2022-04-29 阿里巴巴(中国)有限公司 一种面部视频编码方法、解码方法及装置

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109831638A (zh) * 2019-01-23 2019-05-31 广州视源电子科技股份有限公司 视频图像传输方法、装置、交互智能平板和存储介质
US20190215482A1 (en) * 2018-01-05 2019-07-11 Facebook, Inc. Video Communication Using Subtractive Filtering
CN113132735A (zh) * 2019-12-30 2021-07-16 北京大学 一种基于视频帧生成的视频编码方法
CN114422795A (zh) * 2022-01-25 2022-04-29 阿里巴巴(中国)有限公司 一种面部视频编码方法、解码方法及装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190215482A1 (en) * 2018-01-05 2019-07-11 Facebook, Inc. Video Communication Using Subtractive Filtering
CN109831638A (zh) * 2019-01-23 2019-05-31 广州视源电子科技股份有限公司 视频图像传输方法、装置、交互智能平板和存储介质
CN113132735A (zh) * 2019-12-30 2021-07-16 北京大学 一种基于视频帧生成的视频编码方法
CN114422795A (zh) * 2022-01-25 2022-04-29 阿里巴巴(中国)有限公司 一种面部视频编码方法、解码方法及装置

Also Published As

Publication number Publication date
CN114422795A (zh) 2022-04-29

Similar Documents

Publication Publication Date Title
Cai et al. End-to-end optimized roi image compression
Liu et al. Neural video coding using multiscale motion compensation and spatiotemporal context model
WO2021208247A1 (zh) 一种视频图像的拟态压缩方法、装置、存储介质及终端
CN104096362B (zh) 基于游戏者关注区域改进视频流的码率控制比特分配
Wu et al. Learned block-based hybrid image compression
WO2023143101A1 (zh) 一种面部视频编码方法、解码方法及装置
CN111630570A (zh) 图像处理方法、设备及计算机可读存储介质
WO2023143349A1 (zh) 一种面部视频编码方法、解码方法及装置
WO2023246926A1 (zh) 模型训练方法、视频编码方法及解码方法
CN116233445B (zh) 视频的编解码处理方法、装置、计算机设备和存储介质
WO2023246923A1 (zh) 视频编码方法、解码方法、电子设备及存储介质
Fang et al. 3dac: Learning attribute compression for point clouds
Akbari et al. Learned multi-resolution variable-rate image compression with octave-based residual blocks
Zheng et al. Context tree-based image contour coding using a geometric prior
Jiang et al. Multi-modality deep network for extreme learned image compression
CN111885384B (zh) 带宽受限下基于生成对抗网络的图片处理和传输方法
Tan et al. Image compression algorithms based on super-resolution reconstruction technology
Pinheiro et al. NF-PCAC: Normalizing Flow based Point Cloud Attribute Compression
WO2023225808A1 (en) Learned image compress ion and decompression using long and short attention module
WO2020053688A1 (en) Rate distortion optimization for adaptive subband coding of regional adaptive haar transform (raht)
WO2023143331A1 (zh) 一种面部视频编码方法、解码方法及装置
Yang et al. Graph-convolution network for image compression
CN114449286A (zh) 一种视频编码方法、解码方法及装置
CN107770537B (zh) 基于线性重建的光场图像压缩方法
CN114205585A (zh) 面部视频编码方法、解码方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23746230

Country of ref document: EP

Kind code of ref document: A1