WO2023143101A1 - 一种面部视频编码方法、解码方法及装置 - Google Patents

一种面部视频编码方法、解码方法及装置 Download PDF

Info

Publication number
WO2023143101A1
WO2023143101A1 PCT/CN2023/071943 CN2023071943W WO2023143101A1 WO 2023143101 A1 WO2023143101 A1 WO 2023143101A1 CN 2023071943 W CN2023071943 W CN 2023071943W WO 2023143101 A1 WO2023143101 A1 WO 2023143101A1
Authority
WO
WIPO (PCT)
Prior art keywords
facial
information
dimensional
target
latent
Prior art date
Application number
PCT/CN2023/071943
Other languages
English (en)
French (fr)
Inventor
王钊
李彬哲
叶琰
王诗淇
Original Assignee
阿里巴巴(中国)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴(中国)有限公司 filed Critical 阿里巴巴(中国)有限公司
Publication of WO2023143101A1 publication Critical patent/WO2023143101A1/zh

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/15Conference systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/20Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using video object coding
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/44Decoders specially adapted therefor, e.g. video decoders which are asymmetric with respect to the encoder
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/21Server components or server architectures
    • H04N21/218Source of audio or video content, e.g. local disk arrays
    • H04N21/2187Live feed
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/478Supplemental services, e.g. displaying phone caller identification, shopping application
    • H04N21/4788Supplemental services, e.g. displaying phone caller identification, shopping application communicating with other users, e.g. chatting

Definitions

  • the embodiments of the present application relate to the field of computer technology, and in particular to a facial video encoding method, decoding method and device.
  • the more traditional video encoding and decoding methods usually extract and describe facial information from facial video frames based on two-dimensional features, and the two-dimensional features themselves are obtained by mapping the original three-dimensional faces, and there are certain distortions in the obtaining process and distortion, therefore, the facial video encoding and decoding operation based on the above-mentioned two-dimensional features will eventually result in poor quality of the reconstructed facial video frame.
  • embodiments of the present application provide a facial video encoding method, decoding method, and device to at least partially solve the above-mentioned problems.
  • a facial video coding method including:
  • a model training method including:
  • the code rate loss function is constructed; according to the reconstructed facial video frame sample and the target facial video frame sample, a distortion loss function is constructed;
  • a training loss function is obtained based on the code rate loss function and the distortion loss function, so as to train the fully connected encoding model and the fully connected decoding model.
  • an electronic device including: a processor, a memory, a communication interface, and a communication bus, and the processor, the memory, and the communication interface complete the mutual communication via the communication bus. communication among them; the memory is used to store at least one executable instruction, and the executable instruction causes the processor to execute the operation corresponding to the facial video coding method as described in the first aspect, or, as described in the second aspect The operations corresponding to the facial video decoding method, or the operations corresponding to the model training method described in the third aspect.
  • a computer storage medium on which a computer program is stored, and when the program is executed by a processor, the facial video coding method as described in the first aspect is implemented, or, as described in the second The face video decoding method described in the aspect, or, the model training method as described in the third aspect.
  • a computer program product including computer instructions, the computer instructions instruct the computing device to perform the operations corresponding to the facial video encoding method described in the first aspect, or, as described in the second aspect Operations corresponding to the facial video decoding method described in the aspect, or operations corresponding to the model training method described in the third aspect.
  • Fig. 1 is a schematic framework diagram of a codec method based on depth video generation
  • FIG. 2 is a schematic diagram of a scene of facial video communication provided according to an embodiment of the present application.
  • Fig. 3 is a flow chart of the steps of a facial video coding method according to Embodiment 1 of the present application;
  • Fig. 4 is a schematic diagram of a specific scenario example in the embodiment shown in Fig. 3;
  • Fig. 6 is a schematic diagram of a specific scenario example in the embodiment shown in Fig. 5;
  • FIG. 7 is a flow chart of steps of a model training method according to Embodiment 3 of the present application.
  • Fig. 11 is a structural block diagram of a model training device according to Embodiment 6 of the present application.
  • FIG. 12 is a schematic structural diagram of an electronic device according to Embodiment 7 of the present application.
  • FIG. 1 is a schematic framework diagram of a codec method based on depth video generation.
  • the main principle of this method is to deform the reference frame based on the motion of the frame to be encoded to obtain a reconstructed frame corresponding to the frame to be encoded.
  • the basic framework of the codec method based on depth video generation is described below in conjunction with Figure 1:
  • FIG. 2 is a schematic diagram of a scene of facial video communication provided according to an embodiment of the present application.
  • the whole communication process includes: the facial video encoding process performed by the sending end, and the facial video decoding process performed by the receiving end.
  • FIG. 3 is a flow chart of steps of a face video encoding method according to Embodiment 1 of the present application.
  • the facial video encoding method provided in the embodiment of the present application may correspond to the facial video encoding process performed by the sending end in FIG. 1 above.
  • the facial video encoding method provided in this embodiment includes the following steps:
  • Step 302 acquiring the target facial video frame to be encoded and the 3D facial template corresponding to the reference facial video frame.
  • a 3D facial template is a digital representation of the face in space, including a variety of different representation methods, such as: Point cloud, polygon mesh, voxel, etc.
  • representation methods such as: Point cloud, polygon mesh, voxel, etc.
  • the specific description form can be: the position information of each voxel, and the pixel value of each voxel, etc.
  • a three-dimensional facial reconstruction algorithm may be used, or obtained in combination with manual interactive operations.
  • the process of obtaining a three-dimensional face template may include:
  • the three-dimensional facial reconstruction is performed based on the reference facial video frame to obtain an initial three-dimensional facial template; in response to an editing operation on the initial three-dimensional facial template, the three-dimensional facial template is obtained.
  • the initial 3D facial template can be obtained according to the existing 3D facial reconstruction algorithm, but the accuracy of the initial 3D facial template may be low.
  • the position information and pixel value of each voxel can be manually calculated etc., adjusted to more accurately represent the face in the reference face video frame.
  • Step 304 performing feature extraction on the target facial video frame and the three-dimensional facial template to obtain target three-dimensional facial description information.
  • the 3D face description information is the information used to drive the 3D face template, and correspondingly, the target 3D face description information is the information used to drive the 3D face template to obtain the 3D face corresponding to the target face video frame.
  • the 3D facial description information may include: 3D expression information, 3D translation information, 3D angle information, 3D texture information, 3D shape information, and so on.
  • the specific content of the three-dimensional facial description information is not limited, and one or more of the above information may be selected as the facial description information according to actual needs.
  • Step 306 encoding target 3D facial description information to obtain facial video bit stream.
  • the facial description information is differentially calculated to obtain differential three-dimensional facial description information; the differential three-dimensional facial description information is encoded to obtain potential coding information, and the dimension value of the latent coding information is smaller than the differential three-dimensional facial description information; respectively, the latent coding information and the reference three-dimensional facial description
  • the description information is entropy coded to obtain the facial video bit stream.
  • the facial video bit stream is generated based on the difference result of the reference 3D facial description information and the target 3D facial description information. Since the data volume of the differential result is lower than the data volume of the target 3D facial description information, the above-mentioned Compared with the method of directly generating facial video bitstream based on the target 3D facial description information, this method can effectively reduce the coding rate of facial video stream; secondly, after obtaining the above difference result, the difference result is also encoded (dimension reduction processing ), so that the dimension of the obtained latent coding information is smaller than the dimension of the above-mentioned difference result, which can further reduce the amount of data to be coded. code rate.
  • the machine learning model can be used to encode the differential 3D facial description information, so as to obtain latent encoding information.
  • the specific method can be as follows:
  • the following operations may be performed after obtaining the latent coding information:
  • the facial video bitstream can be obtained in the following ways, including:
  • the differential latent coding information and the reference 3D facial description information are respectively entropy coded to obtain the facial video bit stream.
  • the latent coding information and the previous latent coding information corresponding to the previous facial video frame are differentially calculated to obtain the differential latent coding information, and then the differential latent coding information is generated based on the differential latent coding information
  • the data volume of the differential latent coding information is smaller than that of the latent coding information, generating the facial video bitstream based on the differential latent coding information can further reduce the coding rate.
  • FIG. 4 is a schematic diagram of a specific example of a specific scenario in the embodiment shown in FIG. 3. In the following, referring to the schematic diagram shown in FIG. 3,
  • Perform feature extraction on the target face video frame and 3D face template to obtain the target 3D face description information ⁇ t ⁇ t , ⁇ t , l t ⁇ ;
  • the result is input into the fully connected coding model to output the potential coding information ⁇ t , and ⁇ t is quantized to obtain ⁇ t ' , and the preorder latent coding information ⁇ ( t-1) 'Perform difference operation, and perform entropy coding on the operation result, at the same time, the reference three-dimensional facial description information ⁇ r Quantize to get ⁇ r ', and perform entropy coding on ⁇ r ', so as to get the facial video bit stream.
  • the three-dimensional facial description information is extracted from the target facial video frame, and the facial video bit stream obtained by encoding the above-mentioned three-dimensional facial description information, due to the face itself That is, it is three-dimensional. Therefore, if the face is described directly using the three-dimensional facial description information, the accuracy of the description information is higher, and then the facial video frame is reconstructed based on the above-mentioned three-dimensional facial description information with higher description accuracy, and the obtained reconstruction The quality difference between the facial video frame and the target facial video frame is small, which can improve the quality of facial video frame reconstruction.
  • the facial video encoding method in this embodiment can be executed by any suitable electronic device with data capability, including but not limited to: server, PC, etc.
  • FIG. 5 is a flowchart of steps of a facial video decoding method according to Embodiment 2 of the present application.
  • the facial video decoding method provided in the embodiment of this application may correspond to the facial video decoding method executed by the receiving end in Figure 2 above.
  • the facial video decoding method provided by this embodiment includes the following steps:
  • Step 502 acquire facial video bit stream and three-dimensional facial template.
  • the facial video bit stream is obtained based on the target three-dimensional facial description information corresponding to the target facial video frame.
  • the 3D facial template can be directly transmitted from the encoding section, or can be obtained after receiving the reference facial video frame from the encoding end, based on the reference facial video frame, using a 3D facial reconstruction algorithm, or combined with manual interactive operations.
  • a specific method of obtaining a three-dimensional face template please refer to the detailed introduction in step 302, which will not be repeated here.
  • the 3D face description information is the information used to drive the 3D face template, and correspondingly, the target 3D face description information is the information used to drive the 3D face template to obtain the 3D face corresponding to the target face video frame.
  • the 3D facial description information may include: 3D expression information, 3D translation information, 3D angle information, 3D texture information, 3D shape information, and so on.
  • the specific content of the three-dimensional facial description information is not limited, and one or more of the above information may be selected as the facial description information according to actual needs.
  • Step 504 decoding the facial video bit stream to obtain target 3D facial description information.
  • the facial video bitstream may also include coded reference 3D facial description information; correspondingly, the facial video bitstream is decoded to obtain the target 3D facial description information, including:
  • the differential 3D facial description information is obtained by performing a differential operation on the reference 3D facial description information and the target 3D facial description information.
  • a machine learning model can be used to decode the latent coding information to obtain differential 3D facial description information.
  • the latent coding information can be input into a fully connected decoding model so that the fully connected decoding model can output differential 3D facial description information .
  • performing entropy decoding on the facial video bit stream to obtain latent coding information and reference three-dimensional facial description information may include:
  • Step 506 Based on the description information of the target 3D face, deform the 3D face template to obtain a reconstructed face video frame corresponding to the target face video frame.
  • FIG. 6 is a schematic diagram of a specific scenario example in the embodiment shown in FIG. 5.
  • an example of a specific scenario will be used to describe the embodiment of the present application:
  • the facial video bit stream and 3D facial template perform entropy decoding on the facial video bit stream, and respectively obtain the quantized target 3D facial description information ⁇ r ', and the differential latent coding information; the latent coding information and the target facial video frame
  • the pre-order latent coding information ⁇ (t-1) ′ corresponding to the previous facial video frame is summed, Get the quantized potential encoding information ⁇ t ', input ⁇ t 'into the fully connected decoding model, get the latent encoding information ⁇ t , perform summing operation on ⁇ t and ⁇ r ', and obtain the reconstructed target 3D facial description information ⁇ t ’, based on ⁇ t ’ and the 3D facial template, the reconstructed facial video frame is obtained by using the 3DMM algorithm.
  • the three-dimensional facial description information is extracted from the target facial video frame, and the facial video bit stream obtained by encoding the above-mentioned three-dimensional facial description information, because the face itself is Therefore, using the 3D facial description information directly to describe the face has a higher accuracy, and then in the decoding stage, based on the above-mentioned 3D facial description information with higher description accuracy, the facial video frame is reconstructed to obtain The quality difference between the reconstructed facial video frame and the target facial video frame is small, which can improve the quality of facial video frame reconstruction.
  • the facial video decoding method in this embodiment can be executed by any suitable electronic device with data capability, including but not limited to: a server, a PC, and the like.
  • FIG. 7 is a flowchart of steps of a model training method according to Embodiment 3 of the present application. Specifically, the model training method provided in this embodiment includes the following steps:
  • Step 702 Obtain target 3D face description sample information according to the target face video frame samples and 3D face template samples.
  • Step 704 Input the target 3D facial description sample information into the fully connected coding model to be trained to obtain latent coding sample information.
  • Step 706 encode latent coding sample information to obtain facial video bit stream samples.
  • Step 708 Decode the facial video bitstream samples to obtain potential coding sample information; and input the latent coding sample information into the fully connected decoding model to be trained to obtain target 3D facial description sample information.
  • Step 710 Based on the target 3D face description sample information, deform the 3D face template sample to obtain a reconstructed face video frame sample.
  • Step 712 Construct a rate loss function according to the transmission code rate corresponding to the facial video bitstream samples; construct a distortion loss function according to the reconstructed facial video frame samples and the target facial video frame samples.
  • step 714 a training loss function is obtained based on the rate loss function and the distortion loss function, so as to train the fully connected encoding model and the fully connected decoding model.
  • step 704-step 710 may include:
  • the differential 3D facial description sample information is obtained; the differential 3D facial description sample information is input into the fully connected coding model to be trained to obtain latent coding sample information; the latent coding sample information is encoded and refer to the three-dimensional facial description sample information to obtain the facial video bitstream sample; decode the facial video bitstream sample to obtain the potential encoding sample information and reference three Two-dimensional facial description sample information; input latent coding sample information into the fully connected decoding model to be trained to obtain differential three-dimensional facial description sample information; based on reference three-dimensional facial description sample information and differential three-dimensional facial description sample information, obtain target three-dimensional facial description sample information ; Based on the target 3D face description sample information, deforming the 3D face template sample to obtain the reconstructed face video frame sample.
  • the calculation result of mean error calculation (such as MAE, Mean Absolute Error) can be directly used as the distortion loss function based on the target 3D facial description sample information and the reconstructed 3D facial description sample information; It is also possible to first input the existing facial feature point extraction model based on the target facial video frame sample and the reconstructed target facial video frame sample, thereby respectively obtaining the target facial feature points corresponding to the target facial video frame sample, and reconstructing the facial video frame The reconstructed facial feature points corresponding to the sample, and then based on the position information of the target facial feature points and the position information of the reconstructed facial feature points, construct a distortion loss function based on the average error; The calculation result of the average error calculation of the 3D face description sample information and the reconstructed 3D face description sample information is used as the first distortion loss function, and the second based on the position information of the target facial feature points and the position information of the reconstructed facial feature points.
  • the distortion loss function is weighted and fused to
  • corresponding weight values can be set for the code rate loss function, the first distortion loss function, and the second distortion loss function, and then based on the set weight values, the above three types of losses Functions are summed to obtain the final training loss function. Specifically, see the following formula:
  • L is the final training loss function
  • L R is the code rate loss function
  • L M is the first distortion loss function
  • L L is the second distortion loss function
  • ⁇ 1 , ⁇ 2 , ⁇ 3 are the code rate loss functions
  • FIG. 8 is a schematic diagram of a scene corresponding to Embodiment 3 of the present application.
  • the code rate loss function is constructed based on the transmission code rate corresponding to the facial video bitstream samples;
  • the first distortion loss function is constructed based on the target 3D facial description sample information and the reconstructed 3D facial description sample information;
  • the second distortion loss function is obtained based on the target facial video frame samples and the reconstructed facial video frame samples.
  • the model training method in this embodiment may be executed by any suitable electronic device with data capability, including but not limited to: a server, a PC, and the like.
  • FIG. 9 is a structural block diagram of a facial video encoding device according to Embodiment 4 of the present application.
  • the facial video encoding device provided by the embodiment of the present application includes:
  • the first obtaining module 902 is used to obtain the target facial video frame to be encoded and correspond to the reference facial video frame 3D facial template;
  • the target three-dimensional facial description information obtaining module 904 is used to perform feature extraction on the target facial video frame and the three-dimensional facial template to obtain the target three-dimensional facial description information;
  • the encoding module 906 is used for encoding target 3D facial description information to obtain facial video bit stream.
  • the facial video encoding device also includes:
  • the reference three-dimensional facial description information obtaining module is used for feature extraction on the reference facial video frame and the three-dimensional facial template, and obtains the reference three-dimensional facial description information;
  • the encoding module 906 is specifically used to perform a differential operation on the reference three-dimensional facial description information and the target three-dimensional facial description information to obtain differential three-dimensional facial description information; encode the differential three-dimensional facial description information to obtain latent coding information, and the dimension value of the latent coding information is less than the differential 3D facial description information; entropy encoding is performed on the latent coding information and the reference 3D facial description information respectively to obtain a facial video bit stream.
  • the encoding module 906 is specifically configured to:
  • the differential 3D facial description information is input to a fully connected encoding model, so that the fully connected encoding model outputs latent encoding information.
  • the facial video encoding device also includes:
  • the differential latent coding information obtaining module is used to obtain the previous potential coding information corresponding to the previous facial video frame of the target facial video frame after obtaining the latent coding information; differential operation is performed on the latent coding information and the previous latent coding information, Obtain differential latent coding information;
  • the encoding module 906 executes the step of performing entropy encoding on the latent encoding information and the reference three-dimensional facial description information respectively to obtain the facial video bit stream, it is specifically used to: respectively perform entropy encoding on the differential latent encoding information and the reference three-dimensional facial description information Encode to get facial video bitstream.
  • the target 3D facial description information includes at least one of the following: 3D expression information, 3D translation information, and 3D angle information.
  • the facial video encoding device of this embodiment is used to implement the corresponding facial video encoding methods in the aforementioned multiple method embodiments, and has the beneficial effects of the corresponding method embodiments, which will not be repeated here.
  • the function implementation of each module in the facial video encoding device of this embodiment reference may be made to the description of corresponding parts in the foregoing method embodiments, and details are not repeated here.
  • FIG. 10 is a structural block diagram of a facial video decoding device according to Embodiment 5 of the present application.
  • the facial video decoding device provided by the embodiment of the present application includes:
  • the second obtaining module 1002 is used to obtain facial video bitstream and three-dimensional facial template; facial video bitstream is obtained based on the target three-dimensional facial description information corresponding to the target facial video frame;
  • the decoding module 1004 is used to decode the facial video bit stream to obtain the target three-dimensional facial description information
  • the reconstructed face video frame obtaining module 1006 is configured to perform deformation processing on the three-dimensional face template based on the target three-dimensional face description information to obtain a reconstructed face video frame corresponding to the target face video frame.
  • the facial video bitstream also includes encoded reference 3D facial description information; the decoding module 1004 is specifically configured to: perform entropy decoding on the facial video bitstream to obtain latent coding information and reference 3D facial description information. Facial description information; decode latent coding information to obtain differential 3D facial description information, which is obtained by differential operation of reference 3D facial description information and target 3D facial description information; reference 3D facial description information and difference The three-dimensional facial description information is summed to obtain the target three-dimensional facial description information.
  • the decoding module 1004 is specifically configured to:
  • the latent encoding information is fed into the fully-connected decoding model, so that the fully-connected decoding model outputs differential 3D facial description information.
  • the decompression module 1004 when the decompression module 1004 performs entropy decoding on the facial video bitstream to obtain latent coding information and refer to the steps of 3D facial description information, it is specifically configured to: perform entropy decoding on the facial video bitstream Decoding to obtain differential latent coding information and reference three-dimensional facial description information; obtaining pre-sequence latent coding information corresponding to the previous face video frame of the target facial video frame; summing differential latent coding information and pre-sequence latent coding information, Get latent coding information.
  • the facial video decoding device in this embodiment is used to implement the corresponding facial video decoding methods in the foregoing method embodiments, and has the beneficial effects of the corresponding method embodiments, which will not be repeated here.
  • the function implementation of each module in the facial video decoding device of this embodiment reference may be made to the description of corresponding parts in the foregoing method embodiments, and details are not repeated here.
  • the target three-dimensional facial description sample information extraction module 1102 is used to obtain the target three-dimensional facial description sample information according to the target facial video frame sample and the three-dimensional facial template sample;
  • a fully connected coding module 1104 configured to input the target three-dimensional face description sample information into the fully connected coding model to be trained to obtain latent coding sample information;
  • a video stream sample obtains a module 1106, which is used to encode potential coding sample information to obtain a facial video bit stream sample;
  • the fully connected decoding module 1108 is used to decode the facial video bitstream samples to obtain latent coding sample information; and input the latent coding sample information into the fully connected decoding model to be trained to obtain target three-dimensional facial description sample information;
  • the reconstructed face video frame sample obtaining module 1110 is used to describe the sample information based on the target three-dimensional face,
  • the three-dimensional facial template sample is deformed to obtain the reconstructed facial video frame sample;
  • the model training device in this embodiment is used to implement the corresponding model training methods in the aforementioned multiple method embodiments, and has the beneficial effects of the corresponding method embodiments, which will not be repeated here.
  • the function implementation of each module in the model training device of this embodiment reference may be made to the descriptions of corresponding parts in the foregoing method embodiments, and details are not repeated here.
  • FIG. 12 shows a schematic structural diagram of an electronic device according to Embodiment 7 of the present application.
  • the specific embodiment of the present application does not limit the specific implementation of the electronic device.
  • the conference terminal may include: a processor (processor) 1202, a communication interface (Communications Interface) 1204, a memory (memory) 1206, and a communication bus 1108.
  • processor processor
  • Communication interface Communication Interface
  • memory memory
  • the processor 1202 , the communication interface 1204 , and the memory 1206 communicate with each other through the communication bus 1208 .
  • the communication interface 1204 is used for communicating with other electronic devices or servers.
  • the processor 1202 is configured to execute the program 1210. Specifically, it may execute the above-mentioned facial video encoding method, or the facial video decoding method, or the relevant steps in the embodiment of the model training method.
  • the program 1210 may include program codes including computer operation instructions.
  • the processor 1202 may be a CPU, or an ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement the embodiments of the present application.
  • the one or more processors included in the smart device may be of the same type, such as one or more CPUs, or may be different types of processors, such as one or more CPUs and one or more ASICs.
  • the memory 1206 is used to store the program 1210 .
  • the memory 1206 may include a high-speed RAM memory, and may also include a non-volatile memory (non-volatile memory), such as at least one magnetic disk memory.
  • the program 1210 can specifically be used to make the processor 1202 perform the following operations: obtain the target facial video frame to be encoded and the three-dimensional facial template corresponding to the reference facial video frame; perform feature extraction on the target facial video frame and the three-dimensional facial template , to obtain the target three-dimensional facial description information; encode the target three-dimensional facial description information to obtain a facial video bit stream.
  • the program 1210 can specifically be used to make the processor 1202 perform the following operations: acquire a facial video bitstream and a three-dimensional facial template; the facial video bitstream is based on the target three-dimensional facial description corresponding to the target facial video frame obtained by decoding the facial video bit stream to obtain the target three-dimensional facial description information; based on the target three-dimensional facial description information, the three-dimensional facial template is deformed to obtain the corresponding target facial video frame Reconstruct face video frames.
  • the program 1210 can specifically be used to make the processor 1202 perform the following operations: obtain the target 3D facial description sample information according to the target facial video frame sample and the 3D facial template sample; input the target 3D facial description sample information into the fully connected coding to be trained model to obtain latent coding sample information; encode the latent coding sample information to obtain facial video bitstream samples; decode the facial video bitstream samples to obtain latent coding sample information; and input the latent coding sample information to be
  • the trained fully connected decoding model obtains target 3D facial description sample information; based on the target 3D facial description sample information, deforms the 3D facial template sample to obtain a reconstructed facial video frame sample; according to the facial video bit stream
  • the transmission code rate corresponding to the sample constructs a code rate loss function; according to the reconstructed face video frame sample and the target face video frame sample, a distortion loss function is constructed; based on the code rate loss function and the distortion loss function, training loss is obtained function to train a fully-connected encoding model and a
  • the electronic device of this embodiment in the encoding stage, based on the three-dimensional facial template, the three-dimensional facial description information is extracted from the target facial video frame, and the facial video bit stream obtained by encoding the above-mentioned three-dimensional facial description information, due to The face itself is three-dimensional. Therefore, the accuracy of the description information is higher by directly using the three-dimensional facial description information to describe the face, and then the face video frame is reconstructed based on the above-mentioned three-dimensional facial description information with higher description accuracy, and the obtained The difference in quality between the reconstructed face video frame and the target face video frame is smaller. In this embodiment of the present application, the quality of facial video frame reconstruction can be improved.
  • a computer, processor, microprocessor controller, or programmable hardware includes memory components (e.g., RAM, ROM, flash memory, etc.) that can store or receive software or computer code, when the software or computer code is Or when the hardware accesses and executes, realize the facial video coding method described herein, or the facial video decoding method, or the model training method.
  • memory components e.g., RAM, ROM, flash memory, etc.
  • the hardware accesses and executes, realize the facial video coding method described herein, or the facial video decoding method, or the model training method.
  • the execution of the code converts the general-purpose computer into a method for executing the face video shown here.
  • the face video coding method, or, the face video decoding method, or, the special purpose computer of model training method when a general-purpose computer accesses the codes for implementing the face video encoding method shown here, or the face video decoding method, or the model training method.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)
  • Image Analysis (AREA)

Abstract

本申请实施例提供了一种面部视频编码方法、解码方法及装置。面部视频编码方法包括:获取待编码的目标面部视频帧和与参考面部视频帧对应的三维面部模板;对目标面部视频帧和三维面部模板进行特征提取,得到目标三维面部描述信息;编码目标三维面部描述信息,得到面部视频比特流。本申请实施例,使用三维面部描述信息对面部进行描述,描述信息的准确度更高,进而再基于上述描述准确度较高的三维面部描述信息进行面部视频帧重建,得到的重建面部视频帧与目标面部视频帧间的质量差异则较小。本申请实施例,可以提升面部视频帧重建的质量。

Description

一种面部视频编码方法、解码方法及装置
本申请要求于2022年01月25日提交中国专利局、申请号为202210085764.X、申请名称为“一种面部视频编码方法、解码方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及计算机技术领域,尤其涉及一种面部视频编码方法、解码方法及装置。
背景技术
随着视频编解码技术的不断发展,视频编解码设备已被广泛应用于各种场景中,例如:视频会议、视频直播等等。
目前,较为传统的视频编解码方法,通常是基于二维特征对面部视频帧进行面部信息提取和描述的,而二维特征本身是对原始三维面部进行映射得到的,其得到过程存在一定的扭曲和失真,因此,基于上述二维特征进行面部视频的编解码操作,最终得到的重建面部视频帧的质量较差。
发明内容
有鉴于此,本申请实施例提供一种面部视频编码方法、解码方法及装置,以至少部分解决上述问题。
根据本申请实施例的第一方面,提供了一种面部视频编码方法,包括:
获取待编码的目标面部视频帧和与参考面部视频帧对应的三维面部模板;
对所述目标面部视频帧和所述三维面部模板进行特征提取,得到目标三维面部描述信息;
编码所述目标三维面部描述信息,得到面部视频比特流。
根据本申请实施例的第二方面,提供了一种面部视频解码方法,包括:
获取面部视频比特流和三维面部模板;所述面部视频比特流是基于目标面部视频帧对应的目标三维面部描述信息得到的;
解码所述面部视频比特流,得到所述目标三维面部描述信息;
基于所述目标三维面部描述信息,对所述三维面部模板进行形变处理,得到与所述目标面部视频帧对应的重建面部视频帧。
根据本申请实施例的第三方面,提供了一种模型训练方法,包括:
根据目标面部视频帧样本和三维面部模板样本,得到目标三维面部描述样本信息;
将所述目标三维面部描述样本信息输入待训练的全连接编码模型,得到潜在编码样本信息;
编码所述潜在编码样本信息,得到面部视频比特流样本;
对所述面部视频比特流样本进行解码,得到潜在编码样本信息;并将所述潜在编码样本信息输入待训练的全连接解码模型,得到目标三维面部描述样本信息;
基于所述目标三维面部描述样本信息,对所述三维面部模板样本进行形变处理,得到重建面部视频帧样本;
根据所述面部视频比特流样本对应的传输码率构建码率损失函数;根据所述重建面部视频帧样本和所述目标面部视频帧样本,构建失真损失函数;
基于所述码率损失函数和所述失真损失函数得到训练损失函数,以对全连接编码模型和全连接解码模型进行训练。
根据本申请实施例的第四方面,提供了一种电子设备,包括:处理器、存储器、通信接口和通信总线,所述处理器、所述存储器和所述通信接口通过所述通信总线完成相互间的通信;所述存储器用于存放至少一可执行指令,所述可执行指令使所述处理器执行如第一方面所述的面部视频编码方法对应的操作,或者,如第二方面所述的面部视频解码方法对应的操作,或者,如第三方面所述的模型训练方法对应的操作。
根据本申请实施例的第五方面,提供了一种计算机存储介质,其上存储有计算机程序,该程序被处理器执行时实现如第一方面所述的面部视频编码方法,或者,如第二方面所述的面部视频解码方法,或者,如第三方面所述的模型训练方法。
根据本申请实施例的第六方面,提供了一种计算机程序产品,包括计算机指令,所述计算机指令指示计算设备执行如第一方面所述的面部视频编码方法对应的操作,或者,如第二方面所述的面部视频解码方法对应的操作,或者,如第三方面所述的模型训练方法对应的操作。
根据本申请实施例提供的面部视频编码方法以及解码方法,在编码阶段,是基于三维面部模板,对目标面部视频帧进行了三维面部描述信息的提取,并通过对上述三维面部描述信息编码得到的面部视频比特流,由于面部本身即为三维的,因此,直接使用三维面部描述信息对面部进行描述,描述信息的准确度更高,进而再基于上述描述准确度较高的三维面部描述信息进行面部视频帧重建,得到的重建面部视频帧与目标面部视频帧间的质量差异则较小。本申请实施例,可以提升面部视频帧重建的质量。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请实施例中记载的一些实施例,对于本领域普通技术人员来讲,还可以根据这些附图获得其他的附图。
图1为基于深度视频生成的编解码方法的框架示意图;
图2为根据本申请实施例提供的面部视频通信的场景示意图;
图3为根据本申请实施例一的一种面部视频编码方法的步骤流程图;
图4为图3所示实施例中的一种具体场景示例的示意图;
图5为根据本申请实施例二的一种面部视频解码方法的步骤流程图;
图6为图5所示实施例中的一种具体场景示例的示意图;
图7为根据本申请实施例三的一种模型训练方法的步骤流程图;
图8为图7所示实施例中的一种场景示例的示意图;
图9为根据本申请实施例四的一种面部视频编码装置的结构框图;
图10为根据本申请实施例五的一种面部视频解码装置的结构框图;
图11为根据本申请实施例六的一种模型训练装置的结构框图;
图12为根据本申请实施例七的一种电子设备的结构示意图。
具体实施方式
为了使本领域的人员更好地理解本申请实施例中的技术方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅是本申请实施例一部分实施例,而不是全部的实施例。基于本申请实施例中的实施例,本领域普通技术人员所获得的所有其他实施例,都应当属于本申请实施例保护的范围。
参见图1,图1为基于深度视频生成的编解码方法的框架示意图。该方法的主要原理是基于待编码帧的运动对参考帧进行形变,以得到待编码帧对应的重建帧。下面结合图1对基于深度视频生成的编解码方法的基本框架进行说明:
第一步,编码阶段,编码器采用关键点提取器提取待编码的目标面部视频帧的目标关键点信息,并对目标关键点信息编码;同时,采用传统的图像编码方法(如VVC、HEVC等)对参考面部视频帧进行编码。
第二步,解码阶段,解码器中的运动估计模块,通过关键点提取器提取参考面部视频帧的参考关键点信息;并基于参考关键点信息和目标关键点信息进行稠密运动估计,得到稠密运动估计图和遮挡图,其中,稠密运动估计图表征关键点信息表征的特征域中,目标面部视频帧与参考面部视频帧之间的相对运动关系;遮挡图表征目标面部视频帧中各像素点被遮挡的程度。
第三步,解码阶段,解码器中的生成模块基于稠密运动估计图对参考面部视频帧进行形变处理,得到形变处理结果,再将形变处理结果与遮挡图相乘,从而输出重建面部视频帧。
图1所示方法中,是基于从二维面部视频帧中提取到的二维信息(关键点信息)进行面部信息提取和描述,进而进行视频帧重建的,而二维特征本身是对原始三维面部进行映射得到的,其得到过程存在一定的扭曲和失真,因此,上述基于二维特征进 行面部视频的编解码操作,最终得到的重建面部视频帧的质量较差。
本申请实施例中,编码阶段,基于三维面部模板,对目标面部视频帧进行了三维面部描述信息的提取,并通过对上述三维面部描述信息编码得到的面部视频比特流,由于面部本身即为三维的,因此,直接使用三维面部描述信息对面部进行描述,描述信息的准确度更高,进而再基于上述描述准确度较高的三维面部描述信息进行面部视频帧重建,得到的重建面部视频帧与目标面部视频帧间的质量差异则较小。因此,可以提升面部视频帧重建的质量。
下面结合本申请实施例附图进一步说明本申请实施例具体实现。
参见图2,图2为根据本申请实施例提供的面部视频通信的一种场景示意图。为便于理解,首先结合图2对本申请实施例的整个面部视频通信过程进行解释说明。其中,整个通信过程包括:由发送端执行的面部视频编码过程,和,由接收端执行的面部视频解码过程。
由发送端执行的面部视频编码过程,包括:获取捕获的由多个连续面部视频帧组成的面部视频,以及获取面部视频中的面部对应的三维面部模板,将面部视频中的各面部视频{It|t=0,1,2...N}和三维面部模板输入发送端的特征提取器,从而得到各面部视频帧It的三维面部描述信息(图1中以三维表情信息βt、三维平移信息lt以及三维角度信息θt作为三维面部描述信息进行举例说明,并不构成对三维面部描述信息的具体先多功能),再对上述三维面部描述信息进行编码,从而得到面部视频比特流χt,以发送至接收端进行解码。
由接收端执行的面部视频解码过程,包括:对接收到的面部视频比特流χt进行解码,从而得到三维面部模板,基于得到的上述三维面部描述信息,对三维面部模板进行形变处理(图1中以采用面部3D形变统计模型(3DMM)对三维面部模板进行形变处理为例进行说明,并不构成对本申请中形变处理方式的限定),得到重建面部视频,以供后续基于三维面部模型的应用程序(如:沉浸式虚拟现实、视频会议/直播,等等)使用。
实施例一
参照图3,图3为根据本申请实施例一的一种面部视频编码方法的步骤流程图。本申请实施例中提供的面部视频编码方法可以对应于上述图1中由发送端执行的面部视频编码过程。具体地,本实施例提供的面部视频编码方法包括以下步骤:
步骤302,获取待编码的目标面部视频帧和与参考面部视频帧对应的三维面部模板。
三维面部模板是面部在空间中的数字化表示,包括多种不同的表示方法,例如: 点云、多边形网格、体像素等等,例如:就体像素表示的三维面部模型而言,其具体描述形式可以为:每个体像素的位置信息,以及每个体像素的像素值,等等。
具体地,可以基于参考面部视频帧,采用三维面部重建算法,或者,再结合人工交互操作得到的。例如,三维面部模板的获取过程,可以包括:
基于参考面部视频帧进行三维面部重建,得到初始三维面部模板;响应于对初始三维面部模板的编辑操作,得到三维面部模板。
也就是说,可以先根据现有的三维面部重建算法,得到初始的三维面部模板,但是,初始三维面部模板的精确度可能较低,此后,可以人工对各体像素的位置信息以及像素值等等,进行调整,以更精准地表达参考面部视频帧中的面部。
步骤304,对目标面部视频帧和三维面部模板进行特征提取,得到目标三维面部描述信息。
三维面部描述信息为用于驱动三维面部模板的信息,对应地,目标三维面部描述信息为用于驱动三维面部模板,以得到目标面部视频帧对应的三维面部的信息。
三维面部描述信息可以包括:三维表情信息、三维平移信息、三维角度信息、三维纹理信息、三维形状信息,等等。本申请实施例中,对于三维面部描述信息的具体内容不做限定,可以根据实际需要选择上述信息中的一个或者多个作为面部描述信息。
步骤306,编码目标三维面部描述信息,得到面部视频比特流。
为了降低编码码率,可选地,在其中一些实施例中,在编码目标三维面部描述信息,得到面部视频比特流之前,还可以:
对参考面部视频帧和三维面部模板进行特征提取,得到参考三维面部描述信息;适应性地,编码目标三维面部描述信息,得到面部视频比特流,则可以包括:对参考三维面部描述信息和目标三维面部描述信息进行差分运算,得到差分三维面部描述信息;对差分三维面部描述信息进行编码,得到潜在编码信息,潜在编码信息的维度值小于差分三维面部描述信息;分别对潜在编码信息和参考三维面部描述信息进行熵编码,得到面部视频比特流。
上述编码方式中,首先,是基于参考三维面部描述信息和目标三维面部描述信息的差分结果生成面部视频比特流的,由于差分结果的数据量低于目标三维面部描述信息的数据量,因此,上述方式与直接基于目标三维面部描述信息生成面部视频比特流的方式相比,可以有效降低面部视频流的编码码率;其次,在得到上述差分结果之后,还对差分结果进行了编码(降维处理),从而使得得到的潜在编码信息的维度小于上述差分结果的维度,这样可以进一步降低待编码的数据量,因此,基于降维得到的上述潜在编码信息得到面部视频比特流,可以进一步地降低编码码率。
进一步地,可以通过机器学习模型,对差分三维面部描述信息进行编码,从而得到潜在编码信息,具体方式可以为:
将差分三维面部描述信息输入全连接编码模型,以使全连接编码模型输出潜在编 码信息。
可选的,在其中一些实施例中,为了进一步地降低编码码率,还可以在得到潜在编码信息之后,执行如下操作:
获取与目标面部视频帧的前一面部视频帧对应的前序潜在编码信息;对潜在编码信息和前序潜在编码信息进行差分运算,得到差分潜在编码信息;
对应地,可以通过如下方式得到面部视频比特流,包括:
分别对差分潜在编码信息和参考三维面部描述信息进行熵编码,得到面部视频比特流。
上述方式,在得到目标面部视频帧的潜在编码信息之后,将潜在编码信息和前一面部视频帧对应的前序潜在编码信息进行差分运算,得到了差分潜在编码信息,进而基于差分潜在编码信息生成面部视频比特流,由于差分潜在编码信息的数据量小于潜在编码信息的数据量,因此,基于差分潜在编码信息生成面部视频比特流,可以进一步地降低编码码率。
参见图4,图4为图3所示实施例中的一种具体场景示例的示意图,以下,将参考图4所示的示意图,以一个具体场景示例,对本申请实施例进行说明:
获取参考面部视频帧和目标面部视频帧,以及,三维面部模板,对参考面部视频帧和三维面部模板进行特征提取,得到参考三维面部描述信息ωr={βrr,lr};对目标面部视频帧和三维面部模板进行特征提取,得到目标三维面部描述信息ωt={βtt,lt};对ωr量化之后,与进行ωt差分运算,进而将差分运算结果输入全连接编码模型,以输出到潜在编码信息ηt,并对ηt量化得到ηt',对ηt'和目标面部视频帧的前一面部视频帧对应的前序潜在编码信息η(t-1)'进行差分运算,并对运算结果进行熵编码,同时,对参考三维面部描述信息ωr 量化得到ωr',并对ωr'进行熵编码,从而得到面部视频比特流。
本申请实施例中,在编码阶段,是基于三维面部模板,对目标面部视频帧进行了三维面部描述信息的提取,并通过对上述三维面部描述信息的编码得到的面部视频比特流,由于面部本身即为三维的,因此,直接使用三维面部描述信息对面部进行描述,描述信息的准确度更高,进而后续再基于上述描述准确度较高的三维面部描述信息进行面部视频帧重建,得到的重建面部视频帧与目标面部视频帧间的质量差异则较小,可以提升面部视频帧重建的质量。
本实施例的面部视频编码方法可以由任意适当的具有数据能力的电子设备执行,包括但不限于:服务器、PC机等。
实施例二
参照图5,图5为根据本申请实施例二的一种面部视频解码方法的步骤流程图。本申请实施例中提供的面部视频解码方法可以对应于上述图2中由接收端执行的面部 视频解码过程。具体地,本实施例提供的面部视频解码方法包括以下步骤:
步骤502,获取面部视频比特流和三维面部模板。
其中,面部视频比特流是基于目标面部视频帧对应的目标三维面部描述信息得到的。
三维面部模板可以是从编码段直接传输过来的,也可以是从编码端接收参考面部视频帧之后,基于参考面部视频帧,采用三维面部重建算法,或者,再结合人工交互操作得到的。具体得到三维面部模板的方法可参见步骤302中的详细介绍,此处不再赘述。
三维面部描述信息为用于驱动三维面部模板的信息,对应地,目标三维面部描述信息为用于驱动三维面部模板,以得到目标面部视频帧对应的三维面部的信息。
三维面部描述信息可以包括:三维表情信息、三维平移信息、三维角度信息、三维纹理信息、三维形状信息,等等。本申请实施例中,对于三维面部描述信息的具体内容不做限定,可以根据实际需要选择上述信息中的一个或者多个作为面部描述信息。
步骤504,解码面部视频比特流,得到目标三维面部描述信息。
可选地,面部视频比特流中还可以包括编码后参考三维面部描述信息;对应地,解码面部视频比特流,得到目标三维面部描述信息,包括:
对面部视频比特流进行熵解码,得到潜在编码信息和参考三维面部描述信息;对潜在编码信息进行解码,得到差分三维面部描述信息;对参考三维面部描述信息和差分三维面部描述信息进行加和运算,得到目标三维面部描述信息。其中,差分三维面部描述信息是对参考三维面部描述信息和目标三维面部描述信息进行差分运算得到的。
进一步地,可以采用机器学习模型,对潜在编码信息进行解码,以得到差分三维面部描述信息,具体地可以:将潜在编码信息输入全连接解码模型,以使全连接解码模型输出差分三维面部描述信息。
进一步地,对面部视频比特流进行熵解码,得到潜在编码信息和参考三维面部描述信息,可以包括:
对面部视频比特流进行熵解码,得到差分潜在编码信息和参考三维面部描述信息;获取与目标面部视频帧的前一面部视频帧对应的前序潜在编码信息;对差分潜在编码信息和前序潜在编码信息进行加和运算,得到潜在编码信息。
步骤506,基于目标三维面部描述信息,对三维面部模板进行形变处理,得到与目标面部视频帧对应的重建面部视频帧。
参见图6,图6为图5所示实施例中的一种具体场景示例的示意图,以下,将参考图6所示的示意图,以一个具体场景示例,对本申请实施例进行说明:
获取面部视频比特流和三维面部模板,对面部视频比特流进行熵解码,分别得到量化后的目标三维面部描述信息ωr',以及,差分潜在编码信息;对潜在编码信息和目标面部视频帧的前一面部视频帧对应的前序潜在编码信息η(t-1)′进行加和运算, 得到量化后的潜在编码信息ηt',将ηt'输入全连接解码模型,得到潜在编码信息ηt,对ηt和ωr'进行加和运算,得到重建的目标三维面部描述信息ωt',基于ωt'和三维面部模板,采用3DMM算法,得到重建面部视频帧。
本申请实施例中,在编码阶段,是基于三维面部模板,对目标面部视频帧进行了三维面部描述信息的提取,并通过对上述三维面部描述信息编码得到的面部视频比特流,由于面部本身即为三维的,因此,直接使用三维面部描述信息对面部进行描述,描述信息的准确度更高,进而在解码阶段,再基于上述描述准确度较高的三维面部描述信息进行面部视频帧重建,得到的重建面部视频帧与目标面部视频帧间的质量差异则较小,可以提升面部视频帧重建的质量。
本实施例的面部视频解码方法可以由任意适当的具有数据能力的电子设备执行,包括但不限于:服务器、PC机等。
实施例三
参照图7,图7为根据本申请实施例三的一种模型训练方法的步骤流程图。具体地,本实施例提供的模型训练方法包括以下步骤:
步骤702,根据目标面部视频帧样本和三维面部模板样本,得到目标三维面部描述样本信息。
步骤704,将目标三维面部描述样本信息输入待训练的全连接编码模型,得到潜在编码样本信息。
步骤706,编码潜在编码样本信息,得到面部视频比特流样本。
步骤708,对面部视频比特流样本进行解码,得到潜在编码样本信息;并将潜在编码样本信息输入待训练的全连接解码模型,得到目标三维面部描述样本信息。
步骤710,基于目标三维面部描述样本信息,对三维面部模板样本进行形变处理,得到重建面部视频帧样本。
步骤712,根据面部视频比特流样本对应的传输码率构建码率损失函数;根据重建面部视频帧样本和目标面部视频帧样本,构建失真损失函数。
步骤714,基于码率损失函数和失真损失函数得到训练损失函数,以对全连接编码模型和全连接解码模型进行训练。
可选地,为了进一步地降低编码码率,在执行步骤702的同时,可以:基于参考面部视频帧样本和三维面部模板样本得到参考三维面部描述样本信息。对应地,步骤704-步骤710可以包括:
基于参考三维面部描述样本信息和目标三维面部描述样本信息,得到差分三维面部描述样本信息;将差分三维面部描述样本信息输入待训练的全连接编码模型,得到潜在编码样本信息;编码潜在编码样本信息和参考三维面部描述样本信息,得到面部视频比特流样本;对面部视频比特流样本进行解码,得到潜在编码样本信息和参考三 维面部描述样本信息;将潜在编码样本信息输入待训练的全连接解码模型,得到差分三维面部描述样本信息;基于参考三维面部描述样本信息和差分三维面部描述样本信息,得到目标三维面部描述样本信息;基于目标三维面部描述样本信息,对三维面部模板样本进行形变处理,得到重建面部视频帧样本。
步骤712中,在构建失真损失函数时,可以直接将基于目标三维面部描述样本信息和重建得到的三维面部描述样本信息进行平均误差计算(如MAE,Mean Absolute Error)的计算结果作为失真损失函数;也可以先分别将基于目标面部视频帧样本和重建的目标面部视频帧样本输入现有的面部特征点提取模型,从而分别得到目标面部视频帧样本对应的目标面部特征点,以及,重建面部视频帧样本对应的重建面部特征点,再基于目标面部特征点的位置信息,以及重建面部特征点的位置信息,基于平均误差构建失真损失函数;还可以,对上述两种方式进行融合,将直接基于目标三维面部描述样本信息和重建得到的三维面部描述样本信息进行平均误差计算的计算结果作为第一失真损失函数,与基于目标面部特征点的位置信息,以及重建面部特征点的位置信息得到的第二失真损失函数进行加权融合,从而得到最终的训练损失函数。
对应地,在步骤712中,可以分别为码率损失函数、第一失真损失函数以及第二失真损失函数设定对应的权重值,然后基于设定的各权重值,对上述三种类型的损失函数进行加和处理,从而得到最终的训练损失函数。具体地,可参见如下公式:
L=λ1LR2LM3LL
其中,L为最终的训练损失函数;LR为码率损失函数;LM为第一失真损失函数;LL为第二失真损失函数;λ1、λ2、λ3分别为码率损失函数的权重值、第一失真损失函数的权重值,以及第二失真损失函数的权重值。
对于本申请实施例中各步骤的具体执行过程,可以参考前述实施例中的对应步骤,此处不再赘述。
参见图8,图8为本申请实施例三对应的场景示意图,该图在图4和图6的基础上,增加了码率损失函数、第一失真损失函数以及第二失真损失函数,从图8中可以看出:码率损失函数是基于面部视频比特流样本对应的传输码率构建的;第一失真损失函数是基于目标三维面部描述样本信息和重建得到的三维面部描述样本信息构建的;第二失真损失函数是基于目标面部视频帧样本和重建面部视频帧样本得到的。
本实施例的模型训练方法可以由任意适当的具有数据能力的电子设备执行,包括但不限于:服务器、PC机等。
实施例四
参见图9,图9为根据本申请实施例四的一种面部视频编码装置的结构框图。本申请实施例提供的面部视频编码装置包括:
第一获取模块902,用于获取待编码的目标面部视频帧和与参考面部视频帧对应 的三维面部模板;
目标三维面部描述信息得到模块904,用于对目标面部视频帧和三维面部模板进行特征提取,得到目标三维面部描述信息;
编码模块906,用于编码目标三维面部描述信息,得到面部视频比特流。
可选地,在其中一些实施例中,面部视频编码装置还包括:
参考三维面部描述信息得到模块,用于对参考面部视频帧和三维面部模板进行特征提取,得到参考三维面部描述信息;
编码模块906,具体用于对参考三维面部描述信息和目标三维面部描述信息进行差分运算,得到差分三维面部描述信息;对差分三维面部描述信息进行编码,得到潜在编码信息,潜在编码信息的维度值小于差分三维面部描述信息;分别对潜在编码信息和参考三维面部描述信息进行熵编码,得到面部视频比特流。
可选地,在其中一些实施例中,编码模块906在执行对差分三维面部描述信息进行编码,得到潜在编码信息的步骤时,具体用于:
将差分三维面部描述信息输入全连接编码模型,以使全连接编码模型输出潜在编码信息。
可选地,在其中一些实施例中,面部视频编码装置还包括:
差分潜在编码信息得到模块,用于在得到潜在编码信息之后,获取与目标面部视频帧的前一面部视频帧对应的前序潜在编码信息;对潜在编码信息和前序潜在编码信息进行差分运算,得到差分潜在编码信息;
对应地,编码模块906在执行分别对潜在编码信息和参考三维面部描述信息进行熵编码,得到面部视频比特流的步骤时,具体用于;分别对差分潜在编码信息和参考三维面部描述信息进行熵编码,得到面部视频比特流。
可选地,在其中一些实施例中,目标三维面部描述信息,包括如下至少一项:三维表情信息、三维平移信息、三维角度信息。
本实施例的面部视频编码装置用于实现前述多个方法实施例中相应的面部视频编码方法,并具有相应的方法实施例的有益效果,在此不再赘述。此外,本实施例的面部视频编码装置中的各个模块的功能实现均可参照前述方法实施例中的相应部分的描述,在此亦不再赘述。
实施例五
参见图10,图10为根据本申请实施例五的一种面部视频解码装置的结构框图。本申请实施例提供的面部视频解码装置包括:
第二获取模块1002,用于获取面部视频比特流和三维面部模板;面部视频比特流是基于目标面部视频帧对应的目标三维面部描述信息得到的;
解码模块1004,用于解码面部视频比特流,得到目标三维面部描述信息;
重建面部视频帧得到模块1006,用于基于目标三维面部描述信息,对三维面部模板进行形变处理,得到与目标面部视频帧对应的重建面部视频帧。
可选地,在其中一些实施例中,面部视频比特流中还包括编码后参考三维面部描述信息;解码模块1004,具体用于:对面部视频比特流进行熵解码,得到潜在编码信息和参考三维面部描述信息;对潜在编码信息进行解码,得到差分三维面部描述信息,差分三维面部描述信息是对参考三维面部描述信息和目标三维面部描述信息进行差分运算得到的;对参考三维面部描述信息和差分三维面部描述信息进行加和运算,得到目标三维面部描述信息。
可选地,在其中一些实施例中,解码模块1004在执行对潜在编码信息进行解码,得到差分三维面部描述信息的步骤时,具体用于:
将潜在编码信息输入全连接解码模型,以使全连接解码模型输出差分三维面部描述信息。
可选地,在其中一些实施例中,解压模块1004在执行对面部视频比特流进行熵解码,得到潜在编码信息和参考三维面部描述信息的步骤时,具体用于:对面部视频比特流进行熵解码,得到差分潜在编码信息和参考三维面部描述信息;获取与目标面部视频帧的前一面部视频帧对应的前序潜在编码信息;对差分潜在编码信息和前序潜在编码信息进行加和运算,得到潜在编码信息。
本实施例的面部视频解码装置用于实现前述多个方法实施例中相应的面部视频解码方法,并具有相应的方法实施例的有益效果,在此不再赘述。此外,本实施例的面部视频解码装置中的各个模块的功能实现均可参照前述方法实施例中的相应部分的描述,在此亦不再赘述。
实施例六
参见图11,图11为根据本申请实施例六的一种模型训练装置的结构框图。本申请实施例提供的模型训练装置包括:
目标三维面部描述样本信息提取模块1102,用于根据目标面部视频帧样本和三维面部模板样本,得到目标三维面部描述样本信息;
全连接编码模块1104,用于将目标三维面部描述样本信息输入待训练的全连接编码模型,得到潜在编码样本信息;
视频流样本得到模块1106,用于编码潜在编码样本信息,得到面部视频比特流样本;
全连接解码模块1108,用于对面部视频比特流样本进行解码,得到潜在编码样本信息;并将潜在编码样本信息输入待训练的全连接解码模型,得到目标三维面部描述样本信息;
重建面部视频帧样本得到模块1110,用于基于目标三维面部描述样本信息,对三 维面部模板样本进行形变处理,得到重建面部视频帧样本;
损失函数构建模块1112,用于根据面部视频比特流样本对应的传输码率构建码率损失函数;根据重建面部视频帧样本和目标面部视频帧样本,构建失真损失函数;
训练模块1114,用于基于码率损失函数和失真损失函数得到训练损失函数,以对全连接编码模型和全连接解码模型进行训练。
本实施例的模型训练装置用于实现前述多个方法实施例中相应的模型训练方法,并具有相应的方法实施例的有益效果,在此不再赘述。此外,本实施例的模型训练装置中的各个模块的功能实现均可参照前述方法实施例中的相应部分的描述,在此亦不再赘述。
实施例七
参照图12,示出了根据本申请实施例七的一种电子设备的结构示意图,本申请具体实施例并不对电子设备的具体实现做限定。
如图12所示,该会议终端可以包括:处理器(processor)1202、通信接口(Communications Interface)1204、存储器(memory)1206、以及通信总线1108。
其中:
处理器1202、通信接口1204、以及存储器1206通过通信总线1208完成相互间的通信。
通信接口1204,用于与其它电子设备或服务器进行通信。
处理器1202,用于执行程序1210,具体可以执行上述面部视频编码方法,或者,面部视频解码方法,或者,模型训练方法实施例中的相关步骤。
具体地,程序1210可以包括程序代码,该程序代码包括计算机操作指令。
处理器1202可能是CPU,或者是特定集成电路ASIC(Application Specific Integrated Circuit),或者是被配置成实施本申请实施例的一个或多个集成电路。智能设备包括的一个或多个处理器,可以是同一类型的处理器,如一个或多个CPU;也可以是不同类型的处理器,如一个或多个CPU以及一个或多个ASIC。
存储器1206,用于存放程序1210。存储器1206可能包含高速RAM存储器,也可能还包括非易失性存储器(non-volatile memory),例如至少一个磁盘存储器。
程序1210具体可以用于使得处理器1202执行以下操作:获取待编码的目标面部视频帧和与参考面部视频帧对应的三维面部模板;对所述目标面部视频帧和所述三维面部模板进行特征提取,得到目标三维面部描述信息;编码所述目标三维面部描述信息,得到面部视频比特流。
或者,
程序1210具体可以用于使得处理器1202执行以下操作:获取面部视频比特流和三维面部模板;所述面部视频比特流是基于目标面部视频帧对应的目标三维面部描述 信息得到的;解码所述面部视频比特流,得到所述目标三维面部描述信息;基于所述目标三维面部描述信息,对所述三维面部模板进行形变处理,得到与所述目标面部视频帧对应的重建面部视频帧。
或者,
程序1210具体可以用于使得处理器1202执行以下操作:根据目标面部视频帧样本和三维面部模板样本,得到目标三维面部描述样本信息;将所述目标三维面部描述样本信息输入待训练的全连接编码模型,得到潜在编码样本信息;编码所述潜在编码样本信息,得到面部视频比特流样本;对所述面部视频比特流样本进行解码,得到潜在编码样本信息;并将所述潜在编码样本信息输入待训练的全连接解码模型,得到目标三维面部描述样本信息;基于所述目标三维面部描述样本信息,对所述三维面部模板样本进行形变处理,得到重建面部视频帧样本;根据所述面部视频比特流样本对应的传输码率构建码率损失函数;根据所述重建面部视频帧样本和所述目标面部视频帧样本,构建失真损失函数;基于所述码率损失函数和所述失真损失函数得到训练损失函数,以对全连接编码模型和全连接解码模型进行训练。
程序1210中各步骤的具体实现可以参见上述面部视频编码方法,或者,面部视频解码方法,或者,模型训练方法实施例中的相应步骤和单元中对应的描述,在此不赘述。所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的设备和模块的具体工作过程,可以参考前述方法实施例中的对应过程描述,在此不再赘述。
通过本实施例的电子设备,在编码阶段,是基于三维面部模板,对目标面部视频帧进行了三维面部描述信息的提取,并通过对上述三维面部描述信息的编码得到的面部视频比特流,由于面部本身即为三维的,因此,直接使用三维面部描述信息对面部进行描述,描述信息的准确度更高,进而再基于上述描述准确度较高的三维面部描述信息进行面部视频帧重建,得到的重建面部视频帧与目标面部视频帧间的质量差异则较小。本申请实施例,可以提升面部视频帧重建的质量。
本申请实施例还提供了一种计算机程序产品,包括计算机指令,该计算机指令指示计算设备执行上述多个方法实施例中的任一方法对应的操作。
需要指出,根据实施的需要,可将本申请实施例中描述的各个部件/步骤拆分为更多部件/步骤,也可将两个或多个部件/步骤或者部件/步骤的部分操作组合成新的部件/步骤,以实现本申请实施例的目的。
上述根据本申请实施例的方法可在硬件、固件中实现,或者被实现为可存储在记录介质(诸如CD ROM、RAM、软盘、硬盘或磁光盘)中的软件或计算机代码,或者被实现通过网络下载的原始存储在远程记录介质或非暂时机器可读介质中并将被存储在本地记录介质中的计算机代码,从而在此描述的方法可被存储在使用通用计算机、专用处理器或者可编程或专用硬件(诸如ASIC或FPGA)的记录介质上的这样的软件处 理。可以理解,计算机、处理器、微处理器控制器或可编程硬件包括可存储或接收软件或计算机代码的存储组件(例如,RAM、ROM、闪存等),当软件或计算机代码被计算机、处理器或硬件访问且执行时,实现在此描述的面部视频编码方法,或者,面部视频解码方法,或者,模型训练方法。此外,当通用计算机访问用于实现在此示出的面部视频编码方法,或者,面部视频解码方法,或者,模型训练方法的代码时,代码的执行将通用计算机转换为用于执行在此示出的面部视频编码方法,或者,面部视频解码方法,或者,模型训练方法的专用计算机。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及方法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请实施例的范围。
以上实施方式仅用于说明本申请实施例,而并非对本申请实施例的限制,有关技术领域的普通技术人员,在不脱离本申请实施例的精神和范围的情况下,还可以做出各种变化和变型,因此所有等同的技术方案也属于本申请实施例的范畴,本申请实施例的专利保护范围应由权利要求限定。

Claims (12)

  1. 一种面部视频编码方法,包括:
    获取待编码的目标面部视频帧和与参考面部视频帧对应的三维面部模板;
    对所述目标面部视频帧和所述三维面部模板进行特征提取,得到目标三维面部描述信息;
    编码所述目标三维面部描述信息,得到面部视频比特流。
  2. 根据权利要求1所述的方法,其中,在所述编码所述目标三维面部描述信息,得到面部视频比特流之前,所述方法还包括:
    对所述参考面部视频帧和所述三维面部模板进行特征提取,得到参考三维面部描述信息;
    所述编码所述目标三维面部描述信息,得到面部视频比特流,包括:
    对所述参考三维面部描述信息和所述目标三维面部描述信息进行差分运算,得到差分三维面部描述信息;
    对所述差分三维面部描述信息进行编码,得到潜在编码信息,所述潜在编码信息的维度值小于所述差分三维面部描述信息;
    分别对所述潜在编码信息和所述参考三维面部描述信息进行熵编码,得到面部视频比特流。
  3. 根据权利要求2所述的方法,其中,所述对所述差分三维面部描述信息进行编码,得到潜在编码信息,包括:
    将所述差分三维面部描述信息输入全连接编码模型,以使所述全连接编码模型输出潜在编码信息。
  4. 根据权利要求2所述的方法,其中,在所述得到潜在编码信息之后,所述方法还包括:
    获取与所述目标面部视频帧的前一面部视频帧对应的前序潜在编码信息;
    对所述潜在编码信息和所述前序潜在编码信息进行差分运算,得到差分潜在编码信息;
    所述分别对所述潜在编码信息和所述参考三维面部描述信息进行熵编码,得到面部视频比特流,包括:
    分别对所述差分潜在编码信息和所述参考三维面部描述信息进行熵编码,得到面部视频比特流。
  5. 根据权利要求1所述的方法,其中,所述目标三维面部描述信息,包括如下至少一项:三维表情信息、三维平移信息、三维角度信息。
  6. 根据权利要求1所述的方法,其中,所述三维面部模板的获取过程,包括:
    基于所述参考面部视频帧进行三维面部重建,得到初始三维面部模板;
    响应于对所述初始三维面部模板的编辑操作,得到三维面部模板。
  7. 一种面部视频解码方法,包括:
    获取面部视频比特流和三维面部模板;所述面部视频比特流是基于目标面部视频帧对应的目标三维面部描述信息得到的;
    解码所述面部视频比特流,得到所述目标三维面部描述信息;
    基于所述目标三维面部描述信息,对所述三维面部模板进行形变处理,得到与所述目标面部视频帧对应的重建面部视频帧。
  8. 根据权利要求7所述的方法,其中,所述面部视频比特流中还包括编码后参考三维面部描述信息;
    所述解码所述面部视频比特流,得到所述目标三维面部描述信息,包括:
    对所述面部视频比特流进行熵解码,得到潜在编码信息和参考三维面部描述信息;
    对所述潜在编码信息进行解码,得到差分三维面部描述信息,所述差分三维面部描述信息是对参考三维面部描述信息和目标三维面部描述信息进行差分运算得到的;
    对所述参考三维面部描述信息和所述差分三维面部描述信息进行加和运算,得到所述目标三维面部描述信息。
  9. 根据权利要求8所述的方法,其中,所述对所述潜在编码信息进行解码,得到差分三维面部描述信息,包括:
    将所述潜在编码信息输入全连接解码模型,以使所述全连接解码模型输出差分三维面部描述信息。
  10. 根据权利要求8所述的方法,其中,所述对所述面部视频比特流进行熵解码,得到潜在编码信息和参考三维面部描述信息,包括:
    对所述面部视频比特流进行熵解码,得到差分潜在编码信息和参考三维面部描述信息;
    获取与所述目标面部视频帧的前一面部视频帧对应的前序潜在编码信息;
    对所述差分潜在编码信息和所述前序潜在编码信息进行加和运算,得到潜在编码信息。
  11. 一种模型训练方法,包括:
    根据目标面部视频帧样本和三维面部模板样本,得到目标三维面部描述样本信息;
    将所述目标三维面部描述样本信息输入待训练的全连接编码模型,得到潜在编码样本信息;
    编码所述潜在编码样本信息,得到面部视频比特流样本;
    对所述面部视频比特流样本进行解码,得到潜在编码样本信息;并将所述潜在编码样本信息输入待训练的全连接解码模型,得到目标三维面部描述样本信息;
    基于所述目标三维面部描述样本信息,对所述三维面部模板样本进行形变处理,得到重建面部视频帧样本;
    根据所述面部视频比特流样本对应的传输码率构建码率损失函数;根据所述重建 面部视频帧样本和所述目标面部视频帧样本,构建失真损失函数;
    基于所述码率损失函数和所述失真损失函数得到训练损失函数,以对全连接编码模型和全连接解码模型进行训练。
  12. 一种电子设备,包括:处理器、存储器、通信接口和通信总线,所述处理器、所述存储器和所述通信接口通过所述通信总线完成相互间的通信;
    所述存储器用于存放至少一可执行指令,所述可执行指令使所述处理器执行如权利要求1-6中任一项所述的面部视频编码方法对应的操作,或者,如权利要求7-10中任一项所述的面部视频解码方法对应的操作,或者,如权利要求11中所述的模型训练方法对应的操作。
PCT/CN2023/071943 2022-01-25 2023-01-12 一种面部视频编码方法、解码方法及装置 WO2023143101A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210085764.X 2022-01-25
CN202210085764.XA CN114531561A (zh) 2022-01-25 2022-01-25 一种面部视频编码方法、解码方法及装置

Publications (1)

Publication Number Publication Date
WO2023143101A1 true WO2023143101A1 (zh) 2023-08-03

Family

ID=81623512

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/071943 WO2023143101A1 (zh) 2022-01-25 2023-01-12 一种面部视频编码方法、解码方法及装置

Country Status (2)

Country Link
CN (1) CN114531561A (zh)
WO (1) WO2023143101A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114531561A (zh) * 2022-01-25 2022-05-24 阿里巴巴(中国)有限公司 一种面部视频编码方法、解码方法及装置
CN114898020A (zh) * 2022-05-26 2022-08-12 唯物(杭州)科技有限公司 一种3d角色实时面部驱动方法、装置、电子设备及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6330281B1 (en) * 1999-08-06 2001-12-11 Richfx Ltd. Model-based view extrapolation for interactive virtual reality systems
CN104618721A (zh) * 2015-01-28 2015-05-13 山东大学 基于特征建模的极低码率下人脸视频编解码方法
CN110472558A (zh) * 2019-08-13 2019-11-19 上海掌门科技有限公司 图像处理方法和装置
CN113570684A (zh) * 2021-01-22 2021-10-29 腾讯科技(深圳)有限公司 图像处理方法、装置、计算机设备和存储介质
CN114531561A (zh) * 2022-01-25 2022-05-24 阿里巴巴(中国)有限公司 一种面部视频编码方法、解码方法及装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6330281B1 (en) * 1999-08-06 2001-12-11 Richfx Ltd. Model-based view extrapolation for interactive virtual reality systems
CN104618721A (zh) * 2015-01-28 2015-05-13 山东大学 基于特征建模的极低码率下人脸视频编解码方法
CN110472558A (zh) * 2019-08-13 2019-11-19 上海掌门科技有限公司 图像处理方法和装置
CN113570684A (zh) * 2021-01-22 2021-10-29 腾讯科技(深圳)有限公司 图像处理方法、装置、计算机设备和存储介质
CN114531561A (zh) * 2022-01-25 2022-05-24 阿里巴巴(中国)有限公司 一种面部视频编码方法、解码方法及装置

Also Published As

Publication number Publication date
CN114531561A (zh) 2022-05-24

Similar Documents

Publication Publication Date Title
WO2023143101A1 (zh) 一种面部视频编码方法、解码方法及装置
WO2022267641A1 (zh) 一种基于循环生成对抗网络的图像去雾方法及系统
CN112991203B (zh) 图像处理方法、装置、电子设备及存储介质
CN111669587B (zh) 一种视频图像的拟态压缩方法、装置、存储介质及终端
CN110290387B (zh) 一种基于生成模型的图像压缩方法
CN111386551A (zh) 点云的预测编码、解码的方法和设备
WO2020237646A1 (zh) 图像处理方法、设备及计算机可读存储介质
Bird et al. 3d scene compression through entropy penalized neural representation functions
WO2023246923A1 (zh) 视频编码方法、解码方法、电子设备及存储介质
CN113688907A (zh) 模型训练、视频处理方法,装置,设备以及存储介质
WO2023246926A1 (zh) 模型训练方法、视频编码方法及解码方法
CN112203098A (zh) 基于边缘特征融合和超分辨率的移动端图像压缩方法
WO2023143349A1 (zh) 一种面部视频编码方法、解码方法及装置
US20220335560A1 (en) Watermark-Based Image Reconstruction
US20240146963A1 (en) Method and apparatus for talking face video compression
CN114373023A (zh) 一种基于点的点云几何有损压缩重建装置与方法
CN111885384B (zh) 带宽受限下基于生成对抗网络的图片处理和传输方法
Pinheiro et al. Nf-pcac: Normalizing flow based point cloud attribute compression
CN114339190B (zh) 通讯方法、装置、设备及存储介质
WO2023143331A1 (zh) 一种面部视频编码方法、解码方法及装置
WO2022067776A1 (zh) 点云的解码、编码方法、解码器、编码器和编解码系统
CN114449286A (zh) 一种视频编码方法、解码方法及装置
CN113132755B (zh) 可扩展人机协同图像编码方法及系统、解码器训练方法
Yang et al. Graph-convolution network for image compression
CN114401406A (zh) 一种面部视频编码方法、解码方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23745984

Country of ref document: EP

Kind code of ref document: A1