WO2023143331A1 - 一种面部视频编码方法、解码方法及装置 - Google Patents

一种面部视频编码方法、解码方法及装置 Download PDF

Info

Publication number
WO2023143331A1
WO2023143331A1 PCT/CN2023/073013 CN2023073013W WO2023143331A1 WO 2023143331 A1 WO2023143331 A1 WO 2023143331A1 CN 2023073013 W CN2023073013 W CN 2023073013W WO 2023143331 A1 WO2023143331 A1 WO 2023143331A1
Authority
WO
WIPO (PCT)
Prior art keywords
facial video
video frame
target
facial
frame
Prior art date
Application number
PCT/CN2023/073013
Other languages
English (en)
French (fr)
Inventor
王钊
陈柏林
叶琰
王诗淇
Original Assignee
阿里巴巴(中国)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN202210085777.7A external-priority patent/CN114401406A/zh
Priority claimed from CN202210085252.3A external-priority patent/CN114205585A/zh
Application filed by 阿里巴巴(中国)有限公司 filed Critical 阿里巴巴(中国)有限公司
Publication of WO2023143331A1 publication Critical patent/WO2023143331A1/zh

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/169Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
    • H04N19/184Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being bits, e.g. of the compressed video stream

Definitions

  • the embodiments of the present application relate to the field of computer technology, and in particular to a facial video encoding method, decoding method and device.
  • existing video coding algorithms usually select the first frame in the video as a reference frame when performing facial video coding and decoding, and encode and decode subsequent video frames to obtain corresponding reconstructed facial video frames.
  • the reconstructed facial video frame obtained in this way has poorer texture quality, and at the same time, the accuracy of motion description is also lower. That is to say, the quality of the reconstructed facial video frame obtained by the above-mentioned existing video coding algorithm is relatively poor.
  • embodiments of the present application provide a facial video encoding method, decoding method, and device to at least partially solve the above-mentioned problems.
  • a facial video coding method including:
  • a facial video decoding method including:
  • the facial video bitstream includes: a plurality of encoded back reference facial video frames and encoded compact feature information; the encoded compact feature information represents the key feature information of the target facial video frame to be reconstructed;
  • the facial video frame is reconstructed to obtain the same The fused facial video frame corresponding to the target facial video frame.
  • a method for generating a reference facial video frame including:
  • the information difference value represents the difference degree between the information contained in the target face video frame and the information contained in each initial reference face video frame
  • the target facial video frame is added to the reference frame list as a newly added reference facial video frame.
  • a model training method including:
  • each reference facial video frame sample and each corresponding driving information sample into the initial second generation model obtain the initial reconstruction facial video frame sample corresponding to each reference facial video frame sample respectively;
  • Each initial reconstructed facial video frame sample is input into the initial fusion model to obtain the fusion facial video frame sample;
  • an electronic device including: a processor, a memory, a communication interface, and a communication bus, and the processor, the memory, and the communication interface complete mutual communication via the communication bus. communication among them; the memory is used to store at least one executable instruction, and the executable instruction causes the processor to execute the operation corresponding to the facial video coding method as described in the first aspect, or, as described in the second aspect.
  • a computer storage medium on which a computer program is stored, and when the program is executed by a processor, the facial video coding method as described in the first aspect is implemented, or, as described in the second
  • the facial video decoding method described in the aspect, or, the reference facial video frame generation method as described in the third aspect, or, the model training method as described in the fourth aspect is provided.
  • a computer program product including computer instructions, the computer instructions instruct the computing device to perform the operations corresponding to the face video coding method described in the first aspect, or, as described in the second aspect
  • the target facial video frame to be coded is encoded and decoded based on a plurality of reference facial video frames, so as to obtain the fusion weight corresponding to the target facial video frame. Facet video frames.
  • the single reference facial video frame limits Reconstruction of texture information and motion information, but in the embodiment of the present application based on multiple reference facial video frames to obtain fusion facial video frames, its texture quality and motion information then refer to multiple different reference facial video frames at the same time, therefore, the reconstruction
  • the embodiments of the present application improve the reconstruction quality of facial video frames.
  • Fig. 1 is a schematic framework diagram of a codec method based on depth video generation
  • Fig. 2 is a flow chart of the steps of a facial video encoding method according to Embodiment 1 of the present application;
  • Fig. 3 is a schematic diagram of a scenario example in the embodiment shown in Fig. 2;
  • Fig. 4 is a flow chart of the steps of a facial video coding method according to Embodiment 1 of the present application;
  • Fig. 5 is a flow chart of the steps of a facial video decoding method according to Embodiment 2 of the present application.
  • FIG. 6 is a schematic diagram of a scene corresponding to Embodiment 2 of the present application.
  • FIG. 7 is a flow chart of the steps of a facial video decoding method according to Embodiment 3 of the present application.
  • Figure 8 is a schematic diagram of the scene corresponding to the third embodiment of the present application.
  • Fig. 9 is a schematic flow chart of obtaining a fusion facial video frame according to Embodiment 3 of the present application.
  • Fig. 10 is another schematic flow chart of obtaining a fusion facial video frame according to Embodiment 3 of the present application.
  • FIG. 11 is a schematic flow diagram of a method for generating a reference facial video frame according to Embodiment 4 of the present application.
  • FIG. 12 is a flow chart of steps of a model training method according to Embodiment 5 of the present application.
  • Fig. 13 is a structural block diagram of a facial video encoding device according to Embodiment 6 of the present application.
  • FIG. 14 is a structural block diagram of a facial video decoding device according to Embodiment 7 of the present application.
  • Fig. 15 is a structural block diagram of a model training device according to Embodiment 8 of the present application.
  • FIG. 16 is a schematic structural diagram of an electronic device according to Embodiment 9 of the present application.
  • FIG. 1 is a schematic framework diagram of a codec method based on depth video generation.
  • the main principle of this method is to deform the reference frame based on the motion of the frame to be encoded to obtain a reconstructed frame corresponding to the frame to be encoded.
  • the basic framework of the codec method based on depth video generation is described below in conjunction with Figure 1:
  • the encoder uses a key point extractor to extract the target key point information of the target facial video frame to be encoded, and encodes the target key point information; meanwhile, adopts traditional image coding methods (such as VVC, HEVC, etc.) ) encodes the reference face video frame.
  • a key point extractor to extract the target key point information of the target facial video frame to be encoded, and encodes the target key point information; meanwhile, adopts traditional image coding methods (such as VVC, HEVC, etc.) ) encodes the reference face video frame.
  • the motion estimation module in the decoder extracts the reference key point information of the reference facial video frame through the key point extractor; and performs dense motion estimation based on the reference key point information and target key point information to obtain dense motion Estimation map and occlusion map, wherein, the dense motion estimation map represents the relative motion relationship between the target facial video frame and the reference facial video frame in the feature domain represented by the key point information; the occlusion map represents the target facial video frame. The degree of occlusion.
  • the third step is the decoding stage.
  • the generation module in the decoder performs deformation processing on the reference facial video frame based on the dense motion estimation map to obtain the deformation processing result, and then multiplies the deformation processing result by the occlusion map to output the reconstructed facial video frame.
  • subsequent video frames are encoded and decoded based on a single reference frame to obtain corresponding reconstructed facial video frames.
  • the texture quality and motion information of the reconstructed facial video frame obtained in this way mainly depend on the single reference facial video frame, that is to say, in the reconstruction process, the single reference facial video frame limits the reconstruction of texture information and motion information. , making the texture quality of the reconstructed facial video frame lower than that of the original facial video frame to be encoded, and at the same time, the accuracy of motion description is also lower. That is, the resulting reconstructed face video frames are of poor quality.
  • the fused facial video frame is obtained based on a plurality of reference facial video frames, and its texture quality and motion information refer to a plurality of different reference facial video frames at the same time.
  • the quality difference between video frames is small, which can improve the reconstruction quality of facial video frames.
  • FIG. 2 is a flow chart of steps of a face video encoding method according to Embodiment 1 of the present application. Specifically, the facial video encoding method provided in this embodiment includes the following steps:
  • Step 202 acquire the target facial video frame to be encoded and a plurality of initial reference facial video frames in the reference frame list.
  • the specific setting method and data of the initial reference facial video frame are not limited, and can be selected according to actual conditions.
  • the initial reference face video frame can be Any video frame in the video whose time stamp is earlier than the target facial video frame; it may also be a video frame selected from the facial video frames according to a preset selection rule, and so on.
  • the facial video frame may be the previous preset number of frames in the facial video.
  • Step 204 Encode a plurality of initial reference facial video frames and target facial video frames respectively to obtain a facial video bit stream.
  • VVC Versatile Video Coding
  • feature extraction can be performed on the target facial video frame first to obtain the target compact feature of the target facial video frame; and then the target compact feature is encoded, wherein the target compact feature represents the target Key feature information in facial video frames, such as facial features position information, posture information, expression information, etc.
  • the machine learning model can be used to perform feature extraction on the target facial video frame, so as to obtain the target compact features.
  • Fig. 3 is a schematic diagram of a scene corresponding to Embodiment 1 of the present application.
  • an example of a specific scene will be used to describe the embodiment of the present application:
  • target facial video frame a respectively, and a plurality of initial reference facial video frames in the reference frame list: A1, A2, A3, A4 (in the embodiment of the present application, the number of initial reference facial video frames is not limited, Fig. In 3, only 4 examples are used for illustration, which does not constitute a limitation to the embodiment of the present application); A1, A2, A3, A4 are encoded respectively, and a is also encoded (for example, the compact feature can be performed on a extraction to encode compact features) to obtain a facial video bitstream.
  • the method after obtaining the facial video bit stream, the method also includes:
  • the information difference value represents the difference degree between the information contained in the target face video frame and the information contained in the initial reference face video frame; If the information difference value of the preset threshold is used, the target facial video frame is added to the reference frame list as a new reference facial video frame to update the reference frame list; the new reference facial video frame is used to encode other video frames, Get the first facial video bitstream.
  • the target facial video frame can be added to the reference frame list as a new reference facial video frame, and the initial reference facial video frame corresponding to the maximum information difference value is deleted, thereby
  • the updated reference frame list is obtained to encode and decode the facial video frames after the current target facial video frame to obtain the corresponding reconstructed facial video frames.
  • the mean square error calculation can be performed to obtain the pixel-level information difference value; Feature extraction is performed on the target facial video frame and the initial reference facial video frame to obtain features that can characterize the key feature information of the target facial video frame, and the features that can characterize the key feature information of the initial reference facial video frame, and then Then, based on the above two features, the mean square error is calculated to obtain the information difference value of the feature domain, and so on.
  • the information difference value is greater than the preset threshold, it indicates that the information difference between the target facial video frame and the initial reference facial video frame is relatively large.
  • the target facial video frame is Encoding and decoding operations, the information difference between the reconstructed facial video frame and the target facial video frame is relatively large, and the reconstruction quality of the facial video frame is difficult to guarantee.
  • the target facial video frame when it is determined that the information difference between the target facial video frame and the initial reference facial video frame is relatively large, the target facial video frame is used as a newly added reference facial video frame to analyze other video frames.
  • the frames are encoded to obtain the first facial video bitstream. This can reduce the problem of poor facial video reconstruction quality caused by encoding and decoding the target facial video frame based on the initial reference facial video frame when the information difference between the initial reference facial video frame and the facial video frame to be encoded is large , improving the quality of facial video reconstruction, and then obtaining higher-quality reconstructed facial video frames.
  • the target facial video frame as a new reference facial video frame, it can be encoded in the following manner:
  • the newly added reference facial video frame is used as the key frame, and the newly added reference facial video frame is independently encoded, and the newly added reference facial video frame is encoded with relatively small quantization distortion.
  • the encoding process retains the integrity of the newly added reference facial video frame data.
  • only the reference facial video frame itself needs to be added to complete the decoding process.
  • the target facial video frame in the process of encoding the target facial video frame, is not directly encoded and decoded based on the initial reference facial video frame in the reference frame list, but the target facial video frame is calculated first With the information difference value between the initial reference face video frame, if the information difference value is larger (greater than the preset threshold), then no longer based on the initial reference face video frame, the target face video frame is encoded and decoded. Instead, the The target facial video frame is used as a new (added) reference facial video frame to encode other video frames to obtain the first facial video bitstream.
  • the reference facial video frame actually used in the encoding and decoding process is determined based on the degree of information difference between the target facial video frame and the initial reference facial video frame.
  • the degree of difference is large, the target The facial video frame is used as a new reference facial video frame, and no longer depends on the initial reference facial video frame for encoding and decoding. Therefore, the quality of facial video encoding and decoding can be improved, and then a higher quality reconstructed facial video frame can be obtained.
  • a plurality of reference facial video frames and target facial video frames are encoded to obtain the facial video bit stream, so that after the decoding end obtains the facial video bit stream, it can be based on multiple reference facial video frames.
  • the target facial video frame is decoded, and then the fused and reconstructed facial video frame corresponding to the target facial video frame is obtained.
  • the facial video encoding method provided in Embodiment 1 of the present application can be executed by a video encoding terminal (encoder), and is used to encode facial video files, so as to realize digital broadband compression of facial video files. It can be applied to a variety of different scenarios, such as: the storage and streaming of conventional video games involving faces, specifically: the game video frames can be encoded by the facial video encoding method provided by the embodiment of the application to form a corresponding
  • the video code stream can be stored and transmitted in video streaming services or other similar applications; another example: low-latency scenarios such as video conferencing and live video broadcasting, specifically: the facial video encoding method provided by the embodiment of the application can be used for
  • the facial video data collected by the video acquisition device is encoded to form a corresponding video code stream, and sent to the conference terminal, and the video code stream is decoded by the conference terminal to obtain the corresponding facial video picture; also for example: virtual reality scene, you can The facial video data collected by the video acquisition device is encoded by the facial video
  • the number of reference facial video frames contained in the reference frame list is constant (can be set according to actual conditions), therefore, when the target facial video frame is determined as a new reference facial video frame, then need The reference frame list is updated to obtain the updated reference frame list, so as to perform encoding and decoding operations of subsequent facial video frames.
  • the updating method of reference frame list can be:
  • the update method of the reference frame list can be: adding a new reference facial video frame to the reference frame list, and deleting an initial reference facial video from all initial reference facial video frames Frame, get the updated reference frame list.
  • the initial reference facial video frame can be deleted according to the information difference value between each initial reference facial video and the newly added reference facial video frame, for example: the initial reference facial video with the largest information difference value is deleted, or can be deleted according to Timestamps of each initial reference facial video, delete the initial reference facial video frame with the earliest timestamp, and so on.
  • the target facial video frame is encoded to obtain a second facial video bit stream.
  • the target facial video frame is encoded to obtain the second facial video bitstream
  • the specific method for decoding the second facial video bitstream based on the initial reference facial video frame is not
  • any existing method for encoding and decoding the current frame by means of the reference frame may be used, which will not be repeated here.
  • the target facial video frame in the process of encoding the target facial video frame, first calculate the information difference value between the target facial video frame and the initial reference facial video frame, if the information difference value is larger (greater than the preset threshold) , then the The target facial video frame is used as a new (newly added) reference facial video frame to perform encoding and decoding operations, thereby obtaining the reconstructed facial video frame corresponding to the target facial video frame; if the information difference value is small, then based on the initial reference facial video frame, for The target facial video frame is coded and decoded to obtain the corresponding reconstructed facial video frame.
  • the problem of poor facial video reconstruction quality caused by encoding and decoding operations on the target facial video frame based on the initial reference facial video frame can be avoided. , improving the quality of facial video reconstruction, and then obtaining higher-quality reconstructed facial video frames.
  • the facial video encoding method in this embodiment can be executed by any suitable electronic device with data capability, including but not limited to: server, PC, etc.
  • FIG. 4 is a flow chart of steps of a face video encoding method according to Embodiment 1 of the present application.
  • the number of initial reference facial video frames can be set in advance according to actual needs, and the specific number is not limited here.
  • the facial video encoding method provided by the present embodiment comprises the following steps:
  • Step 402 respectively calculating information difference values between the target facial video frame to be encoded and each initial reference facial video frame in the reference frame list. If there is a candidate information difference value less than or equal to the preset threshold among all the information difference values, step 404 is performed; if all the information difference values are greater than the preset threshold, step 408 is performed.
  • the information difference value between the target facial video frame and the initial reference facial video frame can be calculated in the following two different ways:
  • the first way based on the pixel value of each pixel point in the target facial video frame to be encoded and the pixel value of each pixel point in the described initial reference facial video frame, carry out mean square error (Mean-Square Error, MSE) calculation, The information difference value between the target facial video frame and the initial reference facial video frame is obtained.
  • MSE mean square error
  • feature extraction is performed on the target facial video frame to be encoded to obtain the compact features of the target facial video frame; feature extraction is performed on the initial reference facial video frame in the reference frame list to obtain the initial reference facial video frame The compact feature; Based on the compact feature of the target facial video frame and the compact feature of the initial reference facial video frame, the mean square error calculation is carried out to obtain the target facial video frame and the initial reference facial video frame in the reference frame list. Information difference value.
  • the compact feature of the target facial video frame represents the key feature information in the target facial video frame;
  • the compact feature of the initial reference facial video frame represents the key feature information in the initial reference facial video frame.
  • the target face video frame to be encoded can be input into a pre-trained feature extraction model, so that the feature extraction model Output the compact feature of the target facial video frame; and, input the initial reference facial video frame in the reference frame list into the feature extraction model, so that the feature extraction model outputs the compact feature of the initial reference facial video frame feature.
  • the specific structure and parameters of the feature extraction model are not limited, and may be set according to actual conditions.
  • the first way is to start from the perspective of the pixel level and carry out The information difference value is calculated, therefore, the accuracy of the calculation result is higher.
  • the second way is to calculate the information difference value from the perspective of the extracted features, that is, to calculate the information difference value from the perspective of the feature domain. Since the features are extracted from the facial video frame, it is the The video frame is down-sampled, so the calculation amount is small and the calculation efficiency is high.
  • Step 404 Determine the information difference value with the smallest value from the candidate information difference values as the target information difference value.
  • Step 406 Based on the initial reference facial video frame corresponding to the target information difference value, encode the target facial video frame to obtain a second facial video bit stream. At this point, the encoding process ends.
  • the minimum information difference value can be selected from multiple initial reference facial video frames.
  • the target facial video frame is encoded to obtain the second facial video bit stream, and the initial reference facial video frame pair corresponding to the target information difference value
  • the specific method for decoding the second facial video bitstream is not limited, and any existing method for encoding and decoding the current frame by using the reference frame may be used, and details are not repeated here.
  • the target facial video frame is used as a newly added reference facial video frame, and the newly added reference facial video frame is used for encoding other video frames to obtain a first facial video bit stream.
  • any initial reference facial video frame is used as a reference frame, and the When the face video frame is encoded and decoded, the information difference between the reconstructed face video frame and the target face video frame is relatively large, and the reconstruction quality of the face video frame is difficult to guarantee.
  • the target facial video frame when it is determined that the information difference between the target facial video frame and all initial reference facial video frames is large, the target facial video frame is used as a newly added reference facial video frame for other reference facial video frames.
  • the video frames are encoded to obtain a first facial video bitstream. In this way, the quality of facial video reconstruction can be improved, thereby obtaining higher-quality reconstructed facial video frames.
  • Step 410 add the new reference facial video frame to the reference frame list, and delete the initial reference facial video frame with the earliest time stamp, to obtain the updated reference frame list.
  • the target facial video frame in the process of encoding the target facial video frame, first calculate the information difference value between the target facial video frame and each initial reference facial video frame, if the information difference value is larger (greater than the preset Threshold), then the target facial video frame is used as a new (newly added) reference facial video frame for encoding and decoding operations; if there is a small information difference value, then based on the initial reference facial video frame corresponding to the minimum difference value, the target facial video frame The video frame is encoded and decoded to obtain the corresponding reconstructed facial video frame.
  • the value is based on the initial reference facial video frame corresponding to the minimum difference value to encode and decode the target facial video frame. Since the difference between the initial reference facial video frame corresponding to the minimum difference value and the previous facial video frame is the smallest, therefore, based on The initial reference facial video frame with the smallest difference is used for subsequent encoding and decoding operations, the quality of facial video reconstruction is higher, and the quality of the obtained reconstructed facial video frame is also higher.
  • FIG. 5 is a flowchart of steps of a facial video decoding method according to Embodiment 2 of the present application. Specifically, the facial video decoding method provided by this embodiment includes the following steps:
  • Step 502 acquire facial video bit stream.
  • Step 504 decoding the bit stream of the facial video, obtaining the target driving information of the target facial video frame to be encoded and the target identification information indicating whether the target facial video is a newly added reference facial video frame.
  • the target driving information may be information that characterizes the key feature information of the target facial video frame, where the key feature information may include: facial features position information, posture information, expression information, and so on.
  • the specific form of the target driving information is not limited, for example: it can be the key point information extracted by the facial key point extractor, or it can be the implicit representation extracted by the feature extraction model The amount of data is smaller, and the compact feature matrix (vector) with richer representation information, and so on.
  • the target identification information in this step is generated by the encoding end during the process of encoding the target facial video frame and sent to the decoding end.
  • the encoding end can determine whether the target facial video frame is a new reference facial video frame based on the information difference value between the target facial video frame and the initial reference facial video frame in the reference frame list, specifically, if the information difference values are greater than the preset threshold , then it is determined that the target facial video frame is a new reference facial video frame, which can be used to encode other video frames; if there is an information difference value less than or equal to a preset threshold, then it is determined that the target facial video frame is For the non-newly added reference facial video frame, the target facial video frame may be encoded based on the initial reference facial video frame in the reference frame list to obtain a facial video bit stream and send it to the decoder.
  • Step 506 if the target facial video frame is a non-newly added reference facial video frame, respectively acquire reference driving information of multiple reference facial video frames in the reference frame list.
  • the reference driving information may be information characterizing the key feature information of the reference facial video frame, where the key feature information may include: facial feature position information, posture information, expression information, and so on.
  • the specific form of the reference driving information is not limited.
  • it can be the key point information extracted by the facial key point extractor, or the implicit representation extracted by the feature extraction model.
  • the amount of data is smaller, and the compact feature matrix (vector) with richer representation information, and so on.
  • the target facial video frame is a non-new reference facial video frame
  • the target facial video frame is a newly added reference facial video frame
  • the target facial video frame is encoded independently, that is, the target facial video frame is encoded with relatively small quantization distortion, and the encoding process retains the target facial video frame Complete data of the frame.
  • decoding it only needs to use the decoding method corresponding to the encoding method for decoding.
  • Step 508 based on the target driving information and each reference driving information, calculate the information difference value between the target facial video frame and each reference facial video frame.
  • the information difference value represents the degree of difference between the information contained in the target facial video frame and the information contained in the reference facial video frame.
  • the information difference value can be calculated in the following way:
  • the mean square error is calculated to obtain the information difference value between the target facial video frame and each reference facial video frame.
  • Step 510 taking the reference facial video frame corresponding to the minimum information difference value as the target reference facial video frame, and obtaining the reconstructed facial video frame based on the target reference facial video frame and target driving information.
  • the specific decoding form for obtaining the reconstructed facial video frame is not limited, and the corresponding decoding method can be adopted based on the encoding method used when obtaining the facial video bitstream in step 406 in the embodiment shown in FIG. 4 . Decode in a manner to obtain reconstructed facial video frames.
  • FIG. 6 is a schematic diagram of a scene corresponding to Embodiment 2 of the present application.
  • an example of a specific scene will be used to describe the embodiment of the present application:
  • the facial video bit stream is decoded to obtain the target driving information D of the target facial video frame, and the target identification information, wherein the target identification information represents that the target facial video frame is a non-newly added reference facial video frame; obtain the reference
  • the reference driving information of each reference facial video frame in the frame list: D1, D2 and D3 (the quantity of the reference facial video frame contained in the reference frame list can be any integer greater than 1, only with 3 reference facial video in Fig.
  • the target facial video frame is a non-newly added reference facial video frame
  • the finally determined target reference facial video frame is the reference facial video frame with the minimum information difference of the target facial video frame, therefore, based on the above-mentioned target reference facial video frame, the facial video frame is reconstructed, and the quality of the reconstructed facial video frame obtained is also the same. Highest, improves the reconstruction quality of facial video frames.
  • FIG. 7 is a flow chart of steps of a facial video decoding method according to Embodiment 3 of the present application. Specifically, the facial video decoding method provided by this embodiment includes the following steps:
  • Step 702 acquire the facial video bit stream, the facial video bit stream includes: a plurality of encoded reference facial video frames and encoded compact feature information.
  • the encoded compact feature information represents the key feature information of the target facial video frame to be reconstructed.
  • the encoded compact feature information corresponds to the compact feature information used to represent the key feature information obtained by feature extraction of the target facial video frame, and may also correspond to: the target compact feature information of adjacent target facial video frames the difference between.
  • Step 704 respectively decode a plurality of coded reference facial video frames to obtain a plurality of reference facial video frames.
  • Step 706 decoding the encoded compact feature information to obtain the target compact feature of the target facial video frame.
  • Step 708 based on multiple reference facial video frames and target compact features, perform facial video frame reconstruction to obtain a fused facial video frame corresponding to the target facial video frame.
  • the facial video frame reconstruction can be performed separately, so as to obtain the initial reconstructed facial video frame corresponding to each reference facial video frame, and then each initial reconstructed facial video frame
  • the video frames are fused to obtain a final fused facial video frame; it is also possible to obtain a final fused facial video frame based on multiple reference facial video frames and reference facial video frames at the same time.
  • Figure 8 is a schematic diagram of the scene corresponding to Embodiment 3 of the present application.
  • an example of a specific scene will be used to describe the embodiment of the present application:
  • the reference face video frame is decoded after encoding, obtains respectively a plurality of reference face video frames: a1, a2, a3, a4 ( In the embodiment of the present application, the number of reference facial video frames is not limited, and only 4 are used as an example for illustration in Fig. 8, which does not constitute a limitation to the embodiment of the present application), and the target compact feature; finally, based on A plurality of reference facial video frames a1, a2, a3, a4, and the target compact feature are used to reconstruct the facial video frame to obtain a fused facial video frame a0 corresponding to the target facial video frame.
  • facial video frame reconstruction is performed to obtain a fusion facial video frame corresponding to the target facial video frame, which may include:
  • the facial video frame is reconstructed to obtain the initial reconstructed facial video frame corresponding to each reference facial video frame; each initial reconstructed facial video frame is fused to obtain the target face
  • the video frame corresponds to the fused face video frame.
  • the specific process of obtaining the initial reconstructed facial video frame is not limited.
  • it can be: perform feature extraction on each reference facial video frame to obtain reference compact features; then perform sparse motion estimation based on the reference compact features and target compact features to obtain a sparse motion estimation map; based on the sparse motion estimation map, the reference face
  • the video frame is deformed to obtain the initial reconstructed facial video frame corresponding to the target facial video frame.
  • the sparse motion estimation map is represented in the preset sparse feature domain, and the target face video frame is compared with the reference Relative motion relationships between facial video frames.
  • the weight values corresponding to each initial reconstructed facial video frame can be obtained first; then based on each weight value, each initial reconstructed facial video frame is subjected to linear weighting processing to obtain the fused face corresponding to the target facial video frame video frame.
  • linear weighting may be performed on the pixel values corresponding to the pixel points of each initial reconstructed facial video frame based on each weight value, so as to obtain a fused facial video frame corresponding to the target facial video frame.
  • fusion processing can also be performed with the help of machine learning models, specifically:
  • Each initial reconstructed facial video frame is input into the fusion model, so that the fusion model outputs a fused facial video frame corresponding to the target facial video frame.
  • the specific structure and reference of the fusion model are not limited, and can be set according to actual needs.
  • the fusion model can be a U-Net network based on a combination of convolutional layers and generalized division normalization layers; it can also be an hourglass model based on multiple downsampling modules and multiple corresponding upsampling modules, and so on.
  • facial video frame reconstruction is performed to obtain a fusion facial video frame corresponding to the target facial video frame, which may include:
  • the driving information corresponding to the reference facial video frame is obtained, and the driving information includes: a motion estimation map between the reference facial video frame and the target facial video frame ; Input the driving information corresponding to each reference facial video frame and each reference facial video frame into the first generation model to obtain a fusion facial video frame corresponding to the target facial video frame.
  • the motion estimation map represents the relative motion relationship between the reference facial video frame and the target facial video frame.
  • the motion estimation map can include a sparse motion map and a dense motion map, wherein the sparse motion map represents the relative motion relationship between the target face video frame and the reference face video frame in the preset sparse feature domain; The relative motion relationship between the target facial video frame and the reference facial video frame in the dense feature domain.
  • the driving information may also include an occlusion map of the degree of occlusion of each pixel in the target facial video frame, and so on.
  • FIG. 9 is a schematic flow chart of obtaining fused facial video frames according to Embodiment 3 of the present application.
  • two reference facial video frames are used as an example, which does not constitute a limitation to the embodiment of the present application. Specifically: based on the reference facial video frame 1 and the target compact feature, obtain the corresponding driving information 1 of the reference facial video frame 1; based on the reference facial video frame 2 and the target compact feature, obtain the corresponding driving information 2 of the reference facial video frame 2; The reference facial video frames 1, 2 and driving information 1, 2 are input into the first generative model to obtain fused facial video frames.
  • facial video frame reconstruction is performed to obtain a fusion facial video frame corresponding to the target facial video frame, which may also include:
  • each reference facial video frame For each reference facial video frame, based on the reference facial video frame and the target compact feature, obtain the driving information corresponding to the reference facial video frame; each reference facial video frame and the corresponding driving information of each reference facial video frame are input into the second generator model to obtain the initial reconstructed facial video frames corresponding to each reference facial video frame; each initial reconstructed facial video frame is input into the fusion model, so that the fusion model outputs the fusion facial video corresponding to the target facial video frame frequency frame.
  • FIG. 10 is another schematic flow chart for obtaining fused facial video frames according to Embodiment 3 of the present application.
  • two reference facial video frames are also used as an example. Specifically: based on the reference facial video frame 1 and the target compact feature, obtain the driving information 1 corresponding to the reference facial video frame 1; based on the reference facial video frame 2 and the target compact feature, obtain the corresponding driving information 2 of the reference facial video frame 2; Input the second generative model with reference to the facial video frame 1 and the driving information 1 to obtain the initial reconstructed facial video frame 1, input the reference facial video frame 2 and the driving information 2 into the second generative model to obtain the initial reconstructed facial video frame 2; The reconstructed facial video frame 1 and the initial reconstructed facial video frame 2 are input into the fusion model, and finally the fused facial video frame is obtained.
  • the target facial video frame to be encoded is encoded and decoded based on a plurality of reference facial video frames, so as to obtain the fused and reconstructed facial video frame corresponding to the target facial video frame.
  • its texture quality and motion information mainly depend on the single reference facial video frame, that is, in the reconstruction process, the single reference facial video frame limits Reconstruction of texture information and motion information, but in the embodiment of the present application based on multiple reference facial video frames to obtain fusion facial video frames, its texture quality and motion information then refer to multiple different reference facial video frames at the same time, therefore, the reconstruction The quality difference between the resulting fused face video frame and the target face video frame is small.
  • the embodiments of the present application improve the reconstruction quality of facial video frames.
  • the facial video decoding method in this embodiment can be executed by any suitable electronic device with data capability, including but not limited to: a server, a PC, and the like.
  • FIG. 11 is a flowchart of steps of a method for generating a reference face video frame according to Embodiment 4 of the present application.
  • the reference facial video frame generation method provided in this embodiment includes the following steps:
  • Step 802 acquiring the target facial video frame and a plurality of initial reference facial video frames in the reference frame list.
  • the specific setting method and data of the initial reference facial video frame are not limited, and can be selected according to actual conditions.
  • the initial reference facial video frame can be any video frame in the facial video whose timestamp is earlier than the target facial video frame; it can also be a video frame according to the preset selection rules, from The selected video frame in the facial video frame, etc.
  • the facial video frame may be the previous preset number of frames in the facial video.
  • Step 804 calculating information difference values between the target facial video frame and each initial reference facial video frame.
  • the facial information difference value represents the degree of difference between the information contained in the target facial video frame and the information contained in each initial reference facial video frame.
  • the mean square error calculation can be performed to obtain the pixel-level information difference value
  • Feature extraction is performed on the target facial video frame and the initial reference facial video frame to obtain the features that can characterize the key feature information of the target facial video frame, and the features that can characterize the key feature information of the initial reference facial video frame, and then based on the above two features , to calculate the mean square error, so as to obtain the information difference value of the feature domain, and so on.
  • Step 806 if there is an information difference value greater than the preset threshold, add the target facial video frame as a new reference facial video frame to the facial reference frame list.
  • the number of reference facial video frames contained in the facial reference frame list is usually constant (can be set according to actual conditions), therefore, when the target facial video frame is determined as a newly added reference facial video frame, then need
  • the reference frame list is updated to obtain the updated reference frame list, so as to perform encoding and decoding operations of subsequent facial video frames.
  • the update method of the reference frame list may be: adding a newly added reference facial video frame to the reference frame list, and deleting an initial reference facial video frame from all initial reference facial video frames to obtain the updated reference frame list.
  • the initial reference facial video frame can be deleted according to the information difference value between each initial reference facial video and the newly added reference facial video frame, for example: the initial reference facial video with the largest information difference value is deleted, or can be deleted according to Timestamps of each initial reference facial video, delete the initial reference facial video frame with the earliest timestamp, and so on.
  • the reference face video frame generation that the embodiment of the present application provides, according to the information difference value between the target face video frame and each initial reference face video frame, when there is an initial reference face video frame with a larger information difference value (greater than a preset threshold value) , it indicates that the information difference between the target facial video frame and one or several initial reference facial video frames is relatively large, that is to say, the information in the facial video frame has undergone a large migration (change ), then encode and decode subsequent facial video frames based on the initial reference facial video frame with a large information difference value, and the information difference between the reconstructed facial video frame and the target facial video frame may be relatively large.
  • the target facial video frame is used as a new reference facial video frame in the reference frame list, and an updated reference frame list is obtained, so that when subsequent facial video frame reconstruction is performed based on the updated reference frame list, it can be Improve the quality of reconstructed video frames.
  • FIG. 12 is a flow chart of steps of a model training method according to Embodiment 5 of the present application. Specifically, the model training method provided in this embodiment includes the following steps:
  • Step 902 Encode a plurality of initial reference facial video frame samples and target facial video frame samples respectively to obtain facial video bit stream samples.
  • Step 904 decoding the facial video bitstream samples to obtain target compact feature samples of multiple initial reference facial video frame samples and target facial video frame samples.
  • Step 906 based on each reference facial video frame sample and the target compact feature sample, obtain a driving information sample corresponding to each reference facial video frame sample.
  • Step 908 Input the reference facial video frame samples and corresponding driving information samples into the initial second generation model to obtain initial reconstructed facial video frame samples respectively corresponding to the reference facial video frame samples.
  • Step 910 input each initial reconstructed facial video frame sample into the initial fusion model to obtain the fused facial video frame sample.
  • Step 912 Construct a loss function based on the fused facial video frame samples and target facial video frame samples, and perform training based on the loss function to obtain a trained second generation model and transition fusion model.
  • Step 914 Based on the target facial video frame sample and multiple initial reference facial video frame samples, based on the trained second generative model, retrain the transition fusion model to obtain the trained fusion model.
  • the model training is divided into two stages, wherein, the first stage: based on a plurality of initial reference facial video frame samples and target facial video frame samples, preliminary training is performed on the second generation model and the fusion model, Obtain the trained second generation model and the transition fusion model; the second stage: based on multiple initial reference facial video frame samples, target facial video frame samples and the trained second generation model, refine the transition fusion model separately tune to get the final trained fusion model.
  • the advantage of the above model training is that the entire training task is divided into two different subtasks, which can reduce the difficulty of machine learning and improve the efficiency of model training compared with the method of training the second generation model and the fusion model at the same time.
  • the model training method in this embodiment may be executed by any suitable electronic device with data capability, including but not limited to: a server, a PC, and the like.
  • FIG. 13 is a structural block diagram of a facial video encoding device according to Embodiment 5 of the present application.
  • the facial video encoding device provided by the embodiment of the present application includes:
  • Facial video frame acquisition module 1002 for obtaining a plurality of initial reference facial video frames in the target facial video frame to be encoded and the reference frame list;
  • the video stream obtaining module 1004 is used to encode a plurality of initial reference facial video frames and target facial video frames respectively to obtain a facial video bit stream.
  • the facial video encoding device also includes:
  • the difference value calculation module is used to calculate the information difference value between the target face video frame and each initial reference face video frame after obtaining the face video bit stream; the information difference value represents the information contained in the target face video frame and the initial reference face the degree of variance between the information contained in the video frames;
  • the reference frame list update module is used to add the target facial video frame as a new reference facial video frame to the reference frame list to update the reference frame list if there is an information difference value greater than the preset threshold;
  • the video frames are used to encode other video frames to obtain the first facial video bit stream.
  • the difference value calculation module is specifically used for:
  • the mean square error is calculated to obtain the target facial video frame and the reference frame list.
  • the difference value calculation module is specifically used for:
  • the mean square error is calculated to obtain the information difference value between the target facial video frame and the initial reference facial video frame.
  • the video stream obtaining module 1004 is specifically used for:
  • the video stream obtaining module 1004 is specifically configured to:
  • the video stream obtaining module 1004 is specifically used for:
  • the target compact features and each initial reference facial video frame are encoded separately to obtain the facial video bitstream.
  • the facial video encoding device of this embodiment is used to implement the corresponding facial video encoding methods in the aforementioned multiple method embodiments, and has the beneficial effects of the corresponding method embodiments, which will not be repeated here.
  • the function implementation of each module in the facial video encoding device of this embodiment reference may be made to the description of corresponding parts in the foregoing method embodiments, and details are not repeated here.
  • FIG. 14 is a structural block diagram of a facial video decoding device according to Embodiment 7 of the present application.
  • the facial video decoding device provided by the embodiment of the present application includes:
  • Video stream obtaining module 1102 is used for obtaining facial video bitstream, and facial video bitstream comprises: after a plurality of coded reference facial video frames and compact characteristic information after coding; Compact characteristic information characterizes the key of the target facial video frame to be reconstructed after coding characteristic information;
  • the first decoding module 1104 is used to decode a plurality of coded reference facial video frames respectively to obtain a plurality of reference facial video frames;
  • the second decoding module 1106 is used to decode the encoded compact feature information to obtain the target compact feature of the target facial video frame;
  • the fused facial video frame obtaining module 1108 is used to reconstruct the facial video frame based on a plurality of reference facial video frames and target compact features, and obtain a fused facial video frame corresponding to the target facial video frame.
  • the fusion facial video frame obtains a module 1108, which is specifically used for:
  • the facial video frame is reconstructed to obtain an initial reconstructed facial video frame corresponding to each reference facial video frame;
  • Fusion processing is performed on each initially reconstructed facial video frame to obtain a fused facial video frame corresponding to the target facial video frame.
  • the fused facial video frame obtaining module 1108, when performing fusion processing on each initially reconstructed facial video frame to obtain the fused facial video frame corresponding to the target facial video frame specifically uses At:
  • a linear weighting process is performed on each initially reconstructed facial video frame to obtain a fused facial video frame corresponding to the target facial video frame.
  • the fused facial video frame obtaining module 1108, when performing fusion processing on each initially reconstructed facial video frame to obtain the fused facial video frame corresponding to the target facial video frame specifically uses At:
  • Each initial reconstructed facial video frame is input into the fusion model, so that the fusion model outputs a fused facial video frame corresponding to the target facial video frame.
  • the fusion facial video frame obtains a module 1108, which is specifically used for:
  • the driving information corresponding to the reference facial video frame is obtained, and the driving information includes: a motion estimation map between the reference facial video frame and the target facial video frame ;
  • Each reference facial video frame and the driving information corresponding to each reference facial video frame are input into the first generation model to obtain a fused facial video frame corresponding to the target facial video frame.
  • the fusion facial video frame obtains a module 1108, which is specifically used for:
  • Each reference facial video frame and the driving information corresponding to each reference facial video frame are input into the second generation model to obtain the initial reconstructed facial video frame corresponding to each reference facial video frame respectively;
  • Each initial reconstructed facial video frame is input into the fusion model, so that the fusion model outputs a fused facial video frame corresponding to the target facial video frame.
  • the facial video decoding device in this embodiment is used to implement the corresponding facial video decoding methods in the foregoing method embodiments, and has the beneficial effects of the corresponding method embodiments, which will not be repeated here.
  • the function implementation of each module in the facial video decoding device of this embodiment reference may be made to the description of corresponding parts in the foregoing method embodiments, and details are not repeated here.
  • FIG. 15 is a structural block diagram of a model training device according to Embodiment 8 of the present application.
  • the model training device provided by the embodiment includes:
  • the video stream sample obtains module 1202 is used for encoding a plurality of initial reference face video frame samples and the target face video frame sample respectively, obtains the face video bit stream sample;
  • Video stream sample decoding module 1204 is used for decoding the facial video stream sample after encoding, obtains the target compact feature sample of a plurality of initial reference facial video frame samples and target facial video frame samples;
  • the driving information sample obtaining module 1206 is used to obtain the corresponding driving information sample of each reference facial video frame sample based on each reference facial video frame sample and the target compact feature sample;
  • the initial reconstructed facial video frame sample obtaining module 1208 is used to input each reference facial video frame sample and corresponding driving information samples into the initial second generation model to obtain the initial reconstructed facial video frame sample corresponding to each reference facial video frame sample ;
  • the fusion facial video frame sample obtains module 1210 which is used to input each initial reconstructed facial video frame sample into the initial fusion model to obtain the fusion facial video frame sample;
  • the first training module 1212 is used to build a loss function based on the fusion face video frame sample and the target face video frame sample, and train based on the loss function to obtain the second generation model and transition fusion model that have been trained;
  • the second training module 1214 is used for retraining the transition fusion model based on the target facial video frame sample and a plurality of initial reference facial video frame samples, based on the trained second generation model, to obtain the trained fusion model.
  • the model training device in this embodiment is used to implement the corresponding model training methods in the aforementioned multiple method embodiments, and has the beneficial effects of the corresponding method embodiments, which will not be repeated here.
  • the function implementation of each module in the model training device of this embodiment reference may be made to the descriptions of corresponding parts in the foregoing method embodiments, and details are not repeated here.
  • FIG. 16 shows a schematic structural diagram of an electronic device according to Embodiment 9 of the present application.
  • the specific embodiment of the present application does not limit the specific implementation of the electronic device.
  • the conference terminal may include: a processor (processor) 1302, a communication interface (Communications Interface) 1304, a memory (memory) 1306, and a communication bus 1308.
  • processor processor
  • communication interface Communication Interface
  • memory memory
  • the processor 1302 , the communication interface 1304 , and the memory 1306 communicate with each other through the communication bus 1308 .
  • the communication interface 1304 is used for communicating with other electronic devices or servers.
  • the processor 1302 is configured to execute the program 1310. Specifically, it can execute the above facial video encoding method, or the facial video decoding method, or refer to the facial video frame generation method, or the relevant steps in the embodiment of the model training method.
  • the program 1310 may include program codes including computer operation instructions.
  • the processor 1302 may be a CPU, or an ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement the embodiments of the present application.
  • the one or more processors included in the smart device may be of the same type, such as one or more CPUs, or may be different types of processors, such as one or more CPUs and one or more ASICs.
  • the memory 1306 is used to store the program 1110 .
  • the memory 1106 may include a high-speed RAM memory, and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory.
  • the program 1310 can specifically be used to make the processor 1302 perform the following operations: obtain the target facial video frame to be encoded and a plurality of initial reference facial video frames in the reference frame list; The video frames are encoded to obtain a facial video bitstream.
  • the program 1310 can specifically be used to make the processor 1302 perform the following operations: acquire the facial video bitstream, the facial facial video bitstream includes: a plurality of coded reference facial video frames and coded compact feature information; The key feature information of the reconstructed target face video frame; respectively decoding a plurality of coded reference face video frames of the face to obtain a plurality of reference face video frames; decoding the compact feature information of the face target face video frame to obtain the target compact feature of the face target face video frame; Based on multiple reference facial video frames of the face and the compact features of the facial target, the facial video frame is reconstructed to obtain a fused facial video frame corresponding to the facial target facial video frame.
  • the program 1310 can specifically be used to make the processor 1302 perform the following operations: acquire the target facial video frame and a plurality of initial reference facial video frames in the reference frame list; calculate the information difference value between the target facial video frame and each initial reference facial video frame , the facial information difference value represents the degree of difference between the information contained in the target facial video frame and the information contained in each initial reference facial video frame; if there is an information difference value greater than the preset threshold, then the maximum information difference value corresponding to The initial reference facial video frame is deleted from the facial reference frame list, and the target facial video frame is added to the facial reference frame list as a new reference facial video frame.
  • Program 1310 can specifically be used to make processor 1302 perform the following operations: a plurality of initial reference facial video frame samples and target facial video frame samples are encoded respectively to obtain facial video bitstream samples; decode facial facial video bitstream samples to obtain facial Multiple initial reference facial video frame samples and target compact feature samples of facial target facial video frame samples; based on each reference facial video frame sample and facial target compact feature samples, obtain corresponding driving information samples for each reference facial video frame sample; Input each reference facial video frame sample and each corresponding driving information sample into the initial second generation model to obtain the initial reconstructed facial video frame samples corresponding to each reference facial video frame sample; input each initial reconstructed facial video frame sample into the initial fusion model to obtain the fusion facial video frame samples; construct a loss function based on the facial fusion facial video frame samples and facial target facial video frame samples, and perform training based on the facial loss function to obtain the trained second generation model and transition fusion model; based on the facial The target facial video frame sample and multiple initial reference facial video frame samples of the face, based on the trained second generative model
  • each step in the program 1310 please refer to the above-mentioned face video encoding method, or, the face video decoding method, or, refer to the face video frame generation method, or, the corresponding steps in the model training method embodiment and the corresponding description in the unit, I won't go into details here.
  • the specific working process of the above-described devices and modules can refer to the corresponding process description in the foregoing method embodiments, and details are not repeated here.
  • the target facial video frame to be encoded is encoded and decoded based on a plurality of reference facial video frames, so as to obtain the fused and reconstructed facial video frame corresponding to the target facial video frame.
  • its texture quality and motion information mainly depend on the single reference facial video frame, that is, in the reconstruction process, the single reference facial video frame limits Reconstruction of texture information and motion information, but in the embodiment of the present application based on multiple reference facial video frames to obtain fusion facial video frames, its texture quality and motion information then refer to multiple different reference facial video frames at the same time, therefore, the reconstruction The quality difference between the resulting fused face video frame and the target face video frame is small.
  • the embodiments of the present application improve the reconstruction quality of facial video frames.
  • An embodiment of the present application further provides a computer program product, including computer instructions, where the computer instruction instructs a computing device to perform operations corresponding to any method in the foregoing multiple method embodiments.
  • each component/step described in the embodiment of the present application can be divided into more components/steps, and two or more components/steps or partial operations of components/steps can also be combined into New components/steps to achieve the purpose of the embodiment of the present application.
  • the above method according to the embodiment of the present application can be implemented in hardware, firmware, or as software or computer code that can be stored in a recording medium (such as CD ROM, RAM, floppy disk, hard disk, or magneto-optical disk), or implemented by Computer code downloaded from a network that is originally stored on a remote recording medium or a non-transitory machine-readable medium and will be stored on a local recording medium so that the methods described herein can be stored on a computer code using a general-purpose computer, a dedicated processor, or a programmable Such software processing on a recording medium of dedicated hardware such as ASIC or FPGA.
  • a recording medium such as CD ROM, RAM, floppy disk, hard disk, or magneto-optical disk
  • Computer code downloaded from a network that is originally stored on a remote recording medium or a non-transitory machine-readable medium and will be stored on a local recording medium so that the methods described herein can be stored on a computer code using a general-purpose computer, a dedicated processor, or
  • a computer, processor, microprocessor controller, or programmable hardware includes memory components (e.g., RAM, ROM, flash memory, etc.) that can store or receive software or computer code, when the software or computer code is Or when the hardware accesses and executes, realize the face video encoding method described here, or, the face video decoding method, or, refer to the face video frame generation method, or, the model training method.
  • memory components e.g., RAM, ROM, flash memory, etc.
  • the hardware accesses and executes, realize the face video encoding method described here, or, the face video decoding method, or, refer to the face video frame generation method, or, the model training method.
  • a general-purpose computer accesses the code for realizing the facial video coding method shown here, or the facial video decoding method, or, referring to the facial video frame generation method, or the model training method
  • the execution of the code will be performed by the general-purpose computer. Converted to a dedicated computer for performing the facial video encoding method

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

本申请实施例提供了一种面部视频编码方法、解码方法及装置。面部视频编码方法包括:获取待编码的目标面部视频帧和参考帧列表中的多个初始参考面部视频帧;分别对多个初始参考面部视频帧和目标面部视频帧进行编码,得到面部视频比特流。本申请实施例中基于多个参考面部视频帧进行编码和解码操作,得到融合面部视频帧,其纹理质量和运动信息则同时参考了多个不同的参考面部视频帧,因此,重建得到的融合面部视频帧与目标面部视频帧之间的质量差异较小,提高了面部视频帧的重建质量。

Description

一种面部视频编码方法、解码方法及装置
本公开要求于2022年01月25日提交中国专利局、申请号为202210085777.7、申请名称为“一种面部视频编码方法、解码方法及装置”的中国专利申请以及于2022年01月25日提交中国专利局、申请号为202210085252.3、申请名称为“面部视频编码方法、解码方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本公开中。
技术领域
本申请实施例涉及计算机技术领域,尤其涉及一种面部视频编码方法、解码方法及装置。
背景技术
随着视频编码技术不断发展,为了提高视频编码性能,出现了多种多样的视频编码算法。例如:传统的采用基于块的运动估计、离散余弦变换等方法进行的视频编码算法;基于深度学习的端到端视频编码算法,等等。
目前,现有视频编码算法,在进行面部视频编码及解码时,通常选择视频中的第一帧作为参考帧,对后续视频帧进行编码及解码,以得到对应的重建面部视频帧。但是,这样得到的重建面部视频帧,与待编码的原面部视频帧相比,其纹理质量较差,同时,运动描述的精度也较低。也就是说,现有的上述视频编码算法,得到的重建面部视频帧的质量较差。
发明内容
有鉴于此,本申请实施例提供一种面部视频编码方法、解码方法及装置,以至少部分解决上述问题。
根据本申请实施例的第一方面,提供了一种面部视频编码方法,包括:
获取待编码的目标面部视频帧和参考帧列表中的多个初始参考面部视频帧;
分别对所述多个初始参考面部视频帧和所述目标面部视频帧进行编码,得到面部视频比特流。
根据本申请实施例的第二方面,提供了一种面部视频解码方法,包括:
获取面部视频比特流,所述面部视频比特流包括:多个编码后参考面部视频帧和编码后紧凑特征信息;所述编码后紧凑特征信息表征待重建的目标面部视频帧的关键特征信息;
分别解码所述多个编码后参考面部视频帧,得到多个参考面部视频帧;
解码所述编码后紧凑特征信息,得到所述目标面部视频帧的目标紧凑特征;
基于所述多个参考面部视频帧和所述目标紧凑特征,进行面部视频帧重建,得到与 所述目标面部视频帧对应的融合面部视频帧。
根据本申请实施例的第三方面,提供了一种参考面部视频帧生成方法,包括:
获取目标面部视频帧和参考帧列表中的多个初始参考面部视频帧;
计算目标面部视频帧与各初始参考面部视频帧间的信息差异值,所述信息差异值表征目标面部视频帧中包含的信息与各初始参考面部视频帧中包含的信息之间的差异程度;
若存在大于预设阈值的信息差异值,则将目标面部视频帧作为新增参考面部视频帧添加至所述参考帧列表。
据本申请实施例的第四方面,提供了一种模型训练方法,包括:
分别对多个初始参考面部视频帧样本和目标面部视频帧样本进行编码,得到面部视频比特流样本;
解码所述面部视频比特流样本,得到所述多个初始参考面部视频帧样本和所述目标面部视频帧样本的目标紧凑特征样本;
基于每个参考面部视频帧样本和所述目标紧凑特征样本,得到每个参考面部视频帧样本对应的驱动信息样本;
将各参考面部视频帧样本和对应的各驱动信息样本输入初始第二生成模型,得到分别与各参考面部视频帧样本对应的初始重建面部视频帧样本;
将各初始重建面部视频帧样本输入初始融合模型,得到融合面部视频帧样本;
基于所述融合面部视频帧样本和所述目标面部视频帧样本构建损失函数,并基于所述损失函数进行训练,得到训练完成的第二生成模型和过渡融合模型;
基于所述目标面部视频帧样本和所述多个初始参考面部视频帧样本,基于训练完成的第二生成模型,再次训练所述过渡融合模型,得到训练完成的融合模型。
根据本申请实施例的第五方面,提供了一种电子设备,包括:处理器、存储器、通信接口和通信总线,所述处理器、所述存储器和所述通信接口通过所述通信总线完成相互间的通信;所述存储器用于存放至少一可执行指令,所述可执行指令使所述处理器执行如第一方面所述的面部视频编码方法对应的操作,或者,如第二方面所述的面部视频解码方法对应的操作,或者,如第三方面所述的参考面部视频帧生成方法对应的操作,或者,如第四方面所述的模型训练方法对应的操作。
根据本申请实施例的第六方面,提供了一种计算机存储介质,其上存储有计算机程序,该程序被处理器执行时实现如第一方面所述的面部视频编码方法,或者,如第二方面所述的面部视频解码方法,或者,如第三方面所述的参考面部视频帧生成方法,或者,如第四方面所述的模型训练方法。
根据本申请实施例的第七方面,提供了一种计算机程序产品,包括计算机指令,所述计算机指令指示计算设备执行如第一方面所述的面部视频编码方法对应的操作,或者,如第二方面所述的面部视频解码方法对应的操作,或者,如第三方面所述的参考面部视频帧生成方法对应的操作,或者,如第四方面所述的模型训练方法对应的操作。
根据本申请实施例提供的面部视频编码方法、解码方法及装置,是基于多个参考面部视频帧,对待编码的目标面部视频帧进行编码和解码操作,从而得到与目标面部视频帧对应的融合重构面部视频帧的。对于现有的仅基于单个参考面部视频帧得到的重建面部视频帧,其纹理质量和运动信息主要依赖于该单个参考面部视频帧,也就是说,在重建过程中,该单个参考面部视频帧限制了纹理信息和运动信息的重建,而本申请实施例中基于多个参考面部视频帧得到融合面部视频帧,其纹理质量和运动信息则同时参考了多个不同的参考面部视频帧,因此,重建得到的融合面部视频帧与目标面部视频帧之间的质量差异较小。综上,本申请实施例提高了面部视频帧的重建质量。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请实施例中记载的一些实施例,对于本领域普通技术人员来讲,还可以根据这些附图获得其他的附图。
图1为基于深度视频生成的编解码方法的框架示意图;
图2为根据本申请实施例一的一种面部视频编码方法的步骤流程图;
图3为图2所示实施例中的一种场景示例的示意图;
图4为根据本申请实施例一的一种面部视频编码方法的步骤流程图;
图5为根据本申请实施例二的一种面部视频解码方法的步骤流程图;
图6为本申请实施例二对应的场景示意图;
图7为根据本申请实施例三的一种面部视频解码方法的步骤流程图;
图8为本申请实施例三对应的场景示意图
图9为根据本申请实施例三得到融合面部视频帧的一种流程示意图;
图10为根据本申请实施例三得到融合面部视频帧的另一种流程示意图;
图11为根据本申请实施例四的一种参考面部视频帧生成方法的流程示意图;
图12为根据本申请实施例五的一种模型训练方法的步骤流程图;
图13为根据本申请实施例六的一种面部视频编码装置的结构框图;
图14为根据本申请实施例七的一种面部视频解码装置的结构框图;
图15为根据本申请实施例八的一种模型训练装置的结构框图;
图16为根据本申请实施例九的一种电子设备的结构示意图。
具体实施方式
为了使本领域的人员更好地理解本申请实施例中的技术方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅是本申请实施例一部分实施例,而不是全部的实施例。基于本申请实施例中的 实施例,本领域普通技术人员所获得的所有其他实施例,都应当属于本申请实施例保护的范围。
参见图1,图1为基于深度视频生成的编解码方法的框架示意图。该方法的主要原理是基于待编码帧的运动对参考帧进行形变,以得到待编码帧对应的重建帧。下面结合图1对基于深度视频生成的编解码方法的基本框架进行说明:
第一步,编码阶段,编码器采用关键点提取器提取待编码的目标面部视频帧的目标关键点信息,并对目标关键点信息编码;同时,采用传统的图像编码方法(如VVC、HEVC等)对参考面部视频帧进行编码。
第二步,解码阶段,解码器中的运动估计模块,通过关键点提取器提取参考面部视频帧的参考关键点信息;并基于参考关键点信息和目标关键点信息进行稠密运动估计,得到稠密运动估计图和遮挡图,其中,稠密运动估计图表征关键点信息表征的特征域中,目标面部视频帧与参考面部视频帧之间的相对运动关系;遮挡图表征目标面部视频帧中各像素点被遮挡的程度。
第三步,解码阶段,解码器中的生成模块基于稠密运动估计图对参考面部视频帧进行形变处理,得到形变处理结果,再将形变处理结果与遮挡图相乘,从而输出重建面部视频帧。
图1所示方法中,基于单个参考帧对后续视频帧进行编解码,以得到对应的重建面部视频帧。但是,这样得到的重建面部视频帧,其纹理质量和运动信息主要依赖于该单个参考面部视频帧,也就是说,在重建过程中,该单个参考面部视频帧限制了纹理信息和运动信息的重建,使得重建面部视频帧与待编码的原面部视频帧相比,其纹理质量较差,同时,运动描述的精度也较低。也就是说,得到的重建面部视频帧的质量较差。
本申请实施例中,基于多个参考面部视频帧得到融合面部视频帧,其纹理质量和运动信息则同时参考了多个不同的参考面部视频帧,因此,重建得到的融合面部视频帧与目标面部视频帧之间的质量差异较小,可以提高面部视频帧的重建质量。
下面结合本申请实施例附图进一步说明本申请实施例具体实现。
实施例一
参照图2,图2为根据本申请实施例一的一种面部视频编码方法的步骤流程图。具体地,本实施例提供的面部视频编码方法包括以下步骤:
步骤202,获取待编码的目标面部视频帧和参考帧列表中的多个初始参考面部视频帧。
本申请实施例中,对于初始参考面部视频帧的具体设定方式,以及数据均不做限定,可以根据实际情况选择。
例如:在视频会议或者视频直播等低延迟场景下,初始参考面部视频帧可以为面部 视频中任意的,时间戳早于目标面部视频帧的视频帧;也可以为按照预设的选择规则,从面部视频帧中选择的视频帧,等等。如:可以为面部视频中前预设数量帧的面部视频帧。
步骤204,分别对多个初始参考面部视频帧和目标面部视频帧进行编码,得到面部视频比特流。
具体地,就每个初始参考面部视频帧而言,可以采用相对较小的量化失真进行编码,编码过程保留该初始参考面部视频帧的完整数据,例如:可以采用通用视频编码(VVC)的方式,对初始参考面部视频帧进行编码。
针对目标面部视频帧而言,为了降低传输码率,可以先对目标面部视频帧进行特征提取,得到目标面部视频帧的目标紧凑特征;再对目标紧凑特征进行编码,其中,目标紧凑特征表征目标面部视频帧中的关键特征信息,例如:面部五官位置信息、姿态信息以及表情信息等等。进一步地:可以借助机器学习模型对目标面部视频帧进行特征提取,从而得到目标紧凑特征。
参见图3,图3为本申请实施例一对应的场景示意图,以下,将参考图3所示的示意图,以一个具体场景示例,对本申请实施例进行说明:
分别获取目标面部视频帧a,以及,参考帧列表中的多个初始参考面部视频帧:A1、A2、A3、A4(本申请实施例中,对于初始参考面部视频帧的数量不做限定,图3中仅以4个为例进行说明,并不构成对本申请实施例的限定);分别对A1、A2、A3、A4进行编码,另外,也对a进行编码(例如,可以对a进行紧凑特征提取,从而对紧凑特征进行编码),从而得到面部视频比特流。
进一步的,在本申请一些实施例中,在得到面部视频比特流之后,方法还包括:
分别计算目标面部视频帧与各初始参考面部视频帧间的信息差异值;信息差异值表征目标面部视频帧中包含的信息与初始参考面部视频帧中包含的信息之间的差异程度;若存在大于预设阈值的信息差异值,则将目标面部视频帧作为新增参考面部视频帧添加至参考帧列表,以更新参考帧列表;所述新增参考面部视频帧用于对其它视频帧进行编码,得到第一面部视频比特流。
具体地,若存在大于预设阈值的信息差异值,则可以将目标面部视频帧作为新增参考面部视频帧添加至参考帧列表,同时将最大信息差异值对应的初始参考面部视频帧删除,从而得到更新后的参考帧列表,以对当前的目标面部视频帧之后的面部视频帧进行编码及解码操作,得到对应的重构面部视频帧。
本申请实施例中,对于计算上述信息差异值时所采用的具体计算方式不做限定。例如:可以基于目标面部视频帧中各个像素点的像素值,以及初始参考面部视频帧中各个像素点的像素值,进行均方误差计算,从而得到像素级的信息差异值;也可以先分别对目标面部视频帧和初始参考面部视频帧进行特征提取,得到能够表征目标面部视频帧的关键特征信息的特征,以及,能够表征初始参考面部视频帧的关键特征信息的特征,然 后基于上述两种特征,进行均方误差计算,从而得到特征域的信息差异值,等等。
若信息差异值大于预设阈值,则表明目标面部视频帧与初始参考面部视频帧之间的信息差异较大,此时,若还以初始参考面部视频帧作为参考帧,对目标面部视频帧进行编解码操作,得到的重构面部视频帧与目标面部视频帧之间的信息差异较大,面部视频帧的重建质量难以保证。
因此,本申请实施例中,在确定出目标面部视频帧与初始参考面部视频帧之间的信息差异较大的情况下,则将目标面部视频帧作为新增参考面部视频帧,以对其它视频帧进行编码,得到第一面部视频比特流。这样可以减少因初始参考面部视频帧与待编码的面部视频帧之间的信息差异较大时,基于初始参考面部视频帧对目标面部视频帧进行编解码操作而导致的面部视频重建质量差的问题,提升了面部视频重建的质量,进而得到了较高质量的重建面部视频帧。
具体地,可以通过如下方式,针对作为新增参考面部视频帧的目标面部视频帧,其可以通过如下方式编码:
将新增参考面部视频帧作为关键帧,对新增参考面部视频帧进行独立编码,以相对较小的量化失真对新增参考面部视频帧进行编码,编码过程保留新增参考面部视频帧的完整数据。对应地,解码时,仅需新增参考面部视频帧本身即可完成解码过程。
后续,在对当前面部视频之后的其他面部视频帧编码时,则可以基于上述新增参考面部视频帧(当前面部视频)进行,以提升编码质量,进而得到较高质量的重建面部视频帧。
本申请实施例,在对目标面部视频帧进行编码的过程中,并不是直接基于参考帧列表中的初始参考面部视频帧间对目标面部视频帧进行编解码操作,而是先计算目标面部视频帧与初始参考面部视频帧间之间的信息差异值,若信息差异值较大(大于预设阈值),则不再基于初始参考面部视频帧对目标面部视频帧进行编解码操作,而是,将目标面部视频帧作为新的(新增)参考面部视频帧,以对其他视频帧进行编码,得到第一面部视频比特流。本申请实施例中,是基于目标面部视频帧与初始参考面部视频帧间之间的信息差异程度,确定编解码过程中实际采用的参考面部视频帧的,当差异程度较大时,则将目标面部视频帧作为新的参考面部视频帧,而不再依赖于初始参考面部视频帧进行编解码,因此,可以提升面部视频编解码的质量,进而得到较高质量的重建面部视频帧。
本申请实施例中,是对多个参考面部视频帧以及目标面部视频帧进行编码得到面部视频比特流的,以便解码端在获取到面部视频比特流之后,可以基于多个参考面部视频帧,对目标面部视频帧进行解码,进而得到目标面部视频帧对应的融合重构面部视频帧的。对于现有的仅基于单个参考面部视频帧得到的重建面部视频帧,其纹理质量和运动信息主要依赖于该单个参考面部视频帧,也就是说,在重建过程中,该单个参考面部视频帧限制了纹理信息和运动信息的重建,而本申请实施例中基于多个参考面部视频帧得 到融合面部视频帧,其纹理质量和运动信息则同时参考了多个不同的参考面部视频帧,因此,重建得到的融合面部视频帧与目标面部视频帧之间的质量差异较小。
本申请实施例一提供的面部视频编码方法,可以由视频编码端(编码器)执行,用于对面部视频文件进行编码,以实现对面部视频文件的数字宽带进行压缩。其可以适用与多种不同的场景,如:常规的涉及面部的视频游戏的存储和流式传输,具体地:可以通过本申请实施例提供的面部视频编码方法对游戏视频帧进行编码,形成对应的视频码流,以在视频流服务或者其他类似的应用中存储和传输;又如:视频会议、视频直播等低延时场景,具体地:可以通过本申请实施例提供的面部视频编码方法对视频采集设备采集到的面部视频数据进行编码,形成对应的视频码流,并发送至会议终端,通过会议终端对视频码流进行解码从而得到对应的面部视频画面;还如:虚拟现实场景,可以通过本申请实施例提供的面部视频编码方法对视频采集设备采集到的面部视频数据进行编码,形成对应的视频码流,并发送至虚拟现实相关设备(如VR虚拟眼镜等),通过VR设备对视频码流进行解码从而得到对应的面部视频画面,并基于面部视频画面实现对应的VR功能,等等。
一些实施例中,参考帧列表中包含的参考面部视频帧的数量是恒定的(可以根据实际情况进行设定),因此,当将目标面部视频帧确定为新增参考面部视频帧时,则需要对参考帧列表进行更新,以得到更新后的参考帧列表,从而进行后续面部视频帧的编码及解码操作。
具体地,若初始参考面部视频帧的数量为1个,则参考帧列表的更新方法可以为:
将新增参考面部视频帧添加至参考帧列表,并删除初始参考面部视频帧,得到更新后参考帧列表。
若初始参考面部视频帧的数量为多个,则参考帧列表的更新方法可以为:将新增参考面部视频帧添加至参考帧列表,并从所有初始参考面部视频帧中删除一个初始参考面部视频帧,得到更新后参考帧列表。具体地,可以根据各初始参考面部视频与新增参考面部视频帧之间的信息差异值,进行初始参考面部视频帧的删除,例如:将信息差异值最大的初始参考面部视频删除,也可以根据各初始参考面部视频的时间戳,将时间戳最早的初始参考面部视频帧删除,等等。
若所有信息差异值中存在小于或者等于预设阈值的候选信息差异值,则基于初始参考面部视频帧,对目标面部视频帧进行编码得到第二面部视频比特流。
本申请实施例中,对于基于初始参考面部视频帧,对目标面部视频帧进行编码得到第二面部视频比特流,以及,基于初始参考面部视频帧对第二面部视频比特流进行解码的具体方式不做限定,可以采用任意现有的借助参考帧进行当前帧编解码的方法,此处不再赘述。
在本实施例中,在对目标面部视频帧进行编码的过程中,先计算目标面部视频帧与初始参考面部视频帧间之间的信息差异值,若信息差异值较大(大于预设阈值),则将 目标面部视频帧作为新的(新增)参考面部视频帧进行编解码操作,从而得到目标面部视频帧对应的重构面部视频帧;若信息差异值较小,则基于初始参考面部视频帧,对目标面部视频帧进行编解码操作,从而得到对应的重建面部视频帧。因此,可以避免当初始参考面部视频帧与目标面部视频帧之间的信息差异较大时,仍基于初始参考面部视频帧对目标面部视频帧进行编解码操作而导致的面部视频重建质量差的问题,提升了面部视频重建的质量,进而得到了较高质量的重建面部视频帧。
本实施例的面部视频编码方法可以由任意适当的具有数据能力的电子设备执行,包括但不限于:服务器、PC机等。
参照图4,图4为根据本申请实施例一的一种面部视频编码方法的步骤流程图。初始参考面部视频帧的数量可以预先根据实际需要进行设定,此处对于具体的数量值不做限定。本实施例提供的面部视频编码方法包括以下步骤:
步骤402,分别计算待编码的目标面部视频帧与参考帧列表中各初始参考面部视频帧间的信息差异值。若所有信息差异值中存在小于或者等于预设阈值的候选信息差异值,则执行步骤404;若所有信息差异值均大于预设阈值,则执行步骤408。
进一步地,可以采用如下两种不同的方式,计算目标面部视频帧与初始参考面部视频帧间的信息差异值:
第一种方式:基于待编码的目标面部视频帧中各像素点的像素值和所述初始参考面部视频帧中各像素点的像素值,进行均方误差(Mean-Square Error,MSE)计算,得到目标面部视频帧与初始参考面部视频帧间的信息差异值。
第二种方式:对待编码的目标面部视频帧进行特征提取,得到目标面部视频帧的紧凑特征;对所述参考帧列表中的初始参考面部视频帧进行特征提取,得到所述初始参考面部视频帧的紧凑特征;基于所述目标面部视频帧的紧凑特征和所述初始参考面部视频帧的紧凑特征,进行均方误差计算,得到目标面部视频帧与参考帧列表中的初始参考面部视频帧间的信息差异值。其中,所述目标面部视频帧的紧凑特征表征所述目标面部视频帧中的关键特征信息;所述初始参考面部视频帧的紧凑特征表征初始参考面部视频帧中的关键特征信息。
进一步地,在对面部视频帧进行特征提起时,可以基于深度学习模型进行,具体地:可以将所述待编码的目标面部视频帧输入预先训练完成的特征提取模型,以使所述特征提取模型输出所述目标面部视频帧的紧凑特征;并且,将所述参考帧列表中的初始参考面部视频帧输入所述特征提取模型,以使所述特征提取模型输出所述初始参考面部视频帧的紧凑特征。
本申请实施例中,对于特征提取模型的具体结构和参数不做限定,可以根据实际情况设定。
对比计算信息差异值的上述两种不同方式,第一种方式是从像素级角度出发,进行 信息差异值计算的,因此,计算结果的准确度更高。第二种方式是从提取到的特征的角度出发,进行信息差异值计算的,即从特征域的角度出发进行信息差异值计算的,由于特征是从面部视频帧中提取出来的,是对面部视频帧进行了下采样操作,因此,计算量较小,计算效率较高。
步骤404,从候选信息差异值中确定数值最小的信息差异值为目标信息差异值。
步骤406,基于目标信息差异值对应的初始参考面部视频帧,对目标面部视频帧进行编码得到第二面部视频比特流。至此,编码流程结束。
本申请实施例中,当小于或者等于预设阈值的候选信息差异值有多个时,为了提高面部视频编解码的质量,可以从多个初始参考面部视频帧中,选择最小信息差异值(目标信息差异值)对应的初始参考面部视频帧,作为实际的参考帧,进而基于该实际的参考帧,对目标面部视频帧进行编码得到第二面部视频比特流。
本申请实施例中,对于基于目标信息差异值对应的初始参考面部视频帧,对目标面部视频帧进行编码得到第二面部视频比特流,以及,基于目标信息差异值对应的初始参考面部视频帧对第二面部视频比特流进行解码的具体方式不做限定,可以采用任意现有的借助参考帧进行当前帧编解码的方法,此处不再赘述。
步骤408,将目标面部视频帧作为新增参考面部视频帧,新增参考面部视频帧用于对其它视频帧进行编码,得到第一面部视频比特流。
若所有信息差异值均大于预设阈值,则表明目标面部视频帧与各初始参考面部视频帧之间的信息差异均较大,此时,以任意一个初始参考面部视频帧作为参考帧,对目标面部视频帧进行编解码操作,得到的重建面部视频帧与目标面部视频帧之间的信息差异较大,面部视频帧的重建质量难以保证。
因此,本申请实施例中,在确定出目标面部视频帧与所有初始参考面部视频帧之间的信息差异均较大的情况下,则将目标面部视频帧作为新增参考面部视频帧以对其它视频帧编码得到第一面部视频比特流。这样可以提升了面部视频重建的质量,进而得到了较高质量的重建面部视频帧。
步骤410,将新增参考面部视频帧添加至参考帧列表,并删除时间戳最早的初始参考面部视频帧,得到更新后参考帧列表。
在对参考帧列表中的初始参考面部视频帧进行更新之后,在对当前面部视频之后的其它面部视频帧编码时,则可以基于上述更新后参考帧列表中的参考面部视频帧进行,以提升编码质量,进而得到较高质量的重建面部视频帧。
本申请实施例中,在对目标面部视频帧进行编码的过程中,先计算目标面部视频帧与各初始参考面部视频帧间之间的信息差异值,若信息差异值均较大(大于预设阈值),则将目标面部视频帧作为新的(新增)参考面部视频帧进行编解码操作;若存在较小的信息差异值,则基于最小差异值对应的初始参考面部视频帧,对目标面部视频帧进行编解码操作,从而得到对应的重建面部视频帧。因此,可以避免当初始参考面部视频帧与 目标面部视频帧之间的信息差异均较大时,仍基于初始参考面部视频帧对目标面部视频帧进行编解码操作而导致的面部视频重建质量差的问题;另外,当存在较小的信息差异值,是基于最小差异值对应的初始参考面部视频帧对目标面部视频帧进行编解码操作的,由于最小差异值对应的初始参考面部视频帧与前面部视频帧之间的差异最小,因此,基于该差异最小的初始参考面部视频帧进行后续的编解码操作,面部视频重建的质量较高,进而得到的重建面部视频帧的质量也较高。
实施例二
参照图5,图5为根据本申请实施例二的一种面部视频解码方法的步骤流程图。具体地,本实施例提供的面部视频解码方法包括以下步骤:
步骤502,获取面部视频比特流。
步骤504,解码面部视频比特流,获取待编码的目标面部视频帧的目标驱动信息和表征目标面部视频是否为新增参考面部视频帧的目标标识信息。
目标驱动信息可以为表征目标面部视频帧关键特征信息的信息,其中,关键特征信息可以包括:面部五官位置信息、姿态信息以及表情信息等等。本申请实施例中对于目标驱动信息的具体形式不做限定,例如:可以是通过面部关键点提取器提取到的显式表示的关键点信息,也可以是通过特征提取模型提取到的隐式表示的数据量更小,且表征信息更丰富的紧凑特征矩阵(向量),等等。
本步骤中的目标标识信息,为编码端在对目标面部视频帧进行编码的过程中生成并发送至解码端的。编码端可以基于目标面部视频帧和参考帧列表中的初始参考面部视频帧间的信息差异值确定目标面部视频帧是否为新增参考面部视频帧,具体地,若信息差异值均大于预设阈值,则确定目标面部视频帧为新增参考面部视频帧,该新增参考面部视频帧可用于对其它视频帧编码;若存在小于或者等于预设阈值的信息差异值,则确定目标面部视频帧为非新增参考面部视频帧,可以基于参考帧列表中的初始参考面部视频帧对目标面部视频帧进行编码,得到面部视频比特流并发送至解码端。
步骤506,若目标面部视频帧为非新增参考面部视频帧,分别获取参考帧列表中的多个参考面部视频帧的参考驱动信息。
参考驱动信息可以为表征参考面部视频帧关键特征信息的信息,其中,关键特征信息可以包括:面部五官位置信息、姿态信息以及表情信息等等。本申请实施例中对于参考驱动信息的具体形式不做限定,例如:可以是通过面部关键点提取器提取到的显式表示的关键点信息,也可以是通过特征提取模型提取到的隐式表示的数据量更小,且表征信息更丰富的紧凑特征矩阵(向量),等等。
当目标面部视频帧为非新增参考面部视频帧时,表明是基于参考帧列表中的初始参考面部视频帧对目标面部视频帧进行编码的,由于参考帧列表中包括多个初始参考面部视频帧,因此,需要从中确定实际的参考面部视频帧:目标参考面部视频帧。
另外,当目标面部视频帧为新增参考面部视频帧时,表明是对目标面部视频帧进行独立编码的,即以相对较小的量化失真对目标面部视频帧进行编码,编码过程保留目标面部视频帧的完整数据。对应地,解码时,仅需采用与编码方式对应的解码方式进行解码即可。
步骤508,基于目标驱动信息和各参考驱动信息,计算目标面部视频帧与各参考面部视频帧间的信息差异值。
其中,信息差异值表征目标面部视频帧中包含的信息与参考面部视频帧中包含的信息之间的差异程度。
进一步地,本申请实施例中,可以通过如下方式计算信息差异值:
基于目标驱动信息和各参考驱动信息,进行均方误差计算,得到目标面部视频帧与各参考面部视频帧间的信息差异值。
步骤510,将最小信息差异值对应的参考面部视频帧作为目标参考面部视频帧,基于目标参考面部视频帧和目标驱动信息,得到重建面部视频帧。
具体地,本申请实施例中,对于得到重建面部视频帧的具体解码形式不做限定,可以基于图4所示实施例中步骤406得到面部视频比特流时所采用的编码方式,采用对应的解码方式进行解码,从而得到重建面部视频帧。
参见图6,图6为本申请实施例二对应的场景示意图,以下,将参考图6所示的示意图,以一个具体场景示例,对本申请实施例进行说明:
获取面部视频比特流;对面部视频比特流解码,得到目标面部视频帧的目标驱动信息D,以及目标标识信息,其中该目标标识信息表征目标面部视频帧为非新增参考面部视频帧;获取参考帧列表中的各参考面部视频帧的参考驱动信息:D1、D2以及D3(参考帧列表中包含的参考面部视频帧的数量可以为大于1的任意整数,图6中仅以3个参考面部视频帧进行举例说明);基于D、D1、D2以及D3,计算目标面部视频帧与各参考面部视频帧间的信息差异值d1(D与D1的差值)、d2(D与D2的差值)以及d3(D与D3的差值),将最小信息差异值d1对应的参考面部视频帧D1作为目标参考面部视频帧,基于目标参考面部视频帧和目标驱动信息,得到重建面部视频帧。
本申请实施例中,在解码的过程中,若确定目标面部视频帧为非新增参考面部视频帧,则基于目标面部视频帧的目标驱动信息和各参考面部视频帧的参考驱动信息,计算出目标面部视频帧与各参考面部视频帧间的信息差异值,进而从取参考帧列表的多个参考面部视频帧中,将最小信息差异值对应的参考面部视频帧作为目标参考面部视频帧,以得到重建面部视频帧。由于最终确定出的目标参考面部视频帧是与目标面部视频帧的信息差异最小的参考面部视频帧,因此,基于上述目标参考面部视频帧进行面部视频帧重建,得到的重建面部视频帧的质量也最高,提高了面部视频帧的重建质量。
实施例三、
参照图7,图7为根据本申请实施例三的一种面部视频解码方法的步骤流程图。具体地,本实施例提供的面部视频解码方法包括以下步骤:
步骤702,获取面部视频比特流,面部视频比特流包括:多个编码后参考面部视频帧和编码后紧凑特征信息。
其中,编码后紧凑特征信息表征待重建的目标面部视频帧的关键特征信息。
本申请实施例中,编码后紧凑特征信息对应为对目标面部视频帧进行特征提取得到的用于表征关键特征信息的紧凑特征信息,也可以对应为:相邻的目标面部视频帧的目标紧凑特征之间的差值。
步骤704,分别解码多个编码后参考面部视频帧,得到多个参考面部视频帧。
步骤706,解码编码后紧凑特征信息,得到目标面部视频帧的目标紧凑特征。
步骤708,基于多个参考面部视频帧和目标紧凑特征,进行面部视频帧重建,得到与目标面部视频帧对应的融合面部视频帧。
具体地,本步骤中,可以针对每个参考面部视频帧,结合目标紧凑特征,分别进行面部视频帧重建,从而得到每个参考面部视频帧对应的初始重建面部视频帧,再对各初始重建面部视频帧进行融合,得到最终的融合面部视频帧;也可以,同时基于多个参考面部视频帧和参考面部视频帧,得到最终的融合面部视频帧。
本申请实施例中,对于得到融合面部视频帧时所采用的具体方式不做限定。
见图8,图8为本申请实施例三对应的场景示意图,以下,将参考图8所示的示意图,以一个具体场景示例,对本申请实施例进行说明:
获取由多个编码后参考面部视频帧和编码后紧凑特征信息组成的面部视频比特流;对编码后参考面部视频帧进行解码,分别得到多个参考面部视频帧:a1、a2、a3、a4(本申请实施例中,对于参考面部视频帧的数量不做限定,图8中仅以4个为例进行说明,并不构成对本申请实施例的限定),以及,目标紧凑特征;最后,再基于多个参考面部视频帧a1、a2、a3、a4,以及,目标紧凑特征,进行面部视频帧重建,得到与目标面部视频帧对应的融合面部视频帧a0。
进一步的,在本申请一些实施例中,基于多个参考面部视频帧和目标紧凑特征,进行面部视频帧重建,得到与目标面部视频帧对应的融合面部视频帧,可以包括:
分别基于每个参考面部视频帧和目标紧凑特征,进行面部视频帧重建,得到与每个参考面部视频帧对应的初始重建面部视频帧;对各初始重建面部视频帧进行融合处理,得到与目标面部视频帧对应的融合面部视频帧。
具体地,本申请实施例中,对于得到初始重建面部视频帧的具体过程不做限定。
例如,可以为:对每个参考面部视频帧进行特征提取,得到参考紧凑特征;再基于参考紧凑特征和目标紧凑特征进行稀疏运动估计,得到稀疏运动估计图;基于稀疏运动估计图,对参考面部视频帧进行形变处理,从而得到目标面部视频帧对应的初始重构面部视频帧。其中,稀疏运动估计图表征在预设的稀疏特征域中,目标面部视频帧与参考 面部视频帧之间的相对运动关系。
另外,在对各初始重建面部视频帧进行融合处理时,也可以采用不同的融合方式:
在本申请一些实施例中,可以先获取各初始重建面部视频帧对应的权重值;再基于各权重值,对各初始重建面部视频帧进行线性加权处理,得到与目标面部视频帧对应的融合面部视频帧。具体地,可以是基于各权重值,对各初始重建面部视频帧对应像素点的像素值进行线性加权处理,从而得到与目标面部视频帧对应的融合面部视频帧。
在本申请一些实施例中,还可以借助机器学习模型进行融合处理,具体地:
将各初始重建面部视频帧输入融合模型,以使融合模型输出与目标面部视频帧对应的融合面部视频帧。
本申请实施例中,对于融合模型的具体结构和参考均不做限定,可以根据实际需要自行设定。例如:融合模型可以为基于卷积层和广义除法归一化层组合而成的U-Net网络;也可以为基于多个下采样模块和多个对应地上采样模块组成的沙漏模型,等等。
进一步的,在本申请一些实施例中,基于多个参考面部视频帧和目标紧凑特征,进行面部视频帧重建,得到与目标面部视频帧对应的融合面部视频帧,可以包括:
针对每个参考面部视频帧,基于该参考面部视频帧和目标紧凑特征,得到该参考面部视频帧对应的驱动信息,驱动信息包括:该参考面部视频帧和目标面部视频帧之间的运动估计图;将各参考面部视频帧和各参考面部视频帧对应的驱动信息输入第一生成模型,得到与目标面部视频帧对应的融合面部视频帧。
其中,运动估计图表征参考面部视频帧和目标面部视频帧之间相对运动关系。运动估计图可以包括稀疏运动图和稠密运动图,其中,稀疏运动图表征在预设的稀疏特征域中,目标面部视频帧与参考面部视频帧之间的相对运动关系;稠密运动图表征在预设的稠密特征域中,目标面部视频帧与参考面部视频帧之间的相对运动关系。
进一步地,为了提高面部视频帧重建的质量,驱动信息还可以包括目标面部视频帧中各像素点被遮挡程度的遮挡图,等等。
参见图9,图9为根据本申请实施例三得到融合面部视频帧的一种流程示意图,该图中以2个参考面部视频帧进行举例,并不构成对本申请实施例的限制。具体的:基于参考面部视频帧1和目标紧凑特征,得到参考面部视频帧1对应的驱动信息1;基于参考面部视频帧2和目标紧凑特征,得到参考面部视频帧2对应的驱动信息2;再将参考面部视频帧1、2、驱动信息1、2,输入第一生成模型,从而得到融合面部视频帧。
进一步的,在本申请另一些实施例中,基于多个参考面部视频帧和目标紧凑特征,进行面部视频帧重建,得到与目标面部视频帧对应的融合面部视频帧,还可以包括:
针对每个参考面部视频帧,基于该参考面部视频帧和目标紧凑特征,得到该参考面部视频帧对应的驱动信息;将各参考面部视频帧和各参考面部视频帧对应的驱动信息输入第二生成模型,得到分别与各参考面部视频帧对应的初始重建面部视频帧;将各初始重建面部视频帧输入融合模型,以使融合模型输出与目标面部视频帧对应的融合面部视 频帧。
参见图10,图10为根据本申请实施例三得到融合面部视频帧的另一种流程示意图,该图中还以2个参考面部视频帧进行举例。具体的:基于参考面部视频帧1和目标紧凑特征,得到参考面部视频帧1对应的驱动信息1;基于参考面部视频帧2和目标紧凑特征,得到参考面部视频帧2对应的驱动信息2;将参考面部视频帧1和驱动信息1输入第二生成模型,得到初始重建面部视频帧1,将参考面部视频帧2和驱动信息2输入第二生成模型,得到初始重建面部视频帧2;再将初始重建面部视频帧1和初始重建面部视频帧2输入融合模型,最终得到融合面部视频帧。
本申请实施例中,是基于多个参考面部视频帧,对待编码的目标面部视频帧进行编码和解码操作,从而得到与目标面部视频帧对应的融合重构面部视频帧的。对于现有的仅基于单个参考面部视频帧得到的重建面部视频帧,其纹理质量和运动信息主要依赖于该单个参考面部视频帧,也就是说,在重建过程中,该单个参考面部视频帧限制了纹理信息和运动信息的重建,而本申请实施例中基于多个参考面部视频帧得到融合面部视频帧,其纹理质量和运动信息则同时参考了多个不同的参考面部视频帧,因此,重建得到的融合面部视频帧与目标面部视频帧之间的质量差异较小。综上,本申请实施例提高了面部视频帧的重建质量。
本实施例的面部视频解码方法可以由任意适当的具有数据能力的电子设备执行,包括但不限于:服务器、PC机等。
实施例四
参照图11,图11为根据本申请实施例四的一种参考面部视频帧生成方法的步骤流程图。具体地,本实施例提供的参考面部视频帧生成方法包括以下步骤:
步骤802,获取目标面部视频帧和参考帧列表中的多个初始参考面部视频帧。
本申请实施例中,对于初始参考面部视频帧的具体设定方式,以及数据均不做限定,可以根据实际情况选择。
例如:在视频会议或者视频直播等低延迟场景下,初始参考面部视频帧可以为面部视频中任意的,时间戳早于目标面部视频帧的视频帧;也可以为按照预设的选择规则,从面部视频帧中选择的视频帧,等等。如:可以为面部视频中前预设数量帧的面部视频帧。
步骤804,计算目标面部视频帧与各初始参考面部视频帧间的信息差异值。
面部信息差异值表征目标面部视频帧中包含的信息与各初始参考面部视频帧中包含的信息之间的差异程度。
本申请实施例中,对于计算上述信息差异值时所采用的具体计算方式不做限定。例如:可以基于目标面部视频帧中各个像素点的像素值,以及初始参考面部视频帧中各个像素点的像素值,进行均方误差计算,从而得到像素级的信息差异值;也可以先分别对 目标面部视频帧和初始参考面部视频帧进行特征提取,得到能够表征目标面部视频帧的关键特征信息的特征,以及,能够表征初始参考面部视频帧的关键特征信息的特征,然后基于上述两种特征,进行均方误差计算,从而得到特征域的信息差异值,等等。
步骤806,若存在大于预设阈值的信息差异值,则将目标面部视频帧作为新增参考面部视频帧添加至面部参考帧列表。
具体地,面部参考帧列表中包含的参考面部视频帧的数量通常是恒定的(可以根据实际情况进行设定),因此,当将目标面部视频帧确定为新增参考面部视频帧时,则需要对参考帧列表进行更新,以得到更新后的参考帧列表,从而进行后续面部视频帧的编码及解码操作。
具体地,参考帧列表的更新方法可以为:将新增参考面部视频帧添加至参考帧列表,并从所有初始参考面部视频帧中删除一个初始参考面部视频帧,得到更新后参考帧列表。具体地,可以根据各初始参考面部视频与新增参考面部视频帧之间的信息差异值,进行初始参考面部视频帧的删除,例如:将信息差异值最大的初始参考面部视频删除,也可以根据各初始参考面部视频的时间戳,将时间戳最早的初始参考面部视频帧删除,等等。
本申请实施例提供的参考面部视频帧生成,根据目标面部视频帧与各初始参考面部视频帧间的信息差异值,当存在信息差异值较大(大于预设阈值)的初始参考面部视频帧时,则表明目标面部视频帧与某个或者某几个初始参考面部视频帧之间的信息差异较大,也就是说,随着时间变化,面部视频帧中的信息已经发生了较大迁移(改变),再基信息差异值较大的初始参考面部视频帧对后续面部视频帧进行编解码,得到的重建面部视频帧与目标面部视频帧间的信息差异可能较大。因此,本申请中,将目标面部视频帧作为新增参考面部视频帧参考帧列表中,得到了更新后的参考帧列表,以使得基于更新后的参考帧列表进行后续面部视频帧重建时,可以提升重建视频帧的质量。
实施例五
参照图12,图12为根据本申请实施例五的一种模型训练方法的步骤流程图。具体地,本实施例提供的模型训练方法包括以下步骤:
步骤902,分别对多个初始参考面部视频帧样本和目标面部视频帧样本进行编码,得到面部视频比特流样本。
步骤904,解码面部视频比特流样本,得到多个初始参考面部视频帧样本和目标面部视频帧样本的目标紧凑特征样本。
步骤906,基于每个参考面部视频帧样本和目标紧凑特征样本,得到每个参考面部视频帧样本对应的驱动信息样本。
步骤908,将各参考面部视频帧样本和对应的各驱动信息样本输入初始第二生成模型,得到分别与各参考面部视频帧样本对应的初始重建面部视频帧样本。
步骤910,将各初始重建面部视频帧样本输入初始融合模型,得到融合面部视频帧 样本。
步骤912,基于融合面部视频帧样本和目标面部视频帧样本构建损失函数,并基于损失函数进行训练,得到训练完成的第二生成模型和过渡融合模型。
步骤914,基于目标面部视频帧样本和多个初始参考面部视频帧样本,基于训练完成的第二生成模型,再次训练过渡融合模型,得到训练完成的融合模型。
本申请实施例,将模型训练分为两个阶段,其中,第一个阶段:先基于多个初始参考面部视频帧样本和目标面部视频帧样本,对第二生成模型和融合模型进行初步训练,得到训练完成的第二生成模型和过渡融合模型;第二个阶段:再基于多个初始参考面部视频帧样本、目标面部视频帧样本以及训练完成的第二生成模型,单独对过渡融合模型进行精调,以得到最终训练完成的融合模型。上述模型训练的好处在于:将整个训练任务分为两个不同的子任务,与同时训练第二生成模型和融合模型的方式相比,可以降低机器学习的难度,提高模型训练的效率。
本实施例的模型训练方法可以由任意适当的具有数据能力的电子设备执行,包括但不限于:服务器、PC机等。
实施例六
参见图13,图13为根据本申请实施例五的一种面部视频编码装置的结构框图。本申请实施例提供的面部视频编码装置包括:
面部视频帧获取模块1002,用于获取待编码的目标面部视频帧和参考帧列表中的多个初始参考面部视频帧;
视频流得到模块1004,用于分别对多个初始参考面部视频帧和目标面部视频帧进行编码,得到面部视频比特流。
可选地,在其中一些实施例中,面部视频编码装置还包括:
差异值计算模块,用于在得到面部视频比特流之后,分别计算目标面部视频帧与各初始参考面部视频帧间的信息差异值;信息差异值表征目标面部视频帧中包含的信息与初始参考面部视频帧中包含的信息之间的差异程度;
参考帧列表更新模块,用于若存在大于预设阈值的信息差异值,则将目标面部视频帧作为新增参考面部视频帧添加至参考帧列表,以更新参考帧列表;所述新增参考面部视频帧用于对其它视频帧进行编码,得到第一面部视频比特流。
可选地,在其中一些实施例中,差异值计算模块,具体用于:
基于所述目标面部视频帧中各像素点的像素值和所述初始参考面部视频帧中各像素点的像素值,进行均方误差计算,得到所述目标面部视频帧与所述参考帧列表中的初始参考面部视频帧间的信息差异值。
可选地,在其中一些实施例中,差异值计算模块,具体用于:
对所述目标面部视频帧进行特征提取,得到所述目标面部视频帧的紧凑特征,所述 目标面部视频帧的紧凑特征表征所述目标面部视频帧中的关键特征信息;
对所述初始参考面部视频帧进行特征提取,得到所述初始参考面部视频帧的紧凑特征,所述初始参考面部视频帧的紧凑特征表征初始参考面部视频帧中的关键特征信息;
基于所述目标面部视频帧的紧凑特征和所述初始参考面部视频帧的紧凑特征,进行均方误差计算,得到所述目标面部视频帧与所述初始参考面部视频帧间的信息差异值。
可选地,在其中一些实施例中,视频流得到模块1004,具体用于:
若所有信息差异值中存在小于或者等于所述预设阈值的候选信息差异值,则基于所述候选信息差异值对应的初始参考面部视频帧,对所述目标面部视频帧进行编码得到第二面部视频比特流。
可选地,在其中一些实施例中,若所述候选信息差异值的数量为多个,视频流得到模块1004,具体用于:
从所述候选信息差异值中确定数值最小的信息差异值为目标信息差异值;基于所述目标信息差异值对应的初始参考面部视频帧,对所述目标面部视频帧进行编码得到第二面部视频比特流。
可选地,在其中一些实施例中,视频流得到模块1004,具体用于:
对目标面部视频帧进行特征提取,得到目标面部视频帧的目标紧凑特征;
分别对目标紧凑特征以及各初始参考面部视频帧进行编码,得到面部视频比特流。
本实施例的面部视频编码装置用于实现前述多个方法实施例中相应的面部视频编码方法,并具有相应的方法实施例的有益效果,在此不再赘述。此外,本实施例的面部视频编码装置中的各个模块的功能实现均可参照前述方法实施例中的相应部分的描述,在此亦不再赘述。
实施例七
参见图14,图14为根据本申请实施例七的一种面部视频解码装置的结构框图。本申请实施例提供的面部视频解码装置包括:
视频流获取模块1102,用于获取面部视频比特流,面部视频比特流包括:多个编码后参考面部视频帧和编码后紧凑特征信息;编码后紧凑特征信息表征待重建的目标面部视频帧的关键特征信息;
第一解码模块1104,用于分别解码多个编码后参考面部视频帧,得到多个参考面部视频帧;
第二解码模块1106,用于解码编码后紧凑特征信息,得到目标面部视频帧的目标紧凑特征;
融合面部视频帧得到模块1108,用于基于多个参考面部视频帧和目标紧凑特征,进行面部视频帧重建,得到与目标面部视频帧对应的融合面部视频帧。
可选地,在其中一些实施例中,融合面部视频帧得到模块1108,具体用于:
分别基于每个参考面部视频帧和目标紧凑特征,进行面部视频帧重建,得到与每个参考面部视频帧对应的初始重建面部视频帧;
对各初始重建面部视频帧进行融合处理,得到与目标面部视频帧对应的融合面部视频帧。
可选地,在其中一些实施例中,融合面部视频帧得到模块1108,在执行对各初始重建面部视频帧进行融合处理,得到与目标面部视频帧对应的融合面部视频帧的步骤时,具体用于:
获取各初始重建面部视频帧对应的权重值;
基于各权重值,对各初始重建面部视频帧进行线性加权处理,得到与目标面部视频帧对应的融合面部视频帧。
可选地,在其中一些实施例中,融合面部视频帧得到模块1108,在执行对各初始重建面部视频帧进行融合处理,得到与目标面部视频帧对应的融合面部视频帧的步骤时,具体用于:
将各初始重建面部视频帧输入融合模型,以使融合模型输出与目标面部视频帧对应的融合面部视频帧。
可选地,在其中一些实施例中,融合面部视频帧得到模块1108,具体用于:
针对每个参考面部视频帧,基于该参考面部视频帧和目标紧凑特征,得到该参考面部视频帧对应的驱动信息,驱动信息包括:该参考面部视频帧和目标面部视频帧之间的运动估计图;
将各参考面部视频帧和各参考面部视频帧对应的驱动信息输入第一生成模型,得到与目标面部视频帧对应的融合面部视频帧。
可选地,在其中一些实施例中,融合面部视频帧得到模块1108,具体用于:
针对每个参考面部视频帧,基于该参考面部视频帧和目标紧凑特征,得到该参考面部视频帧对应的驱动信息;
将各参考面部视频帧和各参考面部视频帧对应的驱动信息输入第二生成模型,得到分别与各参考面部视频帧对应的初始重建面部视频帧;
将各初始重建面部视频帧输入融合模型,以使融合模型输出与目标面部视频帧对应的融合面部视频帧。
本实施例的面部视频解码装置用于实现前述多个方法实施例中相应的面部视频解码方法,并具有相应的方法实施例的有益效果,在此不再赘述。此外,本实施例的面部视频解码装置中的各个模块的功能实现均可参照前述方法实施例中的相应部分的描述,在此亦不再赘述。
实施例八
参见图15,图15为根据本申请实施例八的一种模型训练装置的结构框图。本申请 实施例提供的模型训练装置包括:
视频流样本得到模块1202,用于分别对多个初始参考面部视频帧样本和目标面部视频帧样本进行编码,得到面部视频比特流样本;
视频流样本解码模块1204,用于解码编码后面部视频流样本,得到多个初始参考面部视频帧样本和目标面部视频帧样本的目标紧凑特征样本;
驱动信息样本得到模块1206,用于基于每个参考面部视频帧样本和目标紧凑特征样本,得到每个参考面部视频帧样本对应的驱动信息样本;
初始重建面部视频帧样本得到模块1208,用于将各参考面部视频帧样本和对应的各驱动信息样本输入初始第二生成模型,得到分别与各参考面部视频帧样本对应的初始重建面部视频帧样本;
融合面部视频帧样本得到模块1210,用于将各初始重建面部视频帧样本输入初始融合模型,得到融合面部视频帧样本;
第一训练模块1212,用于基于融合面部视频帧样本和目标面部视频帧样本构建损失函数,并基于损失函数进行训练,得到训练完成的第二生成模型和过渡融合模型;
第二训练模块1214,用于基于目标面部视频帧样本和多个初始参考面部视频帧样本,基于训练完成的第二生成模型,再次训练过渡融合模型,得到训练完成的融合模型。
上述各步骤的具体执行过程,可以参考上述实施例一或者实施例二中的对应步骤,此处不再赘述。
本实施例的模型训练装置用于实现前述多个方法实施例中相应的模型训练方法,并具有相应的方法实施例的有益效果,在此不再赘述。此外,本实施例的模型训练装置中的各个模块的功能实现均可参照前述方法实施例中的相应部分的描述,在此亦不再赘述。
实施例九
参照图16,示出了根据本申请实施例九的一种电子设备的结构示意图,本申请具体实施例并不对电子设备的具体实现做限定。
如图16所示,该会议终端可以包括:处理器(processor)1302、通信接口(Communications Interface)1304、存储器(memory)1306、以及通信总线1308。
其中:
处理器1302、通信接口1304、以及存储器1306通过通信总线1308完成相互间的通信。
通信接口1304,用于与其它电子设备或服务器进行通信。
处理器1302,用于执行程序1310,具体可以执行上述面部视频编码方法,或者,面部视频解码方法,或者,参考面部视频帧生成方法,或者,模型训练方法实施例中的相关步骤。
具体地,程序1310可以包括程序代码,该程序代码包括计算机操作指令。
处理器1302可能是CPU,或者是特定集成电路ASIC(Application Specific Integrated Circuit),或者是被配置成实施本申请实施例的一个或多个集成电路。智能设备包括的一个或多个处理器,可以是同一类型的处理器,如一个或多个CPU;也可以是不同类型的处理器,如一个或多个CPU以及一个或多个ASIC。
存储器1306,用于存放程序1110。存储器1106可能包含高速RAM存储器,也可能还包括非易失性存储器(non-volatile memory),例如至少一个磁盘存储器。
程序1310具体可以用于使得处理器1302执行以下操作:获取待编码的目标面部视频帧和参考帧列表中的多个初始参考面部视频帧;分别对面部多个初始参考面部视频帧和面部目标面部视频帧进行编码,得到面部视频比特流。
或者,
程序1310具体可以用于使得处理器1302执行以下操作:获取面部视频比特流,面部面部视频比特流包括:多个编码后参考面部视频帧和编码后紧凑特征信息;面部编码后紧凑特征信息表征待重建的目标面部视频帧的关键特征信息;分别解码面部多个编码后参考面部视频帧,得到多个参考面部视频帧;解码面部编码后紧凑特征信息,得到面部目标面部视频帧的目标紧凑特征;基于面部多个参考面部视频帧和面部目标紧凑特征,进行面部视频帧重建,得到与面部目标面部视频帧对应的融合面部视频帧。
或者,
程序1310具体可以用于使得处理器1302执行以下操作:获取目标面部视频帧和参考帧列表中的多个初始参考面部视频帧;计算目标面部视频帧与各初始参考面部视频帧间的信息差异值,面部信息差异值表征目标面部视频帧中包含的信息与各初始参考面部视频帧中包含的信息之间的差异程度;若存在大于预设阈值的信息差异值,则将最大信息差异值对应的初始参考面部视频帧从面部参考帧列表中删除,并将目标面部视频帧作为新增参考面部视频帧添加至面部参考帧列表。
或者,
程序1310具体可以用于使得处理器1302执行以下操作:分别对多个初始参考面部视频帧样本和目标面部视频帧样本进行编码,得到面部视频比特流样本;解码面部面部视频比特流样本,得到面部多个初始参考面部视频帧样本和面部目标面部视频帧样本的目标紧凑特征样本;基于每个参考面部视频帧样本和面部目标紧凑特征样本,得到每个参考面部视频帧样本对应的驱动信息样本;将各参考面部视频帧样本和对应的各驱动信息样本输入初始第二生成模型,得到分别与各参考面部视频帧样本对应的初始重建面部视频帧样本;将各初始重建面部视频帧样本输入初始融合模型,得到融合面部视频帧样本;基于面部融合面部视频帧样本和面部目标面部视频帧样本构建损失函数,并基于面部损失函数进行训练,得到训练完成的第二生成模型和过渡融合模型;基于面部目标面部视频帧样本和面部多个初始参考面部视频帧样本,基于训练完成的第二生成模型,再次训练面部过渡融合模型,得到训练完成的融合模型。
程序1310中各步骤的具体实现可以参见上述面部视频编码方法,或者,面部视频解码方法,或者,参考面部视频帧生成方法,或者,模型训练方法实施例中的相应步骤和单元中对应的描述,在此不赘述。所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的设备和模块的具体工作过程,可以参考前述方法实施例中的对应过程描述,在此不再赘述。
通过本实施例的电子设备,是基于多个参考面部视频帧,对待编码的目标面部视频帧进行编码和解码操作,从而得到与目标面部视频帧对应的融合重构面部视频帧的。对于现有的仅基于单个参考面部视频帧得到的重建面部视频帧,其纹理质量和运动信息主要依赖于该单个参考面部视频帧,也就是说,在重建过程中,该单个参考面部视频帧限制了纹理信息和运动信息的重建,而本申请实施例中基于多个参考面部视频帧得到融合面部视频帧,其纹理质量和运动信息则同时参考了多个不同的参考面部视频帧,因此,重建得到的融合面部视频帧与目标面部视频帧之间的质量差异较小。综上,本申请实施例提高了面部视频帧的重建质量。
本申请实施例还提供了一种计算机程序产品,包括计算机指令,该计算机指令指示计算设备执行上述多个方法实施例中的任一方法对应的操作。
需要指出,根据实施的需要,可将本申请实施例中描述的各个部件/步骤拆分为更多部件/步骤,也可将两个或多个部件/步骤或者部件/步骤的部分操作组合成新的部件/步骤,以实现本申请实施例的目的。
上述根据本申请实施例的方法可在硬件、固件中实现,或者被实现为可存储在记录介质(诸如CD ROM、RAM、软盘、硬盘或磁光盘)中的软件或计算机代码,或者被实现通过网络下载的原始存储在远程记录介质或非暂时机器可读介质中并将被存储在本地记录介质中的计算机代码,从而在此描述的方法可被存储在使用通用计算机、专用处理器或者可编程或专用硬件(诸如ASIC或FPGA)的记录介质上的这样的软件处理。可以理解,计算机、处理器、微处理器控制器或可编程硬件包括可存储或接收软件或计算机代码的存储组件(例如,RAM、ROM、闪存等),当软件或计算机代码被计算机、处理器或硬件访问且执行时,实现在此描述的面部视频编码方法,或者,面部视频解码方法,或者,参考面部视频帧生成方法,或者,模型训练方法。此外,当通用计算机访问用于实现在此示出的面部视频编码方法,或者,面部视频解码方法,或者,参考面部视频帧生成方法,或者,模型训练方法的代码时,代码的执行将通用计算机转换为用于执行在此示出的面部视频编码方法,或者,面部视频解码方法,或者,参考面部视频帧生成方法,或者,模型训练方法的专用计算机。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及方法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术 人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请实施例的范围。
以上实施方式仅用于说明本申请实施例,而并非对本申请实施例的限制,有关技术领域的普通技术人员,在不脱离本申请实施例的精神和范围的情况下,还可以做出各种变化和变型,因此所有等同的技术方案也属于本申请实施例的范畴,本申请实施例的专利保护范围应由权利要求限定。

Claims (19)

  1. 一种面部视频编码方法,包括:
    获取待编码的目标面部视频帧和参考帧列表中的多个初始参考面部视频帧;
    分别对所述多个初始参考面部视频帧和所述目标面部视频帧进行编码,得到面部视频比特流。
  2. 根据权利要求1所述的方法,其中,在所述得到面部视频比特流之后,所述方法还包括:
    分别计算所述目标面部视频帧与各初始参考面部视频帧间的信息差异值;所述信息差异值表征所述目标面部视频帧中包含的信息与所述初始参考面部视频帧中包含的信息之间的差异程度;
    若存在大于预设阈值的信息差异值,则将所述目标面部视频帧作为新增参考面部视频帧添加至所述参考帧列表,以更新所述参考帧列表;所述新增参考面部视频帧用于对其它视频帧进行编码,得到第一面部视频比特流。
  3. 根据权利要求2所述的方法,其中,所述计算目标面部视频帧与参初始参考面部视频帧间的信息差异值,包括:
    基于目标面部视频帧中各像素点的像素值和所述初始参考面部视频帧中各像素点的像素值,进行均方误差计算,得到目标面部视频帧与所述参考帧列表中的初始参考面部视频帧间的信息差异值。
  4. 根据权利要求2所述的方法,其中,所述计算目标面部视频帧与初始参考面部视频帧间的信息差异值,包括:
    对所述目标面部视频帧进行特征提取,得到所述目标面部视频帧的紧凑特征,所述目标面部视频帧的紧凑特征表征所述目标面部视频帧中的关键特征信息;
    对初始参考面部视频帧进行特征提取,得到所述初始参考面部视频帧的紧凑特征,所述初始参考面部视频帧的紧凑特征表征初始参考面部视频帧中的关键特征信息;
    基于所述目标面部视频帧的紧凑特征和所述初始参考面部视频帧的紧凑特征,进行均方误差计算,得到所述目标面部视频帧与所述初始参考面部视频帧间的信息差异值。
  5. 根据权利要求2所述的方法,其中,所述分别对所述多个初始参考面部视频帧和所述目标面部视频帧进行编码,得到面部视频比特流,包括:
    若所有信息差异值中存在小于或者等于所述预设阈值的候选信息差异值,则基于所述候选信息差异值对应的初始参考面部视频帧,对所述目标面部视频帧进行编码得到第二面部视频比特流。
  6. 根据权利要求5所述的方法,其中,若所述候选信息差异值的数量为多个,所述基于所述候选信息差异值对应的初始参考面部视频帧,对所述目标面部视频帧进行编码得到第二面部视频比特流,包括:
    从所述候选信息差异值中确定数值最小的信息差异值为目标信息差异值;
    基于所述目标信息差异值对应的初始参考面部视频帧,对所述目标面部视频帧进行编码得到第二面部视频比特流。
  7. 根据权利要求1所述的方法,其中,所述分别对所述多个初始参考面部视频帧和所述目标面部视频帧进行编码,得到面部视频比特流,包括:
    对所述目标面部视频帧进行特征提取,得到所述目标面部视频帧的目标紧凑特征;
    分别对所述目标紧凑特征以及各初始参考面部视频帧进行编码,得到面部视频比特流。
  8. 一种面部视频解码方法,包括:
    获取面部视频比特流,所述面部视频比特流包括:多个编码后参考面部视频帧和编码后紧凑特征信息;所述编码后紧凑特征信息表征待重建的目标面部视频帧的关键特征信息;
    分别解码所述多个编码后参考面部视频帧,得到多个参考面部视频帧;
    解码所述编码后紧凑特征信息,得到所述目标面部视频帧的目标紧凑特征;
    基于所述多个参考面部视频帧和所述目标紧凑特征,进行面部视频帧重建,得到与所述目标面部视频帧对应的融合面部视频帧。
  9. 根据权利要求8所述的方法,其中,所述基于所述多个参考面部视频帧和所述目标紧凑特征,进行面部视频帧重建,得到与所述目标面部视频帧对应的融合面部视频帧,包括:
    分别基于每个参考面部视频帧和所述目标紧凑特征,进行面部视频帧重建,得到与每个参考面部视频帧对应的初始重建面部视频帧;
    对各初始重建面部视频帧进行融合处理,得到与所述目标面部视频帧对应的融合面部视频帧。
  10. 根据权利要求9所述的方法,其中,所述对各初始重建面部视频帧进行融合处理,得到与所述目标面部视频帧对应的融合面部视频帧,包括:
    获取各初始重建面部视频帧对应的权重值;
    基于各权重值,对各初始重建面部视频帧进行线性加权处理,得到与所述目标面部视频帧对应的融合面部视频帧。
  11. 根据权利要求9所述的方法,其中,所述对各初始重建面部视频帧进行融合处理,得到与所述目标面部视频帧对应的融合面部视频帧,包括:
    将各初始重建面部视频帧输入融合模型,以使所述融合模型输出与所述目标面部视频帧对应的融合面部视频帧。
  12. 根据权利要求8所述的方法,其中,所述基于所述多个参考面部视频帧和所述目标紧凑特征,进行面部视频帧重建,得到与所述目标面部视频帧对应的融合面部视频帧,包括:
    针对每个参考面部视频帧,基于该参考面部视频帧和所述目标紧凑特征,得到该参 考面部视频帧对应的驱动信息,所述驱动信息包括:该参考面部视频帧和所述目标面部视频帧之间的运动估计图;
    将各参考面部视频帧和各参考面部视频帧对应的驱动信息输入第一生成模型,得到与所述目标面部视频帧对应的融合面部视频帧。
  13. 根据权利要求8所述的方法,其中,所述基于所述多个参考面部视频帧和所述目标紧凑特征,进行面部视频帧重建,得到与所述目标面部视频帧对应的融合面部视频帧,包括:
    针对每个参考面部视频帧,基于该参考面部视频帧和所述目标紧凑特征,得到该参考面部视频帧对应的驱动信息;
    将各参考面部视频帧和各参考面部视频帧对应的驱动信息输入第二生成模型,得到分别与各参考面部视频帧对应的初始重建面部视频帧;
    将各初始重建面部视频帧输入融合模型,以使所述融合模型输出与所述目标面部视频帧对应的融合面部视频帧。
  14. 一种参考面部视频帧生成方法,包括:
    获取目标面部视频帧和参考帧列表中的多个初始参考面部视频帧;
    计算目标面部视频帧与各初始参考面部视频帧间的信息差异值,所述信息差异值表征目标面部视频帧中包含的信息与各初始参考面部视频帧中包含的信息之间的差异程度;
    若存在大于预设阈值的信息差异值,则将目标面部视频帧作为新增参考面部视频帧添加至所述参考帧列表。
  15. 一种模型训练方法,包括:
    分别对多个初始参考面部视频帧样本和目标面部视频帧样本进行编码,得到面部视频比特流样本;
    解码所述面部视频比特流样本,得到所述多个初始参考面部视频帧样本和所述目标面部视频帧样本的目标紧凑特征样本;
    基于每个参考面部视频帧样本和所述目标紧凑特征样本,得到每个参考面部视频帧样本对应的驱动信息样本;
    将各参考面部视频帧样本和对应的各驱动信息样本输入初始第二生成模型,得到分别与各参考面部视频帧样本对应的初始重建面部视频帧样本;
    将各初始重建面部视频帧样本输入初始融合模型,得到融合面部视频帧样本;
    基于所述融合面部视频帧样本和所述目标面部视频帧样本构建损失函数,并基于所述损失函数进行训练,得到训练完成的第二生成模型和过渡融合模型;
    基于所述目标面部视频帧样本和所述多个初始参考面部视频帧样本,基于训练完成的第二生成模型,再次训练所述过渡融合模型,得到训练完成的融合模型。
  16. 一种面部视频编码装置,包括:
    面部视频帧获取模块,用于获取待编码的目标面部视频帧和参考帧列表中的多个初 始参考面部视频帧;
    编码面部视频帧得到模块,用于分别对所述多个初始参考面部视频帧和所述目标面部视频帧进行编码,得到面部视频比特流。
  17. 一种电子设备,包括:处理器、存储器、通信接口和通信总线,所述处理器、所述存储器和所述通信接口通过所述通信总线完成相互间的通信;
    所述存储器用于存放至少一可执行指令,所述可执行指令使所述处理器执行如权利要求1-7中任一项所述的面部视频编码方法对应的操作,或者,如权利要求8-13中任一项所述的面部视频解码方法对应的操作,或者,如权利要求14所述的参考面部视频帧生成方法对应的操作,或者,如权利要求15中所述的模型训练方法对应的操作。
  18. 一种计算机存储介质,其中,所述计算机存储介质上存储有计算机程序,当所述计算机程序被处理器执行时,实现如权利要求1-14中任一项所述的方法。
  19. 一种计算机程序产品,包括计算机指令,其中,所述计算机指令指示计算机设备执行如权利要求1-14中任一所述的方法对应的操作。
PCT/CN2023/073013 2022-01-25 2023-01-18 一种面部视频编码方法、解码方法及装置 WO2023143331A1 (zh)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202210085777.7A CN114401406A (zh) 2022-01-25 2022-01-25 一种面部视频编码方法、解码方法及装置
CN202210085252.3A CN114205585A (zh) 2022-01-25 2022-01-25 面部视频编码方法、解码方法及装置
CN202210085777.7 2022-01-25
CN202210085252.3 2022-01-25

Publications (1)

Publication Number Publication Date
WO2023143331A1 true WO2023143331A1 (zh) 2023-08-03

Family

ID=87470763

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/073013 WO2023143331A1 (zh) 2022-01-25 2023-01-18 一种面部视频编码方法、解码方法及装置

Country Status (1)

Country Link
WO (1) WO2023143331A1 (zh)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060120457A1 (en) * 2004-12-06 2006-06-08 Park Seung W Method and apparatus for encoding and decoding video signal for preventing decoding error propagation
CN104768011A (zh) * 2015-03-31 2015-07-08 浙江大学 图像编解码方法和相关装置
CN110830808A (zh) * 2019-11-29 2020-02-21 合肥图鸭信息科技有限公司 一种视频帧重构方法、装置及终端设备
CN110876065A (zh) * 2018-08-29 2020-03-10 华为技术有限公司 候选运动信息列表的构建方法、帧间预测方法及装置
CN113132735A (zh) * 2019-12-30 2021-07-16 北京大学 一种基于视频帧生成的视频编码方法
CN113573063A (zh) * 2021-06-16 2021-10-29 百果园技术(新加坡)有限公司 视频编解码方法及装置
US20210409685A1 (en) * 2019-09-27 2021-12-30 Tencent Technology (Shenzhen) Company Limited Video encoding method, video decoding method, and related apparatuses

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060120457A1 (en) * 2004-12-06 2006-06-08 Park Seung W Method and apparatus for encoding and decoding video signal for preventing decoding error propagation
CN104768011A (zh) * 2015-03-31 2015-07-08 浙江大学 图像编解码方法和相关装置
CN110876065A (zh) * 2018-08-29 2020-03-10 华为技术有限公司 候选运动信息列表的构建方法、帧间预测方法及装置
US20210409685A1 (en) * 2019-09-27 2021-12-30 Tencent Technology (Shenzhen) Company Limited Video encoding method, video decoding method, and related apparatuses
CN110830808A (zh) * 2019-11-29 2020-02-21 合肥图鸭信息科技有限公司 一种视频帧重构方法、装置及终端设备
CN113132735A (zh) * 2019-12-30 2021-07-16 北京大学 一种基于视频帧生成的视频编码方法
CN113573063A (zh) * 2021-06-16 2021-10-29 百果园技术(新加坡)有限公司 视频编解码方法及装置

Similar Documents

Publication Publication Date Title
CN104096362B (zh) 基于游戏者关注区域改进视频流的码率控制比特分配
Wu et al. Learned block-based hybrid image compression
WO2020237646A1 (zh) 图像处理方法、设备及计算机可读存储介质
WO2023016155A1 (zh) 图像处理方法、装置、介质及电子设备
WO2023143101A1 (zh) 一种面部视频编码方法、解码方法及装置
CN110870310A (zh) 图像编码方法和装置
US11115678B2 (en) Diversified motion using multiple global motion models
WO2023246926A1 (zh) 模型训练方法、视频编码方法及解码方法
CN116233445B (zh) 视频的编解码处理方法、装置、计算机设备和存储介质
CN113132735A (zh) 一种基于视频帧生成的视频编码方法
Hu et al. Fvc: An end-to-end framework towards deep video compression in feature space
WO2023143349A1 (zh) 一种面部视频编码方法、解码方法及装置
CN114979672A (zh) 视频编码方法、解码方法、电子设备及存储介质
Zhao et al. CBREN: Convolutional neural networks for constant bit rate video quality enhancement
CN116600119B (zh) 视频编码、解码方法、装置、计算机设备和存储介质
WO2023143331A1 (zh) 一种面部视频编码方法、解码方法及装置
US11095901B2 (en) Object manipulation video conference compression
CN112637604A (zh) 低时延视频压缩方法及装置
Tan et al. Image compression algorithms based on super-resolution reconstruction technology
WO2023225808A1 (en) Learned image compress ion and decompression using long and short attention module
Yang et al. Graph-convolution network for image compression
CN114401406A (zh) 一种面部视频编码方法、解码方法及装置
WO2021012942A1 (zh) 残差编码、解码方法及装置、存储介质及电子装置
Hu et al. HDVC: Deep Video Compression with Hyperprior-Based Entropy Coding
CN114205585A (zh) 面部视频编码方法、解码方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23746212

Country of ref document: EP

Kind code of ref document: A1