CN115546575A

CN115546575A - Training method of driving model, driving method, device, readable medium and equipment

Info

Publication number: CN115546575A
Application number: CN202210709551.XA
Authority: CN
Inventors: 郑燊; 庞昊洲; 李宏亮; 唐迪; 蒋昊; 温翔
Original assignee: Beijing Zitiao Network Technology Co Ltd
Current assignee: Beijing Zitiao Network Technology Co Ltd
Priority date: 2022-06-21
Filing date: 2022-06-21
Publication date: 2022-12-30

Abstract

The disclosure relates to a training method, a driving device, a readable medium and equipment of a driving model, aiming at each video frame in a preset video sample, inputting data of the video frame into a first preset mouth shape driving model to obtain a virtual image corresponding to the video frame; training a first preset mouth shape driving model according to the data of the virtual image and the video frame to obtain a first driving sub-model; inputting the first preset mouth shape parameter into a preset image rendering model to obtain a first rendering image; training a second preset mouth shape driving model according to the first rendering image and the first preset mouth shape parameter to obtain a second driving submodel; and determining a first model driving model according to the first driving submodel and the second driving submodel, wherein the first model driving model is used for determining a target model parameter corresponding to the target audio according to a preset virtual image and the target audio to be identified.

Description

Training method of driving model, driving method, device, readable medium and equipment

Technical Field

The present disclosure relates to the field of mouth shape driving, and in particular, to a training method, a driving method, an apparatus, a readable medium, and a device for driving a model.

Background

In the current game production or virtual object application scene in the virtual space, a corresponding mouth shape animation needs to be produced according to the target audio, so that mouth shape driving is performed on a preset virtual object according to the mouth shape animation.

According to the related technology, 3D face capture data can be collected through the camera devices arranged in different directions, then end-to-end training is carried out on the model according to the 3D face capture data, mouth shape parameters corresponding to different audios are output based on the trained model, so that mouth shape driving is carried out on a preset virtual object according to the mouth shape parameters, however, the robustness of the model obtained through training based on the 3D face capture data is poor, and the scene requirements of multiple languages and multiple speakers cannot be met.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

In a first aspect, the present disclosure provides a training method of a driving model, the method comprising:

for each video frame in a preset video sample, inputting data of the video frame into a first preset mouth shape driving model to obtain a virtual image corresponding to the video frame;

training the first preset mouth shape driving model according to the virtual image and the data of the video frame to obtain a first driving sub-model;

acquiring a first preset mouth shape parameter, and inputting the first preset mouth shape parameter into a preset image rendering model to obtain a first rendering image;

training a second preset mouth shape driving model according to the first rendering image and the first preset mouth shape parameter to obtain a second driving sub-model;

and determining a first mouth shape driving model according to the first driving sub-model and the second driving sub-model, wherein the first mouth shape driving model is used for determining a target mouth shape parameter corresponding to a target audio according to a preset virtual image and the target audio to be recognized so as to drive a mouth shape of a virtual object according to the target mouth shape parameter, and the preset virtual image is an image of the virtual object to be driven.

In a second aspect, the present disclosure provides a method of driving a virtual object, the method comprising:

acquiring target audio to be identified;

inputting a preset virtual image and the target audio into a first mouth shape driving model trained in advance to obtain a target mouth shape parameter corresponding to the target audio, wherein the preset virtual image is an image of a virtual object to be driven;

carrying out mouth shape driving on the virtual object according to the target mouth shape parameters;

wherein the first bite-type drive model is a first bite-type drive model obtained by training by the method provided by the first aspect of the present disclosure.

In a third aspect, the present disclosure provides a training apparatus for driving a model, the apparatus comprising:

the first training module is used for inputting data of the video frames into a first preset mouth shape driving model aiming at each video frame in a preset video sample to obtain a virtual image corresponding to the video frame; training the first preset mouth shape driving model according to the virtual image and the data of the video frame to obtain a first driving sub-model;

the second training module is used for acquiring a first preset mouth shape parameter and inputting the first preset mouth shape parameter into a preset image rendering model to obtain a first rendering image; training a second preset mouth shape driving model according to the first rendering image and the first preset mouth shape parameter to obtain a second driving sub-model;

the first determining module is used for determining a first mouth shape driving model according to the first driving sub-model and the second driving sub-model, the first mouth shape driving model is used for determining a target mouth shape parameter corresponding to a target audio according to a preset virtual image and the target audio to be identified so as to drive a mouth shape of a virtual object according to the target mouth shape parameter, and the preset virtual image is an image of the virtual object to be driven.

In a fourth aspect, the present disclosure provides an apparatus for driving a virtual object, the apparatus comprising:

the acquisition module is used for acquiring target audio to be identified;

the second determining module is used for inputting a preset virtual image and the target audio into a first mouth shape driving model trained in advance to obtain a target mouth shape parameter corresponding to the target audio, wherein the preset virtual image is an image of a virtual object to be driven;

the mouth shape driving module is used for carrying out mouth shape driving on the virtual object according to the target mouth shape parameters;

wherein the first bite-type drive model is trained by the training apparatus provided in the third aspect of the present disclosure.

In a fifth aspect, there is provided a computer readable medium having stored thereon a computer program which, when executed by a processing device, performs the steps of the method of the first or second aspect of the disclosure.

In a sixth aspect, an electronic device is provided, comprising:

a storage device having a computer program stored thereon;

processing means for executing the computer program in the storage means to implement the steps of the method of the first or second aspect of the disclosure.

According to the technical scheme, aiming at each video frame in a preset video sample, inputting data of the video frame into a first preset mouth shape driving model to obtain a virtual image corresponding to the video frame; training the first preset mouth shape driving model according to the virtual image and the data of the video frame to obtain a first driving sub-model; acquiring a first preset mouth shape parameter, and inputting the first preset mouth shape parameter into a preset image rendering model to obtain a first rendering image; training a second preset mouth shape driving model according to the first rendering image and the first preset mouth shape parameter to obtain a second driving sub-model; and determining a first model driving model according to the first driving sub-model and the second driving sub-model, wherein the first model driving model is used for determining a target mouth shape parameter corresponding to the target audio according to a preset virtual image and the target audio to be recognized so as to drive the mouth shape of the virtual object according to the target mouth shape parameter, and the preset virtual image is the image of the virtual object to be driven, so that the first model driving model is obtained by performing model training by using 2D video data.

Additional features and advantages of the present disclosure will be set forth in the detailed description which follows.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale. In the drawings:

FIG. 1 is a flow diagram illustrating a method of training a driver model in accordance with an exemplary embodiment;

FIG. 2 is a schematic diagram illustrating a process of training a first driver submodel in accordance with an exemplary embodiment;

FIG. 3 is a schematic diagram illustrating a process for training a second driver submodel in accordance with an exemplary embodiment;

FIG. 4 is a flow diagram illustrating another method of training a driver model in accordance with an exemplary embodiment;

FIG. 5 is a flow diagram illustrating a method for driving a virtual object in accordance with an exemplary embodiment;

FIG. 6 is a flow diagram illustrating another method of driving a virtual object in accordance with an exemplary embodiment;

FIG. 7 is a block diagram illustrating a training apparatus for driving a model in accordance with an exemplary embodiment;

FIG. 8 is a block diagram of a model-driven training apparatus according to the embodiment shown in FIG. 7;

FIG. 9 is a block diagram illustrating an apparatus for driving a virtual object in accordance with an exemplary embodiment;

FIG. 10 is a block diagram of a driving apparatus of a virtual object according to the embodiment shown in FIG. 9;

fig. 11 is a block diagram illustrating a configuration of an electronic device according to an example embodiment.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and examples of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "including" and variations thereof as used herein is intended to be open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence of the functions performed by the devices, modules or units.

It is noted that references to "a" or "an" in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will appreciate that references to "one or more" are intended to be exemplary and not limiting unless the context clearly indicates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

All actions of acquiring signals, information or data in the present disclosure are performed under the premise of complying with the corresponding data protection regulation policy of the country of the location and obtaining the authorization given by the owner of the corresponding device.

It is understood that before the technical solutions disclosed in the embodiments of the present disclosure are used, the type, the use range, the use scene, etc. of the personal information related to the present disclosure should be informed to the user and authorized by the user in a proper manner according to the relevant laws and regulations.

For example, in response to receiving an active request from a user, a prompt message is sent to the user to explicitly prompt the user that the requested operation to be performed would require the acquisition and use of personal information to the user. Thus, the user can select whether to provide personal information to the software or hardware such as electronic equipment, application program, server or storage medium for executing the operation of the technical scheme of the present disclosure according to the prompt information.

As an alternative but non-limiting implementation manner, in response to receiving the active request of the user, the manner of sending the prompt information to the user may be, for example, a pop-up window manner, and the prompt information may be presented in a text manner in the pop-up window. In addition, a selection control for providing personal information to the electronic equipment by the user's selection of "agree" or "different meaning" can be carried in the pop-up window.

It is to be understood that the above-described notification and user authorization process is merely exemplary and is not intended to limit the disclosed implementations, and that other ways of satisfying relevant legal regulations may be implemented in the disclosed implementations.

At the same time, it is understood that the data involved in the present disclosure (including but not limited to the data itself, the acquisition or use of the data) should comply with the requirements of the corresponding laws and regulations and related regulations.

The method is mainly applied to scenes for carrying out mouth shape driving on preset virtual objects (such as preset game characters, preset virtual characters and the like) according to audio to be recognized, for example, in current game production, american classmates need to manually produce mouth shape animations corresponding to one frame by one frame according to pronunciation of each sentence, but the process needs higher time cost, for example, a 3-second audio generally needs to spend about 1 day of artistic classmates, except for being applied to the field of games, scenes such as live virtual objects and the like can increasingly enter daily life, and mouth shape driving on the preset virtual objects is needed in the scenes.

In order to improve the efficiency of mouth shape driving in the related art, mouth shape animations corresponding to each phoneme can be manually produced by art classmates, then an ASR (Automatic Speech Recognition) algorithm is used to recognize a phoneme sequence of a sentence, and art resources are spliced according to the phoneme sequence, but the animations of each phoneme need to be produced by the art classmates within a long time, ASRs corresponding to different languages need to be trained, the model training and deployment costs are high, and the splicing method has poor continuity, which leads to the problem of mouth shape monotony.

In another related technology, 3D face capture data may be acquired by cameras arranged in different orientations, and then end-to-end training may be performed on a model according to the 3D face capture data, and mouth shape parameters corresponding to different audio frequencies may be output based on the trained model, so as to perform mouth shape driving on a preset virtual object according to the mouth shape parameters, however, the 3D face capture data may be acquired by presetting a large number of cameras in different orientations, the data acquisition cost is high, and the acquired data amount is small, the model obtained by training based on a small number of 3D face capture data has poor robustness, and cannot meet the scene requirements of multiple languages and multiple speakers, and the accuracy of mouth shape driving based on the model needs to be improved.

In order to solve the existing problems, the present disclosure provides a training method, a driving method, an apparatus, a readable medium, and a device for driving a model, which can perform model training using video data in a 2D preset video sample to obtain a first mouth shape driving model, and since the 2D video data can be easily obtained from an open-source video database and has a large data size, model training is performed based on large-scale 2D video data, so that richer emotion information can be learned, and the robustness of the model is also stronger, the model can adapt to scenes of multiple languages and multiple speakers, thereby improving the versatility of the model, and mouth shape driving is performed based on the first mouth shape driving model having better versatility, and also the accuracy and efficiency of mouth shape driving can be improved.

Meanwhile, the first mouth shape driving model is obtained through 2D-based video data training, then the target mouth shape driving parameters are determined to carry out mouth shape driving based on the first mouth shape driving model, manual participation is not needed in the whole process, and an ASR model is not needed, so that the labor cost and the training and deployment cost of the model can be reduced.

Specific embodiments of the present disclosure are described in detail below with reference to the accompanying drawings.

FIG. 1 is a flow chart illustrating a method of training a driven model according to an exemplary embodiment, as shown in FIG. 1, the method comprising the steps of:

in step S101, for each video frame in a preset video sample, data of the video frame is input into a first preset mouth shape driving model, so as to obtain a virtual image corresponding to the video frame.

The preset video sample can be obtained from an open-source video database, for example, videos with a large number of talking faces in the real world exist, video data including voice information can be intercepted and used as the preset video sample for training to obtain a first mouth-shaped driving model, data of the video frame includes image frames and audio data of the video frame, and the first preset mouth-shaped driving model is a preset deep learning model.

Exemplarily, fig. 2 is a schematic diagram illustrating a process of training to obtain a first driving sub-model according to an exemplary embodiment, as shown in fig. 2, the first preset mouth shape driving model includes an audio feature extraction model, an image feature extraction model and a picture generation model, so that in the process of inputting data of the video frame into the first preset mouth shape driving model to obtain a virtual image corresponding to the video frame, audio data of the video frame may be input into the audio feature extraction model to extract an audio feature, an image frame of the video frame is input into the image feature extraction model to extract an image feature, and then the audio feature and the image feature are input into the picture generation model to obtain a virtual image corresponding to the video frame.

In step S102, the first preset mouth shape driving model is trained according to the virtual image and the data of the video frame, so as to obtain a first driving sub-model.

In this step, the virtual image and the audio data may be input into a preset lip language recognition model to obtain a lip language recognition result, and a lip language loss value is determined according to the lip language recognition result; inputting the virtual image and the image frame into a preset image quality identification model to obtain an image quality identification result corresponding to the video frame, and determining an image loss value according to the image quality identification result; and performing iterative training on the first preset mouth shape driving model according to the lip language loss value and the image loss value to obtain the first driving submodel.

Continuing to take fig. 2 as an example, in the process of training the first preset mouth model driver model according to the data of the virtual image and the video frame to obtain the first driver sub-model, as shown in fig. 2, the audio data corresponding to the virtual image and the video frame may be input into a preset lip language recognition model to obtain a lip language recognition result, and a lip language loss value is determined according to the lip language recognition result; inputting the virtual image and the image frame corresponding to the video frame into a preset image quality identification model to obtain an image quality identification result corresponding to the video frame, and determining an image loss value according to the image quality identification result; and then, performing iterative training on the first preset mouth model driving model according to the lip language loss value and the image loss value to obtain the first driving submodel.

Specifically, in the process of performing iterative training on the first preset mouth shape driving model according to the lip language loss value and the image loss value, reference may be made to a specific implementation process of training the model based on the loss value described in a related document, which is not described herein again.

It should be noted that, as shown in fig. 2, the preset lip language recognition model and the preset picture quality recognition model are only used in the model training stage of the first driving submodel, that is, the lip language loss value is determined by the preset lip language recognition model, and the image loss value is determined by the preset picture quality recognition model, so that the feedback training is performed on the first preset mouth model driving model based on the two loss values, and in the model application stage of the first driving submodel, in the process of outputting the virtual images corresponding to different video frames respectively by the first driving submodel, the lip language recognition model and the preset picture quality recognition model are not required to be preset.

In step S103, a first preset mouth shape parameter is obtained, and the first preset mouth shape parameter is input into a preset image rendering model, so as to obtain a first rendering image.

The first preset mouth shape parameter may be any mouth shape parameter preset by a game developer, the preset image rendering model may be a pytorch3d model, and the mouth shape in the first rendering image is a mouth shape corresponding to the first preset mouth shape parameter.

In step S104, a second preset mouth shape driving model is trained according to the first rendering image and the first preset mouth shape parameter, so as to obtain a second driving submodel.

Fig. 3 is a schematic diagram illustrating a process of training a second driving sub-model according to an exemplary embodiment, as shown in fig. 3, in this step, the first rendering image may be input into the second preset mouth shape driving model to obtain a first model output parameter; determining a first mouth shape parameter loss value according to the first preset mouth shape parameter and the first model output parameter; and performing iterative training on the second preset mouth shape driving model according to the first mouth shape parameter loss value to obtain the second driving sub-model, wherein the second preset mouth shape driving model is also a preset deep learning model.

Specifically, the dense motion feature of the mouth shape on the first rendered image may be extracted through the second preset mouth shape driving model, and then the corresponding mouth shape parameter (i.e. the first model output parameter) may be predicted according to the dense motion feature.

The first mouth shape driving sub-model and the second mouth shape driving sub-model can be obtained through pre-training based on the mode, the first mouth shape driving sub-model is used for converting a real video including a target audio to be recognized into a virtual video which is sounded by a virtual object, the second mouth shape driving sub-model extracts the intensive motion characteristics of the mouth shape on each frame of target virtual image in the virtual video, and then the corresponding mouth shape parameters are predicted according to the intensive motion characteristics, so that the target mouth shape parameters corresponding to the target audio are output.

In step S105, a first mouth-shape driving model is determined according to the first driver submodel and the second driver submodel.

The first mouth shape driving model is used for determining a target mouth shape parameter corresponding to a target audio according to a preset virtual image and the target audio to be recognized so as to drive a mouth shape of a virtual object according to the target mouth shape parameter, and the preset virtual image is an image of the virtual object to be driven.

By adopting the method, the video data in the 2D preset video sample can be used for model training to obtain the first mouth-type driving model, the 2D video data can be easily obtained from an open-source video database, the data volume is large, model training is carried out based on the large-scale 2D video data, richer emotion information can be learned, and the robustness of the model is stronger.

Considering that the first mouth shape driving model includes two models, namely a first driving sub-model and a second driving sub-model, and the model structure is relatively complex, and the response time of determining that the mouth shape driving is performed on the target mouth shape parameter based on the first mouth shape driving model is relatively longer, and the requirement of performing a relatively fast response on the mouth shape driving in a real-time scene cannot be met, in another possible implementation manner of the present disclosure, after a preset audio corresponding to each video frame in a preset video sample is input into the first mouth shape driving model, the mouth shape parameter output by the model is taken as a target training sample, another preset deep learning model is subjected to supervised end-to-end training based on the target training sample, so as to obtain a second mouth shape driving model, so that the target audio to be recognized can be input into the second mouth shape driving model, the second mouth shape driving model directly outputs the target mouth shape driving parameter for mouth shape driving, and the second mouth shape driving model has a relatively simple structure compared with the first mouth shape driving model, therefore, the second mouth shape driving model determines that the mouth shape driving model performs mouth shape driving parameter is output for mouth shape driving and the mouth shape driving model, and the requirement of the first mouth shape driving model is introduced in a relatively fast response manner, and the following real-time response of the first mouth shape driving model is also can be introduced, and the high-efficiency driving model is met, and the requirement of the following real-time driving model is introduced in a real-time scene, which is compared with the first mouth shape driving model, and the first mouth shape driving model is more efficient driving model is compared with the first mouth shape driving model, and the following high-time driving model:

FIG. 4 is a flow chart illustrating another method of driving model training, according to an exemplary embodiment, as shown in FIG. 4, the method further comprising the steps of:

in step S401, a target training sample is determined by the first model-like drive model.

The target training sample comprises preset audios corresponding to different video frames and second preset mouth shape parameters corresponding to the preset audios.

In this step, for each video frame in the preset video sample, the image frame and the audio data of the video frame may be input into the first model to obtain a second model output parameter; and regarding each video frame, taking the audio data in the video frame as the preset audio, and taking the second model output parameter as the second preset mouth shape parameter corresponding to the preset audio.

In step S402, a third preset mouth shape driving model is supervised-trained according to the preset audio and the second preset mouth shape parameter corresponding to the preset audio, so as to obtain a second mouth shape driving model.

The third preset mouth shape driving model can comprise a preset deep learning model, and the second mouth shape driving model is used for determining a target mouth shape parameter corresponding to the target audio.

In this step, the preset audio may be input into the third preset mouth shape driving model to obtain a third model output parameter corresponding to the preset audio; determining a second mouth shape parameter loss value according to the second preset mouth shape parameter and the third model output parameter; and performing iterative training on the third preset mouth shape driving model according to the second mouth shape parameter loss value to obtain the second mouth shape driving model.

The method shown in fig. 4 is adopted, the second mouth shape driving model can be obtained through training, then the target audio to be recognized can be input into the second mouth shape driving model, the target mouth shape parameters corresponding to the target audio are directly output through the second mouth shape driving model, so that mouth shape driving can be performed according to the target mouth shape parameters later, and as the second mouth shape driving model is relatively simple in model structure compared with the first mouth shape driving model, the response time of determining the target mouth shape parameters to perform mouth shape driving based on the second mouth shape driving model is relatively fast, so that mouth shape driving can be performed more efficiently, and the real-time response requirement for mouth shape driving in a real-time scene is met, for example, after a section of 10 seconds of audio is input into the second mouth shape driving model, mouth shape driving can be performed only by the response time of 1 second, so that the scene requirement for real-time mouth shape driving is met.

The following describes a specific manner of mouth-driving the virtual object based on the first mouth-driving model or the second mouth-driving model obtained by the above training:

fig. 5 is a flowchart illustrating a method of driving a virtual object, according to an exemplary embodiment, as shown in fig. 5, the method including the steps of:

in step S501, target audio to be recognized is acquired.

The target audio refers to preset audio corresponding to a virtual object to be subjected to mouth shape driving, taking an offline production scene of a game as an example, the target audio refers to preset game audio corresponding to different preset game characters, in a real-time game interaction scene, a preset game character is required to perform real-time voice interaction with a user (referring to a user using game software), at this time, the target audio may be preset dialogue audio configured for the preset game character in advance, and the virtual object may include a preset virtual character, for example, a preset game character.

In step S502, a preset virtual image and the target audio are input into a first mouth shape driving model trained in advance, so as to obtain a target mouth shape parameter corresponding to the target audio.

Wherein, the preset virtual image (or referred to as an avator image) is an image of a virtual object to be driven; for example, the preset virtual image includes a preset face image corresponding to a virtual object, the first mouth shape driving model includes a first driving sub-model and a second driving sub-model, both the first driving sub-model and the second driving sub-model are preset deep learning models, and an output of the first driving sub-model is used as an input of the second driving sub-model, the first driving sub-model is used for converting a live video (the live video includes voice information) into an aviator video (the aviator video replaces a real object in the live video with the preset virtual object); the second driving submodel is used for converting each picture frame in the aviator video into corresponding target mouth shape driving parameters, and the target mouth shape parameters comprise parameters respectively representing mouth shapes such as opening lips, puckering lips, moving lips in all directions, smiling lips, puckering lower lips and puckering upper lips.

Therefore, in this step, the preset virtual image and the target audio may be input into the first driver sub-model, so as to obtain a target virtual image corresponding to the target audio; and inputting the target virtual image into the second driving sub-model to obtain the target mouth shape parameters corresponding to the target audio.

In step S503, the virtual object is mouth-driven based on the target mouth shape parameter.

In this step, the target mouth shape parameter may be input into a preset image rendering model, and a second rendering image corresponding to the target mouth shape parameter is obtained; the virtual object is then mouth-driven according to the second rendered image.

The preset image rendering model may be a pyrrch 3d model, the pyrrch 3d model may perform image rendering on the face image corresponding to the target mouth shape parameter to obtain the second rendered image, and the mouth shape displayed in the second rendered image is the target mouth shape corresponding to the target mouth shape parameter.

By adopting the method, the 2D video data is used for carrying out model training to obtain the first mouth-type driving model, the 2D video data can be easily obtained from an open-source video database, the data volume is large, model training is carried out based on the large-scale 2D video data, richer emotion information can be learned, and the robustness of the model is stronger, so that the model can be adapted to scenes with multiple languages and multiple speakers, the universality of the model is improved, mouth-type driving is carried out based on the first mouth-type driving model with better universality, and the accuracy and the efficiency of mouth-type driving can be obviously improved.

Fig. 6 is a flowchart illustrating a method of driving a virtual object according to an exemplary embodiment, and as shown in fig. 6, the method further includes the steps of:

in step S504, the target audio to be recognized is input into the second mouth shape driving model, so as to obtain the target mouth shape parameters corresponding to the target audio.

The second mouth shape driving model is a deep learning model obtained in advance according to a target training sample, the target training sample is a sample determined by the first mouth shape driving model, the target training sample specifically includes preset audios corresponding to different video frames in a preset video sample, and a second preset mouth shape parameter corresponding to the preset audios, and the second preset mouth shape parameter is a mouth shape parameter (namely a second model output parameter) output by the model after the preset audios are input into the first mouth shape driving model.

By adopting the method, the target audio to be recognized can be input into the second mouth shape driving model, the target mouth shape driving parameters are directly output through the second mouth shape driving model to carry out mouth shape driving, and the second mouth shape driving model has a relatively simple model structure compared with the first mouth shape driving model, so that the response time of determining the target mouth shape parameters to carry out mouth shape driving based on the second mouth shape driving model is relatively fast, further the mouth shape driving can be carried out more efficiently, the real-time response requirement on the mouth shape driving in a real-time scene is met, for example, after a section of 10 seconds of audio is input into the second mouth shape driving model, the mouth shape driving can be carried out only by usually needing 1 second response time, and the scene requirement of the real-time mouth shape driving is met.

FIG. 7 is a block diagram illustrating an apparatus for model-driven training, according to an exemplary embodiment, as shown in FIG. 7, the apparatus comprising:

the first training module 701 is configured to, for each video frame in a preset video sample, input data of the video frame into a first preset mouth shape driving model to obtain a virtual image corresponding to the video frame; training the first preset mouth shape driving model according to the virtual image and the data of the video frame to obtain a first driving sub-model;

a second training module 702, configured to obtain a first preset mouth shape parameter, and input the first preset mouth shape parameter into a preset image rendering model to obtain a first rendered image; training a second preset mouth shape driving model according to the first rendering image and the first preset mouth shape parameter to obtain a second driving sub-model;

a first determining module 703, configured to determine a first mouth shape driving model according to the first driving sub-model and the second driving sub-model, where the first mouth shape driving model is configured to determine a target mouth shape parameter corresponding to a target audio according to a preset virtual image and the target audio to be recognized, so as to perform mouth shape driving on a virtual object according to the target mouth shape parameter, and the preset virtual image is an image of the virtual object to be driven.

Optionally, the data of the video frame includes an image frame and audio data of the video frame, and the first training module 701 is configured to input the virtual image and the audio data into a preset lip language recognition model to obtain a lip language recognition result, and determine a lip language loss value according to the lip language recognition result; inputting the virtual image and the image frame into a preset image quality identification model to obtain an image quality identification result corresponding to the video frame, and determining an image loss value according to the image quality identification result; and performing iterative training on the first preset mouth shape driving model according to the lip language loss value and the image loss value to obtain the first driving submodel.

Optionally, the second training module 702 is configured to input the first rendered image into the second preset mouth shape driving model to obtain a first model output parameter; determining a first mouth shape parameter loss value according to the first preset mouth shape parameter and the first model output parameter; and performing iterative training on the second preset mouth model driving model according to the first mouth model parameter loss value to obtain a second driving submodel.

Alternatively, fig. 8 is a block diagram of a model-driven training apparatus according to the embodiment shown in fig. 7, and as shown in fig. 8, the apparatus further includes:

a determining training sample module 704, configured to determine a target training sample through the first mouth shape driving model, where the target training sample includes preset audio corresponding to different video frames and a second preset mouth shape parameter corresponding to the preset audio;

a third training module 705, configured to perform supervised training on a third preset mouth shape driving model according to the preset audio and the second preset mouth shape parameter corresponding to the preset audio, to obtain a second mouth shape driving model, where the second mouth shape driving model is configured to determine the target mouth shape parameter corresponding to the target audio according to the target audio.

Optionally, the determining training sample module 704 is configured to, for each video frame in the preset video sample, input image frames and audio data of the video frame into the first model driver to obtain a second model output parameter; and regarding each video frame, taking the audio data in the video frame as the preset audio, and taking the second model output parameter as the second preset mouth shape parameter corresponding to the preset audio.

Optionally, the third training module 705 is configured to input the preset audio into the third preset mouth shape driving model to obtain a third model output parameter corresponding to the preset audio; determining a second mouth shape parameter loss value according to the second preset mouth shape parameter and the third model output parameter; and performing iterative training on the third preset mouth shape driving model according to the second mouth shape parameter loss value to obtain the second mouth shape driving model.

Fig. 9 is a block diagram illustrating an apparatus for driving a virtual object according to an exemplary embodiment, the apparatus including, as shown in fig. 9:

an obtaining module 901, configured to obtain a target audio to be identified;

a second determining module 902, configured to input a preset virtual image and the target audio into a first mouth shape driving model trained in advance, to obtain a target mouth shape parameter corresponding to the target audio, where the preset virtual image is an image of a virtual object to be driven;

a mouth shape driving module 903, configured to perform mouth shape driving on the virtual object according to the target mouth shape parameter;

wherein the first bite-type driving model is obtained by training through the training apparatus provided in the embodiment shown in fig. 7.

Optionally, the first model driver model includes a first driver sub-model and a second driver sub-model, and the second determining module 902 is configured to input the preset virtual image and the target audio into the first driver sub-model to obtain a target virtual image corresponding to the target audio; and inputting the target virtual image into the second driving sub-model to obtain the target mouth shape parameters corresponding to the target audio.

Alternatively, fig. 10 is a block diagram of a driving apparatus for a virtual object according to the embodiment shown in fig. 9, and as shown in fig. 10, the apparatus further includes:

a third determining module 904, configured to input the target audio to be identified into a second mouth shape driving model, so as to obtain the target mouth shape parameters corresponding to the target audio, where the second mouth shape driving model is a model obtained by pre-training according to the training apparatus provided in the embodiment shown in fig. 8.

Optionally, the mouth shape driving module 903 is configured to input the target mouth shape parameter into a preset image rendering model, so as to obtain a second rendering image corresponding to the target mouth shape parameter; and carrying out mouth shape driving on the virtual object according to the second rendered image.

By adopting the device, the 2D video data can be used for carrying out model training to obtain the first mouth-type driving model, the 2D video data can be easily obtained from an open-source video database, the data volume is large, model training is carried out based on the large-scale 2D video data, richer emotional information can be learned, the robustness of the model is stronger, the model can be adapted to scenes of multiple languages and multiple speakers, the universality of the model is improved, mouth-type driving is carried out based on the first mouth-type driving model with better universality, and the accuracy and the efficiency of mouth-type driving can be improved.

Referring now to FIG. 11, shown is a schematic diagram of an electronic device 1100 suitable for use in implementing embodiments of the present disclosure. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 11 is only an example, and should not bring any limitation to the functions and the use range of the embodiment of the present disclosure.

As shown in fig. 11, the electronic device 1100 may include a processing means (e.g., central processing unit, graphics processor, etc.) 1101 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM) 1102 or a program loaded from a storage means 1108 into a Random Access Memory (RAM) 1103. In the RAM 1103, various programs and data necessary for the operation of the electronic device 1100 are also stored. The processing device 1101, the ROM 1102, and the RAM 1103 are connected to each other by a bus 1104. An input/output (I/O) interface 1105 is also connected to bus 1104.

Generally, the following devices may be connected to the I/O interface 1105: input devices 1106 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 1107 including, for example, liquid Crystal Displays (LCDs), speakers, vibrators, and the like; storage devices 1108, including, for example, magnetic tape, hard disk, and the like; and a communication device 1109. The communication means 1109 may allow the electronic device 1100 to communicate wirelessly or wiredly with other devices to exchange data. While fig. 11 illustrates an electronic device 1100 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication device 1109, or installed from the storage device 1108, or installed from the ROM 1102. The computer program, when executed by the processing device 1101, performs the above-described functions defined in the methods of the embodiments of the present disclosure.

It should be noted that the computer readable medium mentioned in the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some implementations, the clients may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may interconnect with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: inputting data of each video frame in a preset video sample into a first preset mouth shape driving model to obtain a virtual image corresponding to the video frame; training the first preset mouth shape driving model according to the virtual image and the data of the video frame to obtain a first driving sub-model; acquiring a first preset mouth shape parameter, and inputting the first preset mouth shape parameter into a preset image rendering model to obtain a first rendering image; training a second preset mouth shape driving model according to the first rendering image and the first preset mouth shape parameter to obtain a second driving submodel; and determining a first model driving model according to the first driving sub-model and the second driving sub-model, wherein the first model driving model is used for determining a target mouth shape parameter corresponding to a target audio according to a preset virtual image and the target audio to be recognized so as to drive a mouth shape of a virtual object according to the target mouth shape parameter, and the preset virtual image is an image of the virtual object to be driven.

Alternatively, the computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring target audio to be identified; inputting a preset virtual image and the target audio into a first mouth shape driving model trained in advance to obtain a target mouth shape parameter corresponding to the target audio, wherein the preset virtual image is an image of a virtual object to be driven; carrying out mouth shape driving on the virtual object according to the target mouth shape parameters; wherein the first bite-type drive model is a first bite-type drive model trained by the method of the first aspect of the present disclosure.

Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present disclosure may be implemented by software or hardware. The name of a module does not in some cases constitute a limitation of the module itself, and for example, an acquisition module may also be described as a "module that acquires target audio".

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems on a chip (SOCs), complex Programmable Logic Devices (CPLDs), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

Example 1 provides, in accordance with one or more embodiments of the present disclosure, a method of training a driver model, comprising:

Example 2 provides the method of example 1, wherein the data of the video frame includes an image frame and audio data of the video frame, and the training the first preset mouth shape driving model according to the virtual image and the data of the video frame to obtain a first driving sub-model includes:

inputting the virtual image and the audio data into a preset lip language recognition model to obtain a lip language recognition result, and determining a lip language loss value according to the lip language recognition result;

inputting the virtual image and the image frame into a preset image quality identification model to obtain an image quality identification result corresponding to the video frame, and determining an image loss value according to the image quality identification result;

and performing iterative training on the first preset mouth shape driving model according to the lip language loss value and the image loss value to obtain the first driving submodel.

Example 3 provides the method of example 1, wherein training a second preset-die-driving model according to the first rendered image and the first preset-die parameter to obtain a second driving sub-model comprises:

inputting the first rendering image into the second preset mouth shape driving model to obtain a first model output parameter;

determining a first mouth shape parameter loss value according to the first preset mouth shape parameter and the first model output parameter;

and performing iterative training on the second preset mouth shape driving model according to the first mouth shape parameter loss value to obtain a second driving sub-model.

Example 4 provides the method of example 1, further comprising, in accordance with one or more embodiments of the present disclosure:

determining a target training sample through the first mouth-type driving model, wherein the target training sample comprises preset audios corresponding to different video frames and second preset mouth-type parameters corresponding to the preset audios;

and carrying out supervised training on a third preset mouth shape driving model according to the preset audio and the second preset mouth shape parameter corresponding to the preset audio to obtain a second mouth shape driving model, wherein the second mouth shape driving model is used for determining the target mouth shape parameter corresponding to the target audio according to the target audio.

Example 5 provides the method of example 4, the determining, by the first bite-type drive model, the target training sample including:

inputting image frames and audio data of the video frames into the first model driving model aiming at each video frame in the preset video samples to obtain second model output parameters;

and regarding each video frame, taking the audio data in the video frame as the preset audio, and taking the second model output parameter as the second preset mouth shape parameter corresponding to the preset audio.

Example 6 provides the method of example 4, and the performing supervised training on a third preset mouth shape driving model according to the preset audio and the second preset mouth shape parameter corresponding to the preset audio to obtain a second mouth shape driving model includes:

inputting the preset audio frequency into the third preset mouth shape driving model to obtain a third model output parameter corresponding to the preset audio frequency;

determining a second mouth shape parameter loss value according to the second preset mouth shape parameter and the third model output parameter;

and performing iterative training on the third preset mouth shape driving model according to the second mouth shape parameter loss value to obtain the second mouth shape driving model.

Example 7 provides, in accordance with one or more embodiments of the present disclosure, a method of driving a virtual object, the method comprising:

acquiring target audio to be identified;

wherein the first model is a first model trained by the method of any one of examples 1-3.

Example 8 provides the method of example 7, wherein the first model-driven model includes a first driver sub-model and a second driver sub-model, and inputting the preset virtual image and the target audio into the first model-driven model trained in advance, and obtaining target mouth shape parameters corresponding to the target audio includes:

inputting the preset virtual image and the target audio into the first driving sub-model to obtain a target virtual image corresponding to the target audio;

and inputting the target virtual image into the second driver submodel to obtain the target mouth shape parameters corresponding to the target audio.

Example 9 provides the method of example 7, further comprising, in accordance with one or more embodiments of the present disclosure:

inputting the target audio to be recognized into a second mouth shape driving model to obtain the target mouth shape parameters corresponding to the target audio, wherein the second mouth shape driving model is obtained by pre-training according to the method of any one of examples 4-6.

Example 10 provides the method of example 7, the mouth-driving the virtual object according to the target mouth-shape parameters including:

inputting the target mouth shape parameter into a preset image rendering model to obtain a second rendering image corresponding to the target mouth shape parameter;

and mouth shape driving is carried out on the virtual object according to the second rendering image.

Example 11 provides, in accordance with one or more embodiments of the present disclosure, an apparatus for driving training of a model, the apparatus comprising:

the first training module is used for inputting data of the video frames into a first preset mouth shape driving model aiming at each video frame in a preset video sample to obtain a virtual image corresponding to the video frame; training the first preset mouth shape driving model according to the data of the virtual image and the video frame to obtain a first driving sub-model;

Example 12 provides, in accordance with one or more embodiments of the present disclosure, an apparatus for driving a virtual object, comprising:

the acquisition module is used for acquiring target audio to be identified;

wherein the first model is a first model trained by the training apparatus of example 11.

Example 13 provides, in accordance with one or more embodiments of the present disclosure, a computer-readable medium having stored thereon a computer program that, when executed by a processing device, performs the steps of the method of any one of examples 1-6 or examples 7-10.

Example 14 provides, in accordance with one or more embodiments of the present disclosure, an electronic device, comprising:

a storage device having a computer program stored thereon;

processing means for executing the computer program in the storage means to implement the steps of the method of any one of examples 1-6 or examples 7-10.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present disclosure.

Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims. With regard to the apparatus in the above embodiments, the specific manner in which each module performs operations has been described in detail in the embodiments related to the method, and will not be described in detail here.

Claims

1. A method of driving model training, the method comprising:

inputting data of each video frame in a preset video sample into a first preset mouth shape driving model to obtain a virtual image corresponding to the video frame;

2. The method of claim 1, wherein the data of the video frame comprises image frame and audio data of the video frame, and the training of the first preset mouth shape driving model according to the virtual image and the data of the video frame to obtain a first driving sub-model comprises:

3. The method of claim 1, wherein training a second preset mouth shape driving model according to the first rendering image and the first preset mouth shape parameter to obtain a second driving submodel comprises:

4. The method of claim 1, further comprising:

5. The method of claim 4, wherein the determining, by the first bite-type drive model, target training samples comprises:

inputting image frames and audio data of the video frames into the first model driving model aiming at each video frame in the preset video sample to obtain second model output parameters;

6. The method of claim 4, wherein the supervised training of a third preset mouth shape driving model according to the preset audio and the second preset mouth shape parameter corresponding to the preset audio comprises:

7. A method of driving a virtual object, the method comprising:

acquiring target audio to be identified;

inputting a preset virtual image and the target audio into a pre-trained first mouth shape driving model to obtain a target mouth shape parameter corresponding to the target audio, wherein the preset virtual image is an image of a virtual object to be driven;

wherein the first bite-type drive model is a first bite-type drive model trained by the method of any one of claims 1-3.

8. The method of claim 7, wherein the first model driver model comprises a first driver submodel and a second driver submodel, and the inputting the preset virtual image and the target audio into the pre-trained first model driver model to obtain the target mouth shape parameters corresponding to the target audio comprises:

9. The method of claim 7, further comprising:

inputting the target audio to be recognized into a second mouth shape driving model to obtain the target mouth shape parameters corresponding to the target audio, wherein the second mouth shape driving model is obtained by pre-training according to the method of any one of claims 4-6.

10. The method of claim 7, wherein said mouth-driving the virtual object according to the target mouth-shape parameters comprises:

mouth shape driving is carried out on the virtual object according to the second rendering image.

11. An apparatus for driving a model, the apparatus comprising:

the first determining module is used for determining a first model driving model according to the first driving sub model and the second driving sub model, the first model driving model is used for determining a target mouth shape parameter corresponding to a target audio according to a preset virtual image and the target audio to be recognized so as to drive a mouth shape of a virtual object according to the target mouth shape parameter, and the preset virtual image is an image of the virtual object to be driven.

12. An apparatus for driving a virtual object, the apparatus comprising:

the acquisition module is used for acquiring target audio to be identified;

wherein the first bite-type drive model is a first bite-type drive model trained by the training apparatus of claim 11.

13. A computer-readable medium, on which a computer program is stored, which program, when being executed by a processing means, is adapted to carry out the steps of the method of any one of claims 1-6 or 7-10.

14. An electronic device, comprising:

a storage device having a computer program stored thereon;

processing means for executing the computer program in the storage means to carry out the steps of the method according to any one of claims 1-6 or 7-10.