CN117788654A

CN117788654A - Three-dimensional face driving method based on voice, model training method and device

Info

Publication number: CN117788654A
Application number: CN202311766861.6A
Authority: CN
Inventors: 杨少雄; 徐颖; 崔宪坤
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2023-12-20
Filing date: 2023-12-20
Publication date: 2024-03-29

Abstract

The present disclosure provides a three-dimensional face driving method, a model training method and a device based on voice, which relate to the fields of computer vision, deep learning, enhanced display, virtual reality and the like in artificial intelligence technology, and can be applied to scenes such as meta universe, digital people, generation type artificial intelligence and the like. The method comprises the following steps: determining a to-be-processed driving sequence of the to-be-processed voice; performing style conversion on the driving sequence to be processed according to the style conversion model to obtain a target driving sequence; the target driving sequence is used for indicating the three-dimensional facial action of the second object when outputting the voice to be processed; the style conversion model is trained according to the first driving sequence and the second driving sequence; the first driving sequence is used for indicating a three-dimensional facial action of the first object when outputting target voice; the second driving sequence is used for indicating the three-dimensional facial action of the second object when outputting the target voice; and driving the three-dimensional face model corresponding to the second object to perform facial actions according to the target driving sequence.

Description

Three-dimensional face driving method based on voice, model training method and device

Technical Field

The disclosure relates to the technical field of artificial intelligence, in particular to the technical fields of computer vision, deep learning, enhanced display, virtual reality and the like, and can be applied to scenes such as metauniverse, digital people, generated artificial intelligence and the like; and more particularly to a three-dimensional face driving method based on voice, a model training method and a device.

Background

Currently, with the continued development of artificial intelligence technology, a large number of avatars are created for information transfer. In addition, in order to ensure that the user has a good visual experience, in the process of displaying the avatar, the facial motion of the avatar, particularly the lip motion, needs to be adaptively adjusted so as to meet the requirement of the voice to be played.

Disclosure of Invention

The present disclosure provides a three-dimensional face driving method, a model training method and a device based on voice, so that the face action change of a three-dimensional face model is more in line with the face action style of a speaking object when speaking.

According to a first aspect of the present disclosure, there is provided a voice-based three-dimensional face driving method, including:

determining a to-be-processed driving sequence of the to-be-processed voice; the to-be-processed driving sequence is used for indicating a three-dimensional facial action of a first object when outputting the to-be-processed voice;

Performing style conversion processing on the driving sequence to be processed according to a style conversion model to obtain a target driving sequence; the target driving sequence is used for indicating a three-dimensional facial action of a second object when outputting the voice to be processed; the style conversion model is obtained by training an initial model according to at least one group of first driving sequences and second driving sequences; the first driving sequence is used for indicating a three-dimensional facial action of the first object when outputting target voice; the second driving sequence is used for indicating a three-dimensional facial action of a second object when outputting the target voice; the style conversion model is used for outputting a driving sequence which accords with the face style of the second object when speaking;

and driving the three-dimensional face model corresponding to the second object to perform facial action according to the target driving sequence.

According to a second aspect of the present disclosure, there is provided a training method of a style conversion model, including:

acquiring at least one training set; wherein the training set comprises a first drive sequence and a second drive sequence; the first driving sequence is used for indicating a three-dimensional facial action of the first object when outputting target voice; the second driving sequence is used for indicating a three-dimensional facial action of a second object when outputting the target voice;

Training the initial model according to the at least one training set to obtain a style conversion model; the style conversion model is used for outputting a driving sequence conforming to a target style; the target style is a face style of the second subject when speaking.

According to a third aspect of the present disclosure, there is provided a three-dimensional face driving device based on voice, comprising:

a determining unit for determining a to-be-processed driving sequence of the to-be-processed voice; the to-be-processed driving sequence is used for indicating a three-dimensional facial action of a first object when outputting the to-be-processed voice;

the processing unit is used for carrying out style conversion processing on the driving sequence to be processed according to the style conversion model to obtain a target driving sequence; the target driving sequence is used for indicating a three-dimensional facial action of a second object when outputting the voice to be processed; the style conversion model is obtained by training an initial model according to at least one group of first driving sequences and second driving sequences; the first driving sequence is used for indicating a three-dimensional facial action of the first object when outputting target voice; the second driving sequence is used for indicating a three-dimensional facial action of a second object when outputting the target voice; the style conversion model is used for outputting a driving sequence which accords with the face style of the second object when speaking;

And the driving unit is used for driving the three-dimensional facial model corresponding to the second object to perform facial actions according to the target driving sequence.

According to a fourth aspect of the present disclosure, there is provided a training apparatus of a style conversion model, including:

an acquisition unit for acquiring at least one group of training sets; wherein the training set comprises a first drive sequence and a second drive sequence; the first driving sequence is used for indicating a three-dimensional facial action of the first object when outputting target voice; the second driving sequence is used for indicating a three-dimensional facial action of a second object when outputting the target voice;

the training unit is used for training the initial model according to the at least one group of training sets to obtain a style conversion model; the style conversion model is used for outputting a driving sequence conforming to a target style; the target style is a face style of the second subject when speaking.

According to a fifth aspect of the present disclosure, there is provided an electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first aspect or to enable the at least one processor to perform the method of the second aspect.

According to a sixth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of the first aspect or for causing the computer to perform the method of the second aspect.

According to a seventh aspect of the present disclosure, there is provided a computer program product comprising: a computer program stored in a readable storage medium, from which it can be read by at least one processor of an electronic device, the execution of which causes the electronic device to perform the method of the first aspect or the execution of which causes the electronic device to perform the method of the second aspect.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 is a schematic flow chart of a three-dimensional face driving method based on voice according to an embodiment of the disclosure;

fig. 2 is a flow chart of a second three-dimensional face driving method based on voice according to an embodiment of the disclosure;

fig. 3 is a flow chart of a training method of a style conversion model according to an embodiment of the disclosure;

FIG. 4 is a flowchart of a training method of a second style conversion model according to an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of a three-dimensional face driving device based on voice according to an embodiment of the disclosure;

fig. 6 is a schematic structural diagram of a second voice-based three-dimensional face driving device according to an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of a training device for a style conversion model according to an embodiment of the present disclosure;

fig. 8 is a schematic structural diagram of a training device for a style conversion model according to an embodiment of the present disclosure;

fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the disclosure;

fig. 10 is a block diagram of an electronic device used to implement a speech-based three-dimensional face driving method or model training method of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

At present, how to control the facial motion of the three-dimensional facial model corresponding to the avatar when driving the facial motion corresponding to the avatar according to the voice so as to ensure that the facial motion style of the avatar more accords with the facial motion style of the user when speaking is a problem to be solved.

In one possible implementation manner, a driving sequence corresponding to each phoneme corresponding to the speaking of the user may be manually constructed, where the driving sequence corresponding to the phoneme may be used to instruct the user to perform a facial action corresponding to the speaking based on the phonemes. Then, when the face action of the avatar is required to be driven according to the voice, phonemes contained in the voice can be determined first, and a driving sequence set corresponding to the current voice is constructed based on the correspondence between the phonemes and the driving sequences constructed in advance. Then, the face actions of the avatar are sequentially controlled according to the driving sequences included in the driving sequence set. In addition, in the implementation manner, the facial actions of the virtual images corresponding to different phones are easy to be disconnected, so that the user has poor appearance. It should be noted that the facial movements mentioned in the present disclosure include movements of various parts of the face in the avatar, such as different parts of lips, cheeks, forehead, etc.

In another possible implementation manner, the driving sequence corresponding to the face scanning data can be determined by acquiring the face scanning data when the person speaks and according to the face scanning data. And then, carrying out end-to-end model training on the voice sent by the user and the obtained driving sequence, so that the model can output the driving sequence which accords with the face style when the user speaks based on the input voice. However, the model training approach described above requires a large number of data sets, resulting in high data acquisition costs.

To avoid at least one of the above technical problems, the inventors of the present disclosure have creatively worked to obtain the inventive concept of the present disclosure: when the three-dimensional face model corresponding to the second object needs to be driven, a to-be-processed driving sequence which accords with the face style of the first object when the first object speaks can be generated, then style conversion processing is carried out on the to-be-processed driving sequence according to the style conversion model, so that a target driving sequence which accords with the face action style of the second object when the second object speaks is obtained, and then the three-dimensional face model is driven according to the target driving sequence, so that the three-dimensional face model can carry out face action more smoothly and is more similar to the face action style of the second object when the second object speaks.

The present disclosure provides a three-dimensional face driving method, a model training method and a device based on voice, which are applied to the technical fields of computer vision, deep learning, enhanced display, virtual reality and the like in artificial intelligence technology, and can be applied to scenes such as meta universe, digital people, generated artificial intelligence and the like, so that the motion of a three-dimensional face model is smoother, and the motion of the three-dimensional face model can be enabled to conform to a specific style.

In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the user accord with the regulations of related laws and regulations, and the public order colloquial is not violated.

Fig. 1 is a schematic flow chart of a three-dimensional face driving method based on voice according to an embodiment of the disclosure, as shown in fig. 1, the method includes:

s101, determining a to-be-processed driving sequence of a to-be-processed voice; the to-be-processed driving sequence is used for indicating the three-dimensional facial action of the first object when the first object outputs the to-be-processed voice.

For example, the execution subject of the present embodiment may be a three-dimensional face driving device based on voice, and the three-dimensional face driving device based on voice may be a server (such as a local server, or a cloud server), may also be a computer, may also be a processor, may also be a chip, or the like, and the present embodiment is not limited thereto.

The first object and the second object in the embodiment may be a real person or a cartoon character, which is not particularly limited in the disclosure.

When the driving control is required to be performed on the three-dimensional face model corresponding to the second object according to the voice to be processed, a driving sequence to be processed can be generated according to the voice to be processed.

It should be noted that, the to-be-processed driving sequence generated herein is a driving sequence conforming to a facial action style when the first object speaks, and the to-be-processed driving sequence is specifically configured to instruct the first object to output the facial action corresponding to the to-be-processed voice.

In one example, the manner of generating the to-be-processed driving sequence may be generated by a model for generating a driving sequence corresponding to the first generating object, and the to-be-processed voice is input into the model to obtain the to-be-driven sequence. In particular, the model in this example may generate a model for an end-to-end sequence of the above-described speech-to-drive sequence.

S102, performing style conversion treatment on the driving sequence to be treated according to a style conversion model to obtain a target driving sequence; the target driving sequence is used for indicating a three-dimensional facial action when the second object outputs the voice to be processed; the style conversion model is obtained by training the initial model according to at least one group of first driving sequences and second driving sequences; the first driving sequence is used for indicating a three-dimensional facial action of the first object when outputting target voice; the second driving sequence is used for indicating the three-dimensional facial action of the second object when outputting the target voice; the style conversion model is used to output a driving sequence conforming to the facial style of the second subject when speaking.

In this embodiment, after obtaining the to-be-processed driving sequence according to the style of the first object, the style conversion model trained in advance may perform style conversion processing on the to-be-processed driving sequence so as to obtain the target driving sequence according to the facial action style of the second object when speaking, where the target driving sequence may be specifically used to indicate the facial action corresponding to the second object when speaking the to-be-processed voice.

It should be noted that, the style conversion model in this embodiment is used to convert the style of the driving sequence. Where a driving sequence may be understood as a series of parameters for indicating the action of the model face. Style conversion may be understood as a conversion from a facial action style when one subject speaks to a facial action style when another subject speaks.

In addition, at least one training set may be acquired in advance when training the style conversion model. The training set includes a first driving sequence corresponding to the first object and a second driving sequence corresponding to the second object for the same voice (i.e., the target voice). Specifically, the first driving sequence is a driving sequence conforming to a facial style when the first subject speaks, and the first driving sequence is specifically used to instruct the first subject to output a facial action when the target voice. Likewise, the second driving sequence is a driving sequence conforming to a facial style when the first subject speaks, and the second driving sequence is specifically used to instruct the second subject to output a facial action when the target voice. Wherein different training sets may correspond to different target voices. And then, combining the training set, and training the initial model to obtain the style conversion model which can convert the driving sequence conforming to the speaking face style of the first object into the driving sequence conforming to the speaking face style of the second object.

It should be noted that, in this embodiment, the model architecture of the style conversion model is not particularly limited.

And S103, driving the three-dimensional face model corresponding to the second object to perform face action according to the target driving sequence.

In this embodiment, after obtaining the target driving sequence that conforms to the face style of the second object when speaking, the face motion control may be performed on the three-dimensional face model corresponding to the second object according to the target driving sequence, so that the face motion of the three-dimensional face model may conform to the face motion style of the second object when speaking the voice to be processed.

It can be appreciated that in this embodiment, when the three-dimensional face model of the second object is driven according to the voice, the to-be-processed driving sequence conforming to the face style of the first object may be generated first, and then style conversion processing is performed on the to-be-processed driving sequence in combination with the style conversion model, so as to obtain the target driving sequence conforming to the face action style of the second object. Compared with the mode of generating the target driving sequence based on the end-to-end model of the driving sequence from the voice to the face style conforming to the second object, the style conversion model for performing the driving sequence style conversion provided in the embodiment has lower training difficulty, requires fewer training sets, and can improve the efficiency of three-dimensional face driving.

Fig. 2 is a flow chart of a second voice-based three-dimensional face driving method according to an embodiment of the disclosure, as shown in fig. 2, the method includes:

s201, determining phoneme information corresponding to the voice to be processed; wherein the phoneme information is a set of phonemes constituting a speech to be processed.

In this embodiment, when the three-dimensional face model corresponding to the second object needs to be controlled, the face action of the second object when outputting the voice to be processed can be accurately simulated. The phoneme information may be understood as a set of phonemes arranged in sequence, which are obtained by splitting phonemes of a speech to be processed.

S202, determining a to-be-processed driving sequence according to the mapping relation and the phoneme information; the mapping relation is a corresponding relation between the phonemes and a preset parameter set; the method comprises the steps that a preset parameter set is used for indicating three-dimensional facial actions when a first object sends out phonemes; the drive sequence to be processed comprises at least one set of preset parameters. The pending driver sequence is for indicating a three-dimensional facial motion of the first object when outputting the pending speech.

In this embodiment, after obtaining the phoneme information, the preset parameter set corresponding to the phoneme contained in the phoneme information may be determined according to the correspondence between the preset phoneme and the preset parameter set, and a sequence formed by the preset parameter sets corresponding to the phonemes may be used as the sequence to be processed in this embodiment.

It should be noted that, the above preset parameter set may be understood as a face parameter corresponding to a face action of the first object when the first object emits a phoneme corresponding to the set.

It can be appreciated that in this embodiment, the sequence to be driven may be generated by splicing the preset parameter sets corresponding to phonemes in the voice to be processed, so that the problem that the training time is long due to the need of training the end-to-end model of the voice to the drive sequence conforming to the speaking face style of the first object can be avoided.

S203, performing style conversion processing on the driving sequence to be processed according to the style conversion model to obtain a target driving sequence; the target driving sequence is used for indicating a three-dimensional facial action when the second object outputs the voice to be processed; the style conversion model is obtained by training the initial model according to at least one group of first driving sequences and second driving sequences; the first driving sequence is used for indicating a three-dimensional facial action of the first object when outputting target voice; the second driving sequence is used for indicating the three-dimensional facial action of the second object when outputting the target voice; the style conversion model is used to output a driving sequence conforming to the facial style of the second subject when speaking.

For example, in the present embodiment, the specific principle of step S203 may refer to step S102, which is not described herein. In addition, when the sequence to be processed is acquired in the manner of S201 to S202, the first driving sequence used in the training process of the further style conversion model may also be acquired in the manner of S201 to S202.

S204, driving the three-dimensional face model corresponding to the second object to perform face action according to the target driving sequence.

For example, the specific principle of step S204 may be referred to step S103, which is not described herein.

It can be understood that in this embodiment, through the style conversion model, style conversion processing is performed on the to-be-processed driving sequence obtained by splicing corresponding to the phonemes, and face action control is performed on the three-dimensional face model corresponding to the second object based on the converted target driving sequence, so that smoothness of face action of the three-dimensional face model can be improved, and the face action of the three-dimensional face model is enabled to conform to the face style of the second object when speaking, so as to improve user watching experience.

In one example, the first drive sequence includes N first parameter sets; the first parameter set is used for indicating the three-dimensional facial action of the first object under the time frame corresponding to the first parameter set; n is a positive integer greater than 1; the style conversion model is obtained by carrying out parameter adjustment on the initial model according to the third driving sequence and the second driving sequence; the third driving sequence is output by the initial model according to the N first parameter sets; the third drive sequence includes N third parameter sets. It should be noted that, the specific principles herein may be referred to the description in the embodiment of fig. 4, and will not be repeated herein.

In one example, the style conversion model is obtained by performing parameter adjustment on an initial model according to a first face model and a second face model; the first facial model is obtained by carrying out parameter adjustment on a preset facial model according to a third driving sequence; the second facial model is obtained by carrying out parameter adjustment on the preset facial model according to a second driving sequence. It should be noted that, the specific principles herein may be referred to the description in the embodiment of fig. 4, and will not be repeated herein.

In one example, the style conversion model is obtained by performing parameter adjustment on an initial model according to a first loss function and a second loss function; the first loss function is obtained according to the third driving sequence and the second driving sequence; wherein the second drive sequence comprises N second parameter sets; the second parameter set is used for indicating the three-dimensional facial action of the second object under the time frame corresponding to the second parameter set; the second loss function is obtained according to the first face model and the second face model; the first facial model is obtained by carrying out parameter adjustment on a preset facial model according to a third driving sequence; the second facial model is obtained by carrying out parameter adjustment on the preset facial model according to a second driving sequence. It should be noted that, the specific principles herein may be referred to the description in the embodiment of fig. 4, and will not be repeated herein.

Fig. 3 is a flow chart of a training method of a style conversion model according to an embodiment of the present disclosure, as shown in fig. 3, where the method includes:

s301, acquiring at least one group of training sets; wherein the training set comprises a first drive sequence and a second drive sequence; the first driving sequence is used for indicating a three-dimensional facial action of the first object when outputting target voice; the second driving sequence is used for indicating the three-dimensional facial action of the second object when outputting the target voice.

The execution body of the embodiment may be a training device of a style conversion model, and the training device of the style conversion model may be a server (such as a local server or a cloud server), may also be a computer, may also be a processor, may also be a chip, or the like, which is not limited in this embodiment. The training device of the style conversion model may be the same device as the three-dimensional face driving device based on voice, or may be a different device.

The technical principle of step S301 may be referred to step S102, and will not be described herein.

S302, training an initial model according to at least one group of training sets to obtain a style conversion model; the style conversion model is used for outputting a driving sequence conforming to the target style; the target style is a face style of the second subject when speaking.

In one example, a plurality of sets of driving parameters may be included in the first driving sequence, and the sets of driving parameters have time frames corresponding thereto. The driving parameter set in the first driving sequence may be understood as a parameter set for describing a facial motion of the first object under a time frame corresponding to the set in a process of outputting the target voice by the first object. Also, a plurality of sets of driving parameters may be included in the second driving sequence. Specifically, the driving parameter set in the second driving sequence may be understood as a parameter set for describing a facial action of the second object under a time frame corresponding to the set in the process of outputting the target voice by the first object.

When training the initial model according to the training set, a driving parameter set corresponding to a target time frame can be selected according to a first driving sequence, and a driving parameter set corresponding to the target time frame can be selected in a second driving sequence to perform model training. That is, the input of the initial model is a set of parameters in the corresponding first driving sequence at one time frame.

It can be appreciated that, in this embodiment, the style conversion model is obtained by training with the training set, so that the driving sequence can be subsequently style-converted based on the style conversion model, so as to quickly obtain the driving sequence conforming to the style of the speaking object, and control the three-dimensional facial model corresponding to the speaking object.

Fig. 4 is a flow chart of a training method of a second style conversion model according to an embodiment of the present disclosure, as shown in fig. 4, the method includes:

s401, acquiring at least one group of training sets; wherein the training set comprises a first drive sequence and a second drive sequence; the first driving sequence is used for indicating a three-dimensional facial action of the first object when outputting target voice; the second driving sequence is used for indicating the three-dimensional facial action of the second object when outputting the target voice.

Wherein the first drive sequence comprises N first parameter sets; the first parameter set is used for indicating the three-dimensional facial action of the first object under the time frame corresponding to the first parameter set; n is a positive integer greater than 1.

That is, in this embodiment, the first driving sequence is a sequence composed of N first parameter sets, and the first parameter sets correspond to time frames in the target voice. The first parameter set may be specifically understood as a three-dimensional facial action corresponding to the first object when outputting the voice under the corresponding time frame in the target voice.

In one example, step S401 includes the steps of: determining phoneme information corresponding to the target voice; wherein the phoneme information is a set of phonemes constituting the target speech; determining a first driving sequence according to the mapping relation and the phoneme information; the mapping relation is a corresponding relation between the phonemes and a preset parameter set; the preset parameter set is used for representing three-dimensional facial actions when phonemes are sent out; the first driving sequence comprises at least one preset parameter set; the second drive sequence is acquired, and the second drive sequence and the first drive sequence are used as a group of training sets.

The specific principle of the acquisition manner of the first driving sequence in this embodiment is similar to that of steps S201 to S202, and will not be described here again.

In addition, the second driving sequence may be obtained by using an end-to-end model of the voice-to-driving sequence mentioned in the related art, which is not described herein; alternatively, it may be generated by acquiring a face image when the second object occurs.

It can be understood that, in this embodiment, the style conversion model is trained by using the first driving sequence spliced by the preset parameter set corresponding to the phoneme as the sequence conforming to the style of the face of the first object, so that the smoothness of the action of the final three-dimensional face model can be improved, the obtaining mode of the training data is simpler, and the model training complexity of generating the corresponding driving sequence by combining the models is avoided.

S402, inputting N first parameter sets into an initial model to obtain a third driving sequence; wherein the third drive sequence comprises N third parameter sets.

In this embodiment, when the initial model is trained, N first parameter sets corresponding to N time frames are input into the initial model at the same time, and style conversion processing is performed on the N first parameter sets by the initial model, so that the initial model can combine correlations between facial actions indicated by the first parameter sets of the time frames to obtain a driving sequence composed of N third parameter sets.

S403, according to the third driving sequence and the second driving sequence, carrying out parameter adjustment on the initial model to obtain a style conversion model; the style conversion model is used for outputting a driving sequence conforming to the target style; the target style is a face style of the second subject when speaking.

In this embodiment, after the third driving sequence output by the initial model is obtained, the second driving sequence and the third driving sequence in the training set may be combined to perform parameter adjustment on the initial model, so that the style conversion model obtained after training meets the preset training stop condition. It should be noted that, the stopping conditions of the model training in this embodiment are similar to the setting manner in the related art, and will not be described here again.

In one example, when the initial model is parameter-adjusted based on the third drive sequence and the second drive sequence, the initial model may be parameter-adjusted in combination with a loss function obtained by the third drive sequence and the second drive sequence. The process of generating the loss function according to the third driving sequence and the second driving sequence may be obtained by combining various loss functions such as the L1 norm loss function type, the mean square error loss function, the cross entropy loss function, and the like provided in the related art, which are not particularly limited in this embodiment.

It can be understood that, when the user speaks, the change of the facial motion of the user also has time continuity, that is, the facial motion corresponding to the next moment is generally affected by the facial motion of the previous moment, so when the initial model is trained, each first parameter set under a plurality of continuous time frames is simultaneously input into the initial model, so that when the initial model performs style conversion on the input data, the style conversion on the first parameter set can be fully combined with the correlation between the facial motions under adjacent time frames, so as to improve the accuracy of the style conversion result. Moreover, the problem that when two different objects output the same section of voice, large noise is generated when training is performed frame by frame due to the fact that the mouth is opened and closed in a time difference mode, and the final facial action is unsmooth can be avoided.

In one example, step S403 includes the steps of:

according to the third driving sequence, carrying out parameter adjustment on the preset face model to obtain a first face model; according to the second driving sequence, carrying out parameter adjustment on the preset face model to obtain a second face model; and carrying out parameter adjustment on the initial model according to the first face model and the second face model to obtain a style conversion model.

In this embodiment, when the initial model parameters are adjusted according to the third driving sequence and the second driving sequence, the same preset face model may be respectively adjusted according to the third driving sequence and the second driving sequence, that is, facial actions in the preset face model may be adjusted according to the third driving sequence, so as to obtain the first face model. And adjusting the facial actions in the preset facial model according to the fourth driving sequence to obtain a second facial model.

The preset face model may be understood as a three-dimensional face model constructed according to a face corresponding to a speaking object, specifically, the preset face model herein may select a three-dimensional face model corresponding to a second object, or may select three-dimensional face models corresponding to other speaking objects, which is not limited in this embodiment.

After the first and second facial models are obtained, parameters of the initial model may be adjusted based on differences between the first and second facial models. For example, the loss function may be constructed by extracting position information corresponding to at least one key point in the first face model and position information corresponding to at least one key point in the second face model, and then performing parameter adjustment on the initial model based on the obtained loss function.

It can be understood that in this embodiment, the preset face model is driven by the third driving sequence obtained by the model, and further facial motion verification is performed on the preset face model, so as to further ensure accuracy and rationality of the third driving sequence output by the model, avoid the phenomenon of over-fitting of the model, and improve the training efficiency of the model.

In one example, step S403 includes the steps of:

determining a first loss function according to the third driving sequence and the second driving sequence; wherein the second drive sequence comprises N second parameter sets; the second parameter set is used for indicating the three-dimensional facial action of the second object under the time frame corresponding to the second parameter set; according to the third driving sequence, carrying out parameter adjustment on the preset face model to obtain a first face model; according to the second driving sequence, carrying out parameter adjustment on the preset face model to obtain a second face model; determining a second loss function according to the first face model and the second face model; and carrying out parameter adjustment on the initial model according to the first loss function and the second loss function to obtain a style conversion model.

In this embodiment, on the basis of the above example, the second driving sequence also includes N second parameter sets corresponding to time frames one by one, where the second parameter sets may be understood as facial parameters of the second object under one time frame in the process of sending out the target voice.

When the third drive sequence and the second drive sequence are acquired, a loss function may be constructed not only based on the third drive sequence and the second drive sequence, to obtain the first loss function. In addition, a second loss function can be constructed according to a first face model and a second face model which are obtained by driving the preset face model through the third driving sequence and the second driving sequence respectively. And then, combining the first loss function and the second loss function, and carrying out parameter adjustment on the initial model so as to train and obtain a style conversion model.

It can be appreciated that in this embodiment, the initial model may be trained by combining the first loss function constructed by the differences between the driving sequences and the second loss function constructed by the differences between the face models, so that the model training efficiency may be improved, and the model over-fitting phenomenon and the phenomenon of unreasonable facial actions during face driving are avoided, so as to improve the accuracy of style conversion performed by the style conversion model.

In one example, the style conversion model includes M one-dimensional convolution layers, where the one-dimensional convolution layers are used to perform convolution processing on input data to obtain a processing result; and the size of the processing result is the same as the size of the input data; m is a positive integer greater than 1.

Illustratively, the style conversion model in the present embodiment specifically includes M one-dimensional convolution layers. And when each one-dimensional convolution layer carries out convolution operation on the input data input into the convolution layer, the obtained convolution result is the same as the size of the input data corresponding to the one-dimensional convolution layer. That is, the convolution operation of the shift convolution layer in this embodiment does not change the size of the input data, and further, by the above-described model construction manner, the size consistency of the input and output results of the style conversion model is ensured.

For example, a full convolution network may be used as the model architecture of the style conversion model in practical applications, and further the step size of each convolution layer in the full convolution network may be set to 1 to ensure consistency of the input and output sizes of each convolution layer.

It can be understood that in this embodiment, the style conversion processing is performed on the input driving sequence by setting a plurality of one-dimensional convolution layers, so that the model structure is simple, and consistency of the input and output data sizes can be ensured, so as to ensure fluency of the facial motion during final facial driving.

In one example, the style conversion model may also obtain a model output result consistent with the size of the input driving sequence after a plurality of downsampling processes and upsampling processes are sequentially performed on the input driving sequence by a plurality of sequentially connected convolution layers.

Fig. 5 is a schematic structural diagram of a three-dimensional voice-based face driving device according to an embodiment of the present disclosure, and as shown in fig. 5, a three-dimensional voice-based face driving device 500 includes:

a determining unit 501, configured to determine a to-be-processed driving sequence of a to-be-processed voice; the to-be-processed driving sequence is used for indicating the three-dimensional facial action of the first object when the first object outputs the to-be-processed voice.

The processing unit 502 is configured to perform style conversion processing on a driving sequence to be processed according to a style conversion model to obtain a target driving sequence; the target driving sequence is used for indicating a three-dimensional facial action when the second object outputs the voice to be processed; the style conversion model is obtained by training the initial model according to at least one group of first driving sequences and second driving sequences; the first driving sequence is used for indicating a three-dimensional facial action of the first object when outputting target voice; the second driving sequence is used for indicating the three-dimensional facial action of the second object when outputting the target voice; the style conversion model is used to output a driving sequence conforming to the facial style of the second subject when speaking.

And a driving unit 503, configured to drive the three-dimensional facial model corresponding to the second object to perform facial motion according to the target driving sequence.

The device provided in this embodiment is configured to implement the technical scheme provided by the method, and the implementation principle and the technical effect are similar and are not repeated.

Fig. 6 is a schematic structural diagram of a second three-dimensional voice-based face driving device according to an embodiment of the present disclosure, as shown in fig. 6, a three-dimensional voice-based face driving device 600, including:

a determining unit 601, configured to determine a to-be-processed driving sequence of a to-be-processed voice; the to-be-processed driving sequence is used for indicating the three-dimensional facial action of the first object when the first object outputs the to-be-processed voice.

The processing unit 602 is configured to perform style conversion processing on the driving sequence to be processed according to the style conversion model to obtain a target driving sequence; the target driving sequence is used for indicating a three-dimensional facial action when the second object outputs the voice to be processed; the style conversion model is obtained by training the initial model according to at least one group of first driving sequences and second driving sequences; the first driving sequence is used for indicating a three-dimensional facial action of the first object when outputting target voice; the second driving sequence is used for indicating the three-dimensional facial action of the second object when outputting the target voice; the style conversion model is used to output a driving sequence conforming to the facial style of the second subject when speaking.

The driving unit 603 is configured to drive the three-dimensional face model corresponding to the second object to perform a facial action according to the target driving sequence.

In one example, the determining unit 601 includes:

a first determining module 6011, configured to determine phoneme information corresponding to a voice to be processed; wherein the phoneme information is a set of phonemes constituting a speech to be processed;

a second determining module 6012, configured to determine a to-be-processed driving sequence according to the mapping relationship and the phoneme information; the mapping relation is a corresponding relation between the phonemes and a preset parameter set; the method comprises the steps that a preset parameter set is used for indicating three-dimensional facial actions when a first object sends out phonemes; the drive sequence to be processed comprises at least one set of preset parameters.

In one example, the first drive sequence includes N first parameter sets; the first parameter set is used for indicating the three-dimensional facial action of the first object under the time frame corresponding to the first parameter set; n is a positive integer greater than 1; the style conversion model is obtained by carrying out parameter adjustment on the initial model according to the third driving sequence and the second driving sequence; the third driving sequence is output by the initial model according to the N first parameter sets; the third drive sequence includes N third parameter sets.

In one example, the style conversion model is obtained by performing parameter adjustment on an initial model according to a first face model and a second face model; the first facial model is obtained by carrying out parameter adjustment on a preset facial model according to a third driving sequence; the second facial model is obtained by carrying out parameter adjustment on the preset facial model according to a second driving sequence.

In one example, the style conversion model is obtained by performing parameter adjustment on an initial model according to a first loss function and a second loss function; the first loss function is obtained according to the third driving sequence and the second driving sequence; wherein the second drive sequence comprises N second parameter sets; the second parameter set is used for indicating the three-dimensional facial action of the second object under the time frame corresponding to the second parameter set; the second loss function is obtained according to the first face model and the second face model; the first facial model is obtained by carrying out parameter adjustment on a preset facial model according to a third driving sequence; the second facial model is obtained by carrying out parameter adjustment on the preset facial model according to a second driving sequence.

Fig. 7 is a schematic structural diagram of a training device for a style conversion model according to an embodiment of the present disclosure, and as shown in fig. 7, a training device 700 for a style conversion model includes:

an acquisition unit 701, configured to acquire at least one set of training sets; wherein the training set comprises a first drive sequence and a second drive sequence; the first driving sequence is used for indicating a three-dimensional facial action of the first object when outputting target voice; the second driving sequence is used for indicating the three-dimensional facial action of the second object when outputting the target voice.

Training unit 702, configured to train the initial model according to at least one training set to obtain a style conversion model; the style conversion model is used for outputting a driving sequence conforming to the target style; the target style is a face style of the second subject when speaking.

Fig. 8 is a schematic structural diagram of a training device for a style conversion model according to an embodiment of the present disclosure, and as shown in fig. 8, a training device 800 for a style conversion model includes:

An acquisition unit 801 for acquiring at least one set of training sets; wherein the training set comprises a first drive sequence and a second drive sequence; the first driving sequence is used for indicating a three-dimensional facial action of the first object when outputting target voice; the second driving sequence is used for indicating the three-dimensional facial action of the second object when outputting the target voice.

Training unit 802, configured to train the initial model according to at least one training set to obtain a style conversion model; the style conversion model is used for outputting a driving sequence conforming to the target style; the target style is a face style of the second subject when speaking.

In one example, the first drive sequence includes N first parameter sets; the first parameter set is used for indicating the three-dimensional facial action of the first object under the time frame corresponding to the first parameter set; n is a positive integer greater than 1;

training unit 802, comprising:

the first obtaining module 8021 is configured to input N first parameter sets to the initial model, so as to obtain a third driving sequence; wherein the third driving sequence comprises N third parameter sets;

the adjusting module 8022 is configured to perform parameter adjustment on the initial model according to the third driving sequence and the second driving sequence, so as to obtain a style conversion model.

In one example, the adjustment module 8022 includes:

a first adjusting submodule 80221, configured to perform parameter adjustment on a preset face model according to a third driving sequence, so as to obtain a first face model;

the second adjusting submodule 80222 is configured to perform parameter adjustment on the preset face model according to a second driving sequence to obtain a second face model;

and the third adjusting submodule 80223 is used for carrying out parameter adjustment on the initial model according to the first face model and the second face model to obtain a style conversion model.

In one example, the adjustment module 8022 includes:

a first determining submodule for determining a first loss function according to the third drive sequence and the second drive sequence; wherein the second drive sequence comprises N second parameter sets; the second parameter set is used for indicating the three-dimensional facial action of the second object under the time frame corresponding to the second parameter set;

the fourth adjustment sub-module is used for carrying out parameter adjustment on the preset face model according to the third driving sequence to obtain a first face model;

a fifth adjustment sub-module, configured to perform parameter adjustment on the preset face model according to the second driving sequence, to obtain a second face model;

A second determination submodule for determining a second loss function from the first face model and the second face model;

and the sixth adjustment sub-module is used for carrying out parameter adjustment on the initial model according to the first loss function and the second loss function to obtain a style conversion model.

In one example, the obtaining unit 801 includes:

a third determining module 8011, configured to determine phoneme information corresponding to the target speech; wherein the phoneme information is a set of phonemes constituting the target speech;

a fourth determining module 8012, configured to determine a first driving sequence according to the mapping relationship and the phoneme information; the mapping relation is a corresponding relation between the phonemes and a preset parameter set; the preset parameter set is used for representing three-dimensional facial actions when phonemes are sent out; the first driving sequence comprises at least one preset parameter set;

a second obtaining module 8013, configured to obtain a second driving sequence;

fifth determining module 8014 is configured to take the second driving sequence and the first driving sequence as a set of training sets.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

The present disclosure provides an electronic device, comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method provided in any one of the embodiments described above.

Fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure, as shown in fig. 9, an electronic device 900 in the present disclosure may include: a processor 901 and a memory 902.

A memory 902 for storing a program; the memory 902 may include a volatile memory (english: volatile memory), such as a random-access memory (RAM), such as a static random-access memory (SRAM), a double data rate synchronous dynamic random-access memory (DDR SDRAM), etc.; the memory may also include a non-volatile memory (English) such as a flash memory (English). The memory 902 is used to store computer programs (e.g., application programs, functional modules, etc. that implement the methods described above), computer instructions, etc., which may be stored in one or more of the memories 902 in a partitioned manner. And the above-described computer programs, computer instructions, data, etc. may be called by the processor 901.

The computer programs, computer instructions, etc., described above may be stored in one or more of the memories 902 in a partitioned manner. And the above-described computer programs, computer instructions, etc. may be invoked by the processor 901.

A processor 901 for executing a computer program stored in the memory 902 to implement the steps in the method according to the above embodiment.

Reference may be made in particular to the description of the embodiments of the method described above.

The processor 901 and the memory 902 may be separate structures or may be integrated structures. When the processor 901 and the memory 902 are separate structures, the memory 902 and the processor 901 may be coupled by a bus 903.

The electronic device in this embodiment may execute the technical scheme in the above method, and the specific implementation process and the technical principle are the same, which are not described herein again.

The present disclosure provides a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method provided by any one of the embodiments described above.

According to an embodiment of the present disclosure, the present disclosure also provides a computer program product comprising: a computer program stored in a readable storage medium, from which at least one processor of an electronic device can read, the at least one processor executing the computer program causing the electronic device to perform the solution provided by any one of the embodiments described above.

Fig. 10 shows a schematic block diagram of an example electronic device 1000 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 10, the apparatus 1000 includes a computing unit 1001 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 1002 or a computer program loaded from a storage unit 1008 into a Random Access Memory (RAM) 1003. In the RAM 1003, various programs and data required for the operation of the device 1000 can also be stored. The computing unit 1001, the ROM 1002, and the RAM 1003 are connected to each other by a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.

Various components in device 1000 are connected to I/O interface 1005, including: an input unit 1006 such as a keyboard, a mouse, and the like; an output unit 1007 such as various types of displays, speakers, and the like; a storage unit 1008 such as a magnetic disk, an optical disk, or the like; and communication unit 1009 such as a network card, modem, wireless communication transceiver, etc. Communication unit 1009 allows device 1000 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks.

The computing unit 1001 may be among various general and/or special purpose processing groups having processing and computing capabilities. Some examples of computing unit 1001 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 1001 performs the respective methods and processes described above, such as a three-dimensional face driving method based on speech, a model training method. For example, in some embodiments, the speech-based three-dimensional face driving method, model training method, may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 1008. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 1000 via ROM 1002 and/or communication unit 1009. When the computer program is loaded into the RAM 1003 and executed by the computing unit 1001, one or more steps of the above-described voice-based three-dimensional face driving method, model training method may be performed. Alternatively, in other embodiments, the computing unit 1001 may be configured to perform the speech-based three-dimensional face driving method, the model training method, by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual Private Server" or simply "VPS") are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A three-dimensional face driving method based on voice, comprising:

2. The method of claim 1, wherein determining a to-be-processed drive sequence for a to-be-processed voice comprises:

determining phoneme information corresponding to the voice to be processed; wherein the phoneme information is a set of phonemes constituting a speech to be processed;

determining a driving sequence to be processed according to the mapping relation and the phoneme information; the mapping relation is a corresponding relation between the phonemes and a preset parameter set; the preset parameter set is used for indicating a three-dimensional face action when the first object sends out a phoneme; the drive sequence to be processed comprises at least one preset parameter set.

3. The method of claim 1 or 2, wherein the first drive sequence comprises N first parameter sets; the first parameter set is used for indicating the three-dimensional facial action of the first object under the time frame corresponding to the first parameter set; n is a positive integer greater than 1; the style conversion model is obtained by carrying out parameter adjustment on the initial model according to a third driving sequence and a second driving sequence;

the third driving sequence is output by the initial model according to the N first parameter sets; the third drive sequence includes N third parameter sets.

4. The method of claim 3, wherein the style conversion model is obtained by performing parameter adjustment on the initial model according to a first face model and a second face model;

the first facial model is obtained by carrying out parameter adjustment on a preset facial model according to the third driving sequence; the second facial model is obtained by carrying out parameter adjustment on a preset facial model according to the second driving sequence.

5. A method according to claim 3, wherein the style conversion model is obtained by parameter adjustment of the initial model according to a first loss function and a second loss function;

the first loss function is derived from the third drive sequence and the second drive sequence; wherein the second drive sequence comprises N second parameter sets; the second parameter set is used for indicating the three-dimensional facial action of the second object under the time frame corresponding to the second parameter set;

the second loss function is obtained according to a first face model and a second face model; the first facial model is obtained by carrying out parameter adjustment on a preset facial model according to the third driving sequence; the second facial model is obtained by carrying out parameter adjustment on a preset facial model according to the second driving sequence.

6. The method according to any one of claims 1-5, wherein the style conversion model comprises M one-dimensional convolution layers, wherein the one-dimensional convolution layers are used for carrying out convolution processing on input data to obtain a processing result; and the size of the processing result is the same as the size of the input data; m is a positive integer greater than 1.

7. A method of training a style conversion model, comprising:

8. The method of claim 7, wherein the first drive sequence comprises N first parameter sets; the first parameter set is used for indicating the three-dimensional facial action of the first object under the time frame corresponding to the first parameter set; n is a positive integer greater than 1;

Training the initial model according to the at least one training set to obtain a style conversion model, wherein the training set comprises the following steps:

inputting the N first parameter sets into the initial model to obtain a third driving sequence; wherein the third drive sequence comprises N third parameter sets;

and according to the third driving sequence and the second driving sequence, carrying out parameter adjustment on the initial model to obtain a style conversion model.

9. The method of claim 8, wherein performing parameter adjustment on the initial model according to the third driving sequence and the second driving sequence to obtain a style conversion model comprises:

according to the third driving sequence, carrying out parameter adjustment on a preset face model to obtain a first face model;

according to the second driving sequence, carrying out parameter adjustment on the preset face model to obtain a second face model;

and according to the first facial model and the second facial model, carrying out parameter adjustment on the initial model to obtain a style conversion model.

10. The method of claim 8, wherein performing parameter adjustment on the initial model according to the third driving sequence and the second driving sequence to obtain a style conversion model comprises:

Determining a first loss function from the third drive sequence and the second drive sequence; wherein the second drive sequence comprises N second parameter sets; the second parameter set is used for indicating the three-dimensional facial action of the second object under the time frame corresponding to the second parameter set;

according to the third driving sequence, carrying out parameter adjustment on a preset face model to obtain a first face model; according to the second driving sequence, carrying out parameter adjustment on the preset face model to obtain a second face model; determining a second loss function according to the first face model and the second face model;

and according to the first loss function and the second loss function, carrying out parameter adjustment on the initial model to obtain a style conversion model.

11. The method of any of claims 7-10, wherein obtaining at least one set of training sets comprises:

determining phoneme information corresponding to the target voice; wherein the phoneme information is a set of phonemes constituting a target speech;

determining a first driving sequence according to the mapping relation and the phoneme information; the mapping relation is a corresponding relation between the phonemes and a preset parameter set; the preset parameter set is used for representing three-dimensional facial actions when phonemes are sent out; the first driving sequence comprises at least one preset parameter set;

A second drive sequence is acquired and the second drive sequence and the first drive sequence are taken as a set of training sets.

12. The method according to any one of claims 7-11, wherein the style conversion model comprises M one-dimensional convolution layers, wherein the one-dimensional convolution layers are used for carrying out convolution processing on input data to obtain a processing result; and the size of the processing result is the same as the size of the input data; m is a positive integer greater than 1.

13. A three-dimensional speech-based facial driving apparatus comprising:

14. The apparatus of claim 13, wherein the determining unit comprises:

the first determining module is used for determining phoneme information corresponding to the voice to be processed; wherein the phoneme information is a set of phonemes constituting a speech to be processed;

the second determining module is used for determining a driving sequence to be processed according to the mapping relation and the phoneme information; the mapping relation is a corresponding relation between the phonemes and a preset parameter set; the preset parameter set is used for indicating a three-dimensional face action when the first object sends out a phoneme; the drive sequence to be processed comprises at least one preset parameter set.

15. The apparatus of claim 13 or 14, wherein the first drive sequence comprises N first sets of parameters; the first parameter set is used for indicating the three-dimensional facial action of the first object under the time frame corresponding to the first parameter set; n is a positive integer greater than 1; the style conversion model is obtained by carrying out parameter adjustment on the initial model according to a third driving sequence and a second driving sequence;

16. The apparatus of claim 15, wherein the style conversion model is a result of parameter adjustment of the initial model based on a first face model and a second face model;

17. The apparatus of claim 15, wherein the style conversion model is a result of parameter adjustment of the initial model according to a first loss function and a second loss function;

18. The apparatus of any one of claims 13-17, wherein the style conversion model comprises M one-dimensional convolution layers, wherein the one-dimensional convolution layers are configured to convolve input data to obtain a processed result; and the size of the processing result is the same as the size of the input data; m is a positive integer greater than 1.

19. A training apparatus for a style conversion model, comprising:

20. The apparatus of claim 19, wherein the first drive sequence comprises N first sets of parameters; the first parameter set is used for indicating the three-dimensional facial action of the first object under the time frame corresponding to the first parameter set; n is a positive integer greater than 1;

Training unit, comprising:

the first acquisition module is used for inputting the N first parameter sets into the initial model to obtain a third driving sequence; wherein the third drive sequence comprises N third parameter sets;

and the adjusting module is used for carrying out parameter adjustment on the initial model according to the third driving sequence and the second driving sequence to obtain a style conversion model.

21. The apparatus of claim 20, wherein the adjustment module comprises:

the first adjusting sub-module is used for carrying out parameter adjustment on a preset face model according to the third driving sequence to obtain a first face model;

the second adjusting sub-module is used for carrying out parameter adjustment on the preset face model according to the second driving sequence to obtain a second face model;

and the third adjustment sub-module is used for carrying out parameter adjustment on the initial model according to the first face model and the second face model to obtain a style conversion model.

22. The apparatus of claim 20, wherein the adjustment module comprises:

A fourth adjustment sub-module, configured to perform parameter adjustment on a preset face model according to the third driving sequence, so as to obtain a first face model;

23. The apparatus according to any one of claims 19-22, wherein the acquisition unit comprises:

a third determining module, configured to determine phoneme information corresponding to the target speech; wherein the phoneme information is a set of phonemes constituting a target speech;

a fourth determining module, configured to determine a first driving sequence according to the mapping relationship and the phoneme information; the mapping relation is a corresponding relation between the phonemes and a preset parameter set; the preset parameter set is used for representing three-dimensional facial actions when phonemes are sent out; the first driving sequence comprises at least one preset parameter set;

The second acquisition module is used for acquiring a second driving sequence;

and a fifth determining module, configured to take the second driving sequence and the first driving sequence as a group of training sets.

24. The apparatus of any one of claims 19-23, wherein the style conversion model comprises M one-dimensional convolution layers, wherein the one-dimensional convolution layers are configured to convolve input data to obtain a processed result; and the size of the processing result is the same as the size of the input data; m is a positive integer greater than 1.

25. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-12.

26. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-12.

27. A computer program product comprising a computer program which, when executed by a processor, implements the steps of the method of any of claims 1-12.