CN115938352A

CN115938352A - Model obtaining method, mouth shape coefficient generating device, mouth shape coefficient generating equipment and mouth shape coefficient generating medium

Info

Publication number: CN115938352A
Application number: CN202211288333.XA
Authority: CN
Inventors: 张智勐; 丁彧; 吕唐杰; 范长杰; 胡志鹏
Original assignee: Netease Hangzhou Network Co Ltd
Current assignee: Netease Hangzhou Network Co Ltd
Priority date: 2022-10-20
Filing date: 2022-10-20
Publication date: 2023-04-07

Abstract

The application discloses a model obtaining method, a mouth shape coefficient generating device, mouth shape coefficient generating equipment and a mouth shape coefficient generating medium, which are applied to an avatar voice animation synthesis technology. The method comprises the following steps: the method comprises the steps of obtaining a first voice with a first preset duration and a first model coefficient corresponding to the first voice. And determining a first phoneme sequence corresponding to the first voice, and determining the first phoneme sequence and the first mouth-shape coefficient as a first training sample. And acquiring a second voice with a second preset duration and a second mouth shape coefficient corresponding to the second voice. The second speech is speech having a target style. And determining a second phoneme sequence corresponding to the second voice, and determining the second phoneme sequence and the second mouth shape coefficient as a second training sample. And training the target mouth shape coefficient generation model by using the first training sample and the second training sample. The target mouth shape coefficient generation model is used for generating a target mouth shape coefficient with a target style corresponding to the target voice.

Description

Model obtaining method, mouth shape coefficient generating device, mouth shape coefficient generating equipment and mouth shape coefficient generating medium

Technical Field

The application relates to an avatar voice animation synthesis technology, in particular to a method, a device, equipment and a medium for obtaining a mouth shape coefficient generation model, and also relates to a method, a device, equipment and a medium for generating mouth shape coefficients.

Background

In the field of animation synthesis, in order to obtain a complete virtual character animation, synthesis from voice to a mouth shape coefficient needs to be realized, that is, a section of voice is input and a mouth shape coefficient corresponding to the voice is synthesized, so as to ensure that the mouth shape state of the virtual character in the animation is matched with the voice content. However, when the virtual character image is shaped, different virtual characters need to have different speaking styles. Therefore, it is also necessary to ensure that the mouth shape coefficients correspond to the speaking style of the virtual character when determining the mouth shape coefficients.

In the prior art, in order to obtain a mouth shape coefficient with a speaking style, for different speaking styles, voice data of at least 30 minutes with the speaking style reserved is required to be recorded as a training sample to train a deep neural network, and a large amount of working cost is required to be consumed to obtain voice dynamic capture data.

Therefore, how to reduce the work cost in the process of obtaining the mouth shape coefficient with the speaking style retained and improve the animation production efficiency becomes a technical problem that needs to be solved by those skilled in the art.

Disclosure of Invention

The application provides a method and a device for obtaining a mouth shape coefficient generation model, an electronic device and a storage medium, wherein a mapping layer in a target mouth shape coefficient generation model is trained by using a long-time non-style voice sample, and style coding vectors in the target mouth shape coefficient generation model are trained by using a short-time voice sample with speaking style, so that the working cost in the obtaining process of the mouth shape coefficient with speaking style is reduced, and the animation production efficiency is improved.

The embodiment of the present application provides, in a first aspect, a method for obtaining a mouth shape coefficient generation model, including:

the method comprises the steps of obtaining a first voice with a first preset duration and a first model coefficient corresponding to the first voice, wherein the first voice is a non-style voice.

And determining a first phoneme sequence corresponding to the first voice, and determining the first phoneme sequence and the first mouth-shape coefficient as a first training sample.

And acquiring a second voice with a second preset duration and a second mouth shape coefficient corresponding to the second voice, wherein the second voice is a voice with a target style. The second preset time is shorter than the first preset time.

And determining a second phoneme sequence corresponding to the second voice, and determining the second phoneme sequence and the second mouth shape coefficient as a second training sample.

And training the target mouth shape coefficient generation model by using the first training sample and the second training sample. The target mouth shape coefficient generation model is used for generating a target mouth shape coefficient with a target style corresponding to the target voice.

Wherein the first training sample is used for training a mapping layer in the target mouth shape coefficient generation model. The mapping layer is used for determining the mapping relation between each phoneme in the first phoneme sequence and each mouth shape coefficient in the first mouth shape coefficient. The second training sample is used for training the style coding layer in the target mouth shape coefficient generation model. And the style coding layer is used for extracting style characteristics corresponding to the second phoneme sequence according to the second mouth shape coefficient and the mapping relation.

A second aspect of the embodiments of the present application provides a method for generating a mouth shape coefficient, including:

and obtaining a target phoneme sequence of the target voice, wherein the target phoneme sequence is used for representing semantic information of the target voice.

And inputting the target phoneme sequence into a target mouth shape coefficient generation model to obtain a target mouth shape coefficient output by the target mouth shape coefficient generation model. The target mouth shape coefficient is used for driving the mouth motion change of the target virtual character.

Wherein the target mouth shape coefficient generation model is generated according to the obtaining method of the mouth shape coefficient generation model provided by the first aspect.

A third aspect of the embodiments of the present application provides an apparatus for obtaining a mouth shape coefficient generation model, including:

the obtaining unit is used for obtaining a first voice with a first preset duration and a first model coefficient corresponding to the first voice, wherein the first voice is a non-style voice.

And the determining unit is used for determining a first phoneme sequence corresponding to the first voice and determining the first phoneme sequence and the first mouth-shape coefficient as a first training sample.

The obtaining unit is further configured to obtain a second voice with a second preset duration and a second mouth shape coefficient corresponding to the second voice. The second voice is a voice with a target style, and the second preset time length is less than the first preset time length.

And the determining unit is further configured to determine a second phoneme sequence corresponding to the second speech, and determine the second phoneme sequence and the second mouth shape coefficient as a second training sample.

And the training unit is used for training the target mouth shape coefficient generation model by using the first training sample and the second training sample. The target mouth shape coefficient generation model is used for generating a target mouth shape coefficient with a target style corresponding to the target voice.

A fourth aspect of the embodiments of the present application provides an apparatus for generating a shape coefficient, including:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a target phoneme sequence of target voice, and the target phoneme sequence is used for representing semantic information of the target voice.

And the processing unit is used for inputting the target phoneme sequence into the target mouth shape coefficient generation model and obtaining a target mouth shape coefficient output by the target mouth shape coefficient generation model, and the target mouth shape coefficient is used for driving the mouth action change of the target virtual character.

Wherein the target mouth shape coefficient generation model is obtained according to the apparatus provided in the third aspect.

A fifth aspect of an embodiment of the present application provides an electronic device, including:

a processor;

a memory for storing a program of a method, which when read and executed by the processor, performs any of the methods described above.

The present application also provides a computer storage medium storing a computer program that, when executed, implements any of the methods described above.

Compared with the prior art, the method has the following advantages:

according to the method for obtaining the mouth shape coefficient generation model, semantic information required by mouth shape coefficient synthesis is provided through a first phoneme sequence of first voice with first preset duration, and style information required by mouth shape coefficient synthesis is provided through the mouth shape coefficient of second voice with first target style. And then, based on a second phoneme sequence and a second mouth shape coefficient of a second voice with a second preset duration, acquiring mouth shape coefficients corresponding to the phonemes under the target style after style characteristic coding. The method comprises the steps of ensuring the mapping relation between original semantic phonemes and a mouth shape coefficient through a first phoneme sequence and a first mouth shape coefficient of a first voice without style, and acquiring a small number of voice samples of a target character and acquiring a corresponding style coding vector through the voice samples if the mouth shape coefficient with a qualified target character is required to be acquired. In this way, the style encoding vector performs style encoding on each phoneme, and then the target mouth shape coefficient corresponding to each phoneme after the style encoding can be obtained based on the original mapping relation. In the method, the mapping relation between the original semantic phoneme and the mouth shape coefficient is the basis, and only one section of non-style voice sample is needed to obtain the original mapping relation. If different mouth shape coefficients of different characters need to be obtained, only a short-time voice sample of a certain character needs to be obtained to train different style coding vectors, so that the working cost in the process of obtaining the mouth shape coefficients with the speaking style is greatly reduced, and the animation production efficiency is improved.

Drawings

Fig. 1 is a schematic flowchart of a method for obtaining a mouth shape coefficient generation model according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a mouth shape coefficient generation model provided in an embodiment of the present application;

fig. 3 is a schematic flowchart of a method for generating a mouth shape coefficient according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an apparatus for obtaining a mouth shape coefficient generation model according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of an apparatus for generating a mouth shape coefficient according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The application provides a method and a device for obtaining a mouth shape coefficient generation model, an electronic device and a storage medium, wherein a mapping layer in a target mouth shape coefficient generation model is trained by using a long-time non-style voice sample, and a style coding vector in the target mouth shape coefficient generation model is trained by using a short-time voice sample with a speaking style, so that the working cost in the obtaining process of the mouth shape coefficient with the speaking style is reduced, and the animation production efficiency is improved.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of embodiments in many different forms than those described herein and is therefore not limited to the specific embodiments disclosed below.

The terminology used in the description herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. The description used in this application and in the appended claims is for example: the terms "a," "an," "first," and "second," etc., are not intended to be limiting in number or order, but rather are used to distinguish one type of information from another.

In the prior art, in order to obtain a mouth shape coefficient with a speaking style, for different speaking styles, voice data of at least 30 minutes with the speaking style reserved is required to be recorded as a training sample, then a large number of training samples are required to train a deep neural network, and in such a way, a large amount of working cost is required to be consumed to obtain voice capturing data.

In order to solve the above problems, the present application provides an obtaining method of a mouth shape coefficient generation model, so as to reduce the working cost in the mouth shape coefficient obtaining process of preserving the speaking style and improve the animation production efficiency.

The core of the method for obtaining the mouth shape coefficient generation model provided by the application is as follows:

in the sample collection process, the style information of the mouth shape can be obtained based on the mouth shape coefficient, and when the style information is obtained based on the mouth shape coefficient, only a small amount of voice collection is needed.

Therefore, in the process of training the mouth shape coefficient generation model, the application trains the mouth shape coefficient generation model by using the first phoneme sequence and the first mouth shape coefficient of the first voice with the first preset duration and the second mouth shape coefficient of the second voice with the first target style as training samples.

The method comprises the steps of providing semantic information required by mouth shape coefficient synthesis through a first phoneme sequence of first voice with a first preset duration, providing style information required by mouth shape coefficient synthesis through a mouth shape coefficient of second voice with a first target style, and determining mapping relations between different phonemes and different mouth shape coefficients based on the first phoneme sequence and the first mouth shape coefficient of the first voice.

The embodiment of the present application first provides a method for obtaining a mouth shape coefficient generation model, where an implementation subject of the method may be various types of user terminals such as a notebook computer, a tablet computer, a desktop computer, a set-top box, a mobile device (e.g., a mobile phone, a personal digital assistant, a dedicated messaging device, a game console), or a combination of any two or more of these data processing devices, and may also be a server.

Referring to fig. 1, fig. 1 is a flowchart illustrating a method for obtaining a mouth shape coefficient generation model according to an embodiment of the present application, where the method includes the following steps S101 to S105:

step S101, a first voice with a first preset duration and a first model coefficient corresponding to the first voice are obtained.

A first phoneme sequence and a first mouth-type coefficient; the first phoneme sequence is used for representing semantic information of the first voice.

In an optional implementation manner of the present application, the first voice of the first preset duration may adopt a piece of linguistically-free voice data.

The first mouth shape coefficient may be understood as data for driving a change in the shape of the mouth of the virtual character when the virtual character plays the first voice. During a specific application, the first mouth shape coefficient can be obtained by collecting mouth shape change data of a designated virtual character.

Step S102, a first phoneme sequence corresponding to the first voice is determined.

The phoneme refers to a minimum voice unit divided according to natural attributes of voice, and is generated based on pronunciation actions in view of human physiological properties, for example: [ ma ] contains two pronunciation actions of [ m ] and [ a ], which are two different factors. During a specific application, all contents in a language can be obtained by different combinations of factors, such as: in the phonemes involved in chinese, the set of entire phonemes is { 'ao', 'l', 'r', 'j', 'h', 'z', 'iao','t','m', 'uo', 'ei', 'er', 'o', 'v','s', 'ng','d', 'uai', 'w', 'ai', 'x', 'i', 'f', 'n', 'a', 'iu', 'u', 'ui', 'zh', 'ch', 'out', 'g', 'ie', 'k', 'sil', 'e', 'y', 'q', 'ua', 'u', 'p', 'c', 'ue'. Further, the phoneme sequence of the first speech may be understood as content for representing speech information in speech, and may be obtained by removing content such as speech timbre information in the first speech.

In an alternative embodiment of the present application, the phoneme sequence of the first speech may be obtained through the following steps 1 to 3:

step S1, obtaining duration information of each phoneme in the first voice according to the pronunciation sequence of each phoneme in the first voice;

the duration information of each phoneme in the first speech can be understood as duration information occupied by a designated virtual character when the virtual character reads each phoneme in the first speech when playing the first speech,

for example, the duration information of each phoneme may be expressed as: { 'sil':0 ms-20 ms;

20 ms-50 ms for 'w'; …, where 'sil' and 'w' represent different phone classes, and 0ms to 20ms represents the time period of 0ms to 20ms when the virtual character starts to speak

The phoneme of 'sil' further indicates that the virtual character is reading the phoneme of 'w' in the time period of 20ms to 50ms when the character is reading the phoneme of 'sil' in the time period of 20ms to 50 ms.

S2, determining the mouth shape frame rate information of the target role;

in the image field, the frame rate refers to the number of frames transmitted per second on the screen, and further, the target character can be understood as an action object of the stylized mouth shape coefficient generated based on the mouth shape coefficient generation model, and in practical applications, the target character is also a virtual character. The frame rate information of the mouth shape of the target character can be understood as the frame rate of the animation when the target character plays a segment of voice in the form of animation.

And S3, aligning each phoneme in the first voice according to the target character mouth shape frame rate information and the duration information of each phoneme, and taking each aligned phoneme in the first voice as the first phoneme sequence.

The aligning of the phonemes in the first voice according to the mouth frame rate information of the target character and the duration information of the phonemes means that the frame number of each phoneme in an animation frame is determined according to the frame rate of the target character when the voice animation is played and the pronunciation time of each phoneme.

For example, assuming that the frame rate of the target character playing the speech animation is 100fps, for the phoneme time information { 'sil':0 ms-20 ms; 20 ms-50 ms for 'w'; … } the resulting first phoneme sequence may be further denoted as { 'sil', 'sil', 'w',

' w ' … ', that is, ' sil ' is a phoneme that lasts 20ms and occupies 2 frames of speech animation when the target character plays speech animation, and ' w ' is a phoneme that lasts 30ms and occupies 3 frames of speech animation when the target character plays speech animation.

And S103, acquiring a second voice with a second preset duration and a second mouth shape coefficient corresponding to the second voice.

The second voice is a voice with a target style, and the collected second voice is far smaller than the first voice in time.

In the embodiment of the application, the aim is to train and obtain a mouth shape coefficient generation model capable of synthesizing mouth shape coefficients with style classes. Therefore, in the stage of collecting training samples, the collection of style class information of speech is essential.

In this embodiment of the application, the second voice is a carrier of the target style, for example, assuming that the target style represents an active character of a virtual character, the second voice may be voice data of another character having the same character as the virtual character; for another example, assuming that the target style represents the dialect of the virtual character, the second speech may be speech data of a segment of other characters having the same dialect.

In an alternative embodiment of the present application, the speech having the target style may be obtained by synthesizing the speech of the target character by the relevant staff, and the target character also refers to an action object of the stylized mouth shape coefficient generated based on the mouth shape coefficient generation model, as in the above step S101.

Furthermore, the second mouth shape coefficient is mainly used for determining style characteristics of the target style, wherein the style characteristics of the target style are obtained in the process of training an initial mouth shape coefficient generation model. And in order to obtain the style characteristics of the target style, only voice data of about 1 minute to 5 minutes needs to be collected.

And step S104, determining a second phoneme sequence corresponding to the second voice.

Similarly, the step may refer to the determination process of the first speech corresponding to the first phoneme sequence in step S102, which is not described herein again.

Step S105, using the first phoneme sequence and the first mouth shape coefficient as a first training sample, using the second phoneme sequence and the second mouth shape coefficient as a second training sample, and training an initial target mouth shape coefficient generation model by using the first training sample and the second training sample.

The first training sample is used for training a mapping layer in the target mouth shape coefficient generation model, and the mapping layer is used for determining the mapping relation between each phoneme in the first phoneme sequence and each mouth shape coefficient in the first mouth shape coefficient. And the second training sample is used for training a style coding layer in the target mouth shape coefficient generation model, and the style coding layer is used for extracting style characteristics corresponding to the second phoneme sequence according to the second mouth shape coefficient and the mapping relation.

In the specific application process, machine Learning (ML) is adopted to train the initial mouth shape coefficient generation model, and the Machine Learning (which is a multi-field cross subject and relates to multi-subjects such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like) is specially used for researching how a computer simulates or realizes the Learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the new energy of the knowledge structure. Machine learning generally includes techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and the like. Machine learning belongs to artificial intelligence

(artificailintelligence, AI) a branch of the technology.

Before the initial mouth shape coefficient generation model is trained using the first phoneme sequence, the first mouth shape coefficient, and the second mouth shape coefficient, since the first phoneme sequence exists in the form of a phoneme classification feature (for example, the phoneme sequence may be { 'sil', 'sil', 'w',

'w' … } where 'sil' and 'w' each represent a different factor class), but in each sequence of phonemes the different phonemes are not continuous, but discrete, unordered. Therefore, in an alternative embodiment of the present application, the first phoneme sequence needs to be further preprocessed.

Specifically, the process of preprocessing the first phoneme sequence specifically includes: and performing feature coding on the first phoneme sequence to obtain a phoneme feature vector corresponding to the first phoneme sequence.

In an alternative embodiment of the present application, the phoneme feature vector may be obtained in the form of one-hot encoding. In this embodiment of the present application, processing the first phoneme sequence in the one-hot coding form refers to converting discrete factor categories in each phoneme sequence into a vector form, so as to learn a mapping relationship between different phonemes and different mouth shape coefficients in a process of training the initial mouth shape coefficient generation model.

In particular, the first mouth-shaped coefficient is used for determining the corresponding relation between different phonemes and different mouth-shaped coefficients in combination with the first phoneme sequence; the second mouth shape coefficient is used for determining style characteristics of a target style of the second voice.

Further, in order to facilitate understanding of the method for obtaining the mouth shape coefficient generation model provided in the present application, a training process of the mouth shape coefficient generation model is specifically described below with reference to a specific structure of the mouth shape coefficient generation model.

Referring to fig. 2, fig. 2 is a schematic structural diagram of a mouth shape coefficient generation model according to an embodiment of the present application.

The mouth shape coefficient generation model includes: semantic coding network 201, style coding layer 202, feature transmission module 203 and mapping layer 204.

The mouth shape coefficient generation model is mainly used for outputting mouth shape coefficients with style characteristics corresponding to voice information according to the input voice information, and the mouth shape coefficients are specifically used for driving the mouth shape state change of the virtual character to enable the mouth shape state of the virtual character during voice playing to be matched with the voice style characteristics of the virtual character.

In the specific application process, the mouth shape state change of the virtual character is mainly controlled by the semantic information of voice played by the virtual character and the style information of the virtual character. Therefore, when the initial mouth shape coefficient generation model is trained, the whole training process needs to be completed based on training sample data which simultaneously has semantic information and style information of a virtual character.

In this embodiment of the present application, the first phoneme sequence is mainly used to provide semantic information required for training the initial mouth shape coefficient generation model, and the second mouth shape coefficient is used to provide style information required for training the initial mouth shape coefficient generation model.

Specifically, the semantic information in the first phoneme sequence may be extracted and obtained by the semantic coding network 201, and the semantic coding network 201 may be understood as a convolutional neural network, in an optional embodiment of the present application, the semantic coding network 201 includes a 6-layer 1-dimensional convolutional neural network, a kernel of each network is 3, and a step length is 1.

Further, after the semantic coding network 201 acquires the semantic information in the first phoneme sequence 201, the semantic information and the first mouth shape coefficient are input into the mapping network 204, so that the mapping network learns the mapping relationship between different mouth shape coefficients and the semantic information.

Further, the style coding layer 202 is configured to extract style features of the target style from the second mouth shape coefficient, and store the style features of the target style. Specifically, the style coding layer 202 mainly comprises a style coding vector for storing style characteristics of the target style, where the style coding vector may be represented by T = { μ, σ }, where μ represents the target style translation coefficient and σ represents a scaling coefficient of the target style.

In a specific application process, the style coding layer 202 may train the ability of extracting style features from the style coding layer 202 while training the initial mouth shape coefficient generation model, specifically, training the style feature learning module needs to acquire a second phoneme sequence of the second speech with the target style, which is the same as the first phoneme sequence, where the second phoneme sequence is used to represent semantic information of the second speech.

Further, the second phoneme sequence may be obtained in a manner similar to the above steps S1 to S3, that is, firstly, according to the pronunciation sequence of each phoneme in the second speech, obtaining duration information of each phoneme in the second speech; secondly, determining the mouth shape frame rate information of the target role; and finally, according to the mouth frame rate information of the target character and the duration information of each phoneme, performing alignment processing on each phoneme in the second voice, and taking each phoneme in the second voice after the alignment processing as the second phoneme sequence.

Then, carrying out feature coding on the second phoneme sequence to obtain a phoneme feature vector corresponding to the second phoneme sequence; finally, the phoneme feature vector is input into a semantic coding network 201, and a semantic feature in the second phoneme sequence is extracted through the semantic coding network 201 (for convenience of description, the semantic feature is represented by a symbol F hereinafter).

After the semantic features in the second phoneme sequence are obtained, inputting the semantic features into the style coding layer 202, and performing AdaIN operation on the feature F through the style coding layer 202 by using the style coding vector T = { mu, sigma }, so as to obtain the stylized target semantic features

256 represents the dimension of the stylized target semantic feature, and n represents the length of the stylized target semantic feature; and finally, inputting the stylized target semantic features into a subsequent mapping network.

Wherein the operation of AdaIN is represented by the following formula (1):

wherein m (F) represents a mean of the semantic features and v (F) represents a standard deviation of the semantic features.

Further, after learning different semantic information and target style information based on the first phoneme sequence and the second mouth shape coefficient, a mapping relationship between different semantics and different mouth shape coefficients needs to be further learned.

In the embodiment of the present application, the mapping relationship between the different semantics and the different mouth shape coefficients is obtained by the mapping layer 204.

Further, in order to determine the training degree of the mouth shape coefficient generation network, a loss function of the mouth shape coefficient generation model constructed by the following formula (2) may be used

Wherein B represents a mouth shape coefficient of a sample voice input to the mouth shape coefficient generation model;

representing the mouth shape coefficient of the sample voice with the target style processed by the mouth shape coefficient generating model, wherein B belongs to R ^m×n ，

m represents the characteristic dimension of the first mouth-size coefficient and n represents the degree of fragmentation of the mouth-size coefficient.

Based on the above description of the method for obtaining the mouth shape coefficient generation model, in the training sample, the second mouth shape coefficient is used to determine the style characteristics of the first target style, and the first phoneme sequence and the first mouth shape coefficient are used to provide semantic information and mapping relationships between different semantic information and different mouth shape coefficients.

Furthermore, for the mouth shape coefficient generation model for generating mouth shape coefficients of different styles, only the speech with style characteristics needs to be obtained again when the sample is collected.

From the above description, when it is desired to obtain mouth shape coefficients of different styles, it is not necessary to train the mapping layer, but only to collect voices of other styles in a short time on the basis of the first voice and train the style encoding vectors in the style encoding layer. Therefore, in an alternative embodiment of the present application, the method for obtaining the mouth shape coefficient generation model further includes the following steps S106 and S108:

and step S106, obtaining a third voice with other target styles and a third mouth shape coefficient corresponding to the third voice.

And step S107, acquiring a third phoneme sequence corresponding to the third voice.

Step S108, after training the mapping layer of the mouth shape coefficient generation model by using the first phoneme sequence and the first mouth shape coefficient, training the style coding layer of the mouth shape coefficient generation model by using the third phoneme sequence and the third mouth shape coefficient to obtain a target mouth shape coefficient generation model for generating a target voice with other target style.

Inputting the third phoneme sequence into a mouth shape coefficient generation model for generating mouth shape coefficients with other target styles, and obtaining third mouth shape coefficients with the other target styles output by the mouth shape coefficient generation model. Wherein the third voice with other target style is similar to the second voice mentioned in the above step, only different from the style of the third voice and the style of the second voice.

That is to say, in the method for obtaining a mouth shape coefficient generation model provided in the embodiment of the present application, for mouth shape coefficients of different styles and types, only the speech corresponding to the style needs to be collected in a targeted manner to obtain the mouth shape coefficient of the speech and obtain corresponding style characteristics, and there is no need to repeatedly determine a mapping relationship between various semantic information and the mouth shape coefficient by using a large amount of speech information.

In summary, the method for obtaining a mouth shape coefficient generation model provided in the embodiment of the present application trains the mouth shape coefficient generation model by using the first phoneme sequence and the first mouth shape coefficient of the first voice with a preset duration and the second mouth shape coefficient of the second voice with a target style as training samples.

The method comprises the steps of providing semantic information required by synthesizing a mouth shape coefficient through a first phoneme sequence of first voice with a first preset duration, providing style information required by synthesizing the mouth shape coefficient through a mouth shape coefficient of second voice with a target style, and determining mapping relations between different phonemes and different mouth shape coefficients based on the first phoneme sequence and the first mouth shape coefficient of the first voice. The working cost in the process of obtaining the mouth shape coefficient with the speaking style reserved is reduced, and the animation production efficiency is improved.

Please refer to fig. 3, where fig. 3 is a flowchart of a mouth shape coefficient generation method according to another embodiment of the present application, and the embodiment of the method is basically similar to the above method for obtaining a mouth shape coefficient generation model, so that description is simple, and relevant parts can be obtained by referring to the related description of the method for obtaining a mouth shape coefficient generation model in the present application, and details are not repeated here.

The method comprises the following steps S301 and S302:

step S301 obtains a target phoneme sequence of the target speech.

Wherein the target phoneme sequence is used for representing semantic information of the target voice.

Step S302, inputting the target phoneme sequence into a target mouth shape coefficient generation model for generating mouth shape coefficients with a target style, and obtaining target mouth shape coefficients with a target style corresponding to the target voice.

In an alternative embodiment of the present application, the target phoneme sequence of the target speech may be obtained by:

acquiring duration information of each phoneme in the target voice according to the pronunciation sequence of each phoneme in the target voice;

determining the mouth shape frame rate information of a target role corresponding to the target voice;

and processing each phoneme in the target voice according to the mouth frame rate information of the target character and the duration information of each phoneme, and taking each phoneme in the target voice after being processed as a target phoneme sequence.

In an optional embodiment of the present application, the method further comprises:

and performing feature coding on the target phoneme sequence to obtain a phoneme feature vector corresponding to the target phoneme sequence.

The target mouth shape coefficient generation model is a double-branch neural network, so when the target mouth shape coefficient is input into the target mouth shape coefficient generation model, the speaking style of a target virtual character corresponding to the target voice is determined firstly, and if the target virtual character has no speaking style requirement, a target phoneme sequence corresponding to the target voice is directly input into a mapping layer of the target mouth shape coefficient generation model through a first branch. If the target virtual character has a speaking style, the target voice is input into a target mouth shape coefficient generation model with the style, and the target voice is input into a style coding layer of the target mouth shape coefficient generation model through a second branch, and then an output result of the middle style coding layer is input into a mapping layer of the target mouth shape coefficient generation model to obtain a final target mouth shape coefficient.

Fig. 4 is a schematic structural diagram of an apparatus for obtaining a shape coefficient generation model according to an embodiment of the present application, where fig. 4 is a schematic structural diagram of the apparatus for obtaining a shape coefficient generation model according to the present application. Since this embodiment of the apparatus is substantially similar to the embodiment of the method for obtaining a model for generating mouth shape coefficients described above, the description is relatively simple, and in relation thereto, reference is made to the description of the embodiment of the method described above, and the following description of the embodiment of the apparatus is merely illustrative.

The device for obtaining the mouth shape coefficient generation model comprises:

the obtaining unit 401 is configured to obtain a first voice with a first preset duration and a first model coefficient corresponding to the first voice. The first voice is a no-style voice.

A determining unit 402, configured to determine a first phoneme sequence corresponding to the first speech, and determine the first phoneme sequence and the first mouth-type coefficient as the first training sample.

The obtaining unit 401 is further configured to obtain a second voice with a second preset duration and a second mouth shape coefficient corresponding to the second voice. The second voice is a voice with a target style, and the second preset time length is less than the first preset time length.

The determining unit 402 is further configured to determine a second phoneme sequence corresponding to the second speech, and determine the second phoneme sequence and the second mouth shape coefficient as a second training sample.

A training unit 403, configured to train the target mouth shape coefficient generation model using the first training sample and the second training sample. The target mouth shape coefficient generation model is used for generating a target mouth shape coefficient with a target style corresponding to the target voice.

In an alternative embodiment, the training unit 403 is specifically configured to directly input the first phoneme sequence in the first training sample into the mapping layer, and obtain a first output result corresponding to the first phoneme sequence. A first loss value is determined based on the first output result and the first bite coefficient. And updating the network layer coefficient of the mapping layer according to the first loss value.

In an optional embodiment, the training unit 403 is specifically configured to, after the mapping layer is trained by using the first training sample, input a second phoneme sequence in the second training sample into the style coding layer, and obtain an intermediate output result corresponding to the second phoneme sequence. And inputting the intermediate output result into the mapping layer to obtain a second output result corresponding to the second phoneme sequence. And determining a second loss value according to the second output result and the second mouth shape coefficient. And updating the network layer coefficient of the style coding layer according to the second loss value.

In an alternative embodiment, the style encoding layer includes trainable style encoding vectors.

The training unit 403 is specifically configured to obtain a second phoneme feature vector corresponding to the second phoneme sequence. And extracting semantic features corresponding to the second phoneme sequence according to the phoneme feature vector. And coding the semantic features by utilizing the style coding vector to obtain the stylized target semantic features. And inputting the stylized target semantic features into a mapping layer.

In an alternative embodiment, the first phoneme sequence or the second phoneme sequence is obtained by:

and acquiring the pronunciation sequence of each phoneme in the first voice or the second voice and the pronunciation duration information of each phoneme.

And determining mouth shape frame rate information of the target virtual character, wherein the target virtual character corresponds to the target style.

And aligning each phoneme in the first voice or the second voice according to the mouth frame rate information of the target character and the duration information of each phoneme in the first voice or the second voice.

And determining the first phoneme sequence or the second phoneme sequence according to the arrangement sequence of the phonemes in the aligned first voice or second voice.

In an optional embodiment, the target mouth shape coefficient generation model further comprises a semantic coding network.

The obtaining unit 401 is further configured to perform feature coding on the first phoneme sequence by using a semantic coding network, and obtain a first phoneme feature vector corresponding to the first phoneme sequence. And performing feature coding on the second phoneme sequence by using a semantic coding network to obtain a second phoneme feature vector corresponding to the second phoneme sequence.

The training unit 403 is specifically configured to directly input the first phoneme feature vector into the mapping layer.

In an alternative embodiment, the obtaining unit 401 is specifically configured to cut the first phoneme sequence into a plurality of first phoneme sequence fragments according to a preset phoneme sequence length. And respectively carrying out feature coding on the plurality of first phoneme sequence fragments to obtain first phoneme feature vectors corresponding to the first phoneme sequence fragments.

In an alternative embodiment, the obtaining unit 401 is specifically configured to cut the second phoneme sequence into at least one first phoneme sequence fragment according to a preset phoneme sequence length. And respectively carrying out feature coding on at least one second phoneme sequence fragment to obtain a second phoneme feature vector corresponding to each second phoneme sequence fragment.

It should be noted that, the contents of information interaction, execution process, and the like between modules/units in the apparatus for obtaining a mouth shape coefficient generation model are based on the same concept as the method embodiments corresponding to fig. 1 to fig. 3 in the present application, and specific contents may refer to the description in the foregoing method embodiments in the present application, and are not described herein again.

Fig. 5 is a schematic structural diagram of a device for generating a shape coefficient according to an embodiment of the present application, where fig. 5 is a schematic structural diagram of the device for generating a shape coefficient according to the present application. Since this embodiment of the apparatus is substantially similar to the embodiment of the method for generating a die factor described above, the description is relatively simple, and where relevant, reference is made to the above description of the embodiment of the method, which is described below only schematically.

The device for generating the mouth shape coefficient comprises:

an obtaining unit 501 is configured to obtain a target phoneme sequence of a target voice, where the target phoneme sequence is used to represent semantic information of the target voice.

And the processing unit 502 is configured to input the target phoneme sequence into the target mouth shape coefficient generation model, and obtain a target mouth shape coefficient output by the target mouth shape coefficient generation model, where the target mouth shape coefficient is used to drive the mouth motion change of the target virtual character.

Wherein, the target mouth shape coefficient generation model is obtained by the obtaining device of the mouth shape coefficient generation model provided by the embodiment shown in fig. 4.

In an optional embodiment, the generating device of the mouth shape coefficient further includes a determining unit 503.

A determining unit 503, configured to determine a style requirement of the target virtual character.

The processing unit 502 is specifically configured to, if the style requirement is no style, directly input the target phoneme sequence into the mapping layer of the target mouth shape coefficient generation model.

And if the style requirement is a target style, inputting the target phoneme sequence into a style coding layer of the target mouth shape coefficient generation model, and inputting an intermediate output result output by the style coding layer into a mapping layer.

It should be noted that, the contents of information interaction, execution process, and the like between the modules/units in the device for generating a mouth shape coefficient are based on the same concept as the method embodiments corresponding to fig. 1 to fig. 3 in the present application, and specific contents may refer to the description in the foregoing method embodiments in the present application, and are not described herein again.

An embodiment of the present application also provides an electronic device, please refer to fig. 6, and fig. 6 is a schematic structural diagram of the electronic device provided in the embodiment of the present application.

The electronic device includes:

at least one processor 601, at least one memory 603, at least one communication interface 602, and at least one communication bus 604;

optionally, the communication interface 602 may be an interface of a communication module, such as an interface of a GSM module;

the processor 601 may be a CPU, or an APPlication Specific Integrated Circuit ASIC (APPlication Specific Integrated Circuit), or one or more Integrated circuits configured to implement embodiments of the present APPlication;

the memory 603 may comprise a high speed RAM memory and may also include a non-volatile memory

(non-volatile memory), such as at least one disk memory.

The memory 603 is a program for storing the method, and when the program is read and executed by the processor, the program performs the method provided by the method embodiment of the present application.

Another embodiment of the present application also provides a computer storage medium storing a computer program that, when executed, implements the method provided in the above method embodiment.

Although the present application has been described with reference to the preferred embodiments, it is not intended to limit the present application, and those skilled in the art can make variations and modifications without departing from the spirit and scope of the present application, therefore, the scope of the present application should be determined by the claims that follow.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

1. Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transient media), such as modulated data signals and carrier waves.

2. It will be apparent to those skilled in the art that embodiments of the present application may be provided as a system or an electronic device. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Claims

1. A method for obtaining a mouth shape coefficient generation model is characterized by comprising the following steps:

acquiring a first voice with a first preset duration and a first model coefficient corresponding to the first voice; the first voice is a no-style voice;

determining a first phoneme sequence corresponding to the first voice, and determining the first phoneme sequence and the first mouth-shape coefficient as a first training sample;

acquiring a second voice with a second preset duration and a second mouth shape coefficient corresponding to the second voice; the second voice is a voice with a target style; the second preset time length is less than the first preset time length;

determining a second phoneme sequence corresponding to the second voice, and determining the second phoneme sequence and the second mouth shape coefficient as a second training sample;

training a target mouth shape coefficient generation model by using the first training sample and the second training sample; the target mouth shape coefficient generation model is used for generating a target mouth shape coefficient which corresponds to the target voice and has the target style;

wherein the first training sample is used for training a mapping layer in the target mouth shape coefficient generation model; the mapping layer is used for determining the mapping relation between each phoneme in the first phoneme sequence and each mouth shape coefficient in the first mouth shape coefficient; the second training sample is used for training a style coding layer in the target mouth shape coefficient generation model; and the style coding layer is used for extracting style characteristics corresponding to the second phoneme sequence according to the second mouth shape coefficient and the mapping relation.

2. The method of claim 1, wherein training a target mouth shape coefficient generation model using the first training sample and the second training sample comprises:

directly inputting the first phoneme sequence in the first training sample into the mapping layer to obtain a first output result corresponding to the first phoneme sequence;

determining a first loss value according to the first output result and the first model coefficient;

and updating the network layer coefficient of the mapping layer according to the first loss value.

3. The method of claim 2, wherein training a target mouth shape coefficient generative model using the first training sample and the second training sample comprises:

after the mapping layer is trained by using the first training sample, inputting the second phoneme sequence in the second training sample into the style coding layer to obtain an intermediate output result corresponding to the second phoneme sequence;

inputting the intermediate output result into the mapping layer to obtain a second output result corresponding to the second phoneme sequence;

determining a second loss value according to the second output result and the second mouth shape coefficient;

and updating the network layer coefficient of the style coding layer according to the second loss value.

4. The method of claim 3, wherein the style encoding layer comprises trainable style encoding vectors; the inputting the second phoneme sequence in the second training sample into the style coding layer to obtain an intermediate output result corresponding to the second phoneme sequence includes:

acquiring a second phoneme feature vector corresponding to the second phoneme sequence;

extracting semantic features corresponding to the second phoneme sequence according to the phoneme feature vectors;

encoding the semantic features by using the style encoding vector to obtain stylized target semantic features;

the inputting the intermediate output result into the mapping layer includes:

inputting the stylized target semantic features into the mapping layer.

5. The method according to any of the claims 1 to 4, characterized in that the first phoneme sequence or the diphone sequence is obtained by:

acquiring the pronunciation sequence of each phoneme in the first voice or the second voice and the pronunciation duration information of each phoneme;

determining mouth shape frame rate information of the target virtual role; the target virtual character corresponds to the target style;

aligning each phoneme in the first voice or the second voice according to the mouth shape frame rate information of the target character and the duration information of each phoneme in the first voice or the second voice;

and determining the first phoneme sequence or the second phoneme sequence according to the arrangement sequence of the phonemes in the aligned first voice or the aligned second voice.

6. The method of claim 5, wherein the target mouth shape coefficient generation model further comprises a semantic coding network; the method further comprises the following steps:

performing feature coding on the first phoneme sequence by using the semantic coding network to obtain a first phoneme feature vector corresponding to the first phoneme sequence; performing feature coding on the second phoneme sequence by using the semantic coding network to obtain a second phoneme feature vector corresponding to the second phoneme sequence;

the directly inputting the first phoneme sequence in the first training sample into the mapping layer comprises:

inputting the first phoneme feature vector directly into the mapping layer.

7. The method of claim 6, wherein said feature coding the first phone sequence by using the semantic coding network to obtain a first phone feature vector corresponding to the first phone sequence comprises:

according to a preset phoneme sequence length, cutting the first phoneme sequence into a plurality of first phoneme sequence fragments;

and respectively carrying out feature coding on the plurality of first phoneme sequence fragments to obtain the first phoneme feature vector corresponding to each first phoneme sequence fragment.

8. The method of claim 7, wherein the performing feature coding on the second phoneme sequence by using the semantic coding network to obtain the second phoneme feature vector corresponding to the second phoneme sequence comprises:

cutting the second phoneme sequence into at least one first phoneme sequence fragment according to the preset phoneme sequence length;

and respectively performing feature coding on the at least one second phoneme sequence fragment to obtain the second phoneme feature vector corresponding to each second phoneme sequence fragment.

9. A method for generating a mouth shape coefficient, the method comprising:

obtaining a target phoneme sequence of a target voice, wherein the target phoneme sequence is used for representing semantic information of the target voice;

inputting the target phoneme sequence into a target mouth shape coefficient generation model to obtain a target mouth shape coefficient output by the target mouth shape coefficient generation model; the target mouth shape coefficient is used for driving the mouth motion change of the target virtual character;

wherein the target mouth shape coefficient generation model is generated according to the obtaining method of the mouth shape coefficient generation model of any one of claims 1-8.

10. The method of claim 9, further comprising:

determining style requirements of the target virtual character;

if the style requirement is no style, directly inputting the target phoneme sequence into a mapping layer of the target mouth shape coefficient generation model;

and if the style requirement is a target style, inputting the target phoneme sequence into the style coding layer of the target mouth shape coefficient generation model, and inputting an intermediate output result output by the style coding layer into the mapping layer.

11. An apparatus for obtaining a mouth shape coefficient generation model, comprising:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a first voice with a first preset duration and a first model coefficient corresponding to the first voice; the first voice is a non-style voice;

a determining unit, configured to determine a first phoneme sequence corresponding to the first speech, and determine the first phoneme sequence and the first mouth-type coefficient as a first training sample;

the acquiring unit is further configured to acquire a second voice with a second preset duration and a second mouth shape coefficient corresponding to the second voice; the second voice is a voice with a target style; the second preset time length is less than the first preset time length;

the determining unit is further configured to determine a second phoneme sequence corresponding to the second speech, and determine the second phoneme sequence and the second mouth shape coefficient as a second training sample;

a training unit, configured to train a target mouth shape coefficient generation model using the first training sample and the second training sample; the target mouth shape coefficient generation model is used for generating a target mouth shape coefficient which corresponds to the target voice and has the target style;

12. An apparatus for generating a mouth shape coefficient, comprising:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a target phoneme sequence of target voice, and the target phoneme sequence is used for representing semantic information of the target voice;

the processing unit is used for inputting the target phoneme sequence into a target mouth shape coefficient generation model and obtaining a target mouth shape coefficient output by the target mouth shape coefficient generation model; the target mouth shape coefficient is used for driving the mouth action change of the target virtual character;

wherein the target mouth shape coefficient generation model is obtained according to the apparatus provided in claim 11.

13. An electronic device, comprising:

a processor;

a memory for storing a program of a method, which when read and executed by the processor performs the method of any one of claims 1-10.

14. A computer storage medium, characterized in that it stores a computer program which, when executed, implements the method of any one of claims 1-10.