CN111260761B

CN111260761B - Method and device for generating mouth shape of animation character

Info

Publication number: CN111260761B
Application number: CN202010042300.1A
Authority: CN
Inventors: 程大治; 夏龙; 吴凡; 卓邦声; 高强; 马楠; 郭常圳
Original assignee: Beijing Ape Power Future Technology Co Ltd
Current assignee: Beijing Ape Power Future Technology Co Ltd
Priority date: 2020-01-15
Filing date: 2020-01-15
Publication date: 2023-05-09
Anticipated expiration: 2040-01-15
Also published as: CN111260761A

Abstract

The application provides a method and a device for generating an animation character mouth shape, wherein the method for generating the animation character mouth shape comprises the following steps: receiving voice audio and voice text corresponding to the voice audio; acquiring a candidate phoneme probability in each audio frame of the voice audio and a phoneme sequence corresponding to the voice text; generating a phoneme set list corresponding to the voice audio according to the candidate phoneme probability and the phoneme sequence in each audio frame of the voice audio; and searching and playing the corresponding cartoon character mouth shape in a preset cartoon character material library according to the phoneme set list. By the method, the mouth shapes of the animation characters can be matched with voice audios at any time, so that the mouth shapes of the generated animation characters are more in accordance with the speaking modes in the real world, and the generated animation is more natural and real.

Description

Method and device for generating mouth shape of animation character

Technical Field

The present invention relates to the field of computer technology, and in particular, to a method and apparatus for generating an animated character mouth shape, a computing device, and a computer readable storage medium.

Background

With the rapid development of computer technology, animation is widely used, and in order to attract the attention of children, some video elements such as animation teaching are often produced to provide entertainment or teaching.

In the prior art, the generated animation character can not produce a corresponding mouth shape according to the voice, the problem that the mouth shape does not correspond to the voice often occurs after the animation is played, and even the mouth shape of the animation character is quite exaggerated, so that the animation is not natural enough, not real enough and the quality of the generated animation is poor.

Therefore, how to solve the above-mentioned problems is a urgent problem for the skilled person.

Disclosure of Invention

In view of the foregoing, embodiments of the present application provide a method and apparatus for generating an animated character shape, a computing device, and a computer-readable storage medium, which address the technical drawbacks of the prior art.

According to a first aspect of an embodiment of the present application, there is provided a method of generating an animated character shape comprising:

receiving voice audio and voice text corresponding to the voice audio;

acquiring a candidate phoneme probability in each audio frame of the voice audio and a phoneme sequence corresponding to the voice text;

generating a phoneme set list corresponding to the voice audio according to the candidate phoneme probability and the phoneme sequence in each audio frame of the voice audio;

and searching and playing the corresponding cartoon character mouth shape in a preset cartoon character material library according to the phoneme set list.

Optionally, obtaining a candidate phoneme probability in each audio frame of the speech audio includes:

performing frame division processing on the voice audio to obtain a plurality of audio frames;

extracting acoustic features of each audio frame;

inputting the acoustic features into a pre-trained acoustic model, so that the acoustic model predicts candidate phoneme probabilities in each of the audio frames.

Optionally, obtaining a phoneme sequence corresponding to the voice text includes:

word segmentation processing is carried out on the voice text to obtain a word set;

searching a corresponding phoneme in a preset dictionary according to each word in the word set;

and generating a phoneme sequence corresponding to the voice text according to the sequence of each word in the word set.

Optionally, generating a phoneme set list corresponding to the voice audio according to the candidate phoneme probability and the phoneme sequence in each audio frame of the voice audio includes:

generating candidate phoneme sequence probabilities in the first n+1 audio frames according to the candidate phoneme sequence probabilities in the first n audio frames, the candidate phoneme probabilities in the n+1 audio frames and the phoneme sequence, wherein n is a positive integer;

and acquiring a candidate phoneme sequence corresponding to the voice audio, and generating a phoneme set list according to the candidate phoneme sequence and a start frame and an end frame of each phoneme in the candidate phoneme sequence.

Optionally, before searching and playing the corresponding animated character mouth shape in the preset animated character database according to the phoneme set list, the method further comprises:

and preprocessing the phonemes in the phoneme set list to obtain a processed phoneme set list.

Optionally, the phoneme set list includes a candidate phoneme sequence corresponding to the voice audio and a start frame and an end frame of each phoneme;

preprocessing phonemes in the phoneme set list to obtain a processed phoneme set list, wherein the preprocessing comprises the following steps:

acquiring a start frame and an end frame of each phoneme in the phoneme set list, and determining a continuous frame of each phoneme;

and when the continuous frame is smaller than a preset threshold value, filtering phonemes corresponding to the continuous frame, and further obtaining a processed phoneme set list.

Optionally, filtering phonemes corresponding to the continuous frames includes:

replacing the phoneme with a previous phoneme of the phoneme in the case that the phoneme is a consonant;

judging whether the last phoneme or the next phoneme of the phonemes is a vowel or not under the condition that the phonemes are vowels;

if not, not processing;

if yes, replacing the phoneme with the previous or next phoneme of the phoneme.

According to a second aspect of embodiments of the present application, there is provided an apparatus for generating an animated character shape comprising:

the receiving module is configured to receive voice audio and voice text corresponding to the voice audio;

the acquisition module is configured to acquire candidate phoneme probabilities in each audio frame of the voice audio and a phoneme sequence corresponding to the voice text;

a generation module configured to generate a list of phone sets corresponding to the speech audio according to the candidate phone probabilities and the phone sequences in each audio frame of the speech audio;

and the playing module is configured to search and play the corresponding cartoon character mouth shape in a preset cartoon character material library according to the phoneme set list.

Optionally, the obtaining module is further configured to perform frame segmentation processing on the voice audio to obtain a plurality of audio frames; extracting acoustic features of each audio frame; inputting the acoustic features into a pre-trained acoustic model, so that the acoustic model predicts candidate phoneme probabilities in each of the audio frames.

Optionally, the obtaining module is further configured to perform word segmentation processing on the voice text to obtain a word set; searching a corresponding phoneme in a preset dictionary according to each word in the word set; and generating a phoneme sequence corresponding to the voice text according to the sequence of each word in the word set.

Optionally, the generating module is further configured to generate a candidate phoneme sequence probability in the first n+1 audio frames according to the candidate phoneme sequence probability in the first n audio frames, the candidate phoneme probability in the n+1th audio frames and the phoneme sequence, wherein n is a positive integer; and acquiring a candidate phoneme sequence corresponding to the voice audio, and generating a phoneme set list according to the candidate phoneme sequence and a start frame and an end frame of each phoneme in the candidate phoneme sequence.

Optionally, the apparatus further includes:

and the preprocessing module is configured to preprocess the phonemes in the phoneme set list to obtain a processed phoneme set list.

the preprocessing module is further configured to acquire a start frame and an end frame of each phoneme in the phoneme set list and determine a continuous frame of each phoneme; and when the continuous frame is smaller than a preset threshold value, filtering phonemes corresponding to the continuous frame, and further obtaining a processed phoneme set list.

Optionally, the preprocessing module is further configured to replace the phoneme with a previous phoneme of the phoneme in the case that the phoneme is a consonant; judging whether the last phoneme or the next phoneme of the phonemes is a vowel or not under the condition that the phonemes are vowels; if not, not processing; if yes, replacing the phoneme with the previous or next phoneme of the phoneme.

According to a third aspect of embodiments of the present application, there is provided a computing device comprising a memory, a processor and computer instructions stored on the memory and executable on the processor, the processor executing the instructions to implement the steps of the method of generating an animated character mouth shape.

According to a fourth aspect of embodiments of the present application, there is provided a computer-readable storage medium storing computer instructions that, when executed by a processor, implement the steps of the method of generating an animated character outline.

According to the method, the device and the system, the received voice audio and the voice text corresponding to the voice audio are respectively processed to obtain the candidate phoneme probability and the phoneme sequence corresponding to the voice text in each audio frame of the voice audio, the voice audio is decomposed into the most basic phonemes, the phoneme set list of the voice audio is determined according to the phoneme sequence obtained by the voice text, the corresponding mouth shape of the animation character is searched and played in the preset animation character material library according to the phonemes in the phoneme set list, the whole voice audio is converted into the basic phonemes, and the mouth shape of the animation character is obtained according to the phonemes, so that the generated mouth shape animation is more natural and real, the animation quality is improved, the speaking mode in the real world is more met, the user experience is improved, and the user viscosity is further improved.

Drawings

FIG. 1 is a block diagram of a computing device provided by an embodiment of the present application;

FIG. 2 is a flow chart of a method of generating an animated character model provided in an embodiment of the present application;

FIG. 3 is a flowchart of a method for obtaining probabilities of speech audio candidate phonemes provided in an embodiment of the present application;

FIG. 4 is a flowchart of a method for obtaining a phoneme sequence corresponding to a speech text according to an embodiment of the present application;

FIG. 5 is a flow chart of a method of generating an animated character model provided in another embodiment of the application;

fig. 6 is a schematic structural view of an apparatus for generating an animated character model according to an embodiment of the present application.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is, however, susceptible of embodiment in many other ways than those herein described and similar generalizations can be made by those skilled in the art without departing from the spirit of the application and the application is therefore not limited to the specific embodiments disclosed below.

The terminology used in one or more embodiments of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of one or more embodiments of the application. As used in this application in one or more embodiments and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present application refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that, although the terms first, second, etc. may be used in one or more embodiments of the present application to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, a first may also be referred to as a second, and similarly, a second may also be referred to as a first, without departing from the scope of one or more embodiments of the present application. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.

First, terms related to one or more embodiments of the present invention will be explained.

Phonemes: the method is characterized in that minimum voice units are divided according to the natural attribute of voice, the voice units are analyzed according to pronunciation actions in syllables, and one action forms a phoneme. Phonemes are divided into two major classes, vowels and consonants. For example, the Chinese syllable ā (o) has only one phoneme, and the Chinese syllable has two phonemes as a result of the ai (love).

Audio frame: and after framing the voice audio, obtaining a plurality of voice audio fragments.

Acoustic features: the extraction of acoustic features is an important element of speech recognition. Mel-frequency cepstral coefficients (MFCCs) are designed based on human auditory characteristics and are widely used in speech recognition, which are based on linear transformations of the logarithmic energy spectrum of the nonlinear mel scale of sound frequencies. The MFCC of each audio frame is extracted as an acoustic feature of the audio frame.

Phonetic text: subtitle text corresponding to voice audio.

Phoneme sequence: a sequence of phoneme combinations corresponding to each word in the phonetic text.

In the present application, a method and apparatus for generating an animated character shape, a computing device, and a computer-readable storage medium are provided, and are described in detail in the following embodiments.

FIG. 1 illustrates a block diagram of a computing device 100, according to an embodiment of the present application. The components of the computing device 100 include, but are not limited to, a memory 110 and a processor 120. Processor 120 is coupled to memory 110 via bus 130 and database 150 is used to store data.

Computing device 100 also includes access device 140, access device 140 enabling computing device 100 to communicate via one or more networks 160. Examples of such networks include the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. The access device 140 may include one or more of any type of network interface, wired or wireless (e.g., a Network Interface Card (NIC)), such as an IEEE802.11 Wireless Local Area Network (WLAN) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth.

In one embodiment of the present application, the above-described components of computing device 100, as well as other components not shown in FIG. 1, may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device shown in FIG. 1 is for exemplary purposes only and is not intended to limit the scope of the present application. Those skilled in the art may add or replace other components as desired.

Computing device 100 may be any type of stationary or mobile computing device including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), mobile phone (e.g., smart phone), wearable computing device (e.g., smart watch, smart glasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 100 may also be a mobile or stationary server.

Wherein the processor 120 may perform the steps of the method of generating an animated character model shown in figure 2. FIG. 2 shows a flowchart of a method of generating an animated character model in accordance with an embodiment of the application, including steps 202 through 208.

Step 202: and receiving voice audio and voice text corresponding to the voice audio.

With the increasing development of internet technology, more and more animation is used for education, warning and entertainment, and in the process of producing animation, the mouth shape of an animated character can be controlled to be changed along with voice audio, and at the same time, each piece of voice audio also has corresponding voice text. The voice audio can be Chinese or foreign language such as English, german, french, etc., and the language text corresponding to the voice audio is the corresponding language of the voice audio.

In the embodiments provided herein, the voice audio "yibian" and the corresponding voice text "one side" of chinese are received.

In another embodiment provided herein, the voice audio "I Love China" and the corresponding voice text "I Love China" are received in English.

Step 204: and acquiring the candidate phoneme probability and the phoneme sequence corresponding to the voice text in each audio frame of the voice audio.

Alternatively, referring to fig. 3, obtaining the candidate phoneme probabilities in each audio frame of the speech audio includes steps 302 through 306 described below.

Step 302: and carrying out frame division processing on the voice audio to obtain a plurality of audio frames.

And carrying out frame division processing on the voice audio to generate a plurality of audio frames corresponding to the voice audio. The audio frame is the segment of the voice audio.

In the embodiment provided in the application, the received Chinese voice audio "yibian" is subjected to frame division processing to obtain a plurality of audio frames, which are respectively F ₁ ，F ₂ ，F ₃ ……F _n 。

Step 304: the acoustic features of each audio frame are extracted.

Experiments on human auditory perception show that human auditory perception is focused only in certain specific areas, not the whole spectrum range, so that Mel Frequency Cepstrum Coefficients (MFCCs) are designed according to human auditory characteristics, and are widely applied in speech recognition, and the MFCCs are based on linear transformation of the logarithmic energy spectrum of the nonlinear mel scale of sound frequencies. The MFCC of each audio frame is extracted as an acoustic feature of the audio frame.

In the embodiments provided herein, the acoustic features MF of each audio frame are extracted ₁ ，MF ₂ ，MF ₃ ……MF _n 。

Step 306: inputting the acoustic features into a pre-trained acoustic model, so that the acoustic model predicts candidate phoneme probabilities in each of the audio frames.

The acoustic model is constructed based on a deep neural network model, which is trained to predict the phoneme probabilities corresponding to a given audio frame from its acoustic features as input, by fitting the distribution of training data by a corresponding machine learning optimization algorithm. In practical applications, the acoustic features are input into a pre-trained acoustic model that predicts the candidate phoneme probabilities in each audio frame.

In the embodiments provided herein, the acoustic features MF of each audio frame are determined ₁ ，MF ₂ ，MF ₃ ……MF _n And inputting the candidate phonemes into a pre-trained acoustic model, wherein the acoustic model predicts the probability of the candidate phonemes in each audio frame.

Optionally, referring to fig. 4, the step of obtaining the phoneme sequence corresponding to the phonetic text includes the following steps 402 to 406.

Step 402: and performing word segmentation processing on the voice text to obtain a word set.

After the voice text is obtained, word segmentation processing is carried out on the voice text by taking words as units, and a word set of the voice text is obtained.

In the embodiment provided in the application, word segmentation is performed on one side of the voice text, so that a word set is obtained as one side.

Step 404: and searching a corresponding phoneme in a preset dictionary according to each word in the word set.

In the embodiment provided in the application, according to the word set (one, edge), the phonemes corresponding to "one" are respectively searched in the preset dictionary to be "ii, i1", and the phonemes corresponding to "edge" are respectively "b, ian5". Wherein, the tone of each word is placed on the vowel, and the "1, 2, 3, 4 and 5" representing the tone respectively represent "one sound, two sound, three sound, four sound and light sound".

Step 406: and generating a phoneme sequence corresponding to the voice text according to the sequence of each word in the word set.

In the embodiment provided herein, the corresponding phoneme sequence is generated as (ii, i1, b, ian 5) in the order of each word in the set of words (one, edge).

Step 206: and generating a phoneme set list corresponding to the voice audio according to the candidate phoneme probability and the phoneme sequence in each audio frame of the voice audio.

Optionally, generating a candidate phoneme sequence probability in the first n+1 audio frames according to the candidate phoneme sequence probability in the first n audio frames, the candidate phoneme probability in the n+1 audio frames and the phoneme sequence, wherein n is a positive integer, and generating a phoneme set list according to the finally acquired candidate phoneme sequence corresponding to the voice audio and the start frame and the end frame of each phoneme in the candidate phoneme sequence.

In the embodiment provided in the application, starting from the first audio frame of the voice audio, a dynamic programming algorithm is used to obtain the probability of the candidate phoneme sequence in the first n+1 audio frames according to the probability of the candidate phoneme sequence in the first n audio frames, the probability of the candidate phoneme in the n+1 audio frames and the phoneme sequence, and this process is repeated to finally obtain the maximum probability of the candidate phoneme sequence, so as to obtain a phoneme set list generated by the candidate phoneme sequence and the beginning frame and the ending frame of each phoneme in the candidate phoneme sequence, see table 1.

TABLE 1

Phonemes	Start frame	End frame
			ii	47	59
i1	59	68
			b	68	76
ian5	76	91

Step 208: and searching and playing the corresponding cartoon character mouth shape in a preset cartoon character material library according to the phoneme set list.

And establishing a material library for each animation character to be generated by an art designer in advance, drawing images with different mouth shapes for each phoneme, and naming the images as phonemes corresponding to the images. And obtaining a corresponding mouth shape of the animation character from a preset animation character material library according to the phoneme set list, and corresponding the mouth shape with the corresponding animation character, thereby finally realizing the animation character speaking video of the mouth shape corresponding to the animation character according to the voice generation.

In the embodiments provided herein, the artist pre-creates a library of materials for the monkey in advance, looks up the monkey's corresponding mouth shape based on (ii, i1, b, ian 5) in the list of phone sets, and corresponds the mouth shape to the monkey's corresponding position. Finally, playing the animation, and realizing that the monkey generates the speaking video of the corresponding mouth shape according to one side.

According to the method for generating the mouth shape of the animation character, the voice audio and the voice text are obtained at any time, the voice is converted into the minimum voice unit phonemes, the mouth shape of the corresponding animation character is drawn in advance according to the phonemes, and the mouth shape of the corresponding animation character can be searched according to the phonemes in practical application, so that the problem that the prior art cannot automatically generate the mouth shape of the real animation character by utilizing the voice is solved, the mouth shape of the generated animation character is matched with the actual voice, the look and feel are natural and real, the speaking mode in the real world is more met, and the animation quality is improved.

Fig. 5 illustrates a method for generating an animated character dies described by taking an example of animating zebra speaking, according to one embodiment of the present application, including steps 502 through 512.

Step 502: and receiving voice audio and voice text corresponding to the voice audio.

In the embodiment provided in the application, the received voice audio is "woaizhongguo", and the corresponding voice text is "i love china".

Step 504: and acquiring the candidate phoneme probability and the phoneme sequence corresponding to the voice text in each audio frame of the voice audio.

In the embodiment provided in the application, the phoneme sequence corresponding to the obtained voice text and the candidate phoneme probability in each audio frame of the voice audio are (w, o3, aa, ai4, zh, ong1, g, uo 2)

Step 506: and generating a phoneme set list corresponding to the voice audio according to the candidate phoneme probability and the phoneme sequence in each audio frame of the voice audio.

In the embodiment of the present application, a phoneme set list corresponding to the voice audio is generated according to the candidate phoneme probability in each audio frame of the voice audio and the phoneme sequence as shown in table 2.

TABLE 2

Step 508: and acquiring a start frame and an end frame of each phoneme in the phoneme set list, and determining a continuous frame of each phoneme.

The number of sustained frames per phone can be determined directly from the beginning and ending frames of each phone in the phone set list.

In the embodiment provided herein, w lasts 12 frames, o3 lasts 6 frames, aa lasts 8 frames, ai4 lasts 15 frames, zh lasts 13 frames, ong1 lasts 20 frames, g lasts 5 frames, uo2 lasts 19 frames.

Step 510: and when the continuous frame is smaller than a preset threshold value, filtering phonemes corresponding to the continuous frame, and further obtaining a processed phoneme set list.

After the continuous frame of each phoneme is acquired, the phonemes with shorter duration are filtered, namely, the phonemes with shorter continuous frame time are filtered, so that the unnatural and excessively short jump of the output animation is reduced.

Optionally, in the case that the phoneme is a consonant, replacing the phoneme with a previous phoneme of the phoneme; judging whether the last phoneme or the next phoneme of the phonemes is a vowel or not under the condition that the phonemes are vowels; if not, not processing; if yes, replacing the phoneme with the previous or next phoneme of the phoneme.

In the embodiment provided in the present application, the preset threshold is 7 frames, and the conventional output video frame rate is 30 frames/second, so that less than 7 frames is equal to or less than 0.2 seconds. The phonemes of the continuous frame that are smaller than the preset threshold are "o" and "g". For consonant "g", the last phoneme "ong1" is used instead of "g". For vowel "o," it is known that "aa" of the front and rear phones "w" and "aa" is a vowel, and therefore "aa" is used to replace vowel "o," and the list of phone sets after preprocessing is shown in table 3.

TABLE 3 Table 3

Phonemes	Start frame	End frame
			w	30	42
aa	42	48
			aa	48	56
ai4	56	71
			zh	71	84
ong1	84	104
			ong1	104	109
uo2	109	128

Step 512: and searching and playing the corresponding cartoon character mouth shape in a preset cartoon character material library according to the phoneme set list.

The method for generating the mouth shapes of the animation characters can be packaged into a plug-in of animation software, so that the animation software can directly utilize the plug-in.

In the embodiment provided by the application, the artist creates a phoneme mouth shape database for the zebra in advance, searches the zebra phoneme mouth shape corresponding to each phoneme in the preset zebra phoneme mouth shape database according to the phoneme set list described in the table 3, and corresponds the zebra phoneme mouth shape to the zebra, thereby making and playing the animation video of the zebra speaking.

According to the method for generating the mouth shape of the animation character, voice audio and voice texts are obtained at any time, voice is converted into the minimum voice unit phonemes, preprocessing is carried out on the obtained phoneme set list, and phonemes with shorter duration are filtered, so that unnatural too short jump of an output animation is reduced. According to the mouth shapes of the corresponding animation characters drawn by the phonemes in advance, the mouth shapes of the corresponding animation characters can be searched according to the phonemes in practical application, the problem that the prior art cannot automatically generate the mouth shapes of the real animation characters by utilizing the voices is solved, the generated mouth shapes of the animation characters are matched with the actual voices, the appearance is natural and real, the speaking modes in the real world are more met, and the animation quality is improved.

Corresponding to the above method embodiments, the present application further provides an embodiment of an apparatus for generating an animated character die, and fig. 6 shows a schematic structural diagram of an apparatus for generating an animated character die according to an embodiment of the present application.

As shown in fig. 6, the apparatus includes:

the receiving module 602 is configured to receive voice audio and voice text corresponding to the voice audio.

An obtaining module 604 is configured to obtain a candidate phoneme probability in each audio frame of the speech audio and a phoneme sequence corresponding to the speech text.

A generation module 606 is configured to generate a list of phone sets corresponding to the speech audio based on the candidate phone probabilities and the phone sequences in each audio frame of the speech audio.

And a playing module 608, configured to search and play the corresponding cartoon character mouth shape in the preset cartoon character material library according to the phoneme set list.

Optionally, the obtaining module 604 is further configured to perform frame segmentation processing on the voice audio to obtain a plurality of audio frames; extracting acoustic features of each audio frame; inputting the acoustic features into a pre-trained acoustic model, so that the acoustic model predicts candidate phoneme probabilities in each of the audio frames.

Optionally, the obtaining module 604 is further configured to perform word segmentation processing on the voice text to obtain a word set; searching a corresponding phoneme in a preset dictionary according to each word in the word set; and generating a phoneme sequence corresponding to the voice text according to the sequence of each word in the word set.

Optionally, the generating module 606 is further configured to generate a candidate phoneme sequence probability in the first n+1 audio frames according to the candidate phoneme sequence probability in the first n audio frames, the candidate phoneme probability in the n+1th audio frames, and the phoneme sequence, where n is a positive integer; and acquiring a candidate phoneme sequence corresponding to the voice audio, and generating a phoneme set list according to the candidate phoneme sequence and a start frame and an end frame of each phoneme in the candidate phoneme sequence.

Optionally, the apparatus further includes:

According to the device for generating the mouth shape of the animation character, voice audio and voice texts are obtained at any time, voice is converted into the minimum voice unit phonemes, preprocessing is carried out on the obtained phoneme set list, and phonemes with shorter duration are filtered, so that unnatural too short jump of an output animation is reduced. According to the mouth shape of the corresponding animation character drawn by the phonemes in advance, the mouth shape of the corresponding animation character can be searched according to the phonemes in practical application, so that the problem that the prior art cannot automatically generate the mouth shape of the real animation character by utilizing the voices is solved, the generated mouth shape of the animation character is matched with the actual voices, the appearance is natural and real, the speaking mode in the real world is more met, and the animation quality is improved.

An embodiment of the present application also provides a computing device including a memory, a processor, and computer instructions stored on the memory and executable on the processor, which when executed by the processor implement the steps of the method of generating an animated character shape.

An embodiment of the present application also provides a computer-readable storage medium storing computer instructions that, when executed by a processor, perform the steps of a method of generating an animated character model as described above.

The above is an exemplary version of a computer-readable storage medium of the present embodiment. It should be noted that, the technical solution of the storage medium and the technical solution of the method for generating the mouth shape of the animated figure belong to the same concept, and details of the technical solution of the storage medium which are not described in detail can be referred to the description of the technical solution of the method for generating the mouth shape of the animated figure.

The foregoing describes specific embodiments of the present application. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

The computer instructions include computer program code that may be in source code form, object code form, executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the computer readable medium contains content that can be appropriately scaled according to the requirements of jurisdictions in which such content is subject to legislation and patent practice, such as in certain jurisdictions in which such content is subject to legislation and patent practice, the computer readable medium does not include electrical carrier signals and telecommunication signals.

It should be noted that, for the sake of simplicity of description, the foregoing method embodiments are all expressed as a series of combinations of actions, but it should be understood by those skilled in the art that the present application is not limited by the order of actions described, as some steps may be performed in other order or simultaneously in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily all necessary for the present application.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to the related descriptions of other embodiments.

The above-disclosed preferred embodiments of the present application are provided only as an aid to the elucidation of the present application. Alternative embodiments are not intended to be exhaustive or to limit the invention to the precise form disclosed. Obviously, many modifications and variations are possible in light of the teaching of this application. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best understand and utilize the invention. This application is to be limited only by the claims and the full scope and equivalents thereof.

Claims

1. A method of generating an animated character shape comprising:

receiving voice audio and voice text corresponding to the voice audio;

generating a phoneme set list corresponding to the voice audio according to the candidate phoneme probability and the phoneme sequence in each audio frame of the voice audio, wherein the phoneme set list comprises the candidate phoneme sequence corresponding to the voice audio and a start frame and an end frame of each phoneme;

acquiring a start frame and an end frame of each phoneme in the phoneme set list, determining a continuous frame of each phoneme, when the continuous frame is smaller than a preset threshold, using a last phoneme of the phoneme to replace the phoneme when the phoneme is a consonant, judging whether the last phoneme or the next phoneme of the phoneme is a vowel when the phoneme is a vowel, if not, not processing, and if so, replacing the phoneme with the last phoneme or the next phoneme of the phoneme, thereby obtaining a processed phoneme set list;

and searching and playing the corresponding cartoon character mouth shape in a preset cartoon character material library according to the processed phoneme set list.

2. The method of generating an animated character mouth shape according to claim 1, wherein obtaining a candidate phoneme probability in each audio frame of the speech audio comprises:

extracting acoustic features of each audio frame;

3. The method of generating an animated character mouth shape according to claim 1, wherein obtaining a phoneme sequence corresponding to the phonetic text comprises:

4. The method of generating an animated character mouth shape according to claim 1, wherein generating a list of phone sets corresponding to the voice audio from the candidate phone probabilities and the phone sequences in each audio frame of the voice audio comprises:

5. An apparatus for generating an animated character shape comprising:

a generating module configured to generate a phoneme set list corresponding to the voice audio according to the candidate phoneme probability and the phoneme sequence in each audio frame of the voice audio, wherein the phoneme set list comprises the candidate phoneme sequence corresponding to the voice audio and a start frame and an end frame of each phoneme;

the preprocessing module is configured to acquire a start frame and an end frame of each phoneme in the phoneme set list, determine a continuous frame of each phoneme, replace the phoneme with a last phoneme of the phoneme when the continuous frame is smaller than a preset threshold and judge whether the last phoneme or the next phoneme of the phoneme is a vowel when the phoneme is a vowel, and if not, do not process, and replace the phoneme with the last phoneme or the next phoneme of the phoneme if so, so as to obtain a processed phoneme set list;

and the playing module is configured to search and play the corresponding cartoon character mouth shape in a preset cartoon character material library according to the processed phoneme set list.

6. The apparatus for generating an animated character mouth shape as recited in claim 5, wherein,

the acquisition module is further configured to perform frame division processing on the voice audio to obtain a plurality of audio frames; extracting acoustic features of each audio frame; inputting the acoustic features into a pre-trained acoustic model, so that the acoustic model predicts candidate phoneme probabilities in each of the audio frames.

7. The apparatus for generating an animated character mouth shape as recited in claim 5, wherein,

the acquisition module is further configured to perform word segmentation processing on the voice text to obtain a word set; searching a corresponding phoneme in a preset dictionary according to each word in the word set; and generating a phoneme sequence corresponding to the voice text according to the sequence of each word in the word set.

8. The apparatus for generating an animated character mouth shape as recited in claim 5, wherein,

the generating module is further configured to generate a candidate phoneme sequence probability in the first n+1 audio frames according to the candidate phoneme sequence probability in the first n audio frames, the candidate phoneme probability in the n+1th audio frames and the phoneme sequence, wherein n is a positive integer; and acquiring a candidate phoneme sequence corresponding to the voice audio, and generating a phoneme set list according to the candidate phoneme sequence and a start frame and an end frame of each phoneme in the candidate phoneme sequence.

9. A computing device comprising a memory, a processor, and computer instructions stored on the memory and executable on the processor, wherein the processor, when executing the instructions, implements the steps of the method of any of claims 1-4.

10. A computer readable storage medium storing computer instructions which, when executed by a processor, implement the steps of the method of any one of claims 1-4.