CN109377540B

CN109377540B - Method and device for synthesizing facial animation, storage medium, processor and terminal

Info

Publication number: CN109377540B
Application number: CN201811156589.9A
Authority: CN
Inventors: 陈晓威; 万里红; 张伟东; 张民英
Original assignee: Netease Hangzhou Network Co Ltd
Current assignee: Netease Hangzhou Network Co Ltd
Priority date: 2018-09-30
Filing date: 2018-09-30
Publication date: 2023-12-19
Anticipated expiration: 2038-09-30
Also published as: CN109377540A

Abstract

The invention discloses a method and a device for synthesizing facial animation, a storage medium, a processor and a terminal. The method comprises the following steps: performing voice analysis on the audio file to obtain a phoneme time stamp file and an expression time stamp file, wherein the phoneme time stamp file comprises: the time stamp and duration of each phoneme corresponding to each word obtained by converting the audio file, wherein each word corresponds to at least one phoneme; acquiring a mouth shape sequence corresponding to the phoneme time stamp file, wherein the mouth shape sequence is used for describing mouth shape information corresponding to each phoneme in the phoneme time stamp file; acquiring an expression sequence corresponding to the expression time stamp file, wherein the expression sequence is used for describing expression information corresponding to the expression time stamp file; and synthesizing the mouth shape sequence and the expression sequence into a facial animation. The invention solves the technical problems that the voice analysis mode provided in the related technology is easy to cause larger error of the voice animation synthesized later and influences the user experience.

Description

Method and device for synthesizing facial animation, storage medium, processor and terminal

Technical Field

The present invention relates to the field of computers, and in particular, to a method and apparatus for synthesizing facial animation, a storage medium, a processor, and a terminal.

Background

The face information of the person includes: expression and mouth shape. In general, expression and mouth shape change have independence, wherein the mouth shape contains more high-frequency information, and expression is more prone to low-frequency expression. For example, when an average person is speaking a sentence, the mouth shape frequently changes with the change of pronunciation. Relatively, expression changes slowly, even lacking significant changes. In general, facial information can be seen as a fusion of two relatively independent parts of expression and mouth shape.

Aiming at fusion of expression and mouth shape, the technical proposal provided in the related technology is mainly divided into voice analysis, expression and mouth shape animation synthesis and voice-driven facial animation. For voice analysis, the analysis is mainly performed for Chinese and English voice. The method is mainly realized by means of motion capture or direct bone motion making by artistic staff and the like for synthesizing the expression and the mouth shape animation.

Regarding chinese speech phoneme parsing, one of the solutions provided by the related art can only input chinese speech and output chinese text, but cannot accurately obtain a time stamp of each phoneme in the chinese text and a duration thereof. Another solution provided by the related art (e.g., the Waston service by IBM) can process the mid-range voice and obtain the time stamp and duration of each word, but the Waston cannot accurately locate the time stamp and duration of each word, thereby causing a great error to the subsequently generated voice animation. Regarding expression animation synthesis and voice-driven facial animation, the motion capture method provided in the related art has high cost, poor flexibility and large generated data volume, so that the method is difficult to apply to a mobile terminal. In addition, the expression animation synthesis and the voice-driven facial animation produced by the artist have the problems of low efficiency, poor flexibility, excessive cost of repeated modification, and the like. Table 1 is a description of the state of the art for various speech parsing techniques provided in the related art, as shown in table 1:

TABLE 1

Table 2 is a description of the state of the art for various facial animation synthesis techniques provided in the related art, as shown in table 2:

TABLE 2

In view of the above problems, no effective solution has been proposed at present.

Disclosure of Invention

At least some embodiments of the present invention provide a method, an apparatus, a storage medium, a processor, and a terminal for synthesizing a facial animation, so as to at least solve a technical problem that a voice parsing method provided in a related technology is easy to cause a larger error in a subsequently synthesized voice animation, and affects user experience.

According to one embodiment of the present invention, there is provided a method for synthesizing facial animation, including:

performing voice analysis on the audio file to obtain a phoneme time stamp file and an expression time stamp file, wherein the phoneme time stamp file comprises: the time stamp and duration of each phoneme corresponding to each word obtained by converting the audio file, wherein each word corresponds to at least one phoneme; acquiring a mouth shape sequence corresponding to the phoneme time stamp file, wherein the mouth shape sequence is used for describing mouth shape information corresponding to each phoneme in the phoneme time stamp file; acquiring an expression sequence corresponding to the expression time stamp file, wherein the expression sequence is used for describing expression information corresponding to the expression time stamp file; and synthesizing the mouth shape sequence and the expression sequence into a facial animation.

Optionally, after synthesizing the mouth shape sequence and the expression sequence into the facial animation, the method further comprises: and synchronously playing the facial animation and the audio file.

Optionally, performing voice parsing on the audio file to obtain a phoneme timestamp file includes: converting the audio file into a text sequence; converting the text sequence into a phoneme sequence according to the Chinese pinyin of each word in the text sequence, wherein each word corresponds to at least one phoneme; and carrying out time sequence modeling on the phoneme sequence to obtain a phoneme time stamp file.

Optionally, converting the audio file into the text sequence includes: the audio file is converted into a text sequence using a connected temporal classification-recurrent neural network model.

Optionally, performing time sequence modeling on the phoneme sequence to obtain a phoneme timestamp file includes: and carrying out time sequence modeling on the phoneme sequence by adopting a hidden Markov model to obtain a phoneme time stamp file.

Optionally, acquiring the expression sequence corresponding to the expression timestamp file includes: extracting a spectrogram of the audio file in a preset time window; deducing an expression animation corresponding to the expression time stamp file and emotion categories corresponding to each expression according to the spectrogram to obtain an expression sequence.

Optionally, deriving the expression animation corresponding to each phoneme in the expression timestamp file and the emotion type corresponding to each expression according to the spectrogram, and obtaining the expression sequence includes: and setting the sound spectrogram as an input item, deducing an expression animation corresponding to the expression timestamp file and emotion types corresponding to each expression through a convolutional neural network, and obtaining an expression sequence.

Optionally, obtaining the mouth shape sequence corresponding to the phoneme timestamp file includes: determining a mouth shape type corresponding to each phoneme in the phoneme timestamp file according to a preset corresponding relation, wherein the preset corresponding relation is used for recording the mapping relation between different phonemes and the mouth shape type, and each mouth shape type corresponds to different mouth shape animations respectively; and mapping the time stamp of each phoneme with the corresponding mouth shape type to obtain the mouth shape sequence.

Optionally, synthesizing the mouth shape sequence and the expression sequence into a facial animation, and playing the facial animation and the audio file synchronously includes: a judging step, namely triggering and judging whether to synthesize the mouth shape animation in the mouth shape sequence and the expression animation in the expression sequence or not at each preset time interval; if yes, fusing the current mouth shape animation with the last mouth shape animation to obtain mouth shape animation to be played, and then synthesizing the expression animation corresponding to the trigger time with the mouth shape animation to be played to obtain the facial animation at the trigger time; and a playing step, wherein if the playing end time of the audio file is not reached, the facial animation at the triggering time is played, and the judgment step is returned until the playing end time is reached.

According to one embodiment of the present invention, there is also provided a facial animation synthesizing apparatus, including:

the analysis module is used for carrying out voice analysis on the audio file to obtain a phoneme time stamp file and an expression time stamp file, wherein the phoneme time stamp file comprises: the time stamp and duration of each phoneme corresponding to each word obtained by converting the audio file, wherein each word corresponds to at least one phoneme; the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a mouth shape sequence corresponding to a phoneme time stamp file and an expression sequence corresponding to an expression time stamp file, wherein the mouth shape sequence is used for describing mouth shape information corresponding to each phoneme in the phoneme time stamp file, and the expression sequence is used for describing expression information corresponding to the expression time stamp file; and the synthesis module is used for synthesizing the mouth shape sequence and the expression sequence into facial animation.

Optionally, the apparatus further includes: and the playing module is used for synchronously playing the facial animation and the audio file.

Optionally, the parsing module includes: a first conversion unit for converting the audio file into a text sequence; the second conversion unit is used for converting the text sequence into a phoneme sequence according to the Chinese pinyin of each word in the text sequence, wherein each word corresponds to at least one phoneme; and the first processing unit is used for carrying out time sequence modeling on the phoneme sequence to obtain a phoneme time stamp file.

Optionally, the first converting unit is configured to convert the audio file into a text sequence using a connected temporal classification-recurrent neural network model.

Optionally, the first processing unit is configured to perform time-sequence modeling on the phoneme sequence by using a hidden markov model, so as to obtain a phoneme timestamp file.

Optionally, the acquiring module includes: the extraction unit is used for extracting the spectrogram of the audio file in a preset time window; the first acquisition unit is used for deducing an expression animation corresponding to the expression time stamp file and an emotion type corresponding to each expression according to the sound spectrogram to obtain an expression sequence.

Optionally, the first obtaining unit is configured to set the spectrogram as an input item, and derive, through a convolutional neural network, an expression animation corresponding to the expression timestamp file and an emotion category corresponding to each expression, to obtain an expression sequence.

Optionally, the acquiring module includes: the determining unit is used for determining the mouth shape type corresponding to each phoneme in the phoneme time stamp file according to a preset corresponding relation, wherein the preset corresponding relation is used for recording the mapping relation between different phonemes and the mouth shape type, and each mouth shape type corresponds to different mouth shape animations respectively; and the second acquisition unit is used for carrying out the time stamp of each phoneme and the corresponding mouth shape type to obtain the mouth shape sequence.

Optionally, the synthesis module includes: the judging unit is used for triggering and judging whether to synthesize the mouth shape animation in the mouth shape sequence and the expression animation in the expression sequence or not at each preset time interval; the first processing unit is used for fusing the current mouth shape animation with the last mouth shape animation to obtain the mouth shape animation to be played when the output of the judging unit is yes, and then synthesizing the expression animation corresponding to the trigger time with the mouth shape animation to be played to obtain the face animation at the trigger time; and the playing unit is used for playing the facial animation at the triggering moment if the playing end moment of the audio file is not reached, and returning to the judging step until the playing end moment is reached.

According to an embodiment of the present invention, there is further provided a storage medium including a stored program, wherein the device in which the storage medium is controlled to execute the above-described facial animation synthesis method when the program runs.

According to an embodiment of the present invention, there is further provided a processor for running a program, wherein the program executes the above-mentioned facial animation synthesis method.

According to one embodiment of the present invention, there is also provided a terminal including: the facial animation synthesis system comprises one or more processors, a memory, a display device and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, and the one or more programs are used for executing the facial animation synthesis method.

In at least some embodiments of the present invention, a manner of performing voice parsing on an audio file to obtain a phoneme timestamp file and an expression timestamp file, where the phoneme timestamp file includes a timestamp and a duration of each phoneme corresponding to each word obtained by converting the audio file, and by obtaining a mouth shape sequence corresponding to the phoneme timestamp file and an expression sequence corresponding to the expression timestamp file, and synthesizing the mouth shape sequence and the expression sequence into a facial animation, the purpose of obtaining expression information and mouth shape information at different moments in a voice by pre-establishing a facial expression base and a mouth shape base and analyzing the voice is achieved, and further, synthesizing a voice animation by using the phoneme timestamp file is achieved, so that accurate phoneme sequence recognition is performed on the voice, and only voice data and a corresponding mouth shape and expression text sequence need to be input, so that a technical effect of an animation process of a whole face (including a change of a mouth shape and an expression) can be generated, and further, a technical problem that a voice parsing manner provided in a related technology is easy to cause a larger error and affect user experience is solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiments of the invention and together with the description serve to explain the invention and do not constitute a limitation on the invention. In the drawings:

FIG. 1 is a flow chart of a method of composing a facial animation according to one embodiment of the invention;

FIG. 2 is a diagram of a comparison of die animation configuration based on Table 4, according to an alternative embodiment of the present invention;

FIG. 3 is a flow chart of a facial animation synthesis and playback process in accordance with an alternative embodiment of the present invention;

FIG. 4 is a block diagram of a composition apparatus for facial animation according to one embodiment of the present invention;

fig. 5 is a block diagram of a composition apparatus for facial animation according to an alternative embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

According to one embodiment of the present invention, there is provided an embodiment of a method of composing a facial animation, it being noted that the steps illustrated in the flowchart of the figures may be performed in a computer system such as a set of computer-executable instructions, and that although a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order other than that illustrated herein.

The method embodiments may be performed in a mobile terminal, a computer terminal, or similar computing device. Taking the example of running on a mobile terminal, the mobile terminal may comprise one or more processors (which may include, but are not limited to, processing means such as an image processor (GPU) or a Microprocessor (MCU) or a programmable logic device (FPGA)) and a memory for storing data, and optionally the mobile terminal may further comprise transmission means for communication functions and input-output devices. It will be appreciated by those skilled in the art that the above-described structure is merely illustrative and not limiting on the structure of the above-described mobile terminal. For example, the mobile terminal may also include more or fewer components than the above-described structure, or have a different configuration than the above-described structure.

The memory may be used to store a computer program, for example, a software program of application software and a module, for example, a computer program corresponding to a method for synthesizing a facial animation in one embodiment of the present invention, and the processor executes the computer program stored in the memory, thereby performing various functional applications and data processing, that is, implementing the above-described method for synthesizing a facial animation. The memory may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid state memory. In some examples, the memory may further include memory remotely located with respect to the processor, the remote memory being connectable to the mobile terminal through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission means is used for receiving or transmitting data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the mobile terminal. In one example, the transmission means comprises a network adapter (Network Interface Controller, simply referred to as NIC) that can be connected to other network devices via a base station to communicate with the internet. In one example, the transmission device may be a Radio Frequency (RF) module, which is used to communicate with the internet wirelessly.

In the running environment of the mobile terminal, the invention provides a set of facial animation generation system based on audio aiming at scenes such as virtual role dialogue, voice message visualization, video post-production and the like, and the generated animation can run on various platforms with low cost. The facial animation generating system mainly has the following advantages:

1) Accurate phoneme sequence recognition: by combining the advantages of a connected temporal classification-recurrent neural network (Connectionist temporal classification-Recurrent Neural Networks, called CTC-RNN for short) and a deep neural network-hidden Markov model (Deep Neural Networks-Hidden Markov Model, called DNN-HMM for short), accurate phoneme sequence recognition of the voice is realized and a corresponding phoneme time stamp file is output;

2) Only voice data and corresponding text need to be input, and the whole facial animation process can be generated, wherein the facial animation process comprises the changes of mouth shapes and expressions.

In this embodiment, a method for synthesizing a facial animation running on the mobile terminal is provided. Fig. 1 is a flowchart of a method of synthesizing a facial animation according to one embodiment of the present invention, as shown in fig. 1, the method comprising the steps of:

Step S12, carrying out voice analysis on the audio file to obtain a phoneme time stamp file and an expression time stamp file, wherein the phoneme time stamp file comprises: the time stamp and duration of each phoneme corresponding to each word obtained by converting the audio file, wherein each word corresponds to at least one phoneme;

step S14, a mouth shape sequence corresponding to the phoneme time stamp file and an expression sequence corresponding to the expression time stamp file are obtained, wherein the mouth shape sequence is used for describing mouth shape information corresponding to each phoneme in the phoneme time stamp file, and the expression sequence is used for describing expression information corresponding to the expression time stamp file;

and S16, synthesizing the mouth shape sequence and the expression sequence into facial animation.

Through the steps, the voice analysis of the audio file can be adopted to obtain the phoneme timestamp file and the expression timestamp file, the phoneme timestamp file comprises the timestamp and the duration of each phoneme in at least one phoneme corresponding to each word obtained through the conversion of the audio file, the mouth shape sequence corresponding to the phoneme timestamp file and the expression sequence corresponding to the expression timestamp file are obtained, the mouth shape sequence and the expression sequence are synthesized into the facial animation, the technical problems that the facial expression base and the mouth shape base which are established in advance and the expression information and the mouth shape information at different moments in the voice are obtained through analyzing the voice, the purpose of synthesizing the voice animation by utilizing the phoneme timestamp file is achieved, the voice is accurately identified by the phoneme sequence, and the whole facial animation process (comprising the change of the mouth shape and the expression) can be generated only by inputting voice data and the corresponding mouth shape and the expression text sequence, so that the technical problem that the subsequent synthesized voice animation is easy to cause large errors and influence the user experience in the voice mode provided in the related technology is solved.

After the mouth shape sequence and the expression sequence are synthesized into the facial animation, the facial animation and the audio file can be synchronously played.

Optionally, in step S12, performing speech parsing on the audio file to obtain a phoneme timestamp file may include the following steps:

step S121, converting the audio file into a text sequence;

step S122, converting the text sequence into a phoneme sequence according to the Chinese pinyin of each word in the text sequence, wherein each word corresponds to at least one phoneme;

step S123, performing time sequence modeling on the phoneme sequence to obtain a phoneme time stamp file.

In an alternative embodiment, the audio file may be converted into a text sequence using a connected temporal classification-recurrent neural network model, and the phoneme sequence may be time-sequentially modeled using a hidden markov model to obtain a phoneme timestamp file.

For voice analysis, an alternative embodiment of the present invention mainly solves the problem that the Chinese voice analysis method provided in the related art cannot obtain the accurate pronunciation time stamp and duration of each text. For this purpose, the aim to be achieved is the time alignment of speech recognition with text phonemes.

To make the models involved in this alternative embodiment more generic, it is necessary to accept varying inputs, e.g., different sexes, ages, and emotions. In training a data-driven model, the input data is often speech and facial animation data of the corresponding character, and the output is parameters for reconstructing the facial animation. For example, a model trained using audio data of a sad elderly person can only be directed to facial expressions of the sad elderly person. To decouple the relationship between input and output, the problem of diversity between input and output needs to be solved by an intermediate expression layer. For this purpose, in an alternative embodiment of the invention, the intermediate layer employs phonemes. The phonemes are the smallest units in speech recognition, and are analyzed based on the pronunciation actions of syllables, one action constituting each phoneme. Therefore, regardless of the human voice, the corresponding phoneme information can be extracted from the audio. The biggest difference from the speech recognition approach provided in the related art is that: this alternative embodiment results in phonemes rather than text. In addition, since an animation sequence in time series is required to be produced, time node acquisition of phonemes is also important.

The speech recognition method provided in the related art often adopts a GMM-HMM for modeling and training, wherein the HMM well solves the problem of dynamic time sequence of phonemes, and can divide each phoneme into a plurality of states which are respectively used for representing start, middle, end and the like. The GMM models each state, assuming that the probability distribution of each state satisfies the GMM, and then learns parameters of the GMM using training samples. In recent years, with the rise of deep neural networks, modeling of each state has changed from GMM to DNN model, while the part related to HMM remains unchanged. In general, the model framework of speech recognition slowly translates into a context-dependent deep neural network-hidden Markov model (CD-DNN-HMM). With the rapid development of neural network optimization technology and the continuous improvement of the computing power of a Graphics Processor (GPU), the latest speech recognition technology can model through RNN and CTC and realize an acoustic model of end-to-end (end-to-end) speech recognition. CTC directly corresponds the voice and the corresponding text, and realizes the classification of time sequence problems, thereby discarding the HMM structure. Because of the powerful modeling capabilities of neural networks, end-to-end output labels do not need to be subdivided as in conventional frameworks. For example: the output of the framework is not subdivided into states, phonemes or initials for the Chinese speech recognition, but rather the Chinese is directly taken as output.

In an alternative embodiment of the present invention, accurate acquisition of the phoneme sequence and its corresponding time is required. HMMs have good modeling capabilities with respect to time. However, whether the model is a GMM-HMM or a DNN-HMM, the modeling capability is limited, so that the recognition accuracy of the model is difficult to break through the bottleneck. In contrast, CTC-RNN models have high recognition accuracy, but since the models are end-to-end modeling processes and do not contain accurate phoneme time information, it is difficult for both to meet the requirements of the present invention. Based on this, this alternative embodiment employs a model that combines both. First, a CTC-RNN model is used to obtain a precise text sequence, which is to convert a piece of input audio into a series of words (e.g., "hello"). Next, the text is converted into one or more phonemes (e.g., n, i, h, ao) based on each text's Chinese pinyin, based on which the text sequence may be converted into a sequence of phonemes. Then, time sequence modeling is performed by using an HMM model, and the HMM model mainly solves the problems that: (1) Likelihood estimation, the HMM generates probability of a series of crosstalk elements; (2) Decoding, given a series of crosstalk pixel sequences, searching the most probably subordinate HMM state sequence, and mainly adopting the Viterbi algorithm in the process. Based on the above manner, accurate phoneme time information (e.g., pronunciation time point, pronunciation duration) can be obtained while ensuring high recognition accuracy.

Optionally, in step S14, acquiring the expression sequence corresponding to the expression timestamp file may include performing the steps of:

step S141, extracting a spectrogram of the audio file in a preset time window;

and step S142, deducing an expression animation corresponding to the expression time stamp file and emotion types corresponding to each expression according to the spectrogram to obtain an expression sequence.

In an optional implementation manner, the sound spectrogram can be set as an input item, and the expression animation corresponding to the expression timestamp file and the emotion type corresponding to each expression are deduced through the convolutional neural network to obtain the expression sequence.

The phoneme information extraction in speech recognition is mainly directed to the processing of high-frequency information. The high-frequency information is a region in which the voice signal changes drastically. The high-frequency information can well express the mouth shape animation, and in contrast, the facial information of the human face is mainly related to the emotion of the human. Thus, in an alternative embodiment of the invention, the low frequency information is further extracted in the audio to represent the emotional state of the face.

Speech emotion recognition is a very complex process. The emotion information is contained within the shape and profile information of the audio file, so this alternative embodiment uses a deep Convolutional Neural Network (CNN) structure to model different emotion categories, mainly including normal emotion (i.e., no facial expression) and four emotions of anger and funeral.

In this alternative embodiment, the processing flow for emotion includes: firstly, adopting a fixed time window to carry out sliding treatment on an input one-dimensional audio signal, and extracting a spectrogram of an audio file in a specified time window; then, taking the spectrogram as input, and gradually deducing the emotion state in the time window based on a multilayer information processing mechanism of CNN and outputting (normal, happy, anger, grive, happy); and finally, correspondingly controlling and switching the facial expression according to the output emotion type.

In addition, the normal, happy, angry, sad and happy expression animation is usually designed and completed in advance by an art producer, and the expression animation at different moments can be associated with the time stamp of each expression in the expression time stamp file through the spectrogram to obtain an expression sequence, so that the expression animation at the corresponding moment is displayed along with the playing of the audio file.

Alternatively, in step S14, acquiring the mouth shape sequence corresponding to the phoneme time stamp file may include the following performing steps:

step S143, determining a mouth shape type corresponding to each phoneme in the phoneme timestamp file according to a preset corresponding relation, wherein the preset corresponding relation is used for recording a mapping relation between different phonemes and mouth shape types, and each mouth shape type corresponds to different mouth shape animations respectively;

And step S144, performing time stamp of each phoneme and corresponding mouth shape type to obtain the mouth shape sequence.

Through the above-mentioned speech recognition and alignment, speech emotion recognition, in an alternative embodiment of the present invention, emotion information and phoneme sequences can be obtained on the basis of the input audio, respectively. On this basis, the facial animation is further synthesized. For example: the content of the input audio file (16 k wav file) is: "the people's surnames, ginger and first name, because of the sense of mishap between people, ask for life, get the power, the people's demon, help the culture Wang Wuwang to open people's flourishing world, and cultivate a Mingdu Zhou Muwang. "table 3 is a phoneme sequence table obtained by speech recognition, as shown in table 3:

TABLE 3 Table 3

Audio (wav) file tagging	Phoneme start time	Phoneme end time	Phoneme (Pinyin initial consonant and vowel)
				A00000	0.000	0.030	sil
A00000	0.030	0.040	c
				A00000	0.130	0.040	i
A00000	0.170	0.120	r
				A00000	0.250	0.120	en
A00000	0.370	0.100	x
				A00000	0.490	0.100	ing
A00000	0.590	0.130	j
				A00000	0.670	0.130	iang
A00000	0.800	0.080	m
				A00000	0.850	0.080	ing
A00000	0.930	0.220	sh
				A00000	1.120	0.220	ang
A00000	1.340	0.650	sil
				A00000	1.990	0.070	y
A00000	2.040	0.070	in
				A00000	2.110	0.090	g
A00000	2.180	0.090	an

……

A00000	11.730	0.090	zh
				A00000	11.820	0.090	ou
A00000	11.910	0.140	m
				A00000	12.000	0.140	u
A00000	12.140	0.180	w
				A00000	12.220	0.180	ang
A00000	12.400	0.030	sil

Where sil is an empty note, indicating no sound.

As can be seen from the above-described phoneme sequence list, the time of a single phoneme is as short as 0.03 seconds, which is approximately one frame time of a hand tour, and thus, each phoneme corresponds to one mouth-shaped animation file. In addition, the initial consonant and final sound are identified for each phoneme, and the mouth shape mapping table of the initial consonant and final sound is formulated, and the standby mouth shape is added, so that 11 basic mouth shapes are designed.

Table 4 is a mouth shape mapping table of the initial consonant, and fig. 2 is a mouth shape animation configuration comparison schematic diagram based on table 4 according to an alternative embodiment of the present invention, as shown in table 4 and fig. 2:

TABLE 4 Table 4

Table 5 is a vowel mouth shape mapping table as shown in Table 5:

TABLE 5

Table 6 is a mouth shape mapping table for special phonemes as shown in table 6:

TABLE 6

Phonemes	Phoneme class	Mapping of die types	Mouth-like animation configuration
				sil	Without any means for	0 (standby initial)	Mouth animation 11

According to the mapping relation between the phonemes and the mouth shape types, obtaining the mouth shape type corresponding to the phonemes on each time stamp, further determining the configured mouth shape animation, and then binding the phonemes with the mouth shape type, further generating a corresponding mouth shape array (corresponding to the mouth shape sequence):

[[0.000,"0"],[0.030,"E"],[0.130,"E"],[0.170,"J"],[0.250,"H"],[0.370,"E"],[0.490,"E"],[0.590,"J"],[0.670,"A"],[0.800,"B"],[0.850,"E"],[0.930,"U"],[1.120,"A"],[1.340,"0"],[1.990,"E"],[2.040,"E"],[2.110,"H"],[2.180,"a"],[2.270,"U"],[2.370,"U"],[2.420,"J"],[2.530,"H"],[2.600,"J"],[2.660,"a"],[2.700,"H"],[2.880,"O"],[2.980,"H"],[3.090,"a"],[3.160,"0"],[3.750,"E"],[3.800,"A"],[3.890,"E"],[3.960,"a"],[4.040,"J"],[4.170,"E"],[4.320,"B"],[4.350,"E"],[4.510,"0"],[5.010,"E"],[5.090,"H"],[5.190,"U"],[5.350,"H"],[5.450,"H"],[5.600,"E"],[5.850,"0"],[6.020,"U"],[6.120,"U"],[6.230,"U"],[6.310,"O"],[6.480,"E"],[6.550,"A"],[6.770,"0"],[7.370,"U"],[7.470,"U"],[7.610,"U"],[7.660,"H"],[7.720,"U"],[7.770,"A"],[7.840,"U"],[8.020,"U"],[8.140,"U"],[8.170,"A"],[8.390,"H"],[8.600,"A"],[8.670,"J"],[8.830,"E"],[8.940,"J"],[9.000,"H"],[9.080,"J"],[9.170,"a"],[9.240,"U"],[9.400,"H"],[9.450,"U"],[9.640,"E"],[9.810,"0"],[10.170,"B"],[10.360,"E"],[10.450,"B"],[10.550,"a"],[10.630,"E"],[10.680,"U"],[10.710,"E"],[10.800,"E"],[10.880,"E"],[10.930,"A"],[11.040,"B"],[11.120,"E"],[11.290,"U"],[11.380,"U"],[11.650,"0"],[11.730,"U"],[11.820,"O"],[11.910,"B"],[12.000,"U"],[12.140,"U"],[12.220,"A"],[12.400,"0"]]。

optionally, in step S16, synthesizing the mouth shape sequence and the expression sequence into a facial animation, and playing the facial animation in synchronization with the audio file may include the following steps:

step S161, triggering and judging whether to synthesize mouth shape animation in a mouth shape sequence and expression animation in an expression sequence or not every preset time length;

step S162, if yes, fusing the current mouth shape animation with the last mouth shape animation to obtain mouth shape animation to be played, and then synthesizing the expression animation corresponding to the trigger time with the mouth shape animation to be played to obtain the facial animation at the trigger time;

Step S163, if the playing end time of the audio file is not reached, playing the facial animation at the triggering time, and returning to step S161 until the playing end time is reached.

For mobile game platforms, to enable games to remain fluent and reduce power consumption, it is often locked at 30 frame rates. Since the time per frame is 1 second/30=1.0/30.0 seconds, it is necessary to ensure that the frequency of playing the mouth shape and the frame rate of the mobile terminal remain consistent. The mouth-shape playing frequency of more than 30 frames lacks practical significance on a mobile game platform that locks 30 frames.

After the audio file starts to play, triggering every 0.03 second, and judging whether the corresponding mouth shape animation and expression animation are required to be synthesized into facial animation at the time point and play. If so, traversing the offline generated mouth shape sequence and expression sequence in the playing time range of the audio file to obtain the mouth shape animation and the expression animation corresponding to the triggering moment, and then synthesizing the mouth shape animation and the expression animation obtained at the triggering moment into the facial animation.

FIG. 3 is a flow chart of a facial animation synthesis and playback process according to an alternative embodiment of the present invention, as shown in FIG. 3, the flow may include the following processing steps:

Step S302, starting to play the audio file;

step S304, starting a timer which is successfully registered, wherein the interval is set to be 1.0/30.0 seconds; triggering every 0.03 seconds, and determining whether the time point needs to synthesize and play the corresponding facial animation;

step S306, if the corresponding facial animation is determined to be synthesized and played, traversing the offline generated mouth shape sequence and expression sequence to obtain the mouth shape animation and the expression animation corresponding to the trigger time, and synthesizing the mouth shape animation and the expression animation obtained at the trigger time into the facial animation, wherein the mouth shape animation corresponding to the trigger time needs to carry out fusion processing on the current mouth shape animation and the last mouth shape animation;

step S308, judging whether the end playing time of the audio file is reached; if yes, go on to step S310; if not, continuously and synchronously playing the facial animation in the playing time range of the audio file, and then turning to step S304;

step S310, if the end playing time of the audio file is reached, the facial animation is played.

Through the above examples provided by the present invention, the following performance analysis data can be obtained:

(1) Computing performance of off-line generation of mouth shapes: taking the Mac Mini configuration of 2.6GHz Intel Core i5 as an example, the average time taken to convert speech to a phoneme timestamp is 6s when processing a 10 second wav file. In addition, the time taken to convert the phoneme timestamp into a mouth-form array is negligible.

(2) Calculation performance of the running process: can stably operate in an environment of 30 frames and above. Since the mouth shape array can be generated off-line, the actual running performance overhead is small. If the frame rate is below 30 frames, there may be a case of a middle jump of the mouth shape due to the short duration of the individual phonemes, but the overall impact is small.

(3) Running a memory analysis: for a 10 second wav file, its corresponding mouth shape array is saved as txt text file, with a size of 900b. The produced 11 mouth-shaped bone animations have a single mouth shape of only about 10 frames, the storage size of a single mouth-shaped action gis file is 25kb, the storage size of all mouth-shaped bone animations is 275kb, and the mouth-shaped bone animations can be loaded as required during operation.

The real and natural facial animation is a very time-consuming and energy-consuming task, and the vivid facial animation can remarkably improve the information receiving and interactive friendliness of users, and has important significance in the services of games, voice visualization, video post-processing and the like. At least part of the embodiments of the invention realize the extraction of the phoneme sequence and emotion information in the audio through developing a voice recognition algorithm and a voice emotion recognition algorithm. Meanwhile, the facial animation is decomposed into a small number of orthogonal expression bases and mouth shape bases, and according to the identified phonemes and marked emotion results, the real and natural facial animation is generated by using an effective fusion algorithm. A large number of experiments show that the generated facial animation is accurate and rich, has low operation cost, and simultaneously greatly reduces the manufacturing cost of artistic personnel.

Therefore, at least part of the embodiments provided by the invention firstly acquire the time stamp and the duration of the phonemes of each word through voice recognition in the aspect of facial animation synthesis, and then drive the mouth shape action of the pinyin semantics. Taking Chinese recognition as an example (applicable to other languages as well), superposition fusion of basic actions, expressions and mouth-shape animations is supported. Meanwhile, off-line preprocessing can be performed, so that the occupation and the memory expense of a CPU are small in the running process of the whole system, and the system is very suitable for a hand-tour platform.

In an embodiment of the present invention, a device for synthesizing facial animation is further provided, and the device is used for implementing the above embodiment and a preferred implementation manner, and is not described again. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.

Fig. 4 is a block diagram of a composition apparatus for facial animation according to an embodiment of the present invention, as shown in fig. 4, the apparatus comprising: the parsing module 10 is configured to perform voice parsing on the audio file to obtain a phoneme timestamp file and an expression timestamp file, where the phoneme timestamp file includes: the time stamp and duration of each phoneme corresponding to each word obtained by converting the audio file, wherein each word corresponds to at least one phoneme; the obtaining module 20 is configured to obtain a mouth shape sequence corresponding to the phoneme timestamp file and obtain an expression sequence corresponding to the expression timestamp file, where the mouth shape sequence is used to describe mouth shape information corresponding to each phoneme in the phoneme timestamp file, and the expression sequence is used to describe expression information corresponding to the expression timestamp file; and a synthesizing module 30 for synthesizing the mouth shape sequence and the expression sequence into facial animation.

Alternatively, fig. 5 is a block diagram of a facial animation synthesizing apparatus according to an alternative embodiment of the present invention, as shown in fig. 5, the apparatus comprising: the device further comprises: and the playing module 40 is used for playing the facial animation and the audio file synchronously.

Optionally, the parsing module 10 includes: a first converting unit (not shown in the figure) for converting the audio file into a text sequence; a second converting unit (not shown in the figure) for converting the text sequence into a phoneme sequence according to the Chinese pinyin of each word in the text sequence, wherein each word corresponds to at least one phoneme; a first processing unit (not shown in the figure) for performing time-series modeling on the phoneme sequence to obtain a phoneme time stamp file.

Optionally, a first conversion unit (not shown in the figure) is used for converting the audio file into a text sequence using a connected temporal classification-recurrent neural network model.

Optionally, a first processing unit (not shown in the figure) is configured to perform time-series modeling on the phoneme sequence by using a hidden markov model, so as to obtain a phoneme time stamp file.

Optionally, the acquiring module 20 includes: an extracting unit (not shown in the figure) for extracting a spectrogram of the audio file within a preset time window; the first obtaining unit (not shown in the figure) is used for deriving the expression animation corresponding to the expression time stamp file and the emotion type corresponding to each expression according to the sound spectrum image to obtain the expression sequence.

Optionally, a first obtaining unit (not shown in the figure) is configured to set the spectrogram as an input item, and derive, through a convolutional neural network, an expression animation corresponding to the expression timestamp file and an emotion category corresponding to each expression, to obtain an expression sequence.

Optionally, the acquiring module 20 includes: a determining unit (not shown in the figure) configured to determine a mouth shape type corresponding to each phoneme in the phoneme timestamp file according to a preset correspondence, where the preset correspondence is used to record a mapping relationship between different phonemes and mouth shape types, and each mouth shape type corresponds to a different mouth shape animation; a second obtaining unit (not shown in the figure) for binding the timestamp of each phoneme with the corresponding mouth shape type to obtain the mouth shape sequence.

Optionally, the synthesis module 30 includes: a judging unit (not shown in the figure) for triggering and judging whether to synthesize the mouth shape animation in the mouth shape sequence and the expression animation in the expression sequence every preset time length; a first processing unit (not shown in the figure) for fusing the current mouth shape animation with the last mouth shape animation to obtain the mouth shape animation to be played when the output of the judging unit is yes, and then synthesizing the expression animation corresponding to the trigger time with the mouth shape animation to be played to obtain the face animation at the trigger time; and a playing unit (not shown in the figure) for playing the facial animation at the trigger time if the playing end time of the audio file is not reached, and returning to the judging step until the playing end time is reached.

An embodiment of the invention also provides a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the method embodiments described above when run.

Alternatively, in the present embodiment, the above-described storage medium may be configured to store a computer program for performing the steps of:

s1, carrying out voice analysis on an audio file to obtain a phoneme time stamp file and an expression time stamp file, wherein the phoneme time stamp file comprises: the time stamp and duration of each phoneme corresponding to each word obtained by converting the audio file, wherein each word corresponds to at least one phoneme;

s2, acquiring a mouth shape sequence corresponding to the phoneme time stamp file, wherein the mouth shape sequence is used for describing mouth shape information corresponding to each phoneme in the phoneme time stamp file;

s3, acquiring an expression sequence corresponding to the expression time stamp file, wherein the expression sequence is used for describing expression information corresponding to the expression time stamp file;

s4, synthesizing the mouth shape sequence and the expression sequence into facial animation, and synchronously playing the facial animation and the audio file.

Alternatively, in the present embodiment, the storage medium may include, but is not limited to: a usb disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing a computer program.

An embodiment of the invention also provides a processor arranged to run a computer program to perform the steps of any of the method embodiments described above.

Alternatively, in the present embodiment, the above-described processor may be configured to execute the following steps by a computer program:

Alternatively, specific examples in this embodiment may refer to examples described in the foregoing embodiments and optional implementations, and this embodiment is not described herein.

The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

In the foregoing embodiments of the present invention, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed technology content may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of the units, for example, may be a logic function division, and may be implemented in another manner, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention.

Claims

1. A method of synthesizing facial animation, comprising:

performing voice analysis on the audio file to obtain an expression timestamp file;

converting the audio file into a text sequence; converting the text sequence into a phoneme sequence according to the Chinese pinyin of each word in the text sequence; and carrying out time sequence modeling on the phoneme sequence to obtain a phoneme time stamp file, wherein the phoneme time stamp file comprises the following components: the time stamp and duration of each phoneme corresponding to each text obtained by conversion of the audio file correspond to at least one phoneme, the type of the at least one phoneme comprises vowel pinyin or the type of the at least one phoneme comprises initial consonant pinyin and vowel pinyin;

acquiring a mouth shape sequence corresponding to the phoneme time stamp file, wherein the mouth shape sequence is used for describing mouth shape information corresponding to each phoneme in the phoneme time stamp file;

Acquiring an expression sequence corresponding to the expression time stamp file, wherein the expression sequence is used for describing expression information corresponding to the expression time stamp file;

synthesizing the mouth shape sequence and the expression sequence into facial animation, and synchronously playing the facial animation and the audio file;

synthesizing the mouth shape sequence and the expression sequence into the facial animation, and synchronously playing the facial animation and the audio file comprises the following steps: triggering and judging whether to synthesize the mouth shape animation in the mouth shape sequence and the expression animation in the expression sequence or not at each interval of preset time length, wherein the preset time length is determined by a game frame rate; if yes, fusing the current mouth shape animation with the last mouth shape animation to obtain a mouth shape animation to be played, and then synthesizing the expression animation corresponding to the trigger time with the mouth shape animation to be played to obtain the facial animation at the trigger time; and if the playing end time of the audio file is not reached, playing the facial animation at the triggering time.

2. The method of claim 1, wherein converting the audio file into the text sequence comprises:

The audio file is converted into the text sequence using a connected temporal classification-recurrent neural network model.

3. The method of claim 1, wherein time-sequential modeling the sequence of phonemes to obtain the phoneme timestamp file comprises:

and carrying out time sequence modeling on the phoneme sequence by adopting a hidden Markov model to obtain the phoneme time stamp file.

4. The method of claim 1, wherein obtaining the expression sequence corresponding to the expression timestamp file comprises:

extracting a spectrogram of the audio file in a preset time window;

deducing the expression animation corresponding to the expression time stamp file and the emotion type corresponding to each expression according to the spectrogram to obtain the expression sequence.

5. The method of claim 4, wherein deriving the expression animation corresponding to the expression timestamp file and the emotion classification corresponding to each expression from the spectrogram, and obtaining the expression sequence comprises:

and setting the spectrogram as an input item, deducing an expression animation corresponding to the expression timestamp file and emotion categories corresponding to each expression through a convolutional neural network, and obtaining the expression sequence.

6. The method of claim 1, wherein obtaining a mouth shape sequence corresponding to the phoneme timestamp file comprises:

determining a mouth shape type corresponding to each phoneme in the phoneme timestamp file according to a preset corresponding relation, wherein the preset corresponding relation is used for recording the mapping relation between different phonemes and the mouth shape type, and each mouth shape type corresponds to different mouth shape animations respectively;

binding the time stamp of each phoneme with the corresponding mouth shape type to obtain the mouth shape sequence.

7. The method of claim 1, wherein synthesizing the mouth-shaped sequence and the expression sequence into the facial animation and playing the facial animation in synchronization with the audio file further comprises:

and returning to the step of triggering and judging whether to synthesize the mouth shape animation in the mouth shape sequence and the expression animation in the expression sequence after the face animation at the triggering moment is played for a preset time length at each interval until the playing ending moment is reached.

8. A facial animation synthesis apparatus, comprising:

the analysis module is used for carrying out voice analysis on the audio file to obtain an expression timestamp file;

The parsing module includes: a first conversion unit for converting the audio file into a text sequence; the second conversion unit is used for converting the text sequence into a phoneme sequence according to the Chinese pinyin of each word in the text sequence, wherein each word corresponds to at least one phoneme; the first processing unit is used for carrying out time sequence modeling on the phoneme sequence to obtain a phoneme time stamp file, wherein the phoneme time stamp file comprises the following components: the time stamp and duration of each phoneme corresponding to each text obtained by conversion of the audio file correspond to at least one phoneme, the type of the at least one phoneme comprises vowel pinyin or the type of the at least one phoneme comprises initial consonant pinyin and vowel pinyin;

the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a mouth shape sequence corresponding to the phoneme time stamp file and an expression sequence corresponding to the expression time stamp file, wherein the mouth shape sequence is used for describing mouth shape information corresponding to each phoneme in the phoneme time stamp file, and the expression sequence is used for describing expression information corresponding to the expression time stamp file;

the synthesis module is used for synthesizing the mouth shape sequence and the expression sequence into facial animation, and synchronously playing the facial animation and the audio file;

The synthesis module comprises: the judging unit is used for triggering and judging whether to synthesize the mouth shape animation in the mouth shape sequence and the expression animation in the expression sequence or not at each interval of preset time length, wherein the preset time length is determined by a game frame rate; the first processing unit is used for fusing the current mouth shape animation with the last mouth shape animation to obtain a mouth shape animation to be played when the output of the judging unit is yes, and then synthesizing the expression animation corresponding to the trigger time with the mouth shape animation to be played to obtain the face animation at the trigger time; and the playing unit is used for playing the facial animation at the triggering moment if the playing end moment of the audio file is not reached.

9. The apparatus of claim 8, wherein the first converting unit is configured to convert the audio file into the text sequence using a connected temporal classification-recurrent neural network model.

10. The apparatus of claim 8, wherein the first processing unit is configured to time-sequentially model the sequence of phonemes using a hidden markov model to obtain the phoneme timestamp file.

11. The apparatus of claim 8, wherein the acquisition module comprises:

the extraction unit is used for extracting the spectrogram of the audio file in a preset time window;

the first acquisition unit is used for deducing the expression animation corresponding to the expression time stamp file and the emotion type corresponding to each expression according to the spectrogram to obtain the expression sequence.

12. The apparatus of claim 11, wherein the first obtaining unit is configured to set the spectrogram as an input item, and derive, through a convolutional neural network, an expression animation corresponding to each phoneme in the expression timestamp file and an emotion category corresponding to each expression, to obtain the expression sequence.

13. The apparatus of claim 8, wherein the acquisition module comprises:

the determining unit is used for determining the mouth shape type corresponding to each phoneme in the phoneme time stamp file according to a preset corresponding relation, wherein the preset corresponding relation is used for recording the mapping relation between different phonemes and the mouth shape type, and each mouth shape type corresponds to different mouth shape animations respectively;

and the second acquisition unit is used for binding the time stamp of each phoneme with the corresponding mouth shape type to obtain the mouth shape sequence.

14. The apparatus of claim 8, wherein the playback unit is further configured to:

and returning to the judging unit after playing the facial animation at the triggering moment until the playing ending moment is reached.

15. A storage medium comprising a stored program, wherein the program, when run, controls a device in which the storage medium is located to perform the method of synthesizing facial animation according to any one of claims 1 to 7.

16. A processor for running a program, wherein the program runs to perform the method of synthesizing facial animation according to any one of claims 1 to 7.

17. A terminal, comprising: one or more processors, a memory, a display device, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs for performing the facial animation synthesis method of any of claims 1-7.