CN116363268A

CN116363268A - Method and device for generating mouth shape animation, electronic equipment and storage medium

Info

Publication number: CN116363268A
Application number: CN202310139936.1A
Authority: CN
Inventors: 杨建顺; 陈军宏
Original assignee: Xiamen Black Mirror Technology Co ltd
Current assignee: Xiamen Black Mirror Technology Co ltd
Priority date: 2023-02-20
Filing date: 2023-02-20
Publication date: 2023-06-30

Abstract

The invention discloses a method, a device, an electronic device and a storage medium for generating mouth shape animation, wherein the method comprises the following steps: acquiring target voice, and processing the target voice based on a preset voice recognition algorithm to obtain a target text with timestamp information; generating a phoneme sequence according to the pinyin information of the target text and the timestamp information; converting the phoneme sequence into a visual sequence according to a first preset conversion rule, wherein the visual sequence comprises a plurality of visual elements, and the visual elements represent mouth-shaped visual characteristics during pronunciation; and rendering the mouth shape animation corresponding to the target voice based on the visual sequence, so that the pronunciation action of the mouth shape animation is accurately matched with the target voice, and more accurate generation of the mouth shape animation is realized.

Description

Method and device for generating mouth shape animation, electronic equipment and storage medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method and apparatus for generating a mouth shape animation, an electronic device, and a storage medium.

Background

The voice is used as a natural communication form, and has outstanding influence in the field of man-machine interaction. However, in the human-computer interaction process, it is extremely complex to generate realistic mouth-shaped animation.

In the prior art, only limited single key frame animation is generally relied on to generate mouth shape animation frame data conforming to Gaussian distribution. The scheme is difficult to show the movements of mouth shapes and facial muscles of normal people when speaking, so that the finally generated mouth shape animation effect does not accord with the normal speaking rule.

Therefore, how to generate the mouth shape animation more accurately is a technical problem to be solved at present.

It should be noted that the information disclosed in the above background section is only for enhancing understanding of the background of the present disclosure and thus may include information that does not constitute prior art known to those of ordinary skill in the art.

Disclosure of Invention

The embodiment of the application discloses a method, a device, electronic equipment and a storage medium for generating mouth-shaped animation, which are used for generating the mouth-shaped animation more accurately.

In a first aspect, there is provided a method for generating a mouth shape animation, the method comprising: acquiring target voice, and processing the target voice based on a preset voice recognition algorithm to obtain a target text with timestamp information; generating a phoneme sequence according to the pinyin information of the target text and the timestamp information; converting the phoneme sequence into a visual sequence according to a first preset conversion rule, wherein the visual sequence comprises a plurality of visual elements, and the visual elements represent mouth-shaped visual characteristics during pronunciation; and rendering the mouth shape animation corresponding to the target voice based on the visual sequence.

In a second aspect, there is provided an apparatus for generating a mouth shape animation, the apparatus comprising: the voice recognition module is used for acquiring target voice, and processing the target voice based on a preset voice recognition algorithm to obtain a target text with time stamp information; the generation module is used for generating a phoneme sequence according to the pinyin information of the target text and the time stamp information; the conversion module is used for converting the phoneme sequence into a visual sequence according to a first preset conversion rule, wherein the visual sequence comprises a plurality of visual elements, and the visual elements represent mouth-shaped visual characteristics during pronunciation; and the rendering module is used for rendering the mouth shape animation corresponding to the target voice based on the visual sequence.

In a third aspect, there is provided an electronic device comprising: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform the method of generating a mouth shape animation according to the first aspect via execution of the executable instructions.

In a fourth aspect, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of generating a mouth-shape animation according to the first aspect.

The target voice is obtained by applying the data scheme, and is processed based on a preset voice recognition algorithm, so that a target text with time stamp information is obtained; generating a phoneme sequence according to the pinyin information of the target text and the timestamp information; converting the phoneme sequence into a visual sequence according to a first preset conversion rule, wherein the visual sequence comprises a plurality of visual elements, and the visual elements represent mouth-shaped visual characteristics during pronunciation; and rendering the mouth shape animation corresponding to the target voice based on the visual sequence, so that the pronunciation action of the mouth shape animation is accurately matched with the target voice, and more accurate generation of the mouth shape animation is realized.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a method for generating a mouth shape animation according to an embodiment of the present invention;

FIG. 2 is a flow chart illustrating a method for generating a mouth shape animation according to another embodiment of the present invention;

FIG. 3 is a flowchart illustrating a method for generating a mouth shape animation according to another embodiment of the present invention;

fig. 4 is a schematic structural diagram of a device for generating a mouth shape animation according to an embodiment of the present invention;

fig. 5 shows a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

It is noted that other embodiments of the present application will be readily apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It is to be understood that the present application is not limited to the precise construction set forth herein below and shown in the drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

A method of generating digital human video according to an exemplary embodiment of the present application is described below with reference to fig. 1 to 3. It should be noted that the following application scenario is only shown for the convenience of understanding the spirit and principles of the present application, and embodiments of the present application are not limited in any way in this respect. Rather, embodiments of the present application may be applied to any scenario where applicable.

The embodiment of the application provides a method for generating mouth shape animation, as shown in fig. 1, the method comprises the following steps:

step S101, target voice is obtained, and the target voice is processed based on a preset voice recognition algorithm to obtain a target text with time stamp information.

In this embodiment, the target voice may be real-time voice audio input by the user through an electronic device equipped with a sound card device such as a microphone, or may be pre-recorded or stored voice audio, or may be voice audio obtained by performing voice synthesis on input text information according to a preset voice synthesis algorithm. The electronic device may be a mobile device, such as a mobile phone, a tablet computer, a personal digital assistant, a wearable device (such as glasses, a watch, etc.), or a fixed device, such as a personal computer, an intelligent television, an intelligent home/appliance (such as an air conditioner, an electric cooker, etc.), which is not limited in the embodiments of the present invention. The electronic equipment can locally perform voice recognition based on a preset voice recognition algorithm to obtain a target text with time stamp information, and the electronic equipment can also send the target voice to the server, and the server performs voice recognition based on the preset voice recognition algorithm to obtain the target text with the time stamp information.

Alternatively, the preset speech recognition algorithm may be any one of algorithms including a Dynamic Time Warping (DTW) based algorithm, a non-parametric model based Vector Quantization (VQ) method, a parametric model based Hidden Markov Model (HMM) method, an Artificial Neural Network (ANN) based algorithm, and a support vector machine.

Optionally, after the target voice is obtained, the target voice passes through a preset filter to remove noise in the target voice, so that the accuracy of the target voice is further improved.

Step S102, generating a phoneme sequence according to the pinyin information of the target text and the time stamp information.

In this embodiment, the phonemes are the smallest phonetic units that make up a syllable, and any audio segment is composed of a limited number of phonemes. The target text has pinyin information and time stamp information, and based on the pinyin information and the time stamp information, a plurality of phonemes may be generated, each phoneme constituting a phoneme sequence.

Step S103, converting the phoneme sequence into a phoneme sequence according to a first preset conversion rule.

In this embodiment, the visual sequence includes a plurality of visual elements, the visual elements represent the mouth shape visual characteristics during pronunciation, the mouth shape change is generated by the change of each phoneme in the phoneme sequence, each phoneme in the phoneme sequence is converted into a plurality of visual elements according to a first preset conversion rule, and each visual element forms the visual sequence.

In a specific application scenario of the present application, the first preset conversion rule is shown in table 1.

TABLE 1

Retinoids	Phonemes	Retinoids	Phonemes	Retinoids	Phonemes
						Class 1	none	Class 7	j,q,x	Class 13	i,y
Class 2	b,p,m	Class 8	z,c,s	Class 14	u,w
						Class 3	f	Class 9	zh,ch,sh,r	Class 15	ü
Class 4	d,t,n	Class 10	a	Class 16	er
						Class 5	l	Class 11	o	Class 17	-n,-ng
Class 6	g,k,h	Class 12	e

In some embodiments of the present application, in order to improve the generating efficiency of the mouth-shape animation, after converting the phoneme sequence into the phoneme sequence according to the first preset conversion rule, the method includes:

and carrying out merging processing on the vision sequences according to a preset merging rule, wherein the merging processing comprises the following steps:

1) Combining the corresponding retinol of d, t, n and l;

2) Combining the corresponding retinol of z, c, s-i and z, c, s-i;

3) Enabling a single vowel e and a compound vowel e to share one video element;

4) Let j, q, x and z, c, s share one retinoid;

5) Let u and v share one visual element.

Through merging the video sequences, the phones with similar mouth shapes are subjected to the video sharing, so that the complexity of the videos is reduced, and the generating efficiency of the mouth shape animation is improved.

In some embodiments of the present application, after performing the merging processing on the sequence of vision elements according to a preset merging rule, the method further includes:

adding tongue top and tongue action characteristics into the retinol corresponding to d, t, n and l;

adding tongue rolling action characteristics into the corresponding vitamins of zh, ch, sh and r;

a tongue rolling action feature based on e pronunciation is added to the corresponding element of er.

By adding tongue processing to the sequence of the visual elements, each visual element is more in accordance with the normal pronunciation rules.

Alternatively, the form of the visual element may be text information, or an image frame, or other data form other than text information and an image frame. For example, if the visual is text information, the visual may include a closed mouth with no inclination of the mouth, a large mouth opening of 50% (i.e., half mouth opening), a closed mouth with "O" shape, a large mouth opening of 100% (i.e., large mouth opening), etc.; if the vision element is an image frame, different mouth shapes are displayed in the image frames corresponding to different vision elements.

And step S104, rendering the mouth shape animation corresponding to the target voice based on the visual sequence.

In this embodiment, the visual sequence characterizes the mouth shape visual characteristics of each phoneme corresponding to the target voice, and the mouth shape animation corresponding to the target voice can be rendered based on the visual sequence.

In some embodiments of the present application, the rendering, based on the sequence of visual elements, a mouth-shaped animation corresponding to the target voice includes:

inputting the visual sequence into a preset rendering engine, and obtaining an animation frame sequence according to the output of the preset rendering engine;

generating the mouth shape animation based on the animation frame sequence.

In this embodiment, the preset rendering engine may be a fantasy engine or Unity, and inputs the visual sequence into the preset rendering engine, and the preset rendering engine renders a plurality of animation frames based on the visual sequence to obtain an animation frame sequence, and generates a mouth shape animation based on the animation frame sequence, thereby generating the mouth shape animation more efficiently.

In some embodiments of the present application, the generating the mouth shape animation based on the animation frame sequence includes:

generating a transition frame between every two animation frames in the animation frame sequence based on a preset interpolation algorithm;

and inserting each transition frame into the animation frame sequence to obtain the mouth shape animation.

In this embodiment, interpolation calculation is performed on every two animation frames in the animation frame sequence based on a preset interpolation algorithm, specifically, interpolation calculation may be performed on fusion deformation parameters (i.e., blending type parameters) or key point parameters or skeleton parameters of every two animation frames, transition frames are obtained according to calculation results, and then each transition frame is inserted into the animation frame sequence to obtain a mouth shape animation, so that smoothness of the mouth shape animation is further improved.

Alternatively, the preset interpolation algorithm may be a bezier curve interpolation algorithm or a linear interpolation algorithm.

Optionally, each two animation frames in the animation frame sequence can be weighted according to a preset weight parameter, and a transition frame is generated according to the weighted average result of each two animation frames.

The embodiment of the application also provides a method for generating the mouth shape animation, which is shown in fig. 2 and comprises the following steps:

step S201, target voice is obtained, and the target voice is processed based on a preset voice recognition algorithm to obtain a target text with time stamp information.

In this embodiment, the target voice may be real-time voice audio input by the user through an electronic device equipped with a sound card device such as a microphone, or may be pre-recorded or stored voice audio, or may be voice audio obtained by performing voice synthesis on input text information according to a preset voice synthesis algorithm. And processing the target voice based on a preset voice recognition algorithm to obtain a target text with time stamp information. The person skilled in the art can adopt different preset voice recognition algorithms to perform voice recognition according to the needs, and the protection scope of the application is not affected.

Step S202, acquiring a plurality of syllables according to the pinyin information of the target text.

In this embodiment, the target text has pinyin information, the pinyin is composed of syllables, and a plurality of syllables can be obtained based on the pinyin information.

Step S203, converting each syllable into a plurality of phones according to a second preset conversion rule.

In this embodiment, each syllable is composed of a plurality of phones, and each syllable can be converted into a plurality of phones based on the second preset conversion rule.

In some embodiments of the present application, the second preset conversion rule includes:

separating a first syllable set belonging to an initial consonant, a second syllable set belonging to an integrally-recognized syllable and a third syllable set belonging to a final from each syllable respectively;

if the preset sound needing to be deformed exists in the first sound set, the second sound set or the third sound set, the preset sound is converted into a target sound corresponding to the preset sound according to a preset deformation rule.

In this embodiment, syllables can be divided into initials and finals, which need to be processed respectively according to the initials and the finals, and a first subset of syllables and a third subset of syllables are obtained after processing, and syllables also include some whole syllables to be recognized, and also need to be processed separately, and a second subset of syllables is obtained after processing. In addition, if the first, second or third sound set has the preset sound to be deformed, the preset sound is converted into the target sound corresponding to the preset sound according to the preset deformation rule, so that each sound can be obtained more accurately.

Optionally, when separating the first subset of phones belonging to the initial consonant, zh, ch, sh are separated first, b, p, m, f, d, t, n, l, g, k, h, j, q, x, z, c, s, r, y, w are separated, and yw: yu becomes v, yi/y becomes i, and wu/w becomes u; when separating the second syllable subset belonging to the whole syllable recognition, separating zhi, chi, shi, ri, zi, ci, si first, and converting i to i; when the third tone subset belonging to the vowels is separated, single vowel, front vowel, rear vowel and middle vowel are separated, vowel heads, vowel abdomen and vowel tails of different types of vowels are respectively processed, u connected after j, q and x is converted into v, iu is converted into iou, ui is converted into uei, un is converted into uen, and the vowel tails which are n or ng are independently judged.

Step S204, arranging each sound according to the timestamp information to obtain a phoneme sequence.

In this embodiment, each of the phones corresponds to a different time stamp, and after each of the phones is arranged according to the time stamp information, a phone sequence is obtained.

In some embodiments of the present application, after arranging each of the phones with the timestamp information to obtain the phone sequence, the method further includes:

determining blank positions among syllables;

adding a mute frame with preset duration at a position corresponding to the blank position in the phoneme sequence;

adjusting the duration of each phoneme according to the pronunciation type of the phoneme and the preset duration distribution proportion corresponding to the pronunciation type;

wherein, the pronunciation types comprise initials, finals, vowels, bells and tails.

In this embodiment, blank positions may exist between syllables, and a mute frame with a preset duration is added to the blank positions, so that the corresponding syllable has a pronunciation preparation sound or a pronunciation ending extension sound, so that pronunciation is more natural and smooth, and the preset duration is 0.2s.

The pronunciation types of the phonemes comprise initials, finals, vowels, and vowels, each pronunciation type corresponds to different preset duration distribution ratios, and the duration of each phoneme is adjusted based on the pronunciation type and the corresponding preset duration distribution ratio, so that the pronunciation of each phoneme is more in accordance with the normal speaking rule.

Optionally, I0, F1, F2, and F3 are sequentially used to represent initials, finals, and finals, where a preset duration distribution ratio of I0 is 0.3, a preset duration distribution ratio of f1+f2+f3 is 0.7, a preset duration distribution ratio of F1 in f1+f2+f3 is 0.2, a preset duration distribution ratio of F2 in f1+f2+f3 is 0.8, and F3 is inside F2; if F3 is a vowel, the preset duration distribution ratio of F3 in F2 is 0.4; if F3 is nasal consonant, the preset duration distribution proportion of F3 in F2 is 0.2.

Step S205, converting the phoneme sequence into a phoneme sequence according to a first preset conversion rule.

In this embodiment, the visual sequence includes a plurality of visual elements, the visual elements represent the mouth shape visual characteristics during pronunciation, the mouth shape change is generated by the change of each phoneme in the phoneme sequence, each phoneme in the phoneme sequence is converted into a plurality of visual elements according to a first preset conversion rule, and each visual element forms the visual sequence. In a specific application scenario of the present application, the first preset conversion rule is shown in table 1 above.

And step S206, rendering the mouth shape animation corresponding to the target voice based on the visual sequence.

generating the mouth shape animation based on the animation frame sequence.

In this embodiment, interpolation calculation is performed on every two animation frames in the animation frame sequence based on a preset interpolation algorithm, specifically, interpolation calculation may be performed on fusion deformation parameters, or key point parameters, or skeleton parameters of every two animation frames, transition frames are obtained according to calculation results, and then each transition frame is inserted into the animation frame sequence to obtain a mouth shape animation, so that smoothness of the mouth shape animation is further improved.

By applying the technical scheme, the target voice is obtained, and the target voice is processed based on a preset voice recognition algorithm to obtain a target text with time stamp information; acquiring a plurality of syllables according to the pinyin information of the target text; converting each syllable into a plurality of syllables according to a second preset conversion rule; arranging each sound according to the timestamp information to obtain a phoneme sequence; converting the phoneme sequence into a video sequence according to a first preset conversion rule; and rendering the mouth shape animation corresponding to the target voice based on the visual sequence, so that the phoneme sequence is more accurately obtained, the pronunciation action of the mouth shape animation is accurately matched with the target voice, and more accurate mouth shape animation generation is realized.

The embodiment of the application also provides a method for generating the mouth shape animation, which is shown in fig. 3 and comprises the following steps:

step S301, target voice is obtained, and the target voice is processed based on a preset voice recognition algorithm to obtain a target text with time stamp information.

Step S302, generating a phoneme sequence according to the pinyin information of the target text and the time stamp information.

Step S303, converting the phoneme sequence into a phoneme sequence according to a first preset conversion rule.

Step S304, weights are respectively allocated to each of the visual elements in the visual element sequence based on a preset weight allocation list.

In this embodiment, since the pronunciation of each Chinese character is affected by the adjacent phones to different degrees, when speaking, the mouth shape corresponding to the current phone is closely related to the adjacent front and rear phones, and the rule of co-pronunciation needs to be met. In order to enable each visual element in the visual element sequence to accord with the collaborative pronunciation rule, a preset weight distribution list is generated in advance according to the association relation between a first mouth shape and a second mouth shape, wherein the first mouth shape is the mouth shape of the current sound, and the second mouth shape is the mouth shape of the front and rear sound adjacent to the current sound. And respectively assigning weights to the various visual elements based on the preset weight assignment list after the visual element sequence is obtained.

In some embodiments of the present application, the preset weight allocation list is determined by the following formula:

when (when)

In the time-course of which the first and second contact surfaces,

when (when)

In the time-course of which the first and second contact surfaces,

wherein W is _C Visual weight for consonant phonons; w (W) _V Is the visual weight of vowel sound pronunciation; r is R _C And R is _V The affected level of consonants and the affected level of vowels are quantized to between 0 and 1, respectively; v (V) ₁ Is V (V) ₂ A vowel sound before, and a consonant sound can be separated between the vowel sound and the vowel sound; alpha and beta are all coefficients for controlling weight, when V ₁ And V ₂ When separated by a consonant, β is 1. Vision weights of different phones can be accurately obtained through the formula, and therefore a preset weight distribution list which is more in line with the collaborative pronunciation rules is generated.

Step S305, adjusting the duration of each of the visual elements in the visual element sequence according to each of the weights.

In this embodiment, the duration of each element is adjusted according to each weight, so that the element sequence more accords with the co-pronunciation rule, and the accuracy of the mouth shape animation is further improved.

And step S306, rendering the mouth shape animation corresponding to the target voice based on the visual sequence.

generating the mouth shape animation based on the animation frame sequence.

By applying the technical scheme, the target voice is obtained, and the target voice is processed based on a preset voice recognition algorithm to obtain a target text with time stamp information; generating a phoneme sequence according to the pinyin information of the target text and the timestamp information; converting the phoneme sequence into a video sequence according to a first preset conversion rule; respectively distributing weights to each of the visual elements in the visual element sequence based on a preset weight distribution list; adjusting the duration of each of the visual elements in the sequence of visual elements according to each of the weights; and rendering the mouth shape animation corresponding to the target voice based on the visual sequence, and enabling the visual sequence to be more in accordance with the collaborative pronunciation rules based on the weights corresponding to the visual sequences, so that the pronunciation action of the mouth shape animation is accurately matched with the target voice, and more accurate mouth shape animation generation is realized.

The embodiment of the application also provides a device for generating the mouth shape animation, as shown in fig. 4, the device comprises:

the voice recognition module 401 is configured to obtain a target voice, and process the target voice based on a preset voice recognition algorithm to obtain a target text with timestamp information;

a generating module 402, configured to generate a phoneme sequence according to the pinyin information of the target text and the timestamp information;

a conversion module 403, configured to convert the phoneme sequence into a visual sequence according to a first preset conversion rule, where the visual sequence includes a plurality of visual elements, and the visual elements characterize a mouth shape visual feature during pronunciation;

and a rendering module 404, configured to render a mouth shape animation corresponding to the target voice based on the pixel sequence.

In a specific application scenario, the generating module 402 is specifically configured to:

acquiring a plurality of syllables according to the pinyin information;

converting each syllable into a plurality of syllables according to a second preset conversion rule;

and arranging each sound according to the timestamp information to obtain the phoneme sequence.

In a specific application scenario, the second preset conversion rule includes:

In a specific application scenario, the device further includes a first adjustment module, configured to:

determining blank positions among syllables;

In a specific application scenario, the device further includes a second adjustment module, configured to:

respectively distributing weights to each of the visual elements in the visual element sequence based on a preset weight distribution list;

adjusting the duration of each of the visual elements in the sequence of visual elements according to each of the weights;

the preset weight distribution list is generated according to an association relation between a first mouth shape and a second mouth shape, wherein the first mouth shape is the mouth shape of a current sound, and the second mouth shape is the mouth shape of a front sound and a rear sound adjacent to the current sound.

In a specific application scenario, the rendering module 404 is specifically configured to:

generating the mouth shape animation based on the animation frame sequence.

In a specific application scenario, the rendering module 404 is further specifically configured to:

By applying the technical scheme, the generating device of the mouth shape animation comprises: the voice recognition module is used for acquiring target voice, and processing the target voice based on a preset voice recognition algorithm to obtain a target text with time stamp information; the generation module is used for generating a phoneme sequence according to the pinyin information of the target text and the time stamp information; the conversion module is used for converting the phoneme sequence into a visual sequence according to a first preset conversion rule, wherein the visual sequence comprises a plurality of visual elements, and the visual elements represent mouth-shaped visual characteristics during pronunciation; and the rendering module is used for rendering the mouth shape animation corresponding to the target voice based on the visual sequence, so that the pronunciation action of the mouth shape animation is accurately matched with the target voice, and more accurate generation of the mouth shape animation is realized.

The embodiment of the invention also provides an electronic device, as shown in fig. 5, which comprises a processor 501, a communication interface 502, a memory 503 and a communication bus 504, wherein the processor 501, the communication interface 502 and the memory 503 complete communication with each other through the communication bus 504,

a memory 503 for storing executable instructions of the processor;

a processor 501 configured to execute via execution of the executable instructions:

acquiring target voice, and processing the target voice based on a preset voice recognition algorithm to obtain a target text with timestamp information;

generating a phoneme sequence according to the pinyin information of the target text and the timestamp information;

converting the phoneme sequence into a visual sequence according to a first preset conversion rule, wherein the visual sequence comprises a plurality of visual elements, and the visual elements represent mouth-shaped visual characteristics during pronunciation;

and rendering the mouth shape animation corresponding to the target voice based on the visual sequence.

The communication bus may be a PCI (Peripheral Component Interconnect, peripheral component interconnect standard) bus, or an EISA (Extended Industry Standard Architecture ) bus, or the like. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.

The communication interface is used for communication between the terminal and other devices.

The memory may include RAM (Random Access Memory ) or may include non-volatile memory, such as at least one disk memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.

The processor may be a general-purpose processor, including a CPU (Central Processing Unit ), NP (Network Processor, network processor), etc.; but also DSP (Digital Signal Processing, digital signal processor), ASIC (Application Specific Integrated Circuit ), FPGA (Field Programmable Gate Array, field programmable gate array) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components.

In still another embodiment of the present invention, there is also provided a computer-readable storage medium having stored therein a computer program which, when executed by a processor, implements the method of generating a mouth-shape animation as described above.

In yet another embodiment of the present invention, there is also provided a computer program product containing instructions that, when run on a computer, cause the computer to perform the method of generating a mouth-shape animation as described above.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk), etc.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments.

The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims

1. A method for generating a mouth shape animation, the method comprising:

2. The method of claim 1, wherein the generating a phoneme sequence from the pinyin information and the time stamp information of the target text comprises:

acquiring a plurality of syllables according to the pinyin information;

3. The method of claim 2, wherein the second preset transformation rule comprises:

4. The method of claim 2, wherein after arranging each of the phones with the time stamp information to obtain the phone sequence, the method further comprises:

determining blank positions among syllables;

5. The method of claim 1, wherein after converting the phoneme sequence into a phoneme sequence according to a first preset conversion rule, the method further comprises:

6. The method of claim 1, wherein the rendering a mouth-shape animation corresponding to the target speech based on the sequence of visual elements comprises:

generating the mouth shape animation based on the animation frame sequence.

7. The method of claim 6, wherein the generating the mouth-shaped animation based on the sequence of animation frames comprises:

8. A device for generating a mouth-shaped animation, the device comprising:

the voice recognition module is used for acquiring target voice, and processing the target voice based on a preset voice recognition algorithm to obtain a target text with time stamp information;

the generation module is used for generating a phoneme sequence according to the pinyin information of the target text and the time stamp information;

the conversion module is used for converting the phoneme sequence into a visual sequence according to a first preset conversion rule, wherein the visual sequence comprises a plurality of visual elements, and the visual elements represent mouth-shaped visual characteristics during pronunciation;

and the rendering module is used for rendering the mouth shape animation corresponding to the target voice based on the visual sequence.

9. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the method of generating a mouth shape animation according to any of claims 1-7 via execution of the executable instructions.

10. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the method of generating a mouth-shape animation according to any one of claims 1-7.