CN116363268A - Method and device for generating mouth shape animation, electronic equipment and storage medium - Google Patents

Method and device for generating mouth shape animation, electronic equipment and storage medium Download PDF

Info

Publication number
CN116363268A
CN116363268A CN202310139936.1A CN202310139936A CN116363268A CN 116363268 A CN116363268 A CN 116363268A CN 202310139936 A CN202310139936 A CN 202310139936A CN 116363268 A CN116363268 A CN 116363268A
Authority
CN
China
Prior art keywords
sequence
preset
animation
visual
mouth shape
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310139936.1A
Other languages
Chinese (zh)
Inventor
杨建顺
陈军宏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen Black Mirror Technology Co ltd
Original Assignee
Xiamen Black Mirror Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen Black Mirror Technology Co ltd filed Critical Xiamen Black Mirror Technology Co ltd
Priority to CN202310139936.1A priority Critical patent/CN116363268A/en
Publication of CN116363268A publication Critical patent/CN116363268A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/2053D [Three Dimensional] animation driven by audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/403D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • G10L2021/105Synthesis of the lips movements from speech, e.g. for talking heads
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Data Mining & Analysis (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The invention discloses a method, a device, an electronic device and a storage medium for generating mouth shape animation, wherein the method comprises the following steps: acquiring target voice, and processing the target voice based on a preset voice recognition algorithm to obtain a target text with timestamp information; generating a phoneme sequence according to the pinyin information of the target text and the timestamp information; converting the phoneme sequence into a visual sequence according to a first preset conversion rule, wherein the visual sequence comprises a plurality of visual elements, and the visual elements represent mouth-shaped visual characteristics during pronunciation; and rendering the mouth shape animation corresponding to the target voice based on the visual sequence, so that the pronunciation action of the mouth shape animation is accurately matched with the target voice, and more accurate generation of the mouth shape animation is realized.

Description

Method and device for generating mouth shape animation, electronic equipment and storage medium
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a method and apparatus for generating a mouth shape animation, an electronic device, and a storage medium.
Background
The voice is used as a natural communication form, and has outstanding influence in the field of man-machine interaction. However, in the human-computer interaction process, it is extremely complex to generate realistic mouth-shaped animation.
In the prior art, only limited single key frame animation is generally relied on to generate mouth shape animation frame data conforming to Gaussian distribution. The scheme is difficult to show the movements of mouth shapes and facial muscles of normal people when speaking, so that the finally generated mouth shape animation effect does not accord with the normal speaking rule.
Therefore, how to generate the mouth shape animation more accurately is a technical problem to be solved at present.
It should be noted that the information disclosed in the above background section is only for enhancing understanding of the background of the present disclosure and thus may include information that does not constitute prior art known to those of ordinary skill in the art.
Disclosure of Invention
The embodiment of the application discloses a method, a device, electronic equipment and a storage medium for generating mouth-shaped animation, which are used for generating the mouth-shaped animation more accurately.
In a first aspect, there is provided a method for generating a mouth shape animation, the method comprising: acquiring target voice, and processing the target voice based on a preset voice recognition algorithm to obtain a target text with timestamp information; generating a phoneme sequence according to the pinyin information of the target text and the timestamp information; converting the phoneme sequence into a visual sequence according to a first preset conversion rule, wherein the visual sequence comprises a plurality of visual elements, and the visual elements represent mouth-shaped visual characteristics during pronunciation; and rendering the mouth shape animation corresponding to the target voice based on the visual sequence.
In a second aspect, there is provided an apparatus for generating a mouth shape animation, the apparatus comprising: the voice recognition module is used for acquiring target voice, and processing the target voice based on a preset voice recognition algorithm to obtain a target text with time stamp information; the generation module is used for generating a phoneme sequence according to the pinyin information of the target text and the time stamp information; the conversion module is used for converting the phoneme sequence into a visual sequence according to a first preset conversion rule, wherein the visual sequence comprises a plurality of visual elements, and the visual elements represent mouth-shaped visual characteristics during pronunciation; and the rendering module is used for rendering the mouth shape animation corresponding to the target voice based on the visual sequence.
In a third aspect, there is provided an electronic device comprising: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform the method of generating a mouth shape animation according to the first aspect via execution of the executable instructions.
In a fourth aspect, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of generating a mouth-shape animation according to the first aspect.
The target voice is obtained by applying the data scheme, and is processed based on a preset voice recognition algorithm, so that a target text with time stamp information is obtained; generating a phoneme sequence according to the pinyin information of the target text and the timestamp information; converting the phoneme sequence into a visual sequence according to a first preset conversion rule, wherein the visual sequence comprises a plurality of visual elements, and the visual elements represent mouth-shaped visual characteristics during pronunciation; and rendering the mouth shape animation corresponding to the target voice based on the visual sequence, so that the pronunciation action of the mouth shape animation is accurately matched with the target voice, and more accurate generation of the mouth shape animation is realized.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of a method for generating a mouth shape animation according to an embodiment of the present invention;
FIG. 2 is a flow chart illustrating a method for generating a mouth shape animation according to another embodiment of the present invention;
FIG. 3 is a flowchart illustrating a method for generating a mouth shape animation according to another embodiment of the present invention;
fig. 4 is a schematic structural diagram of a device for generating a mouth shape animation according to an embodiment of the present invention;
fig. 5 shows a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.
It is noted that other embodiments of the present application will be readily apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.
It is to be understood that the present application is not limited to the precise construction set forth herein below and shown in the drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.
A method of generating digital human video according to an exemplary embodiment of the present application is described below with reference to fig. 1 to 3. It should be noted that the following application scenario is only shown for the convenience of understanding the spirit and principles of the present application, and embodiments of the present application are not limited in any way in this respect. Rather, embodiments of the present application may be applied to any scenario where applicable.
The embodiment of the application provides a method for generating mouth shape animation, as shown in fig. 1, the method comprises the following steps:
step S101, target voice is obtained, and the target voice is processed based on a preset voice recognition algorithm to obtain a target text with time stamp information.
In this embodiment, the target voice may be real-time voice audio input by the user through an electronic device equipped with a sound card device such as a microphone, or may be pre-recorded or stored voice audio, or may be voice audio obtained by performing voice synthesis on input text information according to a preset voice synthesis algorithm. The electronic device may be a mobile device, such as a mobile phone, a tablet computer, a personal digital assistant, a wearable device (such as glasses, a watch, etc.), or a fixed device, such as a personal computer, an intelligent television, an intelligent home/appliance (such as an air conditioner, an electric cooker, etc.), which is not limited in the embodiments of the present invention. The electronic equipment can locally perform voice recognition based on a preset voice recognition algorithm to obtain a target text with time stamp information, and the electronic equipment can also send the target voice to the server, and the server performs voice recognition based on the preset voice recognition algorithm to obtain the target text with the time stamp information.
Alternatively, the preset speech recognition algorithm may be any one of algorithms including a Dynamic Time Warping (DTW) based algorithm, a non-parametric model based Vector Quantization (VQ) method, a parametric model based Hidden Markov Model (HMM) method, an Artificial Neural Network (ANN) based algorithm, and a support vector machine.
Optionally, after the target voice is obtained, the target voice passes through a preset filter to remove noise in the target voice, so that the accuracy of the target voice is further improved.
Step S102, generating a phoneme sequence according to the pinyin information of the target text and the time stamp information.
In this embodiment, the phonemes are the smallest phonetic units that make up a syllable, and any audio segment is composed of a limited number of phonemes. The target text has pinyin information and time stamp information, and based on the pinyin information and the time stamp information, a plurality of phonemes may be generated, each phoneme constituting a phoneme sequence.
Step S103, converting the phoneme sequence into a phoneme sequence according to a first preset conversion rule.
In this embodiment, the visual sequence includes a plurality of visual elements, the visual elements represent the mouth shape visual characteristics during pronunciation, the mouth shape change is generated by the change of each phoneme in the phoneme sequence, each phoneme in the phoneme sequence is converted into a plurality of visual elements according to a first preset conversion rule, and each visual element forms the visual sequence.
In a specific application scenario of the present application, the first preset conversion rule is shown in table 1.
TABLE 1
Retinoids Phonemes Retinoids Phonemes Retinoids Phonemes
Class 1 none Class 7 j,q,x Class 13 i,y
Class 2 b,p,m Class 8 z,c,s Class 14 u,w
Class 3 f Class 9 zh,ch,sh,r Class 15 ü
Class 4 d,t,n Class 10 a Class 16 er
Class 5 l Class 11 o Class 17 -n,-ng
Class 6 g,k,h Class 12 e
In some embodiments of the present application, in order to improve the generating efficiency of the mouth-shape animation, after converting the phoneme sequence into the phoneme sequence according to the first preset conversion rule, the method includes:
and carrying out merging processing on the vision sequences according to a preset merging rule, wherein the merging processing comprises the following steps:
1) Combining the corresponding retinol of d, t, n and l;
2) Combining the corresponding retinol of z, c, s-i and z, c, s-i;
3) Enabling a single vowel e and a compound vowel e to share one video element;
4) Let j, q, x and z, c, s share one retinoid;
5) Let u and v share one visual element.
Through merging the video sequences, the phones with similar mouth shapes are subjected to the video sharing, so that the complexity of the videos is reduced, and the generating efficiency of the mouth shape animation is improved.
In some embodiments of the present application, after performing the merging processing on the sequence of vision elements according to a preset merging rule, the method further includes:
adding tongue top and tongue action characteristics into the retinol corresponding to d, t, n and l;
adding tongue rolling action characteristics into the corresponding vitamins of zh, ch, sh and r;
a tongue rolling action feature based on e pronunciation is added to the corresponding element of er.
By adding tongue processing to the sequence of the visual elements, each visual element is more in accordance with the normal pronunciation rules.
Alternatively, the form of the visual element may be text information, or an image frame, or other data form other than text information and an image frame. For example, if the visual is text information, the visual may include a closed mouth with no inclination of the mouth, a large mouth opening of 50% (i.e., half mouth opening), a closed mouth with "O" shape, a large mouth opening of 100% (i.e., large mouth opening), etc.; if the vision element is an image frame, different mouth shapes are displayed in the image frames corresponding to different vision elements.
And step S104, rendering the mouth shape animation corresponding to the target voice based on the visual sequence.
In this embodiment, the visual sequence characterizes the mouth shape visual characteristics of each phoneme corresponding to the target voice, and the mouth shape animation corresponding to the target voice can be rendered based on the visual sequence.
In some embodiments of the present application, the rendering, based on the sequence of visual elements, a mouth-shaped animation corresponding to the target voice includes:
inputting the visual sequence into a preset rendering engine, and obtaining an animation frame sequence according to the output of the preset rendering engine;
generating the mouth shape animation based on the animation frame sequence.
In this embodiment, the preset rendering engine may be a fantasy engine or Unity, and inputs the visual sequence into the preset rendering engine, and the preset rendering engine renders a plurality of animation frames based on the visual sequence to obtain an animation frame sequence, and generates a mouth shape animation based on the animation frame sequence, thereby generating the mouth shape animation more efficiently.
In some embodiments of the present application, the generating the mouth shape animation based on the animation frame sequence includes:
generating a transition frame between every two animation frames in the animation frame sequence based on a preset interpolation algorithm;
and inserting each transition frame into the animation frame sequence to obtain the mouth shape animation.
In this embodiment, interpolation calculation is performed on every two animation frames in the animation frame sequence based on a preset interpolation algorithm, specifically, interpolation calculation may be performed on fusion deformation parameters (i.e., blending type parameters) or key point parameters or skeleton parameters of every two animation frames, transition frames are obtained according to calculation results, and then each transition frame is inserted into the animation frame sequence to obtain a mouth shape animation, so that smoothness of the mouth shape animation is further improved.
Alternatively, the preset interpolation algorithm may be a bezier curve interpolation algorithm or a linear interpolation algorithm.
Optionally, each two animation frames in the animation frame sequence can be weighted according to a preset weight parameter, and a transition frame is generated according to the weighted average result of each two animation frames.
The target voice is obtained by applying the data scheme, and is processed based on a preset voice recognition algorithm, so that a target text with time stamp information is obtained; generating a phoneme sequence according to the pinyin information of the target text and the timestamp information; converting the phoneme sequence into a visual sequence according to a first preset conversion rule, wherein the visual sequence comprises a plurality of visual elements, and the visual elements represent mouth-shaped visual characteristics during pronunciation; and rendering the mouth shape animation corresponding to the target voice based on the visual sequence, so that the pronunciation action of the mouth shape animation is accurately matched with the target voice, and more accurate generation of the mouth shape animation is realized.
The embodiment of the application also provides a method for generating the mouth shape animation, which is shown in fig. 2 and comprises the following steps:
step S201, target voice is obtained, and the target voice is processed based on a preset voice recognition algorithm to obtain a target text with time stamp information.
In this embodiment, the target voice may be real-time voice audio input by the user through an electronic device equipped with a sound card device such as a microphone, or may be pre-recorded or stored voice audio, or may be voice audio obtained by performing voice synthesis on input text information according to a preset voice synthesis algorithm. And processing the target voice based on a preset voice recognition algorithm to obtain a target text with time stamp information. The person skilled in the art can adopt different preset voice recognition algorithms to perform voice recognition according to the needs, and the protection scope of the application is not affected.
Step S202, acquiring a plurality of syllables according to the pinyin information of the target text.
In this embodiment, the target text has pinyin information, the pinyin is composed of syllables, and a plurality of syllables can be obtained based on the pinyin information.
Step S203, converting each syllable into a plurality of phones according to a second preset conversion rule.
In this embodiment, each syllable is composed of a plurality of phones, and each syllable can be converted into a plurality of phones based on the second preset conversion rule.
In some embodiments of the present application, the second preset conversion rule includes:
separating a first syllable set belonging to an initial consonant, a second syllable set belonging to an integrally-recognized syllable and a third syllable set belonging to a final from each syllable respectively;
if the preset sound needing to be deformed exists in the first sound set, the second sound set or the third sound set, the preset sound is converted into a target sound corresponding to the preset sound according to a preset deformation rule.
In this embodiment, syllables can be divided into initials and finals, which need to be processed respectively according to the initials and the finals, and a first subset of syllables and a third subset of syllables are obtained after processing, and syllables also include some whole syllables to be recognized, and also need to be processed separately, and a second subset of syllables is obtained after processing. In addition, if the first, second or third sound set has the preset sound to be deformed, the preset sound is converted into the target sound corresponding to the preset sound according to the preset deformation rule, so that each sound can be obtained more accurately.
Optionally, when separating the first subset of phones belonging to the initial consonant, zh, ch, sh are separated first, b, p, m, f, d, t, n, l, g, k, h, j, q, x, z, c, s, r, y, w are separated, and yw: yu becomes v, yi/y becomes i, and wu/w becomes u; when separating the second syllable subset belonging to the whole syllable recognition, separating zhi, chi, shi, ri, zi, ci, si first, and converting i to i; when the third tone subset belonging to the vowels is separated, single vowel, front vowel, rear vowel and middle vowel are separated, vowel heads, vowel abdomen and vowel tails of different types of vowels are respectively processed, u connected after j, q and x is converted into v, iu is converted into iou, ui is converted into uei, un is converted into uen, and the vowel tails which are n or ng are independently judged.
Step S204, arranging each sound according to the timestamp information to obtain a phoneme sequence.
In this embodiment, each of the phones corresponds to a different time stamp, and after each of the phones is arranged according to the time stamp information, a phone sequence is obtained.
In some embodiments of the present application, after arranging each of the phones with the timestamp information to obtain the phone sequence, the method further includes:
determining blank positions among syllables;
adding a mute frame with preset duration at a position corresponding to the blank position in the phoneme sequence;
adjusting the duration of each phoneme according to the pronunciation type of the phoneme and the preset duration distribution proportion corresponding to the pronunciation type;
wherein, the pronunciation types comprise initials, finals, vowels, bells and tails.
In this embodiment, blank positions may exist between syllables, and a mute frame with a preset duration is added to the blank positions, so that the corresponding syllable has a pronunciation preparation sound or a pronunciation ending extension sound, so that pronunciation is more natural and smooth, and the preset duration is 0.2s.
The pronunciation types of the phonemes comprise initials, finals, vowels, and vowels, each pronunciation type corresponds to different preset duration distribution ratios, and the duration of each phoneme is adjusted based on the pronunciation type and the corresponding preset duration distribution ratio, so that the pronunciation of each phoneme is more in accordance with the normal speaking rule.
Optionally, I0, F1, F2, and F3 are sequentially used to represent initials, finals, and finals, where a preset duration distribution ratio of I0 is 0.3, a preset duration distribution ratio of f1+f2+f3 is 0.7, a preset duration distribution ratio of F1 in f1+f2+f3 is 0.2, a preset duration distribution ratio of F2 in f1+f2+f3 is 0.8, and F3 is inside F2; if F3 is a vowel, the preset duration distribution ratio of F3 in F2 is 0.4; if F3 is nasal consonant, the preset duration distribution proportion of F3 in F2 is 0.2.
Step S205, converting the phoneme sequence into a phoneme sequence according to a first preset conversion rule.
In this embodiment, the visual sequence includes a plurality of visual elements, the visual elements represent the mouth shape visual characteristics during pronunciation, the mouth shape change is generated by the change of each phoneme in the phoneme sequence, each phoneme in the phoneme sequence is converted into a plurality of visual elements according to a first preset conversion rule, and each visual element forms the visual sequence. In a specific application scenario of the present application, the first preset conversion rule is shown in table 1 above.
And step S206, rendering the mouth shape animation corresponding to the target voice based on the visual sequence.
In this embodiment, the visual sequence characterizes the mouth shape visual characteristics of each phoneme corresponding to the target voice, and the mouth shape animation corresponding to the target voice can be rendered based on the visual sequence.
In some embodiments of the present application, the rendering, based on the sequence of visual elements, a mouth-shaped animation corresponding to the target voice includes:
inputting the visual sequence into a preset rendering engine, and obtaining an animation frame sequence according to the output of the preset rendering engine;
generating the mouth shape animation based on the animation frame sequence.
In this embodiment, the preset rendering engine may be a fantasy engine or Unity, and inputs the visual sequence into the preset rendering engine, and the preset rendering engine renders a plurality of animation frames based on the visual sequence to obtain an animation frame sequence, and generates a mouth shape animation based on the animation frame sequence, thereby generating the mouth shape animation more efficiently.
In some embodiments of the present application, the generating the mouth shape animation based on the animation frame sequence includes:
generating a transition frame between every two animation frames in the animation frame sequence based on a preset interpolation algorithm;
and inserting each transition frame into the animation frame sequence to obtain the mouth shape animation.
In this embodiment, interpolation calculation is performed on every two animation frames in the animation frame sequence based on a preset interpolation algorithm, specifically, interpolation calculation may be performed on fusion deformation parameters, or key point parameters, or skeleton parameters of every two animation frames, transition frames are obtained according to calculation results, and then each transition frame is inserted into the animation frame sequence to obtain a mouth shape animation, so that smoothness of the mouth shape animation is further improved.
Alternatively, the preset interpolation algorithm may be a bezier curve interpolation algorithm or a linear interpolation algorithm.
Optionally, each two animation frames in the animation frame sequence can be weighted according to a preset weight parameter, and a transition frame is generated according to the weighted average result of each two animation frames.
By applying the technical scheme, the target voice is obtained, and the target voice is processed based on a preset voice recognition algorithm to obtain a target text with time stamp information; acquiring a plurality of syllables according to the pinyin information of the target text; converting each syllable into a plurality of syllables according to a second preset conversion rule; arranging each sound according to the timestamp information to obtain a phoneme sequence; converting the phoneme sequence into a video sequence according to a first preset conversion rule; and rendering the mouth shape animation corresponding to the target voice based on the visual sequence, so that the phoneme sequence is more accurately obtained, the pronunciation action of the mouth shape animation is accurately matched with the target voice, and more accurate mouth shape animation generation is realized.
The embodiment of the application also provides a method for generating the mouth shape animation, which is shown in fig. 3 and comprises the following steps:
step S301, target voice is obtained, and the target voice is processed based on a preset voice recognition algorithm to obtain a target text with time stamp information.
In this embodiment, the target voice may be real-time voice audio input by the user through an electronic device equipped with a sound card device such as a microphone, or may be pre-recorded or stored voice audio, or may be voice audio obtained by performing voice synthesis on input text information according to a preset voice synthesis algorithm. And processing the target voice based on a preset voice recognition algorithm to obtain a target text with time stamp information. The person skilled in the art can adopt different preset voice recognition algorithms to perform voice recognition according to the needs, and the protection scope of the application is not affected.
Step S302, generating a phoneme sequence according to the pinyin information of the target text and the time stamp information.
Step S303, converting the phoneme sequence into a phoneme sequence according to a first preset conversion rule.
In this embodiment, the phonemes are the smallest phonetic units that make up a syllable, and any audio segment is composed of a limited number of phonemes. The target text has pinyin information and time stamp information, and based on the pinyin information and the time stamp information, a plurality of phonemes may be generated, each phoneme constituting a phoneme sequence.
Step S304, weights are respectively allocated to each of the visual elements in the visual element sequence based on a preset weight allocation list.
In this embodiment, since the pronunciation of each Chinese character is affected by the adjacent phones to different degrees, when speaking, the mouth shape corresponding to the current phone is closely related to the adjacent front and rear phones, and the rule of co-pronunciation needs to be met. In order to enable each visual element in the visual element sequence to accord with the collaborative pronunciation rule, a preset weight distribution list is generated in advance according to the association relation between a first mouth shape and a second mouth shape, wherein the first mouth shape is the mouth shape of the current sound, and the second mouth shape is the mouth shape of the front and rear sound adjacent to the current sound. And respectively assigning weights to the various visual elements based on the preset weight assignment list after the visual element sequence is obtained.
In some embodiments of the present application, the preset weight allocation list is determined by the following formula:
Figure BDA0004087182800000101
when (when)
Figure BDA0004087182800000102
In the time-course of which the first and second contact surfaces,
Figure BDA0004087182800000103
when (when)
Figure BDA0004087182800000104
In the time-course of which the first and second contact surfaces,
Figure BDA0004087182800000105
wherein W is C Visual weight for consonant phonons; w (W) V Is the visual weight of vowel sound pronunciation; r is R C And R is V The affected level of consonants and the affected level of vowels are quantized to between 0 and 1, respectively; v (V) 1 Is V (V) 2 A vowel sound before, and a consonant sound can be separated between the vowel sound and the vowel sound; alpha and beta are all coefficients for controlling weight, when V 1 And V 2 When separated by a consonant, β is 1. Vision weights of different phones can be accurately obtained through the formula, and therefore a preset weight distribution list which is more in line with the collaborative pronunciation rules is generated.
Step S305, adjusting the duration of each of the visual elements in the visual element sequence according to each of the weights.
In this embodiment, the duration of each element is adjusted according to each weight, so that the element sequence more accords with the co-pronunciation rule, and the accuracy of the mouth shape animation is further improved.
And step S306, rendering the mouth shape animation corresponding to the target voice based on the visual sequence.
In this embodiment, the visual sequence characterizes the mouth shape visual characteristics of each phoneme corresponding to the target voice, and the mouth shape animation corresponding to the target voice can be rendered based on the visual sequence.
In some embodiments of the present application, the rendering, based on the sequence of visual elements, a mouth-shaped animation corresponding to the target voice includes:
inputting the visual sequence into a preset rendering engine, and obtaining an animation frame sequence according to the output of the preset rendering engine;
generating the mouth shape animation based on the animation frame sequence.
In this embodiment, the preset rendering engine may be a fantasy engine or Unity, and inputs the visual sequence into the preset rendering engine, and the preset rendering engine renders a plurality of animation frames based on the visual sequence to obtain an animation frame sequence, and generates a mouth shape animation based on the animation frame sequence, thereby generating the mouth shape animation more efficiently.
In some embodiments of the present application, the generating the mouth shape animation based on the animation frame sequence includes:
generating a transition frame between every two animation frames in the animation frame sequence based on a preset interpolation algorithm;
and inserting each transition frame into the animation frame sequence to obtain the mouth shape animation.
In this embodiment, interpolation calculation is performed on every two animation frames in the animation frame sequence based on a preset interpolation algorithm, specifically, interpolation calculation may be performed on fusion deformation parameters, or key point parameters, or skeleton parameters of every two animation frames, transition frames are obtained according to calculation results, and then each transition frame is inserted into the animation frame sequence to obtain a mouth shape animation, so that smoothness of the mouth shape animation is further improved.
Alternatively, the preset interpolation algorithm may be a bezier curve interpolation algorithm or a linear interpolation algorithm.
Optionally, each two animation frames in the animation frame sequence can be weighted according to a preset weight parameter, and a transition frame is generated according to the weighted average result of each two animation frames.
By applying the technical scheme, the target voice is obtained, and the target voice is processed based on a preset voice recognition algorithm to obtain a target text with time stamp information; generating a phoneme sequence according to the pinyin information of the target text and the timestamp information; converting the phoneme sequence into a video sequence according to a first preset conversion rule; respectively distributing weights to each of the visual elements in the visual element sequence based on a preset weight distribution list; adjusting the duration of each of the visual elements in the sequence of visual elements according to each of the weights; and rendering the mouth shape animation corresponding to the target voice based on the visual sequence, and enabling the visual sequence to be more in accordance with the collaborative pronunciation rules based on the weights corresponding to the visual sequences, so that the pronunciation action of the mouth shape animation is accurately matched with the target voice, and more accurate mouth shape animation generation is realized.
The embodiment of the application also provides a device for generating the mouth shape animation, as shown in fig. 4, the device comprises:
the voice recognition module 401 is configured to obtain a target voice, and process the target voice based on a preset voice recognition algorithm to obtain a target text with timestamp information;
a generating module 402, configured to generate a phoneme sequence according to the pinyin information of the target text and the timestamp information;
a conversion module 403, configured to convert the phoneme sequence into a visual sequence according to a first preset conversion rule, where the visual sequence includes a plurality of visual elements, and the visual elements characterize a mouth shape visual feature during pronunciation;
and a rendering module 404, configured to render a mouth shape animation corresponding to the target voice based on the pixel sequence.
In a specific application scenario, the generating module 402 is specifically configured to:
acquiring a plurality of syllables according to the pinyin information;
converting each syllable into a plurality of syllables according to a second preset conversion rule;
and arranging each sound according to the timestamp information to obtain the phoneme sequence.
In a specific application scenario, the second preset conversion rule includes:
separating a first syllable set belonging to an initial consonant, a second syllable set belonging to an integrally-recognized syllable and a third syllable set belonging to a final from each syllable respectively;
if the preset sound needing to be deformed exists in the first sound set, the second sound set or the third sound set, the preset sound is converted into a target sound corresponding to the preset sound according to a preset deformation rule.
In a specific application scenario, the device further includes a first adjustment module, configured to:
determining blank positions among syllables;
adding a mute frame with preset duration at a position corresponding to the blank position in the phoneme sequence;
adjusting the duration of each phoneme according to the pronunciation type of the phoneme and the preset duration distribution proportion corresponding to the pronunciation type;
wherein, the pronunciation types comprise initials, finals, vowels, bells and tails.
In a specific application scenario, the device further includes a second adjustment module, configured to:
respectively distributing weights to each of the visual elements in the visual element sequence based on a preset weight distribution list;
adjusting the duration of each of the visual elements in the sequence of visual elements according to each of the weights;
the preset weight distribution list is generated according to an association relation between a first mouth shape and a second mouth shape, wherein the first mouth shape is the mouth shape of a current sound, and the second mouth shape is the mouth shape of a front sound and a rear sound adjacent to the current sound.
In a specific application scenario, the rendering module 404 is specifically configured to:
inputting the visual sequence into a preset rendering engine, and obtaining an animation frame sequence according to the output of the preset rendering engine;
generating the mouth shape animation based on the animation frame sequence.
In a specific application scenario, the rendering module 404 is further specifically configured to:
generating a transition frame between every two animation frames in the animation frame sequence based on a preset interpolation algorithm;
and inserting each transition frame into the animation frame sequence to obtain the mouth shape animation.
By applying the technical scheme, the generating device of the mouth shape animation comprises: the voice recognition module is used for acquiring target voice, and processing the target voice based on a preset voice recognition algorithm to obtain a target text with time stamp information; the generation module is used for generating a phoneme sequence according to the pinyin information of the target text and the time stamp information; the conversion module is used for converting the phoneme sequence into a visual sequence according to a first preset conversion rule, wherein the visual sequence comprises a plurality of visual elements, and the visual elements represent mouth-shaped visual characteristics during pronunciation; and the rendering module is used for rendering the mouth shape animation corresponding to the target voice based on the visual sequence, so that the pronunciation action of the mouth shape animation is accurately matched with the target voice, and more accurate generation of the mouth shape animation is realized.
The embodiment of the invention also provides an electronic device, as shown in fig. 5, which comprises a processor 501, a communication interface 502, a memory 503 and a communication bus 504, wherein the processor 501, the communication interface 502 and the memory 503 complete communication with each other through the communication bus 504,
a memory 503 for storing executable instructions of the processor;
a processor 501 configured to execute via execution of the executable instructions:
acquiring target voice, and processing the target voice based on a preset voice recognition algorithm to obtain a target text with timestamp information;
generating a phoneme sequence according to the pinyin information of the target text and the timestamp information;
converting the phoneme sequence into a visual sequence according to a first preset conversion rule, wherein the visual sequence comprises a plurality of visual elements, and the visual elements represent mouth-shaped visual characteristics during pronunciation;
and rendering the mouth shape animation corresponding to the target voice based on the visual sequence.
The communication bus may be a PCI (Peripheral Component Interconnect, peripheral component interconnect standard) bus, or an EISA (Extended Industry Standard Architecture ) bus, or the like. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.
The communication interface is used for communication between the terminal and other devices.
The memory may include RAM (Random Access Memory ) or may include non-volatile memory, such as at least one disk memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.
The processor may be a general-purpose processor, including a CPU (Central Processing Unit ), NP (Network Processor, network processor), etc.; but also DSP (Digital Signal Processing, digital signal processor), ASIC (Application Specific Integrated Circuit ), FPGA (Field Programmable Gate Array, field programmable gate array) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components.
In still another embodiment of the present invention, there is also provided a computer-readable storage medium having stored therein a computer program which, when executed by a processor, implements the method of generating a mouth-shape animation as described above.
In yet another embodiment of the present invention, there is also provided a computer program product containing instructions that, when run on a computer, cause the computer to perform the method of generating a mouth-shape animation as described above.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk), etc.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments.
The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims (10)

1. A method for generating a mouth shape animation, the method comprising:
acquiring target voice, and processing the target voice based on a preset voice recognition algorithm to obtain a target text with timestamp information;
generating a phoneme sequence according to the pinyin information of the target text and the timestamp information;
converting the phoneme sequence into a visual sequence according to a first preset conversion rule, wherein the visual sequence comprises a plurality of visual elements, and the visual elements represent mouth-shaped visual characteristics during pronunciation;
and rendering the mouth shape animation corresponding to the target voice based on the visual sequence.
2. The method of claim 1, wherein the generating a phoneme sequence from the pinyin information and the time stamp information of the target text comprises:
acquiring a plurality of syllables according to the pinyin information;
converting each syllable into a plurality of syllables according to a second preset conversion rule;
and arranging each sound according to the timestamp information to obtain the phoneme sequence.
3. The method of claim 2, wherein the second preset transformation rule comprises:
separating a first syllable set belonging to an initial consonant, a second syllable set belonging to an integrally-recognized syllable and a third syllable set belonging to a final from each syllable respectively;
if the preset sound needing to be deformed exists in the first sound set, the second sound set or the third sound set, the preset sound is converted into a target sound corresponding to the preset sound according to a preset deformation rule.
4. The method of claim 2, wherein after arranging each of the phones with the time stamp information to obtain the phone sequence, the method further comprises:
determining blank positions among syllables;
adding a mute frame with preset duration at a position corresponding to the blank position in the phoneme sequence;
adjusting the duration of each phoneme according to the pronunciation type of the phoneme and the preset duration distribution proportion corresponding to the pronunciation type;
wherein, the pronunciation types comprise initials, finals, vowels, bells and tails.
5. The method of claim 1, wherein after converting the phoneme sequence into a phoneme sequence according to a first preset conversion rule, the method further comprises:
respectively distributing weights to each of the visual elements in the visual element sequence based on a preset weight distribution list;
adjusting the duration of each of the visual elements in the sequence of visual elements according to each of the weights;
the preset weight distribution list is generated according to an association relation between a first mouth shape and a second mouth shape, wherein the first mouth shape is the mouth shape of a current sound, and the second mouth shape is the mouth shape of a front sound and a rear sound adjacent to the current sound.
6. The method of claim 1, wherein the rendering a mouth-shape animation corresponding to the target speech based on the sequence of visual elements comprises:
inputting the visual sequence into a preset rendering engine, and obtaining an animation frame sequence according to the output of the preset rendering engine;
generating the mouth shape animation based on the animation frame sequence.
7. The method of claim 6, wherein the generating the mouth-shaped animation based on the sequence of animation frames comprises:
generating a transition frame between every two animation frames in the animation frame sequence based on a preset interpolation algorithm;
and inserting each transition frame into the animation frame sequence to obtain the mouth shape animation.
8. A device for generating a mouth-shaped animation, the device comprising:
the voice recognition module is used for acquiring target voice, and processing the target voice based on a preset voice recognition algorithm to obtain a target text with time stamp information;
the generation module is used for generating a phoneme sequence according to the pinyin information of the target text and the time stamp information;
the conversion module is used for converting the phoneme sequence into a visual sequence according to a first preset conversion rule, wherein the visual sequence comprises a plurality of visual elements, and the visual elements represent mouth-shaped visual characteristics during pronunciation;
and the rendering module is used for rendering the mouth shape animation corresponding to the target voice based on the visual sequence.
9. An electronic device, comprising:
a processor; and
a memory for storing executable instructions of the processor;
wherein the processor is configured to perform the method of generating a mouth shape animation according to any of claims 1-7 via execution of the executable instructions.
10. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the method of generating a mouth-shape animation according to any one of claims 1-7.
CN202310139936.1A 2023-02-20 2023-02-20 Method and device for generating mouth shape animation, electronic equipment and storage medium Pending CN116363268A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310139936.1A CN116363268A (en) 2023-02-20 2023-02-20 Method and device for generating mouth shape animation, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310139936.1A CN116363268A (en) 2023-02-20 2023-02-20 Method and device for generating mouth shape animation, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116363268A true CN116363268A (en) 2023-06-30

Family

ID=86931185

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310139936.1A Pending CN116363268A (en) 2023-02-20 2023-02-20 Method and device for generating mouth shape animation, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116363268A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117115318A (en) * 2023-08-18 2023-11-24 蚂蚁区块链科技(上海)有限公司 Method and device for synthesizing mouth-shaped animation and electronic equipment
CN117275485A (en) * 2023-11-22 2023-12-22 翌东寰球(深圳)数字科技有限公司 Audio and video generation method, device, equipment and storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117115318A (en) * 2023-08-18 2023-11-24 蚂蚁区块链科技(上海)有限公司 Method and device for synthesizing mouth-shaped animation and electronic equipment
CN117115318B (en) * 2023-08-18 2024-05-28 蚂蚁区块链科技(上海)有限公司 Method and device for synthesizing mouth-shaped animation and electronic equipment
CN117275485A (en) * 2023-11-22 2023-12-22 翌东寰球(深圳)数字科技有限公司 Audio and video generation method, device, equipment and storage medium
CN117275485B (en) * 2023-11-22 2024-03-12 翌东寰球(深圳)数字科技有限公司 Audio and video generation method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN106653052B (en) Virtual human face animation generation method and device
CN108447474B (en) Modeling and control method for synchronizing virtual character voice and mouth shape
CN112184858B (en) Virtual object animation generation method and device based on text, storage medium and terminal
CN116363268A (en) Method and device for generating mouth shape animation, electronic equipment and storage medium
RU2487411C2 (en) Method and apparatus for changing lip shape and obtaining lip animation in voice-driven animation
CN110910479B (en) Video processing method, device, electronic equipment and readable storage medium
RU2720359C1 (en) Method and equipment for recognizing emotions in speech
US20210375260A1 (en) Device and method for generating speech animation
Arias et al. Realistic transformation of facial and vocal smiles in real-time audiovisual streams
WO2021123792A1 (en) A Text-to-Speech Synthesis Method and System, a Method of Training a Text-to-Speech Synthesis System, and a Method of Calculating an Expressivity Score
CN110992926B (en) Speech synthesis method, apparatus, system and storage medium
Eskimez et al. Noise-resilient training method for face landmark generation from speech
CN114895817B (en) Interactive information processing method, network model training method and device
CN113724686B (en) Method and device for editing audio, electronic equipment and storage medium
US20230099732A1 (en) Computing system for domain expressive text to speech
KR102489498B1 (en) A method and a system for communicating with a virtual person simulating the deceased based on speech synthesis technology and image synthesis technology
Mattos et al. Improving CNN-based viseme recognition using synthetic data
Charalambous et al. Audio‐driven emotional speech animation for interactive virtual characters
CN116597857A (en) Method, system, device and storage medium for driving image by voice
CN117642814A (en) Robust direct speech-to-speech translation
CN112785667A (en) Video generation method, device, medium and electronic equipment
CN114255737B (en) Voice generation method and device and electronic equipment
CN115529500A (en) Method and device for generating dynamic image
CN115171645A (en) Dubbing method and device, electronic equipment and storage medium
CN113160799B (en) Video generation method and device, computer-readable storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination