CN113538641A - Animation generation method and device, storage medium and electronic equipment - Google Patents

Animation generation method and device, storage medium and electronic equipment Download PDF

Info

Publication number
CN113538641A
CN113538641A CN202110796787.7A CN202110796787A CN113538641A CN 113538641 A CN113538641 A CN 113538641A CN 202110796787 A CN202110796787 A CN 202110796787A CN 113538641 A CN113538641 A CN 113538641A
Authority
CN
China
Prior art keywords
animation
trigger
voice
information
viseme
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110796787.7A
Other languages
Chinese (zh)
Inventor
杜峰
王海新
吴朝阳
杨超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Wodong Tianjun Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Wodong Tianjun Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Wodong Tianjun Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN202110796787.7A priority Critical patent/CN113538641A/en
Publication of CN113538641A publication Critical patent/CN113538641A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/403D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • G10L2021/105Synthesis of the lips movements from speech, e.g. for talking heads

Abstract

The disclosure belongs to the technical field of computers, and relates to an animation generation method and device, a storage medium and electronic equipment. The method comprises the following steps: acquiring the manufactured virtual digital person and acquiring voice information broadcasted by the virtual digital person; carrying out mouth shape animation generation processing on the voice information to obtain mouth shape animation, and carrying out expression animation generation processing on the voice information to obtain expression animation; and performing limb animation generation processing on the voice information to obtain limb animation, and synchronously rendering the virtual digital human according to the mouth shape animation, the expression animation and the limb animation. On one hand, the method reduces the manual participation degree in the process of rendering the virtual digital man, and improves the generation speed and efficiency of the animation content of the virtual digital man; on the other hand, the animation of the virtual digital person can be generated immediately, the algorithm does not need to be retrained according to the virtual digital person with the new image, and the application scenes of the virtual digital person are enriched.

Description

Animation generation method and device, storage medium and electronic equipment
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to an animation generation method, an animation generation apparatus, a computer-readable storage medium, and an electronic device.
Background
In the fields of e-commerce, games, animation and the like, virtual digital people are more and more widely applied, and the influence is gradually expanded, especially 3D (3-dimension) virtual digital people. Furthermore, virtual idols such as 3D virtual digital people also have a wide group of fans. In the E-business field, the virtual digital people are used for live broadcast, so that the labor can be greatly saved, and merchants can live broadcast all the day without professional anchor; in a customer service link, the virtual digital person is communicated with the user, so that the distance between the virtual digital person and the user can be shortened, and the emotion of the user is appealed; in the shopping guide link, the virtual digital person explains the commodities, so that the user can know commodity selling points more quickly, and the order introduction conversion is promoted.
At present, content production based on virtual digital people can be roughly divided into three categories, namely model production and animation production based on text content by professional art production personnel; the method comprises the steps that a professional actor wears a motion capture device and a facial expression capture device, content is recorded by combining text content, and motion, expression and the like are applied to a model of a virtual digital person to perform later-stage rendering; the text content is intelligently analyzed by an AI (Artificial Intelligence) algorithm to generate a corresponding video.
However, both of the first two types of production methods require professional persons to participate, which results in high production cost and long production cycle of the content of the virtual digital person. The third type of production mode needs a large amount of training data, virtual digital people of each new image often need to be retrained, the performance requirement on computing equipment is high, and the generated video of the virtual digital people can only be rendered and stored in advance and played, so that the application scene of the virtual digital people is restricted.
In view of the above, there is a need in the art to develop a new animation generation method and apparatus.
It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present disclosure, and thus may include information that does not constitute prior art known to those of ordinary skill in the art.
Disclosure of Invention
An object of the present disclosure is to provide an animation generating method, an animation generating apparatus, a computer-readable storage medium, and an electronic device, which overcome, at least to some extent, the technical problems of long manufacturing cycle and high cost due to the limitations of the related art.
Additional features and advantages of the disclosure will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the disclosure.
According to a first aspect of embodiments of the present invention, there is provided an animation generation method, the method including: acquiring a manufactured virtual digital person, and acquiring voice information broadcasted by the virtual digital person;
carrying out mouth shape animation generation processing on the voice information to obtain mouth shape animation, and carrying out expression animation generation processing on the voice information to obtain expression animation;
and performing limb animation generation processing on the voice information to obtain limb animation, and synchronously rendering the virtual digital human according to the mouth shape animation, the expression animation and the limb animation.
In an exemplary embodiment of the present invention, the performing a mouth animation generation process on the voice information to obtain a mouth animation includes:
performing phoneme conversion processing on the voice information to obtain a phoneme aligned with time information, and performing viseme conversion processing on the phoneme to obtain a voice viseme aligned with the time information;
and carrying out linear interpolation processing on the voice visual position and the time information to obtain the mouth shape animation.
In an exemplary embodiment of the present invention, the performing a phoneme conversion process on the speech information to obtain phonemes aligned with time information includes:
performing feature extraction processing on the voice information to obtain acoustic features and single-frame duration, and determining an acoustic state corresponding to the acoustic features by using a trained acoustic model;
and combining the acoustic states to obtain phonemes, and aligning the phonemes by using the single-frame duration to obtain the phonemes aligned with the time information of the single-frame duration.
In an exemplary embodiment of the present invention, the trained acoustic model is trained as follows:
training a voice sample to obtain an original probability value between acoustic features and an acoustic state, and performing voice decoding processing on the original probability value to obtain an acoustic state network;
and performing path search processing on the acoustic state network to obtain a target probability value so as to obtain an acoustic model representing the mapping relation between the acoustic features and the acoustic states.
In an exemplary embodiment of the present invention, the performing the viseme conversion process on the phoneme to obtain the speech viseme aligned with the time information includes:
acquiring a first mapping relation between the phoneme and the voice viseme;
and performing viseme conversion processing on the phoneme based on the first mapping relation to obtain a voice viseme aligned with the time information.
In an exemplary embodiment of the present invention, the performing linear interpolation processing on the speech viseme and the time information to obtain a mouth shape animation includes:
determining a current voice viseme in the voice visemes, and representing the current voice viseme by using virtual grid weight;
and carrying out linear interpolation processing on the current voice visual position represented by the virtual grid weight and the time information to obtain the mouth shape animation.
In an exemplary embodiment of the present invention, the performing linear interpolation processing on the current speech viseme represented by the virtual grid weight and the time information to obtain a mouth shape animation includes:
representing a target voice viseme in the voice visemes by using virtual grid weights, and carrying out viseme calculation on the current voice viseme represented by the virtual grid weights and the target voice viseme represented by the virtual grid weights to obtain a viseme calculation result;
determining the single-frame duration, the ending time of the target voice viseme and the consumed duration corresponding to the current voice viseme according to the time information, and carrying out duration calculation on the single-frame duration, the ending time and the consumed duration to obtain a duration calculation result;
and performing weight calculation on the current voice viseme represented by the virtual grid weight, the viseme calculation result and the duration calculation result to obtain a next grid weight so as to determine the mouth shape animation to be rendered according to the next grid weight.
In an exemplary embodiment of the present invention, the acquiring the voice information broadcasted by the virtual digital person includes:
acquiring text information broadcasted by the virtual digital person;
and carrying out synthetic voice conversion processing on the text information to obtain voice information.
In an exemplary embodiment of the present invention, the performing expression animation generation processing on the voice information to obtain expression animation includes:
performing expression animation configuration on the voice information to obtain default expression animation and trigger expression animation;
setting a first default animation in the default expression animation at intervals of preset time, and performing first trigger configuration on the trigger expression animation to determine a first trigger animation in the trigger expression animation;
and replacing the first default animation with the first trigger animation based on the first trigger configuration to obtain the expression animation to be rendered.
In an exemplary embodiment of the invention, the determining the first trigger animation in the trigger emoji animations according to the first trigger configuration includes:
determining first text information in the text information, and performing text trigger configuration on the trigger expression animation by using the first text information to determine a first trigger animation in the trigger expression animation; and/or
And performing text emotion calculation on the text information to obtain emotion scores, and performing score trigger configuration on the trigger expression animations by using the emotion scores to determine a first trigger animation in the trigger expression animations.
In an exemplary embodiment of the present invention, the performing a body animation generation process on the voice information to obtain a body animation includes:
performing limb animation configuration on the voice information to obtain default limb animation and trigger limb animation;
setting a second default animation in the default limb animations at intervals of preset time, and performing second trigger configuration on the trigger limb animation to determine a second trigger animation in the trigger limb animation;
and replacing the second default animation with the second trigger animation based on the second trigger configuration to obtain the limb animation to be rendered.
In an exemplary embodiment of the invention, the determining the second trigger animation of the trigger limb animations, which includes:
determining second text information in the text information, and performing label triggering configuration on the triggered limb animation by using the second text information to determine a second triggered animation in the triggered limb animation; and/or
And performing semantic analysis processing on the text information to obtain semantic information, and performing semantic trigger configuration on the trigger limb animation by using the semantic information to determine a second trigger animation in the trigger limb animation.
According to a second aspect of embodiments of the present invention, there is provided an animation generation apparatus, including: the information acquisition module is configured to acquire the manufactured virtual digital person and acquire voice information broadcasted by the virtual digital person;
the animation generation module is configured to perform mouth shape animation generation processing on the voice information to obtain mouth shape animation, and perform expression animation generation processing on the voice information to obtain expression animation;
and the synchronous rendering module is configured to perform limb animation generation processing on the voice information to obtain limb animation, and perform synchronous rendering on the virtual digital human according to the mouth shape animation, the expression animation and the limb animation.
According to a third aspect of embodiments of the present invention, there is provided an electronic apparatus including: a processor and a memory; wherein the memory has stored thereon computer readable instructions which, when executed by the processor, implement the animation generation method of any of the above-described exemplary embodiments.
According to a fourth aspect of embodiments of the present invention, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the animation generation method in any of the exemplary embodiments described above.
As can be seen from the foregoing technical solutions, the animation generation method, the animation generation apparatus, the computer storage medium, and the electronic device in the exemplary embodiments of the present invention have at least the following advantages and positive effects:
in the method and the device provided by the exemplary embodiment of the present disclosure, a mouth animation, an expression animation and a limb animation of a rendered virtual digital person are generated according to voice information, and the virtual digital person is synchronously rendered according to the mouth animation, the expression animation and the limb animation. On one hand, the artificial participation degree in the process of rendering the virtual digital man is reduced, and the generation speed and the generation efficiency of the animation content of the virtual digital man are improved; on the other hand, the animation of the virtual digital person can be generated immediately, the algorithm does not need to be retrained according to the virtual digital person with the new image, and the application scenes of the virtual digital person are enriched.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty.
FIG. 1 schematically illustrates a flow diagram of a method of animation generation in an exemplary embodiment of the disclosure;
fig. 2 schematically illustrates a flow chart of a method of acquiring voice information in an exemplary embodiment of the present disclosure;
FIG. 3 schematically illustrates a flow diagram of a method of generating a mouth-shape animation in an exemplary embodiment of the present disclosure;
FIG. 4 schematically illustrates a flow chart of a method of phoneme conversion processing in an exemplary embodiment of the present disclosure;
FIG. 5 schematically illustrates a flow chart of a method of training an acoustic model in an exemplary embodiment of the disclosure;
FIG. 6 schematically illustrates a flow chart of a method of viseme conversion processing in an exemplary embodiment of the disclosure;
FIG. 7 schematically illustrates a flow chart of a method of linear interpolation processing in an exemplary embodiment of the present disclosure;
FIG. 8 is a schematic illustration of an interface for representing a current phonetic viseme using virtual grid weights in an exemplary embodiment of the present disclosure;
fig. 9 schematically shows a flow chart of a method of further performing linear interpolation processing in an exemplary embodiment of the present disclosure;
FIG. 10 schematically illustrates a flow diagram of a method of generating an emoji animation in an exemplary embodiment of the disclosure;
fig. 11 schematically illustrates a flow chart of a method of a first trigger configuration in an exemplary embodiment of the disclosure;
FIG. 12 schematically illustrates a flow diagram of a method of generating a limb animation in an exemplary embodiment of the disclosure;
fig. 13 schematically illustrates a flow chart of a method of a second trigger configuration in an exemplary embodiment of the disclosure;
FIG. 14 is a flow chart diagram schematically illustrating an animation generation method in an application scenario according to an exemplary embodiment of the present disclosure;
FIG. 15 is a flow diagram that schematically illustrates a method of mouth-animation generation processing under an application scenario in an exemplary embodiment of the disclosure;
FIG. 16 is a flow chart schematically illustrating a method for phoneme alignment in an application scenario in an exemplary embodiment of the present disclosure;
FIG. 17 is a schematic diagram illustrating the structure of an animation generation apparatus according to an exemplary embodiment of the present disclosure;
FIG. 18 schematically illustrates an electronic device for implementing an animation generation method in an exemplary embodiment of the present disclosure;
fig. 19 schematically illustrates a computer-readable storage medium for implementing an animation generation method in an exemplary embodiment of the present disclosure.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the subject matter of the present disclosure can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and the like. In other instances, well-known technical solutions have not been shown or described in detail to avoid obscuring aspects of the present disclosure.
The terms "a," "an," "the," and "said" are used in this specification to denote the presence of one or more elements/components/parts/etc.; the terms "comprising" and "having" are intended to be inclusive and mean that there may be additional elements/components/etc. other than the listed elements/components/etc.; the terms "first" and "second", etc. are used merely as labels, and are not limiting on the number of their objects.
Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities.
In view of the problems in the related art, the present disclosure provides an animation generation method. Fig. 1 shows a flow chart of an animation generation method, which, as shown in fig. 1, comprises at least the following steps:
and S110, acquiring the manufactured virtual digital person and acquiring voice information broadcasted by the virtual digital person.
And S120, carrying out mouth shape animation generation processing on the voice information to obtain mouth shape animation, and carrying out expression animation generation processing on the voice information to obtain expression animation.
And S130, performing limb animation generation processing on the voice information to obtain limb animation, and synchronously rendering the virtual digital human according to the mouth shape animation, the expression animation and the limb animation.
In an exemplary embodiment of the present disclosure, a mouth animation, an expression animation, and a limb animation rendering a virtual digital person are generated according to voice information, and the virtual digital person is synchronously rendered according to the mouth animation, the expression animation, and the limb animation. On one hand, the artificial participation degree in the process of rendering the virtual digital man is reduced, and the generation speed and the generation efficiency of the animation content of the virtual digital man are improved; on the other hand, the animation of the virtual digital person can be generated immediately, the algorithm does not need to be retrained according to the virtual digital person with the new image, and the application scenes of the virtual digital person are enriched.
The following describes each step of the animation generation method in detail.
In step S110, the created virtual digital person is acquired, and the voice information broadcast by the virtual digital person is acquired.
In an exemplary embodiment of the present disclosure, the virtual digital person may be a 3D virtual digital person.
There are many ways to make a 3D virtual digital person. For example, a professional modeler manually designs and manufactures the model through professional modeling software such as maya (three-dimensional modeling and animation software)/3D max (three-dimensional modeling rendering and animation)/blender (three-dimensional graphic image software), and the like, and generally uses the model for cartoon images or anthropomorphic images; the scanning reconstruction can also be carried out by a camera array, and is generally used for realizing the restoration of the real human image. In addition, in recent years, an automatic model generation algorithm based on a single photograph has been studied.
It is worth noting that, in any manufacturing method, the obtained 3D virtual digital human model needs to be bound by the face blendshape and the skeleton skin. The blendshape is used for generating facial expressions and mouth shape animations, and the skeleton skin is used for generating limb animations.
The blendshape realizes a plurality of predefined shape grids based on single grid deformation, and realizes deformation animation by combining a plurality of blendshapes. The blendshape may also be referred to as morph target in maya/3D max, etc. software.
After the manufactured 3D virtual digital person is obtained, voice information for broadcasting of the 3D virtual digital person can be obtained.
In an alternative embodiment, fig. 2 shows a flowchart of a method for acquiring voice information, and as shown in fig. 2, the method at least includes the following steps: in step S210, text information broadcasted by the virtual digital person is acquired.
The text information is information of text content broadcasted by a given virtual digital person and is used for generating voice information broadcasted by the virtual digital person. The source of the text message can be edited by the user or can be docked with the intelligent dialogue system to realize intelligent dialogue with the virtual digital person.
In step S220, the text information is subjected to synthesized speech conversion processing to obtain speech information.
After the text information is obtained, the text information may be subjected to synthesized speech conversion processing using a speech synthesis algorithm, thereby generating speech information of the text information. The speech synthesis algorithm may be a tool preset by the system for synthesizing speech. Specifically, the speech synthesis algorithm may adopt a speech synthesis algorithm and a platform, such as an open platform of internet corporation, AI speech synthesis software, or cool editor.
Most speech synthesis algorithms are capable of supporting timbre, speech rate, multilingual synthesis. Therefore, in the specific implementation process, different speech synthesis algorithms can be compatible, and the corresponding speech synthesis algorithm can be selected according to the requirements of the actual situation.
In the exemplary embodiment, the text information can be converted into the voice information through the synthetic voice conversion processing, so that the function of providing broadcast voice for the virtual digital person is realized, the situation that the voice information cannot be played simultaneously with animation rendering due to the fact that no voice information corresponding to the text information exists is avoided, and the intelligent degree and the automatic degree of the virtual digital person rendering are improved.
In step S120, a mouth-shape animation generation process is performed on the voice information to obtain a mouth-shape animation, and in the middle, an expression animation generation process is performed on the voice information to obtain an expression animation.
In an exemplary embodiment of the present disclosure, after obtaining the voice information broadcasted by the virtual digital human, a mouth-shape animation generation process may be performed on the voice information to obtain a corresponding mouth-shape animation.
In an alternative embodiment, fig. 3 shows a flow diagram of a method of generating a mouth-shape animation, which, as shown in fig. 3, comprises at least the following steps: in step S310, a phoneme conversion process is performed on the speech information to obtain a phoneme aligned with the time information, and a viseme conversion process is performed on the phoneme to obtain a speech viseme aligned with the time information.
In an alternative embodiment, fig. 4 shows a flow chart of a method of phoneme conversion processing, as shown in fig. 4, the method at least comprises the following steps: in step S410, a feature extraction process is performed on the voice information to obtain an acoustic feature and a single frame duration, and an acoustic state corresponding to the acoustic feature is determined by using the trained acoustic model.
The original speech information is a time-domain waveform, and this input has little description capability, so that the waveform needs to be transformed, for example, to extract the MFCC (Mel Frequency Cepstrum Coefficient) characteristics. According to the characteristics of human ears, each frame of waveform is converted into a multi-dimensional vector, and the process is the feature extraction processing. Commonly used acoustic features may be, in addition to MFCC features, PLP (Perceptual Linear Prediction) features, PITCH features, and the like.
The Mel frequency is extracted based on the auditory characteristics of human ears, and the Mel frequency and the Hz (Hertz) frequency form a nonlinear corresponding relation. Mel-frequency cepstrum coefficients (MFCCs), which are Hz spectral features calculated by using the relationship between them, have been widely used in the field of speech recognition. Due to the nonlinear corresponding relation between the Mel frequency and the Hz frequency, the calculation accuracy of the MFCC is reduced along with the increase of the frequency. Therefore, only low frequency MFCCs are often used in applications, while medium and high frequency MFCCs are discarded.
The feature extraction process is performed in units of frames, and in the speech processing process, the frames are usually divided into frames with a certain frame length and frame shift. The frame length refers to the length of each frame, and the frame shift is the time delay between the beginning of each two frames. If framing is performed with a frame length of 25ms and a frame shift of 10ms, there is an overlap of 15ms between every two frames. An M × N matrix can be obtained by the feature extraction process, where M represents the dimension of the acoustic features and N represents the total frame number.
Further, the trained acoustic model may be utilized to determine the acoustic states corresponding to the acoustic features.
In an alternative embodiment, fig. 5 shows a flow chart of a training method of an acoustic model, as shown in fig. 5, the method at least comprises the following steps: in step S510, a speech sample is trained to obtain an original probability value between the acoustic feature and the acoustic state, and speech decoding processing is performed on the original probability value to obtain an acoustic state network.
An Acoustic Modeling (Acoustic Modeling) is one of the most important parts in a speech system, and in the process of performing speech recognition and the like, speech signals are converted into Acoustic features, then the Acoustic states corresponding to the Acoustic features are determined by using the Acoustic models, and factors or characters can be obtained by combining the Acoustic states.
Here, the acoustic state is a basic unit constituting a character pronunciation, and generally refers to a smaller unit obtained by further dividing a phoneme. And phoneme is the smallest unit of pronunciation.
The training of the acoustic model is to obtain the probability P (o | s) corresponding to the frame characteristics and the states through a large amount of linguistic datai) I.e. the original probability values between the acoustic features and the acoustic states. Where o is a feature of a frame and siIs the state i, and the state corresponding to a certain frame is the state with the highest probability. In practice, the state of adjacent frames should be the same in most cases, since the time of each frame is short.
To achieve this, HMMs (Hidden Markov models) are used to construct the acoustic state network. In addition, the acoustic state network may also be constructed by using other models, which is not particularly limited in the present exemplary embodiment.
In step S520, a path search process is performed on the acoustic state network to obtain a target probability value, so as to obtain an acoustic model representing a mapping relationship between the acoustic features and the acoustic states.
After the acoustic state network is obtained, the optimal path can be obtained by performing path search processing in the acoustic state network. The optimal path is the path with the maximum probability, the probability of the path is determined as a target probability value, an acoustic model is formed, and the mapping relation between the acoustic features and the acoustic states is represented by the acoustic model.
In the exemplary embodiment, the acoustic model is obtained through the speech decoding processing and the path search processing training, the training mode is simple and accurate, and a data basis is provided for determining the acoustic state corresponding to the acoustic feature.
After the acoustic model is trained, it can be determined to transform the acoustic features into corresponding acoustic states using the acoustic model. Specifically, a speech vector composed of a plurality of unit speech signals (e.g., 5) may be set to correspond to an acoustic state, that is, several frames of speech correspond to an acoustic state.
In step S420, the acoustic states are combined to obtain phonemes, and the phonemes are aligned by using the single frame duration to obtain phonemes aligned with the time information of the single frame duration.
Since a plurality of acoustic states corresponds to one phoneme, every third acoustic state can be combined into one phoneme. Furthermore, the phoneme is combined with the time information of the duration of the single frame to obtain the result of aligning the phoneme with the time information.
In the exemplary embodiment, the voice information is converted into the result of aligning the phoneme with the time information, a phoneme basis is provided for the subsequent generation of the mouth-shaped animation, the artificial participation is reduced in the phoneme generation link, and the generation speed and the rendering efficiency of the virtual digital human are improved.
After the phoneme aligned with the time information is obtained, the phoneme may be subjected to a viseme conversion process to obtain a speech viseme aligned with the time information.
In an alternative embodiment, fig. 6 shows a flow chart of a method of viseme conversion processing, as shown in fig. 6, the method at least includes the following steps: in step S610, a first mapping relationship between phonemes and speech visemes is obtained.
Wherein, the pronunciation viseme is the oral cavity shape corresponding to each phoneme pronunciation. Some phonemes are different in pronunciation, but the corresponding mouth shapes are the same when being pronounced. For example, b, p, m in the Chinese Pinyin correspond to the same pronunciation apparent position. Therefore, a first mapping relationship between the phonemes and the speech visemes may be obtained to further convert the phonemes into corresponding speech visemes.
In step S620, based on the first mapping relationship, the phoneme is subjected to viseme conversion processing to obtain a speech viseme aligned with the time information.
The corresponding information of the speech viseme and the time axis can be obtained through the first mapping relation between the phoneme and the speech viseme. That is, for a segment of speech there are a series of corresponding speech visemes, each corresponding to a start time and an end time.
In the exemplary embodiment, the factors are subjected to viseme conversion to obtain the aligned voice visemes, the conversion mode is simple and accurate, a voice viseme basis is provided for the subsequent generation of the mouth shape animations, manual participation is reduced in the step of generating the voice visemes, and the generation speed and the rendering efficiency of the virtual digital people are improved.
In step S320, linear interpolation processing is performed on the speech viseme and the time information to obtain a mouth shape animation.
After the voice viseme is obtained, linear interpolation processing can be carried out on the voice viseme and the time information to obtain the corresponding mouth shape animation.
In an alternative embodiment, fig. 7 shows a flow chart of a method of linear interpolation processing, as shown in fig. 7, the method at least includes the following steps: in step S710, a current speech viseme is determined among the speech visemes, and the current speech viseme is represented by a virtual grid weight.
Wherein, the virtual grid weight may be a blendshape. The blendshape is to realize a plurality of predefined shape grids based on grid deformation, realize deformation animation by combining a plurality of blendshapes, and is also called as morph target in software such as maya or 3D max.
FIG. 8 is a schematic diagram illustrating an interface for representing a current speech viseme by using virtual grid weights, and as shown in FIG. 8, representing a speech viseme by using virtual grid weights is a mouth shape corresponding to each speech viseme in advance, and by adjusting weights of the blendshapeThe corresponding mouth shape is obtained. Generally, let a virtual digital human model have k blendshapes in total, and represent the speech viseme by a vector composed of weights of each blendshape, and is denoted as W ═ W0,w1,…,wk]。
Specifically, when the mouth shape is switched from the current shape to the next voice viseme, in order to obtain a smooth mouth shape animation, the current voice viseme is determined in the voice viseme, and the weight vector corresponding to the current voice viseme is set as Wcurrent
In step S720, linear interpolation processing is performed on the current speech viseme and the time information represented by the virtual grid weight to obtain a mouth shape animation.
In an alternative embodiment, fig. 9 shows a flow chart of a method for further performing linear interpolation processing, and as shown in fig. 9, the method at least includes the following steps: in step S910, a target speech viseme in the speech visemes is represented by the virtual grid weight, and the current speech viseme represented by the virtual grid weight and the target speech viseme represented by the virtual grid weight are subjected to viseme calculation to obtain a viseme calculation result.
After determining the current speech viseme, a target speech viseme may also be determined among the speech visemes. Further, the target speech weight is represented by a virtual network weight, e.g. Wtarget
Further, the current voice viseme and the target voice viseme represented by the virtual network weight can be subjected to viseme calculation, for example, difference calculation to obtain a viseme calculation result.
In step S920, the single frame duration, the ending time of the target speech viseme, and the consumed duration corresponding to the current speech viseme are determined according to the time information, and duration calculation is performed on the single frame duration, the ending time, and the consumed duration to obtain a duration calculation result.
Determining the duration delta t of a single frame and the end time t of the target speech viseme according to the time informationendThe elapsed time t elapsed from the first frame to the current frame. Further, the single-frame duration, the ending time and the consumed duration are subjected to difference calculation and division calculation to obtain corresponding duration calculation results.
In step S930, the current speech viseme, the viseme calculation result, and the duration calculation result represented by the virtual grid weight are subjected to weight calculation to obtain a next grid weight, so as to determine the mouth-shaped animation to be rendered according to the next grid weight.
After the viseme calculation result and the duration calculation result are obtained, weight calculation may be performed on the current speech viseme, viseme calculation result, and duration calculation result represented by the virtual grid weight, for example, summing and multiplying calculation may be performed to obtain a weight vector corresponding to a next frame, that is, a next grid weight, so as to set the weight of each frame to the corresponding blenshape during rendering to obtain a smooth mouth-shaped animation.
Specifically, the weight calculation result can be obtained by referring to formula (1):
Figure BDA0003163099800000141
wherein, Wnext-WnextThe result of the viseme calculation is represented,
Figure BDA0003163099800000142
indicating the time duration calculation.
In the exemplary embodiment, the linear difference of the speech visual bits can obtain smooth mouth-shaped animation, and smoothness and vividness of the rendering effect of the virtual digital human are guaranteed.
After the mouth-shape animation is generated, the matching of the expression animation is also required in order to obtain a more vivid virtual digital person. Therefore, the corresponding expression animation can be obtained by performing expression animation generation processing on the voice information.
In an alternative embodiment, fig. 10 shows a flow diagram of a method for generating an expression animation, as shown in fig. 10, the method at least comprises the following steps: in step S1010, the speech information is subjected to emotion animation configuration to obtain a default emotion animation and a trigger emotion animation.
When the expression animation configuration is carried out on the voice information, the default expression animation can be correspondingly configured considering that the virtual digital person always has the micro expression similar to a real person and is not too stiff. The default emoji animation may include blinking, eyebrow-raising, and small eye movements, among others.
On the other hand, in consideration of making emotion-related expressions such as happy emotion and surprise emotion by matching the virtual digital person with voice information, the method can also be correspondingly configured with triggering expression animations.
And the required default expression animation and the required trigger expression animation can be manufactured and stored in advance, and the rendering can be triggered when required. It is worth noting that the triggering logic of the two types of expression animations is not the same.
In step S1020, a first default animation of the default emotions is set at intervals of a preset time, and a first trigger configuration is performed on the trigger emotions to determine a first trigger animation of the trigger emotions.
Specifically, the default emoji animation may be randomly triggered. One or a plurality of default facial animations can be randomly extracted from a material library of the default facial animations as first default animations to be rendered at preset intervals. In order to make the rendering effect approximate to the micro-expression of the real person, the interval of the preset time can be controlled between 0.5 and 4s according to the frequency of blinking, eyebrow picking and small eyeball floating rotation of the real person.
In addition, the first trigger configuration for the trigger emoticons can have two trigger logics of text trigger and score trigger.
In an alternative embodiment, fig. 11 shows a flowchart of a method of a first trigger configuration, as shown in fig. 11, the method at least includes the following steps: in step S1110, first text information is determined in the text information, and a text trigger configuration is performed on the trigger emoji animation using the first text information to determine a first trigger animation in the trigger emoji animation.
The first text information may be text information in which a user specifies that a certain word or phrase triggers a corresponding first trigger animation. Specifically, the first text information may be marked with a symbol such as # as a tag, and a corresponding first trigger animation may be inserted into a time information position corresponding to the first text information, so as to implement the first trigger configuration.
In step S1120, text emotion calculation is performed on the text information to obtain an emotion score, and score triggering configuration is performed on the triggered emoji animation by using the emotion score to determine a first triggered animation in the triggered emoji animation.
And performing text emotion calculation through NLP (Natural Language Processing) to obtain emotion scores corresponding to the text information so as to analyze emotion information corresponding to the current text information.
Further, the corresponding rendering of the first trigger animation is triggered through the emotion scores.
For example, the sentiment score may be a score between-1 and 1. Wherein, the emotion score is 0 to represent neutral emotion information, the emotion score is more than 0 to represent positive emotion information, and the emotion score is less than 0 to represent negative emotion information. Further, the first trigger animation in the forward direction, such as happy or smiling, is determined by the trigger expression animation corresponding to the forward emotion information.
In the exemplary embodiment, two trigger logics can be configured for triggering the expression animation through the first trigger configuration, so that not only is the real-person rendering effect of the digital virtual human fitted, but also the emotion animation corresponding to the text information can be animated and rendered, and the rendering effect is better.
In step S1030, based on the first trigger configuration, the first default animation is replaced with the first trigger animation, so as to obtain an expression animation to be rendered.
Based on the trigger logic of the first trigger configuration and considering the expression reality of the digital virtual human, the first default animation of the position where the first trigger animation needs to be rendered can be replaced by the first trigger animation so as to obtain the corresponding expression animation to be rendered.
In the exemplary embodiment, the expression animation required by the virtual digital person can be obtained through the expression animation generation mode, the rendering dimensionality of the expression animation is enriched, the expression reality degree and richness of the virtual digital person are improved, and the rendering effect is better.
In step S130, a body animation generation process is performed on the voice information to obtain a body animation, and the virtual digital person is rendered synchronously according to the mouth animation, the expression animation, and the body animation.
In an exemplary embodiment of the present disclosure, fig. 12 shows a flowchart of a method of generating a limb animation, as shown in fig. 12, the method comprising at least the steps of: in step S1210, a body animation configuration is performed on the voice information to obtain a default body animation and a trigger body animation.
When the body animation configuration is carried out on the voice information, the default body animation can be correspondingly configured considering that the virtual digital person always has the body animation similar to a real person to be more real. The default limb animation may include breathing, small amplitude swings in the body, etc.
On the other hand, in consideration of the fact that a virtual digital person performs text-related actions such as waving, bowing, and barycenter in accordance with voice information, it is also possible to trigger a body animation in accordance with the configuration.
Furthermore, according to the application scene of the virtual digital human, related default limb animations and trigger limb animations can be made in advance, classified and stored, and rendering is triggered when needed. It is worth noting that the triggering logic of the two types of limb animations is not the same.
In step S1220, a second default animation of the default limb animations is set at intervals of a preset time, and a second trigger configuration is performed on the trigger limb animation to determine a second trigger animation of the trigger limb animations.
Specifically, the default limb animation may be triggered randomly and played continuously in a loop. At preset time intervals, one or a plurality of default limb animations can be extracted from the material library of the default limb animations to serve as second default animations for rendering. The preset time interval may also be set according to the requirements of the actual situation, and this is not particularly limited in this exemplary embodiment.
In addition, the second trigger configuration mode for triggering the limb animation can be a text trigger logic and a semantic trigger logic.
In an alternative embodiment, fig. 13 shows a flowchart of a method of the second trigger configuration, and as shown in fig. 13, the method at least includes the following steps: in step S1310, second text information is determined in the text information, and the tag trigger configuration for the triggered body animation using the second text information determines a second trigger animation in the triggered body animation.
The second text information may be text information in which the user specifies that a certain word or phrase triggers the corresponding second trigger animation. Specifically, the first text information may be marked with a symbol such as # as a tag, and a corresponding second trigger animation may be inserted into a time information position corresponding to the second text information, so as to implement the second trigger configuration.
In step S1320, semantic analysis is performed on the text information to obtain semantic information, and semantic trigger configuration is performed on the triggered body animation by using the semantic information to determine a second triggered animation in the triggered body animation.
And performing semantic analysis processing through the NLP to obtain semantic information representing the semantics of the text information so as to analyze the content corresponding to the current text information, thereby triggering the corresponding second triggering animation.
For example, when the semantic information characterizes the semantic of "call," the called limb animation in the second trigger animation may be determined to be the second trigger animation.
In the exemplary embodiment, two trigger logics can be configured for triggering the limb animation through the second trigger configuration, so that not only the real-person rendering effect of the digital virtual human is fitted, but also the action animation corresponding to the text information can be animated and rendered, and the rendering effect is better.
In step S1230, the second default animation is replaced with the second trigger animation based on the second trigger configuration, so as to obtain the body animation to be rendered.
Based on the trigger logic of the second trigger configuration and considering the body reality of the digital virtual human, the second default animation of the position where the second trigger animation needs to be rendered can be replaced by the second trigger animation so as to obtain the corresponding body animation to be rendered.
In the exemplary embodiment, the limb animation required by the virtual digital person can be obtained through the limb animation generation mode, the rendering dimensionality of the limb animation is enriched, the limb reality degree and the richness of the virtual digital person are improved, and the rendering effect is better.
After the mouth shape animation, the expression animation and the limb animation are obtained, the mouth shape animation, the expression animation and the limb animation of the virtual digital human can be rendered synchronously with the corresponding time information, so that the complete virtual digital human animation is obtained. The virtual digital person can coordinate and match appropriate expression animation and limb animation while broadcasting text information or voice information input by a user.
The animation generation method in the embodiment of the present disclosure is described in detail below with reference to an application scenario.
Fig. 14 is a flowchart illustrating an animation generation method in an application scenario, and as shown in fig. 14, in step S1410, text information is acquired.
The text information is information of text content broadcasted by a given virtual digital person and is used for generating voice information broadcasted by the virtual digital person.
In step S1420, a voice conversion process is synthesized.
After the text information is obtained, the text information may be subjected to synthesized speech conversion processing using a speech synthesis algorithm, thereby generating speech information of the text information. The speech synthesis algorithm may be a tool preset by the system for synthesizing speech. Specifically, the speech synthesis algorithm may adopt a speech synthesis algorithm and platform, such as an open platform of internet corporation, AI (Artificial intelligence) speech synthesis software, or cool editor.
Most speech synthesis algorithms are capable of supporting timbre, speech rate, multilingual synthesis. Therefore, in the specific implementation process, different speech synthesis algorithms can be compatible, and the corresponding speech synthesis algorithm can be selected according to the requirements of the actual situation.
In step S1430, the mouth shape animation is synthesized.
Fig. 15 is a flowchart showing a method of the mouth animation generation process in the application scene, and as shown in fig. 15, in step S1510, phonemes are aligned.
Fig. 16 is a flowchart illustrating a method for phoneme alignment in an application scenario, and as shown in fig. 16, in step S1610, features are extracted.
The original speech information is a time-domain waveform, and this input has little description capability, so that the waveform needs to be transformed, for example, to extract the MFCC (Mel Frequency Cepstrum Coefficient) characteristics. According to the characteristics of human ears, each frame of waveform is converted into a multi-dimensional vector, and the process is the feature extraction processing. Commonly used acoustic features may be, in addition to MFCC features, PLP (Perceptual Linear Prediction) features, PITCH features, and the like.
The Mel frequency is extracted based on the auditory characteristics of human ears, and the Mel frequency and the Hz (Hertz) frequency form a nonlinear corresponding relation. Mel-frequency cepstrum coefficients (MFCCs), which are Hz spectral features calculated by using the relationship between them, have been widely used in the field of speech recognition. Due to the nonlinear corresponding relation between the Mel frequency and the Hz frequency, the calculation accuracy of the MFCC is reduced along with the increase of the frequency. Therefore, only low frequency MFCCs are often used in applications, while medium and high frequency MFCCs are discarded.
The feature extraction process is performed in units of frames, and in the speech processing process, the frames are usually divided into frames with a certain frame length and frame shift. The frame length refers to the length of each frame, and the frame shift is the time delay between the beginning of each two frames. If framing is performed with a frame length of 25ms and a frame shift of 10ms, there is an overlap of 15ms between every two frames. An M × N matrix can be obtained by the feature extraction process, where M represents the dimension of the acoustic features and N represents the total frame number.
In step S1620, the acoustic model is trained.
Further, the trained acoustic model may be utilized to determine the acoustic states corresponding to the acoustic features.
An Acoustic Modeling (Acoustic Modeling) is one of the most important parts in a speech system, and in the process of performing speech recognition and the like, speech signals are converted into Acoustic features, then the Acoustic states corresponding to the Acoustic features are determined by using the Acoustic models, and factors or characters can be obtained by combining the Acoustic states.
Here, the acoustic state is a basic unit constituting a character pronunciation, and generally refers to a smaller unit obtained by further dividing a phoneme. And phoneme is the smallest unit of pronunciation.
The training of the acoustic model is to obtain the probability P (o | s) corresponding to the frame characteristics and the states through a large amount of linguistic datai) I.e. the original probability values between the acoustic features and the acoustic states. Where o is a feature of a frame and siIs the state i, and the state corresponding to a certain frame is the state with the highest probability. In practice, the state of adjacent frames should be the same in most cases, since the time of each frame is short.
In step S1630, a speech decoding and search algorithm.
To achieve this, HMMs (Hidden Markov models) are used to construct the acoustic state network. In addition, the acoustic state network may also be constructed by using other models, which is not particularly limited in the present exemplary embodiment.
After the acoustic state network is obtained, the optimal path can be obtained by performing path search processing in the acoustic state network. The optimal path is the path with the maximum probability, the probability of the path is determined as a target probability value, an acoustic model is formed, and the mapping relation between the acoustic features and the acoustic states is represented by the acoustic model.
In step S1640, the phoneme and time axis information.
After the acoustic model is trained, it can be determined to transform the acoustic features into corresponding acoustic states using the acoustic model. Specifically, a speech vector composed of a plurality of unit speech signals (e.g., 5) may be set to correspond to an acoustic state, that is, several frames of speech correspond to an acoustic state.
Since a plurality of acoustic states corresponds to one phoneme, every third acoustic state can be combined into one phoneme. Furthermore, the phoneme is combined with the time information of the duration of the single frame to obtain the result of aligning the phoneme with the time information.
In addition, the generation of the mouth-shaped animation can also combine text information and voice information to obtain the result of aligning the phoneme with the time axis through a voice text alignment algorithm.
In step S1520, the visemes are mapped.
The phonetic visemes are the corresponding oral shapes when each phoneme pronounces. Some phonemes are different in pronunciation, but the corresponding mouth shapes are the same when being pronounced. For example, b, p, m in the Chinese Pinyin correspond to the same pronunciation apparent position. Therefore, a first mapping relationship between the phonemes and the speech visemes may be obtained to further convert the phonemes into corresponding speech visemes.
The corresponding information of the speech viseme and the time axis can be obtained through the first mapping relation between the phoneme and the speech viseme. That is, for a segment of speech there are a series of corresponding speech visemes, each corresponding to a start time and an end time.
In step S1530, the blendshape weight is interpolated.
After the voice viseme is obtained, linear interpolation processing can be carried out on the voice viseme and the time information to obtain the corresponding mouth shape animation.
Wherein, the virtual grid weight may be a blendshape. The blendshape is to realize a plurality of predefined shape grids based on grid deformation, realize deformation animation by combining a plurality of blendshapes, and is also called as morph target in software such as maya or 3D max.
The expression of the voice visual positions by using the virtual grid weight is that the corresponding mouth shape is obtained by adjusting the weight of the blenshape according to the mouth shape corresponding to each voice visual position in advance. Generally, let a virtual digital human model have k blendshapes in total, and represent the speech viseme by a vector composed of weights of each blendshape, and is denoted as W ═ W0,w1,…,wk]。
Specifically, when the mouth shape is switched from the current shape to the next speech viseme, in order to obtain a smooth mouth shape animation, the current speech viseme is determined among the speech visemes, andsetting the weight vector corresponding to the current speech visual position as Wcurrent
After determining the current speech viseme, a target speech viseme may also be determined among the speech visemes. Further, the target speech weight is represented by a virtual network weight, e.g. Wtarget
Further, the current voice viseme and the target voice viseme represented by the virtual network weight can be subjected to viseme calculation, for example, difference calculation to obtain a viseme calculation result.
Determining the duration delta t of a single frame and the end time t of the target speech viseme according to the time informationendThe elapsed time t elapsed from the first frame to the current frame. Further, the single-frame duration, the ending time and the consumed duration are subjected to difference calculation and division calculation to obtain corresponding duration calculation results.
After the viseme calculation result and the duration calculation result are obtained, weight calculation may be performed on the current speech viseme, viseme calculation result, and duration calculation result represented by the virtual grid weight, for example, summing and multiplying calculation may be performed to obtain a weight vector corresponding to a next frame, that is, a next grid weight, so as to set the weight of each frame to the corresponding blenshape during rendering to obtain a smooth mouth-shaped animation.
In step S1440, the expressions are animated.
When the expression animation configuration is carried out on the voice information, the default expression animation can be correspondingly configured considering that the virtual digital person always has the micro expression similar to a real person and is not too stiff. The default emoji animation may include blinking, eyebrow-raising, and small eye movements, among others.
On the other hand, in consideration of making emotion-related expressions such as happy emotion and surprise emotion by matching the virtual digital person with voice information, the method can also be correspondingly configured with triggering expression animations.
And the required default expression animation and the required trigger expression animation can be manufactured and stored in advance, and the rendering can be triggered when required. It is worth noting that the triggering logic of the two types of expression animations is not the same.
Specifically, the default emoji animation may be randomly triggered. One or a plurality of default facial animations can be randomly extracted from a material library of the default facial animations as first default animations to be rendered at preset intervals. In order to make the rendering effect approximate to the micro-expression of the real person, the interval of the preset time can be controlled between 0.5 and 4s according to the frequency of blinking, eyebrow picking and small eyeball floating rotation of the real person.
In addition, the first trigger configuration for the trigger emoticons can have two trigger logics of text trigger and score trigger.
First, first text information is determined in the text information, and text trigger configuration is carried out on trigger emoticons by utilizing the first text information to determine the first trigger animation in the trigger emoticons.
The first text information may be text information in which a user specifies that a certain word or phrase triggers a corresponding first trigger animation. Specifically, the first text information may be marked with a symbol such as # as a tag, and a corresponding first trigger animation may be inserted into a time information position corresponding to the first text information, so as to implement the first trigger configuration.
And secondly, performing text emotion calculation on the text information to obtain emotion scores, and performing score trigger configuration on the trigger expression animations by using the emotion scores to determine a first trigger animation in the trigger expression animations.
Performing text emotion calculation through the NLP in step S1450 to obtain emotion scores corresponding to the text information, so as to analyze emotion information corresponding to the current text information.
Further, the corresponding rendering of the first trigger animation is triggered through the emotion scores.
For example, the sentiment score may be a score between-1 and 1. Wherein, the emotion score is 0 to represent neutral emotion information, the emotion score is more than 0 to represent positive emotion information, and the emotion score is less than 0 to represent negative emotion information. Further, the first trigger animation in the forward direction, such as happy or smiling, is determined by the trigger expression animation corresponding to the forward emotion information.
Based on the trigger logic of the first trigger configuration and considering the expression reality of the digital virtual human, the first default animation of the position where the first trigger animation needs to be rendered can be replaced by the first trigger animation so as to obtain the corresponding expression animation to be rendered.
Furthermore, the generation parts of the mouth shape animation and the expression animation can also be trained based on deep learning to convert the voice information into complete facial animation including the mouth shape and the animation.
In step S1460, a limb animation is synthesized.
When the body animation configuration is carried out on the voice information, the default body animation can be correspondingly configured considering that the virtual digital person always has the body animation similar to a real person to be more real. The default limb animation may include breathing, small amplitude swings in the body, etc.
On the other hand, in consideration of the fact that a virtual digital person performs text-related actions such as waving, bowing, and barycenter in accordance with voice information, it is also possible to trigger a body animation in accordance with the configuration.
Furthermore, according to the application scene of the virtual digital human, related default limb animations and trigger limb animations can be made in advance, classified and stored, and rendering is triggered when needed. It is worth noting that the triggering logic of the two types of limb animations is not the same.
The default limb animation may be triggered randomly and continue to loop. At preset time intervals, one or a plurality of default limb animations can be extracted from the material library of the default limb animations to serve as second default animations for rendering. The preset time interval may also be set according to the requirements of the actual situation, and this is not particularly limited in this exemplary embodiment.
In addition, the second trigger configuration mode for triggering the limb animation can be a text trigger logic and a semantic trigger logic.
First, second text information is determined in the text information, and label triggering configuration is carried out on the triggered limb animation by using the second text information to determine a second triggered animation in the triggered limb animation.
The second text information may be text information in which the user specifies that a certain word or phrase triggers the corresponding second trigger animation. Specifically, the first text information may be marked with a symbol such as # as a tag, and a corresponding second trigger animation may be inserted into a time information position corresponding to the second text information, so as to implement the second trigger configuration.
Secondly, semantic information representing the semantics of the text information is obtained by performing semantic analysis processing on the NLP in step S1450, so as to analyze the content corresponding to the current text information, thereby triggering a corresponding second triggering animation.
For example, when the semantic information characterizes the semantic of "call," the called limb animation in the second trigger animation may be determined to be the second trigger animation.
Based on the trigger logic of the second trigger configuration and considering the body reality of the digital virtual human, the second default animation of the position where the second trigger animation needs to be rendered can be replaced by the second trigger animation so as to obtain the corresponding body animation to be rendered.
In step S1470, the digital human model and animation are rendered.
After the mouth shape animation, the expression animation and the limb animation are obtained, the mouth shape animation, the expression animation and the limb animation of the virtual digital human can be rendered synchronously with the corresponding time information, so that the complete virtual digital human animation is obtained. The virtual digital person can coordinate and match appropriate expression animation and limb animation while broadcasting text information or voice information input by a user.
In the application scene of the present disclosure, the mouth shape animation, the expression animation and the limb animation of the rendered virtual digital person are generated according to the voice information, and the virtual digital person is rendered synchronously according to the mouth shape animation, the expression animation and the limb animation. On one hand, the artificial participation degree in the process of rendering the virtual digital man is reduced, and the generation speed and the generation efficiency of the animation content of the virtual digital man are improved; on the other hand, the animation of the virtual digital person can be generated immediately, the algorithm does not need to be retrained according to the virtual digital person with the new image, and the application scenes of the virtual digital person are enriched.
Further, in an exemplary embodiment of the present disclosure, an animation generation apparatus is also provided. Fig. 17 shows a schematic structural diagram of an animation generation apparatus, and as shown in fig. 17, the animation generation apparatus 1700 may include: an information acquisition module 1710, an animation generation module 1720, and a synchronized rendering module 1730.
Wherein:
an information obtaining module 1710, configured to obtain the manufactured virtual digital person, and obtain voice information broadcasted by the virtual digital person; the animation generation module 1720 is configured to perform mouth shape animation generation processing on the voice information to obtain mouth shape animation, and perform expression animation generation processing on the voice information to obtain expression animation; and the synchronous rendering module 1730 is configured to perform limb animation generation processing on the voice information to obtain limb animation, and perform synchronous rendering on the virtual digital human according to the mouth shape animation, the expression animation and the limb animation.
In an exemplary embodiment of the present invention, performing a mouth-shape animation generation process on voice information to obtain a mouth-shape animation includes:
performing phoneme conversion processing on the voice information to obtain a phoneme aligned with the time information, and performing viseme conversion processing on the phoneme to obtain a voice viseme aligned with the time information;
and carrying out linear interpolation processing on the voice visual position and the time information to obtain the mouth shape animation.
In an exemplary embodiment of the present invention, performing a phoneme conversion process on speech information to obtain phonemes aligned with time information includes:
performing feature extraction processing on the voice information to obtain acoustic features and single-frame duration, and determining an acoustic state corresponding to the acoustic features by using a trained acoustic model;
and combining the acoustic states to obtain phonemes, and aligning the phonemes by using the single-frame duration to obtain the phonemes aligned with the time information of the single-frame duration.
In an exemplary embodiment of the invention, the trained acoustic model is trained as follows:
training a voice sample to obtain an original probability value between acoustic characteristics and an acoustic state, and performing voice decoding processing on the original probability value to obtain an acoustic state network;
and performing path search processing on the acoustic state network to obtain a target probability value so as to obtain an acoustic model representing the mapping relation between the acoustic features and the acoustic states.
In an exemplary embodiment of the present invention, performing viseme conversion processing on phonemes to obtain a speech viseme aligned with time information includes:
acquiring a first mapping relation between phonemes and a voice viseme;
and based on the first mapping relation, performing viseme conversion processing on the phoneme to obtain the voice viseme aligned with the time information.
In an exemplary embodiment of the present invention, a linear interpolation process for the speech viseme and the time information to obtain the mouth shape animation includes:
determining the current voice viseme in the voice visemes, and representing the current voice viseme by using the virtual grid weight;
and carrying out linear interpolation processing on the current voice visual position and time information represented by the virtual grid weight to obtain the mouth shape animation.
In an exemplary embodiment of the present invention, a linear interpolation process is performed on the current speech viseme and the time information represented by the virtual grid weight to obtain a mouth-shape animation, which includes:
representing a target voice viseme in the voice visemes by using the virtual grid weight, and performing viseme calculation on the current voice viseme represented by the virtual grid weight and the target voice viseme represented by the virtual grid weight to obtain a viseme calculation result;
determining the single-frame duration, the end time of the target voice viseme and the consumed duration corresponding to the current voice viseme according to the time information, and carrying out duration calculation on the single-frame duration, the end time and the consumed duration to obtain a duration calculation result;
and performing weight calculation on the current voice viseme, the viseme calculation result and the duration calculation result represented by the virtual grid weight to obtain the next grid weight so as to determine the mouth-shaped animation to be rendered according to the next grid weight.
In an exemplary embodiment of the present invention, acquiring the voice information broadcasted by the virtual digital person includes:
acquiring text information broadcasted by a virtual digital person;
and carrying out synthesis voice conversion processing on the text information to obtain voice information.
In an exemplary embodiment of the present invention, performing an expression animation generation process on voice information to obtain an expression animation includes:
performing expression animation configuration on the voice information to obtain default expression animation and trigger expression animation;
setting a first default animation in the default expression animation at intervals of preset time, and performing first trigger configuration on the trigger expression animation to determine a first trigger animation in the trigger expression animation;
and replacing the first default animation with the first trigger animation based on the first trigger configuration to obtain the expression animation to be rendered.
In an exemplary embodiment of the invention, the first trigger configuration of the trigger emoji animation determines a first trigger animation in the trigger emoji animation, including:
determining first text information in the text information, and performing text trigger configuration on the trigger emoticons by using the first text information to determine a first trigger animation in the trigger emoticons; and/or
And performing text emotion calculation on the text information to obtain emotion scores, and performing score trigger configuration on the trigger expression animations by using the emotion scores to determine a first trigger animation in the trigger expression animations.
In an exemplary embodiment of the present invention, performing a body animation generation process on the voice information to obtain a body animation includes:
performing limb animation configuration on the voice information to obtain default limb animation and trigger limb animation;
setting a second default animation in the default limb animations by taking preset time as an interval, and performing second trigger configuration on the trigger limb animation to determine to trigger a second trigger animation in the limb animations;
and replacing the second default animation with the second trigger animation based on the second trigger configuration to obtain the limb animation to be rendered.
In an exemplary embodiment of the invention, the second trigger configuration for triggering limb animation determines a second trigger animation in the triggering limb animation, comprising:
determining second text information in the text information, and performing label triggering configuration on the triggered limb animation by using the second text information to determine a second triggered animation in the triggered limb animation; and/or
And performing semantic analysis processing on the text information to obtain semantic information, and performing semantic trigger configuration on the triggered limb animation by using the semantic information to determine a second triggered animation in the triggered limb animation.
The details of the animation generation apparatus 1700 are described in detail in the corresponding animation generation method, and therefore are not described herein again.
It should be noted that although several modules or units of animation generation apparatus 1700 are mentioned in the above detailed description, such division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.
In addition, in an exemplary embodiment of the present disclosure, an electronic device capable of implementing the above method is also provided.
An electronic device 1800 according to such an embodiment of the invention is described below with reference to fig. 18. The electronic device 1800 shown in fig. 18 is only an example, and should not bring any limitations to the function and the scope of use of the embodiments of the present invention.
As shown in fig. 18, the electronic device 1800 is in the form of a general purpose computing device. Components of the electronic device 1800 may include, but are not limited to: the at least one processing unit 1810, the at least one memory unit 1820, the bus 1830 that connects the various system components (including the memory unit 1820 and the processing unit 1810), and the display unit 1840.
Wherein the storage unit stores program code, which can be executed by the processing unit 1810, so that the processing unit 1810 performs the steps according to various exemplary embodiments of the present invention described in the above section "exemplary method" of the present specification.
The storage unit 1820 may include readable media in the form of volatile storage units, such as a random access memory unit (RAM)1821 and/or a cache memory unit 1822, and may further include a read-only memory unit (ROM) 1823.
The storage unit 1820 may also include a program/utility 1824 having a set (at least one) of program modules 1825, such program modules 1825 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
The bus 1830 may be any type of bus structure representing one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.
The electronic device 1800 may also communicate with one or more external devices 2000 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 1800, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 1800 to communicate with one or more other computing devices. Such communication can occur through input/output (I/O) interface 1850. Also, the electronic device 1800 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the internet) via the network adapter 1860. As shown, the network adapter 1860 communicates with other modules of the electronic device 1800 via the bus 1830. It should be appreciated that although not shown, other hardware and/or software modules may be used in conjunction with the electronic device 1800, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.
In an exemplary embodiment of the present disclosure, there is also provided a computer-readable storage medium having stored thereon a program product capable of implementing the above-described method of the present specification. In some possible embodiments, aspects of the invention may also be implemented in the form of a program product comprising program code means for causing a terminal device to carry out the steps according to various exemplary embodiments of the invention described in the above-mentioned "exemplary methods" section of the present description, when said program product is run on the terminal device.
Referring to fig. 19, a program product 1900 for implementing the above method according to an embodiment of the present invention is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited in this regard and, in the present document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

Claims (15)

1. A method of animation generation, the method comprising:
acquiring a manufactured virtual digital person, and acquiring voice information broadcasted by the virtual digital person;
carrying out mouth shape animation generation processing on the voice information to obtain mouth shape animation, and carrying out expression animation generation processing on the voice information to obtain expression animation;
and performing limb animation generation processing on the voice information to obtain limb animation, and synchronously rendering the virtual digital human according to the mouth shape animation, the expression animation and the limb animation.
2. The animation generation method according to claim 1, wherein the performing of the mouth animation generation process on the voice information to obtain the mouth animation includes:
performing phoneme conversion processing on the voice information to obtain a phoneme aligned with time information, and performing viseme conversion processing on the phoneme to obtain a voice viseme aligned with the time information;
and carrying out linear interpolation processing on the voice visual position and the time information to obtain the mouth shape animation.
3. The animation generation method according to claim 2, wherein the converting the speech information into phonemes aligned with the time information includes:
performing feature extraction processing on the voice information to obtain acoustic features and single-frame duration, and determining an acoustic state corresponding to the acoustic features by using a trained acoustic model;
and combining the acoustic states to obtain phonemes, and aligning the phonemes by using the single-frame duration to obtain the phonemes aligned with the time information of the single-frame duration.
4. A method for generating animation as claimed in claim 3, wherein the trained acoustic model is trained as follows:
training a voice sample to obtain an original probability value between acoustic features and an acoustic state, and performing voice decoding processing on the original probability value to obtain an acoustic state network;
and performing path search processing on the acoustic state network to obtain a target probability value so as to obtain an acoustic model representing the mapping relation between the acoustic features and the acoustic states.
5. The animation generation method as claimed in claim 2, wherein the subjecting of the phoneme to viseme conversion to obtain the speech viseme aligned with the time information includes:
acquiring a first mapping relation between the phoneme and the voice viseme;
and performing viseme conversion processing on the phoneme based on the first mapping relation to obtain a voice viseme aligned with the time information.
6. The animation generation method according to claim 2, wherein the linear interpolation processing of the speech viseme and the time information to obtain a mouth-shaped animation includes:
determining a current voice viseme in the voice visemes, and representing the current voice viseme by using virtual grid weight;
and carrying out linear interpolation processing on the current voice visual position represented by the virtual grid weight and the time information to obtain the mouth shape animation.
7. The animation generation method according to claim 6, wherein the linear interpolation processing of the current phonetic viseme represented by the virtual mesh weight and the time information to obtain the mouth-shaped animation includes:
representing a target voice viseme in the voice visemes by using virtual grid weights, and carrying out viseme calculation on the current voice viseme represented by the virtual grid weights and the target voice viseme represented by the virtual grid weights to obtain a viseme calculation result;
determining the single-frame duration, the ending time of the target voice viseme and the consumed duration corresponding to the current voice viseme according to the time information, and carrying out duration calculation on the single-frame duration, the ending time and the consumed duration to obtain a duration calculation result;
and performing weight calculation on the current voice viseme represented by the virtual grid weight, the viseme calculation result and the duration calculation result to obtain a next grid weight so as to determine the mouth shape animation to be rendered according to the next grid weight.
8. The animation generation method according to claim 1, wherein the acquiring of the voice information broadcasted by the virtual digital person comprises:
acquiring text information broadcasted by the virtual digital person;
and carrying out synthetic voice conversion processing on the text information to obtain voice information.
9. The animation generation method according to claim 8, wherein the performing expression animation generation processing on the voice information to obtain an expression animation includes:
performing expression animation configuration on the voice information to obtain default expression animation and trigger expression animation;
setting a first default animation in the default expression animation at intervals of preset time, and performing first trigger configuration on the trigger expression animation to determine a first trigger animation in the trigger expression animation;
and replacing the first default animation with the first trigger animation based on the first trigger configuration to obtain the expression animation to be rendered.
10. The animation generation method of claim 9, wherein determining the first trigger animation of the trigger emoji animations according to the first trigger configuration comprises:
determining first text information in the text information, and performing text trigger configuration on the trigger expression animation by using the first text information to determine a first trigger animation in the trigger expression animation; and/or
And performing text emotion calculation on the text information to obtain emotion scores, and performing score trigger configuration on the trigger expression animations by using the emotion scores to determine a first trigger animation in the trigger expression animations.
11. The animation generation method according to claim 8, wherein the subjecting of the voice information to the body animation generation processing to obtain the body animation includes:
performing limb animation configuration on the voice information to obtain default limb animation and trigger limb animation;
setting a second default animation in the default limb animations at intervals of preset time, and performing second trigger configuration on the trigger limb animation to determine a second trigger animation in the trigger limb animation;
and replacing the second default animation with the second trigger animation based on the second trigger configuration to obtain the limb animation to be rendered.
12. The animation generation method of claim 11, wherein the second trigger configuration for the triggered limb animation determines a second trigger animation of the triggered limb animations, comprising:
determining second text information in the text information, and performing label triggering configuration on the triggered limb animation by using the second text information to determine a second triggered animation in the triggered limb animation; and/or
And performing semantic analysis processing on the text information to obtain semantic information, and performing semantic trigger configuration on the trigger limb animation by using the semantic information to determine a second trigger animation in the trigger limb animation.
13. An animation generation device, comprising:
the information acquisition module is configured to acquire the manufactured virtual digital person and acquire voice information broadcasted by the virtual digital person;
the animation generation module is configured to perform mouth shape animation generation processing on the voice information to obtain mouth shape animation, and perform expression animation generation processing on the voice information to obtain expression animation;
and the synchronous rendering module is configured to perform limb animation generation processing on the voice information to obtain limb animation, and perform synchronous rendering on the virtual digital human according to the mouth shape animation, the expression animation and the limb animation.
14. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the animation generation method as claimed in any one of claims 1 to 12.
15. An electronic device, comprising:
a processor;
a memory for storing executable instructions of the processor;
wherein the processor is configured to perform the animation generation method of any of claims 1-12 via execution of the executable instructions.
CN202110796787.7A 2021-07-14 2021-07-14 Animation generation method and device, storage medium and electronic equipment Pending CN113538641A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110796787.7A CN113538641A (en) 2021-07-14 2021-07-14 Animation generation method and device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110796787.7A CN113538641A (en) 2021-07-14 2021-07-14 Animation generation method and device, storage medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN113538641A true CN113538641A (en) 2021-10-22

Family

ID=78128026

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110796787.7A Pending CN113538641A (en) 2021-07-14 2021-07-14 Animation generation method and device, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN113538641A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114201043A (en) * 2021-12-09 2022-03-18 北京百度网讯科技有限公司 Content interaction method, device, equipment and medium
CN114245099A (en) * 2021-12-13 2022-03-25 北京百度网讯科技有限公司 Video generation method and device, electronic equipment and storage medium
CN115311731A (en) * 2022-10-10 2022-11-08 之江实验室 Expression generation method and device for sign language digital person
CN116168134A (en) * 2022-12-28 2023-05-26 北京百度网讯科技有限公司 Digital person control method, digital person control device, electronic equipment and storage medium
WO2023124933A1 (en) * 2021-12-31 2023-07-06 魔珐(上海)信息科技有限公司 Virtual digital person video generation method and device, storage medium, and terminal
CN116564338A (en) * 2023-07-12 2023-08-08 腾讯科技(深圳)有限公司 Voice animation generation method, device, electronic equipment and medium
CN117557698A (en) * 2024-01-11 2024-02-13 广州趣丸网络科技有限公司 Digital human limb animation generation method and device, storage medium and computer equipment
CN117557698B (en) * 2024-01-11 2024-04-26 广州趣丸网络科技有限公司 Digital human limb animation generation method and device, storage medium and computer equipment

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114201043A (en) * 2021-12-09 2022-03-18 北京百度网讯科技有限公司 Content interaction method, device, equipment and medium
CN114245099A (en) * 2021-12-13 2022-03-25 北京百度网讯科技有限公司 Video generation method and device, electronic equipment and storage medium
CN114245099B (en) * 2021-12-13 2023-02-21 北京百度网讯科技有限公司 Video generation method and device, electronic equipment and storage medium
WO2023124933A1 (en) * 2021-12-31 2023-07-06 魔珐(上海)信息科技有限公司 Virtual digital person video generation method and device, storage medium, and terminal
CN115311731A (en) * 2022-10-10 2022-11-08 之江实验室 Expression generation method and device for sign language digital person
CN116168134A (en) * 2022-12-28 2023-05-26 北京百度网讯科技有限公司 Digital person control method, digital person control device, electronic equipment and storage medium
CN116168134B (en) * 2022-12-28 2024-01-02 北京百度网讯科技有限公司 Digital person control method, digital person control device, electronic equipment and storage medium
CN116564338A (en) * 2023-07-12 2023-08-08 腾讯科技(深圳)有限公司 Voice animation generation method, device, electronic equipment and medium
CN116564338B (en) * 2023-07-12 2023-09-08 腾讯科技(深圳)有限公司 Voice animation generation method, device, electronic equipment and medium
CN117557698A (en) * 2024-01-11 2024-02-13 广州趣丸网络科技有限公司 Digital human limb animation generation method and device, storage medium and computer equipment
CN117557698B (en) * 2024-01-11 2024-04-26 广州趣丸网络科技有限公司 Digital human limb animation generation method and device, storage medium and computer equipment

Similar Documents

Publication Publication Date Title
CN113538641A (en) Animation generation method and device, storage medium and electronic equipment
US8224652B2 (en) Speech and text driven HMM-based body animation synthesis
CN110688911B (en) Video processing method, device, system, terminal equipment and storage medium
US9361722B2 (en) Synthetic audiovisual storyteller
CN109377540B (en) Method and device for synthesizing facial animation, storage medium, processor and terminal
CN112184858B (en) Virtual object animation generation method and device based on text, storage medium and terminal
US9959657B2 (en) Computer generated head
US20120130717A1 (en) Real-time Animation for an Expressive Avatar
CN112650831A (en) Virtual image generation method and device, storage medium and electronic equipment
CN111915707A (en) Mouth shape animation display method and device based on audio information and storage medium
CN111145777A (en) Virtual image display method and device, electronic equipment and storage medium
CN110880198A (en) Animation generation method and device
JP2023552854A (en) Human-computer interaction methods, devices, systems, electronic devices, computer-readable media and programs
KR20110081364A (en) Method and system for providing a speech and expression of emotion in 3d charactor
CN115311731B (en) Expression generation method and device for sign language digital person
Brooke et al. Two-and three-dimensional audio-visual speech synthesis
Liu et al. Real-time speech-driven animation of expressive talking faces
CN114898018A (en) Animation generation method and device for digital object, electronic equipment and storage medium
CN112992116A (en) Automatic generation method and system of video content
D’alessandro et al. Reactive statistical mapping: Towards the sketching of performative control with data
CN112331184A (en) Voice mouth shape synchronization method and device, electronic equipment and storage medium
Melenchón et al. Emphatic visual speech synthesis
Chen et al. Text to avatar in multimodal human computer interface
Deena Visual speech synthesis by learning joint probabilistic models of audio and video
Ben Youssef et al. Head motion analysis and synthesis over different tasks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination