WO2023096275A1 - Procédé et système de génération d'avatar textuel - Google Patents

Procédé et système de génération d'avatar textuel Download PDF

Info

Publication number
WO2023096275A1
WO2023096275A1 PCT/KR2022/018321 KR2022018321W WO2023096275A1 WO 2023096275 A1 WO2023096275 A1 WO 2023096275A1 KR 2022018321 W KR2022018321 W KR 2022018321W WO 2023096275 A1 WO2023096275 A1 WO 2023096275A1
Authority
WO
WIPO (PCT)
Prior art keywords
motion
avatar
input text
information
animation
Prior art date
Application number
PCT/KR2022/018321
Other languages
English (en)
Korean (ko)
Inventor
김선태
이세윤
최기환
이호진
김동균
제갈수민
김병을
최성준
김재민
이수미
이주현
박소현
Original Assignee
네이버 주식회사
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 네이버 주식회사 filed Critical 네이버 주식회사
Publication of WO2023096275A1 publication Critical patent/WO2023096275A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/403D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/277Analysis of motion involving stochastic approaches, e.g. using Kalman filters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/4302Content synchronisation processes, e.g. decoder synchronisation
    • H04N21/4307Synchronising the rendering of multiple content streams or additional data on devices, e.g. synchronisation of audio on a mobile phone with the video output on the TV screen

Definitions

  • the present disclosure relates to a method and system for generating a text-based avatar, and more particularly, to a method and system for generating a full body animation of an avatar by estimating and synthesizing lip sync animation and motion of the avatar based on text.
  • avatars which are characters that take the place of users in virtual space
  • a lot of avatar service providers put a lot of effort into implementing an avatar in a virtual 3D space that operates similarly to a person in the real world, but in many cases, the motion of the avatar is unnatural.
  • the present disclosure provides a text-based avatar creation method, a computer-readable non-transitory storage medium and an apparatus (system) on which instructions are recorded to solve the above problems.
  • the present disclosure may be implemented in various ways, including a method, an apparatus (system), or a computer-readable non-transitory storage medium on which instructions are recorded.
  • a method for generating a text-based avatar includes synthesizing an avatar's voice based on input text, adding meta information of the synthesized voice to generating a lip sync animation based on the input text, estimating motion of the avatar based on input text, and generating a full body animation of the avatar by synthesizing the lip sync animation and the estimated motion.
  • a computer-readable non-transitory storage medium recording instructions for executing a text-based avatar generation method in a computer according to an embodiment of the present disclosure is provided.
  • An information processing system includes a memory and at least one processor connected to the memory and configured to execute at least one computer-readable program included in the memory, wherein the at least one program includes an input
  • the avatar's voice is synthesized based on the text
  • a lip sync animation is generated based on the meta information of the synthesized voice
  • the avatar's motion is estimated based on the input text
  • the lip sync animation and the estimated motion are synthesized to create an avatar Contains instructions for generating a full-body animation of
  • natural motion in which the face and body are not separated can be implemented by including both facial expressions and body gestures/actions in motion or motion units representing motions of the avatar.
  • the motion for various avatars or characters is changed by changing only resources without retraining the model. can be estimated or synthesized.
  • a more suitable motion for each emotion may be estimated by using a separate machine learning model learned for each emotion.
  • an avatar animation that operates naturally without interruption by estimating not only the motion (motion unit ID) but also playback time information (and connection relationship) of each motion.
  • FIG. 1 shows the configuration of a text-based avatar creation system according to an embodiment of the present disclosure.
  • FIG. 2 is a block diagram showing the internal configuration of an information processing system according to an embodiment of the present disclosure.
  • FIG. 3 is a diagram showing an internal configuration of a processor of an information processing system according to an embodiment of the present disclosure.
  • FIG. 4 shows the configuration of a text-based avatar generation system according to another embodiment of the present disclosure.
  • FIG. 5 is a diagram illustrating an internal configuration of a motion estimation unit according to an embodiment of the present disclosure.
  • FIG. 6 is a diagram illustrating an internal configuration of a motion estimation unit according to another embodiment of the present disclosure.
  • FIG. 7 is a diagram illustrating a data format learned by a motion estimation unit to output a motion reproduction time together with a motion unit ID according to an embodiment of the present disclosure.
  • FIG. 8 is a diagram illustrating blendshape information used when a lip generator generates a lip sync animation according to an embodiment of the present disclosure.
  • FIG. 9 is a flowchart illustrating an example of a text-based avatar generation method according to an embodiment of the present disclosure.
  • a modulee' or 'unit' used in the specification means a software or hardware component, and the 'module' or 'unit' performs certain roles.
  • 'module' or 'unit' is not meant to be limited to software or hardware.
  • a 'module' or 'unit' may be configured to reside in an addressable storage medium and may be configured to reproduce one or more processors.
  • a 'module' or 'unit' includes components such as software components, object-oriented software components, class components, and task components, processes, functions, and attributes. , procedures, subroutines, segments of program code, drivers, firmware, microcode, circuitry, data, databases, data structures, tables, arrays, or variables.
  • a 'module' or 'unit' may be implemented with a processor and a memory.
  • 'Processor' should be interpreted broadly to include general-purpose processors, central processing units (CPUs), microprocessors, digital signal processors (DSPs), controllers, microcontrollers, state machines, and the like.
  • 'processor' may refer to an application specific integrated circuit (ASIC), programmable logic device (PLD), field programmable gate array (FPGA), or the like.
  • ASIC application specific integrated circuit
  • PLD programmable logic device
  • FPGA field programmable gate array
  • 'Processor' refers to a combination of processing devices, such as, for example, a combination of a DSP and a microprocessor, a combination of a plurality of microprocessors, a combination of one or more microprocessors in conjunction with a DSP core, or a combination of any other such configurations. You may. Also, 'memory' should be interpreted broadly to include any electronic component capable of storing electronic information.
  • 'Memory' includes random access memory (RAM), read-only memory (ROM), non-volatile random access memory (NVRAM), programmable read-only memory (PROM), erasable-programmable read-only memory (EPROM), It may also refer to various types of processor-readable media, such as electrically erasable PROM (EEPROM), flash memory, magnetic or optical data storage, registers, and the like.
  • RAM random access memory
  • ROM read-only memory
  • NVRAM non-volatile random access memory
  • PROM programmable read-only memory
  • EPROM erasable-programmable read-only memory
  • a memory is said to be in electronic communication with the processor if the processor can read information from and/or write information to the memory.
  • Memory integrated with the processor is in electronic communication with the processor.
  • a 'system' may include at least one of a server device and a cloud device, but is not limited thereto.
  • a system may consist of one or more server devices.
  • a system may consist of one or more cloud devices.
  • the system may be operated by configuring a server device and a cloud device together.
  • 'avatar' may refer to a virtual two-dimensional or three-dimensional object to which a user of any service or application has given his or her appearance.
  • an avatar may refer to a 2D or 3D object or character that simulates a user's face or whole body or implements a face or whole body set by a user.
  • an avatar is a synthesized voice extracted from text input by a user through an application or automatically generated by a service platform (eg, conversational text input by a user or automatically generated), meta information associated with the synthesized voice, and emotion. It can be expressed through lip sync animation generated based on information, motion animation of face and whole body, and the like.
  • a 'machine learning model' may include any model used to infer an answer to a given input.
  • the machine learning model may include an artificial neural network model including an input layer (layer), a plurality of hidden layers, and an output layer.
  • each layer may include a plurality of nodes.
  • some or all of the plurality of machine learning models described as separate models may be implemented as one model, and the machine learning model described as one model in the present disclosure may be implemented as a plurality of machine learning models. .
  • 'motion' may include an animation representing an avatar's facial expressions and body gestures/actions. It can be an animation that shows what is suitable as an action.
  • motion may consist of a combination or connection of one or more motion units.
  • 'motion' may refer to each motion unit or a motion unit ID assigned to each motion unit.
  • the motion may include making a basic facial expression, motion, or gesture without the avatar making any facial expression, motion, or gesture (no motion).
  • 'each of a plurality of A' or 'each of a plurality of A' may refer to each of all components included in a plurality of A's, or each of some components included in a plurality of A's. .
  • the text-based avatar creation system 100 performs voice synthesis 120, lip sync animation generation 130, motion estimation 140, and video synthesis 150 based on the input text 110 to create an avatar of the avatar.
  • a full body animation 160 can be created.
  • the text-based avatar generation system 100 receives input text 110 (eg, "Hello, nice to meet you"), and synthesizes speech 120 (TTS: Text-to-Text-to-Speech) based on the input text 110. -Speech) to generate a synthesized sound.
  • the synthesized sound may be a voice generated as a result of the avatar uttering the input text 110 (for example, a voice in which the avatar utters "Hello. Nice to meet you"), or may be a natural conversational voice, but is limited thereto. It doesn't work.
  • a statistical model eg, HMM model
  • a deep learning-based model eg, end to end model
  • the text-based avatar generation system 100 receives user's voice characteristic information (eg, user's voice sample, user's vocalization information, etc.) and generates a synthesized sound imitating the user's voice.
  • a synthesized sound reflecting the avatar's voice characteristics may be generated using the avatar's voice characteristic information (eg, the avatar's voice sample or the avatar's vocalization information), but is not limited thereto.
  • the text-based avatar generation system 100 may generate a synthesized sound independent of a user or speaker.
  • the text-based avatar generation system 100 may use meta information about the synthesized voice.
  • the meta information may include, but is not limited to, phoneme information, playback time information, and speech intensity information of the synthesized voice, and the meta information further includes arbitrary information that can be extracted from the synthesized voice, such as emotion information. can do.
  • Examples of meta-information include information on a plurality of phonemes included in the synthesized voice, information on the playback start time and playback length of each phoneme, information on the total playback time of the synthesized voice, and intonation such as intonation and stress. ), information on ignition intensity, and the like may be included.
  • the text-based avatar generation system 100 may generate (130) a lip sync animation based on the meta information of the synthesized voice.
  • the lip sync animation may refer to an animation representing a change in the shape of an avatar's mouth according to the utterance of a synthesized sound.
  • a blendshape-based animation technique may be used to create lip sync animation (130), but is not limited thereto. Generation 130 of lip sync animation using a blendshape-based animation technique will be described later in detail with reference to FIG. 8 .
  • the text-based avatar generation system 100 may estimate 140 motion of the avatar based on the input text 110 .
  • the motion may be an animation representing a suitable facial expression, gesture, and/or motion taken by the avatar while uttering the input text 110 .
  • a motion may be composed of a combination or connection of one or more motion units, and each motion or each motion unit may include an animation representing an avatar's facial expression and body gesture/action.
  • a more natural motion can be implemented by including both facial expressions and body gestures/movements in a motion or motion unit, rather than implementing the motions of the face and the body separately.
  • a motion unit ID may be assigned to each motion unit, and the text-based avatar generation system 100 may estimate one or more motion unit IDs based on the input text 110 .
  • the motion estimation model used in the motion estimation 140 is not re-learned, and only resources are changed for various avatars or characters. Motion generation may be possible.
  • a face part among motions or motion units may be created using a blendshape-based animation technique.
  • motion of a face part may be generated by a method of combining a plurality of blend shapes (or morphing targets) in a morphing method.
  • a plurality of blend shapes including a first blend shape with the left eye closed, a second blend shape with the right eye closed, a third blend shape with a left lip smiling, a fourth blend shape with a right lip smiling, and the like are defined.
  • a facial expression may be generated by blending by assigning a weight value (eg, 0 or more and 1 or less) to each blend shape.
  • a facial part of motion may be created by combining a plurality of facial expressions in a morphing method.
  • the text-based avatar generation system 100 may perform motion estimation 140 by considering additional information in addition to the input text 110, and a machine learning model is used for motion estimation 140. It can be. This will be described in detail later with reference to FIGS. 3 to 7 .
  • the text-based avatar generation system 100 may generate a full-body animation of the avatar by performing video synthesis 150 on the lip sync animation and the estimated motion.
  • the whole body animation of the avatar may be created through a process such as adding a blend shape of a lip sync animation to a blend shape format for expressing a facial part among motions of the avatar.
  • the generated whole-body animation may be an animation in which the avatar moves in detail by reflecting not only large muscles but also small muscles.
  • the whole body animation may be generated in the form of data applicable to various avatars, rather than in the form of animation generated for a specific avatar. Accordingly, it can be applied to various avatars, characters (humans, animals, etc.) using the generated whole-body animation data.
  • synthetic sounds may be included in the full-body animation.
  • the whole body animation when the whole body animation is output on a display (eg, a display of a user terminal), it may be output together with a synthesized sound.
  • the playback length of the full-body animation may be related to the playback length of the synthesized sound.
  • the reproduction length of the whole body animation may be the same as that of the synthesized sound, and the reproduction length of the whole body animation may be longer than that of the synthesized sound for a more natural presentation.
  • a knot tying motion may be performed for a predetermined time (eg, 1 to 2 seconds).
  • a lip sync animation such as uttering the input text 110 is generated (130), motion based on the input text 110 is estimated (140), and the lip sync animation and motion are synthesized to obtain the full body of the avatar.
  • the animation 160 it is possible to create a natural animation in which the avatar actually utters the input text 110 and operates.
  • the information processing system 200 may include a memory 210, a processor 220, a communication module 230 and an input/output interface 240.
  • the information processing system 200 may be configured to communicate information and/or data through a network using the communication module 230 .
  • Memory 210 may include any non-transitory computer readable recording medium.
  • the memory 210 is a non-perishable mass storage device (permanent mass storage device) such as random access memory (RAM), read only memory (ROM), disk drive, solid state drive (SSD), flash memory, and the like. mass storage device).
  • a non-perishable mass storage device such as a ROM, SSD, flash memory, disk drive, etc. may be included in the information processing system 200 as a separate permanent storage device separate from memory.
  • the memory 210 may store an operating system and at least one program code (eg, a code for synthesizing a voice installed and driven in the information processing system 200 and generating a full-body animation of an avatar).
  • a recording medium readable by such a separate computer may include a recording medium directly connectable to the information processing system 200, for example, a floppy drive, a disk, a tape, a DVD/CD-ROM drive, a memory card, etc. It may include a computer-readable recording medium.
  • software components may be loaded into the memory 210 through the communication module 230 rather than a computer-readable recording medium.
  • at least one program is a computer program installed by files provided by developers or a file distribution system that distributes application installation files through the communication module 230 (eg, voice synthesis and avatars). It may be loaded into the memory 210 based on a program for creating a whole body animation, etc.).
  • the processor 220 may be configured to process commands of a computer program by performing basic arithmetic, logic, and input/output operations. Commands may be provided to a user terminal (not shown) or other external system by the memory 210 or the communication module 230 .
  • the processor 220 synthesizes the avatar's voice based on the input text, creates a lip sync animation based on the meta information of the synthesized voice, estimates the avatar's motion based on the input text, and A full-body animation of the avatar may be generated by synthesizing the sync animation and the estimated motion.
  • the communication module 230 may provide a configuration or function for a user terminal (not shown) and the information processing system 200 to communicate with each other through a network, and the information processing system 200 may provide an external system (for example, a separate configuration or function to communicate with a cloud system, etc.).
  • control signals, commands, data, etc. provided under the control of the processor 220 of the information processing system 200 are transmitted through the communication module 230 and the network to the user terminal and/or the communication module of the external system. It may be transmitted to a terminal and/or an external system.
  • the user terminal may receive a full-body animation (and synthesized sound) of the created avatar.
  • the input/output interface 240 of the information processing system 200 is connected to the information processing system 200 or means for interface with a device (not shown) for input or output that the information processing system 200 may include.
  • a device not shown
  • the input/output interface 240 is shown as an element configured separately from the processor 220 , but is not limited thereto, and the input/output interface 240 may be included in the processor 220 .
  • the information processing system 200 may include more components than those of FIG. 2 . However, there is no need to clearly show most of the prior art components.
  • the processor 220 of the information processing system 200 may be configured to manage, process, and/or store information and/or data received from a plurality of user terminals and/or a plurality of external systems. According to an embodiment, the processor 220 may receive input text from a user terminal. Then, the avatar's voice is synthesized based on the input text, a lip sync animation is generated based on the meta information of the synthesized voice, the avatar's motion is estimated based on the input text, the lip sync animation and the estimated motion can be synthesized to generate the avatar's full-body animation.
  • FIG. 3 is a diagram showing an internal configuration of a processor 220 of an information processing system according to an embodiment of the present disclosure.
  • the processor 220 may include a voice synthesis unit 310, an emotion analysis unit 320, a lip generation unit 330, a motion estimation unit 340, and an image synthesis unit 350.
  • the internal configuration of the processor 220 of the information processing system shown in FIG. 3 is only an example and may be implemented differently. For example, at least some of the components of the processor 220 may be omitted or other components may be added, and at least some of the processes performed by the processor 220 may be performed by a processor of the user terminal.
  • the processor 220 may receive input text, and the input text may be provided to at least one of the speech synthesis unit 310 , the emotion analysis unit 320 , and the motion estimation unit 340 .
  • the input text may be received from a user terminal, an external system, or another application (eg, a conversation generating application), or may be generated by a conversation generator (not shown) of the processor 220 .
  • the voice synthesizer 310 may synthesize the avatar's voice based on the input text and generate a synthesized voice.
  • the synthesized voice may refer to a voice in which the avatar utters the input text.
  • the synthesized sound may be a natural conversational voice, but is not limited thereto.
  • the synthesized sound may be a guidance voice for guidance.
  • the speech synthesis unit 310 may generate synthesized speech using an arbitrary speech synthesis model, such as a statistics-based model (eg, HMM model) or a deep learning-based model (eg, end to end model).
  • the voice synthesizer 310 receives voice characteristic information of the user to generate a synthesized sound imitating the user's voice, or receives voice characteristic information of an avatar to generate a synthesized sound reflecting the voice characteristic of the avatar. may, but is not limited thereto.
  • the voice synthesizer 310 may generate a synthesized voice independent of a user or speaker.
  • the voice synthesizer 310 may further use the emotion information extracted by the emotion analyzer 320 to synthesize the avatar's voice. For example, when the emotion of 'joy' is extracted from the input text by the emotion analyzer 320, a synthesized sound reflecting the emotion information of 'joy' may be generated.
  • the voice synthesis unit 310 may provide meta information on the synthesized voice to at least one of the lip generation unit 330 and the motion estimation unit 340.
  • the meta information may include, but is not limited to, phoneme information, playback time information, and speech intensity information of the synthesized voice, and the meta information may further include arbitrary information that can be extracted from the synthesized voice, such as emotion information.
  • meta-information include information about a plurality of phonemes included in the synthesized voice, information about the playback start time and playback length of each phoneme, information about the total playback time of the synthesized voice, and intonation such as intonation and stress. Information, information on utterance strength, and the like may be included.
  • the emotion analyzer 320 may extract emotion information from input text.
  • the emotion analyzer 320 may infer emotion information estimated to be felt by a person uttering the input text (eg, joy, neutrality, sadness, anger, etc.).
  • the emotion analyzer 320 may extract emotion information from input text using a machine learning model trained to infer emotion information.
  • the emotion analyzer 320 additionally receives not only the input text but also conversation contents before and after the input text, and considers the overall conversation contents or context including the input text to obtain emotion information for the input text. can be extracted.
  • Emotion information extracted by the emotion analysis unit 320 may be provided to at least one of the voice synthesis unit 310 and the motion estimation unit 340 .
  • the lip generator 330 may generate a lip sync animation based on the meta information of the synthesized sound extracted by the voice synthesizer 310 .
  • the lip sync animation may refer to an animation representing a change in the shape of an avatar's mouth according to the utterance of a synthesized sound.
  • a blendshape-based animation technique may be used to create lip sync animation, but is not limited thereto. Generation of lip sync animation using blendshape-based animation techniques will be described later in detail with reference to FIG. 8 .
  • the motion estimation unit 340 may estimate the motion of the avatar based on the input text.
  • the motion of the avatar may be an animation representing suitable facial expressions, gestures, and/or motions taken by the avatar while uttering the input text.
  • a motion may be composed of a combination or connection of one or more motion units, and each motion or each motion unit may include an animation representing an avatar's facial expression and body gesture/action.
  • a motion unit ID may be assigned to each motion unit, and the motion estimation unit 340 may estimate one or more motion unit IDs based on input text.
  • the motion estimator 340 may further receive not only input text but also meta information of emotion information and/or synthesized voice, and estimate the motion of the avatar based on the received information.
  • the emotion information may be emotion information extracted from the input text by the emotion analyzer 320 or emotion information included in the meta information of the synthesized sound.
  • the motion estimator 340 may further estimate reproduction time information of each motion as well as motion (motion unit ID).
  • the motion estimator 340 may use a machine learning model to estimate motion (and playback time information of the motion) based on the input text (and emotion information). Motion estimation using a machine learning model will be described later in detail with reference to FIGS. 5 to 7 .
  • the image synthesis unit 350 extracts motion unit files corresponding to each motion unit ID estimated by the motion estimation unit 340 from the motion unit DB 360, and combines, connects, or synthesizes the extracted motion unit files appropriately. By doing so, the motion can be completed.
  • the video synthesis unit 350 may generate a full-body animation of the avatar by synthesizing the generated lip sync animation and motion of the avatar.
  • the video compositing unit 350 creates a lip sync animation through a method of adding a blend shape of a lip sync animation to a blend shape format for expressing a facial part of an avatar motion, a method of synchronizing a lip sync animation and motion, and the like.
  • a full-body animation of the avatar can be created by appropriately mixing the motion of the avatar with the motion of the avatar.
  • arbitrary 3D graphics such as blendshape, morphing, blending, and interpolation technique can be used.
  • the text-based avatar generation system 400 includes a dialogue generator 410, a voice synthesizer 310, an emotion analyzer 320, a lip generator 330, and a motion estimation unit 340. , may include an image synthesis unit 350.
  • FIG. 4 configurations overlapping those of FIG. 3 will be briefly described based on the embodiment shown in FIG. 4 .
  • the dialogue generator 410 may generate input text using an arbitrary language generation model.
  • input text generated by the dialogue generator 410 may be input text for generating a full-body animation.
  • the dialogue generator 410 may generate the second text based on the first text using an arbitrary language generation model. For example, a first text input by a user may be received from a user terminal, and based on this, a second text (eg, a response to the first text) may be generated.
  • the second text generated by the dialogue generator 410 may be input text for generating a full-body animation.
  • the dialogue generator 410 may further generate additional information associated with the input text as well as the input text.
  • the dialogue generating unit 410 includes the input text “I feel so good today”, such as “I feel so good today (Anna raises both arms and is happy)” along with the instruction phrase “(Anna raises both arms and is happy)”. lift and stare)” can be created.
  • the generated additional information is provided to the voice synthesis unit 310, the emotion analysis unit 320, and/or the motion estimation unit 340 to generate a synthesized sound reflecting the additional information or to extract emotion information. Alternatively, it can be used to estimate motion.
  • the dialogue generator 410 may be omitted.
  • text generated by another application eg, a language generating application
  • text directly input by a user may be received as input text.
  • the emotion analyzer 320 may extract emotion information from input text. For example, the emotion analyzer 320 may infer emotion information estimated to be felt by a person uttering the input text (eg, joy, neutrality, sadness, anger, etc.). Emotion information extracted by the emotion analysis unit 320 may be provided to the voice synthesis unit 310 and the motion estimation unit 340 .
  • the voice synthesis unit 310 may synthesize the avatar's voice based on the input text and emotion information. For example, the voice synthesis unit 310 may synthesize a voice such that an avatar utters an input text with an extracted emotion.
  • Text-to-Speech (TTS) meta information may be extracted from the speech synthesized by the voice synthesizer 310, and the extracted TTS meta information may be provided to the lip generator 330 and the motion estimation unit 340.
  • TTS Text-to-Speech
  • the lip generator 330 may generate a lip sync animation based on the TTS meta information.
  • the lip generation unit 330 may generate a lip sync animation representing a change in mouth shape (or change in lip movement, etc.) as the avatar utters a synthesized sound, based on the TTS meta information.
  • lip sync animation may be created using a blendshape based animation technique.
  • the motion estimation unit 340 may estimate the motion of the avatar based on the input text, TTS meta information, and emotion information. For example, the motion estimation unit 340 may estimate a motion unit ID constituting motion and motion playback time information based on input text, emotion information, and TTS meta information by using a machine learning model. Regarding an embodiment in which the motion estimation unit 340 estimates a motion unit ID (and motion reproduction time information) constituting motion based on input text, emotion information, and/or TTS meta information using a machine learning model, It will be described in detail later with reference to FIGS. 5 to 7 .
  • the image synthesis unit 350 extracts motion unit files corresponding to each motion unit ID estimated by the motion estimation unit 340 from the motion unit DB 360, and combines, connects, or synthesizes the extracted motion unit files appropriately. By doing so, the motion can be completed. Also, the video synthesis unit 350 may generate a full-body animation of the avatar by synthesizing the generated lip sync animation and motion of the avatar.
  • the synthesized sound and full-body animation generated by the text-based avatar generation system 400 may be transmitted to a user terminal, and the user terminal may output the received synthesized sound and whole-body animation through an output device.
  • the user terminal may display a whole-body animation on a display while outputting a synthesized sound through a speaker. Through this, the user may recognize the avatar as taking the facial expression, mouth shape, and motion included in the whole body animation while uttering the synthesized sound.
  • the motion estimator 340 may estimate the motion unit ID of the avatar based on the input text and emotion information, and the motion estimator 340 may use the emotion diverter 510 and a machine for each emotion.
  • Running models 520, 530, 540, and 550 may be included.
  • the emotion information received by the motion estimation unit 340 is emotion information extracted by the emotion analyzer 320 based on the input text, emotion information included in the meta information of the synthesized sound, or emotion input by the user. may be information.
  • the emotion-specific machine learning models 520, 530, 540, and 550 may include one or more machine learning models trained for each emotion.
  • the machine learning model (joy) 520 may be a model trained to receive an input text associated with an emotion of 'joy' as an input and infer a motion unit ID associated with an emotion of 'joy', and may be a machine learning model.
  • the (sadness) 540 may be a model learned to infer a motion unit ID associated with the emotion of 'sadness' by receiving an input text associated with the emotion of 'sadness'.
  • the emotion diverter 510 may determine the machine learning model 520 , 530 , 540 , or 550 to be used according to the received emotion information. For example, when the received emotion information is 'joy', it may be determined to use the machine learning model (joy) 520 . Then, based on the input text, the motion unit ID to which the emotion information is reflected may be estimated by the machine learning models 520 , 530 , 540 , and 550 whose use is determined by the emotion diverter 510 .
  • the motion unit ID inferred by the machine learning models 520 , 530 , 540 , and 550 for each emotion may be a motion unit ID corresponding to a motion representing each emotion. According to an embodiment, even motions performing the same or similar motions may be implemented with various motions for each emotion. For example, as shown in FIG. 5 , various motions for each emotion may be implemented for the same or similar motions of putting one hand on the chest.
  • the motion estimation unit 340 receives input text and emotion information and estimates a motion unit ID, but is not limited thereto.
  • the motion estimation unit 340 may receive only input text without emotion information.
  • the emotion diverter 510 may extract emotion information from the input text similarly to the emotion analyzer 320 .
  • the emotion branching unit 510 may extract emotion information using a machine learning model trained to infer emotion information from input text. Thereafter, the emotion diverter 510 may determine the machine learning model 520 , 530 , 540 , or 550 to be used according to the extracted emotion information, and subsequent processes may be performed as described above.
  • the motion estimation unit 340 may further receive as an input not only input text (and emotion information) but also previous conversation or conversation context information.
  • the emotion diverter 510 may further consider additional information to extract emotion information or determine the machine learning model (520, 530, 540, 550) to be used, and the received additional information may be input text (and emotion information). information) may be input to the machine learning models 520, 530, 540, and 550, and may be used to estimate a motion unit ID.
  • the motion estimation unit 340 may estimate the motion unit ID of the avatar based on the input text and emotion information using the single machine learning model 610 .
  • the emotion information received by the motion estimation unit 340 is emotion information extracted by the emotion analyzer 320 based on the input text, emotion information included in the meta information of the synthesized sound, or emotion input by the user. may be information.
  • the single machine learning model 610 may be a model trained to estimate a motion unit ID by receiving input text and emotion information. That is, the single machine learning model 610 may estimate a motion suitable for the input text and a motion in which emotion information is reflected. Therefore, according to the single machine learning model 610 that estimates the motion unit ID based on the input text and emotion information, even if the input text is the same, different motion unit IDs (or motions) can be estimated according to received emotion information. there is.
  • the motion estimation unit 340 is shown in FIG. 6 as estimating a motion unit ID by receiving input text and emotion information, it is not limited thereto.
  • the motion estimation unit 340 may receive only input text without emotion information.
  • a single machine learning model 610 can estimate the motion unit ID based only on the input text.
  • the single machine learning model 610 may estimate the motion unit ID by reflecting the emotion information recognized by the input text. According to the single machine learning model 610 that receives only input text and estimates a motion unit ID, when the input text is the same, the same motion unit ID (or motion) can be estimated.
  • the motion estimation unit 340 may further receive input text (and emotion information) as well as previous conversation or conversation context information as inputs, and the received additional information may be input text (and emotion information). emotion information) can be input into a single machine learning model 610 and used to estimate a motion unit ID.
  • the motion estimation unit 340 estimates the motion unit ID assigned to each motion unit rather than the motion of the avatar or character itself, only the resource is changed without retraining the machine learning model. Thus, motion estimation for various avatars or characters is possible.
  • the motion estimation unit may output a motion reproduction time as well as a motion unit ID using a machine learning model.
  • text and TTS meta information (710, 730), playback time information, and motion unit ID (720, 730) are all included in the training data.
  • the machine learning model may be trained to output playback time information and motion unit IDs 720 and 730 when text and TTS meta information 710 and 730 are input.
  • the text may include a plurality of sentences as well as a single sentence.
  • Training data 1 is text 1 "Hello. Nice to meet you. I'm Dubong.” and meta information (eg, syllable or phoneme information, playback time information for each syllable or phoneme) of the synthesized sound for Text 1.
  • the emotion information included in the training data 1 710 may be emotion information included in the meta information of the synthesized sound or emotion information extracted from the emotion analyzer based on the text.
  • Training data 2 720 corresponds to motion unit ID 1 "Unit 1 (greeting motion), Unit 45 (neutral motion), Unit 78 (tapping motion), Unit 90 (idle motion)" and motion unit IDs. Includes motion reproduction time information 1.
  • Motion playback time information may include playback start time information and playback length information of each motion.
  • a connection operation may be performed between each motion to make the connection of motions natural, and an idle motion to tie the motions together may be added at the end of all motions for the text.
  • Playback time information of the motion estimated by the motion estimator may be estimated in consideration of a connection relationship between motion motions so that a connection with the next motion can be naturally connected.
  • training data may include data including a motion connection relationship. For example, data in a form including a connection relationship of each motion, such as “1-Connection-45-Connection-78-Connection-90”, may be included in the training data.
  • Training data 3 (730) which is another training data, is similar to training data 1 (710), meta information (eg, phoneme or syllable information, playback time information for each phoneme or syllable) of text 2 and synthesized sounds for text 2 ), and learning data 4 740 includes motion unit ID 2 and reproduction time information 2 of motion, similarly to learning data 2 720 .
  • meta information eg, phoneme or syllable information, playback time information for each phoneme or syllable
  • learning data 4 740 includes motion unit ID 2 and reproduction time information 2 of motion, similarly to learning data 2 720 .
  • the machine learning model may be trained through training data in the form of a pair of training data 1 (710) and training data 2 (720), training data 3 (730) and training data 4 (740), and motion estimation.
  • the unit may estimate the motion unit ID and playback time information (and connection relationship) of each motion based on the meta information of the input text and synthesized sound using the machine learning model learned in this way.
  • the motion estimation unit estimates the motion unit ID and playback time information (and connection relationship) of each motion, synthesizes the motions based on this, and generates the animation, it is possible to generate an avatar animation that operates naturally without interruption.
  • FIG. 8 is a diagram illustrating blendshape information used when a lip generator generates a lip sync animation according to an embodiment of the present disclosure.
  • animation of a face part and/or lip sync animation among motions or motion units may be generated using a blendshape-based animation technique.
  • motion of a face part and/or lip sync animation may be generated by a method of combining a plurality of blend shapes (or morphing targets) using a morphing method.
  • a plurality of blend shapes including a smiling fourth blend shape may be designated, and a facial expression may be generated by blending by assigning a weight value (eg, 0 or more and 1 or less) to each blend shape.
  • a facial part of motion may be created by combining a plurality of facial expressions in a morphing method.
  • the lip sync animation may be generated in some way different from the motion of the face part.
  • a plurality of mouth shape images 820 representing mouth shapes 830 when each phoneme 810 and 840 is pronounced may be designated.
  • the lip generating unit extracts mouth shape images 820 representing mouth shapes corresponding to the phoneme information included in the meta information of the synthesized sound, and at the time of pronouncing a specific phoneme 810, 840, the corresponding phoneme 810, 840
  • An animation may be created by combining a plurality of mouth shape images 820 so that the mouth shape image 820 representing the corresponding mouth shape 830 is reproduced.
  • the lip generator may generate a natural lip sync animation using an arbitrary 3D graphic technique such as an interpolation technique or a morphing technique.
  • a predefined blendshape format or a blendshape application programming interface may be used to create a general-purpose motion or animation.
  • the method 900 may be initiated by the processor of the information processing system synthesizing the voice of the avatar based on the input text (S910).
  • meta information may be extracted from the synthesized voice, and the meta information may include at least one of reproduction time information, phoneme information, and speech strength information of the synthesized voice.
  • the processor may generate a lip sync animation based on the meta information of the synthesized voice (S920). For example, the processor may extract one or more mouth shape images for lip sync animation corresponding to phoneme information included in the meta information, and generate the lip sync animation by synthesizing the extracted one or more mouth shape images.
  • the lip sync animation may be an animation including a shape of an avatar's mouth changing based on meta information of a synthesized voice.
  • the processor may estimate the motion of the avatar based on the input text (S930). For example, the processor may extract one or more motion unit IDs corresponding to the input text, and may extract one or more motion units corresponding to each motion unit ID. Additionally, the processor may further consider the meta information as well as the input text, and estimate the motion of the avatar based on the input text and meta information. Also, the processor may estimate playback time of the motion unit corresponding to the motion unit ID as well as the motion unit ID corresponding to the input text. In one embodiment, the playback time of a motion unit may be a playback time to a point where the connection with the next motion unit is natural, and when generating a motion animation, only the corresponding time of the motion unit is played and connected to the next unit. This makes it possible to create more natural animations.
  • the processor may extract emotion information from input text and estimate motion of the avatar based on at least one of the input text and emotion information.
  • the processor is configured to estimate motion of the avatar based on the input text (and/or emotion information), such that the machine is trained to estimate motion associated with the input text (and/or emotion information) as input.
  • a running model can be used.
  • the input text used in steps S910 to S930 may be generated by a language generation model.
  • text generated by a language generation model as an answer to a specific text may be input text.
  • the processor may generate a full body animation of the avatar by synthesizing the lip sync animation and the estimated motion (S940).
  • the process of generating the full body animation of the avatar by synthesizing the lip sync animation and the estimated motion may include synchronizing the playback time of the lip sync animation with the playback time of the estimated motion.
  • the above method may be provided as a computer program stored in a computer readable recording medium to be executed on a computer.
  • the medium may continuously store programs executable by a computer or temporarily store them for execution or download.
  • the medium may be various recording means or storage means in the form of a single or combined hardware, but is not limited to a medium directly connected to a certain computer system, and may be distributed on a network. Examples of the medium include magnetic media such as hard disks, floppy disks and magnetic tapes, optical recording media such as CD-ROM and DVD, magneto optical media such as floptical disks, and Anything configured to store program instructions may include a ROM, RAM, flash memory, or the like.
  • examples of other media include recording media or storage media managed by an app store that distributes applications, a site that supplies or distributes various other software, and a server.
  • the processing units used to perform the techniques may include one or more ASICs, DSPs, digital signal processing devices (DSPDs), programmable logic devices (PLDs) ), field programmable gate arrays (FPGAs), processors, controllers, microcontrollers, microprocessors, electronic devices, and other electronic units designed to perform the functions described in this disclosure. , a computer, or a combination thereof.
  • a general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine.
  • a processor may also be implemented as a combination of computing devices, eg, a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other configuration.
  • the techniques include random access memory (RAM), read-only memory (ROM), non-volatile random access memory (NVRAM), PROM (on a computer readable medium, such as programmable read-only memory (EPROM), erasable programmable read-only memory (EPROM), electrically erasable PROM (EEPROM), flash memory, compact disc (CD), magnetic or optical data storage device, or the like. It can also be implemented as stored instructions. Instructions may be executable by one or more processors and may cause the processor(s) to perform certain aspects of the functionality described in this disclosure.
  • Computer readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another.
  • a storage media may be any available media that can be accessed by a computer.
  • such computer readable media may include RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or desired program code in the form of instructions or data structures. It can be used for transport or storage to and can include any other medium that can be accessed by a computer. Also, any connection is properly termed a computer-readable medium.
  • the software is transmitted from a website, server, or other remote source using coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technologies such as infrared, radio, and microwave
  • coaxial cable , fiber optic cable, twisted pair, digital subscriber line, or wireless technologies such as infrared, radio, and microwave
  • Disk and disc as used herein include CD, laser disc, optical disc, digital versatile disc (DVD), floppy disk, and blu-ray disc, where disks are usually magnetic data is reproduced optically, whereas discs reproduce data optically using a laser. Combinations of the above should also be included within the scope of computer readable media.
  • a software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
  • An exemplary storage medium can be coupled to the processor such that the processor can read information from or write information to the storage medium.
  • the storage medium may be integral to the processor.
  • the processor and storage medium may reside within an ASIC.
  • An ASIC may exist within a user terminal.
  • the processor and storage medium may exist as separate components in a user terminal.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Processing Or Creating Images (AREA)
  • Quality & Reliability (AREA)

Abstract

La présente divulgation concerne un procédé de génération d'avatar textuel, réalisé par au moins un processeur d'un système de traitement d'informations. Le procédé de génération d'avatar comprend les étapes consistant : à synthétiser la voix d'un avatar sur la base d'un texte d'entrée ; à générer une animation de synchronisation labiale sur la base des méta-informations de la voix synthétisée ; à estimer le mouvement de l'avatar sur la base du texte d'entrée ; et à synthétiser l'animation de synchronisation de lèvre et le mouvement estimé de façon à générer une animation de corps entier de l'avatar.
PCT/KR2022/018321 2021-11-23 2022-11-18 Procédé et système de génération d'avatar textuel WO2023096275A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR1020210162774A KR20230075998A (ko) 2021-11-23 2021-11-23 텍스트 기반 아바타 생성 방법 및 시스템
KR10-2021-0162774 2021-11-23

Publications (1)

Publication Number Publication Date
WO2023096275A1 true WO2023096275A1 (fr) 2023-06-01

Family

ID=86539942

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2022/018321 WO2023096275A1 (fr) 2021-11-23 2022-11-18 Procédé et système de génération d'avatar textuel

Country Status (2)

Country Link
KR (2) KR20230075998A (fr)
WO (1) WO2023096275A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117095672A (zh) * 2023-07-12 2023-11-21 支付宝(杭州)信息技术有限公司 一种数字人唇形生成方法及装置

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102593760B1 (ko) * 2023-05-23 2023-10-26 주식회사 스푼라디오 디지털 서비스 기반의 dj 별 맞춤형 가상 아바타 생성 서버, 방법 및 프로그램

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20190111642A (ko) * 2018-03-23 2019-10-02 펄스나인 주식회사 실제 사진의 픽셀 기반의 토킹 헤드 애니메이션을 이용한 영상 처리 시스템 및 방법
KR102116315B1 (ko) * 2018-12-17 2020-05-28 주식회사 인공지능연구원 캐릭터의 음성과 모션 동기화 시스템
KR102116309B1 (ko) * 2018-12-17 2020-05-28 주식회사 인공지능연구원 가상 캐릭터와 텍스트의 동기화 애니메이션 출력 시스템
KR20200143659A (ko) * 2018-01-11 2020-12-24 네오사피엔스 주식회사 다중 언어 텍스트-음성 합성 방법
KR20210060196A (ko) * 2019-11-18 2021-05-26 주식회사 케이티 아바타 메시지 서비스를 제공하는 서버, 방법 및 사용자 단말

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20200143659A (ko) * 2018-01-11 2020-12-24 네오사피엔스 주식회사 다중 언어 텍스트-음성 합성 방법
KR20190111642A (ko) * 2018-03-23 2019-10-02 펄스나인 주식회사 실제 사진의 픽셀 기반의 토킹 헤드 애니메이션을 이용한 영상 처리 시스템 및 방법
KR102116315B1 (ko) * 2018-12-17 2020-05-28 주식회사 인공지능연구원 캐릭터의 음성과 모션 동기화 시스템
KR102116309B1 (ko) * 2018-12-17 2020-05-28 주식회사 인공지능연구원 가상 캐릭터와 텍스트의 동기화 애니메이션 출력 시스템
KR20210060196A (ko) * 2019-11-18 2021-05-26 주식회사 케이티 아바타 메시지 서비스를 제공하는 서버, 방법 및 사용자 단말

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117095672A (zh) * 2023-07-12 2023-11-21 支付宝(杭州)信息技术有限公司 一种数字人唇形生成方法及装置

Also Published As

Publication number Publication date
KR20240038941A (ko) 2024-03-26
KR20230075998A (ko) 2023-05-31

Similar Documents

Publication Publication Date Title
WO2023096275A1 (fr) Procédé et système de génération d'avatar textuel
WO2022048403A1 (fr) Procédé, appareil et système d'interaction multimodale sur la base de rôle virtuel, support de stockage et terminal
US8364488B2 (en) Voice models for document narration
US20200251089A1 (en) Contextually generated computer speech
KR20220008735A (ko) 애니메이션 인터랙션 방법, 장치, 기기 및 저장 매체
KR102116309B1 (ko) 가상 캐릭터와 텍스트의 동기화 애니메이션 출력 시스템
JP2001209820A (ja) 感情表出装置及びプログラムを記録した機械読み取り可能な記録媒体
US20160071302A1 (en) Systems and methods for cinematic direction and dynamic character control via natural language output
CN111145777A (zh) 一种虚拟形象展示方法、装置、电子设备及存储介质
WO2022170848A1 (fr) Procédé, appareil et système d'interaction humain-ordinateur, dispositif électronique et support informatique
WO2022196921A1 (fr) Procédé et dispositif de service d'interaction basé sur un avatar d'intelligence artificielle
JP2022530726A (ja) インタラクティブ対象駆動方法、装置、デバイス、及び記録媒体
KR20190109651A (ko) 인공지능 기반의 음성 모방 대화 서비스 제공 방법 및 시스템
WO2022242706A1 (fr) Production de réponse réactive à base multimodale
CN115497448A (zh) 语音动画的合成方法、装置、电子设备及存储介质
Yamamoto et al. Voice interaction system with 3D-CG virtual agent for stand-alone smartphones
WO2022196880A1 (fr) Procédé et dispositif de service d'interaction basé sur un avatar
JP2020529680A (ja) 通話中の感情を認識し、認識された感情を活用する方法およびシステム
CN116860924A (zh) 一种基于预设提示词数据生成具有模拟人格ai的处理方法
WO2019124850A1 (fr) Procédé et système de personnification et d'interaction avec un objet
CN110070869A (zh) 语音互动生成方法、装置、设备和介质
CN110166844B (zh) 一种数据处理方法和装置、一种用于数据处理的装置
Altarawneh et al. Leveraging Cloud-based Tools to Talk with Robots.
KR102663162B1 (ko) 음성 합성 방법 및 시스템
WO2024144038A1 (fr) Synthèse de la parole et du mouvement d'un être humain virtuel, de bout en bout

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22898957

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE