WO2023068067A1 - 情報処理装置、情報処理方法およびプログラム - Google Patents
情報処理装置、情報処理方法およびプログラム Download PDFInfo
- Publication number
- WO2023068067A1 WO2023068067A1 PCT/JP2022/037498 JP2022037498W WO2023068067A1 WO 2023068067 A1 WO2023068067 A1 WO 2023068067A1 JP 2022037498 W JP2022037498 W JP 2022037498W WO 2023068067 A1 WO2023068067 A1 WO 2023068067A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- avatar
- information processing
- facial expression
- voice
- waveform
- Prior art date
Links
- 230000010365 information processing Effects 0.000 title claims abstract description 80
- 238000003672 processing method Methods 0.000 title claims description 7
- 230000008921 facial expression Effects 0.000 claims abstract description 79
- 230000008451 emotion Effects 0.000 claims abstract description 65
- 230000008909 emotion recognition Effects 0.000 claims abstract description 18
- 230000009471 action Effects 0.000 claims description 77
- 230000002194 synthesizing effect Effects 0.000 claims description 46
- 230000014509 gene expression Effects 0.000 claims description 21
- 230000007613 environmental effect Effects 0.000 claims description 12
- 239000000284 extract Substances 0.000 claims description 11
- 230000015572 biosynthetic process Effects 0.000 abstract description 9
- 238000003786 synthesis reaction Methods 0.000 abstract description 9
- 238000004891 communication Methods 0.000 description 27
- 230000006399 behavior Effects 0.000 description 15
- 238000010586 diagram Methods 0.000 description 13
- 238000000034 method Methods 0.000 description 13
- 230000001815 facial effect Effects 0.000 description 8
- 230000033001 locomotion Effects 0.000 description 8
- 230000006870 function Effects 0.000 description 6
- 210000003128 head Anatomy 0.000 description 6
- 238000004458 analytical method Methods 0.000 description 5
- 241000209094 Oryza Species 0.000 description 4
- 235000007164 Oryza sativa Nutrition 0.000 description 4
- 230000002996 emotional effect Effects 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 210000000744 eyelid Anatomy 0.000 description 4
- 235000009566 rice Nutrition 0.000 description 4
- 230000005540 biological transmission Effects 0.000 description 3
- 230000008859 change Effects 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 210000004709 eyebrow Anatomy 0.000 description 3
- 241000282326 Felis catus Species 0.000 description 2
- 208000003443 Unconsciousness Diseases 0.000 description 2
- 230000015654 memory Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000037303 wrinkles Effects 0.000 description 2
- 241000282412 Homo Species 0.000 description 1
- 206010049976 Impatience Diseases 0.000 description 1
- 230000004397 blinking Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 210000003127 knee Anatomy 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000001151 other effect Effects 0.000 description 1
- 238000005086 pumping Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 230000007306 turnover Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/16—Sound input; Sound output
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T13/00—Animation
- G06T13/20—3D [Three Dimensional] animation
- G06T13/40—3D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T19/00—Manipulating 3D models or images for computer graphics
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/10—Speech classification or search using distance or distortion measures between unknown speech and reference templates
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
Definitions
- the present invention relates to an information processing device, an information processing method, and a program.
- Non-verbal information such as facial expressions, facial expressions, gestures, and hand gestures play a major role in smooth communication. The same can be said for communication using avatars.
- Smartphones have apps for creating avatars. These apps generate avatars based on information obtained through motion capture.
- the avatar's facial expression is generated by imitating the user's facial expression.
- the user's facial expression rarely fluctuates greatly.
- changes in facial expressions are limited to minor changes such as movement of the line of sight and orientation of the face. Therefore, it is difficult to make the avatar express rich emotions.
- the present disclosure proposes an information processing device, an information processing method, and a program that enable an avatar to express rich emotions.
- information including an emotion recognition unit that recognizes an emotion based on a voice waveform, an expression output unit that outputs an expression corresponding to the emotion, and an avatar synthesis unit that synthesizes an avatar showing the expression
- a processing device is provided. Further, according to the present disclosure, there are provided an information processing method in which the information processing of the information processing device is executed by a computer, and a program for causing the computer to implement the information processing of the information processing device.
- FIG. 10 is a diagram showing an example in which the communication support service is applied to two-way communication; It is a figure which shows the hardware structural example of an information processing apparatus.
- FIG. 1 is a diagram showing an example of a communication support service.
- the communication support service is a service that supports communication between users U using avatars AB.
- the facial expressions and actions of avatar AB are controlled based on user U's emotion EM and degree of excitement (magnitude of excitement EM) acquired by voice recognition.
- EM degree of excitement
- voice recognition Compared to the case where motion capture is used to generate facial expressions and actions of avatar AB, it is possible to express richer emotions, so various information that is difficult to convey with words alone can be better conveyed to the other party.
- Communication support services apply to one-way and two-way communications.
- the communication support service is implemented by the information processing device 1 as shown in FIG.
- the information processing apparatus 1 uses a voice recognition technique to extract sound wave information about a voice waveform SD (see FIG. 8) and text information about utterance content from the voice data SG (see FIG. 10).
- the information processing device 1 applies the sound wave feature amount extracted from the voice waveform SD to the emotion database ED to estimate the user's U emotion EM.
- the information processing device 1 determines the degree of excitement at the time of speaking based on the speech waveform SD and the contents of the speech.
- the information processing device 1 adjusts the facial expression indicating the emotion EM according to the degree of excitement.
- the information processing device 1 determines the adjusted facial expression as the facial expression of the avatar AB.
- the information processing device 1 collates the content of the statement with the gesture database JD, and guesses what kind of gesture the user U is making at the time of the statement.
- the information processing device 1 applies the gestures associated with the content of the statement to the behavior database AD (see FIG. 2) together with the degree of excitement. Thereby, the information processing device 1 estimates the behavior AC with the degree of excitement added.
- the information processing device 1 controls the facial expression and behavior of the avatar AB based on the estimated facial expression and behavior AC of the user U.
- the voice "I did it" is detected on the user U's terminal.
- the information processing apparatus 1 extracts the voice waveform SD and the text information "I did it” from the voice data SG acquired from the terminal of the user U using a known voice recognition technology.
- the information processing device 1 extracts the emotion EM indicating joy from the voice waveform and determines the degree of excitement as "high”.
- the information processing device 1 determines a joyful facial expression of the avatar AB by adjusting the positions of the corners of the mouth according to the degree of excitement. Further, the information processing apparatus 1 estimates a scene in which the user U is happy with a fist pump or a banzai from the content of the statement "yay!.
- the information processing device 1 selects a banzai pose according to the degree of excitement, and outputs it as the action AC of the avatar AB.
- avatar AB is not necessarily human.
- a dog, a cat, or the like can also be used as the character of the avatar AB.
- facial expressions and body movements (behavior AC) for making gestures are different for each type of character (human, dog, cat, etc.).
- the information processing device 1 has a different facial expression database FD and action database AD for each type of character.
- FIG. 2 is a block diagram showing an example of the functional configuration of the information processing device 1. As shown in FIG.
- the information processing device 1 includes a voice input unit 10, a voice waveform recognition unit 11, a text recognition unit 12, an emotion recognition unit 13, a gesture recognition unit 14, an expression output unit 15, an action output unit 16, an avatar synthesis unit 17, and a background synthesis unit. 18 and a video output unit 19 .
- the speech input unit 10 outputs the speech data SG acquired from the terminal of the user U to the speech waveform recognition unit 11 and the text recognition unit 12.
- the speech waveform recognition unit 11 extracts a speech waveform SD (sound wave information) from the speech data SG.
- the text recognition unit 12 extracts text information (utterance content) from the voice data SG. Extraction of text information is performed using known speech recognition techniques.
- the emotion recognition unit 13 recognizes the emotion and excitement level of the user U based on the voice waveform SD and the utterance content. Emotion and excitement are estimated mainly based on the voice waveform SD (voice tone, volume, etc.). The degree of excitement can also be estimated from the unique phrases uttered at the time of excitement, the expressions of words, and the like. The emotion recognition unit 13 detects the emotion and the degree of excitement by collating various features extracted from the voice waveform SD and the utterance content with the emotion database ED.
- the gesture recognition unit 14 recognizes gestures based on the utterance content. Gestures include unconscious gestures and conscious gestures linked to speech. For example, gestures such as fist pumping when feeling joy and breaking down in tears when sad are unconscious gestures. The action of eating a rice ball in conjunction with the statement "I'm going to eat a rice ball from now on" is a conscious gesture.
- the gesture database JD defines the correspondence between utterances and gestures.
- the gesture recognition unit 14 estimates the gesture of the user U at the time of speaking by collating the speech content with the gesture database JD.
- the facial expression output unit 15 outputs facial expressions according to the emotion EM.
- Humans have emotions such as joy, disgust, sadness, fear and anger.
- Each emotion is assigned a standard facial expression.
- fun is assigned an expression with cheeks raised, eyebrows and eyelids pulled down, and under-eye wrinkles.
- Disgust is assigned a facial expression with a protruded upper lip, lowered eyebrows, and wrinkles extending from the bottom of the nostrils to the ends of the lips.
- Sadness is assigned an expression with a downward gaze and a dropped upper eyelid.
- Fear is assigned an expression with the upper eyelids raised, the chin lowered and the mouth open.
- Anger is assigned a facial expression with furrowed brows and wide-open eyes.
- the facial expression output unit 15 adjusts the standard facial expressions assigned to the emotion EM according to the degree of excitement. For example, when strong excitement is detected for the emotion EM of fun, the facial expression output unit 15 adjusts the raised cheeks, lowered eyebrows and eyelids, raised corners of the mouth, and the like. When a strong degree of excitement is detected for the emotion EM of sadness, the facial expression output unit 15 outputs a lamenting facial expression with the mouth open. The relationship between emotions and excitement levels and facial expressions is defined in the facial expression database FD. The facial expression output unit 15 outputs facial expressions reflecting the emotion EM and the degree of excitement by comparing the emotion EM and the degree of excitement with the facial expression database FD.
- the action output unit 16 outputs action AC of avatar AB corresponding to the gesture.
- the action output unit 16 adjusts the standard action AC corresponding to the gesture according to the degree of excitement. For example, when a strong degree of excitement (strong sadness) is detected for the emotion EM of sadness, the action output unit 16 outputs an action AC such as dropping one's knees and bowing one's head.
- the relationship between gestures and excitement levels and actions AC is defined in the action database AD.
- the action output unit 16 compares the gesture and the degree of excitement with the action database AD to output the action AC reflecting the gesture and the degree of excitement.
- the avatar synthesizing unit 17 acquires 3D data of the character for avatar AB.
- the character may be manually selected based on user input information, or may be automatically selected based on voice data SG.
- the avatar synthesizing unit 17 synthesizes the avatar AB showing the facial expressions and actions AC acquired from the facial expression output unit 15 and the behavior output unit 16 using the 3D data of the character.
- the background synthesizing unit 18 synthesizes the background BG (see FIG. 9) according to the scene estimated based on the speech waveform SD and the utterance content.
- the rain background BG is set based on the sound of rain (audio waveform SD).
- the background BG of the city of Italy is set based on the content of the statement "I went on a trip to Italy.”
- the video output unit 19 outputs a video VD (see FIG. 10) including the avatar AB and the background BG.
- the video output unit 19 determines whether or not to include the audio data SG acquired by the audio input unit 10 in the video VD based on the mute setting.
- the video output unit 19 outputs the video VD including the audio data SG from which the audio waveform SD is extracted.
- the video output unit 19 outputs the video VD that does not contain the audio data SG.
- FIG. 3 is a diagram showing an example of an emotion recognition method.
- the emotion recognition unit 13 recognizes the emotion EM based on the voice waveform SD. For example, the emotion recognizing unit 13 determines the fundamental frequency (tone of voice), the volume of voice, the speed of speech, and the turnover latency as speech parameters. The emotion recognizing unit 13 extracts, from the voice waveform SD, features relating to individual voice parameter values and time changes as sound wave feature quantities.
- the emotion database ED defines the correspondence relationship between the sound wave feature amount and the emotion EM. The emotion recognition unit 13 detects the emotion EM of the user U at the time of speaking by collating the sound wave feature amount extracted from the voice waveform SD with the emotion database ED.
- the emotion recognition unit 13 recognizes the degree of excitement of the user U based on the speech waveform SD and the utterance content. For example, the emotion recognizing unit 13 extracts, from the speech waveform SD and the utterance content, the appearance frequency of a specific word appearing when excited, the utterance speed, and the features related to the time change of the fundamental frequency as incidental feature quantities.
- the emotion database ED defines a correspondence relationship between incidental features and excitement levels.
- the emotion recognizing unit 13 detects the degree of excitement related to the emotion EM by comparing the incidental feature amount extracted from the speech waveform SD and the utterance content with the emotion database ED.
- the analysis algorithm for emotion EM and excitement level may be based on a specific threshold, or may be based on a learning model in which machine learning has been implemented.
- the emotional EM was estimated based on the fundamental frequency, signal strength, rate of speech and alternation latency.
- the emotion estimation method is not limited to this.
- the emotion EM may be estimated using a known emotion estimation technology such as ST (Sensibility Technology: AGI).
- FIG. 4 is a diagram showing an example of a recognition method for behavior AC.
- utterances such as “Hi”, “Bye bye”, “Welcome”, “Understood”, “Like”, “Surprise!, “Sad”, “Ahaha” and “Please” are recognized.
- An example of action AC is shown.
- the action AC shown in FIG. 4 is unconsciously performed due to the user's U emotion.
- “hi” is associated with a gesture of raising the hand to say hello.
- "Bai-bai” is associated with a gesture of waving goodbye.
- the gesture database JD defines the correspondence between utterances and gestures.
- the gesture recognition unit 14 recognizes the gesture of the user U at the time of utterance by collating the utterance content with the gesture database JD.
- the action database AD defines action standards (standard body movements) for each gesture.
- the action AC does not change depending on the degree of excitement, such as “hi”, “bye bye”, “welcome”, “understand”, and “like”
- the action output unit 16 outputs the standard gestures assigned to the gestures.
- actions are output as actions AC of avatar AB.
- the gesture changes according to the degree of excitement such as "surprise!”, “sad”, “ahaha” and "please”
- the action output unit 16 excites the standard action assigned to the gesture. Adjust accordingly.
- the word "sad” is assigned a gesture of hanging the head forward.
- the degree of sadness is at a standard level (the degree of excitement is at a standard level)
- a standard action in which the angle and speed of hanging the head forward are standard values is output as the action AC of avatar AB.
- the sadness is small (the degree of excitement is small)
- an action in which the amount of drooping (the angle at which the head hangs forward) or the speed is smaller than the standard values is output as the action of avatar AB.
- the sadness is great (the degree of excitement is great)
- the amount or speed of drooping of the head is greater than the standard value, or the action of breaking down in tears is output as the action AC of the avatar AB.
- the action database AD defines correspondence relationships between gestures, excitement levels, and actions AC.
- the action output unit 16 compares the gesture and the degree of excitement with the action database AD, and outputs an action reflecting the gesture and the degree of excitement.
- Fig. 5 shows an example of a gesture that is consciously made in conjunction with an utterance.
- avatars AB-1 and AB-2 are having a conversation
- user U of avatar AB-1 says "rice balls are delicious”.
- the facial expression output unit 15 outputs a joyful facial expression
- the action output unit 16 outputs the rice ball eating action AC.
- Fig. 6 shows another control example of the action AC of avatar AB.
- the action output unit 16 selects a scene in which an upbeat song is playing, a sound Estimate scenes that are difficult to hear and scenes in which a sudden increase in volume occurs.
- the action output unit 16 outputs action AC according to the scene estimated based on the voice waveform SD.
- the action output unit 16 can output an action AC such as blinking or occasionally nodding. It is also possible to perform processing such as recognizing the speaker and increasing the number of new avatars AB, or removing avatars AB that are not speaking from the screen.
- the gesture database JD stores gestures corresponding to backtracking.
- the action output unit 16 outputs standard actions corresponding to the gestures of "Yes”, “No” and “I see” as actions of the avatar AB.
- Fig. 7 shows an example of moving the position of avatar AB by voice.
- the volume of conversation differs according to the distance between avatars AB.
- the voice of avatar AB located nearby can be heard loudly, and the voices of avatars AB located far away can be heard softly.
- the user U wants to make his avatar AB closer to the avatar AB of the friend A, the user U calls out the name of the friend A or says "talk to the friend A".
- the gesture recognition unit 14 recognizes a gesture indicating movement such as walking or running in response to a call from the user U or the like.
- the action output unit 16 outputs an action AC corresponding to a gesture such as walking or running.
- FIG. 8 is a diagram showing an example of a method of setting character CH for avatar AB.
- the character CH can be automatically selected based on the voice data SG.
- the avatar synthesizing unit 17 estimates a character CH that matches the voice quality of the user based on the voice waveform SD.
- the avatar synthesizing unit 17 uses the estimated data of the character CH to generate the avatar AB.
- the avatar synthesizing unit 17 applies the voice waveform SD to a character analysis model that has learned voice waveforms of anime characters (step SA2).
- the character analysis model is machine-learned so that when a voice waveform SD is input, an animation character having a voice quality similar to the voice waveform SD is output.
- the avatar synthesizing unit 17 searches for one or more anime characters having a voice quality similar to the voice waveform SD as character candidates CHC.
- the avatar synthesizing unit 17 uses one character candidate CHC selected based on the user input information as the character CH for the avatar AB (steps SA3 and SA4).
- a plurality of anime characters played by a voice actor VA whose voice quality is similar to that of the user U are presented as character candidates CHC.
- the user U can select a favorite character candidate CHC from among the presented one or more character candidates CHC.
- an expression corresponding to the voice waveform SD can be given to the character candidate CHC so that the character CH suitable for emotional expression can be easily selected.
- the avatar synthesizing unit 17 generates facial expressions according to the voice waveform SD for each of the retrieved one or more character candidates CHC.
- the avatar synthesizing unit 17 presents facial expressions of one or more generated character candidates CHC as selection targets.
- the user U considers the role that the character candidate CHC is playing in the animation, and selects one or more character candidates CHC so that there is no discrepancy between the emotion to be expressed and the role of the character candidate CHC.
- one character candidate CHC is selected as the character CH for the avatar AB.
- the avatar synthesizing unit 17 generates an avatar AB using the selected character CH.
- FIG. 9 is a diagram showing an example of a background BG setting method.
- the background BG can be automatically selected based on the audio data SG.
- the background synthesizing unit 18 extracts a waveform component representing the environmental sound ES from the audio waveform SD.
- the background synthesizing unit 18 determines the background BG based on the extracted waveform components.
- the background synthesis unit 18 acquires the speech waveform SD from the speech waveform recognition unit 11 (step SB1).
- the background synthesizing unit 18 uses a known sound source separation technique to remove the voice information of the user U from the voice waveform SD and extract only the waveform component representing the environmental sound ES (step SB2).
- the background synthesizing unit 18 applies the speech waveform SD to the environment analysis model that has learned the correspondence relationship between the environmental sound ES and the environment (step SB3).
- the environment analysis model is machine-learned so that when the environmental sound ES is input, the environment in which the environmental sound ES is generated is output.
- the background synthesizing unit 18 searches for one or more backgrounds representing an environment similar to the environment in which the environmental sound ES was generated, as background candidate BGCs.
- the background synthesizing unit 18 uses one background candidate BGC selected based on the user input information as the background BG for the avatar AB (step SB4).
- FIG. 10 and 11 are diagrams showing system configuration examples of the communication support service.
- FIG. 10 is a diagram showing an example in which the communication support service is applied to one-way communication.
- FIG. 11 is a diagram showing an example in which the communication support service is applied to two-way communication.
- the communication support service is applicable to both one-way communication and two-way communication.
- a user U who is a sender T transmits voice data SG to the information processing device 1-A (server) to control the expression and behavior of the avatar AB.
- the information processing device 1-A transmits the video VD including the sender T's avatar AB to the receiver R.
- individual users U connected to the information processing apparatus 1-B (server) are the sender T and the receiver R.
- each transmitter/receiver TR can prevent transmission of his/her own voice data SG to other transmitter/receiver TR by mute setting.
- the user U can distribute the video VD that does not include the audio data SG by turning on the mute setting. Even when the mute setting is ON, the terminal of the user U transmits the voice data SG acquired by the microphone to the information processing device 1-B.
- the information processing device 1-B controls the facial expression and behavior of the avatar AB of the user U whose mute setting is turned ON, based on the received audio data SG. As a result, the facial expression and behavior of the avatar AB can be appropriately controlled without distributing the voice data to others.
- a user U who does not plan to make a statement simply turns off the voice transmission function on the application and listens to the content of the conference in order to prevent his/her voice from being mistakenly included in the conference. state (mute setting: ON).
- state mute setting: ON
- the terminal of the user U turns on the microphone and transmits the voice acquired by the microphone to the information processing device 1-B.
- the information processing device 1-B generates an avatar AB of the user U based on the received audio data SG and distributes the video VD, but does not distribute the audio data SG itself to other senders/receivers TR. This makes it possible to control avatar AB while preventing erroneous transmission of voice.
- FIG. 12 is a diagram showing a hardware configuration example of the information processing apparatus 1.
- the information processing device 1 is implemented by a computer 1000 .
- the computer 1000 has a CPU 1100 , a RAM 1200 , a ROM (Read Only Memory) 1300 , a HDD (Hard Disk Drive) 1400 , a communication interface 1500 and an input/output interface 1600 .
- Each part of computer 1000 is connected by bus 1050 .
- the CPU 1100 operates based on programs stored in the ROM 1300 or HDD 1400 and controls each section. For example, the CPU 1100 loads programs stored in the ROM 1300 or HDD 1400 into the RAM 1200 and executes processes corresponding to various programs.
- the ROM 1300 stores a boot program such as BIOS (Basic Input Output System) executed by the CPU 1100 when the computer 1000 is started, and programs dependent on the hardware of the computer 1000.
- BIOS Basic Input Output System
- the HDD 1400 is a computer-readable recording medium that non-temporarily records programs executed by the CPU 1100 and data (including various databases) used by these programs.
- HDD 1400 is a recording medium that records an information processing program according to the present disclosure, which is an example of program data 1450 .
- a communication interface 1500 is an interface for connecting the computer 1000 to an external network 1550 (for example, the Internet).
- the CPU 1100 receives data from another device via the communication interface 1500, and transmits data generated by the CPU 1100 to another device.
- the input/output interface 1600 is an interface for connecting the input/output device 1650 and the computer 1000 .
- the CPU 1100 receives data from input devices such as a keyboard and mouse via the input/output interface 1600 .
- the CPU 1100 also transmits data to an output device such as a display, speaker, or printer via the input/output interface 1600 .
- the input/output interface 1600 may function as a media interface for reading a program or the like recorded on a predetermined recording medium.
- Media include, for example, optical recording media such as DVD (Digital Versatile Disc) and PD (Phase change rewritable disk), magneto-optical recording media such as MO (Magneto-Optical disk), tape media, magnetic recording media, semiconductor memories, etc. is.
- the CPU 1100 of the computer 1000 implements the various functions described above by executing programs loaded on the RAM 1200.
- the HDD 1400 also stores a program for causing a computer to function as the information processing apparatus 1 .
- CPU 1100 reads and executes program data 1450 from HDD 1400 , as another example, these programs may be obtained from another device via external network 1550 .
- the information processing device 1 has an emotion recognition section 13 , an expression output section 15 and an avatar synthesis section 17 .
- the emotion recognition unit 13 recognizes the emotion EM based on the voice waveform SD.
- the facial expression output unit 15 outputs facial expressions according to the emotion EM.
- the avatar synthesizing unit 17 synthesizes an avatar AB showing the output facial expression.
- the processing of the information processing apparatus 1 is executed by the computer 1000 .
- the program of the present disclosure causes the computer 1000 to implement the processing of the information processing device 1 .
- the avatar AB can be made to express richer emotions than when facial expressions are generated by motion capture.
- the emotion recognition unit 13 recognizes the degree of excitement based on the voice waveform SD and the utterance content.
- the facial expression output unit 15 outputs facial expressions reflecting the degree of excitement.
- the information processing device 1 has a gesture recognition section 14 and an action output section 16 .
- the gesture recognition unit 14 recognizes gestures based on the utterance content.
- the action output unit 16 outputs action AC of avatar AB corresponding to the gesture.
- the action output unit 16 outputs action AC reflecting the degree of excitement.
- the action output unit 16 outputs action AC according to the scene estimated based on the voice waveform SD.
- the information processing device 1 has a background synthesizing unit 18 .
- the background synthesizing unit 18 synthesizes a background BG according to a scene estimated based on the speech waveform SD or the utterance content.
- the video of the background BG can be changed by sound.
- the background synthesizing unit 18 extracts waveform components representing the environmental sound ES from the audio waveform SD, and determines the background BG based on the extracted waveform components.
- the background synthesizing unit 18 searches for one or more backgrounds representing an environment similar to the environment in which the environmental sound ES was generated, as background candidate BGCs.
- the background synthesizing unit 18 uses one background candidate BGC selected based on the user input information as the background for the avatar AB.
- an appropriate background BG that reflects the preferences of the user U is selected.
- the avatar synthesizing unit 17 generates an avatar AB using data of the character CH estimated based on the voice waveform SD.
- the avatar AB that matches the voice quality of the user U is provided.
- the avatar synthesizing unit 17 searches for one or more anime characters having a voice quality similar to the voice waveform SD as character candidates CHC.
- the avatar synthesizing unit 17 uses one character candidate CHC selected based on the user input information as the character CH for the avatar AB.
- a favorite anime character that matches the voice quality of the user U is used as the avatar AB.
- the avatar synthesizing unit 17 generates facial expressions according to the voice waveform SD for each of the one or more searched character candidates CHC.
- the avatar synthesizing unit 17 presents facial expressions of the generated one or more character candidates CHC as selection candidates.
- the information processing device 1 has a video output unit 19 .
- the video output unit 19 outputs a video VD including the avatar AB.
- the video output unit 19 outputs the video VD including the audio data SG from which the audio waveform SD is extracted.
- the video output unit 19 outputs the video VD that does not include the audio data SG when the mute setting is ON.
- an emotion recognition unit that recognizes emotions based on voice waveforms; an expression output unit that outputs an expression according to the emotion; an avatar synthesizing unit that synthesizes an avatar showing the facial expression;
- Information processing device having (2) The emotion recognition unit recognizes the degree of excitement based on the voice waveform and utterance content, The facial expression output unit outputs the facial expression that reflects the degree of excitement;
- the information processing apparatus according to (1) above.
- the action output unit outputs the action reflecting the degree of excitement; The information processing apparatus according to (3) above. (5) wherein the action output unit outputs the action corresponding to a scene estimated based on the audio waveform; The information processing apparatus according to (3) or (4) above. (6) A background synthesizing unit that synthesizes a background according to a scene estimated based on the speech waveform or utterance content, The information processing apparatus according to any one of (1) to (5) above. (7) The background synthesizing unit extracts a waveform component representing an environmental sound from the audio waveform, and determines the background based on the extracted waveform component. The information processing apparatus according to (6) above.
- the background synthesizing unit searches for one or more backgrounds representing an environment similar to the environment in which the environmental sound was generated as background candidates, and selects one background candidate based on user input information as the background for the avatar. used as The information processing apparatus according to (7) above.
- the avatar synthesizing unit generates the avatar using character data estimated based on the voice waveform.
- the information processing apparatus according to any one of (1) to (8) above.
- the avatar synthesizing unit searches for one or more anime characters having a voice quality similar to the voice waveform as character candidates, and uses one character candidate selected based on user input information as the avatar character.
- the information processing device according to (9) above.
- the avatar synthesizing unit generates facial expressions corresponding to the voice waveforms for the one or more character candidates retrieved, and presents the facial expressions of the one or more character candidates that have been generated as selection candidates.
- the information processing apparatus according to (10) above.
- (12) Having a video output unit that outputs a video containing the avatar, When the mute setting is OFF, the video output unit includes the audio data from which the audio waveform is extracted and outputs the video, and when the mute setting is ON, the video output unit includes the audio data. not output said video,
- the information processing apparatus according to any one of (1) to (11) above.
- Recognize emotions based on voice waveforms outputting a facial expression according to the emotion; synthesizing an avatar showing the facial expression;
- a computer-implemented information processing method comprising: (14) Recognize emotions based on voice waveforms, outputting a facial expression according to the emotion; synthesizing an avatar showing the facial expression;
- a program that makes a computer do something comprising: (14) Recognize emotions based on voice waveforms, outputting a facial expression according to the emotion; synthesizing an avatar showing the facial expression; A program that makes a computer do something.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Physics & Mathematics (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Psychiatry (AREA)
- Hospice & Palliative Care (AREA)
- Child & Adolescent Psychology (AREA)
- Artificial Intelligence (AREA)
- Quality & Reliability (AREA)
- Computer Graphics (AREA)
- Computer Hardware Design (AREA)
- Software Systems (AREA)
- User Interface Of Digital Computer (AREA)
- Processing Or Creating Images (AREA)
Abstract
Description
[1.コミュニケーション支援サービスの概要]
[2.情報処理装置の構成]
[3.音声認識処理]
[3-1.感情・行動認識]
[3-2.表情・行動出力]
[4.キャラクタ設定]
[5.背景設定]
[6.システム構成例]
[7.ハードウェア構成例]
[8.効果]
図1は、コミュニケーション支援サービスの一例を示す図である。
図2は、情報処理装置1の機能構成の一例を示すブロック図である。
[3-1.感情・行動認識]
図3は、感情の認識手法の一例を示す図である。
図5ないし図7は、表情・行動出力の一例を示す図である。
図8は、アバターAB用のキャラクタCHの設定方法の一例を示す図である。
図9は、背景BGの設定方法の一例を示す図である。
図10および図11は、コミュニケーション支援サービスのシステム構成例を示す図である。図10は、コミュニケーション支援サービスがワンウェイコミュニケーションに適用される例を示す図である。図11は、コミュニケーション支援サービスがツーウェイコミュニケーションに適用される例を示す図である。
図12は、情報処理装置1のハードウェア構成例を示す図である。例えば情報処理装置1は、コンピュータ1000によって実現される。コンピュータ1000は、CPU1100、RAM1200、ROM(Read Only Memory)1300、HDD(Hard Disk Drive)1400、通信インタフェース1500、及び入出力インタフェース1600を有する。コンピュータ1000の各部は、バス1050によって接続される。
情報処理装置1は、感情認識部13、表情出力部15およびアバター合成部17を有する。感情認識部13は、音声波形SDに基づいて感情EMを認識する。表情出力部15は、感情EMに応じた表情を出力する。アバター合成部17は、出力された表情を示すアバターABを合成する。本開示の情報処理方法は、情報処理装置1の処理がコンピュータ1000により実行される。本開示のプログラムは、情報処理装置1の処理をコンピュータ1000に実現させる。
なお、本技術は以下のような構成も採ることができる。
(1)
音声波形に基づいて感情を認識する感情認識部と、
前記感情に応じた表情を出力する表情出力部と、
前記表情を示すアバターを合成するアバター合成部と、
を有する情報処理装置。
(2)
前記感情認識部は、前記音声波形および発言内容に基づいて興奮度を認識し、
前記表情出力部は、前記興奮度を反映した前記表情を出力する、
上記(1)に記載の情報処理装置。
(3)
前記発言内容に基づいてジェスチャを認識するジェスチャ認識部と、
前記ジェスチャに応じた前記アバターの行動を出力する行動出力部と、
を有する上記(2)に記載の情報処理装置。
(4)
前記行動出力部は、前記興奮度を反映した前記行動を出力する、
上記(3)に記載の情報処理装置。
(5)
前記行動出力部は、前記音声波形に基づいて推定されるシーンに応じた前記行動を出力する、
上記(3)または(4)に記載の情報処理装置。
(6)
前記音声波形または発言内容に基づいて推定されるシーンに応じた背景を合成する背景合成部を有する、
上記(1)ないし(5)のいずれか1つに記載の情報処理装置。
(7)
前記背景合成部は、前記音声波形から環境音を示す波形成分を抽出し、抽出された波形成分に基づいて前記背景を決定する、
上記(6)に記載の情報処理装置。
(8)
前記背景合成部は、前記環境音が生成された環境に類似する環境を示す1以上の背景を背景候補として検索し、ユーザ入力情報に基づいて選択された1つの背景候補を前記アバター用の背景として用いる、
上記(7)に記載の情報処理装置。
(9)
前記アバター合成部は、前記音声波形に基づいて推定されるキャラクタのデータを用いて前記アバターを生成する、
上記(1)ないし(8)のいずれか1つに記載の情報処理装置。
(10)
前記アバター合成部は、前記音声波形に類似する声質を持つ1以上のアニメのキャラクタをキャラクタ候補として検索し、ユーザ入力情報に基づいて選択された1つのキャラクタ候補を前記アバター用のキャラクタとして用いる、
上記(9)に記載の情報処理装置。
(11)
前記アバター合成部は、検索された前記1以上のキャラクタ候補について、それぞれ前記音声波形に応じた表情を生成し、生成された前記1以上のキャラクタ候補の表情を選択候補として提示する、
上記(10)に記載の情報処理装置。
(12)
前記アバターを含む映像を出力する映像出力部を有し、
前記映像出力部は、ミュート設定がOFFの場合には、前記音声波形の抽出対象となった音声データを前記映像に含めて出力し、前記ミュート設定がONの場合には、前記音声データを含まない前記映像を出力する、
上記(1)ないし(11)のいずれか1つに記載の情報処理装置。
(13)
音声波形に基づいて感情を認識し、
前記感情に応じた表情を出力し、
前記表情を示すアバターを合成する、
ことを有する、コンピュータにより実行される情報処理方法。
(14)
音声波形に基づいて感情を認識し、
前記感情に応じた表情を出力し、
前記表情を示すアバターを合成する、
ことをコンピュータに実現させるプログラム。
13 感情認識部
14 ジェスチャ認識部
15 表情出力部
16 行動出力部
17 アバター合成部
18 背景合成部
19 映像出力部
AB アバター
AC 行動
BG 背景
BGC 背景候補
CH キャラクタ
CHC キャラクタ候補
EM 感情
ES 環境音
SD 音声波形
Claims (14)
- 音声波形に基づいて感情を認識する感情認識部と、
前記感情に応じた表情を出力する表情出力部と、
前記表情を示すアバターを合成するアバター合成部と、
を有する情報処理装置。 - 前記感情認識部は、前記音声波形および発言内容に基づいて興奮度を認識し、
前記表情出力部は、前記興奮度を反映した前記表情を出力する、
請求項1に記載の情報処理装置。 - 前記発言内容に基づいてジェスチャを認識するジェスチャ認識部と、
前記ジェスチャに応じた前記アバターの行動を出力する行動出力部と、
を有する請求項2に記載の情報処理装置。 - 前記行動出力部は、前記興奮度を反映した前記行動を出力する、
請求項3に記載の情報処理装置。 - 前記行動出力部は、前記音声波形に基づいて推定されるシーンに応じた前記行動を出力する、
請求項3に記載の情報処理装置。 - 前記音声波形または発言内容に基づいて推定されるシーンに応じた背景を合成する背景合成部を有する、
請求項1に記載の情報処理装置。 - 前記背景合成部は、前記音声波形から環境音を示す波形成分を抽出し、抽出された波形成分に基づいて前記背景を決定する、
請求項6に記載の情報処理装置。 - 前記背景合成部は、前記環境音が生成された環境に類似する環境を示す1以上の背景を背景候補として検索し、ユーザ入力情報に基づいて選択された1つの背景候補を前記アバター用の背景として用いる、
請求項7に記載の情報処理装置。 - 前記アバター合成部は、前記音声波形に基づいて推定されるキャラクタのデータを用いて前記アバターを生成する、
請求項1に記載の情報処理装置。 - 前記アバター合成部は、前記音声波形に類似する声質を持つ1以上のアニメのキャラクタをキャラクタ候補として検索し、ユーザ入力情報に基づいて選択された1つのキャラクタ候補を前記アバター用のキャラクタとして用いる、
請求項9に記載の情報処理装置。 - 前記アバター合成部は、検索された前記1以上のキャラクタ候補について、それぞれ前記音声波形に応じた表情を生成し、生成された前記1以上のキャラクタ候補の表情を選択対象として提示する、
請求項10に記載の情報処理装置。 - 前記アバターを含む映像を出力する映像出力部を有し、
前記映像出力部は、ミュート設定がOFFの場合には、前記音声波形の抽出対象となった音声データを前記映像に含めて出力し、前記ミュート設定がONの場合には、前記音声データを含まない前記映像を出力する、
請求項1に記載の情報処理装置。 - 音声波形に基づいて感情を認識し、
前記感情に応じた表情を出力し、
前記表情を示すアバターを合成する、
ことを有する、コンピュータにより実行される情報処理方法。 - 音声波形に基づいて感情を認識し、
前記感情に応じた表情を出力し、
前記表情を示すアバターを合成する、
ことをコンピュータに実現させるプログラム。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202280068600.4A CN118103872A (zh) | 2021-10-18 | 2022-10-06 | 信息处理设备、信息处理方法及程序 |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2021170366 | 2021-10-18 | ||
JP2021-170366 | 2021-10-18 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023068067A1 true WO2023068067A1 (ja) | 2023-04-27 |
Family
ID=86058119
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2022/037498 WO2023068067A1 (ja) | 2021-10-18 | 2022-10-06 | 情報処理装置、情報処理方法およびプログラム |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN118103872A (ja) |
WO (1) | WO2023068067A1 (ja) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2010533006A (ja) * | 2007-03-01 | 2010-10-21 | ソニー コンピュータ エンタテインメント アメリカ リミテッド ライアビリテイ カンパニー | 仮想世界とコミュニケーションを取るためのシステムおよび方法 |
WO2017175351A1 (ja) | 2016-04-07 | 2017-10-12 | 株式会社ソニー・インタラクティブエンタテインメント | 情報処理装置 |
US20200162799A1 (en) * | 2018-03-15 | 2020-05-21 | International Business Machines Corporation | Auto-curation and personalization of sports highlights |
US20210142782A1 (en) * | 2019-11-13 | 2021-05-13 | Facebook Technologies, Llc | Generating a voice model for a user |
-
2022
- 2022-10-06 WO PCT/JP2022/037498 patent/WO2023068067A1/ja active Application Filing
- 2022-10-06 CN CN202280068600.4A patent/CN118103872A/zh active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2010533006A (ja) * | 2007-03-01 | 2010-10-21 | ソニー コンピュータ エンタテインメント アメリカ リミテッド ライアビリテイ カンパニー | 仮想世界とコミュニケーションを取るためのシステムおよび方法 |
WO2017175351A1 (ja) | 2016-04-07 | 2017-10-12 | 株式会社ソニー・インタラクティブエンタテインメント | 情報処理装置 |
US20200162799A1 (en) * | 2018-03-15 | 2020-05-21 | International Business Machines Corporation | Auto-curation and personalization of sports highlights |
US20210142782A1 (en) * | 2019-11-13 | 2021-05-13 | Facebook Technologies, Llc | Generating a voice model for a user |
Also Published As
Publication number | Publication date |
---|---|
CN118103872A (zh) | 2024-05-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108962217B (zh) | 语音合成方法及相关设备 | |
US11222632B2 (en) | System and method for intelligent initiation of a man-machine dialogue based on multi-modal sensory inputs | |
US11232808B2 (en) | Adjusting speed of human speech playback | |
US20200279553A1 (en) | Linguistic style matching agent | |
US11468894B2 (en) | System and method for personalizing dialogue based on user's appearances | |
CN109189980A (zh) | 与用户进行语音交互的方法和电子设备 | |
JP6446993B2 (ja) | 音声制御装置およびプログラム | |
CN109346076A (zh) | 语音交互、语音处理方法、装置和系统 | |
JP2002190034A (ja) | 情報処理装置および方法、並びに記録媒体 | |
JP7227395B2 (ja) | インタラクティブ対象の駆動方法、装置、デバイス、及び記憶媒体 | |
KR20210117066A (ko) | 음향 기반 아바타 모션 제어 방법 및 장치 | |
WO2022079933A1 (ja) | コミュニケーション支援プログラム、コミュニケーション支援方法、コミュニケーション支援システム、端末装置及び非言語表現プログラム | |
CN115088033A (zh) | 代表对话中的人参与者生成的合成语音音频数据 | |
CN111787986A (zh) | 基于面部表情的语音效果 | |
WO2023068067A1 (ja) | 情報処理装置、情報処理方法およびプログラム | |
JP2021113835A (ja) | 音声処理装置および音声処理方法 | |
CN114154636A (zh) | 数据处理方法、电子设备及计算机程序产品 | |
WO2024014318A1 (ja) | 学習モデル生成装置、推論処理装置、学習モデル生成方法および推論処理方法 | |
JP7474211B2 (ja) | ユーザから発話された名詞を忘却する対話プログラム、装置及び方法 | |
US20240078731A1 (en) | Avatar representation and audio generation | |
WO2019187543A1 (ja) | 情報処理装置および情報処理方法 | |
JPWO2020170441A1 (ja) | 情報処理装置、情報処理方法、およびプログラム | |
JP7339420B1 (ja) | プログラム、方法、情報処理装置 | |
US20240078732A1 (en) | Avatar facial expressions based on semantical context | |
CN113573143B (zh) | 音频播放方法和电子设备 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22883372 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2023554474 Country of ref document: JP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2022883372 Country of ref document: EP |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
ENP | Entry into the national phase |
Ref document number: 2022883372 Country of ref document: EP Effective date: 20240521 |