WO2019168834A1 - Effets vocaux basés sur des expressions faciales - Google Patents

Effets vocaux basés sur des expressions faciales Download PDF

Info

Publication number
WO2019168834A1
WO2019168834A1 PCT/US2019/019554 US2019019554W WO2019168834A1 WO 2019168834 A1 WO2019168834 A1 WO 2019168834A1 US 2019019554 W US2019019554 W US 2019019554W WO 2019168834 A1 WO2019168834 A1 WO 2019168834A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
avatar
video
audio signal
user
Prior art date
Application number
PCT/US2019/019554
Other languages
English (en)
Inventor
Sean A. Ramprashad
Carlos M. Avendano
Aram M. Lindahl
Original Assignee
Apple Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US15/908,603 external-priority patent/US20180336716A1/en
Priority claimed from US16/033,111 external-priority patent/US10861210B2/en
Application filed by Apple Inc. filed Critical Apple Inc.
Priority to DE112019001058.1T priority Critical patent/DE112019001058T5/de
Priority to CN201980016107.6A priority patent/CN111787986B/zh
Priority to KR1020207022657A priority patent/KR102367143B1/ko
Publication of WO2019168834A1 publication Critical patent/WO2019168834A1/fr

Links

Classifications

    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F13/00Video games, i.e. games using an electronically generated display having two or more dimensions
    • A63F13/60Generating or modifying game content before or while executing the game program, e.g. authoring tools specially adapted for game development or game-integrated level editor
    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F13/00Video games, i.e. games using an electronically generated display having two or more dimensions
    • A63F13/20Input arrangements for video game devices
    • A63F13/21Input arrangements for video game devices characterised by their sensors, purposes or types
    • A63F13/213Input arrangements for video game devices characterised by their sensors, purposes or types comprising photodetecting means, e.g. cameras, photodiodes or infrared cells
    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F13/00Video games, i.e. games using an electronically generated display having two or more dimensions
    • A63F13/20Input arrangements for video game devices
    • A63F13/21Input arrangements for video game devices characterised by their sensors, purposes or types
    • A63F13/215Input arrangements for video game devices characterised by their sensors, purposes or types comprising means for detecting acoustic signals, e.g. using a microphone
    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F13/00Video games, i.e. games using an electronically generated display having two or more dimensions
    • A63F13/40Processing input control signals of video game devices, e.g. signals generated by the player or derived from the environment
    • A63F13/42Processing input control signals of video game devices, e.g. signals generated by the player or derived from the environment by mapping the input signals into game commands, e.g. mapping the displacement of a stylus on a touch screen to the steering angle of a virtual vehicle
    • A63F13/424Processing input control signals of video game devices, e.g. signals generated by the player or derived from the environment by mapping the input signals into game commands, e.g. mapping the displacement of a stylus on a touch screen to the steering angle of a virtual vehicle involving acoustic input signals, e.g. by using the results of pitch or rhythm extraction or voice recognition

Definitions

  • Multimedia content such as emoji’s, can be sent as part of messaging
  • the emoji’s can represent a variety of predefined people, objects, actions, and/or other things.
  • Some messaging applications allow users to select from a predefined library of emoji’s which can be sent as part of a message that can contain other content (e.g , other multimedia and/or textual content).
  • Animojis are one type of this oilier multimedia content, where a user can select an avatar (e.g., a puppet) to represent themselves. The animoji can move and talk as if it were a video of the user.
  • Animojis enable users to create personalized versions of emoji’s in a fun and creative way.
  • Embodiments of tire present disclosure can provide systems, methods, and computer-readable medium for implementing avatar video clip re vision and playback techniques.
  • a computing device can present a user interface (UI) for tracking a user’s face and presenting a virtual avatar representation (e.g , a puppet or video character version of the user’s face).
  • UI user interface
  • the computing device can capture audio and video information, extract and detect context as well as facial feature characteristics and voice feature characteristics, revise the audio and/or video information based at least in part on the extracted/identified features, and present a video clip of the avatar using the revised audio and/or video information.
  • a computer-implemented method for implementing various audio and video effects techniques may be provided.
  • the method may include displaying a virtual avatar generation interface.
  • the method may also include displaying first preview con ten t of a virtual avatar in the virtual avatar generation interface, the first preview content of the virtual avatar corresponding to realtime preview video frames of a user headshot in a field of view of tire camera and associated headshot changes in an appearance.
  • the method may also include detecting an input in the virtual avatar generation interface while displaying the first preview content of the virtual avatar.
  • the method in response to detecting the input in the virtual avatar generation interface, may also include: capturing, via the camera, a video signal associated with the user headshot during a recording session, capturing, via the microphone, a user audio signal during the recording session, extracting audio feature characteristics from the captured user audio signal, and extracting facial feature characteristics associated with the face from the captured video signal. Additionally, in response to detecting expiration of the recording session, the method may also include: generating an adjusted audio signal from the captured audio signal based at least in part on the facial feature characteristics and the audio feature characteristics, generating second preview content of the virtual avatar in the virtual avatar generation interface according to the facial feature characteristics and the adjusted audio signal, and presenting the second preview content in the virtual avatar generation interface.
  • the method may also include storing facial feature metadata associated with the facial feature characteristics extracted from the video signal and generating adjusted facial feature metadata from the facial feature metadata based at least in part on the facial feature characteristics and the audio feature characteristics. Additionally, the second preview of the virtual avatar may be displayed further according to the adjusted facial metadata. In some examples, the first preview of the virtual avatar may be displayed according to pre view facial feature characteristics identified according to the changes in the appearance of the face during a preview session.
  • an electronic device for implementing various audio and video effects techniques may be provided.
  • the system may include a camera, a microphone, a library of pre-recorded/pre-determined audio, and one or more processors in
  • the processors may be configured to execute computer-executable instructions to perform operations.
  • the operations may include detecting an input in a virtual avatar generation interface while displaying a first preview of a virtual avatar.
  • the operations may also include initiating a capture session including in response to detecting the input in the virtual avatar generation interface.
  • the capture session may include: capturing, via the camera, a video signal associated with a face in a field of view of the camera, capturing, via the microphone, an audio signal associated with the captured video signal, extracting audio feature characteristics from the captured audio signal, and extracting facial feature characteristics associated with the face from the captured video signal.
  • the operations may also include generating an adjusted audio signal based at least in part on the audio feature characteristics and the facial feature characteristics and presenting the second pre view' content in the virtual avatar generation interface, at least in response to detecting expiration of the capture session.
  • the audio signal may be further adjusted based at least in part on a type of the virtual avatar. Additionally, the type of the virtual avatar may be received based at least in part on an avatar type selection affordance presented m the virtual avatar generation interface. In some instances, the type of the virtual avatar may include an animal type, and the adjusted audio signal may be generated based at least in part on a predetermined sound associated with the animal type. The use and timing of predetermined sounds may be based on audio features from the captured audio and/or facial features from the captured video. This predetermined sound may also be itself modified based on audio features from the captured audio and facial features from the captured video.
  • the one or more processors may be further configured to detennine whether a portion of the audio signal corresponds to die face in the field of view'. Additionally, in accordance with a determination that the portion of the audio signal corresponds to die face, the portion of the audio signal may be stored for use in generating the adjusted audio signal and/or in accordance with a determination that the portion of the audio signal does not correspond to the face, at least the portion of the audio signal may be discarded and not considered for modification and/or playback. Additionally, the audio feature characteristics may comprise features of a voice associated with the face in the field of view . In some examples, the one or more processors may be further configured to store facial feature metadata associated with the facial feature characteristics extracted from the video signal.
  • the one or more processors may be further configured to store audio feature metadata associated with the audio feature characteristics extracted from the audio signal. Further, the one or more processors may be further configured to generate adjusted facial metadata based at least in part on the facial feature characteristics and the audio feature characteristics, and the second preview' of the virtual avatar may be generated according to the adjusted facial metadata and the adjusted audio signal.
  • a computer-readable medium may be provided.
  • the computer-readable medium may include computer-executable instructions that, when executed by one or more processors, cause the one or more processors to perform operations.
  • Tire operations may include performing the following actions in response to detecting a request to generate an avatar video clip of a virtual avatar: capturing, via a camera of an electronic device, a video signal associated with a face in a field of view of the camera, capturing, via a microphone of the electronic device, an audio signal, extracting voice feature characteristics from the captured audio signal, and extracting facial feature characteristics associated with tire face from tire captured video signal.
  • the operations may also include performing the following actions in response to detecting a request to preview the avatar video clip: generating an adjusted audio signal based at least in part on the facial feature characteristics and the voice feature characteristics, and displaying a preview of the video clip of the virtual avatar using the adjusted audio signal.
  • the audio signal may be adjusted based at least in part on a facial expression identified in the facial feature characteristics associated with the face. In some instances, the audio signal may be adjusted based at least in part on a level, pitch, duration, format, or change in a voice characteristic associated with the face. Further, in some embodiments, the one or more processors may he further configured to perform the operations comprising transmitting the video clip of the v irtual av atar to another electronic device.
  • FIG. 1 is a simplified block diagram illustrating example flow for providing audio and/or video effects techniques as described herein, according to at least one example.
  • FIG. 2 is another simplified block diagram illustrating example flow for providing audio and/or video effects techniques as described herein, according to at least one example.
  • FIG. 3 is another simplified block diagram illustrating hardware and software components for providing audio and/or video effects techniques as described herein, according to at least one example .
  • FIG. 4 is a flow diagram to illustrate providing audio and/or video effects techniques as described herein, according to at least one example.
  • FIG. 5 is another flow diagram to illustrate providing audio and/or video effects techniques as described herein, according to at least one example.
  • FIG. 6 is a simplified block diagram illustrating a user interface for providing audio and/or video effects techniques as described herein, according to at least one example.
  • FIG. 7 is another flow diagram to illustrate providing audio and/or video effects techniques as described herein, according to at least one example.
  • FIG. 8 is another flow' diagram to illustrate providing audio and/or video effects techniques as described herein, according to at least one example.
  • FIG . 9 is a simplified block diagram illustrating is a computer architecture for providing audio and/or video effects techniques as described herein, according to at least one example .
  • Certain embodiments of the present disclosure relate to devices, computer-readable medium, and methods for implementing various techniques for providing voice effects (e.g., revised audio) based at least in part on facial expressions. Additionally, in some cases, the various techniques may also provide video effects based at least in part on audio characteristics of a recording. Even further, the various techniques may also provide voice effects and video effects (e.g., together) based at least in part on one or both of facial expressions and audio characteristics of a recording. In some examples, the voice effects and/or video effects may be presented m a user interface (UI) configured to display a cartoon representation of a user (e.g., an avatar or digital puppet). Such an avatar that represents a user may be considered an animoji, as it may look like an emoji character familiar to most smart phone users: however, it can be animated to mimic actual motions of the user.
  • UI user interface
  • a user of a computing device may be presented with a UI for generating an animoji video (e.g., a video clip).
  • the video clip can be limited to a predetermined amount of time (e.g., 10 second, 30 seconds, or die like), or the video clip can be unlimited.
  • a preview area may present die user with a real-time representation of their face, using an avatar character. V arious avatar characters may be provided, and a user may even be able to generate or import their own avatars.
  • the preview area may be configured to provide an initial preview' of the avatar and a preview of the recorded video clip.
  • the recorded video clip may be previewed in its original form (e.g., without any video or audio effects) or it may he previewed with audio and/or video effects.
  • the user may select an avatar after the initial video clip has been recorded.
  • the video dip pre view may then change from one avatar to another, with the same or different video effects applied to it, as appropriate. For example, if the raw preview (e.g., original form, without effects) is being viewed, and the user switches avatar characters, the UI may be updated to display a rendering of the same video dip but with the newly selec ted avatar.
  • the facial features and audio e.g., tire user’s voice
  • the facial features and audio that was captured during tire recording can be presented from any of the avatars (e.g., without any effects).
  • the preview- it will appear as if the avatar character is moving the same way the user moved during the recording, and speaking what the user said during the recording.
  • a user may select a first avatar (e.g., a unicorn head) via the UI, or a default avatar can be initially provided.
  • the UI will present die avatar (in this example, the head of a cartoon unicorn if selected by the user or any other available puppet by default) in the preview area, and the device will begin capturing audio and/or video information (e.g., using one or more microphones and/or one or more cameras).
  • audio and/or video information e.g., using one or more microphones and/or one or more cameras.
  • only video information is needed for the initial preview screen.
  • the video information can be analyzed, and facial features can be extracted. These extracted facial features can then be mapped to the unicorn face in real-time, such that the initial preview of the unicorn head appears to mirror that of the user’s.
  • the term real-time is used to indicate that die results of die extraction, mapping, rendering, and presentation are performed in response to each motion of the user and can be presented substantially immediately. To the user, it will appear as if they are looking in the mirror, except the image of their face is replaced with an avatar.
  • the UI While the user’s face is in the line of sight (e.g., the view-) of a camera of the device, the UI will continue to present the initial preview .
  • the device may begin to capture video that has an audio component. In some examples, this includes a camera capturing frames and a microphone capturing audio information.
  • a special camera may be utilized that is capable of capturing 3- dimensional (3D) information as well.
  • any camera may be utilized that is capable of capturing video.
  • the video may be stored in its original form and/or metadata associated with the video may be stored. As such, capturing the video and/or audio information may be different from storing the information.
  • capturing the information may include sensing the information and at least caching it such that is available for processing.
  • the processed data can also be cached until it is determined whether to store or simply utilize the data.
  • the video data e.g., metadata associated with the data
  • this data may not be stored permanently at all, such that the initial preview is not reusable or recoverable.
  • the video data and the audio data may be stored more permanently.
  • the audio and video (A/V) data may be analyzed, processed, etc., in order to provide the audio and video effects described herein.
  • the video data may be processed to extract facial features (e.g., facial feature characteristics) and those facial features may be stored as metadata for the animoji video clip.
  • the set of metadata may be stored with an identifier (ID) that indicates the time, date, and user associated with the video clip.
  • ID identifier
  • the audio data may be stored with the same or other ID.
  • the system may extract audio feature characteristics from the audio data and facial feature characteristics from the video file. This information can be utilized to identify context, key words, intent, and/or emotions of the user, and video and audio effects can be introduced into audio and video data prior to rendering the puppet.
  • the audio signal can be adjusted to include different words, sounds, tones, pitches, riming, etc., based at least in part on the extracted features.
  • the video data e.g., the metadata
  • audio features are extracted in real-time during the preview itself. These audio features may be avatar specific, generated only if the associated avatar is being previewed.
  • the audio features may be avatar agnostic, generated for ail avatars.
  • the audio signal can also be adjusted in part based on these real-time audio feature extractions, and with the pre-stored extracted video features which are created during or after the recording process, but before previewing.
  • a second preview of the puppet can be rendered. This rendering may be performed for each possible puppet, such as the user scrolls through and selects different puppets, the adjusted data is already rendered. Or the rendering can be performed after selection of each puppet. In any event, once the user selects a puppet, the second preview can be presented. The second preview will replay the video dip that was recorded by the user, but with the adjusted audio and/or video. Using the example from above, if the user recorded themselves with an angry tone (e.g., with a gruff voice and a furrowed brow), the context or intent of anger may be detected, and the audio file may be adjusted to include a growling sound.
  • an angry tone e.g., with a gruff voice and a furrowed brow
  • the context or intent of anger may be detected, and the audio file may be adjusted to include a growling sound.
  • the second preview would look like a unicorn saying the words that the user said; however, the voice of the user may be adjusted to sound like a growl, or to make the tone more baritone (e.g., lower).
  • the user could then save the second preview' or select it for transmission to another user (e.g., through a messaging application or the like).
  • the below and above animoji video clips can be shared as .mov files.
  • the described techniques can be used in real-time (e.g., with video messaging or the like).
  • FIG. 1 is a simplified block diagram illustrating example flow' 100 for providing audio and/or video effects based at least m part on audio and/or video features detected in a user’s recording.
  • recording session 102 there are two separate sessions: recording session 102 and playback session 104.
  • device 106 may capture video having an audio component of user 108 at block 1 10.
  • the video and audio may be captured (e.g., collected) separately, using two different devices (e.g., a microphone and a camera).
  • the capturing of video and audio may be triggered based at least in part on selection of a record affordance by user 108.
  • user 108 may say die word“hello” at block 112.
  • device 106 may continue to capture the video and/or audio components of the user’s actions.
  • device 106 can continue capturing the video and audio components, and in this example, user 108 may say the word“bark.”
  • device 106 may also extract spoken words from the audio information.
  • die spoken word extraction or any audio feature extraction
  • the spoken word extraction may actually take place during the preview block 124 in real-time. It is also possible for the extraction (e.g., analysis of the audio) to be done in real-time while recording session 102 is still in process.
  • the avatar process being executed by device 106 may identify through the extraction that the user said the word “bark” and may employ some logic to determine wfiat audio effects to implement.
  • recording session 102 may end when user 108 selects the record affordance again (e.g., indicating a desire to end the recording), selects an end recording affordance (e.g., the record affordance may act as an end recording affordance while recording), or based at least in part on expiration of a time period (e.g , 10 seconds, 30 seconds, or the like).
  • this time period may be automatically predetermined, while m others, it may be user selected (e.g., selected from a list of options or entered in free form through a text entry interface).
  • user 108 may select a preview affordance, indicating that user 108 wishes to watch a preview' of the recording.
  • One option could be to play the original recording without any visual or audio effects.
  • another option could be to play a revised version of the video clip. Based at least in part on detection of the spoken word“bark,” the avatar process may have revised the audio and/or video of the video clip.
  • device 106 may present avatar (also called a puppet and/or animoji)
  • avatar also called a puppet and/or animoji
  • Device 106 may also be configured with speaker 120 that can play audio associated with the video clip.
  • block 116 corresponds to the same point in time as block 1 10, where user 108 may have had his mouth open, but was not yet speaking.
  • avatar 118 may be presented with his mouth open; however, no audio is presented from speaker 120 yet.
  • the avatar process can present avatar 118 with an avatar-specific voice.
  • a predefined dog voice may be used to say the word“hello” at block 122.
  • the dog-voice word “hello” can be presented by speaker 120.
  • each avatar may be associated with a particular pre-defmed voice that best fits that avatar.
  • the sound of a dog bark may be inserted into the audio data (e.g , in place of the word“bark”) such that when it is played back during presentation of the video clip, a“woof’ is presented by speaker 120.
  • different avatar-specific words will be presented at 124 based at least in part on different avatar selections, and in other examples, tire same avatar-specific word may be presented regardless of the avatar selections. For example, if user 108 said“bark,” a “woof’ could be presented when the dog avatar is selected. However, in this same case, if user 108 later selected the cat avatar for the same flow, there are a couple of options for revising the audio.
  • the process could convert the“bark” into a“woof” even though it wouldn’t be appropriate for a cat to“woof.”
  • the process could convert“bark” into a recorded or simulated“meow,” based at least in part on the selection of the cat avatar.
  • the process could ignore the“bark” for avatars other than the dog avatar.
  • there may be a second level of audio feature analysis perfonned even after the extraction at 114.
  • Video and audio features may also influence processing on the avatar specific utterances. For example, the level and pitch and intonation with which a user says“bark” may be detected as part of the audio feature extraction, and this may direct the system to select a specific“woof’ sample or transform such a sample before and/or during the preview process.
  • FIG. 2 is another simplified block diagram illustrating example flow 200 for providing audio and/or video effects based at least in part on audio and/or video features detected in a user’s recording.
  • example flow 200 much like in example flow 100 of FIG.
  • recording session 202 there are two separate sessions: recording session 202 and playback session 204.
  • device 206 may capture video having an audio component of user 208 at block 210. Hie capturing of video and audio may be triggered based at least in part on selection of a record affordance by user 208. In some examples, user 208 may say the word “hello” at block 212. Additionally, at block 212, device 206 may continue to capture the video and/or audio components of the user’s actions. At block 214, device 206 can continue capturing the video and audio components, and in this example, user 208 may hold his mouth open, but not say anything. At block 214, device 206 may also extract facial expressions from the video.
  • the facial feature extraction may actually take place after recording session 202 is complete. Still, it is possible for the extraction (e.g., analysis of the video) to be done in real-time while recording session 202 is still in process. In either case, the avatar process being executed by device 206 may identify through the extraction that the user opened his mouth briefly (e.g., without saying anything) and may employ some logic to determine what audio and/or video effects to implement. In some examples, the determination that the user held their mouth open without saying anything may require extraction and analysis of both audio and video. For example, extraction of the facial feature characteristics (e.g., open mouth) may not be enough, and the process may also need to detect that user 208 did not say anything during the same time period of the recording.
  • the facial feature characteristics e.g., open mouth
  • Video and audio features may also influence processing on the avatar specific utterances.
  • the duration of the opening of the mouth, opening of eyes, etc. may direct the system to select a specific“woof” sample or transform such a sample before and/or during the preview process.
  • One such transformation is changing the level and/or duration of the woof to match the detected opening and closing of the user’s mouth.
  • recording session 202 may end when user 208 selects the record affordance again (e.g., indicating a desire to end the recording), selects an end recording affordance (e.g., the record affordance may act as an end recording affordance while recording), or based at least in part on expiration of a time period (e.g., 20 seconds, 30 seconds, or the like).
  • a time period e.g. 20 seconds, 30 seconds, or the like.
  • user 208 may select a preview affordance, indicating that user 208 washes to watch a preview of the recording.
  • One option could be to play the original recording without any visual or audio effects. However, another option could be to play a revised version of the recording.
  • the avatar process may have revised the audio and/or video of the video clip.
  • device 206 may present avatar (also called a puppet and/or animoj i)
  • Device 206 may also be configured with speaker 220 that can play audio associated with the video clip.
  • block 216 corresponds to the same point in time as block 210, where user 208 may not have been speaking yet.
  • avatar 218 may be presented with his mouth open; however, no audio is presented from speaker 220 yet.
  • the avatar process can present avatar 218 with an avatar-specific voice (as described above).
  • the avatar process may replace the silence identified at block 214 with an avatar-specific word.
  • the sound of a dog bark e.g., a recorded or simulated dog bark
  • the audio data e.g., in place of the silence
  • different avatar-specific words will be presented at 224 based at least in part on different avatar selections, and in other examples, the same avatar- specific word may be presented regardless of the avatar selections.
  • each avatar may have a predefined sound to be played when it is detected that user 208 has held Ids mouth open for an amount of time (e.g., a half second, a whole second, etc.) without speaking.
  • the process could ignore the detection of the open mouth for avatars that don’t have a predefined effect for that facial feature.
  • the process may also detect how many“woof’ sounds to insert (e.g., if the user held his mouth open for double the length of time used to indicate a bark) or whether it’s not possible to insert the number of barks requested (e.g., in the scenario of FIG. 1, where the user would speak“bark” to indicate a“woof’ sound should be inserted.
  • how many“woof’ sounds to insert e.g., if the user held his mouth open for double the length of time used to indicate a bark
  • the number of barks requested e.g., in the scenario of FIG. 1, where the user would speak“bark” to indicate a“woof’ sound should be inserted.
  • the user device can be configured with software for executing the avatar process (e.g., capturing the A/V information, extracting features, analyzing the data, implementing the logic, revising the audio and/or video files, and rendering the preview's) as well as software for executing an application (e.g., an avatar application with its own HI) that enables the user to build the avatar messages and subsequently send them to other user devices.
  • software for executing the avatar process e.g., capturing the A/V information, extracting features, analyzing the data, implementing the logic, revising the audio and/or video files, and rendering the preview's
  • an application e.g., an avatar application with its own HI
  • FIG. 3 is a simplified block diagram 300 illustrating components (e.g., software modules) utilized by the avatar process described above and below in some examples, more or less modules can be utilized to implement the providing of audio and/or video effects based at least in part on audio and/or video features detected in a user’s recording.
  • device 302 may be configured with camera 304, microphone 306, and a display screen for presenting a UI and the avatar previews (e.g., the initial preview- before recording as well as the preview 7 of the recording before sending).
  • the avatar process is configured with avatar engine 308 and voice engine 310.
  • Avatar engine 308 can manage the list of avatars, process the video features (e.g., facial feature characteristics), revise the video information, communicate with voice engine 301 when appropriate, and render video of the av atar 312 when all processing is complete and effects hav e been implemented (or discarded).
  • Revising of the video information can include adjusting or otherwise editing the metadata associated with the video file. In this way, when the video metadata (adjusted or not) is used to render the puppet, the facial features can be mapped to the puppet.
  • voice engine 310 can store the audio information, perform the logic for determining what effects to implement, revise the audio information, and provide modified audio 314 when all processing is complete and effects have been implemented (or discarded).
  • video features 316 can be captured by camera 304 and audio features 318 can be captured by microphone 306. In some cases there may be as many as (or more than) fifty facial features to be detected within video features 316.
  • Example video features include, but are not limited to, duration of expressions, open mouth, frowns, smiles, eyebrows up or furrowed, etc.
  • video features 316 may include only metadata that identifies each of the facial features (e.g., data points that indicate which locations on the user’s face moved or where in what position). Further, video features 316 can be passed to avatar engine 308 and voice engine 310. At avatar engine 308, the metadata associated with video features 316 can be stored and analyzed. In some examples, avatar engine 308 may perform the feature extraction from the video file prior to storing the metadata. However, in other examples, the feature extraction may be performed prior to video features 316 being sent to avatar engine (in which case, video features 316 would be the metadata itself! . At voice engine 310, video features 316 may be compared with audio features 318 when it is helpful to match up wiiat audio features correspond to which video features (e.g., to see if certain audio and video features occur at the same time).
  • voice engine 310 video features 316 may be compared with audio features 318 when it is helpful to match up wiiat audio features correspond to which video features (e.g., to see if certain audio and video features
  • audio features are also passed to voice engine 310 for storage.
  • Example audio features include, but are not limited to, level, pitch, dynamics (e.g., changes in level, pitching, voicing, formants, duration, etc.).
  • Raw audio 320 includes the unprocessed audio file as it’s captured.
  • Raw audio 320 can be passed to voice engine 310 for further processing and potential (e.g , eventual) revision and it can also be stored separately so that the original audio can be used if desired.
  • Raw' audio 320 can also be passed to voice recognition module 322.
  • Voice recognition module 322 can be used to w'ord spot and identify a user’s intent from their voice.
  • voice recognition module 322 can determine wiien a user is angry 7 , sad, happy, or the like. Additionally, wiien a user says a key word (e.g., ‘3 ⁇ 4ark” as described above), voice recognition module 322 will detect this. Information detected and/or collected by voice recognition module 322. can then be passed to voice engine 310 for further logic and/or processing. As noted, in some examples, audio features are extracted in real-time during the preview itself. These audio features may be avatar specific, generated only if the associated avatar is being previewed. The audio features may be avatar agnostic, generated for all avatars.
  • the audio signal can also be adjusted in part based on these real-time audio feature extractions, and with the pre-stored extracted video features which are created during or after the recording process, but before previewing. Additionally, some feature extraction may be performed during rendering at 336 by voice engine 310.
  • Some pre-stored sounds 338 may he used by voice engine 310, as appropriate, to fill in the blanks or to replace other sounds that were extracted.
  • voice engine 310 will make the determination regarding what to do with the information extracted from voice recognition module 322.
  • voice engine 310 can pass the information from voice recognition module 322 to feature module 324 for determining which features correspond to the data extracted by voice recognition module 322.
  • feature module 324 may indicate (e.g., based on a set of rules and/or logic) that a sad voice detected by voice recognition module 322 corresponds to a raising of the pitch of the voice, or the slowing down of the speed or cadence of the voice.
  • feature module 322 can map the extracted audio features to particular voice features.
  • effect type module 326 can map the particular voice features to the desired effect.
  • Voice engine 310 can also be responsible for storing each particular voice for each possible avatar. For example, there may be standard or hardcoded voices for each avatar. Without any other changes being made, if a user selects a particular avatar, voice engine 310 can select the appropriate standard voice for use with playback. In this case, modified audio 314 may just be raw' audio 320 transformed to the appropriate avatar voice based on the selected avatar. As the user scrolls through the avatars and selects different ones, voice engine 310 can modify raw audio 320 on the fly to make it sound like the newly selected avatar. Thus, avatar type 328 needs to be provided to voice engine 310 to make this change.
  • voice engine 310 can revise raw audio file 320 and provide modified audio 314.
  • the user will be provided with an option to use the original audio file at on/off 330. If the user selects“off’ (e.g., effects off), then raw audio 320 can be combined with video of avatar 312 (e.g., corresponding to the unchanged video) to make A/V output 332.
  • A/V output 332 can be provided to the avatar application presented on the U1 of device 302.
  • Avatar engine 308 can be responsible for providing the initial avatar image based at least in part on the selection of avatar type 328.
  • avatar engine 308 is responsible for mapping video features 316 to the appropriate facial markers of each avatar. For example, if video features 316 indicate that the user is smiling, the metadata that indicates a smile can be mapped to the mouth area of the selected avatar so that the avatar appears to be smiling in video of avatar 312. Additionally, avatar engine 308 can receive timing changes 334 from voice engine, as appropriate. For example, if voice engine 310 determines that voice effect is to make the audio be more of a whispering voice (e.g., based on feature module 324 and/or effect type 326 and or the avatar type), and modifies the voice to be more of a whispered voice, this effect change may include slowing down the voice itself, in addition to a reduced level and other formant and pitch changes.
  • voice engine 310 determines that voice effect is to make the audio be more of a whispering voice (e.g., based on feature module 324 and/or effect type 326 and or the avatar type), and modifies the voice to be more of a whispered voice, this effect change
  • the voice engine may produce a modified audio which is slower in playback speed relative to the original audio file for the audio clip.
  • voice engine 310 would need to instruct avatar engine 308 via timing changes 334, so that the video file can be slowed down appropriately; otherwise, the video and audio would not be synchronized
  • voice engine 310 may be configured to make children’s voices sound more high pitched or, alternatively, determine not to make a child’s voice more high pitched because it would sound inappropriate given that raw audio 320 for a child’s voice might already be high pitched. Making this user specific determination of an effect could be driven m part by the audio features extracted, and in this case such features could include pitch values and ranges throughout the recording.
  • voice recognition module 322 may include a recognition engine, a word spotter, a pitch analyzer, and/or a formant analyzer. The analysis performed by voice recognition module 322 will be able to identify if the user if upset, angry, happy, etc.
  • voice recognition module 322 may be able to identify context and/or intonation of the user’s voice, as well as change the intention of wording and/or determine a profile (e.g., a virtual identity) of the user.
  • the avatar process 300 can he configured to package/render the video clip by combining video of avatar 312 and either modified audio 314 or raw audio 320 into A/V output 332.
  • voice engine 310 just needs to know an ID for the metadata associated with video of avatar 312 (e.g., it does not actually need video of avatar 312, it just needs the ID of the metadata).
  • a message within a messaging application e.g., the avatar application
  • the last video clip to be previewed can he sent.
  • the cat avatar video would be sent when the user selects "‘send.”
  • the state of the last preview can be stored and used later. For example, if the last message (e.g , avatar video clip) sent used a particular effect, the first preview' of the next message being generated can utilize that particular effect.
  • voice engine 310 and/or avatar engine 308 can check for certain cues and/or features, and then revise the audio and/or video files to implement the desired effect.
  • Some example feature/effect pairs include : detecting that user has opened their mouth and paused for a moment. In this example, both facial feature characteristics (e.g., mouth open) and audio feature characteristics (e.g., silence) need to happen at the same time in order for the desired effect to be implemented. For this feature/effect pair, the desired effect to revise the audio and video so that the avatar appears to make an avatar/ animal- specific sound.
  • computing device 106 may determine effects for rendering the audio and/or video files based at least in part on the context.
  • a particular video and/or audio feature may be employed for this effect.
  • the voice file may be adjusted to sound more somber, or to be slowed down.
  • the avatar-specific voice might be replaced with a version of the original (e.g., raw) audio to convey the seriousness of the message.
  • the context may be animal noises (e.g., based on the user saying“bark” or“meow” or the like. In this case, the determined effect would be to replace the spoken word“bark” ith the sound of a dog barking.
  • record/send video clip affordance 604 may be represented as a red circle (or a plain circle without the line shown in FIG. 6) prior to the recording session beginning. In this way, the affordance will look more like a standard record button.
  • the appearance of record/send video clip affordance 604 may be changed to look like a clock countdown or other representation of a timer (e.g., if the length of video clip recordings is limited).
  • a user may use avatar selection affordance 606 to select an avatar. Tins can be done before recording of the avatar video dip and/or after recording of the avatar video clip. When selected before recording, the initial preview of the user’s motions and facial characteristics will be presented as the selected avatar. Additionally, the recording will be performed while presenting a live (e.g., real-time) preview of the recording, with the user’s face being represented by the selected avatar. Once the recording is completed, a second preview (e.g., a replay of the actual recording) will be presented, again using the selected avatar. However, at this stage, the user can scroll through avatar selection affordance 606 to select a new avatar to view the recording preview'.
  • a live (e.g., real-time) preview of the recording with the user’s face being represented by the selected avatar.
  • a second preview e.g., a replay of the actual recording
  • the UI upon selection of a new avatar, the UI will begin to preview the recording using the selected avatar.
  • Tire new preview can be presented with the audio/video effects or as originally recorded.
  • the determination regarding whether to present the effected version or the original may be based at least m part on the last method of playback used. For example, if the last playback used effects, the first playback after a new 7 avatar selection may use effects. However, if the last playback did not use effects, the first playback after a new avatar selection may not use effects.
  • the use can replay the video clip with effects by selecting effects preview affordance 608 or without effects by selecting original preview 7 affordance 610.
  • the user can send the avatar video in a message to another computing device using record/send video clip affordance 604.
  • the video clip will be sent using the format corresponding to the last preview (e.g., with or without effects).
  • delete video clip affordance 612 may he selected to delete the avatar video and either start over or exit the avatar and/or messaging applications
  • computing device 106 may display first preview content of a virtual avatar.
  • the first preview content may be a real-time representation of the user’s face, including movement and facial expressions.
  • the first preview would provide an avatar (e.g., cartoon character, digital/virtual puppet) to represent the user’s face instead of an image of the user’s face.
  • This first preview may be video only, or at least a rendering of the avatar without sound. In some examples, this first preview is not recorded and can be utilized for as long as the user desires, without limitation other than batter power or memory space of computing device 106.
  • computing device 106 may extract voice feature characteristics from the audio signal and at block 810, computing device 106 may extract facial feature characterizes from the video signal.
  • computing device 106 may generate adjusted audio signal based at least in part on facial feature characterizes and voice feature characteristics.
  • the audio file captured at block 806 may be revised (e.g , adjusted) to include new sounds, new words, etc., and/or to have the original pitch, tone, volume, etc., adjusted.
  • These adjustments can be made based at least in part on the context detected via analysis of the facial feature characterizes and voice feature characteristics. Additionally, the adjustments can be made based on the type of avatar selected and/or based on specific motions, facial expressions, words, phrases, or actions performed by the user (e.g., expressed by the user’s face) during the recording session
  • FIG. 9 is a simplified block diagram illustrating example architecture 900 for implementing the features described herein, according to at least one embodiment.
  • computing device 902 e.g., computing device 106 of FIG. 1
  • having example architecture 900 may be configured to present relevant UIs, capture audio and video information, extract relevant data, perform logic, revise the audio and video information, and present animoji videos.
  • Computing device 902 may be any type of computing device such as, but not limited to, a mobile phone (e.g., a smartphone), a tablet computer, a personal digital assistant (PDA), a laptop computer, a desktop computer, a thin-chent device, a smart watch, a wireless headset, or the like.
  • a mobile phone e.g., a smartphone
  • PDA personal digital assistant
  • laptop computer e.g., a laptop computer
  • desktop computer e.g., a desktop computer
  • a thin-chent device e.g., a smart watch, a wireless headset, or the like.
  • computing device 902 may include at least one memory 914 and one or more processing units (or processors)) 16.
  • Processor(s) 916 may be implemented as appropriate in hardware, computer-executable instructions, or combinations thereof.
  • Computer-executable instruction or firmware implementations of processor(s) 916 may include computer-executable or machine-executable instructions written in any suitable programming language to perform the various functions described.
  • Memoiy 914 may store program instructions that are loadable and executable on processor(s) 916, as well as data generated during the execution of these programs.
  • memoiy 914 may be volatile (such as random access memoiy (RAM)) and/or non-volatile (such as read-only memoiy (ROM), flash memory, etc.).
  • Computing device 902 may also include additional removable storage and/or non-removable storage 926 including, but not limited to, magnetic storage, optical disks, and/or tape storage.
  • the disk drives and their associated non-transitory computer-readable media may provide non-volatile storage of computer-readable instructions, data structures, program modules, and other data for the computing devices.
  • memoiy 914 may include multiple different types of memory , such as static random access memory (SRAM), dynamic random access memory (DRAM), or ROM While the volatile memory described herein may be referred to as RAM, any volatile memoiy that would not maintain data stored therein once unplugged from a host and/or power would be appropriate.
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • ROM programmable read-only memory
  • Memory 914 and additional storage 926 both removable and non-removable, are all examples of non-transitory computer-readable storage media.
  • non-transitory computer readable storage media may include volatile or non-volatile, removable or non removable media implemented in any method or technology 7 for storage of information such as computer-readable instructions, data structures, program modules, or other data.
  • Memory 914 and additional storage 926 are both examples of non-transitory computer storage media.
  • memory 914 may include operating system 932 and/or one or more application programs or sendees for implementing the features disclosed herein including user interface module 934, avatar control module 936, avatar application module 938, and messaging module 940.
  • Memory 914 may also be configured to store one or more audio and video files to be used to produce audio and video output. In this way, computing device 902 can perform all of the operations described herein.
  • user interface module 934 may- be configured to manage the user interface of computing device 902.
  • user interface module 934 may present any number of various Uls requested by computing device 902.
  • user interface module 934 may be configured to present UI 600 of FIG. 6, which enables implementation of the features describe herein, including communication with avatar process 300 of FIG. 3 which is responsible for capturing video and audio information, extracting appropriate facial feature and voice feature information, and revising the video and audio information prior to presentation of the generated avatar video clips as described above.
  • avatar control module 936 may generate an adjusted audio signal based at least in part on the audio feature characteri stics and the facial feature characteristics, and display a second preview of the virtual avatar in the virtual avatar generation interface according to the facial feature characteristics and the adjusted audio signal.
  • the various embodiments further can be implemented in a wide variety of operating environments, which in some cases can include one or more user computers, computing devices or processing devices which can he used to operate any of a number of applications.
  • User or client devices can include any of a number of general purpose personal computers, such as desktop or laptop computers running a standard operating system, as well as cellular, wireless and handheld devices running mobile software and capable of supporting a number of networking and messaging protocols.
  • Such a system also can include a number of workstations running any of a variety of commerciaily-available operating systems and other known applications for purposes such as development and database management.
  • These devices also can include other electronic devices, such as dummy terminals, thin-clients, gaming systems and other devices capable of communicating via a network.
  • the system and various devices also typically will include a number of software applications, modules, services or other elements located within at least one working memory device, including an operating system and application programs, such as a client application or browser it should be appreciated that alternate embodiments may have numerous variations from that described above. For example, customized hardware might also be used and/or particular elements might be implemented in hardware, software (including portable software, such as applets) or both. Further, connection to other computing devices such as network input/output devices may be employed.
  • Non-transitory storage media and computer-readable storage media for containing code, or portions of code can include any appropriate media known or used in the art (except for transitory media like carrier waves or the like) such as, but not limited to, volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data, including RAM, ROM, Electrically Erasable Programmable Read- Only Memory (EEPROM), flash memory or other memory technology, CD-ROM, DVD or other optical storage, magnetic cassetes, magnetic tape, magnetic disk storage or other magnetic storage devices or any other medium which can be used to store the desired information and which can be accessed by a system device.
  • RAM random access memory
  • ROM read-only memory
  • EEPROM Electrically Erasable Programmable Read- Only Memory
  • flash memory or other memory technology
  • CD-ROM Compact Disc
  • DVD or other optical storage magnetic cassetes, magnetic tape, magnetic disk storage or other magnetic storage devices or any other medium which
  • the terms“comprising,”“having,”“including,” and “containing” are to be construed as open-ended terms (i.e., meaning“including, but not limited to,”) unless otherwise noted.
  • Hie term“connected” is to be construed as partly or wholly contained within, attached to, or joined together, even if there is something intervening.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • User Interface Of Digital Computer (AREA)
  • Processing Or Creating Images (AREA)

Abstract

Des modes de réalisation de la présente invention peuvent fournir des techniques pour ajuster des informations audio et/ou vidéo d'un clip vidéo sur la base, au moins en partie, de caractéristiques faciales et/ou de caractéristiques vocales extraites de composants matériels. Par exemple, en réponse à la détection d'une demande de génération d'un clip vidéo d'avatar d'un avatar virtuel, un signal vidéo associé à un visage dans un champ de vision d'une caméra et un signal audio peuvent être capturés. Des caractéristiques vocales et des caractéristiques faciales peuvent être extraites du signal audio et du signal vidéo, respectivement. Dans certains exemples, en réponse à la détection d'une demande de prévisualisation du clip vidéo d'avatar, un signal audio ajusté peut être généré sur la base, au moins en partie, des caractéristiques faciales et des caractéristiques vocales, et une prévisualisation du clip vidéo de l'avatar virtuel à l'aide du signal audio ajusté peut être affichée.
PCT/US2019/019554 2018-02-28 2019-02-26 Effets vocaux basés sur des expressions faciales WO2019168834A1 (fr)

Priority Applications (3)

Application Number Priority Date Filing Date Title
DE112019001058.1T DE112019001058T5 (de) 2018-02-28 2019-02-26 Stimmeneffekte basierend auf gesichtsausdrücken
CN201980016107.6A CN111787986B (zh) 2018-02-28 2019-02-26 基于面部表情的语音效果
KR1020207022657A KR102367143B1 (ko) 2018-02-28 2019-02-26 얼굴 표정들에 기초한 음성 효과들

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US15/908,603 2018-02-28
US15/908,603 US20180336716A1 (en) 2017-05-16 2018-02-28 Voice effects based on facial expressions
US16/033,111 US10861210B2 (en) 2017-05-16 2018-07-11 Techniques for providing audio and video effects
US16/033,111 2018-07-11

Publications (1)

Publication Number Publication Date
WO2019168834A1 true WO2019168834A1 (fr) 2019-09-06

Family

ID=65812390

Family Applications (2)

Application Number Title Priority Date Filing Date
PCT/US2019/019546 WO2020013891A1 (fr) 2018-07-11 2019-02-26 Techniques de production d'effets audio et vidéo
PCT/US2019/019554 WO2019168834A1 (fr) 2018-02-28 2019-02-26 Effets vocaux basés sur des expressions faciales

Family Applications Before (1)

Application Number Title Priority Date Filing Date
PCT/US2019/019546 WO2020013891A1 (fr) 2018-07-11 2019-02-26 Techniques de production d'effets audio et vidéo

Country Status (4)

Country Link
KR (1) KR102367143B1 (fr)
CN (2) CN112512649B (fr)
DE (1) DE112019001058T5 (fr)
WO (2) WO2020013891A1 (fr)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111803936A (zh) * 2020-07-16 2020-10-23 网易(杭州)网络有限公司 一种语音通信方法及装置、电子设备、存储介质
US20220229885A1 (en) * 2019-06-04 2022-07-21 Sony Group Corporation Image processing apparatus, image processing method, program, and imaging apparatus
CN116248811A (zh) * 2022-12-09 2023-06-09 北京生数科技有限公司 视频处理方法、装置及存储介质

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113891151A (zh) * 2021-09-28 2022-01-04 北京字跳网络技术有限公司 一种音频处理方法、装置、电子设备和存储介质
CN114581567B (zh) * 2022-05-06 2022-08-02 成都市谛视无限科技有限公司 一种声音驱动虚拟形象口型方法、装置及介质

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013152453A1 (fr) * 2012-04-09 2013-10-17 Intel Corporation Communication à l'aide d'avatars interactifs
WO2016154800A1 (fr) * 2015-03-27 2016-10-06 Intel Corporation Animations d'avatars pilotées par les expressions faciales et/ou la parole

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE102004059051A1 (de) * 2004-12-07 2006-06-08 Deutsche Telekom Ag Verfahren und modellbasiertes Audio- und Videosystem zur Darstellung einer virtuellen Figur
US8766983B2 (en) * 2006-05-07 2014-07-01 Sony Computer Entertainment Inc. Methods and systems for processing an interchange of real time effects during video communication
CN101809651B (zh) * 2007-07-31 2012-11-07 寇平公司 提供语音到语音翻译和模拟人类属性的化身的移动无线显示器
CN106961621A (zh) * 2011-12-29 2017-07-18 英特尔公司 使用化身的通信
KR20130139074A (ko) * 2012-06-12 2013-12-20 삼성전자주식회사 오디오 신호 처리 방법 및 이를 적용한 오디오 신호 처리 장치
US9936165B2 (en) * 2012-09-06 2018-04-03 Intel Corporation System and method for avatar creation and synchronization
EP2976749A4 (fr) * 2013-03-20 2016-10-26 Intel Corp Protocoles de transfert fondés sur un avatar, génération d'icône et animation de poupée
US20150031342A1 (en) * 2013-07-24 2015-01-29 Jose Elmer S. Lorenzo System and method for adaptive selection of context-based communication responses
US9607609B2 (en) * 2014-09-25 2017-03-28 Intel Corporation Method and apparatus to synthesize voice based on facial structures
KR102374446B1 (ko) * 2014-12-11 2022-03-15 인텔 코포레이션 아바타 선택 메커니즘
CN105797374A (zh) * 2014-12-31 2016-07-27 深圳市亿思达科技集团有限公司 一种配合脸部表情跟随发出相应语音的方法和终端
CN107742515A (zh) * 2017-09-11 2018-02-27 广东欧珀移动通信有限公司 语音处理方法和装置

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013152453A1 (fr) * 2012-04-09 2013-10-17 Intel Corporation Communication à l'aide d'avatars interactifs
WO2016154800A1 (fr) * 2015-03-27 2016-10-06 Intel Corporation Animations d'avatars pilotées par les expressions faciales et/ou la parole

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220229885A1 (en) * 2019-06-04 2022-07-21 Sony Group Corporation Image processing apparatus, image processing method, program, and imaging apparatus
CN111803936A (zh) * 2020-07-16 2020-10-23 网易(杭州)网络有限公司 一种语音通信方法及装置、电子设备、存储介质
CN111803936B (zh) * 2020-07-16 2024-05-31 网易(杭州)网络有限公司 一种语音通信方法及装置、电子设备、存储介质
CN116248811A (zh) * 2022-12-09 2023-06-09 北京生数科技有限公司 视频处理方法、装置及存储介质
CN116248811B (zh) * 2022-12-09 2023-12-05 北京生数科技有限公司 视频处理方法、装置及存储介质

Also Published As

Publication number Publication date
KR20200105700A (ko) 2020-09-08
CN111787986A (zh) 2020-10-16
CN112512649A (zh) 2021-03-16
CN112512649B (zh) 2024-05-24
KR102367143B1 (ko) 2022-02-23
DE112019001058T5 (de) 2020-11-05
CN111787986B (zh) 2024-08-13
WO2020013891A1 (fr) 2020-01-16

Similar Documents

Publication Publication Date Title
US20180336716A1 (en) Voice effects based on facial expressions
US10861210B2 (en) Techniques for providing audio and video effects
KR102367143B1 (ko) 얼굴 표정들에 기초한 음성 효과들
WO2022048403A1 (fr) Procédé, appareil et système d'interaction multimodale sur la base de rôle virtuel, support de stockage et terminal
CN111415677B (zh) 用于生成视频的方法、装置、设备和介质
US12069345B2 (en) Characterizing content for audio-video dubbing and other transformations
US10019825B2 (en) Karaoke avatar animation based on facial motion data
CN110634483A (zh) 人机交互方法、装置、电子设备及存储介质
CN113392201A (zh) 信息交互方法、装置、电子设备、介质和程序产品
US20150287403A1 (en) Device, system, and method of automatically generating an animated content-item
US20140278403A1 (en) Systems and methods for interactive synthetic character dialogue
KR20070020252A (ko) 메시지를 수정하기 위한 방법 및 시스템
KR101628050B1 (ko) 텍스트 기반 데이터를 애니메이션으로 재생하는 애니메이션 시스템
US11653072B2 (en) Method and system for generating interactive media content
CN107403011B (zh) 虚拟现实环境语言学习实现方法和自动录音控制方法
CN110148406B (zh) 一种数据处理方法和装置、一种用于数据处理的装置
WO2022242706A1 (fr) Production de réponse réactive à base multimodale
KR20230026344A (ko) 멀티미디어 메시징 애플리케이션의 수정 가능한 비디오들에서의 텍스트 메시지들의 맞춤화
KR20240038941A (ko) 텍스트 기반 아바타 생성 방법 및 시스템
US10347299B2 (en) Method to automate media stream curation utilizing speech and non-speech audio cue analysis
JP4917920B2 (ja) コンテンツ生成装置及びコンテンツ生成プログラム
CN112492400B (zh) 互动方法、装置、设备以及通信方法、拍摄方法
WO2023040633A1 (fr) Procédé et appareil de génération de vidéo, dispositif terminal et support de stockage
WO2022041177A1 (fr) Procédé de traitement de message de communication, dispositif et client de messagerie instantanée
US20240320519A1 (en) Systems and methods for providing a digital human in a virtual environment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19711463

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 20207022657

Country of ref document: KR

Kind code of ref document: A

122 Ep: pct application non-entry in european phase

Ref document number: 19711463

Country of ref document: EP

Kind code of ref document: A1