US20180336716A1

US20180336716A1 - Voice effects based on facial expressions

Info

Publication number: US20180336716A1
Application number: US15/908,603
Authority: US
Inventors: Sean A. Ramprashad; Carlos M. Avendano; Aram M. Lindahl
Original assignee: Apple Inc
Current assignee: Apple Inc
Priority date: 2017-05-16
Filing date: 2018-02-28
Publication date: 2018-11-22
Also published as: US10379719B2; DK201770720A1; DK179867B1; US20180335927A1; US10845968B2; CN111563943A; CN111563943B; DK201770721A1; DK179948B1; US20180335930A1; US20180335929A1; CN115393485A; US10521091B2

Abstract

Embodiments of the present disclosure can provide systems, methods, and computer-readable medium for adjusting audio and/or video information of a video clip based at least in part on facial feature and/or voice feature characteristics extracted from hardware components. For example, in response to detecting a request to generate an avatar video clip of a virtual avatar, a video signal associated with a face in a field of view of a camera and an audio signal may be captured. Voice feature characteristics and facial feature characteristics may be extracted from the audio signal and the video signal, respectively. In some examples, in response to detecting a request to preview the avatar video clip, an adjusted audio signal may be generated based at least in part on the facial feature characteristics and the voice feature characteristics, and a preview of the video clip of the virtual avatar using the adjusted audio signal may be displayed.

Description

CROSS-REFERENCES TO RELATED APPLICATIONS

This application claims the benefit of U.S. Provisional Patent Application No. 62/507,177, entitled “Emoji Recording and Sending,” filed May 16, 2017, U.S. Provisional Patent Application No. 62/556,412, entitled “Emoji Recording and Sending,” filed Sep. 9, 2017, and U.S. Provisional Patent Application No. 62/557,121, entitled “Emoji Recording and Sending,” filed Sep. 11, 2017, the entire disclosures of each being herein incorporated by reference for all purposes.

BACKGROUND

Multimedia content, such as emoji's, can be sent as part of messaging communications. The emoji's can represent a variety of predefined people, objects, actions, and/or other things. Some messaging applications allow users to select from a predefined library of emoji's which can be sent as part of a message that can contain other content (e.g., other multimedia and/or textual content). Animojis are one type of this other multimedia content, where a user can select an avatar (e.g., a puppet) to represent themselves. The animoji can move and talk as if it were a video of the user. Animojis enable users to create personalized versions of emoji's in a fun and creative way.

SUMMARY

Embodiments of the present disclosure can provide systems, methods, and computer-readable medium for implementing avatar video clip revision and playback techniques. In some examples, a computing device can present a user interface (UI) for tracking a user's face and presenting a virtual avatar representation (e.g., a puppet or video character version of the user's face). Upon identifying a request to record, the computing device can capture audio and video information, extract and detect context as well as facial feature characteristics and voice feature characteristics, revise the audio and/or video information based at least in part on the extracted/identified features, and present a video clip of the avatar using the revised audio and/or video information.
In some embodiments, a computer-implemented method for implementing various audio and video effects techniques may be provided. The method may include displaying a virtual avatar generation interface. The method may also include displaying first preview content of a virtual avatar in the virtual avatar generation interface, the first preview content of the virtual avatar corresponding to realtime preview video frames of a user headshot in a field of view of the camera and associated headshot changes in an appearance. The method may also include detecting an input in the virtual avatar generation interface while displaying the first preview content of the virtual avatar. In some examples, in response to detecting the input in the virtual avatar generation interface, the method may also include: capturing, via the camera, a video signal associated with the user headshot during a recording session, capturing, via the microphone, a user audio signal during the recording session, extracting audio feature characteristics from the captured user audio signal, and extracting facial feature characteristics associated with the face from the captured video signal. Additionally, in response to detecting expiration of the recording session, the method may also include: generating an adjusted audio signal from the captured audio signal based at least in part on the facial feature characteristics and the audio feature characteristics, generating second preview content of the virtual avatar in the virtual avatar generation interface according to the facial feature characteristics and the adjusted audio signal, and presenting the second preview content in the virtual avatar generation interface.
In some embodiments, the method may also include storing facial feature metadata associated with the facial feature characteristics extracted from the video signal and generating adjusted facial feature metadata from the facial feature metadata based at least in part on the facial feature characteristics and the audio feature characteristics. Additionally, the second preview of the virtual avatar may be displayed further according to the adjusted facial metadata. In some examples, the first preview of the virtual avatar may be displayed according to preview facial feature characteristics identified according to the changes in the appearance of the face during a preview session.
In some embodiments, an electronic device for implementing various audio and video effects techniques may be provided. The system may include a camera, a microphone, a library of pre-recorded/pre-determined audio, and one or more processors in communication with the camera and the microphone. In some examples, the processors may be configured to execute computer-executable instructions to perform operations. The operations may include detecting an input in a virtual avatar generation interface while displaying a first preview of a virtual avatar. The operations may also include initiating a capture session including in response to detecting the input in the virtual avatar generation interface. The capture session may include: capturing, via the camera, a video signal associated with a face in a field of view of the camera, capturing, via the microphone, an audio signal associated with the captured video signal, extracting audio feature characteristics from the captured audio signal, and extracting facial feature characteristics associated with the face from the captured video signal. In some examples, the operations may also include generating an adjusted audio signal based at least in part on the audio feature characteristics and the facial feature characteristics and presenting the second preview content in the virtual avatar generation interface, at least in response to detecting expiration of the capture session.
In some instances, the audio signal may be further adjusted based at least in part on a type of the virtual avatar. Additionally, the type of the virtual avatar may be received based at least in part on an avatar type selection affordance presented in the virtual avatar generation interface. In some instances, the type of the virtual avatar may include an animal type, and the adjusted audio signal may be generated based at least in part on a predetermined sound associated with the animal type. The use and timing of predetermined sounds may be based on audio features from the captured audio and/or facial features from the captured video. This predetermined sound may also be itself modified based on audio features from the captured audio and facial features from the captured video. In some examples, the one or more processors may be further configured to determine whether a portion of the audio signal corresponds to the face in the field of view. Additionally, in accordance with a determination that the portion of the audio signal corresponds to the face, the portion of the audio signal may be stored for use in generating the adjusted audio signal and/or in accordance with a determination that the portion of the audio signal does not correspond to the face, at least the portion of the audio signal may be discarded and not considered for modification and/or playback. Additionally, the audio feature characteristics may comprise features of a voice associated with the face in the field of view. In some examples, the one or more processors may be further configured to store facial feature metadata associated with the facial feature characteristics extracted from the video signal. In some examples, the one or more processors may be further configured to store audio feature metadata associated with the audio feature characteristics extracted from the audio signal. Further, the one or more processors may be further configured to generate adjusted facial metadata based at least in part on the facial feature characteristics and the audio feature characteristics, and the second preview of the virtual avatar may be generated according to the adjusted facial metadata and the adjusted audio signal.
In some embodiments, a computer-readable medium may be provided. The computer-readable medium may include computer-executable instructions that, when executed by one or more processors, cause the one or more processors to perform operations. The operations may include performing the following actions in response to detecting a request to generate an avatar video clip of a virtual avatar: capturing, via a camera of an electronic device, a video signal associated with a face in a field of view of the camera, capturing, via a microphone of the electronic device, an audio signal, extracting voice feature characteristics from the captured audio signal, and extracting facial feature characteristics associated with the face from the captured video signal. The operations may also include performing the following actions in response to detecting a request to preview the avatar video clip: generating an adjusted audio signal based at least in part on the facial feature characteristics and the voice feature characteristics, and displaying a preview of the video clip of the virtual avatar using the adjusted audio signal.
In some embodiments, the audio signal may be adjusted based at least in part on a facial expression identified in the facial feature characteristics associated with the face. In some instances, the audio signal may be adjusted based at least in part on a level, pitch, duration, format, or change in a voice characteristic associated with the face. Further, in some embodiments, the one or more processors may be further configured to perform the operations comprising transmitting the video clip of the virtual avatar to another electronic device.
The following detailed description together with the accompanying drawings will provide a better understanding of the nature and advantages of the present disclosure.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a simplified block diagram illustrating example flow for providing audio and/or video effects techniques as described herein, according to at least one example.

FIG. 2 is another simplified block diagram illustrating example flow for providing audio and/or video effects techniques as described herein, according to at least one example.

FIG. 3 is another simplified block diagram illustrating hardware and software components for providing audio and/or video effects techniques as described herein, according to at least one example.

FIG. 4 is a flow diagram to illustrate providing audio and/or video effects techniques as described herein, according to at least one example.

FIG. 5 is another flow diagram to illustrate providing audio and/or video effects techniques as described herein, according to at least one example.

FIG. 6 is a simplified block diagram illustrating a user interface for providing audio and/or video effects techniques as described herein, according to at least one example.

FIG. 7 is another flow diagram to illustrate providing audio and/or video effects techniques as described herein, according to at least one example.

FIG. 8 is another flow diagram to illustrate providing audio and/or video effects techniques as described herein, according to at least one example.

FIG. 9 is a simplified block diagram illustrating is a computer architecture for providing audio and/or video effects techniques as described herein, according to at least one example.

DETAILED DESCRIPTION

Certain embodiments of the present disclosure relate to devices, computer-readable medium, and methods for implementing various techniques for providing voice effects (e.g., revised audio) based at least in part on facial expressions. Additionally, in some cases, the various techniques may also provide video effects based at least in part on audio characteristics of a recording. Even further, the various techniques may also provide voice effects and video effects (e.g., together) based at least in part on one or both of facial expressions and audio characteristics of a recording. In some examples, the voice effects and/or video effects may be presented in a user interface (UI) configured to display a cartoon representation of a user (e.g., an avatar or digital puppet). Such an avatar that represents a user may be considered an animoji, as it may look like an emoji character familiar to most smart phone users; however, it can be animated to mimic actual motions of the user.
For example, a user of a computing device may be presented with a UI for generating an animoji video (e.g., a video clip). The video clip can be limited to a predetermined amount of time (e.g., 10 second, 30 seconds, or the like), or the video clip can be unlimited. In the UI, a preview area may present the user with a real-time representation of their face, using an avatar character. Various avatar characters may be provided, and a user may even be able to generate or import their own avatars. The preview area may be configured to provide an initial preview of the avatar and a preview of the recorded video clip. Additionally, the recorded video clip may be previewed in its original form (e.g., without any video or audio effects) or it may be previewed with audio and/or video effects. In some cases, the user may select an avatar after the initial video clip has been recorded. The video clip preview may then change from one avatar to another, with the same or different video effects applied to it, as appropriate. For example, if the raw preview (e.g., original form, without effects) is being viewed, and the user switches avatar characters, the UI may be updated to display a rendering of the same video clip but with the newly selected avatar. In other words, the facial features and audio (e.g., the user's voice) that was captured during the recording can be presented from any of the avatars (e.g., without any effects). In the preview, it will appear as if the avatar character is moving the same way the user moved during the recording, and speaking what the user said during the recording.
By way of example, a user may select a first avatar (e.g., a unicorn head) via the UI, or a default avatar can be initially provided. The UI will present the avatar (in this example, the head of a cartoon unicorn if selected by the user or any other available puppet by default) in the preview area, and the device will begin capturing audio and/or video information (e.g., using one or more microphones and/or one or more cameras). In some cases, only video information is needed for the initial preview screen. The video information can be analyzed, and facial features can be extracted. These extracted facial features can then be mapped to the unicorn face in real-time, such that the initial preview of the unicorn head appears to mirror that of the user's. In some cases, the term real-time is used to indicate that the results of the extraction, mapping, rendering, and presentation are performed in response to each motion of the user and can be presented substantially immediately. To the user, it will appear as if they are looking in the mirror, except the image of their face is replaced with an avatar.
While the user's face is in the line of sight (e.g., the view) of a camera of the device, the UI will continue to present the initial preview. Upon selection of a record affordance (e.g., a virtual button) on the UI, the device may begin to capture video that has an audio component. In some examples, this includes a camera capturing frames and a microphone capturing audio information. A special camera may be utilized that is capable of capturing 3-dimensional (3D) information as well. Additionally, in some examples, any camera may be utilized that is capable of capturing video. The video may be stored in its original form and/or metadata associated with the video may be stored. As such, capturing the video and/or audio information may be different from storing the information. For example, capturing the information may include sensing the information and at least caching it such that is available for processing. The processed data can also be cached until it is determined whether to store or simply utilize the data. For example, during the initial preview, while the user's face is being presented as a puppet in real-time, the video data (e.g., metadata associated with the data) may be cached, while it is mapped to the puppet and presented. However, this data may not be stored permanently at all, such that the initial preview is not reusable or recoverable.
Alternatively, in some examples, once the user selects the record affordance of the UI, the video data and the audio data may be stored more permanently. In this way, the audio and video (A/V) data may analyzed, processed, etc., in order to provide the audio and video effects described herein. In some examples, the video data may be processed to extract facial features (e.g., facial feature characteristics) and those facial features may be stored as metadata for the animoji video clip. The set of metadata may be stored with an identifier (ID) that indicates the time, date, and user associated with the video clip. Additionally, the audio data may be stored with the same or other ID. Once stored, or in some examples—prior to storage, the system (e.g., processors of the device) may extract audio feature characteristics from the audio data and facial feature characteristics from the video file. This information can be utilized to identify context, key words, intent, and/or emotions of the user, and video and audio effects can be introduced into audio and video data prior to rendering the puppet. In some examples, the audio signal can be adjusted to include different words, sounds, tones, pitches, timing, etc., based at least in part on the extracted features. Additionally, in some examples, the video data (e.g., the metadata) can also be adjusted. In some examples, audio features are extracted in real-time during the preview itself. These audio features may be avatar specific, generated only if the associated avatar is being previewed. The audio features may be avatar agnostic, generated for all avatars. The audio signal can also be adjusted in part based on these real-time audio feature extractions, and with the pre-stored extracted video features which are created during or after the recording process, but before previewing.
Once the video and audio data have been adjusted based at least in part on the extracted characteristics, a second preview of the puppet can be rendered. This rendering may be performed for each possible puppet, such as the user scrolls through and selects different puppets, the adjusted data is already rendered. Or the rendering can be performed after selection of each puppet. In any event, once the user selects a puppet, the second preview can be presented. The second preview will replay the video clip that was recorded by the user, but with the adjusted audio and/or video. Using the example from above, if the user recorded themselves with an angry tone (e.g., with a gruff voice and a furrowed brow), the context or intent of anger may be detected, and the audio file may be adjusted to include a growling sound. Thus, the second preview would look like a unicorn saying the words that the user said; however, the voice of the user may be adjusted to sound like a growl, or to make the tone more baritone (e.g., lower). The user could then save the second preview or select it for transmission to another user (e.g., through a messaging application or the like). In some examples, the below and above animoji video clips can be shared as .mov files. However, in other examples, the described techniques can be used in real-time (e.g., with video messaging or the like).
FIG. 1 is a simplified block diagram illustrating example flow 100 for providing audio and/or video effects based at least in part on audio and/or video features detected in a user's recording. In example flow 100, there are two separate sessions: recording session 102 and playback session 104. In recording session 102, device 106 may capture video having an audio component of user 108 at block 110. In some examples, the video and audio may be captured (e.g., collected) separately, using two different devices (e.g., a microphone and a camera). The capturing of video and audio may be triggered based at least in part on selection of a record affordance by user 108. In some examples, user 108 may say the word “hello” at block 112. Additionally, at block 112, device 106 may continue to capture the video and/or audio components of the user's actions. At block 114, device 106 can continue capturing the video and audio components, and in this example, user 108 may say the word “bark.” At block 114, device 106 may also extract spoken words from the audio information. However, in other examples, the spoken word extraction (or any audio feature extraction) may actually take place after recording session 102 is complete. In other examples, the spoken word extraction (or any audio feature extraction) may actually take place during the preview block 124 in real-time. It is also possible for the extraction (e.g., analysis of the audio) to be done in real-time while recording session 102 is still in process. In either case, the avatar process being executed by device 106 may identify through the extraction that the user said the word “bark” and may employ some logic to determine what audio effects to implement.
By way of example, recording session 102 may end when user 108 selects the record affordance again (e.g., indicating a desire to end the recording), selects an end recording affordance (e.g., the record affordance may act as an end recording affordance while recording), or based at least in part on expiration of a time period (e.g., 10 seconds, 30 seconds, or the like). In some cases, this time period may be automatically predetermined, while in others, it may be user selected (e.g., selected from a list of options or entered in free form through a text entry interface). Once the recording has completed, user 108 may select a preview affordance, indicating that user 108 wishes to watch a preview of the recording. One option could be to play the original recording without any visual or audio effects. However, another option could be to play a revised version of the video clip. Based at least in part on detection of the spoken word “bark,” the avatar process may have revised the audio and/or video of the video clip.
At block 116, device 106 may present avatar (also called a puppet and/or animoji) 118 on a screen. Device 106 may also be configured with speaker 120 that can play audio associated with the video clip. In this example, block 116 corresponds to the same point in time as block 110, where user 108 may have had his mouth open, but was not yet speaking. As such, avatar 118 may be presented with his mouth open; however, no audio is presented from speaker 120 yet. At block 122, corresponding to block 112 where user 108 said “hello,” the avatar process can present avatar 118 with an avatar-specific voice. In other words, a predefined dog voice may be used to say the word “hello” at block 122. The dog-voice word “hello” can be presented by speaker 120. As will be described in further detail below, there are a variety of different animal (and other character) avatars available for selection by user 108. In some examples, each avatar may be associated with a particular pre-defined voice that best fits that avatar. For example, a dog may have a dog voice, a cat may have a cat voice, a pig may have a pig voice, and a robot may have a robotic voice. These avatar-specific voices may be pre-recorded or may be associated with particular frequency or audio transformations, that can happen by executing mathematical operations on the original sound, such that any user's voice can be transformed to sound like the dog voice. However, each user's dog voice may sound different based at least in part on the particular audio transformation performed.
At block 124, the avatar process may replace the spoken word (e.g., “bark”) with an avatar-specific word. In this example, the sound of a dog bark (e.g., a recorded or simulated dog bark) may be inserted into the audio data (e.g., in place of the word “bark”) such that when it is played back during presentation of the video clip, a “woof” is presented by speaker 120. In some examples, different avatar-specific words will be presented at 124 based at least in part on different avatar selections, and in other examples, the same avatar-specific word may be presented regardless of the avatar selections. For example, if user 108 said “bark,” a “woof” could be presented when the dog avatar is selected. However, in this same case, if user 108 later selected the cat avatar for the same flow, there are a couple of options for revising the audio. In one example, the process could convert the “bark” into a “woof” even though it wouldn't be appropriate for a cat to “woof.” In a different example, the process could convert “bark” into a recorded or simulated “meow,” based at least in part on the selection of the cat avatar. And, in yet another example, the process could ignore the “bark” for avatars other than the dog avatar. As such, there may be a second level of audio feature analysis performed even after the extraction at 114. Video and audio features may also influence processing on the avatar specific utterances. For example, the level and pitch and intonation with which a user says “bark” may be detected as part of the audio feature extraction, and this may direct the system to select a specific “woof” sample or transform such a sample before and/or during the preview process.
FIG. 2 is another simplified block diagram illustrating example flow 200 for providing audio and/or video effects based at least in part on audio and/or video features detected in a user's recording. In example flow 200, much like in example flow 100 of FIG. 1, there are two separate sessions: recording session 202 and playback session 204. In recording session 202, device 206 may capture video having an audio component of user 208 at block 210. The capturing of video and audio may be triggered based at least in part on selection of a record affordance by user 208. In some examples, user 208 may say the word “hello” at block 212. Additionally, at block 212, device 206 may continue to capture the video and/or audio components of the user's actions. At block 214, device 206 can continue capturing the video and audio components, and in this example, user 208 may hold his mouth open, but not say anything. At block 214, device 206 may also extract facial expressions from the video. However, in other examples, the facial feature extraction (or any video feature extraction) may actually take place after recording session 202 is complete. Still, it is possible for the extraction (e.g., analysis of the video) to be done in real-time while recording session 202 is still in process. In either case, the avatar process being executed by device 206 may identify through the extraction that the user opened his mouth briefly (e.g., without saying anything) and may employ some logic to determine what audio and/or video effects to implement. In some examples, the determination that the user held their mouth open without saying anything may require extraction and analysis of both audio and video. For example, extraction of the facial feature characteristics (e.g., open mouth) may not be enough, and the process may also need to detect that user 208 did not say anything during the same time period of the recording. Video and audio features may also influence processing on the avatar specific utterances. For example, the duration of the opening of the mouth, opening of eyes, etc. may direct the system to select a specific “woof” sample or transform such a sample before and/or during the preview process. One such transformation is changing the level and/or duration of the woof to match the detected opening and closing of the user's mouth.
By way of example, recording session 202 may end when user 208 selects the record affordance again (e.g., indicating a desire to end the recording), selects an end recording affordance (e.g., the record affordance may act as an end recording affordance while recording), or based at least in part on expiration of a time period (e.g., 20 seconds, 30 seconds, or the like). Once the recording has finished, user 208 may select a preview affordance, indicating that user 208 wishes to watch a preview of the recording. One option could be to play the original recording without any visual or audio effects. However, another option could be to play a revised version of the recording. Based at least in part on detection of the facial expression (e.g., the open mouth), the avatar process may have revised the audio and/or video of the video clip.
At block 216, device 206 may present avatar (also called a puppet and/or animoji) 218 on a screen of device 206. Device 206 may also be configured with speaker 220 that can play audio associated with the video clip. In this example, block 216 corresponds to the same point in time as block 210, where user 208 may not have been speaking yet. As such, avatar 218 may be presented with his mouth open; however, no audio is presented from speaker 220 yet. At block 222, corresponding to block 212 where user 208 said “hello,” the avatar process can present avatar 218 with an avatar-specific voice (as described above).
At block 224, the avatar process may replace the silence identified at block 214 with an avatar-specific word. In this example, the sound of a dog bark (e.g., a recorded or simulated dog bark) may be inserted into the audio data (e.g., in place of the silence) such that when it is played back during presentation of the video clip, a “woof” is presented by speaker 220. In some examples, different avatar-specific words will be presented at 224 based at least in part on different avatar selections, and in other examples, the same avatar-specific word may be presented regardless of the avatar selections. For example, if user 208 held his mouth open, a “woof” could be presented when the dog avatar is selected, a “meow” sound could be presented for a cat avatar, etc. In some cases, each avatar may have a predefined sound to be played when it is detected that user 208 has held his mouth open for an amount of time (e.g., a half second, a whole second, etc.) without speaking. However, in some examples, the process could ignore the detection of the open mouth for avatars that don't have a predefined effect for that facial feature. Additionally, there may be a second level of audio feature analysis performed even after the extraction at 214. For example, if the process determines that a “woof” is to be inserted for a dog avatar (e.g., based on detection of the open mouth), the process may also detect how many “woof” sounds to insert (e.g., if the user held his mouth open for double the length of time used to indicate a bark) or whether it's not possible to insert the number of barks requested (e.g., in the scenario of FIG. 1, where the user would speak “bark” to indicate a “woof” sound should be inserted. Thus, based on the above two examples, it should be evident, that user 208 can control effects of the playback (e.g., the recorded avatar message) with their facial and voice expressions. Further, while not shown explicitly in either FIG. 1 or FIG. 2, the user device can be configured with software for executing the avatar process (e.g., capturing the A/V information, extracting features, analyzing the data, implementing the logic, revising the audio and/or video files, and rendering the previews) as well as software for executing an application (e.g., an avatar application with its own UI) that enables the user to build the avatar messages and subsequently send them to other user devices.
FIG. 3 is a simplified block diagram 300 illustrating components (e.g., software modules) utilized by the avatar process described above and below. In some examples, more or less modules can be utilized to implement the providing of audio and/or video effects based at least in part on audio and/or video features detected in a user's recording. In some examples, device 302 may be configured with camera 304, microphone 306, and a display screen for presenting a UI and the avatar previews (e.g., the initial preview before recording as well as the preview of the recording before sending). In some examples, the avatar process is configured with avatar engine 308 and voice engine 310. Avatar engine 308 can manage the list of avatars, process the video features (e.g., facial feature characteristics), revise the video information, communicate with voice engine 301 when appropriate, and render video of the avatar 312 when all processing is complete and effects have been implemented (or discarded). Revising of the video information can include adjusting or otherwise editing the metadata associated with the video file. In this way, when the video metadata (adjusted or not) is used to render the puppet, the facial features can be mapped to the puppet. In some examples, voice engine 310 can store the audio information, perform the logic for determining what effects to implement, revise the audio information, and provide modified audio 314 when all processing is complete and effects have been implemented (or discarded).
In some examples, once the user selects to record a new avatar video clip, video features 316 can be captured by camera 304 and audio features 318 can be captured by microphone 306. In some cases there may be as many as (or more than) fifty facial features to be detected within video features 316. Example video features include, but are not limited to, duration of expressions, open mouth, frowns, smiles, eyebrows up or furrowed, etc. Additionally, video features 316 may include only metadata that identifies each of the facial features (e.g., data points that indicate which locations on the user's face moved or where in what position). Further, video features 316 can be passed to avatar engine 308 and voice engine 310. At avatar engine 308, the metadata associated with video features 316 can be stored and analyzed. In some examples, avatar engine 308 may perform the feature extraction from the video file prior to storing the metadata. However, in other examples, the feature extraction may be performed prior to video features 316 being sent to avatar engine (in which case, video features 316 would be the metadata itself). At voice engine 310, video features 316 may be compared with audio features 318 when it is helpful to match up what audio features correspond to which video features (e.g., to see if certain audio and video features occur at the same time).
In some instances, audio features are also passed to voice engine 310 for storage. Example audio features include, but are not limited to, level, pitch, dynamics (e.g., changes in level, pitching, voicing, formants, duration, etc.). Raw audio 320 includes the unprocessed audio file as it's captured. Raw audio 320 can be passed to voice engine 310 for further processing and potential (e.g., eventual) revision and it can also be stored separately so that the original audio can be used if desired. Raw audio 320 can also be passed to voice recognition module 322. Voice recognition module 322 can be used to word spot and identify a user's intent from their voice. For example, voice recognition module 322 can determine when a user is angry, sad, happy, or the like. Additionally, when a user says a key word (e.g., “bark” as described above), voice recognition module 322 will detect this. Information detected and/or collected by voice recognition module 322 can then be passed to voice engine 310 for further logic and/or processing. As noted, in some examples, audio features are extracted in real-time during the preview itself. These audio features may be avatar specific, generated only if the associated avatar is being previewed. The audio features may be avatar agnostic, generated for all avatars. The audio signal can also be adjusted in part based on these real-time audio feature extractions, and with the pre-stored extracted video features which are created during or after the recording process, but before previewing. Additionally, some feature extraction may be performed during rendering at 336 by voice engine 310. Some pre-stored sounds 338 may be used by voice engine 310, as appropriate, to fill in the blanks or to replace other sounds that were extracted.
In some examples, voice engine 310 will make the determination regarding what to do with the information extracted from voice recognition module 322. In some examples, voice engine 310 can pass the information from voice recognition module 322 to feature module 324 for determining which features correspond to the data extracted by voice recognition module 322. For example, feature module 324 may indicate (e.g., based on a set of rules and/or logic) that a sad voice detected by voice recognition module 322 corresponds to a raising of the pitch of the voice, or the slowing down of the speed or cadence of the voice. In other words, feature module 322 can map the extracted audio features to particular voice features. Then, effect type module 326 can map the particular voice features to the desired effect. Voice engine 310 can also be responsible for storing each particular voice for each possible avatar. For example, there may be standard or hardcoded voices for each avatar. Without any other changes being made, if a user selects a particular avatar, voice engine 310 can select the appropriate standard voice for use with playback. In this case, modified audio 314 may just be raw audio 320 transformed to the appropriate avatar voice based on the selected avatar. As the user scrolls through the avatars and selects different ones, voice engine 310 can modify raw audio 320 on the fly to make it sound like the newly selected avatar. Thus, avatar type 328 needs to be provided to voice engine 310 to make this change. However, if an effect is to be provided (e.g., the pitch, tone, or actual words are to be changed within the audio file), voice engine 310 can revise raw audio file 320 and provide modified audio 314. In some examples, the user will be provided with an option to use the original audio file at on/off 330. If the user selects “off” (e.g., effects off), then raw audio 320 can be combined with video of avatar 312 (e.g., corresponding to the unchanged video) to make A/V output 332. A/V output 332 can be provided to the avatar application presented on the UI of device 302.
Avatar engine 308 can be responsible for providing the initial avatar image based at least in part on the selection of avatar type 328. Additionally, avatar engine 308 is responsible for mapping video features 316 to the appropriate facial markers of each avatar. For example, if video features 316 indicate that the user is smiling, the metadata that indicates a smile can be mapped to the mouth area of the selected avatar so that the avatar appears to be smiling in video of avatar 312. Additionally, avatar engine 308 can receive timing changes 334 from voice engine, as appropriate. For example, if voice engine 310 determines that voice effect is to make the audio be more of a whispering voice (e.g., based on feature module 324 and/or effect type 326 and or the avatar type), and modifies the voice to be more of a whispered voice, this effect change may include slowing down the voice itself, in addition to a reduced level and other formant and pitch changes. Accordingly, the voice engine may produce a modified audio which is slower in playback speed relative to the original audio file for the audio clip. In this scenario, voice engine 310 would need to instruct avatar engine 308 via timing changes 334, so that the video file can be slowed down appropriately; otherwise, the video and audio would not be synchronized.
As noted, a user may use the avatar application of device 302 to select different avatars. In some examples, the voice effect can change based at least in part on this selection. However, in other examples, the user may be given the opportunity to select a different voice for a given avatar (e.g., the cat voice for the dog avatar, etc.). This type of free-form voice effect change can be executed by the user via selection on the UI or, in some cases, with voice activation or face motion. For example, a certain facial expression could trigger voice engine 310 to change the voice effect for a given avatar. Further, in some examples, voice engine 310 may be configured to make children's voices sound more high pitched or, alternatively, determine not to make a child's voice more high pitched because it would sound inappropriate given that raw audio 320 for a child's voice might already be high pitched. Making this user specific determination of an effect could be driven in part by the audio features extracted, and in this case such features could include pitch values and ranges throughout the recording.
In some examples, voice recognition module 322 may include a recognition engine, a word spotter, a pitch analyzer, and/or a formant analyzer. The analysis performed by voice recognition module 322 will be able to identify if the user if upset, angry, happy, etc. Additionally, voice recognition module 322 may be able to identify context and/or intonation of the user's voice, as well as change the intention of wording and/or determine a profile (e.g., a virtual identity) of the user.
In some examples, the avatar process 300 can be configured to package/render the video clip by combining video of avatar 312 and either modified audio 314 or raw audio 320 into A/V output 332. In order to package the two, voice engine 310 just needs to know an ID for the metadata associated with video of avatar 312 (e.g., it does not actually need video of avatar 312, it just needs the ID of the metadata). A message within a messaging application (e.g., the avatar application) can be transmitted to other computing devices, where the message includes A/V output 332. When a user selects a “send” affordance in the UI, the last video clip to be previewed can be sent. For example, if a user previews their video clip with the dog avatar, and then switches to the cat avatar for preview, the cat avatar video would be sent when the user selects “send.” Additionally, the state of the last preview can be stored and used later. For example, if the last message (e.g., avatar video clip) sent used a particular effect, the first preview of the next message being generated can utilize that particular effect.
The logic implemented by voice engine 310 and/or avatar engine 308 can check for certain cues and/or features, and then revise the audio and/or video files to implement the desired effect. Some example feature/effect pairs include: detecting that user has opened their mouth and paused for a moment. In this example, both facial feature characteristics (e.g., mouth open) and audio feature characteristics (e.g., silence) need to happen at the same time in order for the desired effect to be implemented. For this feature/effect pair, the desired effect to revise the audio and video so that the avatar appears to make an avatar/animal-specific sound. For example, a dog will make a bark sound, a cat will make a meow sound, a monkey, horse, unicorn, etc., will make the appropriate sound for that character/animal. Other example feature/effect pairs include lower the audio pitch and/or tone when a frown is detected. In this example, only the video feature characteristics need to be detected. However, in some examples, this effect could be implemented based at least in part on voice recognition module 322 detecting sadness in the voice of the user. In this case, video features 316 wouldn't be needed at all. Other example feature/effect pairs include whispering to cause the audio and video speeds to be slowed, toned down, and/or a reduction in changes. In some cases, video changes can lead to modifications of the audio while, in other case, audio changes can lead to modifications of the video.
As noted above, in some examples, avatar engine 308 may act as the feature extractor, in which case video features 316 and audio features 318 may not exist prior to being sent to avatar engine 308. Instead, raw audio 320 and metadata associated with the raw video may be passed into avatar engine 308, where avatar engine 308 may extract the audio feature characteristics and the video (e.g., facial) feature characteristics. In other words, while not drawn this way in FIG. 3, parts of avatar engine 308 may actually exist within camera 304. Additionally, in some examples, metadata associated with video features 316 can be stored in a secure container, and when voice engine 310 is running, it can read the metadata from the container.
In some instances, because the preview video clip of the avatar is not displayed in real-time (e.g., it is rendered and displayed after the video is recorded and sometimes only in response to selection of a play affordance), the audio and video information can be processed offline (e.g., not in real-time). As such, avatar engine 308 and voice engine 310 can read ahead in the audio and video information and make context decisions up front. Then, voice engine 310 can revise the audio file accordingly. This ability to read ahead and make decisions offline will greatly increase the efficiency of the system, especially for longer recordings. Additionally, this enables a second stage of analysis, where additional logic can be processed. Thus, the entire audio file can be analyzed before making any final decisions. For example, if the user says “bark” two times in a row, but the words “bark” were said too closely together, the actual “woof” sound that was prerecorded might not be able to fit in the time it took the user to say “bark, bark.” In this case, voice engine 310 can take the information from voice recognition 322 and determine to ignore the second “bark,” because it won't be possible to include both “woof” sounds in the audio file.
As noted above, when the audio file and the video are packaged together to make A/V output 332, voice engine does not actually need to access video of avatar 312. Instead, the video file (e.g., a .mov format file, or the like) is created as the video is being played by accessing an array of features (e.g., floating-point values) that were written to the metadata file. However, all permutations/adjustments to the audio and video files can be done in advance, and some can even be done in real-time as the audio and video are extracted. Additionally, in some examples, each modified video clip could be saved temporarily (e.g., cached), such that if the user reselects an avatar that's already been previewed, the processing to generate/render that particular preview does not need to be duplicated. As opposed to re-rendering the revised video clip each time the same avatar is selected during the preview section, the above noted caching of rendered video clips would enable the realization of large savings in processor power and instructions per second (IPS), especially for longer recordings and/or recordings with a large number of effects.
Additionally, in some examples, noise suppression algorithms can be employed for handling cases where the sound captured by microphone 306 includes sounds other than the user's voice. For example, when the user is in a windy area, or a loud room (e.g., a restaurant or bar). In these examples, a noise suppression algorithm could lower the decibel output of certain parts of the audio recording. Alternatively, or in addition, different voices could be separated and/or only audio coming from certain angles of view (e.g., the angle of the user's face) could be collected, and other voices could be ignored or suppressed. In other cases, if the avatar process 300 determines that the noise levels are too loud or will be difficult to process, the process 300 could disable the recording option.
FIG. 4 illustrates an example flow diagram showing process 400 for implementing various audio and/or video effects based at least in part on audio and/or video features, according to at least a few embodiments. In some examples, computing device 106 of FIG. 1 or other similar user device (e.g., utilizing at least avatar process 300 of FIG. 3) may perform the process 400 of FIG. 4.
At block 402, computing device 106 may capture video having an audio component. In some examples, the video and audio may be captured by two different hardware components (e.g., a camera may capture the video information while a microphone may capture the audio information). However, in some instances, a single hardware component may be configured to capture both audio and video. In any event, the video and audio information may be associated with one another (e.g., by sharing an ID, timestamp, or the like). As such, the video may have an audio component (e.g., they are part of the same file), or the video may be linked with an audio component (e.g., two files that are associated together).
At block 404, computing device 106 may extract facial features and audio features from the captured video and audio information, respectively. In some cases, the facial feature information may be extracted via avatar engine 308 and stored as metadata. The metadata can be used to map each facial feature to a particular puppet or to any animation or virtual face. Thus, the actual video file does not need to be stored, creating memory storage efficiency and significant savings. Regarding the audio feature extraction, a voice recognition algorithm can be utilized to extract different voice features; for example, words, phrases, pitch, speed, etc.
At block 406, computing device 106 may detect context from the extracted features. For example, context may include a user's intent, mood, setting, location, background items, ideas, etc. The context can be important when employing logic to determine what effects to apply. In some cases, the context can be combined with detected spoken words to determine whether and/or how to adjust the audio file and/or the video file. In one example, a user may furrow his eyebrows and speak slowly. The furrowing of the eyebrows is a video feature that could have been extracted at block 404 and the slow speech is an audio feature that could have been extracted at block 404. Individually, those two features might mean something different; however, when combined together, the avatar process can determine that the user is concerned about something. In this case, the context of the message might be that a parent is speaking to a child, or a friend is speaking to another friend about a serious or concerning matter.
At block 408, computing device 106 may determine effects for rendering the audio and/or video files based at least in part on the context. As noted above, one context might be concern. As such, a particular video and/or audio feature may be employed for this effect. For example, the voice file may be adjusted to sound more somber, or to be slowed down. In other examples, the avatar-specific voice might be replaced with a version of the original (e.g., raw) audio to convey the seriousness of the message. Various other effects can be employed for various other contexts. In other examples, the context may be animal noises (e.g., based on the user saying “bark” or “meow” or the like. In this case, the determined effect would be to replace the spoken word “bark” with the sound of a dog barking.
At block 410, computing device 106 may perform additional logic for additional effects. For example, if the user attempted to effectuate the bark effect by saying bark twice in a row, the additional logic may need to be utilized to determine whether the additional bark is technically feasible. As an example, if the audio clip of the bark that is used to replace the spoken word in the raw audio information is 0.5 seconds long, but the user says “bark” twice in a 0.7-second span, the additional logic can determine that two bark sounds cannot fit in the 0.7 seconds available. Thus, the audio and video file may need to be extended in order to fit both bark sounds, the bark sound may need to be shortened (e.g., by processing the stored bark sound), or the second spoken word bark may need to be ignored.
At block 412, computing device 106 may revise the audio and/or video information based at least in part on the determined effects and/or additional effects. In some examples, only one set of effects may be used. However, in either case, the raw audio file may be adjusted (e.g., revised) to form a new audio file with additional sounds added and/or subtracted. For example, in the “bark” use case, the spoken word “bark” will be removed from the audio file and a new sound that represents an actual dog barking will be inserted. The new file can be saved with a different ID, or with an appended ID (e.g., the raw audio ID, with a .v2 identifier to indicate that it is not the original). Additionally, the raw audio file will be saved separately so that it can be reused for additional avatars and/or if the user decides not to use the determined effects.
At block 414, computing device 106 may receive a selection of an avatar from the user. The user may select one of a plurality of different avatars through a UI of the avatar application being executed by computing device 106. The avatars may be selected via a scroll wheel, drop down menu, or icon menu (e.g., where each avatar is visible on the screen in its own position).
At block 416, computing device 106 may present the revised video with the revised audio based at least in part on the selected avatar. In this example, each adjusted video clip (e.g., a final clip for the avatar that has adjusted audio and/or adjust video) may be generated for each respective avatar prior to selection of the avatar by the user. This way, the processing has already been completed, and the adjusted video clip is ready to be presented immediately upon selection of the avatar. While this might require additional IPS prior to avatar selection, it will speed up the presentation. Additionally, the processing of each adjusted video clip can be performed while the user is reviewing the first preview (e.g., the preview that corresponds to the first/default avatar presented in the UI).
FIG. 5 illustrates an example flow diagram showing process 500 for implementing various audio and/or video effects based at least in part on audio and/or video features, according to at least a few embodiments. In some examples, computing device 106 of FIG. 1 or other similar user device (e.g., utilizing at least avatar process 300 of FIG. 3) may perform the process 500 of FIG. 5.
At block 502, computing device 106 may capture video having an audio component. Just like in block 402 of FIG. 4, the video and audio may be captured by two different hardware components (e.g., a camera may capture the video information while a microphone may capture the audio information). As noted, the video may have an audio component (e.g., they are part of the same file), or the video may be linked with an audio component (e.g., two files that are associated together).
At block 504, computing device 106 may extract facial features and audio features from the captured video and audio information, respectively. Just like above, the facial feature information may be extracted via avatar engine 308 and stored as metadata. The metadata can be used to map each facial feature to a particular puppet or to any animation or virtual face. Thus, the actual video file does not need to be stored, creating memory storage efficiency and significant savings. Regarding the audio feature extraction, a voice recognition algorithm can be utilized to extract different voice features; for example, words, phrases, pitch, speed, etc. Additionally, in some examples, avatar engine 308 and/or voice engine 310 may perform the audio feature extraction.
At block 506, computing device 106 may detect context from the extracted features. For example, context may include a user's intent, mood, setting, location, ideas, identity, etc. The context can be important when employing logic to determine what effects to apply. In some cases, the context can be combined with spoken words to determine whether and/or how to adjust the audio file and/or the video file. In one example, a user's age may be detected as the context (e.g., child, adult, etc.) based at least in part on facial and/or voice features. For example, a child's face may have particular features that can be identified (e.g., large eyes, a small nose, and a relatively small head, etc.). As such, a child context may be detected.
At block 508, computing device 106 may receive a selection of an avatar from the user. The user may select one of a plurality of different avatars through a UI of the avatar application being executed by computing device 106. The avatars may be selected via a scroll wheel, drop down menu, or icon menu (e.g., where each avatar is visible on the screen in its own position).
At block 510, computing device 106 may determine effects for rendering the audio and/or video files based at least in part on the context and the selected avatar. In this example, the effects for each avatar may be generated upon selection of each avatar, as opposed to all at once. In some instances, this will enable realization of significant processor and memory savings, because only one set of effects and avatar rendering will be performed at a time. These savings can be realized especially when the user does not select multiple avatars to preview.
At block 512, computing device 106 may perform additional logic for additional effects, similar to that described above with respect to block 410 of FIG. 4. At block 514, computing device 106 may revise the audio and/or video information based at least in part on the determined effects and/or additional effects for the selected avatar, similar to that described above with respect to block 412 of FIG. 4. At block 516, computing device 106 may present the revised video with the revised audio based at least in part on the selected avatar, similar that described above with respect to block 416 of FIG. 4.
In some examples, the avatar process 300 may determine whether to perform flow 400 or flow 500 based at least in part on historical information. For example, if the user generally uses the same avatar every time, flow 500 will be more efficient. However, if the user regularly switches between avatars, and previews multiple different avatars per video clip, then following flow 400 may be more efficient.
FIG. 6 illustrates an example UI 600 for enabling a user to utilize the avatar application (e.g., corresponding to avatar application affordance 602). In some examples, UI 600 may look different (e.g., it may appear as a standard text (e.g., short messaging service (SMS)) messaging application) until avatar application affordance 602 is selected. As noted, the avatar application can communicate with the avatar process (e.g., avatar process 300 of FIG. 3) to make requests for capturing, processing (e.g., extracting features, running logic, etc.), and adjusting audio and/video. For example, when the user selects a record affordance (e.g., record/send video clip affordance 604), the avatar application may make an application programming interface (API) call to the avatar process to begin capturing video and audio information using the appropriate hardware components. In some example, record/send video clip affordance 604 may be represented as a red circle (or a plain circle without the line shown in FIG. 6) prior to the recording session beginning. In this way, the affordance will look more like a standard record button. During the recording the session, the appearance of record/send video clip affordance 604 may be changed to look like a clock countdown or other representation of a timer (e.g., if the length of video clip recordings is limited). However, in other examples, the record/send video clip affordance 604 may merely change colors to indicate that the avatar application is recording. If there is no timer, or limit on the length of the recording, the user may need to select record/send video clip affordance 604 again to terminate the recording.
In some examples, a user may use avatar selection affordance 606 to select an avatar. This can be done before recording of the avatar video clip and/or after recording of the avatar video clip. When selected before recording, the initial preview of the user's motions and facial characteristics will be presented as the selected avatar. Additionally, the recording will be performed while presenting a live (e.g., real-time) preview of the recording, with the user's face being represented by the selected avatar. Once the recording is completed, a second preview (e.g., a replay of the actual recording) will be presented, again using the selected avatar. However, at this stage, the user can scroll through avatar selection affordance 606 to select a new avatar to view the recording preview. In some cases, upon selection of a new avatar, the UI will begin to preview the recording using the selected avatar. The new preview can be presented with the audio/video effects or as originally recorded. As noted, the determination regarding whether to present the effected version or the original may be based at least in part on the last method of playback used. For example, if the last playback used effects, the first playback after a new avatar selection may use effects. However, if the last playback did not use effects, the first playback after a new avatar selection may not use effects. In some examples, the use can replay the video clip with effects by selecting effects preview affordance 608 or without effects by selecting original preview affordance 610. Once satisfied with the video clip (e.g., the message), the user can send the avatar video in a message to another computing device using record/send video clip affordance 604. The video clip will be sent using the format corresponding to the last preview (e.g., with or without effects). At any time, if the user desires, delete video clip affordance 612 may be selected to delete the avatar video and either start over or exit the avatar and/or messaging applications.
FIG. 7 illustrates an example flow diagram showing process (e.g., a computer-implemented method) 700 for implementing various audio and/or video effects based at least in part on audio and/or video features, according to at least a few embodiments. In some examples, computing device 106 of FIG. 1 or other similar user device (e.g., utilizing at least an avatar application similar to that shown in FIG. 6 and avatar process 300 of FIG. 3) may perform the process 700 of FIG. 7.
At block 702, computing device 106 may display a virtual avatar generation interface. The virtual avatar generation interface may look similar to the UI illustrated in FIG. 6. However, any UI configured to enable the same features described herein can be used.
At block 704, computing device 106 may display first preview content of a virtual avatar. In some examples, the first preview content may be a real-time representation of the user's face, including movement and facial expressions. However, the first preview would provide an avatar (e.g., cartoon character, digital/virtual puppet) to represent the user's face instead of an image of the user's face. This first preview may be video only, or at least a rendering of the avatar without sound. In some examples, this first preview is not recorded and can be utilized for as long as the user desires, without limitation other than batter power or memory space of computing device 106.
At block 706, computing device 106 may detect selection of an input (e.g., record/send video clip affordance 604 of FIG. 6) in the virtual avatar generation interface. This selection may be made while the UI is displaying the first preview content.
At block 708, computing device 106 may begin capturing video and audio signals based at least in part on the input detected at block 706. As described, the video and audio signals may be captured by appropriate hardware components and can be captured by one or a combination of such components.
At block 710, computing device 106 may extract audio feature characteristics and facial feature characteristics as described in detail above. As noted, the extraction may be performed by particular modules of avatar process 300 of FIG. 3 or by other extraction and/or analysis components of the avatar application and/or computing device 106.
At block 712, computing device 106 may generate adjusted audio signal based at least in part on facial feature characterizes and audio feature characteristics. For example, the audio file captured at block 708 may be permanently (or temporarily) revised (e.g., adjusted) to include new sounds, new words, etc., and/or to have the original pitch, tone, volume, etc., adjusted. These adjustments can be made based at least in part on the context detected via analysis of the facial feature characterizes and audio feature characteristics. Additionally, the adjustments can be made based on the type of avatar selected and/or based on specific motions, facial expressions, words, phrases, or actions performed by the user (e.g., expressed by the user's face) during the recording session.
At block 714, computing device 106 may generate second preview content of the virtual avatar in the UI according to the adjusted audio signal. The generated second preview content may be based at least in part on the currently selected avatar or some default avatar. Once the second preview content is generated, computing device 106 can present the second preview content in the UI at block 716.
FIG. 8 illustrates an example flow diagram showing process (e.g., instructions stored on a computer-readable memory that can be executed) 800 for implementing various audio and/or video effects based at least in part on audio and/or video features, according to at least a few embodiments. In some examples, computing device 106 of FIG. 1 or other similar user device (e.g., utilizing at least an avatar application similar to that shown in FIG. 6 and avatar process 300 of FIG. 3) may perform the process 800 of FIG. 8.
At block 802, computing device 106 may detect a request to generate an avatar video clip of a virtual avatar. In some examples, the request may be based at least in part on a user's selection of send/record video clip affordance 604 of FIG. 6.
At block 804, computing device 106 may capture a video signal associated with a face in the field of view of the camera. At block 806, computing device 106 may capture an audio signal corresponding to the video signal (e.g., coming from the face being captured by the camera).
At block 808, computing device 106 may extract voice feature characteristics from the audio signal and at block 810, computing device 106 may extract facial feature characterizes from the video signal.
At block 812, computing device 106 may detect a request to preview the avatar video clip. This request may be based at least in part on a user's selection of a new avatar via avatar selection affordance 606 of FIG. 6 or based at least in part on a user's selection of effects preview affordance 608 of FIG. 6.
At block 814, computing device 106 may generate adjusted audio signal based at least in part on facial feature characterizes and voice feature characteristics. For example, the audio file captured at block 806 may be revised (e.g., adjusted) to include new sounds, new words, etc., and/or to have the original pitch, tone, volume, etc., adjusted. These adjustments can be made based at least in part on the context detected via analysis of the facial feature characterizes and voice feature characteristics. Additionally, the adjustments can be made based on the type of avatar selected and/or based on specific motions, facial expressions, words, phrases, or actions performed by the user (e.g., expressed by the user's face) during the recording session.
At block 816, computing device 106 may generate a preview of the virtual avatar in the UI according to the adjusted audio signal. The generated preview may be based at least in part on the currently selected avatar or some default avatar. Once the preview is generated, computing device 106 can also present the second preview content in the UI at block 816.
FIG. 9 is a simplified block diagram illustrating example architecture 900 for implementing the features described herein, according to at least one embodiment. In some examples, computing device 902 (e.g., computing device 106 of FIG. 1), having example architecture 900, may be configured to present relevant UIs, capture audio and video information, extract relevant data, perform logic, revise the audio and video information, and present animoji videos.
Computing device 902 may be configured to execute or otherwise manage applications or instructions for performing the described techniques such as, but not limited to, providing a user interface (e.g., user interface 600 of FIG. 6) for recording, previewing, and/or sending virtual avatar video clips. Computing device 602 may receive inputs (e.g., utilizing I/O device(s) 904 such as a touch screen) from a user at the user interface, capture information, process the information, and then present the video clips as previews also utilizing I/O device(s) 904 (e.g., a speaker of computing device 902). Computing device 902 may be configured to revise audio and/or video files based at least in part on facial features extracted from the captured video and/or voice features extracted from the captured audio.
Computing device 902 may be any type of computing device such as, but not limited to, a mobile phone (e.g., a smartphone), a tablet computer, a personal digital assistant (PDA), a laptop computer, a desktop computer, a thin-client device, a smart watch, a wireless headset, or the like.
In one illustrative configuration, computing device 902 may include at least one memory 914 and one or more processing units (or processor(s)) 916. Processor(s) 916 may be implemented as appropriate in hardware, computer-executable instructions, or combinations thereof. Computer-executable instruction or firmware implementations of processor(s) 916 may include computer-executable or machine-executable instructions written in any suitable programming language to perform the various functions described.
Memory 914 may store program instructions that are loadable and executable on processor(s) 916, as well as data generated during the execution of these programs. Depending on the configuration and type of computing device 902, memory 914 may be volatile (such as random access memory (RAM)) and/or non-volatile (such as read-only memory (ROM), flash memory, etc.). Computing device 902 may also include additional removable storage and/or non-removable storage 926 including, but not limited to, magnetic storage, optical disks, and/or tape storage. The disk drives and their associated non-transitory computer-readable media may provide non-volatile storage of computer-readable instructions, data structures, program modules, and other data for the computing devices. In some implementations, memory 914 may include multiple different types of memory, such as static random access memory (SRAM), dynamic random access memory (DRAM), or ROM. While the volatile memory described herein may be referred to as RAM, any volatile memory that would not maintain data stored therein once unplugged from a host and/or power would be appropriate.
Memory 914 and additional storage 926, both removable and non-removable, are all examples of non-transitory computer-readable storage media. For example, non-transitory computer readable storage media may include volatile or non-volatile, removable or non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules, or other data. Memory 914 and additional storage 926 are both examples of non-transitory computer storage media. Additional types of computer storage media that may be present in computing device 902 may include, but are not limited to, phase-change RAM (PRAM), SRAM, DRAM, RAM, ROM, electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, compact disc read-only memory (CD-ROM), digital video disc (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to store the desired information and that can be accessed by computing device 902. Combinations of any of the above should also be included within the scope of non-transitory computer-readable storage media.
Alternatively, computer-readable communication media may include computer-readable instructions, program modules, or other data transmitted within a data signal, such as a carrier wave, or other transmission. However, as used herein, computer-readable storage media does not include computer-readable communication media.
Computing device 902 may also contain communications connection(s) 928 that allow computing device 902 to communicate with a data store, another computing device or server, user terminals and/or other devices via one or more networks. Such networks may include any one or a combination of many different types of networks, such as cable networks, the Internet, wireless networks, cellular networks, satellite networks, other private and/or public networks, or any combination thereof. Computing device 902 may also include I/O device(s) 904, such as a touch input device, a keyboard, a mouse, a pen, a voice input device, a display, a speaker, a printer, etc.
Turning to the contents of memory 914 in more detail, memory 914 may include operating system 932 and/or one or more application programs or services for implementing the features disclosed herein including user interface module 934, avatar control module 936, avatar application module 938, and messaging module 940. Memory 914 may also be configured to store one or more audio and video files to be used to produce audio and video output. In this way, computing device 902 can perform all of the operations described herein.
In some examples, user interface module 934 may be configured to manage the user interface of computing device 902. For example, user interface module 934 may present any number of various UIs requested by computing device 902. In particular, user interface module 934 may be configured to present UI 600 of FIG. 6, which enables implementation of the features describe herein, including communication with avatar process 300 of FIG. 3 which is responsible for capturing video and audio information, extracting appropriate facial feature and voice feature information, and revising the video and audio information prior to presentation of the generated avatar video clips as described above.
In some examples, avatar control module 936 is configured to implement (e.g., execute instructions for implementing) avatar process 300 while avatar application module 938 is configured to implement the user facing application. As noted above, avatar application module 938 may utilize one or more APIs for requesting and/or providing information to avatar control module 936.
In some embodiments, messaging module 940 may implement any standalone or add-on messaging application that can communicate with avatar control module 936 and/or avatar application module 938. In some examples, messaging module 940 may be fully integrated with avatar application module 938 (e.g., as seen in UI 600 of FIG. 6), where the avatar application appears to be part of the messaging application. However, in other examples, messaging application 940 may call to avatar application module 938 when a user requests to generate an avatar video clip, and avatar application module 938 may open up a new application altogether that is in integrated with messaging module 940.
Computing device 902 may also be equipped with a camera and microphone, as shown in at least FIG. 3, and processors 916 may be configured to execute instructions to display a first preview of a virtual avatar. In some examples, while displaying the first preview of a virtual avatar, an input may be detected via a virtual avatar generation interface presented by user interface module 934. In some instances, in response to detecting the input in the virtual avatar generation interface, avatar control module 936 may initiate a capture session including: capturing, via the camera, a video signal associated with a face in a field of view of the camera, capturing, via the microphone, an audio signal associated with the captured video signal, extracting audio feature characteristics from the captured audio signal, and extracting facial feature characteristics associated with the face from the captured video signal. Additionally, in response to detecting expiration of the capture session, avatar control module 936 may generate an adjusted audio signal based at least in part on the audio feature characteristics and the facial feature characteristics, and display a second preview of the virtual avatar in the virtual avatar generation interface according to the facial feature characteristics and the adjusted audio signal.
Illustrative methods, computer-readable medium, and systems for providing various techniques for adjusting audio and/or video content based at least in part on voice and/or facial feature characteristics are described above. Some or all of these systems, media, and methods may, but need not, be implemented at least partially by architectures and flows such as those shown at least in FIGS. 1-9 above. While many of the embodiments are described above with reference to messaging applications, it should be understood that any of the above techniques can be used within any type of application including real-time video playback or real-time video messaging applications. For purposes of explanation, specific configurations and details are set forth in order to provide a thorough understanding of the examples. However, it should also be apparent to one skilled in the art that the examples may be practiced without the specific details. Furthermore, well-known features were sometimes omitted or simplified in order not to obscure the example being described.
The various embodiments further can be implemented in a wide variety of operating environments, which in some cases can include one or more user computers, computing devices or processing devices which can be used to operate any of a number of applications. User or client devices can include any of a number of general purpose personal computers, such as desktop or laptop computers running a standard operating system, as well as cellular, wireless and handheld devices running mobile software and capable of supporting a number of networking and messaging protocols. Such a system also can include a number of workstations running any of a variety of commercially-available operating systems and other known applications for purposes such as development and database management. These devices also can include other electronic devices, such as dummy terminals, thin-clients, gaming systems and other devices capable of communicating via a network.
Most embodiments utilize at least one network that would be familiar to those skilled in the art for supporting communications using any of a variety of commercially-available protocols, such as TCP/IP, OSI, FTP, UPnP, NFS, CIFS, and AppleTalk. The network can be, for example, a local area network, a wide-area network, a virtual private network, the Internet, an intranet, an extranet, a public switched telephone network, an infrared network, a wireless network, and any combination thereof.
In embodiments utilizing a network server, the network server can run any of a variety of server or mid-tier applications, including HTTP servers, FTP servers, CGI servers, data servers, Java servers, and business application servers. The server(s) also may be capable of executing programs or scripts in response requests from user devices, such as by executing one or more applications that may be implemented as one or more scripts or programs written in any programming language, such as Java®, C, C# or C++, or any scripting language, such as Perl, Python or TCL, as well as combinations thereof. The server(s) may also include database servers, including without limitation those commercially available from Oracle Microsoft®, Sybase®, and IBM®.
The environment can include a variety of data stores and other memory and storage media as discussed above. These can reside in a variety of locations, such as on a storage medium local to (and/or resident in) one or more of the computers or remote from any or all of the computers across the network. In a particular set of embodiments, the information may reside in a storage-area network (SAN) familiar to those skilled in the art. Similarly, any necessary files for performing the functions attributed to the computers, servers or other network devices may be stored locally and/or remotely, as appropriate. Where a system includes computerized devices, each such device can include hardware elements that may be electrically coupled via a bus, the elements including, for example, at least one central processing unit (CPU), at least one input device (e.g., a mouse, keyboard, controller, touch screen or keypad), and at least one output device (e.g., a display device, printer or speaker). Such a system may also include one or more storage devices, such as disk drives, optical storage devices, and solid-state storage devices such as RAM or ROM, as well as removable media devices, memory cards, flash cards, etc.
Such devices also can include a computer-readable storage media reader, a communications device (e.g., a modem, a network card (wireless or wired), an infrared communication device, etc.), and working memory as described above. The computer-readable storage media reader can be connected with, or configured to receive, a non-transitory computer-readable storage medium, representing remote, local, fixed, and/or removable storage devices as well as storage media for temporarily and/or more permanently containing, storing, transmitting, and retrieving computer-readable information. The system and various devices also typically will include a number of software applications, modules, services or other elements located within at least one working memory device, including an operating system and application programs, such as a client application or browser. It should be appreciated that alternate embodiments may have numerous variations from that described above. For example, customized hardware might also be used and/or particular elements might be implemented in hardware, software (including portable software, such as applets) or both. Further, connection to other computing devices such as network input/output devices may be employed.
Non-transitory storage media and computer-readable storage media for containing code, or portions of code, can include any appropriate media known or used in the art (except for transitory media like carrier waves or the like) such as, but not limited to, volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data, including RAM, ROM, Electrically Erasable Programmable Read-Only Memory (EEPROM), flash memory or other memory technology, CD-ROM, DVD or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices or any other medium which can be used to store the desired information and which can be accessed by a system device. Based on the disclosure and teachings provided herein, a person of ordinary skill in the art will appreciate other ways and/or methods to implement the various embodiments. However, as noted above, computer-readable storage media does not include transitory media such as carrier waves or the like.
The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense. It will, however, be evident that various modifications and changes may be made thereunto without departing from the broader spirit and scope of the disclosure as set forth in the claims.
Other variations are within the spirit of the present disclosure. Thus, while the disclosed techniques are susceptible to various modifications and alternative constructions, certain illustrated embodiments thereof are shown in the drawings and have been described above in detail. It should be understood, however, that there is no intention to limit the disclosure to the specific form or forms disclosed, but on the contrary, the intention is to cover all modifications, alternative constructions and equivalents falling within the spirit and scope of the disclosure, as defined in the appended claims.
The use of the terms “a,” “an,” and “the,” and similar referents in the context of describing the disclosed embodiments (especially in the context of the following claims), are to be construed to cover both the singular and the plural, unless otherwise indicated herein or clearly contradicted by context. The terms “comprising,” “having,” “including,” and “containing” are to be construed as open-ended terms (i.e., meaning “including, but not limited to,”) unless otherwise noted. The term “connected” is to be construed as partly or wholly contained within, attached to, or joined together, even if there is something intervening. The phrase “based on” should be understood to be open-ended, and not limiting in any way, and is intended to be interpreted or otherwise be read as “based at least in part on,” where appropriate. Recitation of ranges of values herein are merely intended to serve as a shorthand method of referring individually to each separate value falling within the range, unless otherwise indicated herein, and each separate value is incorporated into the specification as if it were individually recited herein. All methods described herein can be performed in any suitable order unless otherwise indicated herein or otherwise clearly contradicted by context. The use of any and all examples, or exemplary language (e.g., “such as”) provided herein, is intended merely to better illuminate embodiments of the disclosure and does not pose a limitation on the scope of the disclosure unless otherwise claimed. No language in the specification should be construed as indicating any non-claimed element as essential to the practice of the disclosure.
Disjunctive language such as the phrase “at least one of X, Y, or Z,” unless specifically stated otherwise, is otherwise understood within the context as used in general to present that an item, term, etc., may be either X, Y, or Z, or any combination thereof (e.g., X, Y, and/or Z). Thus, such disjunctive language is not generally intended to, and should not, imply that certain embodiments require at least one of X, at least one of Y, or at least one of Z to each be present. Additionally, conjunctive language such as the phrase “at least one of X, Y, and Z,” unless specifically stated otherwise, should also be understood to mean X, Y, Z, or any combination thereof, including “X, Y, and/or Z.”
Preferred embodiments of this disclosure are described herein, including the best mode known to the inventors for carrying out the disclosure. Variations of those preferred embodiments may become apparent to those of ordinary skill in the art upon reading the foregoing description. The inventors expect skilled artisans to employ such variations as appropriate, and the inventors intend for the disclosure to be practiced otherwise than as specifically described herein. Accordingly, this disclosure includes all modifications and equivalents of the subject matter recited in the claims appended hereto as permitted by applicable law. Moreover, any combination of the above-described elements in all possible variations thereof is encompassed by the disclosure unless otherwise indicated herein or otherwise clearly contradicted by context.
All references, including publications, patent applications, and patents, cited herein are hereby incorporated by reference to the same extent as if each reference were individually and specifically indicated to be incorporated by reference and were set forth in its entirety herein.

Claims

What is claimed is:

1. A method, comprising:

at an electronic device having at least a camera and a microphone:

displaying a virtual avatar generation interface;

displaying first preview content of a virtual avatar in the virtual avatar generation interface, the first preview content of the virtual avatar corresponding to realtime preview video frames of a user headshot in a field of view of the camera and associated headshot changes in an appearance;

while displaying the first preview content of the virtual avatar, detecting an input in the virtual avatar generation interface;

in response to detecting the input in the virtual avatar generation interface:

capturing, via the camera, a video signal associated with the user headshot during a recording session;

capturing, via the microphone, a user audio signal during the recording session;

extracting audio feature characteristics from the captured user audio signal; and

extracting facial feature characteristics associated with the face from the captured video signal; and

in response to detecting expiration of the recording session:

generating an adjusted audio signal from the captured audio signal based at least in part on the facial feature characteristics and the audio feature characteristics;

generating second preview content of the virtual avatar in the virtual avatar generation interface according to the facial feature characteristics and the adjusted audio signal; and

presenting the second preview content in the virtual avatar generation interface.

2. The method of claim 1, further comprising storing facial feature metadata associated with the facial feature characteristics extracted from the video signal and strong audio metadata associated with the audio feature characteristics extracted from the audio signal.

3. The method of claim 2, further comprising generating adjusted facial feature metadata from the facial feature metadata based at least in part on the facial feature characteristics and the audio feature characteristics.

4. The method of claim 3, wherein the second preview of the virtual avatar is displayed further according to the adjusted facial metadata.

5. An electronic device, comprising:

a camera;

a microphone; and

one or more processors in communication with the camera and the microphone, the one or more processors configured to:

while displaying a first preview of a virtual avatar, detecting an input in a virtual avatar generation interface;

in response to detecting the input in the virtual avatar generation interface, initiating a capture session including:

capturing, via the camera, a video signal associated with a face in a field of view of the camera;

capturing, via the microphone, an audio signal associated with the captured video signal;

extracting audio feature characteristics from the captured audio signal; and

in response to detecting expiration of the capture session:

generating an adjusted audio signal based at least in part on the audio feature characteristics and the facial feature characteristics; and

displaying a second preview of the virtual avatar in the virtual avatar generation interface according to the facial feature characteristics and the adjusted audio signal.

6. The electronic device of claim 5, wherein the audio signal is further adjusted based at least in part on a type of the virtual avatar.

7. The electronic device of claim 6, wherein the type of the virtual avatar is received based at least in part on an avatar type selection affordance presented in the virtual avatar generation interface.

8. The electronic device of claim 6, wherein the type of the virtual avatar includes an animal type, and wherein the adjusted audio signal is generated based at least in part on a predetermined sound associated with the animal type.

9. The electronic device of claim 5, wherein the one or more processors are further configured to determine whether a portion of the audio signal corresponds to the face in the field of view.

10. The electronic device of claim 9, wherein the one or more processors are further configured to, in accordance with a determination that the portion of the audio signal corresponds to the face, store the portion of the audio signal for use in generating the adjusted audio signal.

11. The electronic device of claim 9, wherein the one or more processors are further configured to, in accordance with a determination that the portion of the audio signal does not correspond to the face, discard at least the portion of the audio signal.

12. The electronic device of claim 5, wherein the audio feature characteristics comprise features of a voice associated with the face in the field of view.

13. The electronic device of claim 5, wherein the one or more processors are further configured to store facial feature metadata associated with the facial feature characteristics extracted from the video signal.

14. The electronic device of claim 13, wherein the one or more processors are further configured to generate adjusted facial metadata based at least in part on the facial feature characteristics and the audio feature characteristics.

15. The electronic device of claim 14, wherein the second preview of the virtual avatar is generated according to the adjusted facial metadata and the adjusted audio signal.

16. A computer-readable storage medium storing computer-executable instructions that, when executed by one or more processors, configure the one or more processors to perform operations comprising:

in response to detecting a request to generate an avatar video clip of a virtual avatar:

capturing, via a camera of an electronic device, a video signal associated with a face in a field of view of the camera;

capturing, via a microphone of the electronic device, an audio signal;

extracting voice feature characteristics from the captured audio signal; and

in response to detecting a request to preview the avatar video clip:

generating an adjusted audio signal based at least in part on the facial feature characteristics and the voice feature characteristics; and

displaying a preview of the video clip of the virtual avatar using the adjusted audio signal.

17. The computer-readable storage medium of claim 16, wherein the audio signal is adjusted based at least in part on a facial expression identified in the facial feature characteristics associated with the face.

18. The computer-readable storage medium of claim 16, wherein the adjusted audio signal is further adjusted by inserting one or more pre-stored audio samples.

19. The computer-readable storage medium of claim 16, wherein the audio signal is adjusted based at least in part on a level, pitch, duration, variable playback speed, speech spectral-format positions, speech spectral-format-levels, instantaneous playback speed, or change in a voice associated with the face.

20. The computer-readable storage medium of claim 16, wherein the one or more processors are further configured to perform the operations comprising transmitting the video clip of the virtual avatar to another electronic device.