US20220230374A1 - User interface for generating expressive content - Google Patents
User interface for generating expressive content Download PDFInfo
- Publication number
- US20220230374A1 US20220230374A1 US17/713,749 US202217713749A US2022230374A1 US 20220230374 A1 US20220230374 A1 US 20220230374A1 US 202217713749 A US202217713749 A US 202217713749A US 2022230374 A1 US2022230374 A1 US 2022230374A1
- Authority
- US
- United States
- Prior art keywords
- expressive
- effect
- textual input
- user
- receiving
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000000007 visual effect Effects 0.000 claims abstract description 134
- 230000000694 effects Effects 0.000 claims abstract description 120
- 230000001755 vocal effect Effects 0.000 claims abstract description 34
- 238000000034 method Methods 0.000 claims description 42
- 230000004044 response Effects 0.000 claims description 24
- 230000002996 emotional effect Effects 0.000 claims description 13
- 230000008451 emotion Effects 0.000 claims description 11
- 238000012545 processing Methods 0.000 claims description 10
- 238000012800 visualization Methods 0.000 claims description 10
- 238000013500 data storage Methods 0.000 claims description 4
- 238000009877 rendering Methods 0.000 claims description 2
- 238000004891 communication Methods 0.000 description 54
- 230000015654 memory Effects 0.000 description 17
- 238000010586 diagram Methods 0.000 description 8
- 230000008859 change Effects 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 6
- 230000014509 gene expression Effects 0.000 description 6
- 230000003993 interaction Effects 0.000 description 5
- 241000699666 Mus <mouse, genus> Species 0.000 description 4
- 238000004590 computer program Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 230000033001 locomotion Effects 0.000 description 4
- 230000003287 optical effect Effects 0.000 description 4
- 230000003190 augmentative effect Effects 0.000 description 3
- 230000015572 biosynthetic process Effects 0.000 description 3
- 238000006243 chemical reaction Methods 0.000 description 3
- 238000001514 detection method Methods 0.000 description 3
- 230000008921 facial expression Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 238000003786 synthesis reaction Methods 0.000 description 3
- RZVAJINKPMORJF-UHFFFAOYSA-N Acetaminophen Chemical compound CC(=O)NC1=CC=C(O)C=C1 RZVAJINKPMORJF-UHFFFAOYSA-N 0.000 description 2
- 241001417516 Haemulidae Species 0.000 description 2
- 206010002026 amyotrophic lateral sclerosis Diseases 0.000 description 2
- 238000003780 insertion Methods 0.000 description 2
- 230000037431 insertion Effects 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 238000001228 spectrum Methods 0.000 description 2
- 241000699670 Mus sp. Species 0.000 description 1
- 206010033799 Paralysis Diseases 0.000 description 1
- 241000590419 Polygonia interrogationis Species 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- 238000007792 addition Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000007177 brain activity Effects 0.000 description 1
- 238000003490 calendering Methods 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000005684 electric field Effects 0.000 description 1
- 230000001815 facial effect Effects 0.000 description 1
- 230000004886 head movement Effects 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000005055 memory storage Effects 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 238000003032 molecular docking Methods 0.000 description 1
- 230000006855 networking Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 230000000644 propagated effect Effects 0.000 description 1
- 230000002441 reversible effect Effects 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 239000010454 slate Substances 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 239000000758 substrate Substances 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000007723 transport mechanism Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T13/00—Animation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/01—Input arrangements or combined input and output arrangements for interaction between user and computer
- G06F3/048—Interaction techniques based on graphical user interfaces [GUI]
- G06F3/0481—Interaction techniques based on graphical user interfaces [GUI] based on specific properties of the displayed interaction object or a metaphor-based environment, e.g. interaction with desktop elements like windows or icons, or assisted by a cursor's changing behaviour or appearance
- G06F3/04817—Interaction techniques based on graphical user interfaces [GUI] based on specific properties of the displayed interaction object or a metaphor-based environment, e.g. interaction with desktop elements like windows or icons, or assisted by a cursor's changing behaviour or appearance using icons
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/01—Input arrangements or combined input and output arrangements for interaction between user and computer
- G06F3/02—Input arrangements using manually operated switches, e.g. using keyboards or dials
- G06F3/023—Arrangements for converting discrete items of information into a coded form, e.g. arrangements for interpreting keyboard generated codes as alphanumeric codes, operand codes or instruction codes
- G06F3/0233—Character input methods
- G06F3/0236—Character input methods using selection techniques to select from displayed items
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/01—Input arrangements or combined input and output arrangements for interaction between user and computer
- G06F3/02—Input arrangements using manually operated switches, e.g. using keyboards or dials
- G06F3/023—Arrangements for converting discrete items of information into a coded form, e.g. arrangements for interpreting keyboard generated codes as alphanumeric codes, operand codes or instruction codes
- G06F3/0233—Character input methods
- G06F3/0237—Character input methods using prediction or retrieval techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/01—Input arrangements or combined input and output arrangements for interaction between user and computer
- G06F3/048—Interaction techniques based on graphical user interfaces [GUI]
- G06F3/0487—Interaction techniques based on graphical user interfaces [GUI] using specific features provided by the input device, e.g. functions controlled by the rotation of a mouse with dual sensing arrangements, or of the nature of the input device, e.g. tap gestures based on pressure sensed by a digitiser
- G06F3/0488—Interaction techniques based on graphical user interfaces [GUI] using specific features provided by the input device, e.g. functions controlled by the rotation of a mouse with dual sensing arrangements, or of the nature of the input device, e.g. tap gestures based on pressure sensed by a digitiser using a touch-screen or digitiser, e.g. input of commands through traced gestures
- G06F3/04886—Interaction techniques based on graphical user interfaces [GUI] using specific features provided by the input device, e.g. functions controlled by the rotation of a mouse with dual sensing arrangements, or of the nature of the input device, e.g. tap gestures based on pressure sensed by a digitiser using a touch-screen or digitiser, e.g. input of commands through traced gestures by partitioning the display area of the touch-screen or the surface of the digitising tablet into independently controllable areas, e.g. virtual keyboards or menus
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/274—Converting codes to words; Guess-ahead of partial word inputs
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/04—Details of speech synthesis systems, e.g. synthesiser structure or memory management
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
- G10L13/10—Prosody rules derived from text; Stress or intonation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
- G10L2013/083—Special characters, e.g. punctuation marks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
- G10L13/10—Prosody rules derived from text; Stress or intonation
- G10L2013/105—Duration
Definitions
- Alternative and augmentative communication includes forms of communication other than oral speech that are used to express thoughts, needs, wants, or ideas.
- An individual may rely on an AAC system as an aid to communicate when the individual is not able to communicate orally, for example, due to a speech disability.
- Some AAC systems are operative to synthesize speech from the individual's input.
- Conveying emotions, attitude, or tone through speech is oftentimes dependent on non-verbal communicative features, such as gestures and speech prosody; however, current speech-generating AAC systems do not support conveyance of non-verbal information, and generally only provide users with a text-to-speech engine and voice fonts that synthesize a single flat tone of speech that is mostly devoid of emotion and expressivity regardless of the input text that the AAC user is intending to convey.
- synthesized speech generated from an AAC user's input may sound robotic and lack volume and vocal inflection, which makes it difficult for the AAC user to effectively communicate in a way that represents the user's internal voice. As can be appreciated, this can negatively impact AAC users' quality of life, specifically in their interactions with other individuals.
- an AAC user will type and speak an additional explanatory phrase, such as “I am angry” before typing and speaking the phrase that the user intended to speak originally.
- an additional explanatory phrase such as “I am angry” before typing and speaking the phrase that the user intended to speak originally.
- this is inefficient and can present a significant burden to AAC users, particularly when using gaze-based text entry, for which AAC users have a typical text entry rate of between 10-20 words per minute.
- aspects are directed to an automated system, method, and device for generating expressive content.
- an improved user experience is provided, where a user is enabled to efficiently and effectively compose expressive content, such as prosody-enhanced speech, sound effects, or visual effects, using voicesetting editing.
- An expressive synthesized speech system provides an expressive keyboard for enabling the user to input textual content and for selecting expressive operators, such as emoji objects or punctuation objects for applying predetermined prosody attributes, sound effects, or visual effects to the user's textual content.
- the user may selectively enter a voicesetting editor mode, where a voicesetting editor UI is displayed for enabling the user to author or adjust particular prosody attributes associated with the user's content.
- an active listening mode is provided. When the user selects to launch the active listening mode, a set of active listening mode effect options are displayed, wherein each effect option is associated with a particular sound effect and/or visual effect. In conversations, the user is enabled to easily and rapidly respond with expressive vocal sound effects or visual effects while listening to others speak.
- Examples are implemented as a computer process, a computing system, or as an article of manufacture such as a device, computer program product, or computer readable media.
- the computer program product is a computer storage media readable by a computer system and encoding a computer program of instructions for executing a computer process.
- FIG. 1 is a block diagram showing an example operating environment including components of an expressive synthesized speech system for generating expressive content
- FIG. 2A is an illustration of an example user interface display generated by aspects of an expressive synthesized speech system showing an expressive keyboard including selectable punctuation objects;
- FIG. 2B is an illustration of an example user interface display generated by aspects of the expressive synthesized speech system showing a selection of an active listening mode (ALM);
- ALM active listening mode
- FIG. 2C is an illustration of an example user interface display generated by aspects of the expressive synthesized speech system showing a plurality of selectable ALM effect options
- FIG. 2D is an illustration of an example user interface display generated by aspects of the expressive synthesized speech system showing an example communication between a user and a conversation partner, wherein the user is enabled to use the ALM to provide feedback;
- FIGS. 2E-2I are illustrations of example user interface displays generated by aspects of the expressive synthesized speech system showing examples of visual effects/output corresponding to various ALM effect options;
- FIG. 2J is an illustration of an example user interface display generated by aspects of the expressive synthesized speech system showing various example selectable emoji objects;
- FIGS. 2K-2P are illustrations of example user interface displays generated by aspects of the expressive synthesized speech system showing examples of visual effects/output corresponding to various emoji objects;
- FIG. 3A is an illustration of an example user interface display generated by aspects of the expressive synthesized speech system showing a selection to utilize the voicesetting editor;
- FIGS. 3B-3G are illustrations of example voicesetting editor user interface displays generated by aspects of the expressive synthesized speech system
- FIG. 4A is a flow chart showing general stages involved in an example method for generating expressive content
- FIG. 4B is a flow chart showing general stages involved in another example method for generating expressive content
- FIG. 5 is a block diagram illustrating example physical components of a computing device
- FIGS. 6A and 6B are block diagrams of a mobile computing device.
- FIG. 7 is a block diagram of a distributed computing system.
- aspects of the present disclosure are directed to a method, system, and computer storage media for providing intuitive synthesized speech-authoring user interface for generating expressive content. While many of the examples described herein are directed to generating expressive content in an alternative and augmentative communication (AAC) system, as should be appreciated, aspects are equally applicable in a variety of alternative use cases and systems.
- AAC alternative and augmentative communication
- synthesized speech-authoring user interfaces may also be used for authoring or marking up documents that are to be rendered to audio by automated or semi-automated means (e.g., marking up a print edition book to be rendered to an audio book, authoring a document to be rendered by a screen reader program, authoring content in a web-based communication service, and authoring content that is to be read aloud by learning tool systems).
- automated or semi-automated means e.g., marking up a print edition book to be rendered to an audio book, authoring a document to be rendered by a screen reader program, authoring content in a web-based communication service, and authoring content that is to be read aloud by learning tool systems.
- the expressive synthesized speech system 108 is operative to provide improved synthesized speech-authoring user interfaces via which a user 104 is enabled to efficiently and effectively author content for generating expressive output, such as prosody-enhanced speech, sound effects, and visual effects.
- expressive output such as prosody-enhanced speech, sound effects, and visual effects.
- prosody-enhanced speech includes speech that comprises variables of timing, phrasing, emphasis, and intonation that speakers use to help convey aspects of meaning and to make speech lively.
- the example operating environment 100 includes an electronic computing device 102 .
- the computing device 102 illustrated in FIG. 1 is illustrated as a tablet computing device; however, as should be appreciated, the computing device 102 may be one of various types of computing devices (e.g., a tablet computing device, a desktop computer, a mobile communication device, a laptop computer, a laptop/tablet hybrid computing device, a large screen multi-touch display, a gaming device, a smart television, a wearable device, or other type of computing device) for executing applications for performing a variety of tasks.
- the hardware of these computing devices is discussed in greater detail in regard to FIGS. 5, 6A, 6B, and 7 .
- the user 104 utilizes the computing device 102 for executing the expressive synthesized speech system 108 , which in association with a text-to-speech engine (i.e., speech generation engine 118 ), generates expressive synthesized speech from the user's input.
- the computing device 102 includes or is in communication with the expressive synthesized speech system 108 .
- the computing device 102 includes an expressive synthesized speech application programming interface (API), operative to enable an application executing on the computing device to employ the systems and methods of the present disclosure via stored instructions.
- API expressive synthesized speech application programming interface
- the synthesized speech system 108 is operative to receive input (e.g., text input, mode selections, on-screen object selections, and prosody cue input) from a user-controlled input device 106 via various input methods, such as those relying on mice, keyboards, and remote controls, as well as Natural User Interface (NUI) methods, which enable a user to interact with the computing device 102 in a “natural” manner, such as via technologies including touch sensitive displays, voice and speech recognition, intention and goal understanding, motion gesture detection using depth cameras, motion gesture detection using accelerometers/gyroscopes, facial recognition, 3D displays, head, eye, and gaze tracking, immersive augmented reality and virtual reality systems, all of which provide a more natural interface, as well as technologies for sensing brain activity using electric field sensing electrodes (EEG and related methods).
- the user 104 uses gaze-based input methods or head mouse input methods, which are typically used by individuals who have communication challenges stemming from paralysis or severe motor disabilities, such as people who have advanced amy
- the expressive synthesized speech system 108 generate and provide a graphical user interface (GUI) that allows the user 104 to interact with functionality of the expressive synthesized speech system 108 .
- the expressive synthesized speech system 108 comprises an expressive keyboard UI engine 110 , illustrative of a software module, system, or device operative to generate a GUI display of an expressive keyboard.
- the expressive keyboard UI engine 110 provides a keyboard that extends an on-screen keyboard, which is used to input text for speech synthesis, by providing a set of selectable icons or emoji objects that can be selectively inserted into the user's text.
- Each emoji object illustrates a particular emotion (e.g., sad, calm, happy, funny, sarcastic, surprised, irritated, angry), and is associated with a predefined operation or operations that can change the tone of voice of the user's text to a specified emotional state (e.g., sad, calm, happy, funny, sarcastic, surprised, irritated, angry).
- a predefined operation or operations that can change the tone of voice of the user's text to a specified emotional state (e.g., sad, calm, happy, funny, sarcastic, surprised, irritated, angry).
- certain prosodic attributes are applied to the user's text based on the selected emoji object.
- the expressive synthesized speech system 108 is operative to apply a tone and emotion of output speech corresponding to a selected emoji object at a sentence level.
- the expressive keyboard UI engine 110 is operative to intelligently display a set of emoji objects on the expressive keyboard, for example, based on linguistic properties of the user's textual input or based on a recognized emotional state of the user detected via a sensor (e.g., biometric sensors, facial expression sensors, body posture sensors, gesture sensors).
- the set of emoji objects may include emoji objects associated with an emotion corresponding to the user's recognized emotional state.
- one or more emoji objects are associated with a vocal sound effect (e.g., laughter, sarcastic scoff, sharp breath in, grunt, sigh, or a disgusted “ugh” sound).
- a vocal sound effect e.g., laughter, sarcastic scoff, sharp breath in, grunt, sigh, or a disgusted “ugh” sound.
- the expressive keyboard UI engine 110 is operative to receive the user's text input and selection of an emoji object, and communicate the input and selection to a voicesetting engine 114 .
- the voicesetting engine 114 is illustrative of a software module, system, or device operative to receive user input, apply the predefined operation associated with the selected emoji object to apply speech or prosodic properties to the text input, and output a representation of the user's speech to a speech generation engine 118 for generating audible output 126 embodied as expressive synthesized speech.
- the voicesetting engine 114 comprises an index of emoji objects and their corresponding operation(s), which includes prosodic attributes and/or vocal sound effects.
- the voice setting engine 114 is operative to reference the index for applying the appropriate prosodic attributes and/or vocal sound effect to the user's text.
- the voicesetting engine 114 specifies the text input and prosodic attributes and/or vocal sound effects via a markup language, such as Speech Synthesis Markup Language (SSML), which is output to a speech generation engine 118 .
- SSML Speech Synthesis Markup Language
- the audible output 126 is played on an audio output device 128 , which may be integrated with the user's computing device 102 , or may be incorporated in another device utilized by a communication partner 120 .
- the expressive synthesized speech system 108 works in association with a visualization generation engine 116 for generating expressive visual output 122 for display on a visual output device 124 from the user's input.
- the predefined operation or operations associated with each emoji object may include providing a visual feature, wherein, each emoji object may be associated with one or more visual features (e.g., text, emoji, graphics, animations, video clips).
- the expressive keyboard UI engine 110 is operative to communicate the text input and the selected emoji object to the voicesetting engine 114 , wherein the voicesetting engine applies the predefined operation associated with the selected emoji object to apply visual features to the text input, and output a representation of the visual features to a visualization generation engine 116 for generating visual output 122 for display on a visual output device 124 .
- the visual output device 124 may be integrated with the user's computing device 102 , or may be incorporated in another device utilized by a communication partner 120 . In some examples, the visual output device 124 and the audio output device 128 are incorporated in a single device.
- the expressive keyboard UI engine 110 is further operative to provide a plurality of punctuation objects (e.g., a period, comma, question mark, exclamation point), which when selectively inserted into the user's text, change prosodic attributes (e.g., silent space, pitch, speed, emphasis) of surrounding words.
- a settings menu is provided for enabling the user to customize the prosodic attributes or vocal sound effects associated with emoji objects or to customize the prosodic attributes associated with punctuation objects.
- the expressive keyboard UI engine 110 is further operative to provide a selectable active listening mode (ALM) for enabling the user 104 to select vocal sound effects and/or visual effects for communicating information when a communication partner 120 is speaking. For example, akin to using gestures, such as nodding, or non-verbal vocalizations, such as laughing, in standard communication, the user 104 is enabled to use ALM effects to provide feedback to the user's communication partner 120 .
- the expressive keyboard UI engine 110 is operative to provide an ALM command, which when selected, causes the expressive keyboard UI engine to display a plurality of selectable ALM effect options, wherein each ALM effect option is associated with a particular sound effect and/or visual effect.
- the ALM effect options and associated sound or visual effects are customizable by the user 104 .
- the user 104 is enabled to select specific ALM effect options to display on the keyboard.
- voice-banked recordings may be associated with sound effect options.
- a voice-banked phrase or other sound effect may be previously recorded by the user 104 or another individual and saved as a sound effect that can be selectively played or spoken by the expressive synthesized speech system 108 during a conversation with a communication partner 120 .
- an expression-banked reaction may be previously recorded by the user 104 or another individual and saved as a visual effect that can be selectively displayed by the expressive synthesized speech system 108 during a conversation with a communication partner 120 .
- the user 104 may save a static image or a video clip as a visual effect.
- the expressive keyboard UI engine 110 is operative to receive the user's ALM effect option selection, and communicate the selection to the voicesetting engine 114 for outputting a representation of the associated ALM effect to an audio output device 128 and/or to a visual output device 124 .
- the expressive keyboard UI engine 110 is operative to communicate the selected ALM effect option to the voicesetting engine 114 , wherein the voicesetting engine applies the predefined operation associated with the selected ALM effect option to provide audible features or visual features associated with the selected ALM effect option as output to a visualization generation engine 116 for generating visual output 122 for display on a visual output device 124 or to an speech generation engine 118 for generating audible output 126 for playback on an audio output device 128 .
- aspects of the expressive keyboard UI engine 110 enable single-click input by the user 104 to quickly and easily specify the expressive nature of their speech and to rapidly respond with expressive vocal sound effects while listening to others speak. Aspects of the expressive keyboard including the emoji objects, punctuation objects, and the ALM effect options will be described in further detail below with reference to FIGS. 2A-2P .
- the expressive synthesized speech system 108 further comprises a voicesetting editor 112 illustrative of a software module, system, or device operative to provide a GUI that allows the user 104 to modify various prosodic properties associated with text input for asynchronously authoring expressivity of synthesized speech (e.g., outside of a face-to-face conversation).
- the text input includes text composed via interaction with the expressive keyboard.
- the text input includes text in an existing text file which the user 104 is enabled to upload into the expressive synthesized speech system 108 .
- the voicesetting editor 112 is operative to provide various controls for enabling the user 104 to edit various properties of the output speech via coarse input methods, such as via eye gaze or head mouse movement.
- the user 104 is enabled to add detailed pacing, emotional content, or prosodic content to text before the text is rendered as speech.
- the voicesetting editor 112 provides controls for editing pause length, pitch, speed, or emphasis of specific text, and for inserting sound effects and/or visual effects.
- the user 104 may utilize the UI provided by the voicesetting editor 112 when the user wants to have a higher degree of control over specifics of how the user's text will be spoken when played by an audio output device 128 .
- the voicesetting text via the voicesetting editor 112 typically entails additional time, and may be used by the user 104 in a situation where the user is preparing text to speak in advance (e.g., before a medical appointment, before giving a public speech, when preparing stored phrases for repeated use). Aspects of the voicesetting editor 112 will be described in further detail below with respect to FIGS. 3A-3G .
- the expressive synthesized speech system 108 is operative to provide intuitive synthesized speech-authoring user interfaces via which the user 104 is enabled to efficiently and effectively author content for generating expressive output, such as prosody-enhanced speech, sound effects, and visual effects. Examples of synthesized speech-authoring user interfaces provided by the expressive synthesized speech system 108 are described below with reference to FIGS. 2A-3G . With reference now to FIG. 2A , an illustration of an example UI displayed on a display 204 of a computing device 102 and generated by aspects of an expressive synthesized speech system 108 is shown.
- the expressive keyboard UI engine 110 is operative to generate an expressive synthesized speech system UI 202 including an expressive keyboard 206 for enabling the user 104 to provide text input 210 , prosodic cue selections, and sound and/or visual effect selections.
- the expressive keyboard 206 include an alpha-numeric keyboard, which the user 104 is enabled to use to select letters and numbers to author textual content 210 for speech synthesis.
- the expressive keyboard 206 is a gaze-based on-screen keyboard, wherein eye tracking is used to determine the user's gaze position, which is utilized as a cursor. For example, to enter text 210 , the user 104 may gaze on letters, numbers, or other displayed keys of the on-screen keyboard.
- the expressive keyboard 206 enables input via a head mouse input device 106 , where the user's head movements are translated into mouse pointer movement. In other examples, other input device 106 types are used for inputting textual content and providing selections.
- Examples of the expressive keyboard 206 further include a plurality of selectable punctuation objects 208 a - n (collectively, 208 ), which the user 104 is enabled to include with textual input 210 to specify the expressive nature of the textual input.
- each punctuation object 208 has a predetermined operation associated with it that specifies how particular prosodic attributes are to be applied to surrounding text, thus changing the expressive nature of the speech that will be generated from the text.
- inclusion of a period, comma, or exclamation point punctuation object 208 may operate to insert a default or user-customizable amount of silent space between pronouncing words or sentences, thus allowing the user 104 to set the cadence of his/her speech.
- inclusion of a single question mark punctuation object 208 may operate to raise the pitch of a word located immediately prior to the question mark, and inclusion of two question mark punctuation objects may operate to raise the pitch of the two words located immediately prior to the question marks to emphasize that the textual content 210 is a question.
- an exclamation point punctuation object 208 may operate to increase the emotional tone, volume, or rate of speech of at least a portion of the textual content 210 , or to place emphasis on a specific word of the textual content.
- the punctuation objects 208 illustrated in the figures and described herein are non-limiting examples. Other punctuation objects 208 and other corresponding operations are possible and are within the scope of the present disclosure.
- the user 104 may enter textual content 210 , select a punctuation object 208 , and then select to play the text with the punctuation object functionality applied.
- the user 104 may select a play command 212 , which when selected, causes the expressive keyboard UI engine 110 to pass the textual content 210 and the selected punctuation object 208 to the voicesetting engine 114 for application of the prosodic attributes or prosodic properties corresponding to the selected punctuation object to the textual content.
- the expressive keyboard 206 further includes a selectable ALM (active listening mode) command 214 , which when selected, causes the expressive synthesized speech system 108 to enter an active listening mode and the expressive keyboard UI engine 110 to display a plurality of selectable ALM effect options 216 a - n (collectively, 216 ), wherein each ALM effect option is associated with a particular sound effect and/or visual effect that can be selectively communicated to a communication partner 120 .
- selectable ALM effect options 216 are illustrated.
- the ALM effect options 216 provide the synthesized speech user 104 with single-click access to a variety of sound effects and/or visual effects, which in various examples, are customizable by the synthesized speech user.
- the example ALM effect options 216 illustrated in FIG. 2C are displayed as text, other display options are possible, such as images or symbols.
- a user 104 and a communication partner 120 may have a conversation.
- the first step depicts the communication partner 120 talking, and the user 104 providing communication feedback to the communication partner 120 in the form of an “I'm listening” visual effect/output 122 that can be rendered on the communication partner's visual output device 124 in response to a selection of an ALM effect option 216 .
- the second step depicts the user 104 selecting another ALM effect option 216 for providing another visual effect and/or sound effect to communicate with the communication partner 120 . For example, as the communication partner 120 is telling the user 104 a funny story, the user selects a “laugh” ALM effect option 216 .
- a corresponding visual effect/output 122 is displayed on the communication partner's visual output device 124 , and a corresponding sound effect/output 126 is played on the communication partner's audio output device 128 .
- the user 104 is enabled to provide rapid expressive reactions during a conversation.
- ALM effect options 216 may be provided in the expressive keyboard 206 , and a variety of corresponding audio (sound) effects/output 126 and visual effects/output 122 may be provided responsive to a selection of an ALM effect option.
- Various examples of visual effects/output 122 corresponding to various ALM effect options 216 are illustrated in FIGS. 2E-2I .
- an “I'm listening” ALM effect option 126 may be provided, which when selected may provide a visual effect/output 122 for display on a visual output device 124 , such as the communication partner's visual output device.
- the visual effect may be embodied as one or more visual features representing the selected ALM effect option 126 , such as text 218 a saying “I'm listening,” an animated dot object 218 b indicating that the user 104 is listening, an icon or clipart 218 c illustrating that the user is listening, an emoticon 218 d portraying listening, an animated avatar 218 e , or a video clip 218 f that symbolizes listening.
- an “I'm talking” ALM effect option 126 may be provided, which when selected may provide a visual effect/output 122 for display on a communication partner's visual output device 124 .
- the visual effect may be embodied as one or more visual features representing the selected ALM effect option 126 , such as text 220 a saying “I'm talking,” an animated dot object 220 b indicating that the user 104 is talking, an icon or clipart 220 c illustrating that the user is talking, an emoticon 220 d portraying talking, an animated avatar 220 e , or a video clip 220 f that characterizes talking.
- an “I'm typing” ALM effect option 126 may be provided, which when selected may provide a visual effect/output 122 for display on a communication partner's visual output device 124 .
- the visual effect may be embodied as one or more visual features representing the selected ALM effect option 126 , such as text 220 a saying “I'm typing,” an animated dot object 220 b indicating that the user 104 is typing, an icon or clipart 220 c illustrating that the user is typing, an emoticon 220 d portraying typing, an animated avatar 220 e , or a video clip 220 f that characterizes typing or waiting for the user to finish typing.
- a “hold on” ALM effect option 126 may be provided, which when selected may provide a visual effect/output 122 for display on a communication partner's visual output device 124 .
- the visual effect may be embodied as one or more visual features representing the selected ALM effect option 126 , such as text 220 a saying “hold on,” an animated dot object 220 b indicating to wait or hold on, an icon or clipart 220 c illustrating for the communication partner 120 to wait, an emoticon 220 d portraying waiting or holding, an animated avatar 220 e , or a video clip 220 f that characterizes waiting for the user 104 .
- a “pardon me?” ALM effect option 126 may be provided, which when selected may provide a visual effect/output 122 for display on a communication partner's visual output device 124 .
- the visual effect may be embodied as one or more visual features representing the selected ALM effect option 126 , such as text 220 a saying “pardon me?,” an animated dot object 220 b indicating a question, an icon or clipart 220 c illustrating a question, an emoticon 220 d portraying a question, an animated avatar 220 e , or a video clip 220 f that characterizes questioning.
- the examples described above and illustrated in FIGS. 2E-2I are not meant to limiting.
- Other ALM effect options 126 and other visual effects/output 122 are possible and are within the scope of the disclosure.
- the visual output 122 can be in a spectrum from concrete to ambiguous, as well as the spectrum of low to high resolution.
- visual output 122 may be generated and provided automatically. For example, when the expressive communication partner's system 108 may detect that the user 104 is typing, and automatically provide a visual effect, such one or more of the “I'm typing” visual features or the “I'm talking” visual features described above and illustrated in FIGS. 2F and 2G .
- the communication partner's visual output device 124 could be embodied as a physical object that is configured to convey the other visual effects/output 122 .
- the visual output device 124 may include an animatronic robot configured to move appendages, change positions and/or change facial expressions to reflect the visual output 122 .
- the expressive keyboard UI engine 110 is operative to provide a set of selectable icons or emoji objects 228 a - n (collectively, 228 ) that can be selectively inserted into the user's text 210 .
- certain emoji objects 228 may be intelligently and selectively displayed, for example, based on linguistic properties of the user's textual input or based on recognized user affects detected via a sensor (e.g., biometric sensors, facial expression sensors, body posture sensors, gesture sensors).
- Each emoji object 228 illustrates a particular emotion (e.g., sad, calm, happy, funny, sarcastic, surprised, irritated, angry), and is associated with a predefined speech operation or operations that can add a sound effect (e.g., laughter, a sharp breath, a sarcastic scoff, grunt, sigh, a disgusted “ugh” sound, an angry “argh” sound), change the tone of voice of the user's text 210 to a specified emotional state (e.g., sad, calm, happy, funny, sarcastic, irritated, angry), or a combination of both.
- specific predetermined prosodic attribute settings may be associated with each emoji object 228 .
- the specific predetermined prosodic attribute settings associated with the particular emoji object are applied to the user's text 210 .
- the user 104 has entered text 210 and has selected an angry emoji object 228 i .
- the expressive communication partner's system 108 is operative to insert an angry “argh” sound effect and modify the user's text 210 according to predefined prosodic attribute settings associated with the angry emoji object 128 i for expressing an angry tone when the text is played.
- the user's text 210 includes a punctuation object 208 , which in the illustrated example is a question mark. Accordingly, specific prosodic features may be applied to a portion of the user's text 210 according to the inserted punctuation object 208 . For example, the pitch of the last one or more words may be raised to emphasize that the sentence is a question.
- emoji objects 228 are further associated with predefined visual operations, wherein selection and insertion of an emoji object 228 in the user's text 210 causes the expressive synthesized speech system 108 to provide a particular visual feature for display on the communication partner's device. Examples of various visual features that may be displayed in response to the user's selection of emoji objects 228 are illustrated in FIGS. 2K-2P . With reference now to FIG. 2K , responsive to a selection of a happy emoji object 228 c , a visual effect/output 122 representing a happy expression may be provided for display on a communication partner's visual output device 124 .
- the visual effect may be embodied as one or more visual features representing the selected emoji object 228 , such as text 230 a saying “happy,” an animated dot object 230 b characterizing a smiley face, an icon or clipart 230 c illustrating a smiley face, an emoticon 230 d portraying a smiley face, an animated avatar 230 e personifying happiness, or a video clip 230 f that characterizes happiness.
- a visual effect/output 122 representing a sad expression may be provided for display on a communication partner's visual output device 124 .
- the visual effect may be embodied as one or more visual features representing the selected emoji object 228 , such as text 232 a saying “sad,” an animated dot object 232 b characterizing a sad face, an icon or clipart 232 c illustrating a sad face, an emoticon 232 d portraying a sad face, an animated avatar 232 e personifying sadness, or a video clip 232 f that characterizes sadness.
- a visual effect/output 122 representing an angry expression may be provided for display on a communication partner's visual output device 124 .
- the visual effect may be embodied as one or more visual features representing the selected emoji object 228 , such as text 234 a saying “angry,” an animated dot object 234 b characterizing an angry face, an icon or clipart 234 c illustrating an angry face, an emoticon 234 d portraying an angry face, an animated avatar 234 e personifying anger, or a video clip 232 f that characterizes anger.
- a visual effect/output 122 representing a funny expression may be provided for display on a communication partner's visual output device 124 .
- the visual effect may be embodied as one or more visual features representing the selected emoji object 228 , such as text 236 a saying “funny,” an animated dot object 236 b characterizing a laughing face, an icon or clipart 236 c illustrating a laughing face, an emoticon 236 d portraying a laughing face, an animated avatar 236 e personifying laughter, or a video clip 236 f that characterizes laughter.
- a visual effect/output 122 representing a sarcastic expression may be provided for display on a communication partner's visual output device 124 .
- the visual effect may be embodied as one or more visual features representing the selected emoji object 228 , such as text 238 a saying “sarcastic,” an animated dot object 238 b characterizing a sarcastic face, an icon or clipart 238 c illustrating a sarcastic face, an emoticon 238 d portraying a sarcastic face, an animated avatar 238 e personifying sarcasm, or a video clip 238 f that characterizes sarcasm.
- visual features representing the selected emoji object 228 such as text 238 a saying “sarcastic,” an animated dot object 238 b characterizing a sarcastic face, an icon or clipart 238 c illustrating a sarcastic face, an emoticon 238 d portraying a sarcastic face, an animated avatar 238 e personifying sarcasm, or a video clip 238 f that characterizes sarcasm.
- a visual effect/output 122 representing a loving expression may be provided for display on a communication partner's visual output device 124 .
- the visual effect may be embodied as one or more visual features representing the selected emoji object 228 , such as text 240 a saying “caring,” an animated dot object 240 b characterizing a heart, an icon or clipart 240 c illustrating a heart, an emoticon 240 d portraying a heart, an animated avatar 240 e personifying love, or a video clip 240 f that characterizes love.
- the above examples are not limiting.
- Other emoji objects 228 and visual features are possible and are within the scope of the present disclosure.
- the voicesetting editor 112 is operative to provide a GUI that allows the user 104 to modify various prosodic properties associated with individual words within input text 210 for explicitly authoring expressivity of synthesized speech.
- the user 104 is enabled to select to use the voicesetting editor 112 , for example, via a selection of a voicesetting editor command 302 .
- the voicesetting editor 112 is operative to display a voicesetting editor UI 304 for enabling the user 104 to author specific prosodic properties of the textual input 210 .
- the voicesetting editor 112 is operative to parse the input text 210 into three types of tokens 305 : words 308 , punctuation 310 , and vocal sound effects 306 (e.g., derived from selected emoji objects 228 in the input text 210 ).
- the tokens 305 are displayed in reading order.
- padding 312 is provided between the tokens for allowing sufficiently large gaze targets for gaze-based input.
- sound effect tokens 306 include a textual description of the vocal sound effect they represent (e.g., “laugh” for laughter, “argh” for angry effect).
- the voicesetting editor 112 allows for setting prosodic properties 314 on individual tokens 305 or over ranges of tokens.
- Selection of a single token opens a token editing interface, such as the example token editing interface 316 illustrated in FIG. 3C , for editing prosodic properties 314 associated with a single token.
- the token editing interface 316 includes the selected token 305 displayed in the center of the interface, wherein modifiable prosodic properties 314 are displayed in a radial menu surrounding the token.
- a radial menu design used for single token editing reduces the amount of distance the user's eye must travel during gaze-based input to change properties 314 of the token, thus enabling efficient user interaction and increasing user interaction performance.
- a range selector 318 is provided, which when selected, allows the user 104 to select a range of tokens 305 .
- the voicesetting editor 112 is operative to display a token editing interface, such as the example token editing interface 320 illustrated in FIG. 3D , for editing prosodic properties 314 associated with a range 322 of tokens 305 .
- a set of prosodic properties 314 are displayed that the user 104 is enabled to adjust. According to an aspect, only the properties 314 that can be adjusted for a selected token 305 are displayed. Some prosodic properties 314 can be applied to all three types of token 305 , for example, emotional tone, rate of speech, volume, and pitch. Other prosodic properties 314 are adjustable for particular token types. For example, word tokens 308 have an emphasis property 314 b that allows the user 104 to specify which words should be emphasized. As another example, punctuation tokens 310 have a pause property 314 that allows the user 104 to specify an amount of silent time to synthesize between the pronunciations of words.
- the voicesetting editor 112 Upon selection of a prosodic property 314 , the voicesetting editor 112 provides functionalities for enabling the user 104 to adjust the value of the property.
- a set of predetermined value ranges are provided for prosodic properties 314 .
- a meaningful label 324 is associated with each value.
- the emphasis property 314 b may have the labels 324 and values: “normal” (a default value that the user 104 is enabled to configure), “strong” (80%), and “very strong” (100%).
- Value adjusters 326 , 328 are provided for enabling the user 104 to adjust the value of the prosodic property up or down respectively.
- prosodic property values are displayed in the voicesetting editor UI 304 using visual properties.
- emotional tone 314 f , volume 314 d , and rate of speech 314 e properties are displayed as lines over tokens 305 .
- the labels 324 associated with the properties are displayed on or near the lines.
- voice pitch 314 c is displayed using a corresponding baseline height 334 (e.g., the higher the baseline height, the higher the pitch, the lower the baseline height, the lower the pitch).
- emphasis 314 b is displayed using font boldness 336 (e.g., bolder font corresponds to heavier emphasis).
- pause length 314 g is displayed using an ellipse 332 , wherein the width of the ellipse 332 corresponds to the pause length according to inter-word or punctuation tokens.
- FIGS. 4A and 4B are flow charts showing general stages involved in example methods 400 , 420 for generating expressive content.
- the method 400 begins at start OPERATION 402 , and proceeds to OPERATION 404 , where the expressive keyboard 206 is displayed in the expressive synthesized speech system UI 202 .
- the method 400 proceeds to OPERATION 406 , where textual input 210 is received from the user 104 . Further, a selection of an expressive operator may be received. For example, the user 104 may select an emoji object 228 or a punctuation object 208 for insertion into the textual input.
- the method 400 proceeds to OPERATION 410 , where the voicesetting editor 112 parses the textual content 210 and any selected punctuation objects 208 or emoji objects 228 , and displays a voicesetting editor UI 304 for allowing the user 104 to adjust or refine prosodic properties 214 for crafting the rendering of the user's content by a synthetic voice.
- the method 400 proceeds to OPERATION 412 , where the user 104 makes one or more prosodic property 214 selections, and at OPERATION 414 , the prosodic attributes, vocal sound effects, or visual effects associated with selected emoji object 228 and/or punctuation object 208 are applied to the textual input 210 .
- the combined textual input 210 and prosodic attributes, vocal sound effects, or visual effects are output to a speech generation engine 118 for generating expressive audible output 126 for playback on an audio output device 128 or to a visualization generation engine 116 for generating expressive visual output 122 for display on a visual output device 124 .
- the method 400 ends at OPERATION 418 .
- the method 420 begins at start OPERATION 422 , and proceeds to OPERATION 424 , where the expressive keyboard 206 is displayed in the expressive synthesized speech system UI 202 .
- the method 420 proceeds to OPERATION 426 , where a selection to launch the active listening mode (ALM) is received.
- ALM active listening mode
- the user 110 may select an ALM command 214 displayed on the expressive keyboard 206 .
- the method 420 proceeds to OPERATION 428 , where responsive to the ALM command 214 selection, the expressive synthesized speech system 108 enters an active listening mode and the expressive keyboard UI engine 110 displays a plurality of selectable ALM effect options 216 , wherein each ALM effect option 216 is associated with a particular sound effect and/or visual effect that can be selectively communicated to a communication partner 120 .
- the method 420 proceeds to OPERATION 430 , where a selection of an ALM effect option 216 is received.
- the expressive synthesized speech system 108 identifies the vocal sound effect and/or visual effect corresponding to the selected ALM effect option 216 , and outputs the corresponding vocal sound effect and/or visual effect to a speech generation engine 118 or visualization generation engine 116 for generating audible output 126 /visual output 122 for playback/display on a communication partner's 120 device 128 / 124 .
- the method 420 ends at OPERATION 434 .
- program modules include routines, programs, components, data structures, and other types of structures that perform particular tasks or implement particular abstract data types.
- computing systems including, without limitation, desktop computer systems, wired and wireless computing systems, mobile computing systems (e.g., mobile telephones, netbooks, tablet or slate type computers, notebook computers, and laptop computers), hand-held devices, multiprocessor systems, microprocessor-based or programmable consumer electronics, minicomputers, and mainframe computers.
- mobile computing systems e.g., mobile telephones, netbooks, tablet or slate type computers, notebook computers, and laptop computers
- hand-held devices e.g., multiprocessor systems, microprocessor-based or programmable consumer electronics, minicomputers, and mainframe computers.
- the aspects and functionalities described herein operate over distributed systems (e.g., cloud-based computing systems), where application functionality, memory, data storage and retrieval and various processing functions are operated remotely from each other over a distributed computing network, such as the Internet or an intranet.
- a distributed computing network such as the Internet or an intranet.
- user interfaces and information of various types are displayed via on-board computing device displays or via remote display units associated with one or more computing devices. For example, user interfaces and information of various types are displayed and interacted with on a wall surface onto which user interfaces and information of various types are projected.
- Interaction with the multitude of computing systems with which implementations are practiced include, keystroke entry, touch screen entry, voice or other audio entry, gesture entry where an associated computing device is equipped with detection (e.g., camera) functionality for capturing and interpreting user gestures for controlling the functionality of the computing device, and the like.
- detection e.g., camera
- FIGS. 5-7 and the associated descriptions provide a discussion of a variety of operating environments in which examples are practiced.
- the devices and systems illustrated and discussed with respect to FIGS. 5-7 are for purposes of example and illustration and are not limiting of a vast number of computing device configurations that are utilized for practicing aspects, described herein.
- FIG. 5 is a block diagram illustrating physical components (i.e., hardware) of a computing device 500 with which examples of the present disclosure may be practiced.
- the computing device 500 includes at least one processing unit 502 and a system memory 504 .
- the system memory 504 comprises, but is not limited to, volatile storage (e.g., random access memory), non-volatile storage (e.g., read-only memory), flash memory, or any combination of such memories.
- the system memory 504 includes an operating system 505 and one or more program modules 506 suitable for running software applications 550 .
- the system memory 504 includes the expressive synthesized speech system 108 .
- the operating system 505 is suitable for controlling the operation of the computing device 500 .
- aspects are practiced in conjunction with a graphics library, other operating systems, or any other application program, and are not limited to any particular application or system.
- This basic configuration is illustrated in FIG. 5 by those components within a dashed line 508 .
- the computing device 500 has additional features or functionality.
- the computing device 500 includes additional data storage devices (removable and/or non-removable) such as, for example, magnetic disks, optical disks, or tape. Such additional storage is illustrated in FIG. 5 by a removable storage device 509 and a non-removable storage device 510 .
- a number of program modules and data files are stored in the system memory 504 .
- the program modules 506 e.g., expressive synthesized speech system 108
- the program modules 506 perform processes including, but not limited to, one or more of the stages of the methods 400 and 420 illustrated in FIGS. 4A and 4B .
- other program modules are used in accordance with examples and include applications 550 such as electronic mail and contacts applications, word processing applications, spreadsheet applications, database applications, slide presentation applications, drawing or computer-aided application programs, etc.
- aspects are practiced in an electrical circuit comprising discrete electronic elements, packaged or integrated electronic chips containing logic gates, a circuit utilizing a microprocessor, or on a single chip containing electronic elements or microprocessors.
- aspects are practiced via a system-on-a-chip (SOC) where each or many of the components illustrated in FIG. 5 are integrated onto a single integrated circuit.
- SOC system-on-a-chip
- such an SOC device includes one or more processing units, graphics units, communications units, system virtualization units and various application functionality all of which are integrated (or “burned”) onto the chip substrate as a single integrated circuit.
- the functionality, described herein is operated via application-specific logic integrated with other components of the computing device 500 on the single integrated circuit (chip).
- aspects of the present disclosure are practiced using other technologies capable of performing logical operations such as, for example, AND, OR, and NOT, including but not limited to mechanical, optical, fluidic, and quantum technologies.
- aspects are practiced within a general purpose computer or in any other circuits or systems.
- the computing device 500 has one or more input device(s) 512 such as a keyboard, a mouse, a pen, a sound input device, a touch input device, etc.
- the output device(s) 514 such as a display, speakers, a printer, etc. are also included according to an aspect.
- the aforementioned devices are examples and others may be used.
- the computing device 500 includes one or more communication connections 516 allowing communications with other computing devices 518 . Examples of suitable communication connections 516 include, but are not limited to, radio frequency (RF) transmitter, receiver, and/or transceiver circuitry; universal serial bus (USB), parallel, and/or serial ports.
- RF radio frequency
- USB universal serial bus
- Computer readable media include computer storage media.
- Computer storage media include volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, or program modules.
- the system memory 504 , the removable storage device 509 , and the non-removable storage device 510 are all computer storage media examples (i.e., memory storage.)
- computer storage media includes RAM, ROM, electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other article of manufacture which can be used to store information and which can be accessed by the computing device 500 .
- any such computer storage media is part of the computing device 500 .
- Computer storage media does not include a carrier wave or other propagated data signal.
- communication media is embodied by computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave or other transport mechanism, and includes any information delivery media.
- modulated data signal describes a signal that has one or more characteristics set or changed in such a manner as to encode information in the signal.
- communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, radio frequency (RF), infrared, and other wireless media.
- RF radio frequency
- FIGS. 6A and 6B illustrate a mobile computing device 600 , for example, a mobile telephone, a smart phone, a tablet personal computer, a laptop computer, and the like, with which aspects may be practiced.
- a mobile computing device 600 for implementing the aspects is illustrated.
- the mobile computing device 600 is a handheld computer having both input elements and output elements.
- the mobile computing device 600 typically includes a display 605 and one or more input buttons 610 that allow the user to enter information into the mobile computing device 600 .
- the display 605 of the mobile computing device 600 functions as an input device (e.g., a touch screen display). If included, an optional side input element 615 allows further user input.
- the side input element 615 is a rotary switch, a button, or any other type of manual input element.
- mobile computing device 600 incorporates more or less input elements.
- the display 605 may not be a touch screen in some examples.
- the mobile computing device 600 is a portable phone system, such as a cellular phone.
- the mobile computing device 600 includes an optional keypad 635 .
- the optional keypad 635 is a physical keypad.
- the optional keypad 635 is a “soft” keypad generated on the touch screen display.
- the output elements include the display 605 for showing a graphical user interface (GUI), a visual indicator 620 (e.g., a light emitting diode), and/or an audio transducer 625 (e.g., a speaker).
- GUI graphical user interface
- the mobile computing device 600 incorporates a vibration transducer for providing the user with tactile feedback.
- the mobile computing device 600 incorporates input and/or output ports, such as an audio input (e.g., a microphone jack), an audio output (e.g., a headphone jack), and a video output (e.g., a HDMI port) for sending signals to or receiving signals from an external device.
- the mobile computing device 600 incorporates peripheral device port 640 , such as an audio input (e.g., a microphone jack), an audio output (e.g., a headphone jack), and a video output (e.g., a HDMI port) for sending signals to or receiving signals from an external device.
- peripheral device port 640 such as an audio input (e.g., a microphone jack), an audio output (e.g., a headphone jack), and a video output (e.g., a HDMI port) for sending signals to or receiving signals from an external device.
- FIG. 6B is a block diagram illustrating the architecture of one example of a mobile computing device. That is, the mobile computing device 600 incorporates a system (i.e., an architecture) 602 to implement some examples.
- the system 602 is implemented as a “smart phone” capable of running one or more applications (e.g., browser, e-mail, calendaring, contact managers, messaging clients, games, and media clients/players).
- the system 602 is integrated as a computing device, such as an integrated personal digital assistant (PDA) and wireless phone.
- PDA personal digital assistant
- one or more application programs 650 are loaded into the memory 662 and run on or in association with the operating system 664 .
- Examples of the application programs include phone dialer programs, e-mail programs, personal information management (PIM) programs, word processing programs, spreadsheet programs, Internet browser programs, messaging programs, and so forth.
- the expressive synthesized speech system 108 is loaded into memory 662 .
- the system 602 also includes a non-volatile storage area 668 within the memory 662 .
- the non-volatile storage area 668 is used to store persistent information that should not be lost if the system 602 is powered down.
- the application programs 650 may use and store information in the non-volatile storage area 668 , such as e-mail or other messages used by an e-mail application, and the like.
- a synchronization application (not shown) also resides on the system 602 and is programmed to interact with a corresponding synchronization application resident on a host computer to keep the information stored in the non-volatile storage area 668 synchronized with corresponding information stored at the host computer.
- other applications may be loaded into the memory 662 and run on the mobile computing device 600 .
- the system 602 has a power supply 670 , which is implemented as one or more batteries.
- the power supply 670 further includes an external power source, such as an AC adapter or a powered docking cradle that supplements or recharges the batteries.
- the system 602 includes a radio 672 that performs the function of transmitting and receiving radio frequency communications.
- the radio 672 facilitates wireless connectivity between the system 602 and the “outside world,” via a communications carrier or service provider. Transmissions to and from the radio 672 are conducted under control of the operating system 664 . In other words, communications received by the radio 672 may be disseminated to the application programs 650 via the operating system 664 , and vice versa.
- the visual indicator 620 is used to provide visual notifications and/or an audio interface 674 is used for producing audible notifications via the audio transducer 625 .
- the visual indicator 620 is a light emitting diode (LED) and the audio transducer 625 is a speaker.
- LED light emitting diode
- the LED may be programmed to remain on indefinitely until the user takes action to indicate the powered-on status of the device.
- the audio interface 674 is used to provide audible signals to and receive audible signals from the user.
- the audio interface 674 may also be coupled to a microphone to receive audible input, such as to facilitate a telephone conversation.
- the system 602 further includes a video interface 676 that enables an operation of an on-board camera 630 to record still images, video stream, and the like.
- a mobile computing device 600 implementing the system 602 has additional features or functionality.
- the mobile computing device 600 includes additional data storage devices (removable and/or non-removable) such as, magnetic disks, optical disks, or tape.
- additional storage is illustrated in FIG. 6B by the non-volatile storage area 668 .
- data/information generated or captured by the mobile computing device 600 and stored via the system 602 is stored locally on the mobile computing device 600 , as described above.
- the data is stored on any number of storage media that is accessible by the device via the radio 672 or via a wired connection between the mobile computing device 600 and a separate computing device associated with the mobile computing device 600 , for example, a server computer in a distributed computing network, such as the Internet.
- a server computer in a distributed computing network such as the Internet.
- data/information is accessible via the mobile computing device 600 via the radio 672 or via a distributed computing network.
- such data/information is readily transferred between computing devices for storage and use according to well-known data/information transfer and storage means, including electronic mail and collaborative data/information sharing systems.
- FIG. 7 illustrates one example of the architecture of a system for generating expressive content as described above.
- Content developed, interacted with, or edited in association with the expressive synthesized speech system 108 is enabled to be stored in different communication channels or other storage types.
- various documents may be stored using a directory service 722 , a web portal 724 , a mailbox service 726 , an instant messaging store 728 , or a social networking site 730 .
- the expressive synthesized speech system 108 is operative to use any of these types of systems or the like for generating expressive content, as described herein.
- a server 720 provides the expressive synthesized speech system 108 to clients 705 a,b,c .
- the server 720 is a web server providing the expressive synthesized speech system 108 over the web.
- the server 720 provides the expressive synthesized speech system 108 over the web to clients 705 through a network 740 .
- the client computing device is implemented and embodied in a personal computer 705 a , a tablet computing device 705 b or a mobile computing device 705 c (e.g., a smart phone), or other computing device. Any of these examples of the client computing device are operable to obtain content from the store 716 .
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Human Computer Interaction (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Multimedia (AREA)
- Acoustics & Sound (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- User Interface Of Digital Computer (AREA)
Abstract
Description
- This application is a continuation of U.S. patent application Ser. No. 15/347,653, filed Nov. 9, 2016, which is incorporated herewith by reference.
- Alternative and augmentative communication (AAC) includes forms of communication other than oral speech that are used to express thoughts, needs, wants, or ideas. An individual may rely on an AAC system as an aid to communicate when the individual is not able to communicate orally, for example, due to a speech disability. Some AAC systems are operative to synthesize speech from the individual's input.
- Conveying emotions, attitude, or tone through speech is oftentimes dependent on non-verbal communicative features, such as gestures and speech prosody; however, current speech-generating AAC systems do not support conveyance of non-verbal information, and generally only provide users with a text-to-speech engine and voice fonts that synthesize a single flat tone of speech that is mostly devoid of emotion and expressivity regardless of the input text that the AAC user is intending to convey. For example, synthesized speech generated from an AAC user's input may sound robotic and lack volume and vocal inflection, which makes it difficult for the AAC user to effectively communicate in a way that represents the user's internal voice. As can be appreciated, this can negatively impact AAC users' quality of life, specifically in their interactions with other individuals.
- Oftentimes, to try to convey emotion or expressivity, an AAC user will type and speak an additional explanatory phrase, such as “I am angry” before typing and speaking the phrase that the user intended to speak originally. As can be appreciated, this is inefficient and can present a significant burden to AAC users, particularly when using gaze-based text entry, for which AAC users have a typical text entry rate of between 10-20 words per minute.
- This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description section. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended as an aid in determining the scope of the claimed subject matter.
- Aspects are directed to an automated system, method, and device for generating expressive content. By employing aspects of the present disclosure, an improved user experience is provided, where a user is enabled to efficiently and effectively compose expressive content, such as prosody-enhanced speech, sound effects, or visual effects, using voicesetting editing.
- An expressive synthesized speech system provides an expressive keyboard for enabling the user to input textual content and for selecting expressive operators, such as emoji objects or punctuation objects for applying predetermined prosody attributes, sound effects, or visual effects to the user's textual content. In some examples, the user may selectively enter a voicesetting editor mode, where a voicesetting editor UI is displayed for enabling the user to author or adjust particular prosody attributes associated with the user's content. In some examples, an active listening mode is provided. When the user selects to launch the active listening mode, a set of active listening mode effect options are displayed, wherein each effect option is associated with a particular sound effect and/or visual effect. In conversations, the user is enabled to easily and rapidly respond with expressive vocal sound effects or visual effects while listening to others speak. Because the user does not have to type and speak additional explanatory phrases to communicate emotions or expressivity, fewer processing resources are expended to provide input to the expressive synthesized speech system, and the functionality of the computing device used to provide the expressive synthesized speech system is thereby expanded and improved.
- Examples are implemented as a computer process, a computing system, or as an article of manufacture such as a device, computer program product, or computer readable media. According to an aspect, the computer program product is a computer storage media readable by a computer system and encoding a computer program of instructions for executing a computer process.
- The details of one or more aspects are set forth in the accompanying drawings and description below. Other features and advantages will be apparent from a reading of the following detailed description and a review of the associated drawings. It is to be understood that the following detailed description is explanatory only and is not restrictive of the claims.
- The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate various aspects. In the drawings:
-
FIG. 1 is a block diagram showing an example operating environment including components of an expressive synthesized speech system for generating expressive content; -
FIG. 2A is an illustration of an example user interface display generated by aspects of an expressive synthesized speech system showing an expressive keyboard including selectable punctuation objects; -
FIG. 2B is an illustration of an example user interface display generated by aspects of the expressive synthesized speech system showing a selection of an active listening mode (ALM); -
FIG. 2C is an illustration of an example user interface display generated by aspects of the expressive synthesized speech system showing a plurality of selectable ALM effect options; -
FIG. 2D is an illustration of an example user interface display generated by aspects of the expressive synthesized speech system showing an example communication between a user and a conversation partner, wherein the user is enabled to use the ALM to provide feedback; -
FIGS. 2E-2I are illustrations of example user interface displays generated by aspects of the expressive synthesized speech system showing examples of visual effects/output corresponding to various ALM effect options; -
FIG. 2J is an illustration of an example user interface display generated by aspects of the expressive synthesized speech system showing various example selectable emoji objects; -
FIGS. 2K-2P are illustrations of example user interface displays generated by aspects of the expressive synthesized speech system showing examples of visual effects/output corresponding to various emoji objects; -
FIG. 3A is an illustration of an example user interface display generated by aspects of the expressive synthesized speech system showing a selection to utilize the voicesetting editor; -
FIGS. 3B-3G are illustrations of example voicesetting editor user interface displays generated by aspects of the expressive synthesized speech system; -
FIG. 4A is a flow chart showing general stages involved in an example method for generating expressive content; -
FIG. 4B is a flow chart showing general stages involved in another example method for generating expressive content; -
FIG. 5 is a block diagram illustrating example physical components of a computing device; -
FIGS. 6A and 6B are block diagrams of a mobile computing device; and -
FIG. 7 is a block diagram of a distributed computing system. - The following detailed description refers to the accompanying drawings. Wherever possible, the same reference numbers are used in the drawings and the following description refers to the same or similar elements. While examples may be described, modifications, adaptations, and other implementations are possible. For example, substitutions, additions, or modifications may be made to the elements illustrated in the drawings, and the methods described herein may be modified by substituting, reordering, or adding stages to the disclosed methods. Accordingly, the following detailed description is not limiting, but instead, the proper scope is defined by the appended claims. Examples may take the form of a hardware implementation, or an entirely software implementation, or an implementation combining software and hardware aspects. The following detailed description is, therefore, not to be taken in a limiting sense.
- Aspects of the present disclosure are directed to a method, system, and computer storage media for providing intuitive synthesized speech-authoring user interface for generating expressive content. While many of the examples described herein are directed to generating expressive content in an alternative and augmentative communication (AAC) system, as should be appreciated, aspects are equally applicable in a variety of alternative use cases and systems. For example in addition to providing synthesized speech-authoring user interfaces to users who are reliant on an AAC system to generate content, such as users who have communication challenges stemming from severe motor disabilities, synthesized speech-authoring user interfaces may also be used for authoring or marking up documents that are to be rendered to audio by automated or semi-automated means (e.g., marking up a print edition book to be rendered to an audio book, authoring a document to be rendered by a screen reader program, authoring content in a web-based communication service, and authoring content that is to be read aloud by learning tool systems). With reference now to
FIG. 1 , a block diagram of an example operating environment 100 illustrating aspects of an example expressivesynthesized speech system 108 for generating expressive content is shown. The expressivesynthesized speech system 108 is operative to provide improved synthesized speech-authoring user interfaces via which a user 104 is enabled to efficiently and effectively author content for generating expressive output, such as prosody-enhanced speech, sound effects, and visual effects. For example, prosody-enhanced speech includes speech that comprises variables of timing, phrasing, emphasis, and intonation that speakers use to help convey aspects of meaning and to make speech lively. - The example operating environment 100 includes an
electronic computing device 102. Thecomputing device 102 illustrated inFIG. 1 is illustrated as a tablet computing device; however, as should be appreciated, thecomputing device 102 may be one of various types of computing devices (e.g., a tablet computing device, a desktop computer, a mobile communication device, a laptop computer, a laptop/tablet hybrid computing device, a large screen multi-touch display, a gaming device, a smart television, a wearable device, or other type of computing device) for executing applications for performing a variety of tasks. The hardware of these computing devices is discussed in greater detail in regard toFIGS. 5, 6A, 6B, and 7 . - According to aspects, the user 104 utilizes the
computing device 102 for executing the expressivesynthesized speech system 108, which in association with a text-to-speech engine (i.e., speech generation engine 118), generates expressive synthesized speech from the user's input. Thecomputing device 102 includes or is in communication with the expressivesynthesized speech system 108. In one example, thecomputing device 102 includes an expressive synthesized speech application programming interface (API), operative to enable an application executing on the computing device to employ the systems and methods of the present disclosure via stored instructions. - In examples, the
synthesized speech system 108 is operative to receive input (e.g., text input, mode selections, on-screen object selections, and prosody cue input) from a user-controlledinput device 106 via various input methods, such as those relying on mice, keyboards, and remote controls, as well as Natural User Interface (NUI) methods, which enable a user to interact with thecomputing device 102 in a “natural” manner, such as via technologies including touch sensitive displays, voice and speech recognition, intention and goal understanding, motion gesture detection using depth cameras, motion gesture detection using accelerometers/gyroscopes, facial recognition, 3D displays, head, eye, and gaze tracking, immersive augmented reality and virtual reality systems, all of which provide a more natural interface, as well as technologies for sensing brain activity using electric field sensing electrodes (EEG and related methods). In specific examples, the user 104 uses gaze-based input methods or head mouse input methods, which are typically used by individuals who have communication challenges stemming from paralysis or severe motor disabilities, such as people who have advanced amyotrophic lateral sclerosis (ALS). - Aspects of the expressive
synthesized speech system 108 generate and provide a graphical user interface (GUI) that allows the user 104 to interact with functionality of the expressivesynthesized speech system 108. According to examples, the expressivesynthesized speech system 108 comprises an expressivekeyboard UI engine 110, illustrative of a software module, system, or device operative to generate a GUI display of an expressive keyboard. According to one aspect, the expressivekeyboard UI engine 110 provides a keyboard that extends an on-screen keyboard, which is used to input text for speech synthesis, by providing a set of selectable icons or emoji objects that can be selectively inserted into the user's text. Each emoji object illustrates a particular emotion (e.g., sad, calm, happy, funny, sarcastic, surprised, irritated, angry), and is associated with a predefined operation or operations that can change the tone of voice of the user's text to a specified emotional state (e.g., sad, calm, happy, funny, sarcastic, surprised, irritated, angry). For example, certain prosodic attributes are applied to the user's text based on the selected emoji object. According to one example, the expressivesynthesized speech system 108 is operative to apply a tone and emotion of output speech corresponding to a selected emoji object at a sentence level. According to an aspect, the expressivekeyboard UI engine 110 is operative to intelligently display a set of emoji objects on the expressive keyboard, for example, based on linguistic properties of the user's textual input or based on a recognized emotional state of the user detected via a sensor (e.g., biometric sensors, facial expression sensors, body posture sensors, gesture sensors). For example, the set of emoji objects may include emoji objects associated with an emotion corresponding to the user's recognized emotional state. In some examples, one or more emoji objects are associated with a vocal sound effect (e.g., laughter, sarcastic scoff, sharp breath in, grunt, sigh, or a disgusted “ugh” sound). When the user 104 selectively inserts an emoji object, a corresponding vocal sound effect is inserted at the beginning, end, or within the user's synthesized speech. - According to an aspect, the expressive
keyboard UI engine 110 is operative to receive the user's text input and selection of an emoji object, and communicate the input and selection to avoicesetting engine 114. Thevoicesetting engine 114 is illustrative of a software module, system, or device operative to receive user input, apply the predefined operation associated with the selected emoji object to apply speech or prosodic properties to the text input, and output a representation of the user's speech to aspeech generation engine 118 for generatingaudible output 126 embodied as expressive synthesized speech. In some examples, thevoicesetting engine 114 comprises an index of emoji objects and their corresponding operation(s), which includes prosodic attributes and/or vocal sound effects. Thevoice setting engine 114 is operative to reference the index for applying the appropriate prosodic attributes and/or vocal sound effect to the user's text. According to an aspect, thevoicesetting engine 114 specifies the text input and prosodic attributes and/or vocal sound effects via a markup language, such as Speech Synthesis Markup Language (SSML), which is output to aspeech generation engine 118. Theaudible output 126 is played on anaudio output device 128, which may be integrated with the user'scomputing device 102, or may be incorporated in another device utilized by acommunication partner 120. - In some examples, the expressive
synthesized speech system 108 works in association with avisualization generation engine 116 for generating expressivevisual output 122 for display on avisual output device 124 from the user's input. For example, the predefined operation or operations associated with each emoji object may include providing a visual feature, wherein, each emoji object may be associated with one or more visual features (e.g., text, emoji, graphics, animations, video clips). When the user 104 selectively inserts an emoji object into the user's text input, the expressivekeyboard UI engine 110 is operative to communicate the text input and the selected emoji object to thevoicesetting engine 114, wherein the voicesetting engine applies the predefined operation associated with the selected emoji object to apply visual features to the text input, and output a representation of the visual features to avisualization generation engine 116 for generatingvisual output 122 for display on avisual output device 124. Thevisual output device 124 may be integrated with the user'scomputing device 102, or may be incorporated in another device utilized by acommunication partner 120. In some examples, thevisual output device 124 and theaudio output device 128 are incorporated in a single device. - According to another aspect, the expressive
keyboard UI engine 110 is further operative to provide a plurality of punctuation objects (e.g., a period, comma, question mark, exclamation point), which when selectively inserted into the user's text, change prosodic attributes (e.g., silent space, pitch, speed, emphasis) of surrounding words. In some examples, a settings menu is provided for enabling the user to customize the prosodic attributes or vocal sound effects associated with emoji objects or to customize the prosodic attributes associated with punctuation objects. - According to another aspect, the expressive
keyboard UI engine 110 is further operative to provide a selectable active listening mode (ALM) for enabling the user 104 to select vocal sound effects and/or visual effects for communicating information when acommunication partner 120 is speaking. For example, akin to using gestures, such as nodding, or non-verbal vocalizations, such as laughing, in standard communication, the user 104 is enabled to use ALM effects to provide feedback to the user'scommunication partner 120. According to examples, the expressivekeyboard UI engine 110 is operative to provide an ALM command, which when selected, causes the expressive keyboard UI engine to display a plurality of selectable ALM effect options, wherein each ALM effect option is associated with a particular sound effect and/or visual effect. In some examples, the ALM effect options and associated sound or visual effects are customizable by the user 104. For example, the user 104 is enabled to select specific ALM effect options to display on the keyboard. In one example, voice-banked recordings may be associated with sound effect options. For example, a voice-banked phrase or other sound effect may be previously recorded by the user 104 or another individual and saved as a sound effect that can be selectively played or spoken by the expressivesynthesized speech system 108 during a conversation with acommunication partner 120. In another example, an expression-banked reaction may be previously recorded by the user 104 or another individual and saved as a visual effect that can be selectively displayed by the expressivesynthesized speech system 108 during a conversation with acommunication partner 120. Further, the user 104 may save a static image or a video clip as a visual effect. - According to an aspect, the expressive
keyboard UI engine 110 is operative to receive the user's ALM effect option selection, and communicate the selection to thevoicesetting engine 114 for outputting a representation of the associated ALM effect to anaudio output device 128 and/or to avisual output device 124. The expressivekeyboard UI engine 110 is operative to communicate the selected ALM effect option to thevoicesetting engine 114, wherein the voicesetting engine applies the predefined operation associated with the selected ALM effect option to provide audible features or visual features associated with the selected ALM effect option as output to avisualization generation engine 116 for generatingvisual output 122 for display on avisual output device 124 or to anspeech generation engine 118 for generatingaudible output 126 for playback on anaudio output device 128. - Aspects of the expressive
keyboard UI engine 110 enable single-click input by the user 104 to quickly and easily specify the expressive nature of their speech and to rapidly respond with expressive vocal sound effects while listening to others speak. Aspects of the expressive keyboard including the emoji objects, punctuation objects, and the ALM effect options will be described in further detail below with reference toFIGS. 2A-2P . - With reference still to
FIG. 1 , the expressivesynthesized speech system 108 further comprises avoicesetting editor 112 illustrative of a software module, system, or device operative to provide a GUI that allows the user 104 to modify various prosodic properties associated with text input for asynchronously authoring expressivity of synthesized speech (e.g., outside of a face-to-face conversation). In some examples, the text input includes text composed via interaction with the expressive keyboard. In other examples, the text input includes text in an existing text file which the user 104 is enabled to upload into the expressivesynthesized speech system 108. Thevoicesetting editor 112 is operative to provide various controls for enabling the user 104 to edit various properties of the output speech via coarse input methods, such as via eye gaze or head mouse movement. According to aspects, the user 104 is enabled to add detailed pacing, emotional content, or prosodic content to text before the text is rendered as speech. For example, thevoicesetting editor 112 provides controls for editing pause length, pitch, speed, or emphasis of specific text, and for inserting sound effects and/or visual effects. According to examples, the user 104 may utilize the UI provided by thevoicesetting editor 112 when the user wants to have a higher degree of control over specifics of how the user's text will be spoken when played by anaudio output device 128. For example, the voicesetting text via thevoicesetting editor 112 typically entails additional time, and may be used by the user 104 in a situation where the user is preparing text to speak in advance (e.g., before a medical appointment, before giving a public speech, when preparing stored phrases for repeated use). Aspects of thevoicesetting editor 112 will be described in further detail below with respect toFIGS. 3A-3G . - As described above, the expressive
synthesized speech system 108 is operative to provide intuitive synthesized speech-authoring user interfaces via which the user 104 is enabled to efficiently and effectively author content for generating expressive output, such as prosody-enhanced speech, sound effects, and visual effects. Examples of synthesized speech-authoring user interfaces provided by the expressivesynthesized speech system 108 are described below with reference toFIGS. 2A-3G . With reference now toFIG. 2A , an illustration of an example UI displayed on adisplay 204 of acomputing device 102 and generated by aspects of an expressivesynthesized speech system 108 is shown. According to various aspects, the expressivekeyboard UI engine 110 is operative to generate an expressive synthesizedspeech system UI 202 including anexpressive keyboard 206 for enabling the user 104 to providetext input 210, prosodic cue selections, and sound and/or visual effect selections. Examples of theexpressive keyboard 206 include an alpha-numeric keyboard, which the user 104 is enabled to use to select letters and numbers to authortextual content 210 for speech synthesis. In some examples, theexpressive keyboard 206 is a gaze-based on-screen keyboard, wherein eye tracking is used to determine the user's gaze position, which is utilized as a cursor. For example, to entertext 210, the user 104 may gaze on letters, numbers, or other displayed keys of the on-screen keyboard. In other examples, theexpressive keyboard 206 enables input via a headmouse input device 106, where the user's head movements are translated into mouse pointer movement. In other examples,other input device 106 types are used for inputting textual content and providing selections. - Examples of the
expressive keyboard 206 further include a plurality ofselectable punctuation objects 208 a-n (collectively, 208), which the user 104 is enabled to include withtextual input 210 to specify the expressive nature of the textual input. According to aspects, eachpunctuation object 208 has a predetermined operation associated with it that specifies how particular prosodic attributes are to be applied to surrounding text, thus changing the expressive nature of the speech that will be generated from the text. - For example, inclusion of a period, comma, or exclamation
point punctuation object 208 may operate to insert a default or user-customizable amount of silent space between pronouncing words or sentences, thus allowing the user 104 to set the cadence of his/her speech. As another example, inclusion of a single questionmark punctuation object 208 may operate to raise the pitch of a word located immediately prior to the question mark, and inclusion of two question mark punctuation objects may operate to raise the pitch of the two words located immediately prior to the question marks to emphasize that thetextual content 210 is a question. As can be appreciated, this can be useful in scenarios in which the user 104 asks a question that could be interpreted as a statement if not for the question mark(s) (e.g., “She is meeting us there.” vs “She is meeting us there?”). As another example, inclusion of an exclamationpoint punctuation object 208 may operate to increase the emotional tone, volume, or rate of speech of at least a portion of thetextual content 210, or to place emphasis on a specific word of the textual content. As should be appreciated, the punctuation objects 208 illustrated in the figures and described herein are non-limiting examples. Other punctuation objects 208 and other corresponding operations are possible and are within the scope of the present disclosure. - According to an aspect, the user 104 may enter
textual content 210, select apunctuation object 208, and then select to play the text with the punctuation object functionality applied. For example, the user 104 may select aplay command 212, which when selected, causes the expressivekeyboard UI engine 110 to pass thetextual content 210 and the selectedpunctuation object 208 to thevoicesetting engine 114 for application of the prosodic attributes or prosodic properties corresponding to the selected punctuation object to the textual content. - As illustrated in
FIG. 2B , theexpressive keyboard 206 further includes a selectable ALM (active listening mode)command 214, which when selected, causes the expressivesynthesized speech system 108 to enter an active listening mode and the expressivekeyboard UI engine 110 to display a plurality of selectableALM effect options 216 a-n (collectively, 216), wherein each ALM effect option is associated with a particular sound effect and/or visual effect that can be selectively communicated to acommunication partner 120. For example, selection of anALM effect option 216 enables the user 104 to provide rapid expressive reactions when thecommunication partner 120 is speaking. With reference now toFIG. 2C , a plurality of example selectableALM effect options 216 are illustrated. As described above, theALM effect options 216 provide the synthesized speech user 104 with single-click access to a variety of sound effects and/or visual effects, which in various examples, are customizable by the synthesized speech user. Although the exampleALM effect options 216 illustrated inFIG. 2C are displayed as text, other display options are possible, such as images or symbols. - According to an example and as illustrated in
FIG. 2D , a user 104 and acommunication partner 120 may have a conversation. The first step depicts thecommunication partner 120 talking, and the user 104 providing communication feedback to thecommunication partner 120 in the form of an “I'm listening” visual effect/output 122 that can be rendered on the communication partner'svisual output device 124 in response to a selection of anALM effect option 216. The second step depicts the user 104 selecting anotherALM effect option 216 for providing another visual effect and/or sound effect to communicate with thecommunication partner 120. For example, as thecommunication partner 120 is telling the user 104 a funny story, the user selects a “laugh”ALM effect option 216. Accordingly, as depicted in the third step, a corresponding visual effect/output 122 is displayed on the communication partner'svisual output device 124, and a corresponding sound effect/output 126 is played on the communication partner'saudio output device 128. As illustrated, the user 104 is enabled to provide rapid expressive reactions during a conversation. - According to aspects, a variety of
ALM effect options 216 may be provided in theexpressive keyboard 206, and a variety of corresponding audio (sound) effects/output 126 and visual effects/output 122 may be provided responsive to a selection of an ALM effect option. Various examples of visual effects/output 122 corresponding to variousALM effect options 216 are illustrated inFIGS. 2E-2I . According to one example and with reference toFIG. 2E , an “I'm listening”ALM effect option 126 may be provided, which when selected may provide a visual effect/output 122 for display on avisual output device 124, such as the communication partner's visual output device. The visual effect may be embodied as one or more visual features representing the selectedALM effect option 126, such astext 218 a saying “I'm listening,” ananimated dot object 218 b indicating that the user 104 is listening, an icon orclipart 218 c illustrating that the user is listening, anemoticon 218 d portraying listening, ananimated avatar 218 e, or avideo clip 218 f that symbolizes listening. - According to another example and with reference to
FIG. 2F , an “I'm talking”ALM effect option 126 may be provided, which when selected may provide a visual effect/output 122 for display on a communication partner'svisual output device 124. The visual effect may be embodied as one or more visual features representing the selectedALM effect option 126, such astext 220 a saying “I'm talking,” ananimated dot object 220 b indicating that the user 104 is talking, an icon orclipart 220 c illustrating that the user is talking, anemoticon 220 d portraying talking, ananimated avatar 220 e, or avideo clip 220 f that characterizes talking. - According to another example and with reference to
FIG. 2G , an “I'm typing”ALM effect option 126 may be provided, which when selected may provide a visual effect/output 122 for display on a communication partner'svisual output device 124. The visual effect may be embodied as one or more visual features representing the selectedALM effect option 126, such astext 220 a saying “I'm typing,” ananimated dot object 220 b indicating that the user 104 is typing, an icon orclipart 220 c illustrating that the user is typing, anemoticon 220 d portraying typing, ananimated avatar 220 e, or avideo clip 220 f that characterizes typing or waiting for the user to finish typing. - According to another example and with reference to
FIG. 2H , a “hold on”ALM effect option 126 may be provided, which when selected may provide a visual effect/output 122 for display on a communication partner'svisual output device 124. The visual effect may be embodied as one or more visual features representing the selectedALM effect option 126, such astext 220 a saying “hold on,” ananimated dot object 220 b indicating to wait or hold on, an icon orclipart 220 c illustrating for thecommunication partner 120 to wait, anemoticon 220 d portraying waiting or holding, ananimated avatar 220 e, or avideo clip 220 f that characterizes waiting for the user 104. - According to another example and with reference to
FIG. 2I , a “pardon me?”ALM effect option 126 may be provided, which when selected may provide a visual effect/output 122 for display on a communication partner'svisual output device 124. The visual effect may be embodied as one or more visual features representing the selectedALM effect option 126, such astext 220 a saying “pardon me?,” ananimated dot object 220 b indicating a question, an icon orclipart 220 c illustrating a question, anemoticon 220 d portraying a question, ananimated avatar 220 e, or avideo clip 220 f that characterizes questioning. - As should be appreciated, the examples described above and illustrated in
FIGS. 2E-2I are not meant to limiting. OtherALM effect options 126 and other visual effects/output 122 are possible and are within the scope of the disclosure. For example, thevisual output 122 can be in a spectrum from concrete to ambiguous, as well as the spectrum of low to high resolution. According to an aspect,visual output 122 may be generated and provided automatically. For example, when the expressive communication partner'ssystem 108 may detect that the user 104 is typing, and automatically provide a visual effect, such one or more of the “I'm typing” visual features or the “I'm talking” visual features described above and illustrated inFIGS. 2F and 2G . Further, according to another aspect, the communication partner'svisual output device 124 could be embodied as a physical object that is configured to convey the other visual effects/output 122. For example, thevisual output device 124 may include an animatronic robot configured to move appendages, change positions and/or change facial expressions to reflect thevisual output 122. - As described above and with reference now to
FIG. 2J , the expressivekeyboard UI engine 110 is operative to provide a set of selectable icons or emoji objects 228 a-n (collectively, 228) that can be selectively inserted into the user'stext 210. According to an example, certain emoji objects 228 may be intelligently and selectively displayed, for example, based on linguistic properties of the user's textual input or based on recognized user affects detected via a sensor (e.g., biometric sensors, facial expression sensors, body posture sensors, gesture sensors). Each emoji object 228 illustrates a particular emotion (e.g., sad, calm, happy, funny, sarcastic, surprised, irritated, angry), and is associated with a predefined speech operation or operations that can add a sound effect (e.g., laughter, a sharp breath, a sarcastic scoff, grunt, sigh, a disgusted “ugh” sound, an angry “argh” sound), change the tone of voice of the user'stext 210 to a specified emotional state (e.g., sad, calm, happy, funny, sarcastic, irritated, angry), or a combination of both. For example, specific predetermined prosodic attribute settings may be associated with each emoji object 228. When an emoji object 228 is inserted, the specific predetermined prosodic attribute settings associated with the particular emoji object are applied to the user'stext 210. As illustrated inFIG. 2J , the user 104 has enteredtext 210 and has selected anangry emoji object 228 i. Accordingly, responsive to the user's selection, the expressive communication partner'ssystem 108 is operative to insert an angry “argh” sound effect and modify the user'stext 210 according to predefined prosodic attribute settings associated with the angry emoji object 128 i for expressing an angry tone when the text is played. Further, the user'stext 210 includes apunctuation object 208, which in the illustrated example is a question mark. Accordingly, specific prosodic features may be applied to a portion of the user'stext 210 according to the insertedpunctuation object 208. For example, the pitch of the last one or more words may be raised to emphasize that the sentence is a question. - In some examples, emoji objects 228 are further associated with predefined visual operations, wherein selection and insertion of an emoji object 228 in the user's
text 210 causes the expressivesynthesized speech system 108 to provide a particular visual feature for display on the communication partner's device. Examples of various visual features that may be displayed in response to the user's selection of emoji objects 228 are illustrated inFIGS. 2K-2P . With reference now toFIG. 2K , responsive to a selection of ahappy emoji object 228 c, a visual effect/output 122 representing a happy expression may be provided for display on a communication partner'svisual output device 124. The visual effect may be embodied as one or more visual features representing the selected emoji object 228, such astext 230 a saying “happy,” ananimated dot object 230 b characterizing a smiley face, an icon orclipart 230 c illustrating a smiley face, anemoticon 230 d portraying a smiley face, ananimated avatar 230 e personifying happiness, or avideo clip 230 f that characterizes happiness. - With reference now to
FIG. 2L , responsive to a selection of asad emoji object 228 a, a visual effect/output 122 representing a sad expression may be provided for display on a communication partner'svisual output device 124. The visual effect may be embodied as one or more visual features representing the selected emoji object 228, such astext 232 a saying “sad,” ananimated dot object 232 b characterizing a sad face, an icon orclipart 232 c illustrating a sad face, anemoticon 232 d portraying a sad face, ananimated avatar 232 e personifying sadness, or avideo clip 232 f that characterizes sadness. - With reference now to
FIG. 2M , responsive to a selection of anangry emoji object 228 i, a visual effect/output 122 representing an angry expression may be provided for display on a communication partner'svisual output device 124. The visual effect may be embodied as one or more visual features representing the selected emoji object 228, such astext 234 a saying “angry,” ananimated dot object 234 b characterizing an angry face, an icon orclipart 234 c illustrating an angry face, anemoticon 234 d portraying an angry face, ananimated avatar 234 e personifying anger, or avideo clip 232 f that characterizes anger. - With reference now to
FIG. 2N , responsive to a selection of a funny emoji object 228 d, a visual effect/output 122 representing a funny expression may be provided for display on a communication partner'svisual output device 124. The visual effect may be embodied as one or more visual features representing the selected emoji object 228, such astext 236 a saying “funny,” ananimated dot object 236 b characterizing a laughing face, an icon orclipart 236 c illustrating a laughing face, anemoticon 236 d portraying a laughing face, ananimated avatar 236 e personifying laughter, or avideo clip 236 f that characterizes laughter. - With reference now to
FIG. 2O , responsive to a selection of asarcastic emoji object 228 e, a visual effect/output 122 representing a sarcastic expression may be provided for display on a communication partner'svisual output device 124. The visual effect may be embodied as one or more visual features representing the selected emoji object 228, such astext 238 a saying “sarcastic,” ananimated dot object 238 b characterizing a sarcastic face, an icon orclipart 238 c illustrating a sarcastic face, anemoticon 238 d portraying a sarcastic face, ananimated avatar 238 e personifying sarcasm, or avideo clip 238 f that characterizes sarcasm. - With reference now to
FIG. 2P , responsive to a selection of a love emoji object 228, a visual effect/output 122 representing a loving expression may be provided for display on a communication partner'svisual output device 124. The visual effect may be embodied as one or more visual features representing the selected emoji object 228, such astext 240 a saying “caring,” ananimated dot object 240 b characterizing a heart, an icon orclipart 240 c illustrating a heart, anemoticon 240 d portraying a heart, ananimated avatar 240 e personifying love, or avideo clip 240 f that characterizes love. As should be appreciated, the above examples are not limiting. Other emoji objects 228 and visual features are possible and are within the scope of the present disclosure. - As described above, the
voicesetting editor 112 is operative to provide a GUI that allows the user 104 to modify various prosodic properties associated with individual words withininput text 210 for explicitly authoring expressivity of synthesized speech. With reference now toFIG. 3A , the user 104 is enabled to select to use thevoicesetting editor 112, for example, via a selection of avoicesetting editor command 302. - In response and with reference now to
FIG. 3B , thevoicesetting editor 112 is operative to display avoicesetting editor UI 304 for enabling the user 104 to author specific prosodic properties of thetextual input 210. According to an aspect, thevoicesetting editor 112 is operative to parse theinput text 210 into three types of tokens 305: words 308,punctuation 310, and vocal sound effects 306 (e.g., derived from selected emoji objects 228 in the input text 210). As illustrated inFIG. 3B , thetokens 305 are displayed in reading order. According to an aspect, padding 312 is provided between the tokens for allowing sufficiently large gaze targets for gaze-based input. In some examples,sound effect tokens 306 include a textual description of the vocal sound effect they represent (e.g., “laugh” for laughter, “argh” for angry effect). According to aspects, thevoicesetting editor 112 allows for setting prosodic properties 314 onindividual tokens 305 or over ranges of tokens. - Selection of a single token, for example, by dwell-clicking on a
specific token 305, opens a token editing interface, such as the exampletoken editing interface 316 illustrated inFIG. 3C , for editing prosodic properties 314 associated with a single token. Thetoken editing interface 316 includes the selectedtoken 305 displayed in the center of the interface, wherein modifiable prosodic properties 314 are displayed in a radial menu surrounding the token. According to an aspect, a radial menu design used for single token editing reduces the amount of distance the user's eye must travel during gaze-based input to change properties 314 of the token, thus enabling efficient user interaction and increasing user interaction performance. - Aspects of the
voicesetting editor UI 304 allow for selecting a range oftokens 305 for editing. In one example and as illustrated inFIG. 3B , arange selector 318 is provided, which when selected, allows the user 104 to select a range oftokens 305. For example, after selecting therange selector 318, the user 104 may select thefirst token 305 in the desired range, followed by a selection of the last token in the desired range. Responsive to selecting a range oftokens 305, thevoicesetting editor 112 is operative to display a token editing interface, such as the exampletoken editing interface 320 illustrated inFIG. 3D , for editing prosodic properties 314 associated with arange 322 oftokens 305. - According to examples, in the
token editing interface token 305 are displayed. Some prosodic properties 314 can be applied to all three types oftoken 305, for example, emotional tone, rate of speech, volume, and pitch. Other prosodic properties 314 are adjustable for particular token types. For example, word tokens 308 have anemphasis property 314 b that allows the user 104 to specify which words should be emphasized. As another example,punctuation tokens 310 have a pause property 314 that allows the user 104 to specify an amount of silent time to synthesize between the pronunciations of words. - Upon selection of a prosodic property 314, the
voicesetting editor 112 provides functionalities for enabling the user 104 to adjust the value of the property. According to an aspect, a set of predetermined value ranges are provided for prosodic properties 314. In an example and as illustrated inFIG. 3E , ameaningful label 324 is associated with each value. For example, theemphasis property 314 b may have thelabels 324 and values: “normal” (a default value that the user 104 is enabled to configure), “strong” (80%), and “very strong” (100%).Value adjusters - With reference to
FIGS. 3F and 3G , prosodic property values are displayed in thevoicesetting editor UI 304 using visual properties. In one example,emotional tone 314 f,volume 314 d, and rate ofspeech 314 e properties are displayed as lines overtokens 305. According to an example, thelabels 324 associated with the properties are displayed on or near the lines. According to another example,voice pitch 314 c is displayed using a corresponding baseline height 334 (e.g., the higher the baseline height, the higher the pitch, the lower the baseline height, the lower the pitch). According to another example,emphasis 314 b is displayed using font boldness 336 (e.g., bolder font corresponds to heavier emphasis). According to another example,pause length 314 g is displayed using anellipse 332, wherein the width of theellipse 332 corresponds to the pause length according to inter-word or punctuation tokens. - Having described an operating environment and various user interface display examples with respect to
FIGS. 1-3G ,FIGS. 4A and 4B are flow charts showing general stages involved inexample methods FIG. 4A , themethod 400 begins atstart OPERATION 402, and proceeds toOPERATION 404, where theexpressive keyboard 206 is displayed in the expressive synthesizedspeech system UI 202. - The
method 400 proceeds toOPERATION 406, wheretextual input 210 is received from the user 104. Further, a selection of an expressive operator may be received. For example, the user 104 may select an emoji object 228 or apunctuation object 208 for insertion into the textual input. - At
DECISION OPERATION 408, a determination is made as to whether to launch thevoicesetting editor 112 for editingprosodic properties 214 associated with the user'stextual content 210. For example, the determination may be made based on whether the user 104 selects avoicesetting editor command 302. When a determination is made to launch thevoicesetting editor 112, themethod 400 proceeds toOPERATION 410, where thevoicesetting editor 112 parses thetextual content 210 and any selectedpunctuation objects 208 or emoji objects 228, and displays avoicesetting editor UI 304 for allowing the user 104 to adjust or refineprosodic properties 214 for crafting the rendering of the user's content by a synthetic voice. - The
method 400 proceeds toOPERATION 412, where the user 104 makes one or moreprosodic property 214 selections, and atOPERATION 414, the prosodic attributes, vocal sound effects, or visual effects associated with selected emoji object 228 and/orpunctuation object 208 are applied to thetextual input 210. - At
OPERATION 416, the combinedtextual input 210 and prosodic attributes, vocal sound effects, or visual effects are output to aspeech generation engine 118 for generating expressiveaudible output 126 for playback on anaudio output device 128 or to avisualization generation engine 116 for generating expressivevisual output 122 for display on avisual output device 124. Themethod 400 ends atOPERATION 418. - With reference now to
FIG. 4B , themethod 420 begins atstart OPERATION 422, and proceeds toOPERATION 424, where theexpressive keyboard 206 is displayed in the expressive synthesizedspeech system UI 202. - The
method 420 proceeds toOPERATION 426, where a selection to launch the active listening mode (ALM) is received. For example, theuser 110 may select anALM command 214 displayed on theexpressive keyboard 206. - The
method 420 proceeds toOPERATION 428, where responsive to theALM command 214 selection, the expressivesynthesized speech system 108 enters an active listening mode and the expressivekeyboard UI engine 110 displays a plurality of selectableALM effect options 216, wherein eachALM effect option 216 is associated with a particular sound effect and/or visual effect that can be selectively communicated to acommunication partner 120. - The
method 420 proceeds toOPERATION 430, where a selection of anALM effect option 216 is received. AtOPERATION 432, the expressivesynthesized speech system 108 identifies the vocal sound effect and/or visual effect corresponding to the selectedALM effect option 216, and outputs the corresponding vocal sound effect and/or visual effect to aspeech generation engine 118 orvisualization generation engine 116 for generatingaudible output 126/visual output 122 for playback/display on a communication partner's 120device 128/124. Themethod 420 ends atOPERATION 434. - While implementations have been described in the general context of program modules that execute in conjunction with an application program that runs on an operating system on a computer, those skilled in the art will recognize that aspects may also be implemented in combination with other program modules. Generally, program modules include routines, programs, components, data structures, and other types of structures that perform particular tasks or implement particular abstract data types.
- The aspects and functionalities described herein may operate via a multitude of computing systems including, without limitation, desktop computer systems, wired and wireless computing systems, mobile computing systems (e.g., mobile telephones, netbooks, tablet or slate type computers, notebook computers, and laptop computers), hand-held devices, multiprocessor systems, microprocessor-based or programmable consumer electronics, minicomputers, and mainframe computers.
- In addition, according to an aspect, the aspects and functionalities described herein operate over distributed systems (e.g., cloud-based computing systems), where application functionality, memory, data storage and retrieval and various processing functions are operated remotely from each other over a distributed computing network, such as the Internet or an intranet. According to an aspect, user interfaces and information of various types are displayed via on-board computing device displays or via remote display units associated with one or more computing devices. For example, user interfaces and information of various types are displayed and interacted with on a wall surface onto which user interfaces and information of various types are projected. Interaction with the multitude of computing systems with which implementations are practiced include, keystroke entry, touch screen entry, voice or other audio entry, gesture entry where an associated computing device is equipped with detection (e.g., camera) functionality for capturing and interpreting user gestures for controlling the functionality of the computing device, and the like.
-
FIGS. 5-7 and the associated descriptions provide a discussion of a variety of operating environments in which examples are practiced. However, the devices and systems illustrated and discussed with respect toFIGS. 5-7 are for purposes of example and illustration and are not limiting of a vast number of computing device configurations that are utilized for practicing aspects, described herein. -
FIG. 5 is a block diagram illustrating physical components (i.e., hardware) of acomputing device 500 with which examples of the present disclosure may be practiced. In a basic configuration, thecomputing device 500 includes at least oneprocessing unit 502 and asystem memory 504. According to an aspect, depending on the configuration and type of computing device, thesystem memory 504 comprises, but is not limited to, volatile storage (e.g., random access memory), non-volatile storage (e.g., read-only memory), flash memory, or any combination of such memories. According to an aspect, thesystem memory 504 includes anoperating system 505 and one ormore program modules 506 suitable for running software applications 550. According to an aspect, thesystem memory 504 includes the expressivesynthesized speech system 108. Theoperating system 505, for example, is suitable for controlling the operation of thecomputing device 500. Furthermore, aspects are practiced in conjunction with a graphics library, other operating systems, or any other application program, and are not limited to any particular application or system. This basic configuration is illustrated inFIG. 5 by those components within a dashedline 508. According to an aspect, thecomputing device 500 has additional features or functionality. For example, according to an aspect, thecomputing device 500 includes additional data storage devices (removable and/or non-removable) such as, for example, magnetic disks, optical disks, or tape. Such additional storage is illustrated inFIG. 5 by aremovable storage device 509 and anon-removable storage device 510. - As stated above, according to an aspect, a number of program modules and data files are stored in the
system memory 504. While executing on theprocessing unit 502, the program modules 506 (e.g., expressive synthesized speech system 108) perform processes including, but not limited to, one or more of the stages of themethods FIGS. 4A and 4B . According to an aspect, other program modules are used in accordance with examples and include applications 550 such as electronic mail and contacts applications, word processing applications, spreadsheet applications, database applications, slide presentation applications, drawing or computer-aided application programs, etc. - According to an aspect, aspects are practiced in an electrical circuit comprising discrete electronic elements, packaged or integrated electronic chips containing logic gates, a circuit utilizing a microprocessor, or on a single chip containing electronic elements or microprocessors. For example, aspects are practiced via a system-on-a-chip (SOC) where each or many of the components illustrated in
FIG. 5 are integrated onto a single integrated circuit. According to an aspect, such an SOC device includes one or more processing units, graphics units, communications units, system virtualization units and various application functionality all of which are integrated (or “burned”) onto the chip substrate as a single integrated circuit. When operating via an SOC, the functionality, described herein, is operated via application-specific logic integrated with other components of thecomputing device 500 on the single integrated circuit (chip). According to an aspect, aspects of the present disclosure are practiced using other technologies capable of performing logical operations such as, for example, AND, OR, and NOT, including but not limited to mechanical, optical, fluidic, and quantum technologies. In addition, aspects are practiced within a general purpose computer or in any other circuits or systems. - According to an aspect, the
computing device 500 has one or more input device(s) 512 such as a keyboard, a mouse, a pen, a sound input device, a touch input device, etc. The output device(s) 514 such as a display, speakers, a printer, etc. are also included according to an aspect. The aforementioned devices are examples and others may be used. According to an aspect, thecomputing device 500 includes one ormore communication connections 516 allowing communications withother computing devices 518. Examples ofsuitable communication connections 516 include, but are not limited to, radio frequency (RF) transmitter, receiver, and/or transceiver circuitry; universal serial bus (USB), parallel, and/or serial ports. - The term computer readable media as used herein include computer storage media. Computer storage media include volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, or program modules. The
system memory 504, theremovable storage device 509, and thenon-removable storage device 510 are all computer storage media examples (i.e., memory storage.) According to an aspect, computer storage media includes RAM, ROM, electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other article of manufacture which can be used to store information and which can be accessed by thecomputing device 500. According to an aspect, any such computer storage media is part of thecomputing device 500. Computer storage media does not include a carrier wave or other propagated data signal. - According to an aspect, communication media is embodied by computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave or other transport mechanism, and includes any information delivery media. According to an aspect, the term “modulated data signal” describes a signal that has one or more characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, radio frequency (RF), infrared, and other wireless media.
-
FIGS. 6A and 6B illustrate amobile computing device 600, for example, a mobile telephone, a smart phone, a tablet personal computer, a laptop computer, and the like, with which aspects may be practiced. With reference toFIG. 6A , an example of amobile computing device 600 for implementing the aspects is illustrated. In a basic configuration, themobile computing device 600 is a handheld computer having both input elements and output elements. Themobile computing device 600 typically includes adisplay 605 and one ormore input buttons 610 that allow the user to enter information into themobile computing device 600. According to an aspect, thedisplay 605 of themobile computing device 600 functions as an input device (e.g., a touch screen display). If included, an optionalside input element 615 allows further user input. According to an aspect, theside input element 615 is a rotary switch, a button, or any other type of manual input element. In alternative examples,mobile computing device 600 incorporates more or less input elements. For example, thedisplay 605 may not be a touch screen in some examples. In alternative examples, themobile computing device 600 is a portable phone system, such as a cellular phone. According to an aspect, themobile computing device 600 includes anoptional keypad 635. According to an aspect, theoptional keypad 635 is a physical keypad. According to another aspect, theoptional keypad 635 is a “soft” keypad generated on the touch screen display. In various aspects, the output elements include thedisplay 605 for showing a graphical user interface (GUI), a visual indicator 620 (e.g., a light emitting diode), and/or an audio transducer 625 (e.g., a speaker). In some examples, themobile computing device 600 incorporates a vibration transducer for providing the user with tactile feedback. In yet another example, themobile computing device 600 incorporates input and/or output ports, such as an audio input (e.g., a microphone jack), an audio output (e.g., a headphone jack), and a video output (e.g., a HDMI port) for sending signals to or receiving signals from an external device. In yet another example, themobile computing device 600 incorporatesperipheral device port 640, such as an audio input (e.g., a microphone jack), an audio output (e.g., a headphone jack), and a video output (e.g., a HDMI port) for sending signals to or receiving signals from an external device. -
FIG. 6B is a block diagram illustrating the architecture of one example of a mobile computing device. That is, themobile computing device 600 incorporates a system (i.e., an architecture) 602 to implement some examples. In one example, thesystem 602 is implemented as a “smart phone” capable of running one or more applications (e.g., browser, e-mail, calendaring, contact managers, messaging clients, games, and media clients/players). In some examples, thesystem 602 is integrated as a computing device, such as an integrated personal digital assistant (PDA) and wireless phone. - According to an aspect, one or more application programs 650 are loaded into the
memory 662 and run on or in association with theoperating system 664. Examples of the application programs include phone dialer programs, e-mail programs, personal information management (PIM) programs, word processing programs, spreadsheet programs, Internet browser programs, messaging programs, and so forth. According to an aspect, the expressivesynthesized speech system 108 is loaded intomemory 662. Thesystem 602 also includes anon-volatile storage area 668 within thememory 662. Thenon-volatile storage area 668 is used to store persistent information that should not be lost if thesystem 602 is powered down. The application programs 650 may use and store information in thenon-volatile storage area 668, such as e-mail or other messages used by an e-mail application, and the like. A synchronization application (not shown) also resides on thesystem 602 and is programmed to interact with a corresponding synchronization application resident on a host computer to keep the information stored in thenon-volatile storage area 668 synchronized with corresponding information stored at the host computer. As should be appreciated, other applications may be loaded into thememory 662 and run on themobile computing device 600. - According to an aspect, the
system 602 has apower supply 670, which is implemented as one or more batteries. According to an aspect, thepower supply 670 further includes an external power source, such as an AC adapter or a powered docking cradle that supplements or recharges the batteries. - According to an aspect, the
system 602 includes aradio 672 that performs the function of transmitting and receiving radio frequency communications. Theradio 672 facilitates wireless connectivity between thesystem 602 and the “outside world,” via a communications carrier or service provider. Transmissions to and from theradio 672 are conducted under control of theoperating system 664. In other words, communications received by theradio 672 may be disseminated to the application programs 650 via theoperating system 664, and vice versa. - According to an aspect, the
visual indicator 620 is used to provide visual notifications and/or anaudio interface 674 is used for producing audible notifications via theaudio transducer 625. In the illustrated example, thevisual indicator 620 is a light emitting diode (LED) and theaudio transducer 625 is a speaker. These devices may be directly coupled to thepower supply 670 so that when activated, they remain on for a duration dictated by the notification mechanism even though theprocessor 660 and other components might shut down for conserving battery power. The LED may be programmed to remain on indefinitely until the user takes action to indicate the powered-on status of the device. Theaudio interface 674 is used to provide audible signals to and receive audible signals from the user. For example, in addition to being coupled to theaudio transducer 625, theaudio interface 674 may also be coupled to a microphone to receive audible input, such as to facilitate a telephone conversation. According to an aspect, thesystem 602 further includes avideo interface 676 that enables an operation of an on-board camera 630 to record still images, video stream, and the like. - According to an aspect, a
mobile computing device 600 implementing thesystem 602 has additional features or functionality. For example, themobile computing device 600 includes additional data storage devices (removable and/or non-removable) such as, magnetic disks, optical disks, or tape. Such additional storage is illustrated inFIG. 6B by thenon-volatile storage area 668. - According to an aspect, data/information generated or captured by the
mobile computing device 600 and stored via thesystem 602 is stored locally on themobile computing device 600, as described above. According to another aspect, the data is stored on any number of storage media that is accessible by the device via theradio 672 or via a wired connection between themobile computing device 600 and a separate computing device associated with themobile computing device 600, for example, a server computer in a distributed computing network, such as the Internet. As should be appreciated such data/information is accessible via themobile computing device 600 via theradio 672 or via a distributed computing network. Similarly, according to an aspect, such data/information is readily transferred between computing devices for storage and use according to well-known data/information transfer and storage means, including electronic mail and collaborative data/information sharing systems. -
FIG. 7 illustrates one example of the architecture of a system for generating expressive content as described above. Content developed, interacted with, or edited in association with the expressivesynthesized speech system 108 is enabled to be stored in different communication channels or other storage types. For example, various documents may be stored using adirectory service 722, aweb portal 724, amailbox service 726, aninstant messaging store 728, or asocial networking site 730. The expressivesynthesized speech system 108 is operative to use any of these types of systems or the like for generating expressive content, as described herein. According to an aspect, aserver 720 provides the expressivesynthesized speech system 108 toclients 705 a,b,c. As one example, theserver 720 is a web server providing the expressivesynthesized speech system 108 over the web. Theserver 720 provides the expressivesynthesized speech system 108 over the web to clients 705 through anetwork 740. By way of example, the client computing device is implemented and embodied in apersonal computer 705 a, atablet computing device 705 b or amobile computing device 705 c (e.g., a smart phone), or other computing device. Any of these examples of the client computing device are operable to obtain content from thestore 716. - Implementations, for example, are described above with reference to block diagrams and/or operational illustrations of methods, systems, and computer program products according to aspects. The functions/acts noted in the blocks may occur out of the order as shown in any flowchart. For example, two blocks shown in succession may in fact be executed substantially concurrently or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved.
- The description and illustration of one or more examples provided in this application are not intended to limit or restrict the scope as claimed in any way. The aspects, examples, and details provided in this application are considered sufficient to convey possession and enable others to make and use the best mode. Implementations should not be construed as being limited to any aspect, example, or detail provided in this application. Regardless of whether shown and described in combination or separately, the various features (both structural and methodological) are intended to be selectively included or omitted to produce an example with a particular set of features. Having been provided with the description and illustration of the present application, one skilled in the art may envision variations, modifications, and alternate examples falling within the spirit of the broader aspects of the general inventive concept embodied in this application that do not depart from the broader scope.
Claims (20)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US17/713,749 US20220230374A1 (en) | 2016-11-09 | 2022-04-05 | User interface for generating expressive content |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US15/347,653 US11321890B2 (en) | 2016-11-09 | 2016-11-09 | User interface for generating expressive content |
US17/713,749 US20220230374A1 (en) | 2016-11-09 | 2022-04-05 | User interface for generating expressive content |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US15/347,653 Continuation US11321890B2 (en) | 2016-11-09 | 2016-11-09 | User interface for generating expressive content |
Publications (1)
Publication Number | Publication Date |
---|---|
US20220230374A1 true US20220230374A1 (en) | 2022-07-21 |
Family
ID=62063961
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US15/347,653 Active US11321890B2 (en) | 2016-11-09 | 2016-11-09 | User interface for generating expressive content |
US17/713,749 Pending US20220230374A1 (en) | 2016-11-09 | 2022-04-05 | User interface for generating expressive content |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US15/347,653 Active US11321890B2 (en) | 2016-11-09 | 2016-11-09 | User interface for generating expressive content |
Country Status (1)
Country | Link |
---|---|
US (2) | US11321890B2 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2024098117A1 (en) * | 2022-11-10 | 2024-05-16 | Jimple Pty Ltd | Communication aid, communication system, and associated methods |
Families Citing this family (27)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20150068609A (en) * | 2013-12-12 | 2015-06-22 | 삼성전자주식회사 | Method and apparatus for displaying image information |
US11064565B2 (en) * | 2014-02-14 | 2021-07-13 | ATOM, Inc. | Systems and methods for personifying interactive displays used in hotel guest rooms |
US20180124002A1 (en) * | 2016-11-01 | 2018-05-03 | Microsoft Technology Licensing, Llc | Enhanced is-typing indicator |
US11321890B2 (en) * | 2016-11-09 | 2022-05-03 | Microsoft Technology Licensing, Llc | User interface for generating expressive content |
JP6646001B2 (en) * | 2017-03-22 | 2020-02-14 | 株式会社東芝 | Audio processing device, audio processing method and program |
JP2018159759A (en) * | 2017-03-22 | 2018-10-11 | 株式会社東芝 | Voice processor, voice processing method and program |
EP3602539A4 (en) * | 2017-03-23 | 2021-08-11 | D&M Holdings, Inc. | System providing expressive and emotive text-to-speech |
WO2018173295A1 (en) * | 2017-03-24 | 2018-09-27 | ヤマハ株式会社 | User interface device, user interface method, and sound operation system |
US11402909B2 (en) | 2017-04-26 | 2022-08-02 | Cognixion | Brain computer interface for augmented reality |
US11237635B2 (en) | 2017-04-26 | 2022-02-01 | Cognixion | Nonverbal multi-input and feedback devices for user intended computer control and communication of text, graphics and audio |
US10755724B2 (en) * | 2017-05-04 | 2020-08-25 | Rovi Guides, Inc. | Systems and methods for adjusting dubbed speech based on context of a scene |
JP7082357B2 (en) * | 2018-01-11 | 2022-06-08 | ネオサピエンス株式会社 | Text-to-speech synthesis methods using machine learning, devices and computer-readable storage media |
US10726603B1 (en) | 2018-02-28 | 2020-07-28 | Snap Inc. | Animated expressive icon |
US11049299B2 (en) * | 2018-09-26 | 2021-06-29 | The Alchemy Of You, Llc | System and method for improved data structures and related interfaces |
JP7243106B2 (en) * | 2018-09-27 | 2023-03-22 | 富士通株式会社 | Correction candidate presentation method, correction candidate presentation program, and information processing apparatus |
JP6993314B2 (en) * | 2018-11-09 | 2022-01-13 | 株式会社日立製作所 | Dialogue systems, devices, and programs |
US10976991B2 (en) * | 2019-06-05 | 2021-04-13 | Facebook Technologies, Llc | Audio profile for personalized audio enhancement |
US11289067B2 (en) * | 2019-06-25 | 2022-03-29 | International Business Machines Corporation | Voice generation based on characteristics of an avatar |
CN110297928A (en) * | 2019-07-02 | 2019-10-01 | 百度在线网络技术(北京)有限公司 | Recommended method, device, equipment and the storage medium of expression picture |
US10762890B1 (en) * | 2019-08-19 | 2020-09-01 | Voicify, LLC | Development of voice and other interaction applications |
US11322135B2 (en) * | 2019-09-12 | 2022-05-03 | International Business Machines Corporation | Generating acoustic sequences via neural networks using combined prosody info |
CN113051427A (en) | 2019-12-10 | 2021-06-29 | 华为技术有限公司 | Expression making method and device |
US11695758B2 (en) * | 2020-02-24 | 2023-07-04 | International Business Machines Corporation | Second factor authentication of electronic devices |
USD984457S1 (en) * | 2020-06-19 | 2023-04-25 | Airbnb, Inc. | Display screen of a programmed computer system with graphical user interface |
US11991013B2 (en) | 2020-06-19 | 2024-05-21 | Airbnb, Inc. | Incorporating individual audience member participation and feedback in large-scale electronic presentation |
USD985005S1 (en) * | 2020-06-19 | 2023-05-02 | Airbnb, Inc. | Display screen of a programmed computer system with graphical user interface |
US11825004B1 (en) | 2023-01-04 | 2023-11-21 | Mattel, Inc. | Communication device for children |
Citations (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5796916A (en) * | 1993-01-21 | 1998-08-18 | Apple Computer, Inc. | Method and apparatus for prosody for synthetic speech prosody determination |
US6081780A (en) * | 1998-04-28 | 2000-06-27 | International Business Machines Corporation | TTS and prosody based authoring system |
US20070233494A1 (en) * | 2006-03-28 | 2007-10-04 | International Business Machines Corporation | Method and system for generating sound effects interactively |
US20100123724A1 (en) * | 2008-11-19 | 2010-05-20 | Bradford Allen Moore | Portable Touch Screen Device, Method, and Graphical User Interface for Using Emoji Characters |
US20100332224A1 (en) * | 2009-06-30 | 2010-12-30 | Nokia Corporation | Method and apparatus for converting text to audio and tactile output |
US20110040155A1 (en) * | 2009-08-13 | 2011-02-17 | International Business Machines Corporation | Multiple sensory channel approach for translating human emotions in a computing environment |
US7908554B1 (en) * | 2003-03-03 | 2011-03-15 | Aol Inc. | Modifying avatar behavior based on user action or mood |
US20120130717A1 (en) * | 2010-11-19 | 2012-05-24 | Microsoft Corporation | Real-time Animation for an Expressive Avatar |
US8521533B1 (en) * | 2000-11-03 | 2013-08-27 | At&T Intellectual Property Ii, L.P. | Method for sending multi-media messages with customized audio |
US20130247078A1 (en) * | 2012-03-19 | 2013-09-19 | Rawllin International Inc. | Emoticons for media |
US20140067397A1 (en) * | 2012-08-29 | 2014-03-06 | Nuance Communications, Inc. | Using emoticons for contextual text-to-speech expressivity |
US20140181229A1 (en) * | 2012-08-15 | 2014-06-26 | Imvu Inc. | System and method for increasing clarity and expressiveness in network communications |
US20160132292A1 (en) * | 2013-06-07 | 2016-05-12 | Openvacs Co., Ltd. | Method for Controlling Voice Emoticon in Portable Terminal |
US20160140952A1 (en) * | 2014-08-26 | 2016-05-19 | ClearOne Inc. | Method For Adding Realism To Synthetic Speech |
US20160259526A1 (en) * | 2015-03-03 | 2016-09-08 | Kakao Corp. | Display method of scenario emoticon using instant message service and user device therefor |
US20170083506A1 (en) * | 2015-09-21 | 2017-03-23 | International Business Machines Corporation | Suggesting emoji characters based on current contextual emotional state of user |
US11321890B2 (en) * | 2016-11-09 | 2022-05-03 | Microsoft Technology Licensing, Llc | User interface for generating expressive content |
Family Cites Families (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7401020B2 (en) | 2002-11-29 | 2008-07-15 | International Business Machines Corporation | Application of emotion-based intonation and prosody to speech in text-to-speech systems |
US7382868B2 (en) * | 2002-04-02 | 2008-06-03 | Verizon Business Global Llc | Telephony services system with instant communications enhancements |
US20060257827A1 (en) | 2005-05-12 | 2006-11-16 | Blinktwice, Llc | Method and apparatus to individualize content in an augmentative and alternative communication device |
GB0702150D0 (en) | 2007-02-05 | 2007-03-14 | Amegoworld Ltd | A Communication Network and Devices |
US8886537B2 (en) | 2007-03-20 | 2014-11-11 | Nuance Communications, Inc. | Method and system for text-to-speech synthesis with personalized voice |
US9495450B2 (en) * | 2012-06-12 | 2016-11-15 | Nuance Communications, Inc. | Audio animation methods and apparatus utilizing a probability criterion for frame transitions |
US20150046164A1 (en) | 2013-08-07 | 2015-02-12 | Samsung Electronics Co., Ltd. | Method, apparatus, and recording medium for text-to-speech conversion |
US20150100537A1 (en) * | 2013-10-03 | 2015-04-09 | Microsoft Corporation | Emoji for Text Predictions |
US9824681B2 (en) | 2014-09-11 | 2017-11-21 | Microsoft Technology Licensing, Llc | Text-to-speech with emotional content |
-
2016
- 2016-11-09 US US15/347,653 patent/US11321890B2/en active Active
-
2022
- 2022-04-05 US US17/713,749 patent/US20220230374A1/en active Pending
Patent Citations (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5796916A (en) * | 1993-01-21 | 1998-08-18 | Apple Computer, Inc. | Method and apparatus for prosody for synthetic speech prosody determination |
US6081780A (en) * | 1998-04-28 | 2000-06-27 | International Business Machines Corporation | TTS and prosody based authoring system |
US8521533B1 (en) * | 2000-11-03 | 2013-08-27 | At&T Intellectual Property Ii, L.P. | Method for sending multi-media messages with customized audio |
US7908554B1 (en) * | 2003-03-03 | 2011-03-15 | Aol Inc. | Modifying avatar behavior based on user action or mood |
US20070233494A1 (en) * | 2006-03-28 | 2007-10-04 | International Business Machines Corporation | Method and system for generating sound effects interactively |
US20100123724A1 (en) * | 2008-11-19 | 2010-05-20 | Bradford Allen Moore | Portable Touch Screen Device, Method, and Graphical User Interface for Using Emoji Characters |
US20100332224A1 (en) * | 2009-06-30 | 2010-12-30 | Nokia Corporation | Method and apparatus for converting text to audio and tactile output |
US20110040155A1 (en) * | 2009-08-13 | 2011-02-17 | International Business Machines Corporation | Multiple sensory channel approach for translating human emotions in a computing environment |
US20120130717A1 (en) * | 2010-11-19 | 2012-05-24 | Microsoft Corporation | Real-time Animation for an Expressive Avatar |
US20130247078A1 (en) * | 2012-03-19 | 2013-09-19 | Rawllin International Inc. | Emoticons for media |
US20140181229A1 (en) * | 2012-08-15 | 2014-06-26 | Imvu Inc. | System and method for increasing clarity and expressiveness in network communications |
US20140067397A1 (en) * | 2012-08-29 | 2014-03-06 | Nuance Communications, Inc. | Using emoticons for contextual text-to-speech expressivity |
US20160132292A1 (en) * | 2013-06-07 | 2016-05-12 | Openvacs Co., Ltd. | Method for Controlling Voice Emoticon in Portable Terminal |
US20160140952A1 (en) * | 2014-08-26 | 2016-05-19 | ClearOne Inc. | Method For Adding Realism To Synthetic Speech |
US20160259526A1 (en) * | 2015-03-03 | 2016-09-08 | Kakao Corp. | Display method of scenario emoticon using instant message service and user device therefor |
US20170083506A1 (en) * | 2015-09-21 | 2017-03-23 | International Business Machines Corporation | Suggesting emoji characters based on current contextual emotional state of user |
US11321890B2 (en) * | 2016-11-09 | 2022-05-03 | Microsoft Technology Licensing, Llc | User interface for generating expressive content |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2024098117A1 (en) * | 2022-11-10 | 2024-05-16 | Jimple Pty Ltd | Communication aid, communication system, and associated methods |
Also Published As
Publication number | Publication date |
---|---|
US20180130459A1 (en) | 2018-05-10 |
US11321890B2 (en) | 2022-05-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20220230374A1 (en) | User interface for generating expressive content | |
US11727914B2 (en) | Intent recognition and emotional text-to-speech learning | |
US9767789B2 (en) | Using emoticons for contextual text-to-speech expressivity | |
US20200175890A1 (en) | Device, method, and graphical user interface for a group reading environment | |
CN107077841B (en) | Superstructure recurrent neural network for text-to-speech | |
US10170101B2 (en) | Sensor based text-to-speech emotional conveyance | |
US10242672B2 (en) | Intelligent assistance in presentations | |
US20140191939A1 (en) | Using nonverbal communication in determining actions | |
CN115668371A (en) | Classifying auditory and visual conferencing data to infer importance of user utterances | |
CN106471570A (en) | Order single language input method more | |
US20140315163A1 (en) | Device, method, and graphical user interface for a group reading environment | |
JP6841239B2 (en) | Information processing equipment, information processing methods, and programs | |
US11423875B2 (en) | Highly empathetic ITS processing | |
US20130332859A1 (en) | Method and user interface for creating an animated communication | |
US20150364127A1 (en) | Advanced recurrent neural network based letter-to-sound | |
KR20100129122A (en) | Animation system for reproducing text base data by animation | |
US20190251990A1 (en) | Information processing apparatus and information processing method | |
US11789696B2 (en) | Voice assistant-enabled client application with user view context | |
US12050841B2 (en) | Voice assistant-enabled client application with user view context | |
KR20240101711A (en) | Automated text-to-speech pronunciation editing for long-form text documents | |
CN111724799B (en) | Sound expression application method, device, equipment and readable storage medium | |
US20240256773A1 (en) | Concept-level text editing on productivity applications | |
KR102676192B1 (en) | Automatic reprocessing AI program utilizing NLP and generative technology for visualization of novels | |
US20230004213A1 (en) | Processing part of a user input to produce an early response | |
WO2024158478A1 (en) | Concept-level text editing on productivity applications |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: MICROSOFT TECHNOLOGY LICENSING, LLC, WASHINGTON Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:PARADISO, ANN M.;CAMPBELL, JONATHAN;CUTRELL, EDWARD BRYAN;AND OTHERS;SIGNING DATES FROM 20161110 TO 20161120;REEL/FRAME:059507/0828 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE AFTER FINAL ACTION FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: ADVISORY ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE AFTER FINAL ACTION FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: ADVISORY ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |