GB2565589A - Reactive speech synthesis - Google Patents
Reactive speech synthesis Download PDFInfo
- Publication number
- GB2565589A GB2565589A GB1713273.9A GB201713273A GB2565589A GB 2565589 A GB2565589 A GB 2565589A GB 201713273 A GB201713273 A GB 201713273A GB 2565589 A GB2565589 A GB 2565589A
- Authority
- GB
- United Kingdom
- Prior art keywords
- speech
- interruption
- audio
- region
- response
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000003786 synthesis reaction Methods 0.000 title claims abstract description 16
- 230000015572 biosynthetic process Effects 0.000 title claims abstract description 15
- 230000004048 modification Effects 0.000 claims abstract description 14
- 238000012986 modification Methods 0.000 claims abstract description 14
- 238000000034 method Methods 0.000 claims abstract description 13
- 230000004044 response Effects 0.000 claims abstract description 7
- 241001122315 Polites Species 0.000 claims abstract description 4
- 238000006243 chemical reaction Methods 0.000 claims description 21
- 238000006757 chemical reactions by type Methods 0.000 abstract description 3
- 230000008859 change Effects 0.000 description 6
- 238000013518 transcription Methods 0.000 description 6
- 230000035897 transcription Effects 0.000 description 6
- 238000013459 approach Methods 0.000 description 5
- 230000008569 process Effects 0.000 description 5
- 230000000694 effects Effects 0.000 description 3
- 230000008685 targeting Effects 0.000 description 3
- 238000004519 manufacturing process Methods 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 230000009471 action Effects 0.000 description 1
- 239000008186 active pharmaceutical agent Substances 0.000 description 1
- 239000003795 chemical substances by application Substances 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000007794 irritation Effects 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/04—Details of speech synthesis systems, e.g. synthesiser structure or memory management
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/033—Voice editing, e.g. manipulating the voice of the synthesiser
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/033—Voice editing, e.g. manipulating the voice of the synthesiser
- G10L13/0335—Pitch control
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Machine Translation (AREA)
- Electrically Operated Instructional Devices (AREA)
Abstract
A method of responding to an interruption in a speech synthesis system (e.g. conversational agents, AI therapists, virtual personal assistants). In response to an interruption event (e.g. when a user has tried to interrupt the dialogue), the system takes an input of the reaction type and earliest time. A new audio is then created which is suitable for splicing. The system then modifies the audio (e.g. selecting different units, applying DSP modification, stopping before the end of the text) based on the type of modification, to introduce different speaking styles to provide a natural response at a speech interface. The interruption may be configured within a particular region such as a phonetic region, word boundary region, and phase level region. Appropriate settings may also be chosen such as switch to Lombard speech or tailing off in a polite manner.
Description
Reactive speech synthesis
FIELD OF THE INVENTION
Embodiments of the present invention relates to synthesising speech in real time applications. In particular the embodiments relate to synthesising speech where the synthesis output needs to be modified in response to real-time events.
BACKGROUND TO THE INVENTION
During natural conversation the content, style, and prosody of speech can change very suddenly. For example if one of the interlocutors in a dialogue interrupts the other, the original speaker will not finish their sentence without regard to the interruption. Instead they might attempt to continue to 'hold the floor' a term which means that they will continue to speak over the interruption and not allow the other participant to speak. Alternatively they may 'pass the floor' which is the opposite i.e. allow the other interlocutor to speak instead. In either of these cases there are changes to the prosody and style of speaking. For example if the speaker is trying to hold the floor then they will increase their volume, slow their speaking rate, and increase their pitch a combination of effects known as 'Lombard speech'. Current speech synthesisers do not have the ability to perform the necessary changes to their output audio.
Requirements
The requirements of such a system have been already been explored in academia. To summarise the results from the literature these are:
1. A system needs to be responsive and flexible like human interlocutors.
2. It needs to process information incrementally and continuously rather than in large chunks, in other words it must be able to start speaking before processing has been finished.
3. It must know what has already been said.
4. It must be able to halt, then continue/break as well as be able to stop.
5. It must be able to operate in real-time.
6. It must be able to make edits to as-yet unspoken parts of the utterance to change delivery parameters such as speaking rate or pitch.
Additionally we found:
1. Speakers only interrupt their speech a syllable boundary or syllable nucleus; the syllable nucleus is the central part of the syllable which is usually a vowel.
2. There are never changes on certain consonants such as “plosives” for example in English the sound of a 'p' or 'b'.
3. The reaction cannot be too fast otherwise it will sound unnatural. It takes time for a speaker to process that their interlocutor has done something that needs a reaction and what the appropriate reaction would be, in the order of a few hundred milliseconds, but it would be noticeable if it was absent.
Currently there are two approaches one could take based on existing speech synthesis engines:
Modifying the speech with additional libraries
In this approach an application developer who is using the speech library modifies audio generated by the speech synthesiser using Digital Signal Processing (DSP). The problems with this approach are:
1. It creates additional dependencies within end user applications.
2. The application developer needs a much deeper understanding of speech production to know what effects need to be applied to make the speech reaction sound natural.
3. DSP is prone to making errors in the audio - known as artefacts - if the speech is modified too much.
Resvnthesising the audio
The alternative approach is to resynthesise the audio from scratch. To effect changes in the audio most speech synthesisers support at least some parts of Speech Synthesis Markup Language (SSML). This approach is not suitable either because of the following reasons:
1. It does not meet the requirement that it processes information incrementally and continuously.
2. SSML does not normally give the option to change audio at a phoneme level.
3. It would be impossible to guarantee the audio does not change before the point of interruption as synthesisers consider the context to sound natural. This in turn means it would be very challenging to splice in audio at the correct time.
BRIEF DESCRIPTION OF THE INVENTION
The invention is to provide a simple interface for creating reactions in speech synthesis.
Using this interface the synthesis engine provides the following benefits:
• A simple call where audio is regenerated up-to a user specified point in time, without re-synthesis.
• Prior to the user chosen point the audio is guaranteed to be identical to what was previously synthesised, enabling easy splicing.
• The process for reacting can be started at the beginning of a phrase rather than at the start of the speech.
• The user may specify a type of reaction, for example stopping without modification, switch to Lombard speech, or tailing off politely.
• The exact changes in the speech such as pitch and speaking rate are appropriate for the chosen reaction and have been developed by the developers of the synthesis engine.
• The synthesiser will give options to pick different times for the reaction, importantly the points mentioned in the previous section, but also other options such as instantaneous reactions or word boundaries.
• The application can then continue with the original synthesis, insert some additional synthesised speech or audio before continuing, or discard the remaining audio.
This interface can be invoked by developers using special functions in the synthesis engine's Software Development Kit (SDK) or Application Programming Interface (API) or through extensions to other speech APIs such as Microsoft's Speech API (SAPI) depending on how they are interfacing with the engine. Alternatively the reactions can be included in SSML with the use of new SSML tags. This gives the option for both developers and end users who are only using SSML to interface with the engine to include reactions.
BRIEF DESCRIPTION OF THE DRAWINGS
Figures 1 to 4 - typical use case
Figure 1 - System inputs
Figure 2 - Modification start point
Figure 3 - Applied modifications
Figure 4 - Splicing
Figure 5 - Examples of interruption recorded from actual speakers
DETAILED DESCRIPTION OF EMBODIMENTS OF THE INVENTION
The embodiments of the present invention are implemented on a computer system. This system can be used in any dialogue system that is backed by a computer that is capable of running a synthesis engine and then either playing the audio directly or streaming the audio to another playing device and is similarly connected to a microphone for detecting when reactions are appropriate. Examples of direct audio uses are - but not limited to - robots, conversational agents, virtual personal assistants, and “Al therapists”. An example where the audio would be streamed is in a smart home where a central server would generate the audio and manage the dialogue but individual speakers and microphones are located on remote “dumb terminals”.
A typical use case of the system is where a reaction is generated by a dialogue system after it has detected some action from an end user. A specific example which will be shown in the accompanying figures is the audio being changed to acknowledge the user has tried to interrupt the dialogue system which will pass the floor while demonstrating irritation. The system takes as an input the reaction type, earliest time, and optionally a boundary type, see Figure 1. The system then creates new audio which is suitable for splicing. Importantly it does not re-synthesise audio prior to the splice point. It then modifies the speech according the reaction type starting at an appropriate point after the splice point, see Figure 2. After the modification start point the system then modifies the audio through selecting different units, and I or applying DSP modifications, and I or stopping before the end of the text, based on the type of modification to introduce different speaking styles, in the example of the timing shown in Figure 3 the speech is modified to speak with Lombard speech and stops three words later. The system then passes the audio back to the dialogue system which then splices in the audio at the splice point as in Figure 4. The dialogue system now has the option to include new speech that is appropriate for the reaction, and then either continue with the previously synthesised audio or stop entirely, Figure 4 shows how the dialogue system can modify future audio. In Figure 5 an example from recorded human speech of this sort of interaction is shown. The following sections describe the embodiment of the system in greater detail.
Modification start point selection
To select a start point for the reaction the system takes as input the earliest acceptable start time, and the phonetic transcription of the existing synthesised speech. The phonetic transcription is generated simultaneously with the synthesised speech as is returned to the user with the generated speech. The user is responsible for storing the transcription within their own system.
When synthesising the reaction the system first determines where in the phonetic transcription the earliest possible cut time is. It then needs to pick a cut point based on what the user requested, in the case where the earliest time is during the speech and not part of a leading or trailing pause, the follow boundary types can be handled by the system:
1. Phrase ending'. In this case the system simply removes any final pause and return the original synthesised speech
2. Word boundary. The system will simply start the reaction between the next two words. The system keeps track of word boundaries within the phonetic transcription.
3. Syllable boundary. The system also keeps track of syllable boundaries within the phonetic transcription, so the system simply starts the reaction at the next available syllable boundary. Word boundaries are also syllable boundaries.
4. Natural cut point'. In this case the system checks each phoneme from the earliest possible time against a list of acceptable phonemes in which to place a boundary, the list is generated based on human production, so it will only break on the same phonemes that a human would. Alternatively if more appropriate the system will select a syllable boundary as the start of the reaction.
Modification process
One the start point of the reaction is chosen the system then can make appropriate modifications. In the simplest case the system uses simple DSP techniques to change the speech. Alternatively the synthesiser can change targeting for unit selection or input features in a parametric system.
Some specific use cases showing the type of modification to be done, see Figure 5 for the spectral effects:
1. Sudden stop'. This case is more useful for debugging, instead of modifying speech the output is suddenly halted.
2. Angry stop'. Targeting is changed to Lombard speech patterns and DSP tools are applied on a phoneme by phoneme basis to achieve Lombard speech, one or two words are kept to hear the difference but most of the speech is dropped from the utterance.
3. Polite stop'. Targeting is changed to lower pitch and amplitude, and increase duration, creating the prosody of someone tailing off in their speech and allowing the other speaker to take the floor.
4. Speaking over. Lombard speech patterns are introduced from the point of interruption to the end of the utterance. No words are dropped, in this case the system is generating speech suitable for speaking over the interlocutor rather than passing the floor.
Claims (6)
- Claim 1: A method for responding to an interruption or event in a speech synthesis system, whereby a speech synthesiser can dynamically regenerate and insert modified speech audio output in order to provide improved naturalness in a responsive speech interface.
- Claim 2: A method according to claim 1, wherein a speech synthesis system can be interrupted at a specific future point in time in response to an interruption or event, taking into account reaction and procesing times.
- Claim 3: A method according to claim 1, whereby a set of appropriate speech synthesis settings can be chosen in response to an interruption, including but not limited to stopping without modification, switching to Lombard speech, or tailing off in a polite manner.
- Claim 4: A method according to claim 3, whereby a speech synthesiser can seamlessly switch from previously generated audio to audio content including an appropriate interruption response.
- Claim 5: A method according to claim 3, whereby a user can select an appropriate interruption response, including but not limited to stopping without modification, switching to Lombard speech, or tailing off in a polite manner.
- Claim 6: A method according to claim 2, whereby an interruption point may be configured within a particular region, including but not limited to a phonetic region, word boundary region, phrase level region, or instantenous.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
GB1713273.9A GB2565589A (en) | 2017-08-18 | 2017-08-18 | Reactive speech synthesis |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
GB1713273.9A GB2565589A (en) | 2017-08-18 | 2017-08-18 | Reactive speech synthesis |
Publications (2)
Publication Number | Publication Date |
---|---|
GB201713273D0 GB201713273D0 (en) | 2017-10-04 |
GB2565589A true GB2565589A (en) | 2019-02-20 |
Family
ID=59996624
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
GB1713273.9A Pending GB2565589A (en) | 2017-08-18 | 2017-08-18 | Reactive speech synthesis |
Country Status (1)
Country | Link |
---|---|
GB (1) | GB2565589A (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5548681A (en) * | 1991-08-13 | 1996-08-20 | Kabushiki Kaisha Toshiba | Speech dialogue system for realizing improved communication between user and system |
US6240390B1 (en) * | 1998-05-18 | 2001-05-29 | Winbond Electronics Corp. | Multi-tasking speech synthesizer |
JP2008051883A (en) * | 2006-08-22 | 2008-03-06 | Canon Inc | Voice synthesis control method and apparatus |
EP2009620A1 (en) * | 2007-06-25 | 2008-12-31 | Fujitsu Limited | Phoneme length adjustment for speech synthesis |
US20130066632A1 (en) * | 2011-09-14 | 2013-03-14 | At&T Intellectual Property I, L.P. | System and method for enriching text-to-speech synthesis with automatic dialog act tags |
-
2017
- 2017-08-18 GB GB1713273.9A patent/GB2565589A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5548681A (en) * | 1991-08-13 | 1996-08-20 | Kabushiki Kaisha Toshiba | Speech dialogue system for realizing improved communication between user and system |
US6240390B1 (en) * | 1998-05-18 | 2001-05-29 | Winbond Electronics Corp. | Multi-tasking speech synthesizer |
JP2008051883A (en) * | 2006-08-22 | 2008-03-06 | Canon Inc | Voice synthesis control method and apparatus |
EP2009620A1 (en) * | 2007-06-25 | 2008-12-31 | Fujitsu Limited | Phoneme length adjustment for speech synthesis |
US20130066632A1 (en) * | 2011-09-14 | 2013-03-14 | At&T Intellectual Property I, L.P. | System and method for enriching text-to-speech synthesis with automatic dialog act tags |
Also Published As
Publication number | Publication date |
---|---|
GB201713273D0 (en) | 2017-10-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11664011B2 (en) | Clockwork hierarchal variational encoder | |
US7490042B2 (en) | Methods and apparatus for adapting output speech in accordance with context of communication | |
KR20220004737A (en) | Multilingual speech synthesis and cross-language speech replication | |
US9922641B1 (en) | Cross-lingual speaker adaptation for multi-lingual speech synthesis | |
US10199034B2 (en) | System and method for unified normalization in text-to-speech and automatic speech recognition | |
Buschmeier et al. | Combining incremental language generation and incremental speech synthesis for adaptive information presentation | |
EP3776531A1 (en) | Clockwork hierarchical variational encoder | |
Betz et al. | Micro-structure of disfluencies: Basics for conversational speech synthesis | |
US20130066632A1 (en) | System and method for enriching text-to-speech synthesis with automatic dialog act tags | |
US9710552B2 (en) | User driven audio content navigation | |
Yamagishi et al. | Robustness of HMM-based speech synthesis | |
WO2018034169A1 (en) | Dialogue control device and method | |
Cohn et al. | Prosodic differences in human-and Alexa-directed speech, but similar local intelligibility adjustments | |
CN117642814A (en) | Robust direct speech-to-speech translation | |
JP6712754B2 (en) | Discourse function estimating device and computer program therefor | |
WO2023209632A1 (en) | Voice attribute conversion using speech to speech | |
Cutler et al. | Vowel devoicing and the perception of spoken Japanese words | |
GB2565589A (en) | Reactive speech synthesis | |
JP4964695B2 (en) | Speech synthesis apparatus, speech synthesis method, and program | |
US11948550B2 (en) | Real-time accent conversion model | |
Wester et al. | Real-Time Reactive Speech Synthesis: Incorporating Interruptions. | |
US20080077407A1 (en) | Phonetically enriched labeling in unit selection speech synthesis | |
Valentini-Botinhao et al. | Intelligibility analysis of fast synthesized speech | |
JP6424419B2 (en) | Voice control device, voice control method and program | |
JP2011175304A (en) | Voice interactive device and method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
WAP | Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1) | ||
S28 | Restoration of ceased patents (sect. 28/pat. act 1977) |
Free format text: APPLICATION FILED |