GB2565589A - Reactive speech synthesis - Google Patents

Reactive speech synthesis Download PDF

Info

Publication number
GB2565589A
GB2565589A GB1713273.9A GB201713273A GB2565589A GB 2565589 A GB2565589 A GB 2565589A GB 201713273 A GB201713273 A GB 201713273A GB 2565589 A GB2565589 A GB 2565589A
Authority
GB
United Kingdom
Prior art keywords
speech
interruption
audio
region
response
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
GB1713273.9A
Other versions
GB201713273D0 (en
Inventor
Aylett Matthew
Potard Blaise
Braude David
Pidcock Christopher
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Cereproc Ltd
Original Assignee
Cereproc Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cereproc Ltd filed Critical Cereproc Ltd
Priority to GB1713273.9A priority Critical patent/GB2565589A/en
Publication of GB201713273D0 publication Critical patent/GB201713273D0/en
Publication of GB2565589A publication Critical patent/GB2565589A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • G10L13/0335Pitch control
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

A method of responding to an interruption in a speech synthesis system (e.g. conversational agents, AI therapists, virtual personal assistants). In response to an interruption event (e.g. when a user has tried to interrupt the dialogue), the system takes an input of the reaction type and earliest time. A new audio is then created which is suitable for splicing. The system then modifies the audio (e.g. selecting different units, applying DSP modification, stopping before the end of the text) based on the type of modification, to introduce different speaking styles to provide a natural response at a speech interface. The interruption may be configured within a particular region such as a phonetic region, word boundary region, and phase level region. Appropriate settings may also be chosen such as switch to Lombard speech or tailing off in a polite manner.

Description

Reactive speech synthesis
FIELD OF THE INVENTION
Embodiments of the present invention relates to synthesising speech in real time applications. In particular the embodiments relate to synthesising speech where the synthesis output needs to be modified in response to real-time events.
BACKGROUND TO THE INVENTION
During natural conversation the content, style, and prosody of speech can change very suddenly. For example if one of the interlocutors in a dialogue interrupts the other, the original speaker will not finish their sentence without regard to the interruption. Instead they might attempt to continue to 'hold the floor' a term which means that they will continue to speak over the interruption and not allow the other participant to speak. Alternatively they may 'pass the floor' which is the opposite i.e. allow the other interlocutor to speak instead. In either of these cases there are changes to the prosody and style of speaking. For example if the speaker is trying to hold the floor then they will increase their volume, slow their speaking rate, and increase their pitch a combination of effects known as 'Lombard speech'. Current speech synthesisers do not have the ability to perform the necessary changes to their output audio.
Requirements
The requirements of such a system have been already been explored in academia. To summarise the results from the literature these are:
1. A system needs to be responsive and flexible like human interlocutors.
2. It needs to process information incrementally and continuously rather than in large chunks, in other words it must be able to start speaking before processing has been finished.
3. It must know what has already been said.
4. It must be able to halt, then continue/break as well as be able to stop.
5. It must be able to operate in real-time.
6. It must be able to make edits to as-yet unspoken parts of the utterance to change delivery parameters such as speaking rate or pitch.
Additionally we found:
1. Speakers only interrupt their speech a syllable boundary or syllable nucleus; the syllable nucleus is the central part of the syllable which is usually a vowel.
2. There are never changes on certain consonants such as “plosives” for example in English the sound of a 'p' or 'b'.
3. The reaction cannot be too fast otherwise it will sound unnatural. It takes time for a speaker to process that their interlocutor has done something that needs a reaction and what the appropriate reaction would be, in the order of a few hundred milliseconds, but it would be noticeable if it was absent.
Currently there are two approaches one could take based on existing speech synthesis engines:
Modifying the speech with additional libraries
In this approach an application developer who is using the speech library modifies audio generated by the speech synthesiser using Digital Signal Processing (DSP). The problems with this approach are:
1. It creates additional dependencies within end user applications.
2. The application developer needs a much deeper understanding of speech production to know what effects need to be applied to make the speech reaction sound natural.
3. DSP is prone to making errors in the audio - known as artefacts - if the speech is modified too much.
Resvnthesising the audio
The alternative approach is to resynthesise the audio from scratch. To effect changes in the audio most speech synthesisers support at least some parts of Speech Synthesis Markup Language (SSML). This approach is not suitable either because of the following reasons:
1. It does not meet the requirement that it processes information incrementally and continuously.
2. SSML does not normally give the option to change audio at a phoneme level.
3. It would be impossible to guarantee the audio does not change before the point of interruption as synthesisers consider the context to sound natural. This in turn means it would be very challenging to splice in audio at the correct time.
BRIEF DESCRIPTION OF THE INVENTION
The invention is to provide a simple interface for creating reactions in speech synthesis.
Using this interface the synthesis engine provides the following benefits:
• A simple call where audio is regenerated up-to a user specified point in time, without re-synthesis.
• Prior to the user chosen point the audio is guaranteed to be identical to what was previously synthesised, enabling easy splicing.
• The process for reacting can be started at the beginning of a phrase rather than at the start of the speech.
• The user may specify a type of reaction, for example stopping without modification, switch to Lombard speech, or tailing off politely.
• The exact changes in the speech such as pitch and speaking rate are appropriate for the chosen reaction and have been developed by the developers of the synthesis engine.
• The synthesiser will give options to pick different times for the reaction, importantly the points mentioned in the previous section, but also other options such as instantaneous reactions or word boundaries.
• The application can then continue with the original synthesis, insert some additional synthesised speech or audio before continuing, or discard the remaining audio.
This interface can be invoked by developers using special functions in the synthesis engine's Software Development Kit (SDK) or Application Programming Interface (API) or through extensions to other speech APIs such as Microsoft's Speech API (SAPI) depending on how they are interfacing with the engine. Alternatively the reactions can be included in SSML with the use of new SSML tags. This gives the option for both developers and end users who are only using SSML to interface with the engine to include reactions.
BRIEF DESCRIPTION OF THE DRAWINGS
Figures 1 to 4 - typical use case
Figure 1 - System inputs
Figure 2 - Modification start point
Figure 3 - Applied modifications
Figure 4 - Splicing
Figure 5 - Examples of interruption recorded from actual speakers
DETAILED DESCRIPTION OF EMBODIMENTS OF THE INVENTION
The embodiments of the present invention are implemented on a computer system. This system can be used in any dialogue system that is backed by a computer that is capable of running a synthesis engine and then either playing the audio directly or streaming the audio to another playing device and is similarly connected to a microphone for detecting when reactions are appropriate. Examples of direct audio uses are - but not limited to - robots, conversational agents, virtual personal assistants, and “Al therapists”. An example where the audio would be streamed is in a smart home where a central server would generate the audio and manage the dialogue but individual speakers and microphones are located on remote “dumb terminals”.
A typical use case of the system is where a reaction is generated by a dialogue system after it has detected some action from an end user. A specific example which will be shown in the accompanying figures is the audio being changed to acknowledge the user has tried to interrupt the dialogue system which will pass the floor while demonstrating irritation. The system takes as an input the reaction type, earliest time, and optionally a boundary type, see Figure 1. The system then creates new audio which is suitable for splicing. Importantly it does not re-synthesise audio prior to the splice point. It then modifies the speech according the reaction type starting at an appropriate point after the splice point, see Figure 2. After the modification start point the system then modifies the audio through selecting different units, and I or applying DSP modifications, and I or stopping before the end of the text, based on the type of modification to introduce different speaking styles, in the example of the timing shown in Figure 3 the speech is modified to speak with Lombard speech and stops three words later. The system then passes the audio back to the dialogue system which then splices in the audio at the splice point as in Figure 4. The dialogue system now has the option to include new speech that is appropriate for the reaction, and then either continue with the previously synthesised audio or stop entirely, Figure 4 shows how the dialogue system can modify future audio. In Figure 5 an example from recorded human speech of this sort of interaction is shown. The following sections describe the embodiment of the system in greater detail.
Modification start point selection
To select a start point for the reaction the system takes as input the earliest acceptable start time, and the phonetic transcription of the existing synthesised speech. The phonetic transcription is generated simultaneously with the synthesised speech as is returned to the user with the generated speech. The user is responsible for storing the transcription within their own system.
When synthesising the reaction the system first determines where in the phonetic transcription the earliest possible cut time is. It then needs to pick a cut point based on what the user requested, in the case where the earliest time is during the speech and not part of a leading or trailing pause, the follow boundary types can be handled by the system:
1. Phrase ending'. In this case the system simply removes any final pause and return the original synthesised speech
2. Word boundary. The system will simply start the reaction between the next two words. The system keeps track of word boundaries within the phonetic transcription.
3. Syllable boundary. The system also keeps track of syllable boundaries within the phonetic transcription, so the system simply starts the reaction at the next available syllable boundary. Word boundaries are also syllable boundaries.
4. Natural cut point'. In this case the system checks each phoneme from the earliest possible time against a list of acceptable phonemes in which to place a boundary, the list is generated based on human production, so it will only break on the same phonemes that a human would. Alternatively if more appropriate the system will select a syllable boundary as the start of the reaction.
Modification process
One the start point of the reaction is chosen the system then can make appropriate modifications. In the simplest case the system uses simple DSP techniques to change the speech. Alternatively the synthesiser can change targeting for unit selection or input features in a parametric system.
Some specific use cases showing the type of modification to be done, see Figure 5 for the spectral effects:
1. Sudden stop'. This case is more useful for debugging, instead of modifying speech the output is suddenly halted.
2. Angry stop'. Targeting is changed to Lombard speech patterns and DSP tools are applied on a phoneme by phoneme basis to achieve Lombard speech, one or two words are kept to hear the difference but most of the speech is dropped from the utterance.
3. Polite stop'. Targeting is changed to lower pitch and amplitude, and increase duration, creating the prosody of someone tailing off in their speech and allowing the other speaker to take the floor.
4. Speaking over. Lombard speech patterns are introduced from the point of interruption to the end of the utterance. No words are dropped, in this case the system is generating speech suitable for speaking over the interlocutor rather than passing the floor.

Claims (6)

  1. Claim 1: A method for responding to an interruption or event in a speech synthesis system, whereby a speech synthesiser can dynamically regenerate and insert modified speech audio output in order to provide improved naturalness in a responsive speech interface.
  2. Claim 2: A method according to claim 1, wherein a speech synthesis system can be interrupted at a specific future point in time in response to an interruption or event, taking into account reaction and procesing times.
  3. Claim 3: A method according to claim 1, whereby a set of appropriate speech synthesis settings can be chosen in response to an interruption, including but not limited to stopping without modification, switching to Lombard speech, or tailing off in a polite manner.
  4. Claim 4: A method according to claim 3, whereby a speech synthesiser can seamlessly switch from previously generated audio to audio content including an appropriate interruption response.
  5. Claim 5: A method according to claim 3, whereby a user can select an appropriate interruption response, including but not limited to stopping without modification, switching to Lombard speech, or tailing off in a polite manner.
  6. Claim 6: A method according to claim 2, whereby an interruption point may be configured within a particular region, including but not limited to a phonetic region, word boundary region, phrase level region, or instantenous.
GB1713273.9A 2017-08-18 2017-08-18 Reactive speech synthesis Pending GB2565589A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
GB1713273.9A GB2565589A (en) 2017-08-18 2017-08-18 Reactive speech synthesis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB1713273.9A GB2565589A (en) 2017-08-18 2017-08-18 Reactive speech synthesis

Publications (2)

Publication Number Publication Date
GB201713273D0 GB201713273D0 (en) 2017-10-04
GB2565589A true GB2565589A (en) 2019-02-20

Family

ID=59996624

Family Applications (1)

Application Number Title Priority Date Filing Date
GB1713273.9A Pending GB2565589A (en) 2017-08-18 2017-08-18 Reactive speech synthesis

Country Status (1)

Country Link
GB (1) GB2565589A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5548681A (en) * 1991-08-13 1996-08-20 Kabushiki Kaisha Toshiba Speech dialogue system for realizing improved communication between user and system
US6240390B1 (en) * 1998-05-18 2001-05-29 Winbond Electronics Corp. Multi-tasking speech synthesizer
JP2008051883A (en) * 2006-08-22 2008-03-06 Canon Inc Voice synthesis control method and apparatus
EP2009620A1 (en) * 2007-06-25 2008-12-31 Fujitsu Limited Phoneme length adjustment for speech synthesis
US20130066632A1 (en) * 2011-09-14 2013-03-14 At&T Intellectual Property I, L.P. System and method for enriching text-to-speech synthesis with automatic dialog act tags

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5548681A (en) * 1991-08-13 1996-08-20 Kabushiki Kaisha Toshiba Speech dialogue system for realizing improved communication between user and system
US6240390B1 (en) * 1998-05-18 2001-05-29 Winbond Electronics Corp. Multi-tasking speech synthesizer
JP2008051883A (en) * 2006-08-22 2008-03-06 Canon Inc Voice synthesis control method and apparatus
EP2009620A1 (en) * 2007-06-25 2008-12-31 Fujitsu Limited Phoneme length adjustment for speech synthesis
US20130066632A1 (en) * 2011-09-14 2013-03-14 At&T Intellectual Property I, L.P. System and method for enriching text-to-speech synthesis with automatic dialog act tags

Also Published As

Publication number Publication date
GB201713273D0 (en) 2017-10-04

Similar Documents

Publication Publication Date Title
US11664011B2 (en) Clockwork hierarchal variational encoder
US7490042B2 (en) Methods and apparatus for adapting output speech in accordance with context of communication
KR20220004737A (en) Multilingual speech synthesis and cross-language speech replication
US9922641B1 (en) Cross-lingual speaker adaptation for multi-lingual speech synthesis
US10199034B2 (en) System and method for unified normalization in text-to-speech and automatic speech recognition
Buschmeier et al. Combining incremental language generation and incremental speech synthesis for adaptive information presentation
EP3776531A1 (en) Clockwork hierarchical variational encoder
Betz et al. Micro-structure of disfluencies: Basics for conversational speech synthesis
US20130066632A1 (en) System and method for enriching text-to-speech synthesis with automatic dialog act tags
US9710552B2 (en) User driven audio content navigation
Yamagishi et al. Robustness of HMM-based speech synthesis
WO2018034169A1 (en) Dialogue control device and method
Cohn et al. Prosodic differences in human-and Alexa-directed speech, but similar local intelligibility adjustments
CN117642814A (en) Robust direct speech-to-speech translation
JP6712754B2 (en) Discourse function estimating device and computer program therefor
WO2023209632A1 (en) Voice attribute conversion using speech to speech
Cutler et al. Vowel devoicing and the perception of spoken Japanese words
GB2565589A (en) Reactive speech synthesis
JP4964695B2 (en) Speech synthesis apparatus, speech synthesis method, and program
US11948550B2 (en) Real-time accent conversion model
Wester et al. Real-Time Reactive Speech Synthesis: Incorporating Interruptions.
US20080077407A1 (en) Phonetically enriched labeling in unit selection speech synthesis
Valentini-Botinhao et al. Intelligibility analysis of fast synthesized speech
JP6424419B2 (en) Voice control device, voice control method and program
JP2011175304A (en) Voice interactive device and method

Legal Events

Date Code Title Description
WAP Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1)
S28 Restoration of ceased patents (sect. 28/pat. act 1977)

Free format text: APPLICATION FILED