GB2565589A

GB2565589A - Reactive speech synthesis

Info

Publication number: GB2565589A
Application number: GB1713273.9A
Authority: GB
Inventors: Aylett Matthew; Potard Blaise; Braude David; Pidcock Christopher
Original assignee: Cereproc Ltd
Current assignee: Cereproc Ltd
Priority date: 2017-08-18
Filing date: 2017-08-18
Publication date: 2019-02-20
Also published as: GB201713273D0

Abstract

A method of responding to an interruption in a speech synthesis system (e.g. conversational agents, AI therapists, virtual personal assistants). In response to an interruption event (e.g. when a user has tried to interrupt the dialogue), the system takes an input of the reaction type and earliest time. A new audio is then created which is suitable for splicing. The system then modifies the audio (e.g. selecting different units, applying DSP modification, stopping before the end of the text) based on the type of modification, to introduce different speaking styles to provide a natural response at a speech interface. The interruption may be configured within a particular region such as a phonetic region, word boundary region, and phase level region. Appropriate settings may also be chosen such as switch to Lombard speech or tailing off in a polite manner.

Description

Reactive speech synthesis

FIELD OF THE INVENTION

Embodiments of the present invention relates to synthesising speech in real time applications. In particular the embodiments relate to synthesising speech where the synthesis output needs to be modified in response to real-time events.

BACKGROUND TO THE INVENTION

During natural conversation the content, style, and prosody of speech can change very suddenly. For example if one of the interlocutors in a dialogue interrupts the other, the original speaker will not finish their sentence without regard to the interruption. Instead they might attempt to continue to 'hold the floor' a term which means that they will continue to speak over the interruption and not allow the other participant to speak. Alternatively they may 'pass the floor' which is the opposite i.e. allow the other interlocutor to speak instead. In either of these cases there are changes to the prosody and style of speaking. For example if the speaker is trying to hold the floor then they will increase their volume, slow their speaking rate, and increase their pitch a combination of effects known as 'Lombard speech'. Current speech synthesisers do not have the ability to perform the necessary changes to their output audio.

Requirements

The requirements of such a system have been already been explored in academia. To summarise the results from the literature these are:

1. A system needs to be responsive and flexible like human interlocutors.

2. It needs to process information incrementally and continuously rather than in large chunks, in other words it must be able to start speaking before processing has been finished.

3. It must know what has already been said.

4. It must be able to halt, then continue/break as well as be able to stop.

5. It must be able to operate in real-time.

6. It must be able to make edits to as-yet unspoken parts of the utterance to change delivery parameters such as speaking rate or pitch.

Additionally we found:

1. Speakers only interrupt their speech a syllable boundary or syllable nucleus; the syllable nucleus is the central part of the syllable which is usually a vowel.

2. There are never changes on certain consonants such as “plosives” for example in English the sound of a 'p' or 'b'.

3. The reaction cannot be too fast otherwise it will sound unnatural. It takes time for a speaker to process that their interlocutor has done something that needs a reaction and what the appropriate reaction would be, in the order of a few hundred milliseconds, but it would be noticeable if it was absent.

Currently there are two approaches one could take based on existing speech synthesis engines:

Modifying the speech with additional libraries

In this approach an application developer who is using the speech library modifies audio generated by the speech synthesiser using Digital Signal Processing (DSP). The problems with this approach are:

1. It creates additional dependencies within end user applications.

2. The application developer needs a much deeper understanding of speech production to know what effects need to be applied to make the speech reaction sound natural.

3. DSP is prone to making errors in the audio - known as artefacts - if the speech is modified too much.

Resvnthesising the audio

The alternative approach is to resynthesise the audio from scratch. To effect changes in the audio most speech synthesisers support at least some parts of Speech Synthesis Markup Language (SSML). This approach is not suitable either because of the following reasons:

1. It does not meet the requirement that it processes information incrementally and continuously.

2. SSML does not normally give the option to change audio at a phoneme level.

3. It would be impossible to guarantee the audio does not change before the point of interruption as synthesisers consider the context to sound natural. This in turn means it would be very challenging to splice in audio at the correct time.

BRIEF DESCRIPTION OF THE INVENTION

The invention is to provide a simple interface for creating reactions in speech synthesis.

Using this interface the synthesis engine provides the following benefits:

• A simple call where audio is regenerated up-to a user specified point in time, without re-synthesis.

• Prior to the user chosen point the audio is guaranteed to be identical to what was previously synthesised, enabling easy splicing.

• The process for reacting can be started at the beginning of a phrase rather than at the start of the speech.

• The user may specify a type of reaction, for example stopping without modification, switch to Lombard speech, or tailing off politely.

• The exact changes in the speech such as pitch and speaking rate are appropriate for the chosen reaction and have been developed by the developers of the synthesis engine.

• The synthesiser will give options to pick different times for the reaction, importantly the points mentioned in the previous section, but also other options such as instantaneous reactions or word boundaries.

• The application can then continue with the original synthesis, insert some additional synthesised speech or audio before continuing, or discard the remaining audio.

This interface can be invoked by developers using special functions in the synthesis engine's Software Development Kit (SDK) or Application Programming Interface (API) or through extensions to other speech APIs such as Microsoft's Speech API (SAPI) depending on how they are interfacing with the engine. Alternatively the reactions can be included in SSML with the use of new SSML tags. This gives the option for both developers and end users who are only using SSML to interface with the engine to include reactions.

BRIEF DESCRIPTION OF THE DRAWINGS

Figures 1 to 4 - typical use case

Figure 1 - System inputs

Figure 2 - Modification start point

Figure 3 - Applied modifications

Figure 4 - Splicing

Figure 5 - Examples of interruption recorded from actual speakers

DETAILED DESCRIPTION OF EMBODIMENTS OF THE INVENTION

The embodiments of the present invention are implemented on a computer system. This system can be used in any dialogue system that is backed by a computer that is capable of running a synthesis engine and then either playing the audio directly or streaming the audio to another playing device and is similarly connected to a microphone for detecting when reactions are appropriate. Examples of direct audio uses are - but not limited to - robots, conversational agents, virtual personal assistants, and “Al therapists”. An example where the audio would be streamed is in a smart home where a central server would generate the audio and manage the dialogue but individual speakers and microphones are located on remote “dumb terminals”.

A typical use case of the system is where a reaction is generated by a dialogue system after it has detected some action from an end user. A specific example which will be shown in the accompanying figures is the audio being changed to acknowledge the user has tried to interrupt the dialogue system which will pass the floor while demonstrating irritation. The system takes as an input the reaction type, earliest time, and optionally a boundary type, see Figure 1. The system then creates new audio which is suitable for splicing. Importantly it does not re-synthesise audio prior to the splice point. It then modifies the speech according the reaction type starting at an appropriate point after the splice point, see Figure 2. After the modification start point the system then modifies the audio through selecting different units, and I or applying DSP modifications, and I or stopping before the end of the text, based on the type of modification to introduce different speaking styles, in the example of the timing shown in Figure 3 the speech is modified to speak with Lombard speech and stops three words later. The system then passes the audio back to the dialogue system which then splices in the audio at the splice point as in Figure 4. The dialogue system now has the option to include new speech that is appropriate for the reaction, and then either continue with the previously synthesised audio or stop entirely, Figure 4 shows how the dialogue system can modify future audio. In Figure 5 an example from recorded human speech of this sort of interaction is shown. The following sections describe the embodiment of the system in greater detail.

Modification start point selection

To select a start point for the reaction the system takes as input the earliest acceptable start time, and the phonetic transcription of the existing synthesised speech. The phonetic transcription is generated simultaneously with the synthesised speech as is returned to the user with the generated speech. The user is responsible for storing the transcription within their own system.

When synthesising the reaction the system first determines where in the phonetic transcription the earliest possible cut time is. It then needs to pick a cut point based on what the user requested, in the case where the earliest time is during the speech and not part of a leading or trailing pause, the follow boundary types can be handled by the system:

1. Phrase ending'. In this case the system simply removes any final pause and return the original synthesised speech

2. Word boundary. The system will simply start the reaction between the next two words. The system keeps track of word boundaries within the phonetic transcription.

3. Syllable boundary. The system also keeps track of syllable boundaries within the phonetic transcription, so the system simply starts the reaction at the next available syllable boundary. Word boundaries are also syllable boundaries.

4. Natural cut point'. In this case the system checks each phoneme from the earliest possible time against a list of acceptable phonemes in which to place a boundary, the list is generated based on human production, so it will only break on the same phonemes that a human would. Alternatively if more appropriate the system will select a syllable boundary as the start of the reaction.

Modification process

One the start point of the reaction is chosen the system then can make appropriate modifications. In the simplest case the system uses simple DSP techniques to change the speech. Alternatively the synthesiser can change targeting for unit selection or input features in a parametric system.

Some specific use cases showing the type of modification to be done, see Figure 5 for the spectral effects:

1. Sudden stop'. This case is more useful for debugging, instead of modifying speech the output is suddenly halted.

2. Angry stop'. Targeting is changed to Lombard speech patterns and DSP tools are applied on a phoneme by phoneme basis to achieve Lombard speech, one or two words are kept to hear the difference but most of the speech is dropped from the utterance.

3. Polite stop'. Targeting is changed to lower pitch and amplitude, and increase duration, creating the prosody of someone tailing off in their speech and allowing the other speaker to take the floor.

4. Speaking over. Lombard speech patterns are introduced from the point of interruption to the end of the utterance. No words are dropped, in this case the system is generating speech suitable for speaking over the interlocutor rather than passing the floor.

Claims

Claim 1: A method for responding to an interruption or event in a speech synthesis system, whereby a speech synthesiser can dynamically regenerate and insert modified speech audio output in order to provide improved naturalness in a responsive speech interface.
Claim 2: A method according to claim 1, wherein a speech synthesis system can be interrupted at a specific future point in time in response to an interruption or event, taking into account reaction and procesing times.
Claim 3: A method according to claim 1, whereby a set of appropriate speech synthesis settings can be chosen in response to an interruption, including but not limited to stopping without modification, switching to Lombard speech, or tailing off in a polite manner.
Claim 4: A method according to claim 3, whereby a speech synthesiser can seamlessly switch from previously generated audio to audio content including an appropriate interruption response.
Claim 5: A method according to claim 3, whereby a user can select an appropriate interruption response, including but not limited to stopping without modification, switching to Lombard speech, or tailing off in a polite manner.
Claim 6: A method according to claim 2, whereby an interruption point may be configured within a particular region, including but not limited to a phonetic region, word boundary region, phrase level region, or instantenous.