GB2511078A

GB2511078A - System for recording speech prompts

Info

Publication number: GB2511078A
Application number: GB201303152A
Authority: GB
Inventors: Matthew Peter Aylett; Christopher John Pidcock
Original assignee: Cereproc Ltd
Current assignee: Cereproc Ltd
Priority date: 2013-02-22
Filing date: 2013-02-22
Publication date: 2014-08-27
Also published as: GB201303152D0

Abstract

Speech errors are detected when a speaker reads a specified text aloud and fed back (eg. visually) to the user. Detected errors include disfluencies, mispronounced or incorrect words, environmental noise, microphone malfunction or speech which is too loud, quiet, fast (fig. 5) or slow (fig. 4). Aspects of the invention include detecting errors via spectral analysis or amplitude detection 5, a language model 7 which guides speech recognition 6 and a display 2 for indicating readers progress along the text.

Description

TITLE

System for recording speech prompts

FIELD OF THE INVENTION

Embodiments of the present invention relate to recording speech audio data. In particular embodiments of the invention relate to a method for efficiently recording speech audio from text prompts.

1OBACKGROUND TO THE INVENTION

The efficient recording of pre-specified text is a useful process in several fields. For example, if a professional voiceover artist is paid by the hour to record text material, the efficiency of the recording process is directly related to the cost. When a limited l5time is available to collect data, such as in a consumer-focused application, an efficient process is required to obtain as much data as possible. A precise match between the text and the read speech is often a requirement, for example when recording an audio book, or capturing training data for a speech recogniser or speech synthesiser.

Current methods of recording speech audio from text are characterised by two main problems: 1) slow rate of audio capture and/or 2) failure to detect errors. Voice recordings are often carried out in a professional recording studio. Text is presented on a paper recording script, with a dedicated engineer operating studio hardware.

25Pecordings are post processed manually to edit and split the audio, and fix any errors. This process is time consuming and expensive, often over 75% of the time is spent on tasks other than recording audio. For example, a 3 hour recording slot may result in only 30-45 minutes of recorded audio. Some of this time is spent on error correction, but a majority is spent readying and checking a voicing cue.

The process can be rendered more efficient by training a voiceover to record themselves, however in this case speech errors and misreadings are often missed.

Automatic systems based on silence detection or an autocue like system will also fail to detect errors or mismatches between the text and the speech.

It is therefore desirable to provide a technical solution that allows recording speech audio from a fixed text whilst automatically tracking the voice over as they read, retaining an ability detect and correct errors, and readying the text to the appropriate location for further reading. Furthermore such a system would be able to Sautomatically produce a clean recording' of the fixed text with no further manual intervention.

Automatic speech recognition (ASP) technology can play an important role in detecting errors automatically. However current systems are designed either for lOcommand and control systems, with fixed lexicons and limited commands, or an open vocabulary, with a relatively high word error rate making it hard to distinguish between a speaker error and an ASP error.

ASP systems normally comprise of two statistical models, a phonetic model which is 1 bused to recognise speech sounds, and a language model, which is used to provide the expectations of what utterances a user might produce. Although significant previous work has looked at adapting the phonetic model to a current speaker, in standard ASP systems the language model remains static. This is because in most speech recognition problems the actual text that a user will utter is not known. The 2ouse of finite state transition networks (FST5) to implement language models is now widespread. Although creating full FSTs from open domain language data can require very long processing time and memory requirements, creating small EST5 from more limited language data is very rapid.

2bDigital signal processing (DSP) techniques can also play an important role in error detection. For example, the amplitude of the users' speech over a window of time can be evaluated, and if it appears abnormally loud or soft the user could be notified.

Non-speech sounds such as laughter, breaths, and coughs have different spectral characteristics from normal speech.

BPIEF DESCPIPTION OF THE INVENTION

The invention is a method for efficiently recording a database of speech that exactly matches specific text input.

When making a speech recording, the recorder displays text input to a user on a smartphone, tablet or computer screen. An instruction to start reading is presented to the user, for example by a flashing recording indicator, or a short audio instruction.

Preferably the desired speech rate is indicated by a progress indicator -for example Ba progress bar under the text, a bouncing ball akin to a karaoke display, or a moving highlight of the text. The audio is transmitted to a receiver, either: 1) the local smartphone, tablet, or computer, or 2) streamed to a remote server. The audio is monitored at the receiver. The receiver detects word errors, disfluencies, and unnatural or unexpected silences, using a combination of automatic speech lOrecognition (ASP) and digital signal processing (DSP) techniques, and returns an error instruction to the recording apparatus. On receiving an error instruction, the recorder stops recording, indicates to the user that recording has stopped, and moves the progress indicator back to a previous known good location. The recorder then starts recording, as before.

Preferably a training phase can be used to instruct the user in the recording process, perform initial checks on the recording environment, and help the system adapt to the style and speech of the user.

2OThe method ensures that recording can continue uninterrupted to the maximum extent possible, with as little recording time possible lost to corrections and the process of moving between sentences.

BRIEF DESCRIPTION OF THE DRAWINGS

Figure 1 schematically illustrates the recorder system. A human speaker reads prompt text presented on a display into a microphone. An analysis and control system, possibly located on the same computational device, or on a remote computational device, controls the display of the prompt text using a DSP analysis 3omodule and a customised ASP system using a language model based on the prompt text.

Figure 2 schematically illustrates the main recorder interface, including an example of a progress bar indicator and an indication of currently completed material.

Figure 3 schematically illustrates the main recorder interface showing an example of progress being displayed as the user reads out prompt text.

Figure 4 schematically illustrates the main recorder interface showing how the Sdisplay indicates a user is reading material at a lower than desired speech rate.

Figure 5 schematically illustrates the main recorder interface showing how the display indicates a user is reading material at a higher than desired speech rate.

lOFigure 6 schematically illustrates an error message being displayed to the speaker.

Figure 7 shows an example of a customised language model created specifically to track the user reading a chunk of prompt text.

15DETAILED DESCRIPTION OF EMBODIMENTS OF THE INVENTION Some embodiments of the present invention involve a computer system with a display and microphone input. The display renders the fixed text that needs to be read aloud by a human speaker to produce an audio recording. The microphone 2ocaptures the audio spoken. The computer system controls the display on the basis of the spoken audio that is captured. Either a local or remote system automatically keeps track of the users progress as the text is read aloud. See Figure 1.

A display and audio capture device 1, is comprised of a display screen 2 and a 2bmicrophone system 3 for recording audio spoken by the user. The analysis and control system 4, may be located on the same system as the audio capture device, or may be operated on a remote device with an internet connection allowing it to control the display and receive the audio from the microphone. A digital signal processing module 5, analyses the audio input measuring amplitude and spectral 3ofeatures. The results of the analysis is compared to a set of thresholds and heuristics to ensure the audio is appropriate for the recording. If it is not an error is sent to the display controller 9, else audio is passed to a speech recogniser 6. The speech recogniser uses a language model 7 built from the prompt text 8 which allows it to monitor a speakers progress and returns a completion flag, when a punctuated 35chunk of text has cleanly been recorded, or an error flag, if the speech does not correspond to the expected text, to the display controller 9. The prompt text is used by the display controller to create a display of the text for the user to read out loud on the display 2.

SThe prompt text can be, but is not limited to, several sentences, possibly with internal punctuation, forming a sequence of material.

Some embodiments of the invention, in addition to displaying the text to be read aloud, also display the human speaker's current reading progress, both to the current lOword and to the last complete intonational phrase that was read without error, as well as indicating desired progress given a target speech rate. See Figure 2.

A screen 10, displays prompt text (11, 12, 14). In this Figure the user has successfully read the first chunk of text up to a comma, and paused. This is shown lbby highlighting the successfully read text in dark grey 11. The next chunk of text that the user needs to record is highlighted in light grey 12. Progress is displayed in the white balls above each word 13. They are white which shows the reader has not begun to read the next chunk of text. 14 shows the remaining text for the user to read.

The system monitors the users progress as they read out each word in a chunk of prompt text. Figure 3 shows a user in the process of reading out the second chunk of text. The words he has read out are marked with a black ball 14, the words the user has yet to read out are still marked with a white ball 13.

Figure 4 shows a user who is reading the text more slowly than desired. The ball with a wider border 15 shows where the user would have progressed to if they had been reading at the desired speech rate. Figure 5 shows a user who is reading the text more quickly than desired. The ball filled in dark grey 16, shows where the user has 3oread to which is ahead of the desired point, marked with black balls 14.

Figure 6 shows the system reporting an error to a user in the process of reading out the second chunk of text 17. The error message explains what the problem is and automatically resets the system to the previous completed chunk allowing the user to 35read the chunk again. In this example an error in the reading was found at the beginning of the word white' by the speech recognition system using the bespoke language model generated by the prompt text.

An error in the reading process could be, but is not limited to:

S

A disfluency An incorrect word, or out of sequence words Too slow or too fast speech rate Laughter, coughs, sighs, groans and other non-speech noises Unusually loud or soft speech

High background noise

A microphone failure, or microphone problem which leads to any of the above.

2oTracking Reading Progress and Detecting Word Errors In some embodiments of the present invention, in order to confidently keep track of the user's reading progress, a public domain ASR system is used to efficiently recognise and align the users speech with input text in real time as the user reads 25out the text prompt.

In some embodiments of the present invention the ASP system alters the phonetic models to adapt to the user's vocal style during a brief training phase.

30The language model for the ASP system, rather than being designed and built as part of the ASP system before it is used in the system, is instead generated automatically from the prompt text that is to be read. The language model could be, but is not limited to, a EST or a transition network. The model represents all possible states that a reader can be in while reading the prompt text. There are 4 types of 3Snon-error states and 2 types of states which signal an error. Error free states are: Ni. Having read some words of the initial punctuated chunk of words.

N2. Having cleanly read words up to a punctuated boundary.

N3. Having cleanly read some of the prompt text up to a punctuated boundary and having read some of the words in the subsequent utterance.

N4. Having cleanly read all the prompt text. :to

States which involve an error are: El. Having read incorrect words, disfluencies or sounds that do not match the initial punctuated chunk of words in the prompt.

E2. Having cleanly read some of the prompt text up to a punctuated boundary and having read some of the words in the subsequent utterance followed by incorrect 2OWhen the language model enters an error state, the system backtracks to the last complete punctuated utterance. Thus El requires a restart and E2 back tracks to an appropriate N2 state. Ni, N2 and N3 are used to give the user feedback on their current reading progress, including their reading speed. N4 signals that the prompt has been captured and allows the system to move to a new prompt.

A schematic of a language model in operation in the examples shown in Figures 2-6 is shown in Figure 7.

A start state 18, can transition through each word model 19, or can exit through a 3ogarbage model 20 to return and error 21. If all the word models are successfully applied the completion state is entered 22 and no error is returned.

Detecting Errors In some embodiments of the present invention, while the ASP module is tracking the users speech, a series of DSP modules monitor the audio to detect other error types.

SThese could be but are not limited to: Too slow or too fast speech rate: The speech rate is calculated based on the result of the ASP analysis. If the speech rate is higher or lower than specified thresholds an error is generated. If a pause is longer than a specified threshold, an error is lOgenerated.

Unusually loud or soft speech: Amplitude is measured over words aligned by the language model to audio data. If the amplitude is higher or lower than specified thresholds an error is generated.

High background noise: Amplitude is measured in silences aligned by the language model to audio data. If the amplitude is higher than a specified threshold an error is generated.

20A microphone failure or microphone problem which leads to any of the above: If the global spectral envelope of the users speech changes above a threshold an error is generated.

Error Notification Errors consist of, but are not limited to: Critical error: e.g. loss of microphone input. Terminate current prompt reading, notify user and allow a restart one the problem is fixed.

Peading error: e.g. the user is disfluent or misreads a word. Notify the user and allow the user to restart from the last cleanly read point in prompt.

Warning: e.g the reading is slower than expected. Notify the user but do not take any 35action.