GB2511078A - System for recording speech prompts - Google Patents

System for recording speech prompts Download PDF

Info

Publication number
GB2511078A
GB2511078A GB201303152A GB201303152A GB2511078A GB 2511078 A GB2511078 A GB 2511078A GB 201303152 A GB201303152 A GB 201303152A GB 201303152 A GB201303152 A GB 201303152A GB 2511078 A GB2511078 A GB 2511078A
Authority
GB
United Kingdom
Prior art keywords
text
speech
user
error
recording
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
GB201303152A
Other versions
GB201303152D0 (en
Inventor
Matthew Peter Aylett
Christopher John Pidcock
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Cereproc Ltd
Original Assignee
Cereproc Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cereproc Ltd filed Critical Cereproc Ltd
Priority to GB201303152A priority Critical patent/GB2511078A/en
Publication of GB201303152D0 publication Critical patent/GB201303152D0/en
Publication of GB2511078A publication Critical patent/GB2511078A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G09EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
    • G09BEDUCATIONAL OR DEMONSTRATION APPLIANCES; APPLIANCES FOR TEACHING, OR COMMUNICATING WITH, THE BLIND, DEAF OR MUTE; MODELS; PLANETARIA; GLOBES; MAPS; DIAGRAMS
    • G09B19/00Teaching not covered by other main groups of this subclass
    • G09B19/04Speaking
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/221Announcement of recognition results

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Business, Economics & Management (AREA)
  • General Health & Medical Sciences (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Educational Administration (AREA)
  • Educational Technology (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Machine Translation (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

Speech errors are detected when a speaker reads a specified text aloud and fed back (eg. visually) to the user. Detected errors include disfluencies, mispronounced or incorrect words, environmental noise, microphone malfunction or speech which is too loud, quiet, fast (fig. 5) or slow (fig. 4). Aspects of the invention include detecting errors via spectral analysis or amplitude detection 5, a language model 7 which guides speech recognition 6 and a display 2 for indicating readers progress along the text.

Description

TITLE
System for recording speech prompts
FIELD OF THE INVENTION
Embodiments of the present invention relate to recording speech audio data. In particular embodiments of the invention relate to a method for efficiently recording speech audio from text prompts.
1OBACKGROUND TO THE INVENTION
The efficient recording of pre-specified text is a useful process in several fields. For example, if a professional voiceover artist is paid by the hour to record text material, the efficiency of the recording process is directly related to the cost. When a limited l5time is available to collect data, such as in a consumer-focused application, an efficient process is required to obtain as much data as possible. A precise match between the text and the read speech is often a requirement, for example when recording an audio book, or capturing training data for a speech recogniser or speech synthesiser.
Current methods of recording speech audio from text are characterised by two main problems: 1) slow rate of audio capture and/or 2) failure to detect errors. Voice recordings are often carried out in a professional recording studio. Text is presented on a paper recording script, with a dedicated engineer operating studio hardware.
25Pecordings are post processed manually to edit and split the audio, and fix any errors. This process is time consuming and expensive, often over 75% of the time is spent on tasks other than recording audio. For example, a 3 hour recording slot may result in only 30-45 minutes of recorded audio. Some of this time is spent on error correction, but a majority is spent readying and checking a voicing cue.
The process can be rendered more efficient by training a voiceover to record themselves, however in this case speech errors and misreadings are often missed.
Automatic systems based on silence detection or an autocue like system will also fail to detect errors or mismatches between the text and the speech.
It is therefore desirable to provide a technical solution that allows recording speech audio from a fixed text whilst automatically tracking the voice over as they read, retaining an ability detect and correct errors, and readying the text to the appropriate location for further reading. Furthermore such a system would be able to Sautomatically produce a clean recording' of the fixed text with no further manual intervention.
Automatic speech recognition (ASP) technology can play an important role in detecting errors automatically. However current systems are designed either for lOcommand and control systems, with fixed lexicons and limited commands, or an open vocabulary, with a relatively high word error rate making it hard to distinguish between a speaker error and an ASP error.
ASP systems normally comprise of two statistical models, a phonetic model which is 1 bused to recognise speech sounds, and a language model, which is used to provide the expectations of what utterances a user might produce. Although significant previous work has looked at adapting the phonetic model to a current speaker, in standard ASP systems the language model remains static. This is because in most speech recognition problems the actual text that a user will utter is not known. The 2ouse of finite state transition networks (FST5) to implement language models is now widespread. Although creating full FSTs from open domain language data can require very long processing time and memory requirements, creating small EST5 from more limited language data is very rapid.
2bDigital signal processing (DSP) techniques can also play an important role in error detection. For example, the amplitude of the users' speech over a window of time can be evaluated, and if it appears abnormally loud or soft the user could be notified.
Non-speech sounds such as laughter, breaths, and coughs have different spectral characteristics from normal speech.
BPIEF DESCPIPTION OF THE INVENTION
The invention is a method for efficiently recording a database of speech that exactly matches specific text input.
When making a speech recording, the recorder displays text input to a user on a smartphone, tablet or computer screen. An instruction to start reading is presented to the user, for example by a flashing recording indicator, or a short audio instruction.
Preferably the desired speech rate is indicated by a progress indicator -for example Ba progress bar under the text, a bouncing ball akin to a karaoke display, or a moving highlight of the text. The audio is transmitted to a receiver, either: 1) the local smartphone, tablet, or computer, or 2) streamed to a remote server. The audio is monitored at the receiver. The receiver detects word errors, disfluencies, and unnatural or unexpected silences, using a combination of automatic speech lOrecognition (ASP) and digital signal processing (DSP) techniques, and returns an error instruction to the recording apparatus. On receiving an error instruction, the recorder stops recording, indicates to the user that recording has stopped, and moves the progress indicator back to a previous known good location. The recorder then starts recording, as before.
Preferably a training phase can be used to instruct the user in the recording process, perform initial checks on the recording environment, and help the system adapt to the style and speech of the user.
2OThe method ensures that recording can continue uninterrupted to the maximum extent possible, with as little recording time possible lost to corrections and the process of moving between sentences.
BRIEF DESCRIPTION OF THE DRAWINGS
Figure 1 schematically illustrates the recorder system. A human speaker reads prompt text presented on a display into a microphone. An analysis and control system, possibly located on the same computational device, or on a remote computational device, controls the display of the prompt text using a DSP analysis 3omodule and a customised ASP system using a language model based on the prompt text.
Figure 2 schematically illustrates the main recorder interface, including an example of a progress bar indicator and an indication of currently completed material.
Figure 3 schematically illustrates the main recorder interface showing an example of progress being displayed as the user reads out prompt text.
Figure 4 schematically illustrates the main recorder interface showing how the Sdisplay indicates a user is reading material at a lower than desired speech rate.
Figure 5 schematically illustrates the main recorder interface showing how the display indicates a user is reading material at a higher than desired speech rate.
lOFigure 6 schematically illustrates an error message being displayed to the speaker.
Figure 7 shows an example of a customised language model created specifically to track the user reading a chunk of prompt text.
15DETAILED DESCRIPTION OF EMBODIMENTS OF THE INVENTION Some embodiments of the present invention involve a computer system with a display and microphone input. The display renders the fixed text that needs to be read aloud by a human speaker to produce an audio recording. The microphone 2ocaptures the audio spoken. The computer system controls the display on the basis of the spoken audio that is captured. Either a local or remote system automatically keeps track of the users progress as the text is read aloud. See Figure 1.
A display and audio capture device 1, is comprised of a display screen 2 and a 2bmicrophone system 3 for recording audio spoken by the user. The analysis and control system 4, may be located on the same system as the audio capture device, or may be operated on a remote device with an internet connection allowing it to control the display and receive the audio from the microphone. A digital signal processing module 5, analyses the audio input measuring amplitude and spectral 3ofeatures. The results of the analysis is compared to a set of thresholds and heuristics to ensure the audio is appropriate for the recording. If it is not an error is sent to the display controller 9, else audio is passed to a speech recogniser 6. The speech recogniser uses a language model 7 built from the prompt text 8 which allows it to monitor a speakers progress and returns a completion flag, when a punctuated 35chunk of text has cleanly been recorded, or an error flag, if the speech does not correspond to the expected text, to the display controller 9. The prompt text is used by the display controller to create a display of the text for the user to read out loud on the display 2.
SThe prompt text can be, but is not limited to, several sentences, possibly with internal punctuation, forming a sequence of material.
Some embodiments of the invention, in addition to displaying the text to be read aloud, also display the human speaker's current reading progress, both to the current lOword and to the last complete intonational phrase that was read without error, as well as indicating desired progress given a target speech rate. See Figure 2.
A screen 10, displays prompt text (11, 12, 14). In this Figure the user has successfully read the first chunk of text up to a comma, and paused. This is shown lbby highlighting the successfully read text in dark grey 11. The next chunk of text that the user needs to record is highlighted in light grey 12. Progress is displayed in the white balls above each word 13. They are white which shows the reader has not begun to read the next chunk of text. 14 shows the remaining text for the user to read.
The system monitors the users progress as they read out each word in a chunk of prompt text. Figure 3 shows a user in the process of reading out the second chunk of text. The words he has read out are marked with a black ball 14, the words the user has yet to read out are still marked with a white ball 13.
Figure 4 shows a user who is reading the text more slowly than desired. The ball with a wider border 15 shows where the user would have progressed to if they had been reading at the desired speech rate. Figure 5 shows a user who is reading the text more quickly than desired. The ball filled in dark grey 16, shows where the user has 3oread to which is ahead of the desired point, marked with black balls 14.
Figure 6 shows the system reporting an error to a user in the process of reading out the second chunk of text 17. The error message explains what the problem is and automatically resets the system to the previous completed chunk allowing the user to 35read the chunk again. In this example an error in the reading was found at the beginning of the word white' by the speech recognition system using the bespoke language model generated by the prompt text.
An error in the reading process could be, but is not limited to:
S
A disfluency An incorrect word, or out of sequence words Too slow or too fast speech rate Laughter, coughs, sighs, groans and other non-speech noises Unusually loud or soft speech
High background noise
A microphone failure, or microphone problem which leads to any of the above.
2oTracking Reading Progress and Detecting Word Errors In some embodiments of the present invention, in order to confidently keep track of the user's reading progress, a public domain ASR system is used to efficiently recognise and align the users speech with input text in real time as the user reads 25out the text prompt.
In some embodiments of the present invention the ASP system alters the phonetic models to adapt to the user's vocal style during a brief training phase.
30The language model for the ASP system, rather than being designed and built as part of the ASP system before it is used in the system, is instead generated automatically from the prompt text that is to be read. The language model could be, but is not limited to, a EST or a transition network. The model represents all possible states that a reader can be in while reading the prompt text. There are 4 types of 3Snon-error states and 2 types of states which signal an error. Error free states are: Ni. Having read some words of the initial punctuated chunk of words.
N2. Having cleanly read words up to a punctuated boundary.
N3. Having cleanly read some of the prompt text up to a punctuated boundary and having read some of the words in the subsequent utterance.
N4. Having cleanly read all the prompt text. :to
States which involve an error are: El. Having read incorrect words, disfluencies or sounds that do not match the initial punctuated chunk of words in the prompt.
E2. Having cleanly read some of the prompt text up to a punctuated boundary and having read some of the words in the subsequent utterance followed by incorrect 2OWhen the language model enters an error state, the system backtracks to the last complete punctuated utterance. Thus El requires a restart and E2 back tracks to an appropriate N2 state. Ni, N2 and N3 are used to give the user feedback on their current reading progress, including their reading speed. N4 signals that the prompt has been captured and allows the system to move to a new prompt.
A schematic of a language model in operation in the examples shown in Figures 2-6 is shown in Figure 7.
A start state 18, can transition through each word model 19, or can exit through a 3ogarbage model 20 to return and error 21. If all the word models are successfully applied the completion state is entered 22 and no error is returned.
Detecting Errors In some embodiments of the present invention, while the ASP module is tracking the users speech, a series of DSP modules monitor the audio to detect other error types.
SThese could be but are not limited to: Too slow or too fast speech rate: The speech rate is calculated based on the result of the ASP analysis. If the speech rate is higher or lower than specified thresholds an error is generated. If a pause is longer than a specified threshold, an error is lOgenerated.
Unusually loud or soft speech: Amplitude is measured over words aligned by the language model to audio data. If the amplitude is higher or lower than specified thresholds an error is generated.
High background noise: Amplitude is measured in silences aligned by the language model to audio data. If the amplitude is higher than a specified threshold an error is generated.
20A microphone failure or microphone problem which leads to any of the above: If the global spectral envelope of the users speech changes above a threshold an error is generated.
Error Notification Errors consist of, but are not limited to: Critical error: e.g. loss of microphone input. Terminate current prompt reading, notify user and allow a restart one the problem is fixed.
Peading error: e.g. the user is disfluent or misreads a word. Notify the user and allow the user to restart from the last cleanly read point in prompt.
Warning: e.g the reading is slower than expected. Notify the user but do not take any 35action.
GB201303152A 2013-02-22 2013-02-22 System for recording speech prompts Withdrawn GB2511078A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
GB201303152A GB2511078A (en) 2013-02-22 2013-02-22 System for recording speech prompts

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB201303152A GB2511078A (en) 2013-02-22 2013-02-22 System for recording speech prompts

Publications (2)

Publication Number Publication Date
GB201303152D0 GB201303152D0 (en) 2013-04-10
GB2511078A true GB2511078A (en) 2014-08-27

Family

ID=48091926

Family Applications (1)

Application Number Title Priority Date Filing Date
GB201303152A Withdrawn GB2511078A (en) 2013-02-22 2013-02-22 System for recording speech prompts

Country Status (1)

Country Link
GB (1) GB2511078A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020133340A1 (en) * 2001-03-16 2002-09-19 International Business Machines Corporation Hierarchical transcription and display of input speech
WO2006031752A2 (en) * 2004-09-10 2006-03-23 Soliloquy Learning, Inc. Microphone setup and testing in voice recognition software
US20080101556A1 (en) * 2006-10-31 2008-05-01 Samsung Electronics Co., Ltd. Apparatus and method for reporting speech recognition failures
US20120219932A1 (en) * 2011-02-27 2012-08-30 Eyal Eshed System and method for automated speech instruction

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020133340A1 (en) * 2001-03-16 2002-09-19 International Business Machines Corporation Hierarchical transcription and display of input speech
WO2006031752A2 (en) * 2004-09-10 2006-03-23 Soliloquy Learning, Inc. Microphone setup and testing in voice recognition software
US20080101556A1 (en) * 2006-10-31 2008-05-01 Samsung Electronics Co., Ltd. Apparatus and method for reporting speech recognition failures
US20120219932A1 (en) * 2011-02-27 2012-08-30 Eyal Eshed System and method for automated speech instruction

Also Published As

Publication number Publication date
GB201303152D0 (en) 2013-04-10

Similar Documents

Publication Publication Date Title
US6792409B2 (en) Synchronous reproduction in a speech recognition system
CA2799892C (en) System and method for real-time multimedia reporting
US9774747B2 (en) Transcription system
US12039481B2 (en) Interactive test method, device and system
CN101739870B (en) Interactive language learning system and method
KR100312060B1 (en) Speech recognition enrollment for non-readers and displayless devices
US8311832B2 (en) Hybrid-captioning system
US8260617B2 (en) Automating input when testing voice-enabled applications
US8560327B2 (en) System and method for synchronizing sound and manually transcribed text
US20130035936A1 (en) Language transcription
US20140122081A1 (en) Automated text to speech voice development
US20200211565A1 (en) System and method for simultaneous multilingual dubbing of video-audio programs
US20180047387A1 (en) System and method for generating accurate speech transcription from natural speech audio signals
CN104240718A (en) Transcription support device, method, and computer program product
US9472186B1 (en) Automated training of a user audio profile using transcribed medical record recordings
US20170076626A1 (en) System and Method for Dynamic Response to User Interaction
JP2015011348A (en) Training and evaluation method for foreign language speaking ability using voice recognition and device for the same
CN110503941B (en) Language ability evaluation method, device, system, computer equipment and storage medium
US7308407B2 (en) Method and system for generating natural sounding concatenative synthetic speech
JP2011242637A (en) Voice data editing device
GB2511078A (en) System for recording speech prompts
Proença et al. Children's reading aloud performance: a database and automatic detection of disfluencies
AT&T
Kraljevski et al. Hyperarticulation of Corrections in Multilingual Dialogue Systems.
Amdal et al. FonDat1: A Speech Synthesis Corpus for Norwegian.

Legal Events

Date Code Title Description
WAP Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1)