WO2005045803A1 - Error detection for speech to text transcription systems - Google Patents

Error detection for speech to text transcription systems Download PDF

Info

Publication number
WO2005045803A1
WO2005045803A1 PCT/IB2004/052218 IB2004052218W WO2005045803A1 WO 2005045803 A1 WO2005045803 A1 WO 2005045803A1 IB 2004052218 W IB2004052218 W IB 2004052218W WO 2005045803 A1 WO2005045803 A1 WO 2005045803A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
text
signal
speech signal
eπor
Prior art date
Application number
PCT/IB2004/052218
Other languages
French (fr)
Other versions
WO2005045803A8 (en
Inventor
Hauke Schramm
Original Assignee
Philips Intellectual Property & Standards Gmbh
Koninklijke Philips Electronics N.V.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Philips Intellectual Property & Standards Gmbh, Koninklijke Philips Electronics N.V. filed Critical Philips Intellectual Property & Standards Gmbh
Priority to US10/578,073 priority Critical patent/US7617106B2/en
Priority to CN200480032825.6A priority patent/CN1879146B/en
Priority to EP04791820A priority patent/EP1702319B1/en
Priority to DE602004018385T priority patent/DE602004018385D1/en
Priority to JP2006537527A priority patent/JP4714694B2/en
Publication of WO2005045803A1 publication Critical patent/WO2005045803A1/en
Publication of WO2005045803A8 publication Critical patent/WO2005045803A8/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • G10L2021/0135Voice conversion or morphing

Definitions

  • the invention relates to the field of speech to text transcription systems and methods and more particularly to the detection of errors in speech to text transcriptions systems.
  • Speech transcription and speech recognition systems recognize speech, e.g. a spoken dictation and transcribe the recognized speech to text.
  • Speech transcription systems are nowadays widely used, for example in the medical sector or in legal practices.
  • speech transcription systems such as Speech MagicTM of Philips Electronics NV and the Via VoiceTM system of IBM Corporation that are commercially available.
  • Speech MagicTM of Philips Electronics NV
  • a text which is generated by a speech to text transcription system inevitably comprises erroneous text portions.
  • Such erroneous text portions arise due to many reasons, such as different environmental conditions like noise in which the speech has been recorded or different speakers to which the system is not properly adapted.
  • Spoken commands within the dictation that relate to punctuation, text formatting or type face have to be properly interpreted by a speech to text transcription system instead of being literally transcribed as words. Since speech to text transcription systems feature limited speech recognition capabilities as well as limited command inte ⁇ retation capabilities, they inevitably produce errors in the transcribed text.
  • the generated text of a speech to text transcription system has to be checked for errors and erroneous text portions in a proofreading step.
  • the proofreading typically has to be performed by a human proofreader.
  • the proofreader compares the original speech signal of the dictation with the transcribed text generated by the speech to text transcription system.
  • Proofreading in the form of comparison is typically performed by listening to the original speech signal while simultaneously reading the transcribed text. Especially this kind of comparison is extremely exhausting for the proofreader since the text in form of visual information has to be compared with the speech signal which is provided in the form of acoustic information.
  • the comparison therefore requires high concentration of the proofreader for a time corresponding to the duration of the dictation.
  • proofreading is not necessary for major parts of the transcribed text. Nevertheless the original source of the text is only available as a speech signal which is only accessible in a sequential way by listening to it. Comparing a written text with an acoustic signal can only be performed by listening to the acoustic signal in its entirety. Therefore the proofreading may even be more time consuming than the transcription process itself.
  • the present invention aims to provide a method, a system and a computer program product for an efficient error detection within text generated by an automatic speech to text transcription system.
  • the present invention provides a method for error detection for speech to text transcription systems.
  • the speech to text transcription system receives a first speech signal and transcribes this first speech signal into text.
  • the transcribed text is re-transformed into a second, synthetic speech signal .
  • the proofreader only has to compare two acoustic signals of first and second speech signal instead of comparing a first speech signal with the transcribed text.
  • First and second speech signals are provided to the proofreader via a stereo headphone for example. In this way the proofreader listens simultaneously to the first and to the second speech signal and can easily detect potential deviations between the two speech signals indicating that an error has occurred in the speech to text transcription process.
  • the re-transformation of the transcribed text into a second speech signal is performed by a so called text to speech synthesizing system.
  • text to speech synthesizing systems are disclosed in e.g. EP 0363233 and EP 0706170.
  • Typical text to speech .synthesizing systems are based on diphone synthesis techniques or unit selection synthesis techniques containing databases in which recorded parts of voices are stored.
  • a way of generating a synthetic second speech signal from the transcribed text which is synchronous to the first speech signal is to invert the speech recognition process. Instead of producing output text from input feature vectors (representing e.g.
  • the speech recognition system is also applied to generate output feature vectors from input text. This is can be achieved by first transforming the text into a (context-dependent) phoneme sequence and successively transforming the phoneme sequence into a Hidden-Markov-Model sequence (HMMs). The concatenated HMMs in turn generate the output feature vector sequence according to a distinct HMM state sequence.
  • HMM state sequence for generating the second speech signal is the optimal (Viterbi) state sequence obtained in the previous speech recognition step, in which the first speech signal has been transformed to text.
  • This state sequence aligns each feature vector to a distinct Hidden-Markov-Model state and thus to a distinct part of the transcribed text.
  • the speed and/or the volume of the second speech signal which is extracted from the transcribed text of the first speech signal matches the speed and/or the volume of the first speech signal.
  • the synthesizing of the second speech signal from the transcribed text is therefore performed with respect to the speed and/or the volume of the first, natural speech signal. This is advantageous, since a comparison between two acoustic signals that are synchronized is much easier than a comparison between two acoustic signals that are not synchronized.
  • the synchronization of the transcribed text depends on the transcribed text co ⁇ us itself as well as on the speed and the dynamic range of the first, hence natural speech signal.
  • the first speech signal is also subject of a transformation.
  • a set of filter functions is applied to the first speech signal in order to transform the spectrum of the first speech signal.
  • the spectrum of the first speech signal is assimilated to the spectrum of the synthesized second speech signal.
  • an additional signal is generated by subtracting or superimposing the first and the second speech signal.
  • this kind of comparison signal is generated by subtracting the first and the second speech signal
  • the amplitude of this comparison signal indicates deviations between first and second speech signals.
  • Especially large deviations between first and second speech signal are an indication that the speech to text transcription system has generated an error. Therefore, the comparison signal gives a direct indication whether an error has occurred in the speech to text transcription process.
  • the comparison signal not necessarily has to be generated by a subtraction of the two speech signals.
  • a comparison signal is provided to the proofreader acoustically and/or visually. In this way the generated comparison signal is provided to the proofreader.
  • the proofreader can easier identify portions of the transcribed text that are erroneous. In particular when a comparison signal is provided visually in the transcribed text, the proofreader's attention is attracted to those text portions to which an appreciable comparison signal corresponds.
  • the method for error detection produces an error indication when the amplitude of the comparison signal is beyond a predefined range.
  • the comparison signal is generated by a subtraction of the first and second speech signal
  • an error indication is outputted to the proofreader when the amplitude of the comparison signal exceeds a predefined threshold. The outputting of the error indication can occur acoustically as well as visually. By means of this error indication the proofreader no longer has to observe or listen to an awkwardly sounding comparison signal.
  • the error indication may for example be realized by a distinct ringing tone.
  • the error indication is outputted visually within the transcribed text by means of a graphical user interface. In this way the proofreader no longer has to listen and to compare the two speech signals acoustically. Moreover the comparison between the first and the second speech signal is entirely represented by a comparison signal. Only in such cases when the comparison signal is beyond a predefined threshold value an error indication is outputted within the transcribed text. The proofreader's task then reduces to a manual control of those text portions that are assigned with an error indication. The proof reader may systematically select these text portions that are potentially erroneous.
  • the proofreader In order to check whether the speech to text transcription system produced an error the proofreader only listens to those clippings of the first and the second speech signals ⁇ that correspond to the text portions that are assigned with an error indication.
  • the method therefore provides an efficient approach to filter only those text portions'of a transcribed text that might be erroneous. A listening to the complete first speech signal and a reading of the entire transcribed text for proofreading pu ⁇ ose is therefore no longer needed.
  • the proofreading, that has to be performed by a human proofreader effectively reduces to those text portions that have been identified as potentially erroneous by the error detection system. In the same way as the time exposure of the proofreading process decreases, the overall efficiency of the proof reading is enhanced.
  • a pattern recognition is performed on the comparison signal in order to identify pre-defined patterns of the comparison signal being indicative of a distinct type of error in the text.
  • Errors produced by the speech to text transcription system are typically due to misinte ⁇ retations of portions of the first, natural speech signal. Such errors especially occur for ambiguous portions of the natural speech signal, such as similarly sounding words with a different meaning and hence different spelling.
  • the speech to text transcription system may produce nonsense words when for example a distinct spoken word is misrecognized as a similar sounding word. Such a confusion may occur several times during the transcription process.
  • the transcribed text is re-transformed into a second speech signal and when first and second speech signals are compared by means of the above described comparison signal, such a confusion between two words may lead to a distinct pattern in the comparison signal.
  • a pattern recognition applied to the comparison signal a certain type of error produced by the transcription system may be directly identified.
  • the distinct patterns corresponding to certain types of errors produced by the speech to text transcription system are typically stored by some kind of storing means and provided to the error detection method in order to identify different types of errors.
  • a pattern in the comparison signal that does not match any of the known pattern indicating some type of error may be assigned to an error and a correction procedure manually performed by the proofreader.
  • the method for error detection may collect various patterns in the comparison signal being assigned to a distinct type of error.
  • a correction suggestion is provided with a detected type of error generated by the speech to text transcription system. Since a distinct type of error in the transcribed text is identified by means of a corresponding pattern of the comparison signal, the source of the error, the misrecognized portion of the speech signal can be resolved.
  • a correction suggestion is preferably provided visually by means of a graphical user interface. The proofreading that has to be performed by the human proofreader ideally reduces to the steps of accepting or rejecting correction suggestions provided by the error detection system.
  • the error detection system automatically replaces the erroneous text portion of the transcribed text with the generated correction suggestion.
  • the proofreader rejects a correction suggestion provided by the error detection system, the proofreader has to correct the erroneous text portion of the transcribed text manually.
  • the described method and system for error detection within text generated by a speech to text transcription system provides an efficient and less time consuming approach for proofreading of the transcribed text.
  • the essential task of an indispensable human proofreader reduces to a minimum number of potentially misrecognized text portions within the transcribed text.
  • Fig. 1 is illustrative of a flow chart of the error detection method
  • Fig. 2 is illustrative of a flow chart of the error detection method
  • Fig. 3 is illustrative of a flow chart of the error detection method including pattern recognition of the comparison signal
  • Fig. 4 shows a block diagram of a speech to text transcription system with error detecting means.
  • Figure 1 shows a flow chart of the error detection method of the present invention.
  • text is generated from a first, natural speech signal by means of a conventional speech to text transcription system.
  • the transcribed text of step 100 is re-transformed into a second speech signal by means of a conventional text to speech synthesizing system.
  • the first natural speech signal and the second artificially generated speech signal are provided to a human proofreader.
  • the proofreader listens to both first and second speech signal simultaneously in step 106.
  • first and second speech signals are synchronized in order to facilitate the acoustic comparison performed by the proofreader.
  • the proofreader detects deviations between the first and the second speech signal.
  • step 100 in which the first, natural speech signal has been transcribed to text.
  • step 108 the correction of the detected error within the text has to be performed manually.
  • the proofreading i.e. the comparison of the initial, natural speech signal and the transcribed text is no longer based on a comparison on an acoustic and a visual signal.
  • the proofreader has only to listen to two different acoustic signals. Only in case that an error has been detected, the proofreader has to find the corresponding text portion within the transcribed text and perform the correction.
  • Figure 2 is illustrative of a flow chart of an error detection method according to a preferred embodiment of the invention.
  • a text is transcribed from a first speech signal by a conventional text to speech transcription system.
  • an artificial speech signal is synthesized by means of a text to speech synthesizing system.
  • a first, natural speech signal is applied to a set of filter functions in step 204 to approximate the spectrum of the natural speech signal to the spectrum of the second, artificially generated speech signal.
  • the method either proceeds with step 206 or with step 208.
  • the filtered, first, natural speech signal as well as the second artificially generated speech signal are acoustically provided to the proofreader.
  • step 208 the filtered, natural first speech signal and the second artificially generated speech signal are visually provided to the proofreader.
  • the method continues with step 210 in which the proofreader compares the first and the second speech signals either acoustically and/or visually.
  • the proofreader detects errors in the generated text either by means of listening to the two different speech signals and/or by means of a graphical representation of the two speech signals.
  • the detected errors are manually corrected by the proofreader.
  • figure 3 another flow chart illustrating an error detection method according to the present invention is shown.
  • a text is transcribed from a first, natural speech signal by means of a conventional speech to text transcription system.
  • the transcribed text is retransformed into a second speech signal by means of a text to speech synthesizing system.
  • the first, natural speech signal is applied to a set of filter functions in order to assimilate the sound and the spectrum of the first speech signal to the sound and to the spectrum of the artificially generated second speech signal.
  • a comparison signal between the first and second speech signal is generated by means of e.g. subtracting or superimposing the first and the second speech signal. Instead of providing the speech signals directly the method now restricts to provide the generated comparison signal.
  • the comparison signal is either provided acoustically in step 308 or visually in step 310.
  • Potential e ⁇ ors in the text can easily be detected in step 312 by means of the comparison signal.
  • the comparison signal has been generated by subtracting the two speech signals, a potential e ⁇ or in the text can easily be detected when the amplitude of the comparison signal is above a predefined threshold.
  • the co ⁇ ection of detected errors can either be performed manually in step 318 or one can make use of alternative steps 314 and 316.
  • step 314 a pattern recognition is applied to the comparison signal.
  • FIG. 4 shows a block diagram of an e ⁇ or detection system for a speech to text transcription system.
  • a first speech signal 400 is inputted into an e ⁇ or detection module 402.
  • the e ⁇ or detection module 402 comprises means for a speech to text transcription and generates a text 412 which is outputted from the e ⁇ or detection module 402.
  • the e ⁇ or detection module 402 is connected to a graphical user interface 406 and to an accoustic user interface 404.
  • the e ⁇ or detection module 402 further comprises a speech synthesizing module 408, a speech to text transcription module 410, a text to speech transformation module 414 as well as a text 412, a first speech signal 418 and a second speech signal 416.
  • Natural speech signal 400 representing a dictation is inputted into the speech synthesizing module 408 and into the speech to text transcription module 410 of the e ⁇ or detection module 402.
  • the speech to text transcription module 410 transcribes the speech signal 400 into a text 412.
  • the generated text 412 is outputted as a transcribed text as well as being further processed within the e ⁇ or detection module 402.
  • the text 412 is therefore provided to the text to speech transformation module 414, which retransforms the transcribed text 412 to a second artificially generated speech signal 416.
  • the text to speech transformation module 414 is based on conventional techniques that are known from text to speech synthesizing systems.
  • the artificially generated speech signal 416 can now be compared with the initial, natural speech signal 400 entering the e ⁇ or detection module 402 by means of the acoustic user interface 404.
  • the acoustic user interface 404 can for example be implemented by a stereo headphone.
  • the natural speech signal 400 may be provided on the left channel of the stereo headphone whereas the artificially generated speech signal 416 may be provided on the right channel of the headphone.
  • a human proof reader listening to both speech signals simultaneously can thus easily detect deviations between the two speech signals 400 and 416 that are due to misinte ⁇ retations or e ⁇ ors performed by the speech to text transcription module 410. Since a comparison between a natural speech signal 400 and a machine generated speech signal 416 might be confusing or awkwardly sounding to the proof reader, the natural speech signal 400 can be filtered by the speech synthesizing module 408 applying a set of filter functions on the natural speech signal in order to assimilate the spectrum and the sound of the natural speech signal 400 to the synthesized speech signal 416.
  • the speech synthesizing module 408 transforms the natural speech signal 400 into a filtered speech signal 418. Similar as described above both speech signals, the filtered one 418 as well as the synthesized one 416 can acoustically be provided to the proofreader by means of the acoustic user interface 404. Additionally or alternatively the two generated speech signals can be provided in a graphical representation by means of the graphical user interface 406. With the help of the graphical representation of the speech signals 416 and 418, the proofreader may skip major parts of the transcribed text that have been transcribed co ⁇ ectly.
  • the e ⁇ or detection module 402 provides a further processing of the two speech signals 416 and 418 by means of generating a comparison signal being indicative of huge deviations of the two speech signals, the proofreading process and the detection and co ⁇ ection of e ⁇ ors produced by the speech to text transformation module 410 becomes more effective and less time consuming.
  • a further processing of the generated comparison signal by means of pattern recognition wherein distinct patterns can be assigned to particular types of e ⁇ ors is of further advantage in order to facilitate the detection and co ⁇ ection tasks to be performed by the human proofreader.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Document Processing Apparatus (AREA)
  • Machine Translation (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The present invention relates to a method, a system and a computer program product for error detection within text generated by a speech to text transcription system. The transcribed text is re-transformed into an artificial speech signal by means of a text to speech transcription system. The original, natural speech signal and the artificially generated speech are provided to a proof reader for comparison of the two acoustic signals. Deviations between the original speech signal and the speech transformed from the transcribed text indicate, that an error may have occurred in the speech to text transcription process, which has to be corrected manually. The speech signals to be compared can be provided acoustically and/or visually to the proof reader preferably by making use of a comparison signal deduced from the two speech signals. Major, correctly transcribed, parts of the text can be skipped during the proof reading process, saving time and enhancing effectivity of the entire proof reading process.

Description

Error detection for speech to text transcription systems
The invention relates to the field of speech to text transcription systems and methods and more particularly to the detection of errors in speech to text transcriptions systems. Speech transcription and speech recognition systems recognize speech, e.g. a spoken dictation and transcribe the recognized speech to text. Speech transcription systems are nowadays widely used, for example in the medical sector or in legal practices. There exists a variety of speech transcription systems, such as Speech Magic™ of Philips Electronics NV and the Via Voice™ system of IBM Corporation that are commercially available. Compared to a human transcriptionist, on the one hand a speech transcription system saves time and costs, but on the other hand it cannot provide such a high accuracy of speech understanding and command inteφretation than a human transcriptionist. A text which is generated by a speech to text transcription system inevitably comprises erroneous text portions. Such erroneous text portions arise due to many reasons, such as different environmental conditions like noise in which the speech has been recorded or different speakers to which the system is not properly adapted. Spoken commands within the dictation that relate to punctuation, text formatting or type face have to be properly interpreted by a speech to text transcription system instead of being literally transcribed as words. Since speech to text transcription systems feature limited speech recognition capabilities as well as limited command inteφretation capabilities, they inevitably produce errors in the transcribed text. In order to ensure that a dictation is properly transcribed into text, the generated text of a speech to text transcription system has to be checked for errors and erroneous text portions in a proofreading step. The proofreading typically has to be performed by a human proofreader. The proofreader compares the original speech signal of the dictation with the transcribed text generated by the speech to text transcription system. Proofreading in the form of comparison is typically performed by listening to the original speech signal while simultaneously reading the transcribed text. Especially this kind of comparison is extremely exhausting for the proofreader since the text in form of visual information has to be compared with the speech signal which is provided in the form of acoustic information. The comparison therefore requires high concentration of the proofreader for a time corresponding to the duration of the dictation. Taking into account that the error rates of a speech to text transcription system can be beneath 20% and may even decrease in the near future, it is clear that proofreading is not necessary for major parts of the transcribed text. Nevertheless the original source of the text is only available as a speech signal which is only accessible in a sequential way by listening to it. Comparing a written text with an acoustic signal can only be performed by listening to the acoustic signal in its entirety. Therefore the proofreading may even be more time consuming than the transcription process itself. The present invention aims to provide a method, a system and a computer program product for an efficient error detection within text generated by an automatic speech to text transcription system. The present invention provides a method for error detection for speech to text transcription systems. The speech to text transcription system receives a first speech signal and transcribes this first speech signal into text. In order to facilitate a proofreading or correction procedure which has to be performed by a human proof reader, the transcribed text is re-transformed into a second, synthetic speech signal . In this way the proofreader only has to compare two acoustic signals of first and second speech signal instead of comparing a first speech signal with the transcribed text. First and second speech signals are provided to the proofreader via a stereo headphone for example. In this way the proofreader listens simultaneously to the first and to the second speech signal and can easily detect potential deviations between the two speech signals indicating that an error has occurred in the speech to text transcription process.
The re-transformation of the transcribed text into a second speech signal is performed by a so called text to speech synthesizing system. Examples of text to speech synthesizing systems are disclosed in e.g. EP 0363233 and EP 0706170. Typical text to speech .synthesizing systems are based on diphone synthesis techniques or unit selection synthesis techniques containing databases in which recorded parts of voices are stored. According to a preferred embodiment of the invention, a way of generating a synthetic second speech signal from the transcribed text which is synchronous to the first speech signal is to invert the speech recognition process. Instead of producing output text from input feature vectors (representing e.g. a 10 ms portion of the first speech signal) the speech recognition system is also applied to generate output feature vectors from input text. This is can be achieved by first transforming the text into a (context-dependent) phoneme sequence and successively transforming the phoneme sequence into a Hidden-Markov-Model sequence (HMMs). The concatenated HMMs in turn generate the output feature vector sequence according to a distinct HMM state sequence. In order to support synchronization between first and second speech signal the HMM state sequence for generating the second speech signal is the optimal (Viterbi) state sequence obtained in the previous speech recognition step, in which the first speech signal has been transformed to text. This state sequence aligns each feature vector to a distinct Hidden-Markov-Model state and thus to a distinct part of the transcribed text. . According to a further preferred embodiment of the invention, the speed and/or the volume of the second speech signal which is extracted from the transcribed text of the first speech signal matches the speed and/or the volume of the first speech signal. The synthesizing of the second speech signal from the transcribed text is therefore performed with respect to the speed and/or the volume of the first, natural speech signal. This is advantageous, since a comparison between two acoustic signals that are synchronized is much easier than a comparison between two acoustic signals that are not synchronized. Therefore the synchronization of the transcribed text depends on the transcribed text coφus itself as well as on the speed and the dynamic range of the first, hence natural speech signal. According to a further preferred embodiment of the invention, the first speech signal is also subject of a transformation. Preferably a set of filter functions is applied to the first speech signal in order to transform the spectrum of the first speech signal. In this way the spectrum of the first speech signal is assimilated to the spectrum of the synthesized second speech signal. As a consequence the sound of the natural first speech signal and the synthesized second speech signal approach, which facilitates once more the comparison of the two speech signals to be performed by the human proof reader. Finally two artificially generated or artificially sounding acoustic signals have to be compared instead of one artificial and one natural acoustic signal. According to a further preferred embodiment of the invention an additional signal is generated by subtracting or superimposing the first and the second speech signal. When this kind of comparison signal is generated by subtracting the first and the second speech signal, the amplitude of this comparison signal indicates deviations between first and second speech signals. Especially large deviations between first and second speech signal are an indication that the speech to text transcription system has generated an error. Therefore, the comparison signal gives a direct indication whether an error has occurred in the speech to text transcription process. The comparison signal not necessarily has to be generated by a subtraction of the two speech signals. In general a huge variety of methods leading to a comparison signal from the first and second speech signal is conceivable, e.g. by means of a supeφosition or a convolution of speech signals. According to a further preferred embodiment of the invention, a comparison signal is provided to the proofreader acoustically and/or visually. In this way the generated comparison signal is provided to the proofreader. By making use of this comparison signal, the proofreader can easier identify portions of the transcribed text that are erroneous. In particular when a comparison signal is provided visually in the transcribed text, the proofreader's attention is attracted to those text portions to which an appreciable comparison signal corresponds. Major parts of the correctly transcribed text associated with a comparison signal of low amplitude can be skipped in the proof-reading process. Consequently the efficiency of the proofreader and the proof reading process is remarkably enhanced. According to a further preferred embodiment of the invention, the method for error detection produces an error indication when the amplitude of the comparison signal is beyond a predefined range. When for example the comparison signal is generated by a subtraction of the first and second speech signal, an error indication is outputted to the proofreader when the amplitude of the comparison signal exceeds a predefined threshold. The outputting of the error indication can occur acoustically as well as visually. By means of this error indication the proofreader no longer has to observe or listen to an awkwardly sounding comparison signal. The error indication may for example be realized by a distinct ringing tone. According to a further preferred embodiment of the invention, the error indication is outputted visually within the transcribed text by means of a graphical user interface. In this way the proofreader no longer has to listen and to compare the two speech signals acoustically. Moreover the comparison between the first and the second speech signal is entirely represented by a comparison signal. Only in such cases when the comparison signal is beyond a predefined threshold value an error indication is outputted within the transcribed text. The proofreader's task then reduces to a manual control of those text portions that are assigned with an error indication. The proof reader may systematically select these text portions that are potentially erroneous. In order to check whether the speech to text transcription system produced an error the proofreader only listens to those clippings of the first and the second speech signals ■ that correspond to the text portions that are assigned with an error indication. The method therefore provides an efficient approach to filter only those text portions'of a transcribed text that might be erroneous. A listening to the complete first speech signal and a reading of the entire transcribed text for proofreading puφose is therefore no longer needed. The proofreading, that has to be performed by a human proofreader effectively reduces to those text portions that have been identified as potentially erroneous by the error detection system. In the same way as the time exposure of the proofreading process decreases, the overall efficiency of the proof reading is enhanced. According to a further preferred embodiment of the invention, a pattern recognition is performed on the comparison signal in order to identify pre-defined patterns of the comparison signal being indicative of a distinct type of error in the text. Errors produced by the speech to text transcription system are typically due to misinteφretations of portions of the first, natural speech signal. Such errors especially occur for ambiguous portions of the natural speech signal, such as similarly sounding words with a different meaning and hence different spelling. For example the speech to text transcription system may produce nonsense words when for example a distinct spoken word is misrecognized as a similar sounding word. Such a confusion may occur several times during the transcription process. When now in turn the transcribed text is re-transformed into a second speech signal and when first and second speech signals are compared by means of the above described comparison signal, such a confusion between two words may lead to a distinct pattern in the comparison signal. By means of a pattern recognition applied to the comparison signal a certain type of error produced by the transcription system may be directly identified. The distinct patterns corresponding to certain types of errors produced by the speech to text transcription system are typically stored by some kind of storing means and provided to the error detection method in order to identify different types of errors. Furthermore a pattern in the comparison signal that does not match any of the known pattern indicating some type of error may be assigned to an error and a correction procedure manually performed by the proofreader. In this way the method for error detection may collect various patterns in the comparison signal being assigned to a distinct type of error. Such a functionality could be inteφreted as an autonomous learning. According to a further preferred embodiment of the invention, a correction suggestion is provided with a detected type of error generated by the speech to text transcription system. Since a distinct type of error in the transcribed text is identified by means of a corresponding pattern of the comparison signal, the source of the error, the misrecognized portion of the speech signal can be resolved. A correction suggestion is preferably provided visually by means of a graphical user interface. The proofreading that has to be performed by the human proofreader ideally reduces to the steps of accepting or rejecting correction suggestions provided by the error detection system. When the proofreader accepts an error correction the error detection system automatically replaces the erroneous text portion of the transcribed text with the generated correction suggestion. Given the other case that the proofreader rejects a correction suggestion provided by the error detection system, the proofreader has to correct the erroneous text portion of the transcribed text manually. The described method and system for error detection within text generated by a speech to text transcription system provides an efficient and less time consuming approach for proofreading of the transcribed text. The essential task of an indispensable human proofreader reduces to a minimum number of potentially misrecognized text portions within the transcribed text. In comparison to a conventional method of proof reading, the proofreader no longer has to listen to the entire natural speech signal that has been transcribed by the speech to text transcription system. In the following, preferred embodiments of the invention will be described in greater detail by making reference to the drawings in which:
Fig. 1 is illustrative of a flow chart of the error detection method, Fig. 2 is illustrative of a flow chart of the error detection method, Fig. 3 is illustrative of a flow chart of the error detection method including pattern recognition of the comparison signal, Fig. 4 shows a block diagram of a speech to text transcription system with error detecting means.
Figure 1 shows a flow chart of the error detection method of the present invention. In a first step 100 text is generated from a first, natural speech signal by means of a conventional speech to text transcription system. In the next step 102 the transcribed text of step 100 is re-transformed into a second speech signal by means of a conventional text to speech synthesizing system. In the following step 104, the first natural speech signal and the second artificially generated speech signal are provided to a human proofreader. The proofreader listens to both first and second speech signal simultaneously in step 106. Typically first and second speech signals are synchronized in order to facilitate the acoustic comparison performed by the proofreader. In step 108 the proofreader detects deviations between the first and the second speech signal. Such deviations indicate that an error has occurred in step 100, in which the first, natural speech signal has been transcribed to text. When the proofreader has detected an error in step 108 the correction of the detected error within the text has to be performed manually. In this way the proofreading, i.e. the comparison of the initial, natural speech signal and the transcribed text is no longer based on a comparison on an acoustic and a visual signal. Instead the proofreader has only to listen to two different acoustic signals. Only in case that an error has been detected, the proofreader has to find the corresponding text portion within the transcribed text and perform the correction. Figure 2 is illustrative of a flow chart of an error detection method according to a preferred embodiment of the invention. Similar as illustrated in figure 1 in a first step 200 a text is transcribed from a first speech signal by a conventional text to speech transcription system. Based on the transcribed text, in the next step 202 an artificial speech signal is synthesized by means of a text to speech synthesizing system. In order to facilitate a comparison between the two speech signals a first, natural speech signal is applied to a set of filter functions in step 204 to approximate the spectrum of the natural speech signal to the spectrum of the second, artificially generated speech signal. After that, the method either proceeds with step 206 or with step 208. In step 206 the filtered, first, natural speech signal as well as the second artificially generated speech signal are acoustically provided to the proofreader. In contrast in step 208 the filtered, natural first speech signal and the second artificially generated speech signal are visually provided to the proofreader. After the providing of first and second speech signals to the proofreader the method continues with step 210 in which the proofreader compares the first and the second speech signals either acoustically and/or visually. In a next step 212 the proofreader detects errors in the generated text either by means of listening to the two different speech signals and/or by means of a graphical representation of the two speech signals. In the final step 214 the detected errors are manually corrected by the proofreader. In figure 3 another flow chart illustrating an error detection method according to the present invention is shown. Again in a first step 300 a text is transcribed from a first, natural speech signal by means of a conventional speech to text transcription system. In a next step 302 the transcribed text is retransformed into a second speech signal by means of a text to speech synthesizing system. Similar as described in figure 2, in step 304 the first, natural speech signal is applied to a set of filter functions in order to assimilate the sound and the spectrum of the first speech signal to the sound and to the spectrum of the artificially generated second speech signal. In the following step 306, a comparison signal between the first and second speech signal is generated by means of e.g. subtracting or superimposing the first and the second speech signal. Instead of providing the speech signals directly the method now restricts to provide the generated comparison signal. The comparison signal is either provided acoustically in step 308 or visually in step 310. Potential eπors in the text can easily be detected in step 312 by means of the comparison signal. When for example the comparison signal has been generated by subtracting the two speech signals, a potential eπor in the text can easily be detected when the amplitude of the comparison signal is above a predefined threshold. After the detection of potentially erroneous text portions in step 312, the coπection of detected errors can either be performed manually in step 318 or one can make use of alternative steps 314 and 316. In step 314 a pattern recognition is applied to the comparison signal. When distinct portions of the comparison signal match two characteristic patterns that are stored in the system, the corresponding text portion of the transcribed text is identified as potentially eπoneous. In the following step 316 those potentially eπoneous text portions are assigned to a distinct type of eπor. The eπor information gathered in this way may be further exploited in order to generate suggestion coπections to eliminate these eπors in the transcribed text. Figure 4 shows a block diagram of an eπor detection system for a speech to text transcription system. A first speech signal 400 is inputted into an eπor detection module 402. The eπor detection module 402 comprises means for a speech to text transcription and generates a text 412 which is outputted from the eπor detection module 402. Furthermore the eπor detection module 402 is connected to a graphical user interface 406 and to an accoustic user interface 404. The eπor detection module 402 further comprises a speech synthesizing module 408, a speech to text transcription module 410, a text to speech transformation module 414 as well as a text 412, a first speech signal 418 and a second speech signal 416. Natural speech signal 400 representing a dictation is inputted into the speech synthesizing module 408 and into the speech to text transcription module 410 of the eπor detection module 402. The speech to text transcription module 410 transcribes the speech signal 400 into a text 412. The generated text 412 is outputted as a transcribed text as well as being further processed within the eπor detection module 402. The text 412 is therefore provided to the text to speech transformation module 414, which retransforms the transcribed text 412 to a second artificially generated speech signal 416. The text to speech transformation module 414 is based on conventional techniques that are known from text to speech synthesizing systems. The artificially generated speech signal 416 can now be compared with the initial, natural speech signal 400 entering the eπor detection module 402 by means of the acoustic user interface 404. The acoustic user interface 404 can for example be implemented by a stereo headphone. The natural speech signal 400 may be provided on the left channel of the stereo headphone whereas the artificially generated speech signal 416 may be provided on the right channel of the headphone. A human proof reader listening to both speech signals simultaneously can thus easily detect deviations between the two speech signals 400 and 416 that are due to misinteφretations or eπors performed by the speech to text transcription module 410. Since a comparison between a natural speech signal 400 and a machine generated speech signal 416 might be confusing or awkwardly sounding to the proof reader, the natural speech signal 400 can be filtered by the speech synthesizing module 408 applying a set of filter functions on the natural speech signal in order to assimilate the spectrum and the sound of the natural speech signal 400 to the synthesized speech signal 416. Therefore, the speech synthesizing module 408 transforms the natural speech signal 400 into a filtered speech signal 418. Similar as described above both speech signals, the filtered one 418 as well as the synthesized one 416 can acoustically be provided to the proofreader by means of the acoustic user interface 404. Additionally or alternatively the two generated speech signals can be provided in a graphical representation by means of the graphical user interface 406. With the help of the graphical representation of the speech signals 416 and 418, the proofreader may skip major parts of the transcribed text that have been transcribed coπectly. Especially when the eπor detection module 402 provides a further processing of the two speech signals 416 and 418 by means of generating a comparison signal being indicative of huge deviations of the two speech signals, the proofreading process and the detection and coπection of eπors produced by the speech to text transformation module 410 becomes more effective and less time consuming. A further processing of the generated comparison signal by means of pattern recognition wherein distinct patterns can be assigned to particular types of eπors is of further advantage in order to facilitate the detection and coπection tasks to be performed by the human proofreader.
LIST OF REFERENCE NUMERALS 400 First Speech Signal 402 Eπor Detection Module 404 Acoustic User Interface 406 Graphical User Interface 408 Speech Synthesizing Module 410 Speech to Text Transcription Module 412 Text 414 Text to Speech Transformation Module 416 Second Speech Signal 418 Filtered Speech Signal

Claims

CLAIMS:
1. A method for eπor detection within text transcribed from a first speech signal by an automatic speech-to-text transcription system, comprising synthesizing a second speech signal from the transcribed text, providing first and second speech signal outputs for a comparison between first and second speech signals for an identification of potential eπors in the text.
2. The method according to claim 1, wherein the speed and/or the volume of the second speech signal matches the speed and/or the volume of the first speech signal.
3. The method according to claim 1 or 2, wherein a set of filter functions is applied to the first speech signal to approximate the spectrum of the first speech signal to the spectrum of the second speech signal. ;
4. The method according to any one of the claims 1 to 3, wherein the second speech signal is generated by applying an inverse speech transcription process, generating a feature vector sequence from the text, using (a) statistical models of the speech-to-text transcription system and (b) a state sequence obtained in the process of transcription of the text from the first speech signal.
5. The method according to any one of the claims 1 to 4, wherein a comparison signal is generated by subtracting or superimposing first and second speech signals.
6. The method according to claim 5, wherein the comparison signal is provided acoustically and/or visually.
7. The method according to claim 5 or 6, wherein an eπor indication is outputted when the amplitude of the comparison signal is beyond a predefined range.
8. The method according to claim 7, wherein the eπor indication is outputted visually within the transcribed text on a graphical user interface.
9. The method according to any one of the claims 5 to 8, further comprising a pattern recognition of the comparison signal in order to identify a pre-trained pattern of the comparison signal being indicative of a type of eπor in the text.
10. The method according to claim 9, wherein a coπection suggestion is provided with a detected type of eπor in the generated text.
11. An eπor detection system for a speech-to-text transcription system providing a transcribed text (412) from a first speech signal (400), the eπor detection system comprising: means for synthesizing a second speech signal (416) from the transcribed text (412), means for providing first (400, 418) and second (416) speech signals for comparison between first and second speech signals for an identification of potential eπors in the text (412).
12. The detection system according to claim 11, wherein a comparison signal is generated by means of subtracting or superimposing first (400, 418) and second (416) speech signals.
13. The detection system according to claim 11 or 12, wherein the first (400, 418) and second (416) speech signal and/or the comparison signal is provided acoustically or visually for eπor detection puφose.
14. The detection system according to claim 12 or 13, wherein an eπor indication is outputted when the comparison signal is beyond a predefined range.
15. The detection system according to any one of the claims 12 to 14, wherein a distinct pattern in the comparison signal is assigned to a certain type of eπor in the transcribed text (412) and a coπection suggestion being provided with a detected type of eπor in the transcribed text.
16. A computer program product for eπor detection for a speech-to-text transcription system providing a transcribed text from a first speech signal, the computer program product comprising program means for: synthesizing a second speech signal from the transcribed text, matching speed and/or volume of the second speech signal to the speed and/or and volume of the first speech signal, providing first and second speech signal outputs for a comparison between first and second speech signals.
17. The computer program product according to claim 16, the computer program product comprising means for generating a comparison signal by means of subtracting or superimposing first and second speech signals.
18. The computer program product according to claim 16 or 17, the computer program product comprising means for providing the first and second speech signals and/or the comparison signal acoustically or visually for eπor detection puφose.
19. The computer program product according to claim 17 or 18, the computer program product comprising means for outputting an eπor indication when the comparison signal is beyond a predefined range.
20. The computer program product according to any one of the claims 17 to
19, the computer program product comprising means for assigning a distinct pattern in the comparison signal to a certain type of eπor in the transcribed text and providing a coπection suggestion with a detected type of eπor in the transcribed text.
PCT/IB2004/052218 2003-11-05 2004-10-27 Error detection for speech to text transcription systems WO2005045803A1 (en)

Priority Applications (5)

Application Number Priority Date Filing Date Title
US10/578,073 US7617106B2 (en) 2003-11-05 2004-10-27 Error detection for speech to text transcription systems
CN200480032825.6A CN1879146B (en) 2003-11-05 2004-10-27 Error detection for speech to text transcription systems
EP04791820A EP1702319B1 (en) 2003-11-05 2004-10-27 Error detection for speech to text transcription systems
DE602004018385T DE602004018385D1 (en) 2003-11-05 2004-10-27 ERROR DETECTION FOR LANGUAGE TO TEXT TRANSCRIPTION SYSTEMS
JP2006537527A JP4714694B2 (en) 2003-11-05 2004-10-27 Error detection in speech-text transcription systems

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
EP03104078 2003-11-05
EP03104078.5 2003-11-05

Publications (2)

Publication Number Publication Date
WO2005045803A1 true WO2005045803A1 (en) 2005-05-19
WO2005045803A8 WO2005045803A8 (en) 2006-08-10

Family

ID=34560196

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2004/052218 WO2005045803A1 (en) 2003-11-05 2004-10-27 Error detection for speech to text transcription systems

Country Status (7)

Country Link
US (1) US7617106B2 (en)
EP (1) EP1702319B1 (en)
JP (1) JP4714694B2 (en)
CN (1) CN1879146B (en)
AT (1) ATE417347T1 (en)
DE (1) DE602004018385D1 (en)
WO (1) WO2005045803A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8285334B2 (en) 2007-08-08 2012-10-09 Lg Electronics Inc. Mobile communications terminal for broadcast reception

Families Citing this family (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6910481B2 (en) * 2003-03-28 2005-06-28 Ric Investments, Inc. Pressure support compliance monitoring system
US9520068B2 (en) * 2004-09-10 2016-12-13 Jtt Holdings, Inc. Sentence level analysis in a reading tutor
US8014650B1 (en) * 2006-01-24 2011-09-06 Adobe Systems Incorporated Feedback of out-of-range signals
FR2902542B1 (en) * 2006-06-16 2012-12-21 Gilles Vessiere Consultants SEMANTIC, SYNTAXIC AND / OR LEXICAL CORRECTION DEVICE, CORRECTION METHOD, RECORDING MEDIUM, AND COMPUTER PROGRAM FOR IMPLEMENTING SAID METHOD
US9280971B2 (en) * 2009-02-27 2016-03-08 Blackberry Limited Mobile wireless communications device with speech to text conversion and related methods
CN102163379B (en) * 2010-02-24 2013-03-13 英业达股份有限公司 System and method for locating and playing corrected voice of dictated passage
US20150279354A1 (en) * 2010-05-19 2015-10-01 Google Inc. Personalization and Latency Reduction for Voice-Activated Commands
US10522133B2 (en) * 2011-05-23 2019-12-31 Nuance Communications, Inc. Methods and apparatus for correcting recognition errors
CA2869530A1 (en) * 2012-04-27 2013-10-31 Aravind GANAPATHIRAJU Negative example (anti-word) based performance improvement for speech recognition
CN102665012B (en) * 2012-05-02 2015-07-08 江苏南大数码科技有限公司 Device for automatically inspecting remote call voice inquiry platform failure
US9135916B2 (en) * 2013-02-26 2015-09-15 Honeywell International Inc. System and method for correcting accent induced speech transmission problems
US9712666B2 (en) 2013-08-29 2017-07-18 Unify Gmbh & Co. Kg Maintaining audio communication in a congested communication channel
US10069965B2 (en) 2013-08-29 2018-09-04 Unify Gmbh & Co. Kg Maintaining audio communication in a congested communication channel
KR101808810B1 (en) * 2013-11-27 2017-12-14 한국전자통신연구원 Method and apparatus for detecting speech/non-speech section
CN105374356B (en) * 2014-08-29 2019-07-30 株式会社理光 Audio recognition method, speech assessment method, speech recognition system and speech assessment system
US20160379640A1 (en) * 2015-06-24 2016-12-29 Honeywell International Inc. System and method for aircraft voice-to-text communication with message validation
JP6605995B2 (en) * 2016-03-16 2019-11-13 株式会社東芝 Speech recognition error correction apparatus, method and program
EP3504709B1 (en) 2016-10-20 2020-01-22 Google LLC Determining phonetic relationships
US10446138B2 (en) * 2017-05-23 2019-10-15 Verbit Software Ltd. System and method for assessing audio files for transcription services
CN109949828B (en) * 2017-12-20 2022-05-24 苏州君林智能科技有限公司 Character checking method and device
CN112567456A (en) * 2018-07-16 2021-03-26 万卷智能有限公司 Learning aid
KR102615154B1 (en) * 2019-02-28 2023-12-18 삼성전자주식회사 Electronic apparatus and method for controlling thereof
US11410658B1 (en) * 2019-10-29 2022-08-09 Dialpad, Inc. Maintainable and scalable pipeline for automatic speech recognition language modeling

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0200389A2 (en) * 1985-04-08 1986-11-05 Kabushiki Kaisha Toshiba Proofreading/correction apparatus
EP0962914A2 (en) 1998-05-30 1999-12-08 GRUNDIG Aktiengesellschaft Method and apparatus for determining a confidence measure for speech recognition
US6219638B1 (en) * 1998-11-03 2001-04-17 International Business Machines Corporation Telephone messaging and editing system
US20030200093A1 (en) * 1999-06-11 2003-10-23 International Business Machines Corporation Method and system for proofreading and correcting dictated text

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2585547B2 (en) * 1986-09-19 1997-02-26 株式会社日立製作所 Method for correcting input voice in voice input / output device
JPH0488399A (en) * 1990-08-01 1992-03-23 Clarion Co Ltd Voice recognizer
GB2302199B (en) * 1996-09-24 1997-05-14 Allvoice Computing Plc Data processing method and apparatus
US6088674A (en) * 1996-12-04 2000-07-11 Justsystem Corp. Synthesizing a voice by developing meter patterns in the direction of a time axis according to velocity and pitch of a voice
US5987405A (en) * 1997-06-24 1999-11-16 International Business Machines Corporation Speech compression by speech recognition
JP3519259B2 (en) * 1997-12-29 2004-04-12 京セラ株式会社 Voice recognition actuator
US6490563B2 (en) * 1998-08-17 2002-12-03 Microsoft Corporation Proofreading with text to speech feedback
US6064965A (en) * 1998-09-02 2000-05-16 International Business Machines Corporation Combined audio playback in speech recognition proofreader
US6338038B1 (en) * 1998-09-02 2002-01-08 International Business Machines Corp. Variable speed audio playback in speech recognition proofreader
DE19920501A1 (en) * 1999-05-05 2000-11-09 Nokia Mobile Phones Ltd Speech reproduction method for voice-controlled system with text-based speech synthesis has entered speech input compared with synthetic speech version of stored character chain for updating latter
US6370503B1 (en) * 1999-06-30 2002-04-09 International Business Machines Corp. Method and apparatus for improving speech recognition accuracy
US7010489B1 (en) * 2000-03-09 2006-03-07 International Business Mahcines Corporation Method for guiding text-to-speech output timing using speech recognition markers
DE10304229A1 (en) * 2003-01-28 2004-08-05 Deutsche Telekom Ag Communication system, communication terminal and device for recognizing faulty text messages

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0200389A2 (en) * 1985-04-08 1986-11-05 Kabushiki Kaisha Toshiba Proofreading/correction apparatus
EP0962914A2 (en) 1998-05-30 1999-12-08 GRUNDIG Aktiengesellschaft Method and apparatus for determining a confidence measure for speech recognition
US6219638B1 (en) * 1998-11-03 2001-04-17 International Business Machines Corporation Telephone messaging and editing system
US20030200093A1 (en) * 1999-06-11 2003-10-23 International Business Machines Corporation Method and system for proofreading and correcting dictated text

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8285334B2 (en) 2007-08-08 2012-10-09 Lg Electronics Inc. Mobile communications terminal for broadcast reception

Also Published As

Publication number Publication date
EP1702319B1 (en) 2008-12-10
CN1879146B (en) 2011-06-08
WO2005045803A8 (en) 2006-08-10
JP4714694B2 (en) 2011-06-29
US7617106B2 (en) 2009-11-10
US20070027686A1 (en) 2007-02-01
DE602004018385D1 (en) 2009-01-22
CN1879146A (en) 2006-12-13
ATE417347T1 (en) 2008-12-15
JP2007510943A (en) 2007-04-26
EP1702319A1 (en) 2006-09-20

Similar Documents

Publication Publication Date Title
EP1702319B1 (en) Error detection for speech to text transcription systems
JP4241376B2 (en) Correction of text recognized by speech recognition through comparison of speech sequences in recognized text with speech transcription of manually entered correction words
EP1317750B1 (en) Speech recognition method with a replace command
JP3263392B2 (en) Text processing unit
JP6654611B2 (en) Growth type dialogue device
JP6716300B2 (en) Minutes generation device and minutes generation program
WO2007055233A1 (en) Speech-to-text system, speech-to-text method, and speech-to-text program
US20080154591A1 (en) Audio Recognition System For Generating Response Audio by Using Audio Data Extracted
JP2015014665A (en) Voice recognition device and method, and semiconductor integrated circuit device
JP4736478B2 (en) Voice transcription support device, method and program thereof
US6546369B1 (en) Text-based speech synthesis method containing synthetic speech comparisons and updates
JPH10254475A (en) Speech recognition method
US8990092B2 (en) Voice recognition device
JP4839970B2 (en) Prosody identification apparatus and method, and speech recognition apparatus and method
JP2008256942A (en) Data comparison apparatus of speech synthesis database and data comparison method of speech synthesis database
JP2011242637A (en) Voice data editing device
JP4296290B2 (en) Speech recognition apparatus, speech recognition method and program
JP3277579B2 (en) Voice recognition method and apparatus
US6934680B2 (en) Method for generating a statistic for phone lengths and method for determining the length of individual phones for speech synthesis
CN111696530B (en) Target acoustic model obtaining method and device
JP2001134276A (en) Speech to character conversion error detecting device and recording medium
EP1422691B1 (en) Method for adapting a speech recognition system
JP2975808B2 (en) Voice recognition device
JP2005037423A (en) Speech output device
JP2002287781A (en) Voice recognition system

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 200480032825.6

Country of ref document: CN

AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NA NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): BW GH GM KE LS MW MZ NA SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
WWE Wipo information: entry into national phase

Ref document number: 2004791820

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 2006537527

Country of ref document: JP

WWE Wipo information: entry into national phase

Ref document number: 2007027686

Country of ref document: US

Ref document number: 10578073

Country of ref document: US

WWP Wipo information: published in national office

Ref document number: 2004791820

Country of ref document: EP

WWP Wipo information: published in national office

Ref document number: 10578073

Country of ref document: US