JP2000029492A

JP2000029492A - Speech interpretation apparatus, speech interpretation method, and speech recognition apparatus

Info

Publication number: JP2000029492A
Application number: JP10193959A
Authority: JP
Inventors: Hiroaki Kokubo; 浩明小窪
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1998-07-09
Filing date: 1998-07-09
Publication date: 2000-01-28

Abstract

PROBLEM TO BE SOLVED: To provide an efficient speech recognition error correction method, in a speech interpretation apparatus. SOLUTION: This apparatus is provided with a speech recognition section 1001 which recognizes input speeches and converts the speeches into character strings, a speech storage section 1002 which stores the input speeches, a character string storage section 1003 which stores the speech strings converted in the speech recognition section 1001, a language analysis section 1004 which analyzes the character strings stored in the character string storage section 1003, an interpretation section 1005 for making interpretation to other language in accordance with the result of the analysis of the language analysis section 1004, a phrase extraction section 1007 for extracting the phrases of the input speeches corresponding to the portions failed in the analysis in case of the failure of the analysis by the language analysis section 1004 from the speech storage section 1002 and a speech reproduction section 1008 for reproducing the speeches extracted by the phrase extraction section 1007. The speech interpretation apparatus specifies the phrases of the input speeches corresponding to the portions failed in the analysis and, therefore, the user is not required to again utter all of the input speeches.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、発話された音声を
認識する音声認識装置、及び発話された音声を認識し、
その認識結果を他の言語へ翻訳する音声翻訳装置に関す
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech recognition apparatus for recognizing a spoken voice, and a speech recognition apparatus for recognizing a spoken voice.
The present invention relates to a speech translation device that translates the recognition result into another language.

【０００２】[0002]

【従来の技術】近年の音声認識では、音響モデルの高精
度化に加え、統計的言語モデルの導入により、連続音声
の認識が可能となってきた。音声認識技術については、
中川著，「確率モデルによる音声認識」，電子情報通信
学会編，１９８８に詳しい。2. Description of the Related Art In recent speech recognition, continuous speech recognition has become possible by introducing a statistical language model in addition to improving the accuracy of an acoustic model. For speech recognition technology,
Nakagawa, "Speech Recognition by Probabilistic Model", edited by IEICE, 1988.

【０００３】音声翻訳装置は、連続音声認識の技術を用
いて発話された文章を認識し、その認識結果を機械翻訳
することにより、他言語に翻訳する装置である。音声翻
訳装置に関する文献としては、例えば、森元，他，音声
翻訳システム（ＡＳＵＲＡ）のシステム構成と性能評
価，情報処理学会論文誌，Ｖｏｌ３７，Ｎｏ．９，ｐ
ｐ．１７２６−１７３５，１９９６などがある。[0003] A speech translator is a device that recognizes an uttered sentence using a continuous speech recognition technique, and translates the recognition result into a different language by machine translation. Documents relating to the speech translation device include, for example, Morimoto, et al., System Configuration and Performance Evaluation of Speech Translation System (ASURA), Transactions of Information Processing Society of Japan, Vol. 9, p
p. 1726-1735, 1996 and the like.

【０００４】音声認識装置及び音声翻訳装置に関する課
題の一つに音声認識の認識誤りの問題がある。特に、音
声翻訳装置においていは、音声認識に誤りが生じると、
機械翻訳において構文解析が不能となり正しい翻訳結果
が得られない。音声翻訳装置に関する音声認識誤りの対
策としては、認識誤りにより文全体の構文解析が不能と
なった場合に、構文解析が成功した部分フレーズのみを
翻訳する部分翻訳法がある。この方法は、例えば、脇
田，他，意味的類似性を用いた後処理的な音声認識正解
部分特定法と音声翻訳手法への導入，情報処理学会研究
報告９７−ＳＬＰ−１７，ｐｐ．１９−２６，１９９７
に開示されている。また、他の方法としては、機械翻訳
にかける前に音声認識結果を発声者にフィードバック
し、認識誤りがある場合には再発声を要求する方法があ
る。[0004] One of the problems with the speech recognition device and the speech translation device is the problem of recognition errors in speech recognition. In particular, in a speech translator, if an error occurs in speech recognition,
In machine translation, parsing is not possible and correct translation results cannot be obtained. As a countermeasure against a speech recognition error related to the speech translation device, there is a partial translation method that translates only a partial phrase that has been successfully parsed when the syntax analysis of the entire sentence becomes impossible due to the recognition error. This method is described in, for example, Wakita et al., Introduction to Post-Processing Speech Recognition Correct Part Identification Method and Speech Translation Method Using Semantic Similarity, Information Processing Society of Japan Research Report 97-SLP-17, pp. 19-26, 1997
Is disclosed. As another method, there is a method in which a speech recognition result is fed back to a speaker before being subjected to machine translation, and when there is a recognition error, a request for re-speaking is provided.

【０００５】また、特開平７−１２９５９４号公報に、
音声認識部による認識結果として適正なものが得られな
かったと判断された場合、音声認識結果とともに認識結
果の付帯情報を出力する自動通訳システムが開示されて
いる。例えば、「わたしはがくせいです」との発話に対
する認識結果として、「わたし」の中の「わ」が、
「わ」と「か」のどちらか認識できず、曖昧性が生じた
場合には、曖昧性があるという情報が上記付帯情報とな
る。そして、上記自動通訳システムでは、「わ」と
「か」のどちらが正しいかを発話者に問い合わせる問い
合わせるために、「わたしはがくせいです」という文と
「かたしはがくせいです」という文とを音声化して音声
で問い合わせるとともに、必要に応じて文字コードのま
ま表示部に問い合わせを表示する。Further, Japanese Patent Application Laid-Open No. 7-129594 discloses that
There is disclosed an automatic interpreting system that outputs supplementary information of a recognition result together with a speech recognition result when it is determined that an appropriate result is not obtained as a recognition result by a voice recognition unit. For example, as a result of recognizing the utterance "I am gakusei", "wa" in "me"
If either “wa” or “ka” cannot be recognized and ambiguity occurs, information indicating that there is ambiguity is the additional information. Then, in the above automatic interpreting system, in order to inquire the speaker whether "wa" or "ka" is correct, the sentences "I am gakusei" and "Kagahashigakusei" are spoken. Inquiry is made in the form of voice and, if necessary, the inquiry is displayed on the display unit with the character code unchanged.

【０００６】[0006]

【発明が解決しようとする課題】しかし、部分翻訳法で
は、部分翻訳した結果が、必ずしも発話文全体の意図に
沿ったものであるという保証はなく、場合によっては、
まったく意味の通じないような内容や、相手に誤解を与
えるような内容となる恐れもある。However, in the partial translation method, there is no guarantee that the result of the partial translation is consistent with the intention of the entire utterance sentence, and in some cases,
The content may not be meaningful at all, or may mislead others.

【０００７】また、機械翻訳にかける前に音声認識結果
を発声者にフィードバックし、認識誤りがある場合には
再発声を要求する方法では、音声認識結果のフィードバ
ックによって音声認識誤りが判明した場合、認識誤りが
起きている部分のみを修正するためには、翻訳装置に修
正部分を特定させることを要求するために、通常は文章
全体を再発声することになる。長い文章ほど認識誤りを
生じる可能性が高いので、認識誤りの生じた部分のみを
効率良く修正できることが好ましい。Further, in the method of feeding back the speech recognition result to the speaker before performing the machine translation and requesting re-speech when there is a recognition error, if the speech recognition error is found by the feedback of the speech recognition result, In order to correct only the part where the recognition error has occurred, the whole sentence is usually re-uttered in order to require the translator to specify the corrected part. Since a longer sentence is more likely to cause a recognition error, it is preferable that only a portion where a recognition error occurs can be efficiently corrected.

【０００８】また、特開平７−１２９５９４号公報に開
示されている自動通訳システムにおいても、認識結果が
ユーザが入力した文と全く異なる場合には、全文を入力
せざるを得ない。In the automatic interpretation system disclosed in Japanese Patent Application Laid-Open No. 7-129594, if the recognition result is completely different from the sentence input by the user, the entire sentence must be input.

【０００９】そこで、本発明の目的は、音声認識誤りが
生じた場合に、認識誤りが生じた部分のみを効率良く修
正する音声翻訳装置、及び音声翻訳方法、並びに、音声
認識誤りが生じた場合に、認識誤りが生じた部分のみを
効率良く修正する音声認識装置を提供することである。Accordingly, an object of the present invention is to provide a speech translation apparatus and a speech translation method for efficiently correcting only a part where a speech recognition error has occurred when a speech recognition error has occurred, and to provide a speech translation method when a speech recognition error has occurred. Another object of the present invention is to provide a speech recognition device that efficiently corrects only a portion where a recognition error has occurred.

【００１０】[0010]

【課題を解決するための手段】前記課題を解決するため
に、本発明の音声翻訳装置では、入力音声を認識し文字
列に変換する音声認識部と、入力音声を格納する音声格
納部と、前記音声認識部で変換された文字列を格納する
文字列格納部と、前記文字列格納部に格納された文字列
を解析する言語解析部と、前記言語解析部の解析結果に
基づき他の言語への翻訳を行う翻訳部と、前記言語解析
部が解析に失敗した場合に、解析に失敗した部分に対応
する入力音声のフレーズを前記音声格納部から抽出する
フレーズ抽出部と、前記フレーズ抽出部で抽出された音
声を再生する音声再生部とを設ける。In order to solve the above-mentioned problems, a speech translator according to the present invention recognizes an input speech and converts it into a character string, a speech storage section for storing the input speech, A character string storage unit that stores the character string converted by the voice recognition unit; a language analysis unit that analyzes the character string stored in the character string storage unit; and another language based on the analysis result of the language analysis unit. A translation unit for performing translation to the language analysis unit; a phrase extraction unit configured to extract, from the speech storage unit, a phrase of an input voice corresponding to a part that failed in analysis when the language analysis unit fails in analysis; and the phrase extraction unit. And a sound reproducing unit for reproducing the sound extracted in step (1).

【００１１】また、前記音声認識部は、復唱された解析
に失敗した部分に対応する入力音声のフレーズを認識し
文字列に変換し、前記翻訳部は、入力音声に対応する文
字列のうち、解析に失敗した部分に対応する入力音声の
フレーズに対応する文字列を前記音声認識部で変換され
た復唱された解析に失敗した部分に対応する入力音声の
フレーズに対応する文字列に置き換えた文字列に対して
翻訳を行う。The voice recognition unit recognizes the phrase of the input voice corresponding to the part of the repetition that failed to be analyzed and converts the phrase into a character string. The translation unit includes a character string corresponding to the input voice. Characters obtained by replacing the character string corresponding to the phrase of the input voice corresponding to the part where the analysis has failed with the character string corresponding to the phrase of the input voice corresponding to the part where the repetition of the analysis failed, which has been converted by the voice recognition unit. Translate a column.

【００１２】また、前記言語解析部は、文節間の意味的
関係に基づいて解析を行う。The linguistic analysis unit performs an analysis based on a semantic relationship between phrases.

【００１３】また、入力音声から韻律情報を抽出する韻
律情報推定部を設け、前記フレーズ抽出部は、前記韻律
情報に基づいて解析に失敗した部分に対応する入力音声
のフレーズを前記音声格納部から抽出する。In addition, a prosody information estimating section for extracting prosody information from the input speech is provided, and the phrase extracting section reads, from the speech storage section, a phrase of the input speech corresponding to a portion that failed to be analyzed based on the prosody information. Extract.

【００１４】また、前記音声再生部は声質を変換する声
質変換機能を設け、前記声質変換機能によって声質を変
換して前記フレーズ抽出部で抽出された音声を再生す
る。Further, the voice reproducing section has a voice quality converting function for converting voice quality, converts the voice quality by the voice quality converting function, and reproduces the voice extracted by the phrase extracting section.

【００１５】また、入力音声に対応する文字列のうち、
解析に失敗した部分に対応する入力音声のフレーズに対
応する部分を強調して表示する表示部を有することを特
徴とする。In the character string corresponding to the input voice,
A display unit is provided which highlights and displays a portion corresponding to the phrase of the input voice corresponding to the portion where the analysis has failed.

【００１６】前記課題を解決するために、本発明の音声
翻訳方法では、入力音声を認識し文字列に変換するステ
ップと、入力音声を音声格納部に格納するステップと、
前記変換された文字列を文字列格納部に格納するステッ
プと、前記文字列格納部に格納された文字列を解析する
ステップと、前記解析に失敗した場合に、解析に失敗し
た部分に対応する入力音声のフレーズを前記音声格納部
から抽出するステップと、前記抽出された音声を再生す
るステップとを有することを特徴とする。In order to solve the above-mentioned problems, in the voice translation method of the present invention, a step of recognizing an input voice and converting it into a character string; a step of storing the input voice in a voice storage unit;
Storing the converted character string in a character string storage unit, analyzing the character string stored in the character string storage unit, and, when the analysis fails, corresponding to a part that failed in the analysis. Extracting a phrase of an input voice from the voice storage unit; and reproducing the extracted voice.

【００１７】また、復唱された解析に失敗した部分に対
応する入力音声のフレーズを認識し文字列に変換するス
テップと、入力音声に対応する文字列のうち、解析に失
敗した部分に対応する入力音声のフレーズに対応する文
字列を前記音声認識部で変換された復唱された解析に失
敗した部分に対応する入力音声のフレーズに対応する文
字列に置き換えた文字列に対して翻訳を行うステップと
を有するようにしてもよい。In addition, a step of recognizing a phrase of an input voice corresponding to a portion which has been repeated and whose analysis has failed, and converting the phrase into a character string; Translating a character string corresponding to the phrase of the input voice corresponding to the portion of the input voice corresponding to the portion of the voice that has failed to be analyzed that has been converted by the voice recognition unit; May be provided.

【００１８】また、文節間の意味的関係に基づいて前記
文字列格納部に格納された文字列を解析するステップを
有するようにしてもよい。The method may further include a step of analyzing a character string stored in the character string storage unit based on a semantic relationship between phrases.

【００１９】また、入力音声から韻律情報を抽出するス
テップと、前記韻律情報に基づいて解析に失敗した部分
に対応する入力音声のフレーズを前記音声格納部から抽
出するステップとを有するようにしてもよい。The method may further include the step of extracting prosody information from the input speech, and the step of extracting from the speech storage unit a phrase of the input speech corresponding to the portion that failed to be analyzed based on the prosody information. Good.

【００２０】また、前記抽出された音声の声質を変換し
て再生するステップを有するようにしてもよい。また、
表示部に、入力音声に対応する文字列のうち、解析に失
敗した部分に対応する入力音声のフレーズに対応する部
分を強調して表示するステップを有するようにしてもよ
い。The method may further comprise the step of converting and reproducing the voice quality of the extracted voice. Also,
The display unit may include a step of emphasizing and displaying, in the character string corresponding to the input voice, a portion corresponding to the phrase of the input voice corresponding to the portion where the analysis failed.

【００２１】前記課題を解決するために、本発明の音声
認識装置では、入力音声を認識し文字列に変換する音声
認識部と、入力音声を格納する音声格納部と、前記音声
認識部で変換された文字列を格納する文字列格納部と、
前記文字列格納部に格納された文字列を解析する言語解
析部と、前記言語解析部が解析に失敗した場合に、解析
に失敗した部分に対応する入力音声のフレーズを前記音
声格納部から抽出するフレーズ抽出部と、前記フレーズ
抽出部で抽出された音声を再生する音声再生部とを設け
る。In order to solve the above-mentioned problems, a voice recognition device of the present invention recognizes an input voice and converts the voice into a character string; a voice storage unit that stores the input voice; A character string storage unit for storing the obtained character string,
A language analysis unit that analyzes the character string stored in the character string storage unit, and, when the language analysis unit fails in the analysis, extracts a phrase of the input voice corresponding to the part that failed in the analysis from the voice storage unit. And a sound reproducing unit that reproduces the sound extracted by the phrase extracting unit.

【００２２】また、前記音声認識部は、復唱された解析
に失敗した部分に対応する入力音声のフレーズを認識し
文字列に変換し、前記文字列格納部に格納されている入
力音声に対応する文字列のうち、解析に失敗した部分に
対応する入力音声のフレーズに対応する文字列を、前記
音声認識部で変換された復唱された解析に失敗した部分
に対応する入力音声のフレーズに対応する文字列に置き
換える。Further, the voice recognition unit recognizes a phrase of the input voice corresponding to the portion where the repeated analysis failed, converts the phrase into a character string, and corresponds to the input voice stored in the character string storage unit. In the character string, the character string corresponding to the phrase of the input voice corresponding to the part that failed to be analyzed corresponds to the phrase of the input voice corresponding to the part that failed to be analyzed and read back by the voice recognition unit. Replace with a string.

【００２３】また、入力音声に対応する文字列のうち、
解析に失敗した部分に対応する入力音声のフレーズに対
応する部分を強調して表示する表示部を有することを特
徴とする。In the character string corresponding to the input voice,
A display unit is provided which highlights and displays a portion corresponding to the phrase of the input voice corresponding to the portion where the analysis has failed.

【００２４】[0024]

【発明の実施の形態】（第一の実施例）以下、本発明の
実施例を示す。DESCRIPTION OF THE PREFERRED EMBODIMENTS (First Embodiment) An embodiment of the present invention will be described below.

【００２５】図１は本実施例を説明するためのブロック
図である。まず、音声翻訳装置の構成と処理の流れにつ
いての概要を図１を用いて説明する。入力音声は音声格
納部１００２に格納されるとともに、音声認識部１００
１において音声の認識が行われ、認識結果が文字列とし
て出力される。音声認識部１００１の構成については図
２に示し、後で詳細を説明する。音声認識部１００１で
処理された認識結果は文字列格納部１００３に格納され
ると同時に、言語解析部１００４に送られ、構文解析が
行われる。言語解析部１００４の詳細については図３を
用いて後で説明する。言語解析部１００４で構文解析さ
れた結果は翻訳部１００５に送られ、ここで翻訳が行わ
れる。機械翻訳については、事例ベースの翻訳方式が主
流であり、例えばＴＤＭＴ（ＴｒａｎｓｆｅｒＤｒｉ
ｖｅｎＭａｃｈｉｎｅＴｒａｎｓｌａｔｉｏｎ）な
どの方式がある。ＴＤＭＴについては、Ｆｕｒｕｓｅ，
ｅｔａｌ，ＣｏｎｓｔｉｔｕｅｎｔＢｏｕｎｄａ
ｒｙＰａｒｓｉｎｇｆｏｒＥｘａｍｐｌｅ−Ｂａｓ
ｅｄＭａｃｈｉｎｅＴｒａｎｓｌａｔｉｏｎ，Ｉ
ｎｐｒｏｃ．ｏｆＣＯＬＩＮＧ９４，ｐｐ．１０
５−１１１，１９９４などを参照されたい。翻訳部１０
０５で翻訳された結果は、翻訳提示部１００６により利
用者に対して提示される。翻訳提示部１００６の具体的
な例としては、翻訳結果を文字列として提示するための
ディスプレイや、合成音声として出力するための音声合
成装置などである。FIG. 1 is a block diagram for explaining the present embodiment. First, an outline of a configuration and a processing flow of the speech translation apparatus will be described with reference to FIG. The input voice is stored in the voice storage unit 1002 and the voice recognition unit 100
In step 1, voice recognition is performed, and the recognition result is output as a character string. The configuration of the voice recognition unit 1001 is shown in FIG. 2 and will be described later in detail. The recognition result processed by the voice recognition unit 1001 is stored in the character string storage unit 1003, and at the same time, sent to the language analysis unit 1004 to perform syntax analysis. The details of the language analysis unit 1004 will be described later with reference to FIG. The result of the syntax analysis by the language analysis unit 1004 is sent to the translation unit 1005, where the translation is performed. For machine translation, case-based translation methods are mainstream. For example, TDMT (Transfer Dri)
ven Machine Translation). For TDMT, see Furuse,
et al, Constituent Bounda
ry Parsingfor Example-Bas
ed Machine Translation, I
n proc. of COLING 94, pp. 10
5-111, 1994 and the like. Translator 10
The result translated in 05 is presented to the user by the translation presenting unit 1006. Specific examples of the translation presenting unit 1006 include a display for presenting a translation result as a character string and a speech synthesizer for outputting as a synthesized speech.

【００２６】なお、音声格納部１００２と文字列格納部
１００３とは、同一の記憶媒体で構成してもよいし、別
々の記憶媒体で構成してもよい。Note that the voice storage unit 1002 and the character string storage unit 1003 may be constituted by the same storage medium, or may be constituted by separate storage media.

【００２７】図２は、音声認識部を説明するためのブロ
ック図である。音声検出部２００１は、入力信号の中か
ら音声の区間を検出する。音声検出には、例えば、入力
信号の短時間パワを計算し、そのパワ値の時系列と閾値
とを比較し検出する方法が一般的である。また、信号波
形のゼロクロスを併用する方式もある。音声区間検出方
法の詳細については、古井著，ディジタル音声処理，東
海大学出版会，１９８５に詳しい。音声検出部２００１
によって検出された音声信号は短時間間隔（通常十〜数
十ｍｓ）で分割され、特徴分析部２００２に送られる。
特徴分析部２００２は入力した短時間信号から特徴ベク
トルを抽出する。一般に、音声認識に用いられる特徴ベ
クトルは、ＬＰＣケプストラムが採用されることが多
い。ＬＰＣケプストラムの計算方法については、前出の
古井著，ディジタル信号処理を参照されたい。照合部２
００３は、音響モデル２００４、単語辞書２００５、言
語モデル２００６に基づき、生成され得る文候補に対し
て、入力音声の時系列データが観測されたときの条件付
き確率を計算する。音響モデル２００４は音韻単位に分
割された音声の特徴パラメータ系列を隠れマルコフモデ
ル（ＨＭＭ）として表現したものである。単語辞書２０
０５は、認識対象語彙を音韻記号の列に変換するための
辞書である。言語モデル２００６は、単語間の共起確率
を、２−ｇｒａｍ，３−ｇｒａｍとしてモデル化したも
のである。照合部２００３の照合方式については、Ａ＊
探索、ビーム探索、ビタビ探索など多数の手法が提案さ
れており、Ａ＊探索に関しては、河原，他，ヒューリス
ティックな言語モデルを用いた会話音声中の単語スポッ
ティング，信学論Ｖｏｌ．Ｊ７８−Ｄ−ＩＩ，Ｎｏ．
７，ｐｐ．１０１３−１０２０，Ｊｕｌ．１９９５を、
ビーム探索に関しては、Ｎｅｙｅｔ．ａｌ，Ａ
Ｄａｔａ−ｄｒｉｖｅｎＯｒｇａｎｉｚａｔｉｏｎ
ｏｆＤｙｎａｍｉｃＰｒｏｇｒａｍｍｉｎｇＢｅ
ａｍＳｅａｒｃｈｆｏｒＣｏｎｔｉｎｕｏｕｓＳ
ｐｅｅｃｈＲｅｃｏｇｎｉｔｉｏｎ，Ｐｒｏｃ．ＩＣ
ＡＳＳＰ８７，ｐｐ．８３３−８３６，１９８７などが
参考となる。照合部２００３で計算された文候補の確率
値のうち、もっとも高い確率を示す文候補を入力音声に
対する認識結果として推定する。FIG. 2 is a block diagram for explaining the speech recognition unit. The voice detection unit 2001 detects a voice section from the input signal. For voice detection, for example, a general method is to calculate short-time power of an input signal, and compare and detect a time series of the power value with a threshold. There is also a method in which a zero cross of a signal waveform is used together. For details of the voice section detection method, see Furui, Digital Voice Processing, Tokai University Press, 1985. Voice detection unit 2001
The audio signal detected by the above is divided at short time intervals (usually tens to several tens of ms) and sent to the feature analysis unit 2002.
The feature analysis unit 2002 extracts a feature vector from the input short-time signal. Generally, an LPC cepstrum is often used as a feature vector used for speech recognition. For the calculation method of the LPC cepstrum, refer to the aforementioned digital signal processing by Furui. Collation unit 2
In step 003, based on the acoustic model 2004, the word dictionary 2005, and the language model 2006, a conditional probability when the time-series data of the input speech is observed for a sentence candidate that can be generated is calculated. The acoustic model 2004 represents a feature parameter sequence of speech divided into phoneme units as a hidden Markov model (HMM). Word dictionary 20
Reference numeral 05 denotes a dictionary for converting a recognition target vocabulary into a sequence of phonemic symbols. The language model 2006 is obtained by modeling the co-occurrence probabilities between words as 2-gram and 3-gram. Regarding the matching method of the matching unit 2003, A *
Numerous methods such as search, beam search, and Viterbi search have been proposed. Regarding A * search, Kawahara et al., Word Spotting in Conversational Speech Using Heuristic Language Model, IEICE, Vol. J78-D-II, No.
7, pp. 1013-1020, Jul. 1995,
For beam search, see Ney et. al, A
Data-driven Organization
of Dynamic Programming Be
am Searchfor Continuous S
peech Recognition, Proc. IC
ASSP87, pp. 833-834, 1987, etc., are helpful. The sentence candidate having the highest probability among the sentence candidate probability values calculated by the matching unit 2003 is estimated as the recognition result for the input speech.

【００２８】図３は、言語解析部１００４を説明するた
めのブロック図である。言語解析部１００４は、パーザ
３００１と文節内文法３００２、係り受け文法３００３
により構成されている。言語処理の分野では、言語の最
小単位を形態素と呼ぶ。そこで、音声認識結果得られた
単語系列をここでは形態素列と呼ぶことにする。FIG. 3 is a block diagram for explaining the language analysis unit 1004. The linguistic analysis unit 1004 includes a parser 3001, a grammar within a clause 3002, a dependency grammar 3003
It consists of. In the field of language processing, the smallest unit of language is called a morpheme. Therefore, the word sequence obtained as a result of the speech recognition will be referred to as a morpheme sequence here.

【００２９】日本語の文章は、１個以上の内容語と０個
以上の機能語と句読点からなる文節と呼ばれる単位の連
接とみなすことが出来る。文節内文法３００２は、形態
素の連接と文節との関係をモデル化したものである。文
節内文法の一例としては、形態素の連鎖をマルコフモデ
ルとして定式化した形態素ｎ−ｇｒａｍモデルがある。
このように、文法を形態素連鎖の確率モデルと表現した
ものを確率的言語モデルと呼ぶ。一方、係り受け文法３
００３は、係り受けとよばれる複数の文節間の関係を、
例えば文脈自由文法（ＣＦＧ：Ｃｏｎｔｅｘｔｆｒ
ｅｅｇｒａｍｍａｒ）などによって、モデル化した文
法である。パーザ３００１は、音声認識の結果得られた
形態素列を、文節内文法３００２と係り受け文法３００
３の二つの文法を用いて、文節単位の係り受け構造を解
析する。図４に解析結果の例を示す。この例では、ホテ
ル・ニューオータニ，まで，行き，たい，の，です，が
という、形態素列に対して、文節内文法３００２を用い
て、二つの文節「ホテル・ニューオータニまで」，「行
きたいのですが」を生成し、さらに、係り受け文法３０
０３により、この二つの文節の係り受け関係を規定す
る。このように、言語解析部１００４は、入力文に対し
て言語解析を行い、解析係り受け構造をもつ構造木を生
成し、翻訳部１００５に出力する。A Japanese sentence can be regarded as a concatenation of a unit called a clause composed of one or more content words, zero or more function words, and punctuation marks. The intra-phrase grammar 3002 models the relationship between morpheme concatenation and a phrase. As an example of the intra-phrase grammar, there is a morpheme n-gram model in which a chain of morphemes is formulated as a Markov model.
The expression of the grammar as a morpheme chain probability model is called a probabilistic language model. Meanwhile, dependency grammar 3
003 indicates the relationship between a plurality of clauses called dependency
For example, a context-free grammar (CFG: Context fr)
ee grammar). The parser 3001 converts the morpheme sequence obtained as a result of the speech recognition into the intra-phrase grammar 3002 and the dependency grammar 3002.
Using the two grammars (3), the dependency structure is analyzed for each clause. FIG. 4 shows an example of the analysis result. In this example, for the morpheme sequence, “To Hotel New Otani”, “To the Hotel New Otani”, “I want to go to the Hotel New Otani” Is generated, and the dependency grammar 30
03 defines the dependency relationship between these two phrases. As described above, the linguistic analysis unit 1004 performs linguistic analysis on the input sentence, generates a structured tree having an analysis dependency structure, and outputs the parse tree to the translating unit 1005.

【００３０】翻訳部１００５以下の機能については、図
１において既に説明した。The functions of the translator 1005 and below have already been described with reference to FIG.

【００３１】以上が、音声翻訳装置における音声翻訳処
理の概要である。The above is the outline of the speech translation processing in the speech translation device.

【００３２】次に、音声認識誤りが生じた場合の問題点
と、本実施例における音声認識誤りの修正方法について
説明する。Next, a description will be given of a problem when a speech recognition error occurs and a method of correcting the speech recognition error in this embodiment.

【００３３】音声翻訳装置の問題点は、音声認識が常に
正しい認識結果を生成するとは限らないことである。ど
んなに優れた音声認識装置を作成しても認識誤りを避け
て通ることは出来ない。音声翻訳装置において音声認識
誤りが生じると、誤った翻訳結果が生成されたり、ある
いは、言語解析が失敗し、翻訳そのものが実行できない
ことになる。例えば、「ホテル・ニューオータニまで行
きたいのですが」という入力発声に対して音声認識誤り
が生じ、「ホテル臭う多摩で行きたいのですが」という
認識結果が生成されたとする。この時、言語解析部１０
０４では解析に失敗する。図５に言語解析に失敗した場
合の解析結果を示す。この例では、音声認識が誤ったホ
テル，臭う，多摩，での部分で文節が生成できず、この
ため、「行きたいのですが」の文節と係り受け関係を取
りうる文節ＮＰ（ｄｅｓｔｉｎａｔｉｏｎ）が見つから
ない。言語解析部１００４で解析木の生成に失敗する
と、解析結果を翻訳部１００５に渡すことが出来ないた
め、翻訳処理が実行できない。A problem with speech translators is that speech recognition does not always produce correct recognition results. No matter how good a speech recognition device is made, recognition errors cannot be avoided. When a speech recognition error occurs in the speech translation device, an erroneous translation result is generated, or language analysis fails, and the translation itself cannot be executed. For example, it is assumed that a speech recognition error occurs for an input utterance of “I want to go to Hotel New Otani”, and a recognition result of “I want to go to Hotel Smelt Tama” is generated. At this time, the language analysis unit 10
At 04, the analysis fails. FIG. 5 shows an analysis result when the language analysis fails. In this example, a phrase cannot be generated at the hotel, odor, or Tama where the speech recognition is incorrect. Therefore, a phrase NP (destination) that can take a dependency relationship with the phrase "I want to go" is generated. can not find. If the linguistic analysis unit 1004 fails to generate the analytic tree, the analysis result cannot be passed to the translation unit 1005, so that the translation process cannot be executed.

【００３４】本実施の音声翻訳装置では、このような場
合に、言語解析が遂行できない原因となった部分を推定
し、使用者に再発声を要求する。以下、言語解析の失敗
原因となった部分を不適格部分と呼ぶことにする。不適
格部分の推定には、例えば、文節内文法により文節が生
成できなかった部分、あるいは、係り受け関係が求まら
ず、解析木の要素から外れてしまった文節などを抽出す
る。また、前出の、脇田，他，意味的類似性を用いた後
処理的な音声認識正解部分特定法と音声翻訳手法への導
入で述べられている音声認識正解部分特定法で、正解部
分として特定されなかった部分を不適格部分として特定
しても良い。この文献で述べられているように、意味的
な属性を利用することで、「ホテルまで行きたいのです
が」という発声を「ホタルまで行きたいのですが」と誤
認識した場合、構文的に正しい文章であっても、「ホタ
ルまで」が不適格部分であることを推定することが可能
となる。In such a case, the speech translating apparatus according to the present embodiment estimates a portion that has caused the language analysis to be unexecutable, and requests the user to speak again. Hereinafter, the part that caused the language analysis to fail will be referred to as an ineligible part. For estimating an unqualified part, for example, a part in which a phrase cannot be generated by the grammar in a phrase, or a phrase whose dependency relation is not obtained and which is out of the element of the parse tree is extracted. Also, in the speech recognition correct part specification method described in the introduction of the post-processing speech recognition correct part specification method using semantic similarity and Wakata et al. A part that has not been specified may be specified as an unqualified part. As described in this document, by using semantic attributes, if the utterance “I want to go to a hotel” is misrecognized as “I want to go to a firefly”, Even if the sentence is correct, it is possible to estimate that "to firefly" is an ineligible part.

【００３５】言語解析部１００４において特定された不
適格部分は、フレーズ抽出部１００７に送られる。フレ
ーズ抽出部１００７では、音声格納部１００２に格納さ
れている発声された音声のうち、入力してきた不適格部
分に相当するフレーズを抽出する。フレーズの抽出に
は、音声認識部１００１から得られる照合情報を用い
る。音声認識では、各音素をＨＭＭと呼ばれる状態遷移
モデルとしてモデル化し、音素系列にしたがってＨＭＭ
を結合して得られる状態遷移モデルのうち、最も確率値
の高い音素系列を認識結果とする。音声認識のスコア計
算で良く用いられるビタビアルゴリズムでは、入力音声
の特徴ベクトル系列を（数１）とし、ＨＭＭの状態遷移
系列を（数２）とすると、この時の確率値は、（数３）
で求めることができる。The ineligible part specified by the language analysis unit 1004 is sent to the phrase extraction unit 1007. The phrase extracting unit 1007 extracts a phrase corresponding to the input ineligible part from the uttered speech stored in the speech storage unit 1002. For the extraction of the phrase, the collation information obtained from the voice recognition unit 1001 is used. In speech recognition, each phoneme is modeled as a state transition model called an HMM, and the HMM is modeled according to the phoneme sequence.
Of the state transition models obtained by combining the phoneme sequences with the highest probability value as the recognition result. In a Viterbi algorithm often used for calculating a score for speech recognition, if a feature vector sequence of an input speech is (Equation 1) and a state transition sequence of an HMM is (Equation 2), the probability value at this time is (Equation 3)
Can be obtained by

【００３６】[0036]

【数１】 (Equation 1)

【００３７】[0037]

【数２】 (Equation 2)

【００３８】[0038]

【数３】 (Equation 3)

【００３９】（数３）によれば、任意の時間Ｔ＝ｔにお
いて、入力音声の特徴ベクトルｙｉは、唯一の状態遷移
に対応付けられることになる。この遷移系列をビタビパ
スと呼ぶ。図６は発声音声と音声認識結果との対応を説
明するための図である。縦軸は音声認識結果の音韻系列
に対応してＨＭＭを結合した状態遷移系列である。この
例のように、ビタビパス６００１を参照することによっ
て、不適格部分６００２に対応する発声音声のフレーズ
６００３を容易に抽出することができる。フレーズ抽出
部１００７で切り出された発声音声の部分データは、音
声再生部１００８に送られ音声として再生される。音声
再生部１００８は量子化データとして格納れている音声
データをアナログ信号に変換するＤ／Ａ変換器と、変換
されたアナログ信号を音として再生するためのアンプ、
スピーカにより構成される。この時、普段聞きなれてい
ない自分の音声を聴取することに対し、違和感を覚える
利用者も少なくない。そのような違和感を回避するため
には、音声再生部１００８の中に声質変換機能を追加す
れば良い。声質変換の手法は、音声認識における話者適
応と同様な戦略により、話者Ａと話者Ｂのそれぞれの音
声に対する特徴ベクトルの対応付けを求めた変換ベクト
ルを適用すれば良い。声質変換に関する詳細について
は、橋本，他，話者選択と移動ベクトル平滑化を用いた
声質変換のためのスペクトル写像，信学技報，ｓｐ９５
−１，ＰＰ．１−８，１９９５などを参考にされたい。According to (Equation 3), at an arbitrary time T = t, the feature vector yi of the input voice is associated with only one state transition. This transition sequence is called a Viterbi path. FIG. 6 is a diagram for explaining the correspondence between the uttered voice and the voice recognition result. The vertical axis is a state transition sequence obtained by combining HMMs corresponding to the phoneme sequence of the speech recognition result. By referring to the Viterbi path 6001 as in this example, the phrase 6003 of the uttered voice corresponding to the ineligible part 6002 can be easily extracted. The partial data of the uttered voice cut out by the phrase extracting unit 1007 is sent to the voice reproducing unit 1008 and reproduced as voice. An audio reproduction unit 1008 that converts the audio data stored as quantized data into an analog signal; a D / A converter; an amplifier that reproduces the converted analog signal as sound;
It is composed of speakers. At this time, there are many users who feel uncomfortable with listening to their own voice which is not usually heard. In order to avoid such discomfort, a voice quality conversion function may be added to the audio reproduction unit 1008. As a voice quality conversion method, it is only necessary to apply a conversion vector which is obtained by associating a feature vector with each of the voices of the speakers A and B according to the same strategy as the speaker adaptation in the voice recognition. For details on voice conversion, see Hashimoto et al., Spectral Mapping for Voice Conversion Using Speaker Selection and Motion Vector Smoothing, IEICE Technical Report, sp95
-1, PP. 1-8, 1995, etc.

【００４０】音声再生部１００８より再生された音声
「ホテル・ニューオータニまで」に対して、利用者は復
唱することを求められる。この復唱に関しては、約束事
としてマニュアル等に記載しておくか、あるいは、「復
唱してください」などのガイダンスを提示してもよい。The user is required to repeat the voice "from Hotel New Otani" reproduced by the voice reproducing unit 1008. Regarding this repetition, the repetition may be described in a manual or the like, or guidance such as “Please repeat” may be provided.

【００４１】音声認識部１００１の出力である文字列
（認識結果）を表示する表示部を設け、音声再生部１０
０８により「ホテル・ニューオータニまで」という文を
再生する際に、当該表示部に「×××行きたいのです
が」というように、不適各部分を強調して表示するよう
にしてもよい。不適格部分を強調して表示し、修正しな
ければならない個所を明確にすることで、利用者の復唱
を助けることができる。また、認識結果を表示すること
で、利用者は、音声翻訳装置が不適各部分と判断したフ
レーズが適切であるか否か、音声翻訳装置が不適各部分
と判断したフレーズ以外にも認識誤りがあるか否かを判
断することができる。利用者が、音声翻訳装置が不適各
部分と判断したフレーズが不適切であると判断した場
合、又は音声翻訳装置が不適各部分と判断したフレーズ
以外にも認識誤りがあると判断した場合には、利用者
は、復唱による修正を中止し、認識全体のやり直しを選
択することもできる。A display unit for displaying a character string (recognition result) output from the voice recognition unit 1001 is provided.
When reproducing the sentence "To Hotel New Otani" by 08, each inappropriate portion may be highlighted and displayed on the display unit, such as "I want to go to XXX". Highlighting ineligible parts and clarifying where they need to be corrected can help users read back. In addition, by displaying the recognition result, the user can determine whether the phrase that the speech translator has determined to be inappropriate is appropriate or not, and recognize a recognition error in addition to the phrase that the speech translator has determined to be inappropriate. It can be determined whether or not there is. When the user determines that the phrase that the speech translator has determined to be inappropriate is inappropriate, or when the user determines that there is a recognition error other than the phrase that the speech translator has determined to be inappropriate. Alternatively, the user can stop the correction by repetition and select to redo the entire recognition.

【００４２】なお、翻訳結果提示部１００６をディスプ
レイで構成する場合には、当該ディスプレイを認識結果
を表示する表示部として用いてもよい。When the translation result presentation unit 1006 is constituted by a display, the display may be used as a display unit for displaying the recognition result.

【００４３】利用者によって復唱された音声「ホテル・
ニューオータニまで」は、再び音声認識部１００１に入
力され、その認識結果「ホテル・ニューオータニまで」
が文字列格納部１００３に出力される。この時、文字列
格納部には、前回の音声認識結果「ホテル臭う多摩で行
きたいのですが」保存されている。また、言語解析部１
００４の履歴情報により、「ホテル臭う多摩で」の部分
が不適格部分であることが判っている。そこで、新たに
入力された「ホテル・ニューオータニまで」と不適格部
分「ホテル臭う多摩で」とを置換する。このようにして
生成された形態素列ホテル・ニューオータニ，まで，行
き，たい，の，です，がが言語解析部１００４に出力さ
れる。言語解析部１００４は、新たな入力ホテル・ニュ
ーオータニ，まで，行きたい，の，です，がに対して
は，図４で説明したように、正しい解析を行うことが出
来るので。正しく解析された結果を翻訳部１００５に送
り、翻訳部１００５において音声翻訳を完了させる。The voice read by the user "Hotel
"Up to New Otani" is input again to the speech recognition unit 1001 and the recognition result "To Hotel New Otani"
Is output to the character string storage unit 1003. At this time, the character string storage unit stores the previous speech recognition result “I want to go to Hotel Odor Tama”. Language analysis unit 1
According to the history information of No. 004, it is known that the portion “Hotel Odor Tama” is an unqualified portion. Therefore, the newly input “until Hotel New Otani” is replaced with the ineligible part “Hotel Smelly Tama”. The morpheme string Hotel New Otani, which is generated in this way, is sent to the language analyzer 1004. The linguistic analysis unit 1004 wants to go to a new input hotel, New Otani, but can perform correct analysis as described with reference to FIG. The result of the correct analysis is sent to the translation unit 1005, and the translation unit 1005 completes the speech translation.

【００４４】以上説明したシーケンスを図７に示す。図
中左側が利用者と翻訳装置との対話シーケンスであり、
右側が音声翻訳装置の内部動作を示す。FIG. 7 shows the sequence described above. The left side of the figure is the dialogue sequence between the user and the translation device,
The right side shows the internal operation of the speech translator.

【００４５】このように、利用者が発声した音声に対し
て、音声認識誤りが生じたために、言語解析が失敗した
場合には、不適格部分に対応する発声フレーズを特定
し、その部分だけを復唱させるので、音声認識誤りの生
じた部分を効率良く修正し、正しい翻訳結果を生成する
ことを可能とする。As described above, when the language analysis fails due to a voice recognition error in the voice uttered by the user, the utterance phrase corresponding to the ineligible part is specified, and only that part is identified. Since repetition is performed, a portion where a speech recognition error has occurred can be efficiently corrected, and a correct translation result can be generated.

【００４６】（第二の実施例）第一の実施例では、言語
解析部で解析誤りの原因となった部分を直接不適格フレ
ーズとして対応づけていた。ところが、音声認識誤りを
含む文章に対して、必ずしも本来の発話意図に対応した
意味的まとまりを持つ単位に文節が生成されるという保
証はない。たとえば、「ホテル・ニューオータニまで、
行きたいのですが」の発声に対する音声認識結果が「ホ
テルに大田まで行きたいのですが」と誤ってしまった場
合には、「ホテルに」、「大田まで」といった文節が生
成されるため、「ホテルに」の部分の係り受け解析が失
敗する。その結果「ホテル・ニュー」のフレーズを復唱
するようガイダンスがなされ、「ホテル・ニューオータ
ニまで」の部分を正しく修正することはできない。この
ような誤動作を防止する機能を追加した第二の実施例に
ついて次に説明する。(Second Embodiment) In the first embodiment, the part causing the analysis error in the language analysis unit is directly associated as an unqualified phrase. However, there is no guarantee that a sentence containing a speech recognition error is necessarily generated in a unit having a semantic unit corresponding to the original utterance intention. For example, "To Hotel New Otani,
If the voice recognition result for the utterance of "I want to go" is incorrectly saying "I want to go to Daejeon", phrases such as "To hotel" and "To Daejeon" are generated. Dependency analysis of "To Hotel" failed. As a result, guidance is given to repeat the phrase "Hotel New", and the "To Hotel New Otani" part cannot be corrected correctly. Next, a second embodiment to which a function for preventing such a malfunction is added will be described.

【００４７】図８は本発明の第二の実施例を説明するた
めのブロック図である。図１で説明した第一の実施例と
の相違は、韻律情報推定部８００９を追加した点であ
る。韻律情報推定部８００９では、入力した音声を信号
解析し韻律情報を抽出する。韻律情報推定部８００９の
一実施例を図９に示す。この例では入力信号の短時間パ
ワを計算するパワ計算部９００１と、音声のピッチを抽
出するピッチ抽出部９００２、および、パワ情報とピッ
チ情報とを統合し文章の韻律構造を決定する韻律情報統
合部９００３により構成される。ピッチ抽出とは、声帯
振動の基本周波数を推定することであり、たとえば、音
声波形の零交叉数に基づく抽出法やＬＰＣ分析の残差信
号の自己相関係数より抽出する方法など多数の抽出法が
開発されている。詳細については、前出の古井著，ディ
ジタル信号処理が詳しい。FIG. 8 is a block diagram for explaining a second embodiment of the present invention. The difference from the first embodiment described with reference to FIG. 1 is that a prosody information estimating unit 8009 is added. The prosody information estimating unit 8009 extracts the prosody information by analyzing the input speech signal. FIG. 9 shows an embodiment of the prosody information estimating section 8009. In this example, a power calculation unit 9001 for calculating short-time power of an input signal, a pitch extraction unit 9002 for extracting a pitch of voice, and a prosody information integration for integrating power information and pitch information to determine a prosody structure of a sentence. It comprises a unit 9003. Pitch extraction refers to estimating the fundamental frequency of vocal fold vibrations. For example, there are a number of extraction methods such as an extraction method based on the number of zero crossings of a voice waveform and a method of extracting from the autocorrelation coefficient of a residual signal in LPC analysis. Is being developed. For details, see the above-mentioned book by Furui, Digital Signal Processing.

【００４８】図１０にパワ計算部９００１、ピッチ抽出
部９００２で抽出される出力例を示す。上段がパワ情
報、中段がピッチ情報である。パワ情報では、パワの値
が落ち込んだ部分がポーズ部分と推定される。ピッチ情
報は、声の出始めから時間の経過と共にピッチ周波数が
低下する特徴をもつ話調成分（基本イントネーション成
分）と単語、文節固有のピッチパタン（アクセント成
分）との合成によって定まる。韻律情報統合部９００３
では、このようなポーズによる区切り、ピッチパタンの
切れ目をフレーズの単位とし、韻律情報に基づく文章構
造を決定する。図１０の下段がそれぞれの韻律情報に基
づいて決定された構造木の例である。FIG. 10 shows an output example extracted by the power calculation unit 9001 and the pitch extraction unit 9002. The upper part is power information, and the middle part is pitch information. In the power information, a part where the value of the power has dropped is estimated to be a pause part. The pitch information is determined by synthesizing a speech tone component (basic intonation component) having a characteristic that the pitch frequency decreases with the passage of time from the beginning of the voice, and a pitch pattern (accent component) unique to a word or a phrase. Prosody information integration unit 9003
Then, the sentence structure based on the prosody information is determined by using such a pause by a pause and a break of a pitch pattern as a unit of a phrase. The lower part of FIG. 10 is an example of a structural tree determined based on each piece of prosody information.

【００４９】フレーズ抽出部８００７では、韻律情報推
定部８００９から得られた韻律情報に基づき、言語解析
部８００４で抽出した誤り部分（言語解析が失敗した原
因となった部分）に対し、その誤り部分を内包する韻律
構造の部分木を再生フレーズとする。たとえば、「ホテ
ル・ニュー」に対応する誤り部分に対しては、その部分
を内包する韻律構造の部分木である「ホテルニューオー
タニまで」の発声音声がフレーズ抽出部８００７で抽出
される復唱対象フレーズとなる。この結果利用者は、
「ホテル・ニューオータニまで」を復唱し、新たな認識
結果として「ホテル・ニューオータニまで」が得られ
る。この新たな認識結果は、文字列格納部８００３に格
納されている第一の音声認識結果「ホテルに大田まで行
きたいのですが」のうちの対応部分「ホテルに大田ま
で」と置きかえられる。このように、正しく修正された
認識結果は言語解析部８００４において構文解析が行わ
れた後、翻訳部８００５で他言語への翻訳が行われる。The phrase extraction unit 8007 replaces the error part (the part that caused the language analysis failure) extracted by the language analysis unit 8004 with the error part based on the prosody information obtained from the prosody information estimation unit 8009. Is defined as a playback phrase. For example, for the erroneous part corresponding to “Hotel New”, the utterance voice of “Up to Hotel New Otani”, which is a subtree of the prosodic structure including that part, is used as the repetition target phrase extracted by the phrase extraction unit 8007. Become. As a result,
"To Hotel New Otani" is repeated, and "To Hotel New Otani" is obtained as a new recognition result. This new recognition result is replaced with the corresponding part "to the hotel to Daejeon" in the first speech recognition result "I want to go to the hotel to Daejeon" stored in the character string storage unit 8003. In this manner, the correctly corrected recognition result is subjected to syntax analysis in the language analysis unit 8004, and then translated into another language in the translation unit 8005.

【００５０】したがって本発明の第二の実施例では、第
一の実施例同様、利用者が発声した音声に対して、音声
認識誤りが生じたために、言語解析が失敗した場合に
は、不適格部分に対応する発声フレーズを特定し、その
部分だけを復唱させることによって、音声認識誤りの生
じた部分を効率良く修正し、正しい翻訳結果を生成する
こと可能とする。また、さらに、音声認識の誤りによ
り、構文解析で不適切な位置で文節の区切りが行われて
しまったとしても、韻律情報から得られる構文を利用し
て復唱対象部分を決定することにより、適切なフレーズ
を復唱させることが可能である。Therefore, in the second embodiment of the present invention, as in the first embodiment, if the speech analysis fails for the voice uttered by the user and the language analysis fails, By specifying the utterance phrase corresponding to the part and repeating the part only, it is possible to efficiently correct the part where the speech recognition error has occurred and generate a correct translation result. Furthermore, even if a phrase is segmented at an inappropriate position in the syntax analysis due to an error in speech recognition, the repetition target portion is determined using the syntax obtained from the prosodic information, thereby It is possible to repeat a phrase.

【００５１】（第三の実施例）第一及び第二の実施例で
は、音声翻訳装置の実施例について説明したが、本実施
例では、音声認識装置の実施例について説明する。(Third Embodiment) In the first and second embodiments, the embodiment of the speech translating apparatus has been described. In the present embodiment, the embodiment of the speech recognizing apparatus will be described.

【００５２】本実施例における音声認識装置は、図１の
音声認識部１００１、音声格納部１００２、文字列格納
部１００３、言語解析部１００４、フレーズ抽出部１０
０７，及び音声再生部１００８で構成される。入力音声
は音声格納部１００２に格納される一方で、音声認識部
１００１において音声の認識がおこなわれ、認識結果が
文字列として出力される。音声認識部１００１で処理さ
れた認識結果は文字列格納部１００３に格納されると同
時に、言語解析部１００４に送られ、構文解析が行われ
る。言語解析部１００４において特定された不適格部分
は、フレーズ抽出部１００７に送られる。フレーズ抽出
部１００７では、音声格納部１００２に格納されている
発声された音声のうち、入力された不適格部分に相当す
るフレーズを抽出する。フレーズ抽出部１００７で切り
出された発声音声の部分データは、音声再生部１００８
に送られ音声として再生される。音声再生部１００８よ
り再生された音声に対して、利用者は復唱することを求
められる。利用者によって復唱された音声は、再び音声
認識部１００１に入力され、その認識結果が文字列格納
部１００３に出力される。そして、不適格部分を復唱さ
れた音声の認識結果に置換する。このような処理によ
り、音声認識誤りの生じた部分を効率良く修正すること
が可能となる。The speech recognition apparatus according to the present embodiment includes a speech recognition unit 1001, a speech storage unit 1002, a character string storage unit 1003, a language analysis unit 1004, and a phrase extraction unit 10 shown in FIG.
07 and an audio reproducing unit 1008. While the input voice is stored in the voice storage unit 1002, voice recognition is performed in the voice recognition unit 1001, and the recognition result is output as a character string. The recognition result processed by the voice recognition unit 1001 is stored in the character string storage unit 1003, and at the same time, sent to the language analysis unit 1004 to perform syntax analysis. The ineligible part specified by the language analysis unit 1004 is sent to the phrase extraction unit 1007. The phrase extraction unit 1007 extracts a phrase corresponding to the input ineligible part from the uttered speech stored in the speech storage unit 1002. The partial data of the uttered voice extracted by the phrase extracting unit 1007 is
To be played back as audio. The user is required to repeat the sound reproduced by the sound reproducing unit 1008. The voice reproduced by the user is input to the voice recognition unit 1001 again, and the recognition result is output to the character string storage unit 1003. Then, the ineligible part is replaced with the recognition result of the repetitive voice. Through such processing, it is possible to efficiently correct the portion where the speech recognition error has occurred.

【００５３】[0053]

【発明の効果】本発明によれば、音声翻訳装置において
利用者が発声した音声に対して、音声認識誤りが生じた
ために、言語解析が失敗した場合には、不適格部分に対
応する発声フレーズを特定するので、利用者は入力音声
の全てを再発声する必要がなくなる。また、利用者に、
特定した部分を復唱させることによって、音声認識誤り
の生じた部分を効率良く修正し、正しい翻訳結果を生成
すること可能とする。According to the present invention, when a language analysis fails due to a speech recognition error in a speech uttered by a user in a speech translation device, a speech phrase corresponding to an ineligible part. , The user does not need to re-utter all of the input speech. In addition,
By repeating the specified portion, it is possible to efficiently correct the portion where the speech recognition error has occurred, and generate a correct translation result.

【００５４】また、本発明の音声認識装置により、音声
認識誤りの生じた部分を効率良く修正することが可能と
なる。Further, with the speech recognition apparatus of the present invention, it is possible to efficiently correct a portion where a speech recognition error has occurred.

[Brief description of the drawings]

【図１】音声翻訳装置の一実施例を説明するためのブロ
ック図である。FIG. 1 is a block diagram for explaining an embodiment of a speech translation apparatus.

【図２】音声認識部を説明するためのブロック図であ
る。FIG. 2 is a block diagram illustrating a voice recognition unit.

【図３】言語解析部を説明するためのブロック図であ
る。FIG. 3 is a block diagram for explaining a language analysis unit.

【図４】言語解析部による解析結果を説明するための図
である。FIG. 4 is a diagram for explaining an analysis result by a language analysis unit.

【図５】言語解析部によって解析に失敗した例を説明す
るための図である。FIG. 5 is a diagram for explaining an example in which the analysis by the language analysis unit has failed.

【図６】発声音声と音声認識結果との対応を説明するた
めの図である。FIG. 6 is a diagram for explaining the correspondence between an uttered voice and a voice recognition result.

【図７】音声翻訳装置の一実施例の動作シーケンスを説
明するための図である。FIG. 7 is a diagram for explaining an operation sequence of an embodiment of the speech translation apparatus.

【図８】音声翻訳装置の第二の実施例を説明するための
ブロック図である。FIG. 8 is a block diagram for explaining a second embodiment of the speech translation apparatus.

【図９】韻律情報推定部を説明するためのブロック図で
ある。FIG. 9 is a block diagram for explaining a prosody information estimation unit.

【図１０】韻律情報推定部によって推定される韻律情報
を説明するための図である。FIG. 10 is a diagram for explaining prosody information estimated by a prosody information estimation unit.

[Explanation of symbols]

１００１・・・音声認識部、１００２・・・音声格納部、１０
０３・・・文字列格納部、１００４・・・言語解析部、１００
５・・・翻訳部、１００６・・・翻訳結果提示部、１００７・・
・フレーズ抽出部、１００８・・・音声再生部。1001 ... voice recognition unit, 1002 ... voice storage unit, 10
03: character string storage unit, 1004: language analysis unit, 100
5: translation unit, 1006: translation result presentation unit, 1007 ...
Phrase extraction unit, 1008 ... Sound reproduction unit.

Claims

[Claims]

A voice recognition unit configured to recognize an input voice and convert the input voice into a character string; a voice storage unit configured to store the input voice; a character string storage unit configured to store a character string converted by the voice recognition unit; A language analysis unit that analyzes a character string stored in a character string storage unit, a translation unit that translates into another language based on the analysis result of the language analysis unit, and a case where the language analysis unit fails in analysis. A speech extraction unit that extracts a phrase of an input speech corresponding to a part that failed to be analyzed from the speech storage unit, and a speech playback unit that plays back the speech extracted by the phrase extraction unit. Translator.

2. The voice recognition unit recognizes a phrase of an input voice corresponding to a portion where repetition of the analysis failed, and converts the phrase into a character string. The translation unit includes a character string corresponding to the input voice. Characters obtained by replacing the character string corresponding to the phrase of the input voice corresponding to the part where the analysis has failed with the character string corresponding to the phrase of the input voice corresponding to the part where the repetition of the analysis failed, which has been converted by the voice recognition unit. 2. The speech translation device according to claim 1, wherein translation is performed on the columns.

3. The speech translation device according to claim 1, wherein said language analysis unit performs analysis based on a semantic relationship between phrases.

4. A prosody information estimating unit for extracting prosody information from an input speech, wherein the phrase extracting unit stores a phrase of the input speech corresponding to a part that failed to be analyzed based on the prosody information in the speech storage unit. The speech translation device according to claim 1, wherein the speech translation device extracts the speech from the speech.

5. The voice reproducing unit according to claim 1, further comprising a voice quality converting function for converting voice quality, wherein the voice quality is converted by the voice quality converting function and the voice extracted by the phrase extracting unit is reproduced. The speech translation device according to claim 1.

6. A display unit for emphasizing and displaying, in a character string corresponding to an input voice, a portion corresponding to a phrase of the input voice corresponding to a portion that has failed in analysis. The speech translation device according to claim 5.

7. A step of recognizing an input voice and converting it into a character string; a step of storing the input voice in a voice storage unit; a step of storing the converted character string in a character string storage unit; Analyzing the character string stored in the storage unit; extracting the phrase of the input voice corresponding to the part of the analysis failure from the voice storage unit when the analysis fails; Reproducing the speech.

8. A step of recognizing a phrase of an input voice corresponding to a portion that has failed to be analyzed and converting it into a character string, and an input corresponding to a portion of the character string corresponding to the input voice that failed to analyze. Translating the character string corresponding to the phrase of the voice replaced with the character string corresponding to the phrase of the input voice corresponding to the part of the input voice that corresponds to the part that failed to be analyzed after being converted by the voice recognition unit; The speech translation method according to claim 7, comprising:

9. The voice according to claim 7, further comprising a step of analyzing a character string stored in the character string storage unit based on a semantic relationship between phrases. Translation method.

10. A method for extracting prosody information from an input voice, and extracting a phrase of the input voice corresponding to a part of which analysis failed based on the prosody information from the voice storage unit. Claims 7 to 9
To any of the speech translation methods.

11. The speech translation method according to claim 7, further comprising the step of converting and reproducing the voice quality of the extracted speech.

12. The display device according to claim 11, further comprising a step of emphasizing and displaying a portion corresponding to the phrase of the input voice corresponding to the portion of the character string corresponding to the input voice which has failed in the analysis. The speech translation method according to any one of claims 7 to 11.

13. A voice recognition unit for recognizing an input voice and converting it into a character string; a voice storage unit for storing the input voice; a character string storage unit for storing a character string converted by the voice recognition unit; A language analysis unit that analyzes a character string stored in a character string storage unit, and, when the language analysis unit fails in analysis, extracts a phrase of an input voice corresponding to a part in which the analysis failed from the voice storage unit. A speech recognition device, comprising: a phrase extraction unit; and a speech reproduction unit that reproduces speech extracted by the phrase extraction unit.

14. The voice recognition unit recognizes a phrase of an input voice corresponding to a part where the repeated analysis failed, converts the phrase into a character string, and corresponds to the input voice stored in the character string storage unit. In the character string, the character string corresponding to the phrase of the input voice corresponding to the part that failed to be analyzed corresponds to the phrase of the input voice corresponding to the part that failed to be analyzed and read back by the voice recognition unit. 14. The speech recognition device according to claim 13, further comprising means for replacing the character string with a character string.

15. A display unit for emphasizing and displaying, in a character string corresponding to an input voice, a portion corresponding to a phrase of the input voice corresponding to a portion that failed to be analyzed. The speech recognition device according to claim 14.