JP2012014042A

JP2012014042A - Audio input interface device and audio input method

Info

Publication number: JP2012014042A
Application number: JP2010151808A
Authority: JP
Inventors: Hiroyuki Washino; 浩之鷲野; Hirotaka Goi; 啓恭伍井
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2010-07-02
Filing date: 2010-07-02
Publication date: 2012-01-19
Anticipated expiration: 2030-07-02
Also published as: JP5538099B2

Abstract

PROBLEM TO BE SOLVED: To obtain an audio input interface device capable of presenting an accurate correction candidate without explicitly correcting a section by a user when the user corrects a false recognition result in voice recognition.SOLUTION: Voice recognition means 103 outputs a word string as a recognition result for audio input. Linking correction candidate generating means 109 performs matching between a phrase obtained by linking all words outputted from the voice recognition means 103 and a phrase in linking reading and syllable storage means 108 for registering therein information on a phrase obtained by linking a plurality of words, and generates a word unit of correction candidate. Difference extraction means 110 extracts a difference in a matching result. Correction candidate display means 111 displays the difference extracted by the difference extraction means 110 as the correction candidate.

Description

本発明は、音声入力インタフェース装置及び音声入力方法に関し、特に、音声認識においてユーザの意図と異なる音声認識結果が出力された場合に、ユーザが簡単に音声認識結果を修正することのできる修正手段を備えた音声入力インタフェース装置及び音声入力方法に関する。 The present invention relates to a voice input interface device and a voice input method, and in particular, a correction unit that allows a user to easily correct a voice recognition result when a voice recognition result different from the user's intention is output in voice recognition. The present invention relates to a voice input interface device and a voice input method provided.

音声認識では、使用環境やユーザの個人差によって認識率が異なり、誤認識が生じるという本質的な課題がある。認識率を上げて誤認識を抑える方法としては、ユーザによって認識パラメータをチューニングする方法や、場面によって認識語彙を絞る方法がある。しかしながら、不特定多数のユーザが使用することが想定され、さらにカーナビの施設名検索等のように大規模な語彙を対象にしなくてはならない場合には、上記のような解決策は本質的な解決方法とはならない。従って、音声認識において誤認識が生じた場合に、簡単かつ素早く認識結果を修正するインタフェースを提供することは極めて重要である。そこで、これまでにも、音声認識結果を修正するインタフェースは数々提案されている。 In speech recognition, the recognition rate varies depending on the usage environment and individual differences of users, and there is an essential problem that erroneous recognition occurs. As a method of increasing the recognition rate and suppressing erroneous recognition, there are a method of tuning a recognition parameter by a user and a method of narrowing a recognition vocabulary depending on a scene. However, when it is assumed that a large number of unspecified users will use it, and it is necessary to target a large vocabulary such as a facility name search of a car navigation system, the above solution is essential. It is not a solution. Therefore, it is extremely important to provide an interface that can easily and quickly correct a recognition result when a misrecognition occurs in speech recognition. So far, many interfaces for correcting the speech recognition result have been proposed.

例えば、特許文献１には、音声認識による認識結果とともに、修正候補となる単語の一覧が表示され、ユーザが単語の一覧から所望の単語を選択するだけで、簡単に修正することのできる音声認識装置が開示されている。特許文献１に記載されている手法は、コンフュージョン・ネットワークを用いて音声入力に基づく単語グラフを音響的なクラスタリングにより複数の単語の区間に分割し、単語の各区間ごとに競合確率の高い単語を修正候補として生成する手法である。しかしながら、単語の区間分割に失敗した際には、所望の単語が修正候補単語の一覧に現れない現象が起きる。
一方、音声認識において単語の区間分割が失敗した場合に、ユーザが単語の区間を修正することで、正しい区間に分割し直し、修正する候補を再提示する手法が、例えば、非特許文献１に開示されているが、この場合、ユーザが手動で区間を修正する必要があり、手間が増えてしまう。 For example, in Patent Document 1, a list of words that are candidates for correction is displayed together with the recognition result by voice recognition, and the voice recognition that can be easily corrected by the user simply selecting a desired word from the word list. An apparatus is disclosed. The technique described in Patent Literature 1 uses a confusion network to divide a word graph based on speech input into a plurality of word segments by acoustic clustering, and a word with a high contention probability for each word segment. Is generated as a correction candidate. However, when word segmentation fails, a phenomenon occurs in which a desired word does not appear in the list of correction candidate words.
On the other hand, when word segmentation fails in speech recognition, a method in which a user corrects a word segment to redivide the segment into correct segments and re-present candidates to be corrected is disclosed in Non-Patent Document 1, for example. Although disclosed, in this case, it is necessary for the user to manually correct the section, which increases labor.

特開２００６−１４６００８号公報JP 2006-146008 A

遠藤、寺田：「音声入力における対話的候補選択手法」、インタラクション２００３論文集、ｐ．ｐ．１９５−１９６，２００３．Endo, Terada: “Interactive candidate selection method for speech input”, Interaction 2003, p. p. 195-196, 2003.

従来の音声認識では、後述する非特許文献２〜４に示されるように、言語モデルとして、単語ｎグラムモデルを用いることで、単語連接の自由度の高い認識が可能であるが、一般的に、学習データの制限から、ｎが３以上の実装が困難であり、長い単語連鎖に対しては誤認識を起こす可能性が高い。また、各単語間の分割区間を含めて誤認識が起きた場合に、認識結果を修正するインタフェースには、簡単かつ素早く修正することができるものがなかった。 In conventional speech recognition, as shown in Non-Patent Documents 2 to 4 to be described later, a word n-gram model can be used as a language model to recognize words with a high degree of freedom in word concatenation. Because of the limitation of learning data, it is difficult to implement n equal to or greater than 3, and there is a high possibility of misrecognizing long word chains. In addition, there is no interface that can easily and quickly correct the recognition result when erroneous recognition occurs including the divided section between each word.

この発明は上記のような課題を解決するためになされたもので、ユーザが音声認識における誤認識結果を修正する際、ユーザによる明示的な区間の修正を行うことなく、正しい修正候補を提示することのできる音声入力インタフェース装置及び音声入力方法を得ることを目的とする。 The present invention has been made to solve the above-described problems. When a user corrects a misrecognition result in speech recognition, a correct correction candidate is presented without an explicit correction of a section by the user. An object of the present invention is to obtain a voice input interface device and a voice input method that can be used.

この発明に係る音声入力インタフェース装置は、音声入力に対する認識結果の単語列を出力する単語列出力手段と、複数の単語を連結した語句の情報が登録された連結情報記憶手段と、単語列出力手段で出力された全ての単語を連結した語句と、連結情報記憶手段に登録された語句とのマッチングを行い、語句単位の修正候補を生成する連結修正候補生成手段と、連結修正候補生成手段で生成された修正候補を出力する修正候補出力手段とを備えたものである。 The voice input interface device according to the present invention includes a word string output unit that outputs a word string of a recognition result for voice input, a connection information storage unit in which information on a phrase that connects a plurality of words is registered, and a word string output unit Generated by the linked correction candidate generating means, which generates a correction candidate for each phrase by matching the phrase obtained by concatenating all the words output in step 1 and the phrase registered in the linked information storage means, and generated by the linked correction candidate generating means Correction candidate output means for outputting the corrected correction candidates.

この発明の音声入力インタフェース装置は、単語列出力手段で出力された全ての単語を連結した語句と、連結情報記憶手段に登録された語句とのマッチングを行い、語句単位の修正候補を生成するようにしたので、ユーザが音声認識における誤認識結果を修正する際、ユーザによる明示的な区間の修正を行うことなく、正しい修正候補を提示することができる。 The speech input interface device according to the present invention performs matching between a phrase obtained by concatenating all the words output by the word string output unit and a phrase registered in the concatenation information storage unit, and generates a correction candidate for each phrase. Therefore, when the user corrects the misrecognition result in the speech recognition, the correct correction candidate can be presented without correcting the explicit section by the user.

この発明の実施の形態１の音声入力インタフェース装置を示す構成図である。BRIEF DESCRIPTION OF THE DRAWINGS It is a block diagram which shows the audio | voice input interface apparatus of Embodiment 1 of this invention. この発明の実施の形態１の音声入力インタフェース装置における音声認識結果を示す説明図である。It is explanatory drawing which shows the speech recognition result in the audio | voice input interface device of Embodiment 1 of this invention. この発明の実施の形態１の音声入力インタフェース装置における単語修正候補生成手段が行う処理を示すフローチャートである。It is a flowchart which shows the process which the word correction candidate production | generation means in the speech input interface device of Embodiment 1 of this invention performs. この発明の実施の形態１の音声入力インタフェース装置における単語読み・音節記憶手段の情報を示す説明図である。It is explanatory drawing which shows the information of the word reading and syllable storage means in the audio | voice input interface apparatus of Embodiment 1 of this invention. この発明の実施の形態１の音声入力インタフェース装置における連結修正候補生成手段が行う処理を示すフローチャートである。It is a flowchart which shows the process which the connection correction candidate production | generation means in the audio | voice input interface apparatus of Embodiment 1 of this invention performs. この発明の実施の形態１の音声入力インタフェース装置における連結読み・音節記憶手段の情報を示す説明図である。It is explanatory drawing which shows the information of the joint reading and syllable storage means in the audio | voice input interface apparatus of Embodiment 1 of this invention. この発明の実施の形態１の音声入力インタフェース装置における音声認識結果と単語修正候補と連結修正候補とを示す説明図である。It is explanatory drawing which shows the speech recognition result in the audio | voice input interface apparatus of Embodiment 1 of this invention, a word correction candidate, and a connection correction candidate. この発明の実施の形態１の音声入力インタフェース装置における修正対象単語を単語修正候補に置き換えた場合の説明図である。It is explanatory drawing at the time of replacing the correction object word in the audio | voice input interface apparatus of Embodiment 1 of this invention with the word correction candidate. この発明の実施の形態１の音声入力インタフェース装置における連結修正候補が選択された場合の説明図である。It is explanatory drawing when the connection correction candidate in the audio | voice input interface apparatus of Embodiment 1 of this invention is selected. この発明の実施の形態２の音声入力インタフェース装置を示す構成図である。It is a block diagram which shows the audio | voice input interface apparatus of Embodiment 2 of this invention. この発明の実施の形態２の音声入力インタフェース装置における絞り込み処理を示すフローチャートである。It is a flowchart which shows the narrowing-down process in the audio | voice input interface apparatus of Embodiment 2 of this invention.

実施の形態１．
図１は、本発明の実施の形態１における音声入力インタフェース装置の構成図である。
音声入力手段１０１は、マイクなどの音声入力デバイス及びＡＤ変換器により構成されており、ユーザが音声を入力すると、アナログ音声信号をコンピュータにより処理可能なデジタル音声信号に変換する。音声認識辞書記憶手段１０２は、音声認識のために必要な認識辞書（言語モデル）を保存している記憶装置である。音声認識手段（単語列出力手段）１０３は、音声入力手段１０１で変換されたデジタル音声信号を入力とし、音声認識辞書記憶手段１０２を参照して音声を認識し、音声認識結果を出力する。音声認識結果表示手段１０４は、音声認識結果をＬＣＤ表示器などの表示デバイスを用いてユーザに提示する。修正対象単語選択手段１０５は、ユーザが音声認識結果を修正したい場合に、マウスやタッチパネルなどの入力デバイスを用いて修正対象となる単語を選択する操作を受け付け、ユーザによって選択操作がなされた場合に、修正対象単語を出力する。 Embodiment 1 FIG.
FIG. 1 is a configuration diagram of the voice input interface device according to the first embodiment of the present invention.
The voice input unit 101 includes a voice input device such as a microphone and an AD converter. When a user inputs voice, the voice input unit 101 converts an analog voice signal into a digital voice signal that can be processed by a computer. The speech recognition dictionary storage unit 102 is a storage device that stores a recognition dictionary (language model) necessary for speech recognition. The voice recognition means (word string output means) 103 receives the digital voice signal converted by the voice input means 101, recognizes the voice with reference to the voice recognition dictionary storage means 102, and outputs the voice recognition result. The voice recognition result display means 104 presents the voice recognition result to the user using a display device such as an LCD display. The correction target word selection unit 105 receives an operation of selecting a word to be corrected using an input device such as a mouse or a touch panel when the user wants to correct the voice recognition result, and when the user performs a selection operation. , The correction target word is output.

単語読み・音節記憶手段（単語情報記憶手段）１０６は、認識対象となる単語の表記と読み情報及び音節情報を保存している記憶装置である。単語修正候補生成手段１０７は、ユーザが選択した修正対象単語に対して、類似度の高い単語を単語修正候補として生成して出力する。このとき、単語読み・音節記憶手段１０６に保存されている単語単位の読み情報及び音節情報を利用して単語修正候補を生成する。 The word reading / syllable storage means (word information storage means) 106 is a storage device that stores notation of words to be recognized, reading information, and syllable information. The word correction candidate generation unit 107 generates and outputs a word having a high similarity as a word correction candidate with respect to the correction target word selected by the user. At this time, word correction candidates are generated using word-by-word reading information and syllable information stored in the word reading / syllable storage means 106.

連結読み・音節記憶手段（連結情報記憶手段）１０８は、認識対象となる１つまたは複数の単語が接続された語句の、読み情報及び音節情報を保存している記憶装置である。連結修正候補生成手段１０９は、音声認識結果の全ての単語を連結した語句に対して、類似度の高い語句を連結修正候補として生成して出力する。このとき、連結読み・音節記憶手段１０８に保存されている読み情報及び音節情報を利用して連結修正候補を生成する。差分抽出手段１１０は、生成された連結修正候補と、すでにユーザに提示されている認識結果との差分語句を抽出する。 The connected reading / syllable storage means (connected information storage means) 108 is a storage device that stores reading information and syllable information of a phrase to which one or more words to be recognized are connected. The link correction candidate generation unit 109 generates and outputs a word having a high degree of similarity as a link correction candidate for a word obtained by connecting all the words in the speech recognition result. At this time, linked correction candidates are generated using the reading information and syllable information stored in the linked reading / syllable storage means 108. The difference extraction unit 110 extracts a difference phrase between the generated connection correction candidate and the recognition result already presented to the user.

修正候補表示手段（修正候補出力手段）１１１は、生成された単語修正候補と差分語句の両方を同時に、修正候補としてＬＣＤ表示器などの表示デバイスを用いてユーザに提示する。修正候補選択手段１１２は、マウスやタッチパネルなどの入力デバイスを用いてユーザが意図する修正候補を選択する操作を受け付け、ユーザによって選択操作がなされた場合に、選択された修正候補を出力する。修正実行手段１１３は、ユーザによって選択された修正候補を入力として、既にユーザに提示されている認識結果を更新し、修正結果をユーザに再提示する。 The correction candidate display means (correction candidate output means) 111 presents both the generated word correction candidate and the difference phrase simultaneously to the user as a correction candidate using a display device such as an LCD display. The correction candidate selection unit 112 receives an operation of selecting a correction candidate intended by the user using an input device such as a mouse or a touch panel, and outputs the selected correction candidate when a selection operation is performed by the user. The correction execution unit 113 receives the correction candidate selected by the user as an input, updates the recognition result already presented to the user, and re-presents the correction result to the user.

以下では、上記のように構成された音声入力インタフェース装置の処理の流れについて、具体例を交えて説明する。
ユーザが「三菱電機株式会社」を音声入力しようとして、「ミツビシデンキカブシキガイシャ」と発話したとする。
このとき、先ず、音声入力手段１０１は、発話されたアナログ音声信号をデジタル音声信号に変換する。次に、音声認識手段１０３は、上記変換されたデジタル音声信号を入力として音声認識辞書記憶手段１０２に保存されている音声認識辞書を参照して音声を認識し、音声認識結果を出力する（単語列出力ステップ）。音声を認識する手法は任意であり、以下の非特許文献２〜４に記されているような、公知の音声認識手法を利用することができる。例えば、デジタル音声信号を音響特徴量に変換し、音素など音声認識の基本単位に対する音響スコアと、言語モデルに基づく言語スコアに基づいて、認識候補の探索を行う手法などが考えられる。 Hereinafter, the flow of processing of the voice input interface device configured as described above will be described with a specific example.
It is assumed that the user utters “Mitsubishi Denki Kabushiki Geisha” while trying to input “Mitsubishi Electric Corporation” by voice.
At this time, the voice input unit 101 first converts the uttered analog voice signal into a digital voice signal. Next, the speech recognition unit 103 recognizes the speech by referring to the speech recognition dictionary stored in the speech recognition dictionary storage unit 102 using the converted digital speech signal as an input, and outputs a speech recognition result (word Column output step). The method for recognizing speech is arbitrary, and a known speech recognition method as described in Non-Patent Documents 2 to 4 below can be used. For example, a method of searching for recognition candidates based on an acoustic score for a basic unit of speech recognition, such as phonemes, and a language score based on a language model, can be considered.

非特許文献２：鹿野清宏、伊藤克亘、河原達也、武田一哉、山本幹雄著：「音声認識システム」株式会社オーム社、平成１３年５月１５日
非特許文献３：北研二、辻井潤一著：「確率的言語モデル」、東京大学出版会、平成１１年１１月２５日
非特許文献４：中川聖一著：「確率モデルによる音声認識」、社団法人電子情報通信学会、昭和６３年７月１日 Non-Patent Document 2: Kiyohiro Shikano, Katsunobu Ito, Tatsuya Kawahara, Kazuya Takeda, Mikio Yamamoto: “Speech Recognition System” Ohm Co., Ltd., May 15, 2001 Non-Patent Document 3: Kenji Kita, Junichi Sakurai: "Probabilistic language model", The University of Tokyo Press, November 25, 1999 Non-patent document 4: Seiichi Nakagawa: "Speech recognition by probabilistic model", The Institute of Electronics, Information and Communication Engineers, July 1 1988 Day

音声認識結果表示手段１０４は、前記音声認識の結果と単語の分割区間がユーザにわかるようなレイアウトで、ＬＣＤ表示器などの表示デバイスを用いてユーザに提示する。例えば、「ミツビシデンキカブシキガイシャ」という発話に対して、音声認識の結果が「三井／市電／機器／株式／会社」（／は単語の区切りを表す）であったとする。このとき、図２のように、分割された単語が発話順序に応じて左から右に順に表示されるようなレイアウトで表示することが望ましい。 The voice recognition result display means 104 presents the result of the voice recognition and the word segmentation section to the user using a display device such as an LCD display in a layout that allows the user to know the word segmentation section. For example, for the utterance “Mitsubishi Denki Kabushiki Gaisha”, it is assumed that the result of speech recognition is “Mitsui / City Electric / Equipment / Stock / Company” (/ represents a word break). At this time, as shown in FIG. 2, it is desirable to display the divided words so that the words are displayed sequentially from left to right according to the utterance order.

音声認識結果表示手段１０４によって表示された音声認識の結果がユーザの意図通りの結果であれば、そのままユーザは音声認識結果を利用することができるが、意図通りの結果でない場合、認識結果を修正する必要がある。例えば、「ミツビシデンキカブシキガイシャ」という発話に対して、前記のように音声認識の結果が「三井／市電／機器／株式／会社」であった場合には、ユーザの意図通りではない結果が出力されており、さらに単語の分割区間及び分割数も誤っている。具体的には、本来「ミツビシデンキカブシキガイシャ」は「ミツビシ／デンキ／カブシキ／ガイシャ」（分割数４）と単語が区切られなければならないが、認識結果は「ミツイ／シデン／キキ／カブシキ／ガイシャ」（分割数５）と区切られている。従って、単語だけでなく単語の分割区間も含めてユーザが認識結果を修正する必要がある。 If the result of speech recognition displayed by the speech recognition result display means 104 is the result as intended by the user, the user can use the speech recognition result as it is, but if the result is not as intended, the recognition result is corrected. There is a need to. For example, for the utterance “Mitsubishi Denki Kabushiki Gaisha”, if the result of speech recognition is “Mitsui / City / Equipment / Stock / Company” as described above, a result not intended by the user is output. In addition, the word segmentation and number of divisions are also incorrect. Specifically, the word “Mitsubishi / Denki / Kabushiki / Gaisha” should be separated from the word “Mitsubishi / Denki / Kabushiki / Gaisha” (number of divisions 4). (Division number 5). Therefore, it is necessary for the user to correct the recognition result including not only the word but also the divided section of the word.

そこで修正対象単語選択手段１０５が、マウスやタッチパネルなどの入力デバイスを用いてユーザが修正対象となる単語を選択する操作を受け付ける。ユーザが修正したい単語を選択すると、修正対象単語選択手段１０５は、選択操作を感知し、選択された修正対象単語を出力する。ここでは、「三井／市電／機器／株式／会社」の５つの単語のうち、修正対象単語として「三井」が選択されたとする。 Therefore, the correction target word selection unit 105 accepts an operation for the user to select a word to be corrected using an input device such as a mouse or a touch panel. When the user selects a word to be corrected, the correction target word selecting unit 105 detects the selection operation and outputs the selected correction target word. Here, it is assumed that “Mitsui” is selected as the correction target word among the five words “Mitsui / city tram / equipment / stock / company”.

次に、単語修正候補生成手段１０７が行う処理の流れを図３のフローチャートに従って詳細に説明する。
単語修正候補生成手段１０７は、修正対象単語選択手段１０５から出力された修正対象単語を入力として、まず、単語読み・音節記憶手段１０６に記憶されている読み情報と音節情報から、修正対象単語の読みと音節情報を取得する（ステップＳＴ１０１）。ここで、単語読み・音節記憶手段１０６には、図４のような形で認識対象語句が形態素解析などの単語分割手法によって分割された単語の表記を保存されており、さらに表記に対応して、その読みと音節情報が格納されていることが望ましい。単語修正候補生成手段１０７は、単語読み・音節記憶手段１０６の中から修正対象単語を検索し、対応する読みと音節情報を取得する。例えば、修正対象単語が「三井」の場合、単語読み・音節記憶手段１０６の表記の中から「三井」を検索し、その読み「ミツイ」と音節「ｍｉ−ｃｕ−ｉ」を取得する。 Next, the flow of processing performed by the word correction candidate generation unit 107 will be described in detail with reference to the flowchart of FIG.
The word correction candidate generation unit 107 receives the correction target word output from the correction target word selection unit 105 as an input, and firstly reads the correction target word from the reading information and syllable information stored in the word reading / syllable storage unit 106. Reading and syllable information are acquired (step ST101). Here, the word reading / syllable storage means 106 stores a notation of a word obtained by dividing the recognition target phrase by a word dividing method such as morphological analysis in the form as shown in FIG. 4, and further corresponds to the notation. The reading and syllable information are preferably stored. The word correction candidate generation unit 107 searches for a correction target word from the word reading / syllable storage unit 106 and acquires corresponding reading and syllable information. For example, when the correction target word is “Mitsui”, “Mitsui” is searched from the notation of the word reading / syllable storage means 106, and the reading “Mitsui” and the syllable “mi-cu-i” are acquired.

次に、単語読み・音節記憶手段１０６の中から任意の単語を選択し（ステップＳＴ１０２）、修正対象単語の読みとステップＳＴ１０２で選択した単語の読みとの類似度を計算する（ステップＳＴ１０３）。単語の読みを利用した類似度の計算手法は任意の公知の計算方法を利用することができる。例えば、ある単語を別の単語に編集する際の操作手順（挿入、削除、置換）の最少の回数を単語間の距離として定義する編集距離（レーベンシュタイン距離）を利用する。例えば、「ミツイ」を「ミツビシ」に編集する際の手順は以下のように、
１．「ミツイ」
２．「ミツビ」（イをビに置換）
３．「ミツビシ」（シを挿入）
となるから、最少で２回の操作手順を必要とする。従って単語「ミツイ」と「ミツビシ」の編集距離は２となる。編集距離が小さいほど、読みの類似度は大きいとしてよいので、編集距離の逆数を単語間の類似度として計算することが可能である。以下では、この読み情報を用いた類似度を読み類似度と呼ぶ。 Next, an arbitrary word is selected from the word reading / syllable storage means 106 (step ST102), and the similarity between the reading of the correction target word and the reading of the word selected in step ST102 is calculated (step ST103). Any known calculation method can be used as the similarity calculation method using word reading. For example, an edit distance (Levenstein distance) that defines the minimum number of operation procedures (insertion, deletion, replacement) when editing a word as another word as the distance between words is used. For example, the procedure for editing “Mitsui” to “Mitsubishi” is as follows:
1. "Mitsui"
2. "Mitsubi" (Replace I with Bi)
3. "Mitsubishi" (Insert)
Therefore, at least two operation procedures are required. Therefore, the edit distance between the words “Mitsui” and “Mitsubishi” is 2. Since the similarity of reading may be larger as the editing distance is smaller, the reciprocal of the editing distance can be calculated as the similarity between words. Hereinafter, the similarity using the reading information is referred to as reading similarity.

次に、単語修正候補生成手段１０７は、修正対象単語の音節とステップＳＴ１０２で選択した単語の音節の類似度を計算する（ステップＳＴ１０４）。単語の音節を利用した類似度の計算手法は、公知の計算方法を利用することができる。例えば、以下の非特許文献５に記載されているような、部分音節列の統計的な認識誤り傾向から各部分音節節相互の混同確率を計算し、全ての部分音節列の混同確率の積の対数として単語全体の類似度を求める手法を利用することができる。以下では、この音節情報を用いた類似度を音響類似度と呼ぶ。
非特許文献５：阿部他：『認識誤り傾向の確率モデルを用いた２段階探索法による大規模連続音声認識』、電子情報通信学会誌、Ｖｏｌ．Ｊ８３−Ｄ−ＩＩ、Ｎｏ．１２、ｐｐ．２５４５−２５５３、２０００．
以上のステップＳＴ１０２〜ステップＳＴ１０４まで処理を、単語読み・音節記憶手段１０６の中に保存されている全ての単語について繰り返す（ステップＳＴ１０５）。 Next, the word correction candidate generation means 107 calculates the similarity between the syllable of the correction target word and the syllable of the word selected in step ST102 (step ST104). A known calculation method can be used as the similarity calculation method using the syllables of words. For example, as described in Non-Patent Document 5 below, the confusion probability between the partial syllable strings is calculated from the statistical recognition error tendency of the partial syllable string, and the product of the confusion probabilities of all the partial syllable strings is calculated. It is possible to use a technique for obtaining the similarity of whole words as a logarithm. Below, the similarity using this syllable information is called acoustic similarity.
Non-Patent Document 5: Abe et al .: “Large-scale continuous speech recognition using a two-step search method using a probability error probability model”, IEICE Journal, Vol. J83-D-II, no. 12, pp. 2545-2553, 2000.
The above steps ST102 to ST104 are repeated for all the words stored in the word reading / syllable storage means 106 (step ST105).

全ての単語ｉについて読み類似度ｒ_i及び音響類似度ａ_iを求めると、次に単語修正候補生成手段１０７は、修正対象単語と、単語読み・音節記憶手段１０６の中に保存されている全ての単語との間の読み類似度と音響類似度を、それぞれ読み類似度の総和及び音響類似度の総和で割って正規化し（ただし、ｎは単語の総数）、次式のように、両類似度の重み付き和を計算して単語間類似度ｓ_iとする（ステップＳＴ１０６）。

When the reading similarity r _i and the acoustic similarity a _i are obtained for all the words i, the word correction candidate generation unit 107 then stores the correction target word and all of the words stored in the word reading / syllable storage unit 106. Normalization is performed by dividing the reading similarity and acoustic similarity between two words by the sum of reading similarities and the sum of acoustic similarities (where n is the total number of words). The weighted sum of degrees is calculated and set as the similarity between words s _i (step ST106).

上式中、αは読み類似度と音響類似度のどちらをどれだけ重視して単語の類似度を計算するかを決める重みである。αは音声認識の使用環境に応じて任意に設定することができ、α＝０の場合、音響類似度のみを利用することになり、逆に、α＝１の場合、読み類似度のみを利用することになる。 In the above formula, α is a weight that determines how much importance is given to the reading similarity or the acoustic similarity to calculate the word similarity. α can be arbitrarily set according to the use environment of voice recognition. When α = 0, only the acoustic similarity is used. Conversely, when α = 1, only the reading similarity is used. Will do.

このように、単語修正候補生成手段１０７は、ユーザが選択した修正対象単語、単語読み・音節記憶手段１０６の中に保存されている全ての単語との単語間類似度を計算し、最後に単語間類似度の大きい順に並べて上位ｎ件の単語を単語修正候補として生成する（ステップＳＴ１０７）。件数ｎは任意である。例えば、ｎ＝３として修正対象単語が「三井」である場合には、例えば、単語間類似度の高い「三ツ井（ミツイ）」や「三津（ミツ）」、「水井（ミズイ）」などの単語が単語修正候補として選ばれることになる。 In this way, the word correction candidate generation unit 107 calculates the inter-word similarity between the correction target word selected by the user and all the words stored in the word reading / syllable storage unit 106, and finally the word Arranged in descending order of the similarity between words, the top n words are generated as word correction candidates (step ST107). The number n is arbitrary. For example, when n = 3 and the correction target word is “Mitsui”, for example, a word such as “Mitsui”, “Mitsu”, “Mizui” or the like having a high degree of similarity between words. Will be selected as a word correction candidate.

さて、単語間の分割区間が正しい場合には、ユーザの意図した修正候補が、この単語修正候補の上位に現れる可能性が高いが、単語間の分割区間が誤っている場合には、ユーザの意図した修正候補が、この単語修正候補の上位に現れる可能性が低くなってしまう。これは、単語間の分割区間が誤っている場合には、修正対象単語とユーザが意図した単語の読み類似度及び音響類似度がともに小さくなってしまうからである。 Now, if the divided section between words is correct, the correction candidate intended by the user is likely to appear at the top of this word correction candidate, but if the divided section between words is incorrect, the user's The possibility that the intended correction candidate appears at the top of this word correction candidate is reduced. This is because, when the divided section between words is incorrect, both the reading similarity and the acoustic similarity of the correction target word and the word intended by the user are reduced.

この課題を解決するために、ユーザがまず単語間の分割区間を修正して、その後に単語修正候補を生成するという手法が、前述した非特許文献１に示されている手法である。しかしこの手法では、ユーザが単語間の分割区間を修正するという手間が増えてしまっている。
そこで、本実施の形態では、単語間の分割区間が誤っている場合にも、ユーザが単語間の分割区間を修正するという手間をとらせることなく、ユーザが意図する修正候補として生成するために、音声認識結果に現れた全ての単語を連結し、連結した語句に対して類似度を求める処理を行う（連結修正候補生成ステップ、修正候補出力ステップ）。具体的には、連結修正候補生成手段１０９が、音声認識手段１０３で認識結果として出力された複数の単語を連結した語句に対して、連結読み・音節記憶手段１０８に格納された語句の中で類似度の高い語句を連結修正候補として生成する。この連結修正候補生成手段１０９の処理を、図５のフローチャートに従って詳しく説明する。 In order to solve this problem, a method in which a user first corrects a divided section between words and then generates a word correction candidate is the method described in Non-Patent Document 1 described above. However, this method increases the time and effort for the user to correct the divided section between words.
Therefore, in the present embodiment, in order to generate a correction candidate intended by the user without causing the user to correct the divided section between words even when the divided section between words is incorrect. Then, all the words appearing in the speech recognition result are connected, and the degree of similarity is obtained for the connected phrases (concatenated correction candidate generation step, correction candidate output step). Specifically, the link correction candidate generation unit 109 uses the words stored in the linked reading / syllable storage unit 108 with respect to a phrase obtained by connecting a plurality of words output as a recognition result by the speech recognition unit 103. A word with high similarity is generated as a concatenation correction candidate. The processing of the link correction candidate generation unit 109 will be described in detail according to the flowchart of FIG.

まず、単語読み・音節記憶手段１０６を参照して音声認識結果に現れた全ての単語の読み及び音節を取得する（ステップＳＴ２０１）。次に、それらの読み及び音節を連結した語句を生成する（ステップＳＴ２０２）。例えば、音声認識結果が「三井／市電／機器／株式／会社」である場合、「ミツイ／シデン／キキ／カブシキ／ガイシャ」と読みを取得し、全ての読みを連結した語句「ミツイシデンキキカブシキガイシャ」を生成する。また、「ｍｉ−ｃｕ−ｉ／ｓｉ−ｄｅ−ｎ／ｋｉ−ｋｉ／ｋａ−ｂｕ−ｓｉ−ｋｉ／ｇａ−ｉ−ｓｊａ」と音節を取得し、全ての音節を連結させた音節「ｍｉ−ｃｕ−ｉ−ｓｉ−ｄｅ−ｎ−ｋｉ−ｋｉ−ｋａ−ｂｕ−ｓｉ−ｋｉ−ｇａ−ｉ−ｓｊａ」を生成する。次に、連結読み・音節記憶手段１０８の中から任意の語句を選択し（ステップＳＴ２０３）、ステップＳＴ２０２で連結した語句の読み及び音節と、ステップＳＴ２０３で選択した語句の読み及び音節の類似度を計算する（ステップＳＴ２０４、ステップＳＴ２０５）。ここで、連結読み・音節記憶手段１０８には、図６のような形で認識対象語句の表記が保存されており、さらに表記に対応して、その読みと音節情報が格納されていることが望ましい。語句間の類似度の計算には、前記単語間の類似度を求めるときと同様に、読み類似度、及び音響類似度の重み付け和を利用する（ステップＳＴ２０６）。 First, with reference to the word reading / syllable storage means 106, readings and syllables of all words appearing in the speech recognition result are acquired (step ST201). Next, the phrase which connected those readings and syllables is produced | generated (step ST202). For example, if the voice recognition result is “Mitsui / City / Equipment / Stock / Company”, the reading “Mitsui / Siden / Kiki / Kabushiki / Gaisha” is acquired, and the phrase “Mitsui Shidenki Kabushiki Gaisha” concatenated all the readings Is generated. Also, the syllable “mi-cu-i / si-de-n / ki-ki / ka-bu-si-ki / ga-i-sja” is acquired and the syllable “mi−” is obtained by connecting all the syllables. cu-i-si-de-n-ki-ki-ka-bu-si-ki-ga-i-sja ". Next, an arbitrary word / phrase is selected from the concatenated reading / syllable storage means 108 (step ST203), and the similarity between the reading and syllable of the word concatenated in step ST202 and the reading and syllable of the word selected in step ST203 is determined. Calculate (step ST204, step ST205). Here, the connected reading / syllable storage means 108 stores the notation of the recognition target phrase in the form as shown in FIG. 6, and further stores the reading and syllable information corresponding to the notation. desirable. In calculating the similarity between words, the weighted sum of reading similarity and acoustic similarity is used (step ST206), as in the case of obtaining the similarity between words.

以上のステップＳＴ２０３〜ステップＳＴ２０６まで処理を、連結読み・音節記憶手段１０８の中に保存されている全ての語句について繰り返す。このように、連結修正候補生成手段１０９は、認識結果の全ての語句を連結した語句と、連結読み・音節記憶手段１０８の中に保存されている全ての語句間の類似度を計算し（ステップＳＴ２０７）、最後に語句間の類似度の大きい順に並べて１つまたは複数の上位の語句を連結修正候補として生成する（ステップＳＴ２０８）。例えば、認識結果の全ての語句を連結した語句が「ミツイシデンキキカブシキガイシャ」である場合には、連結した語句間の類似度の大きい「三菱電機株式会社（ミツビシデンキカブシキガイシャ）」や「三井電気株式会社（ミツイデンキカブシキガイシャ）」などの語句が連結修正候補として生成される。 The above steps ST203 to ST206 are repeated for all the words stored in the linked reading / syllable storage means 108. In this way, the connection correction candidate generation unit 109 calculates the similarity between the words obtained by connecting all the words in the recognition result and all the words stored in the connection reading / syllable storage unit 108 (step (ST207) Finally, one or a plurality of higher-order words are arranged as concatenated correction candidates arranged in descending order of similarity between words (step ST208). For example, if the phrase that concatenates all the words in the recognition result is “Mitsubishi Denki Kabushiki Geisha”, “Mitsubishi Electric Corporation (Mitsubishi Denki Kabushiki Geisha)” or “Mitsui Electric Co., Ltd.” that has a high degree of similarity between the linked words. A phrase such as “Company (Mitsui Denki Kabushiki Gaisha)” is generated as a candidate for concatenation correction.

次に、差分抽出手段１１０が、連結修正候補生成手段１０９によって生成された連結修正候補と、認識結果として既出の語句との差分を抽出する。例えば上記の例のように、認識結果が「三井／市電／機器／株式／会社」であり、連結修正候補が「三菱電機株式会社」である場合、「株式」と「会社」の単語部分は同一であるので、差分語句は「三菱電機」部分ということになる。本実施の形態では、最終表示形態を表記としているため、差分語句の抽出は、読み同士で行うのではなく、表記同士で行う。 Next, the difference extraction unit 110 extracts a difference between the link correction candidate generated by the link correction candidate generation unit 109 and the already-explained phrase as a recognition result. For example, as in the above example, if the recognition result is “Mitsui / City Electric / Equipment / Stock / Company” and the consolidation candidate is “Mitsubishi Electric Corporation”, the word parts of “Stock” and “Company” are Since they are the same, the difference phrase is the “Mitsubishi Electric” part. In the present embodiment, since the final display form is used as a notation, the extraction of difference words is not performed between readings but between notations.

以上のように、単語修正候補生成手段１０７が単語修正候補を生成し、差分抽出手段１１０が連結修正候補から差分語句を抽出すると、修正候補表示手段１１１が、生成された単語修正候補と差分語句の両方を同時に、修正候補としてＬＣＤ表示器などの表示デバイスを用いてユーザに提示する。表示の際のレイアウトは任意であるが、図７のように、ユーザが選択した修正対象単語の下に単語修正候補をリスト状に表示し、かつ音声認識結果の上に差分語句を差分が一目でわかるようにレイアウトして表示するのが望ましい。
次に、修正候補選択手段１１２が、マウスやタッチパネルなどの入力デバイスを用いてユーザが修正候補を選択する操作を受け付ける。ユーザが修正候補を選択すると、修正候補選択手段１１２は選択操作を感知し、選択された修正対象単語を出力する。 As described above, when the word correction candidate generation unit 107 generates a word correction candidate and the difference extraction unit 110 extracts a difference phrase from the linked correction candidates, the correction candidate display unit 111 displays the generated word correction candidate and the difference phrase. Both are simultaneously presented to the user as correction candidates using a display device such as an LCD display. Although the layout at the time of display is arbitrary, as shown in FIG. 7, the word correction candidates are displayed in a list form under the correction target word selected by the user, and the difference word / phrase is displayed at a glance on the speech recognition result. It is desirable to lay out and display as you can see.
Next, the correction candidate selection unit 112 receives an operation for the user to select a correction candidate using an input device such as a mouse or a touch panel. When the user selects a correction candidate, the correction candidate selection unit 112 senses the selection operation and outputs the selected correction target word.

最後に、修正実行手段１１３は、ユーザによって選択された修正候補を入力として、既に表示されている音声認識結果の修正を実行する。例えば図７の例では、修正対象単語の「三井」に対して、単語修正候補の「三ツ木」が修正候補として選択された場合には、修正対象単語の「三井」と単語修正候補の「三ツ木」を置き換えて図８のように表示する。一方、連結修正候補との差分語句の修正候補の「三菱電機」が選択された場合には、差分語句に対応する単語列「三井／市電／機器」の部分を「三菱電機」に置き換えて図９のように表示する。また、修正実行手段１１３は、後段に音声認識結果を用いるような処理が続く場合には、認識結果の修正が行われた旨を適切な場所に通知する。
以上が、本実施の形態に係る音声入力インタフェース装置の処理の流れである。 Finally, the correction execution means 113 executes correction of the already displayed speech recognition result with the correction candidate selected by the user as an input. For example, in the example of FIG. 7, when the word correction candidate “Mitsuki” is selected as the correction candidate for the correction target word “Mitsui”, the correction target word “Mitsui” and the word correction candidate “Mitsuki” "Is replaced and displayed as shown in FIG. On the other hand, when “Mitsubishi Electric” is selected as the candidate for correction of the difference word with the linked correction candidate, the word string “Mitsui / city / device” corresponding to the difference word is replaced with “Mitsubishi Electric”. It is displayed as 9. In addition, when the process of using the speech recognition result continues in the subsequent stage, the correction execution unit 113 notifies the appropriate place that the recognition result has been corrected.
The above is the processing flow of the voice input interface device according to the present embodiment.

上記の例では、音声認識において単語の分割区間を誤っているため、修正対象単語「三井」に対して、ユーザが意図する修正候補「三菱」を提示することはできないが、認識結果「三井／市電／機器／株式／会社」の単語を連結させて類似度を計算することで、意図する修正候補「三菱電機」を生成することができている。
また、本実施の形態では、説明の簡略化のために単語を連結させて類似度を計算したが、文字の連結であっても同様の効果を奏する。 In the above example, because the word segmentation section is incorrect in speech recognition, the correction candidate “Mitsubishi” intended by the user cannot be presented to the correction target word “Mitsui”, but the recognition result “Mitsui / By calculating the similarity by concatenating the words “city / device / stock / company”, the intended correction candidate “Mitsubishi Electric” can be generated.
Further, in the present embodiment, for the sake of simplicity, the similarity is calculated by connecting words, but the same effect can be obtained even by connecting characters.

このように、実施の形態１によれば、音声認識においてユーザが意図しない結果が出力され、かつ単語の分割区間も誤っている場合にも、認識結果の各単語単位の修正候補と、認識結果として出力された複数または全ての単語を連結した語句に対する修正候補を同時に提示することができるので、ユーザに単語の分割区間を修正する手間をとらせることなく、正しい修正候補を提示する修正インタフェースを提供することができる。 As described above, according to the first embodiment, even if a result unintended by the user is output in speech recognition and the word segmentation section is also incorrect, the correction candidate for each word of the recognition result and the recognition result The correction interface for presenting correct correction candidates can be presented at the same time without requiring the user to correct the divided sections of the words. Can be provided.

以上のように、実施の形態１の音声入力インタフェース装置によれば、音声入力に対する認識結果の単語列を出力する単語列出力手段と、複数の単語を連結した語句の情報が登録された連結情報記憶手段と、単語列出力手段で出力された全ての単語を連結した語句と、連結情報記憶手段に登録された語句とのマッチングを行い、語句単位の修正候補を生成する連結修正候補生成手段と、連結修正候補生成手段で生成された修正候補を出力する修正候補出力手段とを備えたので、ユーザが音声認識における誤認識結果を修正する際、ユーザによる明示的な区間の修正を行うことなく、正しい修正候補を提示することができる。 As described above, according to the voice input interface device of the first embodiment, the word string output means for outputting the word string of the recognition result for the voice input, and the link information in which the information on the phrase connecting the plurality of words is registered. A storage means, a phrase obtained by concatenating all the words output by the word string output means, and a phrase registered in the linkage information storage means, and a linked correction candidate generating means for generating a correction candidate in phrase units; And a correction candidate output means for outputting the correction candidates generated by the linked correction candidate generation means, so that when the user corrects the erroneous recognition result in the speech recognition, the user does not correct the explicit section. , Correct correction candidates can be presented.

また、実施の形態１の音声入力インタフェース装置によれば、音声入力に対する認識結果の単語列を出力する単語列出力手段と、複数の単語を連結した語句の情報が登録された連結情報記憶手段と、単語列出力手段で出力された全ての単語を連結した語句と、連結情報記憶手段に登録された語句とのマッチングを行い、語句単位の修正候補を生成する連結修正候補生成手段と、連結修正候補生成手段で生成された修正候補と、単語列出力手段で出力された全ての単語を連結した語句との差分を抽出する差分抽出手段と、差分抽出手段で抽出された差分を修正候補として出力する修正候補出力手段とを備えたので、修正候補の確認をより容易に行うことができる。 In addition, according to the voice input interface device of the first embodiment, a word string output unit that outputs a word string as a recognition result for voice input, and a link information storage unit in which information on a phrase that links a plurality of words is registered. A connected correction candidate generating means for matching a phrase obtained by connecting all the words output by the word string output means and a word registered in the connection information storage means, and generating a correction candidate for each phrase; Difference extraction means for extracting a difference between the correction candidate generated by the candidate generation means and a phrase obtained by connecting all the words output by the word string output means, and the difference extracted by the difference extraction means is output as a correction candidate. Since the correction candidate output means is provided, the correction candidate can be confirmed more easily.

また、実施の形態１の音声入力インタフェース装置によれば、単語単位の情報が登録された単語情報記憶手段と、単語列出力手段で出力された単語列の単語に対して、前記単語情報記憶手段に保存されている単語の中から、単語単位の修正候補を生成する単語修正候補生成手段を備えたので、より的確な修正候補を提示することができる。 Further, according to the voice input interface device of the first embodiment, the word information storage means for the word information storage means in which information in units of words is registered and the words in the word string output by the word string output means. Since the word correction candidate generating means for generating the word-by-word correction candidates from the words stored in is provided, more accurate correction candidates can be presented.

また、実施の形態１の音声入力インタフェース装置によれば、単語列出力手段が出力した単語列に対する修正対象単語の指定を受け付ける修正対象単語選択手段を備え、単語修正候補生成手段は、修正対象単語の音節と修正候補の音節との類似度と、修正対象単語の読みと修正候補の読みの類似度の両方を利用し、両類似度に重みを付けた総和を全体の類似度として修正候補を生成する際の情報に利用するようにしたので、的確な修正候補の単語を提示することができる。 In addition, according to the voice input interface device of the first embodiment, the correction target word selecting unit that receives the specification of the correction target word for the word string output by the word string output unit is provided, and the word correction candidate generation unit includes the correction target word. Using both the similarity between the syllable and the correction candidate syllable, and the similarity between the reading of the correction target word and the correction candidate, the correction candidate is determined with the sum total weighted on both similarities as the overall similarity. Since it is used for the information at the time of generation, it is possible to present an accurate correction candidate word.

また、実施の形態１の音声入力インタフェース装置によれば、連結修正候補生成手段は、単語列出力手段で出力された全ての単語を連結した語句を修正対象語句とした場合、修正対象語句の音節と修正候補の音節との類似度と、修正対象語句の読みと修正候補の読みの類似度の両方を利用し、両類似度に重みを付けた総和を全体の類似度として修正候補を生成する際の情報に利用するようにしたので、的確な修正候補の語句を提示することができる。 In addition, according to the speech input interface device of the first embodiment, the linking correction candidate generating unit, when a phrase obtained by concatenating all the words output by the word string output unit is a correction target phrase, the syllable of the correction target phrase And the correction candidate syllable similarity, and both the reading of the correction target phrase and the correction candidate reading are used, and a correction candidate is generated with the sum total weighting both similarities as the overall similarity. Since it is used for the information at the time, it is possible to present the correct correction candidate words.

また、実施の形態１の音声入力インタフェース装置によれば、単語列出力手段の出力結果を表示する認識結果表示手段と、認識結果表示手段で表示された認識結果に対する修正対象単語の指定を受け付ける修正対象単語選択手段と、連結修正候補生成手段で生成された連結修正候補を表示する修正候補表示手段と、修正候補表示手段で表示された修正候補に対するユーザの選択結果を受け付ける修正候補選択手段と、単語列出力手段の出力結果を修正候補選択手段で受け付けた選択結果に基づいて修正する修正実行手段とを備えたので、音声入力や修正候補の選択が容易なインタフェースを提供することができる。 According to the voice input interface device of the first embodiment, the recognition result display means for displaying the output result of the word string output means, and the correction for accepting the designation of the correction target word for the recognition result displayed by the recognition result display means. Target word selection means, correction candidate display means for displaying the linked correction candidates generated by the linked correction candidate generation means, correction candidate selection means for receiving the user's selection results for the correction candidates displayed by the correction candidate display means, Since the correction execution means for correcting the output result of the word string output means based on the selection result received by the correction candidate selection means is provided, it is possible to provide an interface for easy voice input and correction candidate selection.

また、実施の形態１の音声入力方法によれば、音声入力に対する認識結果の単語列を出力する単語列出力ステップと、単語列出力ステップで出力された全ての単語を連結した語句と、連結情報記憶手段に登録されている複数の単語を連結した語句とのマッチングを行い、語句単位の修正候補を生成する連結修正候補生成ステップと、連結修正候補生成ステップで生成された修正候補を出力する修正候補出力ステップとを備えたので、ユーザが音声認識における誤認識結果を修正する際、ユーザによる明示的な区間の修正を行うことなく、正しい修正候補を提示することができる。 Further, according to the speech input method of the first embodiment, a word sequence output step for outputting a word sequence as a recognition result for speech input, a phrase obtained by connecting all the words output in the word sequence output step, and connection information Matching with a phrase obtained by concatenating a plurality of words registered in the storage means and generating a correction candidate generated in the linked correction candidate generation step, and a correction candidate generated in the linked correction candidate generation step Since the candidate output step is provided, when the user corrects the misrecognition result in the speech recognition, the correct correction candidate can be presented without correcting the explicit section by the user.

実施の形態２．
上記の実施の形態１では、分割区間を誤った場合にもユーザが意図する修正候補を提示するために、音声認識結果として出力された単語を連結した語句に対する連結修正候補を生成していた。このとき、ユーザが選択した修正対象単語の時間的位置によって、連結修正候補をさらに絞り込むことも可能である。すなわち、図２のように、音声認識結果が発話順序に応じて左から右へレイアウトされて表示されている場合、ユーザが音声認識結果を修正する際、左端の単語から順番に修正を行っていくと考えて自然である。従って、左から２番目以降に現れた単語をユーザが修正対象単語として選択した場合に、修正対象単語より以前に現れた単語の認識結果は正しいものとして、連結修正候補を絞り込むことができる。例えば、図２において、左から３番目の「機器」が選択された場合には、「三井」及び「市電」の単語単位の認識結果は正しいものとして、「三井」及び「市電」に接続可能な単語修正候補及び連結修正候補のみに修正候補を絞り込むことができる。 Embodiment 2. FIG.
In the first embodiment, in order to present a correction candidate intended by the user even when the divided section is incorrect, a linked correction candidate for a phrase obtained by connecting words output as a speech recognition result is generated. At this time, the linked correction candidates can be further narrowed down according to the temporal position of the correction target word selected by the user. That is, as shown in FIG. 2, when the speech recognition result is displayed laid out from left to right according to the utterance order, when the user corrects the speech recognition result, the correction is performed in order from the leftmost word. Natural to think. Therefore, when the user selects the word appearing second and later from the left as the correction target word, the recognition result of the word appearing before the correction target word is assumed to be correct, and the linked correction candidates can be narrowed down. For example, in FIG. 2, when the third “equipment” from the left is selected, it is possible to connect to “Mitsui” and “city tram” on the assumption that the word unit recognition results of “Mitsui” and “city tram” are correct. Correction candidates can be narrowed down only to simple word correction candidates and linked correction candidates.

そこで実施の形態２では、実施の形態１における音声入力インタフェース装置の構成に加え、音声認識結果において２番目以降に現れた単語をユーザが修正対象単語として選択した場合に、修正対象単語以前に現れた単語の認識結果は正しいものとして、単語修正候補及び連結修正候補の絞り込み機能を有する音声入力インタフェース装置を示す。 Therefore, in the second embodiment, in addition to the configuration of the voice input interface device in the first embodiment, when the user selects a word appearing second or later in the voice recognition result as a correction target word, it appears before the correction target word. The speech input interface device having the function of narrowing down the word correction candidates and the connection correction candidates is assumed as the correct word recognition result.

図１０に、本発明の実施の形態２における音声入力インタフェース装置の構成図を示す。図１０に示しているように、単語修正候補生成手段１０７の後段に単語修正候補絞り込み手段１１４が、また、連結修正候補生成手段１０９の後段に連結修正候補絞り込み手段１１５が挿入されている部分のみが、実施の形態１と異なる。即ち、単語修正候補絞り込み手段１１４は、修正対象単語選択手段１０５で受け付けた修正対象単語の単語列における位置に応じて、単語修正候補生成手段１０７において生成された単語修正候補を絞り込む手段である。また、連結修正候補絞り込み手段１１５は、修正対象単語選択手段１０５で受け付けた修正対象単語の単語列における位置に応じて、連結修正候補生成手段１０９において生成された連結修正候補を絞り込む手段である。これらの構成以外は、実施の形態１の同様であるため、他の部分についての構成及び動作の説明は省略する。 FIG. 10 shows a configuration diagram of the voice input interface device according to the second embodiment of the present invention. As shown in FIG. 10, only a portion in which the word correction candidate narrowing means 114 is inserted in the subsequent stage of the word correction candidate generating means 107 and the connection correction candidate narrowing means 115 is inserted in the subsequent stage of the connection correction candidate generating means 109. However, this is different from the first embodiment. That is, the word correction candidate narrowing means 114 is a means for narrowing down the word correction candidates generated by the word correction candidate generating means 107 according to the position in the word string of the correction target word received by the correction target word selecting means 105. Further, the linked correction candidate narrowing means 115 is a means for narrowing down the linked correction candidates generated by the linked correction candidate generating means 109 according to the position of the correction target word received by the correction target word selecting means 105 in the word string. Since the configuration other than these is the same as that of the first embodiment, the description of the configuration and operation of other portions is omitted.

まず、単語修正候補絞り込み手段１１４が行う処理の流れについて、図１１のフローチャートに従って説明する。図１１に示すように、修正対象単語選択手段１０５においてユーザが選択した修正対象単語が音声認識結果における１番目の単語である場合には、単語修正候補絞り込み手段１１４は何の処理も行わない（ステップＳＴ３０１；ＮＯ）。一方、ユーザが選択した修正対象単語が音声認識結果における２番目以降の単語である場合（ステップＳＴ３０１；ＹＥＳ）、単語修正候補絞り込み手段１１４は修正対象単語以前の認識結果を取得する（ステップＳＴ３０２）。例えば、ユーザが「三菱電機株式会社」を音声入力しようとして、「ミツビシデンキカブシキガイシャ」と発話し、音声認識結果が「三菱／電子／機器／株式会社」であったとする。このとき、ユーザが修正対象単語として「電子」を選択した場合、修正対象単語以前の音声認識結果「三菱」を取得する。 First, the flow of processing performed by the word correction candidate narrowing means 114 will be described with reference to the flowchart of FIG. As shown in FIG. 11, when the correction target word selected by the user in the correction target word selection means 105 is the first word in the speech recognition result, the word correction candidate narrowing means 114 does not perform any processing ( Step ST301; NO). On the other hand, when the correction target word selected by the user is the second and subsequent words in the speech recognition result (step ST301; YES), the word correction candidate narrowing means 114 acquires the recognition result before the correction target word (step ST302). . For example, it is assumed that the user tries to input “Mitsubishi Electric Corporation” by voice and speaks “Mitsubishi Denki Kabushiki Geisha” and the voice recognition result is “Mitsubishi / Electronic / Equipment / Co., Ltd.”. At this time, when the user selects “electronic” as the correction target word, the speech recognition result “Mitsubishi” before the correction target word is acquired.

次に、単語修正候補絞り込み手段１１４は、連結読み・音節記憶手段１０８を参照し、単語修正候補生成手段１０７によって生成された単語修正候補の中から、修正対象単語以前の音声認識結果に接続可能な単語修正候補のみを選択することで、単語修正候補を絞り込む処理を行う（ステップＳＴ３０３）。例えば、上記の例では、連結読み・音節記憶手段１０８を参照し、修正対象単語「電子」に対する単語修正候補の中から、修正対象単語以前の単語「三菱」に接続可能な単語修正候補のみを選択して単語修正候補を絞り込む。このように、修正対象単語以前の単語情報を利用して単語修正候補を絞り込むことで、ユーザが意図する修正結果が単語修正候補の上位に現れる可能性を向上させることができる効果がある。 Next, the word correction candidate narrowing means 114 can connect to the speech recognition result before the correction target word from the word correction candidates generated by the word correction candidate generation means 107 with reference to the linked reading / syllable storage means 108. By selecting only correct word correction candidates, processing for narrowing down word correction candidates is performed (step ST303). For example, in the above example, only the word correction candidates connectable to the word “Mitsubishi” before the correction target word are selected from the word correction candidates for the correction target word “electronic” with reference to the linked reading / syllable storage unit 108. Select to narrow down word correction candidates. In this way, by narrowing down the word correction candidates using the word information before the correction target word, there is an effect that the possibility that the correction result intended by the user appears at the top of the word correction candidates can be improved.

一方、連結修正候補絞り込み手段１１５が行う処理の流れは、前記図１１の単語修正候補絞り込み手段１１４が行う処理の流れにおいて、単語修正候補が連結修正候補に置き換えた処理の流れと同一である。 On the other hand, the flow of processing performed by the linked correction candidate narrowing means 115 is the same as the flow of processing in which the word correction candidate is replaced with the linked correction candidate in the processing flow performed by the word correction candidate narrowing means 114 of FIG.

以上のように、実施の形態２の音声入力インタフェース装置によれば、単語列出力手段が出力した単語列に対する修正対象単語の指定を受け付ける修正対象単語選択手段と、修正対象単語の単語列における位置に応じて、単語修正候補生成手段において生成された単語修正候補を絞り込む単語修正候補絞り込み手段とを備えたので、ユーザが意図する修正結果が修正候補の上位に現れる可能性を向上させることができる。 As described above, according to the voice input interface device of the second embodiment, the correction target word selection unit that receives the specification of the correction target word for the word string output by the word string output unit, and the position of the correction target word in the word string Accordingly, since the word correction candidate narrowing means for narrowing down the word correction candidates generated by the word correction candidate generating means is provided, the possibility that the correction result intended by the user appears at the top of the correction candidates can be improved. .

また、実施の形態２の音声入力インタフェース装置によれば、単語列出力手段が出力した単語列に対する修正対象単語の指定を受け付ける修正対象単語選択手段と、修正対象単語選択手段で受け付けた修正対象単語の単語列における位置に応じて、連結修正候補生成手段において生成された連結修正候補を絞り込む連結修正候補絞り込み手段とを備えたので、ユーザが意図する修正結果が連結修正候補の上位に現れる可能性を向上させることができる。 In addition, according to the voice input interface device of the second embodiment, the correction target word selecting unit that receives the specification of the correction target word for the word string output by the word string output unit, and the correction target word received by the correction target word selecting unit Since there is a link correction candidate narrowing means for narrowing down the link correction candidates generated by the link correction candidate generation means according to the position in the word string, the correction result intended by the user may appear at the top of the link correction candidates Can be improved.

１０１音声入力手段、１０２音声認識辞書記憶手段、１０３音声認識手段、１０４音声認識結果表示手段、１０５修正対象単語選択手段、１０６単語読み・音節記憶手段、１０７単語修正候補生成手段、１０８連結読み・音節記憶手段、１０９連結修正候補生成手段、１１０差分抽出手段、１１１修正候補表示手段、１１２修正候補選択手段、１１３修正実行手段、１１４単語修正候補絞り込み手段、１１５連結修正候補絞り込み手段。 101 speech input means, 102 speech recognition dictionary storage means, 103 speech recognition means, 104 speech recognition result display means, 105 correction target word selection means, 106 word reading / syllable storage means, 107 word correction candidate generation means, 108 concatenated reading / Syllable storage means, 109 linkage correction candidate generation means, 110 difference extraction means, 111 correction candidate display means, 112 correction candidate selection means, 113 correction execution means, 114 word correction candidate narrowing means, 115 linked correction candidate narrowing means.

Claims

A word string output means for outputting a word string of a recognition result for voice input;
A concatenation information storage means in which information on a phrase obtained by concatenating a plurality of words is registered;
A combined correction candidate generating unit that performs matching between a phrase obtained by connecting all the words output by the word string output unit and a word registered in the connection information storage unit, and generates a correction candidate in units of words;
A speech input interface device comprising correction candidate output means for outputting the correction candidates generated by the linked correction candidate generation means.

A word string output means for outputting a word string of a recognition result for voice input;
A concatenation information storage means in which information on a phrase that concatenates a plurality of words is registered;
A combined correction candidate generating unit that performs matching between a phrase obtained by connecting all the words output by the word string output unit and a word registered in the connection information storage unit, and generates a correction candidate in units of words;
A difference extraction means for extracting a difference between the correction candidate generated by the connection correction candidate generation means and a phrase obtained by connecting all the words output by the word string output means;
A voice input interface device comprising correction candidate output means for outputting the difference extracted by the difference extraction means as a correction candidate.

Word information storage means in which word unit information is registered;
A word correction candidate generating unit that generates a word-by-word correction candidate from words stored in the word information storage unit with respect to the words in the word string output by the word string output unit. The voice input interface device according to claim 1 or 2.

A correction target word selection means for accepting designation of a correction target word for the word string output by the word string output means;
The word correction candidate generation means uses both the similarity between the syllable of the correction target word and the syllable of the correction candidate, and the similarity between the reading of the correction target word and the correction candidate, and weights both similarities. 4. The voice input interface apparatus according to claim 3, wherein the added sum is used as information for generating a correction candidate as an overall similarity.

The link correction candidate generating means, when a phrase obtained by connecting all the words output by the word string output means is a correction target phrase, the similarity between the syllable of the correction target phrase and the syllable of the correction candidate, and the correction target The use of both the word reading and the correction candidate reading similarity, and the sum total weighted by both similarities is used as information for generating a correction candidate as an overall similarity. The voice input interface device according to claim 1 or 2.

Correction target word selection means for accepting designation of a correction target word for the word string output by the word string output means;
4. The voice input interface device according to claim 3, further comprising word correction candidate narrowing means for narrowing down the word correction candidates generated by the word correction candidate generating means according to the position of the correction target word in the word string. .

Correction target word selection means for accepting designation of a correction target word for the word string output by the word string output means;
And a connected correction candidate narrowing means for narrowing down the connected correction candidates generated by the connected correction candidate generating means according to the position of the correction target word received by the correction target word selecting means in the word string. The voice input interface device according to any one of claims 1, 2, and 5.

Recognition result display means for displaying the output result of the word string output means;
Correction target word selection means for accepting designation of a correction target word for the recognition result displayed by the recognition result display means;
Correction candidate display means for displaying the connection correction candidates generated by the connection correction candidate generation means;
Correction candidate selection means for receiving a user's selection result for the correction candidates displayed by the correction candidate display means;
3. The voice input interface device according to claim 1, further comprising a correction execution unit that corrects an output result of the word string output unit based on a selection result received by the correction candidate selection unit.

A word string output step for outputting a word string of a recognition result for voice input;
Concatenated correction that matches a phrase obtained by concatenating all the words output in the word string output step with a phrase obtained by concatenating a plurality of words registered in the concatenated information storage unit, and generates a correction candidate for each phrase. A candidate generation step;
A speech input method comprising: a correction candidate output step for outputting the correction candidates generated in the linked correction candidate generation step.