JP2002287792A

JP2002287792A - Voice recognition device

Info

Publication number: JP2002287792A
Application number: JP2001090373A
Authority: JP
Inventors: Kunio Yokoi; 邦雄横井; Norihide Kitaoka; 教英北岡
Original assignee: Denso Corp
Current assignee: Denso Corp
Priority date: 2001-03-27
Filing date: 2001-03-27
Publication date: 2002-10-04
Anticipated expiration: 2021-03-27
Also published as: JP4604377B2

Abstract

PROBLEM TO BE SOLVED: To make it unnecessary to input an entire voice input object again at the time of correction in the case of the voice input object having a hierarchical structure and to prevent degradation of the recognition rate. SOLUTION: A voice information holding part 41 holds voice information inputted from a voice input part 27. When a voice is inputted for correction, a comparison and discrimination part 42 compares current inputted voice information with preceding voice information held in the voice information holding part 41 to discriminate resembling parts between them. A dictionary control part 34 of a voice recognition part 30 uses also a recognition result stored in a preceding result storage part 33 to perform dictionary control of taking only a part corresponding to a correction part as a comparison object pattern candidate. For example, when 'Aichi Prefecture, Kariya City, Showa town' is erroneously recognized as 'Aichi Prefecture, Kariya City, Shoei town' and 'Showa town' is inputted with voice for correction, only the town name part (concretely, Showa town, Shoei town, or the like) following Aichi Prefecture, Kariya City is taken as the comparison object pattern candidate.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、例えばナビゲーシ
ョンシステムにおける目的地の設定などを音声によって
入力できるようにする場合であって、特に誤認識の場合
の訂正入力への対応に適した音声認識技術に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a case where a destination setting or the like in a navigation system can be input by voice, and more particularly to a voice recognition technique suitable for coping with a correction input in the case of erroneous recognition. About.

【０００２】[0002]

【従来の技術】従来より、入力された音声を予め記憶さ
れている複数の比較対象パターン候補と比較し、一致度
合の高いものを認識結果とする音声認識装置が既に実用
化されており、例えばナビゲーションシステムにおいて
設定すべき目的地を利用者が地名を音声で入力するため
などに用いられている。特に車載ナビゲーションシステ
ムを運転手自身が利用する場合、音声入力であればボタ
ン操作や画面注視が伴わないため、車両の走行中に行っ
ても安全性が高いため有効である。2. Description of the Related Art Conventionally, a speech recognition apparatus that compares an input speech with a plurality of pre-stored comparison target pattern candidates and obtains a speech having a high degree of coincidence as a recognition result has already been put into practical use. It is used, for example, for a user to input a destination name to be set in a navigation system by voice. In particular, when the driver uses the in-vehicle navigation system, voice input does not involve button operation or screen gaze, so that it is effective because the safety is high even when the vehicle is running.

【０００３】このような機能を満たすためには、十分詳
細な地点の指定が容易にできなくてはならない。具体的
には、県や市のレベルではなく、市の下の町名のレベル
や、町村における大字といったレベルまで入力できる必
要がある。さらに、利用者が例えば「愛知県刈谷市昭和
町」と設定したい場合に、「愛知県」「刈谷市」「昭和
町」というように県市町というレベル毎に区切って発音
しなくてはならないとすると利用者にとって煩わしいの
で、ひと続きで入力（一括入力）できるようにすること
が好ましい。In order to satisfy such a function, it is necessary to easily specify a sufficiently detailed point. Specifically, it is necessary to be able to input not only the level of the prefecture and the city but also the level of the name of the town under the city and the level of the capital in the town and village. Furthermore, if the user wants to set, for example, "Showa-cho, Kariya-shi, Aichi", it must be pronounced separately for each level, such as "Aichi-ken", "Kariya-shi", "Showa-cho". Then, since it is troublesome for the user, it is preferable that input (batch input) can be performed continuously.

【０００４】但し、このように一括入力に対応する構成
とした場合には、逆に利用者にとって使い勝手が悪くな
る状況も想定される。それは、設定すべき目的地を利用
者が再度入力しなくてはならない場合である。つまり、
現在の認識技術ではその認識結果が完全に正確なものと
は言い切れないため、一度の音声入力で必ずしも正確に
認識されるとは限らないからである。例えば利用者が
「愛知県刈谷市昭和（しょうわ）町」と音声で入力した
場合に、例えば「愛知県刈谷市松栄（しょうえい）町」
と誤って認識してしまうことが考えられる。そして、こ
のような誤認識の場合には、再度「愛知県刈谷市昭和
町」と音声入力する必要がある。[0004] However, in the case of such a configuration corresponding to the batch input, a situation in which the usability is degraded for the user may be assumed. That is when the user has to re-enter the destination to be set. That is,
This is because the current recognition technology cannot always say that the recognition result is completely accurate, so that a single voice input does not always result in accurate recognition. For example, when the user inputs a voice as "Showa-cho, Kariya-shi, Aichi", for example, "Shoei-cho, Kariya-shi, Aichi"
May be mistakenly recognized. Then, in the case of such an erroneous recognition, it is necessary to input a voice again as "Showa-cho, Kariya city, Aichi prefecture".

【０００５】しかしながら、日常生活における会話など
を考えると、このような誤認識の場合には、「愛知県刈
谷市昭和町」と音声入力するのではなく、誤認識された
町名だけを修正することが自然である。つまり、２回目
は「昭和町」だけを音声入力できるようにすることが好
ましいと考えられる。However, considering conversations in daily life, in the case of such an erroneous recognition, it is necessary to correct only the erroneously recognized town name, instead of inputting a voice as "Showa-cho, Kariya-shi, Aichi Prefecture". Is natural. That is, it is considered preferable that the second time be able to input only "Showa-cho" by voice.

【０００６】このような問題を解決するためになされた
ものとして、特開平１１−３８９９４号公報に開示され
た音声認識装置がある。この技術によれば、複数の語を
階層的につなぎ合わせた比較対象パターン候補の上位階
層を構成する語又は語群が、認識処理時の省略対象とし
て設定されている場合には、その設定されている上位階
層構成語又は語群を省略したものも一時的に比較対象パ
ターン候補と見なした上で、入力音声に対する比較を実
行して認識処理を行う。そのため、例えば利用者が「愛
知県刈谷市昭和（しょうわ）町」と音声で入力したにも
かかわらず、音声認識装置「愛知県刈谷市松栄（しょう
えい）町」と誤って認識してしまった場合、利用者が再
度音声入力する際に「愛知県刈谷市昭和町」と音声入力
するのではなく、「昭和町」だけを音声入力するだけで
よくなる。To solve such a problem, there is a speech recognition apparatus disclosed in Japanese Patent Application Laid-Open No. H11-38994. According to this technology, if a word or a group of words constituting a higher layer of a comparison target pattern candidate obtained by connecting a plurality of words hierarchically is set as an abbreviation target in the recognition processing, the setting is performed. Even if the upper layer constituent words or word groups are omitted, they are temporarily regarded as comparison target pattern candidates, and then the input speech is compared to perform recognition processing. Therefore, for example, the user incorrectly recognized the voice recognition device as "Shoei-cho, Kariya-shi, Aichi," even though the user had input by voice as "Showa-cho, Kariya-shi, Aichi". In this case, when the user inputs the voice again, instead of inputting the voice of “Showa-cho, Kariya-shi, Aichi”, only the voice of “Showa-cho” needs to be input.

【０００７】[0007]

【発明が解決しようとする課題】このようにユーザの使
い勝手は向上することが期待できる従来技術ではある
が、その一方で、認識率の低下が懸念される。なぜな
ら、従来技術の場合には、全ての階層からの言い直しに
対応するために、言い直しの可能性のある全ての途中階
層も認識開始点とみなす手法を採用したため、比較対象
パターン候補が非常に増えてしまうからである。As described above, the prior art is expected to improve the usability of the user, but on the other hand, there is a concern that the recognition rate may decrease. This is because, in the case of the conventional technology, in order to cope with rephrasing from all hierarchies, a method is adopted in which all intermediate hierarchies that may be rephrased are also regarded as recognition start points. It is because it increases.

【０００８】なお、このような問題は、上述した県市町
…からなる地名には限らず、同じように複数の語を階層
的につなぎ合わせたものとして設定されるものであれば
同様に適用できる。そこで本発明は、音声認識装置への
音声入力の対象が階層的構造の場合に、誤認識されて訂
正のために再度音声を入力する際、上位階層から全て音
声入力しなくてもよくすることで利用者の負担を軽減
し、使い勝手をより向上させることができ、且つ認識率
の低下を防止することを目的とする。[0008] Such a problem is not limited to the above-mentioned place names consisting of prefectures, municipalities, etc., but can be similarly applied as long as a plurality of words are similarly set as a hierarchical connection. . Therefore, the present invention is to eliminate the need to input all voices from the upper layer when re-inputting voices for correction due to erroneous recognition when the voice input to the voice recognition device has a hierarchical structure. Therefore, it is possible to reduce the burden on the user, improve the usability, and prevent the recognition rate from decreasing.

【０００９】[0009]

【課題を解決するための手段及び発明の効果】請求項１
に記載の音声認識装置によれば、利用者が音声入力手段
を介して音声を入力すると、認識手段が、その入力され
た音声を予め辞書手段に記憶されている複数の比較対象
パターン候補と比較して一致度合の高いものを認識結果
とし、報知手段によって認識結果を報知する。そして、
認識結果が報知された後に所定の確定指示がなされた場
合には、確定後処理手段が、その認識結果を確定したも
のとして所定の確定後処理を実行する。ここで、辞書手
段に記憶されている複数の比較対象パターン候補の内の
少なくとも一つは、複数の語を階層的につなぎ合わせた
ものとして設定されている。Means for Solving the Problems and Effects of the Invention
According to the speech recognition device described in (1), when the user inputs a speech through the speech input unit, the recognition unit compares the inputted speech with a plurality of comparison target pattern candidates stored in the dictionary unit in advance. Then, a result having a high degree of coincidence is set as a recognition result, and the notification result is notified by the notifying means. And
If a predetermined confirmation instruction is issued after the notification of the recognition result, the post-confirmation processing means executes the predetermined post-confirmation processing assuming that the recognition result has been confirmed. Here, at least one of the plurality of comparison target pattern candidates stored in the dictionary means is set as a plurality of words connected hierarchically.

【００１０】このような前提において、音声情報保持手
段が、前回の発声時に入力された音声情報を保持してお
き、認識結果報知後に所定の確定指示がなされずに音声
入力があった場合には、今回発声時に入力された音声情
報と前記保持された前回発声時の音声情報とを比較し、
今回発声時の音声情報が前回発声時の音声情報のどの部
分と最も近いかを判定して、その最も近い部分を訂正箇
所とする。そしてさらに、訂正個所と判定された部分
が、複数の語を階層的につなぎ合わせた比較対象パター
ン候補の一部の階層に相当する場合は、認識手段が、そ
の一部階層の構成語又は語群を一時的に比較対象パター
ン候補とみなした上で、今回発声時に入力された音声に
対する比較を実行する。Under such a premise, the voice information holding means holds the voice information input at the time of the previous utterance, and if the voice input is made without a predetermined confirmation instruction after the notification of the recognition result. Comparing the voice information input at the time of this utterance with the held voice information at the time of the last utterance,
It is determined which part of the voice information at the time of this utterance is closest to the voice information at the time of the last utterance, and the closest part is determined as a corrected part. Further, when the portion determined to be a correction portion corresponds to a part of the hierarchy of the comparison target pattern candidate in which a plurality of words are hierarchically connected, the recognizing means determines that the constituent word or the word of the partial hierarchy is included. After temporarily considering the group as a comparison target pattern candidate, a comparison is performed on the voice input at the time of this utterance.

【００１１】このように、予め辞書手段に設定されてい
る比較対象パターン候補以外であっても、階層構造の比
較対象パターン候補については当該一部階層の構成語又
は語群も一時的に比較対象パターン候補とみなされて認
識処理に用いられるため、その一部階層部分のみを訂正
のために言い直すことができる。例えば利用者が「愛知
県刈谷市昭和（しょうわ）町」と音声で入力したにもか
かわらず、音声認識装置「愛知県刈谷市松栄（しょうえ
い）町」と誤って認識してしまった場合、利用者が再度
音声入力する際に「愛知県刈谷市昭和町」と音声入力す
るのではなく、「昭和町」だけを音声入力するだけでよ
くなる。装置側に誤認識された場合に、その誤認識され
た部分（上述の例では「昭和町」という町名）だけを修
正することは、日常生活における会話などの習慣から考
えると、ごく自然である。音声認識装置を利用する場合
に限って特別な注意を払うことを強制するのは使い勝手
の点で好ましくない。したがって、本発明の音声認識装
置のように、誤認識の部分だけ修正するという日常会話
の習慣においてごく自然な振舞いに対応できることによ
って、上位階層を省略した方が自然な場合であっても上
位階層から音声入力しなくてはならないという利用者の
負担を軽減し、使い勝手をより向上させることができ
る。As described above, even if the comparison target pattern candidates other than the comparison target pattern candidates set in the dictionary means in advance, the constituent words or word groups of the partial hierarchy are temporarily compared with the comparison target pattern candidates having the hierarchical structure. Since it is regarded as a pattern candidate and used for the recognition process, only a part of the hierarchical portion can be restated for correction. For example, if the user uttered "Showa-cho, Kariya-shi, Aichi", but incorrectly recognized the voice recognition device as "Shoei-cho, Kariya-shi, Aichi," When the user inputs the voice again, instead of inputting the voice of "Showa-cho, Kariya-shi, Aichi", it is sufficient to input only the voice of "Showa-cho". If the device side misrecognizes it, it is very natural to correct only the misrecognized part (the town name "Showa-cho" in the above example) from the habit of conversation in everyday life. . Forcing special attention only when using a voice recognition device is not preferable in terms of usability. Therefore, as in the case of the speech recognition apparatus of the present invention, it is possible to cope with a very natural behavior in a daily conversation habit of correcting only the part of erroneous recognition. The user's burden of having to input a voice from the user can be reduced, and usability can be further improved.

【００１２】そしてさらに、このような言い直し（訂
正）に対応できながら、上述した従来技術の場合よりも
誤認識を低減できる。なぜなら、従来技術の場合には、
全ての階層からの言い直しに対応するために可能性のあ
る全ての途中階層も認識開始点とみなす思想であるた
め、非常に比較対象パターン候補が増えてしまう。それ
に対して本発明の場合には、予め訂正個所を特定してお
き、その特定された訂正箇所に対応する部分のみを比較
対象パターン候補とするため、相対的に少ない数で済
む。例えば上述例であれば、愛知県刈谷市までは訂正が
ないため、その下位階層である町名部分（具体的には、
昭和町、松栄町……など）のみを比較対象パターン候補
とすればよい。つまり、愛知県刈谷市を上位階層としな
い比較対象パターン候補は全て対象外となるため、相対
的には非常に少ない数で済む。比較対象パターン候補が
少なくなるということは、誤認識の可能性の低減、認識
処理時間の短縮化にも寄与する。Further, while being able to cope with such rephrasing (correction), erroneous recognition can be reduced as compared with the above-described conventional technique. Because, in the case of the prior art,
Since it is a concept that all intermediate layers that are possible in order to cope with restatement from all layers are regarded as recognition start points, the number of pattern candidates to be compared greatly increases. On the other hand, in the case of the present invention, a correction point is specified in advance, and only a part corresponding to the specified correction point is set as a comparison target pattern candidate. For example, in the case of the above example, there is no correction up to Kariya city in Aichi prefecture.
Only Showa-cho, Matsuei-cho, etc.) may be set as comparison target pattern candidates. That is, all the comparison target pattern candidates that do not have Kariya City in Aichi Prefecture as the upper layer are excluded from the target, so that only a relatively small number is required. Reducing the number of pattern candidates to be compared contributes to a reduction in the possibility of erroneous recognition and a reduction in recognition processing time.

【００１３】なお、利用者の負担軽減、使い勝手の向上
という観点からは、一部の誤認識の場合に全部言い直し
させるのは好ましくないが、誤認識防止の観点からは、
認識対象が多い方がマッチングし易くなる。したがっ
て、両者のメリットのトレードオフとなるが、誤認識と
なっている階層以下は全て訂正入力させれば、後端が特
定できるため、認識精度の向上が期待できる。その場合
には、請求項２に示すように、今回発声時の音声情報と
前回発声時の音声情報とを比較する際、音声情報の最後
尾から行えばよい。From the viewpoint of reducing the burden on the user and improving the usability, it is not preferable to restate all of the cases of partial erroneous recognition, but from the viewpoint of erroneous recognition prevention,
Matching is easier when there are more recognition targets. Therefore, although there is a trade-off between the merits of both, if correction input is performed for all layers below the erroneously recognized layer, the rear end can be specified, and improvement in recognition accuracy can be expected. In this case, as described in claim 2, when comparing the voice information at the time of the present utterance with the voice information at the time of the previous utterance, it may be performed from the end of the voice information.

【００１４】なお、この場合には、利用者がそのような
訂正方法を遵守することが実効性をあげるために不可欠
であるため、請求項３に示すように、そのような訂正方
法を利用者に報知することが好ましい。この報知タイミ
ングとしては、装置を起動した時点などが考えられる。
もちろん、定期的に報知するようにしてもよい。In this case, it is indispensable for the user to comply with such a correction method in order to improve its effectiveness. It is preferable to notify The notification timing may be, for example, a time when the apparatus is started.
Of course, the notification may be made periodically.

【００１５】一方、電話番号などは市外局番−市内局番
−＊＊＊＊となっている。それらで３階層と捉えてもよ
いが、数字入力に関しては１桁ずつがそれぞれ階層であ
ると扱うこともできる。その場合には、例えば１桁分の
数字が誤認識の際にその１桁だけ訂正入力すると、新た
な誤認識を招きやすい。つまり、同じ数字が他の桁にも
ある可能性があるからである。したがって、その前後の
数字（例えば前後１けたずつ加えて３桁）を訂正入力す
ることが好ましい。つまり、例えば最低３桁程度で入力
すれば、該当部分の特定が容易になると考えられる。On the other hand, telephone numbers and the like are area code-city code-****. These may be regarded as three levels, but it is also possible to treat each digit as a level in numerical input. In such a case, for example, when a single digit number is erroneously recognized and corrected and input by one digit, new erroneous recognition is likely to occur. That is, the same number may be present in other digits. Therefore, it is preferable to correct and input the preceding and succeeding numbers (for example, three digits in addition to one digit before and after). In other words, it is considered that, for example, by inputting at least three digits, the corresponding portion can be easily specified.

【００１６】この場合にも、利用者がそのような訂正方
法を遵守しなければ実効性があがらないので、請求項４
に示すように、そのような訂正方法を利用者に報知する
ことが好ましい。なお、訂正箇所判定手段にて比較する
今回発声時の音声情報と前回発声時の音声情報として
は、請求項５に示すように音声の波形情報であってもよ
いし、請求項６に示すように音声の特徴パラメータであ
ってもよい。また、これらの比較に際しては、請求項７
に示すようにＤＰマッチング法を用いて行ってもよい
し、請求項８に示すように隠れマルコフモデルを用いて
行ってもよい。Also in this case, the effectiveness cannot be improved unless the user observes such a correction method.
It is preferable to notify the user of such a correction method as shown in FIG. The voice information at the time of this utterance and the voice information at the time of the previous utterance to be compared by the correction portion determination means may be voice waveform information as described in claim 5 or as described in claim 6. May be a voice feature parameter. Also, when comparing these, claim 7
The method may be performed using a DP matching method as shown in FIG. 8 or may be performed using a hidden Markov model as described in claim 8.

【００１７】ところで、訂正個所と判定された部分が比
較対象パターン候補の一部の階層に相当する場合に、そ
の一部階層の構成語又は語群を一時的に比較対象パター
ン候補とみなすことで、「通常の入力」だけでなく「訂
正のための入力」にも対応できるようにしたが、請求項
９に示すようにしてもよい。つまり、訂正動作検出手段
によって、今回の発声が前回の発声内容の訂正であるこ
とを示す利用者の所定の動作を検出した場合に、その
「一時的に比較対象パターン候補とみなした一部階層の
構成語又は語群のみ」を用いて認識を行うのである。つ
まり、前回の訂正であることが分かっていれば、「通常
の入力」に対応させる必要がないため、それだけ比較対
象パターン候補が少なくなり、誤認識をより低減させる
ことができる。By the way, when a portion determined to be a correction position corresponds to a part of the hierarchy of the pattern candidate to be compared, constituent words or word groups of the partial hierarchy are temporarily regarded as the pattern candidate to be compared. , Not only "normal input" but also "input for correction". In other words, when the correction operation detecting means detects a predetermined operation of the user indicating that the current utterance is a correction of the previous utterance content, the correction operation detecting means detects the "partial hierarchy temporarily regarded as a comparison target pattern candidate". Only the constituent words or word groups of "are used. In other words, if it is known that the correction is the previous correction, it is not necessary to correspond to the "normal input", so that the number of pattern candidates to be compared is reduced accordingly, and erroneous recognition can be further reduced.

【００１８】また、訂正のために利用者が行う音声入力
方法が複数の内から選択できる場合には、請求項１０に
示すように、その選択された訂正方法に応じて訂正箇所
判定手段が訂正個所の判定を行えば、より精度の高い認
識が行える。例えば、これまで説明した例で言えば、訂
正したい部分だけ入力する方法、訂正したい部分を含ん
で最後まで入力する方法、訂正したい部分の前後を含ん
で入力する方法などが考えられるため、利用者の選択に
応じてこれらのいずれの訂正にも対応できるようにして
おくのである。If the user can select one of a plurality of voice input methods to be corrected for correction, the correction portion determination means corrects the correction according to the selected correction method. If the location is determined, more accurate recognition can be performed. For example, in the example described so far, there is a method of inputting only a portion to be corrected, a method of inputting a portion including a portion to be corrected, and a method of inputting before and after a portion to be corrected. According to the selection of, any of these corrections can be dealt with.

【００１９】なお、認識結果の報知後に所定の確定指示
がなされた場合には、その認識結果を確定したものとし
て所定の確定後処理へ移行すると説明したが、この「所
定の確定後処理」とは、例えばカーナビゲーションシス
テムに用いられた場合には、認識結果としての目的地を
設定する処理自体あるいは目的地設定処理を実行する装
置側へその目的地を設定するよう指示する処理などが考
えられる。また、認識結果の報知後の「所定の確定指
示」に関しては、やはり音声で入力（例えば「はい」と
発声することで入力）したり、スイッチ類の操作によっ
て指示したりすることが考えられる。It has been described that, when a predetermined confirmation instruction is given after the notification of the recognition result, the recognition result is determined to be determined and the process proceeds to the predetermined post-confirmation processing. For example, when used in a car navigation system, a process itself for setting a destination as a recognition result or a process for instructing a device that executes the destination setting process to set the destination may be considered. . Regarding the “predetermined confirmation instruction” after the notification of the recognition result, it is conceivable that the instruction may be made by voice (for example, input by uttering “yes”) or by operating switches.

【００２０】さらに、前記認識結果の報知に関しては、
請求項１１に示すように、所定の音声発生装置から認識
結果の内容を音声にて出力することにより行うことが考
えられる。カーナビゲーションシステムなどの車載機器
用として用いる場合には、音声で出力されれば、ドライ
バーは視点を表示装置にずらしたりする必要がないの
で、安全運転のより一層の確保の点では有利であると言
える。但し、音声出力に限定されるものではなく、請求
項１１に示すように、例えば画面上に文字または記号を
表示できる表示装置に、認識結果の内容を、文字または
記号による画像にて表示することにより行ったり、音声
及び画像の両方にて報知するようにしてもよいし、それ
ら以外の報知の手法を採用してもよい。車載機器として
適用する場合に音声出力が有利であることを述べたが、
もちろん車両が走行中でない状況もあるので、音声及び
画像の両方で報知すれば、ドライバーは表示による確認
と音声による確認との両方が可能となる。Further, regarding the notification of the recognition result,
According to an eleventh aspect of the present invention, it is conceivable to perform the recognition by outputting the contents of the recognition result by voice from a predetermined voice generating device. When used for in-vehicle equipment such as car navigation systems, if audio is output, the driver does not need to shift his or her viewpoint to the display device, which is advantageous in terms of further ensuring safe driving. I can say. However, the present invention is not limited to audio output, and the content of the recognition result may be displayed as an image using characters or symbols on a display device capable of displaying characters or symbols on a screen, for example. Or may be notified by both voice and image, or other notification methods may be adopted. Although it was stated that audio output is advantageous when applied as in-vehicle equipment,
Of course, there are situations in which the vehicle is not running, so if both the voice and the image are used, the driver can perform both the confirmation by display and the confirmation by voice.

【００２１】なお、訂正のための音声入力がなされた場
合の認識結果を表示によって報知する場合には、例えば
請求項１３に示すように、最初の認識結果あるいは前回
の認識結果から訂正された部分を、それ以外の部分と区
別可能なように表示態様を変えて表示するすることも好
ましい。例えば色を変えたり・文字を大きくしたりとい
ったことである。このようにすれば、全体の中での訂正
部分が明確になる。In the case where the recognition result when a speech input for correction is made is displayed and displayed, for example, the first recognition result or a portion corrected from the previous recognition result may be used. Is preferably displayed in a different display mode so as to be distinguishable from other portions. For example, changing colors or enlarging characters. In this way, the correction part in the whole becomes clear.

【００２２】なお、以上説明した音声認識装置の適用先
としては例えばナビゲーションシステムが考えられる。
この場合には、音声認識装置と、ナビゲーション装置と
を備え、音声認識装置の音声入力手段は、少なくともナ
ビゲーション装置がナビゲート処理をする上で指定され
る必要のある所定のナビゲート処理関連データの指示を
利用者が音声にて入力するために用いられるものであ
り、確定後処理手段は、認識手段による認識結果をナビ
ゲーション装置に出力するよう構成することが考えられ
る。この場合の「所定のナビゲート処理関連データ」と
しては、目的地が代表的なものとして挙げられるが、そ
れ以外にもルート探索に関する条件選択など、ナビゲー
ト処理をする上で指定の必要のある指示が含まれる。そ
してこの場合は、認識結果としてのナビゲート処理関連
情報を報知することとなるが、上述したように、一度音
声入力した地名の一部が誤認識されていた場合に、その
誤っている箇所を再入力すればよくなるなど、利用者の
使い勝手が向上する。The application of the above-described speech recognition apparatus is, for example, a navigation system.
In this case, a voice recognition device and a navigation device are provided, and the voice input means of the voice recognition device is provided with at least predetermined navigation process-related data which needs to be specified when the navigation device performs the navigation process. The instruction is used by the user to input by voice, and the post-determination processing means may be configured to output the result of recognition by the recognition means to the navigation device. In this case, the “predetermined navigation processing-related data” is a typical destination, but other than that, it is necessary to specify the navigation processing, such as selecting conditions for a route search. Instructions are included. In this case, the navigation processing-related information as a recognition result is notified. As described above, when a part of the place name once input by voice is erroneously recognized, the erroneous part is recognized. The user's usability is improved, for example, the user has only to input again.

【００２３】[0023]

【発明の実施の形態】以下、本発明が適用された実施例
について図面を用いて説明する。なお、本発明の実施の
形態は、下記の実施例に何ら限定されることなく、本発
明の技術的範囲に属する限り、種々の形態を採り得るこ
とは言うまでもない。Embodiments of the present invention will be described below with reference to the drawings. It is needless to say that the embodiments of the present invention are not limited to the following examples, and can take various forms as long as they belong to the technical scope of the present invention.

【００２４】図１は、音声認識機能を備えた制御装置１
を含むシステムの構成を表すブロック図である。なお、
本実施例の制御装置１は、自動車（車両）に搭載され
て、使用者としての車両の乗員（主に、運転者）と音声
にて対話しながら、その車両に搭載されたナビゲーショ
ン装置１５を制御するものである。FIG. 1 shows a control device 1 having a voice recognition function.
It is a block diagram showing the structure of the system containing. In addition,
The control device 1 of the present embodiment is mounted on an automobile (vehicle), and interacts with a vehicle occupant (mainly, a driver) as a user by voice, and controls the navigation device 15 mounted on the vehicle. To control.

【００２５】図１に示すように、本実施例の制御装置１
は、使用者が各種の指令やデータなどを外部操作によっ
て入力するためのスイッチ装置３と、画像を表示するた
めの表示装置５と、音声を入力するためのマイクロフォ
ン７と、音声入力時に操作するトークスイッチ９と、音
声を出力するためのスピーカ１１と、車両の現在位置
（現在地）の検出や経路案内などを行う周知のナビゲー
ション装置１５とに接続されている。As shown in FIG. 1, the control device 1 of this embodiment
Is operated by a user to input various commands, data, and the like by an external operation, a display device 5 for displaying an image, a microphone 7 for inputting voice, and a voice input. It is connected to a talk switch 9, a speaker 11 for outputting a voice, and a well-known navigation device 15 for detecting a current position (current position) of the vehicle and providing route guidance.

【００２６】なお、ナビゲーション装置１５は、車両の
現在位置を検出するための周知のＧＰＳ装置や、地図デ
ータ，地名データ，施設名データなどの経路案内用デー
タを記憶したＣＤ−ＲＯＭ、そのＣＤ−ＲＯＭからデー
タを読み出すためのＣＤ−ＲＯＭドライブ、及び、使用
者が指令を入力するための操作キーなどを備えている。
そして、ナビゲーション装置１５は、例えば、使用者か
ら操作キーを介して、目的地と目的地までの経路案内を
指示する指令とが入力されると、車両の現在位置と目的
地へ至るのに最適な経路とを含む道路地図を、表示装置
５に表示させて経路案内を行う。また、表示装置５に
は、ナビゲーション装置１５によって経路案内用の道路
地図が表示されるだけでなく、情報検索用メニューなど
の様々な画像が表示される。The navigation device 15 is a well-known GPS device for detecting the current position of the vehicle, a CD-ROM storing route guidance data such as map data, place name data, facility name data, and the like. A CD-ROM drive for reading data from the ROM, operation keys for the user to input commands, and the like are provided.
For example, when a user inputs a destination and a command for instructing route guidance to the destination via an operation key from the user, the navigation device 15 is optimal for reaching the current position of the vehicle and the destination. The display device 5 displays a road map including a simple route and provides route guidance. The display device 5 displays not only a road map for route guidance by the navigation device 15 but also various images such as an information search menu.

【００２７】そして、制御装置１は、ＣＰＵ，ＲＯＭ，
及びＲＡＭなどからなるマイクロコンピュータを中心に
構成された制御部５０と、その制御部５０にスイッチ装
置３からの指令やデータを入力する入力部２３と、制御
部５０から出力された画像データをアナログの画像信号
に変換して表示装置５に出力し、画面上に画像を表示さ
せる画面出力部２５と、マイクロフォン７から入力され
た音声信号をデジタルデータに変換する音声入力部２７
と、音声入力部２７を介して入力される音声信号から、
使用者が発話した言葉としてのキーワード（以下、発話
キーワードともいう）を認識して取得するための音声認
識部３０と、同じく音声入力部２７を介して入力される
音声信号から、使用者前回の発話内容及び今回の発話内
容から訂正個所を判定する訂正個所判定部４０と、制御
部５０から出力されたテキストデータをアナログの音声
信号に変換してスピーカ１１に出力し、スピーカ１１を
鳴動させる音声出力部２８と、上記ナビゲーション装置
１５と制御部５０とをデータ通信可能に接続する機器制
御インタフェース（機器制御Ｉ／Ｆ）２９とを備えてい
る。The control device 1 includes a CPU, a ROM,
A control unit 50 mainly composed of a microcomputer including a RAM and the like, an input unit 23 for inputting commands and data from the switch device 3 to the control unit 50, and an image data output from the control unit 50 Screen output unit 25 that converts the image signal into an image signal and outputs the image signal to the display device 5 to display an image on a screen, and an audio input unit 27 that converts an audio signal input from the microphone 7 into digital data.
From the audio signal input via the audio input unit 27,
A speech recognition unit 30 for recognizing and acquiring a keyword as a word spoken by the user (hereinafter, also referred to as an utterance keyword) and a speech signal input through the speech input unit 27 similarly to the user's previous speech A correction point determination unit 40 that determines a correction point from the utterance content and the current utterance content, and a voice that converts the text data output from the control unit 50 into an analog voice signal, outputs the analog voice signal to the speaker 11, and sounds the speaker 11 An output unit 28 and an equipment control interface (equipment control I / F) 29 for connecting the navigation device 15 and the control unit 50 so as to enable data communication are provided.

【００２８】なお、音声入力部２７は、入力した音声の
特徴量を分析するため、例えば数１０ｍｓ程度の区間の
フレーム信号を一定間隔で切り出し、その入力信号が音
声の含まれている音声区間であるのか音声の含まれてい
ない雑音区間であるのか判定する。マイク７から入力さ
れる信号は、認識対象の音声だけでなく雑音も混在した
ものであるため、音声区間と雑音区間の判定を行なう。
この判定方法としては従来より多くの手法が提案されて
おり、例えば入力信号の短時間パワーを一定時間毎に抽
出していき、所定の閾値以上の短時間パワーが一定以上
継続したか否かによって音声区間であるか雑音区間であ
るかを判定する手法がよく採用されている。そして、音
声区間であると判定された場合には、その入力信号が音
声認識部３０に出力されることとなる。The voice input unit 27 extracts frame signals of, for example, a period of about several tens of milliseconds at regular intervals in order to analyze the characteristic amount of the input voice, and outputs the input signal in a voice section containing voice. It is determined whether there is a noise section or a noise section containing no voice. Since the signal input from the microphone 7 includes not only the speech to be recognized but also noise, the speech section and the noise section are determined.
As this determination method, many methods have been proposed as compared with the related art.For example, the short-time power of the input signal is extracted at regular time intervals, and it is determined whether the short-time power of a predetermined threshold or more has continued for a certain time or more. A technique of determining whether a section is a speech section or a noise section is often adopted. When it is determined that the input signal is in the voice section, the input signal is output to the voice recognition unit 30.

【００２９】ここで、音声認識部３０、訂正個所判定部
４０及び制御部５０の構成について、図２を参照してさ
らに詳しく説明する。音声認識部３０は、照合部３１、
辞書部３２、前回結果記憶部３３及び辞書制御部３４と
を備えている。辞書部３２は、使用者が発話すると想定
され且つ当該制御装置１が認識すべき複数のキーワード
（比較対照パターン候補）毎のＩＤとその構造から構成
された辞書データを記憶している。そして、照合部３１
では、音声入力部２７から入力した音声データと辞書部
３２の辞書データを用いて照合（認識）を行い、認識尤
度の最も大きなキーワードのＩＤを認識結果として制御
部５０へ出力する。この認識結果は前回結果記憶部３３
にも記憶される。なお、前回結果記憶部３３は、照合部
３１にて得られた認識結果を更新しながら記憶する。し
たがって、前回の認識結果のみをそして、辞書制御部３
４は、前記結果記憶部３３に記憶された前回の認識結果
と、訂正個所判定部４０にて判定された訂正個所とに基
づいて、辞書部３２の辞書を制御する。この辞書制御の
内容については後述する。Here, the configurations of the speech recognition section 30, the correction location determination section 40 and the control section 50 will be described in more detail with reference to FIG. The voice recognition unit 30 includes a collation unit 31,
A dictionary unit 32, a previous result storage unit 33, and a dictionary control unit 34 are provided. The dictionary unit 32 stores the IDs of a plurality of keywords (comparison pattern candidates) that are assumed to be spoken by the user and are to be recognized by the control device 1, and dictionary data including the structures thereof. Then, the collating unit 31
Then, collation (recognition) is performed using the voice data input from the voice input unit 27 and the dictionary data of the dictionary unit 32, and the ID of the keyword having the highest likelihood of recognition is output to the control unit 50 as a recognition result. This recognition result is stored in the previous result storage unit 33.
Is also stored. The previous result storage unit 33 updates and stores the recognition result obtained by the matching unit 31. Therefore, only the previous recognition result and the dictionary control unit 3
Reference numeral 4 controls the dictionary of the dictionary unit 32 based on the previous recognition result stored in the result storage unit 33 and the correction location determined by the correction location determination unit 40. The contents of the dictionary control will be described later.

【００３０】一方、訂正個所判定部４０は、音声情報保
持部４１と比較・判定部４２とを備えている。音声情報
保持部４１は、音声入力部２７から入力された音声情報
を保持しておき、比較・判定部４２は、音声入力部２７
から今回入力された音声情報と音声情報保持部４１に保
持しておいた前回の音声情報とを比較して、両者の似て
いる部分を判定する。この比較に際しては、音声波形そ
のものを比較してもよいし、音声の特徴パラメータであ
ってもよい。また、比較に際しては、ＤＰマッチング法
や隠れマルコフモデルを用いて行う。そして、この似て
いる部分を訂正個所であると判定して、音声認識部３０
の辞書制御部３４へ出力する。なお、この訂正個所につ
いての情報は、制御部５０へも出力される。On the other hand, the correction location judging section 40 includes a voice information holding section 41 and a comparing / judging section 42. The voice information holding unit 41 holds voice information input from the voice input unit 27, and the comparison / judgment unit 42 stores the voice information
Then, the voice information input this time is compared with the previous voice information held in the voice information holding unit 41 to determine a similar part between the two. In this comparison, the sound waveform itself may be compared, or a feature parameter of the sound may be used. The comparison is performed using the DP matching method or the hidden Markov model. Then, the similar part is determined to be a correction part, and the speech recognition unit 30
Is output to the dictionary control unit 34. Note that the information about the correction point is also output to the control unit 50.

【００３１】制御部５０は、認識結果記憶部５１と、最
終認識結果確定部５２と後処理部５３などを備えてい
る。認識結果記憶部５１は、音声認識部３０から出力さ
れた認識結果を記憶しておく。ここに記憶される認識結
果は、削除されるまで記憶されている。そして、最終結
果確定部５２は、この認識結果記憶部５１に記憶された
１以上の認識結果及び訂正個所判定部４０から入力した
訂正個所に基づいて、最終的な認識結果を確定する。な
お、認識結果記憶部５１に記憶された１以上の認識結果
は、所定の確定指示がなされた場合にクリア（削除）さ
れる。The control section 50 includes a recognition result storage section 51, a final recognition result determination section 52, a post-processing section 53, and the like. The recognition result storage unit 51 stores the recognition result output from the speech recognition unit 30. The recognition result stored here is stored until it is deleted. Then, the final result determination unit 52 determines the final recognition result based on the one or more recognition results stored in the recognition result storage unit 51 and the correction part input from the correction part determination unit 40. One or more recognition results stored in the recognition result storage unit 51 are cleared (deleted) when a predetermined confirmation instruction is given.

【００３２】そして、後処理部３２ｃでは、例えば上記
所定の確定指示がなされた場合に、機器制御Ｉ／Ｆ２９
を介してナビゲーション装置１５へデータを送って所定
の処理をするように指示する「確定後処理」を実行した
り、あるいは音声認識部３０から出力された認識結果を
テキストデータとして音声出力部２８へ送り、スピーカ
１１から発音させるように指示する処理を実行する。In the post-processing section 32c, for example, when the above-mentioned predetermined confirmation instruction is issued, the device control I / F 29
To execute a "post-confirmation process" for instructing the navigation device 15 to perform a predetermined process by sending data to the navigation device 15 via the PC or to output the recognition result output from the voice recognition unit 30 to the voice output unit 28 as text data. Then, a process for instructing the speaker 11 to sound is executed.

【００３３】なお、音声認識部３０から制御部５０へ送
る認識結果としては、最終的な認識結果としての上位比
較対象パターンの全てでもよいし、あるいはその内の最
上位のものだけでもよい。但し、以下の説明では、理解
を容易にするため、特に断らない限り最上位のもの一つ
だけを送ることを前提として進める。The recognition result sent from the voice recognition unit 30 to the control unit 50 may be all of the upper comparison target patterns as the final recognition result, or only the uppermost pattern among them. However, in the following description, in order to facilitate understanding, it is assumed that only the highest order is sent unless otherwise specified.

【００３４】また、本実施例においては、利用者がトー
クスイッチ９を押すと、その後に音声入力が可能とな
る。なお、トークスイッチ９を押したのに音声入力がさ
れない場合も想定されるため、トークスイッチ９が押さ
れて音声入力が可能となった後に所定時間以上の無音区
間があれば、音声入力が不可能な状態に移行する。な
お、音声入力部２７はトークスイッチ９が押されたタイ
ミングを監視しており、押されたことを検知するだけで
十分であるが、音声認識部３０及び訂正個所判定部４０
は、トークスイッチ９が押されたタイミング及び押され
た状態が継続した時間を監視しており、トークスイッチ
９がクリック操作なのかダブルクリック操作なのかも判
断できるようにされている。具体的には、トークスイッ
チ９がオンされた後の比較的短い時間（例えば０．５秒
以内）にオフされた場合にはそれをクリック操作とみな
す。そして、そのクリック操作が所定間隔以内（例えば
０．５秒以内）に２回連続して行われた場合にダブルク
リック操作とみなす。本実施例のシステムでは、通常の
音声入力の場合はクリック操作をし、訂正のための音声
入力の場合にはダブルクリック操作をする使用方法とな
っているため、これらのいずれの入力であるかを音声認
識部３０及び訂正個所判定部４０は検知し、それに応じ
た処理を実行する。なお、ダブルクリック操作に代えて
いわゆる長押し操作（例えば２秒以上押し続けるといっ
た操作）がされた場合に、訂正入力であるとしてもよ
い。Further, in this embodiment, when the user presses the talk switch 9, voice input becomes possible thereafter. Note that it is assumed that no voice is input even though the talk switch 9 is pressed. Therefore, if there is a silent section for a predetermined time or more after the talk switch 9 is pressed and voice input is enabled, voice input is disabled. Move to a possible state. Note that the voice input unit 27 monitors the timing at which the talk switch 9 is pressed, and it is sufficient to detect that the talk switch 9 is pressed, but the voice recognition unit 30 and the correction location determination unit 40
Monitors the timing at which the talk switch 9 is pressed and the time during which the pressed state is continued, so that it can be determined whether the talk switch 9 is a click operation or a double-click operation. Specifically, when the talk switch 9 is turned off within a relatively short time (for example, within 0.5 seconds) after the talk switch 9 is turned on, it is regarded as a click operation. If the click operation is performed twice consecutively within a predetermined interval (for example, within 0.5 seconds), it is regarded as a double click operation. In the system of this embodiment, a click operation is performed in the case of normal voice input, and a double click operation is performed in the case of voice input for correction. Is detected by the voice recognition unit 30 and the correction part determination unit 40, and the corresponding processing is executed. It should be noted that when a so-called long-press operation (for example, an operation of holding down for two seconds or more) is performed instead of the double-click operation, the correction input may be performed.

【００３５】次に、本実施例システムの動作について、
ナビゲーション装置１５にて経路探索をするための目的
地を音声入力する場合を例にとり、図３、４のフローチ
ャートを参照して説明する。まず、図３の最初のステッ
プであるＳ１０では、トークスイッチ９がオンされたか
（押下されたか）否かを判断し、トークスイッチ９がオ
ンされた場合には（Ｓ１０：ＹＥＳ）、音声抽出処理を
行う（Ｓ２０）。この音声抽出処理は、音声入力部２７
において、マイク７を介して入力された音声データに基
づき音声区間であるか雑音区間であるかを判定し、音声
区間のデータを抽出して音声認識部３０及び訂正個所判
定部４０へ出力する処理である。Next, the operation of the system of this embodiment will be described.
An example in which a destination for a route search is input by voice using the navigation device 15 will be described with reference to the flowcharts of FIGS. First, in S10, which is the first step in FIG. 3, it is determined whether or not the talk switch 9 is turned on (pressed). If the talk switch 9 is turned on (S10: YES), the voice extraction processing is performed. Is performed (S20). This voice extraction processing is performed by the voice input unit 27.
In the process, it is determined whether or not the voice section is a voice section or a noise section based on the voice data input via the microphone 7, and the data of the voice section is extracted and output to the voice recognition unit 30 and the correction part determination unit 40. It is.

【００３６】次に、音声認識処理を行う（Ｓ３０）。こ
の音声認識処理の詳細を図４を参照して説明する。上述
したように、音声認識部３０及び訂正個所判定部４０
は、トークスイッチ９が押されたタイミング及び押され
た状態が継続した時間を監視しているため、まずは、ト
ークスイッチ９がクリック操作されたか否かを判断する
（Ｓ３１）。そして、クリック操作であれば（Ｓ３１：
ＹＥＳ）、通常の音声入力であるため、そのまま認識処
理を実行する（Ｓ３２）。一方、クリック操作でなけれ
ば（Ｓ３１：ＮＯ）、例えばダブルクリックや長押しが
された場合であって、訂正のための入力であるため、訂
正個所判定部４０において訂正個所を判定し（Ｓ３
４）、音声認識部３０では、その判定された訂正個所に
基づいて辞書を制御する（Ｓ３５）。この場合は、その
制御された辞書を用いてＳ３２の認識処理が実行される
こととなる。Next, a voice recognition process is performed (S30). Details of the voice recognition processing will be described with reference to FIG. As described above, the voice recognition unit 30 and the correction location determination unit 40
Monitors the timing at which the talk switch 9 is pressed and the time during which the pressed state is continued, so it is first determined whether or not the talk switch 9 is clicked (S31). And if it is a click operation (S31:
YES), since it is a normal voice input, the recognition processing is executed as it is (S32). On the other hand, if it is not a click operation (S31: NO), for example, when double-clicking or long-pressing is performed and the input is for correction, the correction point determination unit 40 determines a correction point (S3).
4) The speech recognition unit 30 controls the dictionary based on the determined correction part (S35). In this case, the recognition process in S32 is performed using the controlled dictionary.

【００３７】ここで、Ｓ３５の辞書制御について説明す
る。まずは、本実施例の辞書部３２に記憶されている辞
書データについて説明する。辞書データは、比較対象パ
ターン候補となる語彙そのもののデータだけでなく、そ
の比較対象パターンとなる語彙が複数の語を階層的につ
なぎ合わせたものである場合には、その階層構造を示す
データも記憶されている。具体的には、語彙を構成する
音節データが図５に示すように木（tree）構造の各辺
（図５において矢印（→）で示す）に割り付けられてい
る。なお、図５において、一重丸（○）は頂点を表し、
二重丸（◎）は受理頂点、すなわち単語に対する頂点を
表す。そして、図５中の矢印Ａで示す頂点が「根」とな
り、そこから先行順走査（preorder traversal）にした
がって各辺に割り付けられ音節を辿ることで単語が完成
する。ここで「先行順走査」とは、根を訪問し、次に子
を根とする部分木を順番に走査（この走査も先行順走査
である。）していくことを指す。なお、ここで、「親」
とは直前の頂点、「子」とは次の頂点、「兄弟」とは同
じ親を持つ頂点同士をそれぞれ意味する。Here, the dictionary control in S35 will be described. First, the dictionary data stored in the dictionary unit 32 according to the present embodiment will be described. The dictionary data includes not only the data of the vocabulary itself as the comparison target pattern but also the data indicating the hierarchical structure when the vocabulary of the comparison target pattern is a plurality of words connected hierarchically. It is remembered. More specifically, syllable data constituting a vocabulary is allocated to each side of a tree structure (indicated by an arrow (→) in FIG. 5) as shown in FIG. In FIG. 5, a single circle (○) represents a vertex,
Double circles (◎) represent accepted vertices, that is, vertices for words. Then, the vertex indicated by the arrow A in FIG. 5 becomes the “root”, from which the word is completed by tracing syllables assigned to each side according to preorder traversal. Here, "preceding forward scan" means that a root is visited, and then a subtree having a child as a root is sequentially scanned (this scan is also a forward forward scan). Here, "parent"
Means the vertex immediately before, "child" means the next vertex, and "sibling" means vertices having the same parent.

【００３８】つまり、図５に示す具体例では、「根」と
なる頂点（矢印Ａで示す）から順に辿ると「あいちけ
ん」となって矢印Ｂで示す受理頂点となる。したがっ
て、「あいちけん（愛知県）」で一つの認識対象単語と
なる。そして、さらにその矢印Ｂで示す受理頂点を経由
して「かりやし」となって矢印Ｃで示す受理頂点とな
る。したがって、あいちけんかりやし（愛知県刈谷
市）」でも一つの認識対象単語となる。さらにその矢印
Ｃで示す受理頂点を経由して「子」の頂点がある。図５
には図示しないが例えば「しょうわちょう」と辿ること
ができて受理頂点があるため、「あいちけんかりやしし
ょうわちょう（愛知県刈谷市昭和町）」でも一つの認識
対象単語である。That is, in the specific example shown in FIG. 5, when the vertices (indicated by the arrow A) which are the roots are sequentially traced, the vertices become "Aichiken" and become the accepted vertices indicated by the arrow B. Therefore, "Aichiken (Aichi Prefecture)" is one recognition target word. Then, through the reception vertex indicated by the arrow B, “Kariyashi” is formed and the reception vertex indicated by the arrow C is obtained. Therefore, "Aichi Ken Kariyashi (Kariya City, Aichi Prefecture)" is also one recognition target word. Further, there is a vertex of “child” via the receiving vertex indicated by the arrow C. FIG.
Although it is not shown, since it can be traced as "showacho" and has a reception vertex, for example, "aichikenkariya shishowacho (Showa-cho, Kariya city, Aichi prefecture)" is one recognition target word.

【００３９】この場合には、例えば「あいちけんかりや
ししょうわちょう（愛知県刈谷市昭和町）」という一つ
の認識対象単語は、「あいちけん（愛知県）」と「かり
やし（刈谷市）」と「しょうわちょう（昭和町）」とい
う３つの語が階層的につなぎ合わせたものである。した
がって、このように３階層となっているということが図
５に矢印Ｂ，Ｃで示す受理頂点の存在によって判る。つ
まり、受理頂点はそこまで辿ってきた音節データで単語
が構成されることを示すが、逆にその受理頂点から下流
側にさらに音節データがある場合には、その受理頂点よ
りも上流側が上位階層となり、下流側が下位階層とな
る。例えば、図５に矢印Ｂで示す受理頂点を考えると、
上流側の「あいちけん（愛知県）」が上位階層であり、
下流側の「かりやし（刈谷市）……」が下位階層であ
る。つまり、この場合には県を示す語が上位階層で、市
レベル以下を示す語が下位階層となる。また、図５に矢
印Ｃで示す受理頂点を考えると、上流側の「あいちけん
かりやし（愛知県刈谷市）」が上位階層であり、図５に
は示していないが下流側の例えば「しょうわちょう（昭
和町）」が下位階層となる。In this case, for example, one recognition target word such as "Aichiken Kariyashi Showacho (Kawatani City, Aichi Prefecture)" includes "Aichiken (Aichi Prefecture)" and "Kariyashi (Kariya City )) And Showacho (Showa-cho) are hierarchically connected. Therefore, it can be seen from the existence of the reception vertices indicated by arrows B and C in FIG. In other words, the accepted vertex indicates that a word is composed of the syllable data that has been traced to it. Conversely, if there is more syllable data downstream from the accepted vertex, the upstream side of the accepted vertex is in the upper hierarchy. And the downstream side is the lower hierarchy. For example, consider the receiving vertex indicated by arrow B in FIG.
"Aichiken (Aichi Prefecture)" on the upstream side is the upper hierarchy,
"Kalyashi (Kariya City) ..." on the downstream side is the lower hierarchy. That is, in this case, the word indicating the prefecture is the upper hierarchy, and the word indicating the city level or lower is the lower hierarchy. Also, considering the reception vertex indicated by arrow C in FIG. 5, “Aichiken Kariyashi (Kariya City, Aichi Prefecture)” on the upstream side is the upper layer, and although not shown in FIG. "Wachi (Showa-cho)" is the lower hierarchy.

【００４０】以上は辞書部３２に記憶されている辞書デ
ータの説明として、愛知県刈谷市昭和町という具体例で
説明したが、基本的には都道府県を最上位階層とし、市
レベルを２番目の階層、町レベルを３番目の階層として
他の地名についてもデータが設定されている。なお、
「基本的に」といったのは、県の次の市レベルで「町」
や「村」が来る地名もあるからである。In the above, the description of the dictionary data stored in the dictionary section 32 has been given with a specific example of Showa-cho, Kariya city, Aichi prefecture. However, basically, the prefecture is the highest level and the city level is the second level. And the town level is the third level, and data is also set for other place names. In addition,
"Basically" means "town" at the next city level in the prefecture
And "villages" come in some places.

【００４１】このような辞書データに対して、図４のＳ
３４では次のような辞書制御を行う。すなわち、訂正個
所判定部４０にて、前回入力された音声情報中でどこが
訂正個所かが分かるため、音声認識部３０の辞書制御部
３４では、前回結果記憶部３３に記憶されている認識結
果も用いて、その特定された訂正箇所に対応する部分の
みを比較対象パターン候補とする。例えば利用者が「愛
知県刈谷市昭和（しょうわ）町」と音声で入力したにも
かかわらず、音声認識部３０が「愛知県刈谷市松栄（し
ょうえい）町」と誤って認識して状況を想定する。利用
者は訂正のために「昭和町」とだけ音声入力すると、訂
正個所判定部４０では、音声情報保持部４１に保持され
ている前回の音声情報（利用者が「あいちけんかりやし
しょうわちょう」と発声したもの）と、今回の音声情報
（利用者が「しょうわちょう」と発声したもの）とを比
較し、どの前回の音声情報の内のどの部分に対応する入
力であったかを判定する。この訂正個所についての情報
を得た音声認識装置３０の辞書制御部３４では、前回結
果記憶部３３に記憶されている「愛知県刈谷市松栄町」
という認識結果にも基づくことで、前回の音声認識結果
に対して、愛知県刈谷市までは訂正がなく、その下位階
層である町名部分の訂正であることが分かるため、愛知
県刈谷市に続く町名部分（具体的には、昭和町、松栄町
……など）のみを比較対象パターン候補とする。これが
辞書制御の内容である。For such dictionary data, S in FIG.
At 34, the following dictionary control is performed. That is, since the correction location determination unit 40 knows where the correction location is in the previously input speech information, the dictionary control unit 34 of the voice recognition unit 30 also recognizes the recognition result stored in the previous result storage unit 33. Only the portion corresponding to the specified corrected portion is used as a comparison target pattern candidate. For example, despite the fact that the user has input by voice as "Showa town in Kariya city, Aichi prefecture", the voice recognition unit 30 mistakenly recognizes "Shoei town in Kariya city in Aichi prefecture" and Suppose. When the user voice-inputs only "Showa-cho" for correction, the correction location determination unit 40 determines that the previous voice information (the user has entered "Aichikenkarishashiwacho") held in the voice information holding unit 41 ) Is compared with the current voice information (the user uttered "showacho") to determine which part of the previous voice information corresponds to which input. . The dictionary control unit 34 of the voice recognition device 30 that has obtained the information about the correction location, “Matsuei-cho, Kariya city, Aichi prefecture” stored in the previous result storage unit 33.
Based on the recognition result, it is understood that there is no correction to the previous speech recognition result up to Kariya city in Aichi prefecture, and that it is a correction of the town name part that is the lower hierarchy, so it follows from Kariya city in Aichi prefecture Only the street name portion (specifically, Showa Town, Matsusakae Town, etc.) is set as the comparison target pattern candidate. This is the content of dictionary control.

【００４２】なお、逆に言えば、Ｓ３１にて肯定判断、
すなわち通常の入力の場合には、このような辞書制御を
することなく、認識処理（Ｓ３２）を実行する。認識処
理後は、音声認識部３０では、その認識結果を前回結果
記憶部３３に記憶すると共に、制御部５０へ出力して
（Ｓ３３）、図３のＳ４０へ移行する。In other words, conversely, a positive determination is made in S31,
That is, in the case of a normal input, the recognition process (S32) is executed without performing such dictionary control. After the recognition process, the speech recognition unit 30 stores the recognition result in the previous result storage unit 33, outputs the result to the control unit 50 (S33), and proceeds to S40 in FIG.

【００４３】図３のフローチャートの説明に戻り、Ｓ４
０では、認識結果をトークバック及び表示する。このト
ークバックは、制御部５０が音声出力部２８を制御し、
認識した結果を音声によりスピーカ１１から出力させる
と共に、画面出力部２５を制御し、認識した結果を示す
文字などを表示装置５に表示させる。なお、この場合の
トークバックは、音声認識部３０において直前に認識し
た部分のみについて行う。つまり、上述例で言えば、最
初の音声入力に対しては、「愛知県刈谷市松栄町です
ね」とトークバックし、訂正入力に対しては、「昭和町
ですね」とトークバックする。一方、表示の場合には、
訂正入力に対して同様に訂正部分のみを表示しても良い
が、例えば「愛知県刈谷市昭和町ですね」と表示し、そ
の「昭和町」の部分のみ色を変えて表示するといった手
法も採用できる。Returning to the description of the flowchart of FIG.
At 0, the recognition result is talked back and displayed. In this talkback, the control unit 50 controls the audio output unit 28,
The recognition result is output from the speaker 11 by voice, and the screen output unit 25 is controlled to display characters or the like indicating the recognition result on the display device 5. Note that the talkback in this case is performed only for the part immediately recognized by the voice recognition unit 30. That is, in the above example, for the first voice input, talkback is made for "Matsuei-cho, Kariya-shi, Aichi" and for corrected input, talkback is made for "Showa-cho". On the other hand, in the case of display,
In the same way, only the correction part may be displayed for the correction input, but for example, a method of displaying "It is Showa-cho in Kariya city, Aichi prefecture" and changing the color of only the part of "Showa-cho" and displaying it Can be adopted.

【００４４】その後、正しい認識であったか否かを、利
用者からの指示に基づいて判断する（Ｓ５０）。具体的
には、利用者によるスイッチ装置３に対する操作に基づ
いてもよいし、あるいはマイク７からの音声入力に基づ
いてもよい。例えば「はい」という肯定的な内容を示す
音声入力があれば正しい認識であったと判断できるし、
「いいえ」「違う」などの否定的な内容を示す音声入力
があれば誤った認識であったと判断できる。Thereafter, it is determined whether or not the recognition is correct based on an instruction from the user (S50). Specifically, it may be based on an operation on the switch device 3 by the user, or may be based on a voice input from the microphone 7. For example, if there is a voice input indicating a positive content of "Yes", it can be determined that the recognition was correct,
If there is a voice input indicating a negative content such as "No" or "No", it can be determined that the recognition is incorrect.

【００４５】そして、誤った認識であった場合には（Ｓ
５０：ＮＯ）、その認識結果が所定カテゴリに属するも
のであるかどうかを判断する（Ｓ９０）。本実施例では
経路案内のための目的地を設定する処理を前提としてい
るので、この所定カテゴリとは地名に関するカテゴリで
ある。所定カテゴリであれば（Ｓ９０：ＹＥＳ）、Ｓ１
００へ移行して、その認識結果を制御部５０の認識結果
記憶部５１に一時的に記憶しておく。なお、このように
して認識結果記憶部５１に一時的に記憶された認識結果
は、Ｓ８０での削除処理が実行されない限り記憶されて
いる。つまり、何度も訂正入力する場合には、Ｓ１００
の処理を複数回実行する可能性があり、その場合は、全
て記憶しておくという意味である。If the recognition is erroneous (S
50: NO), it is determined whether or not the recognition result belongs to a predetermined category (S90). In the present embodiment, the processing for setting a destination for route guidance is premised, and the predetermined category is a category relating to a place name. If it is a predetermined category (S90: YES), S1
The process proceeds to 00, and the recognition result is temporarily stored in the recognition result storage unit 51 of the control unit 50. The recognition result temporarily stored in the recognition result storage unit 51 in this manner is stored unless the deletion processing in S80 is executed. In other words, when the correction is input many times, S100
May be executed a plurality of times, in which case, it means that all of them are stored.

【００４６】次に、訂正方法の報知を行う（Ｓ１１
０）。これは、誤った認識がされた状態であり、利用者
が訂正のための入力を再度行うことが予想されることに
対応するため、その訂正方法を了知させるために行うも
のである。訂正方法については、例えば「誤認識となっ
ている階層以下は全て訂正入力する」方法が考えられ
る。つまり、「愛知県刈谷市昭和町」と音声入力して
「愛知県岡崎市昭和町」と誤認識した場合に、誤認識部
分のみの訂正として「刈谷市」とだけ音声入力するので
はなく、「刈谷市昭和町」と音声入力させる方法であ
る。これは、誤認識防止の観点からは、認識対象が多い
方がマッチングし易くなるという知見に基づいている。
また、誤認識となっている階層以下は全て訂正入力させ
れば、後端が特定できるため、認識精度の向上が期待で
きる。このような訂正入力ルールに基づく場合には、訂
正個所判定部４０において、今回発声時の音声情報と前
回発声時の音声情報とを、一律にその音声情報の最後尾
から行うことができる。したがって、より精度良く訂正
個所（再度入力された個所）を判定できる。この場合に
は、利用者がそのような訂正方法を遵守することが実効
性をあげるために不可欠であるため、図３のＳ１１０で
は、そのような訂正方法を利用者に報知する。Next, the correction method is notified (S11).
0). This is performed to acknowledge the correction method in order to cope with a situation in which the user is erroneously recognized and the user is expected to input again for correction. As a correction method, for example, a method of “correcting and inputting data in all the layers below the erroneously recognized layer” can be considered. In other words, if you uttered "Showa Town, Kariya City, Aichi Prefecture" and misrecognized "Showa Town, Okazaki City, Aichi Prefecture", instead of just uttering "Kariya City" as a correction for only the misrecognized part, This is a method of inputting voice as "Showa-cho in Kariya city". This is based on the finding that, from the viewpoint of preventing erroneous recognition, matching is easier when there are more recognition targets.
In addition, if all of the layers below the erroneously recognized layer are corrected and input, the rear end can be specified, so that improvement in recognition accuracy can be expected. When based on such a correction input rule, the correction part determination unit 40 can uniformly perform the voice information at the time of the present utterance and the voice information at the time of the previous utterance from the end of the voice information uniformly. Therefore, it is possible to determine a correction position (a position input again) with higher accuracy. In this case, since it is indispensable for the user to observe such a correction method in order to improve the effectiveness, in S110 of FIG. 3, such a correction method is notified to the user.

【００４７】また、所定カテゴリでなければ（Ｓ９０：
ＮＯ）、Ｓ１２０へ移行してその他の処理を実行する。
Ｓ１１０あるいはＳ１２０の処理の後はＳ１０へ戻っ
て、処理を繰り返す。一方、Ｓ５０で肯定判断、すなわ
ち正しい認識であると判断した場合には、制御部５０の
最終認識結果確定部５２にて認識結果を確定する（Ｓ６
０）。訂正入力がなく、認識結果記憶部５１に一の認識
結果しか記憶されていなければ、それを最終認識結果と
して確定する。また、複数の認識結果が記憶されている
場合には、それを総合的に判断して最終認識結果を確定
する。上述した具体例で言えば、「愛知県刈谷市昭和
町」と音声入力して「愛知県刈谷市松栄町」と誤認識
し、誤認識部分のみの訂正として「昭和町」と音声入力
して「昭和町」と正しく認識された場合には、認識結果
記憶部５１に、「愛知県刈谷市松栄町」と「昭和町」の
２つの認識結果が記憶されていることとなる。訂正個所
判定部４０からの訂正個所を特定する情報は制御部５０
にも出力されるため、最初に記憶した「愛知県刈谷市松
栄町」の内の「愛知県刈谷市」部分と２回目に記憶した
「昭和町」とを組み合わせて、「愛知県刈谷市昭和町」
を最終認識結果と確定する。If the category is not the predetermined category (S90:
NO), and proceeds to S120 to execute other processing.
After the processing of S110 or S120, the process returns to S10, and the processing is repeated. On the other hand, if the determination is affirmative in S50, that is, it is determined that the recognition is correct, the final recognition result determination unit 52 of the control unit 50 determines the recognition result (S6).
0). If there is no correction input and only one recognition result is stored in the recognition result storage unit 51, it is determined as the final recognition result. If a plurality of recognition results are stored, the results are comprehensively determined to determine the final recognition result. Speaking of the above-mentioned specific example, voice input "Showa-cho, Kariya-shi, Aichi" is erroneously recognized as "Matsuei-cho, Kariya-shi, Aichi". If "Showa-cho" is correctly recognized, the recognition result storage unit 51 stores two recognition results, "Matsuei-cho, Kariya-shi, Aichi" and "Showa-cho". The information for specifying the correction location from the correction location determination unit 40 is transmitted to the control unit 50.
It is also output to ”Kawatani, Aichi Prefecture” in the first memory “Kariya City, Aichi Prefecture” and “Showa Town” in the second time are combined, "
Is determined as the final recognition result.

【００４８】なお、２回以上の訂正入力があり、３つ以
上の認識結果が認識結果記憶部５１に記憶されている場
合であっても、同様の考え方で最終認識結果を確定でき
る。つまり、訂正は１回だけに限られず最終的に正しく
認識されるまで何度行っても良い。したがって、利用者
としては、Ｓ４０でトークバック等された認識結果が正
しくなければ、何度でも訂正入力をすればよい。Even if there are two or more correction inputs and three or more recognition results are stored in the recognition result storage unit 51, the final recognition result can be determined in the same way. In other words, the correction is not limited to one time, but may be performed any number of times until it is finally correctly recognized. Therefore, if the recognition result of the talkback or the like in S40 is not correct, the user may input the correction many times.

【００４９】認識結果が確定すると、次に所定の確定後
処理を実行する（Ｓ９０）。この場合の確定後処理と
は、認識結果としての「経路案内のための目的地」に関
するデータを、機器制御Ｉ／Ｆ２９を介してナビゲーシ
ョン装置１５へ出力したり、音声認識部３０における辞
書制御を元に戻す処理などである。上述の辞書制御は、
あくまで訂正のための対応策であるので、最終的に正し
い認識ができた場合には、辞書を制限的にしない方がよ
いため、元に戻す。When the recognition result is determined, a predetermined post-determination process is executed (S90). In this case, the post-determination processing means that data relating to “destination for route guidance” as a recognition result is output to the navigation device 15 via the device control I / F 29, and dictionary control in the voice recognition unit 30 is performed. This is an undoing process. The dictionary control described above
This is only a corrective measure, and if correct recognition is finally obtained, it is better not to restrict the dictionary.

【００５０】そして、音声認識部３０の前回結果記憶部
３３及び制御部５０の認識結果記憶部５１に一時的に記
憶されていた認識結果を削除（クリア）する（Ｓ８
０）。このような認識結果の一時的な記憶も訂正のため
の対応策であるので、最終的に正しい認識ができた場合
には不要となる。また、逆にこのような認識結果が残っ
ていると、別の内容の音声入力を認識する場合に不都合
だからである。Ｓ８０の処理の後はＳ１０へ戻って、処
理を繰り返す。Then, the recognition result temporarily stored in the previous result storage unit 33 of the voice recognition unit 30 and the recognition result storage unit 51 of the control unit 50 is deleted (cleared) (S8).
0). Such a temporary storage of the recognition result is also a countermeasure for correction, and is unnecessary when the correct recognition is finally obtained. On the other hand, if such a recognition result remains, it is inconvenient when recognizing a voice input of another content. After the process in S80, the process returns to S10, and the process is repeated.

【００５１】以上が、経路案内のための目的地を音声入
力する場合を例にとった場合の動作説明であるが、本発
明の音声認識に係る特徴及び効果をより明確に理解する
ために、上述のフローチャートの処理説明でも触れた
が、目的地として「愛知県刈谷市昭和（しょうわ）町」
を指定するという具体例で説明を続ける。The above is an explanation of the operation in the case where the destination for route guidance is input by voice as an example. In order to more clearly understand the features and effects related to voice recognition of the present invention, As mentioned in the processing explanation of the above-mentioned flowchart, the destination is "Showa-cho, Kariya-shi, Aichi"
The description will be continued with a specific example of specifying.

【００５２】利用者がマイク７を介して「愛知県刈谷市
昭和町」と音声入力したとする。音声認識の精度が１０
０％でない場合には誤認識してしまう可能性がある。例
えば「愛知県刈谷市松栄（しょうえい）町」と誤って認
識してしまった場合には、その音声をスピーカ１１を介
して出力する。It is assumed that the user voice-inputs “Showa-cho, Kariya-shi, Aichi” through the microphone 7. Accuracy of voice recognition is 10
If it is not 0%, it may be erroneously recognized. For example, if the user mistakenly recognizes “Shoei town in Kariya city, Aichi prefecture”, the sound is output via the speaker 11.

【００５３】これにより利用者は誤って認識されている
ことが判るので、訂正のための音声入力を再度する必要
があるが、その場合、利用者が再度「愛知県刈谷市昭和
町」と音声入力しなくても、「昭和町」だけを音声入力
するだけでよくなる。このように、誤認識された部分
（上述の例では「昭和町」という町名）だけを修正する
ことは、日常生活における会話などの習慣から考える
と、ごく自然である。音声認識装置を利用する場合に限
って特別な注意を払うことを強制するのは使い勝手の点
で好ましくない。したがって、本実施例のように、誤認
識の部分だけ修正するという日常会話の習慣においてご
く自然な振舞いに対応できることにより、上位階層を省
略した方が自然な場合であっても上位階層から音声入力
しなくてはならないという利用者の負担を軽減し、使い
勝手をより向上させることができる。As a result, the user is found to be erroneously recognized, and it is necessary to re-input the voice for correction. In this case, the user again hears "Showa-cho, Kariya-shi, Aichi". You do not need to input it, just input the voice of "Showacho". It is very natural to correct only the misrecognized portion (the town name “Showa-cho” in the above example) in view of customs such as conversation in daily life. Forcing special attention only when using a voice recognition device is not preferable in terms of usability. Therefore, as in the present embodiment, it is possible to cope with a very natural behavior in a daily conversation habit of correcting only the misrecognition part, so that even if it is more natural to omit the upper layer, voice input from the upper layer The user's burden of having to do so can be reduced, and usability can be further improved.

【００５４】そしてさらに、このような言い直し（訂
正）に対応できながら、誤認識を低減できる。上述した
従来技術の場合には、全ての階層からの言い直しに対応
するために可能性のある全ての途中階層も認識開始点を
みなす思想であるため、比較対象パターン候補が増えて
しまうのに対して本実施例の場合には、訂正個所判定部
４０において予め訂正個所を特定しておき、その特定さ
れた訂正箇所に対応する部分のみを比較対象パターン候
補とするため、相対的に少ない数で済む。上述例であれ
ば、愛知県刈谷市までは訂正がないため、その下位階層
である町名部分（具体的には、昭和町、松栄町……な
ど）のみを比較対象パターン候補とすればよい。つま
り、愛知県刈谷市を上位階層としない比較対象パターン
候補は全て対象外となるため、相対的には非常に少ない
数で済む。比較対象パターン候補が少なくなるというこ
とは、誤認識の可能性の低減、認識処理時間の短縮化に
も寄与する。Further, while being able to cope with such rephrasing (correction), erroneous recognition can be reduced. In the case of the above-described conventional technique, all possible intermediate layers are considered to be recognition start points in order to cope with rephrasing from all layers. On the other hand, in the case of the present embodiment, the correction location determination unit 40 specifies a correction location in advance, and only a portion corresponding to the specified correction location is set as a comparison target pattern candidate. Only needs to be done. In the above example, since there is no correction up to Kariya city in Aichi prefecture, only the lower-level town name portion (specifically, Showa town, Matsusakae town,..., Etc.) may be set as comparison target pattern candidates. That is, all the comparison target pattern candidates that do not have Kariya City in Aichi Prefecture as the upper layer are excluded from the target, so that only a relatively small number is required. Reducing the number of pattern candidates to be compared contributes to a reduction in the possibility of erroneous recognition and a reduction in recognition processing time.

【００５５】なお、本実施例の場合には、マイク７、音
声入力部２７が「音声入力手段」に相当し、音声出力部
２８、スピーカ１１、画面出力部２５、表示装置５が
「報知手段」及び「訂正方法報知手段」に相当する。ま
た、音声認識部３０が「認識手段」に相当し、その音声
認識部３０内の辞書部３２が「辞書手段」に相当する。
また、制御部５０が「確定後処理手段」に相当する。ま
た、訂正個所判定部４０中の音声情報保持部４１が「音
声情報保持手段」に相当し、比較・判定部４２が「訂正
箇所判定手段」に相当する。また、トークスイッチ９、
音声入力部２７、音声認識部３０、訂正個所判定部４０
が「訂正動作検出手段」及び「訂正方法検出手段」に相
当する。In this embodiment, the microphone 7 and the voice input unit 27 correspond to "voice input means", and the voice output unit 28, the speaker 11, the screen output unit 25, and the display device 5 correspond to "notification means." And "correction method informing means". The voice recognition unit 30 corresponds to “recognition means”, and the dictionary unit 32 in the voice recognition unit 30 corresponds to “dictionary means”.
Further, the control unit 50 corresponds to “post-confirmation processing means”. In addition, the audio information holding unit 41 in the correction location determination unit 40 corresponds to “audio information storage unit”, and the comparison / determination unit 42 corresponds to “correction location determination unit”. Also, the talk switch 9,
Voice input unit 27, voice recognition unit 30, correction location determination unit 40
Correspond to “correction operation detecting means” and “correction method detecting means”.

【００５６】［別実施例］以上図１〜図５を参照して、
一実施例を説明したが、別のいくつかの実施例について
説明する。（１）上記実施例では、誤認識防止の観点を重要視し
て、訂正入力の場合、誤認識となっている階層以下は全
て訂正入力するルールとした。しかし、利用者の負担軽
減、使い勝手の向上という観点からは、誤認識部分のみ
訂正入力することが好ましい。これらのいずれを採用す
るかは、両者のメリットのトレードオフとなるが、例え
ば利用者がいずれかの方法を選択できるようにしてもよ
い。その場合には、図３のＳ３０での音声認識処理とし
て、上述した図４に示す処理に代えて図６に示す処理を
実行することが考えられる。[Another Embodiment] Referring to FIGS. 1 to 5,
Having described one embodiment, several other embodiments will now be described. (1) In the above embodiment, with emphasis on the point of prevention of erroneous recognition, in the case of a correction input, a rule for correcting and inputting all the layers below the erroneously recognized hierarchy is used. However, from the viewpoint of reducing the burden on the user and improving the usability, it is preferable to correct and input only the erroneously recognized portion. Which of these methods is adopted is a trade-off between the merits of the two methods. For example, the user may be able to select one of the methods. In that case, it is conceivable to execute the process shown in FIG. 6 instead of the process shown in FIG. 4 as the voice recognition process in S30 of FIG.

【００５７】ここでは、２種類の訂正方法を認めること
とし、それをトークスイッチ９の操作方法で区別してい
る。つまり、トークスイッチ９がクリックされた場合に
は（Ｓ２３１：ＹＥＳ）、通常の認識処理を行うが、ト
ークスイッチ９がクリックでない場合には（Ｓ２３１：
ＮＯ）、さらにトークスイッチ９がダブルクリックされ
たか否かで、訂正入力の方法を区別する。ダブルクリッ
クの場合には（Ｓ２３４：ＹＥＳ）、通常の訂正個所判
定を実行し（Ｓ２３５）、ダブルクリックでない場合
（例えばトリプルクリックや長押し）には（Ｓ２３４：
ＮＯ）、上記実施例で説明したような「入力された音声
情報の後端部から比較」して訂正個所を判定する（Ｓ２
３７）。Here, two types of correction methods are recognized, and these are distinguished by the operation method of the talk switch 9. That is, when the talk switch 9 is clicked (S231: YES), the normal recognition processing is performed, but when the talk switch 9 is not clicked (S231: YES).
NO), and the method of correction input is distinguished depending on whether or not the talk switch 9 is double-clicked. In the case of a double click (S234: YES), a normal correction point determination is performed (S235), and in the case of no double click (for example, triple click or long press) (S234:
NO), and determine the correction location by comparing "from the rear end of the input audio information" as described in the above embodiment (S2).
37).

【００５８】Ｓ２３７のように後端から比較すれば認識
精度が相対的に向上するが、その訂正方法を利用者が遵
守する必要がある。一方、Ｓ２３６の場合には、訂正箇
所のみの訂正入力であってもよいため、利用者の負担軽
減、使い勝手の向上が相対的に向上する。もちろん、Ｓ
２３６の場合には自由な訂正方法ができるため、訂正個
所以外の部分を再度入力することもできる。Although the recognition accuracy is relatively improved by comparing from the rear end as in S237, it is necessary for the user to observe the correction method. On the other hand, in the case of S236, correction input of only the correction portion may be performed, so that the burden on the user is reduced and the usability is relatively improved. Of course, S
In the case of 236, since a free correction method can be performed, a portion other than the corrected portion can be input again.

【００５９】なお、これ以外のステップは図４の場合と
同じである。つまり図６のＳ２３２，Ｓ２３３は図４の
Ｓ３２，Ｓ３３と同じ内容の処理であり、図６のＳ２３
６は図４のＳ３５と同じ内容の処理である。また、この
場合には、図３のＳ１１０における訂正方法の報知にお
いて、２つの訂正方法を選択できる旨とその指定方法を
報知する。The other steps are the same as those in FIG. That is, S232 and S233 of FIG. 6 are processes having the same contents as S32 and S33 of FIG.
6 is a process having the same contents as S35 of FIG. In this case, the notification of the correction method in S110 of FIG. 3 indicates that two correction methods can be selected and the designation method.

【００６０】（２）上記実施例では住所を例にとった
が、それ以外にも、複数の語を階層的につなぎ合わせた
ものとして扱える場合には同様に適用できる。その一つ
として電話番号が想定できる。電話番号は一般的に市外
局番−市内局番−＊＊＊＊となっているため、３階層と
捉えてもよいが、数字入力に関しては１桁ずつがそれぞ
れ階層であると扱ってもよい。但し。１桁分の数字が誤
認識の際にその１桁だけ訂正入力すると、同じ数字が他
の桁にも存在する可能性があって新たな誤認識を招きや
すい。したがって、その前後の数字（例えば前後１けた
ずつ加えて３桁）を訂正入力することが好ましい。つま
り、例えば最低３桁程度で入力すれば、該当部分の特定
が容易になると考えられるため、その場合もやはり訂正
方法を報知することが好ましい。(2) In the above embodiment, an address is taken as an example. However, other than that, when a plurality of words can be handled as being connected in a hierarchical manner, the same can be applied. One of them can be a telephone number. Telephone numbers are generally area code-city code-****, so they may be considered as three layers, but for numeric input, each digit may be treated as a layer. . However. If a single digit is mis-recognized and only one digit is corrected and input, there is a possibility that the same digit may be present in another digit, and new misrecognition is likely to occur. Therefore, it is preferable to correct and input the preceding and succeeding numbers (for example, three digits in addition to one digit before and after). That is, for example, if the input is made with at least three digits, it is considered that the corresponding portion can be easily specified. In such a case, it is also preferable to notify the correction method.

【００６１】（３）上記実施例では、図３のＳ９０の処
理においては、認識結果が所定カテゴリに属するもので
あるかどうかを判断するものとし、その所定カテゴリと
は目的地の設定を前提にするため地名に関するカテゴリ
であると説明した。しかしながら、本発明の主旨はこの
ような地名等に限定されるものではなく、抽象的に言え
ば、認識結果を出力し、利用者の確認を得てから正式に
確定する必要があるような情報に関するカテゴリという
ことである。具体的に上述のカーナビゲーションシステ
ムで言うならば、ナビゲート処理をする上で指定される
必要のある所定のナビゲート処理関連情報の指示という
こととなる。この「所定のナビゲート処理関連情報」の
代表的なものが目的地であるが、それ以外にもルート探
索に関する条件選択など、ナビゲート処理をする上で指
定の必要のある指示が含まれる。(3) In the above embodiment, in the process of S90 in FIG. 3, it is determined whether or not the recognition result belongs to a predetermined category, and the predetermined category is based on the setting of a destination. It is explained that the category is related to place names. However, the gist of the present invention is not limited to such place names and the like, but in abstract terms, information that needs to be formally output after outputting a recognition result and obtaining confirmation from a user. Category. More specifically, in the case of the above-described car navigation system, this means an instruction of predetermined navigation processing related information that needs to be specified in performing the navigation processing. A representative one of the "predetermined navigation processing related information" is a destination, but also includes an instruction which needs to be specified in the navigation processing, such as selection of a condition for a route search.

【００６２】（４）上記実施例では、音声認識装置をカ
ーナビゲーションシステムに適用した例として説明した
が、適用先としては、上述したカーナビゲーションシス
テム２には限定されない。例えば音声認識装置を空調シ
ステム用として用いる場合には、設定温度の調整、空調
モード（冷房・暖房・ドライ）の選択、あるいは風向モ
ードの選択を音声入力によって行うようにすることが考
えられる。例えば設定温度について言えば、「設定温度
を２５度にする」や「設定温度を５度下げる」というよ
うに、設定温度に関する指示であるが、その指示内容に
ついて複数存在する場合である。本発明を適用すれば、
「設定温度を２５度にする」と入力したのに「設定温度
を２２度にする」と誤認識した場合には、再度の音声入
力では「２５度にする」だけ入力すればよくなり、やは
り利用者の使い勝手が向上する。空調モードや風向モー
ドなどについても同様である。(4) In the above embodiment, the example in which the voice recognition device is applied to the car navigation system has been described. However, the application destination is not limited to the above-described car navigation system 2. For example, when the voice recognition device is used for an air conditioning system, it is conceivable to adjust the set temperature, select an air conditioning mode (cooling / heating / dry), or select a wind direction mode by voice input. For example, as for the set temperature, there are instructions related to the set temperature, such as "set the set temperature to 25 degrees" or "decrease the set temperature by 5 degrees", but there are a plurality of instructions. By applying the present invention,
If the user inputs "set the temperature to 25 degrees" but misrecognizes "set the temperature to 22 degrees", the user may input only "set the temperature to 25 degrees" in the voice input again. The usability of the user is improved. The same applies to the air conditioning mode and the wind direction mode.

【００６３】また、カーナビゲーションシステムや空調
システムは、車載機器として用いられる場合だけではな
く、例えば携帯型ナビゲーション装置や屋内用空調装置
などでもよい。但し、これまで説明したように車載機器
用として用いる場合には利用者がドライバーであること
が考えられ、その場合には運転自体が最重要であり、そ
れ以外の車載機器については、なるべく運転に支障がな
いことが好ましい。したがって、車載機器としてのカー
ナビゲーションシステムや空調システムを前提とした音
声認識装置の場合には、より一層の利点がある。もちろ
ん、このような視点で考えるならば、ナビゲーションシ
ステムや空調システム以外の車載機器に対しても同様に
利用することができる。例えば、カーオーディオ機器な
どは有効である。また、いわゆるパワーウインドウの開
閉やミラー角度の調整などを音声によって指示するよう
な構成を考えれば、そのような状況でも有効である。Further, the car navigation system and the air conditioning system are not limited to the case where they are used as on-vehicle equipment, but may be, for example, a portable navigation device or an indoor air conditioner. However, as described above, when used for in-vehicle equipment, the user may be a driver. In that case, driving itself is the most important. For other in-vehicle equipment, drive as much as possible. Preferably, there is no hindrance. Therefore, in the case of a speech recognition device on the premise of a car navigation system or an air conditioning system as an in-vehicle device, there is a further advantage. Of course, from such a viewpoint, the present invention can be similarly applied to in-vehicle devices other than the navigation system and the air conditioning system. For example, a car audio device is effective. Also, considering a configuration in which opening and closing of the power window and adjustment of the mirror angle are instructed by voice, it is effective even in such a situation.

【００６４】（５）上記実施例にて説明した音声認識機
能（訂正個所判定機能なども含む）は制御装置１が備え
たプログラムを実行することで実現することができる。
このようなプログラムは、記録媒体に記録して流通させ
たり、ネットワークを介して提供することができ、記録
媒体やネットワークからコンピュータにロードして実行
することができる。(5) The speech recognition function (including the correction part determination function and the like) described in the above embodiment can be realized by executing a program provided in the control device 1.
Such a program can be recorded on a recording medium and distributed, or can be provided via a network, and can be loaded into a computer from a recording medium or a network and executed.

[Brief description of the drawings]

【図１】実施例システムの概略構成を示すブロック図で
ある。FIG. 1 is a block diagram illustrating a schematic configuration of an embodiment system.

【図２】実施例システムの音声認識部、訂正個所判定部
及び制御部の構成を示すブロック図である。FIG. 2 is a block diagram illustrating a configuration of a speech recognition unit, a correction location determination unit, and a control unit of the system according to the embodiment.

【図３】実施例システムにおける音声認識及び対話制御
に係る処理を示すフローチャートである。FIG. 3 is a flowchart illustrating processing related to voice recognition and dialog control in the system of the embodiment.

【図４】図４中で実行される音声認識処理を示すフロー
チャートである。FIG. 4 is a flowchart showing a speech recognition process executed in FIG. 4;

【図５】音声認識部内の辞書部に記憶されている辞書デ
ータを示す説明図である。FIG. 5 is an explanatory diagram showing dictionary data stored in a dictionary unit in the speech recognition unit.

【図６】音声認識処理の別実施例を示すフローチャート
である。FIG. 6 is a flowchart showing another embodiment of the voice recognition process.

【符号の説明】１…制御装置、３…スイッチ装置、５…表示装置、７…
マイクロフォン、９…トークスイッチ、１１…スピー
カ、１５…ナビゲーション装置、２３…入力部、２５…
画面出力部、２７…音声入力部、２８…音声出力部、２
９…機器制御Ｉ／Ｆ、３０…音声認識部、３１…照合
部、３２…辞書部、３３…前回結果記憶部、３４…辞書
制御部、４０…訂正個所判定部、４１…音声情報保持
部、４２…比較・判定部、５０…制御部、５１…認識結
果記憶部、５２…最終認識結果確定部、５３…後処理部[Description of Signs] 1 ... Control device, 3 ... Switch device, 5 ... Display device, 7 ...
Microphone, 9 talk switch, 11 speaker, 15 navigation device, 23 input unit, 25
Screen output unit, 27: voice input unit, 28: voice output unit, 2
9: device control I / F, 30: voice recognition unit, 31: collation unit, 32: dictionary unit, 33: previous result storage unit, 34: dictionary control unit, 40: correction location determination unit, 41: voice information holding unit .., 42... Comparison / determination unit, 50... Control unit, 51... Recognition result storage unit, 52.

───────────────────────────────────────────────────── フロントページの続き (51)Int.Cl.⁷ 識別記号ＦＩテーマコート゛(参考）Ｇ１０Ｌ 15/00 Ｇ１０Ｌ 3/00 ５５１Ｑ 15/28 ５６１Ｃ５６１Ｄ ──────────────────────────────────────────────────続き Continued on the front page (51) Int.Cl. ⁷ Identification symbol FI Theme coat ゛ (Reference) G10L 15/00 G10L 3/00 551Q 15/28 561C 561D

Claims

[Claims]

1. A voice input means for inputting voice, and a voice input through the voice input means is compared with a plurality of comparison target pattern candidates stored in a dictionary means in advance to determine a matching degree. A recognizing means for recognizing a higher result, a notifying means for notifying the recognition result by the recognizing means, and a predetermined confirmation instruction after the recognizing result is notified by the notifying means, the recognizing result is displayed. Post-determination processing means for executing predetermined post-determination processing as being determined, at least one of the plurality of comparison target pattern candidates stored in the dictionary means, a plurality of words hierarchically A voice recognition device set as a spliced voice recognition device, comprising: voice information holding means for holding voice information input at the time of previous speech via the voice input means; If no predetermined confirmation instruction is given after the notification of the recognition result by the voice input device and there is a voice input through the voice input device, the voice information and the voice information input at the time of the present utterance through the voice input device. The voice information of the previous utterance stored in the holding unit is compared with the voice information of the previous utterance to determine which part of the voice information of the previous utterance is closest to the voice information of the previous utterance, and the closest part is corrected. Correction part determining means, wherein the part determined as the correction part by the correction part determining means corresponds to a partial hierarchy of a comparison target pattern candidate in which the plurality of words are hierarchically connected. The recognition means is configured to temporarily consider constituent words or word groups of the partial hierarchy as the comparison target pattern candidates, and then perform a comparison with the input voice. Recognition device.

2. The speech recognition apparatus according to claim 1, wherein the correction portion determining means compares the speech information at the time of the current utterance with the speech information at the time of the previous utterance from the end of the speech information. A speech recognition device characterized by performing.

3. The speech recognition device according to claim 2, further comprising: a correction method notifying unit for notifying a voice input method to be performed by a user for correction, wherein said correction method notifying unit includes a correction method notifying unit. A voice recognition device that informs the user of input of a part including a desired part.

4. The speech recognition device according to claim 1, further comprising a correction method notifying unit for notifying a user of a voice input method to be corrected, wherein said correction method notifying unit includes a correction method notifying unit. A voice recognition device that informs the user to input a part including before and after a desired part.

5. The speech recognition apparatus according to claim 1, wherein the speech information at the time of the present utterance and the speech information at the time of the previous utterance, which are compared by the correction portion determination means, are speech waveform information. A speech recognition device, characterized in that:

6. The speech recognition device according to claim 1, wherein the speech information at the time of the present utterance and the speech information at the time of the previous utterance, which are compared by the correction portion determination means, are characteristic parameters of the speech. A speech recognition device, characterized in that:

7. The speech recognition device according to claim 1, wherein said correction portion determination means compares said speech information at the time of said current speech with speech information at the time of said previous speech by a DP matching method. A speech recognition device characterized by performing using a voice recognition device.

8. The speech recognition apparatus according to claim 1, wherein said correction portion determining means compares said speech information at the time of said present utterance with speech information at the time of said previous utterance by using a hidden Markov model. A speech recognition device characterized by performing using a voice recognition device.

9. The voice recognition device according to claim 1, further comprising a correction operation detection unit for detecting a predetermined operation of the user indicating that the current utterance is a correction of the previous utterance content. When a predetermined operation is detected by the correcting operation detecting unit, the recognizing unit uses only the constituent words or word groups of the partial hierarchy temporarily regarded as the comparison target pattern candidates, A speech recognition device configured to perform a comparison on the input speech.

10. A speech recognition apparatus according to claim 1, further comprising a step of indicating which one of a plurality of speech input methods that a user can perform for correction. A voice, comprising: a correction method detecting means for detecting a predetermined operation of a user; wherein the correction location determining means determines the correction location according to the correction method detected by the correction method detecting means. Recognition device.

11. A speech recognition apparatus according to claim 1, wherein said notifying means performs the output by outputting the contents of the recognition result as speech.

12. A speech recognition apparatus according to claim 1, wherein said notifying means performs the recognition by displaying the contents of the recognition result by characters, symbols, or the like. apparatus.

13. The speech recognition apparatus according to claim 12, wherein the notifying means displays the recognition result when the speech input for the correction is made, from the first recognition result or the previous recognition result. A speech recognition apparatus characterized in that a corrected part is displayed in a different display mode so as to be distinguishable from other parts.