JP2021085974A

JP2021085974A - Speech recognition apparatus, speech recognition processing method and speech recognition processing program

Info

Publication number: JP2021085974A
Application number: JP2019214366A
Authority: JP
Inventors: 啓輔茂木; Keisuke Motegi; 美幸鈴木; Yoshiyuki Suzuki; 山本　健太郎; Kentaro Yamamoto; 健太郎山本
Original assignee: Alpine Electronics Inc
Current assignee: Alpine Electronics Inc
Priority date: 2019-11-27
Filing date: 2019-11-27
Publication date: 2021-06-03

Abstract

To provide a speech recognition apparatus, a speech recognition processing method and a speech recognition processing program that can reduce false recognition.SOLUTION: An on-vehicle apparatus 1 comprises: a microphone 30 which picks up a user's utterance speech; a speech recognition dictionary in which a plurality of words to be selected are registered; a speech recognition processing part 52 which performs feature extraction on the utterance speech picked up by the microphone 30, and selects a word higher in similarity than a predetermined threshold from a plurality of words registered in the speech recognition dictionary; and a recognition result selection part 53 which selects, when there are a plurality of words determined by the speech recognition processing part 52 to be high in similarity, a word having top priority out of them.SELECTED DRAWING: Figure 1

Description

本発明は、利用者が発声した音声に対して音声認識を行う音声認識装置、音声認識処理方法および音声認識処理プログラムに関する。 The present invention relates to a voice recognition device, a voice recognition processing method, and a voice recognition processing program that perform voice recognition for a voice uttered by a user.

従来から、カメラの状態に合わせて使われる可能性の高い機能の優先順位を設定し、音声認識に際して、優先順位の高い機能のコマンドの重みを更新するようにした撮像装置が知られている（例えば、特許文献１参照。）。例えば、「シャッタボタン」の機能に含まれる「チーズ」とう音声認識においては、「チース」と認識した場合であってもこれが撮影コマンド「チーズ」を示すものであると判定している。このように、大きい重みが付された撮影コマンドほど、音声認識によって得られたワードとの類似性が低いにもかかわらず、利用者によって入力されたコマンドであると判定されやすくなるようにしている。 Conventionally, there has been known an imaging device that sets the priority of functions that are likely to be used according to the state of the camera and updates the command weights of the functions with high priority during voice recognition ( For example, see Patent Document 1.). For example, in the voice recognition of "cheese" included in the function of the "shutter button", it is determined that this indicates the shooting command "cheese" even when it is recognized as "cheese". In this way, the higher the weight of the shooting command, the easier it is to determine that the command is input by the user, even though the similarity with the word obtained by voice recognition is low. ..

特開２０１４−１２２９７８号公報Japanese Unexamined Patent Publication No. 2014-122978

ところで、上述した特許文献１に示された音声認識では、優先順位の高いワードが認識結果として得られやすくなるが、誤認識しやすい類似した複数のワードが含まれ、その中の一つを発声するような場合には、誤認識の発生を抑制することができないという問題があった。例えば、音声認識を行う分野（製品）を撮像装置以外に広げた場合であって、チャンネル選択コマンドに含まれる「１チャンネル」から「１０チャンネル」までの１０個のワードのいずれかを音声認識で選択するものとする。利用者が実際には「いっちゃんねる」と発声したときに、「８チャンネル」、「１チャンネル」、「１０チャンネル」の順に類似性が高いと判定された場合には、認識結果が「８チャンネル」になってしまう。特許文献１に示された音声認識の手法で、「８チャンネル」、「１チャンネル」、「１０チャンネル」の重みを大きくしても、このような誤認識は避けられない。この誤認識が利用者の発声のくせなどに起因するものであれば、何度発声しなおしても同じ認識結果となる。 By the way, in the voice recognition shown in Patent Document 1 described above, a word having a high priority is easily obtained as a recognition result, but a plurality of similar words that are easily erroneously recognized are included, and one of them is uttered. In such a case, there is a problem that the occurrence of erroneous recognition cannot be suppressed. For example, when the field (product) of voice recognition is expanded to other than the imaging device, any one of 10 words from "1 channel" to "10 channels" included in the channel selection command can be recognized by voice recognition. Shall be selected. When the user actually utters "Icchanneru", if it is judged that the similarity is high in the order of "8 channels", "1 channel", and "10 channels", the recognition result is "8 channels". "Become. Even if the weights of "8 channels", "1 channel", and "10 channels" are increased by the voice recognition method shown in Patent Document 1, such erroneous recognition is unavoidable. If this erroneous recognition is caused by the habit of the user's utterance, the same recognition result will be obtained no matter how many times the utterance is repeated.

本発明は、このような点に鑑みて創作されたものであり、その目的は、誤認識を低減することができる音声認識装置、音声認識処理方法および音声認識処理プログラムを提供することにある。 The present invention has been created in view of these respects, and an object of the present invention is to provide a voice recognition device, a voice recognition processing method, and a voice recognition processing program capable of reducing erroneous recognition.

上述した課題を解決するために、本発明の音声認識装置は、利用者の発話音声を集音する集音手段と、選択対象となる複数のワードが登録された音声認識辞書と、集音手段によって集音された発話音声に対して特徴抽出を行い、音声認識辞書に登録された複数のワードの中から所定のしきい値よりも類似度が高いワードを選択する音声認識手段と、音声認識手段によって類似度が高いと判定されたワードが複数存在する場合に、その中から最も優先順位が高いワードを選択する認識結果選択手段とを備えている。 In order to solve the above-mentioned problems, the voice recognition device of the present invention includes a sound collecting means for collecting the uttered voice of the user, a voice recognition dictionary in which a plurality of words to be selected are registered, and a sound collecting means. A voice recognition means that extracts features from the spoken voice collected by the voice recognition dictionary and selects a word having a higher degree of similarity than a predetermined threshold value from a plurality of words registered in the voice recognition dictionary, and voice recognition. When there are a plurality of words determined to have high similarity by the means, the recognition result selection means for selecting the word having the highest priority from among them is provided.

また、本発明の音声認識処理方法は、集音手段によって集音された発話音声に対して特徴抽出を行い、音声認識辞書に登録された複数のワードの中から所定のしきい値よりも類似度が高いワードを音声認識手段によって選択する音声認識ステップと、音声認識手段によって類似度が高いと判定されたワードが複数存在する場合に、その中から最も優先順位が高いワードを認識結果選択手段によって選択する認識結果選択ステップとを有している。 Further, the voice recognition processing method of the present invention extracts features of the spoken voice collected by the sound collecting means, and is more similar than a predetermined threshold value from a plurality of words registered in the voice recognition dictionary. When there are a voice recognition step in which a word having a high degree is selected by a voice recognition means and a plurality of words judged to have a high degree of similarity by the voice recognition means, the word having the highest priority is recognized by the recognition result selection means. It has a recognition result selection step to be selected by.

さらに、本発明の音声認識処理プログラムは、コンピュータを、集音手段によって集音された発話音声に対して特徴抽出を行い、音声認識辞書に登録された複数のワードの中から所定のしきい値よりも類似度が高いワードを選択する音声認識手段と、音声認識手段によって類似度が高いと判定されたワードが複数存在する場合に、その中から最も優先順位が高いワードを選択する認識結果選択手段として機能させる。 Further, the voice recognition processing program of the present invention extracts features of the spoken voice collected by the sound collecting means by the computer, and sets a predetermined threshold value from a plurality of words registered in the voice recognition dictionary. When there are multiple voice recognition means that select words with a higher degree of similarity and words that are determined to have a higher degree of similarity by the voice recognition means, the recognition result selection that selects the word with the highest priority from among them. Make it function as a means.

音声認識結果として複数のワードが得られたときに、利用者が選択する可能性が高い順番を想定して設定された優先順位にしたがってその中から一つを選択することにより、音声認識における誤認識を低減することができる。 When multiple words are obtained as a voice recognition result, an error in voice recognition is made by selecting one of them according to the priority set assuming the order in which the user is likely to select. Recognition can be reduced.

また、上述した認識結果選択手段は、音声認識手段によって類似度が高いと判定されたワードが複数存在する場合であって、これら複数のワードの類似度の差が所定値以内の場合に、その中から最も優先順位が高いワードを選択することが望ましい。複数のワードの類似度の差が小さい場合に誤認識が生じる可能性が高くなるため、このような場合に優先順位に基づいてワードの選択を行うことにより、音声認識における誤認識を低減することができる。 Further, the above-mentioned recognition result selection means is used when there are a plurality of words determined to have high similarity by the voice recognition means and the difference in similarity between the plurality of words is within a predetermined value. It is desirable to select the word with the highest priority. When the difference in similarity between multiple words is small, there is a high possibility that misrecognition will occur. In such a case, by selecting words based on the priority, it is necessary to reduce misrecognition in speech recognition. Can be done.

また、上述した優先順位は、発話を行う利用者の趣味嗜好に基づいて設定されることが望ましい。あるいは、上述した優先順位は、音声認識結果の過去の履歴に基づいて設定されることが望ましい。これにより、利用者が発話する可能性を考慮したワードの優先順位の設定が可能となる。 Further, it is desirable that the above-mentioned priority is set based on the hobbies and tastes of the user who speaks. Alternatively, it is desirable that the above-mentioned priority is set based on the past history of the voice recognition result. This makes it possible to set the priority of words in consideration of the possibility that the user speaks.

また、上述した優先順位は、音声認識処理を行う際の利用者の状況に基づいて設定されることが望ましい。これにより、音声認識処理時の利用者の動作内容や周辺環境などに対応して利用者が発話する可能性を考慮したワードの優先順位の設定が可能となる。 Further, it is desirable that the above-mentioned priority is set based on the situation of the user when performing the voice recognition process. As a result, it is possible to set the priority of words in consideration of the possibility that the user speaks according to the operation content of the user during the voice recognition process and the surrounding environment.

また、上述した優先順位は、音声認識の対象となる発話音声を集音手段によって集音した時点において設定されることが望ましい。最新の優先順位を用いることにより、さらに音声認識における誤認識を低減することができる。 Further, it is desirable that the above-mentioned priority is set at the time when the spoken voice to be recognized by the voice is collected by the sound collecting means. By using the latest priority, it is possible to further reduce erroneous recognition in speech recognition.

一実施形態の車載装置の構成を示す図である。It is a figure which shows the structure of the vehicle-mounted device of one Embodiment. 音声認識処理のしきい値とスコアとの関係を示す図である。It is a figure which shows the relationship between the threshold value of a voice recognition process, and a score. 車載装置において音声認識の対象となる各種のコマンドに対応するメニュー画面が表示された状態で音声認識処理を行う動作手順を示す流れ図である。It is a flow chart which shows the operation procedure which performs the voice recognition processing in the state which the menu screen corresponding to various commands to be voice recognition is displayed in the in-vehicle device. ＴＶチューナ処理部を用いて所望のテレビ放送（チャンネル）を指定する際のチャンネル指定コマンド（ワード）と各チャンネルの優先順位との関係を示す図である。It is a figure which shows the relationship between the channel designation command (word) at the time of designating a desired television broadcast (channel) using a TV tuner processing unit, and the priority of each channel. ナビゲーション処理部を用いて近くのコンビニエンスストアを抽出した検索結果画面の一例を示す図である。It is a figure which shows an example of the search result screen which extracted the nearby convenience store by using the navigation processing part. 図５に示す検索結果画面において行きたいコンビニエンスストアを指定する際の指定コマンド（ワード）と各ワードの優先順位との関係を示す図である。It is a figure which shows the relationship between the designated command (word) at the time of designating the convenience store to go to in the search result screen shown in FIG. 5, and the priority of each word.

以下、本発明の音声認識装置を適用した一実施形態の車載装置について、図面を参照しながら説明する。 Hereinafter, an in-vehicle device according to an embodiment to which the voice recognition device of the present invention is applied will be described with reference to the drawings.

図１は、一実施形態の車載装置の構成を示す図である。図１に示すように、車載装置１は、ナビゲーション処理部１０、ＴＶチューナ処理部１４、ラジオチューナ処理部１６、ＡＶ処理部１８、操作部２０、入力制御部２２、表示処理部２４、表示装置２６、マイクロホン３０、アナログ−デジタル変換器（Ａ／Ｄ）３２、デジタル−アナログ変換器（Ｄ／Ａ）４０、スピーカ４２、制御部５０、ハードディスク装置（ＨＤＤ）７０を備えている。 FIG. 1 is a diagram showing a configuration of an in-vehicle device according to an embodiment. As shown in FIG. 1, the in-vehicle device 1 includes a navigation processing unit 10, a TV tuner processing unit 14, a radio tuner processing unit 16, an AV processing unit 18, an operation unit 20, an input control unit 22, a display processing unit 24, and a display device. It includes 26, a microphone 30, an analog-to-digital converter (A / D) 32, a digital-to-analog converter (D / A) 40, a speaker 42, a control unit 50, and a hard disk device (HDD) 70.

ナビゲーション処理部１０は、ハードディスク装置７０に記憶されている地図データを用いて、車載装置１が搭載された車両の走行を案内するナビゲーション動作を行う。自車位置を検出するＧＰＳ（Global Positioning System）装置１２とともに用いられ、車両の走行を案内するナビゲーション動作には、地図表示、経路探索・誘導のほかに周辺施設を検索して表示する動作などが含まれる。なお、自車位置検出は、ＧＰＳ１２の他にジャイロセンサや車速センサ等の自律航法センサを組み合わせて用いるようにしてもよい。 The navigation processing unit 10 uses the map data stored in the hard disk device 70 to perform a navigation operation for guiding the traveling of the vehicle on which the in-vehicle device 1 is mounted. Used together with the GPS (Global Positioning System) device 12 that detects the position of the own vehicle, the navigation operation that guides the vehicle's travel includes map display, route search / guidance, and operation to search and display surrounding facilities. included. The vehicle position detection may be performed by using an autonomous navigation sensor such as a gyro sensor or a vehicle speed sensor in combination with the GPS 12.

ＴＶチューナ処理部１４は、地上デジタル放送等の放送信号を受信し、映像および音声を再生する処理を行う。ラジオチューナ処理部１６は、ラジオ放送の信号を受信し、音声を再生する処理を行う。ＡＶ処理部１８は、圧縮されてハードディスク装置７０に記憶されている音楽データや映像データを読み出して再生する処理を行う。なお、音楽データや映像データは、ディスク読取装置（図示せず）を用いてＣＤやＤＶＤから読み取ったものを用いたり、ネットワーク経由で受信したものを用いるようにしてもよい。 The TV tuner processing unit 14 receives broadcast signals such as terrestrial digital broadcasting and performs processing for reproducing video and audio. The radio tuner processing unit 16 receives a radio broadcast signal and performs a process of reproducing audio. The AV processing unit 18 performs a process of reading and playing back music data and video data that have been compressed and stored in the hard disk device 70. As the music data and the video data, those read from a CD or DVD using a disc reading device (not shown) may be used, or those received via a network may be used.

操作部２０は、利用者による各種操作を受け付けるためのものであり、各種のスイッチや操作つまみ等が備わっている。入力制御部２２は、操作部２０の操作状態を監視し、利用者による入力内容を検出する。 The operation unit 20 is for receiving various operations by the user, and is provided with various switches, operation knobs, and the like. The input control unit 22 monitors the operation state of the operation unit 20 and detects the input content by the user.

表示処理部２４は、各種の操作画面や入力画面等を表示する映像信号を出力して表示装置２６にこれらの画面を表示するとともに、ＴＶチューナ処理部１４によって受信した放送信号に対応する映像画面やＡＶ処理部１８によって再生した映像画面等を表示する映像信号を出力して表示装置２６にこれらの画面を表示する。表示装置２６は、運転席と助手席の中央前方に設置されており、例えば液晶表示装置（ＬＣＤ）を用いて構成されている。 The display processing unit 24 outputs video signals for displaying various operation screens, input screens, etc., displays these screens on the display device 26, and displays the video screens corresponding to the broadcast signals received by the TV tuner processing unit 14. A video signal for displaying a video screen or the like reproduced by the AV processing unit 18 is output, and these screens are displayed on the display device 26. The display device 26 is installed in front of the center of the driver's seat and the passenger seat, and is configured by using, for example, a liquid crystal display (LCD).

マイクロホン３０は、利用者（例えば、自車両の運転者）の発話音声を集音する。アナログ−デジタル変換器３２は、マイクロホン３０によって集音された音声信号をデジタルの発話データに変換する。 The microphone 30 collects the uttered voice of the user (for example, the driver of the own vehicle). The analog-digital converter 32 converts the audio signal collected by the microphone 30 into digital utterance data.

デジタル−アナログ変換器４０は、ナビゲーション処理部１０、ＴＶチューナ処理部１４、ラジオチューナ処理部１６、ＡＶ処理部１８のそれぞれの処理によって生成される案内音声やオーディオ音（デジタルデータ）をアナログの音声信号に変換してスピーカ４２から出力する。なお、実際には、デジタル−アナログ変換器４０とスピーカ４２の間には信号を増幅する増幅器が接続されているが、図１ではこの増幅器は省略されている。また、デジタル−アナログ変換器４０とスピーカ４２との組合せは再生チャンネル数分備わっているが、図１では一組のみが図示されている。 The digital-to-analog converter 40 converts the guidance sound and audio sound (digital data) generated by the processing of the navigation processing unit 10, the TV tuner processing unit 14, the radio tuner processing unit 16, and the AV processing unit 18 into analog sound. It is converted into a signal and output from the speaker 42. In reality, an amplifier that amplifies the signal is connected between the digital-to-analog converter 40 and the speaker 42, but this amplifier is omitted in FIG. Further, although the combinations of the digital-to-analog converter 40 and the speaker 42 are provided for the number of reproduction channels, only one set is shown in FIG.

制御部５０は、車載装置１の全体を制御するためのものであり、ＲＯＭやＲＡＭなどに格納された所定のプログラムをＣＰＵで実行することにより実現される。この制御部５０は、メニュー作成部５１、音声認識処理部５２、認識結果選択部５３、視聴情報取得部５４、視聴履歴作成部５５、視聴チャンネル優先順位設定部５６、施設利用情報取得部５７、施設利用履歴作成部５８、施設優先順位設定部５９を有する。 The control unit 50 is for controlling the entire in-vehicle device 1, and is realized by executing a predetermined program stored in a ROM, RAM, or the like on the CPU. The control unit 50 includes a menu creation unit 51, a voice recognition processing unit 52, a recognition result selection unit 53, a viewing information acquisition unit 54, a viewing history creation unit 55, a viewing channel priority setting unit 56, and a facility usage information acquisition unit 57. It has a facility usage history creation unit 58 and a facility priority setting unit 59.

メニュー作成部５１は、音声認識の対象となる各種のコマンド（音声認識の対象となる複数のワード）が含まれるメニュー画面を作成する。また、ナビゲーション処理部１０等の動作に必要な各種のメニュー画面を作成する。このメニュー画面は階層化されており、最初に表示されたメニュー画面に含まれる１つのワードを選択すると、このワードに対応する次のメニュー画面が表示される。 The menu creation unit 51 creates a menu screen including various commands (a plurality of words to be voice recognition) to be voice recognition. In addition, various menu screens necessary for the operation of the navigation processing unit 10 and the like are created. This menu screen is hierarchized, and when one word included in the first displayed menu screen is selected, the next menu screen corresponding to this word is displayed.

音声認識処理部５２は、マイクロホン３０によって集音される利用者の発話による音声に対して特徴抽出を行い、ハードディスク装置７０に格納されている音声認識辞書に登録された認識候補となる複数の認識ワードのそれぞれと比較することにより、所定のしきい値よりも類似度が高いワードを選択する音声認識処理を行う。本実施形態では、利用者が発声する音声を常時集音し、集音した音声の先頭部分から発話が終了した時点までの音声を対象として音声認識処理を行っている。 The voice recognition processing unit 52 extracts features from the voice uttered by the user collected by the microphone 30, and a plurality of recognition candidates registered in the voice recognition dictionary stored in the hard disk device 70. By comparing with each of the words, a voice recognition process is performed to select a word having a higher degree of similarity than a predetermined threshold value. In the present embodiment, the voice uttered by the user is constantly collected, and the voice recognition process is performed on the voice from the beginning of the collected voice to the time when the utterance is completed.

例えば、利用者が「いっちゃんねる」（１チャンネル）と発声し、この内容を音声認識処理を行って認識する場合を考えると、最初に「いっ」と発生した時点では、「いっ」という音声と認識ワード「１チャンネル」との間の相違が大きいため、音声認識処理に失敗することになる。同様に、「いっちゃ」、「いっちゃん」、「いっちゃんね」などと発声した時点でもこれらと認識ワード「１チャンネル」との間の相違が大きいため、音声認識処理に失敗する。そして、「いっちゃんねる」と発声すると、比較対象となる認識ワード「１チャンネル」との間の相違が小さくなるため、音声認識処理に成功する。 For example, considering the case where the user utters "Ichanneru" (1 channel) and recognizes this content by performing voice recognition processing, when the first "I" occurs, the voice "I" is heard. Since there is a large difference from the recognition word "1 channel", the voice recognition process will fail. Similarly, even when "Icha", "Icchan", "Icchanne", etc. are uttered, the voice recognition process fails because there is a large difference between these and the recognition word "1 channel". Then, when "Icchanneru" is uttered, the difference between the recognition word "1 channel" to be compared becomes small, so that the voice recognition process succeeds.

このような処理を行うために、「ディスタンス」と「スコア」の概念が導入されている。ディスタンスは、比較対象としての音声と認識ワードとの類似距離を示す。例えば、図２に示すように、ディスタンスの最大値は最も大きい類似距離として「１０００」に設定されている。スコアＳは、実際の入力音声と認識ワード「１チャンネル」との類似距離を示す。上述したように、利用者が「いっちゃんねる」と発声する場合を例にとると、発声を開始した時点でのスコアは１０００であり、発声が進行するにしたがって、入力音声のスコアが次第に低下する。そして、発声が終了した時点での入力音声のスコアが所定のしきい値Ｔｈ未満になると、その入力音声の内容が認識ワードと一致したものとする認識結果が得られる。一般には、複数の認識ワードのそれぞれを対象にして同様の処理が並行して行われ、最終的に一つの認識ワードが認識結果として抽出される。なお、発声が終了した時点での入力音声のスコアが所定のしきい値Ｔｈ未満にならない場合（入力音声に近い認識ワードが存在しない場合）には音声認識処理に失敗したことになり、再度の音声入力を促す等の処理が行われる。 In order to perform such processing, the concepts of "distance" and "score" have been introduced. The distance indicates the similar distance between the voice as a comparison target and the recognition word. For example, as shown in FIG. 2, the maximum value of the distance is set to "1000" as the largest similarity distance. The score S indicates a similar distance between the actual input voice and the recognition word “1 channel”. As described above, for example, when the user utters "Icchanneru", the score at the start of utterance is 1000, and the score of the input voice gradually decreases as the utterance progresses. .. Then, when the score of the input voice at the time when the utterance is completed becomes less than the predetermined threshold value Th, the recognition result that the content of the input voice matches the recognition word is obtained. In general, the same processing is performed in parallel for each of a plurality of recognition words, and finally one recognition word is extracted as a recognition result. If the score of the input voice at the end of the utterance does not fall below the predetermined threshold value Th (when there is no recognition word close to the input voice), the voice recognition process has failed and the voice recognition process has failed again. Processing such as prompting voice input is performed.

ところで、音声認識辞書に格納された認識ワードと同じ内容の音声を利用者が発声する場合を考えると、最終的なスコアＳは小さな値となるはずであり、しきい値Ｔｈをある程度小さく設定しても、利用者が発声した音声の内容を認識することができるはずである。しかし、実際には、利用者が発声した音声とともに、ロードノイズや風切音、雨音あるいはオーディオ音なども同時に集音されるため、利用者が発声した音声の最終的なスコアＳは静かな環境下で集音したときほど小さな値にはならない。この点を考慮して、音声認識処理の成否を判定するためのしきい値Ｔｈは、あまり小さな値に設定することはできない。裏を返せば、このようにして設定されたしきい値Ｔｈを用いて音声認識処理を行うものとすれば、音声認識処理に成功した場合であって、しきい値ＴｈとスコアＳとの差が小さい場合には、利用者の周囲がロードノイズ、風切音、雨音、オーディオ音などが大きな環境であり、しきい値ＴｈとスコアＳとの差が大きい場合には、利用者の周囲がロードノイズ、風切音、雨音、オーディオ音などが小さく静かな環境であるといえる。 By the way, considering the case where the user utters a voice having the same content as the recognition word stored in the voice recognition dictionary, the final score S should be a small value, and the threshold value Th is set to a certain degree. However, it should be possible to recognize the content of the voice uttered by the user. However, in reality, the final score S of the voice uttered by the user is quiet because the road noise, wind noise, rain sound, audio sound, etc. are collected at the same time as the voice uttered by the user. The value is not as small as when collecting sound in an environment. In consideration of this point, the threshold value Th for determining the success or failure of the voice recognition process cannot be set to a very small value. On the flip side, if the voice recognition process is performed using the threshold value Th set in this way, the difference between the threshold value Th and the score S is the case where the voice recognition process is successful. If is small, the environment around the user is loud with road noise, wind noise, rain noise, audio noise, etc., and if the difference between the threshold Th and the score S is large, the surroundings of the user are around the user. However, it can be said that the environment is quiet with little road noise, wind noise, rain noise, audio noise, etc.

なお、トークスイッチ等を利用者自身が操作し、発話の開始と終了を利用者が指示し、その間の入力音声を認識対象として音声認識処理を行うようにしてもよい。 The talk switch or the like may be operated by the user himself / herself to instruct the start and end of the utterance, and the voice recognition process may be performed with the input voice during that time as the recognition target.

認識結果選択部５３は、音声認識処理部５２によって類似度が高いと判定された（スコアが所定のしきい値未満の）複数のワードが検出された場合に、その中から最も優先順位が高いワードを、利用者の趣味嗜好や過去の履歴に基づいて選択する。この優先順位は、利用者が選択する可能性が高い順番を想定して設定されている。なお、一般には、利用者の履歴には利用者本人の趣味嗜好が反映されていると考えられる。この優先順位に従ったワードの選択動作は、検出された複数のワードの類似度（スコア）の差が所定値以内の場合に行うようにしてもよい。また、この優先順位は、ワード選択時点（マイクロホン３０を用いた発話音声集音時）における最新のものが用いられる。 When a plurality of words (scores less than a predetermined threshold value) determined to have high similarity are detected by the voice recognition processing unit 52, the recognition result selection unit 53 has the highest priority among them. Words are selected based on the user's hobbies and preferences and past history. This priority is set assuming the order in which the user is likely to select. In general, it is considered that the user's history reflects the user's hobbies and tastes. The word selection operation according to this priority may be performed when the difference in the similarity (score) of the detected plurality of words is within a predetermined value. Further, as this priority, the latest one at the time of word selection (at the time of collecting utterance voice using the microphone 30) is used.

視聴情報取得部５４は、ＴＶチューナ処理部１４を用いて視聴するテレビ放送の情報（視聴情報）を取得する。この視聴情報には、受信したテレビ放送のチャンネル番号、ジャンル、受信時間などが含まれる。また、この視聴情報は、利用者ごとに取得することが望ましいが、特定の利用者（例えば、車両の運転者）の視聴頻度が突出していることが多いことを考慮し、この特定の利用者（この利用者が具体的に誰であるかがわからなくてもよい）に対応する視聴情報として取り扱うようにしてもよい。 The viewing information acquisition unit 54 acquires information (viewing information) of the television broadcast to be viewed using the TV tuner processing unit 14. This viewing information includes the channel number, genre, reception time, etc. of the received television broadcast. In addition, although it is desirable to acquire this viewing information for each user, considering that the viewing frequency of a specific user (for example, the driver of a vehicle) is often outstanding, this specific user It may be treated as viewing information corresponding to (it is not necessary to know who this user is).

視聴履歴作成部５５は、視聴情報取得部５４によって取得した視聴情報を用いて、利用者ごとの視聴履歴を作成する。作成された視聴履歴は、ハードディスク装置７０に保存される。 The viewing history creation unit 55 creates a viewing history for each user by using the viewing information acquired by the viewing information acquisition unit 54. The created viewing history is stored in the hard disk device 70.

視聴チャンネル優先順位設定部５６は、視聴履歴作成部５５によって作成された視聴履歴等に基づいて、認識結果選択部５３によって視聴チャンネルに関するワード選択を行う際の優先順位を設定する。 The viewing channel priority setting unit 56 sets the priority when the recognition result selection unit 53 selects a word related to the viewing channel based on the viewing history created by the viewing history creation unit 55.

施設利用情報取得部５７は、ナビゲーション処理部１０のナビゲーション動作に利用された施設の情報（施設利用情報）を取得する。この施設利用情報には、経路探索の目的地として設定された施設の情報（店舗名、フランチャイズチェーン名、ジャンルなど）、自車両が実際に利用（駐車）した施設の情報（店舗名等）などが含まれる。 The facility use information acquisition unit 57 acquires information on the facility (facility use information) used for the navigation operation of the navigation processing unit 10. This facility usage information includes information on facilities set as destinations for route search (store name, franchise chain name, genre, etc.), information on facilities actually used (parked) by the own vehicle (store name, etc.), etc. Is included.

施設利用履歴作成部５８は、施設利用情報取得部５７によって取得した施設利用情報を用いて、利用者の施設利用履歴を作成する。作成された施設利用履歴は、ハードディスク装置７０に保存される。 The facility use history creation unit 58 creates a facility use history of the user by using the facility use information acquired by the facility use information acquisition unit 57. The created facility usage history is stored in the hard disk device 70.

施設優先順位設定部５９は、施設利用履歴作成部５８によって作成された施設利用履歴等に基づいて、認識結果選択部５３によってナビゲーション動作時の施設入力に関するワード選択を行う際の優先順位を設定する。 The facility priority setting unit 59 sets the priority when the recognition result selection unit 53 selects words related to facility input during navigation operation based on the facility use history created by the facility use history creation unit 58. ..

上述したマイクロホン３０が集音手段に、音声認識処理部５２が音声認識手段に、認識結果選択部５３が認識結果選択手段にそれぞれ対応する。 The microphone 30 described above corresponds to the sound collecting means, the voice recognition processing unit 52 corresponds to the voice recognition means, and the recognition result selection unit 53 corresponds to the recognition result selection means.

本実施形態の車載装置１はこのような構成を有しており、次に、その動作を説明する。図３は、車載装置１において音声認識の対象となる各種のコマンドに対応するメニュー画面が表示された状態で音声認識処理を行う動作手順を示す流れ図である。 The vehicle-mounted device 1 of the present embodiment has such a configuration, and the operation thereof will be described next. FIG. 3 is a flow chart showing an operation procedure for performing voice recognition processing in a state where menu screens corresponding to various commands to be voice recognition are displayed in the vehicle-mounted device 1.

メニュー作成部５１によって作成されたメニュー画面が表示装置２６に表示されると（ステップ１００）、音声認識処理部５２は、マイクロホン３０によって集音された利用者の発話音声に対して音声認識処理を行いながら（ステップ１０２）、音声認識成功か否かを判定する（ステップ１０４）。発話音声に対応するスコアが所定値よりも大きい場合（音声認識辞書内の各ワードとの類似度が小さい場合）は否定判断が行われ、ステップ１０２に戻って音声認識処理が継続される。 When the menu screen created by the menu creation unit 51 is displayed on the display device 26 (step 100), the voice recognition processing unit 52 performs voice recognition processing on the user's uttered voice collected by the microphone 30. While doing so (step 102), it is determined whether or not the voice recognition is successful (step 104). When the score corresponding to the spoken voice is larger than the predetermined value (when the similarity with each word in the voice recognition dictionary is small), a negative determination is made, and the process returns to step 102 to continue the voice recognition process.

また、スコアが所定値未満になった場合（音声認識辞書内の各ワードとの類似度が大きい場合）にはステップ１０４の判定において肯定判断が行われる。次に、認識結果選択部５３は、音声認識処理によって複数のワードが検出されたか否かを判定する（ステップ１０６）。発話音声に対応する認識結果としてのワードが一つであることが望ましいが、メニュー画面内に含まれる音声認識の対象となる複数のコマンドの読み方が類似している場合には、スコアがしきい値未満のワードが複数同時に検出される場合がある。このような場合には肯定判断が行われる。 Further, when the score becomes less than a predetermined value (when the degree of similarity with each word in the speech recognition dictionary is large), an affirmative judgment is made in the determination in step 104. Next, the recognition result selection unit 53 determines whether or not a plurality of words have been detected by the voice recognition process (step 106). It is desirable that there is one word as a recognition result corresponding to the spoken voice, but if the readings of multiple commands to be recognized by voice included in the menu screen are similar, the score is threshold. Multiple words less than the value may be detected at the same time. In such a case, a positive judgment is made.

次に、認識結果選択部５３は、検出された複数のワードのスコア差が所定値以内か否かを判定する（ステップ１０８）。スコア差が所定値以内の場合とは、各ワードのスコアが接近している場合であって、いずれを選択した場合であっても誤認識の可能性が高い場合である。この場合には肯定判断が行われる。次に、認識結果選択部５３は、複数のワードの中から優先順位が最も高いワードを最終的な認識結果として選択する（ステップ１１０）。その後、このワードに対応する処理が行われる（ステップ１１２）。 Next, the recognition result selection unit 53 determines whether or not the score difference between the detected plurality of words is within a predetermined value (step 108). The case where the score difference is within a predetermined value is a case where the scores of the words are close to each other, and there is a high possibility of erroneous recognition regardless of which one is selected. In this case, a positive judgment is made. Next, the recognition result selection unit 53 selects the word having the highest priority from the plurality of words as the final recognition result (step 110). After that, the process corresponding to this word is performed (step 112).

一方で、スコアが所定値未満となったワードが一つしかない場合（ステップ１０６の判定において否定判断）や、複数のワードのスコア差が所定値より大きい場合（ステップ１０８の判定において否定判断）には、認識結果選択部５３は、最も高いスコアのワードを選択する（ステップ１１４）。その後、このワードに対応する処理が行われる（ステップ１１２）。 On the other hand, when there is only one word whose score is less than the predetermined value (negative judgment in the determination in step 106), or when the score difference between a plurality of words is larger than the predetermined value (negative judgment in the determination in step 108). The recognition result selection unit 53 selects the word with the highest score (step 114). After that, the process corresponding to this word is performed (step 112).

次に、複数のワードが検出された場合にその中の１つを選択するための優先順位の具体例について説明する。 Next, a specific example of the priority for selecting one of a plurality of words when they are detected will be described.

図４は、ＴＶチューナ処理部１４を用いて所望のテレビ放送（チャンネル）を指定する際のチャンネル指定コマンド（ワード）と各チャンネルの優先順位との関係を示す図である。 FIG. 4 is a diagram showing the relationship between a channel designation command (word) and a priority of each channel when a desired television broadcast (channel) is designated by using the TV tuner processing unit 14.

利用者は、チャンネル指定コマンド「１チャンネル」を選択したい場合には、音声で「いっちゃんねる」と発声すればよい。しかし、利用者が「いっちゃんねる」と発声したときに、音声データの特徴が比較的近い３つのワード「１チャンネル」、「８チャンネル」、「１０チャンネル」についてスコアがしきい値を下回って、これら３つのワードが認識結果として得られる場合がある。このような場合には、その時点で設定されている優先順位にしたがって、優先順位が最も高い「１チャンネル」が最終的な認識結果として採用される。 When the user wants to select the channel designation command "1 channel", he / she can say "Icchanneru" by voice. However, when the user uttered "Icchanneru", the scores of the three words "1 channel", "8 channels", and "10 channels" whose voice data characteristics were relatively close were below the threshold value. These three words may be obtained as a recognition result. In such a case, "1 channel" having the highest priority is adopted as the final recognition result according to the priority set at that time.

ところで、図４に示した優先順位は、視聴履歴作成部５５によって作成されてハードディスク装置７０に格納されている視聴履歴に基づいて、視聴チャンネル優先順位設定部５６によって設定される。一例として、以下に示す４つのケースが考えられる。 By the way, the priority shown in FIG. 4 is set by the viewing channel priority setting unit 56 based on the viewing history created by the viewing history creating unit 55 and stored in the hard disk device 70. As an example, the following four cases can be considered.

（ケースＡ１）視聴頻度が高い（累積した視聴時間が長い）チャンネルの優先順位を高くする。例えば、直近の１か月間の視聴頻度が視聴履歴に基づいて抽出されてこの優先順位が設定される。 (Case A1) Increase the priority of channels with high viewing frequency (long cumulative viewing time). For example, the viewing frequency for the most recent month is extracted based on the viewing history, and this priority is set.

（ケースＡ２）視聴頻度が高い（累積した視聴時間が長い）ジャンルのテレビ放送のチャンネルの優先順位を高く設定する。その時点で放送されている番組のジャンルを識別する情報は受信中の放送信号から取得することが可能であり、視聴チャンネル優先順位設定部５６は、３つのワード「１チャンネル」、「８チャンネル」、「１０チャンネル」に対応する各チャンネルのジャンルと、視聴履歴に含まれるジャンル毎の受信時間とを比較することにより、各チャンネルの優先順位を設定する。 (Case A2) Set a high priority for TV broadcast channels of genres with high viewing frequency (long cumulative viewing time). Information that identifies the genre of the program being broadcast at that time can be obtained from the broadcast signal being received, and the viewing channel priority setting unit 56 has three words "1 channel" and "8 channels". , The priority of each channel is set by comparing the genre of each channel corresponding to "10 channels" with the reception time of each genre included in the viewing history.

（ケースＡ３）その時点で視聴中のテレビ放送のチャンネルの優先順位を低く設定する。チャンネルの切り替えを想定している場合には、受信中のチャンネルを指定する確率は低いため、優先順位を下げることが望ましい。 (Case A3) The priority of the TV broadcast channel currently being viewed is set low. When switching channels is assumed, it is desirable to lower the priority because the probability of specifying the channel being received is low.

（ケースＡ４）受信中のテレビ放送の内容とは直接関係しない災害情報等の外部情報を受信したときに、この外部情報の入手が容易なジャンルの番組を放送中のチャンネルの優先順位を高く設定する。例えば、災害情報を受信した場合に、その時点でニュース番組を放送中のチャンネルの優先順位が高く設定される。 (Case A4) When external information such as disaster information that is not directly related to the content of the TV broadcast being received is received, the priority of the channel that is broadcasting the program of the genre in which this external information is easily available is set high. To do. For example, when disaster information is received, the priority of the channel currently broadcasting the news program is set high.

なお、これら４つのケースは、適宜組合わせるようにしてもよい。例えば、ケースＡ１とケースＡ３を組み合わせたり、ケースＡ２とケースＡ３を組み合わせる場合が考えられる。 In addition, these four cases may be combined appropriately. For example, a case A1 and a case A3 may be combined, or a case A2 and a case A3 may be combined.

図５は、ナビゲーション処理部１０を用いて近くのコンビニエンスストアを抽出した検索結果画面の一例を示す図である。検索結果画面には、自車位置周辺に存在するコンビニエンスストアが、自車位置に近い順に、店舗名（フランチャイズチェーン名）を含んで表示されている。 FIG. 5 is a diagram showing an example of a search result screen in which a nearby convenience store is extracted using the navigation processing unit 10. On the search result screen, convenience stores existing around the vehicle position are displayed in order of proximity to the vehicle position, including the store name (franchise chain name).

図６は、図５に示す検索結果画面において行きたいコンビニエンスストアを指定する際の指定コマンド（ワード）と各ワードの優先順位との関係を示す図である。 FIG. 6 is a diagram showing the relationship between the designated command (word) when designating the convenience store to be visited on the search result screen shown in FIG. 5 and the priority of each word.

利用者は、検索結果画面中で一番上に表示されている施設「ＡＡマート」を選択したい場合には、音声で「ひとつめにいく」と発声すればよい。しかし、利用者が「ひとつめにいく」と発声したときに、音声データの特徴が比較的近い４つのワード「ひとつめにいく」、「ふたつめにいく」、「みっつめにいく」、「いつつめにいく」についてスコアがしきい値を下回って、これら４つのワードが認識結果として得られる場合がある。このような場合には、その時点で設定されている優先順位にしたがって、優先順位が最も高い「みっつめにいく」が最終的な認識結果として採用される。 When the user wants to select the facility "AA Mart" displayed at the top of the search result screen, he / she can say "Go to the first" by voice. However, when the user says "go to the first", the four words "go to the first", "go to the second", "go to the third", and "go to the third" have relatively similar characteristics of the voice data. In some cases, the score for "I'm going to mess up" is below the threshold, and these four words are obtained as recognition results. In such a case, "Go to Mitsume", which has the highest priority, is adopted as the final recognition result according to the priority set at that time.

ところで、図６に示した優先順位は、施設利用履歴作成部５８によって作成されてハードディスク装置７０に格納されている施設利用履歴に基づいて施設優先順位設定部５９によって設定される。一例として、以下に示す４つのケースが考えられる。 By the way, the priority shown in FIG. 6 is set by the facility priority setting unit 59 based on the facility use history created by the facility use history creation unit 58 and stored in the hard disk device 70. As an example, the following four cases can be considered.

（ケースＢ１）利用頻度が高い（利用回数が多い）フランチャイズチェーンに属するコンビニエンスストアの優先順位を高くする。例えば、直近の１か月間の利用頻度が施設利用履歴に基づいて抽出されてこの優先順位が設定される。 (Case B1) Raise the priority of convenience stores belonging to franchise chains that are frequently used (frequently used). For example, the frequency of use in the last month is extracted based on the facility usage history, and this priority is set.

（ケースＢ２）自車位置から近いコンビニエンスストアの優先順位を高く設定する。自宅等の周辺についてはコンビニエンスストアの場所を把握していることが多いため、コンビニエンスストアを検索する場合には自宅等の周辺以外を走行中であることが想定される。このため、とりあえず近くのコンビニエンスストアを検索して用事を済ませようとしていることが考えられる。なお、これら２つのケースは、適宜組合わせるようにしてもよい。 (Case B2) Set a high priority for convenience stores that are close to the vehicle position. Since the location of convenience stores is often known around homes, etc., when searching for convenience stores, it is assumed that the vehicle is traveling outside the vicinity of homes, etc. For this reason, it is conceivable that they are trying to finish their errands by searching for a nearby convenience store for the time being. In addition, these two cases may be combined appropriately.

このように、本実施形態の車載装置１では、音声認識結果として複数のワードが得られたときに、利用者が選択する可能性が高い順番を想定して設定された優先順位にしたがってその中から一つを選択することにより、音声認識における誤認識を低減することができる。 As described above, in the in-vehicle device 1 of the present embodiment, when a plurality of words are obtained as the voice recognition result, among them according to the priority set assuming the order in which the user is likely to select. By selecting one from, it is possible to reduce erroneous recognition in speech recognition.

また、音声認識処理によって得られた複数のワードの類似度の差が小さい場合に誤認識が生じる可能性が高くなるため、このような場合に優先順位に基づいてワードの選択を行うことにより、音声認識における誤認識を低減することができる。 In addition, if the difference in similarity between a plurality of words obtained by the voice recognition process is small, there is a high possibility that erroneous recognition will occur. False recognition in voice recognition can be reduced.

また、優先順位を、発話を行う利用者の趣味嗜好や過去の履歴に基づいて設定することにより、利用者が発話する可能性を考慮したワードの優先順位の設定が可能となる。 In addition, by setting the priority based on the hobbies and tastes of the user who speaks and the past history, it is possible to set the priority of the word in consideration of the possibility of the user speaking.

また、優先順位を、音声認識処理を行う際の利用者の状況に基づいて設定することにより、音声認識処理時の利用者の動作内容や周辺環境などに対応して、利用者が発話する可能性を考慮したワードの優先順位の設定が可能となる。 In addition, by setting the priority based on the user's situation when performing voice recognition processing, the user can speak according to the user's operation content and surrounding environment during voice recognition processing. It is possible to set the priority of words in consideration of gender.

また、優先順位を、音声認識の対象となる発話音声を集音した時点において設定することが望ましい。最新の優先順位を用いることにより、さらに音声認識における誤認識を低減することができる。 In addition, it is desirable to set the priority at the time when the spoken voice to be voice recognition is collected. By using the latest priority, it is possible to further reduce erroneous recognition in speech recognition.

なお、本発明は上記実施形態に限定されるものではなく、本発明の要旨の範囲内において種々の変形実施が可能である。例えば、上述した実施形態では、テレビ放送受信中にチャンネルを指定する場合（図４）と、ナビゲーション動作において検索したコンビニエンスストアのいずれかを指定する場合（図５、図６）について具体例を示したが、これらは一例であってそれ以外について本発明を適用することができる。 The present invention is not limited to the above embodiment, and various modifications can be made within the scope of the gist of the present invention. For example, in the above-described embodiment, specific examples are shown for a case of designating a channel during reception of a television broadcast (FIG. 4) and a case of designating one of the convenience stores searched in the navigation operation (FIGS. 5 and 6). However, these are examples, and the present invention can be applied to others.

例えば、車両の前後左右に備わったカメラで撮像した画像を合成して上空から見たトップビュー画像を生成して表示する動作と、車両後方のリアカメラで後方を撮像したバックビュー画像を生成して表示する動作のいずれかを音声で指定する場合に、前者を選択するワードとして「トップビュー」が、後者を選択するワードとして「バックビュー」が発声されるものとする。利用者が「○○ビュー」と発声したときに、２つのワード「トップビュー」、「バックビュー」のスコアがともにしきい値未満になった場合に、図示しない優先順位設定部は、ステアリング角度（蛇角）に基づいてこれらの優先順位を設定するようにしてもよい。例えば、ステアリング角度が大きい場合には、ハンドルを回しながら駐車場の空いたスペースや自宅の車庫に進入しようとしている場合が考えられるため、「バックビュー」の優先順位を「トップビュー」よりも高く設定する。また、ステアリング角度が小さい場合には、このような進入動作の後に、左右や後方の壁や他の車両との間の間隔を確認しながら車両を後退させている場合が考えられるため、「トップビュー」の優先順位を「バックビュー」よりも高く設定する。 For example, the operation of synthesizing the images captured by the cameras provided on the front, rear, left and right sides of the vehicle to generate and display the top view image seen from above, and the back view image of the rear image captured by the rear camera behind the vehicle are generated. When any of the actions to be displayed is specified by voice, "top view" is uttered as a word for selecting the former, and "back view" is uttered as a word for selecting the latter. When the user utters "○○ view" and the scores of the two words "top view" and "back view" are both below the threshold value, the priority setting unit (not shown) is the steering angle. These priorities may be set based on (snake angle). For example, if the steering angle is large, it is possible that you are trying to enter an empty space in the parking lot or your garage while turning the steering wheel, so the priority of "back view" is higher than that of "top view". Set. In addition, when the steering angle is small, after such an approaching motion, the vehicle may be retracted while checking the distance between the left and right and rear walls and other vehicles. Set the priority of "view" higher than that of "back view".

また、上述した実施形態では、音声認識処理によってスコアがしきい値未満の複数のワードが抽出され、さらにこれらのスコア差が所定値以下のときに優先順位に従った選択を行ったが、このようにスコアがしきい値未満の複数のワードが抽出される場合とは、ロードノイズ、風切音、雨音、オーディオ音などの周辺雑音が多く、各スコアがしきい値に近い場合に主に生じると考えられる。したがって、認識結果選択部５３は、周辺雑音のレベル判定を行い、雑音のレベルが基準値を超えている場合にワードの選択動作を行うようにしてもよい。 Further, in the above-described embodiment, a plurality of words whose scores are less than the threshold value are extracted by the voice recognition process, and when the score difference between them is equal to or less than a predetermined value, selection is performed according to the priority. When multiple words with scores below the threshold value are extracted, there are many ambient noises such as road noise, wind noise, rain noise, and audio sound, and each score is close to the threshold value. It is thought that it occurs in. Therefore, the recognition result selection unit 53 may determine the level of ambient noise and perform word selection operation when the noise level exceeds the reference value.

上述したように、本発明によれば、音声認識結果として複数のワードが得られたときに、利用者が選択する可能性が高い順番を想定して設定された優先順位にしたがってその中から一つを選択することにより、音声認識における誤認識を低減することができる。 As described above, according to the present invention, when a plurality of words are obtained as a voice recognition result, one of them is set according to the priority set assuming the order in which the user is likely to select. By selecting one, it is possible to reduce erroneous recognition in voice recognition.

１車載装置
１０ナビゲーション処理部
１４ＴＶチューナ処理部
１６ラジオチューナ処理部
１８ＡＶ処理部
３０マイクロホン
３２アナログ−デジタル変換器（Ａ／Ｄ）
５１メニュー作成部
５２音声認識処理部
５３認識結果選択部
５４視聴情報取得部
５５視聴履歴作成部
５６視聴チャンネル優先順位設定部
５７施設利用情報取得部
５８施設利用履歴作成部
５９施設優先順位設定部
７０ハードディスク装置（ＨＤＤ） 1 In-vehicle device 10 Navigation processing unit 14 TV tuner processing unit 16 Radio tuner processing unit 18 AV processing unit 30 Microphone 32 Analog-to-digital converter (A / D)
51 Menu creation unit 52 Voice recognition processing unit 53 Recognition result selection unit 54 Viewing information acquisition unit 55 Viewing history creation unit 56 Viewing channel priority setting unit 57 Facility usage information acquisition unit 58 Facility usage history creation unit 59 Facility priority setting unit 70 Hard disk device (HDD)

Claims

A sound collecting means for collecting the user's spoken voice,
A voice recognition dictionary in which multiple words to be selected are registered, and
A voice that extracts features from the spoken voice collected by the sound collecting means and selects a word having a higher degree of similarity than a predetermined threshold value from the plurality of words registered in the voice recognition dictionary. Recognition means and
When there are a plurality of the words determined to have high similarity by the voice recognition means, the recognition result selection means for selecting the word having the highest priority among them, and the recognition result selection means.
A voice recognition device characterized by comprising.

The recognition result selection means is a case where there are a plurality of the words determined to have a high degree of similarity by the voice recognition means, and the difference in the degree of similarity between the plurality of words is within a predetermined value. The voice recognition device according to claim 1, wherein the word having the highest priority is selected from the above.

The voice recognition device according to claim 1 or 2, wherein the priority is set based on the hobbies and tastes of the user who speaks.

The voice recognition device according to any one of claims 1 to 3, wherein the priority is set based on a past history of the voice recognition result.

The voice recognition device according to any one of claims 1 to 4, wherein the priority is set based on a situation of a user when performing voice recognition processing.

The voice recognition device according to any one of claims 1 to 5, wherein the priority is set at the time when the spoken voice to be voice recognition is collected by the sound collecting means.

Features are extracted from the spoken voice collected by the voice recognition means, and a word having a higher similarity than a predetermined threshold is selected from a plurality of words registered in the voice recognition dictionary by the voice recognition means. Voice recognition steps and
When there are a plurality of the words determined to have high similarity by the voice recognition means, the recognition result selection step of selecting the word having the highest priority from the words by the recognition result selection means.
A voice recognition processing method characterized by having.

The recognition result selection means is a case where there are a plurality of the words determined to have a high degree of similarity by the voice recognition means, and the difference in the degree of similarity between the plurality of words is within a predetermined value. The voice recognition processing method according to claim 7, wherein the word having the highest priority is selected from the above.

Computer,
A voice recognition means that extracts features from the spoken voice collected by the sound collecting means and selects a word having a higher degree of similarity than a predetermined threshold value from the plurality of words registered in the voice recognition dictionary. When,
When there are a plurality of the words determined to have high similarity by the voice recognition means, the recognition result selection means for selecting the word having the highest priority among them, and the recognition result selection means.
A voice recognition processing program that works.

The recognition result selection means is a case where there are a plurality of the words determined to have a high degree of similarity by the voice recognition means, and the difference in the degree of similarity between the plurality of words is within a predetermined value. The voice recognition processing program according to claim 9, wherein the word having the highest priority is selected from the above.