JP5037041B2

JP5037041B2 - On-vehicle voice recognition device and voice command registration method

Info

Publication number: JP5037041B2
Application number: JP2006173813A
Authority: JP
Inventors: 教明大谷
Original assignee: Alpine Electronics Inc
Current assignee: Alpine Electronics Inc
Priority date: 2006-06-23
Filing date: 2006-06-23
Publication date: 2012-09-26
Anticipated expiration: 2026-06-23
Also published as: JP2008003371A

Description

本発明は、車載用音声認識装置及び音声コマンド登録方法に関し、特に、車室内でユーザがコマンドとして発した単語や語句などの音声（以下、「ボイスタグ(Voice Tag) 」ともいう。）を基に的確な音声認識を行うよう適応された技術に関する。 The present invention relates to an in-vehicle voice recognition device and a voice command registration method, and in particular, based on voices such as words and phrases (hereinafter also referred to as “Voice Tags”) issued as commands by a user in a vehicle interior. The present invention relates to a technology adapted to perform accurate speech recognition.

最近の車両には、車室内のユーザに対して様々なサービスを提供するための機器や装置などが搭載されている。その代表的な車載機器として、設定した目的地に向けて道路を間違うことなく走行できるように案内する機能（経路誘導機能）を搭載したナビゲーション装置や、各種ソース（ラジオ受信機、ＣＤプレーヤ、ＴＶ受信機、ＤＶＤプレーヤ等）から出力される音声（オーディオ）情報や映像（ビデオ）情報などの各種エンターテイメントを提供するオーディオ／ビデオ（Ａ／Ｖ）機器などがある。これらの車載機器（装置）は、ユーザがリモコンや操作パネル等を操作して所要の指示を与えることにより、その操作指示に応じてその動作状態が変更される。変更された機器の動作状態は、車室内に設置されたスピーカ（リア席のユーザについてはワイヤレスヘッドホン等）を介して聴くことができ、また車載モニタ等の表示装置の画面を通して見ることができる。 Recent vehicles are equipped with devices and devices for providing various services to users in the passenger compartment. Typical in-vehicle devices include a navigation device equipped with a function (route guidance function) for guiding the user to travel to a set destination without making a mistake, and various sources (radio receiver, CD player, TV) There are audio / video (A / V) devices that provide various entertainment such as audio (audio) information and video (video) information output from a receiver, a DVD player, and the like. These in-vehicle devices (apparatuses) are operated according to the operation instructions when the user gives a necessary instruction by operating the remote controller or the operation panel. The changed operating state of the device can be heard via a speaker (such as wireless headphones for a user at the rear seat) installed in the passenger compartment, or can be viewed through a screen of a display device such as an in-vehicle monitor.

このように各車載機器に対してはリモコン操作等のマニュアル操作に基づいて所要の操作指示を入力することができるが、最近では、操作指示を音声入力（発話）するだけで当該機器の制御を行える機能（音声認識機能）を搭載した装置も出現している。かかる音声認識機能は、ユーザの操作上の便宜を図る点で有利であり、特に、運転者にとっては安全走行の点で非常に有用である。 As described above, a required operation instruction can be input to each in-vehicle device based on a manual operation such as a remote control operation. However, recently, the control of the device can be performed only by voice input (speech) of the operation instruction. Devices equipped with a function that can be performed (voice recognition function) have also appeared. Such a voice recognition function is advantageous in terms of convenience for the user's operation, and is particularly useful for the driver in terms of safe driving.

音声認識機能を実現するには音声コマンドの認識用辞書を必要とし、この認識用辞書には、音声認識の対象とされる単語や語句などの音声、すなわち、音声認識に基づいて制御されるべき車載機器の操作指示に関連したボイスタグ又はこれを含む音声コマンドが登録されている。ボイスタグは、例えば、ナビゲーション装置において住所録(Address Book)を呼び出す際に使用される。その一例を図９に示す。 To realize the voice recognition function, a voice command recognition dictionary is required. This recognition dictionary should be controlled based on the voice of words or phrases that are the target of voice recognition, that is, voice recognition. A voice tag related to an operation instruction of the in-vehicle device or a voice command including the voice tag is registered. The voice tag is used, for example, when calling an address book in a navigation device. An example is shown in FIG.

ボイスタグを登録する際には（図９（ａ）参照）、先ず、ナビゲーション機能が有効な状態で、リモコン操作や画面上でのタッチ操作等により"Address Book"画面６１を表示させ、次に画面６１上の"Add Voice Tag" 部分６２にタッチすると、ボイスタグを記録する旨の案内情報（子画面）６３が表示される。ユーザがこの画面６３上の"Start" 部分６４にタッチし、例えば２秒以内に、所望のボイスタグ（何と言って呼び出すかの音声）を発声すると、そのボイスタグを記録中である旨の案内情報（子画面）６５が表示される。そして、この画面６５上の"Finished"部分６６が点灯すると、そのボイスタグの認識辞書への登録が完了する。例えば、その登録されたボイスタグが [マクドナルド] であったとする。 When registering a voice tag (see FIG. 9A), first, with the navigation function enabled, the “Address Book” screen 61 is displayed by remote control operation or touch operation on the screen, and then the screen. When an “Add Voice Tag” portion 62 on 61 is touched, guidance information (sub-screen) 63 for recording a voice tag is displayed. When the user touches the “Start” portion 64 on the screen 63 and utters a desired voice tag (sound to be called) within 2 seconds, for example, guidance information indicating that the voice tag is being recorded ( Child screen) 65 is displayed. When the “Finished” portion 66 on the screen 65 lights up, registration of the voice tag in the recognition dictionary is completed. For example, assume that the registered voice tag is [McDonald].

このようにして登録されたボイスタグ（＝ [マクドナルド] ）を音声認識処理に使用する際には（図９（ｂ）参照）、先ず、画面上でのタッチ操作等により自車位置周辺の地図画面６７を表示させ、この状態で、ユーザが"Go to [マクドナルド]"と発声すると、この音声コマンド（ＰＣＭデータ）を基に音声認識処理を実行し、目的地までの誘導経路等の情報を画面６７に表示する。図中、ＣＭは自車位置マーク、ＧＲは誘導経路、ＤＳ１は目的地までの距離及び時間とその方位を指示する情報、ＤＳ２は誘導経路上で次の案内ポイントまでの距離を指示する情報を示している。 When the voice tag (= [McDonald]) registered in this way is used for voice recognition processing (see FIG. 9B), first, a map screen around the vehicle position by a touch operation or the like on the screen. In this state, when the user utters “Go to [McDonald]”, voice recognition processing is executed based on this voice command (PCM data), and information such as a guidance route to the destination is displayed on the screen. 67. In the figure, CM is the vehicle position mark, GR is the guidance route, DS1 is information indicating the distance and time to the destination and its direction, and DS2 is information indicating the distance to the next guidance point on the guidance route. Show.

また、図９（ａ）においてボイスタグを登録する際、ユーザが発声したボイスタグが音声認識に使えないデータであった場合には、図９（ｃ）に示すようにユーザに再登録を促す旨の案内情報（警告画面）６８が表示される。ユーザは、必要であればこの画面６８上の"Start" 部分にタッチして再度ボイスタグを発声し、必要でなければ"Cancel"部分にタッチして再登録を中止する。なお、ここでいう「音声認識に使えないデータ」とは、基本的には、周囲が非常に騒々しい環境下でユーザが発声を行ったために（一応登録はできたが）音声認識エンジンで使用する音声データとしては有効なレベルに達していなかった場合のデータを指している。ただし、図９（ｃ）に例示した警告画面６８は、このような場合に限らず、発声そのものを検出できなかった場合（同図（ａ）の例を参照すると、ユーザが画面６３上の"Start" 部分６４にタッチしてから何も発声しないで２秒経過してしまった場合）にも表示される。 In addition, when registering a voice tag in FIG. 9A, if the voice tag spoken by the user is data that cannot be used for speech recognition, the user is prompted to re-register as shown in FIG. 9C. Guidance information (warning screen) 68 is displayed. If necessary, the user touches the “Start” portion on the screen 68 to utter the voice tag again, and if not necessary, touches the “Cancel” portion to cancel the re-registration. Note that “data that cannot be used for speech recognition” here is basically a speech recognition engine because the user uttered in a very noisy environment (although it was registered for the time being). The voice data to be used is data when the effective level has not been reached. However, the warning screen 68 illustrated in FIG. 9C is not limited to such a case, and when the utterance itself cannot be detected (refer to the example of FIG. "Start" is also displayed when 2 seconds have passed without touching the portion 64).

上記の従来技術に関連する技術としては、例えば、特許文献１に記載されるように、音声認識装置において、入力した音声データの音声パターンとあらかじめ生成された標準音声パターンとのマッチングを行うマッチング部を予備選択部とマッチング処理部の２段構成とし、予備選択部では、入力した音声データと辞書生成部において音声データから全帯域フィルタによって分析され辞書に登録された全帯域辞書データとをマッチングして候補単語を絞り込むようにし、マッチング処理部では、絞り込まれた候補単語と辞書生成部において音声データから帯域別フィルタによって分析され辞書に登録された帯域別辞書データとのマッチング処理により候補単語の内から類似度が所定のしきい値より大きい候補単語を認識単語として出力するようにしたものがある。
特開平６−３０１３９９号公報 As a technique related to the above prior art, for example, as described in Patent Document 1, in a speech recognition apparatus, a matching unit that performs matching between a speech pattern of input speech data and a standard speech pattern generated in advance The preliminary selection unit matches the input voice data with the full-band dictionary data analyzed by the full-band filter from the voice data in the dictionary generation unit and registered in the dictionary. The matching processing unit narrows down the candidate words, and the matching processing unit matches the narrowed candidate words with the band-by-band dictionary data analyzed by the band-by-band filter in the dictionary generation unit and registered in the dictionary. Candidate words whose similarity is greater than a predetermined threshold are output as recognition words. There are things you.
JP-A-6-301399

上述したように従来の技術では、ユーザが発したボイスタグを認識辞書に登録してナビゲーション装置等の制御に利用できるようにした機能が実現されているが、従来の方法では認識辞書に登録されている全てのボイスタグに対して音声認識を行っているため、以下に説明するような不都合があった。 As described above, in the conventional technique, a function has been realized in which a voice tag issued by a user is registered in a recognition dictionary and can be used for control of a navigation device or the like. However, in the conventional method, the function is registered in the recognition dictionary. Since voice recognition is performed on all the voice tags, there are inconveniences as described below.

すなわち、音声認識エンジンでは、ユーザが発したボイスタグ（コマンド）と認識辞書に登録されている全てのコマンドとのそれぞれの合致度（「スコア」ともいう。）を算出し、その算出結果から最も合致度の大きいコマンドをユーザが発声したコマンドとして決定する（音声認識）。このとき、その最も合致度の大きいコマンドが１つに特定できれば問題はないが、登録されているコマンドの数が多くなってくると発声上「読み」の類似したコマンドも多くなるため、認識エンジンでは必ずしも１つに特定することができず、結果として、マッチングしないコマンドを誤認識してしまう場合が起こり得る。つまり、従来の技術では、ボイスタグを登録する際、過去に登録したボイスタグと同一もしくは類似している音声を登録した場合、ボイスタグ呼出し用の認識辞書には同一もしくは類似している音声データが複数登録されることになり、そのため、ナビゲーション装置等の制御に利用する際に音声認識処理を行ったときに誤認識する割合が高くなる（つまり、音声コマンドに対する認識率が低下する）といった課題があった。 That is, the speech recognition engine calculates the degree of matching (also called “score”) between the voice tag (command) issued by the user and all the commands registered in the recognition dictionary, and the best match is obtained from the calculation result. A command having a high degree is determined as a command uttered by the user (voice recognition). At this time, there is no problem as long as the command with the highest degree of matching can be specified, but as the number of registered commands increases, the number of commands that are similar to “read” on the utterance increases. However, it is not always possible to specify one, and as a result, a command that does not match may be erroneously recognized. That is, in the conventional technology, when registering a voice tag, if the same or similar voice as a previously registered voice tag is registered, a plurality of identical or similar voice data is registered in the voice tag calling recognition dictionary. Therefore, there is a problem that the rate of erroneous recognition increases when voice recognition processing is performed when used for control of a navigation device or the like (that is, the recognition rate for voice commands decreases). .

本発明は、かかる従来技術における課題に鑑み創作されたもので、音声コマンドを認識辞書に登録してナビゲーション装置等の車載機器の制御に利用するにあたり、登録した音声コマンドに対する認識率を向上させることができる車載用音声認識装置及び音声コマンド登録方法を提供することを目的とする。 The present invention was created in view of the problems in the prior art, and improves the recognition rate for a registered voice command when the voice command is registered in a recognition dictionary and used to control an in-vehicle device such as a navigation device. It is an object of the present invention to provide a vehicle-mounted speech recognition device and a speech command registration method.

上記の従来技術の課題を解決するため、本発明の一形態によれば、車室内でユーザが発話するコマンドを入力する音声入力手段と、前記音声入力手段を介して入力されたコマンドとの比較照合を行うのに使用され、制御対象機器の動作状態に応じて選択可能なコマンドを登録した複数の認識辞書と、入力されたコマンドを前記認識辞書に登録すべきかどうかを判断するための判別用辞書であって前記複数の認識辞書に登録されているコマンドと同じコマンドが登録されるよう適応されたものとを格納した記憶手段と、前記制御対象機器の動作状態に応じて有効な前記認識辞書を切り替える辞書切替選択手段と、前記音声入力手段を介して入力されたコマンドと前記記憶手段に格納されているいずれかの辞書に登録されているコマンドとの比較照合に基づいた音声認識を行う音声認識手段とを備え、前記音声認識手段は、前記音声入力手段を介してコマンドが入力されたときに、前記判別用辞書のみを使用して当該入力されたコマンドに対する音声認識を行い、該音声認識に基づいて算出した認識スコアが所定のしきい値より低い場合に、当該コマンドを前記制御対象機器の動作状態に応じて選択された前記認識辞書及び前記判別用辞書に登録することを特徴とする車載用音声認識装置が提供される。 In order to solve the above-described problems of the prior art, according to one aspect of the present invention, a comparison is made between voice input means for inputting a command spoken by a user in a vehicle cabin and commands input via the voice input means. A plurality of recognition dictionaries in which commands that can be selected according to the operation state of the control target device are registered, and for determining whether the input commands should be registered in the recognition dictionary. Storage means storing a dictionary adapted to register the same command as that registered in the plurality of recognition dictionaries, and the recognition dictionary effective according to the operating state of the control target device a dictionary switching selection means for switching a comparison of commands said registered in any of the dictionaries stored in the storage means and the input command through the voice input means Voice recognition means for performing voice recognition based on the command, and the voice recognition means uses the discrimination dictionary only when the command is inputted via the voice input means. When the recognition score calculated based on the voice recognition is lower than a predetermined threshold, the recognition dictionary selected according to the operation state of the control target device and the determination An in-vehicle voice recognition device characterized by being registered in a dictionary is provided.

本発明に係る車載用音声認識装置によれば、ユーザが発したコマンド（ボイスタグを含む）を登録するにあたり、判別用辞書のみを使用して当該コマンドに対する音声認識を行い、その結果に基づき認識スコアが当該しきい値より低い場合に、当該コマンドはこれまで登録したいずれのコマンドとも類似していないデータであると判断して、当該コマンドを認識辞書と判別用辞書に登録するようにしている。 According to the in-vehicle speech recognition device of the present invention, when registering a command (including a voice tag) issued by a user, speech recognition is performed on the command using only the discrimination dictionary, and a recognition score is based on the result. Is lower than the threshold value, the command is determined to be data that is not similar to any of the commands registered so far, and the command is registered in the recognition dictionary and the discrimination dictionary.

つまり、登録しようとしているコマンド（ボイスタグを含む）が判別用辞書に既に登録されているものと同一もしくは類似しているか、あるいは類似していない（非類似）かを判断し、非類似の場合にのみ当該コマンドを登録するようにしている。言い換えると、既に登録されているコマンドと同一もしくは類似しているコマンドについては、登録しないようにしている。 In other words, it is determined whether the command to be registered (including the voice tag) is the same as or similar to that already registered in the discrimination dictionary, or is not similar (dissimilar). Only the command is registered. In other words, a command that is the same as or similar to a command that has already been registered is not registered.

これによって、従来技術に見られたような不都合（過去に登録したコマンド（ボイスタグを含む）と同一もしくは類似している音声データが登録されることによってひき起こされる認識の際の紛らわしさ）を解消することができ、登録した音声コマンドに対する認識率を高めることが可能となる。 This eliminates the inconvenience seen in the prior art (confusingness in recognition caused by registration of voice data that is the same as or similar to previously registered commands (including voice tags)). It is possible to increase the recognition rate for the registered voice command.

本発明の他の形態によれば、車室内でユーザが発話するコマンドを入力する音声入力手段と、前記音声入力手段を介して入力されたコマンドとの比較照合を行うのに使用され、制御対象機器の動作状態に応じて選択可能なコマンドを登録した複数の認識辞書と、入力されたコマンドを前記認識辞書に登録すべきかどうかを判断するための判別用辞書であって前記複数の認識辞書に登録されているコマンドと同じコマンドが登録されるよう適応されたものとを格納した記憶手段と、前記制御対象機器の動作状態に応じて有効な前記認識辞書を切り替える辞書切替選択手段と、前記音声入力手段を介して入力されたコマンドと前記記憶手段に格納されているいずれかの辞書に登録されているコマンドとの比較照合に基づいた音声認識を行う音声認識手段とを備えた車載用音声認識装置において、前記音声入力手段を介してコマンドが入力されたときに、前記判別用辞書のみを有効にして当該入力されたコマンドに対する音声認識を実行し、該音声認識に基づいて算出した認識スコアが所定のしきい値より低い場合に、当該コマンドを前記制御対象機器の動作状態に応じて選択された前記認識辞書及び前記判別用辞書に登録することを特徴とする音声コマンド登録方法が提供される。 According to another aspect of the present invention, the voice input means for inputting a command spoken by the user in the vehicle interior and the command input via the voice input means are used for comparison and collation, A plurality of recognition dictionaries in which commands that can be selected according to the operating state of the device are registered, and a determination dictionary for determining whether or not the input commands should be registered in the recognition dictionary. A storage unit storing a command adapted to register the same command as the registered command, a dictionary switching selection unit that switches the recognition dictionary that is valid according to an operation state of the control target device, and the voice A voice recognition that performs voice recognition based on comparison and collation between a command input via the input means and a command registered in any of the dictionaries stored in the storage means. In-vehicle speech recognition and means, when the command through the voice input means is input, executes the voice recognition for the determination dictionary only enable and the input command, the voice When the recognition score calculated based on recognition is lower than a predetermined threshold value, the command is registered in the recognition dictionary and the discrimination dictionary selected according to the operation state of the control target device, A voice command registration method is provided.

本発明に係る車載用音声認識装置の他の構成上の特徴及びそれに基づく具体的な処理態様等については、後述する発明の実施の形態を参照しながら詳細に説明する。 Other structural features of the vehicle-mounted speech recognition apparatus according to the present invention and specific processing modes based thereon will be described in detail with reference to embodiments of the invention described later.

以下、本発明の実施の形態について、添付の図面を参照しながら説明する。 Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings.

図１は、本発明の一実施形態に係る車載用音声認識装置を組み込んだ車載オーディオ／ビデオ（Ａ／Ｖ）・ナビゲーションシステムの構成を示したものである。 FIG. 1 shows a configuration of an in-vehicle audio / video (A / V) navigation system incorporating an in-vehicle voice recognition device according to an embodiment of the present invention.

図示のように車載Ａ／Ｖ・ナビゲーションシステム４０は、本実施形態に係る車載用音声認識装置１０と、その音声認識結果に基づいて発話内容（ボイスタグを含むコマンド）に対応した制御が行われる対象機器（図示の例では、ラジオ受信機１、ＤＶＤ／ＣＤプレーヤ２、ＴＶ受信機４及びナビゲーションユニット５）と、フロント席のユーザが各制御対象機器に対して各種設定操作を行うためのフロント席用操作ユニット（ヘッドユニット（Ｈ／Ｕ））２０と、リア席のユーザが各制御対象機器（ナビゲーションユニット５を除く）に対して各種設定操作を行うためのリア席用操作ユニット３０と、フロント席用表示ユニット２５と、アンプユニット２６と、スピーカ２７と、リア席用表示ユニット３１と、ワイヤレスヘッドホン３２とを備えている。車載用音声認識装置１０、各制御対象機器１〜５、フロント席用操作ユニット２０、各表示ユニット２５，３１及びアンプユニット２６は、伝送路として供される光ファイバ等のバス６を介して相互に接続されている。図示の例では、スピーカ２７は１個のみ示されているが、実際には車室内の所定の場所に所要の個数、例えば、リア席が１列の場合であれば少なくともリア席の左右の近傍とフロント席の左右の近傍にそれぞれ２個ずつ、計４個のスピーカ２７が設置されている。リア席用の操作ユニット３０、表示ユニット３１及びワイヤレスヘッドホン３２についても同様に、それぞれ１台（１個）のみ示されているが、実際にはリア席の搭乗者数に応じて所要の台数（個数）分設けられている。 As shown in the figure, the in-vehicle A / V / navigation system 40 is subject to control corresponding to the utterance content (command including a voice tag) based on the in-vehicle voice recognition device 10 according to the present embodiment and the voice recognition result. Equipment (in the illustrated example, radio receiver 1, DVD / CD player 2, TV receiver 4 and navigation unit 5) and a front seat for a user at the front seat to perform various setting operations on each control target equipment Operation unit (head unit (H / U)) 20, a rear seat operation unit 30 for a rear seat user to perform various setting operations on each control target device (excluding the navigation unit 5), front A seat display unit 25, an amplifier unit 26, a speaker 27, a rear seat display unit 31, and wireless headphones 32. It is provided. The in-vehicle voice recognition device 10, the control target devices 1 to 5, the front seat operation unit 20, the display units 25 and 31, and the amplifier unit 26 are mutually connected via a bus 6 such as an optical fiber provided as a transmission path. It is connected to the. In the example shown in the figure, only one speaker 27 is shown, but in reality, the required number in a predetermined position in the passenger compartment, for example, at least the vicinity of the left and right of the rear seats when the rear seats are one row. In total, four speakers 27 are installed in the vicinity of the left and right sides of the front seat. Similarly, only one unit (one) is shown for the operation unit 30 for the rear seat, the display unit 31 and the wireless headphones 32, but in reality, the required number (in accordance with the number of passengers in the rear seat) Number).

本実施形態に係る車載用音声認識装置１０は、記録媒体としてのハードディスクドライブ（ＨＤＤ）７と、マイクロホン８と、音声認識ユニット９とを備えている。マイクロホン８は、車室内の運転席前方のサンバイザー又はルームミラーの近傍に適宜設置され、ユーザが発声するコマンド（ボイスタグを含む）を検出してその音圧レベルに応じたアナログ音声信号に変換するものである。ＨＤＤ７によって駆動されるディスク（図示せず）には、ナビゲーション機能を実行する際に使用する地図データと共に、音声認識機能を実行する際に使用するデータがそれぞれ割り当てられた記憶領域に格納されている。地図データは、各縮尺レベル（１／１２５００、１／２５０００、１／５００００等）に応じて適当な大きさの経度幅及び緯度幅に区切られており、経路探索やマップマッチング等の各種処理に必要な道路ユニットのデータ及び交差点の詳細を表す交差点ユニットのデータ、各種施設（コンビニエンスストア、ガソリンスタンド、スーパー・ディスカウントショップ等）に関するデータ（位置、住所、電話番号、ジャンル等の各種情報）等を含んでいる。ＨＤＤ７（その一部の記憶領域）に格納される音声認識用のデータの内容については、音声認識ユニット９の内部構成と併せて後で説明する。 The in-vehicle voice recognition device 10 according to the present embodiment includes a hard disk drive (HDD) 7 as a recording medium, a microphone 8, and a voice recognition unit 9. The microphone 8 is appropriately installed in the vicinity of the sun visor or the room mirror in front of the driver's seat in the vehicle interior, detects a command (including a voice tag) uttered by the user, and converts it into an analog voice signal corresponding to the sound pressure level. Is. A disk (not shown) driven by the HDD 7 stores map data used when the navigation function is executed and data used when the voice recognition function is executed in respective allocated storage areas. . The map data is divided into longitude and latitude widths of appropriate sizes according to each scale level (1/12500, 1/25000, 1 / 50,000, etc.), and can be used for various processes such as route search and map matching. Necessary road unit data, intersection unit data showing details of the intersection, data on various facilities (convenience store, gas station, super discount shop, etc.) (various information such as location, address, telephone number, genre, etc.) Contains. The contents of voice recognition data stored in the HDD 7 (part of its storage area) will be described later together with the internal configuration of the voice recognition unit 9.

フロント席用操作ユニット（Ｈ／Ｕ）２０は、運転者と助手席の乗員が共用できるように両座席の中間のセンターコンソール上に「操作パネル」の形態で設置されており、その対応する表示ユニット２５は、その操作パネル（Ｈ／Ｕ）の上方に配置されており、例えば、ＬＣＤモニタからなる。この表示ユニット２５の画面には、ナビゲーションユニット５から出力された各種の映像情報（音声認識に基づいた自車位置周辺の地図、目的地までの誘導経路など）、ＤＶＤ／ＣＤプレーヤ２やＴＶ受信機４などの映像ソースから出力された映像情報などが表示される。一方、リア席用操作ユニット３０は、リア席のユーザが操作し易いように「リモコン」の形態で設けられており、これに対応するリア席用表示ユニット３１と赤外線通信により接続されている。このリア席用表示ユニット３１は、例えば、前の座席のヘッドレストの後部に設置されており、フロント側の表示ユニット２５と同様に映像情報をディスプレイ画面に表示するＬＣＤモニタ等を有している。この表示ユニット３１は、その対応するワイヤレスヘッドホン３２と赤外線通信及びＲＦ通信により接続されている。なお、ワイヤレスヘッドホン３２に代えて、ジャック付きのヘッドホンを使用してもよい。この場合には、ヘッドホンは対応する表示ユニット３１とジャックを介して有線接続されることになる。 The front seat operation unit (H / U) 20 is installed in the form of an “operation panel” on the center console between the two seats so that the driver and the passenger in the front passenger seat can share the display. The unit 25 is disposed above the operation panel (H / U), and includes, for example, an LCD monitor. On the screen of the display unit 25, various kinds of video information output from the navigation unit 5 (a map around the vehicle position based on voice recognition, a guide route to the destination, etc.), DVD / CD player 2 and TV reception The video information output from the video source such as the machine 4 is displayed. On the other hand, the rear seat operation unit 30 is provided in the form of a “remote control” so that the user at the rear seat can easily operate, and is connected to the corresponding rear seat display unit 31 by infrared communication. The rear seat display unit 31 is installed, for example, in the rear part of the headrest of the front seat, and has an LCD monitor or the like that displays video information on the display screen, like the front display unit 25. The display unit 31 is connected to the corresponding wireless headphones 32 by infrared communication and RF communication. Instead of the wireless headphones 32, headphones with a jack may be used. In this case, the headphones are wired to the corresponding display unit 31 via a jack.

各制御対象機器１〜５は、基本的な動作として、フロント席用操作ユニット２０、リア席用操作ユニット３０又は音声認識ユニット１０からバス６に送出された操作指示に係るデータ（後述する「機器制御信号」）を受信し、その操作指示に係るデータに基づいて自己の動作状態を設定もしくは変更し、その結果（現在の動作状態）を指示するデータを音声／映像信号としてバス６に送出する。例えば、ラジオ受信機１の場合、各操作ユニット２０，３０あるいは音声認識ユニット１０から与えられる操作指示に応答して、ＦＭ放送やＡＭ放送の信号を受信して復調することにより音声信号を生成し、これをデジタルの音声データに変換して、バス６に送出する。また、ＤＶＤ／ＣＤプレーヤ２の場合、同様に与えられる操作指示に応答して、ユーザにより選択されたＤＶＤの記録面に記録された信号を読み取り、再生された映像データをバス６に送出する。また、ナビゲーションユニット５の場合、同様に与えられる操作指示に応答して、ユーザにより設定された目的地までの誘導経路を探索し、その探索した経路のデータをバス６に送出する。 Each of the control target devices 1 to 5 has, as a basic operation, data related to an operation instruction sent to the bus 6 from the front seat operation unit 20, the rear seat operation unit 30, or the voice recognition unit 10 (the “device” described later). Control signal ") is received, and its own operation state is set or changed based on the data related to the operation instruction, and data indicating the result (current operation state) is sent to the bus 6 as an audio / video signal. . For example, in the case of the radio receiver 1, in response to an operation instruction given from each operation unit 20, 30 or the voice recognition unit 10, an audio signal is generated by receiving and demodulating an FM broadcast or AM broadcast signal. This is converted into digital audio data and sent to the bus 6. In the case of the DVD / CD player 2, in response to an operation instruction given in the same manner, a signal recorded on the recording surface of the DVD selected by the user is read and the reproduced video data is sent to the bus 6. In the case of the navigation unit 5, in response to an operation instruction given in the same manner, a guidance route to the destination set by the user is searched, and data of the searched route is sent to the bus 6.

フロント席用操作ユニット２０は、制御部２１と、操作部２２と、表示部２３と、メモリ部２４とを備えている。このうち、操作部２２は、各制御対象機器１〜５に対して各種設定操作を行うための操作キー、例えば、電源のオン／オフ及び音量調整を行うための電源キー、各機器を選択するための選択キー、所定の動作や機能を行わせるためのシフトキーやプリセットキー等を備えている。表示部２３は、操作パネル（Ｈ／Ｕ）上にＬＣＤ等の形態で配置されており、制御部２１から出力されるデータに基づいて、各種情報、例えば、ラジオ受信機１に関してはＦＭ／ＡＭの種別やその放送局の受信周波数など、ＤＶＤ／ＣＤプレーヤ２に関してはＣＤ演奏時のディスク番号や再生位置（トラック数、経過時間等）などを表示する。メモリ部２４は、フラッシュメモリ等の不揮発性メモリからなり、制御部２１からの制御に基づいて必要な情報（データ）を格納しておくためのものである。例えば、各操作ユニット２０，３０あるいは音声認識ユニット１０から与えられる操作指示に基づいて選択機器からの音声／映像信号の出力動作が停止された時点での当該機器の動作状態を示すデータが格納される。このデータは、次の出力動作開始時に必要に応じて参照するために格納されるものであり、例えば、いずれの機器（ソース）を使用していたかを指示する「ソース種別」、オーディオソースであればその音声を聴取していた際の音量や音質の調整値を指示する「音量・音質」、各機器別の詳細な動作状態を指示する「機器別詳細情報」などを含む。 The front seat operation unit 20 includes a control unit 21, an operation unit 22, a display unit 23, and a memory unit 24. Among these, the operation unit 22 selects operation keys for performing various setting operations on the control target devices 1 to 5, for example, a power key for power on / off and volume adjustment, and each device. Selection keys, shift keys and preset keys for performing predetermined operations and functions. The display unit 23 is arranged in the form of an LCD or the like on the operation panel (H / U), and based on data output from the control unit 21, various information, for example, FM / AM for the radio receiver 1 is used. For the DVD / CD player 2, the disc number and playback position (number of tracks, elapsed time, etc.) during CD performance are displayed. The memory unit 24 includes a nonvolatile memory such as a flash memory, and stores necessary information (data) based on the control from the control unit 21. For example, data indicating the operation state of the device at the time when the output operation of the audio / video signal from the selected device is stopped based on the operation instruction given from each operation unit 20, 30 or the voice recognition unit 10 is stored. The This data is stored for reference when necessary at the start of the next output operation. For example, this data may be a “source type” indicating which device (source) is used or an audio source. For example, “volume / sound quality” for instructing an adjustment value of volume and sound quality when listening to the sound, “detailed information for each device” for instructing a detailed operation state for each device, and the like are included.

制御部２１はマイクロコンピュータ（マイコン）等により構成され、本システム４０全体の制御を行うものである。基本的には、各操作ユニット２０，３０あるいは音声認識ユニット１０から与えられた操作指示に基づき、選択機器からバス６を介して送られてくる音声／映像データを取得して音声／映像情報の再生を行う動作、操作状況や動作状態等を指示する情報を表示部２３に表示させる動作などの制御を行う。この場合、取得された音声データは、制御部２１によりバス６を介してアンプユニット２６に送られ、適宜Ｄ／Ａ変換され、また音量や音質等の制御が行われ、増幅された後、スピーカ２７を通して音声出力される。また、取得された映像データは、制御部２１によりバス６を介して表示ユニット２５に送られ、そのディスプレイ画面に映像情報として表示される。 The control unit 21 is configured by a microcomputer or the like, and controls the entire system 40. Basically, on the basis of operation instructions given from the operation units 20 and 30 or the voice recognition unit 10, the audio / video data sent from the selected device via the bus 6 is acquired to obtain the audio / video information. Control is performed such as an operation for performing reproduction, an operation for displaying information indicating an operation state, an operation state, and the like on the display unit 23. In this case, the acquired audio data is sent to the amplifier unit 26 via the bus 6 by the control unit 21 and appropriately D / A converted, and the volume and sound quality are controlled and amplified, and then the speaker 27 is output as audio. The acquired video data is sent to the display unit 25 by the control unit 21 via the bus 6 and displayed as video information on the display screen.

一方、リア席用操作ユニット（リモコン）３０は、特に図示はしないが、フロント側の操作部２２と同等の機能を有する操作部と、この操作部から入力された操作指示に応じた信号を赤外線通信により表示ユニット３１に向けて送信するための赤外線送信部とを備えている。また、リア席用表示ユニット３１は、特に図示はしないが、リモコン３０及びワイヤレスヘッドホン３２との間で制御信号やデータ等を通信するための赤外線通信部と、フロント側の制御部２１と同等の制御を行う制御部と、フロント側の表示ユニット２５と同様のＬＣＤモニタ等からなる表示部と、フロント側のメモリ部２４と同様のメモリ部とを備えている。 On the other hand, the rear seat operation unit (remote control) 30 is not particularly shown, but an operation unit having the same function as the operation unit 22 on the front side and a signal corresponding to an operation instruction input from the operation unit are infrared rays. An infrared transmission unit for transmitting to the display unit 31 by communication. The rear seat display unit 31 is not particularly shown, but is equivalent to the infrared communication unit for communicating control signals and data between the remote controller 30 and the wireless headphones 32, and the front side control unit 21. A control unit that performs control, a display unit that includes an LCD monitor or the like similar to the front-side display unit 25, and a memory unit similar to the front-side memory unit 24 are provided.

次に、本実施形態に係る車載用音声認識装置１０の構成について、その一例を示す図２を参照しながら説明する。 Next, the configuration of the in-vehicle speech recognition device 10 according to the present embodiment will be described with reference to FIG.

本実施形態に係る車載用音声認識装置１０は、図示のようにＨＤＤ７（その一部の記憶領域）と、マイクロホン８と、音声認識ユニット９とを備えている。音声認識ユニット９は、その機能ブロックとして、音声入力部１１と、音声認識処理部１２と、辞書切替選択部１３と、音声再生処理部１４と、機器制御信号発生部１５とを備えている。 The on-vehicle speech recognition apparatus 10 according to the present embodiment includes an HDD 7 (part of its storage area), a microphone 8 and a speech recognition unit 9 as shown in the figure. The voice recognition unit 9 includes a voice input unit 11, a voice recognition processing unit 12, a dictionary switching selection unit 13, a voice reproduction processing unit 14, and a device control signal generation unit 15 as functional blocks.

ＨＤＤ７には、音声認識用のデータとして、ユーザが発話したコマンド（ボイスタグを含む）を認識するためのコマンド認識辞書と、本発明の特徴をなす判別用辞書ＪＤと、音響モデルＡＭとが格納されている。音響モデルＡＭは当業者には周知のものであり、例えば、音素対応の音素ＨＭＭ（隠れマルコフモデル）からなる音素ＨＭＭセットを生成し、この音素ＨＭＭセットの音素ＨＭＭを組み合わせて、それぞれの音節対応の初期音素連鎖音節ＨＭＭからなる初期音素連鎖音節ＨＭＭセットを生成し、その初期音素連鎖音節ＨＭＭセットを学習することによって、作成され得る。この音響モデルＡＭは、音声認識処理部１２において音声認識を行う際に適宜参照される。 The HDD 7 stores, as voice recognition data, a command recognition dictionary for recognizing commands (including voice tags) spoken by the user, a discrimination dictionary JD that characterizes the present invention, and an acoustic model AM. ing. The acoustic model AM is well known to those skilled in the art. For example, a phoneme HMM set composed of phoneme HMMs (Hidden Markov Models) corresponding to phonemes is generated, and the phoneme HMMs of the phoneme HMM sets are combined to correspond to each syllable Can be created by generating an initial phoneme chain syllable HMM set of initial phoneme chain syllable HMMs and learning the initial phoneme chain syllable HMM set. The acoustic model AM is appropriately referred to when the voice recognition processing unit 12 performs voice recognition.

コマンド認識辞書は、例えば、各制御対象機器１〜５の動作状態もしくは操作指示に関連させてそれぞれ選択可能なコマンドからなる認識辞書毎に区分され（図示の例では、Ｄ１〜Ｄ３の３種類の辞書）、当該辞書の識別番号（ＩＤ＝１〜３）に対応させて格納されている。図示の例では、１番目のコマンド認識辞書Ｄ１がボイスタグ(Voice Tag) 認識用の辞書として割り当てられている。 For example, the command recognition dictionary is classified for each recognition dictionary including commands that can be selected in association with the operation state or operation instruction of each control target device 1 to 5 (in the illustrated example, three types of D1 to D3). Dictionary) and the dictionary identification number (ID = 1 to 3). In the illustrated example, the first command recognition dictionary D1 is assigned as a voice tag recognition dictionary.

判別用辞書ＪＤは、後述するようにユーザが発したボイスタグをボイスタグ認識用の辞書Ｄ１に登録すべきかどうかを判断する際に使用されるものである。この判別用辞書ＪＤには、少なくともボイスタグ認識用の辞書Ｄ１に登録されているボイスタグと同じボイスタグが登録されるようになっている。さらに判別用辞書ＪＤには、ナビゲーションユニット５が製品としてサポートしているコマンド（図２の例では、"Menu","Cancel","Map" ）が予約語として登録されている。このような予約語を予め登録しておくことで、後述するようにボイスタグを音声認識したときに、ユーザが新規で登録したコマンドなのか、ナビゲーション機能として元々有していたコマンドなのかを判別することができる。 The determination dictionary JD is used when determining whether or not a voice tag issued by a user should be registered in the voice tag recognition dictionary D1 as described later. In this discrimination dictionary JD, at least the same voice tag as that registered in the voice tag recognition dictionary D1 is registered. Further, in the discrimination dictionary JD, commands supported by the navigation unit 5 as products (“Menu”, “Cancel”, “Map” in the example of FIG. 2) are registered as reserved words. By registering such reserved words in advance, when a voice tag is recognized as described later, it is determined whether the command is newly registered by the user or originally used as a navigation function. be able to.

音声認識ユニット９において、音声入力部１１は、マイクロホン８を介してユーザが発した音声コマンド（アナログ音声信号）を適宜増幅し、デジタル化した後、音声認識処理部１２に出力する。音声認識処理部１２は、基本的には、音響モデルＡＭを参照しながら各制御対象機器１〜５の動作状態において選択可能なコマンドからなる認識辞書（図示の例では、コマンド認識辞書Ｄ１〜Ｄ３のいずれか）を使用して、入力された音声コマンドと当該認識辞書に含まれる各コマンドとを比較照合し、それぞれ合致度（認識スコア）を算出して、最も認識スコアの大きいコマンドをユーザが発した音声コマンドとして決定するものである。さらに音声認識処理部１２では、本発明に関連する処理として、後述するようにボイスタグを登録する際には判別用辞書ＪＤを使用して認識処理を行い、その認識スコアに応じて、当該ボイスタグをボイスタグ認識用の辞書Ｄ１に登録すべきかどうかを決定する。 In the voice recognition unit 9, the voice input unit 11 appropriately amplifies a voice command (analog voice signal) issued by the user via the microphone 8, digitizes it, and outputs it to the voice recognition processing unit 12. The speech recognition processing unit 12 basically has a recognition dictionary (command recognition dictionaries D1 to D3 in the illustrated example) consisting of commands that can be selected in the operating state of each control target device 1 to 5 while referring to the acoustic model AM. ), The input voice command and each command included in the recognition dictionary are compared and collated, and the degree of matching (recognition score) is calculated. It is determined as a voice command issued. Further, as a process related to the present invention, the voice recognition processing unit 12 performs a recognition process using the discrimination dictionary JD when registering a voice tag, as will be described later, and determines the voice tag according to the recognition score. It is determined whether or not to be registered in the dictionary D1 for voice tag recognition.

辞書切替選択部１３は、各制御対象機器１〜５と動作可能に接続されており、これらの動作状態に変化が発生したときにそれを検出してその動作状態に対応する選択可能なコマンドからなる認識辞書を選択するものである。音声再生処理部１４では、ユーザによって音声入力されたコマンド（ボイスタグを含む）の認識結果をトークバック再生したり、各制御対象機器１〜５に対する、音声による操作の結果を報知するための音声データを合成する。合成された音声データはバス６（図１）に送出され、アンプユニット２６を介してスピーカ２７からユーザに報知される。機器制御信号発生部１５では、音声認識処理部１２で決定されたコマンドを取得し、そのコマンドの内容に応じた機器制御信号を出力する。出力された機器制御信号はバス６（図１）に送出され、該当する制御対象機器では、その機器制御信号に基づいて動作状態の変更を行う。 The dictionary switching selection unit 13 is operatively connected to each of the control target devices 1 to 5, and detects a change in these operation states, and from selectable commands corresponding to the operation states A recognition dictionary is selected. The voice reproduction processing unit 14 performs talkback reproduction of a recognition result of a command (including a voice tag) inputted by the user, and voice data for notifying a result of a voice operation on each of the control target devices 1 to 5. Is synthesized. The synthesized audio data is sent to the bus 6 (FIG. 1) and notified to the user from the speaker 27 via the amplifier unit 26. The device control signal generation unit 15 acquires the command determined by the voice recognition processing unit 12 and outputs a device control signal corresponding to the content of the command. The output device control signal is sent to the bus 6 (FIG. 1), and the corresponding control target device changes the operation state based on the device control signal.

以下、本実施形態に係る車載用音声認識装置１０（図２）において行うコマンド認識辞書と判別用辞書の切替選択に基づいたボイスタグの登録に係る処理について、その一例を示す図３を参照しながら説明する。併せて、図４〜図８も参照しながら補足説明する。 Hereinafter, with reference to FIG. 3 showing an example of processing related to voice tag registration based on switching selection between a command recognition dictionary and a discrimination dictionary performed in the in-vehicle speech recognition apparatus 10 (FIG. 2) according to the present embodiment. explain. In addition, a supplementary explanation will be given with reference to FIGS.

先ず初期状態として、ボイスタグ認識用の辞書（コマンド認識辞書Ｄ１）と判別用辞書ＪＤにはボイスタグは登録されていないものとし、また判別用辞書ＪＤには、ナビゲーション機能として元々有していたコマンド（"Menu","Cancel","Map" ）が予約語として登録されているものとする。 First, as an initial state, it is assumed that no voice tag is registered in the voice tag recognition dictionary (command recognition dictionary D1) and the discrimination dictionary JD, and the discrimination dictionary JD has a command ( “Menu”, “Cancel”, “Map”) are registered as reserved words.

この状態で最初のステップＳ１では、音声認識ユニット９において、マイクロホン８から音声入力部１１を介してボイスタグ用の音声データを検出した（ＹＥＳ）か否（ＮＯ）かを判定する。判定結果がＹＥＳの場合にはステップＳ２に進み、判定結果がＮＯの場合にはボイスタグを検出するまで判定処理を繰り返す。 In the first step S1 in this state, the voice recognition unit 9 determines whether voice tag voice data is detected from the microphone 8 via the voice input unit 11 (YES) or not (NO). If the determination result is YES, the process proceeds to step S2, and if the determination result is NO, the determination process is repeated until a voice tag is detected.

次のステップＳ２では、音声認識ユニット９において、辞書切替選択部１３が音声認識処理部１２と協働して、選択可能なコマンド認識辞書（図２の例では、Ｄ１〜Ｄ３）と判別用辞書ＪＤを有効化する。 In the next step S2, in the voice recognition unit 9, the dictionary switching selection unit 13 cooperates with the voice recognition processing unit 12 to select a command recognition dictionary (D1 to D3 in the example of FIG. 2) and a discrimination dictionary. Enable JD.

次のステップＳ３では、音声認識処理部１２において、ＨＤＤ７に格納されている各辞書（この場合、ボイスタグ認識用の辞書Ｄ１と判別用辞書ＪＤ）を参照して、登録されているボイスタグは有る（ＹＥＳ）か否（ＮＯ）かを判定する。判定結果がＹＥＳの場合にはステップＳ４に進み、判定結果がＮＯの場合にはステップＳ５に進む。 In the next step S3, the voice recognition processing unit 12 refers to the dictionaries stored in the HDD 7 (in this case, the voice tag recognition dictionary D1 and the discrimination dictionary JD), and there are registered voice tags ( YES) or not (NO). If the determination result is YES, the process proceeds to step S4, and if the determination result is NO, the process proceeds to step S5.

ステップＳ５では（ボイスタグが未だ登録されていない場合）、音声認識処理部１２により、その検出したボイスタグのＰＣＭデータ（音声データ）を録音すると共に、そのボイスタグを当該認識辞書（ボイスタグ認識用の辞書Ｄ１）と判別用辞書ＪＤに登録する。図４はその一例を示しており、図示の例では、ボイスタグ認識用の辞書Ｄ１と判別用辞書ＪＤにボイスタグとして"Best Buy"のコマンドが登録されている（図中、（ｂ），（ｃ）参照）。なお、（ａ）に示す画面５１は、図９に例示した"Address Book"画面６１と同等のものである。このようにしてボイスタグの登録が終了すると、ステップＳ１に戻って上記の処理を繰り返す。 In step S5 (when the voice tag is not yet registered), the voice recognition processing unit 12 records the PCM data (voice data) of the detected voice tag and stores the voice tag in the recognition dictionary (dictionary D1 for voice tag recognition). ) And the registration dictionary JD. FIG. 4 shows an example. In the example shown in FIG. 4, a “Best Buy” command is registered as a voice tag in the voice tag recognition dictionary D1 and the discrimination dictionary JD ((b) and (c) in the figure. )reference). The screen 51 shown in FIG. 9A is the same as the “Address Book” screen 61 illustrated in FIG. When the registration of the voice tag is thus completed, the process returns to step S1 and the above processing is repeated.

一方、ステップＳ４では、辞書切替選択部１３が音声認識処理部１２と協働して、現在有効になっている認識辞書のＩＤ（図２の例では、ＩＤ＝１〜３）を保持した上で、辞書を全て無効化する（図４（ｃ）、図５（ｂ）、図７（ｂ）参照）。 On the other hand, in step S4, the dictionary switching selection unit 13 cooperates with the voice recognition processing unit 12 to hold the ID of the currently effective recognition dictionary (ID = 1 to 3 in the example of FIG. 2). Then, all the dictionaries are invalidated (see FIG. 4C, FIG. 5B, and FIG. 7B).

次のステップＳ６では、辞書切替選択部１３が音声認識処理部１２と協働して、無効化された辞書のうち判別用辞書ＪＤのみを有効にし（図５（ｃ）、図７（ｃ）参照）、音声認識処理部１２において、録音されたＰＣＭデータに基づきその判別用辞書ＪＤのみを使用して、検出したボイスタグに対する音声認識を実行する。図５（ａ）、図７（ａ）は、この場合の「検出したボイスタグ」の一例を示しており、図５（ａ）の例ではボイスタグとして"My Home" が検出されており、図７（ａ）の例ではボイスタグとして"Best Buy"が検出されている。なお、各図の（ａ）に示す画面５２は、図９に例示した画面６５と同等のものである。 In the next step S6, the dictionary switching selection unit 13 cooperates with the voice recognition processing unit 12 to validate only the discrimination dictionary JD among the invalidated dictionaries (FIG. 5 (c), FIG. 7 (c)). The voice recognition processing unit 12 executes voice recognition for the detected voice tag using only the discrimination dictionary JD based on the recorded PCM data. FIGS. 5A and 7A show an example of the “detected voice tag” in this case. In the example of FIG. 5A, “My Home” is detected as the voice tag. In the example of (a), “Best Buy” is detected as a voice tag. Note that the screen 52 shown in FIG. 9A is equivalent to the screen 65 illustrated in FIG.

次のステップＳ７では、音声認識処理部１２において、音声認識に基づいて算出した認識スコアが所定のしきい値より低い（ＹＥＳ）か否（ＮＯ）かを判定する。判定結果がＹＥＳの場合にはステップＳ８に進み、判定結果がＮＯの場合にはステップＳ９に進む。ここに、認識スコアが当該しきい値より低い場合には、検出したボイスタグは、これまで登録したいずれのボイスタグとも類似していないデータ（非類似のデータ）であると判断することができる。一方、認識スコアが当該しきい値より高い場合には、検出したボイスタグは、これまで登録したいずれかのボイスタグと同一又は類似しているデータ（同一／類似のデータ）であると判断することができる。 In the next step S7, the voice recognition processing unit 12 determines whether the recognition score calculated based on the voice recognition is lower than a predetermined threshold (YES) or not (NO). If the determination result is yes, the process proceeds to step S8, and if the determination result is no, the process proceeds to step S9. Here, when the recognition score is lower than the threshold value, it can be determined that the detected voice tag is data that is not similar to any voice tag registered so far (unsimilar data). On the other hand, if the recognition score is higher than the threshold value, the detected voice tag may be determined to be the same or similar data (identical / similar data) to any of the registered voice tags. it can.

ステップＳ８では（検出したボイスタグが非類似のデータの場合）、音声認識処理部１２により、その検出したボイスタグのＰＣＭデータ（音声データ）を録音すると共に、そのボイスタグを判別用辞書ＪＤとボイスタグ認識用の辞書Ｄ１に登録する。図６はその一例を示しており、図示の例では、ボイスタグとして新たに"My Home" のコマンドが登録されている（図中、（ｂ）のボイスタグ認識用の辞書Ｄ１’、（ｃ）の判別用辞書ＪＤ’参照）。このようにしてボイスタグの登録が終了すると、ステップＳ１０に進む。 In step S8 (when the detected voice tag is dissimilar data), the voice recognition processing unit 12 records the PCM data (voice data) of the detected voice tag, and the voice tag is used for the discrimination dictionary JD and the voice tag recognition. Registered in the dictionary D1. FIG. 6 shows an example. In the example shown in FIG. 6, a command “My Home” is newly registered as a voice tag (in FIG. 6, (b) voice tag recognition dictionary D1 ′, (c) (See JD ′ for determination). When the registration of the voice tag is thus completed, the process proceeds to step S10.

一方、ステップＳ９では（検出したボイスタグが同一／類似のデータの場合）、音声認識処理部１２からの制御に基づき機器制御信号発生部１５を介して表示ユニット（この場合、フロント席用表示ユニット２５）の画面に、当該ボイスタグを各辞書（ボイスタグ認識用の辞書Ｄ１と判別用辞書ＪＤ）に登録しない旨の案内情報（警告画面）を表示する。図８（ａ）はその一例を示しており、図示の例では、過去に登録したボイスタグと極めて類似しているので登録しない旨、そして再登録を促す旨の案内情報（警告画面）５３が表示されている。この警告画面を表示すると、ステップＳ１０に進む。 On the other hand, in step S9 (when the detected voice tag is the same / similar data), the display unit (in this case, the front seat display unit 25 in this case) is controlled via the device control signal generator 15 based on the control from the voice recognition processor 12. ), Guidance information (warning screen) indicating that the voice tag is not registered in each dictionary (voice tag recognition dictionary D1 and discrimination dictionary JD) is displayed. FIG. 8A shows an example. In the example shown in the figure, guidance information (warning screen) 53 indicating that registration is not performed and re-registration is displayed because it is very similar to a voice tag registered in the past. Has been. When this warning screen is displayed, the process proceeds to step S10.

最後のステップＳ１０では、辞書切替選択部１３が音声認識処理部１２と協働して、判別用辞書ＪＤ（又はＪＤ’）を無効にし、保持しておいた認識辞書ＩＤのコマンド認識辞書を有効化する（図６（ｂ），（ｃ）、図８（ｂ），（ｃ）参照）。 In the last step S10, the dictionary switching selection unit 13 cooperates with the voice recognition processing unit 12 to invalidate the discrimination dictionary JD (or JD ′) and validate the stored command recognition dictionary with the recognized dictionary ID. (See FIGS. 6B and 6C, FIGS. 8B and 8C).

なお、本発明の要旨とは関係しないので特に図示はしていないが、上記の処理（図３）を通してコマンド認識辞書（ボイスタグ認識用辞書Ｄ１）に登録されたボイスタグは、各制御対象機器１〜５を制御するのに利用され得る。この場合、音声認識ユニット９では、機器制御信号発生部１５により、音声認識処理部１２で認識されたコマンドに応じた機器制御信号を出力し、これに対応する制御を当該制御対象機器に対して実行する。その際、当該制御対象機器の動作状態に係る映像を表示している表示ユニット２５，３１に対して当該コマンドに応じた制御（画面表示やその変更など）を行うと共に、当該制御対象機器の動作状態に係る音声を出力しているスピーカ２７（ワイヤレスヘッドホン３２を含む）に対して当該コマンドに応じた制御（音声の変更など）を行う。例えば、図９（ｂ）に例示したように、自車位置周辺の地図画面６７を表示させている状態で、ユーザがマイクロホン８を介して"Go to [My Home]" と発声すると、音声認識ユニット９では、この音声コマンド（ＰＣＭデータ）を基に音声認識処理を実行し、その実行結果に応じた機器制御信号を出力する。ナビゲーションユニット５では、この機器制御信号に応答して、表示ユニット２５の画面に自宅(My Home) までの誘導経路等の情報を表示する。 The voice tag registered in the command recognition dictionary (voice tag recognition dictionary D1) through the above process (FIG. 3) is not shown because it is not related to the gist of the present invention. Can be used to control 5. In this case, in the speech recognition unit 9, the device control signal generation unit 15 outputs a device control signal corresponding to the command recognized by the speech recognition processing unit 12, and performs control corresponding thereto for the control target device. Execute. At that time, control (screen display, change thereof, etc.) corresponding to the command is performed on the display units 25 and 31 displaying the video related to the operation state of the control target device, and the operation of the control target device is performed. Control (change of sound, etc.) corresponding to the command is performed on the speaker 27 (including the wireless headphones 32) that outputs sound related to the state. For example, as illustrated in FIG. 9B, when the user utters “Go to [My Home]” via the microphone 8 while the map screen 67 around the vehicle position is displayed, voice recognition is performed. The unit 9 executes voice recognition processing based on the voice command (PCM data) and outputs a device control signal according to the execution result. In response to the device control signal, the navigation unit 5 displays information such as a guide route to the home (My Home) on the screen of the display unit 25.

以上説明したように、本実施形態に係る車載用音声認識装置１０によれば、マイクロホン８を介して音声入力されたコマンド（ボイスタグを含む）を認識辞書Ｄ１と判別用辞書ＪＤに登録するにあたり、判別用辞書ＪＤのみを使用して当該コマンドに対する音声認識を行い、その結果に基づき算出した認識スコアが当該しきい値より低い場合には、当該コマンドはこれまで登録したいずれのコマンドとも類似していない（非類似の）データであると判断して、当該コマンドを各辞書Ｄ１，ＪＤに登録するようにしている。 As described above, according to the in-vehicle speech recognition apparatus 10 according to the present embodiment, in registering a command (including a voice tag) input through the microphone 8 in the recognition dictionary D1 and the discrimination dictionary JD, If only the discrimination dictionary JD is used for voice recognition of the command, and the recognition score calculated based on the result is lower than the threshold, the command is similar to any command registered so far. It is determined that there is no (dissimilar) data, and the command is registered in each of the dictionaries D1 and JD.

一方、認識スコアが当該しきい値より高い場合には、当該コマンドはこれまで登録したいずれかのコマンドと同一もしくは類似している（同一／類似の）データであると判断して、当該コマンドを各辞書Ｄ１，ＪＤに登録しない旨の案内情報（警告画面）５３を表示するようにしている。 On the other hand, if the recognition score is higher than the threshold, it is determined that the command is the same or similar (same / similar) data as any of the registered commands so far, Guidance information (warning screen) 53 notifying registration in each of the dictionaries D1 and JD is displayed.

つまり、登録しようとしているコマンド（ボイスタグを含む）が判別用辞書ＪＤに既に登録されているコマンドと同一／類似しているか、あるいは非類似かを判断し、非類似の場合にのみ当該コマンドを登録するようにし、同一／類似の場合には登録しないようにしている。 That is, it is determined whether the command to be registered (including the voice tag) is the same / similar to or not similar to the command already registered in the discrimination dictionary JD, and the command is registered only when it is dissimilar. If it is the same / similar, it is not registered.

これにより、従来技術に見られたような、過去に登録したコマンドと同一もしくは類似している音声データが登録されることによってひき起こされる認識の際の紛らわしさといった不都合を解消することができる。その結果、認識辞書に登録した音声コマンドに対する認識率を高めることができる。 As a result, it is possible to eliminate the inconvenience such as confusion at the time of recognition caused by registration of voice data that is the same as or similar to a command registered in the past as seen in the prior art. As a result, the recognition rate for voice commands registered in the recognition dictionary can be increased.

上述した実施形態では、車載用音声認識装置１０を車載Ａ／Ｖ・ナビゲーションシステム４０の一部として組み込んだ場合を例にとって説明したが、本発明の要旨（ユーザが発したコマンド（ボイスタグ）をコマンド認識辞書（ボイスタグ認識用辞書Ｄ１）に登録するにあたり、その登録を行うべきかどうかを判断するための判別用辞書を作成しておき、この判別用辞書を使用して入力されたコマンドに対する認識処理を行い、その結果（認識スコア）に基づいて同一もしくは類似していないコマンドのみを認識辞書に登録するようにしたこと）からも明らかなように、必ずしもＡ／Ｖ機器とナビゲーション装置の両方を含むシステムに組み込んで使用する必要がないことはもちろんである。 In the above-described embodiment, the case where the in-vehicle speech recognition device 10 is incorporated as a part of the in-vehicle A / V / navigation system 40 has been described as an example, but the gist of the present invention (command (voice tag) issued by the user is a command) When registering in the recognition dictionary (voice tag recognition dictionary D1), a determination dictionary for determining whether the registration should be performed is created, and a recognition process for a command input using the determination dictionary As will be clear from the results (recognition score), only commands that are not the same or similar are registered in the recognition dictionary). As a result, both the A / V device and the navigation device are necessarily included. Of course, it is not necessary to use it by incorporating it into the system.

また、上述した実施形態では、地図データと共にコマンド認識辞書、判別用辞書等を格納する記録媒体としてＨＤＤ７を使用しているが、これに代えて、フラッシュメモリなどの書き換え可能な他の記録媒体を使用してもよい。 In the above-described embodiment, the HDD 7 is used as a recording medium for storing a command recognition dictionary, a discrimination dictionary, and the like together with map data. Instead, another rewritable recording medium such as a flash memory is used. May be used.

本発明の一実施形態に係る車載用音声認識装置を組み込んだ車載オーディオ／ビデオ（Ａ／Ｖ）・ナビゲーションシステムの構成を示すブロック図である。1 is a block diagram showing a configuration of an in-vehicle audio / video (A / V) navigation system incorporating an in-vehicle voice recognition device according to an embodiment of the present invention. 図１における車載用音声認識装置の構成を示すブロック図である。It is a block diagram which shows the structure of the vehicle-mounted speech recognition apparatus in FIG. 図２の車載用音声認識装置において行うコマンド認識辞書と判別用辞書の切替選択に基づいたボイスタグの登録に係る処理の一例を示すフロー図である。It is a flowchart which shows an example of the process which concerns on registration of the voice tag based on the switching selection of the command recognition dictionary and discrimination dictionary performed in the vehicle-mounted speech recognition apparatus of FIG. 図３の処理フローの補足説明図（その１）である。FIG. 4 is a supplementary explanatory diagram (part 1) of the processing flow of FIG. 3; 図３の処理フローの補足説明図（その２）である。FIG. 4 is a supplementary explanatory diagram (2) of the processing flow of FIG. 3. 図３の処理フローの補足説明図（その３）である。FIG. 4 is a supplementary explanatory diagram (part 3) of the processing flow of FIG. 3; 図３の処理フローの補足説明図（その４）である。FIG. 4 is a supplementary explanatory diagram (4) of the processing flow of FIG. 3. 図３の処理フローの補足説明図（その５）である。FIG. 6 is a supplementary explanatory diagram (No. 5) of the processing flow of FIG. 3. ボイスタグの使用例を説明するための図である。It is a figure for demonstrating the usage example of a voice tag.

Explanation of symbols

１〜５…制御対象機器、
７…ＨＤＤ（記憶手段）、
８…マイクロホン（音声入力手段）、
９…音声認識ユニット（音声認識手段）、
１０…車載用音声認識装置、
１２…音声認識処理部、
１３…辞書切替選択部、
１４…音声再生処理部、
１５…機器制御信号発生部、
２０，３０…操作ユニット、
２５，３１…表示ユニット（表示手段）、
２７…スピーカ、
４０…車載オーディオ／ビデオ（Ａ／Ｖ）・ナビゲーションシステム、
５３…ボイスタグを登録しない旨の案内情報（警告画面）、
Ｄ１，Ｄ１’…ボイスタグ認識用辞書（コマンド認識辞書）、
Ｄ２，Ｄ３…コマンド認識辞書、
ＪＤ，ＪＤ’…判別用辞書。 1 to 5 ... controlled devices,
7 HDD (storage means),
8 ... Microphone (voice input means),
9: Voice recognition unit (voice recognition means),
10 ... Vehicle speech recognition device,
12 ... voice recognition processing unit,
13 ... dictionary switching selection unit,
14 ... voice reproduction processing unit,
15 ... Device control signal generator,
20, 30 ... operation unit,
25, 31 ... display unit (display means),
27 ... Speaker,
40. Car audio / video (A / V) navigation system,
53. Guidance information (warning screen) not to register voice tag,
D1, D1 '... Voice tag recognition dictionary (command recognition dictionary),
D2, D3 ... Command recognition dictionary,
JD, JD '... Dictionary for discrimination.

Claims

Voice input means for inputting commands spoken by the user in the passenger compartment;
A plurality of recognition dictionaries in which commands that can be selected according to the operation state of the control target device are used for comparison and collation with commands input via the voice input means, and the input commands are A storage means for storing a determination dictionary for determining whether or not to register in a recognition dictionary and adapted to register the same command as that registered in the plurality of recognition dictionaries;
Dictionary switching selection means for switching the recognition dictionary effective according to the operation state of the control target device;
Voice recognition means for performing voice recognition based on a comparison collation between a command input via the voice input means and a command registered in any dictionary stored in the storage means;
When the command is input via the voice input unit, the voice recognition unit performs voice recognition on the input command using only the discrimination dictionary, and the recognition calculated based on the voice recognition A vehicle-mounted speech recognition apparatus, wherein when the score is lower than a predetermined threshold, the command is registered in the recognition dictionary and the discrimination dictionary selected according to the operation state of the control target device.

Furthermore, a display means is provided,
When the recognition score calculated based on the voice recognition is higher than a predetermined threshold, the voice recognition unit warns the display unit that the command is not registered in the recognition dictionary and the discrimination dictionary. The on-vehicle speech recognition apparatus according to claim 1, wherein a screen is displayed.

When the command is input through the voice input unit and the command is not yet registered in the determination dictionary, the voice recognition unit directly uses the input command for the recognition dictionary and the determination. The in-vehicle speech recognition device according to claim 1, wherein the in-vehicle speech recognition device is registered in a dictionary for a vehicle.

The in-vehicle speech recognition according to claim 1, wherein a command supported as a product by a navigation unit cooperating with the in-vehicle speech recognition device is registered in the discrimination dictionary as a reserved word. apparatus.

The voice input means for inputting a command spoken by the user in the passenger compartment and the command input via the voice input means are used for comparison and collation, and can be selected according to the operating state of the control target device. A plurality of recognition dictionaries in which commands are registered, and a discrimination dictionary for determining whether or not an input command should be registered in the recognition dictionary, and the same commands as those registered in the plurality of recognition dictionaries are registered Storage means storing the one adapted to be performed, dictionary switching selection means for switching the recognition dictionary effective according to the operating state of the control target device, and a command input via the voice input means, automotive speech and a speech recognition means for performing speech recognition based on the comparison and collation with the commands that are registered in any dictionary that is stored in the storage means In the identification apparatus,
When a command is input via the voice input means, only the discrimination dictionary is enabled and voice recognition is performed on the input command.
When the recognition score calculated based on the voice recognition is lower than a predetermined threshold, the command is registered in the recognition dictionary and the determination dictionary selected according to the operation state of the control target device. Characteristic voice command registration method.

When the recognition score calculated based on the voice recognition is higher than a predetermined threshold value, the display unit displays a warning screen indicating that the command is not registered in the recognition dictionary and the determination dictionary. The voice command registration method according to claim 5.