JP2016143050A

JP2016143050A - Voice recognition device and voice recognition method

Info

Publication number: JP2016143050A
Application number: JP2015021658A
Authority: JP
Inventors: 貴裕土屋; Takahiro Tsuchiya
Original assignee: Denso Corp
Current assignee: Denso Corp
Priority date: 2015-02-05
Filing date: 2015-02-05
Publication date: 2016-08-08

Abstract

PROBLEM TO BE SOLVED: To provide a technology that can recognize a driver's voice with high accuracy.SOLUTION: The voice recognition method includes storing recognition dictionary data that correspond to driver's condition in advance and performing voice recognition by detecting the driver's condition and using the recognition dictionary data that correspond to the driver's condition. Concerning driver's voice recognition, the driver's condition is thought to affect the recognition accuracy more greatly than an individual difference between the drivers. Therefore, the improvement of the voice recognition accuracy is made possible by performing voice recognition through by using the recognition dictionary data that correspond to the driver's condition.SELECTED DRAWING: Figure 2

Description

本発明は、車両に適用されて運転者が発する音声を認識する技術に関する。 The present invention relates to a technique for recognizing a voice generated by a driver when applied to a vehicle.

今日の車両には、運転者の運転負荷を軽減したり、快適な運転を可能としたり、あるいは安全な運転を確保したりすることを目的として、様々な機能が搭載されている。これに伴って、運転者が車両に対して操作あるいは設定を要する事柄も増加する傾向にあり、この面での運転者の負担を軽減するために、様々な事柄を運転者が容易に操作あるいは設定可能とする技術も提案されている。
そのような技術の一つとして、運転者の音声を認識する技術が開発されている。車両側で運転者の音声を認識することができれば、運転者が声を発することで車両に対する操作あるいは設定することが可能となる。 Today's vehicles are equipped with various functions for the purpose of reducing the driving load on the driver, enabling comfortable driving, or ensuring safe driving. Along with this, there is a tendency that the driver needs to operate or set the vehicle, and in order to reduce the burden on the driver in this aspect, the driver can easily operate or perform various matters. A technology that enables setting is also proposed.
As one such technique, a technique for recognizing a driver's voice has been developed. If the driver's voice can be recognized on the vehicle side, the driver can operate or set the vehicle by speaking.

音声認識では、次のような原理で音声を認識する。先ず、マイクロフォンで検出した音声信号を、「音素」と呼ばれる所定時間の短い音声信号に分割し、それぞれの音素を所定の方法で解析することによって複数個の特徴量を抽出する。こうして得られた複数個の特徴量に基づいて、その音素が表す音を特定することができる。例えば日本語であれば、５種類の母音と、９種類の子音とが存在し、それらの変化形も含めると４０種類程度の音が存在すると言われているが、それぞれについて複数個の特徴量を求めると、特徴量の組合せは異なったものとなる。そこで、これらの特徴量の組合せを識別辞書データとして記憶しておく。そして、ある音素から特徴量の組合せが得られたら、識別辞書データの中から最も近い特徴量の組合せを選択することで、その音素が表す母音あるいは子音を特定することができる。このような処理を、マイクロフォンで検出した音声信号の全体に対して施すことによって音声を認識することが可能となる。 In speech recognition, speech is recognized based on the following principle. First, an audio signal detected by a microphone is divided into audio signals having a short predetermined time called “phonemes”, and a plurality of feature amounts are extracted by analyzing each phoneme by a predetermined method. Based on the plurality of feature quantities obtained in this way, the sound represented by the phoneme can be specified. For example, in Japanese, there are 5 types of vowels and 9 types of consonants, and it is said that there are about 40 types of sounds including their variations. , The combination of feature amounts is different. Therefore, a combination of these feature amounts is stored as identification dictionary data. When a combination of feature amounts is obtained from a certain phoneme, the vowel or consonant represented by the phoneme can be specified by selecting the closest combination of feature amounts from the identification dictionary data. By performing such processing on the entire audio signal detected by the microphone, the audio can be recognized.

もっとも、運転者が声を使って各種の車載機能を操作あるいは設定するためには、音声を十分な精度で認識できることが重要となる。そこで、予め設定しておいた内容（単語や文章など）を話者に読み上げさせて、話者の声を学習することによって認識精度を向上させることも提案されている（特許文献１など）。母音や子音を解析して得られる特徴量の組合せ（識別辞書データ）は、話者によって少しずつ違うので、標準的な話者から得られる識別辞書データを用いるよりも、音声認識しようとする話者から得られた識別辞書データを用いた方が認識精度を向上させることが可能と考えられる。 However, in order for the driver to operate or set various vehicle functions using voice, it is important that the voice can be recognized with sufficient accuracy. Therefore, it has been proposed to improve recognition accuracy by causing a speaker to read a preset content (such as a word or a sentence) and learning the voice of the speaker (Patent Document 1, etc.). Since the combination of feature values (identification dictionary data) obtained by analyzing vowels and consonants varies little by little depending on the speaker, the story to be recognized by speech rather than using the identification dictionary data obtained from a standard speaker It is considered that recognition accuracy can be improved by using identification dictionary data obtained from a person.

特開２００３−１６２２９２号公報JP 2003-162292 A

しかし、運転者が車両に対して声を使って各種機能を操作あるいは設定する目的からすると、運転者の声を学習しても認識精度が不足する場合があり、認識精度をより一層向上させることが可能な技術の開発が要請されていた。 However, for the purpose of the driver operating or setting various functions using voice to the vehicle, the recognition accuracy may be insufficient even if the driver's voice is learned, and the recognition accuracy is further improved. There was a demand for the development of technology that could

この発明は、運転者の音声を高い認識精度で音声認識することが可能な技術の提供を目的とする。 An object of the present invention is to provide a technology capable of recognizing a driver's voice with high recognition accuracy.

上述した課題を解決するために本発明の音声認識装置および音声認識方法は、運転者状態に応じた識別辞書データを予め記憶しておく。そして、音声認識に際しては、運転者状態を検出して、運転者状態に応じた識別辞書データを用いて音声認識する。
詳細な理由については後述するが、運転者の音声認識に関しては、運転者の個体差による影響よりも、運転者状態の方が、認識精度に大きな影響を与えると考えられる。従って、運転者状態に応じた識別辞書データを用いて音声認識することによって、音声認識の認識精度を向上させることが可能となる。 In order to solve the above-described problems, the speech recognition apparatus and the speech recognition method of the present invention preliminarily store identification dictionary data corresponding to the driver state. In voice recognition, the driver state is detected, and voice recognition is performed using identification dictionary data corresponding to the driver state.
Although the detailed reason will be described later, regarding the driver's voice recognition, it is considered that the driver state has a greater influence on the recognition accuracy than the influence of the individual difference of the driver. Therefore, the recognition accuracy of voice recognition can be improved by performing voice recognition using the identification dictionary data corresponding to the driver state.

本実施例の音声認識装置１００を搭載した車両１を示す説明図である。It is explanatory drawing which shows the vehicle 1 carrying the speech recognition apparatus 100 of a present Example. 音声認識装置１００の大まかな内部構造を示すブロック図である。2 is a block diagram showing a rough internal structure of the speech recognition apparatus 100. FIG. 音声認識の概要についての説明図である。It is explanatory drawing about the outline | summary of voice recognition. 識別辞書データを用いて音声認識する様子を概念的に示した説明図である。It is explanatory drawing which showed notionally the mode that speech recognition was performed using identification dictionary data. 音声認識の結果を例示した説明図である。It is explanatory drawing which illustrated the result of voice recognition. 本実施例の音声認識処理のフローチャートである。It is a flowchart of the speech recognition process of a present Example. 運転者状態を例示した説明図である。It is explanatory drawing which illustrated the driver | operator state. 運転者状態に応じて設定された識別辞書データを例示した説明図である。It is explanatory drawing which illustrated the identification dictionary data set according to the driver | operator state. 変形例の音声認識装置２００の大まかな内部構造を示すブロック図である。It is a block diagram which shows the rough internal structure of the speech recognition apparatus 200 of a modification. 変形例の音声認識装置２００が運転者に応じた識別辞書データを生成する処理のフローチャートである。It is a flowchart of the process which the voice recognition apparatus 200 of a modification produces | generates the identification dictionary data according to a driver | operator. 運転者に応じて識別辞書データが生成される様子を示した説明図である。It is explanatory drawing which showed a mode that identification dictionary data were produced | generated according to a driver | operator.

以下では、上述した本願発明の内容を明確にするために実施例について説明する。
Ａ．装置構成：
図１には、音声認識装置１００を搭載した車両１が示されている。図示されるように車両１には、運転者が発する声の音声信号を取得するマイクロフォン１０と、マイクロフォン１０で取得した音声信号を解析して音声認識する音声認識装置１００が搭載されている。また、車両１には、運転者の顔画像を撮影する車載カメラ２０も搭載されており、音声認識装置１００は、車載カメラ２０で得られた顔画像を解析することによって、運転者状態（例えば、疲れている状態、眠たい状態、興奮している状態）を検出することが可能である。
尚、本実施例では、車載カメラ２０で顔画像を撮影することによって運転者状態を検出するものとして説明する。これは、運転者の顔画像を解析して運転者状態を検出する技術は、様々な運転者状態を高い精度で検出可能な技術として、十分な実績を有するためである。もっとも、運転者状態を検出することができるのであれば、車載カメラ２０で顔画像を撮影することに限られるわけではない。例えば、運転者の生体情報（心拍数や、血圧、発汗量など）を検出して、運転者状態を検出しても良いし、あるいは運転者の運転挙動（ハンドル操作やアクセル操作、ブレーキ操作についての大きさ、速さ、頻度など）を検出して、運転者状態を検出しても良い。 Hereinafter, examples will be described in order to clarify the contents of the present invention described above.
A. Device configuration :
FIG. 1 shows a vehicle 1 equipped with a voice recognition device 100. As shown in the figure, the vehicle 1 is equipped with a microphone 10 that acquires a voice signal of a voice uttered by the driver, and a voice recognition device 100 that recognizes the voice by analyzing the voice signal acquired by the microphone 10. The vehicle 1 is also equipped with an in-vehicle camera 20 that captures a driver's face image, and the voice recognition device 100 analyzes the face image obtained by the in-vehicle camera 20 to thereby detect the driver state (for example, , Tired state, sleepy state, excited state).
In this embodiment, it is assumed that the driver state is detected by taking a face image with the in-vehicle camera 20. This is because the technology for analyzing the driver's face image to detect the driver state has a sufficient track record as a technology capable of detecting various driver states with high accuracy. However, as long as the driver's state can be detected, it is not limited to photographing a face image with the in-vehicle camera 20. For example, the driver's biological information (heart rate, blood pressure, perspiration, etc.) may be detected to detect the driver state, or the driver's driving behavior (handle operation, accelerator operation, brake operation) May be detected to detect the driver state.

図２には、本実施例の音声認識装置１００の大まかな内部構造が示されている。図示されるように音声認識装置１００は、運転者状態検出部１０１や、識別辞書データ選択部１０２、識別辞書データ記憶部１０３、音声取得部１０４、音声認識部１０５を備えている。
尚、音声認識装置１００は、ＣＰＵを中心として、メモリーや、タイマー、入出力周辺装置などが、バスを介して相互にデータを通信可能に接続されたマイクロコンピューターによって実現されている。従って、これらの「部」は、音声認識装置１００が、音声を認識する機能に着目して、便宜的に音声認識装置１００の内部構造を分類したものに過ぎず、音声認識装置１００がこれらの部に物理的に区分されることを表すものではない。従って、これらの「部」は、ＣＰＵで実行されるコンピュータープログラムとして実現することもできるし、ＬＳＩやメモリーを含む電子回路として実現することもできるし、更にはこれらを組合せることによって実現することもできる。 FIG. 2 shows a rough internal structure of the speech recognition apparatus 100 of the present embodiment. As illustrated, the speech recognition apparatus 100 includes a driver state detection unit 101, an identification dictionary data selection unit 102, an identification dictionary data storage unit 103, a speech acquisition unit 104, and a speech recognition unit 105.
The voice recognition apparatus 100 is realized by a microcomputer in which a memory, a timer, an input / output peripheral device, and the like are connected to each other via a bus so that data can be communicated with each other, with a CPU at the center. Accordingly, these “parts” are merely classifications of the internal structure of the speech recognition apparatus 100 for the sake of convenience, focusing on the function of the speech recognition apparatus 100 recognizing speech. It does not indicate that it is physically divided into parts. Therefore, these “units” can be realized as a computer program executed by the CPU, can be realized as an electronic circuit including an LSI or a memory, and further realized by combining them. You can also.

運転者状態検出部１０１は、車載カメラ２０で撮影した運転者の顔画像を解析することによって、運転者状態を検出する。運転者状態検出部１０１が運転者状態として検出する内容については後述する。
識別辞書データ選択部１０２は、運転者状態検出部１０１から運転者状態を受け取ると、識別辞書データ記憶部１０３に記憶されている識別辞書データの中から、運転者状態に応じた識別辞書データを選択する。識別辞書データ記憶部１０３には、後述する複数種類の識別辞書データが、運転者状態に応じて記憶されている。
音声取得部１０４は、マイクロフォン１０で検出した運転者の音声信号を取得する。
そして音声認識部１０５は、識別辞書データ選択部１０２で選択された識別辞書データを参照しながら、音声取得部１０４で取得した音声信号を用いて音声認識する。
このように本実施例の音声認識装置１００では、運転者状態に応じた識別辞書データを参照して音声認識することによって、認識精度を向上させている。以下、この理由について説明するが、その準備として、音声認識の概要について簡単に説明しておく。 The driver state detection unit 101 detects the driver state by analyzing the driver's face image captured by the in-vehicle camera 20. The contents detected by the driver state detection unit 101 as the driver state will be described later.
When the identification dictionary data selection unit 102 receives the driver status from the driver status detection unit 101, the identification dictionary data selection unit 102 selects identification dictionary data corresponding to the driver status from the identification dictionary data stored in the identification dictionary data storage unit 103. select. The identification dictionary data storage unit 103 stores a plurality of types of identification dictionary data described later according to the driver state.
The voice acquisition unit 104 acquires the driver's voice signal detected by the microphone 10.
The voice recognition unit 105 performs voice recognition using the voice signal acquired by the voice acquisition unit 104 while referring to the identification dictionary data selected by the identification dictionary data selection unit 102.
Thus, in the speech recognition apparatus 100 of the present embodiment, the recognition accuracy is improved by referring to the identification dictionary data corresponding to the driver state and performing speech recognition. The reason for this will be described below. As a preparation for this, an outline of speech recognition will be briefly described.

音声認識では、大まかに言うと３つの工程を経て行われる。先ず、第1の工程では、音声信号を複数の音素に分割して、音素毎に特徴量を抽出する。続く第２の工程では、抽出した特徴量を識別辞書データと照合することによって、音素毎に音を特定する。そして第３の工程では、音が特定された音素を組み合わせて、音声として認識する。 Roughly speaking, speech recognition is performed through three steps. First, in the first step, a speech signal is divided into a plurality of phonemes, and feature quantities are extracted for each phoneme. In the subsequent second step, the sound is specified for each phoneme by comparing the extracted feature quantity with the identification dictionary data. In the third step, the phonemes whose sounds are specified are combined and recognized as speech.

図３には、音声信号を複数の音素に分割して、音素毎に特徴量を抽出する様子が概念的に示されている。たとえば図３（ａ）に示すような音声信号が得られたとすると、この音声信号を所定時間（たとえば２５ｍｓｅｃ）幅の複数の音素に分割する。図３（ｂ）では、音声信号が（１）〜（１６）の音素に分割された様子が例示されている。
また、図３（ｃ）には、一部の音素が拡大して表示されている。例えば（２）の音素は、（８）や、（１０）、（１６）の音素とは明らかに波形が異なっている。また、（１６）の音素は、（２）や、（８）、（１０）の音素とは明らかに波形が異なっている。これに対して、（８）の音素と（１０）の音素とは、波形に共通する特徴がある。実際、人間には、（８）の音素と（１０）の音素とは同じ音と認識され、（２）の音素や（１６）の音素は、他の音素とは違う音と認識される。そこで、（１）〜（１６）の全ての音素について、音素に含まれる波形を解析して特徴量を抽出する。このとき抽出する特徴量は、音の大きさや高さについての特徴量ではなく、人間が識別する音の種類（すなわち母音なのか子音なのか、何という母音あるいは子音なのか）に関連する特徴量とする。このような特徴量としては、フォルマント周波数に代表される複数の特徴量が知られている。一般的には、それぞれの音素から１０種類程度の特徴量が抽出される。 FIG. 3 conceptually shows a state in which a speech signal is divided into a plurality of phonemes and feature amounts are extracted for each phoneme. For example, if an audio signal as shown in FIG. 3A is obtained, the audio signal is divided into a plurality of phonemes having a predetermined time (for example, 25 msec) width. FIG. 3B illustrates a state in which the audio signal is divided into phonemes (1) to (16).
In FIG. 3C, some phonemes are enlarged and displayed. For example, the phoneme of (2) has a clearly different waveform from the phonemes of (8), (10), and (16). The phoneme of (16) has a clearly different waveform from the phonemes of (2), (8), and (10). On the other hand, the phoneme of (8) and the phoneme of (10) have a feature common to waveforms. Actually, the phoneme of (8) and the phoneme of (10) are recognized as the same sound by humans, and the phoneme of (2) and (16) are recognized as different sounds from other phonemes. Therefore, for all phonemes (1) to (16), a feature amount is extracted by analyzing a waveform included in the phoneme. The feature value extracted at this time is not a feature value related to the volume or pitch of the sound, but a feature value related to the type of sound identified by humans (that is, whether it is a vowel or a consonant, what kind of vowel or consonant) And As such a feature quantity, a plurality of feature quantities represented by formant frequencies are known. Generally, about 10 types of feature quantities are extracted from each phoneme.

続いて、音素から抽出した複数の特徴量を識別辞書データと照合することによって、その音素が表す音の種類を特定する。ここで、識別辞書データとは、次のようなものである。
例えば、「あ」という音声を多数集めてきて、複数の特徴量を抽出したとする。集めた音声の個数が１００個であれば、複数の特徴量の組が、１００組得られることになる。これら１００組の特徴量は、厳密に一致することはないにしても、大まかには似た値となる筈である。例えば、１つの音素から５種類の特徴量（特徴量１〜特徴量５）を抽出するとして、それら特徴量の組を５次元座標の座標点として表すと、「あ」という音声から得られた音素の座標点は、集まって分布する筈である。
同様な操作を「い」の音声について行うと、「い」の音声から得られた音素の座標点は、「あ」の音素の座標点とは異なる位置に、集まって分布する筈である。「う」の音声、「え」の音声、「お」の音声についても同様に、それぞれの音素の座標点は、それぞれの位置に集まって分布する。図４には、このように複数の特徴量を軸とする座標空間（特徴量空間と呼ばれる）内で、それぞれの音素が異なる位置に集まって分布する様子が概念的に示されている。 Subsequently, the type of sound represented by the phoneme is specified by collating a plurality of feature amounts extracted from the phoneme with the identification dictionary data. Here, the identification dictionary data is as follows.
For example, it is assumed that a lot of voices “A” are collected and a plurality of feature amounts are extracted. If the number of collected voices is 100, 100 sets of a plurality of feature amounts are obtained. These 100 sets of feature quantities should be roughly similar values even if they do not exactly match. For example, when five types of feature quantities (feature quantity 1 to feature quantity 5) are extracted from one phoneme, and a set of these feature quantities is expressed as a coordinate point of a five-dimensional coordinate, it is obtained from the voice “a”. Phoneme coordinate points should be collected and distributed.
When the same operation is performed on the voice of “I”, the coordinate points of the phoneme obtained from the voice of “I” should be gathered and distributed at positions different from the coordinate points of the phoneme of “A”. Similarly, for the voice of “U”, the voice of “E”, and the voice of “O”, the coordinate points of each phoneme are gathered and distributed at the respective positions. FIG. 4 conceptually shows how phonemes are collected and distributed at different positions in a coordinate space (referred to as feature amount space) having a plurality of feature amounts as axes.

次に、ある音素から特徴量を抽出したところ、その音素の座標点が、図４中のＡ点だったとする。この場合、その音素が表す音は「あ」であると考えて良い。同様に、別の音素から抽出した特徴量の座標点が図４中のＢ点であれば、その音素が表す音は「え」と考えられる。
このように、母音や子音のそれぞれについて、特徴量空間での分布範囲を予め調べて記憶しておく。そうすれば、未知の音素についても、その音素から抽出した特徴量が、何れの分布範囲に含まれるかを判断することによって、音の種類（何れの母音あるいは子音に該当するか、あるいは何れにも該当しないか）を特定することができる。識別辞書データとは、母音や子音のそれぞれについて特徴量空間での分布範囲を記述した一組のデータを言う。 Next, when a feature amount is extracted from a phoneme, the coordinate point of the phoneme is point A in FIG. In this case, it may be considered that the sound represented by the phoneme is “A”. Similarly, if the coordinate point of the feature amount extracted from another phoneme is point B in FIG. 4, the sound represented by that phoneme is considered “e”.
In this way, the distribution range in the feature amount space is examined and stored in advance for each vowel and consonant. Then, for unknown phonemes, by determining in which distribution range the feature quantity extracted from the phoneme is included, the type of sound (which vowel or consonant corresponds to, or which Is not applicable). The identification dictionary data is a set of data describing the distribution range in the feature amount space for each vowel and consonant.

以上のようにして、音素から抽出した複数の特徴量を識別辞書データと照合する。こうすれば、図３に示した複数の音素のそれぞれについて、図５に示したように、音の種類を特定することができる。たとえば、（１）〜（３）の音素は子音の「Ｊ」を表しており、（４）〜（６）の音素は母音の「い」を表し、（７）〜（１０）は子音の「Ｂ」を表していると言ったように、音素毎に音の種類を特定することができる。
次に、複数の音素をまとめる処理を開始する。これは、音素は音声信号を一定の時間幅（例えば２５ｍｓｅｃ）で機械的に分割したものであって、音声として認識するには時間幅が短すぎるためである。図５に示した例では、（１）〜（３）の音素が子音の「Ｊ」にまとめられ、（４）〜（６）の音素が母音の「い」にまとめられる。同様に、（７）〜（１０）の音素が子音の「Ｂ」に、（１１）〜（１３）の音素が母音の「う」に、（１４）〜（１６）の音素が子音の「Ｎ」にまとめられる。
そして、これらを組み合わせることによって、「じ」、「ぶ」、「ん」と音声認識されることになる。 As described above, a plurality of feature amounts extracted from phonemes are collated with the identification dictionary data. In this way, the type of sound can be specified for each of the plurality of phonemes shown in FIG. 3, as shown in FIG. For example, the phonemes (1) to (3) represent the consonant “J”, the phonemes (4) to (6) represent the vowel “I”, and (7) to (10) represent the consonant. As said to represent “B”, the type of sound can be specified for each phoneme.
Next, a process for collecting a plurality of phonemes is started. This is because a phoneme is obtained by mechanically dividing an audio signal with a certain time width (for example, 25 msec), and the time width is too short to be recognized as speech. In the example shown in FIG. 5, the phonemes (1) to (3) are grouped into a consonant “J”, and the phonemes (4) to (6) are grouped into a vowel “I”. Similarly, the phonemes of (7) to (10) are “B” of consonants, the phonemes of (11) to (13) are “u” of vowels, and the phonemes of (14) to (16) are “ N ”.
By combining these, voice recognition of “ji”, “bu”, and “n” is performed.

以上の説明から明らかなように、音声認識の精度は、使用する識別辞書データによって大きく依存するので、適切な識別辞書データを使用することが重要となる。また、人の声には個人差があることからも明らかなように、識別辞書データにも個人差がある。例えば、大きく口を開けてはっきりと話す人と、口をあまり開けずにくぐもって話す人とでは、異なった識別辞書データとなる。このため、識別辞書データは、どのような人であっても誤認識することがなく、それでいながら、ある程度の認識精度が得られるように設定されている。このことは、誤認識を避けるために認識精度を犠牲にしていることになり、従って、話者を特定して専用の識別辞書データを用いれば、音声認識の認識精度を向上させることが可能と考えられる。
しかし、話者を特定して専用の識別辞書データを用いた場合でも、運転者が声を用いて、車両に対して操作したり設定したりするためには認識精度が十分ではない場合がある。そこで、本実施例の音声認識装置１００は、以下に説明するように、運転者状態を考慮することによって認識精度を向上させている。 As is clear from the above description, since the accuracy of speech recognition greatly depends on the identification dictionary data to be used, it is important to use appropriate identification dictionary data. Further, as is apparent from the fact that there are individual differences in human voices, there are also individual differences in the identification dictionary data. For example, different identification dictionary data is used for a person who speaks clearly with his / her mouth wide open and a person who speaks without opening his / her mouth. For this reason, the identification dictionary data is set so as to obtain a certain degree of recognition accuracy without being erroneously recognized by any person. This means that recognition accuracy is sacrificed in order to avoid misrecognition. Therefore, if a speaker is identified and dedicated identification dictionary data is used, the recognition accuracy of speech recognition can be improved. Conceivable.
However, even when a speaker is specified and dedicated identification dictionary data is used, the recognition accuracy may not be sufficient for the driver to operate or set the vehicle using voice. . Therefore, the speech recognition apparatus 100 according to the present embodiment improves the recognition accuracy by considering the driver state, as will be described below.

Ｂ．音声認識処理：
図６には、本実施例の音声認識装置１００で行われる音声認識処理のフローチャートが示されている。
図示されるように音声認識処理では、先ず始めに、運転者状態を検出する（Ｓ１０１）。図７には、検出する運転者状態が例示されている。例えば、状態Ａは通常の状態であり、状態Ｂは疲れている状態、状態Ｃは眠たい状態である。状態Ｄは焦っている状態であり、状態Ｅは興奮している状態、状態Ｆは喜んでいる状態、状態Ｇは悲しんでいる状態である。これらの運転者状態は、車載カメラ２０で撮影した運転者の顔画像を解析することによって検出する。もちろん、運転者の生体情報や、運転挙動を検出して、運転者状態を検出しても良い。 B. Speech recognition processing:
FIG. 6 shows a flowchart of speech recognition processing performed by the speech recognition apparatus 100 of the present embodiment.
As shown in the drawing, in the voice recognition process, first, the driver state is detected (S101). FIG. 7 illustrates the driver state to be detected. For example, state A is a normal state, state B is tired, and state C is sleepy. State D is an impatient state, state E is an excited state, state F is a happy state, and state G is a sad state. These driver states are detected by analyzing the driver's face image taken by the in-vehicle camera 20. Of course, the driver's state may be detected by detecting the driver's biological information and driving behavior.

続いて、前回に検出した時から運転者状態が変化したか否かを判断する（Ｓ１０２）。例えば、前回に検出した運転者状態が状態Ａであり、今回の運転者状態も状態Ｂであれば、運転者状態が変化したと判断する（Ｓ１０２：ｙｅｓ）。そして、この場合は、今回の運転者状態に対応する識別辞書データを読み出す（Ｓ１０３）。
前述したように識別辞書データとは、音の種類（識別しようとする母音および子音）毎に特徴量の分布範囲を記述したデータである。音声認識装置１００の図示しないメモリーには、状態Ａ〜状態Ｇのそれぞれに対応する識別辞書データが記憶されている。 Subsequently, it is determined whether or not the driver state has changed since the previous detection (S102). For example, if the previously detected driver state is state A and the current driver state is also state B, it is determined that the driver state has changed (S102: yes). In this case, identification dictionary data corresponding to the current driver state is read (S103).
As described above, the identification dictionary data is data describing a distribution range of feature amounts for each type of sound (vowels and consonants to be identified). In a memory (not shown) of the speech recognition apparatus 100, identification dictionary data corresponding to each of the states A to G is stored.

図８には、「あ」〜「お」の５種類の母音について、状態Ａに対応する識別辞書データに設定された特徴量の分布範囲と、状態Ｂに対応する識別辞書データに設定された特徴量の分布範囲とが示されている。このように識別辞書データは、運転者状態によって変化する。これは次のような理由による。
先ず、人間が話す際には、口の開け方や、口の中の形、更には息を吐き出す強さなどを変えることによって音の種類を変化させる。従って、音の種類を特定するために用いる識別辞書データは、口の開け方や、口の中の形、息を吐き出す強さなどの違いが反映されている。そして、口の大きさは形、口の中の形、更には息を吐き出す強さなどには個人差があるから、識別辞書データにも個人差が現れる。このため、話者に応じて専用の識別辞書データを使い分ければ、音声認識の認識精度を向上させることができる。 In FIG. 8, the distribution range of the feature amount set in the identification dictionary data corresponding to the state A and the identification dictionary data corresponding to the state B are set for five types of vowels “A” to “O”. The distribution range of the feature amount is shown. Thus, the identification dictionary data changes depending on the driver state. This is due to the following reason.
First, when a person speaks, the type of sound is changed by changing how the mouth is opened, the shape of the mouth, and the strength to exhale. Therefore, the identification dictionary data used to specify the type of sound reflects differences in how to open the mouth, the shape in the mouth, the strength to exhale. Since there are individual differences in the size of the mouth, the shape in the mouth, the strength to exhale, etc., individual differences also appear in the identification dictionary data. For this reason, the recognition accuracy of voice recognition can be improved by using dedicated identification dictionary data depending on the speaker.

ところが、口の開け方や、口の中の形、息を吐き出す強さなどは、話者の状態によって大きく変化する。例えば、疲れている時と、興奮している時とを比較すると、口の開け方や、息の強さが大きく異なることは直ちに了解できる。また、口の中の形にしても、興奮している時には口蓋の奥の部分が引き上げられた状態となり、疲れている時には垂れ下がった状態となる傾向にある。従って、話者の状態によっても識別辞書データは大きく異なったものとなる。 However, how to open the mouth, the shape in the mouth, the strength to exhale, etc. vary greatly depending on the state of the speaker. For example, when you are tired and you are excited, you can immediately understand that the way you open your mouth and the strength of your breath differ greatly. Even in the shape of the mouth, when the user is excited, the back part of the palate tends to be pulled up, and when tired, it tends to hang down. Therefore, the identification dictionary data varies greatly depending on the state of the speaker.

ここで、話者がＡさんの場合と、Ｂさんの場合とで識別辞書データを使い分けることによって認識精度が向上することは、半ば直感的に了解できる。ところが、疲れていても興奮していても、ＡさんはＡさんであり、ＢさんはＢさんであるから、疲れている場合と興奮している場合とで、識別辞書データを使い分ける必要性は、常識では考えられない。
しかし、上述したように、話者がＡさんの場合とＢさんの場合とで、識別辞書データを使い分けることによって認識精度を向上させることができる理由は、ＡさんとＢさんとで、口の開け方や、口の中の形、息を吐き出す強さなどが違っているからである。そして、口の開け方や、口の中の形、息を吐き出す強さなどが違うという点からは、話者の違いと、話者の状態の違いとで大きく異なることはない。
加えて、本実施例の話者は運転者であるから、子供や老人が話者になることはなく、音声認識の対象となる話者の範囲は、一般的な音声認識の場合よりも絞られる。更に、運転中の運転者は、体調的にも心理的にも様々な状態を取り得る。このことは、話者が違うことに起因して誤認識してしまう可能性よりも、運転者状態が違うことに起因して誤認識する可能性の方が高いことを意味している。逆に言えば、話者が運転者の場合には、話者によって識別辞書データを使い分けるよりも、話者の状態（運転者状態）によって識別辞書データを使い分けた方が、音声認識の認識精度を向上させることが可能と考えられるのである。 Here, it can be intuitively understood that the recognition accuracy is improved by using different identification dictionary data for the case where the speaker is Mr. A and the case where Mr. B is used. However, even if you are tired or excited, Mr. A is Mr. A and Mr. B is Mr. B, so there is no need to use different identification dictionary data depending on whether you are tired or excited. , Can not be considered in common sense.
However, as described above, the reason why the recognition accuracy can be improved by using different identification dictionary data depending on whether the speaker is Mr. A or Mr. B. This is because the method of opening, the shape of the mouth, the strength to exhale, etc. are different. And from the point that the way of opening the mouth, the shape in the mouth, the strength to exhale, etc. are different, there is no significant difference between the speaker and the speaker.
In addition, since the speaker in the present embodiment is a driver, children and elderly people will not become speakers, and the range of speakers that are subject to speech recognition is narrower than in the case of general speech recognition. It is done. Furthermore, the driver who is driving can take various states both physically and psychologically. This means that there is a higher possibility of misrecognition due to a different driver state than a possibility of misrecognition due to a different speaker. In other words, when the speaker is a driver, the recognition accuracy of voice recognition is better when the identification dictionary data is used differently depending on the state of the speaker (driver state) than when the identification dictionary data is used differently by the speaker. It is thought that it is possible to improve.

本実施例の音声認識装置１００では、このような理由から、図７に示した状態Ａ〜状態Ｇの運転者状態毎に識別辞書データを記憶している。そこで、図６に示した音声認識処理では、運転者状態が変わったと判断した場合には（Ｓ１０２：ｙｅｓ）、新たな運転者状態に対応する識別辞書データをメモリーから読み出すこととしている（Ｓ１０３）。
これに対して、運転者状態が変わっていない場合は（Ｓ１０２：ｎｏ）、既に読み出した識別辞書データを継続して使用できるので、識別辞書データを読み出す処理（Ｓ１０３）は省略する。 In the voice recognition apparatus 100 of the present embodiment, identification dictionary data is stored for each driver state of the state A to the state G shown in FIG. Therefore, in the voice recognition process shown in FIG. 6, when it is determined that the driver state has changed (S102: yes), the identification dictionary data corresponding to the new driver state is read from the memory (S103). .
On the other hand, when the driver state has not changed (S102: no), the identification dictionary data that has already been read can be used continuously, and therefore the process of reading the identification dictionary data (S103) is omitted.

続いて、運転者が声を発したか否かを判断する（Ｓ１０４）。運転者が声を発するとマイクロフォン１０で検出されるので、運転者が声を発したか否かは、マイクロフォン１０の出力から直ちに判断することができる。
その結果、運転者が声を発していない場合は（Ｓ１０４：ｎｏ）、処理の先頭に戻って、再び運転者状態を検出した後（Ｓ１０１）、上述した続く一連の処理（Ｓ１０２〜Ｓ１０４）を実行する。 Subsequently, it is determined whether or not the driver speaks (S104). When the driver speaks, the microphone 10 detects the voice. Therefore, it can be immediately determined from the output of the microphone 10 whether the driver speaks.
As a result, if the driver does not speak (S104: no), the process returns to the top of the process, the driver state is detected again (S101), and then the series of subsequent processes (S102 to S104) described above are performed. Run.

これに対して運転者が声を発した場合は（Ｓ１０４：ｙｅｓ）、そのまま運転者の音声を取得する（Ｓ１０５）。
そして、読み出しておいた識別辞書データを用いて音声を認識する（Ｓ１０６）。すなわち、取得した音声信号を複数の音素に分割して、それぞれの音素から複数の特徴量を抽出する（図３参照）。そして、音素毎に抽出された複数の特徴量を、読み出しておいた識別辞書データと照合することによって、音素が表す音の種類（何れの母音あるいは子音であるか）を特定する（図４参照）。その後、複数の音素を組み合わせることによって、音声を認識する（図５参照）。 On the other hand, when the driver speaks (S104: yes), the driver's voice is acquired as it is (S105).
Then, the speech is recognized using the read identification dictionary data (S106). That is, the acquired speech signal is divided into a plurality of phonemes, and a plurality of feature amounts are extracted from each phoneme (see FIG. 3). Then, the type of sound represented by the phoneme (which vowel or consonant) is specified by comparing the plurality of feature values extracted for each phoneme with the read identification dictionary data (see FIG. 4). ). Thereafter, the speech is recognized by combining a plurality of phonemes (see FIG. 5).

こうして音声を認識したら、その結果を外部の機器（例えば車両１の図示しない制御装置）に出力した後（Ｓ１０７）、音声認識を終了するか否かを判断する（Ｓ１０８）。
その結果、音声認識を終了しないと判断した場合は（Ｓ１０８：ｎｏ）、処理の先頭に戻って、再び運転者状態を検出した後（Ｓ１０１）、上述した続く一連の処理（Ｓ１０２〜Ｓ１０８）を実行する。このような処理を繰り返しているうちに、音声認識を終了すると判断したら（Ｓ１０８：ｙｅｓ）、図６の音声認識処理を終了する。 When the voice is recognized in this manner, the result is output to an external device (for example, a control device (not shown) of the vehicle 1) (S107), and then it is determined whether or not the voice recognition is to be ended (S108).
As a result, when it is determined that the speech recognition is not finished (S108: no), the process returns to the top of the process, and after detecting the driver state again (S101), the above-described series of processes (S102 to S108) are performed. Run. If it is determined that the voice recognition is to be ended while repeating such a process (S108: yes), the voice recognition process of FIG. 6 is ended.

以上に詳しく説明したように、本実施例の音声認識処理では、運転者状態に応じて適切な識別辞書データを使い分けているので、音声認識の認識精度を大幅に向上させることが可能となる。 As described in detail above, in the voice recognition processing of the present embodiment, appropriate identification dictionary data is properly used according to the driver's state, so that the recognition accuracy of voice recognition can be greatly improved.

Ｃ．変形例：
上述した実施例では、運転者状態に応じた識別辞書データが、予めメモリー内に記憶されているものとして説明した。しかし、運転者状態に応じた識別辞書データを生成しても良い。こうすれば、運転者毎に、運転者状態に応じた識別辞書データを生成することができるので、より一層、音声認識の認識精度を向上させることが可能となる。 C. Modified example:
In the above-described embodiment, the identification dictionary data corresponding to the driver state is described as being stored in the memory in advance. However, identification dictionary data corresponding to the driver state may be generated. In this way, the identification dictionary data corresponding to the driver state can be generated for each driver, so that the recognition accuracy of voice recognition can be further improved.

図９には、運転者状態に応じた識別辞書データを運転者毎に生成する変形例の音声認識装置２００の大まかな内部構造が示されている。図示した変形例の音声認識装置２００は、図２を用いて前述した本実施例の音声認識装置１００に対して、音声学習部２０１と、識別辞書データ生成部２０２と、標準識別辞書データ記憶部２０３とが追加されている点が大きく異なっている。 FIG. 9 shows a rough internal structure of a modified speech recognition apparatus 200 that generates identification dictionary data corresponding to the driver state for each driver. The speech recognition apparatus 200 of the illustrated modification is different from the speech recognition apparatus 100 of the present embodiment described above with reference to FIG. 2 with a speech learning unit 201, an identification dictionary data generation unit 202, and a standard identification dictionary data storage unit. The difference is that 203 is added.

このうちの音声学習部２０１は、音声取得部１０４で取得した運転者の音声信号に基づいて運転者の音声を学習する。
識別辞書データ生成部２０２は、学習した運転者の音声に基づいて、運転者状態に応じた識別辞書データを生成する。また、識別辞書データの生成に際しては、先ず、標準的な運転者を想定して運転者状態毎に設定された識別辞書データ（標準識別辞書データ）を、標準識別辞書データ記憶部２０３から取得する。続いて、取得した標準識別辞書データを、標準的な運転者の音声と、学習した運転者の音声との偏差の分だけ修正することによって、学習した運転者用の運転者状態に応じた識別辞書データを生成する。そして、識別辞書データ生成部２０２は、生成した識別辞書データを識別辞書データ記憶部１０３に記憶する。学習した運転者用の識別辞書データを生成する処理については、後ほど詳しく説明する。 Of these, the voice learning unit 201 learns the driver's voice based on the driver's voice signal acquired by the voice acquisition unit 104.
The identification dictionary data generation unit 202 generates identification dictionary data corresponding to the driver state based on the learned driver's voice. When generating identification dictionary data, first, identification dictionary data (standard identification dictionary data) set for each driver state assuming a standard driver is acquired from the standard identification dictionary data storage unit 203. . Subsequently, the acquired standard identification dictionary data is corrected according to the deviation between the standard driver's voice and the learned driver's voice to identify the driver according to the driver's condition for the learned driver. Generate dictionary data. Then, the identification dictionary data generation unit 202 stores the generated identification dictionary data in the identification dictionary data storage unit 103. The process of generating the learned driver identification dictionary data will be described in detail later.

その他の点については、図２を用いて前述した本実施例の音声認識装置１００と同様である。以下、簡単に説明すると、運転者状態検出部１０１は、運転者状態を検出して、その結果を識別辞書データ選択部１０２に出力する。すると、識別辞書データ選択部１０２は、識別辞書データ記憶部１０３に記憶されている識別辞書データの中から、運転者状態に応じた識別辞書データを選択する。そして音声認識部１０５は、識別辞書データ選択部１０２で選択された識別辞書データを参照しながら、音声取得部１０４で取得した音声信号を用いて音声認識する。 Other points are the same as those of the speech recognition apparatus 100 of the present embodiment described above with reference to FIG. Hereinafter, in brief, the driver state detection unit 101 detects the driver state and outputs the result to the identification dictionary data selection unit 102. Then, the identification dictionary data selection unit 102 selects identification dictionary data corresponding to the driver state from the identification dictionary data stored in the identification dictionary data storage unit 103. The voice recognition unit 105 performs voice recognition using the voice signal acquired by the voice acquisition unit 104 while referring to the identification dictionary data selected by the identification dictionary data selection unit 102.

図１０には、変形例の音声認識装置２００が運転者状態に応じた識別辞書データを運転者毎に生成する識別辞書データ生成処理のフローチャートが示されている。
識別辞書データ生成処理では、先ず始めに、運転者状態を検出する（Ｓ２０１）。ここでは、車載カメラ２０で撮影した運転者の顔画像を解析することによって運転者状態を検出することとしているが、他の方法で運転者状態を検出してもよい。 FIG. 10 shows a flowchart of identification dictionary data generation processing in which the voice recognition device 200 according to the modification generates identification dictionary data corresponding to the driver state for each driver.
In the identification dictionary data generation process, first, the driver state is detected (S201). Here, the driver state is detected by analyzing the driver's face image captured by the in-vehicle camera 20, but the driver state may be detected by other methods.

そして、運転者状態が通常状態か否かを判断する（Ｓ２０２）。その結果、運転者状態が通常状態ではなかった場合は（Ｓ２０２：ｎｏ）、運転者状態に応じた識別辞書データを生成するには適さないと考えられる。そこで、処理の先頭に戻って、運転者状態を検出した後（Ｓ２０１）、再び、運転者状態が通常状態か否かを判断する（Ｓ２０２）。
このような判断を繰り返しているうちに、運転者状態が通常状態と判断されたら（Ｓ２０２：ｙｅｓ）、予め設定しておいた単語や文章などを運転者に読み上げて貰うなどの方法によって、運転者の学習用の音声を取得する（Ｓ２０３）。 Then, it is determined whether or not the driver state is a normal state (S202). As a result, when the driver state is not the normal state (S202: no), it is considered that it is not suitable for generating the identification dictionary data corresponding to the driver state. Therefore, after returning to the top of the process and detecting the driver state (S201), it is determined again whether or not the driver state is the normal state (S202).
If it is determined that the driver's state is the normal state while repeating such determination (S202: yes), driving is performed by a method such as reading out a preset word or sentence to the driver. The user's learning voice is acquired (S203).

こうして学習した運転者の音声に基づいて、通常状態用の識別辞書データを生成する（Ｓ２０４）。すなわち、複数の母音および子音のそれぞれについて、特徴量空間での分布範囲を生成する。学習した音声に基づいて識別辞書データを生成する方法については、様々な方法が知られており、何れの方法を用いても構わない。尚、こうして生成された識別辞書データは、運転者が通常状態にあるときの音声に基づいて生成されたものなので、通常状態用の識別辞書データとなる。 Based on the driver's voice learned in this way, identification dictionary data for the normal state is generated (S204). That is, a distribution range in the feature amount space is generated for each of a plurality of vowels and consonants. Various methods are known for generating identification dictionary data based on learned speech, and any method may be used. Since the identification dictionary data generated in this way is generated based on the voice when the driver is in the normal state, it becomes the identification dictionary data for the normal state.

続いて、標準の運転者を想定して運転者状態毎に設定された識別辞書データ（標準識別辞書データ）の中から、通常状態用の識別辞書データを取得する（Ｓ２０５）。変形例の音声認識装置２００のメモリーには、標準識別辞書データが予め記憶されている。
そして、Ｓ２０４で生成しておいた通常状態用の識別辞書データと、Ｓ２０５で読み出した通常状態用の識別辞書データとを比較する（Ｓ２０６）。こうすることにより、標準の運転者用の識別辞書データから、学習した運転者用の識別辞書データを生成するための識別辞書データの修正量を求めることができる。
その後、こうして求めた識別辞書データの修正量で標準識別辞書データを修正することによって、学習した運転者用の識別辞書データを生成する（Ｓ２０７）。 Subsequently, identification dictionary data for the normal state is acquired from the identification dictionary data (standard identification dictionary data) set for each driver state assuming a standard driver (S205). Standard identification dictionary data is stored in advance in the memory of the speech recognition apparatus 200 according to the modification.
Then, the normal state identification dictionary data generated in S204 is compared with the normal state identification dictionary data read in S205 (S206). By doing so, the correction amount of the identification dictionary data for generating the learned driver identification dictionary data can be obtained from the standard driver identification dictionary data.
Thereafter, the learned identification dictionary data for the driver is generated by correcting the standard identification dictionary data with the correction amount of the identification dictionary data thus obtained (S207).

図１１には、標準識別辞書データを修正することによって、学習した運転者用の識別辞書データを生成する様子が例示されている。尚、図７に示したように、ここでは状態Ａ〜状態Ｇの７つの運転者状態が存在しており、従って、標準識別辞書データには、これら７つの運転者状態の識別辞書データが設定されている。しかし、図示が煩雑となることを避けるために、図１１では、状態Ａ〜状態Ｃの３つの運転者状態についての識別辞書データが表示されている。また、それぞれの識別辞書データには、複数の母音および子音のそれぞれについて特徴量空間での分布範囲が記述されているが、これについても図示が煩雑となることを避けるために、図１１では「あ」の母音について特徴量空間での分布範囲が表示されている。 FIG. 11 illustrates a state where the learned identification dictionary data for the driver is generated by correcting the standard identification dictionary data. As shown in FIG. 7, here, there are seven driver states of state A to state G. Therefore, the identification dictionary data of these seven driver states is set in the standard identification dictionary data. Has been. However, in order to avoid the illustration from becoming complicated, in FIG. 11, identification dictionary data for three driver states of state A to state C is displayed. In addition, in each identification dictionary data, a distribution range in the feature amount space is described for each of a plurality of vowels and consonants. In order to avoid complicated illustration of these, in FIG. The distribution range in the feature amount space is displayed for the vowel of “a”.

例えば、状態Ａ（通常状態）で学習した運転者用の識別辞書データが、図１１（ａ）中に破線で示した分布であり、標準識別辞書データが図中に実線で示した分布であったとする。この分布の違いは、運転者の個体差によるものである。従って、状態Ａ（通常状態）で生じた個体差による違いは、他の運転者状態についても同様に生じると考えられる。
変形例の音声認識装置２００では、このような考え方に基づいて、それぞれの運転者状態に対応する識別辞書データを生成する。例えば、状態Ｂについては、標準識別辞書データが図１１（ｂ）中に実線で示した分布であったとすると、学習した運転者用の識別辞書データは図中に破線で示す分布になると考えられる。状態Ｃについても同様に、標準識別辞書データが図１１（ｃ）中に実線で示した分布であったとすると、学習した運転者用の識別辞書データは図中に破線で示す分布になると考えられる。
図１０のＳ２０７では、このようにして標準識別辞書データから、学習した運転者用の運転者状態に応じた識別辞書データを生成する。そして、生成した運転者用の識別辞書データを、図示しないメモリーに記憶した後（Ｓ２０８）、図１１の識別辞書データ生成処理を終了する。 For example, the driver identification dictionary data learned in the state A (normal state) is a distribution indicated by a broken line in FIG. 11A, and the standard identification dictionary data is a distribution indicated by a solid line in the figure. Suppose. This difference in distribution is due to individual differences among drivers. Therefore, it is considered that the difference due to the individual difference that occurs in the state A (normal state) also occurs in other driver states.
The speech recognition apparatus 200 according to the modification generates identification dictionary data corresponding to each driver state based on such a concept. For example, for the state B, if the standard identification dictionary data has a distribution indicated by a solid line in FIG. 11B, the learned identification dictionary data for the driver is considered to have a distribution indicated by a broken line in the figure. . Similarly, for the state C, if the standard identification dictionary data has a distribution indicated by a solid line in FIG. 11C, the learned identification dictionary data for the driver is considered to have a distribution indicated by a broken line in the figure. .
In S207 of FIG. 10, identification dictionary data corresponding to the learned driver state for the driver is generated from the standard identification dictionary data in this way. Then, the generated identification dictionary data for the driver is stored in a memory (not shown) (S208), and the identification dictionary data generation process of FIG. 11 is terminated.

以上のようにして、運転者毎に運転者状態に応じた識別辞書データを生成しておけば、運転者状態だけでなく、運転者の個体差による違いも考慮して音声認識することができるので、より一層、認識精度を向上させることが可能となる。 If the identification dictionary data corresponding to the driver state is generated for each driver as described above, it is possible to recognize the voice in consideration of not only the driver state but also the difference due to the individual difference of the driver. As a result, the recognition accuracy can be further improved.

以上、本実施例および変形例について説明したが、本発明は上記の実施例および変形例に限られるものではなく、その要旨を逸脱しない範囲において種々の態様で実施することができる。 As mentioned above, although the present Example and the modification were demonstrated, this invention is not restricted to said Example and a modification, It can implement in a various aspect in the range which does not deviate from the summary.

１…車両、１０…マイクロフォン、２０…車載カメラ、
１００…音声認識装置、１０１…運転者状態検出部、
１０２…識別辞書データ選択部、１０３…識別辞書データ記憶部、
１０４…音声取得部、１０５…音声認識部、２００…音声認識装置、
２０１…音声学習部、２０２…識別辞書データ生成部、
２０３…標準識別辞書データ記憶部。 1 ... Vehicle, 10 ... Microphone, 20 ... In-vehicle camera,
DESCRIPTION OF SYMBOLS 100 ... Voice recognition apparatus, 101 ... Driver state detection part,
102: Identification dictionary data selection unit 103: Identification dictionary data storage unit,
104 ... voice acquisition unit, 105 ... voice recognition unit, 200 ... voice recognition device,
201 ... voice learning unit, 202 ... identification dictionary data generation unit,
203: Standard identification dictionary data storage unit.

Claims

A voice recognition device that obtains a driver's voice, divides the voice into phonemes of a predetermined time width, and recognizes the voice by comparing the phonemes with identification dictionary data,
A driver state detector for detecting the driver state of the driver;
An identification dictionary data storage unit storing the identification dictionary data corresponding to the driver state;
An identification dictionary data selection unit for selecting the identification dictionary data corresponding to the driver state from the stored identification dictionary data;
A voice recognition device comprising: a voice recognition unit that recognizes the voice using the selected identification dictionary data.

The speech recognition device according to claim 1,
The driver recognition unit is a detection unit that detects the driver state by analyzing a face image of the driver.

The speech recognition device according to claim 1 or 2,
A standard identification dictionary data storage unit for storing standard identification dictionary data set assuming a standard driver;
An identification dictionary data generating unit that generates the identification dictionary data of the driver from the standard identification dictionary data by learning the driver's voice and detecting a correction amount with respect to the standard driver's voice; A speech recognition apparatus comprising:

A voice recognition method for acquiring a driver's voice, dividing the voice into phonemes of a predetermined time width, and recognizing the voice by comparing the phonemes with identification dictionary data,
Detecting the driver state of the driver;
Selecting the identification dictionary data according to the driver state from the identification dictionary data stored according to the driver state;
Recognizing the sound using the selected identification dictionary data.