JP2017090612A

JP2017090612A - Voice recognition control system

Info

Publication number: JP2017090612A
Application number: JP2015219113A
Authority: JP
Inventors: 真吾入方; Shingo Irikata; 宗義難波; Muneyoshi Nanba
Original assignee: Mitsubishi Motors Corp
Current assignee: Mitsubishi Motors Corp
Priority date: 2015-11-09
Filing date: 2015-11-09
Publication date: 2017-05-25

Abstract

PROBLEM TO BE SOLVED: To improve voice recognition accuracy in relation to a voice recognition control system.SOLUTION: A voice recognition control system includes: an identification unit 2 for identifying a person uttering a voice on the basis of at least the voice: a selection unit 3 for selecting an acoustic model corresponding to the person identified by the identification unit 2; and a recognition unit 4 for recognizing the voice by using the acoustic model selected by the selection unit 3.SELECTED DRAWING: Figure 2

Description

本発明は、車両乗員の音声で車載装置を制御する音声認識制御システムに関する。 The present invention relates to a voice recognition control system for controlling an in-vehicle device with a voice of a vehicle occupant.

従来、音声コマンドを用いて車載装置を制御できるようにした音声認識制御システムが開発されている。すなわち、車両の乗員の音声を入力信号として、各種車載装置（カーナビゲーション装置，エアコン装置，カーオーディオ装置など）の作動状態を制御するものである。一般に、発話された音声は音響モデル（音の波形サンプルと音素との対応関係が規定されたモデル）に基づいて解析され、言語モデル（音素の連なり方が規定されたモデル）に基づいてその意味内容が認識される（特許文献１参照）。このような音声認識技術を利用することで、車載装置をハンズフリーで操作することが可能となる。 2. Description of the Related Art Conventionally, a voice recognition control system that can control an in-vehicle device using a voice command has been developed. That is, the operation state of various in-vehicle devices (car navigation device, air conditioner device, car audio device, etc.) is controlled using the voice of the vehicle occupant as an input signal. In general, spoken speech is analyzed based on an acoustic model (a model in which the correspondence between sound waveform samples and phonemes is specified), and based on a language model (a model in which phonemes are connected) The contents are recognized (see Patent Document 1). By using such a voice recognition technology, it is possible to operate the vehicle-mounted device in a hands-free manner.

特開2002-189492号公報JP 2002-189492 A

ところで、上記の音響モデルや言語モデルは、標準話者の音声に基づいて作成される。一方、実際の音声には、発話者の発声器官の構造や発声法，社会的環境（方言，イントネーション）などによる音響的な変動（周波数変動，波形変動）が含まれており、必ずしも標準話者の音声と共通する特徴を持たない。そのため、標準的なモデルを用いたのでは、十分な音声認識精度が得られないことがある。特に、車両の音声認識制御システムにおいては、常に同一の人物が車両に搭乗するとは限らないため、音声認識精度や各種車載装置の制御性が低下しやすい。 By the way, the above acoustic model and language model are created based on the voice of a standard speaker. On the other hand, the actual speech includes acoustic fluctuations (frequency fluctuations, waveform fluctuations) due to the structure of the vocal organs of the speaker, the vocalization method, the social environment (dialect, intonation), etc. It does not have the same characteristics as the voice. Therefore, sufficient speech recognition accuracy may not be obtained using a standard model. In particular, in a vehicle voice recognition control system, the same person does not always get on the vehicle, so that the voice recognition accuracy and the controllability of various in-vehicle devices are likely to deteriorate.

本件の目的の一つは、上記のような課題に鑑みて創案されたものであり、車両乗員の音声認識精度を高めた音声認識制御システムを提供することである。なお、この目的に限らず、後述する「発明を実施するための形態」に示す各構成から導き出される作用効果であって、従来の技術では得られない作用効果を奏することも、本件の他の目的として位置付けることができる。 One of the objects of the present case was invented in view of the above-described problems, and is to provide a voice recognition control system that improves the voice recognition accuracy of a vehicle occupant. It should be noted that the present invention is not limited to this purpose, and is an operational effect that is derived from each configuration shown in “Mode for Carrying Out the Invention” to be described later. Can be positioned as a purpose.

（１）ここで開示する音声認識制御システムは、車両乗員の音声を入力信号として車載装置を制御する音声認識制御システムである。本システムは、少なくとも前記音声に基づき、前記音声を発した人物を特定する特定部を備える。また、前記特定部で特定された前記人物に応じた音響モデルを選択する選択部と、前記選択部で選択された前記音響モデルを用いて前記音声を認識する認識部とを備える。 (1) The voice recognition control system disclosed here is a voice recognition control system that controls an in-vehicle device using a voice of a vehicle occupant as an input signal. The system includes a specifying unit that specifies a person who has emitted the sound based on at least the sound. A selection unit configured to select an acoustic model corresponding to the person specified by the specifying unit; and a recognition unit configured to recognize the voice using the acoustic model selected by the selection unit.

前記認識部が、前記特定部で特定された前記人物に対応する音響モデルを用いて、前記人物の音声を認識することが好ましい。また、前記特定部が、前記音声が発話された位置（発話位置）と、前記位置に存在する前記人物とを特定することが好ましい。なお、前記音響モデルには言語モデルが含まれることが好ましい。 It is preferable that the recognition unit recognizes the voice of the person using an acoustic model corresponding to the person specified by the specifying unit. Moreover, it is preferable that the said specific | specification part pinpoints the position (speaking position) where the said voice was uttered, and the said person who exists in the said position. The acoustic model preferably includes a language model.

（２）車室内の画像を撮影する室内カメラを備え、前記特定部が、前記音声と前記画像とを用いて、前記音声を発した人物を特定することが好ましい。
（３）前記特定部が、前記人物の口唇の動きと前記音声のタイミングとの比較により、前記人物を特定することが好ましい。
（４）前記音声を用いて、前記特定部で特定された前記人物に対応する音響モデルを作成するデータ更新部を備えることが好ましい。
（５）前記特定部で特定された前記人物と前記認識部で認識された制御対象とが対応する場合に、前記制御対象を制御する制御部を備えることが好ましい。 (2) It is preferable to provide an indoor camera that captures an image of the interior of the vehicle, and the specifying unit specifies the person who has emitted the sound using the sound and the image.
(3) It is preferable that the specifying unit specifies the person by comparing the movement of the lips of the person and the timing of the voice.
(4) It is preferable to provide a data updating unit that creates an acoustic model corresponding to the person specified by the specifying unit using the voice.
(5) It is preferable that a control unit that controls the control target is provided when the person specified by the specifying unit corresponds to the control target recognized by the recognition unit.

特定部で特定された人物に応じた音響モデルを選択することで、複数の乗員が搭乗しうる車室内における音声認識精度を向上させることができる。 By selecting an acoustic model corresponding to the person specified by the specifying unit, it is possible to improve the voice recognition accuracy in the passenger compartment where a plurality of passengers can board.

音声認識制御システムが適用された車両の模式的な上面図である。1 is a schematic top view of a vehicle to which a voice recognition control system is applied. 音声認識制御システムの構成を示す模式図である。It is a schematic diagram which shows the structure of a speech recognition control system. 音声認識制御システムの制御内容を説明するためのフローチャートである。It is a flowchart for demonstrating the control content of a speech recognition control system.

図面を参照して、実施形態としての音声認識制御システムについて説明する。なお、以下に示す実施形態はあくまでも例示に過ぎず、以下の実施形態で明示しない種々の変形や技術の適用を排除する意図はない。本実施形態の各構成は、それらの趣旨を逸脱しない範囲で種々変形して実施することができる。また、必要に応じて取捨選択することができ、あるいは適宜組み合わせることができる。 A speech recognition control system as an embodiment will be described with reference to the drawings. Note that the embodiment described below is merely an example, and there is no intention to exclude various modifications and technical applications that are not explicitly described in the following embodiment. Each configuration of the present embodiment can be implemented with various modifications without departing from the spirit thereof. Further, they can be selected as necessary, or can be appropriately combined.

［１．装置構成］
本実施形態の音声認識制御システムは、図１に示す車両１０に適用される。車両１０の車室内には運転席１４，助手席１５が設けられ、車室前方側にはインパネ（インストルメントパネル，ダッシュボード）が配置される。インパネの車室側に面した部分のうち、運転席１４の前方にはステアリング装置や計器類が配置され、助手席１５の前方にはグローブボックスが配置される。また、インパネの車幅方向中央には、カーナビ機能やＡＶ機能などのユーザーインターフェースを集約して提供するマルチコミュニケーション型のディスプレイ装置１６が搭載される。ディスプレイ装置１６の位置は、運転席１４に座る運転手の視点では左斜め前方であり、助手席１５に座る乗員の視点では右斜め前方である。 [1. Device configuration]
The voice recognition control system of this embodiment is applied to the vehicle 10 shown in FIG. A driver's seat 14 and a passenger seat 15 are provided in the passenger compartment of the vehicle 10, and an instrument panel (instrument panel, dashboard) is disposed on the front side of the passenger compartment. A steering device and instruments are arranged in front of the driver's seat 14 and a glove box is arranged in front of the passenger seat 15 in the portion of the instrument panel facing the passenger compartment. A multi-communication type display device 16 that collects and provides user interfaces such as a car navigation function and an AV function is mounted in the center of the instrument panel in the vehicle width direction. The position of the display device 16 is diagonally forward left from the viewpoint of the driver sitting in the driver's seat 14 and diagonally forward right from the viewpoint of the passenger sitting in the passenger seat 15.

ディスプレイ装置１６は、タッチパネルを備えた汎用の映像表示装置（表示画面）とCPU（Central Processing Unit），ROM（Read Only Memory），RAM（Random Access Memory）などを含む電子制御装置（コンピューター）とを備えた電子デバイスである。ディスプレイ装置１６は、ナビゲーション装置１１，エアコン装置１２，カーオーディオ装置１３，マルチメディアシステムなどの車載装置に接続されて、各種車載装置の入出力装置として機能しうる。例えば、ナビゲーション装置１１から提供される目的地までの経路情報や地図情報，渋滞情報などは、このディスプレイ装置１６の表示画面に表示可能とされる。また、このディスプレイ装置１６の表示画面には、車載の地上デジタル放送チューナーで受信した番組や、DVD映像コンテンツ，リアビューカメラで撮影された映像，エアコン装置１２やカーオーディオ装置１３の操作用インターフェースといった、多様な視覚情報が再生，表示可能である。 The display device 16 includes a general-purpose video display device (display screen) having a touch panel and an electronic control device (computer) including a CPU (Central Processing Unit), ROM (Read Only Memory), RAM (Random Access Memory), and the like. An electronic device provided. The display device 16 is connected to in-vehicle devices such as the navigation device 11, the air conditioner device 12, the car audio device 13, and the multimedia system, and can function as an input / output device for various in-vehicle devices. For example, route information from the navigation device 11 to the destination, map information, traffic jam information, and the like can be displayed on the display screen of the display device 16. The display screen of the display device 16 includes a program received by an in-vehicle digital terrestrial broadcast tuner, DVD video content, video shot by a rear view camera, an interface for operating the air conditioner device 12 and the car audio device 13, and the like. Various visual information can be reproduced and displayed.

また、車両１０には、乗員の音声を入力信号として各種車載装置を制御する音声認識制御装置１が搭載される。音声認識制御装置１は、CPU，MPU（Micro Processing Unit）などのプロセッサとROM，RAM，不揮発メモリなどを集積した電子デバイス（ECU，電子制御装置）である。ここでいうプロセッサとは、例えば制御ユニット（制御回路）や演算ユニット（演算回路），キャッシュメモリ（レジスタ）などを内蔵する処理装置（プロセッサ）である。また、ROM，RAM及び不揮発メモリは、プログラムや作業中のデータが格納されるメモリ装置である。音声認識制御装置１で実施される制御の内容は、ファームウェアやアプリケーションプログラムとしてROM，RAM，不揮発メモリ，リムーバブルメディア内に記録される。また、プログラムの実行時には、プログラムの内容がRAM内のメモリ空間内に展開され、プロセッサによって実行される。 In addition, the vehicle 10 is equipped with a voice recognition control device 1 that controls various in-vehicle devices using an occupant's voice as an input signal. The speech recognition control device 1 is an electronic device (ECU, electronic control device) in which a processor such as a CPU or MPU (Micro Processing Unit) and a ROM, RAM, nonvolatile memory, etc. are integrated. The processor here is, for example, a processing device (processor) including a control unit (control circuit), an arithmetic unit (arithmetic circuit), a cache memory (register), and the like. The ROM, RAM, and nonvolatile memory are memory devices that store programs and working data. The contents of the control performed by the voice recognition control apparatus 1 are recorded in ROM, RAM, nonvolatile memory, and removable media as firmware and application programs. When the program is executed, the contents of the program are expanded in the memory space in the RAM and executed by the processor.

図２に示すように、音声認識制御装置１には、入力装置としてのマイクアレイ２１及び室内カメラ２２が接続される。マイクアレイ２１は、複数のマイクロフォンを所定の配列に並べた音声入力装置であり、室内カメラ２２は車室内全体を撮影可能な広角ビデオカメラである。一方、出力装置としては、ナビゲーション装置１１，エアコン装置１２，カーオーディオ装置１３，ディスプレイ装置１６などが接続される。音声認識制御装置１は、マイクアレイ２１から入力された音声と室内カメラ２２で撮影された画像とに基づいて、各種車載装置を制御する。 As shown in FIG. 2, a microphone array 21 and an indoor camera 22 as input devices are connected to the voice recognition control device 1. The microphone array 21 is an audio input device in which a plurality of microphones are arranged in a predetermined arrangement, and the indoor camera 22 is a wide-angle video camera capable of photographing the entire vehicle interior. On the other hand, as an output device, a navigation device 11, an air conditioner device 12, a car audio device 13, a display device 16, and the like are connected. The voice recognition control device 1 controls various in-vehicle devices based on the voice input from the microphone array 21 and the image taken by the indoor camera 22.

［２．制御構成］
音声認識制御装置１は、音声を発した人物を識別し、その人物に適した音響モデルを用いて音声内容を識別する機能を持つ。例えば、車両１０に誰かが乗車すると、乗車した人物が誰であるのかを認識し、その人物が着座した位置を記憶する。また、車室内で誰かが音声を発したときには、その音声を発した人物が誰であるのかを特定し、その人物に対応する音響モデルを用いてその音声を認識する。音声の特徴が異なる人物には、異なる音響モデルが適用される。なお、ここでいう音響モデルには、言語モデルが含まれるものとする。 [2. Control configuration]
The voice recognition control device 1 has a function of identifying a person who has made a voice and identifying voice contents using an acoustic model suitable for the person. For example, when someone gets on the vehicle 10, the person who gets on the vehicle 10 is recognized and the position where the person is seated is stored. Also, when someone utters a sound in the passenger compartment, the person who uttered the sound is specified, and the sound is recognized using an acoustic model corresponding to the person. Different acoustic models are applied to persons with different voice characteristics. The acoustic model here includes a language model.

上記の制御を実施するための要素として、音声認識制御装置１には、特定部２，選択部３，認識部４，データベース５，制御部６が設けられる。これらは、音声認識制御装置１で実行されるプログラムの一部の機能を示すものであり、ソフトウェアで実現されるものとする。ただし、各機能の一部又は全部をハードウェア（電子制御回路）で実現してもよく、あるいはソフトウェアとハードウェアとを併用して実現してもよい。 As elements for performing the above control, the speech recognition control device 1 is provided with a specifying unit 2, a selection unit 3, a recognition unit 4, a database 5, and a control unit 6. These indicate some functions of a program executed by the speech recognition control apparatus 1 and are realized by software. However, some or all of the functions may be realized by hardware (electronic control circuit), or may be realized by using software and hardware together.

データベース５は、音声認識に関する総合的な各種データが記録，保存されたストレージ装置である。ここには、音声認識で用いられる多数の音響モデルが、その音響モデルに対応する人物と関連づけられた状態で記録，保存される。データベース５が持っている音響モデルの一つは、標準話者の音声に基づいて予め作成されたものである。他の音響モデルは、標準話者以外の人物の音声に基づいて作成，学習，更新されるものであり、車両１０に搭乗する各々の乗員について各音声を認識するのに適した（最適化された）ものとされる。例えば、入力された音声の特徴（例えば、音響スペクトル）が標準話者の音声の特徴から大きく離れている場合には、入力された音声に対応する新たな音響モデルが作成され、その音声の認識結果に基づいて音響モデルが学習，更新される。 The database 5 is a storage device in which various general data relating to speech recognition are recorded and stored. Here, a large number of acoustic models used in speech recognition are recorded and stored in a state associated with a person corresponding to the acoustic model. One of the acoustic models possessed by the database 5 is created in advance based on the voice of a standard speaker. Other acoustic models are created, learned, and updated based on the voice of a person other than the standard speaker, and are suitable for recognizing each voice for each occupant on the vehicle 10 (optimized). ) For example, if the input speech features (for example, the acoustic spectrum) are far from the standard speaker speech features, a new acoustic model corresponding to the input speech is created and the speech is recognized. The acoustic model is learned and updated based on the result.

特定部２は、少なくともマイクアレイ２１から入力された音声に基づき、その音声を発した人物を特定するものである。人物の特定手法としては、音声が検出された時点でリアルタイムに特定する手法と、車両１０に誰かが乗車したときにその人物と着座位置との関係を把握しておき、検出された音声の音源位置に基づいて人物を特定する手法とが挙げられる。 The identifying unit 2 identifies a person who has emitted the sound based on at least the sound input from the microphone array 21. As a method for identifying a person, a method for identifying in real time when sound is detected, and a relationship between the person and the seating position when someone gets on the vehicle 10, and a sound source of the detected sound And a method of identifying a person based on the position.

前者の場合、音声中に含まれる波形パターンや周波数パターン，声紋パターンなどに基づいて人物を特定することが可能である。あるいは、室内カメラ２２で撮影された画像中から人間の顔を抽出し、口唇の動きと音声のタイミングとが一致する人物を特定することも可能である。後者の場合、室内カメラ２２で撮影された画像を解析（例えば、顔認証）することで人物を特定してもよいし、その人物に何らかの音声を発してもらい、前者と同様の手法を用いてその人物を特定してもよい。ここで特定された人物の情報は、選択部３に伝達される。 In the former case, it is possible to specify a person based on a waveform pattern, a frequency pattern, a voiceprint pattern, etc. included in the voice. Alternatively, it is also possible to extract a human face from an image photographed by the indoor camera 22 and specify a person whose lip movement matches the voice timing. In the latter case, a person may be specified by analyzing an image captured by the indoor camera 22 (for example, face authentication), or the person is uttered with some sound, and the same method as the former is used. The person may be specified. Information on the person specified here is transmitted to the selection unit 3.

選択部３は、データベース５に記録，保存されている複数の音響モデルの中から、特定部２で特定された人物に対応する音響モデルを選択するものである。例えば、車両１０の所有者であるユーザＡが音声コマンドを発したときには、ユーザＡに対応する第一音響モデルが選択される。また、ユーザＡとは別のユーザＢが音声コマンドを発したときには、ユーザＢに対応する第二音響モデルが選択される。ここでの選択結果は、認識部４に伝達される。 The selection unit 3 selects an acoustic model corresponding to the person specified by the specifying unit 2 from among a plurality of acoustic models recorded and stored in the database 5. For example, when the user A who is the owner of the vehicle 10 issues a voice command, the first acoustic model corresponding to the user A is selected. When a user B different from the user A issues a voice command, the second acoustic model corresponding to the user B is selected. The selection result here is transmitted to the recognition unit 4.

認識部４は、選択部３で選択された音響モデルを用いて音声を認識するものである。ここでは、発話内容の文脈が解析され、制御対象となる車載装置の種類が推定されるとともに、制御対象に対する音声コマンドの内容が認識される。音声認識の具体的な手法は任意であり、公知の音声認識技術を採用することができる。例えば、音響モデルに基づいて音声に含まれる音素が解析された後に、言語モデルに基づいて音素の連なりからなる語や句が解析され、その意味内容が認識される。ここでの認識結果は、制御部６に伝達される。 The recognition unit 4 recognizes speech using the acoustic model selected by the selection unit 3. Here, the context of the utterance content is analyzed, the type of the in-vehicle device to be controlled is estimated, and the content of the voice command for the control target is recognized. A specific method of speech recognition is arbitrary, and a known speech recognition technique can be employed. For example, after a phoneme included in speech is analyzed based on an acoustic model, a word or phrase consisting of a series of phonemes is analyzed based on a language model, and its semantic content is recognized. The recognition result here is transmitted to the control unit 6.

制御部６は、認識部４で認識された結果に基づき、制御対象を実際に制御するものである。ここでは、特定部２で特定された人物と制御対象との組み合わせが適切である場合に、その制御対象が実際に制御される。例えば、車両１０の所有者であるユーザＡは、ナビゲーション装置１１，エアコン装置１２，カーオーディオ装置１３を音声コマンドで操作可能とされる。これに対し、ユーザＡの知人であるユーザＣは、これらの車載装置を音声コマンドで操作できないものとされる。人物と制御対象との組み合わせが適切でない場合には、その制御対象が制御されることなく、音声コマンドがキャンセル（取り消し）される。 The control unit 6 actually controls the control target based on the result recognized by the recognition unit 4. Here, when the combination of the person specified by the specifying unit 2 and the control target is appropriate, the control target is actually controlled. For example, the user A who is the owner of the vehicle 10 can operate the navigation device 11, the air conditioner device 12, and the car audio device 13 with voice commands. On the other hand, the user C who is an acquaintance of the user A cannot operate these in-vehicle devices with voice commands. If the combination of the person and the control target is not appropriate, the voice command is canceled (cancelled) without being controlled.

［３．フローチャート］
図３は、音声認識制御装置１で実施される制御内容を説明するためのフローチャート例である。まず、マイクアレイ２１で検出された音声情報，室内カメラ２２で撮影された画像情報が音声認識制御装置１に入力され（ステップＡ１）、車両１０に誰かが乗車したか否かが判定される（ステップＡ２）。誰かの乗車が検出されると、その人物が認識されるとともに、その人物の着座位置が特定される（ステップＡ３）。その後、音声が入力されたか否かが判定される（ステップＡ４）。 [3. flowchart]
FIG. 3 is an example of a flowchart for explaining the control contents executed by the speech recognition control apparatus 1. First, voice information detected by the microphone array 21 and image information taken by the indoor camera 22 are input to the voice recognition control device 1 (step A1), and it is determined whether or not someone has boarded the vehicle 10 ( Step A2). When someone's boarding is detected, the person is recognized and the seating position of the person is specified (step A3). Thereafter, it is determined whether or not a voice is input (step A4).

ここで、何らかの音声が入力されていると、特定部２において、少なくともその音声情報に基づき、音声を発した人物が特定される（ステップＡ５）。また、その人物がデータベース５に登録済みのユーザであるか否かが判定され（ステップＡ６）、登録済みでなければ新規の音響モデルがデータベース５に追加登録される（ステップＡ７）。その後、選択部３において、その人物に対応する音響モデルが選択される（ステップＡ８）。また、認識部４において、選択された音響モデルに基づく音声認識が実施される（ステップＡ９）。
音声認識が完了すると、その認識結果に基づき、その人物の音響モデルが学習，更新される（ステップＡ１０）。なお、制御部６では、特定された人物と制御対象との対応関係が判定され、対応する場合にはその制御対象に対する制御が実施される。一方、特定された人物と制御対象とが対応しない場合には、その音声コマンドがキャンセルされ、制御が不実施とされる。 Here, if any sound is input, the identifying unit 2 identifies the person who has emitted the sound based on at least the sound information (step A5). Further, it is determined whether or not the person is a user registered in the database 5 (step A6). If not registered, a new acoustic model is additionally registered in the database 5 (step A7). Thereafter, the selection unit 3 selects an acoustic model corresponding to the person (step A8). Further, the recognition unit 4 performs speech recognition based on the selected acoustic model (step A9).
When the speech recognition is completed, the acoustic model of the person is learned and updated based on the recognition result (step A10). Note that the control unit 6 determines a correspondence relationship between the specified person and the control target, and when corresponding, controls the control target. On the other hand, if the specified person does not correspond to the control target, the voice command is canceled and the control is not performed.

［４．作用，効果］
（１）上記の通り、特定部２で特定された人物に応じた音響モデルを選択することで、複数の乗員が搭乗しうる車室内における音声認識精度を向上させることができる。これにより、例えば発話者の発声器官の構造や発声法，社会的環境が標準話者と大きく相違するような場合であっても、音声コマンドの内容を精度よく認識することができ、各種車載装置の制御性を高めることができる。 [4. Action, effect]
(1) As described above, by selecting the acoustic model corresponding to the person specified by the specifying unit 2, it is possible to improve the voice recognition accuracy in the cabin where a plurality of passengers can board. Thus, for example, even if the structure, utterance method, and social environment of the utterance organ of the speaker are significantly different from those of the standard speaker, the contents of the voice command can be accurately recognized, and various in-vehicle devices Controllability can be improved.

（２）特定部２での制御内容に関して、マイクアレイ２１で取得された音声情報だけでなく、室内カメラ２２の画像情報を併用することで、音声のみを用いた場合と比較して短時間で容易にその人物を特定することが可能となる。したがって、簡素な構成で音声認識精度及び各種車載装置の制御性を向上させることができる。
（３）また、画像中の口唇の動きと音声のタイミングとを比較することで、その人物が発話者であることを確実に特定することができる。 (2) Concerning the control contents in the specifying unit 2, not only the audio information acquired by the microphone array 21 but also the image information of the indoor camera 22 is used in a short time compared with the case where only the audio is used. The person can be easily identified. Therefore, the speech recognition accuracy and the controllability of various in-vehicle devices can be improved with a simple configuration.
(3) Further, by comparing the movement of the lips in the image and the timing of the voice, it can be surely specified that the person is a speaker.

（４）特定された人物の音響モデルをその人物の音声で学習，更新することで、その後の音声認識精度をさらに向上させることができる。また、データベース５に登録済みでないユーザに対しては新規の音響モデルが追加登録されるようになっているため、データベース５の拡張性を高めることができ、簡素な構成で音声認識精度及び各種車載装置の制御性を向上させることができる。 (4) By learning and updating the acoustic model of the specified person with the voice of the person, the subsequent voice recognition accuracy can be further improved. Further, since a new acoustic model is additionally registered for a user who has not been registered in the database 5, the expandability of the database 5 can be improved, and voice recognition accuracy and various in-vehicle functions can be improved with a simple configuration. The controllability of the apparatus can be improved.

（５）制御部６での制御内容に関して、特定部２で特定された人物と制御対象との対応関係を判断することで、車載装置の利用権限をユーザ毎に設定することが容易となる。例えば、車両１０の所有者であるユーザＡが知人であるユーザＣに一時的に運転を代わってもらうような場合には、ユーザＣによる音声コマンドに制限をかけることで、音声コマンドに不慣れなユーザＣによる誤操作を防止することができ、利便性を向上させることができる。 (5) By determining the correspondence between the person specified by the specifying unit 2 and the control target with respect to the control contents in the control unit 6, it becomes easy to set the usage authority of the in-vehicle device for each user. For example, when the user A who is the owner of the vehicle 10 asks the user C who is an acquaintance to temporarily drive, the user who is not familiar with the voice command by restricting the voice command by the user C An erroneous operation due to C can be prevented, and convenience can be improved.

［５．変形例］
上述の実施形態では、特定部２で特定された人物のそれぞれに対応する音響モデルがデータベース５に記録，保存されるものとしたが、音響モデルは必ずしも個々のユーザに設定する必要はない。例えば、入力された音声の特徴が標準話者の音声の特徴から大きく離れていない場合には、その人物に専用の音響モデルは不要である。また、入力された音声の特徴をいくつかの種類（例えば、若年男性，若年女性，壮年男性，壮年女性，老年男性，老年女性といった六種類）に分類し、それぞれの種類に対して音響モデルを設定してもよい。この場合、特定された人物に関連づけられる種類に応じた音響モデルを用いて音声認識を実施すればよい。 [5. Modified example]
In the above-described embodiment, the acoustic model corresponding to each person specified by the specifying unit 2 is recorded and stored in the database 5, but the acoustic model is not necessarily set for each user. For example, if the input voice feature is not far from the voice feature of the standard speaker, no dedicated acoustic model is required for that person. Also, the characteristics of the input speech are classified into several types (for example, six types such as young men, young women, older men, older women, older men, older women), and an acoustic model is assigned to each type. It may be set. In this case, speech recognition may be performed using an acoustic model corresponding to the type associated with the specified person.

また、上述の実施形態では、音声の認識から制御対象の制御までに至るすべての過程が音声認識制御装置１で統括管理されているが、音声認識制御装置１の機能の一部又は全部を車両１０の外部に移設することも考えられる。例えば、音声認識制御装置１をインターネット，携帯電話機の無線通信網，その他のデジタル無線通信網などのネットワークに接続可能とし、ネットワーク上のサーバに音声認識制御装置１の機能の一部又は全部を実装してもよい。これにより、データベース５の管理や更新が容易となり、音声認識精度やジェスチャ認識精度を向上させることができる。 Further, in the above-described embodiment, all processes from speech recognition to control of the control target are managed in an integrated manner by the speech recognition control device 1, but some or all of the functions of the speech recognition control device 1 are controlled by the vehicle. Relocation to the outside of 10 is also conceivable. For example, the voice recognition control device 1 can be connected to a network such as the Internet, a mobile phone wireless communication network, and other digital wireless communication networks, and a part or all of the functions of the voice recognition control device 1 are mounted on a server on the network. May be. Thereby, management and update of the database 5 become easy, and speech recognition accuracy and gesture recognition accuracy can be improved.

１音声認識制御装置
２特定部
３選択部
４認識部
５データベース
６制御部
１０車両
１１ナビゲーション装置
１２エアコン装置
１３カーオーディオ装置
１４運転席
１５助手席
１６ディスプレイ装置
２１マイクアレイ
２２室内カメラ DESCRIPTION OF SYMBOLS 1 Voice recognition control apparatus 2 Identification part 3 Selection part 4 Recognition part 5 Database 6 Control part 10 Vehicle 11 Navigation apparatus 12 Air conditioner apparatus 13 Car audio apparatus 14 Driver's seat 15 Passenger's seat 16 Display apparatus 21 Microphone array 22 Indoor camera

Claims

In a voice recognition control system that controls an in-vehicle device using the voice of a vehicle occupant as an input signal,
A specifying unit for specifying a person who has emitted the sound based on at least the sound;
A selection unit that selects an acoustic model corresponding to the person identified by the identification unit;
A recognition unit for recognizing the voice using the acoustic model selected by the selection unit;
A voice recognition control system comprising:

It has an indoor camera that captures images in the passenger compartment,
The voice recognition control system according to claim 1, wherein the specifying unit specifies a person who has emitted the voice using the voice and the image.

The voice recognition control system according to claim 2, wherein the specifying unit specifies the person by comparing the movement of the lip of the person and the timing of the voice.

The voice recognition control according to any one of claims 1 to 3, further comprising a data update unit that creates an acoustic model corresponding to the person specified by the specifying unit using the voice. system.

The control unit that controls the control target when the person specified by the specifying unit and the control target recognized by the recognition unit correspond to each other. The speech recognition control system according to item 1.