JP6604151B2

JP6604151B2 - Speech recognition control system

Info

Publication number: JP6604151B2
Application number: JP2015219116A
Authority: JP
Inventors: 真吾入方; 宗義難波
Original assignee: Mitsubishi Motors Corp
Current assignee: Mitsubishi Motors Corp
Priority date: 2015-11-09
Filing date: 2015-11-09
Publication date: 2019-11-13
Anticipated expiration: 2035-11-09
Also published as: JP2017090615A

Description

本発明は、車両乗員の音声で車載装置を制御する音声認識制御システムに関する。 The present invention relates to a voice recognition control system for controlling an in-vehicle device with a voice of a vehicle occupant.

従来、車両の運転手が発する音声や視線を入力操作として、様々な情報を運転手に提供する情報案内装置が提案されている。例えば、運転手が音声で「あの山は何？」と発話すると、運転手の視線の先にある山の名称を答える装置が知られている（特許文献１参照）。また、運転手が「あの店のおすすめは？」と発話したときに、カーナビゲーションシステムの地図データを用いて運転手の視線の先にある施設を検索し、その施設の案内文を出力する技術も知られている（特許文献２参照）。音声入力と視線入力とを併用することで、運転者の集中力を低下させることなく所望の情報を提供することが可能となる。 2. Description of the Related Art Conventionally, there has been proposed an information guide device that provides various information to a driver by using a voice or line of sight emitted by the driver of the vehicle as an input operation. For example, an apparatus is known that answers the name of a mountain ahead of the driver's line of sight when the driver utters “What is that mountain?” By voice (see Patent Document 1). Also, when the driver speaks “What is the recommendation for that store?”, A technology that searches the facility ahead of the driver's line of sight using the map data of the car navigation system and outputs a guidance text for the facility Is also known (see Patent Document 2). By using voice input and line-of-sight input together, it is possible to provide desired information without reducing the driver's concentration.

特開2003-329463号公報JP 2003-329463 A 特開2009-031065号公報JP 2009-031065

しかしながら、乗員が必要とする情報の種類は、乗員毎に相違する場合がある。例えば、車両の目的地が大型商業施設である場合に、運転者はその施設内に設けられている駐車場の位置や駐車スペースの広さ，高さ制限，駐車料金などの情報が知りたいことがある。これに対し、運転者以外の同乗者にとっては駐車場関連の情報が不要とされ、施設内の店舗情報や営業時間情報を知りたがる場合がある。既存の技術では、乗員に提供される情報の適切さの度合いが十分に評価されていないため、改良の余地がある。 However, the type of information required by the occupant may differ for each occupant. For example, when the destination of the vehicle is a large commercial facility, the driver wants to know information such as the location of the parking lot, the size of the parking space, the height limit, and the parking fee. There is. On the other hand, passengers other than the driver do not need information related to the parking lot, and may want to know store information and business hours information in the facility. The existing technology has room for improvement because the degree of appropriateness of information provided to passengers is not fully evaluated.

本件の目的の一つは、上記のような課題に鑑みて創案されたものであり、乗員に適切な情報を提供できるようにして利便性を向上させた音声認識制御システムを提供することである。なお、この目的に限らず、後述する「発明を実施するための形態」に示す各構成から導き出される作用効果であって、従来の技術では得られない作用効果を奏することも、本件の他の目的として位置付けることができる。 One of the purposes of this case is to provide a voice recognition control system that has been developed in view of the above-described problems and has improved convenience by providing appropriate information to passengers. . It should be noted that the present invention is not limited to this purpose, and is an operational effect that is derived from each configuration shown in “Mode for Carrying Out the Invention” to be described later. Can be positioned as a purpose.

（１）ここで開示する音声認識制御システムは、車両乗員の音声を入力信号として車載装置を制御する音声認識制御システムである。本システムは、少なくとも前記音声に基づき、発話位置及び発話内容を認識する音声認識部と、室内カメラで撮影された車室内の画像に基づき、前記発話位置の人物のジェスチャを検出するジェスチャ検出部とを備える。また、前記ジェスチャで指定される施設に関する複数の情報を記憶するデータベースを備える。さらに、前記発話内容と前記発話位置とに基づき、前記複数の情報の一部を選択して出力する制御部を備える。 (1) The voice recognition control system disclosed here is a voice recognition control system that controls an in-vehicle device using a voice of a vehicle occupant as an input signal. The system includes a speech recognition unit that recognizes an utterance position and utterance content based on at least the voice, a gesture detection unit that detects a gesture of a person at the utterance position based on an image of a vehicle interior captured by an indoor camera, Is provided. In addition, a database for storing a plurality of information related to the facility designated by the gesture is provided. Furthermore, a control unit is provided that selects and outputs a part of the plurality of information based on the utterance content and the utterance position.

（２）前記制御部は、前記発話位置が運転席である場合に前記施設の駐停車設備情報を出力することが好ましい。
（３）前記制御部は、前記発話位置が前記運転席以外である場合に前記施設の営業内容情報を出力することが好ましい。
（４）前記制御部が、前記音声が入力された時刻に応じた前記情報を出力することが好ましい。例えば、前記時刻が前記施設の営業時間外である場合に営業時間情報を出力することが好ましい。 (2) It is preferable that the control unit outputs parking / stopping facility information of the facility when the utterance position is a driver's seat.
(3) It is preferable that the said control part outputs the business content information of the said facility, when the said speech position is other than the said driver's seat.
(4) It is preferable that the said control part outputs the said information according to the time when the said audio | voice was input. For example, it is preferable to output business hours information when the time is outside the business hours of the facility.

なお、車速を検出する車速検出部を備えることが好ましい。例えば、前記車速検出部で検出された前記車速に応じて、前記制御部が前記複数の情報の一部を選択して出力することが好ましい。前記車速の情報は、前記複数の情報の一部を「出力」するための条件としてもよいし、前記複数の情報の一部を「選択」するための条件としてもよい。 In addition, it is preferable to provide the vehicle speed detection part which detects a vehicle speed. For example, it is preferable that the control unit selects and outputs some of the plurality of information according to the vehicle speed detected by the vehicle speed detection unit. The vehicle speed information may be a condition for “outputting” some of the plurality of information, or may be a condition for “selecting” some of the plurality of information.

発話内容だけでなく発話位置に見合った情報を提供することができ、利便性を向上させることができる。 Information corresponding to the utterance position as well as the utterance content can be provided, and convenience can be improved.

音声認識制御システムが適用された車両の上面図である。1 is a top view of a vehicle to which a voice recognition control system is applied. 音声認識制御システムの構成を示す図である。It is a figure which shows the structure of a speech recognition control system. ジェスチャの方向と案内の対象との関係を示す図である。It is a figure which shows the relationship between the direction of a gesture, and the object of guidance. 案内の対象，発話内容，発話位置と情報の種類との関係を示す表である。It is a table | surface which shows the relationship between the object of guidance, utterance content, utterance position, and the kind of information. 音声認識制御システムの制御内容を説明するためのフローチャートである。It is a flowchart for demonstrating the control content of a speech recognition control system.

図面を参照して、実施形態としての音声認識制御システムについて説明する。なお、以下に示す実施形態はあくまでも例示に過ぎず、以下の実施形態で明示しない種々の変形や技術の適用を排除する意図はない。本実施形態の各構成は、それらの趣旨を逸脱しない範囲で種々変形して実施することができる。また、必要に応じて取捨選択することができ、あるいは適宜組み合わせることができる。 A speech recognition control system as an embodiment will be described with reference to the drawings. Note that the embodiment described below is merely an example, and there is no intention to exclude various modifications and technical applications that are not explicitly described in the following embodiment. Each configuration of the present embodiment can be implemented with various modifications without departing from the spirit thereof. Further, they can be selected as necessary, or can be appropriately combined.

［１．装置構成］
本実施形態の音声認識制御システムは、図１に示す車両１０に適用される。車両１０の車室内には運転席１４，助手席１５が設けられ、車室前方側にはインパネ（インストルメントパネル，ダッシュボード）が配置される。インパネの車室側に面した部分のうち、運転席１４の前方にはステアリング装置や計器類が配置され、助手席１５の前方にはグローブボックスが配置される。また、インパネの車幅方向中央には、カーナビ機能やオーディオビジュアル機能などのユーザーインターフェースを集約して提供するマルチコミュニケーション型のディスプレイ装置１６が設けられる。ディスプレイ装置１６の位置は、運転席１４に座る運転手の視点では左斜め前方であり、助手席１５に座る乗員（助手）の視点では右斜め前方である。 [1. Device configuration]
The voice recognition control system of this embodiment is applied to the vehicle 10 shown in FIG. A driver's seat 14 and a passenger seat 15 are provided in the passenger compartment of the vehicle 10, and an instrument panel (instrument panel, dashboard) is disposed on the front side of the passenger compartment. A steering device and instruments are arranged in front of the driver's seat 14 and a glove box is arranged in front of the passenger seat 15 in the portion of the instrument panel facing the passenger compartment. In addition, a multi-communication type display device 16 that provides a user interface such as a car navigation function and an audio visual function is provided at the center of the instrument panel in the vehicle width direction. The position of the display device 16 is diagonally forward left from the viewpoint of the driver sitting in the driver's seat 14 and diagonally forward right from the viewpoint of the passenger (assistant) sitting in the passenger seat 15.

ディスプレイ装置１６は、タッチパネルを備えた汎用の映像表示装置（表示画面）とスピーカ（音響装置）とCPU（Central Processing Unit），ROM（Read Only Memory），RAM（Random Access Memory）などを含む電子制御装置（コンピューター）とを備えた電子デバイスである。ディスプレイ装置１６は、ナビゲーション装置１１，エアコン装置１２，カーオーディオ装置１３，マルチメディアシステムなどの車載装置に接続されて、各種車載装置の入出力装置として機能しうる。例えば、ナビゲーション装置１１から提供される目的地までの経路情報や地図情報，渋滞情報などは、このディスプレイ装置１６の表示画面に表示可能とされ、音声案内も出力可能とされる。また、このディスプレイ装置１６の表示画面には、車載の地上デジタル放送チューナーで受信した番組や、情報記録メディアの映像コンテンツ，リアビューカメラで撮影された映像，エアコン装置１２やカーオーディオ装置１３の操作用インターフェースといった、多様な視聴覚情報が再生，表示可能である。 The display device 16 is an electronic control including a general-purpose video display device (display screen) having a touch panel, a speaker (acoustic device), a CPU (Central Processing Unit), a ROM (Read Only Memory), a RAM (Random Access Memory), and the like. An electronic device provided with a device (computer). The display device 16 is connected to in-vehicle devices such as the navigation device 11, the air conditioner device 12, the car audio device 13, and the multimedia system, and can function as an input / output device for various in-vehicle devices. For example, route information, map information, traffic jam information, etc. provided from the navigation device 11 to the destination can be displayed on the display screen of the display device 16, and voice guidance can also be output. In addition, the display screen of the display device 16 is used for operating a program received by an in-vehicle terrestrial digital broadcast tuner, video content of an information recording medium, video taken by a rear view camera, an air conditioner device 12 or a car audio device 13. Various audiovisual information such as an interface can be reproduced and displayed.

また、車両１０には、乗員の音声を入力信号として各種車載装置を制御する音声認識制御装置１が搭載される。音声認識制御装置１は、CPU，MPU（Micro Processing Unit）などのプロセッサとROM，RAM，不揮発メモリなどを集積した電子デバイス（ECU，電子制御装置）である。ここでいうプロセッサとは、例えば制御ユニット（制御回路）や演算ユニット（演算回路），キャッシュメモリ（レジスタ）などを内蔵する処理装置（プロセッサ）である。また、ROM，RAM及び不揮発メモリは、プログラムや作業中のデータが格納されるメモリ装置である。音声認識制御装置１で実施される制御の内容は、ファームウェアやアプリケーションプログラムとしてROM，RAM，不揮発メモリ，リムーバブルメディア内に記録される。また、プログラムの実行時には、プログラムの内容がRAM内のメモリ空間内に展開され、プロセッサによって実行される。 In addition, the vehicle 10 is equipped with a voice recognition control device 1 that controls various in-vehicle devices using an occupant's voice as an input signal. The speech recognition control device 1 is an electronic device (ECU, electronic control device) in which a processor such as a CPU or MPU (Micro Processing Unit) and a ROM, RAM, nonvolatile memory, etc. are integrated. The processor here is, for example, a processing device (processor) including a control unit (control circuit), an arithmetic unit (arithmetic circuit), a cache memory (register), and the like. The ROM, RAM, and nonvolatile memory are memory devices that store programs and working data. The contents of the control performed by the voice recognition control apparatus 1 are recorded in ROM, RAM, nonvolatile memory, and removable media as firmware and application programs. When the program is executed, the contents of the program are expanded in the memory space in the RAM and executed by the processor.

図２に示すように、音声認識制御装置１の入力装置としては、マイクアレイ２１，室内カメラ２２，車速センサ２３などが挙げられる。マイクアレイ２１は、複数のマイクロフォンを所定の配列に並べた音声入力装置であり、室内カメラ２２は車室内全体を撮影可能な広角ビデオカメラである。マイクアレイ２１，室内カメラ２２は、例えば車幅方向中央部の天井面に内蔵される。車速センサ２３は、車輪の回転速度に応じたパルス信号を出力するセンサである。一方、音声認識制御装置１の出力装置（制御対象）としては、ナビゲーション装置１１，エアコン装置１２，カーオーディオ装置１３，ディスプレイ装置１６などが挙げられる。音声認識制御装置１は、マイクアレイ２１から入力された音声と室内カメラ２２で撮影された画像と車速センサ２３で検出されたパルス情報とに基づいて、各種車載装置を制御する。 As shown in FIG. 2, examples of the input device of the voice recognition control device 1 include a microphone array 21, an indoor camera 22, a vehicle speed sensor 23, and the like. The microphone array 21 is an audio input device in which a plurality of microphones are arranged in a predetermined arrangement, and the indoor camera 22 is a wide-angle video camera capable of photographing the entire vehicle interior. The microphone array 21 and the indoor camera 22 are built in, for example, the ceiling surface at the center in the vehicle width direction. The vehicle speed sensor 23 is a sensor that outputs a pulse signal corresponding to the rotational speed of the wheel. On the other hand, examples of the output device (control target) of the voice recognition control device 1 include a navigation device 11, an air conditioner device 12, a car audio device 13, a display device 16, and the like. The voice recognition control device 1 controls various on-vehicle devices based on the voice input from the microphone array 21, the image taken by the indoor camera 22, and the pulse information detected by the vehicle speed sensor 23.

［２．制御構成］
音声認識制御装置１は、乗員の音声及びジェスチャに基づき、ジェスチャで指定される施設に関する情報を提供する機能を持つ。ここでいう施設とは、車両１０の外部に存在する建造物や構造物を意味し、例えば建物，設備，工場，公園，競技場，駅などを含み、好ましくはナビゲーション装置１１に内蔵された地図情報中に記録されているPOI（Point Of Interest）を含む。例えば、マイクアレイ２１から何らかの音声が入力されると、まずその音声が発せられた発話位置が認識されるとともに、発話内容が認識される。また、室内カメラ２２で撮影された画像に基づき、発話位置の人物によってなされたジェスチャが検出され、そのジェスチャが表す施設（ジェスチャによって指定される施設）が検出される。ジェスチャの具体例としては、その施設を指で指し示す仕草やその施設を見つめる仕草（視線を向ける仕草）が挙げられる。そして、発話内容が「案内の要求」である場合には、施設に関する情報が乗員に提供される。このとき、乗員に提供される情報の種類は、発話位置に応じて設定される。 [2. Control configuration]
The voice recognition control device 1 has a function of providing information related to a facility designated by a gesture based on a passenger's voice and gesture. The facility here means a building or a structure existing outside the vehicle 10 and includes, for example, a building, equipment, a factory, a park, a stadium, a station, etc., and is preferably a map built in the navigation device 11. Includes POI (Point Of Interest) recorded in the information. For example, when some kind of sound is input from the microphone array 21, the utterance position where the sound is uttered is recognized and the utterance content is recognized. Further, based on the image taken by the indoor camera 22, a gesture made by the person at the utterance position is detected, and a facility represented by the gesture (a facility designated by the gesture) is detected. Specific examples of gestures include a gesture that points to the facility with a finger and a gesture that looks at the facility (a gesture that turns the line of sight). When the utterance content is “request for guidance”, information about the facility is provided to the occupant. At this time, the type of information provided to the occupant is set according to the utterance position.

上記のような制御を実施するための要素として、音声認識制御装置１には、車速検出部２，音声認識部３，ジェスチャ検出部４，データベース５，制御部６が設けられる。これらは、音声認識制御装置１で実行されるプログラムの一部の機能を示すものであり、ソフトウェアで実現されるものとする。ただし、各機能の一部又は全部をハードウェア（電子制御回路）で実現してもよく、あるいはソフトウェアとハードウェアとを併用して実現してもよい。
車速検出部２は、車速センサ２３が出力するパルス信号に基づき、車速を取得（検出，算出）するものである。ここで取得された車速の情報は、ジェスチャ検出部４及び制御部６に伝達される。 As elements for carrying out the control as described above, the voice recognition control device 1 is provided with a vehicle speed detection unit 2, a voice recognition unit 3, a gesture detection unit 4, a database 5, and a control unit 6. These indicate some functions of a program executed by the speech recognition control apparatus 1 and are realized by software. However, some or all of the functions may be realized by hardware (electronic control circuit), or may be realized by using software and hardware together.
The vehicle speed detection unit 2 acquires (detects and calculates) the vehicle speed based on the pulse signal output from the vehicle speed sensor 23. The vehicle speed information acquired here is transmitted to the gesture detection unit 4 and the control unit 6.

音声認識部３は、少なくともマイクアレイ２１から入力された音声に基づき、発話位置及び発話内容を認識するものである。ここでは例えば、発話者が運転席１４に着座している人物（運転手）であるのか、助手席１５に着座している人物（助手）であるのか、それともこれら以外の乗員（後部座席の乗員）であるのかが判断される。発話者の位置は、マイクアレイ２１で検知された複数の音声信号の大きさや遅れに基づいて特定可能である。あるいは、室内カメラ２２で撮影された画像を解析し、画像中に存在する人物の口唇の動きと音声が検出されたタイミングとを比較することでも、発話位置を特定可能である。 The voice recognition unit 3 recognizes the utterance position and the utterance content based on at least the voice input from the microphone array 21. Here, for example, whether the speaker is a person (driver) seated in the driver's seat 14, a person seated in the passenger seat 15 (assistant), or other passengers (passenger in the rear seat) ) Is determined. The position of the speaker can be specified based on the size and delay of a plurality of audio signals detected by the microphone array 21. Alternatively, the utterance position can be specified by analyzing an image photographed by the indoor camera 22 and comparing the movement of the lips of the person existing in the image with the timing when the sound is detected.

発話内容は「案内の要求」，「制御の指令」，「その他」の三種類のいずれかに分類されて認識される。例えば、発話内容が『あれはなに？』『なんだっけ？』『説明して？』といった音声コマンドを含む場合には、その発話内容が「案内の要求」であると判断される。一方、発話内容が『オン』『オフ』『作動』『停止』といった音声コマンドを含む場合には、その発話内容が「制御の指令」であると判断される。また、発話内容が上記のいずれの音声コマンドを含まない場合には、その発話内容が「その他」に該当するものと判断される。音声認識の具体的手法は任意であり、公知の音声認識技術を採用することができる。例えば、音響モデルに基づいて音声に含まれる音素が解析された後に、言語モデルに基づいて音素の連なりからなる語や句が解析され、その意味内容が認識される。ここで認識された発話位置及び発話内容の情報は、ジェスチャ検出部４及び制御部６に伝達される。 The utterance content is classified and recognized as one of three types of “request for guidance”, “command for control”, and “other”. For example, the utterance content is “What is that? ""What was that? ""Explain? ”Is determined to be“ guidance request ”. On the other hand, if the utterance content includes voice commands such as “on”, “off”, “activate”, and “stop”, it is determined that the utterance content is a “control command”. If the utterance content does not include any of the above voice commands, it is determined that the utterance content corresponds to “other”. A specific method of speech recognition is arbitrary, and a known speech recognition technique can be employed. For example, after a phoneme included in speech is analyzed based on an acoustic model, a word or phrase consisting of a series of phonemes is analyzed based on a language model, and its semantic content is recognized. The information on the utterance position and the utterance content recognized here is transmitted to the gesture detection unit 4 and the control unit 6.

ジェスチャ検出部４は、室内カメラ２２で撮影された画像に基づき、発話位置に存在する人物（すなわち発話者）によるジェスチャと、そのジェスチャによって指定される発話の対象とを検出するものである。「案内の要求」における発話の対象は、車両１０の外部の施設である。また、「制御の指令」における発話の対象には、車両１０に搭載された各種車載装置やその操作ボタン，インストルメントパネル上に表示されるインジケーター，ディスプレイ装置１６上に表示されるアイコンなどが含まれる。 The gesture detection unit 4 detects a gesture by a person (that is, a speaker) existing at an utterance position and an utterance target designated by the gesture based on an image taken by the indoor camera 22. The target of the utterance in the “request for guidance” is a facility outside the vehicle 10. Further, the utterance target in the “control command” includes various in-vehicle devices mounted on the vehicle 10 and operation buttons thereof, indicators displayed on the instrument panel, icons displayed on the display device 16, and the like. It is.

発話の対象は、音声認識部３で認識された発話位置から、ジェスチャによって示された方向に向かって仮想線を伸ばした先に配置されているものを推定することによって検出可能である。ジェスチャとして指さしの仕草を検出する場合、画像解析により手の位置を推定し、車室内における発話者の手の位置を基準として、指の方向に仮想線を伸ばすことで、対象を精度よく検出することができる。一方、ジェスチャとして視線を検出する場合、画像解析により顔の位置や向きを推定し、車室内における発話者の顔の位置を基準として、視線方向に仮想線を伸ばすことで、対象を精度よく検出することができる。ここで検出された対象の情報は、制御部６に伝達される。 The target of the utterance can be detected by estimating from the utterance position recognized by the voice recognition unit 3 what is arranged ahead of the virtual line extending in the direction indicated by the gesture. When detecting the gesture of the pointing finger as a gesture, the position of the hand is estimated by image analysis, and the target is accurately detected by extending a virtual line in the direction of the finger based on the position of the speaker's hand in the passenger compartment. be able to. On the other hand, when detecting the gaze as a gesture, the position and orientation of the face are estimated by image analysis, and the target is accurately detected by extending the virtual line in the gaze direction based on the position of the speaker's face in the vehicle interior. can do. Information on the target detected here is transmitted to the control unit 6.

本実施形態のジェスチャ検出部４は、図３に示すように、発話内容が「案内の要求」であるときに、発話時点における車両１０の位置を基準として、ジェスチャによって指し示された方向の先に存在する施設を特定する。ジェスチャによる施設の特定手法としては、公知の手法を採用することができる。例えば、発話時点における車両１０の位置はナビゲーション装置１１で特定可能であり、施設はナビゲーション装置１１に内蔵された地図情報から特定可能である（特許文献１，２参照）。 As shown in FIG. 3, the gesture detection unit 4 according to the present embodiment, when the utterance content is “guidance request”, uses the position of the vehicle 10 at the time of the utterance as a reference, and moves in the direction indicated by the gesture. Identify facilities that exist in A publicly known method can be adopted as a facility identification method using a gesture. For example, the position of the vehicle 10 at the time of utterance can be specified by the navigation device 11, and the facility can be specified from map information built in the navigation device 11 (see Patent Documents 1 and 2).

データベース５は、音声認識に関する総合的な各種データが記録，保存されたストレージ装置である。ここには、音声認識で用いられる音響モデルや言語モデルが記録，保存される。音響モデル及び言語モデルは、標準話者の音声に基づいてあらかじめ作成されたものである。なお、具体的な音響モデル，言語モデルについては、公知の技術（例えば、特開2002-189492号など）に基づいて作成することができる。 The database 5 is a storage device in which various general data relating to speech recognition are recorded and stored. Here, acoustic models and language models used in speech recognition are recorded and stored. The acoustic model and the language model are created in advance based on the voice of the standard speaker. Note that a specific acoustic model and language model can be created based on a known technique (for example, JP-A-2002-189492).

また、データベース５には、発話内容が「案内の要求」である場合に乗員に提供される情報の種類と、発話内容及び発話位置との関係が記録，保存される。本実施形態のデータベース５には、図４に示すように、発話の対象である施設，発話内容，発話位置の組み合わせと、その乗員に提供される情報の種類との関係が記録，保存される。図４には、例えば施設Ａに対する案内の要求があったとき、発話位置が運転席１４だったならば、施設Ａの駐停車設備情報が提供されることが示されている。また、発話位置が助手席１５だったならば、施設Ａの営業内容情報が提供され、発話位置が他の座席だったならば、施設Ａの一般情報が提供されることが示されている。 The database 5 records and stores the relationship between the type of information provided to the occupant when the utterance content is “guidance request”, and the utterance content and utterance position. In the database 5 of the present embodiment, as shown in FIG. 4, the relationship between the facility to be uttered, the utterance content, the utterance position combination, and the type of information provided to the passenger is recorded and stored. . FIG. 4 shows that, for example, when there is a request for guidance for the facility A, if the utterance position is the driver's seat 14, parking / stopping facility information of the facility A is provided. Further, it is shown that if the utterance position is the passenger seat 15, the business content information of the facility A is provided, and if the utterance position is another seat, the general information of the facility A is provided.

制御部６は、音声認識部３で認識された発話位置及び発話内容と、ジェスチャ検出部４で検出された発話の対象とに基づき、データベース５に記録，保存された関係を用いて各種車載装置を制御するものである。制御部６はおもに二つの機能を持つ。
第一の機能は、制御対象を音声で制御する機能（ハンズフリー制御機能）である。制御部６は、音声認識部３で認識された発話内容が「制御の指令」を意味する音声コマンドを含む場合に、発話の対象の作動状態を制御する。なお、ハンズフリー制御機能の具体的手法は任意であり、公知の手法を採用することができる。 Based on the utterance position and utterance content recognized by the speech recognition unit 3 and the utterance target detected by the gesture detection unit 4, the control unit 6 uses various in-vehicle devices using the relationship recorded and stored in the database 5. Is to control. The control unit 6 mainly has two functions.
The first function is a function (hands-free control function) for controlling a controlled object by voice. The control unit 6 controls the operation state of the utterance target when the utterance content recognized by the voice recognition unit 3 includes a voice command meaning “control command”. Note that the specific method of the hands-free control function is arbitrary, and a known method can be adopted.

第二の機能は、乗員が知りたい情報をディスプレイ装置１６に出力する機能（案内機能）である。制御部６は、音声認識部３で認識された発話内容が「制御の指令」を意味する音声コマンドを含まず、かつ「案内の要求」を意味する音声コマンドを含む場合に、発話の対象である施設に関する情報を乗員に提供する。このとき、乗員に提供する情報は、施設に関する多数の情報の中から発話位置に応じて取捨選択される。なお、発話内容に「案内の要求」を意味する音声コマンドや「制御の指令」を意味する音声コマンドが含まれない場合には、発話の対象が制御されることなく、音声コマンドがキャンセル（取り消し）される。 The second function is a function (guidance function) for outputting information that the passenger wants to know to the display device 16. When the utterance content recognized by the voice recognition unit 3 does not include a voice command meaning “control command” and includes a voice command meaning “request for guidance”, the control unit 6 Providing passengers with information about a facility. At this time, the information provided to the occupant is selected according to the utterance position from among a lot of information regarding the facility. If the utterance content does not include a voice command that means “request for guidance” or a voice command that means “control command”, the voice command is canceled (cancelled) without controlling the utterance target. )

［３．フローチャート］
図５は、音声認識制御装置１で実施される制御内容を説明するためのフローチャート例である。まず、マイクアレイ２１で検出された音声情報，室内カメラ２２で撮影された画像情報，車速センサ２３からのパルス情報が音声認識制御装置１に入力され（ステップＡ１）、音声が入力されたか否かが判定される（ステップＡ２）。ここで、何らかの音声が入力されていると、音声認識部３において、少なくともその音声に基づき、発話位置と発話内容とが認識される（ステップＡ３）。また、ジェスチャ検出部４では、室内カメラ２２で撮影された画像に基づき、発話位置の人物のジェスチャが検出され（ステップＡ４）、そのジェスチャによって指定される発話の対象が特定される（ステップＡ５）。 [3. flowchart]
FIG. 5 is an example of a flowchart for explaining the control contents executed by the speech recognition control apparatus 1. First, voice information detected by the microphone array 21, image information taken by the indoor camera 22, and pulse information from the vehicle speed sensor 23 are inputted to the voice recognition control device 1 (step A1), and whether or not voice is inputted. Is determined (step A2). Here, when some kind of voice is input, the voice recognition unit 3 recognizes the utterance position and the utterance content based on at least the voice (step A3). The gesture detection unit 4 detects the gesture of the person at the utterance position based on the image taken by the indoor camera 22 (step A4), and specifies the utterance target specified by the gesture (step A5). .

制御部６では、音声認識部３で認識された発話内容が「制御の指令」を意味する音声コマンドを含むか否かが判定され（ステップＡ６）、この条件が成立する場合には、ジェスチャで指定された発話の対象（例えば、ナビゲーション装置１１やエアコン装置１２）の作動状態が制御される（ステップＡ７）。 The control unit 6 determines whether or not the utterance content recognized by the voice recognition unit 3 includes a voice command meaning “control command” (step A6). If this condition is satisfied, a gesture is performed. The operating state of the designated utterance target (for example, the navigation device 11 or the air conditioner device 12) is controlled (step A7).

また、ステップＡ６の条件が不成立の場合には、音声認識部３で認識された発話内容が「案内の要求」を意味する音声コマンドを含むか否かが判定される（ステップＡ８）。この条件が成立し、発話位置が運転席１４だった場合には、運転手向けの駐停車施設情報が選択されて出力され、発話の対象（例えば、施設Ａや施設Ｂ）に関する音声案内と映像案内とがディスプレイ装置１６から提供される（ステップＡ９，Ａ１０）。一方、発話位置が助手席１５だった場合には、助手向けの営業内容情報が選択されて出力される。また、発話位置が運転席１４，助手席１５のいずれでもなければ、施設の一般情報が選択されて出力される。なお、ステップＡ８の条件も不成立の場合には、発話の対象が制御されることなく、音声コマンドがキャンセルされる。 If the condition of step A6 is not satisfied, it is determined whether or not the utterance content recognized by the voice recognition unit 3 includes a voice command meaning “request for guidance” (step A8). When this condition is satisfied and the utterance position is the driver's seat 14, parking facility information for the driver is selected and output, and voice guidance and video regarding the utterance target (for example, the facility A and the facility B) are output. Guidance is provided from the display device 16 (steps A9 and A10). On the other hand, when the utterance position is the passenger seat 15, the business content information for the assistant is selected and output. If the utterance position is neither the driver's seat 14 nor the passenger seat 15, general information on the facility is selected and output. If the condition in step A8 is not satisfied, the voice command is canceled without controlling the utterance target.

［４．作用，効果］
（１）上記の音声認識制御装置１では、音声入力と視線入力とを併用した情報提供に際し、発話位置に応じて複数の情報の一部が選択的に出力される。このように、発話位置に応じて、乗員に提供される情報を選択することで、発話内容だけでなく発話位置に見合った情報を提供することができ、利便性を向上させることができる。 [4. Action, effect]
(1) In the voice recognition control device 1 described above, when providing information using both voice input and line-of-sight input, a part of a plurality of pieces of information is selectively output according to the utterance position. Thus, by selecting the information provided to the occupant according to the utterance position, not only the utterance contents but also information corresponding to the utterance position can be provided, and convenience can be improved.

（２）例えば、運転席１４に着座する運転手が施設Ａについての案内の要求をした場合には、施設Ａの駐停車設備情報が提供される。これにより、運転手は車両１０を実際に施設Ａの駐車場まで移動させる前に、駐車が可能であるか否かを確認することができる。また、駐車場の料金や営業時間を確認することができる。したがって、車両の駐停車操作に有益な情報を獲得することができ、利便性を向上させることができる。 (2) For example, when a driver seated in the driver's seat 14 requests guidance about the facility A, parking / stop facility information of the facility A is provided. Thereby, the driver can confirm whether or not parking is possible before actually moving the vehicle 10 to the parking lot of the facility A. You can also check parking fees and opening hours. Therefore, it is possible to acquire information useful for the parking and stopping operation of the vehicle, and it is possible to improve convenience.

（３）一方、助手席１５に着座する助手が施設Ａについての案内を要求した場合には、施設Ａの営業内容情報が提供される。これにより、助手は実際に施設Ａへと足を運ぶ前に、営業時間の情報やお得な情報を先取りすることができる。したがって、施設利用に有益な情報を獲得することができ、利便性を向上させることができる。 (3) On the other hand, when the assistant seated in the passenger seat 15 requests guidance about the facility A, the business content information of the facility A is provided. As a result, the assistant can preempt information on business hours and advantageous information before actually visiting the facility A. Therefore, information useful for facility use can be acquired, and convenience can be improved.

［５．変形例］
上述の実施形態では、図４に示すように、発話内容と発話位置とに基づいて情報の種類が選択されているが、発話時刻も考慮して情報の種類を選択する構成としてもよい。例えば、発話時刻が施設の営業時間内である場合には施設の営業内容情報を提供し、発話時刻が施設の営業時間外である場合にはその営業時間情報のみを提供することが考えられる。このように、時間帯に応じて情報の種類を変更することで、乗員にとってより適切な情報を提供することができ、利便性をさらに向上させることができる。 [5. Modified example]
In the above-described embodiment, as shown in FIG. 4, the type of information is selected based on the utterance content and the utterance position. However, the type of information may be selected in consideration of the utterance time. For example, if the utterance time is within the business hours of the facility, the business content information of the facility may be provided, and if the utterance time is outside the business hours of the facility, only the business time information may be provided. Thus, by changing the type of information according to the time zone, more appropriate information can be provided for the occupant, and convenience can be further improved.

また、上述の実施形態では、音声の認識から制御対象の制御までに至るすべての過程が音声認識制御装置１で統括管理されているが、音声認識制御装置１の機能の一部又は全部を車両１０の外部に移設することも考えられる。例えば、音声認識制御装置１をインターネット，携帯電話機の無線通信網，その他のデジタル無線通信網などのネットワークに接続可能とし、ネットワーク上のサーバに音声認識制御装置１の機能の一部又は全部を実装してもよい。これにより、データベース５の管理や更新が容易となり、音声認識精度やジェスチャ認識精度を向上させることができる。 Further, in the above-described embodiment, all processes from speech recognition to control of the control target are managed in an integrated manner by the speech recognition control device 1, but some or all of the functions of the speech recognition control device 1 are controlled by the vehicle. Relocation to the outside of 10 is also conceivable. For example, the voice recognition control device 1 can be connected to a network such as the Internet, a mobile phone wireless communication network, and other digital wireless communication networks, and a part or all of the functions of the voice recognition control device 1 are mounted on a server on the network. May be. Thereby, management and update of the database 5 become easy, and speech recognition accuracy and gesture recognition accuracy can be improved.

なお、上述の実施形態の制御において、案内機能の実施条件に車速の条件を追加してもよい。例えば、車速検出部２で検出された車速が所定車速以下（例えば、10km/h以下）であることを条件として、案内機能を実施することとしてもよい。これにより、車両１０が中高速で走行している状態では案内機能の発動に制限をかけることができ、音声コマンドの誤認識やジェスチャの誤検出をより確実に防止することができる。 In the control of the above-described embodiment, a vehicle speed condition may be added to the guide function execution condition. For example, the guidance function may be performed on the condition that the vehicle speed detected by the vehicle speed detection unit 2 is a predetermined vehicle speed or less (for example, 10 km / h or less). Thereby, in the state where the vehicle 10 is traveling at a medium to high speed, it is possible to limit the activation of the guidance function, and it is possible to more reliably prevent erroneous recognition of voice commands and erroneous detection of gestures.

あるいは、車速に応じて、乗員に提供される情報の種類が選択されることとしてもよい。例えば、運転手が施設Ａに対する案内の要求をしたときに、車速が所定の徐行車速以下（例えば、20km/h以下）である場合には、その運転者が施設Ａに車両１０を駐停車させようとしているものと判断し、駐停車設備情報を提供する。一方、車速が徐行車速を越えている場合には、駐停車の意図がないものと判断し、施設Ａの営業内容情報を提供する。このように、車速に応じて情報の種類を選択，変更することで、車両１０の走行状態に見合った情報を提供することができ、利便性を向上させることができる。 Alternatively, the type of information provided to the occupant may be selected according to the vehicle speed. For example, when the driver requests guidance to the facility A and the vehicle speed is equal to or less than a predetermined slow speed (for example, 20 km / h or less), the driver parks the vehicle 10 at the facility A. It is judged that it is going to provide parking and stopping facilities information. On the other hand, when the vehicle speed exceeds the slow vehicle speed, it is determined that there is no intention of parking and stopping, and the business content information of the facility A is provided. Thus, by selecting and changing the type of information according to the vehicle speed, it is possible to provide information corresponding to the traveling state of the vehicle 10 and improve convenience.

１音声認識制御装置
２車速検出部
３音声認識部
４ジェスチャ検出部
５データベース
６制御部
１０車両
１１ナビゲーション装置
１２エアコン装置
１３カーオーディオ装置
１４運転席
１５助手席
１６ディスプレイ装置
２１マイクアレイ
２２室内カメラ
２３車速センサ DESCRIPTION OF SYMBOLS 1 Voice recognition control apparatus 2 Vehicle speed detection part 3 Voice recognition part 4 Gesture detection part 5 Database 6 Control part 10 Vehicle 11 Navigation apparatus 12 Air conditioner apparatus 13 Car audio apparatus 14 Driver's seat 15 Passenger's seat 16 Display apparatus 21 Microphone array 22 Indoor camera 23 Vehicle speed sensor

Claims

In a voice recognition control system that controls an in-vehicle device using the voice of a vehicle occupant as an input signal,
A voice recognition unit for recognizing the utterance position and the utterance content based on at least the voice;
A gesture detection unit that detects a gesture of a person at the utterance position based on an image of a passenger compartment captured by an indoor camera;
A database for storing a plurality of pieces of information related to the facility designated by the gesture;
A control unit that selects and outputs a part of the plurality of information based on the utterance content and the utterance position;
A voice recognition control system comprising:

The voice recognition control system according to claim 1, wherein the control unit outputs parking / stopping facility information of the facility when the utterance position is a driver's seat.

The voice recognition control system according to claim 1, wherein the control unit outputs business content information of the facility when the utterance position is other than the driver's seat.

The said control part outputs the said information according to the time when the said audio | voice was input, The speech recognition control system of any one of Claims 1-3 characterized by the above-mentioned.