JP2017090613A

JP2017090613A - Voice recognition control system

Info

Publication number: JP2017090613A
Application number: JP2015219114A
Authority: JP
Inventors: 真吾入方; Shingo Irikata; 宗義難波; Muneyoshi Nanba
Original assignee: Mitsubishi Motors Corp
Current assignee: Mitsubishi Motors Corp
Priority date: 2015-11-09
Filing date: 2015-11-09
Publication date: 2017-05-25

Abstract

PROBLEM TO BE SOLVED: To provide a voice recognition control system capable of readily providing guide information relating to a function unknown to an occupant to improve convenience.SOLUTION: A voice recognition controller 1 includes: a voice recognition unit 3 for recognizing an utterer and a content of an utterance on the basis of at least a voice input from a microphone array 21; a gesture detection unit 4 for detecting a gesture of the utterer expressing an object of the utterance on the basis of an image in a vehicle cabin that is photographed by an indoor camera 22; and a guide unit 7 for outputting a voice guide or a video guide for guiding how to use the object expressed by the gesture when the content of the utterance recognized by the voice recognition unit 3 contains a predetermined voice command.SELECTED DRAWING: Figure 2

Description

本発明は、車両乗員の音声で車載装置を制御する音声認識制御システムに関する。 The present invention relates to a voice recognition control system for controlling an in-vehicle device with a voice of a vehicle occupant.

従来、車両に搭載される各種車載装置の操作方法を音声で案内するガイダンスシステムが開発されている。すなわち、各種車載装置の操作に不慣れな乗員に対して、具体的な使い方を説明するガイド情報を音声で提供するものである。例えば、車両用ドアの開閉操作時に、自動施解錠装置の操作方法に関する音声ガイドを出力する技術が知られている（特許文献１参照）。これにより、乗員はユーザーマニュアルを確認する手間が省けるだけでなく、操作の習熟度を手軽に高めることができ、各種車載装置の利便性を向上させることができる。 2. Description of the Related Art Conventionally, guidance systems have been developed that provide voice guidance for operating various on-vehicle devices mounted on a vehicle. That is, the guide information explaining specific usage is provided by voice to passengers unfamiliar with the operation of various in-vehicle devices. For example, a technique for outputting a voice guide related to an operation method of an automatic locking / unlocking device at the time of opening / closing a vehicle door is known (see Patent Document 1). As a result, the occupant can not only save time and effort to check the user manual, but also can easily increase the proficiency level of operation and improve the convenience of various in-vehicle devices.

特開2008-255753号公報JP2008-255753

一方、近年の車載装置は、利便性や快適性，自然環境への適合性に対するニーズに応えるべく多機能化されており、新たに追加された機能の存在すら乗員に知られていない場合がある。このような未知の機能に関するガイド情報は、既存のガイダンスシステムでは提供することが難しく、乗員がそのようなガイド情報を引き出すことも困難であった。 On the other hand, in-vehicle devices in recent years have been multi-functionalized to meet the needs for convenience, comfort, and adaptability to the natural environment, and even the presence of newly added functions may not be known to passengers. . Such guide information relating to unknown functions is difficult to provide with existing guidance systems, and it has also been difficult for passengers to extract such guide information.

本件の目的の一つは、上記のような課題に鑑みて創案されたものであり、乗員にとって未知の機能に関するガイド情報を容易に提供することができ、利便性を高めることのできる音声認識制御システムを提供することである。なおこの目的に限らず、後述する「発明を実施するための形態」に示す各構成から導き出される作用効果であって、従来の技術では得られない作用効果を奏することも、本件の他の目的として位置付けることができる。 One of the purposes of this case was created in view of the above problems, and can easily provide guide information regarding functions unknown to the occupant, and can improve the convenience of voice recognition control. Is to provide a system. It is not limited to this purpose, and is an operational effect derived from each configuration shown in “Mode for Carrying Out the Invention” to be described later. Can be positioned as

（１）ここで開示する音声認識制御システムは、車両乗員の音声を入力信号として車載装置を制御する音声認識制御システムである。本システムは、少なくとも前記音声に基づき、発話者及び発話内容を認識する音声認識部を備える。また、室内カメラで撮影された車室内の画像に基づき、発話の対象を表す前記発話者のジェスチャを検出するジェスチャ検出部を備える。さらに、前記音声認識部で認識された前記発話内容が所定の音声コマンドを含む場合に、前記ジェスチャが表す前記対象の使い方を案内するための音声ガイド又は映像ガイドを出力するガイド部を備える。
なお、前記発話内容が前記所定の音声コマンドを含まない場合には、前記ガイド部が前記発話内容に応じて前記対象の作動状態を制御することが好ましい。 (1) The voice recognition control system disclosed here is a voice recognition control system that controls an in-vehicle device using a voice of a vehicle occupant as an input signal. The system includes a speech recognition unit that recognizes a speaker and utterance contents based on at least the speech. In addition, a gesture detection unit is provided for detecting the gesture of the speaker representing the utterance target based on the vehicle interior image captured by the indoor camera. In addition, when the utterance content recognized by the voice recognition unit includes a predetermined voice command, a guide unit that outputs a voice guide or a video guide for guiding how to use the object represented by the gesture is provided.
In addition, when the said utterance content does not contain the said predetermined | prescribed voice command, it is preferable that the said guide part controls the operating state of the said object according to the said utterance content.

（２）前記ジェスチャが、前記対象に向けて指をさす仕草であることが好ましい。
（３）前記ガイド部が、前記発話者に応じて前記音声ガイド又は前記映像ガイドの情報量を変更することが好ましい。
例えば、前記発話者が運転手である場合には前記情報量を増加させ、専門的な案内を実施することが好ましい。一方、前記発話者が運転手以外の乗員である場合には前記情報量を減少させ、基礎的な案内を実施することが好ましい。 (2) It is preferable that the gesture is a gesture of pointing a finger toward the target.
(3) It is preferable that the guide unit changes an information amount of the voice guide or the video guide according to the speaker.
For example, when the speaker is a driver, it is preferable to increase the amount of information and provide specialized guidance. On the other hand, when the speaker is an occupant other than the driver, it is preferable to reduce the amount of information and perform basic guidance.

（４）前記ガイド部が、前記発話者の累積搭乗時間に応じて前記情報量を変更することが好ましい。
（５）前記ガイド部は、前記発話者が運転手又は助手であることを条件として、前記音声ガイド又は前記映像ガイドを出力することが好ましい。
（６）車速を検出する車速検出部をさらに備えることが好ましい。この場合、前記ガイド部は、前記車速検出部で検出された前記車速が所定車速以下であることを条件として、前記音声ガイド又は前記映像ガイドを出力することが好ましい。 (4) It is preferable that the guide unit changes the information amount according to the accumulated boarding time of the speaker.
(5) It is preferable that the guide unit outputs the voice guide or the video guide on the condition that the speaker is a driver or an assistant.
(6) It is preferable to further include a vehicle speed detector that detects the vehicle speed. In this case, it is preferable that the guide unit outputs the audio guide or the video guide on condition that the vehicle speed detected by the vehicle speed detection unit is equal to or lower than a predetermined vehicle speed.

音声入力とジェスチャ入力とを組み合わせて音声ガイドや映像ガイドを出力させることで、車載装置の使い方を乗員に対して知らせることができ、特に乗員にとって未知の機能に関するガイド情報を容易に提供することができる。また、乗員は発話の対象を表すジェスチャをしながら所定の音声コマンドを発するだけで、その対象の操作方法や機能に関する音声ガイドや映像ガイドを引き出すことができる。したがって、車載装置の利便性を高めることができる。 By combining voice input and gesture input to output a voice guide or video guide, it is possible to inform the occupant how to use the in-vehicle device, and in particular, it is possible to easily provide guide information regarding functions unknown to the occupant. it can. Further, the occupant can draw out a voice guide and a video guide related to the operation method and function of the target simply by issuing a predetermined voice command while making a gesture representing the target of the utterance. Therefore, the convenience of the in-vehicle device can be improved.

音声認識制御システムが適用された車両の模式的な上面図である。1 is a schematic top view of a vehicle to which a voice recognition control system is applied. 音声認識制御システムの構成を示す模式図である。It is a schematic diagram which shows the structure of a speech recognition control system. 音声認識制御システムの制御内容を説明するためのフローチャートである。It is a flowchart for demonstrating the control content of a speech recognition control system. 乗員がスイッチ類に関するガイドを要求している状態を示す図である。It is a figure which shows the state which the passenger | crew is requesting the guide regarding switches.

図面を参照して、実施形態としての音声認識制御システムについて説明する。なお、以下に示す実施形態はあくまでも例示に過ぎず、以下の実施形態で明示しない種々の変形や技術の適用を排除する意図はない。本実施形態の各構成は、それらの趣旨を逸脱しない範囲で種々変形して実施することができる。また、必要に応じて取捨選択することができ、あるいは適宜組み合わせることができる。 A speech recognition control system as an embodiment will be described with reference to the drawings. Note that the embodiment described below is merely an example, and there is no intention to exclude various modifications and technical applications that are not explicitly described in the following embodiment. Each configuration of the present embodiment can be implemented with various modifications without departing from the spirit thereof. Further, they can be selected as necessary, or can be appropriately combined.

［１．装置構成］
本実施形態の音声認識制御システムは、図１に示す車両１０に適用される。車両１０の車室内には運転席１４，助手席１５が設けられ、車室前方側にはインパネ（インストルメントパネル，ダッシュボード）が配置される。インパネの車室側に面した部分のうち、運転席１４の前方にはステアリング装置や計器類が配置され、助手席１５の前方にはグローブボックスが配置される。また、インパネの車幅方向中央には、カーナビ機能やオーディオビジュアル機能などのユーザーインターフェースを集約して提供するマルチコミュニケーション型のディスプレイ装置１６とボタン式のスイッチ類１７とが設けられる。ディスプレイ装置１６及びスイッチ類１７の位置は、運転席１４に座る運転手の視点では左斜め前方であり、助手席１５に座る乗員（助手）の視点では右斜め前方である。 [1. Device configuration]
The voice recognition control system of this embodiment is applied to the vehicle 10 shown in FIG. A driver's seat 14 and a passenger seat 15 are provided in the passenger compartment of the vehicle 10, and an instrument panel (instrument panel, dashboard) is disposed on the front side of the passenger compartment. A steering device and instruments are arranged in front of the driver's seat 14 and a glove box is arranged in front of the passenger seat 15 in the portion of the instrument panel facing the passenger compartment. In the center of the instrument panel in the vehicle width direction, a multi-communication type display device 16 and a button-type switch 17 for providing a user interface such as a car navigation function and an audio visual function are provided. The positions of the display device 16 and the switches 17 are diagonally left forward from the viewpoint of the driver sitting in the driver's seat 14 and diagonally forward right from the viewpoint of the passenger (assistant) sitting in the passenger seat 15.

ディスプレイ装置１６は、タッチパネルを備えた汎用の映像表示装置（表示画面）とスピーカ（音響装置）とCPU（Central Processing Unit），ROM（Read Only Memory），RAM（Random Access Memory）などを含む電子制御装置（コンピューター）とを備えた電子デバイスである。ディスプレイ装置１６は、ナビゲーション装置１１，エアコン装置１２，カーオーディオ装置１３，マルチメディアシステムなどの車載装置に接続されて、各種車載装置の入出力装置として機能しうる。例えば、ナビゲーション装置１１から提供される目的地までの経路情報や地図情報，渋滞情報などは、このディスプレイ装置１６の表示画面に表示可能とされ、音声案内も出力可能とされる。また、このディスプレイ装置１６の表示画面には、車載の地上デジタル放送チューナーで受信した番組や、情報記録メディアの映像コンテンツ，リアビューカメラで撮影された映像，エアコン装置１２やカーオーディオ装置１３の操作用インターフェースといった、多様な視聴覚情報が再生，表示可能である。 The display device 16 is an electronic control including a general-purpose video display device (display screen) having a touch panel, a speaker (acoustic device), a CPU (Central Processing Unit), a ROM (Read Only Memory), a RAM (Random Access Memory), and the like. An electronic device provided with a device (computer). The display device 16 is connected to in-vehicle devices such as the navigation device 11, the air conditioner device 12, the car audio device 13, and the multimedia system, and can function as an input / output device for various in-vehicle devices. For example, route information, map information, traffic jam information, etc. provided from the navigation device 11 to the destination can be displayed on the display screen of the display device 16, and voice guidance can also be output. In addition, the display screen of the display device 16 is used for operating a program received by an in-vehicle terrestrial digital broadcast tuner, video content of an information recording medium, video taken by a rear view camera, an air conditioner device 12 or a car audio device 13. Various audiovisual information such as an interface can be reproduced and displayed.

スイッチ類１７は、LED（Light Emitting Diode）や有機EL（Organic Electro-Luminescence）などの発光素子を内蔵したキー（ボタン）が複数個並べられたものであり、ナビゲーション装置１１，エアコン装置１２，カーオーディオ装置１３，マルチメディアシステムなどの車載装置に接続されて、各種車載装置の入力装置として機能しうる。個々のボタンには、各種車載装置の機能の一つを任意に設定可能とされる。例えば、車両１０の乗員がカーオーディオ装置１３を頻繁に利用する場合には、スイッチ類１７のボタンの一つに選曲機能やスピーカボリューム調節機能を付与することが可能である。また、ナビゲーション装置１１を頻用する場合には、ナビゲーション装置１１の起動機能を付与することも可能である。 The switches 17 are composed of a plurality of keys (buttons) having built-in light emitting elements such as LEDs (Light Emitting Diodes) and organic ELs (Organic Electro-Luminescence). The audio device 13 is connected to a vehicle-mounted device such as a multimedia system, and can function as an input device for various vehicle-mounted devices. One of the functions of various in-vehicle devices can be arbitrarily set for each button. For example, when a passenger of the vehicle 10 frequently uses the car audio device 13, it is possible to give a music selection function or a speaker volume adjustment function to one of the buttons of the switches 17. In addition, when the navigation device 11 is frequently used, it is possible to provide a startup function of the navigation device 11.

また、車両１０には、乗員の音声を入力信号として各種車載装置を制御する音声認識制御装置１が搭載される。音声認識制御装置１は、CPU，MPU（Micro Processing Unit）などのプロセッサとROM，RAM，不揮発メモリなどを集積した電子デバイス（ECU，電子制御装置）である。ここでいうプロセッサとは、例えば制御ユニット（制御回路）や演算ユニット（演算回路），キャッシュメモリ（レジスタ）などを内蔵する処理装置（プロセッサ）である。また、ROM，RAM及び不揮発メモリは、プログラムや作業中のデータが格納されるメモリ装置である。音声認識制御装置１で実施される制御の内容は、ファームウェアやアプリケーションプログラムとしてROM，RAM，不揮発メモリ，リムーバブルメディア内に記録される。また、プログラムの実行時には、プログラムの内容がRAM内のメモリ空間内に展開され、プロセッサによって実行される。 In addition, the vehicle 10 is equipped with a voice recognition control device 1 that controls various in-vehicle devices using an occupant's voice as an input signal. The speech recognition control device 1 is an electronic device (ECU, electronic control device) in which a processor such as a CPU or MPU (Micro Processing Unit) and a ROM, RAM, nonvolatile memory, etc. are integrated. The processor here is, for example, a processing device (processor) including a control unit (control circuit), an arithmetic unit (arithmetic circuit), a cache memory (register), and the like. The ROM, RAM, and nonvolatile memory are memory devices that store programs and working data. The contents of the control performed by the voice recognition control apparatus 1 are recorded in ROM, RAM, nonvolatile memory, and removable media as firmware and application programs. When the program is executed, the contents of the program are expanded in the memory space in the RAM and executed by the processor.

図２に示すように、音声認識制御装置１の入力装置としては、マイクアレイ２１，室内カメラ２２，車速センサ２３などが挙げられる。マイクアレイ２１は、複数のマイクロフォンを所定の配列に並べた音声入力装置であり、室内カメラ２２は車室内全体を撮影可能な広角ビデオカメラである。マイクアレイ２１，室内カメラ２２は、例えば車幅方向中央部の天井面に内蔵される。車速センサ２３は、車輪の回転速度に応じたパルス信号を出力するセンサである。一方、音声認識制御装置１の出力装置としては、ナビゲーション装置１１，エアコン装置１２，カーオーディオ装置１３，ディスプレイ装置１６，スイッチ類１７などが挙げられる。音声認識制御装置１は、マイクアレイ２１から入力された音声と室内カメラ２２で撮影された画像と車速センサ２３で検出されたパルス情報とに基づいて、各種車載装置を制御する。 As shown in FIG. 2, examples of the input device of the voice recognition control device 1 include a microphone array 21, an indoor camera 22, a vehicle speed sensor 23, and the like. The microphone array 21 is an audio input device in which a plurality of microphones are arranged in a predetermined arrangement, and the indoor camera 22 is a wide-angle video camera capable of photographing the entire vehicle interior. The microphone array 21 and the indoor camera 22 are built in, for example, the ceiling surface at the center in the vehicle width direction. The vehicle speed sensor 23 is a sensor that outputs a pulse signal corresponding to the rotational speed of the wheel. On the other hand, examples of the output device of the voice recognition control device 1 include a navigation device 11, an air conditioner device 12, a car audio device 13, a display device 16, and switches 17. The voice recognition control device 1 controls various on-vehicle devices based on the voice input from the microphone array 21, the image taken by the indoor camera 22, and the pulse information detected by the vehicle speed sensor 23.

［２．制御構成］
音声認識制御装置１は、乗員の音声と車室内の画像とに基づいて、各種車載装置の使い方を案内するための音声ガイドや映像ガイドを出力する機能（いわゆる『使い方ガイド機能』）を持つ。例えば、マイクアレイ２１から何らかの音声が入力されると、まずその音声を発した発話者が認識されるとともに、発話内容が認識される。また、室内カメラ２２で撮影された画像に基づき、発話と同時に発話者によってなされたジェスチャが検出され、そのジェスチャが表す対象（ジェスチャによって指定される対象物）が検出される。そして、発話内容が「ガイドの要求」である場合には、ジェスチャが表す対象についての使い方を案内するための音声ガイドや映像ガイドが乗員に提供される。一方、発話内容が「制御の指令」である場合には、その発話内容に応じて、ジェスチャが表す対象に対する制御が実施される。 [2. Control configuration]
The voice recognition control device 1 has a function (so-called “how to use guide function”) for outputting a voice guide and a video guide for guiding how to use various in-vehicle devices based on a passenger's voice and a vehicle interior image. For example, when some kind of voice is input from the microphone array 21, the utterer who utters the voice is first recognized and the utterance content is recognized. Moreover, based on the image image | photographed with the indoor camera 22, the gesture made by the speaker at the same time as the utterance is detected, and the object (the object specified by the gesture) represented by the gesture is detected. When the utterance content is “request for guide”, a voice guide or a video guide for guiding how to use the object represented by the gesture is provided to the occupant. On the other hand, if the utterance content is a “control command”, control is performed on the object represented by the gesture according to the utterance content.

乗員に提供される音声ガイドの具体例を以下に例示する。
・『このボタンは、カーナビの起動ボタンです』
・『このスイッチは、エコモードの選択スイッチです』
・『このレバーは、ステアリングヒーターの起動レバーです』
・『このディスプレイには、カーナビ・オーディオ・エアコンシステムのメニュー画面が表示されます』
また、乗員に提供される映像ガイドの具体例としては、音声ガイドの内容を文字で表示することや、車両１０のユーザーマニュアルに記載されているような図解ガイドを再生，表示することなどが挙げられる。 Specific examples of voice guides provided to passengers are shown below.
・ "This button is a car navigation start button"
・ "This switch is an eco-mode selection switch"
・ "This lever is a steering heater start lever"
・ "The menu screen of the car navigation system, audio system, and air conditioner system is displayed on this display"
In addition, specific examples of the video guide provided to the occupant include displaying the contents of the audio guide in characters and playing and displaying an illustrated guide as described in the user manual of the vehicle 10. It is done.

上記のような制御を実施するための要素として、音声認識制御装置１には、車速検出部２，音声認識部３，ジェスチャ検出部４，データベース５，人物特定部６，ガイド部７が設けられる。これらは、音声認識制御装置１で実行されるプログラムの一部の機能を示すものであり、ソフトウェアで実現されるものとする。ただし、各機能の一部又は全部をハードウェア（電子制御回路）で実現してもよく、あるいはソフトウェアとハードウェアとを併用して実現してもよい。 As an element for carrying out the control as described above, the voice recognition control device 1 is provided with a vehicle speed detection unit 2, a voice recognition unit 3, a gesture detection unit 4, a database 5, a person identification unit 6, and a guide unit 7. . These indicate some functions of a program executed by the speech recognition control apparatus 1 and are realized by software. However, some or all of the functions may be realized by hardware (electronic control circuit), or may be realized by using software and hardware together.

データベース５は、音声認識に関する総合的な各種データが記録，保存されたストレージ装置である。ここには、音声認識で用いられる音響モデルや言語モデルが記録，保存される。音響モデル及び言語モデルは、標準話者の音声に基づいて予め作成されたものである。なお、具体的な音響モデル，言語モデルの構成については、公知の技術（例えば、特開2002-189492号など）を参照して採用することができる。
車速検出部２は、車速センサ２３が出力するパルス信号に基づき、車速を取得（検出，算出）するものである。ここで取得された車速の情報は、ガイド部７に伝達される。 The database 5 is a storage device in which various general data relating to speech recognition are recorded and stored. Here, acoustic models and language models used in speech recognition are recorded and stored. The acoustic model and the language model are created in advance based on the voice of the standard speaker. In addition, about the structure of a specific acoustic model and a language model, it can employ | adopt with reference to a well-known technique (for example, Unexamined-Japanese-Patent No. 2002-189492 etc.).
The vehicle speed detection unit 2 acquires (detects and calculates) the vehicle speed based on the pulse signal output from the vehicle speed sensor 23. The information on the vehicle speed acquired here is transmitted to the guide unit 7.

音声認識部３は、少なくともマイクアレイ２１から入力された音声に基づき、発話者（発話位置）及び発話内容を認識するものである。ここでは、発話者が運転席１４に着座している人物（運転手）であるのか、助手席１５に着座している人物（助手）であるのか、それともこれら以外の乗員（後部座席の乗員）であるのかが判断される。発話者の位置は、マイクアレイ２１で検知された複数の音声信号の大きさや遅れに基づいて特定可能である。あるいは、室内カメラ２２で撮影された画像を解析し、画像中に存在する人物の口唇の動きと音声が検出されたタイミングとを比較することでも、発話者の位置を特定可能である。 The voice recognition unit 3 recognizes a speaker (speech position) and utterance contents based on at least the voice input from the microphone array 21. Here, whether the speaker is a person (driver) seated in the driver's seat 14, a person (passenger) seated in the passenger seat 15, or another passenger (passenger in the rear seat) Is determined. The position of the speaker can be specified based on the size and delay of a plurality of audio signals detected by the microphone array 21. Alternatively, the position of the speaker can be specified by analyzing an image captured by the indoor camera 22 and comparing the movement of the lips of the person existing in the image with the timing at which the sound is detected.

発話内容は「ガイドの要求」，「制御の指令」，「その他」の三種類のいずれかに分類されて認識される。例えば、発話内容が『これはなに？』『なんだっけ？』『説明して？』といった音声コマンドを含む場合には、その発話内容が「ガイドの要求」であると判断される。一方、発話内容が『オン』『オフ』『作動』『停止』といった音声コマンドを含む場合には、その発話内容が「制御の指令」であると判断される。また、発話内容が上記のいずれの音声コマンドを含まない場合には、その発話内容が「その他」に該当するものと判断される。音声認識の具体的な手法は任意であり、公知の音声認識技術を採用することができる。例えば、音響モデルに基づいて音声に含まれる音素が解析された後に、言語モデルに基づいて音素の連なりからなる語や句が解析され、その意味内容が認識される。ここで認識された発話者及び発話内容の情報は、ジェスチャ検出部４及びガイド部７に伝達される。 The utterance content is classified and recognized as one of three types: “Guide request”, “Control command”, and “Other”. For example, if the utterance is “What is this? ""What was that? ""Explain? ”Is determined to be“ Guide Request ”. On the other hand, if the utterance content includes voice commands such as “on”, “off”, “activate”, and “stop”, it is determined that the utterance content is a “control command”. If the utterance content does not include any of the above voice commands, it is determined that the utterance content corresponds to “other”. A specific method of speech recognition is arbitrary, and a known speech recognition technique can be employed. For example, after a phoneme included in speech is analyzed based on an acoustic model, a word or phrase consisting of a series of phonemes is analyzed based on a language model, and its semantic content is recognized. The information of the speaker and the content of the utterance recognized here is transmitted to the gesture detection unit 4 and the guide unit 7.

ジェスチャ検出部４は、室内カメラ２２で撮影された画像に基づき、発話位置に存在する人物（すなわち発話者）によるジェスチャを検出するものである。ここでは、発話者のジェスチャとして、発話の対象を表すジェスチャが検出される。ジェスチャの具体例としては、対象に指をさす仕草や対象を見つめる仕草（視線を向ける仕草）などが挙げられる。このジェスチャによって表される「発話の対象」には、車両１０に搭載された各種車載装置やその操作ボタン，スイッチ類１７，インストルメントパネル上に表示されるインジケーター，ディスプレイ装置１６上に表示されるアイコンなどが含まれる。 The gesture detection unit 4 detects a gesture by a person (that is, a speaker) existing at an utterance position based on an image photographed by the indoor camera 22. Here, a gesture representing an utterance target is detected as the gesture of the speaker. Specific examples of gestures include a gesture that points a finger at a target and a gesture that looks at a target (a gesture that turns a line of sight). The “utterance target” represented by this gesture is displayed on various display devices 16 such as various in-vehicle devices mounted on the vehicle 10, its operation buttons, switches 17, indicators displayed on the instrument panel. Includes icons.

対象物は、音声認識部３で認識された発話者の位置から、ジェスチャによって示された方向に向かって仮想線を伸ばした先に配置されているものを推定することによって検出可能である。ジェスチャとして指さしの仕草を検出する場合、画像解析により手の位置を推定し、車室内における発話者の手の位置を基準として、指の方向に仮想線を伸ばすことで、対象を精度よく検出することができる。一方、ジェスチャとして視線を検出する場合、画像解析により顔の位置や向きを推定し、車室内における発話者の顔の位置を基準として、視線方向に仮想線を伸ばすことで、対象を精度よく検出することができる。ここで検出された対象の情報は、ガイド部７に伝達される。 The object can be detected by estimating from the position of the speaker recognized by the voice recognition unit 3 what is placed ahead of the virtual line extending in the direction indicated by the gesture. When detecting the gesture of the pointing finger as a gesture, the position of the hand is estimated by image analysis, and the target is accurately detected by extending a virtual line in the direction of the finger based on the position of the speaker's hand in the passenger compartment. be able to. On the other hand, when detecting the gaze as a gesture, the position and orientation of the face are estimated by image analysis, and the target is accurately detected by extending the virtual line in the gaze direction based on the position of the speaker's face in the vehicle interior. can do. Information on the target detected here is transmitted to the guide unit 7.

人物特定部６は、少なくともマイクアレイ２１から入力された音声に基づき、その音声を発した人物を特定するものである。ここでは、発話した人物が誰であるのかが検出されるとともに、その人物が車両１０に搭乗したのべ時間（累積搭乗時間）が計測される。人物の特定手法としては、音声が検出された時点でリアルタイムに特定する手法と、車両１０に誰かが乗車したときにその人物と着座位置との関係を把握しておき、検出された音声の音源位置に基づいて人物を特定する手法とが挙げられる。 The person identifying unit 6 identifies a person who has emitted the sound based on at least the sound input from the microphone array 21. Here, the person who speaks is detected, and the time (cumulative boarding time) that the person has boarded the vehicle 10 is measured. As a method for identifying a person, a method for identifying in real time when sound is detected, and a relationship between the person and the seating position when someone gets on the vehicle 10, and a sound source of the detected sound And a method of identifying a person based on the position.

前者の場合、音声中に含まれる波形パターンや周波数パターン，声紋パターンなどに基づいて人物を特定することが可能である。あるいは、室内カメラ２２で撮影された画像中から人間の顔を抽出し、口唇の動きと音声のタイミングとが一致する人物を特定することも可能である。後者の場合、室内カメラ２２で撮影された画像を解析（例えば、顔認証）することで人物を特定してもよいし、その人物に何らかの音声を発してもらい、前者と同様の手法を用いてその人物を特定してもよい。ここで特定された人物の情報は、ガイド部７に伝達される。 In the former case, it is possible to specify a person based on a waveform pattern, a frequency pattern, a voiceprint pattern, etc. included in the voice. Alternatively, it is also possible to extract a human face from an image photographed by the indoor camera 22 and specify a person whose lip movement matches the voice timing. In the latter case, a person may be specified by analyzing an image captured by the indoor camera 22 (for example, face authentication), or the person is uttered with some sound, and the same method as the former is used. The person may be specified. The information of the person specified here is transmitted to the guide unit 7.

ガイド部７は、音声認識部３で認識された発話者及び発話内容と、ジェスチャ検出部４で検出されたジェスチャとに基づき、各種車載装置を制御するものである。ガイド部７はおもに二つの機能を持つ。
第一の機能は、ジェスチャによって特定される対象を音声で制御する機能（ハンズフリー制御機能）である。ガイド部７は、音声認識部３で認識された発話内容が「ガイドの要求」を意味する音声コマンドを含まず、かつ「制御の指令」を意味する音声コマンドを含む場合に、対象の作動状態を制御する。なお、発話内容に「ガイドの要求」を意味する音声コマンドや「制御の指令」を意味する音声コマンドが含まれない場合には、対象が制御されることなく、音声コマンドがキャンセル（取り消し）される。 The guide unit 7 controls various in-vehicle devices based on the speaker and speech content recognized by the speech recognition unit 3 and the gesture detected by the gesture detection unit 4. The guide part 7 mainly has two functions.
The first function is a function (hands-free control function) for controlling an object specified by a gesture by voice. When the utterance content recognized by the voice recognition unit 3 does not include a voice command meaning “request for guide” and includes a voice command meaning “control command”, the guide unit 7 To control. If the utterance content does not include a voice command that means “request for guide” or a voice command that means “control command”, the voice command is canceled without being controlled. The

第二の機能は、ジェスチャによって特定される対象の使い方を案内するための音声ガイドや映像ガイドをディスプレイ装置１６に出力する機能（ガイド機能）である。ガイド部７は、音声認識部３で認識された発話内容が「ガイドの要求」を意味する音声コマンドを含む場合に、対象に関する音声ガイドや映像ガイドを乗員に提供する。
本実施形態のガイド部７は、発話者が運転手又は助手であることと、車速検出部２で検出された車速が所定車速以下（例えば、10km/h以下）であることとを条件として、ガイド機能を発動する。つまり、ガイド機能は、運転手又は助手による「ガイドの要求」がジェスチャとともに認識された場合に実行可能とする。これにより、後部座席の乗員による不必要なガイダンスが防止される。また、ガイド機能は車両１０の停止中に実行可能とする。これにより、車両１０の走行中に運転手の意図しないタイミングでガイド機能が発動することが防止される。 The second function is a function (guide function) for outputting a voice guide or a video guide for guiding how to use the object specified by the gesture to the display device 16. When the utterance content recognized by the voice recognition unit 3 includes a voice command meaning “request for guide”, the guide unit 7 provides the occupant with a voice guide and a video guide regarding the object.
The guide unit 7 of the present embodiment is based on the condition that the speaker is a driver or an assistant and that the vehicle speed detected by the vehicle speed detection unit 2 is equal to or lower than a predetermined vehicle speed (for example, 10 km / h or lower). Activate the guide function. That is, the guide function can be executed when the “request for guide” by the driver or the assistant is recognized together with the gesture. This prevents unnecessary guidance by the passenger in the rear seat. The guide function can be executed while the vehicle 10 is stopped. As a result, the guide function is prevented from being activated at a timing not intended by the driver while the vehicle 10 is traveling.

ガイド部７によって提供される案内レベル（案内の量や質）は、発話者に応じて変更される。すなわち、本実施形態のガイド部７は、発話者に応じて音声ガイドや前記映像ガイドの情報量を変更する機能を持つ。例えば、対象がナビゲーション装置１１の起動ボタンであって発話者が助手である場合、『このボタンは、カーナビの起動ボタンです』といった風に、対象の名称や初歩的，基本的な使い方の情報を提供する。一方、発話者が運転手である場合には、『このボタンは、カーナビの起動ボタンです。長押しするとカーナビを再起動します』といった風に、助手に対する情報よりも専門的で高度な使い方の情報を提供する。 The guidance level (amount and quality of guidance) provided by the guide unit 7 is changed according to the speaker. That is, the guide unit 7 of the present embodiment has a function of changing the information amount of the audio guide or the video guide according to the speaker. For example, if the target is the start button of the navigation device 11 and the speaker is the assistant, the name of the target and information on the basic and basic usage will be displayed, such as “This button is the start button of the car navigation system”. provide. On the other hand, if the speaker is a driver, “This button is the start button for car navigation. If you press and hold it, the car navigation system will be restarted. "

また、本実施形態のガイド部７は、発話者の累積搭乗時間に応じて、案内レベルを変更する機能を持つ。例えば、発話者である運転手の累積搭乗時間が比較的短く、車両１０の運転経験が浅い場合には、『このボタンは、カーナビの起動ボタンです。長押しするとカーナビを再起動します』といった丁寧なガイドを提供する。一方、運転手の累積搭乗時間が比較的長く、各種車載装置の操作に熟練していると考えられる場合には、『長押しでリセット』といった風に、初歩的，基本的な使い方の情報を省略し、高度な使い方の情報を提供する。 Moreover, the guide part 7 of this embodiment has a function which changes a guidance level according to a speaker's accumulated boarding time. For example, when the accumulated boarding time of the driver who is the speaker is relatively short and the driving experience of the vehicle 10 is inexperienced, “This button is a car navigation start button. A long-press will restart the car navigation system. ” On the other hand, if the driver's accumulated boarding time is relatively long and he / she thinks he / she is proficient at operating various in-vehicle devices, he / she can provide basic and basic usage information such as “Reset by long press”. Omitted and provides advanced usage information.

［３．フローチャート］
図３は、音声認識制御装置１で実施される制御内容を説明するためのフローチャート例である。まず、マイクアレイ２１で検出された音声情報，室内カメラ２２で撮影された画像情報，車速センサ２３からのパルス情報が音声認識制御装置１に入力され（ステップＡ１）、音声が入力されたか否かが判定される（ステップＡ２）。ここで、何らかの音声が入力されていると、音声認識部３において、少なくともその音声に基づき、発話者と発話内容とが認識される（ステップＡ３）。また、ジェスチャ検出部４では、室内カメラ２２で撮影された画像に基づき、発話者のジェスチャが検出される（ステップＡ４）。同様に、人物特定部６では、入力された音声や画像に基づき、発話した人物が特定される（ステップＡ５）。 [3. flowchart]
FIG. 3 is an example of a flowchart for explaining the control contents executed by the speech recognition control apparatus 1. First, voice information detected by the microphone array 21, image information taken by the indoor camera 22, and pulse information from the vehicle speed sensor 23 are inputted to the voice recognition control device 1 (step A1), and whether or not voice is inputted. Is determined (step A2). Here, if some kind of voice is input, the voice recognition unit 3 recognizes the speaker and the content of the utterance based on at least the voice (step A3). In addition, the gesture detection unit 4 detects the gesture of the speaker based on the image captured by the indoor camera 22 (step A4). Similarly, the person specifying unit 6 specifies the person who has spoken based on the input voice and image (step A5).

ガイド部７では、発話者が運転手，助手のどちらかであるか否かが判定され（ステップＡ６）、車速検出部２で検出された車速が所定車速以下であるか否かが判定される（ステップＡ７）。また、音声認識部３で認識された発話内容が「ガイドの要求」を意味する音声コマンドを含むか否かが判定される（ステップＡ８）。これらの全ての条件が成立すると、ガイド部７において、発話者の着座位置や累積搭乗時間に応じた案内レベルが設定され（ステップＡ９）、ジェスチャによって特定された対象の使い方を案内するための音声ガイド，映像ガイドがディスプレイ装置１６に出力される（ステップＡ１０）。 In the guide unit 7, it is determined whether the speaker is a driver or an assistant (step A6), and it is determined whether the vehicle speed detected by the vehicle speed detection unit 2 is equal to or lower than a predetermined vehicle speed. (Step A7). In addition, it is determined whether or not the utterance content recognized by the voice recognition unit 3 includes a voice command meaning “request for guide” (step A8). When all these conditions are satisfied, the guide unit 7 sets a guidance level corresponding to the seating position of the speaker and the accumulated boarding time (step A9), and voice for guiding the usage of the target specified by the gesture. A guide and a video guide are output to the display device 16 (step A10).

一方、ステップＡ６，Ａ７，Ａ８で判定される各条件の何れかが不成立であれば、音声認識部３で認識された発話内容が「制御の指令」を意味する音声コマンドを含むか否かが判定される（ステップＡ１１）。この条件が成立すると、ジェスチャ対象の作動状態が制御される。また、この条件が不成立ならば、対象が制御されることなく、音声コマンドがキャンセル（取り消し）される。 On the other hand, if any of the conditions determined in steps A6, A7, and A8 is not satisfied, whether or not the utterance content recognized by the voice recognition unit 3 includes a voice command meaning “control command” is determined. Determination is made (step A11). When this condition is satisfied, the operating state of the gesture target is controlled. If this condition is not satisfied, the voice command is canceled (cancelled) without being controlled.

［４．作用，効果］
図４に示すように、運転手が車両停止中にスイッチ類１７の一つを指さしつつ、『これなんだっけ？』と発話すると、音声認識制御装置１は発話者（発話位置）及び発話内容を認識するとともに発話した人物を特定する。また、運転手の仕草から、指をさされた一個のスイッチを発話の対象として認識する。
運転手が発した『これなんだっけ？』との音声コマンドは「ガイドの要求」を意味するものであることから、音声認識制御装置１はそのスイッチの使い方を案内するための音声ガイドや映像ガイドをディスプレイ装置１６に出力する。このとき、案内レベルは運転手の累積搭乗時間に応じたものとされる。これにより、スイッチ類１７の使い方に関する運転手の習熟度に見合った情報が提供されることになり、利便性が向上する。 [4. Action, effect]
As shown in FIG. 4, while the driver points to one of the switches 17 while the vehicle is stopped, “What is this? The speech recognition control device 1 recognizes the speaker (utterance position) and the content of the speech and identifies the person who spoke. Also, from the driver's gesture, one switch with a finger is recognized as an utterance target.
"What was this?" ] Means a “guide request”, and the voice recognition control device 1 outputs a voice guide and a video guide for guiding how to use the switch to the display device 16. At this time, the guidance level is determined according to the accumulated boarding time of the driver. Thereby, information commensurate with the skill level of the driver regarding how to use the switches 17 is provided, and convenience is improved.

（１）このように、音声入力とジェスチャ入力とを組み合わせて、ジェスチャ対象の使い方を案内するための音声ガイドや映像ガイドを出力させることで、車載装置やその操作ボタンの使い方を乗員に対して知らせることができる。また、実際にその車載装置を作動させる必要がなく、乗員にとって未知の機能に関するガイド情報を容易に提供することができる。さらに、乗員は発話の対象を表すジェスチャをしながら所定の音声コマンドを発するだけで、その対象の操作方法や機能に関する音声ガイドや映像ガイドを手軽に引き出すことができる。したがって、車両，車載装置の利便性を高めることができる。 (1) In this way, by combining voice input and gesture input and outputting a voice guide and video guide for guiding how to use the gesture target, the usage of the in-vehicle device and its operation buttons can be given to the occupant. I can inform you. Further, it is not necessary to actually operate the in-vehicle device, and guide information regarding functions unknown to the occupant can be easily provided. Furthermore, the occupant can easily draw out a voice guide and a video guide related to the operation method and function of the target simply by issuing a predetermined voice command while making a gesture representing the target of the utterance. Therefore, the convenience of the vehicle and the in-vehicle device can be improved.

（２）また、対象に指をさす仕草をガイド開始のトリガーとすることで、車載装置の使い方を乗員に対して容易に教示することができ、利便性を高めることができる。また、視線や顔の向きといったジェスチャを用いる場合と比較して、発話の対象が特定しやすいことから、ガイドの誤作動やジェスチャの誤検出を発生しにくくすることができる。したがって、車両，車載装置の利便性を高めることができる。 (2) Moreover, by using the gesture of pointing a finger at the object as a trigger for starting the guide, it is possible to easily teach the occupant how to use the in-vehicle device, and the convenience can be improved. In addition, compared to the case of using a gesture such as a line of sight or a face direction, it is easier to specify the utterance target, so that it is possible to make it difficult for the guide to malfunction or to detect a gesture erroneously. Therefore, the convenience of the vehicle and the in-vehicle device can be improved.

（３）上記の音声認識制御装置１では、発話者に応じてガイドの情報量が変更される。例えば、発話者が運転手である場合には情報量を増加させ、専門的な案内が実施される。一方、発話者が運転手以外の乗員（助手）である場合には情報量を減少させ、基礎的な案内が実施される。このように、発話者（発話位置）に応じて情報量を増減させることで、乗員のニーズに見合ったガイド機能を提供することができる。 (3) In the voice recognition control device 1 described above, the information amount of the guide is changed according to the speaker. For example, when the speaker is a driver, the amount of information is increased and specialized guidance is implemented. On the other hand, when the speaker is an occupant (assistant) other than the driver, the amount of information is reduced and basic guidance is performed. Thus, by increasing or decreasing the amount of information according to the speaker (speaking position), it is possible to provide a guide function that meets the passenger's needs.

（４）上記の音声認識制御装置１では、発話者の累積搭乗時間に応じてガイドの情報量が変更される。例えば、車両１０の運転経験が浅い運転手に対しては基礎的な案内と専門的な案内とがともに実施される。一方、車両１０の運転に熟練した運転手に対しては、専門的な案内のみが実施される。これにより、発話者の熟練度や知識量に見合ったガイド機能を提供することができる。 (4) In the voice recognition control device 1 described above, the information amount of the guide is changed according to the accumulated boarding time of the speaker. For example, both basic guidance and specialized guidance are implemented for a driver who has little experience in driving the vehicle 10. On the other hand, only specialized guidance is provided for a driver skilled in driving the vehicle 10. Thereby, a guide function commensurate with the skill level and knowledge amount of the speaker can be provided.

（５）上記のガイド機能は、発話者が運転手又は助手であることを条件として実施される。これにより、後部座席に搭乗する人物によるガイド機能の発動を禁止することができ、音声コマンドの誤認識やジェスチャの誤検出を防止することができる。
（６）また、上記のガイド機能は、車速が所定車速以下であることを条件として実施される。これにより、車両１０が中高速で走行している状態ではガイド機能の発動に制限をかけることができ、音声コマンドの誤認識やジェスチャの誤検出をより確実に防止することができる。 (5) The above guide function is implemented on condition that the speaker is a driver or an assistant. As a result, it is possible to prohibit the guide function from being activated by a person on the rear seat, and to prevent erroneous recognition of voice commands and erroneous detection of gestures.
(6) Moreover, said guide function is implemented on condition that a vehicle speed is below a predetermined vehicle speed. Thereby, in the state where the vehicle 10 is traveling at a medium to high speed, it is possible to limit the activation of the guide function, and it is possible to more reliably prevent erroneous recognition of voice commands and erroneous detection of gestures.

［５．変形例］
上述の実施形態では、対象に指をさす仕草が発話の対象を表すジェスチャである場合について詳述したが、具体的なジェスチャの種類はこれに限定されない。例えば、乗員が対象に視線を向けながら『これなんだっけ？』と発声したときに、その対象に関するガイド情報を提供するような制御構成としてもよい。
また、上述の実施形態では、スイッチ類１７の一つを対象とした場合のガイドについて詳述したが、音声ガイドや映像ガイドの提供と同時に、対象スイッチに内蔵された発光素子を点滅させてもよい。ガイド中の対象を目立たせることで、発話者以外の乗員に対しても、ガイドの内容がどの対象に関するものであるのかを明示することができ、利便性をさらに高めることができる。 [5. Modified example]
In the above-described embodiment, the case where the gesture of pointing the finger at the object is a gesture representing the utterance target has been described in detail, but the specific type of gesture is not limited to this. For example, an occupant turned his gaze toward the subject, saying, “What ’s this? The control configuration may be such that guide information relating to the target is provided when “
In the above-described embodiment, the guide for one of the switches 17 has been described in detail. However, at the same time as providing the audio guide and the video guide, the light emitting element incorporated in the target switch may be blinked. Good. By making the object in the guide conspicuous, it is possible to clearly indicate to which occupant other than the speaker the object of the guide relates, and the convenience can be further enhanced.

また、上述の実施形態では、音声の認識から制御対象の制御までに至るすべての過程が音声認識制御装置１で統括管理されているが、音声認識制御装置１の機能の一部又は全部を車両１０の外部に移設することも考えられる。例えば、音声認識制御装置１をインターネット，携帯電話機の無線通信網，その他のデジタル無線通信網などのネットワークに接続可能とし、ネットワーク上のサーバに音声認識制御装置１の機能の一部又は全部を実装してもよい。これにより、データベース５の管理や更新が容易となり、音声認識精度やジェスチャ認識精度を向上させることができる。 Further, in the above-described embodiment, all processes from speech recognition to control of the control target are managed in an integrated manner by the speech recognition control device 1, but some or all of the functions of the speech recognition control device 1 are controlled by the vehicle. Relocation to the outside of 10 is also conceivable. For example, the voice recognition control device 1 can be connected to a network such as the Internet, a mobile phone wireless communication network, and other digital wireless communication networks, and a part or all of the functions of the voice recognition control device 1 are mounted on a server on the network. May be. Thereby, management and update of the database 5 become easy, and speech recognition accuracy and gesture recognition accuracy can be improved.

１音声認識制御装置
２車速検出部
３音声認識部
４ジェスチャ検出部
５データベース
６人物特定部
７ガイド部
１０車両
１１ナビゲーション装置
１２エアコン装置
１３カーオーディオ装置
１４運転席
１５助手席
１６ディスプレイ装置
１７スイッチ類
２１マイクアレイ
２２室内カメラ
２３車速センサ DESCRIPTION OF SYMBOLS 1 Voice recognition control apparatus 2 Vehicle speed detection part 3 Voice recognition part 4 Gesture detection part 5 Database 6 Person specific part 7 Guide part 10 Vehicle 11 Navigation apparatus 12 Air conditioner apparatus 13 Car audio apparatus 14 Driver's seat 15 Passenger seat 16 Display apparatus 17 Switches 21 Microphone array 22 Indoor camera 23 Vehicle speed sensor

Claims

In a voice recognition control system that controls an in-vehicle device using the voice of a vehicle occupant as an input signal,
A voice recognition unit for recognizing a speaker and utterance content based on at least the voice;
A gesture detection unit that detects a gesture of the speaker representing an utterance target based on an image of a passenger compartment captured by an indoor camera;
When the utterance content recognized by the voice recognition unit includes a predetermined voice command, a guide unit that outputs a voice guide or a video guide for guiding how to use the target represented by the gesture;
A voice recognition control system comprising:

The voice recognition control system according to claim 1, wherein the gesture is a gesture of pointing a finger toward the object.

The voice recognition control system according to claim 1, wherein the guide unit changes an information amount of the voice guide or the video guide according to the speaker.

The voice recognition control system according to claim 3, wherein the guide unit changes the information amount according to the accumulated boarding time of the speaker.

The audio according to any one of claims 1 to 4, wherein the guide unit outputs the audio guide or the video guide on condition that the speaker is a driver or an assistant. Recognition control system.

A vehicle speed detector for detecting the vehicle speed;
The said guide part outputs the said audio | voice guide or the said image | video guide on condition that the said vehicle speed detected by the said vehicle speed detection part is below a predetermined vehicle speed, The one of Claims 1-5 characterized by the above-mentioned. The speech recognition control system according to item 1.