JP5014662B2

JP5014662B2 - On-vehicle speech recognition apparatus and speech recognition method

Info

Publication number: JP5014662B2
Application number: JP2006110379A
Authority: JP
Inventors: 真人柴田
Original assignee: Alpine Electronics Inc
Current assignee: Alpine Electronics Inc
Priority date: 2006-04-13
Filing date: 2006-04-13
Publication date: 2012-08-29
Anticipated expiration: 2026-04-13
Also published as: JP2007286136A

Description

本発明は、音声認識機能を利用して車載機器等を制御する技術に関し、特に、車室内で乗員が発話した操作指示に対応した音声認識を行うよう適応された車載用音声認識装置及び音声認識方法に関する。 The present invention relates to a technology for controlling an in-vehicle device or the like using a speech recognition function, and more particularly to an in-vehicle speech recognition apparatus and speech recognition adapted to perform speech recognition corresponding to an operation instruction spoken by an occupant in a vehicle interior. Regarding the method.

最近の車両には、運転者や助手席の乗員、リア席の乗員等（以下、便宜上「ユーザ」ともいう。）に対して様々なサービスを提供するための機器や装置などが搭載されている。その代表的な車載機器として、設定した目的地に向けて道路を間違うことなく走行できるように案内する機能（経路誘導機能）を搭載したナビゲーション装置や、各種ソース（ラジオ受信機、ＣＤプレーヤ、ＴＶ受信機、ＤＶＤプレーヤ等）から出力される音声（オーディオ）情報や映像（ビデオ）情報などの各種エンターテイメントを提供するオーディオ／ビデオ（Ａ／Ｖ）機器、エアコン装置などがある。これらの車載機器（装置）は、ユーザがリモコンや操作パネル等を操作して所要の指示を与えることにより、その操作指示に応じてその動作状態が変更される。変更された機器の動作状態は、車室内に設置されたスピーカ（リア席のユーザについてはワイヤレスヘッドホン等）を介して聴くことができ、また車載モニタ等の表示装置の画面を通して見ることができる。 Recent vehicles are equipped with devices and devices for providing various services to drivers, passengers in the passenger seat, passengers in the rear seat, etc. (hereinafter also referred to as “user” for convenience). . Typical in-vehicle devices include a navigation device equipped with a function (route guidance function) for guiding the user to travel to a set destination without making a mistake, and various sources (radio receiver, CD player, TV) There are audio / video (A / V) devices that provide various entertainment such as audio (audio) information and video (video) information output from a receiver, a DVD player, etc., and an air conditioner. These in-vehicle devices (apparatuses) are operated according to the operation instructions when the user gives a necessary instruction by operating the remote controller or the operation panel. The changed operating state of the device can be heard via a speaker (such as wireless headphones for a user at the rear seat) installed in the passenger compartment, or can be viewed through a screen of a display device such as an in-vehicle monitor.

このように各車載機器に対してはリモコン操作等のマニュアル操作に基づいて所要の操作指示を入力することができるが、最近では、操作指示を音声入力（発話）するだけで当該機器の制御を行える機能（音声認識機能）を搭載した装置も出現している。かかる音声認識機能は、ユーザの操作上の便宜を図る点で有利であり、特に、運転者にとっては安全走行の点で非常に有用である。 As described above, a required operation instruction can be input to each in-vehicle device based on a manual operation such as a remote control operation. However, recently, the control of the device can be performed only by voice input (speech) of the operation instruction. Devices equipped with a function that can be performed (voice recognition function) have also appeared. Such a voice recognition function is advantageous in terms of convenience for the user's operation, and is particularly useful for the driver in terms of safe driving.

この音声認識機能を実現するには音声認識辞書を必要とし、この音声認識辞書には、音声認識の対象とする単語や語句などの語彙、すなわち、音声認識に基づいて制御されるべき車載機器（以下、「制御対象機器」ともいう。）の操作指示に関連した語彙があらかじめ登録されている。例えば、ナビゲーション装置であれば、「目的地」、「メニュー」、「周辺検索」などの語彙が登録され、Ａ／Ｖ機器であれば、「ラジオ」、「ＦＭ」、「ＡＭ」、「メニュー」、「再生」、「停止」などの語彙が登録されている。 In order to realize this voice recognition function, a voice recognition dictionary is required. In this voice recognition dictionary, words such as words and phrases to be voice recognition, that is, in-vehicle devices to be controlled based on voice recognition ( Hereinafter, the vocabulary related to the operation instruction of “control target device” is registered in advance. For example, a vocabulary such as “destination”, “menu”, “surrounding search” is registered for a navigation device, and “radio”, “FM”, “AM”, “menu” for an A / V device. Vocabulary such as “,” “play” and “stop” is registered.

上記の従来技術に関連する技術としては、例えば、特許文献１に記載されるように、ユーザが発話した内容を音声認識して制御対象機器の制御を行う音声制御装置において、制御対象機器の動作状態を考慮してユーザの発話を認識することで、制御対象機器の音声による操作を適切に行えるようにしたものがある。
特開２００４−８６１５０号公報 As a technique related to the above-described conventional technique, for example, as described in Patent Document 1, in a voice control apparatus that performs speech recognition of content uttered by a user and controls the control target equipment, the operation of the control target equipment There is a device that can appropriately perform a voice operation of a control target device by recognizing a user's utterance in consideration of a state.
JP 2004-86150 A

上述したように従来の技術では、制御対象機器に対する操作指示を発話するだけで当該機器の制御を行える機能が実現されているが、従来の方法では音声認識辞書に登録されている全ての語彙に対して音声認識を行っているため、その登録されている語彙の数が多くなってくると、以下に説明するような不都合が起こり得る。 As described above, in the conventional technology, a function that can control the device only by uttering an operation instruction to the device to be controlled is realized, but in the conventional method, all the vocabularies registered in the speech recognition dictionary are stored. On the other hand, since speech recognition is performed, if the number of registered vocabularies increases, inconveniences described below may occur.

すなわち、音声認識エンジンでは、ユーザの発話した内容（音声コマンド）と音声認識辞書に登録されている全ての語彙（コマンド）との合致度を算出し、その算出結果から最も合致度の大きいコマンドをユーザが発した音声コマンドとして決定する（音声認識）。このとき、その最も合致度の大きいコマンドが１つに特定できれば問題はないが、登録されている語彙の数が多くなってくると発音上「読み」の類似した語彙も多くなるため、音声認識エンジンでは必ずしも１つに特定することができず、結果として、マッチングしない語彙を誤認識してしまう場合が起こり得る。つまり、従来の音声認識方法では、使用する音声認識辞書に登録されている語彙の数が多くなってくると、それに応じて誤認識する割合が高くなり、ユーザの発話内容を正確に認識するのが困難になる（音声コマンドに対する認識率が低下する）といった課題があった。 That is, the speech recognition engine calculates the degree of match between the content (speech command) uttered by the user and all vocabularies (commands) registered in the speech recognition dictionary, and the command with the highest degree of match is calculated from the calculation result. It is determined as a voice command issued by the user (voice recognition). At this time, there is no problem as long as the command with the highest degree of matching can be identified, but as the number of registered vocabularies increases, the number of words that are similar to “reading” in pronunciation also increases, so voice recognition In the engine, it is not always possible to specify one, and as a result, a vocabulary that does not match may be erroneously recognized. In other words, in the conventional speech recognition method, when the number of vocabularies registered in the speech recognition dictionary to be used increases, the proportion of erroneous recognition increases accordingly, and the user's utterance content is recognized accurately. Is difficult (recognition rate for voice commands decreases).

本発明は、かかる従来技術における課題に鑑み創作されたもので、発話内容を音声認識して車載機器を制御するに際し、その発話内容に対する認識率を高めることができる車載用音声認識装置及び音声認識方法を提供することを目的とする。 The present invention has been created in view of the problems in the prior art, and when recognizing speech content to control an in-vehicle device, the on-vehicle speech recognition device and speech recognition capable of increasing the recognition rate for the speech content. It aims to provide a method.

上述した従来技術の課題を解決するため、本発明の一形態によれば、車室内でユーザが指示する情報を音声入力する音声入力手段と、前記音声入力手段を介して発話したユーザを特定する発話者特定手段と、複数の制御対象機器の各々の操作指示に関連した語彙が登録されている１つの音声認識辞書を格納すると共に、各制御対象機器毎にそれぞれ認識すべき語彙とあらかじめ設定された重み付けとの関係を規定した第１のテーブルを格納した辞書格納手段と、ユーザが着座している座席と当該座席のユーザが視聴している情報のソースである制御対象機器との関係を規定した第２のテーブルを格納したメモリ手段と、前記音声入力手段、発話者特定手段、辞書格納手段及びメモリ手段に動作可能に接続された制御手段とを備え、前記制御手段は、前記発話者特定手段と協働して発話者を特定したときに、前記音声認識辞書及び前記第１、第２の各テーブルを参照して、当該発話者が視聴している情報のソースである制御対象機器に対応した語彙に所定の重み付けを付加し、該重み付けの付加された語彙を参照して当該発話者の発話内容に対する音声認識を行い、該認識した発話内容に応じた制御を当該制御対象機器に対して行うことを特徴とする車載用音声認識装置が提供される。 In order to solve the above-described problems of the prior art, according to one aspect of the present invention, voice input means for voice input of information instructed by a user in a vehicle interior and a user who speaks via the voice input means are specified. Stores a speech recognition dictionary in which vocabulary related to operation instructions of each of a plurality of control target devices is registered, and a vocabulary to be recognized for each control target device is set in advance. Stipulates the relationship between the dictionary storage means storing the first table that defines the relationship with the weighting, the seat on which the user is seated, and the control target device that is the source of information viewed by the user of the seat memory means for storing a second table, the voice input unit, speaker identifying means, and a operatively connected to the control means in the dictionary storing means and memory means, said control hand , Upon identifying the speaker identification means cooperating with speaker, the speech recognition dictionary and the first, with reference to the second each table, the source of information which the speaker is viewing A predetermined weight is added to the vocabulary corresponding to the control target device, and the speech content of the speaker is recognized by referring to the weighted vocabulary, and control according to the recognized speech content is performed. An in-vehicle speech recognition apparatus is provided that is performed on the control target device.

この形態に係る車載用音声認識装置によれば、車室内で発話したユーザ（発話者）を特定したときに、第２のテーブルを参照して当該発話者が視聴している情報のソースである制御対象機器を特定し、各ソース（制御対象機器）に共用される音声認識辞書に登録されている語彙のうち、その特定した制御対象機器に対応した語彙のみに、第１のテーブルに規定されている所定の重み付けを付加し、その重み付けの付加された語彙を参照して当該発話者の発話内容に対する音声認識を行うようにしている。 According to the in-vehicle speech recognition device according to this aspect , when a user (speaker) who has spoken in the passenger compartment is specified, the second table is referred to as a source of information viewed by the speaker. identify the control target device, among vocabularies registered in the voice recognition dictionary which is shared to the source (control target device), vocabulary only corresponding to the specified control target device, is defined in the first table The predetermined weighting is added, and speech recognition is performed on the utterance content of the speaker by referring to the vocabulary to which the weighting is added.

これにより、そのユーザが発話した内容を認識するに際し、音声認識辞書に登録されている語彙のうち、第１のテーブルを参照して重み付けが付加された当該語彙（第２のテーブルを参照して特定した当該制御対象機器に対応した語彙）のみを認識すればよいので、従来のように音声認識辞書に登録されている全ての語彙に対して音声認識を行う場合と比べて、マッチングしない語彙を誤認識する割合を減らすことができる。つまり、ユーザが発話した内容（音声コマンド）に対する認識率を向上させることができる。 As a result, when recognizing the content spoken by the user, among the vocabulary registered in the speech recognition dictionary, the vocabulary to which weighting is added by referring to the first table (refer to the second table). The vocabulary that does not match is compared with the case where speech recognition is performed for all vocabularies registered in the speech recognition dictionary as in the past. The rate of misrecognition can be reduced. That is, the recognition rate for the content (voice command) spoken by the user can be improved.

また、本発明の他の形態によれば、車室内でユーザが発話した制御対象機器に対する操作指示に対応した音声認識を行う機能を備えた車載用音声認識装置において、あらかじめ複数の制御対象機器の各々の操作指示に関連した語彙を登録した１つの音声認識辞書と共に、各制御対象機器毎にそれぞれ認識すべき語彙とあらかじめ設定された重み付けとの関係を規定した第１のテーブルと、ユーザが着座している座席と当該座席のユーザが視聴している情報のソースである制御対象機器との関係を規定した第２のテーブルとを記憶手段に格納しておき、発話を検出したときに当該発話者を特定し、前記音声認識辞書及び前記第１、第２の各テーブルを参照して、当該発話者が視聴している情報のソースである制御対象機器に対応した語彙に所定の重み付けを付加し、該重み付けの付加された語彙を参照して当該発話者の発話内容に対する音声認識を実行し、該認識した発話内容に応じた制御を当該制御対象機器に対して行うことを特徴とする音声認識方法が提供される。 According to another aspect of the present invention, in an in-vehicle speech recognition apparatus having a function of performing speech recognition corresponding to an operation instruction for a control target device spoken by a user in a vehicle interior, a plurality of control target devices are preliminarily provided. A first table that defines the relationship between a vocabulary to be recognized for each control target device and a preset weight, together with one speech recognition dictionary in which vocabulary related to each operation instruction is registered, and a user seated A second table that defines the relationship between the seat being seated and the control target device that is the source of information viewed by the user of the seat is stored in the storage means, and the utterance is detected when the utterance is detected. identify who the voice recognition dictionary, and the first, with reference to the second each table, a predetermined vocabulary to which the speaker is corresponding to the control target device is the source of the information being viewed Adding weighting, referring to the weighted vocabulary, performing speech recognition on the utterance content of the speaker, and performing control according to the recognized utterance content on the control target device A speech recognition method is provided.

本発明に係る車載用音声認識装置及び音声認識方法の他の構成上の特徴及びそれに基づく具体的な処理態様等については、後述する発明の実施の形態を参照しながら詳細に説明する。 Other structural features of the in-vehicle speech recognition apparatus and speech recognition method according to the present invention and specific processing modes based on the features will be described in detail with reference to embodiments of the invention to be described later.

以下、本発明の実施の形態について、添付の図面を参照しながら説明する。 Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings.

図１は、本発明の一実施形態に係る車載用音声認識装置を組み込んだ車載オーディオ／ビデオ（Ａ／Ｖ）・ナビゲーションシステムの構成を示したものである。 FIG. 1 shows a configuration of an in-vehicle audio / video (A / V) navigation system incorporating an in-vehicle voice recognition device according to an embodiment of the present invention.

図示のように車載Ａ／Ｖ・ナビゲーションシステム４０は、本発明の特徴をなす車載用音声認識装置ＡＲと、その音声認識結果に基づいて発話内容（音声コマンド）に対応した制御が行われる対象機器（図示の例では、ラジオ受信機１、ＣＤプレーヤ２、ＤＶＤプレーヤ３、ＴＶ受信機４、ナビゲーションユニット５及びエアコン６）と、フロント席のユーザが各制御対象機器に対して各種設定操作を行うためのフロント席用操作ユニット（ヘッドユニット（Ｈ／Ｕ））２０と、リア席のユーザが各制御対象機器（ナビゲーションユニット５を除く）に対して各種設定操作を行うためのリア席用操作ユニット３０と、フロント席用表示ユニット２５と、アンプユニット２６と、スピーカ２７と、リア席用表示ユニット３１と、ワイヤレスヘッドホン３２とを備えて構成されている。車載用音声認識装置ＡＲ、各ソース（制御対象機器）１〜６、フロント席用操作ユニット２０、各表示ユニット２５，３１及びアンプユニット２６は、伝送路として供される光ファイバ等のバス７を介して相互に接続されている。 As shown in the figure, the in-vehicle A / V / navigation system 40 includes an in-vehicle voice recognition device AR that characterizes the present invention, and a target device in which control corresponding to the utterance content (voice command) is performed based on the voice recognition result. (In the example shown, the radio receiver 1, the CD player 2, the DVD player 3, the TV receiver 4, the navigation unit 5, and the air conditioner 6) and the user at the front seat perform various setting operations on the devices to be controlled. Operation unit for the front seat (head unit (H / U)) 20 and a rear seat operation unit for the rear seat user to perform various setting operations on each control target device (excluding the navigation unit 5) 30, a front seat display unit 25, an amplifier unit 26, a speaker 27, a rear seat display unit 31, and a wireless head. It is constituted by a phone 32. The in-vehicle voice recognition device AR, each source (device to be controlled) 1 to 6, the front seat operation unit 20, each display unit 25, 31 and the amplifier unit 26 are connected to a bus 7 such as an optical fiber provided as a transmission path. Are connected to each other.

図示の例では、スピーカ２７は１個のみ示されているが、実際には車室内の所定の場所に所要の個数、例えば、リア席が１列の場合であれば少なくともリア席の左右の近傍とフロント席の左右の近傍にそれぞれ２個ずつ、計４個のスピーカ２７が設置されている。リア席用の操作ユニット３０、表示ユニット３１及びワイヤレスヘッドホン３２についても同様に、それぞれ１台（１個）のみ示されているが、実際にはリア席の搭乗者数に応じて所要の個数、例えば、リア席が１列の場合であればそのリア席の左右の搭乗者用にそれぞれ２台（２個）の操作ユニット３０、表示ユニット３１及びワイヤレスヘッドホン３２がそれぞれ設けられている。 In the example shown in the figure, only one speaker 27 is shown, but in reality, the required number in a predetermined position in the passenger compartment, for example, at least the vicinity of the left and right of the rear seats when the rear seats are one row. In total, four speakers 27 are installed in the vicinity of the left and right sides of the front seat. Similarly, only one unit (one) is shown for the operation unit 30 for the rear seat, the display unit 31 and the wireless headphones 32, but in reality, the required number according to the number of passengers in the rear seat, For example, if the rear seats are in a single row, two (two) operation units 30, a display unit 31, and wireless headphones 32 are provided for the left and right passengers of the rear seats.

本発明の特徴をなす車載用音声認識装置ＡＲは、記憶媒体としてのハードディスクドライブ（ＨＤＤ）８と、マイクロホンアレイ９と、音声認識ユニット１０とを備えている。ＨＤＤ８によって駆動されるディスク（図示せず）には、ナビゲーション機能を実行する際に使用する地図データと共に、音声認識機能を実行する際に使用するデータ（音声認識辞書）がそれぞれ割り当てられた記憶領域に格納されている。地図データは、各縮尺レベル（１／１２５００、１／２５０００、１／５００００等）に応じて適当な大きさの経度幅及び緯度幅に区切られており、経路探索やマップマッチング等の各種処理に必要な道路ユニットのデータ及び交差点の詳細を表す交差点ユニットのデータ、各種施設（コンビニエンスストア、ガソリンスタンド、スーパー・ディスカウントショップ等）に関するデータ（位置、住所、電話番号、ジャンル等の各種情報）などを含んでいる。ＨＤＤ８に格納されている音声認識辞書の内容については後で説明する。 The in-vehicle voice recognition device AR that characterizes the present invention includes a hard disk drive (HDD) 8 as a storage medium, a microphone array 9, and a voice recognition unit 10. Storage areas to which data (speech recognition dictionary) used when executing the speech recognition function are allocated to the disk (not shown) driven by the HDD 8 together with map data used when the navigation function is executed. Stored in The map data is divided into longitude and latitude widths of appropriate sizes according to each scale level (1/12500, 1/25000, 1 / 50,000, etc.), and can be used for various processes such as route search and map matching. Necessary road unit data, intersection unit data showing details of the intersection, data on various facilities (convenience store, gas station, super discount shop, etc.) (various information such as location, address, telephone number, genre, etc.) Contains. The contents of the speech recognition dictionary stored in the HDD 8 will be described later.

マイクロホンアレイ９は、複数のマイクロホンを所定の間隔でアレイ状に並置して構成され、例えば、車室内の運転席前方のサンバイザー又はルームミラーの近傍に適宜設置されている。このマイクロホンアレイ９（各マイクロホン）は、ユーザ（運転者、助手席の乗員又はリア席の乗員）が発話する制御対象機器の操作等に係る指示（音声）を検出してその音圧レベルに応じたアナログ音声信号に変換するものである。各マイクロホンで検出された信号は、後述するように、車室内で発話したユーザ（発話者）の居る場所、すなわち、その発話者を特定するのに利用される。この発話者の特定方法については、音声認識ユニット１０の内部構成と併せて後で説明する。 The microphone array 9 is configured by arranging a plurality of microphones in an array at predetermined intervals, and is appropriately installed, for example, in the vicinity of a sun visor or a room mirror in front of the driver's seat in the passenger compartment. The microphone array 9 (each microphone) detects an instruction (speech) related to the operation of the control target device spoken by the user (driver, passenger in the passenger seat, or rear passenger) according to the sound pressure level. This is converted to an analog audio signal. As will be described later, the signal detected by each microphone is used to specify the location of the user (speaker) who speaks in the passenger compartment, that is, the speaker. This speaker specifying method will be described later together with the internal configuration of the speech recognition unit 10.

フロント席用操作ユニット（Ｈ／Ｕ）２０は、運転者と助手席の乗員が共用できるように両座席の中間のセンターコンソール上に「操作パネル」の形態で設置されており、その対応する表示ユニット２５は、その操作パネル（Ｈ／Ｕ）の上方に配置されている。この表示ユニット２５は、例えば、デュアル表示タイプのＬＣＤモニタ（便宜上「デュアルディスプレイ」という。）からなり、これは、同じ画面を右方向（運転席の側）から見た場合と左方向（助手席の側）から見た場合とでそれぞれ違う画像を同時に表示することができるものである。このデュアルディスプレイ（表示ユニット２５）の画面には、ナビゲーションユニット５から出力された各種の映像情報（自車位置の周囲の地図、自車位置から目的地までの誘導経路、音声認識に基づいた施設検索等の案内情報など）、ＤＶＤプレーヤ３やＴＶ受信機４などの映像ソースから出力された映像情報などが表示される。 The front seat operation unit (H / U) 20 is installed in the form of an “operation panel” on the center console between the two seats so that the driver and the passenger in the front passenger seat can share the display. The unit 25 is disposed above the operation panel (H / U). The display unit 25 includes, for example, a dual display type LCD monitor (referred to as “dual display” for the sake of convenience). This is the same as when the same screen is viewed from the right (driver's side) and left (passenger seat). Different images can be displayed at the same time when viewed from the side. On the screen of the dual display (display unit 25), various video information output from the navigation unit 5 (map around the vehicle position, guidance route from the vehicle position to the destination, facility based on voice recognition) Information such as search information), video information output from a video source such as the DVD player 3 or the TV receiver 4, and the like are displayed.

一方、リア席用操作ユニット３０は、リア席のユーザが操作し易いように「リモコン」の形態で設けられており、これに対応するリア席用表示ユニット３１と赤外線通信により接続されている。このリア席用表示ユニット３１は、例えば、前の座席のヘッドレストの後部に設置されており、フロント側の表示ユニット２５と同様に映像情報をディスプレイ画面に表示するＬＣＤモニタ等を有している。また、この表示ユニット３１は、その対応するワイヤレスヘッドホン３２と赤外線通信及びＲＦ通信により接続されている。 On the other hand, the rear seat operation unit 30 is provided in the form of a “remote control” so that the user at the rear seat can easily operate, and is connected to the corresponding rear seat display unit 31 by infrared communication. The rear seat display unit 31 is installed, for example, in the rear part of the headrest of the front seat, and has an LCD monitor or the like that displays video information on the display screen, like the front display unit 25. The display unit 31 is connected to the corresponding wireless headphones 32 by infrared communication and RF communication.

各ソース（制御対象機器）１〜６は、基本的な動作として、フロント席用操作ユニット２０からバス７に送出された操作指示に係るデータ、又はリア席用操作ユニット（リモコン）３０から赤外線通信により表示ユニット３１を介してバス７に送出された操作指示に係るデータ、あるいは音声認識ユニット１０からバス７に送出された操作指示に係るデータ（後述する「機器制御信号」）を受信し、それぞれ操作指示に係るデータに基づいて自己の動作状態を設定もしくは変更し、その結果（現在の動作状態）を指示するデータを音声／映像信号としてバス７に送出する。例えば、ラジオ受信機１の場合、各操作ユニット２０，３０あるいは音声認識ユニット１０から与えられる操作指示に応答して、ＦＭ放送やＡＭ放送の信号を受信して復調することにより音声信号を生成し、これをデジタルの音声データに変換して、バス７に送出する。また、ＤＶＤプレーヤ３の場合、同様に与えられる操作指示に応答して、ユーザにより選択されたＤＶＤの記録面に記録された信号を読み取り、再生された映像データをバス７に送出する。 Each source (device to be controlled) 1 to 6 has, as a basic operation, data related to an operation instruction sent from the front seat operation unit 20 to the bus 7 or infrared communication from the rear seat operation unit (remote control) 30. Receives the data related to the operation instruction sent to the bus 7 via the display unit 31 or the data related to the operation instruction sent to the bus 7 from the voice recognition unit 10 ("device control signal" described later), Based on the data related to the operation instruction, its own operation state is set or changed, and data indicating the result (current operation state) is sent to the bus 7 as an audio / video signal. For example, in the case of the radio receiver 1, in response to an operation instruction given from each operation unit 20, 30 or the voice recognition unit 10, an audio signal is generated by receiving and demodulating an FM broadcast or AM broadcast signal. This is converted into digital audio data and sent to the bus 7. In the case of the DVD player 3, in response to an operation instruction given in the same manner, a signal recorded on the recording surface of the DVD selected by the user is read and the reproduced video data is sent to the bus 7.

フロント席用操作ユニット２０は、制御部２１と、操作部２２と、表示部２３と、メモリ部２４とを備えている。このうち、操作部２２は、各ソース（制御対象機器）１〜６に対して各種設定操作を行うための操作キー、例えば、電源のオン／オフ及び音量調整を行うための電源キー、各ソースを選択するための選択キー、数字キー、所定の機能を行わせるためのプリセットキー、矢印が付されたシフトキー（矢印の部分を操作することでＦＦ／ＲＥＷ動作、シーク・アップ／ダウン動作等の操作を指示する）等を備えている。表示部２３は、操作パネル（Ｈ／Ｕ）上にＬＣＤ等の形態で配置されており、制御部２１から出力されるデータに基づいて、各種情報、例えば、ラジオ受信機１に関してはＦＭ／ＡＭの種別やその放送局の受信周波数など、ＣＤプレーヤ２に関してはＣＤ演奏時のディスク番号や再生位置（トラック数、経過時間等）などを表示する。 The front seat operation unit 20 includes a control unit 21, an operation unit 22, a display unit 23, and a memory unit 24. Among these, the operation unit 22 includes operation keys for performing various setting operations on the sources (control target devices) 1 to 6, for example, a power key for power on / off and volume adjustment, and each source. Select key for selecting, numeric key, preset key for performing a predetermined function, shift key with an arrow (such as FF / REW operation, seek up / down operation by operating the arrow part) Instructing the operation). The display unit 23 is arranged in the form of an LCD or the like on the operation panel (H / U), and based on data output from the control unit 21, various information, for example, FM / AM for the radio receiver 1 is used. For the CD player 2, the disc number and playback position (number of tracks, elapsed time, etc.) during the CD performance are displayed.

メモリ部２４は、フラッシュメモリ等の不揮発性半導体メモリからなり、制御部２１からの制御に基づいて必要な情報（データ）を格納しておくためのものである。このメモリ部２４には、各操作ユニット２０，３０あるいは音声認識ユニット１０から与えられる操作指示に基づき選択ソース（制御対象機器）からの音声／映像信号の出力動作が停止された時点での当該機器の動作状態を示すデータ（以下、「機器動作状態データ」という。）が格納される。この機器動作状態データは、次の出力動作開始時に必要に応じて参照するために格納される。この機器動作状態データには、例えば、いずれの機器（ソース）を使用していたかを指示する「ソース種別」、オーディオソースであればその音声を聴取していた際の音量や音質の調整値を指示する「音量・音質」、各ソース別の詳細な機器動作状態を指示する「ソース別詳細情報」等が含まれる。ソース別詳細情報には、例えば、ラジオ受信機１を使用していた場合にはＦＭ／ＡＭの種別や放送局（受信周波数）の情報等が含まれ、ＣＤプレーヤ２を使用していた場合には複数枚装填されているＣＤの中でいずれのＣＤを再生していたかを示すディスク番号や何曲目の頭からどれくらいの時間が経過した位置を再生していたかを示す再生位置の情報等が含まれる。 The memory unit 24 is composed of a nonvolatile semiconductor memory such as a flash memory, and stores necessary information (data) based on control from the control unit 21. The memory unit 24 stores the device at the time when the output operation of the audio / video signal from the selected source (control target device) is stopped based on the operation instruction given from each operation unit 20, 30 or the voice recognition unit 10. The data indicating the operation state (hereinafter referred to as “apparatus operation state data”) is stored. This equipment operation state data is stored for reference when necessary at the start of the next output operation. The device operation status data includes, for example, a “source type” that indicates which device (source) was used, and an adjustment value of the volume and sound quality when listening to the sound of an audio source. “Volume / sound quality” to be instructed, “detailed information by source” to instruct a detailed device operation state for each source, and the like are included. The source-specific detailed information includes, for example, FM / AM type and broadcast station (reception frequency) information when the radio receiver 1 is used, and when the CD player 2 is used. Includes the disc number indicating which CD was being played among multiple CDs loaded, and information on the playback position indicating how much time had elapsed since the beginning of what song. It is.

制御部２１はマイクロコンピュータ（マイコン）等により構成され、本システム４０全体の制御を行うものである。基本的には、各操作ユニット２０，３０あるいは音声認識ユニット１０から与えられた操作指示に基づき、選択ソース（制御対象機器）からバス７を介して送られてくる音声／映像データを取得して音声／映像情報の再生を行う動作、操作状況や動作状態等を指示する情報を表示部２３に表示させる動作、機器動作状態データの格納動作や読み出し動作などの制御を行う。この場合、取得された音声データは、制御部２１によりバス７を介してアンプユニット２６に送られ、適宜Ｄ／Ａ変換され、また音量や音質等の制御が行われ、増幅された後、スピーカ２７を通して音声出力される。また、取得された映像データは、制御部２１によりバス７を介して表示ユニット２５に送られ、そのディスプレイ画面に映像情報として表示される。 The control unit 21 is configured by a microcomputer or the like, and controls the entire system 40. Basically, audio / video data sent from the selected source (control target device) via the bus 7 is acquired based on the operation instructions given from the operation units 20 and 30 or the voice recognition unit 10. Control is performed such as an operation for reproducing audio / video information, an operation for displaying information instructing an operation state and an operation state on the display unit 23, a storage operation and a read operation for device operation state data. In this case, the acquired audio data is sent to the amplifier unit 26 via the bus 7 by the control unit 21 and appropriately D / A converted, and the volume and sound quality are controlled and amplified. 27 is output as audio. The acquired video data is sent to the display unit 25 by the control unit 21 via the bus 7 and displayed on the display screen as video information.

一方、リア席用操作ユニット（リモコン）３０は、特に図示はしないが、フロント側の操作部２２と同等の機能を有する操作部と、この操作部から入力された操作指示に応じた信号を赤外線通信により表示ユニット３１に向けて送信するための赤外線送信部とを備えている。また、リア席用表示ユニット３１は、特に図示はしないが、リモコン３０及びワイヤレスヘッドホン３２との間で制御信号やデータ等を通信するための赤外線通信部と、フロント側の制御部２１と同等の制御を行う制御部と、フロント側の表示ユニット２５と同様のＬＣＤモニタ等からなる表示部と、フロント側のメモリ部２４と同様のメモリ部とを備えている。 On the other hand, the rear seat operation unit (remote control) 30 is not particularly shown, but an operation unit having the same function as the operation unit 22 on the front side and a signal corresponding to an operation instruction input from the operation unit are infrared rays. An infrared transmission unit for transmitting to the display unit 31 by communication. The rear seat display unit 31 is not particularly shown, but is equivalent to the infrared communication unit for communicating control signals and data between the remote controller 30 and the wireless headphones 32, and the front side control unit 21. A control unit that performs control, a display unit that includes an LCD monitor or the like similar to the front-side display unit 25, and a memory unit similar to the front-side memory unit 24 are provided.

＜第１の実施形態（図２〜図４参照）＞
図２は、第１の実施形態に係る車載用音声認識装置の構成を一部模式的に示したものである。 <First Embodiment (see FIGS. 2 to 4)>
FIG. 2 schematically shows a part of the configuration of the in-vehicle speech recognition apparatus according to the first embodiment.

本実施形態に係る車載用音声認識装置ＡＲは、図示のようにＨＤＤ８と、マイクロホンアレイ９と、デジタル信号プロセッサ（ＤＳＰ）１１と、ＣＰＵ１２と、ＲＡＭ等からなるメモリ部１３とを備えている。このうちＤＳＰ１１、ＣＰＵ１２及びメモリ部１３は、音声認識ユニット１０（図１）を構成する。ＤＳＰ１１は、その機能ブロックとして、音声入力部１１ａと、ビームフォーミング部１１ｂと、音源方向特定部１１ｃとを備えている。一方、ＣＰＵ１２は、その機能ブロックとして、認識辞書選択部１２ａと、音声認識処理部１２ｂと、機器制御信号発生部１２ｃとを備えている。 The in-vehicle speech recognition apparatus AR according to the present embodiment includes an HDD 8, a microphone array 9, a digital signal processor (DSP) 11, a CPU 12, and a memory unit 13 including a RAM and the like as illustrated. Among these, the DSP 11, the CPU 12, and the memory unit 13 constitute the voice recognition unit 10 (FIG. 1). The DSP 11 includes a voice input unit 11a, a beam forming unit 11b, and a sound source direction specifying unit 11c as functional blocks. On the other hand, the CPU 12 includes a recognition dictionary selection unit 12a, a speech recognition processing unit 12b, and a device control signal generation unit 12c as functional blocks.

ＨＤＤ８には、音声認識に基づいて制御されるべきソース（制御対象機器）に対応させてそれぞれ当該機器の操作指示に関連した専用の語彙（すなわち、当該機器に対し音声コマンドとして発する頻度の高い語彙）をあらかじめ登録した複数の専用認識辞書が格納されている。図示の例では、ナビゲーションユニット５に関連した語彙（「目的地」、「メニュー」、「周辺検索」、「現在地」など）を登録した専用認識辞書Ｄ１と、ＤＶＤプレーヤ３に関連した語彙（「メニュー」、「再生」、「停止」など）を登録した専用認識辞書Ｄ２と、ラジオ受信機１に関連した語彙（「ラジオ」、「ＦＭ」、「ＡＭ」など）を登録した専用認識辞書Ｄ３の３種類の辞書が格納されている。 The HDD 8 has a dedicated vocabulary associated with the operation instruction of the device corresponding to the source (control target device) to be controlled based on the voice recognition (that is, a vocabulary frequently issued as a voice command to the device). ) Are stored in advance. In the illustrated example, the vocabulary related to the navigation unit 5 (such as “destination”, “menu”, “surrounding search”, “current location”) and the vocabulary related to the DVD player 3 (“ Menu "," play "," stop ", etc.) dedicated recognition dictionary D2 and dedicated recognition dictionary D3 registered vocabulary (" radio "," FM "," AM ", etc.) related to the radio receiver 1 These three types of dictionaries are stored.

メモリ部１３には、ユーザが着座している座席（運転席、助手席、リア席）と当該座席のユーザが視聴している情報のソース（制御対象機器）との関係を示す情報（管理テーブル）が格納される。この管理テーブルは、ＣＰＵ１２とフロント席用操作ユニット２０内の制御部２１及びリア席用表示ユニット３１内の制御部（図示せず）とが協働し、各ユニット内のメモリ部２４に格納されている「機器動作状態データ」に基づいて作成される。従って、各ソース（制御対象機器）の動作状態が変更されると、それに応じて管理テーブルの内容も更新される。 The memory unit 13 includes information (management table) indicating a relationship between a seat (driver's seat, front passenger seat, rear seat) where the user is seated and a source (control target device) of information viewed by the user of the seat. ) Is stored. This management table is stored in the memory unit 24 in each unit in cooperation with the CPU 12, the control unit 21 in the front seat operation unit 20, and the control unit (not shown) in the rear seat display unit 31. It is created based on the “apparatus operating state data”. Therefore, when the operating state of each source (control target device) is changed, the contents of the management table are updated accordingly.

本実施形態では、マイクロホンアレイ９とその検出信号を処理するＤＳＰ１１とを用いて、音源の方向（この場合、発話者が着座している座席の方向）を特定している。複数のマイクロホンを用いて音源の方向を特定する方法は知られている。すなわち、個々のマイクロホンは無指向性であるが、複数のマイクロホンをアレイ状に配置して音源からの音を各マイクロホンで検出し、それぞれ検出したデータを加算処理することで指向性をもたせることができる。例えば、図３に示すように、マイクロホンアレイ９の真正面から音が入射する場合（図示の例では、リア席のユーザＰ３が発話している場合）、マイクロホンアレイ９の各マイクロホンに到達する音圧信号は位相的にほぼ同相となるため、これらを加算するとレベル的に大きな信号となる。これに対し、音が斜めから入射した場合（図示の例では、運転席のユーザＰ１、助手席のユーザＰ２が発話している場合）、各マイクロホンに到達する時間に差が生じ、位相的に正方向又は負方向にずれるため、これらを加算するとお互いに打ち消しあってレベル的に小さな信号となる。この原理を利用して、各マイクロホンで検出した信号のレベルと位相差に基づき、音の到来方向（すなわち、発話者の居る方向）を特定することができる。その特定に際し、本実施形態ではビームフォーミング法を用いている。 In this embodiment, the direction of the sound source (in this case, the direction of the seat on which the speaker is seated) is specified using the microphone array 9 and the DSP 11 that processes the detection signal. A method for specifying the direction of a sound source using a plurality of microphones is known. In other words, each microphone is omnidirectional, but a plurality of microphones are arranged in an array so that sound from a sound source is detected by each microphone, and each detected data is added and processed to add directionality. it can. For example, as shown in FIG. 3, when sound enters from directly in front of the microphone array 9 (in the illustrated example, when a user P3 in the rear seat is speaking), the sound pressure reaching each microphone of the microphone array 9 Since the signals are almost in phase in phase, adding these results in a large level signal. On the other hand, when the sound is incident obliquely (in the example shown, when the driver seat user P1 and the passenger seat user P2 are speaking), a difference occurs in the time to reach each microphone, and in phase Since they are shifted in the positive direction or the negative direction, if these are added, they cancel each other and become a small signal in terms of level. Using this principle, it is possible to specify the direction of sound arrival (that is, the direction in which the speaker is present) based on the level and phase difference of the signal detected by each microphone. In the identification, the beam forming method is used in the present embodiment.

すなわち、音声認識ユニット１０において、マイクロホンアレイ９の各マイクロホンで検出された信号（アナログ音声信号）は、ＤＳＰ１１の音声入力部１１ａを通して適宜増幅され、デジタル化された後、ビームフォーミング部１１ｂに入力されると共に、ＣＰＵ１２の音声認識処理部１２ｂに入力される。ビームフォーミング部１１ｂでは、入力された信号に基づき方向推定を行ってビーム信号を生成し（ビームフォーミング）、その生成されたビーム信号に基づいて音源方向特定部１１ｃにより、音圧レベルの大きい信号を受信している方向を音源の方向（発話者の居る方向）として特定する。 That is, in the voice recognition unit 10, signals (analog voice signals) detected by each microphone of the microphone array 9 are appropriately amplified through the voice input unit 11a of the DSP 11, digitized, and then input to the beam forming unit 11b. And input to the voice recognition processing unit 12b of the CPU 12. The beam forming unit 11b performs direction estimation based on the input signal to generate a beam signal (beam forming), and based on the generated beam signal, the sound source direction specifying unit 11c generates a signal having a high sound pressure level. The receiving direction is specified as the direction of the sound source (the direction in which the speaker is present).

ＣＰＵ１２では、認識辞書選択部１２ａにより、メモリ部１３に格納されている管理テーブルを参照して、ＨＤＤ８に格納されている複数の専用認識辞書Ｄ１〜Ｄ３の中から、その特定された発話者が視聴している情報のソース（制御対象機器）に対応した専用の音声認識辞書を選択する。次いで音声認識処理部１２ｂでは、その選択された専用認識辞書を使用して、その発話内容（音声コマンド）とその選択された専用認識辞書に含まれる各語彙（コマンド）とを比較照合し、それぞれ合致度を算出する。そして、その算出結果に基づき最も合致度の大きい「語彙」をユーザの発話したコマンドとして決定する。次いで機器制御信号発生部１２ｃでは、その決定されたコマンドを取得し、そのコマンドの内容に応じた機器制御信号を出力する。出力された機器制御信号は、当該制御対象機器に対する操作指示データとして、ＣＰＵ１２によりバス７に送出される。 The CPU 12 refers to the management table stored in the memory unit 13 by the recognition dictionary selection unit 12a and selects the identified speaker from the plurality of dedicated recognition dictionaries D1 to D3 stored in the HDD 8. A dedicated speech recognition dictionary corresponding to the source of information being viewed (control target device) is selected. Next, the speech recognition processing unit 12b uses the selected dedicated recognition dictionary to compare and collate the utterance content (speech command) with each vocabulary (command) included in the selected dedicated recognition dictionary, The degree of match is calculated. Then, based on the calculation result, the “vocabulary” having the highest degree of match is determined as the command spoken by the user. Next, the device control signal generator 12c acquires the determined command and outputs a device control signal corresponding to the content of the command. The output device control signal is sent to the bus 7 by the CPU 12 as operation instruction data for the control target device.

以下、本実施形態に係る車載用音声認識装置ＡＲ（図２）においてＣＰＵ１２がＤＳＰ１１と協働して行う発話者の特定及びそれに基づく音声認識辞書の切替選択等に係る処理について、その一例を示す図４を参照しながら説明する。 Hereinafter, an example of processing related to the identification of the speaker and the switching selection of the speech recognition dictionary based on the identification performed by the CPU 12 in cooperation with the DSP 11 in the in-vehicle speech recognition apparatus AR (FIG. 2) according to the present embodiment will be shown. This will be described with reference to FIG.

先ず初期状態として、各座席（運転席、助手席、リア席）のユーザがそれぞれ所望のソース（制御対象機器）の情報を既に視聴しており、音声認識ユニット１０内のＣＰＵ１２により管理テーブル（ユーザが着座している座席と当該座席で視聴している情報のソースとの関係を示す情報）が作成され、メモリ部１３に格納されているものとする。 First, as an initial state, each seat (driver's seat, front passenger seat, rear seat) user has already viewed information on a desired source (control target device), and the CPU 12 in the speech recognition unit 10 uses the management table (user It is assumed that information indicating the relationship between the seat where the user is seated and the source of the information viewed on the seat is created and stored in the memory unit 13.

この状態で最初のステップＳ１では、ＣＰＵ１２において、マイクロホンアレイ９からＤＳＰ１１（音声入力部１１ａ）を介して発話を検出した（ＹＥＳ）か否（ＮＯ）かを判定する。判定結果がＹＥＳの場合には次のステップＳ２に進み、判定結果がＮＯの場合には発話を検出するまで判定処理を繰り返す。なお、ステップＳ１の処理内容において括弧書きで記載する「発話操作」については後で説明する。 In the first step S1 in this state, the CPU 12 determines whether an utterance has been detected from the microphone array 9 via the DSP 11 (voice input unit 11a) (YES) or not (NO). If the determination result is YES, the process proceeds to the next step S2, and if the determination result is NO, the determination process is repeated until an utterance is detected. The “speech operation” described in parentheses in the processing content of step S1 will be described later.

次のステップＳ２では、ＣＰＵ１２からの制御に基づきＤＳＰ１１において、マイクロホンアレイ９を用いたビームフォーミング法により、その発話を行ったユーザ（発話者）の居る方向（座席）を特定する。つまり、当該発話者を特定する。 In the next step S2, the direction (seat) in which the user (speaker) who made the utterance is present is specified in the DSP 11 by the beamforming method using the microphone array 9 based on the control from the CPU 12. That is, the speaker is specified.

次のステップＳ３では、ＣＰＵ１２において認識辞書選択部１２ａにより、メモリ部１３に格納されている管理テーブルを参照して、ＨＤＤ８に格納されている複数の専用の認識辞書Ｄ１〜Ｄ３の中から、その発話者が視聴している情報のソース（例えば、運転席であればナビゲーションユニット５、助手席であればラジオ受信機１、リア席であればＤＶＤプレーヤ３）に対応した専用の音声認識辞書を選択する。 In the next step S3, the recognition dictionary selection unit 12a in the CPU 12 refers to the management table stored in the memory unit 13, and from among a plurality of dedicated recognition dictionaries D1 to D3 stored in the HDD 8, A dedicated speech recognition dictionary corresponding to the source of information viewed by the speaker (for example, the navigation unit 5 for the driver's seat, the radio receiver 1 for the passenger seat, and the DVD player 3 for the rear seat). select.

次のステップＳ４では、ＣＰＵ１２において音声認識処理部１２ｂにより、その選択された専用認識辞書を使用して、当該発話者の発話内容（音声コマンド）に対する音声認識を実行する。 In the next step S4, the voice recognition processing unit 12b in the CPU 12 executes voice recognition for the utterance content (voice command) of the speaker using the selected dedicated recognition dictionary.

最後のステップＳ５では、ＣＰＵ１２において機器制御信号発生部１２ｃにより、その認識されたコマンド（発話内容）に応じた機器制御信号を出力し、これに対応する制御を当該制御対象機器に対して実行する。その際、ＣＰＵ１２からの制御に基づき、当該制御対象機器の動作状態に係る映像を表示している表示ユニット２５，３１に対して当該発話内容に応じた制御（画面の変更など）を行うと共に、当該制御対象機器の動作状態に係る音声を出力しているスピーカ２７（ワイヤレスヘッドホン３２を含む）に対して当該発話内容に応じた制御（音声の変更など）を行う。 In the final step S5, the device control signal generator 12c in the CPU 12 outputs a device control signal corresponding to the recognized command (utterance content), and executes the control corresponding to this on the control target device. . At that time, based on the control from the CPU 12, the display units 25 and 31 displaying the video related to the operation state of the control target device are controlled according to the utterance content (change of the screen, etc.) Control (sound change etc.) according to the content of the utterance is performed on the speaker 27 (including the wireless headphones 32) that outputs the sound related to the operation state of the control target device.

以上説明したように、第１の実施形態に係る車載用音声認識装置ＡＲによれば、マイクロホンアレイ９を用いたビームフォーミング法（ＤＳＰ１１）により、車室内で発話したユーザ（の居る方向）を特定し、ＣＰＵ１２により、ＨＤＤ８に格納されている複数の専用認識辞書Ｄ１〜Ｄ３の中から、その特定した発話者が視聴している情報のソース（制御対象機器）に対応した専用の音声認識辞書を選択するようにしている。つまり、その発話者が当該制御対象機器に対し音声コマンドとして発する頻度の高い語彙を登録した専用認識辞書を選択するようにしている。 As described above, according to the on-vehicle speech recognition apparatus AR according to the first embodiment, the user (direction in which the user is uttered) in the vehicle compartment is specified by the beam forming method (DSP 11) using the microphone array 9. Then, the CPU 12 creates a dedicated speech recognition dictionary corresponding to the source (control target device) of the information that the specified speaker is viewing from among the plurality of dedicated recognition dictionaries D1 to D3 stored in the HDD 8. I am trying to select it. In other words, a dedicated recognition dictionary in which vocabulary frequently issued by the speaker as a voice command to the control target device is registered is selected.

これにより、その発話者の発話内容（音声コマンド）を認識するに際し、その選択した専用認識辞書に登録されている語彙のみを認識すればよいので、従来のように音声認識辞書に登録されている全ての語彙に対して音声認識を行う場合と比べて、マッチングしない語彙を誤認識する割合を減らすことができる。つまり、その発話者に適した音声認識を行うことで、音声コマンドに対する認識率を高めることができる。 Thus, when recognizing the utterance content (voice command) of the speaker, only the vocabulary registered in the selected dedicated recognition dictionary needs to be recognized, so that it is registered in the speech recognition dictionary as in the past. Compared with the case where speech recognition is performed for all vocabularies, the rate of misrecognizing vocabulary that does not match can be reduced. That is, the recognition rate for the voice command can be increased by performing voice recognition suitable for the speaker.

例えば、発話者がリア席に着座していた場合、リア席用表示ユニット３１の画面上で再生されているＤＶＤ操作のみに対する音声認識辞書Ｄ２を使用することで、誤認識の割合を減らすことができる。この場合、フロント席用表示ユニット（デュアルディスプレイ）２５の運転席側の画面にナビゲーション情報が表示されていても、リア席での発話操作によりそのナビゲーションの動作に影響を与えることがない。また、発話者が助手席に着座していた場合も、同様である。 For example, when the speaker is seated in the rear seat, the rate of misrecognition can be reduced by using the speech recognition dictionary D2 for only the DVD operation reproduced on the screen of the rear seat display unit 31. it can. In this case, even if the navigation information is displayed on the screen on the driver's seat side of the front seat display unit (dual display) 25, the navigation operation is not affected by the speech operation at the rear seat. The same applies when the speaker is seated in the passenger seat.

＜第２の実施形態（図５参照）＞
上述した第１の実施形態に係る車載用音声認識装置ＡＲ（図２）では、発話者を特定する手段としてマイクロホンアレイ９を用いたビームフォーミング法（ＤＳＰ１１）により音源の方向（発話者の居る方向）を特定する場合を例にとって説明したが、発話者を特定する手段がこれに限定されないことはもちろんである。例えば、操作指示を音声入力（発話）する際に何らかのスイッチ等を操作し（発話操作）、この発話操作をＣＰＵで検出してその発話者を特定するようにしてもよい。図５はその場合の実施形態に係る車載用音声認識装置の構成を示したものである。 <Second Embodiment (see FIG. 5)>
In the on-vehicle speech recognition device AR (FIG. 2) according to the first embodiment described above, the direction of the sound source (the direction in which the speaker is present) by the beamforming method (DSP 11) using the microphone array 9 as means for specifying the speaker. ) Has been described as an example, but it goes without saying that the means for specifying the speaker is not limited to this. For example, when inputting an operation instruction by voice (speech), a certain switch or the like may be operated (speech operation), and this utterance operation may be detected by the CPU to identify the speaker. FIG. 5 shows the configuration of the in-vehicle speech recognition apparatus according to the embodiment in that case.

この第２の実施形態に係る車載用音声認識装置ＡＲ１（図５）は、第１の実施形態に係る車載用音声認識装置ＡＲ（図２）と比べて、フロント席用及びリア席用の各操作ユニット２０，３０の操作部にそれぞれ発話スイッチ５０を設けた点、マイクロホンアレイ９に代えてマイクロホン９ａを設けた点、ＤＳＰ１１を省略した点、ＣＰＵ１２の代わりにＣＰＵ１４を有し、このＣＰＵ１４が音声入力部１４ａと、発話者特定部１４ｂと、認識辞書選択部１４ｃと、音声認識処理部１４ｄと、機器制御信号発生部１４ｅとを備えている点で相違する。他の構成及びその機能については、第１の実施形態の場合と同じであるのでその説明は省略する。 The in-vehicle voice recognition device AR1 (FIG. 5) according to the second embodiment is different from the in-vehicle voice recognition device AR (FIG. 2) according to the first embodiment. The point where the utterance switch 50 is provided in the operation unit of each of the operation units 20 and 30, the point where the microphone 9a is provided instead of the microphone array 9, the point where the DSP 11 is omitted, and the CPU 14 is provided instead of the CPU 12, and this CPU 14 is a voice. The difference is that an input unit 14a, a speaker identification unit 14b, a recognition dictionary selection unit 14c, a speech recognition processing unit 14d, and a device control signal generation unit 14e are provided. Other configurations and functions thereof are the same as those in the first embodiment, and thus description thereof is omitted.

また、この第２の実施形態においてＣＰＵ１４が行う発話者の特定及びそれに基づく音声認識辞書の切替選択等に係る処理についても、第１の実施形態に係る処理（図４）と基本的に同じであるのでその説明は省略する。 In addition, the processing related to the speaker identification and the voice recognition dictionary switching selection based on the speaker identification performed by the CPU 14 in the second embodiment is basically the same as the processing according to the first embodiment (FIG. 4). Since there is, the description is omitted.

この第２の実施形態に係る車載用音声認識装置ＡＲ１においても、上述した第１の実施形態に係る車載用音声認識装置ＡＲにおいて得られた効果と同様の効果を得ることができる。さらに本実施形態では、発話スイッチ５０の操作を検出することで発話者を容易に特定することができるので、マイクロホンアレイ９とＤＳＰ１１を使用して発話者を特定する場合と比べて、構成の簡素化及びコストの低減化を図ることができる。 In the in-vehicle voice recognition device AR1 according to the second embodiment, the same effects as those obtained in the in-vehicle voice recognition device AR according to the first embodiment described above can be obtained. Furthermore, in this embodiment, since the speaker can be easily identified by detecting the operation of the speech switch 50, the configuration is simpler than the case where the speaker is identified using the microphone array 9 and the DSP 11. And cost reduction.

＜第３の実施形態（図６参照）＞
上述した第１、第２の実施形態に係る車載用音声認識装置ＡＲ，ＡＲ１（図２、図５）では、ＨＤＤ８に複数の専用の認識辞書Ｄ１〜Ｄ３を用意し、ＤＳＰ１１の機能又は発話スイッチ５０の操作に基づいて特定した座席の発話者が視聴している情報のソース（制御対象機器）に対応させていずれか１つの専用認識辞書を選択する場合を例にとって説明したが、認識辞書を変更する形態は必ずしもこれに限定されない。 <Third Embodiment (see FIG. 6)>
In the on-vehicle speech recognition apparatuses AR and AR1 (FIGS. 2 and 5) according to the first and second embodiments described above, a plurality of dedicated recognition dictionaries D1 to D3 are prepared in the HDD 8, and the function or speech switch of the DSP 11 is prepared. The case where one of the dedicated recognition dictionaries is selected corresponding to the information source (device to be controlled) being viewed by the speaker of the seat specified based on the operation of 50 has been described as an example. The form to change is not necessarily limited to this.

上記のように複数の専用認識辞書の中から選択するのではなく、例えば、発話内容に対する音声認識を実行する際に、特定した発話者の視聴している情報のソース（制御対象機器）に応じて認識すべき単語を優先させる「重み付け」を付加し、その「重み付け」が付加された認識単語を当該発話者のコマンドとして認識するようにしてもよい。図６はその場合の音声認識方法の一例を示したものである。 Rather than selecting from a plurality of dedicated recognition dictionaries as described above, for example, when performing speech recognition on utterance content, depending on the source of information (control target device) that the specified speaker is viewing It is also possible to add “weighting” to prioritize the word to be recognized and recognize the recognized word to which the “weighting” is added as a command of the speaker. FIG. 6 shows an example of the speech recognition method in that case.

本実施形態に係る車載用音声認識装置は、特に図示はしないが、基本的に第１、第２の実施形態に係る車載用音声認識装置ＡＲ，ＡＲ１（図２、図５）と同等の構成を有している。構成上相違する点は、ＣＰＵ１２，１４において認識辞書選択部１２ａ，１４ｃに相当する機能ブロックを備えていない点（ただし、メモリ部１３に格納されている管理テーブルは利用する）、ＨＤＤ８に複数の専用の認識辞書Ｄ１〜Ｄ３を用意する代わりに、各ソース（制御対象機器）に共用される１つの音声認識辞書を用意すると共に、各ソース毎にそれぞれ認識すべき語彙（単語）とあらかじめ設定した重み付けとの関係を規定したテーブル（図６のＷＴ１，ＷＴ２）を用意している点である。 The in-vehicle speech recognition apparatus according to the present embodiment is not specifically illustrated, but basically has the same configuration as the in-vehicle speech recognition apparatuses AR and AR1 (FIGS. 2 and 5) according to the first and second embodiments. have. The difference in configuration is that the CPUs 12 and 14 do not have functional blocks corresponding to the recognition dictionary selection units 12a and 14c (however, the management table stored in the memory unit 13 is used) . Instead of preparing dedicated recognition dictionaries D1 to D3, one speech recognition dictionary shared by each source (control target device) is prepared, and a vocabulary (word) to be recognized for each source is set in advance. A table (WT1, WT2 in FIG. 6) that defines the relationship with weighting is prepared.

この第３の実施形態では、ＣＰＵ１２（１４）において特定された発話者の発話内容に対する音声認識を実行する際に、メモリ部１３に格納されている管理テーブルを参照して当該発話者が視聴している情報のソース（制御対象機器）を特定し、上記のテーブルＷＴ１，ＷＴ２を参照して当該制御対象機器に対応した語彙のみに「重み付け」を付加する。例えば、運転席側と助手席側からマイクロホン９（９ａ）を介してナビゲーション関連の単語「会社」が発話された場合、ＣＰＵ１２（１４）では、図６に示すように運転席側の認識単語「会社」にのみ重み付け（＋１０）を付加することで、運転席側から発話された「会社」を音声コマンドとして認識し、その認識したコマンドに対応する制御をナビゲーションユニット５に対して実行する。また、運転席側と助手席側からマイクロホン９（９ａ）を介してオーディオ関連の単語「停止」が発話された場合には、助手席側の認識単語「停止」にのみ重み付け（＋１０）を付加することにより、助手席側から発話された「停止」を音声コマンドとして認識し、その認識したコマンドに対応する制御をオーディオ機器（ラジオ受信機１、ＤＶＤプレーヤ３など）に対して実行する。 In the third embodiment, when performing speech recognition on the utterance content of the utterer specified by the CPU 12 (14), the utterer views the utterance by referring to the management table stored in the memory unit 13. The information source (control target device) is identified, and “weighting” is added only to the vocabulary corresponding to the control target device with reference to the tables WT1 and WT2. For example, when a navigation-related word “company” is spoken from the driver side and the passenger side via the microphone 9 (9a), the CPU 12 (14) recognizes the recognition word “ By adding weight (+10) only to “company”, “company” spoken from the driver's seat side is recognized as a voice command, and control corresponding to the recognized command is executed on the navigation unit 5. When the audio-related word “stop” is uttered from the driver side and the passenger side via the microphone 9 (9a), only the recognition word “stop” on the passenger side is weighted (+10). Thus, “stop” uttered from the passenger seat side is recognized as a voice command, and control corresponding to the recognized command is executed on the audio device (radio receiver 1, DVD player 3, etc.).

上述した各実施形態では、車載用音声認識装置ＡＲ（ＡＲ１）を車載Ａ／Ｖ・ナビゲーションシステム４０の一部として組み込んだ場合を例にとって説明したが、本発明の要旨（発話者を特定し、その発話者が視聴している情報のソース（制御対象機器）に対応させて認識辞書を変更（専用の音声認識辞書を選択、又は認識単語の重み付けを変更）し、その変更された辞書を使用して音声認識を実行し、その認識した発話内容に対応する制御を当該制御対象機器に対して行うこと）からも明らかなように、必ずしもＡ／Ｖ機器とナビゲーション装置の両方を含むシステムに組み込んで使用する必要がないことはもちろんである。 In each of the above-described embodiments, the case where the in-vehicle voice recognition device AR (AR1) is incorporated as a part of the in-vehicle A / V / navigation system 40 has been described as an example, but the gist of the present invention (identifying the speaker, Change the recognition dictionary (select a dedicated speech recognition dictionary or change the weight of the recognition word) according to the information source (device to be controlled) that the speaker is watching and use the changed dictionary The voice recognition is performed, and control corresponding to the recognized utterance content is performed on the control target device), as is apparent from the above description, it is not necessarily incorporated into a system including both the A / V device and the navigation device. Of course, there is no need to use it.

また、上述した各実施形態では、車室内でユーザが着座している座席と当該座席で視聴している情報のソース（制御対象機器）との関係を示す「管理テーブル」を音声認識ユニット１０内のメモリ部１３に格納する場合を例にとって説明したが、本発明の要旨からも明らかなように、必ずしも音声認識ユニット１０内に保有しておく必要がないことはもちろんである。例えば、その管理テーブルをＨ／Ｕ２０内のメモリ部２４に格納しておき、音声認識ユニット１０内のＣＰＵ１２（１４）が、必要な時にＨ／Ｕ２０内の制御部２１と協働して、メモリ部２４（管理テーブル）を参照するようにしてもよい。 Further, in each of the above-described embodiments, the “management table” indicating the relationship between the seat where the user is seated in the passenger compartment and the source of information (control target device) viewed in the seat is stored in the voice recognition unit 10. Although the case where the data is stored in the memory unit 13 has been described as an example, it is needless to say that it is not always necessary to have it stored in the voice recognition unit 10 as is apparent from the gist of the present invention. For example, the management table is stored in the memory unit 24 in the H / U 20, and the CPU 12 (14) in the speech recognition unit 10 cooperates with the control unit 21 in the H / U 20 when necessary to The unit 24 (management table) may be referred to.

また、上述した各実施形態では、リア席用にワイヤレスヘッドホン３２を備えた場合を例にとって説明したが、かかる「ワイヤレス」タイプのものに限定されず、ジャック付きのヘッドホンを使用した場合にも本発明は同様に適用することができる。この場合、ヘッドホンは対応する表示ユニット３１とジャックを介して有線接続されることになる。 Further, in each of the above-described embodiments, the case where the wireless headphone 32 is provided for the rear seat has been described as an example. However, the present invention is not limited to such a “wireless” type, and this headphone is also used when a headphone with a jack is used. The invention is equally applicable. In this case, the headphones are connected to the corresponding display unit 31 via a jack.

また、上述した各実施形態では、地図データ及び音声認識辞書を格納する記憶媒体としてＨＤＤ８を使用しているが、これに代えて、ＤＶＤドライブ（ＤＶＤ−ＲＯＭ）やＣＤドライブ（ＣＤ−ＲＯＭ）等の他の記憶媒体を使用してもよい。 In each of the above-described embodiments, the HDD 8 is used as a storage medium for storing the map data and the voice recognition dictionary. Instead, a DVD drive (DVD-ROM), a CD drive (CD-ROM), or the like is used. Other storage media may be used.

本発明の一実施形態に係る車載用音声認識装置を組み込んだ車載オーディオ／ビデオ（Ａ／Ｖ）・ナビゲーションシステムの構成を示すブロック図である。1 is a block diagram showing a configuration of an in-vehicle audio / video (A / V) navigation system incorporating an in-vehicle voice recognition device according to an embodiment of the present invention. 第１の実施形態に係る車載用音声認識装置の構成を一部模式的に示すブロック図である。It is a block diagram which shows a part of structure of the vehicle-mounted speech recognition apparatus which concerns on 1st Embodiment. 図２の車載用音声認識装置においてマイクロホンアレイを用いたビームフォーミング法により音源の方向（発話者の居る方向）を特定する方法を説明するための図である。It is a figure for demonstrating the method to identify the direction of a sound source (direction where a speaker exists) by the beam forming method using a microphone array in the vehicle-mounted speech recognition apparatus of FIG. 図２の車載用音声認識装置において行う発話者の特定及びそれに基づく音声認識辞書の切替選択等に係る処理の一例を示すフロー図である。It is a flowchart which shows an example of the process which concerns on the voice recognition dictionary switching selection etc. based on the identification of the speaker performed in the vehicle-mounted speech recognition apparatus of FIG. 第２の実施形態に係る車載用音声認識装置の構成を一部模式的に示すブロック図である。It is a block diagram which shows typically a structure of the vehicle-mounted speech recognition apparatus which concerns on 2nd Embodiment. 第３の実施形態に係る車載用音声認識装置において行う音声認識の方法を説明するための図である。It is a figure for demonstrating the method of the speech recognition performed in the vehicle-mounted speech recognition apparatus which concerns on 3rd Embodiment.

Explanation of symbols

１〜６…発話者が視聴している情報のソース（制御対象機器）、
８…ＨＤＤ（辞書格納手段）、
９…マイクロホンアレイ（音声入力手段）、
９ａ…マイクロホン（音声入力手段）、
１０…音声認識ユニット、
１１…ＤＳＰ（発話者特定手段）、
１２，１４…ＣＰＵ（制御手段）、
１３…メモリ部（テーブル格納手段）、
２０，３０…操作ユニット、
２５，３１…表示ユニット（表示手段）、
２７…スピーカ（音声出力手段）、
３２…ヘッドホン（音声出力手段）、
４０…車載オーディオ／ビデオ（Ａ／Ｖ）・ナビゲーションシステム、
５０…発話スイッチ（発話者特定手段）、
ＡＲ，ＡＲ１…車載用音声認識装置、
Ｄ１，Ｄ２，Ｄ３…（各制御対象機器に対応した）音声認識辞書、
Ｐ１，Ｐ２，Ｐ３…車室内の乗員（ユーザ）、
ＷＴ１，ＷＴ２…認識単語と重み付けとの関係を規定したテーブル。 1-6 ... Source of information (control target device) that the speaker is watching,
8 HDD (dictionary storage means),
9: Microphone array (voice input means),
9a: Microphone (voice input means),
10 ... Voice recognition unit,
11 ... DSP (speaker identification means),
12, 14 ... CPU (control means),
13: Memory unit (table storage means),
20, 30 ... operation unit,
25, 31 ... display unit (display means),
27 ... Speaker (voice output means),
32. Headphone (sound output means),
40. Car audio / video (A / V) navigation system,
50. Utterance switch (speaker identification means),
AR, AR1 ... Vehicle speech recognition device,
D1, D2, D3 ... speech recognition dictionary (corresponding to each control target device),
P1, P2, P3 ... passengers (users) in the passenger compartment
WT1, WT2 ... A table defining the relationship between recognized words and weights.

Claims

Voice input means for voice input of information instructed by the user in the passenger compartment;
A speaker identifying means for identifying a user who has spoken via the voice input means;
Stores one speech recognition dictionary in which vocabulary related to each operation instruction of a plurality of control target devices is registered, and the relationship between the vocabulary to be recognized for each control target device and a preset weighting A dictionary storing means for storing the defined first table;
Memory means storing a second table that defines a relationship between a seat on which a user is seated and a control target device that is a source of information viewed by the user of the seat;
Control means operably connected to the voice input means, speaker identification means , dictionary storage means and memory means ;
When the control means specifies a speaker in cooperation with the speaker specifying means, the control means refers to the voice recognition dictionary and the first and second tables, and the speaker is watching Add a predetermined weight to the vocabulary corresponding to the controlled device that is the source of information, perform speech recognition on the utterance content of the speaker by referring to the vocabulary with the weight added, and according to the recognized utterance content A vehicle-mounted speech recognition apparatus, wherein the control target device is controlled.

The voice input means is a microphone array installed at a predetermined location in the passenger compartment.
The speaker specifying unit receives a signal having a high sound pressure level based on the direction signal based on the signal detected by each microphone of the microphone array and generating a beam signal. The vehicle-mounted speech recognition apparatus according to claim 1, further comprising: a unit that identifies a direction in which the speaker is speaking as a direction in which the speaker is present.

The voice input means is a microphone installed at a predetermined location in the vehicle interior,
The in-vehicle speech recognition apparatus according to claim 1, wherein the speaker specifying means is an utterance switch operated when a user utters via the microphone.

A plurality of display means provided corresponding to each seat in the passenger compartment,
When the control unit performs control according to the recognized utterance content on the control target device, the control unit applies the utterance content to the display unit displaying an image related to the operation state of the control target device. The vehicle-mounted speech recognition apparatus according to claim 1, wherein control is performed in accordance with the control.

A plurality of audio output means operably connected to the plurality of display means,
When the control unit performs control according to the recognized utterance content on the control target device, the control unit outputs the utterance content to the voice output unit that outputs the voice related to the operation state of the control target device. The vehicle-mounted speech recognition apparatus according to claim 4 , wherein control is performed according to

In a vehicle speech recognition device having a function of performing speech recognition corresponding to an operation instruction for a control target device spoken by a user in a vehicle interior,
A first that defines a relationship between a vocabulary to be recognized for each control target device and a preset weight, together with one speech recognition dictionary in which vocabulary related to each operation instruction of a plurality of control target devices is registered . And a second table that defines a relationship between a seat on which the user is seated and a control target device that is a source of information viewed by the user of the seat, in the storage unit,
Identify the speaker when the utterance is detected,
Referring to the voice recognition dictionary and the first and second tables, a predetermined weight is added to the vocabulary corresponding to the control target device that is the source of the information that the speaker is viewing,
Performing speech recognition on the utterance content of the speaker by referring to the weighted vocabulary;
A speech recognition method, wherein control according to the recognized utterance content is performed on the control target device.

The speech recognition method according to claim 6, wherein a speech is detected by a microphone array, and when the speech is detected, a direction in which the speaker is present is specified by a beam forming method using the microphone array.

The speech recognition method according to claim 6, wherein an utterance is detected by a microphone and the speaker is specified based on an operation of an utterance switch.

When the control according to the recognized utterance content is performed on the control target device, the control according to the utterance content is performed on the display unit displaying the video related to the operation state of the control target device. The speech recognition method according to claim 6.

When the control according to the recognized utterance content is performed on the control target device, the control according to the utterance content is performed on the voice output unit that outputs the voice related to the operation state of the control target device. The speech recognition method according to claim 6, wherein the speech recognition method is performed.