JP2006195302A

JP2006195302A - Speech recognition system and vehicle equipped with the speech recognition system

Info

Publication number: JP2006195302A
Application number: JP2005008534A
Authority: JP
Inventors: Yoichi Kitano; 陽一北野
Original assignee: Honda Motor Co Ltd
Current assignee: Honda Motor Co Ltd
Priority date: 2005-01-17
Filing date: 2005-01-17
Publication date: 2006-07-27

Abstract

PROBLEM TO BE SOLVED: To provide a speech recognition system that can enable robust speech recognition even in a state wherein the state of a moving body changes and improve processing efficiency. SOLUTION: The speech recognition system 10 has 1st to 4th grammar dictionaries 62a to 62d and a language dictionary 64 for recognizing speech data inputted from a microphone 20. Those grammar dictionaries 62a to 62d and language dictionary 64 are selectively used based upon a vehicle speed V from a vehicle speed sensor 42, a current position P from a GPS device 44, a yaw rate Y from a yaw rate sensor 46, and a heart rate H from a heart rate sensor 34. Consequently, proper the grammar dictionaries 62a to 62d or the language dictionary 64 matching the state of the vehicle 12 can be selected to improve the processing efficiency of the speech recognition while realizing the speech recognition which is robust. COPYRIGHT: (C)2006,JPO&NCIPI

Description

この発明は、発話時の話者の情報を入力するマイク等の話者情報入力手段が接続される音声認識システム及びこの音声認識システムを備える車両に関し、一層詳細には、移動体の状態が変化する状況下でも頑健に音声認識を行う音声認識システム及びこの音声認識システムを備える車両に関する。 The present invention relates to a voice recognition system to which speaker information input means such as a microphone for inputting information of a speaker at the time of utterance is connected, and a vehicle equipped with the voice recognition system, and more particularly, the state of a moving body changes. The present invention relates to a voice recognition system that performs voice recognition robustly even in a situation in which the vehicle is equipped, and a vehicle including the voice recognition system.

近時の車両には多数の電子機器が搭載されるようになり、その機能も一層高度化しつつある。このような中、車両に搭載されたナビゲーションシステム等の電子機器の操作を容易化するため、音声による遠隔操作を可能にする音声認識システムが開発されている。 Recently, many electronic devices have been installed in vehicles, and their functions are becoming more sophisticated. Under such circumstances, in order to facilitate the operation of electronic devices such as a navigation system mounted on a vehicle, a voice recognition system that enables remote operation by voice has been developed.

車両用に限らないが、音声認識システムにおいては、話者が話す音声情報を確実に認識しつつ、音声認識の処理効率を高める要望が存在する。このような要望に応えるため、話者の発話時間に応じて検索対象となる参照データ（音声認識辞書）を選択する音声認識システムが提案されている（例えば、特許文献１参照）。 Although not limited to vehicles, there is a demand for improving the speech recognition processing efficiency while reliably recognizing speech information spoken by a speaker in a speech recognition system. In order to meet such demands, a speech recognition system has been proposed that selects reference data (speech recognition dictionary) to be searched according to the speaker's utterance time (see, for example, Patent Document 1).

特開２０００−９９０７７号公報JP 2000-99077 A

前記の音声認識システムは、話者が発話する環境が一定であれば、音声認識の再現性が高くなることが見込まれる。しかし、車両等の移動体においてはその走行状態や走行箇所に応じて音声認識の周辺環境が著しく変化する。このため、同一の話者が同一の言葉を発する場合であっても、発話の速度や明瞭性に変化が生じる。従って、前記の音声認識システムでは、移動体の状態に応じて音声認識の認識性能が低下する可能性がある。 The above speech recognition system is expected to improve the reproducibility of speech recognition if the environment in which the speaker speaks is constant. However, in a moving body such as a vehicle, the surrounding environment for voice recognition changes remarkably according to the travel state and travel location. For this reason, even when the same speaker utters the same word, the speed and clarity of the utterance change. Therefore, in the above speech recognition system, the recognition performance of speech recognition may be lowered depending on the state of the moving body.

この発明はこのような課題を考慮してなされたものであり、移動体の状態が変化する状況下でも頑健な音声認識を可能としかつ処理効率を向上できる音声認識システム及びこの音声認識システムを備える車両を提供することを目的とする。 The present invention has been made in view of such problems, and includes a speech recognition system that can perform robust speech recognition and improve processing efficiency even in a situation where the state of a moving body changes, and the speech recognition system. The object is to provide a vehicle.

この発明に係る音声認識システムは、発話時の話者の情報を入力する話者情報入力手段と、移動体の状態を検出する状態検出手段とが接続された音声認識システムであって、複数の音声認識辞書と、前記話者情報入力手段に入力される話者情報の入力時間を測定する入力時間測定手段と、前記状態検出手段により検出された前記移動体の状態及び前記入力時間測定手段により測定された入力時間に基づき前記音声認識辞書を選択する音声認識辞書選択手段とを備えることを特徴とする。 A speech recognition system according to the present invention is a speech recognition system in which speaker information input means for inputting speaker information at the time of utterance and state detection means for detecting the state of a moving body are connected, A speech recognition dictionary; input time measuring means for measuring the input time of speaker information input to the speaker information input means; and the state of the moving body detected by the state detecting means and the input time measuring means. Voice recognition dictionary selecting means for selecting the voice recognition dictionary based on the measured input time.

この発明によれば、状態検出手段により検出された移動体の状態及び入力時間測定手段により測定された話者情報の入力時間に基づいて音声認識辞書を選択する。このため、移動体の状態に応じた適切な音声認識辞書が選択可能となり、頑健な音声認識を実現しつつ、音声認識の効率を向上させることができる。 According to the present invention, the speech recognition dictionary is selected based on the state of the moving body detected by the state detecting unit and the input time of the speaker information measured by the input time measuring unit. For this reason, it is possible to select an appropriate speech recognition dictionary according to the state of the moving body, and it is possible to improve the efficiency of speech recognition while realizing robust speech recognition.

ここで、話者情報入力手段は、発話時の話者の情報を入力するものであれば特に限定されないが、例えば、話者の音声情報を入力するマイク等の音声情報入力手段のみ、搭乗者の口元の動き、いわゆる唇動など話者を撮影した画像情報を入力するカメラ等の画像情報入力手段のみ、音声情報入力手段と画像情報入力手段の組合せ、複数の音声情報入力手段の組合せ、複数の画像情報入力手段の組合せ等、種々の構成を採り得ることができる。 Here, the speaker information input means is not particularly limited as long as it inputs the information of the speaker at the time of speaking. For example, only the voice information input means such as a microphone for inputting the voice information of the speaker is used. Only for image information input means such as a camera for inputting image information of a speaker such as movement of the mouth of the mouth, so-called lip movement, a combination of voice information input means and image information input means, a combination of a plurality of voice information input means, a plurality of Various configurations such as a combination of the image information input means can be adopted.

また、移動体は、車両、船、水陸両用車、プレジャーボート、ヘリコプタ、飛行機等とすることができる。 Further, the moving body can be a vehicle, a ship, an amphibious vehicle, a pleasure boat, a helicopter, an airplane, or the like.

更に、移動体の状態とは、音声認識に関連する移動体の状態、即ち、状態検出手段として移動体に搭載されたセンサ機器、例えば車速センサ、ＧＰＳアンテナ、ヨーレートセンサ、心拍数センサ等で検出される状態を指す。 Further, the state of the moving body is the state of the moving body related to voice recognition, that is, detected by a sensor device mounted on the moving body as a state detecting means, such as a vehicle speed sensor, a GPS antenna, a yaw rate sensor, a heart rate sensor, or the like. Refers to the state to be done.

音声認識辞書とは、音声認識に用いる参照データを備えるものであり、例えば、文法による認識に基づく音声認識用の参照データを備える文法辞書や、自由発話認識に基づく音声認識用の参照データを備える言語辞書などがある。なお、文法による認識とは、単語（例：「ガソリンスタンド」）や定型文（例：「ラジオをオンにする」）などシステムが要求する規則（文法）に則って音声認識を行う方法であり、自由発話認識とは、自由な発話（例：「うーん、この辺に美味しいレストランはないかな？」）に対してコンピュータが対話（適切な受け答え）をしながら音声認識を行う方法である。 The speech recognition dictionary includes reference data used for speech recognition. For example, the speech recognition dictionary includes a grammar dictionary including reference data for speech recognition based on recognition based on grammar, and reference data for speech recognition based on free utterance recognition. There are language dictionaries. Note that grammatical recognition is a method of performing speech recognition according to the rules (grammar) required by the system, such as words (eg, “gas station”) and fixed phrases (eg, “turn on radio”). Free speech recognition is a method in which a computer performs speech recognition while interacting (appropriately receiving and answering) a free speech (eg, “Well, is there a good restaurant around here?”).

また、この発明に係る音声認識システムは、発話時の話者の情報を入力する話者情報入力手段と、移動体の状態を検出する状態検出手段とが接続された音声認識システムであって、複数の音声認識辞書と、前記話者情報入力手段に入力される話者情報の入力時間を測定する入力時間測定手段と、前記状態検出手段により検出された前記移動体の状態に基づいて、前記入力時間測定手段により測定された入力時間を補正する入力時間補正手段と、前記入力時間補正手段により補正された入力時間に基づき前記音声認識辞書を選択する音声認識辞書選択手段とを備えることを特徴とする。 The speech recognition system according to the present invention is a speech recognition system in which speaker information input means for inputting speaker information at the time of utterance and state detection means for detecting the state of a moving body are connected, Based on a plurality of speech recognition dictionaries, an input time measuring unit that measures an input time of speaker information input to the speaker information input unit, and a state of the moving body detected by the state detecting unit, Input time correcting means for correcting the input time measured by the input time measuring means; and voice recognition dictionary selecting means for selecting the voice recognition dictionary based on the input time corrected by the input time correcting means. And

この発明によれば、状態検出手段により検出された移動体の状態に基づいて、入力時間測定手段により測定された話者情報の入力時間を補正する。このため、移動体の状態が変化しても、補正後の入力時間に基づいて音声認識辞書を適切に選択可能となる。従って、頑健な音声認識を実現しつつ、音声認識の効率を向上させることができる。 According to the present invention, the input time of the speaker information measured by the input time measuring means is corrected based on the state of the moving body detected by the state detecting means. For this reason, even if the state of the moving body changes, the speech recognition dictionary can be appropriately selected based on the corrected input time. Therefore, the efficiency of voice recognition can be improved while realizing robust voice recognition.

ここで、前記音声認識システムは、前記補正前又は補正後の入力時間に応じて音声認識方法を切り換えることが好ましい。 Here, it is preferable that the voice recognition system switches a voice recognition method according to an input time before or after the correction.

これにより、搭乗者の手動動作を介さずに自動的に音声認識方法を切り換えることができる。このため、搭乗者が音声認識方法を把握していない場合であっても、適切な音声認識方法を選択することが可能となる。また、搭乗者が音声認識方法を把握している場合であっても、搭乗者が音声認識方法を手動で切り換える手間を省略することができる。 Thereby, the voice recognition method can be automatically switched without the manual operation of the passenger. For this reason, even if the passenger does not know the voice recognition method, an appropriate voice recognition method can be selected. Further, even when the passenger knows the voice recognition method, the trouble of manually switching the voice recognition method by the passenger can be omitted.

音声認識方法としては、上述した文法による認識、自由発話認識等を用いることができる。 As the speech recognition method, the above-described grammar recognition, free speech recognition, or the like can be used.

更に、前記入力時間測定手段により測定された入力時間の補正は、前記移動体の状態を係数化し、この係数を前記入力時間に乗算することで行うことができる。 Further, the correction of the input time measured by the input time measuring means can be performed by converting the state of the moving body into a coefficient and multiplying the input time by this coefficient.

これにより、移動体の状態が変化しても、この変化を補完するように音声入力時間を補正可能である。従って、移動体の基準状態における音声認識辞書を音声入力時間ごとに設けておけばよく、音声認識辞書の選択を効率的に行うことができる。 Thereby, even if the state of the moving body changes, it is possible to correct the voice input time so as to complement this change. Therefore, it is sufficient to provide a speech recognition dictionary in the reference state of the moving body for each speech input time, and the speech recognition dictionary can be selected efficiently.

また、この発明に係る車両は、前述の音声認識システムのいずれかを備える。 The vehicle according to the present invention includes any one of the above-described voice recognition systems.

この発明によれば、状態検出手段により検出された移動体の状態及び入力時間測定手段により測定された話者情報の入力時間に基づいて音声認識辞書を選択する。このため、移動体の状態に応じた適切な音声認識辞書が選択可能となり、頑健な音声認識を実現しつつ、音声認識の処理効率を向上させることができるという効果が達成される。 According to the present invention, the speech recognition dictionary is selected based on the state of the moving body detected by the state detecting unit and the input time of the speaker information measured by the input time measuring unit. For this reason, it is possible to select an appropriate speech recognition dictionary according to the state of the moving body, and the effect of improving the speech recognition processing efficiency while achieving robust speech recognition is achieved.

以下、この発明に係る音声認識システム及びこの音声認識システムを備える車両について実施の形態を挙げ、添付の図１〜図７を参照しながら説明する。 DESCRIPTION OF EMBODIMENTS Hereinafter, embodiments of a voice recognition system according to the present invention and a vehicle including the voice recognition system will be described with reference to FIGS.

図１に示すように、本発明の一実施形態に係る音声認識システム１０は車両１２（移動体）に搭載されている。 As shown in FIG. 1, a speech recognition system 10 according to an embodiment of the present invention is mounted on a vehicle 12 (moving body).

この音声認識システム１０は、搭乗者（話者）１４の音声を入力するマイク（話者情報入力手段）２０に車内通信網１８を介して接続されている。マイク２０は、車内におけるルーフとウインドシールドガラスとの境界部近傍に設けられている。マイク２０は、これ以外の箇所（例えば、ルーフ下面部、インスツルメントパネル部、ヘッドレスト、搭乗者の肩部及びヘッドセット等）に設けられていてもよい。 The voice recognition system 10 is connected to a microphone (speaker information input means) 20 for inputting the voice of a passenger (speaker) 14 via an in-vehicle communication network 18. The microphone 20 is provided in the vicinity of the boundary between the roof and the windshield glass in the vehicle. The microphone 20 may be provided in places other than this (for example, a roof lower surface part, an instrument panel part, a headrest, a passenger's shoulder part, a headset, etc.).

車内通信網１８には、ナビゲーションシステム２８、エンジンコントローラ３０、パネル操作部３２、心拍数センサ３４及び発汗センサ３６が接続されており、これらの各機器は車内通信網１８によって相互にデータ通信が可能である。後述するように、心拍数センサ３４と発汗センサ３６は本発明の状態検出手段として利用可能である。 A navigation system 28, an engine controller 30, a panel operation unit 32, a heart rate sensor 34 and a sweat sensor 36 are connected to the in-vehicle communication network 18, and these devices can communicate data with each other via the in-vehicle communication network 18. It is. As will be described later, the heart rate sensor 34 and the sweat sensor 36 can be used as the state detection means of the present invention.

ナビゲーションシステム２８にはＧＰＳ（Global Positioning System）装置４４が接続されており、車両１２の現在位置Ｐの座標情報を取得することができる。 A GPS (Global Positioning System) device 44 is connected to the navigation system 28, and coordinate information of the current position P of the vehicle 12 can be acquired.

エンジンコントローラ３０はエンジン４０の制御を行うものであって、車速センサ４２により車両１２の車速Ｖを検出することができる。また、エンジンコントローラ３０はヨーレートセンサ４６により車両１２のヨーレートＹ（０−１００の範囲になるようにスケール調整したもの）を取得することができる。 The engine controller 30 controls the engine 40 and can detect the vehicle speed V of the vehicle 12 by the vehicle speed sensor 42. Further, the engine controller 30 can acquire the yaw rate Y of the vehicle 12 (scale adjusted to be in the range of 0-100) by the yaw rate sensor 46.

パネル操作部３２は、搭乗者１４が操作するスイッチやボタンが設けられており、ナビゲーションシステム２８の動作制御はここから行うことができる。 The panel operation unit 32 is provided with switches and buttons operated by the passenger 14, and the operation control of the navigation system 28 can be performed from here.

図２に示すように、心拍数センサ３４及び発汗センサ３６は、車両１２のステアリング２２に装着されている。心拍数センサ３４は、搭乗者１４の手を通じて搭乗者１４の心拍数を測定し、その測定結果をデジタル信号として音声認識システム１０に出力する。発汗センサ３６は、搭乗者１４の手における発汗量を測定し、その測定結果をデジタル信号として音声認識システム１０に出力する。 As shown in FIG. 2, the heart rate sensor 34 and the sweat sensor 36 are attached to the steering 22 of the vehicle 12. The heart rate sensor 34 measures the heart rate of the passenger 14 through the hand of the passenger 14 and outputs the measurement result to the voice recognition system 10 as a digital signal. The sweat sensor 36 measures the amount of sweat in the hand of the occupant 14 and outputs the measurement result to the voice recognition system 10 as a digital signal.

心拍数センサ３４及び発汗センサ３６はそれぞれ搭乗者１４の心拍数及び発汗量を測定できるものであればよく、その配置や性能は設計条件により変更可能である。 The heart rate sensor 34 and the sweat sensor 36 only need to be able to measure the heart rate and the amount of sweat of the occupant 14, and their arrangement and performance can be changed according to design conditions.

図３に示すように、音声認識システム１０は、Ａ／Ｄ変換部５０と、一時記憶部５２と、特徴抽出部５４と、音声パターン認識部５６と、音響辞書５８と、音声認識辞書格納部６０とを有する。 As shown in FIG. 3, the speech recognition system 10 includes an A / D conversion unit 50, a temporary storage unit 52, a feature extraction unit 54, a speech pattern recognition unit 56, an acoustic dictionary 58, and a speech recognition dictionary storage unit. 60.

Ａ／Ｄ変換部５０は、マイク２０から入力される搭乗者１４のアナログ音声信号をデジタル音声データに変換するものである。具体的には、ＰＣＭ（パルス符号変調）等の量子化技術を用いてアナログ音声信号をデジタル音声データに変換する。 The A / D conversion unit 50 converts an analog audio signal of the passenger 14 input from the microphone 20 into digital audio data. Specifically, an analog audio signal is converted into digital audio data using a quantization technique such as PCM (pulse code modulation).

一時記憶部５２は、Ａ／Ｄ変換部５０により変換されたデジタル音声データを一時記憶データとして一時的に記憶するものである。一時記憶部５２としては、例えば市販のメモリを用いることができる。 The temporary storage unit 52 temporarily stores the digital audio data converted by the A / D conversion unit 50 as temporary storage data. As the temporary storage unit 52, for example, a commercially available memory can be used.

特徴抽出部５４は、一時記憶部５２に一時的に記憶された一時記憶データから特徴的な部分を特徴抽出データとして抽出するものである。例えば、高速フーリエ変換（ＦＦＴ）を用いて前記一時記憶データを周波数分析して特徴抽出データを抽出する。 The feature extraction unit 54 extracts a characteristic part from the temporarily stored data temporarily stored in the temporary storage unit 52 as feature extraction data. For example, the temporally stored data is subjected to frequency analysis using fast Fourier transform (FFT) to extract feature extraction data.

音声パターン認識部５６は、音声認識辞書格納部６０内の音声認識辞書及び音響辞書５８を用いて特徴抽出部５４で抽出された特徴抽出データから音声パターンの認識を行うものである。この音声パターンの認識方法としては、例えば、パターンマッチング手法や統計的な手法を用いることができる。統計的手法としては、確率的な有限状態を持つ隠れマルコフモデル（Hidden Markov Model）の手法を挙げることができる。隠れマルコフモデルでは、音声モデルの学習を行うことにより音声を高確率で認識可能である。 The voice pattern recognition unit 56 recognizes a voice pattern from the feature extraction data extracted by the feature extraction unit 54 using the voice recognition dictionary and the acoustic dictionary 58 in the voice recognition dictionary storage unit 60. As the speech pattern recognition method, for example, a pattern matching method or a statistical method can be used. As a statistical technique, a hidden Markov model having a probabilistic finite state can be cited. In the hidden Markov model, the speech can be recognized with high probability by learning the speech model.

音響辞書５８は、人間の発声の小さな単位（音素）ごとの音響特徴を複数の基準音響パターンとして備えるものである。音声パターン認識部５６は、この基準音響パターンを用いて特徴抽出データの音響特徴を算出する。 The acoustic dictionary 58 includes acoustic features for each small unit (phoneme) of human speech as a plurality of reference acoustic patterns. The voice pattern recognition unit 56 calculates the acoustic feature of the feature extraction data using this reference acoustic pattern.

音声認識辞書格納部６０は、文法による認識に基づく音声認識用の参照データを備える第１〜第４文法辞書６２ａ〜６２ｄと、自由発話認識に基づく音声認識用の参照データを備える言語辞書６４とを有する。 The speech recognition dictionary storage unit 60 includes first to fourth grammar dictionaries 62a to 62d that include reference data for speech recognition based on recognition based on grammar, and a language dictionary 64 that includes reference data for speech recognition based on free speech recognition. Have

図４に示すように、第１〜第４文法辞書６２ａ〜６２ｂは、文法による音声認識用の参照データを搭乗者１４の音声の入力時間に応じて備えるものである。例えば、単語又は定型文ごとの音声パターンを参照データとして、各参照データの発音時間に応じて記憶している。 As shown in FIG. 4, the first to fourth grammar dictionaries 62 a to 62 b are provided with reference data for speech recognition by grammar according to the voice input time of the passenger 14. For example, a voice pattern for each word or fixed phrase is stored as reference data according to the pronunciation time of each reference data.

言語辞書６４は、自由発話認識用の参照データを備えるものである。例えば、上記第１〜第４文法辞書６２ａ〜６２ｄのように単語や定型文ごとの音声パターンを音声認識用データとして数多く備える上述した文法辞書や、間投詞（例：「えーっと」、「うーん」）等の音声認識に不要な音声パターンを音声認識用データとして備える音声認識辞書（棄却辞書）等といった音声認識辞書の集合体から構成することができる。なお、言語辞書６４は１つとは限らず、車両１２の状態等に応じて複数設けることもできる。 The language dictionary 64 includes reference data for free utterance recognition. For example, the above-described grammar dictionaries or interjections (e.g., "Ut" and "Um") that have many speech patterns for each word or fixed sentence as speech recognition data, such as the first to fourth grammar dictionaries 62a to 62d. The speech recognition dictionary (rejection dictionary) or the like having a speech pattern unnecessary for speech recognition as speech recognition data can be configured. Note that the number of language dictionaries 64 is not limited to one, and a plurality of language dictionaries 64 may be provided according to the state of the vehicle 12 and the like.

図３に戻って、更に、音声認識システム１０は、入力時間測定部６６及び音声認識制御部６８を備える。 Returning to FIG. 3, the speech recognition system 10 further includes an input time measuring unit 66 and a speech recognition control unit 68.

入力時間測定部６６は、一時記憶部５２に一時記憶された一時記憶データに基づいて搭乗者１４の音声入力時間を測定するものである。具体的には、音量に関する所定の閾値を設定し、この閾値を超えるデータが連続する時間を測定して、音声入力時間とする。 The input time measuring unit 66 measures the voice input time of the passenger 14 based on the temporarily stored data temporarily stored in the temporary storage unit 52. Specifically, a predetermined threshold value related to the volume is set, and the time during which data exceeding the threshold value continues is measured as the voice input time.

音声認識制御部６８は、音声パターン認識部５６、入力時間測定部６６、各センサ機器（車速センサ４２、ＧＰＳ装置４４、ヨーレートセンサ４６、心拍数センサ３４等）に対して各種の指令を出すことで音声認識を行うものである。音声認識制御部６８の機能については後述する。 The voice recognition control unit 68 issues various commands to the voice pattern recognition unit 56, the input time measurement unit 66, and each sensor device (vehicle speed sensor 42, GPS device 44, yaw rate sensor 46, heart rate sensor 34, etc.). Voice recognition is performed with this. The function of the voice recognition control unit 68 will be described later.

次に、このように構成される音声認識システム１０を用いて搭乗者１４の音声を認識する手順について説明する。 Next, a procedure for recognizing the voice of the passenger 14 using the voice recognition system 10 configured as described above will be described.

図５のステップＳ１において、音声認識制御部６８は、搭乗者１４からナビゲーションシステム２８への音声認識開始入力を検知すると、マイク２０をオンにする。次いで、搭乗者１４が音声を発すると、当該音声を検出したマイク２０は、当該音声に対応するアナログ音声信号を出力する。そして、このアナログ音声信号はＡ／Ｄ変換部５０でデジタル音声データに変換された後、一時記憶部５２に一時的に記憶される。 In step S 1 of FIG. 5, when the voice recognition control unit 68 detects a voice recognition start input from the passenger 14 to the navigation system 28, the voice recognition control unit 68 turns on the microphone 20. Next, when the passenger 14 utters a sound, the microphone 20 that has detected the sound outputs an analog sound signal corresponding to the sound. The analog audio signal is converted into digital audio data by the A / D conversion unit 50 and then temporarily stored in the temporary storage unit 52.

次いで、ステップＳ２において、音声認識制御部６８は、入力時間測定部６６に対し、一時記憶部５２に一時的に記憶された一時記憶データの入力時間を測定させる。入力時間測定部６６が測定した入力時間（補正前入力時間Ｔ１）はデジタルデータとして音声認識制御部６８に出力される。 Next, in step S 2, the voice recognition control unit 68 causes the input time measurement unit 66 to measure the input time of the temporarily stored data temporarily stored in the temporary storage unit 52. The input time (input time T1 before correction) measured by the input time measurement unit 66 is output to the voice recognition control unit 68 as digital data.

ステップＳ３において、音声認識制御部６８は、各種センサ機器（車速センサ４２、ＧＰＳ装置４４、ヨーレートセンサ４６、心拍数センサ３４）からの信号を読み取る。即ち、車速センサ４２からは車速Ｖ、ＧＰＳ装置４４からは現在位置Ｐ、ヨーレートセンサ４６からはヨーレートＹ（０−１００の範囲になるようにスケール調整したもの）、心拍数センサ３４からは心拍数Ｈを読み取る。そして、この読取り値に基づいて、入力時間測定部６６が測定した補正前入力時間Ｔ１を補正して補正後入力時間Ｔ２を算出する。即ち、以下の式（１）を用いて補正後入力時間Ｔ２を算出する。 In step S3, the voice recognition control unit 68 reads signals from various sensor devices (vehicle speed sensor 42, GPS device 44, yaw rate sensor 46, heart rate sensor 34). That is, the vehicle speed sensor 42 is the vehicle speed V, the GPS device 44 is the current position P, the yaw rate sensor 46 is the yaw rate Y (scale adjusted to be in the range of 0-100), and the heart rate sensor 34 is the heart rate. Read H. Based on the read value, the input time T1 before correction measured by the input time measuring unit 66 is corrected to calculate the input time T2 after correction. That is, the corrected input time T2 is calculated using the following equation (1).

Ｔ２＝Ｔ１×Ｃ_v×Ｃ_P×Ｃ_Y×Ｃ_H…（１） T2 = T1 × C _v × C _P × C _Y × C _H (1)

ここで、Ｃ_vは車速Ｖの時間補正係数、Ｃ_Pは現在位置Ｐの時間補正係数、Ｃ_YはヨーレートＹの時間補正係数、Ｃ_Hは心拍数Ｈの時間補正係数である。各補正係数Ｃ_v、Ｃ_P、Ｃ_Y、Ｃ_Hは図６に示す関係に基づいて求められる。 Here, C _v is the time correction factor of the vehicle speed V, C _P is the time correction factor of the current position P, C _Y the time correction factor of the yaw rate Y, C _H is the time correction factor in heart rate H. Each correction coefficient C _v , C _P , C _Y , C _H is obtained based on the relationship shown in FIG.

例えば、補正前入力時間Ｔ１＝０．９（秒）、車速Ｖ＝６２（ｋｍ／ｈ）、現在位置Ｐ＝山間部、ヨーレートＹ＝２０、心拍数Ｈ＝標準とする場合、下記式（２）より補正後入力時間Ｔ２≒１．０４（秒）となる。 For example, when the input time before correction T1 = 0.9 (seconds), the vehicle speed V = 62 (km / h), the current position P = mountain portion, the yaw rate Y = 20, and the heart rate H = standard, the following formula (2 ), The corrected input time T2≈1.04 (seconds).

Ｔ１Ｃ_v Ｃ_P Ｃ_Y Ｃ_H
Ｔ２＝０．９×１．０５×１．１×１．０×１．０≒１．０４…（２） T1 C _v C _P C _Y C _H
T2 = 0.9 × 1.05 × 1.1 × 1.0 × 1.0≈1.04 (2)

なお、図６では、車速Ｖが速いほど補正係数が大きくなっている。これは、車速Ｖが速いほど、搭乗者１４の緊張状態が高まり、搭乗者１４の発話速度が速くなることを考慮したものである。 In FIG. 6, the correction coefficient increases as the vehicle speed V increases. This is because the higher the vehicle speed V, the higher the tension of the passenger 14 and the higher the speaking speed of the passenger 14.

現在位置Ｐ、ヨーレートＹ及び心拍数Ｈも同様であり、搭乗者１４の緊張状態が高まるにつれて、補正係数が高くなるように設定されている。 The same applies to the current position P, the yaw rate Y, and the heart rate H, and the correction coefficient is set to increase as the tension state of the passenger 14 increases.

また、図６の現在位置Ｐに関し、市街地、山間地、高速道路の判定は、ＧＰＳ装置４４が衛星から受信する信号に含まれる位置情報（緯度、経度等）と、ナビゲーションシステム２８に予め記憶しておいた地図情報とを比較・判定することで行う。 In addition, regarding the current position P in FIG. 6, determination of urban areas, mountainous areas, and highways is stored in advance in the navigation system 28 and positional information (latitude, longitude, etc.) included in the signal received by the GPS device 44 from the satellite. This is done by comparing / determining the map information.

更に、心拍数Ｈについては、音声認識制御部６８に事前に標準値を設定しておく。そして、音声認識制御部６８が心拍数センサ３４からの心拍数Ｈとこの標準値とを比較し、補正方法を決定する。 Further, for the heart rate H, a standard value is set in advance in the voice recognition control unit 68. Then, the voice recognition control unit 68 compares the heart rate H from the heart rate sensor 34 with this standard value, and determines a correction method.

ステップＳ４において、音声認識制御部６８は、ステップＳ３で算出した補正後入力時間Ｔ２に基づき認識方法（認識辞書）を決定する。即ち、図７に示すように、補正後入力時間Ｔ２が２秒未満の場合は、第１〜第４文法辞書６２ａ〜６２ｄのいずれかを用いて文法による認識を行う。補正後入力時間Ｔ２が２秒以上の場合は、言語辞書６４を用いて自由発話認識を行う。上記例では、補正後入力時間Ｔ２は、Ｔ２≒１．０４秒であるため、２秒未満である。このため、認識方法は文法による認識が選択される。また、音声認識辞書は、第１〜第４文法辞書６２ａ〜６２ｄの中でも、補正後入力時間Ｔ２に対応した第３文法辞書６２ｃが選択される（図４参照）。 In step S4, the speech recognition control unit 68 determines a recognition method (recognition dictionary) based on the corrected input time T2 calculated in step S3. That is, as shown in FIG. 7, when the corrected input time T2 is less than 2 seconds, grammatical recognition is performed using one of the first to fourth grammar dictionaries 62a to 62d. When the corrected input time T2 is 2 seconds or longer, free speech recognition is performed using the language dictionary 64. In the above example, the post-correction input time T2 is less than 2 seconds since T2≈1.04 seconds. For this reason, recognition by grammar is selected as the recognition method. As the speech recognition dictionary, the third grammar dictionary 62c corresponding to the corrected input time T2 is selected from the first to fourth grammar dictionaries 62a to 62d (see FIG. 4).

ステップＳ５において、音声認識制御部６８は、選択した音声認識辞書（上記例では第３文法辞書６２ｃ）を音声パターン認識部５６に対して通知する。 In step S5, the speech recognition control unit 68 notifies the speech pattern recognition unit 56 of the selected speech recognition dictionary (the third grammar dictionary 62c in the above example).

ステップＳ６において、音声パターン認識部５６は、音声認識制御部６８から通知された第３文法辞書６２ｃを用いて、特徴抽出部５４で抽出された特徴抽出データの音声認識を行う。ここでの認識動作は、パターンマッチング手法や統計的な手法等の上述の認識手法を用いる。 In step S 6, the voice pattern recognition unit 56 performs voice recognition of the feature extraction data extracted by the feature extraction unit 54 using the third grammar dictionary 62 c notified from the voice recognition control unit 68. The recognition operation here uses the above-described recognition method such as a pattern matching method or a statistical method.

ステップＳ７において、音声パターン認識部５６は、この認識結果を音声認識制御部６８に出力する。音声認識制御部６８は、音声パターン認識部５６から入力された認識結果を車内通信網１８を介してナビゲーションシステム２８へと出力し、音声認識処理を終了する。ナビゲーションシステム２８は、入力された認識結果に基づきナビゲーションを行う。 In step S 7, the voice pattern recognition unit 56 outputs the recognition result to the voice recognition control unit 68. The voice recognition control unit 68 outputs the recognition result input from the voice pattern recognition unit 56 to the navigation system 28 via the in-vehicle communication network 18 and ends the voice recognition process. The navigation system 28 performs navigation based on the input recognition result.

以上説明したように、本実施形態に係る音声認識システム１０では、マイク２０から入力される音声の認識に用いる音声認識辞書を、車両１２の状態（エンジンコントローラ３０からの車速Ｖ、ヨーレートセンサ４６からのヨーレートＹ等）に基づいて選択する。このため、車両１２の状態に応じた適切な音声認識辞書が選択可能となり、頑健な音声認識を実現しつつ、音声認識の効率を向上させることができる。 As described above, in the speech recognition system 10 according to this embodiment, the speech recognition dictionary used for recognition of speech input from the microphone 20 is determined based on the state of the vehicle 12 (the vehicle speed V from the engine controller 30 and the yaw rate sensor 46). Selected based on the yaw rate Y). For this reason, it is possible to select an appropriate speech recognition dictionary corresponding to the state of the vehicle 12, and it is possible to improve the efficiency of speech recognition while realizing robust speech recognition.

また、音声認識システム１０では、車両１２の状態に基づいて音声入力時間（補正前入力時間Ｔ１）を補正する。このため、車両１２の状態が変化しても、補正後の音声入力時間（補正後入力時間Ｔ２）と車両１２の状態に基づいて音声認識辞書を適切に選択可能となる。従って、頑健な音声認識を実現しつつ、音声認識の効率を向上させることができる。 In the voice recognition system 10, the voice input time (pre-correction input time T 1) is corrected based on the state of the vehicle 12. For this reason, even if the state of the vehicle 12 changes, the speech recognition dictionary can be appropriately selected based on the corrected speech input time (corrected input time T2) and the state of the vehicle 12. Therefore, the efficiency of voice recognition can be improved while realizing robust voice recognition.

更に、音声認識システム１０では、音声入力時間（本実施形態では補正後入力時間Ｔ２）に応じて、文法による認識又は自由発話認識を選択している。このため、搭乗者１４が音声認識方法を把握していない場合であっても、適切な音声認識方法を選択することが可能となる。また、搭乗者が音声認識方法を把握している場合であっても、搭乗者の手間を省略することができる。 Furthermore, the speech recognition system 10 selects grammatical recognition or free utterance recognition according to the speech input time (corrected input time T2 in this embodiment). For this reason, even if the passenger 14 does not know the voice recognition method, an appropriate voice recognition method can be selected. Further, even when the passenger knows the voice recognition method, the time and effort of the passenger can be omitted.

加えて、音声認識システム１０では、車両１２の状態を係数化し、この係数を音声入力時間（補正前入力時間Ｔ１）に乗算することで音声入力時間を補正している。このため、車両１２の状態が変化しても、この変化を補完するように音声入力時間を補正可能である。従って、車両１２の基準状態における音声認識辞書を音声入力時間ごとに設けておけばよく、音声認識辞書の選択を効率的に行うことができる。 In addition, the voice recognition system 10 corrects the voice input time by converting the state of the vehicle 12 into a coefficient and multiplying this coefficient by the voice input time (pre-correction input time T1). For this reason, even if the state of the vehicle 12 changes, the voice input time can be corrected so as to complement this change. Therefore, a speech recognition dictionary in the reference state of the vehicle 12 may be provided for each speech input time, and the speech recognition dictionary can be selected efficiently.

なお、この発明は、上記実施形態に限らず、この明細書の記載内容に基づき、種々の構成を採り得ることはもちろんである。 Note that the present invention is not limited to the above-described embodiment, and it is needless to say that various configurations can be adopted based on the description in this specification.

例えば、上記実施形態では、移動体として車両１２を挙げたが、これに限られず、船、水陸両用車、プレジャーボート、ヘリコプタ、飛行機等の移動体とすることができる。 For example, in the above-described embodiment, the vehicle 12 is exemplified as the moving body. However, the moving body is not limited thereto, and may be a moving body such as a ship, an amphibious vehicle, a pleasure boat, a helicopter, and an airplane.

また、上記実施形態では、車両１２の状態を音声入力時間の補正に用いたが、車両１２の状態と音声入力時間に基づいて音声認識辞書を選択するものであればこれに限られない。例えば、車両１２の状態及び音声入力時間の組合せと、音声認識辞書との対応表を事前に作成・記憶しておき、この対応表を用いて音声認識辞書を選択してもよい。また、音声認識辞書を音声入力時間との関係でデフォルト設定しておき、車両１２の状態に応じて、より認識性能の高い音声認識辞書又はより処理効率の高い音声認識辞書へとシフトさせることもできる。 Moreover, in the said embodiment, although the state of the vehicle 12 was used for correction | amendment of voice input time, if the speech recognition dictionary is selected based on the state of the vehicle 12 and voice input time, it will not be restricted to this. For example, a correspondence table between the combination of the state of the vehicle 12 and the voice input time and the voice recognition dictionary may be created and stored in advance, and the voice recognition dictionary may be selected using the correspondence table. In addition, the voice recognition dictionary may be set as a default in relation to the voice input time, and may be shifted to a voice recognition dictionary with higher recognition performance or a voice recognition dictionary with higher processing efficiency depending on the state of the vehicle 12. it can.

上記実施形態では、車両１２の状態として、車速Ｖ、現在位置Ｐ、ヨーレートＹ及び心拍数Ｈを用いたが、必ずしもこの組合せに限られない。例えば、車速Ｖのみを用いることも可能である。また、上記以外の車両１２の状態、例えば、平均振動数Ｆ、発汗量Ｓ等を用いることもできる。平均振動数Ｆは、車速Ｖ及びエンジン回転数等の値に基づいて算出することができる。発汗量Ｓは、発汗センサ３６を用いて検出可能である。 In the above embodiment, the vehicle speed V, the current position P, the yaw rate Y, and the heart rate H are used as the state of the vehicle 12, but this is not necessarily limited to this combination. For example, it is possible to use only the vehicle speed V. Further, the state of the vehicle 12 other than the above, for example, the average frequency F, the sweating amount S, and the like can be used. The average frequency F can be calculated based on values such as the vehicle speed V and the engine speed. The perspiration amount S can be detected using a perspiration sensor 36.

話者情報入力手段としてマイク２０を用いたが、搭乗者１４の口元の動き、いわゆる唇動を撮映するＣＣＤカメラ等の画像取得手段など音声認識用情報を取得できるものであればこれに限られない。また、複数の話者情報入力手段を用いることもできる。 The microphone 20 is used as the speaker information input means. However, the present invention is not limited to this as long as it can acquire voice recognition information such as an image acquisition means such as a CCD camera that images the movement of the mouth of the passenger 14, so-called lip movement. I can't. Also, a plurality of speaker information input means can be used.

音声認識辞書として、第１〜第４文法辞書６２ａ〜６２ｄ及び言語辞書６４を用いたが、音声認識辞書は少なくとも２つあればよい。また、文法辞書又は言語辞書のいずれか１種類のみでもよく、他の音声認識方法に基づく音声認識辞書を用いることもできる。 Although the first to fourth grammar dictionaries 62a to 62d and the language dictionary 64 are used as the voice recognition dictionary, it is sufficient that there are at least two voice recognition dictionaries. Further, only one of a grammar dictionary and a language dictionary may be used, and a speech recognition dictionary based on another speech recognition method may be used.

更に、上記実施形態では、音素ごとの音響特徴に関する音響辞書５８は、選択対象の音声認識辞書に含まれていなかったが、音響辞書５８を複数設けることで選択対象の音声認識辞書とすることもできる。例えば車両１２の状態及び基準音響パターンの発音時間ごとに音響辞書５８を複数設け、車両１２の状態及び音素の入力時間に基づいてこれを選択することが可能である。 Furthermore, in the above embodiment, the acoustic dictionary 58 relating to the acoustic features for each phoneme was not included in the selection target speech recognition dictionary. However, by providing a plurality of acoustic dictionaries 58, a selection target speech recognition dictionary may be used. it can. For example, it is possible to provide a plurality of acoustic dictionaries 58 for each state of the vehicle 12 and the sound generation time of the reference sound pattern, and to select them based on the state of the vehicle 12 and the input time of phonemes.

音声認識方法は補正後入力時間Ｔ２に応じて切り換えたが、補正前入力時間Ｔ１に応じて切り換えることもできる。 The speech recognition method is switched according to the input time after correction T2, but can be switched according to the input time before correction T1.

音声入力時間に所定の係数を乗算する方法を採ったが、例えば、車両１２の状態に応じて加算・減算する方法を採ることも可能である。 Although the method of multiplying the voice input time by a predetermined coefficient is employed, for example, a method of adding / subtracting according to the state of the vehicle 12 can be employed.

上記実施形態は車両の内部で完結するシステムとしたが、これに限られず、例えば、音声パターン認識部５６、音声認識辞書格納部６０を車外の情報センタに配置し、無線通信を利用して音声認識を行うようにすることも可能である。 The above embodiment is a system that is completed inside the vehicle. However, the present invention is not limited to this. For example, the voice pattern recognition unit 56 and the voice recognition dictionary storage unit 60 are arranged in an information center outside the vehicle, and voice communication is performed using wireless communication. It is also possible to perform recognition.

この発明の一実施形態に係る音声認識システムが搭載された車両のブロック構成図である。1 is a block configuration diagram of a vehicle equipped with a voice recognition system according to an embodiment of the present invention. 前記車両のステアリングの平面図である。It is a top view of the steering of the vehicle. 前記音声認識システムのブロック構成図である。It is a block block diagram of the said speech recognition system. 前記音声認識システムにおける文法辞書と、搭乗者の音声入力時間との関係を示す図である。It is a figure which shows the relationship between the grammar dictionary in the said speech recognition system, and a passenger | crew's voice input time. 前記音声認識システムにおける音声認識処理の手順を示すフローチャートである。It is a flowchart which shows the procedure of the speech recognition process in the said speech recognition system. 前記音声認識システムにおける各センサ機器の値と、前記音声入力時間の補正方法との関係を示す図である。It is a figure which shows the relationship between the value of each sensor apparatus in the said voice recognition system, and the correction | amendment method of the said voice input time. 前記音声認識システムにおける前記音声入力時間と、音声認識方法と、音声認識辞書との関係を示す図である。It is a figure which shows the relationship between the said voice input time in the said speech recognition system, a speech recognition method, and a speech recognition dictionary.

Explanation of symbols

１０…音声認識システム１２…車両（移動体）
１４…搭乗者（話者）２０…マイク（話者情報入力手段）
３４…心拍数センサ（状態検出手段）４２…車速センサ（状態検出手段）
４４…ＧＰＳ装置（状態検出手段）
４６…ヨーレートセンサ（状態検出手段）
６２ａ〜６２ｄ…文法辞書（音声認識辞書）
６４…言語辞書（音声認識辞書）
６６…入力時間測定部（入力時間測定手段）
６８…音声認識制御部（入力時間補正手段及び音声認識辞書選択手段） 10 ... Voice recognition system 12 ... Vehicle (moving body)
14 ... Passenger (speaker) 20 ... Microphone (speaker information input means)
34 ... Heart rate sensor (state detection means) 42 ... Vehicle speed sensor (state detection means)
44 ... GPS device (state detection means)
46 ... Yaw rate sensor (state detection means)
62a-62d ... Grammar dictionary (voice recognition dictionary)
64 ... Language dictionary (voice recognition dictionary)
66 ... Input time measuring unit (input time measuring means)
68. Voice recognition control unit (input time correcting means and voice recognition dictionary selecting means)

Claims

A speech recognition system in which speaker information input means for inputting speaker information at the time of utterance and state detection means for detecting the state of a moving body are connected,
Multiple speech recognition dictionaries,
Input time measuring means for measuring the input time of speaker information input to the speaker information input means;
A speech recognition dictionary selecting means for selecting the speech recognition dictionary based on the state of the moving body detected by the state detecting means and the input time measured by the input time measuring means. .

A speech recognition system in which speaker information input means for inputting speaker information at the time of utterance and state detection means for detecting the state of a moving body are connected,
Multiple speech recognition dictionaries,
Input time measuring means for measuring the input time of speaker information input to the speaker information input means;
An input time correcting means for correcting the input time measured by the input time measuring means based on the state of the moving body detected by the state detecting means;
A speech recognition system comprising: speech recognition dictionary selecting means for selecting the speech recognition dictionary based on the input time corrected by the input time correcting means.

The voice recognition system according to claim 1, wherein the moving body is a vehicle.

A vehicle comprising the voice recognition system according to claim 3.