JP2006215418A

JP2006215418A - Voice input device and voice input method

Info

Publication number: JP2006215418A
Application number: JP2005030020A
Authority: JP
Inventors: Daisuke Saito; 大介斎藤
Original assignee: Nissan Motor Co Ltd
Current assignee: Nissan Motor Co Ltd
Priority date: 2005-02-07
Filing date: 2005-02-07
Publication date: 2006-08-17

Abstract

<P>PROBLEM TO BE SOLVED: To provide a voice input device with which uttering of the will of a user can surely be recognized under very adverse noisy environment and operability is made great and to provide a voice input method. <P>SOLUTION: The voice input device is provided with a voice inputting section 103 which converts inputted voice into audio signals and outputs them, a voice recognition section 105 which converts the audio signals outputted from the voice input section 103 into information signals and an uttering possibility judging section 102 that judges the time interval in which voice inputting of the user is to have high possibility. In the time interval in which possibility of voice inputting by the user is judged to be high by the uttering possibility judging section 102, inputting of the voice of the user is converted by the voice inputting section 103 into audio signals and outputted and the audio signals are converted into information signals by the voice recognition section 105 for a recognition process. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は音声入力装置及び音声入力方法に関する。 The present invention relates to a voice input device and a voice input method.

近年、自動車において、音声入力、認識装置を備え、ナビゲーション装置やオーディオ装置、空調装置等の操作を音声で行うことが可能なものがある。こうした音声入力装置は、一般に、音声信号を取込むマイクロフォンを備え、ユーザの発話音声を入力・認識処理するように構成されている。そのような発話音声を認識処理する手段の一例は下記特許文献１に記載されている。 2. Description of the Related Art In recent years, some automobiles are equipped with a voice input / recognition device and can operate a navigation device, an audio device, an air conditioner, and the like by voice. Such a voice input device is generally provided with a microphone that takes in a voice signal, and is configured to input and recognize a user's voice. An example of means for recognizing and processing such uttered speech is described in Patent Document 1 below.

一般的な音声入力装置では、例えば、常に音声入力を待ち受け、入力音声のパワー値(入力信号の振幅の二乗値)等を用いてユーザの発話音声区間を検出し、認識を行うように構成されているが、自動車の室内で、こうした構成を用いると、音声信号のみならず、車両のエンジンや空調機の稼働音、オーディオ音、走行に伴う風音、タイヤと路面の接触に伴うロードノイズ等、多くの雑音が同時に入力される為、認識誤りが多くなる。 In a general voice input device, for example, it is configured to always wait for voice input, detect a user's speech voice section using the power value of the input voice (the square value of the amplitude of the input signal), etc., and perform recognition. However, if such a configuration is used in the interior of an automobile, not only the audio signal but also the operating sound of the engine and air conditioner of the vehicle, the audio sound, the wind noise associated with traveling, the road noise associated with the contact between the tire and the road surface, etc. Because many noises are input simultaneously, recognition errors increase.

そこで、車両に搭載される音声入力装置は、使用者が発声の直前に音声入力装置側へ入力開始の信号を送り、認識対象となる音声区間を正確に特定させる方式がよく用いられる。 Therefore, a voice input device mounted on a vehicle often uses a method in which a user sends an input start signal to the voice input device immediately before utterance to accurately specify a voice section to be recognized.

上記方式の例として、ＰＴＴ(Ｐｕｓｈ−Ｔｏ−Ｔａｌｋ)スイッチ、ＰＴＡ(Ｐｕｓｈ−Ｔｏ−Ａｃｔｉｖａｔｅ)スイッチ方式等が知られている。ＰＴＴ方式は、ボタンが押下されている時間区間についての入力音声が音声認識の対象になる。ＰＴＡ方式では、ボタンが押された時刻から所定の時間以上にポーズ(無音声区間)が継続されるまでの区間を音声認識の対象とする。これに類似した手法として、特定のキーワードの入力を前記ＰＴＡスイッチの押下操作と同等の機能として用いる手法も提案されている。 As examples of the above system, a PTT (Push-To-Talk) switch system, a PTA (Push-To-Activate) switch system, and the like are known. In the PTT method, the input speech for the time interval in which the button is pressed is subject to speech recognition. In the PTA method, the period from the time when the button is pressed until the pause (no voice period) continues for a predetermined time or longer is set as the target of voice recognition. As a method similar to this, a method of using a specific keyword input as a function equivalent to the pressing operation of the PTA switch has been proposed.

以下では、音声入力開始をシステムに伝達する手段を「発話スイッチ」と呼ぶ。 Hereinafter, the means for transmitting the voice input start to the system is referred to as “speech switch”.

特開平９−２８８４９３号公報JP-A-9-288493

こうした発話スイッチを用いた音声入力装置では、音声入力を行うたびに使用者がスイッチの押下等の操作を行わなければならず、操作が煩わしいという問題がある。 In the voice input device using such a speech switch, there is a problem that the user has to perform an operation such as pressing the switch every time voice input is performed, and the operation is troublesome.

また、使用者が発話を意思決定してから発話スイッチ押下操作が行われ、その後発話を開始するという過程を経ることから、操作完了までの所要時間が増加し、使用者にとって使い勝手が悪いという問題がある。 In addition, the process of pressing the utterance switch after the user makes a decision to utter and then starting the utterance increases the time required to complete the operation, which is inconvenient for the user. There is.

発話スイッチ押下操作後に「コマンドをどうぞ（ピッ）」や、「ピッと鳴ったらお話ください(ピッ)」("ピッ"は報知音)等のフィードバックを教示する機能を備え、使用者に発話のタイミングを教示する構成を持つ装置では、上記所要時間は更に増加することが考えられる。 A function that teaches feedback such as “Please give a command (pip)” after pressing the utterance switch, or “Please speak if you hear a beep (pip)” (“beep” is a notification sound), etc. It is conceivable that the required time is further increased in an apparatus having a configuration that teaches the above.

上記の問題を解消する為に、常時入力を待ち受ける方式をとると、特に劣悪な雑音環境下である車室内においては、誤認識が著しく増加し、無発話にも関わらず雑音を何らかの操作コマンドと解釈する結果、使用者の意思と無関係に機器の操作が実行されてしまう場合が生じてしまう。 In order to solve the above problems, if a system that always waits for input is used, the number of misrecognitions increases remarkably, especially in the interior of a vehicle in a poor noise environment, and noise is treated as an operation command despite no speech. As a result of interpretation, there is a case where the operation of the device is executed regardless of the intention of the user.

本発明は上記の問題に鑑みてなされたものであり、本発明が解決しようとする課題は、劣悪な雑音環境下においても、使用者の意思による発話を、高い確度で認識する、使い勝手の良い音声入力装置及び音声入力方法を提供することにある。 The present invention has been made in view of the above problems, and the problem to be solved by the present invention is that the user's intention is recognized with high accuracy even in a poor noise environment. To provide a voice input device and a voice input method.

使用者の音声入力が行われる可能性が高い時間区間を発話可能性判断手段によって判断し、該時間区間において使用者が発話した音声を音入力手段が音声信号に変換して出力し、該音声信号を音声認識手段が情報信号に変換することを特徴とする音声入力装置を構成する。 The speech possibility determination means determines a time interval in which the user's voice input is likely to be performed, and the sound input means converts the voice spoken by the user in the time interval into a voice signal and outputs the voice signal. The speech input device is characterized in that the speech recognition means converts the signal into an information signal.

本発明の実施によって、劣悪な雑音環境下においても、使用者の意思による発話を、高い確度で認識する、使い勝手の良い音声入力装置及び音声入力方法を提供することが可能となる。 By implementing the present invention, it is possible to provide an easy-to-use voice input device and voice input method for recognizing speech uttered by a user's intention with high accuracy even in a poor noise environment.

以下に、本発明の実施の形態を説明する。 Hereinafter, embodiments of the present invention will be described.

（基本機能と実現手段）
本発明に係る音声入力装置は、入力される音声を音声信号に変換して出力する音入力手段と、該音入力手段が出力する音声信号を情報信号に変換する認識処理を行う音声認識手段と、使用者の発話が行われる可能性が高い時間区間を判断する発話可能性判断手段とを有し、該発話可能性判断手段によって使用者の発話が行われる可能性が高いと判断された時間区間において使用者が発話し入力する音声を該音入力手段が音声信号に変換して出力し、該音声信号を該音声認識手段が情報信号に変換する認識処理を行う。このような構成によって、使用者は発話スイッチを操作することなく、音声入力を行うことができる。 (Basic functions and implementation methods)
A voice input device according to the present invention includes a sound input unit that converts an input voice into a voice signal and outputs the voice signal, and a voice recognition unit that performs a recognition process of converting the voice signal output from the sound input unit into an information signal. Utterance possibility determination means for determining a time interval in which the user's utterance is likely to be performed, and the time when the utterance possibility determination means determines that the user's utterance is likely to be performed The voice input means converts the voice input by the user during the section into a voice signal and outputs the voice signal, and the voice recognition means converts the voice signal into an information signal. With such a configuration, the user can perform voice input without operating the speech switch.

また、本発明に係る音声入力装置において、発話可能性判断部は、例えば、音声による操作の対象となる機器の動作状態の変化を検出することによって、使用者の音声入力が行われる可能性が高い時間区間を判断し、その時間区間において使用者が発話した音声を音入力手段が音声信号に変換して出力し、その音声信号を音声認識手段が情報信号に変換する認識処理を行う。このような構成によって、使用者は発話スイッチを操作することなく、該機器を操作するための音声入力を行うことができる。 Further, in the voice input device according to the present invention, the utterance possibility determination unit may perform voice input by the user, for example, by detecting a change in the operating state of a device to be operated by voice. A high time interval is determined, and the speech uttered by the user in that time interval is converted into an audio signal by the sound input means and output, and the voice recognition means converts the audio signal into an information signal. With this configuration, the user can perform voice input for operating the device without operating the speech switch.

また、上記の基本構成に加えて、音入力手段が出力する音声信号との照合に用いる言語辞書を構成要素とする認識辞書と、発話可能性判断手段によって使用者の発話が行われる可能性が高いと判断される時間区間において発話される可能性が高い語彙を選択し、該語彙に基づき該言語辞書を変更する言語辞書変更手段とを付加すれば、音声認識手段による音声認識の効率と確度とを向上させることができる。この場合に、言語辞書変更手段が、発話される可能性が高い語彙として、動作状態の変化が検出された、音声による操作の対象となる機器に関連する音声操作語彙(機器を操作するために発話する語彙)を選択すれば、該機器を操作するための音声入力を行う際の音声認識の効率と確度とを向上させることができる。 In addition to the basic configuration described above, there is a possibility that the user's utterance may be performed by the recognition dictionary including the language dictionary used for collation with the voice signal output from the sound input unit and the utterance possibility determination unit. If a vocabulary that is likely to be uttered in a time interval judged to be high is selected and a language dictionary changing means for changing the language dictionary based on the vocabulary is added, the efficiency and accuracy of speech recognition by the speech recognition means And can be improved. In this case, the language dictionary changing means, as a vocabulary that is likely to be spoken, is a voice operation vocabulary related to the device that is the target of the voice operation, for which a change in the operating state is detected (for operating the device). If the vocabulary to be uttered is selected, the efficiency and accuracy of speech recognition when performing speech input for operating the device can be improved.

本発明に係る音声入力装置及び音声入力方法は、自動車の室内において、特に大きな効果を発揮するので、以下の説明においては、本発明に係る音声入力装置及び音声入力方法が自動車の室内において用いられる場合を想定する。 Since the voice input device and the voice input method according to the present invention are particularly effective in the interior of a car, in the following description, the voice input device and the voice input method according to the present invention are used in a car interior. Assume a case.

自動車の室内に搭載される各種機器(ナビゲーション装置、オーディオ装置、空調装置等)を乗員である使用者が操作することを考える時、使用者が、該機器の動作状態の変化をきっかけとして、操作を行う場面が多く考えられる。そのような場面は、例えば、ＣＤ、ＭＤ、ＤＶＤ等において、タイトル、チャプタの変わり目における選曲、早送り、巻き戻し、曲飛ばし等の操作、ラジオ、テレビの番組の変わり目、電波状態の変化時点における選局操作、ＶＩＣＳ(道路交通情報通信システム)のデータ受信時または情報表示終了後における情報表示、再表示、読み上げ、経路再計算に関する操作、ＥＴＣ通信確立時または通信完了後における通信情報(料金、利用履歴)の表示、読み上げ操作、携帯電話のハンズフリーシステムにおける着信時等における受話操作、自動空調機稼動状態(風量レベル)の変化時点における温度再設定操作、時報情報の受信時におけるテレビ、ラジオの操作等である。 When considering the operation of various devices (navigation devices, audio devices, air conditioners, etc.) installed in the interior of an automobile by a passenger user, the user can operate the device by using a change in the operating state of the device. Many scenes can be considered. Such scenes can be selected, for example, on CDs, MDs, DVDs, etc. at the time of title, chapter change, fast forward, rewind, skipping, radio, television program change, radio wave condition change point, etc. Station operation, VICS (road traffic information communication system) data reception or after information display information display, redisplay, reading, route recalculation operations, communication information (charge, usage) when ETC communication is established or after communication is completed (History) display, reading operation, receiving operation when receiving a call in a hands-free system of a mobile phone, temperature resetting operation when the automatic air conditioner operating state (air flow level) changes, TV, radio when receiving time signal information Operation.

すなわち、機器の動作状態の変化時刻を含む所定の時間区間において、使用者が該機器の操作を行う可能性が高いと捉えることができる。 That is, it can be considered that the user is highly likely to operate the device in a predetermined time interval including the change time of the operation state of the device.

上記傾向を音声入力による機器操作に置き換えれば、機器の動作状態が変化する時刻を含む時間区間が、該機器を操作するための音声発話の可能性が高い時間区間である、と解釈することができる。 If the above trend is replaced with device operation by voice input, it can be interpreted that the time interval including the time when the operation state of the device changes is a time interval in which the possibility of voice utterance for operating the device is high. it can.

本発明においては、上記傾向に着目し、車両搭載機器の動作状態の変化によって機器操作のための発話の可能性を予測し、該発話の可能性が高い時間区間について、発話スイッチを操作せずに音声入力による操作が行える音声入力装置及び音声入力方法を構成する。 In the present invention, paying attention to the above-mentioned tendency, the possibility of utterance for device operation is predicted by the change in the operating state of the on-vehicle device, and the utterance switch is not operated for a time interval in which the possibility of utterance is high. The voice input device and the voice input method that can be operated by voice input are configured.

（実施の形態例）
図１に、本発明に係る音声入力装置の基本構成を示し、図２に、本発明に係る音声入力装置の基本的な実施の形態例を示す。 (Embodiment example)
FIG. 1 shows a basic configuration of a voice input device according to the present invention, and FIG. 2 shows a basic embodiment of the voice input device according to the present invention.

図１において、車載機器101は車両の室内に設置された各種操作機器群である。本実施の形態例では、車載機器101において、操作対象となる機器類は、ＣＤ、ＭＤ、ＤＶＤ等の記憶媒体再生機器、テレビ、ラジオ等の放送波受信装置、テレマティクスシステム(ナビゲーションシステム、ＶＩＣＳシステム、ＥＴＣシステム、ハンズフリー電話等を含む)、自動空調機、オートライト(室内・室外)、オートロック、オートワイパー、時計等である。 In FIG. 1, an in-vehicle device 101 is a group of various operation devices installed in a vehicle interior. In the present embodiment, in the in-vehicle device 101, the devices to be operated are storage medium playback devices such as CD, MD, DVD, broadcast wave receiving devices such as televisions and radios, telematics systems (navigation system, VICS system). , ETC system, hands-free phone etc.), automatic air conditioner, auto light (indoor / outdoor), auto lock, auto wiper, clock, etc.

発話可能性判断手段である発話可能性判断部102は、図２における演算装置204、記憶装置205、タイマ206、センサ207から構成される。演算装置204としては、ＣＰＵ、ＭＰＵ、ＤＳＰ、ＦＰＧＡ等の一般的な動作回路を組み合わせたものが使用される。記憶装置205としては、キャッシュメモリ、メインメモリ、ＨＤＤ、ＣＤ、ＭＤ、ＤＶＤ、光ディスク、ＦＤＤ等、一般的な記憶媒体が使用される。 The utterance possibility determination unit 102 as utterance possibility determination means includes the arithmetic device 204, the storage device 205, the timer 206, and the sensor 207 in FIG. As the arithmetic unit 204, a combination of general operation circuits such as a CPU, MPU, DSP, and FPGA is used. As the storage device 205, a general storage medium such as a cache memory, main memory, HDD, CD, MD, DVD, optical disk, FDD, or the like is used.

発話可能性判断部102は、車載機器101の動作状態の変化を監視し、該変化を検出した場合に、該変化を検出した時刻を含む所定時間区間を、該機器の操作に関する使用者の発話が行われる可能性が高い時間区間であると判断し、この時間区間において使用者が発話した音声を音入力部103が音声信号に変換して出力し、音声認識部105へ送出するように命令する。これに伴い、タイマ206により計測される所定の発話可能時間区間(使用者の発話が行われる可能性が高いと判断された時間区間、この時間区間内において使用者が発話した音声は音声信号に変換され、さらに情報信号に変換される)において、音声認識部105は、音声信号を情報信号に変換する認識処理を行う。 The utterance possibility determination unit 102 monitors a change in the operating state of the in-vehicle device 101. When the change is detected, the utterance possibility determination unit 102 determines a predetermined time interval including the time when the change is detected as a user's utterance regarding the operation of the device. The voice input unit 103 converts the voice uttered by the user in this time interval into a voice signal, outputs it, and sends it to the voice recognition unit 105. To do. Along with this, a predetermined utterable time interval measured by the timer 206 (a time interval in which it is determined that the user's utterance is likely to be performed, and the voice uttered by the user within this time interval is converted into an audio signal. The voice recognition unit 105 performs a recognition process for converting the voice signal into an information signal.

上記の認識処理の終了後、再び使用者の発話が行われる可能性が生じるまで、音入力部103と音声認識部105との動作が停止される。ただし、音入力部103と音声認識部105のとのうちのいずれか一方のみの動作が停止されれば、その結果として、音声認識部105における認識処理は停止される。このような構成によって、上記発話可能時間区間外における雑音等の入力による誤認識の可能性を排除することができる。 After the above recognition process is completed, the operations of the sound input unit 103 and the voice recognition unit 105 are stopped until the user can speak again. However, if the operation of only one of the sound input unit 103 and the voice recognition unit 105 is stopped, the recognition process in the voice recognition unit 105 is stopped as a result. With such a configuration, it is possible to eliminate the possibility of erroneous recognition due to input of noise or the like outside the utterable time interval.

音入力手段である音入力部103は、図２におけるマイクロフォン201と増幅装置202、ＡＤ変換装置203、演算装置204から構成され、発話可能性判断部102の命令に従って、使用者の発話音声(図１の矢印(a))を音声信号に変換して出力し、音声認識部105へ入力する。マイクロフォン201としては一般的なマイクロフォンを用いることができる。音入力部103は、発話可能性判断部102が命令した場合、及び、発話スイッチ108が押された場合に、入力を有効とし、入力された音声を音声信号に変換して出力して音声認識部105へ送出する。ただし、音声認識部105が、これらの場合にのみ動作をするのであれば、音入力部103は動作を継続していてもよい。 The sound input unit 103, which is a sound input unit, includes the microphone 201, the amplification device 202, the AD conversion device 203, and the calculation device 204 in FIG. 2, and the user's utterance voice (see FIG. 1 arrow (a)) is converted into a voice signal and output, and input to the voice recognition unit 105. A general microphone can be used as the microphone 201. The sound input unit 103 validates the input when the utterance possibility determination unit 102 instructs, and when the utterance switch 108 is pressed, converts the input voice into a voice signal, and outputs the voice signal for voice recognition. Send to unit 105. However, if the voice recognition unit 105 operates only in these cases, the sound input unit 103 may continue to operate.

尚、音入力部103からは、上記音声信号の他、車両内外で発生している雑音やオーディオ、ナビゲーションシステム等から出力される音信号も同時に出力される。従って、マイクロフォン201とＡＤ変換装置203の間、もしくはＡＤ変換装置203と音声認識を行う演算装置204との間に、音信号中の非目的成分を弱めるためのフィルタ(アンチエイリアシングフィルタ)を設ける、利得調整機構を設けて入力信号のパワー(ゲイン)が適切となるよう増幅量を調整する、入力信号のパワー変化等に基づき入力信号中で音声の含まれる区間を正確に切り出す音声抽出(Voice Activity Detection
: VAD)機構を設ける、オーディオシステムやナビゲーションシステムから出力される音信号を打ち消すエコーキャンセリング機構を設ける、等の対策によって、音声認識の対象でない雑音や音入力信号を排除ないしは弱めるようにすることが望ましい。 Note that the sound input unit 103 simultaneously outputs noise signals generated from inside and outside the vehicle, sound signals output from the navigation system, and the like in addition to the sound signals. Accordingly, a filter (anti-aliasing filter) is provided between the microphone 201 and the AD conversion device 203 or between the AD conversion device 203 and the arithmetic device 204 that performs speech recognition to weaken non-target components in the sound signal. A voice adjustment (Voice Activity) that adjusts the amount of amplification so that the power (gain) of the input signal is appropriate by providing a gain adjustment mechanism, and accurately cuts out the section that contains the voice in the input signal based on the power change of the input signal, etc. Detection
: Eliminate or weaken noise and sound input signals that are not subject to speech recognition through measures such as providing a (VAD) mechanism and an echo canceling mechanism that cancels sound signals output from audio systems and navigation systems. Is desirable.

言語辞書変更手段である辞書変更部104は、図２における演算装置204、記憶装置205から構成され、発話可能性判断部102の判断結果に基づき、次発話として入力される可能性が高い語彙を選択し、該語彙が認識されやすくなるよう、音入力手段である音入力部103が出力する音声信号との照合に用いる言語辞書を変更する。変更方法の具体例は後述する。この言語辞書は認識辞書106の構成要素となっている。 The dictionary changing unit 104, which is a language dictionary changing unit, includes the arithmetic device 204 and the storage device 205 in FIG. 2, and based on the determination result of the utterance possibility determining unit 102, a vocabulary that is highly likely to be input as the next utterance. The language dictionary used for collation with the audio signal output from the sound input unit 103 as sound input means is changed so that the vocabulary can be easily recognized. A specific example of the changing method will be described later. This language dictionary is a component of the recognition dictionary 106.

音声認識手段である音声認識部105は、図２における演算装置204、記憶装置205から構成され、音入力部103において出力される音声信号を認識辞書106に記憶された音響的特徴、及び言語的特徴を用いて比較・照合を行い、尤もらしい(尤度の高い)語彙を一つあるいは複数、認識結果として取得する。一般的な特徴量としては、ＬＰＣケプストラム、ＬＰＣデルタケプストラム、メルケプストラム、対数パワー等を組み合わせた時系列ベクトルデータが比較・照合に用いられる。すなわち、音声認識手段である音声認識部105は、音入力部103が出力する音声信号を情報信号である語彙に変換する認識処理を行う。 The speech recognition unit 105, which is a speech recognition means, includes the arithmetic device 204 and the storage device 205 in FIG. 2, and the acoustic features stored in the recognition dictionary 106 for the speech signals output from the sound input unit 103, and linguistic features Comparison and matching are performed using features, and one or more likely (highly likely) vocabularies are acquired as recognition results. As a general feature amount, time-series vector data combining LPC cepstrum, LPC delta cepstrum, mel cepstrum, logarithmic power, and the like is used for comparison / collation. That is, the speech recognition unit 105 as speech recognition means performs a recognition process for converting the speech signal output from the sound input unit 103 into a vocabulary that is an information signal.

認識辞書106は図２の記憶装置205に対応し、音響辞書と言語辞書とから構成される。前者は、前述の時系列ベクトルデータと照合を行う対象となる言語の音響的特徴量を、例えば隠れマルコフモデル(ＨＭＭ：Hidden Markov Model)等の形式で記録したものであり、後者は、音声入力システムが受理可能な語彙を、例えば単語のつながり(単語ネットワーク)として記録したものである。尚、言語辞書変更手段である辞書変更部104は、この言語辞書を変更して、発話可能性が高い語彙を認識されやすくする。 The recognition dictionary 106 corresponds to the storage device 205 in FIG. 2 and is composed of an acoustic dictionary and a language dictionary. The former records the acoustic features of the language to be collated with the above time-series vector data, for example, in the form of a Hidden Markov Model (HMM), and the latter is a voice input. The vocabulary that can be accepted by the system is recorded, for example, as a word connection (word network). Note that the dictionary changing unit 104, which is a language dictionary changing unit, changes the language dictionary so that a vocabulary with a high utterance possibility is easily recognized.

操作命令発行部107は、図２における演算装置204及び記憶装置205から構成され、音声認識部105で認識された語彙内容を解釈して、各種機器への操作命令に変換、操作対象機器へ向けて操作命令を送出する(図１の矢印(b))。操作対象機器は音声以外の操作手段(例えば各種機器に配置されたボタンやタッチパネル等)によっても操作可能であり、こうした操作手段からの情報(ボタン押下情報等)も、音声認識結果と同様、操作命令発行部107にて操作命令に変換され、各種対象機器へ向けて送出される。 The operation command issuing unit 107 includes the arithmetic device 204 and the storage device 205 in FIG. 2, interprets the vocabulary content recognized by the voice recognition unit 105, converts it into operation commands for various devices, and directs the operation target device The operation command is sent out (arrow (b) in FIG. 1). The operation target device can be operated by operation means other than voice (for example, buttons and touch panels arranged on various devices), and information (button press information, etc.) from these operation means is also operated in the same manner as the voice recognition result. It is converted into an operation command by the command issuing unit 107 and sent to various target devices.

発話スイッチ108は、音声入力開始をシステムに伝える為に使用者が押下するスイッチであり、図２のスイッチ208に対応する。発話可能性判断部102が発話可能性が高いと判断している場合を除いて、使用者は、発話時に発話スイッチ108を押下後、発話を開始する。 The utterance switch 108 is a switch that is pressed by the user to notify the system of the start of voice input, and corresponds to the switch 208 in FIG. Except when the utterance possibility determination unit 102 determines that the utterance possibility is high, the user starts the utterance after pressing the utterance switch 108 during the utterance.

（通常状態での辞書構成と認識動作例）
図４に、本実施の形態例における認識辞書106中の言語辞書例を示す。 (Dictionary structure and recognition operation example under normal conditions)
FIG. 4 shows a language dictionary example in the recognition dictionary 106 in the present embodiment.

図４の(a)に示した言語辞書は、ネットワーク型の言語辞書の一例であり、階層構造のネットワークを形成している。すなわち下位階層の単語は、上位階層の単語カテゴリを詳細化する関係となっており、上位階層と下位階層の単語を１単語毎、あるいは連結して入力することが可能である。例えば図４の辞書では、「行き先」や、「行き先、神奈川県横浜市金沢区」、「テレビ、△×放送」等を認識することが可能である。 The language dictionary shown in FIG. 4A is an example of a network type language dictionary, and forms a hierarchical network. That is, the words in the lower layer have a relationship that refines the word category in the upper layer, and the words in the upper layer and the lower layer can be input for each word or connected. For example, in the dictionary of FIG. 4, it is possible to recognize “destination”, “destination, Kanazawa-ku, Yokohama-shi, Kanagawa”, “TV, Δ × broadcast”, and the like.

図４の(b)に示した言語辞書も同様のネットワーク型言語辞書であるが、語彙カテゴリ毎に辞書を設け、これらを独立、あるいは並列に照合対象として認識を行う構成となっている。認識可能語彙は図４の(a)と同様である。図４の(b)に示した言語辞書においては、「メニュー辞書401」から「・・辞書405」までの単独の辞書のみを有効にすることもでき、また辞書全てを有効にし、各辞書に含まれる語彙を並列に待ち受けることも可能である。ただし、語彙数の増加に伴い認識率が低下する為、例えばメニュー辞書401(図４の(b)に示す)を初期状態の待ち受け辞書とし、この結果から、認識語彙が「行き先」であれば行き先辞書402(図４の(b)に示す)を有効にする等、語彙数と認識性能の関係から適切な利用法を決定することが望ましい。 The language dictionary shown in FIG. 4B is a similar network type language dictionary, but a dictionary is provided for each vocabulary category, and these are recognized independently or in parallel as targets for collation. The recognizable vocabulary is the same as (a) in FIG. In the language dictionary shown in FIG. 4 (b), only a single dictionary from “menu dictionary 401” to “..Dictionary 405” can be validated. It is also possible to wait for the included vocabulary in parallel. However, since the recognition rate decreases as the number of vocabularies increases, for example, the menu dictionary 401 (shown in (b) of FIG. 4) is set as an initial standby dictionary. From this result, if the recognized vocabulary is “destination” It is desirable to determine an appropriate usage from the relationship between the number of vocabularies and the recognition performance, such as enabling the destination dictionary 402 (shown in FIG. 4B).

尚、一括入力を除く場合(数単語ずつ区切って入力する場合)では、使用者に対し適宜音声や表示等によるフィードバックがなされることが望ましい。 In the case of excluding batch input (when inputting several words at a time), it is desirable to provide feedback to the user by voice or display as appropriate.

以下に、図４の(a)もしくは(b)に示した辞書を用いた場合の、システムと使用者の対話例を示す。ただし、この例では発話可能性が高いと判断していない状態(以下「通常状態」と称す)での動作例である。この場合、使用者は、発話スイッチ108を押下することによって、音声認識システムを音声入力待ちの状態とする。以下において、U:は使用者の発話、S:はシステム応答を示す。 The following shows an example of interaction between the system and the user when the dictionary shown in FIG. 4 (a) or (b) is used. However, this example is an operation example in a state where it is not determined that the possibility of speech is high (hereinafter referred to as “normal state”). In this case, the user presses the utterance switch 108 to place the voice recognition system in a state waiting for voice input. In the following, U: indicates a user's utterance and S: indicates a system response.

(対話例１：目的地設定入力)
U：(発話スイッチ108を押下)
S:コマンドをどうぞ
U:行き先
S:行き先の住所をどうぞ
U:神奈川県横浜市
S:横浜市のどこですか
U:金沢区六浦
S:神奈川県横浜市金沢区六浦を行き先にします。よろしいですか。
U:はい
S:国道○×号を通るルートです。交通規制に従い走行してください。 (Dialogue example 1: Destination setting input)
U: (Press utterance switch 108)
S: Command
U: Destination
S: Please give me your destination address
U: Yokohama City, Kanagawa Prefecture
S: Where is Yokohama
U: Rokuura, Kanazawa-ku
S: The destination is Rokuura, Kanazawa-ku, Yokohama, Kanagawa. Is it OK.
U: Yes
S: Route through National Highway No. Drive according to traffic regulations.

(対話例２：テレビ選局入力)
U：(発話スイッチ108を押下)
S:コマンドをどうぞ
U:テレビ
S:チャンネルを選択してください
U:△×放送
S:△×放送にします
すなわち、通常状態においては、使用者は発話スイッチ108を押下することで、システムの音声入力を有効化させ、音声入力を開始する。 (Dialogue example 2: TV channel selection input)
U: (Press utterance switch 108)
S: Command
U: TV
S: Please select a channel
U: △ × Broadcast
That is, in the normal state, the user activates the voice input of the system and starts the voice input by pressing the speech switch 108 in the normal state.

（自動入力状態での辞書の変化と認識動作）
一方、発話可能性判断部102にて機器の動作状態の変化を検出し、発話可能性が高いと判断した時間区間では、発話スイッチ108の押下無しで音声入力を有効化する(以降「自動入力状態」と称す)。機器の動作状態変化の具体例及び音声入力を有効化する時間区間の決定方法については後述する。 (Dictionary changes and recognition operations in the automatic input state)
On the other hand, in the time interval in which the utterance possibility determination unit 102 detects a change in the operating state of the device and determines that the utterance possibility is high, voice input is validated without pressing the utterance switch 108 (hereinafter referred to as `` automatic input ''). Referred to as "state"). A specific example of a change in the operating state of the device and a method for determining a time interval for validating voice input will be described later.

この有効化処理と同期して、辞書変更部104は、動作状態が変化した機器に関連する音声操作語彙(機器を操作するために発話する語彙)を使用者の発話が行われる可能性が高いと判断される時間区間において発話される可能性が高い語彙(次発話候補語彙)として選択し、該次発話候補語彙に基づいて、該語彙が認識されやすいように、認識辞書106中の言語辞書を動的に変更する。 In synchronization with this activation processing, the dictionary changing unit 104 is likely to utter the user's speech operation vocabulary (vocabulary uttered to operate the device) related to the device whose operation state has changed. The language dictionary in the recognition dictionary 106 is selected as a vocabulary (next utterance candidate vocabulary) that is likely to be uttered in the time interval determined to be such that the vocabulary is easily recognized based on the next utterance candidate vocabulary. To change dynamically.

そのような動的変更の例を図５から図８に示す。 Examples of such dynamic changes are shown in FIGS.

図５は、階層移動の一例を示したものであり、図４の(a)に示した元辞書に対し、次発話予測語彙(次の曲、…、曲情報)を上位階層(第１階層)に移動させた辞書としている。 FIG. 5 shows an example of hierarchy movement. The next utterance prediction vocabulary (next song,..., Song information) is set to the upper hierarchy (first hierarchy) with respect to the original dictionary shown in FIG. ) Moved to dictionaries.

図６は、発話予測語彙以外を認識語彙から除外する例を示したものであり、図４の(a)に示した元辞書に対し、次発話候補語彙以外の語彙を全て無効化、すなわち次発話候補語彙(次の曲、…、曲情報)のみを認識対象とした辞書としている。 FIG. 6 shows an example of excluding the utterance prediction vocabulary from the recognition vocabulary. All the vocabularies other than the next utterance candidate vocabulary are invalidated from the original dictionary shown in FIG. It is a dictionary that recognizes only utterance candidate vocabulary (next song, ..., song information).

図７は、出現・遷移確率のバイアスを付加する例を示したものであり、図４の(a)に示した元辞書に対し、次発話予測語彙(次の曲、…、曲情報)の発生確率・遷移確率にボーナスを与えて、尤度が高くなるように変更した辞書としている。 FIG. 7 shows an example of adding a bias of appearance / transition probabilities. The next utterance prediction vocabulary (next song,..., Song information) is added to the original dictionary shown in FIG. The dictionary is modified so that the likelihood is increased by giving a bonus to the occurrence probability / transition probability.

図８は、有効化辞書の変更の一例を示したものであり、図８の(a)は、図４の(b)に示した辞書を通常状態で使う場合の例を示しており、初期状態では「メニュー辞書」のみが有効化され、該辞書に含まれる語彙のみが認識可能となっている。 FIG. 8 shows an example of changing the activation dictionary. FIG. 8A shows an example of using the dictionary shown in FIG. 4B in the normal state. In the state, only the “menu dictionary” is activated, and only the vocabulary included in the dictionary can be recognized.

これに対し、例えば、テレビ像信機の動作状態に変化が検出されて自動入力状態になった場合には、次発話候補語彙の含まれる辞書(図８の(b)の「テレビ辞書」)が有効化され、該辞書に含まれる語彙のみが認識可能となっている。 On the other hand, for example, when a change is detected in the operation state of the television image detector and the automatic input state is entered, a dictionary including the next utterance candidate vocabulary (“TV dictionary” in FIG. 8B). Is activated so that only the vocabulary contained in the dictionary can be recognized.

こうして動的に辞書の状態を変更することで、次発話候補語彙が認識されやすくなる。 By dynamically changing the dictionary state in this way, the next utterance candidate vocabulary can be easily recognized.

図８の(b)に示したように辞書を変更した場合の対話例を示す。 An example of dialogue when the dictionary is changed as shown in FIG.

(上記対話例２に対応)
U：△×放送
S：△×放送にします
この対話例から判るように、自動入力状態では、発話スイッチ108を押下せずに、短時間で操作を完了することができる。 (Corresponding to dialog example 2 above)
U: △ × Broadcast
S: △ × Broadcasting As can be seen from this interactive example, in the automatic input state, the operation can be completed in a short time without pressing the speech switch 108.

（基本的な動作例(音声入力有効化と辞書変更フロー))
図３のフローチャートを用いて、本実施の形態例の主要な動作を説明する。
ステップS101：車載機器101の動作情報を取得する。
ステップS102：現在時刻Ｔ_ｎを取得する。
ステップS103：機器動作情報の変化(前時刻Ｔ_ｎ−１からの変化)を監視し、
変化が検出されない場合はステップS104へ進み、変化が検出される場合はステップS107へ進む。前時刻Ｔ_ｎ−１から現在時刻Ｔ_ｎまでの機器動作情報の変化を監視するには、機器動作情報を所定時間分バッファリング(データを一時記憶メモリに記憶し順次読み出すこと)していればよい。 (Basic operation example (voice input validation and dictionary change flow))
The main operation of this embodiment will be described with reference to the flowchart of FIG.
Step S101: The operation information of the in-vehicle device 101 is acquired.
Step S102: obtaining the current time _{T n.}
Step S103: Monitor device operation information change (change from previous time T _n-1 ),
When a change is not detected, it progresses to step S104, and when a change is detected, it progresses to step S107. In _order to monitor the change in the device operation information from the previous time T _n-1 to the current time T _n , the device operation information is buffered for a predetermined time (data is stored in the temporary storage memory and sequentially read out). Good.

ステップS104：発話スイッチ108が押下された場合はステップS105へ進み、押下されない場合はステップS101へ戻る。
ステップS105：音声認識部105(図１)による音声認識処理を行う。この場合、言語辞書に変更を受けていない通常状態での認識動作となる。
ステップS106：認識結果に基づき、操作命令発行部107(図１)によって、操作命令が発行される。 Step S104: If the speech switch 108 has been pressed, the process proceeds to step S105, and if not, the process returns to step S101.
Step S105: Voice recognition processing by the voice recognition unit 105 (FIG. 1) is performed. In this case, the recognition operation is performed in a normal state where the language dictionary is not changed.
Step S106: An operation command is issued by the operation command issuing unit 107 (FIG. 1) based on the recognition result.

ステップS107：動作状態の変化を検出した機器の種類に基づき、該機器の音声による操作に関する語彙(音声操作語彙)が認識されやすくなるよう辞書変更部104(図１)によって言語辞書を変更する。
ステップS108：音入力部103(図１)による音声入力を有効化し、音声認識部105(図１)への入力を開始する。
ステップS109：タイマ206(図２)を０に初期化。音声入力を有効化した時刻からの経過カウントを開始する。
ステップS110：発話スイッチ108の押下が検出される場合、すなわち自動的に音声入力が有効になっている状態で更に発話スイッチ108が押された場合は、ステップS111へ進み、発話スイッチ108が押下されない場合は、ステップS112へ進む。 Step S107: The language dictionary is changed by the dictionary changing unit 104 (FIG. 1) so that the vocabulary (speech operation vocabulary) related to the voice operation of the device is easily recognized based on the type of the device in which the change in the operation state is detected.
Step S108: The voice input by the sound input unit 103 (FIG. 1) is validated, and the input to the voice recognition unit 105 (FIG. 1) is started.
Step S109: The timer 206 (FIG. 2) is initialized to zero. The elapsed count from the time when voice input is enabled is started.
Step S110: If pressing of the utterance switch 108 is detected, that is, if the utterance switch 108 is further pressed while the voice input is automatically enabled, the process proceeds to step S111, and the utterance switch 108 is not pressed. If yes, go to Step S112.

ステップS111：言語辞書の変更を解除し、元の辞書構成(図４に例示)に戻し、ステップS105へ進む。 Step S111: Cancel the change of the language dictionary, return to the original dictionary configuration (illustrated in FIG. 4), and proceed to Step S105.

ステップS112：音声入力の有無を検出する。検出方法としては、入力信号のパワーを監視する等の方法を取ることができる。音声入力が検出される場合はステップS114へ進み、音声入力が検出されない場合はステップS113へ進む。 Step S112: The presence / absence of voice input is detected. As a detection method, a method such as monitoring the power of the input signal can be used. If a voice input is detected, the process proceeds to step S114. If a voice input is not detected, the process proceeds to step S113.

ステップS113：タイマ時刻すなわち音声入力を有効化してからの経過時間と所定時間β（請求項３、１３に記載の「第２の時間」に該当）を比較し、タイマ時刻＜βならばステップS110へ戻り、タイマ時刻≧βならばステップS116へ進む。 Step S113: The timer time, that is, the elapsed time since the voice input is validated is compared with a predetermined time β (corresponding to the “second time” according to claims 3 and 13). If the timer time ≧ β, the process proceeds to step S116.

これにより、音声による操作の対象となる機器の動作状態の変化が検出される時刻を基準時刻とするとき、該基準時刻から、該基準時刻よりも、予め定められた第２の時間(β)だけ後の第２の時刻に至る時間区間が、使用者の音声入力が行われる可能性が高い時間区間であると判断し、その時間区間中、音声認識に使用される確率の高い言語辞書が(ステップS107において)用意され、使用者の音声入力を認識する状態が保たれる。 As a result, when the time at which the change in the operating state of the device to be operated by voice is detected is set as the reference time, a second time (β) that is determined in advance from the reference time than the reference time. It is determined that the time interval up to the second time later is a time interval in which the user's voice input is highly likely to be performed, and during that time interval, a language dictionary with a high probability of being used for speech recognition It is prepared (in step S107) and the state of recognizing the user's voice input is maintained.

ここで、βは、機器の動作状態の変化を検出してから何秒間にわたって、音声入力を有効化するかを決定する。βの値は、ステップS108の前に、予め定められているのであるが、例えば、固定の値(例えば5秒間等)にする、使用者によって決定する、自動入力状態下での使用履歴から使用者の発話タイミングを学習し、この傾向に適合するように調整する、等の方法で決定することができる。 Here, β determines how many seconds the voice input is validated after detecting a change in the operating state of the device. The value of β is determined in advance before step S108.For example, it is determined by the user to use a fixed value (for example, 5 seconds). The user's utterance timing can be learned and adjusted so as to match this tendency.

ステップS114：音声認識処理を行う。この場合、言語辞書が機器動作変化に基づく変更を受けている、自動入力状態での認識動作となる。認識動作後、ステップS115へ進む。
ステップS115：認識結果に基づき、操作命令発行部107(図１)によって、操作命令が発行され、ステップS116へ進む。
ステップS116：音声入力を無効化し、音声認識部105への音信号の入力を停止する。ただし、音声認識部105への音信号の入力を停止するかわりに、音声認識部105の動作を停止させてもよい。すなわち、音声認識部105による認識処理が終了してから、発話スイッチ108が押されることを含めて、再び使用者の発話が行われる可能性が生じるまでの間、音入力部103及び音声認識部105のうちの少なくとも一方の動作を停止すればよい。このようにすれば、音声認識部105による認識処理が終了してから、再び使用者が音声入力を行おうとするまでの時間区間においては、音声認識処理は行われず、雑音等の影響による誤認識の可能性を低減することができる。
ステップS117：辞書変更部104が言語辞書の変更を解除し、上記時間区間前にステップS107において変更した言語辞書を変更前の状態、すなわち通常状態の辞書(図４に例示)へ戻す。これは、次に、通常状態での発話が行われる場合に対応する処置である。 Step S114: Perform voice recognition processing. In this case, the recognition operation is performed in an automatic input state in which the language dictionary is changed based on the device operation change. After the recognition operation, the process proceeds to step S115.
Step S115: Based on the recognition result, an operation command is issued by the operation command issuing unit 107 (FIG. 1), and the process proceeds to step S116.
Step S116: The voice input is invalidated and the input of the sound signal to the voice recognition unit 105 is stopped. However, instead of stopping the input of the sound signal to the voice recognition unit 105, the operation of the voice recognition unit 105 may be stopped. That is, after the recognition processing by the voice recognition unit 105 is completed, until the utterance switch 108 is pressed and until there is a possibility that the user will speak again, the sound input unit 103 and the voice recognition unit The operation of at least one of 105 may be stopped. In this way, the speech recognition processing is not performed in the time interval from when the recognition processing by the speech recognition unit 105 is completed until the user tries to input speech again, and erroneous recognition due to the influence of noise or the like. The possibility of this can be reduced.
Step S117: The dictionary changing unit 104 cancels the change of the language dictionary, and returns the language dictionary changed in step S107 before the above time interval to the state before change, that is, the normal state dictionary (illustrated in FIG. 4). This is a procedure corresponding to the case where the utterance in the normal state is performed next.

この一連の処理により、通常状態では発話スイッチ108を押した後、入力された音声を認識するよう機能する一方、機器の動作状態の変化が検出された場合には所定時間区間(長さβ)において自動入力状態となり、発話スイッチ108を押さずに該機器の操作音声の入力を行うことが可能となる。 This series of processing functions to recognize the input voice after pressing the utterance switch 108 in the normal state, while a predetermined time interval (length β) is detected when a change in the operating state of the device is detected. Thus, the automatic input state is entered, and the operation voice of the device can be input without pressing the speech switch 108.

（熟練者と未熟練者による使い方の違い）
尚、上記フローでは、自動入力状態下において発話スイッチ108が押された場合、辞書の変更を解除する(言語辞書を通常状態の辞書に戻す)構成となっている。これは、本実施の形態例の動作を理解している使用者に対する対応である。すなわち、自動入力状態(発話スイッチ108を押さずに特定の機器の操作発話を入力できる状態)で、敢えて発話スイッチ108を押していることから、使用者が通常時の操作を行う意思があるとみなしている。また、変化のあった特定機器の操作以外を行う意思があるとみなすこともできることから、動作変化が検出された機器に関連する操作語彙以外が有効となるように言語辞書を変更するようにしても良い。 (Difference in usage between skilled and unskilled personnel)
In the above flow, when the utterance switch 108 is pressed in the automatic input state, the change of the dictionary is canceled (the language dictionary is returned to the normal state dictionary). This is a response to the user who understands the operation of the present embodiment. In other words, since the utterance switch 108 is intentionally pressed in the automatic input state (a state in which an operation utterance of a specific device can be input without pressing the utterance switch 108), it is considered that the user intends to perform the normal operation. ing. In addition, since it can be considered that there is an intention to perform operations other than the operation of a specific device that has changed, the language dictionary should be changed so that other than the operation vocabulary related to the device in which the operation change is detected becomes effective. Also good.

一方、未熟練使用者、すなわち、自動入力が可能な機能を知らない使用者では、自動入力状態において発話ボタンを押してしまう可能性がある。こうした使用者への対応として、例えば通常状態での認識処理、命令発行を行い、この時認識結果が動作状態の変化のあった機器の操作であった場合は、事後的に音声や映像の出力によって、本発明の機構を教示するようにすることが望ましい。これによって、使用者は、次回から該機器の動作変化に伴う入力をより円滑に行うことができるようになる。 On the other hand, an unskilled user, that is, a user who does not know a function capable of automatic input, may press the speech button in the automatic input state. As a response to such users, for example, recognition processing and command issuance in a normal state, and if the recognition result is an operation of a device whose operation state has changed, the output of audio or video will be performed later. Thus, it is desirable to teach the mechanism of the present invention. As a result, the user can more smoothly perform input accompanying the change in operation of the device from the next time.

以下の実施例によって、発話可能性判断のより具体的な方法を示す。 The following example shows a more specific method for determining the possibility of speech.

本発明に係る音声入力装置おいては、装置が自動車内に設置された場合に、発話可能性が高い時間区間を検出する方法として、車室内外に備えられた機器の動作状態の変化を用いる。以下に、具体的な車室内機器における、発話可能性検出方法を示す。 In the voice input device according to the present invention, when the device is installed in an automobile, a change in the operating state of the equipment provided inside and outside the vehicle interior is used as a method for detecting a time interval with a high probability of speaking. . Hereinafter, a speech possibility detection method in a specific vehicle interior device will be described.

(実施例１)
本実施例では、音声による操作の対象となる機器が、記憶媒体に記憶された音信号または映像信号を再生する機器であり、その機器の動作状態の変化を検出し、その結果に基づき、使用者の発話が行われる可能性が高いと判断される時間区間を決定する構成を示す。ここで、記憶媒体に記憶された音信号または映像信号を再生する機器とは、例えば、ＣＤ、ＭＤ、ＤＶＤ、ＨＤＤ、フラッシュメモリ等の記憶媒体に格納された音信号、映像信号を再生する機器を指す。 (Example 1)
In this embodiment, the device to be operated by voice is a device that reproduces a sound signal or a video signal stored in a storage medium, detects a change in the operating state of the device, and uses the result based on the result. The structure which determines the time interval judged that a person's utterance is likely to be performed is shown. Here, a device that reproduces a sound signal or video signal stored in a storage medium is a device that reproduces a sound signal or video signal stored in a storage medium such as a CD, MD, DVD, HDD, or flash memory. Point to.

この時、発話可能性判断部102は、この機器の動作状態の変化が生じる時刻から発話可能性を判断する。この機器の動作状態の変化が検出される時刻は、上記憶媒体の変化、もしくは、該記憶媒体に記憶された音信号または映像信号の区切りが検出される時刻である、とする。ここで、記憶媒体の変化とは、例えば、装置がＣＤチェンジャを具備する場合の、ＣＤ(記憶媒体)の入れ替えに相当し、憶媒体に記憶された音信号または映像信号の区切りとは、例えば、憶媒体に記憶された曲または映像の再生が次の曲または次のチャプタの映像の再生に移る、再生が記憶された信号の終点に達する等の再生データ内での変化に相当する。 At this time, the utterance possibility determination unit 102 determines the utterance possibility from the time when the operation state of the device changes. The time at which the change in the operating state of the device is detected is the time at which the change in the upper storage medium or the break of the sound signal or the video signal stored in the storage medium is detected. Here, the change of the storage medium corresponds to, for example, replacement of a CD (storage medium) when the apparatus includes a CD changer, and the separation of the sound signal or the video signal stored in the storage medium is, for example, This corresponds to a change in the reproduction data such that the reproduction of the music or video stored in the storage medium shifts to the reproduction of the video of the next music or the next chapter, or the reproduction reaches the end point of the stored signal.

こうした変化は、例えば、音データ中のパワーやリズムの変化、映像信号中の前後フレーム間の差の変化として検出可能である。また、上記記憶媒体に付随情報(メタデータ等と呼ばれる)が記憶されている場合は、これを利用して動作が変化する時刻を取得することが望ましい。メタデータとしては、記憶媒体の総収録時間や、曲・チャプタの開始・終了時刻、更には、曲名、曲ジャンル、発売日、アーティスト名、チャプタのキャプション情報等がある。 Such a change can be detected as, for example, a change in power or rhythm in the sound data, or a change in the difference between the previous and next frames in the video signal. Further, when accompanying information (called metadata or the like) is stored in the storage medium, it is desirable to use this to acquire the time at which the operation changes. The metadata includes the total recording time of the storage medium, the start / end time of the song / chapter, the song title, song genre, release date, artist name, chapter caption information, and the like.

尚、前述した図３のフローチャートでは、変化時刻を始点として所定時間区間(長さβ)について発話可能性が高いと判断する構成を示した。これを、本実施例が楽曲の再生を行う場合を例として、図９の(a)に示す。 The above-described flowchart of FIG. 3 shows a configuration in which it is determined that there is a high possibility of utterance for a predetermined time interval (length β) starting from the change time. This is shown in (a) of FIG. 9, taking as an example the case where the present embodiment reproduces music.

図９の上部に示した波形は音信号の一部を示しており、変化点Ａにおいて、楽曲が次の楽曲へと切り替わる。このような音信号の区切りである楽曲の切り替わりは、所定の時間内に音信号が０レベルとなることによって検出される。このような変化が検出される時刻（請求項３、１３に記載の「基準時刻」に該当）をＴ_ｃとし、Ｔ_ｃ〜(Ｔ_ｃ＋β)の時間区間(基準時刻から第２の時刻に至る時間区間、図中、格子縞で示す)において発話可能性が高いと判断している。 The waveform shown in the upper part of FIG. 9 shows a part of the sound signal, and at the change point A, the music is switched to the next music. The switching of music, which is a break of such a sound signal, is detected when the sound signal becomes 0 level within a predetermined time. The time at which such a change is detected (corresponding to the “reference time” according to claims 3 and 13) is defined as T _c, and the time interval from T _c to (T _c + β) (from the reference time to the second time) It is determined that the possibility of utterance is high in the time interval (indicated by checkered stripes in the figure).

一方、図９の(b)においては、変化時刻Ｔ_ｃから所定時間α（請求項３、１３に記載の「第１の時間」に該当）だけさかのぼり、(Ｔ_ｃ−α)〜(Ｔ_ｃ＋β)の時間区間(第１の時刻から第２の時刻に至る時間区間、図中、格子縞で示す)において発話可能性が高いと判断している。変化時刻Ｔ_ｃを予め取得することは、上記メタデータを最初に読み込んでおく等の方法で可能であり、時刻がすでに時刻Ｔ_ｃ−αを過ぎている場合の、(Ｔ_ｃ−α)〜(Ｔ_ｃ＋β)の時間区間(第１の時刻から第２の時刻に至る時間区間)においての発話の取得は、絶えず、発話を所定時間分バッファリングすることによって可能となる。この場合に、音入力部103は動作を継続していて、それが出力する音声信号を、絶えず、所定時間分バッファリングしている必要がある。α、βの設定方法としては、固定の時間とする、使用者によって変更可能とする、使用履歴(発話タイミングの学習)によって調整を行う、楽曲の内容に基づき変更する(イントロ部分の長い曲では時間区間を延長する等)等とする。 On the other hand, in (b) of FIG. 9, it goes back from the change time T _c by a predetermined time α (corresponding to “first time” according to claims 3 and 13), and (T _c −α) to (T _c + Β) is determined to have a high utterance possibility in the time interval (the time interval from the first time to the second time, indicated by the checkered pattern in the figure). The change time _Tc can be acquired in advance by a method such as reading the metadata first. When the time has already passed the time _Tc- α, ( _Tc- α) to The acquisition of the utterance in the time interval of (T _c + β) (the time interval from the first time to the second time) is continuously possible by buffering the utterance for a predetermined time. In this case, the sound input unit 103 continues to operate, and it is necessary to continuously buffer the audio signal output by the sound input unit 103 for a predetermined time. The setting method of α, β is fixed time, can be changed by the user, is adjusted according to the usage history (learning of utterance timing), and is changed based on the content of the song (for songs with a long intro part) For example, extending the time interval).

尚、Ｔ_ｃ−α〜Ｔ_ｃの時間区間(第１の時刻から基準時刻に至る時間区間)が、使用者の発話が行われる可能性が高い時間区間であるとしてもよい。 Incidentally, T _{_c} -α~T _c time interval (time interval reaching the reference time from the first time), the utterance of the user may be a high time interval is performed.

上記動作状態の変化が検出された場合には、自動入力状態に移行すると同時に、言語辞書の変更を行う。例えば、曲(チャプタ)の前後移動(頭だし・スキップ)、曲情報問い合わせ、次のディスク、曲名検索、アーティスト名検索等に関する語彙を優先して認識するよう言語辞書を変更する。 If a change in the operating state is detected, the language dictionary is changed simultaneously with the transition to the automatic input state. For example, the language dictionary is changed so that the vocabulary related to the movement of a song (chapter) back and forth (heading / skip), song information inquiry, next disc, song name search, artist name search, etc. is recognized with priority.

更に、自動入力状態を使用者に知らせる手段を設けることが望ましい。図１０に、楽曲再生中の動作状態の変化時における表示画面例を示す。図１０の(a)は通常状態での画面を、図１０の(b)は動作状態の変化を検出して自動入力状態での画面を示している。図１０の(b)においては、ボイスコマンド(音声による機器の操作)が可能であることと、その操作の種類とが表示されている。 Furthermore, it is desirable to provide means for notifying the user of the automatic input state. FIG. 10 shows an example of a display screen when the operation state changes during music reproduction. 10A shows a screen in the normal state, and FIG. 10B shows a screen in the automatic input state by detecting a change in the operating state. In FIG. 10B, the voice command (operation of the device by voice) and the type of operation are displayed.

図１０の(b)への画面遷移に同期して報知音を鳴らすようにしても良い。この時、自動入力が可能な機器の種類と報知音が対応するように、機器毎に報知音を用意する等の構成とすれば、使用者にとってよりわかりやすい教示となる。 You may make it sound a notification sound synchronizing with the screen transition to (b) of FIG. At this time, if the configuration is such that a notification sound is prepared for each device so that the notification sound corresponds to the type of device that can be automatically input, the teaching becomes easier to understand for the user.

以上の構成により、使用者は発話スイッチ108を押下せずに機器の操作発話を行うことができる。 With the above configuration, the user can perform an operation utterance of the device without pressing the utterance switch 108.

(実施例２)
本実施例では、音声による操作の対象となる機器が、外部より転送される音信号または映像信号を受信して出力する機器である放送波受信機器であり、この放送波受信機器の動作状態の変化を検出し、その結果に基づき、使用者の発話が行われる可能性が高いと判断される時間区間を決定する構成を示す。ここに、放送波受信機器とは、例えばテレビ受像機、ラジオ受信機等を指す。 (Example 2)
In this embodiment, the device to be operated by sound is a broadcast wave receiving device that receives and outputs a sound signal or a video signal transferred from the outside, and the operating state of this broadcast wave receiving device is The structure which detects a change and determines the time interval judged that possibility that a user's speech will be performed highly based on the result is shown. Here, the broadcast wave receiving device refers to, for example, a television receiver, a radio receiver, or the like.

本実施例においては、音声による操作の対象となる機器の動作状態の変化として、受信される音信号または映像信号の区切り(例えば、受信番組の変化)、もしくは、外部より転送される音信号または映像信号の受信強度の変化を、使用者の音声入力が行われる可能性が高い時間区間の判断に用いる。すなわち、受信番組の変化(番組の開始・終了やＣＭ挿入等)及び受信強度の変化を動作状態の変化として捉え、発話可能性を判断する。 In this embodiment, as a change in the operating state of the device to be operated by voice, a received sound signal or video signal break (for example, a change in received program), or a sound signal transferred from the outside or A change in the reception intensity of the video signal is used to determine a time interval in which the user's voice input is likely to be performed. That is, a change in received program (start / end of program, CM insertion, etc.) and a change in received intensity are regarded as a change in operation state, and the possibility of speech is determined.

番組情報の変化は、受信した音信号のパワーの変化や受信チャネル(ステレオ・モノラル)の変化、映像信号の前後フレーム間での差の変化等を用いて検出することができる。放送番組に関する情報(メタデータ)が取得できる場合はこれを用いることが望ましい。メタデータとしては、放送中の番組の開始・終了時刻、現在流れている楽曲の開始・終了時刻、番組や楽曲の説明情報等がある。番組情報の変化に基づく発話可能性の検出は実施例１と同様である。 The change in the program information can be detected using a change in the power of the received sound signal, a change in the reception channel (stereo / monaural), a change in the difference between frames before and after the video signal, and the like. If information (metadata) about a broadcast program can be acquired, it is desirable to use it. The metadata includes the start / end time of the program being broadcast, the start / end time of the currently playing music, the description information of the program and music, and the like. The detection of the possibility of utterance based on the change in the program information is the same as in the first embodiment.

一方、受信強度は、移動体(いまの場合には自動車)に搭載された放送波受信機器においては、常時変化する。そこで、所定の閾値を設け、受信強度が該閾値を下回る場合に動作が変化したとみなす。 On the other hand, the reception intensity constantly changes in a broadcast wave receiving device mounted on a mobile body (in this case, an automobile). Therefore, a predetermined threshold is provided, and it is considered that the operation has changed when the reception intensity is lower than the threshold.

図１１に、受信強度の変化に伴う発話可能性の検出例を示す。図１１中の上部に示した折れ線が時間経過に伴う受信強度の変化を示しており、この例では、時刻ＡからＡ'の区間、時刻ＢからＢ'の区間、時刻ＣからＣ'の区間、時刻ＤからＤ'の区間、時刻ＥからＥ'の区間において受信強度が閾値ＴＨを下回る。それぞれの区間を、区間の左端と同じ符号を用いて、Ａ、Ｂ、Ｃ、Ｄ、Ｅとする。ちなみに、時刻Ｃ'では、使用者がチャンネル変更操作を行った結果、受信強度が上昇している様子を示しており、時刻Ｅ'では使用者が放送波受信機器の使用を終了し、以降は受信しない様子を示している(使用者が、記憶媒体に記憶された音信号または映像信号を再生する機器に切り替えた場合等に相当する)。 FIG. 11 shows an example of detecting the possibility of speech accompanying a change in reception intensity. The broken line shown in the upper part of FIG. 11 shows the change in reception intensity with time. In this example, the section from time A to A ′, the section from time B to B ′, and the section from time C to C ′. In the section from time D to D ′ and the section from time E to E ′, the reception intensity is lower than the threshold value TH. Let each section be A, B, C, D, E using the same reference numerals as the left end of the section. By the way, at time C ′, it is shown that the reception intensity is increased as a result of the user changing the channel. At time E ′, the user ends the use of the broadcast wave receiving device, and thereafter It shows a state of not receiving (corresponding to a case where the user switches to a device that reproduces a sound signal or a video signal stored in a storage medium).

図１１の(a)は、上記区間Ａ〜Ｅ(図中、格子縞で示す)全てにおいて、発話可能性が高いと判断している。従って、受信感度が閾値を下回る時間区間は常に自動入力状態となり、放送波受信機器の操作発話を入力することが可能である。 In FIG. 11A, it is determined that the utterance possibility is high in all the sections A to E (indicated by checkered stripes in the figure). Therefore, the time interval in which the reception sensitivity falls below the threshold value is always in the automatic input state, and it is possible to input the operation utterance of the broadcast wave receiving device.

一方、図１１の(b)は、所定の待機時間Ｔ(図中、灰色で示す)を設け、受信感度が閾値ＴＨを下回ってからの経過時間が待機時間Ｔを超えた場合(図中、格子縞で示す)にのみ発話可能性が高いと判断している。この場合は、区間Ａ、Ｂ、Ｄについては区間が待機時間より短い為に自動入力状態には移行せず、区間ＣとＥについては、時刻Ｃ''〜Ｃ'間、時刻Ｅ''〜Ｅ'間について自動入力状態に移行する。 On the other hand, (b) of FIG. 11 provides a predetermined standby time T (shown in gray in the figure), and the elapsed time after the reception sensitivity falls below the threshold value TH exceeds the standby time T (in the figure, It is judged that the utterance possibility is high only for (indicated by checkered stripes). In this case, the sections A, B, and D do not shift to the automatic input state because the sections are shorter than the standby time, and the sections C and E are between the times C ″ to C ′ and the times E ″ to Transition to the automatic input state between E ′.

自動入力状態への移行に同期して、例えば、チャンネル選択、番組情報の取得、テレビ・ラジオのＯＦＦといった操作に関連する語彙が認識されやすいように言語辞書が変更される。 In synchronization with the transition to the automatic input state, the language dictionary is changed so that the vocabulary related to operations such as channel selection, acquisition of program information, and TV / radio OFF can be easily recognized.

更に、自動入力状態を使用者に知らせる手段を設けることが望ましい。本実施例における画面遷移例を図１２に示す。図１２の(a)は通常状態の画面であり、図１２の(b)では、受信強度が低くなった為(砂嵐画面の状態)に自動入力状態に移行した場合の画面を示している。図１２の(b)の画面には、ボイスコマンド(音声による機器の操作)が可能であることと、その操作の種類とが表示されている。 Furthermore, it is desirable to provide means for notifying the user of the automatic input state. An example of screen transition in the present embodiment is shown in FIG. FIG. 12A shows a screen in a normal state, and FIG. 12B shows a screen in the case of shifting to an automatic input state because the reception intensity is low (the state of a sandstorm screen). On the screen of FIG. 12B, the voice command (operation of the device by voice) and the type of operation are displayed.

(実施例３)
本実施例では、音声による操作の対象となる機器が、通信手段により情報収受を行う通信機器である車載テレマティクス機器であり、該機器の動作状態の変化を検出し、その結果に基づき、使用者の発話が行われる可能性が高いと判断される時間区間を決定する構成を示す。ここに、テレマティクス機器とは、例えば、ナビゲーションシステム、ＤＳＲＣ(狭域通信)システム、ハンズフリーシステム等を指す。 Example 3
In this embodiment, the device to be operated by voice is a vehicle-mounted telematics device that is a communication device that receives information by communication means, detects a change in the operating state of the device, and based on the result, the user The structure which determines the time interval judged that possibility that utterance of will be performed is high is shown. Here, the telematics device refers to, for example, a navigation system, a DSRC (narrow area communication) system, a hands-free system, and the like.

上記通信機器の動作状態の変化が検出される時刻は、該通信機器と該通信機器以外の外部通信機器との通信が確立される時刻、または、該外部通信機器からの情報が着信する時刻である、とする。すなわち、テレマティクス機器の通信等による情報授受を動作状態の変化として捉え、該変化時に、発話可能性が高くなると判断する。具体的な動作状態の変化としては、渋滞情報の取得、ＥＴＣの通信確立、ハンズフリーシステムへの着信等が考えられる。こうした状況を検出した場合に、検出時刻から所定時間について発話可能性が高いと判断し、自動入力状態とする。あるいは、ＶＩＣＳ情報やＥＴＣ情報のように、情報の授受の直後に該情報を表示及び報知音を出力する構成の機器については、該情報の出力から一定時間区間経過し、情報出力を終了した時点からの所定時間区間について、自動入力状態に移行する構成としても良い。 The time at which the change in the operating state of the communication device is detected is the time at which communication between the communication device and an external communication device other than the communication device is established, or the time at which information from the external communication device arrives. Suppose there is. That is, information exchange by communication of a telematics device or the like is regarded as a change in operation state, and it is determined that the possibility of utterance increases at the time of the change. As specific changes in the operating state, acquisition of traffic jam information, establishment of ETC communication, incoming calls to the hands-free system, and the like can be considered. When such a situation is detected, it is determined that there is a high possibility of speaking for a predetermined time from the detection time, and an automatic input state is set. Alternatively, for a device configured to display the information and output a notification sound immediately after the transmission / reception of information, such as VICS information or ETC information, when a certain time interval elapses from the output of the information and the information output is terminated. It is good also as a structure which transfers to an automatic input state about the predetermined time area from.

自動入力状態への移行に同期して、例えば、渋滞情報やＥＴＣ料金情報の表示(再表示)・読み上げ操作やハンズフリーシステムの受話操作に関連する語彙が認識されやすいように言語辞書が変更される。 Synchronized with the transition to the automatic input state, for example, the language dictionary has been changed so that vocabulary related to traffic information and ETC fee information display (redisplay), reading operations and hands-free system receiving operations can be easily recognized. The

更に、自動入力状態を使用者に知らせる手段を設けることが望ましい。図１３に、ＶＩＣＳ情報を受信した場合における画面遷移例を示す。図１３の(a)はＶＩＣＳ情報を取得した直後の通常状態での画面であり、図１３の(b)は表示から一定時間区間が経過し自動入力状態に移行した場合の画面を示している。図１３の(b)においては、ボイスコマンド(音声による機器の操作)が可能であることと、その操作の種類とが表示されている。 Furthermore, it is desirable to provide means for notifying the user of the automatic input state. FIG. 13 shows an example of screen transition when VICS information is received. 13A shows a screen in a normal state immediately after acquiring VICS information, and FIG. 13B shows a screen in a case where a certain time interval has elapsed from the display and the state is shifted to an automatic input state. . In FIG. 13B, a voice command (operation of the device by voice) and the type of the operation are displayed.

(実施例４)
本実施例では、音声による操作の対象となる機器が、車室内または車室外からの入力情報に基づき動作状態が能動制御される機器である、車両内外に具備されるセンサ等の情報から自動的に動作状態が制御される機器であり、その機器の動作状態の変化を検出し、その結果に基づき、使用者の発話が行われる可能性が高いと判断される時間区間を決定する構成を示す。上記車室内または車室外からの入力情報に基づき動作状態が能動制御される機器とは、例えば、自動空調機、オートライト(室内外)、オートロック、オートワイパー、時計等を指す。例えば、自動空調機は車室内の温度センサから車室内温度に関する情報を、車室内からの入力情報として受取り、その入力情報に基づき動作状態を能動制御する。また、オートワイパー（自動式ワイパー）は、車室外のセンサから降雨状態に関する情報を、車室外からの入力情報として受取り、その入力情報に基づき動作状態を能動制御する。 Example 4
In this embodiment, the device to be operated by voice is a device whose operation state is actively controlled based on input information from inside or outside the vehicle, and automatically from information such as sensors provided inside and outside the vehicle. FIG. 2 shows a configuration in which a device whose operation state is controlled is detected, a change in the operation state of the device is detected, and a time interval in which it is determined that a user's speech is likely to be performed is determined based on the result. . The devices whose operation state is actively controlled based on input information from the vehicle interior or the exterior of the vehicle include, for example, an automatic air conditioner, an auto light (indoor / outdoor), an auto lock, an auto wiper, a clock, and the like. For example, the automatic air conditioner receives information related to the temperature in the vehicle interior from the temperature sensor in the vehicle interior as input information from the vehicle interior, and actively controls the operation state based on the input information. Further, the auto wiper (automatic wiper) receives information on the rain condition from a sensor outside the vehicle compartment as input information from outside the vehicle compartment, and actively controls the operation state based on the input information.

自動空調機、オートロック、オートワイパー、オートライト等であれば、該機器のＯＮ、ＯＦＦ(施錠、開錠)、風量変化、動作速度変化等が検出された場合に発話可能性が高いと判断する。時計については、使用者の操作履歴等を利用し、時報情報に伴うテレビ・ラジオ操作傾向を学習し、この情報に基づき、例えば○時○分になったという時刻の変化を検出した時に、発話可能性が高いと判断することが望ましい。機器の動作状態の変化を検出した時刻から所定時間区間、自動入力状態とする。これに同期して、自動空調機の風量や設定温度の調節操作、オートロックの施錠・開錠操作、オートワイパー・オートライトのＯＮ・ＯＦＦ操作、時報に起因するテレビ・ラジオの操作等に関連する語彙が認識されやすくなるように、言語辞書を変更する。 If it is an automatic air conditioner, auto lock, auto wiper, auto light, etc., it is judged that the possibility of utterance is high when ON / OFF (locking, unlocking), air volume change, operation speed change, etc. are detected. To do. As for the clock, the user's operation history, etc. is used to learn TV / radio operation trends associated with time signal information. Based on this information, for example, an utterance is detected when a change in time such as ○ hour ○ minute is detected. It is desirable to judge that the possibility is high. The automatic input state is set for a predetermined time interval from the time when the change in the operation state of the device is detected. Synchronously with this, it is related to the operation of adjusting the air volume and set temperature of the automatic air conditioner, locking / unlocking operation of auto-lock, ON / OFF operation of auto wiper / auto light, operation of TV / radio due to time signal, etc. The language dictionary is changed so that the vocabulary to be recognized is easily recognized.

更に、自動入力状態を使用者に知らせる手段を設けることが望ましい。図１４には、そのような例として、空調機の動作状態変化時の画面遷移を示す。図１４の(a)は通常状態の画面であり、空調機風量が３であることが明示されている。図１４の(b)は、室温が下がり空調機風量が１に変化した時の画面を示している。空調機の動作状態の変化に伴い、自動入力状態となり、空調機調整用ボイスコマンド(音声による機器の操作)が可能となっていることが表示されている。 Furthermore, it is desirable to provide means for notifying the user of the automatic input state. FIG. 14 shows a screen transition when the operating state of the air conditioner changes as such an example. (A) of FIG. 14 is a screen in a normal state, and clearly shows that the air volume of the air conditioner is 3. FIG. 14B shows a screen when the room temperature is lowered and the air-conditioner air volume is changed to 1. With the change in the air conditioner operating state, the automatic input state is entered, indicating that the air conditioner adjustment voice command (operating the device by voice) is possible.

(実施例１〜４共通)
以上の実施例において、音声による操作の対象となる複数の機器の動作の変化が同時期に検出される可能性がある。このような場合についての動作例を図１５に示す。 (Common to Examples 1 to 4)
In the above embodiment, there is a possibility that a change in the operation of a plurality of devices that are objects of operation by voice is detected at the same time. An operation example for such a case is shown in FIG.

図１５の(a)、(b)、(c)は、それぞれ、機器Ａ、Ｂ、Ｃの変化を時系列で示している。この例では、機器Ａは時刻Ｔ_１及びＴ_６で、機器Ｂは時刻Ｔ_２及びＴ_５で、機器Ｃは時刻Ｔ_３及びＴ_４で変化する。そして、機器Ａについては、変化時刻Ｔ(＝Ｔ_１、Ｔ_６)を基準としてＴ−αからＴ＋β_１の時間区間について、機器Ｂは変化時刻Ｔ(＝Ｔ_２、Ｔ_５)からＴ＋β_２において、機器Ｃは変化時刻Ｔ(＝Ｔ_３、Ｔ_４)からＴ＋β_３において発話可能性が高いと判断する。 (A), (b), and (c) of FIG. 15 show changes in the devices A, B, and C in time series, respectively. In this example, the device A at time _{T 1} and _{T 6,} the device B at time _{T 2} and _{T 5,} device C changes at time _{T 3} and _{T 4.} For the device A, the device B changes from the change time T (= T ₂ , T ₅ ) to T + β ₂ for the time interval from T−α to T + β _{1 with} the change time T (= T ₁ , T ₆ ) as a reference. The device C determines that the possibility of speech is high at T + β ₃ from the change time T (= T ₃ , T ₄ ).

この状況下における動作例を図１５の(e)、(f)、(g)に示す。
(e)Ａ ∪ Ｂ ∪ Ｃ：発話可能性の高い機器全てについて操作が可能な自動入力状態とする。
(f)Ｎｅｗｅｓｔ(Ａ、Ｂ、Ｃ)：最新の変化を検出した機器についてのみの操作が可能な自動入力状態とする。
(g)Ａ＞Ｂ＞Ｃ：機器の優先順位を決定し(例えば、優先度：ハンズフリー電話＞記憶媒体再生機器＞放送波受信機器＞自動空調機)、最優先の機器の操作が可能な自動入力状態とする。 Examples of operations under this situation are shown in FIGS. 15 (e), (f), and (g).
(e) A ∪ B ∪ C: An automatic input state in which all devices with high utterance possibility can be operated.
(f) Newest (A, B, C): An automatic input state in which only the device that has detected the latest change can be operated.
(g) A>B> C: Determines the priority order of devices (for example, priority: hands-free telephone> storage media playback device> broadcast wave receiving device> automatic air conditioner), and operation of the highest priority device is possible Set to the automatic input state.

上述の一連の構成により、動作状態の変化が検出された機器について、該機器の操作に関する発話を、発話スイッチを使わずに入力することが可能になり、使用者にとっての利便性が向上する。 With the above-described series of configurations, it is possible to input an utterance related to the operation of the device in which a change in the operating state is detected without using the utterance switch, and convenience for the user is improved.

本発明に係る音声入力装置の基本構成を示す図である。It is a figure which shows the basic composition of the audio | voice input apparatus which concerns on this invention. 本発明に係る音声入力装置の基本的な実施の形態例を示す図である。It is a figure which shows the example of fundamental embodiment of the audio | voice input apparatus which concerns on this invention. 本発明に係る音声入力装置の基本的な動作を示したフローチャートである。It is the flowchart which showed the basic operation | movement of the audio | voice input apparatus which concerns on this invention. 本発明の実施の形態例における言語辞書の一例を示す図である。It is a figure which shows an example of the language dictionary in the embodiment of this invention. 言語辞書の動的変更例(階層移動)を示す図である。It is a figure which shows the example of dynamic change of a language dictionary (hierarchy movement). 言語辞書の動的変更例(発話予測語彙以外を認識語彙から除外)を示す図である。It is a figure which shows the example of dynamic change of a language dictionary (except for an utterance prediction vocabulary is excluded from a recognition vocabulary). 言語辞書の動的変更例(出現・遷移確率のバイアス)を示す図である。It is a figure which shows the example of dynamic change of a language dictionary (bias of appearance and transition probability). 言語辞書の動的変更例(有効化辞書の変更)を示す図である。It is a figure which shows the example of dynamic change of a language dictionary (change of an activation dictionary). 記憶媒体に記憶された音信号を再生する機器における発話可能性検出方法を説明する図である。It is a figure explaining the speech possibility detection method in the apparatus which reproduces | regenerates the sound signal memorize | stored in the storage medium. 実施例１における画面遷移を示した図である。It is the figure which showed the screen transition in Example 1. FIG. 放送波受信機器における発話可能性検出方法を説明する図である。It is a figure explaining the speech possibility detection method in a broadcast wave receiver. 実施例２における画面遷移を示した図である。It is the figure which showed the screen transition in Example 2. FIG. 実施例３における画面遷移を示した図である。It is the figure which showed the screen transition in Example 3. FIG. 実施例４における画面遷移を示した図である。It is the figure which showed the screen transition in Example 4. FIG. 音声による操作の対象となる機器が複数ある場合においての各機器の動作例を示した図である。It is the figure which showed the operation example of each apparatus in case there exist two or more apparatuses used as the object of operation by an audio | voice.

Explanation of symbols

101：車載機器、102：発話可能性判断部、103：音入力部、104：辞書変更部、105：音声認識部、106：認識辞書、107：操作命令発行部、108：発話スイッチ、201：マイクロフォン、202：増幅装置、203：ＡＤ変換装置、204：演算装置、205：記憶装置、206：タイマ、207：センサ、208：スイッチ、401：メニュー辞書、402：行き先辞書、403：テレビ辞書、404：ＣＤ辞書、405：・・辞書。 101: In-vehicle device, 102: Speech possibility determination unit, 103: Sound input unit, 104: Dictionary change unit, 105: Speech recognition unit, 106: Recognition dictionary, 107: Operation command issue unit, 108: Speech switch, 201: Microphone, 202: amplification device, 203: AD conversion device, 204: arithmetic device, 205: storage device, 206: timer, 207: sensor, 208: switch, 401: menu dictionary, 402: destination dictionary, 403: television dictionary, 404: CD dictionary, 405: ... dictionary.

Claims

Sound input means for converting input sound into sound signals and outputting; sound recognition means for converting sound signals output from the sound input means into information signals; and time when the user is likely to speak A speech input device having an utterance possibility judging means for judging a section,
The sound input means converts voice input by the user into a voice signal in a time interval determined by the speech possibility determination means to be a time interval in which the user is highly likely to be uttered. The voice input device outputs the voice recognition means for performing a recognition process for converting the voice signal into an information signal.

A recognition dictionary having a language dictionary used for collation with a speech signal output from the sound input means, and a vocabulary that the user is likely to utter in a time interval in which the user is likely to utter The speech input device according to claim 1, further comprising: a language dictionary changing unit that selects the language dictionary and changes the language dictionary based on the vocabulary.

The utterance possibility determining means, when a reference time is a time at which a change in the operating state of a device to be operated by voice is detected, is a first time that is a predetermined time before the reference time. A time interval from one time to the reference time, a time interval from the reference time to a second time after a predetermined second time, or from the first time to the second time The voice input device according to claim 1, wherein the time interval to be reached is determined to be a time interval in which the user is highly likely to speak.

The language dictionary changing unit selects a voice operation vocabulary related to a device to be operated by the voice as a vocabulary that the user is likely to utter in a time interval in which the user is likely to utter. 4. The voice input device according to claim 3, wherein the language dictionary is changed based on the voice operation vocabulary.

The device to be operated by voice is a device that reproduces a sound signal or a video signal stored in a storage medium, and the change in the storage medium is detected at the time when the change in the operation state of the device is detected. The audio input device according to claim 3 or 4, wherein the audio input device is a time at which a break of a sound signal or a video signal stored in the storage medium is detected.

The device to be operated by the voice is a device that receives and outputs a sound signal or a video signal transferred from the outside, and the time when the change in the operation state of the device is detected is the received sound signal. 5. The audio input device according to claim 3, wherein the audio input device is a time at which a video signal break is detected, or a time at which a change in reception intensity of an externally transmitted sound signal or video signal is detected. .

The device to be operated by the voice is a communication device that receives and receives information by communication means, and the time when the change in the operation state of the communication device is detected is the external communication device other than the communication device and the communication device. 5. The voice input device according to claim 3, wherein the voice input device is a time at which communication with the external communication device is established or a time at which information from the external communication device arrives.

The device to be operated by voice is a device whose operation state is actively controlled based on input information from inside or outside the vehicle interior, and the time at which a change in the operation state of the device is detected is 5. The voice input device according to claim 3, wherein the time is a time when a change in the control state is detected.

The operation of at least one of the sound input unit and the voice recognition unit is stopped until the possibility that the user speaks again after the recognition processing by the voice recognition unit is completed. The voice input device according to any one of claims 1 to 8.

After completion of the recognition processing by the voice recognition means, the language dictionary changing means returns the language dictionary changed before the recognition processing to the state before the change before the user may possibly speak again. The voice input device according to any one of claims 2 to 9,

Sound input means for converting input sound into sound signals and outputting; sound recognition means for converting sound signals output from the sound input means into information signals; and time when the user is likely to speak A speech input method using speech possibility determination means for determining a section,
The sound input means converts voice input by the user into a voice signal in a time interval determined by the speech possibility determination means to be a time interval in which the user is highly likely to be uttered. A voice input method for outputting, wherein the voice recognition means converts the voice signal into an information signal.

A recognition dictionary having a language dictionary used for collation with a speech signal output from the sound input means, and a vocabulary that the user is likely to utter in a time interval in which the user is likely to utter 12. The speech input method according to claim 11, further comprising: using a language dictionary changing unit that selects the language dictionary and changes the language dictionary based on the vocabulary.

The utterance possibility determining means, when a reference time is a time at which a change in the operating state of a device to be operated by voice is detected, is a first time that is a predetermined time before the reference time. A time interval from one time to the reference time, a time interval from the reference time to a second time after a predetermined second time, or from the first time to the second time The voice input method according to claim 11 or 12, wherein the time interval to be reached is determined as a time interval in which the user's utterance is likely to be performed.

The language dictionary changing unit selects a voice operation vocabulary related to a device to be operated by the voice as a vocabulary that the user is likely to utter in a time interval in which the user is likely to utter. 14. The voice input method according to claim 13, wherein the language dictionary is changed based on the voice operation vocabulary.

The device to be operated by voice is a device that reproduces a sound signal or a video signal stored in a storage medium, and the change in the storage medium is detected at the time when the change in the operation state of the device is detected. 15. The audio input method according to claim 13 or 14, wherein the audio input method is a time at which a break of a sound signal or a video signal stored in the storage medium is detected.

The device to be operated by the voice is a device that receives and outputs a sound signal or a video signal transferred from the outside, and the time when the change in the operation state of the device is detected is the received sound signal. 15. The audio input method according to claim 13 or 14, wherein a time at which a video signal break is detected, or a time at which a change in reception intensity of a sound signal or video signal transferred from the outside is detected. .

The device to be operated by the voice is a communication device that receives and receives information by communication means, and the time when the change in the operation state of the communication device is detected is the external communication device other than the communication device and the communication device. 15. The voice input method according to claim 13, wherein the time at which communication with the external communication device is established or the time at which information from the external communication device arrives is received.

The device to be operated by voice is a device whose operation state is actively controlled based on input information from inside or outside the vehicle interior, and the time at which a change in the operation state of the device is detected is The voice input method according to claim 13 or 14, characterized in that it is a time at which a change in the control state is detected.

The operation of at least one of the sound input unit and the voice recognition unit is stopped until the possibility that the user speaks again after the recognition processing by the voice recognition unit is completed. The voice input method according to claim 11, wherein:

After completion of the recognition processing by the voice recognition means, the language dictionary changing means returns the language dictionary changed before the recognition processing to the state before the change before the user may possibly speak again. The voice input method according to any one of claims 12 to 19, wherein