JP5154363B2

JP5154363B2 - Car interior voice dialogue system

Info

Publication number: JP5154363B2
Application number: JP2008274124A
Authority: JP
Inventors: 真人戸上; 浩明小窪; 康成大淵
Original assignee: Clarion Co Ltd
Current assignee: Faurecia Clarion Electronics Co Ltd
Priority date: 2008-10-24
Filing date: 2008-10-24
Publication date: 2013-02-27
Anticipated expiration: 2028-10-24
Also published as: JP2010102163A

Description

本願明細書で開示される技術は、車室内に搭載される音声対話装置に関する。 The technology disclosed in the present specification relates to a voice interactive apparatus mounted in a vehicle interior.

自動車に搭載されるカーナビゲーションシステムには、音声対話機能を備えるものが広く使われている（例えば特許文献１参照）。従来、音声対話機能を実現するために必要な音声認識技術及び音声合成技術が広く検討されている。音声認識技術は、マイクロホンを通して入力された音声波形をテキスト化する技術である。音声合成技術は、テキストから音声波形を生成する技術である。音声認識技術と音声合成技術を組み合わせることで、カーナビゲーションシステムがユーザと音声で会話することが可能となる。また、車室内で録音した音声波形には様々な騒音が重畳するため、音声認識率が大幅に劣化するという問題がある。この問題に対して、複数のマイクロホン素子を有するマイクロホンアレイを用いて騒音を抑圧し、所望の音声のみを抽出する音源分離技術が広く検討されている。
特開２００４−１０９３２３号公報 A car navigation system mounted on a car is widely used having a voice dialogue function (see, for example, Patent Document 1). Conventionally, a speech recognition technology and a speech synthesis technology necessary for realizing a speech dialogue function have been widely studied. The speech recognition technology is a technology for converting a speech waveform input through a microphone into text. The speech synthesis technique is a technique for generating a speech waveform from text. By combining the speech recognition technology and the speech synthesis technology, the car navigation system can talk with the user by voice. Further, since various noises are superimposed on the voice waveform recorded in the passenger compartment, there is a problem that the voice recognition rate is greatly deteriorated. In order to solve this problem, a sound source separation technique that suppresses noise using a microphone array having a plurality of microphone elements and extracts only desired speech has been widely studied.
JP 2004-109323 A

従来のカーナビゲーションシステムでは、ドライバが発話することを前提に作られているため、受理する音声認識辞書の内容を全て音声合成で読み上げるような構成になっていた。しかし、選択肢を全て読み上げるのにかかる時間が長いため、音声対話が終了するまでの時間が長くなってしまうという課題があった。一方、選択肢は、読み上げられる代わりにディスプレイに表示されてもよい。しかし、ディスプレイに選択肢を表示する方法は、音声対話が終了するまでの時間が短いという利点はあるものの、ドライバが音声対話を行う場合に使用することは好ましくない。自動車を運転中のドライバがディスプレイを目視することによって安全性が損なわれるためである。 Since the conventional car navigation system is designed on the assumption that the driver speaks, the entire contents of the speech recognition dictionary to be accepted are read out by speech synthesis. However, since it takes a long time to read all the options, there is a problem that it takes a long time to complete the voice conversation. On the other hand, the option may be displayed on the display instead of being read out. However, although the method of displaying choices on the display has the advantage that the time until the voice conversation is completed is short, it is not preferable to use it when the driver performs the voice conversation. This is because the safety of the driver who is driving the vehicle is impaired by viewing the display.

本願で開示する代表的な発明は、自動車内における乗員の乗車位置を検知する乗員検知部と、テキスト情報を音声に変換して出力する音声出力部と、前記テキスト情報を画像に変換して表示する画像表示部と、複数のマイクロホンと、前記乗員検知部の検知結果に基づいて、前記音声出力部又は前記画像表示部のいずれか一方を選択し、前記選択された前記音声出力部又は前記画像表示部に前記テキスト情報を出力させる切り替え部と、を備え、前記運転者以外の乗員が乗車している場合、前記テキスト情報の前記画像表示部への表示に要する第１時間、及び、前記テキスト情報を読み上げる音声の前記音声出力部からの出力に要する第２時間を推定し、前記第１時間が前記第２時間より短い場合、前記テキスト情報を前記画像表示部に出力させることを選択し、前記第２時間が前記第１時間より短い場合、前記テキスト情報を前記音声出力部に出力させることを選択し、前記運転者以外の乗員が乗車している場合、さらに、前記運転者以外の乗員が応答するように促す音声を前記音声出力部から出力し、前記運転者以外の乗員が応答するように促す音声が出力された後、前記複数のマイクロホンが音声を受信すると、前記受信した音声の音源方向を特定し、前記特定された音源方向が、前記複数のマイクロホンから前記運転者以外の乗員への方向を含む所定の範囲内であるか否かを判定し、前記特定された音源方向が前記所定の範囲内である場合、前記受信した音声を、前記画像表示部に表示されたテキスト情報に対する応答として処理し、前記運転者以外の乗員が乗車していない場合、前記テキスト情報を前記音声出力部に出力させることを選択することを特徴とする。 A representative invention disclosed in the present application includes an occupant detection unit that detects a occupant's boarding position in an automobile, an audio output unit that converts text information into voice, and outputs the text information into an image for display. The image display unit, the plurality of microphones, and the sound output unit or the image display unit are selected based on the detection result of the occupant detection unit, and the selected sound output unit or the image is selected. A switching unit that causes the display unit to output the text information, and when an occupant other than the driver is on board, a first time required to display the text information on the image display unit, and the text Estimating the second time required to output the speech for reading information from the voice output unit, and outputting the text information to the image display unit when the first time is shorter than the second time If the second time is shorter than the first time, select to output the text information to the voice output unit, and if a passenger other than the driver is on board, When a voice prompting a passenger other than the driver to respond is output from the voice output unit and a voice prompting a passenger other than the driver to respond is output, the plurality of microphones receive the voice. Determining the sound source direction of the received voice, determining whether the specified sound source direction is within a predetermined range including a direction from the plurality of microphones to an occupant other than the driver, When the identified sound source direction is within the predetermined range, the received voice is processed as a response to the text information displayed on the image display unit, and no occupant other than the driver is on board. If, and selects that to output said text information to the audio output unit.

本発明の一実施形態によれば、選択肢提示手段判定部によって、車内にドライバのみが存在するか否かが判定され、その判定の結果にしたがって最適な提示手段が選択される。例えば、ドライバのみ存在する場合は、音声合成によって音声認識辞書の内容が読み上げられ、ドライバ以外の同乗者が存在する場合は、タスク時間短縮化のため、同乗者が回答するよう誘導した後、ディスプレイ上に選択肢が表示される。これによって、安全性を損なうことなく、タスク時間の短い音声対話が実現し、迅速なカーナビゲーションシステムの操作が可能となる。 According to one embodiment of the present invention, the option presenting means determination unit determines whether or not there is only a driver in the vehicle, and selects the optimum presenting means according to the determination result. For example, if there is only a driver, the contents of the speech recognition dictionary are read out by speech synthesis, and if there are passengers other than the driver, the passenger is guided to answer to reduce the task time, and then the display The choices are displayed above. As a result, a voice conversation with a short task time can be realized without sacrificing safety, and a quick car navigation system can be operated.

図１は、本発明の第１の実施形態の音声対話装置の機能ブロック図である。 FIG. 1 is a functional block diagram of the voice interactive apparatus according to the first embodiment of the present invention.

本実施形態の音声対話装置は、例えば、自動車に搭載されるカーナビゲーションシステムのアプリケーションとして使われることが想定される。以下、この想定の下、実施形態を説明する。なお、自動車には、一人の運転者（ドライバ）を含む一人以上のユーザが乗車する。 The voice interactive apparatus of this embodiment is assumed to be used as an application of a car navigation system mounted on a car, for example. Hereinafter, the embodiment will be described under this assumption. In addition, one or more users including one driver (driver) get on the automobile.

図２は、本発明の第１の実施形態の音声対話装置のハードウェア構成のブロック図である。 FIG. 2 is a block diagram of a hardware configuration of the voice interactive apparatus according to the first embodiment of this invention.

本実施形態のシステムは、少なくとも二つ以上のマイクロホン素子からなるマイクロホンアレイ１２０１を備える。マイクロホンアレイ１２０１は、それぞれのマイクロホン素子位置における音圧レベルを計測する。 The system of the present embodiment includes a microphone array 1201 including at least two microphone elements. The microphone array 1201 measures the sound pressure level at each microphone element position.

マイクロホンアレイ１２０１によって計測されたアナログの音圧値は、ＡＤ変換装置１２０２でサンプリングされ、デジタルデータに変換される。ＡＤ変換装置１２０２は、アナログのローパスフィルタ（図示省略）などを用いて、サンプリングレートの０．５倍以上の周波数成分を除去した後の音圧値をサンプリングしてもよい。 The analog sound pressure value measured by the microphone array 1201 is sampled by the AD converter 1202 and converted into digital data. The AD converter 1202 may sample the sound pressure value after removing a frequency component equal to or more than 0.5 times the sampling rate using an analog low-pass filter (not shown) or the like.

サンプリングされたデジタル音圧データは、中央演算装置１２０３に送られる。中央演算装置１２０３では、デジタル音圧データ中の音源方向の推定、位相差の補正及び音声認識や対話処理といったプログラムを実行する。 The sampled digital sound pressure data is sent to the central processing unit 1203. The central processing unit 1203 executes programs such as estimation of a sound source direction in digital sound pressure data, correction of a phase difference, speech recognition, and dialogue processing.

中央演算装置１２０３によって実行されるプログラムは、データとして記憶媒体１２０５に記憶される。 A program executed by the central processing unit 1203 is stored in the storage medium 1205 as data.

プログラム実行時に必要な一時的なデータは、揮発性メモリ１２０４又は記憶媒体１２０５に記憶されてもよい。その他、プログラム実行に必要な事前データは、記憶媒体１２０５に事前に記憶される。記憶媒体１２０５は、例えば、ハードディスクドライブ（ＨＤＤ）又はフラッシュメモリのような大容量の不揮発性記憶媒体である。一方、揮発性メモリ１２０４は、例えば、ダイナミックランダムアクセスメモリ（ＤＲＡＭ）のような高速な記憶装置である。 Temporary data necessary for program execution may be stored in the volatile memory 1204 or the storage medium 1205. In addition, advance data necessary for program execution is stored in advance in the storage medium 1205. The storage medium 1205 is a large-capacity nonvolatile storage medium such as a hard disk drive (HDD) or a flash memory. On the other hand, the volatile memory 1204 is a high-speed storage device such as a dynamic random access memory (DRAM).

例えば、図１に示す対話スクリプトデータベース（ＤＢ）１１４及び認識語彙ＤＢ１１５は、記憶媒体１２０５に格納される。図１に示す上記のデータベース以外の各部は、記憶媒体１２０５に格納されたプログラムを中央演算装置１２０３が実行することによって実現される。ただし、音声入力部１０１は、ＡＤ変換装置１２０２によって実現されてもよい。上記のデータベース及びプログラムは、記憶媒体１２０５に格納され、必要に応じてそれらの全部又は一部が揮発性メモリ１２０４にコピーされてもよい。 For example, the dialogue script database (DB) 114 and the recognition vocabulary DB 115 shown in FIG. 1 are stored in the storage medium 1205. Each unit other than the database shown in FIG. 1 is realized by the central processing unit 1203 executing a program stored in the storage medium 1205. However, the voice input unit 101 may be realized by the AD conversion device 1202. The above database and program may be stored in the storage medium 1205, and all or a part of them may be copied to the volatile memory 1204 as necessary.

マイクロホンで受音したアナログ音声は、音声入力部１０１において、デジタル音声に変換される。変換されたデジタル音声は音源分離部１０２に送られる。音源分離部１０２において、デジタル音声中に含まれる雑音成分が除去され、所望の音声を強調した信号が得られる。所望の音声か否かは、音源方向の情報から判断される。例えば、所望の音声が自動車を運転するドライバである場合は、マイクロホンから見たドライバの相対方向が所望の音源方向として設定され、その所望の音源方向から到来する音声が所望の音声と判断される。 Analog voice received by the microphone is converted into digital voice by the voice input unit 101. The converted digital sound is sent to the sound source separation unit 102. In the sound source separation unit 102, a noise component included in the digital sound is removed, and a signal in which a desired sound is emphasized is obtained. Whether or not the sound is desired is determined from information on the sound source direction. For example, when the desired sound is a driver driving an automobile, the relative direction of the driver viewed from the microphone is set as the desired sound source direction, and the sound coming from the desired sound source direction is determined as the desired sound. .

所望の音源方向は分離範囲設定部１０５によって設定される。分離範囲設定部１０５は、対話制御部１０７の出力結果に基づき、所望の音源方向を設定する。 The desired sound source direction is set by the separation range setting unit 105. The separation range setting unit 105 sets a desired sound source direction based on the output result of the dialogue control unit 107.

音源分離部１０２が取り出した所望の音声成分は、音声認識部１０３に送られる。音声認識部１０３は、送られてきた音声波形の発話内容を認識し、文字列に変換したものを出力する。音声認識部１０３が実行する音声認識処理は、例えば、隠れマルコフモデルに基づくもの、又は、動的計画法に基づくものなど、いかなるものであってもよい。 The desired speech component extracted by the sound source separation unit 102 is sent to the speech recognition unit 103. The speech recognition unit 103 recognizes the utterance content of the transmitted speech waveform and outputs the converted speech string. The speech recognition processing executed by the speech recognition unit 103 may be any one based on, for example, a hidden Markov model or a dynamic programming method.

音声認識処理は、入力された音声波形の発話内容が、所与の語彙の中のどの語と最も近いかを判定し、最も近い語を出力する一種のパターンマッチング処理である。音声認識に用いるために予め保持される語彙は、音声認識処理を開始する前に、予め認識辞書生成部１０８にて作られる。 The speech recognition process is a kind of pattern matching process that determines which word in the given vocabulary is the closest to the utterance content of the input speech waveform and outputs the closest word. The vocabulary stored in advance for use in speech recognition is created in advance by the recognition dictionary generation unit 108 before starting the speech recognition process.

音声認識部１０３は、さらに、認識結果の尤度を計算する。認識結果の尤度とは、入力された音声波形の発話内容と最も近いと判定された語と、その発話内容と、の近さの度合いを示す尺度である。尤度は、公知の種々の方法によって算出することができる。 The speech recognition unit 103 further calculates the likelihood of the recognition result. The likelihood of the recognition result is a scale indicating the degree of closeness between the word determined to be closest to the utterance content of the input speech waveform and the utterance content. The likelihood can be calculated by various known methods.

音声認識部１０３によって生成される文字列に変換された発話内容（すなわち認識結果）及びその認識結果の尤度は、認識結果棄却判定部１０４に送られる。認識結果棄却判定部１０４は、認識結果の尤度の情報に基づいて、認識結果を受理するか棄却するかを判定する。例えば、認識結果の尤度の情報から生成される認識結果の事後確率が閾値を超える場合に受理すると判定されてもよい。一方、事後確率が閾値以下の場合、所与の語彙のいずれとも異なる語が発話されたと推定されるため、認識結果を棄却すると判定されてもよい。 The utterance content (that is, the recognition result) converted into the character string generated by the speech recognition unit 103 and the likelihood of the recognition result are sent to the recognition result rejection determination unit 104. The recognition result rejection determination unit 104 determines whether to accept or reject the recognition result based on the likelihood information of the recognition result. For example, it may be determined to be accepted when the posterior probability of the recognition result generated from the likelihood information of the recognition result exceeds a threshold value. On the other hand, when the posterior probability is equal to or lower than the threshold value, it is estimated that a word different from any given vocabulary is spoken, and therefore, it may be determined to reject the recognition result.

認識結果棄却判定部１０４で受理された認識結果は対話制御部１０７に送られる。認識結果が棄却された場合は、音声認識処理が続行されてもよい。音声認識処理の開始後、一定時間以内に認識結果が受理されなかった場合、認識結果棄却判定部１０４は、認識結果が無かったという情報を対話制御部１０７に送信してもよい。対話制御部１０７は、その情報に基づき、次の行動を決定してもよい。例えば、対話制御部１０７は、「もう一度発話してください」と発話を促すガイダンスを出力した後、音声認識処理を再度実行するように各部を制御してもよい。 The recognition result received by the recognition result rejection determination unit 104 is sent to the dialogue control unit 107. If the recognition result is rejected, the voice recognition process may be continued. If the recognition result is not accepted within a certain time after the start of the speech recognition processing, the recognition result rejection determination unit 104 may transmit information indicating that there is no recognition result to the dialogue control unit 107. The dialogue control unit 107 may determine the next action based on the information. For example, the dialogue control unit 107 may control each unit so that the speech recognition process is executed again after outputting the guidance for uttering “Please speak again”.

図３は、本発明の第１の実施形態において実行される音声認識処理を示すフローチャートである。 FIG. 3 is a flowchart showing the speech recognition process executed in the first embodiment of the present invention.

具体的には、図３は、音源分離部１０２、音声認識部１０３及び認識結果棄却判定部１０４によって実行される具体的な音声認識処理の流れを示す。 Specifically, FIG. 3 shows the flow of specific speech recognition processing executed by the sound source separation unit 102, speech recognition unit 103, and recognition result rejection determination unit 104.

目的音範囲設定Ｓ９０１において、音源分離部１０２は、分離範囲設定部１０５によって設定された範囲に基づき、目的音の存在範囲を設定する。例えば、方位角と仰角のそれぞれについて目的音の存在範囲が設定される。例えば、方位角が−３０度から＋３０度まで、及び仰角を−９０度から９０度までと設定されてもよい。 In the target sound range setting S <b> 901, the sound source separation unit 102 sets the target sound existence range based on the range set by the separation range setting unit 105. For example, the target sound existence range is set for each of the azimuth angle and the elevation angle. For example, the azimuth angle may be set from −30 degrees to +30 degrees, and the elevation angle may be set from −90 degrees to 90 degrees.

音源分離Ｓ９０２において、音源分離部１０２は、設定された目的音存在範囲の情報に基づき、目的音方向の音を抽出する。 In the sound source separation S902, the sound source separation unit 102 extracts a sound in the target sound direction based on the set information on the target sound presence range.

音声認識Ｓ９０３において、音声認識部１０３は、音源分離Ｓ９０２において抽出された目的音方向の音の発話内容を、音声認識辞書及び音のモデルを用いて認識する。 In speech recognition S903, the speech recognition unit 103 recognizes the utterance content of the sound in the target sound direction extracted in the sound source separation S902 using the speech recognition dictionary and the sound model.

信頼度チェックＳ９０４において、認識結果棄却判定部１０４は、音声認識結果の信頼度を示す尺度（例えば、音声認識結果に付随する音響尤度から計算される事後確率など）が予め設定された閾値を上回っているか否かを判定する。信頼度を示す尺度が閾値を上回っている場合、認識された音声は、コマンドを入力するために発話されたものであると推定される。例えば、後述するように選択肢がユーザに提示されると、ユーザがそれに応じて選択肢の一つを発話する。このような発話は、コマンドを入力するための発話の一例である。この場合、処理は方向チェックＳ９０５に進む。 In the reliability check S904, the recognition result rejection determination unit 104 sets a threshold in which a scale indicating the reliability of the speech recognition result (for example, a posterior probability calculated from the acoustic likelihood associated with the speech recognition result) is set in advance. It is determined whether or not it exceeds. If the measure of confidence is above the threshold, the recognized speech is estimated to be uttered to enter a command. For example, as will be described later, when an option is presented to the user, the user utters one of the options accordingly. Such an utterance is an example of an utterance for inputting a command. In this case, the process proceeds to the direction check S905.

一方、信頼度を示す尺度が閾値を下回っている場合、認識された音声はコマンドを入力するために発話されたものでない（例えば雑音のような、コマンドとは無関係に入力された音声等である）と推定される。この場合、処理は制限時間チェックＳ９０７に進む。 On the other hand, if the measure of reliability is below the threshold, the recognized speech is not spoken to enter the command (for example, speech entered without regard to the command, such as noise) )It is estimated to be. In this case, the process proceeds to the time limit check S907.

方向チェックＳ９０５において、認識結果棄却判定部１０４は、音声認識波形の時間長をＴとし、音声認識波形の音源方向を数式（１）によって算出し、その方向が所与の目的音範囲内か否かを判定する。 In the direction check S905, the recognition result rejection determination unit 104 calculates the sound source direction of the speech recognition waveform by Equation (1), where T is the time length of the speech recognition waveform, and whether the direction is within a given target sound range. Determine whether.

θ（ｆ，τ）は時間τにおける周波数ｆの音源方向であり、その算出方法は後述する。ｆｍａｘは音声認識波形の最大周波数成分である。 θ (f, τ) is the sound source direction of the frequency f at time τ, and the calculation method will be described later. fmax is the maximum frequency component of the speech recognition waveform.

音源方向が目的音範囲内であった場合、認識結果棄却判定部１０４は、認識結果を返して処理を終了する。後述するように、ユーザに対して選択肢が提示された後で図３に示す処理が実行された場合、上記のように返された認識結果は、選択肢の提示に対する応答（すなわち、提示された選択肢の一つを発話したもの）として処理される。 If the sound source direction is within the target sound range, the recognition result rejection determination unit 104 returns the recognition result and ends the process. As will be described later, when the process shown in FIG. 3 is executed after the option is presented to the user, the recognition result returned as described above is a response to the presentation of the option (that is, the presented option). As one of the utterances).

音源方向が目的音範囲外であった場合、処理は制限時間チェックＳ９０７に進む。制限時間チェックＳ９０７において、認識結果棄却判定部１０４は、音声認識を開始してから経過した時間が所与の制限時間内であるか否かを判定する。経過した時間が所与の制限時間内である場合、処理は音源分離Ｓ９０２に戻り、目的音方向の音の抽出が再度実行される。経過した時間が制限時間を超えている場合、認識結果を返さずに処理が終了する。 If the sound source direction is outside the target sound range, the process proceeds to a time limit check S907. In the time limit check S907, the recognition result rejection determination unit 104 determines whether or not the time elapsed since the start of speech recognition is within a given time limit. If the elapsed time is within the given time limit, the process returns to the sound source separation S902, and the sound in the target sound direction is extracted again. If the elapsed time exceeds the time limit, the process ends without returning the recognition result.

上記の音源分離Ｓ９０２及び音声認識Ｓ９０３は、音声入力部１０１が取得するリアルタイムの音声波形に対する処理である。つまり時々刻々入力されてくる新しい音声波形に対して上記の処理が施される。 The sound source separation S902 and the speech recognition S903 are processes for a real-time speech waveform acquired by the speech input unit 101. That is, the above processing is performed on new speech waveforms that are input every moment.

図４は、本発明の第１の実施形態の音源分離部１０２が実行する詳細な処理の流れを示す説明図である。 FIG. 4 is an explanatory diagram illustrating a flow of detailed processing executed by the sound source separation unit 102 according to the first embodiment of this invention.

図４の処理は、音声入力部１０１が一定量（例えば数十ｍｓ程度）の音声データを取得する度に実行される。 The process of FIG. 4 is executed each time the voice input unit 101 acquires a certain amount of voice data (for example, about several tens of ms).

複数のマイクロホン素子によって収録された音声波形は、マイクロホン素子毎に、ＤＦＴ１００１にて離散フーリエ変換を施される。マイクロホン素子（ｉ）毎の、サンプリング時間（ｔ）の音圧データは数式（２）によって表される。 A speech waveform recorded by a plurality of microphone elements is subjected to discrete Fourier transform in the DFT 1001 for each microphone element. The sound pressure data for the sampling time (t) for each microphone element (i) is expressed by Equation (2).

事前に音声波形のハミング窓又はハニング窓を時間領域の信号に掛け合わせた後、離散フーリエ変換が施されてもよい。ハミング窓又はハニング窓の窓関数を掛け合わせることで、高精度な時間周波数領域の信号を得ることができる。 A discrete Fourier transform may be performed after a hamming window or a hanning window of a speech waveform is previously multiplied by a signal in the time domain. By multiplying the Hamming window or the window function of the Hanning window, a highly accurate time-frequency domain signal can be obtained.

離散フーリエ変換による時間周波数領域信号への変換は数式（３）によって行われる。変換後の信号は、数式（４）によって表される。 The conversion to the time-frequency domain signal by the discrete Fourier transform is performed by Equation (3). The converted signal is expressed by Equation (4).

ここでτはフレームインデックスと呼ばれ、時間周波数領域信号への変換した回数と等しくなる。ｗ（ｎ）はハニング窓又はハミング窓の窓関数である。フーリエ変換の際のフレームサイズとする。 Here, τ is called a frame index and is equal to the number of times of conversion into a time-frequency domain signal. w (n) is the window function of the Hanning window or the Hamming window. Frame size for Fourier transform.

周波数毎ベクトル化１００２において、音源分離部１０２は、変換後の同じ時間周波数毎領域に属するマイク毎の信号をまとめあげて、数式（５）で定義されるベクトルＸ（ｆ，τ）を生成する。Ｍはマイク素子数とする。 In the per-frequency vectorization 1002, the sound source separation unit 102 generates a vector X (f, τ) defined by Expression (5) by collecting the signals for each microphone belonging to the same time-frequency domain after conversion. M is the number of microphone elements.

音源定位１００３において、音源分離部１０２は、時間周波数毎に、数式（６）で定義されるステアリングベクトルａ（θ，ｆ）とＸ（ｆ，τ）との内積の最大値を与える音源方向θを数式（７）で計算する。ｃは音速とする。 In the sound source localization 1003, the sound source separation unit 102 gives the maximum value of the inner product of the steering vectors a (θ, f) and X (f, τ) defined by Equation (6) for each time frequency. Is calculated by Equation (7). c is the speed of sound.

図５は、本発明の第１の実施形態の音源分離１００４において時間周波数毎に実行される処理を示すフローチャートである。 FIG. 5 is a flowchart illustrating processing executed for each time frequency in the sound source separation 1004 according to the first embodiment of this invention.

音源分離部１０２は、目的音範囲かどうかＳ１１０１において、時間周波数毎の音源方向θが所定の目的音範囲内であるか否かを判定する。音源方向θが目的音範囲内であった場合、音源分離部１０２は、ベクトルｎ（ｆ，τ）をゼロベクトルに設定し、ベクトルｓ（ｆ，τ）をＸ（ｆ，τ）に設定した後、目的音共分散更新Ｓ１１０２に進む。目的音範囲外の場合、音源分離部１０２は、ベクトルｎ（ｆ，τ）をＸ（ｆ，τ）に設定し、雑音共分散更新Ｓ１１０３に進む。 In step S1101, the sound source separation unit 102 determines whether the sound source direction θ for each time frequency is within a predetermined target sound range. When the sound source direction θ is within the target sound range, the sound source separation unit 102 sets the vector n (f, τ) to the zero vector and sets the vector s (f, τ) to X (f, τ). Thereafter, the process proceeds to the target sound covariance update S1102. If it is outside the target sound range, the sound source separation unit 102 sets the vector n (f, τ) to X (f, τ), and proceeds to the noise covariance update S1103.

目的音共分散更新Ｓ１１０２において、音源分離部１０２は、ベクトルｓ（ｆ，τ）を用いて共分散行列Ｒｓ（ｆ）を数式（８）のように更新する。 In the target sound covariance update S1102, the sound source separation unit 102 updates the covariance matrix Rs (f) using Equation (8) using the vector s (f, τ).

雑音共分散更新Ｓ１１０３において、音源分離部１０２は、ベクトルｎ（ｆ，τ）を用いて共分散行列Ｒ（ｆ）を数式（９）のように更新する。ここでαは所与の更新レートとする。 In noise covariance update S1103, the sound source separation unit 102 updates the covariance matrix R (f) as shown in Expression (9) using the vector n (f, τ). Here, α is a given update rate.

音源分離部１０２は、共分散行列Ｒｓ（ｆ）及びＲ（ｆ）を用いて音源分離フィルタｗ（ｆ，τ）を数式（１０）によって求める。ｅｉｇ＿ｖｅｃｔｏｒは最大固有値となる固有ベクトルを与える関数とする。 The sound source separation unit 102 obtains a sound source separation filter w (f, τ) using Equation (10) using the covariance matrices Rs (f) and R (f). Let eig_vector be a function that gives the eigenvector that is the maximum eigenvalue.

フィルタリングＳ１１０５において、音源分離部１０２は、入力信号Ｘ（ｆ，τ）及び音源分離フィルタｗ（ｆ，τ）から雑音抑圧信号ｓ（ｆ，τ）を数式（１１）によって求める。 In filtering S1105, the sound source separation unit 102 obtains the noise suppression signal s (f, τ) from the input signal X (f, τ) and the sound source separation filter w (f, τ) by Expression (11).

ポストフィルタリングＳ１１０６において、音源分離部１０２は、雑音抑圧信号ｓ（ｆ，τ）にウィナーフィルタ又はスペクトルサブトラクション処理を施すことによって、残留雑音成分を抑圧する。そして、音源分離部１０２は、残留雑音抑圧後の時間周波数信号を出力し、処理を終了する。 In post-filtering S1106, the sound source separation unit 102 suppresses the residual noise component by applying a Wiener filter or spectral subtraction processing to the noise suppression signal s (f, τ). Then, the sound source separation unit 102 outputs the time frequency signal after residual noise suppression, and ends the processing.

逆ＤＦＴ１００５において、音源分離部１０２は、求めた周波数毎の雑音抑圧信号に逆離散フーリエ変換を施すことによって、時間領域信号を生成した後、その時間領域信号を出力する。 In inverse DFT 1005, sound source separation section 102 generates a time domain signal by performing inverse discrete Fourier transform on the obtained noise suppression signal for each frequency, and then outputs the time domain signal.

対話制御部１０７は、対話スクリプトＤＢ１１４に保持された対話スクリプトに基づき、ユーザとの音声対話を制御する。 The dialogue control unit 107 controls voice dialogue with the user based on the dialogue script held in the dialogue script DB 114.

図６は、本発明の第１の実施形態における対話スクリプトに基づく対話を示すフローチャートである。 FIG. 6 is a flowchart showing the dialogue based on the dialogue script in the first embodiment of the present invention.

図６に記載の対話フローの例では、ユーザとの対話開始後、まずコマンド名称Ｓ３０１において、対話制御部１０７は、ユーザが実行したいコマンドの名称を認識する。認識に際し、対話制御部１０７は、ユーザに対して、「コマンド名称をどうぞ」などの発話を促すガイダンスを出力してもよい。さらに、対話制御部１０７は、「この中からお選びください」などのコメントとともに画面に表示されるコマンドリストの中から、実行したいコマンドを選ばせてもよいし、コマンドリストの内容を読み上げた音声を、音声合成システムなどを使用して生成し、その音声を出力してもよい。 In the example of the dialogue flow shown in FIG. 6, after the dialogue with the user is started, first, in the command name S301, the dialogue control unit 107 recognizes the name of the command that the user wants to execute. Upon recognition, the dialogue control unit 107 may output guidance prompting the user to speak such as “Please give command name”. Furthermore, the dialog control unit 107 may select a command to be executed from a command list displayed on the screen together with a comment such as “Please select from these” or a voice that reads out the contents of the command list. May be generated using a speech synthesis system or the like, and the speech may be output.

本実施形態において、画面にコマンドリストを表示するか、コマンドリストを読み上げるかは、ユーザの着座情報などに基づいて切り替えられる。この切り替えは、後述する選択肢の提示方法の選択と同様にして実行されてもよい。例えば、ドライバのみ乗車している場合、コマンドリストを読み上げるように制御されてもよい。一方、同乗者（すなわちドライバ以外のユーザ）が自動車内に存在する場合、音声対話に要する時間を短くするために、「同乗者の方、この中からお選びください」などの、同乗者が画面を見て答えることを促すガイダンスを流した後、コマンドリストを画面に表示するように制御されてもよい。これによって、自動車の走行の安全性を確保しながら、素早いコマンド入力が実現される。 In the present embodiment, whether to display the command list on the screen or to read the command list is switched based on the seating information of the user. This switching may be executed in the same manner as selection of an option presentation method described later. For example, when only the driver is on board, the command list may be read out. On the other hand, if a passenger (ie, a user other than the driver) is present in the car, the passenger will see a screen such as “Passenger, please select from here” in order to shorten the time required for voice conversation. It may be controlled to display the command list on the screen after the guidance for prompting and answering is given. As a result, quick command input is realized while ensuring the safety of driving of the automobile.

図６において、コマンド名称Ｓ３０１終了後、認識したコマンド（すなわち、ユーザによってコマンドリストから選択されたコマンド）に応じて処理を切り替える。例えば認識したコマンドが目的地設定であった場合は、次に具体的な目的地を認識する目的地設定Ｓ３０２が実行される。 In FIG. 6, after the command name S301 ends, the process is switched according to the recognized command (that is, the command selected from the command list by the user). For example, if the recognized command is destination setting, then destination setting S302 for recognizing a specific destination is executed.

目的地設定Ｓ３０２では、「目的地をおっしゃって下さい」のように目的地をユーザが発話するように促すガイダンスが出力されてもよいし、「目的地をこの中からお選びください」と発話する音声が出力された後、目的地のリストが画面に表示されてもよいし、目的地のリストが読み上げられてもよい。画面に表示される目的地のリストの例については、後で図１２を参照して説明する。 In the destination setting S302, guidance that prompts the user to speak the destination may be output, such as “Please tell the destination”, or “Please select a destination from this”. After the audio is output, the destination list may be displayed on the screen, or the destination list may be read out. An example of the destination list displayed on the screen will be described later with reference to FIG.

前述のコマンド名称Ｓ３０１と同様に、同乗者がいるか否かに応じて、安全性を損なわない提示手段のうち、最も目的地設定に要する時間が短くなるような提示手段が選択される。 As in the case of the command name S301 described above, the presenting means that minimizes the time required for destination setting is selected from the presenting means that do not impair safety, depending on whether or not there is a passenger.

コマンド名称Ｓ３０１において認識されたコマンドが自動車機器操作であった場合、自動車機器操作Ｓ３０３が実行される。自動車機器操作Ｓ３０３において、対話制御部１０７は、エアコンのＯｎ／Ｏｆｆ又は音楽等の制御といった自動車内部機器を操作するためのユーザコマンドを認識する。このとき、対話制御部１０７は、「操作したい機器名称及び操作内容をおっしゃって下さい」といったガイダンスを音声出力部１１２に出力させてもよいし、音声コマンドで操作可能なコマンド一覧を画面に表示するように画像表示部１１３を制御してもよい。前述の目的地設定Ｓ３０２と同様に、最も自動車内部機器操作に要する時間が短くなるような提示手段が選択される。 When the command recognized in the command name S301 is an automobile device operation, an automobile device operation S303 is executed. In the car equipment operation S303, the dialogue control unit 107 recognizes a user command for operating the car internal equipment such as control of air conditioner On / Off or music. At this time, the dialogue control unit 107 may cause the voice output unit 112 to output a guidance such as “Please tell us the name of the device you want to operate and the details of the operation” or display a command list that can be operated with voice commands on the screen. In this manner, the image display unit 113 may be controlled. Similar to the above-described destination setting S302, a presentation means is selected such that the time required for operation of the automobile internal device is the shortest.

コマンド名称Ｓ３０１において認識されたコマンドが周辺施設検索であった場合、周辺施設検索Ｓ３０４が実行される。周辺施設検索Ｓ３０４において、対話制御部１０７は、ユーザが所望する周辺施設を検索し、検索された施設を目的地として設定する処理を、音声インタフェースを用いて実行する。具体的には、対話制御部１０７は、「周辺施設をおっしゃって下さい」といったようにユーザ発話を促すガイダンスを音声出力部１１２に出力させてもよいし、「この中から選んでください」といったガイダンスを出力させた後、画面に周辺施設のリストを表示し、画面を見ながらユーザが発話するように促してもよい。前述の自動車機器操作Ｓ３０３と同様に、最も自動車内部機器操作に要する時間が短くなるような提示手段が選択される。 When the command recognized in the command name S301 is a surrounding facility search, a surrounding facility search S304 is executed. In the peripheral facility search S304, the dialogue control unit 107 searches for a peripheral facility desired by the user, and executes a process of setting the searched facility as a destination using a voice interface. Specifically, the dialogue control unit 107 may cause the voice output unit 112 to output a guidance that prompts the user to speak, such as “Please tell the surrounding facility” or “Select from these” guidance. May be displayed, a list of surrounding facilities may be displayed on the screen, and the user may be prompted to speak while viewing the screen. In the same manner as in the above-described automobile device operation S303, a presentation unit that selects the time required for the operation of the automobile internal device most is selected.

Ｓ３０２からＳ３０４までのいずれかの処理が終了した後、対話制御部１０７は対話を終了する。 After any processing from S302 to S304 is completed, the dialogue control unit 107 ends the dialogue.

以上は、カーナビゲーションシステムにおける対話スクリプトに基づく対話の一例である。対話スクリプトは、条件分岐とシステムの実行コマンドの情報とを保持する形式で記述可能な言語によって記述される限り、どのような形式で記述されてもよい。例えば、ＶｏｉｃｅＸＭＬのようなＸＭＬ形式で対話スクリプトが記述されてもよいし、プログラム言語の一種であるスクリプト言語でプログラムコードとして対話スクリプトが記述されてもよい。 The above is an example of the dialogue based on the dialogue script in the car navigation system. The interactive script may be described in any format as long as it is described in a language that can be described in a format that holds conditional branching and system execution command information. For example, an interactive script may be described in an XML format such as Voice XML, or an interactive script may be described as a program code in a script language that is a kind of program language.

対話制御部１０７は、対話スクリプトに基づき、音声認識部１０３、音源分離部１０２、音声出力部１１２及び画像表示部１１３の動作を制御する。例えば、前述のコマンドリストを認識する対話において、認識辞書生成部１０８は、音声認識を開始する前に、認識に用いる認識辞書を切り替える。認識辞書生成部１０８は有限オートマトン形式で記載された音声認識辞書を生成する。 The dialogue control unit 107 controls operations of the voice recognition unit 103, the sound source separation unit 102, the voice output unit 112, and the image display unit 113 based on the dialogue script. For example, in the dialog for recognizing the command list described above, the recognition dictionary generation unit 108 switches the recognition dictionary used for recognition before starting speech recognition. The recognition dictionary generation unit 108 generates a speech recognition dictionary described in a finite automaton format.

図７は、本発明の第１の実施形態において使用される音声認識辞書の一例を示す説明図である。 FIG. 7 is an explanatory diagram showing an example of a speech recognition dictionary used in the first embodiment of the present invention.

この例では、６つのノードと６つのアークがネットワーク化された形で認識辞書が表現されている。この例の認識辞書を用いた場合、ユーザ発話は「目的地」と「語尾」とが連続した文、又は、「電話番号」と「語尾」とが連続した文のいずれかであることが仮定される。「目的地」、「語尾」及び「電話番号」は、アークと呼ばれる、複数の単語をまとめて表現したラベルである。各アークは更に単語リストに展開される。 In this example, the recognition dictionary is expressed in the form of a network of six nodes and six arcs. When the recognition dictionary of this example is used, it is assumed that the user utterance is either a sentence in which “Destination” and “End of Word” are continuous, or a sentence in which “Phone Number” and “End of Word” are continuous Is done. “Destination”, “End of word”, and “Telephone number” are labels that collectively represent a plurality of words called arcs. Each arc is further expanded into a word list.

図８は、本発明の第１の実施形態におけるアークごとの単語リストの一例を示す説明図である。 FIG. 8 is an explanatory diagram illustrating an example of a word list for each arc according to the first embodiment of this invention.

図８は、「目的地」のアークが展開された単語リストの例を示さす。図８の例では、目的地を示す単語のリストに、「中央研究所」及び「機械研究所」のような施設の固有名詞が含まれる。この他の目的地として、例えば、「レストラン」のような施設の一般名詞が含まれてもよいし（図１２参照）、「東京都」のような地名が含まれてもよい。 FIG. 8 shows an example of a word list in which the arc of “destination” is expanded. In the example of FIG. 8, proper names of facilities such as “Central Research Laboratory” and “Mechanical Research Laboratory” are included in the list of words indicating the destination. As other destinations, for example, a general noun of a facility such as “restaurant” may be included (see FIG. 12), or a place name such as “Tokyo” may be included.

このように各アークは単語リストに展開される。これによって、例えば「目的地」と「語尾」が連続した文の数は、「目的地」の単語リストに含まれる単語の数に「語尾」の単語リストに含まれる単語の数を掛け合わせた数になる。単語リストは認識語彙ＤＢ１１５に蓄えられており、認識辞書に応じて必要な語彙が取り出される。 In this way, each arc is expanded into a word list. Thus, for example, the number of sentences in which “Destination” and “End of Word” are consecutive is obtained by multiplying the number of words included in the word list of “Destination” by the number of words included in the word list of “End”. Become a number. The word list is stored in the recognition vocabulary DB 115, and necessary vocabulary is extracted according to the recognition dictionary.

音声認識部１０３は、各単語を音素毎又は音素片毎に分割し、各音素又は音素片に対応した音のモデルを並べたものをパターンとし、入力音声と最も近いパターンを出力する。音のモデルは、音素又は音素片ごとに、ＬＰＣケプストラム、ＭＦＣＣ、それらの差分値（Δ）、それらのΔΔ、又は、パワーの時間差分値など、を特徴量とした混合正規分布で表現される。また特徴量算出時に平均値の減算処理（ケプストラム平均値減算処理）などによって伝達系の歪みを補正してもよい。 The speech recognition unit 103 divides each word into phonemes or phonemes, and outputs a pattern closest to the input speech using a pattern in which sound models corresponding to each phoneme or phoneme are arranged. The sound model is expressed by a mixed normal distribution with a feature amount such as LPC cepstrum, MFCC, their difference value (Δ), their ΔΔ, or power time difference value for each phoneme or phoneme. . Further, the distortion of the transmission system may be corrected by subtracting the average value (cepstrum average value subtracting process) or the like when calculating the feature amount.

入力音声と最も近いパターンは、前向き・後ろ向きアルゴリズムに基づく最尤推定において、最大尤度を与える状態遷移パスのみを計算するように近似したビタビアルゴリズムによって算出される。尤度は、入力音声とパターンとの距離に基づいて定義される。音声認識部１０３は最大尤度を与える状態遷移パスを計算し、その状態遷移パスから単語系列を逆引きする。これによってある一つの文字列が得られ、その文字列が出力される。さらに、最大尤度そのもの、又は、最大尤度を加工することによって得られた事後確率ｐ（Ｏ｜Ｘ）が出力される。ここで、ｐ（Ｏ｜Ｘ）は、入力音声Ｘを条件とした認識結果Ｏの事後確率（すなわち、入力音声Ｘに対する認識結果Ｏが正しい結果である確率を示す値）である。 The pattern closest to the input speech is calculated by a Viterbi algorithm approximated to calculate only the state transition path that gives the maximum likelihood in the maximum likelihood estimation based on the forward / backward algorithm. The likelihood is defined based on the distance between the input speech and the pattern. The speech recognition unit 103 calculates a state transition path that gives the maximum likelihood, and reverses the word sequence from the state transition path. As a result, a single character string is obtained and the character string is output. Further, the maximum likelihood itself or the posterior probability p (O | X) obtained by processing the maximum likelihood is output. Here, p (O | X) is the posterior probability of the recognition result O on the condition of the input speech X (that is, a value indicating the probability that the recognition result O for the input speech X is a correct result).

ユーザ発話は、必ずしも音声認識辞書で表現される文であるとは限らない。また、車室内では所望のユーザ発話以外の走行音などの雑音が存在するため、認識対象の音声が雑音であることも多い。このような場合でも、音声認識部１０３では入力音声に最も近いパターンを出力するため、出力されたすべての音声認識の結果を確信し受理することは望ましくない。 A user utterance is not necessarily a sentence expressed in a speech recognition dictionary. In addition, since there is noise such as running sound other than the desired user utterance in the passenger compartment, the recognition target voice is often noise. Even in such a case, since the voice recognition unit 103 outputs the pattern closest to the input voice, it is not desirable to be confident and accept all the output voice recognition results.

出力された音声認識結果の尤度又は事後確率が所定の閾値より小さい場合、認識対象の音声が雑音であったか又は音声認識辞書で表現される文以外の発話が成された可能性が高いため、そのような音声認識結果は棄却するべきである。認識結果棄却判定部１０４は、音声認識部１０３が出力する認識結果の尤度又は事後確率に基づき、認識結果を受理するか棄却するかを判定する（信頼度チェックＳ９０４）。さらに、認識結果棄却判定部１０４は、音声認識を行った波形の音源方向を推定し、その音源方向が所望の音源方向の範囲外（すなわち目的音範囲外）であった場合に、その認識結果を棄却してもよい。 If the likelihood or posterior probability of the output speech recognition result is smaller than a predetermined threshold, it is highly possible that the speech to be recognized was noise or an utterance other than a sentence expressed in the speech recognition dictionary was made. Such speech recognition results should be rejected. The recognition result rejection determination unit 104 determines whether to accept or reject the recognition result based on the likelihood or posterior probability of the recognition result output by the speech recognition unit 103 (reliability check S904). Furthermore, the recognition result rejection determination unit 104 estimates the sound source direction of the waveform subjected to speech recognition, and if the sound source direction is outside the range of the desired sound source direction (that is, outside the target sound range), the recognition result May be rejected.

一般的に音声認識部１０３が受理可能な文（テキスト）に関する情報をユーザは事前に知らない。したがって、音声認識開始前に、いかなる文が受理可能であるかをユーザに提示する必要がある。選択肢を提示する方法としては、音声合成技術を用いて受理可能な文の一部を読み上げること、又は、受理可能な文の一部を、カーナビゲーションシステムが備えるディスプレイの画面上に表示すること、などが考えられる。 In general, the user does not know in advance information about a sentence (text) that can be received by the speech recognition unit 103. Therefore, it is necessary to present to the user what sentences are acceptable before the start of speech recognition. As a method of presenting options, a part of an acceptable sentence is read out using speech synthesis technology, or a part of an acceptable sentence is displayed on a display screen of a car navigation system. And so on.

図９は、本発明の第１の実施形態の音声対話装置を含むカーナビゲーションシステムのハードウェア構成のブロック図である。 FIG. 9 is a block diagram of a hardware configuration of the car navigation system including the voice interactive apparatus according to the first embodiment of the present invention.

本実施形態のカーナビゲーションシステムは、中央演算装置１６０３と、その中央演算装置に接続されるスピーカ１６０１、記憶媒体１６０２、座席センサ１６０４、ディスプレイ１６０５、マイクロホン１６０６及び速度センサ１６０７と、を備える。 The car navigation system of this embodiment includes a central processing unit 1603, a speaker 1601, a storage medium 1602, a seat sensor 1604, a display 1605, a microphone 1606, and a speed sensor 1607 connected to the central processing unit.

中央演算装置１６０３は、音声認識及び音源分離などのソフトウェア処理を実行する。 The central processing unit 1603 executes software processing such as speech recognition and sound source separation.

記憶媒体１６０２には、認識辞書などの情報が保持される。 Information such as a recognition dictionary is held in the storage medium 1602.

ガイダンス音などの再生音は、スピーカ１６０１から出力される。スピーカ１６０１は超音波スピーカなどの超指向性スピーカであってもよい。 A reproduction sound such as a guidance sound is output from the speaker 1601. The speaker 1601 may be a super-directional speaker such as an ultrasonic speaker.

運転席、助手席、後部座席などの各座席に設置された座席センサ１６０４によって、同乗者が存在するか否かが判定される。座席センサ１６０４は、各座席にユーザが乗車しているか否かを示す情報を出力するものである限り、例えば、重量センサ又は各座席方向にビームを有する超音波センサ等、いかなる種類のものであってもよい。 It is determined whether or not a passenger is present by a seat sensor 1604 installed in each seat such as a driver seat, a passenger seat, and a rear seat. As long as the seat sensor 1604 outputs information indicating whether or not the user is in each seat, the seat sensor 1604 may be of any type, such as a weight sensor or an ultrasonic sensor having a beam in each seat direction. May be.

ディスプレイ１６０５には、コマンドリストなど認識語彙に関する情報、及び、地図などが表示される。 The display 1605 displays information related to the recognized vocabulary such as a command list, a map, and the like.

車載の速度センサ１６０７が取得した、自動車の速度を示す情報は、中央演算装置１６０３内に取り込まれ、走行状況（例えば自動車が走行中であるか否か）を判断するために使われる。 Information indicating the speed of the vehicle acquired by the in-vehicle speed sensor 1607 is taken into the central processing unit 1603 and used to determine a traveling state (for example, whether the vehicle is traveling).

マイクロホン１６０６は、ユーザ発話を収録するために用いられる。音声認識部１０３は、マイクロホン１６０６を通して収録した音声を認識する。マイクロホン１６０６の代わりに、複数のマイクロホン素子からなるマイクロホンアレイ（例えば、図２に示すマイクロホンアレイ１２０１）が用いられてもよい。 The microphone 1606 is used for recording user utterances. The voice recognition unit 103 recognizes the voice recorded through the microphone 1606. Instead of the microphone 1606, a microphone array including a plurality of microphone elements (for example, the microphone array 1201 shown in FIG. 2) may be used.

マイクロホンアレイを用いることで、単一のマイクロホンでは得ることが困難な音源方向に関する情報を得たり、目的話者の方向にビームを当てて、その方向の話者が発話した音声のみを抽出したりすることができる。音源分離部１０２は、マイクロホンアレイを用いて、特定方向の話者が発話した音声のみを抽出してもよい。 By using a microphone array, it is possible to obtain information on the direction of a sound source that is difficult to obtain with a single microphone, or to apply a beam to the direction of the target speaker and extract only the speech uttered by the speaker in that direction. can do. The sound source separation unit 102 may extract only the voice uttered by a speaker in a specific direction using a microphone array.

なお、図９に示すマイクロホン１６０６及び中央演算装置１６０３は、それぞれ、図２に示すマイクロホンアレイ１２０１及び中央演算装置１２０３に相当する。図９に示す記憶媒体１６０２は、図２に示す揮発性メモリ１２０４及び記憶媒体１２０５の少なくとも一方に相当する。 Note that the microphone 1606 and the central processing unit 1603 shown in FIG. 9 correspond to the microphone array 1201 and the central processing unit 1203 shown in FIG. 2, respectively. A storage medium 1602 illustrated in FIG. 9 corresponds to at least one of the volatile memory 1204 and the storage medium 1205 illustrated in FIG.

選択肢提示手段判定部１０６は、カーナビゲーションシステムに付属のユーザ提示装置を用いた提示手段の中から、安全性を損なわないという条件下で、タスク終了時間（すなわち、ユーザ提示装置が選択肢をユーザに提示し、選択肢をユーザが理解し、選択肢のいずれかをユーザが発話するのに要する時間）が短い手段を選択する。図９に記載されたカーナビゲーションシステムは、ユーザ提示装置として、スピーカ１６０１及びディスプレイ１６０５を備える。この場合、ユーザ提示手段としては、音声合成を用いて選択肢を読み上げる音声をスピーカ１６０１から出力するという方法と、ディスプレイ１６０５上に選択肢を表示するという方法の二つが考えられる。 The option presenting means determination unit 106 selects a task end time (that is, the user presenting device gives options to the user from the presenting means using the user presenting device attached to the car navigation system under the condition that safety is not impaired. Presenting, selecting the means with which the user understands the option, and having a short time required for the user to speak any of the option. The car navigation system described in FIG. 9 includes a speaker 1601 and a display 1605 as a user presentation device. In this case, there are two possible user presentation means: a method of outputting a voice that reads out an option using speech synthesis from the speaker 1601 and a method of displaying the option on the display 1605.

図１０は、本発明の第１の実施形態の選択肢提示手段判定部１０６が実行する処理を示すフローチャートである。 FIG. 10 is a flowchart illustrating processing executed by the option presenting means determination unit 106 according to the first embodiment of this invention.

同乗者判定Ｓ７０１において、選択肢提示手段判定部１０６は、ドライバ以外の同乗者が乗車しているか否かを判定する。例えば、助手席（又は後部座席）の座席センサ１６０４の出力に基づいて、同乗者が乗車しているか否かが判定されてもよい。 In passenger determination S701, the option presenting means determination unit 106 determines whether a passenger other than the driver is on board. For example, based on the output of the seat sensor 1604 for the passenger seat (or the rear seat), it may be determined whether or not the passenger is on board.

次のタスク終了時間判定Ｓ７０２において、選択肢提示手段判定部１０６は、各提示手段を用いてユーザに選択肢を提示した場合のタスク終了時間を推定する。タスク終了時間の推定値は、提示装置が選択肢を提示するのに要する時間に、予めプリセットされた平均音声認識終了時間を加算したものであってもよいし、各提示装置を使った音声対話装置を被験者に予め使用してもらった際に測定した平均タスク終了時間であってもよい。 In the next task end time determination S702, the option presenting means determination unit 106 estimates the task end time when the options are presented to the user using each presenting means. The estimated value of the task end time may be obtained by adding a preset average speech recognition end time to the time required for the presentation device to present an option, or a voice interaction device using each presentation device. May be the average task end time measured when the subject has previously used.

最短終了時間手段選択Ｓ７０３において、選択肢提示手段判定部１０６は、同乗者判定Ｓ７０１の結果に基づいて判定された、車内環境の安全性を損なわない提示装置のうち、タスク終了時間判定Ｓ７０２において推定したタスク終了時間が最も短い提示装置を選択する。どの提示装置が選択されたかを示す情報が、選択肢提示手段判定部１０６から出力される。 In the shortest end time means selection S703, the option presenting means determination unit 106 estimates in the task end time determination S702 among the presentation devices determined based on the result of the passenger determination S701 and does not impair the safety of the in-vehicle environment. The presentation device with the shortest task end time is selected. Information indicating which presentation device has been selected is output from the option presentation means determination unit 106.

なお、安全な提示手段が一つしかない場合、タスク終了時間判定Ｓ７０２は実行されなくてもよい。その場合、最短終了時間手段選択Ｓ７０３において、その一つしかない安全な提示手段が選択される。 If there is only one safe presentation means, the task end time determination S702 may not be executed. In that case, in the shortest end time means selection S703, only one of the safe presentation means is selected.

同乗者判定の結果に基づく、車内環境の安全性を損なわない提示装置の選択について以下に具体的に示す。ここでは、図９に記載された二つの提示装置、すなわち、音声合成を用いて選択肢を読み上げるスピーカ１６０１、及び、選択肢を表示するディスプレイ１６０５を例として説明する。 The selection of the presentation device that does not impair the safety of the in-vehicle environment based on the passenger determination result will be specifically described below. Here, two presentation apparatuses described in FIG. 9, that is, a speaker 1601 that reads an option using speech synthesis and a display 1605 that displays the option will be described as an example.

ディスプレイ１６０５をドライバが見るという行為は、運転中のドライバの注意をそぐ可能性（すなわち、それによって安全性が損なわれる可能性）がある。このため、ドライバが選択肢を見ながら選択肢を選ぶという行為は、好ましくない。そのため、音声合成によって選択肢を読み上げることが、安全な提示方法の一つとして考えられる。 The act of the driver looking at the display 1605 may distract the driver's attention while driving (ie, it may compromise safety). For this reason, it is not preferable that the driver selects an option while viewing the option. Therefore, reading out the choices by speech synthesis is considered as one of the safe presentation methods.

このため、同乗者判定Ｓ７０１の結果、ドライバ以外の同乗者が存在しないと判定された場合、図９に示す提示装置を用いた提示方法のうち、ディスプレイ１６０５上に選択肢を表示するという方法は選択されずに、音声合成を用いて選択肢を読み上げるという方法が選択される。すなわち、使用されるべき提示装置として、ディスプレイ１６０５ではなく、スピーカ１６０１が選択される。 Therefore, as a result of the passenger determination S701, when it is determined that there is no passenger other than the driver, the method of displaying options on the display 1605 is selected from the presentation methods using the presentation device shown in FIG. Instead, the method of reading out the options using speech synthesis is selected. In other words, not the display 1605 but the speaker 1601 is selected as a presentation device to be used.

一方、同乗者判定Ｓ７０１の結果、ドライバ以外の同乗者が存在すると判定された場合、ドライバ以外の同乗者がディスプレイ１６０５を見ながら選択肢を選ぶという行為は安全性を損なわない。したがって、この場合、音声合成を用いて選択肢を読み上げるという方法と、ディスプレイ１６０５上に選択肢を表示するという方法のうち最も推定タスク終了時間が最も短い提示方法が選ばれる。 On the other hand, as a result of the passenger determination S701, when it is determined that there is a passenger other than the driver, the act of the passenger other than the driver selecting an option while viewing the display 1605 does not impair safety. Therefore, in this case, the presentation method with the shortest estimated task end time is selected from the method of reading out the choices using speech synthesis and the method of displaying the choices on the display 1605.

既に説明したように、タスク終了時間は、ユーザ提示装置が選択肢をユーザに提示するのに要する時間、及び、提示された選択肢をユーザが理解し、それらの選択肢のいずれかをユーザが発話するのに要する時間の合計である。 As described above, the task end time is the time required for the user presentation device to present the options to the user, and the user understands the presented options, and the user speaks one of those options. It is the total time required for.

ユーザ提示装置がスピーカ１６０１である場合、それが選択肢をユーザに提示するのに要する時間は、おおむね、選択肢を読み上げる音声の合成に要する時間、及び、合成された音声の出力に要する時間の合計に相当する。この時間の推定値は、例えば、中央演算装置１６０３等のハードウェアの処理性能、及び、選択肢として提示されるべきテキストの長さ等に基づいて算出することができる。 When the user presentation device is the speaker 1601, the time required for it to present the option to the user is approximately the sum of the time required for synthesizing the speech that reads out the option and the time required for outputting the synthesized speech. Equivalent to. This estimated time can be calculated based on, for example, the processing performance of hardware such as the central processing unit 1603 and the length of text to be presented as an option.

一方、ユーザ提示装置がディスプレイ１６０５である場合、それが選択肢をユーザに提示するのに要する時間は、おおむね、ディスプレイ１６０５に表示されるべき画像のデータを生成するのに要する時間、及び、生成された画像をディスプレイ１６０５に表示するのに要する時間の合計に相当する。この時間の推定値は、例えば、中央演算装置１６０３等のハードウェアの処理性能、及び、生成される画像のデータ量に基づいて算出することができる。 On the other hand, when the user presentation device is the display 1605, the time required for it to present the options to the user is approximately the time required to generate the data of the image to be displayed on the display 1605. This corresponds to the total time required to display the displayed image on the display 1605. This estimated time value can be calculated based on, for example, the processing performance of hardware such as the central processing unit 1603 and the data amount of the generated image.

提示された選択肢をユーザが理解し、それらの選択肢のいずれかをユーザが発話するのに要する時間として、あらかじめ所定の値が保持されていてもよい。例えば、被験者が各提示装置を使用した場合に要した時間をあらかじめ実際に計測し、その計測された時間を音声対話装置が保持してもよい。 A predetermined value may be held in advance as the time required for the user to understand the presented options and for the user to speak any of those options. For example, the time required when the subject uses each presentation device may be actually measured in advance, and the measured time may be held by the voice interaction device.

ただし、実際には、タスク終了時間は、ユーザが選択肢を理解し、それらのいずれかを発話するのに要する時間より、むしろ、ユーザ提示装置が選択肢をユーザに提示するのに要する時間によって大きく左右されると考えられる。その場合、タスク終了時間として、ユーザ提示装置が選択肢をユーザに提示するのに要する時間のみが算出され、比較されてもよい。その場合、同乗者がいると判定されると、選択肢をディスプレイ１６０５上に表示するのに要する時間と、選択肢を読み上げるのに要する時間とが算出され、両者が比較される。その結果、例えば、選択肢をディスプレイ１６０５上に表示するのに要する時間が短いと判定された場合、提示装置としてディスプレイ１６０５が選択される。 However, in practice, the task end time depends largely on the time it takes for the user presentation device to present the options to the user rather than the time it takes for the user to understand the options and speak any of them. It is thought that it is done. In that case, only the time required for the user presentation device to present the option to the user may be calculated and compared as the task end time. In this case, if it is determined that there is a passenger, the time required for displaying the option on the display 1605 and the time required for reading the option are calculated and compared. As a result, for example, when it is determined that the time required for displaying the option on the display 1605 is short, the display 1605 is selected as the presentation device.

さらに、実際には、同一の選択肢が提示される場合、選択肢をディスプレイ１６０５上に表示するのに要する時間は、選択肢を読み上げるのに要する時間より短くなるのが一般的である。このため、上記の二つの提示方法がいずれも安全性を損なわないと判定された場合、常に（すなわち、推定タスク終了時間を算出することなく）、選択肢をディスプレイ１６０５上に表示するという提示方法が最も推定タスク終了時間が短い提示方法として選択されてもよい。 Further, in practice, when the same option is presented, the time required for displaying the option on the display 1605 is generally shorter than the time required for reading the option. Therefore, when it is determined that neither of the above two presentation methods impairs safety, there is always a presentation method in which options are displayed on the display 1605 (ie, without calculating the estimated task end time). The presentation method with the shortest estimated task end time may be selected.

ディスプレイ１６０５上に選択肢を表示する方法が選択された場合において、ディスプレイとして指向性ディスプレイが用いられる場合、指向性をドライバ以外の同乗者に向けるように設定されてもよい。また音声再生用スピーカ１６０１として指向性スピーカが用いられる場合、これから選択肢を入力するユーザ（すなわち、ドライバ以外の同乗者）の方向に指向性を向けてもよい。 When a method for displaying options on the display 1605 is selected, when a directional display is used as the display, the directivity may be set to be directed toward a passenger other than the driver. When a directional speaker is used as the audio reproduction speaker 1601, directivity may be directed toward the user who inputs an option from now on (ie, a passenger other than the driver).

図１０に示す処理は、対話制御部１０７がユーザと対話する方法を選択するために実行される。例えば、図１０に示す処理は、対話制御部１０７が図６に示す処理を開始する前に実行されてもよいし、対話制御部１０７が各コマンドを処理するたびに実行されてもよい。 The process shown in FIG. 10 is executed in order for the dialog control unit 107 to select a method for interacting with the user. For example, the process shown in FIG. 10 may be executed before the dialog control unit 107 starts the process shown in FIG. 6, or may be executed every time the dialog control unit 107 processes each command.

図１１は、本発明の第１の実施形態において出力されるガイダンス音声の例を示す説明図である。 FIG. 11 is an explanatory diagram showing an example of guidance voice output in the first embodiment of the present invention.

具体的には、図１１は、同乗者がいる場合といない場合のガイダンス音声の出力例を示す。同乗者がいない場合、音声合成で読み上げた選択肢の中から選ぶように誘導するガイダンス音声（図１１の例では、ガイダンス文「これから読み上げる施設名の中からお選びください」を読み上げる音声）が出力される。 Specifically, FIG. 11 shows an example of guidance voice output when a passenger is present and not present. When there is no passenger, a guidance voice that guides the user to select from the choices read out by speech synthesis (in the example of FIG. 11, a guidance sentence “speech to read from the name of the facility to be read”) is output. The

一方、同乗者がいる場合、ディスプレイ１６０５上に表示された選択肢の中から選ぶように誘導するガイダンス音声（図１１の例では、ガイダンス文「同乗者の方がお答えください。これから画面に表示される施設名一覧の中からお選びください」を読み上げる音声）が出力される。同乗者がいる場合であっても、ドライバがディスプレイ１６０５を見ることを誘導してしまうことは好ましくないため、同乗者が答えるように誘導することも必要である。 On the other hand, when there is a fellow passenger, a guidance voice for guiding the user to choose from the choices displayed on the display 1605 (in the example of FIG. 11, the guidance sentence “Please answer by the fellow passenger. “Please select from the list of facility names” is output. Even if there are passengers, it is not preferable that the driver induces the driver to look at the display 1605. Therefore, it is also necessary to induce the passengers to answer.

図１２は、本発明の第１の実施形態においてディスプレイ１６０５に表示される選択肢の例を示す説明図である。 FIG. 12 is an explanatory diagram illustrating an example of options displayed on the display 1605 according to the first embodiment of this invention.

具体的には、図１２は、ディスプレイ１６０５上に選択肢を表示するという方法が選択された場合に表示される画面の表示例を示す。図１２では、例として、設定したい目的地の選択肢（例えば、「レストラン」及び「自宅」等）が画面上に表示される。図１２のように選択肢を画面上に表示することで、音声合成を用いて選択肢を読み上げるという方法と比べると、ユーザが選択肢を把握するまでの時間を短縮することができる。 Specifically, FIG. 12 shows a display example of a screen displayed when a method of displaying options on the display 1605 is selected. In FIG. 12, as an example, destination options (for example, “restaurant” and “home”) to be set are displayed on the screen. By displaying the options on the screen as shown in FIG. 12, it is possible to shorten the time until the user grasps the options as compared to the method of reading the options using speech synthesis.

質問文生成部１１０は、選択肢提示手段判定部１０６の判定結果と同乗者の有無に基づき、質問文を生成する。質問文は、それを出力する必要が生じるたびにリアルタイムに生成されてもよいし、予め条件毎に文がプリセットされていてもよい。さらに、質問文中に同乗者の名前を含めることによって、ある特定の同乗者に回答を促してもよい。そのようにすることで、特定の同乗者の音声のみ抽出すればよくなるため、認識性能を向上可能となる。同乗者の名前を質問文に含めるためには、予めその同乗者の名前を登録する必要がある。 The question sentence generation unit 110 generates a question sentence based on the determination result of the option presenting means determination unit 106 and the presence or absence of a passenger. The question sentence may be generated in real time every time it is necessary to output it, or the sentence may be preset for each condition. Further, by including the passenger's name in the question text, a specific passenger may be prompted to answer. By doing so, it is only necessary to extract the voice of a specific passenger, so that the recognition performance can be improved. In order to include the passenger's name in the question sentence, it is necessary to register the passenger's name in advance.

図１３は、本発明の第１の実施形態において選択肢の提示のために実行される処理の一例を示すフローチャートである。 FIG. 13 is a flowchart illustrating an example of processing executed for presenting options in the first embodiment of the present invention.

最初に、助手席判定Ｓ１７０１において、選択肢提示手段判定部１０６は、助手席に同乗者が乗車しているか否かを判定する。この判定は、図１０の同乗者判定Ｓ７０１と同様の方法で実行される。 First, in the passenger seat determination S1701, the option presenting means determination unit 106 determines whether or not a passenger is in the passenger seat. This determination is executed by the same method as the passenger determination S701 in FIG.

助手席に同乗者が乗車していると判定された場合、ガイダンス出力１＿Ｓ１７０２において、選択肢提示手段判定部１０６は、同乗者がいる場合のガイダンス音声を出力する（図１１参照）。 When it is determined that a passenger is in the passenger seat, in the guidance output 1_S1702, the option presenting means determination unit 106 outputs a guidance voice when there is a passenger (see FIG. 11).

次に、画面に選択肢表示Ｓ１７０４において、選択肢提示手段判定部１０６は、選択肢をディスプレイ１６０５に表示する。具体的には、選択肢提示手段判定部１０６からの指示に基づいて、表示されるべき画像を表示画像生成部１１１が生成し、生成された画像を画像表示部１１３がディスプレイ１６０５に表示させる。 Next, in the option display S 1704 on the screen, the option presenting means determination unit 106 displays the options on the display 1605. Specifically, based on an instruction from the option presenting means determination unit 106, the display image generation unit 111 generates an image to be displayed, and the image display unit 113 causes the display 1605 to display the generated image.

一方、助手席に同乗者が乗車していないと判定された場合、ガイダンス出力２＿Ｓ１７０３において、選択肢提示手段判定部１０６は、同乗者がいない場合のガイダンス音声を出力する（図１１参照）。 On the other hand, when it is determined that the passenger is not in the passenger seat, in the guidance output 2_S1703, the option presenting means determination unit 106 outputs a guidance voice when there is no passenger (see FIG. 11).

次に、音声で選択肢読み上げＳ１７０５において、選択肢提示手段判定部１０６は、選択肢を読み上げる音声を出力する。具体的には、選択肢提示手段判定部１０６からの指示に基づいて、出力されるべき選択肢を含む質問文を質問文生成部１１０が生成し、生成された質問文を読み上げる音声を音声出力部１１２がスピーカ１６０１に出力させる。 Next, in the option reading aloud in step S1705, the option presenting means determination unit 106 outputs a voice for reading the option. Specifically, based on an instruction from the option presenting means determination unit 106, the question sentence generation unit 110 generates a question sentence including the options to be output, and the voice output unit 112 reads out the voice that reads out the generated question sentence. Causes the speaker 1601 to output.

画面に選択肢表示Ｓ１７０４又は音声で選択肢読み上げＳ１７０５が実行された後、音声認識Ｓ１７０６が実行される。具体的には、図３等を参照して説明したように、音声入力部１０１がユーザからの音声入力を受信し、その入力された音声を音源分離部１０２及び音声認識部１０３が処理することによって、入力された音声が認識される。 After the option display S1704 or the option reading aloud S1705 is executed on the screen, the voice recognition S1706 is executed. Specifically, as described with reference to FIG. 3 and the like, the voice input unit 101 receives a voice input from the user, and the sound source separation unit 102 and the voice recognition unit 103 process the input voice. Thus, the input voice is recognized.

このとき、助手席判定Ｓ１７０１の結果に基づいて目的音の範囲が設定されてもよい（図３の目的音範囲設定Ｓ９０１）。例えば、同乗者がいると判定された場合、目的音範囲設定Ｓ９０１において、マイクロホン１６０６から同乗者（例えば助手席に着席しているユーザ）への方向を含む所定の範囲が目的音の範囲として設定されてもよい。その場合、目的音の範囲内からの音声（すなわち同乗者が発話した音声）は、受理される。受理された音声は、選択肢の提示に対する応答として処理される。一方、目的音の範囲外からの音声（例えば運転者が発話した音声）は、棄却されるため、選択肢の提示に対する応答として処理されない。 At this time, the target sound range may be set based on the result of the passenger seat determination S1701 (target sound range setting S901 in FIG. 3). For example, when it is determined that there is a passenger, in the target sound range setting S901, a predetermined range including the direction from the microphone 1606 to the passenger (for example, a user seated in the passenger seat) is set as the target sound range. May be. In that case, the voice from the range of the target sound (that is, the voice uttered by the passenger) is accepted. The accepted voice is processed as a response to the choice presentation. On the other hand, since the voice (for example, the voice uttered by the driver) from outside the target sound range is rejected, it is not processed as a response to the presentation of the option.

なお、上記図１１及び図１３は、同乗者がいない場合の選択肢提示方法として選択肢を読み上げることが選択され、同乗者がいる場合の選択肢提示方法として選択肢を表示することが選択される場合を例として示した。しかし、例えば、助手席判定Ｓ１７０１において、図１０に示したものと同様の処理が実行されてもよい。その結果、助手席に同乗者が乗車している場合であっても、選択肢提示方法として選択肢を読み上げることが選択される場合もある。その場合、ガイダンス出力２＿Ｓ１７０３及び音声で選択肢読み上げＳ１７０５が実行される。 11 and 13 show an example in which selection is read out as an option presentation method when there is no passenger, and display of options is selected as an option presentation method when there is a passenger. As shown. However, for example, in the passenger seat determination S1701, processing similar to that shown in FIG. 10 may be executed. As a result, even when a passenger is in the passenger seat, it may be selected to read out the options as an option presentation method. In that case, guidance output 2_S1703 and option reading aloud S1705 are executed.

図１４は、本発明の第１の実施形態における同乗者の名前登録の処理を示すフローチャートである。 FIG. 14 is a flowchart showing a passenger name registration process according to the first embodiment of the present invention.

この処理は、カーナビゲーションシステム起動直後に実行される。助手席座席センサチェックＳ８０１において、座席センサ１６０４の情報に基づいて、助手席に人がいるか否かが判定される。 This process is executed immediately after the car navigation system is activated. In the passenger seat sensor check S801, whether or not there is a person in the passenger seat is determined based on the information of the seat sensor 1604.

助手席に人がいる場合、処理は名前認識Ｓ８０２に進む。名前認識Ｓ８０２において、音声認識辞書が全ての音節系列を受理可能なように設定された後、音声認識が実行される。このようにすることによって、任意の人の名前を認識することができる。音声認識によって認識された名前は、それが助手席の結果であるというラベルが付けられた後、記憶媒体１６０２に保存される。質問文生成時に、記憶媒体１６０２に保存された名前認識結果が参照される。 If there is a person in the passenger seat, the process proceeds to name recognition S802. In name recognition S802, after the speech recognition dictionary is set to accept all syllable sequences, speech recognition is executed. In this way, the name of any person can be recognized. The name recognized by speech recognition is stored in the storage medium 1602 after it is labeled as a result of the passenger seat. When the question sentence is generated, the name recognition result stored in the storage medium 1602 is referred to.

助手席に人がいない場合、又は、名前認識Ｓ８０２が実行された後、処理は後部座席センサチェックＳ８０３に進み、後部座席に人がいるか否かが判定される。人がいる場合、処理は名前認識Ｓ８０４に進む。名前認識Ｓ８０４において、音声認識辞書が全ての音節系列を受理可能なように設定された後、音声認識が実行される。音声認識によって認識された名前は、それが後部座席の結果であるというラベルを付けられた後、記憶媒体１６０２に保存される。質問文生成時に、記憶媒体１６０２に保存された名前認識結果が参照される。 If there is no person in the passenger seat, or after the name recognition S802 is executed, the process proceeds to the rear seat sensor check S803 to determine whether there is a person in the rear seat. If there is a person, the process proceeds to name recognition S804. In name recognition S804, after the speech recognition dictionary is set to accept all syllable sequences, speech recognition is executed. The name recognized by speech recognition is stored in the storage medium 1602 after it is labeled as a result of the backseat. When the question sentence is generated, the name recognition result stored in the storage medium 1602 is referred to.

後部座席に人がいない場合は、処理を終了する。また、名前認識Ｓ８０４後は、処理を終了する。 If there are no people in the backseat, the process ends. Further, after the name recognition S804, the process is terminated.

音声出力部１１２は、質問文生成部１１０が生成した質問文を音声合成音に変換し、変換した音声合成音をスピーカ１６０１から出力する。表示画像生成部１１１は、選択肢提示手段判定部１０６の判定結果及び同乗者の有無に基づき、画面に情報を表示する必要がある場合は、表示画像を生成する。ここで生成される画像は、図１２に例示したように選択肢を画面上にテキスト情報として表示するものであってもよいし、選択肢毎に予めプリセットされた画像を画面上に表示するものであってもよい。後者の場合、例えば、選択肢に含まれるレストランの画像が画面上に表示されてもよい。画像表示部１１３は、表示画像生成部１１１が生成した画像をディスプレイ１６０５上に表示する。 The voice output unit 112 converts the question sentence generated by the question sentence generation unit 110 into a voice synthesized sound, and outputs the converted voice synthesized sound from the speaker 1601. The display image generation unit 111 generates a display image when it is necessary to display information on the screen based on the determination result of the option presenting means determination unit 106 and the presence or absence of a passenger. The image generated here may be one in which options are displayed as text information on the screen as illustrated in FIG. 12, or an image preset in advance for each option is displayed on the screen. May be. In the latter case, for example, a restaurant image included in the options may be displayed on the screen. The image display unit 113 displays the image generated by the display image generation unit 111 on the display 1605.

機能制御部１０９は、音声認識結果に基づき、車室内機器を制御する。 The function control unit 109 controls the vehicle interior device based on the voice recognition result.

図１５は、本発明の第１の実施形態の機能制御部１０９が実行する処理を示すフローチャートである。 FIG. 15 is a flowchart illustrating processing executed by the function control unit 109 according to the first embodiment of this invention.

経路設定コマンドＳ１５０１において、機能制御部１０９は、音声認識結果が経路設定に関するものであるか否かを判定する。この判定は、音声認識結果の候補毎にその結果が経路設定に関するものか否かを示すフラグをあらかじめ設定しておくことで実現される。 In the route setting command S1501, the function control unit 109 determines whether or not the voice recognition result relates to route setting. This determination is realized by setting in advance a flag indicating whether or not the result of the speech recognition result is related to route setting.

音声認識結果が経路設定に関するものである場合、機能制御部１０９は、カーナビゲーションシステムの経路設定処理を呼び出し、処理を終了する。 When the voice recognition result relates to route setting, the function control unit 109 calls the route setting processing of the car navigation system and ends the processing.

音声認識結果が経路設定に関するものでなかった場合、機能制御部１０９は、次にエアコン操作コマンドＳ１５０２において、音声認識結果がエアコン操作コマンドに関するものか否かを判定する。この判定は、音声認識結果の候補毎に、その結果がエアコン操作コマンドに関するものか否かを示すフラグをあらかじめ設定しておくことで実現される。 If the voice recognition result is not related to the route setting, the function control unit 109 next determines in the air conditioner operation command S1502 whether or not the voice recognition result relates to the air conditioner operation command. This determination is realized by setting in advance a flag indicating whether or not the result relates to an air conditioner operation command for each speech recognition result candidate.

音声認識結果がエアコン操作コマンドに関するものである場合、機能制御部１０９は、カーナビゲーションシステム上のエアコン操作処理を呼び出し、処理を終了する。 When the voice recognition result relates to the air conditioner operation command, the function control unit 109 calls the air conditioner operation process on the car navigation system and ends the process.

音声認識結果がエアコン操作処理に関するものでなかった場合、次にスピーカ制御コマンドＳ１５０３において、機能制御部１０９は、音声認識結果がスピーカ制御に関するものか否かを判定する。音声認識結果がスピーカ制御に関するものであった場合、機能制御部１０９は、カーナビゲーションシステム上のスピーカ制御処理を呼び出し、処理を終了する。 If the voice recognition result is not related to the air conditioner operation process, then in the speaker control command S1503, the function control unit 109 determines whether or not the voice recognition result is related to speaker control. If the voice recognition result relates to speaker control, the function control unit 109 calls speaker control processing on the car navigation system and ends the processing.

以上、本発明の第１の実施形態によれば、自動車にドライバ以外のユーザが乗車しているか否かが判定され、その判定結果に基づいて、ユーザに対する情報提示方法が選択される。それによって、安全性を損なわず、かつ、短時間の音声対話が実現される。 As described above, according to the first embodiment of the present invention, it is determined whether or not a user other than the driver is on the vehicle, and an information presentation method for the user is selected based on the determination result. As a result, a short voice dialogue is realized without sacrificing safety.

次に、本発明の第２の実施形態について説明する。第２の実施形態は、第１の実施形態と同様のハードウェアによって実現される（図２及び図９参照）。さらに、第２の実施形態では、以下に説明する相違点を除き、第１の実施形態と同様の処理が実行される。 Next, a second embodiment of the present invention will be described. The second embodiment is realized by hardware similar to that of the first embodiment (see FIGS. 2 and 9). Further, in the second embodiment, the same processing as that of the first embodiment is executed except for differences described below.

図１６は、本発明の第２の実施形態の音声対話装置の機能ブロック図である。 FIG. 16 is a functional block diagram of the voice interactive apparatus according to the second embodiment of this invention.

図１６に示す各部のうち、音声入力部２０１、音源分離部２０２、音声認識部２０３、認識結果棄却判定部２０４、分離範囲設定部２０５、選択肢提示手段判定部２０６、対話制御部２０７、認識辞書生成部２０８、機能制御部２０９、認識語彙ＤＢ２１７、対話スクリプトＤＢ２１６、質問文生成部２１０、表示画像生成部２１１、音声出力部２１４及び画像表示部２１５は、それぞれ、第１の実施形態における音声入力部１０１、音源分離部１０２、音声認識部１０３、認識結果棄却判定部１０４、分離範囲設定部１０５、選択肢提示手段判定部１０６、対話制御部１０７、認識辞書生成部１０８、機能制御部１０９、認識語彙ＤＢ１１５、対話スクリプトＤＢ１１４、質問文生成部１１０、表示画像生成部１１１、音声出力部１１２及び画像表示部１１３と同等の機能を有する。 Among the units shown in FIG. 16, the voice input unit 201, the sound source separation unit 202, the voice recognition unit 203, the recognition result rejection determination unit 204, the separation range setting unit 205, the option presenting means determination unit 206, the dialogue control unit 207, the recognition dictionary. The generation unit 208, the function control unit 209, the recognized vocabulary DB 217, the dialogue script DB 216, the question sentence generation unit 210, the display image generation unit 211, the audio output unit 214, and the image display unit 215 are each used as the audio input in the first embodiment. Unit 101, sound source separation unit 102, speech recognition unit 103, recognition result rejection determination unit 104, separation range setting unit 105, option presenting means determination unit 106, dialogue control unit 107, recognition dictionary generation unit 108, function control unit 109, recognition Vocabulary DB 115, dialogue script DB 114, question sentence generation unit 110, display image generation unit 111, voice output unit 112 and It has the same function as the image display unit 113.

第２の実施形態の音声対話装置は、さらに、質問文出力タイミング決定部２１２、画像表示タイミング決定部２１３及び走行状況判定部２１８を備える。これらも、記憶媒体１２０５に格納されたプログラムを中央演算装置１２０３が実行することによって実現される。 The voice interactive apparatus according to the second embodiment further includes a question sentence output timing determination unit 212, an image display timing determination unit 213, and a traveling state determination unit 218. These are also realized by the central processing unit 1203 executing the program stored in the storage medium 1205.

走行状況判定部２１８は、車速の情報に基づいて、現在の走行状況、例えば、現在運転中なのか静止中なのかを判定する。さらに、走行状況判定部２１８は、カーナビゲーションシステムの地図情報を利用して、現在地が高速道路上であるといった情報、及び、現在交差点で静止中であるといった状況まで判定してもよい。 The traveling state determination unit 218 determines the current traveling state, for example, whether the vehicle is currently driving or stationary based on the vehicle speed information. Furthermore, the traveling state determination unit 218 may determine, using the map information of the car navigation system, information that the current location is on the expressway and a state that the current location is stationary at the intersection.

具体的には、図９に示すカーナビゲーションシステムが、自動車の現在位置情報を取得する測位装置（図示省略）を備える。走行状況判定部２１８は、測位装置が取得した現在位置情報と、記憶媒体１６０２に格納された地図情報とを参照することによって、現在地が高速道路上であるか否かなどを判定することができる。 Specifically, the car navigation system shown in FIG. 9 includes a positioning device (not shown) that acquires the current position information of the automobile. The traveling state determination unit 218 can determine whether or not the current location is on an expressway by referring to the current position information acquired by the positioning device and the map information stored in the storage medium 1602. .

判定した走行状況に基づき、選択肢提示手段判定部２０６は、カーナビゲーションシステムに付属のユーザ提示装置を用いた提示手段の中から、安全性を損なわないという条件下で、タスク終了時間（すなわち、ユーザ提示装置が選択肢をユーザに提示し、選択肢をユーザが理解し、選択肢のいずれかをユーザが発話するまでに要する時間）が短い手段を選択する。ディスプレイ１６０５をドライバが見るという行為は、ドライバの注意をそぐ可能性（すなわち、それによって安全性が損なわれる可能性）がある。このため、ドライバが運転中に選択肢を見ながら選択肢を選ぶという行為は、好ましくない。 Based on the determined driving situation, the option presenting means determining unit 206 determines the task end time (that is, the user) from the presenting means using the user presenting device attached to the car navigation system under the condition that safety is not impaired. The presentation device presents the option to the user, the user understands the option, and selects a means that requires a short time) until the user speaks one of the options. The act of the driver viewing the display 1605 may distract the driver's attention (ie, it may compromise safety). For this reason, it is not preferable that the driver selects an option while looking at the option while driving.

このため、同乗者判定Ｓ７０１によってドライバ以外の同乗者が存在しないと判定された場合、判定した走行状況に基づいて選択肢提示手段が判定される。具体的には、走行状況が静止中（すなわち車が停止している状態）である場合、ディスプレイ１６０５上に選択肢を表示するという方法が選択される。一方、走行状況が走行中である場合、ドライバに静止可能な場所に車を止めることを誘導するガイダンスを再生して、ドライバが車を止めたことを走行状況判定で確認した後、ディスプレイ１６０５上に選択肢を表示するという方法が選択される。 For this reason, when it is determined by the passenger determination S701 that there is no passenger other than the driver, the option presenting means is determined based on the determined traveling situation. Specifically, when the traveling state is stationary (that is, the vehicle is stopped), a method of displaying options on the display 1605 is selected. On the other hand, when the driving situation is driving, the guidance for instructing the driver to stop the vehicle at a place where the driver can stop is reproduced, and after confirming that the driver has stopped the vehicle by the driving situation determination, the display 1605 The method of displaying the options on is selected.

質問文出力タイミング決定部２１２は、質問文を読み上げる音声を出力するタイミングを走行状況に応じて制御する。具体的には、走行状況が走行中である場合、質問文出力タイミング決定部２１２は、ドライバに静止可能な場所に車を止めることを誘導するガイダンスを再生した後、静止状態になったタイミングで、画面上の選択肢の中から選ぶことを誘導する質問文を出力するように制御する。 The question sentence output timing determination unit 212 controls the timing of outputting the voice for reading out the question sentence according to the traveling situation. Specifically, when the driving situation is driving, the question message output timing determination unit 212 reproduces the guidance for instructing the driver to stop the vehicle at a place where the driver can stop, and then at the timing when the driver enters the stationary state. , Control to output a question sentence that guides the user to choose from the choices on the screen.

画像表示タイミング決定部２１３は、同様に、ディスプレイ１６０５上に選択肢を表示するタイミングを走行状況に応じて制御する。具体的には、画像表示タイミング決定部２１３は、静止状態になったタイミングで、ディスプレイ１６０５の画面上に選択肢を表示するように制御する。 Similarly, the image display timing determination unit 213 controls the timing at which the options are displayed on the display 1605 in accordance with the traveling state. Specifically, the image display timing determination unit 213 performs control so that options are displayed on the screen of the display 1605 at the timing when the image becomes stationary.

図１７は、本発明の第２の実施形態における走行状況及び音声対話のタイミングチャートである。 FIG. 17 is a timing chart of the driving situation and voice conversation in the second embodiment of the present invention.

図１７の例では、音声対話を開始した時点において時速４０ｋｍ／ｈで走行中であったため、質問文出力タイミング決定部２１２は、ドライバに停止をすることを促すガイダンスを再生するように音声出力部２１４を制御する。その後、質問文出力タイミング決定部２１２は、車が停止したことを確認した後、質問文を出力するように音声出力部２１４を制御する。 In the example of FIG. 17, since the vehicle was running at a speed of 40 km / h at the time of starting the voice dialogue, the question sentence output timing determination unit 212 reproduces the guidance that prompts the driver to stop. 214 is controlled. Thereafter, the question sentence output timing determination unit 212 controls the voice output unit 214 to output the question sentence after confirming that the vehicle has stopped.

画像表示タイミング決定部２１３は、車が停止したことを確認した後、選択肢をディスプレイ１６０５の画面に表示するように画像表示部２１５を制御する。その後、音声認識部２０３等による音声認識が開始される。 After confirming that the car has stopped, the image display timing determination unit 213 controls the image display unit 215 to display the options on the screen of the display 1605. Thereafter, voice recognition by the voice recognition unit 203 or the like is started.

図１８は、本発明の第２の実施形態において選択肢の提示のために実行される処理の一例を示すフローチャートである。 FIG. 18 is a flowchart illustrating an example of processing executed for presenting options in the second exemplary embodiment of the present invention.

以下、ドライバ以外の同乗者が乗車しておらず、かつ、車が走行中である場合に図１８の処理が実行される例を示す。 Hereinafter, an example in which the processing of FIG. 18 is executed when a passenger other than the driver is not in the vehicle and the vehicle is traveling will be described.

最初に、状況判定Ｓ１８０１において、走行状況判定部２１８は、車を停止可能であるか否かを判定する。例えば、走行状況判定部２１８は、現在地が高速道路上である場合、車を停止可能でないと判定し、現在地が一般道路上である場合、車を停止可能であると判定してもよい。 First, in the situation determination S1801, the traveling situation determination unit 218 determines whether or not the vehicle can be stopped. For example, the traveling state determination unit 218 may determine that the vehicle cannot be stopped when the current location is on an expressway, and may determine that the vehicle can be stopped when the current location is on a general road.

車を停止可能でないと判定された場合、ガイダンス出力１＿Ｓ１８０２において、選択肢提示手段判定部１０６は、車を停止可能でない場合のガイダンス音声を出力する。例えば、選択肢提示手段判定部１０６は、図１１に示した同乗者がいない場合の質問文を読み上げる音声を出力するように質問文生成部２１０を制御してもよい。 When it is determined that the vehicle cannot be stopped, in the guidance output 1_S1802, the option presenting means determination unit 106 outputs a guidance sound when the vehicle cannot be stopped. For example, the option presenting means determination unit 106 may control the question sentence generation unit 210 to output a voice for reading out the question sentence when there is no passenger shown in FIG.

次に、音声で選択肢読み上げＳ１８０４において、選択肢提示手段判定部１０６は、選択肢を読み上げる音声を出力する。具体的には、選択肢提示手段判定部１０６からの指示に基づいて、出力されるべき選択肢を含む質問文を質問文生成部１１０が生成し、生成された質問文を読み上げる音声を音声出力部１１２がスピーカ１６０１に出力させる。 Next, in the option reading aloud in step S1804, the option presenting means determination unit 106 outputs a voice for reading the option. Specifically, based on an instruction from the option presenting means determination unit 106, the question sentence generation unit 110 generates a question sentence including the options to be output, and the voice output unit 112 reads out the voice that reads out the generated question sentence. Causes the speaker 1601 to output.

一方、車を停止可能であると判定された場合、ガイダンス出力２＿Ｓ１８０３において、選択肢提示手段判定部１０６は、車を停止可能である場合のガイダンス音声を出力する。例えば、選択肢提示手段１０６は、ドライバに静止可能な場所に車を止めることを誘導するガイダンスを出力するように質問文生成部２１０を制御してもよい。 On the other hand, when it is determined that the vehicle can be stopped, in the guidance output 2_S1803, the option presenting means determination unit 106 outputs a guidance sound when the vehicle can be stopped. For example, the option presenting means 106 may control the question sentence generation unit 210 so as to output a guidance for guiding the driver to stop the vehicle at a place where the driver can rest.

次に、停止判定Ｓ１８０５において、質問文出力タイミング決定部２１２は、車が停止したか否かを判定する。車がまだ停止していないと判定された場合、処理はガイダンス出力２＿Ｓ１８０３に戻り、再び車を止めることを誘導するガイダンスが出力される。 Next, in stop determination S1805, the question sentence output timing determination unit 212 determines whether or not the vehicle has stopped. If it is determined that the vehicle has not yet stopped, the process returns to guidance output 2_S1803, and guidance for instructing to stop the vehicle is output again.

一方、車が停止したと判定された場合、質問文出力タイミング決定部２１２は、ガイダンス出力３＿Ｓ１８０６において、車が停止した場合のガイダンス音声を出力する。例えば、質問文出力タイミング決定部２１２は、質問文生成部２１０が生成した、同乗者がいる場合の質問文（例えば図１１参照）を読み上げる音声を出力するように、音声出力部２１４を制御してもよい。なお、上記の例では実際には同乗者がいないが、車が停止しているため、ドライバがディスプレイ１６０５を見ても安全性は損なわれない。 On the other hand, when it is determined that the vehicle has stopped, the question message output timing determination unit 212 outputs a guidance voice when the vehicle stops in guidance output 3_S1806. For example, the question sentence output timing determination unit 212 controls the voice output unit 214 to output a voice that is read out by the question sentence generation unit 210 when a passenger is present (see, for example, FIG. 11). May be. In the above example, there is actually no passenger, but since the vehicle is stopped, safety is not impaired even if the driver looks at the display 1605.

さらに、停止判定Ｓ１８０５において、画像表示タイミング決定部２１３は、車が停止したか否かを判定する。車がまだ停止していないと判定された場合、処理はガイダンス出力２＿Ｓ１８０３に戻り、再び車を止めることを誘導するガイダンスが出力される。 Further, in stop determination S1805, the image display timing determination unit 213 determines whether or not the vehicle has stopped. If it is determined that the vehicle has not yet stopped, the process returns to guidance output 2_S1803, and guidance for instructing to stop the vehicle is output again.

一方、車が停止したと判定された場合、画像表示タイミング決定部２１３は、ガイダンス出力３＿Ｓ１８０６において、選択肢をディスプレイ１６０５の画面に表示する。例えば、画像表示タイミング決定部２１３は、表示画像生成部２１１が生成した選択肢の表示画像（例えば図１２参照）を表示するように、画像表示部２１５を制御してもよい。 On the other hand, when it is determined that the car has stopped, the image display timing determination unit 213 displays the options on the screen of the display 1605 in the guidance output 3_S1806. For example, the image display timing determination unit 213 may control the image display unit 215 so as to display a display image of options generated by the display image generation unit 211 (see, for example, FIG. 12).

音声で選択肢読み上げＳ１８０４又はガイダンス出力３＿Ｓ１８０６が実行された後、音声認識Ｓ１８０７が実行される。このステップは、第１の実施形態の音声認識Ｓ１７０６と同様である。 After the option reading aloud S1804 or the guidance output 3_S1806 is executed by voice, the voice recognition S1807 is executed. This step is the same as the speech recognition S1706 of the first embodiment.

なお、前記図１８は、ドライバ以外の同乗者が乗車しておらず、かつ、車が走行中である場合を例として説明した。例えば、本実施形態においても、第１の実施形態と同様、図１３に示す助手席判定Ｓ１７０１が実行され、同乗者がいないと判定された場合に図１８に示す処理が開始されてもよい。一方、同乗者がいると判定された場合には、第１の実施形態と同様、ガイダンス出力１＿Ｓ１７０２及び画面に選択肢表示Ｓ１７０４が実行されてもよい。 Note that FIG. 18 illustrates an example in which a passenger other than the driver is not in the vehicle and the vehicle is running. For example, also in the present embodiment, as in the first embodiment, the passenger seat determination S1701 shown in FIG. 13 is executed, and the processing shown in FIG. 18 may be started when it is determined that there is no passenger. On the other hand, if it is determined that there is a passenger, the guidance output 1_S 1702 and the option display S 1704 may be executed on the screen as in the first embodiment.

さらに、図１８は、車が走行中でない場合に実行されてもよい。例えば、状況判定Ｓ１８０１において、まず車が走行中であるか否かが判定され、走行中であると判定され場合に停止可能であるか否かが判定されてもよい。車が走行中でないと判定された場合、処理はガイダンス出力２＿Ｓ１８０３及び停止判定Ｓ１８０５が省略され、ガイダンス出力３＿Ｓ１８０６が実行されてもよい。 Further, FIG. 18 may be executed when the vehicle is not traveling. For example, in the situation determination S1801, it may be first determined whether or not the vehicle is traveling, and if it is determined that the vehicle is traveling, it may be determined whether or not the vehicle can be stopped. If it is determined that the vehicle is not traveling, the guidance output 2_S1803 and the stop determination S1805 may be omitted, and the guidance output 3_S1806 may be executed.

以上、本発明の第２の実施形態によれば、自動車にドライバのみが乗車している場合であっても、安全性を損なわず、かつ、短時間で音声対話を実行できる情報提示方法が選択される。 As described above, according to the second embodiment of the present invention, even when only a driver is in a car, an information presentation method that can perform voice conversation in a short time without sacrificing safety is selected. Is done.

以上の本発明の第１及び第２の実施形態は、選択肢を示すテキスト情報をユーザに提示し、その提示に対する応答として選択肢のいずれかをユーザが発話する処理を例として説明した。しかし、これらの実施形態は、任意のテキスト情報の提示及びそれに対するユーザからの応答を処理するために適用することができる。 In the first and second embodiments of the present invention described above, the text information indicating the option is presented to the user, and the process in which the user utters one of the options as a response to the presentation has been described as an example. However, these embodiments can be applied to handle the presentation of arbitrary text information and the response from the user to it.

本発明の第１の実施形態の音声対話装置の機能ブロック図である。It is a functional block diagram of the voice interactive apparatus of the 1st Embodiment of this invention. 本発明の第１の実施形態の音声対話装置のハードウェア構成のブロック図である。It is a block diagram of the hardware constitutions of the voice interactive apparatus of the 1st Embodiment of this invention. 本発明の第１の実施形態において実行される音声認識処理を示すフローチャートである。It is a flowchart which shows the speech recognition process performed in the 1st Embodiment of this invention. 本発明の第１の実施形態の音源分離部が実行する詳細な処理の流れを示す説明図である。It is explanatory drawing which shows the flow of the detailed process which the sound source separation part of the 1st Embodiment of this invention performs. 本発明の第１の実施形態の音源分離において時間周波数毎に実行される処理を示すフローチャートである。It is a flowchart which shows the process performed for every time frequency in the sound source separation of the 1st Embodiment of this invention. 本発明の第１の実施形態における対話スクリプトに基づく対話を示すフローチャートである。It is a flowchart which shows the dialogue based on the dialogue script in the first embodiment of the present invention. 本発明の第１の実施形態において使用される音声認識辞書の一例を示す説明図である。It is explanatory drawing which shows an example of the speech recognition dictionary used in the 1st Embodiment of this invention. 本発明の第１の実施形態におけるアークごとの単語リストの一例を示す説明図である。It is explanatory drawing which shows an example of the word list | wrist for every arc in the 1st Embodiment of this invention. 本発明の第１の実施形態の音声対話装置を含むカーナビゲーションシステムのハードウェア構成のブロック図である。1 is a block diagram of a hardware configuration of a car navigation system including a voice interaction apparatus according to a first embodiment of the present invention. 本発明の第１の実施形態の選択肢提示手段判定部が実行する処理を示すフローチャートである。It is a flowchart which shows the process which the option presentation means determination part of the 1st Embodiment of this invention performs. 本発明の第１の実施形態において出力されるガイダンス音声の例を示す説明図である。It is explanatory drawing which shows the example of the guidance audio | voice output in the 1st Embodiment of this invention. 本発明の第１の実施形態においてディスプレイに表示される選択肢の例を示す説明図である。It is explanatory drawing which shows the example of the option displayed on a display in the 1st Embodiment of this invention. 本発明の第１の実施形態において選択肢の提示のために実行される処理の一例を示すフローチャートである。It is a flowchart which shows an example of the process performed for the presentation of the choice in the 1st Embodiment of this invention. 本発明の第１の実施形態における同乗者の名前登録の処理を示すフローチャートである。It is a flowchart which shows the passenger's name registration process in the 1st Embodiment of this invention. 本発明の第１の実施形態の機能制御部が実行する処理を示すフローチャートである。It is a flowchart which shows the process which the function control part of the 1st Embodiment of this invention performs. 本発明の第２の実施形態の音声対話装置の機能ブロック図である。It is a functional block diagram of the voice interactive apparatus of the 2nd Embodiment of this invention. 本発明の第２の実施形態における走行状況及び音声対話のタイミングチャートである。It is a driving | running | working condition in the 2nd Embodiment of this invention, and a timing chart of voice dialogue. 本発明の第２の実施形態において選択肢の提示のために実行される処理の一例を示すフローチャートである。It is a flowchart which shows an example of the process performed for the presentation of the choice in the 2nd Embodiment of this invention.

Explanation of symbols

１０１、２０１音声入力部
１０２、２０２音源分離部
１０３、２０３音声認識部
１０４、２０４認識結果棄却判定部
１０５、２０５分離範囲設定部
１０６、２０６選択肢提示手段判定部
１０７、２０７対話制御部
１０８、２０８認識辞書生成部
１０９、２０９機能制御部
１１０、２１０質問文生成部
１１１、２１１表示画像生成部
１１２、２１４音声出力部
１１３、２１５画像表示部
１１４、２１６対話スクリプトＤＢ
１１５、２１７認識語彙ＤＢ
２１２質問文出力タイミング決定部
２１３画像表示タイミング決定部
２１８走行状況判定部
Ｓ３０１コマンド名称
Ｓ３０２目的地設定
Ｓ３０３自動車機器操作
Ｓ３０４周辺施設検索
Ｓ３０５時間周波数分解
Ｓ７０１同乗者判定
Ｓ７０２タスク終了時間判定
Ｓ７０３最短終了時間手段選択
Ｓ８０１助手席座席センサチェック
Ｓ８０２名前認識
Ｓ８０３後部座席センサチェック
Ｓ８０４名前認識
Ｓ９０１目的音範囲設定
Ｓ９０２音源分離
１００１ＤＦＴ
１００２周波数毎ベクトル化
１００３音源定位
１００４音源分離
１００５逆ＤＦＴ
Ｓ１１０１目的音範囲かどうか
Ｓ１１０２目的音共分散更新
Ｓ１１０３雑音共分散更新
Ｓ１１０４フィルタ生成
Ｓ１１０５フィルタリング
Ｓ１１０６ポストフィルタリング
１２０１マイクロホンアレイ
１２０２ＡＤ変換装置
１２０３、１６０３中央演算装置
１２０４揮発性メモリ
１２０５、１６０２記憶媒体
Ｓ１５０１経路設定コマンド
Ｓ１５０２エアコン操作コマンド
Ｓ１５０３スピーカ制御コマンド
１６０１スピーカ
１６０４座席センサ
１６０５ディスプレイ
１６０６マイクロホン
１６０７速度センサ 101, 201 Voice input unit 102, 202 Sound source separation unit 103, 203 Speech recognition unit 104, 204 Recognition result rejection determination unit 105, 205 Separation range setting unit 106, 206 Option presentation means determination unit 107, 207 Dialog control unit 108, 208 Recognition dictionary generation unit 109, 209 Function control unit 110, 210 Question sentence generation unit 111, 211 Display image generation unit 112, 214 Audio output unit 113, 215 Image display unit 114, 216 Dialogue script DB
115, 217 Recognition vocabulary DB
212 Question sentence output timing determination unit 213 Image display timing determination unit 218 Travel condition determination unit S301 Command name S302 Destination setting S303 Car equipment operation S304 Neighboring facility search S305 Time-frequency decomposition S701 Passenger determination S702 Task end time determination S703 Shortest end time Means selection S801 Passenger seat sensor check S802 Name recognition S803 Rear seat sensor check S804 Name recognition S901 Target sound range setting S902 Sound source separation 1001 DFT
1002 Vectorization per frequency 1003 Sound source localization 1004 Sound source separation 1005 Inverse DFT
S1101 Target sound range S1102 Target sound covariance update S1103 Noise covariance update S1104 Filter generation S1105 Filtering S1106 Post filtering 1201 Microphone array 1202 AD converter 1203, 1603 Central processing unit 1204 Volatile memory 1205, 1602 Storage medium S1501 Path setting Command S1502 Air-conditioner operation command S1503 Speaker control command 1601 Speaker 1604 Seat sensor 1605 Display 1606 Microphone 1607 Speed sensor

Claims

An occupant detection unit for detecting the position of the occupant in the vehicle;
An audio output unit for converting text information into audio and outputting it;
An image display unit for converting the text information into an image and displaying the image,
Multiple microphones,
A switching unit that selects either the voice output unit or the image display unit based on the detection result of the occupant detection unit and outputs the text information to the selected voice output unit or the image display unit. and, with a,
When an occupant other than the driver is in the vehicle, a first time required for displaying the text information on the image display unit, and a second time required for outputting from the voice output unit a voice that reads out the text information. Estimate time
When the first time is shorter than the second time, it is selected to output the text information to the image display unit, and when the second time is shorter than the first time, the text information is output to the voice output unit. Select to output
When an occupant other than the driver is on board, further, a voice prompting the occupant other than the driver to respond is output from the audio output unit,
After the voice prompting the crew member other than the driver to respond is output, when the plurality of microphones receives the voice, the sound source direction of the received voice is specified,
Determining whether the identified sound source direction is within a predetermined range including a direction from the plurality of microphones to an occupant other than the driver;
When the specified sound source direction is within the predetermined range, the received voice is processed as a response to the text information displayed on the image display unit,
When a passenger other than the driver is not in the vehicle , the voice interaction device is selected to output the text information to the voice output unit .

The automobile further includes a speed sensor that outputs information indicating whether or not the automobile is running,
  The voice interaction device
  Based on the output from the speed sensor, it is determined whether the automobile is running,
  When it is determined that the automobile is running, a voice prompting the driver to stop the automobile is output from the voice output unit,
  The voice interactive apparatus according to claim 1, wherein when it is determined that the automobile is stopped, an image including the text information is displayed on the image display unit.

The spoken dialogue device determines whether or not the vehicle is running when no passenger other than the driver is on board. When the passenger other than the driver is on board, the voice vehicle travels. The voice interactive apparatus according to claim 2, wherein it is not determined whether or not it is in the middle.

The automobile further includes a positioning device that acquires current position information of the automobile,
  The voice interaction device
  Based on the current position information of the vehicle acquired by the positioning device, determine whether the vehicle can be stopped,
  When it is determined that the vehicle can be stopped, a voice prompting the driver to stop the vehicle is output from the voice output unit,
  The voice interactive apparatus according to claim 2, wherein when it is determined that the automobile cannot be stopped, a voice that reads out the text information is output from the voice output unit.

5. The voice dialogue apparatus according to claim 4, wherein when it is determined that the automobile is traveling on an expressway based on the current position information, the voice interactive apparatus determines that the automobile cannot be stopped. The spoken dialogue apparatus described.

In a voice dialogue method by a voice dialogue device comprising: a voice output unit that converts text information into voice and outputs; an image display unit that converts the text information into an image and outputs; and a plurality of microphones.
  An occupant detection procedure for detecting the position of the occupant in the vehicle;
  Based on the position of the occupant detected by the occupant detection procedure, either the voice output unit or the image display unit is selected, and the selected voice output unit or the image display unit outputs the text information. A switching procedure to
  In the switching procedure, when an occupant other than the driver is on board, the first time required to display the text information on the image display unit, and the voice output unit that reads out the text information from the voice output unit A second time required for output is estimated, and when the first time is shorter than the second time, it is selected to output the text information to the image display unit, and the second time is shorter than the first time. A case in which the text information is selected to be output to the voice output unit, and when an occupant other than the driver is not in the vehicle, the text information is selected to be output to the voice output unit.
  The voice interaction method further includes:
  When it is determined that an occupant other than the driver is on board, a procedure for outputting a voice prompting the occupant other than the driver to respond from the audio output unit;
  After the voice prompting the passengers other than the driver to respond is output, when the plurality of microphones receive the voice, a procedure for specifying the sound source direction of the received voice;
  Determining whether the identified sound source direction is within a predetermined range including a direction from the plurality of microphones to an occupant other than the driver;
  And a procedure for processing the received voice as a response to the text information displayed on the image display unit when the specified sound source direction is within the predetermined range. .

The automobile further includes a speed sensor that outputs information indicating whether or not the automobile is running,
  The voice interaction method further includes:
  A procedure for determining whether or not the vehicle is running based on an output from the speed sensor;
  When it is determined that the automobile is running, a procedure for outputting a voice prompting the driver to stop the automobile from the voice output unit;
  The voice interaction method according to claim 6, further comprising: displaying an image including the text information on the image display unit when it is determined that the automobile is stopped.

The procedure for determining whether or not the vehicle is running is executed when an occupant other than the driver is not on board, and is not executed when an occupant other than the driver is on board. The voice interaction method according to claim 7, wherein:

The automobile further includes a positioning device that acquires current position information of the automobile,
  The voice interaction method further includes:
  A procedure for determining whether or not the automobile can be stopped based on the current position information of the automobile acquired by the positioning device;
  When it is determined that the vehicle can be stopped, a procedure for outputting a message prompting the driver to stop the vehicle from the voice output unit;
  The voice interaction method according to claim 7, further comprising: a step of outputting a voice for reading the text information from the voice output unit when it is determined that the automobile cannot be stopped.

The procedure for determining whether or not the vehicle can be stopped is based on the current position information, and when it is determined that the vehicle is traveling on a highway, the vehicle cannot be stopped. The voice interaction method according to claim 9, further comprising a determination procedure.