JP2006317556A

JP2006317556A - Voice dialog apparatus

Info

Publication number: JP2006317556A
Application number: JP2005137803A
Authority: JP
Inventors: Takayuki Yamaguchi; 隆幸山口
Original assignee: Toyota Motor Corp
Current assignee: Toyota Motor Corp
Priority date: 2005-05-10
Filing date: 2005-05-10
Publication date: 2006-11-24
Anticipated expiration: 2025-05-10
Also published as: JP4765394B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a voice dialog apparatus capable of effectively guiding user's utterance volume to a desired level. <P>SOLUTION: According to this invention, the voice dialog apparatus for vocally performing information exchange with a user comprises: a voice input means for inputting the user's utterance thereto; an utterance volume calculation means for calculating the user's utterance volume based on an utterance data inputted to the voice input means; a music output means for outputting music; and an output control means for controlling the music output means, and the output control means is characterized by starting outputting the music by the music output means in order to guide the user's utterance volume to the desired level or adjusting the volume of the music outputted by the music output means according to a calculation result of the user's utterance volume while maintaining the music output state. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、ユーザとの情報のやりとりを音声によって行う音声対話装置に関する。 The present invention relates to a voice interactive apparatus that exchanges information with a user by voice.

従来から、音声を入力する音声入力部と、入力された音声を認識する音声認識部と、認識された音声に基づいて、少なくとも上記表示部に表示される内容を変更せしめる制御部と、上記車両の速度を検出する速度検出部上記車両速度に基づいて上記音声入力部から入力された音声のゲインを変更するゲイン変更部と、上記ゲイン変更部により入力音声のゲインが変更される際に、発話の音量を上げる旨の案内を行う報知手段と、を備えた音声対話装置が知られている（例えば、特許文献１参照）。
特開２００３−３４４０８８号公報 Conventionally, a voice input unit that inputs voice, a voice recognition unit that recognizes input voice, a control unit that changes at least the content displayed on the display unit based on the recognized voice, and the vehicle A speed detecting unit for detecting the speed of the voice, a gain changing unit for changing the gain of the voice input from the voice input unit based on the vehicle speed, and an utterance when the gain of the input voice is changed by the gain changing unit. There is known a voice interaction device provided with a notification means for performing guidance to increase the volume of the voice (for example, see Patent Document 1).
JP 2003-344088 A

ところで、この種の音声対話装置では、ユーザの発話音量が不足する場合等には、上述の従来技術のように、所期の認識精度を実現するのに必要な音声レベル（又はＳ／Ｎ比）を確保するために、“音声レベルを大（又は小）としてもう一度発話してください”といったような音声メッセージ等を出力して、話者に再発話を要求する場合がある。 By the way, in this type of voice interactive apparatus, when the user's utterance volume is insufficient, the voice level (or S / N ratio) necessary to realize the desired recognition accuracy as in the above-described conventional technology. ), A voice message such as “Please speak again with a high (or low) voice level” may be output to request the speaker to speak again.

しかしながら、かかる再発話の要求の仕方は、大きな声で話すのに抵抗を感じるユーザに対しては有効といえず、かかるユーザの発話音量の不足を解消すべく所望のレベルまで発話音量を誘導するのが困難であるという問題点がある。 However, this method of requesting recurrent speech is not effective for a user who feels resistance to speaking in a loud voice, and induces the speech volume to a desired level in order to eliminate the shortage of the user's speech volume. There is a problem that it is difficult.

そこで、本発明は、かかるユーザの心理を考慮しつつ、ユーザの発話音量を所望のレベルまで効果的に誘導することができる音声対話装置の提供を目的とする。 Therefore, an object of the present invention is to provide a voice interactive apparatus capable of effectively guiding the user's utterance volume to a desired level in consideration of the user's psychology.

上記課題を解決するため、本発明の一局面によれば、ユーザとの情報のやりとりを音声によって行う音声対話装置において、
ユーザからの発話が入力される音声入力手段と、
音声入力手段に入力される発話データに基づいてユーザの発話音量を算出する発話音量算出手段と、
音楽を出力する音楽出力手段と、
前記音楽出力手段を制御する出力制御手段とを備え、
前記出力制御手段は、ユーザの発話音量を所望のレベルまで誘導すべく、音楽出力手段による音楽の出力を開始又は音楽出力状態を維持しつつ、ユーザの発話音量の算出結果に応じて、音楽出力手段により出力する音楽の音量を調整することを特徴とする、音声対話装置が提供される。 In order to solve the above-described problem, according to one aspect of the present invention, in a voice interaction apparatus that performs voice exchange of information with a user,
Voice input means for inputting the utterance from the user;
Utterance volume calculation means for calculating the user's utterance volume based on utterance data input to the voice input means;
Music output means for outputting music;
Output control means for controlling the music output means,
The output control means starts music output by the music output means or maintains the music output state in order to guide the user's utterance volume to a desired level, and outputs music according to the calculation result of the user's utterance volume. According to the present invention, there is provided a spoken dialogue apparatus characterized by adjusting the volume of music output by the means.

また、本発明のその他の一局面によれば、ユーザとの情報のやりとりを音声によって行う音声対話装置において、
ユーザからの発話が入力される音声入力手段と、
音声入力手段に入力される発話データに基づいてユーザの発話音量を算出する発話音量算出手段と、
音楽を出力する音楽出力手段と、
ユーザとの音声対話のための応答音を出力する応答音出力手段と、
前記音楽出力手段及び応答音出力手段を制御する出力制御手段とを備え、
前記出力制御手段は、ユーザの発話音量を所望のレベルまで誘導すべく、音楽出力手段による音楽の出力を開始又は音楽出力状態を維持しつつ、ユーザの発話音量の算出結果に応じて、応答音出力手段により出力する応答音の音量を調整することを特徴とする、音声対話装置が提供される。 Further, according to another aspect of the present invention, in a voice interaction device that performs voice exchange of information with a user,
Voice input means for inputting the utterance from the user;
Utterance volume calculation means for calculating the user's utterance volume based on utterance data input to the voice input means;
Music output means for outputting music;
Response sound output means for outputting a response sound for voice conversation with the user;
Output control means for controlling the music output means and response sound output means,
The output control means starts the output of music by the music output means or maintains the music output state in order to guide the user's utterance volume to a desired level, and responds according to the calculation result of the user's utterance volume. There is provided a spoken dialogue apparatus characterized by adjusting the volume of response sound output by the output means.

上記の各局面において、一ユーザの発話音量が所定の音量より小さい場合は、音楽出力手段又は応答音出力手段により出力する音楽又は応答音の音量を大きくすることとしてよい。ユーザの発話音量が所定の音量より大きい場合は、音楽出力手段又は応答音出力手段により出力する音楽又は応答音の音量を小さくすることとしてよい。前記出力制御手段による音楽又は応答音の音量の調整は、ユーザの発話中に行われることとしてよい。車室内で使用される場合、前記音楽出力手段は、前記音楽として、ユーザが車載オーディオ装置にセットした音楽ＣＤやＤＶＤ等に記録された音楽、又は、予め用意された所定の音楽を、車室内のスピーカを介して出力するものであってよい。 In each aspect described above, when the utterance volume of one user is smaller than a predetermined volume, the volume of music or response sound output by the music output means or response sound output means may be increased. When the user's utterance volume is higher than a predetermined volume, the volume of music or response sound output by the music output means or response sound output means may be reduced. The volume of the music or response sound by the output control means may be adjusted during the user's speech. When used in a passenger compartment, the music output means uses, as the music, music recorded on a music CD or DVD set by the user in the in-vehicle audio apparatus, or predetermined music prepared in advance. May be output via a speaker.

本発明によれば、ユーザの発話音量を所望のレベルまで効果的に誘導することができる音声対話装置を得ることができる。 ADVANTAGE OF THE INVENTION According to this invention, the voice interactive apparatus which can guide | invade effectively a user's speech volume to a desired level can be obtained.

以下、図面を参照して、本発明を実施するための最良の形態の説明を行う。 The best mode for carrying out the present invention will be described below with reference to the drawings.

図１は、本発明による音声対話装置が組み込まれる音声対話システムの一実施例を示すシステム構成図である。音声対話装置１０は、対話制御ＥＣＵ２０、車室内の音（音声）を拾う車内マイク（マイクロフォン）３０、アンプ４２及びスピーカ４０を備える。アンプ４２及びスピーカ４０は、車載オーディオシステムで用いられるものと共通であってよい。 FIG. 1 is a system configuration diagram showing an embodiment of a voice dialogue system in which a voice dialogue apparatus according to the present invention is incorporated. The voice interaction device 10 includes a dialogue control ECU 20, an in-vehicle microphone (microphone) 30 that picks up sound (voice) in the vehicle interior, an amplifier 42, and a speaker 40. The amplifier 42 and the speaker 40 may be the same as those used in the in-vehicle audio system.

対話制御ＥＣＵ２０には、ＣＡＮ（ｃｏｎｔｒｏｌｌｅｒａｒｅａｎｅｔｗｏｒｋ）などの適切なバスを介して、車載オーディオシステムが接続される。対話制御ＥＣＵ２０は、後に詳説するように、必要に応じて、車載オーディオシステムからスピーカ４０を介して出力される音楽の音量を制御する。 The in-vehicle audio system is connected to the dialogue control ECU 20 via an appropriate bus such as a CAN (controller area network). As will be described in detail later, the dialogue control ECU 20 controls the volume of music output from the in-vehicle audio system via the speaker 40 as necessary.

対話制御ＥＣＵ２０は、その基本的な構成として、バスを介して接続されるＣＰＵ、メモリ、Ａ／Ｄ（ａｎａｌｏｇ−ｔｏ−ｄｉｇｉｔａｌ）変換器を備える。メモリには、以下で説明する音声対話装置１０の機能を実現するプログラムやデータが格納される。 The dialog control ECU 20 includes a CPU, a memory, and an A / D (analog-to-digital) converter connected via a bus as its basic configuration. The memory stores programs and data for realizing the functions of the voice interaction apparatus 10 described below.

車内マイク３０に入力されるアナログ音声は、マイクアンプにて増幅処理やノイズ除去などの所定処理を受けて、Ａ／Ｄ変換器でデジタル形式の音声信号に変換され、対話制御ＥＣＵ２０に送られる。対話制御ＥＣＵ２０は、音声信号から特徴量を抽出し、次いで、所与の音響／言語モデルを用いたマッチング処理により認識結果を得る。対話制御ＥＣＵ２０は、必要に応じて、認識結果に応じた応答音（システム音声）を、アンプ４２により所定レベルまで増幅してから、スピーカ４０を介して、車室内に出力する。 The analog sound input to the in-vehicle microphone 30 is subjected to predetermined processing such as amplification processing and noise removal by a microphone amplifier, converted into a digital audio signal by an A / D converter, and sent to the dialogue control ECU 20. The dialogue control ECU 20 extracts a feature amount from the audio signal, and then obtains a recognition result by a matching process using a given acoustic / language model. The dialogue control ECU 20 amplifies a response sound (system sound) corresponding to the recognition result to a predetermined level by the amplifier 42 as necessary, and then outputs the response sound to the vehicle interior via the speaker 40.

尚、本発明は、特に音声認識方法により限定されるものでなく、如何なるハードウェア構成で如何なるソフトウェア（音声認識エンジン）を用いた音声認識処理に対しても適用可能である。 Note that the present invention is not particularly limited by the speech recognition method, and can be applied to speech recognition processing using any software (speech recognition engine) with any hardware configuration.

図２は、本実施例の音声対話システムにより実現される主要な処理を示すフローチャートである。 FIG. 2 is a flowchart showing main processing realized by the voice interaction system of this embodiment.

ステップ１００では、対話制御ＥＣＵ２０は、ユーザとの音声対話の開始前段階を検出し、所定のシステム音声（ガイダンス音声）を出力する。この対話開始前段階とは、例えばユーザがトークスイッチをオンにした際に音声対話が開始されるシステムでは、トークスイッチがオンにされたときであってよく、システム側からユーザに何らかの問い合わせが必要となったときであってもよい。後者の例としては、ナビゲーションシステムとしてユーザに行き先を尋ねるときに、「いきさきはどこですか？」なるシステム音声を、スピーカ４０を介して出力する。 In step 100, the dialogue control ECU 20 detects a stage before the start of the voice dialogue with the user, and outputs a predetermined system voice (guidance voice). This pre-dialogue stage may be when the talk switch is turned on, for example, in a system in which a voice dialog is started when the user turns on the talk switch, and the user needs to make some inquiry from the system side. It may be when. As an example of the latter, the system voice “Where is Ikisaki?” Is output via the speaker 40 when the navigation system asks the user where to go.

ステップ１００で対話開始前段階が検出されると、対話制御ＥＣＵ２０は、ステップ１１０として、システム音声ならぬシステム音楽を所定の音量Ｌ０で、スピーカ４０を介して出力する。システム音楽は、バックグラウンドミュージック（ＢＧＭ）として適切な素材が選択され、対話制御ＥＣＵ２０のメモリ内に予め用意される。これにより、ユーザは、音声対話装置１０がユーザからの発話入力待ち状態に入ったことを聴覚から認識することができると共に、システム音楽が背後に流れることで、リラックスした気持ちで発話を行うことできる（この結果、聞き取りやすい、認識しやすい発話音量が期待できるようになる。）。尚、このシステム音楽は、ユーザによりダウンロード可能であってもよく、複数種の中からユーザにより選択可能とされてもよい。 When the pre-dialogue start stage is detected in step 100, the dialog control ECU 20 outputs system music that is not system voice through the speaker 40 at a predetermined volume L0 as step 110. As the system music, an appropriate material is selected as background music (BGM) and prepared in advance in the memory of the dialogue control ECU 20. Thereby, the user can recognize from the auditory that the voice interaction device 10 has entered the state of waiting for the user to input the speech, and can speak with a relaxed feeling because the system music flows behind. (As a result, it is possible to expect an utterance volume that is easy to hear and easy to recognize.) The system music may be downloadable by the user or may be selectable by the user from a plurality of types.

本ステップ１１０に関して、対話開始前段階が検出されたときに、ユーザが既に車載オーディオシステムにより音楽を聴いている状態である場合は、当該車載オーディオシステムによる音楽ＣＤやＤＶＤ等の再生に代えて、上記のシステム音楽を出力してもよいし（即ち、システム音楽への切り替えを行う。）、上記のシステム音楽を出力せず、当該車載オーディオシステムによる音楽ＣＤやＤＶＤ等の再生を継続してもよい。後者の場合、車載オーディオシステムから出力される音楽の音量は、そのまま維持されてもよいし、システム音楽の初期出力音量Ｌ０と同等のレベルまで低減又は増加されてもよい。以下、車載オーディオシステムにより再生される音楽を、上記システム音楽と区別するため「オーディオ音楽」という。 Regarding this step 110, if the user is already listening to music by the in-vehicle audio system when the pre-dialogue stage is detected, instead of playing the music CD or DVD by the in-vehicle audio system, The system music may be output (that is, switching to system music is performed), or the reproduction of music CD, DVD, or the like by the in-vehicle audio system is continued without outputting the system music. Good. In the latter case, the volume of music output from the in-vehicle audio system may be maintained as it is, or may be reduced or increased to a level equivalent to the initial output volume L0 of system music. Hereinafter, music reproduced by the in-vehicle audio system is referred to as “audio music” in order to distinguish it from the system music.

かかるシステム音楽ないしオーディオ音楽の出力中に（ユーザからの発話の入力待ち状態において）、例えばユーザが「豊田市駅」と答えるなどしてユーザからの発話があると（ステップ１２０）、発話音量が算出され、発話音量が適正なレベルか否かが判断される（ステップ１３０）。このとき、車内マイク３０に入力されるアナログ音声（又はＡ／Ｄ変換器でデジタル形式の音声信号）から、システム音楽ないしオーディオ音楽の再生信号を差し引くことで（即ち、システム音楽ないしオーディオ音楽の再生信号を、その逆位相を重畳してキャンセルすることで）、車内マイク３０に入力される音声に含まれるシステム音楽ないしオーディオ音楽の成分が除去される。これは、システム音声出力中のユーザからの発話を検出するバージイン（割り込み発話）機能と同様の技術を用いて実現されてよい。 When the system music or audio music is being output (in the state of waiting for the user's utterance input), for example, when the user utters by answering "Toyota City Station" (step 120), the utterance volume is increased. It is calculated and it is determined whether or not the speech volume is at an appropriate level (step 130). At this time, the reproduction signal of the system music or the audio music is subtracted from the analog sound (or the digital audio signal by the A / D converter) input to the in-vehicle microphone 30 (that is, the reproduction of the system music or the audio music). By canceling the signal by superimposing its opposite phase), the system music or audio music component contained in the voice input to the in-vehicle microphone 30 is removed. This may be realized by using a technique similar to the barge-in (interrupt utterance) function for detecting the utterance from the user who is outputting the system voice.

次いで、このようにしてシステム音楽ないしオーディオ音楽の成分が除去された音声信号から、発話区間が検出される。発話区間とは、話者の音声部、即ち認識対象の音声が含まれている区間であり、先の例では、「とよたし」という発話に係る発話区間である。発話区間の検出は、如何なる方法で実現されてもよく、例えば特開２００４−２７１６０７号公報に開示されるような方法が用いられてよい。この場合、発話区間は、ノイズが除去または低減した音声信号に基づいて検出される。これは、音声信号は、フィルタによりノイズが除去または低減することで、無音声部（無発話部）の振幅は極めて小さくなり、発声部（有音声部）の振幅のみが残る状態となり、発話部が無発話部に比して強調されることに基づく。 Next, an utterance period is detected from the audio signal from which the components of the system music or audio music are removed in this way. The utterance section is a section that includes the voice portion of the speaker, that is, the speech to be recognized. In the above example, the utterance section is an utterance section related to the utterance “Toyoyoshi”. The detection of the utterance period may be realized by any method, and for example, a method as disclosed in Japanese Patent Application Laid-Open No. 2004-271607 may be used. In this case, the utterance period is detected based on the audio signal from which noise is removed or reduced. This is because the noise of the voice signal is removed or reduced by the filter, so that the amplitude of the voiceless part (speechless part) becomes extremely small, and only the amplitude of the voiced part (voiced part) remains. Is emphasized compared to the non-speech part.

本ステップ１３０では、発話区間内の音声量Ｓ［ｄＢ］が、所定の音量に達しているか否かが判断されてよい。発話区間内の音声量Ｓは、発話区間内の全区間に亘る音量の平均値として算出されてよい。発話区間内の音声量の算出値は、パワー、音圧等の音の大きさ・強度を表す適正なパラメータを用いて算出されてよく、平均値に代えて、積算値（積分値）、最大値・最小値等が用いられてもよい。 In step 130, it may be determined whether or not the amount of sound S [dB] in the utterance section has reached a predetermined volume. The voice amount S in the utterance section may be calculated as an average value of the sound volume over all the sections in the utterance section. The calculated value of the volume of speech within the utterance interval may be calculated using appropriate parameters representing the volume and intensity of the sound, such as power and sound pressure. Instead of the average value, the integrated value (integrated value), maximum Values, minimum values, etc. may be used.

或いは、本ステップ１３０では、発話区間内の音声量Ｓ［ｄＢ］と、騒音の音量Ｎ［ｄＢ］との比、即ちＳ／Ｎ比が所定基準を満たしているか否かが判断されてもよい。騒音の音量Ｎは、発話区間外（即ち無音声部）の音量の平均値として算出されてよい。騒音量の平均値Ｎは、発声量の平均値Ｓと同様、パワー、音圧等の音の大きさ・強度を表す適正なパラメータを用いて算出されてよく、平均値に代えて、積算値（積分値）、最大値・最小値等が用いられてもよい。 Alternatively, in this step 130, it may be determined whether or not the ratio of the amount of sound S [dB] in the utterance section and the volume N [dB] of the noise, that is, the S / N ratio satisfies a predetermined criterion. . The volume N of the noise may be calculated as an average value of the volume outside the utterance section (that is, the silent part). The average value N of the noise amount may be calculated using appropriate parameters representing the volume / intensity of the sound, such as power and sound pressure, like the average value S of the utterance amount. (Integral value), maximum value / minimum value, etc. may be used.

本ステップ１３０においてユーザの発話音量が適正なレベルに達していると判断されると、音声認識処理が実施され（ステップ１４０）、音声認識結果が出力される（ステップ１６０）。このとき、対話制御ＥＣＵ２０は、必要に応じて、音声認識の確認として、スピーカ４０を介して「とよたしえきですね？」なるシステム音声を出力してもよい。この場合、音声対話が継続されることになり、ユーザが「そうです」と答えるなどしてユーザからの返答があると、当該返答に係る発話に対して上記ステップ１２０の処理から継続されることになる。尚、上記ステップ１１０の処理により開始されるシステム音楽の出力は、ユーザとの一連の対話が終了するまで継続される。例えば、対話制御ＥＣＵ２０は、先の例で、ユーザの行き先が豊田市駅であると確定し、一連の対話が終了すると、システム音楽の出力の停止を行うと共に（オーディオ音楽の出力をそのまま維持してよい。）、ナビゲーションシステムをして、豊田市駅までのルート案内を実行させる。 If it is determined in step 130 that the user's speech volume has reached an appropriate level, a speech recognition process is performed (step 140), and a speech recognition result is output (step 160). At this time, the dialogue control ECU 20 may output a system voice “Is it all right?” Via the speaker 40 as confirmation of voice recognition, if necessary. In this case, the voice dialogue will be continued, and if the user replies, for example, by saying “Yes”, the utterance related to the response will be continued from the processing of step 120 above. become. Note that the output of the system music started by the processing of step 110 is continued until a series of dialogs with the user is completed. For example, in the previous example, the dialogue control ECU 20 determines that the destination of the user is Toyota City Station, and when the series of dialogues is finished, stops the output of the system music (maintains the output of the audio music as it is). You can use the navigation system to execute route guidance to Toyota Station.

一方、本ステップ１３０においてユーザの発話音量が適正なレベルに達していないと判断されると、ステップ１６０の処理に進む。 On the other hand, if it is determined in step 130 that the user's speech volume has not reached an appropriate level, the process proceeds to step 160.

ステップ１６０では、再発話を要求するシステム音声（ガイダンス音声）を出力すると共に、ユーザの発話音量を所望のレベルまで誘導すべく、システム音楽ないしオーディオ音楽の音量を制御する処理が行われる。即ち、対話制御ＥＣＵ２０は、システム音楽ないしオーディオ音楽の音量を増加させる。このとき、適正なレベルに対応する目標音量Ｌ_{ｔａｒｇｅｔ}（固定値ないし騒音の音量Ｎに応じた可変値）まで一気に増加させてもよいが、当該ステップ１６０を経由する毎に段階的に目標音量Ｌ_{ｔａｒｇｅｔ}まで増加させてもよい。いずれの場合であっても、このシステム音楽ないしオーディオ音楽の音量増加は、次のユーザからの発話がなされるまでに完了するようにしてもよく、或いは、ユーザの発話中に徐々に増加させることとしてもよい。尚、システム音楽ないしオーディオ音楽の音量増加は、アンプ４２のゲインを制御することで実現される。 In step 160, a system voice (guidance voice) requesting a recurrent speech is output, and a process of controlling the volume of system music or audio music is performed in order to guide the user's voice volume to a desired level. That is, the dialogue control ECU 20 increases the volume of system music or audio music. At this time, the target volume L _target corresponding to an appropriate level (a fixed value or a variable value corresponding to the noise volume N) may be increased at a stroke. You may increase to _target . In any case, the volume increase of the system music or audio music may be completed before the next user utters, or may be gradually increased during the user's utterance. It is good. Note that the volume increase of the system music or audio music is realized by controlling the gain of the amplifier 42.

このように発話を行うユーザの背景音（システム音楽ないしオーディオ音楽）を大きくすることで、ユーザがかかる音楽の音量に負けないような大きな音声で発話するように誘導することができる。これは、人は周囲がうるさいときには大きな声で話す傾向にあるという、ランバード効果を期待するものである。尚、システム音楽ないしオーディオ音楽を大きくしても、上述の如く当該システム音楽ないしオーディオ音楽の成分は音声認識処理により除去されるので、ユーザの発話信号が、このようにして音量が大きくされたシステム音楽ないしオーディオ音楽の音声信号によってマスクされて認識不能となることは無い。 By increasing the background sound (system music or audio music) of the user who speaks in this way, the user can be guided to speak with a loud sound that does not lose the volume of the music. This expects the Lambard effect that people tend to speak loudly when they are noisy. Even if the system music or audio music is increased, the system music or audio music components are removed by the voice recognition processing as described above, so that the user's speech signal is thus increased in volume. It is not masked by the audio signal of music or audio music and cannot be recognized.

本ステップ１６０において、同様の観点から、再発話を要求するシステム音声についても同様に音量が増加されてもよい。再発話を要求するシステム音声は、例えば一般的な「音声レベルを大としてもう一度発話してください」なるものであってもよいが、ある程度の適合度の認識候補がある場合には「とよはしえきですか？」なるものであってもよい。尚、システム音声の音量増加は、アンプ４２のゲインを制御することで実現される。 In this step 160, from the same point of view, the volume of the system sound that requests re-speech may be increased in the same manner. The system voice requesting re-speech may be, for example, the general “Please speak again with a high voice level”, but if there is a recognition candidate with a certain degree of suitability, It may be "Is it true?" The increase in the volume of the system sound is realized by controlling the gain of the amplifier 42.

このようにして再発話を要求し、再びユーザからの発話があった場合には、上記ステップ１２０以後の処理が繰り返される。 In this way, when a re-utterance is requested and there is an utterance from the user again, the processing after step 120 is repeated.

このように本実施例によれば、音声認識のための発話を行う周辺環境（車室内環境）にシステム音楽ないしオーディオ音楽を流し、ユーザの発話音量が不足しているときは、当該音楽の音量を増加させることで、大きな声で話すのに抵抗を感じるユーザにとっても自然に大きな声で話せる環境を形成することができる。これにより、かかるユーザの発話音量を所望のレベルまで効果的に誘導することができ、音声認識精度が向上し、信頼性の高い態様で音声対話装置とユーザとの音声対話（それに伴う情報のやり取り）を実現することができる。 As described above, according to the present embodiment, when the system music or audio music is played in the surrounding environment (vehicle interior environment) that performs speech for speech recognition, and the volume of the user's speech is insufficient, the volume of the music By increasing the number, it is possible to form an environment in which a user who feels resistance to speaking in a loud voice can speak naturally in a loud voice. Thus, the user's speech volume can be effectively guided to a desired level, the voice recognition accuracy is improved, and the voice dialogue between the voice dialogue device and the user (information exchange associated therewith) is performed with high reliability. ) Can be realized.

また、システムの応答音（例えば、再発話を要求するシステム音声）についても同様に、ユーザの発話音量が不足しているときは、音量を増加させることで、ランバード効果により、大きな声で話すのに抵抗を感じるユーザに対しても効果的に大きな声で発話するよう誘導することが可能となる。 Similarly, for system response sounds (for example, system sounds that require recurrent speech), if the user's utterance volume is insufficient, the loudness can be increased by increasing the volume to speak loudly. It is possible to induce a user who feels resistance to speak effectively and with a loud voice.

尚、本実施例では、通常のシステムとは対照的に、認識精度を高める観点からは、ユーザからの発話待ち状態でオーディオ音楽のような背景音を一時的に小さくすることないが、ユーザに発話待ち状態であることを認識させるために、背景音を一時的に小さくすることが行われてもよい。但し、この場合も、ユーザの発話前までには背景音を、上述の如く適切なレベルまで増加させることが望ましい。 In this embodiment, in contrast to a normal system, from the viewpoint of improving recognition accuracy, background sounds such as audio music are not temporarily reduced in a state of waiting for an utterance from the user. In order to recognize that the user is waiting for an utterance, the background sound may be temporarily reduced. In this case, however, it is desirable to increase the background sound to an appropriate level as described above before the user utters.

また、本実施例では、ユーザの発話音量が不足しているときに、ユーザの発話音量を所望のレベルまで誘導して増加させるべく、システム音楽ないしオーディオ音楽及び／又はシステム音声の音量が制御されているが、逆に、ユーザの発話音量が大きすぎるときに、ユーザの発話音量を所望のレベルまで誘導して減少させるべく、システム音楽ないしオーディオ音楽及び／又はシステム音声の音量を低下させることとしてもよい。 In this embodiment, when the user's speech volume is insufficient, the volume of system music or audio music and / or system sound is controlled so as to induce and increase the user's speech volume to a desired level. On the other hand, when the user's utterance volume is too high, the volume of the system music or audio music and / or the system voice is decreased in order to induce and decrease the user's utterance volume to a desired level. Also good.

また、本実施例において、対話制御ＥＣＵ２０は、車内マイク３０を介して入力された音声に対して現在“音声認識中”であることをディスプレイ上に表示するのに代えて又はそれに加えて、所定のシステム音楽を、スピーカ４０を介して出力してもよい。これにより、ユーザは、当該所定のシステム音楽を聞けばシステムが現在“音声認識中”であることが分かるので、ディスプレイを見てシステムが現在“音声認識中”であることを確認する必要がなくなる。この場合も同様に、システムが現在“音声認識中”であること知らせるシステム音楽の音量が、ユーザの発話音量を所望のレベルまで誘導すべく制御されてもよい。また、同様の観点から、音声対話装置１０の状態、例えば音声入力待ち状態、認識処理状態等に応じて異なるシステム音楽を、スピーカ４０を介して出力してもよい。 Further, in this embodiment, the dialogue control ECU 20 replaces or additionally displays on the display that the voice input through the in-vehicle microphone 30 is currently “speech recognition”. The system music may be output via the speaker 40. This allows the user to know that the system is currently “speech-recognizing” by listening to the predetermined system music, so it is not necessary to see the display and confirm that the system is currently “speech-recognizing” . Again, the volume of system music that informs the system that it is currently “speech-recognizing” may be controlled to guide the user's utterance volume to a desired level. From the same point of view, different system music may be output via the speaker 40 according to the state of the voice interaction device 10, for example, a voice input waiting state, a recognition processing state, and the like.

以上、本発明の好ましい実施例について詳説したが、本発明は、上述した実施例に制限されることはなく、本発明の範囲を逸脱することなく、上述した実施例に種々の変形及び置換を加えることができる。 The preferred embodiments of the present invention have been described in detail above. However, the present invention is not limited to the above-described embodiments, and various modifications and substitutions can be made to the above-described embodiments without departing from the scope of the present invention. Can be added.

例えば、上述した実施例では、車室内での対話アプリケーションに関するものであったが、本発明は、特にこれに限定されることはなく、他の環境下での対話アプリケーションに対しても適用可能である。 For example, in the above-described embodiment, the present invention relates to the interactive application in the vehicle interior. However, the present invention is not particularly limited to this, and can be applied to the interactive application in other environments. is there.

本発明による音声対話装置が組み込まれる音声対話システムの一実施例を示すシステム構成図である。1 is a system configuration diagram showing an embodiment of a voice interaction system in which a voice interaction device according to the present invention is incorporated. 実施例１の音声対話システムにより実現される主要な処理を示すフローチャートである。It is a flowchart which shows the main processes implement | achieved by the speech dialogue system of Example 1.

Explanation of symbols

１０音声対話装置
２０対話制御ＥＣＵ
３０車内マイク
４０スピーカ
４２アンプ 10 Spoken Dialogue Device 20 Dialogue Control ECU
30 Car interior microphone 40 Speaker 42 Amplifier

Claims

In a voice interaction device that exchanges information with the user by voice,
Voice input means for inputting the utterance from the user;
Utterance volume calculation means for calculating the user's utterance volume based on utterance data input to the voice input means;
Music output means for outputting music;
Output control means for controlling the music output means,
The output control means starts music output by the music output means or maintains the music output state in order to guide the user's utterance volume to a desired level, and outputs music according to the calculation result of the user's utterance volume. A voice interactive apparatus characterized by adjusting the volume of music output by the means.

In a voice interaction device that exchanges information with the user by voice,
Voice input means for inputting the utterance from the user;
Utterance volume calculation means for calculating the user's utterance volume based on utterance data input to the voice input means;
Music output means for outputting music;
Response sound output means for outputting a response sound for voice conversation with the user;
Output control means for controlling the music output means and response sound output means,
The output control means starts the output of music by the music output means or maintains the music output state in order to guide the user's utterance volume to a desired level, and responds according to the calculation result of the user's utterance volume. A voice interactive apparatus characterized by adjusting the volume of a response sound output by an output means.

The voice interactive apparatus according to claim 1 or 2, wherein when the user's utterance volume is lower than a predetermined volume, the volume of the music or response sound output by the music output means or response sound output means is increased.

The voice interactive apparatus according to claim 1 or 2, wherein when the user's utterance volume is larger than a predetermined volume, the volume of the music or response sound output by the music output means or response sound output means is reduced.

The voice interactive apparatus according to claim 3 or 4, wherein the volume of music or response sound by the output control means is adjusted during a user's speech.

In the voice interactive apparatus according to any one of claims 1 to 5, which is used in a passenger compartment.
The music output means outputs, as the music, music recorded on a music CD, DVD, or the like set by the user on the in-vehicle audio device, or predetermined music prepared in advance through a speaker in the vehicle interior. Spoken dialogue device.