JP2006215206A

JP2006215206A - Speech processor and control method therefor

Info

Publication number: JP2006215206A
Application number: JP2005026878A
Authority: JP
Inventors: Toshiaki Fukada; 俊明深田
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2005-02-02
Filing date: 2005-02-02
Publication date: 2006-08-17

Abstract

<P>PROBLEM TO BE SOLVED: To provide a technique for properly controlling a masking signal output so that inputted speech of a user and outputted speech from equipment cannot be heard by a third party and do not give displeasure to the third party, relating to the equipment with a speech input and/or speech output function. <P>SOLUTION: Measurement is carried out at least about either a position of the third party or noise in there, and according to a result of the measurement, a masking signal for masking the speech of the user inputted to an input means is determined to the third party. After that, according to an operation state of the input means (S501, S502), the output of the determined masking signal is controlled (S503, S504). <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、利用者のプライバシー保護を考慮した音声処理装置およびその制御方法に関する。 The present invention relates to an audio processing apparatus and a control method thereof in consideration of user privacy protection.

これまで、音声認識技術は音声認識の性能の向上を主眼とした開発が進められ、カーナビゲーションシステムや音声応答システム（Interactive Voice Response：ＩＶＲ）などにおいては、既に実用化されている。しかしながら、その一方で、パーソナルコンピュータ、複写機やファクシミリ装置などのオフィス機器、テレビや電話機などの家電製品に対しては、音声認識技術はほとんど浸透していない状況である。この理由としては、音声を使わなくとも、キーボード、リモコン、ボタンなどの他の入力手段を用いた入力・設定が可能であることが考えられるが、これらの機器は年々多機能化、複合化しており、特に、一般にキーボードを持たないパーソナルコンピュータ以外の機器に対する入力・設定は、更に複雑化していくと考えられる。すなわち、今後は、ますます直感的で分かりやすいユーザインタフェースが求められてくる。音声は、このような要求を満たすユーザインタフェースの一つ、もしくは、既存のユーザインタフェースと組み合わせたマルチモーダルユーザインタフェースにおける一つのモダリティとして期待されている。このように、音声を用いたユーザインタフェースは、直感的であり、正しく認識されれば、既存の入力手段を用いるよりも素早く入力・設定できる有用なインタフェースであると考えられる。 Up to now, voice recognition technology has been developed mainly for improving the performance of voice recognition, and has already been put into practical use in car navigation systems, voice response systems (Interactive Voice Response: IVR) and the like. However, on the other hand, voice recognition technology is hardly permeated for office equipment such as personal computers, copiers and facsimile machines, and home appliances such as televisions and telephones. The reason for this is that input / setting using other input means such as a keyboard, remote control, buttons, etc., is possible without using voice, but these devices are becoming increasingly multifunctional and complex year by year. In particular, it is considered that input and setting for devices other than personal computers that do not have a keyboard generally become more complicated. That is, in the future, a user interface that is more intuitive and easy to understand will be required. Voice is expected as one modality in a multimodal user interface combined with an existing user interface or one of user interfaces satisfying such requirements. As described above, the user interface using sound is intuitive and is considered to be a useful interface that can be input and set more quickly than when using the existing input means if correctly recognized.

しかしながら、このような音声を用いたユーザインタフェースの普及を阻む要因として、たとえ100%に近い音声認識性能が得られたとしても、周辺に第三者が存在する場合に、機器に向かって発声している音声を聞かれるのが恥かしい、内容を聞かれたくないといったことも考えられる。また、特に、オフィスなどの状況では、機器の利用者の周辺にいる第三者にとって、機器に向かって発せられる音声は言語的（ｖｅｒｂａｌ）なものであるため、利用者の発声が気になって、業務に集中できないことも考えられる。 However, as a factor that hinders the spread of user interfaces using such voices, even if voice recognition performance close to 100% is obtained, when a third party exists in the vicinity, the voice is spoken toward the device. You may be embarrassed to hear the voice you are listening to or do not want to hear the content. Also, particularly in office situations, for a third party in the vicinity of the user of the device, the voice uttered toward the device is verbal, so the user's utterance is worrisome. Therefore, it may be impossible to concentrate on business.

以上のような状況は、利用者が発する音声に限らず、機器からの音声出力においても同様に生じ得る。すなわち、リモコンやボタンで入力した設定内容を音声出力によっても確認できることは便利であるが、第三者にはその内容を聞かれたくないといったことや、第三者にとっても、利用者の発声と同様に、言語的な情報は気になるといった問題がある。 The situation as described above can occur not only in the voice uttered by the user but also in the voice output from the device. In other words, it is convenient to be able to confirm the settings entered with the remote control and buttons by voice output, but the third party does not want to hear the contents, and for the third party, Similarly, there is a problem that linguistic information is a concern.

このように、音声を用いたユーザインタフェースを利用した機器では、利用者の入力音声や機器の出力音声が周辺の第三者に聞かれない仕組みを提供することが必要な場合がある。このような音声を外部に漏らさないという目的に対しては、音声入力もしくは音声出力に対して逆位相の音声波形を発生させることによって、波形をキャンセルすることが原理的には可能であるが、複数点の任意の方向、距離に対して音声をキャンセルすることは極めて困難であり、現実的にはこの方法は用いることはできない。したがって実際には、（１）設備的に工夫する、（２）利用者へ負担を強いる、（３）第三者へ負担を強いる、のいずれかの方法によって対処する必要がある。 As described above, in a device using a user interface using sound, it may be necessary to provide a mechanism in which a user's input sound and a device's output sound are not heard by surrounding third parties. For the purpose of not leaking such voice to the outside, it is possible in principle to cancel the waveform by generating a voice waveform in reverse phase with respect to voice input or voice output, It is extremely difficult to cancel a sound with respect to an arbitrary direction and distance of a plurality of points, and this method cannot be used in practice. Therefore, in practice, it is necessary to deal with any of the following methods: (1) devise equipment, (2) impose a burden on the user, and (3) impose a burden on a third party.

（１）の設備的な工夫としては、例えば、機器の周囲に防音壁を用意するなどの方法が考えられるが、オフィスや家庭にこのような設備を設けることはスペースやコストの面で非実現的である。一方、（２）の利用者への負担を強いる方法としては、音声入力をささやき声で行う、機器の出力を確認する際にイヤホンや骨伝導スピーカを利用する方法が考えられるが、利用者にこれらの負担を強いることは、音声インタフェースの利点を損なうことになる。また、（３）の第三者へ負担を強いる方法としては、音声入力や音声出力時に別のマスキング信号をスピーカから出力することによって、利用者の発声や装置の音声出力を第三者に聞かれないようにする方法が考えられる。この方法は、コスト的には（１）の方法よりもはるかに安価であり、また、適切にマスキングがなされれば、利用者は普通に発声することができるため、（２）のように利用者への負担が生じないという特長がある。しかし、この方法によっても、マスキング信号の出力を適切に制御できない場合には、利用者の発声や装置の音声出力が第三者に聞かれたり（マスキング信号のレベルが小さい、マスキング信号の継続時間長が短い）、第三者がマスキング信号によって音声が聞こえてくる以上に不快なる（マスキング信号のレベルが大きすぎる、マスキング信号の継続時間長が長すぎる）などの問題が生じる。 For example, the equipment of (1) may be a sound barrier around the equipment, but it is not possible to install such equipment in offices and homes because of space and cost. Is. On the other hand, as a method of forcing the burden on the user in (2), a method of using an earphone or a bone conduction speaker when confirming the output of a device in which voice input is performed by whispering can be considered. Forcing this burden detracts from the advantages of the voice interface. Also, (3) as a method of imposing a burden on a third party, by outputting another masking signal from the speaker at the time of voice input or voice output, the voice of the user or the voice output of the device is heard from the third party. There is a way to prevent it. This method is much cheaper than the method (1) in terms of cost, and if it is masked appropriately, the user can speak normally. There is a feature that the burden on the person does not occur. However, even if this method cannot control the output of the masking signal properly, the voice of the user or the sound output of the device is heard by a third party (the masking signal level is low, the duration of the masking signal The problem is that the third person is uncomfortable (or the masking signal level is too high, or the duration of the masking signal is too long).

利用者の発声に対してマスキング信号を生成する方法として、特開平９−３０５１９６号公報（特許文献１）は、音声の母音に含まれるフォルマント周波数成分を主体とする楽音信号を発声者の音声の平均レベルよりも大きな平均パワーで放射する方法を開示している。 As a method for generating a masking signal for a user's utterance, Japanese Patent Laid-Open No. 9-305196 (Patent Document 1) discloses a musical sound signal mainly composed of a formant frequency component contained in a vowel of speech as a voice signal of the speaker. A method of radiating with an average power greater than the average level is disclosed.

特開平９−３０５１９６号公報JP-A-9-305196

しかしながら、上記の特許文献１で開示された技術には以下のような問題点がある。 However, the technique disclosed in Patent Document 1 has the following problems.

まず、特許文献１では、マスキング信号として放射される楽音信号は、その平均パワーが発声者の音声の平均レベルよりも大きくなるように制御される。しかし、マイクロフォンから観測される音声の平均レベルはマイクロフォンのゲインによって変動するし、また、スピーカから放射されるマスキング信号の平均パワーもスピーカのゲインによって変動する。すなわち、マイクロフォンより取り込まれる音声の平均レベルやスピーカから放射されるマスキング信号の平均パワーの大小が、第三者に聞こえる信号レベルの大小に一致するわけではない。つまり、この制御方法によって、適切なマスキングがなされるという保証はない。 First, in Patent Document 1, a musical sound signal radiated as a masking signal is controlled so that its average power is greater than the average level of the voice of the speaker. However, the average level of sound observed from the microphone varies depending on the gain of the microphone, and the average power of the masking signal radiated from the speaker also varies depending on the gain of the speaker. That is, the average level of the sound captured from the microphone and the average power of the masking signal radiated from the speaker do not match the level of the signal level heard by a third party. In other words, there is no guarantee that appropriate masking is performed by this control method.

また、特許文献１では、第三者が存在する位置に関しては、利用者の背後に存在することが仮定されており、そのため、マスキング信号を出力するスピーカが１台だけ用意されている。しかしながら、実際のオフィスや家庭において第三者の存在位置を予め仮定することはそもそも困難であるから、１台のスピーカでは必ずしも十分とは言えない。 In Patent Document 1, it is assumed that a position where a third party exists is behind the user. Therefore, only one speaker that outputs a masking signal is prepared. However, since it is difficult in the first place to presume the location of a third party in an actual office or home, a single speaker is not always sufficient.

また、特許文献１では、周辺の騒音レベルに対する考慮がなされておらず、騒がしい環境、静かな環境にかかわらず、一様にマスキング信号が出力されることになる。そうすると、同様の装置が近傍に多く存在する場合や周辺の騒音レベルが高い場合には、非常に騒々しい状況となるため、利用者の入力音声がマスキングされたとしても、第三者に対して多大な不快感を与えてしまうという問題がある。 Moreover, in patent document 1, the surrounding noise level is not taken into consideration, and a masking signal is uniformly output regardless of a noisy environment or a quiet environment. Then, if there are many similar devices in the vicinity or the surrounding noise level is high, it will be very noisy, so even if the user's input voice is masked, There is a problem that it causes a lot of discomfort.

さらに、特許文献１では、マスキング信号の出力のタイミングについては、利用者が音声入力装置の前の所定の位置に立った場合に楽音を出力するという点が説明されるに留まっており、出力を終了するタイミングについては記載されていない。また、この方法によれば、音声を入力する／しないに関わらず絶えず楽音が出力されるため、不必要なマスキング信号が出力されるという問題もある。さらにいうと、音声入力の内容に応じてマスキング信号のレベルを変化させる点についても示唆されていない。 Further, Patent Document 1 only describes that the timing of outputting the masking signal is that the user outputs a musical sound when standing at a predetermined position in front of the voice input device. There is no description of the timing of termination. In addition, according to this method, there is a problem in that an unnecessary masking signal is output because a musical sound is continuously output regardless of whether or not sound is input. Furthermore, there is no suggestion of changing the level of the masking signal in accordance with the content of the voice input.

加えて、特許文献１では、音声出力に対するマスキングについては全く言及されていない。 In addition, Patent Document 1 does not mention masking for audio output at all.

本発明は上述した問題の少なくともいずれかを解決すべくなされたもので、音声入力および／または音声出力機能を有する機器において、利用者の入力音声や機器の出力音声が周辺の第三者に聞かれず、かつ、第三者に不快感を与えないように、マスキング信号の出力を適切に制御する技術を提供することを目的としている。 The present invention has been made to solve at least one of the problems described above, and in a device having a voice input and / or voice output function, the user's input voice and the output voice of the equipment are heard from a nearby third party. It is an object of the present invention to provide a technique for appropriately controlling the output of a masking signal so as not to cause discomfort to a third party.

上記した課題を解決するために、例えば本発明の音声処理装置は、以下の構成を備える。すなわち、利用者が発声した音声を入力する入力手段から受信した音声情報を処理する音声処理装置であって、周辺環境について測定を行う測定手段と、前記測定手段による測定結果に基づいて、第三者に対して前記入力手段に入力される利用者の音声をマスクするためのマスキング信号を決定する決定手段と、前記入力手段の動作状態に基づいて、前記マスキング信号決定手段により決定されたマスキング信号の出力を制御する制御手段を備える。 In order to solve the above-described problem, for example, a speech processing apparatus of the present invention has the following configuration. That is, a speech processing apparatus for processing speech information received from an input means for inputting speech uttered by a user, wherein a measurement means for measuring a surrounding environment, and a third result based on a measurement result by the measurement means Determining means for determining a masking signal for masking a user's voice input to the input means for a person, and a masking signal determined by the masking signal determining means based on an operating state of the input means The control means which controls the output of this is provided.

本発明によれば、利用者の入力音声や機器の出力音声が周辺の第三者に聞かれず、かつ、第三者に不快感を与えないように、マスキング信号の出力が適切に制御される。 According to the present invention, the output of the masking signal is appropriately controlled so that the input voice of the user and the output voice of the device are not heard by a nearby third party and the third party is not uncomfortable. .

以下、図面を参照して本発明の好適な実施形態について詳細に説明する。 DESCRIPTION OF EMBODIMENTS Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the drawings.

図１は、本発明が適用される音声処理装置のハードウェア構成を示すブロック図である。 FIG. 1 is a block diagram showing a hardware configuration of a speech processing apparatus to which the present invention is applied.

１０１は本装置全体の制御をつかさどるＣＰＵ、１０２は各種パラメータやブートプログラム等を記憶しているＲＯＭ、１０３は、ＣＰＵ１０１に作業領域を提供するとともに、主記憶装置として機能するＲＡＭである。１０４はハードディスク、ＣＤ−ＲＯＭ、ＤＶＤ−ＲＯＭ、メモリカード等の外部記憶装置であり、ここに、録音／再生を行うための録音再生プログラム１１１、音声認識処理を行うための音声認識プログラム１１２、音声合成を行うための音声合成プログラム１１２、マスキング信号１１４、そして、これらのプログラムおよびデータを統括的に処理するための制御プログラム１１０が保持されうる。なお、この外部記憶装置１０４がハードディスクの場合には、ＣＤ−ＲＯＭ等からインストールされた各種プログラムが記憶されている。外部記憶装置１０４に格納されているこれらのプログラムは、ＲＡＭ１０３にロードされ、ＣＰＵ１０１によって実行されることになる。もっとも、これらのプログラムおよびデータは外部記憶装置１０４に記憶させるのではなく、あらかじめＲＯＭ１０２に記憶させた構成であってもよい。 A CPU 101 controls the entire apparatus, a ROM 102 stores various parameters and a boot program, and a RAM 103 provides a work area to the CPU 101 and functions as a main storage device. Reference numeral 104 denotes an external storage device such as a hard disk, CD-ROM, DVD-ROM, or memory card. Here, a recording / playback program 111 for recording / playback, a voice recognition program 112 for performing voice recognition processing, and a voice A speech synthesis program 112 for performing synthesis, a masking signal 114, and a control program 110 for comprehensively processing these programs and data can be held. When the external storage device 104 is a hard disk, various programs installed from a CD-ROM or the like are stored. These programs stored in the external storage device 104 are loaded into the RAM 103 and executed by the CPU 101. Of course, these programs and data are not stored in the external storage device 104 but may be stored in the ROM 102 in advance.

１０５はマイクロフォンであり、利用者が発声する音声の取り込みを行う。この音声の取り込みの際、マイクロフォン１０５で収集された音声信号はマイクアンプ１０５ａで増幅される。１０６は音声出力用スピーカであり、第１のスピーカアンプ１０６ａを介して、利用者に対して録音音声、合成音声等の出力を行う。１０７はマスキング信号出力用スピーカであり、第２のスピーカアンプ１０７ａを介して、この音声入出力装置の周辺に存在する第三者に対してマスキング信号を出力する。 Reference numeral 105 denotes a microphone that captures a voice uttered by a user. At the time of capturing the sound, the sound signal collected by the microphone 105 is amplified by the microphone amplifier 105a. Reference numeral 106 denotes a sound output speaker, which outputs recorded sound, synthesized sound, and the like to the user via the first speaker amplifier 106a. Reference numeral 107 denotes a masking signal output speaker, which outputs a masking signal to a third party existing around the voice input / output device through the second speaker amplifier 107a.

１０８はディスプレイ、ボタン、テンキー、タッチパネル、マウス、キーボード、マイクロフォン、ビデオカメラ、赤外線センサ等の補助入出力装置であり、音声取り込みを開始する際のボタン押下、周辺の騒音レベルの集音、装置の周辺に存在する第三者のビデオカメラによる撮像・赤外線センサによるセンシング、メニューのディスプレイへの表示などを行う。 Reference numeral 108 denotes an auxiliary input / output device such as a display, a button, a numeric keypad, a touch panel, a mouse, a keyboard, a microphone, a video camera, an infrared sensor, and the like. It performs imaging with a third-party video camera in the vicinity, sensing with an infrared sensor, and displaying a menu on the display.

１０９は上記各部を接続するバスである。 Reference numeral 109 denotes a bus for connecting the above-described units.

以下では、第１の実施形態として、上記した音声処理装置を音声入力装置として機能させる場合について説明する。また、第２の実施形態として、上記音声処理装置を音声出力装置として機能させる場合について説明するとともに、第３の実施形態で、その変形例を説明する。さらに、第４の実施形態として、音声入力機能と音声出力機能を協働させることで上記音声処理装置を音声対話装置として機能させる場合について説明する。また、以下の説明では、利用者の音声入力装置としてマイクロフォンを、音声出力装置としてスピーカを用いているが、本発明はこれに限らず、例えば、音声入出力機能を備えるハンドセットなどを用いてもよい。 Hereinafter, as the first embodiment, a case will be described in which the above-described voice processing device is caused to function as a voice input device. Further, as the second embodiment, a case where the above-described sound processing device is caused to function as a sound output device will be described, and a modification thereof will be described in the third embodiment. Furthermore, as a fourth embodiment, a case will be described in which the voice processing device is caused to function as a voice interaction device by cooperating a voice input function and a voice output function. In the following description, a microphone is used as the user's voice input device and a speaker is used as the voice output device. However, the present invention is not limited thereto, and for example, a handset having a voice input / output function may be used. Good.

（第１の実施形態）
第１の実施形態では、音声処理装置を音声入力装置として機能させる場合について説明する。ここで実現される音声入力機能は例えば、録音の際あるいは音声認識の際に使用される。 (First embodiment)
In the first embodiment, a case where a voice processing device functions as a voice input device will be described. The voice input function realized here is used for recording or voice recognition, for example.

図２は、本実施形態に係る音声入力装置の機能構成を示すブロック図である。 FIG. 2 is a block diagram illustrating a functional configuration of the voice input device according to the present embodiment.

２０５は、利用者の音声を入力する音声入力部である。２０３は、音声入力部２０５の動作状態を管理する音声入力状況管理部で、ここで音声入力に関する操作等をトリガとして発行されるイベントが監視される。２０１は、音声入力装置１０５の周辺の騒音レベル、この音声入出力装置の周辺に存在する第三者の有無、第三者の位置を測定する周辺環境測定部である。２０２は、音声入力状況管理部２０３および周辺環境測定部２０１からの情報に基づき、第三者に対して入力音声をマスクするマスキング信号を決定するマスキング信号決定部である。そして、２０４は、マスキング信号決定部２０２で決定されたマスキング信号の出力を制御するマスキング信号制御部である。 Reference numeral 205 denotes a voice input unit that inputs a user's voice. Reference numeral 203 denotes a voice input status management unit that manages the operating state of the voice input unit 205, and monitors events that are issued by triggering operations related to voice input. Reference numeral 201 denotes a surrounding environment measurement unit that measures the noise level around the voice input device 105, the presence / absence of a third party existing around the voice input / output device, and the position of the third party. Reference numeral 202 denotes a masking signal determination unit that determines a masking signal for masking the input voice to a third party based on information from the voice input state management unit 203 and the surrounding environment measurement unit 201. Reference numeral 204 denotes a masking signal control unit that controls the output of the masking signal determined by the masking signal determination unit 202.

次に、本実施形態における音声入力処理の例を、図３から図５までのフローチャートを用いて説明する。なお、本実施形態では、音声入力部２０５およびマスキング信号制御部２０４の処理をイベント駆動型の処理として説明する。対象となるイベントは、音声取り込みが開始されたことを示す「音声入力開始」イベントと、音声取り込みが終了したことを示す「音声入力終了」イベントである。 Next, an example of voice input processing in the present embodiment will be described using the flowcharts of FIGS. In the present embodiment, the processing of the voice input unit 205 and the masking signal control unit 204 will be described as event-driven processing. The target events are a “voice input start” event indicating that voice capturing has started, and a “voice input end” event indicating that voice capturing has ended.

図３は、周辺環境測定部２０１およびマスキング信号決定部２０２の処理フローを示すフローチャートである。 FIG. 3 is a flowchart showing a processing flow of the surrounding environment measurement unit 201 and the masking signal determination unit 202.

まず、ステップＳ３０１において、マイクロフォン１０５を用いて、本装置の周辺の騒音を取り込む。なおここでは、マイクロフォン１０５ではなく、補助入力装置１０８として周辺環境測定用に用意したマイクロフォンを用いて、本装置の周辺の騒音を取り込んでもよい。取り込むタイミングとしては、システムが所定の間隔で自動的に取り込むものであってもよいし、利用者の操作に関する何らかのイベントに応じて取り込むものであってもよい。あるいは、騒音の取り込み専用に用意された操作指示に応じて騒音を取り込むようにしてもよい。 First, in step S <b> 301, noise around the apparatus is captured using the microphone 105. In this case, instead of the microphone 105, a microphone prepared for measuring the surrounding environment may be used as the auxiliary input device 108 to capture noise around the device. As the timing of capturing, the system may automatically capture at a predetermined interval, or may capture according to some event related to the user's operation. Or you may make it take in noise according to the operation instruction prepared only for taking in noise.

次に、ステップＳ３０２において、マスキング信号のレベルを決定する。このレベルの決定方法には様々な方法が考えられるが、最も好適な方法の一つとして、ステップＳ３０１で取り込まれた周辺の騒音信号ｘ_E（ｔ）の平均対数パワーＰ_Eを用いて、マスキング信号ｘ_M（ｔ）の振幅を次式によって変更することができる。 Next, in step S302, the level of the masking signal is determined. Various methods can be considered for determining this level. As one of the most preferable methods, masking is performed using the average logarithmic power P _E of the ambient noise signal x _E (t) captured in step S301. The amplitude of the signal x _M (t) can be changed by the following equation.

ｘ’_M（ｔ）＝ｆ（Ｐ_E）・ｘ_M（ｔ）（１） x ′ _M (t) = f (P _E ) · x _M (t) (1)

ここで、ｆ（・）は、周辺の騒音信号ｘ_E（ｔ）の平均対数パワーＰ_Eに対するマスキング信号ｘ_M（ｔ）の増幅の増減を制御する関数であり、任意のものを用いることが可能である。例えば、図６に示されるようなものを用いればよい。同図に示される関数ｆは、周辺の騒音信号の平均対数パワーが大きい場合にはマスキング信号を小さくし、反対に周辺の騒音信号が小さい場合にはマスキング信号を大きくするという制御を行うものである。なお、この関数は、マスキング信号ｘ_M（ｔ）の平均対数パワーＰ_M、マイクロフォン１０５の取り付け位置および方向やマイクアンプ１０５ａのマイクゲイン、マスキング信号出力用スピーカ１０７の取り付け位置および方向や第２のスピーカアンプ１０７ａのアンプゲイン、スピーカの指向特性（例えば、指向性スピーカか無指向性スピーカか）、周辺環境測定用のマイクロフォンのマイクゲインや取り付け位置および方向、第三者が存在すると想定される位置および方向などを考慮して設計する必要がある。 Here, f (•) is a function for controlling the increase / decrease in the amplification of the masking signal x _M (t) with respect to the average logarithmic power P _E of the surrounding noise signal x _E (t), and any function can be used. Is possible. For example, what is shown in FIG. 6 may be used. The function f shown in the figure is a control that reduces the masking signal when the average log power of the surrounding noise signal is large, and conversely increases the masking signal when the surrounding noise signal is small. is there. This function includes the average logarithmic power P _M of the masking signal x _M (t), the mounting position and direction of the microphone 105, the microphone gain of the microphone amplifier 105a, the mounting position and direction of the masking signal output speaker 107, and the second The amplifier gain of the speaker amplifier 107a, the directional characteristics of the speaker (for example, whether it is a directional speaker or an omnidirectional speaker), the microphone gain and mounting position and direction of the microphone for measuring the surrounding environment, and a position where a third party is assumed to exist It is necessary to design in consideration of the direction.

他にも、マスキング信号が複数ある場合には、周辺環境の状況に応じて、出力するマスキング信号を変更することも可能である。具体的には、平均的な音声スペクトルから周辺の騒音スペクトルを減じたスペクトルに最も近いマスキング信号を選択することが可能である。なお、マスキング信号はいかなるものであってもよいが、この音声入力装置が音声認識の目的に使用する場合には音声認識性能が低下しにくいもの、周辺の第三者や利用者に対して心地よいものが望ましい。 In addition, when there are a plurality of masking signals, it is possible to change the masking signal to be output according to the situation of the surrounding environment. Specifically, it is possible to select the masking signal closest to the spectrum obtained by subtracting the surrounding noise spectrum from the average speech spectrum. Any masking signal may be used, but when this voice input device is used for voice recognition, the voice recognition performance is not easily deteriorated, and it is comfortable for third parties and users in the vicinity. Things are desirable.

図４は、音声入力部２０５と音声入力状況管理部２０３の処理フローを示すフローチャートである。 FIG. 4 is a flowchart showing a processing flow of the voice input unit 205 and the voice input status management unit 203.

まず、ステップＳ４０１において、イベント待機を行う。次に、イベントが検出された場合には、ステップＳ４０２に進み、検出したイベントの種別を判断する。ここで、音声取り込みが開始されたことを示す「音声入力開始」イベントを検出した場合にはステップＳ４０３へ進み、音声取り込みが終了したことを示す「音声入力終了」イベントを検出した場合にはステップＳ４０４へ進む。 First, in step S401, event standby is performed. Next, when an event is detected, the process proceeds to step S402, and the type of the detected event is determined. If a “voice input start” event indicating that voice capture has started is detected, the process proceeds to step S403. If a “voice input end” event indicating that voice capture has been completed is detected, step S403 is performed. The process proceeds to S404.

ステップＳ４０３では、マイクロフォン１０５から利用者の音声の取り込みを開始し、ステップＳ４０１へ戻る。なお、ステップＳ４０１に処理が戻った後も、音声の取り込みは続けているものとする。一方のステップＳ４０４では、マイクロフォン１０５からの利用者の音声の取り込みを終了し、処理を終える。 In step S403, capturing of the user's voice from the microphone 105 is started, and the process returns to step S401. Note that it is assumed that the audio is still captured after the process returns to step S401. On the other hand, in step S404, the capturing of the user's voice from the microphone 105 is terminated, and the process is terminated.

ここで、「音声入力開始」および「音声入力終了」の各イベントは、利用者のボタン押下、ハンドセットを取る／置くなど利用者によって与えられるものであってもよいし、本装置が所定のタイミングや音声区間検出法における状態遷移のイベントとして発生するものであってもよい。 Here, each event of “speech input start” and “speech input end” may be given by the user, such as a user pressing a button, taking a handset, or placing the handset. It may occur as an event of state transition in the voice interval detection method.

図５は、マスキング信号制御部２０４の処理フローを示すフローチャートである。 FIG. 5 is a flowchart showing a processing flow of the masking signal control unit 204.

ここでは、図４に示した処理において利用者が発声する音声を第三者に聞かれないようにするためのマスキング信号をマスキング信号出力用スピーカ１０７から出力する。まず、ステップＳ５０１において、イベント待機を行う。次に、イベントが検出された場合には、ステップＳ５０２に進み、検出したイベントの種別を判断する。ここで、音声取り込みが開始されたことを示す「音声入力開始」イベントを検出した場合にはステップＳ５０３へ進み、音声取り込みが終了したことを示す「音声入力終了」イベントを検出した場合にはステップＳ５０４へ進む。 Here, in the process shown in FIG. 4, a masking signal for preventing a third party from hearing the voice uttered by the user is output from the masking signal output speaker 107. First, in step S501, event waiting is performed. Next, when an event is detected, the process proceeds to step S502, and the type of the detected event is determined. If a “voice input start” event indicating that voice capture has started is detected, the process proceeds to step S503. If a “voice input end” event indicating that voice capture has been completed is detected, step S503 is performed. The process proceeds to S504.

ステップＳ５０３では、ステップＳ３０２で決定されたマスキング信号の出力を開始し、ステップＳ５０１へ戻る。なお、ステップＳ５０１に処理が戻った後も、マスキング信号の出力は続けているものとする。また、ステップＳ５０４では、マスキング信号の出力を終了し、処理を終える。 In step S503, output of the masking signal determined in step S302 is started, and the process returns to step S501. Note that it is assumed that the masking signal continues to be output after the processing returns to step S501. In step S504, the masking signal output ends and the process ends.

以上の処理例では、マスキング信号の出力レベルは、利用者の音声が発声される前に、ステップＳ３０１およびステップＳ３０２で決定していたが、利用者の音声の発声レベルは予め分からないため、利用者の発声レベルを予測してマスキング信号のレベルを決定する必要がある。しかしながら、その予測よりも実際の利用者の発声レベルが小さい場合には、不必要に大きなレベルでマスキング信号を出力していることになる。また、逆に、予測よりも実際の利用者の発声レベルが大きい場合には、十分なマスキング信号が出力されておらず、第三者に利用者の発声内容を聞かれてしまう可能性が生じる。この問題を緩和するためには、利用者の発声の入力レベルに応じて動的にマスキング信号のレベルを決定すればよい。 In the above processing example, the output level of the masking signal is determined in step S301 and step S302 before the user's voice is uttered. However, since the utterance level of the user's voice is not known in advance, It is necessary to determine the level of the masking signal by predicting the voice level of the person. However, when the utterance level of the actual user is lower than the prediction, the masking signal is output at an unnecessarily large level. On the other hand, if the actual user's utterance level is higher than predicted, a sufficient masking signal is not output, and there is a possibility that the content of the user's utterance will be heard by a third party. . In order to alleviate this problem, the level of the masking signal may be determined dynamically according to the input level of the user's utterance.

図７は、音声の入力レベルを利用して動的にマスキング信号レベルを決定する処理フローを示すフローチャートである。 FIG. 7 is a flowchart showing a processing flow for dynamically determining the masking signal level using the voice input level.

ステップＳ７０１からステップＳ７０４は、それぞれステップＳ４０１からステップＳ４０４と同じであるため説明を省略する。ステップＳ７０５では、音声取り込みが開始された後、取り込み音声の信号レベルを所定の時間単位で計測し、このレベルと予測された利用者の発話レベルと比較して、マスキング信号のレベルを適応的に変更する。具体的には、予想された利用者の発話レベルの対数パワーをＰ_S、ステップＳ７０５で計測された時刻ｔ（所定の短時間区間の対数パワー）における利用者の発話レベルの対数パワーをＰ’_Sとすると、式（１）で得られるｘ’_M（ｔ）を次式によって変更すればよい。 Steps S701 to S704 are the same as steps S401 to S404, respectively, and thus description thereof is omitted. In step S705, after the voice capture is started, the signal level of the captured voice is measured in a predetermined time unit, and the level of the masking signal is adaptively compared with the predicted speech level of the user. change. Specifically, the logarithmic power of the predicted utterance level of the user is P _S , and the logarithmic power of the utterance level of the user at the time t (logarithmic power of a predetermined short time interval) measured in step S705 is P ′. _Assuming that _S , x ′ _M (t) obtained by Expression (1) may be changed by the following expression.

ｘ”_M（ｔ）＝ｇ（Ｐ’_S／Ｐ_S）・ｘ’_M（ｔ）（２） x ″ _M (t) = g (P ′ _S / P _S ) · x ′ _M (t) (2)

ここで、ｇ（・）は、対数パワーの比に対するマスキング信号ｘ’_M（ｔ）の増幅の程度を求める関数であり、例えば、図２０に示されるようなものを用いればよい。この関数は、Ｐ’_S／Ｐ_S＞１の場合、つまりステップＳ７０５で計測された対数パワーが予想発話レベルよりも大きい場合、マスキング信号を大きくし、小さい場合はマスキング信号を小さくするものである。 Here, g (•) is a function for obtaining the degree of amplification of the masking signal x ′ _M (t) with respect to the ratio of the logarithmic power. For example, a function as shown in FIG. 20 may be used. This function is to increase the masking signal when P ′ _S / P _S > 1, that is, when the logarithmic power measured in step S705 is larger than the expected speech level, and to decrease the masking signal when it is small. .

図８は、本実施形態における音声入力装置の外観構成の一例を示す図である。 FIG. 8 is a diagram illustrating an example of an external configuration of the voice input device according to the present embodiment.

８０１は、図１に示したハードウェアを収容する音声入力装置の本体である。８０３は、第三者に対してマスキング信号を出力するためのマスキング信号出力用スピーカで、図１のマスキング信号出力用スピーカ１０７に相当する。この例のように、マスキング信号用スピーカが一つしかない場合や、予め第三者の存在する方向が分からない場合などでは、無指向性のスピーカを用いることが望ましい。８０２は、利用者の音声入力および周辺の騒音環境を測定するためのマイクロフォンで、図１のマイクロフォン１０５に相当する。マスキング信号用スピーカ８０３から出力される信号によって、音声入力が適切に行えるようにするために、マイクロフォン８０２はマスキング信号出力用スピーカ８０３よりも利用者に近い位置に設置することが望ましい。 Reference numeral 801 denotes a main body of the voice input device that houses the hardware shown in FIG. Reference numeral 803 denotes a masking signal output speaker for outputting a masking signal to a third party, which corresponds to the masking signal output speaker 107 of FIG. As in this example, when there is only one masking signal speaker or when the direction in which a third party exists is not known in advance, it is desirable to use a non-directional speaker. Reference numeral 802 denotes a microphone for measuring the user's voice input and surrounding noise environment, and corresponds to the microphone 105 in FIG. The microphone 802 is preferably installed at a position closer to the user than the masking signal output speaker 803 so that sound can be appropriately input by a signal output from the masking signal speaker 803.

図９は、図８の変形例を示しており、所定の方向に対するマスキング信号の出力機能を備えるものである。 FIG. 9 shows a modification of FIG. 8, which has a masking signal output function for a predetermined direction.

９０１は、図１に示したハードウェアを収容する音声入力装置の本体である。９０２は、利用者の音声入力を行うためのマイクロフォンで、図１のマイクロフォン１０５に相当する。また、９０６、９０７、９０８は、第三者に対してマスキング信号を出力するためのマスキング信号出力用スピーカで、図１のマスキング信号出力用スピーカ１０７に相当する。９０３、９０４、９０５は、本装置の周辺の騒音レベルを３点で測定するためのマイクロフォンである。このように、周辺環境を測定するためのマイクロフォンやマスキング信号を出力するためのスピーカを複数設けたので、マイクロフォン９０３、９０４、９０５によって測定された個々の騒音レベル、方向、位置に応じて、マスキング信号出力用スピーカ９０６、９０７、９０８から出力する個々のマスキング信号のレベルを変化させることが可能となる。具体的には、例えば、騒音レベルの小さな方向に対しては大き目のマスキング信号を出力し、騒音レベルの大きな方向に対しては小さ目のマスキング信号を出力するといった制御を行うことができる。 Reference numeral 901 denotes a main body of the voice input device that houses the hardware shown in FIG. Reference numeral 902 denotes a microphone for inputting a user's voice and corresponds to the microphone 105 in FIG. Reference numerals 906, 907, and 908 denote masking signal output speakers for outputting a masking signal to a third party, and correspond to the masking signal output speaker 107 of FIG. Reference numerals 903, 904, and 905 denote microphones for measuring the noise level around the apparatus at three points. As described above, since a plurality of microphones for measuring the surrounding environment and speakers for outputting a masking signal are provided, masking is performed according to individual noise levels, directions, and positions measured by the microphones 903, 904, and 905. The level of each masking signal output from the signal output speakers 906, 907, and 908 can be changed. Specifically, for example, it is possible to perform control such that a large masking signal is output in a direction where the noise level is small and a small masking signal is output in a direction where the noise level is large.

また、この例のように、周辺環境を測定するためのマイクロフォンおよびマスキング信号用のスピーカを複数設けた場合は、指向性のマイクロフォンおよびスピーカを用いることが望ましい。また、マスキング信号用スピーカ９０６、９０７、９０８から出力される信号によって、音声入力が適切に行えるようにするために、マイクロフォン９０２はマスキング信号出力用スピーカ９０６、９０７、９０８よりも利用者に近い位置に設置することが望ましい。 Also, as in this example, when a plurality of microphones for measuring the surrounding environment and a plurality of masking signal speakers are provided, it is desirable to use directional microphones and speakers. The microphone 902 is positioned closer to the user than the masking signal output speakers 906, 907, and 908 so that voice input can be appropriately performed by signals output from the masking signal speakers 906, 907, and 908. It is desirable to install in.

図１０は、周辺環境測定部２０１とマスキング信号決定部２０２の処理フローの一例を示すフローチャートである。 FIG. 10 is a flowchart illustrating an example of a processing flow of the surrounding environment measurement unit 201 and the masking signal determination unit 202.

ステップＳ１００１では、本装置の周辺に存在する第三者の有無、第三者の方向、第三者の位置の少なくともいずれか一つを測定する。測定の方法は様々であるが、好適な方法としては赤外線センサを用いればよい。赤外線センサを設置する位置や個数は、測定精度や第三者の想定される数や方向や位置に応じて設定する。赤外線センサに替わるその他の方法としては、ビデオカメラを用いる方法がある。この場合には、装置周辺の様子をビデオカメラによって撮像し、この画像を用いた人物判定を行ったり、第三者が存在しない場合の画像との差分画像を用いることによって、第三者の有無、第三者の方向、第三者の位置を測定することが可能となる。他にも、無線ＩＣタグや非接触ＩＣカードなど、装置周辺に存在する第三者が検知できれば、いかなる方法を用いてもよい。 In step S1001, at least one of the presence / absence of a third party around the apparatus, the direction of the third party, and the position of the third party is measured. There are various measurement methods, but an infrared sensor may be used as a suitable method. The position and number of infrared sensors are set according to the measurement accuracy and the number, direction, and position assumed by a third party. As another method for replacing the infrared sensor, there is a method using a video camera. In this case, whether or not there is a third party by taking a picture of the surroundings of the device with a video camera, making a person determination using this image, or using a difference image from the image when there is no third party It becomes possible to measure the direction of the third party and the position of the third party. In addition, any method such as a wireless IC tag or a non-contact IC card may be used as long as a third party existing around the apparatus can be detected.

次に、ステップＳ１００２では、第三者の存在状況に応じて、マスキング信号のレベルを決定する。第三者の存在状況とは、第三者の有無、第三者の数、本装置から第三者までの距離、本装置に対する第三者の方向である。この際、図９で示したように、複数のマスキング信号出力スピーカがある場合には、それぞれのスピーカの向きに存在する第三者の存在状況に応じて、マスキング信号のレベルをスピーカごとに決定する。また、本装置の周辺の騒音状況と第三者の存在状況の両方を考慮してマスキング信号の決定、制御を行うこともできる。例えば、騒音測定用のマイク、第三者位置測定用の赤外線センサ、マスキング信号出力用のスピーカがそれぞれ同一の方向に対して４つ（向きＡ，Ｂ，Ｃ，Ｄとする）ある場合について説明する。今，Ａでは騒音レベルが大きく、第三者がいない、Ｂでは、騒音レベルが小さく、第三者がいない、Ｃでは、騒音レベルが大きく、第三者がいる、Ｄでは、騒音レベルが小さく、第三者がいるという状況であったとする。この場合、各スピーカから出力されるマスキング信号のレベルを、Ａ＜Ｂ＜Ｃ＜Ｄと設定することによって、適切にマスキング信号出力を行うことができる。 Next, in step S1002, the level of the masking signal is determined according to the presence status of the third party. The presence of a third party is the presence or absence of the third party, the number of third parties, the distance from the device to the third party, and the direction of the third party with respect to the device. At this time, as shown in FIG. 9, when there are a plurality of masking signal output speakers, the level of the masking signal is determined for each speaker according to the presence of a third party existing in the direction of each speaker. To do. In addition, the masking signal can be determined and controlled in consideration of both the noise situation around this apparatus and the presence of third parties. For example, a case where there are four microphones (directions A, B, C, and D) in the same direction each for a noise measurement microphone, a third party position measurement infrared sensor, and a masking signal output speaker will be described. To do. Now, A has a high noise level and no third party, B has a low noise level and no third party, C has a high noise level and has a third party, and D has a low noise level. Suppose you have a third party. In this case, the masking signal level can be appropriately output by setting the level of the masking signal output from each speaker as A <B <C <D.

以上の説明から明らかなように、本実施形態によれば、音声入力装置において、利用者の入力音声が周辺の第三者に聞かれず、かつ、第三者が不快にならないように、マスキング信号の出力を適切に制御することが可能となる。 As is clear from the above description, according to the present embodiment, in the voice input device, the masking signal is used so that the voice input by the user is not heard by a nearby third party and the third party is not uncomfortable. Can be appropriately controlled.

（第２の実施形態）
第２の実施形態では、図１の音声処理装置を音声出力装置として機能させる場合について説明する。ここでいう音声出力とは、録音された音声の再生だけでなく音声合成処理によって合成された音声の出力も含む。 (Second Embodiment)
In the second embodiment, a case will be described in which the voice processing apparatus in FIG. 1 is functioned as a voice output apparatus. The voice output here includes not only the reproduction of the recorded voice but also the output of the voice synthesized by the voice synthesis process.

図１１は、本実施形態に係る音声出力装置の機能構成を示すブロック図である。 FIG. 11 is a block diagram illustrating a functional configuration of the audio output device according to the present embodiment.

１１０１は、音声を出力する音声出力部である。１１０３は、音声出力部１１０１の動作状態を管理する音声出力状況管理部で、ここで音声出力に関する操作等をトリガとして発行されるイベントが監視される。１１０２は、音声出力状況管理部１１０３からの情報に基づき、第三者に対して出力音声をマスクするマスキング信号を決定するマスキング信号決定部である。１１０４は、マスキング信号決定部１１０２で決定されたマスキング信号の出力を制御するマスキング信号制御部である。 Reference numeral 1101 denotes an audio output unit that outputs audio. Reference numeral 1103 denotes an audio output status management unit that manages the operation state of the audio output unit 1101. Here, an event issued with an operation related to audio output as a trigger is monitored. Reference numeral 1102 denotes a masking signal determination unit that determines a masking signal for masking output sound to a third party based on information from the sound output state management unit 1103. A masking signal control unit 1104 controls the output of the masking signal determined by the masking signal determination unit 1102.

次に、本実施形態における音声出力処理の例を、図１２および図１３のフローチャートを用いて説明する。なお、本実施形態では、音声出力部１１０１およびマスキング信号制御部１１０４の処理をイベント駆動型の処理として説明する。対象となるイベントは、音声出力が開始されたことを示す「音声出力開始」イベントと、音声出力が終了したことを示す「音声出力終了」イベントである。 Next, an example of audio output processing in the present embodiment will be described using the flowcharts of FIGS. In the present embodiment, the processing of the audio output unit 1101 and the masking signal control unit 1104 will be described as event-driven processing. The target events are an “audio output start” event indicating that audio output has started, and an “audio output end” event indicating that audio output has ended.

図１２は、音声出力部１１０１および音声出力状況管理部１１０３の処理フローを示すフローチャートである。 FIG. 12 is a flowchart showing a processing flow of the audio output unit 1101 and the audio output status management unit 1103.

まず、ステップＳ１２０１において、イベント待機を行う。次に、イベントが検出された場合には、ステップＳ１２０２に進み、検出したイベントの種別を判断する。ここで、音声出力が開始されたことを示す「音声出力開始」イベントを検出した場合にはステップＳ１２０３へ進み、音声出力が終了したことを示す「音声出力終了」イベントを検出した場合にはステップＳ１２０４へ進む。 First, in step S1201, event waiting is performed. If an event is detected, the process advances to step S1202 to determine the type of event detected. If an “audio output start” event indicating that audio output has started is detected, the process proceeds to step S1203. If an “audio output end” event indicating that audio output has ended is detected, step S1203 is performed. The process proceeds to S1204.

ステップＳ１２０３では、音声出力用スピーカ１０６から音声出力を開始し、ステップＳ１２０１へ戻る。なお、ステップＳ１２０１に処理が戻った後も、音声出力は続けているものとする。一方のステップＳ１２０４では、音声出力用スピーカ１０６からの音声出力を終了し、処理を終える。 In step S1203, audio output is started from the audio output speaker 106, and the process returns to step S1201. It is assumed that audio output continues after the process returns to step S1201. On the other hand, in step S1204, the audio output from the audio output speaker 106 is terminated, and the process ends.

ここで、「音声出力開始」および「音声出力終了」の各イベントは、利用者のボタン押下など利用者によって与えられるものであっても、音声出力装置が所定のタイミングとして発生するものであってもよい。 Here, each event of “sound output start” and “sound output end” is generated by the sound output device as a predetermined timing even if it is given by the user such as a user pressing a button. Also good.

図１３は、マスキング信号制御部１１０４の処理フローを示すフローチャートである。 FIG. 13 is a flowchart showing a processing flow of the masking signal control unit 1104.

ここでは、図１２に示した処理において出力音声を第三者に聞かれないようにするためのマスキング信号をマスキング信号出力用スピーカ１０７から出力する。まず、ステップＳ１３０１において、イベント待機を行う。次に、イベントが検出された場合には、ステップＳ１３０２に進み、検出したイベントの種別を判断する。ここで、音声出力が開始されたことを示す「音声出力開始」イベントを検出した場合にはステップＳ１３０３へ進み、音声出力が終了したことを示す「音声出力終了」イベントを検出した場合にはステップＳ１３０４へ進む。 Here, a masking signal for preventing the output sound from being heard by a third party in the process shown in FIG. First, in step S1301, event waiting is performed. Next, if an event is detected, the process advances to step S1302 to determine the type of the detected event. If an “audio output start” event indicating that audio output has started is detected, the process proceeds to step S1303. If an “audio output end” event indicating that audio output has ended is detected, step S1303 is performed. The process proceeds to S1304.

ステップＳ１３０３では、マスキング信号出力用スピーカ１０７から予め定められたマスキング信号の出力を開始し、ステップＳ１３０１へ戻る。なお、ステップＳ１３０１に処理が戻った後も、マスキング信号の出力は続けているものとする。また、ステップＳ１３０４では、マスキング信号の出力を終了し、処理を終える。 In step S1303, the masking signal output speaker 107 starts outputting a predetermined masking signal, and the process returns to step S1301. Note that the output of the masking signal continues even after the processing returns to step S1301. In step S1304, the output of the masking signal is terminated and the process is terminated.

ここでマスキング信号のレベルは、マスキング信号ｘ_M（ｔ）の平均対数パワーＰ_M、第１のスピーカアンプ１０６ａのアンプゲインや音声出力用スピーカ１０６の取り付け位置および方向、第２のスピーカアンプ１０７ａのアンプゲインやマスキング信号出力用スピーカ１０７の取り付け位置および方向、スピーカの指向特性（例えば、指向性スピーカか無指向性スピーカか）、第三者が存在すると想定される位置および方向などを考慮して設計する必要がある。 Here, the level of the masking signal includes the average logarithmic power P _M of the masking signal x _M (t), the amplifier gain of the first speaker amplifier 106a, the mounting position and direction of the audio output speaker 106, and the second speaker amplifier 107a. Considering amplifier gain, masking signal output speaker 107 mounting position and direction, speaker directivity (for example, directional speaker or omnidirectional speaker), position and direction in which a third party is assumed to exist, etc. Need to design.

なお、マスキング信号はいかなるものであってもよいが、利用者に対する出力音声が聞き取りにくくならないもの、周辺の第三者や利用者に対して心地よいものが望ましい。 Any masking signal may be used, but a masking signal that does not make it difficult to hear the output sound to the user, or that is comfortable to a third party or a user in the vicinity is desirable.

図１４は、本実施形態における音声出力装置の外観構成の一例を示す図である。 FIG. 14 is a diagram illustrating an example of an external configuration of the audio output device according to the present embodiment.

１４０１は、図１に示したハードウェアを収容する音声出力装置の本体である。１４０３は、第三者に対してマスキング信号を出力するためのマスキング信号出力用スピーカで、図１のマスキング信号出力用スピーカ１０７に相当する。この例のように、マスキング信号用のスピーカが一つしかない場合や、予め第三者の存在する方向が分からない場合などでは、無指向性のスピーカを用いることが望ましい。１４０２は、利用者に対して音声出力を行うための音声出力用スピーカで、図１の音声出力用スピーカ１０６に相当する。マスキング信号用スピーカ１４０３から出力される信号によって、出力音声が利用者に聞こえにくくならないようにするために、音声出力用スピーカ１４０２はマスキング信号出力用スピーカ１４０３よりも利用者の耳に近い位置に設置することが望ましい。 Reference numeral 1401 denotes a main body of the audio output apparatus that accommodates the hardware shown in FIG. Reference numeral 1403 denotes a masking signal output speaker for outputting a masking signal to a third party, which corresponds to the masking signal output speaker 107 of FIG. As in this example, when there is only one masking signal speaker or when the direction in which a third party exists is not known in advance, it is desirable to use a non-directional speaker. Reference numeral 1402 denotes an audio output speaker for outputting audio to the user, and corresponds to the audio output speaker 106 of FIG. The audio output speaker 1402 is placed closer to the user's ear than the masking signal output speaker 1403 so that the output sound is not difficult for the user to hear due to the signal output from the masking signal speaker 1403. It is desirable to do.

以上の説明から明らかなように、本実施形態によれば、音声出力装置において、出力音声が周辺の第三者に聞かれず、かつ、第三者が不快にならないように、マスキング信号の出力を適切に制御することが可能となる。 As is clear from the above description, according to the present embodiment, in the audio output device, the output of the masking signal is performed so that the output audio is not heard by a nearby third party and the third party is not uncomfortable. It becomes possible to control appropriately.

（第３の実施形態）
上述の第２の実施形態は、出力音声に対するマスキング信号の生成、出力を、音声出力に関する操作により発行されるイベントのみに基づき行うものであったが、音声入力機能に係る第１の実施形態で説明したように、周辺環境を測定し、この情報も用いて、マスキング信号を生成、出力することもできる。 (Third embodiment)
In the second embodiment described above, the generation and output of the masking signal for the output sound is performed only based on the event issued by the operation related to the sound output. However, the second embodiment is related to the sound input function. As described, the surrounding environment can be measured, and this information can also be used to generate and output a masking signal.

図１５は、本実施形態に係る音声出力装置の機能構成を示すブロック図である。これは、第２の実施形態に係る図１１の構成に、周辺環境測定部１５０５を付加した構成である。この周辺環境測定部１５０５は、この音声出力装置の周辺の騒音レベル、騒音方向、騒音位置、本装置の周辺に存在する第三者の有無、第三者の方向および位置を測定する。その他の構成要素は図１１と同様であるから、同じ参照番号を付して説明を省略する。 FIG. 15 is a block diagram illustrating a functional configuration of the audio output device according to the present embodiment. This is a configuration in which a surrounding environment measurement unit 1505 is added to the configuration of FIG. 11 according to the second embodiment. The surrounding environment measurement unit 1505 measures the noise level, noise direction, noise position, presence / absence of a third party existing around the apparatus, and the direction and position of the third party around the sound output device. Since other components are the same as those in FIG. 11, the same reference numerals are assigned and description thereof is omitted.

図１６は、本実施形態における音声出力装置の外観構成の一例を示す図である。 FIG. 16 is a diagram illustrating an example of an external configuration of the audio output device according to the present embodiment.

１６０１は、図１に示したハードウェアを収容する音声出力装置の本体である。１６０２は、利用者に対して音声出力を行うための音声出力用スピーカで、図１の音声出力用スピーカ１０６に相当する。また、１６０６、１６０７、１６０８は、第三者に対してマスキング信号を出力するためのマスキング信号出力用スピーカで、図１のマスキング信号出力用スピーカ１０７に相当する。 Reference numeral 1601 denotes a main body of the audio output apparatus that accommodates the hardware shown in FIG. Reference numeral 1602 denotes an audio output speaker for outputting audio to the user, and corresponds to the audio output speaker 106 of FIG. Reference numerals 1606, 1607, and 1608 denote masking signal output speakers for outputting a masking signal to a third party, and correspond to the masking signal output speaker 107 of FIG.

１６０３、１６０４、１６０５は、本装置の周辺の騒音レベルを３点で測定するためのマイクロフォンである。このように、周辺環境を測定するためのマイクロフォンやマスキング信号を出力するためのスピーカを複数設けたので、マイクロフォン１６０３、１６０４、１６０５によって測定された個々の騒音レベルに応じて、マスキング信号出力用スピーカ１６０６、１６０７、１６０８から出力する個々のマスキング信号のレベルを変化させることが可能となる。 Reference numerals 1603, 1604, and 1605 denote microphones for measuring the noise level around the apparatus at three points. As described above, since a plurality of microphones for measuring the surrounding environment and speakers for outputting masking signals are provided, a masking signal output speaker is used according to the individual noise levels measured by the microphones 1603, 1604, and 1605. It becomes possible to change the level of each masking signal output from 1606, 1607, and 1608.

また、この例のように、周辺環境を測定するためのマイクロフォンやマスキング信号用のスピーカを複数設ける場合には、指向性のマイクロフォンおよびスピーカを用いることが望ましい。また、マスキング信号用スピーカ１６０６、１６０７、１６０８から出力される信号によって、１６０２の音声出力が利用者に聞こえにくくならないようにするために、マスキング信号出力用スピーカ１６０６、１６０７、１６０８は、スピーカ１６０２よりも利用者から離れた位置に設置することが望ましい。 In addition, as in this example, when a plurality of microphones for measuring the surrounding environment and speakers for masking signals are provided, it is desirable to use directional microphones and speakers. Further, the masking signal output speakers 1606, 1607, and 1608 are connected to the speaker 1602 so that the sound output from the masking signal 1606, 1607, and 1608 does not become difficult for the user to hear. It is desirable to install it at a position away from the user.

また、周辺環境の測定は、騒音に関するもののほか、本装置の周辺に存在する第三者の有無、第三者の方向および位置など利用者以外の人に関する測定を行うこと、あるいはそれらを組み合わせることもできる。この場合は、音声入力機能に係る第１の実施形態で説明した図１０のフローチャートと同様な方法によって実現が可能である。 In addition to measuring noise, the measurement of the surrounding environment should be done for people other than users, such as the presence or absence of third parties around the device, the direction and position of third parties, or a combination of these. You can also. This case can be realized by a method similar to the flowchart of FIG. 10 described in the first embodiment related to the voice input function.

以上の説明から明らかなように、本実施形態によれば、音声出力装置において、周辺環境の測定結果を考慮することによって、出力音声が周辺の第三者に聞かれず、かつ、第三者が不快にならないように、マスキング信号の出力を適切に制御することが可能となる。 As is clear from the above description, according to the present embodiment, in the audio output device, the output sound is not heard by the surrounding third party by considering the measurement result of the surrounding environment, and the third party It is possible to appropriately control the output of the masking signal so as not to be uncomfortable.

（第４の実施形態）
第４の実施形態では、音声入力機能と音声出力機能を協働させることで図１の音声処理装置を音声対話装置として機能させる場合について説明する。 (Fourth embodiment)
In the fourth embodiment, a case will be described in which the voice processing apparatus of FIG. 1 is made to function as a voice dialogue apparatus by cooperating a voice input function and a voice output function.

図１７は、本実施形態に係る音声対話装置の機能構成を示すブロック図である。 FIG. 17 is a block diagram showing a functional configuration of the voice interactive apparatus according to the present embodiment.

１７０５は、利用者の入力音声の認識を行う音声認識部である。１７０３は、音声認識部１７０５の動作状態を管理する音声入力状況管理部で、ここで音声認識に関するイベントが監視される。１７０１は、本装置周辺の騒音レベル、騒音方向、騒音位置、本装置の周辺に存在する第三者の有無、第三者の方向および位置を測定する周辺環境測定部である。１７０２は、音声入力状況管理部１７０３および周辺環境測定部１７０１からの情報に基づき、第三者に対して入力音声または出力音声をマスクするマスキング信号を決定するマスキング信号決定部である。１７０４は、マスキング信号決定部１７０２で決定されたマスキング信号の出力を制御するマスキング信号制御部である。１７０９は、合成音声を出力する音声出力部である。１７０７は、音声出力部１７０７の動作状態を管理する音声出力状況管理部で、ここで音声合成に関するイベントが監視される。１７０６は、音声認識部１７０５の結果を解釈する認識結果解釈部である。１７０８は、利用者との対話の状態を管理する対話管理部である。そして、１７１０は、対話状態に応じて利用者に音声出力する内容を生成する応答生成部である。 Reference numeral 1705 denotes a voice recognition unit that recognizes a user's input voice. Reference numeral 1703 denotes a voice input status management unit that manages the operating state of the voice recognition unit 1705, where events relating to voice recognition are monitored. Reference numeral 1701 denotes a surrounding environment measurement unit that measures a noise level, a noise direction, a noise position around the apparatus, the presence / absence of a third party existing around the apparatus, and a direction and position of the third party. Reference numeral 1702 denotes a masking signal determination unit that determines a masking signal for masking input sound or output sound to a third party based on information from the sound input state management unit 1703 and the surrounding environment measurement unit 1701. A masking signal control unit 1704 controls the output of the masking signal determined by the masking signal determination unit 1702. Reference numeral 1709 denotes a voice output unit that outputs synthesized voice. Reference numeral 1707 denotes an audio output status management unit that manages the operating state of the audio output unit 1707, where events relating to audio synthesis are monitored. A recognition result interpretation unit 1706 interprets the result of the voice recognition unit 1705. Reference numeral 1708 denotes a dialogue management unit that manages the state of dialogue with the user. Reference numeral 1710 denotes a response generation unit that generates contents to be output to the user according to the conversation state.

図１８は、本実施形態における音声対話装置の外観構成の一例を示す図である。 FIG. 18 is a diagram illustrating an example of an external configuration of the voice interaction apparatus according to the present embodiment.

１８０１は、図１に示したハードウェアを収容する音声対話装置の本体である。１８０４は、第三者に対してマスキング信号を出力するためのマスキング信号出力用スピーカで、図１のマスキング信号出力用スピーカ１０７に相当する。この例のように、マスキング信号用のスピーカが一つしかない場合や、予め第三者の存在する方向が分からない場合などでは、無指向性のスピーカを用いることが望ましい。１８０２は、利用者の音声入力および周辺の騒音環境を測定するためのマイクロフォンで、図１のマイクロフォン１０５に相当する。マスキング信号用スピーカ１８０４から出力されるマスキング信号によって、音声入力が適切に行えるようにするために、マイクロフォン１８０２はマスキング信号出力用スピーカ１８０４よりも利用者に近い位置に設置することが望ましい。 Reference numeral 1801 denotes a main body of the voice interactive apparatus that accommodates the hardware shown in FIG. Reference numeral 1804 denotes a masking signal output speaker for outputting a masking signal to a third party, which corresponds to the masking signal output speaker 107 of FIG. As in this example, when there is only one masking signal speaker or when the direction in which a third party exists is not known in advance, it is desirable to use a non-directional speaker. Reference numeral 1802 denotes a microphone for measuring the voice input of the user and the surrounding noise environment, and corresponds to the microphone 105 in FIG. The microphone 1802 is preferably installed at a position closer to the user than the masking signal output speaker 1804 so that voice input can be appropriately performed by the masking signal output from the masking signal speaker 1804.

１８０３は、利用者に対して音声出力を行うための音声出力用スピーカで、図１の音声出力用スピーカ１０６に相当する。マスキング信号用スピーカ１８０４から出力されるマスキング信号によって、出力音声が利用者に聞こえにくくならないようにするために、音声出力用スピーカ１８０３はマスキング信号出力用スピーカ１８０４よりも利用者の耳に近い位置に設置することが望ましい。 Reference numeral 1803 denotes an audio output speaker for outputting audio to the user, which corresponds to the audio output speaker 106 of FIG. In order not to make it difficult for the user to hear the output sound due to the masking signal output from the masking signal speaker 1804, the sound output speaker 1803 is closer to the user's ear than the masking signal output speaker 1804. It is desirable to install.

図１９は、本実施形態の音声対話装置におけるマスキング信号出力のタイミングの一例を示している。横軸は経過時間を表わしており、縦軸は、上から順に、音声対話装置の応答（Ｓｙｓｔｅｍ）、利用者の発声（Ｕｓｅｒ）、マスキング信号の出力（Ｍａｓｋ）である。この例では、まず、Ｓｙｓｔｅｍが時刻ｔ１から時刻ｔ３にかけて、「それでは氏名を入力してください」と音声出力し、これに続いて、時刻ｔ４から時刻ｔ５にかけて、Ｕｓｅｒによって「山田太郎」と発声され、さらにその後、Ｓｙｓｔｅｍが時刻ｔ８から時刻ｔ９にかけて、「山田太郎さんですね」と確認の音声出力を行っている場面が示されている。 FIG. 19 shows an example of the timing of masking signal output in the voice interactive apparatus of this embodiment. The abscissa represents the elapsed time, and the ordinate represents the response (System) of the voice interactive device, the user's utterance (User), and the output of the masking signal (Mask) in order from the top. In this example, the system first outputs a voice saying “Please enter your name” from time t1 to time t3, and then the user says “Taro Yamada” from time t4 to time t5. Further, after that, a scene is shown in which the system performs a sound output of “Taro Yamada” from time t8 to time t9.

このとき、マスキング信号は以下のタイミングで出力されている。まず、最初のＳｙｓｔｅｍの「それでは氏名を入力してください」という時刻ｔ１〜ｔ３での音声出力は、第三者に聞かれても構わないと考えられるため、この音声出力に対するマスキング信号は出力されない。次に、Ｕｓｅｒの「山田太郎」という発声に対しては、マスキング信号を出力する必要がある。このとき、利用者はＳｙｓｔｅｍの音声出力の終了を待たずに発声を開始することが考えられるため、図示のように、マスキング信号は、Ｓｙｓｔｅｍの音声出力の終了時刻ｔ３よりも早い時刻ｔ２から出力を開始することが好ましい。また、このマスキング信号の終了時刻は、Ｕｓｅｒの「山田太郎」という発声の終了時刻ｔ５に若干のマージンを加えた時刻ｔ６としている。 At this time, the masking signal is output at the following timing. First, since it is considered that the voice output at the time t1 to t3 in the first system, “Please enter your name”, may be heard by a third party, no masking signal for this voice output is output. . Next, it is necessary to output a masking signal for the user's “Taro Yamada” utterance. At this time, since the user may start speaking without waiting for the end of the system sound output, the masking signal is output from the time t2 earlier than the end time t3 of the system sound output as shown in the figure. It is preferable to start. The end time of the masking signal is set to a time t6 obtained by adding a slight margin to the end time t5 of the utterance “Taro Yamada” of User.

次に、Ｓｙｓｔｅｍの「山田太郎さんですね」という音声出力も、第三者に聞かれて欲しくない情報であると考えられるため、この出力の開始時刻ｔ８と終了時刻ｔ９に若干のマージンを考慮した時刻ｔ７から時刻ｔ１０の間でマスキング信号を出力している。 Next, the system's voice output “Taro Yamada-san” is also considered to be information that is not wanted to be heard by a third party, so a slight margin is taken into consideration for the output start time t8 and end time t9. The masking signal is output between time t7 and time t10.

なお、マスキング信号の出力のタイミングはこの例に限ったものでないことは言うまでもない。他にも、利用者が装置を使用している間や装置の電源がオンになっている際には、常にマスキング信号を出力しておき、第三者に聞かせたくないと考えられる入出力のタイミングにおいて、マスキング信号のレベルを大きくしてもよい。他にも、利用者に発声を求める際のプライバシー度の高低に応じてマスキング信号のレベルを変化させてもよい。また、周辺環境を測定するためのマイクロフォンの取り付け位置や個数、マスキング信号を出力するためのスピーカの取り付け位置や個数は図１８に示したものに限らないことは言うまでもない。 Needless to say, the output timing of the masking signal is not limited to this example. In addition, when a user is using the device or when the device is turned on, always output a masking signal and input / output At the timing, the level of the masking signal may be increased. In addition, the level of the masking signal may be changed according to the level of privacy when the user is asked to speak. Needless to say, the position and number of microphones for measuring the surrounding environment and the position and number of speakers for outputting a masking signal are not limited to those shown in FIG.

また、前述のように、周辺環境として、赤外線センサやビデオカメラなどを用いて第三者に関する情報を測定し、これを用いてマスキング信号を決定、制御するようにしてもよい。 Further, as described above, information relating to a third party may be measured using an infrared sensor, a video camera, or the like as the surrounding environment, and a masking signal may be determined and controlled using the measured information.

以上の説明から明らかなように、本実施形態の音声対話装置によれば、利用者の入力音声や機器の出力音声が周辺の第三者に聞かれず、かつ、第三者が不快にならないように、マスキング信号の出力を適切に制御することが可能となる。 As is clear from the above description, according to the voice interaction apparatus of the present embodiment, the input voice of the user and the output voice of the device are not heard by a nearby third party, and the third party is not uncomfortable. In addition, the masking signal output can be appropriately controlled.

（他の実施形態）
以上、本発明の実施形態を詳述したが、本発明は、複数の機器から構成されるシステムに適用してもよいし、また、一つの機器からなる装置に適用してもよい。 (Other embodiments)
As mentioned above, although embodiment of this invention was explained in full detail, this invention may be applied to the system comprised from several apparatuses, and may be applied to the apparatus which consists of one apparatus.

なお、本発明は、前述した実施形態の機能を実現するソフトウェアのプログラムを、システムあるいは装置に直接あるいは遠隔から供給し、そのシステムあるいは装置のコンピュータがその供給されたプログラムコードを読み出して実行することによっても達成される。その場合、プログラムの機能を有していれば、その形態はプログラムである必要はない。 In the present invention, a software program that realizes the functions of the above-described embodiments is directly or remotely supplied to a system or apparatus, and the computer of the system or apparatus reads and executes the supplied program code. Is also achieved. In that case, as long as it has the function of a program, the form does not need to be a program.

従って、本発明の機能処理をコンピュータで実現するために、そのコンピュータにインストールされるプログラムコード自体およびそのプログラムを格納した記憶媒体も本発明を構成することになる。つまり、本発明の特許請求の範囲には、本発明の機能処理を実現するためのコンピュータプログラム自体、およびそのプログラムを格納した記憶媒体も含まれる。 Therefore, in order to realize the functional processing of the present invention with a computer, the program code itself installed in the computer and the storage medium storing the program also constitute the present invention. In other words, the claims of the present invention include the computer program itself for realizing the functional processing of the present invention and a storage medium storing the program.

その場合、プログラムの機能を有していれば、オブジェクトコード、インタプリタにより実行されるプログラム、ＯＳに供給するスクリプトデータ等、プログラムの形態を問わない。 In this case, the program may be in any form as long as it has a program function, such as an object code, a program executed by an interpreter, or script data supplied to the OS.

プログラムを供給するための記憶媒体としては、例えば、フレキシブルディスク、ハードディスク、光ディスク、光磁気ディスク、ＭＯ、ＣＤ−ＲＯＭ、ＣＤ−Ｒ、ＣＤ−ＲＷ、磁気テープ、不揮発性のメモリカード、ＲＯＭ、ＤＶＤ（ＤＶＤ−ＲＯＭ、ＤＶＤ−Ｒ）などがある。 As a storage medium for supplying the program, for example, flexible disk, hard disk, optical disk, magneto-optical disk, MO, CD-ROM, CD-R, CD-RW, magnetic tape, nonvolatile memory card, ROM, DVD (DVD-ROM, DVD-R).

その他、プログラムの供給方法としては、クライアントコンピュータのブラウザを用いてインターネットのホームページに接続し、そのホームページから本発明のコンピュータプログラムそのもの、もしくは圧縮され自動インストール機能を含むファイルをハードディスク等の記憶媒体にダウンロードすることによっても供給できる。また、本発明のプログラムを構成するプログラムコードを複数のファイルに分割し、それぞれのファイルを異なるホームページからダウンロードすることによっても実現可能である。つまり、本発明の機能処理をコンピュータで実現するためのプログラムファイルを複数のユーザに対してダウンロードさせるＷＷＷサーバも、本発明のクレームに含まれるものである。 As another program supply method, a client computer browser is used to connect to an Internet homepage, and the computer program of the present invention itself or a compressed file including an automatic installation function is downloaded from the homepage to a storage medium such as a hard disk. Can also be supplied. It can also be realized by dividing the program code constituting the program of the present invention into a plurality of files and downloading each file from a different homepage. That is, a WWW server that allows a plurality of users to download a program file for realizing the functional processing of the present invention on a computer is also included in the claims of the present invention.

また、本発明のプログラムを暗号化してＣＤ−ＲＯＭ等の記憶媒体に格納してユーザに配布し、所定の条件をクリアしたユーザに対し、インターネットを介してホームページから暗号化を解く鍵情報をダウンロードさせ、その鍵情報を使用することにより暗号化されたプログラムを実行してコンピュータにインストールさせて実現することも可能である。 In addition, the program of the present invention is encrypted, stored in a storage medium such as a CD-ROM, distributed to users, and key information for decryption is downloaded from a homepage via the Internet to users who have cleared predetermined conditions. It is also possible to execute the encrypted program by using the key information and install the program on a computer.

また、コンピュータが、読み出したプログラムを実行することによって、前述した実施形態の機能が実現される他、そのプログラムの指示に基づき、コンピュータ上で稼動しているＯＳなどが、実際の処理の一部または全部を行い、その処理によっても前述した実施形態の機能が実現され得る。 In addition to the functions of the above-described embodiments being realized by the computer executing the read program, the OS running on the computer based on the instruction of the program is a part of the actual processing. Alternatively, the functions of the above-described embodiment can be realized by performing all of them and performing the processing.

さらに、記憶媒体から読み出されたプログラムが、コンピュータに挿入された機能拡張ボードやコンピュータに接続された機能拡張ユニットに備わるメモリに書き込まれた後、そのプログラムの指示に基づき、その機能拡張ボードや機能拡張ユニットに備わるＣＰＵなどが実際の処理の一部または全部を行い、その処理によっても前述した実施形態の機能が実現される。 Furthermore, after the program read from the storage medium is written to a memory provided in a function expansion board inserted into the computer or a function expansion unit connected to the computer, the function expansion board or The CPU or the like provided in the function expansion unit performs part or all of the actual processing, and the functions of the above-described embodiments are realized by the processing.

本発明が適用される音声処理装置のハードウェア構成を示すブロック図である。It is a block diagram which shows the hardware constitutions of the audio | voice processing apparatus with which this invention is applied. 第１の実施形態に係る音声入力装置の機能構成を示すブロック図である。It is a block diagram which shows the function structure of the audio | voice input apparatus which concerns on 1st Embodiment. 第１の実施形態に係る周辺環境測定部とマスキング信号決定部の処理フローの一例を示すフローチャートである。It is a flowchart which shows an example of the processing flow of the surrounding environment measurement part and masking signal determination part which concern on 1st Embodiment. 第１の実施形態に係る音声入力部と音声入力状況管理部の処理フローの一例を示すフローチャートである。It is a flowchart which shows an example of the processing flow of the audio | voice input part and audio | voice input status management part which concern on 1st Embodiment. 第１の実施形態に係るマスキング信号制御部の処理フローの一例を示すフローチャートである。It is a flowchart which shows an example of the processing flow of the masking signal control part which concerns on 1st Embodiment. 第１の実施形態におけるマスキング信号のレベルを決定する際に用いる関数の一例を示す図である。It is a figure which shows an example of the function used when determining the level of the masking signal in 1st Embodiment. 第１の実施形態における音声の入力レベルを利用して動的にマスキング信号レベルを決定する処理フローの一例を示すフローチャートである。It is a flowchart which shows an example of the processing flow which determines a masking signal level dynamically using the audio | voice input level in 1st Embodiment. 第１の実施形態における音声入力装置の外観構成の一例を示す図である。It is a figure which shows an example of the external appearance structure of the audio | voice input apparatus in 1st Embodiment. 第１の実施形態における音声入力装置の外観構成の別の例を示す図である。It is a figure which shows another example of the external appearance structure of the audio | voice input apparatus in 1st Embodiment. 第１の実施形態における周辺環境測定部とマスキング信号決定部の処理フローの一例を示すフローチャートである。It is a flowchart which shows an example of the processing flow of the surrounding environment measurement part and masking signal determination part in 1st Embodiment. 第２の実施形態に係る音声出力装置の機能構成を示すブロック図である。It is a block diagram which shows the function structure of the audio | voice output apparatus which concerns on 2nd Embodiment. 第２の実施形態における音声出力部と音声出力状況管理部の処理フローの一例を示すフローチャートである。It is a flowchart which shows an example of the processing flow of the audio | voice output part and audio | voice output condition management part in 2nd Embodiment. 第２の実施形態におけるマスキング信号制御部の処理フローの一例を示すフローチャートである。It is a flowchart which shows an example of the processing flow of the masking signal control part in 2nd Embodiment. 第２の実施形態における音声出力装置の外観構成の一例を示す図である。It is a figure which shows an example of the external appearance structure of the audio | voice output apparatus in 2nd Embodiment. 第３の実施形態に係る音声出力装置の機能構成を示すブロック図である。It is a block diagram which shows the function structure of the audio | voice output apparatus which concerns on 3rd Embodiment. 第３の実施形態における音声出力装置の外観構成の一例を示す図である。It is a figure which shows an example of the external appearance structure of the audio | voice output apparatus in 3rd Embodiment. 第４の実施形態に係る音声対話装置の機能構成を示すブロック図である。It is a block diagram which shows the function structure of the voice interactive apparatus which concerns on 4th Embodiment. 第４の実施形態における音声対話装置の外観構成の一例を示す図である。It is a figure which shows an example of an external appearance structure of the voice interactive apparatus in 4th Embodiment. 第４の実施形態の音声対話装置におけるマスキング信号出力のタイミングの一例を示す図である。It is a figure which shows an example of the timing of the masking signal output in the voice interactive apparatus of 4th Embodiment. 第１の実施形態におけるマスキング信号のレベルを決定する際に用いる関数の一例を示す図である。It is a figure which shows an example of the function used when determining the level of the masking signal in 1st Embodiment.

Claims

A voice processing device for processing voice information received from an input means for inputting voice uttered by a user,
Measuring means for measuring the surrounding environment;
Determining means for determining a masking signal for masking the voice uttered by the user to a third party based on the measurement result by the measuring means;
And a control means for controlling the output of the masking signal determined by the masking signal determination means based on the operating state of the input means.

The speech processing apparatus according to claim 1, wherein the measurement unit measures at least one of a position of a third party and noise.

The speech processing apparatus according to claim 1, wherein the operation state of the input unit is determined based on a start event and an end event of speech input.

The audio processing apparatus according to claim 1, wherein the measuring unit includes an infrared sensor or a video camera for measuring a position of a third party.

An audio processing device comprising output means for outputting audio,
Measuring means for measuring the surrounding environment;
Determining means for determining a masking signal for masking the sound output from the output means to a third party based on the measurement result by the measuring means;
And a control means for controlling the output of the masking signal determined by the masking signal determination means on the basis of the operating state of the output means.

6. The speech processing apparatus according to claim 5, wherein the measuring unit measures at least one of a third party position and noise.

The sound processing apparatus according to claim 5, wherein the operation state of the output unit is determined based on a start event and an end event of sound output.

The audio processing apparatus according to claim 5, wherein the measurement unit includes an infrared sensor or a video camera for measuring a position of a third party.

A speech processing apparatus comprising: reception means for receiving voice information from input means for inputting voice uttered by a user; and output means for outputting voice,
Measuring means for measuring the surrounding environment;
A decision to mask a user's voice input to the input means for a third party and to determine a masking signal for masking the voice output from the output means based on a measurement result by the measuring means Means,
And a control means for controlling the output of the masking signal determined by the masking signal determination means on the basis of the operating states of the input means and the output means.

A control method of a voice processing device for processing voice information received from an input means for inputting voice uttered by a user,
A measurement step for measuring the surrounding environment;
A determination step of determining a masking signal for masking the voice uttered by the user to a third party based on the measurement result of the measurement step;
And a control step of controlling the output of the masking signal determined by the masking signal determination step based on the operating state of the input means.

The method according to claim 10, wherein the measuring step measures at least one of a third party position and noise.

A method for controlling a speech processing apparatus comprising output means for outputting speech,
A measurement step for measuring the surrounding environment;
A determination step of determining a masking signal for masking the sound output from the output means to a third party based on the measurement result of the measurement step;
And a control step of controlling the output of the masking signal determined by the masking signal determination step based on the operating state of the output means.

The method according to claim 12, wherein the measurement step measures at least one of a position of a third party and noise.

A control method for a voice processing device comprising: a receiving means for receiving voice information from an input means for inputting voice uttered by a user; and an output means for outputting voice,
A measurement step for measuring the surrounding environment;
A decision for masking a user's voice input to the input means for a third party and determining a masking signal for masking the voice output from the output means based on the measurement result of the measurement step. Steps,
And a control step for controlling the output of the masking signal determined by the masking signal determination step based on the operating states of the input means and the output means.

The program for implement | achieving the control method of the audio processing apparatus in any one of Claim 10 to 14 with a computer.