JP6194740B2

JP6194740B2 - Audio processing apparatus, audio processing method, and program

Info

Publication number: JP6194740B2
Application number: JP2013216002A
Authority: JP
Inventors: 純也藤本; 桂樹岡林
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2013-10-17
Filing date: 2013-10-17
Publication date: 2017-09-13
Anticipated expiration: 2033-10-17
Also published as: JP2015080087A

Description

本発明は、音声処理装置、音声処理方法、及びプログラムに関する。 The present invention relates to a voice processing device, a voice processing method, and a program.

昨今、人が通常の活動をしている際に、その行動や状態に合わせて人をアシストするような情報提供を行うことが考えられている。このような用途において、情報提供がある度に端末を取り出したり、端末に視線を向けたりすることは煩わしく感じる。そこで人の手や目を占有しないハンズフリー・アイズフリーなユーザインタフェースが必要と考えられている。 In recent years, it has been considered to provide information that assists a person in accordance with the behavior and state when the person is performing a normal activity. In such an application, it is bothersome to take out the terminal every time information is provided or to turn the line of sight toward the terminal. Therefore, a hands-free and eyes-free user interface that does not occupy human hands and eyes is considered necessary.

このようなインタフェースに用いることが考えられる音響技術において、情報提供を行うための音声の定位位置を変化させる方法が知られている。例えば、複数のスピーカの音圧レベル、発音時刻を変化させて音像の定位方向を変化させる方法、左右のチャンネル信号を夫々遅延する方法等が知られている（例えば、特許文献１〜３参照）。また、会話を検出すると、コンテンツ音声の音像の定位位置を任意の位置に移動させる方法や、仮想画像の視点を特定し、特定した視点に対応して音響を変える方法も知られている（例えば、特許文献４〜５参照）。さらに、会話を検出するための方法も知られている（例えば、特許文献６、非特許文献１参照） In an acoustic technique that can be used for such an interface, a method of changing a sound localization position for providing information is known. For example, a method of changing a sound image localization direction by changing sound pressure levels and sound generation times of a plurality of speakers, a method of delaying left and right channel signals, and the like are known (see, for example, Patent Documents 1 to 3). . In addition, when a conversation is detected, a method of moving a localization position of a sound image of content audio to an arbitrary position, a method of specifying a viewpoint of a virtual image, and a method of changing sound according to the specified viewpoint are known (for example, And Patent Documents 4 to 5). Furthermore, a method for detecting a conversation is also known (see, for example, Patent Document 6 and Non-Patent Document 1).

特開２００１−１１２０８３号公報JP 2001-112083 A 特開平８−２３７７９０号公報JP-A-8-237790 国際公開番号ＷＯ００／４５６１９号公報International Publication Number WO00 / 45619 特開２０１１−９７２６８号公報JP 2011-97268 A 特開平１０−１３７４４５号公報Japanese Patent Laid-Open No. 10-137445 特開２００７−１７６２０号公報JP 2007-17620 A

日本音響学会講演論文集「ＶＡＤの信頼度を利用した雑音に頑健な音声認識デコーダの検討」、大西翼、ディクソン・ポール、岩野公司、古井貞煕著、ｐ．４９−５０（２００９年９月）Proceedings of the Acoustical Society of Japan “Study of a robust speech recognition decoder using VAD reliability”, Tsubasa Onishi, Dickson Paul, Koji Iwano, Sadayoshi Furui, p. 49-50 (September 2009)

人の手や目を占有しないユーザインタフェースに用いる音響技術では、提供する情報の音声と周囲の環境音との音量バランスを、環境変化やユーザの状態に対応して調整することが望ましい場合がある。しかし、環境音の大きさは時々刻々と変化するため、手動で音量バランスの調節を行うことは困難である。また、上記のような、音響の音声の定位位置を変える従来の音響技術等では、実世界の状況との関連性が崩れてしまうことがあるとともに、提供する情報の音声と環境音との音量バランスを調整することはできない、という問題がある。 In acoustic technology used for a user interface that does not occupy human hands or eyes, it may be desirable to adjust the volume balance between the sound of the information to be provided and the surrounding environmental sound in accordance with environmental changes and user conditions. . However, since the volume of the environmental sound changes every moment, it is difficult to manually adjust the volume balance. In addition, in the conventional acoustic technology that changes the localization position of the acoustic sound as described above, the relevance with the situation in the real world may be lost, and the volume of the sound of the information to be provided and the environmental sound There is a problem that the balance cannot be adjusted.

ひとつの側面によれば、本発明の目的は、環境音と提供する情報の音声との音量バランスを周囲の音環境の変化に対応して制御可能にすることである。 According to one aspect, an object of the present invention is to make it possible to control the volume balance between the environmental sound and the sound of information to be provided in response to changes in the surrounding sound environment.

ひとつの態様である音声処理装置は、収音部、音声取得部、重畳比算出部、重畳処理部、出力部を有している。収音部は、環境音を収音する。音声取得部は、提供する情報の情報音を取得する。重畳比算出部は、前記情報音の音圧レベルの時系列データの第１の代表値と前記環境音の音圧レベルの時系列データの第２の代表値との差と、第１の所定値との差を補うような、前記情報音と前記環境音とを重畳させた重畳音の音圧に対する前記環境音の音圧の比を示す重畳比を算出する。重畳処理部は、前記重畳比に基づき前記情報音と前記環境音とを重畳する処理を行なう。出力部は、前記重畳する処理が行われた音声信号を出力する。 An audio processing apparatus according to one aspect includes a sound collection unit, an audio acquisition unit, a superposition ratio calculation unit, a superposition processing unit, and an output unit. The sound collection unit collects environmental sounds. The voice acquisition unit acquires an information sound of information to be provided. The superimposition ratio calculating unit calculates a difference between a first representative value of the time series data of the sound pressure level of the information sound and a second representative value of the time series data of the sound pressure level of the environmental sound, and a first predetermined value. A superposition ratio indicating a ratio of a sound pressure of the environmental sound to a sound pressure of the superposed sound obtained by superimposing the information sound and the environmental sound so as to compensate for a difference from the value is calculated. The superimposition processing unit performs a process of superimposing the information sound and the environmental sound based on the superposition ratio. The output unit outputs an audio signal that has been subjected to the superimposing process.

実施形態による音声処理装置、音声処理方法及びプログラムによれば、環境音と提供する情報の音声との音量バランスを、周囲の音環境の変化に対応して制御することが可能になる。 According to the sound processing device, the sound processing method, and the program according to the embodiment, the volume balance between the environmental sound and the sound of the information to be provided can be controlled in accordance with the change in the surrounding sound environment.

第１の実施の形態による音声処理システムのハードウエア構成の一例を示す図である。It is a figure which shows an example of the hardware constitutions of the speech processing system by 1st Embodiment. 第１の実施の形態による音声処理装置の機能の一例を示すブロック図である。It is a block diagram which shows an example of the function of the speech processing unit by 1st Embodiment. 第１の実施の形態による音圧レベルの代表値を算出する方法を説明する図である。It is a figure explaining the method of calculating the representative value of the sound pressure level by 1st Embodiment. 第１の実施の形態による音圧レベルの代表値を算出する方法を説明する図である。It is a figure explaining the method of calculating the representative value of the sound pressure level by 1st Embodiment. 第１の実施の形態による音圧レベルの代表値を算出する方法を説明する図である。It is a figure explaining the method of calculating the representative value of the sound pressure level by 1st Embodiment. 第１の実施の形態による音声処理システムの動作を示すフローチャートである。It is a flowchart which shows operation | movement of the speech processing system by 1st Embodiment. 第２の実施の形態による音声処理システムの利用状況の一例を概念的に示す図である。It is a figure which shows notionally an example of the utilization condition of the speech processing system by 2nd Embodiment. 第２の実施の形態による音声処理システムの構成の一例を示すブロック図である。It is a block diagram which shows an example of a structure of the speech processing system by 2nd Embodiment. 第２の実施の形態による音声処理装置の機能の一例を示すブロック図である。It is a block diagram which shows an example of the function of the speech processing unit by 2nd Embodiment. 第２の実施の形態による注視状態を説明する図である。It is a figure explaining the gaze state by 2nd Embodiment. 第２の実施の形態による注視状態を検出するための機能の一例を示すブロック図である。It is a block diagram which shows an example of the function for detecting the gaze state by 2nd Embodiment. 第２の実施の形態による赤外線情報の一例を示す図である。It is a figure which shows an example of the infrared information by 2nd Embodiment. 第２の実施の形態による注視対象情報の一例を示す図である。It is a figure which shows an example of the attention object information by 2nd Embodiment. 第２の実施の形態による正面リストの一例を示す図である。It is a figure which shows an example of the front list | wrist by 2nd Embodiment. 第２の実施の形態による音声処理システムの主な動作を示すフローチャートである。It is a flowchart which shows the main operation | movement of the speech processing system by 2nd Embodiment. 第２の実施の形態の音声処理システムによる注視検出処理を示すフローチャートである。It is a flowchart which shows the gaze detection process by the audio | voice processing system of 2nd Embodiment. 第３の実施の形態による音声処理装置の機能の一例を示すブロック図である。It is a block diagram which shows an example of the function of the speech processing unit by 3rd Embodiment. 第３の実施の形態による音声処理システムの動作を示すフローチャートである。It is a flowchart which shows operation | movement of the speech processing system by 3rd Embodiment. 第３の実施の形態による音声処理システムの動作を示すフローチャートである。It is a flowchart which shows operation | movement of the speech processing system by 3rd Embodiment. 変形例による音声処理システムのハードウエア構成の一例を示す図である。It is a figure which shows an example of the hardware constitutions of the speech processing system by a modification. 標準的なコンピュータのハードウエア構成の一例を示す図である。It is a figure which shows an example of the hardware constitutions of a standard computer.

（第１の実施の形態）
以下、図面を参照しながら、第１の実施の形態による音声処理システム１について説明する。図１は、音声処理システム１のハードウエア構成の一例を示す図である。音声処理システム１は、提供する情報の音声（以下、情報音という）と、周囲の音声である環境音とを、周囲の音環境の変化に対応して自動的に調整されたバランスで重畳して提供するシステムである。図１に示すように、音声処理システム１は、音声処理装置２とマイクデバイス３０とを有している。 (First embodiment)
Hereinafter, the speech processing system 1 according to the first embodiment will be described with reference to the drawings. FIG. 1 is a diagram illustrating an example of a hardware configuration of the voice processing system 1. The sound processing system 1 superimposes the sound of information to be provided (hereinafter referred to as “information sound”) and the environmental sound that is the surrounding sound with a balance that is automatically adjusted in response to changes in the surrounding sound environment. It is a system that provides. As shown in FIG. 1, the voice processing system 1 includes a voice processing device 2 and a microphone device 30.

音声処理装置２は、環境音を取得し、情報音を環境音と重畳した音声信号を出力する装置であり、演算処理装置３、記憶部５、入力部２３、表示部２５、音声入出力部１５を有している。音声処理装置２は、例えば、多機能携帯電話、タブレット型コンピュータ、音楽再生装置などとすることができる。 The sound processing device 2 is a device that acquires environmental sound and outputs a sound signal in which information sound is superimposed on the environmental sound. The arithmetic processing device 3, the storage unit 5, the input unit 23, the display unit 25, and the sound input / output unit. 15. The voice processing device 2 can be, for example, a multifunctional mobile phone, a tablet computer, a music playback device, or the like.

演算処理装置３は、音声処理装置２の動作を制御するプロセッサである。演算処理装置３は、例えば記憶部５にあらかじめ記憶された制御プログラムを読み込んで実行することにより、音声処理装置２の動作を制御する処理を行う。 The arithmetic processing device 3 is a processor that controls the operation of the audio processing device 2. For example, the arithmetic processing device 3 reads and executes a control program stored in advance in the storage unit 5 to perform processing for controlling the operation of the audio processing device 2.

記憶部５は、例えば半導体メモリなどであり、ＲｅａｄＯｎｌｙＡｃｃｅｓｓＭｅｍｏｒｙ（ＲＯＭ）７、ＲｏｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ（ＲＡＭ）９などを有している。記憶部５は、例えば、音声処理装置２の動作を制御する制御プログラム、音声処理装置２の動作に必要な各種情報や演算結果などを記憶する。 The storage unit 5 is, for example, a semiconductor memory or the like, and includes a read only access memory (ROM) 7 and a random access memory (RAM) 9. The storage unit 5 stores, for example, a control program for controlling the operation of the sound processing device 2, various information necessary for the operation of the sound processing device 2, calculation results, and the like.

入力部２３は、情報を入力する装置であり、例えばタッチパネル、キーボードなどである。表示部２５は、情報を表示する装置であり、例えば、液晶表示装置などである。音声入出力部１５は、音声処理装置２に接続されるスピーカやイヤホンなどの音声出力装置に音声信号を出力したり、音声処理装置２に接続されるマイク等の音声取得装置からの音声信号の入力を受付けたりする装置である。 The input unit 23 is a device for inputting information, and is, for example, a touch panel or a keyboard. The display unit 25 is a device that displays information, such as a liquid crystal display device. The voice input / output unit 15 outputs a voice signal to a voice output device such as a speaker or an earphone connected to the voice processing device 2, or receives a voice signal from a voice acquisition device such as a microphone connected to the voice processing device 2. It is a device that accepts input.

マイクデバイス３０は、イヤホン３２とマイク３４とを有しており、音声処理装置２と、有線または無線により接続されて、音声を授受する装置である。マイクデバイス３０は、例えばバイノーラルマイクデバイスである。イヤホン３２は、耳に装着することができ、音声処理装置２で生成された音声信号を音声として出力する装置である。マイク３４は、イヤホン３２と一体に形成されることが好ましく、環境音を収音し、音声処理装置２に音声信号として出力する装置である。 The microphone device 30 includes an earphone 32 and a microphone 34, and is connected to the audio processing device 2 by wire or wireless to exchange audio. The microphone device 30 is, for example, a binaural microphone device. The earphone 32 is a device that can be worn on the ear and outputs the sound signal generated by the sound processing device 2 as sound. The microphone 34 is preferably formed integrally with the earphone 32, and is a device that collects environmental sound and outputs it as a sound signal to the sound processing device 2.

図２は、音声処理装置２の機能の一例を示すブロック図である。図２に示すように、音声処理装置２は、収音部４１、音声取得部４３、重畳比算出部４５、重畳処理部４７、出力部４９を有している。収音部４１は、例えばマイクデバイス３０から入力される環境音の音声信号を取得する。音声取得部４３は、音声処理装置２がマイクデバイス３０を介してユーザに提供する情報音を、例えば記憶部５などから取得する。重畳比算出部４５は、収音部４１で取得した環境音の音圧レベルと音声取得部４３で取得した情報音の音圧レベルとの差が予め決められた値になるように、環境音と情報音とを重畳する際の音圧の重畳比を算出する。重畳処理部４７は、音声取得部４３で算出された重畳比に応じて環境音と情報音とを重畳した音声信号を生成する。出力部４９は、重畳処理部４７で生成された音声信号を例えばマイクデバイス３０に出力する。 FIG. 2 is a block diagram illustrating an example of functions of the audio processing device 2. As shown in FIG. 2, the voice processing device 2 includes a sound collection unit 41, a voice acquisition unit 43, a superposition ratio calculation unit 45, a superposition processing unit 47, and an output unit 49. The sound collection unit 41 acquires an audio signal of environmental sound input from the microphone device 30, for example. The voice acquisition unit 43 acquires, for example, the information sound provided to the user by the voice processing device 2 via the microphone device 30 from the storage unit 5 or the like. The superimposition ratio calculation unit 45 sets the environmental sound so that the difference between the sound pressure level of the environmental sound acquired by the sound collection unit 41 and the sound pressure level of the information sound acquired by the sound acquisition unit 43 becomes a predetermined value. The superposition ratio of the sound pressure when superimposing the sound and the information sound is calculated. The superimposition processing unit 47 generates an audio signal in which the environmental sound and the information sound are superimposed according to the superposition ratio calculated by the audio acquisition unit 43. The output unit 49 outputs the audio signal generated by the superimposition processing unit 47 to, for example, the microphone device 30.

ここで、図３から図５を参照しながら、第１の実施の形態による重畳比算出部４５の動作について、さらに説明する。図３から図５は、音声信号に基づき、音圧レベルの代表値を算出する方法を説明する図である。図３から図５は、音圧レベルの代表値を異なる方法で算出する方法の一例を示す図である。図３から図５において、横軸は時刻、縦軸は音圧レベルを示す。 Here, the operation of the superposition ratio calculation unit 45 according to the first embodiment will be further described with reference to FIGS. 3 to 5. 3 to 5 are diagrams illustrating a method for calculating a representative value of the sound pressure level based on the audio signal. 3 to 5 are diagrams illustrating an example of a method for calculating the representative value of the sound pressure level by different methods. 3 to 5, the horizontal axis indicates time, and the vertical axis indicates sound pressure level.

図３は、一定の白色雑音下のような音声信号の一例を示している。ここで、音圧レベルは、音の音声信号が複数チャンネルの信号を含む場合には、複数のチャンネルの信号の同一時刻における最大の音圧レベルとするようにしてもよい。 FIG. 3 shows an example of an audio signal under constant white noise. Here, the sound pressure level may be the maximum sound pressure level at the same time of the signals of the plurality of channels when the sound signal of the sound includes signals of the plurality of channels.

図３では、元の音圧レベルに対して、移動平均、加重平均、中央値を表している。移動平均とは、移動平均の算出対象時刻の過去の一定時間内の音圧レベルの時系列データの平均である。加重平均とは、加重平均の算出対象時刻の過去の一定時間内の音圧レベルの時系列データに、それぞれ所定の重み（係数）を掛けて加算し、算出した平均である。この例では、算出対象時刻に近い時刻の音圧レベルほど重みを大きくして算出している。中央値とは、中央値の算出対象時刻の過去の一定時間内の音圧レベルの時系列データを大小順に整列させたときの中央値である。 In FIG. 3, the moving average, the weighted average, and the median value are represented with respect to the original sound pressure level. The moving average is an average of time-series data of sound pressure levels within a fixed time in the past of a calculation target time of the moving average. The weighted average is an average calculated by adding a predetermined weight (coefficient) to time-series data of sound pressure levels within a fixed time in the past at the time at which the weighted average is calculated. In this example, the sound pressure level at a time closer to the calculation target time is calculated by increasing the weight. The median is the median when the time-series data of sound pressure levels within a certain time in the past of the median calculation target time are arranged in order of magnitude.

図３の例では、元の音圧レベルは時刻に対して音圧レベルの変動が大きいが、移動平均、加重平均、中央値は、それぞれ変動が緩和された音圧レベルとなっている。一定の白色雑音下のような状況において、どの代表値でも値の変化はおさえられているが、移動平均は他よりも変化が少なく安定している。 In the example of FIG. 3, the original sound pressure level has a large variation in the sound pressure level with respect to the time, but the moving average, the weighted average, and the median are the sound pressure levels in which the variation is alleviated. In a situation such as under constant white noise, the change of the value is suppressed at any representative value, but the moving average is less changed and more stable than the other.

図４は、ある時刻以降、音圧レベルに増加があった例を示している。このように途中で環境が変わって音圧レベルが遷移するような場合には、変化に対して、加重平均、移動平均、中央値の順に追従が早くなっている。すなわち、加重平均が、他の算出方法より早く音圧レベルの増加に追従できているといえる。この傾向を代表値に反映するには、代表値として加重平均を採用することが好ましいと考えられる。 FIG. 4 shows an example in which the sound pressure level has increased after a certain time. In this way, when the environment changes midway and the sound pressure level transitions, the follow-up is faster in the order of the weighted average, moving average, and median with respect to the change. In other words, it can be said that the weighted average can follow the increase in the sound pressure level faster than other calculation methods. In order to reflect this tendency in the representative value, it is considered preferable to employ a weighted average as the representative value.

図５は、ある時刻に音圧レベルの急激な変化があった例を示している。このとき、音圧レベルの急激な変化に影響されないのは、中央値、移動平均、加重平均の順となる。このような突発的に大きな音が入ってきた状況において、中央値はほとんど変化が見られないが、他は引きずられて値が大きく変わってしまっていることが分かる。急激な変化は過渡的な現象である。よって、代表値としては、急激な変化に影響を受けない値が好ましいと考えられ、この場合、中央値を採用するのが好ましいと考えられる。 FIG. 5 shows an example in which there is a sudden change in the sound pressure level at a certain time. At this time, the median, moving average, and weighted average are not affected by the sudden change in the sound pressure level. In the situation where a loud sound suddenly enters like this, the median value hardly changes, but it can be seen that the value is greatly changed by dragging others. A sudden change is a transient phenomenon. Therefore, it is considered that a value that is not affected by a sudden change is preferable as the representative value, and in this case, it is preferable to adopt the median value.

このように、音圧レベルの代表値の算出方法は複数考えられる。しかも、代表値の算出方法によって、有効な場面が異なる。例えば、異なる代表値には以下のような特徴があることが考えられる。
移動平均：白色雑音や周期性のある雑音環境下で有効
加重平均：環境変化に対する反応を早くしたいときなどに有効
中央値：突発的にとても大きな音が入るような環境下で有効 As described above, there are a plurality of methods for calculating the representative value of the sound pressure level. In addition, the effective scenes differ depending on the representative value calculation method. For example, different representative values may have the following characteristics.
Moving average: Effective under white noise or periodic noise environment Weighted average: Effective when quick response to environmental changes, etc. Median: Effective under sudden loud noise

よって、代表値は、音圧レベルの時刻による変動の状況に応じた有効な方法で算出することが好ましい。とくに、例えば、予め図４や、図５のような状況が予測される場合には、それぞれの状況にあった代表値の算出方法を設定しておくこともできる。 Therefore, it is preferable to calculate the representative value by an effective method according to the state of fluctuation of the sound pressure level depending on the time. In particular, for example, when situations such as those shown in FIGS. 4 and 5 are predicted in advance, it is possible to set a representative value calculation method suitable for each situation.

本実施の形態においては、音声処理装置２は、収音部４１で収音された環境音の音圧レベルの時系列データの代表値Ｌｂと、ユーザに提供される情報音の音圧レベルの時系列データの代表値Ｌｃとを、設定された算出方法で算出する。 In the present embodiment, the sound processing device 2 includes the representative value Lb of the time-series data of the sound pressure level of the environmental sound collected by the sound collecting unit 41 and the sound pressure level of the information sound provided to the user. The representative value Lc of the time series data is calculated by a set calculation method.

以下、図６を参照しながら、第１の実施の形態による音声処理システム１の動作についてさらに説明する。図６は、音声処理システム１の動作を示すフローチャートである。音声処理システム１による処理は、予め記憶された制御プログラムを演算処理装置３が読み込んで実行することにより行われる処理であるが、ここでは便宜上、図２に示した各機能が処理を行うとして説明する。 Hereinafter, the operation of the speech processing system 1 according to the first embodiment will be further described with reference to FIG. FIG. 6 is a flowchart showing the operation of the voice processing system 1. The processing by the sound processing system 1 is processing performed by the arithmetic processing device 3 reading and executing a control program stored in advance, but here, for convenience, it is assumed that each function shown in FIG. 2 performs processing. To do.

図６に示すように、まず重畳比算出部４５は、音圧レベル差の目標値Ｘを決定する（Ｓ６１）。ここで、音圧レベル差とは、環境音の音圧レベルの代表値Ｌｂ（以下、環境代表値Ｌｂという）と、情報音の音圧レベルの代表値Ｌｃ（以下、情報代表値Ｌｃという）との差である。音声処理装置２は、目標値Ｘを所望の値に設定し、環境代表値Ｌｂと情報代表値Ｌｃとの差が常に目標値Ｘと一致するように制御する。これにより、環境音と情報音とを重畳したときに、例えば、騒音のある環境で情報音を容易に聞き取れるようにしたり、話し声を情報音に優先して聞き取れるようにしたりすることができる。 As shown in FIG. 6, the superimposition ratio calculation unit 45 first determines a target value X of the sound pressure level difference (S61). Here, the sound pressure level difference is a representative value Lb of the sound pressure level of the environmental sound (hereinafter referred to as the environmental representative value Lb) and a representative value Lc of the sound pressure level of the information sound (hereinafter referred to as the information representative value Lc). Is the difference. The voice processing device 2 sets the target value X to a desired value, and controls so that the difference between the environment representative value Lb and the information representative value Lc always matches the target value X. Thus, when the environmental sound and the information sound are superimposed, for example, the information sound can be easily heard in a noisy environment, or the spoken voice can be heard with priority over the information sound.

収音部４１は、環境音情報をマイクデバイス３０から取得する（Ｓ６２）。音声取得部４３は、情報音の現在の音圧レベル値を不図示のバッファに格納する。収音部４１は、環境音の現在のレベル値を不図示のバッファに格納する（Ｓ６３）。さらに、音声取得部４３は、情報代表値Ｌｃを算出する。収音部４１は、環境代表値Ｌｂを算出する（Ｓ６４）。 The sound collection unit 41 acquires environmental sound information from the microphone device 30 (S62). The sound acquisition unit 43 stores the current sound pressure level value of the information sound in a buffer (not shown). The sound collection unit 41 stores the current level value of the environmental sound in a buffer (not shown) (S63). Furthermore, the voice acquisition unit 43 calculates an information representative value Lc. The sound collection unit 41 calculates the environmental representative value Lb (S64).

ここで、各代表値を求める区間を決める時間Ｔは調整パラメータになる。一例として、代表値に移動平均を利用する際に、突発的な環境音の発生により重畳バランスが急激な変化とならないようにすることを考えると、時間Ｔは、次のように決定できる。 Here, the time T for determining the section for obtaining each representative value is an adjustment parameter. As an example, when the moving average is used as the representative value, considering that the superposition balance does not change suddenly due to the sudden generation of environmental sound, the time T can be determined as follows.

音圧レベルの弁別域は０．５（ｄＢ）〜１．０（ｄＢ）とされていることから、環境音の音圧レベルの移動平均値がＬであったときに音圧レベル差がＡ（ｄＢ）である突発的な音がｋ秒間継続した際に、移動平均値の変化が０．５（ｄＢ）以下となる最小の時間を求める。
｜｜｛Ｌ（Ｔ―ｋ）＋（Ｌ＋Ａ）ｋ｝／Ｔ―Ｌ｜｜≦０．５・・・（式１） Since the discrimination range of the sound pressure level is 0.5 (dB) to 1.0 (dB), when the moving average value of the sound pressure level of the environmental sound is L, the sound pressure level difference is A. When a sudden sound of (dB) continues for k seconds, the minimum time for which the change of the moving average value is 0.5 (dB) or less is obtained.
|| {L (T−k) + (L + A) k} /T−L||≦0.5 (Expression 1)

式１より下記の式２が得られる。
｜｜ｋＡ／Ｔ｜｜≦０．５・・・（式２） The following formula 2 is obtained from the formula 1.
|| kA / T || ≦ 0.5 (Formula 2)

ここで、例えばｋ＝１（秒）、Ａ＝３０（ｄＢ）とすると、Ｔ＝６０（秒）となる。つまり、時間Ｔ＝６０（秒）とすることで、平均的な環境音の音圧レベルより３０（ｄＢ）大きい音が１秒間発生しても、環境音の音圧レベルの平均値は０．５（ｄＢ）しか大きくならず、結果として重畳比Ｓもほとんど変化しない。 For example, if k = 1 (second) and A = 30 (dB), then T = 60 (second). That is, by setting the time T = 60 (seconds), even if a sound 30 dB higher than the average environmental sound pressure level is generated for 1 second, the average value of the environmental sound pressure level is 0. Only 5 (dB) is increased, and as a result, the superposition ratio S hardly changes.

続いて、重畳比算出部４５は、情報代表値Ｌｃと環境代表値Ｌｂとの差分Ｙを求める（Ｓ６５）。重畳比算出部４５は、目標値Ｘと差分Ｙとの差分Ｚを求める（Ｓ６６）。さらに重畳比算出部４５は、情報音と環境音の音圧の配分比のデシベル表現が差分Ｚに一致するように、重畳比Ｓを求める（Ｓ６７）。 Subsequently, the superposition ratio calculation unit 45 obtains a difference Y between the information representative value Lc and the environment representative value Lb (S65). The superimposition ratio calculation unit 45 obtains a difference Z between the target value X and the difference Y (S66). Further, the superposition ratio calculation unit 45 obtains the superposition ratio S so that the decibel expression of the distribution ratio of the sound pressure of the information sound and the environmental sound matches the difference Z (S67).

ここで、マイクデバイス３０で測定した環境音の音圧レベルの直近のＴ秒間の時系列データの代表値を環境代表値Ｌｂ、情報音の音圧レベルの直近のＴ秒間の時系列データの代表値を情報代表値Ｌｃとすると、音圧の重畳比Ｓは次の式３から求められる。
Ｘ−（Ｌｃ−Ｌｂ）＝２０ｌｏｇ（（１−Ｓ）／Ｓ）・・・（式３） Here, the representative value of the time series data for the nearest T seconds of the sound pressure level of the environmental sound measured by the microphone device 30 is the environment representative value Lb, and the representative value of the time series data for the nearest T seconds of the sound pressure level of the information sound. When the value is the information representative value Lc, the superposition ratio S of the sound pressures can be obtained from the following equation 3.
X- (Lc-Lb) = 20 log ((1-S) / S) (Formula 3)

この式３は、目標とする音圧レベルの差Ｘと現在の平均音圧レベル差（Ｌｂ−Ｌｃ）とを比較して、足りない分を重畳比Ｓの調整により補うという考え方を示している。 This expression 3 shows the idea of comparing the target sound pressure level difference X with the current average sound pressure level difference (Lb−Lc) and compensating for the missing amount by adjusting the superposition ratio S. .

重畳処理部４７は、マイクデバイス３０から取得した環境音と、音声取得部４３で取得した情報音の時系列データに基づき上記のように算出された重畳比Ｓにより環境音と情報音とを重畳し、出力音圧を決定する（Ｓ６８）。出力部４９は、求めた出力音圧に応じた音声信号を出力することにより、マイクデバイス３０により音声を再生させる（ステップ７０）。 The superimposition processing unit 47 superimposes the environmental sound and the information sound by the superposition ratio S calculated as described above based on the environmental sound acquired from the microphone device 30 and the time series data of the information sound acquired by the sound acquisition unit 43. Then, the output sound pressure is determined (S68). The output unit 49 reproduces sound by the microphone device 30 by outputting a sound signal corresponding to the obtained output sound pressure (step 70).

ここで、ある瞬間の環境音の音圧をｐ_ｂ、情報音の音圧をｐ_ｃとすると、出力音圧ｐｏは、下記の式４で算出される。
ｐｏ＝Ｓｐ_ｂ＋（１−Ｓ）ｐ_ｃ・・・（式４） Here, if the sound pressure of the sound pressure at the moment of the environmental sound that p _b, information sound and p _c, the output sound pressure po is calculated by Equation 4 below.
_{po = Sp b + (1-} S) p c ··· ( Equation 4)

ここでは、ユーザに提供されるコンテンツである情報音が快適に聞き取れるような音圧レベル差となるように調整することを考える。例えば、通常の会話の音圧レベルは６０（ｄＢ）、会議室の音圧レベルは４０（ｄＢ）とされていることから、情報音が環境音よりＸ＝２０（ｄＢ）大きくなるような目標値Ｘを設定する。 Here, it is considered that the information sound, which is the content provided to the user, is adjusted so that the sound pressure level difference can be heard comfortably. For example, since the sound pressure level of a normal conversation is 60 (dB) and the sound pressure level of a conference room is 40 (dB), the target information sound is X = 20 (dB) higher than the environmental sound. Set the value X.

音声処理装置２では、例えば入力部２３などから終了指示がない場合には（Ｓ７０:ＮＯ）、Ｓ６２から処理を繰り返し、終了指示があった場合には（Ｓ７０:ＹＥＳ）、処理を終了する。 In the audio processing device 2, for example, when there is no end instruction from the input unit 23 or the like (S70: NO), the process is repeated from S62, and when there is an end instruction (S70: YES), the process ends.

以上説明したように、第１の実施の形態による音声処理システム１によれば、収音部４１は、環境音を収音し、音声取得部４３は、情報音を取得する。重畳比算出部４５は、予め定められた算出方法で、環境代表値Ｌｂと情報代表値Ｌｃとを算出する。また、重畳比算出部４５は、環境代表値Ｌｂと、情報代表値Ｌｃとの差が、予め決められた目標値Ｘとなるような重畳比Ｓを算出する。重畳処理部４７は、算出された重畳比Ｓに基づき音声を合成する。出力部４９は、合成された音声を出力する。 As described above, according to the sound processing system 1 according to the first embodiment, the sound collection unit 41 collects environmental sounds, and the sound acquisition unit 43 acquires information sounds. The superimposition ratio calculation unit 45 calculates the environment representative value Lb and the information representative value Lc by a predetermined calculation method. The superimposition ratio calculation unit 45 calculates the superposition ratio S so that the difference between the environment representative value Lb and the information representative value Lc becomes a predetermined target value X. The superimposition processing unit 47 synthesizes speech based on the calculated superposition ratio S. The output unit 49 outputs the synthesized voice.

以上のように、第１の実施の形態による音声処理システム１では、予め決められた目標値になるように、環境音と情報音との音圧レベル差を自動的に制御することができる。よって、カナル型のイヤホンのような耳を塞ぐデバイスを用いた場合に、ユーザの周囲の環境音が聞こえなくなることにより、周囲への注意不足や、会話ができないといった問題が生じることが防止できる。このように、環境音と情報音と重畳してマイクデバイス３０で再生することで、方向感や臨場感を保ったまま環境音もユーザに聞かせることができる。 As described above, in the sound processing system 1 according to the first embodiment, the sound pressure level difference between the environmental sound and the information sound can be automatically controlled so as to be a predetermined target value. Therefore, when a device that closes the ear, such as a canal-type earphone, is used, it is possible to prevent problems such as lack of attention to the surroundings and inability to talk due to the inability to hear environmental sounds around the user. As described above, the environmental sound and the information sound are superimposed and reproduced by the microphone device 30, so that the user can hear the environmental sound while maintaining a sense of direction and presence.

環境音と情報音との重畳比を自動的に調整することが可能なので、状況にあった音量バランスを実現できる。すなわち、情報代表値と環境代表値との差の目標値Ｘを予め所望の値に設定することにより、環境音の音量が大きすぎて情報音に集中できなかったり、逆に環境音の音量が低すぎて周囲への注意不足になったりすることが防止される。しかも、重畳比の調整は自動で行えるので、ユーザが手動で音量バランスの調節をすることなく、時々刻々と変化する周囲の音環境に適した音量バランスを自動で保つことができ、ユーザの利便性が増す。情報音の音量と環境音の音量とを共に調整することにより、環境音を優先するといった制御も可能となる。 Since it is possible to automatically adjust the superposition ratio of the environmental sound and the information sound, it is possible to realize a sound volume balance suitable for the situation. That is, by setting the target value X of the difference between the information representative value and the environmental representative value to a desired value in advance, the volume of the environmental sound is too high to concentrate on the information sound, or conversely, the volume of the environmental sound is low. It is prevented that it is too low and the attention to the surroundings is insufficient. Moreover, since the superimposition ratio can be adjusted automatically, the volume balance suitable for the surrounding sound environment that changes from moment to moment can be automatically maintained without the user manually adjusting the volume balance. Increases nature. By adjusting both the volume of the information sound and the volume of the environmental sound, it is possible to perform control such that the environmental sound is given priority.

（第２の実施の形態）
次に、第２の実施の形態により音声処理システム１００について説明する。図７は、音声処理システム１００の利用状況の一例を概念的に示す図である。第２の実施の形態において、第１の実施の形態と同様の構成及び動作については同一番号を付し、重複説明を省略する。音声処理システム１００は、頭部デバイス１３０、音声処理装置２０、赤外線発生装置１２５を含んでいる。 (Second Embodiment)
Next, the speech processing system 100 will be described according to the second embodiment. FIG. 7 is a diagram conceptually illustrating an example of a usage status of the voice processing system 100. In the second embodiment, the same configurations and operations as those of the first embodiment are denoted by the same reference numerals, and redundant description is omitted. The voice processing system 100 includes a head device 130, a voice processing device 20, and an infrared generator 125.

音声処理システム１００は、音声処理システム１と同様に、環境音と情報音とを重畳して出力するシステムである。音声処理システム１では、音声処理装置２にマイクデバイス３０が接続されていたが、音声処理システム１００では、音声処理装置２０に頭部デバイス１３０が接続される。また、音声処理システム１００は、赤外線発生装置１２５を備えており、音声処理装置２０は、赤外線により自己の位置を計測することができる。 Similar to the voice processing system 1, the voice processing system 100 is a system that superimposes and outputs environmental sounds and information sounds. In the voice processing system 1, the microphone device 30 is connected to the voice processing apparatus 2, but in the voice processing system 100, the head device 130 is connected to the voice processing apparatus 20. The voice processing system 100 includes an infrared generator 125, and the voice processor 20 can measure its own position using infrared rays.

第２の実施の形態による音声処理システム１００では、ポスタ１１１、ポスタ１１３など、ユーザ１１０が注視することが期待される注視対象物体が存在する領域で用いられることが想定されている。よって、赤外線発生装置１２５は、例えば、ポスタ１１１の正面、ポスタ１１３の正面等の領域を照射することが好ましい。このとき、赤外線発生装置１２５は、ユーザ１１０の上方に相当する場所に設けられるようにしてもよい。これにより音声処理装置２０は、自己の位置として、例えば、ポスタ１１１の正面のある領域などの位置を検出することになる。 The speech processing system 100 according to the second embodiment is assumed to be used in an area where there is a gaze target object that the user 110 is expected to gaze, such as the poster 111 and the poster 113. Therefore, it is preferable that the infrared ray generator 125 irradiates a region such as the front of the poster 111 or the front of the poster 113, for example. At this time, the infrared ray generator 125 may be provided at a location corresponding to the upper side of the user 110. As a result, the speech processing apparatus 20 detects, for example, a position such as an area in front of the poster 111 as its own position.

図８は、音声処理システム１００の構成の一例を示すブロック図である。図８に示すように、音声処理システム１００では、音声処理装置２０は、頭部デバイス１３０と、有線または無線により接続されている。音声処理装置２０のハードウエア構成は、第１の実施の形態による音声処理装置２と同様の構成とすることができる。音声処理装置２０は、赤外線発生装置１２５の位置情報を、赤外線位置情報ＤａｔａＢａｓｅ（ＤＢ）１４３から取得する。赤外線位置情報ＤＢ１４３は、予め音声処理装置２０の記憶部５に保持しておくようにしてもよいし、例えば、音声処理装置２０と通信ネットワークで接続可能な情報処理装置を介して取得するようにしてもよい。 FIG. 8 is a block diagram illustrating an example of the configuration of the voice processing system 100. As shown in FIG. 8, in the voice processing system 100, the voice processing apparatus 20 is connected to the head device 130 by wire or wirelessly. The hardware configuration of the audio processing device 20 can be the same as that of the audio processing device 2 according to the first embodiment. The voice processing device 20 acquires the position information of the infrared ray generator 125 from the infrared position information Data Base (DB) 143. The infrared position information DB 143 may be stored in the storage unit 5 of the voice processing device 20 in advance, or may be acquired, for example, via an information processing device that can be connected to the voice processing device 20 via a communication network. May be.

頭部デバイス１３０は、イヤホン３２、マイク３４、マイコン１３５、加速度センサ１３７、ジャイロセンサ１３９、赤外線受光部１４１を有している。頭部デバイス１３０は、図７に示したように、例えば、ヘッドホンなどのようにユーザが頭部に装着した状態で音声を聞くことができる。また、頭部デバイス１３０は、環境音１２１、環境音１２３等をマイク３４で収音する。 The head device 130 includes an earphone 32, a microphone 34, a microcomputer 135, an acceleration sensor 137, a gyro sensor 139, and an infrared light receiving unit 141. As shown in FIG. 7, the head device 130 can listen to the sound while the user is wearing the head, such as headphones. The head device 130 picks up the environmental sound 121, the environmental sound 123, and the like with the microphone 34.

加速度センサ１３７は、頭部デバイス１３０の加速度を検出する。加速度センサ１３７は、例えば３次元加速度センサとするようにしてもよい。ジャイロセンサ１３９は、頭部デバイス１３０の傾きを計測する。赤外線受光部１４１は、赤外線発生装置１２５からの赤外線を受光する。 The acceleration sensor 137 detects the acceleration of the head device 130. The acceleration sensor 137 may be a three-dimensional acceleration sensor, for example. The gyro sensor 139 measures the tilt of the head device 130. The infrared light receiving unit 141 receives infrared light from the infrared generator 125.

マイコン１３５は、所定の処理を行うプログラムを実行可能な情報処理装置として機能する集積回路である。例えば、マイコン１３５は、音声処理装置２０から入力された音声信号を左右のイヤホン３２に分けて出力する。また、マイコン１３５は、マイク３４で取得した音声を音声処理装置２０に出力する。さらにマイコン１３５は、加速度センサ１３７により検出される加速度、ジャイロセンサ１３９により検出される角度、赤外線受光部１４１が受光した赤外線が発光された赤外線発光装置の識別情報などを音声処理装置２０に出力する。このとき、マイコン１３５は、赤外線受光部１４１の検出結果から、赤外線発生装置１２５の識別情報を解析するなど、所定の処理を行なうようにしてもよい。 The microcomputer 135 is an integrated circuit that functions as an information processing apparatus that can execute a program that performs predetermined processing. For example, the microcomputer 135 divides and outputs the audio signal input from the audio processing device 20 to the left and right earphones 32. Further, the microcomputer 135 outputs the voice acquired by the microphone 34 to the voice processing device 20. Further, the microcomputer 135 outputs the acceleration detected by the acceleration sensor 137, the angle detected by the gyro sensor 139, the identification information of the infrared light emitting device from which the infrared light received by the infrared light receiving unit 141 is emitted, and the like to the voice processing device 20. . At this time, the microcomputer 135 may perform a predetermined process such as analyzing the identification information of the infrared ray generator 125 from the detection result of the infrared ray receiver 141.

図９は、音声処理装置２０の機能の一例を示すブロック図である。図９に示すように、音声処理装置２０は、音声処理装置２と同様に、収音部４１、音声取得部４３、重畳比算出部４５、重畳処理部４７、出力部４９を有している。音声処理装置２０は、さらに、立体音響処理部１５１、状態計測部１５３、状態検出部１５５、位置姿勢推定部１５７を有している。 FIG. 9 is a block diagram illustrating an example of functions of the audio processing device 20. As shown in FIG. 9, the sound processing device 20 includes a sound collection unit 41, a sound acquisition unit 43, a superposition ratio calculation unit 45, a superimposition processing unit 47, and an output unit 49, similar to the sound processing device 2. . The audio processing device 20 further includes a stereophonic sound processing unit 151, a state measurement unit 153, a state detection unit 155, and a position / orientation estimation unit 157.

状態計測部１５３は、例えば、頭部デバイス１３０の加速度センサ１３７、ジャイロセンサ１３９、赤外線受光部１４１からの検出結果を取得する。検出結果とは、例えば、加速度センサ１３７から得られる頭部デバイス１３０の加速度、ジャイロセンサ１３９から得られる頭部デバイス１３０の角度、赤外線受光部１４１から得られる頭部デバイス１３０の位置に対応する情報である。 The state measurement unit 153 acquires detection results from, for example, the acceleration sensor 137, the gyro sensor 139, and the infrared light receiving unit 141 of the head device 130. The detection result is, for example, information corresponding to the acceleration of the head device 130 obtained from the acceleration sensor 137, the angle of the head device 130 obtained from the gyro sensor 139, and the position of the head device 130 obtained from the infrared light receiving unit 141. It is.

位置姿勢推定部１５７は、状態計測部１５３が取得した検出結果から、ユーザの位置姿勢を推定する。ユーザの位置姿勢とは、頭部デバイス１３０の位置として得られるユーザ１１０の位置、頭部デバイス１３０の方向として得られるユーザ１１０の正面範囲１７３などである。頭部デバイス１３０の位置は、例えば赤外線受光部１４１の検出結果から得られる位置情報に、加速度センサ１３７から得られる加速度を積分して得られる位置の変化を加算することにより算出される。このとき例えば位置姿勢推定部１５７は、頭部デバイス１３０から取得した赤外線発生装置１２５の識別情報を赤外線位置情報ＤＢ１４３で参照し、対応する位置情報を取得する。赤外線位置情報ＤＢ１４３の詳細は後述する。頭部デバイス１３０の方向は、例えばジャイロセンサ１３９から得られる角度の情報を積分することにより算出される。 The position / orientation estimation unit 157 estimates the position and orientation of the user from the detection result acquired by the state measurement unit 153. The position and orientation of the user includes the position of the user 110 obtained as the position of the head device 130, the front range 173 of the user 110 obtained as the direction of the head device 130, and the like. The position of the head device 130 is calculated, for example, by adding a change in position obtained by integrating the acceleration obtained from the acceleration sensor 137 to the position information obtained from the detection result of the infrared light receiving unit 141. At this time, for example, the position / orientation estimation unit 157 refers to the identification information of the infrared ray generator 125 acquired from the head device 130 in the infrared position information DB 143 and acquires corresponding position information. Details of the infrared position information DB 143 will be described later. The direction of the head device 130 is calculated by integrating angle information obtained from the gyro sensor 139, for example.

立体音響処理部１５１は、例えば音声取得部４３や収音部４１からの音声に対し、チャンネル数を変更したり、左右の音声の再生時刻や周波数特性を調整したりするなど、立体音響処理を行う。この処理には、例えば、特許文献１〜５のいずれかに記載の従来の音響処理等を利用することもできる。このような処理により、例えば、所望の位置に情報音１１５、情報音１１７の仮想的な発生位置を設定することもできる。よって、例えば情報音１１５をポスタ１１１の位置に設定し、情報音１１７をポスタ１１３の位置に設定するといったことも可能である。 For example, the stereophonic sound processing unit 151 performs stereophonic sound processing such as changing the number of channels or adjusting the reproduction time and frequency characteristics of the left and right sound with respect to the sound from the sound acquisition unit 43 and the sound collection unit 41. Do. For this processing, for example, the conventional acoustic processing described in any of Patent Documents 1 to 5 can be used. By such processing, for example, the virtual generation position of the information sound 115 and the information sound 117 can be set at a desired position. Therefore, for example, the information sound 115 can be set at the position of the poster 111 and the information sound 117 can be set at the position of the poster 113.

状態検出部１５５は、状態計測部１５３で計測された情報から、ユーザの状態を検出する。例えば、ユーザが歩行しているか否かを、加速度センサ１３７から得られた加速度に基づき検出する。この検出方法は、歩数計等で一般に用いられている方法を利用することができる。別の例として、状態検出部１５５は、マイク３４で計測された情報から、ユーザ１１０の周囲で会話が行われているか否かを検出するようにしてもよい。会話が行われているか否かは、例えば、特許文献６、非特許文献１などに記載の方法により検出することができる。状態検出部１５５は、検出したユーザの状態を重畳比算出部４５に出力する。 The state detection unit 155 detects the user state from the information measured by the state measurement unit 153. For example, whether or not the user is walking is detected based on the acceleration obtained from the acceleration sensor 137. As this detection method, a method generally used in a pedometer or the like can be used. As another example, the state detection unit 155 may detect whether or not a conversation is being performed around the user 110 from information measured by the microphone 34. Whether or not a conversation is being performed can be detected, for example, by a method described in Patent Document 6, Non-Patent Document 1, and the like. The state detection unit 155 outputs the detected user state to the superposition ratio calculation unit 45.

重畳比算出部４５では、立体音響処理部１５１で処理された情報音と、収音部４１で収音された環境音とに対して、状態検出部１５５で検出されたユーザの状態に応じて重畳比Ｓを算出する。重畳比Ｓの算出は、第１の実施の形態と同様の方法を適用することができる。 In the superimposition ratio calculation unit 45, the information sound processed by the stereophonic sound processing unit 151 and the environmental sound collected by the sound collection unit 41 according to the state of the user detected by the state detection unit 155. The superposition ratio S is calculated. For the calculation of the superposition ratio S, the same method as in the first embodiment can be applied.

本実施の形態では、さらにユーザの状態の一つとして注視状態を検出する。図１０は、注視状態を説明する図である。図１０に示すように、注視状態とは、例えば、情報音１１５を出力している物体等、注視対象候補となる物体が所定時間以上ユーザ１１０の推定された正面範囲１７３に基づく注視範囲１７１に入っていると判定される状態をいう。 In the present embodiment, the gaze state is further detected as one of the user states. FIG. 10 is a diagram for explaining a gaze state. As shown in FIG. 10, the gaze state refers to, for example, a gaze range 171 based on a front range 173 estimated by the user 110 for a predetermined time or more, such as an object that outputs the information sound 115. The state that is judged to be in.

以下、図１１から図１４を参照しながら、注視状態検出について説明する。図１１は、注視状態を検出するための機能の一例を示すブロック図である。図１１に示すように、音声処理装置２０は、状態計測部１５３として、頭部計測部１６３を有し、状態検出部１５５として対象位置取得部１６１、注視状態検出部１６５を有する。 Hereinafter, gaze state detection will be described with reference to FIGS. 11 to 14. FIG. 11 is a block diagram illustrating an example of a function for detecting a gaze state. As illustrated in FIG. 11, the voice processing device 20 includes a head measurement unit 163 as the state measurement unit 153, and includes a target position acquisition unit 161 and a gaze state detection unit 165 as the state detection unit 155.

図１２から図１４は、注視状態を検出するために用いる各種データのデータ構造の一例を示す図である。図１２は、赤外線情報１７５の一例を示す図、図１３は、注視対象情報１８０の一例を示す図、図１４は、正面リスト１８５の一例を示す図である。 12 to 14 are diagrams illustrating examples of data structures of various data used for detecting the gaze state. 12 is a diagram illustrating an example of the infrared information 175, FIG. 13 is a diagram illustrating an example of the gaze target information 180, and FIG. 14 is a diagram illustrating an example of the front list 185.

図１２に示すように、赤外線情報１７５は、上述した赤外線位置情報ＤＢ１４３の内容であり、赤外線Ｉｄｅｎｔｉｆｉｃａｔｉｏｎ（ＩＤ）１７７、位置情報１７８を有している。赤外線ＩＤ１７７は、赤外線発生装置１２５の識別情報である。位置情報１７８は、赤外線ＩＤ１７７に対応する赤外線発生装置１２５から出力された赤外線が検出されたときに、頭部デバイス１３０が存在している位置を示す情報である。 As shown in FIG. 12, the infrared information 175 is the contents of the infrared position information DB 143 described above, and includes an infrared identification (ID) 177 and position information 178. The infrared ID 177 is identification information of the infrared generator 125. The position information 178 is information indicating a position where the head device 130 exists when the infrared ray output from the infrared ray generator 125 corresponding to the infrared ID 177 is detected.

図１３に示すように、注視対象情報１８０は、注視対象ＩＤ１８２、位置情報１８３を有している。注視対象ＩＤ１８２は、ユーザ１１０の注視対象となる可能性のある物体の識別情報である。図７の例では、例えばポスタ１１１、ポスタ１１３等である。位置情報１８３は、注視対象ＩＤ１８２に対応する注視対象の位置を示す情報である。 As shown in FIG. 13, the gaze target information 180 has a gaze target ID 182 and position information 183. The gaze target ID 182 is identification information of an object that may be a gaze target of the user 110. In the example of FIG. 7, for example, the poster 111, the poster 113, and the like. The position information 183 is information indicating the position of the gaze target corresponding to the gaze target ID 182.

図１４に示すように、正面リスト１８５は、注視候補ＩＤ１８７、検出時刻１８８を有している。注視候補ＩＤ１８７は、ユーザ１１０の注視範囲１７１に存在していると判定された、ユーザ１１０が注視していると推定される注視対象の識別情報である。検出時刻１８８は、注視候補ＩＤ１８７が注視範囲１７１に含まれていると検出された時刻である。 As shown in FIG. 14, the front list 185 has a gaze candidate ID 187 and a detection time 188. The gaze candidate ID 187 is identification information of a gaze target that is determined to be present in the gaze range 171 of the user 110 and is estimated that the user 110 is gazing. The detection time 188 is a time when it is detected that the gaze candidate ID 187 is included in the gaze range 171.

図１１に戻って、頭部計測部１６３は、頭部デバイス１３０からの加速度、角度、赤外線受光に関する情報を取得する。位置姿勢推定部１５７は、頭部計測部１６３からの情報に基づき、位置姿勢を推定する。位置姿勢とは、例えば、ユーザ１１０の位置、及びユーザ１１０の注視範囲１７１である。位置姿勢推定部１５７は、頭部計測部１６３で取得した赤外線発生装置１２５の識別情報を赤外線情報１７５における赤外線ＩＤ１７７で検索し、対応する位置情報１７８を取得することにより、ユーザ１１０の位置を取得する。位置姿勢推定部１５７は、頭部デバイス１３０から取得した加速度及び角度に基づき、例えば頭部デバイス１３０の姿勢を推定し、ユーザ１１０の正面範囲１７３を算出して注視範囲１７１を推定する。正面範囲１７３の角度範囲は、予め定めておくことができる。 Returning to FIG. 11, the head measurement unit 163 acquires information on acceleration, angle, and infrared light reception from the head device 130. The position and orientation estimation unit 157 estimates the position and orientation based on the information from the head measurement unit 163. The position and orientation are, for example, the position of the user 110 and the gaze range 171 of the user 110. The position / orientation estimation unit 157 acquires the position of the user 110 by searching the infrared ID 177 in the infrared information 175 for the identification information of the infrared generator 125 acquired by the head measurement unit 163 and acquiring the corresponding position information 178. To do. The position / orientation estimation unit 157 estimates, for example, the orientation of the head device 130 based on the acceleration and angle acquired from the head device 130, calculates the front range 173 of the user 110, and estimates the gaze range 171. The angle range of the front range 173 can be determined in advance.

対象位置取得部１６１は、注視対象情報１８０から、物体の位置情報１８３を取得する。注視状態検出部１６５は、位置姿勢推定部１５７で推定された注視範囲１７１に含まれる位置情報１８３があるか否かを判別する。注視範囲１７１に含まれる位置情報１８３がある場合には、注視状態検出部１６５は、位置情報１８３に対応する注視対象ＩＤ１８２と、検出された時刻とを、正面リスト１８５における注視候補ＩＤ１８７と検出時刻１８８として記憶させる。注視状態検出部１６５が、一定時間以上同一の注視候補ＩＤ１８７の物体が注視範囲１７１内にあると検出した場合に、ユーザ１１０は、注視状態であると判別される。 The target position acquisition unit 161 acquires the position information 183 of the object from the gaze target information 180. The gaze state detection unit 165 determines whether there is position information 183 included in the gaze range 171 estimated by the position / orientation estimation unit 157. When there is position information 183 included in the gaze range 171, the gaze state detection unit 165 displays the gaze target ID 182 corresponding to the position information 183 and the detected time, the gaze candidate ID 187 in the front list 185, and the detection time. Store as 188. When the gaze state detection unit 165 detects that an object with the same gaze candidate ID 187 within the gaze range 171 for a certain time or more, the user 110 is determined to be in the gaze state.

続いて、図１５、図１６を用いて、本実施の形態による音声処理システム１００の動作について説明する。図１５は、音声処理システム１００の主な動作を示すフローチャートである。図１６は、音声処理システム１００による注視検出処理を示すフローチャートである。音声処理システム１００による処理は、予め記憶された制御プログラムを演算処理装置３が読み込んで実行することにより行われる処理であるが、ここでは便宜上、図９または図１１に示した各機能が処理を行うとして説明する。また、第１の実施の形態と同様の処理については、詳細な説明を省略する。 Subsequently, the operation of the speech processing system 100 according to the present embodiment will be described with reference to FIGS. 15 and 16. FIG. 15 is a flowchart showing main operations of the voice processing system 100. FIG. 16 is a flowchart showing gaze detection processing by the audio processing system 100. The processing by the voice processing system 100 is processing performed by the arithmetic processing device 3 reading and executing a pre-stored control program. Here, for convenience, each function illustrated in FIG. 9 or FIG. 11 performs processing. It will be described as being performed. Detailed description of the same processing as in the first embodiment is omitted.

図１５に示すように、まず重畳比算出部４５は、ユーザ状態に応じて音圧レベル差の目標値Ｘを決定する（Ｓ１９１）。本実施の形態では、状態検出部１５５がユーザや周囲の状態を検出している。ここでは、上述したように、例えば、以下の状態を検出することが可能である。
状態ａ）周囲で会話が行われている状態：以下、このときの目標値をＸａとし、この状態を会話状態という。
状態ｂ）ユーザ１１０が歩行している状態：以下、このときの目標値をＸｂとし、この状態を歩行状態という。
状態ｃ）ユーザ１１０が注視対象を注視している状態：以下、このときの目標値をＸｃとし、この状態を注視状態という。
状態ｄ）状態ａ）〜ｃ）が検出されていない状態：以下、このときの目標値をＸｄとし、この状態を通常状態という。 As shown in FIG. 15, first, the superimposition ratio calculation unit 45 determines a target value X of the sound pressure level difference according to the user state (S191). In the present embodiment, the state detection unit 155 detects the user and the surrounding state. Here, as described above, for example, the following states can be detected.
State a) State in which conversation is performed around: Hereinafter, the target value at this time is Xa, and this state is referred to as a conversation state.
State b) State in which the user 110 is walking: Hereinafter, the target value at this time is Xb, and this state is referred to as a walking state.
State c) State in which the user 110 is gazing at the gaze target: Hereinafter, the target value at this time is Xc, and this state is referred to as the gaze state.
State d) State in which states a) to c) are not detected: Hereinafter, the target value at this time is Xd, and this state is referred to as a normal state.

このとき目標Ｘａ〜Ｘｄの大きさとしては、状態によって下記の式５の大小関係とすることが考えられる。
Ｘａ＜Ｘｂ＜Ｘｄ＜Ｘｃ・・・（式５） At this time, as the magnitudes of the targets Xa to Xd, it can be considered that the magnitude relationship of the following Expression 5 is set depending on the state.
Xa <Xb <Xd <Xc (Formula 5)

これらの目標値は、例えば予め記憶部５に記憶しておき、状態検出部１５５で各状態が検出された場合に、目標値を変更するようにしてもよい。なお、例えば初期値としてＸ＝Ｘｄと設定することもできる。 These target values may be stored in advance in the storage unit 5, for example, and the target values may be changed when each state is detected by the state detection unit 155. For example, X = Xd can be set as an initial value.

収音部４１は、環境音情報を頭部デバイス１３０から取得する（Ｓ１９２）。立体音響処理部１５１は、音声取得部４３からの音声に対し、左右の音声の再生時刻や周波数特性を調整するなど、立体音響処理を行い、重畳比算出部４５に出力する情報音を算出する（Ｓ１９３）。このとき、立体音響処理は、位置姿勢推定部１５７で推定された位置姿勢、及び状態検出部１５５で検出されたユーザの状態などに応じて、所望の位置から仮想的に発生する情報音を生成する処理としてもよい。このとき考慮されるユーザの状態の一つとして、注視状態が考えられる。注視状態の検出処理の詳細については後述する。 The sound collection unit 41 acquires environmental sound information from the head device 130 (S192). The stereophonic sound processing unit 151 performs stereophonic sound processing such as adjusting the reproduction time and frequency characteristics of the left and right sounds with respect to the sound from the sound acquisition unit 43, and calculates the information sound to be output to the superposition ratio calculation unit 45. (S193). At this time, the stereophonic sound processing generates information sound virtually generated from a desired position according to the position and orientation estimated by the position and orientation estimation unit 157, the user state detected by the state detection unit 155, and the like. It is good also as processing to do. One of the user states considered at this time is a gaze state. Details of the gaze state detection process will be described later.

音声取得部４３は、情報音の現在の音圧レベル値を不図示のバッファに格納する。収音部４１は、環境音の現在のレベル値を不図示のバッファに格納する（Ｓ１９４）。さらに、音声取得部４３は、情報代表値Ｌｃを算出する。収音部４１は、環境代表値Ｌｂを算出する（Ｓ１９５）。ここで、各代表値を求める区間を決める時間Ｔは、第１の実施の形態と同様に決定されることが好ましい。 The sound acquisition unit 43 stores the current sound pressure level value of the information sound in a buffer (not shown). The sound collection unit 41 stores the current level value of the environmental sound in a buffer (not shown) (S194). Furthermore, the voice acquisition unit 43 calculates an information representative value Lc. The sound collection unit 41 calculates the environmental representative value Lb (S195). Here, it is preferable that the time T for determining the section for obtaining each representative value is determined in the same manner as in the first embodiment.

続いて、重畳比算出部４５は、情報代表値Ｌｃと環境代表値Ｌｂとの差分Ｙを求める（Ｓ１９６）。重畳比算出部４５は、目標値Ｘと差分Ｙとの差分Ｚを求める（Ｓ１９７）。さらに重畳比算出部４５は、情報音と環境音の音圧の配分比のデシベル表現が差分Ｚに一致するように、重畳比Ｓを求める（Ｓ１９８）。重畳比Ｓは、第１の実施の形態における算出方法と同様の方法で算出される。 Subsequently, the superposition ratio calculation unit 45 obtains a difference Y between the information representative value Lc and the environment representative value Lb (S196). The superimposition ratio calculation unit 45 obtains a difference Z between the target value X and the difference Y (S197). Further, the superposition ratio calculation unit 45 obtains the superposition ratio S so that the decibel expression of the distribution ratio of the sound pressure of the information sound and the environmental sound matches the difference Z (S198). The superposition ratio S is calculated by the same method as the calculation method in the first embodiment.

重畳処理部４７は、頭部デバイス１３０で取得した環境音と、音声取得部４３で取得した情報音の時系列データに基づき算出された重畳比Ｓにより環境音と情報音とを重畳し、出力音圧を決定する（Ｓ１９９）。出力部４９は、例えば上述した式４により求めた出力音圧により音声信号を出力することにより、頭部デバイス１３０により音声を再生させる（ステップ２００）。 The superimposition processing unit 47 superimposes the environmental sound and the information sound by the superposition ratio S calculated based on the time series data of the environmental sound acquired by the head device 130 and the information sound acquired by the sound acquisition unit 43, and outputs The sound pressure is determined (S199). The output unit 49 reproduces sound by the head device 130, for example, by outputting a sound signal with the output sound pressure obtained by the above-described Expression 4 (step 200).

音声処理装置２０では、例えば入力部２３などから終了指示がない場合には（Ｓ２０１:ＮＯ）、Ｓ１９１から処理を繰り返し、終了指示があった場合には（Ｓ２０１:ＹＥＳ）、処理を終了する。 In the audio processing device 20, for example, when there is no end instruction from the input unit 23 or the like (S201: NO), the process is repeated from S191, and when there is an end instruction (S201: YES), the process ends.

次に、注視状態の検出処理を図１６を参照しながら説明する。図１６に示すように、対象位置取得部１６１は、注視対象候補の位置情報を、例えば注視対象情報１８０から取得する（Ｓ２３１）。頭部計測部１６３は、頭部デバイス１３０からの検出結果を取得する。位置姿勢推定部１５７は、頭部計測部１６３の検出結果に基づき、ユーザ１１０の頭部位置姿勢を注視範囲１７１として推定する（Ｓ２３２）。 Next, gaze state detection processing will be described with reference to FIG. As illustrated in FIG. 16, the target position acquisition unit 161 acquires position information of a gaze target candidate from, for example, the gaze target information 180 (S231). The head measurement unit 163 acquires a detection result from the head device 130. The position and orientation estimation unit 157 estimates the head position and orientation of the user 110 as the gaze range 171 based on the detection result of the head measurement unit 163 (S232).

注視状態検出部１６５は、注視範囲１７１と注視対象情報１８０とを比較することにより位置情報１８３が注視範囲１７１に入っている注視対象候補を検出する（Ｓ２３３）。注視対象候補がいずれも注視範囲１７１に入っていない場合には（Ｓ２３３：ＮＯ）、注視状態検出部１６５は、正面リスト１８５から注視候補ＩＤ１８７及び検出時刻１８８を削除する（Ｓ２３４）。 The gaze state detection unit 165 detects a gaze target candidate whose position information 183 is in the gaze range 171 by comparing the gaze range 171 and the gaze target information 180 (S233). When none of the gaze target candidates are within the gaze range 171 (S233: NO), the gaze state detection unit 165 deletes the gaze candidate ID 187 and the detection time 188 from the front list 185 (S234).

Ｓ２３３で、注視対象候補が注視範囲１７１に入っている場合には（Ｓ２３３：ＹＥＳ）、注視状態検出部１６５は、注視対象候補が既に正面リスト１８５に含まれているか否かを判別する（Ｓ２３５）。含まれていない場合には（Ｓ２３５：ＮＯ）、注視状態検出部１６５は、正面リスト１８５に、現在の時刻と注視対象候補に対応する識別情報とを検出時刻１８８、注視候補ＩＤ１８７として記憶する。注視対象候補が既に正面リスト１８５に含まれている場合には（Ｓ２３５：ＹＥＳ）、正面リスト１８５に記録されている時刻と、現在の時刻とを比較し、一定時間経過していれば、ユーザ１１０は注視状態であると判定し（ステップ２３７）、図１５のＳ１９３の処理に戻る。 If the gaze target candidate is in the gaze range 171 in S233 (S233: YES), the gaze state detection unit 165 determines whether or not the gaze target candidate is already included in the front list 185 (S235). ). If not included (S235: NO), the gaze state detection unit 165 stores the current time and identification information corresponding to the gaze target candidate in the front list 185 as the detection time 188 and the gaze candidate ID 187. If the gaze target candidate is already included in the front list 185 (S235: YES), the time recorded in the front list 185 is compared with the current time. 110 is determined to be a gaze state (step 237), and the process returns to the process of S193 in FIG.

以上説明したように、第２の実施の形態による音声処理システム１００によれば、収音部４１は、環境音を収音し、音声取得部４３は、情報音を取得する。状態検出部１５５は、頭部デバイス１３０で検出された情報に基づき、例えば、会話状態、歩行状態、注視状態を検出する。重畳比算出部４５は、環境代表値Ｌｂと、情報代表値Ｌｃとの差が、検出された状態に応じて予め決められた目標値Ｘａ〜Ｘｄとなるような重畳比Ｓを算出する。重畳比算出部４５は、予め定められた算出方法で、環境代表値Ｌｂと情報代表値Ｌｃとを算出する。 As described above, according to the sound processing system 100 according to the second embodiment, the sound collection unit 41 collects environmental sounds, and the sound acquisition unit 43 acquires information sounds. The state detection unit 155 detects, for example, a conversation state, a walking state, and a gaze state based on the information detected by the head device 130. The superimposition ratio calculation unit 45 calculates the superposition ratio S so that the difference between the environment representative value Lb and the information representative value Lc becomes target values Xa to Xd determined in advance according to the detected state. The superimposition ratio calculation unit 45 calculates the environment representative value Lb and the information representative value Lc by a predetermined calculation method.

重畳比算出部４５は、環境代表値Ｌｂと、情報代表値Ｌｃとの差が、予め決められた目標値Ｘとなるように、重畳比Ｓを算出する。重畳処理部４７は、算出された重畳比Ｓに基づき音声を合成する。出力部４９は、合成された音声を出力する。注視状態の場合には、立体音響処理部１５１により、注視していると推定される物体から情報音が発生しているように音響処理を行うことが好ましい。 The superimposition ratio calculation unit 45 calculates the superposition ratio S so that the difference between the environment representative value Lb and the information representative value Lc becomes a predetermined target value X. The superimposition processing unit 47 synthesizes speech based on the calculated superposition ratio S. The output unit 49 outputs the synthesized voice. In the gaze state, it is preferable that the stereophonic sound processing unit 151 performs the sound processing so that the information sound is generated from the object estimated to be gazing.

以上のように、第１の実施の形態による音声処理システム１００では、予め決められた目標値になるように、環境音と情報音との音圧レベル差を自動的に制御することができる。よって、カナル型のイヤホンのような耳を塞ぐデバイスを用いた場合に、ユーザの周囲の環境音が聞こえなくなることにより、周囲への注意不足や、会話ができないといった問題が生じることが防止できる。このとき、環境音と情報音と重畳して頭部デバイス１３０で再生することで、方向感や臨場感を保ったまま環境音もユーザに聞かせることができる。 As described above, in the sound processing system 100 according to the first embodiment, the sound pressure level difference between the environmental sound and the information sound can be automatically controlled so as to have a predetermined target value. Therefore, when a device that closes the ear, such as a canal-type earphone, is used, it is possible to prevent problems such as lack of attention to the surroundings and inability to talk due to the inability to hear environmental sounds around the user. At this time, the environmental sound and the information sound are superimposed and reproduced by the head device 130, so that the user can hear the environmental sound while maintaining the sense of direction and the presence.

さらに、例えば会話状態を検出した場合には、環境音を情報音に比べて大きくすることもでき、積極的に環境音をユーザに聞かせ、会話を可能にすることができる。歩行状態を検出した場合には、会話状態よりは小さいながら、通常状態よりは環境音を大きくすることで、安全に配慮することができる。注視状態が検出された場合には、通常状態よりも情報音を大きくして、情報を積極的に提供することもできる。 Further, for example, when a conversation state is detected, the environmental sound can be increased compared to the information sound, and the user can actively listen to the environmental sound and enable conversation. When the walking state is detected, safety can be considered by making the environmental sound larger than the normal state while being smaller than the conversation state. When the gaze state is detected, the information sound can be made louder than the normal state to actively provide information.

このように、重畳比の調整は自動で行えるので、ユーザが手動で音量バランスの調節をすることなく、時々刻々と変化する周囲の音環境に適した音量バランスを自動で保つことができ、ユーザの利便性が増す。情報音の音量と環境音の音量とを共に調整することにより、環境音を優先するといった制御も可能となる。 In this way, the superimposition ratio can be adjusted automatically, so that the user can automatically maintain a volume balance suitable for the surrounding sound environment that changes from moment to moment without manually adjusting the volume balance. Increase convenience. By adjusting both the volume of the information sound and the volume of the environmental sound, it is possible to perform control such that the environmental sound is given priority.

さらに、位置姿勢推定部１５７が推定したユーザの位置姿勢に応じて立体音響処理部１５１により人が身につけたイヤホン３２から出力される音情報を加工し、人の周囲の任意の位置・方向から聞こえてくるように仮想的な音源位置を設定することができる。このように、人の頭部の位置・姿勢を検出することで、周囲環境に音源位置が固定されているようにリアルタイムに調整することが可能となる。これにより、あたかも実世界の環境中に音源があるかのように人に感じさせる音声ＡｒｇｕｍｅｎｔｅｄＲｅａｒｉｔｙ（ＡＲ）を実現できる。 Furthermore, the sound information output from the earphone 32 worn by the person is processed by the stereophonic sound processing unit 151 in accordance with the position and orientation of the user estimated by the position and orientation estimation unit 157, and from any position and direction around the person The virtual sound source position can be set so that it can be heard. Thus, by detecting the position / posture of a person's head, it becomes possible to adjust in real time so that the sound source position is fixed in the surrounding environment. As a result, it is possible to realize a voice-arranged reality (AR) that makes a person feel as if there is a sound source in a real-world environment.

この音声ＡＲを利用すると、ハンズフリー・アイズフリーな情報提供を実現できる。この音声ＡＲを利用したユーザインタフェースの適用例として、図７に示したように展示会などの会場において、ユーザの周囲にある各展示物の位置情報に応じた音響処理を行うことができる。例えば、展示物に関する説明音声に、その展示物の方から音声が出力されているような処理が可能である。このような処理により、ユーザが興味ある展示を探しやすくような案内を行うことも考えられる。 By using this voice AR, it is possible to provide hands-free and eyes-free information. As an application example of the user interface using the voice AR, as shown in FIG. 7, in a venue such as an exhibition, acoustic processing corresponding to the position information of each exhibit around the user can be performed. For example, it is possible to perform processing such that a voice is output from the exhibit to the explanation voice regarding the exhibit. By such processing, it may be possible to provide guidance that makes it easy for the user to search for an interesting exhibition.

（第３の実施の形態）
以下、第３の実施の形態による音声処理システムについて説明する。第３の実施の形態において、第１または第２の実施の形態と同様の構成及び動作については同一番号を付し、重複説明を省略する。 (Third embodiment)
Hereinafter, a voice processing system according to the third embodiment will be described. In the third embodiment, configurations and operations similar to those in the first or second embodiment are denoted by the same reference numerals, and redundant description is omitted.

第３の実施の形態による音声処理システムは、第２の実施の形態による音声処理システム２４０と同様のハードウエア構成とすることができる。第３の実施の形態による音声処理システムは、音声処理システム１００において、音声処理装置２０に代えて音声処理装置２５０を有しており、代表値算出方法の切替機能を有する例である。 The voice processing system according to the third embodiment can have the same hardware configuration as the voice processing system 240 according to the second embodiment. The speech processing system according to the third embodiment is an example in which the speech processing system 100 includes a speech processing device 250 instead of the speech processing device 20 and has a function of switching a representative value calculation method.

図１７は、音声処理装置２５０の機能の一例を示すブロック図である。図１７に示すように、音声処理装置２５０は、音声処理装置２０と同様に、収音部４１、音声取得部４３、重畳比算出部４５、重畳処理部４７、出力部４９、立体音響処理部１５１、状態計測部１５３、状態検出部１５５、位置姿勢推定部１５７を有している。音声処理装置２５０は、さらに、代表値切替部２５１を有している。 FIG. 17 is a block diagram illustrating an example of functions of the audio processing device 250. As illustrated in FIG. 17, the sound processing device 250 is similar to the sound processing device 20 in that the sound collection unit 41, the sound acquisition unit 43, the superposition ratio calculation unit 45, the superposition processing unit 47, the output unit 49, and the stereophonic sound processing unit. 151, a state measurement unit 153, a state detection unit 155, and a position / orientation estimation unit 157. The voice processing device 250 further includes a representative value switching unit 251.

代表値切替部２５１は、音圧レベルの時系列データの代表値の算出方法を切替える。具体的には、第１の実施の形態において説明した移動平均、加重平均、中央値を、各代表値が有効な状況に応じて採用することが考えられる。 The representative value switching unit 251 switches the calculation method of the representative value of the time series data of the sound pressure level. Specifically, the moving average, the weighted average, and the median value described in the first embodiment may be adopted depending on the situation where each representative value is effective.

第１の実施の形態において、各代表値が有効な状況について以下のように説明した。
移動平均：白色雑音や周期性のある雑音環境下で有効
加重平均：環境変化に対する反応を早くしたいときなどに有効
中央値：突発的にとても大きな音が入るような環境下で有効 In the first embodiment, the situation where each representative value is valid has been described as follows.
Moving average: Effective under white noise or periodic noise environment Weighted average: Effective when quick response to environmental changes, etc. Median: Effective under sudden loud noise

具体的には、各代表値が有効な状況の例として次のような状況が考えられる。
移動平均：データセンタのような、空調やファンの音が一定量のノイズになる場合等
加重平均：工事現場など、騒音レベルが断続的に変化する場合等
中央値：オフィスでドア開閉音が大きい場合、スポーツで打撃音が大きい場合等 Specifically, the following situations can be considered as examples of situations where each representative value is valid.
Moving average: When the sound of air conditioning or fans becomes a certain amount of noise, such as in a data center, etc. Weighted average: When the noise level changes intermittently, such as at a construction site, etc. Median: Loud door opening / closing sound in the office If the impact sound is loud in sports

第１及び第２の実施の形態においては、代表値の算出方法は予め定めておいた算出方法を常に用いるとしたが、本実施の形態においては、代表値切替部２５１は、例えば、過去の環境音の分布を解析し、用いる代表値の算出方法を自動で切り替える。 In the first and second embodiments, it is assumed that the calculation method of the representative value is always used in advance, but in this embodiment, the representative value switching unit 251 Analyzes the distribution of environmental sound and automatically switches the representative value calculation method to be used.

図１８、図１９は、第３の実施の形態による音声処理システムの動作を示すフローチャートである。第３の実施の形態による音声処理システムによる処理は、予め記憶された制御プログラムを演算処理装置３が読み込んで実行することにより行われる処理であるが、ここでは便宜上、図１７に示した各機能が処理を行うとして説明する。また、第１または第２の実施の形態と同様の処理については、詳細な説明を省略する。 18 and 19 are flowcharts showing the operation of the voice processing system according to the third embodiment. The processing by the speech processing system according to the third embodiment is processing performed by the arithmetic processing device 3 reading and executing a pre-stored control program. Here, for convenience, each function shown in FIG. Will be described as performing processing. Further, detailed description of the same processing as in the first or second embodiment is omitted.

図１８に示すように、まず重畳比算出部４５は、ユーザ状態に応じて音圧レベル差の目標値Ｘを決定する（Ｓ２８１）。本実施の形態では、第２の実施の形態と同様、状態検出部１５５がユーザや周囲の状態を検出している。ここでは、上述したように、目標値Ｘａ〜Ｘｄを切替えることが好ましい。これらの目標値は、例えば予め記憶部５に記憶しておき、状態検出部１５５で各状態が検出された場合に、目標値を変更することが好ましい。 As shown in FIG. 18, the superimposition ratio calculation unit 45 first determines a target value X of the sound pressure level difference according to the user state (S281). In the present embodiment, as in the second embodiment, the state detection unit 155 detects the user and the surrounding state. Here, as described above, it is preferable to switch the target values Xa to Xd. These target values are preferably stored in the storage unit 5 in advance, for example, and when each state is detected by the state detection unit 155, the target value is preferably changed.

収音部４１は、環境音情報を頭部デバイス１３０から取得する（Ｓ２８２）。立体音響処理部１５１は、音声取得部４３からの音声に対し、頭部デバイス１３０で得られた頭部などの位置姿勢に応じて左右の音声の再生時刻や周波数特性を調整するなど、立体音響処理を行い、重畳比算出部４５に出力する情報音を算出する（Ｓ２８３）。このとき、立体音響処理は、位置姿勢推定部１５７で推定された位置姿勢及び状態検出部１５５で検出されたユーザの状態などに応じて、所望の位置から仮想的に発生する情報音を生成する処理としてもよい。このとき考慮されるユーザの状態の一つとして、第２の実施の形態において説明した注視検出を行うようにしてもよい。音声取得部４３は、情報音の現在音圧レベル値を不図示のバッファに格納する。収音部４１は、環境音の現在のレベル値を不図示のバッファに格納する（Ｓ２８４）。 The sound collection unit 41 acquires environmental sound information from the head device 130 (S282). The stereophonic sound processing unit 151 adjusts the reproduction time and frequency characteristics of the left and right sounds in accordance with the position and orientation of the head obtained by the head device 130 with respect to the sound from the sound acquisition unit 43. Processing is performed to calculate the information sound to be output to the superposition ratio calculation unit 45 (S283). At this time, the stereophonic sound processing generates information sound virtually generated from a desired position according to the position and orientation estimated by the position and orientation estimation unit 157 and the user state detected by the state detection unit 155. It is good also as processing. As one of the user states considered at this time, the gaze detection described in the second embodiment may be performed. The sound acquisition unit 43 stores the current sound pressure level value of the information sound in a buffer (not shown). The sound collection unit 41 stores the current level value of the environmental sound in a buffer (not shown) (S284).

ここで、各時刻で重畳バランスの算出を行う前に、代表値切替部２５１は、マイク３４で取得した過去一定時間の環境音の音圧レベルの時系列データを分析し、データの分布が正規分布に近いかどうかを判定する（Ｓ２８５）。判定方法としては、時系列データの歪度や尖度を用いるジャック−ベラ検定等の検定方法を用いる。正規分布に近いと判定された場合は（Ｓ２８５：ＹＥＳ）、代表値切替部２５１は、代表値に移動平均を用いる（Ｓ２８６）。正規分布に近いと判定されなかった場合には（Ｓ２８５：ＮＯ）、代表値切替部２５１は、代表値に中央値を用いる（Ｓ２８７）。 Here, before calculating the superimposition balance at each time, the representative value switching unit 251 analyzes the time-series data of the sound pressure level of the environmental sound for the past certain time acquired by the microphone 34, and the data distribution is normal. It is determined whether or not the distribution is close (S285). As a determination method, a test method such as Jack-Bella test using skewness or kurtosis of time series data is used. When it is determined that the distribution is close to the normal distribution (S285: YES), the representative value switching unit 251 uses a moving average for the representative value (S286). If it is not determined that the distribution is close to the normal distribution (S285: NO), the representative value switching unit 251 uses the median value as the representative value (S287).

音声取得部４３は、代表値切替部２５１で設定された算出方法に基づき、情報代表値Ｌｃ、および環境代表値Ｌｂを算出する（Ｓ２８８）。ここで、各代表値を求める区間を決める時間Ｔは、第１の実施の形態と同様に決定されることが好ましい。 The voice acquisition unit 43 calculates the information representative value Lc and the environment representative value Lb based on the calculation method set by the representative value switching unit 251 (S288). Here, it is preferable that the time T for determining the section for obtaining each representative value is determined in the same manner as in the first embodiment.

図１９に示すように、重畳比算出部４５は、情報代表値Ｌｃと環境代表値Ｌｂとの差分Ｙを求める（Ｓ２８９）。重畳比算出部４５は、目標値Ｘと差分Ｙとの差分Ｚを求める（Ｓ２９０）。さらに重畳比算出部４５は、情報音と環境音の配分比のデシベル表現が差分Ｚに一致するように、重畳比Ｓを求める（Ｓ２９１）。音圧の重畳比Ｓは、上述した式３から求められる。 As shown in FIG. 19, the superposition ratio calculation unit 45 obtains a difference Y between the information representative value Lc and the environment representative value Lb (S289). The superimposition ratio calculation unit 45 obtains a difference Z between the target value X and the difference Y (S290). Further, the superposition ratio calculation unit 45 obtains the superposition ratio S so that the decibel expression of the distribution ratio of the information sound and the environmental sound matches the difference Z (S291). The superposition ratio S of the sound pressures can be obtained from the above equation 3.

重畳処理部４７は、頭部デバイス１３０で取得した環境音と、音声取得部４３で取得した情報音との時系列データに基づき算出された重畳比Ｓにより環境音と情報音とを重畳し、出力音圧を決定する（Ｓ２９２）。出力部４９は、例えば上述した式４により求めた出力音圧により音声信号を出力することにより、頭部デバイス１３０により音声を再生させる（ステップ２９３）。 The superimposition processing unit 47 superimposes the environmental sound and the information sound with the superposition ratio S calculated based on the time series data of the environmental sound acquired by the head device 130 and the information sound acquired by the sound acquisition unit 43, The output sound pressure is determined (S292). The output unit 49 reproduces sound by the head device 130, for example, by outputting a sound signal with the output sound pressure obtained by the above-described Expression 4 (step 293).

音声処理装置２５０では、例えば入力部２３などから終了指示がない場合には（Ｓ２９４:ＮＯ）、Ｓ２８１から処理を繰り返し、終了指示があった場合には（Ｓ２９４:ＹＥＳ）、処理を終了する。 In the audio processing device 250, for example, when there is no end instruction from the input unit 23 or the like (S294: NO), the process is repeated from S281, and when there is an end instruction (S294: YES), the process ends.

以上説明したように、第３の実施の形態による音声処理システムによれば、収音部４１は、環境音を収音し、音声取得部４３は、情報音を取得する。状態検出部１５５は、頭部デバイス１３０で検出された情報に基づき、例えば、会話状態、歩行状態、注視状態を検出する。重畳比算出部４５は、状態検出部１５５で検出された状態に応じて、目標値Ｘａ〜Ｘｄのいずれかを目標値として設定する。 As described above, according to the sound processing system according to the third embodiment, the sound collection unit 41 collects environmental sounds, and the sound acquisition unit 43 acquires information sounds. The state detection unit 155 detects, for example, a conversation state, a walking state, and a gaze state based on the information detected by the head device 130. The superimposition ratio calculation unit 45 sets any one of the target values Xa to Xd as the target value according to the state detected by the state detection unit 155.

本実施の形態では、代表値切替部２５１は、過去の環境音の時系列データを解析し、時系列データの分布が正規分布に近い場合には、代表値として移動平均を用いる。このとき、環境代表値Ｌｂを移動平均により求めるが、例えば情報代表値Ｌｃは、予め定められた方法で求めるようにしてもよい。 In the present embodiment, the representative value switching unit 251 analyzes time series data of past environmental sounds, and uses a moving average as a representative value when the distribution of the time series data is close to a normal distribution. At this time, the environmental representative value Lb is obtained by a moving average. For example, the information representative value Lc may be obtained by a predetermined method.

重畳比算出部４５は、環境代表値Ｌｂと、情報代表値Ｌｃとの差が、予め決められた目標値Ｘとなるように、重畳比Ｓを算出する。重畳処理部４７は、算出された重畳比Ｓに基づき音声を合成する。出力部４９は、合成された音声を出力する。このとき、目標値Ｘを、注視状態の場合には、立体音響処理部１５１により、注視していると推定される物体から情報音が発生しているように音響処理を行うこともできる。 The superimposition ratio calculation unit 45 calculates the superposition ratio S so that the difference between the environment representative value Lb and the information representative value Lc becomes a predetermined target value X. The superimposition processing unit 47 synthesizes speech based on the calculated superposition ratio S. The output unit 49 outputs the synthesized voice. At this time, when the target value X is in the gaze state, the stereophonic sound processing unit 151 can perform acoustic processing so that the information sound is generated from the object estimated to be gaze.

以上のように、第３の実施の形態による音声処理システムでは、第２の実施の形態による音声処理システム１００が奏する効果に加え、周囲の音環境により適した方法に切替えて音圧レベルの代表値を算出することが可能になる。よって、例えば、通常は突発音に大きく左右されないように中央値を用いるが、突発音がほとんどなく雑音がホワイトノイズに近い環境に移動した際に、より安定的な移動平均に自動的に切り替える等、より柔軟な対応が可能になる。また、時々刻々と変わる環境に対応する一方で、突発的な環境音の変化に過敏に反応してバランスが大きく変更されることがないようにする効果がある。 As described above, in the sound processing system according to the third embodiment, in addition to the effects exhibited by the sound processing system 100 according to the second embodiment, a method suitable for the surrounding sound environment is switched to a representative sound pressure level. The value can be calculated. Therefore, for example, the median value is usually used so that it is not greatly affected by sudden sound, but when there is almost no sudden sound and the noise moves to an environment close to white noise, it automatically switches to a more stable moving average, etc. , More flexible response is possible. Moreover, while responding to an environment that changes from moment to moment, there is an effect that the balance is not greatly changed in response to sudden changes in environmental sounds.

（変形例）
以下、変形例による音声処理システム２４０について説明する。変形例は、例えば、第１から第３の実施の形態による音声処理システムの変形例である。音声処理システム２４０は、音声処理装置２４２及びマイクデバイス３０を有している。音声処理装置２４２は、音声処理システム１の音声処理装置２に、音声処理システム１００の頭部デバイス１３０が有する一部の機能等を追加した例である。本変形例において、第１から第３の実施の形態と同様の構成及び動作については同一番号を付し、重複説明を省略する。 (Modification)
Hereinafter, the voice processing system 240 according to the modification will be described. The modification is, for example, a modification of the sound processing system according to the first to third embodiments. The voice processing system 240 includes a voice processing device 242 and a microphone device 30. The voice processing device 242 is an example in which some of the functions of the head device 130 of the voice processing system 100 are added to the voice processing device 2 of the voice processing system 1. In the present modification, the same configurations and operations as those in the first to third embodiments are denoted by the same reference numerals, and redundant description is omitted.

図２０は、音声処理システム２４０のハードウエア構成の一例を示す図である。音声処理装置２４２は、情報音を環境音と重畳して出力する装置であり、第１から第３の実施の形態による音声処理装置２、２０と同様に、演算処理装置３、記憶部５、入力部２３、表示部２５、音声入出力部１５を有している。 FIG. 20 is a diagram illustrating an example of a hardware configuration of the voice processing system 240. The sound processing device 242 is a device that outputs information sound superimposed on the environmental sound, and similarly to the sound processing devices 2 and 20 according to the first to third embodiments, the arithmetic processing device 3, the storage unit 5, An input unit 23, a display unit 25, and a voice input / output unit 15 are provided.

音声処理装置２４２は、さらに、通信部１１、アンテナ１３、加速度センサ２４５、ジャイロセンサ２４７を有している。通信部１１は、音声処理装置２４２の外部との情報の送受信の処理を行う。アンテナ１３は、無線により電磁波を送受信する。加速度センサ２４５は、音声処理装置２４２の加速度を検出する。加速度センサ２４５は、例えば３次元加速度センサとすることができる。ジャイロセンサ２４７は、音声処理装置２４２の角度を検出する。音声処理装置２４２の機能構成は、第２の実施の形態による音声処理装置２０または第３の実施の形態による音声処理装置２５０と同様とすることができる。 The voice processing device 242 further includes a communication unit 11, an antenna 13, an acceleration sensor 245, and a gyro sensor 247. The communication unit 11 performs transmission / reception processing of information with the outside of the audio processing device 242. The antenna 13 transmits and receives electromagnetic waves wirelessly. The acceleration sensor 245 detects the acceleration of the voice processing device 242. The acceleration sensor 245 can be, for example, a three-dimensional acceleration sensor. The gyro sensor 247 detects the angle of the sound processing device 242. The functional configuration of the voice processing device 242 can be the same as that of the voice processing device 20 according to the second embodiment or the voice processing device 250 according to the third embodiment.

本変形例では、加速度センサ２４５、ジャイロセンサ２４７による検出結果に基づき、音声処理装置２０と同様にユーザ１１０の歩行状態を検出することができる。また、マイクデバイス３０のマイク３４で収音された結果に基づき、音声処理装置２０または音声処理装置２５０と同様に会話状態を検出することができる。さらに、通信部１１、アンテナ１３を介して、ＧｌｏｂａｌＰｏｓｉｔｉｏｎｉｎｇＳｙｓｔｅｍ（ＧＰＳ）を利用して、自己の位置を取得することができる。さらに、加速度センサ２４５、ジャイロセンサ２４７による検出結果を利用することにより、第２または第３の実施の形態と同様に注視状態の判別も行うことができる。 In the present modification, the walking state of the user 110 can be detected based on the detection results of the acceleration sensor 245 and the gyro sensor 247 as in the case of the voice processing device 20. Further, the conversation state can be detected in the same manner as the voice processing device 20 or the voice processing device 250 based on the result of the sound picked up by the microphone 34 of the microphone device 30. Furthermore, it is possible to acquire its own position using the Global Positioning System (GPS) via the communication unit 11 and the antenna 13. Furthermore, by using the detection results obtained by the acceleration sensor 245 and the gyro sensor 247, the gaze state can be determined as in the second or third embodiment.

よって、第２または第３の実施の形態による音声処理システムと同様に、音声処理システム２４０は、ユーザ１１０の状態に適した重畳比で環境音と情報音とを重畳して出力することが可能である。 Therefore, similar to the sound processing system according to the second or third embodiment, the sound processing system 240 can superimpose and output the environmental sound and the information sound with a superposition ratio suitable for the state of the user 110. It is.

以上説明したように、変形例による音声処理システム２４０によれば、第２または第３の実施の形態による音声処理装置２０、または音声処理装置２５０と同様の効果を奏することができる。さらに、この構成を用いれば、赤外線発生装置１２５は不要となるので、ＧＰＳが利用可能な場所であれば、音声処理システム２４０を利用することができる。 As described above, according to the audio processing system 240 according to the modification, the same effects as those of the audio processing device 20 or the audio processing device 250 according to the second or third embodiment can be obtained. Furthermore, since this configuration eliminates the need for the infrared generator 125, the voice processing system 240 can be used wherever GPS is available.

ここで、上記第１から第３の実施の形態及び変形例による音声処理方法の動作をコンピュータに行わせるために共通に適用されるコンピュータの例について説明する。図２１は、標準的なコンピュータのハードウエア構成の一例を示すブロック図である。図２１に示すように、コンピュータ３００は、ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ（ＣＰＵ）３０２、メモリ３０４、入力装置３０６、出力装置３０８、外部記憶装置３１２、媒体駆動装置３１４、ネットワーク接続装置等がバス３１０を介して接続されている。 Here, an example of a computer that is commonly applied to cause a computer to perform the operations of the sound processing methods according to the first to third embodiments and the modified examples will be described. FIG. 21 is a block diagram illustrating an example of a hardware configuration of a standard computer. As shown in FIG. 21, a computer 300 includes a central processing unit (CPU) 302, a memory 304, an input device 306, an output device 308, an external storage device 312, a medium driving device 314, a network connection device, and the like via a bus 310. It is connected.

ＣＰＵ３０２は、コンピュータ３００全体の動作を制御する演算処理装置である。メモリ３０４は、コンピュータ３００の動作を制御するプログラムを予め記憶したり、プログラムを実行する際に必要に応じて作業領域として使用したりするための記憶部である。メモリ３０４は、例えばＲＡＭ、ＲＯＭ等である。入力装置３０６は、コンピュータの使用者により操作されると、その操作内容に対応付けられている使用者からの各種情報の入力を取得し、取得した入力情報をＣＰＵ３０２に送付する装置であり、例えばキーボード装置、マウス装置などである。出力装置３０８は、コンピュータ３００による処理結果を出力する装置であり、表示装置などが含まれる。例えば表示装置は、ＣＰＵ３０２により送付される表示データに応じてテキストや画像を表示する。 The CPU 302 is an arithmetic processing unit that controls the operation of the entire computer 300. The memory 304 is a storage unit for storing in advance a program for controlling the operation of the computer 300 or using it as a work area when necessary when executing the program. The memory 304 is, for example, a RAM or a ROM. The input device 306 is a device that, when operated by a computer user, acquires various information input from the user associated with the operation content and sends the acquired input information to the CPU 302. Keyboard device, mouse device, etc. The output device 308 is a device that outputs a processing result by the computer 300, and includes a display device and the like. For example, the display device displays text and images according to display data sent by the CPU 302.

外部記憶装置３１２は、例えば、ハードディスクなどの記憶装置であり、ＣＰＵ３０２により実行される各種制御プログラムや、取得したデータ等を記憶しておく装置である。媒体駆動装置３１４は、可搬記録媒体３１６に書き込みおよび読み出しを行うための装置である。ＣＰＵ３０２は、可搬記録媒体３１６に記録されている所定の制御プログラムを、媒体駆動装置３１４を介して読み出して実行することによって、各種の制御処理を行うようにすることもできる。可搬記録媒体３１６は、例えばＣｏｍｐａｃｔＤｉｓｃ（ＣＤ）−ＲＯＭ、ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｃ（ＤＶＤ）、ＵｎｉｖｅｒｓａｌＳｅｒｉａｌＢｕｓ（ＵＳＢ）メモリ等である。ネットワーク接続装置３１８は、有線または無線により外部との間で行われる各種データの授受の管理を行うインタフェース装置である。バス３１０は、上記各装置等を互いに接続し、データのやり取りを行う通信経路である。 The external storage device 312 is a storage device such as a hard disk, and stores various control programs executed by the CPU 302, acquired data, and the like. The medium driving device 314 is a device for writing to and reading from the portable recording medium 316. The CPU 302 can perform various control processes by reading and executing a predetermined control program recorded on the portable recording medium 316 via the medium driving device 314. The portable recording medium 316 is, for example, a Compact Disc (CD) -ROM, a Digital Versatile Disc (DVD), a Universal Serial Bus (USB) memory, or the like. The network connection device 318 is an interface device that manages transmission / reception of various data performed between the outside by wired or wireless. A bus 310 is a communication path for connecting the above devices and the like to exchange data.

上記第１から第３の実施の形態及び変形例による音声処理方法をコンピュータに実行させるプログラムは、例えば外部記憶装置３１２に記憶させる。ＣＰＵ３０２は、外部記憶装置３１２からプログラムを読み出し、コンピュータ３００に音声処理の動作を行なわせる。このとき、まず、音声処理の処理をＣＰＵ３０２に行わせるための制御プログラムを作成して外部記憶装置３１２に記憶させておく。そして、入力装置３０６から所定の指示をＣＰＵ３０２に与えて、この制御プログラムを外部記憶装置３１２から読み出させて実行させるようにする。また、このプログラムは、可搬記録媒体３１６に記憶するようにしてもよい。 A program that causes a computer to execute the sound processing methods according to the first to third embodiments and the modifications is stored in, for example, the external storage device 312. The CPU 302 reads a program from the external storage device 312 and causes the computer 300 to perform an audio processing operation. At this time, first, a control program for causing the CPU 302 to perform voice processing is created and stored in the external storage device 312. Then, a predetermined instruction is given from the input device 306 to the CPU 302 so that the control program is read from the external storage device 312 and executed. The program may be stored in the portable recording medium 316.

なお、本発明は、以上に述べた実施の形態に限定されるものではなく、本発明の要旨を逸脱しない範囲内で種々の構成または実施形態を採ることができる。例えば、状態検出部１５５が検出するユーザの状態は、上記４つの状態（会話状態、歩行状態、注視状態、通常状態）に限定されない。また、４つの状態のうちのいくつかのみを検出可能な音声処理システムとしてもよい。ユーザの状態の検出方法も上記に限定されない。同様の状態が検出できれば、別の方法を採用することもできる。例えば、頭部デバイス１３０に地磁気センサを設置し、地磁気センサの検出結果に基づき、頭部デバイス１３０の姿勢を推定するようにしてもよい。 The present invention is not limited to the embodiments described above, and various configurations or embodiments can be adopted without departing from the gist of the present invention. For example, the user state detected by the state detection unit 155 is not limited to the above four states (a conversation state, a walking state, a gaze state, and a normal state). Moreover, it is good also as a speech processing system which can detect only some of four states. The user state detection method is not limited to the above. If a similar state can be detected, another method can be adopted. For example, a geomagnetic sensor may be installed in the head device 130, and the posture of the head device 130 may be estimated based on the detection result of the geomagnetic sensor.

情報音は、音声処理装置２等に予め記憶しておくようにしたが、音声処理装置２等と通信可能な別の情報処理装置から取得する等、変形は可能である。赤外線情報１７５、注視対象情報１８０などについても、別の情報処理装置から取得するようにしてもよい。また、ユーザ１１０が携帯可能な音声処理装置で、音声の再生や環境音、ユーザ状態の取得のみを行い、その他の処理を別の情報処理装置で行う、などの変形も可能である。 The information sound is stored in advance in the audio processing device 2 or the like, but can be modified such as being acquired from another information processing device that can communicate with the audio processing device 2 or the like. The infrared information 175, the gaze target information 180, and the like may be acquired from another information processing apparatus. In addition, it is possible to modify the audio processing device that can be carried by the user 110, such as only reproducing the sound, acquiring the environmental sound, and the user state, and performing other processing using another information processing device.

以上の実施形態に関し、さらに以下の付記を開示する。
（付記１）
環境音を収音する収音部と、
提供する情報の情報音を取得する音声取得部と、
前記情報音の音圧レベルの時系列データの第１の代表値と前記環境音の音圧レベルの時系列データの第２の代表値との差と、第１の所定値との差を補うような、前記情報音と前記環境音とを重畳させた重畳音の音圧に対する前記環境音の音圧の比を示す重畳比を算出する重畳比算出部と、
前記重畳比に基づき前記情報音と前記環境音とを重畳する処理を行なう重畳処理部と、
前記重畳する処理が行われた音声信号を出力する出力部と、
を有することを特徴とする音声処理装置。
（付記２）
前記第１の代表値及び前記第２の代表値は、それぞれの音圧レベルの時系列データの移動平均、加重平均、中央値のいずれかであることを特徴とする付記１に記載の音声処理装置。
（付記３）
ユーザの状態を検出する状態検出部
をさらに有し、
前記状態検出部は、前記環境音に会話が含まれているか否かを検出し、
前記重畳比算出部は、前記状態検出部が前記会話を検出した場合には、前記第１の所定値に代えて、前記第１の所定値よりも小さい第２の所定値に基づき前記重畳比を算出する
ことを特徴とする付記１または付記２に記載の音声処理装置。
（付記４）
ユーザの状態を検出する状態検出部
をさらに有し、
前記状態検出部は、前記ユーザが歩行状態であるか否かを検出し、
前記重畳比算出部は、前記状態検出部が前記歩行状態を検出した場合には、前記第１の所定値に代えて、前記第１の所定値よりも小さく、前記状態検出部が前記環境音に会話が含まれていることを検出した場合の第２の所定値よりも大きい第３の所定値に基づき前記重畳比を算出する
ことを特徴とする付記１または付記２に記載の音声処理装置。
（付記５）
ユーザの状態を検出する状態検出部と
前記情報音と関連する対象物の位置を取得する対象位置取得部、
をさらに有し、
前記状態検出部は、前記ユーザが前記対象物の位置を注視しているか否かを検出し、
前記重畳比算出部は、前記状態検出部が前記対象物の位置を注視している状態を検出した場合には、前記第１の所定値よりも大きい第４の所定値に基づき前記重畳比を算出する
ことを特徴とする付記１または付記２に記載の音声処理装置。
（付記６）
前記環境音の過去一定時間の分布が正規分布に近いと判別された場合には、前記第１の代表値及び前記第２の代表値は、それぞれの音圧レベルの時系列データの移動平均とし、そうでない場合には中央値とする代表値切替部
をさらに有することを特徴とする付記１から付記５のいずれかに記載の音声処理装置。
（付記７）
音声処理装置が、
環境音を収音し、
提供する情報の情報音を取得し、
前記情報音の音圧レベルの時系列データの第１の代表値と前記環境音の音圧レベルの時系列データの第２の代表値との差と、第１の所定値との差を補うような、前記情報音と前記環境音とを重畳させた重畳音の音圧に対する前記環境音の音圧の比を示す重畳比を算出し、
前記重畳比に基づき前記情報音と前記環境音とを重畳し、
前記重畳する処理が行われた音声信号を出力する、
ことを特徴とする音声処理方法。
（付記８）
前記第１の代表値及び前記第２の代表値は、それぞれの音圧レベルの時系列データの移動平均、加重平均、または中央値であることを特徴とする付記９に記載の音声処理方法。
（付記９）
会話を検出した場合には、前記第１の所定値に代えて、前記第１の所定値よりも小さい第２の所定値に基づき前記重畳比を算出する
ことを特徴とする付記７または付記８に記載の音声処理方法。
（付記１０）
歩行状態を検出した場合には、前記第１の所定値に代えて、前記第１の所定値よりも小さく、前記状態検出部が前記環境音に会話が含まれていることを検出した場合の第２の所定値よりも大きい第３の所定値に基づき前記重畳比を算出する
ことを特徴とする付記７または付記８に記載の音声処理方法。
（付記１１）
前記情報音と関連する対象物の位置を取得し、
ユーザが前記対象物の位置を注視しているか否かを検出し、
前記対象物の位置が注視されている状態を検出した場合には、前記第１の所定値よりも大きい第４の所定値に基づき前記重畳比を算出する
ことを特徴とする付記７または付記８に記載の音声処理装置。
（付記１２）
前記環境音の過去一定時間の分布が正規分布に近いと判別された場合には、前記第１の代表値及び前記第２の代表値は、それぞれの音圧レベルの時系列データの移動平均とし、そうでない場合には中央値とする
をさらに有することを特徴とする付記７または付記８に記載の音声処理装置。
（付記１３）
環境音を収音し、
提供する情報の情報音を取得し、
前記情報音の音圧レベルの時系列データの第１の代表値と前記環境音の音圧レベルの時系列データの第２の代表値との差と、第１の所定値との差を補うような、前記情報音と前記環境音とを重畳させた重畳音の音圧に対する前記環境音の音圧の比を示す重畳比を算出し、
前記重畳比に基づき前記情報音と前記環境音とを重畳し、
前記重畳する処理が行われた音声信号を出力する、
処理をコンピュータに実行させるプログラム。
（付記１４）
前記第１の代表値及び前記第２の代表値は、それぞれの音圧レベルの時系列データの移動平均、加重平均、または中央値であることを特徴とする付記１３に記載のプログラム。
（付記１５）
会話を検出した場合には、前記第１の所定値に代えて、前記第１の所定値よりも小さい第２の所定値に基づき前記重畳比を算出する
ことを特徴とする付記１３または付記１４に記載のプログラム。 Regarding the above embodiment, the following additional notes are disclosed.
(Appendix 1)
A sound collection unit that collects environmental sounds;
An audio acquisition unit for acquiring information sounds of information to be provided;
The difference between the first representative value of the time series data of the sound pressure level of the information sound and the second representative value of the time series data of the sound pressure level of the environmental sound is compensated for the difference between the first predetermined value. A superposition ratio calculating unit that calculates a superposition ratio indicating a ratio of a sound pressure of the environmental sound to a sound pressure of the superposed sound obtained by superimposing the information sound and the environmental sound,
A superimposition processor that performs a process of superimposing the information sound and the environmental sound based on the superposition ratio;
An output unit that outputs an audio signal subjected to the superimposing process;
A speech processing apparatus comprising:
(Appendix 2)
The sound processing according to appendix 1, wherein the first representative value and the second representative value are any one of a moving average, a weighted average, and a median value of time series data of each sound pressure level. apparatus.
(Appendix 3)
It further has a state detection unit for detecting the user's state,
The state detection unit detects whether or not conversation is included in the environmental sound,
When the state detection unit detects the conversation, the superposition ratio calculation unit replaces the first predetermined value with the superposition ratio based on a second predetermined value smaller than the first predetermined value. The speech processing apparatus according to Supplementary Note 1 or Supplementary Note 2, wherein:
(Appendix 4)
It further has a state detection unit for detecting the user's state,
The state detection unit detects whether the user is in a walking state,
When the state detection unit detects the walking state, the superposition ratio calculation unit is smaller than the first predetermined value instead of the first predetermined value, and the state detection unit detects the environmental sound. The speech processing apparatus according to appendix 1 or appendix 2, wherein the superposition ratio is calculated based on a third predetermined value that is larger than a second predetermined value when it is detected that a conversation is included in .
(Appendix 5)
A state detection unit for detecting a user's state, and a target position acquisition unit for acquiring a position of a target related to the information sound,
Further comprising
The state detection unit detects whether the user is gazing at the position of the object,
When the state detection unit detects the state of gazing at the position of the object, the superposition ratio calculation unit calculates the superposition ratio based on a fourth predetermined value that is larger than the first predetermined value. The speech processing apparatus according to Supplementary Note 1 or Supplementary Note 2, wherein the speech processing device is calculated.
(Appendix 6)
When it is determined that the distribution of the environmental sound for a certain period of time in the past is close to a normal distribution, the first representative value and the second representative value are moving averages of time series data of respective sound pressure levels. If not, the speech processing apparatus according to any one of appendix 1 to appendix 5, further comprising a representative value switching unit that sets a median value.
(Appendix 7)
The audio processor
Picks up environmental sounds,
Get the information sound of the information you provide,
The difference between the first representative value of the time series data of the sound pressure level of the information sound and the second representative value of the time series data of the sound pressure level of the environmental sound is compensated for the difference between the first predetermined value. Calculating a superposition ratio indicating a ratio of a sound pressure of the environmental sound to a sound pressure of the superposed sound obtained by superimposing the information sound and the environmental sound,
Superimposing the information sound and the environmental sound based on the superposition ratio;
Outputting an audio signal subjected to the superimposing process;
And a voice processing method.
(Appendix 8)
The speech processing method according to appendix 9, wherein the first representative value and the second representative value are a moving average, a weighted average, or a median value of time series data of each sound pressure level.
(Appendix 9)
Supplementary note 7 or Supplementary note 8, wherein when the conversation is detected, the superposition ratio is calculated based on a second predetermined value smaller than the first predetermined value instead of the first predetermined value. The voice processing method described in 1.
(Appendix 10)
When the walking state is detected, instead of the first predetermined value, the state detection unit is smaller than the first predetermined value, and the state detection unit detects that the environmental sound includes conversation. 9. The speech processing method according to appendix 7 or appendix 8, wherein the superposition ratio is calculated based on a third predetermined value larger than the second predetermined value.
(Appendix 11)
Obtaining the position of the object associated with the information sound;
Detecting whether the user is gazing at the position of the object;
Supplementary note 7 or Supplementary note 8, wherein when the state in which the position of the object is being watched is detected, the superposition ratio is calculated based on a fourth predetermined value that is larger than the first predetermined value. The voice processing apparatus according to 1.
(Appendix 12)
When it is determined that the distribution of the environmental sound for a certain period of time in the past is close to a normal distribution, the first representative value and the second representative value are moving averages of time series data of respective sound pressure levels. If not, the speech processing apparatus according to appendix 7 or appendix 8, further comprising a median value.
(Appendix 13)
Picks up environmental sounds,
Get the information sound of the information you provide,
The difference between the first representative value of the time series data of the sound pressure level of the information sound and the second representative value of the time series data of the sound pressure level of the environmental sound is compensated for the difference between the first predetermined value. Calculating a superposition ratio indicating a ratio of a sound pressure of the environmental sound to a sound pressure of the superposed sound obtained by superimposing the information sound and the environmental sound,
Superimposing the information sound and the environmental sound based on the superposition ratio;
Outputting an audio signal subjected to the superimposing process;
A program that causes a computer to execute processing.
(Appendix 14)
The program according to appendix 13, wherein the first representative value and the second representative value are a moving average, a weighted average, or a median value of time series data of each sound pressure level.
(Appendix 15)
Supplementary note 13 or Supplementary note 14, wherein when the conversation is detected, the superposition ratio is calculated based on a second predetermined value smaller than the first predetermined value instead of the first predetermined value. The program described in.

１音声処理システム
２音声処理装置
３演算処理装置
５記憶部
７ＲＯＭ
９ＲＡＭ
１１通信部
１３アンテナ
１５音声入出力部
２３入力部
２５表示部
３０マイクデバイス
３２イヤホン
３４マイク DESCRIPTION OF SYMBOLS 1 Voice processing system 2 Voice processing apparatus 3 Arithmetic processing apparatus 5 Memory | storage part 7 ROM
9 RAM
11 Communication Unit 13 Antenna 15 Audio Input / Output Unit 23 Input Unit 25 Display Unit 30 Microphone Device 32 Earphone 34 Microphone

Claims

A sound collection unit that collects environmental sounds;
An audio acquisition unit for acquiring information sounds of information to be provided;
A state detection unit for detecting whether or not the environmental sound includes a conversation;
The difference between the first representative value of the time series data of the sound pressure level of the information sound and the second representative value of the time series data of the sound pressure level of the environmental sound is compensated for the difference between the first predetermined value. The superimposition ratio indicating the ratio of the sound pressure of the environmental sound to the sound pressure of the superposed sound obtained by superimposing the information sound and the environmental sound is calculated, and the state detection unit includes the conversation in the environmental sound. A superposition ratio calculating unit that calculates the superposition ratio based on a second predetermined value smaller than the first predetermined value instead of the first predetermined value ,
A superimposition processor that performs a process of superimposing the information sound and the environmental sound based on the superposition ratio;
An output unit that outputs an audio signal subjected to the superimposing process;
A speech processing apparatus comprising:

A sound collection unit that collects environmental sounds;
An audio acquisition unit for acquiring information sounds of information to be provided;
A state detection unit that detects a state in which conversation is included in the environmental sound and detects whether or not the user is in a walking state;
The difference between the first representative value of the time series data of the sound pressure level of the information sound and the second representative value of the time series data of the sound pressure level of the environmental sound is compensated for the difference between the first predetermined value. When the superposition ratio indicating the ratio of the sound pressure of the environmental sound to the sound pressure of the superimposition sound obtained by superimposing the information sound and the environmental sound is calculated, and the state detection unit detects the walking state Is, instead of the first predetermined value, smaller than the first predetermined value, and a second predetermined value when the state detection unit detects a state in which conversation is included in the environmental sound A superposition ratio calculation unit that calculates the superposition ratio based on a third predetermined value that is greater than
A superimposition processor that performs a process of superimposing the information sound and the environmental sound based on the superposition ratio;
An output unit that outputs an audio signal subjected to the superimposing process;
A speech processing apparatus comprising:

A sound collection unit that collects environmental sounds;
An audio acquisition unit for acquiring information sounds of information to be provided;
A target position acquisition unit that acquires a position of a target object related to the information sound;
A state detection unit for detecting whether the user is gazing at the position of the object ;
The difference between the first representative value of the time series data of the sound pressure level of the information sound and the second representative value of the time series data of the sound pressure level of the environmental sound is compensated for the difference between the first predetermined value. The superimposition ratio indicating the ratio of the sound pressure of the environmental sound to the sound pressure of the superposed sound obtained by superimposing the information sound and the environmental sound is calculated, and the state detection unit determines the position of the object. A superimposition ratio calculating unit that calculates the superposition ratio based on a fourth predetermined value that is larger than the first predetermined value instead of the first predetermined value when detecting a state in which ,
A superimposition processor that performs a process of superimposing the information sound and the environmental sound based on the superposition ratio;
An output unit that outputs an audio signal subjected to the superimposing process;
A speech processing apparatus comprising:

4. The first representative value and the second representative value are any one of a moving average, a weighted average, and a median value of time series data of each sound pressure level. The speech processing device according to any one of the above.

When it is determined that the distribution of the environmental sound for a certain period of time in the past is close to a normal distribution, the first representative value and the second representative value are moving averages of time series data of respective sound pressure levels. The speech processing apparatus according to any one of claims 1 to 4, further comprising a representative value switching unit that sets a median value in a case that is not the case .

The audio processor
Picks up environmental sounds,
Get the information sound of the information you provide,
Detecting whether the environmental sound includes a conversation;
The difference between the first representative value of the time series data of the sound pressure level of the information sound and the second representative value of the time series data of the sound pressure level of the environmental sound is compensated for the difference between the first predetermined value. Calculating a superposition ratio indicating a ratio of a sound pressure of the environmental sound to a sound pressure of the superposed sound obtained by superimposing the information sound and the environmental sound,
When a state in which conversation is included in the environmental sound is detected, the superposition ratio is calculated based on a second predetermined value smaller than the first predetermined value instead of the first predetermined value. ,
Superimposing the information sound and the environmental sound based on the superposition ratio;
Outputting an audio signal subjected to the superimposing process;
And a voice processing method.

The audio processor
Picks up environmental sounds,
Get the information sound of the information you provide,
Detect a state in which conversation is included in the environmental sound,
Detect if the user is walking,
The difference between the first representative value of the time series data of the sound pressure level of the information sound and the second representative value of the time series data of the sound pressure level of the environmental sound is compensated for the difference between the first predetermined value. Calculating a superposition ratio indicating a ratio of a sound pressure of the environmental sound to a sound pressure of the superposed sound obtained by superimposing the information sound and the environmental sound,
When the walking state is detected, instead of the first predetermined value, a second state is detected when a state is detected that is smaller than the first predetermined value and the environmental sound includes conversation. Calculating the superposition ratio based on a third predetermined value greater than the predetermined value of
Superimposing the information sound and the environmental sound based on the superposition ratio;
Outputting an audio signal subjected to the superimposing process;
And a voice processing method.

The audio processor
Picks up environmental sounds,
Get the information sound of the information you provide,
Obtaining the position of the object associated with the information sound;
Detecting whether the user is gazing at the position of the object;
The difference between the first representative value of the time series data of the sound pressure level of the information sound and the second representative value of the time series data of the sound pressure level of the environmental sound is compensated for the difference between the first predetermined value. Calculating a superposition ratio indicating a ratio of a sound pressure of the environmental sound to a sound pressure of the superposed sound obtained by superimposing the information sound and the environmental sound,
If the user detects a state of gazing at the position of the object, the superposition ratio is based on a fourth predetermined value that is larger than the first predetermined value instead of the first predetermined value. To calculate
Superimposing the information sound and the environmental sound based on the superposition ratio;
Outputting an audio signal subjected to the superimposing process;
And a voice processing method.

Picks up environmental sounds,
Get the information sound of the information you provide,
Detecting whether the environmental sound includes a conversation;
The difference between the first representative value of the time series data of the sound pressure level of the information sound and the second representative value of the time series data of the sound pressure level of the environmental sound is compensated for the difference between the first predetermined value. Calculating a superposition ratio indicating a ratio of a sound pressure of the environmental sound to a sound pressure of the superposed sound obtained by superimposing the information sound and the environmental sound,
When a state in which conversation is included in the environmental sound is detected, the superposition ratio is calculated based on a second predetermined value smaller than the first predetermined value instead of the first predetermined value. ,
Superimposing the information sound and the environmental sound based on the superposition ratio;
Outputting an audio signal subjected to the superimposing process;
A program that causes a computer to execute processing.

Picks up environmental sounds,
Get the information sound of the information you provide,
Detect a state in which conversation is included in the environmental sound,
Detect if the user is walking,
The difference between the first representative value of the time series data of the sound pressure level of the information sound and the second representative value of the time series data of the sound pressure level of the environmental sound is compensated for the difference between the first predetermined value. Calculating a superposition ratio indicating a ratio of a sound pressure of the environmental sound to a sound pressure of the superposed sound obtained by superimposing the information sound and the environmental sound,
When the walking state is detected, instead of the first predetermined value, a second state is detected when a state is detected that is smaller than the first predetermined value and the environmental sound includes conversation. Calculating the superposition ratio based on a third predetermined value greater than the predetermined value of
Superimposing the information sound and the environmental sound based on the superposition ratio;
Outputting an audio signal subjected to the superimposing process;
A program that causes a computer to execute processing.

Picks up environmental sounds,
Get the information sound of the information you provide,
Obtaining the position of the object associated with the information sound;
Detecting whether the user is gazing at the position of the object;
The difference between the first representative value of the time series data of the sound pressure level of the information sound and the second representative value of the time series data of the sound pressure level of the environmental sound is compensated for the difference between the first predetermined value. Calculating a superposition ratio indicating a ratio of a sound pressure of the environmental sound to a sound pressure of the superposed sound obtained by superimposing the information sound and the environmental sound,
If the user detects a state of gazing at the position of the object, the superposition ratio is based on a fourth predetermined value that is larger than the first predetermined value instead of the first predetermined value. To calculate
Superimposing the information sound and the environmental sound based on the superposition ratio;
Outputting an audio signal subjected to the superimposing process;
A program that causes a computer to execute processing.