JP6569945B2

JP6569945B2 - Binaural sound generator, microphone array, binaural sound generation method, program

Info

Publication number: JP6569945B2
Application number: JP2016023347A
Authority: JP
Inventors: 健太丹羽; 小林　和則; 和則小林; 隆典西野
Original assignee: Nippon Telegraph and Telephone Corp; Mie University NUC
Current assignee: Nippon Telegraph and Telephone Corp; Mie University NUC
Priority date: 2016-02-10
Filing date: 2016-02-10
Publication date: 2019-09-04
Anticipated expiration: 2036-02-10
Also published as: JP2017143406A

Description

本発明は、バイノーラル音生成技術に関し、特に所定の立体形状をしたマイクロホンアレイを用いて収音した信号からバイノーラル音を生成する技術に関する。 The present invention relates to a binaural sound generation technique, and more particularly to a technique for generating binaural sound from a signal collected using a microphone array having a predetermined three-dimensional shape.

近年、全天球カメラが普及したことを背景として、ユーザが見渡している映像に対応した音を仮想的に生成するための研究が盛んにおこなわれている。その一つに、全天球映像音声視聴システムがある（非特許文献１）。全天球映像とは、全天球カメラで撮影した映像のことである。これにより、ユーザはあたかも撮影した場にいるかのような映像を視ることが可能となる。 In recent years, with the widespread use of omnidirectional cameras, extensive research has been conducted to virtually generate sound corresponding to images overlooked by users. One of them is an omnidirectional video / audio viewing system (Non-Patent Document 1). An omnidirectional image is an image taken with an omnidirectional camera. As a result, the user can view the video as if it were in the shooting location.

全天球映像音声視聴システムでは、複数の領域（具体的には、特定の角度幅で区切った領域）において推定した局所音源信号群にＨＲＴＦ（Ｈｅａｄ−ＲｅｌａｔｅｄＴｒａｎｓｆｅｒＦｕｎｃｔｉｏｎ）を畳み込むことにより、ユーザが見渡している映像に対応するバイノーラル音を生成・出力することができる。このシステムでは、ユーザがジャイロセンサ付きのＨＭＤ（ＨｅａｄＭｏｕｎｔｅｄＤｉｓｐｌａｙ）を装着することで、頭部方向をリアルタイムに取得する。そして、取得した頭部方向に応じて各局所音源信号に畳み込むＨＲＴＦを切り替えることで、ユーザが見渡している映像に対応したバイノーラル音をリアルタイムに生成する。生成したバイノーラル音はイヤホンやヘッドホンを用いて聴取される。 In the omnidirectional video / audio viewing system, the user convolves a HRTF (Head-Related Transfer Function) with a local sound source signal group estimated in a plurality of regions (specifically, regions divided by specific angle widths). It can generate and output binaural sound corresponding to the overlooked video. In this system, the user acquires the head direction in real time by wearing an HMD (Head Mounted Display) with a gyro sensor. And the binaural sound corresponding to the image | video which the user overlooks is produced | generated in real time by switching HRTF convolved with each local sound source signal according to the acquired head direction. The generated binaural sound is heard using earphones or headphones.

なお、ＨＭＤは１枚のフレネルレンズとスマートホンを組み合わせて構成されるような簡単なものでもよい。スマートホンを用いて構成することにより、ネットワークで配信されるコンテンツの視聴が容易に可能となる。 The HMD may be as simple as a combination of one Fresnel lens and a smartphone. By using a smart phone, content distributed over the network can be easily viewed.

以下では、全天球映像音声視聴システムにおける音の生成（全天球映像に対応したバイノーラル音の生成システム）について説明する。 Hereinafter, sound generation in the omnidirectional video / audio viewing system (a binaural sound generation system corresponding to the omnidirectional video) will be described.

Ｋ個（Ｋは１以上の整数）の音源が存在する音場に、Ｍ本（Ｍは１以上の整数）のマイクロホンで構成されたアレイを設置して観測することを想定する。ｋ番目（１≦ｋ≦Ｋ）の音源信号をＳ_ｋ,ω,τ、ｍ番目（１≦ｍ≦Ｍ）の観測信号をＸ_ｍ,ω,τ、その間の伝達特性をＡ_ｍ,ｋ,ωとするとき、観測信号群ｘ_ω,τは次式でモデル化される。 Assume that an array composed of M (M is an integer of 1 or more) microphones is installed and observed in a sound field in which K (K is an integer of 1 or more) sound sources exist. The k-th (1 ≦ k ≦ K) sound source signal is _represented by S _{k, ω, τ} , the m-th (1 ≦ m ≦ M) observation signal is represented by X _{m, ω, τ} , and the transfer characteristic therebetween is represented by A _{m, k, When ω} , the observation signal group x _{ω, τ} is modeled by the following equation.

ここで、ω、τはそれぞれ周波数のインデックス、フレーム時間のインデックスを表す。また、 Here, ω and τ represent a frequency index and a frame time index, respectively. Also,

であり、Ｔは転置、Ｎ_ｍ,ω,τはｍ番目の観測信号に含まれる背景雑音を表す。
T represents transposition, and N _{m, ω, τ} represents background noise included in the m-th observed signal.

ユーザが見渡している映像に対応したバイノーラル音ｂ_ω,τ＝[Ｂ_ω,τ ^（Left），Ｂ_ω,τ ^(Right)]^Ｔの生成について説明する。フレーム時間τにおけるユーザの頭部方向（極座標表現）をΨ_τ＝［Ψ_τ ^(Hor)，Ψ_τ ^(Ver)]^Ｔと表す。音源の指向性や背景雑音を無視できると仮定したとき、ユーザの頭部方向と各音源の間のＨＲＴＦを各音源信号に畳み込むことで、ユーザが見渡している映像に対応したバイノーラル音ｂ_ω,τを出力できる。その様子を図１に示す。 The generation of binaural sound b _{ω, τ} = [B _{ω, τ} ^(Left) , B _{ω, τ} ^(Right) ] ^T corresponding to the video that the user is looking over will be described. The head direction (polar coordinate expression) of the user at the frame time τ is expressed as Ψ _τ = [Ψ _τ ^(Hor) , Ψ _τ ^(Ver) ] ^T. Assuming that the directivity of the sound source and background noise can be ignored, by convolving the HRTF between the direction of the user's head and each sound source into each sound source signal, the binaural sound b _{ω, τ} can be output. This is shown in FIG.

ここで、Ｈ_ｋ,Ψτ,ω ^(Left)、Ｈ_ｋ,Ψτ,ω ^(Right)は、ｋ番目の音源とユーザの左耳間のＨＲＴＦ、ｋ番目の音源とユーザの右耳間のＨＲＴＦをそれぞれ表す。 Here, H _{k, Ψτ, ω} ^(Left) and H _{k, Ψτ, ω} ^(Right) are the HRTF between the kth sound source and the user's left ear, and the HRTF between the kth sound source and the user's right ear. Represent each.

近接した音源の位置の違いに対してＨＲＴＦが劇的に変化しないことを考慮すると、局所的な領域内にある音源群を１つの音源信号（以下、局所音源信号という）と見なしてもユーザの音像定位に大きな影響を及ぼさないと考えられる。そこで、全天球映像音声視聴システムでは、個々の音源信号を抽出するのではなく、方向Θ_ｊ＝[Θ_ｊ ^(Hor),Θ_ｊ ^(Ver)]^Ｔ（ｊ＝１，…，Ｌ）を主軸とした角度幅を持つＬ個の領域（以下、簡単のため、局所領域Θ_ｊともいう）群における局所音源信号群を推定する方向別収音する方式を採用する。その様子を図２に示す。例えば、図２の局所音源信号Ｚ_Θ３,ω,τと図１の３番目の音源信号Ｓ_３,ω,τ、４番目の音源信号Ｓ_４,ω,τが対応していることを示している。なお、方向別収音の具体的な方法については後述する。 Considering that the HRTF does not change dramatically with the difference in the position of adjacent sound sources, even if the sound source group in the local region is regarded as one sound source signal (hereinafter referred to as a local sound source signal), the user's It is considered that the sound image localization is not greatly affected. Therefore, in the omnidirectional video / audio viewing system, the direction Θ _j = [Θ _j ^(Hor) , Θ _j ^(Ver) ] ^T (j = 1,..., L) is not extracted from individual sound source signals. A method of collecting sounds by direction for estimating a local sound source signal group in a group of L areas having an angular width as a main axis (hereinafter also referred to as a local area Θ _j for simplicity) is adopted. This is shown in FIG. For example, the local sound source signal Z _{Θ3, ω, τ} in FIG. 2 corresponds to the third sound source signal S _{3, ω, τ} in FIG. 1 and the fourth sound source signal S _{4, ω, τ.} Yes. A specific method of collecting sound by direction will be described later.

方向Θ_ｊ＝[Θ_ｊ ^(Hor),Θ_ｊ ^(Ver)]^Ｔを主軸とした角度幅を持つ領域とその他領域から到来した音源群を分離し、局所音源信号Ｚ_Θｊ,ω,τ（ｊ＝１，…，Ｌ）が推定されたと仮定すると、ユーザが見渡している映像に対応したバイノーラル音ｂ_ω,τは、次式で仮想的に生成される。 Direction Θ _j = [Θ _j ^(Hor) , Θ _j ^(Ver) ] A region having an angular width with ^T as the main axis and a sound source group coming from other regions are separated, and a local sound source signal Z _{Θj, ω, τ} (j = 1,..., L), the binaural sound b _{ω, τ} corresponding to the video overlooked by the user is virtually generated by the following equation.

ここで、Ｈ_{Θｊ,Ψτ,ω} ^(Left)、Ｈ_{Θｊ,Ψτ,ω} ^(Right)は、ｊ番目の領域の主軸方向とユーザの左耳間のＨＲＴＦ、ｊ番目の領域の主軸方向とユーザの右耳間のＨＲＴＦをそれぞれ表す。なお、音場の残響時間、頭部や両耳の物理構造の個人性、音源と受聴者の間の距離に応じてＨＲＴＦが変化することは一般的に知られているが、ここでは、これらの影響を無視できると仮定し、Ｈ_{Θｊ,Ψτ,ω} ^(Left)、Ｈ_{Θｊ,Ψτ,ω} ^(Right)を簡略化して表すこととした。この簡略化したＨ_{Θｊ,Ψτ,ω} ^(Left)、Ｈ_{Θｊ,Ψτ,ω} ^(Right)は、あらかじめＨＡＴＳ（ＨｅａｄａｎｄＴｏｒｓｏＳｉｍｕｌａｔｏｒｓ）を低残響下に設置し、スピーカを離散的に配置して収録したデータベースから最も近い方向のＨＲＴＦを選択することで得られる。 Here, H _{Θj, Ψτ, ω} ^(Left) and H _{Θj, Ψτ, ω} ^(Right) are the HRTF between the principal axis direction of the jth region and the left ear of the user, the principal axis direction of the jth region and the user's HRTF between right ears is represented respectively. It is generally known that the HRTF changes depending on the reverberation time of the sound field, the personality of the physical structure of the head and both ears, and the distance between the sound source and the listener. Assuming that the influence of can be ignored, H _{Θj, Ψτ, ω} ^(Left) and H _{Θj, Ψτ, ω} ^(Right) are simplified. This simplified H _{Θj, Ψτ, ω} ^(Left) and H _{Θj, Ψτ, ω} ^(Right) are recorded in advance by installing HATS (Head and Torso Simulators) under low reverberation and discretely arranging speakers. It is obtained by selecting the HRTF in the direction closest to the selected database.

音源信号群ｓ_ω,τからバイノーラル音ｂ_ω,τを生成するための全体的な処理フローを図３に示す。図３における再合成処理が式（９）、式（１０）を用いたバイノーラル音の生成に対応する。その際、ＨＭＤにより取得されたユーザの頭部方向が入力される（図３におけるユーザコントロールが対応する）。 Excitation signal group s _omega, binaural sound from the _tau b _omega, the overall processing flow for generating the _tau shown in Fig. The re-synthesis process in FIG. 3 corresponds to the generation of binaural sound using Expressions (9) and (10). At that time, the head direction of the user acquired by the HMD is input (the user control in FIG. 3 corresponds).

次に、観測信号群ｘ_ω,τから局所音源信号群ｚ_ω,τ＝[Ｚ_Θ１,ω,τ，…，Ｚ_ΘＬ,ω,τ]^Ｔを収音する方向別収音について説明する。全天球映像音声視聴システムでは、局所ＰＳＤ（ＰｏｗｅｒＳｐｅｃｔｒａｌＤｅｎｓｉｔｙ）推定に基づく音源強調方式による方向別収音を用いる。 Next, the sound collection by direction for collecting the local sound source signal group z _{ω, τ} = [Z _{Θ1, ω, τ} ,..., Z _{ΘL, ω, τ} ] ^T from the observed signal group x _{ω, τ} will be described. The omnidirectional video / audio viewing system uses direction-specific sound collection by a sound source enhancement method based on local PSD (Power Spectral Density) estimation.

ここで、全天球映像音声視聴システムにおいて音源別収音でなく、方向別収音を用いる理由を説明する。ユーザが見渡している映像に対応するように分離した信号群を定位操作し再合成するという用途では、近接した位置にある音源群を無理に分離する必要性はないと考えられる。これは、音源群と受聴者の間のＨＲＴＦの特性が大きく変わらないため、受聴者の音像定位に対して大きな影響を及ぼさないからである。むしろ、音源が時々刻々と移動する状況を想定するならば、できるだけ均一に区切られた領域群に対応する局所音源信号群を生成できる方が好ましいからである。 Here, the reason for using sound collection by direction instead of sound collection by sound source in the omnidirectional video and audio viewing system will be described. In an application where a signal group separated so as to correspond to a video that the user looks around is localized and re-synthesized, it is considered unnecessary to forcibly separate sound source groups at close positions. This is because the characteristics of the HRTF between the sound source group and the listener do not change greatly, and thus the sound image localization of the listener is not greatly affected. Rather, if it is assumed that the sound source moves from moment to moment, it is preferable that a local sound source signal group corresponding to a group of regions divided as uniformly as possible can be generated.

観測信号群ｘ_ω,τにビームフォーミングを適用する、あるいはショットガンマイクのような超指向性のマイクロホンを用いて受音する等の手段により方向Θ_ｊを主軸とした領域から到来した音をプリエンハンスした信号をＹ_Θｊ,ω,τ（ｊ＝１，…，Ｌ）とする。また、プリエンハンスした信号群をｙ_ω,τ＝［Ｙ_Θ１,ω,τ，…，Ｙ_ΘＬ,ω,τ］^Ｔと表す。プリエンハンスした信号群ｙ_ω,τを生成する処理が図３における指向性形成処理である。 Pre-enhance sound coming from the region with the direction Θ _j as the main axis by applying beam forming to the observation signal group x _{ω, τ} or receiving sound using a super-directional microphone such as a shotgun microphone The signal obtained is _defined as Y _{Θj, ω, τ} (j = 1,..., L). Also, the pre-enhancement signal group _{_{y ω, τ = [Y Θ1}} , ω, τ, ..., Y ΘL, ω, τ] denoted ^T. The process of generating the pre-enhanced signal group _{yω, τ} is the directivity forming process in FIG.

音源信号が互いに無相関であると仮定すると、Ｙ_Θｊ,ω,τのＰＳＤφ_ＹΘｊ,ωは次式でモデル化される。 When the sound source signal is assumed to be mutually _{_{uncorrelated,}} Y Θj, ω, PSDφ YΘj of _{_tau, omega} is modeled by the following equation.

ここで、＜・＞は期待値演算、Ｄ_Θｊ,ｋ,ωはｋ番目の音源に対するｊ番目のビームフォーミング／受音の平均的な感度、φ_Ｓｋ,ωはｋ番目の音源のＰＳＤを表す。 Here, <•> represents the expected value calculation, D _{Θj, k, ω} represents the average sensitivity of the jth beamforming / received sound for the kth sound source _, and φ _{Sk, ω} represents the PSD of the kth sound source. .

式（１１）の関係が局所音源信号群ｚ_ω,τとプリエンハンスされた信号群ｙ_ω,τの関係についても成り立つと仮定すると、φ_ＹΘｊ,ωは次式で近似して表される。 Assuming that the relationship of Equation (11) also holds for the relationship between the local sound source signal group z _{ω, τ} and the pre-enhanced signal group y _{ω, τ} , φ _{YΘj, ω} is expressed by the following equation.

ここで、Ｄ_{Θｊ,Θｉ,ω}は方向Θ_ｉを主軸とした領域に対するｊ番目のビームフォーミング／受音の平均的な感度、φ_ＳΘｉ,ωはｉ番目の局所音源信号のＰＳＤ（局所ＰＳＤ）を表す。Ｌ個のφ_ＳΘｉ,ωとφ_ＹΘｊ,ωの関係は次式でモデル化される。 _Here, D Θj, Θi, average sensitivity of the j-th beam forming / sound receiving the _omega is the direction theta _i for regions with the _spindle, φ SΘi, _ω is the i-th local source signal PSD (local PSD) Represents. The relationship between L _{φSΘi, ω} and _{φYΘj, ω} is modeled by the following equation.

Ｌ個の局所ＰＳＤφ_ＳΘｉ,ωを推定するために、式（１３）の逆問題を解く。ここでは、雑音抑圧性能を高めるために、フレーム毎に局所ＰＳＤを推定することとすると、逆問題は次式で定式化される。 In order to estimate L local PSDφ _{SΘi, ω} , the inverse problem of equation (13) is solved. Here, if the local PSD is estimated for each frame in order to enhance the noise suppression performance, the inverse problem is formulated by the following equation.

なお、実用上の課題としてスパース性を仮定できる局所領域の数Ｌ、Ｄ_ω ^-１の安定性を制御する課題が生じる。Ｄ_ωの要素はすべて正の数であるため、Ｄ_ωの特異値の条件によっては安定に解が求まらないこともある。したがって、マニュアルで安定化計算の調整をする必要がある。例えば、以下のように対角項に所定の値を加算する操作を行い、調整すればよい。 As a practical problem, there arises a problem of controlling the stability of the number of local regions L and D _ω ⁻¹ where sparseness can be assumed. Since all elements of D _omega is the number of positive, depending on the conditions of the singular values of D _omega sometimes not obtained is stable solutions. Therefore, it is necessary to adjust the stabilization calculation manually. For example, an operation of adding a predetermined value to the diagonal term may be performed and adjusted as follows.

ここで、εは安定化係数であり、値が大きいほど安定な逆行列計算を可能にする。 Here, ε is a stabilization coefficient, and a larger value enables more stable inverse matrix calculation.

観測信号に干渉雑音のみが混在している場合には、式（１４）で算出したΦ＾_Ｓ,ω,τから目的音のＰＳＤ及び雑音のＰＳＤを求めればよい。なお、目的音のＰＳＤ、雑音のＰＳＤは音源強調のフィルタを生成する際に必要となる。 When only the interference noise is mixed in the observation signal, the PSD of the target sound and the PSD of the noise may be obtained from Φ ^ _{S, ω, τ} calculated by the equation (14). Note that the PSD of the target sound and the PSD of the noise are necessary when generating a sound source enhancement filter.

しかし、実際には式（１）のように非干渉性（あるいは拡散性）の背景雑音が観測信号に存在する。そのような場合には、干渉性雑音のＰＳＤと背景雑音のＰＳＤを別々に推定した方が精度の高い音源強調のフィルタを生成できると考えられる。干渉性雑音のＰＳＤと背景雑音のＰＳＤを別々に推定するための一方法を以下で説明する。 However, in actuality, incoherent (or diffusive) background noise exists in the observed signal as shown in Equation (1). In such a case, it is considered that a more accurate sound source enhancement filter can be generated by separately estimating the PSD of coherent noise and the PSD of background noise. One method for separately estimating the PSD of the coherent noise and the PSD of the background noise will be described below.

まず、式（１４）で算出したΦ＾_Ｓ,ω,τから背景雑音のＰＳＤを取り除く。背景雑音は目的音、干渉雑音とは無相関であると仮定できるので、パワースペクトル領域での加算性を仮定しても近似的には成り立つと考えられる。ｉ番目の方向Θ_ｉの局所領域にある音源群を目的音とする。そのとき、局所ＰＳＤφ_{ＳΘｉ,ω,τ}からその中に存在する背景雑音のＰＳＤφ_{BNTΘｉ,ω,τ}を減算する。これにより、推定された目的音のＰＳＤ（背景雑音の影響を除去済み）φ_{TSΘｉ,ω,τ}が求まる。 First, the PSD of background noise is removed from Φ ^ _{S, ω, τ} calculated by Expression (14). Since the background noise can be assumed to be uncorrelated with the target sound and the interference noise, it can be considered that the background noise can be approximated even if the addability in the power spectrum region is assumed. A sound source group in a local region in the i-th direction Θ _i is set as a target sound. At that time, local _{PSDφ SΘi, ω,} the background noise of _PSDφ BNTΘi present from _τ in _{it, ω,} is subtracted _τ. As a result, the PSD of the estimated target sound (having removed the influence of background noise) φ _{TSΘi, ω, τ} is obtained.

もし、目的音のＰＳＤφ_{TSΘｉ,ω,τ}が０より小さいときには０にする。また、式（１６）の背景雑音のＰＳＤφ_{BNTΘｉ,ω,τ}を計算するために背景雑音が時間的な定常性が強い（つまり、時間に応じて劇的に変化しない）ことを仮定し、再帰的な更新アルゴリズムにより、φ_{ＳΘｉ,ω,τ}を時間平滑化処理することで突発性の成分を除去すると、式（１７）が得られる。 If the target sound _{PSDφTSΘi, ω, τ} is smaller than 0, it is set to 0. _Further, in order to calculate the background noise PSDφ _{BNTΘi, ω, τ} of the equation (16), it is assumed that the background noise has a strong temporal _steadiness (that is, it does not change dramatically according to the time), and is recursive. When the sudden component is removed by _{performing a} time smoothing process on _{φSΘi, ω, τ} by a typical update algorithm, Expression (17) is obtained.

ここで、β_ωは時間平滑化のための定数である。例えば、１５０ｍｓ程度で忘却するように設定すればよい。φ⁻ _{ＳΘｉ,ω,τ}の区間Τにおける最低値を保持することで、目的音領域（ｉ番目の方向Θ_ｉの局所領域）の背景雑音のＰＳＤφ_{BNTΘｉ,ω,τ}を推定することができる。 Here, _βω is a constant for time smoothing. For example, it may be set to forget about 150 ms. _φ ^- _{SΘi, ω,} by holding the minimum value in the interval Τ of _tau, can be estimated _PSDφ BNTΘi of background noise (local area of i-th direction theta _i) target sound _{region, omega,} and _tau.

同様に、目的音領域（ｉ番目の方向Θ_ｉの局所領域）以外の領域にある干渉性雑音群のＰＳＤφ_{ISΘｉ,ω,τ}を推定するために目的音と同様に背景雑音のＰＳＤφ_{BNIΘｉ,ω,τ}を減算する。 Similarly, the interference noise group _PSDφ ISΘi in the other region (local regions of the i-th direction theta _i) target sound _{_region, ω,} PSDφ _BNIΘi similarly background noise and target sound to estimate the _{_tau, omega , τ} is subtracted.

ここで、α_１,ωはコンテンツに応じて最適値が変わる重み係数である。また、干渉性雑音群のＰＳＤφ_{ISΘｉ,ω,τ}についても０より小さいときには０にフロアリングする。式（１９）にある背景雑音のＰＳＤφ_{BNIΘｉ,ω,τ}は以下のように計算する。 Here, α _{1 and ω} are weighting factors whose optimum values change according to the content. Also, PSDφ _{ISΘi, ω, τ} of the coherent noise group is floored to 0 when it is smaller than 0. The background noise PSDφ _{BNIΘi, ω, τ} in the equation (19) is calculated as follows.

ｊ番目の局所音源信号Ｚ_Θｊ,ω,τを推定するためのウィーナーフィルタＧ_Θｊ,ω,τを生成する。 A Wiener filter _{GΘj, ω, τ} for estimating the jth local sound source signal _{ZΘj, ω, τ} is generated.

ここで、α_２,ω、α_３,ωは重み係数である。
Here, α _{2, ω} and α _{3, ω} are weighting factors.

式（２２）を用いて計算した後のウィーナーフィルタＧ_Θｊ,ω,τを以下のように整形する。 The Wiener filter G _{Θj, ω, τ} after calculation using the equation (22) is shaped as follows.

ここで、α_４,ωは重み係数である。この後、α_５,ω（０≦α_５,ω＜１）を用いて、α_５,ω≦Ｇ_Θｊ,ω,τ≦１となるようにＧ_Θｊ,ω,τのフロアリング処理を行う。局所音源信号Ｚ_Θｊ,ω,τは次式で算出される。 Here, α _{4, ω} is a weighting coefficient. _{Thereafter, α 5, ω (0 ≦} α 5, ω <1) with _{_{a, α 5, ω ≦ G Θj}} , ω, τ ≦ 1 become as G _{.theta.j, omega,} performs flooring processing _tau . The local sound source signal Z _{Θj, ω, τ} is calculated by the following equation.

プリエンハンスした信号群ｙ_ω,τをウィーナーフィルタリングすることにより局所音源信号群ｚ_ω,τを生成する処理が図３における方向別収音処理である。 The process of generating the local sound source signal group z _{ω, τ} by performing Wiener filtering on the pre-enhanced signal group y _{ω, τ} is the direction-specific sound collection process in FIG.

最後に、全天球映像音声視聴システムにおけるバイノーラル音の生成処理を実行するバイノーラル音生成システム９００について説明する。図４は、バイノーラル音生成システム９００の構成を示すブロック図である。図４に示すようにバイノーラル音生成システム９００は、収音装置９０５と、再合成装置９５５を含む。収音装置９０５は、Ｍ本のマイクロホン９１０−１〜９１０−Ｍと、Ｍ個の周波数領域変換部９２０−１〜９２０−Ｍと、Ｌ個のビームフォーミング部９３０−１〜９３０−Ｌと、局所ＰＳＤ推定部９４０と、ウィーナーフィルタリング部９５０を含む。再合成装置９５５は、ＨＲＴＦ畳み込み部９６０を含む。 Finally, a binaural sound generation system 900 that executes binaural sound generation processing in the omnidirectional video / audio viewing system will be described. FIG. 4 is a block diagram showing the configuration of the binaural sound generation system 900. As shown in FIG. 4, the binaural sound generation system 900 includes a sound collection device 905 and a resynthesis device 955. The sound collection device 905 includes M microphones 910-1 to 910 -M, M frequency domain conversion units 920-1 to 920 -M, L beam forming units 930-1 to 930 -L, A local PSD estimation unit 940 and a Wiener filtering unit 950 are included. The resynthesis device 955 includes an HRTF convolution unit 960.

時間領域観測信号群から局所音源信号群を生成する処理（音源分離処理）を実行するのが、収音装置９０５である。マイクロホン９１０−１〜９１０−Ｍは、Ｋ個の音源が存在する音場の音声を収音し、時間領域観測信号を生成する。周波数領域変換部９２０−１〜９２０−Ｍは、それぞれ時間領域観測信号を観測信号Ｘ_ｍ,ω,τ（１≦ｍ≦Ｍ）に変換する。 The sound collection device 905 executes processing (sound source separation processing) for generating a local sound source signal group from the time domain observation signal group. Microphones 910-1 to 910 -M collect sound in a sound field where K sound sources are present, and generate time-domain observation signals. The frequency domain conversion units 920-1 to 920 -M convert the time domain observation signals to observation signals X _{m, ω, τ} (1 ≦ m ≦ M), respectively.

ビームフォーミング部９３０−１〜９３０−Ｌは、Ｍ個の観測信号（観測信号群）からプリエンハンスした信号Ｙ_Θｊ,ω,τ（ｊ＝１，…，Ｌ）を生成する。なお、マイクロホン９１０−１〜９１０−Ｍの代わりに、Ｌ＝Ｍとして、Ｌ個の指向性マイクを用いて収音するのでもよい。この場合、指向性マイクを用いて収音した信号をプリエンハンスした信号Ｙ_Θｊ,ω,τ（ｊ＝１，…，Ｌ）としてよいので、ビームフォーミング部９３０−１〜９３０−Ｌが不要になる。 Beam forming sections 930-1 to 930 -L generate pre-enhanced signals Y _{Θj, ω, τ} (j = 1,..., L) from M observation signals (observation signal group). Instead of the microphones 910-1 to 910 -M, L = M may be used and sound may be collected using L directional microphones. In this case, since the signal Y _{Θj, ω, τ} (j = 1,..., L) _obtained by pre- _enhancing the signal collected using the directional microphone may be used, the beam forming units 930-1 to 930-L are unnecessary. Become.

局所ＰＳＤ推定部９４０は、プリエンハンスした信号Ｙ_Θｊ,ω,τ（ｊ＝１，…，Ｌ）を用いて目的音のＰＳＤ、干渉雑音のＰＳＤ、背景雑音のＰＳＤを生成する。具体的には、式（１４）、式（１６）、式（１９）、式（１８）を用いて、目的音のＰＳＤ、干渉雑音のＰＳＤ、背景雑音のＰＳＤを生成する。 The local PSD estimation unit 940 generates a target sound PSD, interference noise PSD, and background noise PSD using the pre-enhanced signal Y _{Θj, ω, τ} (j = 1,..., L). Specifically, the target sound PSD, the interference noise PSD, and the background noise PSD are generated using the equations (14), (16), (19), and (18).

ウィーナーフィルタリング部９５０は、目的音のＰＳＤ、干渉雑音のＰＳＤ、背景雑音のＰＳＤを用いてＬ個のウィーナーフィルタを生成し、プリエンハンスした信号Ｙ_Θｊ,ω,τ（ｊ＝１，…，Ｌ）にウィーナーフィルタＧ_Θｊ,ω,τ（ｊ＝１，…，Ｌ）を適用し、局所音源信号Ｚ_Θｊ,ω,τ（ｊ＝１，…，Ｌ）を生成する。具体的には、式（２２）、式（２３）、式（２４）を用いて局所音源信号Ｚ_Θｊ,ω,τを生成する。 The Wiener filtering unit 950 generates L Wiener filters using the target sound PSD, the interference noise PSD, and the background noise PSD, and pre-enhanced signals Y _{Θj, ω, τ} (j = 1,..., L ) To apply the Wiener filter G _{Θj, ω} _{, τ} (j = 1,..., L) to generate the local sound source signal Z _{Θj, ω, τ} (j = 1,..., L). Specifically, the local sound source signal Z _{Θj, ω, τ} is generated using Expression (22), Expression (23), and Expression (24).

局所音源信号群からバイノーラル音を生成する処理（再合成処理）を実行するのが、再合成装置９５５である。ＨＲＴＦ畳み込み部９６０は、局所音源信号Ｚ_Θｊ,ω,τ（ｊ＝１，…，Ｌ）からバイノーラル音ｂ_ω,τを生成する。具体的には、式（９）、式（１０）を用いて受聴用のバイノーラル信号である受聴信号（左）と受聴信号（右）を生成する。 The re-synthesis device 955 executes a process (re-synthesis process) for generating a binaural sound from the local sound source signal group. The HRTF convolution unit 960 generates a binaural sound b _{ω, τ} from the local sound source signal Z _{Θj, ω, τ} (j = 1,..., L). Specifically, the listening signal (left) and the listening signal (right), which are binaural signals for listening, are generated using Equations (9) and (10).

なお、インターネットのようなネットワークに収音装置９０５と再合成装置９５５を接続してバイノーラル音生成システム９００を構成することもできる。この場合、収音装置９０５、再合成装置９５５はネットワークによる通信に必要は手段を具備する必要があるのはいうまでもない。また、伝送に適するよう、局所音源信号群を符号化する符号化部、局所音源信号群を符号化した符号化データを復号する復号部をそれぞれ収音装置９０５、再合成装置９５５に備えるようにしてもよい。 Note that the binaural sound generation system 900 can be configured by connecting the sound collection device 905 and the re-synthesis device 955 to a network such as the Internet. In this case, it goes without saying that the sound collection device 905 and the re-synthesis device 955 need to have means necessary for communication via the network. In addition, the sound collection device 905 and the resynthesis device 955 are provided with an encoding unit that encodes a local excitation signal group and a decoding unit that decodes encoded data obtained by encoding the local excitation signal group, respectively, so as to be suitable for transmission. May be.

全天球映像音声視聴システムでは音源分離処理をしたうえでバイノーラル音を生成するため、観測信号を生成するマイクロホンの配置やマイクロホンアレイの形状について特に制限はなかった。一方、観測信号を生成するマイクロホンを特殊な形状のマイクロホンアレイに配置し収音することにより、観測信号からバイノーラル音を直接得るようなバイノーラル録音に関する研究も進められている。通常、バイノーラル録音では、ＨＡＴＳやダミーヘッドのような耳介つきのマイクロホンを用いて録音する。これに対し、非特許文献２では、耳介を精巧にモデル化することなく、固定方向を撮像した映像に対するバイノーラル音を簡易録音する方法を提案している。非特許文献２では、球状のマイクロホンアレイに、半球状の窪みを設け、そこにマイクロホンを設置するという簡素な構成を用いても、音を定位するための手がかりになり得るような周波数−空間特性パターンを得られることを確認している。 In the omnidirectional video / audio viewing system, binaural sound is generated after sound source separation processing, so there is no particular limitation on the arrangement of microphones for generating observation signals and the shape of the microphone array. On the other hand, research on binaural recording that directly obtains binaural sounds from observation signals by arranging microphones that generate observation signals in a microphone array of a special shape and collecting sound is also underway. Usually, in binaural recording, recording is performed using a microphone with an auricle such as HATS or a dummy head. On the other hand, Non-Patent Document 2 proposes a method for simply recording binaural sounds for images captured in a fixed direction without elaborately modeling the pinna. In Non-Patent Document 2, frequency-space characteristics that can serve as a clue to localize sound using a simple configuration in which a spherical microphone array is provided with a hemispherical depression and a microphone is installed there. It is confirmed that the pattern can be obtained.

丹羽健太、小泉悠馬、小林和則、植松尚、“全天球映像に対応したバイノーラル音を生成するための方向別収音に関する検討”、信学技報EA2015-7、電子情報通信学会、２０１５年７月、vol.115, no.126, pp.33-38.Kenta Niwa, Kuruma Koizumi, Kazunori Kobayashi, Takashi Uematsu, “Study on sound collection according to direction to generate binaural sound corresponding to omnidirectional video”, IEICE Technical Report EA2015-7, IEICE, 2015 July, vol.115, no.126, pp.33-38. 中桐大志、山村俊貴、西野隆典、成瀬央、武田一哉、“くぼみ付き球状マイクロホンバッフルを用いたバイノーラル録音の検討”、日本音響学会２０１５年春季研究発表会2-P-42、２０１５年３月、pp.889-890.Nakagiri Taishi, Yamamura Toshiki, Nishino Takanori, Naruse Osamu, Takeda Kazuya, “Study of Binaural Recording Using Spherical Microphone Baffle with Indentation”, Acoustical Society of Japan 2015 Spring Meeting 2-P-42, March 2015, pp.889-890.

バイノーラル音生成システム９００では、雑音抑圧量等を調整するために必要となるアレイ信号処理のパラメータ（収音装置９０５のパラメータ）の最適値がコンテンツごとに異なるため、パラメータの調整作業を行う必要性があった。一方、コンテンツごとに最適なパラメータに調整するのでなく、様々なコンテンツに対して汎用的に使えるようなパラメータに調整することも考えられるが、このようにすると収音性能が劣化するコンテンツが存在するなどの問題があった。 In the binaural sound generation system 900, since the optimum values of the array signal processing parameters (parameters of the sound collection device 905) necessary for adjusting the noise suppression amount and the like are different for each content, it is necessary to perform parameter adjustment work. was there. On the other hand, instead of adjusting to the optimum parameter for each content, it may be possible to adjust to a parameter that can be used universally for various contents, but there is content that deteriorates the sound collection performance by doing so. There were problems such as.

そこで本発明では、アレイ信号処理のパラメータの調整が不要な、観測信号からバイノーラル音を生成するバイノーラル音生成装置を提供することを目的とする。 Accordingly, an object of the present invention is to provide a binaural sound generation device that generates a binaural sound from an observation signal, which does not require adjustment of array signal processing parameters.

本発明の一態様は、マイクロホンを設置するＭ個（Ｍは３以上の整数）の窪みを備えるマイクロホンアレイを用いて収音した観測信号からバイノーラル音を生成するバイノーラル音生成装置であって、ｎ、ｋをＭ＝２ｎ＋ｋ（ｎ≧１、ｋ＝０または１）を満たす整数とし、前記マイクロホンアレイの立体形状を上から見た形状は、対称性を持つ図形であり、前記Ｍ個の窪みは、前記立体形状の側面に設けられ、そのうち２ｎ個の窪みは、前記立体形状を上から見て１８０度間隔でペアになるように設けられるものであり、前記窪みには少なくとも１本のマイクロホンが設置されており、前記観測信号を補間合成することにより前記バイノーラル音を生成する補間合成部とを含む。 One aspect of the present invention is a binaural sound generation device that generates a binaural sound from an observation signal picked up using a microphone array including M depressions (M is an integer of 3 or more) where microphones are installed, and n , K is an integer satisfying M = 2n + k (n ≧ 1, k = 0 or 1), and the shape of the three-dimensional shape of the microphone array viewed from above is a symmetrical figure, and the M depressions are , Provided on the side of the three-dimensional shape, of which 2n dents are provided so as to be paired at intervals of 180 degrees when the three-dimensional shape is viewed from above, and at least one microphone is provided in the dent. And an interpolating and synthesizing unit that generates the binaural sound by interpolating and synthesizing the observation signals.

本発明によれば、上から見た形状が対称性を有する立体形状に上から見て１８０度間隔でペアになるような位置に設けられた窪みにマイクロホンを設置したマイクロホンアレイを用いて収音した観測信号から補間合成によりバイノーラル音を生成することにより、アレイ信号処理のパラメータの調整作業を行うことなく、観測信号からバイノーラル音を生成することが可能となる。 According to the present invention, sound collection is performed using a microphone array in which microphones are installed in depressions provided at positions that are paired at intervals of 180 degrees when viewed from above into a three-dimensional shape having a symmetrical shape when viewed from above. By generating a binaural sound from the observed signal by interpolation synthesis, it becomes possible to generate a binaural sound from the observed signal without adjusting the parameters of the array signal processing.

音源別収音を用いた頭部方向に応じたバイノーラル音の生成のイメージを示す図。The figure which shows the image of the production | generation of the binaural sound according to the head direction using the sound-collection according to sound source. 方向別収音を用いた頭部方向に応じたバイノーラル音の生成のイメージを示す図。The figure which shows the image of the production | generation of the binaural sound according to the head direction using the sound collection according to direction. 全天球映像音声視聴システムにおけるバイノーラル音の生成処理フローを示す図。The figure which shows the production | generation processing flow of the binaural sound in a omnidirectional video-audio viewing system. バイノーラル音生成システム９００の構成を示すブロック図。1 is a block diagram showing a configuration of a binaural sound generation system 900. FIG. バイノーラル音生成装置４００の構成を示すブロック図。The block diagram which shows the structure of the binaural sound production | generation apparatus 400. FIG. バイノーラル音生成装置４００の動作を示すフローチャート。5 is a flowchart showing the operation of the binaural sound generating apparatus 400. マイクロホンアレイ４１０の立体形状の一例を示す図。The figure which shows an example of the three-dimensional shape of the microphone array 410. FIG. マイクロホンアレイ４１０の立体形状の一例を示す図。The figure which shows an example of the three-dimensional shape of the microphone array 410. FIG. マイクロホンアレイ４１０に全天球映像生成用カメラを内蔵した様子を示す図。The figure which shows a mode that the omnidirectional video production | generation camera was incorporated in the microphone array 410. FIG. マイクロホンアレイ４１０のマイクロホンの設置位置の一例を示す図。The figure which shows an example of the installation position of the microphone of the microphone array. 水平面におけるマイクロホンの選択の様子を示す図。The figure which shows the mode of selection of the microphone in a horizontal surface. 水平面における各マイクの重み係数のグラフを示す図。The figure which shows the graph of the weighting coefficient of each microphone in a horizontal surface. 水平方向と仰角方向の重みのグラフを示す図。The figure which shows the graph of the weight of a horizontal direction and an elevation angle direction.

以下、本発明の実施の形態について、詳細に説明する。なお、同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail. In addition, the same number is attached | subjected to the structure part which has the same function, and duplication description is abbreviate | omitted.

非特許文献２は、精巧な耳介のモデルの代わりに、球状をしたマイクロホンアレイに設けた半球状の窪みにマイクロホンを設置するだけで音源定位に有用な情報が受音信号に含まれることを実験的に示している。そこで、この簡単な立体形状に設けた窪みにマイクロホンを設置するという簡易な方法を全方位収音に拡張し、当該収音信号から全天球映像に対応したバイノーラル音を仮想的に生成する方法について説明する。 Non-Patent Document 2 shows that instead of an elaborate pinna model, the sound reception signal contains information useful for sound source localization simply by installing a microphone in a hemispherical depression provided in a spherical microphone array. Experimentally shown. Therefore, a simple method of installing a microphone in a hollow provided in this simple three-dimensional shape is extended to omnidirectional sound collection, and a binaural sound corresponding to the omnidirectional image is virtually generated from the sound collection signal Will be described.

以下、図５〜図６を参照してバイノーラル音生成装置４００について説明する。図５は、バイノーラル音生成装置４００の構成を示すブロック図である。図６は、バイノーラル音生成装置４００の動作を示すフローチャートである。図５に示すようにバイノーラル音生成装置４００は、マイクロホンアレイ４１０と、補間合成部４２０を含む。マイクロホンアレイ４１０は所定の位置に窪みがある所定の立体形状をしており、その窪みにＭ本のマイクロホン９１０−１〜９１０−Ｍが設置される。 Hereinafter, the binaural sound generation device 400 will be described with reference to FIGS. FIG. 5 is a block diagram illustrating a configuration of the binaural sound generation device 400. FIG. 6 is a flowchart showing the operation of the binaural sound generator 400. As shown in FIG. 5, the binaural sound generation device 400 includes a microphone array 410 and an interpolation / synthesis unit 420. The microphone array 410 has a predetermined three-dimensional shape with a recess at a predetermined position, and M microphones 910-1 to 910 -M are installed in the recess.

マイクロホンアレイ４１０は、Ｋ個の音源が存在する音場の音声を収音し、Ｍ個の時間領域の観測信号を生成する（Ｓ４１０）。マイクロホンアレイ４１０の立体形状の一例は、球体、円柱である。また、厳密な球体・円柱ではなく、球体・円柱に近い形状であってもよい。全天球映像を視る際首を左右に振り回すことを考えると、一般に立体形状を上から見た形状が典型的には円のように点対称な図形、対称性を持つ図形の方がよい。 The microphone array 410 picks up sound in a sound field in which K sound sources are present, and generates M time domain observation signals (S410). An example of the three-dimensional shape of the microphone array 410 is a sphere or a cylinder. Moreover, the shape close | similar to a sphere and a cylinder may be sufficient instead of a exact sphere and a cylinder. Considering that the head is swung from side to side when viewing the omnidirectional image, the shape of the three-dimensional shape seen from above is typically a point-symmetric figure such as a circle, or a figure with symmetry. .

また、これらの立体形状の側面を９０度間隔で窪ませることにより、窪みが構成される。この窪みは耳介を簡易にモデル化したものである。この窪みの形状は半球状といった単純な形状でよい。また、窪みは９０度間隔に制限されるものではない。人間の耳の配置が上から見て左右対称であることを考慮して、マイクロホンの左右ペア（１８０度間隔のペア）を左右対称に側面に設置できるのであれば、例えば、６０度、３０度といった角度間隔（一般に、ｎを２以上の整数として、１８０／ｎ度間隔）のようにどのような角度で窪みをつけてもよい。なお、９０度間隔よりも狭めたほうが収音性能はよくなる。また、マイクロホンを左右ペアとして左右対称に設置できるのであれば、窪みを設置する間隔は厳密に１８０／ｎ度間隔のように均一の間隔でなくてもよい。 Moreover, a hollow is comprised by denting the side surface of these solid shapes at intervals of 90 degrees. This depression is a simple model of the auricle. The shape of the recess may be a simple shape such as a hemisphere. Further, the depressions are not limited to 90 degree intervals. Considering that the arrangement of human ears is symmetric when viewed from above, if left and right microphone pairs (pairs spaced 180 degrees) can be placed symmetrically on the sides, for example, 60 degrees and 30 degrees The depressions may be formed at any angle such as an angular interval (generally, n is an integer of 2 or more and an interval of 180 / n degrees). Note that the sound collection performance is improved when the interval is narrower than 90 degrees. In addition, if the microphones can be installed symmetrically as a left and right pair, the intervals at which the recesses are installed do not have to be exactly uniform intervals such as 180 / n degrees.

首を左右に振る動作と人間の耳の配置を考慮すると、上から見て対称な立体形状に、マイクロホンの左右ペアを左右対称に設置するのが最も収音性能がよくなる。各窪みに１本のマイクロホンを設置することとすると、９０度、６０度、３０度のときそれぞれ、４本、６本、１２本のマイクロホンが設置されることになる。ただし、補間合成により受聴用の仮想バイノーラル音を生成することができるので、必ずしも左右ペアとなる２本のマイクロホンを左右対称に設置するのでなくてもよい。例えば、９０度間隔で設けた４つの窪みのうち、３つの窪みについて各１本のマイクロホンを設置する構成としてもよい。また、残り１つについては実際には窪みになっていなくてもよい。 Considering the motion of shaking the head left and right and the arrangement of human ears, the best sound collection performance is achieved by placing the left and right pairs of microphones symmetrically in a three-dimensional shape symmetrical from above. If one microphone is installed in each recess, four, six, and twelve microphones are installed at 90 degrees, 60 degrees, and 30 degrees, respectively. However, since a virtual binaural sound for listening can be generated by interpolation synthesis, it is not always necessary to install two microphones that are paired left and right symmetrically. For example, a configuration may be adopted in which one microphone is installed for each of three depressions among four depressions provided at intervals of 90 degrees. Further, the remaining one does not actually have to be a depression.

前後の顔の向き（耳の向き）を考慮して、各窪みに２本のマイクロホンを設置するようにしてもよい（図７（Ｃ）参照）。このように設置することにより、より高音質のバイノーラル音が再合成できるようになる。 In consideration of the front and back face orientation (ear orientation), two microphones may be installed in each recess (see FIG. 7C). By installing in this way, binaural sound with higher sound quality can be re-synthesized.

さらに、首を上下に振る動作を考慮すると、立体形状の上面や下面にマイクロホンを設置するのがよい。なお、上面や下面にマイクロホンを設置する場合は、窪みは不要である。耳介をモデル化する必要がないからである。このように仰角方向にもマイクロホンを設置することでも、より高音質のバイノーラル音が再合成できるようになる。 Furthermore, considering the operation of shaking the neck up and down, it is preferable to install microphones on the upper and lower surfaces of the three-dimensional shape. In addition, when a microphone is installed on the upper surface or the lower surface, no depression is necessary. This is because it is not necessary to model the pinna. In this way, even by installing a microphone in the elevation direction, binaural sound with higher sound quality can be re-synthesized.

このような窪みを備えた立体形状の例を図７、図８に示す。図７（Ａ）、図８（Ａ）はマイクロホンアレイの立体形状を上・下から見た図である。図７（Ｂ）、図８（Ｂ）はマイクロホンアレイの立体形状を正面（背面）・横から見た図である。図７（Ｃ）、図８（Ｃ）は窪みの形状、マイクロホンを設置する受音位置を示した図である。図７（Ａ）〜（Ｃ）、図８（Ａ）〜（Ｃ）における破線の半円あるいは実線の円が窪みを、小さい黒点が受音位置を示している。受音位置は、非特許文献２の図１の左図のように水平面上で前後、左右等に３０度ずらしてもよい。図８（Ｄ）は図８（Ａ）、（Ｂ）で示す立体形状の上面・下面の形状がどのように生成されるのか示したものである。図８（Ｄ）の実線部が立体形状の上面・下面の形状である。 An example of a three-dimensional shape having such depressions is shown in FIGS. FIGS. 7A and 8A are views of the three-dimensional shape of the microphone array as viewed from above and below. FIGS. 7B and 8B are views of the three-dimensional shape of the microphone array as viewed from the front (back) and the side. FIGS. 7C and 8C are views showing the shape of the depression and the sound receiving position where the microphone is installed. In FIGS. 7A to 7C and FIGS. 8A to 8C, a broken semicircle or a solid circle indicates a depression, and a small black dot indicates a sound receiving position. The sound receiving position may be shifted 30 degrees forward and backward, left and right, etc. on a horizontal plane as shown in the left diagram of FIG. FIG. 8D shows how the three-dimensional top and bottom shapes shown in FIGS. 8A and 8B are generated. The solid line portions in FIG. 8D are the shapes of the three-dimensional upper and lower surfaces.

図７、図８からわかるように上から見ても横から見ても上下、左右に対称性のある図形になっていることがわかる。また、マイクロホンアレイの立体形状は円柱や球体を組み合わせて構成されていることもわかる。図７、図８の立体形状は直径１２ｃｍの円柱がベースとなっているが、この立体形状が頭部形状を模擬することを考慮すると、直径１６ｃｍ程度の球体に近い方がよい。なお、立体形状を上から見た形状の幅（円の場合は直径に相当するもの）の上限は、伝達遅延を考慮すると、２５ｃｍ程度である。また、下限については、マイクロホンアレイを小型化することを考慮すると、５ｃｍ程度となる。つまり、幅は、５ｃｍ以上２５ｃｍ以下にするとよい。 As can be seen from FIGS. 7 and 8, it can be seen that the figure is symmetrical in the vertical and horizontal directions when viewed from above and from the side. It can also be seen that the three-dimensional shape of the microphone array is configured by combining cylinders and spheres. The solid shapes of FIGS. 7 and 8 are based on a cylinder with a diameter of 12 cm, but considering that this solid shape simulates the head shape, it is better to be close to a sphere with a diameter of about 16 cm. Note that the upper limit of the width of the three-dimensional shape as viewed from above (corresponding to the diameter in the case of a circle) is about 25 cm in consideration of transmission delay. The lower limit is about 5 cm in consideration of downsizing the microphone array. That is, the width is preferably 5 cm or more and 25 cm or less.

図７（Ｃ）、図８（Ｃ）をみればわかるように各窪みには、２本のマイクロホンが設置されている。これは、先述の通り、前後の顔の向きに応じてバイノーラル音を生成するためである。 As can be seen from FIGS. 7C and 8C, two microphones are installed in each recess. This is because a binaural sound is generated according to the direction of the front and back faces as described above.

また、水平方向の定位だけでなく、仰角方向の定位を付与するために立体形状の上面・下面にマイクロホンを設置してもよい。上面あるいは下面に１本のマイクロホンを設置するだけでもよい。もちろん、上面・下面にそれぞれ１本のマイクロホンを設置するのでもよい。図７、図８の立体形状では上面・下面に各３本のマイクロホンが設置されている。耳介を模擬する必要がないため、上面・下面にマイクロホンを設置する場合、窪みは必要ないのは先述の通りである。 Moreover, in order to provide not only horizontal localization but also elevation angle localization, microphones may be installed on the upper and lower surfaces of a three-dimensional shape. Only one microphone may be installed on the upper surface or the lower surface. Of course, one microphone may be provided on each of the upper and lower surfaces. 7 and 8, three microphones are installed on the upper and lower surfaces. Since it is not necessary to simulate the auricle, when the microphones are installed on the upper surface and the lower surface, the depression is not necessary as described above.

なお、全天球映像生成用カメラは例えば図９に示すようにマイクロホンアレイ４１０に内蔵されていてもいいし、マイクロホンアレイ４１０とは別の場所に設置してあってもよい。 Note that the omnidirectional video generation camera may be built in the microphone array 410 as shown in FIG. 9, for example, or may be installed at a location different from the microphone array 410.

補間合成部４２０は、Ｍ個の時間領域の観測信号ｘ_ｍ,ｔ（１≦ｍ≦Ｍ）を補間合成し、受聴用の仮想信号である時間領域のバイノーラル音ｂ_ｔ ^(Left)、ｂ_ｔ ^(Right)を生成する（Ｓ４２０）。具体的には、式（２５）、式（２６）を用いて補間合成を行う。 The interpolating / combining unit 420 interpolates and synthesizes M time domain observation signals x _{m, t} (1 ≦ m ≦ M), and binaural sounds b _t ^(Left) and b _t in the time domain, which are virtual signals for listening. ^(Right) is generated (S420). Specifically, interpolation synthesis is performed using Expression (25) and Expression (26).

ここで、ｗ_ｍ,Ψτ ^(Left)、ｗ_ｍ,Ψτ ^(Right)は、頭部方向Ψτ やマイクインデックスｍによって変わる重み係数である。 Here, w _{m, Ψτ} ^(Left) and w _{m, Ψτ} ^(Right) are weighting factors that vary depending on the head direction Ψτ and the microphone index m.

以下、図１０〜１３を参照して補間に用いる重み係数の設計について説明する。ここでは、水平方向に９０度ごとに４箇所の窪みがあり、上面・下面に各１本のマイクロホンが設置されているマイクロホンアレイ４１０を用いて説明する。マイクロホンアレイ４１０は計１０本のマイクロホンを用いて受音することになる。 Hereinafter, the design of the weighting coefficient used for the interpolation will be described with reference to FIGS. Here, a description will be given using a microphone array 410 having four depressions in the horizontal direction every 90 degrees and one microphone on each of the upper and lower surfaces. The microphone array 410 receives sound using a total of ten microphones.

図１０は、マイクロホンの設置位置（受音位置）をマイクインデックスｍ（以下、Ｍｉｃ（ｍ）と表す）を用いて示した図である。図１０に示すように、Ｍｉｃ（１）とＭｉｃ（２）がある窪み（図中の太線の半円）方向を本マイクロホンアレイの正面（Ａｚｉｍｕｔｈ＝０°）とする。また、窪み中央を通る水平面を基準水平面（Ｅｌｅｖａｔｉｏｎ＝０°）とする。なお、上面・下面に設置する各１本のマイクロホンはＭｉｃ（１０）とＭｉｃ（１３）に設置されるものとする。つまり、Ｍｉｃ（９）、Ｍｉｃ（１１）、Ｍｉｃ（１２）、Ｍｉｃ（１４）にはマイクロホンを設置しない。 FIG. 10 is a diagram showing a microphone installation position (sound receiving position) using a microphone index m (hereinafter referred to as Mic (m)). As shown in FIG. 10, the direction (Azimuth = 0 °) of the microphone array is the direction in which the depressions with Mic (1) and Mic (2) (thick semicircles in the figure) are present. Further, a horizontal plane passing through the center of the depression is defined as a reference horizontal plane (Elevation = 0 °). It is assumed that each one microphone installed on the upper surface and the lower surface is installed in Mic (10) and Mic (13). That is, no microphone is installed in Mic (9), Mic (11), Mic (12), and Mic (14).

図１１に、水平面におけるマイクロホンの選択の様子を示す。目が向いている方向（矢印で示す方向）が、ユーザの頭部方向に対応する。なお、Ｍｉｃ（１）〜Ｍｉｃ（８）の位置は変わらないものとする。例えば、上段の一番左の図では、目が０°の方向を向いており、このときＭｉｃ（４）、Ｍｉｃ（７）を用いて観測されることになる。 FIG. 11 shows how a microphone is selected on a horizontal plane. The direction in which the eyes are facing (the direction indicated by the arrow) corresponds to the head direction of the user. Note that the positions of Mic (1) to Mic (8) are not changed. For example, in the leftmost diagram in the upper stage, the eyes are oriented in the direction of 0 °, and at this time, observation is performed using Mic (4) and Mic (7).

図１２に、水平面（横の動き）における各マイクの重み係数のグラフを示す。グラフに従い、各マイクロホンの重み係数を設定する。例えば、Ｍｉｃ（１）のグラフ（左上のグラフ）を見ると、−１８０°〜−９０°では重み係数（グラフのＷｅｉｇｈｔ）が０から１に単調増大し、−９０°〜０°では重み係数が１から０に単調減少、０°〜１８０°では重み係数がゼロになるように設定されることがわかる。 FIG. 12 shows a graph of the weight coefficient of each microphone on the horizontal plane (lateral movement). According to the graph, the weight coefficient of each microphone is set. For example, when viewing the graph of Mic (1) (upper left graph), the weighting factor (Weight of the graph) monotonically increases from 0 to 1 at -180 ° to -90 °, and the weighting factor at -90 ° to 0 °. Is monotonically decreased from 1 to 0, and the weighting coefficient is set to be zero between 0 ° and 180 °.

図１３に、水平方向と仰角方向の重み係数（上下のマイクで受音した信号を重みづけ加算に用いる重み）のグラフを示す。水平方向と同様にグラフに従い、各重み係数を設定する。三つのグラフをあわせてみると、−９０°〜９０°において水平方向（ＨｏｒｉｚｏｎｔａｌＳｉｇｎａｌ）の重み係数、Ｍｉｃ（１０）とＭｉｃ（１３）の仰角方向（ＶｅｒｔｉｃａｌＳｉｇｎａｌ）の重み係数の和が１になるように設定されることがわかる。 FIG. 13 shows a graph of weighting coefficients (weights used for weighted addition of signals received by the upper and lower microphones) in the horizontal direction and the elevation angle direction. Each weighting factor is set according to the graph as in the horizontal direction. When the three graphs are combined, the sum of the weighting coefficient in the horizontal direction (Horizontal Signal) and the weighting coefficient in the elevation direction (Vertical Signal) of Mic (10) and Mic (13) becomes 1 at -90 ° to 90 °. It turns out that it is set up to become.

なお、水平方向の重み係数（図１２）、上下の定位感付与のための仰角方向の重み係数（図１３）は厳格に設計する必要はない。あくまで頭部方向に応じて対応関係のとれる重み係数に設定されていればよい。 Note that the weighting factor in the horizontal direction (FIG. 12) and the weighting factor in the elevation direction (FIG. 13) for providing a sense of orientation in the vertical direction need not be designed strictly. It is only necessary to set the weighting coefficient so as to have a corresponding relationship according to the head direction.

水平方向の重み係数と仰角方向の重み係数を用いて補間合成する方法について説明する。まず、頭部方向の水平角にある８本のマイクロホンで受音した信号をあらかじめ設定した重み係数に応じて合成する。次に、頭部方向の仰角に応じて先ほど合成した信号と上下方向にある２本のマイクロホンで受音した信号を重みづけ加算し、最終的な仮想バイノーラル音を得る。 A method of performing interpolation synthesis using a horizontal weighting factor and an elevation weighting factor will be described. First, signals received by eight microphones at the horizontal angle in the head direction are synthesized according to a preset weighting factor. Next, the final synthesized binaural sound is obtained by weighting and adding the signal synthesized earlier according to the elevation angle in the head direction and the signal received by the two microphones in the vertical direction.

頭部方向がΨτであるときのｍ番目のマイクロホンに対する重みｗ_ｍ,Ψτ ^(Left)、ｗ_ｍ,Ψτ ^(Right)は上述のように計算できるので、音源分離処理を行うことなく、仮想的なバイノーラル音を生成することができる。 Since the weights w _{m, Ψτ} ^(Left) and w _{m, Ψτ} ^(Right) for the m-th microphone when the head direction is Ψτ can be calculated as described above, a virtual sound source separation process is not performed. Binaural sounds can be generated.

本実施形態では、マイクロホンアレイ４１０を用いてＭチャネルの信号を観測し、補間合成部４２０でミックスダウン（補間合成）することで仮想的なバイノーラル音を生成する。これにより、バイノーラル音生成システム９００で採用した音源強調法に存在する、収録対象の音の種類に対して最適値が変わるような潜在パラメータ群の調整作業を不要とすることができる。 In the present embodiment, an M channel signal is observed using the microphone array 410 and mixed down (interpolation synthesis) by the interpolation synthesis unit 420 to generate a virtual binaural sound. This eliminates the need for adjustment of the latent parameter group, which is present in the sound source enhancement method employed in the binaural sound generation system 900 and that changes the optimum value for the type of sound to be recorded.

＜補記＞
本発明の装置は、例えば単一のハードウェアエンティティとして、キーボードなどが接続可能な入力部、液晶ディスプレイなどが接続可能な出力部、ハードウェアエンティティの外部に通信可能な通信装置（例えば通信ケーブル）が接続可能な通信部、ＣＰＵ（Central Processing Unit、キャッシュメモリやレジスタなどを備えていてもよい）、メモリであるＲＡＭやＲＯＭ、ハードディスクである外部記憶装置並びにこれらの入力部、出力部、通信部、ＣＰＵ、ＲＡＭ、ＲＯＭ、外部記憶装置の間のデータのやり取りが可能なように接続するバスを有している。また必要に応じて、ハードウェアエンティティに、ＣＤ−ＲＯＭなどの記録媒体を読み書きできる装置（ドライブ）などを設けることとしてもよい。このようなハードウェア資源を備えた物理的実体としては、汎用コンピュータなどがある。 <Supplementary note>
The apparatus of the present invention includes, for example, a single hardware entity as an input unit to which a keyboard or the like can be connected, an output unit to which a liquid crystal display or the like can be connected, and a communication device (for example, a communication cable) capable of communicating outside the hardware entity. Can be connected to a communication unit, a CPU (Central Processing Unit, may include a cache memory or a register), a RAM or ROM that is a memory, an external storage device that is a hard disk, and an input unit, an output unit, or a communication unit thereof , A CPU, a RAM, a ROM, and a bus connected so that data can be exchanged between the external storage devices. If necessary, the hardware entity may be provided with a device (drive) that can read and write a recording medium such as a CD-ROM. A physical entity having such hardware resources includes a general-purpose computer.

ハードウェアエンティティの外部記憶装置には、上述の機能を実現するために必要となるプログラムおよびこのプログラムの処理において必要となるデータなどが記憶されている（外部記憶装置に限らず、例えばプログラムを読み出し専用記憶装置であるＲＯＭに記憶させておくこととしてもよい）。また、これらのプログラムの処理によって得られるデータなどは、ＲＡＭや外部記憶装置などに適宜に記憶される。 The external storage device of the hardware entity stores a program necessary for realizing the above functions and data necessary for processing the program (not limited to the external storage device, for example, reading a program) It may be stored in a ROM that is a dedicated storage device). Data obtained by the processing of these programs is appropriately stored in a RAM or an external storage device.

ハードウェアエンティティでは、外部記憶装置（あるいはＲＯＭなど）に記憶された各プログラムとこの各プログラムの処理に必要なデータが必要に応じてメモリに読み込まれて、適宜にＣＰＵで解釈実行・処理される。その結果、ＣＰＵが所定の機能（上記、…部、…手段などと表した各構成要件）を実現する。 In the hardware entity, each program stored in an external storage device (or ROM or the like) and data necessary for processing each program are read into a memory as necessary, and are interpreted and executed by a CPU as appropriate. . As a result, the CPU realizes a predetermined function (respective component requirements expressed as the above-described unit, unit, etc.).

本発明は上述の実施形態に限定されるものではなく、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。また、上記実施形態において説明した処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されるとしてもよい。 The present invention is not limited to the above-described embodiment, and can be appropriately changed without departing from the spirit of the present invention. In addition, the processing described in the above embodiment may be executed not only in time series according to the order of description but also in parallel or individually as required by the processing capability of the apparatus that executes the processing. .

既述のように、上記実施形態において説明したハードウェアエンティティ（本発明の装置）における処理機能をコンピュータによって実現する場合、ハードウェアエンティティが有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記ハードウェアエンティティにおける処理機能がコンピュータ上で実現される。 As described above, when the processing functions in the hardware entity (the apparatus of the present invention) described in the above embodiments are realized by a computer, the processing contents of the functions that the hardware entity should have are described by a program. Then, by executing this program on a computer, the processing functions in the hardware entity are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）／ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto-Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape or the like, and as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only). Memory), CD-R (Recordable) / RW (ReWritable), etc., magneto-optical recording medium, MO (Magneto-Optical disc), etc., semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, the computer reads a program stored in its own recording medium and executes a process according to the read program. As another execution form of the program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to the computer. Each time, the processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、ハードウェアエンティティを構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In this embodiment, a hardware entity is configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.

Claims

The binaural sound corresponding to the video that the user overlooks in the omnidirectional video / audio viewing system from the observation signal collected using the microphone array having M depressions (M is an integer of 3 or more) where the microphone is installed. A binaural sound generating device for generating,
n and k are integers satisfying M = 2n + k (n ≧ 1, k = 0 or 1),
The shape of the three-dimensional shape of the microphone array viewed from above is a figure having symmetry,
The M depressions are provided on the side surfaces of the three-dimensional shape, of which 2n depressions are provided so as to be paired at intervals of 180 degrees when the three-dimensional shape is viewed from above.
At least one microphone is installed in the depression,
When there is a dent corresponding to the ear in the user's head direction, a weighting factor greater than 0 is applied only to the observation signal collected by the microphone installed in the dent and the ear in the user's head direction. A binaural sound generating device including: an interpolating / synthesizing unit that generates the binaural sound by interpolating and synthesizing the observation signals by giving a weighting coefficient larger than 0 to all the observation signals when there is no corresponding depression .

Microphone provided with M (M is an integer of 3 or more) depressions for installing microphones that collect observation signals used to generate binaural sounds corresponding to images overlooked by the user in the omnidirectional video / audio viewing system An array,
n and k are integers satisfying M = 2n + k (n ≧ 1, k = 0 or 1),
The shape of the three-dimensional shape of the microphone array viewed from above is a figure having symmetry,
The M depressions are provided on the side surfaces of the three-dimensional shape, of which 2n depressions are provided so as to be paired at intervals of 180 degrees when the three-dimensional shape is viewed from above.
At least one microphone is installed in the depression ,
The binaural sound has a weight coefficient greater than 0 for only the observation signal collected by the microphone installed in the depression when there is a depression corresponding to the ear in the head direction of the user. A microphone array generated by interpolating and synthesizing the observation signals by giving a weighting coefficient larger than 0 to all the observation signals when there is no depression corresponding to the ear in the direction .

  The microphone array according to claim 2,
  x _{ｍ,ｔm, t} (1 ≦ m ≦ M) is the observed signal, b _ｔt ^(Left)(Left) , B _ｔt ^{(Right)(Right)} Is the binaural sound,
  The binaural sound is generated by interpolating with the following equation.
(However, w _{ｍ,Ψτm, Ψτ} ^(Left)(Left) , W _{ｍ,Ψτm, Ψτ} ^{(Right)(Right)} Is a weighting coefficient that varies depending on the user's head direction Ψτ and the microphone index m at the frame time τ)
  A microphone array characterized by that.

The microphone array according to claim 3 ,
The microphone array, wherein the shape viewed from above is a circle.

The microphone array according to claim 3 or 4 ,
K satisfies k = 0,
A microphone array, wherein two microphones are installed in each of the 2n depressions.

A microphone array according to any one of claims 3 to 5 ,
A microphone array, wherein at least one microphone is provided on each of the upper surface or the lower surface or the upper and lower surfaces of the three-dimensional shape.

The binaural sound corresponding to the video that the user overlooks in the omnidirectional video / audio viewing system from the observation signal collected using the microphone array having M depressions (M is an integer of 3 or more) where the microphone is installed. A binaural sound generating method in a binaural sound generating device for generating,
n and k are integers satisfying M = 2n + k (n ≧ 1, k = 0 or 1),
The shape of the three-dimensional shape of the microphone array viewed from above is a figure having symmetry,
The M depressions are provided on the side surfaces of the three-dimensional shape, of which 2n depressions are provided so as to be paired at intervals of 180 degrees when the three-dimensional shape is viewed from above.
At least one microphone is installed in the depression,
When there is a dent corresponding to the ear in the user's head direction, a weighting factor greater than 0 is applied only to the observation signal collected by the microphone installed in the dent and the ear in the user's head direction. An interpolating and synthesizing step of generating the binaural sound by interpolating and synthesizing the observed signal by giving a weighting coefficient greater than 0 to all the observed signals when there is no corresponding depression .

A program for causing a computer to function as the binaural sound generating device according to claim 1.