JP2013097273A

JP2013097273A - Sound source estimation device, method, and program and moving body

Info

Publication number: JP2013097273A
Application number: JP2011241610A
Authority: JP
Inventors: Tomoya Takatani; 智哉高谷; Jun Sato; 潤佐藤; Ryuji Funayama; 竜士船山
Original assignee: Toyota Motor Corp
Current assignee: Toyota Motor Corp
Priority date: 2011-11-02
Filing date: 2011-11-02
Publication date: 2013-05-20
Anticipated expiration: 2031-11-02
Also published as: JP5692006B2

Abstract

PROBLEM TO BE SOLVED: To provide a sound source estimation device, method, and program capable of estimating a sound source more accurately and to provide a moving body using the same.SOLUTION: The sound source estimation device uses observation signals acquired by microphones 11 and 12 to estimate a direction of a sound source and includes: a noise estimator 23 which estimates noise components included in the observation signals; a mask generation unit 24 which generates a mask M(ω,t) on the basis of the noise components estimated by the noise estimator 23; a reliability generation unit 25 which calculates reliability (t) of the mask generated by the mask generation unit 24; a CSP coefficient calculation unit 26 which calculates a CSP coefficient corrected by the mask and the reliability of the mask; and a direction estimation unit 28 which estimates the direction of the sound source on the basis of the CSP coefficient.

Description

本発明は、音源推定装置、音源推定方法、音源推定プログラム、及び移動体に関し、特に詳しくは音源を推定する音源推定装置、音源推定方法、及び音源推定プログラム、並びに、該音源推定装置を用いた移動体に関する。 The present invention relates to a sound source estimation device, a sound source estimation method, a sound source estimation program, and a moving object, and particularly uses a sound source estimation device, a sound source estimation method, a sound source estimation program, and a sound source estimation device for estimating a sound source. Related to moving objects.

非特許文献１には、ＣＳＰ法（Ｃｒｏｓｓ−ｐｏｗｅｒＳｐｅｃｔｒｕｍＰｈａｓｅａｎａｌｙｓｉｓ:白色化相互相関）を用いた技術が開示されている。ＣＳＰ法は、ＧＣＣ−ＰＨＡＴ（ＧｅｎｅｒａｌｉｚｅｄＣｒｏｓｓ−ＣｏｒｒｅｌａｔｉｏｎＰＨＡｓｅＴｒａｎｓｆｏｒｍ）アルゴリズムとも呼ばれ、音源方向の推定に用いられている Non-Patent Document 1 discloses a technique using a CSP method (Cross-power Spectrum Phase analysis). The CSP method is also called a GCC-PHAT (Generalized Cross-Correlation PHAse Transform) algorithm, and is used for estimating a sound source direction.

ＦｒｉｔｈｊｏｆＨｕｍｍｅｓ, ＪｕｎｇｅＱｉ, ＴｉｍＦｉｎｇｓｃｈｅｉｄ著ＲＯＢＵＳＴＡＣＯＵＳＴＩＣＳＰＥＡＫＥＲＬＯＣＡＬＩＺＡＴＩＯＮＷＩＴＨＤＩＳＴＲＩＢＵＴＥＤＭＩＣＲＯＰＨＯＮＥＳ１９ｔｈＥｕｒｏｐｅａｎＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇＣｏｎｆｅｒｅｎｃｅ（ＥＵＳＩＰＣＯ２０１１）ｐｐ.２４０−２４４Frithof Hummes, Junge Qi, by Tim Fingcheid ROBUST ACOUSTIC SPEAKER LOCALIZATION WITCH DISTRIBUTED MICROPHONES 19th European Cep 20 US

以下、ＣＳＰ法の処理について説明する。図４はＣＳＰ法の処理フローを示すブロック図である。短時間ＤＦＴ部１２１は、２つのマイクロフォン（以下、マイク）が観測した観測信号ｘ_１（ｔ）、ｘ_２（ｔ）に対して、短時間ＤＦＴ（ＤｉｓｃｒｅｔＦｏｕｒｉｅｒＴｒａｎｓｆｏｒｍ）処理を行う。これにより、時間領域の観測信号ｘ_１（ｔ）、ｘ_２（ｔ）がそれぞれ時間−周波数領域の観測信号Ｘ_１（ω，ｔ）、Ｘ_２（ω，ｔ）に変換される。 Hereinafter, processing of the CSP method will be described. FIG. 4 is a block diagram showing a processing flow of the CSP method. The short-time DFT unit 121 performs short-time DFT (Discrete Fourier Transform) processing on the observation signals x ₁ (t) and x ₂ (t) observed by two microphones (hereinafter referred to as microphones). Thereby, the observation signals x ₁ (t) and x ₂ (t) in the time domain are converted into the observation signals X ₁ (ω, t) and X ₂ (ω, t) in the time-frequency domain, respectively.

ＣＰＳ係数算出部１２６は、観測信号Ｘ_１（ω，ｔ）、Ｘ_２（ω，ｔ）からＣＳＰ係数ＣＳＰ（ｄ,ｔ）を算出する。なお、ＣＳＰ係数とは、観測信号Ｘ_１（ω，ｔ）、Ｘ_２（ω，ｔ）をその振幅で正規化した相互相関関数である。そして、時間差推定部１２７は、ＣＳＰ係数を最大にするインデックスｄに基づいて、到来時間差τを推定する。この到来時間差τが第１のマイクと第２のマイクで観測した音の到来時間差に対応する。推定された時間差に基づいて、方位推定部１２８が方位θを推定している。 The CPS coefficient calculation unit 126 calculates the CSP coefficient CSP (d, t) from the observation signals X ₁ (ω, t) and X ₂ (ω, t). The CSP coefficient is a cross-correlation function obtained by normalizing the observation signals X ₁ (ω, t) and X ₂ (ω, t) with their amplitudes. Then, the time difference estimation unit 127 estimates the arrival time difference τ based on the index d that maximizes the CSP coefficient. This arrival time difference τ corresponds to the arrival time difference of sound observed by the first microphone and the second microphone. Based on the estimated time difference, the direction estimation unit 128 estimates the direction θ.

ところで、ＣＳＰ法のアルゴリズムには、音源数は一つであるという仮定がある。そして、ＣＳＰ法では、この仮定に基づいて離散フーリエ変換で得られた全帯域信号を用いて音源方位推定を行っている。まず、観測信号Ｘ_１（ω，ｔ）、Ｘ_２（ω，ｔ）は以下の式（１）、式（２）で表すことができる。 Incidentally, the CSP algorithm has an assumption that the number of sound sources is one. In the CSP method, sound source azimuth estimation is performed using a full-band signal obtained by discrete Fourier transform based on this assumption. First, the observation signals X ₁ (ω, t) and X ₂ (ω, t) can be expressed by the following equations (1) and (2).

ωは周波数であり、ｔは時間である。Ｘ_１ｓは目的音源からの音を第１のマイクで取得した時の観測信号、Ｘ_２ｓは目的音源からの音を第２のマイクで取得した時の観測信号である。ｔ_１ｓは目的音源と第１のマイクとの距離に応じた時間であり、ｔ_２ｓは目的音源と第２のマイクの距離に対応する時間である。τ（ｔ）は、第１のマイクと第２のマイクとの間の音の到来時間差である。音源数が１つであると仮定すると、ＣＳＰ係数は式（３）で表せる。 ω is frequency and t is time. X _1s is an observation signal when the sound from the target sound source is acquired by the first microphone, and X _2s is an observation signal when the sound from the target sound source is acquired by the second microphone. t _1s is a time corresponding to the distance between the target sound source and the first microphone, and t _2s is a time corresponding to the distance between the target sound source and the second microphone. τ (t) is the difference in arrival time of sound between the first microphone and the second microphone. Assuming that the number of sound sources is one, the CSP coefficient can be expressed by equation (3).

なお、式（３）において、＊は共役を示している。しかしながら、実環境では、音源数は必ずしも一つではなく、環境雑音や干渉音の混入がある。このため、複数音源の混合信号が観測される。この混合信号は、式（４）、式（５）で表すことができる。

In the formula (3), * indicates a conjugate. However, in an actual environment, the number of sound sources is not necessarily one, and there is a mixture of environmental noise and interference sound. For this reason, a mixed signal of a plurality of sound sources is observed. This mixed signal can be expressed by equations (4) and (5).

式（４）、式（５）において、ｎは雑音となる音源（雑音源）の数である。また、Ｘ_１Ｎｎは、ｎ番目の雑音源からの音を第１のマイクで観測した時の観測信号であり、ｔ_１Ｎｎはｎ番目の雑音源と第１のマイクとの距離に対応する時間である。同様に、Ｘ_２Ｎｎは、ｎ番目の雑音源からの音を第２のマイクで観測した時の観測信号であり、ｔ_２Ｎｎはｎ番目の雑音源と第２のマイクとの距離に応じた時間に対応している。τ_ｓは目的音源からの音の到達時間差であり、τ_Ｎｎは、ｎ番目の雑音源からの音の到来時間差である。 In Expressions (4) and (5), n is the number of sound sources (noise sources) that become noise. X _1Nn is an observation signal when sound from the nth noise source is observed by the first microphone, and t _1Nn is a time corresponding to the distance between the nth noise source and the first microphone. is there. Similarly, X _2Nn is an observation signal when sound from the nth noise source is observed by the second microphone, and t _2Nn is a time corresponding to the distance between the nth noise source and the second microphone. It corresponds to. τ _s is the arrival time difference of the sound from the target sound source, and τ _Nn is the arrival time difference of the sound from the nth noise source.

以下、説明を簡単にするため、目的音源数を１、雑音源数を１とする。この場合、観測信号は以下の式（６）、式（７）で表される。

Hereinafter, in order to simplify the description, the number of target sound sources is 1 and the number of noise sources is 1. In this case, the observation signal is expressed by the following equations (6) and (7).

ＣＳＰ係数は、以下の式（８）で展開される。

The CSP coefficient is developed by the following equation (8).

高ＳＮＲ（ＳｉｇｎａｌＮｏｉｓｅＲａｔｉｏ）の場合、すなわち、低雑音環境下の場合、以下の式（９）の近似式が成立する。

In the case of a high SNR (Signal Noise Ratio), that is, in a low noise environment, the following approximate expression (9) is established.

従って、ＣＳＰ係数の算出式は、以下の式（１０）のように展開されるため、目的音源の方位推定が可能となる。

Accordingly, the calculation formula for the CSP coefficient is expanded as shown in the following formula (10), and thus it is possible to estimate the direction of the target sound source.

低ＳＮＲの場合、すなわち、高雑音環境下の場合、式（９）の近似式が成立しない。従って、式（１１）に示されるように、雑音成分の到来位相差等（式（１１）のリージョン項）がＣＳＰ係数列に影響を与える。

In the case of low SNR, that is, in a high noise environment, the approximate expression of Expression (9) is not established. Therefore, as shown in Expression (11), the arrival phase difference of the noise component or the like (region term in Expression (11)) affects the CSP coefficient sequence.

式（１２）、式（１３）に示すように、ＣＳＰ係数を最大にするインデックスｄを探索し、そのインデックスｄを変換することによって、音源の方位が算出される。

As shown in Expression (12) and Expression (13), the index d that maximizes the CSP coefficient is searched, and the direction of the sound source is calculated by converting the index d.

上記のように、ＣＳＰ法では、振幅情報を正規化して、位相差情報だけで算出している。さらに、非特許文献１に記載の方法では、ウィーナーフィルタ（ＷｉｅｎｅｒＦｉｌｅｔｅｒ）を用いている。このような音源方向の推定では、より精度を高くすることが望まれている。例えば、マイクに対して、雑音源や目的音源が相対的に移動している場合に、より正確に方向を推定することが望まれている。 As described above, in the CSP method, amplitude information is normalized, and calculation is performed using only phase difference information. Furthermore, in the method described in Non-Patent Document 1, a Wiener filter is used. In such estimation of the sound source direction, higher accuracy is desired. For example, when a noise source and a target sound source are moving relative to a microphone, it is desired to estimate the direction more accurately.

本発明は、上記の問題点に鑑みてなされたものであり、正確に音源を推定することができる音源推定装置、音源推定方法、及び音源推定プログラムを提供することを目的とする。 The present invention has been made in view of the above problems, and an object thereof is to provide a sound source estimation apparatus, a sound source estimation method, and a sound source estimation program that can accurately estimate a sound source.

本発明の一態様にかかる音源推定装置は、少なくとも２つのマイクによって取得された観測信号を用いて、音源の方向を推定する音源推定装置であって、前記観測信号に含まれる雑音成分を推定する雑音推定部と、前記雑音推定部で推定された雑音成分に基づいて、マスクを生成するマスク生成部と、前記マスク生成部で生成されたマスクの信頼度を算出する信頼度算出部と、前記マスク、及び前記マスクの信頼度によって補正されたＣＳＰ係数を算出するＣＳＰ係数算出部と、補正された前記ＣＳＰ係数に基づいて、音源の方向を推定する推定部と、を備えたものである。このようにマスクとマスクの信頼度を導入することで、正確に音源を推定することができる。 A sound source estimation apparatus according to an aspect of the present invention is a sound source estimation apparatus that estimates a direction of a sound source using observation signals acquired by at least two microphones, and estimates a noise component included in the observation signal. A noise estimation unit; a mask generation unit that generates a mask based on a noise component estimated by the noise estimation unit; a reliability calculation unit that calculates a reliability of the mask generated by the mask generation unit; A mask, a CSP coefficient calculation unit that calculates a CSP coefficient corrected by the reliability of the mask, and an estimation unit that estimates a direction of a sound source based on the corrected CSP coefficient. Thus, by introducing the mask and the reliability of the mask, the sound source can be accurately estimated.

上記の音源推定装置において、補正された前記ＣＳＰ係数を最大とするインデックスに基づいて、２つの前記マイクに音が到来する到来時間差を算出し、前記到来時間差に基づいて、前記推定部が音源の方向を推定し、前記補正されたＣＳＰ係数の最大値に応じて、推定された方向に目的とする音源が存在するか否かを判定してもよい。このようにすることで、より正確に音源を推定することができる。 In the sound source estimation device, based on the index that maximizes the corrected CSP coefficient, an arrival time difference between the two microphones is calculated, and based on the arrival time difference, the estimation unit determines the sound source A direction may be estimated, and whether or not a target sound source exists in the estimated direction may be determined according to the corrected maximum value of the CSP coefficient. By doing so, the sound source can be estimated more accurately.

上記の音源推定装置において、前記マスクが、周波数に応じた離散的な値であり、２つの前記マイクからの観測信号の相互相関と前記マスクとの積を逆フーリエ変換することによって、補正された前記ＣＳＰ係数が算出されていてもよい。 In the sound source estimation apparatus, the mask is a discrete value corresponding to a frequency, and is corrected by performing an inverse Fourier transform on a product of the cross-correlation of observation signals from the two microphones and the mask. The CSP coefficient may be calculated.

上記の音源推定装置において、前記逆フーリエ変換した値に前記信頼度を乗じることによって、補正された前記ＣＳＰ係数が算出されていてもよい。 In the sound source estimation apparatus, the corrected CSP coefficient may be calculated by multiplying the inverse Fourier transform value by the reliability.

本発明の一態様にかかる移動体は、上記の音源推定装置を搭載した移動体であって、前記移動体の動作状態に応じた車両信号に基づいて、前記雑音推定部が雑音推定を行うことを特徴とするものである。このようにすることで、適切なタイミングで推定された雑音成分を用いて、音源を推定することができる。 A moving object according to an aspect of the present invention is a moving object equipped with the above-described sound source estimation device, and the noise estimation unit performs noise estimation based on a vehicle signal according to an operation state of the moving object. It is characterized by. By doing in this way, a sound source can be estimated using a noise component estimated at an appropriate timing.

上記の移動体において、前記移動体がマスクを予め記憶したマスク記憶部と、前記移動体の動作状態に応じた車両信号に基づいて、前記マスク記憶部に記憶されたマスクと、前記マスク生成部で生成されたマスクのいずれかを選択するマスク選択部と、をさらに備えていてもよい。このようにすることで、適切なマスクを用いて音源を推定することができる。 In the above moving body, a mask storage unit in which the mobile body stores a mask in advance, a mask stored in the mask storage unit based on a vehicle signal corresponding to an operation state of the moving body, and the mask generation unit And a mask selection unit that selects any of the masks generated in (1). By doing in this way, a sound source can be estimated using an appropriate mask.

本発明の一態様にかかる音源推定方法は、少なくとも２つのマイクによって取得された観測信号を用いて、音源の方向を推定する音源推定方法であって、前記観測信号に含まれる雑音成分を推定するステップと、前記雑音成分に基づいて、マスクを生成するステップと、前記マスクの信頼度を算出するステップと、前記マスク、及び前記マスクの前記信頼度によって補正されたＣＳＰ係数を算出するステップと、補正された前記ＣＳＰ係数に基づいて、音源の方向を推定するステップと、を備えたものである。このようにマスクとマスクの信頼度を導入することで、正確に音源を推定することができる。 A sound source estimation method according to an aspect of the present invention is a sound source estimation method that estimates a direction of a sound source using observation signals acquired by at least two microphones, and estimates a noise component included in the observation signal. Generating a mask based on the noise component; calculating a reliability of the mask; calculating a CSP coefficient corrected by the mask and the reliability of the mask; And estimating the direction of the sound source based on the corrected CSP coefficient. Thus, by introducing the mask and the reliability of the mask, the sound source can be accurately estimated.

上記の音源推定方法において、補正された前記ＣＳＰ係数を最大とするインデックスに基づいて、２つの前記マイクに音が到来する到来時間差を算出し、前記到来時間差に基づいて、前記推定部が音源の方向を推定し、前記補正されたＣＳＰ係数の最大値に応じて、推定された方向に目的とする音源が存在するか否かを判定してもよい。このようにすることで、より正確に音源を推定することができる。 In the sound source estimation method, based on the index that maximizes the corrected CSP coefficient, a difference in arrival time when sound arrives at the two microphones is calculated. Based on the difference in arrival time, the estimation unit determines the sound source A direction may be estimated, and whether or not a target sound source exists in the estimated direction may be determined according to the corrected maximum value of the CSP coefficient. By doing so, the sound source can be estimated more accurately.

上記の音源推定方法において、前記マスクが、周波数に応じた離散的な値であり、２つの前記マイクからの観測信号の相互相関と前記マスクとの積を逆フーリエ変換することによって、補正された前記ＣＳＰ係数が算出されていてもよい。 In the sound source estimation method, the mask is a discrete value corresponding to a frequency, and is corrected by performing an inverse Fourier transform on a product of a cross-correlation of observation signals from the two microphones and the mask. The CSP coefficient may be calculated.

上記の音源推定方法において、前記逆フーリエ変換した値に前記信頼度を乗じることによって、補正された前記ＣＳＰ係数が算出されていてもよい。 In the above sound source estimation method, the corrected CSP coefficient may be calculated by multiplying the inverse Fourier transform value by the reliability.

上記の音源推定プログラムは、少なくとも２つのマイクによって取得された観測信号を用いて、音源の方向を推定する音源推定プログラムであって、コンピュータに対して、前記観測信号に含まれる雑音成分を推定するステップと、前記雑音成分に基づいて、マスクを生成するステップと、前記マスクの信頼度を算出するステップと、前記マスク、及び前記マスクの前記信頼度によって補正されたＣＳＰ係数を算出するステップと、補正された前記ＣＳＰ係数に基づいて、音源の方向を推定するステップと、を実行させるものである。このようにマスクとマスクの信頼度を導入することで、正確に音源を推定することができる。 The sound source estimation program is a sound source estimation program that estimates the direction of a sound source using observation signals acquired by at least two microphones, and estimates a noise component included in the observation signal to a computer Generating a mask based on the noise component; calculating a reliability of the mask; calculating a CSP coefficient corrected by the mask and the reliability of the mask; And a step of estimating a direction of a sound source based on the corrected CSP coefficient. Thus, by introducing the mask and the reliability of the mask, the sound source can be accurately estimated.

上記の音源推定プログラムは、コンピュータに対して、補正された前記ＣＳＰ係数を最大とするインデックスに基づいて、２つの前記マイクに音が到来する到来時間差を算出させ、前記到来時間差に基づいて、前記推定部が音源の方向を推定させ、前記補正されたＣＳＰ係数の最大値に応じて、推定された方向に目的とする音源が存在するか否かを判定させてもよい。このようにすることで、より正確に音源を推定することができる。 The sound source estimation program causes a computer to calculate an arrival time difference between two microphones based on an index that maximizes the corrected CSP coefficient, and based on the arrival time difference, The estimation unit may cause the direction of the sound source to be estimated and determine whether or not the target sound source exists in the estimated direction according to the corrected maximum value of the CSP coefficient. By doing so, the sound source can be estimated more accurately.

上記の音源推定プログラムは、前記マスクが、周波数に応じた離散的な値であり、
２つの前記マイクからの観測信号の相互相関と前記マスクとの積を逆フーリエ変換することによって、補正された前記ＣＳＰ係数が算出されていてもよい。 In the sound source estimation program, the mask is a discrete value corresponding to a frequency,
The corrected CSP coefficient may be calculated by performing inverse Fourier transform on the product of the cross-correlation of the observation signals from the two microphones and the mask.

上記の音源推定プログラムは、前記逆フーリエ変換した値に前記信頼度を乗じることによって、補正された前記ＣＳＰ係数が算出されていてもよい。 The sound source estimation program may calculate the corrected CSP coefficient by multiplying the inverse Fourier transform value by the reliability.

本発明によれば、正確に音源を推定することができる音源推定装置、音源推定方法、音源推定プログラム、及びそれを用いた移動体を提供することができる。 ADVANTAGE OF THE INVENTION According to this invention, the sound source estimation apparatus which can estimate a sound source correctly, a sound source estimation method, a sound source estimation program, and a mobile body using the same can be provided.

実施の形態にかかる音源推定装置の構成を示すブロック図である。It is a block diagram which shows the structure of the sound source estimation apparatus concerning embodiment. 音源推定装置におけるフローを示すブロック図である。It is a block diagram which shows the flow in a sound source estimation apparatus. 実施の形態にかかる音源推定装置の応用例を示すブロック図である。It is a block diagram which shows the application example of the sound source estimation apparatus concerning embodiment. ＣＳＰ法による音源推定を説明する図である。It is a figure explaining the sound source estimation by CSP method.

以下、本発明に係る音源推定装置の実施形態を、図面に基づいて詳細に説明する。但し、本発明が以下の実施形態に限定される訳ではない。また、説明を明確にするため、以下の記載及び図面は、適宜、簡略化されている。 Hereinafter, embodiments of a sound source estimation apparatus according to the present invention will be described in detail with reference to the drawings. However, the present invention is not limited to the following embodiments. In addition, for clarity of explanation, the following description and drawings are simplified as appropriate.

まず、本発明の実施の形態にかかる音源推定装置について、図１を用いて説明する。図１は、音源推定装置のシステム構成を示すブロック図である。本実施の形態に係る音源推定装置は、音源の方向を推定している。さらに、推定された音源の方向に目的とする音源が存在するか否かを判定している。例えば、本実施の形態にかかる音源推定装置を、車両に搭載する。そして、音源である他の車両の方向、並びに、他の車両が近くに存在しているか否かを検出している。このようにすることで、接近車両の有無、及びその方向を検出することができる。これにより、車両が接近していることを効果的に報知することができ、交通事故の防止に資することができる。 First, a sound source estimation apparatus according to an embodiment of the present invention will be described with reference to FIG. FIG. 1 is a block diagram showing a system configuration of the sound source estimation apparatus. The sound source estimation apparatus according to the present embodiment estimates the direction of the sound source. Further, it is determined whether or not a target sound source exists in the estimated direction of the sound source. For example, the sound source estimation apparatus according to the present embodiment is mounted on a vehicle. And the direction of the other vehicle which is a sound source, and whether other vehicles exist near are detected. By doing in this way, the presence or absence of an approaching vehicle and its direction can be detected. Thereby, it can notify effectively that the vehicle is approaching, and it can contribute to prevention of a traffic accident.

図１に示すように、音源推定装置は、マイク１１、マイク１２、マイクアンプ１３、マイクアンプ１４、ＡＤ変換器１５、及びＣＰＵ１６を備えている。図１においては、二つのマイク１１、１２しか示されていないが、マイクの数は特に限定されるものではない。マイクの数は複数であればよく、例えば、３以上であってもよい。例えば、複数のマイクがアレイ状に配列されたマイクロホンアレーを用いることができる。そして、多数のマイクのうちの２つのマイクに対して、以下の処理を行う。こうすることで、音源の方向の推定が可能とある。さらに、一対のマイクを複数用意して、それぞれに対して以下の処理を行うことで、音源の位置を特定することもできる。 As shown in FIG. 1, the sound source estimation apparatus includes a microphone 11, a microphone 12, a microphone amplifier 13, a microphone amplifier 14, an AD converter 15, and a CPU 16. In FIG. 1, only two microphones 11 and 12 are shown, but the number of microphones is not particularly limited. The number of microphones may be plural, for example, three or more. For example, a microphone array in which a plurality of microphones are arranged in an array can be used. Then, the following processing is performed on two of the many microphones. By doing so, the direction of the sound source can be estimated. Furthermore, the position of the sound source can be specified by preparing a plurality of pairs of microphones and performing the following processing on each of them.

マイク１１とマイク１２とは、距離Ｄだけ隔てて配置されている。マイク１１、１２がθ（ｔ）の方向からの音を検出したとする。すなわち、図１では、目的音源がθ（ｔ）の方向にあるとしている。マイク１１、１２は、検出した音に応じた観測信号を出力する。 The microphone 11 and the microphone 12 are spaced apart by a distance D. Assume that the microphones 11 and 12 detect sound from the direction of θ (t). That is, in FIG. 1, it is assumed that the target sound source is in the direction of θ (t). The microphones 11 and 12 output an observation signal corresponding to the detected sound.

マイクアンプ１３、１４は、マイク１１、マイク１２からの観測信号をそれぞれ増幅して、Ａ／Ｄ変換器１５に出力する。ＡＤ変換器１５は、入力された観測信号をＡ／Ｄ変換する。Ａ／Ｄ変換器１５から出力されたデジタルの観測信号は、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）に入力される。ＣＰＵ１６は、Ａ／Ｄ変換器１５からの観測信号に対して、音源方向を推定するための演算処理を行う。ＣＰＵ１６は、図示しないＲＯＭ（Read Only Memory）、ＲＡＭ（Random Access Memory）に記憶されているプログラムやパラメータ等を参照して、処理を行う。 The microphone amplifiers 13 and 14 amplify the observation signals from the microphones 11 and 12, respectively, and output the amplified signals to the A / D converter 15. The AD converter 15 performs A / D conversion on the input observation signal. The digital observation signal output from the A / D converter 15 is input to a CPU (Central Processing Unit). The CPU 16 performs arithmetic processing for estimating the sound source direction on the observation signal from the A / D converter 15. The CPU 16 performs processing with reference to programs and parameters stored in a ROM (Read Only Memory) and a RAM (Random Access Memory) (not shown).

次に、ＣＰＵ１６における処理ブロックの構成について図２を用いて説明する。図２は、ＣＰＵ１２の構成を示すブロック図である。ＣＰＵ１６は、Ａ／Ｄ変換器１５からの観測信号に対して、ブロックに従った処理を行う。ＣＰＵ１６は、短時間ＤＦＴ部２１と、短時間ＤＦＴ部２２と、雑音推定器２３と、マスク生成部２４と、Ｒｅｌｉａｂｉｌｉｔｙ生成部２５と、時間周波数補正型のＣＳＰ係数算出部２６と、時間差推定部２７と、方位推定部２８と、を備えている。 Next, the configuration of processing blocks in the CPU 16 will be described with reference to FIG. FIG. 2 is a block diagram showing the configuration of the CPU 12. The CPU 16 performs processing according to the block on the observation signal from the A / D converter 15. The CPU 16 includes a short-time DFT unit 21, a short-time DFT unit 22, a noise estimator 23, a mask generation unit 24, a reliability generation unit 25, a time-frequency correction type CSP coefficient calculation unit 26, and a time difference estimation unit. 27 and an azimuth estimation unit 28.

マイク１１によって観測される観測信号を観測信号ｘ_１（ｔ）とし、マイク１２によって観測される信号を観測信号ｘ_２（ｔ）としている。短時間ＤＦＴ部２１、２２では、観測信号ｘ_１、ｘ_２（ｔ）を短時間離散フーリエ変換する。例えば、所定時間の観測信号をバッファやメモリに記憶して、その観測信号を、複数のフレームに分割する。例えば、時間領域において、隣接フレームが半分重なるように、ハーフシフトによってフレーム分割している。また、窓関数を用いて、フレーム分割しても良い。さらに、フレーム分割された観測信号を離散フーリエ変換する。このようにすることで、時間領域の観測信号ｘ_１（ｔ）、ｘ_２（ｔ）がそれぞれ時間−周波数領域の観測信号Ｘ_１（ω，ｔ）、Ｘ_２（ω，ｔ）に変換される。短時間ＤＦＴ部２１、２２は、観測信号Ｘ_１（ω，ｔ）、Ｘ_２（ω，ｔ）を雑音推定器２３、マスク生成部２４、ＣＳＰ係数算出部２６に出力する An observation signal observed by the microphone 11 is an observation signal x ₁ (t), and a signal observed by the microphone 12 is an observation signal x ₂ (t). The short-time DFT units 21 and 22 perform short-time discrete Fourier transform on the observation signals x ₁ and x ₂ (t). For example, an observation signal for a predetermined time is stored in a buffer or a memory, and the observation signal is divided into a plurality of frames. For example, in the time domain, the frame is divided by half shift so that adjacent frames are overlapped by half. Further, the frame may be divided using a window function. Further, the observation signal divided into frames is subjected to discrete Fourier transform. In this way, the time domain observation signals x ₁ (t) and x ₂ (t) are converted into time-frequency domain observation signals X ₁ (ω, t) and X ₂ (ω, t), respectively. The The short-time DFT units 21 and 22 output the observation signals X ₁ (ω, t) and X ₂ (ω, t) to the noise estimator 23, the mask generation unit 24, and the CSP coefficient calculation unit 26.

雑音推定器２３は、観測信号Ｘ_１（ω，ｔ）、Ｘ_２（ω，ｔ）を用いて、雑音を推定する。例えば、過去時間における時間平均やＮｕｌｌｂｅａｍｆｏｒｍｅｒ等のマイクロホンアレイによる推定方法を用いることができる。具体的には、以下の式（１４）を用いて雑音推定することができる。なお、式（１４）において、Ｓはフレームの分割数である。

The noise estimator 23 estimates noise using the observation signals X ₁ (ω, t) and X ₂ (ω, t). For example, a time average in the past time or an estimation method using a microphone array such as Nullbeamformer can be used. Specifically, noise estimation can be performed using the following equation (14). In Equation (14), S is the number of frame divisions.

雑音推定器２３は、推定した雑音Ｎ（ω,ｔ）をマスク生成部２４に出力する。マスク生成部２４は、周波数に応じてＣＳＰ係数をマスキングするマスクＭ（ω,ｔ）を生成する。マスク生成部２４は、雑音Ｎ（ω,ｔ）、及び観測信号Ｘ_１（ω，ｔ）、Ｘ_２（ω，ｔ）を用いて、マスクＭ（ω，ｔ）を算出する。例えば、式（９）で示した近似式の成立／不成立は、到来した音の各周波数におけるＳＮ比（ＳＮＲ）で決まる。このため、ＳＮ比を推定して、近似式が成立するか否かを判定する。近似式が成立しない周波数帯域、すなわち雑音が高い帯域に対しては、マスキング処理を行うための処理を導入する。こうすることで、雑音成分の影響が小さい帯域だけでＣＳＰ係数を算出することが可能となる。これにより、低ＳＮＲ環境（高雑音環境下）においても、頑健に動作する音源方向推定が可能となる。 The noise estimator 23 outputs the estimated noise N (ω, t) to the mask generation unit 24. The mask generation unit 24 generates a mask M (ω, t) that masks the CSP coefficient according to the frequency. The mask generator 24 calculates the mask M (ω, t) using the noise N (ω, t) and the observation signals X ₁ (ω, t) and X ₂ (ω, t). For example, the establishment / non-establishment of the approximate expression shown in Expression (9) is determined by the SN ratio (SNR) at each frequency of the incoming sound. For this reason, the SN ratio is estimated and it is determined whether or not the approximate expression holds. For a frequency band in which the approximate expression is not satisfied, that is, a band with high noise, a process for performing a masking process is introduced. By doing so, it is possible to calculate the CSP coefficient only in the band where the influence of the noise component is small. This makes it possible to estimate the direction of a sound source that operates robustly even in a low SNR environment (under a high noise environment).

例えば、雑音Ｎ（ω,ｔ）をしきい値と比較して、その比較結果に応じてＭ（ω,ｔ）を設定すればよい。具体的には、雑音Ｎ（ω,ｔ）の値がしきい値よりも大きい場合、Ｍ（ω,ｔ）＝０とし、しきい値よりも小さい場合、Ｍ（ω,ｔ）＝０とする。このように、マスクＭ（ω，ｔ）は周波数に応じた離散的な値となっている。マスク生成部２４で生成したマスクＭ（ω,ｔ）は、Ｒｅｌｉａｂｉｌｉｔｙ生成部２５と時間周波数補正型のＣＳＰ係数算出部２６とに入力される。 For example, the noise N (ω, t) may be compared with a threshold value and M (ω, t) may be set according to the comparison result. Specifically, when the value of the noise N (ω, t) is larger than the threshold value, M (ω, t) = 0, and when smaller than the threshold value, M (ω, t) = 0. To do. Thus, the mask M (ω, t) has a discrete value corresponding to the frequency. The mask M (ω, t) generated by the mask generation unit 24 is input to the reliability generation unit 25 and the time-frequency correction type CSP coefficient calculation unit 26.

Ｒｅｌｉａｂｉｌｉｔｙ生成部２５はマスクＭ（ω，ｔ）の信頼度を示すＲｅｌｉａｂｉｌｉｔｙ（ｔ）を算出する。上記のように、雑音Ｎ（ω，ｔ）に応じて、Ｍ（ω,ｔ）の値が変化している。従って、Ｍ（ω,ｔ）＝１となる周波数が多いほど、雑音が少なく、信頼度が高くなると考えられる。一方、Ｍ（ω,ｔ）＝０となる周波数が多いほど、雑音が多く、信頼度が低くなると考えられる。このような場合、観測信号中に含まれる目的音源からの信号成分が少ないため、推定された目的音源の方向の信頼性が低くなる。従って、マスクＭ（ω，ｔ）の信頼度を示すＲｅｌｉａｂｉｌｉｔｙ（ｔ）を導入することで、より正確に音源の方向を推定することができる。すなわち、雑音成分と信号成分とに基づいて、マスクＭ（ω，ｔ）とＲｅｌｉａｂｉｌｉｔｙ（ｔ）とを用いることで、時間―周波数補正を行ったＣＳＰ係数を算出することができる。 The reliability generation unit 25 calculates Reliability (t) indicating the reliability of the mask M (ω, t). As described above, the value of M (ω, t) changes according to the noise N (ω, t). Therefore, it is considered that the more the frequency at which M (ω, t) = 1, the less the noise and the higher the reliability. On the other hand, it is considered that the greater the frequency at which M (ω, t) = 0, the greater the noise and the lower the reliability. In such a case, since there are few signal components from the target sound source included in the observation signal, the reliability of the estimated direction of the target sound source is lowered. Therefore, the direction of the sound source can be estimated more accurately by introducing Reliability (t) indicating the reliability of the mask M (ω, t). That is, based on the noise component and the signal component, the CSP coefficient subjected to time-frequency correction can be calculated by using the mask M (ω, t) and Reliability (t).

例えば、Ｒｅｌｉａｂｉｌｉｔｙ（ｔ）が、以下の式（１５）を用いて求めることができる。

For example, Reliability (t) can be obtained using the following equation (15).

なお、Ωは、ωのカウント数である。すなわち、Ω個のωに対するＭ（ω，ｔ）が算出されているものとしている。例えば、Ω＝１００の場合、すなわち、ある時間において１００個のＭ（ω，ｔ）が算出された場合、１００個中１０個のＭ（ω，ｔ）が１であり、９０個のＭ（ω，ｔ）が０であったとする。このときのＲｅｌｉａｂｉｌｉｔｙ（ｔ）は０．１（＝１０／１００）となる。この場合、信頼度が低いことになる。一方、Ω＝１００の場合で、１００個中１００個のＭ（ω，ｔ）が１００であり、０個のＭ（ω，ｔ）が０であったとする。このときのＲｅｌｉａｂｉｌｉｔｙ（ｔ）は１（＝１００／１００）となる。この場合、信頼度が高いことになる。 Note that Ω is the count number of ω. That is, it is assumed that M (ω, t) for Ω ω has been calculated. For example, when Ω = 100, that is, when 100 M (ω, t) are calculated at a certain time, 10 M (ω, t) out of 100 are 1, and 90 M ( Let ω, t) be zero. At this time, Reliability (t) is 0.1 (= 10/100). In this case, the reliability is low. On the other hand, in the case of Ω = 100, 100 out of 100 M (ω, t) are 100, and 0 M (ω, t) is 0. At this time, Reliability (t) is 1 (= 100/100). In this case, the reliability is high.

Ｒｅｌｉａｂｉｌｉｔｙ生成部２５は、Ｒｅｌｉａｂｉｌｉｔｙ（ｔ）を時間−周波数補正型のＣＳＰ係数算出部２６に出力する。さらに、ＣＳＰ係数算出部２６には、短時間ＤＦＴ部２１、２２からの観測信号Ｘ_１（ω，ｔ）、観測信号Ｘ_２（ω，ｔ）が入力されている。 The reliability generation unit 25 outputs Reliability (t) to the time-frequency correction type CSP coefficient calculation unit 26. Further, the observation signal X ₁ (ω, t) and the observation signal X ₂ (ω, t) from the short-time DFT units 21 and 22 are input to the CSP coefficient calculation unit 26.

ＣＳＰ係数算出部２６は、Ｒｅｌｉａｂｉｌｉｔｙ（ｔ）、観測信号Ｘ_１（ω，ｔ）、観測信号Ｘ_２（ω，ｔ）に基づいて、ＣＳＰ係数ＣＳＰ（ω，ｔ）を算出する。ＣＳＰ（ω，ｔ）は、例えば、式（１６）を用いて求めることができる。

The CSP coefficient calculation unit 26 calculates the CSP coefficient CSP (ω, t) based on Reliability (t), the observation signal X ₁ (ω, t), and the observation signal X ₂ (ω, t). CSP (ω, t) can be obtained using, for example, Expression (16).

式（１６）に示されるように、ＣＳＰ係数算出部２６は、観測信号Ｘ_１（ω，ｔ）、Ｘ_２（ω，ｔ）をその振幅で正規化した相互相関関数とマスクＭ（ω，ｔ）との積に対して逆離散フーリエ変換（ＩＤＦＴ）を実行している。そして、ＣＳＰ係数算出部２６は逆離散フーリエ変換した値に、Ｒｅｌｉａｂｉｌｉｔｙ（ｔ）を乗じることで、ＣＳＰ係数を求めている。換言すると、Ｒｅｌｉａｂｉｌｉｔｙ（ｔ）がＣＳＰ係数の重み付けの値となっている。このようにすることで、時間及び周波数に対する補正が行われたＣＳＰ係数ＣＳＰを求めることができる。 As shown in Expression (16), the CSP coefficient calculation unit 26 normalizes the observed signals X ₁ (ω, t) and X ₂ (ω, t) with their amplitudes and a mask M (ω, An inverse discrete Fourier transform (IDFT) is performed on the product with t). Then, the CSP coefficient calculating unit 26 obtains the CSP coefficient by multiplying the value obtained by the inverse discrete Fourier transform by Reliability (t). In other words, Reliability (t) is the weighting value of the CSP coefficient. In this way, the CSP coefficient CSP that has been corrected for time and frequency can be obtained.

ＣＳＰ係数算出部２６は、算出したＣＳＰ係数を時間差推定部２７に出力する。時間差推定部２７は、ＣＳＰ係数から到来時間差τ（ｔ）を推定する。これにより、２つのマイク１１、１２に到来する音の時間差を求めることができる。例えば、到来時間差τ（ｔ）は、式（１７）を用いて算出することができる。

The CSP coefficient calculation unit 26 outputs the calculated CSP coefficient to the time difference estimation unit 27. The time difference estimation unit 27 estimates the arrival time difference τ (t) from the CSP coefficient. Thereby, the time difference between the sounds arriving at the two

microphones

11 and 12 can be obtained. For example, the arrival time difference τ (t) can be calculated using Expression (17).

なお、sampling frequencyは、サンプリング周波数である。式（１７）ではＣＳＰ係数ＣＳＰ（ｄ，ｔ）を最大とするインデックスｄを算出している。そして、このインデックスｄをサンプリング周波数で除することによって、到来時間差τ（ｔ）が算出される。このように、ＣＳＰ係数、すなわち、振幅で正規化した観測信号Ｘ_１（ω，ｔ）、Ｘ_２（ω，ｔ）の相互相関関数に基づいて、到来時間差τ（ｔ）を算出している。ＣＳＰ法では、振幅情報を正規化して、位相差スペクトル情報を元にＣＳＰ係数を算出している。従って、ＣＳＰ法は、他の音源方位推定技術よりも残響の影響に対して頑健な性質を持っている。 The sampling frequency is a sampling frequency. In Expression (17), an index d that maximizes the CSP coefficient CSP (d, t) is calculated. Then, the arrival time difference τ (t) is calculated by dividing the index d by the sampling frequency. Thus, the arrival time difference τ (t) is calculated based on the cross correlation function of the observed signals X ₁ (ω, t) and X ₂ (ω, t) normalized by the CSP coefficient, that is, the amplitude. . In the CSP method, amplitude information is normalized and CSP coefficients are calculated based on phase difference spectrum information. Therefore, the CSP method is more robust against the influence of reverberation than other sound source direction estimation techniques.

方位推定部２８は到来時間差τ（ｔ）に基づいて、マイク１１、１２に対して音が到来した方位θ（ｔ）を推定する。これにより、音源の方向を推定することができる。例えば、式（１８）を用いて方位θ（ｔ）を推定することができる。なお、Ｃは音速である。

The azimuth estimating unit 28 estimates the azimuth θ (t) at which sound has arrived with respect to the

microphones

11 and 12 based on the arrival time difference τ (t). Thereby, the direction of the sound source can be estimated. For example, the azimuth θ (t) can be estimated using Equation (18). C is the speed of sound.

判定部２９は、ＣＳＰ係数ＣＳＰの値に応じて方位θ（ｔ）に、目的とする目的音源が存在しているかいかなを判定する。例えば、目的音源が他の車両であったとする。この場合、ＣＳＰ係数が最大となるインデックスｄの時のＣＳＰ係数ＣＳＰの値に応じて、方位θ（ｔ）に他の車両が存在しているか否かを判定している。ＣＳＰ係数の最大値がしきい値よりも大きい時は、雑音成分が低く、信頼度が高い。従って、θ（ｔ）の方向に他の車両が存在していると判定する。一方、ＣＳＰ係数の最大値がしきい値よりも小さい時は、雑音成分が高く、信頼度が低い。従って、θ（ｔ）の方向に他の車両が存在していないと判定する。 The determination unit 29 determines whether a target sound source is present in the azimuth θ (t) according to the value of the CSP coefficient CSP. For example, assume that the target sound source is another vehicle. In this case, it is determined whether there is another vehicle in the direction θ (t) according to the value of the CSP coefficient CSP at the index d at which the CSP coefficient is maximum. When the maximum value of the CSP coefficient is larger than the threshold value, the noise component is low and the reliability is high. Therefore, it is determined that another vehicle exists in the direction of θ (t). On the other hand, when the maximum value of the CSP coefficient is smaller than the threshold value, the noise component is high and the reliability is low. Therefore, it is determined that there is no other vehicle in the direction of θ (t).

このように、ＣＳＰ係数ＣＳＰとしきい値とを比較することで、方位θ（ｔ）に音源があるか否かを推定することができる。ＣＳＰ係数ＣＳＰと比較するしきい値は、実験結果等に応じて、ユーザが予め設定してもよい。ＣＳＰ係数ＣＳＰの最大値に応じて、方位θ（ｔ）に目的音源があるか否かを検出している。よって、信頼性を向上することができる。 Thus, by comparing the CSP coefficient CSP with the threshold value, it can be estimated whether there is a sound source in the azimuth θ (t). The threshold value to be compared with the CSP coefficient CSP may be set in advance by the user according to the experimental result or the like. Whether or not there is a target sound source in the azimuth θ (t) is detected according to the maximum value of the CSP coefficient CSP. Therefore, reliability can be improved.

このようなＣＳＰ係数に基づく判定手法は、例えば、「複数車両に対応したマイクロホンアレーによる接近車両検出システムの構築」坂野秀樹他著電子情報通信学会技術研究報告;巻号:２０１１−３−１８, １１０, ４７１ ; ｐｐ１３−１６に記載された手法を用いることができる。 Such a determination method based on the CSP coefficient is, for example, “Construction of an approaching vehicle detection system using a microphone array corresponding to a plurality of vehicles” Hideki Sakano et al., IEICE Technical Report; Volume: 2011-3-18, 110, 471; pp13-16 can be used.

上記の音源推定方法を用いることで、目的音源の方向をより正確に推定することが可能になる。マスクＭ（ω，ｔ）を導入することで、雑音成分の高い周波数の影響を低減することができる。さらに、マスクＭ（ω，ｔ）の信頼性を示すＲｅｌｉａｂｉｌｉｔｙ（ｔ）を導入することで、信頼性の低いタイミングにおいて方向が推定されるのを防ぐことができる。すなわち、信号成分の高いタイミングでの推定が可能となる。これにより、目的となる音源の方向をより正確に推定することができる。 By using the above sound source estimation method, the direction of the target sound source can be estimated more accurately. By introducing the mask M (ω, t), it is possible to reduce the influence of the frequency having a high noise component. Furthermore, by introducing Reliability (t) indicating the reliability of the mask M (ω, t), it is possible to prevent the direction from being estimated at a timing with low reliability. That is, it is possible to estimate the signal component at a high timing. Thereby, the direction of the target sound source can be estimated more accurately.

上記の説明では、マスクＭ（ω,ｔ）をバイナリ、すなわち、（０，１）の２値で設定したが、マスクＭ（ω,ｔ）は（０，１）の２値に限られるものではない。すなわち、マスクＭ（ω，ｔ）の値を、段階的、あるいは連続的に設定してよい。例えば、雑音Ｎ（ω，ｔ）を複数のしきい値と比較して、マスクＭ（ω，ｔ）を０から１の間で多段階に算出してもよい。さらには、マスクＭ（ω，ｔ）を０から１の間の連続値として算出してもよい。具体的には、以下の式（１９）または式（２０）で示されたウィーナーフィルタを用いて、マスクＭ（ω，ｔ）を算出することができる。 In the above description, the mask M (ω, t) is set to binary, that is, binary of (0,1), but the mask M (ω, t) is limited to the binary of (0,1). is not. That is, the value of the mask M (ω, t) may be set stepwise or continuously. For example, the noise N (ω, t) may be compared with a plurality of threshold values, and the mask M (ω, t) may be calculated in multiple stages between 0 and 1. Further, the mask M (ω, t) may be calculated as a continuous value between 0 and 1. Specifically, the mask M (ω, t) can be calculated using the Wiener filter expressed by the following formula (19) or formula (20).

なお、γは、実験結果等に応じて予め設定しておくことができるパラメータであり、２あるいは２以外の実数とすることができる。こうすることで、擬似パラメトリックウィーナーフィルタを用いて、マスクを生成することができる。同様に、βも実験結果等に応じて予め設定しておくことができるパラメータであり、１あるいは１以外の実数とすることができる。このように、雑音成分の高い周波数の影響を排除又は抑制することができるマスクＭ（ω，ｔ）を導入することができる。またマスクＭ（ω，ｔ）の値を連続値として設定した場合でも、上記の式（１５）を用いて、Ｒｅｌｉａｂｉｌｉｔｙ（ｔ）を算出することができる。 Note that γ is a parameter that can be set in advance according to an experimental result or the like, and can be a real number other than 2 or 2. In this way, a mask can be generated using a pseudo parametric Wiener filter. Similarly, β is a parameter that can be set in advance according to the experimental result or the like, and can be a real number other than 1 or 1. In this way, a mask M (ω, t) that can eliminate or suppress the influence of a high frequency noise component can be introduced. Even when the value of the mask M (ω, t) is set as a continuous value, Reliability (t) can be calculated using the above equation (15).

上記の音源推定装置は、移動体への搭載に好適である。自動車、移動ロボット、オートバイなどの移動体では、自己が移動しながら、音源方向を推定することになる。さらには、他の移動体が移動している公道等の環境下では、音源である他の移動体も移動することになる。このような場合、目的音源に対して移動体が相対的に移動しながら、音源推定装置が音源方向の推定を行う。目的音源と音源推定装置が相対的に移動している環境下において、上記の音源推定処理を行う。上記の音源推定処理では、時間補正が行われたＣＳＰ係数を用いているため、より正確に方向を推定することができる。すなわち、Ｒｅｌｉａｂｉｌｉｔｙ（ｔ）を導入して、信頼度の高いタイミングでの観測信号から音源方向を推定しているため、推定精度を向上することができる。 The above-described sound source estimation apparatus is suitable for mounting on a moving object. In a moving body such as an automobile, a mobile robot, and a motorcycle, the direction of the sound source is estimated while moving by itself. Furthermore, in an environment such as a public road where other moving bodies are moving, the other moving bodies that are sound sources also move. In such a case, the sound source estimation apparatus estimates the sound source direction while the moving body moves relative to the target sound source. The sound source estimation process described above is performed in an environment where the target sound source and the sound source estimation apparatus are relatively moving. In the sound source estimation process, since the CSP coefficient subjected to time correction is used, the direction can be estimated more accurately. That is, since the reliability (t) is introduced and the sound source direction is estimated from the observation signal at a highly reliable timing, the estimation accuracy can be improved.

以下に、音源推定装置を移動体である車両に搭載した例について、図３を用いて説明する。図３は、音源推定装置を搭載した車両の要部を示すブロック図である。車両３０は、車両信号取得部３１と、雑音推定器起動部３２を有している。さらに、マスク記憶部４１と、マスク選択部４２が、図２で示した音源推定装置に追加されている。なお、図２で示した、短時間ＤＦＴ部２１、短時間ＤＦＴ部２２、時間差推定部２７、方位推定部２８、及び判定部２９については、同様の処理を行うため、図３では図示を省略している。図３に示す構成では、下記に示すように、マスクＭ（ω，ｔ）を動的に生成している。 Hereinafter, an example in which the sound source estimation apparatus is mounted on a vehicle that is a moving body will be described with reference to FIG. FIG. 3 is a block diagram showing a main part of a vehicle equipped with a sound source estimation device. The vehicle 30 includes a vehicle signal acquisition unit 31 and a noise estimator activation unit 32. Further, a mask storage unit 41 and a mask selection unit 42 are added to the sound source estimation apparatus shown in FIG. The short-time DFT unit 21, the short-time DFT unit 22, the time difference estimation unit 27, the direction estimation unit 28, and the determination unit 29 shown in FIG. doing. In the configuration shown in FIG. 3, the mask M (ω, t) is dynamically generated as described below.

車両信号取得部３１は、車両３０に関する車両信号を取得する。車両信号取得部３１は例えば、車両３０の制御信号や操作信号を車両信号として取得する。具体的には、車両３０が自動車であるとすると、車両３０に設けられたワイパーやヘッドライトのオンオフを車両信号として取得する。さらには、車両３０の走行速度や、ブレーキペダルやアクセルペダルの踏み込み量、地図情報やＧＰＳからの位置情報を車両信号としてもよい。また、カメラやレーダからの他のセンサからの認識結果を車両信号としてもよい。車両信号は、車両３０の動作状態に関する情報であればよい。車両信号取得部３１は、取得した車両信号を、雑音推定器起動部３２と、マスク選択部４２に出力する。 The vehicle signal acquisition unit 31 acquires a vehicle signal related to the vehicle 30. For example, the vehicle signal acquisition unit 31 acquires a control signal or an operation signal of the vehicle 30 as a vehicle signal. Specifically, if the vehicle 30 is an automobile, the on / off state of a wiper or a headlight provided on the vehicle 30 is acquired as a vehicle signal. Furthermore, the traveling speed of the vehicle 30, the depression amount of the brake pedal or the accelerator pedal, map information, or position information from GPS may be used as the vehicle signal. In addition, a recognition result from another sensor from a camera or radar may be used as a vehicle signal. The vehicle signal may be information regarding the operating state of the vehicle 30. The vehicle signal acquisition unit 31 outputs the acquired vehicle signal to the noise estimator activation unit 32 and the mask selection unit 42.

雑音推定器起動部３２は、車両３０の動作状態に応じた車両信号に基づいて、雑音推定器２３を起動させる。雑音推定器２３は、雑音推定器起動部３２からの指示によって、雑音推定を開始する。環境中の雑音が変化した場合、雑音推定器起動部３２は雑音推定器２３を起動させる。例えば、車速がある速度以下（例えば、２０ｋｍ／ｈ以下）になったタイミングで、雑音推定器起動部３２が雑音推定器２３を起動してもよい。これにより、車速が一定速度以下になったタイミングで、雑音推定が行われる。あるいは、ブレーキペダルやアクセルペダルに踏み込み量に基づいて、雑音推定器起動部３２が雑音推定器２３を起動してもよい。さらには、地図情報とＧＰＳからの位置情報に基づいて、雑音推定器起動部３２が雑音推定器２３を起動してもよい。具体的には、交通事故が多い交差点等の地点に車両３０が近づいた場合、その直前での雑音推定によって、マスクを生成するようにしてもよい。さらには、カメラやレーザなどの他のセンサの認識結果から、雑音推定器起動部３２が雑音推定器２３を起動してもよい。このように、車両３０の周囲の環境が変わったタイミングや、車両３０の動作が変化したタイミングで、雑音推定が行われるよう、雑音推定器起動部３２が雑音推定器２３を起動させる。 The noise estimator activation unit 32 activates the noise estimator 23 based on the vehicle signal corresponding to the operation state of the vehicle 30. The noise estimator 23 starts noise estimation in response to an instruction from the noise estimator activation unit 32. When the noise in the environment changes, the noise estimator activation unit 32 activates the noise estimator 23. For example, the noise estimator activation unit 32 may activate the noise estimator 23 at a timing when the vehicle speed becomes a certain speed or less (for example, 20 km / h or less). Thereby, noise estimation is performed at the timing when the vehicle speed becomes a certain speed or less. Alternatively, the noise estimator activation unit 32 may activate the noise estimator 23 based on the depression amount of the brake pedal or the accelerator pedal. Furthermore, the noise estimator activation unit 32 may activate the noise estimator 23 based on the map information and the position information from the GPS. Specifically, when the vehicle 30 approaches a point such as an intersection where traffic accidents frequently occur, a mask may be generated by noise estimation immediately before that. Furthermore, the noise estimator activation unit 32 may activate the noise estimator 23 based on the recognition results of other sensors such as a camera and a laser. In this way, the noise estimator activation unit 32 activates the noise estimator 23 so that noise estimation is performed at the timing when the environment around the vehicle 30 changes or when the operation of the vehicle 30 changes.

マスク記憶部４１は、予め設定された一つ以上のマスクＭ（ω，ｔ）を記憶している。例えば、商品開発時に実験等によってマスクを求めておき、商品製造時にマスク記憶部４１に予め記憶させておく。さらに、マスク記憶部４１は、マスク生成部２４が生成したマスクＭ（ω，ｔ）を記憶する。具体的は、ワイパーが動作している状態の雑音成分を予め集音し、その集音結果に基づいてマスクを予め生成しておく。あるいは、ある速度で走行している車両のエンジン音を集音して、その集音結果に基づいてマスクを予め生成しておく。このようなマスクをマスク記憶部４１に予め記憶させておく。 The mask storage unit 41 stores one or more preset masks M (ω, t). For example, a mask is obtained by an experiment or the like at the time of product development, and stored in advance in the mask storage unit 41 at the time of product manufacture. Further, the mask storage unit 41 stores the mask M (ω, t) generated by the mask generation unit 24. Specifically, a noise component in a state where the wiper is operating is collected in advance, and a mask is generated in advance based on the sound collection result. Alternatively, the engine sound of a vehicle traveling at a certain speed is collected, and a mask is generated in advance based on the sound collection result. Such a mask is stored in the mask storage unit 41 in advance.

マスク選択部４２は、状況に応じて、以下の（ａ）〜（ｃ）を選択する。
（ａ）その場で生成したマスク
（ｂ）マスク記憶部４１に商品製造時に予め記憶されているマスク
（ｃ）マスクを使用しない（すなわち、Ｍ（ω，ｔ）の全要素が常時１となるマスク） The mask selection unit 42 selects the following (a) to (c) according to the situation.
(A) The mask generated on the spot (b) The mask (c) mask previously stored at the time of product manufacture is not used in the mask storage unit 41 (that is, all elements of M (ω, t) are always 1) mask)

（ａ）のマスクは、上述したように、その場で取得した観測信号Ｘ_１（ω，ｔ）、観測信号Ｘ_２と、それらから推定された雑音Ｎ（ω、ｔ）を用いて生成される。（ａ）のマスクは、現在の環境や車両３０の動作状態に応じたマスクとなっている。一方、マスク記憶部４１は、その場の観測信号によらないマスクを予め記憶している。 As described above, the mask of (a) is generated using the observation signal X ₁ (ω, t), the observation signal X ₂ acquired on the spot, and the noise N (ω, t) estimated from them. The The mask (a) is a mask according to the current environment and the operating state of the vehicle 30. On the other hand, the mask storage unit 41 stores in advance a mask that does not depend on the in-situ observation signal.

マスク選択部４２は、車両信号に基づいて、上記の（ａ）〜（ｃ）のマスクのいずれか１つを選択する。例えば、ワイパースイッチがオンの場合とオフの場合とで、雨天時のマスクと、晴天時のマスクを切り替える。具体的には、雨天時のマスクは（ｂ）のマスクとし、晴天時のマスクは（ａ）のマスクとすることができる。さらに、ヘッドライトがオンの場合と、オフの場合とで、夜用のマスクと、日中用のマスクとを切り替える。地図情報とＧＰＳからの位置情報から、市街地や郊外等の場所の特性に応じたマスクを切り替えるようにしてもよい。このように、マスク選択部４２は、車両３０の動作状況に応じて最適なマスクを選択する。 The mask selection unit 42 selects any one of the masks (a) to (c) described above based on the vehicle signal. For example, the mask for rainy weather and the mask for sunny weather are switched depending on whether the wiper switch is on or off. Specifically, the mask in the rainy weather can be the mask (b), and the mask in the fine weather can be the mask (a). Furthermore, the mask for the night and the mask for the day are switched depending on whether the headlight is on or off. You may make it switch the mask according to the characteristic of places, such as a city area and a suburb, from map information and the positional information from GPS. As described above, the mask selection unit 42 selects an optimal mask according to the operation state of the vehicle 30.

上述したように、車両３０の状況を示す車両情報に基づいて、雑音推定器２３を起動させている。従って、車両３０の状況変化に応じて、雑音モデル、すなわち、マスクＭ（ω，ｔ）を動的に生成することができる。車両３０の周囲の雑音の態様が刻々と変化する場合でも適切なタイミングでマスクＭ（ω，ｔ）を生成することができる。これにより、音源の方向を正確に推定することができる。さらに、マスク記憶部４１に記憶されたマスクと、その場で生成したマスクを車両信号に応じて使い分けている。これにより、より正確に音源を推定することができるようになる。車両３０に音源推定装置を搭載することで、交差点等において、死角となる横道からの接近車両の認知が可能となる。 As described above, the noise estimator 23 is activated based on the vehicle information indicating the state of the vehicle 30. Accordingly, a noise model, that is, a mask M (ω, t) can be dynamically generated in accordance with a change in the situation of the vehicle 30. Even when the state of noise around the vehicle 30 changes every moment, the mask M (ω, t) can be generated at an appropriate timing. Thereby, the direction of the sound source can be accurately estimated. Furthermore, the mask memorize | stored in the mask memory | storage part 41 and the mask produced | generated on the spot are used properly according to a vehicle signal. As a result, the sound source can be estimated more accurately. By mounting the sound source estimation device on the vehicle 30, it is possible to recognize an approaching vehicle from a side road that becomes a blind spot at an intersection or the like.

なお、上記の説明では、音源推定装置が自動車である車両３０に搭載されている例について説明したが、音源推定装置を搭載する移動体は特に限定されるものではない。例えば、オートバイ、移動ロボット等に音源推定装置を搭載してもよい。移動ロボットに音源推定装置を搭載することで、ユーザの声の方向に振り返ったり、異常音を検知することも可能になる。 In the above description, the example in which the sound source estimation device is mounted on the vehicle 30 that is an automobile has been described. However, the moving body on which the sound source estimation device is mounted is not particularly limited. For example, a sound source estimation device may be mounted on a motorcycle, a mobile robot, or the like. By mounting a sound source estimation device on a mobile robot, it becomes possible to look back in the direction of the user's voice or detect abnormal sounds.

上述した音源推定処理は、ＤＳＰ（Digital Signal Processor）、ＭＰＵ（Micro Processing Unit）、若しくはＣＰＵ（Central Processing Unit）又はこれらの組み合わせを含むコンピュータにプログラムを実行させることによって実現してもよい。 The sound source estimation process described above may be realized by causing a computer including a DSP (Digital Signal Processor), an MPU (Micro Processing Unit), a CPU (Central Processing Unit), or a combination thereof to execute a program.

上述の例において、音源推定処理をコンピュータに行わせるための命令群を含むプログラムは、様々なタイプの非一時的なコンピュータ可読媒体（non-transitory computer readable medium）を用いて格納され、コンピュータに供給することができる。非一時的なコンピュータ可読媒体は、様々なタイプの実体のある記録媒体（tangible storage medium）を含む。非一時的なコンピュータ可読媒体の例は、磁気記録媒体（例えばフレキシブルディスク、磁気テープ、ハードディスクドライブ）、光磁気記録媒体（例えば光磁気ディスク）、ＣＤ−ＲＯＭ（Read Only Memory）、ＣＤ−Ｒ、ＣＤ−Ｒ／Ｗ、半導体メモリ（例えば、マスクＲＯＭ、ＰＲＯＭ（Programmable ROM）、ＥＰＲＯＭ（Erasable PROM）、フラッシュＲＯＭ、ＲＡＭ（Random Access Memory））を含む。また、プログラムは、様々なタイプの一時的なコンピュータ可読媒体（transitory computer readable medium）によってコンピュータに供給されてもよい。一時的なコンピュータ可読媒体の例は、電気信号、光信号、及び電磁波を含む。一時的なコンピュータ可読媒体は、電線及び光ファイバ等の有線通信路、又は無線通信路を介して、プログラムをコンピュータに供給できる。 In the above example, a program including a group of instructions for causing a computer to perform sound source estimation processing is stored using various types of non-transitory computer readable media and supplied to the computer. can do. Non-transitory computer readable media include various types of tangible storage media. Examples of non-transitory computer-readable media include magnetic recording media (for example, flexible disks, magnetic tapes, hard disk drives), magneto-optical recording media (for example, magneto-optical disks), CD-ROMs (Read Only Memory), CD-Rs, CD-R / W, semiconductor memory (for example, mask ROM, PROM (Programmable ROM), EPROM (Erasable PROM), flash ROM, RAM (Random Access Memory)) are included. The program may also be supplied to the computer by various types of transitory computer readable media. Examples of transitory computer readable media include electrical signals, optical signals, and electromagnetic waves. The temporary computer-readable medium can supply the program to the computer via a wired communication path such as an electric wire and an optical fiber, or a wireless communication path.

１１マイク
１２マイク
１３マイクアンプ
１４マイクアンプ
１５Ａ／Ｄ変換器
１６ＣＰＵ
２１短時間ＤＦＴ部
２２短時間ＤＦＴ部
２３雑音推定器
２４マスク生成部
２５Ｒｅｌｉａｂｉｌｉｔｙ生成部
２６ＣＳＰ係数算出部
２７時間差推定部
２８方位推定部
２９判定部
３０車両
３１車両信号取得部
３２雑音推定器起動部
４１マスク記憶部
４２マスク選択部 11 microphone 12 microphone 13 microphone amplifier 14 microphone amplifier 15 A / D converter 16 CPU
DESCRIPTION OF SYMBOLS 21 Short time DFT part 22 Short time DFT part 23 Noise estimator 24 Mask generation part 25 Reliability generation part 26 CSP coefficient calculation part 27 Time difference estimation part 28 Direction estimation part 29 Judgment part 30 Vehicle 31 Vehicle signal acquisition part 32 Noise estimator starting Unit 41 Mask storage unit 42 Mask selection unit

Claims

A sound source estimation apparatus that estimates the direction of a sound source using observation signals acquired by at least two microphones,
A noise estimation unit for estimating a noise component included in the observation signal;
A mask generation unit that generates a mask based on the noise component estimated by the noise estimation unit;
A reliability calculation unit for calculating the reliability of the mask generated by the mask generation unit;
A CSP coefficient calculation unit that calculates a CSP coefficient corrected by the mask and the reliability of the mask;
A sound source estimation apparatus comprising: an estimation unit that estimates a direction of a sound source based on the corrected CSP coefficient.

Based on an index that maximizes the corrected CSP coefficient, a difference in arrival time at which sound arrives at the two microphones is calculated,
Based on the arrival time difference, the estimation unit estimates the direction of the sound source,
The sound source estimation apparatus according to claim 1, wherein it is determined whether or not a target sound source exists in the estimated direction according to the corrected maximum value of the CSP coefficient.

The mask is a discrete value according to frequency,
The sound source according to claim 1 or 2, wherein the corrected CSP coefficient is calculated by performing inverse Fourier transform on a product of a cross-correlation of observation signals from two microphones and the mask. Estimating device.

The sound source estimation apparatus according to claim 3, wherein the corrected CSP coefficient is calculated by multiplying the inverse Fourier transform value by the reliability.

It is a moving body carrying a sound source estimating device in any 1 paragraph of Claims 1-4,
The moving body, wherein the noise estimation unit performs noise estimation based on a vehicle signal corresponding to an operating state of the moving body.

A mask storage unit in which the moving body stores a mask in advance;
The apparatus further comprises: a mask stored in the mask storage unit and a mask selection unit that selects one of the masks generated by the mask generation unit based on a vehicle signal corresponding to an operation state of the moving body. Item 6. A moving object according to item 5

A sound source estimation method for estimating the direction of a sound source using observation signals acquired by at least two microphones,
Estimating a noise component included in the observed signal;
Generating a mask based on the noise component;
Calculating a reliability of the mask;
Calculating a CSP coefficient corrected by the mask and the reliability of the mask;
A sound source estimation method comprising: estimating a sound source direction based on the corrected CSP coefficient.

Based on an index that maximizes the corrected CSP coefficient, a difference in arrival time at which sound arrives at the two microphones is calculated,
Based on the arrival time difference, the estimation unit estimates the direction of the sound source,
The sound source estimation method according to claim 7, wherein whether or not a target sound source exists in the estimated direction is determined according to the corrected maximum value of the CSP coefficient.

The mask is a discrete value according to frequency,
9. The sound source according to claim 7, wherein the corrected CSP coefficient is calculated by performing inverse Fourier transform on a product of a cross-correlation of observation signals from two microphones and the mask. Estimation method.

The sound source estimation method according to claim 9, wherein the corrected CSP coefficient is calculated by multiplying the inverse Fourier transform value by the reliability.

A sound source estimation program for estimating the direction of a sound source using observation signals acquired by at least two microphones,
Against the computer,
Estimating a noise component included in the observed signal;
Generating a mask based on the noise component;
Calculating a reliability of the mask;
Calculating a CSP coefficient corrected by the mask and the reliability of the mask;
Estimating the direction of the sound source based on the corrected CSP coefficient;
Sound source estimation program that executes

Against the computer,
Based on the index that maximizes the corrected CSP coefficient, the arrival time difference at which sound arrives at the two microphones is calculated,
Based on the arrival time difference, the estimation unit estimates the direction of the sound source,
The sound source estimation program according to claim 11, wherein it is determined whether or not a target sound source exists in the estimated direction according to the corrected maximum value of the CSP coefficient.

The mask is a discrete value according to frequency,
The sound source according to claim 11 or 12, wherein the corrected CSP coefficient is calculated by performing inverse Fourier transform on a product of a cross-correlation of observation signals from two microphones and the mask. Estimation program.

The sound source estimation program according to claim 13, wherein the corrected CSP coefficient is calculated by multiplying the inverse Fourier transform value by the reliability.