JP2016142914A

JP2016142914A - Wiener filter design device, speech enhancement device, wiener filter design method and program

Info

Publication number: JP2016142914A
Application number: JP2015018532A
Authority: JP
Inventors: 健太丹羽; Kenta Niwa; 和則小林; Kazunori Kobayashi; 悠馬小泉; Yuma Koizumi
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2015-02-02
Filing date: 2015-02-02
Publication date: 2016-08-08
Anticipated expiration: 2035-02-02
Also published as: JP6053202B2

Abstract

PROBLEM TO BE SOLVED: To provide a Wiener filter design device capable of designing a Wiener filter realizing speech enhancement that only aimed sound in a narrow range is collected.SOLUTION: The Wiener filter design device comprises: a sudden PSD estimation unit 203 which estimates power spectral density of sudden target sound on the basis of time change of spot power spectral density of an observation signal; and a Wiener filtering unit 204 which designs a Wiener filter using the power spectral density of sudden target sound.SELECTED DRAWING: Figure 4

Description

本発明は、特定の音源信号を明瞭に抽出するためのフィルタを設計するウィーナーフィルタ設計装置、特定の音源信号を明瞭に抽出する音声強調装置、ウィーナーフィルタ設計方法、プログラムに関する。 The present invention relates to a Wiener filter design device that designs a filter for clearly extracting a specific sound source signal, a speech enhancement device that clearly extracts a specific sound source signal, a Wiener filter design method, and a program.

任意の方向を強調するためのビームフォーミング技術と非線形性のウィーナーフィルタを組み合わせた音声強調法として、例えば非特許文献１がある。非特許文献１の方法を局所PSD(power spectral density、パワースペクトル密度)推定に基づく音声強調法と呼んでいる。 For example, Non-Patent Document 1 is a speech enhancement method that combines a beam forming technique for enhancing an arbitrary direction and a nonlinear Wiener filter. The method of Non-Patent Document 1 is called a speech enhancement method based on local PSD (power spectral density) estimation.

＜非特許文献１の音声強調装置１０（局所PSD推定に基づく音声強調法）＞
以下、図１、図２を参照して局所PSD推定に基づく音声強調を実現する非特許文献１の音声強調装置について説明する。図１は、非特許文献１の音声強調装置１０の構成を示すブロック図である。図２は、非特許文献１の音声強調装置１０の動作を示すフローチャートである。図１に示すように、非特許文献１の音声強調装置１０は、M個(M≧2,m=1,…,M)のマイクロホン素子で構成されたマイクロホンアレイ１００、周波数領域変換部１０１、ビームフォーミング部１０２、局所PSD推定部１０３、ウィーナーフィルタリング部１０４、時間領域変換部１０５を含む。マイクロホンアレイ１００は、K個(K≧1,k=1,…,K)の音源からMチャネルの観測信号を観測する（Ｓ１００）。周波数領域変換部１０１は、マイクロホンアレイ１００からの観測信号をFFT等の手段で周波数領域に変換する（Ｓ１０１）。m番目のマイクロホンとk番目の音源の間の伝達特性をA_m,k(ω)、k番目の音源信号をS_k(ω,τ)と表すとき、m番目の受音信号は、次式でモデル化される。 <Speech enhancement device 10 of Non-Patent Document 1 (speech enhancement method based on local PSD estimation)>
Hereinafter, a speech enhancement device of Non-Patent Document 1 that realizes speech enhancement based on local PSD estimation will be described with reference to FIGS. 1 and 2. FIG. 1 is a block diagram illustrating a configuration of a speech enhancement device 10 of Non-Patent Document 1. FIG. 2 is a flowchart showing the operation of the speech enhancement device 10 of Non-Patent Document 1. As shown in FIG. 1, the speech enhancement device 10 of Non-Patent Document 1 includes a microphone array 100 configured with M (M ≧ 2, m = 1,..., M) microphone elements, a frequency domain conversion unit 101, A beam forming unit 102, a local PSD estimation unit 103, a Wiener filtering unit 104, and a time domain conversion unit 105 are included. The microphone array 100 observes M channel observation signals from K (K ≧ 1, k = 1,..., K) sound sources (S100). The frequency domain conversion unit 101 converts the observation signal from the microphone array 100 into the frequency domain by means such as FFT (S101). When the transfer characteristic between the m-th microphone and the k-th sound source is expressed as Am _{, k} (ω), and the k-th sound source signal is expressed as S _k (ω, τ), the m-th received signal is Modeled with

ここで、ωは周波数、τはフレームを表す。ビームフォーミング部１０２は、周波数領域に変換された観測信号に対して線形的なフィルタを掛け合わせることで到来方向毎に強調した音声を出力する（Ｓ１０２）。ビームフォーミング部１０２は、目的音を強調するようなフィルタだけでなく、雑音のPSDを推定するために計L個(L≧2,l=1,…,L)のフィルタを含んで構成され、複数のフィルタ出力を計算する。l番目のフィルタが方向θ_lを強調する感度特性を持つものとし、到来方向毎のL個のフィルタを適用するとき、ビームフォーミング部１０２のl番目の出力は次式で計算される。 Here, ω represents a frequency, and τ represents a frame. The beam forming unit 102 outputs a voice enhanced for each direction of arrival by multiplying the observation signal converted into the frequency domain by a linear filter (S102). The beam forming unit 102 includes not only a filter that emphasizes the target sound but also a total of L (L ≧ 2, l = 1,..., L) filters for estimating the PSD of noise, Calculate multiple filter outputs. When the l-th filter has a sensitivity characteristic that emphasizes the direction θ _l , and the L filters for each arrival direction are applied, the l-th output of the beamforming unit 102 is calculated by the following equation.

W_l,m(ω)はl番目のビームフォーミングのフィルタ係数、D_l,k(ω)はl番目のビームフォーミングのk番目の音源に対する感度である。音源信号が互いに無相関であることを仮定すると、l番目のビームフォーミング出力のPSDは次式でモデル化される。 W _{l, m} (ω) is the filter coefficient of the l-th beam forming, and D _{l, k} (ω) is the sensitivity of the l-th beam forming to the k-th sound source. Assuming that the sound source signals are uncorrelated with each other, the PSD of the l-th beamforming output is modeled by the following equation.

ここで、<・>は期待値演算子、φ_Sk(ω)はk番目の音源信号のPSDを表す。式(3)の関係が、局所空間から到来する音源信号群(局所音源信号)とビームフォーミング出力に対しても成り立つことを仮定すると、φ_Yθl(ω)は式(4)で表される。 Here, <•> represents an expected value operator, and φ _Sk (ω) represents the PSD of the kth sound source signal. Assuming that the relationship of equation (3) holds true for the sound source signal group (local sound source signal) coming from the local space and the beamforming output, φ _Yθl (ω) is expressed by equation (4).

ここで、D_l,θl(ω)は方向θ_lを中心とした局所空間に対するビームフォーミングの平均感度、φ_Sθl(ω)はl番目の局所音源信号のPSD(局所PSD)を表す。L個のφ_Sθl(ω)とφ_Yθl(ω)の関係は次式で表される。 Here, D _l, _θl (ω) represents the average sensitivity of beam forming with respect to the local space with the direction θ _l as the center, and φ _Sθl (ω) represents the PSD (local PSD) of the l-th local sound source signal. The relationship between L φ _Sθl (ω) and φ _Yθl (ω) is expressed by the following equation.

以後、非負行列D(ω)を感度行列と呼ぶ。局所PSD推定部１０３は、式(5)の逆問題を解くことにより、L個の（到来方向毎の）局所PSDを推定する（Ｓ１０３）。具体的には、局所PSD推定部１０３は、雑音抑圧性能を高めるために、式(6)に示すように各フレーム毎の局所PSDΦ^_S(ω,τ)を推定するものとする。 Hereinafter, the non-negative matrix D (ω) is referred to as a sensitivity matrix. The local PSD estimation unit 103 estimates L local PSDs (per arrival direction) by solving the inverse problem of Equation (5) (S103). Specifically, the local PSD estimation unit 103 estimates local PSDΦ ^ _S (ω, τ) for each frame as shown in Equation (6) in order to improve noise suppression performance.

ここで、 here,

である。以後、G(ω)をPSD推定行列と呼ぶ。ウィーナーフィルタリング部１０４は、推定された局所PSDを用いて、l番目の局所音源信号を強調するためのウィーナーフィルタH_l(ω,τ)を式(8)に基づいて設計する（Ｓ１０４）。 It is. Hereinafter, G (ω) is referred to as a PSD estimation matrix. The Wiener filtering unit 104 uses the estimated local PSD to design a Wiener filter H _l (ω, τ) for enhancing the l-th local sound source signal based on the equation (8) (S104).

ウィーナーフィルタリング部１０４は、推定された局所PSDを用いて設計されたウィーナーフィルタを用い、式(9)に基づいて雑音を抑圧した出力信号Z_θl(ω,τ)を計算する（Ｓ１０４）。 The Wiener filtering unit 104 calculates an output signal Z _θl (ω, τ) in which noise is suppressed based on Expression (9) using a Wiener filter designed using the estimated local PSD (S104).

時間領域変換部１０５は、周波数領域の信号をIFFT等の手段で時間領域に変換し、変換された信号を出力する（Ｓ１０５）。 The time domain conversion unit 105 converts the frequency domain signal into the time domain by means of IFFT or the like, and outputs the converted signal (S105).

K. Niwa, Y. Hioka, and K. Kobayashi, “Post-filter design for speech enhancement in various noisy environments,” in Proc. IWAENC 2014, pp. 36-40, 2014.K. Niwa, Y. Hioka, and K. Kobayashi, “Post-filter design for speech enhancement in various noisy environments,” in Proc. IWAENC 2014, pp. 36-40, 2014.

図３を参照して、局所PSD推定に基づく音声強調法における課題について考察する。図３は、局所PSD推定に基づく音声強調法における課題を考察する図である。今、2〜4本程度のマイクロホンを用いて観測した信号に対して、局所PSD推定に基づく音声強調法を適用することを考える。この場合、約60〜80度の角度幅を持つターゲット局所空間とその他の局所空間を分離するように音声強調できることが知られている。しかしながら、この方法ではターゲット局所空間の角度幅よりも狭い範囲から到来した音源とその他の音源を区別して収音することは困難である。例えば図３に示すように、2〜4本程度のマイクロホンからなるマイクロホンアレイ１００により、サッカー場内の音(例えば、キック音)とその他の雑音(例えば、周囲のサポータの声援)を、分離したい場合について考える。この構成では、その他の局所空間８（濃くドットハッチングされた領域、空間）に混在するサポータの声援５ｂは抑圧することができるものの、ターゲットとなる選手を含むターゲット局所空間７（横線ハッチングされた領域、空間）に高いレベルで混在するサポータの声援５ｆを抑圧することが難しかった。従って、ターゲットとなる選手のみを含む狭小な局所空間６（薄くドットハッチングされた領域、空間）における目的音のみを抽出することは難しかった。 With reference to FIG. 3, the problem in the speech enhancement method based on local PSD estimation will be considered. FIG. 3 is a diagram for considering a problem in the speech enhancement method based on local PSD estimation. Now, consider applying the speech enhancement method based on local PSD estimation to signals observed using about 2 to 4 microphones. In this case, it is known that speech enhancement can be performed so as to separate a target local space having an angular width of about 60 to 80 degrees from other local spaces. However, with this method, it is difficult to distinguish and collect sound sources coming from a range narrower than the angular width of the target local space from other sound sources. For example, as shown in FIG. 3, when it is desired to separate a sound (for example, kick sound) and other noise (for example, cheers of surrounding supporters) in a soccer field by using a microphone array 100 composed of about 2 to 4 microphones. think about. In this configuration, the supporter's cheering 5b mixed in the other local space 8 (dark dot-hatched region, space) can be suppressed, but the target local space 7 (horizontal line hatched region) including the target player ), It was difficult to suppress the supporter's cheering 5f mixed at a high level. Therefore, it is difficult to extract only the target sound in a narrow local space 6 (a thin dot-hatched area or space) including only the target player.

そこで本発明は、狭い範囲にある狙った音だけが明瞭に収音される音声強調を実現するウィーナーフィルタを設計できるウィーナーフィルタ設計装置を提供することを目的とする。 Therefore, an object of the present invention is to provide a Wiener filter design device capable of designing a Wiener filter that realizes speech enhancement in which only a target sound in a narrow range is clearly picked up.

本発明のウィーナーフィルタ設計装置は、突発性PSD推定部と、ウィーナーフィルタリング部を含む。 The Wiener filter design apparatus of the present invention includes a sudden PSD estimation unit and a Wiener filtering unit.

突発性PSD推定部は、観測信号の局所パワースペクトル密度の時間変化に基づいて、突発性を有する目的音のパワースペクトル密度を推定する。ウィーナーフィルタリング部は、突発性を有する目的音のパワースペクトル密度を用いてウィーナーフィルタを設計する。 The sudden PSD estimation unit estimates the power spectral density of the target sound having suddenness based on the temporal change of the local power spectral density of the observation signal. The Wiener filtering unit designs a Wiener filter using the power spectral density of the target sound having suddenness.

本発明のウィーナーフィルタ設計装置によれば、狭い範囲にある狙った音だけが明瞭に収音される音声強調を実現するウィーナーフィルタを設計できる。 According to the Wiener filter design apparatus of the present invention, it is possible to design a Wiener filter that realizes speech enhancement in which only targeted sounds in a narrow range are clearly picked up.

非特許文献１の音声強調装置の構成を示すブロック図。The block diagram which shows the structure of the audio | voice emphasis apparatus of a nonpatent literature 1. FIG. 非特許文献１の音声強調装置の動作を示すフローチャート。10 is a flowchart showing the operation of the speech enhancement device of Non-Patent Document 1. 局所PSD推定に基づく音声強調法における課題を考察する図。The figure which considers the subject in the speech enhancement method based on local PSD estimation. 実施例１の音声強調装置、ウィーナーフィルタ設計装置の構成を示すブロック図。1 is a block diagram illustrating a configuration of a speech enhancement device and a Wiener filter design device according to Embodiment 1. FIG. 実施例１の音声強調装置、ウィーナーフィルタ設計装置の動作を示すフローチャート。3 is a flowchart showing the operation of the speech enhancement apparatus and Wiener filter design apparatus according to the first embodiment. 実施例１の突発性PSD推定部における動作の例を説明する図。FIG. 6 is a diagram for explaining an example of operation in the sudden PSD estimation unit according to the first embodiment. 従来の局所PSD推定と、本実施例の突発性PSD推定の違いを概念的に説明する図。The figure which illustrates notionally the difference of the conventional local PSD estimation and the sudden PSD estimation of a present Example. 実施例２の音声強調装置、ウィーナーフィルタ設計装置の構成を示すブロック図。The block diagram which shows the structure of the audio | voice emphasis apparatus of Example 2, and a Wiener filter design apparatus. 実施例２の音声強調装置、ウィーナーフィルタ設計装置の動作を示すフローチャート。9 is a flowchart showing the operation of the speech enhancement apparatus and Wiener filter design apparatus according to the second embodiment.

以下、本発明の実施の形態について、詳細に説明する。なお、同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail. In addition, the same number is attached | subjected to the structure part which has the same function, and duplication description is abbreviate | omitted.

以下、図４、図５を参照して実施例１の音声強調装置２０、ウィーナーフィルタ設計装置２について説明する。図４は、本実施例の音声強調装置２０、ウィーナーフィルタ設計装置２の構成を示すブロック図である。図５は本実施例の音声強調装置２０、ウィーナーフィルタ設計装置２の動作を示すフローチャートである。 Hereinafter, the speech enhancement device 20 and the Wiener filter design device 2 according to the first embodiment will be described with reference to FIGS. 4 and 5. FIG. 4 is a block diagram illustrating the configuration of the speech enhancement device 20 and the Wiener filter design device 2 according to the present embodiment. FIG. 5 is a flowchart showing the operations of the speech enhancement device 20 and the Wiener filter design device 2 of this embodiment.

従来のアレイ信号処理では、空間的な情報のみを用いて、目的音とその他雑音を分離して収音しようとしてきた。この方法ではマイクロホン数を増やさない限り、空間的な分解能を向上させること、狭い範囲にある目的音をクリアに収音することは難しい。本発明では目的音が突発性の高い音（例えば図３の例におけるサッカーのキック音）である場合が多いという点に着目した。ここで本願で言う「突発性の高い音」または「突発性を有する音」とは、当該音に先行する一定時間幅で観測した前兆となる音響信号に対して、パワースペクトル密度上で比較して、任意の周波数帯域において一定閾値以上の変化量（レベルの増加）を生じ、かつ、前記一定時間幅以下の短い時間幅でその変化量が減少していくように急峻な時間変化を生じる音響信号を指す。例えば前時間フレームと現在の時間フレームとの間におけるパワースペクトル密度のレベル差が、任意の周波数帯域において6dB以上の変化（増加および減少）がある音を指す。また前記変化量が大きいほど「突発性が高い」と呼称する。本実施例ではマイクロホンの数を増やさずに、時間情報に基づいて狭い範囲にある（突発性の）目的音を明瞭に収音できる音声強調装置２０、またそのような音声強調に用いるウィーナーフィルタを設計することができるウィーナーフィルタ設計装置２を実現した。図４に示すように、本実施例の音声強調装置２０は、M個(M≧2,m=1,…,M)のマイクロホン素子で構成されたマイクロホンアレイ１００、周波数領域変換部１０１、ビームフォーミング部１０２、局所PSD推定部１０３、突発性PSD推定部２０３、ウィーナーフィルタリング部２０４、時間領域変換部１０５を含み、突発性PSD推定部２０３、ウィーナーフィルタリング部２０４以外の構成は、非特許文献１の音声強調装置１０と同じである。音声強調装置２０の構成要件のうち、突発性PSD推定部２０３とウィーナーフィルタリング部２０４を含む別の装置としてウィーナーフィルタ設計装置２を構成してもよい。この場合、マイクロホンアレイ１００、周波数領域変換部１０１、ビームフォーミング部１０２、局所PSD推定部１０３の構成要件を含む第１の装置と、ウィーナーフィルタ設計装置２（第２の装置）と、時間領域変換部１０５を含む第３の装置とが接続されて音声強調装置２０を構成するものとしてもよい。あるいは、前述の第１の装置と第３の装置とが一つの装置（第４の装置）として構成され、第４の装置とウィーナーフィルタ設計装置２が接続されて音声強調装置２０を構成するものとしてもよい。以下、非特許文献１の音声強調装置１０と相違する構成要件のみを説明する。 In conventional array signal processing, the target sound and other noises are separated and collected by using only spatial information. In this method, unless the number of microphones is increased, it is difficult to improve spatial resolution and to clearly collect a target sound in a narrow range. In the present invention, attention has been paid to the fact that the target sound is often a sudden sound (for example, a soccer kick sound in the example of FIG. 3). As used herein, the term “sudden sound” or “sudden sound” refers to an acoustic signal that is a precursor observed over a certain time period preceding the sound in terms of power spectral density. In addition, a sound that produces a change amount (increase in level) above a certain threshold in an arbitrary frequency band, and a steep time change so that the change amount decreases in a short time width that is less than the predetermined time width. Refers to the signal. For example, a sound whose level difference in power spectral density between the previous time frame and the current time frame has a change (increase and decrease) of 6 dB or more in an arbitrary frequency band. The larger the change amount, the higher the suddenness. In the present embodiment, the speech enhancement device 20 that can clearly collect the target sound in a narrow range (suddenly) based on the time information without increasing the number of microphones, and the Wiener filter used for such speech enhancement are provided. The Wiener filter design device 2 that can be designed was realized. As shown in FIG. 4, the speech enhancement apparatus 20 of the present embodiment includes a microphone array 100 configured with M (M ≧ 2, m = 1,..., M) microphone elements, a frequency domain conversion unit 101, a beam. It includes a forming unit 102, a local PSD estimation unit 103, an abrupt PSD estimation unit 203, a Wiener filtering unit 204, and a time domain conversion unit 105. The configuration other than the abrupt PSD estimation unit 203 and the Wiener filtering unit 204 is described in Non-Patent Document 1. This is the same as the speech enhancement apparatus 10. Of the constituent requirements of the speech enhancement device 20, the Wiener filter design device 2 may be configured as another device including the sudden PSD estimation unit 203 and the Wiener filtering unit 204. In this case, the first device including the configuration requirements of the microphone array 100, the frequency domain conversion unit 101, the beam forming unit 102, and the local PSD estimation unit 103, the Wiener filter design device 2 (second device), and the time domain conversion The speech enhancement device 20 may be configured by being connected to a third device including the unit 105. Alternatively, the first device and the third device described above are configured as one device (fourth device), and the fourth device and the Wiener filter design device 2 are connected to form the speech enhancement device 20. It is good. Hereinafter, only configuration requirements different from the speech enhancement device 10 of Non-Patent Document 1 will be described.

＜突発性PSD推定部２０３＞
突発性PSD推定部２０３は、観測信号の到来方向毎の局所パワースペクトル密度の時間変化に基づいて、突発性を有する目的音のPSD（突発性PSD）を推定する（Ｓ２０３）。より具体的には、突発性PSD推定部２０３は、空間情報のみを用いて推定した目的音のPSDφ^_Sθl(ω,τ)を突発性PSDφ^_SSθl(ω,τ)と、その他の局所PSDφ^_SNθl(ω,τ)に分割して出力する。図６、図７を参照して突発性PSD推定部２０３における動作例について説明する。図６は、本実施例の突発性PSD推定部２０３における動作の例を説明する図である。図７は、従来の局所PSD推定と、本実施例の突発性PSD推定の違いを概念的に説明する図である。 <Sudden PSD Estimation Unit 203>
The sudden PSD estimation unit 203 estimates the PSD (sudden PSD) of the target sound having suddenness based on the temporal change of the local power spectral density for each arrival direction of the observation signal (S203). More specifically, the sudden PSD estimation unit 203 converts the target sound PSDφ ^ _Sθl (ω, τ) estimated using only spatial information into the sudden PSDφ ^ _SSθl (ω, τ) and other local PSDφ. ^ _Divide into _SNθl (ω, τ) and output. An example of operation in the sudden PSD estimation unit 203 will be described with reference to FIGS. 6 and 7. FIG. 6 is a diagram illustrating an example of operation in the sudden PSD estimation unit 203 of the present embodiment. FIG. 7 is a diagram conceptually illustrating the difference between the conventional local PSD estimation and the sudden PSD estimation of the present embodiment.

図６は、あるサッカーの試合中に観測された局所PSDの出力レベルを、横軸を時間、縦軸を局所PSDの出力レベルとしてプロットしたグラフである。図６の例において、観測された信号には（突発性の）目的音だけでなく、その他の方向から到来した雑音が混在する。図６の時刻t1において出力レベルp1の目的音（突発性、例えばキック音）が観測されたものとし、目的音が観測された時刻t1の直前から所定のフレーム数過去までの時間区間をts1とし、時間区間ts1における出力レベルの最大値をps1とする。また、図６の時刻t2において出力レベルの極大値（例えば歓声などの極大値）が出力レベルp2で観測されたものとし、時刻t2の直前から所定のフレーム数過去までの時間区間をts2とし、時間区間ts2における出力レベルの最大値をps2とする。図７に示すように、局所PSD推定部１０３によれば、60〜80度の角度幅の局所空間(左の図において横線ハッチングされた空間、領域)におけるPSD（以下、φ^_Sθl(ω,τ)と記述)とその他の空間(濃くドットハッチングされた空間、領域)におけるPSD（以下、Σ^L _i=1,i≠lφ^_Sθi(ω,τ)と記述)を分割して推定することは可能であるが、（突発性の）目的音を含む空間（薄くドットハッチングされた空間、領域）のPSD(以下、φ^_SSθl(ω,τ)と記述)とこれを除く60〜80度の角度幅の局所空間（右の図において横線ハッチングされた空間、領域）のPSD(以下、φ^_SNθl(ω,τ)と記述)を推定することは難しい。そこで突発性PSD推定部２０３は、信号の時間変化に基づいて突発性の信号が到来したことを判断し（後述）、φ^_Sθl(ω,τ)をφ^_SSθl(ω,τ)とφ^_SNθl(ω,τ)に分割する（Ｓ２０３）。突発性PSD推定部２０３は、図６に示すように、対象時刻の直前までのある一定区間（例えばts1,ts2）の局所PSDの最大値(例えばps1,ps2)と、対象時刻の局所PSD（例えばp1,p2）の比Λ(ω,τ)を計算し、Λ(ω,τ)がある閾値Thr(ω)を超えた場合に突発性の信号が到来したと判断する（Ｓ２０３）。Λ(ω,τ)は以下のように表現できる。 FIG. 6 is a graph in which the output level of local PSD observed during a certain soccer game is plotted with the horizontal axis representing time and the vertical axis representing local PSD output level. In the example of FIG. 6, the observed signal includes not only the (sudden) target sound but also noises coming from other directions. It is assumed that the target sound (suddenness, for example, kick sound) at the output level p1 is observed at time t1 in FIG. 6, and the time interval from immediately before the time t1 when the target sound is observed until the predetermined number of frames is ts1. The maximum output level in the time interval ts1 is ps1. Further, it is assumed that the maximum value of the output level (for example, the maximum value such as a cheer) is observed at the output level p2 at time t2 in FIG. 6, and the time interval from the time immediately before time t2 to the predetermined number of frames is ts2. The maximum value of the output level in the time interval ts2 is set to ps2. As shown in FIG. 7, according to the local PSD estimation unit 103, PSD (hereinafter, φ ^ _Sθl (ω, Estimate by dividing PSD ( _{denoted as} Σ ^L _{i = 1, i ≠ l} φ ^ _Sθi (ω, τ)) in other spaces (dark dot-hatched space, region) It is possible, but the PSD (hereinafter referred to as φ ^ _SSθl (ω, τ)) of the space containing the (sudden) target sound (lightly dot-hatched space, region) and 60 to 80 except this It is difficult to estimate the PSD (hereinafter referred to as φ ^ _SNθl (ω, τ)) of a local space (space hatched in the right figure). Therefore, the sudden PSD estimation unit 203 determines that a sudden signal has arrived based on the time change of the signal (described later), and _changes φ ^ _Sθl (ω, τ) to φ ^ _SSθl (ω, τ) and φ ^ _Divide into _SNθl (ω, τ) (S203). As shown in FIG. 6, the sudden PSD estimation unit 203 generates a local PSD maximum value (for example, ps1, ps2) in a certain interval (for example, ts1, ts2) immediately before the target time, and a local PSD (for example, ps1, ps2). For example, the ratio Λ (ω, τ) of p1, p2) is calculated, and when Λ (ω, τ) exceeds a certain threshold Thr (ω), it is determined that a sudden signal has arrived (S203). Λ (ω, τ) can be expressed as follows:

ここで、Δは対象時刻の直前までのある一定区間（例えばts1,ts2）のフレームインデックスを表し、音場の状況にもよるが約3〜5秒程度の区間を設定すればよい。なお突発性PSD推定部２０３は、Λ(ω,τ)とThr(ω)を比較する際に、ある程度の幅をもった周波数帯域で平均した値を用いればさらに雑音に頑健となるため好適である。周波数幅はΛ(ω,τ)の値が大きく表れやすい帯域を選択して選定されていてもよい。以下では、周波数帯域での比較演算を簡易にΛ(ω,τ)<Thr(ω)のように表す。 Here, Δ represents a frame index of a certain fixed section (for example, ts1, ts2) up to immediately before the target time, and a section of about 3 to 5 seconds may be set depending on the state of the sound field. Note that the abrupt PSD estimation unit 203 is suitable because when comparing Λ (ω, τ) and Thr (ω), using a value averaged in a frequency band having a certain width makes the noise more robust. is there. The frequency width may be selected by selecting a band in which the value of Λ (ω, τ) tends to appear large. Hereinafter, the comparison operation in the frequency band is simply expressed as Λ (ω, τ) <Thr (ω).

(i)Λ(ω,τ)<Thr(ω)の場合
この場合突発性PSD推定部２０３は、突発性の音源が検出されなかったものと判断し、 (i) In the case of Λ (ω, τ) <Thr (ω) In this case, the sudden PSD estimation unit 203 determines that no sudden sound source is detected,

とする。
(ii)Λ(ω,τ)≧Thr(ω)の場合
この場合突発性PSD推定部２０３は、突発性の音源が検出されたものと判断し、 And
(ii) When Λ (ω, τ) ≧ Thr (ω) In this case, the sudden PSD estimation unit 203 determines that a sudden sound source is detected,

として分割を実行する。ここで、meanは平均値を計算する演算子である。 Perform the split as Here, mean is an operator for calculating an average value.

＜ウィーナーフィルタリング部２０４＞
ウィーナーフィルタリング部２０４は、突発性を有する目的音のPSD（突発性PSD、φ^_SSθl(ω,τ)）を用いてウィーナーフィルタを設計する（Ｓ２０４）。より具体的には、ウィーナーフィルタリング部２０４は、前述した３種類のPSDを用いて、次式のウィーナーフィルタを設計する（Ｓ２０４）。 <Wiener filtering unit 204>
The Wiener filtering unit 204 designs a Wiener filter using the PSD of the target sound having suddenness (sudden PSD, φ ^ _SSθl (ω, τ)) (S204). More specifically, the Wiener filtering unit 204 designs a Wiener filter of the following equation using the three types of PSD described above (S204).

ウィーナーフィルタリング部２０４は、ビームフォーミング部１０２の出力であるY_θl(ω,τ)に式(11)で設計したウィーナーフィルタを用いて、雑音を抑圧する、または突発性のある目的音を強調する（Ｓ２０４）。 The Wiener filtering unit 204 suppresses noise or emphasizes a sudden target sound by using the Wiener filter designed by Expression (11) for Y _θl (ω, τ) that is the output of the beamforming unit 102. (S204).

＜本実施例の音声強調装置２０、ウィーナーフィルタ設計装置２が奏する効果＞
従来の2〜4本程度のマイクロホンを用いた音声強調装置では、空間情報のみを用いて目的音とその他雑音を分離してきた。そのため、狭い範囲にある狙った音だけをクリアに収音することが難しかった。本実施例の音声強調装置２０によれば、目的音の性質(突発性が高いという性質)を利用することで、マイクロホン数を増加させなくても、狭い範囲にある狙った音だけを明瞭に収音できる。また、本実施例のウィーナーフィルタ設計装置２によれば、マイクロホン数を増加させなくても、狭い範囲にある狙った音だけが明瞭に収音される音声強調を実現するウィーナーフィルタを設計できる。 <Effects of the speech enhancement device 20 and the Wiener filter design device 2 of this embodiment>
In a conventional speech enhancement device using about 2 to 4 microphones, the target sound and other noises are separated using only spatial information. For this reason, it is difficult to clearly collect only the targeted sound in a narrow range. According to the speech enhancement apparatus 20 of the present embodiment, by using the property of the target sound (the property of high suddenness), only the targeted sound in a narrow range can be clearly seen without increasing the number of microphones. Can collect sound. Further, according to the Wiener filter design device 2 of the present embodiment, it is possible to design a Wiener filter that realizes speech enhancement in which only a target sound in a narrow range is clearly collected without increasing the number of microphones.

以下、図８、図９を参照して実施例２の音声強調装置３０、ウィーナーフィルタ設計装置３について説明する。図８は、本実施例の音声強調装置３０、ウィーナーフィルタ設計装置３の構成を示すブロック図である。図９は、本実施例の音声強調装置３０、ウィーナーフィルタ設計装置３の動作を示すフローチャートである。図８に示すように、本実施例の音声強調装置３０はマイクロホン３００と、周波数領域変換部１０１と、突発性PSD推定部２０３と、ウィーナーフィルタリング部３０４と、時間領域変換部１０５を含む構成である。本実施例のウィーナーフィルタ設計装置３は、突発性PSD推定部２０３と、ウィーナーフィルタリング部３０４を含む構成である。本実施例の音声強調装置３０は、従来例、実施例１におけるマイクロホンアレイ１００を単一のマイクロホン３００に置換えることにより、従来例、実施例１におけるビームフォーミング部１０２、局所PSD推定部１０３を省略した例である。周波数領域変換部１０１の動作（Ｓ１０１）、突発性PSD推定部２０３の動作（Ｓ２０３）、時間領域変換部１０５の動作（Ｓ１０５）は、信号のチャネルが１チャネルになっていること、前述のＬ個のフィルタがなくなっていること以外は実施例１と同様に動作するため、説明を割愛する。ウィーナーフィルタリング部３０４は、実施例１とは異なり、周波数領域変換された受音信号Xに対して、ウィーナーフィルタを掛け合わせる（Ｓ３０４）。 Hereinafter, the speech enhancement device 30 and the Wiener filter design device 3 according to the second embodiment will be described with reference to FIGS. 8 and 9. FIG. 8 is a block diagram illustrating the configuration of the speech enhancement device 30 and the Wiener filter design device 3 according to the present embodiment. FIG. 9 is a flowchart showing the operations of the speech enhancement device 30 and the Wiener filter design device 3 of the present embodiment. As shown in FIG. 8, the speech enhancement apparatus 30 according to the present exemplary embodiment includes a microphone 300, a frequency domain conversion unit 101, an abrupt PSD estimation unit 203, a Wiener filtering unit 304, and a time domain conversion unit 105. is there. The Wiener filter design device 3 according to the present exemplary embodiment includes a sudden PSD estimation unit 203 and a Wiener filtering unit 304. The speech enhancement apparatus 30 according to the present embodiment replaces the microphone array 100 according to the conventional example and the first embodiment with a single microphone 300, so that the beam forming unit 102 and the local PSD estimation unit 103 according to the conventional example and the first embodiment are replaced with each other. This is an omitted example. The operation of the frequency domain transforming unit 101 (S101), the operation of the sudden PSD estimating unit 203 (S203), and the operation of the time domain transforming unit 105 (S105) are as follows. Since the operation is the same as that of the first embodiment except that the number of filters is eliminated, the description is omitted. Unlike the first embodiment, the Wiener filtering unit 304 multiplies the frequency-transformed received sound signal X by a Wiener filter (S304).

本実施例の音声強調装置３０によれば、時間情報に着目したことにより、マイクロホン一つのみで音響を収音した場合であっても、狭い範囲にある狙った音（突発性の目的音）だけを明瞭に収音できる。また、本実施例のウィーナーフィルタ設計装置３によれば、マイクロホン一つのみで音響を収音した場合であっても、狭い範囲にある狙った音（突発性の目的音）だけが明瞭に収音される音声強調を実現するウィーナーフィルタを設計できる。 According to the speech enhancement device 30 of the present embodiment, the target sound (sudden target sound) in a narrow range can be obtained even when the sound is picked up by only one microphone by focusing on the time information. Only can be clearly picked up. Further, according to the Wiener filter design apparatus 3 of the present embodiment, even when sound is picked up with only one microphone, only a target sound (sudden target sound) in a narrow range is clearly collected. It is possible to design a Wiener filter that achieves sound enhancement.

＜補記＞
本発明の装置は、例えば単一のハードウェアエンティティとして、キーボードなどが接続可能な入力部、液晶ディスプレイなどが接続可能な出力部、ハードウェアエンティティの外部に通信可能な通信装置（例えば通信ケーブル）が接続可能な通信部、ＣＰＵ（Central Processing Unit、キャッシュメモリやレジスタなどを備えていてもよい）、メモリであるＲＡＭやＲＯＭ、ハードディスクである外部記憶装置並びにこれらの入力部、出力部、通信部、ＣＰＵ、ＲＡＭ、ＲＯＭ、外部記憶装置の間のデータのやり取りが可能なように接続するバスを有している。また必要に応じて、ハードウェアエンティティに、ＣＤ−ＲＯＭなどの記録媒体を読み書きできる装置（ドライブ）などを設けることとしてもよい。このようなハードウェア資源を備えた物理的実体としては、汎用コンピュータなどがある。 <Supplementary note>
The apparatus of the present invention includes, for example, a single hardware entity as an input unit to which a keyboard or the like can be connected, an output unit to which a liquid crystal display or the like can be connected, and a communication device (for example, a communication cable) capable of communicating outside the hardware entity. Can be connected to a communication unit, a CPU (Central Processing Unit, may include a cache memory or a register), a RAM or ROM that is a memory, an external storage device that is a hard disk, and an input unit, an output unit, or a communication unit thereof , A CPU, a RAM, a ROM, and a bus connected so that data can be exchanged between the external storage devices. If necessary, the hardware entity may be provided with a device (drive) that can read and write a recording medium such as a CD-ROM. A physical entity having such hardware resources includes a general-purpose computer.

ハードウェアエンティティの外部記憶装置には、上述の機能を実現するために必要となるプログラムおよびこのプログラムの処理において必要となるデータなどが記憶されている（外部記憶装置に限らず、例えばプログラムを読み出し専用記憶装置であるＲＯＭに記憶させておくこととしてもよい）。また、これらのプログラムの処理によって得られるデータなどは、ＲＡＭや外部記憶装置などに適宜に記憶される。 The external storage device of the hardware entity stores a program necessary for realizing the above functions and data necessary for processing the program (not limited to the external storage device, for example, reading a program) It may be stored in a ROM that is a dedicated storage device). Data obtained by the processing of these programs is appropriately stored in a RAM or an external storage device.

ハードウェアエンティティでは、外部記憶装置（あるいはＲＯＭなど）に記憶された各プログラムとこの各プログラムの処理に必要なデータが必要に応じてメモリに読み込まれて、適宜にＣＰＵで解釈実行・処理される。その結果、ＣＰＵが所定の機能（上記、…部、…手段などと表した各構成要件）を実現する。 In the hardware entity, each program stored in an external storage device (or ROM or the like) and data necessary for processing each program are read into a memory as necessary, and are interpreted and executed by a CPU as appropriate. . As a result, the CPU realizes a predetermined function (respective component requirements expressed as the above-described unit, unit, etc.).

本発明は上述の実施形態に限定されるものではなく、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。また、上記実施形態において説明した処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されるとしてもよい。 The present invention is not limited to the above-described embodiment, and can be appropriately changed without departing from the spirit of the present invention. In addition, the processing described in the above embodiment may be executed not only in time series according to the order of description but also in parallel or individually as required by the processing capability of the apparatus that executes the processing. .

既述のように、上記実施形態において説明したハードウェアエンティティ（本発明の装置）における処理機能をコンピュータによって実現する場合、ハードウェアエンティティが有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記ハードウェアエンティティにおける処理機能がコンピュータ上で実現される。 As described above, when the processing functions in the hardware entity (the apparatus of the present invention) described in the above embodiments are realized by a computer, the processing contents of the functions that the hardware entity should have are described by a program. Then, by executing this program on a computer, the processing functions in the hardware entity are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）／ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto-Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape or the like, and as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only). Memory), CD-R (Recordable) / RW (ReWritable), etc., magneto-optical recording medium, MO (Magneto-Optical disc), etc., semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, the computer reads a program stored in its own recording medium and executes a process according to the read program. As another execution form of the program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to the computer. Each time, the processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、ハードウェアエンティティを構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In this embodiment, a hardware entity is configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.

Claims

Based on the temporal change of the local power spectral density of the observation signal, the sudden PSD estimation unit for estimating the power spectral density of the target sound having suddenness,
A Wiener filter design apparatus including a Wiener filtering unit that designs a Wiener filter using the power spectrum density of the target sound having suddenness.

A microphone array composed of a plurality of microphone elements;
Based on the temporal change of the local power spectral density of the observation signal from the microphone array, the sudden PSD estimation unit for estimating the power spectral density of the target sound having suddenness,
A speech enhancement apparatus including a Wiener filtering unit that designs a Wiener filter using the power spectrum density of the target sound having suddenness and emphasizes the target sound having suddenness using the designed Wiener filter.

The speech enhancement apparatus according to claim 2,
A frequency domain converter that converts the observation signals from the microphone array into a frequency domain;
A beam forming unit that enhances the converted observation signal for each direction of arrival;
A local PSD estimation unit for estimating a local power spectral density for each direction of arrival;
The sudden PSD estimator is
A speech enhancement device that estimates the power spectrum density of the target sound having the suddenness based on a temporal change of the local power spectrum density for each arrival direction.

Sudden PSD estimation step for estimating the power spectral density of the target sound having suddenness based on the temporal change of the local power spectral density of the observation signal;
A Wiener filter design method including a Wiener filtering step of designing a Wiener filter using a power spectral density of a target sound having suddenness.

A program that causes a computer to function as the Wiener filter design device according to claim 1.

A program for causing a computer to function as the speech enhancement apparatus according to claim 2 or 3.