JP5516169B2

JP5516169B2 - Sound processing apparatus and program

Info

Publication number: JP5516169B2
Application number: JP2010159543A
Authority: JP
Inventors: 多伸近藤
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2010-07-14
Filing date: 2010-07-14
Publication date: 2014-06-11
Anticipated expiration: 2030-07-14
Also published as: JP2012022120A

Description

本発明は、音響信号に含まれる雑音成分を抑圧する技術に関する。 The present invention relates to a technique for suppressing a noise component included in an acoustic signal.

目的音成分と雑音成分との混合音の音響信号から雑音成分を抑圧する技術が従来から提案されている。例えば特許文献１には、複数の音響信号の各々における低域成分と各低域成分の平均成分とのうち強度が最小となる成分を選択して各音響信号の高域成分と合成することで、風雑音が抑圧された雑音抑圧信号を生成する技術が開示されている。 Conventionally, a technique for suppressing a noise component from an acoustic signal of a mixed sound of a target sound component and a noise component has been proposed. For example, in Patent Document 1, a component having the minimum intensity is selected from a low-frequency component and an average component of each low-frequency component in each of a plurality of acoustic signals, and synthesized with a high-frequency component of each acoustic signal. A technique for generating a noise suppression signal in which wind noise is suppressed is disclosed.

特許第４３５６６７０号公報Japanese Patent No. 4356670

しかし、特許文献１の技術では、雑音抑圧信号の生成に利用される成分が強度のみを基準として選択されるから、例えば風雑音と比較して目的音成分の強度が小さい場合には目的音成分が除去される可能性がある。また、複数の音響信号の平均成分が雑音抑圧信号の低域成分として選択された場合には、雑音抑圧信号の生成の過程で目的音成分の波形が大幅に変化するから、目的音成分が忠実に再現されないという問題もある。以上の事情を考慮して、本発明は、音響信号の雑音成分を高精度に抑圧することを目的とする。 However, in the technique of Patent Document 1, since the component used for generating the noise suppression signal is selected based on the intensity only, for example, when the intensity of the target sound component is small compared to the wind noise, the target sound component May be removed. In addition, when the average component of multiple acoustic signals is selected as the low-frequency component of the noise suppression signal, the waveform of the target sound component changes significantly during the generation of the noise suppression signal. There is also a problem that it is not reproduced. In view of the above circumstances, an object of the present invention is to suppress a noise component of an acoustic signal with high accuracy.

以上の課題を解決するために本発明が採用する手段を説明する。なお、本発明の理解を容易にするために、以下の説明では、本発明の各要素と後述の各実施形態の要素との対応を括弧書で付記するが、本発明の範囲を実施形態の例示に限定する趣旨ではない。 Means employed by the present invention to solve the above problems will be described. In order to facilitate understanding of the present invention, in the following description, the correspondence between each element of the present invention and the element of each of the embodiments described later is indicated in parentheses, but the scope of the present invention is not limited to the embodiment. It is not intended to limit the example.

本発明の音響処理装置は、並列に収音された第１音響信号および第２音響信号の各々について、当該音響信号の周波数毎の成分値の時系列を要素とする観測行列（例えば観測行列Ｖi）の非負行列因子分解で、当該音響信号の相異なる成分の周波数毎の成分値を示す複数の基底（例えば基底Ｃi[1]〜Ｃi[K]）を含む基底行列（例えば基底行列Ｗi）と、当該各基底の重み値の時系列を各々が示す複数の重み系列（例えば重み系列Ｅi[1]〜Ｅi[K]）を含む係数行列（例えば係数行列Ｈi）とを生成する行列分解手段（例えば行列分解部４４）と、第１音響信号の基底行列の複数の基底のうち第２音響信号の基底行列の基底との相関が高い基底を、第１音響信号の雑音成分に対応する雑音基底（例えば雑音基底Ｃi_noise）として特定する雑音特定手段（例えば雑音特定部４６）と、第１音響信号から雑音成分を抑圧した推定目的音成分（例えば推定目的音信号ｑTiのスペクトルＹTi）を基底行列のうち雑音基底以外の各基底と係数行列のうち雑音基底以外に対応する各重み系列とを利用して生成する目的音抽出手段（例えば目的音抽出部５２）と、第１音響信号から目的音成分を抑圧した推定雑音成分（例えば推定雑音信号ｑNiのスペクトルＹNi）を雑音基底と係数行列のうち当該雑音基底に対応する重み系列（例えば重み系列Ｅi_noise）とを利用して生成する雑音抽出手段（例えば雑音抽出部５４）と、目的音成分の調波構造に対応する残留成分（例えば残留成分のスペクトルＲi）を推定雑音成分から抽出する調波成分抽出手段（例えば調波成分抽出部６４）と、推定目的音成分と残留成分とを合成する目的音合成手段（例えば目的音合成部６６）とを具備する。 The acoustic processing apparatus according to the present invention provides an observation matrix (for example, an observation matrix Vi) for each of the first acoustic signal and the second acoustic signal collected in parallel as elements of a time series of component values for each frequency of the acoustic signal. ) In a non-negative matrix factorization, a base matrix (for example, base matrix Wi) including a plurality of bases (for example, base Ci [1] to Ci [K]) indicating component values for different frequencies of the acoustic signal, Matrix decomposition means for generating a coefficient matrix (for example, coefficient matrix Hi) including a plurality of weight series (for example, weight series Ei [1] to Ei [K]) each indicating a time series of the weight values of the respective bases ( For example, a base having a high correlation with the base of the base matrix of the second acoustic signal among a plurality of bases of the base matrix of the first acoustic signal is selected as a noise base corresponding to the noise component of the first acoustic signal. Noise specifying means (for example, noise base Ci_noise) A noise specifying unit 46) and an estimated target sound component (for example, a spectrum YTi of the estimated target sound signal qTi) obtained by suppressing the noise component from the first acoustic signal. Target sound extraction means (for example, target sound extraction unit 52) generated using each of the corresponding weight series, and an estimated noise component (for example, spectrum of the estimated noise signal qNi) obtained by suppressing the target sound component from the first acoustic signal. YNi) using a noise base and a weight sequence (for example, a weight sequence Ei_noise) corresponding to the noise base in the coefficient matrix, and a harmonic structure of the target sound component A harmonic component extracting means (for example, harmonic component extracting unit 64) for extracting a residual component corresponding to (for example, the spectrum Ri of the residual component) from the estimated noise component, and combining the estimated target sound component and the residual component That includes the target sound synthesizing means (e.g. target sound synthesizing unit 66).

以上の構成では、第１音響信号および第２音響信号の各々の観測行列が基底行列と係数行列とに分解され、第１音響信号の基底行列の複数の基底のうち第２音響信号の基底行列の基底との相関が高い雑音基底が除外されたうえで推定目的音成分が抽出される。したがって、第１音響信号の目的音成分の強度が雑音成分と比較して低い場合でも雑音成分を高精度に抑圧することが可能である。また、目的音成分の調波構造に対応する残留成分が推定雑音成分から抽出されて推定目的音成分に合成されるから、推定雑音成分に目的音成分の一部（残留成分）が残留した場合でも、目的音成分の欠落を有効に防止できるという利点がある。しかも、調波構造を手掛かりに推定雑音成分から残留成分を抽出するから、残留成分の強度が雑音成分に対して低い場合でも残留成分を高精度に抽出できるという利点がある。なお、本発明の適用の範囲は、２系統の音響信号を処理する構成に限定されない。すなわち、３系統以上の音響信号を処理する構成でも、特定の２系統の音響信号に着目したときに本発明の要件を充足する構成は、本発明の範囲に当然に包含される。 In the above configuration, each observation matrix of the first acoustic signal and the second acoustic signal is decomposed into a base matrix and a coefficient matrix, and the base matrix of the second acoustic signal among the plurality of bases of the base matrix of the first acoustic signal. The noise target having a high correlation with the base of the sound is excluded and the estimated target sound component is extracted. Therefore, even when the intensity of the target sound component of the first acoustic signal is lower than the noise component, the noise component can be suppressed with high accuracy. Also, since the residual component corresponding to the harmonic structure of the target sound component is extracted from the estimated noise component and synthesized with the estimated target sound component, a part of the target sound component (residual component) remains in the estimated noise component However, there is an advantage that the loss of the target sound component can be effectively prevented. In addition, since the residual component is extracted from the estimated noise component using the harmonic structure as a clue, there is an advantage that the residual component can be extracted with high accuracy even when the strength of the residual component is lower than the noise component. The scope of application of the present invention is not limited to a configuration that processes two systems of acoustic signals. That is, even in a configuration for processing three or more systems of acoustic signals, a configuration that satisfies the requirements of the present invention when focusing on two specific systems of acoustic signals is naturally included in the scope of the present invention.

本発明の好適な態様において、調波成分抽出手段は、目的音成分の基本周波数を推定する周波数推定手段（例えば周波数推定部７２）と、推定雑音成分のうち基本周波数の整数倍の周波数の調波成分が強調されるように各係数値が設定された調波係数列を生成する調波係数列生成手段（例えば調波係数列生成部７４）と、調波係数列を推定雑音成分に作用させて残留成分を抽出する調波抽出手段（例えば調波抽出部７８）とを含む。以上の態様においては、周波数推定手段が推定した目的音成分の基本周波数に応じて生成された調波係数列の適用で推定雑音成分から残留成分が抽出される。したがって、音響信号の周波数特性（調波構造）に応じた適切な残留成分を抽出できるという利点がある。
さらに好適な態様において、周波数推定手段は、目的音抽出手段が生成した推定目的音成分の基本周波数を推定する。以上の態様においては、雑音成分が抑圧された推定目的音成分について基本周波数が推定されるから、雑音成分が混在した状態で基本周波数を推定する場合と比較して目的音成分の基本周波数を高精度に推定できるという利点がある。ただし、目的音成分と雑音成分とが混在した第１音響信号または第２音響信号を対象として目的音成分の基本周波数を推定する方法も採用され得る。なお、基本周波数の推定の方法（例えば周波数領域での処理か時間領域での処理か）は任意である。 In a preferred aspect of the present invention, the harmonic component extracting means includes frequency estimating means (for example, the frequency estimating unit 72) for estimating the fundamental frequency of the target sound component, and adjusting the frequency of an integral multiple of the fundamental frequency among the estimated noise components. Harmonic coefficient string generating means (for example, harmonic coefficient string generating unit 74) for generating a harmonic coefficient string in which each coefficient value is set so that the wave component is emphasized, and the harmonic coefficient string acting on the estimated noise component And harmonic extraction means (for example, harmonic extraction unit 78) for extracting residual components. In the above aspect, the residual component is extracted from the estimated noise component by applying the harmonic coefficient sequence generated according to the fundamental frequency of the target sound component estimated by the frequency estimating means. Therefore, there is an advantage that an appropriate residual component corresponding to the frequency characteristic (harmonic structure) of the acoustic signal can be extracted.
In a further preferred aspect, the frequency estimation means estimates the fundamental frequency of the estimated target sound component generated by the target sound extraction means. In the above aspect, since the fundamental frequency is estimated for the estimated target sound component in which the noise component is suppressed, the fundamental frequency of the target sound component is increased compared to the case where the fundamental frequency is estimated in a state where the noise component is mixed. There is an advantage that accuracy can be estimated. However, a method of estimating the fundamental frequency of the target sound component for the first sound signal or the second sound signal in which the target sound component and the noise component are mixed may be employed. Note that the method of estimating the fundamental frequency (for example, processing in the frequency domain or time domain) is arbitrary.

本発明の好適な態様に係る音響処理装置は、第１音響信号および第２音響信号のスペクトルの時系列を第１解析パラメータ（例えば窓幅ωA，移動量δA）のもとで観測行列として生成する第１周波数分析手段（例えば周波数分析部４２）と、第１解析パラメータとは相違する第２解析パラメータ（例えば窓幅ωB，移動量δB）のもとで推定目的音成分および推定雑音成分のスペクトルを順次に生成する第２周波数分析手段（例えば周波数分析部６２）と、第２解析パラメータに応じた間隔で時間軸上および周波数軸上に配列する解析点（例えば解析点ｐ2）毎に係数値が設定された補正係数列（例えば補正係数列ＧBi）を生成する係数列補正手段（例えば係数列補正部７６）とを具備し、雑音抽出手段は、第１解析パラメータに応じた間隔で時間軸上および周波数軸上に配列する解析点（例えば解析点ｐ1）毎に係数値が設定された雑音係数列（例えば雑音係数列ＧNi）を、雑音基底と当該雑音基底に対応する重み系列とを利用して生成し、雑音係数列を観測行列に作用させて推定雑音成分を生成し、係数列補正手段は、雑音係数列から補正係数列を生成し、調波係数列生成手段は、補正係数列から基本周波数の整数倍の周波数の成分を抽出して調波係数列を生成する。以上の態様においては、推定雑音成分の抽出に利用される雑音係数列が調波係数列の生成に流用されるから、調波係数列の生成に雑音係数列を利用しない構成と比較して残留成分の抽出に必要な処理の負荷が軽減される。また、第１解析パラメータに応じた雑音係数列が第２解析パラメータに応じた補正係数列に補正されたうえで調波係数列の生成（さらには残留成分の抽出）に適用されるから、第１解析パラメータと第２解析パラメータとが相違する場合でも、残留成分の抽出に利用される適切な調波係数列を生成することが可能である。したがって、例えば非負行列因子分解に最適な数値に第１解析パラメータを設定し、基本周波数の推定や残留成分と推定目的音成分との合成に最適な数値に第２解析パラメータを設定することが可能である。 The acoustic processing apparatus according to a preferred aspect of the present invention generates a time series of spectra of the first acoustic signal and the second acoustic signal as an observation matrix based on a first analysis parameter (for example, window width ωA, movement amount δA). First estimated frequency analysis means (for example, frequency analysis unit 42) and second analysis parameters (for example, window width ωB, movement amount δB) that are different from the first analysis parameters. Second frequency analysis means (for example, frequency analysis unit 62) that sequentially generates a spectrum and analysis points (for example, analysis point p2) arranged on the time axis and frequency axis at intervals according to the second analysis parameter. Coefficient sequence correction means (for example, coefficient sequence correction section 76) for generating a correction coefficient sequence (for example, correction coefficient sequence GBi) in which numerical values are set, and the noise extraction means is time-dependent at intervals according to the first analysis parameter. axis And a noise coefficient sequence (for example, noise coefficient sequence GNi) in which coefficient values are set for each analysis point (for example, analysis point p1) arranged on the frequency axis, using a noise base and a weight sequence corresponding to the noise base. And generating the estimated noise component by applying the noise coefficient sequence to the observation matrix, the coefficient sequence correcting means generates a correction coefficient sequence from the noise coefficient sequence, and the harmonic coefficient sequence generating means from the correction coefficient sequence A harmonic coefficient sequence is generated by extracting a component having a frequency that is an integral multiple of the fundamental frequency. In the above aspect, since the noise coefficient sequence used for extracting the estimated noise component is diverted for generation of the harmonic coefficient sequence, it remains as compared with the configuration in which the noise coefficient sequence is not used for generation of the harmonic coefficient sequence. The processing load necessary to extract the components is reduced. Since the noise coefficient sequence corresponding to the first analysis parameter is corrected to the correction coefficient sequence corresponding to the second analysis parameter and applied to the generation of the harmonic coefficient sequence (and the residual component extraction), Even when the first analysis parameter is different from the second analysis parameter, it is possible to generate an appropriate harmonic coefficient sequence used for extraction of residual components. Therefore, for example, the first analysis parameter can be set to a value that is optimal for non-negative matrix factorization, and the second analysis parameter can be set to a value that is optimal for estimation of the fundamental frequency and synthesis of the residual component and the estimated target sound component. It is.

本発明の好適な態様に係る音響処理装置は、第１音響信号と第２音響信号との位相差（例えば位相差ΔＰ[nA]）を算定する位相差算定手段（例えば位相差算定部５８２）を具備し、目的音抽出手段は、位相差算定手段が算定した位相差に応じて各係数値が可変に設定された目的音係数列を、基底行列のうち雑音基底以外の各基底と係数行列のうち雑音基底以外に対応する重み系列とから生成して観測行列に作用させる。例えば、第１音響信号と第２音響信号との位相差が大きい（雑音成分が優勢である）ほど目的音係数列による雑音抑圧の効果が増加するように、目的音係数列の各係数値が位相差に応じて設定される。以上の態様においては、第１音響信号と第２音響信号との位相差が目的音係数列に反映されるから、目的音係数列に位相差を反映させない構成と比較して雑音成分を充分に抑圧した推定目的音成分を生成できるという利点がある。なお、位相差算定手段が算定した位相差を雑音係数列に反映させる構成も採用され得る。すなわち、雑音抽出手段は、第１音響信号と第２音響信号との位相差に応じて各係数値が可変に設定された雑音係数列を、雑音基底と当該雑音基底に対応する重み系列とから生成して観測行列に作用させる。 The acoustic processing apparatus according to a preferred aspect of the present invention is a phase difference calculating means (for example, a phase difference calculating unit 582) that calculates a phase difference (for example, a phase difference ΔP [nA]) between the first acoustic signal and the second acoustic signal. The target sound extraction means includes a target sound coefficient sequence in which each coefficient value is variably set according to the phase difference calculated by the phase difference calculation means, and a base matrix other than the noise base and a coefficient matrix. Are generated from weight sequences corresponding to those other than the noise basis, and are applied to the observation matrix. For example, each coefficient value of the target sound coefficient sequence increases so that the effect of noise suppression by the target sound coefficient sequence increases as the phase difference between the first sound signal and the second sound signal increases (the noise component is dominant). It is set according to the phase difference. In the above aspect, since the phase difference between the first acoustic signal and the second acoustic signal is reflected in the target sound coefficient sequence, the noise component is sufficiently compared with the configuration in which the phase difference is not reflected in the target sound coefficient sequence. There is an advantage that a suppressed estimated target sound component can be generated. A configuration in which the phase difference calculated by the phase difference calculating means is reflected in the noise coefficient sequence can also be employed. That is, the noise extraction means calculates a noise coefficient sequence in which each coefficient value is variably set according to the phase difference between the first acoustic signal and the second acoustic signal from the noise base and the weight sequence corresponding to the noise base. Generate and act on the observation matrix.

本発明の好適な態様に係る音響処理装置は、第１音響信号と第２音響信号との強度差（例えば強度差ΔＡ[nA]）を算定する強度差算定手段（例えば強度差算定部５８４）を具備し、目的音抽出手段は、強度差算定手段が算定した強度差（例えば振幅差やパワー差）に応じて各係数値が可変に設定された目的音係数列を、基底行列のうち雑音基底以外の各基底と係数行列のうち雑音基底以外に対応する重み系列とから生成して観測行列に作用させる。例えば、第１音響信号と第２音響信号との強度差が大きい（雑音成分が優勢である）ほど目的音係数列による雑音抑圧の効果が増加するように、目的音係数列の各係数値が強度差に応じて設定される。以上の形態においては、第１音響信号と第２音響信号との強度差が目的音係数列に反映されるから、目的音係数列に強度差を反映させない構成と比較して雑音成分を充分に抑圧した推定目的音成分を生成できるという利点がある。なお、強度差算定手段が算定した強度差を雑音係数列に反映させる構成も採用され得る。すなわち、雑音抽出手段は、第１音響信号と第２音響信号との強度差に応じて各係数値が可変に設定された雑音係数列を、雑音基底と当該雑音基底に対応する重み系列とから生成して観測行列に作用させる。 The sound processing apparatus according to a preferred aspect of the present invention is an intensity difference calculating means (for example, an intensity difference calculating unit 584) for calculating an intensity difference (for example, an intensity difference ΔA [nA]) between the first acoustic signal and the second acoustic signal. And the target sound extraction means outputs the target sound coefficient sequence in which each coefficient value is variably set according to the intensity difference (for example, amplitude difference or power difference) calculated by the intensity difference calculation means, from the base matrix. It is generated from each basis other than the basis and a weight sequence other than the noise basis among the coefficient matrices and is applied to the observation matrix. For example, each coefficient value of the target sound coefficient sequence increases so that the effect of noise suppression by the target sound coefficient sequence increases as the intensity difference between the first acoustic signal and the second acoustic signal increases (the noise component is dominant). It is set according to the intensity difference. In the above embodiment, since the difference in intensity between the first acoustic signal and the second acoustic signal is reflected in the target sound coefficient sequence, the noise component is sufficiently compared with the configuration in which the intensity difference is not reflected in the target sound coefficient sequence. There is an advantage that a suppressed estimated target sound component can be generated. Note that a configuration in which the intensity difference calculated by the intensity difference calculating means is reflected in the noise coefficient sequence may be employed. That is, the noise extraction means calculates a noise coefficient sequence in which each coefficient value is variably set according to the intensity difference between the first acoustic signal and the second acoustic signal from the noise base and the weight sequence corresponding to the noise base. Generate and act on the observation matrix.

以上の各態様に係る音響処理装置は、音響信号の処理に専用されるＤＳＰ（Digital Signal Processor）などのハードウェア（電子回路）によって実現されるほか、ＣＰＵ（Central Processing Unit）などの汎用の演算処理装置とプログラムとの協働によっても実現される。本発明に係るプログラムは、並列に収音された第１音響信号および第２音響信号の各々について、当該音響信号の周波数毎の成分値の時系列を要素とする観測行列の非負行列因子分解で、当該音響信号の相異なる成分の周波数毎の成分値を示す複数の基底を含む基底行列と、当該各基底の重み値の時系列を各々が示す複数の重み系列を含む係数行列とを生成する行列分解処理と、第１音響信号の基底行列の複数の基底のうち第２音響信号の基底行列の基底との相関が高い基底を、第１音響信号の雑音成分に対応する雑音基底として特定する雑音特定処理と、第１音響信号から雑音成分を抑圧した推定目的音成分を基底行列のうち雑音基底以外の各基底と係数行列のうち雑音基底以外に対応する各重み系列とを利用して生成する目的音抽出処理と、第１音響信号から目的音成分を抑圧した推定雑音成分を雑音基底と係数行列のうち当該雑音基底に対応する重み系列とを利用して生成する雑音抽出処理と、目的音成分の調波構造に対応する残留成分を推定雑音成分から抽出する調波成分抽出処理と、推定目的音成分と残留成分とを合成する目的音合成処理とをコンピュータに実行させる。以上のプログラムによれば、本発明に係る音響処理装置と同様の作用および効果が実現される。なお、本発明のプログラムは、コンピュータが読取可能な記録媒体に格納された形態で利用者に提供されてコンピュータにインストールされるほか、通信網を介した配信の形態でサーバ装置から提供されてコンピュータにインストールされる。 The acoustic processing device according to each of the above aspects is realized by hardware (electronic circuit) such as a DSP (Digital Signal Processor) dedicated to processing of an acoustic signal, or a general-purpose calculation such as a CPU (Central Processing Unit). This is also realized by cooperation between the processing device and the program. The program according to the present invention is a non-negative matrix factorization of an observation matrix having a time series of component values for each frequency of the acoustic signal for each of the first acoustic signal and the second acoustic signal collected in parallel. Generating a base matrix including a plurality of bases indicating component values for different frequencies of different components of the acoustic signal, and a coefficient matrix including a plurality of weight sequences each indicating a time series of weight values of the respective bases A base having a high correlation between the matrix decomposition process and the base of the base matrix of the second acoustic signal among the plurality of bases of the base matrix of the first acoustic signal is specified as a noise base corresponding to the noise component of the first acoustic signal. Generates estimated target sound component with noise component suppressed from first acoustic signal using noise identification processing and each weight sequence corresponding to non-noise basis among coefficient matrix and coefficient matrix Target sound extraction processing , A noise extraction process for generating an estimated noise component obtained by suppressing the target sound component from the first acoustic signal using a noise base and a weight sequence corresponding to the noise base in the coefficient matrix, and a harmonic structure of the target sound component The computer executes a harmonic component extraction process for extracting a residual component corresponding to the estimated noise component from the estimated noise component and a target sound synthesis process for synthesizing the estimated target sound component and the residual component. According to the above program, the same operation and effect as the sound processing apparatus according to the present invention are realized. The program of the present invention is provided to the user in a form stored in a computer-readable recording medium and installed in the computer, or is provided from the server device in the form of distribution via a communication network. Installed on.

本発明の第１実施形態に係る音響処理装置のブロック図である。1 is a block diagram of a sound processing apparatus according to a first embodiment of the present invention. 第１処理部のブロック図である。It is a block diagram of a 1st processing part. 観測行列の説明図である。It is explanatory drawing of an observation matrix. 基底行列および係数行列の説明図である。It is explanatory drawing of a base matrix and a coefficient matrix. 目的音抽出部および雑音抽出部のブロック図である。It is a block diagram of a target sound extraction part and a noise extraction part. 第２処理部のブロック図である。It is a block diagram of a 2nd processing part. 第２処理部の処理で想定される解析点の説明図である。It is explanatory drawing of the analysis point assumed by the process of a 2nd process part. 調波成分抽出部のブロック図である。It is a block diagram of a harmonic component extraction unit. 調波成分抽出部の動作の説明図である。It is explanatory drawing of operation | movement of a harmonic component extraction part. 第２実施形態における第１処理部のブロック図である。It is a block diagram of the 1st processing part in a 2nd embodiment. 第３実施形態における第２処理部のブロック図である。It is a block diagram of the 2nd processing part in a 3rd embodiment. 第４実施形態における調波成分抽出部のブロック図である。It is a block diagram of the harmonic component extraction part in 4th Embodiment.

＜Ａ：第１実施形態＞
図１は、本発明の第１実施形態に係る音響処理装置１００のブロック図である。図１に示すように、音響処理装置１００には信号供給装置１２と放音装置１４とが接続される。信号供給装置１２は、相異なる位置で並列（同時）に収音されたステレオ形式の音響信号ｓ1および音響信号ｓ2を音響処理装置１００に供給する。各音響信号ｓi（ｉ＝１,２）は、目的音成分と雑音成分との混合音の音圧波形を表す時間領域信号である。図１では、相互に離間して配置された複数の収音機器１２２（例えば無指向性のマイクロホン）が信号供給装置１２として例示されている。ただし、可搬型または内蔵型の記録媒体から各音響信号ｓiを取得して音響処理装置１００に供給する再生装置や、各音響信号ｓiを通信網から受信して音響処理装置１００に供給する通信装置を、信号供給装置１２として採用することも可能である。 <A: First Embodiment>
FIG. 1 is a block diagram of a sound processing apparatus 100 according to the first embodiment of the present invention. As shown in FIG. 1, a signal supply device 12 and a sound emitting device 14 are connected to the sound processing device 100. The signal supply device 12 supplies the sound processing device 100 with the stereo-type sound signal s1 and sound signal s2 collected in parallel (simultaneously) at different positions. Each acoustic signal si (i = 1, 2) is a time domain signal representing a sound pressure waveform of a mixed sound of a target sound component and a noise component. In FIG. 1, a plurality of sound collection devices 122 (for example, omnidirectional microphones) that are arranged apart from each other are illustrated as the signal supply device 12. However, a playback device that acquires each acoustic signal si from a portable or built-in recording medium and supplies it to the sound processing device 100, or a communication device that receives each acoustic signal si from a communication network and supplies it to the sound processing device 100. Can also be employed as the signal supply device 12.

音響処理装置１００は、音響信号ｓ1および音響信号ｓ2からステレオ形式の音響信号ｑ1および音響信号ｑ2を生成する。各音響信号ｑiは、音響信号ｓiから雑音成分を抑圧（目的音成分を強調）した時間領域信号である。放音装置１４（例えばステレオスピーカやステレオヘッドホン）は、音響処理装置１００が生成した音響信号ｑ1および音響信号ｑ2に応じた音波を放射する。なお、音響信号ｓiをアナログからデジタルに変換するＡ/Ｄ変換器や音響信号ｑiをデジタルからアナログに変換するＤ/Ａ変換器の図示は便宜的に省略されている。 The sound processing apparatus 100 generates a stereo sound signal q1 and a sound signal q2 from the sound signal s1 and the sound signal s2. Each acoustic signal qi is a time domain signal obtained by suppressing a noise component (emphasizing a target sound component) from the acoustic signal si. The sound emitting device 14 (for example, a stereo speaker or stereo headphones) radiates sound waves corresponding to the acoustic signal q1 and the acoustic signal q2 generated by the acoustic processing device 100. Note that an A / D converter that converts the acoustic signal si from analog to digital and a D / A converter that converts the acoustic signal qi from digital to analog are not shown for convenience.

図１に示すように、音響処理装置１００は、演算処理装置２２と記憶装置２４とを具備するコンピュータシステムで実現される。記憶装置２４は、演算処理装置２２が実行するプログラムや演算処理装置２２が使用する各種のデータを記憶する。半導体記録媒体や磁気記録媒体等の公知の記録媒体や複数種の記録媒体の組合せが記憶装置２４として任意に採用され得る。音響信号ｓ1および音響信号ｓ2を記憶装置２４に記憶した構成（したがって信号供給装置１２は省略され得る）も好適である。 As shown in FIG. 1, the sound processing device 100 is realized by a computer system including an arithmetic processing device 22 and a storage device 24. The storage device 24 stores a program executed by the arithmetic processing device 22 and various data used by the arithmetic processing device 22. A known recording medium such as a semiconductor recording medium or a magnetic recording medium or a combination of a plurality of types of recording media can be arbitrarily employed as the storage device 24. A configuration in which the acoustic signal s1 and the acoustic signal s2 are stored in the storage device 24 (therefore, the signal supply device 12 can be omitted) is also suitable.

演算処理装置２２は、記憶装置２４に格納されたプログラムの実行で、音響信号ｓiから音響信号ｑiを生成するための複数の機能（第１処理部３１，第２処理部３２）を実現する。なお、演算処理装置２２の各機能を複数の集積回路に分散した構成や、専用の電子回路（DSP）が各機能を実現する構成も採用され得る。 The arithmetic processing unit 22 implements a plurality of functions (a first processing unit 31 and a second processing unit 32) for generating the acoustic signal qi from the acoustic signal si by executing a program stored in the storage device 24. A configuration in which each function of the arithmetic processing unit 22 is distributed over a plurality of integrated circuits, or a configuration in which a dedicated electronic circuit (DSP) realizes each function may be employed.

図１の第１処理部３１は、目的音成分を強調（雑音成分を抑圧）したステレオ形式の推定目的音信号ｑT1および推定目的音信号ｑT2（T：target）と、雑音成分を強調（目的音成分を抑圧）したステレオ形式の推定雑音信号ｑN1および推定雑音信号ｑN2（N：noise）とを、音響信号ｓ1および音響信号ｓ2から生成する。すなわち、音響信号ｓiが目的音成分（推定目的音信号ｑTi）と雑音成分（推定雑音信号ｑNi）とに分離される。ただし、目的音成分と雑音成分との完全な分離は困難であるから、本来的には推定目的音信号ｑTiに選別されるべき目的音成分の一部（以下「残留成分」という）が推定雑音信号ｑNiに混在する可能性がある。そこで、第２処理部３２は、残留成分を推定雑音信号ｑNiから抽出して推定目的音信号ｑTiに合成することで音響信号ｑi（ｑ1，ｑ2）を生成する。 The first processing unit 31 in FIG. 1 emphasizes the target sound component (the noise component is suppressed) in the stereo form of the estimated target sound signal qT1 and the estimated target sound signal qT2 (T: target) and the noise component (the target sound). A stereo-type estimated noise signal qN1 and an estimated noise signal qN2 (N: noise) with components suppressed) are generated from the acoustic signal s1 and the acoustic signal s2. That is, the acoustic signal si is separated into a target sound component (estimated target sound signal qTi) and a noise component (estimated noise signal qNi). However, since it is difficult to completely separate the target sound component and the noise component, a part of the target sound component (hereinafter referred to as “residual component”) to be originally selected for the estimated target sound signal qTi is estimated noise. The signal qNi may be mixed. Therefore, the second processing unit 32 generates the acoustic signal qi (q1, q2) by extracting the residual component from the estimated noise signal qNi and synthesizing it with the estimated target sound signal qTi.

図２は、第１処理部３１のブロック図である。図２に示すように、第１処理部３１は、周波数分析部４２と行列分解部４４と雑音特定部４６と目的音抽出部５２と雑音抽出部５４と波形合成部５６とを含んで構成される。 FIG. 2 is a block diagram of the first processing unit 31. As shown in FIG. 2, the first processing unit 31 includes a frequency analysis unit 42, a matrix decomposition unit 44, a noise identification unit 46, a target sound extraction unit 52, a noise extraction unit 54, and a waveform synthesis unit 56. The

周波数分析部４２は、図３に示すように、各音響信号ｓiのスペクトルＳi（Ｓ1，Ｓ2）を時間軸上の単位期間（フレーム）毎に順次に生成する。各単位期間のスペクトルＳiは、周波数軸上の相異なる周波数（ｆ1，ｆ2，……，ｆMA，……）に対応する複数の成分値（パワー）ｘiを配列したパワースペクトルである。すなわち、図３に示すように、時間軸上に間隔ΔtAで配列する時点ｔ（ｔ1，ｔ2，……）と周波数軸上に間隔ΔfAで配列する周波数ｆ（ｆ1，ｆ2，……）とに対応して時間-周波数平面に行列状に配列する解析点（グリッド）ｐ1毎に成分値ｘiが算定される。 As shown in FIG. 3, the frequency analysis unit 42 sequentially generates a spectrum Si (S1, S2) of each acoustic signal si for each unit period (frame) on the time axis. The spectrum Si of each unit period is a power spectrum in which a plurality of component values (power) xi corresponding to different frequencies (f1, f2,..., FMA,...) On the frequency axis are arranged. That is, as shown in FIG. 3, at time points t (t1, t2,...) Arranged at intervals ΔtA on the time axis and frequencies f (f1, f2,...) Arranged at intervals ΔfA on the frequency axis. Correspondingly, component values xi are calculated for each analysis point (grid) p1 arranged in a matrix on the time-frequency plane.

各スペクトルＳiの生成には、単位期間の窓幅（フレーム長）ωAおよび移動量（時間軸上のシフト量）δAを解析パラメータとして適用した短時間フーリエ変換が採用される。各解析点ｐ1の時間軸上の間隔ΔtAおよび周波数軸上の間隔ΔfAは、周波数分析部４２による周波数分析の解析パラメータ（窓幅ωA，移動量δA）に応じて可変に設定される。 The generation of each spectrum Si employs short-time Fourier transform in which the window width (frame length) ωA and movement amount (shift amount on the time axis) δA of the unit period are applied as analysis parameters. The interval ΔtA on the time axis and the interval ΔfA on the frequency axis of each analysis point p1 are variably set according to the analysis parameters (window width ωA, movement amount δA) of frequency analysis by the frequency analysis unit 42.

図３に示すように、各音響信号ｓiのスペクトルＳiは、帯域ＢLa内のスペクトルＸiと帯域ＢHa内のスペクトルＸHiとに区分される。帯域ＢLaは、雑音成分の周波数を包含するように設定される。本実施形態では風雑音を雑音成分として想定する。風雑音は、空気自体が流動して収音機器１２２の振動板に直接に衝突することで発生する雑音成分である。空気の衝突に起因した振動板の振動の周波数は、空気の振動（音圧変化）として振動板に伝播する音波の周波数と比較して低い。具体的には、風雑音の周波数は、例えば１ｋＨｚ以下の低周波成分が支配的となる。以上の傾向を考慮して、帯域ＢLaは、ＭA個（ＭAは自然数）の周波数ｆ1〜ｆMAを含む１ｋＨｚ以下の範囲に設定される。帯域ＢHaは、帯域ＢLaと比較して高域側（例えば１ｋＨｚ以上）の帯域である。 As shown in FIG. 3, the spectrum Si of each acoustic signal si is divided into a spectrum Xi in the band BLa and a spectrum XHi in the band BHa. The band BLa is set so as to include the frequency of the noise component. In the present embodiment, wind noise is assumed as a noise component. Wind noise is a noise component generated when air itself flows and directly collides with the diaphragm of the sound collecting device 122. The vibration frequency of the diaphragm due to the air collision is lower than the frequency of the sound wave propagating to the diaphragm as air vibration (sound pressure change). Specifically, the frequency of the wind noise is predominantly a low frequency component of 1 kHz or less, for example. Considering the above tendency, the band BLa is set to a range of 1 kHz or less including MA (MA is a natural number) frequencies f1 to fMA. The band BHa is a band on the high frequency side (for example, 1 kHz or more) compared to the band BLa.

図３に示すように、周波数分析部４２が生成したスペクトルＳiの時系列（スペクトログラム）は、ＮA個の時点ｔ1〜ｔNAを含む解析期間Ｔ0毎に時間軸上で区分される。解析期間Ｔ0は、例えば数十秒程度の長時間に設定される。図３に示すように、帯域ＢLa内のＭA個の周波数ｆ1〜ｆMAと解析期間Ｔ0内のＮA個の時点ｔ1〜ｔNAとに対応する解析点ｐ1の成分値ｘi[1,1]〜ｘi[MA,NA]をＭA行×ＮA列に配列した観測行列Ｖiが、音響信号ｓ1および音響信号ｓ2の各々について解析期間Ｔ0毎に規定される。成分値ｘi[mA,nA]は、帯域ＢLa内のＭA個の周波数ｆ1〜ｆMAのうち第ｍA番目（ｍA＝１〜ＭA）の周波数ｆmAと、解析期間Ｔ0内のＮA個の時点ｔ1〜ｔNAのうち第ｎA番目（ｎA＝１〜ＮA）の時点ｔnAとに対応する解析点ｐ1の成分値ｘiを意味する。 As shown in FIG. 3, the time series (spectrogram) of the spectrum Si generated by the frequency analysis unit 42 is divided on the time axis for every analysis period T0 including NA time points t1 to tNA. The analysis period T0 is set to a long time of about several tens of seconds, for example. As shown in FIG. 3, the component values xi [1,1] to xi [at the analysis point p1 corresponding to the MA frequencies f1 to fMA in the band BLa and the NA time points t1 to tNA within the analysis period T0. MA, NA] is defined for each analysis period T0 for each of the acoustic signal s1 and acoustic signal s2. The component value xi [mA, nA] is the mAth (mA = 1 to MA) frequency fmA of the MA frequencies f1 to fMA in the band BLa and the NA time points t1 to tNA within the analysis period T0. Means the component value xi of the analysis point p1 corresponding to the nAth (nA = 1 to NA) time point tnA.

以上の説明から理解されるように、観測行列Ｖiの第ｎA列は、解析期間Ｔ0内の第ｎA番目の時点ｔnAにおけるスペクトルＸiのＭA個の成分値ｘi[1,nA]〜ｘi[MA,nA]の系列に相当し、観測行列Ｖiの第ｍA行は、解析期間Ｔ0内のＮA個の時点ｔ1〜ｔNAにわたる周波数ｆmAの成分値ｘi[mA,t1]〜ｘi[mA,NA]の時系列に相当する。スペクトルＸiの成分値ｘi[mA,nA]はパワー（非負値）を意味するから、観測行列Ｖiは非負行列（負数を含まない行列）である。なお、スペクトルＳi（Ｘi）を振幅スペクトルとした構成も採用され得る。 As can be understood from the above description, the nAth column of the observation matrix Vi represents the MA component values xi [1, nA] to xi [MA, of the spectrum Xi at the nAth time point tnA within the analysis period T0. nA] series, and the mAth row of the observation matrix Vi represents the component values xi [mA, t1] to xi [mA, NA] of the frequency fmA over the NA time points t1 to tNA in the analysis period T0. Corresponds to the series. Since the component value xi [mA, nA] of the spectrum Xi means power (non-negative value), the observation matrix Vi is a non-negative matrix (matrix not including negative numbers). A configuration in which the spectrum Si (Xi) is an amplitude spectrum can also be employed.

図２の行列分解部４４は、各観測行列Ｖiの非負行列因子分解（NMF：Non-negative Matrix Factorization）で基底行列Ｗi（Ｗ1，Ｗ2）と係数行列Ｈi（Ｈ1，Ｈ2）とを生成する。図４に示すように、基底行列Ｗiは、成分値ｗi[1,1]〜ｗi[MA,K]を配列したＭA行×Ｋ列の非負行列であり、係数行列Ｈiは、重み値ｈi[1,1]〜ｈi[K,NA]を配列したＫ行×ＮA列の非負行列である（Ｋは自然数）。基底行列Ｗiと係数行列Ｈiとの積が観測行列Ｖiと近似する（Ｖi≒Ｗi・Ｈi）ように基底行列Ｗiと係数行列Ｈiとが生成される。周波数分析部４２が適用する解析パラメータ（窓幅ωA，移動量δA）は、観測行列Ｖiの非負行列因子分解が適切に実行され得る数値に設定される。 The matrix decomposition unit 44 in FIG. 2 generates a base matrix Wi (W1, W2) and a coefficient matrix Hi (H1, H2) by non-negative matrix factorization (NMF) of each observation matrix Vi. As shown in FIG. 4, the base matrix Wi is a non-negative matrix of MA rows × K columns in which component values wi [1,1] to wi [MA, K] are arranged, and the coefficient matrix Hi has a weight value hi [ 1,1] to hi [K, NA] is a non-negative matrix of K rows × NA columns (K is a natural number). The base matrix Wi and the coefficient matrix Hi are generated so that the product of the base matrix Wi and the coefficient matrix Hi approximates the observation matrix Vi (Vi≈Wi · Hi). The analysis parameters (window width ωA, movement amount δA) applied by the frequency analysis unit 42 are set to numerical values at which the non-negative matrix factorization of the observation matrix Vi can be appropriately performed.

図４に示すように、基底行列Ｗiは、Ｋ個の基底（codebook）Ｃi[1]〜Ｃi[K]で構成される。第ｋ列目（ｋ＝１〜Ｋ）の基底Ｃi[k]は、解析期間Ｔ0内の音響信号ｓiを構成するＫ種類の音響成分から選択された１種類の音響成分について周波数ｆ1〜ｆMAでの成分値ｗi[1,k]〜ｗi[MA,k]を配列したパワースペクトルに相当する。他方、係数行列Ｈiは、図４に示すように、Ｋ個の重み系列（excitation）Ｅi[1]〜Ｅi[K]で構成される。第ｋ行目の重み系列Ｅi[k]は、基底行列Ｗiの基底Ｃi[k]が示す音響成分に対する単位期間毎の重み値ｈi[k,1]〜ｈi[k,NA]の時系列（基底Ｃi[k]の各成分値ｗi[mA,k]の時間変化）に相当する。以上の定義から理解されるように、音響信号ｓiの時点ｔnAでのスペクトルＸiは、係数行列Ｈiのうち当該時点ｔnAに対応するＫ個の重み値ｈi[1,nA]〜ｈi[K,nA]を適用したＫ個の基底Ｃi[1]〜Ｃi[K]の加重和で近似される（Ｘi≒ｈi[1,nA]×Ｃi[1]＋ｈi[2,nA]×Ｃi[2]＋……＋ｈi[K,nA]×Ｃi[K]）。 As shown in FIG. 4, the base matrix Wi is composed of K codebooks Ci [1] to Ci [K]. The basis Ci [k] of the k-th column (k = 1 to K) has frequencies f1 to fMA for one type of acoustic component selected from the K types of acoustic components constituting the acoustic signal si within the analysis period T0. Corresponds to a power spectrum in which component values wi [1, k] to wi [MA, k] are arranged. On the other hand, the coefficient matrix Hi is composed of K weight sequences (excitation) Ei [1] to Ei [K] as shown in FIG. The weight sequence Ei [k] in the k-th row is a time series of weight values hi [k, 1] to hi [k, NA] per unit period for the acoustic component indicated by the basis Ci [k] of the basis matrix Wi ( Corresponding to each component value wi [mA, k] of the base Ci [k]. As understood from the above definition, the spectrum Xi of the acoustic signal si at the time point tnA is the K weight values hi [1, nA] to hi [K, nA corresponding to the time point tnA in the coefficient matrix Hi. ] Is approximated by a weighted sum of K bases Ci [1] to Ci [K] (Xi≈hi [1, nA] × Ci [1] + hi [2, nA] × Ci [2] + ...... + hi [K, nA] × Ci [K]).

観測行列Ｖiの非負行列因子分解には公知の方法が任意に採用される。例えば、基底行列Ｗiおよび係数行列Ｈiの積と観測行列Ｖiとの相違（例えば距離）が最小化するように基底行列Ｗiと係数行列Ｈiとを逐次的に更新（反復演算）する方法が好適に採用される。反復演算に適用される基底行列Ｗiの初期値（成分値ｗi[mA,k]の初期値）は、例えば乱数に設定される。なお、例えば風雑音のスペクトル（高域ほど減衰する周波数特性）を模擬するように各基底Ｃi[k]のＭA個の成分値ｗi[1,k]〜ｗi[MA,k]の初期値を設定した構成も好適である。 A known method is arbitrarily employed for non-negative matrix factorization of the observation matrix Vi. For example, a method of sequentially updating (iteratively calculating) the base matrix Wi and the coefficient matrix Hi so as to minimize the difference (for example, distance) between the product of the base matrix Wi and the coefficient matrix Hi and the observation matrix Vi is preferable. Adopted. The initial value of the base matrix Wi applied to the iterative calculation (the initial value of the component value w i [mA, k]) is set to a random number, for example. For example, the initial values of the MA component values wi [1, k] to wi [MA, k] of each basis Ci [k] are simulated so as to simulate the spectrum of wind noise (frequency characteristics that attenuate as the frequency increases). The set configuration is also suitable.

図２の雑音特定部４６は、各基底行列ＷiのＫ個の基底Ｃi[1]〜Ｃi[K]のうち雑音成分（風雑音）に対応する１個の基底Ｃi[k]（以下では「雑音基底Ｃi_noise」と表記する）を特定する。風雑音は、収音機器１２２に衝突する空気の乱流に起因して発生するから、相異なる位置で収音された音響信号ｓ1および音響信号ｓ2の各々に含まれる風雑音の瞬時的な周波数特性は相互に統計的に独立する。ただし、風雑音の長期的な周波数特性は、音声等と比較すると、収音の位置に関わらず同様の特性に維持され易い。すなわち、解析期間Ｔ0のような長期間にわたる風雑音の周波数特性は音響信号ｓ1と音響信号ｓ2とで類似するという傾向がある。 The noise specifying unit 46 shown in FIG. 2 has one basis Ci [k] (hereinafter referred to as “wind noise”) corresponding to a noise component (wind noise) among the K bases Ci [1] to Ci [K] of each basis matrix Wi. Specified as “noise base Ci_noise”. Since wind noise is generated due to the turbulent flow of air colliding with the sound collecting device 122, the instantaneous frequency of the wind noise included in each of the acoustic signals s1 and s2 collected at different positions. The characteristics are statistically independent of each other. However, the long-term frequency characteristics of wind noise are more likely to be maintained at the same characteristics regardless of the position of sound collection, as compared to voice and the like. That is, the frequency characteristics of wind noise over a long period such as the analysis period T0 tend to be similar between the acoustic signal s1 and the acoustic signal s2.

以上の傾向を考慮して、雑音特定部４６は、音響信号ｓ1の基底行列Ｗ1（Ｋ個の基底Ｃ1[1]〜Ｃ1[K]）と音響信号ｓ2の基底行列Ｗ2（Ｋ個の基底Ｃ2[1]〜Ｃ2[K]）との間で相互に相関が高い各基底Ｃi[k]（Ｃ1[k1]，Ｃ2[k2]）を雑音基底Ｃi_noiseとして基底行列Ｗ1および基底行列Ｗ2の各々から特定する。例えば、基底行列Ｗ1の１個の基底Ｃ1[k]と基底行列Ｗ2の１個の基底Ｃ2[k]とを選択する全通りの組合せについて基底Ｃ1[k]と基底Ｃ2[k]との相関の度合を示す指標（相関指標）を算定し、相関指標が示す相関の度合が最大となる組合せの基底Ｃ1[k1]と基底Ｃ2[k2]との各々（変数ｋ1と変数ｋ2との数値の異同は不問）を雑音基底Ｃi_noise（Ｃ1_noise，Ｃ2_noise）として抽出する。基底Ｃ1[k]と基底Ｃ2[k]との相関指標としては、例えば距離（ユークリッド距離）や内積が好適に採用される。 In consideration of the above tendency, the noise specifying unit 46 determines the basis matrix W1 (K basis C1 [1] to C1 [K]) of the acoustic signal s1 and the basis matrix W2 (K basis C2) of the acoustic signal s2. [1] to C2 [K]), each base Ci [k] (C1 [k1], C2 [k2]) having a high correlation with each other is used as a noise base Ci_noise from each of the base matrix W1 and the base matrix W2. Identify. For example, the correlation between the basis C1 [k] and the basis C2 [k] for all combinations of selecting one basis C1 [k] of the basis matrix W1 and one basis C2 [k] of the basis matrix W2 An index (correlation index) indicating the degree of the correlation is calculated, and each of the combinations of the base C1 [k1] and the base C2 [k2] (the values of the variable k1 and the variable k2) having the maximum degree of correlation indicated by the correlation index The noise base Ci_noise (C1_noise, C2_noise) is extracted. As a correlation index between the base C1 [k] and the base C2 [k], for example, a distance (Euclidean distance) or an inner product is preferably employed.

図２の目的音抽出部５２は、音響信号ｓiから目的音成分を抽出した推定目的音信号ｑTiのスペクトルＹTi（ＹT1，ＹT2）を順次に生成する。雑音抽出部５４は、音響信号ｓiから雑音成分を抽出した推定雑音信号ｑNiのスペクトルＹNi（ＹN1，ＹN2）を生成する。図５は、目的音抽出部５２および雑音抽出部５４のブロック図である。 The target sound extraction unit 52 in FIG. 2 sequentially generates the spectrum YTi (YT1, YT2) of the estimated target sound signal qTi obtained by extracting the target sound component from the acoustic signal si. The noise extraction unit 54 generates a spectrum YNi (YN1, YN2) of the estimated noise signal qNi obtained by extracting a noise component from the acoustic signal si. FIG. 5 is a block diagram of the target sound extraction unit 52 and the noise extraction unit 54.

図５に示すように、目的音抽出部５２は、係数列生成部５２２と抽出処理部５２４とを含んで構成される。係数列生成部５２２は、解析期間Ｔ0毎に目的音係数列ＧTi（ＧT1，ＧT2）を生成する。目的音係数列ＧTiは、係数値ｇTi[1,1]〜ｇTi[MA,NA]を配列したＭA行×ＮA列の行列である。目的音係数列ＧTiのうち第ｍA行の第ｎA列に位置する係数値ｇTi[mA,nA]は、時点tnAのスペクトルＸiのうち周波数ｆmAでの成分値ｘi[mA,nA]に対する利得（スペクトルゲイン）に相当し、０以上かつ１以下の範囲内で音響信号ｓiの特性（風雑音の強度）に応じて可変に設定される。すなわち、時点ｔnAでの音響信号ｓiの周波数ｆmAの音響成分において風雑音が優勢であるほど係数値ｇTi[mA,nA]は小さい数値に設定される。 As shown in FIG. 5, the target sound extraction unit 52 includes a coefficient sequence generation unit 522 and an extraction processing unit 524. The coefficient sequence generator 522 generates the target sound coefficient sequence GTi (GT1, GT2) for each analysis period T0. The target sound coefficient sequence GTi is a matrix of MA rows × NA columns in which coefficient values gTi [1,1] to gTi [MA, NA] are arranged. The coefficient value gTi [mA, nA] located in the nAth column of the mAth row of the target sound coefficient sequence GTi is the gain (spectrum) for the component value xi [mA, nA] at the frequency fmA of the spectrum Xi at the time tnA. Is variably set in accordance with the characteristics of the acoustic signal si (wind noise intensity) within a range of 0 or more and 1 or less. That is, the coefficient value gTi [mA, nA] is set to a smaller numerical value as the wind noise becomes dominant in the acoustic component of the frequency fmA of the acoustic signal si at the time point tnA.

第１実施形態の係数列生成部５２２は、図４に示すように、音響信号ｓiの基底行列Ｗiから雑音基底Ｃi_noiseを除外したＭA行×(K-1)列の行列ＷTiと、雑音基底Ｃi_noiseに対応する重み系列Ｅi_noiseを係数行列Ｈiから除外した(K-1)行×ＮA列の行列ＨTiとから目的音係数列ＧTi（Ｇ1，Ｇ2）を生成する。 As shown in FIG. 4, the coefficient sequence generator 522 of the first embodiment includes a matrix WTi of MA rows × (K−1) columns excluding the noise basis Ci_noise from the basis matrix Wi of the acoustic signal si, and the noise basis Ci_noise. The target sound coefficient sequence GTi (G1, G2) is generated from the matrix HTi of (K-1) rows × NA columns excluding the weight sequence Ei_noise corresponding to.

具体的には、係数列生成部５２２は、第１に、雑音基底Ｃi_noiseの除外後の行列ＷTiと重み系列Ｅi_noiseの除外後の行列ＨTiとの乗算で行列ＶTiを算定する。図４に示すように、行列ＶTiは、要素値ｖTi[1,1]〜ｖTi[MA,NA]をＭA行×ＮA列に配列した行列である。以上の説明から理解されるように、行列ＶTiの第ｎA列に位置するＭA個の要素値ｖTi[1,nA]〜ｖTi[MA,nA]は、時点ｔnAのスペクトルＸiから風雑音を抑圧したパワースペクトルの推定値に相当する。 Specifically, the coefficient sequence generation unit 522 first calculates the matrix VTi by multiplying the matrix WTi after the noise basis Ci_noise is excluded and the matrix HTi after the weight sequence Ei_noise is excluded. As shown in FIG. 4, the matrix VTi is a matrix in which element values vTi [1,1] to vTi [MA, NA] are arranged in MA rows × NA columns. As understood from the above description, the MA element values vTi [1, nA] to vTi [MA, nA] located in the nA-th column of the matrix VTi suppress wind noise from the spectrum Xi at the time point tnA. It corresponds to the estimated value of the power spectrum.

第２に、係数列生成部５２２は、以下の数式(A)の演算で目的音係数列ＧTiの係数値ｇTi[mA,nA]を算定する。数式(A)の記号ｖ[mA,nA]は、基底行列Ｗiと係数行列Ｈiとを乗算したＭA行×ＮA列の行列のうち第ｍ行の第ｎ列の要素値（すなわち、スペクトルＸiの成分値ｘi[mA,nA]の推定値）に相当する。要素値ｖTi[mA,nA]を要素値ｖ[mA,nA]で除算するのは、係数値ｇTi[mA,nA]を０以上かつ１以下の数値に正規化するためである。以上のように風雑音の雑音基底Ｃi_noiseおよび重み系列Ｅi_noiseを除外した行列ＶTiから目的音係数列ＧTiが生成されるから、風雑音が優勢であるほど係数値ｇTi[mA,nA]は小さい数値に設定される。
ｇTi[mA,nA]＝ｖTi[mA,nA]／ｖ[mA,nA] ……(A) Second, the coefficient sequence generation unit 522 calculates the coefficient value gTi [mA, nA] of the target sound coefficient sequence GTi by the calculation of the following formula (A). The symbol v [mA, nA] in equation (A) is the element value of the nth column of the mth row (ie, the spectrum Xi) of the matrix of MA rows × NA columns obtained by multiplying the base matrix Wi and the coefficient matrix Hi. Equivalent value of the component value x i [mA, nA]). The reason why the element value vTi [mA, nA] is divided by the element value v [mA, nA] is to normalize the coefficient value gTi [mA, nA] to a numerical value between 0 and 1. As described above, since the target sound coefficient sequence GTi is generated from the matrix VTi excluding the noise base Ci_noise and the weight sequence Ei_noise of the wind noise, the coefficient value gTi [mA, nA] becomes smaller as the wind noise becomes more dominant. Is set.
gTi [mA, nA] = vTi [mA, nA] / v [mA, nA] (A)

図５の抽出処理部５２４は、係数列生成部５２２が生成した目的音係数列ＧTiを音響信号ｓiの観測行列Ｖiに作用させることで、解析期間Ｔ0内のＮA個の時点ｔ1〜ｔNAの各々に対応するＮA個のスペクトルＹTiの時系列（解析期間Ｔ0内のスペクトログラム）を解析期間Ｔ0毎に順次に生成する。時点ｔnAのスペクトルＹTiは、ＭA個の成分値ｙTi[1,nA]〜ｙTi[MA,nA]で構成されるパワースペクトルである。具体的には、成分値ｙTi[mA,nA]は、目的音係数列ＧTiの係数値ｇTi[mA,nA]と観測行列Ｖiの成分値ｘi[mA,nA]との乗算値に設定される（ｙTi[mA,nA]＝ｇTi[mA,nA]×ｘi[mA,nA]）。前述のように風雑音が優勢であるほど係数値ｇTi[mA,nA]は小さい数値に設定されるから、抽出処理部５２４が生成するスペクトルＹTiは、音響信号ｓiのスペクトルＸiから風雑音を抑圧したスペクトルに相当する。 The extraction processing unit 524 in FIG. 5 applies each of the target sound coefficient sequence GTi generated by the coefficient sequence generation unit 522 to the observation matrix Vi of the acoustic signal si, so that each of the NA points in time t1 to tNA within the analysis period T0. A time series (spectrogram within the analysis period T0) of NA spectra YTi corresponding to is sequentially generated for each analysis period T0. The spectrum YTi at the time tnA is a power spectrum composed of MA component values yTi [1, nA] to yTi [MA, nA]. Specifically, the component value yTi [mA, nA] is set to a product of the coefficient value gTi [mA, nA] of the target sound coefficient sequence GTi and the component value xi [mA, nA] of the observation matrix Vi. (YTi [mA, nA] = gTi [mA, nA] × xi [mA, nA]). As described above, the coefficient value gTi [mA, nA] is set to a smaller value as the wind noise becomes more dominant. Therefore, the spectrum YTi generated by the extraction processing unit 524 suppresses the wind noise from the spectrum Xi of the acoustic signal si. It corresponds to the spectrum.

図５の雑音抽出部５４は、目的音抽出部５２と同様に、係数値ｇNi[1,1]〜ｇNi[MA,NA]で構成されるＭA行×ＮA列の雑音係数列ＧNi（ＧN1，ＧN2）を解析期間Ｔ0毎に生成する係数列生成部５４２と、雑音係数列ＧNiを観測行列Ｖiに作用させてＮA個のスペクトルＹNiの時系列（解析期間Ｔ0内のスペクトログラム）を生成する抽出処理部５４４とを含んで構成される。 The noise extraction unit 54 in FIG. 5, similar to the target sound extraction unit 52, has a noise coefficient sequence GNi (GN 1, GN) of MA rows × NA columns composed of coefficient values gNi [1,1] to gNi [MA, NA]. GN2) is generated every analysis period T0, and extraction processing for generating a time series of NA spectra YNi (a spectrogram within the analysis period T0) by applying the noise coefficient sequence GNi to the observation matrix Vi. Part 544.

図４に示すように、係数列生成部５４２は、第１に、雑音特定部４６が特定した雑音基底Ｃi_noiseと当該雑音基底Ｃi_noiseに対応する重み系列Ｅi_noiseとの乗算で、要素値ｖNi[1,1]〜ｖNi[MA,NA]をＭA行×ＮA列に配列した行列ＶNiを算定する。行列ＶNiは、解析期間Ｔ0内の音響信号ｓiの雑音成分のスペクトログラムに相当する。第２に、係数列生成部５４２は、前述の数式(A)と同様の数式(B)の演算で０以上かつ１以下の係数値ｇNi[mA,nA]を算定する。以上の説明から理解されるように、時点ｔnAでの音響信号ｓiの周波数ｆmAの音響成分において風雑音が優勢であるほど係数値ｇNi[mA,nA]は大きい数値に設定される。
ｇNi[mA,nA]＝ｖNi[mA,nA]／ｖ[mA,nA] ……(B) As shown in FIG. 4, the coefficient sequence generation unit 542 first multiplies the noise base Ci_noise specified by the noise specification unit 46 and the weight sequence Ei_noise corresponding to the noise base Ci_noise to obtain an element value vNi [1, 1] to vNi [MA, NA] are calculated as a matrix VNi arranged in MA rows × NA columns. The matrix VNi corresponds to the spectrogram of the noise component of the acoustic signal si within the analysis period T0. Second, the coefficient sequence generation unit 542 calculates a coefficient value gNi [mA, nA] that is greater than or equal to 0 and less than or equal to 1 by the calculation of the formula (B) similar to the formula (A) described above. As understood from the above description, the coefficient value gNi [mA, nA] is set to a larger numerical value as the wind noise becomes dominant in the acoustic component of the frequency fmA of the acoustic signal si at the time tnA.
gNi [mA, nA] = vNi [mA, nA] / v [mA, nA] (B)

抽出処理部５４４は、係数列生成部５４２が生成した雑音係数列ＧNiを音響信号ｓiの観測行列Ｖiに作用させることで、解析期間Ｔ0内のＮA個のスペクトルＹNiの時系列（スペクトログラム）を解析期間Ｔ0毎に順次に生成する。スペクトルＹNiは、ＭA個の成分値ｙNi[1,nA]〜ｙNi[MA,nA]で構成されるパワースペクトルである。具体的には、成分値ｙNi[mA,nA]は、雑音係数列ＧNiの係数値ｇNi[mA,nA]と観測行列Ｖiの成分値ｘi[mA,nA]との乗算値に設定される（ｙNi[mA,nA]＝ｇNi[mA,nA]×ｘi[mA,nA]）。前述のように雑音成分（風雑音）が優勢であるほど係数値ｇNi[mA,nA]は大きい数値に設定されるから、抽出処理部５４４が生成するスペクトルＹNiは、音響信号ｓiのスペクトルＸiから風雑音を抽出したスペクトルに相当する。 The extraction processing unit 544 analyzes the time series (spectrogram) of the NA spectra YNi within the analysis period T0 by applying the noise coefficient sequence GNi generated by the coefficient sequence generation unit 542 to the observation matrix Vi of the acoustic signal si. It generates sequentially for every period T0. The spectrum YNi is a power spectrum composed of MA component values yNi [1, nA] to yNi [MA, nA]. Specifically, the component value yNi [mA, nA] is set to a product of the coefficient value gNi [mA, nA] of the noise coefficient sequence GNi and the component value xi [mA, nA] of the observation matrix Vi ( yNi [mA, nA] = gNi [mA, nA] × xi [mA, nA]). As described above, the coefficient value gNi [mA, nA] is set to a larger value as the noise component (wind noise) becomes more dominant. Therefore, the spectrum YNi generated by the extraction processing unit 544 is derived from the spectrum Xi of the acoustic signal si. It corresponds to the spectrum from which wind noise is extracted.

以上に説明したように、目的音抽出部５２は音響信号ｓiから目的音成分を抽出し、雑音抽出部５４は音響信号ｓiから雑音成分を抽出する。すなわち、目的音抽出部５２および雑音抽出部５４は、音響信号ｓiを目的音成分（ＹT1，ＹT2）と雑音成分（ＹN1，ＹN2）とに分離する要素として機能する。 As described above, the target sound extraction unit 52 extracts a target sound component from the acoustic signal si, and the noise extraction unit 54 extracts a noise component from the acoustic signal si. That is, the target sound extraction unit 52 and the noise extraction unit 54 function as elements that separate the acoustic signal si into the target sound component (YT1, YT2) and the noise component (YN1, YN2).

図２の波形合成部５６は、目的音抽出部５２が単位期間毎に生成したスペクトルＹTi（帯域ＢLa）と周波数分析部４２が生成したスペクトルＸHi（帯域ＢHa）とから時間領域の推定目的音信号ｑTi（ｑT1，ｑT2）を生成する。具体的には、波形合成部５６は、スペクトルＹTiおよびスペクトルＸHiの加算値の振幅スペクトルと音響信号ｓiの位相スペクトルとを適用した逆フーリエ変換で時間領域信号を生成するとともに前後の単位期間で相互に連結することで推定目的音信号ｑTiを生成する。また、波形合成部５６は、雑音抽出部５４が単位期間毎に生成したスペクトルＹNiから時間領域の推定雑音信号ｑNi（ｑN1，ｑN2）を生成する。 The waveform synthesis unit 56 in FIG. 2 uses the spectrum YTi (band BLa) generated by the target sound extraction unit 52 for each unit period and the spectrum XHi (band BHa) generated by the frequency analysis unit 42 to estimate the target sound signal in the time domain. qTi (qT1, qT2) is generated. Specifically, the waveform synthesizer 56 generates a time domain signal by inverse Fourier transform using the amplitude spectrum of the added value of the spectrum YTi and the spectrum XHi and the phase spectrum of the acoustic signal si, and at the same time before and after the unit period. To generate the estimated target sound signal qTi. Further, the waveform synthesizer 56 generates a time-domain estimated noise signal qNi (qN1, qN2) from the spectrum YNi generated by the noise extraction unit 54 for each unit period.

図１の第２処理部３２は、前述のように、以上の手順で生成された推定雑音信号ｑNiから残留成分を抽出して推定目的音信号ｑTiに合成する。第１実施形態では、目的音成分が調波構造の音響（典型的には音声）である場合を想定し、推定雑音信号ｑNiのうち調波構造を構成する調波成分（基音成分および倍音成分）を残留成分として推定雑音信号ｑNiから抽出する。図６は、第２処理部３２のブロック図である。図６に示すように、第２処理部３２は、周波数分析部６２と調波成分抽出部６４と目的音合成部６６と波形合成部６８とを含んで構成される。 As described above, the second processing unit 32 in FIG. 1 extracts a residual component from the estimated noise signal qNi generated by the above procedure and synthesizes it to the estimated target sound signal qTi. In the first embodiment, assuming that the target sound component is a harmonic structure sound (typically speech), the harmonic components (fundamental sound component and harmonic component) constituting the harmonic structure of the estimated noise signal qNi. ) As a residual component from the estimated noise signal qNi. FIG. 6 is a block diagram of the second processing unit 32. As shown in FIG. 6, the second processing unit 32 includes a frequency analysis unit 62, a harmonic component extraction unit 64, a target sound synthesis unit 66, and a waveform synthesis unit 68.

周波数分析部６２は、推定目的音信号ｑTiのスペクトルＳTi（ＳT1，ＳT2）と推定雑音信号ｑNiのスペクトルＳNi（ＳN1，ＳN2）とを単位期間毎に順次に生成する。推定目的音信号ｑTiのスペクトルＳTiは、複数の成分値（パワー）ｓTiを配列したパワースペクトルである。同様に、推定雑音信号ｑNiのスペクトルＳNiは、複数の成分値ｓNiを配列したパワースペクトルである。図７に示すように、時間軸上に間隔ΔtBで配列する各時点ｔと周波数軸上に間隔ΔfBで配列する各周波数ｆとに対応する解析点ｐ2毎に成分値ｘTiおよび成分値ｘNiが算定される。図６に示すように、推定目的音信号ｑTiのスペクトルＳTiは、帯域ＢLb内のスペクトルＺiと帯域ＢHb内のスペクトルＺHiとに区分される。帯域ＢLbは、ＭB個（ＭBは自然数）の周波数ｆ1〜ｆMBを含む範囲（例えば０．１ｋＨｚから４．４ｋＨｚまでの帯域）に設定され、帯域ＢHbは、帯域ＢLbの高域側（例えば４．４ｋＨｚ以上の帯域）に設定される。 The frequency analysis unit 62 sequentially generates a spectrum STi (ST1, ST2) of the estimated target sound signal qTi and a spectrum SNi (SN1, SN2) of the estimated noise signal qNi for each unit period. The spectrum STi of the estimated target sound signal qTi is a power spectrum in which a plurality of component values (power) sTi are arranged. Similarly, the spectrum SNi of the estimated noise signal qNi is a power spectrum in which a plurality of component values sNi are arranged. As shown in FIG. 7, the component value xTi and the component value xNi are calculated for each analysis point p2 corresponding to each time point t arranged at intervals ΔtB on the time axis and each frequency f arranged at intervals ΔfB on the frequency axis. Is done. As shown in FIG. 6, the spectrum STi of the estimated target sound signal qTi is divided into a spectrum Zi in the band BLb and a spectrum ZHi in the band BHb. The band BLb is set to a range (for example, a band from 0.1 kHz to 4.4 kHz) including MB (MB is a natural number) frequencies f1 to fMB, and the band BHb is a high band side of the band BLb (for example, 4. 4 kHz or higher band).

スペクトルＳTiおよびスペクトルＳNiの算定には、単位期間の窓幅ωBおよび移動量（時間軸上のシフト量）δBを解析パラメータとした短時間フーリエ変換が採用される。周波数分析部４２の解析パラメータ（窓幅ωA，移動量δA）が非負行列因子分解に好適な数値という観点から選定されるのに対し、周波数分析部６２の解析パラメータ（窓幅ωB，移動量δA）は、第２処理部３２での残留成分（調波成分）の抽出および合成にとって好適な数値という観点から選定される。以上の相違に起因して、周波数分析部４２の解析パラメータ（窓幅ωA，移動量δA）と周波数分析部６２の解析パラメータ（窓幅ωB，移動量δB）とは相違する。すなわち、第１処理部３１で想定される各解析点ｐ1の時間軸上の間隔ΔtAと第２処理部３２で想定される各解析点ｐ2の間隔ΔtBとは相違し、第１処理部３１で想定される各解析点ｐ1の周波数軸上の間隔ΔfAと第２処理部３２で想定される各解析点ｐ2の間隔ΔfBとは相違する。具体的には、時間分解能は周波数分析部６２が周波数分析部４２を上回り（ΔtB＜ΔtA）、周波数分解能は周波数分析部４２が周波数分析部６２を上回る（ΔfA＜ΔfB）ように、各解析パラメータが選定される。例えば、周波数分析部４２については窓幅ωAが５１２ｍｓに設定されて移動量δAが６４ｍｓに設定されるのに対し、周波数分析部６２については窓幅ωBが２５ｍｓに設定されて移動量δBが５ｍｓに設定される。 For the calculation of the spectrum STi and the spectrum SNi, a short-time Fourier transform using the window width ωB and the movement amount (shift amount on the time axis) δB in the unit period as analysis parameters is adopted. The analysis parameters (window width ωA, movement amount δA) of the frequency analysis unit 42 are selected from the viewpoint of numerical values suitable for non-negative matrix factorization, whereas the analysis parameters (window width ωB, movement amount δA) of the frequency analysis unit 62 are selected. ) Is selected from the viewpoint of a numerical value suitable for extraction and synthesis of residual components (harmonic components) in the second processing unit 32. Due to the above differences, the analysis parameters (window width ωA, movement amount δA) of the frequency analysis unit 42 and the analysis parameters (window width ωB, movement amount δB) of the frequency analysis unit 62 are different. That is, the interval ΔtA on the time axis of each analysis point p1 assumed by the first processing unit 31 is different from the interval ΔtB of each analysis point p2 assumed by the second processing unit 32. The interval ΔfA on the frequency axis of each analysis point p1 assumed is different from the interval ΔfB of each analysis point p2 assumed in the second processing unit 32. Specifically, each analysis parameter is set so that the frequency analysis unit 62 exceeds the frequency analysis unit 42 (ΔtB <ΔtA) and the frequency resolution exceeds the frequency analysis unit 62 (ΔfA <ΔfB). Is selected. For example, for the frequency analysis unit 42, the window width ωA is set to 512 ms and the movement amount δA is set to 64 ms, whereas for the frequency analysis unit 62, the window width ωB is set to 25 ms and the movement amount δB is 5 ms. Set to

図６の調波成分抽出部６４は、雑音成分の推定雑音信号ｑNiから残留成分のスペクトルＲiを順次に抽出する。ＮB個のスペクトルＲiの時系列（残留成分のスペクトログラム）が解析期間Ｔ0毎に順次に生成される。 6 sequentially extracts the residual component spectrum Ri from the noise component estimated noise signal qNi. A time series (spectrogram of residual components) of NB spectra Ri is sequentially generated every analysis period T0.

図８は、調波成分抽出部６４のブロック図である。図８に示すように、調波成分抽出部６４は、周波数推定部７２と調波係数列生成部７４と係数列補正部７６と調波抽出部７８とを含んで構成される。周波数推定部７２は、推定目的音信号ｑTiのスペクトルＺiの解析で各解析期間Ｔ0内のＮB個の単位期間の各々について音響信号ｓiの目的音成分（推定目的音信号ｑTi）の基本周波数Ｆi[nB]（Ｆi[1]〜Ｆi[NB]）を推定する。基本周波数Ｆi[nB]の推定には公知の技術（例えば調波構造の解析やケプストラムの算定）が任意に採用される。 FIG. 8 is a block diagram of the harmonic component extraction unit 64. As shown in FIG. 8, the harmonic component extraction unit 64 includes a frequency estimation unit 72, a harmonic coefficient sequence generation unit 74, a coefficient sequence correction unit 76, and a harmonic extraction unit 78. The frequency estimator 72 analyzes the spectrum Zi of the estimated target sound signal qTi and analyzes the fundamental frequency Fi [of the target sound component (estimated target sound signal qTi) of the acoustic signal si for each of the NB unit periods within each analysis period T0. nB] (Fi [1] to Fi [NB]) is estimated. For estimating the fundamental frequency Fi [nB], a known technique (for example, analysis of harmonic structure or calculation of cepstrum) is arbitrarily employed.

調波係数列生成部７４は、周波数推定部７２が推定した基本周波数Ｆi[nB]と第１処理部３１の係数列生成部５４２が生成した雑音係数列ＧNiとを利用して調波係数列ＧHi（H：harmonics）を生成する。図９に示すように、調波係数列ＧHiは、時間-周波数平面内の相異なる解析点ｐ2に対応する係数値ｇHi[1,1]〜ｇHi[MB,NB]を配列したＭB行×ＮB列の行列である。調波係数列ＧHiの第ｎB列を構成するＭB個の係数値ｇHi[1,nB]〜ｇHi[MB,nB]は、解析期間Ｔ0のうち第ｎB番目の単位期間（時点ｔnB）における目的音成分の調波構造を示す係数列である。 The harmonic coefficient sequence generation unit 74 uses the fundamental frequency Fi [nB] estimated by the frequency estimation unit 72 and the noise coefficient sequence GNi generated by the coefficient sequence generation unit 542 of the first processing unit 31 to generate a harmonic coefficient sequence. GHi (H: harmonics) is generated. As shown in FIG. 9, the harmonic coefficient sequence GHi has an MB row × NB in which coefficient values gHi [1,1] to gHi [MB, NB] corresponding to different analysis points p2 in the time-frequency plane are arranged. A matrix of columns. The MB coefficient values gHi [1, nB] to gHi [MB, nB] constituting the nBth row of the harmonic coefficient row GHi are target sounds in the nBth unit period (time point tnB) in the analysis period T0. It is a coefficient sequence which shows the harmonic structure of a component.

調波係数列生成部７４による調波係数列ＧHiの生成には雑音係数列ＧNiが利用されるが、雑音係数列ＧNiの各係数値ｇNi[mA,nA]に対応する解析点ｐ1と、調波係数列ＧHiの各係数値ｇHi[mB,nB]に対応する解析点ｐ2とは相違する。以上の相違を補償するために、係数列補正部７６は、図９に示すように、第１処理部３１の係数列生成部５４２が生成した雑音係数列ＧNiの補正で解析期間Ｔ0毎に補正係数列ＧBiを生成する。補正係数列ＧBiは、係数値ｇBi[1,1]〜ｇBi[MB,NB]を配列したＭB行×ＮB列の行列である。 The generation of the harmonic coefficient string GHi by the harmonic coefficient string generation unit 74 uses the noise coefficient string GNi. The analysis point p1 corresponding to each coefficient value gNi [mA, nA] of the noise coefficient string GNi, This is different from the analysis point p2 corresponding to each coefficient value gHi [mB, nB] of the wave coefficient sequence GHi. In order to compensate for the above difference, the coefficient sequence correction unit 76 corrects the noise coefficient sequence GNi generated by the coefficient sequence generation unit 542 of the first processing unit 31 for each analysis period T0 as shown in FIG. A coefficient sequence GBi is generated. The correction coefficient sequence GBi is a matrix of MB rows × NB columns in which coefficient values gBi [1,1] to gBi [MB, NB] are arranged.

具体的には、係数列補正部７６は、雑音係数列ＧNiの各係数値ｇNi[mA,nA]の補間または間引で補正係数列ＧBiの各係数値ｇBi[mB,nB]を生成する。例えば雑音係数列ＧNiの行数ＭAが目標の行数ＭBを上回る場合（ＭA＞ＭB）、係数列補正部７６は、雑音係数列ＧNiの各列を構成するＭA個の係数値ｇNi[1,nA]〜ｇHi[MA,nA]の間引（補間）でＭB個の係数列ｇBi[1,nA]〜ｇBi[MB,nA]を生成する。また、雑音係数列ＧNiの列数ＮAが目標の列数ＮBを下回る場合（ＮA＜ＮB）、係数列補正部７６は、雑音係数列ＧNiの各行を構成するＮA個の係数値ｇNi[mA,1]〜ｇNi[mA,NA]の補間でＮB個の係数列ｇBi[mA,1]〜ｇBi[mA,NB]を生成する。補間や間引には公知の技術（例えば直線補間）が任意に採用される。 Specifically, the coefficient sequence correction unit 76 generates each coefficient value gBi [mB, nB] of the correction coefficient sequence GBi by interpolation or thinning of each coefficient value gNi [mA, nA] of the noise coefficient sequence GNi. For example, when the number of rows MA of the noise coefficient sequence GNi exceeds the target number of rows MB (MA> MB), the coefficient sequence correction unit 76 sets the MA coefficient values gNi [1, MB coefficient sequences gBi [1, nA] to gBi [MB, nA] are generated by thinning out (interpolating) nA] to gHi [MA, nA]. When the number NA of noise coefficient columns GNi is less than the target number of columns NB (NA <NB), the coefficient column correction unit 76 uses NA coefficient values gNi [mA, mA constituting each row of the noise coefficient column GNi. NB coefficient sequences gBi [mA, 1] to gBi [mA, NB] are generated by interpolation of 1] to gNi [mA, NA]. A known technique (for example, linear interpolation) is arbitrarily employed for interpolation and thinning.

図８に示すように、調波係数列生成部７４は、調波構造特定部７４２と係数列合成部７４４とを含んで構成される。調波構造特定部７４２は、音響信号ｓiの目的音成分（残留成分）の調波構造を示す調波係数列Ｄiを生成する。図９に示すように、調波係数列Ｄiは、係数値ｄi[1,1]〜ｄi[MB,NB]を配列したＭB行×ＮB列の行列である。調波係数列Ｄiの第ｎB列を構成するＭB個の係数値ｄi[1,nB]〜ｄi[MB,nB]の数値列は、解析期間Ｔ0内の第ｎB番目の単位期間における目的音成分の調波構造を指定する。具体的には、図９に示すように、ＭB個の係数値ｄi[1,nB]〜ｄi[MB,nB]のうち周波数推定部７２が推定した基本周波数Ｆi[nB]の整数倍の周波数（Ｆ0[nB]，２Ｆ0[nB]，３Ｆ0[nB]，……）に対応する係数値ｄi[mB,nB]が１に設定されるとともに他の係数値ｄi[mB,nB]はゼロに設定される。 As shown in FIG. 8, the harmonic coefficient sequence generation unit 74 includes a harmonic structure specifying unit 742 and a coefficient sequence synthesis unit 744. The harmonic structure specifying unit 742 generates a harmonic coefficient sequence Di indicating the harmonic structure of the target sound component (residual component) of the acoustic signal si. As shown in FIG. 9, the harmonic coefficient sequence Di is a matrix of MB rows × NB columns in which coefficient values di [1,1] to di [MB, NB] are arranged. The numerical sequence of MB coefficient values di [1, nB] to di [MB, nB] constituting the nBth column of the harmonic coefficient sequence Di is a target sound component in the nBth unit period in the analysis period T0. Specifies the harmonic structure of. Specifically, as shown in FIG. 9, among the MB coefficient values di [1, nB] to di [MB, nB], a frequency that is an integral multiple of the fundamental frequency Fi [nB] estimated by the frequency estimation unit 72. The coefficient value di [mB, nB] corresponding to (F0 [nB], 2F0 [nB], 3F0 [nB],...) Is set to 1 and the other coefficient values di [mB, nB] are set to zero. Is set.

図８の係数列合成部７４４は、係数列補正部７６が生成した補正係数列ＧBiと調波構造特定部７４２が生成した調波係数列Ｄiとの合成で調波係数列ＧHiを生成する。具体的には、調波係数列ＧHiの係数値ｇHi[mB,nB]は、図９に示すように、補正係数列ＧBiの係数値ｇBi[mB,nB]と調波係数列Ｄiの係数値ｄi[mB,nB]との乗算値に設定される（ｇHi[mB,nB]＝ｇBi[mB,nB]×ｄi[mB,nB]）。したがって、基本周波数Ｆi[nB]の整数倍の周波数において残留成分が優勢であるほど係数値ｇHi[mB,nB]は大きい数値に設定される。 8 generates a harmonic coefficient sequence GHi by combining the correction coefficient sequence GBi generated by the coefficient sequence correction unit 76 and the harmonic coefficient sequence Di generated by the harmonic structure specifying unit 742. Specifically, the coefficient value gHi [mB, nB] of the harmonic coefficient sequence GHi is the coefficient value gBi [mB, nB] of the correction coefficient sequence GBi and the coefficient value of the harmonic coefficient sequence Di as shown in FIG. It is set to the product of di [mB, nB] (gHi [mB, nB] = gBi [mB, nB] × di [mB, nB]). Accordingly, the coefficient value gHi [mB, nB] is set to a larger value as the residual component becomes dominant at a frequency that is an integral multiple of the fundamental frequency Fi [nB].

図８の調波抽出部７８は、係数列合成部７４４が生成した調波係数列ＧHiを推定雑音信号ｑNiのスペクトルＳNiに作用させることで、解析期間Ｔ0内のＮB個のスペクトルＲiの時系列（残留成分のスペクトログラム）を生成する。時点ｔnBのスペクトルＲiは、周波数ｆ1〜ｆMBに対応するＭB個の成分値ｒi[1,nB]〜ｒi[MB,nB]で構成されるパワースペクトルである。調波抽出部７８は、推定雑音信号ｑNiの時点ｔnBのスペクトルＳNiのうち周波数ｆmBの成分値ｓNi[mB,nB]と調波係数列ＧHiの係数値ｇHi[mB,nB]との乗算値をスペクトルＲiの成分値ｒi[mB,nB]として算定する（ｒi[mB,nB]＝ｇHi[mB,nB]×ｓNi[mB,nB]）。したがって、スペクトルＲiは、推定雑音信号ｑNiに混入した残留成分（周波数Ｆi[nB]を基本周波数とする調波成分）のスペクトルの推定値に相当する。以上が調波成分抽出部６４の構成および作用である。 The harmonic extraction unit 78 of FIG. 8 applies the harmonic coefficient sequence GHi generated by the coefficient sequence synthesis unit 744 to the spectrum SNi of the estimated noise signal qNi, so that the time series of the NB spectra Ri in the analysis period T0 is obtained. (Spectrogram of residual components) is generated. The spectrum Ri at the time point tnB is a power spectrum composed of MB component values ri [1, nB] to ri [MB, nB] corresponding to the frequencies f1 to fMB. The harmonic extraction unit 78 calculates a product of the component value sNi [mB, nB] of the frequency fmB and the coefficient value gHi [mB, nB] of the harmonic coefficient sequence GHi in the spectrum SNi at the time tnB of the estimated noise signal qNi. It is calculated as the component value ri [mB, nB] of the spectrum Ri (ri [mB, nB] = gHi [mB, nB] × sNi [mB, nB]). Therefore, the spectrum Ri corresponds to the estimated value of the spectrum of the residual component (harmonic component having the frequency Fi [nB] as the fundamental frequency) mixed in the estimated noise signal qNi. The above is the configuration and operation of the harmonic component extraction unit 64.

図６の目的音合成部６６は、周波数分析部６２が推定目的音信号ｑTiから生成したスペクトルＺi（帯域ＢLb）と調波成分抽出部６４が生成したスペクトルＲiとの合成（スペクトル加算）で単位期間毎に順次にスペクトルＺRiを生成する。すなわち、スペクトルＺRiは、推定目的音信号ｑTiのうち帯域ＢLb内の目的音成分と推定雑音信号ｑNiに残留した残留成分との混合音のパワースペクトルに相当する。 The target sound synthesizer 66 shown in FIG. 6 is unitized by combining (spectrum addition) the spectrum Zi (band BLb) generated from the estimated target sound signal qTi by the frequency analyzer 62 and the spectrum Ri generated by the harmonic component extractor 64. A spectrum ZRi is generated sequentially for each period. That is, the spectrum ZRi corresponds to the power spectrum of the mixed sound of the target sound component in the band BLb and the residual component remaining in the estimated noise signal qNi in the estimated target sound signal qTi.

波形合成部６８は、目的音合成部６６が単位期間毎に生成したスペクトルＺRi（帯域ＢLb）と周波数分析部６２が生成したスペクトルＺHi（帯域ＢHb）とから時間領域の音響信号ｑi（ｑ1，ｑ2）を生成する。波形合成部６８による音響信号ｑiの生成には、波形合成部５６による推定目的音信号ｑTiの生成と同様の方法が採用される。以上の説明から理解されるように、音響信号ｑiの再生音は、推定目的音信号ｑTiの目的音成分と推定雑音信号ｑNiの残留成分との混合音に相当する。 The waveform synthesizer 68 generates a time-domain acoustic signal qi (q1, q2) from the spectrum ZRi (band BLb) generated by the target sound synthesizer 66 for each unit period and the spectrum ZHi (band BHb) generated by the frequency analyzer 62. ) Is generated. For the generation of the acoustic signal qi by the waveform synthesizer 68, the same method as the generation of the estimated target sound signal qTi by the waveform synthesizer 56 is employed. As can be understood from the above description, the reproduced sound of the acoustic signal qi corresponds to a mixed sound of the target sound component of the estimated target sound signal qTi and the residual component of the estimated noise signal qNi.

以上に説明したように、第１実施形態では、音響信号ｓiの観測行列Ｖiが基底行列Ｗiと係数行列Ｈiとに分解され、雑音基底Ｃi_noiseを除外した基底行列Ｗi（行列ＷTi）と重み系列Ｅi_noiseを除外した係数行列Ｈi（行列ＨTi）とを利用して目的音係数列ＧTiが生成される。したがって、音響信号ｓiの目的音成分の強度が雑音成分と比較して低い場合でも、高精度に風雑音を抑圧することが可能である。また、基底行列Ｗiのうち雑音基底Ｃi_noise以外の各基底Ｃi[k]と係数行列Ｈiのうち重み系列Ｅi_noise以外の各重み系列Ｅi[k]とは維持されるから、音響信号ｓiの目的音成分の波形が忠実に維持された音響信号ｑiを生成できるという利点もある。 As described above, in the first embodiment, the observation matrix Vi of the acoustic signal si is decomposed into the base matrix Wi and the coefficient matrix Hi, and the base matrix Wi (matrix WTi) excluding the noise base Ci_noise and the weight sequence Ei_noise. The target sound coefficient sequence GTi is generated using the coefficient matrix Hi (matrix HTi) excluding the. Therefore, even when the intensity of the target sound component of the acoustic signal si is lower than that of the noise component, it is possible to suppress wind noise with high accuracy. Further, since each basis Ci [k] other than the noise basis Ci_noise in the basis matrix Wi and each weight series Ei [k] other than the weight series Ei_noise in the coefficient matrix Hi are maintained, the target sound component of the acoustic signal si is maintained. There is also an advantage that an acoustic signal qi can be generated in which the waveform of is maintained faithfully.

なお、基底行列Ｗiから雑音基底Ｃi_noiseを特定する方法としては、例えば、風雑音の周波数特性を模擬するように事前に作成されたモデルを基底行列Ｗiの各基底Ｃi[k]と比較する構成も採用され得る。しかし、風雑音のモデルを利用する構成では、事前に用意されたモデルとは周波数特性が相違する風雑音を充分に抑圧できない可能性がある。他方、第１実施形態では、基底行列Ｗ1と基底行列Ｗ2との間で相関が高い各基底Ｃi[k]が雑音基底Ｃi_noiseとして特定されるから、風雑音のモデルを利用する構成と比較して、多様な特性の風雑音を充分に抑圧できるという利点がある。 As a method for identifying the noise base Ci_noise from the base matrix Wi, for example, a configuration in which a model created in advance so as to simulate the frequency characteristics of wind noise is compared with each base Ci [k] of the base matrix Wi. Can be employed. However, in a configuration using a wind noise model, wind noise having a frequency characteristic different from that of a model prepared in advance may not be sufficiently suppressed. On the other hand, in the first embodiment, each base Ci [k] having a high correlation between the base matrix W1 and the base matrix W2 is specified as a noise base Ci_noise, so that it is compared with a configuration using a wind noise model. There is an advantage that wind noise with various characteristics can be sufficiently suppressed.

また、第１実施形態では、推定雑音信号ｑNi内に残留する目的音成分（残留成分）が抽出されて推定目的音信号ｑTi（スペクトルＺi）に合成されるから、例えば推定目的音信号ｑTiを放音装置１４から再生する場合と比較して、再生音における目的音成分の欠落を防止することが可能である。しかも、調波構造の解析で残留成分を雑音成分から分離するため、残留成分の強度が雑音成分に対して低い場合でも残留成分を高精度に抽出できるという利点がある。 In the first embodiment, since the target sound component (residual component) remaining in the estimated noise signal qNi is extracted and synthesized with the estimated target sound signal qTi (spectrum Zi), for example, the estimated target sound signal qTi is released. Compared with the case of reproducing from the sound device 14, it is possible to prevent the target sound component from being lost in the reproduced sound. In addition, since the residual component is separated from the noise component in the analysis of the harmonic structure, there is an advantage that the residual component can be extracted with high accuracy even when the strength of the residual component is lower than the noise component.

また、第１実施形態では、解析点ｐ1に対応する雑音係数列ＧNiが第２処理部３２での解析点ｐ2に対応するように補正されたうえで残留成分の抽出に利用される。したがって、第１処理部３１の周波数分析部４２による解析パラメータ（窓幅ωA，移動量δA）と第２処理部３２の周波数分析部６２による解析パラメータ（窓幅ωB，移動量δB）とを個別に選定できるという利点がある。具体的には、前述のように、行列分解部４４による非負行列因子分解に適切な数値という観点から周波数分析部４２の解析パラメータ（窓幅ωA，移動量δA）を選定し、残留成分の抽出および合成に適切な数値という観点から周波数分析部６２の解析パラメータを選定することが可能である。 In the first embodiment, the noise coefficient sequence GNi corresponding to the analysis point p1 is corrected so as to correspond to the analysis point p2 in the second processing unit 32, and then used for extracting residual components. Therefore, the analysis parameters (window width ωA, movement amount δA) by the frequency analysis unit 42 of the first processing unit 31 and the analysis parameters (window width ωB, movement amount δB) by the frequency analysis unit 62 of the second processing unit 32 are individually set. There is an advantage that can be selected. Specifically, as described above, the analysis parameters (window width ωA, movement amount δA) of the frequency analysis unit 42 are selected from the viewpoint of numerical values suitable for the non-negative matrix factorization by the matrix decomposition unit 44, and the residual components are extracted. In addition, it is possible to select an analysis parameter of the frequency analysis unit 62 from the viewpoint of a numerical value appropriate for synthesis.

＜Ｂ：第２実施形態＞
次に、本発明の第２実施形態について説明する。なお、以下の各例示において作用や機能が第１実施形態と同等である要素については、以上の説明で参照した符号を流用して各々の詳細な説明を適宜に省略する。 <B: Second Embodiment>
Next, a second embodiment of the present invention will be described. In addition, about the element which an effect | action and function are equivalent to 1st Embodiment in each following illustration, the code | symbol referred by the above description is diverted and each detailed description is abbreviate | omitted suitably.

信号供給装置１２の２個の収音機器１２２に対して正面方向から到来する目的音成分は、位相差を殆ど発生させずに略同等の強度（振幅）で各収音機器１２２に到達する。他方、風雑音は前述のように空気の乱流に起因するから、同位相かつ同振幅で各収音機器１２２に到達する可能性は低い。したがって、音響信号ｓ1や音響信号ｓ2にて風雑音が優勢となるほど両者間の位相差や強度差が増加するという傾向がある。以上の傾向を考慮して、本実施形態では、音響信号ｓ1と音響信号ｓ2との位相差や強度差に応じて目的音係数列ＧTiの各係数値ｇTi[mA,nA]や雑音係数列ＧNiの各係数値ｇNi[mA,nA]を可変に設定する。 The target sound component that arrives from the front direction with respect to the two sound collecting devices 122 of the signal supply device 12 reaches each sound collecting device 122 with substantially the same intensity (amplitude) without generating a phase difference. On the other hand, since wind noise is caused by air turbulence as described above, the possibility of reaching each sound collecting device 122 with the same phase and the same amplitude is low. Therefore, the more the wind noise becomes dominant in the acoustic signal s1 and the acoustic signal s2, the more the phase difference and the intensity difference between the two tend to increase. In consideration of the above tendency, in the present embodiment, each coefficient value gTi [mA, nA] of the target sound coefficient sequence GTi and the noise coefficient sequence GNi according to the phase difference or intensity difference between the acoustic signal s1 and the acoustic signal s2. Each coefficient value gNi [mA, nA] is variably set.

第２実施形態の音響処理装置１００は、図１０に示すように、第１実施形態の第１処理部３１に位相差算定部５８２と強度差算定部５８４とを追加した構成である。音響信号ｓ1および音響信号ｓ2の各々の帯域ＢMの成分が位相差算定部５８２および強度差算定部５８４に供給される。帯域ＢMは、風雑音の周波数と主要な目的音成分の周波数とを包含するように設定される。例えば、帯域ＢMは４kHz以下の範囲（すなわち帯域ＢLaを含む帯域）に設定される。 As shown in FIG. 10, the acoustic processing apparatus 100 of the second embodiment has a configuration in which a phase difference calculation unit 582 and an intensity difference calculation unit 584 are added to the first processing unit 31 of the first embodiment. The components of the bands BM of the acoustic signal s1 and the acoustic signal s2 are supplied to the phase difference calculation unit 582 and the intensity difference calculation unit 584. The band BM is set so as to include the frequency of the wind noise and the frequency of the main target sound component. For example, the band BM is set to a range of 4 kHz or less (that is, a band including the band BLa).

図１０の位相差算定部５８２は、音響信号ｓ1と音響信号ｓ2との位相差ΔＰ[nA]を単位期間毎（時点ｔnA毎）に順次に算定する。位相差ΔＰ[nA]は、例えば、帯域ＢM内の各周波数での位相差の代表値（例えば平均値）である。同様に、強度差算定部５８４は、音響信号ｓ1と音響信号ｓ2との強度差（例えば振幅差やパワー差）ΔＡ[nA]を単位期間毎に順次に算定する。 The phase difference calculation unit 582 in FIG. 10 sequentially calculates the phase difference ΔP [nA] between the acoustic signal s1 and the acoustic signal s2 for each unit period (every time tnA). The phase difference ΔP [nA] is, for example, a representative value (for example, an average value) of the phase difference at each frequency in the band BM. Similarly, the intensity difference calculation unit 584 sequentially calculates the intensity difference (for example, amplitude difference or power difference) ΔA [nA] between the acoustic signal s1 and the acoustic signal s2 for each unit period.

目的音抽出部５２の係数列生成部５２２は、位相差算定部５８２が算定した位相差ΔＰ[nA]と強度差算定部５８４が算定した強度差ΔＡ[nA]とに応じて目的音係数列ＧTiの係数値ｇTi[mA,nA]を可変に設定する。具体的には、係数列生成部５２２は、位相差ΔＰ[nA]または強度差ΔＡ[nA]が大きい（時点ｔnAで風雑音が優勢である）ほど、前掲の数式(A)で算定される係数値ｇTi[mA,nA]を小さい数値に補正する。したがって、第２実施形態によれば、第１実施形態と比較して風雑音を充分に抑圧した推定目的音信号ｑTiを生成できるという利点がある。 The coefficient sequence generation unit 522 of the target sound extraction unit 52 determines the target sound coefficient sequence according to the phase difference ΔP [nA] calculated by the phase difference calculation unit 582 and the intensity difference ΔA [nA] calculated by the intensity difference calculation unit 584. The coefficient value gTi [mA, nA] of GTi is variably set. Specifically, the coefficient sequence generation unit 522 calculates the above equation (A) as the phase difference ΔP [nA] or the intensity difference ΔA [nA] is larger (wind noise is dominant at the time tnA). The coefficient value gTi [mA, nA] is corrected to a small value. Therefore, according to the second embodiment, there is an advantage that it is possible to generate the estimated target sound signal qTi in which the wind noise is sufficiently suppressed as compared with the first embodiment.

他方、雑音抽出部５４の係数列生成部５４２は、位相差ΔＰ[nA]と強度差ΔＡ[nA]とに応じて雑音係数列ＧNiの係数値ｇNi[mA,nA]を可変に設定する。具体的には、係数列生成部５４２は、位相差ΔＰ[nA]または強度差ΔＡ[nA]が大きい（時点ｔnAで風雑音が優勢である）ほど、前掲の数式(B)で算定される係数値ｇNi[mA,nA]を大きい数値に補正する。したがって、第２実施形態によれば、第１実施形態と比較して風雑音を充分に強調した推定雑音信号ｑNiを生成できるという利点がある。 On the other hand, the coefficient sequence generation unit 542 of the noise extraction unit 54 variably sets the coefficient value gNi [mA, nA] of the noise coefficient sequence GNi according to the phase difference ΔP [nA] and the intensity difference ΔA [nA]. Specifically, the coefficient sequence generation unit 542 is calculated by the above formula (B) as the phase difference ΔP [nA] or the intensity difference ΔA [nA] is larger (wind noise is dominant at the time tnA). The coefficient value gNi [mA, nA] is corrected to a large value. Therefore, according to the second embodiment, there is an advantage that an estimated noise signal qNi in which wind noise is sufficiently emphasized can be generated as compared with the first embodiment.

＜Ｃ：第３実施形態＞
第３実施形態の音響処理装置１００は、図１１に示すように第２処理部３２に調整部６５を追加した構成である。調整部６５は、推定雑音信号ｑNiのスペクトルＳNiの強度（パワー）を減少させる増幅器（例えば１未満の数値を乗算する乗算器）である。目的音合成部６６は、第１実施形態と同様のスペクトルＺi（帯域ＢLb）およびスペクトルＲiと、調整部６５による処理後（減衰後）のスペクトルＳNiとの合成で単位期間毎に順次にスペクトルＺRiを生成する。すなわち、音響信号ｓiの雑音成分が低音量で再生音に付加される。 <C: Third Embodiment>
The sound processing apparatus 100 according to the third embodiment has a configuration in which an adjustment unit 65 is added to the second processing unit 32 as illustrated in FIG. 11. The adjustment unit 65 is an amplifier (for example, a multiplier that multiplies a numerical value less than 1) that decreases the intensity (power) of the spectrum SNi of the estimated noise signal qNi. The target sound synthesizer 66 combines the spectrum Zi (band BLb) and spectrum Ri similar to those in the first embodiment and the spectrum SNi after processing (after attenuation) by the adjustment unit 65 to sequentially produce the spectrum ZRi for each unit period. Is generated. That is, the noise component of the acoustic signal si is added to the reproduced sound at a low volume.

第３実施形態でも第１実施形態と同様の効果が実現される。なお、推定目的音信号ｑTiのスペクトルＺiと残留成分のスペクトルＲiとのみを合成してスペクトルＺRiを生成する第１実施形態の構成では、雑音成分を高度に除外することが可能であるが、再生音が聴感的に不自然な印象となる可能性がある。第３実施形態では、推定雑音信号ｑNiのスペクトルＳNiがスペクトルＺRiの合成に適用されるから、聴感的に自然な印象の再生音を生成できるという利点がある。 In the third embodiment, the same effect as in the first embodiment is realized. In the configuration of the first embodiment in which the spectrum ZRi is generated by synthesizing only the spectrum Zi of the estimated target sound signal qTi and the residual component spectrum Ri, the noise component can be highly excluded. The sound may be audibly unnatural. In the third embodiment, since the spectrum SNi of the estimated noise signal qNi is applied to the synthesis of the spectrum ZRi, there is an advantage that it is possible to generate a reproduced sound with an audibly natural impression.

＜Ｄ：第４実施形態＞
第１実施形態では、基底行列Ｗiと係数行列Ｈiとから生成した目的音係数列ＧTiを目的音抽出部５２が観測行列Ｖiに作用させることでスペクトルＹTiの時系列を生成し、雑音係数列ＧNiを雑音抽出部５４が観測行列Ｖiに作用させることでスペクトルＹNiの時系列を生成した。第４実施形態は、目的音抽出部５２や雑音抽出部５４の動作を簡略化した形態である。 <D: Fourth Embodiment>
In the first embodiment, the target sound extraction unit 52 applies the target sound coefficient sequence GTi generated from the base matrix Wi and the coefficient matrix Hi to the observation matrix Vi to generate a time series of the spectrum YTi, and the noise coefficient sequence GNi. Is applied to the observation matrix Vi by the noise extraction unit 54 to generate a time series of the spectrum YNi. In the fourth embodiment, the operations of the target sound extraction unit 52 and the noise extraction unit 54 are simplified.

図４を参照して前述したように、基底行列Ｗiから雑音基底Ｃi_noiseを除外した行列ＷTiと係数行列Ｈiから重み系列Ｅi_noiseを除外した行列ＨTiとを乗算した行列ＶTiは、音響信号ｓiから雑音成分を抑圧した場合のスペクトログラムに近似する。そこで、第４実施形態の目的音抽出部５２は、雑音成分の抑圧後のスペクトルＹTiの時系列（スペクトログラム）として行列ＶTiを解析期間Ｔ0毎に順次に生成する。行列ＶTiの第ｎA列に位置するＭA個の要素値ｖTi[1,nA]〜ｖTi[MA,nA]の系列がスペクトルＹTiとして波形合成部５６に供給される。 As described above with reference to FIG. 4, the matrix VTi obtained by multiplying the matrix WTi excluding the noise basis Ci_noise from the base matrix Wi and the matrix HTi excluding the weight sequence Ei_noise from the coefficient matrix Hi is a noise component from the acoustic signal si. Approximate to the spectrogram when. Therefore, the target sound extraction unit 52 of the fourth embodiment sequentially generates the matrix VTi as the time series (spectrogram) of the spectrum YTi after suppressing the noise component for each analysis period T0. A sequence of MA element values vTi [1, nA] to vTi [MA, nA] located in the nAth column of the matrix VTi is supplied to the waveform synthesis unit 56 as a spectrum YTi.

また、雑音基底Ｃi_noiseと重み系列Ｅi_noiseとを乗算した行列ＶNiは、音響信号ｓiから目的音成分を抑圧した場合のスペクトログラムに近似する。そこで、第４実施形態の雑音抽出部５４は、目的音成分の抑圧後のスペクトルＹNiの時系列（スペクトログラム）として行列ＶNiを解析期間Ｔ0毎に順次に生成する。行列ＶNiの第ｎA列に位置するＭA個の要素値ｖNi[1,nA]〜ｖNi[MA,nA]の系列がスペクトルＹNiとして利用される。 The matrix VNi obtained by multiplying the noise base Ci_noise and the weight sequence Ei_noise approximates a spectrogram when the target sound component is suppressed from the acoustic signal si. Therefore, the noise extraction unit 54 of the fourth embodiment sequentially generates the matrix VNi as the time series (spectrogram) of the spectrum YNi after suppression of the target sound component for each analysis period T0. A sequence of MA element values vNi [1, nA] to vNi [MA, nA] located in the nA-th column of the matrix VNi is used as the spectrum YNi.

図１２は、第４実施形態における調波成分抽出部６４のブロック図である。図１２に示すように、第４実施形態の調波成分抽出部６４は、第１実施形態の係数列合成部７４４を省略した構成である。係数列補正部７６には、雑音抽出部５４が生成した行列ＶNiが供給される。 FIG. 12 is a block diagram of the harmonic component extraction unit 64 in the fourth embodiment. As illustrated in FIG. 12, the harmonic component extraction unit 64 of the fourth embodiment has a configuration in which the coefficient string synthesis unit 744 of the first embodiment is omitted. The coefficient sequence correction unit 76 is supplied with the matrix VNi generated by the noise extraction unit 54.

行列ＶNiは、各解析点ｐ1に対応した要素値ｖNi[1,1]〜ｖNi[MA,NA]で構成されたＭA行×ＮA列の行列である。係数列補正部７６は、行列ＶNiの各要素値ｖNi[mA,nA]の補正（補間，間引）で行列ＶBiを生成する。行列ＶBiは、周波数分析部６２の解析パラメータ（窓幅ωB，移動量δB）に応じた各解析点ｐ2に対応した要素値ｖBi[1,1]〜ｖBi[MB,NB]で構成される。 The matrix VNi is a matrix of MA rows × NA columns composed of element values vNi [1,1] to vNi [MA, NA] corresponding to each analysis point p1. The coefficient sequence correction unit 76 generates the matrix VBi by correcting (interpolating, thinning out) each element value vNi [mA, nA] of the matrix VNi. The matrix VBi is composed of element values vBi [1,1] to vBi [MB, NB] corresponding to the analysis points p2 corresponding to the analysis parameters (window width ωB, movement amount δB) of the frequency analysis unit 62.

調波抽出部７８は、調波構造特定部７４２が生成した調波係数列Ｄiを係数列補正部７６による補正後の行列ＶBiに作用させることで、解析期間Ｔ0内のＮB個のスペクトルＲiの時系列（残留成分のスペクトログラム）を生成する。具体的には、調波抽出部７８は、行列ＶBiの要素値ｖBi[mB,nB]と調波係数列Ｄiの係数値ｄi[mB,nB]との乗算値をスペクトルＲiの成分値ｒi[mB,nB]として算定する。行列ＶBiは、第１処理部３１での分離後の雑音成分のスペクトルの推定値に近似するから、スペクトルＲiは、分離後の雑音成分から目的音成分の残留成分（すなわち基本周波数Ｆi[nB]の整数倍の周波数の調波成分）を抽出したスペクトルの推定値に相当する。したがって、第４実施形態でも第１実施形態と同様の効果が実現される。 The harmonic extraction unit 78 applies the harmonic coefficient sequence Di generated by the harmonic structure specifying unit 742 to the matrix VBi after correction by the coefficient sequence correction unit 76, so that the NB spectra Ri within the analysis period T0 can be obtained. Generate time series (residual component spectrogram). Specifically, the harmonic extraction unit 78 multiplies the element value vBi [mB, nB] of the matrix VBi and the coefficient value di [mB, nB] of the harmonic coefficient sequence Di by the component value ri [ mB, nB]. Since the matrix VBi approximates the estimated value of the spectrum of the noise component after separation in the first processing unit 31, the spectrum Ri is the residual component of the target sound component (ie, the fundamental frequency Fi [nB]) from the noise component after separation. The harmonic component having a frequency that is an integral multiple of () is equivalent to the estimated value of the extracted spectrum. Therefore, the fourth embodiment can achieve the same effect as the first embodiment.

＜Ｅ：変形例＞
以上の各形態には多様に変形され得る。具体的な変形の態様を以下に例示する。以下の例示から任意に選択された２以上の態様は適宜に併合され得る。 <E: Modification>
Each of the above forms can be variously modified. Specific modifications are exemplified below. Two or more aspects arbitrarily selected from the following examples can be appropriately combined.

（１）変形例１
以上の各形態では、推定目的音信号ｑTiのスペクトルＺiから音響信号ｓiの目的音成分の基本周波数Ｆi[nB]を推定したが、基本周波数Ｆi[nB]の推定の方法は任意である。例えば、推定目的音信号ｑTiの基本周波数Ｆi[nB]を時間領域の処理（例えば自己相関関数を利用した方法）で推定することも可能である。また、目的音成分の抽出前の音響信号ｓi（またはスペクトルＸi）の解析で基本周波数Ｆi[nB]を推定する構成も採用され得る。ただし、雑音成分が混在している段階では基本周波数Ｆi[nB]の推定の精度が低下するから、基本周波数Ｆi[nB]の高精度な推定という観点からは、雑音成分の抑圧後に基本周波数Ｆi[nB]を推定する第１実施形態の構成が有利である。 (1) Modification 1
In each of the above embodiments, the fundamental frequency Fi [nB] of the target sound component of the acoustic signal si is estimated from the spectrum Zi of the estimated target sound signal qTi, but the method of estimating the fundamental frequency Fi [nB] is arbitrary. For example, the fundamental frequency Fi [nB] of the estimated target sound signal qTi can be estimated by time domain processing (for example, a method using an autocorrelation function). In addition, a configuration in which the fundamental frequency Fi [nB] is estimated by analyzing the acoustic signal si (or spectrum Xi) before extracting the target sound component may be employed. However, since the estimation accuracy of the fundamental frequency Fi [nB] decreases at the stage where noise components are mixed, from the viewpoint of high-precision estimation of the fundamental frequency Fi [nB], the fundamental frequency Fi is suppressed after the noise component is suppressed. The configuration of the first embodiment for estimating [nB] is advantageous.

なお、例えば目的音成分と雑音成分とが混在している音響信号ｓiから基本周波数Ｆiを推定する場合でも、遮断周波数を下回る成分の除去（ローカットフィルタ処理）で雑音成分を減衰させれば基本周波数Ｆi[nB]の高精度な推定も可能である。しかし、雑音成分の周波数が刻々と変化する状況を想定すると、遮断周波数を最適値に選定することは非常に困難である。前述の各実施形態においては、雑音成分の周波数が変化する場合でも雑音成分が有効に抑圧され、抑圧後の推定目的音信号ｑTiを対象として基本周波数Ｆi[nB]が推定されるから、雑音成分の周波数が変化した場合の遮断周波数の選定を問題とせずに基本周波数Ｆi[nB]を高精度に推定できるという利点がある。 For example, even when the fundamental frequency Fi is estimated from the acoustic signal si in which the target sound component and the noise component are mixed, if the noise component is attenuated by removing the component below the cutoff frequency (low cut filter processing), the fundamental frequency Fi [nB] can also be estimated with high accuracy. However, assuming a situation where the frequency of the noise component changes every moment, it is very difficult to select the cut-off frequency as an optimum value. In each of the above-described embodiments, the noise component is effectively suppressed even when the frequency of the noise component changes, and the fundamental frequency Fi [nB] is estimated for the estimated target sound signal qTi after the suppression. There is an advantage that the fundamental frequency Fi [nB] can be estimated with high accuracy without causing the problem of selection of the cut-off frequency when the frequency changes.

（２）変形例２
目的音成分と雑音成分との分離（第１処理部３１）や残留成分の抽出（第２処理部３２）の対象を低域側の帯域（ＢLa，ＢLb）に限定する構成は省略され得る。例えば、音響信号ｓiの全帯域を行列分解部４４や雑音特定部４６による処理の対象とした構成も採用され得る。ただし、風雑音の強度は高域側の帯域（例えば帯域ＢHa）で低下するから、音響信号ｓiの帯域分割を省略した構成では、風雑音の独立した基底Ｃi[k]を非負行列因子分解で高精度に抽出することが困難となる。したがって、抑圧の対象となる雑音成分の周波数帯域が事前に判明している場合には、雑音成分を包含する周波数帯域（帯域ＢLa）のみを行列分解部４４や雑音特定部４６による処理の対象とした前述の構成が格別に好適である。 (2) Modification 2
The configuration for limiting the target sound component and the noise component (first processing unit 31) and the residual component extraction (second processing unit 32) to the lower band (BLa, BLb) may be omitted. For example, a configuration in which the entire band of the acoustic signal si is a target of processing by the matrix decomposing unit 44 or the noise specifying unit 46 may be employed. However, since the intensity of the wind noise decreases in the higher band (for example, the band BHa), in the configuration in which the band division of the acoustic signal si is omitted, an independent basis Ci [k] of the wind noise is obtained by non-negative matrix factorization. It becomes difficult to extract with high accuracy. Therefore, when the frequency band of the noise component to be suppressed is known in advance, only the frequency band (band BLa) including the noise component is processed by the matrix decomposing unit 44 and the noise specifying unit 46. The above-described configuration is particularly suitable.

（３）変形例３
以上の各形態では、音響信号ｓiの解析期間Ｔ0毎に目的音係数列ＧTiおよび雑音係数列ＧNiを生成したが、解析期間Ｔ0の区切は省略される。例えば、音響信号ｓiの全区間にわたる単位期間毎のスペクトルＸiの時系列を１個の観測行列Ｖiとした構成も採用され得る。 (3) Modification 3
In each of the above embodiments, the target sound coefficient sequence GTi and the noise coefficient sequence GNi are generated every analysis period T0 of the acoustic signal si, but the division of the analysis period T0 is omitted. For example, a configuration in which the time series of the spectrum Xi for each unit period over the entire section of the acoustic signal si is one observation matrix Vi can be adopted.

（４）変形例４
以上の各形態では、目的音係数列ＧTiの各係数値ｇTi[mA,nA]を音響信号ｓiの各成分値ｘi[mA,nA]に乗算することで推定目的音信号ｑTiを生成したが、目的音係数列ＧTiを音響信号ｓiに作用させる方法は適宜に変更される。例えば、音響信号ｓiの各成分値ｘi[mA,nA]に係数値ｇTi[mA,nA]を加算する構成も採用され得る。また、以上の各形態での例示とは反対に、風雑音が優勢であるほど係数値ｇTi[mA,nA]が大きい数値となるように目的音係数列ＧTiを生成する構成では、成分値ｘi[mA,nB]を係数値ｇTi[mA,nA]で除算または減算する構成が採用され得る。雑音成分の強調用の雑音係数列ＧNiについても同様に、音響信号ｓiに対する適用の方法や風雑音の優劣との関係は適宜に変更される。 (4) Modification 4
In each of the above embodiments, the estimated target sound signal qTi is generated by multiplying each component value xi [mA, nA] of the acoustic signal si by each coefficient value gTi [mA, nA] of the target sound coefficient sequence GTi. The method of applying the target sound coefficient sequence GTi to the acoustic signal si is appropriately changed. For example, a configuration in which the coefficient value gTi [mA, nA] is added to each component value xi [mA, nA] of the acoustic signal si may be employed. Contrary to the examples in the above embodiments, in the configuration in which the target sound coefficient sequence GTi is generated so that the coefficient value gTi [mA, nA] becomes larger as the wind noise becomes more dominant, the component value xi A configuration in which [mA, nB] is divided or subtracted by the coefficient value gTi [mA, nA] may be employed. Similarly, for the noise coefficient string GNi for emphasizing the noise component, the method of application to the acoustic signal si and the relationship with the superiority or inferiority of wind noise are appropriately changed.

（５）変形例５
以上の各形態では、２系統の音響信号ｑi（ｑ1，ｑ2）を生成したが、１系統（モノラル形式）の音響信号ｑ1のみを生成する場合にも以上の各形態が同様に適用され得る。例えば、音響信号ｓ1に対応する１系統のみを対象として目的音成分と雑音成分との分離や残留成分の抽出および付加が実行される。以上の構成では、音響信号ｓ1の基底行列Ｗ1から雑音基底Ｃ1_noiseを特定するために音響信号ｓ2が利用される。 (5) Modification 5
In the above embodiments, two systems of acoustic signals qi (q1, q2) are generated, but the above embodiments can be similarly applied when only one system (monaural format) of acoustic signals q1 is generated. For example, separation of target sound components and noise components and extraction and addition of residual components are executed for only one system corresponding to the acoustic signal s1. In the above configuration, the acoustic signal s2 is used to specify the noise base C1_noise from the basis matrix W1 of the acoustic signal s1.

（６）変形例６
演算処理装置２２の処理は、音響信号ｓiの供給に並行して実時間的に実行され、処理毎に逐次的に音響信号ｑiが再生され得る。ただし、事前に用意された音響信号ｓiに対する処理が完了してから音響信号ｑiの再生を開始する構成（バッチ処理）も好適である。 (6) Modification 6
The processing of the arithmetic processing unit 22 is executed in real time in parallel with the supply of the acoustic signal si, and the acoustic signal qi can be reproduced sequentially for each processing. However, a configuration (batch processing) in which the reproduction of the acoustic signal qi is started after the processing for the acoustic signal si prepared in advance is completed is also suitable.

１００……音響処理装置、１２……信号供給装置、１４……放音装置、２２……演算処理装置、２４……記憶装置、３１……第１処理部、３２……第２処理部、４２……周波数分析部、４４……行列分解部、４６……雑音特定部、５２……目的音抽出部、５４……雑音抽出部、５６……波形合成部、５２２……係数列生成部、５２４……抽出処理部、５４２……係数列生成部、５４４……抽出処理部、５８２……位相差算定部、５８４……強度差算定部、６２……周波数分析部、６４……調波成分抽出部、６５……調整部、６６……目的音合成部、６８……波形合成部、７２……周波数推定部、７４……調波係数列生成部、７４２……調波構造特定部、７４４……係数列合成部、７６……係数列補正部、７８……調波抽出部。
DESCRIPTION OF SYMBOLS 100 ... Sound processing device, 12 ... Signal supply device, 14 ... Sound emission device, 22 ... Arithmetic processing device, 24 ... Memory | storage device, 31 ... 1st processing part, 32 ... 2nd processing part, 42 …… Frequency analysis unit, 44 …… Matrix decomposition unit, 46 …… Noise identification unit, 52 …… Target sound extraction unit, 54 …… Noise extraction unit, 56 …… Waveform synthesis unit, 522 …… Coefficient sequence generation unit 524 ... Extraction processing unit, 542 ... Coefficient sequence generation unit, 544 ... Extraction processing unit, 582 ... Phase difference calculation unit, 584 ... Intensity difference calculation unit, 62 ... Frequency analysis unit, 64 ... Adjustment Wave component extraction unit, 65 ... adjustment unit, 66 ... target sound synthesis unit, 68 ... waveform synthesis unit, 72 ... frequency estimation unit, 74 ... harmonic coefficient sequence generation unit, 742 ... harmonic structure identification Unit 744... Coefficient sequence combining unit 76... Coefficient sequence correction unit 78.

Claims

For each of the first acoustic signal and the second acoustic signal collected in parallel, the acoustic signals are different by non-negative matrix factorization of an observation matrix having a time series of component values for each frequency of the acoustic signal as elements. Matrix decomposition means for generating a base matrix including a plurality of bases indicating component values for each frequency of the component, and a coefficient matrix including a plurality of weight sequences each indicating a time series of weight values of the respective bases;
A base having a high correlation with the base of the base matrix of the second acoustic signal among the plurality of bases of the base matrix of the first acoustic signal is identified as a noise base corresponding to a noise component of the first acoustic signal Noise identification means to
An estimated target sound component obtained by suppressing the noise component from the first acoustic signal is obtained using each basis other than the noise basis in the basis matrix and each weight sequence corresponding to the other than the noise basis in the coefficient matrix. A target sound extraction means to generate;
Noise extraction means for generating an estimated noise component obtained by suppressing a target sound component from the first acoustic signal using the noise base and a weight sequence corresponding to the noise base in the coefficient matrix;
Harmonic component extraction means for extracting a residual component corresponding to the harmonic structure of the target sound component from the estimated noise component;
A sound processing apparatus comprising: target sound synthesis means for synthesizing the estimated target sound component and the residual component.

The harmonic component extraction means includes
Frequency estimating means for estimating a fundamental frequency of the target sound component;
Harmonic coefficient string generation means for generating a harmonic coefficient string in which each coefficient value is set so as to emphasize a harmonic component of a frequency that is an integral multiple of the fundamental frequency among the estimated noise components;
The acoustic processing apparatus according to claim 1, further comprising: a harmonic extraction unit that causes the harmonic coefficient sequence to act on the estimated noise component to extract the residual component.

The sound processing apparatus according to claim 2, wherein the frequency estimation unit estimates a fundamental frequency of the estimated target sound component generated by the target sound extraction unit.

A time series of spectra for each unit section of the first acoustic signal and the second acoustic signal is generated as the observation matrix based on a first analysis parameter including a window width and a movement amount of each unit section . Frequency analysis means;
A second frequency analysis for sequentially generating a spectrum of the estimated target sound component and the estimated noise component for each unit section based on a second analysis parameter including a window width and a movement amount different from the first analysis parameter. Means,
Coefficient sequence correction means for generating a correction coefficient sequence in which coefficient values are set for each analysis point arranged on the time axis and the frequency axis at intervals according to the second analysis parameter,
The noise extraction unit corresponds to the noise base and the noise base, a noise coefficient sequence in which coefficient values are set for each analysis point arranged on the time axis and the frequency axis at intervals according to the first analysis parameter. And generating the estimated noise component by applying the noise coefficient sequence to the observation matrix,
The coefficient sequence correction unit generates the correction coefficient sequence from the noise coefficient sequence,
4. The acoustic processing device according to claim 2, wherein the harmonic coefficient string generation unit generates a harmonic coefficient string by extracting a component having a frequency that is an integral multiple of the fundamental frequency from the correction coefficient string.

Phase difference calculating means for calculating a phase difference between the first acoustic signal and the second acoustic signal;
An intensity difference calculating means for calculating an intensity difference between the first acoustic signal and the second acoustic signal;
The target sound extraction means uses a target sound coefficient sequence in which each coefficient value is variably set according to a phase difference and an intensity difference between the first acoustic signal and the second acoustic signal as the noise in the base matrix. Generating from each basis other than the basis and a weight sequence corresponding to the coefficient matrix other than the noise basis, and acting on the observation matrix;
The noise extraction unit corresponds to a noise coefficient sequence in which each coefficient value is variably set according to a phase difference and an intensity difference between the first acoustic signal and the second acoustic signal, and corresponds to the noise base and the noise base. The sound processing device according to claim 4, wherein the sound processing device is generated from a weight sequence to be applied to the observation matrix.

  For each of the first acoustic signal and the second acoustic signal collected in parallel, the acoustic signals are different by non-negative matrix factorization of an observation matrix having a time series of component values for each frequency of the acoustic signal as elements. Matrix decomposition processing for generating a base matrix including a plurality of bases indicating component values for each frequency of the component, and a coefficient matrix including a plurality of weight sequences each indicating a time series of weight values of the respective bases;
  A base having a high correlation with the base of the base matrix of the second acoustic signal among the plurality of bases of the base matrix of the first acoustic signal is identified as a noise base corresponding to a noise component of the first acoustic signal Noise identification processing,
  An estimated target sound component obtained by suppressing the noise component from the first acoustic signal is obtained using each basis other than the noise basis in the basis matrix and each weight sequence corresponding to the other than the noise basis in the coefficient matrix. A target sound extraction process to be generated;
  A noise extraction process for generating an estimated noise component obtained by suppressing a target sound component from the first acoustic signal using the noise base and a weight sequence corresponding to the noise base in the coefficient matrix;
  A harmonic component extraction process for extracting a residual component corresponding to the harmonic structure of the target sound component from the estimated noise component;
  A target sound synthesis process for synthesizing the estimated target sound component and the residual component;
  A program that causes a computer to execute.