JP2022065565A

JP2022065565A - Acoustic signal processing apparatus, acoustic signal processing method, and program

Info

Publication number: JP2022065565A
Application number: JP2020174241A
Authority: JP
Inventors: 順貴小野; Junki Ono; ロビンシャイブラー; Scheibler Robin; 佑幸若林; Hiroyuki Wakabayashi; 隆生河村; Takao Kawamura; 亮一宮崎; Ryoichi Miyazaki
Original assignee: Tokyo Metropolitan Public University Corp
Current assignee: Tokyo Metropolitan Public University Corp
Priority date: 2020-10-15
Filing date: 2020-10-15
Publication date: 2022-04-27
Anticipated expiration: 2040-10-15
Also published as: JP7497040B2

Abstract

To provide a technique for removing noises from monaurally recorded audio signals.SOLUTION: An acoustic signal processing apparatus includes: a compensated spectrogram acquisition unit for acquiring a processing target spectrogram that satisfies an origin matching condition in which a start position of a removal target sound in a processing target sound time series indicating a monaurally recorded processing target sound matches a time origin in a reference time series indicating a pre-recorded removal target sound, and satisfies a sampling frequency matching condition in which a sampling frequency in the processing target sound time series matches the sampling frequency in the reference time series, and for acquiring a reference spectrogram that satisfies the origin matching condition and the sampling frequency matching condition; and a target complex spectrogram estimation unit for estimating a time series that maximizes likelihood for a predetermined probability distribution of a target complex spectrogram obtained as a difference in complex spectrum at the same time between the processing target spectrogram and the reference spectrogram multiplied by a frequency transfer function.SELECTED DRAWING: Figure 1

Description

特許法第３０条第２項適用申請有り令和２年３月２日、日本音響学会２０２０年春季研究発表会（於：埼玉大学）の講演論文集Patent Law Article 30, Paragraph 2 Application Applicable Proceedings of the Acoustical Society of Japan 2020 Spring Research Presentation (at Saitama University) on March 2, 2nd year of the Ordinance

本発明は、音響信号処理装置、音響信号処理方法及びプログラムに関する。 The present invention relates to an acoustic signal processing device, an acoustic signal processing method and a program.

目的の音を録音した際の音を示す時系列から目的の音以外の音（すなわち雑音）の影響が軽減された音を示す時系列を生成する技術が様々に提案されている（非特許文献１及び２）。 Various techniques have been proposed to generate a time series showing a sound in which the influence of a sound other than the target sound (that is, noise) is reduced from a time series showing the sound when the target sound is recorded (non-patent documents). 1 and 2).

R. Martin, “Noise Power Spectral Density Estimation Based on Optimal Smoothing and Minimum Statistics” IEEE Trans. on SAP, vol. 9, no. 5, pp.504-512, 2001.R. Martin, “Noise Power Spectral Density Optimization Based on Optimal Smoothing and Minimum Statistics” IEEE Trans. On SAP, vol. 9, no. 5, pp.504-512, 2001. T. Gerkmann, et al., “Unbiased MMSE-Based Noise Power Estimation With Low Complexity and Low Tracking Delay” IEEE/ACM Trans. on ASLP,vol. 20, no. 4, pp.1383|1393, 2012.T. Gerkmann, et al., “Unbiased MMSE-Based Noise Power Optimization With Low Complexity and Low Tracking Delay” IEEE / ACM Trans. On ASLP, vol. 20, no. 4, pp.1383 | 1393, 2012.

このような提案にはモノラル録音から非定常な雑音を除去する技術の提案がある。しかしながら、その技術は充分に雑音を除去しているとは言えないものである。このように、１個のマイクで録音（すなわちモノラル録音）された音の時系列から雑音の影響が軽減された音の時系列を生成することは難しい。 Such a proposal includes a technique for removing unsteady noise from monaural recording. However, it cannot be said that the technique sufficiently removes noise. As described above, it is difficult to generate a time series of sounds in which the influence of noise is reduced from the time series of sounds recorded by one microphone (that is, monaural recording).

上記事情に鑑み、本発明は、モノラル録音された音の時系列から雑音の影響がより軽減された音の時系列を生成する技術を提供することを目的としている。 In view of the above circumstances, it is an object of the present invention to provide a technique for generating a time series of sounds in which the influence of noise is further reduced from the time series of sounds recorded in monaural.

本発明の一態様は、モノラル録音された処理対象音を示す処理対象音時系列内の除去対象の除去対象音が開始される開始位置と予め録音済みの前記除去対象音を示す参照時系列の時間原点とが一致という原点一致条件と、前記処理対象音時系列のサンプリング周波数と前記参照時系列のサンプリング周波数とが一致というサンプリング周波数一致条件とを満たす前記処理対象音時系列の複素スペクトルの時系列である処理対象スペクトログラムと、前記原点一致条件及び前記サンプリング周波数一致条件を満たす前記参照時系列の複素スペクトルの時系列である参照スペクトログラムと、を取得する補償済みスペクトログラム取得部と、前記処理対象スペクトログラムと周波数伝達関数が乗算された前記参照スペクトログラムとの同時刻における複素スペクトルの差として得られる目的複素スペクトログラムにおける複素スペクトルである目的複素スペクトルの、予め定められた所定の確率分布に対する尤度を最大にする時系列を推定する目的複素スペクトログラム推定部と、前記目的複素スペクトログラム推定部により推定された前記目的複素スペクトログラムを有する音の時系列を生成する目的音時系列生成部と、を備える音響信号処理装置である。 One aspect of the present invention is a reference time series indicating a start position at which the removal target sound to be removed in the processing target sound time series indicating the monaurally recorded processing target sound is started and the pre-recorded removal target sound. When the complex spectrum of the processing target sound time series satisfies the origin matching condition that the time origin matches and the sampling frequency matching condition that the sampling frequency of the processing target sound time series matches the sampling frequency of the reference time series. A compensated spectrogram acquisition unit for acquiring a processed spectrogram that is a series, a reference spectrogram that is a time series of a complex spectrum of the reference time series that satisfies the origin matching condition and the sampling frequency matching condition, and the processed spectrogram. The likelihood of the objective complex spectrum, which is the complex spectrum in the objective complex spectrogram obtained as the difference of the complex spectrum at the same time from the reference spectrogram multiplied by the frequency transfer function, with respect to a predetermined probability distribution is maximized. An acoustic signal processing device including an objective complex spectrogram estimation unit that estimates the time series to be used, and a target sound time series generation unit that generates a time series of sounds having the objective complex spectrogram estimated by the objective complex spectrogram estimation unit. Is.

本発明により、モノラル録音された音の時系列から雑音の影響がより軽減された音の時系列を生成することが可能となる。 INDUSTRIAL APPLICABILITY According to the present invention, it is possible to generate a time series of sounds in which the influence of noise is further reduced from the time series of sounds recorded in monaural.

実施形態の音響信号処理装置１の概要を説明する説明図。Explanatory drawing explaining the outline of the acoustic signal processing apparatus 1 of embodiment. 実施形態の音響信号処理装置１の機能構成の一例を示す図。The figure which shows an example of the functional structure of the acoustic signal processing apparatus 1 of embodiment. 実施形態における制御部１０の機能構成の一例を示す図。The figure which shows an example of the functional structure of the control part 10 in an embodiment. 実施形態の音響信号処理装置１が実行する処理の流れの一例を示すフローチャート。The flowchart which shows an example of the flow of the process executed by the acoustic signal processing apparatus 1 of embodiment. 実施形態における評価実験の実験環境を説明する説明図。Explanatory drawing explaining the experimental environment of the evaluation experiment in embodiment. 実施形態における評価実験の結果の一例を示す図。The figure which shows an example of the result of the evaluation experiment in an embodiment.

（実施形態）
図１は、実施形態の音響信号処理装置１の概要を説明する説明図である。音響信号処理装置１は、目的の音（以下「目的音」という。）がモノラル録音された際の録音された音（以下「処理対象音」という。）を示す時系列を用いて、雑音の影響が軽減された音の時系列を生成する。以下、処理対象音を示す時系列を処理対象音時系列という。モノラル録音とは１つのマイクで録音することを意味する。 (Embodiment)
FIG. 1 is an explanatory diagram illustrating an outline of the acoustic signal processing device 1 of the embodiment. The acoustic signal processing device 1 uses a time series indicating the recorded sound (hereinafter referred to as “processing target sound”) when the target sound (hereinafter referred to as “target sound”) is monaurally recorded, and uses noise. Generates a time series of sounds with reduced effects. Hereinafter, the time series indicating the processing target sound is referred to as a processing target sound time series. Monaural recording means recording with one microphone.

雑音とは、目的音以外の音である。雑音は、目的音以外の音であればどのような音であってもよい。雑音には、ランダムな音だけでなく、スピーカーから流れるクラシック曲の音であって目的音以外の音も含まれる。音の影響を軽減するとは、具体的には音の振幅を小さくすることを意味する。 Noise is a sound other than the target sound. The noise may be any sound other than the target sound. The noise includes not only random sounds but also sounds of classical music played from speakers and other than the target sound. Reducing the influence of sound specifically means reducing the amplitude of sound.

音響信号処理装置１は、具体的には、参照時系列を用いて、処理対象音時系列から除去対象音の影響を抑制した音を示す時系列（以下「目的音時系列」という。）を生成する。除去対象音は、処理対象音に含まれる音のうち目的音以外の音であって、処理対象音から除去する対象の音である。参照時系列は、参照音を示す時系列である。参照音は、コンパクトディスク等の音の記録媒体に予め録音済みの除去対象音である。ＭＰ３（MPEG-1 Audio Layer-3）等の所定のフォーマットの音楽データは、予め録音済みの音を示すので、参照音は、音楽データの形で予め録音済みの除去対象音であってもよい。 Specifically, the acoustic signal processing device 1 uses a reference time series to form a time series (hereinafter referred to as “target sound time series”) indicating a sound in which the influence of the sound to be removed is suppressed from the time series of the sound to be processed. Generate. The sound to be removed is a sound other than the target sound among the sounds included in the sound to be processed, and is a sound to be removed from the sound to be processed. The reference time series is a time series indicating a reference sound. The reference sound is a sound to be removed that has been pre-recorded on a sound recording medium such as a compact disc. Since music data in a predetermined format such as MP3 (MPEG-1 Audio Layer-3) indicates a pre-recorded sound, the reference sound may be a pre-recorded removal target sound in the form of music data. ..

処理対象音は、例えばコンパクトディスクに記録されたクラシックの曲がスピーカーから流れる中でモノラル録音された２人の話者の会話である。このような場合、目的音は２人の話者の会話であり、除去対象音はスピーカーから流れるクラシックの曲の音であり、参照音はコンパクトディスクに録音された音である。 The sound to be processed is, for example, a conversation between two speakers recorded in monaural while a classical song recorded on a compact disc is played from a speaker. In such a case, the target sound is a conversation between two speakers, the sound to be removed is the sound of a classical song flowing from a speaker, and the reference sound is a sound recorded on a compact disc.

処理対象音時系列は、処理対象音を表すアナログ信号が所定のサンプリング周波数で離散化された信号である。そのため、処理対象音時系列は、図１においてｘ（ｎ）は、処理対象音時系列のｎ番目のサンプルが示す振幅ｘを表す。 The processing target sound time series is a signal in which an analog signal representing the processing target sound is discretized at a predetermined sampling frequency. Therefore, in the processing target sound time series, x (n) in FIG. 1 represents the amplitude x indicated by the nth sample of the processing target sound time series.

参照時系列は、参照音を表すアナログ信号が所定のサンプリング周波数で離散化された信号である。そのため、参照時系列は、図１においてｏ（ｎ）は、参照時系列のｎ番目のサンプルが示す振幅ｏを表す。 The reference time series is a signal in which an analog signal representing a reference sound is discretized at a predetermined sampling frequency. Therefore, in the reference time series, o (n) in FIG. 1 represents the amplitude o indicated by the nth sample of the reference time series.

処理対象音時系列と参照時系列とは、必ずしも同一のサンプリング周波数で離散化されたものとは限らない。むしろ一般に、処理対象音時系列と参照時系列とのサンプリング周波数は同一ではない。なぜなら、たとえ録音時の設定されたサンプリング周波数が同一であっても、振動子などの録音に用いるハードウェアの性能の環境に依る変化等によりサンプリング周波数が設定された値からずれるからである。このように、処理対象音時系列と参照時系列とは、必ずしもサンプリング周波数一致条件を満たさない。サンプリング周波数一致条件は、処理対象音時系列のサンプリング周波数と参照時系列のサンプリング周波数とが同一という条件である。 The processing target sound time series and the reference time series are not always discretized at the same sampling frequency. Rather, in general, the sampling frequencies of the processed sound time series and the reference time series are not the same. This is because even if the sampling frequency set at the time of recording is the same, the sampling frequency deviates from the set value due to a change in the performance of the hardware used for recording such as a vibrator depending on the environment. As described above, the processing target sound time series and the reference time series do not always satisfy the sampling frequency matching condition. The sampling frequency matching condition is a condition that the sampling frequency of the sound time series to be processed and the sampling frequency of the reference time series are the same.

また、除去対象音は必ずしも処理対象音の録音が開始されたタイミングから生じているわけではない。このため、処理対象音時系列内で除去対象音が開始される位置（以下「除去対象音開始位置」という。）は、必ずしも処理対象音時系列の時間原点（すなわちｎ＝０）に一致するわけでは無い。そのため、処理対象音時系列と参照時系列とは、必ずしも原点一致条件を満たさない。原点一致条件は、除去対象音開始位置と参照時系列の時間原点とが一致しているという条件である。 Further, the sound to be removed does not necessarily originate from the timing at which the recording of the sound to be processed is started. Therefore, the position where the removal target sound starts in the processing target sound time series (hereinafter referred to as “removal target sound start position”) does not necessarily coincide with the time origin (that is, n = 0) of the processing target sound time series. Not necessarily. Therefore, the processing target sound time series and the reference time series do not always satisfy the origin matching condition. The origin matching condition is a condition that the start position of the sound to be removed and the time origin of the reference time series match.

音響信号処理装置１が実行する処理のより詳細な流れを説明する。音響信号処理装置１は、補償済みスペクトログラム取得部１１０、目的複素スペクトログラム推定部１２０及び目的音時系列生成部１３０を備える。 A more detailed flow of processing executed by the acoustic signal processing device 1 will be described. The acoustic signal processing device 1 includes a compensated spectrogram acquisition unit 110, a target complex spectrogram estimation unit 120, and a target sound time series generation unit 130.

補償済みスペクトログラム取得部１１０は、処理対象音時系列及び参照時系列を取得し、処理対象音時系列及び参照時系列に基づき処理対象スペクトログラム及び参照スペクトログラムを取得する。処理対象スペクトログラムは、サンプリング周波数一致条件及び原点一致条件を満たす処理対象音時系列の複素スペクトルの時系列である。参照スペクトログラムは、サンプリング周波数一致条件及び原点一致条件を満たす参照時系列の複素スペクトルの時系列である。 The compensated spectrogram acquisition unit 110 acquires the processing target sound time series and the reference time series, and acquires the processing target spectrogram and the reference spectrogram based on the processing target sound time series and the reference time series. The processing target spectrogram is a time series of complex spectra of the processing target sound time series that satisfy the sampling frequency matching condition and the origin matching condition. A reference spectrogram is a time series of complex spectra of a reference time series that satisfies the sampling frequency matching condition and the origin matching condition.

補償済みスペクトログラム取得部１１０は、例えば時間原点補償処理、振幅周波数変換処理及びサンプリング周波数補償処理を実行することで、処理対象スペクトログラム及び参照スペクトログラムを取得する。 The compensated spectrogram acquisition unit 110 acquires the processed spectrogram and the reference spectrogram by executing, for example, a time origin compensation process, an amplitude frequency conversion process, and a sampling frequency compensation process.

時間原点補償処理は、除去対象音開始位置と参照時系列の時間原点とを一致させる処理である。時間原点補償処理では、例えば処理対象音時系列と参照時系列との間の相互相関を最大化するように参照時系列の時間原点の時刻を移動させることで参照時系列の時間原点を除去対象音開始位置に一致させる（参考文献１参照）。このような場合、処理対象音時系列は変化せず参照時系列が変化する。時間原点補償処理の実行により、原点一致条件が満たされる。 The time origin compensation process is a process of matching the start position of the sound to be removed with the time origin of the reference time series. In the time origin compensation process, for example, the time origin of the reference time series is removed by moving the time of the time origin of the reference time series so as to maximize the cross-correlation between the sound time series to be processed and the reference time series. Match the sound start position (see Reference 1). In such a case, the processing target sound time series does not change, but the reference time series changes. The origin matching condition is satisfied by executing the time origin compensation process.

参考文献１：特開２０１４－１７４３９３号公報 Reference 1: Japanese Unexamined Patent Publication No. 2014-174393

振幅周波数変換処理は、第１副振幅周波数変換処理と第２副振幅周波数変換処理とを含む。第１副振幅周波数変換処理は、原点一致条件を満たす処理対象音時系列について、複数の第１区分期間の１つの第１区分期間ごとに複素スペクトルを取得する処理である。第１区分期間は、処理対象音時系列の全期間の連続する一部の期間である。１つの第１区分期間は、他の第１区分期間を包含せず全ての第１区分期間の和集合は処理対象音時系列の全期間に等しい。複数の第１区分期間の１つの第１区分期間ごとに複素スペクトルを取得する処理は、例えば短時間フーリエ変換である。第１副振幅周波数変換処理の実行により、原点一致条件を満たす処理対象音時系列が、原点一致条件を満たす処理対象音時系列の複素スペクトルの時系列（以下「第１スペクトログラム」という。）に変換される。 The amplitude frequency conversion process includes a first sub-amplitude frequency conversion process and a second sub-amplitude frequency conversion process. The first sub-amplitude frequency conversion process is a process of acquiring a complex spectrum for each first section period of a plurality of first section periods for the process target sound time series satisfying the origin matching condition. The first division period is a continuous part of the entire period of the processing target sound time series. One first division period does not include the other first division period, and the union of all the first division periods is equal to the entire period of the sound time series to be processed. The process of acquiring a complex spectrum for each first division period of a plurality of first division periods is, for example, a short-time Fourier transform. By executing the first sub-amplitude frequency conversion process, the processing target sound time series satisfying the origin matching condition becomes a time series of the complex spectrum of the processing target sound time series satisfying the origin matching condition (hereinafter referred to as "first spectrogram"). Will be converted.

第２副振幅周波数変換処理は、時間原点一致条件を満たす参照時系列について、複数の第２区分期間の１つの第２区分期間ごとに複素スペクトルを取得する処理である。第２区分期間は、参照時系列の全期間の連続する一部の期間である。１つの第２区分期間は、他の第２区分期間を包含せず全ての第２区分期間の和集合は参照時系列の全期間に等しい。複数の第２区分期間の１つの第２区分期間ごとに複素スペクトルを取得する処理は、例えば短時間フーリエ変換である。第２副振幅周波数変換処理の実行により、原点一致条件を満たす参照時系列が、原点一致条件を満たす参照時系列の複素スペクトルの時系列（以下「第２スペクトログラム」という。）に変換される。 The second sub-amplitude frequency conversion process is a process of acquiring a complex spectrum for each second division period of a plurality of second division periods for a reference time series satisfying the time origin matching condition. The second division period is a continuous part of the entire period of the reference time series. One second division period does not include the other second division period, and the union of all the second division periods is equal to the entire period of the reference time series. The process of acquiring a complex spectrum for each second division period of a plurality of second division periods is, for example, a short-time Fourier transform. By executing the second sub-amplitude frequency conversion process, the reference time series satisfying the origin matching condition is converted into the time series of the complex spectrum of the reference time series satisfying the origin matching condition (hereinafter referred to as “second spectrogram”).

サンプリング周波数補償処理は、第１スペクトログラム及び第２スペクトログラムに基づき、処理対象スペクトログラム及び参照スペクトログラムを取得する処理である。サンプリング周波数補償処理は、例えば第１スペクトログラム及び第２スペクトログラムに対するブラインド同期（参考文献１に記載のブラインド補償の音声信号処理方法）を適用する処理である。ブラインド同期では、参照時系列のサンプリング周波数が変化し、処理対象音時系列のサンプリング周波数は変化しない。 The sampling frequency compensation process is a process of acquiring a process target spectrogram and a reference spectrogram based on the first spectrogram and the second spectrogram. The sampling frequency compensation process is, for example, a process of applying blind synchronization (a blind compensation audio signal processing method according to Reference 1) to the first spectrogram and the second spectrogram. In blind synchronization, the sampling frequency of the reference time series changes, and the sampling frequency of the processing target sound time series does not change.

このように、処理対象音時系列及び参照時系列の組に対して時間原点補償処理、振幅周波数変換処理及びサンプリング周波数補償処理をこの順番に実行することで、処理対象スペクトログラムと参照スペクトログラムとが得られる。 In this way, by executing the time origin compensation processing, the amplitude frequency conversion processing, and the sampling frequency compensation processing in this order for the set of the processing target sound time series and the reference time series, the processing target spectrogram and the reference spectrogram can be obtained. Be done.

図１における以下の式（１）の記号は、処理対象スペクトログラムを表す。 The symbol of the following equation (1) in FIG. 1 represents a spectrogram to be processed.

式（１）の記号は、処理対象スペクトログラムのｍ番目の時間フレームが表す複素スペクトル中の周波数ωの周波数成分の振幅及び位相を表す（ｍは１以上の整数）。 The symbol of the equation (1) represents the amplitude and phase of the frequency component of the frequency ω in the complex spectrum represented by the m-th time frame of the spectrogram to be processed (m is an integer of 1 or more).

図１における以下の式（２）の記号は、参照スペクトログラムを表す。 The symbol of the following equation (2) in FIG. 1 represents a reference spectrogram.

式（２）の記号は、参照スペクトログラムのｍ番目の時間フレームが表す複素スペクトル中の周波数ωの周波数成分の振幅及び位相を表す。式（２）の記号ε_０は、処理対象音時系列と参照時系列との間のサンプリング周波数の差を表す。 The symbol in equation (2) represents the amplitude and phase of the frequency component of frequency ω in the complex spectrum represented by the m-th time frame of the reference spectrogram. The symbol ε ₀ in the equation (2) represents the difference in sampling frequency between the processing target sound time series and the reference time series.

目的複素スペクトログラム推定部１２０は、処理対象スペクトログラムと参照スペクトログラムとに基づいて、予め定められた所定の確率分布（以下「基準確率分布」という。）に対する尤度を最大にする目的複素スペクトルの時系列を推定する。目的複素スペクトルは、処理対象スペクトログラムと周波数伝達関数が乗算された参照スペクトログラムとの同時刻における複素スペクトルの差として得られる複素スペクトルである。以下、目的複素スペクトルの時系列を目的複素スペクトログラムという。 The target complex spectrogram estimation unit 120 is a time series of the target complex spectrum that maximizes the probability with respect to a predetermined predetermined probability distribution (hereinafter referred to as “reference probability distribution”) based on the processing target spectrogram and the reference spectrogram. To estimate. The target complex spectrum is a complex spectrum obtained as the difference between the complex spectra to be processed and the reference spectrogram multiplied by the frequency transfer function at the same time. Hereinafter, the time series of the target complex spectrum is referred to as a target complex spectrogram.

目的複素スペクトログラムのｍ番目の時間フレームが表す目的複素スペクトルは、例えば以下の式（３）で表される。式（３）の左辺の記号が目的複素スペクトルを表す。 The target complex spectrum represented by the m-th time frame of the target complex spectrogram is represented by, for example, the following equation (3). The symbol on the left side of equation (3) represents the target complex spectrum.

式（３）の右辺のＨ（ω）は、周波数伝達関数（すなわち参照音の再生録音環境の周波数応答を表す関数）を表す。 H (ω) on the right side of the equation (3) represents a frequency transfer function (that is, a function representing the frequency response of the reproduction recording environment of the reference sound).

基準確率分布は、例えば以下の式（４）で表される。式（４）は零平均一般化複素正規分布である。 The reference probability distribution is expressed by, for example, the following equation (4). Equation (4) is a zero-mean generalized complex normal distribution.

α及びβは、基準確率分布の形を決める助変数であり、予め定められた値である。特にαは、分散を表す。βは１以上の整数である。β＝２のとき式（４）は正規分布を表し、β＝１のとき式（４）はラプラス分布を表す。式（４）においてΓはガンマ関数を表す。 α and β are auxiliary variables that determine the shape of the reference probability distribution, and are predetermined values. In particular, α represents the variance. β is an integer of 1 or more. When β = 2, equation (4) represents a normal distribution, and when β = 1, equation (4) represents a Laplace distribution. In equation (4), Γ represents the gamma function.

以下、目的複素スペクトログラムにおける目的複素スペクトルの出現確率の分布を、目的複素スペクトル出現確率分布という。予め定められた所定の確率分布に対する尤度を最大にすることは、例えば基準確率分布が式（４）で表される場合には例えば以下の式（５）で表される周波数伝達関数の対数尤度関数を最大化することである。 Hereinafter, the distribution of the appearance probability of the target complex spectrum in the target complex spectrogram is referred to as the target complex spectrum appearance probability distribution. Maximizing the likelihood with respect to a predetermined probability distribution is, for example, the logarithm of the frequency transfer function represented by the following equation (5) when the reference probability distribution is represented by the equation (4). It is to maximize the likelihood function.

式（５）の左辺は、周波数伝達関数の対数尤度関数である。式（５）の右辺第２項は、所定の定数を表す。式（５）において左辺を大きくすることは、右辺第１項の負号をとったものの大きさを小さくすることを意味する。式（５）の右辺第１項の負号をとったものが小さくなることは、Ｓ（ω、ｍ）の絶対値のβ乗の平均が小さくなることを意味する。処理対象音から除去対象音が除かれるほどＳ（ω、ｍ）は小さくなるので、式（５）の右辺第１項が小さいほど、除去対象音の影響が抑制された処理対象音の時系列が取得されることを意味する。そのため、式（５）で表される周波数伝達関数の対数尤度関数を最大化する目的複素スペクトルを取得することは、除去対象音の影響を最大限抑制した処理対象音の時系列を取得することを意味する。 The left side of equation (5) is the log-likelihood function of the frequency transfer function. The second term on the right side of the equation (5) represents a predetermined constant. Increasing the left side in the equation (5) means decreasing the size of the negative sign of the first term on the right side. The fact that the negative sign of the first term on the right side of the equation (5) becomes smaller means that the average of the absolute values of S (ω, m) to the β power becomes smaller. Since S (ω, m) becomes smaller as the removal target sound is removed from the processing target sound, the smaller the first term on the right side of the equation (5), the more the time series of the processing target sound in which the influence of the removal target sound is suppressed. Means that is obtained. Therefore, to acquire the objective complex spectrum that maximizes the log-likelihood function of the frequency transfer function represented by Eq. (5), the time series of the processing target sound that suppresses the influence of the removal target sound to the maximum is acquired. Means that.

目的音時系列生成部１３０は、目的複素スペクトログラム推定部１２０が推定した目的複素スペクトログラムに基づき、目的複素スペクトログラム推定部１２０が推定した目的複素スペクトログラムを有する音の時系列を目的音時系列として生成する。 The target sound time series generation unit 130 generates a time series of sounds having the target complex spectrogram estimated by the target complex spectrogram estimation unit 120 as a target sound time series based on the target complex spectrogram estimated by the target complex spectrogram estimation unit 120. ..

目的音時系列は、処理対象音時系列を用いて生成される時系列であって処理対象音時系列から参照時系列の成分が抑制された時系列である。参照音は予め録音済みの除去対象音であるため、目的音時系列は処理対象音から除去対象音の成分が抑制された音の時系列である。 The target sound time series is a time series generated by using the processing target sound time series, and the component of the reference time series is suppressed from the processing target sound time series. Since the reference sound is a pre-recorded sound to be removed, the target sound time series is a time series of sounds in which the components of the sound to be removed are suppressed from the sound to be processed.

図２は、実施形態の音響信号処理装置１の機能構成の一例を示す図である。音響信号処理装置１は、バスで接続されたＣＰＵ（Central Processing Unit）等のプロセッサ９１とメモリ９２とを備える制御部１０を備え、プログラムを実行する。音響信号処理装置１は、プログラムの実行によって制御部１０、入力部１１、通信部１２、記憶部１３及び出力部１４を備える装置として機能する。より具体的には、プロセッサ９１が記憶部１３に記憶されているプログラムを読み出し、読み出したプログラムをメモリ９２に記憶させる。プロセッサ９１が、メモリ９２に記憶させたプログラムを実行することによって、音響信号処理装置１は、制御部１０、入力部１１、通信部１２、記憶部１３及び出力部１４を備える装置として機能する。 FIG. 2 is a diagram showing an example of the functional configuration of the acoustic signal processing device 1 of the embodiment. The acoustic signal processing device 1 includes a control unit 10 including a processor 91 such as a CPU (Central Processing Unit) connected by a bus and a memory 92, and executes a program. The acoustic signal processing device 1 functions as a device including a control unit 10, an input unit 11, a communication unit 12, a storage unit 13, and an output unit 14 by executing a program. More specifically, the processor 91 reads out the program stored in the storage unit 13, and stores the read program in the memory 92. By executing the program stored in the memory 92 by the processor 91, the acoustic signal processing device 1 functions as a device including a control unit 10, an input unit 11, a communication unit 12, a storage unit 13, and an output unit 14.

制御部１０は、音響信号処理装置１が備える各種機能部の動作を制御する。制御部１０は、例えば処理対象音時系列及び参照時系列を用いて目的音時系列を生成する。 The control unit 10 controls the operation of various functional units included in the acoustic signal processing device 1. The control unit 10 generates a target sound time series using, for example, a processing target sound time series and a reference time series.

入力部１１は、マウスやキーボード、タッチパネル等の入力装置を含んで構成される。入力部１１は、これらの入力装置を自装置に接続するインタフェースを含んで構成されてもよい。入力部１１は、自装置に対する各種情報の入力を受け付ける。 The input unit 11 includes an input device such as a mouse, a keyboard, and a touch panel. The input unit 11 may be configured to include an interface for connecting these input devices to its own device. The input unit 11 receives input of various information to its own device.

通信部１２は、自装置を外部装置に接続するための通信インタフェースを含んで構成される。通信部１２は、有線又は無線を介して接続先の外部装置と通信する。通信部１２は、例えば外部装置から参照時系列を取得する。通信部１２は、例えば外部装置から処理対象音時系列を取得する。外部装置は例えば参照音の音楽データを再生するコンピュータであって音響信号処理装置１に再生中の時系列のデータを送信するコンピュータである。外部装置は例えば処理対象音時系列を送信するコンピュータである。外部装置は例えば処理対象音時系列をモノラル録音するマイクである。 The communication unit 12 includes a communication interface for connecting the own device to an external device. The communication unit 12 communicates with the external device to be connected via wired or wireless. The communication unit 12 acquires a reference time series from, for example, an external device. The communication unit 12 acquires the processing target sound time series from, for example, an external device. The external device is, for example, a computer that reproduces music data of a reference sound and transmits time-series data being reproduced to the acoustic signal processing device 1. The external device is, for example, a computer that transmits a time series of sound to be processed. The external device is, for example, a microphone that records the sound time series to be processed in monaural.

記憶部１３は、磁気ハードディスク装置や半導体記憶装置などの非一時的コンピュータ読み出し可能な記憶媒体装置を用いて構成される。記憶部１３は音響信号処理装置１に関する各種情報を記憶する。記憶部１３は、例えば予め基準確率分布を示す情報を記憶する。記憶部１３は、例えば予め周波数伝達関数を記憶する。 The storage unit 13 is configured by using a non-temporary computer-readable storage medium device such as a magnetic hard disk device or a semiconductor storage device. The storage unit 13 stores various information about the acoustic signal processing device 1. The storage unit 13 stores, for example, information indicating a reference probability distribution in advance. The storage unit 13 stores, for example, a frequency transfer function in advance.

出力部１４は、各種情報を出力する。出力部１４は、例えばＣＲＴ（Cathode Ray Tube）ディスプレイや液晶ディスプレイ、有機ＥＬ（Electro-Luminescence）ディスプレイ等の表示装置を含んで構成される。出力部１４は、これらの表示装置を自装置に接続するインタフェースとして構成されてもよい。出力部１４は、例えば入力部１１に入力された情報を出力する。出力部１４は、例えばスピーカー等の音の出力装置を含んで構成されてもよい。出力部１４は、これらの音の出力装置を自装置に接続するインタフェースとして構成されてもよい。 The output unit 14 outputs various information. The output unit 14 includes display devices such as a CRT (Cathode Ray Tube) display, a liquid crystal display, and an organic EL (Electro-Luminescence) display. The output unit 14 may be configured as an interface for connecting these display devices to its own device. The output unit 14 outputs, for example, the information input to the input unit 11. The output unit 14 may be configured to include a sound output device such as a speaker. The output unit 14 may be configured as an interface for connecting these sound output devices to the own device.

図３は、実施形態における制御部１０の機能構成の一例を示す図である。制御部１０は、補償済みスペクトログラム取得部１１０、目的複素スペクトログラム推定部１２０、目的音時系列生成部１３０、処理対象音時系列取得部１４０、参照時系列取得部１５０、通信制御部１６０、出力制御部１７０及び記録部１８０を備える。 FIG. 3 is a diagram showing an example of the functional configuration of the control unit 10 in the embodiment. The control unit 10 includes a compensated spectrogram acquisition unit 110, a target complex spectrogram estimation unit 120, a target sound time series generation unit 130, a processing target sound time series acquisition unit 140, a reference time series acquisition unit 150, a communication control unit 160, and output control. A unit 170 and a recording unit 180 are provided.

処理対象音時系列取得部１４０は、通信部１２に入力された処理対象音時系列を取得する。処理対象音時系列取得部１４０は、処理対象音時系列が予め記憶部１３に記憶済みの場合には、記憶部１３から処理対象音時系列を読み出すことで処理対象音時系列を取得してもよい。 The processing target sound time series acquisition unit 140 acquires the processing target sound time series input to the communication unit 12. When the processing target sound time series is stored in the storage unit 13 in advance, the processing target sound time series acquisition unit 140 acquires the processing target sound time series by reading the processing target sound time series from the storage unit 13. May be good.

参照時系列取得部１５０は、通信部１２に入力された参照時系列を取得する。参照時系列取得部１５０は、参照時系列が予め記憶部１３に記憶済みの場合には、記憶部１３から参照時系列を読み出すことで参照時系列を取得してもよい。 The reference time series acquisition unit 150 acquires the reference time series input to the communication unit 12. When the reference time series is stored in the storage unit 13 in advance, the reference time series acquisition unit 150 may acquire the reference time series by reading the reference time series from the storage unit 13.

通信制御部１６０及は、通信部１２の動作を制御する。出力制御部１７０は出力部１４の動作を制御する。記録部１８０は、情報を記憶部１３に記録する。 The communication control unit 160 and the communication control unit 160 control the operation of the communication unit 12. The output control unit 170 controls the operation of the output unit 14. The recording unit 180 records information in the storage unit 13.

補償済みスペクトログラム取得部１１０は、時間原点補償部１１１、振幅周波数変換部１１２及びサンプリング周波数補償部１１３を備える。 The compensated spectrogram acquisition unit 110 includes a time origin compensation unit 111, an amplitude frequency conversion unit 112, and a sampling frequency compensation unit 113.

時間原点補償部１１１は、処理対象音時系列取得部１４０が取得した処理対象音時系列と参照時系列取得部１５０が取得した参照時系列とに対して時間原点補償処理を実行する。 The time origin compensation unit 111 executes the time origin compensation process for the processing target sound time series acquired by the processing target sound time series acquisition unit 140 and the reference time series acquired by the reference time series acquisition unit 150.

振幅周波数変換部１１２は、原点一致条件を満たす処理対象音時系列と原点一致条件を満たす参照時系列とに対して振幅周波数変換処理を実行する。振幅周波数変換処理の実行により、原点一致条件を満たす処理対象音時系列が第１スペクトログラムに変換され、原点一致条件を満たす参照時系列が第２スペクトログラムに変換される。 The amplitude frequency conversion unit 112 executes the amplitude frequency conversion process for the processing target sound time series satisfying the origin matching condition and the reference time series satisfying the origin matching condition. By executing the amplitude frequency conversion process, the processing target sound time series satisfying the origin matching condition is converted into the first spectrogram, and the reference time series satisfying the origin matching condition is converted into the second spectrogram.

サンプリング周波数補償部１１３は、サンプリング周波数補償処理を実行する。サンプリング周波数補償処理の実行により、第１スペクトログラム及び第２スペクトログラムに基づき処理対象スペクトログラム及び参照スペクトログラムが生成される。 The sampling frequency compensation unit 113 executes the sampling frequency compensation process. By executing the sampling frequency compensation process, the spectrogram to be processed and the reference spectrogram are generated based on the first spectrogram and the second spectrogram.

図４は、実施形態の音響信号処理装置１が実行する処理の流れの一例を示すフローチャートである。通信部１２に処理対象音時系列が入力され、入力された処理対象音時系列を処理対象音時系列取得部１４０が取得する（ステップＳ１０１）。次に通信部１２に参照時系列が入力され、入力された参照時系列を参照時系列取得部１５０が取得する（ステップＳ１０２）。次に補償済みスペクトログラム取得部１１０が、処理対象音時系列及び参照時系列を取得し、処理対象音時系列及び参照時系列に基づき処理対象スペクトログラム及び参照スペクトログラムを取得する（ステップＳ１０３）。 FIG. 4 is a flowchart showing an example of a processing flow executed by the acoustic signal processing device 1 of the embodiment. The processing target sound time series is input to the communication unit 12, and the processing target sound time series acquisition unit 140 acquires the input processing target sound time series (step S101). Next, the reference time series is input to the communication unit 12, and the reference time series acquisition unit 150 acquires the input reference time series (step S102). Next, the compensated spectrogram acquisition unit 110 acquires the processing target sound time series and the reference time series, and acquires the processing target spectrogram and the reference spectrogram based on the processing target sound time series and the reference time series (step S103).

ステップＳ１０３では、補償済みスペクトログラム取得部１１０は例えば以下の補償済みスペクトログラム取得処理の実行により処理対象音時系列及び参照時系列に基づき処理対象スペクトログラム及び参照スペクトログラムを取得する。補償済みスペクトログラム取得処理では、まず時間原点補償部１１１が処理対象音時系列及び参照時系列に対して時間原点補償処理を実行し、処理対象音時系列の開始位置と参照時系列の時間原点を一致させる。これにより、時間原点補償部１１１は、原点一致条件を満たす処理対象音時系列と原点一致条件を満たす参照時系列とを取得する。 In step S103, the compensated spectrogram acquisition unit 110 acquires the process target spectrogram and the reference spectrogram based on the process target sound time series and the reference time series by, for example, executing the following compensated spectrogram acquisition process. In the compensated spectrogram acquisition process, the time origin compensation unit 111 first executes the time origin compensation process for the processing target sound time series and the reference time series, and sets the start position of the processing target sound time series and the time origin of the reference time series. Match. As a result, the time origin compensation unit 111 acquires the processing target sound time series satisfying the origin matching condition and the reference time series satisfying the origin matching condition.

補償済みスペクトログラム取得処理では、次に振幅周波数変換部１１２が、原点一致条件を満たす処理対象音時系列と原点一致条件を満たす参照時系列とに対して振幅周波数変換処理を実行する。振幅周波数変換処理の実行により、原点一致条件を満たす処理対象音時系列が第１スペクトログラムに変換され、原点一致条件を満たす参照時系列が第２スペクトログラムに変換される。 In the compensated spectrogram acquisition process, the amplitude frequency conversion unit 112 then executes the amplitude frequency conversion process for the processing target sound time series satisfying the origin matching condition and the reference time series satisfying the origin matching condition. By executing the amplitude frequency conversion process, the processing target sound time series satisfying the origin matching condition is converted into the first spectrogram, and the reference time series satisfying the origin matching condition is converted into the second spectrogram.

補償済みスペクトログラム取得処理では、次にサンプリング周波数補償部１１３がサンプリング周波数補償処理を実行する。サンプリング周波数補償処理の実行により、第１スペクトログラム及び第２スペクトログラムに基づき処理対象スペクトログラム及び参照スペクトログラムが生成される。 In the compensated spectrogram acquisition process, the sampling frequency compensation unit 113 then executes the sampling frequency compensation process. By executing the sampling frequency compensation process, the spectrogram to be processed and the reference spectrogram are generated based on the first spectrogram and the second spectrogram.

ステップＳ１０３の次に目的複素スペクトログラム推定部１２０は、処理対象スペクトログラム及び参照スペクトログラムに基づき基準確率分布に対する尤度を最大にする目的複素スペクトログラムを推定する（ステップＳ１０４）。 Following step S103, the target complex spectrogram estimation unit 120 estimates the target complex spectrogram that maximizes the likelihood with respect to the reference probability distribution based on the processed spectrogram and the reference spectrogram (step S104).

次に目的音時系列生成部１３０が、目的複素スペクトログラム推定部１２０が推定した目的複素スペクトログラムに基づき、目的複素スペクトログラム推定部１２０が推定した目的複素スペクトログラムを有する音の時系列を目的音時系列として生成する（ステップＳ１０５）。 Next, the target sound time series generation unit 130 uses the time series of sounds having the target complex spectrogram estimated by the target complex spectrogram estimation unit 120 as the target sound time series based on the target complex spectrogram estimated by the target complex spectrogram estimation unit 120. Generate (step S105).

次に出力制御部１７０が出力部１４の動作を制御し、出力部１４から目的音時系列が示す音を出力させる（ステップＳ１０６）。なお、ステップＳ１０５で生成された目的音時系列は、ステップＳ１０５以降に記録部１８０によって記憶部１３に記録されてもよい。 Next, the output control unit 170 controls the operation of the output unit 14, and outputs the sound indicated by the target sound time series from the output unit 14 (step S106). The target sound time series generated in step S105 may be recorded in the storage unit 13 by the recording unit 180 after step S105.

＜実験結果＞
実施形態の音響信号処理装置１を用いて雑音の影響が軽減された音の時系列を生成した実験（以下「評価実験」という。）結果の一例を示す。 <Experimental results>
An example of the result of an experiment (hereinafter referred to as "evaluation experiment") in which a time series of sounds in which the influence of noise is reduced by using the acoustic signal processing device 1 of the embodiment is shown is shown.

図５は、実施形態における評価実験の実験環境を説明する説明図である。評価実験のための処理対象音及び参照音は、４．１×３．８×２．８立方メートルの部屋であって２つのスピーカー９０１及び９０２とマイクロホン９０３が設置された部屋（以下「実験室」という。）でマイクロホン９０３によって録音された。実験室の吸音率は０．２であった。スピーカー９０１は目的音を出力する（鳴らす）音源であり、スピーカー９０２は参照音を出力する（鳴らす）音源であった。 FIG. 5 is an explanatory diagram illustrating the experimental environment of the evaluation experiment in the embodiment. The sound to be processed and the reference sound for the evaluation experiment are a room of 4.1 x 3.8 x 2.8 cubic meters in which two speakers 901 and 902 and a microphone 903 are installed (hereinafter referred to as "laboratory"). It was recorded by the microphone 903. The sound absorption coefficient of the laboratory was 0.2. The speaker 901 was a sound source that outputs (sounds) the target sound, and the speaker 902 is a sound source that outputs (sounds) the reference sound.

スピーカー９０２は、実験室の壁の１つ（以下「縦基準壁」という。）から１００ｃｍの位置に設置されており、スピーカー９０１は縦基準壁から１４０ｃｍの位置に設置されていた。スピーカー９０１とスピーカー９０２との縦基準壁に垂直な方向の間隔は４０ｃｍであった。マイクロホン９０３は、縦基準壁から１２０ｃｍの位置に設置されていた。 The speaker 902 was installed at a position 100 cm from one of the walls of the laboratory (hereinafter referred to as "vertical reference wall"), and the speaker 901 was installed at a position 140 cm from the vertical reference wall. The distance between the speaker 901 and the speaker 902 in the direction perpendicular to the vertical reference wall was 40 cm. The microphone 903 was installed at a position 120 cm from the vertical reference wall.

スピーカー９０１とスピーカー９０２の横基準壁からの距離は１２０ｃｍであった。横基準壁は、実験室の壁の１つであって縦基準壁に直交する壁である。マイクロホン９０３の横基準壁からの距離は、２９０ｃｍであった。すなわち、マイクロホン９０３は、実験室の壁のうち横基準壁に対抗する壁から１２０ｃｍの位置に設置されていた。 The distance between the speaker 901 and the speaker 902 from the lateral reference wall was 120 cm. The horizontal reference wall is one of the walls of the laboratory and is orthogonal to the vertical reference wall. The distance of the microphone 903 from the lateral reference wall was 290 cm. That is, the microphone 903 was installed at a position 120 cm from the wall of the laboratory facing the lateral reference wall.

実験室では、スピーカー９０１だけを動作させ目的音だけを出力した状態でマイクロホン９０３が録音すること（以下「目的音録音」という。）が行われた。目的音録音時のマイクロホン９０３のサンプリング周波数の設定値は、１６０００Ｈｚであった。また実験室では、目的音録音とは別のタイミングに、スピーカー９０２だけを動作させ参照音だけを出力した状態でマイクロホン９０３が録音すること（以下「参照音録音」という。）が行われた。参照音録音時のマイクロホン９０３のサンプリング周波数の設定値は、（１６０００＋１）Ｈｚであった。評価実験では、目的音録音時のサンプリング周波数と参照音録音時のサンプリング周波数とを１だけずらすことでサンプリング周波数ミスマッチが模擬された。 In the laboratory, recording was performed by the microphone 903 in a state where only the speaker 901 was operated and only the target sound was output (hereinafter referred to as "target sound recording"). The setting value of the sampling frequency of the microphone 903 at the time of recording the target sound was 16000 Hz. Further, in the laboratory, recording was performed by the microphone 903 in a state where only the speaker 902 was operated and only the reference sound was output at a timing different from the target sound recording (hereinafter referred to as "reference sound recording"). The setting value of the sampling frequency of the microphone 903 at the time of recording the reference sound was (16000 + 1) Hz. In the evaluation experiment, the sampling frequency mismatch was simulated by shifting the sampling frequency at the time of recording the target sound and the sampling frequency at the time of recording the reference sound by one.

評価実験では、目的音録音で録音された音と参照音録音で録音された音とが、入力ＳＮＲ（Signal to Noise Ratio）を－５、０、５、１０デシベルになるように混合され評価実験において処理対象音として用いられた。なお、入力ＳＮＲを－５、０、５、１０デシベルになるように混合され、とは、入力ＳＮＲを－５デシベル、０デシベル、５デシベル、１０デシベルの４条件に変化させることを意味する。 In the evaluation experiment, the sound recorded by the target sound recording and the sound recorded by the reference sound recording are mixed so that the input SNR (Signal to Noise Ratio) is -5, 0, 5, 10 decibels, and the evaluation experiment is performed. It was used as a sound to be processed in. The input SNR is mixed so as to be -5, 0, 5, and 10 decibels, which means that the input SNR is changed to four conditions of -5 decibels, 0 decibels, 5 decibels, and 10 decibels.

評価実験では、振幅周波数変換処理として短時間フーリエ変換が用いられた。評価実験における短時間フーリエ変換は窓長が４０９６点であり、シフト長が窓長の１／２でありゼロ詰めするという条件の元８１９２点で行われた。短時間フーリエ変換の窓はハミング窓が用いられた。 In the evaluation experiment, the short-time Fourier transform was used as the amplitude frequency conversion process. The short-time Fourier transform in the evaluation experiment was performed at 8192 points under the condition that the window length was 4096 points, the shift length was 1/2 of the window length, and the shift length was reduced to zero. A humming window was used as the window for the short-time Fourier transform.

評価実験では、スピーカー９０１及び９０２から出力した音をマイクロホン９０３で録音した音の時系列に代えて、入力ＳＮＲ等も含め実験室と同様の環境をモデル化し有限要素法等の数値シミュレーションによって生成された時系列を用いて音響信号処理装置１を評価した。以下、評価実験のうちスピーカー９０１及び９０２から出力した音をマイクロホン９０３で録音した音の時系列を用いて音響信号処理装置１を評価する実験を実実験という。以下、評価実験のうち数値シミュレーションによって生成された時系列を用いて音響信号処理装置１を評価する実験をシミュレーション実験という。 In the evaluation experiment, the sound output from the speakers 901 and 902 is replaced with the time series of the sound recorded by the microphone 903, and the environment similar to that of the laboratory including the input SNR is modeled and generated by numerical simulation such as the finite element method. The acoustic signal processing device 1 was evaluated using the time series. Hereinafter, among the evaluation experiments, an experiment in which the sound output from the speakers 901 and 902 is evaluated by using the time series of the sound recorded by the microphone 903 is referred to as an actual experiment. Hereinafter, among the evaluation experiments, an experiment in which the acoustic signal processing device 1 is evaluated using a time series generated by a numerical simulation is referred to as a simulation experiment.

図６は、実施形態における評価実験の結果の一例を示す図である。結果Ｒ１～Ｒ３はそれぞれ、シミュレーション実験の実験結果の一例を示す。結果Ｒ４～Ｒ６はそれぞれ、実実験の実験結果の一例を示す。結果Ｒ１～Ｒ６の横軸は入力ＳＮＲを表し、縦軸は出力ＳＮＲを表す。出力ＳＮＲの定義は、（１０×ｌｏｇ_１０（σ_q ^２／σ_ｏ ^２））デシベルであった。σ_q ^２は目的音時系列の分散であって除去対象音が存在した期間における目的音時系列の分散を表す。σ_ｏ ^２は、マイクロホン９０３によって録音された参照音の分散を表す。 FIG. 6 is a diagram showing an example of the result of the evaluation experiment in the embodiment. Results R1 to R3 show examples of experimental results of simulation experiments, respectively. Results R4 to R6 each show an example of the experimental results of an actual experiment. The horizontal axis of the results R1 to R6 represents the input SNR, and the vertical axis represents the output SNR. The definition of the output SNR was (10 × log ₁₀ (σ _q ² / σ _o ² )) decibels. σ _q ² is the dispersion of the target sound time series and represents the dispersion of the target sound time series during the period in which the sound to be removed exists. σ _o ² represents the variance of the reference sound recorded by the microphone 903.

結果Ｒ１～Ｒ６における棒グラフはそれぞれ各グラフにおいて、左からβ＝０．２、０．４、・・・、２．０の場合の結果である。また、塗りつぶし部分（Sync.(off)）はブラインド同期を行なっていない場合の結果であり、色抜きのもの（Sync.(on)）はブラインド同期を行なった場合の結果を示す。 Results The bar graphs in R1 to R6 are the results when β = 0.2, 0.4, ..., 2.0 from the left in each graph, respectively. The filled part (Sync. (Off)) is the result when blind synchronization is not performed, and the uncolored part (Sync. (On)) is the result when blind synchronization is performed.

結果Ｒ１～Ｒ６は、サンプリング周波数のズレの補償を行うことで出力ＳＮＲが向上したことを示す。この理由の１つは、サンプリング周波数のズレが存在する場合は周波数応答（周波数伝達関数）が見かけ上時不変ではなくなってしまい、式（３）で表されるモデルが成り立たないということである。また、処理対象音の種類と入力ＳＮＲとに依存して出力ＳＮＲが最大となるβが異なる点も、理由の１つである。よって、音響信号処理装置１は、目的複素スペクトルの確率分布を正規分布やラプラス分布であると仮定するよりも柔軟なモデルであり、より高い出力ＳＮＲを得ることができる。 Results R1 to R6 indicate that the output SNR was improved by compensating for the deviation of the sampling frequency. One of the reasons for this is that when there is a deviation in the sampling frequency, the frequency response (frequency transfer function) is apparently not invariant at the time, and the model represented by the equation (3) does not hold. Another reason is that β, which maximizes the output SNR, differs depending on the type of sound to be processed and the input SNR. Therefore, the acoustic signal processing device 1 is a more flexible model than assuming that the probability distribution of the target complex spectrum is a normal distribution or a Laplace distribution, and a higher output SNR can be obtained.

＜式（５）を用いて目的複素スペクトログラムを取得する方法＞
ここで式（５）を用いて目的複素スペクトログラムを推定する方法の一例を説明する。目的複素スペクトログラムは、式（５）をＨ（ω）で微分し極大値を与えるＳ（ω、ｍ）を推定結果として取得することで推定される。しかしながら、絶対値の冪乗の微分の値は解析的に得ることができない。 <Method of obtaining the target complex spectrogram using Eq. (5)>
Here, an example of a method for estimating the target complex spectrogram using the equation (5) will be described. The target complex spectrogram is estimated by differentiating the equation (5) with H (ω) and acquiring S (ω, m) which gives a maximum value as an estimation result. However, the derivative value of the power of the absolute value cannot be obtained analytically.

もちろん、解析的な解を得ることができないだけなので、音響信号処理装置１は数値計算で式(５)をそのまま扱い近似値を取得するという地道な方法で目的複素スペクトログラムを推定してもよい。しかしながらこのような場合計算量が多いため丸め誤差等も発生しやすく、計算誤差が大きくなる可能性が高い。そこで、音響信号処理装置１は、式（５）そのままを数値計算することに代えて、式（５）を等価なより計算量の少ない式に変形して目的複素スペクトログラムを推定してもよい。 Of course, since only an analytical solution cannot be obtained, the acoustic signal processing apparatus 1 may estimate the target complex spectrogram by a steady method of handling the equation (5) as it is in the numerical calculation and acquiring an approximate value. However, in such a case, since the amount of calculation is large, rounding error or the like is likely to occur, and there is a high possibility that the calculation error will be large. Therefore, instead of numerically calculating the equation (5) as it is, the acoustic signal processing apparatus 1 may transform the equation (5) into an equivalent equation with a smaller amount of calculation to estimate the target complex spectrogram.

音響信号処理装置１は、式（５）を等価なより計算量の少ない式に変形して目的複素スペクトログラムを推定する方法として、例えば以下の補助関数を適用して目的複素スペクトログラムを取得してもよい。補助関数法は、Majorization-Minimization Algorithm もしくはMM Algorithmとも呼称される（参考文献２及び３参照）。 As a method of estimating the target complex spectrogram by transforming the equation (5) into an equivalent and less computationally expensive equation, the acoustic signal processing device 1 may obtain the target complex spectrogram by applying, for example, the following auxiliary function. good. The auxiliary function method is also referred to as Majorization-Minimization Algorithm or MM Algorithm (see References 2 and 3).

参考文献２：David R Hunter & Kenneth Lange “A Tutorial on MM Algorithms”, The American Statistician, 58:1, 30-37.
参考文献３：小野、「補助関数法による最適化アルゴリズ厶とその音響信号処理への応用」、日本音響学会誌６８巻１１号、ｐｐ．５６６～５７１、２０１２年 Reference 2: David R Hunter & Kenneth Lange “A Tutorial on MM Algorithms”, The American Statistician, 58: 1, 30-37.
Reference 3: Ono, "Optimization Algorithm by Auxiliary Function Method and Its Application to Acoustic Signal Processing", Journal of Acoustical Society of Japan, Vol. 68, No. 11, pp. 566-571, 2012

補助関数法では目的関数に対して適切な補助関数を見つける必要がある。ここで目的関数とは、最適化問題ごとに与えられる、最大化もしくは最小化したい関数である。補助関数法における適切な補助関数は、例えば参考文献４に記載の方法を用いることで見つけられる。 In the auxiliary function method, it is necessary to find an appropriate auxiliary function for the objective function. Here, the objective function is a function to be maximized or minimized given for each optimization problem. Appropriate auxiliary functions in the auxiliary function method can be found, for example, by using the method described in Reference 4.

参考文献４：Nobutaka Ono and Shigeki Miyabe, “Auxiliary-Function-Based Independent
Component Analysis for Super-Gaussian Sources”, V. Vigneron et al. (Eds.): LVA/ICA 2010, LNCS 6365, pp. 165-172, 2010, Springer-Verlag Berlin Heidelberg 2010 Reference 4: Nobutaka Ono and Shigeki Miyabe, “Auxiliary-Function-Based Independent
Component Analysis for Super-Gaussian Sources ”, V. Vigneron et al. (Eds.): LVA / ICA 2010, LNCS 6365, pp. 165-172, 2010, Springer-Verlag Berlin Heidelberg 2010

ここで、振幅ｘで表現される連続で微分可能な関数であって振幅ｘの偶関数である関数Ｇ（ｘ）の数学的な特性について考える。そこで、振幅ｘを負の値も有し得る変数ｘとして扱い、偶関数Ｇ（ｘ）の数学的な特性を説明する。Ｇ（ｘ）の変数ｘによる微分を変数ｘで割り算する関数が、定義域内で連続であり、ｘ＞０で正であり、なおかつ単調減少な関数であるならば、以下の式（６）が任意のｘについて成り立つ。 Here, consider the mathematical characteristics of the function G (x), which is a continuously differentiable function expressed by the amplitude x and is an even function of the amplitude x. Therefore, the amplitude x is treated as a variable x that can have a negative value, and the mathematical characteristics of the even function G (x) will be described. If the function that divides the derivative of G (x) by the variable x by the variable x is continuous within the definition range, positive at x> 0, and monotonically decreasing, the following equation (6) is It holds for any x.

式（６）の等号条件は、以下の式（７）で表される。 The equal sign condition of the formula (6) is expressed by the following formula (7).

式（５）の第１項は、式（７）の条件を満たす。そこで、以下の式（８）で表される補助関数法における適切な補助関数Ｑは、以下の式（９）のように表される。 The first term of the formula (5) satisfies the condition of the formula (7). Therefore, an appropriate auxiliary function Q in the auxiliary function method expressed by the following equation (8) is expressed by the following equation (9).

Ｈ_０（ω）は補助変数である。式（９）の右辺第２項は補助変数であるＨ_０（ω）のみに依存する項である。そのため、Ｈ（ω）の最適化には無関係な項である。式（９）はＨ（ω）の２次関数である。補助関数ＱをＨ（ω）で微分した式を０とおき、Ｈ_０（ω）＝Ｈ（ω）^（ｋ）を代入して式変形することで、以下の更新式（１１）が得られる。ｋは反復回数を表す。 H ₀ (ω) is an auxiliary variable. The second term on the right-hand side of equation (9) is a term that depends only on the auxiliary variable H ₀ (ω). Therefore, it is a term irrelevant to the optimization of H (ω). Equation (9) is a quadratic function of H (ω). The following updated equation (11) can be obtained by transforming the equation by substituting H ₀ (ω) = H (ω) ^(k) and setting the equation obtained by differentiating the auxiliary function Q with H (ω) as 0. .. k represents the number of repetitions.

式（１１）においてアスタリスクの記号は複素共役であることを示す。式（１１）で推定された周波数伝達関数を式（３）に代入することで、目的複素スペクトログラムが推定される。このような式８１１）及び式（３）を用いた推定を所定の終了条件が満たされるまで繰り返し実行することで目的複素スペクトログラムを推定する方法が補助関数適用法である。なお、終了条件は、例えば所定の回数繰り返された、という条件である。繰り返しの初期値は、式（１１）の分母を０にしない条件であればどのような条件であってもよい。例えば、Ｈ（ω）の初期値は１という条件の元、式（１１）の分母を０にしない条件であってもよい。 In equation (11), the asterisk symbol indicates that it is a complex conjugate. By substituting the frequency transfer function estimated by the equation (11) into the equation (3), the target complex spectrogram is estimated. The auxiliary function application method is a method of estimating the target complex spectrogram by repeatedly executing the estimation using the equations 811) and (3) until a predetermined end condition is satisfied. The end condition is, for example, a condition that the process is repeated a predetermined number of times. The initial value of the repetition may be any condition as long as the denominator of the equation (11) is not set to 0. For example, the initial value of H (ω) may be a condition that the denominator of the equation (11) is not set to 0 under the condition that it is 1.

このように構成された実施形態の音響信号処理装置１は、処理対象スペクトログラムと参照スペクトログラムとに基づいて目的複素スペクトログラムにおける目的複素スペクトルの出現確率の分布と基準確率分布との違いを最小にする目的複素スペクトログラムを推定する。目的音時系列は、処理対象音時系列を用いて生成される時系列であって処理対象音時系列から参照時系列の成分が抑制された時系列である。そして、参照音は予め録音済みの除去対象音である。そのため、音響信号処理装置１は、目的音時系列は処理対象音から除去対象音の成分が抑制された音の時系列を生成することができる。すなわち、音響信号処理装置１は、モノラル録音された音の時系列から雑音の影響がより軽減された音の時系列を生成することができる。 The acoustic signal processing apparatus 1 of the embodiment configured as described above has a purpose of minimizing the difference between the distribution of the appearance probability of the target complex spectrum and the reference probability distribution in the target complex spectrogram based on the processed target spectrogram and the reference spectrogram. Estimate a complex spectrogram. The target sound time series is a time series generated by using the processing target sound time series, and the component of the reference time series is suppressed from the processing target sound time series. The reference sound is a pre-recorded sound to be removed. Therefore, the acoustic signal processing device 1 can generate a time series of sounds in which the component of the sound to be removed is suppressed from the sound to be processed as the target sound time series. That is, the acoustic signal processing device 1 can generate a time series of sounds in which the influence of noise is further reduced from the time series of sounds recorded in monaural.

（変形例）
なお、音響信号処理装置１は、ネットワークを介して通信可能に接続された複数台の情報処理装置を用いて実装されてもよい。この場合、音響信号処理装置１が備える各機能部は、複数の情報処理装置に分散して実装されてもよい。 (Modification example)
The acoustic signal processing device 1 may be mounted by using a plurality of information processing devices connected so as to be communicable via a network. In this case, each functional unit included in the acoustic signal processing device 1 may be distributed and mounted in a plurality of information processing devices.

なお、音響信号処理装置１の各機能の全て又は一部は、ＡＳＩＣ（ＡｐｐｌｉｃａｔｉｏｎＳｐｅｃｉｆｉｃＩｎｔｅｇｒａｔｅｄＣｉｒｃｕｉｔ）やＰＬＤ（ＰｒｏｇｒａｍｍａｂｌｅＬｏｇｉｃＤｅｖｉｃｅ）やＦＰＧＡ（ＦｉｅｌｄＰｒｏｇｒａｍｍａｂｌｅＧａｔｅＡｒｒａｙ）等のハードウェアを用いて実現されてもよい。プログラムは、コンピュータ読み取り可能な記録媒体に記録されてもよい。コンピュータ読み取り可能な記録媒体とは、例えばフレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ－ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置である。プログラムは、電気通信回線を介して送信されてもよい。 In addition, all or a part of each function of the acoustic signal processing device 1 is realized by ASIC (Application Specific Integrated Circuit), PLD (Programmable Logic Device), FPGA (Field Programmable Gate Array), etc. good. The program may be recorded on a computer-readable recording medium. The computer-readable recording medium is, for example, a flexible disk, a magneto-optical disk, a portable medium such as a ROM or a CD-ROM, or a storage device such as a hard disk built in a computer system. The program may be transmitted over a telecommunication line.

以上、この発明の実施形態について図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計等も含まれる。 Although the embodiments of the present invention have been described in detail with reference to the drawings, the specific configuration is not limited to this embodiment, and includes designs and the like within a range that does not deviate from the gist of the present invention.

１…音響信号処理装置、１０…制御部、１１…入力部、１２…通信部、１３…記憶部、１４…出力部、１１０…補償済みスペクトログラム取得部、１１１…時間原点補償部、１１２…振幅周波数変換部、１１３…サンプリング周波数補償部、１２０…目的複素スペクトログラム推定部、１３０…目的音時系列生成部、１４０…処理対象音時系列取得部、１５０…参照時系列取得部、１６０…通信制御部、１７０…出力制御部、１８０…記録部、９１…プロセッサ、９２…メモリ 1 ... Sound signal processing device, 10 ... Control unit, 11 ... Input unit, 12 ... Communication unit, 13 ... Storage unit, 14 ... Output unit, 110 ... Compensated spectrogram acquisition unit, 111 ... Time origin compensation unit, 112 ... Amplitude Frequency conversion unit, 113 ... Sampling frequency compensation unit, 120 ... Target complex spectrogram estimation unit, 130 ... Target sound time series generation unit, 140 ... Processing target sound time series acquisition unit, 150 ... Reference time series acquisition unit, 160 ... Communication control Unit, 170 ... Output control unit, 180 ... Recording unit, 91 ... Processor, 92 ... Memory

Claims

The origin that the start position of the removal target sound in the processing target sound time series that indicates the monaurally recorded processing target sound and the time origin of the reference time series that indicates the pre-recorded removal target sound match. A processing target spectrogram that is a time series of a complex spectrum of the processing target sound time series that satisfies the matching condition and the sampling frequency matching condition that the sampling frequency of the processing target sound time series and the sampling frequency of the reference time series match. , The compensated spectrogram acquisition unit for acquiring the reference spectrogram, which is the time series of the complex spectrum of the reference time series, which satisfies the origin matching condition and the sampling frequency matching condition.
The probability of the objective complex spectrum, which is the complex spectrum in the objective complex spectrogram obtained as the difference between the complex spectra at the same time time between the processing target spectrogram and the reference spectrogram multiplied by the frequency transfer function, with respect to a predetermined probability distribution. Purpose of estimating the time series that maximizes the degree The complex spectrogram estimator and
An acoustic signal processing device.

A target sound time series generator that generates a time series of sounds having the target complex spectrogram estimated by the target complex spectrogram estimation unit,
The acoustic signal processing apparatus according to claim 1.

The compensated spectrogram acquisition unit is a time series of a complex spectrum of the processed sound time series that satisfies the origin matching condition after matching the start position of the processed sound time series with the time origin of the reference time series. The 1 spectrogram and the 2nd spectrogram which is a time series of the complex spectrum of the reference time series satisfying the origin matching condition are acquired, and the processed target spectrogram and the reference spectrogram are acquired using the 1st spectrogram and the 2nd spectrogram. do,
The acoustic signal processing apparatus according to claim 1 or 2.

The origin that the start position of the removal target sound in the processing target sound time series that indicates the monaurally recorded processing target sound and the time origin of the reference time series that indicates the pre-recorded removal target sound match. A processing target spectrogram that is a time series of a complex spectrum of the processing target sound time series that satisfies the matching condition and the sampling frequency matching condition that the sampling frequency of the processing target sound time series and the sampling frequency of the reference time series match. , The compensated spectrogram acquisition step for acquiring the reference spectrogram, which is the time series of the complex spectrum of the reference time series, which satisfies the origin matching condition and the sampling frequency matching condition.
The probability of the objective complex spectrum, which is the complex spectrum in the objective complex spectrogram obtained as the difference between the complex spectra at the same time time between the processed spectrogram and the reference spectrogram multiplied by the frequency transfer function, with respect to a predetermined probability distribution. Purpose of estimating the time series that maximizes the degree Complex spectrogram estimation step and
Acoustic signal processing method having.

A program for operating a computer as the acoustic signal processing device according to any one of claims 1 to 3.