WO2021100136A1 - Sound source signal estimation device, sound source signal estimation method, and program - Google Patents

Sound source signal estimation device, sound source signal estimation method, and program Download PDF

Info

Publication number
WO2021100136A1
WO2021100136A1 PCT/JP2019/045392 JP2019045392W WO2021100136A1 WO 2021100136 A1 WO2021100136 A1 WO 2021100136A1 JP 2019045392 W JP2019045392 W JP 2019045392W WO 2021100136 A1 WO2021100136 A1 WO 2021100136A1
Authority
WO
WIPO (PCT)
Prior art keywords
sound source
signal
separation filter
source signal
sound
Prior art date
Application number
PCT/JP2019/045392
Other languages
French (fr)
Japanese (ja)
Inventor
江村 暁
Original Assignee
日本電信電話株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電信電話株式会社 filed Critical 日本電信電話株式会社
Priority to PCT/JP2019/045392 priority Critical patent/WO2021100136A1/en
Publication of WO2021100136A1 publication Critical patent/WO2021100136A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • G10L21/028Voice signal separating using properties of sound source

Definitions

  • the present invention relates to a technique for estimating a sound source signal.
  • y n (k) (where k represents the time).
  • h n and m are mixing coefficients.
  • the mixing coefficients h n and m are scalars.
  • the signal from the mth sound source is separated into sound sources by multiplying the nth sound source signal y n (k) by the separation coefficients w m and n and taking the sum, as shown in the following equation.
  • the separation coefficients w m and n are updated so that each sound source signal is statistically more independent.
  • Natural Gradient method and FastICA are known as such update methods.
  • y n (k) (where k represents the time).
  • h n, m (p) is the impulse response of the acoustic path from the mth sound source to the nth microphone
  • P is the length of the impulse response of the acoustic path.
  • Q is the filter length of the FIR filter.
  • the filter length Q of the FIR filter is also several thousand. Therefore, the calculation of BSS in the convolution mixed model is much more difficult than that of BSS in the instantaneous mixed model.
  • the frequency domain processing approach is usually applied to BSS in the convolution mixed model.
  • a Short-Time Fourier Transform STFT
  • STFT Short-Time Fourier Transform
  • f is the frame number when the signal is framed by STFT
  • is the frequency
  • S m (f, ⁇ ) is the mth sound source signal obtained by converting sm (k) into the frequency domain.
  • H n, m ( ⁇ ) are the transmission characteristics of the acoustic path from the mth sound source to the nth microphone, which are obtained by converting h n, m (p) in the frequency domain
  • Y n (f, ⁇ ) are , Y n (k) is the nth pick-up signal obtained by frequency domain conversion.
  • ⁇ T represents transpose.
  • the separation filter W ( ⁇ ) can be updated by applying the above-mentioned Natural Gradient method and FastICA as they are at each frequency. Therefore, such an approach is called frequency domain ICA (Frequency-Domain ICA; FDICA).
  • each frequency is processed individually, so there are two problems.
  • the first problem is called a scaling problem, in which each sound source signal is estimated with a different gain at each frequency.
  • the second problem is called the permutation problem, in which the sound sources are estimated in a different order at each frequency.
  • the scaling problem is solved by a method of recovering the sound source signal component at the position of the microphone, focusing on the transmission characteristics between the estimated sound source signal and the sound collection signal by the microphone, and the permutation problem is solved. , It is solved by the method by clustering the activity sequence obtained from the estimated sound source signal (see Non-Patent Document 1).
  • the estimated sound source signal obtained by separating the signal from a certain sound source can be obtained, but the separation performance is often insufficient. This is because the estimated sound source signal is mixed with the crosstalk component of the signal from another sound source, that is, the reverberation of the signal from another sound source or the signal from another sound source, and the reverberation time is not short. The effect is great. In other words, the estimated sound source signal has room for more sparseness as compared to an ideal signal in which the estimated sound source signal contains only the signal from the sound source to be separated.
  • an object of the present invention is to provide a sound source signal estimation technique capable of improving the estimation accuracy by using the sparseness of the sound source signal as an evaluation standard.
  • y (f, ⁇ ) [Y 1 (f, ⁇ ),..., Y M (f, ⁇ ) )]
  • T the sound collection signal vector
  • u (f, ⁇ ) the whitening sound collection signal vector
  • x y_z means that y z is a superscript for x
  • x y_z means that y z is a subscript for x
  • the signal itself may become sparse, or the amplitude of the signal may become small. Therefore, simply reducing the L1 norm of the vector related to the separated sound source signal does not necessarily make the signal sparse.
  • a separation filter that minimizes the L1 norm while keeping the L2 norm representing the signal power of the vector related to the separated sound source signal constant is obtained.
  • a method of generating a separation filter based on the Lehman optimization method is used.
  • the separated sound source signal a sound source signal separated from the whitened sound collection signal is used.
  • Step 1 STFT transform
  • Step 2 Whitening of the sound collection signal
  • the matrix T can be obtained by using the eigenvalue decomposition of the spatial correlation matrix of the sound collection signal vector y (f).
  • the eigenvalue decomposition of the spatial correlation matrix E [y (f) y H (f)] is given by the following equation.
  • D is a diagonal matrix
  • H represents Hermitian transpose
  • Step 3 Generation of separation filter
  • the vector relating to the separated sound source signal the vector v (f) of the following equation generated from the whitening sound collection signal vector u (f) using the separation filter W is used.
  • the generation of the separation filter W is formulated as a constrained minimization problem (optimization problem) of the following equation.
  • F is an integer of 1 or more representing the number of frames used for optimization.
  • F (W)
  • F (W) is called a cost function.
  • G is a matrix consisting of the partial derivatives of the cost function F with respect to W.
  • the total derivative DF W (Z) can be expressed as follows by using the Lehman gradient U based on the canonical inner product.
  • the matrix A is a skew-symmetric matrix
  • is a parameter representing a curve
  • Step 4 Sound source separation
  • the estimated sound source signal vector ⁇ s (f) [ ⁇ S 1 (f),..., ⁇ S M (f)] T from the sound collection signal vector y (f).
  • the sound source signal vector s'(f) is described in Non-Patent Document 1.
  • An estimated sound source signal vector ⁇ s (f) is generated as a sound source signal vector that solves the scaling problem and the patent problem using the method.
  • the solution of the scaling problem is to adjust the scaling of each component at each frequency, and the solution of the permutation problem is to determine the order of the sound source components estimated at each frequency.
  • Step 5 Inverse STFT transform
  • Non-Patent Document 1 This makes it possible to avoid the occurrence of infinity in numerical calculations and solve optimization problems.
  • FIG. 1 is a block diagram showing a configuration of a sound source signal estimation device 100.
  • FIG. 2 is a flowchart showing the operation of the sound source signal estimation device 100.
  • the sound source signal estimation device 100 includes a frequency domain conversion unit 110, a whitening unit 120, a separation filter generation unit 130, a sound source separation unit 140, a time domain conversion unit 150, and a recording unit 190.
  • the recording unit 190 is a component unit that appropriately records information necessary for processing of the sound source signal estimation device 100.
  • the sound source signal estimation device 100 receives signals picked up by M microphones installed in a sound field having M sound sources (M is an integer of 2 or more) as an input, and estimates signals from the M sound sources. And output.
  • M an integer of 2 or more
  • the frequency domain conversion for example, STFT conversion can be used.
  • FIG. 3 is a block diagram showing the configuration of the separation filter generation unit 130.
  • FIG. 4 is a flowchart showing the operation of the separation filter generation unit 130.
  • the separation filter generation unit 130 includes a separation filter initialization unit 131, a parameter setting unit 132, a separation filter update unit 133, a counter update unit 134, and a convergence condition determination unit 135.
  • the separation filter initialization unit 131 sets the counter k to 1 and sets W [1] , which is the result of the first update of the separation filter W ( ⁇ ).
  • the parameter setting unit 132 specifies the parameter ⁇ k (where ⁇ k is non-zero) and the parameters ⁇ k, 1 , ⁇ k, 2 ( ⁇ k, 1 , ⁇ k, 2 specify the aluminum ho condition). It is a parameter to be set, and 0 ⁇ k, 1 ⁇ k, 2 ⁇ 1 is satisfied).
  • the separation filter update unit 133 defines two inequalities (a) and (b) with respect to the parameter ⁇ k using W [k] which is the kth update result of the separation filter W ( ⁇ ). ) Is satisfied, the parameter ⁇ k is halved,
  • W [k + 1] Y.
  • the counter update unit 123 increments the counter k by 1. Specifically, k ⁇ k + 1.
  • the predetermined convergence condition is tr ((W [k + 1] -W [k] ) H (W [k + 1] -W [k] )) ⁇ (where ⁇ is preset. It is a value).
  • the process returns to S132. That is, the separation filter update unit 130 repeats the calculations of S132 to S134.
  • the sound source separation unit 140 receives the sound collection signal vector y (f, ⁇ ) generated in S120 and the separation filter W ( ⁇ ) generated in S130 as inputs, and from the sound collection signal vector y (f, ⁇ ), The estimated sound source signal vector ⁇ s (f, ⁇ ) is generated and output using the separation filter W ( ⁇ ).
  • the method for generating the estimated sound source signal vector ⁇ s (f, ⁇ ) can be, for example, the method described in ⁇ Technical Background>. Specifically, it is as follows.
  • the sound source separation unit 140 solves the scaling problem and the permutation problem of the estimated sound source vector s'(f, ⁇ ) by using the method described in Non-Patent Document 1, so that the estimated sound source signal vector ⁇ Generate s (f, ⁇ ).
  • an inverse STFT conversion can be used for the time domain conversion.
  • FIG. 5 is a diagram showing an example of a functional configuration of a computer that realizes each of the above-mentioned devices.
  • the processing in each of the above-mentioned devices can be carried out by causing the recording unit 2020 to read a program for causing the computer to function as each of the above-mentioned devices, and operating the control unit 2010, the input unit 2030, the output unit 2040, and the like.
  • the device of the present invention is, for example, as a single hardware entity, an input unit to which a keyboard or the like can be connected, an output unit to which a liquid crystal display or the like can be connected, and a communication device (for example, a communication cable) capable of communicating outside the hardware entity.
  • Communication unit CPU (Central Processing Unit, cache memory, registers, etc.) to which can be connected, RAM and ROM as memory, external storage device as hard hardware, and input, output, and communication units of these , CPU, RAM, ROM, has a connecting bus so that data can be exchanged between external storage devices.
  • a device (drive) or the like capable of reading and writing a recording medium such as a CD-ROM may be provided in the hardware entity.
  • a physical entity equipped with such hardware resources includes a general-purpose computer and the like.
  • the external storage device of the hardware entity stores the program required to realize the above-mentioned functions and the data required for processing this program (not limited to the external storage device, for example, reading a program). It may be stored in a ROM, which is a dedicated storage device). Further, the data obtained by the processing of these programs is appropriately stored in a RAM, an external storage device, or the like.
  • each program stored in the external storage device (or ROM, etc.) and the data necessary for processing each program are read into the memory as needed, and are appropriately interpreted, executed, and processed by the CPU. ..
  • the CPU realizes a predetermined function (each component represented by the above, ..., ... means, etc.).
  • the present invention is not limited to the above-described embodiment, and can be appropriately modified without departing from the spirit of the present invention. Further, the processes described in the above-described embodiment are not only executed in chronological order according to the order described, but may also be executed in parallel or individually as required by the processing capacity of the device that executes the processes. ..
  • the processing function in the hardware entity (device of the present invention) described in the above embodiment is realized by a computer
  • the processing content of the function that the hardware entity should have is described by a program.
  • the processing function in the above hardware entity is realized on the computer.
  • the program that describes this processing content can be recorded on a computer-readable recording medium.
  • the computer-readable recording medium may be, for example, a magnetic recording device, an optical disk, a photomagnetic recording medium, a semiconductor memory, or the like.
  • a hard disk device, a flexible disk, a magnetic tape, or the like as a magnetic recording device is used as an optical disk
  • a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), or a CD-ROM (Compact Disc Read Only) is used as an optical disk.
  • Memory CD-R (Recordable) / RW (ReWritable), etc.
  • MO Magnetto-Optical disc
  • EP-ROM Electroically Erasable and Programmable-Read Only Memory
  • semiconductor memory can be used.
  • the distribution of this program is carried out, for example, by selling, transferring, renting, etc., a portable recording medium such as a DVD or CD-ROM on which the program is recorded. Further, the program may be stored in the storage device of the server computer, and the program may be distributed by transferring the program from the server computer to another computer via a network.
  • a computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. Then, when the process is executed, the computer reads the program stored in its own storage device and executes the process according to the read program. Further, as another execution form of this program, a computer may read the program directly from a portable recording medium and execute processing according to the program, and further, the program is transferred from the server computer to this computer. Each time, the processing according to the received program may be executed sequentially. In addition, the above processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition without transferring the program from the server computer to this computer. May be.
  • the program in this embodiment includes information to be used for processing by a computer and equivalent to the program (data that is not a direct command to the computer but has a property of defining the processing of the computer, etc.).
  • the hardware entity is configured by executing a predetermined program on the computer, but at least a part of these processing contents may be realized in terms of hardware.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

Provided is sound source signal estimation technology that can improve estimation precision by using the sparseness of a sound source signal as an evaluation criterion. A sound source signal estimation device according to the present invention comprises: a whitening unit that generates a whitened sound collection signal vector u(f, ω) (provided that, u(f, ω) satisfies u(f, ω)=T(ω)y(f, ω) and E[u(f, ω)uH(f, ω)]=I with respect to a prescribed matrix T(ω)) from a sound collection signal vector y(f, ω); a separation filter generation unit that generates a separation filter W(ω) by solving an optimization problem for a cost function F(W(ω)) of the separation filter W(ω) defined by using the whitened sound collection signal vector u(f, ω); and a sound source separation unit that, by using the separation filter W(ω), generates an estimation sound source signal vector ^s(f, ω) from the sound collection signal vector y(f, ω).

Description

音源信号推定装置、音源信号推定方法、プログラムSound source signal estimation device, sound source signal estimation method, program
 本発明は、音源信号を推定する技術に関する。 The present invention relates to a technique for estimating a sound source signal.
 複数のマイクロホンを音場に設置して取得したマルチチャネルの収音信号に含まれる複数の音源からの信号(以下、音源信号という)を個々の音源信号に分離する技術が近年盛んに研究開発されている。そのような方法の一例として、独立成分解析(Independent Component Analysis; ICA)に基づくブラインド音源分離(Blind Source Separation; BSS)がよく知られている。 In recent years, a technique for separating signals from a plurality of sound sources (hereinafter referred to as sound source signals) included in a multi-channel sound collection signal acquired by installing a plurality of microphones in a sound field into individual sound source signals has been actively researched and developed. ing. As an example of such a method, Blind Source Separation (BSS) based on Independent Component Analysis (ICA) is well known.
 以下、BSSの例について説明する。はじめに、M個の音源がある音場にM個のセンサが設置されている場合を考える。M個の音源のそれぞれを第m音源(m=1, …, M)といい、第m音源からの信号(以下、第m音源信号という)(m=1, …, M)をsm(k)(ただし、kは時刻を表す)と表す。また、M個のセンサのそれぞれを第nセンサ(n=1, …, M)といい、第nセンサにより第1音源信号s1(k), …, 第M音源信号sM(k)を収音した信号(以下、第n収音信号という)(n=1, …, M)をyn(k)(ただし、kは時刻を表す)と表す。このとき、第n収音信号yn(k) (n=1,…, M)が、次式により記述されるモデル(瞬時混合モデル)を考える。 An example of BSS will be described below. First, consider the case where M sensors are installed in a sound field with M sound sources. M pieces of each of the m-th sound source of the sound source (m = 1, ..., M ) is called, the signal from the m-th sound source (hereinafter, the first that m sound source signal) (m = 1, ..., M) and s m ( It is expressed as k) (where k represents the time). In addition, each of the M sensors is called the nth sensor (n = 1,…, M), and the first sound source signal s 1 (k),…, M sound source signal s M (k) is generated by the nth sensor. The picked up signal (hereinafter referred to as the nth picked up signal) (n = 1, ..., M) is expressed as y n (k) (where k represents the time). At this time, consider a model (instantaneous mixed model) in which the nth pick-up signal y n (k) (n = 1, ..., M) is described by the following equation.
Figure JPOXMLDOC01-appb-M000004
Figure JPOXMLDOC01-appb-M000004
 ここで、hn,mは混合係数である。なお、混合係数hn,mはスカラーである。 Here, h n and m are mixing coefficients. The mixing coefficients h n and m are scalars.
 ICAに基づくBSSでは、次式のように、分離係数wm,nを第n収音信号yn(k)に掛けて和をとることで、第m音源からの信号を音源分離し、第m推定音源信号^sm(k) (m=1,…, M)を得る。 In BSS based on ICA, the signal from the mth sound source is separated into sound sources by multiplying the nth sound source signal y n (k) by the separation coefficients w m and n and taking the sum, as shown in the following equation. m Get the estimated sound source signal ^ s m (k) (m = 1,…, M).
Figure JPOXMLDOC01-appb-M000005
Figure JPOXMLDOC01-appb-M000005
 このとき、分離係数wm,nは、各音源信号が統計的により独立になるように更新される。このような更新方法として、Natural Gradient法やFastICAが知られている。 At this time, the separation coefficients w m and n are updated so that each sound source signal is statistically more independent. Natural Gradient method and FastICA are known as such update methods.
 次に、センサの代わりにマイクロホンが音場に設置されている場合を考える。つまり、M個の音源がある音場にM個のマイクロホンが設置されている場合を考える。M個のマイクロホンのそれぞれを第nマイクロホン(n=1, …, M)といい、第nマイクロホンにより第1音源信号s1(k), …, 第M音源信号sM(k)を収音した信号(以下、第n収音信号という)(n=1, …, M)をyn(k)(ただし、kは時刻を表す)と表す。このとき、第n収音信号yn(k) (n=1,…, M)が、畳み込みを用いた次式により記述されるモデル(畳み込み混合モデル)を考える。 Next, consider the case where a microphone is installed in the sound field instead of the sensor. In other words, consider the case where M microphones are installed in a sound field with M sound sources. Each of the M microphones is called the nth microphone (n = 1,…, M), and the first sound source signal s 1 (k),…, Mth sound source signal s M (k) is picked up by the nth microphone. The signal (hereinafter referred to as the nth pick-up signal) (n = 1, ..., M) is expressed as y n (k) (where k represents the time). At this time, consider a model (convolution mixed model) in which the nth sound pickup signal y n (k) (n = 1, ..., M) is described by the following equation using convolution.
Figure JPOXMLDOC01-appb-M000006
Figure JPOXMLDOC01-appb-M000006
 ここで、hn,m(p)は第m音源から第nマイクロホンまでの音響経路のインパルス応答、Pは音響経路のインパルス応答の長さである。 Here, h n, m (p) is the impulse response of the acoustic path from the mth sound source to the nth microphone, and P is the length of the impulse response of the acoustic path.
 BSSでは、FIRフィルタwm,n(q)を用いた次式により、第m音源からの信号を音源分離し、第m推定音源信号^sm(k) (m=1,…, M)を得る。 In BSS, the signal from the mth sound source is separated by the following equation using the FIR filter w m, n (q), and the mth estimated sound source signal ^ s m (k) (m = 1,…, M) To get.
Figure JPOXMLDOC01-appb-M000007
Figure JPOXMLDOC01-appb-M000007
 ここで、QはFIRフィルタのフィルタ長である。 Here, Q is the filter length of the FIR filter.
 音響経路のインパルス応答の長さPは、通常の残響時間T60=400ms程度で16kHzサンプリングのとき、数千タップになるため、FIRフィルタのフィルタ長Qも数千になる。そのため、畳み込み混合モデルにおけるBSSの計算は、瞬時混合モデルにおけるBSSのそれと比べて遥かに困難なものとなる。 Since the impulse response length P of the acoustic path is several thousand taps when sampling at 16 kHz with a normal reverberation time T 60 = 400 ms, the filter length Q of the FIR filter is also several thousand. Therefore, the calculation of BSS in the convolution mixed model is much more difficult than that of BSS in the instantaneous mixed model.
 そこで、畳み込み混合モデルにおけるBSSに対しては、通常、周波数領域処理のアプローチが適用される。このアプローチでは、短時間フーリエ変換(Short-Time Fourier Transform; STFT)を第n収音信号yn(k)に適用して周波数領域へ変換する。これにより、畳み込み混合モデルは、次式のような、周波数ごとの瞬時混合モデルの集まりに変換される。 Therefore, the frequency domain processing approach is usually applied to BSS in the convolution mixed model. In this approach, a Short-Time Fourier Transform (STFT) is applied to the nth pick-up signal y n (k) to transform it into the frequency domain. As a result, the convolutional mixed model is converted into a set of instantaneous mixed models for each frequency as shown in the following equation.
Figure JPOXMLDOC01-appb-M000008
Figure JPOXMLDOC01-appb-M000008
 ここで、fはSTFTで信号をフレーム化する際のフレーム番号、ωは周波数であり、Sm(f, ω)は、sm(k)を周波数領域変換することにより得られる第m音源信号、Hn,m(ω)は、hn,m(p)を周波数領域変換することにより得られる、第m音源から第nマイクロホンまでの音響経路の伝達特性、Yn(f, ω)は、yn(k)を周波数領域変換することにより得られる第n収音信号である。また、・Tは転置を表す。 Here, f is the frame number when the signal is framed by STFT, ω is the frequency, and S m (f, ω) is the mth sound source signal obtained by converting sm (k) into the frequency domain. , H n, m (ω) are the transmission characteristics of the acoustic path from the mth sound source to the nth microphone, which are obtained by converting h n, m (p) in the frequency domain, and Y n (f, ω) are , Y n (k) is the nth pick-up signal obtained by frequency domain conversion. Also, ・T represents transpose.
 このとき、分離フィルタW(ω)は、次式により与えられる。 At this time, the separation filter W (ω) is given by the following equation.
Figure JPOXMLDOC01-appb-M000009
Figure JPOXMLDOC01-appb-M000009
 分離フィルタW(ω)は、各周波数において、先述のNatural Gradient法やFastICAをそのまま適用することで、更新することができる。そのため、このようなアプローチは周波数領域ICA(Frequency-Domain ICA; FDICA)と呼ばれる。 The separation filter W (ω) can be updated by applying the above-mentioned Natural Gradient method and FastICA as they are at each frequency. Therefore, such an approach is called frequency domain ICA (Frequency-Domain ICA; FDICA).
 このFDICAでは、各周波数を個別に処理するため、2つの問題が生じる。1つ目の問題は、スケーリング問題と呼ばれるものであり、各周波数において各音源信号が別々のゲインで推定されてしまうという問題である。2つ目の問題は、パーミュテーション問題と呼ばれるものであり、各周波数において音源が別々の順番で推定されてしまうという問題である。 In this FDICA, each frequency is processed individually, so there are two problems. The first problem is called a scaling problem, in which each sound source signal is estimated with a different gain at each frequency. The second problem is called the permutation problem, in which the sound sources are estimated in a different order at each frequency.
 スケーリング問題については、推定された音源信号とマイクロホンによる収音信号との間の伝達特性に着目して、マイクロホンの位置における音源信号成分を回復する手法により解決されており、パーミュテーション問題については、推定された音源信号から求めたアクティビティシーケンスのクラスタリングによる方法により解決されている(非特許文献1参照)。 The scaling problem is solved by a method of recovering the sound source signal component at the position of the microphone, focusing on the transmission characteristics between the estimated sound source signal and the sound collection signal by the microphone, and the permutation problem is solved. , It is solved by the method by clustering the activity sequence obtained from the estimated sound source signal (see Non-Patent Document 1).
 しかし、FDICAによる音源分離を行うと、ある音源からの信号を音源分離した推定音源信号を得ることができるが、分離性能が不十分となることも多い。これは、推定音源信号に他の音源からの信号のクロストーク成分、つまり、他の音源からの信号や他の音源からの信号の残響が混入しているためであり、残響時間が短くない場合にはその影響は大きくなる。言い換えると、推定音源信号に分離対象の音源からの信号のみ含まれる理想的な信号と比較して、推定音源信号はよりスパース化される余地があるということである。 However, when sound source separation by FDICA is performed, an estimated sound source signal obtained by separating the signal from a certain sound source can be obtained, but the separation performance is often insufficient. This is because the estimated sound source signal is mixed with the crosstalk component of the signal from another sound source, that is, the reverberation of the signal from another sound source or the signal from another sound source, and the reverberation time is not short. The effect is great. In other words, the estimated sound source signal has room for more sparseness as compared to an ideal signal in which the estimated sound source signal contains only the signal from the sound source to be separated.
 そこで本発明では、音源信号のスパース性を評価基準とすることで、推定精度を向上させることができる音源信号推定技術を提供することを目的とする。 Therefore, an object of the present invention is to provide a sound source signal estimation technique capable of improving the estimation accuracy by using the sparseness of the sound source signal as an evaluation standard.
 本発明の一態様は、Mを2以上の整数、sm(k)(ただし、kは時刻を表す)を第m音源からの信号(以下、第m音源信号という)(m=1, …, M)、yn(k)(ただし、kは時刻を表す)を第nマイクロホンにより第1音源信号s1(k), …, 第M音源信号sM(k)を収音した信号(以下、第n収音信号という)(n=1, …, M)、Yn(f, ω) (n=1, …, M)(ただし、fはフレーム番号、ωは周波数を表す)を第n収音信号yn(k)の周波数領域における信号(以下、第n収音信号という)、y(f, ω)=[Y1(f, ω), …, YM(f, ω)]Tを収音信号ベクトルとし、収音信号ベクトルy(f, ω)から、白色化収音信号ベクトルu(f, ω)(ただし、u(f, ω)は所定の行列T(ω)に対してu(f, ω)=T(ω)y(f, ω), E[u(f, ω)uH(f, ω)]=Iを満たす)を生成する白色化部と、白色化収音信号ベクトルu(f, ω)を用いて定義される分離フィルタW(ω)=[w1(ω) … wM(ω)]のコスト関数F(W(ω))の最適化問題 One aspect of the present invention, 2 or more integer M, s m (k) (Here, k represents a time) signal from the m-th sound source (hereinafter, referred to as the m source signal) (m = 1, ... , M), y n (k) (where k represents the time) is the first sound source signal s 1 (k), ..., M sound source signal s M (k) picked up by the nth microphone ( Hereinafter referred to as the nth pick-up signal) (n = 1,…, M), Y n (f, ω) (n = 1,…, M) (where f represents the frame number and ω represents the frequency). Signal in the frequency region of the nth sound pickup signal y n (k) (hereinafter referred to as the nth sound collection signal), y (f, ω) = [Y 1 (f, ω),…, Y M (f, ω) )] Let T be the sound collection signal vector, and from the sound collection signal vector y (f, ω), the whitening sound collection signal vector u (f, ω) (where u (f, ω) is a predetermined matrix T (ω). ) To generate u (f, ω) = T (ω) y (f, ω), E [u (f, ω) u H (f, ω)] = I) , Whitening The cost function F (W (ω)) of the separation filter W (ω) = [w 1 (ω)… w M (ω)] defined using the sound collection signal vector u (f, ω) Optimization problem
Figure JPOXMLDOC01-appb-M000010
Figure JPOXMLDOC01-appb-M000010
(Fは1以上の整数)を解くことにより、分離フィルタW(ω)を生成する分離フィルタ生成部と、収音信号ベクトルy(f, ω)から、分離フィルタW(ω)を用いて推定音源信号ベクトル^s(f, ω)を生成する音源分離部と、を含む。 Estimated using the separation filter W (ω) from the sound collection signal vector y (f, ω) and the separation filter generator that generates the separation filter W (ω) by solving (F is an integer of 1 or more). Includes a sound source separator that generates a sound source signal vector ^ s (f, ω).
 本発明によれば、音源信号のスパース性を評価基準とすることで、推定精度を向上させることが可能となる。 According to the present invention, it is possible to improve the estimation accuracy by using the sparsity of the sound source signal as an evaluation standard.
音源信号推定装置100の構成を示すブロック図である。It is a block diagram which shows the structure of the sound source signal estimation apparatus 100. 音源信号推定装置100の動作を示すフローチャートである。It is a flowchart which shows the operation of the sound source signal estimation apparatus 100. 分離フィルタ生成部130の構成を示すブロック図である。It is a block diagram which shows the structure of the separation filter generation part 130. 分離フィルタ生成部130の動作を示すフローチャートである。It is a flowchart which shows the operation of the separation filter generation unit 130. 本発明の実施形態における各装置を実現するコンピュータの機能構成の一例を示す図である。It is a figure which shows an example of the functional structure of the computer which realizes each apparatus in embodiment of this invention.
 以下、本発明の実施の形態について、詳細に説明する。なお、同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail. The components having the same function are given the same number, and duplicate explanations will be omitted.
 各実施形態の説明に先立って、この明細書における表記方法について説明する。 Prior to the description of each embodiment, the notation method in this specification will be described.
 _(アンダースコア)は下付き添字を表す。例えば、xy_zはyzがxに対する上付き添字であり、xy_zはyzがxに対する下付き添字であることを表す。 _ (Underscore) represents the subscript. For example, x y_z means that y z is a superscript for x, and x y_z means that y z is a subscript for x.
 また、ある文字xに対する^xや~xのような上付き添え字の”^”や”~”は、本来”x”の真上に記載されるべきであるが、明細書の記載表記の制約上、^xや~xと記載しているものである。 Also, superscripts "^" and "~" such as ^ x and ~ x for a certain character x should be written directly above "x", but they should be written directly above "x". Due to restrictions, it is described as ^ x or ~ x.
<技術的背景>
 分離された音源信号のスパース性は、ベクトルのL1ノルムで評価することができる(参考非特許文献1参照)。
(参考非特許文献1:Traian E. Abrudan, Jan Eriksson, and Visa Koivunen, “Steepest Descent Algorithms for Optimization Under Unitary Matrix Constraint,” IEEE Trans. on signal processing, vol.56, Issue 3, pp.1134-1147, March 2008.)
<Technical background>
The sparsity of the separated sound source signal can be evaluated by the L1 norm of the vector (see Reference Non-Patent Document 1).
(Reference Non-Patent Document 1: Traian E. Abrudan, Jan Eriksson, and Visa Koivunen, “Steepest Descent Algorithms for Optimization Under Unitary Matrix Constraint,” IEEE Trans. On signal processing, vol.56, Issue 3, pp.1134-1147 , March 2008.)
 しかし、L1ノルムの値が小さくなった場合、信号そのものがスパースになる場合と、信号の振幅が小さくなる場合がある。そのため、分離された音源信号に関するベクトルのL1ノルムを小さくするだけでは、当該信号がスパースになるとは限らない。 However, when the value of the L1 norm becomes small, the signal itself may become sparse, or the amplitude of the signal may become small. Therefore, simply reducing the L1 norm of the vector related to the separated sound source signal does not necessarily make the signal sparse.
 そこで、本発明の実施形態では、分離された音源信号に関するベクトルの信号パワーを表すL2ノルムを一定に保ちつつ、そのL1ノルムを最小化する分離フィルタを求める。その際、リーマン最適化手法をもとに分離フィルタを生成する方法を用いる。なお、ここでは分離された音源信号として、白色化された収音信号から分離された音源信号を用いる。 Therefore, in the embodiment of the present invention, a separation filter that minimizes the L1 norm while keeping the L2 norm representing the signal power of the vector related to the separated sound source signal constant is obtained. At that time, a method of generating a separation filter based on the Lehman optimization method is used. Here, as the separated sound source signal, a sound source signal separated from the whitened sound collection signal is used.
 以下、本発明の実施形態における音源信号の推定手順について説明する。 Hereinafter, the procedure for estimating the sound source signal according to the embodiment of the present invention will be described.
《音源信号推定手順》
(ステップ1:STFT変換)
 第n収音信号yn(k) (n=1,…, M)を、STFTを用いて周波数領域における信号である第n収音信号Yn(f, ω) (n=1,…, M)に変換する。
<< Sound source signal estimation procedure >>
(Step 1: STFT transform)
The nth sound pickup signal y n (k) (n = 1, ..., M) is the nth sound pickup signal Y n (f, ω) (n = 1, ..., M) which is a signal in the frequency domain using STFT. Convert to M).
(ステップ2:収音信号の白色化)
 ここでは、第n収音信号Yn(f, ω) (n=1,…, M)を白色化する。第n収音信号Yn(f, ω)を第n要素とする収音信号ベクトルy(f, ω)=[Y1(f, ω), …, YM(f, ω)]Tを白色化した白色化収音信号ベクトルu(f, ω)=[U1(f, ω), …, UM(f, ω)]Tとは、所定のM×M行列T(ω)に対してu(f, ω)= T(ω)y(f, ω), E[u(f, ω)uH(f, ω)]=Iを満たすベクトルu(f, ω)のことである。ここで、IはM×M単位行列であり、E[・]は期待値を表す。以下、簡単のために、ωを省略する。
(Step 2: Whitening of the sound collection signal)
Here, the nth sound pickup signal Y n (f, ω) (n = 1, ..., M) is whitened. The nth sound collection signal Y n (f, ω) is the nth element of the sound collection signal vector y (f, ω) = [Y 1 (f, ω),…, Y M (f, ω)] T Whitened whitening sound collection signal vector u (f, ω) = [U 1 (f, ω),…, U M (f, ω)] T is a predetermined M × M matrix T (ω). On the other hand, it is a vector u (f, ω) that satisfies u (f, ω) = T (ω) y (f, ω), E [u (f, ω) u H (f, ω)] = I. is there. Here, I is an M × M identity matrix, and E [・] represents an expected value. Hereinafter, ω is omitted for the sake of simplicity.
 行列Tは、収音信号ベクトルy(f)の空間相関行列の固有値分解を用いて求めることができる。空間相関行列E[y(f)yH(f)]の固有値分解が次式で与えられるとする。 The matrix T can be obtained by using the eigenvalue decomposition of the spatial correlation matrix of the sound collection signal vector y (f). Suppose that the eigenvalue decomposition of the spatial correlation matrix E [y (f) y H (f)] is given by the following equation.
Figure JPOXMLDOC01-appb-M000011
Figure JPOXMLDOC01-appb-M000011
 ここで、Dは対角行列であり、・Hはエルミート転置を表す。 Here, D is a diagonal matrix, and H represents Hermitian transpose.
 このとき、行列Tは次式で求められる。 At this time, the matrix T is calculated by the following equation.
Figure JPOXMLDOC01-appb-M000012
Figure JPOXMLDOC01-appb-M000012
 この行列Tを用いると、収音信号ベクトルy(f)と白色化収音信号ベクトルu(f)の関係を次式で表すことができる。 Using this matrix T, the relationship between the sound collection signal vector y (f) and the whitening sound collection signal vector u (f) can be expressed by the following equation.
Figure JPOXMLDOC01-appb-M000013
Figure JPOXMLDOC01-appb-M000013
 あるいは、 Or,
Figure JPOXMLDOC01-appb-M000014
Figure JPOXMLDOC01-appb-M000014
と表すことができる。ただし、 It can be expressed as. However,
Figure JPOXMLDOC01-appb-M000015
Figure JPOXMLDOC01-appb-M000015
である。 Is.
(ステップ3:分離フィルタの生成)
 分離された音源信号のスパース性をL1ノルムで測るコスト関数に注目し、その値が最小になるように分離フィルタを生成する。ここで、分離された音源信号に関するベクトルには、分離フィルタWを用いて白色化収音信号ベクトルu(f)から生成した次式のベクトルv(f)を用いる。
(Step 3: Generation of separation filter)
Pay attention to the cost function that measures the sparsity of the separated sound source signal with the L1 norm, and generate a separation filter so that the value is minimized. Here, as the vector relating to the separated sound source signal, the vector v (f) of the following equation generated from the whitening sound collection signal vector u (f) using the separation filter W is used.
Figure JPOXMLDOC01-appb-M000016
Figure JPOXMLDOC01-appb-M000016
 分離フィルタWの生成は、次式の制約付き最小化問題(最適化問題)として定式化される。 The generation of the separation filter W is formulated as a constrained minimization problem (optimization problem) of the following equation.
Figure JPOXMLDOC01-appb-M000017
Figure JPOXMLDOC01-appb-M000017
 ここで、Fは最適化に用いられるフレーム数を表す1以上の整数である。 Here, F is an integer of 1 or more representing the number of frames used for optimization.
 以下、説明を簡潔にするため、F(W)=||[w1 H[u(1) … u(F)]||1+…+||[wM H[u(1) … u(F)]||1とする。なお、F(W)のことをコスト関数という。 Below, for the sake of brevity, F (W) = || [w 1 H [u (1)… u (F)] || 1 +… + || [w M H [u (1)… u) (F)] || Set to 1 . Note that F (W) is called a cost function.
 このようにすると、上記の最適化問題は以下のように表すことができる。 In this way, the above optimization problem can be expressed as follows.
Figure JPOXMLDOC01-appb-M000018
Figure JPOXMLDOC01-appb-M000018
 この最適化問題を解くアルゴリズムを説明する前に、まずは、理論的な背景について説明する。なお、W∈CM×Mであるが、ここではより一般的なW∈CM×Pの場合で説明する。 Before explaining the algorithm for solving this optimization problem, first, the theoretical background will be explained. Although W ∈ C M × M , the case of W ∈ C M × P , which is more general, will be explained here.
(リーマン勾配の算出)
 コスト関数FのZ方向(z∈CM×P)の全微分DFW(Z)は、次式で与えられる。
(Calculation of Lehman gradient)
The total derivative DF W (Z) of the cost function F in the Z direction (z ∈ C M × P ) is given by the following equation.
Figure JPOXMLDOC01-appb-M000019
Figure JPOXMLDOC01-appb-M000019
 ただし、 However,
Figure JPOXMLDOC01-appb-M000020
Figure JPOXMLDOC01-appb-M000020
であり、Gはコスト関数FのWに関する偏微分からなる行列である。 And G is a matrix consisting of the partial derivatives of the cost function F with respect to W.
 全微分DFW(Z)は、カノニカル内積に基づくリーマン勾配Uを用いると、以下のように表すことができる。 The total derivative DF W (Z) can be expressed as follows by using the Lehman gradient U based on the canonical inner product.
Figure JPOXMLDOC01-appb-M000021
Figure JPOXMLDOC01-appb-M000021
 このとき、リーマン勾配Uは拘束条件WHW=Iの接空間内にあり、 At this time, the Lehman gradient U is in the tangent space of the constraint condition W H W = I,
Figure JPOXMLDOC01-appb-M000022
Figure JPOXMLDOC01-appb-M000022
であることが導ける(参考非特許文献1参照)。 (See Reference Non-Patent Document 1).
 リーマン勾配Uは、拘束条件WHW=Iの接空間内においてコスト関数Fを最も小さくする方向となっている。 The Lehman gradient U is in the direction of minimizing the cost function F in the tangent space of the constraint W H W = I.
(リーマン勾配に基づく多様体上の曲線上の探索)
 ここで拘束条件WHW=Iに対応する多様体上の曲線
(Search on a curve on a manifold based on the Lehman gradient)
Here the curve on the manifold corresponding to the constraint condition W H W = I
Figure JPOXMLDOC01-appb-M000023
Figure JPOXMLDOC01-appb-M000023
を考える。ここで、行列Aは歪対称行列であり、τは曲線を表すパラメータである。 think of. Here, the matrix A is a skew-symmetric matrix, and τ is a parameter representing a curve.
 τを変えることで、Y(τ)は多様体上の曲線を描く。なお、この曲線はケーリー(Cayley)変換とも呼ばれる(参考非特許文献2参照)。
(参考非特許文献2:Z. Wen and W. Yin, A feasible method for optimization with orthogonality constraints, Mathematical Programming, 142(1-2), pp.397-434, 2013.)
By changing τ, Y (τ) draws a curve on the manifold. This curve is also called a Cayley transformation (see Reference Non-Patent Document 2).
(Reference Non-Patent Document 2: Z. Wen and W. Yin, A feasible method for optimization with orthogonality constraints, Mathematical Programming, 142 (1-2), pp.397-434, 2013.)
 この曲線Y(τ)は、以下の性質を持つ。
(1)任意のτについて、拘束条件WHW=Iを満たす。
(2)τ=0での接ベクトルY’(0)は、Y’(0)=-AWで与えられる。
This curve Y (τ) has the following properties.
(1) For any τ, the constraint condition W H W = I is satisfied.
(2) The tangent vector Y'(0) at τ = 0 is given by Y'(0) =-AW.
 以上の性質より、A=GWH-WGHとすると、この曲線Y(τ)上でコスト関数Fは最も効率よく減少することがわかる。そこで、曲線Y(τ)上でコスト関数Fを確実に減少させるパラメータτを求めることで分離フィルタWを求めることができる(参考非特許文献2参照)。 From the above properties, it can be seen that the cost function F decreases most efficiently on this curve Y (τ) when A = GW H -WG H. Therefore, the separation filter W can be obtained by finding the parameter τ that surely reduces the cost function F on the curve Y (τ) (see Reference Non-Patent Document 2).
 以下、分離フィルタWを求めるアルゴリズムについて説明する。
(1)分離フィルタWの初期値をW[1]とする。W[1]を分離フィルタWの1回目の更新結果という。
(2)分離フィルタWのk回目の更新結果であるW[k]を用いて、多様体上の曲線
Hereinafter, the algorithm for obtaining the separation filter W will be described.
(1) Let the initial value of the separation filter W be W [1] . W [1] is called the result of the first update of the separation filter W.
(2) Curves on the manifold using W [k] , which is the result of the kth update of the separation filter W.
Figure JPOXMLDOC01-appb-M000024
Figure JPOXMLDOC01-appb-M000024
を求める。
(3)この曲線Y(τ)上での以下の探索により、分離フィルタWのk+1回目の更新結果であるW[k+1]を求める。
(3-1)パラメータτを非零の値に設定する。
(3-2)アルミホ条件を指定する2つのパラメータρ1, ρ2を設定する。ただし、パラメータρ1, ρ2は0<ρ12<1を満たすものとする。
(3-3)以下の2つの不等式が成り立つ間は、パラメータτを半減させていく。
To ask.
(3) By the following search on this curve Y (τ), W [k + 1] , which is the result of the k + 1th update of the separation filter W, is obtained.
(3-1) Set the parameter τ to a non-zero value.
(3-2) Set two parameters ρ 1 and ρ 2 that specify the aluminum ho condition. However, it is assumed that the parameters ρ 1 and ρ 2 satisfy 0 <ρ 12 <1.
(3-3) While the following two inequalities hold, the parameter τ is halved.
Figure JPOXMLDOC01-appb-M000025
Figure JPOXMLDOC01-appb-M000025
 それ以外の場合、つまり、上記2つの不等式のいずれかが成り立たない場合は、W[k+1]=Y(τ)とする。 In other cases, that is, if either of the above two inequalities does not hold, then W [k + 1] = Y (τ).
 ただし、 However,
Figure JPOXMLDOC01-appb-M000026
Figure JPOXMLDOC01-appb-M000026
である。
(4)tr((W[k+1]-W[k])H(W[k+1]-W[k]))があらかじめ設定した値εより小さくなったとき、収束したと判定して、W[k+1]を解とする。つまり、W[k+1]を分離フィルタWとする。一方、tr((W[k+1]-W[k])H(W[k+1]-W[k]))があらかじめ設定した値εより小さくならないときは、(2)に戻る。なお、tr((W[k+1]-W[k])H(W[k+1]-W[k]))<εのことを収束条件という。
Is.
(4) When tr ((W [k + 1] -W [k] ) H (W [k + 1] -W [k] )) becomes smaller than the preset value ε, it is determined that the convergence has occurred. Then, let W [k + 1] be the solution. That is, let W [k + 1] be the separation filter W. On the other hand, if tr ((W [k + 1] -W [k] ) H (W [k + 1] -W [k] )) is not smaller than the preset value ε, the process returns to (2). Note that tr ((W [k + 1] -W [k] ) H (W [k + 1] -W [k] )) <ε is called a convergence condition.
(ステップ4:音源分離)
 ステップ3で求めた分離フィルタWを用いて、収音信号ベクトルy(f)から、推定音源信号ベクトル^s(f)=[^S1(f), …, ^SM(f)]Tを生成する。具体的には、s’(f)=Wy(f)により音源分離した音源信号ベクトルs’(f)を求めた後、音源信号ベクトルs’(f)に対して非特許文献1に記載の方法を用いてスケーリング問題およびパーミュテーション問題を解決した音源信号ベクトルとして推定音源信号ベクトル^s(f)を生成する。なお、スケーリング問題の解決とは、各周波数で各成分のスケーリングを調整することであり、パーミュテーションの問題の解決とは、各周波数で推定された音源成分の順序を確定することである。
(Step 4: Sound source separation)
Using the separation filter W obtained in step 3, the estimated sound source signal vector ^ s (f) = [^ S 1 (f),…, ^ S M (f)] T from the sound collection signal vector y (f). To generate. Specifically, after obtaining the sound source signal vector s'(f) separated by sound source by s'(f) = Wy (f), the sound source signal vector s'(f) is described in Non-Patent Document 1. An estimated sound source signal vector ^ s (f) is generated as a sound source signal vector that solves the scaling problem and the patent problem using the method. The solution of the scaling problem is to adjust the scaling of each component at each frequency, and the solution of the permutation problem is to determine the order of the sound source components estimated at each frequency.
(ステップ5:逆STFT変換)
 推定音源信号ベクトル^s(f)の第m要素を第m推定音源信号^Sm(f) (m=1, …, M)とし、第m推定音源信号~Sm(f)を、逆STFT変換を用いて時間領域における信号である第m推定音源信号~sm(k)(1≦m≦M)に変換する。
(Step 5: Inverse STFT transform)
The mth element of the estimated sound source signal vector ^ s (f) is the mth estimated sound source signal ^ S m (f) (m = 1,…, M), and the mth estimated sound source signal ~ S m (f) is reversed. It converted to the m estimated source signal ~ s m is the signal in the time domain (k) (1 ≦ m ≦ M) using the STFT transformation.
 なお、コスト関数Fの微分を求める際にベクトルv=[V1 … VL]のL1ノルムの微分をとる必要がある。この微分は、ベクトルvの長さが0のとき、無限大になってしまう。これを避けるために、微小定数εを用いて、ベクトルvのL1ノルムを When finding the derivative of the cost function F, it is necessary to take the derivative of the L1 norm of the vector v = [V 1 … V L]. This derivative becomes infinite when the length of the vector v is 0. To avoid this, use the fine constant ε to set the L1 norm of the vector v.
Figure JPOXMLDOC01-appb-M000027
Figure JPOXMLDOC01-appb-M000027
で近似するとよい(非特許文献1参照)。これにより数値計算において無限大の発生を回避し、最適化問題を解くことができるようになる。 (See Non-Patent Document 1). This makes it possible to avoid the occurrence of infinity in numerical calculations and solve optimization problems.
<第1実施形態>
 以下、図1~図2を参照して音源信号推定装置100を説明する。図1は、音源信号推定装置100の構成を示すブロック図である。図2は、音源信号推定装置100の動作を示すフローチャートである。図1に示すように音源信号推定装置100は、周波数領域変換部110と、白色化部120と、分離フィルタ生成部130と、音源分離部140と、時間領域変換部150と、記録部190を含む。記録部190は、音源信号推定装置100の処理に必要な情報を適宜記録する構成部である。
<First Embodiment>
Hereinafter, the sound source signal estimation device 100 will be described with reference to FIGS. 1 to 2. FIG. 1 is a block diagram showing a configuration of a sound source signal estimation device 100. FIG. 2 is a flowchart showing the operation of the sound source signal estimation device 100. As shown in FIG. 1, the sound source signal estimation device 100 includes a frequency domain conversion unit 110, a whitening unit 120, a separation filter generation unit 130, a sound source separation unit 140, a time domain conversion unit 150, and a recording unit 190. Including. The recording unit 190 is a component unit that appropriately records information necessary for processing of the sound source signal estimation device 100.
 音源信号推定装置100は、M個(Mを2以上の整数)の音源がある音場に設置されたM個のマイクロホンで収音した信号を入力とし、当該M個の音源からの信号を推定し、出力する。以下、sm(k)(ただし、kは時刻を表す)を第m音源からの信号(以下、第m音源信号という)(m=1, …, M)、yn(k)(ただし、kは時刻を表す)を第nマイクロホンにより第1音源信号s1(k), …, 第M音源信号sM(k)を収音した信号(以下、第n収音信号という)(n=1, …, M)とする。 The sound source signal estimation device 100 receives signals picked up by M microphones installed in a sound field having M sound sources (M is an integer of 2 or more) as an input, and estimates signals from the M sound sources. And output. Hereinafter, s m (k) (Here, k represents the time) the signal from the m sound source (hereinafter, referred to as the m source signal) (m = 1, ..., M), y n (k) ( where (k represents the time) is the signal obtained by collecting the first sound source signal s 1 (k), ..., M sound source signal s M (k) by the nth microphone (hereinafter referred to as the nth sound source signal) (n = Let 1,…, M).
 図2に従い音源信号推定装置100の動作について説明する。 The operation of the sound source signal estimation device 100 will be described with reference to FIG.
 S110において、周波数領域変換部110は、第n収音信号yn(k) (n=1, …, M)を入力とし、第n収音信号yn(k) (n=1, …, M)から、所定の周波数領域変換により、周波数領域における信号である第n収音信号Yn(f, ω) (n=1, …, M)(ただし、fはフレーム番号、ωは周波数を表す)を生成し、出力する。周波数領域変換には、例えば、STFT変換を用いることができる。 In S110, the frequency domain converter 110, the n collected signal y n (k) (n = 1, ..., M) as input, the n collected signal y n (k) (n = 1, ..., From M), the nth sound pickup signal Y n (f, ω) (n = 1,…, M) (where f is the frame number and ω is the frequency), which is the signal in the frequency domain, by converting the frequency domain. (Represent) is generated and output. For the frequency domain conversion, for example, STFT conversion can be used.
 S120において、白色化部120は、S110で生成した第n収音信号Yn(f, ω) (n=1, …, M)を入力とし、収音信号ベクトルy(f, ω)=[Y1(f, ω), …, YM(f, ω)]Tから、白色化収音信号ベクトルu(f, ω)(ただし、u(f, ω)は所定の行列T(ω)に対してu(f, ω)=T(ω)y(f, ω), E[u(f, ω)uH(f, ω)]=Iを満たす)を生成し、出力する。白色化収音信号ベクトルu(f, ω)の生成方法は、例えば、<技術的背景>に記載した方法とすることができる。具体的には、以下のようになる。まず、白色化部120は、収音信号ベクトルy(f, ω)の空間相関行列の固有値分解E[y(f, ω)yH(f, ω)]=C(ω)D(ω)CH(ω)(ただし、D(ω)は対角行列である)を計算する。次に、白色化部120は、次式により行列T(ω)を計算する。 In S120, the whitening unit 120 takes the nth sound pickup signal Y n (f, ω) (n = 1, ..., M) generated in S110 as an input, and the sound collection signal vector y (f, ω) = [ From Y 1 (f, ω),…, Y M (f, ω)] T , the whitening sound pickup signal vector u (f, ω) (where u (f, ω) is a predetermined matrix T (ω) For u (f, ω) = T (ω) y (f, ω), E [u (f, ω) u H (f, ω)] = I) is generated and output. The method for generating the whitening sound collection signal vector u (f, ω) can be, for example, the method described in <Technical Background>. Specifically, it is as follows. First, the whitening unit 120 decomposes the eigenvalues of the spatial correlation matrix of the sound collection signal vector y (f, ω) E [y (f, ω) y H (f, ω)] = C (ω) D (ω). Calculate C H (ω) (where D (ω) is a diagonal matrix). Next, the whitening unit 120 calculates the matrix T (ω) by the following equation.
Figure JPOXMLDOC01-appb-M000028
Figure JPOXMLDOC01-appb-M000028
最後に、白色化部120は、u(f, ω)=T(ω)y(f, ω)により白色化収音信号ベクトルu(f, ω)を計算する。 Finally, the whitening unit 120 calculates the whitening sound collection signal vector u (f, ω) by u (f, ω) = T (ω) y (f, ω).
 S130において、分離フィルタ生成部130は、S120で生成した白色化収音信号ベクトルu(f, ω)を入力とし、白色化収音信号ベクトルu(f, ω)を用いて定義される分離フィルタW(ω)=[w1(ω) … wM(ω)]のコスト関数F(W(ω))の最適化問題 In S130, the separation filter generation unit 130 takes the whitening sound collection signal vector u (f, ω) generated in S120 as an input, and the separation filter is defined by using the whitening sound collection signal vector u (f, ω). Optimization problem of cost function F (W (ω)) of W (ω) = [w 1 (ω)… w M (ω)]
Figure JPOXMLDOC01-appb-M000029
Figure JPOXMLDOC01-appb-M000029
(Fは1以上の整数)を解くことにより、分離フィルタW(ω)を生成し、出力する。 By solving (F is an integer of 1 or more), the separation filter W (ω) is generated and output.
 以下、図3~図4を参照して分離フィルタ生成部130について説明する。図3は、分離フィルタ生成部130の構成を示すブロック図である。図4は、分離フィルタ生成部130の動作を示すフローチャートである。図3に示すように分離フィルタ生成部130は、分離フィルタ初期化部131と、パラメータ設定部132と、分離フィルタ更新部133と、カウンタ更新部134と、収束条件判定部135を含む。 Hereinafter, the separation filter generation unit 130 will be described with reference to FIGS. 3 to 4. FIG. 3 is a block diagram showing the configuration of the separation filter generation unit 130. FIG. 4 is a flowchart showing the operation of the separation filter generation unit 130. As shown in FIG. 3, the separation filter generation unit 130 includes a separation filter initialization unit 131, a parameter setting unit 132, a separation filter update unit 133, a counter update unit 134, and a convergence condition determination unit 135.
 図4に従い分離フィルタ生成部130の動作について説明する。 The operation of the separation filter generation unit 130 will be described with reference to FIG.
 S131において、分離フィルタ初期化部131は、カウンタkを1に設定し、分離フィルタW(ω)の1回目の更新結果であるW[1]を設定する。 In S131, the separation filter initialization unit 131 sets the counter k to 1 and sets W [1] , which is the result of the first update of the separation filter W (ω).
 S132において、パラメータ設定部132は、パラメータτk(ただし、τkは非零とする)と、パラメータρk,1, ρk,2(ρk,1, ρk,2はアルミホ条件を指定するパラメータであり、0<ρk,1k,2<1を満たす)とを設定する。 In S132, the parameter setting unit 132 specifies the parameter τ k (where τ k is non-zero) and the parameters ρ k, 1 , ρ k, 2k, 1 , ρ k, 2 specify the aluminum ho condition). It is a parameter to be set, and 0 <ρ k, 1k, 2 <1 is satisfied).
 S133において、分離フィルタ更新部133は、パラメータτkに対して、分離フィルタW(ω)のk回目の更新結果であるW[k]を用いて定義される2つの不等式(a), (b)が成立する間は、パラメータτkを半減させていき、 In S133, the separation filter update unit 133 defines two inequalities (a) and (b) with respect to the parameter τ k using W [k] which is the kth update result of the separation filter W (ω). ) Is satisfied, the parameter τ k is halved,
Figure JPOXMLDOC01-appb-M000030
Figure JPOXMLDOC01-appb-M000030
不等式(a), (b)のいずれかが成立しなくなったときは、分離フィルタW(ω)のk+1回目の更新結果であるW[k+1]をW[k+1]=Y[k]k)として生成する。なお、パラメータτkを半減させるとは、具体的には、τk←τk/2とすることである。また、 When either of the inequalities (a) and (b) does not hold, W [k + 1] , which is the result of the k + 1th update of the separation filter W (ω), is changed to W [k + 1] = Y. Generate as [k]k). To halve the parameter τ k is specifically to set τ k ← τ k / 2. Also,
Figure JPOXMLDOC01-appb-M000031
Figure JPOXMLDOC01-appb-M000031
である。 Is.
 S134において、カウンタ更新部123は、カウンタkを1だけインクリメントする。具体的には、k←k+1とする。 In S134, the counter update unit 123 increments the counter k by 1. Specifically, k ← k + 1.
 S135において、収束条件判定部135は、分離フィルタW(ω)のk回目の更新結果であるW[k]と分離フィルタW(ω)のk+1回目の更新結果であるW[k+1]とを用いて定義される所定の収束条件が満たされるときは、分離フィルタW(ω)をW(ω)=W[k+1]として生成し、処理を終了する。ここで、所定の収束条件とは、tr((W[k+1]-W[k])H(W[k+1]-W[k]))<ε(ただし、εはあらかじめ設定された値である)のことである。それ以外の場合、S132の処理に戻る。つまり、分離フィルタ更新部130は、S132~S134の計算を繰り返す。 In S135, the convergence condition determination unit 135 is the k-th update result of the separation filter W (ω) W [k] and the k + 1th update result of the separation filter W (ω) W [k + 1]. ] And the predetermined convergence condition defined by using] is satisfied, the separation filter W (ω) is generated as W (ω) = W [k + 1] , and the process is terminated. Here, the predetermined convergence condition is tr ((W [k + 1] -W [k] ) H (W [k + 1] -W [k] )) <ε (where ε is preset. It is a value). In other cases, the process returns to S132. That is, the separation filter update unit 130 repeats the calculations of S132 to S134.
 S140において、音源分離部140は、S120で生成した収音信号ベクトルy(f, ω)とS130で生成した分離フィルタW(ω)を入力とし、収音信号ベクトルy(f, ω)から、分離フィルタW(ω)を用いて推定音源信号ベクトル^s(f, ω)を生成し、出力する。推定音源信号ベクトル^s(f, ω)の生成方法は、例えば、<技術的背景>に記載した方法とすることができる。具体的には、以下のようになる。まず、音源分離部140は、収音信号ベクトルy(f, ω)と分離フィルタW(ω)から、s’(f, ω)=W(ω)y(f, ω)により推定音源ベクトルs’(f, ω)を計算する。次に、音源分離部140は、非特許文献1に記載の方法を用いて推定音源ベクトルs’(f, ω)が抱えるスケーリング問題とパーミュテーション問題を解決することにより、推定音源信号ベクトル^s(f, ω)を生成する。 In S140, the sound source separation unit 140 receives the sound collection signal vector y (f, ω) generated in S120 and the separation filter W (ω) generated in S130 as inputs, and from the sound collection signal vector y (f, ω), The estimated sound source signal vector ^ s (f, ω) is generated and output using the separation filter W (ω). The method for generating the estimated sound source signal vector ^ s (f, ω) can be, for example, the method described in <Technical Background>. Specifically, it is as follows. First, the sound source separation unit 140 estimates the sound source vector s from the sound collection signal vector y (f, ω) and the separation filter W (ω) by s'(f, ω) = W (ω) y (f, ω). 'Calculate (f, ω). Next, the sound source separation unit 140 solves the scaling problem and the permutation problem of the estimated sound source vector s'(f, ω) by using the method described in Non-Patent Document 1, so that the estimated sound source signal vector ^ Generate s (f, ω).
 S150において、時間領域変換部150は、S140で生成した推定音源信号ベクトル^s(f, ω)を入力とし、推定音源信号ベクトル^s(f, ω)の第m要素である第m推定音源信号^Sm(f, ω) (m=1, …, M)から、所定の時間領域変換により、時間領域における信号である第m推定音源信号^sm(k) (m=1, …, M)を生成し、出力する。時間領域変換には、例えば、逆STFT変換を用いることができる。 In S150, the time domain conversion unit 150 takes the estimated sound source signal vector ^ s (f, ω) generated in S140 as an input, and the m-th estimated sound source which is the m-th element of the estimated sound source signal vector ^ s (f, ω). From the signal ^ S m (f, ω) (m = 1,…, M), the mth estimated sound source signal ^ s m (k) (m = 1,…, which is the signal in the time domain by the predetermined time domain conversion , M) is generated and output. For the time domain conversion, for example, an inverse STFT conversion can be used.
 本実施形態の発明によれば、音源信号のスパース性を評価基準とすることで、推定精度を向上させることが可能となる。 According to the invention of the present embodiment, it is possible to improve the estimation accuracy by using the sparsity of the sound source signal as an evaluation standard.
<補記>
 図5は、上述の各装置を実現するコンピュータの機能構成の一例を示す図である。上述の各装置における処理は、記録部2020に、コンピュータを上述の各装置として機能させるためのプログラムを読み込ませ、制御部2010、入力部2030、出力部2040などに動作させることで実施できる。
<Supplement>
FIG. 5 is a diagram showing an example of a functional configuration of a computer that realizes each of the above-mentioned devices. The processing in each of the above-mentioned devices can be carried out by causing the recording unit 2020 to read a program for causing the computer to function as each of the above-mentioned devices, and operating the control unit 2010, the input unit 2030, the output unit 2040, and the like.
 本発明の装置は、例えば単一のハードウェアエンティティとして、キーボードなどが接続可能な入力部、液晶ディスプレイなどが接続可能な出力部、ハードウェアエンティティの外部に通信可能な通信装置(例えば通信ケーブル)が接続可能な通信部、CPU(Central Processing Unit、キャッシュメモリやレジスタなどを備えていてもよい)、メモリであるRAMやROM、ハードディスクである外部記憶装置並びにこれらの入力部、出力部、通信部、CPU、RAM、ROM、外部記憶装置の間のデータのやり取りが可能なように接続するバスを有している。また必要に応じて、ハードウェアエンティティに、CD-ROMなどの記録媒体を読み書きできる装置(ドライブ)などを設けることとしてもよい。このようなハードウェア資源を備えた物理的実体としては、汎用コンピュータなどがある。 The device of the present invention is, for example, as a single hardware entity, an input unit to which a keyboard or the like can be connected, an output unit to which a liquid crystal display or the like can be connected, and a communication device (for example, a communication cable) capable of communicating outside the hardware entity. Communication unit, CPU (Central Processing Unit, cache memory, registers, etc.) to which can be connected, RAM and ROM as memory, external storage device as hard hardware, and input, output, and communication units of these , CPU, RAM, ROM, has a connecting bus so that data can be exchanged between external storage devices. Further, if necessary, a device (drive) or the like capable of reading and writing a recording medium such as a CD-ROM may be provided in the hardware entity. A physical entity equipped with such hardware resources includes a general-purpose computer and the like.
 ハードウェアエンティティの外部記憶装置には、上述の機能を実現するために必要となるプログラムおよびこのプログラムの処理において必要となるデータなどが記憶されている(外部記憶装置に限らず、例えばプログラムを読み出し専用記憶装置であるROMに記憶させておくこととしてもよい)。また、これらのプログラムの処理によって得られるデータなどは、RAMや外部記憶装置などに適宜に記憶される。 The external storage device of the hardware entity stores the program required to realize the above-mentioned functions and the data required for processing this program (not limited to the external storage device, for example, reading a program). It may be stored in a ROM, which is a dedicated storage device). Further, the data obtained by the processing of these programs is appropriately stored in a RAM, an external storage device, or the like.
 ハードウェアエンティティでは、外部記憶装置(あるいはROMなど)に記憶された各プログラムとこの各プログラムの処理に必要なデータが必要に応じてメモリに読み込まれて、適宜にCPUで解釈実行・処理される。その結果、CPUが所定の機能(上記、…部、…手段などと表した各構成部)を実現する。 In the hardware entity, each program stored in the external storage device (or ROM, etc.) and the data necessary for processing each program are read into the memory as needed, and are appropriately interpreted, executed, and processed by the CPU. .. As a result, the CPU realizes a predetermined function (each component represented by the above, ..., ... means, etc.).
 本発明は上述の実施形態に限定されるものではなく、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。また、上記実施形態において説明した処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されるとしてもよい。 The present invention is not limited to the above-described embodiment, and can be appropriately modified without departing from the spirit of the present invention. Further, the processes described in the above-described embodiment are not only executed in chronological order according to the order described, but may also be executed in parallel or individually as required by the processing capacity of the device that executes the processes. ..
 既述のように、上記実施形態において説明したハードウェアエンティティ(本発明の装置)における処理機能をコンピュータによって実現する場合、ハードウェアエンティティが有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記ハードウェアエンティティにおける処理機能がコンピュータ上で実現される。 As described above, when the processing function in the hardware entity (device of the present invention) described in the above embodiment is realized by a computer, the processing content of the function that the hardware entity should have is described by a program. Then, by executing this program on the computer, the processing function in the above hardware entity is realized on the computer.
 この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、DVD(Digital Versatile Disc)、DVD-RAM(Random Access Memory)、CD-ROM(Compact Disc Read Only Memory)、CD-R(Recordable)/RW(ReWritable)等を、光磁気記録媒体として、MO(Magneto-Optical disc)等を、半導体メモリとしてEEP-ROM(Electronically Erasable and Programmable-Read Only Memory)等を用いることができる。 The program that describes this processing content can be recorded on a computer-readable recording medium. The computer-readable recording medium may be, for example, a magnetic recording device, an optical disk, a photomagnetic recording medium, a semiconductor memory, or the like. Specifically, for example, a hard disk device, a flexible disk, a magnetic tape, or the like as a magnetic recording device is used as an optical disk, and a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), or a CD-ROM (Compact Disc Read Only) is used as an optical disk. Memory), CD-R (Recordable) / RW (ReWritable), etc., MO (Magneto-Optical disc), etc. as a magneto-optical recording medium, EP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. as a semiconductor memory Can be used.
 また、このプログラムの流通は、例えば、そのプログラムを記録したDVD、CD-ROM等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The distribution of this program is carried out, for example, by selling, transferring, renting, etc., a portable recording medium such as a DVD or CD-ROM on which the program is recorded. Further, the program may be stored in the storage device of the server computer, and the program may be distributed by transferring the program from the server computer to another computer via a network.
 このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記憶装置に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるASP(Application Service Provider)型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの(コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等)を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. Then, when the process is executed, the computer reads the program stored in its own storage device and executes the process according to the read program. Further, as another execution form of this program, a computer may read the program directly from a portable recording medium and execute processing according to the program, and further, the program is transferred from the server computer to this computer. Each time, the processing according to the received program may be executed sequentially. In addition, the above processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition without transferring the program from the server computer to this computer. May be. The program in this embodiment includes information to be used for processing by a computer and equivalent to the program (data that is not a direct command to the computer but has a property of defining the processing of the computer, etc.).
 また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、ハードウェアエンティティを構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Further, in this form, the hardware entity is configured by executing a predetermined program on the computer, but at least a part of these processing contents may be realized in terms of hardware.
 上述の本発明の実施形態の記載は、例証と記載の目的で提示されたものである。網羅的であるという意思はなく、開示された厳密な形式に発明を限定する意思もない。変形やバリエーションは上述の教示から可能である。実施形態は、本発明の原理の最も良い例証を提供するために、そして、この分野の当業者が、熟考された実際の使用に適するように本発明を色々な実施形態で、また、色々な変形を付加して利用できるようにするために、選ばれて表現されたものである。すべてのそのような変形やバリエーションは、公正に合法的に公平に与えられる幅にしたがって解釈された添付の請求項によって定められた本発明のスコープ内である。 The above description of the embodiment of the present invention is presented for the purpose of illustration and description. There is no intention to be exhaustive and no intention to limit the invention to the exact form disclosed. Deformations and variations are possible from the above teachings. The embodiments are in various embodiments and in various ways to provide the best illustration of the principles of the invention and to be suitable for practical use by those skilled in the art. It is selected and expressed so that it can be used by adding transformations. All such variations and variations are within the scope of the invention as defined by the appended claims, interpreted according to the width given fairly, legally and impartially.

Claims (4)

  1.  Mを2以上の整数、sm(k)(ただし、kは時刻を表す)を第m音源からの信号(以下、第m音源信号という)(m=1, …, M)、yn(k)(ただし、kは時刻を表す)を第nマイクロホンにより第1音源信号s1(k), …, 第M音源信号sM(k)を収音した信号(以下、第n収音信号という)(n=1, …, M)、Yn(f, ω) (n=1, …, M)(ただし、fはフレーム番号、ωは周波数を表す)を第n収音信号yn(k)の周波数領域における信号(以下、第n収音信号という)、y(f, ω)=[Y1(f, ω), …, YM(f, ω)]Tを収音信号ベクトルとし、
     収音信号ベクトルy(f, ω)から、白色化収音信号ベクトルu(f, ω)(ただし、u(f, ω)は所定の行列T(ω)に対してu(f, ω)=T(ω)y(f, ω), E[u(f, ω)uH(f, ω)]=Iを満たす)を生成する白色化部と、
     白色化収音信号ベクトルu(f, ω)を用いて定義される分離フィルタW(ω)=[w1(ω) … wM(ω)]のコスト関数F(W(ω))の最適化問題
    Figure JPOXMLDOC01-appb-M000001
    (Fは1以上の整数)を解くことにより、分離フィルタW(ω)を生成する分離フィルタ生成部と、
     収音信号ベクトルy(f, ω)から、分離フィルタW(ω)を用いて推定音源信号ベクトル^s(f, ω)を生成する音源分離部と、
     を含む音源信号推定装置。
    M an integer of 2 or more, s m (k) (Here, k represents the time) the signal from the m sound source (hereinafter, the referred m source signal) (m = 1, ..., M), y n ( k) (where k represents the time) is the first sound source signal s 1 (k), ..., M sound source signal s M (k) picked up by the nth microphone (hereinafter, the nth sound picking signal). (N = 1,…, M), Y n (f, ω) (n = 1,…, M) (where f is the frame number and ω is the frequency) is the nth pick-up signal y n Signal in the frequency region of (k) (hereinafter referred to as the nth pick-up signal), y (f, ω) = [Y 1 (f, ω),…, Y M (f, ω)] T is the pick-up signal As a vector
    From the sound collection signal vector y (f, ω), the whitening sound collection signal vector u (f, ω) (where u (f, ω) is u (f, ω) with respect to the predetermined matrix T (ω). = T (ω) y (f, ω), E [u (f, ω) u H (f, ω)] = I) is generated)
    Optimum of the cost function F (W (ω)) of the separation filter W (ω) = [w 1 (ω)… w M (ω)] defined using the whitening sound collection signal vector u (f, ω) Problem
    Figure JPOXMLDOC01-appb-M000001
    A separation filter generator that generates a separation filter W (ω) by solving (F is an integer of 1 or more),
    A sound source separation unit that generates an estimated sound source signal vector ^ s (f, ω) from the sound collection signal vector y (f, ω) using the separation filter W (ω).
    Sound source signal estimator including.
  2.  請求項1に記載の音源信号推定装置であって、
     前記分離フィルタ生成部は、
     カウンタkを1に設定し、分離フィルタW(ω)の1回目の更新結果であるW[1]を設定する分離フィルタ初期化部と、
     パラメータτk(ただし、τkは非零とする)と、パラメータρk,1, ρk,2(ρk,1, ρk,2はアルミホ条件を指定するパラメータであり、0<ρk,1k,2<1を満たす)とを設定するパラメータ設定部と、
     パラメータτkに対して、分離フィルタW(ω)のk回目の更新結果であるW[k]を用いて定義される2つの不等式(a), (b)が成立する間は、パラメータτkを半減させていき、
    Figure JPOXMLDOC01-appb-M000002
    不等式(a), (b)のいずれかが成立しなくなったときは、分離フィルタW(ω)のk+1回目の更新結果であるW[k+1]をW[k+1]=Y[k]k)として生成する分離フィルタ更新部と、
     カウンタkを1だけインクリメントするカウンタ更新部と、
     分離フィルタW(ω)のk回目の更新結果であるW[k]と分離フィルタW(ω)のk+1回目の更新結果であるW[k+1]とを用いて定義される所定の収束条件が満たされるときは、分離フィルタW(ω)をW(ω)=W[k+1]として生成する収束条件判定部とを含む
     ことを特徴とする音源信号推定装置。
    The sound source signal estimation device according to claim 1.
    The separation filter generator
    The separation filter initialization unit that sets the counter k to 1 and sets W [1] , which is the result of the first update of the separation filter W (ω),
    The parameter τ k (where τ k is non-zero) and the parameters ρ k, 1 , ρ k, 2k, 1 , ρ k, 2 are parameters that specify the aluminum ho condition, 0 <ρ k. , 1k, 2 <1) and the parameter setting part,
    For the parameter tau k, while the k-th an update result W [k] 2 two inequalities to be defined using the separation filter W (ω) (a), the (b) holds, the parameter tau k Is halved,
    Figure JPOXMLDOC01-appb-M000002
    When either of the inequalities (a) and (b) does not hold, W [k + 1] , which is the result of the k + 1th update of the separation filter W (ω), is changed to W [k + 1] = Y. Separation filter update part generated as [k]k),
    A counter updater that increments the counter k by 1 and
    A predetermined value defined using W [k] , which is the k-th update result of the separation filter W (ω), and W [k + 1] , which is the k + 1-th update result of the separation filter W (ω). A sound source signal estimation device including a convergence condition determination unit that generates a separation filter W (ω) as W (ω) = W [k + 1] when the convergence condition is satisfied.
  3.  Mを2以上の整数、sm(k)(ただし、kは時刻を表す)を第m音源からの信号(以下、第m音源信号という)(m=1, …, M)、yn(k)(ただし、kは時刻を表す)を第nマイクロホンにより第1音源信号s1(k), …, 第M音源信号sM(k)を収音した信号(以下、第n収音信号という)(n=1, …, M)、Yn(f, ω) (n=1, …, M)(ただし、fはフレーム番号、ωは周波数を表す)を第n収音信号yn(k)の周波数領域における信号(以下、第n収音信号という)、y(f, ω)=[Y1(f, ω), …, YM(f, ω)]Tを収音信号ベクトルとし、
     音源信号推定装置が、収音信号ベクトルy(f, ω)から、白色化収音信号ベクトルu(f, ω)(ただし、u(f, ω)は所定の行列T(ω)に対してu(f, ω)=T(ω)y(f, ω), E[u(f, ω)uH(f, ω)]=Iを満たす)を生成する白色化ステップと、
     前記音源信号推定装置が、白色化収音信号ベクトルu(f, ω)を用いて定義される分離フィルタW(ω)=[w1(ω) … wM(ω)]のコスト関数F(W(ω))の最適化問題
    Figure JPOXMLDOC01-appb-M000003
    (Fは1以上の整数)を解くことにより、分離フィルタW(ω)を生成する分離フィルタ生成ステップと、
     前記音源信号推定装置が、収音信号ベクトルy(f, ω)から、分離フィルタW(ω)を用いて推定音源信号ベクトル^s(f, ω)を生成する音源分離ステップと
     を含む音源信号推定方法。
    M an integer of 2 or more, s m (k) (Here, k represents the time) the signal from the m sound source (hereinafter, the referred m source signal) (m = 1, ..., M), y n ( k) (where k represents the time) is the first sound source signal s 1 (k), ..., M sound source signal s M (k) picked up by the nth microphone (hereinafter, the nth sound picking signal). (N = 1,…, M), Y n (f, ω) (n = 1,…, M) (where f is the frame number and ω is the frequency) is the nth pick-up signal y n Signal in the frequency region of (k) (hereinafter referred to as the nth pick-up signal), y (f, ω) = [Y 1 (f, ω),…, Y M (f, ω)] T is the pick-up signal As a vector
    The sound source signal estimator uses the sound collection signal vector y (f, ω) to whiten the sound collection signal vector u (f, ω) (where u (f, ω) is for a predetermined matrix T (ω). A whitening step that produces u (f, ω) = T (ω) y (f, ω), E [u (f, ω) u H (f, ω)] = I)
    The sound source signal estimator uses a whitening sound collection signal vector u (f, ω) to define a separation filter W (ω) = [w 1 (ω)… w M (ω)] cost function F ( W (ω)) optimization problem
    Figure JPOXMLDOC01-appb-M000003
    A separation filter generation step that generates a separation filter W (ω) by solving (F is an integer of 1 or more),
    A sound source signal including a sound source separation step in which the sound source signal estimation device generates an estimated sound source signal vector ^ s (f, ω) from a sound collection signal vector y (f, ω) using a separation filter W (ω). Estimating method.
  4.  請求項1または2に記載の音源信号推定装置としてコンピュータを機能させるためのプログラム。 A program for operating a computer as the sound source signal estimation device according to claim 1 or 2.
PCT/JP2019/045392 2019-11-20 2019-11-20 Sound source signal estimation device, sound source signal estimation method, and program WO2021100136A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2019/045392 WO2021100136A1 (en) 2019-11-20 2019-11-20 Sound source signal estimation device, sound source signal estimation method, and program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2019/045392 WO2021100136A1 (en) 2019-11-20 2019-11-20 Sound source signal estimation device, sound source signal estimation method, and program

Publications (1)

Publication Number Publication Date
WO2021100136A1 true WO2021100136A1 (en) 2021-05-27

Family

ID=75980474

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2019/045392 WO2021100136A1 (en) 2019-11-20 2019-11-20 Sound source signal estimation device, sound source signal estimation method, and program

Country Status (1)

Country Link
WO (1) WO2021100136A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005518572A (en) * 2002-02-27 2005-06-23 キネティック リミテッド Blind signal separation
US20100076756A1 (en) * 2008-03-28 2010-03-25 Southern Methodist University Spatio-temporal speech enhancement technique based on generalized eigenvalue decomposition
JP2010282193A (en) * 2009-06-04 2010-12-16 Honda Motor Co Ltd Reverberation suppressing device and reverberation suppressing method
JP2019533194A (en) * 2016-09-29 2019-11-14 合肥華凌股▲フン▼有限公司Hefei Hualing Co.,Ltd. Blind signal separation method, configuration and voice control system, and electrical assembly

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005518572A (en) * 2002-02-27 2005-06-23 キネティック リミテッド Blind signal separation
US20100076756A1 (en) * 2008-03-28 2010-03-25 Southern Methodist University Spatio-temporal speech enhancement technique based on generalized eigenvalue decomposition
JP2010282193A (en) * 2009-06-04 2010-12-16 Honda Motor Co Ltd Reverberation suppressing device and reverberation suppressing method
JP2019533194A (en) * 2016-09-29 2019-11-14 合肥華凌股▲フン▼有限公司Hefei Hualing Co.,Ltd. Blind signal separation method, configuration and voice control system, and electrical assembly

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
HYVÄRINEN, AAPO ET AL.: "Independent Component Analysis: first edition", TOKYO DENKI UNIVERSITY PRESS, vol. 20, 1 February 2005 (2005-02-01), pages 404 - 423 *

Similar Documents

Publication Publication Date Title
CN102084667B (en) Dereverberation apparatus, dereverberation method, dereverberation program, and recording medium
JP2021036297A (en) Signal processing device, signal processing method, and program
JP6903611B2 (en) Signal generators, signal generators, signal generators and programs
US8271277B2 (en) Dereverberation apparatus, dereverberation method, dereverberation program, and recording medium
JP2007526511A (en) Method and apparatus for blind separation of multipath multichannel mixed signals in the frequency domain
JP2002510930A (en) Separation of unknown mixed sources using multiple decorrelation methods
JP4964259B2 (en) Parameter estimation device, sound source separation device, direction estimation device, method and program thereof
JP7026357B2 (en) Time frequency mask estimator learning device, time frequency mask estimator learning method, program
CN112951263B (en) Speech enhancement method, apparatus, device and storage medium
Kereliuk et al. Modal analysis of room impulse responses using subband ESPRIT
JP6721165B2 (en) Input sound mask processing learning device, input data processing function learning device, input sound mask processing learning method, input data processing function learning method, program
WO2021100136A1 (en) Sound source signal estimation device, sound source signal estimation method, and program
JP2019054344A (en) Filter coefficient calculation device, sound pickup device, method thereof, and program
JP4977100B2 (en) Reverberation removal apparatus, dereverberation removal method, program thereof, and recording medium
WO2022172441A1 (en) Sound source separation device, sound source separation method, and program
WO2021100094A1 (en) Sound source signal estimation device, sound source signal estimation method, and program
WO2021255925A1 (en) Target sound signal generation device, target sound signal generation method, and program
US20220141584A1 (en) Latent variable optimization apparatus, filter coefficient optimization apparatus, latent variable optimization method, filter coefficient optimization method, and program
JP6827908B2 (en) Speech enhancement device, speech enhancement learning device, speech enhancement method, program
JP2007226036A (en) Signal separation device, signal separation method, signal separation program, and recording medium, and signal direction-of-arrival estimation device, signal direction-of-arrival estimation method, signal direction-of-arrival estimation program, and recording medium
JP6114053B2 (en) Sound source separation device, sound source separation method, and program
JP6989031B2 (en) Transfer function estimator, method and program
JP2018191255A (en) Sound collecting device, method thereof, and program
US20240127841A1 (en) Acoustic signal enhancement apparatus, method and program
JP5498452B2 (en) Background sound suppression device, background sound suppression method, and program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19953357

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19953357

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP