JP6961545B2

JP6961545B2 - Sound signal processor, sound signal processing method, and program

Info

Publication number: JP6961545B2
Application number: JP2018125779A
Authority: JP
Inventors: 岳彦籠嶋
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2018-07-02
Filing date: 2018-07-02
Publication date: 2021-11-05
Anticipated expiration: 2038-07-02
Also published as: JP2020003751A; CN110675890A; CN110675890B

Description

本発明の実施形態は、音信号処理装置、音信号処理方法、およびプログラムに関する。 Embodiments of the present invention relate to sound signal processing devices, sound signal processing methods, and programs.

複数の音源から発せられた音の音信号に含まれる目的音信号を強調する技術が知られている。例えば、マイクで観測した音信号の特徴量に基づいて算出したＳＮ比最大化ビームフォーマを、音信号に含まれる目的音信号を強調するためのフィルタとして用いる技術が開示されている。特徴量として、話者方向やマイク間の音声到達時間差を表すベクトルを用いている。 A technique for emphasizing a target sound signal included in a sound signal of a sound emitted from a plurality of sound sources is known. For example, a technique is disclosed in which an SN ratio maximizing beamformer calculated based on a feature amount of a sound signal observed by a microphone is used as a filter for emphasizing a target sound signal included in the sound signal. As the feature quantity, a vector representing the speaker direction and the difference in voice arrival time between microphones is used.

従来では、観測した音信号から特徴量を抽出し、特徴量から目的音信号を強調するためのフィルタを算出しており、高精度に目的音信号を強調することが困難な場合があった。 Conventionally, a feature amount is extracted from an observed sound signal, and a filter for emphasizing the target sound signal is calculated from the feature amount, and it may be difficult to emphasize the target sound signal with high accuracy.

特許第４８９１８０１号公報Japanese Patent No. 4891801 特許第５０４４５８１号公報Japanese Patent No. 50444581

本発明が解決しようとする課題は、高精度に目的音信号を強調することができる、音信号処理装置、音信号処理方法、およびプログラムを提供することである。 An object to be solved by the present invention is to provide a sound signal processing device, a sound signal processing method, and a program capable of emphasizing a target sound signal with high accuracy.

実施形態の音信号処理装置は、目的音信号を強調した強調音信号に基づいて、第１音信号に含まれる前記目的音信号を強調するための空間フィルタ係数を導出する係数導出部と、前記強調音信号に基づいて、目的音区間を検出する検出部と、前記目的音区間と前記第１音信号とに基づいて、前記第１音信号における前記目的音区間の第１空間相関行列と、前記第１音信号における前記目的音区間以外の非目的音区間の第２空間相関行列と、を導出する相関導出部と、を備え、前記係数導出部は、前記第１空間相関行列および前記第２空間相関行列に基づいて、前記空間フィルタ係数を導出し、前記検出部は、前記目的音信号に対する非目的音信号のパワーの比が前記第１音信号より大きい第２音信号と、前記強調音信号と、に基づいて、前記目的音区間を検出する。 The sound signal processing device of the embodiment includes a coefficient deriving unit that derives a spatial filter coefficient for emphasizing the target sound signal included in the first sound signal based on the emphasized sound signal that emphasizes the target sound signal, and the above-mentioned A detection unit that detects a target sound section based on the emphasized sound signal, and a first spatial correlation matrix of the target sound section in the first sound signal based on the target sound section and the first sound signal. The first sound signal includes a second space correlation matrix of a non-target sound section other than the target sound section, and a correlation derivation unit for deriving the first sound signal, and the coefficient derivation unit includes the first space correlation matrix and the first space correlation matrix. The spatial filter coefficient is derived based on the two-spatial correlation matrix, and the detection unit uses the second sound signal whose power ratio of the non-target sound signal to the target sound signal is larger than that of the first sound signal and the emphasis. The target sound section is detected based on the sound signal .

音信号処理システムの模式図。Schematic diagram of a sound signal processing system. 音信号処理部の機能的構成の模式図。The schematic diagram of the functional configuration of a sound signal processing unit. 音信号処理のフローチャート。Flowchart of sound signal processing. 音信号処理システムの模式図。Schematic diagram of a sound signal processing system. 音信号処理部の機能的構成の模式図。The schematic diagram of the functional configuration of a sound signal processing unit. 音信号処理のフローチャート。Flowchart of sound signal processing. 音信号処理システムの模式図。Schematic diagram of a sound signal processing system. ハードウェア構成の説明図。Explanatory drawing of hardware configuration.

以下に添付図面を参照して、本実施の形態の詳細を説明する。 The details of the present embodiment will be described below with reference to the accompanying drawings.

（第１の実施の形態）
図１は、本実施の形態の音信号処理システム１の一例を示す模式図である。 (First Embodiment)
FIG. 1 is a schematic diagram showing an example of the sound signal processing system 1 of the present embodiment.

音信号処理システム１は、音信号処理装置１０と、第１マイク１４と、第２マイク１６と、を備える。音信号処理装置１０と、第１マイク１４および第２マイク１６とは、データや信号を授受可能に接続されている。 The sound signal processing system 1 includes a sound signal processing device 10, a first microphone 14, and a second microphone 16. The sound signal processing device 10 and the first microphone 14 and the second microphone 16 are connected so as to be able to exchange data and signals.

音信号処理装置１０は、１または複数の音源１２から発せられた音の音信号を処理する。 The sound signal processing device 10 processes the sound signal of the sound emitted from one or a plurality of sound sources 12.

音源１２は、音の発生源である。音源１２は、例えば、人および人以外の動物などの生物や、楽器などの非生物であるが、これらに限定されない。本実施の形態では、音源１２が人である場合を一例として説明する。このため、本実施の形態では、音が、音声である場合を一例として説明する。なお、音の種類は限定されない。また、以下では、人を、話者と称する場合がある。 The sound source 12 is a sound source. The sound source 12 is, for example, a living thing such as a human being and an animal other than a human being, or a non-living body such as a musical instrument, but is not limited thereto. In the present embodiment, the case where the sound source 12 is a person will be described as an example. Therefore, in the present embodiment, the case where the sound is voice will be described as an example. The type of sound is not limited. In the following, a person may be referred to as a speaker.

本実施の形態では、音信号処理装置１０は、複数の音源１２から発せられた音を含む音信号を処理し、音信号に含まれる目的音信号を強調する。複数の音源１２は、目的音源１２Ａと、非目的音源１２Ｂと、に分類される。目的音源１２Ａは、目的音を発する音源１２である。目的音とは、強調対象の音である。目的音信号とは、目的音を示す信号である。目的音信号は、例えば、スペクトルによって表される。非目的音源１２Ｂは、非目的音を発する音源１２である。非目的音は、目的音以外の音である。 In the present embodiment, the sound signal processing device 10 processes a sound signal including sounds emitted from a plurality of sound sources 12 and emphasizes a target sound signal included in the sound signal. The plurality of sound sources 12 are classified into a target sound source 12A and a non-purpose sound source 12B. The target sound source 12A is a sound source 12 that emits a target sound. The target sound is a sound to be emphasized. The target sound signal is a signal indicating a target sound. The target sound signal is represented by, for example, a spectrum. The non-purpose sound source 12B is a sound source 12 that emits a non-purpose sound. A non-purpose sound is a sound other than the target sound.

本実施の形態では、二人の話者である目的音源１２Ａと非目的音源１２Ｂが、テーブルＴを挟んで会話する環境を想定する。本実施の形態は、例えば、非目的音源１２Ｂが店員、目的音源１２Ａが顧客であって、これらの話者の会話を示す音信号から、一方の話者である目的音源１２Ａの目的音信号を強調する用途を想定して説明する。なお、音源１２の数や音源１２の配置は、これらに限定されない。また、想定環境は、この環境に限定されない。 In the present embodiment, it is assumed that the target sound source 12A and the non-purpose sound source 12B, which are two speakers, talk with each other across the table T. In the present embodiment, for example, the non-purpose sound source 12B is a clerk, the target sound source 12A is a customer, and the target sound signal of the target sound source 12A, which is one speaker, is obtained from the sound signals indicating the conversations of these speakers. The explanation will be made assuming the intended use. The number of sound sources 12 and the arrangement of the sound sources 12 are not limited to these. Moreover, the assumed environment is not limited to this environment.

第１マイク１４および第２マイク１６は、音を集音する。本実施の形態では、第１マイク１４および第２マイク１６は、音源１２から発せられた音を集音し、音信号を音信号処理装置１０へ出力する。 The first microphone 14 and the second microphone 16 collect sound. In the present embodiment, the first microphone 14 and the second microphone 16 collect the sound emitted from the sound source 12 and output the sound signal to the sound signal processing device 10.

第１マイク１４は、少なくとも目的音を含む音を集音するためのマイクである。言い換えると、第１マイク１４は、目的音源１２Ａから発せられた目的音を少なくとも集音するためのマイクである。 The first microphone 14 is a microphone for collecting sound including at least a target sound. In other words, the first microphone 14 is a microphone for collecting at least the target sound emitted from the target sound source 12A.

第１マイク１４は、集音した音を示す音信号として、第３音信号を音信号処理装置１０へ出力する。第３音信号は、非目的音信号と目的音信号とを含む音信号である。非目的音信号とは、非目的音を示す信号である。非目的音信号は、例えば、スペクトルによって表される。第１マイク１４は、音源１２（目的音源１２Ａ、非目的音源１２Ｂ）から発せられた音を集音し、第３音信号を音信号処理装置１０へ出力可能な位置に予め配置されている。本実施の形態では、第１マイク１４は、テーブルＴ上に配置されている場合を想定する。 The first microphone 14 outputs a third sound signal to the sound signal processing device 10 as a sound signal indicating the collected sound. The third sound signal is a sound signal including a non-purpose sound signal and a target sound signal. The non-purpose sound signal is a signal indicating a non-purpose sound. The non-target sound signal is represented by, for example, a spectrum. The first microphone 14 is arranged in advance at a position where the sound emitted from the sound source 12 (target sound source 12A, non-purpose sound source 12B) can be collected and the third sound signal can be output to the sound signal processing device 10. In the present embodiment, it is assumed that the first microphone 14 is arranged on the table T.

本実施の形態では、音信号処理システム１は、複数の第１マイク１４（第１マイク１４Ａ〜第１マイク１４Ｄ）を備える。このため、音信号処理装置１０には、複数の第１マイク１４から、複数の第３音信号が出力される。なお、複数の第３音信号を一つにまとめた音信号を、第１音信号と称して説明する。 In the present embodiment, the sound signal processing system 1 includes a plurality of first microphones 14 (first microphones 14A to 14D). Therefore, a plurality of third sound signals are output to the sound signal processing device 10 from the plurality of first microphones 14. A sound signal obtained by combining a plurality of third sound signals into one will be referred to as a first sound signal.

第１マイク１４の数は、集音対象の音源１２の数以上であればよい。上述したように、本実施の形態では、音信号処理システム１は、１つの目的音源１２Ａと、１つの非目的音源１２Ｂと、の合計２つの音源１２を想定している。この場合、第１マイク１４の数は、２以上であればよい。本実施の形態では、音信号処理システム１は、４つの第１マイク１４（第１マイク１４Ａ〜第１マイク１４Ｄ）を備える場合を一例として説明する。 The number of the first microphones 14 may be equal to or greater than the number of sound sources 12 to be collected. As described above, in the present embodiment, the sound signal processing system 1 assumes one target sound source 12A and one non-purpose sound source 12B, for a total of two sound sources 12. In this case, the number of the first microphones 14 may be 2 or more. In the present embodiment, the case where the sound signal processing system 1 includes four first microphones 14 (first microphones 14A to 14D) will be described as an example.

複数の第１マイク１４は、複数の音源１２の各々からの音到達時間差が互いに異なる。すなわち、複数の第１マイク１４は、上記音到達時間差が互いに異なるように、配置位置を予め調整されている。 The plurality of first microphones 14 have different sound arrival time differences from each of the plurality of sound sources 12. That is, the arrangement positions of the plurality of first microphones 14 are adjusted in advance so that the sound arrival time differences are different from each other.

第２マイク１６は、少なくとも非目的音を集音するためのマイクである。言い換えると、第２マイク１６は、非目的音源１２Ｂから発せられた非目的音を少なくとも集音するためのマイクである。 The second microphone 16 is a microphone for collecting at least non-purpose sounds. In other words, the second microphone 16 is a microphone for collecting at least the non-purpose sound emitted from the non-purpose sound source 12B.

第２マイク１６は、集音した音を示す音信号として、第２音信号を音信号処理装置１０へ出力する。第２音信号は、目的音信号に対する非目的音信号のパワーの比が、第１音信号（第３音信号）より大きい音信号である。第２音信号は、目的音信号に対する非目的音信号のパワーの比が、第１音信号（第３音信号）より大きく、且つ、目的音信号のパワーに比べて非目的音信号のパワーの大きい音信号であることが好ましい。 The second microphone 16 outputs the second sound signal to the sound signal processing device 10 as a sound signal indicating the collected sound. The second sound signal is a sound signal in which the ratio of the power of the non-target sound signal to the target sound signal is larger than that of the first sound signal (third sound signal). In the second sound signal, the ratio of the power of the non-target sound signal to the target sound signal is larger than that of the first sound signal (third sound signal), and the power of the non-target sound signal is higher than the power of the target sound signal. It is preferably a loud sound signal.

本実施の形態では、第２マイク１６は、第１マイク１４に比べて非目的音源１２Ｂに近い位置に配置されている。例えば、第２マイク１６は、ヘッドセットマイクや、ピンマイクである。本実施の形態では、第２マイク１６は、非目的音源１２Ｂである話者の口元で音声を集音可能となるように、非目的音源１２Ｂに装着されている。 In the present embodiment, the second microphone 16 is arranged at a position closer to the non-purpose sound source 12B than the first microphone 14. For example, the second microphone 16 is a headset microphone or a pin microphone. In the present embodiment, the second microphone 16 is attached to the non-purpose sound source 12B so that the voice can be collected at the mouth of the speaker, which is the non-purpose sound source 12B.

音信号処理装置１０は、ＡＤ変換部１８と、音信号処理部２０と、出力部２２と、を備える。なお、音信号処理装置１０は、少なくとも音信号処理部２０を備えた構成であればよく、ＡＤ変換部１８および出力部２２の少なくとも一方を別体として構成してもよい。 The sound signal processing device 10 includes an AD conversion unit 18, a sound signal processing unit 20, and an output unit 22. The sound signal processing device 10 may be configured to include at least the sound signal processing unit 20, and at least one of the AD conversion unit 18 and the output unit 22 may be configured as a separate body.

ＡＤ変換部１８は、複数の第１マイク１４から複数の第３音信号を受付ける。また、ＡＤ変換部１８は、第２マイク１６から第２音信号を受付ける。ＡＤ変換部１８は、複数の第３音信号および第２音信号の各々をデジタル信号に変換し、音信号処理部２０へ出力する。 The AD conversion unit 18 receives a plurality of third sound signals from the plurality of first microphones 14. Further, the AD conversion unit 18 receives the second sound signal from the second microphone 16. The AD conversion unit 18 converts each of the plurality of third sound signals and the second sound signal into digital signals and outputs them to the sound signal processing unit 20.

音信号処理部２０は、ＡＤ変換部１８から受付けた複数の第３音信号および第２音信号を用いて、複数の第３音信号を１つにまとめた第１音信号に含まれる目的音信号を強調し、強調音信号を出力部２２へ出力する。 The sound signal processing unit 20 uses the plurality of third sound signals and the second sound signals received from the AD conversion unit 18, and the sound signal processing unit 20 uses the plurality of third sound signals to be combined into one target sound included in the first sound signal. The signal is emphasized, and the emphasized sound signal is output to the output unit 22.

出力部２２は、音信号処理部２０から受付けた強調音信号を出力する装置である。出力部２２は、例えば、スピーカ、通信装置、表示装置、録音装置、記録装置、などである。スピーカは、強調音信号によって表される音を出力する。通信装置は、強調音信号を、ネットワーク等を介して外部装置等へ送信する。表示装置は、強調音信号を示す情報を表示する。録音装置は、強調音信号を記憶する。録音装置は、例えば、ＩＣレコーダやパーソナルコンピュータ等である。記録装置は、強調音信号によって示される音を公知の方法でテキストに変換して記録する装置である。なお、出力部２２は、音信号処理部２０から受付けた強調音信号をアナログ信号に変換した後に出力、送信、記憶、または記録してもよい。 The output unit 22 is a device that outputs an emphasis sound signal received from the sound signal processing unit 20. The output unit 22 is, for example, a speaker, a communication device, a display device, a recording device, a recording device, and the like. The speaker outputs the sound represented by the emphasis signal. The communication device transmits an emphasis sound signal to an external device or the like via a network or the like. The display device displays information indicating an emphasis signal. The recording device stores the emphasis signal. The recording device is, for example, an IC recorder, a personal computer, or the like. The recording device is a device that converts the sound indicated by the emphasis sound signal into text by a known method and records it. The output unit 22 may output, transmit, store, or record the emphasized sound signal received from the sound signal processing unit 20 after converting it into an analog signal.

次に、音信号処理部２０について詳細を説明する。 Next, the sound signal processing unit 20 will be described in detail.

図２は、音信号処理部２０の機能的構成の一例を示す模式図である。 FIG. 2 is a schematic diagram showing an example of the functional configuration of the sound signal processing unit 20.

音信号処理部２０は、変換部２０Ａと、変換部２０Ｂと、検出部２０Ｃと、相関導出部２０Ｄと、第１相関記憶部２０Ｅと、第２相関記憶部２０Ｆと、係数導出部２０Ｇと、生成部２０Ｈと、逆変換部２０Ｉと、を備える。 The sound signal processing unit 20 includes a conversion unit 20A, a conversion unit 20B, a detection unit 20C, a correlation derivation unit 20D, a first correlation storage unit 20E, a second correlation storage unit 20F, a coefficient derivation unit 20G, and the like. A generation unit 20H and an inverse conversion unit 20I are provided.

変換部２０Ａ、変換部２０Ｂ、検出部２０Ｃ、相関導出部２０Ｄ、係数導出部２０Ｇ、生成部２０Ｈ、および逆変換部２０Ｉは、例えば、１または複数のプロセッサにより実現される。例えば上述の各部は、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）などのプロセッサにプログラムを実行させること、すなわちソフトウェアにより実現してもよい。上述の各部は、専用のＩＣ（ＩｎｔｅｇｒａｔｅｄＣｉｒｃｕｉｔ）などのプロセッサ、すなわちハードウェアにより実現してもよい。上述の各部は、ソフトウェアおよびハードウェアを併用して実現してもよい。複数のプロセッサを用いる場合、各プロセッサは、各部のうち１つを実現してもよいし、各部のうち２以上を実現してもよい。 The conversion unit 20A, the conversion unit 20B, the detection unit 20C, the correlation derivation unit 20D, the coefficient derivation unit 20G, the generation unit 20H, and the inverse conversion unit 20I are realized by, for example, one or more processors. For example, each of the above-mentioned parts may be realized by causing a processor such as a CPU (Central Processing Unit) to execute a program, that is, by software. Each of the above-mentioned parts may be realized by a processor such as a dedicated IC (Integrated Circuit), that is, hardware. Each of the above-mentioned parts may be realized by using software and hardware together. When a plurality of processors are used, each processor may realize one of each part, or may realize two or more of each part.

第１相関記憶部２０Ｅおよび第２相関記憶部２０Ｆは、各種情報を記憶する。第１相関記憶部２０Ｅおよび第２相関記憶部２０Ｆは、ＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）、光ディスク、メモリカード、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）などの一般的に利用されているあらゆる記憶媒体により構成することができる。また、第１相関記憶部２０Ｅおよび第２相関記憶部２０Ｆは、物理的に異なる記憶媒体としてもよいし、物理的に同一の記憶媒体の異なる記憶領域として実現してもよい。さらに、第１相関記憶部２０Ｅおよび第２相関記憶部２０Ｆの各々は、物理的に異なる複数の記憶媒体により実現してもよい。 The first correlation storage unit 20E and the second correlation storage unit 20F store various types of information. The first correlated storage unit 20E and the second correlated storage unit 20F may be composed of any commonly used storage medium such as an HDD (Hard Disk Drive), an optical disk, a memory card, and a RAM (Random Access Memory). can. Further, the first correlation storage unit 20E and the second correlation storage unit 20F may be physically different storage media, or may be realized as different storage areas of physically the same storage medium. Further, each of the first correlation storage unit 20E and the second correlation storage unit 20F may be realized by a plurality of physically different storage media.

変換部２０Ａは、ＡＤ変換部１８を介して第２マイク１６から受付けた第２音信号を短時間フーリエ変換（ＳＴＦＴ：Ｓｈｏｒｔ−ＴｉｍｅＦｏｕｒｉｅｒＴｒａｎｓｆｏｒｍ）し、周波数スペクトルＸ_１（ｆ，ｎ）によって表される第２音信号を検出部２０Ｃへ出力する。なお、ｆは、周波数ビンの番号を示し、ｎは、フレームの番号を表す。 The conversion unit 20A performs a short-time Fourier transform (STFT: Short-Time Fourier Transform) of the second sound signal received from the second microphone 16 via the AD conversion unit 18, and is represented by a frequency spectrum X ₁ (f, n). The second sound signal to be generated is output to the detection unit 20C. In addition, f represents the frequency bin number, and n represents the frame number.

例えば、サンプリング周波数を１６ｋＨｚ、フレーム長を２５６サンプル、フレームシフトを１２８サンプルに設定する。この場合、変換部２０Ａは、第２音信号に２５６サンプルのハニング窓をかけた後に、高速フーリエ変換（ＦＦＴ：ＦａｓｔＦｏｕｒｉｅｒＴｒａｎｓｆｏｒｍ）を行うことで、第２音信号を周波数スペクトルに変換する。そして、この周波数スペクトルの低域と高域の対称性を考慮して、該周波数スペクトルにおける、ｆが０以上１２８以下の範囲の１２９点の複素数値を、第２音信号における、第ｎフレームの周波数スペクトルＸ_１（ｆ，ｎ）として算出する。そして、変換部２０Ａは、周波数スペクトルＸ_１（ｆ，ｎ）によって表される第２音信号を検出部２０Ｃへ出力する。 For example, the sampling frequency is set to 16 kHz, the frame length is set to 256 samples, and the frame shift is set to 128 samples. In this case, the conversion unit 20A converts the second sound signal into a frequency spectrum by performing a fast Fourier transform (FFT) after applying a Hanning window of 256 samples to the second sound signal. Then, in consideration of the symmetry of the low frequency band and the high frequency frequency spectrum, the complex numerical values of 129 points in the range where f is 0 or more and 128 or less in the frequency spectrum are set to the nth frame in the second sound signal. Calculated as the frequency spectrum X ₁ (f, n). Then, the conversion unit 20A _{outputs the second sound signal represented by the frequency spectrum X 1} (f, n) to the detection unit 20C.

変換部２０Ｂは、ＡＤ変換部１８を介して複数の第１マイク１４（第１マイク１４Ａ〜第１マイク１４Ｄ）から受付けた複数の第３音信号の各々を、短時間フーリエ変換（ＳＴＦＴ）し、周波数スペクトルＸ_２，１（ｆ，ｎ）、周波数スペクトルＸ_２，２（ｆ，ｎ）、周波数スペクトルＸ_２，３（ｆ，ｎ）、周波数スペクトルＸ_２，４（ｆ，ｎ）の各々によって表される複数の第３音信号を生成する。 The conversion unit 20B performs short-time Fourier transform (STFT) on each of the plurality of third sound signals received from the plurality of first microphones 14 (first microphones 14A to 14D) via the AD conversion unit 18. , Frequency spectrum X _2,1 (f, n), Frequency spectrum X _2,2 (f, n), Frequency spectrum X _2,3 (f, n), Frequency spectrum X _2,4 (f, n), respectively. Generates a plurality of third sound signals represented by.

周波数スペクトルＸ_２，１（ｆ，ｎ）は、第１マイク１４Ａから受付けた第３音信号を短時間フーリエ変換したものである。周波数スペクトルＸ_２，２（ｆ，ｎ）は、第１マイク１４Ｂから受付けた第３音信号を短時間フーリエ変換したものである。周波数スペクトルＸ_２，３（ｆ，ｎ）は、第１マイク１４Ｃから受付けた第３音信号を短時間フーリエ変換したものである。周波数スペクトルＸ_２，４（ｆ，ｎ）は、第１マイク１４Ｄから受付けた第３音信号を短時間フーリエ変換したものである。 The frequency spectrum X _{2, 1} (f, n) is a short-time Fourier transform of the third sound signal received from the first microphone 14A. The frequency spectra X _{2, 2} (f, n) are short-time Fourier transforms of the third sound signal received from the first microphone 14B. The frequency spectra X2, ₃ (f, n) are short-time Fourier transforms of the third sound signal received from the first microphone 14C. The frequency spectra X _{2, 4} (f, n) are short-time Fourier transforms of the third sound signal received from the first microphone 14D.

なお、複数の第３音信号の各々を示す上記複数の周波数スペクトルをまとめた多次元ベクトル（本実施の形態では４次元ベクトル）を、以下では、第１音信号を示す周波数スペクトルＸ_２（ｆ，ｎ）と称して説明する。言い換えると、第１音信号は、周波数スペクトルＸ_２（ｆ，ｎ）によって表される。第１音信号を示す周波数スペクトルＸ_２（ｆ，ｎ）は、下記式（１）で表される。 It should be noted that a multidimensional vector (four-dimensional vector in the present embodiment) that summarizes the plurality of frequency spectra indicating each of the plurality of third sound signals is used below, and the frequency spectrum X ₂ (f) indicating the first sound signal is described below. , N). In other words, the first sound signal is _{represented by the frequency spectrum X 2} (f, n). _{The frequency spectrum X 2} (f, n) showing the first sound signal is represented by the following equation (1).

変換部２０Ｂは、第１音信号を示す周波数スペクトルＸ_２（ｆ，ｎ）を、相関導出部２０Ｄおよび生成部２０Ｈへ出力する。 The conversion unit 20B outputs the frequency spectrum X ₂ (f, n) indicating the first sound signal to the correlation derivation unit 20D and the generation unit 20H.

第１相関記憶部２０Ｅは、第１空間相関行列φ_ｘｘ（ｆ，ｎ）を記憶する。第１空間相関行列φ_ｘｘは、第１音信号における目的音区間の空間相関行列を示す。目的音区間とは、第１音信号における、目的音を含む区間を示す。区間は、時系列方向における、特定の期間を示す。 The first correlation storage unit 20E stores the first spatial correlation matrix _φxx (f, n). The first spatial correlation matrix phi _xx, showing the spatial correlation matrix of the target sound section in the first sound signal. The target sound section indicates a section including the target sound in the first sound signal. The interval indicates a specific period in the time series direction.

上述したように、本実施の形態では、第１音信号は、４次元ベクトルを示す周波数スペクトルＸ_２（ｆ，ｎ）によって表される。このため、第１空間相関行列φ_ｘｘ（ｆ，ｎ）は、周波数ビン毎の４×４の複素数の行列によって表される。 As described above, in the present embodiment, the first sound signal is represented by the frequency spectrum X ₂ (f, n) showing a four-dimensional vector. Therefore, the first spatial correlation matrix _φxx (f, n) is represented by a matrix of 4 × 4 complex numbers for each frequency bin.

初期状態では、第１相関記憶部２０Ｅは、ゼロ行列（φ_ｘｘ（ｆ，０）＝０）で初期化された第１空間相関行列φ_ｘｘ（ｆ，ｎ）を記憶する。第１空間相関行列φ_ｘｘ（ｆ，ｎ）は、後述する相関導出部２０Ｄによって更新される。 In the initial state, the first correlation storage unit 20E stores _{the first spatial correlation matrix φxx} (f, n) _{initialized with the zero matrix (φxx} (f, 0) = 0). The first spatial correlation matrix _φxx (f, n) is updated by the correlation deriving unit 20D described later.

第２相関記憶部２０Ｆは、第２空間相関行列φ_ＮＮ（ｆ，ｎ）を記憶する。第２空間相関行列φ_ＮＮは、第１音信号における非目的音区間の空間相関行列を示す。非目的音区間とは、第１音信号における、目的音区間以外の区間を示す。 The second correlation storage unit 20F stores the second spatial correlation matrix φ _NN (f, n). The second spatial correlation matrix φ _NN indicates the spatial correlation matrix of the non-purpose sound section in the first sound signal. The non-target sound section indicates a section other than the target sound section in the first sound signal.

第１空間相関行列φ_ｘｘ（ｆ，ｎ）と同様に、本実施の形態では、第２空間相関行列φ_ＮＮ（ｆ，ｎ）は、周波数ビン毎の４×４の複素数の行列によって表される。 Similar to the first spatial correlation matrix φ _xx (f, n), in the present embodiment, the second spatial correlation matrix φ _NN (f, n) is represented by a 4 × 4 complex matrix for each frequency bin. NS.

初期状態では、第２相関記憶部２０Ｆは、ゼロ行列（φ_ＮＮ（ｆ，０）＝０）で初期化された第２空間相関行列φ_ＮＮ（ｆ，ｎ）を記憶する。第２空間相関行列φ_ＮＮ（ｆ，ｎ）は、後述する相関導出部２０Ｄの処理によって更新される。 In the initial state, the second correlation storage unit 20F stores _{the second spatial correlation matrix φ NN} (f, n) _{initialized with the zero matrix (φ NN} (f, 0) = 0). The second spatial correlation matrix φ _NN (f, n) is updated by the processing of the correlation deriving unit 20D described later.

次に、検出部２０Ｃ、相関導出部２０Ｄ、係数導出部２０Ｇ、生成部２０Ｈ、および逆変換部２０Ｉについて説明する。本実施の形態では、音信号処理部２０は、音信号処理開始時に初期処理を行った後に、定常処理を実行する。相関導出部２０Ｄ、係数導出部２０Ｇ、および生成部２０Ｈは、初期処理時と定常処理時で、異なる処理を実行する。 Next, the detection unit 20C, the correlation derivation unit 20D, the coefficient derivation unit 20G, the generation unit 20H, and the inverse conversion unit 20I will be described. In the present embodiment, the sound signal processing unit 20 executes the steady processing after performing the initial processing at the start of the sound signal processing. The correlation derivation unit 20D, the coefficient derivation unit 20G, and the generation unit 20H execute different processes during the initial process and the steady process.

まず、初期処理における、相関導出部２０Ｄ、係数導出部２０Ｇ、および生成部２０Ｈの機能について説明する。 First, the functions of the correlation derivation unit 20D, the coefficient derivation unit 20G, and the generation unit 20H in the initial processing will be described.

初期処理とは、音信号処理部２０が、音信号処理の開始時に実行する処理である。初期処理では、音信号処理部２０は、第１相関記憶部２０Ｅおよび第２相関記憶部２０Ｆに記憶されている、ゼロ行列で初期化されている第１空間相関行列φ_ｘｘ（ｆ，ｎ）および第２空間相関行列φ_ＮＮ（ｆ，ｎ）を更新することで、これらの空間相関行列に初期値を設定する。 The initial processing is a process executed by the sound signal processing unit 20 at the start of the sound signal processing. In the initial processing, the sound signal processing section 20, stored in the first correlation storage portion 20E and the second correlation storage unit 20F, initialized with that first spatial correlation matrix phi _xx zero matrix _(f, n) And by updating the second spatial correlation matrix φ _NN (f, n), initial values are set in these spatial correlation matrices.

係数導出部２０Ｇは、第１音信号に含まれる目的音信号を強調するための、空間フィルタ係数Ｆ（ｆ，ｎ）を導出する。係数導出部２０Ｇは、第１空間相関行列φ_ｘｘ（ｆ，ｎ）および第２空間相関行列φ_ＮＮ（ｆ，ｎ）に基づいて、空間フィルタ係数Ｆ（ｆ，ｎ）を導出する。 The coefficient derivation unit 20G derives the spatial filter coefficient F (f, n) for emphasizing the target sound signal included in the first sound signal. The coefficient derivation unit 20G derives the spatial filter coefficient F (f, n) based on _{the first spatial correlation matrix φ xx} (f, n) and the second spatial correlation matrix φ _{NN (f, n).}

上述したように、本実施の形態では、第１音信号は、４次元ベクトルを示す周波数スペクトルＸ_２（ｆ，ｎ）によって表される。このため、係数導出部２０Ｇは、第１空間相関行列φ_ｘｘ（ｆ，ｎ）および第２空間相関行列φ_ＮＮ（ｆ，ｎ）に基づいて、複素数の４次元ベクトルである空間フィルタ係数Ｆ（ｆ，ｎ）を算出する。空間フィルタ係数Ｆ（Ｆ、ｎ）は、下記式（２）で表される。 As described above, in the present embodiment, the first sound signal is represented by the frequency spectrum X ₂ (f, n) showing a four-dimensional vector. Therefore, the coefficient derivation unit 20G is based on the first spatial correlation matrix φ _xx (f, n) and the second spatial correlation matrix φ _NN (f, n), and is a complex four-dimensional vector spatial filter coefficient F ( Calculate f, n). The spatial filter coefficient F (F, n) is represented by the following equation (2).

但し、初期処理においては、係数導出部２０Ｇは、空間フィルタ係数Ｆ（ｆ，ｎ）＝［０，０，０，１］を、空間フィルタ係数Ｆ（ｆ，ｎ）として導出するものとする。 However, in the initial processing, the coefficient deriving unit 20G derives the spatial filter coefficient F (f, n) = [0,0,0,1] as the spatial filter coefficient F (f, n).

生成部２０Ｈは、係数導出部２０Ｇで導出された空間フィルタ係数Ｆ（ｆ，ｎ）を用いて、周波数スペクトルＸ_２（ｆ，ｎ）によって表される第１音信号に含まれる目的音信号を強調した、強調音信号を生成する。 The generation unit 20H uses the spatial filter coefficient F (f, n) derived by the coefficient derivation unit 20G to generate a target sound signal included in the first sound signal represented by the _{frequency spectrum X 2 (f, n).} Generates an emphasized, emphasized sound signal.

詳細には、生成部２０Ｈは、下記式（３）を用いて、出力スペクトルＹ（ｆ，ｎ）によって表される強調音信号を生成する。 Specifically, the generation unit 20H uses the following equation (3) to generate an emphasis sound signal represented by the output spectrum Y (f, n).

すなわち、生成部２０Ｈは、周波数スペクトルＸ_２（ｆ，ｎ）と、空間フィルタ係数Ｆ（ｆ，ｎ）をエルミート転置した転置行列と、の積を、強調音信号を示す出力スペクトルＹ（ｆ，ｎ）として生成する。なお、初期処理では、生成部２０Ｈは、Ｙ（ｆ，ｎ）＝Ｘ_２，４（ｆ，ｎ）となる強調音信号を出力する。すなわち、初期処理では、生成部２０Ｈは、第１マイク１４Ｄで集音された第３音信号の周波数スペクトルを、強調音信号として出力する。なお、初期処理で強調音信号として用いる第１マイク１４は、複数の第１マイク１４（第１マイク１４Ａ〜第１マイク１４Ｄ）の内の１つの第１マイク１４であればよく、第１マイク１４Ｄに限定されない。 That is, the generation unit 20H calculates the product of the frequency spectrum X ₂ (f, n) and the transposed matrix in which the spatial filter coefficient F (f, n) is Hermitian transposed, and outputs the output spectrum Y (f,) showing the emphasized sound signal. Generate as n). In the initial processing, the generation unit 20H outputs an emphasis sound signal in which _{Y (f, n) = X 2,4 (f, n).} That is, in the initial processing, the generation unit 20H outputs the frequency spectrum of the third sound signal collected by the first microphone 14D as an emphasis sound signal. The first microphone 14 used as the emphasis sound signal in the initial processing may be the first microphone 14 among the plurality of first microphones 14 (first microphones 14A to 14D), and may be the first microphone. Not limited to 14D.

生成部２０Ｈは、出力スペクトルＹ（ｆ，ｎ）によって表される強調音信号を、逆変換部２０Ｉおよび検出部２０Ｃへ出力する。 The generation unit 20H outputs the emphasis signal represented by the output spectrum Y (f, n) to the inverse conversion unit 20I and the detection unit 20C.

検出部２０Ｃは、強調音信号に基づいて、目的音区間を検出する。本実施の形態では、検出部２０Ｃは、第２音信号と強調音信号に基づいて、目的音区間を検出する。 The detection unit 20C detects a target sound section based on the emphasis sound signal. In the present embodiment, the detection unit 20C detects a target sound section based on the second sound signal and the emphasis sound signal.

詳細には、検出部２０Ｃは、周波数スペクトルＸ_１（ｆ，ｎ）によって表される第２音信号と、生成部２０Ｈから受付けた出力スペクトルＹ（ｆ，ｎ）によって表される強調音信号と、に基づいて、目的音区間を検出する。 Specifically, the detection unit 20C includes a _{second sound signal represented by the frequency spectrum X 1} (f, n) and an emphasis sound signal represented by the output spectrum Y (f, n) received from the generation unit 20H. , Detects the target sound section.

目的音区間は、目的音源１２Ａが音を発しているか否かをフレーム番号毎に示す関数ｕ_２（ｎ）によって表される。 _{The target sound section is represented by a function u 2} (n) indicating whether or not the target sound source 12A is emitting sound for each frame number.

ｕ_２（ｎ）＝１は、第ｎフレームで目的音源１２Ａが音を発している事を示す。第ｎフレームとは、ｎ番目のフレームを示す。ｕ_２（ｎ）＝０は、第ｎフレームで目的音源１２Ａが音を発していない事を示す。 u ₂ (n) = 1 indicates that the target sound source 12A is emitting sound in the nth frame. The nth frame indicates the nth frame. u ₂ (n) = 0 indicates that the target sound source 12A does not emit sound in the nth frame.

具体的には、関数ｕ_２は、下記式（４）によって表される。 Specifically, the function u ₂ is represented by the following equation (4).

式（４）中、ｐ_Ｙ（ｎ）およびｐ_Ｘ（ｎ）は、下記式（５）および式（６）によって表される。すなわち、ｐ_Ｙ（ｎ）およびｐ_Ｘ（ｎ）は、出力スペクトルＹ（ｆ，ｎ）によって表される強調音信号と、周波数スペクトルＸ_１（ｆ，ｎ）によって表される第２音信号と、の各々のパワーに依存する。 In the formula (4), p _Y (n) and p _X (n) are represented by the following formulas (5) and (6). That is, p _Y (n) and p _X (n) are an emphasis sound signal represented by the output spectrum Y (f, n) and a second sound signal represented by the _{frequency spectrum X 1 (f, n).} Depends on the power of each of.

ここで、初期処理の段階では、ｐ_Ｙ（ｎ）には、目的音源１２Ａと非目的音源１２Ｂの双方の音に応じたスペクトルが含まれる。このため、式（４）中、閾値ｔ_１は、目的音源１２Ａまたは非目的音源１２Ｂから音が発せられている場合に、ｔ_１＜Ｐ_Ｙ（ｎ）の関係を満たすように、予め設定する。 Here, at the initial processing stage, p _Y (n) includes spectra corresponding to the sounds of both the target sound source 12A and the non-purpose sound source 12B. Therefore, in the equation (4), the threshold value t ₁ is set in advance so as to satisfy the relationship of _{t 1} <P _Y (n) when the sound is emitted from the target sound source 12A or the non-purpose sound source 12B. ..

また、目的音源１２Ａと非目的音源１２Ｂの内、非目的音源１２Ｂが音を発している場合には、ｐ_Ｘ（ｎ）は、ｐ_ｙ（ｎ）に比べて相対的に大きくなる。このため、式（４）中、閾値ｔ_２は、非目的音源１２Ｂが音を発している場合に、ｐ_Ｘ（ｎ）−ｐ_ｙ（ｎ）≧ｔ_２の関係を満たすように、予め設定する。 Further, when the non-purpose sound source 12B is emitting sound among the target sound source 12A and the non-purpose sound source 12B, p _X (n) is _{relatively larger than py} (n). Thus, in the formula (4), the threshold value _{t 2,} when the non-target sound source 12B is emitting a _sound, so as to satisfy the relationship _{p X (n) -p y (} n) ≧ t 2, preset do.

これらの設定により、関数ｕ_２（ｎ）は、目的音源１２Ａのみが音を発している第ｎフレームでは値“１”を示す。そして、関数ｕ_２は、目的音源１２Ａが音を発していない第ｎフレームでは値“０”を示す。 With these settings, the function u ₂ (n) shows a value of "1" in the nth frame in which only the target sound source 12A emits sound. Then, the function u ₂ shows a value “0” in the nth frame in which the target sound source 12A does not emit sound.

このため、検出部２０Ｃは、ｕ_２（ｎ）＝１で表される区間を目的音区間として検出し、ｕ_２（ｎ）＝０で表される区間を非目的音区間として検出する。 Therefore, the detection unit 20C detects the section _{represented by u 2} (n) = 1 as the target sound section, and detects the section represented by u ₂ (n) = 0 as the non-target sound section.

相関導出部２０Ｄは、検出部２０Ｃで検出された目的音区間と、変換部２０ＢおよびＡＤ変換部１８を介して第１マイク１４から受付けた第１音信号と、に基づいて、第１空間相関行列φ_ｘｘ（ｆ，ｎ）と、第２空間相関行列φ_ＮＮ（ｆ，ｎ）と、を導出する。そして、相関導出部２０Ｄは、導出した第１空間相関行列φ_ｘｘ（ｆ，ｎ）を第１相関記憶部２０Ｅへ記憶することで、第１空間相関行列φ_ｘｘ（ｆ，ｎ）を更新する。同様に、相関導出部２０Ｄは、導出した第２空間相関行列φ_ＮＮ（ｆ，ｎ）を第２相関記憶部２０Ｆへ記憶することで、第２空間相関行列φ_ＮＮ（ｆ，ｎ）を更新する。 The correlation derivation unit 20D has a first spatial correlation based on the target sound section detected by the detection unit 20C and the first sound signal received from the first microphone 14 via the conversion unit 20B and the AD conversion unit 18. The matrix φ _xx (f, n) and the second spatial correlation matrix φ _NN (f, n) are derived. Then, the correlation derivation unit 20D _{updates the first space correlation matrix φxx} (f, n) by _{storing the derived first space correlation matrix φxx} (f, n) in the first correlation storage unit 20E. .. Similarly, the correlation derivation unit 20D _{updates the second space correlation matrix φ NN} (f, n) by _{storing the derived second space correlation matrix φ NN} (f, n) in the second correlation storage unit 20F. do.

詳細には、相関導出部２０Ｄは、ｕ_２（ｎ）＝１で表される区間（第ｎフレーム）については、下記式（７）により、第１空間相関行列φ_ｘｘ（ｆ，ｎ）を導出および更新し、第２空間相関行列φ_ＮＮ（ｆ，ｎ）は更新しない。 Specifically, the correlation deriving unit 20D uses the following _{equation (7) to obtain a} first spatial correlation matrix φxx (f, n) for the interval (nth frame) represented by _{u 2 (n) = 1.} Derived and updated, and the second spatial correlation matrix φ _NN (f, n) is not updated.

一方、相関導出部２０Ｄは、ｕ_２（ｎ）＝０で表される区間（第ｎフレーム）については、下記式（８）により、第２空間相関行列φ_ＮＮ（ｆ，ｎ）を導出および更新し、第１空間相関行列φ_ｘｘ（ｆ，ｎ）は更新しない。 On the other hand, the correlation derivation unit 20D derives _{the second spatial correlation matrix φ NN} (f, n) by the following equation (8) for the interval (nth frame) represented by _{u 2 (n) = 0.} It is updated, and the first spatial correlation matrix _φxx (f, n) is not updated.

式（７）および式（８）中、αは、０以上１未満の値である。αの値が１に近い値であるほど、過去に導出した第１空間相関行列φ_ｘｘ（ｆ，ｎ）の重みが、最新の第１空間相関行列φ_ｘｘ（ｆ，ｎ）に比べて大きい事を意味する。αの値は、予め設定すればよい。αは、例えば、０．９５などとすればよい。 In the formula (7) and the formula (8), α is a value of 0 or more and less than 1. As the value of α is a value close to 1, the weight of the first spatial correlation matrix phi _xx derived in the past _(f, n) is larger than the latest first spatial correlation matrix φ _{xx (f,} n) Means things. The value of α may be set in advance. α may be, for example, 0.95.

すなわち、相関導出部２０Ｄは、目的音区間の第１音信号について、過去に導出した第１空間相関行列φ_ｘｘ（ｆ，ｎ）を、該第１音信号と該第１音信号をエルミート転置した転置信号との乗算値によって表される最新の第１空間相関行列φ_ｘｘ（ｆ，ｎ）で補正することによって、新たな第１空間相関行列φ_ｘｘ（ｆ，ｎ）を導出する。なお、目的区間の第１音信号とは、第１音信号における、目的区間の音信号を意味する。 That is, the correlation derivation unit 20D transposes the first sound signal and the first sound signal of the first sound signal of the target sound section, which has been derived in the past, into the first space correlation matrix _φxx (f, n) by Hermitian. A new first spatial correlation matrix φ _xx (f, n) is derived _{by correcting with the latest first spatial correlation matrix φ xx} (f, n) represented by the multiplication value with the transposed signal. The first sound signal in the target section means the sound signal in the target section in the first sound signal.

相関導出部２０Ｄは、第１相関記憶部２０Ｅに記憶済の第１空間相関行列φ_ｘｘ（ｆ，ｎ）を、過去に導出した第１空間相関行列φ_ｘｘ（ｆ，ｎ）として用いればよい。第１相関記憶部２０Ｅには、１つの第１空間相関行列φ_ｘｘ（ｆ，ｎ）のみが記憶され、順次、相関導出部２０Ｄによって更新される。 Correlation derivation unit 20D includes first spatial correlation matrix phi _xx of already stored in the first correlation storage section 20E of the (f, n), may be used as the first spatial correlation matrix phi _xx derived in the past (f, n) .. _{Only one first spatial correlation matrix φxx} (f, n) is stored in the first correlation storage unit 20E, and is sequentially updated by the correlation derivation unit 20D.

また、相関導出部２０Ｄは、非目的音区間の第１音信号について、過去に導出した第２空間相関行列φ_ＮＮ（ｆ，ｎ）を、該第１音信号と該第１音信号をエルミート転置した転置信号との乗算値によって表される最新の第２空間相関行列φ_ＮＮ（ｆ，ｎ）で補正することによって、新たな第２空間相関行列φ_ＮＮ（ｆ，ｎ）を導出する。なお、非目的区間の第１音信号とは、第１音信号における、非目的区間の音信号を意味する。 _{Further, the correlation derivation unit 20D uses the second spatial correlation matrix φ NN} (f, n) derived in the past for the first sound signal in the non-purpose sound section, and the first sound signal and the first sound signal as Elmeet. A new second spatial correlation matrix φ _NN (f, n) is derived _{by correcting with the latest second spatial correlation matrix φ NN} (f, n) represented by the multiplication value with the transposed transposed signal. The first sound signal in the non-purpose section means the sound signal in the non-purpose section in the first sound signal.

相関導出部２０Ｄは、第２相関記憶部２０Ｆに記憶済の第２空間相関行列φ_ＮＮ（ｆ，ｎ）を、過去に導出した第２空間相関行列φ_ＮＮ（ｆ，ｎ）として用いればよい。第２相関記憶部２０Ｆには、１つの第２空間相関行列φ_ＮＮ（ｆ，ｎ）のみが記憶されるものとし、順次、相関導出部２０Ｄによって更新される。 Correlation derivation unit 20D, the second spatial correlation matrix phi _NN of already stored in the second correlation storing section 20F to (f, n), may be used as the second spatial correlation matrix phi _NN derived in the past (f, n) .. It is assumed that only one second spatial correlation matrix φ _NN (f, n) is stored in the second correlation storage unit 20F, and is sequentially updated by the correlation derivation unit 20D.

次に、定常処理における、相関導出部２０Ｄ、係数導出部２０Ｇ、および生成部２０Ｈの機能について説明する。定常処理とは、上記初期処理の後に実行される処理である。 Next, the functions of the correlation derivation unit 20D, the coefficient derivation unit 20G, and the generation unit 20H in the steady processing will be described. The steady-state process is a process executed after the initial process.

なお、音信号処理部２０は、初期処理を所定時間実行した後に定常処理へ移行してもよいし、第１空間相関行列φ_ｘｘ（ｆ，ｎ）および第２空間相関行列φ_ＮＮ（ｆ，ｎ）が所定回数更新されたときに定常処理へ移行してもよい。 The sound signal processing unit 20 may shift to the steady processing after executing the initial processing for a predetermined time, or may shift to the first spatial correlation matrix _φxx (f, n) and the second spatial correlation matrix φ _NN (f, NN). When n) is updated a predetermined number of times, the process may shift to routine processing.

まず、定常処理における係数導出部２０Ｇの機能を説明する。初期処理では、係数導出部２０Ｇは、空間フィルタ係数Ｆ（ｆ，ｎ）として、空間フィルタ係数Ｆ（ｆ，ｎ）＝［０，０，０，１］を導出した。 First, the function of the coefficient derivation unit 20G in the steady processing will be described. In the initial processing, the coefficient derivation unit 20G derives the spatial filter coefficient F (f, n) = [0,0,0,1] as the spatial filter coefficient F (f, n).

定常処理では、係数導出部２０Ｇは、目的音信号を強調した強調音信号に基づいて、第１音信号に含まれる目的音信号を強調するための空間フィルタ係数Ｆ（ｆ，ｎ）を導出する。 In the steady processing, the coefficient derivation unit 20G derives the spatial filter coefficient F (f, n) for emphasizing the target sound signal included in the first sound signal based on the emphasized sound signal emphasizing the target sound signal. ..

上述したように、第１音信号は、複数の第１マイク１４から取得した複数の第３音信号からなる。このため、係数導出部２０Ｇは、複数の第１マイク１４から出力された複数の第３信号からなる第１音信号に含まれる目的音信号を強調した強調音信号に基づいて、空間フィルタ係数Ｆ（ｆ，ｎ）を導出する。 As described above, the first sound signal includes a plurality of third sound signals acquired from the plurality of first microphones 14. Therefore, the coefficient deriving unit 20G has a spatial filter coefficient F based on the emphasized sound signal emphasizing the target sound signal included in the first sound signal composed of the plurality of third signals output from the plurality of first microphones 14. (F, n) is derived.

詳細には、係数導出部２０Ｇは、相関導出部２０Ｄによって更新された第１空間相関行列φ_ｘｘ（ｆ，ｎ）および第２空間相関行列φ_ＮＮ（ｆ，ｎ）に基づいて、空間フィルタ係数Ｆ（ｆ，ｎ）を導出する。 Specifically, the coefficient deriving unit 20G is based on the first spatial correlation matrix φ _xx (f, n) and the second spatial correlation matrix φ _NN (f, n) updated by the correlation deriving unit 20D. F (f, n) is derived.

係数導出部２０Ｇは、第１相関記憶部２０Ｅおよび第２相関記憶部２０Ｆから第１空間相関行列φ_ｘｘ（ｆ，ｎ）および第２空間相関行列φ_ＮＮ（ｆ，ｎ）を読取り、空間フィルタ係数Ｆ（ｆ，ｎ）の導出に用いればよい。 _{The coefficient derivation unit 20G reads the first spatial correlation matrix φxx} (f, n) and the second spatial correlation matrix φ _NN (f, n) from the first correlation storage unit 20E and the second correlation storage unit 20F, and performs a spatial filter. It may be used for deriving the coefficient F (f, n).

ここで、定常処理の段階で、第１相関記憶部２０Ｅおよび第２相関記憶部２０Ｆに記憶されている第１空間相関行列φ_ｘｘ（ｆ，ｎ）および第２空間相関行列φ_ＮＮ（ｆ，ｎ）は、相関導出部２０Ｄによって更新済の空間相関行列である。すなわち、これらの空間相関行列は、強調音信号に基づいて検出された目的音区間を用いて、相関導出部２０Ｄによって更新された空間相関行列である。このため、係数導出部２０Ｇは、強調音信号に基づいて、空間フィルタ係数Ｆ（ｆ，ｎ）を導出することとなる。 Here, at the stage of steady processing, the first spatial correlation matrix φ _xx (f, n) and the second spatial correlation matrix φ _NN (f,) stored in the first correlation storage unit 20E and the second correlation storage unit 20F. n) is a spatial correlation matrix updated by the correlation derivation unit 20D. That is, these spatial correlation matrices are spatial correlation matrices updated by the correlation derivation unit 20D using the target sound section detected based on the emphasized sound signal. Therefore, the coefficient deriving unit 20G derives the spatial filter coefficient F (f, n) based on the emphasized sound signal.

詳細には、係数導出部２０Ｇは、第１空間相関行列φ_ｘｘ（ｆ，ｎ）と第２空間相関行列φ_ＮＮ（ｆ，ｎ）の逆行列との積によって表される行列の、最大固有値に対応する固有ベクトルＦ_ＳＮＲ（ｆ，ｎ）を導出する。そして、係数導出部２０Ｇは、固有ベクトルＦ_ＳＮＲ（ｆ，ｎ）を、空間フィルタ係数Ｆ（ｆ，ｎ）として導出する（Ｆ（ｆ，ｎ）＝Ｆ_ＳＮＲ（ｆ，ｎ））。 Specifically, the coefficient derivation unit 20G is the maximum eigenvalue of the matrix represented by the product of _{the first spatial correlation matrix φxx} (f, n) and the _{inverse matrix of the second spatial correlation matrix φ NN (f, n).} The eigenvector F _SNR (f, n) corresponding to is derived. Then, the coefficient deriving unit 20G derives the eigenvector F _SNR (f, n) as the spatial filter coefficient F (f, n) (F (f, n) = F _SNR (f, n)).

固有ベクトルＦ_ＳＮＲ（ｆ，ｎ）は、目的音と非目的音とのパワー比を最大化するＭＡＸ−ＳＮＲ（ＭａｘｉｍｕｍＳｉｇｎａｌ−ｔｏ−Ｎｏｉｓｅ）ビームフォーマを構成する。 The eigenvector F _SNR (f, n) constitutes a MAX-SNR (Maximum Signal-to-Noise) beamformer that maximizes the power ratio between the target sound and the non-target sound.

なお、係数導出部２０Ｇは、各周波数ビンのパワーを調整することで音質を改善するポストフィルタｗ（ｆ，ｎ）を追加し、下記式（９）を用いて、空間フィルタ係数Ｆ（ｆ，ｎ）を導出してもよい。 The coefficient derivation unit 20G adds a post filter w (f, n) that improves sound quality by adjusting the power of each frequency bin, and uses the following equation (9) to create a spatial filter coefficient F (f, n). n) may be derived.

ポストフィルタｗ（ｆ，ｎ）は、下記式（１０）で表される。 The post filter w (f, n) is represented by the following equation (10).

次に、生成部２０Ｈについて説明する。定常処理では、生成部２０Ｈは、初期処理時と同様に、係数導出部２０Ｇで導出された空間フィルタ係数Ｆ（ｆ，ｎ）を用いて、周波数スペクトルＸ_２（ｆ，ｎ）によって表される第１音信号に含まれる目的音信号を強調した、強調音信号を生成する。すなわち、生成部２０Ｈは、上記式（３）を用いて、出力スペクトルＹ（ｆ，ｎ）によって表される強調音信号を生成する。 Next, the generation unit 20H will be described. _{In the steady processing, the generation unit 20H is represented by the frequency spectrum X 2} (f, n) using the spatial filter coefficient F (f, n) derived by the coefficient derivation unit 20G as in the initial processing. An emphasized sound signal that emphasizes the target sound signal included in the first sound signal is generated. That is, the generation unit 20H uses the above equation (3) to generate an emphasis sound signal represented by the output spectrum Y (f, n).

生成部２０Ｈは、生成した強調音信号を、逆変換部２０Ｉおよび検出部２０Ｃへ出力する。 The generation unit 20H outputs the generated emphasis sound signal to the inverse conversion unit 20I and the detection unit 20C.

逆変換部２０Ｉは、生成部２０Ｈから受付けた強調音信号を逆短時間フーリエ変換（ＩＳＴＦＴ：ＩｎｖｅｒｓｅＳｈｏｒｔ−ＴｉｍｅＦｏｕｒｉｅｒＴｒａｎｓｆｏｒｍ）し、出力部２２へ出力する。 The inverse transform unit 20I performs an inverse short-time Fourier transform (ISTFT: Inverse Short-Time Fourier Transform) of the emphasized sound signal received from the generation unit 20H and outputs it to the output unit 22.

すなわち、逆変換部２０Ｉは、目的音源１２Ａから発せられた目的音の目的音信号が強調され非目的音信号が抑圧された強調信号を、時間領域の音波形に変換する。 That is, the inverse conversion unit 20I converts the emphasized signal in which the target sound signal of the target sound emitted from the target sound source 12A is emphasized and the non-target sound signal is suppressed into a sound wave shape in the time domain.

具体的には、逆変換部２０Ｉは、強調信号を示す出力スペクトルＹ（ｆ，ｎ）の対称性を用いて、出力スペクトルＹ（ｆ，ｎ）から２５６点のスペクトルを生成し、逆フーリエ変換を行う。次に、逆変換部２０Ｉは、合成窓関数を適用し、前フレームの出力波形とフレームシフト分ずらして重畳することにより、音波形を生成すればよい。 Specifically, the inverse transform unit 20I generates a spectrum of 256 points from the output spectrum Y (f, n) by using the symmetry of the output spectrum Y (f, n) indicating the emphasized signal, and performs the inverse Fourier transform. I do. Next, the inverse transformation unit 20I may generate a sound wave shape by applying a composite window function and superimposing it on the output waveform of the previous frame by shifting the frame shift.

次に、検出部２０Ｃについて説明する。初期処理時には、検出部２０Ｃは、目的音区間を検出した。 Next, the detection unit 20C will be described. At the time of initial processing, the detection unit 20C detected the target sound section.

定常処理時には、検出部２０Ｃは、強調音信号と第２音信号に基づいて、目的音区間と、重複区間と、を検出する。重複区間とは、目的音源１２Ａおよび非目的音源１２Ｂの双方から音が発せられている区間を示す。例えば、重複区間とは、複数の話者が発話している区間を示す。 At the time of steady processing, the detection unit 20C detects a target sound section and an overlapping section based on the emphasis sound signal and the second sound signal. The overlapping section indicates a section in which sound is emitted from both the target sound source 12A and the non-purpose sound source 12B. For example, the overlapping section indicates a section in which a plurality of speakers are speaking.

詳細には、検出部２０Ｃは、関数ｕ_２（ｎ）に加えて、関数ｕ_１（ｎ）を検出する。 Specifically, the detection unit 20C, in addition to the function _u 2 (n), for detecting the function _u 1 (n).

関数ｕ_１（ｎ）は、第２非目的音区間を示す関数である。詳細には、関数ｕ_１（ｎ）は、非目的音源１２Ｂが音を発しているか否かをフレーム番号毎に示す関数である。第２非目的音区間は、非目的音源１２Ｂが音を発している区間である。 The function u ₁ (n) is a function indicating a second non-purpose sound section. Specifically, the function u ₁ (n) is a function indicating whether or not the non-purpose sound source 12B is emitting sound for each frame number. The second non-purpose sound section is a section in which the non-purpose sound source 12B emits sound.

ここで、定常処理の段階では、出力スペクトルＹ（ｆ，ｎ）によって表される強調音信号に含まれる、非目的音源１２Ｂから発せられた非目的音によるパワーは、抑圧されている。このため、上記式（５）によって表されるｐ_Ｙ（ｎ）は、近似的に、目的音源１２Ａから発せられた目的音によるパワーとみなすことができる。このため、定常処理の段階では、ｕ_１（ｎ）によって表される第２非目的音区間と、ｕ_２（ｎ）によって表される目的音区間は、下記式（１１）および式（１２）によって表される。 Here, at the stage of steady processing, the power due to the non-purpose sound emitted from the non-purpose sound source 12B included in the emphasis sound signal represented by the output spectrum Y (f, n) is suppressed. _{Therefore, p Y} (n) represented by the above equation (5) can be approximately regarded as the power generated by the target sound emitted from the target sound source 12A. Therefore, at the stage of steady processing, the second non-purpose sound section represented by _{u 1} _{(n) and the target sound section represented by u 2} (n) are the following equations (11) and (12). Represented by.

なお、ｕ_２（ｎ）＝１は、第ｎフレームで目的音源１２Ａが音を発している事を示す。ｕ_２（ｎ）＝０は、第ｎフレームで目的音源１２Ａが音を発していない事を示す。また、ｕ_１（ｎ）＝１は、第ｎフレームで非目的音源１２Ｂが音を発している事を示す。ｕ_１（ｎ）＝０は、第ｎフレームで非目的音源１２Ｂが音を発していない事を示す。 Note that u ₂ (n) = 1 indicates that the target sound source 12A is emitting sound in the nth frame. u ₂ (n) = 0 indicates that the target sound source 12A does not emit sound in the nth frame. Further, u ₁ (n) = 1 indicates that the non-purpose sound source 12B is emitting sound in the nth frame. u ₁ (n) = 0 indicates that the non-purpose sound source 12B does not emit sound in the nth frame.

このため、式（１１）および式（１２）中における閾値ｔ_３および閾値ｔ_４は、ｕ_１（ｎ）およびｕ_２（ｎ）が上記条件を示す式となるように、予め設定すればよい。 _{Therefore, the threshold values t 3} and t ₄ in the equations (11) and (12) may be set in advance so that u ₁ (n) and u ₂ (n) are equations indicating the above conditions. ..

検出部２０Ｃは、ｕ_２（ｎ）＝１であり、且つ、ｕ_１（ｎ）＝０である区間を、目的音区間として検出する。また、検出部２０Ｃは、ｕ_２（ｎ）＝０である区間を、非目的音区間として検出する。また、検出部２０Ｃは、ｕ_２（ｎ）＝１であり且つｕ_１（ｎ）＝１である区間を、目的音源１２Ａおよび非目的音源１２Ｂの双方から音が発せられている重複区間として検出する。そして、検出部２０Ｃは、検出結果を、相関導出部２０Ｄへ出力する。本実施の形態では、検出部２０Ｃは、検出結果として、ｕ_１（ｎ）および、ｕ_２（ｎ）を相関導出部２０Ｄへ出力する。 The detection unit 20C detects a section in which u ₂ (n) = 1 and u ₁ (n) = 0 as a target sound section. Further, the detection unit 20C detects a section where u ₂ (n) = 0 as a non-purpose sound section. Further, the detection unit 20C detects a section in which u ₂ (n) = 1 and u ₁ (n) = 1 as an overlapping section in which sound is emitted from both the target sound source 12A and the non-purpose sound source 12B. do. Then, the detection unit 20C outputs the detection result to the correlation derivation unit 20D. In the present embodiment, the detection unit 20C outputs u ₁ (n) and u ₂ (n) to the correlation derivation unit 20D as the detection result.

相関導出部２０Ｄは、検出部２０Ｃで検出された目的音区間と、重複区間と、第１音信号と、に基づいて、第１空間相関行列φ_ｘｘ（ｆ，ｎ）と第２空間相関行列φ_ＮＮ（ｆ，ｎ）を導出する。 _{The correlation derivation unit 20D has a first spatial correlation matrix φxx} (f, n) and a second spatial correlation matrix based on the target sound section, the overlapping section, and the first sound signal detected by the detection unit 20C. Derivation of φ _NN (f, n).

相関導出部２０Ｄは、ｕ_２（ｎ）＝１であり且つｕ_１（ｎ）＝０である区間を目的音区間とし、該区間については、下記式（１３）を用いて第１空間相関行列φ_ｘｘ（ｆ，ｎ）を導出し更新する。なお、ｕ_２（ｎ）＝１であり且つｕ_１（ｎ）＝０である目的音区間について、相関導出部２０Ｄは、第２空間相関行列φ_ＮＮ（ｆ，ｎ）の導出および更新を行わない。 The correlation derivation unit 20D uses a section in which u ₂ (n) = 1 and u ₁ (n) = 0 as a target sound section, and uses the following equation (13) for the first spatial correlation matrix. _{Derivation and update of φxx} (f, n). The _{correlation derivation unit 20D derives and updates the second spatial correlation matrix φ NN} (f, n) for the target sound section in which _{u 2} (n) = 1 and u _{1 (n) = 0.} No.

一方、相関導出部２０Ｄは、ｕ_２（ｎ）＝０である区間を非目的音区間とし、該区間については、下記式（１４）を用いて、第２空間相関行列φ_ＮＮ（ｆ，ｎ）を導出し更新する。なお、ｕ_２（ｎ）＝０である区間について、相関導出部２０Ｄは、第１空間相関行列φ_ｘｘ（ｆ，ｎ）の導出および更新を行わない。 On the other hand, in the correlation derivation unit 20D _{, a section where u 2} (n) = 0 is set as a non-purpose sound section, and for this section, the second spatial correlation matrix φ _NN (f, n) is used by using the following equation (14). ) Is derived and updated. Note that the correlation deriving unit 20D does not derive or update _{the first spatial correlation matrix φxx} _{(f, n) for the interval where u 2 (n) = 0.}

また、相関導出部２０Ｄは、ｕ_２（ｎ）＝１であり且つｕ_１（ｎ）＝１である区間については、第１空間相関行列φ_ｘｘ（ｆ，ｎ）および第２空間相関行列φ_ＮＮ（ｆ，ｎ）の双方の導出および更新を行わない。上述したように、ｕ_２（ｎ）＝１であり且つｕ_１（ｎ）＝１である区間は、目的音源１２Ａおよび非目的音源１２Ｂの双方から音が発せられている重複区間である。 Further, the correlation derivation unit 20D has a _{first spatial correlation matrix φ xx} (f, n) and a second spatial correlation matrix φ _{for an interval in which u 2} (n) = 1 and u ₁ (n) = 1. _{Both NN} (f, n) are not derived or updated. As described above, the section in which u ₂ (n) = 1 and u ₁ (n) = 1 is an overlapping section in which sound is emitted from both the target sound source 12A and the non-purpose sound source 12B.

すなわち、定常処理において、相関導出部２０Ｄは、目的音源１２Ａおよび非目的音源１２Ｂの双方から音が発せられている重複区間については、第１空間相関行列φ_ｘｘ（ｆ，ｎ）および第２空間相関行列φ_ＮＮ（ｆ，ｎ）の双方の導出および更新を行わない。 _{That is, in the steady processing, the correlation derivation unit 20D uses the first space correlation matrix φxx} (f, n) and the second space for the overlapping section in which sound is emitted from both the target sound source 12A and the non-target sound source 12B. Both of the correlation matrix φ _NN (f, n) are not derived or updated.

このように、目的音源１２Ａおよび非目的音源１２Ｂの双方が同時に音を発している重複区間については、第１空間相関行列φ_ｘｘ（ｆ，ｎ）および第２空間相関行列φ_ＮＮ（ｆ，ｎ）の双方を更新しない構成とする。この構成により、目的音源１２Ａおよび非目的音源１２Ｂの双方が同時に音を発した重複区間を用いることによる、目的音信号の強調精度の低下を抑制することができる。 As described above, for the overlapping section in which both the target sound source 12A and the non-purpose sound source 12B emit sound at the same time, the first spatial correlation matrix φ _xx (f, n) and the second spatial correlation matrix φ _NN (f, n) ) Will not be updated. With this configuration, it is possible to suppress a decrease in the emphasis accuracy of the target sound signal due to the use of the overlapping section in which both the target sound source 12A and the non-purpose sound source 12B emit sound at the same time.

次に、本実施の形態の音信号処理装置１０が実行する音信号処理の手順を説明する。 Next, the procedure of sound signal processing executed by the sound signal processing device 10 of the present embodiment will be described.

図３は、本実施の形態の音信号処理装置１０が実行する音信号処理の手順の一例を示す、フローチャートである。 FIG. 3 is a flowchart showing an example of a sound signal processing procedure executed by the sound signal processing device 10 of the present embodiment.

変換部２０Ｂが、複数の第１マイク１４から受付けた第３信号を短時間フーリエ変換し、周波数スペクトルＸ_２（ｆ，ｎ）によって表される第１音信号を取得する（ステップＳ１００）。変換部２０Ｂは、取得した第１音信号を、相関導出部２０Ｄおよび生成部２０Ｈへ出力する（ステップＳ１０２）。 The conversion unit 20B performs a short-time Fourier transform on the third signal received from the plurality of first microphones 14 to _{acquire the first sound signal represented by the frequency spectrum X 2} (f, n) (step S100). The conversion unit 20B outputs the acquired first sound signal to the correlation derivation unit 20D and the generation unit 20H (step S102).

次に、変換部２０Ａが、第２マイク１６から受付けた第２音信号を短時間フーリエ変換し、周波数スペクトルＸ_１（ｆ，ｎ）によって表される第２音信号を取得する（ステップＳ１０４）。変換部２０Ａは、取得した第２音信号を検出部２０Ｃへ出力する（ステップＳ１０６）。 Next, the conversion unit 20A performs a short-time Fourier transform on the second sound signal received from the second microphone 16 to _{acquire the second sound signal represented by the frequency spectrum X 1} (f, n) (step S104). .. The conversion unit 20A outputs the acquired second sound signal to the detection unit 20C (step S106).

なお、ステップＳ１００〜ステップＳ１０６の処理は、変換部２０Ａおよび変換部２０Ｂが並列で実行すればよく、図３に示す順に限定されない。また、ステップＳ１００〜ステップＳ１０６の処理は、音信号処理を終了するまで継続して繰返し実行されるものとする。 The processes of steps S100 to S106 may be executed by the conversion unit 20A and the conversion unit 20B in parallel, and are not limited to the order shown in FIG. Further, it is assumed that the processes of steps S100 to S106 are continuously and repeatedly executed until the sound signal processing is completed.

そして、音信号処理装置１０は、初期処理を実行する（ステップＳ１０８〜ステップＳ１２０）。 Then, the sound signal processing device 10 executes the initial processing (steps S108 to S120).

詳細には、まず、係数導出部２０Ｇが、第１相関記憶部２０Ｅおよび第２相関記憶部２０Ｆから、第１空間相関行列φ_ｘｘ（ｆ，ｎ）および第２空間相関行列φ_ＮＮ（ｆ，ｎ）を読取る（ステップＳ１０８）。上述したように、初期状態では、第１空間相関行列φ_ｘｘ（ｆ，ｎ）および第２空間相関行列φ_ＮＮ（ｆ，ｎ）は、ゼロ行列で初期化されている。 Specifically, first, the coefficient deriving unit 20G receives the first spatial correlation matrix φ _xx (f, n) and the second spatial correlation matrix φ _NN (f, from the first correlation storage unit 20E and the second correlation storage unit 20F). n) is read (step S108). As described above, in the initial state, the first spatial correlation matrix _φxx (f, n) and the second spatial correlation matrix φ _NN (f, n) are initialized with a zero matrix.

次に、係数導出部２０Ｇは、ステップＳ１０８で読取った第１空間相関行列φ_ｘｘ（ｆ，ｎ）および第２空間相関行列φ_ＮＮ（ｆ，ｎ）を用いて、空間フィルタ係数Ｆ（ｆ，ｎ）を導出する（ステップＳＳ１１０）。上述したように、初期状態では、係数導出部２０Ｇは、空間フィルタ係数Ｆ（ｆ，ｎ）として空間フィルタ係数Ｆ（ｆ，ｎ）＝［０，０，０，１］を導出する。 Next, the coefficient deriving unit 20G uses the first spatial correlation matrix φ _xx (f, n) and the second spatial correlation matrix φ _NN (f, n) read in step S108 to use the spatial filter coefficient F (f, n). n) is derived (step SS110). As described above, in the initial state, the coefficient deriving unit 20G derives the spatial filter coefficient F (f, n) = [0,0,0,1] as the spatial filter coefficient F (f, n).

次に、生成部２０Ｈが、ステップＳ１１０で導出された空間フィルタ係数Ｆ（ｆ，ｎ）を用いて、ステップＳ１１０で取得した、周波数スペクトルＸ_２（ｆ，ｎ）によって表される第１音信号に含まれる目的音信号を強調した強調音信号を生成する（ステップＳ１１２）。 Next, the generation unit 20H uses the spatial filter coefficient F (f, n) derived in step S110 to obtain the first sound signal represented by the _{frequency spectrum X 2 (f, n) acquired in step S110.} Generates an emphasized sound signal that emphasizes the target sound signal included in (step S112).

次に、逆変換部２０Ｉが、ステップＳ１１２で生成された、出力スペクトルＹ（ｆ，ｎ）によって表される強調音信号を逆短時間フーリエ変換し、出力部２２へ出力する（ステップＳ１１４）。 Next, the inverse transform unit 20I performs inverse short-time Fourier transform on the emphasized sound signal represented by the output spectrum Y (f, n) generated in step S112 and outputs it to the output unit 22 (step S114).

次に、検出部２０Ｃが、ステップＳ１１２で生成された強調音信号と第２音信号を用いて、関数ｕ_２（ｎ）によって表される目的音区間を検出する（ステップＳ１１６）。 Next, the detection unit 20C detects the target sound section represented by the _{function u 2} (n) by using the emphasis sound signal and the second sound signal generated in step S112 (step S116).

次に、相関導出部２０Ｄは、ステップＳ１１６で検出された目的音区間と第１音信号を用いて、第１空間相関行列φ_ｘｘ（ｆ，ｎ）と第２空間相関行列φ_ＮＮ（ｆ，ｎ）を導出する（ステップＳ１１８）。 Next, the correlation derivation unit 20D uses the target sound section detected in step S116 and the first sound signal to form the first spatial correlation matrix φ _xx (f, n) and the second spatial correlation matrix φ _NN (f,). n) is derived (step S118).

次に、相関導出部２０Ｄは、ステップＳ１１８で導出した第１空間相関行列φ_ｘｘ（ｆ，ｎ）と第２空間相関行列φ_ＮＮ（ｆ，ｎ）を、第１相関記憶部２０Ｅおよび第２相関記憶部２０Ｆの各々へ記憶することで、これらの空間相関行列を更新する（ステップＳ１２０）。 Next, the correlation derivation unit 20D uses the first spatial correlation matrix φ _xx (f, n) and the second spatial correlation matrix φ _NN (f, n) derived in step S118 as the first correlation storage unit 20E and the second. These spatial correlation matrices are updated by storing in each of the correlation storage units 20F (step S120).

次に、音信号処理部２０が、初期処理から定常処理へ移行するか否かを判断する（ステップＳ１２２）。例えば、音信号処理部２０は、初期処理を所定時間実行した否かを判別することで、定常処理へ移行するか否かを判断する。また、音信号処理部２０は、第１空間相関行列φ_ｘｘ（ｆ，ｎ）および第２空間相関行列φ_ＮＮ（ｆ，ｎ）が所定回数更新されたか否かを判別することで、定常処理へ移行するか否かを判別してもよい。 Next, the sound signal processing unit 20 determines whether or not to shift from the initial processing to the steady processing (step S122). For example, the sound signal processing unit 20 determines whether or not to shift to the steady processing by determining whether or not the initial processing has been executed for a predetermined time. Further, the sound signal processing unit 20 _{determines whether or not the first spatial correlation matrix φxx} (f, n) and the second spatial correlation matrix φ _NN (f, n) have been updated a predetermined number of times, thereby performing steady processing. It may be determined whether or not to shift to.

ステップＳ１２２で否定判断すると（ステップＳ１２２：Ｎｏ）、上記ステップＳ１０８へ戻る。一方、ステップＳ１２２で肯定判断すると（ステップＳ１２２：Ｙｅｓ）、音信号処理部２０は、定常処理を実行する（ステップＳ１２４〜ステップＳ１３８）。 If a negative determination is made in step S122 (step S122: No), the process returns to step S108. On the other hand, if an affirmative determination is made in step S122 (step S122: Yes), the sound signal processing unit 20 executes steady processing (steps S124 to S138).

定常処理では、係数導出部２０Ｇが、第１相関記憶部２０Ｅおよび第２相関記憶部２０Ｆから、第１空間相関行列φ_ｘｘ（ｆ，ｎ）および第２空間相関行列φ_ＮＮ（ｆ，ｎ）を読取る（ステップＳ１２４）。すなわち、係数導出部２０Ｇは、相関導出部２０Ｄによって更新された最新の第１空間相関行列φ_ｘｘ（ｆ，ｎ）および第２空間相関行列φ_ＮＮ（ｆ，ｎ）を読取る。 In the steady processing, the coefficient deriving unit 20G receives the first spatial correlation matrix φ _xx (f, n) and the second spatial correlation matrix φ _NN (f, n) from the first correlation storage unit 20E and the second correlation storage unit 20F. Is read (step S124). That is, the coefficient deriving unit 20G reads the latest first spatial correlation matrix φ _xx (f, n) and the second spatial correlation matrix φ _NN (f, n) updated by the correlation deriving unit 20D.

次に、係数導出部２０Ｇは、ステップＳ１２４で読取った第１空間相関行列φ_ｘｘ（ｆ，ｎ）および第２空間相関行列φ_ＮＮ（ｆ，ｎ）に基づいて、空間フィルタ係数Ｆ（ｆ，ｎ）を導出する（ステップＳ１２６）。 Next, the coefficient deriving unit 20G has a spatial filter coefficient F (f, n) based on _{the first spatial correlation matrix φ xx} (f, n) and the second spatial correlation matrix φ _{NN (f, n) read in step S124.} n) is derived (step S126).

次に、生成部２０Ｈが、ステップＳ１２６で導出された空間フィルタ係数Ｆ（ｆ，ｎ）を用いて、変換部２０Ｂから受付けた第１音信号に含まれる目的音信号を強調し、強調音信号を生成する（ステップＳ１２８）。 Next, the generation unit 20H uses the spatial filter coefficient F (f, n) derived in step S126 to emphasize the target sound signal included in the first sound signal received from the conversion unit 20B, and emphasizes the emphasis sound signal. Is generated (step S128).

次に、逆変換部２０Ｉが、ステップＳ１２８で生成された強調音信号を逆短時間フーリエ変換し、出力部２２へ出力する（ステップＳ１３０）。 Next, the inverse transform unit 20I performs inverse short-time Fourier transform on the emphasized sound signal generated in step S128 and outputs it to the output unit 22 (step S130).

次に、検出部２０Ｃが、第２音信号と、ステップＳ１２８で生成された強調音信号と、を用いて、目的音区間と重複区間を検出する（ステップＳ１３２）。 Next, the detection unit 20C detects a target sound section and an overlapping section using the second sound signal and the emphasis sound signal generated in step S128 (step S132).

次に、相関導出部２０Ｄが、ステップＳ１３２で検出された目的音区間と、重複区間と、変換部２０ＢおよびＡＤ変換部１８を介して第１マイク１４から受付けた第１音信号と、に基づいて、第１空間相関行列φ_ｘｘ（ｆ，ｎ）と第２空間相関行列φ_ＮＮ（ｆ，ｎ）を導出する（ステップＳ１３４）。そして、相関導出部２０Ｄは、導出した第１空間相関行列φ_ｘｘ（ｆ，ｎ）と第２空間相関行列φ_ＮＮ（ｆ，ｎ）を第１相関記憶部２０Ｅおよび第２相関記憶部２０Ｆの各々へ記憶することで、これらの空間相関行列を更新する（ステップＳ１３６）。 Next, the correlation derivation unit 20D is based on the target sound section detected in step S132, the overlapping section, and the first sound signal received from the first microphone 14 via the conversion unit 20B and the AD conversion unit 18. Then, the first spatial correlation matrix φ _xx (f, n) and the second spatial correlation matrix φ _NN (f, n) are derived (step S134). Then, the correlation derivation unit 20D uses the derived first space correlation matrix _φxx (f, n) and the second space correlation matrix φ _NN (f, n) of the first correlation storage unit 20E and the second correlation storage unit 20F. By storing in each, these spatial correlation matrices are updated (step S136).

次に、音信号処理部２０が、音信号処理を終了するか否かを判断する（ステップＳ１３８）。ステップＳ１３８で否定判断すると（ステップＳ１３８：Ｎｏ）、上記ステップＳ１２４へ戻る。ステップＳ１３８で肯定判断すると（ステップＳ１３８：Ｙｅｓ）、本ルーチンを終了する。 Next, the sound signal processing unit 20 determines whether or not to end the sound signal processing (step S138). If a negative determination is made in step S138 (step S138: No), the process returns to step S124. If an affirmative judgment is made in step S138 (step S138: Yes), this routine ends.

以上説明したように、本実施の形態の音信号処理装置１０は、係数導出部２０Ｇを備える。係数導出部２０Ｇは、目的音信号を強調した強調音信号に基づいて、第１音信号に含まれる目的音信号を強調するための空間フィルタ係数Ｆ（ｆ，ｎ）を導出する。このため、導出した空間フィルタ係数Ｆ（ｆ，ｎ）を用いて目的音信号を強調した強調音信号を生成することで、高精度に目的音信号を強調することができる。 As described above, the sound signal processing device 10 of the present embodiment includes a coefficient deriving unit 20G. The coefficient derivation unit 20G derives the spatial filter coefficient F (f, n) for emphasizing the target sound signal included in the first sound signal based on the emphasized sound signal that emphasizes the target sound signal. Therefore, by using the derived spatial filter coefficient F (f, n) to generate an emphasis sound signal that emphasizes the target sound signal, the target sound signal can be emphasized with high accuracy.

ここで、従来では、複数の話者が同時に発話すると、目的音の強調精度が低下する場合があった。例えば、話者方向やマイク間の到来時間差を表すベクトルを音信号の特徴量として用い、該特徴量に基づいて、音信号に含まれる目的音信号を強調するためのフィルタを生成する従来方式が知られている。 Here, conventionally, when a plurality of speakers speak at the same time, the emphasis accuracy of the target sound may decrease. For example, there is a conventional method in which a vector representing the speaker direction and the arrival time difference between microphones is used as a feature amount of a sound signal, and a filter for emphasizing the target sound signal included in the sound signal is generated based on the feature amount. Are known.

しかし、このような従来方式では、複数の話者が同時に発話すると、話者の各々の特徴量とは異なる特徴量の分布が得られるため、目的音信号を強調するためのフィルタの精度が低下する場合があった。また、複数の話者が順番に発話する状況の場合についても、相槌などによる同時に発話する区間が発生することから、目的音信号を強調するためのフィルタの精度が低下する場合があった。 However, in such a conventional method, when a plurality of speakers speak at the same time, a distribution of features different from the features of each speaker is obtained, so that the accuracy of the filter for emphasizing the target sound signal is lowered. There was a case. Further, even in the case where a plurality of speakers speak in order, the accuracy of the filter for emphasizing the target sound signal may be lowered because a section in which the speakers speak at the same time due to an aizuchi or the like is generated.

一方、本実施の形態の音信号処理装置１０では、目的音信号を強調した強調音信号に基づいて、第１音信号に含まれる目的音信号を強調するための空間フィルタ係数Ｆ（ｆ，ｎ）を導出する。このため、導出した空間フィルタ係数Ｆ（ｆ，ｎ）を第１音信号に適用することで、目的音信号を強調した強調音信号を生成することによって、高精度に目的音信号を強調することができる。 On the other hand, in the sound signal processing device 10 of the present embodiment, the spatial filter coefficient F (f, n) for emphasizing the target sound signal included in the first sound signal is based on the emphasized sound signal emphasizing the target sound signal. ) Is derived. Therefore, by applying the derived spatial filter coefficient F (f, n) to the first sound signal, the target sound signal is emphasized with high accuracy by generating an emphasized sound signal that emphasizes the target sound signal. Can be done.

従って、音信号処理装置１０は、高精度に目的音信号を強調することができる。 Therefore, the sound signal processing device 10 can emphasize the target sound signal with high accuracy.

また、本実施の形態の音信号処理装置１０では、検出部２０Ｃは、目的音信号に対する非目的音信号のパワーの比が第１音信号より大きい第２音信号と、強調音信号と、に基づいて、目的音区間を検出する。このため、検出部２０Ｃは、高精度に目的音区間を検出することができる。そして、係数導出部２０Ｇは、高精度に検出された目的音区間と第１音信号に基づいて導出された、第１空間相関行列φ_ｘｘ（ｆ，ｎ）および第２空間相関行列φ_ＮＮ（ｆ，ｎ）に基づいて、空間フィルタ係数Ｆ（ｆ，ｎ）を導出する。 Further, in the sound signal processing device 10 of the present embodiment, the detection unit 20C sets the ratio of the power of the non-target sound signal to the target sound signal to be larger than that of the first sound signal, that is, the second sound signal and the emphasized sound signal. Based on this, the target sound section is detected. Therefore, the detection unit 20C can detect the target sound section with high accuracy. Then, the coefficient derivation unit 20G is derived based on the target sound section and the first sound signal detected with high accuracy, and the first spatial correlation matrix _φxx (f, n) and the second spatial correlation matrix φ _NN ( The spatial filter coefficient F (f, n) is derived based on f, n).

このため、音信号処理装置１０は、更に高精度に目的音信号を強調することができる。 Therefore, the sound signal processing device 10 can emphasize the target sound signal with higher accuracy.

また、本実施の形態では、検出部２０Ｃが、第２音信号と強調音信号に基づいて、目的音区間を検出する。このため、本実施の形態の音信号処理装置１０は、目的音源１２Ａおよび非目的音源１２Ｂの位置に拘らず、非目的音源１２Ｂの非目的音を抑圧して目的音源１２Ａの目的音信号を強調するように、空間フィルタ係数Ｆ（ｆ，ｎ）を導出することができる。このため、音信号処理装置１０は、第１音信号に含まれる目的音信号を、より高精度に強調するための空間フィルタ係数（ｆ，ｎ）を導出することができる。 Further, in the present embodiment, the detection unit 20C detects the target sound section based on the second sound signal and the emphasis sound signal. Therefore, the sound signal processing device 10 of the present embodiment suppresses the non-purpose sound of the non-purpose sound source 12B and emphasizes the target sound signal of the target sound source 12A regardless of the positions of the target sound source 12A and the non-purpose sound source 12B. As such, the spatial filter coefficient F (f, n) can be derived. Therefore, the sound signal processing device 10 can derive a spatial filter coefficient (f, n) for emphasizing the target sound signal included in the first sound signal with higher accuracy.

また、本実施の形態では、検出部２０Ｃは、強調音信号に基づいて、目的音と非目的音とが重複する重複区間と、目的音区間と、を検出する。そして、相関導出部２０Ｄは、目的音区間と重複区間と第１音信号とに基づいて、第１空間相関行列φ_ｘｘ（ｆ，ｎ）および第２空間相関行列φ_ＮＮ（ｆ，ｎ）を導出する。 Further, in the present embodiment, the detection unit 20C detects an overlapping section in which the target sound and the non-target sound overlap and a target sound section based on the emphasized sound signal. _{Then, the correlation derivation unit 20D calculates the first spatial correlation matrix φ xx} (f, n) and the second spatial correlation matrix φ _NN (f, n) based on the target sound section, the overlapping section, and the first sound signal. Derived.

そして、相関導出部２０Ｄは、重複区間については、第１空間相関行列φ_ｘｘ（ｆ，ｎ）および第２空間相関行列φ_ＮＮ（ｆ，ｎ）を導出および更新しない。このため、係数導出部２０Ｇは、重複区間については、空間フィルタ係数Ｆ（ｆ，ｎ）を導出しない。よって、第１音信号における、複数の音源１２から同時に音が発せられる区間についても、本実施の形態の音信号処理装置１０では、高精度に目的音信号を強調することができる。 Then, the correlation derivation unit 20D _{does not derive and update the first spatial correlation matrix φ xx} (f, n) and the second spatial correlation matrix φ _NN (f, n) for the overlapping interval. Therefore, the coefficient deriving unit 20G does not derive the spatial filter coefficient F (f, n) for the overlapping section. Therefore, the sound signal processing device 10 of the present embodiment can emphasize the target sound signal with high accuracy even in the section of the first sound signal in which sounds are simultaneously emitted from the plurality of sound sources 12.

＜変形例１＞
なお、上記では、検出部２０Ｃは、出力スペクトルＹ（ｆ，ｎ）によって表される強調音信号と、周波数スペクトルＸ_１（ｆ，ｎ）によって表される第２音信号と、のパワーに基づいて、目的音区間および重複区間を検出した。 <Modification example 1>
In the above, the detection unit 20C is based on the power of the emphasis sound signal represented by the output spectrum Y (f, n) and the second sound signal represented by the _{frequency spectrum X 1 (f, n).} The target sound section and the overlapping section were detected.

しかし、検出部２０Ｃは、出力スペクトルＹ（ｆ，ｎ）および周波数スペクトルＸ_１（ｆ，ｎ）を用いて、他の方法により、目的音区間および重複区間を検出してもよい。 However, the detection unit 20C may detect the target sound section and the overlapping section by other methods using the output spectrum Y (f, n) and the frequency spectrum X _{1 (f, n).}

例えば、出力スペクトルＹ（ｆ，ｎ）および周波数スペクトルＸ_１（ｆ，ｎ）を入力とし、関数ｕ_１（ｎ）および関数ｕ_２（ｎ）を推定するためのモデルを、決定木やk近傍法、サポートベクターマシン、ニューラルネットワークなどにより学習してもよい。 _{For example, a model for estimating the function u 1} (n) and the function u ₂ (n) with the output spectrum Y (f, n) and the frequency spectrum X ₁ (f, n) as inputs is determined by a decision tree or near k. It may be learned by a method, a support vector machine, a neural network, or the like.

一例として、ニューラルネットワークを用いたモデルの学習について説明する。 As an example, learning of a model using a neural network will be described.

この場合、検出部２０Ｃは、モデルを学習するための学習データを収集する。例えば、本実施の形態の音信号処理部２０を学習装置に実装し、音信号処理部２０を用いて上記処理を実行することで、周波数スペクトルＸ_１（ｆ，ｎ）と該周波数スペクトルＸ_１（ｆ，ｎ）から導出した出力スペクトルＹ（ｆ，ｎ）とを含む学習データを、複数記録する。同次に、第１マイク１４Ｄで目的音源１２Ａの目的音を集音して記録する。そして、ユーザによる該目的音の視聴やユーザによる該目的音の波形の目視などにより、各フレームで音を発している音源１２を判定することで、関数ｕ_１（ｎ）および関数ｕ_２（ｎ）の正解データｃ_１（ｎ）およびｃ_２（ｎ）を作成する。 In this case, the detection unit 20C collects training data for training the model. For example, by mounting the sound signal processing unit 20 of the present embodiment on the learning device and executing the above processing using the sound signal processing unit 20, the frequency spectrum X ₁ (f, n) and the frequency spectrum X ₁ A plurality of training data including the output spectrum Y (f, n) derived from (f, n) are recorded. Next, the target sound of the target sound source 12A is collected and recorded by the first microphone 14D. _{Then, the function u 1} (n) and the function u ₂ (n) are determined by determining the sound source 12 that emits sound in each frame by viewing the target sound by the user or visually observing the waveform of the target sound by the user. ) Correct answer data c ₁ (n) and c ₂ (n) are created.

また、検出部２０Ｃは、入力特徴量として、下記式（１５）で表されるベクトルｖ（ｎ）を用いる。 Further, the detection unit 20C uses the vector v (n) represented by the following equation (15) as the input feature amount.

式（１５）で表されるベクトルｖ（ｎ）は、当該フレームと直前のフレームとのスペクトルの絶対値の対数を連結した、５１６次元ベクトルである。このため、目的音区間および重複区間の検出は、ベクトルｖ（ｎ）から正解データを表す二次元ベクトルｃ（ｎ）＝［ｃ_１（ｎ），ｃ_２（ｎ）］の推定に、定式化することができる。 The vector v (n) represented by the equation (15) is a 516-dimensional vector obtained by concatenating the logarithms of the absolute values of the spectra of the frame and the immediately preceding frame. Therefore, the detection of the target sound section and the overlapping section is formulated from the vector v (n) to the estimation of the _{two-dimensional vector c (n) = [c 1} (n), c _{2 (n)] representing the correct answer data.} can do.

ここで、ニューラルネットワークのモデルの構成を、下記式（１６）〜式（２０）で定義する。 Here, the configuration of the neural network model is defined by the following equations (16) to (20).

中間層のノード数を１００とすると、行列Ｗ_ｉおよび行列Ｗ_０のサイズは、各々、１００×５１６、２×１００となる。このため、行列Ｗ_１、行列Ｗ_２、行列Ｗ_３のサイズは、何れも、１００×１００となる。 When the number of nodes of the intermediate layer is 100, the size of the matrix _{W i} and the matrix _{W 0} are each a 100 × 516,2 × 100. Therefore, the sizes of the matrix W ₁ , the matrix W ₂ , and the matrix W ₃ are all 100 × 100.

また、式（１６）〜式（２０）における関数ｓｉｇｍｏｉｄ（）は、下記式（２１）で表されるｓｉｇｍｏｉｄ関数を、ベクトルの各要素に適用する演算を表す。 Further, the function sigmoid () in the equations (16) to (20) represents an operation in which the sigmoid function represented by the following equation (21) is applied to each element of the vector.

そして、目的関数Ｌを、下記式（２２）で表されるクロスエントロピーで定義する。 Then, the objective function L is defined by the cross entropy represented by the following equation (22).

そして、検出部２０Ｃは、目的関数Ｌを最大化するパラメータ列Ｗ_ｉ，Ｗ_ｏ，Ｗ_１，Ｗ_２，Ｗ_３を、学習によって求める。 Then, the detection unit 20C _{obtains the parameter sequences Wi} , W _o , W ₁ , W ₂ , and W ₃ that maximize the objective function L by learning.

学習の手法には、確率的勾配降下法など、既存の手法を用いればよい。このモデルを用いて導出した関数ｕ_１（ｎ）および関数ｕ_２（ｎ）は、０から１の間の連続値となる。このため、例えば０．５を閾値として、それ以上であれば１（目的音区間）、それ未満であれば０（非目的音区間）に二値化すればよい。 As the learning method, an existing method such as the stochastic gradient descent method may be used. _{The function u 1} (n) and the function u ₂ (n) derived using this model are continuous values between 0 and 1. Therefore, for example, 0.5 may be set as a threshold value, and if it is more than that, it may be binarized to 1 (target sound section), and if it is less than that, it may be binarized to 0 (non-target sound section).

このように、検出部２０Ｃは、出力スペクトルＹ（ｆ，ｎ）および周波数スペクトルＸ_１（ｆ，ｎ）を用いて、第１の実施の形態とは異なる方法により、目的音区間および重複区間を検出してもよい。 As described above, the detection unit 20C uses the output spectrum Y (f, n) and the frequency spectrum X ₁ (f, n) to set the target sound section and the overlapping section by a method different from that of the first embodiment. It may be detected.

（第２の実施の形態）
本実施の形態では、第２マイク１６から取得した第２音声信号を用いずに、第１マイク１４から取得した第１音声信号を用いて、音信号処理を行う形態を説明する。 (Second Embodiment)
In the present embodiment, a mode in which sound signal processing is performed using the first audio signal acquired from the first microphone 14 without using the second audio signal acquired from the second microphone 16 will be described.

図４は、本実施の形態の音信号処理システム２の一例を示す模式図である。 FIG. 4 is a schematic diagram showing an example of the sound signal processing system 2 of the present embodiment.

音信号処理システム２は、音信号処理装置１１と、複数の第１マイク１４と、を備える。音信号処理装置１１と、複数の第１マイク１４とは、データや信号を授受可能に接続されている。 The sound signal processing system 2 includes a sound signal processing device 11 and a plurality of first microphones 14. The sound signal processing device 11 and the plurality of first microphones 14 are connected so as to be able to exchange data and signals.

すなわち、音信号処理システム２は、音信号処理装置１０に代えて音信号処理装置１１を備え、且つ、第２マイク１６を備えない点以外は、第１の実施の形態の音信号処理システム１と同様である。 That is, the sound signal processing system 1 of the first embodiment is provided with the sound signal processing device 11 instead of the sound signal processing device 10 and is not provided with the second microphone 16. Is similar to.

本実施の形態では、音信号処理システム２は、音源１２として、複数の目的音源１２Ａを想定する。図４には、複数の目的音源１２Ａとして、三人の話者である目的音源１２Ａ１〜目的音源１２Ａ３を一例として示した。目的音源１２Ａは、例えば、人（話者）である。本実施の形態では、矩形形状のテーブルＴの３辺に、各々１人の話者（目的音源１２Ａ１、目的音源１２Ａ２、目的音源１２Ａ３）が座って会話する環境を想定している。なお、本実施の形態では、音信号処理装置１１による音信号処理中、これらの複数の目的音源１２Ａの位置は、大きく移動しない場合を想定している。なお、目的音源１２Ａの数は、３に限定されず、２または４以上であってもよい。 In the present embodiment, the sound signal processing system 2 assumes a plurality of target sound sources 12A as the sound source 12. In FIG. 4, as the plurality of target sound sources 12A, the target sound sources 12A1 to 12A3, which are three speakers, are shown as an example. The target sound source 12A is, for example, a person (speaker). In the present embodiment, it is assumed that one speaker (target sound source 12A1, target sound source 12A2, target sound source 12A3) sits and talks on each of the three sides of the rectangular table T. In this embodiment, it is assumed that the positions of the plurality of target sound sources 12A do not move significantly during the sound signal processing by the sound signal processing device 11. The number of target sound sources 12A is not limited to 3, and may be 2 or 4 or more.

第１の実施の形態と同様に、音信号処理システム２は、複数の第１マイク１４を備える。本実施の形態では、一例として、第１マイク１４Ａ〜第１マイク１４Ｄの４つの第１マイク１４を示した。 Similar to the first embodiment, the sound signal processing system 2 includes a plurality of first microphones 14. In the present embodiment, as an example, four first microphones 14 of the first microphone 14A to the first microphone 14D are shown.

第１の実施の形態と同様に、複数の第１マイク１４は、複数の目的音源１２Ａの各々からの音到達時間差が互いに異なる。すなわち、複数の第１マイク１４は、上記音到達時間差が互いに異なるように、配置位置が予め調整されている。 Similar to the first embodiment, the plurality of first microphones 14 have different sound arrival time differences from each of the plurality of target sound sources 12A. That is, the arrangement positions of the plurality of first microphones 14 are adjusted in advance so that the sound arrival time differences are different from each other.

また、音信号処理システム２に設けられる複数の第１マイク１４の数は、本実施の形態の音源１２の数以上であればよい。このため、本実施の形態では、第１マイク１４の数は、３以上であればよい。第１マイク１４の数が多いほど、目的音の強調精度の向上を図ることができる。 Further, the number of the plurality of first microphones 14 provided in the sound signal processing system 2 may be equal to or greater than the number of the sound sources 12 of the present embodiment. Therefore, in the present embodiment, the number of the first microphones 14 may be 3 or more. As the number of the first microphones 14 increases, the accuracy of emphasizing the target sound can be improved.

一例として、音信号処理システム２は、４つの第１マイク１４（第１マイク１４Ａ〜第１マイク１４Ｄ）を備える形態を説明する。 As an example, a mode in which the sound signal processing system 2 includes four first microphones 14 (first microphones 14A to 14D) will be described.

第１の実施の形態と同様に、複数の第１マイク１４の各々から第３信号が出力されることで、音信号処理装置１１には、複数の第３信号が出力される。第１の実施の形態と同様に、複数の第３音信号を一つにまとめた音信号を、第１音信号と称して説明する。 Similar to the first embodiment, the third signal is output from each of the plurality of first microphones 14, so that the sound signal processing device 11 outputs the plurality of third signals. Similar to the first embodiment, a sound signal obtained by combining a plurality of third sound signals into one will be referred to as a first sound signal.

音信号処理装置１１は、ＡＤ変換部１８と、音信号処理部３０と、出力部２２と、を備える。ＡＤ変換部１８および出力部２２は、第１の実施の形態と同様である。音信号処理装置１１は、音信号処理部２０に代えて音信号処理部３０を備える点以外は、第１の実施の形態と同様である。なお、音信号処理装置１１は、少なくとも音信号処理部３０を備えた構成であればよく、ＡＤ変換部１８および出力部２２の少なくとも一方を別体として構成してもよい。 The sound signal processing device 11 includes an AD conversion unit 18, a sound signal processing unit 30, and an output unit 22. The AD conversion unit 18 and the output unit 22 are the same as those in the first embodiment. The sound signal processing device 11 is the same as that of the first embodiment except that the sound signal processing unit 30 is provided instead of the sound signal processing unit 20. The sound signal processing device 11 may be configured to include at least the sound signal processing unit 30, and at least one of the AD conversion unit 18 and the output unit 22 may be configured as a separate body.

音信号処理部３０は、ＡＤ変換部１８を介して複数の第３音信号を受付ける。音信号処理部３０は、受付けた複数の第３音信号を１つにまとめた第１音信号に含まれる目的音信号を強調し、強調音信号を出力部２２へ出力する。 The sound signal processing unit 30 receives a plurality of third sound signals via the AD conversion unit 18. The sound signal processing unit 30 emphasizes the target sound signal included in the first sound signal that combines the received plurality of third sound signals into one, and outputs the emphasized sound signal to the output unit 22.

音信号処理部３０について詳細を説明する。 The sound signal processing unit 30 will be described in detail.

図５は、音信号処理部３０の機能的構成の一例を示す模式図である。 FIG. 5 is a schematic diagram showing an example of the functional configuration of the sound signal processing unit 30.

音信号処理部３０は、変換部３０Ｂと、分離部３０Ｊと、検出部３０Ｃと、相関導出部３０Ｄと、複数の第３相関記憶部３０Ｅと、第４相関記憶部３０Ｆと、複数の加算部３０Ｋと、複数の係数導出部３０Ｇと、複数の生成部３０Ｈと、複数の逆変換部３０Ｉと、を備える。 The sound signal processing unit 30 includes a conversion unit 30B, a separation unit 30J, a detection unit 30C, a correlation derivation unit 30D, a plurality of third correlation storage units 30E, a fourth correlation storage unit 30F, and a plurality of addition units. It includes 30K, a plurality of coefficient derivation units 30G, a plurality of generation units 30H, and a plurality of inverse conversion units 30I.

変換部３０Ｂ、分離部３０Ｊ、検出部３０Ｃ、相関導出部３０Ｄ、複数の係数導出部３０Ｇ、複数の加算部３０Ｋ、複数の生成部３０Ｈ、および複数の逆変換部３０Ｉは、例えば、１または複数のプロセッサにより実現される。例えば上述の各部は、ＣＰＵなどのプロセッサにプログラムを実行させること、すなわちソフトウェアにより実現してもよい。上述の各部は、専用のＩＣなどのプロセッサ、すなわちハードウェアにより実現してもよい。上述の各部は、ソフトウェアおよびハードウェアを併用して実現してもよい。複数のプロセッサを用いる場合、各プロセッサは、各部のうち１つを実現してもよいし、各部のうち２以上を実現してもよい。 The conversion unit 30B, the separation unit 30J, the detection unit 30C, the correlation derivation unit 30D, the plurality of coefficient derivation units 30G, the plurality of addition units 30K, the plurality of generation units 30H, and the plurality of inverse conversion units 30I may be, for example, one or more. It is realized by the processor of. For example, each of the above-mentioned parts may be realized by causing a processor such as a CPU to execute a program, that is, by software. Each of the above-mentioned parts may be realized by a processor such as a dedicated IC, that is, hardware. Each of the above-mentioned parts may be realized by using software and hardware together. When a plurality of processors are used, each processor may realize one of each part, or may realize two or more of each part.

第３相関記憶部３０Ｅおよび第４相関記憶部３０Ｆは、各種情報を記憶する。第３相関記憶部３０Ｅおよび第４相関記憶部３０Ｆは、ＨＤＤ、光ディスク、メモリカード、ＲＡＭなどの一般的に利用されているあらゆる記憶媒体により構成することができる。また、第３相関記憶部３０Ｅおよび第４相関記憶部３０Ｆは、物理的に異なる記憶媒体としてもよいし、物理的に同一の記憶媒体の異なる記憶領域として実現してもよい。さらに、第３相関記憶部３０Ｅおよび第４相関記憶部３０Ｆの各々は、物理的に異なる複数の記憶媒体により実現してもよい。 The third correlation storage unit 30E and the fourth correlation storage unit 30F store various types of information. The third correlation storage unit 30E and the fourth correlation storage unit 30F can be configured by any commonly used storage medium such as an HDD, an optical disk, a memory card, and a RAM. Further, the third correlation storage unit 30E and the fourth correlation storage unit 30F may be physically different storage media, or may be realized as different storage areas of physically the same storage medium. Further, each of the third correlated storage unit 30E and the fourth correlated storage unit 30F may be realized by a plurality of physically different storage media.

なお、音信号処理部３０には、複数の目的音源１２Ａの各々に対応する、第３相関記憶部３０Ｅ、係数導出部３０Ｇ、加算部３０Ｋ、生成部３０Ｈ、および逆変換部３０Ｉが設けられている。上述したように、本実施の形態では、３つの目的音源１２Ａ（目的音源１２Ａ１〜目的音源１２Ａ３）を想定している。 The sound signal processing unit 30 is provided with a third correlation storage unit 30E, a coefficient derivation unit 30G, an addition unit 30K, a generation unit 30H, and an inverse conversion unit 30I corresponding to each of the plurality of target sound sources 12A. There is. As described above, in the present embodiment, three target sound sources 12A (target sound source 12A1 to target sound source 12A3) are assumed.

このため、本実施の形態では、音信号処理部３０には、３つの第３相関記憶部３０Ｅ（第３相関記憶部３０Ｅ１〜第３相関記憶部３０Ｅ３）、３つの係数導出部３０Ｇ（係数導出部３０Ｇ１〜係数導出部３０Ｇ２）、３つの加算部３０Ｋ（加算部３０Ｋ１〜加算部３０Ｋ３）、３つの生成部３０Ｈ（生成部３０Ｈ１〜生成部３０Ｈ３）、および３つの逆変換部３０Ｉ（逆変換部３０Ｉ１〜逆変換部３０Ｉ３）が設けられている。 Therefore, in the present embodiment, the sound signal processing unit 30 has three third correlation storage units 30E (third correlation storage units 30E1 to third correlation storage unit 30E3) and three coefficient derivation units 30G (coefficient derivation). Units 30G1 to coefficient derivation unit 30G2), three addition units 30K (addition unit 30K1 to addition unit 30K3), three generation units 30H (generation unit 30H1 to generation unit 30H3), and three inverse conversion units 30I (inverse conversion unit). 30I1 to the inverse conversion unit 30I3) are provided.

なお、音信号処理システム２で想定する目的音源１２Ａの数は、３つに限定されない。例えば、音信号処理システム２で想定する目的音源１２Ａの数は、１、２、または４以上であってもよい。そして、音信号処理部３０では、第３相関記憶部３０Ｅ、係数導出部３０Ｇ、３つの加算部３０Ｋ、生成部３０Ｈ、および逆変換部３０Ｉの各々を、複数の目的音源１２Ａと同じ数、備えた構成とすればよい。 The number of target sound sources 12A assumed by the sound signal processing system 2 is not limited to three. For example, the number of target sound sources 12A assumed by the sound signal processing system 2 may be 1, 2, or 4 or more. Then, the sound signal processing unit 30 includes the third correlation storage unit 30E, the coefficient derivation unit 30G, the three addition units 30K, the generation unit 30H, and the inverse conversion unit 30I in the same number as the plurality of target sound sources 12A. It may be configured as such.

変換部３０Ｂは、第１の実施の形態の変換部２０Ｂと同様に、ＡＤ変換部１８を介して複数の第１マイク１４（第１マイク１４Ａ〜第１マイク１４Ｄ）から受付けた複数の第３音信号の各々を短時間フーリエ変換（ＳＴＦＴ）し、周波数スペクトルＸ_１（ｆ，ｎ）、周波数スペクトルＸ_２（ｆ，ｎ）、周波数スペクトルＸ_３（ｆ，ｎ）、周波数スペクトルＸ_４（ｆ，ｎ）の各々によって表される複数の第３音信号を生成する。 The conversion unit 30B receives from the plurality of first microphones 14 (first microphones 14A to 14D) via the AD conversion unit 18 as in the case of the conversion unit 20B of the first embodiment. Each of the sound signals is short-time Fourier transform (STFT), and the frequency spectrum X ₁ (f, n), the frequency spectrum X ₂ (f, n), the frequency spectrum X ₃ (f, n), and the frequency spectrum X ₄ (f). , N) generate a plurality of third sound signals represented by each of.

周波数スペクトルＸ_１（ｆ，ｎ）は、第１マイク１４Ａから受付けた第３音信号を短時間フーリエ変換したものである。周波数スペクトルＸ_２（ｆ，ｎ）は、第１マイク１４Ｂから受付けた第３音信号を短時間フーリエ変換したものである。周波数スペクトルＸ_３（ｆ，ｎ）は、第１マイク１４Ｃから受付けた第３音信号を短時間フーリエ変換したものである。周波数スペクトルＸ_４（ｆ，ｎ）は、第１マイク１４Ｄから受付けた第３音信号を短時間フーリエ変換したものである。 The frequency spectrum X ₁ (f, n) is a short-time Fourier transform of the third sound signal received from the first microphone 14A. The frequency spectrum X ₂ (f, n) is a short-time Fourier transform of the third sound signal received from the first microphone 14B. Frequency spectrum X _{3 (f,} n) is obtained by the third sound signal received from the first microphone 14C and the short-time Fourier transform. The frequency spectrum X ₄ (f, n) is a short-time Fourier transform of the third sound signal received from the first microphone 14D.

なお、本実施の形態では、複数の第３音信号の各々を示す上記複数の周波数スペクトルをまとめた多次元ベクトル（本実施の形態では４次元ベクトル）を、第１音信号を示す周波数スペクトルＸ（ｆ，ｎ）と称して説明する。言い換えると、本実施の形態では、第１音信号は、周波数スペクトルＸ（ｆ，ｎ）によって表される。第１音信号を示す周波数スペクトルＸ（ｆ，ｎ）は、下記式（２３）で表される。 In the present embodiment, a multidimensional vector (four-dimensional vector in the present embodiment) that summarizes the plurality of frequency spectra indicating each of the plurality of third sound signals is used as a frequency spectrum X indicating the first sound signal. This will be described as (f, n). In other words, in this embodiment, the first sound signal is represented by the frequency spectrum X (f, n). The frequency spectrum X (f, n) showing the first sound signal is represented by the following equation (23).

変換部３０Ｂは、第１音信号を示す周波数スペクトルＸ（ｆ，ｎ）を、分離部３０Ｊおよび複数の生成部３０Ｈ（生成部３０Ｈ１〜生成部３０Ｈ３）の各々へ出力する。 The conversion unit 30B outputs the frequency spectrum X (f, n) indicating the first sound signal to each of the separation unit 30J and the plurality of generation units 30H (generation unit 30H1 to generation unit 30H3).

第３相関記憶部３０Ｅは、第３空間相関行列を記憶する。第３空間相関行列は、第１音信号における目的音成分の空間相関行列を示す。 The third correlation storage unit 30E stores the third spatial correlation matrix. The third spatial correlation matrix shows the spatial correlation matrix of the target sound component in the first sound signal.

上述したように、音信号処理部３０は、複数の目的音源１２Ａの各々に対応する、３つの第３相関記憶部３０Ｅ（第３相関記憶部３０Ｅ１〜第３相関記憶部３０Ｅ３）を備える。 As described above, the sound signal processing unit 30 includes three third correlation storage units 30E (third correlation storage units 30E1 to third correlation storage units 30E3) corresponding to each of the plurality of target sound sources 12A.

第３相関記憶部３０Ｅ１は、目的音源１２Ａ１に対応する第３相関記憶部３０Ｅである。第３相関記憶部３０Ｅ１は、第３空間相関行列φ_ｘｘａ（ｆ，ｎ）を記憶する。第３空間相関行列φ_ｘｘａ（ｆ，ｎ）は、第１音信号における目的音源１２Ａ１の目的音成分の空間相関行列を示す。目的音源１２Ａ１の目的音成分とは、第１音信号に含まれる、目的音源１２Ａ１から発せられた目的音の成分（すなわちスペクトル）を示す。目的音成分は、後述する分離部３０Ｊによって第１音信号から分離される（詳細後述）。 The third correlation storage unit 30E1 is a third correlation storage unit 30E corresponding to the target sound source 12A1. The third correlation storage unit 30E1 stores the third spatial correlation matrix _φxxa (f, n). The third spatial correlation matrix _φxxa (f, n) shows the spatial correlation matrix of the target sound component of the target sound source 12A1 in the first sound signal. The target sound component of the target sound source 12A1 indicates a component (that is, a spectrum) of the target sound emitted from the target sound source 12A1 included in the first sound signal. The target sound component is separated from the first sound signal by the separation unit 30J described later (details will be described later).

上述したように、本実施の形態では、第１音信号は、４次元ベクトルを示す周波数スペクトルＸ（ｆ，ｎ）によって表される。このため、第３空間相関行列φ_ｘｘａ（ｆ，ｎ）は、周波数ビン毎の４×４の複素数の行列によって表される。 As described above, in the present embodiment, the first sound signal is represented by the frequency spectrum X (f, n) showing the four-dimensional vector. Therefore, the third spatial correlation matrix _φxxa (f, n) is represented by a matrix of 4 × 4 complex numbers for each frequency bin.

同様に、第３相関記憶部３０Ｅ２は、目的音源１２Ａ２に対応する第３相関記憶部３０Ｅである。第３相関記憶部３０Ｅ２は、第３空間相関行列φ_ｘｘｂ（ｆ，ｎ）を記憶する。第３空間相関行列φ_ｘｘｂ（ｆ，ｎ）は、第１音信号における目的音源１２Ａ２の目的音成分の空間相関行列を示す。目的音源１２Ａ２の目的音成分とは、第１音信号に含まれる、目的音源１２Ａ２から発せられた目的音の成分を示す。第３空間相関行列φ_ｘｘａ（ｆ，ｎ）と同様に、第３空間相関行列φ_ｘｘｂ（ｆ，ｎ）は、周波数ビン毎の４×４の複素数の行列によって表される。 Similarly, the third correlation storage unit 30E2 is the third correlation storage unit 30E corresponding to the target sound source 12A2. The third correlation storage unit 30E2 stores the third spatial correlation matrix _φxxb (f, n). The third spatial correlation matrix _φxxb (f, n) shows the spatial correlation matrix of the target sound component of the target sound source 12A2 in the first sound signal. The target sound component of the target sound source 12A2 indicates a component of the target sound emitted from the target sound source 12A2 included in the first sound signal. Third space correlation matrix φ _xxa (f, n) similarly to the third spatial correlation matrix φ _xxb (f, n) is represented by a complex matrix of 4 × 4 for each frequency bin.

第３相関記憶部３０Ｅ３は、目的音源１２Ａ３に対応する第３相関記憶部３０Ｅである。第３相関記憶部３０Ｅ３は、第３空間相関行列φ_ｘｘｃ（ｆ，ｎ）を記憶する。第３空間相関行列φ_ｘｘｃ（ｆ，ｎ）は、第１音信号における目的音源１２Ａ３の目的音成分の空間相関行列を示す。目的音源１２Ａ３の目的音成分とは、第１音信号に含まれる、目的音源１２Ａ３から発せられた目的音の成分を示す。第３空間相関行列φ_ｘｘｃ（ｆ，ｎ）は、周波数ビン毎の４×４の複素数の行列によって表される。 The third correlation storage unit 30E3 is a third correlation storage unit 30E corresponding to the target sound source 12A3. The third correlation storage unit 30E3 stores the third spatial correlation matrix _φxxc (f, n). The third spatial correlation matrix _φxxc (f, n) shows the spatial correlation matrix of the target sound component of the target sound source 12A3 in the first sound signal. The target sound component of the target sound source 12A3 indicates a component of the target sound emitted from the target sound source 12A3 included in the first sound signal. The third spatial correlation matrix _φxxc (f, n) is represented by a matrix of 4 × 4 complex numbers for each frequency bin.

第４相関記憶部３０Ｆは、第４空間相関行列φ_ＮＮ（ｆ，ｎ）を記憶する。第４空間相関行列φ_ＮＮ（ｆ，ｎ）は、第１音信号における、非目的音成分の空間相関行列を示す。非目的音成分とは、第１音信号に含まれる、目的音源１２Ａ（目的音源１２Ａ１〜目的音源１２Ａ３）の各々から発せられた目的音の成分以外の成分を示す。非目的音成分は、後述する分離部３０Ｊによって第１音信号から分離される（詳細後述）。 The fourth correlation storage unit 30F stores the fourth spatial correlation matrix φ _NN (f, n). The fourth spatial correlation matrix φ _NN (f, n) shows the spatial correlation matrix of the non-objective sound component in the first sound signal. The non-target sound component refers to a component other than the target sound component emitted from each of the target sound sources 12A (target sound source 12A1 to target sound source 12A3) included in the first sound signal. The non-purpose sound component is separated from the first sound signal by the separation unit 30J described later (details will be described later).

初期状態では、第４相関記憶部３０Ｆには、ゼロ行列で初期化（φ_ＮＮ（ｆ，０）＝０）された第４空間相関行列φ_ＮＮ（ｆ，ｎ）が、初期値として予め記憶されている。 _{In the initial state, the fourth spatial correlation matrix φ NN} (f, n) initialized with a zero matrix (φ _NN (f, 0) = 0) is stored in advance in the fourth correlation storage unit 30F as an initial value. Has been done.

一方、初期状態において、第３相関記憶部３０Ｅ１、第３相関記憶部３０Ｅ２、および第３相関記憶部３０Ｅ３には、それぞれ、目的音源１２Ａ１、目的音源１２Ａ２、および目的音源１２Ａ３の各々の位置で発せられた目的音の空間相関行列を示す第３空間相関行列φ_ｘｘａ（ｆ，ｎ）、第３空間相関行列φ_ｘｘｂ（ｆ，ｎ）、第３空間相関行列φ_ｘｘｃ（ｆ，ｎ）が、初期値として予め記憶されている。 On the other hand, in the initial state, the third correlation storage unit 30E1, the third correlation storage unit 30E2, and the third correlation storage unit 30E3 are emitted at the respective positions of the target sound source 12A1, the target sound source 12A2, and the target sound source 12A3, respectively. _{The third spatial correlation matrix φxxa} (f, n), the third spatial correlation matrix _φxxb (f, n), and the third spatial correlation matrix _φxxc (f, n) showing the spatial correlation matrix of the obtained target sound are It is stored in advance as an initial value.

このような、第３空間相関行列φ_ｘｘａ（ｆ，ｎ）、第３空間相関行列φ_ｘｘｂ（ｆ，ｎ）、および第３空間相関行列φ_ｘｘｃ（ｆ，ｎ）の各々の初期値は、複数の第１マイク１４（第１マイク１４Ａ〜第１マイク１４Ｄ）の各々の配置と、複数の目的音源１２Ａ（目的音源１２Ａ１〜目的音源１２Ａ３）と、の位置から、シミュレーションによって予め求めればよい。また、第３空間相関行列φ_ｘｘａ（ｆ，ｎ）、第３空間相関行列φ_ｘｘｂ（ｆ，ｎ）、第３空間相関行列φ_ｘｘｃ（ｆ，ｎ）の初期値は、複数の目的音源１２Ａ（目的音源１２Ａ１〜目的音源１２Ａ３）の各々が各音源１２の位置で発した目的音を、予め複数の第１マイク１４（第１マイク１４Ａ〜第１マイク１４Ｄ）で集音し、集音により得られた目的音信号から予め導出してもよい。 The initial values of the third spatial correlation matrix _φxxa (f, n), the third spatial correlation matrix _φxxb (f, n), and the third spatial correlation matrix _φxxc (f, n) are set to It may be obtained in advance by simulation from the arrangement of each of the plurality of first microphones 14 (first microphones 14A to 14D) and the positions of the plurality of target sound sources 12A (target sound sources 12A to 12A3). Further, the initial values of the third space correlation matrix _φxxa (f, n), the third space correlation matrix _φxxb (f, n), and the third space correlation matrix _φxxc (f, n) are a plurality of target sound sources 12A. The target sounds emitted by each of the target sound sources 12A1 to 12A3 at the positions of the target sound sources 12 are collected in advance by a plurality of first microphones 14 (first microphones 14A to 14D), and the sounds are collected. It may be derived in advance from the obtained target sound signal.

具体的には、音信号処理部３０は、目的音源１２Ａ１〜目的音源１２Ａ３の各々の位置から発せられた目的音を、テーブルＴ上の複数の第１マイク１４（第１マイク１４Ａ〜第１マイク１４Ｄ）で集音することで得られた目的音信号から、各々の第３空間相関行列の初期値を予め導出してもよい。 Specifically, the sound signal processing unit 30 transmits the target sounds emitted from the respective positions of the target sound sources 12A1 to 12A3 to a plurality of first microphones 14 (first microphones 14A to first microphones) on the table T. The initial value of each third space correlation matrix may be derived in advance from the target sound signal obtained by collecting the sound in 14D).

例えば、目的音源１２Ａ１〜目的音源１２Ａ３の各々の位置にスピーカを配置して白色雑音を再生し、複数の第１マイク１４（第１マイク１４Ａ〜第１マイク１４Ｄ）で集音した音のスペクトルを表す４次元ベクトルを、Ｎａ（ｆ，ｎ），Ｎｂ（ｆ，ｎ），Ｎｃ（ｆ，ｎ）で表すと想定する。この場合、音信号処理部３０は、下記式（２４）〜式（２６）を用いて、各々の第３空間相関行列の初期値を予め導出し、第３相関記憶部３０Ｅ１〜第３相関記憶部３０Ｅ３に、それぞれ予め記憶すればよい。 For example, speakers are arranged at each position of the target sound source 12A1 to the target sound source 12A3 to reproduce white noise, and the spectrum of the sound collected by the plurality of first microphones 14 (first microphone 14A to first microphone 14D) is displayed. It is assumed that the four-dimensional vector to be represented is represented by Na (f, n), Nb (f, n), and Nc (f, n). In this case, the sound signal processing unit 30 derives the initial value of each third spatial correlation matrix in advance using the following equations (24) to (26), and the third correlation storage unit 30E1 to the third correlation storage. Each of them may be stored in advance in the unit 30E3.

次に、目的音源１２Ａ１に対応する、加算部３０Ｋ１、係数導出部３０Ｇ１、生成部３０Ｈ１、および逆変換部３０Ｉ１について説明する。 Next, the addition unit 30K1, the coefficient derivation unit 30G1, the generation unit 30H1, and the inverse conversion unit 30I1 corresponding to the target sound source 12A1 will be described.

加算部３０Ｋ１は、目的音源１２Ａ１に対応する加算部３０Ｋである。加算部３０Ｋ１は、対応する目的音源１２Ａ１以外の目的音源１２Ａ（目的音源１２Ａ２、目的音源１２Ａ３）の第３空間相関行列（第３空間相関行列φ_ｘｘｂ（ｆ，ｎ）、第３空間相関行列φ_ｘｘｃ（ｆ，ｎ））と、第４空間相関行列φ_ＮＮ（ｆ，ｎ）と、を加算し、係数導出部３０Ｇへ出力する。具体的には、加算部３０Ｋ１は、下記式（２７）により、空間相関行列の和を導出し、係数導出部３０Ｇ１へ出力する。 The addition unit 30K1 is an addition unit 30K corresponding to the target sound source 12A1. _{The addition unit 30K1 is a third space correlation matrix (third space correlation matrix φ xxb} (f, n), third space correlation matrix φ) of the target sound source 12A (target sound source 12A2, target sound source 12A3) other than the corresponding target sound source 12A1. _xxx (f, n)) and the fourth spatial correlation matrix φ _NN (f, n) are added and output to the coefficient derivation unit 30G. Specifically, the addition unit 30K1 derives the sum of the spatial correlation matrix by the following equation (27) and outputs it to the coefficient derivation unit 30G1.

係数導出部３０Ｇ１は、目的音源１２Ａ１に対応する係数導出部３０Ｇである。係数導出部３０Ｇ１は、第１音信号に含まれる、対応する目的音源１２Ａ１の目的音信号を強調するための空間フィルタ係数Ｆ_ａ（ｆ，ｎ）を導出する。詳細には、係数導出部３０Ｇ１は、第３空間相関行列φ_ｘｘａ（ｆ，ｎ）および第４空間相関行列φ_ＮＮ（ｆ，ｎ）に基づいて、空間フィルタ係数Ｆ_ａ（ｆ，ｎ）を導出する。 The coefficient derivation unit 30G1 is a coefficient derivation unit 30G corresponding to the target sound source 12A1. Coefficient deriving unit 30G1 is included in the first sound signal, corresponding spatial filter coefficient for emphasizing the target sound signal of the target sound source 12A1 to F _{a (f,} n) to derive a. Specifically, the coefficient derivation unit 30G1 determines _{the spatial filter coefficient F a} (f, n) _{based on the third spatial correlation matrix φ xxa} (f, n) and the fourth spatial correlation matrix φ _NN (f, n). Derived.

具体的には、係数導出部３０Ｇ１は、第３空間相関行列φ_ｘｘａ（ｆ，ｎ）と空間相関行列の和φ_ＳＳ（ｆ，ｎ）の逆行列との積によって表される行列の、最大固有値に対応する固有ベクトルＦ_ＳＮＲ（ｆ，ｎ）を導出する。 Specifically, the coefficient derivation unit 30G1 is the maximum of the matrix represented by the product of _{the third spatial correlation matrix φxxa} (f, n) and _{the inverse matrix of the sum φ SS (f, n) of the spatial correlation matrices.} _{The eigenvector F SNR} (f, n) corresponding to the eigenvalue is derived.

そして、係数導出部３０Ｇ１は、この固有ベクトルＦ_ＳＮＲ（ｆ，ｎ）を、目的音源１２Ａに対応する空間フィルタ係数Ｆ_ａ（ｆ，ｎ）として導出する（Ｆａ（ｆ，ｎ）＝Ｆ_ＳＮＲ（ｆ，ｎ））。なお、係数導出部３０Ｇ１は、第１の実施の形態と同様に、ポストフィルタｗ（ｆ，ｎ）を追加し、空間フィルタ係数Ｆ_ａ（ｆ，ｎ）を導出してもよい。 The coefficient deriving unit 30G1 is the eigenvector _F SNR (f, n), the spatial filter coefficients _F a (f, n) corresponding to the target sound source 12A derived as _{(Fa (f, n) =} F SNR (f , N)). The coefficient deriving unit 30G1, like the first embodiment, by adding the post filter w (f, n) and may derive the spatial filter coefficients _F a (f, n).

生成部３０Ｈ１は、目的音源１２Ａ１に対応する生成部３０Ｈである。生成部３０Ｈ１は、係数導出部３０Ｇ１で導出された空間フィルタ係数Ｆ_ａ（ｆ，ｎ）を用いて、周波数スペクトルＸ（ｆ，ｎ）によって表される第１音信号に含まれる、目的音源１２Ａ１の目的音信号を強調した強調音信号を生成する。 The generation unit 30H1 is a generation unit 30H corresponding to the target sound source 12A1. Generator 30H1, using the spatial filter coefficients derived by the coefficient deriving unit _{30G1 F a (f, n)} , the frequency spectrum X (f, n) included in the first sound signal, represented by, the target sound source 12A1 Generates an emphasized sound signal that emphasizes the target sound signal of.

詳細には、生成部３０Ｈ１は、下記式（２８）を用いて、出力スペクトルＹ_ａ（ｆ，ｎ）によって表される強調音信号を生成する。出力スペクトルＹ_ａ（ｆ，ｎ）によって表される強調音信号は、第１音信号における、目的音源１２Ａの目的音信号を強調した音信号である。 In particular, generator 30H1, using the following equation (28), to produce an enhanced sound signal represented by the output spectrum _Y a (f, n). Emphasized sound signal represented by the output spectrum Y _{a (f,} n) is the first sound signal, a sound signal emphasizing a target sound signal of the target sound source 12A.

すなわち、生成部３０Ｈ１は、周波数スペクトルＸ（ｆ，ｎ）と、空間フィルタ係数Ｆ_ａ（ｆ，ｎ）をエルミート転置した転置行列と、の積を、強調音信号を示す出力スペクトルＹ_ａ（ｆ，ｎ）として生成する。 That is, the generating unit 30H1 the frequency spectrum X (f, n) and the spatial filter coefficients _F a (f, n) and transposed matrix obtained by Hermitian transpose of a product, the output spectrum _Y a (f showing the emphasized sound signal , N).

生成部３０Ｈ１は、出力スペクトルＹ_ａ（ｆ，ｎ）によって表される強調音信号を、逆変換部３０Ｉ１および検出部３０Ｃへ出力する。すなわち、生成部３０Ｈ１は、目的音源１２Ａの目的音信号の強調された強調音信号を、逆変換部３０Ｉ１および検出部３０Ｃへ出力する。 Generator 30H1 the output spectrum _Y a (f, n) the emphasized sound signal represented by, and outputs it to the inverse transform unit 30I1 and the detection unit 30C. That is, the generation unit 30H1 outputs the emphasized sound signal of the target sound signal of the target sound source 12A to the inverse conversion unit 30I1 and the detection unit 30C.

逆変換部３０Ｉ１は、目的音源１２Ａ１に対応する逆変換部３０Ｉである。逆変換部３０Ｉは、第１の実施の形態の逆変換部２０Ｉと同様に、強調信号を示す出力スペクトルＹ_ａ（ｆ，ｎ）の対称性を用いて、出力スペクトルＹ_ａ（ｆ，ｎ）から２５６点のスペクトルを生成し、逆フーリエ変換を行う。次に、逆変換部３０Ｉ１は、合成窓関数を適用し、前フレームの出力波形とフレームシフト分ずらして重畳することにより、音波形を生成すればよい。そして、逆変換部３０Ｉ１は、生成した音波形によって表される、目的音源１２Ａの強調音信号を、出力部２２へ出力する。 The inverse conversion unit 30I1 is an inverse conversion unit 30I corresponding to the target sound source 12A1. Inverse transform unit 30I, like the inverse transform unit 20I of the first embodiment, the output spectrum _Y a (f, n) indicating the enhancement signal using the symmetry of the output spectrum _Y a (f, n) A spectrum of 256 points is generated from the above, and an inverse Fourier transform is performed. Next, the inverse transformation unit 30I1 may generate a sound wave shape by applying a composite window function and superimposing it on the output waveform of the previous frame by shifting the frame shift. Then, the inverse transformation unit 30I1 outputs the emphasis sound signal of the target sound source 12A represented by the generated sound wave shape to the output unit 22.

次に、目的音源１２Ａ２に対応する、加算部３０Ｋ２、係数導出部３０Ｇ２、生成部３０Ｈ２、および逆変換部３０Ｉ２について説明する。また、目的音源１２Ａ３に対応する、加算部３０Ｋ３、係数導出部３０Ｇ３、生成部３０Ｈ３、および逆変換部３０Ｉ３について説明する。 Next, the addition unit 30K2, the coefficient derivation unit 30G2, the generation unit 30H2, and the inverse conversion unit 30I2 corresponding to the target sound source 12A2 will be described. Further, the addition unit 30K3, the coefficient derivation unit 30G3, the generation unit 30H3, and the inverse conversion unit 30I3 corresponding to the target sound source 12A3 will be described.

加算部３０Ｋ２、加算部３０Ｋ３、係数導出部３０Ｇ２、係数導出部３０Ｇ３、生成部３０Ｈ２、生成部３０Ｈ３、逆変換部３０Ｉ２、および逆変換部３０Ｉ３は、対応する目的音源１２Ａに応じた情報が異なる点以外は、加算部３０Ｋ１、係数導出部３０Ｇ１、生成部３０Ｈ１、および逆変換部３０Ｉ１と同様の処理を行う。 The points that the information corresponding to the corresponding target sound source 12A is different between the addition unit 30K2, the addition unit 30K3, the coefficient derivation unit 30G2, the coefficient derivation unit 30G3, the generation unit 30H2, the generation unit 30H3, the inverse conversion unit 30I2, and the inverse conversion unit 30I3. Except for the above, the same processing as that of the addition unit 30K1, the coefficient derivation unit 30G1, the generation unit 30H1, and the inverse conversion unit 30I1 is performed.

詳細には、加算部３０Ｋ２は、第３空間相関行列φ_ｘｘａ（ｆ，ｎ）と、第３空間相関行列φ_ｘｘｃ（ｆ，ｎ）と、第４空間相関行列φ_ＮＮ（ｆ，ｎ）と、の空間相関行列の和φ_ＳＳ（ｆ，ｎ）を導出し、係数導出部３０Ｇ２へ出力する。この和φ_ＳＳ（ｆ，ｎ）は、下記式（２９）で表される。 Specifically, the addition unit 30K2 includes a third spatial correlation matrix _φxxa (f, n), a third spatial correlation matrix _φxxc (f, n), and a fourth spatial correlation matrix φ _NN (f, n). The sum of the spatial correlation matrices of, φ _SS (f, n) is derived and output to the coefficient derivation unit 30G2. This sum φ _SS (f, n) is expressed by the following equation (29).

そして、係数導出部３０Ｇ２は、φ_ＸＸｂ（ｆ，ｎ）と、式（２９）によって表されるφ_ＳＳ（ｆ，ｎ）と、に基づいて、空間フィルタ係数Ｆ_ｂ（ｆ，ｎ）を導出する。このため、生成部３０Ｈ２は、目的音源１２Ａ２の目的音信号の強調された強調音信号（出力スペクトルＹ_ｂ（ｆ，ｎ））を、逆変換部３０Ｉ１および検出部３０Ｃへ出力する。 Then, the coefficient deriving unit 30G2 derives the spatial filter coefficient F _b (f, n) _{based on φ XX b} _{(f, n) and φ SS} (f, n) represented by the equation (29). do. Therefore, the generation unit 30H2 outputs the emphasized sound signal (output spectrum Y _b (f, n)) of the target sound signal of the target sound source 12A2 to the inverse conversion unit 30I1 and the detection unit 30C.

加算部３０Ｋ３は、第３空間相関行列φ_ｘｘａ（ｆ，ｎ）と、第３空間相関行列φ_ｘｘｂ（ｆ，ｎ）と、第４空間相関行列φ_ＮＮ（ｆ，ｎ）と、の空間相関行列の和φ_ＳＳ（ｆ，ｎ）を導出し、係数導出部３０Ｇ３へ出力する。この和φ_ＳＳ（ｆ，ｎ）は、下記式（３０）で表される。 The addition unit 30K3 is a spatial correlation between the third spatial correlation matrix φ _xxa (f, n), the third spatial correlation matrix φ _{xx b} (f, n), and the fourth spatial correlation matrix φ _NN (f, n). The sum of the matrices φ _SS (f, n) is derived and output to the coefficient derivation unit 30G3. This sum φ _SS (f, n) is expressed by the following equation (30).

そして、係数導出部３０Ｇ３は、φＸＸｃ（ｆ，ｎ）と、式（２９）によって表されるφ_ＳＳ（ｆ，ｎ）と、に基づいて、空間フィルタ係数Ｆ_ｃ（ｆ，ｎ）を導出する。このため、生成部３０Ｈ３は、目的音源１２Ａ３の目的音信号の強調された強調音信号（出力スペクトルＹ_ｃ（ｆ，ｎ））を、逆変換部３０Ｉ２および検出部３０Ｃへ出力する。 Then, the coefficient deriving unit 30G3 derives the spatial filter coefficient F _c (f, n) _{based on φXXc (f, n) and φ SS (f, n) represented by the equation (29).} .. Thus, generator 30H3 is an enhanced emphasized sound signal of the target sound signal of the target sound source 12A3 (output spectrum _{Y c (f, n))} , and outputs it to the inverse transform unit 30I2 and the detection unit 30C.

次に、検出部３０Ｃについて説明する。検出部３０Ｃは、強調音信号に基づいて、目的音区間を検出する。本実施の形態では、検出部３０Ｃは、複数の目的音源１２Ａ（目的音源１２Ａ１〜目的音源１２Ａ３）にそれぞれ対応する複数の強調音信号を用いて、複数の目的音源１２Ａの各々から発せられた目的音の、目的音区間を検出する。 Next, the detection unit 30C will be described. The detection unit 30C detects a target sound section based on the emphasis sound signal. In the present embodiment, the detection unit 30C uses a plurality of emphasized sound signals corresponding to the plurality of target sound sources 12A (target sound sources 12A to 12A3), and the detection unit 30C uses the plurality of emphasized sound signals corresponding to the plurality of target sound sources 12A (target sound sources 12A1 to target sound sources 12A3), and the detection unit 30C uses the plurality of emphasized sound signals emitted from each of the plurality of target sound sources 12A. Detects the target sound section of the sound.

詳細には、検出部３０Ｃは、生成部３０Ｈ１から、出力スペクトルＹ_ａ（ｆ，ｎ）によって表される、目的音源１２Ａ１の目的音信号を強調した強調音信号を受付ける。また、検出部３０Ｃは、生成部３０Ｈ２から、出力スペクトルＹ_ｂ（ｆ，ｎ）によって表される、目的音源１２Ａ２の目的音信号を強調した強調音信号を受付ける。また、検出部３０Ｃは、生成部３０Ｈ３から、出力スペクトルＹ_ｃ（ｆ，ｎ）によって表される、目的音源１２Ａ３の目的音信号を強調した強調音信号を受付ける。 Specifically, the detection unit 30C, from the generation unit 30h1, the output spectrum _Y a (f, n) is represented by, it accepts emphasized sound signal emphasizing a target sound signal of the target sound source 12A1. Further, the detection unit 30C receives an emphasis sound signal emphasizing the target sound signal of the target sound source 12A2 represented by _{the output spectrum Y b (f, n) from the generation unit 30H2.} Further, the detection unit 30C receives an emphasis sound signal emphasizing the target sound signal of the target sound source 12A3 represented by _{the output spectrum Y c (f, n) from the generation unit 30H3.}

そして、検出部３０Ｃは、これらの強調音信号（出力スペクトルＹ_ａ（ｆ，ｎ），出力スペクトルＹ_ｂ（ｆ，ｎ），出力スペクトルＹ_ｃ（ｆ，ｎ））に基づいて、目的音源１２Ａ１〜目的音源１２Ａ３の各々の目的音区間を検出する。 Then, the detection unit 30C is based on these emphasized sound signals (output spectrum Y _a (f, n), output spectrum Y _b (f, n), output spectrum Y _c (f, n)), and the target sound source 12A1. -Detects each target sound section of the target sound source 12A3.

第１の実施の形態と同様に、目的音区間は、目的音源１２Ａが音を発しているか否かをフレーム番号毎に示す関数ｕ（ｎ）によって表される。本実施の形態では、目的音源１２Ａ１〜目的音源１２Ａ３の各々の目的音の目的音区間を、関数ｕ_ａ（ｎ）、関数ｕ_ｂ（ｎ）、関数ｕ_ｃ（ｎ）で表す。なお、これらの関数は、値“１”を示す場合、第ｎフレームで目的音源１２Ａが音を発している事を示す。また、値“０”を示す場合、第ｎフレームで目的音源１２Ａが音を発していない事を示す。 Similar to the first embodiment, the target sound section is represented by a function u (n) indicating whether or not the target sound source 12A is emitting sound for each frame number. In this embodiment, the target sound section of the target sound for each target sound source 12A1~ target sound source 12A3, represented by the function _u a (n), the function _u b (n), the function _u c (n). When these functions show a value of "1", they indicate that the target sound source 12A is emitting sound in the nth frame. Further, when the value "0" is shown, it means that the target sound source 12A does not emit sound in the nth frame.

検出部３０Ｃは、これらの関数ｕ_ａ（ｎ）、関数ｕ_ｂ（ｎ）、関数ｕ_ｃ（ｎ）を用いて、下記式（３１）〜式（３３）によって表される閾値処理を行うことで、各々の目的音源１２Ａの目的音の目的音区間を検出する。 Detector 30C, these functions _u a (n), the function _u b (n), using a function _u c (n), by performing the threshold processing expressed by the equation (31) to (33) Then, the target sound section of the target sound of each target sound source 12A is detected.

上記式（３１）〜式（３３）中、ｔは、目的音と非目的音との境界のパワーを表す閾値である。また、式（３１）〜式（３３）中、Ｐ_ａ、Ｐ_ｂ、Ｐ_ｃは、各々、下記式（３４）〜式（３６）で表される。 In the above equations (31) to (33), t is a threshold value representing the power of the boundary between the target sound and the non-target sound. In the formula (31) to Formula _{_{(33), P a, P}} b, P c are each represented by the following formula (34) to (36).

検出部３０Ｃは、複数の目的音源１２Ａ（目的音源１２Ａ１〜目的音源１２Ａ３）の各々の目的音の目的音区間の検出結果を、相関導出部３０Ｄへ出力する。 The detection unit 30C outputs the detection result of the target sound section of each target sound of the plurality of target sound sources 12A (target sound source 12A1 to target sound source 12A3) to the correlation derivation unit 30D.

次に、分離部３０Ｊについて説明する。分離部３０Ｊは、第１音信号を、目的音成分と非目的音成分に分離する。 Next, the separation unit 30J will be described. The separation unit 30J separates the first sound signal into a target sound component and a non-target sound component.

分離部３０Ｊは、第１音信号を示す周波数スペクトルＸ（ｆ，ｎ）を、変換部３０Ｂから受付ける。上述したように、本実施の形態では、第１音信号を示す周波数スペクトルＸ（ｆ，ｎ）は、上記式（２３）で表される。また、本実施の形態では、周波数スペクトルＸ（ｆ，ｎ）は、４つの第１マイク１４Ａ（第１マイク１４Ａ１〜第１マイク１４Ｄ）の各々から受付けた４つの第３音信号の各々を示す周波数スペクトルをまとめた、４次元ベクトルによって表される。 The separation unit 30J receives the frequency spectrum X (f, n) indicating the first sound signal from the conversion unit 30B. As described above, in the present embodiment, the frequency spectrum X (f, n) showing the first sound signal is represented by the above equation (23). Further, in the present embodiment, the frequency spectrum X (f, n) indicates each of the four third sound signals received from each of the four first microphones 14A (first microphones 14A to 14D). It is represented by a four-dimensional vector that summarizes the frequency spectrum.

分離部３０Ｊは、周波数スペクトルＸ（ｆ，ｎ）によって表される第１音信号を、目的音成分Ｓ（ｆ，ｎ）と非目的音成分Ｎ（ｆ，ｎ）に分離する。目的音成分Ｓ（ｆ，ｎ）は、下記式（３７）によって表される。非目的音成分Ｎ_ｉ（ｆ，ｎ）は、下記式（３８）によって表される。 The separation unit 30J separates the first sound signal represented by the frequency spectrum X (f, n) into the target sound component S (f, n) and the non-target sound component N (f, n). The target sound component S (f, n) is represented by the following formula (37). The non-purpose sound component _Ni (f, n) is represented by the following equation (38).

そして、分離部３０Ｊは、公知の音区間検出技術を用いて、全ての周波数ｆに対して、第ｎフレームが目的音区間である場合、Ｓ（ｆ，ｎ）＝Ｘ（ｆ，ｎ），Ｎ（ｆ，ｎ）＝［０，０，０，０］を算出する。また、分離部３０Ｊは、全ての周波数ｆに対して、第ｎフレームが非目的音区間である場合、Ｓ（ｆ，ｎ）＝［０，０，０，０］，Ｎ（ｆ，ｎ）＝Ｘ（ｆ，ｎ）とすればよい。 Then, the separation unit 30J uses a known sound section detection technique, and when the nth frame is the target sound section for all frequencies f, S (f, n) = X (f, n), N (f, n) = [0,0,0,0] is calculated. Further, the separation unit 30J has S (f, n) = [0,0,0,0], N (f, n) when the nth frame is a non-purpose sound section for all frequencies f. = X (f, n) may be set.

そして、分離部３０Ｊは、第１音信号から分離した、目的音成分Ｓ（ｆ，ｎ）と非目的音成分Ｎ（ｆ，ｎ）を、相関導出部３０Ｄへ出力する。 Then, the separation unit 30J outputs the target sound component S (f, n) and the non-target sound component N (f, n) separated from the first sound signal to the correlation derivation unit 30D.

相関導出部３０Ｄは、目的音区間と、目的音成分と、非目的音成分と、に基づいて、第１音信号における目的音成分の第３空間相関行列と、第１音信号における非目的音成分の第４空間相関行列と、を導出する。 The correlation derivation unit 30D has a third spatial correlation matrix of the target sound component in the first sound signal and a non-purpose sound in the first sound signal based on the target sound section, the target sound component, and the non-target sound component. The fourth spatial correlation matrix of the components is derived.

詳細には、相関導出部３０Ｄは、目的音成分Ｓ（ｆ，ｎ）および非目的音成分Ｎ（ｆ，ｎ）を分離部３０Ｊから受付ける。また、相関導出部３０Ｄは、検出部３０Ｃから、目的音区間を示す関数として、関数ｕ_ａ（ｎ）、関数ｕ_ｂ（ｎ）、関数ｕ_ｃ（ｎ）を受付ける。 Specifically, the correlation derivation unit 30D receives the target sound component S (f, n) and the non-target sound component N (f, n) from the separation unit 30J. Moreover, the correlation derivation unit 30D, receives from the detector 30C, as a function indicating the target sound section, the function _u a (n), the function _u b (n), the function _u c a (n).

そして、相関導出部３０Ｄは、目的音成分Ｓ（ｆ，ｎ）、非目的音成分Ｎ（ｆ，ｎ）、関数ｕ_ａ（ｎ）、関数ｕ_ｂ（ｎ）、関数ｕ_ｃ（ｎ）に基づいて、第３空間相関行列φ_ｘｘａ（ｆ，ｎ）、第３空間相関行列φ_ｘｘｂ（ｆ，ｎ）、第３空間相関行列φ_ｘｘｃ（ｆ，ｎ）、および第４空間相関行列φ_ＮＮ（ｆ，ｎ）を導出する。そして、相関導出部３０Ｄは、導出した第３空間相関行列φ_ｘｘａ（ｆ，ｎ）、第３空間相関行列φ_ｘｘｂ（ｆ，ｎ）、第３空間相関行列φ_ｘｘｃ（ｆ，ｎ）、および第４空間相関行列φ_ＮＮ（ｆ，ｎ）を、各々、第３相関記憶部３０Ｅ１、第３相関記憶部３０Ｅ２、第３相関記憶部３０Ｅ３、および第４相関記憶部３０Ｆへ記憶することで、これらの空間相関行列を更新する。 Then, the correlation derivation unit 30D is the target sound component S (f, n), a non-target sound component N (f, n), the function _u a (n), the function _u b (n), the function _u c (n) Based on, the third spatial correlation matrix φ _xxa (f, n), the third spatial correlation matrix φ _xxb (f, n), the third spatial correlation matrix φ _xxx (f, n), and the fourth spatial correlation matrix φ _NN. (F, n) is derived. Then, the correlation derivation unit 30D includes the derived third space correlation matrix _φxxa (f, n), the third space correlation matrix _φxxb (f, n), the third space correlation matrix _φxxc (f, n), and By storing the fourth space correlation matrix φ _NN (f, n) in the third correlation storage unit 30E1, the third correlation storage unit 30E2, the third correlation storage unit 30E3, and the fourth correlation storage unit 30F, respectively. Update these spatial correlation matrices.

相関導出部３０Ｄは、ｕ_ａ（ｎ）＝１、且つｕ_ｂ（ｎ）＝０、且つｕ_ｃ（ｎ）＝０の区間（第ｎフレーム）については、下記式（３９）により第３空間相関行列φ_ｘｘａ（ｆ，ｎ）を導出して更新し、第３空間相関行列φ_ｘｘｂ（ｆ，ｎ）および第３空間相関行列φ_ｘｘｃ（ｆ，ｎ）を更新しない。 Correlation derivation unit 30D _is, u a (n) = 1, and _u b (n) = 0, for and _u c (n) = 0 in the interval (n-th frame), the third space by the following equation (39) The correlation matrix _φxxa (f, n) is derived and updated, and the third spatial correlation matrix _φxxb (f, n) and the third spatial correlation matrix _φxxc (f, n) are not updated.

また、相関導出部３０Ｄは、ｕ_ａ（ｎ）＝０、且つｕ_ｂ（ｎ）＝１、且つｕ_ｃ（ｎ）＝０の区間（第ｎフレーム）については、下記式（４０）により第３空間相関行列φ_ｘｘｂ（ｆ，ｎ）を導出および更新し、第３空間相関行列φ_ｘｘａ（ｆ，ｎ）および第３空間相関行列φ_ｘｘｃ（ｆ，ｎ）を更新しない。 Moreover, the correlation derivation unit 30D _is, u a (n) = 0, and _u b (n) = 1, for and _u c (n) = 0 in the interval (n-th frame), the the following equation (40) 3 The spatial correlation matrix _φxxb (f, n) is derived and updated, and the third spatial correlation matrix _φxxa (f, n) and the third spatial correlation matrix _φxxc (f, n) are not updated.

また、相関導出部３０Ｄは、ｕ_ａ（ｎ）＝０、且つｕ_ｂ（ｎ）＝０、且つｕ_ｃ（ｎ）＝１の区間（第ｎフレーム）については、下記式（４１）により第３空間相関行列φ_ｘｘｃ（ｆ，ｎ）を導出および更新し、第３空間相関行列φ_ｘｘａ（ｆ，ｎ）および第３空間相関行列φ_ｘｘｂ（ｆ，ｎ）を更新しない。 Moreover, the correlation derivation unit 30D _is, u a (n) = 0, and _u b (n) = 0, for and _u c (n) = 1 of the section (n-th frame), the the following equation (41) 3 The spatial correlation matrix _φxxc (f, n) is derived and updated, and the third spatial correlation matrix _φxxa (f, n) and the third spatial correlation matrix _φxxb (f, n) are not updated.

また、相関導出部３０Ｄは、第４空間相関行列φ_ＮＮ（ｆ，ｎ）を下記式（４２）により導出および更新する。 Further, the correlation derivation unit 30D _{derives and updates the fourth spatial correlation matrix φ NN} (f, n) by the following equation (42).

式（３９）〜式（４２）中、αは、０以上１未満の値である。αの値が１に近い値であるほど、過去に導出した空間相関行列の重みが、最新の空間相関行列に比べて大きい事を意味する。αの値は、予め設定すればよい。αは、例えば、０．９５などとすればよい。 In formulas (39) to (42), α is a value of 0 or more and less than 1. The closer the value of α is to 1, the larger the weight of the spatial correlation matrix derived in the past is larger than that of the latest spatial correlation matrix. The value of α may be set in advance. α may be, for example, 0.95.

すなわち、相関導出部３０Ｄは、過去に導出した第３空間相関行列を、目的音成分Ｓ（ｆ，ｎ）をエルミート転置した転置成分との乗算値によって表される最新の第３空間相関行列で補正することによって、新たな第３空間相関行列を導出する。 That is, the correlation derivation unit 30D is the latest third space correlation matrix represented by the multiplication value of the third space correlation matrix derived in the past with the transposed component obtained by Hermitian transposition of the target sound component S (f, n). By correcting, a new third spatial correlation matrix is derived.

なお、相関導出部３０Ｄは、第３相関記憶部３０Ｅ（第３相関記憶部３０Ｅ１〜第３相関記憶部３０Ｅ３）に記憶済の第３空間相関行列φ_ｘｘａ（ｆ，ｎ）、第３空間相関行列φ_ｘｘｂ（ｆ，ｎ）、第３空間相関行列φ_ｘｘｃ（ｆ，ｎ）を、過去に導出した第３空間相関行列として用いればよい。また、これらの第３相関記憶部３０Ｅには、各々、１つの第３空間相関行列のみが記憶され、順次、相関導出部３０Ｄによって更新される。 _{The correlation derivation unit 30D includes a third spatial correlation matrix φxxa} (f, n) and a third spatial correlation stored in the third correlation storage unit 30E (third correlation storage unit 30E1 to third correlation storage unit 30E3). The matrix _φxxb (f, n) and the third spatial correlation matrix _φxxc (f, n) may be used as the third spatial correlation matrix derived in the past. Further, only one third spatial correlation matrix is stored in each of these third correlation storage units 30E, and the correlation derivation unit 30D sequentially updates them.

また、相関導出部３０Ｄは、過去に導出した第４空間相関行列φ_ＮＮ（ｆ，ｎ）を、非目的音成分Ｎ（ｆ，ｎ）と非目的音成分Ｎ（ｆ，ｎ）をエルミート転置した転置成分との乗算値によって表される最新の第４空間相関行列φ_ＮＮ（ｆ，ｎ）で補正することによって、新たな第４空間相関行列φ_ＮＮ（ｆ，ｎ）を導出する。 Further, the correlation derivation unit 30D transposes the fourth spatial correlation matrix φ _NN (f, n) derived in the past into the non-objective sound component N (f, n) and the non-objective sound component N (f, n) by Hermitian. A new fourth spatial correlation matrix φ _NN (f, n) is derived _{by correcting with the latest fourth spatial correlation matrix φ NN} (f, n) represented by the multiplication value with the transposed component.

なお、相関導出部３０Ｄは、第４相関記憶部３０Ｆに記憶済の第４空間相関行列φ_ＮＮ（ｆ，ｎ）を、過去に導出した第４空間相関行列φ_ＮＮ（ｆ，ｎ）として用いればよい。また、第４相関記憶部３０Ｆには、１つの第４空間相関行列φ_ＮＮ（ｆ，ｎ）のみが記憶され、順次、相関導出部３０Ｄによって更新される。 Incidentally, the correlation derivation unit 30D is used fourth space correlation matrix φ _NN (f, n) of the already stored in the fourth correlation storage unit 30F, and a fourth spatial correlation matrix phi _NN derived in the past (f, n) Just do it. Further, only one fourth spatial correlation matrix φ _NN (f, n) is stored in the fourth correlation storage unit 30F, and is sequentially updated by the correlation derivation unit 30D.

次に、本実施の形態の音信号処理装置１１が実行する音信号処理の手順を説明する。 Next, the procedure of sound signal processing executed by the sound signal processing device 11 of the present embodiment will be described.

図６は、本実施の形態の音信号処理装置１１が実行する音信号処理の手順の一例を示す、フローチャートである。 FIG. 6 is a flowchart showing an example of a sound signal processing procedure executed by the sound signal processing device 11 of the present embodiment.

変換部３０Ｂが、ＡＤ変換部１８を介して複数の第１マイク１４から受付けた第３信号を短時間フーリエ変換し、周波数スペクトルＸ（ｆ，ｎ）によって表される第１音信号を取得する（ステップＳ２００）。変換部３０Ｂは、取得した第１音信号を、分離部３０Ｊおよび生成部３０Ｈ（生成部３０Ｈ１〜生成部３０Ｈ３）の各々へ出力する（ステップＳ２０２）。 The conversion unit 30B performs a short-time Fourier transform on the third signal received from the plurality of first microphones 14 via the AD conversion unit 18 to acquire the first sound signal represented by the frequency spectrum X (f, n). (Step S200). The conversion unit 30B outputs the acquired first sound signal to each of the separation unit 30J and the generation unit 30H (generation unit 30H1 to generation unit 30H3) (step S202).

次に、分離部３０Ｊが、第１音信号を、目的音成分Ｓ_ｉ（ｆ，ｎ）と非目的音成分Ｎ_ｉ（ｆ，ｎ）に分離する（ステップＳ２０４）。そして、分離部３０Ｊは、目的音成分Ｓ_ｉ（ｆ，ｎ）と非目的音成分Ｎ_ｉ（ｆ，ｎ）を、相関導出部３０Ｄへ出力する。 Next, the separation unit 30J separates the first sound signal into the target sound component S _i (f, n) and the non-target sound component _Ni (f, n) (step S204). Then, the separation unit 30J outputs the target sound component S _i (f, n) and the non-target sound component _Ni (f, n) to the correlation derivation unit 30D.

次に、音信号処理部３０では、目的音源１２Ａ１〜目的音源１２Ａ３の各々に対応する、加算部３０Ｋ、係数導出部３０Ｇ、生成部３０Ｈ、および逆変換部３０Ｉ１が、ステップＳ２０６〜ステップＳ２１２の処理を実行する。なお、ステップＳ２０６〜ステップＳ２１２の処理は、複数の目的音源１２Ａ（目的音源１２Ａ１〜目的音源１２Ａ３）の各々に対応する機能間で、並列して実行される。 Next, in the sound signal processing unit 30, the addition unit 30K, the coefficient derivation unit 30G, the generation unit 30H, and the inverse conversion unit 30I1 corresponding to each of the target sound source 12A1 to the target sound source 12A3 perform the processing of steps S206 to S212. To execute. The processes of steps S206 to S212 are executed in parallel between the functions corresponding to each of the plurality of target sound sources 12A (target sound sources 12A to 12A3).

まず、加算部３０Ｋが、対応する目的音源１２Ａ以外の目的音源１２Ａの第３空間相関行列と、第４空間相関行列φ_ＮＮ（ｆ，ｎ）と、を加算し、対応する目的音源１２Ａの係数導出部３０Ｇへ出力する（ステップＳ２２０６）。 First, the addition unit 30K adds the third space correlation matrix of the target sound source 12A other than the corresponding target sound source 12A and the fourth space correlation matrix φ _NN (f, n), and the coefficient of the corresponding target sound source 12A. Output to the lead-out unit 30G (step S2206).

係数導出部３０Ｇは、対応する目的音源１２Ａの第３空間相関行列と、第４空間相関行列φ_ＮＮ（ｆ，ｎ）と、を第３相関記憶部３０Ｅおよび第４相関記憶部３０Ｆから読取る（ステップＳ２０８）。 The coefficient derivation unit 30G reads the third space correlation matrix and the fourth space correlation matrix φ _NN (f, n) of the corresponding target sound source 12A from the third correlation storage unit 30E and the fourth correlation storage unit 30F ( Step S208).

そして、係数導出部３０Ｇは、ステップＳ２０８で読取った第３空間相関行列および第４空間相関行列φ_ＮＮ（ｆ，ｎ）に基づいて、空間フィルタ係数を導出する（ステップＳ２１０）。 Then, the coefficient derivation unit 30G derives the spatial filter coefficient based on the third spatial correlation matrix and the fourth spatial correlation matrix φ _{NN (f, n) read in step S208 (step S210).}

次に、生成部３０Ｈが、ステップＳ２１０で導出した空間フィルタ係数を用いて、第１音信号に含まれる、対応する目的音源１２Ａの目的音信号を強調した強調音信号を生成する（ステップＳ２１２）。 Next, the generation unit 30H uses the spatial filter coefficient derived in step S210 to generate an emphasis sound signal that emphasizes the target sound signal of the corresponding target sound source 12A included in the first sound signal (step S212). ..

そして、逆変換部３０Ｉ１は、ステップＳ２１２で生成された強調音信号を出力部２２へ出力する（ステップＳ２１４）。 Then, the inverse conversion unit 30I1 outputs the emphasis sound signal generated in step S212 to the output unit 22 (step S214).

目的音源１２Ａ１〜目的音源１２Ａ３の各々に対応する、加算部３０Ｋ、係数導出部３０Ｇ、生成部３０Ｈ、および逆変換部３０Ｉ１が、ステップＳ２０６〜ステップＳ２１２の処理を実行することによって、目的音源１２Ａ１から発せられた目的音信号を強調した強調音信号と、目的音源１２Ａ２から発せられた目的音信号を強調した強調音信号と、目的音源１２Ａ３から発せられた目的音信号を強調した強調音信号と、が検出部３０Ｃおよび逆変換部３０Ｉへ出力される。 The addition unit 30K, the coefficient derivation unit 30G, the generation unit 30H, and the inverse conversion unit 30I1 corresponding to each of the target sound source 12A1 to the target sound source 12A3 execute the processes of steps S206 to S212 to start from the target sound source 12A1. An emphasized sound signal that emphasizes the emitted target sound signal, an emphasized sound signal that emphasizes the target sound signal emitted from the target sound source 12A2, and an emphasized sound signal that emphasizes the target sound signal emitted from the target sound source 12A3. Is output to the detection unit 30C and the inverse conversion unit 30I.

このため、逆変換部３０Ｉ１〜逆変換部３０Ｉ３の各々から受付けた強調音信号を出力する出力部２２は、複数の目的音源１２Ａの各々の目的音をそれぞれ強調した、複数の強調音信号を出力することができる。 Therefore, the output unit 22 that outputs the emphasis sound signals received from each of the inverse conversion units 30I1 to the inverse conversion unit 30I3 outputs a plurality of emphasis sound signals that emphasize each target sound of the plurality of target sound sources 12A. can do.

次に、検出部３０Ｃは、生成部３０Ｈ（生成部３０Ｈ１〜生成部３０Ｈ３）から受付けた複数の強調音信号を用いて、複数の目的音源１２Ａの各々の目的音の、目的音区間を検出する（ステップＳ２１６）。 Next, the detection unit 30C detects a target sound section of each target sound of the plurality of target sound sources 12A by using a plurality of emphasized sound signals received from the generation unit 30H (generation unit 30H1 to generation unit 30H3). (Step S216).

次に、相関導出部３０Ｄは、ステップＳ２０４で分離された目的音成分Ｓ（ｆ，ｎ）および非目的音成分Ｎ（ｆ，ｎ）と、複数の目的音源１２Ａの各々の目的音の目的音区間を示す関数（ｕ_ａ（ｎ），ｕ_ｂ（ｎ），ｕ_ｃ（ｎ））に基づいて、複数の目的音源１２Ａの各々に対応する第３空間相関行列（第３空間相関行列φ_ｘｘａ（ｆ，ｎ）、第３空間相関行列φ_ｘｘｂ（ｆ，ｎ）、第３空間相関行列φ_ｘｘｃ（ｆ，ｎ））、および第４空間相関行列φ_ＮＮ（ｆ，ｎ）を導出する（ステップＰＳ２１８）。 Next, the correlation derivation unit 30D includes the target sound component S (f, n) and the non-target sound component N (f, n) separated in step S204, and the target sound of each target sound of the plurality of target sound sources 12A. function indicating the section based on the _{_{(u a (n), u}} b (n), u c (n)), the third space correlation matrix corresponding to each of the plurality of target sound source 12A (third space correlation matrix phi _xxa (F, n), the third spatial correlation matrix φ _xxb (f, n), the third spatial correlation matrix φ _xxx (f, n)), and the fourth spatial correlation matrix φ _NN (f, n) are derived ( Step PS218).

そして、相関導出部３０Ｄは、導出した第３空間相関行列φ_ｘｘａ（ｆ，ｎ）、第３空間相関行列φ_ｘｘｂ（ｆ，ｎ）、第３空間相関行列φ_ｘｘｃ（ｆ，ｎ）、および第４空間相関行列φ_ＮＮ（ｆ，ｎ）を、各々、第３相関記憶部３０Ｅ１、第３相関記憶部３０Ｅ２、第３相関記憶部３０Ｅ３、および第４相関記憶部３０Ｆへ記憶することで、これらの空間相関行列を更新する（ステップＳ２２０）。 Then, the correlation derivation unit 30D includes the derived third space correlation matrix _φxxa (f, n), the third space correlation matrix _φxxb (f, n), the third space correlation matrix _φxxc (f, n), and By storing the fourth space correlation matrix φ _NN (f, n) in the third correlation storage unit 30E1, the third correlation storage unit 30E2, the third correlation storage unit 30E3, and the fourth correlation storage unit 30F, respectively. These spatial correlation matrices are updated (step S220).

次に、音信号処理部３０が、音信号処理を終了するか否かを判断する（ステップＳ２２２）。ステップＳ２２２で否定判断すると（ステップＳ２２２：Ｎｏ）、上記ステップＳ２００へ戻る。一方、ステップＳ２２２で肯定判断すると（ステップＳ２２２：Ｙｅｓ）、本ルーチンを終了する。 Next, the sound signal processing unit 30 determines whether or not to end the sound signal processing (step S222). If a negative determination is made in step S222 (step S222: No), the process returns to step S200. On the other hand, if an affirmative judgment is made in step S222 (step S222: Yes), this routine ends.

以上説明したように、本実施の形態の音信号処理装置１１は、分離部３０Ｊが、第１音信号を目的音成分と非目的音成分に分離する。検出部３０Ｃは、強調音信号に基づいて、目的音区間を検出する。相関導出部３０Ｄは、目的音区間と、目的音成分と、非目的音成分と、に基づいて、第１音信号における目的音成分の第３空間相関行列と、第１音信号における非目的音成分の第４空間相関行列と、を導出する。そして、係数導出部３０Ｇは、第３空間相関行列および第４空間相関行列に基づいて、空間フィルタ係数を導出する。 As described above, in the sound signal processing device 11 of the present embodiment, the separation unit 30J separates the first sound signal into a target sound component and a non-target sound component. The detection unit 30C detects a target sound section based on the emphasis sound signal. The correlation derivation unit 30D has a third spatial correlation matrix of the target sound component in the first sound signal and a non-purpose sound in the first sound signal based on the target sound section, the target sound component, and the non-target sound component. The fourth spatial correlation matrix of the components is derived. Then, the coefficient derivation unit 30G derives the spatial filter coefficient based on the third spatial correlation matrix and the fourth spatial correlation matrix.

このように、本実施の形態の音信号処理装置１１では、第２マイク１６から取得した第２音声信号を用いずに、第１マイク１４から取得した第１音声信号を用いて、空間フィルタ係数を導出する。このため、本実施の形態では、目的音源１２Ａ以外の非目的音源１２Ｂの音を集音するための第２マイク１６を用意することなく、目的音源１２Ａから発せられた目的音信号を高精度に強調することができる。 As described above, in the sound signal processing device 11 of the present embodiment, the spatial filter coefficient is used by using the first audio signal acquired from the first microphone 14 without using the second audio signal acquired from the second microphone 16. Is derived. Therefore, in the present embodiment, the target sound signal emitted from the target sound source 12A is highly accurate without preparing the second microphone 16 for collecting the sounds of the non-purpose sound source 12B other than the target sound source 12A. Can be emphasized.

また、本実施の形態の音信号処理装置１１では、複数の目的音源１２Ａの各々の目的音の目的音信号を、分離して強調することができる。 Further, in the sound signal processing device 11 of the present embodiment, the target sound signals of the target sounds of the plurality of target sound sources 12A can be separated and emphasized.

また、本実施の形態の音信号処理装置１１では、相関導出部３０Ｄが、順次、第３空間相関行列および第４空間相関行列を更新する。このため、第３相関記憶部３０Ｅに初期値として記憶されていた第３空間相関行列が想定する、目的音源１２Ａと第１マイク１４との位置関係にずれが生じた場合であっても、実際の位置関係に応じた空間相関行列に次第に収束するように更新されていくこととなる。 Further, in the sound signal processing device 11 of the present embodiment, the correlation derivation unit 30D sequentially updates the third space correlation matrix and the fourth space correlation matrix. Therefore, even if the positional relationship between the target sound source 12A and the first microphone 14 is assumed to be the third spatial correlation matrix stored as the initial value in the third correlation storage unit 30E, it is actually It will be updated so that it gradually converges to the spatial correlation matrix according to the positional relationship of.

このため、本実施の形態の音信号処理装置１１は、効果的に目的音源１２Ａから発せられた目的音信号を強調し、非目的音信号を抑圧することができる。 Therefore, the sound signal processing device 11 of the present embodiment can effectively emphasize the target sound signal emitted from the target sound source 12A and suppress the non-target sound signal.

また、本実施の形態の音信号処理装置１１は、第１音信号を目的音成分と非目的音成分に分離し、空間相関行列の導出に用いる。このため、音信号処理装置１１は、雑音などの非目的音を効果的に抑圧した強調音信号を生成することができる。よって、音信号処理装置１１は、高精度な強調音信号を提供することができる。 Further, the sound signal processing device 11 of the present embodiment separates the first sound signal into a target sound component and a non-target sound component, and uses them for deriving a spatial correlation matrix. Therefore, the sound signal processing device 11 can generate an emphasis sound signal that effectively suppresses non-purpose sounds such as noise. Therefore, the sound signal processing device 11 can provide a highly accurate emphasized sound signal.

＜変形例２＞
なお、分離部３０Ｊは、上記第２の実施の形態に示した方法とは異なる方法を用いて、第１音信号を目的音成分と非目的音成分に分離してもよい。 <Modification 2>
The separation unit 30J may separate the first sound signal into a target sound component and a non-target sound component by using a method different from the method shown in the second embodiment.

例えば、分離部３０Ｊは、周波数ビン毎に目的音か非目的音かの判定を行い、判定結果を用いて、第１音信号を目的音成分と非目的音成分に分離してもよい。 For example, the separation unit 30J may determine whether the sound is a target sound or a non-target sound for each frequency bin, and may separate the first sound signal into a target sound component and a non-target sound component using the determination result.

例えば、分離部３０Ｊは、ニューラルネットワークを用いて、第１音信号を目的音成分と非目的音成分に分離する。 For example, the separation unit 30J separates the first sound signal into a target sound component and a non-target sound component by using a neural network.

この場合、分離部３０Ｊは、ニューラルネットワークを用いて、フレーム毎および周波数ビンごとに、値“０”または値“１”を示す音声マスクＭ_Ｓ（ｆ，ｎ）および非音声マスクＭ_Ｎ（ｆ，ｎ）を推定する。そして、分離部３０Ｊは、下記式（４３）および式（４４）を用いて、目的音成分Ｓ_ｉ（ｆ，ｎ）および非目的音成分Ｎ_ｉ（ｆ，ｎ）を導出する。 In this case, the separation unit 30J, using a neural network, for each frame and for each frequency bin, voice mask indicates a value "0" or the value _{"1" M S (f,} n) and non-speech mask _M N (f , N) is estimated. _{Then, the separation unit 30J derives the target sound component S i} (f, n) and the non-target sound component _Ni (f, n) using the following equations (43) and (44).

分離部３０Ｊは、ニューラルネットワークの入力として、１チャンネルの周波数スペクトルＸｉ（ｆ，ｎ）を用いる。そして、分離部３０Ｊは、各チャンネルの入力に対して、音声マスクＭ_Ｓ（ｆ，ｎ）および非音声マスクＭ_Ｎ（ｆ，ｎ）を推定する。 The separation unit 30J uses the frequency spectrum Xi (f, n) of one channel as the input of the neural network. Then, the separation unit 30J, to the input of each channel, estimates the speech mask _M S (f, n) and non-speech mask _M N (f, n).

そして、分離部３０Ｊは、全てのチャンネルの推定結果の多数決等を用いて、チャンネル共通の音声マスクＭ_Ｓ（ｆ，ｎ）および非音声マスクＭ_Ｎ（ｆ，ｎ）を推定すればよい。 Then, the separation unit 30J, using the majority or the like of the estimation results of all the channels, the channel common voice mask M _{S (f,} n) and non-speech mask M _{N (f,} n) may be estimated.

分離部３０Ｊは、雑音を含まないクリーンな目的音信号と、目的音を含まない非目的音信号と、を用いたシミュレーションによって、ニューラルネットワークの学習データを予め生成すればよい。 The separation unit 30J may generate training data of the neural network in advance by a simulation using a clean target sound signal that does not contain noise and a non-target sound signal that does not contain the target sound.

クリーンな目的音信号のスペクトルをＳ_ｔ（ｆ，ｎ）、目的音を含まない非目的音信号のスペクトルをＮ_ｔ（ｆ，ｎ）で表す。すると、雑音などの非目的音を重畳した音のスペクトルＸ_ｔ（ｆ，ｎ）と、音声マスクの正解データＭ_ｔＳ（ｆ，ｎ）および非音声マスクの正解データＭ_ｔＮ（ｆ，ｎ）は、下記式（４５）〜式（４７）により導出される。 The spectrum of a clean target sound signal is represented by _St (f, n), and the spectrum of a non-target sound signal that does not include the target sound is _{represented by N t} (f, n). _{Then, the spectrum X t} (f, n) of the sound on which the non-purpose sound such as noise is superimposed _{, the correct answer data M tS} (f, n) of the voice mask, and the correct answer data M _tN (f, n) of the non-voice mask are obtained. , Derived from the following equations (45) to (47).

式（４６）および式（４７）中、ｔ_Ｓおよびｔ_Ｎは、目的音と非目的音とのパワー比の閾値を示す。 In the formulas (46) and (47), t _S and t _N indicate the threshold value of the power ratio between the target sound and the non-target sound.

入力特徴量には、下記式（４８）で表されるベクトルｖ（ｎ）を用いる。 As the input feature amount, the vector v (n) represented by the following equation (48) is used.

式（４８）によって表されるベクトルｖ（ｎ）は、当該フレームと直前のフレームのスペクトルの絶対値の対数を連結した、５１６次元ベクトルである。そして、音声マスクＭ_Ｓ（ｆ，ｎ）および非音声マスクＭ_Ｎ（ｆ，ｎ）の推定は、入力特徴量を示すベクトルｖ（ｎ）から、正解データを表す２５８次元ベクトルｃ（ｎ）を推定する問題として、下記式（４９）に定式化することができる。 The vector v (n) represented by the equation (48) is a 516-dimensional vector in which the logarithms of the absolute values of the spectra of the frame and the immediately preceding frame are concatenated. The audio mask _M S (f, n) estimation and non-speech mask _M N (f, n) from the vector v (n) indicating the input feature value, 258-dimensional vector c representing the correct answer data (n) As a problem to be estimated, it can be formulated into the following equation (49).

このため、ニューラルネットワークのモデルの構成は、下記式（５０）〜式（５４）で定義することができる。 Therefore, the configuration of the neural network model can be defined by the following equations (50) to (54).

ここで、中間層のノード数を２００と想定する。すると、行列Ｗ_ｉのサイズは２００×５１６となり、行列Ｗ_ｏのサイズは２５８×２００となる。このため、行列Ｗ_１、行列Ｗ_２、および行列Ｗ_３のサイズは、いずれも２００×２００となる。 Here, it is assumed that the number of nodes in the intermediate layer is 200. Then, the matrix _{W i} size becomes 200 × 516, the size of the matrix _{W o} becomes 258 × 200. Therefore, the sizes of the matrix W ₁ , the matrix W ₂ , and the matrix W ₃ are all 200 × 200.

ここで、目的関数Ｌを、下記式（５５）で表されるクロスエントロピーで定義する。 Here, the objective function L is defined by the cross entropy represented by the following equation (55).

そして、分離部３０Ｊは、目的関数Ｌを最大化するパラメータ列Ｗ_ｉ，Ｗ_ｏ，Ｗ_１，Ｗ_２，Ｗ_３を学習により導出する。学習の手法には、確率的勾配降下法など、公知の方法を用いればよい。 Then, the separation unit 30J _{derives the parameter sequences Wi} , W _o , W ₁ , W ₂ , and W ₃ that maximize the objective function L by learning. As the learning method, a known method such as a stochastic gradient descent method may be used.

そして、分離部３０Ｊは、上記方法により生成したモデルを用いて推定した、上記式（５５）におけるｍ_ｉ（ｎ），（ｉ＝１，・・・・，２５８）は、０から１の間の連続値となる。このため、例えば、分離部３０Ｊは、例えば“０．５”を閾値とし、該閾値以上であれば値“１”、該閾値未満であれば“０”に二値化することで、フレーム毎および周波数ビンごと音声マスクＭ_Ｓ（ｆ，ｎ）および非音声マスクＭ_Ｎ（ｆ，ｎ）を推定する。そして、分離部３０Ｊは、上記式（４３）および上記式（４４）を用いて、目的音成分Ｓ_ｉ（ｆ，ｎ）および非目的音成分Ｎ_ｉ（ｆ，ｎ）を導出すればよい。 _{Then, the separation unit 30J estimates that mi} (n), (i = 1, ..., 258) in the above equation (55) are between 0 and 1, estimated using the model generated by the above method. It becomes a continuous value of. Therefore, for example, the separation unit 30J sets a threshold value of, for example, "0.5", and binarizes the value to "1" if the threshold value is equal to or higher than the threshold value, and to "0" if the threshold value is less than the threshold value. and frequency bins for each voice mask _M S (f, n) and non-speech mask _M n (f, n) to estimate. _{Then, the separation unit 30J may derive the target sound component S i} (f, n) and the non-target sound component _Ni (f, n) by using the above formula (43) and the above formula (44).

なお、上記変形例１および上記変形例２では、ニューラルネットワークの構成要素が、３層の中間層を持つ全結合ネットワークである場合を一例として説明した。しかし、ニューラルネットワークの構成要素は、これに限定されない。 In the first modification and the second modification, the case where the component of the neural network is a fully connected network having three intermediate layers has been described as an example. However, the components of the neural network are not limited to this.

例えば、学習データが十分に用意できる場合には、中間層の層数やノード数をさらに増やすことにより、精度向上を図ってもよい。また、バイアス項を用いてもよい。また、活性化関数としては、ｓｉｇｍｏｉｄの他に、ｒｅｌｕやｔａｎｈなどの種々の関数を用いることができる。また、全結合層以外にも、畳み込みニューラルネットワークやリカレントニューラルネットワークなど、種々の構成を利用することが可能である。また、ニューラルネットワークに入力する特徴量として、ＦＦＴパワースペクトルを用いるものとして説明したが、このほかにもメルフィルタバンクやメルケプストラムなど、種々の特徴量やそれらの組み合わせを用いることが可能である。 For example, if sufficient training data can be prepared, the accuracy may be improved by further increasing the number of layers and the number of nodes in the intermediate layer. Moreover, the bias term may be used. Further, as the activation function, various functions such as relu and tanh can be used in addition to sigmoid. In addition to the fully connected layer, various configurations such as a convolutional neural network and a recurrent neural network can be used. Further, although the FFT power spectrum has been described as the feature amount to be input to the neural network, various feature amounts such as a mel filter bank and a mel cepstrum and a combination thereof can be used.

（第３の実施の形態）
なお、音信号処理装置１０および音信号処理装置１１は、音信号処理システム２に代えて、認識部を備えた構成であってもよい。 (Third Embodiment)
The sound signal processing device 10 and the sound signal processing device 11 may be configured to include a recognition unit instead of the sound signal processing system 2.

図７は、本実施の形態の音信号処理システム３の一例を示す模式図である。 FIG. 7 is a schematic diagram showing an example of the sound signal processing system 3 of the present embodiment.

音信号処理システム３は、音信号処理装置１３と、複数の第１マイク１４と、を備える。音信号処理装置１３と、複数の第１マイク１４とは、データや信号を授受可能に接続されている。すなわち、音信号処理システム３は、音信号処理装置１０に代えて音信号処理装置１３を備える。 The sound signal processing system 3 includes a sound signal processing device 13 and a plurality of first microphones 14. The sound signal processing device 13 and the plurality of first microphones 14 are connected so as to be able to exchange data and signals. That is, the sound signal processing system 3 includes a sound signal processing device 13 instead of the sound signal processing device 10.

音信号処理装置１３は、ＡＤ変換部１８と、音信号処理部２０と、認識部２４と、を備える。ＡＤ変換部１８および音信号処理部２０は、第１の実施の形態と同様である。すなわち、音信号処理装置１３は、出力部２２に代えて認識部２４を備えた点以外は、音信号処理装置１０と同様である。 The sound signal processing device 13 includes an AD conversion unit 18, a sound signal processing unit 20, and a recognition unit 24. The AD conversion unit 18 and the sound signal processing unit 20 are the same as those in the first embodiment. That is, the sound signal processing device 13 is the same as the sound signal processing device 10 except that the recognition unit 24 is provided instead of the output unit 22.

認識部２４は、音信号処理部２０から受付けた強調音信号を認識する。 The recognition unit 24 recognizes the emphasis sound signal received from the sound signal processing unit 20.

具体的には、認識部２４は、強調音信号を解析する装置である。認識部２４は、例えば、出力スペクトルＹ（ｆ，ｎ）によって表される強調音信号を公知の解析方法で認識し、認識結果を出力する。この出力は、テキストデータでもよいし、認識された単語ＩＤのような記号化された情報であってもよい。認識部２４には、公知の認識装置を用いればよい。 Specifically, the recognition unit 24 is a device that analyzes the emphasis sound signal. The recognition unit 24 recognizes, for example, the emphasis signal represented by the output spectrum Y (f, n) by a known analysis method, and outputs the recognition result. This output may be text data or symbolized information such as recognized word IDs. A known recognition device may be used for the recognition unit 24.

（適用範囲）
上記実施の形態および変形例で説明した音信号処理装置１０、音信号処理装置１１、および音信号処理装置１３は、目的音信号を強調する様々な装置やシステムに適用することができる。 (Scope of application)
The sound signal processing device 10, the sound signal processing device 11, and the sound signal processing device 13 described in the above-described embodiments and modifications can be applied to various devices and systems that emphasize the target sound signal.

詳細には、音信号処理装置１０、音信号処理装置１１、および音信号処理装置１３は、１または複数の話者が音声を出力する環境において音を集音して処理する各種のシステムや装置に適用することができる。 Specifically, the sound signal processing device 10, the sound signal processing device 11, and the sound signal processing device 13 are various systems and devices that collect and process sound in an environment in which one or more speakers output sound. Can be applied to.

例えば、音信号処理装置１０、音信号処理装置１１、および音信号処理装置１３は、会議システム、講義システム、接客対応システム、スマートスピーカ、車載システム、等に適用することができる。 For example, the sound signal processing device 10, the sound signal processing device 11, and the sound signal processing device 13 can be applied to a conference system, a lecture system, a customer service system, a smart speaker, an in-vehicle system, and the like.

会議システムは、１または複数の話者が発話するスペース内に配置されたマイクで集音された音を処理するシステムである。講義システムは、講義者および受講者の少なくとも一方が発話するスペース内に配置されたマイクで集音された音を処理するシステムである。接客対応システムは、店員と顧客が対話形式で発話するスペース内に配置されたマイクで集音された音を処理するシステムである。スマートスピーカは、対話型の音声操作に対応したＡＩ（ＡｒｔｉｆｉｃｉａｌＩｎｔｅｌｌｉｇｅｎｃｅ）アシスタントを利用可能なスピーカである。車載システムは、車両内で乗員の発話した音を集音して処理し、処理結果を車両の駆動制御などに用いるシステムである。 The conference system is a system that processes the sound collected by the microphones arranged in the space where one or more speakers speak. The lecture system is a system that processes the sound collected by the microphone arranged in the space where at least one of the lecturer and the student speaks. The customer service system is a system that processes the sound collected by the microphone arranged in the space where the clerk and the customer speak interactively. A smart speaker is a speaker that can use an AI (Artificial Intelligence) assistant that supports interactive voice operations. The in-vehicle system is a system that collects and processes the sounds spoken by the occupants in the vehicle and uses the processing results for driving control of the vehicle and the like.

次に、上記実施の形態および変形例の音信号処理装置１０、音信号処理装置１１、および音信号処理装置１３のハードウェア構成を説明する。 Next, the hardware configurations of the sound signal processing device 10, the sound signal processing device 11, and the sound signal processing device 13 of the above-described embodiment and modification will be described.

図８は、上記実施の形態および変形例の音信号処理装置１０、音信号処理装置１１、および音信号処理装置１３のハードウェア構成例を示す説明図である。 FIG. 8 is an explanatory diagram showing a hardware configuration example of the sound signal processing device 10, the sound signal processing device 11, and the sound signal processing device 13 of the above-described embodiment and modification.

上記実施の形態および変形例の音信号処理装置１０、音信号処理装置１１、および音信号処理装置１３は、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）５１などの制御装置と、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）５２やＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）５３などの記憶装置と、ネットワークに接続して通信を行う通信Ｉ／Ｆ５４と、各部を接続するバス６１を備えている。 The sound signal processing device 10, the sound signal processing device 11, and the sound signal processing device 13 of the above-described embodiments and modifications include a control device such as a CPU (Central Processing Unit) 51, a ROM (Read Only Memory) 52, and a RAM. It includes a storage device such as (Random Access Memory) 53, a communication I / F 54 that connects to a network for communication, and a bus 61 that connects each unit.

上記実施の形態および変形例の音信号処理装置１０、音信号処理装置１１、および音信号処理装置１３で実行されるプログラムは、ＲＯＭ５２等に予め組み込まれて提供される。 The program executed by the sound signal processing device 10, the sound signal processing device 11, and the sound signal processing device 13 of the above-described embodiment and the modified example is provided by being incorporated in the ROM 52 or the like in advance.

上記実施の形態および変形例の音信号処理装置１０、音信号処理装置１１、および音信号処理装置１３で実行されるプログラムは、インストール可能な形式又は実行可能な形式のファイルでＣＤ−ＲＯＭ（ＣｏｍｐａｃｔＤｉｓｋＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、フレキシブルディスク（ＦＤ）、ＣＤ−Ｒ（ＣｏｍｐａｃｔＤｉｓｋＲｅｃｏｒｄａｂｌｅ）、ＤＶＤ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｋ）等のコンピュータで読み取り可能な記録媒体に記録してコンピュータプログラムプロダクトとして提供されるように構成してもよい。 The programs executed by the sound signal processing device 10, the sound signal processing device 11, and the sound signal processing device 13 of the above-described embodiment and modifications are CD-ROMs (Compact) in an installable format or an executable format file. It is configured to be recorded on a computer-readable recording medium such as a Disk Read Only Memory), a flexible disk (FD), a CD-R (Compact Disk Recordable), or a DVD (Digital Versaille Disk) and provided as a computer program product. You may.

さらに、上記実施の形態および変形例の音信号処理装置１０、音信号処理装置１１、および音信号処理装置１３で実行されるプログラムを、インターネット等のネットワークに接続されたコンピュータ上に格納し、ネットワーク経由でダウンロードさせることにより提供するように構成してもよい。また、上記実施の形態および変形例の音信号処理装置１０、音信号処理装置１１、および音信号処理装置１３で実行されるプログラムをインターネット等のネットワーク経由で提供または配布するように構成してもよい。 Further, the programs executed by the sound signal processing device 10, the sound signal processing device 11, and the sound signal processing device 13 of the above-described embodiment and the modified example are stored on a computer connected to a network such as the Internet, and the network. It may be configured to be provided by downloading via. Further, the program executed by the sound signal processing device 10, the sound signal processing device 11, and the sound signal processing device 13 of the above-described embodiment and the modified example may be provided or distributed via a network such as the Internet. good.

上記実施の形態および変形例の音信号処理装置１０、音信号処理装置１１、および音信号処理装置１３で実行されるプログラムは、コンピュータを、上記実施の形態および変形例の音信号処理装置１０、音信号処理装置１１、および音信号処理装置１３の各部として機能させうる。このコンピュータは、ＣＰＵ５１がコンピュータ読取可能な記憶媒体からプログラムを主記憶装置上に読み出して実行することができる。 The programs executed by the sound signal processing device 10, the sound signal processing device 11, and the sound signal processing device 13 of the above-described embodiment and the modified example use the computer as the sound signal processing device 10 of the above-described embodiment and the modified example. It can function as each part of the sound signal processing device 11 and the sound signal processing device 13. This computer can read a program from a computer-readable storage medium onto the main storage device and execute the program by the CPU 51.

本発明のいくつかの実施の形態および変形例を説明したが、これらの実施の形態および変形例は、例として提示したものであり、発明の範囲を限定することは意図していない。これら新規な実施の形態および変形例は、その他の様々な形態で実施されることが可能であり、発明の要旨を逸脱しない範囲で、種々の省略、置き換え、変更を行うことができる。これらの実施の形態やその変形は、発明の範囲や要旨に含まれるとともに、特許請求の範囲に記載された発明とその均等の範囲に含まれる。 Although some embodiments and modifications of the present invention have been described, these embodiments and modifications are presented as examples and are not intended to limit the scope of the invention. These novel embodiments and modifications can be implemented in various other embodiments, and various omissions, replacements, and changes can be made without departing from the gist of the invention. These embodiments and modifications thereof are included in the scope and gist of the invention, and are also included in the scope of the invention described in the claims and the equivalent scope thereof.

１０、１１、１３音信号処理装置
１２音源
１２Ａ、１２Ａ１、１２Ａ２、１２Ａ３目的音源
１２Ｂ非目的音源
１４、１４Ａ、１４Ｂ、１４Ｃ、１４Ｄ第１マイク
１６第２マイク
２０、３０音信号処理部
２０Ｃ検出部
２０Ｄ相関導出部
２０Ｇ係数導出部
２０Ｈ生成部
２４認識部
３０Ｃ検出部
３０Ｄ相関導出部
３０Ｇ、３０Ｇ１、３０Ｇ２、３０Ｇ３係数導出部
３０Ｈ、３０Ｈ１、３０Ｈ２、３０Ｈ３生成部
３０Ｊ分離部 10, 11, 13 Sound signal processing device 12 Sound source 12A, 12A1, 12A2, 12A3 Target sound source 12B Non-purpose sound source 14, 14A, 14B, 14C, 14D 1st microphone 16 2nd microphone 20, 30 Sound signal processing unit 20C Detection unit 20D Correlation Derivation Unit 20G Coefficient Derivation Unit 20H Generation Unit 24 Recognition Unit 30C Detection Unit 30D Correlation Derivation Unit 30G, 30G1, 30G2, 30G3 Coefficient Derivation Unit 30H, 30H1, 30H2, 30H3 Generation Unit 30J Separation Unit

Claims

Based on the emphasized sound signal the target sound signal emphasizing a coefficient deriving unit that derives the spatial filter coefficients for emphasizing the target sound signal contained in the first sound signal,
A detection unit that detects a target sound section based on the emphasized sound signal,
Based on the target sound section and the first sound signal, the first spatial correlation matrix of the target sound section in the first sound signal and the non-purpose sound section other than the target sound section in the first sound signal. The second space correlation matrix, the correlation derivation unit that derives the second space correlation matrix, and
With
The coefficient derivation unit derives the spatial filter coefficient based on the first spatial correlation matrix and the second spatial correlation matrix.
The detection unit
The target sound section is detected based on the second sound signal in which the power ratio of the non-target sound signal to the target sound signal is larger than that of the first sound signal and the emphasized sound signal.
Sound signal processing device.

The coefficient derivation unit is
The spatial filter coefficient is derived based on the emphasized sound signal that emphasizes the target sound signal included in the first sound signal acquired from a plurality of microphones.
The sound signal processing device according to claim 1.

The detection unit
Based on the emphasized sound signal, the target sound section and the overlapping section where the target sound signal and the non-target sound signal overlap are detected.
The correlation derivation unit
Based on the target sound section, the overlapping section, and the first sound signal, the first space correlation matrix and the second space correlation matrix are derived.
The sound signal processing device according to claim 1 or 2.

The correlation derivation unit
The latest obtained by multiplying the first spatial correlation matrix derived in the past with respect to the first sound signal in the target sound section by the first sound signal and the transposed signal obtained by Hermit translocation of the first sound signal. By correcting with the first space correlation matrix of the above, a new first space correlation matrix is derived.
The second spatial correlation matrix derived in the past for the first sound signal in the non-purpose sound section is represented by a multiplication value of the first sound signal and the transposed signal obtained by Hermit translocation of the first sound signal. A new second space correlation matrix is derived by correcting with the latest second space correlation matrix.
The sound signal processing apparatus according to any one of claims 1 to 3.

The coefficient derivation unit is
The eigenvector corresponding to the maximum eigenvalue of the product of the first spatial correlation matrix and the inverse matrix of the second spatial correlation matrix is derived as the spatial filter coefficient.
The sound signal processing device according to claim 3 or 4.

Based on the emphasized sound signal the target sound signal emphasizing a coefficient deriving unit that derives the spatial filter coefficients for emphasizing the target sound signal contained in the first sound signal,
A detector that detects the target sound section based on the emphasized sound signal,
A separation unit that separates the first sound signal into a target sound component and a non-target sound component,
Based on the target sound section, the target sound component, and the non-target sound component, the third spatial correlation matrix of the target sound component in the first sound signal and the non-purpose sound component in the first sound signal. A correlation derivation unit that derives the fourth spatial correlation matrix of sound components, and
With
The coefficient derivation unit derives the spatial filter coefficient based on the third spatial correlation matrix and the fourth spatial correlation matrix.
Sound signal processing device.

The correlation derivation unit
The third spatial correlation matrix derived in the past is newly corrected by the latest third spatial correlation matrix represented by the multiplication value of the target sound component and the transposed component obtained by Hermitian transposition of the target sound component. Derivation of the third spatial correlation matrix
By correcting the fourth spatial correlation matrix derived in the past with the latest fourth spatial correlation matrix represented by the multiplication value of the non-objective sound component and the transpose of the non-objective sound component Hermitian transposed. , Derivation of the new fourth spatial correlation matrix,
The sound signal processing device according to claim 6.

The coefficient derivation unit is
The eigenvector corresponding to the maximum eigenvalue of the product of the third spatial correlation matrix and the inverse matrix of the fourth spatial correlation matrix is derived as the spatial filter coefficient.
The sound signal processing device according to claim 7.

Based on the emphasized sound signal the target sound signal emphasizing a coefficient deriving step of deriving the spatial filter coefficients for emphasizing the target sound signal contained in the first sound signal,
A detection step for detecting a target sound section based on the emphasized sound signal, and
Based on the target sound section and the first sound signal, the first spatial correlation matrix of the target sound section in the first sound signal and the non-purpose sound section other than the target sound section in the first sound signal. The second spatial correlation matrix, the correlation derivation step for deriving, and
Including
The coefficient derivation step derives the spatial filter coefficient based on the first spatial correlation matrix and the second spatial correlation matrix.
The detection step
The target sound section is detected based on the second sound signal in which the power ratio of the non-target sound signal to the target sound signal is larger than that of the first sound signal and the emphasized sound signal.
Sound signal processing method.

A coefficient derivation step for deriving a spatial filter coefficient for emphasizing the target sound signal included in the first sound signal based on the emphasized sound signal emphasizing the target sound signal, and a coefficient derivation step.
A detection step for detecting a target sound section based on the emphasized sound signal, and
Based on the target sound section and the first sound signal, the first spatial correlation matrix of the target sound section in the first sound signal and the non-purpose sound section other than the target sound section in the first sound signal. The second spatial correlation matrix, the correlation derivation step for deriving, and
Is a program that allows a computer to execute
The coefficient derivation step derives the spatial filter coefficient based on the first spatial correlation matrix and the second spatial correlation matrix.
The detection step
The target sound section is detected based on the second sound signal in which the power ratio of the non-target sound signal to the target sound signal is larger than that of the first sound signal and the emphasized sound signal.
Program.

A coefficient derivation unit that derives a spatial filter coefficient for emphasizing the target sound signal included in the first sound signal based on the emphasized sound signal that emphasizes the target sound signal, and a coefficient derivation unit.
A generation unit that uses the spatial filter coefficient to generate the emphasized sound signal that emphasizes the target sound included in the first sound signal.
A recognition unit that recognizes the emphasis signal and
A detection unit that detects a target sound section based on the emphasized sound signal,
Based on the target sound section and the first sound signal, the first spatial correlation matrix of the target sound section in the first sound signal and the non-purpose sound section other than the target sound section in the first sound signal. The second space correlation matrix, the correlation derivation unit that derives the second space correlation matrix, and
With
The coefficient derivation unit derives the spatial filter coefficient based on the first spatial correlation matrix and the second spatial correlation matrix.
The detection unit
The target sound section is detected based on the second sound signal in which the power ratio of the non-target sound signal to the target sound signal is larger than that of the first sound signal and the emphasized sound signal.
Sound signal processing device.

Based on the emphasized sound signal the target sound signal emphasizing a coefficient deriving step of deriving the spatial filter coefficients for emphasizing the target sound signal contained in the first sound signal,
A detection step for detecting a target sound section based on the emphasized sound signal, and
A separation step for separating the first sound signal into a target sound component and a non-target sound component, and
Based on the target sound section, the target sound component, and the non-target sound component, the third spatial correlation matrix of the target sound component in the first sound signal and the non-purpose sound component in the first sound signal. A correlation derivation step for deriving the fourth spatial correlation matrix of sound components, and
Including
The coefficient derivation step derives the spatial filter coefficient based on the third spatial correlation matrix and the fourth spatial correlation matrix.
Sound signal processing method.

A coefficient derivation step for deriving a spatial filter coefficient for emphasizing the target sound signal included in the first sound signal based on the emphasized sound signal emphasizing the target sound signal, and a coefficient derivation step.
A detection step for detecting a target sound section based on the emphasized sound signal, and
A separation step for separating the first sound signal into a target sound component and a non-target sound component, and
Based on the target sound section, the target sound component, and the non-target sound component, the third spatial correlation matrix of the target sound component in the first sound signal and the non-purpose sound component in the first sound signal. A correlation derivation step for deriving the fourth spatial correlation matrix of sound components, and
Is a program that allows a computer to execute
The coefficient derivation step derives the spatial filter coefficient based on the third spatial correlation matrix and the fourth spatial correlation matrix.
Program.

A coefficient derivation unit that derives a spatial filter coefficient for emphasizing the target sound signal included in the first sound signal based on the emphasized sound signal that emphasizes the target sound signal, and a coefficient derivation unit.
A generation unit that uses the spatial filter coefficient to generate the emphasized sound signal that emphasizes the target sound included in the first sound signal.
A recognition unit for recognizing the emphasized sound signal,
A detector that detects the target sound section based on the emphasized sound signal,
A separation unit that separates the first sound signal into a target sound component and a non-target sound component,
Based on the target sound section, the target sound component, and the non-target sound component, the third spatial correlation matrix of the target sound component in the first sound signal and the non-purpose sound component in the first sound signal. A correlation derivation unit that derives the fourth spatial correlation matrix of sound components, and
With
The coefficient derivation unit derives the spatial filter coefficient based on the third spatial correlation matrix and the fourth spatial correlation matrix.
Sound signal processing device.