JP2021167850A

JP2021167850A - Signal processor, signal processing method, signal processing program, learning device, learning method and learning program

Info

Publication number: JP2021167850A
Application number: JP2020070081A
Authority: JP
Inventors: マークデルクロア; Marc Delcroix; 翼落合; Tsubasa Ochiai; 慶介木下; Keisuke Kinoshita; 直弘俵; Naohiro Tawara; 智広中谷; Tomohiro Nakatani; 章子荒木; Akiko Araki; カテリナモリコバ; Morikoba Katerina
Original assignee: Brno Univ Of Technology; Brno Univ of Tech; Nippon Telegraph and Telephone Corp
Current assignee: Brno Univ Of Technology; Brno Univ of Tech; Nippon Telegraph and Telephone Corp
Priority date: 2020-04-08
Filing date: 2020-04-08
Publication date: 2021-10-21
Anticipated expiration: 2040-04-08
Also published as: JP7293162B2

Abstract

To precisely extract a voice signal of a target speaker.SOLUTION: A first conversion part 111 converts a voice signal of a time domain, which is obtained from utterance of a target speaker, into an adaptation feature amount. A second conversion part 121 converts a mixture voice signal of a multichannel time domain, which is obtained by recording voices of a plurality of sound sources by a plurality of microphones, into a pre-adaptation feature amount by a neural network. A third conversion part 123 converts a post-adaptation feature amount, which is obtained by adapting the pre-adaptation feature amount to the target speaker by using the adaptation feature amount, into information for output by the neural network.SELECTED DRAWING: Figure 1

Description

本発明は、信号処理装置、信号処理方法、信号処理プログラム、学習装置、学習方法及び学習プログラムに関する。 The present invention relates to a signal processing device, a signal processing method, a signal processing program, a learning device, a learning method and a learning program.

複数の話者の音声から得られる混合音声信号から、目的話者の音声を抽出する技術としてスピーカービーム（SpeakerBeam）が知られている（例えば、非特許文献１を参照）。例えば、非特許文献１に記載の手法は、混合音声信号を周波数領域に変換し、周波数領域の混合音声信号から目的話者の音声を抽出するメインNN（neural network：ニューラルネットワーク）と、目的話者の音声信号から特徴量を抽出する補助NNとを有し、メインNNの中間部分に設けられた適応層に補助NNの出力を入力することで、周波数領域の混合音声信号に含まれる目的話者の音声信号を推定し、出力するものである。 A speaker beam is known as a technique for extracting the voice of a target speaker from a mixed voice signal obtained from the voices of a plurality of speakers (see, for example, Non-Patent Document 1). For example, the method described in Non-Patent Document 1 includes a main NN (neural network) that converts a mixed voice signal into a frequency domain and extracts the voice of a target speaker from the mixed voice signal in the frequency domain, and a target story. It has an auxiliary NN that extracts the feature amount from the voice signal of the person, and by inputting the output of the auxiliary NN to the adaptive layer provided in the middle part of the main NN, the purpose story included in the mixed voice signal in the frequency domain. It estimates and outputs a person's voice signal.

K. Zmolikova, M. Delcroix, K. Kinoshita, T. Higuchi, A. Ogawa, and T. Nakatani, “Speaker-aware neural network based beamformer for speaker extraction in speech mixtures,” in Proc. of Interspeech’17, 2017, pp. 2655-2659.K. Zmolikova, M. Delcroix, K. Kinoshita, T. Higuchi, A. Ogawa, and T. Nakatani, “Speaker-aware neural network based beamformer for speaker extraction in speech interpolation,” in Proc. Of Interspeech'17, 2017 , pp. 2655-2659.

しかしながら、従来の手法には、混合音声信号から目的話者の音声信号を精度良く抽出することができない場合があるという問題がある。例えば、混合音声信号に含まれる音声信号の特徴が似ている場合、非特許文献１に記載された手法では、十分な精度が得られない場合がある。例えば、同性の複数の話者の音声から得られた音声信号の特徴は、互いに似ることがある。 However, the conventional method has a problem that the audio signal of the target speaker may not be accurately extracted from the mixed audio signal. For example, when the characteristics of the audio signals included in the mixed audio signal are similar, sufficient accuracy may not be obtained by the method described in Non-Patent Document 1. For example, the characteristics of audio signals obtained from the voices of multiple speakers of the same sex may be similar to each other.

上述した課題を解決し、目的を達成するために、信号処理装置は、目的話者の発話から得られた時間領域の音声信号を適応用特徴量に変換する第１変換部と、複数の音源の音声を複数のマイクロホンで録音して得られたマルチチャネルの時間領域の混合音声信号を、ニューラルネットワークにより、適応前特徴量に変換する第２変換部と、前記適応用特徴量を用いて前記適応前特徴量を前記目的話者に適応させた適応後特徴量を、１つ以上の層を備えたニューラルネットワークにより、出力用の情報に変換する第３変換部と、を有することを特徴とする。 In order to solve the above-mentioned problems and achieve the purpose, the signal processing device includes a first conversion unit that converts an audio signal in a time region obtained from the speech of the target speaker into an adaptive feature amount, and a plurality of sound sources. The second conversion unit that converts the multi-channel time region mixed voice signal obtained by recording the voice of the above into the pre-adaptation feature amount by the neural network, and the adaptation feature amount are used. It is characterized by having a third conversion unit that converts the post-adaptation feature amount adapted to the target speaker into information for output by a neural network having one or more layers. do.

本発明によれば、混合音声信号から目的話者の音声信号を精度良く抽出することができる。 According to the present invention, the audio signal of the target speaker can be accurately extracted from the mixed audio signal.

図１は、第１の実施形態に係る信号処理装置の構成例を示す図である。FIG. 1 is a diagram showing a configuration example of a signal processing device according to the first embodiment. 図２は、マイクロホン及び話者の配置例を示す図である。FIG. 2 is a diagram showing an example of arrangement of a microphone and a speaker. 図３は、第１の実施形態に係る信号処理装置の処理の流れを示すフローチャートである。FIG. 3 is a flowchart showing a processing flow of the signal processing apparatus according to the first embodiment. 図４は、第１補助NNの処理の流れを示すフローチャートである。FIG. 4 is a flowchart showing a processing flow of the first auxiliary NN. 図５は、メインNNの処理の流れを示すフローチャートである。FIG. 5 is a flowchart showing the processing flow of the main NN. 図６は、第２の実施形態に係る信号処理装置の構成例を示す図である。FIG. 6 is a diagram showing a configuration example of the signal processing device according to the second embodiment. 図７は、第２の実施形態に係る信号処理装置の処理の流れを示すフローチャートである。FIG. 7 is a flowchart showing a processing flow of the signal processing apparatus according to the second embodiment. 図８は、第２補助NNの処理の流れを示すフローチャートである。FIG. 8 is a flowchart showing a processing flow of the second auxiliary NN. 図９は、メインNNの処理の流れを示すフローチャートである。FIG. 9 is a flowchart showing a processing flow of the main NN. 図１０は、第３の実施形態に係る学習装置の構成例を示す図である。FIG. 10 is a diagram showing a configuration example of the learning device according to the third embodiment. 図１１は、第３の実施形態に係る学習装置の処理の流れを示すフローチャートである。FIG. 11 is a flowchart showing a processing flow of the learning device according to the third embodiment. 図１２は、実験用のデータを示す図である。FIG. 12 is a diagram showing experimental data. 図１３は、実験結果を示す図である。FIG. 13 is a diagram showing the experimental results. 図１４は、実験結果を示す図である。FIG. 14 is a diagram showing the experimental results. 図１５は、実験結果を示す図である。FIG. 15 is a diagram showing the experimental results. 図１６は、実験結果を示す図である。FIG. 16 is a diagram showing the experimental results. 図１７は、プログラムを実行するコンピュータの一例を示す図である。FIG. 17 is a diagram showing an example of a computer that executes a program.

以下に、本願に係る信号処理装置、信号処理方法、信号処理プログラム、学習装置、学習方法及び学習プログラムの実施形態を図面に基づいて詳細に説明する。なお、本発明は、以下に説明する実施形態により限定されるものではない。 Hereinafter, embodiments of a signal processing device, a signal processing method, a signal processing program, a learning device, a learning method, and a learning program according to the present application will be described in detail with reference to the drawings. The present invention is not limited to the embodiments described below.

［第１の実施形態］
図１は、第１の実施形態に係る信号処理装置の構成例を示す図である。図１に示すように、信号処理装置１０は、第１補助NN１１を実行するための処理部として、第１変換部１１１及び統合部１１２を有する。また、信号処理装置１０は、メインNN１２を実行するための処理部として、第２変換部１２１、適応部１２２及び第３変換部１２３を有する。また、信号処理装置１０は、各ニューラルネットワークの重み及びバイアス等のパラメータをモデル情報１５として記憶する。ここでモデル情報１５として記憶されるパラメータの具体的な値は、後述の学習装置又は学習方法において予め学習させることで求めた情報を記憶しておけばよい。 [First Embodiment]
FIG. 1 is a diagram showing a configuration example of a signal processing device according to the first embodiment. As shown in FIG. 1, the signal processing device 10 has a first conversion unit 111 and an integration unit 112 as processing units for executing the first auxiliary NN 11. Further, the signal processing device 10 has a second conversion unit 121, an adaptation unit 122, and a third conversion unit 123 as processing units for executing the main NN 12. Further, the signal processing device 10 stores parameters such as weights and biases of each neural network as model information 15. Here, as the specific value of the parameter stored as the model information 15, the information obtained by learning in advance by the learning device or the learning method described later may be stored.

ここで、メインNN１２は、混合音声信号から目的話者の音声信号を抽出するためのニューラルネットワークである。また、第１補助NN１１は、メインNN１２を目的話者に適応させるための適応用特徴量を得るためのニューラルネットワークである。 Here, the main NN12 is a neural network for extracting the audio signal of the target speaker from the mixed audio signal. The first auxiliary NN 11 is a neural network for obtaining adaptive features for adapting the main NN 12 to the target speaker.

ここで、畳み込みブロックは、１次元の畳み込み及び正規化等を行うための層の集合である。また、エンコーダは、音声信号を所定の特徴空間にマッピング、すなわち音声信号を特徴量ベクトルに変換するニューラルネットワークである。逆に、デコーダは、所定の特徴空間上の特徴量を音声信号の空間にマッピングする、すなわち特徴量ベクトルを音声信号に変換するニューラルネットワークである。エンコーダ及びデコーダは、畳み込みブロックと同様の構成を有していてもよい。 Here, the convolution block is a set of layers for one-dimensional convolution, normalization, and the like. Further, the encoder is a neural network that maps the audio signal to a predetermined feature space, that is, converts the audio signal into a feature vector. On the contrary, the decoder is a neural network that maps the feature amount on the predetermined feature space to the space of the audio signal, that is, converts the feature amount vector into the audio signal. The encoder and decoder may have the same configuration as the convolution block.

畳み込みブロック（1-D Conv）、エンコーダ及びデコーダの構成は、参考文献１（Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing ideal time-frequency magnitude masking for speech separation,” IEEE/ACM Trans. ASLP, vol. 27, no. 8, pp. 1256-1266, 2019.）に記載の構成と同様であってもよい。また、時間領域の音声信号は、参考文献１に記載の方法により得られたものであってもよい。また、以降の説明における各特徴量は、ベクトルで表されるものとする。 The configuration of the convolution block (1-D Conv), encoder and decoder is described in Reference 1 (Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing ideal time-frequency magnitude masking for speech separation,” IEEE / ACM Trans. ASLP, vol. 27, no. 8, pp. 1256-1266, 2019.) may be the same as the configuration described. Further, the audio signal in the time domain may be obtained by the method described in Reference 1. In addition, each feature quantity in the following description shall be represented by a vector.

第１変換部１１１は、目的話者の発話から得られた時間領域の音声信号を、適応用特徴量に変換する。つまり、第１変換部１１１は、時間領域の音声信号の入力を受け付け、適応用特徴量を出力する。図１の例では、第１変換部１１１はニューラルネットワークにより実現するものとし、時間領域の音声信号をニューラルネットワークにより、適応用特徴量に変換する。以降の説明では、第１変換部１１１で用いられるニューラルネットワークを第１のニューラルネットワークと呼ぶ。第１のニューラルネットワークは、第１補助NN１１の一部である。図１の例では、第１のニューラルネットワークには、エンコーダ及び畳み込みブロックが備えられている。適応用特徴量は、目的話者の埋め込みベクトルということができる。 The first conversion unit 111 converts the audio signal in the time domain obtained from the utterance of the target speaker into the adaptive feature amount. That is, the first conversion unit 111 receives the input of the audio signal in the time domain and outputs the adaptive feature amount. In the example of FIG. 1, the first conversion unit 111 is realized by a neural network, and the audio signal in the time domain is converted into the adaptive feature amount by the neural network. In the following description, the neural network used in the first conversion unit 111 is referred to as a first neural network. The first neural network is a part of the first auxiliary NN11. In the example of FIG. 1, the first neural network includes an encoder and a convolution block. The adaptive feature quantity can be said to be an embedded vector of the target speaker.

なお、第１変換部１１１は図１の構成例のように、ニューラルネットワークに限定されるものではなく、例えば、i-vectorかx-vector等の周知の話者埋め込みベクトル（embeddingベクトル）を抽出する手法を用いてもよい。 The first conversion unit 111 is not limited to the neural network as in the configuration example of FIG. 1, and extracts a well-known speaker embedding vector (embedding vector) such as i-vector or x-vector. You may use the method of

また、統合部１１２は、適応用特徴量を複数時間フレームについて平均する。入力として与えられる目的話者の音声信号が短い発話であれば全ての時間フレームについて平均してもよいし、複数発話等の長時間の音声信号であれば、その一部の時間区間であって、第１変換部の処理単位である時間フレームよりも長い時間区間であればよい。つまり、統合部１１２は、平均化前の適応用特徴量の入力を受け付け、平均化した適応用特徴量を出力する。なお、統合部１１２は、複数の全結合層により構成されるものであってもよい。 In addition, the integration unit 112 averages the adaptive features for a plurality of time frames. Purpose given as input If the speaker's voice signal is a short utterance, it may be averaged for all time frames, and if it is a long-time voice signal such as multiple utterances, it is a part of the time interval. , The time interval may be longer than the time frame which is the processing unit of the first conversion unit. That is, the integration unit 112 receives the input of the adaptive feature amount before averaging and outputs the averaged adaptive feature amount. The integration unit 112 may be composed of a plurality of fully connected layers.

第２変換部１２１は、複数の音源の音声を複数のマイクロホンで録音して得られたマルチチャネルの時間領域の混合音声信号を、ニューラルネットワークにより、適応前特徴量に変換する。つまり、第２変換部１２１は、マルチチャネルの時間領域の音声信号の入力を受け付け、適応前特徴量を出力する。以降の説明では、第２変換部１２１で用いられるニューラルネットワークを第２のニューラルネットワークと呼ぶ。第２のニューラルネットワークは、メインNN１２の一部である。図１の例では、第２のニューラルネットワークには、エンコーダ及び畳み込みブロックが備えられている。 The second conversion unit 121 converts a multi-channel time domain mixed audio signal obtained by recording the audio of a plurality of sound sources with a plurality of microphones into a pre-adaptation feature amount by a neural network. That is, the second conversion unit 121 receives the input of the audio signal in the multi-channel time domain and outputs the pre-adaptation feature amount. In the following description, the neural network used in the second conversion unit 121 will be referred to as a second neural network. The second neural network is part of the main NN12. In the example of FIG. 1, the second neural network includes an encoder and a convolution block.

適応部１２２は、平均化した適応用特徴量を用いて適応前特徴量を目的話者に適応させた特徴量である適応後特徴量に変換する。つまり、適応用特徴量と適応前特徴量の入力を受け付け、適応後特徴量を出力する。適応部１２２は、従来のスピーカービームと同様の方法で目的話者への適応を行うことができる。例えば、適応部１２２は、いずれも同じ次元数のベクトルである適応用特徴量と適応前特徴量の、要素ごとの積（element-wise product）を計算することにより適応後特徴量を得ることができる。 The adaptation unit 122 converts the pre-adaptation feature amount into the post-adaptation feature amount which is the feature amount adapted to the target speaker by using the averaged adaptation feature amount. That is, the input of the feature amount for adaptation and the feature amount before adaptation is received, and the feature amount after adaptation is output. The adaptation unit 122 can adapt to the target speaker in the same manner as the conventional speaker beam. For example, the adaptation unit 122 can obtain the post-adaptation feature amount by calculating the product (element-wise product) of the adaptation feature amount and the pre-adaptation feature amount, which are all vectors having the same number of dimensions. can.

ここで、適応部１２２は、ニューラルネットワークにおける層、すなわち適応層として実現される。図１に示すように、メインNN１２全体を見ると、適応層は、エンコーダに続く１番目の畳み込みブロックと２番目の畳み込みブロックとの間に挿入されている。 Here, the adaptation unit 122 is realized as a layer in the neural network, that is, an adaptation layer. As shown in FIG. 1, when looking at the entire main NN12, the adaptive layer is inserted between the first convolution block and the second convolution block following the encoder.

第３変換部１２３は、適応後特徴量を、ニューラルネットワークにより、出力用の情報に変換する。つまり、第３変換部１２３は、適応後特徴量の入力を受け付け、出力用の情報を推定結果として出力する。出力用の情報は、入力された混合音声中の目的話者の音声信号に対応する情報であり、音声信号そのものであってもよいし、音声信号を導出可能な所定の形式のデータであってもよい。以降の説明では、第３変換部１２３で用いられるニューラルネットワークを第３のニューラルネットワークと呼ぶ。第３のニューラルネットワークは、メインNN１２の一部である。図１の例では、第３のニューラルネットワークには、１つ以上の畳み込みブロック及びデコーダが備えられている。 The third conversion unit 123 converts the post-adaptation feature amount into information for output by the neural network. That is, the third conversion unit 123 receives the input of the feature amount after adaptation, and outputs the output information as the estimation result. The information for output is information corresponding to the audio signal of the target speaker in the input mixed audio, may be the audio signal itself, or is data in a predetermined format from which the audio signal can be derived. May be good. In the following description, the neural network used in the third conversion unit 123 will be referred to as a third neural network. The third neural network is part of the main NN12. In the example of FIG. 1, the third neural network includes one or more convolution blocks and a decoder.

ここで、第３変換部１２３は、第２変換部１２１のエンコーダから出力される中間特徴量と、第３変換部１２３の畳み込みブロックから出力される中間特徴量とから推定結果を得ることができる。また、目的話者への適応が行われるため、第３変換部１２３は、混合音声信号を話者ごとに分離するだけでなく、目的話者の音声信号を抽出して出力することができる。 Here, the third conversion unit 123 can obtain an estimation result from the intermediate feature amount output from the encoder of the second conversion unit 121 and the intermediate feature amount output from the convolution block of the third conversion unit 123. .. Further, since the adaptation to the target speaker is performed, the third conversion unit 123 can not only separate the mixed audio signal for each speaker but also extract and output the audio signal of the target speaker.

図２を用いて、混合音声信号の元になる混合音声の収録方法を説明する。図２は、マイクロホン及び話者の配置例を示す図である。マイクロホンアレイ３０は、マイクロホン３０１、マイクロホン３０２、マイクロホン３０３、マイクロホン３０４を有する。また、話者４１は目的話者である。また、話者４２は非目的話者である。 A method of recording the mixed voice that is the source of the mixed voice signal will be described with reference to FIG. FIG. 2 is a diagram showing an example of arrangement of a microphone and a speaker. The microphone array 30 includes a microphone 301, a microphone 302, a microphone 303, and a microphone 304. The speaker 41 is a target speaker. The speaker 42 is a non-purpose speaker.

マイクロホンアレイ３０の各マイクロホンは、話者４１及び話者４２の両方の音声を収録する。その結果、マイクロホンアレイ３０が収録した音声の音声信号は、各マイクロホンに対応するチャネルごとの音声信号として扱うことができる。第１の実施形態では、少なくとも２つのマイクロホンを備えたマイクロホンアレイによって収録された音声から得られた混合音声信号が用いられるものとする。なお、混合音声信号には、話者の発話によって生じる音声だけでなく、背景雑音等が含まれる場合がある。 Each microphone of the microphone array 30 records the voices of both the speaker 41 and the speaker 42. As a result, the audio signal of the audio recorded by the microphone array 30 can be treated as an audio signal for each channel corresponding to each microphone. In the first embodiment, it is assumed that a mixed audio signal obtained from audio recorded by a microphone array with at least two microphones is used. The mixed voice signal may include background noise and the like as well as voice generated by the speaker's utterance.

一方、目的話者の音声信号は、目的話者である話者４１の音声のみを収録することにより得られる。また、その場合マイクロホンは１つであってもよい。すなわち、目的話者の音声信号はシングルチャネルであってもよい。さらに、話者４１の位置は、混合音声信号を得るための収録を行う場合と、目的話者の音声信号を得るための収録を行う場合とで異なっていてもよい。 On the other hand, the voice signal of the target speaker is obtained by recording only the voice of the speaker 41 who is the target speaker. Further, in that case, the number of microphones may be one. That is, the audio signal of the target speaker may be a single channel. Further, the position of the speaker 41 may be different between the case of recording for obtaining the mixed audio signal and the case of recording for obtaining the audio signal of the target speaker.

図３は、第１の実施形態に係る信号処理装置の処理の流れを示すフローチャートである。図３に示すように、信号処理装置１０は、目的話者の音声信号及び混合音声信号の入力を受け付ける（ステップＳ１１）。 FIG. 3 is a flowchart showing a processing flow of the signal processing apparatus according to the first embodiment. As shown in FIG. 3, the signal processing device 10 receives the input of the audio signal of the target speaker and the mixed audio signal (step S11).

信号処理装置１０は、第１補助NN１１を実行する（ステップＳ１２）。また、信号処理装置１０は、メインNN１２を実行する（ステップＳ１３）。ここで、信号処理装置１０は、第１補助NN１１とメインNN１２を並行して実行してもよい。ただし、メインNN１２には第１補助NN１１の出力が使用されるため、第１補助NN１１の実行が完了するまでは、メインNN１２の実行は完了しない。 The signal processing device 10 executes the first auxiliary NN 11 (step S12). Further, the signal processing device 10 executes the main NN 12 (step S13). Here, the signal processing device 10 may execute the first auxiliary NN 11 and the main NN 12 in parallel. However, since the output of the first auxiliary NN 11 is used for the main NN 12, the execution of the main NN 12 is not completed until the execution of the first auxiliary NN 11 is completed.

図４は、第１補助NNの処理の流れを示すフローチャートである。図４に示すように、第１変換部１１１は、入力された目的話者の時間領域の音声信号を適応用特徴量に変換する（ステップＳ１２１）。次に、統合部１１２は、適応用特徴量を時間フレームについて統合し出力する（ステップＳ１２２）。 FIG. 4 is a flowchart showing a processing flow of the first auxiliary NN. As shown in FIG. 4, the first conversion unit 111 converts the input audio signal in the time domain of the target speaker into the adaptive feature amount (step S121). Next, the integration unit 112 integrates and outputs the adaptive features for each time frame (step S122).

図５は、メインNNの処理の流れを示すフローチャートである。図５に示すように、まず、第２変換部１２１は、入力された時間領域の混合音声信号を混合音声特徴量に変換する（ステップＳ１３１）。適応部１２２は、統合済みの適応用特徴量を用いて混合音声特徴量を目的話者に適応させた適応後特徴量を取得する（ステップＳ１３２）。第３変換部１２３は、適応後特徴量を音声信号に変換し出力する（ステップＳ１３３）。 FIG. 5 is a flowchart showing the processing flow of the main NN. As shown in FIG. 5, first, the second conversion unit 121 converts the input mixed voice signal in the time domain into the mixed voice feature amount (step S131). The adaptation unit 122 acquires the post-adaptation feature amount obtained by adapting the mixed voice feature amount to the target speaker using the integrated adaptation feature amount (step S132). The third conversion unit 123 converts the feature amount after adaptation into an audio signal and outputs it (step S133).

これまで説明してきたように、第１変換部１１１は、目的話者の発話から得られた時間領域の音声信号を、適応用特徴量に変換する。また、第２変換部１２１は、複数の音源の音声を複数のマイクロホンで録音して得られたマルチチャネルの時間領域の混合音声信号を、第２のニューラルネットワークにより、適応前特徴量に変換する。また、第３変換部１２３は、適応用特徴量を用いて適応前特徴量を目的話者に適応させた適応後特徴量を、１つ以上の層を備えた第３のニューラルネットワークにより、出力用の情報に変換する。このように、信号処理装置１０に入力される混合音声信号はマルチチャンネルである。このため、混合音声信号には、音声を収録した際の空間に関する情報が含まれていることになる。その結果、第１の実施形態によれば、シングルチャネルの混合音声信号を入力する場合に比べて、目的話者の音声信号を精度良く抽出することができるようになる。 As described above, the first conversion unit 111 converts the audio signal in the time domain obtained from the utterance of the target speaker into the adaptive feature amount. Further, the second conversion unit 121 converts the mixed audio signal in the multi-channel time domain obtained by recording the audio of a plurality of sound sources with a plurality of microphones into the pre-adaptation feature amount by the second neural network. .. Further, the third conversion unit 123 outputs the post-adaptation feature amount obtained by adapting the pre-adaptation feature amount to the target speaker by using the adaptation feature amount by the third neural network provided with one or more layers. Convert to information for. As described above, the mixed audio signal input to the signal processing device 10 is multi-channel. Therefore, the mixed voice signal includes information about the space when the voice is recorded. As a result, according to the first embodiment, the audio signal of the target speaker can be extracted with higher accuracy than in the case of inputting the mixed audio signal of the single channel.

［第２の実施形態］
第２の実施形態では、さらにIPD（Inter-microphone phase difference）特徴量を用いて空間に関する情報を取得する。以降の実施形態の説明においては、説明済みの実施形態と同様の機能を有する部には同じ符号を付し、適宜説明を省略する。 [Second Embodiment]
In the second embodiment, information about the space is further acquired by using the IPD (Inter-microphone phase difference) feature amount. In the following description of the embodiment, the same reference numerals are given to parts having the same functions as those of the above-described embodiment, and the description thereof will be omitted as appropriate.

IPD特徴量は、混合音声信号の各チャネルに対応するマイクロホン間の位相差に関する情報の一例である。IPD特徴量の要素を計算するための角度Φは、（１）式のように計算される。 The IPD feature is an example of information on the phase difference between microphones corresponding to each channel of the mixed audio signal. The angle Φ for calculating the element of the IPD feature is calculated as in Eq. (1).

ここで、Y_i,t,fは、時間インデックスがt、周波数インデックスがfである場合の、混合音声信号のSTFT（short-time Fourier transform）のマイクロホンiに対応する係数である。さらに、IPD特徴量は、（２）式のように計算される。ただし、Fは周波数ビンの数である。また、∠は複素数表現された位相を表す。 Here, Y _{i, t, f} are coefficients corresponding to the microphone i of the STFT (short-time Fourier transform) of the mixed audio signal when the time index is t and the frequency index is f. Further, the IPD feature amount is calculated as in Eq. (2). However, F is the number of frequency bins. In addition, ∠ represents the phase expressed in complex numbers.

なお、IPD特徴量を得るためのSTFTのウィンドウサイズ及びシフト幅は、メインNN１２のエンコーダに応じて決定されるものであってもよい。 The window size and shift width of the STFT for obtaining the IPD feature amount may be determined according to the encoder of the main NN12.

図６は、第２の実施形態に係る信号処理装置の構成例を示す図である。図６に示すように信号処理装置１０ａは、第２補助NN１３を実行するための第４変換部１３２を有する。また、信号処理装置１０ａは結合部１２２ａを有する。また、信号処理装置１０ａは、各ニューラルネットワークの重み及びバイアス等のパラメータをモデル情報１５ａとして記憶する。 FIG. 6 is a diagram showing a configuration example of the signal processing device according to the second embodiment. As shown in FIG. 6, the signal processing device 10a has a fourth conversion unit 132 for executing the second auxiliary NN 13. Further, the signal processing device 10a has a coupling portion 122a. Further, the signal processing device 10a stores parameters such as weights and biases of each neural network as model information 15a.

第４変換部１３２は、混合音声信号の各チャネルに対応するマイクロホン間の位相差に関する空間情報を、空間情報特徴量に変換する。つまり、第４変換部１３２は、空間情報の入力を受け付け、空間情報特徴量を出力する。例えば、空間情報はIPD特徴量である。また、第４変換部１３２を構成するニューラルネットワークには、畳み込みブロック及びアップサンプリングのための層が備えられている。空間情報特徴量は、畳み込みブロックによって得られた特徴量をアップサンプリングした上で、さらに畳み込みブロックによる変換が行われた特徴量ということができる。 The fourth conversion unit 132 converts the spatial information regarding the phase difference between the microphones corresponding to each channel of the mixed audio signal into the spatial information feature amount. That is, the fourth conversion unit 132 receives the input of spatial information and outputs the spatial information feature amount. For example, spatial information is an IPD feature. Further, the neural network constituting the fourth conversion unit 132 is provided with a convolution block and a layer for upsampling. The spatial information feature amount can be said to be a feature amount obtained by upsampling the feature amount obtained by the convolution block and then further converting by the convolution block.

結合部１２２ａは、適応部１２２によって出力される適応後特徴量と空間情報特徴量とを結合させる。結合部１２２ａは、単に、ベクトルである適応後特徴量の各要素の後に、ベクトルである空間情報特徴量の各要素が続くように結合してもよい。第３変換部１２３は、空間情報特徴量を結合させた適応後特徴量を、出力用の情報に変換する。 The coupling unit 122a combines the post-adaptation feature amount output by the adaptation unit 122 with the spatial information feature amount. The coupling portion 122a may simply be coupled so that each element of the post-adaptation feature quantity, which is a vector, is followed by each element of the spatial information feature quantity, which is a vector. The third conversion unit 123 converts the post-adaptation feature amount in which the spatial information feature amount is combined into information for output.

図７は、第２の実施形態に係る信号処理装置の処理の流れを示すフローチャートである。図７に示すように、信号処理装置１０ａは、目的話者の音声信号、混合音声信号及び空間情報の入力を受け付ける（ステップＳ２１）。 FIG. 7 is a flowchart showing a processing flow of the signal processing apparatus according to the second embodiment. As shown in FIG. 7, the signal processing device 10a receives inputs of the target speaker's voice signal, mixed voice signal, and spatial information (step S21).

信号処理装置１０ａは、第１補助NN１１を実行する（ステップＳ２２）。また、信号処理装置１０ａは、第２補助NN１３を実行する（ステップＳ２３）。また、信号処理装置１０ａは、メインNN１２を実行する（ステップＳ２４）。信号処理装置１０ａが第１補助NN１１を実行する処理の流れは、図４で説明したものと同様である。 The signal processing device 10a executes the first auxiliary NN 11 (step S22). Further, the signal processing device 10a executes the second auxiliary NN 13 (step S23). Further, the signal processing device 10a executes the main NN12 (step S24). The flow of processing in which the signal processing device 10a executes the first auxiliary NN 11 is the same as that described with reference to FIG.

図８は、第２補助NNの処理の流れを示すフローチャートである。図８に示すように、第４変換部１３２は、入力された空間情報を空間情報特徴量に変換する（ステップＳ２３１）。そして、第４変換部１３２は、空間情報特徴量をアップサンプリングして出力する（ステップＳ２３２）。 FIG. 8 is a flowchart showing a processing flow of the second auxiliary NN. As shown in FIG. 8, the fourth conversion unit 132 converts the input spatial information into the spatial information feature amount (step S231). Then, the fourth conversion unit 132 upsamples and outputs the spatial information feature amount (step S232).

図９は、メインNNの処理の流れを示すフローチャートである。図９に示すように、まず、第２変換部１２１は、入力された時間領域の混合音声信号を混合音声特徴量に変換する（ステップＳ２４１）。適応部１２２は、統合済みの適応用特徴量を用いて混合音声特徴量を目的話者に適応させた適応後特徴量を取得する（ステップＳ２４２）。 FIG. 9 is a flowchart showing a processing flow of the main NN. As shown in FIG. 9, first, the second conversion unit 121 converts the input mixed voice signal in the time domain into the mixed voice feature amount (step S241). The adaptation unit 122 acquires the post-adaptation feature amount obtained by adapting the mixed voice feature amount to the target speaker using the integrated adaptation feature amount (step S242).

ここで、結合部１２２ａは、空間情報特徴量を適応後特徴量に結合する（ステップＳ２４３）。第３変換部１２３は、空間情報特徴量を結合済みの適応後特徴量を音声信号に変換し出力する（ステップＳ２４４）。 Here, the coupling portion 122a couples the spatial information feature amount to the feature amount after adaptation (step S243). The third conversion unit 123 converts the combined spatial information feature amount into an audio signal and outputs the combined post-adaptation feature amount (step S244).

これまで説明してきたように、第４変換部１３２は、混合音声信号の各チャネルに対応するマイクロホン間の位相差に関する空間情報を、空間情報特徴量に変換する。また、第３変換部１２３は、空間情報特徴量を結合させた適応後特徴量を、出力用の情報に変換する。このように、信号処理装置１０ａは、空間情報がより明確になるような特徴量を利用して目的話者の音声を抽出することができる。その結果、第２の実施形態によれば、目的話者の音声信号をより精度良く抽出することができるようになる。 As described above, the fourth conversion unit 132 converts the spatial information regarding the phase difference between the microphones corresponding to each channel of the mixed audio signal into the spatial information feature amount. Further, the third conversion unit 123 converts the post-adaptation feature amount in which the spatial information feature amount is combined into the information for output. In this way, the signal processing device 10a can extract the voice of the target speaker by using the feature amount that makes the spatial information clearer. As a result, according to the second embodiment, the audio signal of the target speaker can be extracted with higher accuracy.

適応部１２２によって実現される適応層では、目的話者の音声信号から得られる特徴量を手掛かりとして、混合音声信号の特徴量から目的話者の音声の特徴量を抽出する。さらに、第２の実施形態では、適応層より出力側の層では、空間情報を用いることで、混合音声中の各音声の方向を考慮した補正ができる。つまり、第２の実施形態では、適応後特徴量に本来必要ない方向の音声が含まれている場合に、その音声に係る特徴を取り除くことで、より分離性能の高い音声信号の特徴量を得ることができると考えられる。 In the adaptation layer realized by the adaptation unit 122, the feature amount of the target speaker's voice is extracted from the feature amount of the mixed voice signal by using the feature amount obtained from the target speaker's voice signal as a clue. Further, in the second embodiment, the layer on the output side of the adaptive layer can be corrected in consideration of the direction of each voice in the mixed voice by using the spatial information. That is, in the second embodiment, when the feature amount after adaptation includes a voice in a direction that is not originally required, the feature amount of the voice signal having higher separation performance is obtained by removing the feature related to the voice. It is thought that it can be done.

さらに、図６の例では、空間情報特徴量を適応層より出力側の層に入力している。一方で、空間情報特徴量を適応層より入力側の層に入力する実施形態も考えられる。ただし、適応層はスペクトラル的な情報に基づき話者を選択するものであるため、適応層より入力側の層に入力された空間情報特徴量は、話者を選択する作用に悪影響を与えることが考えられる。このことは、後に提示する実験結果にも表れる。 Further, in the example of FIG. 6, the spatial information feature amount is input to the output side layer from the adaptive layer. On the other hand, an embodiment in which the spatial information feature amount is input to the input side layer from the adaptive layer is also conceivable. However, since the adaptive layer selects the speaker based on spectral information, the spatial information feature input to the layer on the input side from the adaptive layer may adversely affect the action of selecting the speaker. Conceivable. This is also reflected in the experimental results presented later.

［第３の実施形態］
第３の実施形態では、第１の実施形態の信号処理装置１０のモデル情報１５を生成するための学習処理を行う学習装置について説明する。図１０は、第３の実施形態に係る学習装置の構成例を示す図である。 [Third Embodiment]
In the third embodiment, a learning device that performs learning processing for generating model information 15 of the signal processing device 10 of the first embodiment will be described. FIG. 10 is a diagram showing a configuration example of the learning device according to the third embodiment.

図１０に示すように、学習装置２０は、第１の実施形態の信号処理装置１０と同様に、学習用データに対し、第１補助NN１１及びメインNN１２を実行する。例えば、学習用データは、混合音声信号及び当該混合音声信号に含まれる目的話者の音声信号を正解として含むデータである。 As shown in FIG. 10, the learning device 20 executes the first auxiliary NN 11 and the main NN 12 on the learning data, similarly to the signal processing device 10 of the first embodiment. For example, the learning data is data including the mixed audio signal and the audio signal of the target speaker included in the mixed audio signal as a correct answer.

第１変換部１１１、第２変換部１２１及び第３変換部１２３は、第１の実施形態と同様の処理を行う。また、更新部２４は、混合音声信号に含まれる目的話者の音声信号と出力用の情報とを基に計算される損失が最適化されるように、第１のニューラルネットワーク、第２のニューラルネットワーク及び第３のニューラルネットワークのパラメータを更新する。例えば、更新部２４は、誤差逆伝播法によりパラメータを更新する。更新部２４は、各ニューラルネットワークのパラメータであるモデル情報２５を更新していく。 The first conversion unit 111, the second conversion unit 121, and the third conversion unit 123 perform the same processing as in the first embodiment. Further, the update unit 24 uses the first neural network and the second neural network so that the loss calculated based on the audio signal of the target speaker included in the mixed audio signal and the information for output is optimized. Update the parameters of the network and the third neural network. For example, the update unit 24 updates the parameters by the backpropagation method. The update unit 24 updates the model information 25, which is a parameter of each neural network.

ここで、更新部２４は、出力用の情報によって示される目的話者の音声信号の推定結果と、目的話者の音声信号の正解との信号雑音比が最適化されるように、かつ、適応用特徴量による目的話者の音声信号の識別能力が向上するように、パラメータを更新することができる。この場合、更新部２４は、（３）式のように計算される損失が最適化されるようにパラメータの更新を行う。言い換えると、学習装置２０は、音声認識と話者識別という２つのタスクの両方が正解に近づくようにマルチタスク学習を行う。 Here, the update unit 24 is adapted so that the signal-to-noise ratio between the estimation result of the target speaker's audio signal indicated by the output information and the correct answer of the target speaker's audio signal is optimized. The parameters can be updated so that the ability of the target speaker to discriminate the audio signal according to the feature amount is improved. In this case, the update unit 24 updates the parameters so that the loss calculated as in Eq. (3) is optimized. In other words, the learning device 20 performs multi-task learning so that both the two tasks of voice recognition and speaker identification approach the correct answer.

（３）式に示すように、損失関数は、メインNN１２の出力に関する損失と、第１補助NN１１の出力に関する損失との重みづけ和である。メインNN１２の出力に関する損失は、例えば、（３）式の第１項に示すように、メインNNから出力される推定結果の音声信号^x_s（xの直上に^）と、学習データに含まれる目的話者の音声信号の正解x_sとの信号雑音比（signal-to-noise ratio:SiSNR）である。また、第１補助NN１１の出力に関する損失は、「入力された音声信号の話者が目的話者のものであるか否か」を識別する話者識別のタスクにおける識別能力を用いて表される。例えば、（３）式の第２項は、話者ラベルl^sと目的話者の特徴量e^s（第１補助NN１１の出力）を行列Wにより変換し、ソフトマックス関数σ(・)を適用した結果とのクロスエントロピー（CE）に重み（スケーリングパラメータ）αを掛けたものにより、第１補助NN１１の出力に関する損失を表現している。 As shown in equation (3), the loss function is the weighted sum of the loss related to the output of the main NN12 and the loss related to the output of the first auxiliary NN11. _{The loss related to the output of the main NN 12 is included in the audio signal ^ x s} (immediately above x ^) of the estimation result output from the main NN and the training data, for example, as shown in the first term of Eq. (3). The signal-to-noise ratio (SiSNR) of the speaker's audio signal with the correct answer x _s. Further, the loss related to the output of the first auxiliary NN 11 is expressed by using the discriminating ability in the speaker identification task of discriminating "whether or not the speaker of the input audio signal belongs to the target speaker". .. For example, in the second term of Eq. (3), the speaker label l ^s and the feature quantity e ^s of the target speaker (output of the first auxiliary NN11) are converted by the matrix W, and the softmax function σ (・) is applied. The loss related to the output of the first auxiliary NN 11 is expressed by multiplying the cross entropy (CE) with the resulting result by the weight (scaling parameter) α.

図１１は、第３の実施形態に係る学習装置の処理の流れを示すフローチャートである。図１１に示すように、学習装置２０は、目的話者の音声信号及び混合音声信号の入力を受け付ける（ステップＳ３１）。学習装置２０に入力される各音声信号は、正解が既知の学習用のデータである。 FIG. 11 is a flowchart showing a processing flow of the learning device according to the third embodiment. As shown in FIG. 11, the learning device 20 receives the input of the audio signal of the target speaker and the mixed audio signal (step S31). Each audio signal input to the learning device 20 is learning data whose correct answer is known.

学習装置２０は、第１補助NN１１を実行する（ステップＳ３２）。また、学習装置２０は、メインNN１２を実行する（ステップＳ３３）。ここで、更新部２４は、損失が最適化されるようにモデル情報２５を更新する（ステップＳ３４）。 The learning device 20 executes the first auxiliary NN 11 (step S32). Further, the learning device 20 executes the main NN 12 (step S33). Here, the update unit 24 updates the model information 25 so that the loss is optimized (step S34).

学習装置２０は、所定の条件が満たされている場合、収束したと判定し（ステップＳ３５、Yes）処理を終了する。一方、学習装置２０は、所定の条件が満たされていない場合、収束していないと判定し（ステップＳ３５、No）ステップＳ３２に戻り処理を繰り返す。例えば、条件は、所定の繰り返し回数に到達したこと、損失関数値が所定の閾値以下となったこと、パラメータの更新量（損失関数値の微分値等）が所定の閾値以下となったこと等である。 When the predetermined condition is satisfied, the learning device 20 determines that the learning device has converged (step S35, Yes) and ends the process. On the other hand, if the predetermined condition is not satisfied, the learning device 20 determines that the learning device has not converged (step S35, No), returns to step S32, and repeats the process. For example, the conditions are that the predetermined number of repetitions has been reached, that the loss function value is below a predetermined threshold, that the amount of parameter updates (differential value of the loss function value, etc.) is below a predetermined threshold, etc. Is.

これまで説明してきたように、第１変換部１１１は、目的話者の発話から得られた時間領域の音声信号を、１つ以上の層を備えた第１のニューラルネットワークにより、適応用特徴量に変換する。第２変換部１２１は、複数の音源の音声を複数のマイクロホンで録音して得られたマルチチャネルの時間領域の混合音声信号を、第１のニューラルネットワークに含まれる層の数と同じ数の層を備えた第２のニューラルネットワークにより、適応前特徴量に変換する。第３変換部１２３は、適応用特徴量を用いて適応前特徴量を目的話者に適応させた適応後特徴量を、１つ以上の層を備えた第３のニューラルネットワークにより、出力用の情報に変換する。更新部２４は、混合音声信号に含まれる目的話者の音声信号と出力用の情報とを基に計算される損失が最適化されるように、第１のニューラルネットワーク、第２のニューラルネットワーク及び第３のニューラルネットワークのパラメータを更新する。この結果、第３の実施形態によれば、目的話者の音声信号を抽出する精度を向上させることができる。 As described above, the first conversion unit 111 uses the first neural network having one or more layers to apply the audio signal in the time domain obtained from the utterance of the target speaker to the feature amount for adaptation. Convert to. The second conversion unit 121 uses the same number of layers as the number of layers included in the first neural network to record mixed audio signals in a multi-channel time domain obtained by recording the audio of a plurality of sound sources with a plurality of microphones. It is converted into a pre-adaptation feature quantity by a second neural network provided with. The third conversion unit 123 outputs the post-adaptation feature amount obtained by adapting the pre-adaptation feature amount to the target speaker by using the adaptation feature amount by the third neural network provided with one or more layers. Convert to information. The update unit 24 uses the first neural network, the second neural network, and the second neural network so that the loss calculated based on the target speaker's voice signal included in the mixed voice signal and the information for output is optimized. Update the parameters of the third neural network. As a result, according to the third embodiment, the accuracy of extracting the audio signal of the target speaker can be improved.

更新部２４は、出力用の情報によって示される目的話者の音声信号の推定結果と、目的話者の音声信号の正解との信号雑音比が最適化されるように、かつ、適応用特徴量による目的話者の音声信号の識別能力が向上するように、パラメータを更新する。これにより、音声抽出のためのNNだけでなく、目的話者へ適応のためのNNの精度が向上する。その結果、第３の実施形態によれば、目的話者の音声信号を抽出する精度を向上させることができる。 The update unit 24 optimizes the signal-to-noise ratio between the estimation result of the target speaker's audio signal indicated by the output information and the correct answer of the target speaker's audio signal, and is an adaptive feature amount. The parameters are updated so that the ability of the speaker to discriminate the audio signal is improved. This improves the accuracy of not only the NN for voice extraction but also the NN for adaptation to the intended speaker. As a result, according to the third embodiment, the accuracy of extracting the audio signal of the target speaker can be improved.

［実験結果］
ここで、実施形態と従来の手法とを比較するために行った実験の結果を説明する。図１２は、実験用のデータを示す図である。図１３から図１６は、実験結果を示す図である。実験では、図１２に示すマルチチャネルの２種類の混合音声WSJ（MC-WSJ0-2 mix）及びCSJ（CSJ-2mix）を使用した。なお、#Spksは話者の数、#Fは女性の話者の数、#Mは男性の話者の数、#Mixtureは混合発話の数である。 [Experimental result]
Here, the results of an experiment conducted to compare the embodiment with the conventional method will be described. FIG. 12 is a diagram showing experimental data. 13 to 16 are diagrams showing the experimental results. In the experiment, two types of multi-channel mixed audio WSJ (MC-WSJ0-2 mix) and CSJ (CSJ-2 mix) shown in FIG. 12 were used. #Spks is the number of speakers, #F is the number of female speakers, #M is the number of male speakers, and #Mixture is the number of mixed utterances.

図１３及び図１４は、各手法によって抽出した目的話者の音声信号をSDR（signal-to-distortionratio）によって評価した結果である。図１３の（７）は、第１の実施形態の推定方法に相当する。また、図１３の（９）は、第２の実施形態の推定方法に相当する。また、図１３の（８）は、第２の実施形態の推定方法において、空間情報特徴量を、適応層より入力側の層に入力した場合に相当する。また、FFは、女性の音声同士の混合音声を示している。また、MMは、男性の音声同士の混合音声を示している。また、FMは、女性と男性の音声の混合音声を示している。図１３に示すように、実施形態は、特に話者の性別が異なる場合の混合音声に対して高い精度を示している。また、（９）の手法は、話者の性別が同一である場合にさらに精度が向上する。 13 and 14 are the results of evaluating the audio signal of the target speaker extracted by each method by SDR (signal-to-distortion ratio). FIG. 13 (7) corresponds to the estimation method of the first embodiment. Further, FIG. 13 (9) corresponds to the estimation method of the second embodiment. Further, FIG. 13 (8) corresponds to the case where the spatial information feature amount is input to the input side layer from the adaptive layer in the estimation method of the second embodiment. In addition, FF indicates a mixed voice of female voices. In addition, MM indicates a mixed voice of male voices. FM also shows a mixed voice of female and male voices. As shown in FIG. 13, the embodiment shows high accuracy especially for mixed voices when the speakers are of different genders. Further, the accuracy of the method (9) is further improved when the genders of the speakers are the same.

図１４の（５）は、第１の実施形態の推定方法に相当し、さらに第３の実施形態による学習時の損失関数に第１補助NN１１の出力に関する損失を含まない場合の結果である。一方、（６）は、学習時の損失関数に第１補助NN１１の出力に関する損失（SI-loss）を含む場合の結果である。図１４に示すように、第１の実施形態は、従来の手法に比べて高い精度を示しており、SI-lossを導入することでさらに精度が向上する。特に、SI-lossを導入することで、FFの場合の精度が大きく向上した。 FIG. 14 (5) corresponds to the estimation method of the first embodiment, and is the result when the loss function at the time of learning according to the third embodiment does not include the loss related to the output of the first auxiliary NN 11. On the other hand, (6) is a result when the loss function at the time of learning includes the loss (SI-loss) related to the output of the first auxiliary NN11. As shown in FIG. 14, the first embodiment shows higher accuracy than the conventional method, and the accuracy is further improved by introducing SI-loss. In particular, the introduction of SI-loss has greatly improved the accuracy in the case of FF.

図１５は、FF、MM、FMの各ケースにおけるSDRの向上度合いを示している。図１５に示すように、実施形態の手法（TD-SpkBeam、TD-SpkBeam+SI-loss）によれば、SDRが0を超えることが多く、精度が向上する。図１６は、学習用データの話者数に応じたSDRを示している。図１６に示すように、実施形態の手法（TD-SpkBeam、TD-SpkBeam+SI-loss）によれば、特に話者数が100を超える場合にSDRが大きく向上する。 FIG. 15 shows the degree of improvement in SDR in each of the FF, MM, and FM cases. As shown in FIG. 15, according to the method of the embodiment (TD-SpkBeam, TD-SpkBeam + SI-loss), the SDR often exceeds 0, and the accuracy is improved. FIG. 16 shows the SDR according to the number of speakers of the learning data. As shown in FIG. 16, according to the method of the embodiment (TD-SpkBeam, TD-SpkBeam + SI-loss), the SDR is greatly improved especially when the number of speakers exceeds 100.

［システム構成等］
また、図示した各装置の各構成要素は機能概念的なものであり、必ずしも物理的に図示のように構成されていることを要しない。すなわち、各装置の分散・統合の具体的形態は図示のものに限られず、その全部又は一部を、各種の負荷や使用状況等に応じて、任意の単位で機能的又は物理的に分散・統合して構成することができる。さらに、各装置にて行われる各処理機能は、その全部又は任意の一部が、CPU（Central Processing Unit）及び当該CPUにて解析実行されるプログラムにて実現され、あるいは、ワイヤードロジックによるハードウェアとして実現され得る。 [System configuration, etc.]
Further, each component of each of the illustrated devices is a functional concept, and does not necessarily have to be physically configured as shown in the figure. That is, the specific form of distribution / integration of each device is not limited to the one shown in the figure, and all or part of the device is functionally or physically dispersed / physically distributed in arbitrary units according to various loads and usage conditions. Can be integrated and configured. Further, each processing function performed by each device is realized by a CPU (Central Processing Unit) and a program that is analyzed and executed by the CPU, or hardware by wired logic. Can be realized as.

また、本実施形態において説明した各処理のうち、自動的に行われるものとして説明した処理の全部又は一部を手動的に行うこともでき、あるいは、手動的に行われるものとして説明した処理の全部又は一部を公知の方法で自動的に行うこともできる。この他、上記文書中や図面中で示した処理手順、制御手順、具体的名称、各種のデータやパラメータを含む情報については、特記する場合を除いて任意に変更することができる。 Further, among the processes described in the present embodiment, all or part of the processes described as being automatically performed can be manually performed, or the processes described as being manually performed can be performed. All or part of it can be done automatically by a known method. In addition, the processing procedure, control procedure, specific name, and information including various data and parameters shown in the above document and drawings can be arbitrarily changed unless otherwise specified.

［プログラム］
一実施形態として、信号処理装置１０は、パッケージソフトウェアやオンラインソフトウェアとして上記の音声信号の抽出処理を実行する信号処理プログラムを所望のコンピュータにインストールさせることによって実装できる。例えば、上記の信号処理プログラムを情報処理装置に実行させることにより、情報処理装置を信号処理装置１０として機能させることができる。ここで言う情報処理装置には、デスクトップ型又はノート型のパーソナルコンピュータが含まれる。また、その他にも、情報処理装置にはスマートフォン、携帯電話機やPHS（Personal Handyphone System）等の移動体通信端末、さらには、PDA（Personal Digital Assistant）等のスレート端末等がその範疇に含まれる。 [program]
As one embodiment, the signal processing device 10 can be implemented by installing a signal processing program that executes the above-mentioned audio signal extraction processing as package software or online software on a desired computer. For example, by causing the information processing device to execute the above signal processing program, the information processing device can function as the signal processing device 10. The information processing device referred to here includes a desktop type or notebook type personal computer. In addition, information processing devices include smartphones, mobile communication terminals such as mobile phones and PHS (Personal Handyphone System), and slate terminals such as PDAs (Personal Digital Assistants).

また、信号処理装置１０は、ユーザが使用する端末装置をクライアントとし、当該クライアントに上記の信号処理に関するサービスを提供する信号処理サーバ装置として実装することもできる。例えば、信号処理サーバ装置は、混合音声信号を入力とし、目的話者の音声信号を抽出する信号処理サービスを提供するサーバ装置として実装される。この場合、信号処理サーバ装置は、Ｗｅｂサーバとして実装することとしてもよいし、アウトソーシングによって上記の信号処理に関するサービスを提供するクラウドとして実装することとしてもかまわない。 Further, the signal processing device 10 can be implemented as a signal processing server device in which the terminal device used by the user is a client and the service related to the above signal processing is provided to the client. For example, the signal processing server device is implemented as a server device that provides a signal processing service that takes a mixed audio signal as an input and extracts an audio signal of a target speaker. In this case, the signal processing server device may be implemented as a Web server, or may be implemented as a cloud that provides the above-mentioned signal processing services by outsourcing.

図１７は、プログラムを実行するコンピュータの一例を示す図である。コンピュータ１０００は、例えば、メモリ１０１０、CPU１０２０を有する。また、コンピュータ１０００は、ハードディスクドライブインタフェース１０３０、ディスクドライブインタフェース１０４０、シリアルポートインタフェース１０５０、ビデオアダプタ１０６０、ネットワークインタフェース１０７０を有する。これらの各部は、バス１０８０によって接続される。 FIG. 17 is a diagram showing an example of a computer that executes a program. The computer 1000 has, for example, a memory 1010 and a CPU 1020. The computer 1000 also has a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface 1070. Each of these parts is connected by a bus 1080.

メモリ１０１０は、ROM（Read Only Memory）１０１１及びRAM１０１２を含む。ROM１０１１は、例えば、BIOS（Basic Input Output System）等のブートプログラムを記憶する。ハードディスクドライブインタフェース１０３０は、ハードディスクドライブ１０９０に接続される。ディスクドライブインタフェース１０４０は、ディスクドライブ１１００に接続される。例えば磁気ディスクや光ディスク等の着脱可能な記憶媒体が、ディスクドライブ１１００に挿入される。シリアルポートインタフェース１０５０は、例えばマウス１１１０、キーボード１１２０に接続される。ビデオアダプタ１０６０は、例えばディスプレイ１１３０に接続される。 The memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM 1012. The ROM 1011 stores, for example, a boot program such as a BIOS (Basic Input Output System). The hard disk drive interface 1030 is connected to the hard disk drive 1090. The disk drive interface 1040 is connected to the disk drive 1100. For example, a removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1100. The serial port interface 1050 is connected to, for example, a mouse 1110 and a keyboard 1120. The video adapter 1060 is connected to, for example, the display 1130.

ハードディスクドライブ１０９０は、例えば、OS１０９１、アプリケーションプログラム１０９２、プログラムモジュール１０９３、プログラムデータ１０９４を記憶する。すなわち、信号処理装置１０の各処理を規定するプログラムは、コンピュータにより実行可能なコードが記述されたプログラムモジュール１０９３として実装される。プログラムモジュール１０９３は、例えばハードディスクドライブ１０９０に記憶される。例えば、信号処理装置１０における機能構成と同様の処理を実行するためのプログラムモジュール１０９３が、ハードディスクドライブ１０９０に記憶される。なお、ハードディスクドライブ１０９０は、SSDにより代替されてもよい。 The hard disk drive 1090 stores, for example, the OS 1091, the application program 1092, the program module 1093, and the program data 1094. That is, the program that defines each process of the signal processing device 10 is implemented as a program module 1093 in which a code that can be executed by a computer is described. The program module 1093 is stored in, for example, the hard disk drive 1090. For example, a program module 1093 for executing processing similar to the functional configuration in the signal processing device 10 is stored in the hard disk drive 1090. The hard disk drive 1090 may be replaced by an SSD.

また、上述した実施形態の処理で用いられる設定データは、プログラムデータ１０９４として、例えばメモリ１０１０やハードディスクドライブ１０９０に記憶される。そして、CPU１０２０が、メモリ１０１０やハードディスクドライブ１０９０に記憶されたプログラムモジュール１０９３やプログラムデータ１０９４を必要に応じてRAM１０１２に読み出して実行する。 Further, the setting data used in the processing of the above-described embodiment is stored as program data 1094 in, for example, a memory 1010 or a hard disk drive 1090. Then, the CPU 1020 reads the program module 1093 and the program data 1094 stored in the memory 1010 and the hard disk drive 1090 into the RAM 1012 as needed, and executes the program.

なお、プログラムモジュール１０９３やプログラムデータ１０９４は、ハードディスクドライブ１０９０に記憶される場合に限らず、例えば着脱可能な記憶媒体に記憶され、ディスクドライブ１１００等を介してCPU１０２０によって読み出されてもよい。あるいは、プログラムモジュール１０９３及びプログラムデータ１０９４は、ネットワーク（LAN（Local Area Network）、WAN（Wide Area Network）等）を介して接続された他のコンピュータに記憶されてもよい。そして、プログラムモジュール１０９３及びプログラムデータ１０９４は、他のコンピュータから、ネットワークインタフェース１０７０を介してCPU１０２０によって読み出されてもよい。 The program module 1093 and the program data 1094 are not limited to the case where they are stored in the hard disk drive 1090, and may be stored in, for example, a removable storage medium and read by the CPU 1020 via the disk drive 1100 or the like. Alternatively, the program module 1093 and the program data 1094 may be stored in another computer connected via a network (LAN (Local Area Network), WAN (Wide Area Network), etc.). Then, the program module 1093 and the program data 1094 may be read by the CPU 1020 from another computer via the network interface 1070.

１０、１０ａ信号処理装置
１１第１補助NN
１２メインNN
１５、１５ａ、２５モデル情報
２０学習装置
２４更新部
３０マイクロホンアレイ
４１、４２話者
１１１第１変換部
１１２統合部
１２１第２変換部
１２２適応部
１２２ａ結合部
１２３第３変換部
１３２第４変換部
３０１、３０２、３０３、３０４マイクロホン 10, 10a Signal processing device 11 1st auxiliary NN
12 Main NN
15, 15a, 25 Model information 20 Learning device 24 Update unit 30 Microphone array 41, 42 Speaker 111 1st conversion unit 112 Integration unit 121 2nd conversion unit 122 Adaptation unit 122a Coupling unit 123 3rd conversion unit 132 4th conversion unit 301, 302, 303, 304 microphone

Claims

The first conversion unit that converts the audio signal in the time domain obtained from the utterance of the target speaker into adaptive features, and
A second conversion unit that converts the mixed audio signal in the multi-channel time domain obtained by recording the audio of multiple sound sources with multiple microphones into pre-adaptation features by a neural network.
A third conversion unit that converts the post-adaptation feature amount obtained by adapting the pre-adaptation feature amount to the target speaker using the adaptation feature amount into information for output by a neural network.
A signal processing device characterized by having.

It further has a fourth conversion unit that converts information on the phase difference between microphones corresponding to each channel of the mixed audio signal into spatial information features.
The signal processing device according to claim 1, wherein the third conversion unit converts the post-adaptation feature amount, which is a combination of the spatial information feature amounts, into the information for output.

A signal processing method performed by a signal processing device.
The first conversion step of converting the audio signal in the time domain obtained from the utterance of the target speaker into the adaptive feature amount, and
A second conversion step of converting the mixed audio signal in the multi-channel time domain obtained by recording the audio of a plurality of sound sources with a plurality of microphones into a pre-adaptation feature amount by a neural network.
A third conversion step of converting the post-adaptation feature amount obtained by adapting the pre-adaptation feature amount to the target speaker using the adaptation feature amount into information for output by a neural network.
A signal processing method comprising.

A signal processing program for causing a computer to function as the signal processing device according to claim 1 or 2.

The first conversion unit that converts the audio signal in the time domain obtained from the utterance of the target speaker into adaptive features, and
A second conversion unit that converts the mixed audio signal in the multi-channel time domain obtained by recording the audio of multiple sound sources with multiple microphones into pre-adaptation features by a neural network.
A third conversion unit that converts the post-adaptation feature amount obtained by adapting the pre-adaptation feature amount to the target speaker using the adaptation feature amount into information for output by a neural network.
The neural network used in the second conversion unit and the third so as to optimize the loss calculated based on the audio signal of the target speaker included in the mixed audio signal and the information for the output. An update unit characterized by updating the parameters of the neural network used in the conversion unit, and an update unit.
A learning device characterized by having.

The update unit optimizes the signal-to-noise ratio between the estimation result of the audio signal of the target speaker indicated by the information for the output and the correct answer of the audio signal of the target speaker, and said. The learning device according to claim 5, wherein the parameters are updated so that the ability to discriminate the audio signal of the target speaker by the adaptive feature amount is improved.

A learning method performed by a computer
The first conversion step of converting the audio signal in the time domain obtained from the utterance of the target speaker into the adaptive feature amount, and
A second conversion step of converting the mixed audio signal in the multi-channel time domain obtained by recording the audio of a plurality of sound sources with a plurality of microphones into a pre-adaptation feature amount by a neural network.
A third conversion step of converting the post-adaptation feature amount obtained by adapting the pre-adaptation feature amount to the target speaker using the adaptation feature amount into information for output by a neural network.
The neural network used in the second conversion step and the third so as to optimize the loss calculated based on the audio signal of the target speaker included in the mixed audio signal and the information for the output. An update process characterized by updating the parameters of the neural network used in the conversion process, and
A learning method characterized by including.

A learning program for operating a computer as the learning device according to claim 5 or 6.