JP7138824B2

JP7138824B2 - Sound source separation model learning device, sound source separation device, program, sound source separation model learning method, and sound source separation method

Info

Publication number: JP7138824B2
Application number: JP2022532167A
Authority: JP
Inventors: 祥幹三井
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2020-06-25
Filing date: 2020-06-25
Publication date: 2022-09-16
Anticipated expiration: 2040-06-25
Also published as: WO2021260868A1; JPWO2021260868A1

Description

本開示は、音源分離モデル学習装置、音源分離装置、プログラム、音源分離モデル学習方法及び音源分離方法に関する。 The present disclosure relates to a sound source separation model learning device, a sound source separation device, a program, a sound source separation model learning method, and a sound source separation method.

近年では、複数の音源からなる混合信号より、所望の音源信号のみを分離する手法として、ニューラルネットワーク（以下、ＮＮという）に基づく手法が使用されている。非特許文献１では、複数の音が混ざっている混合信号から、ＮＮを用いた音源分離装置を通過させることで、音源分離が達成される。 In recent years, a technique based on a neural network (hereinafter referred to as NN) has been used as a technique for separating only a desired sound source signal from a mixed signal composed of a plurality of sound sources. In Non-Patent Document 1, sound source separation is achieved by passing a mixed signal in which a plurality of sounds are mixed through a sound source separation device using a neural network.

Ｚ．Ｑ．Ｗａｎｇｅｔａｌ．，ＡｌｔｅｒｎａｔｉｖｅＯｂｊｅｃｔｉｖｅＦｕｎｃｔｉｏｎｓｆｏｒＤｅｅｐＣｌｕｓｔｅｒｉｎｇ，ＰｒｏｃｅｅｄｉｎｇｓｏｆＩＥＥＥＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＡｃｏｕｓｔｉｃｓ，ＳｐｅｅｃｈａｎｄＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇ（ＩＣＡＳＳＰ），２０１８年Z. Q. Wang et al. , Alternative Objective Functions for Deep Clustering, Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018

従来の技術のように、ＮＮに基づく音源分離手法においては、取得された音源信号からＮＮへの入力特徴量を生成して、ＮＮに適用している。 Like the conventional technique, in the NN-based sound source separation method, the input feature amount to the NN is generated from the acquired sound source signal and applied to the NN.

一方、所望の音源を分離したり、不要な音源から到来する信号を抑圧したりといった処理として、他の信号処理が存在する。例えば、マイクロホンアレイを用いたビームフォーミング処理、騒音を抑圧するスペクトルサブトラクション処理、又は、ノイズキャンセリング等を行う適応フィルタリング処理等の信号処理がある。 On the other hand, there are other signal processing processes such as separating a desired sound source and suppressing signals coming from unnecessary sound sources. For example, signal processing such as beam forming processing using a microphone array, spectral subtraction processing for suppressing noise, or adaptive filtering processing for noise canceling or the like.

非特許文献１の学習段階においては、音源信号の取得から入力特徴量の生成までの間に上記のような信号処理が行われることを想定していない。このため、音源分離時に、信号処理を経た後の混合信号をＮＮへと入力しても、信号処理に伴って生じる音響的特性の変動にＮＮが対応できず、十分な音源分離性能を得ることができない。ここで、音響的特性の変動は、例えば、信号のスケール、遅延、残響又は周波数特性等が変化すること等を想定している。 In the learning stage of Non-Patent Document 1, it is not assumed that the above signal processing is performed between the acquisition of the sound source signal and the generation of the input feature amount. For this reason, even if a mixed signal that has undergone signal processing is input to the NN during sound source separation, the NN cannot cope with changes in acoustic characteristics that occur with the signal processing, and sufficient sound source separation performance cannot be obtained. can't Here, it is assumed that the variation of the acoustic characteristics is, for example, that the scale, delay, reverberation, frequency characteristics, etc. of the signal change.

そこで、本開示の一又は複数の態様は、音響的特性が変動した場合であっても、機械学習による音源分離が有効に機能できるようにすることを目的とする。 Accordingly, it is an object of one or more aspects of the present disclosure to enable sound source separation by machine learning to function effectively even when acoustic characteristics fluctuate.

本開示の第１の態様に係る音源分離モデル学習装置は、複数の目的音を少なくとも示す学習用混合信号に対して予め定められた処理を行うことで、前記複数の目的音に由来する複数の処理済目的音を少なくとも示す処理済学習用混合信号を生成する学習側信号処理部と、前記複数の処理済目的音を抽出するための学習側音源分離モデルを用いて、前記処理済学習用混合信号から音を抽出することで、前記抽出された音を示し、前記複数の処理済目的音の各々に各々が対応する複数の学習用抽出信号を生成する学習側モデル推論部と、前記複数の目的音の内の一つの目的音を示す信号に対して、前記一つの目的音を、前記複数の処理済目的音の内、前記一つの目的音に対応する一つの処理済目的音に近づけるための変形処理を行うことで、前記複数の目的音の各々に各々が由来する複数の変形目的音を各々が示す複数の変形目的音信号を生成する信号変形部と、前記複数の学習用抽出信号及び前記複数の変形目的音信号を用いて、前記抽出された音が、前記複数の変形目的音に近づくように、前記学習側音源分離モデルを更新するモデル更新部と、を備えることを特徴とする。 A sound source separation model learning device according to a first aspect of the present disclosure performs predetermined processing on a mixed learning signal indicating at least a plurality of target sounds, thereby obtaining a plurality of target sounds derived from the target sounds. using a learning-side signal processing unit that generates a processed learning mixture signal representing at least a processed target sound, and a learning-side sound source separation model for extracting the plurality of processed target sounds, the processed learning mixture a learning-side model inference unit that extracts a sound from a signal to represent the extracted sound and generates a plurality of training extraction signals each corresponding to each of the plurality of processed target sounds; To bring said one target sound closer to one of said plurality of processed target sounds corresponding to said one target sound, in response to a signal indicating said one target sound among said target sounds. a signal transforming unit for generating a plurality of transformed target sound signals each representing a plurality of transformed target sounds each derived from each of the plurality of target sounds; and the plurality of learning extraction signals. and a model updating unit that updates the learning-side sound source separation model using the plurality of modified target sound signals so that the extracted sound approaches the plurality of modified target sounds. do.

本開示の第２の態様に係る音源分離モデル学習装置は、複数の目的音を少なくとも示す学習用混合信号に対して予め定められた処理を行うことで、前記複数の目的音に由来する複数の処理済目的音を少なくとも示す処理済学習用混合信号を生成する学習側信号処理部と、前記処理済学習用混合信号から、予め定められた音響特徴量である学習用音響特徴量を複数の成分において抽出することで、前記抽出された学習用音響特徴量の時系列データである学習用特徴データを生成する学習側特徴量抽出部と、前記複数の処理済目的音を抽出するために前記複数の成分の各々に対する重みを示す学習側音源分離モデルを用いて、前記学習用特徴データから前記複数の処理済目的音の各々を各々が抽出するための複数の学習用マスクを生成する学習側モデル推論部と、前記複数の学習用マスクを用いて、前記学習用特徴データから音を抽出することで、前記抽出された音を示し、前記複数の処理済目的音の各々に各々が対応する複数の学習用抽出信号を生成する学習側信号抽出部と、前記複数の目的音の内の一つの目的音を示す信号に対して、前記一つの目的音を、前記複数の処理済目的音の内、前記一つの目的音に対応する一つの処理済目的音に近づけるための変形処理を行うことで、前記複数の目的音の各々に各々が由来する複数の変形目的音を各々が示す複数の変形目的音信号を生成する信号変形部と、前記複数の学習用抽出信号及び前記複数の変形目的音信号を用いて、前記抽出された音が、前記複数の変形目的音に近づくように、前記学習側音源分離モデルを更新するモデル更新部と、を備えることを特徴とする。 A sound source separation model learning device according to a second aspect of the present disclosure performs predetermined processing on a mixed learning signal indicating at least a plurality of target sounds, thereby obtaining a plurality of signals derived from the plurality of target sounds. a learning-side signal processing unit that generates a processed learning mixed signal indicating at least a processed target sound; a learning-side feature quantity extraction unit for generating learning feature data, which is time-series data of the extracted learning sound feature quantity, and the plurality of processed target sounds for extracting the plurality of A learning side model that generates a plurality of learning masks for extracting each of the plurality of processed target sounds from the learning feature data, using a learning side sound source separation model that indicates a weight for each of the components of A sound is extracted from the learning feature data using the inference unit and the plurality of learning masks, and a plurality of a learning-side signal extracting unit for generating a learning extraction signal of; a plurality of deformations each representing a plurality of deformed target sounds each derived from each of the plurality of target sounds by performing deformation processing for approximating one processed target sound corresponding to the one target sound; a signal transforming unit that generates a target sound signal; and a training method that uses the plurality of learning extraction signals and the plurality of transformed target sound signals so that the extracted sound approaches the plurality of transformed target sounds. and a model updating unit that updates the side sound source separation model.

本開示の第１の態様に係る音源分離装置は、複数の目的音を少なくとも示す対象混合信号に対して予め定められた処理を行うことで、前記複数の目的音に由来する複数の処理済目的音を少なくとも示す処理済対象混合信号を生成する活用側信号処理部と、複数の目的音を少なくとも示す学習用混合信号に対して予め定められた処理を行うことで、前記学習用混合信号で示される複数の目的音に由来する複数の処理済目的音を少なくとも示す処理済学習用混合信号を生成し、前記処理済学習用混合信号で示される複数の処理済目的音を抽出するための学習側音源分離モデルを用いて、前記処理済学習用混合信号から音を抽出することで、前記抽出された音を示し、前記処理済学習用混合信号で示される複数の処理済目的音の各々に各々が対応する複数の学習用抽出信号を生成し、前記学習用混合信号で示される複数の目的音の内の一つの目的音を示す信号に対して、前記一つの目的音を、前記処理済学習用混合信号で示される複数の処理済目的音の内、前記一つの目的音に対応する一つの処理済目的音に近づけるための変形処理を行うことで、前記学習用混合信号で示される複数の目的音の各々に各々が由来する複数の変形目的音を各々が示す複数の変形目的音信号を生成し、前記複数の学習用抽出信号及び前記複数の変形目的音信号を用いて、前記抽出された音が、前記複数の変形目的音に近づくように、前記学習側音源分離モデルを更新することで生成された、前記処理済対象混合信号で示される複数の処理済目的音を抽出するための活用側音源分離モデルを用いて、前記処理済対象混合信号から音を抽出することで、前記処理済対象混合信号から抽出された音を示し、前記処理済対象混合信号で示される複数の処理済目的音の各々に各々が対応する複数の活用抽出信号を生成する活用側モデル推論部と、を備えることを特徴とする。 A sound source separation device according to a first aspect of the present disclosure performs predetermined processing on a target mixed signal indicating at least a plurality of target sounds, thereby obtaining a plurality of processed target sounds derived from the plurality of target sounds. a utilization-side signal processing unit that generates a processed target mixed signal that at least indicates a sound; a learning side for generating a processed mixed learning signal indicating at least a plurality of processed target sounds derived from a plurality of target sounds derived from a learning side, and extracting a plurality of processed target sounds indicated by the processed mixed learning signal; By extracting sounds from the processed mixed learning signal using a sound source separation model, the extracted sounds are represented, and each of the plurality of processed target sounds represented by the processed mixed learning signals generates a plurality of learning extraction signals corresponding to the one target sound for a signal indicating one of a plurality of target sounds represented by the learning mixed signal, the one target sound to the processed learning By performing deformation processing to approximate one of the processed target sounds indicated by the mixed signal for learning to one of the processed target sounds corresponding to the one target sound, the plurality of target sounds indicated by the mixed signal for learning generating a plurality of modified target sound signals each representing a plurality of modified target sounds each derived from each of the target sounds, and using the plurality of learning extraction signals and the plurality of modified target sound signals, for extracting a plurality of processed target sounds represented by the processed target mixed signal, generated by updating the learning-side sound source separation model so that the sounds approach the plurality of deformed target sounds ; extracting sounds from the processed target mixed signal using the utilization-side sound source separation model to represent the sounds extracted from the processed target mixed signal; and a utilization-side model inference unit that generates a plurality of utilization extraction signals each corresponding to each of the target sounds.

本開示の第２の態様に係る音源分離装置は、複数の目的音を少なくとも示す対象混合信号に対して予め定められた処理を行うことで、前記複数の目的音に由来する複数の処理済目的音を少なくとも示す処理済対象混合信号を生成する活用側信号処理部と、前記処理済対象混合信号から、予め定められた音響特徴量である活用音響特徴量を複数の成分において抽出することで、前記抽出された活用音響特徴量の時系列データである活用特徴データを生成する活用側特徴量抽出部と、複数の目的音を少なくとも示す学習用混合信号に対して予め定められた処理を行うことで、前記学習用混合信号で示される複数の目的音に由来する複数の処理済目的音を少なくとも示す処理済学習用混合信号を生成し、前記処理済学習用混合信号から、予め定められた音響特徴量である学習用音響特徴量を複数の成分において抽出することで、前記抽出された学習用音響特徴量の時系列データである学習用特徴データを生成し、前記処理済学習用混合信号で示される複数の処理済目的音を抽出するために前記学習用特徴データにおける複数の成分の各々に対する重みを示す学習側音源分離モデルを用いて、前記学習用特徴データから前記処理済学習用混合信号で示される複数の処理済目的音の各々を各々が抽出するための複数の学習用マスクを生成し、前記複数の学習用マスクを用いて、前記学習用特徴データから音を抽出することで、前記抽出された音を示し、前記処理済学習用混合信号で示される複数の処理済目的音の各々に各々が対応する複数の学習用抽出信号を生成し、前記学習用混合信号で示される複数の目的音の内の一つの目的音を示す信号に対して、前記一つの目的音を、前記処理済学習用混合信号で示される複数の処理済目的音の内、前記一つの目的音に対応する一つの処理済目的音に近づけるための変形処理を行うことで、前記学習用混合信号で示される複数の目的音の各々に各々が由来する複数の変形目的音を各々が示す複数の変形目的音信号を生成し、前記複数の学習用抽出信号及び前記複数の変形目的音信号を用いて、前記抽出された音が、前記複数の変形目的音に近づくように、前記学習側音源分離モデルを更新することで生成された、前記処理済対象混合信号で示される複数の処理済目的音を抽出するために前記活用特徴データにおける複数の成分の各々に対する重みを示す活用側音源分離モデルを用いて、前記活用特徴データから前記処理済対象混合信号で示される複数の処理済目的音の各々を各々が抽出するための複数の活用マスクを生成する活用側モデル推論部と、前記複数の活用マスクを用いて、前記活用特徴データから音を抽出することで、前記活用特徴データから抽出された音を少なくとも示し、前記処理済対象混合信号で示される複数の処理済目的音の各々に各々が対応する複数の活用抽出信号を生成する活用側信号抽出部と、を備えることを特徴とする。 A sound source separation device according to a second aspect of the present disclosure performs predetermined processing on a target mixed signal indicating at least a plurality of target sounds, thereby obtaining a plurality of processed target sounds derived from the plurality of target sounds. A utilization-side signal processing unit that generates a processed target mixed signal that indicates at least a sound; a utilization-side feature quantity extraction unit that generates utilization feature data that is time-series data of the extracted utilization acoustic feature quantity ; generating a processed learning mixed signal indicating at least a plurality of processed target sounds derived from the plurality of target sounds indicated by the learning mixed signal; and generating a predetermined sound from the processed learning mixed signal By extracting the acoustic feature amount for learning which is the feature amount in a plurality of components, learning feature data which is time-series data of the extracted acoustic feature amount for learning is generated, and the processed mixed signal for learning is used as using a learning source separation model that indicates weights for each of a plurality of components in the learning feature data to extract a plurality of processed target sounds indicated from the learning feature data to the processed training mixture signal generating a plurality of learning masks for extracting each of a plurality of processed target sounds represented by and extracting a sound from the learning feature data using the plurality of learning masks, generating a plurality of learning extraction signals each representing the extracted sound and corresponding to each of a plurality of processed target sounds represented by the processed mixed learning signals; corresponding to the one target sound among the plurality of processed target sounds indicated by the processed mixed signal for learning. a plurality of modified target sounds each representing a plurality of modified target sounds each derived from each of the plurality of target sounds represented by the learning mixed signal by performing a deformation process for approximating one processed target sound to the generating a sound signal, and using the plurality of learning extraction signals and the plurality of modified target sound signals, and adjusting the learning-side sound source separation model so that the extracted sound approaches the plurality of modified target sounds; using a utilization-side sound source separation model that indicates a weight for each of a plurality of components in the utilization feature data for extracting a plurality of processed target sounds indicated by the processed target mixed signal generated by updating , the activity a utilization-side model inference unit for generating a plurality of utilization masks for extracting each of the plurality of processed target sounds indicated by the processed target mixed signal from the target sound feature data, and using the plurality of utilization masks, a plurality of processed target sounds each corresponding to each of the plurality of processed target sounds indicated by the processed target mixed signal, which at least represent the sounds extracted from the utilized feature data by extracting sounds from the utilized feature data; and a utilization-side signal extraction unit that generates a utilization extraction signal.

本開示の第１の態様に係るプログラムは、コンピュータを、複数の目的音を少なくとも示す学習用混合信号に対して予め定められた処理を行うことで、前記複数の目的音に由来する複数の処理済目的音を少なくとも示す処理済学習用混合信号を生成する学習側信号処理部、前記複数の処理済目的音を抽出するための学習側音源分離モデルを用いて、前記処理済学習用混合信号から音を抽出することで、前記抽出された音を示し、前記複数の処理済目的音の各々に各々が対応する複数の学習用抽出信号を生成する学習側モデル推論部、前記複数の目的音の内の一つの目的音を示す信号に対して、前記一つの目的音を、前記複数の処理済目的音の内、前記一つの目的音に対応する一つの処理済目的音に近づけるための変形処理を行うことで、前記複数の目的音の各々に各々が由来する複数の変形目的音を各々が示す複数の変形目的音信号を生成する信号変形部、及び、前記複数の学習用抽出信号及び前記複数の変形目的音信号を用いて、前記抽出された音が、前記複数の変形目的音に近づくように、前記学習側音源分離モデルを更新するモデル更新部、として機能させることを特徴とする。 A program according to a first aspect of the present disclosure causes a computer to perform predetermined processing on a mixed learning signal indicating at least a plurality of target sounds, thereby performing a plurality of processing derived from the plurality of target sounds. a learning-side signal processing unit that generates a processed learning mixed signal representing at least a processed target sound; and a learning-side sound source separation model for extracting the plurality of processed target sounds from the processed learning mixed signal. a learning-side model inference unit that extracts a sound to generate a plurality of learning extraction signals that indicate the extracted sound and that correspond to each of the plurality of processed target sounds; deformation processing for approximating said one target sound to one of said plurality of processed target sounds corresponding to said one target sound, with respect to a signal indicating one target sound in to generate a plurality of modified target sound signals each representing a plurality of modified target sounds each derived from each of the plurality of target sounds; and the plurality of learning extraction signals and the It functions as a model updating unit that updates the learning-side sound source separation model using a plurality of modified target sound signals so that the extracted sound approaches the plurality of modified target sounds.

本開示の第２の態様に係るプログラムは、コンピュータを、複数の目的音を少なくとも示す学習用混合信号に対して予め定められた処理を行うことで、前記複数の目的音に由来する複数の処理済目的音を少なくとも示す処理済学習用混合信号を生成する学習側信号処理部、前記処理済学習用混合信号から、予め定められた音響特徴量である学習用音響特徴量を複数の成分において抽出することで、前記抽出された学習用音響特徴量の時系列データである学習用特徴データを生成する学習側特徴量抽出部、前記複数の処理済目的音を抽出するために前記複数の成分の各々に対する重みを示す学習側音源分離モデルを用いて、前記学習用特徴データから前記複数の処理済目的音の各々を各々が抽出するための複数の学習用マスクを生成する学習側モデル推論部、
前記複数の学習用マスクを用いて、前記学習用特徴データから音を抽出することで、前記抽出された音を示し、前記複数の処理済目的音の各々に各々が対応する複数の学習用抽出信号を生成する学習側信号抽出部、
前記複数の目的音の内の一つの目的音を示す信号に対して、前記一つの目的音を、前記複数の処理済目的音の内、前記一つの目的音に対応する一つの処理済目的音に近づけるための変形処理を行うことで、前記複数の目的音の各々に各々が由来する複数の変形目的音を各々が示す複数の変形目的音信号を生成する信号変形部、及び、前記複数の学習用抽出信号及び前記複数の変形目的音信号を用いて、前記抽出された音が、前記複数の変形目的音に近づくように、前記学習側音源分離モデルを更新するモデル更新部、として機能させることを特徴とする。A program according to a second aspect of the present disclosure causes a computer to perform predetermined processing on a mixed learning signal indicating at least a plurality of target sounds, thereby performing a plurality of processing derived from the plurality of target sounds. a learning-side signal processing unit that generates a processed learning mixed signal that at least indicates the target sound; By doing so, a learning-side feature quantity extraction unit that generates learning feature data that is time-series data of the extracted learning acoustic feature quantity, and extracts the plurality of components to extract the plurality of processed target sounds. a learning-side model inference unit that generates a plurality of learning masks for extracting each of the plurality of processed target sounds from the learning feature data, using a learning-side sound source separation model that indicates a weight for each;
a plurality of learning extractions each representing the extracted sound and corresponding to each of the plurality of processed target sounds by extracting sounds from the learning feature data using the plurality of learning masks; a learning-side signal extraction unit that generates a signal;
for a signal indicating one of the plurality of target sounds, one of the plurality of processed target sounds corresponding to the one target sound. a signal transforming unit that generates a plurality of transformed target sound signals each representing a plurality of transformed target sounds each derived from each of the plurality of target sounds by performing transformation processing to bring the plurality of transformed target sounds closer to A model updating unit that updates the learning-side sound source separation model using the learning extraction signal and the plurality of modified target sound signals so that the extracted sound approaches the plurality of modified target sounds. It is characterized by

本開示の第３の態様に係るプログラムは、コンピュータを、複数の目的音を少なくとも示す対象混合信号に対して予め定められた処理を行うことで、前記複数の目的音に由来する複数の処理済目的音を少なくとも示す処理済対象混合信号を生成する活用側信号処理部、及び、複数の目的音を少なくとも示す学習用混合信号に対して予め定められた処理を行うことで、前記学習用混合信号で示される複数の目的音に由来する複数の処理済目的音を少なくとも示す処理済学習用混合信号を生成し、前記処理済学習用混合信号で示される複数の処理済目的音を抽出するための学習側音源分離モデルを用いて、前記処理済学習用混合信号から音を抽出することで、前記抽出された音を示し、前記処理済学習用混合信号で示される複数の処理済目的音の各々に各々が対応する複数の学習用抽出信号を生成し、前記学習用混合信号で示される複数の目的音の内の一つの目的音を示す信号に対して、前記一つの目的音を、前記処理済学習用混合信号で示される複数の処理済目的音の内、前記一つの目的音に対応する一つの処理済目的音に近づけるための変形処理を行うことで、前記学習用混合信号で示される複数の目的音の各々に各々が由来する複数の変形目的音を各々が示す複数の変形目的音信号を生成し、前記複数の学習用抽出信号及び前記複数の変形目的音信号を用いて、前記抽出された音が、前記複数の変形目的音に近づくように、前記学習側音源分離モデルを更新することで生成された、前記処理済対象混合信号で示される複数の処理済目的音を抽出するための活用側音源分離モデルを用いて、前記処理済対象混合信号から音を抽出することで、前記処理済対象混合信号から抽出された音を示し、前記処理済対象混合信号で示される複数の処理済目的音の各々に各々が対応する複数の活用抽出信号を生成する活用側モデル推論部、として機能させることを特徴とする。 A program according to a third aspect of the present disclosure causes a computer to perform predetermined processing on a target mixed signal indicating at least a plurality of target sounds, thereby generating a plurality of processed signals derived from the plurality of target sounds. a utilization-side signal processing unit that generates a processed target mixed signal that indicates at least a target sound; for generating a processed mixed learning signal indicating at least a plurality of processed target sounds derived from a plurality of target sounds indicated by and extracting a plurality of processed target sounds indicated by the processed mixed learning signal each of a plurality of processed target sounds represented by the processed mixed signal for learning by extracting sounds from the processed mixed signal for learning using the learning-side sound source separation model; and generating a plurality of learning extraction signals each corresponding to a signal indicating one target sound among a plurality of target sounds indicated by the learning mixed signal, the one target sound being subjected to the processing By performing deformation processing to approximate one of the processed target sounds indicated by the mixed learning signal to one of the processed target sounds corresponding to the one target sound, the target sound indicated by the mixed learning signal generating a plurality of modified target sound signals each representing a plurality of modified target sounds each derived from each of the plurality of target sounds; using the plurality of learning extraction signals and the plurality of modified target sound signals, Extracting a plurality of processed target sounds represented by the processed target mixed signal generated by updating the learning-side sound source separation model so that the extracted sounds approximate the plurality of deformed target sounds . By extracting sounds from the processed target mixed signal using the utilization-side sound source separation model for It is characterized by functioning as a utilization-side model inference section that generates a plurality of utilization extraction signals each corresponding to each of the processed target sounds.

本開示の第４の態様に係るプログラムは、コンピュータを、複数の目的音を少なくとも示す対象混合信号に対して予め定められた処理を行うことで、前記複数の目的音に由来する複数の処理済目的音を少なくとも示す処理済対象混合信号を生成する活用側信号処理部、前記処理済対象混合信号から、予め定められた音響特徴量である活用音響特徴量を複数の成分において抽出することで、前記抽出された活用音響特徴量の時系列データである活用特徴データを生成する活用側特徴量抽出部、複数の目的音を少なくとも示す学習用混合信号に対して予め定められた処理を行うことで、前記学習用混合信号で示される複数の目的音に由来する複数の処理済目的音を少なくとも示す処理済学習用混合信号を生成し、前記処理済学習用混合信号から、予め定められた音響特徴量である学習用音響特徴量を複数の成分において抽出することで、前記抽出された学習用音響特徴量の時系列データである学習用特徴データを生成し、前記処理済学習用混合信号で示される複数の処理済目的音を抽出するために前記学習用特徴データにおける複数の成分の各々に対する重みを示す学習側音源分離モデルを用いて、前記学習用特徴データから前記処理済学習用混合信号で示される複数の処理済目的音の各々を各々が抽出するための複数の学習用マスクを生成し、前記複数の学習用マスクを用いて、前記学習用特徴データから音を抽出することで、前記抽出された音を示し、前記処理済学習用混合信号で示される複数の処理済目的音の各々に各々が対応する複数の学習用抽出信号を生成し、前記学習用混合信号で示される複数の目的音の内の一つの目的音を示す信号に対して、前記一つの目的音を、前記処理済学習用混合信号で示される複数の処理済目的音の内、前記一つの目的音に対応する一つの処理済目的音に近づけるための変形処理を行うことで、前記学習用混合信号で示される複数の目的音の各々に各々が由来する複数の変形目的音を各々が示す複数の変形目的音信号を生成し、前記複数の学習用抽出信号及び前記複数の変形目的音信号を用いて、前記抽出された音が、前記複数の変形目的音に近づくように、前記学習側音源分離モデルを更新することで生成された、前記処理済対象混合信号で示される複数の処理済目的音を抽出するために前記活用特徴データにおける複数の成分の各々に対する重みを示す活用側音源分離モデルを用いて、前記活用特徴データから前記処理済対象混合信号で示される複数の処理済目的音の各々を各々が抽出するための複数の活用マスクを生成する活用側モデル推論部、及び、前記複数の活用マスクを用いて、前記活用特徴データから音を抽出することで、前記活用特徴データから抽出された音を少なくとも示し、前記処理済対象混合信号で示される複数の処理済目的音の各々に各々が対応する複数の活用抽出信号を生成する活用側信号抽出部、として機能させることを特徴とする。 A program according to a fourth aspect of the present disclosure causes a computer to perform predetermined processing on a target mixed signal indicating at least a plurality of target sounds, thereby generating a plurality of processed signals derived from the plurality of target sounds. A utilization-side signal processing unit that generates a processed target mixed signal that indicates at least a target sound, and extracts, from the processed target mixed signal, a utilization acoustic feature quantity, which is a predetermined acoustic feature quantity, in a plurality of components, By performing a predetermined process on a learning mixed signal indicating at least a plurality of target sounds, a utilization-side feature quantity extraction unit that generates utilization feature data that is time-series data of the extracted utilization acoustic feature quantity, generating a processed learning mixed signal indicating at least a plurality of processed target sounds derived from the plurality of target sounds indicated by the learning mixed signal; and obtaining a predetermined acoustic feature from the processed learning mixed signal. By extracting the learning acoustic feature amount, which is a quantity, in a plurality of components, learning feature data, which is time-series data of the extracted learning acoustic feature amount, is generated and represented by the processed learning mixed signal. using a learning-side sound source separation model that indicates a weight for each of a plurality of components in the learning feature data in order to extract a plurality of processed target sounds from the learning feature data with the processed learning mixture signal generating a plurality of learning masks each for extracting each of the plurality of processed target sounds shown, and using the plurality of learning masks to extract a sound from the learning feature data, generating a plurality of learning extraction signals each representing the extracted sound and corresponding to each of the plurality of processed target sounds represented by the processed learning mixed signals; For a signal indicating one target sound among the target sounds, the one target sound corresponds to the one target sound among the plurality of processed target sounds indicated by the processed mixed signal for learning. A plurality of modified target sounds each representing a plurality of modified target sounds each derived from each of the plurality of target sounds represented by the learning mixed signal by performing transformation processing to approximate one processed target sound. generating a signal, and using the plurality of learning extraction signals and the plurality of modified target sound signals, updating the learning-side sound source separation model so that the extracted sound approaches the plurality of modified target sounds. using a utilization-side sound source separation model that indicates a weight for each of the plurality of components in the utilization feature data in order to extract a plurality of processed target sounds indicated by the processed target mixed signal generated by a utilization-side model inference unit for generating a plurality of utilization masks, each of which extracts a plurality of processed target sounds indicated by the processed target mixed signal from the utilization feature data; Extracting sounds from the inflectional feature data using a mask to indicate at least the sounds extracted from the inflectional feature data , each of a plurality of processed target sounds indicated by the processed target mixed signal having It is characterized by functioning as a utilization-side signal extraction unit that generates a plurality of corresponding utilization extraction signals.

本開示の第１の態様に係る音源分離モデル学習方法は、複数の目的音を少なくとも示す学習用混合信号に対して予め定められた処理を行うことで、前記複数の目的音に由来する複数の処理済目的音を少なくとも示す処理済学習用混合信号を生成し、前記複数の処理済目的音を抽出するための学習側音源分離モデルを用いて、前記処理済学習用混合信号から音を抽出することで、前記抽出された音を示し、前記複数の処理済目的音の各々に各々が対応する複数の学習用抽出信号を生成し、前記複数の目的音の内の一つの目的音を示す信号に対して、前記一つの目的音を、前記複数の処理済目的音の内、前記一つの目的音に対応する一つの処理済目的音に近づけるための変形処理を行うことで、前記複数の目的音の各々に各々が由来する複数の変形目的音を各々が示す複数の変形目的音信号を生成し、前記複数の学習用抽出信号及び前記複数の変形目的音信号を用いて、前記抽出された音が、前記複数の変形目的音に近づくように、前記学習側音源分離モデルを更新することを特徴とする。 A sound source separation model learning method according to a first aspect of the present disclosure performs predetermined processing on a mixed learning signal indicating at least a plurality of target sounds, thereby obtaining a plurality of signals derived from the plurality of target sounds. generating a processed training mixture signal indicative of at least a processed target sound, and extracting sounds from the processed training mixture signal using a learning-side source separation model for extracting the plurality of processed target sounds. to generate a plurality of learning extraction signals each representing the extracted sound and corresponding to each of the plurality of processed target sounds, and a signal representing one of the plurality of target sounds , the one target sound is deformed to bring it closer to one of the plurality of processed target sounds corresponding to the one target sound, thereby obtaining the plurality of purposes generating a plurality of modified target sound signals each indicating a plurality of modified target sounds each derived from each of the sounds, and using the plurality of learning extraction signals and the plurality of modified target sound signals, The learning-side sound source separation model is updated so that the sound approaches the plurality of modified target sounds.

本開示の第２の態様に係る音源分離モデル学習方法は、複数の目的音を少なくとも示す学習用混合信号に対して予め定められた処理を行うことで、前記複数の目的音に由来する複数の処理済目的音を少なくとも示す処理済学習用混合信号を生成し、前記処理済学習用混合信号から、予め定められた音響特徴量である学習用音響特徴量を複数の成分において抽出することで、前記抽出された学習用音響特徴量の時系列データである学習用特徴データを生成し、前記複数の処理済目的音を抽出するために前記複数の成分の各々に対する重みを示す学習側音源分離モデルを用いて、前記学習用特徴データから前記複数の処理済目的音の各々を各々が抽出するための複数の学習用マスクを生成し、前記複数の学習用マスクを用いて、前記学習用特徴データから音を抽出することで、前記抽出された音を示し、前記複数の処理済目的音の各々に各々が対応する複数の学習用抽出信号を生成し、前記複数の目的音の内の一つの目的音を示す信号に対して、前記一つの目的音を、前記複数の処理済目的音の内、前記一つの目的音に対応する一つの処理済目的音に近づけるための変形処理を行うことで、前記複数の目的音の各々に各々が由来する複数の変形目的音を各々が示す複数の変形目的音信号を生成し、前記複数の学習用抽出信号及び前記複数の変形目的音信号を用いて、前記抽出された音が、前記複数の変形目的音に近づくように、前記学習側音源分離モデルを更新することを特徴とする。 A sound source separation model learning method according to a second aspect of the present disclosure performs predetermined processing on a mixed learning signal indicating at least a plurality of target sounds, thereby obtaining a plurality of signals derived from the plurality of target sounds. generating a processed learning mixed signal indicating at least the processed target sound, and extracting, from the processed learning mixed signal, learning acoustic feature amounts, which are predetermined acoustic feature amounts, in a plurality of components, A learning-side sound source separation model that generates learning feature data, which is time-series data of the extracted learning acoustic feature amount, and indicates a weight for each of the plurality of components in order to extract the plurality of processed target sounds. to generate a plurality of learning masks for extracting each of the plurality of processed target sounds from the learning feature data, and using the plurality of learning masks to generate the learning feature data generating a plurality of training extraction signals each representing the extracted sound and corresponding to each of the plurality of processed target sounds; Transforming a signal indicating a target sound so as to bring the one target sound closer to one of the plurality of processed target sounds corresponding to the one target sound generating a plurality of modified target sound signals each indicating a plurality of modified target sounds each derived from each of the plurality of target sounds, and using the plurality of learning extraction signals and the plurality of modified target sound signals; and updating the learning-side sound source separation model so that the extracted sound approaches the plurality of modified target sounds.

本開示の第１の態様に係る音源分離方法は、複数の目的音を少なくとも示す対象混合信号に対して予め定められた処理を行うことで、前記複数の目的音に由来する複数の処理済目的音を少なくとも示す処理済対象混合信号を生成し、複数の目的音を少なくとも示す学習用混合信号に対して予め定められた処理を行うことで、前記学習用混合信号で示される複数の目的音に由来する複数の処理済目的音を少なくとも示す処理済学習用混合信号を生成し、前記処理済学習用混合信号で示される複数の処理済目的音を抽出するための学習側音源分離モデルを用いて、前記処理済学習用混合信号から音を抽出することで、前記抽出された音を示し、前記処理済学習用混合信号で示される複数の処理済目的音の各々に各々が対応する複数の学習用抽出信号を生成し、前記学習用混合信号で示される複数の目的音の内の一つの目的音を示す信号に対して、前記一つの目的音を、前記処理済学習用混合信号で示される複数の処理済目的音の内、前記一つの目的音に対応する一つの処理済目的音に近づけるための変形処理を行うことで、前記学習用混合信号で示される複数の目的音の各々に各々が由来する複数の変形目的音を各々が示す複数の変形目的音信号を生成し、前記複数の学習用抽出信号及び前記複数の変形目的音信号を用いて、前記抽出された音が、前記複数の変形目的音に近づくように、前記学習側音源分離モデルを更新することで生成された、前記処理済対象混合信号で示される複数の処理済目的音を抽出するための活用側音源分離モデルを用いて、前記処理済対象混合信号から音を抽出することで、前記処理済対象混合信号から抽出された音を示し、前記処理済対象混合信号で示される複数の処理済目的音の各々に各々が対応する複数の活用抽出信号を生成することを特徴とする。 A sound source separation method according to a first aspect of the present disclosure performs predetermined processing on a target mixed signal indicating at least a plurality of target sounds, thereby obtaining a plurality of processed target sounds derived from the plurality of target sounds. generating a processed target mixed signal representing at least a sound, and performing predetermined processing on a learning mixed signal representing at least a plurality of target sounds, thereby obtaining a plurality of target sounds represented by the learning mixed signal; using a learning-side sound source separation model for generating a processed learning mixture signal indicating at least a plurality of processed target sounds from which the learning-side sound source separation model is extracted, and extracting the plurality of processed target sounds indicated by the processed learning mixture signal; , extracting sounds from the processed mixed signal for learning, indicating the extracted sound, and a plurality of learning signals each corresponding to each of a plurality of processed target sounds indicated by the processed mixed signal for learning generating an extraction signal for learning, and for a signal indicating one target sound among a plurality of target sounds indicated by the mixed learning signal, the one target sound indicated by the processed mixed learning signal Each of the plurality of target sounds represented by the learning mixed signal is transformed by performing deformation processing to approximate one of the plurality of processed target sounds to one of the processed target sounds corresponding to the one target sound. generating a plurality of modified target sound signals each representing a plurality of modified target sounds derived from the A utilization-side sound source separation model for extracting a plurality of processed target sounds indicated by the processed target mixed signal generated by updating the learning-side sound source separation model so as to approach the modified target sound of to extract a sound from the processed target mixed signal using generates a plurality of corresponding leveraged extraction signals.

本開示の第２の態様に係る音源分離方法は、複数の目的音を少なくとも示す対象混合信号に対して予め定められた処理を行うことで、前記複数の目的音に由来する複数の処理済目的音を少なくとも示す処理済対象混合信号を生成し、前記処理済対象混合信号から、予め定められた音響特徴量である活用音響特徴量を複数の成分において抽出することで、前記抽出された活用音響特徴量の時系列データである活用特徴データを生成し、複数の目的音を少なくとも示す学習用混合信号に対して予め定められた処理を行うことで、前記学習用混合信号で示される複数の目的音に由来する複数の処理済目的音を少なくとも示す処理済学習用混合信号を生成し、前記処理済学習用混合信号から、予め定められた音響特徴量である学習用音響特徴量を複数の成分において抽出することで、前記抽出された学習用音響特徴量の時系列データである学習用特徴データを生成し、前記処理済学習用混合信号で示される複数の処理済目的音を抽出するために前記学習用特徴データにおける複数の成分の各々に対する重みを示す学習側音源分離モデルを用いて、前記学習用特徴データから前記処理済学習用混合信号で示される複数の処理済目的音の各々を各々が抽出するための複数の学習用マスクを生成し、前記複数の学習用マスクを用いて、前記学習用特徴データから音を抽出することで、前記抽出された音を示し、前記処理済学習用混合信号で示される複数の処理済目的音の各々に各々が対応する複数の学習用抽出信号を生成し、前記学習用混合信号で示される複数の目的音の内の一つの目的音を示す信号に対して、前記一つの目的音を、前記処理済学習用混合信号で示される複数の処理済目的音の内、前記一つの目的音に対応する一つの処理済目的音に近づけるための変形処理を行うことで、前記学習用混合信号で示される複数の目的音の各々に各々が由来する複数の変形目的音を各々が示す複数の変形目的音信号を生成し、前記複数の学習用抽出信号及び前記複数の変形目的音信号を用いて、前記抽出された音が、前記複数の変形目的音に近づくように、前記学習側音源分離モデルを更新することで生成された、前記処理済対象混合信号で示される複数の処理済目的音を抽出するために前記活用特徴データにおける複数の成分の各々に対する重みを示す活用側音源分離モデルを用いて、前記活用特徴データから前記処理済対象混合信号で示される複数の処理済目的音の各々を各々が抽出するための複数の活用マスクを生成し、前記複数の活用マスクを用いて、前記活用特徴データから音を抽出することで、前記活用特徴データから抽出された音を少なくとも示し、前記処理済対象混合信号で示される複数の処理済目的音の各々に各々が対応する複数の活用抽出信号を生成することを特徴とする。 A sound source separation method according to a second aspect of the present disclosure performs predetermined processing on a target mixed signal indicating at least a plurality of target sounds, thereby obtaining a plurality of processed target sounds derived from the plurality of target sounds. generating a processed target mixed signal indicating at least a sound, and extracting a plurality of components from the processed target mixed signal, which is a predetermined acoustic feature quantity, to utilize the extracted utilized sound; By generating utilization feature data, which is time-series data of feature amounts, and performing predetermined processing on a learning mixed signal indicating at least a plurality of target sounds, a plurality of purposes indicated by the learning mixed signal are obtained. generating a processed learning mixed signal indicating at least a plurality of processed target sounds derived from sounds, and obtaining a learning acoustic feature amount, which is a predetermined acoustic feature amount, as a plurality of components from the processed learning mixed signal; to generate learning feature data, which is time-series data of the extracted learning acoustic feature amount, and to extract a plurality of processed target sounds indicated by the processed learning mixed signal Each of a plurality of processed target sounds represented by the processed mixed signal for learning is obtained from the feature data for learning using a learning side sound source separation model that indicates a weight for each of the plurality of components in the feature data for learning. generates a plurality of learning masks for extraction, and extracts sounds from the learning feature data using the plurality of learning masks, thereby indicating the extracted sounds and the processed learning generating a plurality of learning extraction signals each corresponding to each of the plurality of processed target sounds indicated by the mixed signal, and a signal indicating one of the plurality of target sounds indicated by the mixed learning signal; , transforming the one target sound closer to one processed target sound corresponding to the one target sound among the plurality of processed target sounds indicated by the processed mixed signal for learning. to generate a plurality of modified target sound signals each representing a plurality of modified target sounds each derived from each of the plurality of target sounds represented by the mixed learning signals, and the plurality of learning extracted signals and the processed target mixture generated by updating the learning-side sound source separation model using the plurality of modified target sound signals so that the extracted sound approaches the plurality of modified target sounds. with the processed target mixture signal from the utilized feature data using a utilized-side sound source separation model that indicates a weight for each of the plurality of components in the utilized feature data to extract a plurality of processed target sounds indicated by a signal Show generating a plurality of inflection masks for extracting each of a plurality of processed target sounds, and extracting sounds from the inflection feature data using the plurality of inflection masks, thereby obtaining the inflection feature data and generating a plurality of utilized extraction signals each corresponding to each of the plurality of processed target sounds indicated by the processed target mixed signal .

本開示の一又は複数の態様によれば、音響的特性が変動した場合であっても、機械学習による音源分離が有効に機能することができる。 According to one or more aspects of the present disclosure, sound source separation by machine learning can function effectively even when acoustic characteristics fluctuate.

音源分離システムの構成を概略的に示すブロック図である。1 is a block diagram schematically showing the configuration of a sound source separation system; FIG. 音源分離モデル学習装置の構成を概略的に示すブロック図である。1 is a block diagram schematically showing the configuration of a sound source separation model learning device; FIG. 実施の形態１における信号変形部の構成を概略的に示すブロック図である。3 is a block diagram schematically showing the configuration of a signal transforming section according to Embodiment 1; FIG. 音源分離モデル学習装置のハードウェア構成を概略的に示すブロック図である。2 is a block diagram schematically showing the hardware configuration of a sound source separation model learning device; FIG. 音源分離装置の構成を概略的に示すブロック図である。1 is a block diagram schematically showing the configuration of a sound source separation device; FIG. 音源分離装置のハードウェア構成を概略的に示すブロック図である。1 is a block diagram schematically showing the hardware configuration of a sound source separation device; FIG. 音源分離モデル学習装置の動作を示すフローチャートである。4 is a flowchart showing the operation of the sound source separation model learning device; 実施の形態１における信号変形部の動作を示すフローチャートである。4 is a flow chart showing the operation of a signal transforming unit according to Embodiment 1; 音源分離装置の動作を示すフローチャートである。4 is a flow chart showing the operation of the sound source separation device; 音源分離モデル学習装置の動作を示す概念図である。FIG. 4 is a conceptual diagram showing the operation of the sound source separation model learning device; （Ａ）及び（Ｂ）は、音源分離装置の動作例を説明するための概略図である。(A) and (B) are schematic diagrams for explaining an operation example of the sound source separation device. 音源分離装置の利用例を示す概略図である。It is a schematic diagram showing an example of use of the sound source separation device. 実施の形態２における信号変形部の構成を概略的に示すブロック図である。FIG. 10 is a block diagram schematically showing the configuration of a signal transforming section according to Embodiment 2; FIG. 実施の形態２における信号変形部の動作を示すフローチャートである。10 is a flow chart showing the operation of a signal transforming unit according to Embodiment 2. FIG.

実施の形態１．
図１は、実施の形態１に係る音源分離システム１００の構成を概略的に示すブロック図である。
音源分離システム１００は、学習用信号から音源分離モデルを生成する音源分離モデル学習装置１１０と、対象混合信号の中に含まれる、各音源から発せられた目的音を、音源分離モデルを用いて分離し、その目的音を出力する音源分離装置１３０とを備える。Embodiment 1.
FIG. 1 is a block diagram schematically showing the configuration of a sound source separation system 100 according to Embodiment 1. As shown in FIG.
A sound source separation system 100 separates a sound source separation model learning device 110 that generates a sound source separation model from a learning signal, and a target sound emitted from each sound source contained in a target mixed signal using the sound source separation model. and a sound source separation device 130 for outputting the target sound.

ここで、目的音は、音源分離装置１３０を用いて分離して、取り出したい音を指し、非目的音は、音源分離装置１３０を用いて取り出す必要のない音を指すものとする。言い換えると、目的音は、音源分離装置１３０で抽出すべき音を指し、非目的音は、音源分離装置１３０で抽出すべきではない音を指す。 Here, the target sound refers to a sound to be separated and extracted using the sound source separation device 130 , and the non-target sound refers to a sound that does not need to be extracted using the sound source separation device 130 . In other words, the target sound refers to the sound that should be extracted by the sound source separation device 130 , and the non-target sound refers to the sound that should not be extracted by the sound source separation device 130 .

音源分離モデル学習装置１１０と、音源分離装置１３０とは、データを受け渡すことができるようになっている。例えば、図示してはいないが、音源分離モデル学習装置１１０と、音源分離装置１３０とは、ネットワークに接続されている。 The sound source separation model learning device 110 and the sound source separation device 130 can exchange data. For example, although not shown, the sound source separation model learning device 110 and the sound source separation device 130 are connected to a network.

音源分離モデル学習装置１１０は、学習用信号に基づいて、音源分離モデルを生成する。生成された音源分離モデルは、音源分離装置１３０に与えられる。
音源分離装置１３０は、その音源分離モデルを用いて、複数の音源から発せられた複数の目的音を含む混合信号から、複数の目的音を抽出する。The sound source separation model learning device 110 generates a sound source separation model based on the learning signal. The generated sound source separation model is provided to the sound source separation device 130 .
The sound source separation device 130 uses the sound source separation model to extract multiple target sounds from a mixed signal containing multiple target sounds emitted from multiple sound sources.

音源分離モデルは、音源分離装置１３０にて音源の分離を実施する際に用いられる、ＮＮにおける学習モデルである。音源分離モデルは、例えば、ＮＮの配線構造を定義するための情報及びＮＮの各配線における重みを格納したパラメタを含む。音源分離モデルの配線構造は、例えば、全結合型ＮＮ、畳み込みＮＮ（ＣｏｎｖｏｌｕｔｉｏｎａｌＮＮ：ＣＮＮ）、回帰型ＮＮ（ＲｅｃｕｒｒｅｎｔＮＮ：ＲＮＮ）、長短期記憶（Ｌｏｎｇｓｈｏｒｔ－ｔｅｒｍｍｅｍｏｒｙ：ＬＳＴＭ）、ゲート付き回帰型ユニット（Ｇａｔｅｄｒｅｃｕｒｒｅｎｔｕｎｉｔ：ＧＲＵ）、又は、これらの組み合わせであってもよい。 A sound source separation model is a learning model in the NN that is used when the sound source separation device 130 separates sound sources. The sound source separation model includes, for example, information for defining the wiring structure of the NN and parameters storing weights in each wiring of the NN. The wiring structure of the sound source separation model includes, for example, fully connected NN, convolutional NN (CNN), recurrent NN (RNN), long short-term memory (LSTM), and gated regression. It may be a type unit (gated recurrent unit: GRU) or a combination thereof.

図２は、音源分離モデル学習装置１１０の構成を概略的に示すブロック図である。
音源分離モデル学習装置１１０は、学習側入力部１１１と、混合信号生成部１１２と、学習側信号処理部１１３と、学習側特徴量抽出部１１４と、学習側音源分離モデル記憶部１１５と、学習側モデル推論部１１６と、学習側信号抽出部１１７と、信号変形部１１８と、モデル更新部１１９と、学習側通信部１２０とを備える。FIG. 2 is a block diagram schematically showing the configuration of the sound source separation model learning device 110. As shown in FIG.
The sound source separation model learning device 110 includes a learning side input unit 111, a mixed signal generation unit 112, a learning side signal processing unit 113, a learning side feature quantity extraction unit 114, a learning side sound source separation model storage unit 115, and a learning side input unit 111. A side model inference unit 116 , a learning side signal extraction unit 117 , a signal transformation unit 118 , a model update unit 119 , and a learning side communication unit 120 are provided.

学習側入力部１１１は、学習用信号の入力を受け付ける。入力された学習用信号は、混合信号生成部１１２及び信号変形部１１８に与えられる。
学習用信号は、例えば、複数の話者からそれぞれ個別に発せられた音声、複数の楽器からそれぞれ個別に演奏された楽曲、又は、複数の騒音原からそれぞれ個別に発せられた騒音等の目的音及び非目的音を録音したデータの信号を含む。Learning-side input unit 111 receives an input of a learning signal. The input learning signal is provided to the mixed signal generation section 112 and the signal modification section 118 .
The learning signal is, for example, a target sound such as a voice individually emitted from a plurality of speakers, a piece of music individually played by a plurality of musical instruments, or a noise individually emitted from a plurality of noise sources. and non-target sound recording data signals.

混合信号生成部１１２は、学習用信号として目的音及び非目的音の信号を取得し、例えば、これらを加算することによって、複数の目的音と、非目的音とが混ざっている混合信号である学習用混合信号を生成する。学習用混合信号は、学習側信号処理部１１３に与えられる。
ここで、学習用混合信号には、２つ以上の目的音が含まれる。また、学習用混合信号には、１つ以上の非目的音が含まれてもよいし、含まれなくてもよい。学習用混合信号は、例えば、学習用信号として取得された２つ以上の信号を単純に加算して得られる信号であってもよい。言い換えると、学習用混合信号は複数の目的音を少なくとも示す信号である。The mixed signal generation unit 112 acquires the signals of the target sound and the non-target sound as the learning signal, and for example, by adding these signals, a mixed signal in which a plurality of target sounds and the non-target sounds are mixed is obtained. Generate a training mixture signal. The mixed signal for learning is given to the learning-side signal processing section 113 .
Here, the learning mixed signal includes two or more target sounds. Also, the learning mixed signal may or may not include one or more non-target sounds. The learning mixed signal may be, for example, a signal obtained by simply adding two or more signals acquired as learning signals. In other words, the training mixed signal is a signal that at least represents a plurality of target sounds.

混合信号生成部１１２は、例えば、音源分離装置１３０に入力される混合信号である対象混合信号を模擬する処理を含んでもよい。例えば、対象混合信号がマイクロホンアレイにより収録されたマルチチャネル信号である場合、混合信号生成部１１２は、マイクロホンアレイのインパルス応答を畳み込むことで、マイクロホンアレイによる観測を模擬する処理を含んでもよい。 The mixed signal generator 112 may include, for example, a process of simulating a target mixed signal, which is a mixed signal input to the sound source separation device 130 . For example, if the target mixed signal is a multi-channel signal recorded by a microphone array, the mixed signal generator 112 may include processing that simulates observation by the microphone array by convolving the impulse response of the microphone array.

学習側信号処理部１１３は、学習用混合信号に対して予め定められた処理を行うことで、複数の目的音に由来する複数の処理済目的音を少なくとも示す処理済学習用混合信号を生成する。処理済学習用混合信号は、学習側特徴量抽出部１１４及び信号変形部１１８に与えられる。
例えば、学習側信号処理部１１３は、混合信号生成部１１２から与えられる学習用混合信号に対して、目的音を取り出しやすくするため、種々の信号処理を適用した結果得られる処理済学習用混合信号を生成する。
具体的には、予め定められた処理は、機械学習以外の処理であってもよいし、機械学習を用いた処理であってもよい。
また、予め定められた処理は、複数の目的音を抽出しやすくする処理であることが望ましい。
さらに、予め定められた処理は、複数の目的音を強調する処理であることが望ましい。The learning-side signal processing unit 113 generates a processed learning mixed signal representing at least a plurality of processed target sounds derived from the plurality of target sounds by performing predetermined processing on the learning mixed signal. . The processed learning mixed signal is provided to the learning-side feature amount extraction unit 114 and the signal transformation unit 118 .
For example, the learning-side signal processing unit 113 applies various signal processing to the learning mixed signal given from the mixed signal generation unit 112 to easily extract the target sound. to generate
Specifically, the predetermined process may be a process other than machine learning, or may be a process using machine learning.
Moreover, it is desirable that the predetermined process be a process that makes it easier to extract a plurality of target sounds.
Furthermore, it is desirable that the predetermined processing be processing for emphasizing a plurality of target sounds.

学習側信号処理部１１３は、音源分離装置１３０において行われる処理と同じ処理を行う。例えば、古典的な信号処理、機械学習を用いた処理又は未知の信号処理等が行われる。未知の信号処理には、古典的な信号処理又は機械学習を用いた処理が含まれてもよい。 The learning-side signal processing unit 113 performs the same processing as the processing performed in the sound source separation device 130 . For example, classical signal processing, processing using machine learning, unknown signal processing, or the like is performed. The unknown signal processing may include classical signal processing or processing using machine learning.

具体的には、学習側信号処理部１１３が行う処理には、入力された学習用混合信号の中から、雑音信号又は目的音ではない音を示す信号等を抑圧するビームフォーミング処理が含まれてもよい。また、学習側信号処理部１１３が行う処理には、残響を抑圧するための処理が含まれていてもよい。さらに、学習側信号処理部１１３が行う処理には、学習用混合信号の中に存在する非目的音の参照信号が与えられている場合において、エコーキャンセラ等に代表される、非目的音の参照信号を学習用混合信号に含まれる形へと適応変形し、それを学習用混合信号から差し引くことで、学習用混合信号から非目的音に由来する成分を取り除く処理が含まれてもよい。 Specifically, the processing performed by the learning-side signal processing unit 113 includes beamforming processing for suppressing a noise signal or a signal indicating a sound other than the target sound from the input mixed signal for learning. good too. Further, the processing performed by the learning-side signal processing unit 113 may include processing for suppressing reverberation. Furthermore, in the processing performed by the learning-side signal processing unit 113, when a reference signal of a non-target sound existing in the mixed signal for learning is given, a reference signal of the non-target sound represented by an echo canceller or the like is used. A process of adaptively transforming the signal into a form contained in the training mixture signal and subtracting it from the training mixture signal to remove components derived from non-target sounds from the training mixture signal may be included.

なお、学習側信号処理部１１３が行う処理の内容は、時間の経過に伴って変化してもよい。学習側信号処理部１１３に入力される学習用混合信号は、例えば、マイクロホンアレイで収録された複数チャネルの信号であり、出力される処理済学習用混合信号は、例えば、単一チャネルの信号であるが、チャネル数に対する要件はこれに限定されるものではない。 Note that the details of the processing performed by the learning-side signal processing unit 113 may change over time. The learning mixed signal input to the learning side signal processing unit 113 is, for example, a multi-channel signal recorded by a microphone array, and the output processed learning mixed signal is, for example, a single channel signal. However, the requirements for the number of channels are not limited to this.

学習側特徴量抽出部１１４は、学習側信号処理部１１３から与えられる処理済学習用混合信号から、音響特徴量を抽出して、抽出された音響特徴量の時系列データである学習用特徴データを生成する。
例えば、学習側特徴量抽出部１１４は、処理済学習用混合信号から、予め定められた音響特徴量である学習用音響特徴量を複数の成分において抽出することで、抽出された学習用音響特徴量の時系列データである学習用特徴データを生成する。
ここで、音響特徴量は、例えば、処理済学習用混合信号に高速フーリエ変換（ＦａｓｔＦｏｕｒｉｅｒＴｒａｎｓｆｏｒｍ：ＦＦＴ）の処理を施すことによって得られる複素スペクトルである。学習用特徴データは、学習側モデル推論部１１６及び学習側信号抽出部１１７に与えられる。The learning-side feature quantity extraction unit 114 extracts an acoustic feature quantity from the processed mixed signal for learning provided from the learning-side signal processing unit 113, and extracts learning feature data, which is time-series data of the extracted acoustic feature quantity. to generate
For example, the learning-side feature amount extracting unit 114 extracts learning acoustic feature amounts, which are predetermined acoustic feature amounts, from a plurality of components from the processed learning mixed signal, thereby extracting the extracted learning acoustic features. Generate feature data for learning, which is time-series data of quantity.
Here, the acoustic feature amount is, for example, a complex spectrum obtained by performing Fast Fourier Transform (FFT) processing on the processed mixed signal for learning. The feature data for learning is given to the learning side model inference section 116 and the learning side signal extraction section 117 .

学習側音源分離モデル記憶部１１５は、音源分離モデル学習装置１１０で使用される音源分離モデルである学習側音源分離モデルを記憶する。学習側音源分離モデルは、例えば、学習用特徴データにおける各成分に対する重みパラメタを示す。 The learning-side sound source separation model storage unit 115 stores a learning-side sound source separation model that is a sound source separation model used in the sound source separation model learning device 110 . The learning-side sound source separation model indicates, for example, a weight parameter for each component in the learning feature data.

学習側モデル推論部１１６は、学習側特徴量抽出部１１４から与えられる学習用特徴データから、音源分離を行うために必要となる分離用特徴量である学習分離用特徴量を、学習側音源分離モデルを用いて抽出する。学習側モデル推論部１１６で抽出される学習分離用特徴量の時系列データは、例えば、「マスク」と呼ばれる時系列データである。マスクとは、学習側特徴量抽出部１１４で抽出された音響特徴量から、各音源の成分のみを取り出すためのフィルタである。マスクは、例えば、学習側特徴量抽出部１１４で抽出された音響特徴量の各成分において、分離し取り出したい音源からの成分が含まれている割合を求めることにより与えられる。ここで生成されたマスクは、学習用マスクとして学習側信号抽出部１１７に与えられる。
即ち、学習側モデル推論部１１６は、複数の処理済目的音を抽出するために、学習用特徴データを構成する複数の成分の各々に対する重みを示す学習側音源分離モデルを用いて、学習用特徴データから一つの処理済目的音を抽出するための学習用マスクを、目的音毎に生成する。ここでは、学習用混合信号に、複数の目的音が含まれているため、複数の学習用マスクが生成される。The learning-side model inference unit 116 performs learning-side sound source separation by extracting learning-separation feature values, which are separation feature values necessary for performing sound source separation, from the learning-use feature data provided from the learning-side feature value extraction unit 114 . Extract using a model. The time-series data of the learning separation feature quantity extracted by the learning-side model inference unit 116 is, for example, time-series data called “mask”. A mask is a filter for extracting only the component of each sound source from the acoustic feature quantity extracted by the learning-side feature quantity extraction unit 114 . The mask is given, for example, by determining the ratio of the components of the sound source to be separated and extracted among the components of the acoustic feature quantity extracted by the learning-side feature quantity extraction unit 114 . The mask generated here is given to learning-side signal extraction section 117 as a learning mask.
That is, in order to extract a plurality of processed target sounds, the learning-side model inference unit 116 uses a learning-side sound source separation model that indicates the weight for each of the plurality of components that constitute the learning feature data, and extracts the learning feature data. A learning mask for extracting one processed target sound from the data is generated for each target sound. Here, since the learning mixed signal includes a plurality of target sounds, a plurality of learning masks are generated.

学習側信号抽出部１１７は、学習側特徴量抽出部１１４で抽出された音響特徴量の時系列データである学習用特徴データと、学習側モデル推論部１１６で推定された学習分離用特徴量の時系列データである学習用マスクとを用いて、取り出したい音響信号を抽出する。
例えば、学習側信号抽出部１１７は、学習側モデル推論部１１６から与えられる複数の学習用マスクの各々を用いて、学習用特徴データから音を抽出することで、抽出された音を少なくとも示す学習用抽出信号を生成する。The learning-side signal extraction unit 117 extracts learning feature data, which is time-series data of the acoustic feature quantity extracted by the learning-side feature quantity extraction unit 114, and the learning separation feature quantity estimated by the learning-side model inference unit 116. Acoustic signals to be extracted are extracted using learning masks, which are time-series data.
For example, the learning-side signal extracting unit 117 uses each of a plurality of learning masks given from the learning-side model inference unit 116 to extract sounds from the learning feature data, thereby learning at least the extracted sounds. Generate an extraction signal for

具体的には、学習側信号抽出部１１７は、学習分離用特徴量と、学習用音響特徴量とを成分毎に積演算した後に、逆フーリエ変換（ＩｎｖｅｒｓｅＦａｓｔＦｏｕｒｉｅｒＴｒａｎｓｆｏｒｍ：ＩＦＦＴ）の処理を施すことによって、取り出したい目的音を抽出した信号である学習用抽出信号を復元する。ここでは、複数の学習用マスクが使用されるため、複数の学習用マスクの各々に各々が対応する複数の学習用抽出信号が復元される。 Specifically, the learning-side signal extraction unit 117 multiplies the learning separation feature amount and the learning acoustic feature amount for each component, and then performs Inverse Fast Fourier Transform (IFFT) processing. By doing so, a learning extraction signal, which is a signal obtained by extracting the target sound to be extracted, is restored. Since a plurality of learning masks are used here, a plurality of learning extraction signals each corresponding to each of the plurality of learning masks are restored.

信号変形部１１８は、学習側入力部１１１から与えられる学習用信号と、学習側信号処理部１１３から与えられる処理済学習用混合信号とにより、学習用信号に含まれる複数の目的音のそれぞれを、処理済学習用混合信号に含まれるそれぞれの目的音に対応する音に近づけるための変形処理を行うことで変形目的音信号を生成する。生成された変形目的音信号は、モデル更新部１１９に与えられる。
例えば、信号変形部１１８は、複数の目的音の内、一つの目的音を示す信号に対して、その一つの目的音を、対応する一つの処理済目的音に近づけるための変形処理を行うことで、その一つの目的音に由来する一つの変形目的音を示す変形目的音信号を、目的音毎に生成する。ここでは、複数の目的音が存在するため、複数の目的音の各々に各々が対応する複数の変形目的音信号が生成される。The signal transformation unit 118 transforms each of the plurality of target sounds included in the learning signal by using the learning signal supplied from the learning-side input unit 111 and the processed learning mixed signal supplied from the learning-side signal processing unit 113. , a modified target sound signal is generated by performing a deformation process for approximating each target sound included in the processed mixed signal for learning to a sound corresponding to the target sound. The generated modified target sound signal is given to the model updating unit 119 .
For example, the signal transformation unit 118 may perform transformation processing on a signal indicating one target sound among a plurality of target sounds so as to bring the one target sound closer to one corresponding processed target sound. Then, a modified target sound signal indicating one modified target sound derived from the one target sound is generated for each target sound. Here, since a plurality of target sounds exist, a plurality of modified target sound signals each corresponding to each of the plurality of target sounds are generated.

具体的には、学習用信号に、第１の目的音、第２の目的音及び非目的音という３つの成分が含まれる場合、信号変形部１１８は、第１の目的音を示す信号を変形するための変換ｆ１と、第２の目的音を示す信号を変形するための変換ｆ２を設定する。そして、信号変形部１１８は、学習側信号処理部１１３から与えられる処理済学習用混合信号と、第１の目的音を示す信号及び第２の目的音を示す信号を加算した信号との差分が最も小さくなるように変換ｆ１及び変換ｆ２を決定することで、第１の目的音及び第２の目的音のそれぞれを、処理済学習用混合信号に含まれるそれぞれの目的音に由来する音に近づけることができる。これにより、第１の目的音を示す信号に変換ｆ１を適用することで、第１の目的音に対応する変形目的音信号を生成することができ、第２の目的音を示す信号に変換ｆ２を適用することで、第２の目的音に対応する変形目的音信号を生成することができる。 Specifically, when the learning signal includes three components of a first target sound, a second target sound, and a non-target sound, the signal transforming unit 118 transforms the signal representing the first target sound. and a transformation f2 for transforming the signal representing the second target sound. Then, the signal transforming unit 118 determines that the difference between the processed learning mixed signal given from the learning-side signal processing unit 113 and the signal obtained by adding the signal representing the first target sound and the signal representing the second target sound is By determining the transformation f1 and the transformation f2 so as to be the smallest, each of the first target sound and the second target sound is brought closer to the sound derived from each of the target sounds included in the processed mixed signal for learning. be able to. Thus, by applying the transformation f1 to the signal representing the first target sound, the modified target sound signal corresponding to the first target sound can be generated, and the transformation f2 to the signal representing the second target sound. can generate a modified target sound signal corresponding to the second target sound.

ここでは、第１の目的音、第２の目的音及び非目的音が、それぞれ統計的に異なる性質を持っている、言い換えると、相関がないものとしている。このため、例えば、学習側信号処理部１１３から与えられる処理済学習用混合信号と、第１の目的音を示す信号と第２の目的音を示す信号を加算した信号との差分として、二乗誤差を算出することで、第１の目的音及び第２の目的音のそれぞれを、処理済学習用混合信号に含まれるそれぞれの目的音に由来する音に近づけることができる。なお、信号変形部１１８の具体的な構造については、後述する。 Here, it is assumed that the first target sound, the second target sound, and the non-target sound have statistically different properties, in other words, have no correlation. For this reason, for example, the difference between the processed mixed signal for learning given from the learning-side signal processing unit 113 and the signal indicating the first target sound and the signal indicating the second target sound is expressed as a squared error By calculating , each of the first target sound and the second target sound can be approximated to the sound derived from each of the target sounds included in the processed mixed signal for learning. A specific structure of the signal transforming unit 118 will be described later.

モデル更新部１１９は、学習側信号抽出部１１７から与えられる複数の学習用抽出信号と、信号変形部１１８から与えられる複数の変形目的音信号とを用いて、学習側音源分離モデル記憶部１１５に記憶されている学習用音源分離モデルに含まれている重みパラメタを更新する。
例えば、モデル更新部１１９は、複数の学習用抽出信号及び複数の変形目的音信号を用いて、学習側信号抽出部１１７で抽出された音が、抽出すべき一つの目的音に対応する一つの変形目的音に近づくように、学習側音源分離モデルを更新する。
具体的には、モデル更新部１１９は、複数の学習用抽出信号と、複数の変形目的音信号との差分が小さくなるように、学習側音源分離モデルを更新する。The model update unit 119 updates the learning-side sound source separation model storage unit 115 with the plurality of learning extraction signals provided from the learning-side signal extraction unit 117 and the plurality of modified target sound signals provided from the signal modification unit 118. Update the weight parameter included in the stored training sound source separation model.
For example, the model update unit 119 uses a plurality of learning extraction signals and a plurality of deformed target sound signals to convert the sound extracted by the learning-side signal extraction unit 117 into one target sound corresponding to one target sound to be extracted. The learning-side sound source separation model is updated so as to approach the modified target sound.
Specifically, the model updating unit 119 updates the learning-side sound source separation model so that the difference between the plurality of learning extraction signals and the plurality of deformation target sound signals becomes smaller.

重みパラメタの更新には、例えば、信号変形部１１８の出力と、学習側信号抽出部１１７の出力との差分を計算した結果と、例えば、確率的勾配降下法（ＳｔｏｃｈａｓｔｉｃＧｒａｄｉｅｎｔＤｅｓｃｅｎｔ：ＳＧＤ）又はＡｄａｍ法等の公知の最適化手法が使用される。 For updating the weight parameter, for example, the result of calculating the difference between the output of the signal transforming unit 118 and the output of the learning-side signal extracting unit 117 and, for example, stochastic gradient descent (SGD) or Adam A known optimization technique such as the method is used.

学習側通信部１２０は、学習側音源分離モデル記憶部１１５に記憶されている学習用音源分離モデルを、音源分離装置１３０で使用する音源分離モデルである活用側音源分離モデルとして、音源分離装置１３０に送る。 The learning side communication unit 120 uses the learning sound source separation model stored in the learning side sound source separation model storage unit 115 as a utilization side sound source separation model, which is a sound source separation model used in the sound source separation device 130, to the sound source separation device 130. send to

なお、学習側特徴量抽出部１１４と、学習側信号抽出部１１７とについては、その両方を備えない構成とすることができる。
この場合、学習側モデル推論部１１６は、学習側信号処理部１１３から与えられた処理済学習用混合信号に含まれている複数の処理済目的音を抽出するための学習側音源分離モデルを用いて、処理済学習用混合信号から音を抽出することで、その抽出された音を示す学習用抽出信号を生成する。
また、信号変形部１１８は、学習用信号で示される複数の目的音の内、一つの処理済目的音に対応する一つの目的音を示す信号に対して、その一つの目的音をその一つの処理済目的音に近づけるための変形処理を行うことで、その一つの目的音に由来する一つの変形目的音を示す変形目的音信号を、目的音毎に生成する。
そして、モデル更新部１１９は、複数の学習用抽出信号及び複数の変形目的音信号を用いて、学習側モデル推論部１１６で抽出された複数の音の各々が、複数の変形目的音の内の対応する変形目的音に近づくように、学習側音源分離モデルを更新する。Note that the learning-side feature quantity extraction unit 114 and the learning-side signal extraction unit 117 may be configured without both.
In this case, the learning-side model inference unit 116 uses a learning-side sound source separation model for extracting a plurality of processed target sounds included in the processed learning mixed signal given from the learning-side signal processing unit 113. Then, by extracting sounds from the processed mixed signal for learning, a learning extraction signal indicating the extracted sounds is generated.
In addition, the signal transforming unit 118 converts a signal indicating one target sound corresponding to one processed target sound among a plurality of target sounds indicated by the learning signal to one target sound. A modified target sound signal representing one modified target sound derived from one target sound is generated for each target sound by performing deformation processing to approximate the processed target sound.
Then, the model updating unit 119 uses the plurality of learning extraction signals and the plurality of deformation target sound signals to update each of the plurality of sounds extracted by the learning-side model inference unit 116 to one of the plurality of deformation target sounds. The learning-side sound source separation model is updated so as to approach the corresponding modified target sound.

図３は、実施の形態１における信号変形部１１８の構成を概略的に示すブロック図である。
信号変形部１１８は、混合信号ブロック分割部１１８ａと、学習用信号ブロック分割部１１８ｂと、フィルタ推定部１１８ｃと、フィルタ適用部１１８ｄと、ブロック結合部１１８ｅとを備える。FIG. 3 is a block diagram schematically showing the configuration of signal transforming section 118 according to the first embodiment.
The signal transformation unit 118 includes a mixed signal block division unit 118a, a learning signal block division unit 118b, a filter estimation unit 118c, a filter application unit 118d, and a block combination unit 118e.

混合信号ブロック分割部１１８ａは、学習側信号処理部１１３から与えられた処理済学習用混合信号を適当な区間であるブロック毎に分割して得られる信号である混合ブロック信号を生成する第１のブロック分割部である。
例えば、混合信号ブロック分割部１１８ａは、処理済学習用混合信号を複数のブロックに分割することで、複数の混合ブロック信号を生成する。
混合ブロック信号は、フィルタ推定部１１８ｃに与えられる。The mixed signal block division unit 118a generates a mixed block signal, which is a signal obtained by dividing the processed learning mixed signal given from the learning side signal processing unit 113 into blocks that are appropriate sections. This is the block division part.
For example, the mixed signal block division unit 118a generates a plurality of mixed block signals by dividing the processed learning mixed signal into a plurality of blocks.
The mixed block signal is provided to filter estimator 118c.

ブロックへの分割は、例えば、一定の時間間隔毎に実施されればよい。
また、複数のブロック間で重複する区間が生じるようにブロックに分割されてもよい。
但し、サンプル数に対応する各ブロックの長さは、フィルタ推定部１１８ｃにおけるフィルタの導出に必要な長さを上回るよう設定する必要がある。The division into blocks may be performed, for example, at regular time intervals.
Alternatively, the blocks may be divided such that overlapping sections are generated between the blocks.
However, the length of each block corresponding to the number of samples must be set to exceed the length required for deriving the filter in the filter estimation unit 118c.

学習用信号ブロック分割部１１８ｂは、学習側入力部１１１より与えられた学習用信号から目的音の信号を取り出し、その目的音の信号を適当な区間毎に分割することで得られる信号である目的音ブロック信号を生成する第２のブロック分割部である。
例えば、学習用信号ブロック分割部１１８ｂは、一つの目的音を示す信号を複数のブロックに分割することで、複数の目的音ブロック信号を生成する。
目的音ブロック信号は、フィルタ推定部１１８ｃ及びフィルタ適用部１１８ｄに与えられる。ブロックへの分割方法は、混合信号ブロック分割部１１８ａにおける分割方法と同一である。The learning signal block division unit 118b extracts a signal of the target sound from the learning signal given from the learning side input unit 111, and divides the signal of the target sound into appropriate sections. It is a second block division unit that generates a sound block signal.
For example, the learning signal block dividing unit 118b generates a plurality of target sound block signals by dividing a signal representing one target sound into a plurality of blocks.
The target sound block signal is provided to the filter estimation section 118c and the filter application section 118d. The division method into blocks is the same as the division method in mixed signal block division section 118a.

フィルタ推定部１１８ｃは、複数の目的音ブロック信号の各々で示される音を、複数の混合ブロック信号で示される音の内、抽出すべき一つの目的音に対応する音に近づけるためのフィルタを推定することで、複数のフィルタを推定する。
例えば、フィルタ推定部１１８ｃは、混合信号ブロック分割部１１８ａによってブロック単位に分割された混合ブロック信号と、学習用信号ブロック分割部１１８ｂによってブロック単位に分割された目的音ブロック信号とを、ブロック毎に、かつ、目的音毎に、目的音ブロック信号で示される音の、混合ブロック信号で示される音への変換を近似するフィルタのパラメタである変形パラメタを生成する。フィルタは、例えば、ＦＩＲ（ＦｉｎｉｔｅＩｍｐｕｌｓｅＲｅｓｐｏｎｓｅ）フィルタ、ＩＩＲ（ＩｎｆｉｎｉｔｉｅＩｎｐｕｌｓｅＲｅｓｐｏｎｓｅ）フィルタ、又は、ＦＦＴを用いた周波数領域上におけるフィルタ等が使用されればよい。
なお、変形パラメタは、例えば、ブロック毎に異なっていてもよい。The filter estimation unit 118c estimates a filter for approximating the sound indicated by each of the plurality of target sound block signals to the sound corresponding to one target sound to be extracted among the sounds indicated by the plurality of mixed block signals. to estimate multiple filters.
For example, the filter estimation unit 118c divides the mixed block signal divided into blocks by the mixed signal block division unit 118a and the target sound block signal divided into blocks by the learning signal block division unit 118b into blocks. And, for each target sound, a deformation parameter, which is a parameter of a filter approximating the conversion of the sound indicated by the target sound block signal to the sound indicated by the mixed block signal, is generated. As the filter, for example, an FIR (Finite Impulse Response) filter, an IIR (Infinite Impulse Response) filter, or a filter in the frequency domain using FFT may be used.
Note that the deformation parameters may differ, for example, for each block.

フィルタ適用部１１８ｄは、複数の目的音ブロック信号の各々に、フィルタ推定部１１８ｃで推定された複数のフィルタの各々を適用することで、複数の変形ブロック信号を生成する。
例えば、フィルタ適用部１１８ｄは、学習用信号ブロック分割部１１８ｂから与えられる目的音ブロック信号に、フィルタ推定部１１８ｃで推定された、その目的音ブロック信号に対応する変形パラメタを適用した信号を変形ブロック信号として生成する。変形ブロック信号は、ブロック結合部１１８ｅに与えられる。The filter application unit 118d generates a plurality of modified block signals by applying each of the plurality of filters estimated by the filter estimation unit 118c to each of the plurality of target sound block signals.
For example, the filter application unit 118d converts a signal obtained by applying a deformation parameter corresponding to the target sound block signal estimated by the filter estimation unit 118c to the target sound block signal supplied from the learning signal block division unit 118b into a deformation block. Generate as a signal. The deformed block signal is provided to the block combiner 118e.

ブロック結合部１１８ｅは、フィルタ適用部１１８ｄから与えられる変形ブロック信号を結合して得られる信号である変形目的音信号を生成する。変形目的音信号は、図２に示されているモデル更新部１１９に与えられる。
なお、混合信号ブロック分割部１１８ａと、学習用信号ブロック分割部１１８ｂとにおいて、複数のブロック間で重複する区間が生じるように分割が行われた場合には、ブロック結合部１１８ｅは、例えば、重み付け和を計算することで重複を解消してもよい。The block combiner 118e generates a transformed target sound signal that is a signal obtained by combining the transformed block signals supplied from the filter application unit 118d. The modified target sound signal is provided to the model updating unit 119 shown in FIG.
Note that when the mixed signal block division unit 118a and the learning signal block division unit 118b perform division so that overlapping sections occur between a plurality of blocks, the block combination unit 118e performs weighting, for example. Duplicates may be resolved by calculating the sum.

混合信号ブロック分割部１１８ａ及び学習用信号ブロック分割部１１８ｂと、ブロック結合部１１８ｅとについては、これらを備えない構成としてもよい。すなわち、信号全体が単一のブロックとして扱われてもよい。
このような場合には、フィルタ推定部１１８ｃは、学習用信号で示される一つの目的音を、処理済学習用混合信号で示される複数の処理済目的音の内、その一つの目的音に対応する一つの処理済目的音に近づけるためのフィルタを推定する。
そして、フィルタ適用部は、学習用信号の内のその一つの目的音を示す信号に、フィルタ推定部１１８ｃで推定されたフィルタを適用することで、変形目的音信号を生成する。The mixed signal block dividing section 118a, the learning signal block dividing section 118b, and the block combining section 118e may be configured without these. That is, the entire signal may be treated as a single block.
In such a case, the filter estimating unit 118c associates one target sound indicated by the learning signal with one target sound among the plurality of processed target sounds indicated by the processed mixed signal for learning. Estimate a filter for approximating one processed target sound.
Then, the filter application unit generates a modified target sound signal by applying the filter estimated by the filter estimation unit 118c to the signal indicating one target sound among the learning signals.

図４は、音源分離モデル学習装置１１０のハードウェア構成を概略的に示すブロック図である。
音源分離モデル学習装置１１０は、記憶装置１５１と、メモリ１５２と、プロセッサ１５３と、通信インタフェース（以下、通信Ｉ／Ｆという）１５４とを備えるコンピュータ１５０により構成することができる。FIG. 4 is a block diagram schematically showing the hardware configuration of the sound source separation model learning device 110. As shown in FIG.
The sound source separation model learning device 110 can be configured by a computer 150 having a storage device 151 , a memory 152 , a processor 153 and a communication interface (hereinafter referred to as communication I/F) 154 .

記憶装置１５１は、音源分離モデル学習装置１１０で行う処理に必要なプログラム及びデータを記憶する。
メモリ１５２は、プロセッサ１５３が作業を行う作業領域を提供する。
プロセッサ１５３は、記憶装置１５１に記憶されたプログラム及びデータを、メモリ１５２に展開して、処理を実行する。
通信Ｉ／Ｆ１５４は、音源分離装置１３０と通信を行う。The storage device 151 stores programs and data necessary for processing performed by the sound source separation model learning device 110 .
Memory 152 provides a work area in which processor 153 works.
The processor 153 expands the programs and data stored in the storage device 151 into the memory 152 and executes processing.
Communication I/F 154 communicates with sound source separation device 130 .

例えば、混合信号生成部１１２、学習側信号処理部１１３、学習側特徴量抽出部１１４、学習側モデル推論部１１６、学習側信号抽出部１１７、信号変形部１１８及びモデル更新部１１９は、プロセッサ１５３が記憶装置１５１に記憶されたプログラム及びデータをメモリ１５２に展開して、そのプログラムを実行することで、実現することができる。
学習側音源分離モデル記憶部１１５は、記憶装置１５１により実現することができる。
学習側入力部１１１及び学習側通信部１２０は、通信Ｉ／Ｆ１５４により実現することができる。For example, the mixed signal generation unit 112, the learning side signal processing unit 113, the learning side feature amount extraction unit 114, the learning side model inference unit 116, the learning side signal extraction unit 117, the signal transformation unit 118, and the model update unit 119 are the processor 153 can be realized by expanding the program and data stored in the storage device 151 into the memory 152 and executing the program.
The learning-side sound source separation model storage unit 115 can be realized by the storage device 151 .
The learning side input section 111 and the learning side communication section 120 can be realized by the communication I/F 154 .

以上のようなプログラムは、ネットワークを通じて提供されてもよく、また、記録媒体に記録されて提供されてもよい。即ち、このようなプログラムは、例えば、プログラムプロダクトとして提供されてもよい。
なお、音源分離モデル学習装置１１０は、上記のようにプログラムで実現されてもよいし、音源分離モデル学習装置１１０で実行される機能毎に回路を構成して、それら回路を結合して実現されてもよい。
言い換えると、音源分離モデル学習装置１１０は、処理回路網により実現することもできる。The program as described above may be provided through a network, or may be provided by being recorded on a recording medium. That is, such programs may be provided as program products, for example.
Sound source separation model learning device 110 may be realized by a program as described above, or may be realized by forming a circuit for each function executed by sound source separation model learning device 110 and connecting the circuits. may
In other words, the sound source separation model learning device 110 can also be realized by a processing circuit network.

図５は、音源分離装置１３０の構成を概略的に示すブロック図である。
音源分離装置１３０は、活用側通信部１３１と、活用側音源分離モデル記憶部１３２と、活用側入力部１３３と、活用側信号処理部１３４と、活用側特徴量抽出部１３５と、活用側モデル推論部１３６と、活用側信号抽出部１３７と、活用側出力部１３８とを備える。FIG. 5 is a block diagram schematically showing the configuration of the sound source separation device 130. As shown in FIG.
The sound source separation device 130 includes a utilization-side communication unit 131, a utilization-side sound source separation model storage unit 132, a utilization-side input unit 133, a utilization-side signal processing unit 134, a utilization-side feature amount extraction unit 135, and a utilization-side model It includes an inference unit 136 , a utilization-side signal extraction unit 137 , and a utilization-side output unit 138 .

活用側通信部１３１は、音源分離モデル学習装置１１０と通信を行う。例えば、活用側通信部１３１は、音源分離モデル学習装置１１０から活用側音源分離モデルを受け取り、その活用側音源分離モデルを活用側音源分離モデル記憶部１３２に記憶させる。 The utilization side communication unit 131 communicates with the sound source separation model learning device 110 . For example, the utilization-side communication unit 131 receives the utilization-side sound source separation model from the sound source separation model learning device 110 and stores the utilization-side sound source separation model in the utilization-side sound source separation model storage unit 132 .

活用側音源分離モデル記憶部１３２は、活用側音源分離モデルを記憶する。
活用側入力部１３３は、対象混合信号の入力を受け付ける。入力された対象混合信号は、活用側信号処理部１３４に与えられる。
対象混合信号は、音源分離装置１３０に予め記憶されていてもよく、後述するマイク等の音響装置で取得されてもよいし、通信Ｉ／Ｆを介して電話回線等から取得されてもよい。このような場合には、活用側入力部１３３を省略することもできる。The utilization-side sound source separation model storage unit 132 stores the utilization-side sound source separation model.
Utilization side input unit 133 receives input of the target mixed signal. The input target mixed signal is provided to the utilization side signal processing section 134 .
The target mixed signal may be stored in advance in the sound source separation device 130, may be acquired by an audio device such as a microphone described later, or may be acquired from a telephone line or the like via a communication I/F. In such a case, the utilization side input unit 133 can be omitted.

活用側信号処理部１３４は、複数の目的音を少なくとも示す対象混合信号に対して予め定められた処理を行うことで、複数の目的音に由来する複数の処理済目的音を少なくとも示す処理済対象混合信号を生成する。
例えば、活用側信号処理部１３４は、活用側入力部１３３から与えられる対象混合信号に対して、目的音を取り出しやすくするため、種々の信号処理を適用した結果得られる処理済対象混合信号を生成する。ここで行われる処理は、音源分離モデル学習装置１１０の学習側信号処理部１１３で行われる処理と同じである。処理済対象混合信号は、活用側特徴量抽出部１３５に与えられる。The utilization-side signal processing unit 134 performs a predetermined process on the target mixed signal indicating at least a plurality of target sounds, thereby obtaining a processed target signal indicating at least a plurality of processed target sounds derived from the plurality of target sounds. Generate a mixed signal.
For example, the utilization-side signal processing unit 134 generates a processed target mixed signal obtained by applying various signal processing to the target mixed signal given from the utilization-side input unit 133 in order to easily extract the target sound. do. The processing performed here is the same as the processing performed in the learning-side signal processing unit 113 of the sound source separation model learning device 110 . The processed target mixed signal is provided to the utilization side feature quantity extraction unit 135 .

活用側特徴量抽出部１３５は、活用側信号処理部１３４から与えられる処理済対象混合信号から、音響特徴量を抽出して、抽出された音響特徴量の時系列データである活用特徴データを生成する。
例えば、活用側特徴量抽出部１３５は、処理済対象混合信号から、予め定められた音響特徴量である活用音響特徴量を複数の成分において抽出することで、その抽出された活用音響特徴量の時系列データである活用特徴データを生成する。
ここで行われる処理は、音源分離モデル学習装置１１０の学習側特徴量抽出部１１４で行われる処理と同じである。活用特徴データは、活用側モデル推論部１３６に与えられる。The utilization-side feature amount extraction unit 135 extracts acoustic feature amounts from the processed target mixed signal given from the utilization-side signal processing unit 134, and generates utilization feature data that is time-series data of the extracted acoustic feature amounts. do.
For example, the utilizing-side feature amount extraction unit 135 extracts, from the processed target mixed signal, the utilized acoustic feature amount, which is a predetermined acoustic feature amount, in a plurality of components, and extracts the extracted utilized acoustic feature amount. Generate utilization feature data, which is time-series data.
The processing performed here is the same as the processing performed by the learning-side feature quantity extraction unit 114 of the sound source separation model learning device 110 . The utilization feature data is given to the utilization side model inference section 136 .

活用側モデル推論部１３６は、活用側特徴量抽出部１３５から与えられる活用特徴データから、音源分離を行うために必要となる分離用特徴量である活用分離用特徴量を、活用側音源分離モデルを用いて抽出する。ここで行われる処理は、音源分離モデル学習装置１１０の学習側モデル推論部１１６で行われる処理と同じである。
そして、活用側モデル推論部１３６は、抽出された活用分離用特徴量の時系列データであるマスクを、活用マスクとして活用側信号抽出部１３７に与える。
言い換えると、活用側モデル推論部１３６は、複数の処理済目的音を抽出するために、活用特徴データの複数の成分の各々に対する重みを示す活用側音源分離モデルを用いて、活用特徴データから一つの処理済目的音を抽出するための活用マスクを、目的音毎に生成する。このため、複数の目的音の各々に各々が対応する複数の活用マスクが生成される。The utilizing-side model inference unit 136 converts the utilizing-side separation feature quantity, which is a separation feature quantity required for performing sound source separation, from the utilization feature data given from the utilizing-side feature quantity extraction unit 135 to the utilizing-side sound source separation model. Extract using The processing performed here is the same as the processing performed in the learning side model inference section 116 of the sound source separation model learning device 110 .
Then, the utilizing-side model inference unit 136 gives the mask, which is the extracted time-series data of the utilized separating feature quantity, to the utilizing-side signal extraction unit 137 as a utilized mask.
In other words, in order to extract a plurality of processed target sounds, the utilizing-side model inference unit 136 uses the utilizing-side sound source separation model, which indicates the weight for each of the plurality of components of the utilizing feature data, to extract one from the utilizing feature data. A utilization mask for extracting two processed target sounds is generated for each target sound. Therefore, a plurality of utilization masks each corresponding to each of a plurality of target sounds are generated.

活用側信号抽出部１３７は、活用側特徴量抽出部１３５で抽出された音響特徴量の時系列データである活用特徴データと、活用側モデル推論部１３６で推定された活用分離用特徴量の時系列データである活用マスクとを用いて、取り出したい音響信号を抽出する。
例えば、活用側信号抽出部１３７は、活用マスクを用いて、活用特徴データから音を抽出することで、抽出された音を少なくとも示す活用抽出信号を生成する。
ここで行われる処理は、音源分離モデル学習装置１１０の学習側信号抽出部１１７で行われる処理と同じである。そして、活用側信号抽出部１３７は、抽出された音響信号である活用抽出信号を出力信号として活用側出力部１３８に与える。The utilizing-side signal extraction unit 137 extracts the utilization-side feature data, which is the time-series data of the acoustic feature quantity extracted by the utilization-side feature quantity extraction unit 135, and the utilization-side separation feature quantity estimated by the utilization-side model inference unit 136. An acoustic signal to be extracted is extracted using a utilization mask, which is series data.
For example, the utilization-side signal extraction unit 137 extracts a sound from the utilization feature data using the utilization mask, thereby generating a utilization extraction signal indicating at least the extracted sound.
The processing performed here is the same as the processing performed by the learning-side signal extraction unit 117 of the sound source separation model learning device 110 . Then, the utilization-side signal extraction unit 137 provides the utilization-side extraction signal, which is the extracted acoustic signal, to the utilization-side output unit 138 as an output signal.

活用側出力部１３８は、活用側信号抽出部１３７から与えられた出力信号を出力する。
なお、活用側特徴量抽出部１３５と、活用側信号抽出部１３７とについては、例えば、その一方又は両方を備えない構成としてもよい。例えば、活用側特徴量抽出部１３５及び活用側信号抽出部１３７の両方を含まない場合、活用側モデル推論部１３６は、活用側信号処理部１３４から出力された処理済対象混合信号を処理して、分離音の信号を直接出力するように機能する。言い換えると、活用側モデル推論部１３６は、活用側信号処理部１３４から与えられる処理済対象混合信号で示される複数の処理済目的音を抽出するための活用側音源分離モデルを用いて、処理済対象混合信号から音を抽出することで、抽出された音を示す活用抽出信号を生成する。The utilization-side output section 138 outputs the output signal given from the utilization-side signal extraction section 137 .
For example, one or both of the utilization-side feature amount extraction unit 135 and the utilization-side signal extraction unit 137 may be omitted. For example, when neither the utilizing-side feature quantity extraction unit 135 nor the utilizing-side signal extraction unit 137 is included, the utilizing-side model inference unit 136 processes the processed target mixed signal output from the utilizing-side signal processing unit 134. , which functions to directly output the signal of the separated sound. In other words, the utilizing-side model inference unit 136 uses the utilizing-side sound source separation model for extracting a plurality of processed target sounds indicated by the processed target mixed signal given from the utilizing-side signal processing unit 134 to extract the processed target sound. A sound is extracted from the target mixed signal to generate a leveraged extraction signal indicative of the extracted sound.

図６は、音源分離装置１３０のハードウェア構成を概略的に示すブロック図である。
音源分離装置１３０は、記憶装置１６１と、メモリ１６２と、プロセッサ１６３と、通信Ｉ／Ｆ１６４と、音響インタフェース（以下、音響Ｉ／Ｆという）１６５とを備えるコンピュータ１６０により構成することができる。FIG. 6 is a block diagram schematically showing the hardware configuration of the sound source separation device 130. As shown in FIG.
The sound source separation device 130 can be configured by a computer 160 having a storage device 161 , a memory 162 , a processor 163 , a communication I/F 164 and an acoustic interface (hereinafter referred to as acoustic I/F) 165 .

記憶装置１６１は、音源分離装置１３０で行う処理に必要なプログラム及びデータを記憶する。
メモリ１６２は、プロセッサ１６３が作業を行う作業領域を提供する。
プロセッサ１６３は、記憶装置１６１に記憶されたプログラム及びデータを、メモリ１６２に展開して、処理を実行する。
通信Ｉ／Ｆ１６４は、音源分離モデル学習装置１１０と通信を行う。
音響Ｉ／Ｆ１６５は、対象混合信号の入力を受け付ける。対象混合信号は、目的音を含む音を集音して対象音号信号を生成する音響装置で生成されればよい。The storage device 161 stores programs and data necessary for processing performed by the sound source separation device 130 .
Memory 162 provides a work area in which processor 163 works.
The processor 163 develops the programs and data stored in the storage device 161 in the memory 162 and executes processing.
Communication I/F 164 communicates with sound source separation model learning device 110 .
Acoustic I/F 165 receives an input of a target mixed signal. The target mixed signal may be generated by an audio device that collects sounds including the target sound and generates the target phonetic signal.

例えば、活用側信号処理部１３４、活用側特徴量抽出部１３５、活用側モデル推論部１３６、活用側信号抽出部１３７及び活用側出力部１３８は、プロセッサ１６３が記憶装置１６１に記憶されたプログラム及びデータをメモリ１６２に展開して、そのプログラムを実行することで、実現することができる。
活用側音源分離モデル記憶部１３２は、記憶装置１６１により実現することができる。
活用側入力部１３３は、音響Ｉ／Ｆ１６５により実現することができる。
活用側通信部１３１は、通信Ｉ／Ｆ１５４により実現することができる。For example, the utilizing-side signal processing unit 134, the utilizing-side feature amount extracting unit 135, the utilizing-side model inference unit 136, the utilizing-side signal extracting unit 137, and the utilizing-side output unit 138 are executed by the processor 163 as a program stored in the storage device 161 and It can be realized by deploying the data in the memory 162 and executing the program.
The utilization-side sound source separation model storage unit 132 can be realized by the storage device 161 .
The utilization side input unit 133 can be realized by the acoustic I/F 165 .
The utilization side communication unit 131 can be realized by the communication I/F 154 .

以上のようなプログラムは、ネットワークを通じて提供されてもよく、また、記録媒体に記録されて提供されてもよい。即ち、このようなプログラムは、例えば、プログラムプロダクトとして提供されてもよい。
なお、音源分離装置１３０は、上記のようにプログラムで実現してもよいし、音源分離装置１３０で実行される機能毎に回路を構成して、それら回路を結合して実現されてもよい。
言い換えると、音源分離装置１３０は、処理回路網により実現することもできる。The program as described above may be provided through a network, or may be provided by being recorded on a recording medium. That is, such programs may be provided as program products, for example.
Note that the sound source separation device 130 may be realized by a program as described above, or may be realized by forming a circuit for each function executed by the sound source separation device 130 and connecting the circuits.
In other words, the sound source separation device 130 can also be realized by processing circuitry.

次に、動作について説明する。最初に、音源分離モデル学習装置１１０の動作について説明する。
図７は、音源分離モデル学習装置１１０の動作を示すフローチャートである。Next, operation will be described. First, the operation of the sound source separation model learning device 110 will be described.
FIG. 7 is a flowchart showing the operation of the sound source separation model learning device 110. As shown in FIG.

まず、混合信号生成部１１２が、学習用信号から学習に用いる混合信号である学習用混合信号を作成する（Ｓ１０）。学習用混合信号は、音源分離装置１３０の活用側信号処理部１３４へ入力される活用混合信号を模擬して作成される。学習用混合信号は、例えば、学習用信号としての複数の目的音の信号及び非目的音の信号を単純に加算することで生成されてもよい。また、学習用混合信号は、マイクロホンアレイによる収録を模擬するため、学習用信号から取得された信号のそれぞれに対して、マイクロホンアレイのインパルス応答を畳み込む処理を実施した後、出力された信号を加算することで生成されてもよい。 First, the mixed signal generator 112 creates a learning mixed signal, which is a mixed signal used for learning, from the learning signal (S10). The learning mixed signal is created by simulating the active mixed signal input to the active signal processing unit 134 of the sound source separation device 130 . The learning mixed signal may be generated, for example, by simply adding a plurality of target sound signals and non-target sound signals as learning signals. In addition, in order to simulate recording by a microphone array, the mixed signal for training is processed by convoluting the impulse response of the microphone array for each signal acquired from the training signal, and then adding the output signals. may be generated by

次に、学習側信号処理部１１３は、混合信号生成部１１２から与えられた学習用混合信号に対して、各種の信号処理を適用する（Ｓ１１）。ここでの処理内容は、音源分離装置１３０の活用側信号処理部１３４での処理内容と同一である。 Next, the learning-side signal processing unit 113 applies various signal processing to the learning mixed signal provided from the mixed signal generation unit 112 (S11). The processing content here is the same as the processing content in the active-side signal processing unit 134 of the sound source separation device 130 .

次に、信号変形部１１８は、学習用信号から得られた目的音を、学習側信号処理部１１３から与えられる処理済学習用混合信号に含まれている目的音を模した形へと変換することで、目的音毎に変形目的音信号を生成する（Ｓ１２）。ステップＳ１２での処理の詳細は、後述する。 Next, the signal transforming unit 118 transforms the target sound obtained from the learning signal into a form that imitates the target sound included in the processed learning mixed signal given from the learning-side signal processing unit 113. Thus, a modified target sound signal is generated for each target sound (S12). Details of the processing in step S12 will be described later.

次に、学習側特徴量抽出部１１４は、学習側信号処理部１１３より与えられる処理済学習用混合信号から音響特徴量である学習用音響特徴量を抽出して、時系列データとすることで学習用特徴データを生成する（Ｓ１３）。音響特徴量として、例えば、活用側信号処理部１３４からの処理済学習用混合信号に対してＦＦＴを適用することによって得られる複素スペクトルが用いられる。ここでの処理内容は、音源分離装置１３０の活用側特徴量抽出部１３５での処理内容と同一である。 Next, the learning-side feature amount extraction unit 114 extracts learning acoustic feature amounts, which are acoustic feature amounts, from the processed learning mixed signal given from the learning-side signal processing unit 113, and converts them into time-series data. Learning feature data is generated (S13). As the acoustic feature quantity, for example, a complex spectrum obtained by applying FFT to the processed learning mixed signal from the utilization-side signal processing unit 134 is used. The processing content here is the same as the processing content in the utilization side feature amount extraction unit 135 of the sound source separation device 130 .

次に、学習側モデル推論部１１６は、学習用音源分離モデルを用いて、学習側特徴量抽出部１１４にて抽出された音響特徴量から、各音源信号を分離合成するために必要となる分離用特徴量である学習分離用特徴量を抽出し、その学習分離用特徴量の時系列データであるマスクを生成する（Ｓ１５）。マスクは、音源信号毎、言い換えると、目的音毎に生成される。ここでの処理内容は、音源分離装置１３０の活用側モデル推論部１３６での処理内容と同一である。 Next, the learning-side model inference unit 116 uses the learning sound source separation model to separate and synthesize each sound source signal from the acoustic feature quantity extracted by the learning-side feature quantity extraction unit 114. A feature amount for learning separation, which is a feature amount for learning, is extracted, and a mask, which is time-series data of the feature amount for learning separation, is generated (S15). A mask is generated for each sound source signal, in other words, for each target sound. The processing content here is the same as the processing content in the utilizing-side model inference section 136 of the sound source separation device 130 .

次に、学習側信号抽出部１１７は、学習側特徴量抽出部１１４にて抽出された音響特徴量と、学習側モデル推論部１１６にて抽出された学習用分離用特徴量とを用いて、学習用混合信号の中に含まれる目的音を処理した音の信号である学習用抽出信号を抽出する（Ｓ１５）。例えば、学習側信号抽出部１１７は、学習分離用特徴量と、学習用音響特徴量とを成分毎に積演算した後に、逆フーリエ変換の処理を施すことによって、取り出したい目的音に由来する音を抽出した信号である学習用抽出信号を、目的音毎に復元する。ここでの処理内容は、音源分離装置１３０の学習側信号抽出部１１７での処理内容と同一である。 Next, the learning-side signal extraction unit 117 uses the acoustic feature amount extracted by the learning-side feature amount extraction unit 114 and the learning separation feature amount extracted by the learning-side model inference unit 116, An extraction signal for learning, which is a sound signal obtained by processing the target sound contained in the mixed signal for learning, is extracted (S15). For example, the learning-side signal extraction unit 117 multiplies the learning separation feature amount and the learning acoustic feature amount for each component, and then performs an inverse Fourier transform process to obtain a sound originating from the target sound to be extracted. is restored for each target sound. The processing content here is the same as the processing content in the learning-side signal extraction unit 117 of the sound source separation device 130 .

次に、モデル更新部１１９は、信号変形部１１８より与えられる複数の変形目的音信号と、学習側信号抽出部１１７より与えられる複数の学習用抽出信号との誤差を計算した後、その誤差を修正するように、学習用音源分離モデルの備える重みパラメタを更新する（Ｓ１６）。 Next, the model updating unit 119 calculates the error between the plurality of modified target sound signals provided by the signal modifying unit 118 and the plurality of learning extraction signals provided by the learning-side signal extracting unit 117, and then calculates the error. A weight parameter provided in the learning sound source separation model is updated so as to correct it (S16).

続けて、信号変形部１１８の動作について説明する。
図８は、実施の形態１における信号変形部１１８の動作を示すフローチャートである。
まず、混合信号ブロック分割部１１８ａが、学習側信号処理部１１３から与えられた処理済学習用混合信号を、時間軸上で１つ以上のブロックに分割することで混合ブロック信号を生成する（Ｓ２０）。Next, the operation of the signal transforming section 118 will be described.
FIG. 8 is a flow chart showing the operation of signal transforming section 118 according to the first embodiment.
First, the mixed signal block division unit 118a generates a mixed block signal by dividing the processed learning mixed signal given from the learning side signal processing unit 113 into one or more blocks on the time axis (S20 ).

次に、学習用信号ブロック分割部１１８ｂは、学習側入力部１１１から与えられた学習用信号を、時間軸上で１つ以上のブロックに分割することで、目的音ブロック信号を生成する（Ｓ２１）。学習用信号ブロック分割部１１８ｂにおける信号の分割方法は、ステップＳ２０において混合信号ブロック分割部１１８ａが行う分割方法と同一である。 Next, the learning signal block dividing unit 118b generates a target sound block signal by dividing the learning signal given from the learning side input unit 111 into one or more blocks on the time axis (S21 ). The signal division method in the learning signal block division unit 118b is the same as the division method performed by the mixed signal block division unit 118a in step S20.

次に、フィルタ推定部１１８ｃは、フィルタを推定する（Ｓ２２）。
ここでは、処理済学習用混合信号及び学習用信号が全て単一チャネルの音響信号であり、混合信号生成部１１２が学習用信号としてｎ個の目的音を示す信号を取得して混合信号を作成した場合を例に説明する。ここで、ｎは、１以上の整数である。Next, the filter estimation unit 118c estimates a filter (S22).
Here, the processed mixed signal for learning and the signal for learning are all single-channel acoustic signals, and the mixed signal generation unit 112 acquires signals indicating n target sounds as the learning signal to create a mixed signal. A case will be described as an example. Here, n is an integer of 1 or more.

混合信号ブロック分割部１１８ａから取得した混合ブロック信号をｙ（ｔ）とする。ここで、ｔは、ｔ＝０，・・・，Ｔ－１（Ｔは２以上の整数）を満たす整数とする。
また、学習用信号ブロック分割部１１８ｂから取得したｉ番目の目的音の目的音ブロック信号をｓ_ｉ（ｔ）とする。ここで、ｉは、１≦ｉ≦ｎを満たす整数である。
さらに、フィルタ推定部１１８ｃで計算されるフィルタが長さＬのＦＩＲフィルタである場合において、ｉ番目の目的音におけるＦＩＲフィルタの係数を、ｈ_ｉ（τ）とする。ここで、τは、τ＝０，・・・，Ｌ－１を満たす整数とする。
このとき、混合ブロック信号ｙ（ｔ）は、以下の（１）式で近似される。

Let y(t) be the mixed block signal obtained from the mixed signal block division unit 118a. Here, t is an integer that satisfies t=0, .
Let s _i (t) be the target sound block signal of the i-th target sound obtained from the learning signal block division unit 118b. Here, i is an integer that satisfies 1≤i≤n.
Furthermore, when the filter calculated by the filter estimator 118c is an FIR filter of length L, let h _i (τ) be the coefficient of the FIR filter for the i-th target sound. Here, τ is an integer that satisfies τ=0, . . . , L−1.
At this time, the mixed block signal y(t) is approximated by the following equation (1).

ここで、（１）式の近似が二乗誤差規範でもっともよく成り立つ場合について考える。
すなわち、ｈ_ｉ（τ）が、下記の（２）式の誤差関数を最小にする場合について考える。

Here, let us consider the case where the approximation of equation (1) holds best with the squared error criterion.
That is, consider the case where h _i (τ) minimizes the error function of the following equation (2).

このようなｈ_ｉ（τ）を求めるための手段として、まず、下記の（３）式に示されている行列Ｓ_ｉ∈Ｒ^{（（Ｔ－Ｌ＋１）×Ｌ）}を定義する。

As means for obtaining such h _i (τ), first, a matrix S _i εR ^{((T−L+1)×L)} shown in the following equation (3) is defined.

このとき、（２）式は、下記の（４）式で示す行列形式で表現することができる。

At this time, equation (2) can be expressed in a matrix form as shown in equation (4) below.

ここで、ｙは下記の（５）式、ｈ_ｉは下記の（６）式、Ｓは下記の（７）式、ｈは、下記の（８）式で表せる。

Here, _y is represented by the following formula (5), hi is represented by the following formula (6), S is represented by the following formula (7), and h is represented by the following formula (8).

このとき、ｙを最小二乗誤差規範で最も良く近似するフィルタｈ_ｉは、下記の（９）式で示される最適化問題の解となる。

At this time, the filter h _i that best approximates y with the least square error criterion is the solution of the optimization problem expressed by the following equation (9).

そして、（９）式の最適化問題の解は、下記の（１０）式で示される。

The solution of the optimization problem of formula (9) is given by formula (10) below.

このような手順により、ｙ（ｔ）をよく近似するＦＩＲフィルタの係数ｈ_ｉ（ｔ）が求められる。By such a procedure, the FIR filter coefficients h _i (t) that closely approximate y(t) are obtained.

なお、行列Ｓ^ＴＳは、しばしば条件数が大きく、数値安定的に最適化問題の解を得られない可能性がある。このため、下記の（１１）式に示されているように、修正した最適化問題が解かれてもよい。

Note that the matrix S ^T S often has a large number of conditions, and there is a possibility that the optimization problem cannot be solved numerically stably. Therefore, a modified optimization problem may be solved as shown in equation (11) below.

（１１）式で示される最適化問題の解は、下記の（１２）式で示される。

The solution of the optimization problem represented by equation (11) is represented by equation (12) below.

ここで、λは、任意に定めるハイパーパラメタであり、Ｉ_ＮＬは、サイズＮＬの単位行列である。
行列Ｓ^ＴＳと、Ｓ^ＴＳ＋λＩ_ＮＬとを比較すると、後者の方はより条件数が小さく、安定的に逆行列を計算することができる。where λ is an arbitrary hyperparameter and INL is an identity matrix of size _NL .
Comparing the matrix _STS with the matrix ^STS + ^λINL , the latter has a smaller condition number and can stably calculate the inverse matrix.

なお、上記ではｙ（ｔ）及びｓ_ｉ（τ）は、学習用信号及び処理済学習用混合信号が、例えば、１つのマイクロホンのような単一の音響装置から取得された信号のように単一チャネルの信号であることを仮定していたが、実施の形態１はこのような例に限定されない。
例えば、学習用信号及び処理済学習用混合信号が、複数のマイクロホンを備えたマイクロホンアレイを用いて取得された多チャネルの信号であってもよい。この場合、フィルタ推定部１１８ｃが、多チャネルの目的音ブロック信号を受け取った場合には、代表的なチャネルの目的音ブロック信号を選択して、上記のフィルタ係数の計算を行えばよい。また、フィルタ推定部１１８ｃが、多チャンネルの混合ブロック信号を受け取った場合でも、代表的な混合ブロック信号を選択して、上記のフィルタ係数の計算を行えば良い。Note that y(t) and s _i (τ) in the above are the training signal and the processed mixed training signal, for example, a signal acquired from a single sound device such as one microphone. Although it was assumed to be a single-channel signal, Embodiment 1 is not limited to such an example.
For example, the training signal and the processed mixed training signal may be multi-channel signals acquired using a microphone array comprising a plurality of microphones. In this case, when the filter estimator 118c receives multi-channel target sound block signals, it may select the target sound block signals of representative channels and calculate the above filter coefficients. Also, even when the filter estimator 118c receives multi-channel mixed block signals, it suffices to select a representative mixed block signal and perform the above calculation of the filter coefficients.

次に、フィルタ適用部１１８ｄは、ステップＳ２２でブロック毎に推定されたフィルタを、ステップＳ２０で生成された目的音ブロック信号に適用することで、変形ブロック信号を生成する（Ｓ２３）。 Next, the filter application unit 118d applies the filter estimated for each block in step S22 to the target sound block signal generated in step S20, thereby generating a modified block signal (S23).

最後に、ブロック結合部１１８ｅは、ブロック毎に分割された状態の変形ブロック信号を接合して、変形目的音信号を生成する（Ｓ２４）。 Finally, the block combiner 118e joins the deformed block signals divided into blocks to generate a deformed target sound signal (S24).

図９は、音源分離装置１３０の動作を示すフローチャートである。
まず、活用側信号処理部１３４が、入力された対象混合信号に対し、各種の信号処理を適用して処理済対象混合信号を生成する（Ｓ３０）。FIG. 9 is a flowchart showing the operation of the sound source separation device 130. As shown in FIG.
First, the utilization-side signal processing unit 134 applies various types of signal processing to the input target mixed signal to generate a processed target mixed signal (S30).

次に、活用側特徴量抽出部１３５は、活用側信号処理部１３４から与えられる処理済対象混合信号から音響特徴量を抽出し、抽出された音響特徴量の時系列データである活用特徴データを生成する（Ｓ３１）。 Next, the utilization-side feature amount extraction unit 135 extracts acoustic feature amounts from the processed target mixed signal given from the utilization-side signal processing unit 134, and extracts the utilization feature data, which is time-series data of the extracted acoustic feature amounts. Generate (S31).

次に、活用側モデル推論部１３６は、活用音源分離モデルを用いて、活用側特徴量抽出部１３５にて抽出された音響特徴量から、各音源信号を分離合成するために必要となる分離用特徴量の時系列データである活用マスクを、目的音毎に生成する（Ｓ３２）。 Next, the utilization-side model inference unit 136 uses the utilization-side sound source separation model to extract the sound feature values extracted by the utilization-side feature value extraction unit 135, and the separation parameters necessary for separating and synthesizing each sound source signal. A utilization mask, which is time-series data of feature amounts, is generated for each target sound (S32).

次に、活用側信号抽出部１３７が、活用側特徴量抽出部１３５にて抽出された活用音響特徴量と、活用側モデル推論部１３６にて抽出された分離用特徴量とを用いて、対象混合信号の中に含まれる目的音の信号である出力信号を、目的音毎に生成する（Ｓ３３）。 Next, the utilizing-side signal extracting unit 137 uses the utilizing acoustic feature amount extracted by the utilizing-side feature amount extracting unit 135 and the separation feature amount extracted by the utilizing-side model inference unit 136 to extract the object An output signal, which is the signal of the target sound contained in the mixed signal, is generated for each target sound (S33).

次に、音源分離モデル学習装置１１０の動作例について述べる。
図１０は、音源分離モデル学習装置１１０の動作を示す概念図である。
第１の信号１７０は、学習用信号から取得された第１の目的音を示す信号、第２の信号１７１は、学習用信号から取得された第２の目的音を示す信号であり、第３の信号１７２は、学習用信号から取得された非目的音を示す信号である。Next, an operation example of the sound source separation model learning device 110 will be described.
FIG. 10 is a conceptual diagram showing the operation of the sound source separation model learning device 110. As shown in FIG.
A first signal 170 is a signal indicating the first target sound obtained from the learning signal, a second signal 171 is a signal indicating the second target sound obtained from the learning signal, and a third A signal 172 is a signal indicating the non-target sound acquired from the learning signal.

混合信号生成部１１２は、例えば、第１の信号１７０、第２の信号１７１及び第３の信号１７２を単純加算することで、疑似的な学習用混合信号１７３を作成する。
学習用混合信号１７３には、第１の信号１７０に由来する第１の成分１７０＃１、第２の信号１７１に由来する第２の成分１７１＃１、及び、第３の信号１７２に由来する第３の成分１７２＃１が含まれる。The mixed signal generator 112 creates a pseudo learning mixed signal 173 by simply adding the first signal 170 , the second signal 171 and the third signal 172 , for example.
The learning mixed signal 173 includes a first component 170#1 derived from the first signal 170, a second component 171#1 derived from the second signal 171, and a third component 171#1 derived from the third signal 172. A third component 172#1 is included.

学習用混合信号１７３が学習側信号処理部１１３を通過することで、処理済学習用混合信号１７３＃が得られる。この際、第１の目的音に由来する第１の成分１７０＃１は、第４の成分１７０＃２のように、第２の目的音に由来する第２の成分１７１＃１は、第５の成分１７１＃２のように、非目的音に由来する第３の成分１７２＃１は、第６の成分１７２＃２のように、処理済学習用混合信号１７３＃の中で現れる。 By passing learning mixed signal 173 through learning-side signal processing section 113, processed learning mixed signal 173# is obtained. At this time, the first component 170#1 derived from the first target sound becomes the fourth component 170#2, and the second component 171#1 derived from the second target sound becomes the fifth component 170#2. A third component 172#1 derived from a non-target sound, such as the component 171#2 of , appears in the processed mixed training signal 173#, such as the sixth component 172#2.

処理済学習用混合信号１７３＃に対して、第１の目的音及び第２の目的音に対応する音を抽出するために、学習側特徴量抽出部１１４、学習側モデル推論部１１６及び学習側信号抽出部１１７での処理を適用することで、第１の目的音に対応する第１の学習用抽出信号１７４及び第２の目的音に対応する第２の学習用抽出信号１７５が得られる。 In order to extract sounds corresponding to the first target sound and the second target sound from the processed learning mixed signal 173#, the learning side feature quantity extraction unit 114, the learning side model inference unit 116, and the learning side By applying the processing in the signal extraction unit 117, a first extraction signal for learning 174 corresponding to the first target sound and a second extraction signal for learning 175 corresponding to the second target sound are obtained.

さらに、信号変形部１１８は、第１の信号１７０及び第２の信号１７１と、処理済学習用混合信号１７３＃とにより、第１の信号１７０を第４の成分１７０＃２へ変化させるフィルタ及び第２の信号１７１を第５の成分１７１＃２へ変化させるフィルタを推定する。そして、信号変形部１１８は、第１の信号１７０及び第２の信号１７１にそれぞれのフィルタを適用して、第１の変形目的音信号１７６及び第２の変形目的音信号１７７を生成する。 Further, the signal transforming unit 118 uses the first signal 170 and the second signal 171, and the processed learning mixed signal 173# to change the first signal 170 into a fourth component 170#2, a filter and a Assume a filter that transforms the second signal 171 into the fifth component 171#2. Then, the signal transforming unit 118 applies respective filters to the first signal 170 and the second signal 171 to generate a first transform target sound signal 176 and a second transform target sound signal 177 .

モデル更新部１１９は、第１の学習用抽出信号１７４及び第２の学習用抽出信号１７５の組が、第１の変形目的音信号１７６及び第２の変形目的音信号１７７の組に近づくよう、学習用音源分離モデルのパラメタを更新する。 The model update unit 119 performs Update the parameters of the training sound source separation model.

次に、音源分離モデル学習装置１１０により学習された音源分離モデルを用いる際の、音源分離装置１３０の動作例について述べる。
図１１（Ａ）及び（Ｂ）は、音源分離装置１３０の動作例を説明するための概略図である。Next, an operation example of the sound source separation device 130 when using the sound source separation model learned by the sound source separation model learning device 110 will be described.
11A and 11B are schematic diagrams for explaining an operation example of the sound source separation device 130. FIG.

図１１（Ａ）は、音源分離装置１３０により、入力された対象混合信号の波形がどのように変化するかを示す概念図である。
図１１（Ａ）に示されている対象混合信号１８０には、第１の目的音に由来する第１の成分１８１、第２の目的音に由来する第２の成分１８２、及び、非目的音に由来する第３の成分１８３が含まれる。FIG. 11A is a conceptual diagram showing how the waveform of the input target mixed signal is changed by the sound source separation device 130. FIG.
The target mixed signal 180 shown in FIG. 11A includes a first component 181 derived from the first target sound, a second component 182 derived from the second target sound, and A third component 183 derived from is included.

対象混合信号１８０が活用側信号処理部１３４を通過すると、処理済対象混合信号１８０＃が得られる。処理済対象混合信号１８０＃には、第１の成分１８１に由来する第４の成分１８１＃、第２の成分に由来する第５の成分１８２＃、及び、第３の成分１８３に由来する第６の成分１８３＃が含まれる。 When target mixed signal 180 passes through utilization-side signal processing section 134, processed target mixed signal 180# is obtained. The processed target mixed signal 180# includes a fourth component 181# derived from the first component 181, a fifth component 182# derived from the second component, and a third component 182# derived from the third component 183#. 6 component 183#.

活用側信号処理部１３４では、非目的音を抑圧する処理が行われることから、非目的音に由来する第３の成分１８３と比較して、第６の成分１８３＃の音量が下がっている。また、第１の目的音に由来する第１の成分１８１及び第２の目的音に由来する第２の成分１８２と比較して、第４の成分１８１＃及び第５の成分１８２＃は、強調されている。さらに、第４の成分１８１＃及び第５の成分１８２＃は、信号処理に伴って、音量及び波形の形状（周波数特性）等が変化しているほか、活用側信号処理部１３４にて生じる遅延に伴い、対象混合信号１８０と、処理済対象混合信号１８０＃との間で時刻の同期がずれた状態となる。 Since processing for suppressing the non-target sound is performed in the utilization-side signal processing unit 134, the volume of the sixth component 183# is lowered compared to the third component 183 derived from the non-target sound. Also, compared to the first component 181 derived from the first target sound and the second component 182 derived from the second target sound, the fourth component 181# and the fifth component 182# are emphasized. It is Further, the fourth component 181# and the fifth component 182# have their volume and waveform shape (frequency characteristics) changed due to signal processing, and also have a delay caused by the utilization side signal processing section 134. As a result, target mixed signal 180 and processed target mixed signal 180# are out of time synchronization.

処理済対象混合信号１８０＃に対して、活用側特徴量抽出部１３５、活用側モデル推論部１３６及び活用側信号抽出部１３７での処理を適用することにより、第１の出力信号１８４及び第２の出力信号１８５が得られる。第１の出力信号１８４は、第１の目的音に対応する成分を、第２の出力信号１８５は、第２の目的音に対応する成分を、それぞれ抽出したものである。 A first output signal 184 and a second An output signal 185 of is obtained. The first output signal 184 is obtained by extracting the component corresponding to the first target sound, and the second output signal 185 is obtained by extracting the component corresponding to the second target sound.

図１１（Ｂ）は、対象混合信号１８０とは異なる対象混合信号１８６に対し、同様の信号処理を適用した場合について示した概念図である。
処理済対象混合信号１８０＃と、処理済対象混合信号１８６＃とを比較すると、波形の変化及び音量の変化が異なっている。このため、第１の出力信号１８７及び第２の出力信号１８８の波形及び音量も、第１の出力信号１８４及び第２の出力信号１８５とは異なっている。FIG. 11B is a conceptual diagram showing a case where similar signal processing is applied to a target mixed signal 186 different from the target mixed signal 180. FIG.
Comparing the processed target mixed signal 180# and the processed target mixed signal 186#, changes in waveform and volume are different. Therefore, the waveform and volume of the first output signal 187 and the second output signal 188 are also different from the first output signal 184 and the second output signal 185 .

このように、活用側信号処理部１３４へ入力される対象混合信号の特徴、活用側信号処理部１３４の処理内容の変化等によって、処理済対象混合信号の特徴にも変動があり、信号処理後の状態を考慮して生成された学習モデルを用いることで、音源を精度よく分離できる。 In this way, due to changes in the characteristics of the target mixed signal input to the utilization-side signal processing unit 134, changes in the processing contents of the utilization-side signal processing unit 134, etc., the characteristics of the processed target mixed signal also change, and after signal processing Sound sources can be separated with high accuracy by using a learning model generated considering the state of .

なお、音源分離モデル学習装置１１０において、学習側信号処理部１１３を省略し、信号変形部１１８において学習用信号の変形を行わない構成とする場合を考えることができる。このような音源分離モデル学習装置及び学習方法は、従来から知られている。 In addition, in the sound source separation model learning device 110, a configuration can be considered in which the learning-side signal processing unit 113 is omitted and the signal transformation unit 118 does not transform the learning signal. Such a sound source separation model learning device and learning method are conventionally known.

この場合、学習側モデル推論部１１６は、図１０に示されている学習用混合信号１７３より抽出された特徴量から、第１の目的音の第１の信号１７０及び第２の目的音の第２の信号１７１を分離するための分離用特徴量が得られるように学習を行う。
しかしながら、音源分離装置１３０を動作させる場合、図１１（Ａ）に示されているように、活用側モデル推論部１３６には処理済対象混合信号１８０＃より抽出された特徴量が入力される。
学習用混合信号１７３から抽出される特徴量と、処理済対象混合信号１８０＃から抽出される特徴量では、種々の特性が異なっている。音源分離モデルは、処理済対象混合信号１８０＃から抽出される特徴量が入力されることを前提に学習されていないため、分離性能の悪化が生じる。In this case, the learning-side model inference unit 116 extracts the first signal 170 of the first target sound and the second Learning is performed so as to obtain a separating feature amount for separating the signal 171 of No. 2. FIG.
However, when operating the sound source separation device 130, as shown in FIG.
Various characteristics are different between the feature amount extracted from the learning mixed signal 173 and the feature amount extracted from the processed target mixed signal 180#. Since the sound source separation model is not trained on the assumption that the feature amount extracted from the processed target mixed signal 180# is input, the separation performance deteriorates.

また、音源分離モデル学習装置１１０において、学習側信号処理部１１３を省略しないものの、信号変形部１１８において学習用信号の変形を行わない構成をとることも考えられる。
この場合、学習側モデル推論部１１６は、処理済学習用混合信号１７３＃より抽出された特徴量から、第１の目的音の第１の信号１７０及び第２の目的音の第２の信号１７１を分離するための分離用特徴量が得られるように学習される。そして、学習側モデル推論部１１６は、処理済学習用混合信号１７３＃から抽出される特徴量が入力されることを前提として音源分離モデルを学習させるため、上記で述べたような問題を解決できる。Further, in the sound source separation model learning apparatus 110, it is conceivable to adopt a configuration in which the learning signal processing unit 113 is not omitted, but the signal transforming unit 118 does not transform the learning signal.
In this case, the learning-side model inference unit 116 generates the first signal 170 of the first target sound and the second signal 171 of the second target sound from the feature quantity extracted from the processed learning mixed signal 173#. Learning is performed so as to obtain a separation feature amount for separating the . Learning-side model inference section 116 learns the sound source separation model on the premise that the feature amount extracted from processed mixed signal for learning 173# is input, so the above-described problem can be solved. .

しかしながら、図１１（Ａ）に示されている、理済対象混合信号１８０＃に含まれる第４の成分１８１＃及び第５の成分１８２＃と、図１０に示されている象混合信号１８０に含まれている第１の成分１８１及び第２の成分１８２とでは、音量、周波数特性及び遅延等の特性が異なっている。
このため、音源分離モデルは、このような多様な特性の変化を打ち消して元の信号を出力するように学習される。しかし、上述のように、このような特性の変化は、どのような信号が活用側信号処理部１３４に入力されるかによって、又は、時間が経過するにつれて、変化するものである。そのような多様な特性変化を吸収できるように音源分離モデルを学習させることは難しい。However, the fourth component 181# and the fifth component 182# included in the processed target mixed signal 180# shown in FIG. 11(A) and the elephant mixed signal 180 shown in FIG. The included first component 181 and second component 182 differ in characteristics such as volume, frequency characteristics and delay.
For this reason, the sound source separation model is learned so as to cancel such changes in various characteristics and output the original signal. However, as described above, such changes in characteristics change depending on what kind of signal is input to the utilization-side signal processing unit 134 or as time passes. It is difficult to train a sound source separation model so as to absorb such various characteristic changes.

音源分離モデル学習装置１１０において、学習側信号処理部１１３と、信号変形部１１８とを共に機能させ、音源分離モデルが第１の変形目的音信号１７６及び第２の変形目的音信号１７７を分離するための特徴量を出力するように学習させることで、音源分離モデルは特性変化を加味した結果を出力するように学習すればよくなる。
信号変形部１１８において学習用信号の変形を行わない構成の場合には、特性変化を打ち消して元に戻した結果を出力するように音源分離モデルを学習させる必要があったところ、このような条件とすることで、特性変化を打ち消した結果を出力するように学習させる必要がなくなるため、学習が簡単になり、結果として音源分離出力の品質が向上する。In the sound source separation model learning device 110, the learning-side signal processing unit 113 and the signal transforming unit 118 function together, and the sound source separation model separates the first deformed target sound signal 176 and the second deformed target sound signal 177. By learning to output the feature amount for the sound source separation model, it is sufficient to learn to output the result that takes into account the characteristic change.
In the case of a configuration in which the signal transformation unit 118 does not transform the learning signal, it was necessary to train the sound source separation model so as to cancel out the characteristic change and output the restored result. This eliminates the need for learning to output the result of canceling the characteristic change, which simplifies the learning process and, as a result, improves the quality of the sound source separation output.

図１２は、音源分離装置１３０の利用例を示す概略図である。
図１２は、車両１９０に設置されたマイクロホン１９１Ａ、１９１Ｂ、１９１Ｃにおいて、運転席話者１９２が発する音声、助手席話者１９３が発する音声、及び、車両走行音又はカーステレオ等から発せられる騒音１９４が同時に観測される状況を表している。このとき、音源分離装置１３０を用いて、運転席話者１９２の発した音声と、助手席話者１９３の発した音声とを、それぞれ取り出す場合について説明する。FIG. 12 is a schematic diagram showing a usage example of the sound source separation device 130. As shown in FIG.
FIG. 12 shows a sound uttered by a speaker 192 in the driver's seat, a sound uttered by a speaker 193 in the passenger's seat, and noise 194 emitted from a car running sound or a car stereo, etc., in microphones 191A, 191B, and 191C installed in a vehicle 190. are observed at the same time. At this time, the case where the sound source separation device 130 is used to extract the voice uttered by the driver's seat speaker 192 and the voice uttered by the passenger's seat speaker 193 will be described.

運転席話者１９２の発した音声が、図１１（Ａ）に示されている第１の目的音の第１の成分１８１に、助手席話者１９３の発した音声が、第２の目的音の第２の成分１８２に、各種騒音１９４が、非目的音の第３の成分１８３に相当する。また、マイクロホン１９１Ａ、１９１Ｂ、１９１Ｃで収録された信号が、対象混合信号１８０に相当する。
音源分離装置１３０において、活用側信号処理部１３４の出力する処理済対象混合信号１８０＃では、騒音１９４に相当する第６の成分１８３＃が抑圧されている。The voice uttered by the driver's seat speaker 192 is the first component 181 of the first target sound shown in FIG. , and various noises 194 correspond to the third component 183 of the non-target sound. Signals recorded by the microphones 191A, 191B, and 191C correspond to the target mixed signal 180. FIG.
In the sound source separation device 130, the sixth component 183# corresponding to the noise 194 is suppressed in the processed target mixed signal 180# output from the utilization-side signal processing unit 134. FIG.

活用側音源分離モデルを適用後、活用側信号抽出部１３７において抽出された結果が、第１の出力信号１８４及び第２の出力信号１８５に対応する。これらの信号では、運転席及び助手席の各音声が強調されている。 After applying the utilization-side sound source separation model, the results extracted by the utilization-side signal extraction unit 137 correspond to the first output signal 184 and the second output signal 185 . In these signals, the voices of the driver's seat and passenger's seat are emphasized.

活用側音源分離モデルは、音源分離モデル学習装置１１０によって、運転席側と助手席側の話者のそれぞれの音声について、騒音１９４を抑制するような信号処理を行った際の変形された第１の変形目的音信号１７６及び第２の変形目的音信号１７７を考慮して生成されているため、実際に運転席の音声、助手席の音声及び騒音１９４が混合した状態から、運転席と助手席とに座った２人の話者の音声を適切に分離することができる。 The utilization-side sound source separation model is a modified first sound source obtained when the sound source separation model learning device 110 performs signal processing for suppressing the noise 194 for the voices of the speakers on the driver's seat side and the passenger's seat side. Since the modified target sound signal 176 and the second modified target sound signal 177 are taken into consideration, the driver's seat and the passenger's seat are generated from a state in which the voice of the driver's seat, the voice of the passenger's seat, and the noise 194 are actually mixed. It is possible to properly separate the voices of two speakers sitting on the same floor.

また、車両内に限らず、会議中の録音記憶から出席者の発言を取り出す場合であっても、音源分離モデル学習装置で出席者の音声について学習して音源分離モデルを生成すれば、会議と関係ない周辺の雑音を除去する信号処理を行った上で当該音源分離モデルを用いれば、各出席者の音声を分離することができる。 In addition, not only in a vehicle but also in the case of retrieving attendees' utterances from recording memory during a meeting, if the sound source separation model learning device learns the attendee's voice and generates a sound source separation model, If the sound source separation model is used after performing signal processing to remove irrelevant surrounding noise, the voice of each attendee can be separated.

以上のように、実施の形態１によれば、音源分離装置１３０が音源分離モデルを用いて音源分離を実施する際に、活用側信号処理部１３４に伴って生じる音響的特性の変化に音源分離モデルが対応し、この結果として音源分離装置１３０から出力される分離音の品質が向上する。 As described above, according to Embodiment 1, when the sound source separation apparatus 130 performs sound source separation using a sound source separation model, sound source separation is performed according to changes in acoustic characteristics that occur with the utilization-side signal processing unit 134 . The model corresponds, and as a result, the quality of the separated sound output from the sound source separation device 130 is improved.

また、混合信号ブロック分割部１１８ａ、学習用信号ブロック分割部１１８ｂ及びブロック結合部１１８ｅを設けることによる効果として、ブロック毎に異なるフィルタのパラメタを出力することにより、時系列的な変化に対応できるようになる。 Further, as an effect of providing the mixed signal block dividing unit 118a, the learning signal block dividing unit 118b, and the block combining unit 118e, by outputting different filter parameters for each block, it is possible to cope with time-series changes. become.

実施の形態２．
実施の形態１では、混合信号ブロック分割部１１８ａ及び学習用信号ブロック分割部１１８ｂで分割したブロック毎に、フィルタ推定部１１８ｃがフィルタを推定している。実施の形態２では、ブロック毎ではなく、１つのブロック内の時刻毎に異なるフィルタを推定する、言い換えると、フィルタを逐次的に更新することによって、ブロック内の時系列的な変化に対応できるようにする。Embodiment 2.
In Embodiment 1, the filter estimation unit 118c estimates a filter for each block divided by the mixed signal block division unit 118a and the learning signal block division unit 118b. In Embodiment 2, a different filter is estimated not for each block but for each time in one block, in other words, by sequentially updating the filter, it is possible to cope with time-series changes in the block. to

図１に示されているように、実施の形態２に係る音源分離システム２００は、音源分離モデル学習装置２１０と、音源分離装置１３０とを備える。
実施の形態２における音源分離装置１３０は、実施の形態１における音源分離装置１３０と同様である。As shown in FIG. 1 , a sound source separation system 200 according to Embodiment 2 includes a sound source separation model learning device 210 and a sound source separation device 130 .
The sound source separation device 130 according to the second embodiment is the same as the sound source separation device 130 according to the first embodiment.

図２に示されているように、実施の形態２における音源分離モデル学習装置２１０は、学習側入力部１１１と、混合信号生成部１１２と、学習側信号処理部１１３と、学習側特徴量抽出部１１４と、学習側音源分離モデル記憶部１１５と、学習側モデル推論部１１６と、学習側信号抽出部１１７と、信号変形部２１８と、モデル更新部１１９と、学習側通信部１２０とを備える。
実施の形態２における学習側入力部１１１、混合信号生成部１１２、学習側信号処理部１１３、学習側特徴量抽出部１１４、学習側音源分離モデル記憶部１１５、学習側モデル推論部１１６、学習側信号抽出部１１７、モデル更新部１１９及び学習側通信部１２０は、実施の形態１における学習側入力部１１１、混合信号生成部１１２、学習側信号処理部１１３、学習側特徴量抽出部１１４、学習側音源分離モデル記憶部１１５、学習側モデル推論部１１６、学習側信号抽出部１１７、モデル更新部１１９及び学習側通信部１２０と同様である。As shown in FIG. 2, the sound source separation model learning apparatus 210 according to Embodiment 2 includes a learning side input unit 111, a mixed signal generation unit 112, a learning side signal processing unit 113, and a learning side feature amount extraction unit. a learning-side sound source separation model storage unit 115; a learning-side model inference unit 116; a learning-side signal extraction unit 117; a signal transformation unit 218; .
Learning-side input unit 111, mixed signal generation unit 112, learning-side signal processing unit 113, learning-side feature amount extraction unit 114, learning-side source separation model storage unit 115, learning-side model inference unit 116, and learning-side in Embodiment 2 The signal extraction unit 117, the model update unit 119, and the learning side communication unit 120 are the same as the learning side input unit 111, the mixed signal generation unit 112, the learning side signal processing unit 113, the learning side feature amount extraction unit 114, and the learning side communication unit 120 in Embodiment 1. It is the same as the side sound source separation model storage unit 115 , the learning side model inference unit 116 , the learning side signal extraction unit 117 , the model updating unit 119 and the learning side communication unit 120 .

図１３は、実施の形態２における信号変形部２１８の構成を概略的に示すブロック図である。
信号変形部２１８は、混合信号ブロック分割部１１８ａと、学習用信号ブロック分割部１１８ｂと、フィルタ適用部２１８ｄと、ブロック結合部１１８ｅと、フィルタパラメタ記憶部２１８ｆと、フィルタ更新部２１８ｇとを備える。FIG. 13 is a block diagram schematically showing the configuration of signal transforming section 218 according to the second embodiment.
The signal transformation unit 218 includes a mixed signal block division unit 118a, a learning signal block division unit 118b, a filter application unit 218d, a block combination unit 118e, a filter parameter storage unit 218f, and a filter update unit 218g.

実施の形態２における混合信号ブロック分割部１１８ａ、学習用信号ブロック分割部１１８ｂ及びブロック結合部１１８ｅは、実施の形態１における混合信号ブロック分割部１１８ａ、学習用信号ブロック分割部１１８ｂ及びブロック結合部１１８ｅと同様である。 The mixed signal block dividing unit 118a, the learning signal block dividing unit 118b, and the block combining unit 118e in Embodiment 2 are the mixed signal block dividing unit 118a, the learning signal block dividing unit 118b, and the block combining unit 118e in Embodiment 1. is similar to

フィルタパラメタ記憶部２１８ｆは、フィルタ適用部２１８ｄで使用するフィルタパラメタを記憶する。
例えば、フィルタパラメタ記憶部２１８ｆは、予め定められた期間に対応するサンプル毎にフィルタパラメタを記憶する。The filter parameter storage unit 218f stores filter parameters used by the filter application unit 218d.
For example, the filter parameter storage unit 218f stores filter parameters for each sample corresponding to a predetermined period.

フィルタ適用部２１８ｄは、複数の目的音ブロック信号に対して、フィルタパラメタ記憶部２１８ｆに記憶されているフィルタパラメタを適用することで、フィルタパラメタの対応する時刻における処理済サンプル信号を生成する。処理済サンプル信号は、フィルタ更新部２１８ｇに与えられる。言い換えると、フィルタ適用部２１８ｄは、サンプル毎に、複数の目的音ブロック信号から選択された部分にフィルタパラメタを適用することで処理済みサンプル信号を生成する。 The filter application unit 218d applies the filter parameters stored in the filter parameter storage unit 218f to a plurality of target sound block signals, thereby generating processed sample signals at times corresponding to the filter parameters. The processed sample signal is provided to filter updater 218g. In other words, the filter application unit 218d generates the processed sample signal by applying the filter parameter to the portion selected from the plurality of target sound block signals for each sample.

また、フィルタ適用部２１８ｄは、生成された処理済サンプル信号を、複数の目的音ブロック信号の各々で結合することで、複数の変形ブロック信号を生成する。複数の変形ブロック信号は、ブロック結合部１１８ｅに与えられる。 Further, the filter applying unit 218d generates a plurality of modified block signals by combining the generated processed sample signals with each of the plurality of target sound block signals. A plurality of modified block signals are provided to the block combiner 118e.

フィルタ更新部２１８ｇは、フィルタ適用部２１８ｄから与えられる処理済サンプル信号を、処理済学習用混合信号の対応する部分に近づけるように、フィルタパラメタ記憶部２１８ｆに記憶されているフィルタパラメタを更新する。 The filter updating unit 218g updates the filter parameters stored in the filter parameter storage unit 218f so that the processed sample signal provided from the filter applying unit 218d approaches the corresponding portion of the processed mixed signal for learning.

図１４は、実施の形態２における信号変形部２１８の動作を示すフローチャートである。
なお、図１４に示されているフローチャートに含まれているステップの内、図８に示されているフローチャートに含まれているステップの処理と同様の処理を行うステップには、図８に示されているフローチャートに含まれているステップと同じ符号を付している。FIG. 14 is a flow chart showing the operation of signal transforming section 218 according to the second embodiment.
Among the steps included in the flowchart shown in FIG. 14, the steps that perform the same processing as the steps included in the flowchart shown in FIG. The same reference numerals as the steps included in the flow chart are attached.

図１４に示されているフローチャートに含まれているステップＳ２０及びＳ２１での処理は、図８に示されているフローチャートに含まれているステップＳ２０及びＳ２１での処理と同様である。但し、図１４においては、ステップＳ２１の処理の後は、処理はステップＳ４０に進む。 The processes in steps S20 and S21 included in the flowchart shown in FIG. 14 are the same as the processes in steps S20 and S21 included in the flowchart shown in FIG. However, in FIG. 14, after the process of step S21, the process proceeds to step S40.

ステップＳ４０では、フィルタ適用部２１８ｄは、学習用信号ブロック分割部１１８ｂから受け取った複数の目的音ブロック信号から、未選択の１つの目的音ブロック信号を選択する。 In step S40, the filter applying unit 218d selects one unselected target sound block signal from the plurality of target sound block signals received from the learning signal block dividing unit 118b.

次に、フィルタ更新部２１８ｇは、フィルタパラメタの初期値を決定して、その初期値をフィルタパラメタ記憶部２１８ｆに記憶する（Ｓ４１）。フィルタ適用部２１８ｄで使用されるフィルタがＦＩＲフィルタである場合、フィルタ更新部２１８ｇは、例えば、図８に示されているフローチャートのステップＳ２２での処理と同様の処理を行うことで、フィルタパラメタの初期値を推定すればよい。 Next, the filter updating unit 218g determines the initial values of the filter parameters and stores the initial values in the filter parameter storage unit 218f (S41). When the filter used by the filter applying unit 218d is an FIR filter, the filter updating unit 218g performs, for example, the same processing as in step S22 of the flowchart shown in FIG. An initial value should be estimated.

次に、フィルタ適用部２１８ｄは、ステップＳ４０で選択された目的音ブロック信号の内、処理済サンプル信号が未だ生成されていないサンプルの中で先頭に位置するサンプルを選択する（Ｓ４２）。 Next, the filter application unit 218d selects the top sample among the samples for which processed sample signals have not yet been generated, among the target sound block signals selected in step S40 (S42).

次に、フィルタ適用部２１８ｄは、フィルタパラメタ記憶部２１８ｆに記憶されているフィルタパラメタを読み出して、読み出されたフィルタパラメタを、目的音ブロック信号の内の選択されたサンプルに対応する部分に適用することで、処理済サンプル信号を生成する（Ｓ４３）。生成された処理済サンプル信号は、フィルタ更新部２１８ｇに与えられる。 Next, the filter application unit 218d reads the filter parameters stored in the filter parameter storage unit 218f, and applies the read filter parameters to the portion corresponding to the selected sample in the target sound block signal. By doing so, a processed sample signal is generated (S43). The generated processed sample signal is provided to the filter updating section 218g.

次に、フィルタ更新部２１８ｇは、フィルタ適用部２１８ｄからの処理済サンプル信号、混合信号ブロック分割部１１８ａからの混合ブロック信号、及び、学習用信号ブロック分割部１１８ｂからの目的音ブロック信号を用いて、フィルタパラメタ記憶部２１８ｆに記憶されているフィルタパラメタを更新する（Ｓ４４）。例えば、フィルタがＦＩＲフィルタである場合、フィルタパラメタの更新方法として、公知のＮＬＭＳ（ＮｏｒｍａｌｉｚｅｄＬｅａｓｔＭｅａｎＳｑｕａｒｅ）アルゴリズム、又は、ＲＬＳ（ＲｅｃｕｒｓｉｖｅＬｅａｓｔＳｑｕａｒｅ）アルゴリズム等が使用できる。なお、フィルタ更新部２１８ｇが更新を行なう際に、フィルタ適用部２１８ｄでの処理が必要となる場合がある。 Next, the filter updating unit 218g uses the processed sample signal from the filter applying unit 218d, the mixed block signal from the mixed signal block dividing unit 118a, and the target sound block signal from the learning signal block dividing unit 118b. , the filter parameters stored in the filter parameter storage unit 218f are updated (S44). For example, when the filter is an FIR filter, a known NLMS (Normalized Least Mean Square) algorithm, RLS (Recursive Least Square) algorithm, or the like can be used as a method for updating the filter parameters. When the filter updating unit 218g performs updating, there are cases where processing by the filter applying unit 218d is required.

次に、フィルタ適用部２１８ｄは、選択された目的音ブロック信号に含まれている全てのサンプルから処理済サンプル信号を生成したか否かを判断する（Ｓ４５）。全てのサンプルから処理済サンプル信号が生成されている場合（Ｓ４５でＹｅｓ）には、処理はステップＳ４６に進み、処理済サンプル信号が生成されていないサンプルが残っている場合（Ｓ４５でＮｏ）には、処理はステップＳ４２に戻る。 Next, the filter application unit 218d determines whether or not a processed sample signal has been generated from all the samples included in the selected target sound block signal (S45). If processed sample signals have been generated from all the samples (Yes at S45), the process proceeds to step S46. , the process returns to step S42.

ステップＳ４６では、フィルタ適用部２１８ｄは、サンプル毎に生成された処理済みサンプル信号を連結することで変形ブロック信号を生成する。変形ブロック信号は、ブロック結合部１１８ｅに与えられる。 In step S46, the filter application unit 218d generates a modified block signal by concatenating the processed sample signals generated for each sample. The deformed block signal is provided to the block combiner 118e.

次に、フィルタ適用部２１８ｄは、学習用信号ブロック分割部１１８ｂから与えられた全ての目的音ブロック信号を選択したか否かを判断する（Ｓ４７）。全ての目的音ブロック信号を選択した場合（Ｓ４７でＹｅｓ）には、処理はステップＳ２４に進み、未だ選択していない目的音ブロック信号が残っている場合（Ｓ４７でＮｏ）には、処理はステップＳ４０に戻る。 Next, the filter application unit 218d determines whether or not all target sound block signals given from the learning signal block division unit 118b have been selected (S47). If all the target sound block signals have been selected (Yes at S47), the process proceeds to step S24. Return to S40.

そして、ブロック結合部１１８ｅは、図８のステップＳ２４での処理と同様に、ブロック毎に分割された状態の変形ブロック信号を接合して、変形目的音信号を生成する（Ｓ２４）。 Then, the block combiner 118e joins the deformed block signals divided for each block to generate a deformed target sound signal (S24), similarly to the processing in step S24 of FIG.

以上のように、実施の形態２によれは、フィルタが逐次的に更新されるため、学習側信号処理部１１３及び活用側信号処理部１３４が適応的な処理を行う場合でも、学習側信号処理部１１３及び活用側信号処理部１３４の時系列的な変化に対応することができる。 As described above, according to Embodiment 2, since the filters are sequentially updated, even when the learning-side signal processing unit 113 and the utilization-side signal processing unit 134 perform adaptive processing, the learning-side signal processing It is possible to cope with time-series changes in the section 113 and the utilization-side signal processing section 134 .

なお、実施の形態２では、フィルタ更新部２１８ｇ及びフィルタ適用部２１８ｄが１サンプル毎にフィルタを更新し、変形ブロック信号を生成しているため、混合信号ブロック分割部１１８ａ、学習用信号ブロック分割部１１８ｂ及びブロック結合部１１８ｅは、設けられていなくてもよい。
このような場合には、フィルタ適用部２１８ｄは、抽出すべき目的音を示す信号に対し、フィルタパラメタ記憶部２１８ｆに記憶されているフィルタパラメタを適用することで、各々のフィルタパラメタが対応する時刻における処理済サンプル信号を生成する。
フィルタ更新部２１８ｇは、処理済サンプル信号を、処理済学習用混合信号の対応する部分に近づけるように、フィルタパラメタを更新する。
そして、フィルタ適用部２１８ｄは、生成された処理済みサンプル信号を結合することで、変形目的音信号を生成する。In Embodiment 2, the filter updating unit 218g and the filter applying unit 218d update the filter for each sample to generate the deformed block signal. 118b and block coupling portion 118e may not be provided.
In such a case, the filter application unit 218d applies the filter parameters stored in the filter parameter storage unit 218f to the signal indicating the target sound to be extracted, thereby obtaining the time corresponding to each filter parameter. Generate a processed sample signal at .
The filter updating unit 218g updates the filter parameters so that the processed sample signal approaches the corresponding portion of the processed mixed signal for learning.
Then, the filter application unit 218d combines the generated processed sample signals to generate a modified target sound signal.

一方で、混合信号ブロック分割部１１８ａ、学習用信号ブロック分割部１１８ｂ及びブロック結合部１１８ｅを設けることで、ブロック単位でフィルタ適用処理を並列に行って処理速度を向上させたり、ブロック単位でフィルタパラメタの候補グループを作成して、１サンプル毎のパラメタ抽出時にそのグループからパラメタを探索することで、パラメタ抽出速度を向上させたりすることができる。 On the other hand, by providing the mixed signal block dividing unit 118a, the learning signal block dividing unit 118b, and the block combining unit 118e, the filter application processing can be performed in parallel on a block basis to improve the processing speed, or the filter parameter can be set on a block basis. parameter extraction speed can be improved by creating a candidate group and searching for parameters from the group when extracting parameters for each sample.

例えば、ＦＩＲフィルタが使用される場合、フィルタを推定するためには，ブロック分割時に各ブロックの長さをフィルタの長さよりも長く設定する必要がある。このため、実施の形態１のように、ブロック毎にフィルタを推定する場合は、学習側信号処理部１１３及び活用側信号処理部１３４の時系列的な変化に、少なくともＦＩＲフィルタの長さの時間単位でなければ追従できない。一方で、実施の形態２のように、サンプル毎にフィルタを推定することで、学習側信号処理部１１３及び活用側信号処理部１３４の時系列的な変化に、サンプル毎の時間単位で、より細かく追従することができる。 For example, if an FIR filter is used, the length of each block should be set longer than the length of the filter during block division in order to estimate the filter. Therefore, when estimating a filter for each block as in Embodiment 1, at least the length of the FIR filter is added to the time-series changes of the learning-side signal processing unit 113 and the utilization-side signal processing unit 134. If it is not a unit, it cannot be tracked. On the other hand, by estimating the filter for each sample as in Embodiment 2, the time-series changes in the learning-side signal processing unit 113 and the utilization-side signal processing unit 134 can be handled more efficiently in units of time for each sample. It can be followed closely.

また、実施の形態２のように、フィルタパラメタ記憶部２１８ｆを備えることで、フィルタ更新部２１８ｇは、直前のフィルタ推定結果をフィルタパラメタ記憶部２１８ｆに保持しておいた上で、新たにサンプルが得られた際に、フィルタパラメタ記憶部２１８ｆに記録されているフィルタパラメタを、選択されたサンプルに応じ少しだけ変形させてから適用することができる。 Further, by providing the filter parameter storage unit 218f as in the second embodiment, the filter updating unit 218g retains the immediately preceding filter estimation result in the filter parameter storage unit 218f, and then adds a new sample. When obtained, the filter parameters recorded in the filter parameter storage unit 218f can be applied after being slightly modified according to the selected sample.

以上に記載された音源分離モデル学習装置１１０、２１０は、ＮＮに基づく音源分離手法と、古典的な信号処理、機械学習を用いた処理又は未知の信号処理等に基づく信号処理手法を組み合わせた音源分離装置１３０を構成する際において、音源分離モデルの学習を促進し、音源分離性能を向上させる効果を有する。このため、例えば、騒音環境下で音声を認識させる装置において、古典的信号処理と、ＮＮに基づく音源分離とを組み合わせて目的話者の発話音声を取り出すために使用することができる。なお、未知の信号処理には、古典的な信号処理又は機械学習を用いた処理が含まれてもよい。 The sound source separation model learning devices 110 and 210 described above combine a sound source separation method based on NN and a signal processing method based on classical signal processing, processing using machine learning, or unknown signal processing. This has the effect of promoting the learning of the sound source separation model and improving the sound source separation performance when configuring the separation device 130 . Therefore, for example, in a device for recognizing speech in a noisy environment, classical signal processing and NN-based source separation can be combined to extract the uttered speech of the target speaker. The unknown signal processing may include classical signal processing or processing using machine learning.

以上に記載された実施の形態１及び２は、音源分離モデル学習装置１１０、２１０及び音源分離装置１３０の二つの装置で構成されているが、実施の形態１及び２は、このような例に限定されない。例えば、音源分離モデル学習装置１１０、２１０及び音源分離装置１３０が一つの装置、例えば、一つの音源分離学習装置で構成されていてもよい。このような場合には、学習側通信部１２０及び活用側通信部１３１は、不要であり、学習側音源分離モデル記憶部１１５及び活用側音源分離モデル記憶部１３２は、音源分離モデルを記憶する音源分離モデル記憶部として統合することができる。 Embodiments 1 and 2 described above are composed of two devices, the sound source separation model learning devices 110 and 210 and the sound source separation device 130. Not limited. For example, the sound source separation model learning devices 110 and 210 and the sound source separation device 130 may be configured as one device, for example, one sound source separation learning device. In such a case, the learning-side communication unit 120 and the utilization-side communication unit 131 are unnecessary, and the learning-side sound source separation model storage unit 115 and the utilization-side sound source separation model storage unit 132 are used to store sound source separation models. It can be integrated as a separate model store.

１００，２００音源分離システム、１１０，２１０音源分離モデル学習装置、１１１学習側入力部、１１２混合信号生成部、１１３学習側信号処理部、１１４学習側特徴量抽出部、１１５学習側音源分離モデル記憶部、１１６学習側モデル推論部、１１７学習側信号抽出部、１１８，２１８信号変形部、１１８ａ混合信号ブロック分割部、１１８ｂ学習用信号ブロック分割部、１１８ｃフィルタ推定部、１１８ｄ，２１８ｄフィルタ適用部、１１８ｅブロック結合部、２１８ｆフィルタパラメタ記憶部、２１８ｇフィルタ更新部、１１９モデル更新部、１２０学習側通信部、１３０音源分離装置、１３１活用側通信部、１３２活用側音源分離モデル記憶部、１３３活用側入力部、１３４活用側信号処理部、１３５活用側特徴量抽出部、１３６活用側モデル推論部、１３７活用側信号抽出部、１３８活用側出力部。 100, 200 sound source separation system 110, 210 sound source separation model learning device 111 learning side input unit 112 mixed signal generation unit 113 learning side signal processing unit 114 learning side feature amount extraction unit 115 learning side source separation model storage 116 learning-side model inference unit 117 learning-side signal extraction unit 118, 218 signal transformation unit 118a mixed signal block division unit 118b learning signal block division unit 118c filter estimation unit 118d, 218d filter application unit 118e block combining unit 218f filter parameter storage unit 218g filter update unit 119 model update unit 120 learning side communication unit 130 sound source separation device 131 utilization side communication unit 132 utilization side sound source separation model storage unit 133 utilization side Input section 134 Utilization side signal processing section 135 Utilization side feature value extraction section 136 Utilization side model inference section 137 Utilization side signal extraction section 138 Utilization side output section.

Claims

performing predetermined processing on a mixed learning signal representing at least a plurality of target sounds to generate a processed mixed learning signal representing at least a plurality of processed target sounds derived from the plurality of target sounds; a learning-side signal processing unit;
extracting sounds from the processed learning mixed signal using a training-side sound source separation model for extracting the plurality of processed target sounds, thereby representing the extracted sounds and the plurality of processed target sounds; a learning-side model inference unit that generates a plurality of training extraction signals each corresponding to each sound;
for a signal indicating one of the plurality of target sounds, one of the plurality of processed target sounds corresponding to the one target sound. a signal transformation unit that generates a plurality of transformed target sound signals each representing a plurality of transformed target sounds each derived from each of the plurality of target sounds by performing transformation processing to approximate to
a model updating unit that updates the learning-side sound source separation model using the plurality of learning extraction signals and the plurality of modified target sound signals so that the extracted sound approaches the plurality of modified target sounds; A sound source separation model learning device comprising:

performing predetermined processing on a mixed learning signal representing at least a plurality of target sounds to generate a processed mixed learning signal representing at least a plurality of processed target sounds derived from the plurality of target sounds; a learning-side signal processing unit;
for learning, which is time-series data of the extracted learning acoustic feature quantity, by extracting a learning acoustic feature quantity, which is a predetermined acoustic feature quantity, in a plurality of components from the processed mixed signal for learning; a learning-side feature quantity extraction unit that generates feature data;
Each of the plurality of processed target sounds is extracted from the learning feature data using a learning-side sound source separation model that indicates a weight for each of the plurality of components for extracting the plurality of processed target sounds. a learning-side model inference unit that generates a plurality of learning masks for
a plurality of learning extractions each representing the extracted sound and corresponding to each of the plurality of processed target sounds by extracting sounds from the learning feature data using the plurality of learning masks; a learning-side signal extraction unit that generates a signal;
for a signal indicating one of the plurality of target sounds, one of the plurality of processed target sounds corresponding to the one target sound. a signal transformation unit that generates a plurality of transformed target sound signals each representing a plurality of transformed target sounds each derived from each of the plurality of target sounds by performing transformation processing to approximate to
a model updating unit that updates the learning-side sound source separation model using the plurality of learning extraction signals and the plurality of modified target sound signals so that the extracted sound approaches the plurality of modified target sounds; A sound source separation model learning device comprising:

3. The sound source separation model learning device according to claim 1, wherein the predetermined process is a process for facilitating extraction of the plurality of target sounds.

The sound source separation model learning device according to any one of claims 1 to 3, wherein the predetermined process is a process of emphasizing the plurality of target sounds.

The signal transforming unit
a filter estimation unit for estimating a filter for approximating the one target sound to the one processed target sound;
5. The method according to any one of claims 1 to 4, further comprising a filter applying unit that generates the modified target sound signal by applying the filter to the signal indicating the one target sound. Sound source separation model learning device.

The signal transforming unit
a first block dividing unit that generates a plurality of mixed block signals by dividing the processed learning mixed signal into a plurality of blocks;
a second block dividing unit that generates a plurality of target sound block signals by dividing the signal indicating the one target sound into a plurality of blocks;
estimating a filter for approximating the sound indicated by each of the plurality of target sound block signals to the sound corresponding to the one target sound among the sounds indicated by the plurality of mixed block signals; a filter estimation unit that estimates a filter;
a filter application unit that generates a plurality of modified block signals by applying each of the plurality of filters to each of the plurality of target sound block signals;
The sound source separation model learning device according to any one of claims 1 to 4, further comprising a block combiner configured to generate the deformed target sound signal by combining the plurality of deformed block signals. .

The signal transforming unit
a filter parameter storage unit that stores filter parameters for each sample corresponding to a predetermined period;
applying the filter parameters to selected portions of the signal representing the one target sound for each of the samples to generate a processed sample signal; a filter application unit that generates a sound signal;
5. The filter updating unit that updates the filter parameter so that the processed sample signal approaches the corresponding part of the processed mixed signal for learning. The sound source separation model learning device according to the item.

The signal transforming unit
a first block dividing unit that generates a plurality of mixed block signals by dividing the processed learning mixed signal into a plurality of blocks;
a second block dividing unit that generates a plurality of target sound block signals by dividing the signal indicating the one target sound into a plurality of blocks;
a filter parameter storage unit that stores filter parameters for each sample corresponding to a predetermined period;
generating a processed sample signal by applying the filter parameter to a selected portion of the signal representing the one target sound for each of the samples, and applying the processed sample signal to the plurality of target sound blocks; a filter applicator configured to combine on each of the signals to generate a plurality of deformed block signals;
a filter updating unit that updates the filter parameter so that the processed sample signal approaches the corresponding portion of the processed mixed signal for learning;
The sound source separation model learning device according to any one of claims 1 to 4, further comprising a block combiner configured to generate the deformed target sound signal by combining the plurality of deformed block signals. .

9. The model updating unit updates the learning-side sound source separation model so that a difference between the plurality of learning extraction signals and the plurality of deformation target sound signals becomes smaller. The sound source separation model learning device according to any one of 1.

A utilization side that generates a processed target mixed signal representing at least a plurality of processed target sounds derived from the plurality of target sounds by performing predetermined processing on a target mixed signal representing at least a plurality of target sounds. a signal processing unit;
A processed mixed signal indicating at least a plurality of target sounds is subjected to predetermined processing, thereby indicating at least a plurality of processed target sounds derived from the plurality of target sounds indicated by the mixed learning signal. generating a training mixture signal and extracting sounds from the processed training mixture signal using a training-side source separation model for extracting a plurality of processed target sounds indicated by the processed training mixture signal; a plurality of learning extraction signals each representing the extracted sound and corresponding to each of a plurality of processed target sounds indicated by the processed learning mixed signal; for a signal indicating one target sound out of the plurality of target sounds indicated, the one target sound is converted to the one target sound out of the plurality of processed target sounds indicated by the processed mixed signal for learning A plurality of deformed target sounds each representing a plurality of deformed target sounds each derived from each of the plurality of target sounds represented by the learning mixed signal by performing deformation processing for approximating one processed target sound corresponding to the sound. and using the plurality of learning extraction signals and the plurality of deformation target sound signals, the learning-side sound source is adjusted so that the extracted sound approaches the plurality of deformation target sounds Extracting sounds from the processed target mixed signal using a utilization-side source separation model for extracting a plurality of processed target sounds indicated by the processed target mixed signal generated by updating the separation model. By doing so, the utilization side generates a plurality of utilization extraction signals each representing a sound extracted from the processed target mixed signal and corresponding to each of the plurality of processed target sounds indicated by the processed target mixed signal. A sound source separation device comprising: a model inference unit;

A utilization side that generates a processed target mixed signal representing at least a plurality of processed target sounds derived from the plurality of target sounds by performing predetermined processing on a target mixed signal representing at least a plurality of target sounds. a signal processing unit;
Generating utilized feature data, which is time-series data of the extracted utilized acoustic features, by extracting utilized acoustic features, which are predetermined acoustic features, from the processed target mixed signal in a plurality of components. a utilization-side feature quantity extraction unit that
A processed mixed signal indicating at least a plurality of target sounds is subjected to predetermined processing, thereby indicating at least a plurality of processed target sounds derived from the plurality of target sounds indicated by the mixed learning signal. generating a learning mixed signal, and extracting a learning acoustic feature that is a predetermined acoustic feature from the processed learning mixed signal in a plurality of components, thereby obtaining the extracted learning acoustic feature to generate learning feature data that is time-series data of and indicate weights for each of a plurality of components in the learning feature data for extracting a plurality of processed target sounds indicated by the processed learning mixed signal generating a plurality of learning masks for extracting each of a plurality of processed target sounds indicated by the processed mixed learning signal from the learning feature data using the learning-side sound source separation model; By extracting sounds from the learning feature data using a plurality of learning masks, the extracted sounds are indicated, and each of the plurality of processed target sounds indicated by the processed learning mixed signal generates a plurality of learning extraction signals corresponding to the one target sound for a signal indicating one of a plurality of target sounds represented by the learning mixed signal, the one target sound to the processed learning By performing deformation processing to approximate one of the processed target sounds indicated by the mixed signal for learning to one of the processed target sounds corresponding to the one target sound, the plurality of target sounds indicated by the mixed signal for learning generating a plurality of modified target sound signals each representing a plurality of modified target sounds each derived from each of the target sounds, and using the plurality of learning extraction signals and the plurality of modified target sound signals, for extracting a plurality of processed target sounds represented by the processed target mixed signal generated by updating the learning-side sound source separation model so that the sounds approach the plurality of deformed target sounds Each of the plurality of processed target sounds represented by the processed target mixed signal is extracted from the utilized feature data using a utilized sound source separation model indicating a weight for each of the plurality of components in the utilized feature data. an exploiting-side model inference unit that generates a plurality of exploitation masks for
By extracting sounds from the utilization feature data using the plurality of utilization masks, at least the sounds extracted from the utilization feature data are represented, and a plurality of processed target sounds indicated by the processed target mixed signal are obtained. and a utilization-side signal extraction unit that generates a plurality of utilization extraction signals corresponding to each of the utilization-side signal extraction units.

the computer,
performing predetermined processing on a mixed learning signal representing at least a plurality of target sounds to generate a processed mixed learning signal representing at least a plurality of processed target sounds derived from the plurality of target sounds; learning side signal processing unit,
extracting sounds from the processed learning mixed signal using a training-side sound source separation model for extracting the plurality of processed target sounds, thereby representing the extracted sounds and the plurality of processed target sounds; a learning-side model inference unit that generates a plurality of learning extraction signals each corresponding to each sound;
for a signal indicating one of the plurality of target sounds, one of the plurality of processed target sounds corresponding to the one target sound. a signal transformation unit that generates a plurality of transformed target sound signals each representing a plurality of transformed target sounds each derived from each of the plurality of target sounds by performing transformation processing to approximate to
a model updating unit that updates the learning-side sound source separation model using the plurality of learning extraction signals and the plurality of modified target sound signals so that the extracted sound approaches the plurality of modified target sounds; A program characterized by functioning as

the computer,
performing predetermined processing on a mixed learning signal representing at least a plurality of target sounds to generate a processed mixed learning signal representing at least a plurality of processed target sounds derived from the plurality of target sounds; learning side signal processing unit,
for learning, which is time-series data of the extracted learning acoustic feature quantity, by extracting a learning acoustic feature quantity, which is a predetermined acoustic feature quantity, in a plurality of components from the processed mixed signal for learning; a learning-side feature quantity extraction unit that generates feature data;
Each of the plurality of processed target sounds is extracted from the learning feature data using a learning-side sound source separation model that indicates a weight for each of the plurality of components for extracting the plurality of processed target sounds. a learning-side model inference unit that generates a plurality of learning masks for
a plurality of learning extractions each representing the extracted sound and corresponding to each of the plurality of processed target sounds by extracting sounds from the learning feature data using the plurality of learning masks; a learning-side signal extraction unit that generates a signal;
for a signal indicating one of the plurality of target sounds, one of the plurality of processed target sounds corresponding to the one target sound. a signal transformation unit that generates a plurality of transformed target sound signals each representing a plurality of transformed target sounds each derived from each of the plurality of target sounds by performing transformation processing to approximate to
a model updating unit that updates the learning-side sound source separation model using the plurality of learning extraction signals and the plurality of modified target sound signals so that the extracted sound approaches the plurality of modified target sounds; A program characterized by functioning as

the computer,
A utilization side that generates a processed target mixed signal representing at least a plurality of processed target sounds derived from the plurality of target sounds by performing predetermined processing on a target mixed signal representing at least a plurality of target sounds. a signal processing unit, and
A processed mixed signal indicating at least a plurality of target sounds is subjected to predetermined processing, thereby indicating at least a plurality of processed target sounds derived from the plurality of target sounds indicated by the mixed learning signal. generating a training mixture signal and extracting sounds from the processed training mixture signal using a training-side source separation model for extracting a plurality of processed target sounds indicated by the processed training mixture signal; a plurality of learning extraction signals each representing the extracted sound and corresponding to each of a plurality of processed target sounds indicated by the processed learning mixed signal; for a signal indicating one target sound out of the plurality of target sounds indicated, the one target sound is converted to the one target sound out of the plurality of processed target sounds indicated by the processed mixed signal for learning A plurality of deformed target sounds each representing a plurality of deformed target sounds each derived from each of the plurality of target sounds represented by the learning mixed signal by performing deformation processing for approximating one processed target sound corresponding to the sound. and using the plurality of learning extraction signals and the plurality of deformation target sound signals, the learning-side sound source is adjusted so that the extracted sound approaches the plurality of deformation target sounds Extracting sounds from the processed target mixed signal using a utilization-side source separation model for extracting a plurality of processed target sounds indicated by the processed target mixed signal generated by updating the separation model. By doing so, the utilization side generates a plurality of utilization extraction signals each representing a sound extracted from the processed target mixed signal and corresponding to each of the plurality of processed target sounds indicated by the processed target mixed signal. A program characterized by functioning as a model inference part.

the computer,
A utilization side that generates a processed target mixed signal representing at least a plurality of processed target sounds derived from the plurality of target sounds by performing predetermined processing on a target mixed signal representing at least a plurality of target sounds. signal processor,
Generating utilized feature data, which is time-series data of the extracted utilized acoustic features, by extracting utilized acoustic features, which are predetermined acoustic features, from the processed target mixed signal in a plurality of components. Utilization side feature value extraction unit,
A processed mixed signal indicating at least a plurality of target sounds is subjected to predetermined processing, thereby indicating at least a plurality of processed target sounds derived from the plurality of target sounds indicated by the mixed learning signal. generating a learning mixed signal, and extracting a learning acoustic feature that is a predetermined acoustic feature from the processed learning mixed signal in a plurality of components, thereby obtaining the extracted learning acoustic feature to generate learning feature data that is time-series data of and indicate weights for each of a plurality of components in the learning feature data for extracting a plurality of processed target sounds indicated by the processed learning mixed signal generating a plurality of learning masks for extracting each of a plurality of processed target sounds indicated by the processed mixed learning signal from the learning feature data using the learning-side sound source separation model; By extracting sounds from the learning feature data using a plurality of learning masks, the extracted sounds are indicated, and each of the plurality of processed target sounds indicated by the processed learning mixed signal generates a plurality of learning extraction signals corresponding to the one target sound for a signal indicating one of a plurality of target sounds represented by the learning mixed signal, the one target sound to the processed learning By performing deformation processing to approximate one of the processed target sounds indicated by the mixed signal for learning to one of the processed target sounds corresponding to the one target sound, the plurality of target sounds indicated by the mixed signal for learning generating a plurality of modified target sound signals each representing a plurality of modified target sounds each derived from each of the target sounds, and using the plurality of learning extraction signals and the plurality of modified target sound signals, for extracting a plurality of processed target sounds represented by the processed target mixed signal generated by updating the learning-side sound source separation model so that the sounds approach the plurality of deformed target sounds Each of the plurality of processed target sounds represented by the processed target mixed signal is extracted from the utilized feature data using a utilized sound source separation model indicating a weight for each of the plurality of components in the utilized feature data. a exploiting-side model inference unit that generates a plurality of exploitation masks for
By extracting sounds from the utilization feature data using the plurality of utilization masks, at least the sounds extracted from the utilization feature data are represented, and a plurality of processed target sounds indicated by the processed target mixed signal are obtained. A program characterized by functioning as a utilization-side signal extraction unit that generates a plurality of utilization extraction signals corresponding to each.

performing predetermined processing on a mixed learning signal indicating at least a plurality of target sounds to generate a processed mixed learning signal indicating at least a plurality of processed target sounds derived from the plurality of target sounds; ,
extracting sounds from the processed learning mixed signal using a training-side sound source separation model for extracting the plurality of processed target sounds, thereby representing the extracted sounds and the plurality of processed target sounds; generating a plurality of training extracts, each corresponding to each of the sounds;
for a signal indicating one of the plurality of target sounds, one of the plurality of processed target sounds corresponding to the one target sound. generating a plurality of deformed target sound signals each representing a plurality of deformed target sounds each derived from each of the plurality of target sounds by performing deformation processing to approximate to
using the plurality of learning extraction signals and the plurality of modified target sound signals to update the learning-side sound source separation model so that the extracted sound approaches the plurality of modified target sounds. sound source separation model learning method.

performing predetermined processing on a mixed learning signal indicating at least a plurality of target sounds to generate a processed mixed learning signal indicating at least a plurality of processed target sounds derived from the plurality of target sounds; ,
for learning, which is time-series data of the extracted learning acoustic feature quantity, by extracting a learning acoustic feature quantity, which is a predetermined acoustic feature quantity, in a plurality of components from the processed mixed signal for learning; Generate feature data,
Each of the plurality of processed target sounds is extracted from the learning feature data using a learning-side sound source separation model that indicates a weight for each of the plurality of components for extracting the plurality of processed target sounds. Generate multiple training masks for
a plurality of learning extractions each representing the extracted sound and corresponding to each of the plurality of processed target sounds by extracting sounds from the learning feature data using the plurality of learning masks; generate a signal,
for a signal indicating one of the plurality of target sounds, one of the plurality of processed target sounds corresponding to the one target sound. generating a plurality of deformed target sound signals each representing a plurality of deformed target sounds each derived from each of the plurality of target sounds by performing deformation processing to approximate to
using the plurality of learning extraction signals and the plurality of modified target sound signals to update the learning-side sound source separation model so that the extracted sound approaches the plurality of modified target sounds. sound source separation model learning method.

performing predetermined processing on a target mixed signal indicating at least a plurality of target sounds to generate a processed target mixed signal indicating at least a plurality of processed target sounds derived from the plurality of target sounds;
A processed mixed signal indicating at least a plurality of target sounds is subjected to predetermined processing, thereby indicating at least a plurality of processed target sounds derived from the plurality of target sounds indicated by the mixed learning signal. generating a training mixture signal and extracting sounds from the processed training mixture signal using a training-side source separation model for extracting a plurality of processed target sounds indicated by the processed training mixture signal; a plurality of learning extraction signals each representing the extracted sound and corresponding to each of a plurality of processed target sounds indicated by the processed learning mixed signal; for a signal indicating one target sound out of the plurality of target sounds indicated, the one target sound is converted to the one target sound out of the plurality of processed target sounds indicated by the processed mixed signal for learning A plurality of deformed target sounds each representing a plurality of deformed target sounds each derived from each of the plurality of target sounds represented by the learning mixed signal by performing deformation processing for approximating one processed target sound corresponding to the sound. and using the plurality of learning extraction signals and the plurality of deformation target sound signals, the learning-side sound source is adjusted so that the extracted sound approaches the plurality of deformation target sounds Extracting sounds from the processed target mixed signal using a utilization-side source separation model for extracting a plurality of processed target sounds indicated by the processed target mixed signal generated by updating the separation model. generating a plurality of exploited extraction signals each representing a sound extracted from the processed target mixed signal and corresponding to each of a plurality of processed target sounds indicated by the processed target mixed signal ; A sound source separation method characterized by:

performing predetermined processing on a target mixed signal indicating at least a plurality of target sounds to generate a processed target mixed signal indicating at least a plurality of processed target sounds derived from the plurality of target sounds;
Generating utilized feature data, which is time-series data of the extracted utilized acoustic features, by extracting utilized acoustic features, which are predetermined acoustic features, from the processed target mixed signal in a plurality of components. death,
A processed mixed signal indicating at least a plurality of target sounds is subjected to predetermined processing, thereby indicating at least a plurality of processed target sounds derived from the plurality of target sounds indicated by the mixed learning signal. generating a learning mixed signal, and extracting a learning acoustic feature that is a predetermined acoustic feature from the processed learning mixed signal in a plurality of components, thereby obtaining the extracted learning acoustic feature to generate learning feature data that is time-series data of and indicate weights for each of a plurality of components in the learning feature data for extracting a plurality of processed target sounds indicated by the processed learning mixed signal generating a plurality of learning masks for extracting each of a plurality of processed target sounds indicated by the processed mixed learning signal from the learning feature data using the learning-side sound source separation model; By extracting sounds from the learning feature data using a plurality of learning masks, the extracted sounds are indicated, and each of the plurality of processed target sounds indicated by the processed learning mixed signal generates a plurality of learning extraction signals corresponding to the one target sound for a signal indicating one of a plurality of target sounds represented by the learning mixed signal, the one target sound to the processed learning By performing deformation processing to approximate one of the processed target sounds indicated by the mixed signal for learning to one of the processed target sounds corresponding to the one target sound, the plurality of target sounds indicated by the mixed signal for learning generating a plurality of modified target sound signals each representing a plurality of modified target sounds each derived from each of the target sounds, and using the plurality of learning extraction signals and the plurality of modified target sound signals, for extracting a plurality of processed target sounds represented by the processed target mixed signal generated by updating the learning-side sound source separation model so that the sounds approach the plurality of deformed target sounds Each of the plurality of processed target sounds represented by the processed target mixed signal is extracted from the utilized feature data using a utilized sound source separation model indicating a weight for each of the plurality of components in the utilized feature data. generate multiple conjugation masks for
By extracting sounds from the utilization feature data using the plurality of utilization masks, at least the sounds extracted from the utilization feature data are represented, and a plurality of processed target sounds indicated by the processed target mixed signal are obtained. A method of sound source separation, characterized by generating a plurality of exploited extracted signals, one for each.