JP2022186212A

JP2022186212A - Extraction device, extraction method, learning device, learning method, and program

Info

Publication number: JP2022186212A
Application number: JP2021094322A
Authority: JP
Inventors: マークデルクロア; Marc Delcroix; 翼落合; Tsubasa Ochiai; 慶介木下; Keisuke Kinoshita; 智広中谷; Tomohiro Nakatani; カテリナモリコバ; Morikoba Katerina
Original assignee: Brno Univ Of Technology; Brno Univ of Tech; Nippon Telegraph and Telephone Corp
Current assignee: Brno Univ Of Technology; Brno Univ of Tech; Nippon Telegraph and Telephone Corp
Priority date: 2021-06-04
Filing date: 2021-06-04
Publication date: 2022-12-15

Abstract

To extract a target voice from a mixed voice accurately and easily.SOLUTION: A generation unit 131a of an extraction device 10 generates data of a predetermined format from activity information representing a time at which a target sound source emits sound, and a mixed sound. The generation unit 131a generates a weighted sum of an output obtained by inputting the mixed sound into auxiliary NN and the activity information, a coupling vector obtained by coupling the activity information and the mixed sound, and the like. An extraction unit 131b extracts the sound of the target sound source from the mixed sound using the data generated by the generation unit 131a and an extraction network.SELECTED DRAWING: Figure 1

Description

特許法第３０条第２項適用申請有りａｒＸｉｖウェブサイトｈｔｔｐｓ：／／ａｒｘｉｖ．ｏｒｇ／（トップページ）ｈｔｔｐｓ：／／ａｒｘｉｖ．ｏｒｇ／ａｂｓ／２１０１．０５５１６（論文ページ）ｈｔｔｐｓ：／／ａｒｘｉｖ．ｏｒｇ／ｐｄｆ／２１０１．０５５１６．ｐｄｆ（論文ＰＤＦ）ウェブサイトの掲載日２０２１年１月１４日There is an application for the application of Article 30, Paragraph 2 of the Patent Act. arXiv website https://arxiv. org/ (top page) https://arxiv. org/abs/2101.05516 (paper page) https://arxiv. org/pdf/2101.05516. pdf (Paper PDF) Posted on website January 14, 2021

本発明は、抽出装置、抽出方法、学習装置、学習方法及びプログラムに関する。 The present invention relates to an extraction device, an extraction method, a learning device, a learning method, and a program.

複数の話者の音声から得られる混合音声信号から、目的話者の音声を抽出する技術としてスピーカービーム（SpeakerBeam）が知られている（例えば、非特許文献１を参照）。 SpeakerBeam is known as a technique for extracting the speech of a target speaker from a mixed speech signal obtained from speeches of a plurality of speakers (see, for example, Non-Patent Document 1).

例えば、非特許文献１に記載の手法は、混合音声信号を時間領域に変換し、時間領域の混合音声信号から目的話者の音声を抽出するメインＮＮ（neural network：ニューラルネットワーク）と、目的話者の音声信号から特徴量を抽出する補助ＮＮとを有し、メインＮＮの中間部分に設けられた適応層に補助ＮＮの出力を入力することで、時間領域の混合音声信号に含まれる目的話者の音声信号を推定し、出力するものである。 For example, the method described in Non-Patent Document 1 converts a mixed speech signal into the time domain, and extracts the speech of the target speaker from the mixed speech signal in the time domain. The output of the auxiliary NN is input to an adaptation layer provided in the intermediate part of the main NN to extract the target speech contained in the mixed speech signal in the time domain. It estimates and outputs the speech signal of the person.

図１３は、従来のスピーカービームを説明する図である。図１３に示すように、従来のスピーカービームでは、混合音声ｙ_ｔが抽出用ネットワークに入力される。また、目的話者の音声ａ_ｔが補助ネットワークに入力される。 FIG. 13 is a diagram illustrating a conventional speaker beam. As shown in FIG. 13, in a conventional speaker beam, mixed speech _yt is input to the extraction network. Also, the target _speaker 's speech at is input to the auxiliary network.

さらに、補助ネットワークの出力の時間平均ｅ（（１）式）が抽出用ネットワークの第１抽出ブロックと第２抽出ブロックとの間に入力される。そして、抽出用ネットワークから出力されるマスク＾ｍ_ｔ（ｍの直上に＾）によって、混合音声ｙ_ｔから目的音声＾ｘ_ｔが抽出される。 Furthermore, the time average e (equation (1)) of the output of the auxiliary network is input between the first extraction block and the second extraction block of the extraction network. Then, the target speech ̂x _t is extracted from the mixed speech y _t by the mask ̂m _t (̂ just above m) output from the extraction network.

Marc Delcroix, et al. “Improving speaker discrimination of target speech extraction with time-domain SpeakerBeam”,https://arxiv.org/pdf/2001.08378.pdfMarc Delcroix, et al. “Improving speaker discrimination of target speech extraction with time-domain SpeakerBeam”,https://arxiv.org/pdf/2001.08378.pdf

しかしながら、従来の手法には、混合音声から目的音声を精度良くかつ容易に抽出することができない場合があるという問題がある。 However, the conventional method has a problem that it may not be possible to accurately and easily extract the target speech from the mixed speech.

例えば、非特許文献１に記載の手法は、目的話者の音声を事前に登録しておく必要がある。また、例えば、長時間の会議による疲労又は体調の変化等により、目的話者の音声の特徴が、事前に登録した音声の特徴とかい離してしまう場合がある。 For example, the method described in Non-Patent Document 1 requires the voice of the target speaker to be registered in advance. In addition, for example, due to fatigue due to a long meeting, changes in physical condition, etc., the voice characteristics of the target speaker may deviate from the pre-registered voice characteristics.

上述した課題を解決し、目的を達成するために、抽出装置は、目的音源が音を発した時間を表すアクティビティ情報と混合音とから所定の形式のデータを生成する生成部と、前記生成部によって生成されたデータ及び抽出用のＮＮ（ニューラルネットワーク）を用いて、前記混合音から前記目的音源の音を抽出する抽出部と、を有することを特徴とする。 In order to solve the above-described problems and achieve the object, an extraction device includes: a generation unit that generates data in a predetermined format from mixed sound and activity information representing a time at which a target sound source produced a sound; and an extraction unit that extracts the sound of the target sound source from the mixed sound using the data generated by and an NN (neural network) for extraction.

また、学習装置は、目的音源が音を発した時間を表すアクティビティ情報と混合音とから所定の形式のデータを生成する生成部と、前記生成部によって生成されたデータ及び抽出用のＮＮ（ニューラルネットワーク）を用いて、前記混合音から前記目的音源の音を抽出する抽出部と、前記抽出部によって抽出された音を基に計算される損失関数が最適化されるように、前記抽出用のＮＮのパラメータを更新する更新部と、を有することを特徴とする。 The learning device also includes a generation unit that generates data in a predetermined format from the mixed sound and activity information representing the time when the target sound source emitted the sound; network) to extract the sound of the target sound source from the mixed sound, and the loss function calculated based on the sound extracted by the extraction unit is optimized. and an updating unit that updates parameters of the neural network.

本発明によれば、混合音声から目的音声を精度良くかつ容易に抽出することができる。 According to the present invention, it is possible to accurately and easily extract target speech from mixed speech.

図１は、第１の実施形態に係る抽出装置の構成例を示す図である。FIG. 1 is a diagram showing a configuration example of an extraction device according to the first embodiment. 図２は、ダイアライゼーションを説明する図である。FIG. 2 is a diagram for explaining diarization. 図３は、アクティビティ情報を説明する図である。FIG. 3 is a diagram for explaining activity information. 図４は、モデルの構成例を示す図である。FIG. 4 is a diagram illustrating a configuration example of a model. 図５は、モデルの構成例を示す図である。FIG. 5 is a diagram showing a configuration example of a model. 図６は、モデルの構成例を示す図である。FIG. 6 is a diagram illustrating a configuration example of a model. 図７は、第１の実施形態に係る学習装置の構成例を示す図である。FIG. 7 is a diagram showing a configuration example of a learning device according to the first embodiment. 図８は、第１の実施形態に係る抽出装置の処理の流れを示すフローチャートである。FIG. 8 is a flow chart showing the processing flow of the extraction device according to the first embodiment. 図９は、第１の実施形態に係る学習装置の処理の流れを示すフローチャートである。FIG. 9 is a flow chart showing the flow of processing of the learning device according to the first embodiment. 図１０は、実験結果を示す図である。FIG. 10 is a diagram showing experimental results. 図１１は、実験結果を示す図である。FIG. 11 is a diagram showing experimental results. 図１２は、プログラムを実行するコンピュータの一例を示す図である。FIG. 12 is a diagram illustrating an example of a computer that executes programs. 図１３は、従来のスピーカービームを説明する図である。FIG. 13 is a diagram illustrating a conventional speaker beam.

以下に、本願に係る抽出装置、抽出方法、学習装置、学習方法及びプログラムの実施形態を図面に基づいて詳細に説明する。なお、本発明は、以下に説明する実施形態により限定されるものではない。 Embodiments of an extraction device, an extraction method, a learning device, a learning method, and a program according to the present application will be described below in detail with reference to the drawings. In addition, this invention is not limited by embodiment described below.

［第１の実施形態］
図１は、第１の実施形態に係る抽出装置の構成例を示す図である。図１に示すように、抽出装置１０は、インタフェース部１１、記憶部１２及び制御部１３を有する。 [First embodiment]
FIG. 1 is a diagram showing a configuration example of an extraction device according to the first embodiment. As shown in FIG. 1 , the extraction device 10 has an interface section 11 , a storage section 12 and a control section 13 .

抽出装置１０は、複数の音源からの音声を含む混合音声の入力を受け付ける。また、抽出装置１０は、目的の音源の音声を混合音声から抽出し、出力する。 The extraction device 10 accepts input of mixed speech including speech from a plurality of sound sources. Further, the extraction device 10 extracts the sound of the target sound source from the mixed sound and outputs it.

本実施形態では、音源は話者であるものとする。この場合、混合音声は、複数の話者が発した音声を混合したものである。例えば、混合音声は、複数の話者が参加する会議の音声をマイクロホンで録音することによって得られる。以降の説明における「音源」は、適宜「話者」に置き換えられてよい。 In this embodiment, the sound source is assumed to be the speaker. In this case, the mixed speech is a mixture of speech uttered by multiple speakers. For example, mixed speech can be obtained by recording the speech of a conference involving multiple speakers with a microphone. "Sound source" in the following description may be appropriately replaced with "speaker".

なお、本実施形態では、話者によって発せられる音声（voice）だけでなく、あらゆる音源からの音（sound）を扱うことができる。例えば、抽出装置１０は、楽器の音、車のサイレン音等の音響イベントを音源とする混合音の入力を受け付け、目的音源の音を抽出し、出力することができる。また、以降の説明における「音声」は、適宜「音」に置き換えられてもよい。 Note that this embodiment can handle not only voice uttered by a speaker, but also sound from any sound source. For example, the extraction device 10 can receive an input of mixed sound whose sound source is a sound event such as the sound of a musical instrument or the sound of a car siren, and extract and output the sound of the target sound source. Also, "voice" in the following description may be appropriately replaced with "sound".

インタフェース部１１は、データの入力及び出力のためのインタフェースである。例えば、インタフェース部１１はＮＩＣ（Network Interface Card）である。また、インタフェース部１１は、ディスプレイ等の出力装置及びキーボード等の入力装置に接続されていてもよい。 The interface unit 11 is an interface for inputting and outputting data. For example, the interface unit 11 is a NIC (Network Interface Card). Further, the interface unit 11 may be connected to an output device such as a display and an input device such as a keyboard.

記憶部１２は、ＨＤＤ（Hard Disk Drive）、ＳＳＤ（Solid State Drive）、光ディスク等の記憶装置である。なお、記憶部１２は、ＲＡＭ（Random Access Memory）、フラッシュメモリ、ＮＶＳＲＡＭ（Non Volatile Static Random Access Memory）等のデータを書き換え可能な半導体メモリであってもよい。記憶部１２は、抽出装置１０で実行されるＯＳ（Operating System）や各種プログラムを記憶する。 The storage unit 12 is a storage device such as a HDD (Hard Disk Drive), an SSD (Solid State Drive), an optical disc, or the like. Note that the storage unit 12 may be a rewritable semiconductor memory such as a RAM (Random Access Memory), a flash memory, or an NVSRAM (Non Volatile Static Random Access Memory). The storage unit 12 stores an OS (Operating System) and various programs executed by the extraction device 10 .

図１に示すように、記憶部１２は、モデル情報１２１を記憶する。モデル情報１２１は、モデルを構築するためのパラメータ等である。例えば、モデル情報１２１は、後述する各ニューラルネットワークを構築するための重み及びバイアス等である。 As shown in FIG. 1 , the storage unit 12 stores model information 121 . The model information 121 is parameters and the like for constructing a model. For example, the model information 121 is weights, biases, etc. for constructing each neural network described later.

制御部１３は、抽出装置１０全体を制御する。制御部１３は、例えば、ＣＰＵ（Central Processing Unit）、ＭＰＵ（Micro Processing Unit）、ＧＰＵ（Graphics Processing Unit）等の電子回路や、ＡＳＩＣ（Application Specific Integrated Circuit）、ＦＰＧＡ（Field Programmable Gate Array）等の集積回路である。また、制御部１３は、各種の処理手順を規定したプログラムや制御データを格納するための内部メモリを有し、内部メモリを用いて各処理を実行する。 The control unit 13 controls the extraction device 10 as a whole. The control unit 13 includes, for example, electronic circuits such as CPU (Central Processing Unit), MPU (Micro Processing Unit), GPU (Graphics Processing Unit), ASIC (Application Specific Integrated Circuit), FPGA (Field Programmable Gate Array), etc. It is an integrated circuit. The control unit 13 also has an internal memory for storing programs defining various processing procedures and control data, and executes each processing using the internal memory.

制御部１３は、各種のプログラムが動作することにより各種の処理部として機能する。例えば、制御部１３は、信号処理部１３１を有する。 The control unit 13 functions as various processing units by running various programs. For example, the control section 13 has a signal processing section 131 .

信号処理部１３１は、モデル情報１２１から構築されるモデルを用いて、混合音声から目的音声を抽出する。また、モデル情報１２１から構築されるモデルは、学習装置によって訓練されたモデルであるものとする。信号処理部１３１は、生成部１３１ａ及び抽出部１３１ｂを有する。 The signal processing unit 131 uses a model constructed from the model information 121 to extract the target speech from the mixed speech. It is also assumed that the model constructed from the model information 121 is a model trained by a learning device. The signal processor 131 has a generator 131a and an extractor 131b.

生成部１３１ａは、目的音源が音を発した時間を表すアクティビティ情報と混合音とから所定の形式のデータを生成する。 The generation unit 131a generates data in a predetermined format from the mixed sound and the activity information representing the time when the target sound source emitted the sound.

ここで、アクティビティ情報について説明する。アクティビティ情報は目的音源が音を発した時間を表す情報である。 Here, activity information will be described. The activity information is information representing the time when the target sound source emitted the sound.

例えば、アクティビティ情報は、visual-based voice activity detection (VAD)（参考文献１）、personal VAD（参考文献２）、ダイアライゼーション（参考文献３～５）といった既知の手法によって得られる。
参考文献１：P. Liu and Z. Wang, “Voice activity detection using visual information,” in Proc. of ICASSP’04, 2004, vol. 1, pp. I-609.
参考文献２：S. Ding, Q. Wang, S.-Y. Chang, L. Wan, and I. Lopez Moreno, “Personal VAD: Speaker-Conditioned Voice Activity Detection,” in Proc. of Odyssey’20, 2020, pp. 433-439.
参考文献３：D. Garcia-Romero, D. Snyder, G. Sell, D. Povey, and A. McCree, “Speaker diarization using deep neural network embeddings,” Proc. of ICASSP’17, pp. 4930-4934, 2017.
参考文献４：Z. Huang, S.Watanabe, Y. Fujita, P. Garcia, Y. Shao, D. Povey, and S. Khudanpur, “Speaker diarization with region proposal network,” Proc. of ICASSP’20, pp. 6514-6518, 2020.
参考文献５：I. Medennikov, M. Korenevsky, T. Prisyach, Y. Y. Khokhlov, M. Korenevskaya, I. Sorokin, T. V. Timofeeva, A. Mitrofanov, A. Andrusenko, I. Podluzhny, A. Laptev, and A. Romanenko, “Target-speaker voice activity detection: a novel approach for multi-speaker diarization in a dinner party scenario,” ArXiv, vol. abs/2005.07272, 2020. For example, activity information can be obtained by known techniques such as visual-based voice activity detection (VAD) [1], personal VAD [2], diarization [3-5].
Reference 1: P. Liu and Z. Wang, “Voice activity detection using visual information,” in Proc. of ICASSP'04, 2004, vol. 1, pp. I-609.
Reference 2: S. Ding, Q. Wang, S.-Y. Chang, L. Wan, and I. Lopez Moreno, “Personal VAD: Speaker-Conditioned Voice Activity Detection,” in Proc. of Odyssey'20, 2020 , pp. 433-439.
Reference 3: D. Garcia-Romero, D. Snyder, G. Sell, D. Povey, and A. McCree, “Speaker diarization using deep neural network embeddings,” Proc. of ICASSP'17, pp. 4930-4934, 2017.
Reference 4: Z. Huang, S. Watanabe, Y. Fujita, P. Garcia, Y. Shao, D. Povey, and S. Khudanpur, “Speaker diarization with region proposal network,” Proc. of ICASSP'20, pp. 6514-6518, 2020.
Reference 5: I. Medennikov, M. Korenevsky, T. Prisyach, YY Khokhlov, M. Korenevskaya, I. Sorokin, TV Timofeeva, A. Mitrofanov, A. Andrusenko, I. Podluzhny, A. Laptev, and A. Romanenko , “Target-speaker voice activity detection: a novel approach for multi-speaker diarization in a dinner party scenario,” ArXiv, vol. abs/2005.07272, 2020.

例えば、目的話者の映像がある場合は、visual-based VADが有効である。また、目的話者の音声がある場合は、personal VADが有効である。また、目的話者の映像及び音声がいずれも手に入らない場合は、ダイアライゼーションが使用される。 For example, if there is an image of the target speaker, visual-based VAD is effective. Also, personal VAD is effective when there is the voice of the target speaker. Diarization is also used when neither video nor audio of the target speaker is available.

特に、昨今の高精度なダイアライゼーションと本実施形態を組み合わせることで、オーバーラップが発生する会議等のシチュエーションにおいて実用的かつ簡易な音声認識精度を向上を図ることができる。 In particular, by combining the recent high-precision diarization and this embodiment, practical and simple speech recognition accuracy can be improved in situations such as meetings where overlap occurs.

ここで、オーバーラップとは、混合音声において、目的話者の音声と目的話者以外の音声が重複している状態である。アクティビティ情報における目的話者がアクティブに時間区間には、オーバーラップが含まれるケースとオーバーラップが含まれないケースが考えられる。 Here, the overlap is a state in which the target speaker's voice and the voices other than the target speaker overlap in the mixed voice. In the activity information, there are cases where the target speaker is active and where there is an overlap, and there are cases where there is no overlap.

図２は、ダイアライゼーションを説明する図である。図２に示すように、ダイアライゼーションによれば、話者Ａ、話者Ｂ、話者Ｃのそれぞれが発話した時間帯を特定することができる。 FIG. 2 is a diagram for explaining diarization. As shown in FIG. 2, according to diarization, it is possible to identify the time period during which each of speaker A, speaker B, and speaker C spoke.

図３は、アクティビティ情報を説明する図である。図３に示すように、アクティビティ情報は、音源が音を発している時間区間においては１を取り、音源が音を発していない時間区間においては０を取るものであってもよい。この場合、ｐ_ｔ∈｛０，１｝となる。 FIG. 3 is a diagram for explaining activity information. As shown in FIG. 3, the activity information may take 1 in the time interval when the sound source is emitting sound and take 0 in the time interval when the sound source is not emitting sound. In this case, p _t ε{0,1}.

抽出部１３１ｂは、生成部１３１ａによって生成されたデータ及び抽出用ネットワークを用いて、混合音から目的音源の音を抽出する。 The extraction unit 131b uses the data generated by the generation unit 131a and the extraction network to extract the sound of the target sound source from the mixed sound.

ここでは、モデル情報１２１を基に構築されるモデルをADEnet（a speaker activity driven speech extraction neural network）と呼ぶ場合がある。生成部１３１ａ及び抽出部１３１ｂは、ADEnetを用いて処理を行う。以下、ADEnetの複数のバリエーションについて説明する。 Here, a model constructed based on the model information 121 may be called an ADEnet (a speaker activity driven speech extraction neural network). The generating unit 131a and the extracting unit 131b perform processing using ADEnet. Several variations of ADEnet are described below.

なお、本実施形態のADEnetは、アクティビティ情報を入力とするが、アクティビティ情報の取得方法は特定の方法には限定されず、いかなる方法であってもよい。 Although the ADEnet of this embodiment receives activity information as an input, the method of acquiring activity information is not limited to a specific method, and any method may be used.

［ADEnet-auxiliary］
ADEnet-auxiliaryでは、生成部１３１ａは、混合音を補助ネットワークに入力して得られる出力とアクティビティ情報との重み付き和を生成する。また、抽出部１３１ｂは、混合音と重み付き和とを抽出用ネットワークに入力することによって、混合音から目的音源の音を抽出する。 [ADEnet auxiliary]
In ADEnet-auxiliary, the generator 131a generates a weighted sum of the output obtained by inputting the mixed sound to the auxiliary network and the activity information. Also, the extraction unit 131b extracts the sound of the target sound source from the mixed sound by inputting the mixed sound and the weighted sum to the extraction network.

図４は、モデルの構成例を示す図である。図４のモデル（ADEnet-auxiliary）では、抽出用ネットワークと補助ネットワークが用いられる。抽出用ネットワークと補助ネットワークはいずれもニューラルネットワークである。 FIG. 4 is a diagram illustrating a configuration example of a model. The model in FIG. 4 (ADEnet-auxiliary) uses an extraction network and an auxiliary network. Both the extraction network and the auxiliary network are neural networks.

図４に示すように、生成部１３１ａは、補助ネットワークに混合音声ｙ_ｔを入力する。そして、生成部１３１ａは、（２）式により、アクティビティ情報ｐ_ｔを重みとして、補助ネットワークの出力の重み付き和ｅを計算する。 As shown in FIG. 4, the generator 131a inputs the mixed speech _yt to the auxiliary network. Then, the generation unit 131a calculates the weighted sum e of the output of the auxiliary network using the activity information p _t as the weight by the equation (2).

さらに、抽出部１３１ｂは、混合音声ｙ_ｔと重み付き和ｅを抽出用ネットワークに入力し、マスク＾ｍ_ｔを得る。そして、抽出部１３１ｂは、混合音声ｙ_ｔとマスク＾ｍ_ｔとの要素ごとの積を目的音声＾ｘ_ｔとして計算する。 Furthermore, the extraction unit 131b inputs the mixed speech _yt and the weighted sum e to the extraction network to obtain a mask ^ _mt . Then, the extraction unit 131b calculates the product of each element of the mixed speech _yt and the mask ^ _mt as the target speech ^ _xt .

［ADEnet-input］
ADEnet-inputでは、生成部１３１ａは、いずれもベクトルで表されたアクティビティ情報と混合音とを結合した結合ベクトルを生成する。また、抽出部１３１ｂは、結合ベクトルを抽出用ネットワークに入力することによって、混合音から目的音源の音を抽出する。 [ADEnet-input]
In ADEnet-input, the generation unit 131a generates a combined vector that combines the activity information and the mixed sound, both of which are represented by vectors. Also, the extraction unit 131b extracts the sound of the target sound source from the mixed sound by inputting the connection vector to the extraction network.

図５は、モデルの構成例を示す図である。図５のモデル（ADEnet-input）では、抽出用ネットワークが用いられる。 FIG. 5 is a diagram showing a configuration example of a model. The model (ADEnet-input) in FIG. 5 uses an extraction network.

図５に示すように、生成部１３１ａは、混合音声ｙ_ｔとアクティビティ情報ｐ_ｔを結合（concatenate）し、［ｙ_ｔ ^T,ｐ_ｔ］^Tを得る。 As shown in FIG. 5, the generator 131a concatenates the mixed speech ^yt and the activity information ^pt to obtain [ _ytT , _pt _] _T .

そして、抽出部１３１ｂは、［ｙ_ｔ ^T,ｐ_ｔ］^Tを抽出用ネットワークに入力し、マスク＾ｍ_ｔを得る。そして、抽出部１３１ｂは、混合音声ｙ_ｔとマスク＾ｍ_ｔとの要素ごとの積を目的音声＾ｘ_ｔとして計算する。 Then, the extraction unit 131b inputs [y _t ^T , p _t ] ^T to the extraction network to obtain the mask ^m _t . Then, the extraction unit 131b calculates the product of each element of the mixed speech _yt and the mask ^ _mt as the target speech ^ _xt .

［ADEnet-mix］
ADEnet-mixでは、生成部１３１ａは、結合ベクトルに加え、混合音を補助ネットワークに入力して得られる出力とアクティビティ情報との重み付き和をさらに生成する。また、抽出部１３１ｂは、結合ベクトルと重み付き和を抽出用ネットワークに入力することによって、混合音から目的音源の音を抽出する。 [ADEnet-mix]
In ADEnet-mix, the generating unit 131a further generates a weighted sum of the activity information and the output obtained by inputting the mixed sound to the auxiliary network, in addition to the connection vector. Also, the extraction unit 131b extracts the sound of the target sound source from the mixed sound by inputting the connection vector and the weighted sum to the extraction network.

図６は、モデルの構成例を示す図である。図６のモデル（ADEnet-mix）では、抽出用ネットワークと補助ネットワークが用いられる。 FIG. 6 is a diagram illustrating a configuration example of a model. The model (ADEnet-mix) in FIG. 6 uses an extraction network and an auxiliary network.

図６に示すように、生成部１３１ａは、混合音声ｙ_ｔとアクティビティ情報ｐ_ｔを結合（concatenate）し、［ｙ_ｔ ^T,ｐ_ｔ］^Tを得る。 As shown in FIG. 6, the generator 131a concatenates the mixed speech ^yt and the activity information ^pt to obtain [ _ytT , _pt _] _T .

そして、抽出部１３１ｂは、重み付き和ｅと、［ｙ_ｔ ^T,ｐ_ｔ］^Tとを抽出用ネットワークに入力し、マスク＾ｍ_ｔを得る。そして、抽出部１３１ｂは、混合音声ｙ_ｔとマスク＾ｍ_ｔとの要素ごとの積を目的音声＾ｘ_ｔとして計算する。 Then, the extraction unit 131b inputs the weighted sum e and [y _t ^T , p _t ] ^T to the extraction network to obtain the mask ^m _t . Then, the extraction unit 131b calculates the product of each element of the mixed speech _yt and the mask ^ _mt as the target speech ^ _xt .

ここで説明したように、抽出部１３１ｂはマスクを使って目的音声を計算することができる。一方で、抽出部１３１ｂは、マスクを使わずに、抽出用ネットワーク等を使って目的音声を直接計算してもよい。 As described herein, the extractor 131b can use the mask to calculate the target speech. On the other hand, the extraction unit 131b may directly calculate the target speech using an extraction network or the like without using the mask.

ここで、図７を用いて学習装置の構成について説明する。図７は、第１の実施形態に係る学習装置の構成例を示す図である。図７に示すように、学習装置２０は、インタフェース部２１、記憶部２２及び制御部２３を有する。 Here, the configuration of the learning device will be described with reference to FIG. FIG. 7 is a diagram showing a configuration example of a learning device according to the first embodiment. As shown in FIG. 7, the learning device 20 has an interface section 21, a storage section 22 and a control section 23. FIG.

インタフェース部２１は、データの入力及び出力のためのインタフェースである。例えば、インタフェース部２１はＮＩＣである。また、インタフェース部２１は、ディスプレイ等の出力装置及びキーボード等の入力装置に接続されていてもよい。 The interface unit 21 is an interface for inputting and outputting data. For example, the interface unit 21 is a NIC. Further, the interface unit 21 may be connected to an output device such as a display and an input device such as a keyboard.

記憶部２２は、ＨＤＤ、ＳＳＤ、光ディスク等の記憶装置である。なお、記憶部２２は、ＲＡＭ、フラッシュメモリ、ＮＶＳＲＡＭ等のデータを書き換え可能な半導体メモリであってもよい。記憶部２２は、学習装置２０で実行されるＯＳや各種プログラムを記憶する。 The storage unit 22 is a storage device such as an HDD, an SSD, an optical disc, or the like. Note that the storage unit 22 may be a rewritable semiconductor memory such as a RAM, a flash memory, or an NVSRAM. The storage unit 22 stores the OS and various programs executed by the learning device 20 .

図７に示すように、記憶部２２は、モデル情報２２１を記憶する。モデル情報２２１は、モデルを構築するためのパラメータ等である。例えば、モデル情報２２１は、各ニューラルネットワークを構築するための重み及びバイアス等である。 As shown in FIG. 7, the storage unit 22 stores model information 221 . The model information 221 is parameters and the like for constructing a model. For example, the model information 221 is weights and biases for constructing each neural network.

モデル情報２２１は、学習装置２０によって更新され、抽出装置１０に受け渡されてもよい。受け渡されたモデル情報２２１は、抽出装置１０によってモデル情報１２１として記憶される。 The model information 221 may be updated by the learning device 20 and passed to the extraction device 10 . The delivered model information 221 is stored as the model information 121 by the extraction device 10 .

制御部２３は、学習装置２０全体を制御する。制御部２３は、例えば、ＣＰＵ、ＭＰＵ、ＧＰＵ等の電子回路や、ＡＳＩＣ、ＦＰＧＡ等の集積回路である。また、制御部２３は、各種の処理手順を規定したプログラムや制御データを格納するための内部メモリを有し、内部メモリを用いて各処理を実行する。 The control unit 23 controls the learning device 20 as a whole. The control unit 23 is, for example, an electronic circuit such as a CPU, MPU, or GPU, or an integrated circuit such as an ASIC or FPGA. The control unit 23 also has an internal memory for storing programs defining various processing procedures and control data, and executes each processing using the internal memory.

制御部２３は、各種のプログラムが動作することにより各種の処理部として機能する。例えば、制御部２３は、信号処理部２３１、損失計算部２３２及び更新部２３３を有する。 The control unit 23 functions as various processing units by running various programs. For example, the controller 23 has a signal processor 231 , a loss calculator 232 and an updater 233 .

信号処理部２３１は、モデル情報２２１から構築されるモデルを用いて、混合音声から目的音声を抽出する。信号処理部２３１は、生成部２３１ａ及び抽出部２３１ｂを有する。 The signal processing unit 231 uses a model constructed from the model information 221 to extract the target speech from the mixed speech. The signal processor 231 has a generator 231a and an extractor 231b.

生成部２３１ａは、目的音源が音を発した時間を表すアクティビティ情報と混合音とから所定の形式のデータを生成する。また、抽出部２３１ｂは、生成部２３１ａによって生成されたデータ及び抽出用ネットワークを用いて、混合音から目的音源の音を抽出する。 The generation unit 231a generates data in a predetermined format from the activity information representing the time when the target sound source emitted the sound and the mixed sound. Also, the extraction unit 231b uses the data generated by the generation unit 231a and the extraction network to extract the sound of the target sound source from the mixed sound.

生成部２３１ａ及び抽出部２３１ｂは、それぞれ生成部１３１ａ及び抽出部１３１ｂと同様の処理を行う。 The generation unit 231a and the extraction unit 231b perform the same processing as the generation unit 131a and the extraction unit 131b, respectively.

損失計算部２３２は、抽出部２３１ｂによって抽出された音を基に損失関数を計算する。例えば、損失関数は、抽出部２３１ｂによって抽出された音^x_ｔと、学習データに含まれる正解x_tとの信号雑音比（signal-to-noise ratio:SiSNR）である。また、損失関数は信号雑音比に限られず、Signal-to-distortion ratio:SDR及びMSE（Mean square error）等であってもよい。 The loss calculator 232 calculates a loss function based on the sounds extracted by the extractor 231b. For example, the loss function is a signal-to-noise ratio ( _SiSNR ) between the sound ^ _xt extracted by the extraction unit 231b and the correct answer xt included in the learning data. Also, the loss function is not limited to the signal-to-noise ratio, and may be a signal-to-distortion ratio (SDR), MSE (Mean square error), or the like.

更新部２３３は、抽出部２３１ｂによって抽出された音を基に計算される損失関数が最適化されるように、抽出用のＮＮのパラメータを更新する。更新部２３３は、誤差逆伝播法等の既知の手法によりパラメータを更新することができる。 The updating unit 233 updates parameters of the NN for extraction so that the loss function calculated based on the sound extracted by the extracting unit 231b is optimized. The updating unit 233 can update the parameters by a known technique such as error backpropagation.

［第１の実施形態の処理の流れ］
図８は、第１の実施形態に係る抽出装置の処理の流れを示すフローチャートである。図８に示すように、まず、抽出装置１０は、合音声とアクティビティ情報から所定の形式のデータを生成する（ステップＳ１０１）。 [Processing flow of the first embodiment]
FIG. 8 is a flow chart showing the processing flow of the extraction device according to the first embodiment. As shown in FIG. 8, first, the extraction device 10 generates data in a predetermined format from synthesized speech and activity information (step S101).

次に、抽出装置１０は、生成したデータ及び抽出用ネットワークを用いて混合音声から目的音声を抽出する（ステップＳ１０２）。 Next, the extraction device 10 extracts the target speech from the mixed speech using the generated data and the extraction network (step S102).

図９は、第１の実施形態に係る学習装置の処理の流れを示すフローチャートである。図９に示すように、まず、学習装置２０は、合音声とアクティビティ情報から所定の形式のデータを生成する（ステップＳ２０１）。 FIG. 9 is a flow chart showing the flow of processing of the learning device according to the first embodiment. As shown in FIG. 9, first, the learning device 20 generates data in a predetermined format from synthesized speech and activity information (step S201).

次に、学習装置２０は、生成したデータ及び抽出用ネットワークを用いて混合音声から目的音声を抽出する（ステップＳ２０２）。 Next, the learning device 20 extracts the target speech from the mixed speech using the generated data and the extraction network (step S202).

ここで、学習装置２０は、ネットワークを最適化する損失関数を計算する（ステップＳ２０３）。そして、学習装置２０は、損失関数が最適化されるようにネットワークのパラメータを更新する（ステップＳ２０４）。 Here, the learning device 20 calculates a loss function that optimizes the network (step S203). The learning device 20 then updates the network parameters so that the loss function is optimized (step S204).

学習装置２０は、パラメータが収束したと判定した場合（ステップＳ２０５、Ｙｅｓ）、処理を終了する。一方、学習装置２０は、パラメータが収束していないと判定した場合（ステップＳ２０５、Ｎｏ）、ステップＳ２０１に戻り処理を繰り返す。 When the learning device 20 determines that the parameters have converged (step S205, Yes), the processing ends. On the other hand, when the learning device 20 determines that the parameters have not converged (step S205, No), the learning device 20 returns to step S201 and repeats the process.

［第１の実施形態の効果］
これまで説明してきたように、生成部１３１ａは、目的音源が音を発した時間を表すアクティビティ情報と混合音とから所定の形式のデータを生成する。また、抽出部１３１ｂは、生成部１３１ａによって生成されたデータ及び抽出用ネットワークを用いて、混合音から目的音源の音を抽出する。 [Effects of the first embodiment]
As described above, the generation unit 131a generates data in a predetermined format from the activity information representing the time when the target sound source emitted the sound and the mixed sound. Also, the extraction unit 131b uses the data generated by the generation unit 131a and the extraction network to extract the sound of the target sound source from the mixed sound.

このように、抽出装置１０は、アクティビティ情報を利用して目的音の抽出を行う。このため、例えば、抽出装置１０による抽出処理では、目的話者の音声を事前に登録しておくことは不要である。また、抽出装置１０による抽出処理は、目的話者の音声の特徴の変化から受ける影響を小さくすることができる。その結果、本実施形態によれば、混合音声から目的音声を精度良くかつ容易に抽出することができる。 Thus, the extraction device 10 extracts the target sound using the activity information. Therefore, for example, in the extraction process by the extraction device 10, it is not necessary to register the speech of the target speaker in advance. In addition, the extraction processing by the extraction device 10 can reduce the influence of changes in the voice features of the target speaker. As a result, according to this embodiment, it is possible to accurately and easily extract the target speech from the mixed speech.

従来、話者分離のための手がかりとなるデータの態様には様々なものがあった。例えば、１０秒程度の話者の音声、話者の顔を撮影した映像、音声と映像を合わせたもの等が手がかりとして用いられる。そして、各態様に合わせてモデルを用意する必要がある。 Conventionally, there have been various forms of data that serve as clues for speaker separation. For example, the speaker's voice for about 10 seconds, a picture of the speaker's face, a combination of the voice and the video, and the like are used as clues. Then, it is necessary to prepare a model according to each aspect.

これに対し、本実施形態では、手がかりとなるデータの形式にかかわらず、アクティビティ情報に対応したモデルを用意すればよい。 In contrast, in the present embodiment, a model corresponding to activity information may be prepared regardless of the format of the data that serves as a clue.

生成部１３１ａは、混合音を補助ネットワークに入力して得られる出力とアクティビティ情報との重み付き和を生成する。また、抽出部１３１ｂは、混合音と重み付き和とを抽出用ネットワークに入力することによって、混合音から目的音源の音を抽出する。この方法は、図４のADEnet-auxiliaryに相当する。 The generation unit 131a generates a weighted sum of the output obtained by inputting the mixed sound to the auxiliary network and the activity information. Also, the extraction unit 131b extracts the sound of the target sound source from the mixed sound by inputting the mixed sound and the weighted sum to the extraction network. This method corresponds to ADEnet-auxiliary in FIG.

ADEnet-auxiliaryは、特にオーバーラップを除外できれば、スピーカービームにおける混合音声と事前に登録される目的話者の音声との齟齬を取り除いたのと同等の性能を得ることができる。 ADEnet-auxiliary can obtain the same performance as removing the discrepancy between the mixed speech in the speaker beam and the pre-registered target speaker's speech, especially if the overlap can be eliminated.

生成部１３１ａは、いずれもベクトルで表されたアクティビティ情報と混合音とを結合した結合ベクトルを生成する。また、抽出部１３１ｂは、結合ベクトルを抽出用ネットワークに入力することによって、混合音から目的音源の音を抽出する。この方法は、図５のADEnet-inputに相当する。 The generation unit 131a generates a combined vector that combines the activity information and the mixed sound, both of which are represented by vectors. Also, the extraction unit 131b extracts the sound of the target sound source from the mixed sound by inputting the connection vector to the extraction network. This method corresponds to ADEnet-input in FIG.

ADEnet-inputは、補助用ネットワークを必要としないため、ADEnet-auxiliaryと比べて簡易な構成を実現することができる。 Since ADEnet-input does not require an auxiliary network, it can realize a simpler configuration than ADEnet-auxiliary.

生成部１３１ａは、混合音を補助ネットワークに入力して得られる出力とアクティビティ情報との重み付き和をさらに生成する。抽出部１３１ｂは、結合ベクトルと重み付き和を抽出用ネットワークに入力することによって、混合音から目的音源の音を抽出する。この方法は、図６のADEnet-mixに相当する。 The generating unit 131a further generates a weighted sum of the output obtained by inputting the mixed sound to the auxiliary network and the activity information. The extraction unit 131b extracts the sound of the target sound source from the mixed sound by inputting the connection vector and the weighted sum to the extraction network. This method corresponds to ADEnet-mix in FIG.

ADEnet-mixは、結合ベクトルと重み付き和の両方の特徴を抽出結果に反映させることができる。 ADEnet-mix can reflect features of both joint vectors and weighted sums in extraction results.

［実験結果］
本実施形態を用いて行った実験について説明する。実験では、教師データを基に推定したoracle speaker activityにノイズを加える方法で学習用のアクティビティ情報を用意した。なお、oracle speaker activityは、正解音声データ（教師データ）から、音声区間抽出方法（例えば、参考文献６）を使って推定（抽出）した目的音声のアクティビティである。
参考文献６：“https://github.com/wiseman/py-webrtcvad” [Experimental result]
An experiment conducted using this embodiment will be described. In the experiment, activity information for learning was prepared by adding noise to oracle speaker activity estimated based on teacher data. Note that the oracle speaker activity is an activity of a target speech estimated (extracted) from correct speech data (teacher data) using a speech segment extraction method (for example, Reference 6).
Reference 6: “https://github.com/wiseman/py-webrtcvad”

具体的には、oracle speaker activityからオーバーラップした時間区間を除去し、目的話者がアクティブな各時間区間の始点及び終点を、－１秒から１秒の範囲からサンプリングした値で修正した。 Specifically, overlapping time intervals were removed from the oracle speaker activity, and the start and end points of each time interval in which the target speaker was active were corrected with values sampled from the range of -1 second to 1 second.

まず、oracle speaker activityをそのまま使った場合と、ノイズを加えた場合のＳＤＲ（signal to distortion ratio）の比較結果を図１０に示す。図１０は、実験結果を示す図である。 First, FIG. 10 shows a comparison result of SDR (signal to distortion ratio) when oracle speaker activity is used as it is and when noise is added. FIG. 10 is a diagram showing experimental results.

Noisy activity trainingは、学習データにノイズを加えたか否かを示す。また、Activity signal at test timeは、抽出時のオーバーラップの有無、及びノイズを加えたか否か（Oracle又は+Noise）によるＳＤＲである。 Noisy activity training indicates whether or not noise is added to the training data. Also, Activity signal at test time is SDR depending on the presence or absence of overlap at the time of extraction and whether or not noise is added (Oracle or +Noise).

図１０に示すように、本実施形態の一部のモデルは、スピーカービームのＳＤＲを超える場合がある。なお、スピーカービームのＳＤＲは９．４である。また、図１０の実験では、LibriSpeech corpus（V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in Proc. of ICASSP’15, 2015, pp. 5206-5210.）が使用されている。 As shown in FIG. 10, some models of this embodiment may exceed the SDR of the speaker beam. Note that the SDR of the speaker beam is 9.4. Also, in the experiment of FIG. 10, LibriSpeech corpus (V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in Proc. of ICASSP'15, 2015, pp. 5206-5210.) are used.

また、図１０からは、学習データにノイズを加えることで（Noisy activity trainingにチェック）、モデルのロバスト性が向上することがいえる。 Also, from FIG. 10, it can be said that adding noise to the training data (check Noisy activity training) improves the robustness of the model.

次に、会議のシチュエーションを模したデータ（Z. Chen, T. Yoshioka, L. Lu, T. Zhou, Z. Meng, Y. Luo, J. Wu, and J. Li, “Continuous speech separation: Dataset and analysis,” Proc. of ICASSP’20, pp. 7284-7288, 2020.）を使った実験の結果を図１１に示す。図１１は、実験結果を示す図である。 Next, we present data simulating a conference situation (Z. Chen, T. Yoshioka, L. Lu, T. Zhou, Z. Meng, Y. Luo, J. Wu, and J. Li, “Continuous speech separation: Dataset and analysis,” Proc. of ICASSP'20, pp. 7284-7288, 2020.) are shown in FIG. FIG. 11 is a diagram showing experimental results.

図１１の実験では、参考文献５に記載のTS-VADによるアクティビティ情報が用いられた。また、図１１には、cpWER（concatenated minimum-permutation word error rate）、すなわちダイアライゼーションによる誤差を含む評価値が示されている。 In the experiment of FIG. 11, activity information by TS-VAD described in reference 5 was used. FIG. 11 also shows cpWER (concatenated minimum-permutation word error rate), that is, evaluation values including errors due to diarization.

図１１より、オーバーラップの割合が増加するほど、ADEnetを採用した方が、ADEnetを採用しない場合に比べて有利になることがいえる。 From FIG. 11, it can be said that adoption of ADEnet is more advantageous than adoption of ADEnet as the ratio of overlap increases.

［システム構成等］
また、図示した各装置の各構成要素は機能概念的なものであり、必ずしも物理的に図示のように構成されていることを要しない。すなわち、各装置の分散・統合の具体的形態は図示のものに限られず、その全部又は一部を、各種の負荷や使用状況等に応じて、任意の単位で機能的又は物理的に分散・統合して構成することができる。さらに、各装置にて行われる各処理機能は、その全部又は任意の一部が、ＣＰＵ（Central Processing Unit）及び当該ＣＰＵにて解析実行されるプログラムにて実現され、あるいは、ワイヤードロジックによるハードウェアとして実現され得る。 [System configuration, etc.]
Also, each component of each device illustrated is functionally conceptual, and does not necessarily need to be physically configured as illustrated. In other words, the specific form of distribution and integration of each device is not limited to the illustrated one, and all or part of them can be functionally or physically distributed and integrated in arbitrary units according to various loads and usage conditions. Can be integrated and configured. Furthermore, all or any part of each processing function performed by each device is realized by a CPU (Central Processing Unit) and a program analyzed and executed by the CPU, or hardware by wired logic can be realized as

また、本実施形態において説明した各処理のうち、自動的に行われるものとして説明した処理の全部又は一部を手動的に行うこともでき、あるいは、手動的に行われるものとして説明した処理の全部又は一部を公知の方法で自動的に行うこともできる。この他、上記文書中や図面中で示した処理手順、制御手順、具体的名称、各種のデータやパラメータを含む情報については、特記する場合を除いて任意に変更することができる。 Further, among the processes described in the present embodiment, all or part of the processes described as being automatically performed can be performed manually, or the processes described as being performed manually can be performed manually. All or part of this can also be done automatically by known methods. In addition, information including processing procedures, control procedures, specific names, and various data and parameters shown in the above documents and drawings can be arbitrarily changed unless otherwise specified.

［プログラム］
一実施形態として、抽出装置１０及び学習装置２０は、パッケージソフトウェアやオンラインソフトウェアとして上記の音声信号の抽出処理又は学習処理を実行するプログラムを所望のコンピュータにインストールさせることによって実装できる。例えば、上記の抽出処理のためのプログラムを情報処理装置に実行させることにより、情報処理装置を抽出装置１０として機能させることができる。ここで言う情報処理装置には、デスクトップ型又はノート型のパーソナルコンピュータが含まれる。また、その他にも、情報処理装置にはスマートフォン、携帯電話機やＰＨＳ（Personal Handyphone System）等の移動体通信端末、さらには、PDA（Personal Digital Assistant）等のスレート端末等がその範疇に含まれる。 [program]
As one embodiment, the extraction device 10 and the learning device 20 can be implemented by installing a program for executing the above-described audio signal extraction processing or learning processing as package software or online software on a desired computer. For example, the information processing device can function as the extraction device 10 by causing the information processing device to execute a program for the extraction process described above. The information processing apparatus referred to here includes a desktop or notebook personal computer. In addition, information processing devices include mobile communication terminals such as smart phones, mobile phones and PHS (Personal Handyphone Systems), and slate terminals such as PDAs (Personal Digital Assistants).

また、抽出装置１０及び学習装置２０は、ユーザが使用する端末装置をクライアントとし、当該クライアントに上記の音声信号の抽出処理又は学習処理に関するサービスを提供するサーバ装置として実装することもできる。例えば、サーバ装置は、混合音声信号を入力とし、目的話者の音声信号を抽出するサービスを提供するサーバ装置として実装される。この場合、サーバ装置は、Ｗｅｂサーバとして実装することとしてもよいし、アウトソーシングによってサービスを提供するクラウドとして実装することとしてもかまわない。 The extraction device 10 and the learning device 20 can also be implemented as a server device that uses a terminal device used by a user as a client and provides the client with a service related to the above-described audio signal extraction processing or learning processing. For example, the server device is implemented as a server device that receives a mixed speech signal as an input and provides a service of extracting the speech signal of the target speaker. In this case, the server device may be implemented as a web server, or may be implemented as a cloud that provides services through outsourcing.

図１２は、プログラムを実行するコンピュータの一例を示す図である。コンピュータ１０００は、例えば、メモリ１０１０、ＣＰＵ１０２０を有する。また、コンピュータ１０００は、ハードディスクドライブインタフェース１０３０、ディスクドライブインタフェース１０４０、シリアルポートインタフェース１０５０、ビデオアダプタ１０６０、ネットワークインタフェース１０７０を有する。これらの各部は、バス１０８０によって接続される。 FIG. 12 is a diagram illustrating an example of a computer that executes programs. The computer 1000 has a memory 1010 and a CPU 1020, for example. Computer 1000 also has hard disk drive interface 1030 , disk drive interface 1040 , serial port interface 1050 , video adapter 1060 and network interface 1070 . These units are connected by a bus 1080 .

メモリ１０１０は、ＲＯＭ（Read Only Memory）１０１１及びＲＡＭ１０１２を含む。ＲＯＭ１０１１は、例えば、ＢＩＯＳ（Basic Input Output System）等のブートプログラムを記憶する。ハードディスクドライブインタフェース１０３０は、ハードディスクドライブ１０９０に接続される。ディスクドライブインタフェース１０４０は、ディスクドライブ１１００に接続される。例えば磁気ディスクや光ディスク等の着脱可能な記憶媒体が、ディスクドライブ１１００に挿入される。シリアルポートインタフェース１０５０は、例えばマウス１１１０、キーボード１１２０に接続される。ビデオアダプタ１０６０は、例えばディスプレイ１１３０に接続される。 The memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM 1012 . The ROM 1011 stores a boot program such as BIOS (Basic Input Output System). Hard disk drive interface 1030 is connected to hard disk drive 1090 . A disk drive interface 1040 is connected to the disk drive 1100 . A removable storage medium such as a magnetic disk or optical disk is inserted into the disk drive 1100 . Serial port interface 1050 is connected to mouse 1110 and keyboard 1120, for example. Video adapter 1060 is connected to display 1130, for example.

ハードディスクドライブ１０９０は、例えば、ＯＳ１０９１、アプリケーションプログラム１０９２、プログラムモジュール１０９３、プログラムデータ１０９４を記憶する。すなわち、抽出装置１０の各処理を規定するプログラムは、コンピュータにより実行可能なコードが記述されたプログラムモジュール１０９３として実装される。プログラムモジュール１０９３は、例えばハードディスクドライブ１０９０に記憶される。例えば、抽出装置１０における機能構成と同様の処理を実行するためのプログラムモジュール１０９３が、ハードディスクドライブ１０９０に記憶される。なお、ハードディスクドライブ１０９０は、ＳＳＤにより代替されてもよい。 The hard disk drive 1090 stores an OS 1091, application programs 1092, program modules 1093, and program data 1094, for example. That is, a program that defines each process of the extraction device 10 is implemented as a program module 1093 in which computer-executable code is described. Program modules 1093 are stored, for example, on hard disk drive 1090 . For example, the hard disk drive 1090 stores a program module 1093 for executing processing similar to the functional configuration of the extraction device 10 . Note that the hard disk drive 1090 may be replaced by an SSD.

また、上述した実施形態の処理で用いられる設定データは、プログラムデータ１０９４として、例えばメモリ１０１０やハードディスクドライブ１０９０に記憶される。そして、ＣＰＵ１０２０が、メモリ１０１０やハードディスクドライブ１０９０に記憶されたプログラムモジュール１０９３やプログラムデータ１０９４を必要に応じてＲＡＭ１０１２に読み出して実行する。 Also, setting data used in the processing of the above-described embodiment is stored as program data 1094 in the memory 1010 or the hard disk drive 1090, for example. Then, the CPU 1020 reads out the program module 1093 and the program data 1094 stored in the memory 1010 and the hard disk drive 1090 to the RAM 1012 as necessary and executes them.

なお、プログラムモジュール１０９３やプログラムデータ１０９４は、ハードディスクドライブ１０９０に記憶される場合に限らず、例えば着脱可能な記憶媒体に記憶され、ディスクドライブ１１００等を介してＣＰＵ１０２０によって読み出されてもよい。あるいは、プログラムモジュール１０９３及びプログラムデータ１０９４は、ネットワーク（ＬＡＮ（Local Area Network）、ＷＡＮ（Wide Area Network）等）を介して接続された他のコンピュータに記憶されてもよい。そして、プログラムモジュール１０９３及びプログラムデータ１０９４は、他のコンピュータから、ネットワークインタフェース１０７０を介してＣＰＵ１０２０によって読み出されてもよい。 The program modules 1093 and program data 1094 are not limited to being stored in the hard disk drive 1090, but may be stored in a removable storage medium, for example, and read by the CPU 1020 via the disk drive 1100 or the like. Alternatively, program modules 1093 and program data 1094 may be stored in another computer connected via a network (LAN (Local Area Network), WAN (Wide Area Network), etc.). Program modules 1093 and program data 1094 may then be read by CPU 1020 through network interface 1070 from other computers.

１０抽出装置
２０学習装置
１１、２１インタフェース部
１２、２２記憶部
１３、２３制御部
１２１、２２１モデル情報
１３１、２３１信号処理部
１３１ａ、２３１ａ生成部
１３１ｂ、２３１ｂ抽出部 10 extraction device 20 learning device 11, 21 interface unit 12, 22 storage unit 13, 23 control unit 121, 221 model information 131, 231 signal processing unit 131a, 231a generation unit 131b, 231b extraction unit

Claims

a generating unit that generates data in a predetermined format from the mixed sound and activity information representing the time when the target sound source emitted the sound;
an extraction unit that extracts the sound of the target sound source from the mixed sound using the data generated by the generation unit and a neural network (NN) for extraction;
An extraction device comprising:

The generating unit generates a weighted sum of an output obtained by inputting the mixed sound to the auxiliary NN and the activity information,
2. The extractor according to claim 1, wherein the extraction unit extracts the sound of the target sound source from the mixed sound by inputting the mixed sound and the weighted sum to the extraction neural network. Device.

The generating unit generates a combined vector combining the activity information and the mixed sound, both of which are represented by vectors,
2. The extracting apparatus according to claim 1, wherein the extraction unit extracts the sound of the target sound source from the mixed sound by inputting the combination vector to the extraction neural network.

The generating unit further generates a weighted sum of an output obtained by inputting the mixed sound to the auxiliary NN and the activity information,
4. The extracting device according to claim 3, wherein the extracting unit extracts the sound of the target sound source from the mixed sound by inputting the combination vector and the weighted sum to the NN for extraction. .

An extraction method performed by an extraction device, comprising:
a generating step of generating data in a predetermined format from the mixed sound and the activity information representing the time when the target sound source emitted the sound;
an extracting step of extracting the sound of the target sound source from the mixed sound using the data generated in the generating step and a neural network (NN) for extraction;
An extraction method comprising:

a generating unit that generates data in a predetermined format from the mixed sound and activity information representing the time when the target sound source emitted the sound;
an extraction unit that extracts the sound of the target sound source from the mixed sound using the data generated by the generation unit and a neural network (NN) for extraction;
an updating unit that updates parameters of the NN for extraction so that a loss function calculated based on the sound extracted by the extracting unit is optimized;
A learning device characterized by comprising:

A learning method performed by a learning device, comprising:
a generating step of generating data in a predetermined format from the mixed sound and the activity information representing the time when the target sound source emitted the sound;
an extracting step of extracting the sound of the target sound source from the mixed sound using the data generated in the generating step and a neural network (NN) for extraction;
an update step of updating parameters of the NN for extraction so that a loss function calculated based on the sound extracted by the extraction step is optimized;
A learning method comprising:

A program for causing a computer to function as the extraction device according to any one of claims 1 to 4 or the learning device according to claim 6.