JP7453733B2

JP7453733B2 - Method and system for improving multi-device speaker diarization performance

Info

Publication number: JP7453733B2
Application number: JP2023001000A
Authority: JP
Inventors: ヒスホ; ハンヨンカン; ユジンキム; ハンギュキム; ソンギュムン; ボンジンイ; ジョンフンチャン; ジュンソンチョン; イクサンハン; ジェソンホ
Original assignee: Line Works; Naver Corp
Current assignee: Line Works; Naver Corp
Priority date: 2020-06-02
Filing date: 2023-01-06
Publication date: 2024-03-21
Anticipated expiration: 2040-12-09
Also published as: KR20210149336A; JP2023026657A; KR102396136B1; JP2021189424A

Description

以下の説明は、話者ダイアライゼーション（ｓｐｅａｋｅｒｄｉａｒｉｚａｔｉｏｎ）技術に関する。 The following description relates to speaker diarization techniques.

話者ダイアライゼーションとは、複数の話者が発声した内容を録音した音声ファイルから話者ごとに発声区間を分割する技術である。 Speaker diarization is a technology that divides utterance sections for each speaker from a recorded audio file of utterances by multiple speakers.

話者ダイアライゼーション技術は、オーディオデータから話者境界区間を検出するものであって、話者に対する先行知識の使用の可否によって距離基盤方式とモデル基盤方式とに分けられる。 Speaker diarization techniques detect speaker boundary sections from audio data, and are divided into distance-based methods and model-based methods depending on whether or not prior knowledge of speakers can be used.

例えば、特許文献１（登録日２０１８年２月２３日）では、話者の音声を認識する環境の変化や話者の発話状態の影響を受けずに、話者の音声に基づいて話者を区分することができる話者認識モデルを生成する技術が開示されている。 For example, in Patent Document 1 (registration date: February 23, 2018), the speaker is recognized based on the speaker's voice without being affected by changes in the environment in which the speaker's voice is recognized or the speaker's speaking state. A technique is disclosed for generating a speaker recognition model that can be differentiated.

このような話者ダイアライゼーション技術は、会議、インタビュー、取引、裁判などように複数の話者が一定の順序をもたずに発声する状況において発声内容を話者ごとに分割して自動記録する諸般の技術であって、議事録の自動作成などに活用されている。 This type of speaker diarization technology automatically records the utterances divided by each speaker in situations where multiple speakers speak in a random order, such as in meetings, interviews, transactions, and court cases. This technology is used for various purposes such as automatically creating minutes of meetings.

韓国登録特許第１０－１８３３７３１号公報Korean Registered Patent No. 10-1833731

マルチデバイスによる話者ダイアライゼーション性能を向上させることができる方法およびシステムを提供する。 A method and system are provided that can improve multi-device speaker diarization performance.

各ユーザが保有している個人機器を活用するマルチデバイス環境で話者ダイアライゼーションを実行することができる方法およびシステムを提供する。 A method and system are provided that can perform speaker diarization in a multi-device environment that utilizes personal equipment owned by each user.

信頼度に基づいて話者数（クラスタ数）を推定することができる方法およびシステムを提供する。 A method and system are provided that can estimate the number of speakers (number of clusters) based on reliability.

コンピュータシステムが実行する話者ダイアライゼーション方法であって、前記コンピュータシステムは、メモリに含まれるコンピュータ読み取り可能な命令を実行するように構成された少なくとも１つのプロセッサを含み、前記話者ダイアライゼーション方法は、前記少なくとも１つのプロセッサにより、複数の電子機器から各電子機器で録音された音声ファイルを受信する段階、前記少なくとも１つのプロセッサにより、前記各電子機器の前記音声ファイルに対して計算された埋め込み行列に基づいて候補クラスタ数を推定する段階、前記少なくとも１つのプロセッサにより、前記各電子機器の候補クラスタ数を利用して最終クラスタ数を決定する段階、および前記少なくとも１つのプロセッサにより、前記最終クラスタ数を利用して話者ダイアライゼーションクラスタリングを実行する段階を含む、話者ダイアライゼーション方法を提供する。 A speaker diarization method performed by a computer system, the computer system including at least one processor configured to execute computer readable instructions contained in a memory, the speaker diarization method comprising: , receiving, by the at least one processor, an audio file recorded by each electronic device from a plurality of electronic devices; an embedding matrix calculated for the audio file of each of the electronic devices by the at least one processor; estimating the number of candidate clusters based on the at least one processor, determining a final number of clusters using the number of candidate clusters of each electronic device; A speaker diarization method is provided, the method comprising performing speaker diarization clustering using a method.

一側面によると、前記受信する段階は、前記各電子機器の前記音声ファイルに対してエンドポイント検出（ＥＰＤ（ｅｎｄｐｏｉｎｔｄｅｔｅｃｔｉｏｎ））を実行する段階、および前記各電子機器のＥＰＤ結果を統合してＥＰＤユニオン（ｕｎｉｏｎ）を生成する段階を含んでよい。 According to one aspect, the receiving step includes performing end point detection (EPD) on the audio files of the respective electronic devices, and integrating the EPD results of the respective electronic devices. The method may include generating an EPD union.

他の側面によると、前記推定する段階は、前記各電子機器の前記音声ファイルのＥＰＤ結果から埋め込み抽出することで類似度行列（ａｆｆｉｎｉｔｙｍａｔｒｉｘ）を計算する段階、および前記各電子機器の前記類似度行列を利用して前記候補クラスタ数と前記類似度行列の信頼度値を計算する段階を含んでよい。 According to another aspect, the estimating step includes calculating an affinity matrix by embedding and extracting from the EPD result of the audio file of each of the electronic devices, and calculating the similarity of each of the electronic devices. The method may include calculating the number of candidate clusters and a reliability value of the similarity matrix using a matrix.

また他の側面によると、前記候補クラスタ数と前記類似度行列の信頼度値を計算する段階は、前記類似度行列に対して固有値分解（ｅｉｇｅｎｄｅｃｏｍｐｏｓｉｔｉｏｎ）を実行して固有値（ｅｉｇｅｎｖａｌｕｅ）を抽出する段階、および前記抽出された固有値を整列した後、隣接する固有値の差に基づいて前記候補クラスタ数と前記類似度行列の信頼度値を計算する段階を含んでよい。 According to another aspect, calculating the number of candidate clusters and the reliability value of the similarity matrix includes performing eigen decomposition on the similarity matrix to extract eigenvalues. and, after arranging the extracted eigenvalues, calculating the number of candidate clusters and a reliability value of the similarity matrix based on a difference between adjacent eigenvalues.

また他の側面によると、前記候補クラスタ数と前記類似度行列の信頼度値を計算する段階は、前記類似度行列に対して固有値分解を実行して固有値を抽出する段階、前記抽出された固有値を整列した後、隣接する固有値の差を基準として選択された固有値の個数を前記候補クラスタ数として決定する段階、および前記候補クラスタ数の決定過程で選択されずに残った固有値を利用して前記信頼度値を計算する段階を含んでよい。 According to another aspect, the step of calculating the number of candidate clusters and the reliability value of the similarity matrix includes the step of performing eigenvalue decomposition on the similarity matrix to extract eigenvalues; After arranging the eigenvalues, the number of eigenvalues selected based on the difference between adjacent eigenvalues is determined as the number of candidate clusters, and the eigenvalues remaining unselected in the process of determining the number of candidate clusters are used to It may include calculating a confidence value.

また他の側面によると、前記残った固有値を利用して前記信頼度値を計算する段階は、前記残った固有値のうちで最も大きい固有値を前記類似度行列の信頼度値として決定してよい。 According to another aspect, the step of calculating the reliability value using the remaining eigenvalues may determine the largest eigenvalue among the remaining eigenvalues as the reliability value of the similarity matrix.

また他の側面によると、前記残った固有値を利用して前記信頼度値を計算する段階は、前記残った固有値の平均を計算した平均値を前記類似度行列の信頼度値として決定してよい。 According to another aspect, the step of calculating the reliability value using the remaining eigenvalues may include determining an average value obtained by calculating the average of the remaining eigenvalues as the reliability value of the similarity matrix. .

また他の側面によると、前記推定する段階は、前記音声ファイルのＥＰＤ結果に対して学習された加重値に基づいて前記類似度行列に対する加重和（ｗｅｉｇｈｔｅｄｓｕｍ）を適用する段階をさらに含んでよい。 According to another aspect, the estimating step may further include applying a weighted sum to the similarity matrix based on weight values learned for the EPD result of the audio file. .

また他の側面によると、前記決定する段階は、前記信頼度値が最も大きい類似度行列から推定された候補クラスタ数を前記最終クラスタ数として決定してよい。 According to another aspect, the determining step may determine the number of candidate clusters estimated from the similarity matrix having the largest reliability value as the final number of clusters.

さらに他の側面によると、前記実行する段階は、前記各電子機器の前記音声ファイルのＥＰＤ結果から埋め込み抽出をすることで類似度行列を計算する段階、および前記各電子機器の類似度行列を平均し、平均類似度行列と前記最終クラスタ数に基づいて前記話者ダイアライゼーションクラスタリングを実行する段階を含んでよい。 According to still another aspect, the step of performing includes calculating a similarity matrix by performing embedding extraction from the EPD results of the audio files of each of the electronic devices, and averaging the similarity matrix of each of the electronic devices. and performing the speaker diarization clustering based on the average similarity matrix and the final number of clusters.

前記話者ダイアライゼーション方法を前記コンピュータシステムに実行させるために非一時的なコンピュータ読み取り可能な記録媒体に記録される、コンピュータプログラムを提供する。 A computer program is provided that is recorded on a non-transitory computer-readable recording medium to cause the computer system to execute the speaker diarization method.

前記話者ダイアライゼーション方法をコンピュータに実行させるためのプログラムが記録されている、非一時的なコンピュータ読み取り可能な記録媒体を提供する。 A non-transitory computer-readable recording medium is provided, on which a program for causing a computer to execute the speaker diarization method is recorded.

コンピュータシステムであって、メモリに含まれるコンピュータ読み取り可能な命令を実行するように構成された少なくとも１つのプロセッサを含み、前記少なくとも１つのプロセッサは、複数の電子機器から各電子機器で録音された音声ファイルを受信する過程、前記各電子機器の前記音声ファイルに対して計算された埋め込み行列に基づいて候補クラスタ数を推定する過程、前記各電子機器の候補クラスタ数を利用して最終クラスタ数を決定する過程、および前記最終クラスタ数を利用して話者ダイアライゼーションクラスタリングを実行する過程を処理する、コンピュータシステムを提供する。 A computer system comprising at least one processor configured to execute computer-readable instructions contained in a memory, the at least one processor configured to execute audio recordings at each electronic device from a plurality of electronic devices. a process of receiving a file; a process of estimating the number of candidate clusters based on the embedding matrix calculated for the audio file of each of the electronic devices; and determining a final number of clusters using the number of candidate clusters of each of the electronic devices. and performing speaker diarization clustering using the final cluster number.

本発明の実施形態によると、マルチデバイスによる話者ダイアライゼーション性能を向上させることができる。 According to an embodiment of the present invention, it is possible to improve the performance of multi-device speaker diarization.

本発明の実施形態によると、追加の装備は必要とせず、各ユーザが保有する個人機器を活用するマルチデバイス環境で話者ダイアライゼーションを実行することができる。 According to embodiments of the present invention, speaker diarization can be performed in a multi-device environment that requires no additional equipment and utilizes personal equipment owned by each user.

本発明の実施形態によると、信頼度に基づいて話者数（クラスタ数）をより正確に推定することができる。 According to the embodiment of the present invention, the number of speakers (number of clusters) can be estimated more accurately based on reliability.

本発明の一実施形態における、ネットワーク環境の例を示した図である。1 is a diagram illustrating an example of a network environment in an embodiment of the present invention. FIG. 本発明の一実施形態における、コンピュータシステムの内部構成の例を示したブロック図である。FIG. 1 is a block diagram showing an example of the internal configuration of a computer system in an embodiment of the present invention. 本発明の一実施形態における、コンピュータシステムのプロセッサが含むことのできる構成要素の例を示した図である。1 is a diagram illustrating an example of components that a processor of a computer system may include in an embodiment of the present invention. FIG. 本発明の一実施形態における、コンピュータシステムが実行することのできる話者ダイアライゼーション方法の例を示したフローチャートである。1 is a flowchart illustrating an example of a speaker diarization method that can be performed by a computer system in an embodiment of the invention. 本発明の一実施形態における、話者ダイアライゼーションのための全体的な過程の一例を示した図である。FIG. 3 is a diagram illustrating an example of the overall process for speaker diarization in an embodiment of the present invention. 本発明の一実施形態における、個別音声ファイルで認識された音声領域を併合する過程を説明するための例示図である。FIG. 3 is an exemplary diagram illustrating a process of merging audio regions recognized in individual audio files in an embodiment of the present invention. 本発明の一実施形態における、個別音声ファイルで認識された音声領域を併合する過程を説明するための例示図である。FIG. 3 is an exemplary diagram illustrating a process of merging audio regions recognized in individual audio files in an embodiment of the present invention. 本発明の一実施形態における、クラスタ数を決定する過程を説明するための例示図である。FIG. 3 is an exemplary diagram for explaining a process of determining the number of clusters in an embodiment of the present invention. 本発明の一実施形態における、話者ダイアライゼーションクラスタリングを実行する過程を説明するための例示図である。FIG. 3 is an exemplary diagram for explaining a process of performing speaker diarization clustering in an embodiment of the present invention.

以下、本発明の実施形態について、添付の図面を参照しながら詳しく説明する。 Embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

本発明の実施形態は、オーディオデータから話者境界区間を検出する話者ダイアライゼーション技術に関する。 Embodiments of the present invention relate to speaker diarization techniques for detecting speaker boundary sections from audio data.

本明細書で具体的に開示される事項を含む実施形態は、マルチデバイスのよる話者ダイアライゼーションを実行することで話者ダイアライゼーション性能を向上させることができ、各ユーザが保有する個人機器を活用することでシステム構築費用を節減することができる。 Embodiments including matters specifically disclosed herein can improve speaker diarization performance by performing multi-device speaker diarization, and can improve speaker diarization performance by performing multi-device speaker diarization. By utilizing it, system construction costs can be reduced.

図１は、本発明の一実施形態における、ネットワーク環境の例を示した図である。図１のネットワーク環境は、複数の電子機器１１０、１２０、１３０、１４０、サーバ１５０、およびネットワーク１６０を含む例を示している。このような図１は、発明の説明のための一例に過ぎず、電子機器の数やサーバの数が図１のように限定されることはない。 FIG. 1 is a diagram showing an example of a network environment in an embodiment of the present invention. The network environment of FIG. 1 shows an example including a plurality of electronic devices 110, 120, 130, 140, a server 150, and a network 160. Such FIG. 1 is only an example for explaining the invention, and the number of electronic devices and the number of servers are not limited as shown in FIG. 1.

複数の電子機器１１０、１２０、１３０、１４０は、コンピュータシステムによって実現される固定端末や移動端末であってよい。複数の電子機器１１０、１２０、１３０、１４０の例としては、スマートフォン、携帯電話、ナビゲーション、ＰＣ（ｐｅｒｓｏｎａｌｃｏｍｐｕｔｅｒ）、ノート型ＰＣ、デジタル放送用端末、ＰＤＡ（ＰｅｒｓｏｎａｌＤｉｇｉｔａｌＡｓｓｉｓｔａｎｔ）、ＰＭＰ（ＰｏｒｔａｂｌｅＭｕｌｔｉｍｅｄｉａＰｌａｙｅｒ）、タブレット、ゲームコンソール、ウェアラブルデバイス、ＩｏＴ（ＩｎｔｅｒｎｅｔｏｆＴｈｉｎｇｓ）デバイス、ＶＲ（ＶｉｒｔｕａｌＲｅａｌｉｔｙ）デバイス、ＡＲ（ＡｕｇｍｅｎｔｅｄＲｅａｌｉｔｙ）デバイスなどがある。一例として、図１では、電子機器１１０の例としてスマートフォンを示しているが、本発明の実施形態において、電子機器１１０は、実質的に無線または有線通信方式を利用し、ネットワーク１７０を介して他の電子機器１２０、１３０、１４０および／またはサーバ１５０と通信することのできる多様な物理的なコンピュータシステムのうちの１つを意味してよい。 The plurality of electronic devices 110, 120, 130, and 140 may be fixed terminals or mobile terminals realized by a computer system. Examples of the plurality of electronic devices 110, 120, 130, and 140 include smartphones, mobile phones, navigation systems, PCs (personal computers), notebook PCs, digital broadcast terminals, PDAs (personal digital assistants), and PMPs (portable multimedia players). er ), tablets, game consoles, wearable devices, IoT (Internet of Things) devices, VR (Virtual Reality) devices, AR (Augmented Reality) devices, etc. As an example, although FIG. 1 shows a smartphone as an example of the electronic device 110, in the embodiment of the present invention, the electronic device 110 may utilize a substantially wireless or wired communication method to communicate with others via the network 170. may refer to one of a variety of physical computer systems capable of communicating with electronic devices 120, 130, 140 and/or server 150.

通信方式が限定されることはなく、ネットワーク１６０が含むことのできる通信網（一例として、移動通信網、有線インターネット、無線インターネット、放送網、衛星網など）を利用する通信方式だけではなく、機器間の近距離無線通信が含まれてもよい。例えば、ネットワーク１６０は、ＰＡＮ（ｐｅｒｓｏｎａｌａｒｅａｎｅｔｗｏｒｋ）、ＬＡＮ（ｌｏｃａｌａｒｅａｎｅｔｗｏｒｋ）、ＣＡＮ（ｃａｍｐｕｓａｒｅａｎｅｔｗｏｒｋ）、ＭＡＮ（ｍｅｔｒｏｐｏｌｉｔａｎａｒｅａｎｅｔｗｏｒｋ）、ＷＡＮ（ｗｉｄｅａｒｅａｎｅｔｗｏｒｋ）、ＢＢＮ（ｂｒｏａｄｂａｎｄｎｅｔｗｏｒｋ）、インターネットなどのネットワークのうちの１つ以上の任意のネットワークを含んでよい。さらに、ネットワーク１６０は、バスネットワーク、スターネットワーク、リングネットワーク、メッシュネットワーク、スター－バスネットワーク、ツリーまたは階層的ネットワークなどを含むネットワークトポロジのうちの任意の１つ以上を含んでもよいが、これらに限定されることはない。 The communication method is not limited, and can include not only the communication method using the communication network that the network 160 can include (for example, a mobile communication network, wired Internet, wireless Internet, broadcasting network, satellite network, etc.), but also the equipment. It may also include short-range wireless communication between. For example, the network 160 includes a PAN (personal area network), a LAN (local area network), a CAN (campus area network), a MAN (metropolitan area network), and a WAN (wide area network). area network), BBN (broadband network), the Internet, etc. may include any one or more of the networks. Further, network 160 may include any one or more of network topologies including, but not limited to, a bus network, a star network, a ring network, a mesh network, a star-bus network, a tree or a hierarchical network, and the like. It will not be done.

サーバ１５０は、複数の電子機器１１０、１２０、１３０、１４０とネットワーク１６０を介して通信して命令、コード、ファイル、コンテンツ、サービスなどを提供する１つ以上のコンピュータ装置によって実現されてよい。例えば、サーバ１５０は、ネットワーク１６０を介して接続した複数の電子機器１１０、１２０、１３０、１４０に目的とするサービスを提供するシステムであってよい。より具体的な例として、サーバ１５０は、複数の電子機器１１０、１２０、１３０、１４０においてインストールされて実行されるコンピュータプログラムであるアプリケーションを通じ、該当のアプリケーションが目的とするサービス（一例として、音声認識を基盤とした人工知能議事録サービスなど）を複数の電子機器１１０、１２０、１３０、１４０に提供してよい。 Server 150 may be implemented by one or more computing devices that communicate with multiple electronic devices 110, 120, 130, 140 via network 160 to provide instructions, code, files, content, services, etc. For example, the server 150 may be a system that provides targeted services to a plurality of electronic devices 110, 120, 130, and 140 connected via the network 160. As a more specific example, the server 150 provides services (for example, voice recognition may be provided to a plurality of electronic devices 110, 120, 130, 140.

図２は、本発明の一実施形態における、コンピュータシステムの例を示したブロック図である。図１で説明したサーバ１５０は、図２のように構成されたコンピュータシステム２００によって実現されてよい。 FIG. 2 is a block diagram illustrating an example of a computer system in an embodiment of the present invention. The server 150 described in FIG. 1 may be realized by a computer system 200 configured as shown in FIG.

図２に示すように、コンピュータシステム２００は、本発明の実施形態に係る話者ダイアライゼーション方法を実行するための構成要素として、メモリ２１０、プロセッサ２２０、通信インタフェース２３０、および入力／出力インタフェース２４０を含んでよい。 As shown in FIG. 2, a computer system 200 includes a memory 210, a processor 220, a communication interface 230, and an input/output interface 240 as components for performing a speaker diarization method according to an embodiment of the present invention. may be included.

メモリ２１０は、コンピュータ読み取り可能な記録媒体であって、ＲＡＭ（ｒａｎｄｏｍａｃｃｅｓｓｍｅｍｏｒｙ）、ＲＯＭ（ｒｅａｄｏｎｌｙｍｅｍｏｒｙ）、およびディスクドライブのような永続的大容量記録装置を含んでよい。ここで、ＲＯＭやディスクドライブのような永続的大容量記録装置は、メモリ２１０とは区分される別の永続的記録装置としてコンピュータシステム２００に含まれてもよい。また、メモリ２１０には、オペレーティングシステムと、少なくとも１つのプログラムコードが記録されてよい。このようなソフトウェア構成要素は、メモリ２１０とは別のコンピュータ読み取り可能な記録媒体からメモリ２１０にロードされてよい。このような別のコンピュータ読み取り可能な記録媒体は、フロッピー（登録商標）ドライブ、ディスク、テープ、ＤＶＤ／ＣＤ－ＲＯＭドライブ、メモリカードなどのコンピュータ読み取り可能な記録媒体を含んでよい。他の実施形態において、ソフトウェア構成要素は、コンピュータ読み取り可能な記録媒体ではない通信インタフェース２３０を通じてメモリ２１０にロードされてもよい。例えば、ソフトウェア構成要素は、ネットワーク１６０を介して受信されるファイルによってインストールされるコンピュータプログラムに基づいてコンピュータシステム２００のメモリ２１０にロードされてよい。 Memory 210 is a computer readable storage medium and may include permanent mass storage devices such as random access memory (RAM), read only memory (ROM), and disk drives. Here, a permanent large capacity storage device such as a ROM or a disk drive may be included in the computer system 200 as a separate permanent storage device separate from the memory 210. Additionally, an operating system and at least one program code may be recorded in the memory 210. Such software components may be loaded into memory 210 from a computer-readable storage medium separate from memory 210. Such other computer-readable recording media may include computer-readable recording media such as floppy drives, disks, tapes, DVD/CD-ROM drives, memory cards, and the like. In other embodiments, software components may be loaded into memory 210 through communication interface 230 that is not a computer-readable storage medium. For example, software components may be loaded into memory 210 of computer system 200 based on a computer program installed by a file received over network 160.

プロセッサ２２０は、基本的な算術、ロジック、および入出力演算を実行することにより、コンピュータプログラムの命令を処理するように構成されてよい。命令は、メモリ２１０または通信インタフェース２３０によって、プロセッサ２２０に提供されてよい。例えば、プロセッサ２２０は、メモリ２１０のような記録装置に記録されたプログラムコードにしたがって受信される命令を実行するように構成されてよい。 Processor 220 may be configured to process instructions of a computer program by performing basic arithmetic, logic, and input/output operations. Instructions may be provided to processor 220 by memory 210 or communication interface 230. For example, processor 220 may be configured to execute instructions received according to program code recorded on a storage device, such as memory 210.

通信インタフェース２３０は、ネットワーク１６０を介してコンピュータシステム２００が他の装置と互いに通信するための機能を提供してよい。一例として、コンピュータシステム２００のプロセッサ２２０がメモリ２１０のような記録装置に記録されたプログラムコードにしたがって生成した要求や命令、データ、ファイルなどが、通信インタフェース２３０の制御にしたがってネットワーク１６０を介して他の装置に伝達されてよい。これとは逆に、他の装置からの信号や命令、データ、ファイルなどが、ネットワーク１６０を経てコンピュータシステム２００の通信インタフェース２３０を通じてコンピュータシステム２００に受信されてよい。通信インタフェース２３０を通じて受信された信号や命令、データなどは、プロセッサ２２０やメモリ２１０に伝達されてよく、ファイルなどは、コンピュータシステム２００がさらに含むことのできる記録媒体（上述した永続的記録装置）に記録されてよい。 Communication interface 230 may provide functionality for computer system 200 to communicate with other devices and each other via network 160. As an example, requests, instructions, data, files, etc. generated by the processor 220 of the computer system 200 according to program codes recorded in a storage device such as the memory 210 may be transmitted to others via the network 160 under the control of the communication interface 230. may be transmitted to the device. Conversely, signals, instructions, data, files, etc. from other devices may be received by computer system 200 through network 160 and through communication interface 230 of computer system 200. Signals, instructions, data, etc. received through communication interface 230 may be communicated to processor 220 and memory 210, files, etc. may be transferred to a storage medium (such as a persistent storage device as described above) that computer system 200 may further include. May be recorded.

通信方式が限定されることはなく、ネットワーク１６０が含むことのできる通信網（一例として、移動通信網、有線インターネット、無線インターネット、放送網）を利用する通信方式だけではなく、機器間の近距離有線／無線通信が含まれてもよい。例えば、ネットワーク１６０は、ＰＡＮ（ｐｅｒｓｏｎａｌａｒｅａｎｅｔｗｏｒｋ）、ＬＡＮ（ｌｏｃａｌａｒｅａｎｅｔｗｏｒｋ）、ＣＡＮ（ｃａｍｐｕｓａｒｅａｎｅｔｗｏｒｋ）、ＭＡＮ（ｍｅｔｒｏｐｏｌｉｔａｎａｒｅａｎｅｔｗｏｒｋ）、ＷＡＮ（ｗｉｄｅａｒｅａｎｅｔｗｏｒｋ）、ＢＢＮ（ｂｒｏａｄｂａｎｄｎｅｔｗｏｒｋ）、インターネットなどのネットワークのうちの１つ以上の任意のネットワークを含んでよい。さらに、ネットワーク１６０は、バスネットワーク、スターネットワーク、リングネットワーク、メッシュネットワーク、スター－バスネットワーク、ツリーまたは階層的ネットワークなどを含むネットワークトポロジのうちの任意の１つ以上を含んでもよいが、これらに限定されることはない。 The communication method is not limited, and is not limited to communication methods that utilize communication networks that can be included in the network 160 (for example, mobile communication networks, wired Internet, wireless Internet, and broadcasting networks), as well as communication methods that utilize short distances between devices. Wired/wireless communications may also be included. For example, the network 160 may be a PAN (personal area network), a LAN (local area network), a CAN (campus area network), a MAN (metropolitan area network), or a WAN (wide area network). area network), BBN (broadband network), the Internet, etc. may include any one or more of the networks. Further, network 160 may include any one or more of network topologies including, but not limited to, a bus network, a star network, a ring network, a mesh network, a star-bus network, a tree or a hierarchical network, and the like. It will not be done.

入力／出力インタフェース２４０は、入力／出力装置２５０とのインタフェースのための手段であってよい。例えば、入力装置は、マイク、キーボード、カメラ、マウスなどの装置を、出力装置は、ディスプレイ、スピーカなどのような装置を含んでよい。他の例として、入力／出力インタフェース２４０は、タッチスクリーンのように入力と出力のための機能が１つに統合された装置とのインタフェースのための手段であってもよい。入力／出力装置２５０は、コンピュータシステム２００と１つの装置で構成されてもよい。 Input/output interface 240 may be a means for interfacing with input/output device 250. For example, input devices may include devices such as a microphone, keyboard, camera, mouse, etc., and output devices may include devices such as a display, speakers, etc. As another example, input/output interface 240 may be a means for interfacing with a device that has integrated input and output functionality, such as a touch screen. Input/output device 250 may be a single device with computer system 200.

また、他の実施形態において、コンピュータシステム２００は、図２の構成要素よりも少ないか多くの構成要素を含んでもよい。しかし、大部分の従来技術の構成要素を明確に図に示す必要はない。例えば、コンピュータシステム２００は、上述した入力／出力装置２５０のうちの少なくとも一部を含むように実現されてもよいし、トランシーバ、カメラ、各種センサ、データベースなどのような他の構成要素をさらに含んでもよい。 Also, in other embodiments, computer system 200 may include fewer or more components than those of FIG. 2. However, most prior art components need not be clearly illustrated. For example, computer system 200 may be implemented to include at least some of the input/output devices 250 described above, and may further include other components such as transceivers, cameras, various sensors, databases, etc. But that's fine.

以下では、マルチデバイスによって話者ダイアライゼーション性能を向上させるための方法およびシステムの具体的な実施形態について説明する。 Specific embodiments of methods and systems for improving speaker diarization performance with multiple devices are described below.

図３は、本発明の一実施形態における、サーバのプロセッサが含むことのできる構成要素の例を示したブロック図であり、図４は、本発明の一実施形態における、サーバが実行することのできる方法の例を示したフローチャートである。 FIG. 3 is a block diagram illustrating an example of components that a processor of a server may include, in an embodiment of the invention, and FIG. 2 is a flowchart showing an example of a possible method.

本実施形態に係るサーバ１５０は、話者ダイアライゼーションによって議事録音声ファイルを文書として整理することができる人工知能サービスを提供するサービスプラットフォームの役割をする。 The server 150 according to the present embodiment serves as a service platform that provides an artificial intelligence service that can organize minutes recorded voice files as documents through speaker diarization.

サーバ１５０には、コンピュータシステム２００によって実現された話者ダイアライゼーションシステムが構成されてよい。サーバ１５０は、クライアント（ｃｌｉｅｎｔ）である複数の電子機器１１０、１２０、１３０、１４０を対象とするものであり、電子機器１１０、１２０、１３０、１４０上にインストールされた専用アプリケーションや、サーバ１５０と関連するウェブ／モバイルサイトへの接続によって音声認識基盤の人工知能議事録サービスを提供してよい。 Server 150 may be configured with a speaker diarization system implemented by computer system 200 . The server 150 targets a plurality of electronic devices 110 , 120 , 130 , and 140 , which are clients, and includes dedicated applications installed on the electronic devices 110 , 120 , 130 , and 140 , and the server 150 . A voice recognition-based artificial intelligence transcription service may be provided by connecting to an associated web/mobile site.

特に、サーバ１５０は、各ユーザが保有する個人機器を利用したマルチデバイスによって話者ダイアライゼーション性能を向上させることができる。 In particular, the server 150 can improve speaker diarization performance through multiple devices using personal equipment owned by each user.

サーバ１５０のプロセッサ２２０は、図４に係る話者ダイアライゼーション方法を実行するための構成要素として、図３に示すように、音声統合部３１０、クラスタ決定部３２０、およびクラスタリング実行部３３０を含んでよい。 The processor 220 of the server 150 includes a speech integration section 310, a cluster determination section 320, and a clustering execution section 330, as shown in FIG. 3, as components for executing the speaker diarization method according to FIG. good.

実施形態によって、プロセッサ２２０の構成要素は、選択的にプロセッサ２２０に含まれても除外されてもよい。また、実施形態によって、プロセッサ２２０の構成要素は、プロセッサ２２０の機能の表現のために分離されても併合されてもよい。 Depending on the embodiment, components of processor 220 may be selectively included or excluded from processor 220. Also, depending on the embodiment, components of processor 220 may be separated or combined to express the functionality of processor 220.

このようなプロセッサ２２０およびプロセッサ２２０の構成要素は、図４の話者ダイアライゼーション方法が含む段階４１０～４３０を実行するようにサーバ１５０を制御してよい。例えば、プロセッサ２２０およびプロセッサ２２０の構成要素は、メモリ２１０が含むオペレーティングシステムのコードと、少なくとも１つのプログラムのコードとによる命令（ｉｎｓｔｒｕｃｔｉｏｎ）を実行するように実現されてよい。 Such processor 220 and components of processor 220 may control server 150 to perform steps 410-430 included in the speaker diarization method of FIG. For example, processor 220 and components of processor 220 may be implemented to execute instructions in accordance with operating system code and at least one program code contained in memory 210.

ここで、プロセッサ２２０の構成要素は、サーバ１５０に記録されたプログラムコードが提供する命令にしたがってプロセッサ２２０によって実行される互いに異なる機能（ｄｉｆｆｅｒｅｎｔｆｕｎｃｔｉｏｎｓ）の表現であってよい。例えば、サーバ１５０が機器別に認識された音声領域を統合するように上述した命令にしたがってサーバ１５０を制御するプロセッサ２２０の機能的表現として、音声統合部３１０が利用されてよい。 Here, the components of processor 220 may be representations of different functions performed by processor 220 according to instructions provided by program code stored on server 150. For example, the audio integration unit 310 may be used as a functional representation of the processor 220 that controls the server 150 according to the above-described instructions so that the server 150 integrates the audio regions recognized for each device.

プロセッサ２２０は、サーバ１５０の制御と関連する命令がロードされたメモリ２１０から必要な命令を読み取ってよい。この場合、前記読み取られた命令は、以下で説明する段階４１０～４３０をプロセッサ２２０が実行するように制御するための命令を含んでよい。 Processor 220 may read the necessary instructions from memory 210 loaded with instructions related to controlling server 150. In this case, the read instructions may include instructions for controlling the processor 220 to perform steps 410-430 described below.

以下で説明する段階４１０～４３０は、図４に示したものとは異なる順序で実行されてもよく、段階４１０～４３０のうちの一部が省略されるか追加の過程がさらに含まれてもよい。 The steps 410-430 described below may be performed in a different order than shown in FIG. 4, and some of the steps 410-430 may be omitted or additional steps may be further included. good.

図４を参照すると、段階４１０で、音声統合部３１０は、複数の電子機器１１０、１２０、１３０、１４０を対象として各電子機器から該当の機器で録音された音声ファイル（以下、「個別音声ファイル」とする）を受信し、個別音声ファイルから認識された音声領域を統合してよい。 Referring to FIG. 4, in step 410, the audio integration unit 310 targets the plurality of electronic devices 110, 120, 130, and 140 and processes the audio files (hereinafter referred to as "individual audio files") recorded from each electronic device with the corresponding device. '') and may integrate the recognized audio regions from the individual audio files.

本実施形態は、マルチデバイス基盤の環境で話者ダイアライゼーションを実行するものであって、例えば、会議に参加するユーザそれぞれが保有する個人機器からなる複数の電子機器１１０、１２０、１３０、１４０を活用してよい。 This embodiment executes speaker diarization in a multi-device based environment, and for example, a plurality of electronic devices 110, 120, 130, 140 consisting of personal devices owned by users participating in a conference. You can take advantage of it.

サーバ１５０と関連する専用アプリケーションやウェブ／モバイルサイトでは、会議への参加を開始するための開始ボタンと、会議の参加を終了するための終了ボタンが含まれてよく、開始ボタンが入力されると同時に、機器で録音される音声をサーバ１５０にリアルタイムで伝達する機能が含まれてよい。 A dedicated application or web/mobile site associated with the server 150 may include a start button for initiating participation in a meeting and an exit button for ending participation in a meeting, such that once the start button is entered, At the same time, a function of transmitting audio recorded by the device to the server 150 in real time may be included.

本実施形態は、会議音声を録音してサーバ１５０に伝達するための装備として追加の装備を必要とせず、会議参加者が会議中に所持しているスマートフォンやタブレットなどのような個人機器を活用してよい。特に、話者ダイアライゼーション性能を向上させるために会議音声を録音してサーバ１５０に伝達するための装備として、単一の装備ではなく、複数の参加者の個人機器からなるマルチデバイスを活用してよい。 This embodiment does not require any additional equipment for recording conference audio and transmitting it to the server 150, and utilizes personal devices such as smartphones and tablets that conference participants have during the conference. You may do so. In particular, in order to improve speaker diarization performance, the equipment for recording conference audio and transmitting it to the server 150 is not a single equipment, but a multi-device consisting of the personal devices of multiple participants. good.

音声統合部３１０は、各電子機器１１０、１２０、１３０、１４０から個別音声ファイルを受信した後、それぞれの個別音声ファイルから抽出された音声区間を統合する役割を行う。検出される音声領域は機器ごとに異なることがあるため、特定の機器から検出されない音声領域を追加することによって区間の抜けをなくすために各機器の音声区間を統合する。 The audio integration unit 310 receives individual audio files from each of the electronic devices 110, 120, 130, and 140, and then integrates audio segments extracted from the individual audio files. Since the detected audio region may differ depending on the device, the audio regions of each device are integrated in order to eliminate missing sections by adding the audio region that is not detected from a specific device.

段階４２０で、クラスタ決定部３２０は、個別音声ファイルごとに、個別音声ファイルに対して計算された埋め込み行列（以下、「個別埋め込み行列」とする）に基づいて候補クラスタ数を推定した後、個別埋め込み行列の信頼度に基づいて最終クラスタ数を決定してよい。 In step 420, the cluster determining unit 320 estimates the number of candidate clusters for each individual audio file based on the embedding matrix calculated for the individual audio file (hereinafter referred to as the "individual embedding matrix"), and then estimates the number of candidate clusters for each individual audio file. The final number of clusters may be determined based on the reliability of the embedding matrix.

クラスタ決定部３２０は、各個別音声ファイルに対して独立的にクラスタ数を推定した後、推定されたクラスタ数のうちから最終クラスタ数を決定してよい。 The cluster determining unit 320 may independently estimate the number of clusters for each individual audio file, and then determine the final number of clusters from among the estimated cluster numbers.

特に、クラスタ決定部３２０は、最終クラスタ数を決定するために個別音声ファイルに対する候補クラスタ数を推定する過程において、信頼度をともに計算してよく、信頼度が最も高い個別音声ファイルで推定された候補クラスタ数を最終クラスタ数として決定してよい。 In particular, in the process of estimating the number of candidate clusters for each individual audio file in order to determine the final number of clusters, the cluster determining unit 320 may also calculate the reliability, and the cluster determining unit 320 may also calculate the reliability of the individual audio file estimated using the individual audio file with the highest reliability. The number of candidate clusters may be determined as the final number of clusters.

クラスタ数を決定する具体的な過程については、以下でさらに詳しく説明する。 The specific process of determining the number of clusters will be described in more detail below.

段階４３０で、クラスタリング実行部３３０は、段階４１０で統合された音声領域に対して計算された埋め込み行列と、段階４２０で決定された最終クラスタ数を利用して、話者ダイアライゼーションのためのクラスタリングを実行してよい。 In step 430, the clustering execution unit 330 performs clustering for speaker diarization using the embedding matrix calculated for the speech region integrated in step 410 and the final number of clusters determined in step 420. may be executed.

クラスタリング実行部３３０は、各機器の音声ファイルに対する個別埋め込み行列を平均した平均埋め込み行列を求めてよく、平均埋め込み行列と最終クラスタ数に基づいて話者ダイアライゼーションクラスタリングを実行してよい。 The clustering execution unit 330 may obtain an average embedding matrix by averaging the individual embedding matrices for the audio files of each device, and may perform speaker diarization clustering based on the average embedding matrix and the final number of clusters.

したがって、本実施形態では、クラスタ数の推定と話者ダイアライゼーションクラスタリングを、同じ埋め込み行列ではなく別の埋め込み行列に基づいて実行することができ、クラスタ数の推定は個別埋め込み行列を利用し、話者ダイアライゼーションクラスタリングは平均埋め込み行列を利用することができる。 Therefore, in this embodiment, the estimation of the number of clusters and the speaker diarization clustering can be performed based on different embedding matrices rather than the same embedding matrix, and the estimation of the number of clusters is performed using individual embedding matrices. Diarization clustering can utilize the average embedding matrix.

図５は、本発明の一実施形態における、話者ダイアライゼーションの全体的な過程の一例を示した図である。 FIG. 5 is a diagram illustrating an example of the overall process of speaker diarization in an embodiment of the present invention.

図５を参照すると、話者ダイアライゼーション過程は、各電子機器１１０、１２０、１３０、１４０から受信した個別音声ファイルごとに独立的に実行される独立過程と、個別音声ファイルを統合して実行される統合過程とで構成されてよい。 Referring to FIG. 5, the speaker diarization process is performed by integrating an independent process performed for each individual audio file received from each electronic device 110, 120, 130, 140, and an individual audio file. It may consist of an integration process.

音声統合部３１０は、会議中に会議に参加する複数の参加者の個人機器である電子機器１１０、１２０、１３０、１４０から、会議参加者の位置で録音された音声ファイル（個別音声ファイル）を受信する（Ｓ５１）。 The audio integration unit 310 collects audio files (individual audio files) recorded at the locations of conference participants from electronic devices 110, 120, 130, and 140, which are personal devices of multiple participants participating in the conference, during the conference. Receive (S51).

音声統合部３１０は、それぞれの個別音声ファイルに対して独立的にＥＰＤ（ｅｎｄｐｏｉｎｔｄｅｔｅｃｔｉｏｎ）過程を実行する（Ｓ５２）。ＥＰＤとは、無音区間に該当するフレームから音響特徴を取り除いた後に、各フレームのエネルギーを測定することによって音声／無音を区分した発声の始めと終わりを探索することを意味する。言い換えれば、音声統合部３１０は、個別音声ファイルで音声のある領域を探索するＥＰＤを実行する。 The audio integration unit 310 independently performs an EPD (end point detection) process on each individual audio file (S52). EPD means searching for the beginning and end of utterances that are classified as speech/silence by removing acoustic features from frames corresponding to silent periods and then measuring the energy of each frame. In other words, the audio integrator 310 performs EPD to search for a region of audio in the individual audio files.

例えば、図６に示すように、音声統合部３１０は、会議参加者の各機器からＥＰＤ結果として検出された音声領域６０１を取得してよい。会議に参加する参加者ごとに位置が異なるため、それぞれ検出される音声領域６０１も異なるようになる。 For example, as shown in FIG. 6, the audio integration unit 310 may acquire the audio area 601 detected as the EPD result from each device of the conference participants. Since the positions of each participant in the conference are different, the detected audio areas 601 are also different.

再び図５を参照すると、音声統合部３１０は、会議参加者の各機器のＥＰＤ結果を統合してＥＰＤユニオン（ｕｎｉｏｎ）を生成してよい（Ｓ５３）。 Referring again to FIG. 5, the audio integration unit 310 may integrate the EPD results of each device of the conference participants to generate an EPD union (S53).

図７に示すように、会議参加者の各機器から検出される音声領域６０１はすべて異なるため、区間の漏れが発生しないように、各機器のＥＰＤ結果を統合してＥＰＤユニオン７０２を生成してよい。 As shown in FIG. 7, the audio regions 601 detected from each device of the conference participants are all different, so to avoid omission of sections, the EPD results of each device are integrated to generate an EPD union 702. good.

言い換えれば、音声統合部３１０は、会議参加者の各機器から受信した各個別音声ファイルの各個別ＥＰＤ結果を１つのＥＰＤ結果として統合するのである。 In other words, the audio integration unit 310 integrates each individual EPD result of each individual audio file received from each device of a conference participant as one EPD result.

再び図５を参照すると、クラスタ決定部３２０は、各機器のＥＰＤ結果に対して独立的に埋め込み抽出過程を実行する（Ｓ５４）。 Referring again to FIG. 5, the cluster determining unit 320 independently performs the embedding extraction process on the EPD results of each device (S54).

クラスタ決定部３２０は、各機器のＥＰＤ結果から埋め込み抽出をすることで個別類似度行列（ａｆｆｉｎｉｔｙｍａｔｒｉｘ）を計算した後、各機器の個別類似度行列を利用してクラスタ数を計算する（Ｓ５５）。 The cluster determination unit 320 calculates an individual similarity matrix (affinity matrix) by performing embedding extraction from the EPD results of each device, and then calculates the number of clusters using the individual similarity matrix of each device (S55). .

このとき、クラスタ決定部３２０は、クラスタ数とともに、個別類似度行列の信頼度を計算してよい。 At this time, the cluster determining unit 320 may calculate the reliability of the individual similarity matrix as well as the number of clusters.

図８を参照すると、クラスタ決定部３２０は、各機器の個別音声ファイルごとに計算された個別類似度行列８０３に対して固有値分解（ｅｉｇｅｎｄｅｃｏｍｐｏｓｉｔｉｏｎ）を実行して固有値（ｅｉｇｅｎｖａｌｕｅ）と固有ベクトル（ｅｉｇｅｎｖｅｃｔｏｒ）を抽出してよい。 Referring to FIG. 8, the cluster determining unit 320 performs eigen decomposition on the individual similarity matrix 803 calculated for each individual audio file of each device to determine the eigenvalue and eigenvector. may be extracted.

このとき、クラスタ決定部３２０は、個別類似度行列８０３から抽出された固有値を固有値の大きさ順に整列し、整列された固有値に基づいてクラスタ数８０４と信頼度値８０５を決定してよい。 At this time, the cluster determining unit 320 may arrange the eigenvalues extracted from the individual similarity matrix 803 in order of the size of the eigenvalues, and determine the number of clusters 804 and the reliability value 805 based on the arranged eigenvalues.

クラスタ決定部３２０は、整列された固有値に隣接する固有値の差を基準に、有効な主成分に該当する固有値の個数をクラスタ数８０４として決定してよい。固有値が高いということは個別類似度行列８０３で影響力が大きいことを意味し、すなわち、個別音声ファイル内の音声領域に対して個別類似度行列８０３を構成するときに、発声がある話者のうちで発声の比重が高いことを意味する。 The cluster determining unit 320 may determine the number of eigenvalues corresponding to valid principal components as the number of clusters 804 based on the difference between the eigenvalues adjacent to the aligned eigenvalues. A high eigenvalue means that it has a large influence on the individual similarity matrix 803. In other words, when constructing the individual similarity matrix 803 for the audio region in the individual audio file, the This means that there is a high emphasis on vocalization.

言い換えれば、クラスタ決定部３２０は、整列された固有値のうちから十分な大きさの値を有する固有値を選択し、選択された固有値の個数を、話者数を示すクラスタ数８０４として決定してよい。 In other words, the cluster determining unit 320 may select eigenvalues having a sufficiently large value from among the arranged eigenvalues, and determine the number of selected eigenvalues as the number of clusters 804 indicating the number of speakers. .

クラスタ数８０４の決定過程で選択されなかった固有値は、個別類似度行列８０３に含まれるノイズとして見なされてよく、選択されなかった固有値が小さいほど個別類似度行列８０３の計算が正確であると判断され、結果的には個別類似度行列８０３の信頼度が高いと判断されてよい。 Eigenvalues that are not selected in the process of determining the number of clusters 804 may be regarded as noise included in the individual similarity matrix 803, and the smaller the eigenvalues that are not selected, the more accurate the calculation of the individual similarity matrix 803 is. As a result, it may be determined that the reliability of the individual similarity matrix 803 is high.

クラスタ決定部３２０は、整列された固有値のうち、クラスタ数８０４の決定過程で選択されずにノイズとして残った固有値を利用して信頼度値８０５を計算してよい。 The cluster determining unit 320 may calculate the reliability value 805 using the eigenvalues that were not selected in the process of determining the number of clusters 804 and remained as noise among the sorted eigenvalues.

一例として、クラスタ決定部３２０は、クラスタ数８０４の決定過程で選択されなかった固有値のうち、最も大きい固有値を信頼度値８０５として活用してよい。例えば、整列された固有値のうち、値が高い４つの固有値が有効な主成分の数、すなわち、クラスタ数８０４として決定された場合、５番目の固有値を信頼度値８０５として活用してよい。 As an example, the cluster determining unit 320 may use the largest eigenvalue among the eigenvalues not selected in the process of determining the number of clusters 804 as the reliability value 805. For example, when the four highest eigenvalues among the sorted eigenvalues are determined as the number of effective principal components, that is, the number of clusters 804, the fifth eigenvalue may be used as the reliability value 805.

他の例として、クラスタ決定部３２０は、クラスタ数８０４の決定過程で選択されなかったすべての固有値の平均を計算した平均固有値を信頼度値８０５として活用してよい。 As another example, the cluster determining unit 320 may use, as the reliability value 805, an average eigenvalue obtained by calculating the average of all eigenvalues that were not selected in the process of determining the number of clusters 804.

会議参加者の各機器から検出される音声領域６０１は異なるという点において、これから計算された個別類似度行列８０３もすべて異なることがあり、話者数を示すクラスタ数８０４の結果も異なることがある。 In that the audio regions 601 detected from each device of a conference participant are different, the individual similarity matrices 803 calculated from this may also be different, and the results of the number of clusters 804 indicating the number of speakers may also be different. .

機器１の個別音声ファイルでは４人の話者が推定され、機器２の個別音声ファイルでは５人の話者が推定される場合、このように異なる結果を統合するために信頼度を活用するのである。 If 4 speakers are estimated for the individual audio file of device 1 and 5 speakers are estimated for the individual audio file of device 2, we will use confidence to integrate these different results. be.

クラスタ決定部３２０は、各機器の個別類似度行列８０３を平均した平均類似度行列を利用してクラスタ数８０４を決定することも可能である。しかし、平均類似度行列を利用する場合には、クラスタ数８０４を誤って推定するというエラーが発生することがある。 The cluster determining unit 320 can also determine the number of clusters 804 using an average similarity matrix obtained by averaging the individual similarity matrices 803 of each device. However, when using the average similarity matrix, an error may occur in which the number of clusters 804 is incorrectly estimated.

類似度行列から計算された固有値のうちから有効な主成分の数を類推してクラスタ数８０４を推定するため、類似度行列のシャープネス（ｓｈａｒｐｎｅｓｓ）が下がれば性能が下落することもある。 Since the number of clusters 804 is estimated by analogizing the number of effective principal components from among the eigenvalues calculated from the similarity matrix, performance may deteriorate if the sharpness of the similarity matrix decreases.

したがって、クラスタ数８０４を決定するあたり、場合によっては、音声ファイルをスムージング（ｓｍｏｏｔｈｉｎｇ）した結果（平均類似度行列）よりはシャープネスした結果（各機器の個別類似度行列）を利用する方が、より正確な結果が得られる可能性がある。 Therefore, in determining the number of clusters 804, it may be better to use the sharpening results (individual similarity matrix for each device) than the results of smoothing the audio files (average similarity matrix). Accurate results may be obtained.

実施形態によっては、個別類似度行列８０３の加重和（ｗｅｉｇｈｔｅｄｓｕｍ）を適用してよい。 In some embodiments, a weighted sum of individual similarity matrices 803 may be applied.

個別類似度行列８０３の区間ごとに信頼度が異なることがあるという点を考慮した上で、個別類似度行列８０３のすべての区間に同じ加重値を適用して固有値分解を実行するのではなく、ＥＰＤとして検出されなかった領域の加重値を低める方向などによって加重値を学習して適用してよい。 Considering that the reliability may differ for each section of the individual similarity matrix 803, instead of performing eigenvalue decomposition by applying the same weight to all the sections of the individual similarity matrix 803, A weight value may be learned and applied by decreasing the weight value of a region that is not detected as an EPD.

一例として、個別類似度行列８０３の区間ごとに加重値をランダムに適用して行列を統合した後、固有値を計算して信頼度を高める方向によって加重値を学習してよい。 For example, after integrating the matrices by randomly applying weight values to each section of the individual similarity matrix 803, the weight values may be learned in a direction that increases reliability by calculating eigenvalues.

再び図５を参照すると、クラスタ決定部３２０は、各機器の各個別音声ファイルに対して推定されたクラスタ数と信頼度値を統合した後、信頼度に基づいてクラスタ数を最終的に決定してよい（Ｓ５６）。 Referring again to FIG. 5, the cluster determining unit 320 integrates the estimated number of clusters and reliability values for each individual audio file of each device, and then finally determines the number of clusters based on the reliability. Yes (S56).

クラスタ決定部３２０は、各機器の各個別音声ファイルに対して計算された個別類似度行列のうちで信頼度値が最も高い個別類似度行列として計算されたクラスタ数を、最終クラスタ数として決定してよい。 The cluster determining unit 320 determines the number of clusters calculated as the individual similarity matrix with the highest reliability value among the individual similarity matrices calculated for each individual audio file of each device as the final number of clusters. It's fine.

クラスタリング実行部３３０は、各機器のＥＰＤ結果を統合した結果であるＥＰＤユニオンを利用して、独立的にそれぞれ埋め込み抽出をすることで各機器の個別類似度行列を計算してよい（Ｓ５７）。 The clustering execution unit 330 may calculate an individual similarity matrix for each device by independently performing embedding and extraction using the EPD union that is the result of integrating the EPD results of each device (S57).

クラスタリング実行部３３０は、各機器に対して独立的に計算された個別類似度行列を平均して平均類似度行列を計算した後、平均類似度行列とともに、段階Ｓ５６で信頼度に基づいて決定されたクラスタ数を利用して話者ダイアライゼーションクラスタリングを実行してよい（Ｓ５８）。 The clustering execution unit 330 calculates an average similarity matrix by averaging the individual similarity matrices calculated independently for each device, and then calculates the average similarity matrix along with the average similarity matrix determined based on the reliability in step S56. Speaker diarization clustering may be performed using the number of clusters obtained (S58).

図９に示すように、クラスタリング実行部３３０は、各機器に対して独立的に計算された個別類似度行列９０１を平均した平均類似度行列９０２を計算してよい。 As shown in FIG. 9, the clustering execution unit 330 may calculate an average similarity matrix 902 by averaging individual similarity matrices 901 calculated independently for each device.

一例として、クラスタリング実行部３３０は、各機器に対して計算された個別類似度行列９０１に対して行列算術演算（ｅｌｅｍｅｎｔ－ｗｉｓｅ）を実行して平均類似度行列９０２を計算してよい。 For example, the clustering execution unit 330 may calculate the average similarity matrix 902 by performing element-wise matrix arithmetic operations on the individual similarity matrix 901 calculated for each device.

続いて、クラスタリング実行部３３０は、平均類似度行列９０２に対して固有値分解を実行し、固有値順に整列された固有ベクトルに基づいてクラスタリングを実行してよい。 Next, the clustering execution unit 330 may perform eigenvalue decomposition on the average similarity matrix 902 and perform clustering based on the eigenvectors arranged in order of eigenvalues.

１つの個別音声ファイルからｍ個の音声区間が抽出される場合、ｍ×ｍ個のエレメントを含む行列が生成されるが、このとき、各エレメントを示すｖ_ｉ、ｊは、ｉ番目の音声区間からｊ番目の音声区間までの距離を意味する。 When m voice sections are extracted from one individual voice file, a matrix containing m×m elements is generated, and in this case, v _{i, j} indicating each element is the i-th voice section. It means the distance from to the j-th voice section.

このとき、クラスタリング実行部３３０は、信頼度に基づいて決定されたクラスタ数だけ固有ベクトルを選択する方式によって話者ダイアライゼーションクラスタリングを実行してよい。 At this time, the clustering execution unit 330 may perform speaker diarization clustering by selecting eigenvectors by the number of clusters determined based on reliability.

話者ダイアライゼーションのための全体過程は、会議中に複数の個人機器で同時に録音された音声ファイルを受信し、各機器の音声ファイルに対してＥＰＤを実行し、ＥＰＤが実行されたセグメント（音声領域）単位で埋め込みを抽出してクラスタ数（話者数）を推定した後、推定されたクラスタ数に基づいてクラスタリングを実行する。 The overall process for speaker diarization is to receive audio files recorded simultaneously on multiple personal devices during a conference, perform EPD on the audio files of each device, and identify the segments on which EPD was performed (audio After extracting embeddings for each area (region) and estimating the number of clusters (number of speakers), clustering is performed based on the estimated number of clusters.

本実施形態において、話者ダイアライゼーション性能を改善するための過程としては、各機器の個別音声ファイルを利用してＥＰＤユニオンを生成すること、各機器の個別音声ファイルに対して計算された個別埋め込み行列を利用してクラスタ数を推定した後に信頼度に基づいて最終クラスタ数を決定すること、信頼度に基づくクラスタ数と平均類似度行列を利用して話者ダイアライゼーションクラスタリングを実行することが含まれてよい。 In this embodiment, the process for improving the speaker diarization performance includes generating an EPD union using the individual audio files of each device, and calculating individual embeddings for the individual audio files of each device. This includes estimating the number of clusters using a matrix and then determining the final number of clusters based on confidence, and performing speaker diarization clustering using the number of clusters based on confidence and an average similarity matrix. It's fine.

このように、本発明の実施形態によると、追加の装備は必要とせず、複数の会議参加者が所持している個人機器を活用しながら、マルチデバイスによる話者ダイアライゼーションを実行することができる。 Thus, embodiments of the present invention allow multi-device speaker diarization to be performed without the need for additional equipment and while leveraging personal equipment carried by multiple conference participants. .

本発明の実施形態によると、各機器の音声ファイルからクラスタ数を推定した後、これに対する信頼度に基づいて最終クラスタ数を決定することにより、正確に推定されたクラスタ数によって話者ダイアライゼーション性能を向上させることができる。 According to an embodiment of the present invention, after estimating the number of clusters from the audio file of each device, the final number of clusters is determined based on the confidence level, thereby improving speaker diarization performance based on the accurately estimated number of clusters. can be improved.

このように、本実施形態では、マルチデバイスによる話者ダイアライゼーションという新たなタスクを定義することができ、会議参加者それぞれが保有している個人機器を活用するためシステム構築費用を節減することができ、会議を行うための空間をより広い範囲で効率的にカバーすることができる。 In this way, in this embodiment, a new task of multi-device speaker diarization can be defined, and system construction costs can be reduced by utilizing personal devices owned by each conference participant. This allows for a wider range of meeting spaces to be efficiently covered.

新たなタスクに合うようにモデルを学習することが最も一般的な接近方式ではあるが、新たなモデルの学習のためには、データの収集、適用する実際の環境、一般化性能などを考慮する必要がある。この反面、本実施形態は、従来の話者ダイアライゼーションモデルをそのまま使用することができ、既にサービスされている話者ダイアライゼーションシステムの場合であっても、モデルを再学習する必要なく、マルチデバイスから会議音声を受信する機能を追加するだけで話者ダイアライゼーション性能を向上させることができる。 Although the most common approach is to train a model to suit a new task, learning a new model requires consideration of data collection, the actual environment in which it will be applied, generalization performance, etc. There is a need. On the other hand, in this embodiment, the conventional speaker diarization model can be used as is, and even in the case of a speaker diarization system that is already in service, there is no need to retrain the model and multi-device Speaker diarization performance can be improved simply by adding the ability to receive conference audio from speakers.

上述した装置は、ハードウェア構成要素、ソフトウェア構成要素、および／またはハードウェア構成要素とソフトウェア構成要素との組み合わせによって実現されてよい。例えば、実施形態で説明された装置および構成要素は、プロセッサ、コントローラ、ＡＬＵ（ａｒｉｔｈｍｅｔｉｃｌｏｇｉｃｕｎｉｔ）、デジタル信号プロセッサ、マイクロコンピュータ、ＦＰＧＡ（ｆｉｅｌｄｐｒｏｇｒａｍｍａｂｌｅｇａｔｅａｒｒａｙ）、ＰＬＵ（ｐｒｏｇｒａｍｍａｂｌｅｌｏｇｉｃｕｎｉｔ）、マイクロプロセッサ、または命令を実行して応答することができる様々な装置のように、１つ以上の汎用コンピュータまたは特殊目的コンピュータを利用して実現されてよい。処理装置は、オペレーティングシステム（ＯＳ）およびＯＳ上で実行される１つ以上のソフトウェアアプリケーションを実行してよい。また、処理装置は、ソフトウェアの実行に応答し、データにアクセスし、データを記録、操作、処理、および生成してもよい。理解の便宜のために、１つの処理装置が使用されるとして説明される場合もあるが、当業者は、処理装置が複数個の処理要素および／または複数種類の処理要素を含んでもよいことが理解できるであろう。例えば、処理装置は、複数個のプロセッサまたは１つのプロセッサおよび１つのコントローラを含んでよい。また、並列プロセッサのような、他の処理構成も可能である。 The apparatus described above may be realized by hardware components, software components, and/or a combination of hardware and software components. For example, the devices and components described in the embodiments include a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate array (FPGA), a programmable logic unit (PLU), a microprocessor, or may be implemented using one or more general purpose or special purpose computers, such as various devices capable of executing and responding to instructions. A processing device may execute an operating system (OS) and one or more software applications that execute on the OS. The processing device may also be responsive to execution of the software to access, record, manipulate, process, and generate data. Although for convenience of understanding, one processing device may be described as being used, those skilled in the art will appreciate that a processing device may include multiple processing elements and/or multiple types of processing elements. You will understand. For example, a processing device may include multiple processors or a processor and a controller. Other processing configurations are also possible, such as parallel processors.

ソフトウェアは、コンピュータプログラム、コード、命令、またはこれらのうちの１つ以上の組み合わせを含んでもよく、思うままに動作するように処理装置を構成したり、独立的または集合的に処理装置に命令したりしてよい。ソフトウェアおよび／またはデータは、処理装置に基づいて解釈されたり、処理装置に命令またはデータを提供したりするために、いかなる種類の機械、コンポーネント、物理装置、コンピュータ記録媒体または装置に具現化されてよい。ソフトウェアは、ネットワークによって接続されたコンピュータシステム上に分散され、分散された状態で記録されても実行されてもよい。ソフトウェアおよびデータは、１つ以上のコンピュータ読み取り可能な記録媒体に記録されてよい。 Software may include computer programs, code, instructions, or a combination of one or more of these that configure a processing device or instruct a processing device, independently or collectively, to perform operations as desired. You may do so. The software and/or data may be embodied in a machine, component, physical device, computer storage medium or device of any kind for being interpreted by or providing instructions or data to a processing device. good. The software may be distributed on computer systems connected by a network, and may be recorded or executed in a distributed manner. The software and data may be recorded on one or more computer readable storage media.

実施形態に係る方法は、多様なコンピュータ手段によって実行可能なプログラム命令の形態で実現されてコンピュータ読み取り可能な媒体に記録されてよい。ここで、媒体は、コンピュータ実行可能なプログラムを継続して記録するものであっても、実行またはダウンロードのために一時記録するものであってもよい。また、媒体は、単一または複数のハードウェアが結合した形態の多様な記録手段または格納手段であってよく、あるコンピュータシステムに直接接続する媒体に限定されることはなく、ネットワーク上に分散して存在するものであってもよい。媒体の例としては、ハードディスク、フロッピー（登録商標）ディスク、および磁気テープのような磁気媒体、ＣＤ－ＲＯＭ、ＤＶＤのような光媒体、フロプティカルディスク（ｆｌｏｐｔｉｃａｌｄｉｓｋ）のような光磁気媒体、およびＲＯＭ、ＲＡＭ、フラッシュメモリなどを含み、プログラム命令が記録されるように構成されたものであってよい。また、媒体の他の例として、アプリケーションを配布するアプリケーションストアやその他の多様なソフトウェアを供給または配布するサイト、サーバなどで管理する記録媒体または格納媒体が挙げられる。 Methods according to embodiments may be implemented in the form of program instructions executable by various computer means and recorded on computer-readable media. Here, the medium may be one that continuously records a computer-executable program, or one that temporarily records it for execution or download. Also, the medium may be a variety of recording or storage means in the form of a single or multiple hardware combinations, and is not limited to a medium directly connected to a computer system, but may be distributed over a network. It may also exist. Examples of media include magnetic media such as hard disks, floppy disks, and magnetic tape; optical media such as CD-ROMs and DVDs; magneto-optical media such as floptical disks; It may also include ROM, RAM, flash memory, etc., and may be configured to record program instructions. Further, other examples of the medium include an application store that distributes applications, a site that supplies or distributes various other software, and a recording medium or storage medium managed by a server.

以上のように、実施形態を、限定された実施形態および図面に基づいて説明したが、当業者であれば、上述した記載から多様な修正および変形が可能であろう。例えば、説明された技術が、説明された方法とは異なる順序で実行されたり、かつ／あるいは、説明されたシステム、構造、装置、回路などの構成要素が、説明された方法とは異なる形態で結合されたりまたは組み合わされたり、他の構成要素または均等物によって対置されたり置換されたとしても、適切な結果を達成することができる。 As mentioned above, although the embodiments have been described based on limited embodiments and drawings, those skilled in the art will be able to make various modifications and variations based on the above description. For example, the techniques described may be performed in a different order than in the manner described, and/or components of the systems, structures, devices, circuits, etc. described may be performed in a different form than in the manner described. Even when combined or combined, opposed or replaced by other components or equivalents, suitable results can be achieved.

したがって、異なる実施形態であっても、特許請求の範囲と均等なものであれば、添付される特許請求の範囲に属する。 Therefore, even if the embodiments are different, if they are equivalent to the scope of the claims, they fall within the scope of the appended claims.

次の付記を記す。
（付記１）コンピュータシステムが実行する話者ダイアライゼーション方法であって、
前記コンピュータシステムは、メモリに含まれるコンピュータ読み取り可能な命令を実行するように構成された少なくとも１つのプロセッサを含み、
前記話者ダイアライゼーション方法は、
前記少なくとも１つのプロセッサにより、複数の電子機器から各電子機器で録音された音声ファイルを受信する段階、
前記少なくとも１つのプロセッサにより、前記各電子機器の前記音声ファイルに対して計算された埋め込み行列に基づいて候補クラスタ数を推定する段階、
前記少なくとも１つのプロセッサにより、前記各電子機器の候補クラスタ数を利用して最終クラスタ数を決定する段階、および
前記少なくとも１つのプロセッサにより、前記最終クラスタ数を利用して話者ダイアライゼーションクラスタリングを実行する段階
を含む、話者ダイアライゼーション方法。
（付記２）前記受信する段階は、
前記各電子機器の前記音声ファイルに対してエンドポイント検出（ＥＰＤ）を実行する段階、および
前記各電子機器のＥＰＤ結果を統合してＥＰＤユニオンを生成する段階
を含む、付記１に記載の話者ダイアライゼーション方法。
（付記３）前記推定する段階は、
前記各電子機器の前記音声ファイルのＥＰＤ結果から埋め込み抽出をすることで類似度行列を計算する段階、および
前記各電子機器の前記類似度行列を利用して前記候補クラスタ数と前記類似度行列の信頼度値を計算する段階
を含む、付記１に記載の話者ダイアライゼーション方法。
（付記４）前記候補クラスタ数と前記類似度行列の信頼度値を計算する段階は、
前記類似度行列に対して固有値分解を実行して固有値を抽出する段階、および
前記抽出された固有値を整列した後、隣接する固有値の差に基づいて前記候補クラスタ数と前記類似度行列の信頼度値を計算する段階
を含む、付記３に記載の話者ダイアライゼーション方法。
（付記５）前記候補クラスタ数と前記類似度行列の信頼度値を計算する段階は、
前記類似度行列に対して固有値分解を実行して固有値を抽出する段階、
前記抽出された固有値を整列した後、隣接する固有値の差を基準として選択された固有値の個数を前記候補クラスタ数として決定する段階、および
前記候補クラスタ数の決定過程で選択されずに残った固有値を利用して前記信頼度値を計算する段階
を含む、付記３に記載の話者ダイアライゼーション方法。
（付記６）前記残った固有値を利用して前記信頼度値を計算する段階は、
前記残った固有値のうちで最も大きい固有値を前記類似度行列の信頼度値として決定すること
を特徴とする、付記５に記載の話者ダイアライゼーション方法。
（付記７）前記残った固有値を利用して前記信頼度値を計算する段階は、
前記残った固有値の平均を計算した平均値を前記類似度行列の信頼度値として決定すること
を特徴とする、付記５に記載の話者ダイアライゼーション方法。
（付記８）前記推定する段階は、
前記音声ファイルのＥＰＤ結果に対して学習された加重値に基づいて前記類似度行列に対する加重和を適用する段階
をさらに含む、付記３に記載の話者ダイアライゼーション方法。
（付記９）前記決定する段階は、
前記信頼度値が最も大きい類似度行列で推定された候補クラスタ数を前記最終クラスタ数として決定すること
を特徴とする、付記３に記載の話者ダイアライゼーション方法。
（付記１０）前記実行する段階は、
前記各電子機器の前記音声ファイルのＥＰＤ結果から埋め込み抽出をすることで類似度行列を計算する段階、および
前記各電子機器の類似度行列を平均し、平均類似度行列と前記最終クラスタ数に基づいて前記話者ダイアライゼーションクラスタリングを実行する段階
を含む、付記１に記載の話者ダイアライゼーション方法。
（付記１１）付記１～１０のうちのいずれか一つに記載の話者ダイアライゼーション方法を前記コンピュータシステムに実行させる、コンピュータプログラム。
（付記１２）付記１～１０のうちのいずれか一つに記載の話者ダイアライゼーション方法をコンピュータに実行させるためのプログラムが記録されている、非一時的なコンピュータ読み取り可能な記録媒体。
（付記１３）コンピュータシステムであって、
メモリに含まれるコンピュータ読み取り可能な命令を実行するように構成された少なくとも１つのプロセッサ
を含み、
前記少なくとも１つのプロセッサは、
複数の電子機器から各電子機器で録音された音声ファイルを受信する過程、
前記各電子機器の前記音声ファイルに対して計算された埋め込み行列に基づいて候補クラスタ数を推定する過程、
前記各電子機器の候補クラスタ数を利用して最終クラスタ数を決定する過程、および
前記最終クラスタ数を利用して話者ダイアライゼーションクラスタリングを実行する過程
を処理する、コンピュータシステム。
（付記１４）前記受信する過程は、
前記各電子機器の前記音声ファイルに対してＥＰＤを実行する過程、および
前記各電子機器のＥＰＤ結果を統合してＥＰＤユニオンを生成する過程
を含む、付記１３に記載のコンピュータシステム。
（付記１５）前記推定する過程は、
前記各電子機器の前記音声ファイルのＥＰＤ結果から埋め込み抽出をすることで類似度行列を計算する過程、および
前記各電子機器の前記類似度行列を利用して前記候補クラスタ数と前記類似度行列の信頼度値を計算する過程
を含む、付記１３に記載のコンピュータシステム。
（付記１６）前記候補クラスタ数と前記類似度行列の信頼度値を計算する過程は、
前記類似度行列に対して固有値分解を実行して固有値を抽出する過程、
前記抽出された固有値を整列した後、隣接する固有値の差を基準として選択された固有値の個数を前記候補クラスタ数として決定する過程、および
前記候補クラスタ数の決定過程で選択されずに残った固有値を利用して前記信頼度値を計算する過程
を含む、付記１５に記載のコンピュータシステム。
（付記１７）前記残った固有値を利用して前記信頼度値を計算する過程は、
前記残った固有値のうちで最も大きい固有値を前記類似度行列の信頼度値として決定すること
を特徴とする、付記１６に記載のコンピュータシステム。
（付記１８）前記推定する過程は、
前記音声ファイルのＥＰＤ結果に対して学習された加重値に基づいて前記類似度行列に対する加重和を適用する過程
をさらに含む、付記１５に記載のコンピュータシステム。
（付記１９）前記決定する過程は、
前記信頼度値が最も大きい類似度行列で推定された候補クラスタ数を前記最終クラスタ数として決定すること
を特徴とする、付記１５に記載のコンピュータシステム。
（付記２０）前記実行する過程は、
前記各電子機器の前記音声ファイルのＥＰＤ結果から埋め込み抽出をすることで類似度行列を計算する過程、および
前記各電子機器の類似度行列を平均し、平均類似度行列と前記最終クラスタ数に基づいて前記話者ダイアライゼーションクラスタリングを実行する過程
を含む、付記１３に記載のコンピュータシステム。 Please note the following additional notes.
(Additional Note 1) A speaker diarization method executed by a computer system, comprising:
The computer system includes at least one processor configured to execute computer-readable instructions contained in memory;
The speaker diarization method includes:
receiving, by the at least one processor, audio files recorded by each electronic device from a plurality of electronic devices;
estimating a number of candidate clusters based on an embedding matrix calculated for the audio file of each electronic device by the at least one processor;
determining a final number of clusters by the at least one processor using the number of candidate clusters of each electronic device; and performing speaker diarization clustering by the at least one processor using the final number of clusters. A speaker diarization method comprising the steps of:
(Additional Note 2) The receiving step includes:
The speaker according to appendix 1, comprising: performing endpoint detection (EPD) on the audio file of each of the electronic devices; and integrating EPD results of each of the electronic devices to generate an EPD union. Diarization method.
(Additional Note 3) The step of estimating is
calculating a similarity matrix by performing embedding extraction from the EPD results of the audio files of each of the electronic devices; and calculating the number of candidate clusters and the similarity matrix using the similarity matrix of each of the electronic devices. The speaker diarization method according to appendix 1, comprising the step of calculating a confidence value.
(Additional Note 4) The step of calculating the number of candidate clusters and the reliability value of the similarity matrix includes:
extracting eigenvalues by performing eigenvalue decomposition on the similarity matrix; and after arranging the extracted eigenvalues, the number of candidate clusters and the reliability of the similarity matrix are determined based on the difference between adjacent eigenvalues. The speaker diarization method according to appendix 3, comprising the step of calculating a value.
(Additional Note 5) The step of calculating the number of candidate clusters and the reliability value of the similarity matrix includes:
performing eigenvalue decomposition on the similarity matrix to extract eigenvalues;
After arranging the extracted eigenvalues, determining the number of eigenvalues selected based on the difference between adjacent eigenvalues as the number of candidate clusters; and eigenvalues remaining unselected in the process of determining the number of candidate clusters. The speaker diarization method according to appendix 3, comprising the step of calculating the confidence value using .
(Additional Note 6) The step of calculating the reliability value using the remaining eigenvalues includes:
The speaker diarization method according to appendix 5, characterized in that the largest eigenvalue among the remaining eigenvalues is determined as the reliability value of the similarity matrix.
(Additional Note 7) The step of calculating the reliability value using the remaining eigenvalues includes:
The speaker diarization method according to appendix 5, characterized in that an average value obtained by calculating the average of the remaining eigenvalues is determined as a reliability value of the similarity matrix.
(Additional note 8) The step of estimating is
The speaker diarization method according to claim 3, further comprising: applying a weighted sum to the similarity matrix based on weights learned for the EPD results of the audio file.
(Additional note 9) The step of determining the
The speaker diarization method according to appendix 3, characterized in that the number of candidate clusters estimated by the similarity matrix with the largest reliability value is determined as the final number of clusters.
(Additional Note 10) The step of performing is
calculating a similarity matrix by performing embedding extraction from the EPD results of the audio files of each of the electronic devices; and averaging the similarity matrices of each of the electronic devices, and calculating the similarity matrix based on the average similarity matrix and the final number of clusters. The speaker diarization method according to appendix 1, further comprising the step of performing the speaker diarization clustering using the following steps.
(Appendix 11) A computer program that causes the computer system to execute the speaker diarization method according to any one of Appendices 1 to 10.
(Additional Note 12) A non-transitory computer-readable recording medium on which a program for causing a computer to execute the speaker diarization method according to any one of Appendices 1 to 10 is recorded.
(Additional note 13) A computer system,
at least one processor configured to execute computer-readable instructions contained in the memory;
The at least one processor includes:
a process of receiving audio files recorded by each electronic device from multiple electronic devices;
estimating the number of candidate clusters based on the embedding matrix calculated for the audio file of each electronic device;
A computer system that processes: determining a final number of clusters using the number of candidate clusters of each electronic device; and performing speaker diarization clustering using the final number of clusters.
(Additional Note 14) The receiving process includes:
14. The computer system according to appendix 13, comprising: executing EPD on the audio files of each of the electronic devices; and generating an EPD union by integrating EPD results of each of the electronic devices.
(Additional note 15) The estimation process is
calculating a similarity matrix by performing embedding extraction from the EPD results of the audio files of each of the electronic devices; and calculating the number of candidate clusters and the similarity matrix using the similarity matrix of each of the electronic devices. 14. The computer system according to claim 13, comprising: calculating a reliability value.
(Additional Note 16) The process of calculating the number of candidate clusters and the reliability value of the similarity matrix is as follows:
Extracting eigenvalues by performing eigenvalue decomposition on the similarity matrix;
After arranging the extracted eigenvalues, the number of eigenvalues selected based on the difference between adjacent eigenvalues is determined as the number of candidate clusters; and eigenvalues remaining unselected in the process of determining the number of candidate clusters. 16. The computer system according to appendix 15, comprising: calculating the reliability value using .
(Additional Note 17) The process of calculating the reliability value using the remaining eigenvalues is as follows:
17. The computer system according to appendix 16, wherein the largest eigenvalue among the remaining eigenvalues is determined as the reliability value of the similarity matrix.
(Additional note 18) The estimation process is
16. The computer system of claim 15, further comprising: applying a weighted sum to the similarity matrix based on weights learned for the EPD result of the audio file.
(Additional note 19) The process of determining the
16. The computer system according to appendix 15, wherein the number of candidate clusters estimated using the similarity matrix with the largest reliability value is determined as the final number of clusters.
(Additional Note 20) The process to be executed is:
calculating a similarity matrix by performing embedding extraction from the EPD results of the audio files of each of the electronic devices; and averaging the similarity matrices of the respective electronic devices and calculating the similarity matrix based on the average similarity matrix and the final number of clusters. 14. The computer system according to appendix 13, comprising: performing the speaker diarization clustering using the method.

２２０：プロセッサ
３１０：音声統合部
３２０：クラスタ決定部
３３０：クラスタリング実行部 220: Processor 310: Audio integration unit 320: Cluster determination unit 330: Clustering execution unit

Claims

A speaker diarization method performed by a computer system, the method comprising:
The computer system includes at least one processor configured to execute computer-readable instructions contained in memory;
The speaker diarization method includes:
receiving, by the at least one processor, individual audio files recorded by each electronic device from a plurality of electronic devices;
performing endpoint detection (EPD) on the individual audio files of each electronic device;
integrating the individual EPD results of each electronic device to generate an EPD union;
estimating the number of candidate clusters based on the individual embedding matrices calculated for the individual audio files of each electronic device by the at least one processor, the step of: performing embedding extraction using the individual EPD results; calculating an individual embedding matrix for each electronic device; and calculating a reliability value of the number of candidate clusters and the individual embedding matrix using the individual embedding matrix of each electronic device. a step of estimating;
determining, by the at least one processor, a final number of clusters using the number of candidate clusters of each electronic device based on the confidence value; and, by the at least one processor, using the EPD union. A speaker diarization method comprising: performing speaker diarization clustering based on an average similarity matrix obtained by averaging individual similarity matrices of each electronic device calculated by embedding extraction and the final number of clusters. .

Calculating the number of candidate clusters and the confidence value of the individual embedding matrix comprises:
performing eigenvalue decomposition on the individual embedding matrix to extract eigenvalues; and after aligning the extracted eigenvalues, the number of candidate clusters and the reliability of the individual embedding matrix are calculated based on the difference between adjacent eigenvalues; The speaker diarization method according to claim 1, comprising the step of: calculating a degree value.

Calculating the number of candidate clusters and the confidence value of the individual embedding matrix comprises:
performing eigenvalue decomposition on the individual embedding matrix to extract eigenvalues;
After arranging the extracted eigenvalues, determining the number of eigenvalues selected based on the difference between adjacent eigenvalues as the number of candidate clusters; and eigenvalues remaining unselected in the process of determining the number of candidate clusters. The speaker diarization method according to claim 1, comprising the step of: calculating the confidence value using .

Calculating the reliability value using the remaining eigenvalues includes:
The speaker diarization method according to claim 3, characterized in that the largest eigenvalue among the remaining eigenvalues is determined as the reliability value of the individual embedding matrix.

Calculating the reliability value using the remaining eigenvalues includes:
The speaker diarization method according to claim 3, characterized in that an average value obtained by calculating the average of the remaining eigenvalues is determined as the reliability value of the individual embedding matrix.

The estimating step includes:
The speaker diarization method according to any one of claims 1 to 5, further comprising: applying a weighted sum to the individual embedding matrix based on weight values learned for the individual EPD results. .

The determining step includes:
The speaker diarization method according to any one of claims 1 to 6, characterized in that the number of candidate clusters estimated by the embedding matrix with the largest reliability value is determined as the final number of clusters. .

A computer program that causes the computer system to execute the speaker diarization method according to any one of claims 1 to 7.

A non-transitory computer-readable recording medium on which a program for causing a computer to execute the speaker diarization method according to any one of claims 1 to 7 is recorded.

A computer system,
at least one processor configured to execute computer-readable instructions contained in the memory;
The at least one processor includes:
a process of receiving individual audio files recorded by each electronic device from multiple electronic devices;
performing endpoint detection (EPD) on the individual audio files of each electronic device;
a step of integrating the individual EPD results of each electronic device to generate an EPD union;
a process of estimating the number of candidate clusters based on an individual embedding matrix calculated for the individual audio file of each of the electronic devices, the number of candidate clusters being estimated by performing embedding extraction using the individual EPD results; estimating the number of candidate clusters and the reliability value of the individual embedding matrix using the individual embedding matrix of each electronic device;
determining a final number of clusters using the number of candidate clusters of each electronic device based on the reliability value; and a step of determining a final number of clusters of each electronic device calculated by performing embedding extraction using the EPD union. A computer system for performing speaker diarization clustering based on an average similarity matrix obtained by averaging individual similarity matrices and the final number of clusters.

The step of calculating the number of candidate clusters and the reliability value of the individual embedding matrix includes:
Extracting eigenvalues by performing eigenvalue decomposition on the individual embedding matrix;
After arranging the extracted eigenvalues, the number of eigenvalues selected based on the difference between adjacent eigenvalues is determined as the number of candidate clusters; and eigenvalues remaining unselected in the process of determining the number of candidate clusters. 11. The computer system of claim 10, further comprising: calculating the reliability value using:

The step of calculating the reliability value using the remaining eigenvalues includes:
The computer system according to claim 11, wherein the largest eigenvalue among the remaining eigenvalues is determined as the reliability value of the individual embedding matrix.

The estimating process is
A computer system according to any one of claims 10 to 12, further comprising: applying a weighted sum to the individual embedding matrix based on weight values learned for the individual EPD results.

The determining process includes:
The computer system according to any one of claims 10 to 13, wherein the number of candidate clusters estimated by the embedding matrix with the largest reliability value is determined as the final number of clusters.