JP7273078B2

JP7273078B2 - Speaker Diarization Method, System, and Computer Program Using Voice Activity Detection Based on Speaker Embedding

Info

Publication number: JP7273078B2
Application number: JP2021014192A
Authority: JP
Inventors: ヨンギクォン; ヒスホ; ジュンソンチョン; ボンジンイ; イクサンハン
Original assignee: Naver Corp
Current assignee: Naver Corp
Priority date: 2020-11-30
Filing date: 2021-02-01
Publication date: 2023-05-12
Anticipated expiration: 2041-02-01
Also published as: KR102482827B1; KR20220075550A; JP2022086961A

Description

以下の説明は、話者ダイアライゼーション（ｓｐｅａｋｅｒｄｉａｒｉｓａｔｉｏｎ）技術に関する。 The following description relates to speaker diarization techniques.

話者ダイアライゼーションとは、多数の話者が発声した内容を録音した音声ファイルから話者ごとに音声区間を分割する技術を意味する。 Speaker diarization refers to a technique for dividing speech segments for each speaker from a speech file in which contents uttered by a large number of speakers are recorded.

話者ダイアライゼーション技術は、音声データから話者境界区間を検出するものであって、話者に対する先行知識を使用するか否かより、距離ベースの方式とモデルベースの方式とに分けられる。 The speaker diarization technique detects speaker boundary sections from speech data, and is divided into distance-based and model-based techniques depending on whether prior knowledge of speakers is used.

例えば、特許文献１（登録日２０１８年２月２３日）には、話者の音声を認識する環境の変化および話者の発話状態の影響を受けずに、話者の音声に基づいて話者を区分することができる話者認識モデルを生成する技術が開示されている。 For example, Patent Literature 1 (registered February 23, 2018) describes a speaker's speech recognition based on the speaker's speech without being affected by changes in the environment in which the speaker's speech is recognized and the speaker's utterance state. Techniques are disclosed for generating a speaker recognition model that can discriminate between .

このような話者ダイアライゼーション技術は、会議、インタビュー、取引、裁判などのように多くの話者が順不同に発声する状況で発声内容を話者ごとに分割して自動記録する諸般の技術であり、議事録自動作成などに活用されている。 Such speaker diarization technology is a variety of technologies that automatically record the contents of each speaker's speech in a situation where many speakers speak out of order, such as in meetings, interviews, transactions, and trials. , and is used for the automatic creation of meeting minutes.

韓国登録特許第１０－１８３３７３１号公報Korean Patent No. 10-1833731

話者埋め込みに基づいて音声活動領域（ｓｐｅｅｃｈａｃｔｉｖｉｔｙｒｅｇｉｏｎ）である音声区間を検出する方法およびシステムを提供する。 A method and system are provided for detecting speech intervals, which are speech activity regions, based on speaker embeddings.

音声活動を検出するための個別のモデルは使用せず、単一モデルである話者認識モデルを利用して音声活動検出と話者埋め込み抽出を実行する方法およびシステムを提供する。 A method and system are provided for performing voice activity detection and speaker embedding extraction using a single model, a speaker recognition model, rather than using a separate model for detecting voice activity.

コンピュータシステムが実行する話者ダイアライゼーション方法であって、前記コンピュータシステムは、メモリに含まれるコンピュータ読み取り可能な命令を実行するように構成された少なくとも１つのプロセッサを含み、前記話者ダイアライゼーション方法は、前記少なくとも１つのプロセッサにより、与えられた音声ファイルに対して音声フレームごとに話者埋め込みを抽出する段階、および前記少なくとも１つのプロセッサにより、前記話者埋め込みに基づいて音声活動領域（ｓｐｅｅｃｈａｃｔｉｖｉｔｙｒｅｇｉｏｎ）である音声区間を検出する段階を含む、話者ダイアライゼーション方法を提供する。 A computer system implemented speaker diarization method, said computer system including at least one processor configured to execute computer readable instructions contained in a memory, said speaker diarization method comprising: extracting, by the at least one processor, speaker embeddings for each audio frame for a given audio file; and extracting, by the at least one processor, speech activity regions based on the speaker embeddings. ).

一側面によると、前記話者ダイアライゼーション方法は、単一モデルである話者認識モデルを利用して、前記話者埋め込みを抽出する段階と前記音声区間を検出する段階を実行してよい。 According to one aspect, the speaker diarization method may utilize a single model, a speaker recognition model, to perform the steps of extracting the speaker embeddings and detecting the speech intervals.

他の側面によると、前記音声区間を検出する段階は、前記音声フレームそれぞれの話者埋め込みベクトルに対してノルム（Ｎｏｒｍ）値を求める段階、および埋め込みノルム値が閾値（ｔｈｒｅｓｈｏｌｄ）以上の音声フレームは前記音声区間と判断し、前記閾値未満の音声フレームは非音声区間と判断する段階を含んでよい。 According to another aspect, the step of detecting speech intervals includes determining a norm value for a speaker embedding vector of each of the speech frames; The method may include determining the speech interval, and determining speech frames below the threshold as non-speech intervals.

また他の側面によると、前記話者ダイアライゼーション方法は、前記少なくとも１つのプロセッサにより、音声と非音声を分類するための前記閾値を、与えられた音声ファイルによって適応的に設定する段階をさらに含んでよい。 According to yet another aspect, the speaker diarization method further includes adaptively setting, by the at least one processor, the threshold for classifying speech and non-speech according to a given speech file. OK.

また他の側面によると、前記話者ダイアライゼーション方法は、前記少なくとも１つのプロセッサにより、前記音声ファイルに対して、混合ガウスモデル（Ｇａｕｓｓｉａｎｍｉｘｔｕｒｅｍｏｄｅｌ）によって推定された前記閾値を設定する段階をさらに含んでよい。 According to yet another aspect, the method of speaker diarization further includes setting, by the at least one processor, the threshold estimated by a Gaussian mixture model for the audio file. OK.

また他の側面によると、前記話者ダイアライゼーション方法は、前記少なくとも１つのプロセッサにより、音声と非音声を分類するための前記閾値を、実験によって決定された固定値で設定する段階をさらに含んでよい。 According to yet another aspect, the method for speaker diarization further includes setting, by the at least one processor, the threshold for classifying speech and non-speech to an empirically determined fixed value. good.

また他の側面によると、前記話者埋め込みを抽出する段階は、スライディングウィンドウ（ｓｌｉｄｉｎｇｗｉｎｄｏｗ）方式を利用して、前記音声フレームごとに前記話者埋め込みを抽出する段階を含んでよい。 According to yet another aspect, extracting the speaker embeddings may include extracting the speaker embeddings for each of the speech frames using a sliding window scheme.

また他の側面によると、前記話者埋め込みを抽出する段階は、分類ロス（ｃｌａｓｓｉｆｉｃａｔｉｏｎｌｏｓｓ）とハードネガティブマイニングロス（ｈａｒｄｎｅｇａｔｉｖｅｍｉｎｉｎｇｌｏｓｓ）との組み合わせを利用して学習された話者認識モデルにより、前記話者埋め込みを抽出する段階を含んでよい。 According to yet another aspect, the step of extracting speaker embeddings includes using a speaker recognition model trained using a combination of classification loss and hard negative mining loss to: Extracting said speaker embeddings may be included.

また他の側面によると、前記話者埋め込みを抽出する段階は、話者認識モデルの出力が、時間的平均プーリング層（ｔｅｍｐｏｒａｌａｖｅｒａｇｅｐｏｏｌｉｎｇｌａｙｅｒ）を使用して時間の経過によって集計された後、投影層（ｐｒｏｊｅｃｔｉｏｎｌａｙｅｒ）を通過することにより、発言レベル（ｕｔｔｅｒａｎｃｅ－ｌｅｖｅｌ）の埋め込みを取得する段階を含んでよい。 According to yet another aspect, the step of extracting speaker embeddings includes projection It may include obtaining utterance-level embeddings by passing through projection layers.

また他の側面によると、前記音声区間を検出する段階は、前記話者認識モデルの出力が、時間の経過による集計なく、前記投影層を経て伝達されることにより、音声活動ラベルを取得する段階を含んでよい。 According to yet another aspect, detecting the speech intervals includes obtaining speech activity labels as the output of the speaker recognition model is propagated through the projection layer without aggregation over time. may contain

前記話者ダイアライゼーション方法を前記コンピュータシステムに実行させるために非一時的なコンピュータ読み取り可能な記録媒体に記録される、コンピュータプログラムを提供する。 A computer program recorded on a non-transitory computer-readable recording medium is provided for causing the computer system to execute the speaker diarization method.

コンピュータシステムであって、メモリに含まれるコンピュータ読み取り可能な命令を実行するように構成された少なくとも１つのプロセッサを含み、前記少なくとも１つのプロセッサは、与えられた音声ファイルに対して音声フレームごとに話者埋め込みを抽出する話者埋め込み部、および前記話者埋め込みに基づいて音声活動領域である音声区間を検出する音声区間検出部を含む、コンピュータシステムを提供する。 A computer system comprising at least one processor configured to execute computer readable instructions contained in a memory, the at least one processor speaking for each audio frame for a given audio file. A computer system is provided, including a speaker embedder for extracting speaker embeddings, and a speech activity detector for detecting speech activity regions, which are speech activity regions, based on the speaker embeddings.

本発明の実施形態によると、話者埋め込みに基づいて音声活動領域である音声区間を検出することにより、話者認識が明らかな区間だけを検出することができ、話者ダイアライゼーションの性能を高めることができる。 According to the embodiments of the present invention, by detecting speech segments, which are speech active regions, based on speaker embeddings, only segments with clear speaker recognition can be detected, enhancing the performance of speaker diarization. be able to.

本発明の実施形態によると、音声活動の検出のために話者埋め込みの抽出に使用する話者認識モデルを利用することにより、単一モデルで音声活動検出と話者埋め込み抽出を実行することができ、話者ダイアライゼーションのパイプラインを簡素化させることができる。 According to embodiments of the present invention, it is possible to perform voice activity detection and speaker embedding extraction with a single model by utilizing the speaker recognition model used to extract speaker embeddings for voice activity detection. can simplify the speaker diarization pipeline.

本発明の一実施形態における、ネットワーク環境の例を示した図である。1 is a diagram showing an example of a network environment in one embodiment of the present invention; FIG. 本発明の一実施形態における、コンピュータシステムの内部構成の一例を説明するためのブロック図である。It is a block diagram for explaining an example of an internal configuration of a computer system in one embodiment of the present invention. 本発明の一実施形態における、コンピュータシステムのプロセッサが含むことのできる構成要素の例を示した図である。FIG. 2 illustrates an example of components that a processor of a computer system may include in one embodiment of the present invention; 本発明の一実施形態における、コンピュータシステムが実行することのできる話者ダイアライゼーション方法の例を示したフローチャートである。1 is a flowchart illustrating an example of a speaker diarization method that may be performed by a computer system in accordance with one embodiment of the present invention; 本発明の一実施形態における、話者ダイアライゼーションのための全体過程を示したフローチャートである。Figure 4 is a flow chart showing the overall process for speaker diarization in one embodiment of the present invention; 本発明の一実施形態における、話者埋め込みを抽出するためのモデルの例を示した図である。FIG. 4 is a diagram showing an example of a model for extracting speaker embeddings in one embodiment of the present invention; 本発明の一実施形態における、話者埋め込みに基づく音声区間検出方法を利用した話者ダイアライゼーション性能の実験結果を示した図である。FIG. 4 shows experimental results of speaker diarization performance using a speech activity detection method based on speaker embedding in an embodiment of the present invention;

以下、本発明の実施形態について、添付の図面を参照しながら詳しく説明する。 BEST MODE FOR CARRYING OUT THE INVENTION Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

本発明の実施形態は、音声データから話者境界区間を検出する話者ダイアライゼーション技術に関する。 Embodiments of the present invention relate to a speaker diarization technique for detecting speaker boundary intervals from speech data.

本明細書で具体的に開示される事項を含む実施形態は、話者埋め込みに基づいて音声活動領域である音声区間を検出することにより、話者ダイアライゼーションの性能を高めることができ、話者ダイアライゼーションのパイプラインを簡素化させることができる。 Embodiments, including those specifically disclosed herein, can improve the performance of speaker diarization by detecting speech activity regions, speech intervals, based on speaker embeddings, The diarization pipeline can be simplified.

図１は、本発明の一実施形態における、ネットワーク環境の例を示した図である。図１のネットワーク環境は、複数の電子機器１１０、１２０、１３０、１４０、サーバ１５０、およびネットワーク１６０を含む例を示している。このような図１は、発明の説明のための一例に過ぎず、電子機器の数やサーバの数が図１のように限定されることはない。 FIG. 1 is a diagram showing an example of a network environment in one embodiment of the present invention. The network environment of FIG. 1 illustrates an example including a plurality of electronic devices 110 , 120 , 130 , 140 , server 150 and network 160 . Such FIG. 1 is merely an example for explaining the invention, and the number of electronic devices and the number of servers are not limited as in FIG.

複数の電子機器１１０、１２０、１３０、１４０は、コンピュータシステムによって実現される固定端末や移動端末であってよい。複数の電子機器１１０、１２０、１３０、１４０の例としては、スマートフォン、携帯電話、ナビゲーション、ＰＣ（ｐｅｒｓｏｎａｌｃｏｍｐｕｔｅｒ）、ノート型ＰＣ、デジタル放送用端末、ＰＤＡ（ＰｅｒｓｏｎａｌＤｉｇｉｔａｌＡｓｓｉｓｔａｎｔ）、ＰＭＰ（ＰｏｒｔａｂｌｅＭｕｌｔｉｍｅｄｉａＰｌａｙｅｒ）、タブレット、ゲームコンソール、ウェアラブルデバイス、ＩｏＴ（ｉｎｔｅｒｎｅｔｏｆｔｈｉｎｇｓ）デバイス、ＶＲ（ｖｉｒｔｕａｌｒｅａｌｉｔｙ）デバイス、ＡＲ（ａｕｇｍｅｎｔｅｄｒｅａｌｉｔｙ）デバイスなどがある。一例として、図１では、電子機器１１０の例としてスマートフォンを示しているが、本発明の実施形態において、電子機器１１０は、実質的に無線または有線通信方式を利用し、ネットワーク１６０を介して他の電子機器１２０、１３０、１４０および／またはサーバ１５０と通信することのできる多様な物理的なコンピュータシステムのうちの１つを意味してよい。 The plurality of electronic devices 110, 120, 130, 140 may be fixed terminals or mobile terminals implemented by a computer system. Examples of the plurality of electronic devices 110, 120, 130, and 140 include smartphones, mobile phones, navigation systems, PCs (personal computers), notebook PCs, digital broadcasting terminals, PDAs (Personal Digital Assistants), and PMPs (Portable Multimedia Players). ), tablets, game consoles, wearable devices, IoT (Internet of things) devices, VR (virtual reality) devices, AR (augmented reality) devices, and the like. As an example, FIG. 1 shows a smart phone as an example of the electronic device 110 , but in embodiments of the present invention, the electronic device 110 substantially utilizes wireless or wired communication schemes and communicates with other devices via the network 160 . may refer to one of a variety of physical computer systems capable of communicating with the electronic devices 120 , 130 , 140 and/or the server 150 .

通信方式が限定されることはなく、ネットワーク１６０が含むことのできる通信網（一例として、移動通信網、有線インターネット、無線インターネット、放送網、衛星網など）を利用する通信方式だけではなく、機器間の近距離無線通信が含まれてもよい。例えば、ネットワーク１６０は、ＰＡＮ（ｐｅｒｓｏｎａｌａｒｅａｎｅｔｗｏｒｋ）、ＬＡＮ（ｌｏｃａｌａｒｅａｎｅｔｗｏｒｋ）、ＣＡＮ（ｃａｍｐｕｓａｒｅａｎｅｔｗｏｒｋ）、ＭＡＮ（ｍｅｔｒｏｐｏｌｉｔａｎａｒｅａｎｅｔｗｏｒｋ）、ＷＡＮ（ｗｉｄｅａｒｅａｎｅｔｗｏｒｋ）、ＢＢＮ（ｂｒｏａｄｂａｎｄｎｅｔｗｏｒｋ）、インターネットなどのネットワークのうちの１つ以上の任意のネットワークを含んでよい。さらに、ネットワーク１６０は、バスネットワーク、スターネットワーク、リングネットワーク、メッシュネットワーク、スター－バスネットワーク、ツリーまたは階層的ネットワークなどを含むネットワークトポロジのうちの任意の１つ以上を含んでもよいが、これらに限定されることはない。 The communication method is not limited, and not only the communication method using the communication network (eg, mobile communication network, wired Internet, wireless Internet, broadcast network, satellite network, etc.) that can be included in the network 160, but also the device It may also include short-range wireless communication between For example, the network 160 includes a PAN (personal area network), a LAN (local area network), a CAN (campus area network), a MAN (metropolitan area network), a WAN (wide area network), a BBN (broadband network). k), such as the Internet Any one or more of the networks may be included. Additionally, network 160 may include any one or more of network topologies including, but not limited to, bus networks, star networks, ring networks, mesh networks, star-bus networks, tree or hierarchical networks, and the like. will not be

サーバ１５０は、複数の電子機器１１０、１２０、１３０、１４０とネットワーク１６０を介して通信して命令、コード、ファイル、コンテンツ、サービスなどを提供する１つ以上のコンピュータ装置によって実現されてよい。例えば、サーバ１５０は、ネットワーク１６０を介して接続した複数の電子機器１１０、１２０、１３０、１４０に目的とするサービスを提供するシステムであってよい。より具体的な例として、サーバ１５０は、複数の電子機器１１０、１２０、１３０、１４０においてインストールされて実行されるコンピュータプログラムであるアプリケーションを通じ、該当のアプリケーションが目的とするサービス（一例として、音声認識に基づく人工知能議事録サービスなど）を複数の電子機器１１０、１２０、１３０、１４０に提供してよい。 Server 150 may be implemented by one or more computing devices in communication with a plurality of electronic devices 110, 120, 130, 140 over network 160 to provide instructions, code, files, content, services, and the like. For example, the server 150 may be a system that provides intended services to a plurality of electronic devices 110 , 120 , 130 , 140 connected via the network 160 . As a more specific example, the server 150 provides services (for example, voice recognition (such as an artificial intelligence minutes service based on the

図２は、本発明の一実施形態における、コンピュータシステムの例を示したブロック図である。図１を参照しながら説明したサーバ１５０は、図２のように構成されたコンピュータシステム２００によって実現されてよい。 FIG. 2 is a block diagram illustrating an example computer system, in accordance with one embodiment of the present invention. The server 150 described with reference to FIG. 1 may be implemented by the computer system 200 configured as shown in FIG.

図２に示すように、コンピュータシステム２００は、本発明の実施形態に係る話者ダイアライゼーション方法を実行するための構成要素として、メモリ２１０、プロセッサ２２０、通信インタフェース２３０、および入力／出力インタフェース２４０を含んでよい。 As shown in FIG. 2, computer system 200 includes memory 210, processor 220, communication interface 230, and input/output interface 240 as components for performing the speaker diarization method according to an embodiment of the present invention. may contain.

メモリ２１０は、コンピュータ読み取り可能な記録媒体であって、ＲＡＭ（ｒａｎｄｏｍａｃｃｅｓｓｍｅｍｏｒｙ）、ＲＯＭ（ｒｅａｄｏｎｌｙｍｅｍｏｒｙ）、およびディスクドライブのような永続的大容量記録装置を含んでよい。ここで、ＲＯＭやディスクドライブのような永続的大容量記録装置は、メモリ２１０とは区分される別の永続的記録装置としてコンピュータシステム２００に含まれてもよい。また、メモリ２１０には、オペレーティングシステムと、少なくとも１つのプログラムコードが記録されてよい。このようなソフトウェア構成要素は、メモリ２１０とは別のコンピュータ読み取り可能な記録媒体からメモリ２１０にロードされてよい。このような別のコンピュータ読み取り可能な記録媒体は、フロッピー（登録商標）ドライブ、ディスク、テープ、ＤＶＤ／ＣＤ－ＲＯＭドライブ、メモリカードなどのコンピュータ読み取り可能な記録媒体を含んでよい。他の実施形態において、ソフトウェア構成要素は、コンピュータ読み取り可能な記録媒体ではない通信インタフェース２３０を通じてメモリ２１０にロードされてもよい。例えば、ネットワーク１６０を介して受信されるファイルによってインストールされるコンピュータプログラムに基づいてコンピュータシステム２００のメモリ２１０にロードされてよい。 The memory 210 is a computer-readable storage medium and may include random access memory (RAM), read only memory (ROM), and permanent mass storage devices such as disk drives. Here, a permanent mass storage device such as a ROM or disk drive may be included in computer system 200 as a separate permanent storage device separate from memory 210 . Also stored in memory 210 may be an operating system and at least one program code. Such software components may be loaded into memory 210 from a computer-readable medium separate from memory 210 . Such other computer-readable recording media may include computer-readable recording media such as floppy drives, disks, tapes, DVD/CD-ROM drives, memory cards, and the like. In other embodiments, software components may be loaded into memory 210 through communication interface 230 that is not a computer-readable medium. For example, it may be loaded into memory 210 of computer system 200 based on a computer program installed by files received over network 160 .

プロセッサ２２０は、基本的な算術、ロジック、および入出力演算を実行することにより、コンピュータプログラムの命令を処理するように構成されてよい。命令は、メモリ２１０または通信インタフェース２３０によって、プロセッサ２２０に提供されてよい。例えば、プロセッサ２２０は、メモリ２１０のような記録装置に記録されたプログラムコードにしたがって受信される命令を実行するように構成されてよい。 Processor 220 may be configured to process computer program instructions by performing basic arithmetic, logic, and input/output operations. Instructions may be provided to processor 220 by memory 210 or communication interface 230 . For example, processor 220 may be configured to execute received instructions according to program code stored in a storage device, such as memory 210 .

通信モジュール２３０は、ネットワーク１６０を介してコンピュータシステム２００が他の装置と互いに通信するための機能を提供してよい。一例として、コンピュータシステム２００のプロセッサ２２０がメモリ２１０のような記録装置に記録されたプログラムコードにしたがって生成した要求や命令、データ、ファイルなどが、通信インタフェース２３０の制御にしたがってネットワーク１６０を介して他の装置に伝達されてよい。これとは逆に、他の装置からの信号や命令、データ、ファイルなどが、ネットワーク１６０を経てコンピュータシステム２００の通信インタフェース２３０を通じてコンピュータシステム２００に受信されてよい。通信インタフェース２３０を通じて受信された信号や命令、データなどは、プロセッサ２２０やメモリ２１０に伝達されてよく、ファイルなどは、コンピュータシステム２００がさらに含むことのできる記録媒体（上述した永続的記録装置）に記録されてよい。 Communications module 230 may provide functionality for computer system 200 to communicate with other devices over network 160 . By way of example, processor 220 of computer system 200 may transmit requests, instructions, data, files, etc. generated according to program code stored in a storage device such as memory 210 to others via network 160 under the control of communication interface 230 . device. Conversely, signals, instructions, data, files, etc. from other devices may be received by computer system 200 through communication interface 230 of computer system 200 over network 160 . Signals, instructions, data, etc., received through communication interface 230 may be communicated to processor 220 and memory 210, and files, etc., may be stored on storage media (the permanent storage devices described above) that computer system 200 may further include. may be recorded.

通信方式が限定されることはなく、ネットワーク１６０が含むことのできる通信網（一例として、移動通信網、有線インターネット、無線インターネット、放送網）を利用する通信方式だけではなく、機器間の有線／無線通信が含まれてもよい。例えば、ネットワーク１６０は、ＰＡＮ（ｐｅｒｓｏｎａｌａｒｅａｎｅｔｗｏｒｋ）、ＬＡＮ（ｌｏｃａｌａｒｅａｎｅｔｗｏｒｋ）、ＣＡＮ（ｃａｍｐｕｓａｒｅａｎｅｔｗｏｒｋ）、ＭＡＮ（ｍｅｔｒｏｐｏｌｉｔａｎａｒｅａｎｅｔｗｏｒｋ）、ＷＡＮ（ｗｉｄｅａｒｅａｎｅｔｗｏｒｋ）、ＢＢＮ（ｂｒｏａｄｂａｎｄｎｅｔｗｏｒｋ）、インターネットなどのネットワークのうちの１つ以上の任意のネットワークを含んでよい。さらに、ネットワーク１６０は、バスネットワーク、スターネットワーク、リングネットワーク、メッシュネットワーク、スター－バスネットワーク、ツリーまたは階層的ネットワークなどを含むネットワークトポロジのうちの任意の１つ以上を含んでもよいが、これらに限定されることはない。 The communication method is not limited, and not only the communication method using the communication network (for example, mobile communication network, wired Internet, wireless Internet, broadcasting network) that can be included in the network 160, but also the wired/communication method between devices. Wireless communication may be included. For example, the network 160 includes a PAN (personal area network), a LAN (local area network), a CAN (campus area network), a MAN (metropolitan area network), a WAN (wide area network), a BBN (broadband network). k), such as the Internet Any one or more of the networks may be included. Additionally, network 160 may include any one or more of network topologies including, but not limited to, bus networks, star networks, ring networks, mesh networks, star-bus networks, tree or hierarchical networks, and the like. will not be

入力／出力インタフェース２４０は、入力／出力装置２５０とのインタフェースのための手段であってよい。例えば、入力装置は、マイク、キーボード、カメラ、またはマウスなどの装置を、出力装置は、ディスプレイ、スピーカなどのような装置を含んでよい。他の例として、入力／出力インタフェース２４０は、タッチスクリーンのように入力と出力のための機能が１つに統合された装置とのインタフェースのための手段であってもよい。入力／出力装置２５０は、コンピュータシステム２００と１つの装置で構成されてもよい。 Input/output interface 240 may be a means for interfacing with input/output device 250 . For example, input devices may include devices such as microphones, keyboards, cameras, or mice, and output devices may include devices such as displays, speakers, and the like. As another example, input/output interface 240 may be a means for interfacing with a device that integrates functionality for input and output, such as a touch screen. Input/output device 250 may be one device with computer system 200 .

また、他の実施形態において、コンピュータシステム２００は、図２の構成要素よりも少ないか多くの構成要素を含んでもよい。しかし、大部分の従来技術的構成要素を明確に図に示す必要はない。例えば、コンピュータシステム２００は、上述した入力／出力装置２５０のうちの少なくとも一部を含むように実現されてもよいし、トランシーバ、カメラ、各種センサ、データベースなどのような他の構成要素をさらに含んでもよい。 Also, in other embodiments, computer system 200 may include fewer or more components than the components of FIG. However, most prior art components need not be explicitly shown in the figures. For example, computer system 200 may be implemented to include at least some of the input/output devices 250 described above, and may also include other components such as transceivers, cameras, various sensors, databases, and the like. It's okay.

以下では、話者埋め込みに基づいて音声活動を検出する話者ダイアライゼーション方法およびシステムの具体的な実施形態について説明する。 Specific embodiments of speaker diarization methods and systems that detect speech activity based on speaker embeddings are described below.

図３は、本発明の一実施形態における、サーバのプロセッサが含むことのできる構成要素の例を示したブロック図であり、図４は、本発明の一実施形態における、サーバが実行することのできる方法の例を示したフローチャートである。 FIG. 3 is a block diagram illustrating exemplary components that a processor of a server may include in accordance with one embodiment of the present invention, and FIG. 4 is a flow chart showing an example of a possible method;

本実施形態に係るサーバ１５０は、多数の話者が発声した内容を録音した音声ファイルから話者ごとに音声区間を分割して文書として整理する人工知能サービスを提供するサービスプラットフォームの役割を担う。 The server 150 according to the present embodiment serves as a service platform that provides an artificial intelligence service that divides speech sections for each speaker from a speech file that records the content uttered by many speakers and organizes them as documents.

サーバ１５０には、コンピュータシステム２００によって実現された話者ダイアライゼーションシステムが構成されてよい。一例として、サーバ１５０は、クライアント（ｃｌｉｅｎｔ）である複数の電子機器１１０、１２０、１３０、１４０を対象に、電子機器１１０、１２０、１３０、１４０上にインストールされた専用アプリケーションやサーバ１５０と関連するウェブ／モバイルサイトへの接続により、音声認識に基づく人工知能議事録サービスを提供してよい。 Server 150 may comprise a speaker diarization system implemented by computer system 200 . As an example, the server 150 targets a plurality of electronic devices 110 , 120 , 130 , 140 that are clients, and is associated with a dedicated application installed on the electronic devices 110 , 120 , 130 , 140 and the server 150 . A connection to a web/mobile site may provide an artificial intelligence minutes service based on speech recognition.

特に、サーバ１５０は、話者埋め込みに基づいて音声活動領域である音声区間を検出してよい。 In particular, server 150 may detect speech intervals, which are speech active regions, based on speaker embeddings.

サーバ１５０のプロセッサ２２０は、図４に係る話者ダイアライゼーション方法を実行するための構成要素として、図３に示すように、話者埋め込み部３１０、音声区間検出部３２０、およびクラスタリング実行部３３０を含んでよい。 The processor 220 of the server 150 includes a speaker embedding unit 310, a voice interval detection unit 320, and a clustering execution unit 330 as shown in FIG. 3 as components for executing the speaker diarization method according to FIG. may contain.

実施形態によって、プロセッサ２２０の構成要素は、選択的にプロセッサ２２０に含まれても除外されてもよい。また、実施形態によって、プロセッサ２２０の構成要素は、プロセッサ２２０の機能の表現のために分離されても併合されてもよい。 Depending on the embodiment, components of processor 220 may be selectively included or excluded from processor 220 . Also, depending on the embodiment, the components of processor 220 may be separated or merged to represent the functionality of processor 220 .

このようなプロセッサ２２０およびプロセッサ２２０の構成要素は、図４の話者ダイアライゼーション方法が含む段階４１０～４３０を実行するようにサーバ１５０を制御してよい。例えば、プロセッサ２２０およびプロセッサ２２０の構成要素は、メモリ２１０が含むオペレーティングシステムのコードと、少なくとも１つのプログラムのコードとによる命令（ｉｎｓｔｒｕｃｔｉｏｎ）を実行するように実現されてよい。 Such processor 220 and components of processor 220 may control server 150 to perform steps 410-430 included in the speaker diarization method of FIG. For example, processor 220 and components of processor 220 may be implemented to execute instructions according to the code of an operating system and the code of at least one program contained in memory 210 .

ここで、プロセッサ２２０の構成要素は、サーバ１５０に記録されたプログラムコードが提供する命令にしたがってプロセッサ２２０によって実行される、互いに異なる機能の表現であってよい。例えば、サーバ１５０が話者埋め込みを抽出するように上述した命令にしたがってサーバ１５０を制御するプロセッサ２２０の機能的表現として、話者埋め込み部３１０が利用されてよい。 Here, the components of processor 220 may represent different functions performed by processor 220 according to instructions provided by program code stored on server 150 . For example, speaker embedder 310 may be utilized as a functional representation of processor 220 controlling server 150 according to the instructions described above so that server 150 extracts speaker embeddings.

プロセッサ２２０は、サーバ１５０の制御と関連する命令がロードされたメモリ２１０から必要な命令を読み取ってよい。この場合、前記読み取られた命令は、以下で説明する段階４１０～４３０をプロセッサ２２０が実行するように制御するための命令を含んでよい。 Processor 220 may read the necessary instructions from memory 210 loaded with instructions associated with the control of server 150 . In this case, the read instructions may include instructions for controlling processor 220 to perform steps 410-430 described below.

以下で説明する段階４１０～４３０は、図４に示した順序とは異なるように実行されてもよいし、段階４１０～４３０のうちの一部が省略されても追加の過程がさらに含まれてもよい。 Steps 410-430 described below may be performed in a different order than shown in FIG. 4, and additional steps may be included even if some of steps 410-430 are omitted. good too.

図４を参照すると、段階４１０で、話者埋め込み部３１０は、多数の話者が発声した内容を録音した音声ファイルが与えられる場合、話者認識モデルを利用して、与えられた音声ファイルに対して音声フレームごとに話者埋め込みを抽出してよい。一例として、話者埋め込み部３１０は、スライディングウィンドウ（ｓｌｉｄｉｎｇｗｉｎｄｏｗ）方式によって音声フレームごとに話者埋め込みを抽出してよい。 Referring to FIG. 4, in step 410, when given a speech file containing recorded contents uttered by a number of speakers, the speaker embedding unit 310 uses a speaker recognition model to convert the speech file to the given speech file. Alternatively, speaker embeddings may be extracted for each audio frame. For example, the speaker embedding unit 310 may extract speaker embeddings for each speech frame using a sliding window scheme.

段階４２０で、音声区間検出部３２０は、話者埋め込みに基づいて音声活動領域である音声区間を検出してよい。話者埋め込みを抽出するための話者認識モデル（例えば、ＳｐｅａｋｅｒＮｅｔなど）は、音声に対しては埋め込みのノルム（Ｎｏｒｍ）値を高く示し、非音声に対しては埋め込みのノルム値を低く示す。一例として、音声区間検出部３２０は、音声フレームそれぞれの話者埋め込みベクトルに対してノルム値を求め、埋め込みノルム値が閾値（ｔｈｒｅｓｈｏｌｄ）以上の音声フレームは音声区間と判断してよく、埋め込みノルム値が閾値未満の音声フレームは非音声区間と判断してよい。 At step 420, the speech activity detector 320 may detect speech activity, which is the speech activity area, based on the speaker embeddings. A speaker recognition model (eg, SpeakerNet, etc.) for extracting speaker embeddings exhibits a high embedding Norm value for speech and a low embedding norm value for non-speech. As an example, the speech interval detection unit 320 may obtain a norm value for the speaker embedding vector of each speech frame, and may determine that a speech frame having an embedding norm value greater than or equal to a threshold is a speech interval. is less than the threshold may be determined as a non-speech segment.

段階４３０で、クラスタリング実行部３３０は、話者埋め込みをグループ化することにより、段階４２０で検出された音声区間に基づいて話者ダイアライゼーションクラスタリングを実行してよい。クラスタリング実行部３３０は、話者埋め込みに対する類似度行列（ａｆｆｉｎｉｔｙｍａｔｒｉｘ）を計算した後、類似度行列に基づいてクラスタ数を決定してよい。このとき、クラスタリング実行部３３０は、類似度行列に対して固有値分解（ｅｉｇｅｎｄｅｃｏｍｐｏｓｉｔｉｏｎ）を行って固有値（ｅｉｇｅｎｖａｌｕｅ）を抽出した後、抽出された固有値を大きさ順に整列し、整列された固有値で隣接する固有値の差を基準に、有効な主成分に該当する固有値の個数をクラスタ数として決定してよい。固有値が高いということは類似度行列で影響力が大きいことを意味し、すなわち、音声ファイル内の音声区間に対して類似度行列を構成するときに、発声がある話者のうちで発声の比重が高いことを意味する。言い換えれば、クラスタリング実行部３３０は、整列された固有値のうちから十分に大きな値を有する固有値を選択し、選択された固有値の個数を、話者数を示すクラスタ数として決定してよい。クラスタリング実行部３３０は、決定されたクラスタ数に基づいて音声区間をマッピングすることにより、話者ダイアライゼーションのクラスタリングを実行してよい。 At step 430, the clustering performer 330 may perform speaker diarization clustering based on the speech intervals detected at step 420 by grouping the speaker embeddings. After calculating an affinity matrix for speaker embeddings, the clustering execution unit 330 may determine the number of clusters based on the affinity matrix. At this time, the clustering execution unit 330 performs eigendecomposition on the similarity matrix to extract eigenvalues, arranges the extracted eigenvalues in order of size, and arranges the arranged eigenvalues adjacent to each other. Based on the eigenvalue difference, the number of eigenvalues corresponding to effective principal components may be determined as the number of clusters. A high eigenvalue means a high degree of influence in the similarity matrix. means high. In other words, the clustering execution unit 330 may select eigenvalues having sufficiently large values from the aligned eigenvalues, and determine the number of selected eigenvalues as the number of clusters indicating the number of speakers. The clustering execution unit 330 may perform speaker diarization clustering by mapping speech intervals based on the determined number of clusters.

図５に示すように、話者ダイアライゼーションのための全体過程５０は、音声区間（ｓｐｅｅｃｈｒｅｇｉｏｎ）検出段階５１、話者埋め込み抽出（Ｅｘｔｒａｃｔｓｐｅａｋｅｒｅｍｂｅｄｄｉｎｇｓ）段階５２、および話者ダイアライゼーションクラスタリング段階５３を含んでよい。 As shown in FIG. 5, the overall process 50 for speaker diarization includes a speech region detection step 51, an Extract speaker embeddings step 52, and a speaker diarization clustering step 53. may contain.

従来は、各フレームのエネルギーを測定して音声と非音声を区分する方式によって音声区間を検出していたが、音声区間検出のためのモデルは、話者埋め込み（ｓｐｅａｋｅｒｅｍｂｅｄｄｉｎｇ）を抽出するためのモデルとは異なる、独立的なモデルを使用していた。エネルギーに基づいて音声区間を検出する場合、検出された音声区間のうちの一部に話者認識が困難な区間が含まれることがあり、話者認識が困難な区間は話者認識モデルが学習できなかった類型であるため、話者埋め込みの品質が落ちるようになる。結果的に、検出された音声区間の品質が話者ダイアライゼーションの性能を左右するようになる。 Conventionally, the speech interval is detected by measuring the energy of each frame and distinguishing between speech and non-speech. You were using an independent model that was different from the model. When detecting speech segments based on energy, some of the detected speech segments may include segments where speaker recognition is difficult. Since it is a type that could not be done, the quality of speaker embedding is degraded. As a result, the quality of detected speech intervals dominates the performance of speaker diarization.

本実施形態において、プロセッサ２２０は、音声活動を検出するための個別のモデルは使用せず、単一モデルである話者認識モデルを利用して、音声活動検出と話者埋め込み抽出を実行する。言い換えれば、本発明は、埋め込みモデル（ｅｍｂｅｄｄｉｎｇｍｏｄｅｌ）だけで、音声区間検出段階５１と話者埋め込み抽出段階５２を実行することができる。 In this embodiment, processor 220 does not use a separate model for detecting voice activity, but utilizes a single model, the speaker recognition model, to perform voice activity detection and speaker embedding extraction. In other words, the present invention can perform speech activity detection step 51 and speaker embedding extraction step 52 with only an embedding model.

本発明に係る話者ダイアライゼーションシステムに適用される核心アーキテクチャを説明すれば、次のとおりとなる。 The core architecture applied to the speaker diarization system according to the present invention will be described as follows.

話者認識モデルの認識が適切になされる話者表現（ｓｐｅａｋｅｒｒｅｐｒｅｓｅｎｔａｔｉｏｎｓ）を得ることが、話者ダイアライゼーションの問題の核心となる。以下では、深層神経網〔ニューラルネットワーク〕によって話者埋め込みを学習して抽出を行う方法について説明する。 Obtaining speaker representations that are well recognized by speaker recognition models is at the heart of the problem of speaker diarization. A method of learning and extracting speaker embeddings using a deep neural network will be described below.

入力表現（ｉｎｐｕｔｒｅｐｒｅｓｅｎｔａｔｉｏｎｓ）は、メル尺度で線形的に区間を分けて実現してよい。プロセッサ２２０は、一定の大きさ（例えば、２５ｍｓの幅と１０ｍｓのストライド）のウィンドウで各発言（ｕｔｔｅｒａｎｃｅ）からスペクトログラムを抽出する。６４次元のメルフィルタバンクが、ネットワークに対する入力として使用される。平均および分散正規化（ＭＶＮ）はインスタンス正規化を使用し、発言レベルでスペクトラムとフィルタバンクのすべての周波数ビンに対して実行される。 Input representations may be implemented linearly intervalwise on the Mel scale. Processor 220 extracts the spectrogram from each utterance in a window of constant size (eg, 25 ms wide and 10 ms stride). A 64-dimensional mel filter bank is used as input to the network. Mean and Variance Normalization (MVN) uses instance normalization and is performed on all frequency bins of the spectrum and filterbank at the utterance level.

話者埋め込み抽出モデルは、話者認識モデルの１つであるＲｅｓＮｅｔ（Ｒｅｓｉｄｕａｌｎｅｔｗｏｒｋｓ）が使用されてよい。例えば、基本アーキテクチャとして、予備活性化残差ユニット（ｐｒｅ－ａｃｔｉｖａｔｉｏｎｒｅｓｉｄｕａｌｕｎｉｔｓ）を除いたＲｅｓＮｅｔ－３４を適用してよい。ＲｅｓＮｅｔ－３４アーキテクチャの例は、図６に示すとおりである。 The speaker embedding extraction model may use ResNet (Residual networks), which is one of speaker recognition models. For example, ResNet-34 without pre-activation residual units may be applied as the basic architecture. An example of the ResNet-34 architecture is shown in FIG.

話者埋め込み抽出モデルの出力は、時間的平均プーリング層（ｔｅｍｐｏｒａｌａｖｅｒａｇｅｐｏｏｌｉｎｇｌａｙｅｒ）を使用して時間経過によって集計された後、線形投影層（ｌｉｎｅａｒｐｒｏｊｅｃｔｉｏｎｌａｙｅｒ）を通過することで、発言レベルの埋め込みを取得してよい。 The output of the speaker embedding extraction model is aggregated over time using a temporal average pooling layer and then passed through a linear projection layer to obtain utterance-level embeddings. can be obtained.

プロセッサ２２０は、目的関数として、分類ロス（ｃｌａｓｓｉｆｉｃａｔｉｏｎｌｏｓｓ）とハードネガティブマイニングロス（ｈａｒｄｎｅｇａｔｉｖｅｍｉｎｉｎｇｌｏｓｓ）との組み合わせを利用して、話者埋め込み抽出モデルを学習する。 Processor 220 uses a combination of classification loss and hard negative mining loss as the objective function to learn the speaker embedding extraction model.

分類ロスＬ_CEは数式（１）のように定義され、ハードネガティブマイニングロスＬ_Hは数式（２）のように定義される。

The classification loss L _CE is defined as Equation (1), and the hard negative mining loss L _H is defined as Equation (2).

ここで、Ｎはバッチサイズ（ｂａｔｃｈｓｉｚｅ）、ｘ_iとＷ_yiはｉ番目の発言の埋め込みベクトルと該当の話者の基底を示す。Ｈ_iは、

値が大きい上位Ｈ話者ベースの集合を意味する。特定の話者に対する話者の基準は、話者に該当する出力層の加重値行列の一行ベクトルである。各サンプルに対するハード集合であるＨ_iは、サンプルｘ_iと学習セットのすべての話者ベースの間のコサイン類似性に基づき、すべてのミニバッチに対して選択される。範疇型交差エントロピー損失である分類ロスＬ_CEとハードネガティブマイニングロスＬ_Hは、同じ加重値で結合される。

プロセッサ２２０は、有名人の音声を抽出および検収することによって生成された学習データセット（例えば、ＶｏｘＣｅｌｅｂ２など）を利用して話者埋め込み抽出モデルを学習する。このとき、プロセッサ２２０は、各発言からランダムに抽出された固定の長さ（例えば、２秒）の時間セグメント（ｔｅｍｐｏｒａｌｓｅｇｍｅｎｔｓ）を利用して話者埋め込み抽出モデルを学習してよい。 Here, N is the batch size, x _i and W _yi are the embedding vector of the i-th utterance and the basis of the corresponding speaker. H _i is

We mean the set of the top H-speaker bases with large values. A speaker criterion for a particular speaker is a single row vector of the weight matrix of the output layer corresponding to the speaker. A hard set H _i for each sample is selected for every mini-batch based on the cosine similarity between sample x _i and all speaker bases in the training set. The categorical cross-entropy loss, the classification loss L _CE and the hard negative mining loss L _H are combined with the same weight.

Processor 220 trains the speaker embedding extraction model using a training data set (eg, VoxCeleb2, etc.) generated by extracting and accepting voices of celebrities. At this time, the processor 220 may learn the speaker embedding extraction model using temporal segments of fixed length (eg, 2 seconds) randomly extracted from each utterance.

音声区間検出段階５１で選択されたフレームで話者情報を表現する話者埋め込みを抽出する話者埋め込み抽出段階５２で使用される話者認識モデルを、音声区間検出段階５１でも活用してよい。話者埋め込みは、ある一人の音声を他人の音声と区別することができるため、音声（ｓｐｅｅｃｈ〔発話〕）と非音声（ｎｏｎ－ｓｐｅｅｃｈ〔非発話〕）を区別することができる。 The speaker recognition model used in the speaker embedding extraction step 52, which extracts speaker embeddings representing speaker information in the frames selected in the speech activity detection step 51, may also be utilized in the speech activity detection step 51. Speaker embeddings can distinguish between speech and non-speech, as they can distinguish one person's speech from another's.

埋め込みノルム値と目標タスクに対する信頼度には相関関係があるという点において、埋め込みベクトルがソフトマックス関数（ｓｏｆｔｍａｘｆｕｎｃｔｉｏｎ）によって活性化された出力層と同じ線形分類器によって分類される場合、ノルム値が高いということは、埋め込みベクトルと超平面（ｈｙｐｅｒｐｌａｎｅ）との間に大きな余裕があるということ、すなわち、モデルの信頼点数（ｃｏｎｆｉｄｅｎｃｅｓｃｏｒｅ）が高いということを意味する。 In that the embedding norm value and confidence on the target task are correlated, if the embedding vector is classified by the same linear classifier as the output layer activated by the softmax function, the norm value is High means that there is a large margin between the embedding vector and the hyper plane, ie the confidence score of the model is high.

話者認識モデルは、人間の音声に対してのみ学習されたものであるため、学習対象でない非音声に対しては埋め込みノルム値が低く、信頼点数も極めて低い。したがって、独立されたモジュールやモデルを修正せずに、音声区間検出段階５１に話者認識モデルを使用することができる。 Since the speaker recognition model is trained only for human speech, it has a low embedding norm value and a very low confidence score for non-speech that is not the object of learning. Therefore, the speaker recognition model can be used in the speech activity detection step 51 without modification of the independent modules or models.

細分化された音声活動ラベルを得るために、話者埋め込み抽出モデルによってすべての出力をインポートし、時間的な集計なく投影層（ｐｒｏｊｅｃｔｉｏｎｌａｙｅｒ）を経て伝達する。これは、話者表現のために時間的平均プーリングを使用して一定の大きさ（例えば、２秒）のウィンドウで集計される埋め込みを使用するものとは対照的である。 To obtain the refined speech activity labels, we import all the outputs by the speaker embedding extraction model and pass them through the projection layer without temporal aggregation. This is in contrast to using aggregated embeddings in windows of constant size (eg, 2 seconds) using temporal average pooling for speaker representation.

プロセッサ２２０は、音声フレームそれぞれの話者埋め込みベクトルに対してノルム値を求め、埋め込みノルム値が閾値以上の音声フレームは音声区間と判断し、埋め込みノルム値が閾値未満の音声フレームは非音声区間と判断する。 The processor 220 obtains a norm value for the speaker embedding vector of each speech frame, judges speech frames with an embedding norm value equal to or greater than a threshold to be speech segments, and speech frames with embedding norm values less than the threshold to be non-speech segments. to decide.

一例として、プロセッサ２２０は、音声と非音声を分類するための閾値を、実験による固定値で設定してよい。実験を行い、閾値範囲内で最上の結果を見つけ出すことにより、開発集合を使用して埋め込みノルム値に対する閾値を手動で設定してよい。プロセッサ２２０は、すべてのデータセットに対して単一閾値を設定してよい。 As an example, the processor 220 may set the threshold for classifying speech and non-speech at a fixed experimental value. The development set may be used to manually set the thresholds for the embedding norm values by experimenting and finding the best results within the threshold range. Processor 220 may set a single threshold for all data sets.

他の例として、プロセッサ２２０は、与えられた音声ファイルに対して最適の閾値を自動で設定してよい。このとき、プロセッサ２２０は、混合ガウスモデル（ＧＭＭ）を使用して、各発言に対する最適閾値を推定してよい。このために、２つの混合成分を使用して混合ガウスモデルを学習させ、１つの発言としてノルム値の分布を学習する。このとき、混合成分とは、音声クラスタと非音声クラスタを示す。混合ガウスモデルを学習させた後、数式（４）により、閾値を推定してよい。

As another example, processor 220 may automatically set the optimal threshold for a given audio file. Processor 220 may then use a Gaussian Mixture Model (GMM) to estimate the optimal threshold for each utterance. For this, two mixture components are used to train a Gaussian mixture model, and as one utterance the distribution of norm values is learned. At this time, the mixed component indicates a speech cluster and a non-speech cluster. After training the Gaussian mixture model, the threshold may be estimated by equation (4).

ここで、μ₀とμ₁は混合成分それぞれの平均値であり、αは２つの平均値の加重値係数を意味する。 where μ ₀ and μ ₁ are the respective mean values of the mixture components and α means the weighting factor of the two mean values.

プロセッサ２２０は、音声と非音声を分類するための閾値を、音声データによって適応的に推定することにより、多様なデータセットドメインで強力な閾値を設定することができる。 By adaptively estimating thresholds for classifying speech and non-speech with speech data, processor 220 can set robust thresholds in diverse dataset domains.

プロセッサ２２０は、話者埋め込みに基づく音声区間検出段階５１の結果に基づき、音声データの各セッションを複数の音声活動セグメントに分けてよい。 Processor 220 may divide each session of audio data into a plurality of speech activity segments based on the results of speaker embedding-based speech activity detection step 51 .

プロセッサ２２０は、音声区間検出の結果の過度な急変を保障するために、ＰＤ（ｅｎｄｐｏｉｎｔｄｅｔｅｃｔｉｏｎ）過程を実行する。ＥＰＤとは、音声と非音声を区分した発声の最初と最後だけを見つけ出す過程である。一例として、プロセッサ２２０は、一定の大きさのウィンドウをスライディングすることによって最初と最後を探知する。例えば、開始点としては、音声活動フレームの割合が７０％を超える地点が識別され、非音声フレームに対しても同じ規則によって終了地点が識別されてよい。 The processor 220 performs a PD (end point detection) process to ensure that the result of voice activity detection does not change excessively. EPD is the process of finding only the beginning and end of utterances that distinguish between speech and non-speech. As an example, processor 220 finds the beginning and end by sliding a window of constant size. For example, starting points may be identified as points where the percentage of speech activity frames exceeds 70%, and ending points may be identified by the same rule for non-speech frames.

プロセッサ２２０は、ＡＨＣ（ＡｇｇｌｏｍｅｒａｔｉｖｅＨｉｅｒａｒｃｈｉｃａｌＣｌｕｓｔｅｒｉｎｇ）アルゴリズムを利用して話者埋め込みをグループ化してよい。ＡＨＣアルゴリズムは、距離閾値またはクラスタ数によって話者表現をクラスタリングしてよい。プロセッサ２２０は、複数の異なるドメインにおいて、シルエット点数（２≦Ｃ≦１０）を基準に、各セッションまたは音声ファイル（または、音声を含んだビデオ）に対して最適なクラスタ数を自動で選択してよい。 Processor 220 may utilize an Agglomerative Hierarchical Clustering (AHC) algorithm to group speaker embeddings. The AHC algorithm may cluster speaker representations by a distance threshold or number of clusters. The processor 220 automatically selects the optimal number of clusters for each session or audio file (or video with audio) based on the silhouette score (2≤C≤10) in multiple different domains. good.

シルエット点数は、データクラスタ内の一貫性を解釈したものであり、信頼度の尺度として見なされてよい。シルエット点数は、クラスタ内の平均距離により、数式（５）のように定義されてよい。

Silhouette scores are an interpretation of consistency within data clusters and may be viewed as a measure of confidence. The number of silhouette points may be defined by the average distance within the cluster as shown in Equation (5).

平均最近隣クラスタ距離（ｍｅａｎｎｅａｒｅｓｔ－ｃｌｕｓｔｅｒｄｉｓｔａｎｃｅ）は、各サンプルあたり、数式（６）のように定義されてよい。

A mean nearest-cluster distance may be defined as in Equation (6) for each sample.

特に、サンプルのシルエット点数s(i)は、数式（７）のように定義されてよい。

In particular, the number of sample silhouette points s(i) may be defined as in Equation (7).

シルエット点数を利用したクラスタリング方法は、各データセットに対して閾値を手動で調整する方法とは異なり、媒介変数の最適化を要求しない。 Clustering methods using silhouette scores do not require parameter optimization, unlike methods that manually adjust the threshold for each data set.

本実施形態では、話者埋め込みに基づいて音声活動領域（すなわち、音声区間）を検出する方法が、話者ダイアライゼーションの性能を高めるための極めて簡単かつ効果的な解決策となる。 In this embodiment, the method of detecting speech active regions (ie, speech intervals) based on speaker embeddings provides a very simple and effective solution for enhancing the performance of speaker diarization.

図７は、本発明における、話者埋め込みに基づく音声区間検出方法の話者ダイアライゼーション性能の実験結果を示した図である。 FIG. 7 is a diagram showing experimental results of speaker diarization performance of the speech activity detection method based on speaker embedding in the present invention.

実験は、話者ダイアライゼーションのチャレンジデータセットとしてＤＩＨＡＲＤを利用し、音声活動を検出するためのモデルと話者埋め込みを抽出するためのモデルが完全に分割されたパイプラインの話者ダイアライゼーション方法をベースラインとして利用する。ＳＥ（ｓｐｅｅｃｈｅｎｈａｎｃｅｍｅｎｔ）は、音声に対するノイズ除去（ｄｅｎｏｉｓｉｎｇ）過程が含まれたものである。 The experiment utilizes DIHARD as a challenge dataset for speaker diarization and implements a fully split-pipeline speaker diarization method with a model for detecting speech activity and a model for extracting speaker embeddings. Use as a baseline. SE (speech enhancement) includes a denoising process for speech.

ＭＳ（ｍｉｓｓｅｄｓｐｅｅｃｈ）は結果に含まれない音声の比率、ＦＡ（ｆａｌｓｅａｌａｒｍ）は結果に含まれた非音声の比率、ＳＣ（ｓｐｅａｋｅｒｃｏｎｆｕｓｉｏｎ）は結果に含まれたマッピングエラーの比率（話者ＩＤを間違えてマッピングした音声の比率）を示し、ＤＥＲ（ｄｉａｒｓａｔｉｏｎｅｒｒｏｒｒａｔｅ）は、ＭＳとＦＡ、およびＳＣの総合を意味する。すなわち、ＤＥＲが低いほど、話者ダイアライゼーションの性能が高いことを意味する。 MS (missed speech) is the ratio of speech not included in the result, FA (false alarm) is the ratio of non-speech included in the result, SC (speaker confusion) is the ratio of mapping error included in the result (speaker ID DER (diarsation error rate) means the sum of MS, FA, and SC. That is, the lower the DER, the better the performance of speaker diarization.

単一モデルによって音声活動検出と話者埋め込み抽出を実行する本発明の話者ダイアライゼーションの性能とベースラインを比較すると、音声と非音声の分類基準となる閾値を固定設定した方法（Ｏｕｒｓｗ／ＳｐｅａｋｅｒＮｅｔＳＡＤＦｉｘｅｄ）と適応的に自動設定した方法（Ｏｕｒｓｗ／ＳｐｅａｋｅｒＮｅｔＳＡＤＧＭＭ）の両方とも、ベースラインに比べて高い性能を示すことが分かった。 Comparing the performance of our speaker diarization, which performs voice activity detection and speaker embedding extraction with a single model, to the baseline, we find that the fixed-threshold method (Ours w/ We found that both the SpeakerNet SAD Fixed) and adaptively auto-configured methods (Ours w/SpeakerNet SAD GMM) show higher performance compared to baseline.

このように、本発明の実施形態によると、話者埋め込みに基づいて音声活動領域である音声区間を検出することにより、話者認識が明らかな区間だけを検出することができ、話者ダイアライゼーションの性能を高めることができる。また、本発明の実施形態によると、音声活動を検出するために話者埋め込みの抽出に使用される話者認識モデルを利用することにより、単一モデルによって音声活動検出と話者埋め込み抽出を実行することができ、話者ダイアライゼーションのパイプラインを簡素化させることができる。 As described above, according to the embodiment of the present invention, by detecting the speech section, which is the speech activity area, based on the speaker embedding, it is possible to detect only the section where the speaker recognition is clear, and the speaker diarization can be performed. performance can be improved. Also, according to embodiments of the present invention, a single model performs voice activity detection and speaker embedding extraction by utilizing the speaker recognition model used to extract speaker embeddings to detect voice activity. can be used to simplify the speaker diarization pipeline.

上述した装置は、ハードウェア構成要素、ソフトウェア構成要素、および／またはハードウェア構成要素とソフトウェア構成要素との組み合わせによって実現されてよい。例えば、実施形態で説明された装置および構成要素は、プロセッサ、コントローラ、ＡＬＵ（ａｒｉｔｈｍｅｔｉｃｌｏｇｉｃｕｎｉｔ）、デジタル信号プロセッサ、マイクロコンピュータ、ＦＰＧＡ（ｆｉｅｌｄｐｒｏｇｒａｍｍａｂｌｅｇａｔｅａｒｒａｙ）、ＰＬＵ（ｐｒｏｇｒａｍｍａｂｌｅｌｏｇｉｃｕｎｉｔ）、マイクロプロセッサ、または命令を実行して応答することができる様々な装置のように、１つ以上の汎用コンピュータまたは特殊目的コンピュータを利用して実現されてよい。処理装置は、オペレーティングシステム（ＯＳ）およびＯＳ上で実行される１つ以上のソフトウェアアプリケーションを実行してよい。また、処理装置は、ソフトウェアの実行に応答し、データにアクセスし、データを記録、操作、処理、および生成してもよい。理解の便宜のために、１つの処理装置が使用されるとして説明される場合もあるが、当業者であれば、処理装置が複数個の処理要素および／または複数種類の処理要素を含んでもよいことが理解できるであろう。例えば、処理装置は、複数個のプロセッサまたは１つのプロセッサおよび１つのコントローラを含んでよい。また、並列プロセッサのような、他の処理構成も可能である。 The apparatus described above may be realized by hardware components, software components, and/or a combination of hardware and software components. For example, the devices and components described in the embodiments include processors, controllers, arithmetic logic units (ALUs), digital signal processors, microcomputers, field programmable gate arrays (FPGAs), programmable logic units (PLUs), microprocessors, Or may be implemented using one or more general purpose or special purpose computers, such as various devices capable of executing and responding to instructions. The processing unit may run an operating system (OS) and one or more software applications that run on the OS. The processor may also access, record, manipulate, process, and generate data in response to executing software. For convenience of understanding, one processing device may be described as being used, but those skilled in the art will appreciate that a processing device may include multiple processing elements and/or multiple types of processing elements. You can understand that. For example, a processing unit may include multiple processors or a processor and a controller. Other processing configurations are also possible, such as parallel processors.

ソフトウェアは、コンピュータプログラム、コード、命令、またはこれらのうちの１つ以上の組み合わせを含んでもよく、思うままに動作するように処理装置を構成したり、独立的または集合的に処理装置に命令したりしてよい。ソフトウェアおよび／またはデータは、処理装置に基づいて解釈されたり、処理装置に命令またはデータを提供したりするために、いかなる種類の機械、コンポーネント、物理装置、コンピュータ記録媒体または装置に具現化されてよい。ソフトウェアは、ネットワークによって接続されたコンピュータシステム上に分散され、分散された状態で記録されても実行されてもよい。ソフトウェアおよびデータは、１つ以上のコンピュータ読み取り可能な記録媒体に記録されてよい。 Software may include computer programs, code, instructions, or a combination of one or more of these, to configure a processor to operate at its discretion or to independently or collectively instruct a processor. You can Software and/or data may be embodied in any kind of machine, component, physical device, computer storage medium, or device for interpretation by, or for providing instructions or data to, a processing device. good. The software may be stored and executed in a distributed fashion over computer systems linked by a network. Software and data may be recorded on one or more computer-readable recording media.

実施形態に係る方法は、多様なコンピュータ手段によって実行可能なプログラム命令の形態で実現されてコンピュータ読み取り可能な媒体に記録されてよい。ここで、媒体は、コンピュータ実行可能なプログラムを継続して記録するものであっても、実行またはダウンロードのために一時記録するものであってもよい。また、媒体は、単一または複数のハードウェアが結合した形態の多様な記録手段または格納手段であってよく、あるコンピュータシステムに直接接続する媒体に限定されることはなく、ネットワーク上に分散して存在するものであってもよい。媒体の例としては、ハードディスク、フロッピー（登録商標）ディスク、および磁気テープのような磁気媒体、ＣＤ－ＲＯＭおよびＤＶＤのような光媒体、フロプティカルディスク（ｆｌｏｐｔｉｃａｌｄｉｓｋ）のような光磁気媒体、およびＲＯＭ、ＲＡＭ、フラッシュメモリなどを含み、プログラム命令が記録されるように構成されたものであってよい。また、媒体の他の例として、アプリケーションを配布するアプリケーションストアやその他の多様なソフトウェアを供給または配布するサイト、サーバなどで管理する記録媒体または格納媒体が挙げられる。 The method according to the embodiments may be embodied in the form of program instructions executable by various computer means and recorded on a computer-readable medium. Here, the medium may record the computer-executable program continuously or temporarily record it for execution or download. In addition, the medium may be various recording means or storage means in the form of a combination of single or multiple hardware, and is not limited to a medium that is directly connected to a computer system, but is distributed over a network. It may exist in Examples of media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical media such as CD-ROMs and DVDs, magneto-optical media such as floptical disks, and ROM, RAM, flash memory, etc., and may be configured to store program instructions. Other examples of media include recording media or storage media managed by application stores that distribute applications, sites that supply or distribute various software, and servers.

以上のように、実施形態を、限定された実施形態および図面に基づいて説明したが、当業者であれば、上述した記載から多様な修正および変形が可能であろう。例えば、説明された技術が、説明された方法とは異なる順序で実行されたり、かつ／あるいは、説明されたシステム、構造、装置、回路などの構成要素が、説明された方法とは異なる形態で結合されたりまたは組み合わされたり、他の構成要素または均等物によって対置されたり置換されたとしても、適切な結果を達成することができる。 As described above, the embodiments have been described based on the limited embodiments and drawings, but those skilled in the art will be able to make various modifications and variations based on the above description. For example, the techniques described may be performed in a different order than in the manner described and/or components such as systems, structures, devices, circuits, etc. described may be performed in a manner different from the manner described. Appropriate results may be achieved when combined or combined, opposed or substituted by other elements or equivalents.

したがって、異なる実施形態であっても、特許請求の範囲と均等なものであれば、添付される特許請求の範囲に属する。 Accordingly, different embodiments that are equivalent to the claims should still fall within the scope of the appended claims.

２２０：プロセッサ
３１０：話者埋め込み部
３２０：音声区間検出部
３３０：クラスタリング実行部 220: Processor 310: Speaker embedding unit 320: Speech interval detection unit 330: Clustering execution unit

Claims

A computer system implemented speaker diarization method comprising:
The computer system includes at least one processor configured to execute computer readable instructions contained in memory;
The speaker diarization method comprises:
The at least one processor generates speech frames for a given speech file with a speaker recognition model trained using a combination of classification loss and hard negative mining loss. and detecting, by the at least one processor, speech intervals that are speech activity regions based on the speaker embeddings.

The speaker diarization method comprises:
2. The speaker diarization of claim 1, wherein extracting the speaker embeddings and detecting the speech intervals are performed using a single model, a speaker recognition model. Method.

The step of detecting the speech interval includes:
determining a norm value for a speaker embedding vector of each speech frame; determining speech frames with an embedding norm value greater than or equal to a threshold as the speech period, and speech frames with an embedding norm value less than the threshold as the non-speech period; 2. The speaker diarization method of claim 1, comprising the step of determining.

The speaker diarization method comprises:
4. The method of speaker diarization of claim 3, further comprising: adaptively setting, by the at least one processor, the threshold for classifying speech and non-speech according to a given speech file.

The speaker diarization method comprises:
4. The speaker diarization method of claim 3, further comprising: setting, by the at least one processor, the threshold estimated by a Gaussian mixture model for the audio file.

The speaker diarization method comprises:
4. The method of speaker diarization of claim 3, further comprising: setting, by the at least one processor, the threshold for classifying speech and non-speech to an empirically determined fixed value.

Extracting the speaker embeddings includes:
2. The speaker diarization method of claim 1, comprising extracting the speaker embeddings for each speech frame using a sliding window scheme .

A computer system implemented speaker diarization method comprising:
The computer system includes at least one processor configured to execute computer readable instructions contained in memory;
The speaker diarization method comprises:
extracting, by the at least one processor, speaker embeddings for each audio frame for a given audio file, comprising:
The output of the speaker recognition model is aggregated over time using a temporal average pooling layer and then passed through a projection layer to determine the utterance-level and detecting, by the at least one processor, speech intervals that are speech activity regions based on the speaker embeddings.
Speaker diarization method.

The step of detecting the speech interval includes:
9. The speaker diarization method of claim 8, comprising: obtaining speech activity labels by transmitting the output of the speaker recognition model through the projection layer without aggregation over time.

A computer program that causes the computer system to perform the speaker diarization method according to any one of claims 1-9.

a computer system,
at least one processor configured to execute computer readable instructions contained in memory;
The at least one processor
a speaker embedding unit that extracts speaker embeddings for each audio frame for a given audio file using a speaker recognition model trained using a combination of classification loss and hard negative mining loss; and said speaker. A computer system comprising: a speech interval detector that detects speech intervals that are speech activity regions based on embeddings.

The at least one processor
12. The computer system of claim 11, wherein a single model, a speaker recognition model, is utilized to perform the steps of extracting speaker embeddings and detecting speech intervals.

The voice interval detection unit is
obtaining a norm value for the speaker embedding vector of each of the speech frames;
12. The computer system according to claim 11, wherein a speech frame with an embedding norm value equal to or greater than a threshold value is determined to be the speech segment, and a speech frame with an embedding norm value less than the threshold value is determined to be the non-speech segment.

The at least one processor
14. The computer system of claim 13, wherein the threshold for classifying speech and non-speech is adaptively set according to a given speech file.

The at least one processor
14. The computer system of claim 13, wherein the threshold estimated by a Gaussian mixture model is set for the audio file.

The speaker embedding unit includes:
12. The computer system of claim 11, wherein a sliding window scheme is used to extract the speaker embeddings for each speech frame.

a computer system,
at least one processor configured to execute computer readable instructions contained in memory;
The at least one processor
A speaker embedding unit for extracting speaker embeddings for each audio frame for a given audio file, the speaker embedding unit comprising:
a speaker embedding unit configured to obtain utterance level embeddings by passing the output of the speaker recognition model aggregated over time using a temporal average pooling layer and then through a projection layer; and a speech interval detector that detects a speech interval, which is a speech activity area, based on the speaker embeddings.
computer system.

The voice interval detection unit is
18. The computer system of claim 17, wherein the output of the speaker recognition model is propagated through the projection layer without aggregation over time to obtain voice activity labels.