JP7377736B2

JP7377736B2 - Online speaker sequential discrimination method, online speaker sequential discrimination device, and online speaker sequential discrimination system

Info

Publication number: JP7377736B2
Application number: JP2020028305A
Authority: JP
Inventors: 雅文薛; 慶華孫; 翔太堀口
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2020-02-21
Filing date: 2020-02-21
Publication date: 2023-11-10
Anticipated expiration: 2040-02-21
Also published as: JP2021131524A

Description

本発明は、オンライン話者逐次区別方法、オンライン話者逐次区別装置及びオンライン話者逐次区別システムに関する。 The present invention relates to an online speaker sequential discrimination method, an online speaker sequential discrimination device, and an online speaker sequential discrimination system.

近年、放送、ボイスメール、コールセンターの通話、会議等の発話記録が増大しつつある。このような状況において、効率的かつ効果的に自動で索引を作成し、検索タスクを実行するためには、発話内容を単に文書化するのみならず、様々な種類の非言語情報を抽出できるように準備することが重要である。このような非言語情報としては、例えば、メタデータがある。そして、メタデータには、話者の順番、特徴（性別、年齢）、音源の変更等の情報が含まれている。 In recent years, utterance recordings of broadcasts, voice mails, call center calls, conferences, etc. have been increasing. In this situation, in order to efficiently and effectively automatically create indexes and perform search tasks, it is necessary to not only document the utterances but also be able to extract various types of non-verbal information. It is important to be prepared. Such non-linguistic information includes, for example, metadata. The metadata includes information such as the order of speakers, characteristics (gender, age), and changes in sound sources.

発話文書内の音源を識別し、メタデータを定義するためのラベル付けを行うためには、音声逐次区別の処理が用いられる。この音声逐次区別処理においては、音声セグメント内の同質な領域を見出し、それらを、話者、性別、音楽、ノイズ等について一貫してラベル付けをする。音声話者逐次区別処理の主な部分は、話者の逐次区別、すなわち、話者のセグメント化及びクラスタリングである。言い換えれば、この処理は、「誰がいつ話したか」を見出す作業である。 Speech sequential differentiation processing is used to identify sound sources within a spoken document and label them to define metadata. This sequential speech differentiation process finds homogeneous regions within speech segments and consistently labels them with respect to speaker, gender, music, noise, etc. The main part of the voice speaker sequential differentiation process is the sequential speaker differentiation, that is, speaker segmentation and clustering. In other words, this process is a task of finding out "who spoke and when."

従来、音声逐次区別手段の多くは、分析の対象となる対話（会議、通話等）が既に完結しており、音声データの全体を分析のために利用可能であることを前提としている。そして、このような音声逐次区別システムを「オフライン音声逐次区別システム」と称する。しかし、斯かるオフライン音声逐次区別システムでは、音声データをリアルタイムで分析し、低遅延（ＬｏｗＬａｔｅｎｃｙ）で音声区別結果を提供することができない。 Conventionally, most of the speech sequential discrimination means assume that the dialogue (meeting, telephone call, etc.) to be analyzed has already been completed and that the entire speech data can be used for analysis. Such a speech sequential discrimination system is referred to as an "offline speech sequential discrimination system." However, such an offline sequential speech discrimination system cannot analyze speech data in real time and provide speech discrimination results with low latency.

そこで、近年では、音声データをリアルタイムで分析し、低遅延で音声区別結果を提供することができるオンライン音声逐次区別手段が提案されている。このオンライン音声逐次区別手段の一例として、例えば、特許文献１がある。
特許文献１には「話者区別システム３０は、話者ＧＭＭ７４－７８を記憶する記憶部４２と、音声データをセグメント化する音声活動検出部３０と、現セグメントが話者ＧＭＭ７４－７８のいずれにも属していないかを判定する新規性判定部３４と、現セグメントが話者ＧＭＭ７４－７８のいずれにも属していないときに、新たな話者ＧＭＭを生成し、現セグメントを新たな話者ＧＭＭでラベル付けする新モデル生成部４０と、現セグメントが話者ＧＭＭ７４－７８の１つに属しているときに、話者を識別し、現セグメントをその話者でラベル付けする話者識別部４４と、現セグメントを利用して話者ＧＭＭをトレーニングするトレーニング部４８と、音声活動検出部３０が出力したセグメントのシーケンスに従ってセグメントラベルをマージするマージ部４６とを含む」技術が記載されている。 Therefore, in recent years, online speech sequential discrimination means that can analyze speech data in real time and provide speech discrimination results with low delay has been proposed. An example of this online voice sequential discrimination means is, for example, Patent Document 1.
Patent Document 1 states, ``The speaker discrimination system 30 includes a storage unit 42 that stores speaker GMMs 74-78, a voice activity detection unit 30 that segments voice data, and a system that identifies which speaker GMM 74-78 the current segment corresponds to. A novelty determination unit 34 determines whether the current segment belongs to any of the speaker GMMs 74-78, and generates a new speaker GMM, and converts the current segment into the new speaker GMM. and a speaker identification unit 44 that identifies the speaker and labels the current segment with that speaker when the current segment belongs to one of the speaker GMMs 74-78. , a training unit 48 that trains a speaker GMM using the current segment, and a merging unit 46 that merges segment labels according to the sequence of segments output by the voice activity detection unit 30.

特開２００９－１０９７１２号公報Japanese Patent Application Publication No. 2009-109712

上記の特許文献１には、次のような技術が開示されている。まず入力される音声データセグメントについてＧＭＭ（ＧａｕｓｓｉａｎＭｉｘｔｕｒｅＭｏｄｅｌ）が生成され、記憶される。次に、音声データセグメントが入力されると、当該セグメントが記憶されているＧＭＭに比較される。当該セグメントが既に記憶されているＧＭＭに属する（つまり、話者が同一人物である）場合には、話者が識別される。当該セグメントが既に記憶されているＧＭＭに属していない（つまり、話者が同一人物でない）場合には、新たなＧＭＭが生成される。
これにより、音声データをリアルタイムで分析し、低遅延で音声区別結果を提供することができる。 The above-mentioned Patent Document 1 discloses the following technology. First, a GMM (Gaussian Mixture Model) is generated and stored for an input audio data segment. Next, as the audio data segment is input, it is compared to the GMM in which it is stored. If the segment belongs to an already stored GMM (that is, the speaker is the same person), the speaker is identified. If the segment does not belong to an already stored GMM (that is, the speaker is not the same person), a new GMM is generated.
This makes it possible to analyze voice data in real time and provide voice discrimination results with low delay.

しかし、特許文献１に記載の手段では、１つの音声データセグメントが一人の話者に対応することを前提としており、複数の話者が同時に話すことが想定されていないため、複数の異なるユーザの発話が重なった音声データセグメントについては正確な音声逐次区別結果を提供することができない。 However, the method described in Patent Document 1 assumes that one voice data segment corresponds to one speaker, and it is not assumed that multiple speakers speak at the same time. Accurate speech sequential discrimination results cannot be provided for speech data segments in which utterances overlap.

そこで、本発明は、以前の音声データセグメントから抽出した重要な情報（誰が、いつ話したかを判定するのに有用な情報）を格納する発話情報バッファーを用いることにより、複数の異なるユーザの発話が重なった音声データについても、正確な音声逐次区別結果をリアルタイムで生成することができるオンライン話者逐次区別方法を提供することを目的とする。 Therefore, the present invention uses an utterance information buffer that stores important information (information useful for determining who spoke and when) extracted from previous audio data segments, so that the utterances of multiple different users can be It is an object of the present invention to provide an online speaker sequential discrimination method that can generate accurate speech sequential discrimination results in real time even for overlapping voice data.

上記の課題を解決するために、代表的な本発明のオンライン話者逐次区別方法の一つは、ユーザ対話に対応する第１の音声データセグメントと、前記ユーザ対話の話者数を示す話者数データを受け付ける工程と、前記第１の音声データセグメントと前記話者数データとに基づいて、前記第１の音声データセグメントにおける第１の時間位置のセットについて、特定の話者が特定の時間位置において発話中である確率を示す第１の確率値のセットを計算する工程と、前記第１の時間位置のセットの少なくとも一部である第１の時間位置のサブセットと、前記第１の確率値のセットの少なくとも一部であり、前記第１の時間位置のサブセットに対応する第１の確率値のサブセットとを所定の選択手法に基づいて選択し、発話情報バッファーに格納する工程と、前記ユーザ対話に対応する第２の音声データセグメントを受け付ける工程と、前記発話情報バッファーに格納されている前記第１の時間位置のサブセットに対応する音声特徴を前記第１の音声データセグメントから抽出し、前記第２の音声データセグメントに結合し、結合音声データセグメントを生成する工程と、前記結合音声データセグメントと前記話者数データとに基づいて、前記結合音声データセグメントにおける第２の時間位置のセットについて、特定の話者が特定の時間位置において発話中である確率を示す第２の確率値のセットを計算する工程と、前記第１の確率値のサブセットと前記第２の確率値のセットとに基づいて、前記ユーザ対話の話者のそれぞれを識別する話者逐次区別結果を生成する工程とを含む。 In order to solve the above problems, one of the representative online speaker sequential discrimination methods of the present invention includes a first audio data segment corresponding to a user interaction, and a speaker indicating the number of speakers of the user interaction. receiving number data; and based on the first audio data segment and the speaker number data, a particular speaker is at a particular time for a first set of time positions in the first audio data segment. calculating a first set of probability values indicative of a probability of being speaking at a location; a first subset of time locations that is at least a portion of the first set of time locations; and the first probability. selecting at least a portion of the set of values and a first subset of probability values corresponding to the first subset of time positions based on a predetermined selection method, and storing the selected subset in an utterance information buffer; receiving a second audio data segment corresponding to user interaction; and extracting audio features from the first audio data segment corresponding to a subset of the first time locations stored in the speech information buffer; combining the second audio data segment to produce a combined audio data segment; and a second set of time positions in the combined audio data segment based on the combined audio data segment and the speaker count data. calculating a second set of probability values indicative of the probability that a particular speaker is speaking at a particular time location; and a subset of the first probability values and the second set of probability values; and generating a sequential speaker discrimination result that identifies each of the speakers of the user interaction based on the user interaction.

本発明によれば、以前の音声データセグメントから抽出した重要な情報（誰がいつ話したかを判定するのに有用な情報）を格納する発話情報バッファーを用いることにより、複数の異なるユーザの発話が重なった音声データについても、正確な音声逐次区別結果をリアルタイムで生成することができるオンライン話者逐次区別方法を提供することができる。 According to the present invention, by using an utterance information buffer that stores important information extracted from previous audio data segments (information useful for determining who spoke and when), the utterances of multiple different users can be overlapped. It is also possible to provide an online speaker sequential discrimination method that can generate accurate speech sequential discrimination results in real time even for voice data.

図１は、本発明の実施形態を実施するためのコンピュータシステムを示す図である。FIG. 1 is a diagram illustrating a computer system for implementing an embodiment of the invention. 図２は、本発明の実施形態に係るオンライン話者逐次区別システムの構成の一例を示す図である。FIG. 2 is a diagram showing an example of the configuration of an online speaker sequential discrimination system according to an embodiment of the present invention. 図３は、本発明の実施形態に係るオンライン話者逐次区別方法の概要を示す図である。FIG. 3 is a diagram illustrating an overview of an online speaker sequential discrimination method according to an embodiment of the present invention. 図４は、オンライン話者逐次区別方法の流れを示すフローチャートである。FIG. 4 is a flowchart showing the flow of the online speaker sequential discrimination method. 図５は、本発明の実施形態に係る第１の音声データセグメントを示す図である。FIG. 5 is a diagram illustrating a first audio data segment according to an embodiment of the invention. 図６は、本発明の実施形態に係る結合音声データセグメントを示す図である。FIG. 6 is a diagram illustrating a combined audio data segment according to an embodiment of the invention. 図７は、本発明の実施形態に係るオンライン話者逐次区別方法の流れの一例を示す図である。FIG. 7 is a diagram illustrating an example of the flow of an online speaker sequential discrimination method according to an embodiment of the present invention. 図８は、図７に続いて本発明の実施形態に係るオンライン話者逐次区別方法の流れの一例を示す図である。FIG. 8 is a diagram illustrating an example of the flow of the online speaker sequential discrimination method according to the embodiment of the present invention, following FIG. 7 . 図９は、本発明の実施形態に係る発話情報バッファーを更新する方法の一例を示す図である。FIG. 9 is a diagram illustrating an example of a method for updating a speech information buffer according to an embodiment of the present invention.

以下、図面を参照して、本発明の前提となる背景及び本発明の実施形態について説明する。なお、この実施形態により本発明が限定されるものではない。また、図面の記載において、同一部分には同一の符号を付して示している。 DESCRIPTION OF THE PREFERRED EMBODIMENTS The background on which the present invention is based and the embodiments of the present invention will be described below with reference to the drawings. Note that the present invention is not limited to this embodiment. In addition, in the description of the drawings, the same parts are denoted by the same reference numerals.

（背景）
従来の話者逐次区別システムのほとんどは、いくつかの鍵となるサブタスクを行なっており、それには、発話検出、話者変化の検出、性別による分類、及び話者のクラスタリングが含まれる。性能を向上させるために、場合によってはクラスタの再結合及び再分離もまた用いられる。 (background)
Most conventional sequential speaker discrimination systems perform several key subtasks, including utterance detection, speaker change detection, gender classification, and speaker clustering. Recombining and reseparating clusters is also used in some cases to improve performance.

発話検出は、音声のうち、発話のみからなる領域を見出すことを目的とする。このタスクを行うための最も一般的な技術は、音響ガウス混合モデル（Ｇａｕｓｓｉａｎｍｉｘｕｔｒｅｍｏｄｅｌｓ：ＧＭＭ）を用いた最尤度分類である。モデルは通常、いくつかのラベル付けされたデータを用いてトレーニングされており、最も単純な場合には、発話データと非発話データの２つのモデルを有する。システムによっては、話者の性別とチャンネルの種類とに依存したいくつかのモデルを用いるものもある。別の方法では、音声ストリームの単一パス又はマルチパスビタビセグメンテーションを行なうことがある。ニュース放送のデータでは、その発話検出の典型的な誤り率は２％から３％である。 The purpose of speech detection is to find a region of speech consisting only of speech. The most common technique for performing this task is maximum likelihood classification using acoustic Gaussian mixture models (GMM). The model is typically trained with some labeled data, and in the simplest case has two models: spoken data and non-spoken data. Some systems use several models depending on the gender of the speaker and the type of channel. Another method may be to perform single-pass or multi-pass Viterbi segmentation of the audio stream. For news broadcast data, the typical error rate for speech detection is 2% to 3%.

発話セグメントが識別された後、話者変化検出を用いて、各セグメントで起こりうる何らかの話者の変化を見出すことができる。もしこの話者の変化が検出されれば、セグメントはさらに、各々が一人の話者に属する、より小さいセグメントに分割される。 After speech segments are identified, speaker change detection can be used to find any speaker changes that may occur in each segment. If this speaker change is detected, the segment is further divided into smaller segments, each belonging to one speaker.

変化検出には２つの主な技術がある。第１のものでは、ベイズ情報量基準（Ｂａｙｅｓｉａｎｉｎｆｏｍｒａｔｉｏｎｃｒｉｔｅｒｉｏｎ：ＢＩＣ）を用いて、１つの分布より２つとしたほうがよりよくモデル化できるか否かを判定することによって、ウィンドウ内の潜在的な変化点を発見する。第２のものは、最も多くの場合、単一ガウス関数で表される２個の固定長ウィンドウ間の距離、ガウス発散又は一般化尤度比、を測定するように構成されている。この場合、あるしきい値を超えた距離のピークが変化点と考えられる。 There are two main techniques for change detection. The first one uses the Bayesian information criterion (BIC) to evaluate potential changes within the window by determining whether two distributions are better modeled than one. Discover the points. The second is configured to measure the distance between two fixed length windows, most often represented by a single Gaussian function, the Gaussian divergence or the generalized likelihood ratio. In this case, the peak of the distance exceeding a certain threshold is considered to be the point of change.

性別による分類は、セグメントを２つのグループ（男性と女性）に分割するために用いられ、これによって次のクラスタリングの負荷を減じるとともに、話者についてより多くの情報を与える。通常、性別毎に１つの、２つのＧＭＭが前もってトレーニングされ、最尤度が決定基準として用いられる。報告されている性別による分類の誤り率は１％から２％である。 Gender classification is used to divide the segments into two groups (male and female), thereby reducing the burden of subsequent clustering and giving more information about the speakers. Typically, two GMMs are pre-trained, one for each gender, and maximum likelihood is used as the decision criterion. The reported error rate for gender classification is 1% to 2%.

最後のサブタスクである話者のクラスタリングは、各セグメントにその正しい話者ラベルを割当てることである。これは、セグメントを話者に対応する組へクラスタリングすることによって行われる。最も広く行なわれている方策は、ＢＩＣ終了基準を用いた階層的凝集型クラスタリングである。 The final subtask, speaker clustering, is to assign each segment its correct speaker label. This is done by clustering the segments into sets corresponding to speakers. The most widely used strategy is hierarchical agglomerative clustering using BIC termination criteria.

クラスタの各々は単一ガウス関数で表され、一般化尤度比（Ｇｅｎｅｒａｌｉｚｅｄｌｉｋｅｌｉｈｏｏｄｒａｔｉｏ：ＧＬＲ）がクラスタ間距離測定に慣用される。この方法の変形も提案されているが、これらもまた、依然として同様のボトムアップ型クラスタリング技術に基づいている。 Each of the clusters is represented by a single Gaussian function, and a generalized likelihood ratio (GLR) is commonly used to measure intercluster distances. Variations of this method have been proposed, but these are still based on similar bottom-up clustering techniques.

本発明の実施形態では、エンドツーエンド話者逐次区別ネットワーク（ＥｎｄｔｏＥｎｄＮｅｕｒａｌＤｉａｒｉｚａｔｉｏｎＮｅｔｗｏｒｋ；ＥＥＮＤ）を用いることがある。このエンドツーエンド話者逐次区別ネットワークは、話者埋め込み（ＳｐｅａｋｅｒＥｍｂｅｄｄｉｎｇ）を用いずに、特定の話者が特定の時間位置において発話中である確率（誰がいつ話したか）を判定するように構成されたニューラルネットワークである。 Embodiments of the invention may use an End to End Neural Diarization Network (EEND). This end-to-end speaker sequential discrimination network is configured to determine the probability that a particular speaker is speaking at a particular time position (who spoke and when) without using speaker embedding. It is a neural network created by

このエンドツーエンド話者逐次区別ネットワークは、話者の発話が重なった音声データセグメントが発生する場合であっても、良好な話者逐次区別結果を生成することができる。音声データセグメントｘ_ｔが入力された場合、エンドツーエンド話者逐次区別ネットワークはそれぞれの時間位置ｔについて、所定の話者数の話者毎に、当該話者が話した確率ｙを計算する。例えば、二人の話者ｃ１、ｃ２がいる場合、このエンドツーエンド話者逐次区別ネットワークは、話者逐次区別結果として、ｙ_ｔ,c1＝１、ｙ_ｔ,c２＝0を出力したとする。これは、時間位置ｔにおいて、一人の話者ｃ１が発話中であって、もう一人の話者ｃ２が時間位置ｔにおいて発話中でないことを意味する。また、ｙ_ｔ,c1＝0、ｙ_ｔ,c２＝0が話者逐次区別結果として出力された場合には、これは、時間位置ｔにおいて、話者ｃ１、ｃ２が両者とも発話中でないことを意味する。また、ｙ_ｔ,c1＝１、ｙ_ｔ,c２＝１が話者逐次区別結果として出力された場合には、これは、時間位置ｔにおいて、話者ｃ１、ｃ２が両者とも発話中であることを意味する。
このように、エンドツーエンド話者逐次区別ネットワークはユーザ対話の話者のそれぞれを識別する話者逐次区別結果を生成することができる。 This end-to-end sequential speaker discrimination network can produce good sequential speaker discrimination results even when speech data segments with overlapping utterances of speakers occur. Given a speech data segment x _t as input, the end-to-end speaker sequential discrimination network calculates, for each time location t, the probability y that the speaker spoke for a given number of speakers. For example, suppose that when there are two speakers c1 and c2, this end-to-end speaker sequential discrimination network outputs y _t,c1 =1, y _t,c2 =0 as the speaker sequential discrimination results. . This means that at time position t, one speaker c1 is speaking, and the other speaker c2 is not speaking at time position t. Furthermore, if y _t,c1 = 0 and y _t,c2 = 0 are output as the speaker sequential discrimination results, this means that neither speaker c1 nor c2 is speaking at time position t. means. Furthermore, if y _t,c1 = 1 and y _t,c2 = 1 are output as the speaker sequential discrimination results, this means that both speakers c1 and c2 are speaking at time position t. means.
In this way, the end-to-end speaker discrimination network can generate speaker discrimination results that identify each of the speakers of a user interaction.

しかし、上述したエンドツーエンド話者逐次区別ネットワークは、特定の音声データセグメントに関する話者逐次区別結果を生成するためには、当該音声データセグメントだけでなく、当該音声データセグメントの前後の音声データセグメントも必要とするため、ユーザ対話が既に完結しており、音声データの全体が始めから利用可能であるオフライン音声逐次区別システムのみに用いられ、オンライン（リアルタイム）での利用が想定されてない。
従って、本発明では、話者逐次区別判定に有用な情報を格納するための発話情報バッファーを用いることで、エンドツーエンド話者逐次区別ネットワークをオンラインの状況に適用させ、リアルタイムで良好な話者逐次区別結果を生成することが可能となる。 However, in order to generate a speaker sequential discrimination result regarding a specific voice data segment, the end-to-end speaker sequential discrimination network described above needs to analyze not only the voice data segment but also the voice data segments before and after the voice data segment. This method is only used for offline speech sequential discrimination systems in which the user interaction has already been completed and the entire speech data is available from the beginning, and is not intended for online (real-time) use.
Therefore, in the present invention, by using an utterance information buffer to store information useful for sequential speaker discrimination determination, an end-to-end sequential speaker discrimination network can be applied to online situations, and good speaker discrimination can be performed in real time. It becomes possible to generate sequential discrimination results.

（ハードウエア構成）
次に、図１を参照して、本開示の実施形態を実施するためのコンピュータシステム３００について説明する。本明細書で開示される様々な実施形態の機構及び装置は、任意の適切なコンピューティングシステムに適用されてもよい。コンピュータシステム３００の主要コンポーネントは、１つ以上のプロセッサ３０２、メモリ３０４、端末インターフェース３１２、ストレージインタフェース３１４、Ｉ／Ｏ（入出力）デバイスインタフェース３１６、及びネットワークインターフェース３１８を含む。これらのコンポーネントは、メモリバス３０６、Ｉ／Ｏバス３０８、バスインターフェースユニット３０９、及びＩ／Ｏバスインターフェースユニット３１０を介して、相互的に接続されてもよい。 (Hardware configuration)
Next, with reference to FIG. 1, a computer system 300 for implementing an embodiment of the present disclosure will be described. The mechanisms and apparatus of the various embodiments disclosed herein may be applied to any suitable computing system. The main components of computer system 300 include one or more processors 302 , memory 304 , terminal interface 312 , storage interface 314 , I/O (input/output) device interface 316 , and network interface 318 . These components may be interconnected via memory bus 306, I/O bus 308, bus interface unit 309, and I/O bus interface unit 310.

コンピュータシステム３００は、プロセッサ３０２と総称される１つ又は複数の汎用プログラマブル中央処理装置（ＣＰＵ）３０２Ａ及び３０２Ｂを含んでもよい。ある実施形態では、コンピュータシステム３００は複数のプロセッサを備えてもよく、また別の実施形態では、コンピュータシステム３００は単一のＣＰＵシステムであってもよい。各プロセッサ３０２は、メモリ３０４に格納された命令を実行し、オンボードキャッシュを含んでもよい。 Computer system 300 may include one or more general purpose programmable central processing units (CPUs) 302A and 302B, collectively referred to as processors 302. In some embodiments, computer system 300 may include multiple processors, and in other embodiments, computer system 300 may be a single CPU system. Each processor 302 executes instructions stored in memory 304 and may include onboard cache.

ある実施形態では、メモリ３０４は、データ及びプログラムを記憶するためのランダムアクセス半導体メモリ、記憶装置、又は記憶媒体（揮発性又は不揮発性のいずれか）を含んでもよい。メモリ３０４は、本明細書で説明する機能を実施するプログラム、モジュール、及びデータ構造のすべて又は一部を格納してもよい。例えば、メモリ３０４は、話者逐次区別アプリケーション３５０を格納していてもよい。ある実施形態では、話者逐次区別アプリケーション３５０は、後述する機能をプロセッサ３０２上で実行する命令又は記述を含んでもよい。 In some embodiments, memory 304 may include random access semiconductor memory, storage devices, or storage media (either volatile or nonvolatile) for storing data and programs. Memory 304 may store all or a portion of the programs, modules, and data structures that perform the functions described herein. For example, memory 304 may store a sequential speaker differentiation application 350. In some embodiments, speaker sequential differentiation application 350 may include instructions or writing that performs the functions described below on processor 302.

ある実施形態では、話者逐次区別アプリケーション３５０は、プロセッサベースのシステムの代わりに、またはプロセッサベースのシステムに加えて、半導体デバイス、チップ、論理ゲート、回路、回路カード、および/または他の物理ハードウェアデバイスを介してハードウェアで実施されてもよい。ある実施形態では、話者逐次区別アプリケーション３５０は、命令又は記述以外のデータを含んでもよい。ある実施形態では、カメラ、センサ、または他のデータ入力デバイス（図示せず）が、バスインターフェースユニット３０９、プロセッサ３０２、またはコンピュータシステム３００の他のハードウェアと直接通信するように提供されてもよい。 In some embodiments, speaker sequential discrimination application 350 may be implemented on semiconductor devices, chips, logic gates, circuits, circuit cards, and/or other physical hardware instead of or in addition to processor-based systems. It may also be implemented in hardware via a hardware device. In some embodiments, speaker sequential differentiation application 350 may include data other than instructions or descriptions. In some embodiments, cameras, sensors, or other data input devices (not shown) may be provided to communicate directly with bus interface unit 309, processor 302, or other hardware of computer system 300. .

コンピュータシステム３００は、プロセッサ３０２、メモリ３０４、表示システム３２４、及びＩ／Ｏバスインターフェースユニット３１０間の通信を行うバスインターフェースユニット３０９を含んでもよい。Ｉ／Ｏバスインターフェースユニット３１０は、様々なＩ／Ｏユニットとの間でデータを転送するためのＩ／Ｏバス３０８と連結していてもよい。Ｉ／Ｏバスインターフェースユニット３１０は、Ｉ／Ｏバス３０８を介して、Ｉ／Ｏプロセッサ（ＩＯＰ）又はＩ／Ｏアダプタ（ＩＯＡ）としても知られる複数のＩ／Ｏインタフェースユニット３１２，３１４，３１６、及び３１８と通信してもよい。 Computer system 300 may include a bus interface unit 309 that provides communication between processor 302 , memory 304 , display system 324 , and I/O bus interface unit 310 . I/O bus interface unit 310 may be coupled to I/O bus 308 for transferring data to and from various I/O units. The I/O bus interface unit 310 connects a plurality of I/O interface units 312, 314, 316, also known as I/O processors (IOPs) or I/O adapters (IOAs), via the I/O bus 308. and 318.

表示システム３２４は、表示コントローラ、表示メモリ、又はその両方を含んでもよい。表示コントローラは、ビデオ、オーディオ、又はその両方のデータを表示装置３２６に提供することができる。また、コンピュータシステム３００は、データを収集し、プロセッサ３０２に当該データを提供するように構成された1つまたは複数のセンサ等のデバイスを含んでもよい。 Display system 324 may include a display controller, display memory, or both. A display controller may provide video, audio, or both data to display device 326. Computer system 300 may also include devices, such as one or more sensors, configured to collect data and provide the data to processor 302.

例えば、コンピュータシステム３００は、心拍数データやストレスレベルデータ等を収集するバイオメトリックセンサ、湿度データ、温度データ、圧力データ等を収集する環境センサ、及び加速度データ、運動データ等を収集するモーションセンサ等を含んでもよい。これ以外のタイプのセンサも使用可能である。表示システム３２４は、単独のディスプレイ画面、テレビ、タブレット、又は携帯型デバイスなどの表示装置３２６に接続されてもよい。 For example, the computer system 300 may include a biometric sensor that collects heart rate data, stress level data, etc., an environmental sensor that collects humidity data, temperature data, pressure data, etc., and a motion sensor that collects acceleration data, exercise data, etc. May include. Other types of sensors can also be used. Display system 324 may be connected to a display device 326, such as a standalone display screen, a television, a tablet, or a handheld device.

Ｉ／Ｏインタフェースユニットは、様々なストレージ又はＩ／Ｏデバイスと通信する機能を備える。例えば、端末インタフェースユニット３１２は、ビデオ表示装置、スピーカテレビ等のユーザ出力デバイスや、キーボード、マウス、キーパッド、タッチパッド、トラックボール、ボタン、ライトペン、又は他のポインティングデバイス等のユーザ入力デバイスのようなユーザＩ／Ｏデバイス３２０の取り付けが可能である。ユーザは、ユーザインターフェースを使用して、ユーザ入力デバイスを操作することで、ユーザＩ／Ｏデバイス３２０及びコンピュータシステム３００に対して入力データや指示を入力し、コンピュータシステム３００からの出力データを受け取ってもよい。ユーザインターフェースは例えば、ユーザＩ／Ｏデバイス３２０を介して、表示装置に表示されたり、スピーカによって再生されたり、プリンタを介して印刷されたりしてもよい。 The I/O interface unit has the ability to communicate with various storage or I/O devices. For example, the terminal interface unit 312 may include a user output device such as a video display device, a speaker television, or a user input device such as a keyboard, mouse, keypad, touchpad, trackball, buttons, light pen, or other pointing device. It is possible to attach user I/O devices 320 such as: Using the user interface, a user operates a user input device to input input data and instructions to user I/O device 320 and computer system 300, and to receive output data from computer system 300. Good too. The user interface may be displayed on a display device, played through a speaker, or printed through a printer, for example, via the user I/O device 320.

ストレージインタフェース３１４は、１つ又は複数のディスクドライブや直接アクセスストレージ装置３２２（通常は磁気ディスクドライブストレージ装置であるが、単一のディスクドライブとして見えるように構成されたディスクドライブのアレイ又は他のストレージ装置であってもよい）の取り付けが可能である。ある実施形態では、ストレージ装置３２２は、任意の二次記憶装置として実装されてもよい。メモリ３０４の内容は、ストレージ装置３２２に記憶され、必要に応じてストレージ装置３２２から読み出されてもよい。Ｉ／Ｏデバイスインタフェース３１６は、プリンタ、ファックスマシン等の他のＩ／Ｏデバイスに対するインターフェースを提供してもよい。ネットワークインターフェース３１８は、コンピュータシステム３００と他のデバイスが相互的に通信できるように、通信経路を提供してもよい。この通信経路は、例えば、ネットワーク３３０であってもよい。 Storage interface 314 may include one or more disk drives or direct access storage devices 322 (typically magnetic disk drive storage devices, but also an array of disk drives or other storage devices configured to appear as a single disk drive). ) can be installed. In some embodiments, storage device 322 may be implemented as any secondary storage device. The contents of memory 304 may be stored in storage device 322 and read from storage device 322 as needed. I/O device interface 316 may provide an interface to other I/O devices such as printers, fax machines, etc. Network interface 318 may provide a communication pathway so that computer system 300 and other devices can communicate with each other. This communication path may be, for example, network 330.

ある実施形態では、コンピュータシステム３００は、マルチユーザメインフレームコンピュータシステム、シングルユーザシステム、又はサーバコンピュータ等の、直接的ユーザインターフェースを有しない、他のコンピュータシステム（クライアント）からの要求を受信するデバイスであってもよい。他の実施形態では、コンピュータシステム３００は、デスクトップコンピュータ、携帯型コンピュータ、ノートパソコン、タブレットコンピュータ、ポケットコンピュータ、電話、スマートフォン、又は任意の他の適切な電子機器であってもよい。 In some embodiments, computer system 300 is a device that receives requests from other computer systems (clients) that do not have a direct user interface, such as a multi-user mainframe computer system, a single-user system, or a server computer. There may be. In other embodiments, computer system 300 may be a desktop computer, portable computer, laptop, tablet computer, pocket computer, telephone, smart phone, or any other suitable electronic device.

次に、図２を参照して、本発明の実施形態に係るオンライン話者逐次区別システムの構成の一例について説明する Next, an example of the configuration of the online speaker sequential discrimination system according to the embodiment of the present invention will be described with reference to FIG.

図２は、本発明の実施形態に係るオンライン話者逐次区別システム３６０の構成の一例を示す図である。図２に示すように、オンライン話者逐次区別システム３６０は、主にクライアント端末３７５、通信ネットワーク３７０、音声データ取得装置３６５、及び話者逐次区別装置３８０からなる。クライアント端末３７５、音声データ取得装置３６５、及び話者逐次区別装置３８０は、通信ネットワーク３７０を介して互いに接続されている。
通信ネットワーク３７０は、例えばローカルエリアネットワーク（ＬＡＮ）、ワイドエリアネットワーク（ＷＡＮ）、衛星ネットワーク、ケーブルネットワーク、ＷｉＦｉネットワーク、またはそれらの任意の組み合わせを含むものであってもよい。また、クライアント端末３７５、音声データ取得装置３６５、及び話者逐次区別装置の接続は、有線であってもよく、無線であってもよい。 FIG. 2 is a diagram illustrating an example of the configuration of an online speaker sequential discrimination system 360 according to an embodiment of the present invention. As shown in FIG. 2, the online speaker sequential discrimination system 360 mainly includes a client terminal 375, a communication network 370, an audio data acquisition device 365, and a speaker sequential discrimination device 380. The client terminal 375, the audio data acquisition device 365, and the speaker sequential discrimination device 380 are connected to each other via the communication network 370.
Communication network 370 may include, for example, a local area network (LAN), wide area network (WAN), satellite network, cable network, WiFi network, or any combination thereof. Further, the connection between the client terminal 375, the audio data acquisition device 365, and the speaker sequential discrimination device may be wired or wireless.

音声データ取得装置３６５は、後述する話者逐次区別方法の処理の対象となる音声データセグメントを取得するための装置である。この音声データ取得装置３６５は、例えば、スマートフォンやパソコン等のマイクロフォンを備えたコンピューティングデバイスや録音機等であってもよい。 The audio data acquisition device 365 is a device for acquiring audio data segments to be processed by the speaker sequential discrimination method described later. This audio data acquisition device 365 may be, for example, a computing device equipped with a microphone, such as a smartphone or a personal computer, or a recorder.

クライアント端末３７５は、上述した音声データ取得装置３６５によって取得された音声データセグメントや、ユーザ対話の話者数を示す話者数データを、通信ネットワーク３７０を介して話者逐次区別装置に送信する端末である。このクライアント端末３７５は、個人に利用される端末であってもよく、民間企業等の組織における共有の端末であってもよい。また、このクライアント端末３７５は、例えば、デスクトップパソコン、ノートパソコン、タブレット、スマートフォン等、任意のデバイスであってもよい。 The client terminal 375 is a terminal that transmits the voice data segments acquired by the voice data acquisition device 365 described above and the number of speakers data indicating the number of speakers in the user dialogue to the speaker sequential discrimination device via the communication network 370. It is. This client terminal 375 may be a terminal used by an individual, or may be a terminal shared by an organization such as a private company. Further, this client terminal 375 may be any device such as a desktop computer, a notebook computer, a tablet, a smartphone, or the like.

データ記憶部３７２は、通信ネットワーク３７０を介してクライアント端末３７５から送信された、話者逐次区別方法の処理の対象となる音声データセグメント及び話者数データを記憶するためのストレージ部である。このデータ記憶部は、例えば、ＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）、ＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）等のローカルストレージであってもよく、話者逐次区別装置３８０にアクセス可能なクラウド型ストレージ領域であってもよい。 The data storage unit 372 is a storage unit for storing voice data segments and number of speakers data that are transmitted from the client terminal 375 via the communication network 370 and are subject to processing in the sequential speaker discrimination method. This data storage unit may be, for example, a local storage such as an HDD (Hard Disk Drive) or an SSD (Solid State Drive), or may be a cloud-type storage area that is accessible to the speaker sequential discrimination device 380. .

話者逐次区別装置３８０は、本発明に実施形態に係る、話者逐次区別方法における処理を実施するための装置である。話者逐次区別装置３８０は、データ記憶部３７２に記憶されている音声データセグメント及び話者数データを処理することにより、ユーザ対話の話者のそれぞれを識別する話者逐次区別結果を生成するための装置である。 The speaker sequential discrimination device 380 is a device for implementing processing in the speaker sequential discrimination method according to the embodiment of the present invention. The sequential speaker discrimination device 380 processes the audio data segments and the number of speakers data stored in the data storage unit 372 to generate a sequential speaker discrimination result that identifies each of the speakers of the user dialogue. This is the device.

また、図２に示すように、話者逐次区別装置３８０は、本発明の実施形態に係る話者逐次区別方法を実施するために、データ入力部３８２、話者確率判定部３８４、発話情報バッファー管理部３８６、音声データ結合部３８８、及び話者逐次区別部３９０とを含む。 Further, as shown in FIG. 2, the speaker sequential discrimination device 380 includes a data input section 382, a speaker probability determination section 384, an utterance information buffer, and a data input section 382, a speaker probability determination section 384, an utterance information buffer, etc., in order to implement the speaker sequential discrimination method according to the embodiment of the present invention. It includes a management section 386, an audio data combination section 388, and a speaker sequential discrimination section 390.

データ入力部３８２は、クライアント端末３７５（又は音声データ取得装置３６５）から入力された音声データセグメント（例えば、第１の音声データセグメント、第２の音声データセグメント等）及びユーザ対話の話者数を示す話者数データを受け付ける機能部である。 The data input unit 382 inputs the audio data segments (for example, the first audio data segment, the second audio data segment, etc.) input from the client terminal 375 (or the audio data acquisition device 365) and the number of speakers in the user dialogue. This is a functional unit that receives data on the number of speakers shown.

話者確率判定部３８４は、入力された音声データセグメント及び話者数データに基づいて、音声データセグメントにおける時間位置（第１の時間位置のセット）について、特定の話者が特定の時間位置において発話中である確率（例えば、第１の確率値のセット、第２の確率値のセット）を計算する機能部である。この話者確率判定部は、例えば、ＳｅｌｆＡｔｔｅｎｔｉｏｎ法を用いた、エンドツーエンド話者逐次区別ネットワーク（ＥｎｄｔｏＥｎｄＮｅｕｒａｌＤｉａｒｉｚａｔｉｏｎＮｅｔｗｏｒｋ；ＥＥＮＤ）であってもよい。 The speaker probability determination unit 384 determines whether a specific speaker is at a specific time position with respect to the time position (first set of time positions) in the audio data segment based on the input audio data segment and the number of speakers data. This is a functional unit that calculates the probability of being uttered (for example, a first set of probability values, a second set of probability values). This speaker probability determination unit may be, for example, an end-to-end neural diarization network (EEND) using the Self Attention method.

発話情報バッファー管理部３８６は、本発明の実施形態に係る発話情報バッファーに格納する情報（第１の時間位置のサブセット及び第１の時間位置のサブセットに対応する第１の確率値のサブセット）を決定し、発話情報バッファーに格納するための機能部である。ここで、発話情報バッファーに格納する情報は所定の選択手法に基づいて決定されてもよい。
なお、これらの選択手法の詳細については後述する。 The utterance information buffer management unit 386 stores information (a subset of first time positions and a subset of first probability values corresponding to the subset of first time positions) to be stored in the utterance information buffer according to the embodiment of the present invention. This is a functional unit that determines and stores it in the speech information buffer. Here, the information to be stored in the speech information buffer may be determined based on a predetermined selection method.
Note that details of these selection methods will be described later.

音声データ結合部３８８は、第１の音声データセグメントから抽出した音声特徴と第２の音声データセグメントとを結合した結合音声データセグメントを生成する機能部である。ここでは、第１の音声データセグメントから抽出した音声特徴は、発話情報バッファーに格納されている時間位置に基づいて抽出されてもよい。 The audio data combining unit 388 is a functional unit that generates a combined audio data segment by combining the audio features extracted from the first audio data segment and the second audio data segment. Here, the audio features extracted from the first audio data segment may be extracted based on the temporal position stored in the speech information buffer.

話者逐次区別部は、話者確率判定部３８４によって計算された確率値（例えば、第１の確率値のサブセットと第２の確率値のセット）に基づいて、ユーザ対話の話者のそれぞれを識別する話者逐次区別結果を生成する機能部である。 The speaker sequential discriminator distinguishes each of the speakers of the user interaction based on the probability values calculated by the speaker probability determiner 384 (e.g., a subset of the first probability values and a set of second probability values). This is a functional unit that generates successive speaker discrimination results.

なお、オンライン話者逐次区別システム３６０に含まれるそれぞれの機能部は、図１に示す話者逐次区別アプリケーション３５０を構成するソフトウエアモジュールであってもよく、独立した専用ハードウェアデバイスであってもよい。また、上記の機能部は、同一のコンピューティング環境に実施されてもよく、分散されたコンピューティング環境に実施されてもよい。例えば、話者確率判定部３８４を遠隔のサーバやクライアント端末３７５に実装し、それ以外の機能部を話者逐次区別装置３８０に実装する構成であってもよい。 Note that each functional unit included in the online speaker sequential discrimination system 360 may be a software module that constitutes the speaker sequential discrimination application 350 shown in FIG. 1, or may be an independent dedicated hardware device. good. Additionally, the functional units described above may be implemented in the same computing environment or in distributed computing environments. For example, the speaker probability determination unit 384 may be installed in a remote server or client terminal 375, and the other functional units may be installed in the speaker sequential discrimination device 380.

また、本発明の実施形態は、図２を参照して説明したオンライン話者逐次区別システム３６０の構成に限定されない。例えば、音声データ取得装置３６５がクライアント端末３７５を介さずに音声データセグメントを話者逐次区別装置３８０に直接に送信する構成や、音声データ取得装置３６５及びクライアント端末３７５が一体型となる構成等、適宜に変更された構成も可能である。 Furthermore, embodiments of the present invention are not limited to the configuration of the online speaker sequential discrimination system 360 described with reference to FIG. For example, a configuration in which the audio data acquisition device 365 directly transmits audio data segments to the speaker sequential discrimination device 380 without going through the client terminal 375, a configuration in which the audio data acquisition device 365 and the client terminal 375 are integrated, etc. Suitably modified configurations are also possible.

次に、図３を参照して、本発明の実施形態に係るオンライン話者逐次区別方法の概要について説明する。 Next, an overview of the online speaker sequential discrimination method according to the embodiment of the present invention will be described with reference to FIG.

図３は、本発明の実施形態に係るオンライン話者逐次区別方法の概要を示す図である。図３に示すように、複数の話者３９１、３９２が対話しており、この対話は、それぞれの話者３９１、３９２の発話３９１a、３９２bを含む。一般的には、本発明の実施形態に係る対話とは、一人又は複数の話者から発される少なくとも１つの発話を含むものであり、例えば、電話での通話、会議室での話し合い、放送されている音声等を含んでもよいが、図３では、二人の話者３９１、３９２を含む対話を一例として説明する。
上述したように、本発明の実施形態に係るオンライン話者逐次区別方法は、重なった音声データセグメント（つまり、各話者３９１、３９２が同時に話し、声が重なる音声を含むデータ）についても良好な話者逐次区別を生成することができるため、この対話には、各話者３９１、３９２の発話３９１a、３９２bが重複することがあってもよい。 FIG. 3 is a diagram illustrating an overview of an online speaker sequential discrimination method according to an embodiment of the present invention. As shown in FIG. 3, a plurality of speakers 391 and 392 are having a dialogue, and this dialogue includes utterances 391a and 392b of the respective speakers 391 and 392. Generally, an interaction according to embodiments of the present invention includes at least one utterance uttered by one or more speakers, such as a telephone call, a conference room discussion, or a broadcast. However, in FIG. 3, a dialogue including two speakers 391 and 392 will be described as an example.
As mentioned above, the online speaker sequential discrimination method according to embodiments of the present invention performs well even for overlapping audio data segments (i.e., data containing audio from each speaker 391, 392 speaking at the same time and with overlapping voices). Since speaker distinctions can be generated sequentially, the utterances 391a and 392b of the speakers 391 and 392 may overlap in this dialogue.

対話を構成する話者の発話３９１a、３９２bが（例えば図２に示す音声データ取得装置３６５により）録音され、一定時間（１０ミリ秒、１秒等）の音声データセグメントのシリーズとして、リアルタイムで話者逐次区別装置３８０に送信される。この話者逐次区別装置３８０は、図２を参照して説明したデータ入力部３８２、話者確率判定部３８４、発話情報バッファー管理部３８６、音声データ結合部３８８、及び話者逐次区別部３９０によって入力される音声データセグメントを処理することで、低遅延（ＬｏｗＬａｔｅｎｃｙ）で対話の話者のそれぞれを識別する話者逐次区別結果を生成することができる。 The utterances 391a, 392b of the speakers making up the dialogue are recorded (for example, by the audio data acquisition device 365 shown in FIG. The information is transmitted to the serial discrimination device 380. This speaker sequential discrimination device 380 includes the data input section 382, speaker probability determination section 384, utterance information buffer management section 386, audio data combination section 388, and speaker sequential discrimination section 390 described with reference to FIG. By processing the input audio data segments, it is possible to generate sequential speaker discrimination results that identify each of the speakers of the dialogue with low latency.

次に、図４を参照して、本発明の実施形態に係るオンライン話者逐次区別方法の流れについて説明する。 Next, with reference to FIG. 4, the flow of the online speaker sequential discrimination method according to the embodiment of the present invention will be described.

図４は、本発明の実施形態に係るオンライン話者逐次区別方法４００の流れを示すフローチャートである。図４に示すオンライン話者逐次区別方法４００では、音声データセグメントかる抽出した重要な情報（誰がいつ話したかを判定するのに有用な情報）を格納する発話情報バッファーを用いることにより、複数の異なるユーザの発話が重なった音声データセグメントが対話において発生する場合であっても、正確な音声逐次区別結果をリアルタイムで生成することができる方法であり、例えば図２を参照して説明した話者逐次区別装置３８０によって実行される方法であってもよい。 FIG. 4 is a flowchart illustrating a method 400 for sequentially distinguishing online speakers according to an embodiment of the present invention. The online speaker sequential discrimination method 400 shown in FIG. This is a method that can generate accurate speech sequential discrimination results in real time even when voice data segments in which user utterances overlap occur in a dialogue. The method may be performed by the discrimination device 380.

まず、ステップＳ４１０では、ユーザ対話に対応する第１の音声データセグメントと、当該ユーザ対話の話者数を示す話者数データが受け付けられる。ここでの音声データセグメントとは、所定の時間で記録（録音）されているユーザ対話の一部を意味する。例えば、この第１の音声データセグメントは、リアルタイムで行われている対話を一秒（又は１０ミリ秒、２秒、１０秒、１分等）録音した音声データであってもよい。また、ここでの話者数データとは、当該ユーザ対話に参加している話者の人数（以下、「話者数」）を示す情報である。 First, in step S410, a first audio data segment corresponding to a user interaction and speaker count data indicating the number of speakers in the user interaction are received. An audio data segment here refers to a portion of a user's interaction that is recorded over a predetermined period of time. For example, this first audio data segment may be a 1 second (or 10 milliseconds, 2 seconds, 10 seconds, 1 minute, etc.) recording of an interaction occurring in real time. Moreover, the number of speakers data here is information indicating the number of speakers participating in the user dialogue (hereinafter referred to as "number of speakers").

第１の音声データセグメントは、例えば、図２を参照して説明した音声データ取得装置３６５によって録音され、直接又はクライアント端末３７５を介して話者逐次区別装置３８０に送信され、データ入力部３８２に入力されてもよい。また、話者数データは、例えば第１の音声データセグメントが録音された際に、第１の音声データセグメントにメタデータとして付されてもよく、クライアント端末３７５によって行われる前処理によって作成され、話者逐次区別装置３８０に送信されてもよい。 The first audio data segment is recorded, for example, by the audio data acquisition device 365 described with reference to FIG. May be entered. Further, the number of speakers data may be attached as metadata to the first audio data segment when the first audio data segment is recorded, for example, and may be created by preprocessing performed by the client terminal 375. It may also be transmitted to the speaker sequential discrimination device 380.

次に、ステップＳ４２０では、ステップＳ４１０で入力された第１の音声データセグメントと話者数データとに基づいて、当該第１の音声データセグメントにおける第１の時間位置のセットについて、特定の話者が特定の時間位置において発話中である確率を示す第１の確率値のセットが計算される。
ここでの時間位置のセットとは、音声データセグメントを任意の時間単位（１ミリ秒、１０ミリ秒１秒等）のサブセグメントに分割した場合、特定の時間（時刻）のサブセグメントを指定するラベルである。例えば、１秒の音声データセグメントを１００ミリ秒のサブセグメントに分割した場合、１０個の１００ミリ秒間のサブセグメントが得られ、それぞれのサブセグメントは異なる時間位置（１、２、３…１０）に対応する。 Next, in step S420, based on the first audio data segment and the number of speakers input in step S410, a specific speaker is identified for the first set of time positions in the first audio data segment. A first set of probability values is calculated indicating the probability that the utterance is in progress at a particular time location.
The set of time positions here refers to the subsegment at a specific time (time) when an audio data segment is divided into subsegments of arbitrary time units (1 millisecond, 10 milliseconds, 1 second, etc.) It's a label. For example, if you divide a 1 second audio data segment into 100 ms subsegments, you will get 10 100 ms subsegments, each at a different time position (1, 2, 3...10). corresponds to

次に、第１の音声データセグメントを構成するそれぞれの時間位置に対応するサブセグメントについて、特定の話者が発話中（すなわち、話しているか否か）である確率を示す第１の確率値のセットが計算される。この確率の計算は、例えば、入力された第１の音声データセグメントと話者数データとに基づいて、上述したＥＥＮＤネットワークによって行われてもよい。ここで用いられるＥＥＮＤネットワークは、訓練済みであってもよい。
第１の確率値のセットは、第１の音声データセグメントを構成するそれぞれの時間位置に対応するサブセグメント毎に、上述した話者数データに指定されている話者数の数だけの任意の話者について、当該話者が喋っている確率を表す。この確率は、例えば０～１の値で表されてもよい。
図５は、本発明の実施形態に係る第１の音声データセグメント５００を示す図である。図５に示されるように、第１の音声データセグメント５００は、１０個のサブセグメントに分割され、それぞれのサブセグメントは、第１の時間位置のセット５１０を構成する異なる時間位置に指定されている。第１の時間位置のセット５１０のそれぞれの時間位置毎に、二人の話者（ｃ１、ｃ２）５２０が話した確率を示す第１の確率値のセット５３０が計算されている。また、図５に示すように、第１の時間位置のセット５１０に含まれる１０個の時間位置のそれぞれについて、各話者（ｃ１、ｃ２）５２０が話した確率が０～１の値で記録されている。 Next, for each sub-segment corresponding to a time position making up the first audio data segment, a first probability value indicating the probability that a particular speaker is speaking (i.e., speaking or not) is determined. The set is calculated. The calculation of this probability may be performed, for example, by the EEND network described above based on the input first audio data segment and the number of speakers data. The EEND network used here may be trained.
The first set of probability values is an arbitrary number as many as the number of speakers specified in the number of speakers data mentioned above for each sub-segment corresponding to each time position constituting the first audio data segment. For a speaker, it represents the probability that the speaker is speaking. This probability may be expressed as a value between 0 and 1, for example.
FIG. 5 is a diagram illustrating a first audio data segment 500 according to an embodiment of the invention. As shown in FIG. 5, the first audio data segment 500 is divided into ten sub-segments, each sub-segment being designated to a different time position constituting a first set of time positions 510. There is. For each time position of the first set of time positions 510, a first set of probability values 530 has been calculated indicating the probability that the two speakers (c1, c2) 520 have spoken. In addition, as shown in FIG. 5, for each of the 10 time positions included in the first set of time positions 510, the probability that each speaker (c1, c2) 520 spoke is recorded as a value between 0 and 1. has been done.

次に、ステップＳ４３０では、第１の時間位置のセットの少なくとも一部である第１の時間位置のサブセットと、第１の確率値のセットの少なくとも一部であり、第１の時間位置のサブセットに対応する第１の確率値のサブセットとが所定の選択手法に基づいて選択され、発話情報バッファーに格納される。ここでの発話情報バッファーとは、以前（過去）の音声データセグメントに含まれる、話者逐次区別処理に有用な情報を一時的に記憶する記憶領域である。より具体的には、発話情報バッファーに格納されている情報は、音声データセグメント（例えば、第１の音声データセグメント）に関する時間位置の情報と、当該時間位置について計算した確率の情報である。
ただし、発話情報バッファーのサイズが限られているため、当該バッファーに記憶する情報量を抑えるためには、音声データセグメントに関する全ての時間位置及び確率の情報ではなく、所定の選択手法によって選択された一部の情報（すなわち、第１の時間位置のセットの少なくとも一部である第１の時間位置のサブセットと、第１の確率値のセットの少なくとも一部であり、第１の時間位置のサブセットに対応する第１の確率値のサブセット）のみを格納することが望ましい。これにより、発話情報バッファーのサイズを抑えつつ、良好な話者逐次区別結果を生成することができる。 Next, in step S430, a first subset of time positions is at least part of the first set of time positions, and a first subset of time positions is at least part of the first set of probability values. The first subset of probability values corresponding to is selected based on a predetermined selection method and stored in the utterance information buffer. The utterance information buffer here is a storage area that temporarily stores information included in previous (past) audio data segments and useful for sequential speaker discrimination processing. More specifically, the information stored in the utterance information buffer is information about the time position regarding the audio data segment (eg, the first audio data segment) and information about the probability calculated for the time position.
However, since the size of the speech information buffer is limited, in order to reduce the amount of information to be stored in the buffer, rather than storing all the time position and probability information regarding the audio data segment, it is necessary to some information (i.e., a first subset of time locations that is at least part of a first set of time locations; and a subset of first time locations that is at least part of a first set of probability values; It is desirable to store only a subset of the first probability values corresponding to . Thereby, it is possible to generate good successive speaker discrimination results while suppressing the size of the utterance information buffer.

ここでの所定の選択手法とは、音声データセグメントについての時間位置及び確率の情報の中から、発話情報バッファーに格納する情報の一部（例えば、上述したＥＥＮＤネットワークによる話者逐次区別処理に有用な情報）を選択するための手法である。ここでの選択手法は特に限定されないが、以下では、選択手法の例のいくつかについて説明する。 The predetermined selection method here refers to a part of the information to be stored in the utterance information buffer (for example, useful for the sequential speaker discrimination process by the EEND network described above) from among the time position and probability information about the audio data segment. This is a method for selecting specific information. Although the selection method here is not particularly limited, some examples of the selection method will be described below.

発話情報バッファーに格納する情報を選択するための選択手法の１つでは、第１の確率値のセットの中から、所定の確率基準を満たす確率値（第１の確率値のサブセット）と、当該所定の確率基準を満たす確率値に対応する時間位置（第１の確率値のサブセット）とが発話情報バッファーに格納される情報として選択される。この所定の基準は、例えば「０．８以上の確率値」や「最も高い３個の確率値」等であってもよい。
また、一つの実施形態では、第１の時間位置のセットのそれぞれについて、当該時間位置における話者間の確率の差の絶対値を計算し、絶対値が最も高い確率値を有するｎ個の時間位置と当該時間位置に対応する確率値を発話情報バッファーに格納する構成も可能である。
なお、この話者間の確率の差の絶対値aは、以下の数式１から求められる。

ここでは、ｙ_｛ｔ、ｃ｝は、任意の話者ｃが特定の時間位置ｔに話した確率を表し、ｙ_｛ｔ、ｃ１｝は、異なる話者ｃ２が同じ時間位置ｔに話した確率を表す。 One selection method for selecting information to be stored in the utterance information buffer includes selecting from a first set of probability values probability values (a subset of the first probability values) that satisfy a predetermined probability criterion; Time positions (a subset of the first probability values) corresponding to probability values that satisfy a predetermined probability criterion are selected as information to be stored in the utterance information buffer. This predetermined criterion may be, for example, "a probability value of 0.8 or more" or "the highest three probability values".
In one embodiment, for each of the first set of time positions, the absolute value of the probability difference between the speakers at that time position is calculated, and the n times having the probability value with the highest absolute value are A configuration is also possible in which the position and the probability value corresponding to the time position are stored in the speech information buffer.
Note that the absolute value a of the difference in probability between speakers can be obtained from Equation 1 below.

Here, y_{t, c} represents the probability that any speaker c spoke at a particular time position t, and y_{t, c1} represents the probability that a different speaker c2 spoke at the same time position t. represents.

発話情報バッファーに格納する情報を選択するもう一つの選択手法では、第１の確率値のセットの中から、所定のＡｔｔｅｎｔｉｏｎＷｅｉｇｈｔ基準を満たす確率値（第１の確率値のサブセット）と、当該ＡｔｔｅｎｔｉｏｎＷｅｉｇｈｔ基準を満たす確率値に対応する時間位置（第１の確率値のサブセット）とが発話情報バッファーに格納される情報として選択される。ここでのＡｔｔｅｎｔｉｏｎＷｅｉｇｈｔとは、ニューラルネットワークにおいて、特定の入力が最終的な出力に与える影響力を定量的に表すものである。
第１の確率値のセットのそれぞれについて、ニューラルネットワークにおいて与えらえるＡｔｔｅｎｔｉｏｎＷｅｉｇｈｔを計算し、所定のＡｔｔｅｎｔｉｏｎＷｅｉｇｈｔ基準（上の１０％、上の２０％等）を満たす確率値及び当該確率値に対応する時間位置を発話情報バッファーに格納される情報として選択し、発話情報バッファーに格納することで、良好な話者逐次区別結果をより効率的に生成することができる。
なお、ここでのＡｔｔｅｎｔｉｏｎＷｅｉｇｈｔの計算方法は特に限定されず、ニューラルネットワークの構造に応じて適切な計算法は用いられてもよい。 Another selection method for selecting information to be stored in the utterance information buffer is to select probability values that satisfy a predetermined Attention Weight criterion (a subset of the first probability values) from a first set of probability values, and Time positions (subset of first probability values) corresponding to probability values that satisfy the Weight criterion are selected as information to be stored in the utterance information buffer. Attention Weight here quantitatively represents the influence that a specific input has on the final output in a neural network.
For each of the first set of probability values, the Attention Weight given in the neural network is calculated, and a probability value that satisfies a predetermined Attention Weight criterion (upper 10%, upper 20%, etc.) and corresponds to the corresponding probability value are calculated. By selecting the time position to be stored as information to be stored in the utterance information buffer and storing it in the utterance information buffer, it is possible to more efficiently generate good successive speaker discrimination results.
Note that the method for calculating Attention Weight here is not particularly limited, and any appropriate calculation method may be used depending on the structure of the neural network.

発話情報バッファーに格納する情報を選択するもう一つの選択手法では、第１の時間位置のセットの中から、ランダムな時間位置及び当該時間位置に対応する確率値が選択され、発話情報バッファーに格納されてもよい。例えば、第１の時間位置のセットの中から、どの時間位置（及び当該時間位置に対応する確率値）を発話情報バッファーに入れるかを任意の手法で生成した疑似乱数に基づいて決定してもよい。 In another selection method for selecting information to be stored in the utterance information buffer, a random time position and a probability value corresponding to the time position are selected from the first set of time positions and stored in the utterance information buffer. may be done. For example, it is possible to determine which time position (and the probability value corresponding to the time position) to enter into the speech information buffer from among the first set of time positions based on pseudo-random numbers generated using an arbitrary method. good.

なお、本発明の実施形態に係る示す発話情報バッファーは、複数のバッファーに分割されてもよく、単一のバッファーであってもよい。例えば、本発明の実施形態の１つでは、発話情報バッファーは、話者の数ｎによって複数のバッファーに分割されてもよい。この場合には、バッファーの数は以下の数式２から求められる。

また、この場合、発話情報バッファーは、格納される情報に応じて、異なる種類に分けられてもよい。例えば、発話情報バッファーは、それぞれの話者の情報を格納する話者バッファー、重複している音声データセグメントに関する情報を格納するための重複バッファー、又は話者が話していない時間位置に該当する沈黙バッファー等を含んでもよい。 Note that the illustrated speech information buffer according to the embodiment of the present invention may be divided into a plurality of buffers or may be a single buffer. For example, in one embodiment of the invention, the speech information buffer may be divided into multiple buffers depending on the number of speakers n. In this case, the number of buffers is obtained from Equation 2 below.

Further, in this case, the speech information buffer may be divided into different types depending on the information stored. For example, the utterance information buffer may be a speaker buffer for storing information for each speaker, a redundant buffer for storing information about overlapping audio data segments, or a silence corresponding to a time position when a speaker is not speaking. It may also contain a buffer and the like.

次に、ステップＳ４４０では、ユーザ対話に対応する第２の音声データセグメントが受け付けられる。ここでの第２の音声データセグメントは、上述した第１の音声データセグメントと同様に、図２を参照して説明した音声データ取得装置３６５によって録音され、直接又はクライアント端末３７５を介して話者逐次区別装置３８０に送信され、データ入力部３８２に入力される、ユーザ対話における音声データである。また、第２の音声データセグメントは、例えば、ユーザ対話において、ステップＳ４１０で入力した第１の音声データセグメントに続く、後続のの音声データセグメントであってもよい。 Next, in step S440, a second audio data segment corresponding to user interaction is received. Like the first audio data segment described above, the second audio data segment here is recorded by the audio data acquisition device 365 described with reference to FIG. This is audio data in user dialogue that is sequentially transmitted to the discrimination device 380 and input to the data input section 382. The second audio data segment may also be a subsequent audio data segment following the first audio data segment input in step S410, for example, in a user interaction.

次に、ステップＳ４５０では、発話情報バッファーに格納されている第１の時間位置のサブセットに対応する音声特徴を第１の音声データセグメントから抽出し、第２の音声データセグメントに結合することで、結合音声データセグメントが生成される。この結合音声データセグメントは、発話情報バッファーに格納されている第１の時間位置のサブセット（つまり、話者逐次区別処理に有用な時間位置）に対応する音声情報と、ステップＳ４４０で入力された第２の音声データセグメントを１つの音声データファイルに組み合わせたものである。また、ここでの音声特徴は、例えば、メル周波数ケプストラム係数（Ｍｅｌ－ｆｒｅｑｕｅｎｃｙｃｅｐｓｔｒａｌｃｏｅｆｆｉｃｉｅｎｔ；ＭＦＣＣ）であってもよいが、特に限定されない。 Next, in step S450, audio features corresponding to the first subset of temporal positions stored in the speech information buffer are extracted from the first audio data segment and combined with the second audio data segment. A combined audio data segment is generated. This combined audio data segment includes the audio information corresponding to the first subset of time positions stored in the speech information buffer (i.e., the time positions useful for the sequential speaker differentiation process) and the audio information corresponding to the first subset of time positions stored in the utterance information buffer and the 2 audio data segments are combined into one audio data file. Further, the audio feature here may be, for example, a Mel-frequency cepstral coefficient (MFCC), but is not particularly limited.

メル周波数ケプストラム係数は、周波数の非線形メルスケールでの対数パワースペクトルの線形コサイン変換に基づいた、音のパワースペクトルの表現である。
なお、ここでの音声特徴の抽出と、当該音声特徴と第２の音声データセグメントの結合は、任意の既存の手段によって行われてもよい。 Mel frequency cepstral coefficients are a representation of the power spectrum of a sound based on a linear cosine transformation of the logarithmic power spectrum on a nonlinear mel scale of frequency.
Note that the extraction of the audio feature and the combination of the audio feature and the second audio data segment may be performed by any existing means.

次に、ステップＳ４６０では、ステップＳ４５０で生成した結合音声データセグメントと、ステップＳ４１０で入力された話者数データとに基づいて、結合音声データセグメントにおける第２の時間位置のセットについて、特定の話者が特定の時間位置において発話中である確率を示す第２の確率値のセットが計算される。この確率の計算は、上述した第１の確率値のセットと同様に、上述したＥＥＮＤネットワークによって行われてもよい。
図６は、本発明の実施形態に係る結合音声データセグメント６００を示す図である。図６に示されるように、結合音声データセグメント６００は、１５個のサブセグメントに分割され、それぞれのサブセグメントは、第２の時間位置のセット６１０を構成する異なる時間位置に指定されている。第２の時間位置のセット６１０のそれぞれの時間位置毎に、二人の話者（ｃ３、ｃ４）６２０が話した確率を示す第２の確率値のセット６３０が計算されている。また、図６に示すように、第２の時間位置のセット６１０に含まれる１５個の時間位置のそれぞれについて、各話者（ｃ３、ｃ４）６２０が話した確率が０～１の値で記録されている。 Next, in step S460, based on the combined audio data segment generated in step S450 and the number of speakers data input in step S410, a specific talk is determined for a second set of time positions in the combined audio data segment. A second set of probability values is calculated indicating the probability that the person is speaking at a particular time location. The calculation of this probability, similar to the first set of probability values described above, may be performed by the EEND network described above.
FIG. 6 is a diagram illustrating a combined audio data segment 600 according to an embodiment of the invention. As shown in FIG. 6, the combined audio data segment 600 is divided into 15 sub-segments, each sub-segment designated to a different time location forming a second set of time locations 610. For each time position in the second set of time positions 610, a second set of probability values 630 has been calculated indicating the probability that the two speakers (c3, c4) 620 have spoken. Also, as shown in FIG. 6, for each of the 15 time positions included in the second set of time positions 610, the probability that each speaker (c3, c4) 620 spoke is recorded as a value between 0 and 1. has been done.

ステップＳ４６０で生成される第２の確率値のセット６３０は、発話情報バッファーに格納されている第１の時間位置のサブセットに基づいて第１の音声データセグメントから抽出した音声特徴と第２の音声データセグメントとを結合した結合音声データセグメントに基づいて計算されるため、この第１の音声データセグメントの音声特徴に対応する確率値である第１の確率値のグループ６２２と、第２の音声データセグメントに対応する確率値である第２の確率値のグループ６２４とを含む。
なお、発話情報バッファーに格納されている第１の確率値のサブセットと、ステップＳ４６０で計算される第１の確率値のグループ６２２は両方とも、同一の音声に基づいて同一のＥＥＮＤネットワークによって計算されるため、実質的に同様の（類似性が高い）値となることに留意されたい。 The second set of probability values 630 generated in step S460 includes the audio features extracted from the first audio data segment and the second audio based on the subset of first temporal locations stored in the utterance information buffer. A first group of probability values 622, which are probability values corresponding to the audio features of this first audio data segment, and a second audio data segment. and a second group of probability values 624, which are probability values corresponding to the segments.
Note that both the subset of first probability values stored in the utterance information buffer and the group of first probability values 622 calculated in step S460 are calculated by the same EEND network based on the same speech. It should be noted that the values are substantially similar (high similarity) because of the

次に、ステップＳ４７０では、第１の確率値のサブセット（つまり、上述したステップＳ４３０で発話情報バッファーに格納した確率値）とステップＳ４６０で計算した第２の確率値のセットとに基づいて、ユーザ対話の話者のそれぞれを識別する話者逐次区別結果が生成される。この話者逐次区別結果は、結合音声データセグメントにおける第２の時間位置のセットのそれぞれについて、どの話者が話したかを示すものである。また、上述したＳ４５０～Ｓ４７０の内容を、新たな音声データセグメントが受け付けられるたびに繰り返すことにより、ユーザ対話を構成する全ての音声データセグメントについても話者逐次区別結果を生成することができる。
なお、この話者逐次区別結果を生成する処理の詳細については後述する。 Next, in step S470, the user selects a A sequential speaker discrimination result is generated that identifies each of the speakers of the dialogue. This sequential speaker differentiation result indicates which speaker spoke for each of the second set of time locations in the combined audio data segment. Furthermore, by repeating the contents of S450 to S470 described above each time a new voice data segment is received, it is possible to sequentially generate speaker discrimination results for all voice data segments that constitute a user dialogue.
Note that the details of the process for generating the successive speaker discrimination results will be described later.

以上説明した本発明の実施形態に係るオンライン話者逐次区別方法により、以前の音声データセグメントかる抽出した重要な情報（上述した、所定の選択手法によって選択された時間位置及び当該時間位置に対応する確率値）を格納する発話情報バッファーを用いることで、複数の異なるユーザの発話が重なった音声データセグメントが対話において発生する場合であっても、正確な音声逐次区別結果をリアルタイムで生成することができる。 By the online speaker sequential discrimination method according to the embodiment of the present invention described above, the important information extracted from the previous audio data segment (the time position selected by the above-mentioned predetermined selection method and the information corresponding to the time position) By using an utterance information buffer that stores utterances (probability values), it is possible to generate accurate speech sequential discrimination results in real time, even when speech data segments that overlap the utterances of multiple different users occur in a dialogue. can.

次に、図７～８を参照して、本発明の実施形態に係るオンライン話者逐次区別方法の流れの一例について説明する。 Next, an example of the flow of the online speaker sequential discrimination method according to the embodiment of the present invention will be described with reference to FIGS. 7 and 8.

図７は、本発明の実施形態に係るオンライン話者逐次区別方法の流れの一例を示す図である。
上述したように、第１の音声データセグメント５００が話者確率判定部に入力されると、当該第１の音声データセグメント５００と上述した話者数データとに基づいて、当該第１の音声データセグメント５００における第１の時間位置のセットについて、特定の話者５２０が特定の時間位置において発話中である確率を示す第１の確率値のセット５３０が計算される。 FIG. 7 is a diagram illustrating an example of the flow of an online speaker sequential discrimination method according to an embodiment of the present invention.
As described above, when the first voice data segment 500 is input to the speaker probability determination section, the first voice data is determined based on the first voice data segment 500 and the number of speakers data described above. For a first set of time positions in segment 500, a first set of probability values 530 is calculated that indicates the probability that a particular speaker 520 is speaking at the particular time position.

第１の確率値のセット５３０が計算された後、上述した選択手法（絶対値の基準、ＡｔｔｅｎｔｉｏｎＷｅｉｇｈｔの基準、ランダム選択等）により、第１の確率値のセット５３０及び第１の時間位置のセット５１０の中から、発話情報バッファー７５０に格納される第１の時間位置のサブセット７５２及び第１の確率値のサブセット７５４が選択され、発話情報バッファー７５０に格納される。 After the first set of probability values 530 is calculated, the first set of probability values 530 and the first time position are selected using the selection techniques described above (absolute value criterion, attention weight criterion, random selection, etc.). A first subset of time positions 752 and a first subset of probability values 754 are selected from the set 510 and stored in the utterance information buffer 750 .

次に、話者確率判定部によって計算された第１の確率値のセットに基づいて、第１の音声データセグメント５００における第１の時間位置のセットのそれぞれについて、どの話者が話したかを示す第１の話者逐次区別結果７７０を生成する。
なお、この第１の話者逐次区別結果７００は、第１の音声データセグメントに対応する第１の確率値のセットに基づいて計算されるため、ここでは発話情報バッファーに格納されている情報が用いられないが、発話情報バッファーに格納されている情報は、第１の音声データセグメントに続く第２の音声データセグメントに関する第２の話者逐次区別結果を生成する際に用いられる。 and then indicating which speaker spoke for each of the first set of time positions in the first audio data segment 500 based on the first set of probability values calculated by the speaker probability determiner. A first speaker sequential discrimination result 770 is generated.
Note that this first speaker sequential discrimination result 700 is calculated based on the first set of probability values corresponding to the first audio data segment, so here the information stored in the utterance information buffer is Although not used, the information stored in the utterance information buffer is used in generating a second sequential speaker differentiation result for a second audio data segment following the first audio data segment.

図８は、図７に続いて本発明の実施形態に係るオンライン話者逐次区別方法の流れの一例を示す図である。 FIG. 8 is a diagram illustrating an example of the flow of the online speaker sequential discrimination method according to the embodiment of the present invention, following FIG. 7 .

まず、上述したように、図７に示す発話情報バッファー７５０に格納されている第１の時間位置のサブセット７５２に基づいて第１の音声データセグメント５００から抽出した音声特徴と、第２の音声データセグメントとを結合した結合音声データセグメント６００が生成される。この結合音声データセグメント６００が話者確率判定部に入力されると、当該結合音声データセグメント６００と上述した話者数データとに基づいて、当該結合音声データセグメント６００における第２の時間位置のセット６１０について、特定の話者６２０が特定の時間位置において発話中である確率を示す第２の確率値のセット６３０が計算される。 First, as described above, the audio features extracted from the first audio data segment 500 based on the first time position subset 752 stored in the utterance information buffer 750 shown in FIG. A combined audio data segment 600 is generated by combining the segments. When this combined voice data segment 600 is input to the speaker probability determining section, a second time position in the combined voice data segment 600 is set based on the combined voice data segment 600 and the number of speakers data described above. For 610, a second set of probability values 630 is calculated that indicates the probability that a particular speaker 620 is speaking at a particular time location.

上述したように、ここで計算される第２の確率値のセット６３０は、発話情報バッファー７５０に格納されている第１の時間位置のサブセット７５０に基づいて第１の音声データセグメント５００から抽出した音声特徴と第２の音声データセグメントとを結合した結合音声データセグメント６００に基づいて計算されるため、この第１の音声データセグメント５００の音声特徴に対応する確率値である第１の確率値のグループ６２２と、第２の音声データセグメントに対応する確率値である第２の確率値のグループ６２４とを含む。
ただし、ここでの第２の確率値のセット６３０は、上述した第１の確率値のセット５３０と独立して生成されるため、図８に示す話者６２０と、図７を参照して説明した話者５２０との対応性が不明である。つまり、結合音声データセグメント６００の話者ｃ３が、第１の音声データセグメント５００のｃ１に該当するか、ｃ２に該当するかが不明である。同様に、結合音声データセグメント６００の話者ｃ４が、第１の音声データセグメント５００の話者ｃ１に該当するか、ｃ２に該当するかが不明である。
これは、ニューラルネットワークにおけるいわゆる「順列問題」（ｐｅｒｍｕｔａｔｉｏｎｐｒｏｂｌｅｍ）として知られている。 As discussed above, the second set of probability values 630 computed here are extracted from the first audio data segment 500 based on the first subset of time positions 750 stored in the utterance information buffer 750. Since the calculation is based on the combined audio data segment 600 that combines the audio feature and the second audio data segment, the first probability value that is the probability value corresponding to the audio feature of the first audio data segment 500 is A group 622 and a second group of probability values 624 are probability values corresponding to a second segment of audio data.
However, since the second set of probability values 630 here is generated independently of the first set of probability values 530 described above, the explanation will be made with reference to the speaker 620 shown in FIG. 8 and FIG. The correspondence with the speaker 520 who has spoken is unknown. In other words, it is unclear whether speaker c3 of the combined audio data segment 600 corresponds to c1 or c2 of the first audio data segment 500. Similarly, it is unclear whether speaker c4 of combined audio data segment 600 corresponds to speaker c1 or c2 of first audio data segment 500.
This is known as the so-called "permutation problem" in neural networks.

従って、上述した第１の話者逐次区別結果７７０に一貫した話者逐次区別結果を生成するためには、結合音声データセグメント６００の話者を、第１の音声データセグメント５００の話者に対応付ける必要がある。そこで、結合音声データセグメント６００の話者を、第１の音声データセグメント５００の話者に対応付けて、上述した順列問題を解決するためには、本発明の実施形態に係る発話情報バッファーが用いられる。発話情報バッファーに格納されている以前の音声データセグメントについて計算された確率値と、以前の音声データセグメントの一部と新たな音声データセグメントとを結合した音声データセグメントにについて計算された確率値とを比較することで、話者を対応付けることが可能となる。 Therefore, in order to generate a sequential speaker classification result that is consistent with the first sequential speaker classification result 770 described above, the speakers of the combined audio data segment 600 are mapped to the speakers of the first audio data segment 500. There is a need. Therefore, in order to solve the permutation problem described above by associating the speaker of the combined audio data segment 600 with the speaker of the first audio data segment 500, the utterance information buffer according to the embodiment of the present invention is used. It will be done. a probability value calculated for a previous audio data segment stored in the utterance information buffer; a probability value calculated for an audio data segment that is a combination of a portion of the previous audio data segment and the new audio data segment; By comparing, it is possible to associate speakers.

より具体的には、第２の確率値のセット６３０が計算された後、結合音声データセグメント６００のそれぞれの話者c3，c4について計算された第２の確率値のセットと、発話情報バッファー７５０に格納されている、第１の音声データセグメント５００のそれぞれの話者c１，c２について計算された第１の確率値のサブセットの順列が生成される（ｃ１×ｃ３、ｃ１×ｃ４、ｃ２×ｃ３、ｃ２×ｃ４）。
その後、生成したそれぞれの順列について、発話情報バッファー７５０に格納されている第１の確率値のサブセットと、第１の確率値のグループ６２２との類似性を示す相関スコアが計算される。ここでの相関スコア計算は、例えば既存の相関係数や類似性計算の手法によって行われてもよく、特に限定されない。 More specifically, after the second set of probability values 630 is calculated, the second set of probability values calculated for each speaker c3, c4 of the combined audio data segment 600 and the utterance information buffer 750 are A permutation of the subset of first probability values calculated for each speaker c1, c2 of the first audio data segment 500, stored in (c1×c3, c1×c4, c2×c3) is generated. , c2×c4).
Then, for each generated permutation, a correlation score is calculated that indicates the similarity between the first subset of probability values stored in the utterance information buffer 750 and the first group of probability values 622. The correlation score calculation here may be performed by, for example, an existing correlation coefficient or similarity calculation method, and is not particularly limited.

全ての順列についての相関スコアが計算された後、所定の相関スコア基準（例えば、最も高い相関スコア等）を満たす順列の話者は同一人物とみなされる。例えば、一例として、話者ｃ３について計算された第１の確率値のグループ６２２（図８参照）の確率値が発話情報バッファー７５０に格納されている、話者ｃ１について計算された第１の確率値のサブセット（図７参照）に一致し、話者ｃ４について計算された第１の確率値のグループ６２２（図８参照）の確率値が発話情報バッファー７５０に格納されている、話者ｃ２について計算された第１の確率値のサブセット（図７参照）に一致するため、話者ｃ１と話者ｃ３は同一人物とみなされ、話者ｃ２と話者ｃ４は同一人物とみなされる。
このように、結合音声データセグメント６００の話者を、第１の音声データセグメント５００の話者に対応付けることにより、上述した第１の話者逐次区別結果７７０に一貫し、結合音声データセグメント６００における第２の時間位置のセットのそれぞれについて、どの話者が話したかを示す第２の話者逐次区別結果８７０を生成することができる。 After correlation scores for all permutations are calculated, speakers of permutations that meet a predetermined correlation score criterion (eg, highest correlation score, etc.) are considered to be the same person. For example, as an example, a first probability value group 622 (see FIG. 8) calculated for speaker c3 is stored in the utterance information buffer 750. For speaker c2, a first group of probability values 622 (see FIG. 8) matching the subset of values (see FIG. 7) and calculated for speaker c4 is stored in the utterance information buffer 750. Since they match the calculated first probability value subset (see FIG. 7), speakers c1 and c3 are considered to be the same person, and speakers c2 and c4 are considered to be the same person.
In this way, by associating the speaker of the combined audio data segment 600 with the speaker of the first audio data segment 500, it is consistent with the first speaker sequential discrimination result 770 described above, and the speaker of the combined audio data segment 600 is For each of the second set of time locations, a second sequential speaker differentiation result 870 may be generated indicating which speaker has spoken.

また、第２の確率値のセット６３０が計算された後、上述した選択手法（絶対値の基準、ＡｔｔｅｎｔｉｏｎＷｅｉｇｈｔの基準、ランダム選択等）により、第２の確率値のセット６３０及び第２の時間位置のセット６１０の中から、発話情報バッファー８５０に格納される第２の時間位置のサブセット８５２及び第２の確率値のサブセット８５４が選択され、発話情報バッファー８５０に格納される。
なお、図８を参照して、説明した処理を、新たな音声データセグメントが受け付けられるたびに繰り返すことにより、ユーザ対話を構成する全ての音声データセグメントについて話者逐次区別結果を生成することができる。
また、図７に示す発話情報バッファー７５０と、図８に示す発話情報バッファー８５０は、同一の記憶領域であってもよく、（つまり、共通のバッファー）異なる記憶領域（つまり、独立したバッファー）であってもよい。バッファーの構成は、例えば利用可能な計算資源に基づいて適宜に定められてもよい。 Further, after the second set of probability values 630 is calculated, the second set of probability values 630 and the second time A second subset of time locations 852 and a second subset of probability values 854 are selected from the set of locations 610 and stored in the utterance information buffer 850 .
Note that by repeating the process described with reference to FIG. 8 every time a new audio data segment is received, it is possible to generate successive speaker discrimination results for all audio data segments that constitute a user interaction. .
Furthermore, the speech information buffer 750 shown in FIG. 7 and the speech information buffer 850 shown in FIG. 8 may be in the same storage area (that is, a common buffer), or may be in different storage areas (that is, independent buffers). There may be. The configuration of the buffer may be determined as appropriate based on, for example, available computational resources.

次に、図９を参照して、本発明の実施形態に係る発話情報バッファーを更新する方法について説明する。 Next, with reference to FIG. 9, a method for updating the speech information buffer according to the embodiment of the present invention will be described.

図９は、本発明の実施形態に係る発話情報バッファーを更新する方法の一例を示す図である。
上述したように、本発明の実施形態に係る発話情報バッファーは、話者逐次区別処理に有用な情報を一時的に記憶する記憶領域であり、サイズが限られている。そのため、新たな情報をバッファーに格納するためには、既にバッファーに格納されている情報を削除する必要があり、この処理はバッファー更新と呼ばれている。 FIG. 9 is a diagram illustrating an example of a method for updating a speech information buffer according to an embodiment of the present invention.
As described above, the utterance information buffer according to the embodiment of the present invention is a storage area that temporarily stores information useful for sequential speaker discrimination processing, and has a limited size. Therefore, in order to store new information in the buffer, it is necessary to delete the information already stored in the buffer, and this process is called buffer update.

バッファーを更新する方法の一つとしては、いわゆる先入れ先出し（ＦｉｒｓｔＩｎ，ＦｉｒｓｔＯｕｔ；ＦＩＦＯ）方法が用いられてもよい。より具体的には、例えば、発話情報バッファーに対して新たなデータを書き込む要求があった場合、当該発話情報バッファーに既に格納されているデータは、書き込んだ時刻が古い順で前記新たなデータに上書きされる。これにより、より古い音声データセグメントに関する情報が削除され、より新しい音声データセグメントに関する情報が格納される。 As one method for updating the buffer, a so-called first in, first out (FIFO) method may be used. More specifically, for example, if there is a request to write new data into the speech information buffer, the data already stored in the speech information buffer will be replaced with the new data in the order of the oldest writing time. will be overwritten. This causes information regarding older audio data segments to be deleted and information regarding newer audio data segments to be stored.

バッファーを更新する方法の一つとしては、ＡｔｔｅｎｔｉｏｎＷｅｉｇｈｔに基づいたバッファー更新の手法も可能である。上述したように、ＡｔｔｅｎｔｉｏｎＷｅｉｇｈｔとは、ニューラルネットワークにおいて、特定の入力が最終的な出力に与える影響力を定量的に表すものである。ある音声データセグメントにおけるそれぞれの時間位置に指定されるサブセグメント（つまり、入力値）について、ＡｔｔｅｎｔｉｏｎＷｅｉｇｈｔを計算することで、それぞれのサブセグメントがニューラルネットワークの出力に及ぼす影響力を定量的に示すＡｔｔｅｎｔｉｏｎＷｅｉｇｈｔを得ることができる。
例えば、図９に示すように、入力値９１０及び出力値９２０からなるＡｔｔｅｎｔｉｏｎＷｅｉｇｈｔＭａｔｒｉｘ９００を生成し、所定のＡｔｔｅｎｔｉｏｎＷｅｉｇｈｔ計算を行うことにより、それぞれの出力値９２０に対して、それぞれの入力値の影響力を表すＡｔｔｅｎｔｉｏｎＷｅｉｇｈｔ９３０が得られる。このＡｔｔｅｎｔｉｏｎＷｅｉｇｈｔ９３０は、ＡｔｔｅｎｔｉｏｎＷｅｉｇｈｔＭａｔｒｉｘ９００の各要素の値に表されている。ＡｔｔｅｎｔｉｏｎＷｅｉｇｈｔが高ければ高いほど、当該ＡｔｔｅｎｔｉｏｎＷｅｉｇｈｔの入力値が、対応する出力値に対してより高い影響力を有することを意味する。 As one method for updating the buffer, a buffer updating method based on attention weight is also possible. As described above, attention weight quantitatively represents the influence that a specific input has on the final output in a neural network. Attention that quantitatively indicates the influence that each sub-segment has on the output of the neural network by calculating the Attention Weight for each sub-segment (that is, input value) specified at each time position in a certain audio data segment. Weight can be obtained.
For example, as shown in FIG. 9, by generating an Attention Weight Matrix 900 consisting of an input value 910 and an output value 920, and performing a predetermined Attention Weight calculation, the influence of each input value on each output value 920 is calculated. Attention Weight 930 representing force is obtained. This Attention Weight 930 is represented by the value of each element of the Attention Weight Matrix 900. The higher the Attention Weight, the more influence the input value of the Attention Weight has on the corresponding output value.

それぞれの入力値についてＡｔｔｅｎｔｉｏｎＷｅｉｇｈｔを計算した後、発話情報バッファーに既に格納されているデータについて、所定のＡｔｔｅｎｔｉｏｎＷｅｉｇｈｔ基準（例えば、０．７等）を満たさないものが削除され、所定のＡｔｔｅｎｔｉｏｎＷｅｉｇｈｔ基準を満たす新たなデータが格納されるようにしてもよい。
このように、より高いＡｔｔｅｎｔｉｏｎＷｅｉｇｈｔを有する情報を優先的に発話情報バッファーに格納することにより、より良好な話者逐次区別結果を得ることができる。 After calculating the Attention Weight for each input value, the data already stored in the utterance information buffer that does not satisfy a predetermined Attention Weight standard (for example, 0.7, etc.) is deleted, and the data that does not meet the predetermined Attention Weight standard (for example, 0.7, etc.) is deleted. New data that satisfies the above conditions may be stored.
In this way, by preferentially storing information having a higher attention weight in the utterance information buffer, it is possible to obtain a better successive speaker discrimination result.

以上、本発明の実施の形態について説明したが、本発明は、上述した実施の形態に限定されるものではなく、本発明の要旨を逸脱しない範囲において種々の変更が可能である。 Although the embodiments of the present invention have been described above, the present invention is not limited to the embodiments described above, and various changes can be made without departing from the gist of the present invention.

３６０オンライン話者逐次区別システム
３６５音声データ取得装置
３７０通信ネットワーク
３７２データ記憶部
３７５クライアント端末
３８０話者逐次区別装置
３８２データ入力部
３８４話者確率判定部
３８６発話情報バッファー管理部
３８８音声データ結合部
３９０話者逐次区別部 360 Online speaker sequential discrimination system 365 Voice data acquisition device 370 Communication network 372 Data storage section 375 Client terminal 380 Speaker sequential discrimination device 382 Data input section 384 Speaker probability determination section 386 Speech information buffer management section 388 Speech data combination section 390 Speaker sequential discrimination unit

Claims

An online speaker sequential discrimination method, comprising:
receiving a first audio data segment corresponding to a user interaction and speaker count data indicating the number of speakers in the user interaction;
a probability that a particular speaker is speaking at a particular time position for a first set of time positions in the first audio data segment based on the first audio data segment and the speaker count data; calculating a first set of probability values indicating
a first subset of time locations that is at least part of the first set of time locations; and a first subset of time locations that is at least part of the first set of probability values and that corresponds to the first subset of time locations. 1 based on a predetermined selection method, and storing the subset in an utterance information buffer;
receiving a second audio data segment corresponding to the user interaction;
extracting audio features corresponding to the first subset of time positions stored in the speech information buffer from the first audio data segment and combining them with the second audio data segment to create a combined audio data segment; The process of generating;
a second set of time positions in the combined audio data segment indicating, based on the combined audio data segment and the speaker count data, a probability that a particular speaker is speaking at a particular time position; calculating a set of probability values for
generating a sequential speaker discrimination result that identifies each of the speakers of the user interaction based on the first subset of probability values and the second set of probability values;
An online speaker sequential discrimination method characterized by comprising:

The second set of probability values includes a group of first probability values corresponding to the audio features extracted from the first audio data segment and a second probability value corresponding to the second audio data segment. including a group of
The step of generating the speaker sequential discrimination result includes:
generating permutations of the first subset of probability values and the second set of probability values stored in the utterance information buffer; and for each of the permutations, generating the first subset of probability values and the second set of probability values; Calculate the correlation score for the group with a probability value of 1,
generating the speaker sequential discrimination result based on permutations that satisfy a predetermined correlation score criterion;
The online speaker sequential discrimination method according to claim 1, characterized in that:

The predetermined selection method is
For each time position in the first set of time positions, calculate the absolute value of the difference in probabilities between speakers, and set the probability values for which the absolute value satisfies a predetermined criterion as a subset of the first probability values. including selecting
The online speaker sequential discrimination method according to claim 1, characterized in that:

The predetermined selection method is
calculating an Attention Weight value for each of the time locations in the first set of time locations, and selecting time locations that meet a predetermined Attention Weight criterion as a subset of the first time locations. and
The online speaker sequential discrimination method according to claim 1.

The predetermined selection method is
2. The online speaker sequential differentiation method of claim 1, comprising selecting random time locations as a subset of the first time locations from among the set of first time locations.

When there is a request to write new data to the utterance information buffer,
The data already stored in the utterance information buffer is overwritten with the new data in the order of the oldest writing time.
The online speaker sequential discrimination method according to claim 1, characterized in that the method comprises:

When there is a request to write new data to the utterance information buffer,
The data already stored in the utterance information buffer is overwritten with the new data in order of decreasing Attention Weight.
The online speaker sequential discrimination method according to claim 1, characterized in that the method comprises:

An online speaker sequential discrimination device, comprising:
a first data input for receiving a first audio data segment corresponding to a user interaction and number of speakers data indicating the number of speakers in the user interaction;
a probability that a particular speaker is speaking at a particular time position for a first set of time positions in the first audio data segment based on the first audio data segment and the speaker count data; a first speaker probability determination unit that calculates a first set of probability values indicating
a first subset of time locations that is at least part of the first set of time locations; and a first subset of time locations that is at least part of the first set of probability values and that corresponds to the first subset of time locations. 1 based on a predetermined selection method, and storing the selected subset in the utterance information buffer;
a second data input for receiving a second audio data segment corresponding to the user interaction;
extracting audio features corresponding to the first subset of time positions stored in the speech information buffer from the first audio data segment and combining them with the second audio data segment to create a combined audio data segment; an audio data combination unit that generates;
a second set of time positions in the combined audio data segment indicating, based on the combined audio data segment and the speaker count data, a probability that a particular speaker is speaking at a particular time position; a second speaker probability determination unit that calculates a set of probability values;
a speaker sequential discrimination unit that generates a sequential speaker discrimination result that identifies each of the speakers of the user interaction based on the first subset of probability values and the second set of probability values;
An online speaker sequential discrimination device characterized by comprising:

An online speaker sequential discrimination system, comprising:
In the online speaker sequential discrimination system,
A client terminal for acquiring an audio data segment corresponding to a user interaction, data on the number of speakers indicating the number of speakers in the user interaction, and generating sequential speaker discrimination results for identifying each of the speakers in the user interaction. An online speaker sequential discrimination device is connected via a communication network to
The online speaker sequential discrimination device includes:
a first data input unit that receives, from the client terminal, a first audio data segment corresponding to the user interaction and number of speakers data indicating the number of speakers in the user interaction ;
a probability that a particular speaker is speaking at a particular time position for a first set of time positions in the first audio data segment based on the first audio data segment and the speaker count data; a first speaker probability determination unit that calculates a first set of probability values indicating
a first subset of time locations that is at least part of the first set of time locations; and a first subset of time locations that is at least part of the first set of probability values and that corresponds to the first subset of time locations. 1 based on a predetermined selection method, and storing the selected subset in the utterance information buffer;
a second data input for receiving a second audio data segment corresponding to the user interaction;
extracting audio features corresponding to the first subset of time positions stored in the speech information buffer from the first audio data segment and combining them with the second audio data segment to create a combined audio data segment; an audio data combination unit that generates;
a second set of time positions in the combined audio data segment indicating, based on the combined audio data segment and the speaker count data, a probability that a particular speaker is speaking at a particular time position; a second speaker probability determination unit that calculates a set of probability values;
generating and transmitting to the client terminal a speaker sequential discrimination result that identifies each of the speakers of the user interaction based on the first subset of probability values and the second set of probability values; a sequential discrimination section;
An online speaker sequential discrimination system characterized by comprising: