JP2020187346A

JP2020187346A - Speech dialization method and apparatus based on audio visual data

Info

Publication number: JP2020187346A
Application number: JP2020071403A
Authority: JP
Inventors: ジュンソンチョン; Joon Son Chung; ボンジンイ; Bong Jin Lee; イクサンハン; Icksang Han
Original assignee: Line Corp; Naver Corp
Current assignee: Z Intermediate Global Corp; Naver Corp
Priority date: 2019-05-10
Filing date: 2020-04-13
Publication date: 2020-11-19
Anticipated expiration: 2040-04-13
Also published as: KR102230667B1; JP6999734B2; KR20200129934A

Abstract

To provide a speech dialization method and an apparatus therefor based on audio visual data.SOLUTION: There is provided a speaker dialization method having the steps of: calculating correlation between shapes of mouths of speakers included in video data in which a plurality of speakers are captured and a speech segment from a speaker included in audio data; structuring a speaker model for each speaker based on the calculated correlation; and specifying a speaker producing a speech of sound included in the audio data based on the structured speaker model.SELECTED DRAWING: Figure 1

Description

以下の説明は、複数の話者の映像（ｖｉｄｅｏ）データとオーディオデータを利用して話者を分離する技術に関し、より詳細には、映像データからの話者の口の形状とオーディオデータからのスピーチセグメントとの相関関係によって構築された話者モデルに基づいて話者を特定（すなわち、話者ダイアライゼーション（ｄｉａｒｉｓａｔｉｏｎ））する技術に関する。 The following description relates to a technique for separating speakers by using video (video) data and audio data of a plurality of speakers, and more specifically, from the shape of the speaker's mouth from the video data and audio data. It relates to a technique for identifying a speaker (that is, speaker dataation) based on a speaker model constructed by correlation with a speech segment.

近年、機械が読み取り可能なフォーマットを活用しながら、人間のコミュニケーション（例えば、会議など）を記録して検索しようとする要求が高まっている。大規模なデータセットに対する可用性とディープラーニングのフレームワークへの接近性が高まるにつれ、このような人間のコミュニケーションを記録するための自動音声認識は大きく発展した。これにより、トランスクリプト（ｔｒａｎｓｃｒｉｐｔ）に対し、単に文章単語を羅列することを超え、該当の文章を「いつ」、「誰が」発話したかに関する情報を付け加えることが重要となっている。 In recent years, there has been an increasing demand for recording and retrieving human communications (eg, meetings) while utilizing machine-readable formats. As the availability of large data sets and the accessibility of deep learning frameworks have increased, automatic speech recognition for recording such human communications has evolved significantly. As a result, it is important to add information about "when" and "who" uttered the relevant sentence to the transcript, rather than simply listing the sentence words.

例えば、特許文献１（公開日２０１０年０５月２６日）は、それぞれの話者識別結果の信頼度を測定する方法に関し、各フレームの話者識別結果の貢献程度を測定し、各フレームの話者識別貢献度に基づいて話者識別結果の信頼度を測定し、これを話者の真偽判断に利用することにより、話者の検証時に提示された話者の真偽を正確に判断することができ、マルチチャンネル環境において話者識別の正確度を高めることができる技術が開示されている。 For example, Patent Document 1 (publication date: May 26, 2010) measures the degree of contribution of the speaker identification result of each frame with respect to the method of measuring the reliability of each speaker identification result, and talks about each frame. By measuring the reliability of the speaker identification result based on the speaker identification contribution and using this to judge the authenticity of the speaker, the authenticity of the speaker presented at the time of speaker verification can be accurately determined. There are disclosed techniques that can improve the accuracy of speaker identification in a multi-channel environment.

上述した情報は理解を助けるためのものに過ぎず、従来技術の一部を形成しない内容を含むこともあるし、従来技術が通常の技術者に提示することのできる内容を含まないこともある。 The information described above is merely to aid understanding and may include content that does not form part of the prior art, or may not include content that the prior art can present to a normal engineer. ..

韓国公開特許第１０−２０１０−００５５１６８号公報Korean Publication No. 10-2010-0055168 Gazette

複数の話者の映像データに含まれる話者それぞれの口の形状と、オーディオデータに含まれるそれぞれのスピーチセグメントとの相関関係を計算し、計算された相関関係に基づき、各話者に対する話者モデルを構築し、構築された話者モデルに基づき、オーディオデータに含まれる音声を発話する話者を特定する、話者ダイアライゼーション方法を提供する。 The correlation between the shape of each speaker's mouth included in the video data of multiple speakers and each speech segment included in the audio data is calculated, and the speaker for each speaker is calculated based on the calculated correlation. It provides a speaker dialylation method that builds a model and, based on the built speaker model, identifies the speaker who speaks the voice contained in the audio data.

オーディオデータから特定された話者が発話した音声を抽出してテキストに変換し、変換されたテキストを該当の音声が発話された時間情報と関連付けて記録することにより、話者同士のコミュニケーションを記録できるようにする方法を提供する。 Communication between speakers is recorded by extracting the voice spoken by the specified speaker from the audio data, converting it into text, and recording the converted text in association with the time information when the corresponding voice was spoken. Provide a way to enable it.

一側面において、コンピュータシステムが実行する、複数の話者の映像（ｖｉｄｅｏ）データとオーディオデータを利用して話者を分離する方法であって、前記映像データに含まれる前記話者それぞれの口の形状と、前記オーディオデータに含まれる前記話者からのスピーチセグメントそれぞれとの相関関係を計算する段階、前記計算された相関関係に基づき、前記話者それぞれの話者モデルを構築する段階、および前記構築された話者モデルに基づき、前記話者のうちで前記オーディオデータに含まれる音声を発話する話者を特定する段階を含む、話者ダイアライゼーション方法が提供される。 In one aspect, it is a method executed by a computer system that separates speakers by using video (video) data and audio data of a plurality of speakers, and is a method of separating speakers by using the video data of each speaker included in the video data. The stage of calculating the correlation between the shape and each of the speech segments from the speaker included in the audio data, the stage of constructing a speaker model for each of the speakers based on the calculated correlation, and the above. Based on the constructed speaker model, a speaker dialylation method is provided that includes a step of identifying the speaker who speaks the voice included in the audio data among the speakers.

前記話者モデルを構築する段階では、前記話者の各話者に対し、前記スピーチセグメントのうちで前記各話者の口の形状との相関関係が高い上位Ｎ個のスピーチセグメントを前記各話者に対する話者モデルを構築するために使用し、Ｎは自然数であってよい。 At the stage of constructing the speaker model, for each speaker of the speaker, the top N speech segments having a high correlation with the shape of the mouth of each speaker among the speech segments are selected for each speech. Used to build a speaker model for a person, N may be a natural number.

前記各話者に対する話者モデルの構築には、前記Ｎ個のスピーチセグメントのうちで前記各話者の口の形状との相関関係が所定の閾値以上のスピーチセグメントだけが使用されてよい。 In constructing the speaker model for each speaker, only the speech segments whose correlation with the mouth shape of each speaker is equal to or greater than a predetermined threshold value may be used among the N speech segments.

前記オーディオデータに含まれた前記話者のうちの２人以上の発言によって重なった音声は、前記重なった音声の発話時の前記話者の口の形状に基づいて前記重なった音声を発話した話者の発音を抽出し、前記抽出された発音に該当する前記オーディオデータの部分に対するマスクを生成し、前記重なった音声をフィルタリングすることにより、前記重なった音声を発話した話者それぞれの音声に分離してよい。 The voices overlapped by the remarks of two or more of the speakers included in the audio data are stories in which the overlapped voices are uttered based on the shape of the speaker's mouth when the overlapped voices are spoken. By extracting the pronunciation of the person, generating a mask for the part of the audio data corresponding to the extracted pronunciation, and filtering the overlapping voices, the overlapping voices are separated into the voices of the speakers who uttered the overlapping voices. You can do it.

前記話者それぞれの口の形状は、前記映像データから前記話者それぞれの顔を認識し、前記認識された顔を追跡することによって識別されてよい。 The shape of each speaker's mouth may be identified by recognizing each speaker's face from the video data and tracking the recognized face.

前記話者を特定する段階は、前記映像データから前記話者それぞれの顔を検出し、前記検出された顔を追跡することで、前記各話者の口の形状と前記オーディオデータに含まれる発話された音声との相関関係を計算する段階、および前記計算された相関関係と前記構築された話者モデルを使用することで、前記発話された音声を発話した話者を特定する段階を含んでよい。 In the step of identifying the speaker, the face of each speaker is detected from the video data, and the detected face is tracked, so that the shape of the mouth of each speaker and the utterance included in the audio data are included. Including the step of calculating the correlation with the spoken voice and the step of identifying the speaker who spoke the spoken voice by using the calculated correlation and the constructed speaker model. Good.

前記音声を発話した話者を特定する段階は、前記映像データにおいて、前記話者のうちの特定の話者の顔または前記特定の話者の口が隠れる（ｏｃｃｌｕｄｅｄ）ことによって前記特定の話者と前記オーディオデータの前記発話された音声との相関関係が計算できない場合には、前記特定の話者と前記発話された音声との相関関係は０と見なしてよい。 The step of identifying the speaker who uttered the voice is that the specific speaker is blocked by hiding the face of the specific speaker or the mouth of the specific speaker in the video data. When the correlation between the audio data and the spoken voice cannot be calculated, the correlation between the specific speaker and the spoken voice may be regarded as 0.

前記話者を特定する段階は、前記オーディオデータに含まれる発話された音声を発話した話者の位置に関する情報を決定する段階、および前記決定された話者の位置に関する情報と前記構築された話者モデルを使用することで、前記発話された音声を発話した話者を特定する段階を含んでよい。 The step of identifying the speaker is a step of determining information on the position of the speaker who uttered the spoken voice included in the audio data, and a step of determining the information on the position of the determined speaker and the constructed story. By using the person model, a step of identifying the speaker who spoke the spoken voice may be included.

前記音声を発話した話者の位置に関する情報は、前記発話された音声の方向に関する情報を含み、前記方向に関する情報は、前記発話される音声と関連する方位角情報を含んでよい。 The information regarding the position of the speaker who spoke the voice may include information regarding the direction of the spoken voice, and the information regarding the direction may include azimuth information related to the spoken voice.

前記複数の話者のうち、前記映像データから顔は認識されたがまったく発話しないと認識された話者は無視してよい。 Of the plurality of speakers, the speaker whose face is recognized from the video data but is recognized as not speaking at all may be ignored.

前記話者モデルを構築する段階は、前記複数の話者のうち、前記映像データから顔または口の形状が認識されないか前記話者モデルを構築するための口の形状とスピーチセグメントとの相関関係がまったく用意されずに前記話者モデルが構築されない特定の話者に対しては、前記相関関係を計算する段階で計算された相関関係が所定の値未満であるスピーチセグメントに対応する埋め込み（ｅｍｂｅｄｄｉｎｇ）をクラスタリングすることにより、前記特定の話者に対する話者モデルを構築してよい。 At the stage of constructing the speaker model, among the plurality of speakers, the shape of the face or mouth is not recognized from the video data, or the correlation between the shape of the mouth for constructing the speaker model and the speech segment. For a specific speaker for whom the speaker model is not constructed without any provision, embedding corresponding to the speech segment in which the correlation calculated at the stage of calculating the correlation is less than a predetermined value. ) May be built to build a speaker model for the particular speaker.

前記映像データおよびオーディオデータは、前記話者をリアルタイムで撮影した映像に含まれるデータであってよい。 The video data and audio data may be data included in a video of the speaker captured in real time.

前記話者ダイアライゼーション方法は、前記オーディオデータから前記特定された話者が発話した音声を抽出する段階、前記抽出された音声をテキストに変換する段階、および前記変換されたテキストを前記抽出された音声が発話された時間情報と関連付けて記録する段階をさらに含んでよい。 In the speaker dialylation method, a step of extracting the voice uttered by the specified speaker from the audio data, a step of converting the extracted voice into text, and the step of converting the converted text are extracted. It may further include the step of recording the voice in association with the time information spoken.

前記話者ダイアライゼーション方法は、バンドパスフィルタを使用することで、前記オーディオデータから、人間の音声範囲を越える雑音（ｎｏｉｓｅ）をフィルタリングする段階をさらに含んでよい。 The speaker dialylation method may further include a step of filtering noise beyond the human voice range from the audio data by using a bandpass filter.

他の側面において、複数の話者の映像（ｖｉｄｅｏ）データとオーディオデータを利用して話者を分離するコンピュータシステムであって、メモリ、および前記メモリに連結され、前記メモリに含まれるコンピュータ読み取り可能な命令を実行するように構成される少なくとも１つのプロセッサを含み、前記少なくとも１つのプロセッサは、前記映像データに含まれる前記話者それぞれの口の形状と、前記オーディオデータに含まれる前記話者からのスピーチセグメントそれぞれとの相関関係を計算し、前記計算された相関関係に基づき、前記話者それぞれに対する話者モデルを構築し、前記構築された話者モデルに基づき、前記オーディオデータから前記話者のうちで発話する話者を特定する、コンピュータシステムを提供する。 In another aspect, a computer system that uses video (video) data and audio data of a plurality of speakers to separate speakers, which is a memory and a computer readable connected to the memory and contained in the memory. The at least one processor includes at least one processor configured to execute an instruction, from the shape of each speaker's mouth included in the video data and from the speaker included in the audio data. The correlation with each of the speech segments of is calculated, a speaker model for each of the speakers is constructed based on the calculated correlation, and the speaker is used from the audio data based on the constructed speaker model. Provide a computer system that identifies the speaker who speaks among them.

映像データに含まれる話者それぞれの口の形状と、オーディオデータに含まれる話者からのスピーチセグメントそれぞれとの相関関係に基づいて構築された各話者に対する話者モデルを、音声を発話する話者を特定するために使用することにより、映像データにおいて特定の話者が隠れた場合にも、該当の特定の話者による発話を正確に特定することができる。 A story that speaks a speaker model for each speaker constructed based on the correlation between the shape of each speaker's mouth included in the video data and each speech segment from the speaker included in the audio data. By using it to identify a person, even when a specific speaker is hidden in the video data, the utterance by the specific speaker can be accurately specified.

構築された各話者に対する話者モデルに加え、音声を発話する話者を特定するために各話者の口の形状と発話される音声との相関関係および／または該当の音声が発話される位置に関する情報をさらに使用することにより、音声を発話する話者をより正確に特定することができる。 In addition to the speaker model for each speaker constructed, the correlation between each speaker's mouth shape and the spoken voice and / or the corresponding voice is spoken to identify the speaker who speaks the voice. Further use of location information can more accurately identify the speaker who speaks the voice.

オーディオデータから特定された話者が発話した音声を抽出してテキストに変換し、変換されたテキストを該当の音声が発話された時間情報と関連付けて記録することにより、例えば、会議のような話者同士のコミュニケーションを自動で記録することができる。 By extracting the voice spoken by the identified speaker from the audio data, converting it into text, and recording the converted text in association with the time information in which the voice was spoken, for example, a story such as a conference. Communication between people can be recorded automatically.

一実施形態における、話者ダイアライゼーションシステムのパイプラインを簡単に示した図である。It is a figure which showed the pipeline of the speaker dialyization system in one Embodiment briefly. 一例として試験のために使用された映像データのスチールイメージを示した図である。As an example, it is a figure which showed the still image of the video data used for a test. 一例として試験のために使用された映像データとして、公開的なＡＭＩ会議データのスチールイメージを示した図である。As an example, as the video data used for the test, it is a figure which showed the still image of the public AMI conference data. 一例として試験のために使用された映像データとして、公開的なＡＭＩ会議データのスチールイメージを示した図である。As an example, as the video data used for the test, it is a figure which showed the still image of the public AMI conference data. 一実施形態における、音声を発話する話者ダイアライゼーションおよび分離（特定）された話者からの発話を記録する方法を示した図である。It is a figure which showed the speaker dialyization which utters a voice and the method of recording the utterance from the separated (identified) speaker in one embodiment. 一実施形態における、コンピュータシステムの内部構成の一例を説明するためのブロック図である。It is a block diagram for demonstrating an example of the internal structure of the computer system in one Embodiment. 一例として、話者同士の会議を記録した議事録を示した図である。As an example, it is a figure which showed the minutes which recorded the meeting between speakers. 一実施形態における、話者モデルの構築方法および発話する話者を分離する方法を示したフローチャートである。It is a flowchart which showed the method of constructing a speaker model and the method of separating the speaker who speaks in one Embodiment. 一例として、発話する話者を特定する方法を示したフローチャートである。As an example, it is a flowchart showing a method of identifying a speaker who speaks. 一例として、発話する話者を特定する方法を示したフローチャートである。As an example, it is a flowchart showing a method of identifying a speaker who speaks.

以下、本発明の実施形態について、添付の図面を参照しながら詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

以下で説明する実施形態は、実際の会議で「誰が発言したのか」（すなわち、音声を発話した話者）を決定するための技術に関する。実施形態の方法は、ビデオ（例えば、３６０度カメラで撮影したサラウンドビュービデオ）および単一または多重チャンネルオーディオを入力として使用し、これを基に確かな話者ダイアライゼーション（Ｓｐｅａｋｅｒｄｉａｒｉｓａｔｉｏｎ）出力を生成する。 The embodiments described below relate to techniques for determining "who spoke" (ie, the speaker who spoke the voice) at the actual meeting. The method of the embodiment uses video (eg, surround view video taken with a 360 degree camera) and single or multi-channel audio as inputs, based on which a reliable Speaker dialysis output is generated. To do.

これを達成するために、本開示では、先ず、オーディオビジュアルの関連性（ｃｏｒｒｅｓｐｏｎｄｅｎｃｅ）を利用して話者モデルを登録し、登録されたモデルと視覚的情報を使用することで、アクティブ（ａｃｔｉｖｅ）話者（すなわち、発話する話者）を決定する、反復性のある新たな接近技法を提案する。 To achieve this, the present disclosure first registers a speaker model using audiovisual relevance, and then uses the registered model and visual information to activate it. We propose a new repetitive approach technique that determines the speaker (ie, the speaker who speaks).

実施形態の方法は、実際の会議のデータセットに対して量的および質的に優れた性能を示す。実施形態の方法は、公開的なデータセットを対象として評価したときに、比較可能なすべての他の方法を上回る結果が出た（後述する試験結果を参照）。また、多重チャンネルオーディオを使用することができるときには、音声の位置および／または方向を抽出するためにビデオとともにビームフォーミングを使用する。 The method of the embodiment exhibits excellent quantitative and qualitative performance with respect to the actual conference dataset. The methods of the embodiment outperformed all other comparable methods when evaluated against public datasets (see test results below). Also, when multi-channel audio can be used, beamforming is used with the video to extract the position and / or direction of the audio.

以下、実施形態の方法の背景と概要について説明する。 Hereinafter, the background and outline of the method of the embodiment will be described.

多重話者オーディオを単一話者セグメントに分解する作業である話者ダイアライゼーション（ｓｐｅａｋｅｒｄｉａｒｉｓａｔｉｏｎ）は、数年にわたって活発に研究されてきた分野であった。話者の音声は、オーディオだけが使用されるシングルモダリティ（ｓｉｎｇｌｅ−ｍｏｄａｌｉｔｙ）問題として取り扱われる反面、ビデオのようなモダリティを追加して取り扱われることもある。オーディオとオーディオビジュアルの両方に関する話者ダイアライゼーション技術は、次の２つに分けられる。 Speaker dialysis, the task of breaking down multiple-speaker audio into single-speaker segments, has been an area of active research for several years. The speaker's voice is treated as a single-modality problem in which only audio is used, but it may also be treated with additional modality such as video. Speaker dialing techniques for both audio and audiovisual can be divided into the following two types.

１つ目は、個人それぞれは異なる音声特性を持つという仮定による話者モデリング（ＳｐｅａｋｅｒＭｏｄｅｌｉｎｇ：ＳＭ）に基づくものである。 The first is based on Speaker Modeling (SM), which assumes that each individual has different voice characteristics.

一例として、話者モデルは、混合ガウスモデル（ＧＭＭｓ）とｉ−ベクトルで構成されてよい。また、話者モデリングに対してディープラーニングが効果的であるという立証に基づき、話者モデリングによる話者モデルは、ディープラーニングを通じて構築されてもよい。 As an example, the speaker model may consist of mixed Gaussian models (GMMs) and i-vectors. In addition, based on the proof that deep learning is effective for speaker modeling, the speaker model by speaker modeling may be constructed through deep learning.

多くのシステムにおける話者モデルは、ターゲット話者に対して予めトレーニングされたものが一般的であり、未知の参加者には適用が不可能な場合がある。他のアルゴリズムは、一般モデルおよびクラスタリングを使用することにより、未知の（ｕｎｓｅｅｎ）話者にも適応されるようにしている。さらに、特徴クラスタリングに基づくオーディオビジュアルドメイン作業も多く存在する。 Speaker models in many systems are generally pre-trained for the target speaker and may not be applicable to unknown participants. Other algorithms use general models and clustering to adapt to unknown speakers. In addition, there are many audiovisual domain tasks based on feature clustering.

２つ目は、音源位置決定（ＳｏｕｎｄＳｏｕｒｃｅＬｏｃａｌｉｚａｔｉｏｎ：ＳＳＬ）手法を利用するものである。これは、例えば、ＳＲＰ−ＰＨＡＴのような強力なビームフォーミング方法により、ＳＭ基盤の接近法に比べてより優れた性能を実現する。しかし、ＳＳＬ基盤の方法は、話者の位置が固定的であるか知られている場合しか有効でない。したがって、ＳＳＬは、視覚的情報を利用して話者の位置を追跡できる場合など、オーディオビジュアル方法の一部として利用されている。このような接近法は、参加者を効果的に追跡できるか否かの能力に大きく依存する。ＳＳＬは、本開示の実施形態における、動きおよび口の動きを測定する視覚的分析モジュールとの結合が可能である。 The second is to use the Sound Source Localization (SSL) method. It achieves better performance than SM-based approach methods, for example, with powerful beamforming methods such as SRP-PHAT. However, the SSL-based method is only effective if the speaker's position is fixed or known. Therefore, SSL is used as part of audiovisual methods, such as when the position of a speaker can be tracked using visual information. Such approaches are highly dependent on the ability to effectively track participants. SSL can be combined with a visual analysis module that measures movement and mouth movement in the embodiments of the present disclosure.

観測の各類型に応じて独立的なモデルを利用してＳＭおよびＳＳＬ接近法が結合されてよく、これらの情報は、ビタビアルゴリズムまたはベイジアンフィルタリングに基づいて確率論的フレームワークと融合されてよい。 SM and SSL approaches may be combined using independent models for each type of observation, and this information may be fused with a stochastic framework based on the Viterbi algorithm or Bayesian filtering.

本開示では、オーディオビジュアルデータを利用して話者の移動とオクルージョン（ｏｃｃｌｕｓｉｏｎｓ）を処理し、確かな話者ダイアライゼーションシステムを実現するシステムを提示する。このようなシステムのために、口または唇の動き（すなわち、口の形状）がはっきりと確認可能であるときに各参加者の言葉を検出するための最先端のディープオーディオビジュアル同期化ネットワークが使用されてよい。 The present disclosure presents a system that uses audiovisual data to process speaker movement and occlusions and realizes a reliable speaker dialiation system. For such a system, a state-of-the-art deep audiovisual synchronization network is used to detect each participant's words when mouth or lip movements (ie, mouth shape) are clearly visible. May be done.

このような情報は各参加者の話者モデルを登録するために使用されてよく、登録された話者モデルに基づき、参加者が隠れた場合であっても誰が発言するかを決定することができるようになる。各参加者に対して話者モデルを生成することにより、教師なし学習（クラスタリング）問題を、すべての参加者に属する音声セグメントの確率を推定する教師あり分類問題（ｓｕｐｅｒｖｉｓｅｄｃｌａｓｓｉｆｉｃａｔｉｏｎｐｒｏｂｌｅｍ）によって作業を再構成することができる。マルチモーダル融合以前に観測の各類型に対する尤度を計算する技術とは異なり、本開示のオーディオビジュアル同期化は、話者登録過程で使用されてよい。 Such information may be used to register each participant's speaker model, and based on the registered speaker model, it is possible to determine who will speak even if the participant is hidden. become able to. Rework the unsupervised learning (clustering) problem with a supervised classification problem that estimates the probabilities of voice segments belonging to all participants by generating a speaker model for each participant. Can be configured. Unlike techniques for calculating the likelihood for each type of observation prior to multimodal fusion, the audiovisual synchronization of the present disclosure may be used in the speaker registration process.

追加の説明として、マルチチャンネルマイクロフォンの可用性が高い場合、ビームフォーミングが音源の位置を推定するために適用されてよく、両方のモダリティからの空間手がかり（ｃｕｅ）がシステムの性能を向上させるために使用されてよい。 As an additional explanation, when multi-channel microphones are highly available, beamforming may be applied to estimate the position of the sound source, and spatial cues from both modality are used to improve the performance of the system. May be done.

以下では、本発明の実施形態に係る、マルチモーダル話者ダイアライゼーションシステム（オーディオビジュアルシステム）について説明する。 Hereinafter, a multimodal speaker dialylation system (audiovisual system) according to an embodiment of the present invention will be described.

オーディオビジュアルシステムのオーディオ処理部分は、周知のオーディオ処理システムの方法を含んで構成されてよい。例えば、音調強調（ｓｐｅｅｃｈｅｎｈａｎｃｅｍｅｎｔ）システムとしては、シミュレーションされた訓練データに対して訓練された長・短期記憶（ＬＳＴＭ：ＬｏｎｇＳｈｏｒｔ−ＴｅｒｍＭｅｍｏｒｙ）基盤の雑音除去モデルが使用されてよい。また、話者埋め込み（話者モデル）を抽出するために、事前に訓練されたｘベクトルモデルが使用されてよい。ｘベクトル抽出器およびＰＬＤＡパラメータは、データ増強（相加性雑音（ａｄｄｉｔｉｖｅｎｏｉｓｅ））をもつデータセットに対して学習されたものであってよい。 The audio processing portion of the audiovisual system may be configured to include well-known audio processing system methods. For example, as a speech enhancement system, a long short-term memory (LSTM: Long Short-Term Memory) -based noise reduction model trained on simulated training data may be used. Also, a pre-trained x-vector model may be used to extract the speaker embedding (speaker model). The x-vector extractor and PLDA parameters may be trained for a dataset with data enhancement (additive noise).

本発明の実施形態によると、ビデオ内から現在の話者を決定するために少なくとも３種類の情報（オーディオツービデオ（ＡｕｄｉｏｔｏＶｉｄｅｏ）相関関係、話者モデル、オーディオ方向など）が使用されてよい。 According to embodiments of the present invention, at least three types of information (audio to video correlation, speaker model, audio orientation, etc.) may be used to determine the current speaker from within the video. ..

図１は、一実施形態における、話者ダイアライゼーションシステムのパイプラインを簡単に示した図である。 FIG. 1 is a diagram briefly showing the pipeline of the speaker dialylation system in one embodiment.

前処理段階では、映像データ内から顔部分を検出（Ｆａｃｅｄｅｔｅｃｔｉｏｎ）し、該当の顔部分を追跡（Ｆａｃｅｔｒａｃｋｉｎｇ）し、顔部分に対する顔部分映像を取得してよい。追加で、プロフィール（Ｐｒｏｆｉｌｅ）イメージを利用して顔認識（Ｆａｃｅｒｅｃｏｇｎｉｔｉｏｎ）することにより、該当の顔部分映像が誰であるかを検知してよい。 In the preprocessing stage, a face portion may be detected from the video data (Face detection), the corresponding face portion may be tracked (Face tracking), and a face portion image for the face portion may be acquired. In addition, by performing face recognition using a profile image, it may be possible to detect who the relevant face portion image is.

顔検出および顔追跡においては、例えば、ＳＳＤ（ＳｉｎｇｌｅＳｈｏｔＭｕｌｔｉＢｏｘＤｅｔｅｃｔｏｒ）基盤のＣＮＮ顔検出器が、ビデオのすべてのフレームから顔形状を検出するために使用されてよい。このような検出器は、多様なポーズと照明条件に基づいて顔を追跡してよい。位置基盤（ｐｏｓｉｔｉｏｎ−ｂａｓｅｄ）顔追跡器は、個々の顔の検出を顔部分映像としてグループ化するために使用されてよい。 In face detection and face tracking, for example, SSD (Single Shot MultiBox Detector) -based CNN face detectors may be used to detect face shapes from all frames of video. Such a detector may track the face based on a variety of poses and lighting conditions. A position-based face tracker may be used to group individual face detections as facial partial images.

顔認識においては各参加者の顔イメージが求められるが、これにより、会議室内において、これらの位置とは関係なく顔を識別および追跡できるようになる。これは、ユーザ入力またはプロフィールイメージによって構成されてよい。すべての参加者の顔イメージは、一般的に周知の顔認識のための特徴（ｆｅａｔｕｒｅ）、例えば、ＶＧＧＦａｃｅ２ネットワークを利用した埋め込み（ｅｍｂｅｄｄｉｎｇ）で表現され、記録されてよい。 Face recognition requires each participant's face image, which allows the face to be identified and tracked in the conference room regardless of their location. It may consist of user input or profile image. Face images of all participants may be represented and recorded with commonly known face recognition features, such as embedding using the VGGFace2 network.

図２のように、複数話者が会話をやり取りする映像内から顔部分を検出し（四角領域で表示）、該当の顔部分映像を順にＦａｃｅｔｒａｃｋ１、Ｆａｃｅｔｒａｃｋ２、・・・Ｆａｃｅｔｒａｃｋｎと命名する。このうち、Ｆａｃｅｔｒａｃｋ１を予め登録された１つ以上のプロフィールイメージと比較してＦａｃｅｔｒａｃｋ１に対応する話者が誰なのかを特定し、具体的な話者の身元を確認してよい。 As shown in FIG. 2, the face part is detected from the video in which a plurality of speakers exchange conversations (displayed in a square area), and the corresponding face part video is sequentially displayed as Face track 1, Face track 2, ... Face track n. Name it. Of these, Face track 1 may be compared with one or more pre-registered profile images to identify who is the speaker corresponding to Face track 1 and confirm the identity of the specific speaker.

オーディオデータの前処理方法（図示せず）は、オーディオデータに含まれる人間の音声範囲を超える雑音を減少させるために、２００〜７０００Ｈｚをカバーするバンドパスフィルタを通過させることを含んでよい。また、例えば、音声活動検出器は、オーディオ内にスピーチがあるか否かを識別するために使用されてよい。 A method of preprocessing audio data (not shown) may include passing through a bandpass filter covering 200-7000 Hz in order to reduce noise beyond the human voice range contained in the audio data. Also, for example, a voice activity detector may be used to identify whether there is a speech in the audio.

図１の段階１（Ｐｈａｓｅ１）では、オーディオおよび顔部分映像データに対し、オーディオツービデオ相関関係（ＡＶ相関関係）を利用して話者モデルを登録する。 In step 1 (Phase 1) of FIG. 1, a speaker model is registered for audio and facial partial video data by using an audio-to-video correlation (AV correlation).

図１の段階２（Ｐｈａｓｅ２）では、登録された話者モデル、オーディオ、および顔部分映像データを利用し、現在の発話の話者が誰なのかを確認してよい。具体的に、１）オーディオおよび話者モデルを利用して話者検証（Ｓｐｅａｋｅｒｖｅｒｉｆｉｃａｔｉｏｎ）を行い、２）オーディオを利用して発話方向を計算し、３）オーディオおよび顔部分映像を利用してＡＶ相関関係を計算し、１）〜３）の結果を利用して最終的に発話者を決定してよい。 In stage 2 (Phase 2) of FIG. 1, the registered speaker model, audio, and facial partial video data may be used to confirm who is the speaker of the current utterance. Specifically, 1) speaker verification is performed using audio and a speaker model, 2) the utterance direction is calculated using audio, and 3) AV is performed using audio and facial image. The correlation may be calculated, and the speaker may be finally determined using the results of 1) to 3).

本発明の一実施形態では、段階２を実施する前に、オーディオおよびビデオデータ全体に対して段階１を実施して話者モデルを登録する方法を使用し、以下でもこのような実施形態を基準として説明するが、段階２で登録されなかった新たな話者に対しては、話者モデルを登録しながら話者ダイアライゼーションを実行する方法の実施形態でも実現可能である。 One embodiment of the invention uses a method of performing step 1 on the entire audio and video data to register a speaker model prior to performing step 2, and also based on such embodiments below. However, for a new speaker who was not registered in step 2, it is also possible to realize the embodiment of the method of executing the speaker dialyration while registering the speaker model.

以下、実施形態のオーディオビジュアルシステムのオーディオツービデオの相関関係（ＡＶｃｏｒｒｅｌａｔｉｏｎ）について説明する。オーディオおよび口の動きのクロスモーダル（ｃｒｏｓｓ−ｍｏｄａｌ）埋め込み（ｅｍｂｅｄｄｉｎｇ）は、それぞれの信号を示すために使用されてよい。このようなジョイント（ｊｏｉｎｔ）（組み合わされた）埋め込みを訓練するための戦略は、例えば、次に説明するとおりである。 Hereinafter, the audio-to-video correlation (AV correlation) of the audiovisual system of the embodiment will be described. Cross-modal embedding of audio and mouth movements may be used to indicate the respective signals. Strategies for training such joint embeddings are, for example, as described below.

ネットワークは、２つのストリーム（ＭＦＣＣ（Ｍｅｌ−ＦｒｅｑｕｅｎｃｙＣｅｐｓｔｒａｌＣｏｅｆｆｉｃｉｅｎｔｓ）入力を５１２次元ベクトルにエンコードするオーディオストリーム、およびクロップされた顔イメージを５１２次元ベクトルにエンコードするビデオストリーム）で構成されてよい。ネットワークは、１つのビデオクリップとＮ個のオーディオクリップのマルチウェイ（ｍｕｌｔｉ−ｗａｙ）マッチング作業によって訓練されてよい。オーディオとビデオ特徴のユークリッド距離が計算され、Ｎ個の距離が結果として算出されてよい。ネットワークは、ソフトマックス（ｓｏｆｔｍａｘ）レイヤを通過した後、このような距離の逆数に対する交差エントロピー誤差によって学習されてよく、したがって、マッチングする対の類似度は非マッチングするものよりも大きくなる。 The network may consist of two streams (an audio stream that encodes the MFCC (Mel-Freequency Ceptral Cofficients) input into a 512-dimensional vector and a video stream that encodes the cropped face image into a 512-dimensional vector). The network may be trained by a multi-way matching task of one video clip and N audio clips. The Euclidean distances of audio and video features are calculated, and N distances may be calculated as a result. After passing through the softmax layer, the network may be trained by the cross-entropy error for the reciprocal of such distances, so the similarity of matching pairs is greater than that of non-matching ones.

２つの埋め込みのコサイン距離が２つの入力の関連性（ｃｏｒｒｅｓｐｏｎｄｅｎｃｅ）を測定するために使用されてよい。したがって、顔イメージが現在の話者（発話する話者）に対応すれば特徴の間の小さな距離が期待され、そうでない場合には同調（ｉｎ−ｓｙｎｃ）および遠距離（ｌａｒｇｅｄｉｓｔａｎｃｅ）が期待される。ビデオは、１つの連続的なソースを基にするため、ＡＶオフセットはセッション全体にわたって固定されているものと仮定してよい。埋め込み距離は、アウトライアを取り除くために中間値フィルタ（ｍｅｄｉａｎｆｉｌｔｅｒ）を使用することにより、時間によってスムーズ（ｓｍｏｏｔｈ）になる。 The cosine distance between the two implants may be used to measure the correspondence between the two inputs. Therefore, if the facial image corresponds to the current speaker (speaking speaker), a small distance between features is expected, otherwise in-sync and long distance are expected. To. Since the video is based on one continuous source, it can be assumed that the AV offset is fixed throughout the session. The embedding distance becomes smooth over time by using a median filter to remove outliers.

以下では、実施形態のオーディオビジュアルシステムの話者モデルおよびこれを登録する方法について説明する。ＡＶ相関関係は、口の動きが明確に見えるときしか使用することができないため、本開示では、各話者に対する話者モデル、例えば、オクルージョン（隠れ）のせいでオーディオビジュアル同期化が不可能なときでも（または、このような信号の可用性が低い場合でも）アクティブ話者（すなわち、発話する話者）を決定することができるようにする。 The speaker model of the audiovisual system of the embodiment and the method of registering the speaker model will be described below. Since AV correlations can only be used when mouth movements are clearly visible, audiovisual synchronization is not possible in this disclosure due to speaker models for each speaker, such as occlusion. Allows the active speaker (ie, the speaker to speak) to be determined at any time (or even when the availability of such signals is low).

本発明の実施形態では、ＡＶ相関関係によって該当のオーディオ区間の発話者が識別されたオーディオデータを使用することで、該当の発話者に対する話者モデル（話者埋め込み）を登録してよい。 In the embodiment of the present invention, a speaker model (speaker embedding) for the corresponding speaker may be registered by using the audio data in which the speaker of the corresponding audio section is identified by the AV correlation.

一実施形態において、オーディオデータを一定の時間（例えば、それぞれ１．５秒または２秒）間隔に分けた区間であるスピーチセグメントのうち、事前にビデオ全体で実行されて各話者に対する確かなスピーチセグメントを求め、これを利用して話者モデルを取得してよい。 In one embodiment, of the speech segments, which are sections of the audio data divided into fixed time intervals (eg, 1.5 seconds or 2 seconds, respectively), a reliable speech to each speaker that is pre-executed throughout the video. You may find the segment and use it to get the speaker model.

実施形態においては、一例として、Ｎ＝１０（または３）が使用されてよく、ＡＶ相関関係の閾値を超える確信セグメントがＮ個よりも少ない場合には、相関関係が閾値を超えるセグメントだけが話者モデルを登録するために使用されてよい。 In the embodiment, as an example, N = 10 (or 3) may be used, and if there are less than N conviction segments that exceed the AV correlation threshold, only the segments whose correlation exceeds the threshold speak. May be used to create a person model.

セグメントを利用して話者モデルを抽出するためには、ｘ−ｖｅｃｔｏｒやＲｅｓＮｅｔなどのような従来のモデルが使用されてよいが、これに限定されることはない。さらに深いモデルは、小さなｘベクトルモデルよりも難しいデータセットを適切に一般化することができるため、訓練された深いＲｅｓＮｅｔ−５０モデルを使用することが好ましい。 In order to extract the speaker model using the segment, a conventional model such as x-vector or ResNet may be used, but the present invention is not limited to this. It is preferable to use a trained deep ResNet-50 model, as deeper models can adequately generalize more difficult datasets than smaller x-vector models.

例えば、話者モデルは、１．５秒ウィンドウに基づいて特徴値（ｆｅａｔｕｒｅ）を計算し、基準システムによって１回に０．７５秒ずつ動くことにより（または、１回に１フレームずつ動きながら）抽出されてよい。各タイムステップにおける話者モデルと登録された話者モデルとを比較することにより、任意の話者に属するスピーチセグメントの尤度（ｌｉｋｅｌｉｈｏｏｄ）が推定されてよい。推論時間に視覚的情報がまったくなかったとしても、これは教師なしクラスタリングに比べて一般的により確かな教師あり分類問題となり得る。すなわち、これは、分類または対の確認の問題となり、クラスタリングに比べて一般的に極めて強い性能を生み出すことができる。 For example, the speaker model calculates a feature value (feature) based on a 1.5 second window and moves 0.75 seconds at a time (or moves one frame at a time) according to the reference system. It may be extracted. By comparing the speaker model at each time step with the registered speaker model, the likelihood (likelihood) of the speech segment belonging to any speaker may be estimated. Even if there is no visual information in the inference time, this can be a generally more reliable supervised classification problem compared to unsupervised clustering. That is, it becomes a matter of classification or pairing confirmation and can generally produce very strong performance compared to clustering.

以下では、実施形態のオーディオビジュアルシステムが使用する音源位置決定（ＳｏｕｎｄＳｏｕｒｃｅＬｏｃａｌｉｚａｔｉｏｎ：ＳＳＬ）について説明する。話者モデルはもちろんであるが、音源の方向は、誰が発話するかに対する有用な手がかりとなる。音源の方向を決定するためには、例えば、カメラへの４チャンネルマイクロフォンからの記録物のオーディオソースの方向が各オーディオサンプルに対して推定され、すべてのビデオフレームに対する方向は１０度の区間サイズ（ｂｉｎｓｉｚｅ）を有し、±０．５秒間（ｐｅｒｉｏｄ）に対するすべての方位角（ａｚｉｍｕｔｈ）θ値のヒストグラムを生成することによって決定されてよい。 In the following, sound source position determination (SSL) used by the audiovisual system of the embodiment will be described. The direction of the sound source, as well as the speaker model, is a useful clue as to who speaks. To determine the orientation of the sound source, for example, the orientation of the audio source of the recording from the 4-channel microphone to the camera is estimated for each audio sample, and the orientation for all video frames is an interval size of 10 degrees ( It may be determined by having a bin size) and generating a histogram of all azimuth θ values for ± 0.5 seconds (period).

与えられた時間にいずれかの話者に属するオーディオの尤度（ｌｉｋｅｌｉｈｏｏｄ）は、推定されたオーディオソースおよび該当の話者に対するビデオにおける顔検出の角度と相関してよい。 The likelihood of audio belonging to any speaker at a given time may correlate with the estimated audio source and the angle of face detection in the video for that speaker.

以下では、実施形態に係るオーディオビジュアルシステムにおけるマルチモーダルの融合について説明する。 Hereinafter, the fusion of multimodal in the audiovisual system according to the embodiment will be described.

３種類の情報（ＡＶ相関関係、話者モデル、オーディオ方向）は、各話者およびタイムステップに対する信頼度点数を提供してよい。このような点数は、以下で説明するように、簡単な加重値が適用された融合を利用してすべての話者およびタイムステップに対して単一な信頼度点数（Ｃｏｖｅｒａｌｌ）に結合されてよい。ここで、Ｃ_ｓｍは話者モデルからの信頼度点数であり、Ｃ_ａｖｃはＡＶ対応からの点数であり、θ^*およびφはそれぞれ顔の角度（位置）およびオーディオの推定されたＤｏＡ（ＤｉｒｅｃｔｉｏｎａｌｏｆＡｒｒｉｖａｌ、オーディオが聞こえてくる方向）を意味する。αおよびβはそれぞれ所定の加重値を示し、それぞれの点数の重要度に応じてその値が調節されてよい。一般的には、訓練データのうちで最も優れた性能を与える値を使用する。 The three types of information (AV correlation, speaker model, audio direction) may provide confidence scores for each speaker and time step. Such scores may be combined into a single Coverall for all speakers and time steps using a simple weighted fusion, as described below. .. Here, C _sm is the reliability score from the speaker model, C _avc is the score from the AV correspondence, and θ ^* and φ are the estimated DoA (Directional of) of the face angle (position) and audio, respectively. Arrival, the direction in which audio is heard). α and β each indicate a predetermined weighted value, and the value may be adjusted according to the importance of each score. In general, use the value that gives the best performance of the training data.

以下の数式（１）において、カメラからは特定の話者が見えないとき、２番目および３番目の項は０に設定されてよい。 In the following formula (1), the second and third terms may be set to 0 when the camera cannot see a particular speaker.

Ｃ_{ｏｖｅｒａｌｌ}＝Ｃ_ｓｍ＋α＊Ｃ_ａｖｃ＋β＊ｃｏｓｉｎｅ（φ−θ^*）・・・（１） _Overall = C _sm + α * C _avc + β * cosine (φ−θ ^* ) ・・・ (1)

計算された信頼度点数（Ｃ_{ｏｖｅｒａｌｌ}）に基づき、該当のタイムステップで発話される音声を発話した話者が特定されてよい。 Based on the calculated reliability score ( _Overall ), the speaker who uttered the voice uttered in the corresponding time step may be identified.

以下では、一般的なオーディオシステムと実施形態のオーディオビジュアルシステムの性能の比較試験について説明する。本開示では、２つの独立的なデータセット（３６０度に録画された会議の内部データセットおよび公開的に可用性の高いＡＭＩ会議コーパス）に対して評価された。それぞれについては、以下でより詳しく説明する。 In the following, a comparative test of the performance of a general audio system and an audiovisual system of an embodiment will be described. In this disclosure, two independent datasets, the internal dataset of the conference recorded at 360 degrees and the publicly available AMI conference corpus, were evaluated. Each will be described in more detail below.

内部会議データセットは、ビデオ録画に関して参加者から特別な指示のない、定期的な会議のオーディオビジュアル記録で構成される。会議は、作業空間における一日討論の一部を形成したものであり、話者ダイアライゼーションの作業を念頭において設定されたものではない。データセットの相当部分は、話者が頻繁に変わる極めて短い発話で構成されているが、これは話者ダイアライゼーションにおいて極めて困難な条件となる。ビデオは、２つの魚眼レンズをもち、会議の３６０度ビデオをキャプチャするＧｏＰｒｏ（登録商標）Ｆｕｓｉｏｎカメラで録画されたものとする。ビデオは、１秒あたり２５フレームであり、５２２８×２６２４解像度の単一サラウンドビュービデオにともに結合されてよい。オーディオは、４８ｋＨｚで４チャンネルマイクによって録音されてよい。このようなデータセットのスチールイメージは、図２に示すとおりである。図２は、一実施形態に係る試験のために使用された映像データのスチールイメージを示している。 The internal meeting dataset consists of audiovisual recordings of regular meetings without any special instructions from participants regarding video recording. The conference formed part of a day's discussion in the workspace and was not set up with the task of speaker dialiation in mind. A significant portion of the dataset consists of very short utterances that change the speaker frequently, which is a very difficult condition for speaker dialiation. The video shall be recorded with a GoPro® Fusion camera that has two fisheye lenses and captures 360 degree video of the conference. The video is 25 frames per second and may be combined together with a single surround view video with a resolution of 5228 x 2624. The audio may be recorded by a 4-channel microphone at 48 kHz. A still image of such a dataset is shown in FIG. FIG. 2 shows a still image of the video data used for the test according to one embodiment.

データセットには、約３時間の有効性検証セットと、４０分の慎重に注釈が追加されたテスト（ｔｅｓｔ）セットが含まれてよい。テストビデオには９人の話者が存在する。発言が重なる場合は、主な（最大の音の）発言者のＩＤだけに注釈を付与した。埋め込み抽出器およびＡＶ同期化ネットワークは、外部データセットに対して訓練され、検証セットは、基準システムにおけるＡＨＣ閾値および実施形態のシステムにおける融合加重値をチューニングするためだけに使用されてよい。 The dataset may include an approximately 3 hour validation set and a 40 minute carefully annotated test set. There are nine speakers in the test video. When the remarks overlap, only the ID of the main (maximum sound) speaker was annotated. Embedded extractors and AV synchronization networks are trained against external datasets, and validation sets may only be used to tune AHC thresholds in the reference system and fusion weights in the system of embodiments.

ＡＭＩコーパスは、多数の位置から録画した１００時間のビデオで構成されており、実施形態のシステムは、１００時間の分量のビデオのうち、約３０時間および１７時間の分量のビデオをそれぞれ含むＥＳおよびＩＳカテゴリの会議に対して評価した。画質は相対的に低く、ビデオ解像度は２８８×３５２ピクセルである。オーディオは、直径２０ｃｍの８要素円形等間隔（ｅｑｕｉｓｐａｃｅｄ）マイクアレイから録音されたものである。しかし、大部分の本試験においては、アレイのマイクが１つだけ使用されてもよい。ビデオは、会議の参加者それぞれのクローズアップビューを提供する４台のカメラで録画され、上述した内部データセットとは異なり、イメージはともに連結されない。ＥＳビデオは、閾値をチューニングするための検証セットとして使用されてよい。図３ａおよび３ｂは、一例による試験のために使用された映像データであって、公開的なＡＭＩ会議データのスチールイメージを示している。図３ａは、会議の参加者それぞれのクローズアップビューを示した映像データのスチールイメージであり、図３ｂは、会議の参加者とホワイトボードを撮影する遠景映像に対応する映像データのスチールイメージである。 The AMI corpus consists of 100 hours of video recorded from multiple locations, and the system of the embodiment includes ES and 17 hours of video, respectively, of the 100 hours of video. Evaluated for IS category meetings. The image quality is relatively low, with a video resolution of 288 x 352 pixels. The audio was recorded from an 8-element circular equispaced microphone array with a diameter of 20 cm. However, in most of the tests, only one array microphone may be used. The video is recorded by four cameras that provide a close-up view of each participant in the conference, and unlike the internal dataset described above, the images are not concatenated together. The ES video may be used as a validation set for tuning the threshold. Figures 3a and 3b are video data used for the example test and show a still image of public AMI conference data. FIG. 3a is a still image of the video data showing a close-up view of each of the participants in the conference, and FIG. 3b is a still image of the video data corresponding to the distant view image of the participants of the conference and the whiteboard. ..

検出された各顔部分映像に対し、顔埋め込みはＶＧＧＦａｃｅ２を利用して抽出し、Ｎ個の記録された顔認識のためのＦｅａｔｕｒｅ（埋め込み）のそれぞれと比較され、したがって、これらはＮ個の話者のうちの１つに分類されてよい。いずれの時点であっても、同時に発生する顔部分映像が同じ話者を示すことはできないという制約条件が適用された。 For each face part image detected, face embedding was extracted using VGGFace2 and compared to each of the N recorded Face recognition for face recognition, thus these are N stories. It may be classified as one of the persons. At any point in time, a constraint was applied that simultaneous facial footage could not indicate the same speaker.

以下の表１では、基準システムと実施形態のオーディオビジュアルシステムの性能の比較試験による話者ダイアライゼーションの結果を示している。数値が低い収録性能がより優秀であること示している。最後の４行を除いてはＡＭＩデータセットの結果を示している。ＷＢはホワイトボード；ＮＷＢはホワイトボードなし；Ｘｃｈ＋ＶはＸ個のチャンネルオーディオ＋ビデオ；ＳＭは話者モデリング；ＡＶＣはオーディオビジュアル対応；ＳＳＬは音源位置検索；ＭＳは聞き逃したスピーチ；ＦＡはエラーアラーム；ＳＰＫＥは話者エラー；ＤＥＲは：話者ダイアライゼーションのエラー率を示す。 Table 1 below shows the results of speaker dialiation by a comparative test of the performance of the reference system and the audiovisual system of the embodiment. The low numbers indicate that the recording performance is better. Except for the last four lines, the results of the AMI dataset are shown. WB is a whiteboard; NWB is a whiteboard; Xch + V is X channel audio + video; SM is speaker modeling; AVC is audiovisual compatible; SSL is sound source position search; MS is a missed speech; FA is an error alarm SPKE is the speaker error; DER is: the error rate of the speaker dialiation.

評価指標に関し、性能指標としてＤＥＲを使用した。ＤＥＲは、聞き逃したスピーチ（ＭＳ、参照（ｒｅｆｅｒｅｎｃｅ）話者にはあるが仮定の話者にはない）、エラーアラーム（ＦＡ、仮定の話者にはあるが参照話者にはない）、および話者エラー（ＳＰＫＥ、話者ＩＤが他の話者に割り当てられる）の３つの成分に分解されてよい。 Regarding the evaluation index, DER was used as the performance index. DER is a missed speech (MS, reference speaker but not hypothetical speaker), error alarm (FA, hypothetical speaker but not reference speaker), And speaker error (SPKE, speaker ID is assigned to another speaker) may be decomposed into three components.

システムを評価するために使用されたツールは、ＮＩＳＴによってＲＴ話者ダイアライゼーションを評価するために開発されたものであり、参照注釈の人的ミスを補うために２５０ｍｓの許容マージンを含んでいる。 The tool used to evaluate the system was developed by NIST to evaluate RT-speaker dialiation and includes a 250 ms permissible margin to compensate for human error in reference annotations.

ＡＭＩコーパスに対する結果として上記表１を参照する。ホワイトボードが使用される会議の数字は別途提供されるため、その結果が比較される。 See Table 1 above as a result for the AMI corpus. The numbers for meetings where the whiteboard is used are provided separately and the results are compared.

すべての試験において同一のＶＡＤシステムが使用されたため、聞き逃したスピーチおよびエラーアラームの割合は、各データセットに対して互いに異なるモデルにおいて同一である。したがって、話者エラー率（ＳＰＫＥ）だけが、話者ダイアライゼーションシステムによって影響を受ける指標となる。 Since the same VAD system was used in all tests, the percentage of missed speeches and error alarms is the same for each dataset in different models. Therefore, the speaker error rate (SPKE) is the only indicator affected by the speaker dialiation system.

話者モデルオンリーシステム（ＳＭ）は、話者モデルの登録タイミングを検索するためだけに視覚的情報を使用し、使用推論中にはオーディオだけを使用するものであってよい。オーディオ処理パイプラインと埋め込み抽出器として共通なものを利用して実験したとき、性能利得は、クラスタリング問題をダイアライゼーション問題に変更することから発生する。これだけでも、ＥＳおよびＩＳセットにおいて、話者エラーがそれぞれ４８％および２６％と相対的に向上した。 The speaker model-only system (SM) may use visual information only to retrieve the registration timing of the speaker model and use only audio during use inference. When experimenting with a common audio processing pipeline and embedded extractor, the performance gain comes from turning the clustering problem into a dialification problem. This alone resulted in a relative improvement in speaker error of 48% and 26%, respectively, in the ES and IS sets.

表１の結果から、推論時にＡＶ相関関係（ＡＶＣ）と音源位置決定（ＳＳＬ）を追加すれば、性能が明らかに向上することを確認することができる。全般的な相対性能に対するこのようなモダリティの寄与は、テストセットによってそれぞれ２０〜４０％および１９〜３９％であることを確認することができる。このような結果は、すべてのテスト条件において、従来技術の結果を大きく上回ることを現わしている。 From the results in Table 1, it can be confirmed that the performance is clearly improved by adding the AV correlation (AVC) and the sound source position determination (SSL) at the time of inference. It can be confirmed that the contribution of such modality to the overall relative performance is 20-40% and 19-39%, respectively, depending on the test set. Such results show that under all test conditions, they far exceed the results of the prior art.

内部会議データセットで話者エラー率は著しく悪化するようになるが、これは、データセットの困難な特性（ｃｈａｌｌｅｎｇｉｎｇｎａｔｕｒｅ）と話者の人数が多いため起こる。表１の結果から、基準システムは、このようなデータセットで一般化されないが、実施形態のマルチモーダルシステムは、このような「実際のデータ」でも比較的優れた性能を実現することを確認することができる。 Speaker error rates become significantly worse with internal conferencing datasets, which is caused by the difficult characteristics of the dataset (challenge nature) and the large number of speakers. From the results in Table 1, it is confirmed that the reference system is not generalized in such a dataset, but the multimodal system of the embodiment achieves relatively good performance even in such "real data". be able to.

上述のような結果から、本開示の実施形態では、話者モデルを登録するためにオーディオビジュアルの相関関係の利点を活用するマルチモーダルシステムを取り入れることにより、話者ダイアライゼーションのために一般的に使用されるクラスタリング方法に比べ、相当な利点を達成することを確認することができる。 From the results described above, the embodiments of the present disclosure generally incorporate a multimodal system that takes advantage of audiovisual correlation to register a speaker model for speaker dialiation. It can be confirmed that a considerable advantage is achieved compared to the clustering method used.

追加で、以下では、話者モデルとして登録されていない話者を処理する方法について説明する。多様な理由によって未登録の話者がセッション（会議）に参加することがある。このような場合とは、例えば、会議で（１）まったく発話しない人がいる場合、（２）オクルージョンによってＡＶ相関関係がまったく用意されない場合、または（３）電話で会議に参加する人がいる場合、が挙げられる。 In addition, the following describes how to handle speakers who are not registered as speaker models. Unregistered speakers may attend a session for a variety of reasons. Such cases include, for example, (1) some people do not speak at all, (2) occlusion does not provide any AV correlation, or (3) some people join the meeting by telephone. , Can be mentioned.

これに関しては、２つの可能な解決策を提案する。先ず、（１）の場合、まったく発話しない人は無視すると仮定してよい。また、（２）および（３）の場合は、図１の段階２（Ｐｈａｓｅ２）を行った後、ＡＶ相関関係と話者認識の両者に対し、信頼度の低いいずれかのアクティブモデルをクラスタリングして（２）および（３）に該当する話者を登録（すなわち、話者モデルを構築）するようにしてよい。 In this regard, we propose two possible solutions. First, in the case of (1), it may be assumed that a person who does not speak at all is ignored. Further, in the cases of (2) and (3), after performing step 2 (Phase 2) in FIG. 1, one of the active models with low reliability is clustered for both AV correlation and speaker recognition. Then, the speakers corresponding to (2) and (3) may be registered (that is, a speaker model may be constructed).

以下では、図４〜図９を参照しながら、実施形態のシステムのより具体的な構造および実現方法について説明する。 Hereinafter, a more specific structure and a method for realizing the system of the embodiment will be described with reference to FIGS. 4 to 9.

図４は、一実施形態における、音声を発話する話者ダイアライゼーションおよび分離（特定）された話者からの発話を記録する方法を示している。 FIG. 4 shows, in one embodiment, a speaker dialylation that speaks a voice and a method of recording utterances from a separated (identified) speaker.

図４を参照しながら、上述した実施形態のオーディオビジュアルシステム（マルチモーダルシステム）を利用して音声を発話する話者を分離し、分離された話者の音声を記録する方法について説明する。 With reference to FIG. 4, a method of separating a speaker who utters a voice by using the audiovisual system (multimodal system) of the above-described embodiment and recording the voice of the separated speaker will be described.

図に示した例では、３６０度カメラとマイク（図示せず）により、話者（話者Ａ〜Ｄ）に対する映像（ｖｉｄｅｏ）データおよび音声データをそれぞれ取得することを示している。実施形態のオーディオビジュアルシステムは、映像データに含まれる話者Ａ〜Ｄそれぞれの口の形状（例えば、口または唇の形状の変化）とオーディオデータに含まれる話者Ａ〜Ｄそれぞれからのスピーチセグメントとの相関関係を計算し、これに基づき、話者Ａ〜Ｄそれぞれに対する話者モデルを構築（登録）してよい。実施形態のオーディオビジュアルシステムは、構築された話者モデルに基づき、話者Ａ〜Ｄのうちからオーディオデータに含まれる音声を発話する話者を特定してよい。すなわち、オーディオビジュアルシステムは、オーディオデータから話者Ａ〜Ｄそれぞれの発話を特定してよい。 In the example shown in the figure, it is shown that the video (video) data and the audio data for the speakers (speakers A to D) are acquired by the 360-degree camera and the microphone (not shown), respectively. The audiovisual system of the embodiment is a speech segment from each of the mouth shapes (for example, changes in the shape of the mouth or lips) of the speakers A to D included in the video data and the speakers A to D included in the audio data. A speaker model for each of speakers A to D may be constructed (registered) based on the calculation of the correlation with. The audiovisual system of the embodiment may identify the speaker who utters the sound included in the audio data from the speakers A to D based on the constructed speaker model. That is, the audiovisual system may identify the utterances of the speakers A to D from the audio data.

例えば、話者Ａ〜Ｄは会議に参加している一員であってよく、オーディオビジュアルシステムは会議中に発話する話者を特定してよい。 For example, speakers A to D may be members of the conference and the audiovisual system may identify the speaker to speak during the conference.

オーディオビジュアルシステムは、オーディオデータから特定された話者が発話した音声を抽出し、抽出された音声をテキストに変換してよく、変換されたテキストと抽出された音声が発話された時間情報とを関連付けて記録してよい。例えば、オーディオビジュアルシステムは、話者Ａ〜Ｄが参加する会議の議事録として、前記テキストと時間情報を関連付けて記録してよい。 The audiovisual system may extract the voice spoken by the identified speaker from the audio data and convert the extracted voice into text, and the converted text and the time information in which the extracted voice is spoken are input. It may be associated and recorded. For example, the audiovisual system may record the text and time information in association with each other as the minutes of a meeting in which speakers A to D participate.

図６は、一例による、話者同士の会議を記録した議事録を示した図である。 FIG. 6 is a diagram showing minutes recording a meeting between speakers according to an example.

図に示すように、オーディオビジュアルシステムによって特定された話者Ａ〜Ｄそれぞれに対し、各話者が発話した音声がテキストに変換され、該当の音声は、発話された時間情報とともに議事録６００として記録されてよい。図に示してはいないが、議事録６００は、各話者と関連するイメージ（例えば、各話者の顔写真またはサムネイル）をさらに含んでよい。 As shown in the figure, for each of the speakers A to D specified by the audiovisual system, the voice spoken by each speaker is converted into text, and the corresponding voice is recorded as minutes 600 together with the time information spoken. It may be recorded. Although not shown in the figure, Minute 600 may further include an image associated with each speaker (eg, a face photo or thumbnail of each speaker).

より具体的な話者特定方法とオーディオビジュアルシステムの構成および動作については、図５および図７〜図９を参照しながらより詳しく説明する。 A more specific speaker identification method and the configuration and operation of the audiovisual system will be described in more detail with reference to FIGS. 5 and 7 to 9.

以上、図１〜図３を参照しながら説明した技術的特徴についての説明は、図４および図６にもそのまま適用されるため、重複する説明は省略する。 Since the description of the technical features described above with reference to FIGS. 1 to 3 is applied to FIGS. 4 and 6 as they are, duplicate description will be omitted.

図５は、一実施形態における、コンピュータシステムの内部構成の一例を説明するためのブロック図である。 FIG. 5 is a block diagram for explaining an example of the internal configuration of the computer system in one embodiment.

例えば、本発明の実施形態に係る、音声を発話する話者を特定するための話者ダイアライゼーション装置は、図５のコンピュータシステム５００によって実現されてよい。コンピュータシステム５００は、上述したオーディオビジュアルシステムに対応してよい。図５に示すように、コンピュータシステム５００は、音声を発話する話者を特定する話者ダイアライゼーション方法を実行するための構成要素として、プロセッサ５１０、メモリ５２０、永続的記録装置５３０、バス５４０、入力／出力インタフェース５５０、およびネットワークインタフェース５６０を含んでよい。コンピュータシステム５００は、図に示すものとは異なり、複数のコンピュータシステムで構成されてもよい。コンピュータシステム５００は、例えば、複数の話者同士の会議のようなコミュニケーションを記録するためのシステムまたはその一部であってよい。コンピュータシステム５００は、話者を撮影するカメラが含まれる装置内に含まれるか、カメラが含まれる装置と有線および／または無線通信するコンピュータ、またはその他のサーバであるか、その一部であってもよい。 For example, the speaker dialylation device for identifying a speaker who speaks voice according to an embodiment of the present invention may be realized by the computer system 500 of FIG. The computer system 500 may correspond to the audiovisual system described above. As shown in FIG. 5, the computer system 500 includes a processor 510, a memory 520, a permanent recording device 530, a bus 540, as components for executing a speaker dialylation method for identifying a speaker who speaks a voice. An input / output interface 550 and a network interface 560 may be included. The computer system 500 may be composed of a plurality of computer systems, unlike those shown in the figure. The computer system 500 may be, for example, a system or a part thereof for recording communication such as a conference between a plurality of speakers. The computer system 500 is, or is part of, a computer or other server that is contained within a device that includes a camera that captures the speaker, or that communicates wired and / or wirelessly with the device that contains the camera. May be good.

プロセッサ５１０は、音声を発話する話者を特定する話者ダイアライゼーション方法を実現するための構成要素として、命令語のシーケンスを処理することのできる任意の装置を含むか、その一部であってよい。プロセッサ５１０は、例えば、コンピュータプロセッサ、移動装置、または他の電子装置内のプロセッサおよび／またはデジタルプロセッサを含んでよい。プロセッサ５１０は、例えば、サーバコンピュータデバイス、サーバコンピュータ、一連のサーバコンピュータ、サーバファーム、クラウドコンピュータ、コンテンツプラットフォームなどに含まれてよい。プロセッサ５１０は、バス５４０を介してメモリ５２０に接続されてよい。 Processor 510 includes, or is part of, any device capable of processing a sequence of instructions as a component for implementing a speaker dialylation method that identifies the speaker who speaks the voice. Good. Processor 510 may include, for example, a processor and / or a digital processor in a computer processor, mobile device, or other electronic device. The processor 510 may be included, for example, in a server computer device, a server computer, a set of server computers, a server farm, a cloud computer, a content platform, and the like. Processor 510 may be connected to memory 520 via bus 540.

メモリ５２０は、コンピュータシステム５００によって使用されるか、これによって出力される情報を記録するための揮発性メモリ、永続的、仮想、またはその他のメモリを含んでよい。メモリ５２０は、例えば、ＲＡＭ（ｒａｎｄｏｍａｃｃｅｓｓｍｅｍｏｒｙ）および／またはＤＲＡＭ（ｄｙｎａｍｉｃＲＡＭ）を含んでよい。メモリ５２０は、コンピュータシステム５００の状態情報のような任意の情報を記録するために使用されてよい。メモリ５２０は、例えば、音声を発話する話者を特定する話者ダイアライゼーション方法の実行のための命令語を含むコンピュータシステム５００の命令語を記録するために使用されてもよい。コンピュータシステム５００は、必要な場合または適切な場合に１つ以上のプロセッサ５１０を含んでよい。 Memory 520 may include volatile memory, persistent, virtual, or other memory used by or output by computer system 500. The memory 520 may include, for example, a RAM (random access memory) and / or a DRAM (dynamic RAM). Memory 520 may be used to record arbitrary information, such as state information for computer system 500. The memory 520 may be used, for example, to record the instructions of the computer system 500, including the instructions for performing a speaker dialing method that identifies the speaker who speaks the voice. The computer system 500 may include one or more processors 510 when necessary or appropriate.

バス５４０は、コンピュータシステム５００の多様なコンポーネント間の相互作用を可能にする通信基盤構造を含んでよい。バス５４０は、例えば、コンピュータシステム５００のコンポーネント間に、例えば、プロセッサ５１０とメモリ５２０との間にデータを運搬してよい。バス５４０は、コンピュータシステム５００のコンポーネントの間の無線および／または有線通信媒体を含んでよく、並列、直列、または他のトポロジ配列を含んでよい。 Bus 540 may include a communication infrastructure structure that allows interaction between the various components of computer system 500. Bus 540 may carry data between components of computer system 500, for example, between processor 510 and memory 520. Bus 540 may include wireless and / or wired communication media between the components of computer system 500, and may include parallel, serial, or other topology arrays.

永続的記録装置５３０は、（例えば、メモリ５２０に比べて）所定の延長された期間にわたってデータを記録するためにコンピュータシステム５００によって使用されるもののようなメモリ、または他の永続的記録装置のようなコンポーネントを含んでよい。永続的記録装置５３０は、コンピュータシステム５００内のプロセッサ５１０によって使用されるもののような非揮発性メインメモリを含んでよい。永続的記録装置５３０は、例えば、フラッシュメモリ、ハードディスク、オプティカルディスク、または他のコンピュータ読み取り可能媒体を含んでよい。永続的記録装置５３０は、例えば、上述した議事録６００または議事録６００と関連するデータを記録してよい。 Persistent recording device 530, such as memory used by computer system 500 to record data over a predetermined extended period of time (eg, compared to memory 520), or other persistent recording device. Components may be included. Persistent recording device 530 may include non-volatile main memory such as that used by processor 510 in computer system 500. Persistent recording device 530 may include, for example, flash memory, a hard disk, an optical disk, or other computer-readable medium. Permanent recording device 530 may record, for example, the minutes 600 described above or data associated with the minutes 600.

入力／出力インタフェース５５０は、キーボード、マウス、音声命令入力、ディスプレイ、または他の入力／出力装置に対するインタフェースを含んでよい。音声を発話する話者を特定する話者ダイアライゼーション方法と関連する命令および／または入力は、入力／出力インタフェース５５０によって受信されてよい。 The input / output interface 550 may include an interface to a keyboard, mouse, voice instruction input, display, or other input / output device. Instructions and / or inputs associated with a speaker dialylation method that identifies the speaker who speaks the voice may be received by the input / output interface 550.

ネットワークインタフェース５６０は、近距離ネットワークまたはインターネットのようなネットワークに対する１つ以上のインタフェースを含んでよい。ネットワークインタフェース５６０は、有線または無線接続に対するインタフェースを含んでよい。音声を発話する話者を特定する話者ダイアライゼーション方法と関連する命令および／または入力は、ネットワークインタフェース５６０によって受信されてよい。 Network interface 560 may include one or more interfaces to networks such as short-range networks or the Internet. The network interface 560 may include an interface for a wired or wireless connection. Instructions and / or inputs associated with a speaker dialing method that identifies the speaker who speaks the voice may be received by network interface 560.

また、他の実施形態において、コンピュータ装置５００は、図５の構成要素よりも多くの構成要素を含んでもよい。しかし、大部分の従来技術的構成要素を明確に図に示す必要はない。例えば、コンピュータ装置５００は、上述した入力／出力インタフェース５５０と連結する入／出力装置のうちの少なくとも一部を含むように実現されてもよいし、またはトランシーバ（ｔｒａｎｓｃｅｉｖｅｒ）、ＧＰＳ（ＧｌｏｂａｌＰｏｓｉｔｉｏｎｉｎｇＳｙｓｔｅｍ）モジュール、カメラ（例えば、話者を撮影するための３６０度カメラ）、マイクロフォン（例えば、話者の音声を記録するための少なくとも１つのマイクロフォン）、各種センサ、データベースなどのような他の構成要素をさらに含んでもよい。 Also, in other embodiments, the computer device 500 may include more components than the components of FIG. However, most prior art components need not be clearly illustrated. For example, the computer device 500 may be implemented to include at least a portion of the input / output devices linked to the input / output interface 550 described above, or a transceiver, GPS (Global Positioning System). Other components such as modules, cameras (eg, 360 degree cameras for capturing speakers), microphones (eg, at least one microphone for recording the speaker's voice), various sensors, databases, etc. Further may be included.

このようなコンピュータシステム５００によって実現される実施形態により、映像データに含まれる話者それぞれの口の形状と、オーディオデータに含まれる話者からのスピーチセグメントそれぞれとの相関関係を計算し、計算された相関関係に基づき、各話者に対する話者モデルを構築し、構築された話者モデルに基づき、オーディオデータに含まれた音声を発話する話者を特定する話者ダイアライゼーション方法が提供されてよい。 According to the embodiment realized by such a computer system 500, the correlation between the shape of each speaker's mouth included in the video data and each of the speech segments from the speaker included in the audio data is calculated and calculated. A speaker dialyization method is provided that builds a speaker model for each speaker based on the correlation, and identifies the speaker who speaks the voice contained in the audio data based on the constructed speaker model. Good.

以上、図１〜図４を参照しながら上述した技術的特徴についての説明は、図５にもそのまま適用されるため、重複する説明は省略する。 As described above, the above-mentioned description of the technical features with reference to FIGS. 1 to 4 is applied to FIG. 5 as it is, and thus the duplicate description will be omitted.

図７は、一実施形態における、話者モデルの構築方法および発話する話者を分離する方法を示したフローチャートである。 FIG. 7 is a flowchart showing a method of constructing a speaker model and a method of separating speakers who speak in one embodiment.

図７を参照しながら、コンピュータシステム５００によって実行される、音声を発話する話者を特定する話者ダイアライゼーション方法について詳しく説明する。 With reference to FIG. 7, a speaker dialylation method for identifying a speaker who speaks a voice, which is executed by the computer system 500, will be described in detail.

話者を分離するためには、複数の話者の映像データおよびオーディオデータが使用されてよい。映像データは、複数の話者がカメラで撮影されたものであってよく、オーディオデータは、該当の映像データに対応してよい。すなわち、映像データおよびオーディオデータは、１つの映像（ｖｉｄｅｏ）を構成してよい。例えば、映像データおよびオーディオデータは、話者をリアルタイムで撮影する映像に含まれるデータであってよく、あるいは、予め撮影された話者の映像に含まれるデータであってもよい。 Video and audio data of a plurality of speakers may be used to separate the speakers. The video data may be taken by a plurality of speakers with a camera, and the audio data may correspond to the corresponding video data. That is, the video data and the audio data may constitute one video (video). For example, the video data and the audio data may be data included in the video of the speaker captured in real time, or may be data included in the video of the speaker captured in advance.

段階７２０において、プロセッサ５１０は、映像データに含まれる話者それぞれの口の形状（例えば、口の形状の変化）と、オーディオデータに含まれるスピーチセグメントそれぞれとの相関関係を計算してよい。スピーチセグメントは、オーディオデータに含まれた各話者の音声の一部であって、所定の時間（例えば、１．５秒）内の音声に対応してよい。話者それぞれの口の形状は、プロセッサ５１０が映像データから話者それぞれの顔を認識し、認識された顔を追跡することによって識別されてよい。 In step 720, the processor 510 may calculate the correlation between each speaker's mouth shape (eg, a change in mouth shape) contained in the video data and each of the speech segments contained in the audio data. The speech segment is a part of the voice of each speaker included in the audio data, and may correspond to the voice within a predetermined time (for example, 1.5 seconds). The shape of each speaker's mouth may be identified by the processor 510 recognizing each speaker's face from the video data and tracking the recognized face.

段階７３０において、プロセッサ５１０は、計算された相関関係に基づき、話者それぞれに対する話者モデルを構築して登録してよい。言い換えれば、段階７３０は、コミュニケーションと関連する（例えば、会議に参加する）話者（話者モデル）それぞれを登録する段階であってよい。 At step 730, processor 510 may build and register a speaker model for each speaker based on the calculated correlation. In other words, stage 730 may be the stage of registering each speaker (speaker model) associated with communication (eg, attending a conference).

プロセッサ５１０は、一例として、各話者に対し、オーディオデータに含まれるスピーチセグメントのうち、各話者の口の形状との相関関係が高い上位Ｎ個のスピーチセグメントを各話者に対する話者モデルを構築するために利用してよい。Ｎは自然数であってよく、例えば１０であってよい。また、プロセッサ５１０は、Ｎ個のスピーチセグメントのうち、各話者の口の形状との相関関係が所定の閾値以上のスピーチセグメントだけを各話者に対する話者モデルの構築（および登録）のために利用してよい。 As an example, the processor 510 sets the top N speech segments, which have a high correlation with the shape of each speaker's mouth, among the speech segments included in the audio data for each speaker as a speaker model for each speaker. May be used to build. N may be a natural number, for example 10. Further, the processor 510 is for constructing (and registering) a speaker model for each speaker only for the speech segments whose correlation with the mouth shape of each speaker is equal to or more than a predetermined threshold value among the N speech segments. You may use it for.

段階７４０において、プロセッサ５１０は、構築された話者モデルに基づき、話者のうちからオーディオデータに含まれた音声を発話する話者を特定してよい。例えば、プロセッサ５１０は、オーディオデータ（または、これを含む映像）の再生時に発話される音声がどの話者によって発話されたものであるかを特定してよい。 At step 740, the processor 510 may identify the speaker who speaks the voice contained in the audio data from among the speakers, based on the constructed speaker model. For example, the processor 510 may specify by which speaker the sound uttered during reproduction of audio data (or video including the audio data) is uttered.

他の実施形態において、オーディオデータがリアルタイムオーディオデータ（すなわち、話者同士の会議をリアルタイムで撮影する映像に含まれるオーディオデータ）である場合にも、プロセッサ５１０は、発話される音声がどの話者によって発話されたものであるかをリアルタイムで（または、ほぼリアルタイムで）特定してよい。このとき、プロセッサ５１０は、リアルタイムで撮影される映像に対し、リアルタイムで（または、ほぼリアルタイムで）各話者に対するモデルを構築（登録）したり、予め構築（登録）された話者に対するモデルを更新したりしてよい。 In another embodiment, even if the audio data is real-time audio data (that is, audio data included in a video that captures a conference between speakers in real time), the processor 510 also uses which speaker the spoken voice is. It may be identified in real time (or near real time) whether it was uttered by. At this time, the processor 510 builds (registers) a model for each speaker in real time (or almost in real time) for the video shot in real time, or creates a model for the speaker built (registered) in advance. You may update it.

一方、オーディオデータが、２人以上の話者の発言によって重なった音声を含む場合、このような重なった音声を分離して処理してよい。例えば、プロセッサ５１０は、重なった音声の発話時の話者の口の形状（唇の動き）に基づき、重なった音声を発話した話者の発音を抽出してよく、前記抽出された発音に該当するオーディオデータで部分に対するマスクを生成し、前記重なった音声をフィルタリングしてよい。これにより、複数人の話者が同時に発話して重なった音声は、発話した話者それぞれの音声に分離されて記録されてよい。 On the other hand, when the audio data includes voices overlapped by the statements of two or more speakers, such overlapped voices may be processed separately. For example, the processor 510 may extract the pronunciation of the speaker who utters the overlapped voice based on the shape of the speaker's mouth (movement of the lips) when the overlapped voice is uttered, and corresponds to the extracted pronunciation. A mask for a portion may be generated from the audio data to be generated, and the overlapping audio may be filtered. As a result, the voices of a plurality of speakers uttering at the same time and overlapping may be recorded separately as the voices of the uttered speakers.

段階７１０に示すように、映像データおよび／またはオーディオデータは、段階７２０の実行に先立って前処理されてよい。例えば、プロセッサ５１０は、バンドパスフィルタ（一例として、２００〜７０００Ｈｚ範囲）を使用することにより、オーディオデータから、人間の音声範囲を越える雑音を取り除いてよい。また、プロセッサ５１０は、映像データから特定の人物の顔を認識するための前処理を実行してよい。プロセッサ５１０は、映像データから顔を検出し、検出された顔を追跡することにより、映像データから特定の人物の顔を認識してよい。このとき、予め登録（記録された）話者のプロフィールイメージ（ら）をさらに利用して特定が行われてもよい。 As shown in step 710, the video and / or audio data may be preprocessed prior to the execution of step 720. For example, the processor 510 may use a bandpass filter (for example, in the 200-7000 Hz range) to remove noise beyond the human voice range from the audio data. In addition, the processor 510 may execute preprocessing for recognizing the face of a specific person from the video data. The processor 510 may recognize the face of a specific person from the video data by detecting the face from the video data and tracking the detected face. At this time, the profile image (or others) of the speaker registered (recorded) in advance may be further used for identification.

段階７５０において、プロセッサ５１０は、オーディオデータから特定された話者が発話した音声を抽出してよい。 At step 750, processor 510 may extract the voice spoken by the identified speaker from the audio data.

段階７６０において、プロセッサ５１０は、抽出された音声をテキストに変換してよい。音声をテキストに変換するためには、該当のＳＴＴ（ＳｐｅｅｃｈＴｏＴｅｘｔ）技術が利用されてよい。例えば、人工知能、ディープラーニング、またはその他のニューラルネットワークで実現されたモジュールによって音声がテキストに変換されてよい。 At step 760, processor 510 may convert the extracted voice into text. In order to convert the voice into text, the corresponding STT (Speech To Text) technology may be used. For example, speech may be converted to text by modules implemented in artificial intelligence, deep learning, or other neural networks.

段階７７０において、プロセッサ５１０は、変換されたテキストと、抽出された音声が発話された時間情報とを関連付けて記録してよい。例えば、プロセッサ５１０は、図６に示した議事録６００のように、各話者の発言内容とその発言時刻とを関連付けて記録してよい。これにより、コンピュータシステム５００は、話者同士のコミュニケーションを話者別に区分して自動で記録することができる。 At step 770, processor 510 may record the translated text in association with the time information at which the extracted speech was spoken. For example, the processor 510 may record the speech content of each speaker and the speech time in association with each other, as in the minutes 600 shown in FIG. As a result, the computer system 500 can automatically record the communication between speakers by classifying them by speaker.

実施形態では、映像データ内で特定の話者の口が隠れた状態で前記特定の話者からの発話がある場合であっても、構築された話者モデルに基づき、該当の発話が前記特定の話者によるものであるかを識別することができる。 In the embodiment, even if there is an utterance from the specific speaker with the mouth of the specific speaker hidden in the video data, the utterance is specified based on the constructed speaker model. It is possible to identify whether it is due to the speaker of.

一方、複数の話者のうち、映像データから顔は認識されるがまったく発話しないと認識される話者は、無視されてよい。すなわち、このような話者に対する話者モデルは、構築されなくてもよい。あるいは、コンピュータシステム５００には、映像データからの顔が認識されることより、会議の参加者としては存在するが発言はまったくなかった話者という点を示す情報が記録されてよい。 On the other hand, among a plurality of speakers, a speaker whose face is recognized from the video data but is recognized as not speaking at all may be ignored. That is, a speaker model for such a speaker does not have to be constructed. Alternatively, the computer system 500 may record information indicating that the speaker exists as a participant of the conference but does not speak at all because the face is recognized from the video data.

段階７３０における話者モデルの構築にあたり、複数の話者のうち、映像データから顔または口の形状が認識されなかったり（例えば、電話で会議に参加した場合など）、または話者モデルを構築するための口の形状とスピーチセグメントとの相関関係がまったく用意されていなかったりすることを理由に、話者モデルが構築されない（すなわち、話者モデルの構築が不可能な）特定の話者に対しては、プロセッサ５１０は、段階７２０における相関関係を計算する段階で計算された相関関係が所定の値未満であるスピーチセグメントに対応する埋め込み（または特徴値）をクラスタリングすることにより、前記特定の話者に対する話者モデルを構築してよい。これにより、映像データを使用する相関関係分析によっては話者モデルが構築されなかった話者に対しても、話者モデルを構築することができる。 In constructing the speaker model in stage 730, the shape of the face or mouth is not recognized from the video data among a plurality of speakers (for example, when attending a conference by telephone), or the speaker model is constructed. For a particular speaker for whom a speaker model is not built (ie, it is not possible to build a speaker model) because there is no correlation between the mouth shape and the speech segment. Thus, the processor 510 clusters the embeddings (or feature values) corresponding to the speech segments for which the correlation calculated in the step of calculating the correlation in step 720 is less than a predetermined value. You may build a speaker model for a person. As a result, a speaker model can be constructed even for a speaker whose speaker model has not been constructed by the correlation analysis using video data.

以上、図１〜図６を参照しながら上述した技術的特徴についての説明は、図７にもそのまま適用されるため、重複する説明は省略する。 As described above, the above-mentioned description of the technical features with reference to FIGS. 1 to 6 is applied to FIG. 7 as it is, and thus duplicate description will be omitted.

図８および図９は、一例による、発話する話者を特定する方法を示したフローチャートである。 8 and 9 are flowcharts showing a method of identifying a speaker who speaks, according to an example.

図８および図９を参照しながら、段階７４０の音声を発話する話者を特定する方法について詳しく説明する。 A method of identifying the speaker who speaks the voice of step 740 will be described in detail with reference to FIGS. 8 and 9.

段階８１０において、プロセッサ５１０は、映像データから各話者の顔部分を検出し、検出された顔を追跡し、各話者の口の形状とオーディオデータに含まれた音声との相関関係を計算してよい。段階８２０において、プロセッサ５１０は、計算された相関関係と構築された話者モデルを使用し、該当の区間の音声を発話した話者を特定してよい。話者モデルの構築時だけでなく、話者を特定するときにも、映像データを利用して計算された話者の口の形状と音声との相関関係を使用すれば、話者ダイアライゼーションの正確性をより高めることができる。 In step 810, the processor 510 detects the face portion of each speaker from the video data, tracks the detected face, and calculates the correlation between the shape of each speaker's mouth and the sound contained in the audio data. You can do it. At step 820, processor 510 may use the calculated correlation and the constructed speaker model to identify the speaker who spoke the speech in that section. Not only when building a speaker model, but also when identifying a speaker, if the correlation between the shape of the speaker's mouth and the sound calculated using video data is used, the speaker dialification can be performed. The accuracy can be improved.

このとき、映像データにおいて、話者のうち、特定の話者の顔または特定の話者の口が隠れる（ｏｃｃｌｕｄｅｄ）ことによって該当の特定の話者とオーディオデータで発話される音声との相関関係が計算できない場合には、前記特定の話者と発話される音声との相関関係は０と見なされてよい。したがって、特定の話者の口が隠れたときには、プロセッサ５１０は、該当の特定の話者に対しては、前記発話される音声に対する話者を特定するために構築された話者モデルだけを利用してよい。 At this time, in the video data, the correlation between the specific speaker and the sound uttered in the audio data by hiding the face of the specific speaker or the mouth of the specific speaker (occluded). If cannot be calculated, the correlation between the particular speaker and the spoken voice may be considered to be zero. Therefore, when the mouth of a particular speaker is hidden, the processor 510 uses only the speaker model constructed to identify the speaker for the spoken voice for that particular speaker. You can do it.

段階９１０において、プロセッサ５１０は、オーディオデータに含まれた音声が発話される話者の位置に関する情報を決定してよい。音声が発話される話者の位置に関する情報は、発話される音声の方向に関する情報を含んでよい。方向に関する情報は、発話される音声と関連する方位角情報および／または高度情報を含んでよい。あるいは、位置に関する情報は、オーディオデータを記録するために使用されたマイクと話者に該当する音源の位置関係を示すデータを含んでもよい。 At step 910, the processor 510 may determine information about the position of the speaker from which the voice contained in the audio data is spoken. Information about the position of the speaker in which the voice is spoken may include information about the direction of the voice being spoken. The directional information may include azimuth information and / or altitude information associated with the spoken voice. Alternatively, the position information may include data indicating the positional relationship between the microphone used for recording the audio data and the sound source corresponding to the speaker.

段階９２０において、プロセッサ５１０は、決定された話者の位置に関する情報と構築された話者モデルを利用して音声を発話した話者を特定してよい。構築された話者モデルだけでなく、音声が発話される話者の位置に関する情報（すなわち、該当の音声（音声信号）の方向に関する情報）をさらに使用することにより、話者ダイアライゼーションの正確性をより高めることができる。 At step 920, processor 510 may identify the speaker who spoke the voice using information about the determined speaker position and the constructed speaker model. The accuracy of speaker dialification by further using information about the position of the speaker from which the voice is spoken (ie, information about the direction of the voice (voice signal)) as well as the constructed speaker model. Can be further enhanced.

上述した試験結果に示したように、話者モデルのような、話者の口の形状と音声との相関関係および／または話者の位置に関する情報をさらに使用することにより、話者ダイアライゼーションの正確性をより高めることができる。 As shown in the test results above, by further using information about the correlation between the speaker's mouth shape and speech and / or the position of the speaker, such as the speaker model, the speaker dialification The accuracy can be improved.

上述した装置は、ハードウェア構成要素、ソフトウェア構成要素、および／またはハードウェア構成要素とソフトウェア構成要素との組み合わせによって実現されてよい。例えば、実施形態で説明された装置および構成要素は、プロセッサ、コントローラ、ＡＬＵ（ａｒｉｔｈｍｅｔｉｃｌｏｇｉｃｕｎｉｔ）、デジタル信号プロセッサ、マイクロコンピュータ、ＦＰＧＡ（ｆｉｅｌｄｐｒｏｇｒａｍｍａｂｌｅｇａｔｅａｒｒａｙ）、ＰＬＵ（ｐｒｏｇｒａｍｍａｂｌｅｌｏｇｉｃｕｎｉｔ）、マイクロプロセッサ、または命令を実行して応答することができる様々な装置のように、１つ以上の汎用コンピュータまたは特殊目的コンピュータを利用して実現されてよい。処理装置は、オペレーティングシステム（ＯＳ）およびＯＳ上で実行される１つ以上のソフトウェアアプリケーションを実行してよい。また、処理装置は、ソフトウェアの実行に応答し、データにアクセスし、データを記録、操作、処理、および生成してもよい。理解の便宜のために、１つの処理装置が使用されるとして説明される場合もあるが、当業者は、処理装置が複数個の処理要素および／または複数種類の処理要素を含んでもよいことが理解できるであろう。例えば、処理装置は、複数個のプロセッサまたは１つのプロセッサおよび１つのコントローラを含んでよい。また、並列プロセッサのような、他の処理構成も可能である。 The devices described above may be implemented by hardware components, software components, and / or combinations of hardware components and software components. For example, the apparatus and components described in the embodiments include a processor, a controller, an ALU (arithmetic logic unit), a digital signal processor, a microcomputer, an FPGA (field program gate array), a PLU (programmable log unit), a microprocessor, and the like. Alternatively, it may be implemented using one or more general purpose computers or special purpose computers, such as various devices capable of executing and responding to instructions. The processing device may execute an operating system (OS) and one or more software applications running on the OS. The processing device may also respond to the execution of the software, access the data, and record, manipulate, process, and generate the data. For convenience of understanding, one processing device may be described as being used, but one of ordinary skill in the art may indicate that the processing device may include a plurality of processing elements and / or a plurality of types of processing elements. You can understand. For example, a processor may include multiple processors or one processor and one controller. Other processing configurations, such as parallel processors, are also possible.

ソフトウェアは、コンピュータプログラム、コード、命令、またはこれらのうちの１つ以上の組み合わせを含んでもよく、思うままに動作するように処理装置を構成したり、独立的または集合的に処理装置に命令したりしてよい。ソフトウェアおよび／またはデータは、処理装置に基づいて解釈されたり、処理装置に命令またはデータを提供したりするために、いかなる種類の機械、コンポーネント、物理装置、コンピュータ記録媒体または装置に具現化されてよい。ソフトウェアは、ネットワークによって接続されたコンピュータシステム上に分散され、分散された状態で記録されても実行されてもよい。ソフトウェアおよびデータは、１つ以上のコンピュータ読み取り可能な記録媒体に記録されてよい。 The software may include computer programs, code, instructions, or a combination of one or more of these, configuring the processing equipment to operate at will, or instructing the processing equipment independently or collectively. You may do it. The software and / or data is embodied in any type of machine, component, physical device, computer recording medium or device to be interpreted based on the processing device or to provide instructions or data to the processing device. Good. The software is distributed on a computer system connected by a network and may be recorded or executed in a distributed state. The software and data may be recorded on one or more computer-readable recording media.

実施形態に係る方法は、多様なコンピュータ手段によって実行可能なプログラム命令の形態で実現されてコンピュータ読み取り可能な媒体に記録されてよい。このとき、媒体は、コンピュータ実行可能なプログラムを継続して記録するものであっても、実行またはダウンロードのために一時記録するものであってもよい。また、媒体は、単一または複数のハードウェアが結合した形態の多様な記録手段または格納手段であってよく、あるコンピュータシステムに直接接続する媒体に限定されることはなく、ネットワーク上に分散して存在するものであってもよい。媒体の例としては、ハードディスク、フロッピー（登録商標）ディスク、および磁気テープのような磁気媒体、ＣＤ−ＲＯＭおよびＤＶＤのような光媒体、フロプティカルディスク（ｆｌｏｐｔｉｃａｌｄｉｓｋ）のような光磁気媒体、およびＲＯＭ、ＲＡＭ、フラッシュメモリなどを含み、プログラム命令が記録されるように構成されたものであってよい。また、媒体の他の例として、アプリケーションを配布するアプリケーションストアやその他の多様なソフトウェアを供給または配布するサイト、サーバなどで管理する記録媒体または格納媒体が挙げられる。 The method according to the embodiment may be implemented in the form of program instructions that can be executed by various computer means and recorded on a computer-readable medium. At this time, the medium may be a continuous recording of a computer-executable program, or a temporary recording for execution or download. In addition, the medium may be a variety of recording or storage means in the form of a combination of single or multiple hardware, and is not limited to a medium directly connected to a computer system, but is distributed over a network. It may exist. Examples of media include hard disks, floppy (registered trademark) disks, magnetic media such as magnetic tape, magneto-optical media such as CD-ROMs and DVDs, magneto-optical media such as flotropic disks, and the like. And ROM, RAM, flash memory, etc., and may be configured to record program instructions. In addition, other examples of media include recording media or storage media managed by application stores that distribute applications, sites that supply or distribute various other software, servers, and the like.

以上のように、実施形態を、限定された実施形態および図面に基づいて説明したが、当業者であれば、上述した記載から多様な修正および変形が可能であろう。例えば、説明された技術が、説明された方法とは異なる順序で実行されたり、かつ／あるいは、説明されたシステム、構造、装置、回路などの構成要素が、説明された方法とは異なる形態で結合されたりまたは組み合わされたり、他の構成要素または均等物によって対置されたり置換されたとしても、適切な結果を達成することができる。 As described above, the embodiments have been described based on the limited embodiments and drawings, but those skilled in the art will be able to make various modifications and modifications from the above description. For example, the techniques described may be performed in a different order than the methods described, and / or components such as the systems, structures, devices, circuits described may be in a form different from the methods described. Appropriate results can be achieved even if they are combined or combined, or confronted or replaced by other components or equivalents.

したがって、異なる実施形態であっても、特許請求の範囲と均等なものであれば、添付される特許請求の範囲に属する。 Therefore, even different embodiments belong to the attached claims as long as they are equivalent to the claims.

５００：コンピュータシステム
５１０：プロセッサ
５２０：メモリ
５３０：永続的記録装置
５４０：バス
５５０：入力／出力インタフェース
５６０：ネットワークインタフェース 500: Computer system 510: Processor 520: Memory 530: Persistent recording device 540: Bus 550: Input / output interface 560: Network interface

Claims

A method performed by a computer system that uses video and audio data of multiple speakers to separate speakers.
A step of calculating the correlation between the mouth shape of each speaker included in the video data and each of the speech segments from the speaker included in the audio data.
A stage of constructing a speaker model for each of the speakers based on the calculated correlation, and a story of uttering a voice included in the audio data among the speakers based on the constructed speaker model. A speaker dialing method that includes the steps to identify a person.

At the stage of building the speaker model,
For each speaker of the speaker, in order to construct a speaker model for each speaker, the top N speech segments having a high correlation with the shape of the mouth of each speaker among the speech segments are used. use,
N is a natural number,
The speaker dialylation method according to claim 1.

In constructing the speaker model for each speaker, only the speech segments whose correlation with the mouth shape of each speaker is equal to or more than a predetermined threshold value are used among the N speech segments.
The speaker dialylation method according to claim 2.

The voice in which two or more of the speakers included in the audio data overlap is the speaker who utters the overlapped voice based on the shape of the speaker's mouth when the overlapped voice is spoken. By extracting the pronunciation of the above, generating a mask for the part of the audio data corresponding to the extracted pronunciation, and filtering the overlapping voices, the overlapping voices are separated into the voices of the speakers who uttered the overlapping voices. Ru,
The speaker dialylation method according to claim 1.

The shape of each speaker's mouth is identified by recognizing each speaker's face from the video data and tracking the recognized face.
The speaker dialylation method according to claim 1.

The stage of identifying the speaker is
The face of each speaker of the speaker is detected from the video data, the detected face is tracked, and the correlation between the shape of the mouth of each speaker and the spoken voice included in the audio data is determined. Includes a step of calculating and a step of identifying the speaker who spoke the spoken voice using the calculated correlation and the constructed speaker model.
The speaker dialylation method according to claim 1.

The stage of identifying the speaker who spoke the voice is
In the video data, the correlation between the specific speaker and the spoken voice of the audio data is calculated by hiding the face of the specific speaker or the mouth of the specific speaker among the speakers. If not,
The correlation between the particular speaker and the spoken voice is considered to be 0.
The speaker dialylation method according to claim 6.

The stage of identifying the speaker is
The stage of determining the information regarding the position of the speaker who uttered the spoken voice included in the audio data, and the utterance using the information regarding the determined speaker position and the constructed speaker model. Including the step of identifying the speaker who spoke the voice
The speaker dialylation method according to claim 1.

The information regarding the position of the speaker who uttered the voice includes information regarding the direction of the spoken voice.
The information about the direction includes the azimuth information associated with the spoken voice.
The speaker dialylation method according to claim 8.

Of the plurality of speakers, the speaker whose face is recognized from the video data but is recognized as not speaking at all is ignored.
The speaker dialylation method according to claim 1.

The stage of building the speaker model is
Among the plurality of speakers, the speaker model is due to the fact that the face or mouth shape is not recognized from the video data or the correlation between the mouth shape and the speech segment for constructing the speaker model is not prepared at all. For certain speakers who do not build
A speaker model for the particular speaker is constructed by clustering the embeddings corresponding to the speech segments for which the correlation calculated at the stage of calculating the correlation is less than a predetermined value.
The speaker dialylation method according to claim 1.

The video data and audio data are data included in a video captured in real time of the speaker.
The speaker dialylation method according to claim 1.

The stage of extracting the voice spoken by the specified speaker from the audio data,
The speaker dialyization according to claim 1, further comprising a step of converting the extracted voice into text and a step of recording the converted text in association with time information in which the extracted voice is spoken. Method.

A computer program that causes a computer to execute the speaker dialing method according to any one of claims 1 to 13.

A computer-readable recording medium in which a program for executing the speaker dialing method according to any one of claims 1 to 13 is recorded.

A computer system that separates speakers by using video data and audio data of multiple speakers.
Includes memory and at least one processor concatenated with said memory and configured to execute computer-readable instructions contained in said memory.
The at least one processor
The correlation between the shape of each speaker's mouth included in the video data and each of the speech segments from the speaker included in the audio data is calculated, and the speaker is based on the calculated correlation. A speaker model for each is constructed, and a speaker who speaks among the speakers is identified from the audio data based on the constructed speaker model.
Computer system.

The at least one processor
The face of each speaker of the speaker is detected from the video data, the detected face is tracked, and the correlation between the shape of the mouth of each speaker and the spoken voice included in the audio data is determined. Calculating and using the calculated correlation and the constructed speaker model to identify the speaker who spoke the spoken voice.
The computer system according to claim 16.

The at least one processor
The information regarding the position of the speaker who spoke the spoken voice included in the audio data is determined, and the spoken voice is used by using the information regarding the determined speaker position and the constructed speaker model. Identify the speaker who spoke
The computer system according to claim 16.

The at least one processor
Among the plurality of speakers, the face or mouth shape is not recognized from the video data, or the correlation between the mouth shape and the speech segment for constructing the speaker model is not prepared at all. For a specific speaker for whom a person model is not constructed, the specific story is described by clustering the embeddings corresponding to the speech segments whose correlation calculated at the stage of calculating the correlation is less than a predetermined value. Build a speaker model for a person,
The computer system according to claim 16.