JP7507528B1

JP7507528B1 - Speech information extraction device and program

Info

Publication number: JP7507528B1
Application number: JP2024008068A
Authority: JP
Inventors: 圭阿久澤; 陸荒川; 静真久保; 大騎村上; 健三田村
Original assignee: 株式会社Aces
Priority date: 2024-01-23
Filing date: 2024-01-23
Publication date: 2024-06-28
Anticipated expiration: 2044-01-23

Abstract

【課題】音声データから発話者の発話を抽出する際に、話者とマイクとの関係を明らかにして処理精度の向上を実現する発話情報抽出装置及びプログラムを提供する。【解決手段】発話情報抽出装置１は、音声データ２を入手する音声データ入手手段３と、音声データ２において使用されたマイクの数と発話者の数を入手する情報入手手段４と、音声データ２を発話毎に分割して分割発話とする発話分割手段５と、分割発話と発話者とを紐付ける紐付け手段６と、発話者と紐付けられた紐付け結果を表示させる表示手段７と、を備える。マイクの数と発話者の数が同数であった場合、音声データ２のうち、各マイクで録音された分割発話を、単一の発話者と紐付ける処理を行う。【選択図】図１[Problem] To provide an utterance information extraction device and program that improves processing accuracy by clarifying the relationship between the speaker and the microphone when extracting a speaker's utterance from audio data. [Solution] The utterance information extraction device 1 comprises audio data acquisition means 3 for acquiring audio data 2, information acquisition means 4 for acquiring the number of microphones and the number of speakers used in the audio data 2, utterance division means 5 for dividing the audio data 2 into divided utterances by utterance, linking means 6 for linking the divided utterances with the speakers, and display means 7 for displaying the linking results linked with the speakers. When the number of microphones and the number of speakers are the same, a process is performed to link the divided utterances recorded at each microphone in the audio data 2 with a single speaker. [Selected Figure] Figure 1

Description

本発明は、オンラインミーティング、或いは講演会等の音声データについて、発話者を認識して発話を抽出する装置及びそのプログラムに関する。 The present invention relates to a device and a program for recognizing speakers and extracting speech from audio data of online meetings, lectures, etc.

オンラインミーティング、或いは講演会等の音声データについて、いつ誰が話したかを推定して抽出する技術は、話者ダイアライゼーション技術と呼ばれている。例えば、オンラインミーティングの音声データについて、発話者を認識してその発話を抽出することにより、事後的にそのオンラインミーティングにおける発話の有効性の分析等を行うことができる。 The technology to estimate and extract who spoke and when from audio data of online meetings or lectures is called speaker diarization technology. For example, by recognizing the speaker from the audio data of an online meeting and extracting their speech, it is possible to analyze the effectiveness of the speech in that online meeting after the fact.

特許文献１には、音声ファイルに対して行う話者ダイアライゼーション方法であって、話者の基準音声を利用して音声ファイルから基準音声の発話者を識別し、識別されなかった残りの発話区間に対してクラスタリングを利用して話者の識別を行う話者ダイアライゼーション方法が開示されている。 Patent document 1 discloses a speaker diarization method for audio files, in which a reference voice of the speaker is used to identify the speaker of the reference voice from the audio file, and the speaker is identified for the remaining speech segments that have not been identified by using clustering.

特許文献１の方法では、音声ファイルを録音したマイク（マイクロフォン）についてどのように取り扱うかの記載がないため、単に音声ファイルを発話毎に切り分けて、切り分けられた発話がどの発話者の発話であるかを識別している。 The method in Patent Document 1 does not describe how to handle the microphone that recorded the audio file, so it simply divides the audio file into individual utterances and identifies which speaker each utterance belongs to.

特開２０２２－１０９８６７号公報JP 2022-109867 A

特許文献１の方法では、音声ファイルを録音したマイクが単数であるか複数であるかに関わらず同じ処理を行っているが、本願発明者等は、話者ダイアライゼーションを行う際に、話者とマイクとの関係が処理の精度に影響を与えていることを知見した。 In the method of Patent Document 1, the same processing is performed regardless of whether the audio file was recorded using a single microphone or multiple microphones. However, the inventors of the present application have discovered that when performing speaker diarization, the relationship between the speaker and the microphone affects the accuracy of the processing.

本発明は、音声データから発話者の発話を抽出する際に、話者とマイクとの関係を明らかにして処理精度の向上を実現する発話情報抽出装置及びプログラムを提供することを目的とする。 The present invention aims to provide a speech information extraction device and program that clarify the relationship between the speaker and the microphone when extracting the speaker's speech from audio data, thereby improving processing accuracy.

上記目的を達成するために、本発明の発話情報抽出装置は、音声データから発話者の発話を抽出する発話情報抽出装置であって、前記音声データを入手する音声データ入手手段と、前記音声データにおいて使用されたマイクの数と前記発話者の数を入手する情報入手手段と、前記音声データを発話毎に分割して分割発話とする発話分割手段と、前記分割発話と前記発話者とを紐付ける紐付け手段と、前記発話者と紐付けられた紐付け結果を表示させる表示手段とを備え、前記紐付け手段は、前記マイクの数が前記発話者の数と同数であるときは、前記マイク毎の前記分割発話を単一の前記発話者に紐付けし、前記マイクの数が前記発話者の数より少ないときは、前記分割発話毎に前記発話者との紐付けを行うことを特徴とする。 In order to achieve the above object, the speech information extraction device of the present invention is an utterance information extraction device that extracts speaker utterances from audio data, and includes an audio data acquisition means for acquiring the audio data, an information acquisition means for acquiring the number of microphones and the number of speakers used in the audio data, an utterance division means for dividing the audio data into divided utterances for each utterance, a linking means for linking the divided utterances to the speakers, and a display means for displaying the linking results linked to the speakers, and is characterized in that when the number of microphones is the same as the number of speakers, the linking means links the divided utterances for each microphone to a single speaker, and when the number of microphones is less than the number of speakers, it links each divided utterance to the speaker.

本発明の発話情報抽出装置は、情報入手手段によってマイクの数と発話者の数を入手し、マイクの数が発話者の数より少ないときは、各マイクによって録音がなされた音声データには複数の発話者の発話が含まれることになるので、分割発話毎に発話者との紐付けを行う。一方で、マイクの数が発話者の数と同数であるときは、各マイクによって録音がなされた音声データは、単独の発話者による発話のみとなるため、各マイクの音声データについて、一度どの発話者の発話かを判定できれば、その後は他の発話者か否かの判断をする必要がない。よって、マイクの数が発話者の数と同数であるときは不要な処理を行わないため誤った処理がなされることがなく、マイクの数が発話者の数より少ないときのみ紐付け処理を行えばよく、処理効率が向上すると共に処理精度が向上する。 The speech information extraction device of the present invention obtains the number of microphones and the number of speakers by an information obtaining means, and when the number of microphones is less than the number of speakers, the audio data recorded by each microphone contains the speech of multiple speakers, so each divided utterance is linked to a speaker. On the other hand, when the number of microphones is the same as the number of speakers, the audio data recorded by each microphone contains only the speech of a single speaker, so once it has been determined which speaker the audio data from each microphone is, there is no need to determine whether it is another speaker or not. Therefore, when the number of microphones is the same as the number of speakers, unnecessary processing is not performed, so erroneous processing is not performed, and linking processing needs to be performed only when the number of microphones is less than the number of speakers, improving processing efficiency and processing accuracy.

本発明の発話情報抽出装置において、前記表示手段は、前記マイクの数が前記発話者の数と同数であるときは、前記マイク毎に紐付けられた発話者を表示し、前記マイクに紐付けられた前記発話者が複数あるときは、主たる前記発話者を表示すると共に前記発話者が複数であることを示す複数表示を行い、前記複数表示の開示指示があったときは、前記マイクに紐付けられた複数の前記発話者のすべてを表示させるようにしてもよい。 In the speech information extraction device of the present invention, the display means may display the speaker associated with each microphone when the number of microphones is the same as the number of speakers, and when there are multiple speakers associated with the microphones, display the main speaker and also display multiple speakers to indicate that there are multiple speakers, and when an instruction to disclose the multiple displays is given, display all of the multiple speakers associated with the microphones.

当該構成によれば、あるマイクによって録音された音声データが、複数の発話者の発話を含む場合に、ユーザは主たる発話者と、発話者が複数であることを一目で確認することができる。また、ユーザがそのマイクを使用した他の発話者が誰なのかを知りたい場合は、開示指示を行うことによりその発話者を知ることができる。一方で、他の発話者の情報を知る必要がない場合、或いは、発話者が誤って分割された場合は、開示指示を解除することができる。ここで、主たる発話者とは、そのマイクで録音された音声データの発話時間が最も長い発話者としてもよく、マイクの所有者を主たる発話者としてもよい。 According to this configuration, when audio data recorded by a microphone includes speech from multiple speakers, the user can immediately identify the main speaker and that there are multiple speakers. Furthermore, if the user wants to know who the other speakers who used the microphone are, the user can find out who those speakers are by issuing a disclosure instruction. On the other hand, if there is no need to know information about the other speakers, or if the speakers have been mistakenly divided, the disclosure instruction can be cancelled. Here, the main speaker may be the speaker who has the longest speaking time in the audio data recorded by that microphone, or the owner of the microphone may be the main speaker.

本発明の発話情報抽出装置において、前記紐付け手段は、複数の発話者の基準音声によって学習された発話者ＤＢ（ＤＢ＝データベース。以下同じ。）を参照し、前記発話者ＤＢの学習を行う発話者ＤＢ学習手段をさらに備え、前記発話者ＤＢ学習手段は前記発話者に紐付く前記基準音声として、いずれの音声データを前記発話者ＤＢの学習に用いるかを示す情報を、パラメータとして管理画面によって編集可能としてもよい。 In the speech information extraction device of the present invention, the linking means further includes a speaker DB learning means for referring to a speaker DB (DB = database; the same applies below) trained using reference voices of a plurality of speakers and learning the speaker DB, and the speaker DB learning means may make it possible to edit , as a parameter via a management screen, information indicating which voice data is to be used for learning the speaker DB as the reference voice linked to the speaker .

当該構成によれば、例えば、発話者ＤＢに記憶されているある発話者のデータについて、パラメータとして記録された会議の記録を見て、特定日の会議における発話が、別の発話者が出席していた会議である等、発話者の特徴量として不適切であると判断した場合等に、その特定日の会議のデータを削除することができる。当該編集により、発話者ＤＢにおけるデータが、本来の発話者におけるデータとなるため、発話者ＤＢのデータの信頼性を向上させることができる。 According to this configuration, for example, when the data of a certain speaker stored in the speaker DB is viewed, and it is determined that the speech at a meeting on a specific day is inappropriate as a feature of the speaker, for example because the speech was at a meeting attended by a different speaker, the data of the meeting on that specific day can be deleted. This editing makes the data in the speaker DB data for the original speaker, thereby improving the reliability of the data in the speaker DB.

当該構成において、前記パラメータは、前記基準音声が録音されたイベントであってもよい。ここで、イベントとは、オンラインミーティングや講演会、パネルディスカッション等が含まれる。このように、パラメータをイベントとすると、発話者ＤＢにおいて基準音声とすべきでないイベントがあった場合に、そのイベント毎に削除等の編集を行うことができる。当該編集は、ユーザが行ってもよく、所定のルールやアルゴリズム等を用いて自動で行ってもよい。 In this configuration, the parameter may be the event at which the reference voice was recorded. Here, events include online meetings, lectures, panel discussions, and the like. In this way, when the parameter is an event, if there is an event in the speaker DB that should not be used as the reference voice, editing such as deletion can be performed for each event. The editing may be performed by a user, or may be performed automatically using predetermined rules, algorithms, etc.

また、当該構成において、前記発話者ＤＢは、前記発話者の基準音声を解析した結果の発話の特徴量を含んでいてもよい。当該構成によれば、発話者ＤＢのデータとして発話者の特徴量を含むため、音声データから正確に発話者の発話を抽出することができる。ここで、基準音声とは、発話者の特徴を抽出する際に基準とすべき音声であり、平常の会話の他、種々の感情を伴った会話等を含む。 In addition, in this configuration, the speaker DB may contain speech features resulting from an analysis of the reference voice of the speaker. According to this configuration, since the speaker DB data contains the features of the speaker, the speaker's speech can be accurately extracted from the voice data. Here, the reference voice is a voice that should be used as a reference when extracting the speaker's features, and includes not only normal conversation but also conversations with various emotions.

本発明の発話情報抽出装置において、前記紐付け手段の紐付け結果に対してユーザが修正可能な修正手段を備え、前記発話者ＤＢ学習手段が、ユーザによって修正された修正結果を前記発話者ＤＢが学習し、前記紐付け手段が、学習済の前記発話者ＤＢを参照して紐付けを行うようにしてもよい。 The speech information extraction device of the present invention may include a correction means that allows a user to correct the linking result of the linking means, and the speaker DB learning means may have the speaker DB learn the correction result corrected by the user, and the linking means may perform linking by referring to the learned speaker DB.

当該構成によれば、紐付け手段による処理に誤りがあった場合であっても、ユーザによりその誤りを修正し、発話者ＤＢに反映することができる。従って、例えば、ユーザが１箇所の紐付けの誤りを修正した場合であっても、発話者ＤＢが更新されるため、他の同様の誤りも自動的に修正されるものとなる。 According to this configuration, even if an error occurs in the processing by the linking means, the user can correct the error and reflect it in the speaker DB. Therefore, for example, even if the user corrects one linking error, the speaker DB is updated, so that other similar errors are automatically corrected.

本発明によれば、音声データから発話者の発話を抽出する際に、話者とマイクとの関係を明らかにして処理精度の向上を実現する発話情報抽出装置を提供することができる。 The present invention provides a speech information extraction device that clarifies the relationship between the speaker and the microphone when extracting a speaker's speech from audio data, thereby improving processing accuracy.

本発明の実施形態の一例である発話情報抽出装置の機能的構成を示す説明図。FIG. 1 is an explanatory diagram showing a functional configuration of an utterance information extraction device according to an embodiment of the present invention. （Ａ）及び（Ｂ）は、オンラインミーティングのマイクと発話者との関係を示す説明図。1A and 1B are explanatory diagrams showing the relationship between microphones and speakers in an online meeting. （Ａ）～（Ｄ）は、本実施形態の発話情報抽出装置の結果表示画面を示す説明図。4A to 4D are explanatory diagrams showing result display screens of the speech information extraction device of the present embodiment. 本実施形態の発話情報抽出装置の作動を示すフローチャート。4 is a flowchart showing the operation of the speech information extraction device of the present embodiment. 本実施形態の発話情報抽出装置の管理画面を示す説明図。FIG. 2 is an explanatory diagram showing a management screen of the utterance information extraction device of the present embodiment. 本実施形態の発話情報抽出装置の結果表示画面での修正の状態を示す説明図。FIG. 13 is an explanatory diagram showing a state of correction on a result display screen of the utterance information extraction device of the present embodiment.

次に、図１～図６を参照して、本発明の実施形態である発話情報抽出装置、及び発話情報抽出プログラムについて説明する。本実施形態の発話情報抽出装置１は、例えば、オンラインミーティング等で録画されたミーティングの音声である音声データ２を、発話者Ｓ毎に抽出して表示させる装置である。 Next, an utterance information extraction device and an utterance information extraction program according to an embodiment of the present invention will be described with reference to Figs. 1 to 6. The utterance information extraction device 1 of this embodiment is a device that extracts and displays audio data 2, which is the audio of a meeting recorded during an online meeting or the like, for each speaker S.

本実施形態の発話情報抽出装置１は、機能的構成として、図１に示すように、音声データ２を入手する音声データ入手手段３と、音声データ２において使用されたマイクＭの数と発話者Ｓの数を入手する情報入手手段４と、音声データ２を発話毎に分割して分割発話１０とする発話分割手段５と、分割発話１０と発話者Ｓとを紐付ける紐付け手段６と、発話者Ｓと紐付けられた紐付け結果１１を表示させる表示手段７とを備えている。 As shown in FIG. 1, the functional configuration of the speech information extraction device 1 of this embodiment includes a voice data acquisition means 3 for acquiring voice data 2, an information acquisition means 4 for acquiring the number of microphones M and the number of speakers S used in the voice data 2, an utterance division means 5 for dividing the voice data 2 into divided utterances 10 by utterance, a linking means 6 for linking the divided utterances 10 with the speakers S, and a display means 7 for displaying the linking results 11 linked with the speakers S.

また、本実施形態の発話情報抽出装置１は、紐付け手段６が発話者ＤＢ８と接続されている。発話者ＤＢ８には、発話者Ｓの基準音声の学習を行う発話者ＤＢ学習手段９が設けられている。発話者ＤＢ８は、ネットワーク等を介して紐付け手段６と接続されていてもよく、発話情報抽出装置１の内部の記憶手段に記憶されていてもよい。 In addition, in the speech information extraction device 1 of this embodiment, the linking means 6 is connected to the speaker DB 8. The speaker DB 8 is provided with a speaker DB learning means 9 that learns the reference voice of the speaker S. The speaker DB 8 may be connected to the linking means 6 via a network or the like, or may be stored in a storage means inside the speech information extraction device 1.

音声データ２は、例えば、オンラインミーティング等で録画されたミーティングの音声である。その他、複数のパネリスト（発話者）が参加するパネルディスカッション等において録音されたデータ等も含まれる。 The audio data 2 is, for example, the audio of a meeting recorded during an online meeting or the like. It also includes data recorded during a panel discussion or the like in which multiple panelists (speakers) participate.

音声データ入手手段３は、発話情報抽出装置１に音声データ２を入手する機能部である。音声データ２のデータ形式は、音声のみのデータであってもよく、動画データの音声トラックであってもよい。音声データ入手手段３は、例えば、発話情報抽出装置１となっているコンピュータに、音声データ２をアップロードすることにより入手することができる。音声データ２のアップロードは、ネットワーク等を介して行ってもよく、ＳＤカード等の記録媒体を介して行ってもよい。 The audio data acquisition means 3 is a functional unit that acquires the audio data 2 to the speech information extraction device 1. The data format of the audio data 2 may be audio-only data or an audio track of video data. The audio data acquisition means 3 can acquire the audio data 2, for example, by uploading the audio data 2 to the computer that serves as the speech information extraction device 1. The audio data 2 may be uploaded via a network or the like, or via a recording medium such as an SD card.

情報入手手段４は、音声データ２において使用されたマイクＭの数と前記発話者の数を入手する手段である。例えば、図２（Ｂ）のようなネットワーク１４を介したオンラインミーティングがあった場合、拠点（ｄ）においては、ミーティング装置１２ｄの前で発話者Ｓｄがミーティングを行うものである。発話者Ｓｄはヘッドセット１３ｄを装着しており、マイクＭｄは発話者Ｓｄと１対１の関係となる。一方で、拠点（ｅ）では、ミーティング装置１２ｅの前に１個のマイクＭｅが設置されており、１個のマイクＭｅを用いて複数の発話者Ｓｅ１～Ｓｅ３が発話を行う。 The information acquisition means 4 is a means for acquiring the number of microphones M used in the voice data 2 and the number of speakers. For example, in the case of an online meeting via a network 14 as shown in FIG. 2(B), at site (d), a speaker Sd holds the meeting in front of a meeting device 12d. The speaker Sd wears a headset 13d, and the microphone Md has a one-to-one relationship with the speaker Sd. Meanwhile, at site (e), one microphone Me is installed in front of a meeting device 12e, and multiple speakers Se1 to Se3 speak using the single microphone Me.

このようなミーティングの音声データ２について発話者情報の抽出を行う際には、情報入手手段４は、音声データ２に付随するデータ、例えばマイクＩＤや発話者ＩＤ等のデータを基礎として、マイクＭの数と発話者Ｓの数の情報を入手する。具体的には、オンライン会議ツールで会議に参加したユーザのＩＤを入手する手法や、出願人が提供するＡＩによる議事録自動化システムであるＡＣＥＳＭｅｅｔを用いてユーザ自身に情報を入力させる手法を用いることができる。 When extracting speaker information from audio data 2 of such a meeting, the information acquisition means 4 acquires information on the number of microphones M and the number of speakers S based on data accompanying the audio data 2, such as microphone IDs and speaker IDs. Specifically, a method of acquiring the IDs of users who participated in the meeting using an online conference tool, or a method of having users input information themselves using ACESMeet, an AI-based minutes automation system provided by the applicant, can be used.

発話分割手段５は、音声データ２を発話毎に分割して分割発話１０とする機能部である。分割の手法としては、例えば、音声データ２のパワーの閾値に基づく手法を用いることができる。この手法では、一定の時間（例えば０．２５ミリ秒）毎に音声を分割し、その分割されたフレーム内のパワーの総和が閾値以上の場合はそのフレームは音声フレームとし、閾値未満の場合は非音声フレームとする。そして、音声フレームが連続している区間をまとめて分割発話１０とする。なお、音声データ２の分割の手法は、当該手法には限られず、公知の他の手法を用いてもよい。 The speech division means 5 is a functional unit that divides the voice data 2 into individual utterances to generate divided utterances 10. As a division method, for example, a method based on a power threshold of the voice data 2 can be used. In this method, the voice is divided at regular intervals (for example, 0.25 milliseconds), and if the sum of the power in the divided frames is equal to or greater than the threshold, the frame is regarded as a voice frame, and if it is less than the threshold, it is regarded as a non-voice frame. Then, a section in which voice frames are consecutive is collected as divided utterances 10. Note that the method for dividing the voice data 2 is not limited to this method, and other known methods may be used.

紐付け手段６は、発話分割手段５によって分割された分割発話１０と、発話者Ｓとを紐付ける機能部である。例えば、図２（Ａ）のようなオンラインミーティングがあった場合、拠点（ａ）～（ｃ）においては、各発話者Ｓａ～Ｓｃにそれぞれ１台のマイクＭａ～Ｍｃという構成となっている。このような場合は、各マイクＭａ～Ｍｃの音声データ２についての分割発話１０は、そのマイクＭａ～Ｍｃのユーザである発話者Ｓａ～Ｓｃの音声のみとなっているので、そのマイクＭａ～Ｍｃの分割発話１０について、発話者Ｓａ～Ｓｃ以外の発話者を紐付けることはしない。 The linking means 6 is a functional unit that links the divided utterances 10 divided by the utterance dividing means 5 with the speaker S. For example, in the case of an online meeting as shown in FIG. 2(A), in locations (a) to (c), each speaker Sa to Sc is provided with one microphone Ma to Mc. In such a case, the divided utterances 10 for the voice data 2 of each microphone Ma to Mc contain only the voices of the speakers Sa to Sc who are the users of the microphones Ma to Mc, so the divided utterances 10 of the microphones Ma to Mc are not linked to speakers other than the speakers Sa to Sc.

一方で、図２（Ｂ）のようなオンラインミーティングがあった場合は、拠点（ｅ）において、マイクＭｅを複数の発話者Ｓｅ１～Ｓｅ３が使用しているため、マイクＭｅの分割発話１０について、どの発話者の発話であるかの紐付けを行う。 On the other hand, when an online meeting like that shown in FIG. 2(B) takes place, multiple speakers Se1 to Se3 are using the microphone Me at location (e), so the divided utterance 10 from the microphone Me is linked to which speaker it belongs to.

表示手段７は、発話者Ｓと紐付けられた分割発話１０という紐付け結果１１を表示させる機能部である。紐付け結果１１は、例えば図３（Ａ）に示すように、結果表示画面２０において、マイクＭ毎に、そのマイクＭで発話を行っている発話者Ｓの氏名２１、及び各マイクにおいて発話者が行った発話を示す分割発話１０が表示される。図３（Ａ）は、図２（Ａ）のオンラインミーティングの際の紐付け結果１１を示している。 The display means 7 is a functional unit that displays the linking result 11, which is the divided utterances 10 linked to the speaker S. As shown in FIG. 3(A), for example, the linking result 11 is displayed on a result display screen 20 for each microphone M, showing the name 21 of the speaker S who is speaking at that microphone M, and the divided utterances 10 showing the utterances made by the speaker at each microphone. FIG. 3(A) shows the linking result 11 for the online meeting of FIG. 2(A).

図２（Ｂ）のオンラインミーティングの際の紐付け結果１１は、図３（Ｂ）のとおりとなる。図３（Ｂ）においては、マイクの数分だけ、そのマイクの主たる発話者の氏名２１及び分割発話１０が表示されている。また、図３（Ｂ）においては、上段のマイクの氏名２１の表示の左側に、そのマイクを使用している発話者が複数であることを示す複数表示２２が表示される。 The linking result 11 during the online meeting in FIG. 2(B) is as shown in FIG. 3(B). In FIG. 3(B), the name 21 of the main speaker of that microphone and the divided speech 10 are displayed for each microphone. Also, in FIG. 3(B), to the left of the display of the name 21 of the microphone in the upper row, a multiple display 22 is displayed to indicate that there are multiple speakers using that microphone.

このように、本実施形態の発話情報抽出装置１では、表示手段７は、当該複数表示２２を行って、ユーザにマイクに紐付けられた発話者が複数であることを報知する。また、この複数表示２２は、ユーザがタッチ操作等を行って開示指示を行うことにより、そのマイクに紐付けられた複数の発話者のすべてを表示させることができる。 In this way, in the speech information extraction device 1 of this embodiment, the display means 7 displays the multiple display 22 to inform the user that there are multiple speakers linked to the microphone. In addition, the multiple display 22 can display all of the multiple speakers linked to the microphone when the user issues a disclosure instruction by performing a touch operation or the like.

発話者ＤＢ８は、複数の発話者の基準音声によって学習された結果が保存されているデータベースである。発話者ＤＢ８に記憶されているデータとしては、発話者の氏名、発話者の複数の基準音声、及び基準音声を解析した結果の発話の特徴量（ベクトル）等である。 Speaker DB8 is a database that stores the results of learning using the reference voices of multiple speakers. The data stored in speaker DB8 includes the name of the speaker, multiple reference voices of the speaker, and speech features (vectors) resulting from analysis of the reference voices.

発話者ＤＢ８としては、例えば、ディープニューラルネットワークを使ったテキスト依存の話者認証として知られている「d-vector」や「i-vector」等を用いることができる。また、発話者ＤＢ８としては、その他に、Pythonによるオープンソースフレームワークである「Pyannote.audio」等、他のデータベースを用いてもよい。 For example, "d-vector" or "i-vector", which are known as text-dependent speaker authentication using a deep neural network, can be used as the speaker DB8. In addition, other databases such as "Pyannote.audio", an open source framework based on Python, can also be used as the speaker DB8.

発話者ＤＢ学習手段９は、図５に示す管理画面２３によってパラメータを編集可能となっており、パラメータを変更することにより発話者ＤＢ８の学習を行う機能部である。管理画面２３は、図５に示すように、パラメータとして、発話者「○○さんのデータベースに登録されている会議一覧」の項目において、現在登録されている会議（イベント）が表示される。この場合、パラメータとなっているイベントは会議となる。 The speaker DB learning means 9 is a functional unit whose parameters can be edited using the management screen 23 shown in FIG. 5, and which learns the speaker DB 8 by changing the parameters. As shown in FIG. 5, the management screen 23 displays the currently registered conference (event) as a parameter in the item "List of conferences registered in the database for the speaker XX". In this case, the event that is the parameter is a conference.

その会議の横には、削除ボタン２４が設けられており、その会議を削除したい場合に利用される。また、会議が並んでいる最終行には、追加ボタン２５が設けられている。また、管理画面２３の下方には、編集内容を登録する登録ボタン２６及びキャンセルを行うキャンセルボタン２７が設けられている。本実施形態では、管理画面２３において、発話者ＤＢ８の学習のためのパラメータとして会議のデータを表示しているが、これに限らず、会議中の特定の発話、例えば１時間の音声データのうち特定の区間のみを指定することもでき、ユーザが自身でアップロードした音声ファイル等をパラメータとすることができる。 Next to the conference, a delete button 24 is provided, which is used when deleting the conference. An add button 25 is provided at the bottom of the conference list. A register button 26 for registering edited content and a cancel button 27 for canceling are provided below the management screen 23. In this embodiment, the management screen 23 displays conference data as parameters for learning the speaker DB 8, but this is not limiting, and it is also possible to specify a specific utterance during the conference, for example, only a specific section of one hour of audio data, and an audio file uploaded by the user himself can be used as a parameter.

紐付け手段６は、この発話者ＤＢ８に記憶されているデータに基づいて、ある分割発話１０がどの発話者の発話であるかを判定する。判定の方法としては、紐付け手段６においてＡＩ（Artificial Intelligence）を利用して判定を行ってもよい。ＡＩによる判定を行う場合は、ＡＩに対する入力が分割発話１０であり、出力が分割発話１０の発話を行ったと推定される発話者Ｓである。 The linking means 6 determines which speaker a certain split utterance 10 is the utterance of, based on the data stored in this speaker DB 8. As a method of determination, the linking means 6 may use AI (Artificial Intelligence) to make the determination. When making a determination using AI, the input to the AI is the split utterance 10, and the output is the speaker S who is estimated to have spoken the split utterance 10.

その他の判定方法としては、分割発話１０の音声を解析して発話の特徴量を算出し、発話者ＤＢ８に記憶されている特徴量に関するデータとの対比を行って、特徴量が近い発話を行っている発話者を判定結果とすることもできる。また、判定方法として、「話者分類（speaker classification）」や「話者認証（speaker verification）」の研究領域における公知の方法を採用してもよい。 As another method of determination, the voice of the divided utterance 10 is analyzed to calculate the features of the utterance, and the features are compared with data on the features stored in the speaker DB 8 to determine the speaker who has made an utterance with similar features. Also, as a method of determination, a publicly known method in the research field of "speaker classification" or "speaker verification" may be adopted.

本実施形態の発話情報抽出装置１は、ユーザが使用するコンピュータ（ユーザ端末１Ｕ）等を用いて、発話情報抽出プログラム１Ｐを作動させることにより実現される。この発話情報抽出プログラム１Ｐは、パーソナルコンピュータにインストールされているものでもよく、サーバにインストールされてクライアント端末で使用できるものであってもよい。また、ＣＤロムやＤＶＤロム等に記憶された状態であってもよく、サーバ上にアップされてネットワークを通じてダウンロード可能となっていてもよい。 The speech information extraction device 1 of this embodiment is realized by running the speech information extraction program 1P using a computer (user terminal 1U) used by a user. This speech information extraction program 1P may be installed on a personal computer, or may be installed on a server and used on a client terminal. It may also be stored on a CD-ROM, DVD-ROM, etc., or uploaded to a server and available for download over a network.

発話情報抽出プログラム１Ｐが実行されるコンピュータは、ＣＰＵ（中央演算処理装置）、ＧＰＵ（画像処理装置）等のプロセッサ、ハードディスク、メモリ等の記憶手段、及び各種ネットワークとの接続手段、キーボード、マウス、及びディスプレイ等を備えている（図示省略）。 The computer on which the speech information extraction program 1P is executed is equipped with a processor such as a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), a hard disk, storage means such as memory, and means for connecting to various networks, a keyboard, a mouse, a display, etc. (not shown).

次に、本実施形態の発話情報抽出装置１の作動について、図１～図６を参照して説明する。ユーザがユーザ端末１Ｕを起動させ、発話情報抽出プログラム１Ｐを起動させると、ユーザ端末１Ｕの画面に、ユーザを認識するための初期画面が表示される（図示省略）。ユーザが初期画面に対して必要な事項を入力すると、登録済のオンライン会議ツールがある場合は、音声データ入手手段３によって、そのオンライン会議ツールの動画デーから音声データ２が入力される。 Next, the operation of the speech information extraction device 1 of this embodiment will be described with reference to Figs. 1 to 6. When a user starts the user terminal 1U and starts the speech information extraction program 1P, an initial screen for recognizing the user is displayed on the screen of the user terminal 1U (not shown). When the user inputs the necessary information on the initial screen, if there is a registered online conference tool, the audio data acquisition means 3 inputs audio data 2 from the video data of that online conference tool.

次に、ユーザが、参照すべき発話者ＤＢ８を指定し、実行指示を行うと、まず、発話分割手段５が音声データ２を分割して分割発話１０とする分割処理が行われる（ＳＴＥＰ１）。分割の手法としては、上述の音声データ２のパワーの閾値に基づく手法を用いる。 Next, when the user designates the speaker DB 8 to be referenced and issues an execution instruction, the speech division means 5 first performs a division process to divide the voice data 2 into divided utterances 10 (STEP 1). The division method used is a method based on the power threshold of the voice data 2 described above.

次に、紐付け手段６によって、分割発話１０について、発話者Ｓとの紐付けを行う紐付け処理が行われる。この紐付け処理では、まず、紐付け手段６が、音声データ２に付随するデータから、マイクＭの数と発話者Ｓの数、及びマイクＭの所有者等の情報を入手する（ＳＴＥＰ２）。 Next, the linking means 6 performs a linking process to link the divided utterances 10 with the speakers S. In this linking process, the linking means 6 first obtains information such as the number of microphones M and the number of speakers S, and the owners of the microphones M, from the data accompanying the voice data 2 (STEP 2).

ここで、マイクＭの数と発話者Ｓの数が同数であるか否かの判定を行う（ＳＴＥＰ３）
。マイクＭの数と発話者Ｓの数が同数であった場合（ＳＴＥＰ３でＹ）、音声データ２のうち、各マイクＭで録音された分割発話１０を、単一の発話者と紐付ける処理を行う（ＳＴＥＰ４）。 Here, it is determined whether the number of microphones M is the same as the number of speakers S (STEP 3).
If the number of microphones M and the number of speakers S are the same (Y in STEP 3), a process is performed in which the divided utterances 10 recorded by each microphone M in the voice data 2 are linked to a single speaker (STEP 4).

具体的には、各マイクＭで録音された分割発話１０について、ＳＴＥＰ２で入手したマイクＭの所有者の情報から、各マイクＭと発話者Ｓの紐付けを行う。表示手段７は、この紐付けの結果を結果表示画面２０に表示させる（ＳＴＥＰ５）。 Specifically, for the divided speech 10 recorded by each microphone M, each microphone M is linked to the speaker S based on the information on the owner of the microphone M obtained in STEP 2. The display means 7 displays the results of this linking on the result display screen 20 (STEP 5).

一方で、マイクＭの数が発話者Ｓの数よりも少ないときは（ＳＴＥＰ３でＮ）、各分割発話１０毎に発話者Ｓの紐付けを行う（ＳＴＥＰ６）。紐付け処理は、音声データ２から生成された分割発話１０を判定ＡＩの入力とし、出力として、発話者に紐付いた分割発話１０の情報を得て、表示手段７により表示を行う（ＳＴＥＰ７）。 On the other hand, when the number of microphones M is less than the number of speakers S (N in STEP 3), each divided utterance 10 is linked to a speaker S (STEP 6). In the linking process, the divided utterances 10 generated from the voice data 2 are input to the decision AI, and as an output, information on the divided utterances 10 linked to the speakers is obtained and displayed by the display means 7 (STEP 7).

ここで、音声データ入手手段３によって入手された音声データ２において、各マイクＭ毎の音声データとなっておらず、１つのファイルとなっている場合、音声データ２のもととなった動画データから各マイクＭの発話者Ｓが判明する場合は、各マイクＭ毎に発話者Ｓとの紐付けを行う。一方で、各マイクＭの発話者Ｓが判明しない場合は、１つのマイクＭにおいて複数の発話者Ｓが発話しているものとして処理を行う。 Here, in the case where the audio data 2 acquired by the audio data acquisition means 3 is not divided into audio data for each microphone M but is in one file, and if the speaker S of each microphone M can be identified from the video data that is the source of the audio data 2, the speaker S is linked to each microphone M. On the other hand, if the speaker S of each microphone M cannot be identified, processing is performed assuming that multiple speakers S are speaking at one microphone M.

なお、まれにマイクＭの数が発話者Ｓの数よりも多いときがあるが、この場合は、発話分割手段５による分割処理の際に、パワーの低いマイクＭを特定し、そのマイクＭは使用されていないものと判定して、そのマイクＭを除いて以降の処理を行ってもよい。或いは、各マイクＭの分割処理を行った後に、通常通り紐付け処理を行ってもよい。 In rare cases, the number of microphones M may be greater than the number of speakers S. In this case, during the splitting process by the speech splitting means 5, a microphone M with low power may be identified, and it may be determined that the microphone M is not in use, and subsequent processing may be performed excluding that microphone M. Alternatively, after splitting processing for each microphone M, linking processing may be performed as usual.

表示手段７による表示は、紐付け手段６による紐付けの結果をユーザ端末１Ｕに表示させることにより行う（ＳＴＥＰ５，７）。紐付けの結果は、例えば、図３に示す構成とすることができる。 The display by the display means 7 is performed by displaying the results of the linking by the linking means 6 on the user terminal 1U (STEPs 5 and 7). The results of the linking can be, for example, as shown in FIG. 3.

マイクＭの数と発話者Ｓの数が同数であった場合（ＳＴＥＰ３でＹ）の表示は、表示手段７は、図３（Ａ）に示す結果表示画面２０を表示させる（ＳＴＥＰ５）。図３（Ａ）においては、紐付けの結果として、各マイクにおける主たる発話者Ｓの名前（名字）と、そのマイクにより得られた音声データ２における発話者Ｓが発話を行った分割発話１０が表示されている。この分割発話１０は、その音声データ２において記憶されている時刻を横軸に時系列で表示される。図３（Ａ）の例では、発話者Ｓである山田氏が最初に発話を行い、その後、鈴木氏、佐藤氏の順で発話が行われたことがわかる。 When the number of microphones M and the number of speakers S are the same (Y in STEP 3), the display means 7 displays the result display screen 20 shown in FIG. 3(A) (STEP 5). In FIG. 3(A), as a result of the linking, the name (surname) of the main speaker S at each microphone and the divided utterances 10 spoken by the speaker S in the voice data 2 obtained by that microphone are displayed. The divided utterances 10 are displayed in chronological order with the time stored in the voice data 2 on the horizontal axis. In the example of FIG. 3(A), it can be seen that Yamada, the speaker S, spoke first, followed by Suzuki and Sato.

一方で、マイクＭの数が発話者Ｓの数よりも少ないとき（ＳＴＥＰ３でＮ）の表示は、図３（Ｂ）に示す結果表示画面２０となる。図３（Ｂ）においては、紐付けの結果として、各マイクの主たる発話者Ｓが表示されている。具体的には１行目のマイクには山田氏の名前が表示され、２行目のマイクには佐藤氏の名前が表示されている。 On the other hand, when the number of microphones M is less than the number of speakers S (N in STEP 3), the display becomes the result display screen 20 shown in FIG. 3(B). In FIG. 3(B), the main speaker S of each microphone is displayed as a result of the linking. Specifically, Mr. Yamada's name is displayed on the microphone in the first row, and Mr. Sato's name is displayed on the microphone in the second row.

また、図３（Ｂ）の結果表示画面２０では、１行目のマイクの表示の左横に、そのマイクＭを使用している発話者Ｓが複数あることを示す複数表示２２が表示されている。具体的には、複数表示２２は、三角形で一つの頂点が発話者Ｓの山田氏に向けて表示された状態となっている。 In addition, on the result display screen 20 in FIG. 3(B), to the left of the microphone display in the first row, a multiple display 22 is displayed, indicating that there are multiple speakers S using that microphone M. Specifically, the multiple display 22 is a triangle with one apex pointing toward speaker S, Mr. Yamada.

ユーザは、この複数表示２２をユーザ端末１Ｕのマウスでクリック、或いは画面にタッチする等の操作を行うことにより、複数表示の開示指示を行うことができる。ユーザによる開示指示を受けたときは（ＳＴＥＰ８でＹ）、表示手段７は、マイクＭに紐付けられた複数の発話者のすべてを表示させる（ＳＴＥＰ９）。このとき、複数表示２２は、三角形の一つの頂点が下向きとなるように変更される。図３（Ｃ）においては、山田氏のマイクＭにおいて紐付けられた発話者が、山田氏以外に２名いることがわかる。 The user can issue a disclosure instruction for the multiple displays 22 by clicking the mouse of the user terminal 1U or by touching the screen. When a disclosure instruction from the user is received (Y in STEP 8), the display means 7 displays all of the multiple speakers linked to the microphone M (STEP 9). At this time, the multiple displays 22 are changed so that one apex of the triangle faces downward. In FIG. 3(C), it can be seen that there are two other speakers linked to Mr. Yamada's microphone M.

図３（Ｃ）においては、山田氏のマイクＭに紐付けられた発話者は、「山田＿１」と鈴木氏となっている。「山田＿１」の表示は、山田氏のマイクＭに紐付けられた分割発話１０であって、紐付け手段６によっては紐付けができなかった状態を示している。なお、この状態で、ユーザが再度複数表示２２をクリックすれば、表示されていた山田氏以外の分割発話１０の表示を折り畳んで、図３（Ｂ）の状態に戻るようになっている。 In FIG. 3(C), the speakers linked to Yamada's microphone M are "Yamada_1" and Suzuki. The display of "Yamada_1" shows a split utterance 10 linked to Yamada's microphone M, which could not be linked by the linking means 6. If the user clicks on the multiple display 22 again in this state, the display of the split utterances 10 other than Yamada's that were displayed will be collapsed, and the state will return to that of FIG. 3(B).

紐付けができなかった発話者Ｓについては、ユーザはその発話者の分割発話１０を指定して再生指示をすることにより、分割発話１０の内容を再生させて確認することができる。本実施形態では、このような場合に、紐付けが誤っている分割発話１０について、結果表示画面２０で修正可能としている（本発明における修正手段）。 For a speaker S who could not be linked, the user can specify the split utterance 10 of that speaker and instruct it to be played back, thereby playing back and checking the contents of the split utterance 10. In this embodiment, in such a case, the split utterance 10 that is incorrectly linked can be corrected on the result display screen 20 (correction means in the present invention).

ここで、ユーザが当該分割発話１０について、発話者Ｓを修正する場合は（ＳＴＥＰ１０でＹ）、図３（Ｄ）に示すように、分割発話１０の表示をユーザが修正した内容に変更することができる（ＳＴＥＰ１１）。ここでは、「山田＿１」を斉藤氏に修正している。 If the user wants to modify the speaker S for the segmented utterance 10 (Y in STEP 10), the display of the segmented utterance 10 can be changed to the content modified by the user (STEP 11), as shown in FIG. 3(D). Here, "Yamada_1" is modified to Mr. Saito.

このように、ユーザによって分割発話１０の紐付けの修正がなされた場合は、今回発話者の抽出を行った音声データ２が、発話者ＤＢ８において発話者の紐付けに使用されるデータとして登録される（ＳＴＥＰ１２）。これにより、次回以降の処理において当該音声データ２を発話者の紐付けに使用できるようになる。 In this way, when the linking of the split utterances 10 is modified by the user, the voice data 2 from which the speaker was extracted this time is registered in the speaker DB 8 as data to be used for linking the speaker (STEP 12). This makes it possible to use the voice data 2 for linking the speaker in the next and subsequent processes.

次に、図５を参照して、ユーザが発話者ＤＢ学習手段９を利用し、管理画面２３によってパラメータを変更し、発話者ＤＢ８の学習を行う場合について説明する。図５においては、発話者である「○○さん」のデータベースに登録されている会議が一覧で表示されている。ユーザが、この会議一覧を見て、例えば、会議２が学習を行うのにふさわしくないと判断したときは、会議２の右横の削除ボタン２４をクリックし、会議２の削除を行うことができる。 Next, referring to Figure 5, we will explain the case where a user uses the speaker DB learning means 9 to change parameters on the management screen 23 and learn the speaker DB 8. In Figure 5, a list of conferences registered in the database of the speaker "Mr. XX" is displayed. When the user looks at this list of conferences and decides that, for example, Conference 2 is not suitable for learning, he or she can click the delete button 24 to the right of Conference 2 to delete Conference 2.

また、ユーザが、別の会議を登録したいと判断したときは、追加ボタン２５をクリックする。すると、登録したい会議の指定を促す画面（図示省略）が表示されるので、ユーザは、登録が必要と考える会議の音声データ２を指定して発話者ＤＢ８に登録することができる。 Furthermore, if the user decides that he or she wishes to register another conference, he or she clicks the Add button 25. This causes a screen (not shown) to be displayed prompting the user to specify the conference that he or she wishes to register, and the user can then specify the voice data 2 of the conference that he or she believes should be registered, and register it in the speaker DB 8.

次に、紐付け処理がなされた分割発話１０について誤りがあった際の修正について、図６を参照して説明する。例えば、図６（Ａ）に示した状態で、山田氏のマイクＭにおいて録音された３個ある分割発話１０のうち、鈴木の右端にある分割発話１０をユーザが確認した際に、鈴木氏ではなく佐藤氏の発話であったことが判明した。 Next, correction of errors in the segmented utterances 10 that have been subjected to the linking process will be described with reference to FIG. 6. For example, in the state shown in FIG. 6(A), when the user checked the segmented utterance 10 on the right side of Suzuki out of the three segmented utterances 10 recorded by Yamada's microphone M, it was found to be the speech of Sato, not Suzuki.

この場合、ユーザは、修正が必要であるので（ＳＴＥＰ１０でＹ）、マウスやタッチによる操作等で右端の分割発話１０を鈴木氏から佐藤氏に移動させ、修正することができる（ＳＴＥＰ１１）。この修正を確定させると、修正内容が発話者ＤＢ８に登録される（ＳＴＥＰ１２）。 In this case, since a correction is necessary (Y in STEP 10), the user can correct the divided utterance 10 on the right side by moving it from Mr. Suzuki to Mr. Sato using a mouse or touch operation (STEP 11). When the user confirms this correction, the correction is registered in the speaker DB 8 (STEP 12).

ここで、ユーザが再度の紐付け処理を行いたい場合は、再度処理を実行させることで、紐付け手段６が、修正が行われた後の学習済の発話者ＤＢ８を参照して、再度紐付け処理を行うことができる。その結果、例えば、図６における鈴木氏の左から２個目の分割発話１０も佐藤氏の発話であることが判明したときは、表示手段７は、図６（Ｂ）に示すように、修正後の分割発話１０の表示を行う。 If the user wishes to perform the linking process again, the process can be executed again, and the linking means 6 can refer to the trained speaker DB 8 after the correction and perform the linking process again. As a result, for example, when it is determined that the second split utterance 10 from the left of Mr. Suzuki in FIG. 6 is also an utterance of Mr. Sato, the display means 7 displays the corrected split utterance 10 as shown in FIG. 6 (B).

以上の通り、本実施形態の発話情報抽出装置１によれば、マイクＭの数が発話者Ｓの数と同数であるときは、各マイクＭによって録音がなされた音声データ２は、単独の発話者による発話のみとなるため、各マイクＭからの音声データ２について、どの発話者Ｓの発話かを判定する処理を行う必要がない。よって、この場合は不要な処理を行わないため誤った処理がなされることがなく、マイクＭの数が発話者の数より少ないときのみ紐付け処理を行えばよく、処理効率が向上すると共に処理精度が向上する。 As described above, according to the speech information extraction device 1 of this embodiment, when the number of microphones M is the same as the number of speakers S, the audio data 2 recorded by each microphone M contains only the speech of a single speaker, so there is no need to perform processing to determine which speaker S is speaking for the audio data 2 from each microphone M. Therefore, in this case, unnecessary processing is not performed, so erroneous processing is not performed, and linking processing only needs to be performed when the number of microphones M is less than the number of speakers, improving processing efficiency and processing accuracy.

また、あるマイクＭによって録音された音声データ２が、複数の発話者Ｓの発話を含む場合に、結果表示画面２０において、ユーザは主たる発話者Ｓと、発話者Ｓが複数であることを一目で確認することができる。また、ユーザがそのマイクＭを使用した他の発話者Ｓが誰なのかを知りたい場合は、開示指示を行うことによりその発話者Ｓを知ることができる。 In addition, when the audio data 2 recorded by a certain microphone M includes speech from multiple speakers S, the user can see at a glance on the result display screen 20 which speaker S is the main speaker, and that there are multiple speakers S. In addition, if the user wants to know who the other speakers S are who used that microphone M, they can find out about those speakers S by issuing a disclosure instruction.

一方で、ユーザが複数表示２２によって、マイクＭを使用した他の発話者Ｓを確認した際に、実際は発話者Ｓが単独であるにも関わらず、２人以上が発話を行っていると誤認識されていることが発見される場合がある。この場合は、ユーザが再度複数表示２２をクリックすれば、表示されていた複数の分割発話１０の表示を折り畳んで、図３（Ｂ）の状態に戻すことができる。当該構成により、発話者Ｓの認識が誤っている場合であっても、誤った分割発話１０を非表示にすることができる。 On the other hand, when the user checks the other speaker S using the microphone M through the multiple display 22, he or she may discover that the speaker S has been mistakenly recognized as two or more people speaking, when in fact it is only one speaker S. In this case, the user can collapse the display of the multiple split utterances 10 that were displayed by clicking on the multiple display 22 again, returning to the state shown in FIG. 3(B). With this configuration, even if the speaker S has been recognized incorrectly, the incorrect split utterances 10 can be hidden.

また、発話者ＤＢ８に記憶されているある発話者Ｓのデータについて、パラメータとして記録された会議の記録を見て、学習データとしてふさわしくないと判断した場合に、その特定日の会議のデータを削除する等の編集をすることができる。当該編集により、発話者ＤＢ８のデータの信頼性を向上させることができる。 In addition, when data on a certain speaker S stored in speaker DB8 is judged to be unsuitable as learning data by looking at the conference records recorded as parameters, the data on the conference on that particular day can be edited, such as by deleting it. This editing can improve the reliability of the data in speaker DB8.

また、発話者ＤＢ８のパラメータを編集可能とすることにより、一人の発話者Ｓに対して複数の会議を紐付けることが可能となる。同一の発話者Ｓであっても、その声の特徴は使用しているマイクの種類や、発話者の体調、発話のバリエーション（笑い声、ささやき声等）、マイクと発話者との距離、部屋の大きさと反響に関する特性などの音響環境に影響を受ける。このため、発話者Ｓについて、特定の会議のみから特徴量を推定すると、他の音響環境における発話者Ｓの判定の精度が低くなるが、複数の音響環境の会議を紐付けることで、判定精度の向上を図ることができる。 In addition, by making the parameters of the speaker DB8 editable, it becomes possible to link multiple conferences to one speaker S. Even for the same speaker S, the voice characteristics are affected by the acoustic environment, such as the type of microphone used, the speaker's physical condition, speech variation (laughing, whispering, etc.), the distance between the microphone and the speaker, and the size of the room and characteristics related to reverberation. For this reason, if the features of speaker S are estimated only from a specific conference, the accuracy of the determination of speaker S in other acoustic environments will be low, but by linking conferences in multiple acoustic environments, the accuracy of the determination can be improved.

さらに、紐付け処理がなされた分割発話１０について誤りがあった場合であっても、ユーザが修正を行うことができ、その修正結果が発話者ＤＢ８に反映されるので、１箇所の修正を行えば、その他の同様の修正を自動で行うことができる。 Furthermore, even if there is an error in the split utterance 10 that has been linked, the user can make corrections and the correction results are reflected in the speaker DB 8, so that once one correction is made, other similar corrections can be made automatically.

なお、上記実施形態においては、結果表示画面２０において、マイクＭで録音された音声データ２が複数の発話者Ｓである場合に、主たる発話者Ｓを、マイクＭで録音された音声データ２の発話時間が最も長い発話者Ｓとしているが、これに限らず、マイクＭの所有者として登録されている発話者Ｓを主たる発話者Ｓとしてもよい。 In the above embodiment, when the voice data 2 recorded by the microphone M includes multiple speakers S, the main speaker S is the speaker S who has the longest speaking time in the voice data 2 recorded by the microphone M on the result display screen 20. However, this is not limited to this, and the speaker S registered as the owner of the microphone M may be the main speaker S.

また、上記実施形態では、音声データ入手手段３による音声データ２の入力を、登録されたオンライン会議ツールの動画データから自動的に行っているが、これに限らず、ユーザ端末１Ｕに入力画面（図示省略）を表示させて、その入力画面から音声データ２を入力するようにしてもよい。 In addition, in the above embodiment, the voice data 2 is automatically input by the voice data acquisition means 3 from the video data of the registered online conference tool, but this is not limited to the above. An input screen (not shown) may be displayed on the user terminal 1U, and the voice data 2 may be input from the input screen.

また、上記実施形態では、複数表示２２は、三角形の表示となっているが、これに限らず、複数あることが認識できれば、他の表示方法を用いてもよい。例えば、矢印等の図形や、何らかの図形を点滅表示させ、展開させた際に点滅を停止する等、任意の表示とすることができる。 In the above embodiment, the multiple displays 22 are triangular displays, but this is not limiting and other display methods may be used as long as it is possible to recognize that there are multiple displays. For example, any display may be used, such as a flashing arrow or other figure, or a flashing figure that stops flashing when expanded.

また、上記実施形態では、ＳＴＥＰ４において、音声データ２のうち、各マイクＭで録音された分割発話１０を、単一の発話者と紐付ける処理を行う際に、各マイクＭで録音された分割発話１０について、ＳＴＥＰ２で入手したマイクＭの所有者の情報から、各マイクＭと発話者Ｓの紐付けを行っているが、発話者ＤＢ８を参照して判定を行う判定ＡＩに入力し、出力として発話をしたと推定される発話者を得るようにしてもよい。 In the above embodiment, in STEP 4, when the process of linking the divided utterances 10 recorded by each microphone M in the voice data 2 to a single speaker is performed, the divided utterances 10 recorded by each microphone M are linked to each microphone M and speaker S based on the information on the owner of the microphone M obtained in STEP 2, but the information may also be input to a judgment AI that makes a judgment by referring to the speaker DB 8, and the speaker who is estimated to have made the utterance is obtained as the output.

また、上記実施形態では、発話者ＤＢ学習手段９として、図５に示す管理画面２３によってパラメータを編集可能としているが、これに限らず、自動で学習を行うようにしてもよい。例えば、ＳＴＥＰ４において、各マイクＭで録音された分割発話１０を、単一の発話者と紐付ける処理を行った際に、発話者ＤＢ８において追加学習を行ってデータベースを更新するようにしてもよい。また、別個に学習管理画面（図示省略）を設けて、ＳＴＥＰ４の紐付け処理の結果を学習させるか否かを設定できるようにしてもよい。 In the above embodiment, the speaker DB learning means 9 is configured to allow parameters to be edited using the management screen 23 shown in FIG. 5, but this is not limiting and learning may be performed automatically. For example, when the split utterances 10 recorded by each microphone M are linked to a single speaker in STEP 4, additional learning may be performed in the speaker DB 8 to update the database. A separate learning management screen (not shown) may also be provided to allow the user to set whether or not to learn the results of the linking process in STEP 4.

また、上記実施形態では、複数表示２２は、ユーザがタッチ操作等を行って開示指示を行うことにより、そのマイクに紐付けられた複数の発話者のすべてを表示させるようにしているが、これに限らず、複数表示を行うか否かの設定画面（図示省略）を設けておき、その設定に従って複数のすべての表示を行うか否かを設定できるようにしてもよい。 In addition, in the above embodiment, the multiple display 22 displays all of the multiple speakers linked to that microphone when the user issues a disclosure instruction by performing a touch operation or the like, but this is not limited to this. A setting screen (not shown) for determining whether or not to display multiple displays may be provided, and it may be possible to set whether or not to display all of the multiple displays according to that setting.

１…発話情報抽出装置
１Ｐ…発話情報抽出プログラム
２…音声データ
３…音声データ入手手段
４…情報入手手段
５…発話分割手段
６…紐付け手段
７…表示手段
８…発話者ＤＢ
９…発話者ＤＢ学習手段
１０…分割発話
１１…紐付け結果
２０…結果表示画面
２２…複数表示
２３…管理画面
Ｍ…マイク
Ｓ…発話者

Reference Signs List 1: Utterance information extraction device 1P: Utterance information extraction program 2: Voice data 3: Voice data acquisition means 4: Information acquisition means 5: Utterance division means 6: Linking means 7: Display means 8: Speaker DB
9... Speaker DB learning means 10... Divided utterance 11... Linking result 20... Result display screen 22... Multiple display 23... Management screen M... Microphone S... Speaker

Claims

An utterance information extraction device that extracts a speaker's utterance from audio data, comprising:
a voice data acquisition means for acquiring the voice data;
an information acquiring means for acquiring the number of microphones used in the voice data and the number of speakers;
an utterance division means for dividing the voice data into divided utterances for each utterance;
Linking means for linking the divided utterances with the speakers;
A display means for displaying a result of the association associated with the speaker,
When the number of the microphones is the same as the number of the speakers, the linking means links the divided utterances for each of the microphones to a single speaker;
When the number of microphones is smaller than the number of speakers, the speech information extraction device links each divided utterance with the speaker.

The speech information extraction device according to claim 1,
The display means includes:
When the number of the microphones is equal to the number of the speakers, a speaker associated with each of the microphones is displayed;
This speech information extraction device is characterized in that, when there are multiple speakers linked to the microphone, it displays the main speaker and also displays multiple speakers to indicate that there are multiple speakers, and when an instruction to disclose the multiple displays is given, it displays all of the multiple speakers linked to the microphone.

The speech information extraction device according to claim 1,
The linking means refers to a speaker DB trained using reference voices of a plurality of speakers,
The speaker DB learning means further includes a speaker DB learning means for learning the speaker DB,
The speaker DB learning means is characterized in that information indicating which voice data is to be used for learning the speaker DB as the reference voice linked to the speaker is editable as a parameter via a management screen.

The speech information extraction device according to claim 3,
The speech information extraction device, wherein the parameter is an event at which the reference speech was recorded.

The speech information extraction device according to claim 3,
The speech information extraction device, wherein the speaker DB includes speech features resulting from analysis of a reference voice of the speaker.

The speech information extraction device according to claim 3 ,
a correction means for allowing a user to correct the result of the linking means,
The speaker DB learning means learns the result of the correction made by the user in the speaker DB,
The utterance information extraction device is characterized in that the linking means performs the linking by referring to the learned speaker DB.

An utterance information extraction program for causing a computer to operate as the utterance information extraction device according to any one of claims 1 to 6.