JP2020190669A

JP2020190669A - Speaker identification device, speaker identification method, and speaker identification program

Info

Publication number: JP2020190669A
Application number: JP2019096691A
Authority: JP
Inventors: フックタットダトウェン; Hook Tat Dat Wen; 信介菅谷; Shinsuke Sugaya; ▲栄▼傑陳; rong jie Chen
Original assignee: BizReach Inc
Current assignee: BizReach Inc
Priority date: 2019-05-23
Filing date: 2019-05-23
Publication date: 2020-11-26

Abstract

To identify a speaker more easily without need for prior work.SOLUTION: A speaker identification device 50 includes a speakers identification device 50 which identifies a speaker by using a plurality of voice data acquired by a plurality of microphones provided corresponding to a plurality of target persons. The speakers identification device 50 includes: a representative value acquisition unit 52 which acquires a representative value indicating a characteristic of loudness for each voice data for each predetermined first period; a comparison unit 54 which compares two representative values acquired from two different voice data for each first period and obtains a comparison result for each first period; and a specification unit 55 which identifies the speakers in a second period, using a plurality of comparison results acquired in the second period longer than the first period.SELECTED DRAWING: Figure 4

Description

本発明は、話者特定装置、話者特定方法、及び話者特定プログラムに関するものである。 The present invention relates to a speaker identification device, a speaker identification method, and a speaker identification program.

近年、音声認識技術の発展に伴い、面接や会議等における話者の音声を自動的に文字データ化し、議事録を自動作成するシステムが開発されている。このような文字認識を行う際、マイクロフォンによって取得された音声データが誰によって発話されたのかを特定する必要がある。 In recent years, with the development of voice recognition technology, a system has been developed that automatically converts the voice of a speaker in an interview or a conference into character data and automatically creates minutes. When performing such character recognition, it is necessary to identify who spoke the voice data acquired by the microphone.

話者を特定する方法として、例えば、特許文献１に開示される方法が知られている。特許文献１には、話者が発した音声の特徴量を予め記憶しておき、マイクロフォンで取得された音声データの特徴量と予め記憶しておいた話者の特徴量とを比較することにより、話者を特定する方法が開示されている。 As a method for identifying a speaker, for example, a method disclosed in Patent Document 1 is known. In Patent Document 1, the feature amount of the voice emitted by the speaker is stored in advance, and the feature amount of the voice data acquired by the microphone is compared with the feature amount of the speaker stored in advance. , A method of identifying the speaker is disclosed.

特開２００４−１４５１６１号公報Japanese Unexamined Patent Publication No. 2004-145161

しかしながら、特許文献１に開示された発明では、予め話者の特徴量を記憶する必要があり、利便性が低い。また、特徴量が記憶されていない話者については特定することができないため、汎用性に乏しい。
また、話者の声が似ている場合、特徴量の比較では区別しにくく、精度が低い。 However, in the invention disclosed in Patent Document 1, it is necessary to store the feature amount of the speaker in advance, which is not convenient. In addition, since it is not possible to identify a speaker whose feature amount is not stored, it is not versatile.
In addition, when the voices of the speakers are similar, it is difficult to distinguish them by comparing the features, and the accuracy is low.

本発明は、このような事情に鑑みてなされたものであって、事前の作業を不要とし、より簡便に話者を特定することのできる話者特定装置、話者特定方法、及び話者特定プログラムを提供することを目的とする。 The present invention has been made in view of such circumstances, and is a speaker identification device, a speaker identification method, and a speaker identification that do not require prior work and can more easily identify a speaker. The purpose is to provide a program.

本発明の第一態様は、複数の対象者にそれぞれ対応して設けられた複数のマイクロフォンによって取得された複数の音声データを用いて、話者を特定する話者特定装置であって、前記音声データ毎に、音の大きさに関する特徴を示す代表値を所定の第１期間毎に取得する代表値取得部と、異なる２つの前記音声データから取得された２つの前記代表値同士を前記第１期間毎に比較し、比較結果を前記第１期間毎に得る比較部と、前記第１期間よりも長い第２期間に得られた複数の前記比較結果を用いて、前記第２期間における話者を特定する特定部とを具備する話者特定装置である。 A first aspect of the present invention is a speaker identification device that identifies a speaker by using a plurality of voice data acquired by a plurality of microphones provided corresponding to a plurality of subjects, and the voice is described above. For each data, a representative value acquisition unit that acquires a representative value indicating a characteristic regarding loudness for each predetermined first period, and two representative values acquired from two different voice data are said to be the first. A speaker in the second period using a comparison unit that compares each period and obtains a comparison result for each first period and a plurality of the comparison results obtained in a second period longer than the first period. It is a speaker identification device including a specific unit for specifying the above.

上記話者特定装置によれば、代表値取得部によって各音声データにおける代表値が第１期間毎に取得され、比較部によって、第１期間毎に、異なる２つの音声データから取得された２つの代表値が比較されて、比較結果が得られる。そして、特定部によって、第１期間よりも長い第２期間に得られた複数の比較結果を用いて、第２期間における話者が特定される。このように、サンプリングされた全ての音声データを用いるのではなく、各第１期間における代表値を取得することとしたので、音声データに含まれるノイズを低減することができるとともに、処理するデータ量を低減することができる。
また、本態様によれば、第１期間毎に異なる音声データから取得された代表値同士を比較し、第１期間よりも長い第２期間内における複数の比較結果を用いて話者を特定する。音声データにはノイズが含まれているため、例えば、第１期間毎に話者を特定してしまうと、ノイズの影響によって話者が頻繁に切り替わってしまうおそれがある。これに対し、本態様によれば、第１期間よりも長い第２期間内に得られた複数の比較結果を用いて第２期間における話者を特定するので、話者が頻繁に切り替えられることを抑制することができ、話者の特定精度を向上させることが可能となる。 According to the speaker identification device, the representative value acquisition unit acquires the representative value of each voice data in each first period, and the comparison unit acquires two different voice data in each first period. The representative values are compared and the comparison result is obtained. Then, the specific unit identifies the speaker in the second period by using a plurality of comparison results obtained in the second period, which is longer than the first period. In this way, instead of using all the sampled voice data, it is decided to acquire the representative value in each first period, so that the noise contained in the voice data can be reduced and the amount of data to be processed. Can be reduced.
Further, according to this aspect, the representative values acquired from different voice data are compared for each first period, and the speaker is specified by using a plurality of comparison results in the second period longer than the first period. .. Since the voice data contains noise, for example, if the speaker is specified for each first period, the speaker may be frequently switched due to the influence of the noise. On the other hand, according to this aspect, since the speaker in the second period is identified by using a plurality of comparison results obtained in the second period longer than the first period, the speakers are frequently switched. Can be suppressed, and the accuracy of identifying the speaker can be improved.

本発明の第二態様は、複数の対象者にそれぞれ対応して設けられた複数のマイクロフォンによって取得された複数の音声データを用いて、話者を特定する話者特定方法であって、前記音声データ毎に、音の大きさに関する特徴を示す代表値を所定の第１期間毎に取得する代表値取得工程と、異なる２つの前記音声データから取得された２つの前記代表値同士を前記第１期間毎に比較し、比較結果を前記第１期間毎に得る比較工程と、前記第１期間よりも長い第２期間に得られた複数の前記比較結果を用いて、前記第２期間における話者を特定する特定工程とをコンピュータが実行する話者特定方法である。 A second aspect of the present invention is a speaker identification method for identifying a speaker by using a plurality of voice data acquired by a plurality of microphones provided corresponding to a plurality of subjects, and the voice is described above. For each data, a representative value acquisition step of acquiring a representative value indicating a characteristic regarding loudness for each predetermined first period, and the first of the two representative values acquired from two different voice data. A speaker in the second period using a comparison step of comparing each period and obtaining a comparison result for each first period and a plurality of the comparison results obtained in a second period longer than the first period. It is a speaker identification method in which a computer executes a specific process for specifying.

本発明の第三態様は、複数の対象者にそれぞれ対応して設けられた複数のマイクロフォンによって取得された複数の音声データを用いて、話者を特定するための話者特定プログラムであって、前記音声データ毎に、音の大きさに関する特徴を示す代表値を所定の第１期間毎に取得する代表値取得処理と、異なる２つの前記音声データから取得された２つの前記代表値同士を前記第１期間毎に比較し、比較結果を前記第１期間毎に得る比較処理と、前記第１期間よりも長い第２期間に得られた複数の前記比較結果を用いて、前記第２期間における話者を特定する特定処理とをコンピュータに実行させるための話者特定プログラムである。 A third aspect of the present invention is a speaker identification program for identifying a speaker by using a plurality of voice data acquired by a plurality of microphones provided corresponding to a plurality of subjects. For each of the voice data, a representative value acquisition process of acquiring a representative value indicating a characteristic regarding the loudness of the sound for each predetermined first period, and two representative values acquired from the two different voice data are said to be the same. In the second period, the comparison process of comparing each first period and obtaining the comparison result for each first period and the plurality of the comparison results obtained in the second period longer than the first period are used. It is a speaker identification program for causing a computer to execute a specific process for identifying a speaker.

本発明によれば、事前の作業を不要とし、より簡便に話者を特定することができるという効果を奏する。 According to the present invention, there is an effect that the speaker can be identified more easily without the need for prior work.

本発明の一実施形態に係る話者特定システムにおいて、対象者が２名の場合を想定した構成を概略的に示した図である。It is a figure which showed the structure which assumed the case of two subjects in the speaker identification system which concerns on one Embodiment of this invention. 本発明の一実施形態に係る話者特定システムにおいて、対象者が２名の場合を想定した場合のマイクと対象者との対応関係について説明するための図である。It is a figure for demonstrating the correspondence relationship between a microphone and a subject person in the case of assuming the case of two subjects in the speaker identification system which concerns on one Embodiment of this invention. 本発明の一実施形態に係る話者特定装置のハードウェア構成の一例を示した概略構成図である。It is a schematic block diagram which showed an example of the hardware configuration of the speaker identification apparatus which concerns on one Embodiment of this invention. 本発明の一実施形態に係る話者特定装置が有する機能を模式的に示した機能ブロック図である。It is a functional block diagram which shows typically the function which the speaker identification apparatus which concerns on one Embodiment of this invention has. 本発明の一実施形態に係るマイクによって取得された音声データの波形の一例を模式的に示した図である。It is a figure which showed typically an example of the waveform of the voice data acquired by the microphone which concerns on one Embodiment of this invention. 本発明の一実施形態に係る代表値取得部によって実現される処理について説明するための図である。It is a figure for demonstrating the process realized by the representative value acquisition part which concerns on one Embodiment of this invention. 本発明の一実施形態に係る特定部によって実現される処理について説明するための図である。It is a figure for demonstrating the process realized by the specific part which concerns on one Embodiment of this invention. 本発明の一実施形態に係る話者特定方法の手順の一例を示したフローチャートである。It is a flowchart which showed an example of the procedure of the speaker identification method which concerns on one Embodiment of this invention. 本発明の一実施形態に係る話者特定システムにおいて、対象者が３名の場合を想定した構成を概略的に示した図である。It is a figure which showed the structure which assumed the case of three subjects in the speaker identification system which concerns on one Embodiment of this invention. 本発明の一実施形態に係る話者特定システムにおいて、対象者が３名の場合を想定した場合のマイクと対象者との対応関係について説明するための図である。It is a figure for demonstrating the correspondence relationship between a microphone and a subject person in the case of assuming the case of three subjects in the speaker identification system which concerns on one Embodiment of this invention. 本発明の一実施形態に係る特定部によって実現される処理について説明するための図である。It is a figure for demonstrating the process realized by the specific part which concerns on one Embodiment of this invention.

以下に、本発明に係る話者特定装置、話者特定方法、及び話者特定プログラムの一実施形態について、図面を参照して説明する。以下、説明の便宜上、まず最初に、人物が２名の場合を例示して説明し、次に、人物が３名の場合を例示して説明する。 Hereinafter, an embodiment of a speaker identification device, a speaker identification method, and a speaker identification program according to the present invention will be described with reference to the drawings. Hereinafter, for convenience of explanation, first, a case where there are two persons will be described as an example, and then, a case where there are three persons will be illustrated and described.

図１は、本発明の一実施形態に係る話者特定システム１の構成を概略的に示した図である。図１に示すように、本実施形態に係る話者特定システム１は、複数のマイクロフォン（以下、単に「マイク」という。）１０、２０と、マイク１０、２０によって取得された音声データに基づいて話者を特定する話者特定装置５０とを備えている。 FIG. 1 is a diagram schematically showing a configuration of a speaker identification system 1 according to an embodiment of the present invention. As shown in FIG. 1, the speaker identification system 1 according to the present embodiment is based on a plurality of microphones (hereinafter, simply referred to as “microphones”) 10 and 20 and voice data acquired by the microphones 10 and 20. It is provided with a speaker identification device 50 for identifying a speaker.

マイク１０、２０は、例えば、図２に示すように、人物Ａ、Ｂに対応して設けられる。図２において、人物Ａに対応してマイク１０が、人物Ｂに対応してマイク２０が設けられている。マイク１０は、人物Ａからの距離Ｌａ１０と人物Ｂからの距離Ｌｂ１０とが異なる位置に配置されている。同様に、マイク２０は、人物Ａからの距離Ｌａ２０と人物Ｂからの距離Ｌｂ２０とが異なる位置に配置されている。なお、各マイク１０、２０は、各人物Ａ、Ｂに装着されていてもよい。 The microphones 10 and 20 are provided corresponding to the persons A and B, for example, as shown in FIG. In FIG. 2, a microphone 10 is provided corresponding to the person A, and a microphone 20 is provided corresponding to the person B. The microphone 10 is arranged at a position where the distance La10 from the person A and the distance Lb10 from the person B are different. Similarly, the microphone 20 is arranged at different positions from the distance La20 from the person A and the distance Lb20 from the person B. The microphones 10 and 20 may be attached to the persons A and B.

図２において、マイク１０は、人物Ｂよりも人物Ａに近い位置に配置され、マイク２０は、人物Ａよりも人物Ｂに近い位置に配置されている。すなわち、各人物Ａ、Ｂからマイク１０、２０までの距離には、以下の（１）、（２）式に示す関係が成立している。 In FIG. 2, the microphone 10 is arranged at a position closer to the person A than the person B, and the microphone 20 is arranged at a position closer to the person B than the person A. That is, the relationships shown in the following equations (1) and (2) are established for the distances from the persons A and B to the microphones 10 and 20.

Ｌａ１０＜Ｌｂ１０（１）
Ｌｂ２０＜Ｌａ２０（２） La10 <Lb10 (1)
Lb20 <La20 (2)

図３は、話者特定装置５０のハードウェア構成の一例を示した概略構成図である。図３に示すように、話者特定装置５０は、例えば、ＣＰＵ１１、ＣＰＵ１１が実行するプログラム及びこのプログラムにより参照されるデータ等を記憶するための補助記憶装置１２、各プログラム実行時のワーク領域として機能する主記憶装置１３、外部機器（例えば、マイク１０、２０等）やネットワークに接続するための少なくとも一つの通信インターフェース１４等を備えている。また、話者特定装置５０は、例えば、キーボード、マウス、ポインティングデバイス等のユーザインタフェースとして機能する入力部１５、液晶ディスプレイ等の表示部１６等を備えていてもよい。
これら各部は、例えば、バス１８を介して接続されている。補助記憶装置１２は、例えば、ＨＤＤ（Hard Disk Drive）などの磁気ディスク、光磁気ディスク、ＳＳＤ（Solid State Drive）等の半導体メモリ等が一例として挙げられる。 FIG. 3 is a schematic configuration diagram showing an example of the hardware configuration of the speaker identification device 50. As shown in FIG. 3, the speaker identification device 50 is, for example, a CPU 11, an auxiliary storage device 12 for storing a program executed by the CPU 11, a data referenced by the program, and the like, and as a work area at the time of executing each program. It includes a functioning main storage device 13, an external device (for example, microphones 10, 20, etc.), at least one communication interface 14 for connecting to a network, and the like. Further, the speaker identification device 50 may include, for example, an input unit 15 that functions as a user interface for a keyboard, a mouse, a pointing device, and the like, a display unit 16 for a liquid crystal display, and the like.
Each of these parts is connected via, for example, a bus 18. Examples of the auxiliary storage device 12 include magnetic disks such as HDDs (Hard Disk Drives), magneto-optical disks, and semiconductor memories such as SSDs (Solid State Drives).

後述する各種機能を実現するための一連の処理は、一例として、プログラム（例えば、話者特定プログラム）の形式で補助記憶装置１２に記憶されており、このプログラムをＣＰＵ１１が主記憶装置１３に読み出して、情報の加工・演算処理を実行することにより、各種機能が実現される。なお、プログラムは、補助記憶装置１２に予めインストールされている形態や、他のコンピュータ読み取り可能な記憶媒体に記憶された状態で提供される形態、有線又は無線による通信手段を介して配信される形態等が適用されてもよい。コンピュータ読み取り可能な記憶媒体とは、磁気ディスク、光磁気ディスク、ＣＤ−ＲＯＭ、ＤＶＤ−ＲＯＭ、半導体メモリ等である。 As an example, a series of processes for realizing various functions described later are stored in the auxiliary storage device 12 in the form of a program (for example, a speaker identification program), and the CPU 11 reads this program into the main storage device 13. Therefore, various functions are realized by executing information processing and arithmetic processing. The program is pre-installed in the auxiliary storage device 12, provided in a state of being stored in another computer-readable storage medium, or distributed via a wired or wireless communication means. Etc. may be applied. Computer-readable storage media include magnetic disks, magneto-optical disks, CD-ROMs, DVD-ROMs, semiconductor memories, and the like.

図４は、本実施形態に係る話者特定装置５０が有する機能を模式的に示した機能ブロック図である。図４に示すように、話者特定装置５０は、例えば、音声データ記憶部５１、代表値取得部５２、標準化処理部５３、比較部５４、及び特定部５５を備えている。 FIG. 4 is a functional block diagram schematically showing the functions of the speaker identification device 50 according to the present embodiment. As shown in FIG. 4, the speaker identification device 50 includes, for example, a voice data storage unit 51, a representative value acquisition unit 52, a standardization processing unit 53, a comparison unit 54, and a identification unit 55.

音声データ記憶部５１には、マイク１０、２０によって取得された音声データＤ１、Ｄ２が格納される。
例えば、マイク１０、２０と話者特定装置５０とが有線回線または無線回線を介して接続されている場合、話者特定装置５０は、マイク１０、２０から出力される音声データを有線回線又は無線回線を介して取得し、音声データ記憶部５１に格納する。また、話者特定装置５０とマイク１０、２０とは他の端末等を介して間接的に接続されていてもよい。 The voice data storage unit 51 stores the voice data D1 and D2 acquired by the microphones 10 and 20.
For example, when the microphones 10 and 20 and the speaker identification device 50 are connected via a wired line or a wireless line, the speaker identification device 50 transmits the voice data output from the microphones 10 and 20 to the wired line or wireless. It is acquired via a line and stored in the voice data storage unit 51. Further, the speaker identification device 50 and the microphones 10 and 20 may be indirectly connected via another terminal or the like.

例えば、マイク１０、２０によって取得された音声データをマイク１０、２０に接続される記憶媒体に一旦記憶しておき、その後、所定のタイミングで、記憶媒体に格納されている音声データを通信媒体を介して話者特定装置５０に転送することとしてもよい。この場合、マイク１０、２０と記憶媒体とは個別に設けられていてもよいし、ボイスレコーダのように、音声取得機能と記憶媒体とが一体化された装置とされていてもよい。また、話者特定装置５０は、クラウド上に存在していてもよい。 For example, the voice data acquired by the microphones 10 and 20 is temporarily stored in the storage medium connected to the microphones 10 and 20, and then the voice data stored in the storage medium is stored in the communication medium at a predetermined timing. It may be transferred to the speaker identification device 50 via the speaker identification device 50. In this case, the microphones 10 and 20 and the storage medium may be provided separately, or may be a device in which the voice acquisition function and the storage medium are integrated, such as a voice recorder. Further, the speaker identification device 50 may exist on the cloud.

また、上記例は一例であり、マイク１０、２０によって取得された音声データを話者特定装置５０が取得する手法については、多種多様な公知の手法を採用することが可能である。
上記の通り、音声データ記憶部５１には、マイク１０によって取得された音声データＤ１、マイク２０によって取得された音声データＤ２が格納される。
図５は、マイク１０によって取得される音声データＤ１及びマイク２０によって取得される音声データＤ２の波形の一例を模式的に示した図である。 Further, the above example is an example, and a wide variety of known methods can be adopted as a method for the speaker identification device 50 to acquire the voice data acquired by the microphones 10 and 20.
As described above, the voice data storage unit 51 stores the voice data D1 acquired by the microphone 10 and the voice data D2 acquired by the microphone 20.
FIG. 5 is a diagram schematically showing an example of waveforms of the voice data D1 acquired by the microphone 10 and the voice data D2 acquired by the microphone 20.

代表値取得部５２は、音声データ記憶部５１に格納されている音声データＤ１、Ｄ２毎に音の大きさに関する特徴を示す代表値を所定の第１期間毎に取得する。図６は、図５に示した音声データＤ１の時間軸を拡大して示した図である。図６に示すように、音声データＤ１は、少なくとも音声が取得された時刻に関連付けられた音声強度を含むデータである。 The representative value acquisition unit 52 acquires a representative value indicating a feature regarding the loudness of each of the voice data D1 and D2 stored in the voice data storage unit 51 for each predetermined first period. FIG. 6 is an enlarged view of the time axis of the voice data D1 shown in FIG. As shown in FIG. 6, the voice data D1 is data including at least the voice intensity associated with the time when the voice was acquired.

代表値取得部５２は、例えば、音声データＤ１から所定の第１期間ＴＷｉ（ｉ＝１〜ｎ）毎に代表値Ｕｉをそれぞれ取得する。第１期間ＴＷｉは、例えば、音声データに基づく音波波形の一周期よりも長い期間に設定されている。一例として、音声データのサンプリング周波数が４４．１ｋＨｚに設定されている場合、第１期間ＴＷｉは、１ｍｓ以上３０ｍｓ以下の範囲内、好ましくは、５ｍｓ以上２０ｍｓ以下の範囲内に設定されている。 The representative value acquisition unit 52 acquires the representative value Ui from the voice data D1 for each predetermined first period TWi (i = 1 to n), for example. The first period TWi is set to, for example, a period longer than one cycle of the sound wave waveform based on the voice data. As an example, when the sampling frequency of voice data is set to 44.1 kHz, the first period TWi is set within the range of 1 ms or more and 30 ms or less, preferably 5 ms or more and 20 ms or less.

このような範囲に第１期間ＴＷｉを設定することにより、適切な代表値を取得することが可能となる。なお、第１期間ＴＷｉに含まれるサンプリングデータの数は、話者の声の高さに応じて変化する。声が高ければ第１期間ＴＷｉに含まれるサンプリングデータ数は多くなる。したがって、例えば、事後的に話者の特定処理を行う場合には、取得した音声データＤ１、Ｄ２の周波数に応じて、第１期間ＴＷｉに含まれるデータ数がほぼ均等になるように、第１期間ＴＷｉを動的に変化させることとしてもよい。 By setting the first period TWi in such a range, it is possible to acquire an appropriate representative value. The number of sampling data included in the first period TWi changes according to the pitch of the speaker's voice. If the voice is high, the number of sampling data included in the first period TWi increases. Therefore, for example, when the speaker is identified after the fact, the first is such that the number of data included in the first period TWi is substantially equal according to the frequencies of the acquired voice data D1 and D2. The period TWi may be changed dynamically.

代表値は、例えば、各第１期間ＴＷｉにおける最大値である。また、代表値は、最大振幅値であってもよいし、最小強度値と最大強度値との差分であってもよい。また、例えば、各音声強度の絶対値のうち、所定の閾値以上の値を抽出し、抽出した値の平均値を代表値としてもよい。このように、代表値は、各第１期間ＴＷｉにおいて音の大きさの特徴を示す特徴量であればよく、具体的な決定手法については適宜設定することが可能である。なお、以下においては、説明の便宜上、最大値を代表値として採用する場合を例示して説明する。 The representative value is, for example, the maximum value in each first period TWi. Further, the representative value may be the maximum amplitude value or the difference between the minimum intensity value and the maximum intensity value. Further, for example, among the absolute values of each voice intensity, a value equal to or higher than a predetermined threshold value may be extracted, and the average value of the extracted values may be used as a representative value. As described above, the representative value may be a feature amount indicating the characteristic of loudness in each first period TWi, and a specific determination method can be appropriately set. In the following, for convenience of explanation, a case where the maximum value is adopted as a representative value will be described as an example.

上記のように、代表値取得部５２は、音声データＤ１において、各第１期間ＴＷｉにおける最大値Ｕｉを代表値として取得する。また、代表値取得部５２は、音声データＤ２においても同様に、各第１期間ＴＷｉ（ｉ＝１〜ｎ）における最大値Ｖｉを代表値として取得する。 As described above, the representative value acquisition unit 52 acquires the maximum value Ui in each first period TWi as the representative value in the voice data D1. Further, the representative value acquisition unit 52 similarly acquires the maximum value Vi in each first period TWi (i = 1 to n) as the representative value in the voice data D2.

標準化処理部５３は、代表値取得部５２によって取得された代表値Ｕｉ、Ｖｉを標準化する。例えば、話者の声の大きさが異なる場合、各マイク１０、２０によって取得される音波の強度に不均衡が生じることとなる。標準化処理部５３は、話者の声の大きさによる音声強度のアンバランスを解消するため、標準化処理を行う。 The standardization processing unit 53 standardizes the representative values Ui and Vi acquired by the representative value acquisition unit 52. For example, if the loudness of the speaker's voice is different, the intensity of the sound waves acquired by the microphones 10 and 20 will be imbalanced. The standardization processing unit 53 performs standardization processing in order to eliminate the imbalance of voice intensity due to the loudness of the speaker's voice.

例えば、標準化処理は以下のように行われる。
まず、標準化処理部５３は、音声データＤ１の代表値Ｕｉ（ｉ＝１〜ｎ）の平均値Ｘ、音声データＤ２の代表値Ｖｉ（ｉ＝１〜ｎ）の平均値Ｙを算出する。 For example, the standardization process is performed as follows.
First, the standardization processing unit 53 calculates the average value X of the representative values Ui (i = 1 to n) of the audio data D1 and the average value Y of the representative values Vi (i = 1 to n) of the audio data D2.

（３）、（４）式において、ｎは音声データＤ１、Ｄ２における代表値の数、換言すると、第１期間ＴＷｉの数である。 In the equations (3) and (4), n is the number of representative values in the voice data D1 and D2, in other words, the number of TWi in the first period.

続いて、標準化処理部５３は、上記代表値の平均値Ｘ、Ｙを用いて、各代表値Ｕｉ（ｉ＝１〜ｎ）、Ｖｉ（ｉ＝１〜ｎ）を標準化した標準化代表値Ｕｉ＿ｓｔ（ｉ＝１〜ｎ）、Ｖｉ＿ｓｔ（ｉ＝１〜ｎ）を算出する。標準化処理の演算式は以下の通りである。 Subsequently, the standardization processing unit 53 uses the average values X and Y of the representative values to standardize the representative values Ui (i = 1 to n) and Vi (i = 1 to n) as standardized representative values Ui_st ( i = 1 to n) and Vi_st (i = 1 to n) are calculated. The calculation formula for standardization processing is as follows.

比較部５４は、異なる各音声データＤ１、Ｄ２から取得された２つの標準化代表値同士を第１期間ＴＷｉ毎に比較し、第１期間ＴＷｉ毎に比較結果を得る。
具体的には、比較部５４は、同じ第１期間ＴＷｉ（ｉ＝１〜ｎ）に得られた標準化代表値Ｕｉ＿ｓｔ、Ｖｉ＿ｓｔの差分ｄｉ（ｉ＝１〜ｎ）を演算する。演算式は以下の（７）式で表される。 The comparison unit 54 compares the two standardized representative values acquired from the different voice data D1 and D2 for each first period TWi, and obtains a comparison result for each first period TWi.
Specifically, the comparison unit 54 calculates the difference di (i = 1 to n) between the standardized representative values Ui_st and Vi_st obtained in the same first period TWi (i = 1 to n). The calculation formula is represented by the following formula (7).

ｄｉ＝Ｕｉ＿ｓｔ−Ｖｉ＿ｓｔ（７） di = Ui_st-Vi_st (7)

特定部５５は、第１期間ＴＷｉよりも長い第２期間ＴＹｊ（ｊ＝１〜ｍ）毎に、第２期間ＴＹｊ内に得られた複数の比較結果ｄｉを用いて話者を特定する。
例えば、特定部５５は、第２期間ＴＹｊ毎に差分ｄｉの平均値である差分平均値ｄｊ＿ａｖｅを算出する。 The identification unit 55 identifies the speaker by using a plurality of comparison result dis obtained in the second period TYj for each second period TYj (j = 1 to m) longer than the first period TWi.
For example, the specific unit 55 calculates the difference average value dj_ave, which is the average value of the difference di, for each second period TYj.

例えば、第２期間ＴＹｊは、第１期間ＴＷｉの５倍以上の期間に設定される。本実施形態では、一例として、第２期間ＴＹｊは、２５０ｍｓに設定されている。例えば、第１期間ＴＷｉが１０ｍｓに設定されている場合、各第２期間ＴＹｊには２５個の差分ｄｉが存在することとなるので、それら２５個の差分ｄｉの平均値ｄｊ＿ａｖｅを算出する。
ここで、上記２５０ｍｓは第２期間ＴＹｊの一例であり、例えば、要求精度を満足する精度で話者を特定できるような範囲に設定されていればよい。具体的には、第２期間ＴＹｊは、第１期間の５倍以上の期間に設定される。 For example, the second period TYj is set to a period five times or more that of the first period TWi. In the present embodiment, as an example, the second period TYj is set to 250 ms. For example, when the first period TWi is set to 10 ms, there are 25 difference dis in each second period TYj, so the average value dj_ave of the 25 difference dis is calculated.
Here, the above 250 ms is an example of the second period TYj, and may be set in a range in which the speaker can be specified with an accuracy that satisfies the required accuracy, for example. Specifically, the second period TYj is set to a period five times or more the first period.

続いて、特定部５５は、差分平均値ｄｊ＿ａｖｅが正の値（すなわち、ｄｊ＿ａｖｅ＞０）か負の値（ｄｊ＿ａｖｅ＜０）かを判別する。なお、差分平均値の絶対値が所定の閾値以下の場合には、ゼロと判定する。すなわち、この場合には、誰も発話していないか、もしくはノイズであると判定する。そして、図７に示すように、同じ正負の符号またはゼロが所定回数（図７では、６回）または所定期間以上連続した場合に、その符号に応じた人物が話者であると特定する。 Subsequently, the specific unit 55 determines whether the difference average value dj_ave is a positive value (that is, dj_ave> 0) or a negative value (dj_ave <0). If the absolute value of the difference average value is equal to or less than a predetermined threshold value, it is determined to be zero. That is, in this case, it is determined that no one is speaking or that it is noise. Then, as shown in FIG. 7, when the same positive / negative sign or zero continues for a predetermined number of times (6 times in FIG. 7) or for a predetermined period or more, the person corresponding to the sign is identified as the speaker.

例えば、図７に示すように、第２期間ＴＹｊにおける正負の符号が、正が６回、負が２回、０（ゼロ）が２回、負が１回、正が７回、負が１回、正が２回、負が８回、正が１回、０（ゼロ）が６回のように現れた場合を想定する。このような場合、正が６回現れた後に、負またはゼロが６回連続して現れるまでは、話者が人物Ａと特定される。そして、例えば、負の符号が６回続いた場合には、話者が人物Ａから人物Ｂに切り替わったと判定し、その負の符号が最初に現れた時点から話者が人物Ｂであると特定される。そして、次に、正の符号またはゼロが６回以上続かない限り、話者は人物Ｂであると特定される。また、例えば、ゼロが６回続いた場合には、いずれの話者も発話していないと判定される。
この結果、例えば、図７に例示するように話者Ａ、話者Ｂ、発話なしが特定されることとなる。
このように、正、負、またはゼロの符号が所定回数または所定期間以上連続した場合に、話者が切り替わった、または、いずれの話者も発話していないと判定することにより、ノイズの影響による話者の誤った判別を抑制することができる。 For example, as shown in FIG. 7, the positive and negative signs in the second period TYj are positive 6 times, negative 2 times, 0 (zero) 2 times, negative 1 time, positive 7 times, negative 1 It is assumed that the number of times, the positive number is 2 times, the negative number is 8 times, the positive number is 1 time, and 0 (zero) is 6 times. In such a case, the speaker is identified as person A until a positive appears six times and then a negative or zero appears six times in a row. Then, for example, when the negative sign continues 6 times, it is determined that the speaker has switched from the person A to the person B, and the speaker is identified as the person B from the time when the negative sign first appears. Will be done. Then, the speaker is identified as person B unless a positive sign or zero continues six or more times. Further, for example, when zero continues 6 times, it is determined that no speaker has spoken.
As a result, for example, speaker A, speaker B, and no utterance are specified as illustrated in FIG. 7.
In this way, when the positive, negative, or zero signs are continuous for a predetermined number of times or for a predetermined period or more, it is determined that the speaker has been switched or none of the speakers is speaking, and thus the influence of noise It is possible to suppress the erroneous discrimination of the speaker due to.

次に、上述した本実施形態に係る話者特定方法について、図８を参照して簡単に説明する。
まず、マイク１０、２０によって人物Ａ、Ｂが発話した音声データＤ１、Ｄ２が取得され、話者特定装置５０の音声データ記憶部５１に格納される（ＳＡ１）。
続いて、音声データＤ１、Ｄ２のそれぞれにおいて、第１期間ＴＷｉ毎の代表値Ｕｉ、Ｖｉを取得し（ＳＡ２）、更に、各代表値Ｕｉ、Ｖｉを標準化した標準化代表値Ｕｉ＿ｓｔ、Ｖｉ＿ｓｔを算出する（ＳＡ３）。
続いて、第１期間ＴＷｉ毎の標準化代表値の差分ｄｉを算出し（ＳＡ４）、第１期間ＴＷｉよりも長い第２期間ＴＹｊ毎に差分平均値ｄｊ＿ａｖｅを算出する（ＳＡ５）。 Next, the speaker identification method according to the present embodiment described above will be briefly described with reference to FIG.
First, the voice data D1 and D2 spoken by the persons A and B are acquired by the microphones 10 and 20, and stored in the voice data storage unit 51 of the speaker identification device 50 (SA1).
Subsequently, in each of the audio data D1 and D2, the representative values Ui and Vi for each TWi in the first period are acquired (SA2), and further, the standardized representative values Ui_st and Vi_st are calculated by standardizing the representative values Ui and Vi. (SA3).
Subsequently, the difference di of the standardized representative value for each first period TWi is calculated (SA4), and the difference average value dj_ave is calculated for each second period TYj longer than the first period TWi (SA5).

続いて、第２期間ＴＹｊ毎に差分平均値ｄｊ＿ａｖｅの正負を判定し（ＳＡ６）、正負の連続状況に応じて、例えば、同じ符号が所定回数（例えば、６個）連続した場合に、その正負に応じた話者による発話が行われていると特定する（ＳＡ７）。例えば、本実施形態では、上述した（７）式に示したように、差分ｄｉを算出するときに、音声データＤ１の標準化代表値Ｕｉ＿ｓｔから音声データＤ２の標準化代表値Ｖｉ＿ｓｔを差し引いているため、符号が正の場合には音声データＤ１に対応する人物Ａが発話していると判定し、符号が負の場合には音声データＤ２に対応する人物Ｂが発話していると判定する。 Subsequently, the positive / negative of the difference mean value dj_ave is determined for each second period TYj (SA6), and depending on the continuous situation of positive / negative, for example, when the same code is consecutive a predetermined number of times (for example, 6), the positive / negative It is specified that the speaker is speaking according to the above (SA7). For example, in the present embodiment, as shown in the above equation (7), when the difference di is calculated, the standardized representative value Vi_st of the voice data D2 is subtracted from the standardized representative value Ui_st of the voice data D1. When the code is positive, it is determined that the person A corresponding to the voice data D1 is speaking, and when the code is negative, it is determined that the person B corresponding to the voice data D2 is speaking.

以上説明してきたように、本実施形態に係る発話特定装置、発話特定方法、及び発話特定プログラムによれば、代表値取得部５２によって各音声データＤ１、Ｄ２における代表値Ｕｉ、Ｖｉが第１期間ＴＷｉ毎に取得され、標準化処理部５３によって代表値Ｕｉ、Ｖｉが標準化される。そして、比較部５４によって、第１期間ＴＷｉ毎に、異なる２つの音声データＤ１、Ｄ２から取得された２つの標準化代表値Ｕｉ＿ｓｔ、Ｖｉ＿ｓｔが比較されて、比較結果が得られる。そして、特定部５５によって、第１期間ＴＷｉよりも長い第２期間ＴＹｊに得られた複数の比較結果を用いて、第２期間ＴＹｊにおける話者が特定される。このように、サンプリングされた全ての音声データＤ１、Ｄ２を用いるのではなく、各第１期間ＴＷｉにおける代表値Ｕｉ、Ｖｉを取得して話者の特定に用いることとしたので、音声データＤ１、Ｄ２に含まれるノイズを低減することができるとともに、処理するデータ量を低減することができる。 As described above, according to the utterance identification device, the utterance identification method, and the utterance identification program according to the present embodiment, the representative value acquisition unit 52 sets the representative values Ui and Vi in the voice data D1 and D2 in the first period. It is acquired for each TWi, and the representative values Ui and Vi are standardized by the standardization processing unit 53. Then, the comparison unit 54 compares the two standardized representative values Ui_st and Vi_st acquired from the two different voice data D1 and D2 for each TWi in the first period, and obtains a comparison result. Then, the identification unit 55 identifies the speaker in the second period TYj by using a plurality of comparison results obtained in the second period TYj longer than the first period TWi. In this way, instead of using all the sampled audio data D1 and D2, it was decided to acquire the representative values Ui and Vi in each first period TWi and use them to identify the speaker. Therefore, the audio data D1 and The noise contained in D2 can be reduced, and the amount of data to be processed can be reduced.

更に、代表値Ｕｉ、Ｖｉを標準化することとしたので、音声データ間における音声強度のアンバランスを抑制することが可能となる。
また、比較部５４によって、第１期間ＴＷｉ毎に得られた２つの標準化代表値Ｕｉ＿ｓｔ、Ｖｉ＿ｓｔの差分ｄｉを第２期間ＴＹｊにおいて平均化した差分平均値ｄｊ＿ａｖｅを用いて第２期間ＴＹｊにおける話者を特定するので、音声データＤ１、Ｄ２に含まれるノイズの影響によって話者が頻繁に切り替えられることを抑制することができ、話者の特定精度を向上させることが可能となる。 Further, since the representative values Ui and Vi are standardized, it is possible to suppress the imbalance of voice intensity between the voice data.
Further, the speaker in the second period TYj using the difference average value dj_ave obtained by averaging the difference di of the two standardized representative values Ui_st and Vi_st obtained for each TWi in the first period by the comparison unit 54 in the second period TYj. Therefore, it is possible to suppress frequent switching of the speaker due to the influence of noise contained in the voice data D1 and D2, and it is possible to improve the identification accuracy of the speaker.

更に、本実施形態によれば、特定部５５は、差分平均値ｄｊ＿ａｖｅの正負を判定し、同じ符号が所定回数または所定期間連続した場合に、その符号に対応する話者が発話していると判定するので、話者の頻繁な切替を更に抑制することが可能となる。すなわち、本実施形態では、第２期間ＴＹｊを２５０ｍｓに設定しているが、話者が２５０ｍｓ単位で切り替えられることはあまり現実的ではない。本実施形態によれば、第２期間ＴＹｊ毎に得られた差分平均値ｄｊ＿ａｖｅにおいて同じ正負符号が所定回数または所定期間以上連続した場合に、その符号に対応する人物が発話していると判定するので、話者の特定精度を更に向上させることができる。なお、この特定手法は一例であり、例えば、話者の特定精度は低下するが、第２期間ＴＹｊ毎に話者を特定することも可能である。 Further, according to the present embodiment, the specific unit 55 determines whether the difference average value dj_ave is positive or negative, and when the same code is consecutive for a predetermined number of times or for a predetermined period, the speaker corresponding to the code is speaking. Since the determination is made, it is possible to further suppress frequent switching of the speaker. That is, in the present embodiment, the second period TYj is set to 250 ms, but it is not very realistic that the speaker can be switched in 250 ms units. According to the present embodiment, when the same positive and negative signs are consecutive for a predetermined number of times or for a predetermined period or more in the difference average value dj_ave obtained for each second period TYj, it is determined that the person corresponding to the sign is speaking. Therefore, the accuracy of identifying the speaker can be further improved. Note that this specific method is an example. For example, although the accuracy of identifying the speaker is reduced, it is possible to specify the speaker for each TYj in the second period.

なお、上記実施形態では、標準化された代表値Ｕｉ＿ｓｔ、Ｖｉ＿ｓｔを用いて比較結果を得たが、標準化は必須ではなく、比較部５４は、標準化されていない代表値Ｕｉ、Ｖｉを用いて差分ｄｉを得ることとしてもよい。 In the above embodiment, the standardized representative values Ui_st and Vi_st are used to obtain the comparison result, but the standardization is not essential, and the comparison unit 54 uses the non-standardized representative values Ui and Vi to obtain the difference di. May be obtained.

次に、本発明の一実施形態に係る発話特定装置、発話特定方法、及び発話特定プログラムについて、人物が３名の場合を例示して説明する。なお、以下の説明については、上述した話者が２名の場合と共通する点については説明を省略し、異なる点について主に説明する Next, the utterance identification device, the utterance identification method, and the utterance identification program according to the embodiment of the present invention will be described by way of example when there are three persons. Regarding the following explanations, the points common to the above-mentioned case of two speakers will be omitted, and the differences will be mainly explained.

図９は、話者が３名の場合における話者特定システム１´の構成を概略的に示した図である。図９に示すように、話者特定システム１´は、図１に示した構成に加えて、人物Ｃに対応するマイクロフォン（マイク）３０が追加され、マイク３０によって取得された音声データＤ３が話者特定装置５０に入力されるようになっている。 FIG. 9 is a diagram schematically showing the configuration of the speaker identification system 1'when there are three speakers. As shown in FIG. 9, in the speaker identification system 1', in addition to the configuration shown in FIG. 1, a microphone (microphone) 30 corresponding to the person C is added, and the voice data D3 acquired by the microphone 30 speaks. It is designed to be input to the person identification device 50.

図１０は、各人物Ａ〜Ｃと各マイク１０、２０、３０との配置について示した図である。図１０において、人物Ｃに対応してマイク３０が設けられている。マイク１０は、人物Ａからの距離ＬＡ１０と人物Ｂからの距離Ｌｂ１０と人物Ｃからの距離Ｌｃ１０とが異なる位置に配置されている。同様に、マイク２０は、人物Ａからの距離Ｌａ２０と、人物Ｂからの距離Ｌｂ２０と、人物Ｃからの距離Ｌｃ２０とが異なる位置に配置されている。また、マイク３０は、人物Ａからの距離Ｌａ３０と、人物Ｂからの距離Ｌｂ３０と、人物Ｃからの距離Ｌｃ３０とが異なる位置に配置されている。
なお、各マイク１０、２０、３０は、各人物Ａ〜Ｃにそれぞれ装着されていてもよい。 FIG. 10 is a diagram showing the arrangement of the persons A to C and the microphones 10, 20, and 30. In FIG. 10, a microphone 30 is provided corresponding to the person C. The microphone 10 is arranged at different positions from the distance LA10 from the person A, the distance Lb10 from the person B, and the distance Lc10 from the person C. Similarly, the microphone 20 is arranged at different positions from the distance La20 from the person A, the distance Lb20 from the person B, and the distance Lc20 from the person C. Further, the microphone 30 is arranged at a position where the distance La30 from the person A, the distance Lb30 from the person B, and the distance Lc30 from the person C are different.
The microphones 10, 20, and 30 may be attached to the persons A to C, respectively.

図１０において、マイク１０は、いずれの人物Ｂ、Ｃよりも人物Ａに近い位置に配置され、マイク２０は、いずれの人物Ａ、Ｃよりも人物Ｂに近い位置に配置され、マイク３０は、いずれの人物Ａ、Ｂよりも人物Ｃに近い位置に配置されている。すなわち、各人物Ａ、Ｂ、Ｃからマイク１０、２０、３０までの距離には、以下の（８）〜（１３）式に示す関係が成立している。 In FIG. 10, the microphone 10 is arranged at a position closer to the person A than any of the persons B and C, the microphone 20 is arranged at a position closer to the person B than any of the persons A and C, and the microphone 30 is arranged. It is arranged closer to the person C than any of the persons A and B. That is, the relationships shown in the following equations (8) to (13) are established for the distances from the persons A, B, and C to the microphones 10, 20, and 30.

Ｌａ１０＜Ｌｂ１０（８）
Ｌａ１０＜Ｌｃ１０（９）
Ｌｂ２０＜Ｌａ２０（１０）
Ｌｂ２０＜Ｌｃ２０（１１）
Ｌｃ３０＜Ｌａ３０（１２）
Ｌｃ３０＜Ｌｂ３０（１３） La10 <Lb10 (8)
La10 <Lc10 (9)
Lb20 <La20 (10)
Lb20 <Lc20 (11)
Lc30 <La30 (12)
Lc30 <Lb30 (13)

次に、マイク１０、２０、３０によって取得された音声データＤ１〜Ｄ３に基づいて話者を特定する方法について説明する。なお、話者特定装置５０が備える機能ブロックは図４に示した機能と同様であるが、各部が音声データＤ３についても同様の処理を行う点及び特定部５５における話者の特定手法が多少異なる。 Next, a method of identifying the speaker based on the voice data D1 to D3 acquired by the microphones 10, 20, and 30 will be described. The functional block included in the speaker identification device 50 is the same as the function shown in FIG. 4, but the point that each unit performs the same processing for the voice data D3 and the speaker identification method in the specific unit 55 are slightly different. ..

まず、音声データ記憶部５１には、マイク１０によって取得された音声データＤ１、マイク２０によって取得された音声データＤ２、及びマイク３０によって取得された音声データＤ３が格納される。 First, the voice data storage unit 51 stores the voice data D1 acquired by the microphone 10, the voice data D2 acquired by the microphone 20, and the voice data D3 acquired by the microphone 30.

代表値取得部５２は、上述した２名の話者の場合と同様の手法で、各音声データＤ１〜Ｄ３について、第１期間ＴＷｉ毎に代表値を取得する。例えば、代表値取得部５２は、音声データＤ１について、各第１期間ＴＷｉ（ｉ＝１〜ｎ、以下同様）における最大値Ｕｉを代表値として取得し、音声データＤ２において、各第１期間ＴＷｉにおける最大値Ｖｉを代表値として取得し、音声データＤ３において、各第１期間ＴＷｉにおける最大値Ｗｉを代表値として取得する。 The representative value acquisition unit 52 acquires the representative value for each voice data D1 to D3 for each TWi in the first period by the same method as in the case of the two speakers described above. For example, the representative value acquisition unit 52 acquires the maximum value Ui in each first period TWi (i = 1 to n, the same applies hereinafter) as a representative value for the voice data D1, and in the voice data D2, each first period TWi. The maximum value Wi in each first period TWi is acquired as a representative value in the voice data D3.

標準化処理部５３は、代表値取得部５２によって取得された代表値Ｕｉ、Ｖｉ、Ｗｉを標準化し、標準化代表値Ｕｉ＿ｓｔ＿ｕｖ、Ｕｉ＿ｓｔ＿ｕｗ、Ｖｉ＿ｓｔ＿ｕｖ、Ｖｉ＿ｓｔ＿ｖｗ、Ｗｉ＿ｓｔ＿ｕｗ、Ｗｉ＿ｓｔ＿ｖｗをそれぞれ取得する。例えば、標準化処理は以下のように行われる。
まず、標準化処理部５３は、音声データＤ１の代表値Ｕｉ（ｉ＝１〜ｎ）の平均値Ｘ、音声データＤ２の代表値Ｖｉ（ｉ＝１〜ｎ）の平均値Ｙ、音声データＤ３の代表値Ｗｉ（ｉ＝１〜ｎ）の平均値Ｚを算出する。 The standardization processing unit 53 standardizes the representative values Ui, Vi, and Wi acquired by the representative value acquisition unit 52, and acquires the standardized representative values Ui_st_uv, Ui_st_uu, Vi_st_uv, Vi_st_vw, Wi_st_uu, and Wi_st_vw, respectively. For example, the standardization process is performed as follows.
First, the standardization processing unit 53 sets the average value X of the representative value Ui (i = 1 to n) of the audio data D1, the average value Y of the representative value Vi (i = 1 to n) of the audio data D2, and the audio data D3. The average value Z of the representative value Wi (i = 1 to n) is calculated.

（１４）〜（１６）式において、ｎは音声データＤ１、Ｄ２、Ｄ３における代表値の数、換言すると、第１期間ＴＷｉの数である。 In the equations (14) to (16), n is the number of representative values in the voice data D1, D2, and D3, in other words, the number of the first period TWi.

続いて、標準化処理部５３は、上記代表値の平均値Ｘ、Ｙ、Ｚを用いて、音声データＤ１と音声データＤ２との組み合わせ、音声データＤ１と音声データＤ３との組み合わせ、及び音声データＤ２と音声データＤ３との組み合わせのそれぞれについて標準化代表値を算出する。 Subsequently, the standardization processing unit 53 uses the average values X, Y, and Z of the representative values to combine the voice data D1 and the voice data D2, the voice data D1 and the voice data D3, and the voice data D2. A standardized representative value is calculated for each combination of the voice data and the voice data D3.

例えば、音声データＤ１と音声データＤ２との組み合わせにおけるデータ標準化は以下の通りである。 For example, the data standardization in the combination of the voice data D1 and the voice data D2 is as follows.

例えば、音声データＤ１と音声データＤ３との組み合わせにおけるデータ標準化は以下の通りである。 For example, the data standardization in the combination of the voice data D1 and the voice data D3 is as follows.

例えば、音声データＤ２と音声データＤ３との組み合わせにおけるデータ標準化は以下の通りである。 For example, the data standardization in the combination of the voice data D2 and the voice data D3 is as follows.

次に、比較部５４は、第１期間ＴＷｉ毎に、異なる各音声データＤ１、Ｄ２、Ｄ３からそれぞれ取得された標準化代表値のうち、音声データＤ１と音声データＤ２との間の標準化代表値の差分ｄｉ＿ｕｖ、音声データＤ１と音声データＤ３との間の標準化代表値の差分ｄｉ＿ｕｗ、音声データＤ２と音声データＤ３との間の標準化代表値の差分ｄｉ＿ｖｗをそれぞれ算出する。演算式は以下の通りである。 Next, the comparison unit 54 determines the standardized representative value between the audio data D1 and the audio data D2 among the standardized representative values acquired from the different audio data D1, D2, and D3 for each first period TWi. The difference di_uv, the difference di_uw of the standardized representative value between the voice data D1 and the voice data D3, and the difference di_vw of the standardized representative value between the voice data D2 and the voice data D3 are calculated, respectively. The calculation formula is as follows.

ｄｉ＿ｕｖ＝Ｕｉ＿ｓｔ＿ｕｖ−Ｖｉ＿ｓｔ＿ｕｖ（２３）
ｄｉ＿ｕｗ＝Ｕｉ＿ｓｔ＿ｕｗ−Ｗｉ＿ｓｔ＿ｕｗ（２４）
ｄｉ＿ｖｗ＝Ｖｉ＿ｓｔ＿ｖｗ−Ｗｉ＿ｓｔ＿ｖｗ（２５） di_uv = Ui_st_uv-Vi_st_uv (23)
di_uuw = Ui_st_uuw-Wi_st_uuw (24)
di_vw = Vi_st_vw-Wi_st_vw (25)

特定部５５は、第１期間ＴＷｉよりも長い第２期間ＴＹｊ毎に、第２期間ＴＹｊ内に得られた複数の比較結果を用いて話者を特定する。
例えば、特定部５５は、第２期間ＴＹｊ毎に差分ｄｉ＿ｕｖ、ｄｉ＿ｕｗ、ｄｉ＿ｖｗの平均値である差分平均値ｄｊ＿ｕｖ＿ａｖｅ、ｄｊ＿ｕｗ＿ａｖｅ、ｄｊ＿ｖｗ＿ａｖｅを算出する。 The identification unit 55 identifies the speaker by using a plurality of comparison results obtained within the second period TYj for each second period TYj longer than the first period TWi.
For example, the specific unit 55 calculates the difference average values dj_uv_ave, dj_uu_ave, and dj_vw_ave, which are the average values of the differences di_uv, di_uu, and di_vw for each second period TYj.

続いて、特定部５５は、差分平均値ｄｊ＿ｕｖ＿ａｖｅ、ｄｊ＿ｕｗ＿ａｖｅ、ｄｊ＿ｖｗ＿ａｖｅが正の値か負の値かを判定する。なお、差分平均値の絶対値が所定の閾値以下の場合には、ゼロと判定する。すなわち、この場合には、差分を算出するのに使用した音声データに対応する話者以外の話者が発話していると判定する。 Subsequently, the specific unit 55 determines whether the difference average values dj_uv_ave, dj_uw_ave, and dj_vw_ave are positive or negative values. If the absolute value of the difference average value is equal to or less than a predetermined threshold value, it is determined to be zero. That is, in this case, it is determined that a speaker other than the speaker corresponding to the voice data used to calculate the difference is speaking.

なお、人物が３名の場合にも、図１１に示すように、同じ符号が所定数以上連続して（または、所定期間以上連続して）発生した場合に、その符号に応じた人物を話者として特定する。また、いずれの組み合わせ（ｄｊ＿ｕｖ＿ａｖｅ、ｄｊ＿ｕｗ＿ａｖｅ、ｄｊ＿ｖｗ＿ａｖｅ）においてもゼロが所定数以上連続して発生した場合には、いずれの人物も発話していないと判定する。これにより、図１１に示すように、話者が人物Ａであるのか、人物Ｂであるのか、人物Ｃであるのか、あるいは、いずれの人物も発話していないこと（発話なし）を特定することができる。 Even when there are three persons, as shown in FIG. 11, when the same code occurs continuously for a predetermined number or more (or continuously for a predetermined period or more), the person corresponding to the code is spoken. Identify as a person. Further, in any combination (dj_uv_ave, dj_uw_ave, dj_vw_ave), when zeros occur continuously for a predetermined number or more, it is determined that no person is speaking. Thereby, as shown in FIG. 11, it is specified whether the speaker is person A, person B, person C, or none of the persons speaks (no utterance). Can be done.

以上説明したように、３名以上の人物がいる場合には、比較部５４は、異なる２つの音声データの組み合わせを生成し、生成した組み合わせにおける音声データの標準化代表値の差分をそれぞれ算出する。そして、特定部５５は、各音声データの組合せにおいて、算出した標準化代表値の差分を第２期間ＴＹｊ毎に平均することで差分平均値を算出し、その差分平均値の正、負、ゼロの状況を総合的に判断して、話者を特定する。 As described above, when there are three or more persons, the comparison unit 54 generates a combination of two different voice data, and calculates the difference between the standardized representative values of the voice data in the generated combination. Then, the specific unit 55 calculates the difference average value by averaging the difference of the calculated standardized representative values for each TYj in the second period in each combination of audio data, and positive, negative, or zero of the difference average value. Identify the speaker by comprehensively judging the situation.

以上説明してきたように、人物が３名以上存在する場合においても、異なる２つの音声データからなる組み合わせを複数作成し、各組合わせにおいてそれぞれの差分平均値を算出し、その差分平均値の符号の状況を総合的に判断することで、話者を特定することが可能となる。 As described above, even when there are three or more persons, a plurality of combinations consisting of two different voice data are created, the difference average value is calculated for each combination, and the code of the difference average value is calculated. It is possible to identify the speaker by comprehensively judging the situation of.

以上、本発明について実施形態を用いて説明したが、本発明の技術的範囲は上記実施形態に記載の範囲には限定されない。発明の要旨を逸脱しない範囲で上記実施形態に多様な変更又は改良を加えることができ、該変更又は改良を加えた形態も本発明の技術的範囲に含まれる。また、上記実施形態を適宜組み合わせてもよい。
また、上記実施形態で説明した話者特定処理の流れも一例であり、本発明の主旨を逸脱しない範囲内において不要なステップを削除したり、新たなステップを追加したり、処理順序を入れ替えたりしてもよい。 Although the present invention has been described above using the embodiments, the technical scope of the present invention is not limited to the scope described in the above embodiments. Various changes or improvements can be made to the above embodiments without departing from the gist of the invention, and the modified or improved forms are also included in the technical scope of the present invention. Moreover, you may combine the said embodiment as appropriate.
Further, the flow of the speaker identification process described in the above embodiment is also an example, and unnecessary steps may be deleted, new steps may be added, or the processing order may be changed within a range not deviating from the gist of the present invention. You may.

１、１´ ：話者特定システム
１０：マイクロフォン（マイク）
１１：ＣＰＵ
１２：補助記憶装置
１３：主記憶装置
１４：通信インターフェース
１５：入力部
１６：表示部
１８：バス
２０：マイクロフォン（マイク）
３０：マイクロフォン（マイク）
５０：話者特定装置
５１：音声データ記憶部
５２：代表値取得部
５３：標準化処理部
５４：比較部
５５：特定部 1, 1': Speaker identification system 10: Microphone (microphone)
11: CPU
12: Auxiliary storage device 13: Main storage device 14: Communication interface 15: Input unit 16: Display unit 18: Bus 20: Microphone (microphone)
30: Microphone (microphone)
50: Speaker identification device 51: Voice data storage unit 52: Representative value acquisition unit 53: Standardization processing unit 54: Comparison unit 55: Identification unit

Claims

It is a speaker identification device that identifies a speaker by using a plurality of voice data acquired by a plurality of microphones provided corresponding to a plurality of target persons.
For each voice data, a representative value acquisition unit that acquires a representative value indicating a feature regarding loudness for each predetermined first period, and a representative value acquisition unit.
A comparison unit that compares two representative values acquired from two different voice data for each first period and obtains a comparison result for each first period.
A speaker identification device including a specific unit for identifying a speaker in the second period by using a plurality of the comparison results obtained in a second period longer than the first period.

A standardization processing unit for calculating a standardized representative value obtained by standardizing the representative value between the voice data is provided.
The speaker identification device according to claim 1, wherein the comparison unit uses the standardized representative value as the representative value to obtain the comparison result.

The comparison unit calculates the difference between the two representative values acquired from the two different voice data for each first period.
The speaker identification device according to claim 1 or 2, wherein the identification unit calculates a difference average value obtained by averaging the difference for each second period, and identifies a speaker based on the positive or negative of the difference average value. ..

The story according to claim 3, wherein the specific unit determines whether the difference average value is positive or negative, and when the same code is consecutive for a predetermined number of times or for a predetermined period of time, it is determined that the speaker corresponding to the code is speaking. Person identification device.

When three or more of the microphones are installed, the comparison unit generates a combination of two different microphones, and obtains the voice data obtained from the two voice data corresponding to each of the generated combinations. The representative values are compared for each of the first periods, and the comparison result is obtained.
The story according to any one of claims 1 to 4, wherein the specific unit identifies a speaker in the second period by using a plurality of the comparison results obtained in the second period in each of the combinations. Person identification device.

The speaker identification device according to any one of claims 1 to 5, wherein the first period is set to a period longer than one cycle of the sound wave waveform based on the voice data.

The sampling frequency of the audio data is 44.1 kHz.
The speaker identifying device according to any one of claims 1 to 6, wherein the first period is set within a range of 1 ms or more and 30 ms or less.

The speaker identifying device according to any one of claims 1 to 7, wherein the second period is set to a period five times or more as long as the first period.

It is a speaker identification method that identifies a speaker by using a plurality of voice data acquired by a plurality of microphones provided corresponding to a plurality of target persons.
A representative value acquisition step of acquiring a representative value indicating a characteristic of loudness for each of the voice data for each predetermined first period,
A comparison step of comparing two representative values acquired from two different voice data for each first period and obtaining a comparison result for each first period.
A speaker identification method in which a computer executes a specific step of identifying a speaker in the second period by using a plurality of the comparison results obtained in a second period longer than the first period.

It is a speaker identification program for identifying a speaker by using a plurality of voice data acquired by a plurality of microphones provided corresponding to a plurality of target persons.
For each voice data, a representative value acquisition process for acquiring a representative value indicating a characteristic regarding loudness for each predetermined first period, and a representative value acquisition process.
A comparison process in which two representative values acquired from two different voice data are compared for each first period and a comparison result is obtained for each first period.
A speaker identification program for causing a computer to perform a specific process for identifying a speaker in the second period by using a plurality of the comparison results obtained in a second period longer than the first period.