JP2020154061A

JP2020154061A - Speaker identification apparatus, speaker identification method and program

Info

Publication number: JP2020154061A
Application number: JP2019050705A
Authority: JP
Inventors: 浦川　康孝; Yasutaka Urakawa; 康孝浦川; 優仁斗谷; Masahito Toya
Original assignee: Fuetrek Co Ltd
Current assignee: Fuetrek Co Ltd
Priority date: 2019-03-19
Filing date: 2019-03-19
Publication date: 2020-09-24

Abstract

To shorten time required for identification even when there are many identification objects to be registered in speaker identification.SOLUTION: A speaker identification apparatus comprises: a storage section 15 for classifying registered data being voice feature data of all registered persons into groups clustered at similarity of registered data and storing the registered data in correspondence to identification codes given to the respective registered persons; a feature extraction section 11 for extracting collation data being the voice feature data from inputted voice data; a classification determination section 13 for determining into which group the collation data is to be classified; and a speaker identification section 16 for determining the registered person to whom the identification code is given as a speaker of the collation data, the identification code being corresponding to the registered data which has the highest similarity with the collation data and whose similarity with the collation data exceeds a threshold in the registered data classified into the group into which the collation data is to be classified.SELECTED DRAWING: Figure 1

Description

本発明は、話者の音声データと登録されている音声データとを照合して話者を識別する、話者識別装置、話者識別方法およびプログラムに関する。 The present invention relates to a speaker identification device, a speaker identification method, and a program for identifying a speaker by collating the voice data of the speaker with the registered voice data.

個人を認証する方法として、話者から発声された音声と登録されている音声とを照合して、話者を認証する音声認証技術が実用化されている。たとえば、特許文献１および特許文献２には音声による個人認証システムが記載されている。 As a method of authenticating an individual, a voice authentication technique for authenticating a speaker by collating a voice uttered by a speaker with a registered voice has been put into practical use. For example, Patent Document 1 and Patent Document 2 describe a personal authentication system by voice.

特許文献１の個人認証システムは、コンピュータを用いて入力された音声によって個人を特定する個人認証システムにおいて、（１）認証の対象となる言葉を音声で入力し、当該入力音声を音声分析して認証用符号化音声データを作成する手段、（２）前記認証用符号化音声データを、音響モデル、言語モデルおよび単語辞書を用いて音声認識分析によって言葉を抽出して、言葉で分類された前記声紋データベースの中から該言葉に対応するデータのみを抜き出して、認識用符号化音声データと声紋照合する手段、を備える。 The personal authentication system of Patent Document 1 is a personal authentication system that identifies an individual by voice input using a computer. (1) The words to be authenticated are input by voice, and the input voice is voice-analyzed. Means for creating encrypted voice data for authentication, (2) The coded voice data for authentication is classified by words by extracting words by voice recognition analysis using an acoustic model, a language model and a word dictionary. A means for extracting only the data corresponding to the word from the voice pattern database and collating the voice pattern with the encoded voice data for recognition is provided.

特許文献２の個人認証システムは、被認証者に音声入力させ個人を特定する個人認証システムにおいて、（１）予めボイスプリントデータベースに登録されている認証の対象となる複数の単語と認証の対象とならない単語を含む複数の単語を被認証者に提示し、（２）被認証者が音声入力した音声データを取り込み、前記認証対象の単語に対して認証処理を行うと共に、前記認証の対象とならない単語を前記ボイスプリントデータベースに登録する手段、を備える。 The personal authentication system of Patent Document 2 is a personal authentication system that identifies an individual by having the person to be authenticated input voice by voice. (1) A plurality of words to be authenticated and a target of authentication registered in advance in the voice print database. A plurality of words including the non-authentication word are presented to the authenticated person, (2) the voice data input by the authenticated person is taken in, the authentication process is performed on the authentication target word, and the authentication target is not subject to the authentication. A means for registering a word in the voice print database is provided.

特許文献３には、話者の判定精度を向上する技術が記載されている。特許文献３の話者判定装置は、音声信号の音声区間を所定時間長に分割してなる各分割音声区間の話者特徴量と、窓口担当者毎に予め生成された話者特徴量との類似度を算出する類似度算出部と、類似度から、各分割音声区間の話者ＩＤを表す一次判定情報を生成する話者一次判定部と、任意の分割音声区間の前または後の所定数の分割音声区間において最も当てはまる話者である近傍話者の話者特徴量と、任意の分割音声区間の話者特徴量との類似度が所定の条件を充たす場合に、近傍話者の話者ＩＤを任意の分割音声区間の二次判定情報とすることにより、二次判定情報を生成する話者二次判定部と、顧客であることを示す二次判定情報と対応する分割音声区間の話者特徴量、すなわち顧客話者特徴量の集合をクラスタリングして顧客の話者ＩＤを生成し、三次判定情報を生成する話者クラスタリング部を含む。 Patent Document 3 describes a technique for improving the determination accuracy of the speaker. The speaker determination device of Patent Document 3 has a speaker feature amount of each divided voice section divided into a predetermined time length of the voice section of the voice signal, and a speaker feature amount generated in advance for each person in charge of the counter. A similarity calculation unit that calculates the similarity, a speaker primary determination unit that generates primary determination information representing the speaker ID of each divided audio section from the similarity, and a predetermined number before or after an arbitrary divided audio section. When the degree of similarity between the speaker characteristic amount of the neighboring speaker, which is the most applicable speaker in the divided voice section of, and the speaker characteristic amount of the arbitrary divided voice section satisfies a predetermined condition, the speaker of the neighboring speaker A speaker secondary determination unit that generates secondary determination information by using an ID as secondary determination information for an arbitrary divided audio section, and a story of a divided audio section corresponding to the secondary determination information indicating that the customer is a customer. It includes a speaker clustering unit that clusters a person feature quantity, that is, a set of customer speaker feature quantities to generate a customer speaker ID, and generates tertiary determination information.

特開２００３−３２３１９７号公報Japanese Unexamined Patent Publication No. 2003-323197 特開２００３−３０２９９９号公報Japanese Unexamined Patent Publication No. 2003-302999 特開２０１９−８１３１号公報JP-A-2019-8131

音声による個人の認証では、特許文献３の話者判定装置のように、基本的に全ての登録音声と照合音声を比較して個人を特定していた。登録されている話者の対象が多くなるにつれ、類似度を判定する登録数に比例して結果を得るまでの時間がかかる。特許文献１の個人認証システムでは、発声された言葉に対応するデータのみと比較するが、同じ言葉の登録データが多くなれば、同じ問題が生じる。 In the personal authentication by voice, the individual is basically identified by comparing all the registered voices with the collated voices, as in the speaker determination device of Patent Document 3. As the number of registered speakers increases, it takes time to obtain a result in proportion to the number of registrations for determining the similarity. In the personal authentication system of Patent Document 1, comparison is made only with the data corresponding to the spoken word, but if the registered data of the same word increases, the same problem occurs.

特許文献２の個人認証システムでは、被認証者のＩＤがクレジットカードなどで特定されていて、パスワードに代えて音声認証を用いることが前提であり、類似度を判定する登録データはＩＤで限定されている。被認証者のＩＤを含めて音声で話者識別する場合には、すべての登録データと比較する必要がある。 In the personal authentication system of Patent Document 2, it is premised that the ID of the person to be authenticated is specified by a credit card or the like and voice authentication is used instead of the password, and the registration data for determining the similarity is limited by the ID. ing. When identifying the speaker by voice including the ID of the person to be authenticated, it is necessary to compare with all the registered data.

本発明は上述の事情に鑑みてなされたもので、話者識別において登録される識別対象が多い場合でも、識別にかかる所要時間を短くすることを目的とする。 The present invention has been made in view of the above circumstances, and an object of the present invention is to shorten the time required for identification even when there are many identification objects registered in speaker identification.

本発明の第１の観点に係る話者識別装置は、すべての被登録者それぞれの音声特徴データである登録データを、登録データどうしの類似度でクラスタリングされたグループに分類して、前記被登録者それぞれに付与された識別符号と対応づけて記憶する記憶部と、入力された音声データから音声特徴データである照合データを抽出する特徴抽出部と、前記照合データがいずれの前記グループに分類されるべきかを判定する分類判定部と、前記照合データが分類されるべき前記グループに分類されている前記登録データのうち、前記照合データとの類似度が最も高く、かつ、前記照合データとの類似度が閾値を超えている前記登録データに対応づけられた前記識別符号が付与された前記被登録者を、前記照合データの発声者と判定する話者識別部と、を備える。 The speaker identification device according to the first aspect of the present invention classifies the registration data, which is the voice feature data of each registered person, into a group clustered according to the similarity between the registered data, and the registered data. A storage unit that stores the identification code assigned to each person in association with the identification code, a feature extraction unit that extracts collation data that is audio feature data from the input audio data, and the collation data are classified into any of the above groups. Among the registered data classified into the group to which the collation data should be classified, the classification determination unit for determining whether or not the collation data should have the highest degree of similarity to the collation data and the collation data. It is provided with a speaker identification unit that determines that the registered person to whom the identification code associated with the registration data whose similarity exceeds the threshold is assigned as the speaker of the matching data.

本発明によれば、登録データを予めクラスタリングしたグループのいずれに、照合データが分類されるべきかを判定し、その分類されるべきグループに類似度判定対象の登録データを限定するので、登録データすべてと比較する場合よりも、識別にかかる所要時間を短くすることができる。 According to the present invention, it is determined to which of the groups in which the registered data is clustered in advance the collation data should be classified, and the registered data to be determined for similarity is limited to the group to be classified. It is possible to shorten the time required for identification as compared with the case of comparing all.

好ましくは、前記記憶部は、前記登録データそれぞれの発声内容を示す、音素列、音節列もしくは文字列を、前記登録データに対応づけて記憶し、前記話者識別装置は、入力された音声データから、発声内容を示す音素列、音節列もしくは文字列を抽出する音声認識部をさらに備え、前記話者識別部は、前記照合データが分類されるべき前記グループに分類され、かつ、発声内容が前記照合データの発声内容と同じ前記登録データのうち、前記照合データとの類似度が最も高く、かつ、前記照合データとの類似度が閾値を超えている前記登録データに対応づけられた前記識別符号が付与された前記被登録者を、前記照合データの発声者と判定する。 Preferably, the storage unit stores a phonetic element string, a syllable string, or a character string indicating the vocal content of each of the registered data in association with the registered data, and the speaker identification device stores the input voice data. Further, a voice recognition unit for extracting a phonetic element string, a syllable string, or a character string indicating the voice content is further provided, and the speaker identification unit is classified into the group to which the collation data should be classified, and the voice content is Among the registered data having the same voice content as the collated data, the identification associated with the registered data having the highest similarity with the collated data and having a similarity with the collated data exceeding the threshold value. The registered person to which the code is given is determined to be the speaker of the collation data.

その場合、照合データが分類されるべきグループの中で、照合データと発声内容が同じである登録データに比較対象が限定されるので、さらに識別にかかる所要時間を短くすることができる。その上、発声内容が同じである登録データに限って比較するので、誤認識する可能性をより小さくできる。 In that case, in the group to which the collation data should be classified, the comparison target is limited to the registered data having the same utterance content as the collation data, so that the time required for identification can be further shortened. Moreover, since only the registered data having the same utterance content is compared, the possibility of erroneous recognition can be reduced.

好ましくは、前記分類判定部は、前記クラスタリングで前記グループに分類された前記被登録者の音声特徴データを学習データとして機械学習させた、ニューラルネットワークの学習済みモデルを含む。 Preferably, the classification determination unit includes a trained model of a neural network in which the voice feature data of the registered person classified into the group by the clustering is machine-learned as training data.

ニューラルネットワークの学習済みモデルで分類を判定する場合、登録データが分類されるグループの数が増加しても、識別にかかる所要時間が増加するのを抑制することができる。 When the classification is determined by the trained model of the neural network, it is possible to suppress the increase in the time required for identification even if the number of groups in which the registered data is classified increases.

本発明の第２の観点に係る話者識別方法は、入力された音声データから音声特徴データである照合データを抽出する特徴抽出ステップと、前記照合データが、すべての被登録者それぞれの音声特徴データである登録データを登録データどうしの類似度でクラスタリングしたグループの、いずれのグループに分類されるべきかを判定する分類判定ステップと、前記登録データを前記クラスタリングされたグループに分類して、前記被登録者それぞれに付与された識別符号と対応づけて記憶されている前記登録データの中の、前記照合データが分類されるべき前記グループに分類されている前記登録データのうち、前記照合データとの類似度が最も高く、かつ、前記照合データとの類似度が閾値を超えている前記登録データに対応づけられた前記識別符号が付与された前記被登録者を、前記照合データの発声者と判定する話者識別ステップと、を備える。 The speaker identification method according to the second aspect of the present invention includes a feature extraction step of extracting matching data which is voice feature data from input voice data, and the matching data is the voice feature of each registered person. A classification determination step for determining which group the registered data, which is data, should be classified into a group in which the registered data is clustered according to the similarity between the registered data, and the registered data are classified into the clustered group. Among the registration data stored in association with the identification code given to each registered person, among the registration data classified into the group to which the collation data should be classified, the collation data The registered person to whom the identification code associated with the registered data having the highest degree of similarity and the degree of similarity with the collated data exceeds the threshold is referred to as the speaker of the collated data. It includes a speaker identification step for determining.

本発明の第３の観点に係るプログラムは、コンピュータを、すべての被登録者それぞれの音声特徴データである登録データを、登録データどうしの類似度でクラスタリングされたグループに分類して、前記被登録者それぞれに付与された識別符号と対応づけて記憶する記憶部、入力された音声データから識別すべき音声特徴データである照合データを抽出する特徴抽出部、前記照合データがいずれの前記グループに分類されるべきかを判定する分類判定部、および、前記照合データが分類されるべき前記グループに分類されている前記登録データのうち、前記照合データとの類似度が最も高く、かつ、前記照合データとの類似度が閾値を超えている前記登録データの前記識別符号に対応する前記被登録者を、前記照合データの発声者と判定する話者識別部、として機能させる。 In the program according to the third aspect of the present invention, the computer classifies the registration data, which is the voice feature data of all the registered persons, into a group clustered according to the similarity between the registered data, and the registered data. A storage unit that stores the identification code assigned to each person in association with the identification code, a feature extraction unit that extracts matching data that is voice feature data to be identified from the input voice data, and the matching data is classified into any of the above groups. Among the classification determination unit that determines whether the data should be collated and the registered data that are classified into the group to which the collation data should be classified, the collation data has the highest degree of similarity to the collation data. The registered person corresponding to the identification code of the registered data whose similarity with the data exceeds the threshold value is made to function as a speaker identification unit for determining the speaker of the collation data.

本発明によれば、話者識別において登録される識別対象が多い場合でも、識別にかかる所要時間を短くできる。 According to the present invention, even when there are many identification targets registered in speaker identification, the time required for identification can be shortened.

本発明の実施の形態に係る話者識別装置の構成を示すブロック図A block diagram showing a configuration of a speaker identification device according to an embodiment of the present invention. 実施の形態に係る音声登録装置の構成を示すブロック図Block diagram showing the configuration of the voice registration device according to the embodiment 実施の形態に係る声特徴データベースの例を示す図The figure which shows the example of the voice feature database which concerns on embodiment 実施の形態に係る分類判定部のニューラルネットワークの例を示す図The figure which shows the example of the neural network of the classification determination part which concerns on embodiment 実施の形態に係る話者識別の動作の例を示すフローチャートA flowchart showing an example of the speaker identification operation according to the embodiment. 実施の形態の変形例に係る話者識別装置のブロック図Block diagram of the speaker identification device according to the modified example of the embodiment 変形例に係る話者識別の動作の例を示すフローチャートFlow chart showing an example of speaker identification operation according to a modified example 実施の形態に係る話者識別装置のハードウェア構成の一例を示すブロック図A block diagram showing an example of the hardware configuration of the speaker identification device according to the embodiment.

以下、この発明の実施の形態について図面を参照しながら詳細に説明する。なお、図中同一または相当部分には同一符号を付す。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. The same or corresponding parts in the figure are designated by the same reference numerals.

実施の形態．
図１は、本発明の実施の形態に係る話者識別装置の構成を示すブロック図である。話者識別装置１は、マイク２１から入力された音声から抽出した音声特徴データである照合データと、記憶部１５に記憶されている音声特徴データである登録データとを比較して、発声者が被登録者のいずれであるかを識別する。話者識別装置１は、音声入力部１０、特徴抽出部１１、音声認識部１２、分類判定部１３、データ抽出部１４、記憶部１５、および、話者識別部１６を備える。登録データは、登録データどうしの類似度でクラスタリングされたグループに分類され、被登録者それぞれに付与された識別符号と対応づけられた声特徴データベースとして、記憶部１５に記憶されている。登録データには、それぞれの発声内容を示す、音素列、音節列もしくは文字列が対応づけられている。 Embodiment.
FIG. 1 is a block diagram showing a configuration of a speaker identification device according to an embodiment of the present invention. The speaker identification device 1 compares the collation data, which is the voice feature data extracted from the voice input from the microphone 21, with the registered data, which is the voice feature data stored in the storage unit 15, and the speaker identifies the speaker. Identify which of the registered persons. The speaker identification device 1 includes a voice input unit 10, a feature extraction unit 11, a voice recognition unit 12, a classification determination unit 13, a data extraction unit 14, a storage unit 15, and a speaker identification unit 16. The registered data is classified into a group clustered according to the similarity between the registered data, and is stored in the storage unit 15 as a voice feature database associated with the identification code assigned to each registered person. The registered data is associated with a phoneme string, a syllable string, or a character string indicating the content of each utterance.

（音声登録）
話者識別装置１の記憶部１５には、話者識別に先立って、事前に音声登録装置により被登録者の音声特徴データである登録データが記憶されている。図２は、実施の形態に係る音声登録装置の構成を示すブロック図である。音声登録装置２は、音声入力部１０、特徴抽出部１１、音声認識部１２、クラスタリング部１７、入力部１８、データ登録部１９、および記憶部１５を備える。音声登録装置２には、話者識別装置１と同じ装置を共通して用いてもよい。音声入力部１０、特徴抽出部１１、音声認識部１２および記憶部１５は、話者識別装置１と音声登録装置２とで同じものであり、話者識別装置１と音声登録装置２が同じ装置の場合、それらは共通である。 (Voice registration)
Prior to speaker identification, the storage unit 15 of the speaker identification device 1 stores registration data, which is voice feature data of the registered person, by the voice registration device in advance. FIG. 2 is a block diagram showing a configuration of a voice registration device according to an embodiment. The voice registration device 2 includes a voice input unit 10, a feature extraction unit 11, a voice recognition unit 12, a clustering unit 17, an input unit 18, a data registration unit 19, and a storage unit 15. The same device as the speaker identification device 1 may be used in common for the voice registration device 2. The voice input unit 10, the feature extraction unit 11, the voice recognition unit 12, and the storage unit 15 are the same for the speaker identification device 1 and the voice registration device 2, and the speaker identification device 1 and the voice registration device 2 are the same devices. In the case of, they are common.

音声入力部１０は、被登録者によってマイク２１から入力された音声信号を、所定の周波数でサンプリングし、Ａ−Ｄ変換して音声データを生成する。音声入力部１０は、音声データを特徴抽出部１１と音声認識部１２に送る。 The audio input unit 10 samples the audio signal input from the microphone 21 by the registered person at a predetermined frequency, performs A-D conversion, and generates audio data. The voice input unit 10 sends voice data to the feature extraction unit 11 and the voice recognition unit 12.

特徴抽出部１１は、音声データから、音声特徴データである登録データを抽出する。音声特徴データは、例えば、ＧＭＭスーパベクトル（Gaussian Mixture Model Supervector）、ｉ−ｖｅｃｔｏｒ、またはテンソル分解に基づく話者情報表現である。 The feature extraction unit 11 extracts the registered data, which is the voice feature data, from the voice data. The voice feature data is, for example, a speaker information representation based on a GMM super vector (Gaussian Mixture Model Supervector), an i-vector, or a tensor decomposition.

ＧＭＭスーパベクトル（GMM Supervector：ＧＭＭ−ＳＶ）は、音声を混合ガウス分布（Gaussian Mixture Model：ＧＭＭ）によってモデル化し、ＧＭＭを構成する各ガウス分布の平均ベクトルを一列に連結した特徴量である。ＧＭＭは、複数のガウス分布の重み付き線形和で表される確率分布である（W. M. Campbell, D. E. Sturim, and D. A. Reynoldes, "Support Vector Machines using GMM Supervectors for Speaker Verification," IEEE Signal Processing Letters, vol. 13, pp. 308-311, 2006.）。 The GMM Supervector (GMM-SV) is a feature quantity in which voice is modeled by a Gaussian Mixture Model (GMM) and the average vector of each Gaussian distribution constituting the GMM is connected in a row. GMM is a probability distribution represented by a weighted linear sum of multiple Gaussian distributions (WM Campbell, DE Sturim, and DA Reynoldes, "Support Vector Machines using GMM Supervectors for Speaker Verification," IEEE Signal Processing Letters, vol. 13, pp. 308-311, 2006.).

ｉ−ｖｅｃｔｏｒは、ＧＭＭスーパベクトル（GMM Supervector：ＧＭＭ−ＳＶ）を因子分析に基づき次元圧縮することによって得られる特徴量である。一発話から抽出されたＧＭＭ−ＳＶであるＭは、話者と言語に依存しないユニバーサルバックグラウンドモデル（Universal Background Model：ＵＢＭ）のＧＭＭ−ＳＶであるｍと、発話内容、話者・収録環境の変化による音声のばらつきをモデル化した低次元空間への射影行列Ｔを用いて、Ｍ＝ｍ＋Ｔｗと分解される。このｗがｉ−ｖｅｃｔｏｒである（N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, "Front-End Factor Analysis for Speaker Verification," IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788-798, 2011.）。 The i-vector is a feature quantity obtained by dimensionally compressing a GMM super vector (GMM-SV) based on factor analysis. M, which is a GMM-SV extracted from one utterance, is m, which is a GMM-SV of a universal background model (UBM) that does not depend on the speaker and language, and the utterance content, speaker, and recording environment. Using the projection matrix T to the low-dimensional space that models the variation of speech due to changes, it is decomposed as M = m + Tw. This w is an i-vector (N. Dehak, PJ Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, "Front-End Factor Analysis for Speaker Verification," IEEE Transactions on Audio, Speech, and Language Processing. , vol. 19, no. 4, pp. 788-798, 2011.).

テンソル分解に基づく話者情報表現は、ｉ−ｖｅｃｔｏｒのアプローチを拡張したテンソル分解に基づく話者情報表現である。行および列がそれぞれＧＭＭの各分布と平均ベクトルに対応するような行列によって一発話を表現し、多数話者分の行列をテンソルとして扱い、テンソル解析を導入することで話者情報を表現する（チン・トゥアン・トゥー、斎藤大輔、峯松信明、広瀬啓吉、“テンソル分解に基づく話者情報表現を用いた話者識別の検討、” 日本音響学会春季講演論文集、pp. 217-220, 2005.）。 The speaker information representation based on the tensor decomposition is a speaker information representation based on the tensor decomposition, which is an extension of the i-vector approach. One-speech is expressed by a matrix in which rows and columns correspond to each distribution and mean vector of GMM, respectively, a matrix for many speakers is treated as a tensor, and speaker information is expressed by introducing tensor analysis ( Chin Tuan Tou, Daisuke Saito, Nobuaki Minematsu, Keikichi Hirose, "Examination of Speaker Identification Using Speaker Information Expression Based on Tensor Decomposition," Proceedings of the Spring Lectures of the Acoustical Society of Japan, pp. 217-220, 2005. ).

ここでは、音声特徴データとしてｉ−ｖｅｃｔｏｒを用いる場合を例に説明する。特徴抽出部１１は、抽出したｉ−ｖｅｃｔｏｒである登録データを、クラスタリング部１７およびデータ登録部１９に送る。 Here, a case where i-vector is used as the voice feature data will be described as an example. The feature extraction unit 11 sends the extracted registered data, which is an i-vector, to the clustering unit 17 and the data registration unit 19.

音声認識部１２は、音声データから音声特徴を抽出し、その音声特徴に最も近くなるような発話内容を探索し、発声内容を示す音素列、音節列もしくは文字列を抽出する。音声認識部１２は、得られた音素列、音節列もしくは文字列をデータ登録部１９に送る。 The voice recognition unit 12 extracts a voice feature from the voice data, searches for a utterance content that is closest to the voice feature, and extracts a phoneme string, a syllable string, or a character string indicating the utterance content. The voice recognition unit 12 sends the obtained phoneme string, syllable string or character string to the data registration unit 19.

入力部１８は、マイク２１から音声を入力した被登録者を識別するための、識別符号の入力を受け付け、入力された識別符号をデータ登録部１９に送る。入力部１８は、例えば、キーボード、マウスもしくはタッチパネルなどのポインティングデバイスとディスプレイ、ＩＣカードリーダ、バーコードリーダ、または二次元コードリーダを備える。識別符号は、キーボード操作、ポインティングデバイスによる選択、あるいは、ＩＣカード、バーコードもしくは二次元コードで入力されるコードである。入力部１８は、指紋、光彩、指もしくは手の静脈などの生体情報を読み取って、事前に登録されている生体情報と照合して、参照した被登録者の識別符号をデータ登録部１９に送ってもよい。 The input unit 18 receives the input of the identification code for identifying the registered person who has input the voice from the microphone 21, and sends the input identification code to the data registration unit 19. The input unit 18 includes, for example, a pointing device such as a keyboard, a mouse or a touch panel, a display, an IC card reader, a bar code reader, or a two-dimensional code reader. The identification code is a code input by keyboard operation, selection by a pointing device, or an IC card, bar code, or two-dimensional code. The input unit 18 reads biometric information such as fingerprints, glows, and veins of fingers or hands, collates it with the biometric information registered in advance, and sends the referenced identification code of the registered person to the data registration unit 19. You may.

データ登録部１９は、特徴抽出部１１から送られた音声特徴データである登録データと、入力部１８から送られた被登録者の識別符号と、音声認識部１２から送られた音素列、音節列もしくは文字列とを対応付けて、記憶部１５に記憶させる。 The data registration unit 19 includes registration data which is voice feature data sent from the feature extraction unit 11, an identification code of the registered person sent from the input unit 18, and phoneme strings and syllables sent from the voice recognition unit 12. A column or a character string is associated with each other and stored in the storage unit 15.

すべての被登録者の音声特徴データを記憶部１５に記憶させたのち、クラスタリング部１７は、すべての被登録者それぞれの音声特徴データである登録データを、登録データどうしの類似度でクラスタリングしてグループに分類する。類似度は、例えば、コサイン類似度または対数尤度比である。クラスタリングには、例えば、ｋ−ｍｅａｎｓ＋＋法、ｋ−ｍｅａｎｓ法、またはウォード法を用いることができる。 After storing the voice feature data of all the registered persons in the storage unit 15, the clustering unit 17 clusters the registered data, which is the voice feature data of all the registered persons, with the similarity between the registered data. Classify into groups. The similarity is, for example, a cosine similarity or a log-likelihood ratio. For clustering, for example, the k-means ++ method, the k-means method, or Ward's method can be used.

クラスタリング部１７は、登録データそれぞれが分類されるグループの番号を、データ登録部１９に送る。データ登録部１９は、登録データそれぞれに、割り当てられたグループ番号を対応づけて記憶部１５に記憶させる。すなわち、登録データは、登録データどうしの類似度でクラスタリングされたグループに分類されて、被登録者それぞれに付与された識別符号および登録データの発声内容と対応づけて記憶される。 The clustering unit 17 sends the number of the group in which each registered data is classified to the data registration unit 19. The data registration unit 19 associates each registered data with an assigned group number and stores it in the storage unit 15. That is, the registered data is classified into a group clustered according to the similarity between the registered data, and is stored in association with the identification code assigned to each registered person and the utterance content of the registered data.

図３は、実施の形態に係る声特徴データベースの例を示す図である。声特徴データベースは、記憶部１５に記憶されている。声特徴データベースのレコードは、登録者ＩＤ、グループＩＤ、発声内容および音声特徴データから構成される。登録者ＩＤは、被登録者の識別符号である。グループＩＤは、登録データそれぞれが分類されるグループの番号である。発声内容は、登録データの発声内容である音素列、音節列もしくは文字列である。音声特徴データの欄は、音声特徴データそのものであってもよいし、音声特徴データが記憶されているファイルなどへのポインタであってもよい。 FIG. 3 is a diagram showing an example of a voice feature database according to the embodiment. The voice feature database is stored in the storage unit 15. The record of the voice feature database is composed of a registrant ID, a group ID, a voice content, and voice feature data. The registrant ID is an identification code of the registered person. The group ID is the number of the group in which each registered data is classified. The utterance content is a phoneme string, a syllable string, or a character string, which is the utterance content of the registered data. The voice feature data column may be the voice feature data itself, or may be a pointer to a file or the like in which the voice feature data is stored.

声特徴データベースには、異なる登録者ＩＤで発声内容が同じ登録データがあってもよい。また、登録者ＩＤが共通する同じ被登録者で、発声内容が異なる複数の登録データがあってもよい。登録者ＩＤが共通する登録データは、通常、グループＩＤも共通する。 The voice feature database may contain registration data with different registrant IDs and the same utterance content. In addition, there may be a plurality of registered data having the same registrant ID but different utterance contents. The registration data having the same registrant ID usually has the same group ID.

話者識別装置１と音声登録装置２が異なる装置の場合、音声登録装置２は、すべての被登録者の登録データがグループに分類されて、被登録者それぞれの識別符号および発声内容と対応づけて記憶部１５に記憶された声特徴データベースを、話者識別装置１に転送する。話者識別装置１と音声登録装置２が共通の装置の場合は、音声登録装置２の記憶部１５がそのまま、話者識別装置１の記憶部１５として使用される。音声登録装置２と話者識別装置１とで共通の記憶部１５にアクセスできるようにしてもよい。以上で、音声登録を完了する。 When the speaker identification device 1 and the voice registration device 2 are different devices, the voice registration device 2 classifies the registration data of all the registered persons into groups and associates them with the identification code and the utterance content of each registered person. The voice feature database stored in the storage unit 15 is transferred to the speaker identification device 1. When the speaker identification device 1 and the voice registration device 2 are common devices, the storage unit 15 of the voice registration device 2 is used as it is as the storage unit 15 of the speaker identification device 1. The voice registration device 2 and the speaker identification device 1 may be able to access the common storage unit 15. This completes voice registration.

（話者識別）
図１に示す話者識別装置１は、マイク２１から入力された音声から抽出した音声特徴データである照合データと、音声登録装置２で作成された記憶部１５に記憶されている音声特徴データである登録データとを比較して、発声者が被登録者のいずれであるかを識別する。話者識別装置１の音声入力部１０、特徴抽出部１１、音声認識部１２および記憶部１５に記憶されている声特徴データベースは、音声登録装置２のものと同じである。 (Speaker identification)
The speaker identification device 1 shown in FIG. 1 is composed of collation data which is voice feature data extracted from the voice input from the microphone 21 and voice feature data stored in the storage unit 15 created by the voice registration device 2. Compare with some registration data to identify which of the registered speakers is the speaker. The voice feature database stored in the voice input unit 10, the feature extraction unit 11, the voice recognition unit 12, and the storage unit 15 of the speaker identification device 1 is the same as that of the voice registration device 2.

音声入力部１０は、被識別対象者である話者によってマイク２１から入力された音声信号を、所定の周波数でサンプリングし、Ａ−Ｄ変換して音声データを生成する。音声入力部１０は、音声データを特徴抽出部１１と音声認識部１２に送る。 The voice input unit 10 samples the voice signal input from the microphone 21 by the speaker who is the person to be identified at a predetermined frequency, performs AD conversion, and generates voice data. The voice input unit 10 sends voice data to the feature extraction unit 11 and the voice recognition unit 12.

特徴抽出部１１は、音声データから、被識別対象者の音声特徴データである照合データを抽出する。音声特徴データは、例えば、ＧＭＭ（Gaussian Mixture Model）スーパベクトル、ｉ−ｖｅｃｔｏｒ、またはテンソル分解に基づく話者情報表現である。照合データの音声特徴データの種類は、登録データの音声特徴データの種類と同じである。すなわち、登録データが例えばｉ−ｖｅｃｔｏｒの場合、照合データはｉ−ｖｅｃｔｏｒである。 The feature extraction unit 11 extracts collation data, which is the voice feature data of the person to be identified, from the voice data. The voice feature data is, for example, a speaker information representation based on a GMM (Gaussian Mixture Model) super vector, i-vector, or tensor decomposition. The type of voice feature data of the collation data is the same as the type of voice feature data of the registered data. That is, when the registered data is, for example, an i-vector, the collation data is an i-vector.

ここでは、音声特徴データとしてｉ−ｖｅｃｔｏｒを用いる場合を例に説明する。特徴抽出部１１は、抽出したｉ−ｖｅｃｔｏｒである照合データを、分類判定部１３に送る。 Here, a case where i-vector is used as the voice feature data will be described as an example. The feature extraction unit 11 sends the extracted collation data, which is an i-vector, to the classification determination unit 13.

分類判定部１３では、照合データが、記憶部１５に記憶されている音声特徴データである登録データを登録データどうしの類似度でクラスタリングしたグループの、いずれのグループに分類されるべきかを判定する。登録データは、音声登録装置２によってあらかじめ登録データどうしの類似度でクラスタリングしたグループに分類されている。類似度は、例えば、コサイン類似度または対数尤度比である。 The classification determination unit 13 determines to which group the collation data should be classified into a group in which the registration data, which is the voice feature data stored in the storage unit 15, is clustered according to the similarity between the registration data. .. The registered data is classified into a group clustered in advance by the voice registration device 2 according to the similarity between the registered data. The similarity is, for example, a cosine similarity or a log-likelihood ratio.

分類判定には、例えば、グループに分類された登録データを学習データとして、機械学習させた人工ニューラルネットワーク（以下、単にニューラルネットワークという）を用いることができる。分類判定部１３は、各グループのセントロイドのベクトルと照合データを比較して、最も類似度が大きいセントロイドのグループを、照合データが分類されるべきグループと判定してもよい。分類判定部１３は、照合データが分類されるべき、すなわち照合データが属する最も確からしい、グループの番号をデータ抽出部１４に送る。 For the classification determination, for example, a machine-learned artificial neural network (hereinafter, simply referred to as a neural network) can be used using the registered data classified into groups as training data. The classification determination unit 13 may compare the centroid vector of each group with the collation data, and determine the group of centroids having the highest similarity as the group to which the collation data should be classified. The classification determination unit 13 sends the number of the group to which the collation data should be classified, that is, the most probable group to which the collation data belongs, to the data extraction unit 14.

図４は、実施の形態に係る分類判定部のニューラルネットワークの例を示す図である。ニューラルネットワークは、それぞれ人工ニューロン（以下、単にニューロンという）から構成されるノードを含む入力層、中間層および出力層、ならびに、互いに隣接する層の間でノードを相互に接続するエッジから構成される。中間層は、１層以上のｎ層を含む。入力層の各ノードｉ（ｉ＝１．．．ｋ）には、音声特徴データのそれぞれの要素ｘｉが入力される。中間層では、それぞれ前の層の出力が結合され活性化関数で演算された結果が後の層に伝達されて、最終的に出力層に出力される。出力層は、登録データがクラスタリングされたグループの数のノードを有し、ノードｊ（ｊ＝１．．．Ｍ）はそれぞれ、照合データがそのグループｊに分類される確率ｙｊを出力する。 FIG. 4 is a diagram showing an example of a neural network of the classification determination unit according to the embodiment. A neural network consists of an input layer, an intermediate layer and an output layer, each containing a node composed of artificial neurons (hereinafter, simply referred to as a neuron), and an edge that connects the nodes to each other between adjacent layers. .. The intermediate layer includes one or more n layers. Each element xi of the voice feature data is input to each node i (i = 1 ... k) of the input layer. In the intermediate layer, the outputs of the previous layers are combined, the result calculated by the activation function is transmitted to the subsequent layers, and finally output to the output layer. The output layer has as many nodes as the number of groups in which the registered data is clustered, and each node j (j = 1 ... M) outputs the probability yj that the collation data is classified into the group j.

ニューラルネットワークには、グループに分類された登録データを入力し、登録データが分類されているグループと、ニューラルネットワークの出力との差をバックプロパゲーションして各パラメータを調整することで、機械学習させておく。 Machine learning is performed by inputting the registered data classified into groups into the neural network, backpropagating the difference between the group in which the registered data is classified and the output of the neural network, and adjusting each parameter. Keep it.

分類判定部１３は、ニューラルネットワークの学習済みモデルに照合データを入力し、出力層の出力が最も大きいノードの番号、すなわち最も確からしいグループの番号を、照合データが分類されるべきグループの番号とする。分類判定部１３は、照合データが分類されるべきグループの番号をデータ抽出部１４に送る。 The classification determination unit 13 inputs the matching data into the trained model of the neural network, and sets the number of the node having the largest output of the output layer, that is, the number of the most probable group, as the number of the group to which the matching data should be classified. To do. The classification determination unit 13 sends the number of the group to which the collation data should be classified to the data extraction unit 14.

ニューラルネットワークで照合データの分類を判定する場合、判定の演算量は入力層のノード数と中間層の層数でほぼ決まり、出力層のノード数には比例しない。そのため、各グループのセントロイドとの類似度でグループの分類を判定する方法に比べて、グループの数が増えた場合の演算量は少ない。その結果、グループの数が増加しても、識別にかかる所要時間が増加するのを抑制することができる。 When determining the classification of collation data by a neural network, the amount of calculation for the determination is almost determined by the number of nodes in the input layer and the number of layers in the intermediate layer, and is not proportional to the number of nodes in the output layer. Therefore, the amount of calculation is small when the number of groups increases, as compared with the method of determining the classification of groups based on the degree of similarity with the centroid of each group. As a result, even if the number of groups increases, it is possible to suppress an increase in the time required for identification.

音声認識部１２は、音声データから音声特徴を抽出し、その音声特徴に最も近くなるような発話内容を探索し、発声内容を示す音素列、音節列もしくは文字列を抽出する。音声認識部１２は、得られた音素列、音節列もしくは文字列をデータ抽出部１４に送る。 The voice recognition unit 12 extracts a voice feature from the voice data, searches for a utterance content that is closest to the voice feature, and extracts a phoneme string, a syllable string, or a character string indicating the utterance content. The voice recognition unit 12 sends the obtained phoneme string, syllable string or character string to the data extraction unit 14.

データ抽出部１４は、分類判定部１３から送られた番号のグループに属する登録データのうち、音声認識部１２から送られた発声内容と同じ発声内容の登録データを、記憶部１５から読み出して、話者識別部１６に送る。 The data extraction unit 14 reads from the storage unit 15 the registration data having the same utterance content as the utterance content sent from the voice recognition unit 12 among the registration data belonging to the group of numbers sent from the classification determination unit 13. It is sent to the speaker identification unit 16.

話者識別部１６は、データ抽出部１４から送られた登録データのそれぞれと、照合データとを比較し、最も高い類似度が定めた閾値を超えている場合に、その最も高い類似度の登録データの被登録者を、話者である被識別対象者と判定する。すなわち、照合データが分類されるべきグループに分類され、かつ、発声内容が照合データの発声内容と同じ登録データのうち、照合データとの類似度が最も高く、かつ、照合データとの類似度が閾値を超えている登録データに対応づけられた識別符号が付与された被登録者を、照合データの発声者と判定する。 The speaker identification unit 16 compares each of the registered data sent from the data extraction unit 14 with the collation data, and if the highest similarity exceeds a set threshold value, the speaker identification unit 16 registers the highest similarity. The person whose data is registered is determined to be the person to be identified who is the speaker. That is, among the registered data in which the collation data is classified into the group to be classified and the utterance content is the same as the utterance content of the collation data, the similarity with the collation data is the highest and the similarity with the collation data is high. The registered person to whom the identification code associated with the registered data exceeding the threshold is given is determined to be the speaker of the collation data.

登録データと照合データとを比較する類似度の種類は、登録データをクラスタリングしたときの類似度の種類と同じである。例えば、登録データをコサイン類似度でクラスタリングした場合は、コサイン類似度で登録データと照合データとの類似度を算出する。 The type of similarity between the registered data and the collation data is the same as the type of similarity when the registered data is clustered. For example, when the registered data is clustered by the cosine similarity, the similarity between the registered data and the collation data is calculated by the cosine similarity.

被識別対象者が被登録者のいずれかであると判定された結果をもって、話者識別装置１に接続されている装置に、その被登録者に許可された動作を行わせることができる。例えば、ドアの解錠、被登録者に固有の情報へのアクセスの許可、または、被登録者に適したＡＩスピーカの応答などを行わせることができる。話者識別装置１は、例えば、建物のセキュリティシステム、顧客情報管理装置、または、ＡＩスピーカなどに組み込まれていてもよい。 Based on the result of determining that the person to be identified is one of the registered persons, the device connected to the speaker identification device 1 can be made to perform the operation permitted by the registered person. For example, the door can be unlocked, the registered person can be permitted to access information unique to the registered person, or the registered person can be made to respond to an AI speaker suitable for the user. The speaker identification device 1 may be incorporated in, for example, a building security system, a customer information management device, an AI speaker, or the like.

図５は、実施の形態に係る話者識別の動作の例を示すフローチャートである。話者識別装置１は、被識別対象者である話者から音声が入力されると音声信号から音声データを生成する（ステップＳ１０）。特徴抽出部１１は、音声データから照合データを抽出して分類判定部１３に送る（ステップＳ１１）。音声認識部１２は、音声データから発声内容を認識して、データ抽出部１４に送る（ステップＳ１２）。分類判定部１３は、照合データが分類されるべきグループを判定しデータ抽出部１４に送る（ステップＳ１３）。 FIG. 5 is a flowchart showing an example of the speaker identification operation according to the embodiment. The speaker identification device 1 generates voice data from the voice signal when voice is input from the speaker who is the person to be identified (step S10). The feature extraction unit 11 extracts the matching data from the voice data and sends it to the classification determination unit 13 (step S11). The voice recognition unit 12 recognizes the utterance content from the voice data and sends it to the data extraction unit 14 (step S12). The classification determination unit 13 determines a group to which the collation data should be classified and sends it to the data extraction unit 14 (step S13).

データ抽出部１４が、照合データが分類されるべきグループに属する登録データのうち、照合データの発声内容と同じ発声内容の登録データを、記憶部１５から読み出して話者識別部１６に送ると、話者識別部１６は、読み出された登録データの１つを選択する（ステップ１４）。そして、照合データと選択した登録データとの類似度を算出する（ステップＳ１５）。まだ選択していない登録データがあれば（ステップＳ１６；Ｙ）、再び未選択の登録データの１つを選択して（ステップＳ１４）、類似度を算出する（ステップＳ１５）。 When the data extraction unit 14 reads the registered data having the same utterance content as the utterance content of the collation data among the registered data belonging to the group to which the collation data should be classified from the storage unit 15 and sends it to the speaker identification unit 16. The speaker identification unit 16 selects one of the read registered data (step 14). Then, the degree of similarity between the collation data and the selected registered data is calculated (step S15). If there is registered data that has not been selected yet (step S16; Y), one of the unselected registered data is selected again (step S14), and the similarity is calculated (step S15).

話者識別部１６は、データ抽出部１４から送られた登録データのすべてについて照合データとの類似度を算出すると（ステップＳ１６；Ｎ）、算出した類似度の最大値を選択する（ステップＳ１７）。算出した類似度の最大値が閾値より大きければ（ステップＳ１８；Ｙ）、その類似度に対応する登録データに対応づけられた識別符号が付与された被登録者を、照合データの発声者（話者）であると判定する（ステップＳ１９）。最大値が閾値以下なら（ステップＳ１８；Ｎ）、照合データの話者は被登録者のいずれでもないと判定する（ステップＳ２０）。 When the speaker identification unit 16 calculates the similarity with the collation data for all the registered data sent from the data extraction unit 14 (step S16; N), the speaker identification unit 16 selects the maximum value of the calculated similarity (step S17). .. If the maximum value of the calculated similarity is larger than the threshold value (step S18; Y), the registered person to whom the identification code associated with the registration data corresponding to the similarity is assigned is referred to as the speaker of the matching data (talk). Person) (step S19). If the maximum value is equal to or less than the threshold value (step S18; N), it is determined that the speaker of the collation data is neither of the registered persons (step S20).

照合データの話者が被登録者のいずれかに特定された場合、話者識別装置１は、接続されている装置に、その被登録者に許可された動作を行わせることができる。照合データの話者が被登録者のいずれでもないと判定された場合は、被識別対象者に、再度、発声を促すことができる。記憶部１５に記憶されている、照合データが分類されるべきグループに属する登録データに、音声認識部１２で認識した発声内容に該当する登録データがない場合も、被識別対象者に、再度、発声を促すことができる。 When the speaker of the collation data is identified as one of the registered persons, the speaker identification device 1 can cause the connected device to perform an operation permitted by the registered person. If it is determined that the speaker of the collation data is neither of the registered persons, the identified person can be prompted to speak again. Even if the registered data stored in the storage unit 15 and belonging to the group to which the collation data should be classified does not include the registered data corresponding to the utterance content recognized by the voice recognition unit 12, the person to be identified again Can encourage vocalization.

話者識別部１６は、類似度の最大値を閾値と比較せず、単に、最大値の類似度に対応する登録データに対応づけられた識別符号が付与された被登録者を、照合データの発声者（話者）であると判定してもよい。その場合、閾値は類似度の取り得る最小値であるとみなすことができる。 The speaker identification unit 16 does not compare the maximum value of the similarity with the threshold value, and simply refers to the registered person to which the identification code associated with the registration data corresponding to the similarity of the maximum value is given. It may be determined that the speaker is the speaker. In that case, the threshold can be regarded as the minimum possible value of similarity.

以上説明したように、実施の形態に係る話者識別装置１は、登録データを予めクラスタリングしたグループのいずれに、照合データが分類されるべきかを判定し、その分類されるべきグループに類似度判定対象の登録データを限定するので、登録データすべてと比較する場合よりも、識別にかかる所要時間を短くすることができる。また、照合データが分類されるべきグループの中で、照合データと発声内容が同じである登録データに比較対象を限定するので、さらに識別にかかる所要時間を短くすることができる。その上、話者の特徴とは無関係である発話内容を用いて、発声内容が同じである登録データに限って比較するので、比較対象となる登録データどうしが相互に類似する可能性が減少し、話者を誤認識する可能性が減少する。 As described above, the speaker identification device 1 according to the embodiment determines which of the groups in which the registered data is clustered in advance the collation data should be classified, and the degree of similarity to the group to be classified. Since the registered data to be determined is limited, the time required for identification can be shortened as compared with the case of comparing with all the registered data. Further, in the group to which the collation data should be classified, the comparison target is limited to the registered data having the same utterance content as the collation data, so that the time required for identification can be further shortened. In addition, since the utterance content that is irrelevant to the characteristics of the speaker is used to compare only the registered data having the same utterance content, the possibility that the registered data to be compared are similar to each other is reduced. , The possibility of misrecognizing the speaker is reduced.

なお、分類判定部１３で照合データがどのグループに分類されるべきかを判定する際に、ニューラルネットワークの出力層の最も大きい出力、すなわち、グループに分類される確率の最大値、が基準の値より小さい場合に、照合データはいずれのグループにも分類されないと判断して、照合データの話者は被登録者のいずれでもないと判定してもよい。この場合の基準の値は、登録データの数およびグループの数に応じて定めてもよい。グループに分類される確率の最大値が基準の値より小さい場合、そのグループのいずれの登録データの類似度も、閾値より小さいことが推定される。この場合、登録データとの類似度を算出することなく、話者が被登録者でないと判断されるので、話者識別にかかる所用時間をさらに短くすることができる。 When the classification determination unit 13 determines which group the collation data should be classified into, the largest output of the output layer of the neural network, that is, the maximum value of the probability of being classified into a group is a reference value. If it is smaller, it may be determined that the collation data is not classified into any group, and the speaker of the collation data is not one of the registered persons. The standard value in this case may be determined according to the number of registered data and the number of groups. If the maximum value of the probability of being classified into a group is smaller than the reference value, it is estimated that the similarity of any registered data in that group is smaller than the threshold value. In this case, since it is determined that the speaker is not the registered person without calculating the similarity with the registered data, the time required for speaker identification can be further shortened.

変形例．
図６は、実施の形態の変形例に係る話者識別装置のブロック図である。変形例では音声認識部１２を備えず、音声認識を行わない。その他の構成は、実施の形態と同様である。 Modification example.
FIG. 6 is a block diagram of a speaker identification device according to a modified example of the embodiment. In the modified example, the voice recognition unit 12 is not provided, and voice recognition is not performed. Other configurations are the same as in the embodiment.

変形例では、データ抽出部１４は、分類判定部１３から送られた番号のグループに属する登録データを、記憶部１５から読み出して、話者識別部１６に送る。話者識別部１６に送られる登録データには、発声内容が異なる音声特徴データが含まれる。 In the modified example, the data extraction unit 14 reads the registration data belonging to the group of numbers sent from the classification determination unit 13 from the storage unit 15 and sends it to the speaker identification unit 16. The registration data sent to the speaker identification unit 16 includes voice feature data having different utterance contents.

変形例において、記憶部１５に記憶されている声特徴データベースは、実施の形態と同様に、登録データに発声内容が対応づけられていてもよいし、発生内容を含まなくてもよい。発声内容を含まない場合でも、登録者ＩＤが共通する同じ被登録者で、発声内容が異なる複数の登録データがあってもよい。 In the modified example, the voice feature database stored in the storage unit 15 may be associated with the utterance content or may not include the generated content, as in the embodiment. Even if the utterance content is not included, there may be a plurality of registered data having the same registrant ID but different utterance contents.

話者識別部１６は、データ抽出部１４から送られた登録データのそれぞれと、照合データとを比較し、最も高い類似度が定めた閾値を超えている場合に、その最も高い類似度の登録データの被登録者を、話者である被識別対象者と判定する。変形例では、照合データが分類されるべきグループに分類されている登録データのうち、照合データとの類似度が最も高く、かつ、照合データとの類似度が閾値を超えている登録データに対応づけられた識別符号が付与された被登録者を、照合データの発声者と判定する。 The speaker identification unit 16 compares each of the registered data sent from the data extraction unit 14 with the collation data, and if the highest similarity exceeds a set threshold value, the speaker identification unit 16 registers the highest similarity. The person whose data is registered is determined to be the person to be identified who is the speaker. In the modified example, among the registered data classified into the group to which the collated data should be classified, the registered data having the highest similarity with the collated data and the similarity with the collated data exceeds the threshold value is supported. The registered person to which the attached identification code is given is determined to be the speaker of the collation data.

図７は、変形例に係る話者識別の動作の例を示すフローチャートである。変形例では、図５の実施の形態の動作のうち、音声認識のステップＳ１２が省略されている。また、データ抽出部１４は、照合データが分類されるべきグループに属する登録データを、記憶部１５から読み出して話者識別部１６に送るので、話者識別部１６は、照合データが分類されるべきグループに属する登録データの１つを選択する（ステップＳ１４’）。その他の動作は、図５のフローチャートと同様である。 FIG. 7 is a flowchart showing an example of the speaker identification operation according to the modified example. In the modified example, in the operation of the embodiment of FIG. 5, the voice recognition step S12 is omitted. Further, since the data extraction unit 14 reads the registered data belonging to the group to which the collation data should be classified from the storage unit 15 and sends it to the speaker identification unit 16, the speaker identification unit 16 classifies the collation data. Select one of the registered data belonging to the power group (step S14'). Other operations are the same as the flowchart of FIG.

変形例では、音声認識しないので、異なる発生内容の登録データとも照合データと比較するが、照合データが分類されるべきグループに限定されているので、登録データすべてと比較するよりも識別にかかる所要時間を短くできる。また、音声認識を行わないので、その分処理時間は短い。 In the modified example, since voice recognition is not performed, the registered data with different generated contents are also compared with the collated data, but since the collated data is limited to the group to be classified, it requires more identification than comparing with all the registered data. You can shorten the time. Moreover, since voice recognition is not performed, the processing time is short accordingly.

図８は、実施の形態に係る話者識別装置のハードウェア構成の一例を示すブロック図である。話者識別装置１は、図８に示すように、制御部４１、主記憶部４２、外部記憶部４３、操作部４４、表示部４５、入出力部４６および送受信部４７を備える。主記憶部４２、外部記憶部４３、操作部４４、表示部４５、入出力部４６および送受信部４７はいずれも内部バス４０を介して制御部４１に接続されている。 FIG. 8 is a block diagram showing an example of the hardware configuration of the speaker identification device according to the embodiment. As shown in FIG. 8, the speaker identification device 1 includes a control unit 41, a main storage unit 42, an external storage unit 43, an operation unit 44, a display unit 45, an input / output unit 46, and a transmission / reception unit 47. The main storage unit 42, the external storage unit 43, the operation unit 44, the display unit 45, the input / output unit 46, and the transmission / reception unit 47 are all connected to the control unit 41 via the internal bus 40.

制御部４１はＣＰＵ（Central Processing Unit）等から構成され、外部記憶部４３に記憶されている制御プログラム５０に従って、話者識別装置１の音声入力部１０、特徴抽出部１１、音声認識部１２、分類判定部１３、データ抽出部１４、記憶部１５、および、話者識別部１６の各処理を実行する。 The control unit 41 is composed of a CPU (Central Processing Unit) and the like, and according to the control program 50 stored in the external storage unit 43, the voice input unit 10, the feature extraction unit 11, and the voice recognition unit 12 of the speaker identification device 1. Each process of the classification determination unit 13, the data extraction unit 14, the storage unit 15, and the speaker identification unit 16 is executed.

主記憶部４２はＲＡＭ（Random-Access Memory）等から構成され、外部記憶部４３に記憶されている制御プログラム５０をロードし、制御部４１の作業領域として用いられる。 The main storage unit 42 is composed of a RAM (Random-Access Memory) or the like, loads the control program 50 stored in the external storage unit 43, and is used as a work area of the control unit 41.

外部記憶部４３は、フラッシュメモリ、ハードディスク、ＤＶＤ−ＲＡＭ（Digital Versatile Disc Random-Access Memory）、ＤＶＤ−ＲＷ（Digital Versatile Disc ReWritable）等の不揮発性メモリから構成され、話者識別装置１の処理を制御部４１に行わせるためのプログラムを予め記憶し、また、制御部４１の指示に従って、このプログラムが記憶するデータを制御部４１に供給し、制御部４１から供給されたデータを記憶する。 The external storage unit 43 is composed of a flash memory, a hard disk, a non-volatile memory such as a DVD-RAM (Digital Versatile Disc Random-Access Memory) and a DVD-RW (Digital Versatile Disc ReWritable), and processes the speaker identification device 1. The program to be executed by the control unit 41 is stored in advance, and the data stored by this program is supplied to the control unit 41 according to the instruction of the control unit 41, and the data supplied from the control unit 41 is stored.

操作部４４はキーボードおよびマウスなどのポインティングデバイス等と、キーボードおよびポインティングデバイス等を内部バス４０に接続するインタフェース装置から構成されている。操作部４４を介して、音声認識結果の選択指示などが入力され、制御部４１に供給される。 The operation unit 44 is composed of a pointing device such as a keyboard and a mouse, and an interface device for connecting the keyboard, the pointing device, and the like to the internal bus 40. A voice recognition result selection instruction or the like is input via the operation unit 44 and supplied to the control unit 41.

表示部４５は、ＬＣＤ（Liquid Crystal Display）または有機ＥＬディスプレイなどから構成され、話者識別の結果や音声認識した音声内容の文字列などを表示する。 The display unit 45 is composed of an LCD (Liquid Crystal Display), an organic EL display, or the like, and displays a speaker identification result, a character string of voice-recognized voice content, and the like.

入出力部４６は、シリアルインタフェースまたはパラレルインタフェースから構成されている。入出力部４６は、マイク２１を接続して音声信号を入力する。また、スピーカ（図示せず）を接続して、例えば、被識別対象者に音声の入力を促すメッセージを再生する。 The input / output unit 46 is composed of a serial interface or a parallel interface. The input / output unit 46 connects the microphone 21 and inputs an audio signal. In addition, a speaker (not shown) is connected to play, for example, a message prompting the person to be identified to input voice.

送受信部４７は、ネットワークに接続する網終端装置または無線通信装置、およびそれらと接続するシリアルインタフェースまたはＬＡＮ（Local Area Network）インタフェースから構成されている。送受信部４７は、ネットワークを介して、例えば、話者認識結果を使用する装置とのデータのやりとりを行う。 The transmission / reception unit 47 is composed of a network termination device or a wireless communication device connected to the network, and a serial interface or a LAN (Local Area Network) interface connected to them. The transmission / reception unit 47 exchanges data with, for example, a device that uses the speaker recognition result via the network.

図１に示す話者識別装置１の音声入力部１０、特徴抽出部１１、音声認識部１２、分類判定部１３、データ抽出部１４、記憶部１５、および、話者識別部１６の処理は、制御プログラム５０が、制御部４１、主記憶部４２、外部記憶部４３、操作部４４、表示部４５、入出力部４６および送受信部４７などを資源として用いて処理することによって実行する。 The processing of the voice input unit 10, the feature extraction unit 11, the voice recognition unit 12, the classification determination unit 13, the data extraction unit 14, the storage unit 15, and the speaker identification unit 16 of the speaker identification device 1 shown in FIG. 1 is The control program 50 executes the processing by using the control unit 41, the main storage unit 42, the external storage unit 43, the operation unit 44, the display unit 45, the input / output unit 46, the transmission / reception unit 47, and the like as resources.

なお、各実施の形態で説明した話者識別装置１の構成は一例であり、任意に変更および修正が可能である。話者識別装置１の構成は、実施の形態で示したものがすべてではなく、これらに限定されるものではない。例えば、実施の形態で説明したように、話者識別装置１と音声登録装置２とで同じ装置を共通して用いてもよい。また、ネットワーク上に記憶部１５を設置して、話者識別装置１および音声登録装置２から、ネットワークを介して、記憶部１５にアクセスしてもよい。 The configuration of the speaker identification device 1 described in each embodiment is an example, and can be arbitrarily changed and modified. The configuration of the speaker identification device 1 is not limited to all of those shown in the embodiments. For example, as described in the embodiment, the same device may be used in common by the speaker identification device 1 and the voice registration device 2. Further, the storage unit 15 may be installed on the network, and the storage unit 15 may be accessed from the speaker identification device 1 and the voice registration device 2 via the network.

その他、前記のハードウエア構成やフローチャートは一例であり、任意に変更および修正が可能である。 In addition, the above hardware configuration and flowchart are examples, and can be arbitrarily changed and modified.

音声入力部１０、特徴抽出部１１、音声認識部１２、分類判定部１３、データ抽出部１４、記憶部１５、および、話者識別部１６等から構成される話者識別装置１の話者識別処理を行う中心となる部分は、専用のシステムによらず、通常のコンピュータシステムを用いて実現可能である。たとえば、前記の動作を実行するためのコンピュータプログラムを、コンピュータが読みとり可能な記録媒体（ＵＳＢメモリ、ＣＤ−ＲＯＭ、ＤＶＤ−ＲＯＭ等）に格納して配布し、当該コンピュータプログラムをコンピュータにインストールすることにより、前記の処理を実行する話者識別装置１を構成してもよい。また、インターネット等の通信ネットワーク上のサーバ装置が有する記憶装置に当該コンピュータプログラムを格納しておき、通常のコンピュータシステムがダウンロード等することで話者識別装置１を構成してもよい。 Speaker identification of speaker identification device 1 composed of voice input unit 10, feature extraction unit 11, voice recognition unit 12, classification determination unit 13, data extraction unit 14, storage unit 15, speaker identification unit 16, and the like. The central part of processing can be realized by using a normal computer system without relying on a dedicated system. For example, a computer program for executing the above operation is stored and distributed in a computer-readable recording medium (USB memory, CD-ROM, DVD-ROM, etc.), and the computer program is installed in the computer. The speaker identification device 1 that executes the above processing may be configured accordingly. Further, the speaker identification device 1 may be configured by storing the computer program in a storage device of a server device on a communication network such as the Internet and downloading it by a normal computer system.

また、話者識別装置１を、ＯＳ（オペレーティングシステム）とアプリケーションプログラムの分担、またはＯＳとアプリケーションプログラムとの協働により実現する場合等には、アプリケーションプログラム部分のみを記録媒体や記憶装置に格納してもよい。 Further, when the speaker identification device 1 is realized by sharing the OS (operating system) and the application program, or by coordinating the OS and the application program, only the application program part is stored in the recording medium or the storage device. You may.

また、搬送波にコンピュータプログラムを重畳し、通信ネットワークを介して配信することも可能である。たとえば、通信ネットワーク上の掲示板(BBS, Bulletin Board System)に前記コンピュータプログラムを掲示し、ネットワークを介して前記コンピュータプログラムを配信してもよい。そして、このコンピュータプログラムを起動し、ＯＳの制御下で、他のアプリケーションプログラムと同様に実行することにより、前記の処理を実行できるように構成してもよい。 It is also possible to superimpose a computer program on a carrier wave and distribute it via a communication network. For example, the computer program may be posted on a bulletin board system (BBS, Bulletin Board System) on a communication network, and the computer program may be distributed via the network. Then, by starting this computer program and executing it in the same manner as other application programs under the control of the OS, the above processing may be executed.

１話者識別装置
２音声登録装置
１０音声入力部
１１特徴抽出部
１２音声認識部
１３分類判定部
１４データ抽出部
１５記憶部
１６話者識別部
１７クラスタリング部
１８入力部
１９データ登録部
２１マイク
４０内部バス
４１制御部
４２主記憶部
４３外部記憶部
４４操作部
４５表示部
４６入出力部
４７送受信部
５０制御プログラム 1 Speaker identification device 2 Voice registration device 10 Voice input unit 11 Feature extraction unit 12 Voice recognition unit 13 Classification judgment unit 14 Data extraction unit 15 Storage unit 16 Speaker identification unit 17 Clustering unit 18 Input unit 19 Data registration unit 21 Microphone 40 Internal bus 41 Control unit 42 Main storage unit 43 External storage unit 44 Operation unit 45 Display unit 46 Input / output unit 47 Transmission / reception unit 50 Control program

Claims

A memory that classifies the registration data, which is the voice feature data of all the registered persons, into a group clustered according to the similarity between the registered data, and stores the registered data in association with the identification code assigned to each of the registered persons. Department and
A feature extraction unit that extracts matching data, which is voice feature data, from the input voice data,
A classification determination unit that determines which group the collation data should be classified into,
Among the registered data classified into the group to which the collated data should be classified, the registered data having the highest similarity with the collated data and having a similarity with the collated data exceeding a threshold value. A speaker identification unit that determines that the registered person to which the identification code associated with the data is assigned is the speaker of the collation data.
A speaker identification device.

The storage unit stores a phoneme string, a syllable string, or a character string indicating the utterance content of each of the registered data in association with the registered data.
The speaker identification device further includes a voice recognition unit that extracts a phoneme string, a syllable string, or a character string indicating the utterance content from the input voice data.
The speaker identification unit is classified into the group to which the collation data should be classified, and has the highest degree of similarity to the collation data among the registered data whose utterance contents are the same as the utterance contents of the collation data. The registered person to whom the identification code associated with the registered data whose similarity with the collated data exceeds the threshold is determined to be the speaker of the collated data.
The speaker identification device according to claim 1.

The story according to claim 1 or 2, wherein the classification determination unit includes a trained model of a neural network in which the voice feature data of the registered persons classified into the group by the clustering is machine-learned as training data. Person identification device.

A feature extraction step that extracts matching data, which is voice feature data, from the input voice data,
A classification determination step for determining which group the collation data should be classified into, which is a group in which the registration data, which is the voice feature data of all the registered persons, is clustered according to the similarity between the registration data.
The group to which the collation data should be classified in the registered data stored in association with the identification code given to each of the registered persons by classifying the registered data into the clustered group. Among the registered data classified in the above, the identification code associated with the registered data having the highest degree of similarity with the collated data and having a degree of similarity with the collated data exceeding the threshold is given. A speaker identification step for determining the registered person as the speaker of the collation data, and
Speaker identification method.

Computer,
A memory that classifies the registration data, which is the voice feature data of all the registered persons, into a group clustered according to the similarity between the registered data, and stores the registered data in association with the identification code assigned to each registered person. Department,
Feature extraction unit that extracts matching data that is voice feature data to be identified from the input voice data,
The degree of similarity with the collation data among the classification determination unit for determining which group the collation data should be classified into and the registered data classified into the group to which the collation data should be classified. The speaker identification unit, which determines that the registered person corresponding to the identification code of the registered data whose similarity with the collated data exceeds the threshold value is the speaker of the collated data.
A program that functions as.