JP2017097188A

JP2017097188A - Speaker-likeness evaluation device, speaker identification device, speaker collation device, speaker-likeness evaluation method, and program

Info

Publication number: JP2017097188A
Application number: JP2015229670A
Authority: JP
Inventors: 厚志安藤; Atsushi Ando; 太一浅見; Taichi Asami; 義和山口; Yoshikazu Yamaguchi
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2015-11-25
Filing date: 2015-11-25
Publication date: 2017-06-01

Abstract

PROBLEM TO BE SOLVED: To provide a speaker-likeness evaluation device capable of calculating the similarity in speaker's feature vector capable of properly recognizing a speaker even when the length between a registration speech and recognition speech is remarkably different from each other.SOLUTION: The speaker-likeness evaluation device includes: a divided registered speech's speaker feature vector calculation unit 1001 that generates a divided registered speech which has a segment which is shorter than a segment of a registered speech from a registered speech and calculates a divided registered speech's speaker feature vector from the divided registered speech; a recognized speech speaker's feature vector calculation section 1002 that calculates a recognized speech speaker feature vector from a recognized speech; and a similarity calculation section 150 that calculates the similarity between the recognized speech and the divided registered speech by using the recognized speech speaker feature vector and the divided registered speech speaker feature vector.SELECTED DRAWING: Figure 10

Description

本発明は、音声による話者認識技術に関するものであり、特に話者特徴ベクトルの類似度を用いて話者認識する技術に関する。 The present invention relates to a speaker recognition technique using voice, and more particularly to a technique for speaker recognition using the similarity of speaker feature vectors.

音声による話者認識（以下、単に話者認識という）は、話者識別と話者照合に大別される。話者識別は、入力音声に対してその話者が事前登録した話者のうち誰にあたるかを判定する技術であり、例えば音声記録から犯罪者の声を探すことに利用されている。一方、話者照合は、入力音声に対してその話者が事前登録した話者であるかどうかを判定する技術であり、例えば本人確認に利用されている。また、話者照合と話者識別の両方を兼ね備える場合もあり、入力音声が事前登録話者に含まれるかを判定し（照合）、含まれる場合はどの登録話者かを判定する（識別）といった利用法も考えられる。いずれの場合も話者ごとに一発話以上の音声をシステムに事前登録する必要がある。 Speaker recognition by voice (hereinafter simply referred to as speaker recognition) is roughly divided into speaker identification and speaker verification. Speaker identification is a technique for determining who corresponds to a pre-registered speaker for an input voice, and is used to search for a criminal voice from voice recording, for example. On the other hand, speaker verification is a technique for determining whether or not a speaker is a pre-registered speaker for input speech, and is used, for example, for identity verification. In some cases, both speaker verification and speaker identification may be provided, and it is determined whether the input speech is included in the pre-registered speaker (verification), and if it is included, which registered speaker is determined (identification). Such usage is also conceivable. In either case, it is necessary to pre-register voices of more than one utterance in the system for each speaker.

話者認識には、テキスト依存型とテキスト非依存型が存在する。テキスト依存型では、認識の際にユーザは所定の言葉を発する必要がある。一方、テキスト非依存型では、認識の際にユーザは任意の言葉を発してよい。 There are text-dependent and text-independent types for speaker recognition. In the text dependent type, the user needs to utter a predetermined word upon recognition. On the other hand, in the text-independent type, the user may utter arbitrary words during recognition.

話者認識では、入力された音声信号全体から一つの話者特徴ベクトルを算出する技術が利用される。話者特徴ベクトルの求め方を以下に示す。入力された音声信号を数十ミリ秒の音響分析フレームに分割し、各音響分析フレームから抽出した音響特徴量ベクトルを時間順に並べた音響特徴量ベクトル系列を作成する。音響特徴量ベクトル系列から、事前学習しておいた話者特徴量抽出モデル、所定の混合正規分布（ＵＢＭモデル）に対して計算した０次統計量および１次統計量を利用して一つの話者特徴ベクトルを求める。これらの手順は、例えば非特許文献１に開示されている。話者特徴ベクトルは、話者特徴量抽出モデルの事前学習の際に個々の音声信号に対して話者を示すラベルが不要であるために学習が容易であること、どんな音声信号からも一つの話者特徴ベクトルを算出するため、言葉の種類や長さが異なるテキスト非依存型の話者認識でも利用可能であることなどの利点があり、話者認識において広く利用されている。 In speaker recognition, a technique for calculating one speaker feature vector from the entire input speech signal is used. The method for obtaining the speaker feature vector is shown below. The input audio signal is divided into acoustic analysis frames of several tens of milliseconds, and an acoustic feature vector sequence in which the acoustic feature vectors extracted from each acoustic analysis frame are arranged in time order is created. A speaker feature extraction model that has been learned in advance from an acoustic feature vector sequence, a zero-order statistic calculated for a predetermined mixed normal distribution (UBM model), and a first-order statistic are used for one story. A person feature vector is obtained. These procedures are disclosed in Non-Patent Document 1, for example. The speaker feature vector is easy to learn because it does not require a label indicating the speaker for each speech signal during pre-learning of the speaker feature extraction model. Since the speaker feature vector is calculated, there is an advantage that it can be used for text-independent speaker recognition with different types and lengths of words, and it is widely used in speaker recognition.

個々の入力音声から話者特徴ベクトルを算出できれば、話者認識は既存のクラス分類技術や外れ値検出技術を用いて簡単に実現できる。例えば、話者識別は、登録された話者特徴ベクトルと入力音声の話者特徴ベクトルとのコサイン類似度を求め、類似度が最大の登録された話者特徴ベクトルの話者名を返すことで実現できる。話者照合は、登録された話者特徴ベクトルと入力音声の話者特徴ベクトルとのコサイン類似度を求め、類似度の最大値が閾値以上であれば登録話者であると判定することで実現できる。 If speaker feature vectors can be calculated from individual input speech, speaker recognition can be easily realized using existing classification techniques and outlier detection techniques. For example, speaker identification is performed by obtaining the cosine similarity between the registered speaker feature vector and the speaker feature vector of the input speech, and returning the speaker name of the registered speaker feature vector having the maximum similarity. realizable. Speaker verification is achieved by determining the cosine similarity between the registered speaker feature vector and the speaker feature vector of the input speech, and determining that the speaker is a registered speaker if the maximum similarity is greater than or equal to a threshold value. it can.

以下、図１〜図３を参照して従来技術の話者識別装置の概略を説明する。図１は、従来技術の話者識別装置８００の構成を示すブロック図である。図２は、従来技術の話者識別装置８００の動作を示すフローチャートである。図３は、従来技術の話者識別装置８００による識別発話の推定話者識別の例を示す図である。図１に示すように話者識別装置８００は、音響分析部８２０−１と、話者特徴ベクトル算出部８３０−１と、話者登録部８４０と、登録発話記録部８０３と、音響分析部８２０−２と、話者特徴ベクトル算出部８３０−２と、類似度計算部８５０と、話者識別部８７０を含む。話者識別装置８００は、ＵＢＭモデル記録部８０１と、話者特徴量抽出モデル記録部８０２とに接続している。 Hereinafter, an outline of a conventional speaker identification apparatus will be described with reference to FIGS. FIG. 1 is a block diagram showing a configuration of a conventional speaker identification device 800. FIG. 2 is a flowchart showing the operation of the conventional speaker identification device 800. FIG. 3 is a diagram illustrating an example of the estimated speaker identification of the identification utterance by the speaker identification device 800 of the related art. As shown in FIG. 1, the speaker identification device 800 includes an acoustic analysis unit 820-1, a speaker feature vector calculation unit 830-1, a speaker registration unit 840, a registered utterance recording unit 803, and an acoustic analysis unit 820. -2, speaker feature vector calculation unit 830-2, similarity calculation unit 850, and speaker identification unit 870. The speaker identification device 800 is connected to a UBM model recording unit 801 and a speaker feature amount extraction model recording unit 802.

話者特徴ベクトルには、非特許文献１に記載のｉ−ｖｅｃｔｏｒを用いる。非特許文献１にあるように、ｉ−ｖｅｃｔｏｒの算出にはＵＢＭモデル、話者特徴量抽出モデルが必要となるが、これらのモデルは事前に学習しておき、それぞれＵＢＭモデル記録部８０１、話者特徴量抽出モデル記録部８０２に記録しておく。なお、ＵＢＭモデル、話者特徴量抽出モデルはそれぞれ非特許文献１のＴ、Σに対応するものである。 The i-vector described in Non-Patent Document 1 is used as the speaker feature vector. As described in Non-Patent Document 1, the i-vector calculation requires a UBM model and a speaker feature extraction model. These models are learned in advance, and the UBM model recording unit 801, It is recorded in the person feature quantity extraction model recording unit 802. Note that the UBM model and the speaker feature extraction model correspond to T and Σ of Non-Patent Document 1, respectively.

まず、登録発話を事前に記録する処理について説明する。登録発話とは、事前登録しておきたい話者の発話の音声信号のことであり、識別発話の話者を識別するために用いる。登録発話は１以上あるものとし、この登録発話をまとめて登録発話集合という。音響分析部８２０−１は、登録発話から登録発話の音響特徴量ベクトル系列を算出する（Ｓ８２０−１）。音響特徴量ベクトル系列は、先述の通り、登録発話を数十ミリ秒の音響分析フレームに分割し、各音響分析フレームから抽出した音響特徴量ベクトルを時間順に並べたものとして算出すればよい。話者特徴ベクトル算出部８３０−１は、音響分析部８２０−１の出力である登録発話の音響特徴量ベクトル系列から、ＵＢＭモデル記録部８０１、話者特徴量抽出モデル記録部８０２に記録しておいたＵＢＭモデル、話者特徴量抽出モデルを用いて登録発話の話者特徴ベクトルを算出する（Ｓ８３０−１）。話者特徴ベクトルの算出手順は先述の通りである。話者登録部８４０は、話者特徴ベクトル算出部８３０−１の出力である登録発話の話者特徴ベクトルと当該登録発話に対応する話者名とを組にして登録発話記録部８０３の登録発話データベースに登録する（Ｓ８４０）。話者名は先ほどの登録発話の話者を特定するための情報であり、人手で与えられるものである。 First, a process for recording a registered utterance in advance will be described. The registered utterance is an audio signal of the utterance of the speaker that is desired to be registered in advance, and is used to identify the speaker of the identification utterance. It is assumed that there are one or more registered utterances, and the registered utterances are collectively referred to as a registered utterance set. The acoustic analysis unit 820-1 calculates an acoustic feature vector sequence of the registered utterance from the registered utterance (S820-1). As described above, the acoustic feature vector sequence may be calculated by dividing a registered utterance into acoustic analysis frames of several tens of milliseconds and arranging acoustic feature vectors extracted from each acoustic analysis frame in time order. The speaker feature vector calculation unit 830-1 records the registered utterance acoustic feature vector series, which is the output of the acoustic analysis unit 820-1, in the UBM model recording unit 801 and the speaker feature extraction model recording unit 802. A speaker feature vector of the registered utterance is calculated using the placed UBM model and speaker feature extraction model (S830-1). The procedure for calculating the speaker feature vector is as described above. The speaker registration unit 840 sets the registered utterance of the registered utterance recording unit 803 by combining the speaker feature vector of the registered utterance that is the output of the speaker feature vector calculating unit 830-1 and the speaker name corresponding to the registered utterance. Register in the database (S840). The speaker name is information for specifying the speaker of the previous registered utterance, and is given manually.

続いて、識別発話を識別する処理について説明する。識別発話とは、識別対象となる話者の発話の音声信号のことである。音響分析部８２０−２は、識別発話から識別発話の音響特徴量ベクトル系列を算出する（Ｓ８２０−２）。音響特徴量ベクトル系列の算出手順はＳ８２０−１と同様である。話者特徴ベクトル算出部８３０−２は、音響分析部８２０−２の出力である識別発話の音響特徴量ベクトル系列から、ＵＢＭモデル記録部８０１、話者特徴量抽出モデル記録部８０２に記録しておいたＵＢＭモデル、話者特徴量抽出モデルを用いて識別発話の話者特徴ベクトルを算出する（Ｓ８３０−２）。話者特徴ベクトルの算出手順はＳ８３０−１と同様である。類似度計算部８５０は、話者特徴ベクトル算出部８３０−２の出力である識別発話の話者特徴ベクトルと、登録発話データベースの各登録発話の話者特徴ベクトルとの類似度を計算する（Ｓ８５０）。類似度の計算には、例えば非特許文献１に記載のコサイン類似度を利用すればよい。ｗ_１, ｗ_２をそれぞれ識別発話の話者特徴ベクトル、登録発話の話者特徴ベクトルとすると、コサイン類似度ｃは以下の式により与えられる。 Next, a process for identifying an identification utterance will be described. An identification utterance is an audio signal of an utterance of a speaker to be identified. The acoustic analysis unit 820-2 calculates an acoustic feature vector sequence of the identified utterance from the identified utterance (S820-2). The calculation procedure of the acoustic feature vector sequence is the same as that in S820-1. The speaker feature vector calculation unit 830-2 records the acoustic feature vector sequence of the identified utterance, which is the output of the acoustic analysis unit 820-2, in the UBM model recording unit 801 and the speaker feature extraction model recording unit 802. The speaker feature vector of the identified utterance is calculated using the placed UBM model and speaker feature extraction model (S830-2). The procedure for calculating the speaker feature vector is the same as that in S830-1. The similarity calculator 850 calculates the similarity between the speaker feature vector of the identified utterance that is the output of the speaker feature vector calculator 830-2 and the speaker feature vector of each registered utterance in the registered utterance database (S850). ). For example, the cosine similarity described in Non-Patent Document 1 may be used for calculating the similarity. If w ₁ and w ₂ are the speaker feature vector of the identified utterance and the speaker feature vector of the registered utterance, respectively, the cosine similarity c is given by the following equation.

ここで、^ｔは転置を表す。つまり、類似度とは話者らしさを示す指標となるものであり、類似度としてコサイン類似度を採用した場合は類似度が大きいほど比較対象となる話者らしい（話者に似ている）といえる。類似度計算部８５０は、計算した類似度と登録発話の話者名とを組にして出力する。話者識別部８７０は、類似度計算部８５０の出力である類似度の中で最大となる類似度に登録発話の話者名を選択し、識別結果として出力する（Ｓ８７０）。 Here, ^t represents transposition. In other words, the similarity is an index indicating the likelihood of a speaker, and when the cosine similarity is adopted as the similarity, the higher the similarity, the more likely the speaker to be compared (similar to the speaker) I can say that. The similarity calculation unit 850 outputs the calculated similarity and the registered utterance speaker name as a set. The speaker identifying unit 870 selects the speaker name of the registered utterance as the maximum similarity among the similarities output from the similarity calculating unit 850, and outputs the selected speaker name as an identification result (S870).

次に、図４〜図６を参照して従来技術の話者照合装置の概略を説明する。なお、話者識別装置８００の構成部と同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。図４は、従来技術の話者照合装置９００の構成を示すブロック図である。図５は、従来技術の話者照合装置９００の動作を示すフローチャートである。図６は、従来技術の話者照合装置９００による照合発話の照合の例を示す図である。図４に示すように話者照合装置９００は、音響分析部８２０−１と、話者特徴ベクトル算出部８３０−１と、話者登録部９４０と、登録発話記録部９０３と、音響分析部８２０−２と、話者特徴ベクトル算出部８３０−２と、類似度計算部９５０と、話者照合部９７０を含む。話者照合装置９００は、話者識別装置８００と同様、ＵＢＭモデル記録部８０１と、話者特徴量抽出モデル記録部８０２とに接続している。 Next, an outline of a conventional speaker verification apparatus will be described with reference to FIGS. In addition, the same number is attached | subjected to the component which has the same function as the component of the speaker identification apparatus 800, and duplication description is abbreviate | omitted. FIG. 4 is a block diagram showing a configuration of a conventional speaker verification apparatus 900. FIG. 5 is a flowchart showing the operation of the conventional speaker verification apparatus 900. FIG. 6 is a diagram illustrating an example of collation of collation utterances performed by a conventional speaker collation apparatus 900. As shown in FIG. 4, the speaker verification device 900 includes an acoustic analysis unit 820-1, a speaker feature vector calculation unit 830-1, a speaker registration unit 940, a registered utterance recording unit 903, and an acoustic analysis unit 820. -2, speaker feature vector calculation unit 830-2, similarity calculation unit 950, and speaker verification unit 970. Similar to the speaker identification device 800, the speaker verification device 900 is connected to the UBM model recording unit 801 and the speaker feature amount extraction model recording unit 802.

まず、登録発話を事前に記録する処理について説明する。音響分析部８２０−１及び話者特徴ベクトル算出部８３０−１は、話者識別装置８００のそれと同様の処理を行う（Ｓ８２０−１、Ｓ８３０−１）。話者登録部９４０は、話者特徴ベクトル算出部８３０−１の出力である登録発話の話者特徴ベクトルを登録発話記録部９０３の登録発話データベースに登録する（Ｓ９４０）。 First, a process for recording a registered utterance in advance will be described. The acoustic analysis unit 820-1 and the speaker feature vector calculation unit 830-1 perform processing similar to that of the speaker identification device 800 (S820-1, S830-1). The speaker registration unit 940 registers the speaker feature vector of the registered utterance, which is the output of the speaker feature vector calculation unit 830-1, in the registered utterance database of the registered utterance recording unit 903 (S940).

続いて、照合発話を照合する処理について説明する。照合発話とは、照合対象となる話者の発話の音声信号のことであり、登録発話と照合されるものである。音響分析部８２０−２及び話者特徴ベクトル算出部８３０−２も、話者識別装置８００のそれと同様の処理を行う（Ｓ８２０−２、Ｓ８３０−２）。類似度計算部９５０は、話者特徴ベクトル算出部８３０−２の出力である照合発話の話者特徴ベクトルと、登録発話データベースの各登録発話の話者特徴ベクトルとの類似度を計算する（Ｓ９５０）。類似度計算部９５０は、類似度計算部８５０と異なり、類似度のみを出力する。話者照合部９７０は、類似度計算部９５０の出力である類似度の中で最大となる類似度が閾値よりも大きい場合（あるいは閾値以上である場合）に登録発話の話者であるとの照合結果を生成し、出力する（Ｓ９７０）。または、類似度計算部９５０の出力である類似度各々について閾値との比較を行い、一つでも閾値より大きい（閾値以上の）ものがある場合に登録発話の話者であるとの照合結果を生成し、出力するようにしてもよい（Ｓ９７０）。 Next, a process for collating collation utterances will be described. The collation utterance is an audio signal of the utterance of the speaker to be collated, and is collated with the registered utterance. The acoustic analysis unit 820-2 and the speaker feature vector calculation unit 830-2 also perform processing similar to that of the speaker identification device 800 (S820-2, S830-2). The similarity calculation unit 950 calculates the similarity between the speaker feature vector of the collation utterance output from the speaker feature vector calculation unit 830-2 and the speaker feature vector of each registered utterance in the registered utterance database (S950). ). Unlike the similarity calculation unit 850, the similarity calculation unit 950 outputs only the similarity. The speaker verification unit 970 is a registered utterance speaker when the maximum similarity among the similarities output from the similarity calculation unit 950 is greater than a threshold (or greater than or equal to the threshold). A verification result is generated and output (S970). Alternatively, each similarity degree that is the output of the similarity degree calculation unit 950 is compared with a threshold value, and if there is one that is larger than the threshold value (greater than or equal to the threshold value), a matching result indicating that the speaker is a registered utterance is obtained. It may be generated and output (S970).

小川哲司、塩田さやか、“ｉ−ｖｅｃｔｏｒを用いた話者認識”、日本音響学会誌、２０１４年６月、７０巻６号、ｐｐ.３３２−３３９.Tetsuji Ogawa, Sayaka Shioda, “Speaker Recognition Using i-vector”, Journal of the Acoustical Society of Japan, June 2014, Volume 70, No. 6, pp.332-339.

話者特徴ベクトルには、発話長が長くなるほど話者情報が強く表れるという性質があることが知られている。この性質から、同じ話者による音声でも発話長が異なるほど話者特徴ベクトルの類似度が低下する傾向にある。 It is known that the speaker feature vector has the property that the speaker information appears stronger as the utterance length becomes longer. Due to this property, the similarity of the speaker feature vectors tends to decrease as the utterance length is different even for speech from the same speaker.

話者認識では、登録発話が文章読み上げである一方、認識発話（識別発話または照合発話）は単語読み上げであることが多い。このため、登録発話が例えば１０秒以上と長くなるのに対し、認識発話は例えば１．５秒以下と短くなり、その結果、登録発話に比べて認識発話が極端に短くなる。このため、発話長の違いにより両発話の話者特徴ベクトルの類似度が低下する。このような話者特徴ベクトルを用いて話者認識を行う場合、どの話者も類似度が低くなるために話者ごとの類似度の違いが表れにくくなり、認識精度が低下する。 In speaker recognition, the registered utterance is text reading, while the recognition utterance (identification utterance or collation utterance) is often word reading. For this reason, while the registered utterance becomes longer, for example, 10 seconds or more, the recognized utterance becomes shorter, for example, 1.5 seconds or less, and as a result, the recognized utterance becomes extremely short compared to the registered utterance. For this reason, the similarity between the speaker feature vectors of both utterances decreases due to the difference in utterance length. When speaker recognition is performed using such speaker feature vectors, since the similarity is low for any speaker, the difference in similarity for each speaker is difficult to appear, and the recognition accuracy is reduced.

そこで本発明では、登録発話と認識発話の長さが極端に異なる場合であっても、頑健に話者を認識するための話者特徴ベクトルの類似度計算をする話者らしさ評価装置を提供することを目的とする。 Therefore, the present invention provides a speaker likeness evaluation device that calculates the similarity of speaker feature vectors for robustly recognizing a speaker even when the lengths of a registered utterance and a recognized utterance are extremely different. For the purpose.

本発明の一態様は、話者識別または話者照合の対象となる話者の発話音声を認識発話、前記話者を認識するために用いる発話音声を登録発話とし、前記登録発話から当該登録発話の区間より短い区間を有する分割済登録発話を生成し、当該分割済登録発話から分割済登録発話話者特徴ベクトルを算出する分割済登録発話話者特徴ベクトル算出部と、前記認識発話から認識発話話者特徴ベクトルを算出する認識発話話者特徴ベクトル算出部と、前記認識発話話者特徴ベクトルと前記分割済登録発話話者特徴ベクトルを用いて前記認識発話と前記分割済登録発話の類似度を計算する類似度計算部とを有する話者らしさ評価装置である。 In one embodiment of the present invention, an utterance voice of a speaker that is a target of speaker identification or speaker verification is a recognition utterance, and an utterance voice used for recognizing the speaker is a registered utterance. A divided registered utterer feature vector calculation unit that generates a divided registered utterance having a section shorter than the divided section and calculates a divided registered utterer feature vector from the divided registered utterance; and a recognition utterance from the recognized utterance A recognition speaker feature vector calculation unit for calculating a speaker feature vector; and a similarity between the recognition utterance and the divided registered utterance using the recognized utterance feature vector and the divided registered utterance feature vector. It is a speaker likeness evaluation apparatus which has a similarity calculation part to calculate.

本発明によれば、登録発話と認識発話の長さが極端に異なる場合においても、精度よく話者を認識するための話者特徴ベクトルの類似度を計算することが可能となる。 According to the present invention, it is possible to calculate the similarity of speaker feature vectors for accurately recognizing a speaker even when the length of a registered utterance and a recognized utterance are extremely different.

従来技術の話者識別装置８００の構成を示すブロック図。The block diagram which shows the structure of the speaker identification apparatus 800 of a prior art. 従来技術の話者識別装置８００の動作を示すフローチャート。The flowchart which shows operation | movement of the speaker identification apparatus 800 of a prior art. 従来技術の話者識別装置８００による識別発話の推定話者識別の例を示す図。The figure which shows the example of the presumed speaker identification of the identification utterance by the speaker identification apparatus 800 of a prior art. 従来技術の話者照合装置９００の構成を示すブロック図。The block diagram which shows the structure of the speaker verification apparatus 900 of a prior art. 従来技術の話者照合装置９００の動作を示すフローチャート。The flowchart which shows operation | movement of the speaker verification apparatus 900 of a prior art. 従来技術の話者照合装置９００による照合発話の照合の例を示す図。The figure which shows the example of collation of collation utterance by the speaker collation apparatus 900 of a prior art. 実施例１の話者識別装置１００の構成を示すブロック図。1 is a block diagram illustrating a configuration of a speaker identification device 100 according to a first embodiment. 実施例１の話者識別装置１００の詳細な動作を示すフローチャート。3 is a flowchart illustrating detailed operations of the speaker identification device 100 according to the first embodiment. 実施例１の話者識別装置１００による識別発話の推定話者識別の例を示す図。The figure which shows the example of the presumed speaker identification of the identification utterance by the speaker identification device 100 of Example 1. FIG. 実施例２の話者識別装置２００の構成を示すブロック図。FIG. 4 is a block diagram illustrating a configuration of a speaker identification device 200 according to a second embodiment. 実施例２の話者識別装置２００の詳細な動作を示すフローチャート。9 is a flowchart showing detailed operations of the speaker identification device 200 according to the second embodiment. 実施例３の話者識別装置３００の構成を示すブロック図。FIG. 6 is a block diagram illustrating a configuration of a speaker identification device 300 according to a third embodiment. 実施例３の話者識別装置３００の詳細な動作を示すフローチャート。10 is a flowchart illustrating detailed operations of the speaker identification device 300 according to the third embodiment. 実施例３（変形）の話者識別装置３００’の構成を示すブロック図。The block diagram which shows the structure of the speaker identification apparatus 300 'of Example 3 (modification). 実施例４の話者照合装置４００の構成を示すブロック図。FIG. 9 is a block diagram illustrating a configuration of a speaker verification device 400 according to a fourth embodiment. 実施例４の話者照合装置４００の詳細な動作を示すフローチャート。10 is a flowchart showing a detailed operation of the speaker verification device 400 according to the fourth embodiment. 実施例４の話者照合装置４００による照合発話の照合の例を示す図。The figure which shows the example of collation of collation utterance by the speaker collation apparatus 400 of Example 4. FIG. 実施例５の話者照合装置５００の構成を示すブロック図。FIG. 10 is a block diagram illustrating a configuration of a speaker verification device 500 according to a fifth embodiment. 実施例５の話者照合装置５００の詳細な動作を示すフローチャート。10 is a flowchart showing a detailed operation of the speaker verification device 500 according to the fifth embodiment.

以下、本発明の実施の形態について、詳細に説明する。なお、同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail. In addition, the same number is attached | subjected to the structure part which has the same function, and duplication description is abbreviate | omitted.

また、以下の説明でも識別発話または照合発話のことを認識発話、１以上の登録発話を要素とする登録発話の集合を登録発話集合という。 Also, in the following description, an identification utterance or a collation utterance is a recognition utterance, and a set of registered utterances having one or more registered utterances as elements is called a registered utterance set.

＜本願発明の要点＞
話者特徴ベクトルには、発話長が長くなるほど話者情報が強く表れるという性質以外に、話者情報だけでなく発話に含まれる言葉の情報にも依存するという性質もあることが知られている。この性質から、同程度の発話長でも発話に含まれる言葉が異なるほど話者特徴ベクトルの類似度が低下する傾向にある。例えば「おとな」と「おとこ」という発話の話者特徴ベクトルの類似度は高く、「おとな」と「こども」という発話の話者特徴ベクトルの類似度は低くなりやすいという傾向がある。 <Key points of the present invention>
It is known that the speaker feature vector has not only the characteristic that the speaker information becomes stronger as the utterance length becomes longer, but also the characteristic that it depends not only on the speaker information but also on the word information included in the utterance. . Because of this property, even if the speech length is the same, the similarity of the speaker feature vectors tends to decrease as the words included in the speech differ. For example, the similarity between the speaker feature vectors of the utterances “adult” and “man” tends to be high, and the similarity of the speaker feature vectors of the utterances “adult” and “child” tends to be low.

話者特徴ベクトルが有するこれらの性質に起因して類似度が低下する問題に対応するための本願発明のポイントは、（１）発話登録時に認識発話と同程度の発話長となるよう登録発話の音声を分割し、分割後の発話である分割済登録発話各々に対して話者特徴ベクトルを算出すること、（２）話者認識の際には話者特徴ベクトルの類似度の最大値ではなく類似度の話者ごとの平均値の最大値を話者認識の基準とすることの二点である。ポイント（１）により、発話長の違いを原因とする話者特徴ベクトルの類似度低下を防ぐことを可能とする。ポイント（２）は、様々な言葉を含む登録発話と認識発話の話者特徴ベクトルの類似度をすべて考慮して話者を判定することに相当し、登録発話に含まれる言葉の違いによる類似度の変化が話者認識に与える影響を低減することを可能とする。 The points of the present invention for dealing with the problem that the similarity decreases due to these characteristics of the speaker feature vector are as follows: (1) When registering an utterance, the length of the registered utterance is the same as that of the recognized utterance. Dividing the speech and calculating a speaker feature vector for each of the divided registered utterances that are the utterances after the division; (2) when recognizing the speaker, not the maximum value of the similarity of the speaker feature vectors It is two points that the maximum value of the average value of each degree of similarity is used as a reference for speaker recognition. Point (1) makes it possible to prevent a decrease in the similarity of speaker feature vectors due to a difference in utterance length. Point (2) corresponds to determining a speaker in consideration of the similarity of all speaker feature vectors of a registered utterance and a recognized utterance including various words, and the similarity based on a difference in words included in the registered utterance This makes it possible to reduce the influence of changes in speaker recognition on speaker recognition.

以下、図７〜図９を参照して実施例１の話者識別装置を説明する。図７は、実施例１の話者識別装置１００の構成を示すブロック図である。図８は、実施例１の話者識別装置１００の動作を示すフローチャートである。図９は、話者識別装置１００による識別発話の推定話者識別の様子を示す図である。図７に示すように話者識別装置１００は、発話分割部１１０と、音響分析部８２０−１と、話者特徴ベクトル算出部８３０−１と、話者登録部８４０と、登録発話記録部８０３と、音響分析部８２０−２と、話者特徴ベクトル算出部８３０−２と、正規化あり類似度計算部１５０と、類似度平均化部１６０と、話者識別部８７０と、音響分析部８２０−３と、話者特徴ベクトル算出部８３０−３と、話者特徴ベクトル正規化行列学習部１８０と、話者特徴ベクトル正規化行列記録部１０６を含む。話者識別装置１００は、ＵＢＭモデル記録部８０１と、話者特徴量抽出モデル記録部８０２と、正規化行列学習用発話記録部１０４と、正規化行列学習用話者名記録部１０５とに接続している。 Hereinafter, the speaker identification device according to the first embodiment will be described with reference to FIGS. FIG. 7 is a block diagram illustrating a configuration of the speaker identification device 100 according to the first embodiment. FIG. 8 is a flowchart illustrating the operation of the speaker identification device 100 according to the first embodiment. FIG. 9 is a diagram showing a state of the estimated speaker identification of the identification utterance by the speaker identification device 100. As shown in FIG. 7, the speaker identification device 100 includes an utterance dividing unit 110, an acoustic analysis unit 820-1, a speaker feature vector calculation unit 830-1, a speaker registration unit 840, and a registered utterance recording unit 803. An acoustic analysis unit 820-2, a speaker feature vector calculation unit 830-2, a normalization similarity calculation unit 150, a similarity averaging unit 160, a speaker identification unit 870, and an acoustic analysis unit 820. -3, a speaker feature vector calculation unit 830-3, a speaker feature vector normalization matrix learning unit 180, and a speaker feature vector normalization matrix recording unit 106. The speaker identification device 100 is connected to the UBM model recording unit 801, the speaker feature extraction model recording unit 802, the normalization matrix learning utterance recording unit 104, and the normalization matrix learning speaker name recording unit 105. doing.

なお、図７に点線で図示する通り、発話分割部１１０、音響分析部８２０−１、話者特徴ベクトル算出部８３０−１をまとめて分割済登録発話話者特徴ベクトル算出部１００１と、音響分析部８２０−２、話者特徴ベクトル算出部８３０−２をまとめて認識発話話者特徴ベクトル算出部１００２という。 Note that, as illustrated by a dotted line in FIG. 7, the utterance division unit 110, the acoustic analysis unit 820-1, and the speaker feature vector calculation unit 830-1 are collectively divided into the registered utterance speaker feature vector calculation unit 1001 and the acoustic analysis. The unit 820-2 and the speaker feature vector calculation unit 830-2 are collectively referred to as a recognized speaker feature vector calculation unit 1002.

話者識別装置１００では話者特徴ベクトル正規化行列を用いて類似度計算を行う。この正規化行列を用いる手法はＷＣＣＮと呼ばれ、発話に含まれる言葉の違いによる類似度の変動を低減させることが知られている（参考非特許文献２）。ＷＣＣＮを用いると、非特許文献１のコサイン類似度を用いる場合に比べて識別発話・登録発話に含まれる言葉の違いの影響をより受けにくくなり、話者識別精度が向上する。
（参考非特許文献２：A. O. Hatch, S. Kajarekar, A. Stolcke, “Within-Class Covariance Normalization for SVM-based Speaker Recognition”, Proc. Interspeech 2006, pp.1471-1474, 2006.） The speaker identification device 100 performs similarity calculation using a speaker feature vector normalization matrix. This technique using the normalization matrix is called WCCN, and it is known to reduce the variation of the similarity due to the difference in words included in the utterance (Reference Non-Patent Document 2). When WCCN is used, compared with the case where the cosine similarity of Non-Patent Document 1 is used, it is less affected by the difference in words included in the identification utterance / registered utterance, and the speaker identification accuracy is improved.
(Reference Non-Patent Document 2: AO Hatch, S. Kajarekar, A. Stolcke, “Within-Class Covariance Normalization for SVM-based Speaker Recognition”, Proc. Interspeech 2006, pp.1471-1474, 2006.)

話者識別装置１００での処理は大きく登録発話の登録、話者特徴ベクトル正規化行列の学習、識別発話の識別の３つに分かれる。 The processing in the speaker identification apparatus 100 is roughly divided into three processes: registration of a registered utterance, learning of a speaker feature vector normalization matrix, and identification of an identified utterance.

登録発話の登録処理フローの概略は以下の通りである。
１．登録発話の音声を短時間ごと（識別発話と同程度の長さが好ましい）に分割し、分割後の各音声（分割済登録発話という）に対して話者特徴ベクトルを算出する（Ｓ１１０、Ｓ８２０−１、Ｓ８３０−１）。
２．話者特徴ベクトルと登録発話の話者名を組にして登録発話データベースに登録する（Ｓ８４０）。 The outline of the registration processing flow of the registration utterance is as follows.
1. The voice of the registered utterance is divided every short time (preferably the same length as the identification utterance), and a speaker feature vector is calculated for each divided voice (referred to as a divided registered utterance) (S110, S820). -1, S830-1).
2. The speaker feature vector and the speaker name of the registered utterance are paired and registered in the registered utterance database (S840).

話者特徴ベクトル正規化行列の学習処理フローの概略は以下の通りである。
１．正規化行列学習用発話の音声に対して話者特徴ベクトルを算出する（Ｓ８２０−３、Ｓ８３０−３）。
２．話者ごとに話者特徴ベクトルの分散共分散行列を求め、これらの行列の平均をとることで話者特徴ベクトル正規化行列を学習する（Ｓ１８０）。 The outline of the learning process flow of the speaker feature vector normalization matrix is as follows.
1. A speaker feature vector is calculated for the speech of the normalization matrix learning utterance (S820-3, S830-3).
2. A variance / covariance matrix of speaker feature vectors is obtained for each speaker, and a speaker feature vector normalization matrix is learned by taking the average of these matrices (S180).

なお、話者特徴ベクトル正規化行列の学習に用いる正規化行列学習用発話、正規化行列学習用話者名は、学習開始前に正規化行列学習用発話記録部１０４、正規化行列学習用話者名記録部１０５にそれぞれ記録しておく。 Note that the normalization matrix learning utterance and the normalization matrix learning speaker name used for learning the speaker feature vector normalization matrix are the normalization matrix learning utterance recording unit 104 and the normalization matrix learning story before the learning starts. It records in person name recording part 105, respectively.

識別発話の識別処理フローの概略は以下の通りである。
１．識別発話の音声に対して話者特徴ベクトルを算出する（Ｓ８２０−２、Ｓ８３０−２）。
２．識別発話の話者特徴ベクトルと各分割済登録発話の話者特徴ベクトルとの類似度を計算する（Ｓ１５０）。
３．２．で求めた類似度を話者ごとに平均化し、話者ごとの平均類似度を計算する（Ｓ１６０）。
４．話者ごとの平均類似度の最大値に対応する話者名を識別発話の話者（識別結果）として返す（Ｓ８７０）。 The outline of the identification utterance processing flow is as follows.
1. A speaker feature vector is calculated for the voice of the identified utterance (S820-2, S830-2).
2. The similarity between the speaker feature vector of the identified utterance and the speaker feature vector of each divided registered utterance is calculated (S150).
3.2. The similarities obtained in step 1 are averaged for each speaker, and the average similarity for each speaker is calculated (S160).
4). The speaker name corresponding to the maximum value of the average similarity for each speaker is returned as the speaker (identification result) of the identification utterance (S870).

以下、構成部ごとに入力、出力、動作について説明する。
＜発話分割部１１０＞
入力：登録発話
出力：分割済登録発話
登録発話集合の各登録発話を短時間ごとに分割し、分割済登録発話を生成する（Ｓ１１０）。分割時、区間の重複は許すものとする。つまり、図９にあるように分割済登録発話の音声には重なりがある。分割時の窓幅は、話者識別の利用時に想定される識別発話と同程度となるようにし、例えば１．５秒とする。シフト幅は例えば０．５秒とする。 Hereinafter, input, output, and operation will be described for each component.
<Speech division unit 110>
Input: Registered utterance output: Divided registered utterance Each registered utterance in the registered utterance set is divided every short time to generate divided registered utterances (S110). During division, overlapping of sections shall be allowed. That is, as shown in FIG. 9, there is an overlap in the voices of the divided registered utterances. The window width at the time of division is set to be approximately the same as the identification utterance assumed when the speaker identification is used, for example, 1.5 seconds. The shift width is, for example, 0.5 seconds.

区間の重複を許すことにより、より多くの分割済登録発話が生成されることとなり、本願発明のポイント（２）の効果、言葉の違いに起因する影響をより抑制することが可能となる。 By allowing overlapping of sections, more divided registered utterances are generated, and it is possible to further suppress the effects resulting from the point (2) of the present invention and the difference in words.

なお、ここでの処理は、典型的には、識別発話と同程度の長さの分割済登録発話を複数生成することになるが、登録発話の長さより少なくとも短くなる分割済登録発話を１つだけ生成するのでもよい。 Note that this processing typically generates a plurality of divided registered utterances having the same length as the identification utterance, but one divided registered utterance that is at least shorter than the length of the registered utterance. It may be generated only.

＜音響分析部８２０−１、８２０−２、８２０−３＞
入力：分割済登録発話、認識発話、正規化行列学習用発話
出力：音響特徴量ベクトル系列
各発話から音響特徴量ベクトル系列を算出する（Ｓ８２０−１、Ｓ８２０−２、Ｓ８２０−３）。算出した音響特徴量ベクトル系列をそれぞれ分割済登録発話音響特徴量ベクトル系列、認識発話音響特徴量ベクトル系列、正規化行列学習用発話音響特徴量ベクトル系列という。話者識別装置の音響分析部８２０−２に識別発話が入力されること、話者照合装置の音響分析部８２０−２に照合発話が入力されることに対応して、認識発話音響特徴量ベクトル系列のことをそれぞれ識別発話音響特徴量ベクトル系列、照合発話音響特徴量ベクトル系列という。 <Acoustic analysis unit 820-1, 820-2, 820-3>
Input: Divided registered utterance, recognition utterance, utterance for normalization matrix learning output: acoustic feature vector sequence An acoustic feature vector sequence is calculated from each utterance (S820-1, S820-2, S820-3). The calculated acoustic feature vector sequences are referred to as divided registered utterance acoustic feature vector sequences, recognized utterance acoustic feature vector sequences, and normalized matrix learning utterance acoustic feature vector sequences, respectively. The recognition utterance acoustic feature vector corresponding to the input of the identification utterance to the acoustic analysis unit 820-2 of the speaker identification device and the input of the verification utterance to the acoustic analysis unit 820-2 of the speaker verification device The series is referred to as an identification utterance acoustic feature vector series and a verification utterance acoustic feature vector series, respectively.

ここでは、音響特徴量としてＭＦＣＣを利用する。ＭＦＣＣは短時間ごとのスペクトル包絡を表現し、音声認識を始めとする音声関連技術において広く利用されている。ＭＦＣＣの各次元の値をベクトル表記したものを音響特徴量ベクトルとし、ＭＦＣＣベクトルを時間方向に並べたものを音響特徴量ベクトル系列とする。ＭＦＣＣの抽出方法は参考非特許文献３に記載されている。ＭＦＣＣ抽出のフレーム幅は例えば２５ｍｓとし、シフト幅は例えば１０ｍｓとする。また、ＭＦＣＣの動的特徴量も音響特徴量ベクトルに含める。
（参考非特許文献３：鹿野清宏、伊藤克亘、河原達也、武田一哉、山本幹雄、“ＩＴＴｅｘｔ音声認識システム”、pp.13-14、オーム社、2001） Here, MFCC is used as the acoustic feature quantity. The MFCC expresses a spectral envelope every short time and is widely used in speech-related technologies such as speech recognition. A vector notation of values of each dimension of the MFCC is referred to as an acoustic feature vector, and an MFCC vector arranged in the time direction is referred to as an acoustic feature vector vector. A method for extracting MFCC is described in Reference Non-Patent Document 3. The frame width for MFCC extraction is, for example, 25 ms, and the shift width is, for example, 10 ms. Also, the MFCC dynamic feature quantity is included in the acoustic feature quantity vector.
(Reference Non-Patent Document 3: Kiyohiro Shikano, Katsunobu Ito, Tatsuya Kawahara, Kazuya Takeda, Mikio Yamamoto, “IT Text Speech Recognition System”, pp.13-14, Ohmsha, 2001)

＜話者特徴ベクトル算出部８３０−１、８３０−２、８３０−３＞
入力：音響特徴量ベクトル系列、ＵＢＭモデル、話者特徴量抽出モデル
出力：話者特徴ベクトル
各音響特徴量ベクトルからＵＢＭモデル、話者特徴量抽出モデルを用いて話者特徴ベクトルを算出する（Ｓ８３０−１、Ｓ８３０−２、Ｓ８３０−３）。算出した話者特徴ベクトルをそれぞれ分割済登録発話話者特徴ベクトル、認識発話話者特徴ベクトル、正規化行列学習用発話話者特徴ベクトルという。話者特徴ベクトルとして、話者識別装置８００と同じく、ｉ−ｖｅｃｔｏｒを用いる。音響分析部８２０−２と同様、話者特徴ベクトル算出部８３０−２の入力が識別発話音響特徴量ベクトル系列、照合発話音響特徴量ベクトル系列であることに対応して、認識発話話者特徴ベクトルのことをそれぞれ識別発話話者特徴ベクトル、照合発話話者特徴ベクトルという。 <Speaker feature vector calculation units 830-1, 830-2, 830-3>
Input: acoustic feature vector series, UBM model, speaker feature extraction model output: speaker feature vector A speaker feature vector is calculated from each acoustic feature vector using the UBM model and speaker feature extraction model (S830). -1, S830-2, S830-3). The calculated speaker feature vectors are referred to as a divided registered speaker feature vector, a recognized speaker feature vector, and a normalization matrix learning speaker feature vector, respectively. Similar to the speaker identification device 800, i-vector is used as the speaker feature vector. Similar to the acoustic analysis unit 820-2, the input to the speaker feature vector calculation unit 830-2 is a recognized utterance acoustic feature vector sequence and a verification utterance acoustic feature vector sequence, and thus a recognized speaker feature vector. Are referred to as an identification speaker feature vector and a collation speaker feature vector, respectively.

＜話者登録部８４０＞
入力：分割済登録発話話者特徴ベクトル、話者名
出力先：登録発話データベース
分割済登録発話話者特徴ベクトルとそれに対応する話者名（分割元となった登録発話の話者名）を組とし、登録発話記録部８０３の登録発話データベースへ追加する（Ｓ８４０）。話者名は、先述の通り、人手で与えるものとする。つまり、登録発話データベースには登録発話集合の各登録発話に対して１つ以上の分割済登録発話話者特徴ベクトルが登録されることとなる。 <Speaker registration unit 840>
Input: Divided registered utterance feature vector, speaker name Output destination: Registered utterance database Set divided registered utterance feature vector and corresponding speaker name (speaker name of the registered utterance as the division source) And added to the registered utterance database of the registered utterance recording unit 803 (S840). As mentioned above, the speaker name is given manually. That is, in the registered utterance database, one or more divided registered utterer feature vectors are registered for each registered utterance in the registered utterance set.

＜話者特徴ベクトル正規化行列学習部１８０＞
入力：正規化行列学習用話者特徴ベクトル、正規化行列学習用話者名
出力：話者特徴ベクトル正規化行列
話者特徴ベクトル正規化行列の学習を行う（Ｓ１８０）。話者特徴ベクトル正規化行列は、話者ごとに（つまり、正規化行列学習用話者名が同一の正規化行列学習用話者特徴ベクトル群から）話者特徴ベクトルの分散共分散行列を求め、求めた分散共分散行列を全話者で平均化することで得られる。 <Speaker feature vector normalization matrix learning unit 180>
Input: Speaker feature vector for normalization matrix learning, speaker name for normalization matrix learning Output: Speaker feature vector normalization matrix Speaker feature vector normalization matrix is learned (S180). The speaker feature vector normalization matrix calculates the variance covariance matrix of speaker feature vectors for each speaker (that is, from the speaker feature vector group for normalization matrix learning with the same normalization matrix learning speaker name). It is obtained by averaging the obtained variance-covariance matrix among all speakers.

＜正規化あり類似度計算部１５０＞
入力：話者特徴ベクトル正規化行列、識別発話話者特徴ベクトル、分割済登録発話話者特徴ベクトル、話者名
出力：類似度、話者名
識別発話話者特徴ベクトルと登録発話データベースに含まれるすべての分割済登録発話話者特徴ベクトルとの類似度を計算する（Ｓ１５０）。計算した類似度は、分割済登録発話話者特徴ベクトルに対応する話者名と組にして出力される。 <Normalization similarity calculation unit 150>
Input: Speaker feature vector normalization matrix, identified speaker feature vector, divided registered speaker feature vector, speaker name output: similarity, speaker name Identified speaker feature vector and registered utterance database Similarities with all divided registered speaker feature vectors are calculated (S150). The calculated similarity is output in combination with the speaker name corresponding to the divided registered speaker feature vector.

ベクトルの類似度として、話者識別装置８００と同じく、コサイン類似度を用いる。ただしここでは、コサイン類似度計算の際に話者特徴ベクトル正規化行列を利用する。具体的には、コサイン類似度ｃは以下の式により与えられる。 Similar to the speaker identification device 800, the cosine similarity is used as the vector similarity. Here, however, a speaker feature vector normalization matrix is used when calculating the cosine similarity. Specifically, the cosine similarity c is given by the following equation.

ここで、ｗ_１, ｗ_２はそれぞれ識別発話話者特徴ベクトル、分割済登録発話話者特徴ベクトル、Ｗは話者特徴ベクトル正規化行列を表す。なお、^ｔは転置、^−１は逆行列を表す。 Here, w ₁ and w ₂ represent an identification speaker feature vector, a divided registered speaker feature vector, and W a speaker feature vector normalization matrix, respectively. In addition, ^t represents transposition and ^-1 represents an inverse matrix.

＜類似度平均化部１６０＞
入力：類似度、話者名
出力：話者ごとの平均類似度、話者名
類似度を話者ごとに平均化し、話者ごとの平均類似度を求める（Ｓ１６０）。計算した平均類似度は、分割済登録発話話者特徴ベクトルに対応する話者名と組にして出力される。話者ごとの平均をとるために、例えば、分割元を同じくする分割済登録発話話者特徴ベクトルとの類似度の範囲で平均をとるなどすればよい。また、同一話者による複数の登録発話が話者識別装置１００に入力され、当該登録発話から算出される分割済登録発話話者特徴ベクトルが登録データベースに登録されている場合は、同一の話者名と組になっているこれらの分割済登録発話話者特徴ベクトルのすべてあるいは一部を用いて平均類似度を求めるようにしてもよい。 <Similarity averaging unit 160>
Input: similarity, speaker name output: average similarity for each speaker, speaker name Similarity is averaged for each speaker, and average similarity for each speaker is obtained (S160). The calculated average similarity is output in combination with the speaker name corresponding to the divided registered speaker feature vector. In order to take an average for each speaker, for example, an average may be taken in the range of similarity with the divided registered speaker feature vector having the same division source. Further, when a plurality of registered utterances by the same speaker are input to the speaker identification device 100 and the divided registered utterance feature vector calculated from the registered utterance is registered in the registration database, the same speaker The average similarity may be obtained by using all or a part of these divided registered speaker feature vectors paired with the name.

なお、発話分割部１１０の処理において分割済登録発話が１つだけ生成される場合は、話者識別装置１００は類似度平均化部１６０を持たない形で構成することになる。 When only one divided registered utterance is generated in the process of the utterance dividing unit 110, the speaker identification device 100 is configured without the similarity averaging unit 160.

＜話者識別部８７０＞
入力：話者ごとの平均類似度、話者名
出力：識別結果（話者名）
話者ごとの平均類似度のうち、最大となる平均類似度に対応する話者名（つまり、分割元となった登録発話の話者名）を選択、識別結果として返す（Ｓ８７０）。 <Speaker identification unit 870>
Input: Average similarity for each speaker, Speaker name output: Identification result (speaker name)
Of the average similarity for each speaker, the speaker name corresponding to the maximum average similarity (that is, the speaker name of the registered utterance as the division source) is selected and returned as an identification result (S870).

また、分割済登録発話話者特徴ベクトル算出部１００１、認識発話話者特徴ベクトル算出部１００２の入力、出力、動作としてまとめると、以下のようになる。 Further, the input, output, and operation of the divided registered speaker feature vector calculation unit 1001 and the recognized speaker feature vector calculation unit 1002 are summarized as follows.

＜分割済登録発話話者特徴ベクトル算出部１００１＞
入力：登録発話、ＵＢＭモデル、話者特徴量抽出モデル
出力：分割済登録発話話者特徴ベクトル
登録発話から生成した分割済登録発話から、ＵＢＭモデルと話者特徴量抽出モデルを用いて分割済登録発話話者特徴ベクトルを算出する（Ｓ１１０、Ｓ８２０−１、Ｓ８３０−１）。 <Divided Registered Speaker Speaker Feature Vector Calculation Unit 1001>
Input: Registered utterance, UBM model, speaker feature extraction model Output: Divided registered utterance feature vector Divided registration using segmented registered utterance generated from registered utterance using UBM model and speaker feature extraction model The speaker feature vector is calculated (S110, S820-1, S830-1).

＜認識発話話者特徴ベクトル算出部１００２＞
入力：認識発話、ＵＢＭモデル、話者特徴量抽出モデル
出力：認識発話話者特徴ベクトル
認識発話から、ＵＢＭモデルと話者特徴量抽出モデルを用いて認識発話話者特徴ベクトルを算出する（Ｓ８２０−２、Ｓ８３０−２）。 <Recognized utterer feature vector calculation unit 1002>
Input: recognized utterance, UBM model, speaker feature extraction model output: recognized utterer feature vector A recognized utterer feature vector is calculated from the recognized utterance using the UBM model and speaker feature extraction model (S820-). 2, S830-2).

登録発話に比べて識別発話が極端に短い場合でも正しく話者識別を行うため、同一話者であれば話者特徴ベクトルの類似度を上げ、別話者であれば類似度を下げる必要がある。発話長の違いへの最も単純な対処方法として、例えば登録発話から識別発話と同程度の発話長となるような一部区間を抽出し、その区間のみから話者特徴ベクトルを算出して類似度を求める方法が考えられる。しかしこの方法では、抽出する区間の言葉の情報の影響を受けて類似度が変化するため、たまたま登録発話と識別発話が似た言葉を含む区間を登録発話から抽出した場合には、異なる話者でも類似度が高くなってしまう。 In order to perform speaker identification correctly even when the identification utterance is extremely short compared to the registered utterance, it is necessary to increase the similarity of the speaker feature vector for the same speaker and decrease the similarity for another speaker . The simplest way to deal with the difference in utterance length is, for example, extracting a partial section from the registered utterance that has the same utterance length as the identified utterance, and calculating the speaker feature vector only from that section and calculating the similarity A method of obtaining However, in this method, the degree of similarity changes due to the influence of the word information of the section to be extracted. Therefore, if a section containing words similar to the registered utterance and the identification utterance is extracted from the registered utterance, a different speaker But the similarity is high.

したがって、発話長の違いへの対処と同時に、発話に含まれる言葉の影響による話者特徴ベクトルの変化も考慮する必要がある。実施例１の発明では、発話分割部１１０を備えることにより発話長の違いへ対処し、類似度平均化部１６０を備えることにより発話に含まれる言葉の影響に対処する。 Therefore, it is necessary to consider changes in the speaker feature vector due to the influence of words included in the utterance as well as dealing with the difference in utterance length. In the invention of the first embodiment, the utterance dividing unit 110 is provided to deal with the difference in utterance length, and the similarity averaging unit 160 is provided to deal with the influence of words included in the utterance.

これにより、登録発話に比べて識別発話が極端に短い場合でも、話者識別精度が向上する。また、テキスト依存型、テキスト非依存型のいずれに対しても話者識別精度が向上する。 This improves the speaker identification accuracy even when the identification utterance is extremely short compared to the registered utterance. Also, speaker identification accuracy is improved for both the text-dependent type and the text-independent type.

実施例１では、話者特徴ベクトル正規化行列を用いて類似度を計算したが、話者識別装置８００と同様、非特許文献１に記載の方法で類似度を計算してもよい。 In the first embodiment, the similarity is calculated using the speaker feature vector normalization matrix. However, similar to the speaker identification device 800, the similarity may be calculated by the method described in Non-Patent Document 1.

以下、図１０〜図１１を参照して実施例２の話者識別装置を説明する。図１０は、実施例２の話者識別装置２００の構成を示すブロック図である。図１１は、実施例２の話者識別装置２００の動作を示すフローチャートである。実施例１の話者識別装置１００との違いは、話者特徴ベクトル正規化行列の学習に関係する構成部がないこと、正規化あり類似度計算部１５０に替えて類似度計算部８５０が追加されていることである。 Hereinafter, the speaker identification apparatus according to the second embodiment will be described with reference to FIGS. FIG. 10 is a block diagram illustrating a configuration of the speaker identification device 200 according to the second embodiment. FIG. 11 is a flowchart illustrating the operation of the speaker identification device 200 according to the second embodiment. The difference from the speaker identification apparatus 100 of the first embodiment is that there is no component related to learning of the speaker feature vector normalization matrix, and a similarity calculation unit 850 is added instead of the normalization similarity calculation unit 150. It has been done.

なお、図１０に点線で図示する通り、実施例１と同様、発話分割部１１０、音響分析部８２０−１、話者特徴ベクトル算出部８３０−１をまとめて分割済登録発話話者特徴ベクトル算出部１００１と、音響分析部８２０−２、話者特徴ベクトル算出部８３０−２をまとめて認識発話話者特徴ベクトル算出部１００２という。 Note that, as illustrated by the dotted line in FIG. 10, as in the first embodiment, the utterance division unit 110, the acoustic analysis unit 820-1, and the speaker feature vector calculation unit 830-1 are collectively calculated to calculate the divided registered speaker feature vectors. The unit 1001, the acoustic analysis unit 820-2, and the speaker feature vector calculation unit 830-2 are collectively referred to as a recognized speaker feature vector calculation unit 1002.

以下、実施例１と相違する類似度計算部８５０の入力、出力、動作について説明する。 Hereinafter, input, output, and operation of the similarity calculation unit 850 different from the first embodiment will be described.

＜類似度計算部８５０＞
入力：識別発話話者特徴ベクトル、分割済登録発話話者特徴ベクトル、話者名
出力：類似度、話者名
識別発話話者特徴ベクトルと登録発話データベースに含まれるすべての分割済登録発話話者特徴ベクトルとの類似度を計算する（Ｓ８５０）。類似度は非特許文献１のコサイン類似度とする。計算した類似度は、分割済登録発話話者特徴ベクトルに対応する話者名と組にして出力される。 <Similarity calculation unit 850>
Input: Identification speaker feature vector, segmented registered speaker feature vector, speaker name output: similarity, speaker name All segmented registered speakers included in the identification speaker feature vector and registered speech database The similarity with the feature vector is calculated (S850). The similarity is the cosine similarity of Non-Patent Document 1. The calculated similarity is output in combination with the speaker name corresponding to the divided registered speaker feature vector.

実施例１の発明と同様、登録発話に比べて識別発話が極端に短い場合でも、話者識別精度が向上する。また、テキスト依存型、テキスト非依存型のいずれに対しても話者識別精度が向上する。 Similar to the first embodiment, the speaker identification accuracy is improved even when the identification utterance is extremely short compared to the registered utterance. Also, speaker identification accuracy is improved for both the text-dependent type and the text-independent type.

実施例３では、実施例１または実施例２の話者識別装置での処理に加えて、閾値を用いた登録外話者判定処理（閾値処理）を行う。識別発話の話者が登録発話データベースにいる場合（閾値処理の結果、所定の要件を満たすと判断される場合）は実施例１または実施例２の話者識別装置での処理結果である話者名を識別結果として出力し、識別発話の話者が登録発話データベースにない場合（閾値処理の結果、所定の要件を満たされないと判断される場合）は登録外話者と判定する。ここで、登録外話者とは、登録発話データベースに話者名が登録されていない話者をいう。 In the third embodiment, in addition to the processing performed by the speaker identification device according to the first or second embodiment, registered non-registered speaker determination processing (threshold processing) using a threshold value is performed. When the speaker of the identification utterance is in the registered utterance database (when it is determined that the predetermined requirement is satisfied as a result of the threshold processing), the speaker is the processing result of the speaker identification device of the first or second embodiment. The name is output as the identification result, and when the speaker of the identification utterance is not in the registered utterance database (when it is determined that the predetermined requirement is not satisfied as a result of the threshold processing), it is determined as the non-registered speaker. Here, the registered non-speaker means a speaker whose speaker name is not registered in the registered utterance database.

以下、図１２〜図１４を参照して実施例３の話者識別装置を説明する。図１２は、実施例３の話者照合装置３００の構成を示すブロック図である。図１３は、実施例３の話者照合装置３００の動作を示すフローチャートである。図１４は、実施例３（変形）の話者照合装置３００’の構成を示すブロック図である。話者照合装置３００が実施例１の話者照合装置１００をベースにしたもの、話者照合装置３００’が実施例２の話者照合装置２００をベースにしたものである。話者照合装置３００及び話者照合装置３００’では、登録外話者判定部３１０が追加される。 Hereinafter, the speaker identification device according to the third embodiment will be described with reference to FIGS. FIG. 12 is a block diagram illustrating a configuration of the speaker verification device 300 according to the third embodiment. FIG. 13 is a flowchart illustrating the operation of the speaker verification device 300 according to the third embodiment. FIG. 14 is a block diagram illustrating a configuration of a speaker verification device 300 ′ according to the third embodiment (modification). The speaker verification device 300 is based on the speaker verification device 100 of the first embodiment, and the speaker verification device 300 'is based on the speaker verification device 200 of the second embodiment. In the speaker verification device 300 and the speaker verification device 300 ′, a registered non-registered speaker determination unit 310 is added.

なお、図１２及び図１４に点線で図示する通り、実施例１と同様、発話分割部１１０、音響分析部８２０−１、話者特徴ベクトル算出部８３０−１をまとめて分割済登録発話話者特徴ベクトル算出部１００１と、音響分析部８２０−２、話者特徴ベクトル算出部８３０−２をまとめて認識発話話者特徴ベクトル算出部１００２という。 As shown in FIG. 12 and FIG. 14 by dotted lines, as in the first embodiment, the utterance division unit 110, the acoustic analysis unit 820-1, and the speaker feature vector calculation unit 830-1 are collectively grouped registered utterers. The feature vector calculation unit 1001, the acoustic analysis unit 820-2, and the speaker feature vector calculation unit 830-2 are collectively referred to as a recognized speaker feature vector calculation unit 1002.

以下、実施例１または実施例２と相違する話者識別部８７５、登録外話者判定部３１０の入力、出力、動作について説明する。 Hereinafter, input, output, and operation of the speaker identification unit 875 and the non-registered speaker determination unit 310 that are different from those in the first or second embodiment will be described.

＜話者識別部８７５＞
入力：話者ごとの平均類似度、話者名
出力：平均類似度の最大値に対応する話者名、平均類似度の最大値
話者ごとの平均類似度のうち、最大となる平均類似度に対応する話者名を選択、選択した話者名に対応する平均類似度（つまり、平均類似度の最大値）を出力する（Ｓ８７５）。話者識別部８７０では平均類似度の最大値に対応する話者名を識別結果として出力したが、話者識別部８７５では平均類似度の最大値もあわせて出力する。 <Speaker identification unit 875>
Input: Average similarity for each speaker, speaker name output: Speaker name corresponding to the maximum value of average similarity, maximum value of average similarity Of average similarity for each speaker, the maximum average similarity Is selected, and the average similarity (that is, the maximum average similarity) corresponding to the selected speaker name is output (S875). The speaker identification unit 870 outputs the speaker name corresponding to the maximum value of the average similarity as the identification result, but the speaker identification unit 875 also outputs the maximum value of the average similarity.

＜登録外話者判定部３１０＞
入力：平均類似度の最大値に対応する話者名、平均類似度の最大値、閾値
出力：識別結果
識別発話の話者が登録外話者かを判定し、識別発話の話者が登録発話データベースにいる場合は話者名を、識別発話の話者が登録発話データベースにない場合は登録外話者である旨を識別結果として出力する（Ｓ３１０）。登録外話者の判定は、平均類似度の最大値の閾値処理により実現する。閾値は事前に設定されているものとする。 <Registered outside speaker determination unit 310>
Input: Speaker name corresponding to the maximum value of average similarity, maximum value of average similarity, threshold output: Identification result It is determined whether the speaker of the identified utterance is a non-registered speaker, and the speaker of the identified utterance is registered utterance If it is in the database, the speaker name is output, and if the speaker of the identified utterance is not in the registered utterance database, the fact that the speaker is a non-registered speaker is output as the identification result (S310). The determination of the registered outside speaker is realized by threshold processing of the maximum value of the average similarity. It is assumed that the threshold is set in advance.

一般に、登録話者本人であるかを判定する場合、類似度を閾値処理する方法、すなわち類似度が閾値よりも大きい場合は登録話者本人であるとみなし、閾値よりも小さい場合は登録話者でないとみなす方法を用いる。複数の話者の登録発話が登録発話データベースに登録されている場合、すべての登録話者に対して閾値処理を行い、類似度が閾値よりも大きい登録話者が一人でもいれば登録話者、一人もいないのであれば登録外話者と判定することも可能である。 In general, when determining whether or not the speaker is a registered speaker, a method of thresholding similarity is considered, that is, if the similarity is greater than the threshold, the registered speaker is regarded as the registered speaker. Use a method that considers it not. If registered utterances of multiple speakers are registered in the registered utterance database, threshold processing is performed for all registered speakers, and if there is one registered speaker whose similarity is greater than the threshold, If there is no one, it can be determined that the speaker is a registered outside speaker.

しかし、平均類似度の最大値のみを閾値処理することでも、登録外話者の判定を実現することができる。平均類似度の最大値が閾値よりも大きい場合は少なくとも一人以上が登録話者であると判定されるが、閾値よりも小さい場合はその他の話者も平均類似度が閾値よりも小さいことから、すべての登録話者に対して本人でない（登録外話者である）と判定されることとなる。このことから、ここでは、平均類似度の最大値と閾値を用いて、平均類似度の最大値が閾値よりも大きい場合は登録外話者でないと判定し、平均類似度の最大値に対応する話者名を、平均類似度の最大値が閾値よりも小さい場合は登録外話者と判定し、登録外話者であるという情報を識別結果として返す。 However, it is also possible to realize the determination of registered outside speakers by performing threshold processing only on the maximum value of the average similarity. If the maximum average similarity is greater than the threshold, it is determined that at least one or more speakers are registered speakers.If the average similarity is smaller than the threshold, other speakers also have an average similarity smaller than the threshold. It is determined that all registered speakers are not the person himself / herself (is a non-registered speaker). Therefore, here, using the maximum value of the average similarity and the threshold value, if the maximum value of the average similarity value is larger than the threshold value, it is determined that the speaker is not a registered outside speaker, and corresponds to the maximum value of the average similarity value. When the maximum value of the average similarity is smaller than the threshold, the speaker name is determined as a registered non-speaker, and information indicating that the speaker is a registered non-speaker is returned as an identification result.

なお、閾値以上であるか否かにより閾値処理を行ってもよい。また、平均類似度の最大値があらかじめ設定した閾値より大きいことまたは閾値以上であることを、平均類似度の最大値が大きいことを示す所定の範囲にあるという。 Note that threshold processing may be performed depending on whether or not the threshold value is exceeded. In addition, the fact that the maximum value of the average similarity is greater than or equal to or greater than the preset threshold is said to be within a predetermined range indicating that the maximum value of the average similarity is large.

以下、図１５〜図１７を参照して実施例４の話者照合装置を説明する。図１５は、実施例４の話者照合装置４００の構成を示すブロック図である。図１６は、実施例４の話者照合装置４００の動作を示すフローチャートである。図１７は、話者照合装置４００による照合発話の照合の様子を示す図である。図１５に示すように話者照合装置４００は、発話分割部１１０と、音響分析部８２０−１と、話者特徴ベクトル算出部８３０−１と、話者登録部９４０と、登録発話記録部９０３と、音響分析部８２０−２と、話者特徴ベクトル算出部８３０−２と、正規化あり類似度計算部４５０と、類似度平均化部４６０と、話者照合部９７０と、音響分析部８２０−３と、話者特徴ベクトル算出部８３０−３と、話者特徴ベクトル正規化行列学習部１８０と、話者特徴ベクトル正規化行列記録部１０６を含む。話者照合装置４００は、ＵＢＭモデル記録部８０１と、話者特徴量抽出モデル記録部８０２と、正規化行列学習用発話記録部１０４と、正規化行列学習用話者名記録部１０５とに接続している。 Hereinafter, the speaker verification apparatus according to the fourth embodiment will be described with reference to FIGS. FIG. 15 is a block diagram illustrating a configuration of the speaker verification device 400 according to the fourth embodiment. FIG. 16 is a flowchart illustrating the operation of the speaker verification device 400 according to the fourth embodiment. FIG. 17 is a diagram showing how collation utterances are collated by the speaker collation apparatus 400. As shown in FIG. 15, the speaker verification device 400 includes an utterance dividing unit 110, an acoustic analysis unit 820-1, a speaker feature vector calculation unit 830-1, a speaker registration unit 940, and a registered utterance recording unit 903. An acoustic analysis unit 820-2, a speaker feature vector calculation unit 830-2, a normalized similarity calculation unit 450, a similarity averaging unit 460, a speaker verification unit 970, and an acoustic analysis unit 820. -3, a speaker feature vector calculation unit 830-3, a speaker feature vector normalization matrix learning unit 180, and a speaker feature vector normalization matrix recording unit 106. Speaker verification device 400 is connected to UBM model recording unit 801, speaker feature extraction model recording unit 802, normalization matrix learning utterance recording unit 104, and normalization matrix learning speaker name recording unit 105. doing.

なお、図１５に点線で図示する通り、実施例１と同様、発話分割部１１０、音響分析部８２０−１、話者特徴ベクトル算出部８３０−１をまとめて分割済登録発話話者特徴ベクトル算出部１００１と、音響分析部８２０−２、話者特徴ベクトル算出部８３０−２をまとめて認識発話話者特徴ベクトル算出部１００２という。 As illustrated by the dotted line in FIG. 15, similarly to the first embodiment, the utterance dividing unit 110, the acoustic analysis unit 820-1, and the speaker feature vector calculating unit 830-1 are collectively calculated to calculate the divided registered speaker feature vectors. The unit 1001, the acoustic analysis unit 820-2, and the speaker feature vector calculation unit 830-2 are collectively referred to as a recognized speaker feature vector calculation unit 1002.

話者照合装置４００では、実施例１と同様、話者特徴ベクトル正規化行列を用いて類似度計算を行う。 In the speaker verification device 400, similar to the first embodiment, the similarity calculation is performed using the speaker feature vector normalization matrix.

話者照合装置４００での処理は大きく登録発話の登録、話者特徴ベクトル正規化行列の学習、照合発話の照合の３つに分かれる。 The processing in the speaker verification device 400 is roughly divided into three processes: registration of registered utterance, learning of speaker feature vector normalization matrix, and verification of verification utterance.

登録発話の登録処理フローの概略は以下の通りである。
１．登録発話の音声を短時間ごと（識別発話と同程度の長さが好ましい）に分割し、分割後の各音声（分割済登録発話という）に対して話者特徴ベクトルを算出する（Ｓ１１０、Ｓ８２０−１、Ｓ８３０−１）。
２．話者特徴ベクトルを登録発話データベースに登録する（Ｓ９４０）。 The outline of the registration processing flow of the registration utterance is as follows.
1. The voice of the registered utterance is divided every short time (preferably the same length as the identification utterance), and a speaker feature vector is calculated for each divided voice (referred to as a divided registered utterance) (S110, S820). -1, S830-1).
2. The speaker feature vector is registered in the registered utterance database (S940).

話者特徴ベクトル正規化行列の学習処理フローは実施例１で説明した通りであるのでここでは省略する。 Since the learning process flow of the speaker feature vector normalization matrix is as described in the first embodiment, it is omitted here.

照合発話の照合処理フローの概略は以下の通りである。
１．照合発話の音声に対して話者特徴ベクトルを算出する（Ｓ８２０−２、Ｓ８３０−２）。
２．照合発話の話者特徴ベクトルと各分割済登録発話の話者特徴ベクトルとの類似度を計算する（Ｓ４５０）。
３．２．で求めた類似度を登録発話ごとに平均化し、登録発話ごとの平均類似度を計算する（Ｓ４６０）。
４．平均類似度の最大値が閾値より大きいかに基づき、登録話者本人であるかを判定し、照合結果を返す（Ｓ９７０）。 The outline of the verification processing flow for verification utterance is as follows.
1. A speaker feature vector is calculated for the voice of the verification utterance (S820-2, S830-2).
2. The similarity between the speaker feature vector of the verification utterance and the speaker feature vector of each divided registered utterance is calculated (S450).
3.2. The degree of similarity obtained in step 1 is averaged for each registered utterance and the average degree of similarity for each registered utterance is calculated (S460).
4). Based on whether the maximum value of the average similarity is greater than the threshold value, it is determined whether or not the speaker is the registered speaker, and the collation result is returned (S970).

以下では、構成部ごとに入力、出力、動作について説明する。 Below, an input, an output, and operation | movement are demonstrated for every structure part.

＜話者登録部９４０＞
入力：分割済登録発話話者特徴ベクトル
出力先：登録発話データベース
分割済登録発話話者特徴ベクトルを登録発話記録部９０３の登録発話データベースへ追加する（Ｓ９４０）。実施例１と同様、登録発話データベースには登録発話集合の各登録発話に対して１つ以上の分割済登録発話話者特徴ベクトルが登録されることとなる。 <Speaker registration unit 940>
Input: Divided registered utterer feature vector Output destination: Registered utterance database The divided registered utterer feature vector is added to the registered utterance database of the registered utterance recording unit 903 (S940). As in the first embodiment, one or more divided registered utterer feature vectors are registered in the registered utterance database for each registered utterance in the registered utterance set.

なお、類似度平均化部４６０における処理（類似度の平均化）ができるように（図１７参照）、分割済登録発話話者特徴ベクトルには同一の話者の発話に由来するものであるか否かの判断に利用できる情報が付与されている必要がある。例えば、分割元となった登録発話を示す識別子を用いればよい。もちろん、話者識別装置同様、話者名を登録発話データベースに登録しておき、話者名をキーとして平均値を計算する範囲を特定してもよい。 Whether the divided registered utterance feature vector is derived from the utterance of the same speaker so that the similarity averaging unit 460 can perform processing (similarity averaging) (see FIG. 17). Information that can be used to determine whether or not is required. For example, an identifier indicating a registered utterance that is a division source may be used. Of course, like the speaker identification device, the speaker name may be registered in the registered utterance database, and the range in which the average value is calculated using the speaker name as a key may be specified.

＜正規化あり類似度計算部４５０＞
入力：話者特徴ベクトル正規化行列、照合発話話者特徴ベクトル、分割済登録発話話者特徴ベクトル
出力：類似度
照合発話話者特徴ベクトルと登録発話データベースに含まれるすべての分割済登録発話話者特徴ベクトルとの類似度を計算する（Ｓ４５０）。正規化あり類似度計算部１５０との違いは、話者名を入出力としない点である。 <Normalization similarity calculation unit 450>
Input: Speaker feature vector normalization matrix, collated utterer feature vector, divided registered utterer feature vector Output: similarity Collated utterer feature vector and all divided registered utterers included in registered utterance database The similarity with the feature vector is calculated (S450). The difference from the normalized similarity calculation unit 150 is that the speaker name is not input / output.

＜類似度平均化部４６０＞
入力：類似度
出力：平均類似度
類似度を登録発話ごとに平均化し、登録発話ごとの平均類似度を求める（Ｓ４６０）。類似度平均化部１６０との違いは、話者名を入出力としない点である。平均値を計算する範囲を登録発話単位ではなく、同一の話者の発話に由来するものであるか否かの判断に利用できる情報で特定できる範囲としてもよい。 <Similarity averaging unit 460>
Input: similarity output: average similarity The similarity is averaged for each registered utterance to obtain an average similarity for each registered utterance (S460). The difference from the similarity averaging unit 160 is that the speaker name is not input / output. The range in which the average value is calculated may be a range that can be specified by information that can be used to determine whether or not it is derived from the utterance of the same speaker, instead of the registered utterance unit.

＜話者照合部９７０＞
入力：平均類似度、閾値
出力：照合結果
平均類似度の最大値が閾値よりも大きい場合は登録発話の話者と同一話者、閾値よりも小さい場合は別話者であると判定し、照合結果として返す（Ｓ９７０）。または、類似度計算部９５０の出力である類似度各々について閾値との比較を行い、一つでも閾値より大きいものがある場合に登録発話の話者であるとの照合結果を生成し、出力するようにしてもよい（Ｓ９７０）。登録外話者判定部３１０と同様、閾値以上であるか否かにより閾値処理を行ってもよい。 <Speaker verification unit 970>
Input: Average similarity, threshold output: Matching result If the maximum average similarity is greater than the threshold, it is determined that the speaker is the same speaker as the registered utterance. The result is returned (S970). Alternatively, each similarity that is the output of the similarity calculation unit 950 is compared with a threshold value, and if any one is larger than the threshold value, a matching result indicating that the speaker is a registered utterance is generated and output. You may make it (S970). Similar to the registered outside speaker determination unit 310, threshold processing may be performed depending on whether or not the threshold is equal to or greater than the threshold.

テキスト非依存型の話者照合において話者特徴ベクトルを利用する場合、上述の話者特徴ベクトルの性質から、識別発話の発話長や含まれる言葉により類似度が変化し、照合精度が低下する。例えば、ある言葉が照合発話に含まれる場合、必ず登録話者だと判断される、といった事態が発生することもある。発話長や単語の変化に対して頑健な話者照合を行うための、発話長や単語の違いへの最も単純な対処方法として、例えば照合発話と同程度の発話長となるよう登録発話の一部区間を抽出し、その区間のみから話者特徴ベクトルを算出して類似度を求める方法が考えられる。しかしこの方法では、抽出する区間の言葉の情報の影響を受けて類似度が変化するため、たまたま登録発話と照合発話が似た言葉を含む区間を登録発話から抽出した場合には、異なる話者でも類似度が高くなってしまう。 When a speaker feature vector is used in text-independent speaker verification, due to the above-described characteristics of the speaker feature vector, the similarity varies depending on the length of the utterance of the identified utterance and the words included, and the verification accuracy decreases. For example, when a certain word is included in the collation utterance, a situation may occur in which it is always determined that the speaker is a registered speaker. The simplest way to deal with differences in utterance lengths and words in order to perform robust speaker verification against changes in utterance length and words is, for example, one of registered utterances so that the utterance length is comparable to that of verification utterances. A method is conceivable in which a partial section is extracted and a speaker feature vector is calculated only from the section to obtain a similarity. However, in this method, the degree of similarity changes due to the influence of the word information of the section to be extracted. Therefore, if a section containing words similar to the registered utterance and the matching utterance is extracted from the registered utterance, a different speaker But the similarity is high.

したがって、発話長の違いへの対処と同時に、発話に含まれる言葉の影響による話者特徴ベクトルの変化も考慮する必要がある。実施例４の発明では、発話分割部１１０を備えることにより発話長の違いへ対処し、類似度平均化部４６０を備えることにより発話に含まれる言葉の影響に対処する。 Therefore, it is necessary to consider changes in the speaker feature vector due to the influence of words included in the utterance as well as dealing with the difference in utterance length. In the invention of the fourth embodiment, the difference of the utterance length is dealt with by providing the utterance dividing unit 110, and the influence of words included in the utterance is dealt with by providing the similarity averaging unit 460.

これにより、テキスト非依存型の話者照合においても、発話長や単語の変化に対して頑健な、閾値を用いた話者照合が可能となる。 As a result, even in the text-independent speaker verification, speaker verification using a threshold value that is robust against changes in speech length and words can be performed.

実施例４では、話者特徴ベクトル正規化行列を用いて類似度を計算したが、話者照合装置９００と同様、非特許文献１に記載の方法で類似度を計算してもよい。 In the fourth embodiment, the similarity is calculated using the speaker feature vector normalization matrix. However, similar to the speaker verification device 900, the similarity may be calculated by the method described in Non-Patent Document 1.

以下、図１８〜図１９を参照して実施例５の話者照合装置を説明する。図１８は、実施例５の話者照合装置５００の構成を示すブロック図である。図１９は、実施例５の話者照合装置５００の動作を示すフローチャートである。実施例４の話者照合装置４００との違いは、話者特徴ベクトル正規化行列の学習に関係する構成部がないこと、正規化あり類似度計算部４５０に替えて類似度計算部９５０が追加されていることである。 Hereinafter, the speaker verification apparatus according to the fifth embodiment will be described with reference to FIGS. FIG. 18 is a block diagram illustrating a configuration of the speaker verification device 500 according to the fifth embodiment. FIG. 19 is a flowchart illustrating the operation of the speaker verification device 500 according to the fifth embodiment. The difference from the speaker verification device 400 of the fourth embodiment is that there is no component related to learning of the speaker feature vector normalization matrix, and a similarity calculation unit 950 is added instead of the normalization similarity calculation unit 450. It has been done.

なお、図１８に点線で図示する通り、実施例１と同様、発話分割部１１０、音響分析部８２０−１、話者特徴ベクトル算出部８３０−１をまとめて分割済登録発話話者特徴ベクトル算出部１００１と、音響分析部８２０−２、話者特徴ベクトル算出部８３０−２をまとめて認識発話話者特徴ベクトル算出部１００２という。 As illustrated by the dotted line in FIG. 18, as in the first embodiment, the utterance dividing unit 110, the acoustic analysis unit 820-1, and the speaker feature vector calculating unit 830-1 are combined to calculate the divided registered speaker feature vector. The unit 1001, the acoustic analysis unit 820-2, and the speaker feature vector calculation unit 830-2 are collectively referred to as a recognized speaker feature vector calculation unit 1002.

以下、実施例４と相違する類似度計算部９５０の入力、出力、動作について説明する。 Hereinafter, input, output, and operation of the similarity calculation unit 950 different from the fourth embodiment will be described.

＜類似度計算部９５０＞
入力：照合発話話者特徴ベクトル、分割済登録発話話者特徴ベクトル
出力：類似度
照合発話話者特徴ベクトルと登録発話データベースに含まれるすべての分割済登録発話話者特徴ベクトルとの類似度を計算する（Ｓ９５０）。類似度は非特許文献１のコサイン類似度とする。類似度計算部８５０との違いは、話者名を入出力としない点である。 <Similarity calculation unit 950>
Input: Verification speaker feature vector, segmented registered speaker feature vector Output: similarity Calculates the similarity between the verification speaker feature vector and all segmented registered speaker feature vectors contained in the registered utterance database (S950). The similarity is the cosine similarity of Non-Patent Document 1. The difference from the similarity calculator 850 is that the speaker name is not input / output.

実施例４の発明と同様、テキスト非依存型の話者照合においても、発話長や単語の変化に対して頑健な、閾値を用いた話者照合が可能となる。 Similarly to the invention of the fourth embodiment, speaker verification using a threshold that is robust against changes in speech length and words can be performed even in text-independent speaker verification.

＜補記＞
本発明の装置は、例えば単一のハードウェアエンティティとして、キーボードなどが接続可能な入力部、液晶ディスプレイなどが接続可能な出力部、ハードウェアエンティティの外部に通信可能な通信装置（例えば通信ケーブル）が接続可能な通信部、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ、キャッシュメモリやレジスタなどを備えていてもよい）、メモリであるＲＡＭやＲＯＭ、ハードディスクである外部記憶装置並びにこれらの入力部、出力部、通信部、ＣＰＵ、ＲＡＭ、ＲＯＭ、外部記憶装置の間のデータのやり取りが可能なように接続するバスを有している。また必要に応じて、ハードウェアエンティティに、ＣＤ−ＲＯＭなどの記録媒体を読み書きできる装置（ドライブ）などを設けることとしてもよい。このようなハードウェア資源を備えた物理的実体としては、汎用コンピュータなどがある。 <Supplementary note>
The apparatus of the present invention includes, for example, a single hardware entity as an input unit to which a keyboard or the like can be connected, an output unit to which a liquid crystal display or the like can be connected, and a communication device (for example, a communication cable) capable of communicating outside the hardware entity. Can be connected to a communication unit, CPU (may include a central processing unit, cache memory or register), RAM or ROM as a memory, an external storage device as a hard disk, and their input unit, output unit, communication unit , A CPU, a RAM, a ROM, and a bus connected so that data can be exchanged between the external storage devices. If necessary, the hardware entity may be provided with a device (drive) that can read and write a recording medium such as a CD-ROM. A physical entity having such hardware resources includes a general-purpose computer.

ハードウェアエンティティの外部記憶装置には、上述の機能を実現するために必要となるプログラムおよびこのプログラムの処理において必要となるデータなどが記憶されている（外部記憶装置に限らず、例えばプログラムを読み出し専用記憶装置であるＲＯＭに記憶させておくこととしてもよい）。また、これらのプログラムの処理によって得られるデータなどは、ＲＡＭや外部記憶装置などに適宜に記憶される。 The external storage device of the hardware entity stores a program necessary for realizing the above functions and data necessary for processing the program (not limited to the external storage device, for example, reading a program) It may be stored in a ROM that is a dedicated storage device). Data obtained by the processing of these programs is appropriately stored in a RAM or an external storage device.

ハードウェアエンティティでは、外部記憶装置（あるいはＲＯＭなど）に記憶された各プログラムとこの各プログラムの処理に必要なデータが必要に応じてメモリに読み込まれて、適宜にＣＰＵで解釈実行・処理される。その結果、ＣＰＵが所定の機能（上記、…部、…手段などと表した各構成要件）を実現する。 In the hardware entity, each program stored in an external storage device (or ROM or the like) and data necessary for processing each program are read into a memory as necessary, and are interpreted and executed by a CPU as appropriate. . As a result, the CPU realizes a predetermined function (respective component requirements expressed as the above-described unit, unit, etc.).

本発明は上述の実施形態に限定されるものではなく、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。また、上記実施形態において説明した処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されるとしてもよい。 The present invention is not limited to the above-described embodiment, and can be appropriately changed without departing from the spirit of the present invention. In addition, the processing described in the above embodiment may be executed not only in time series according to the order of description but also in parallel or individually as required by the processing capability of the apparatus that executes the processing. .

既述のように、上記実施形態において説明したハードウェアエンティティ（本発明の装置）における処理機能をコンピュータによって実現する場合、ハードウェアエンティティが有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記ハードウェアエンティティにおける処理機能がコンピュータ上で実現される。 As described above, when the processing functions in the hardware entity (the apparatus of the present invention) described in the above embodiments are realized by a computer, the processing contents of the functions that the hardware entity should have are described by a program. Then, by executing this program on a computer, the processing functions in the hardware entity are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｃ）、ＤＶＤ−ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）、ＣＤ−ＲＯＭ（ＣｏｍｐａｃｔＤｉｓｃＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、ＣＤ−Ｒ（Ｒｅｃｏｒｄａｂｌｅ）／ＲＷ（ＲｅＷｒｉｔａｂｌｅ）等を、光磁気記録媒体として、ＭＯ（Ｍａｇｎｅｔｏ−Ｏｐｔｉｃａｌｄｉｓｃ）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（ＥｌｅｃｔｒｏｎｉｃａｌｌｙＥｒａｓａｂｌｅａｎｄＰｒｏｇｒａｍｍａｂｌｅ−ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）等を用いることができる。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape, and the like, and as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only). Memory), CD-R (Recordable) / RW (ReWritable), etc., magneto-optical recording medium, MO (Magneto-Optical disc), etc., semiconductor memory, EEP-ROM (Electronically Erasable-Programmable-Ready), etc. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（ＡｐｐｌｉｃａｔｉｏｎＳｅｒｖｉｃｅＰｒｏｖｉｄｅｒ）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, the computer reads a program stored in its own recording medium and executes a process according to the read program. As another execution form of the program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to the computer. Each time, the processing according to the received program may be executed sequentially. A configuration in which the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes a processing function only by an execution instruction and result acquisition without transferring a program from the server computer to the computer. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、ハードウェアエンティティを構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In this embodiment, a hardware entity is configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.

Claims

Recognize utterance voice of speaker to be speaker identification or speaker verification, and utterance voice used to recognize the speaker as registered utterance,
A divided registered utterer feature vector calculating unit that generates a divided registered utterance having a section shorter than the section of the registered utterance from the registered utterance and calculates a divided registered utterer feature vector from the divided registered utterance; ,
A recognized utterer feature vector calculating unit for calculating a recognized utterer feature vector from the recognized utterance;
A speaker likelihood evaluation apparatus comprising: a similarity calculation unit that calculates a similarity between the recognized utterance and the divided registered utterance using the recognized utterance feature vector and the divided registered utterance feature vector.

The speaker quality evaluation device according to claim 1,
The divided registered utterer feature vector calculation unit generates two or more divided registered utterances,
The similarity calculation unit calculates the similarity for each of the two or more divided registered utterances;
further,
A speaker likeness evaluation apparatus comprising: a similarity averaging unit that calculates an average similarity that is an average value of the similarities.

A set of registered utterances whose elements are one or more registered utterances is a registered utterance set.
From the similarity or the average similarity calculated using the speaker evaluation apparatus according to claim 1 or 2, for each of the identification utterance and the registered utterance of the registered utterance set, the identification utterance A speaker identification device for generating an identification result,
A speaker identification device comprising: a speaker identification unit that selects a speaker name of a registered utterance corresponding to the maximum value of the similarity or the average similarity, and uses the speaker name as the identification result.

A set of registered utterances whose elements are one or more registered utterances is a registered utterance set.
From the similarity or the average similarity calculated using the speaker evaluation apparatus according to claim 1 or 2, for each of the identification utterance and the registered utterance of the registered utterance set, the identification utterance A speaker identification device for generating an identification result,
A speaker name of a registered utterance corresponding to the maximum value of the similarity or the average similarity and a speaker identification unit for selecting the maximum value;
A non-registered speaker determination unit that uses the speaker name as the identification result when the maximum value is within a predetermined range indicating that the maximum value is large.

A set of registered utterances whose elements are one or more registered utterances is set as a registered utterance set.
From the similarity or the average similarity calculated using the speaker evaluation device according to claim 1 or 2, for each of the verification utterance and the registered utterance of the registered utterance set, the verification utterance A speaker verification device that generates a verification result,
A speaker verification device comprising: a speaker verification unit that generates a verification result that the speaker is the registered utterance when the maximum value is within a predetermined range indicating that the maximum value is large.

Recognize utterance voice of speaker to be speaker identification or speaker verification, and utterance voice used to recognize the speaker as registered utterance,
A divided registered utterer feature vector calculation step of generating a divided registered utterance having a section shorter than the section of the registered utterance from the registered utterance, and calculating a divided registered utterer feature vector from the divided registered utterance; ,
A recognized utterer feature vector calculating step for calculating a recognized utterer feature vector from the recognized utterance;
A method for evaluating the likelihood of a speaker that performs a similarity calculation step of calculating a similarity between the recognized utterance and the divided registered utterance using the recognized utterance feature vector and the divided registered utterance feature vector.

The speaker quality evaluation method according to claim 6,
The divided registered utterer feature vector calculation step is to generate two or more divided registered utterances,
The similarity calculation step calculates the similarity for each of the two or more divided registered utterances,
further,
A similarity level averaging step of calculating an average similarity level that is an average value of the similarity levels.

A program for causing a computer to execute the speaker quality evaluation method according to claim 6 or 7.