JPH11338492A

JPH11338492A - Speaker recognition unit

Info

Publication number: JPH11338492A
Application number: JP10147083A
Authority: JP
Inventors: Tadanao Tokuda; 肇直徳田
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1998-05-28
Filing date: 1998-05-28
Publication date: 1999-12-10

Abstract

PROBLEM TO BE SOLVED: To enhance a speaker recognition performance for a speaking content assigning type. SOLUTION: A speaker's character weighting factor addition part 1-5 and a weighting similarity calculating part 1-8 are provided in a text assigned type speaker recognition unit. A distance between a speaker of a recognition object and many-and-unspecified speakers is compared to add a weighting factor proportional to the largeness of the distance to a phoneme model, assuming that a model having a large value thereof shows a speaker's character intensively. The weighted phoneme models are connected at the recognition to prepare a utterance-assigned word model, and similarity between the model and a characteristic vector time series of an speaking voice is calculated by the similarity calculating part 1-8 to determine a result of speaker recognition based on the calculated value.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、話者が発声した音
声を音響分析し、その音声特徴パターンをあらかじめ登
録してある話者特徴パターンと比較し、その結果から話
者が誰であるか、また話者が主張する本人であるか否か
を判定する話者認識装置に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an acoustic analysis of a voice uttered by a speaker, comparing the voice feature pattern with a speaker feature pattern registered in advance, and determining the speaker from the result. The present invention also relates to a speaker recognition device for determining whether or not a speaker is the claimed person.

【０００２】[0002]

【従来の技術】話者認識では、特定の話者がパスワード
を発声した音声パターンを標準パターンとして保持して
おくテキスト依存型の方式が一般的である。この方式で
は認識時に入力音声と本人の標準パターンとの時間軸整
合後の距離を計算し、その値を一定のしきい値と比較す
る事により本人か否かを判断する。しかし、テキスト依
存型ではテープレコーダーなどによって本人の声を録音
し、認識装置の前でそれを再生すれば、装置がそれを受
理してしまう危険性がある。2. Description of the Related Art In speaker recognition, a text-dependent system in which a voice pattern in which a specific speaker utters a password is held as a standard pattern is generally used. In this method, at the time of recognition, the distance between the input voice and the user's standard pattern after time axis matching is calculated, and the value is compared with a certain threshold to determine whether or not the user is the user. However, in the text-dependent type, if a person's voice is recorded by a tape recorder or the like and is reproduced in front of the recognition device, there is a risk that the device will accept it.

【０００３】そこで近年、認識のたびに任意のテキスト
を装置側から指定する事が可能な、テキスト指定型の話
者認識方式が提案されている。この方式では、本人が指
定されたテキストを正しく発声したと判定できる時の
み、それを受理する。これにより、発声内容が毎回異な
るため、録音した音声による詐称を防ぐ事ができる。[0003] In recent years, there has been proposed a text designation type speaker recognition system in which an arbitrary text can be designated from the device every time recognition is performed. In this method, only when the user can determine that the specified text has been correctly uttered, the text is accepted. This makes it possible to prevent spoofing due to the recorded voice since the utterance content is different each time.

【０００４】テキスト指定型では、本人が正しく指定さ
れたテキストを発声したか否かを判定するために、本人
がそのテキストを発声した場合の音声のモデルを生成す
る必要がある。システムの側が任意のテキストを指定で
きるためには、各話者ごとに全ての基本的な音韻モデル
を学習しておき、それを接続して、指定されたテキスト
の音声のモデルを生成する。In the text designation type, in order to determine whether or not the user has uttered the correctly specified text, it is necessary to generate a model of the voice when the user utters the text. In order for the system side to be able to specify an arbitrary text, all basic phonemic models are learned for each speaker, and connected to generate a speech model of the specified text.

【０００５】各話者の音韻モデルの作成には、１人の話
者が非常に多量に発声した音声データ（数百文章）から
各音韻モデルのＨＭＭ（ヒドンマルコフモデル）パラメ
ータを直接切り出して学習するのが理想であるが、現実
的にはそのような多量の発声をユーザーに強いる事は不
可能である。そこで、多数話者のデータから作成した不
特定話者音韻モデルを初期モデルとし、各話者の１００
単語程度の学習データを用いて、その初期モデルをその
話者に適応化する。このようにして、話者の音韻モデル
を生成する。[0005] To create a phoneme model for each speaker, learning is performed by directly extracting HMM (Hidden Markov Model) parameters of each phoneme model from speech data (hundreds of sentences) produced by one speaker in a very large amount. This is ideal, but it is practically impossible to force the user to make such a large amount of utterances. Therefore, an unspecified speaker phoneme model created from data of many speakers is used as an initial model, and 100
Using the word-level training data, the initial model is adapted to the speaker. In this way, a speaker phonemic model is generated.

【０００６】[0006]

【発明が解決しようとする課題】一般に、話者の音声の
特徴は全ての音声区間に渡って現れるわけではなく、例
えば母音部の平均ピッチの高低や鼻音スペクトルなど音
声区間の部分的な特徴として観測される。しかし、従来
の方法では、登録した話者の全ての音韻を等しく評価し
て単語の類似度を算出し、話者認識に用いていた。しか
し、この方法では前述の話者特徴を良く表現した部分と
そうでない部分が同等に評価されるため、必ずしも良好
な話者認識性能が得られるとは限らなかった。Generally, the characteristics of a speaker's voice do not appear in all voice sections, but, for example, as partial features of the voice section such as the average pitch of vowels and the nasal spectrum. Observed. However, in the conventional method, all phonemes of a registered speaker are evaluated equally to calculate the similarity of words, and the calculated similarity is used for speaker recognition. However, according to this method, a portion which expresses the above-mentioned speaker characteristics well and a portion which does not express the same are evaluated equally, so that good speaker recognition performance is not always obtained.

【０００７】したがって本発明は、登録話者の特徴的な
発声部分を認識結果に大きく反映させることにより、話
者認識の向上を図れる話者認識装置を提供することを目
的とする。Accordingly, it is an object of the present invention to provide a speaker recognition apparatus capable of improving speaker recognition by largely reflecting a characteristic utterance of a registered speaker in a recognition result.

【０００８】[0008]

【課題を解決するための手段】本発明は、装置が毎回指
定する単語を発声することにより、話者が本人であるか
否かを判定するテキスト指定型話者認識装置であって、
話者の発声を取り込む音声入力部と、音声信号の特徴ベ
クトル系列を算出する特徴ベクトル算出部と、不特定話
者の音韻モデルを格納している不特定話者音韻モデル格
納部と、話者登録動作時に不特定話者の音韻モデルを登
録話者に適応化する話者適応化部と、登録話者の音韻モ
デルを格納する登録話者音韻モデル格納部と、話者認識
動作時に指定テキストの単語モデルを合成する指定単語
モデル合成部と、話者音声と単語モデルとの類似度を判
定しきい値と比較して話者認識結果を判定する認識結果
判定部と、話者認識結果を出力する認識結果出力部とを
備えたテキスト指定型話者認識装置に、前記話者適応化
部で不特定話者音韻モデルを登録話者に適応化した時の
平均特徴ベクトルの移動量に比例した重み係数をその音
韻モデルに付加する話者特徴重み係数付加部と、入力音
声の特徴ベクトル系列と単語モデルとの類似度を、音韻
モデル単位の部分距離に重みづけした値を時系列方向に
累積した累積距離として算出する重み付け類似度算出部
を備えたことを特徴とする話者認識装置である。SUMMARY OF THE INVENTION The present invention is a text designation type speaker recognition device which determines whether or not a speaker is himself / herself by uttering a designated word every time.
A speech input unit that captures a speaker's utterance, a feature vector calculation unit that calculates a feature vector sequence of a speech signal, an unspecified speaker phoneme model storage unit that stores a phoneme model of an unspecified speaker, A speaker adaptation unit that adapts the phoneme model of the unspecified speaker to the registered speaker during the registration operation, a registered speaker phoneme model storage unit that stores the phoneme model of the registered speaker, and a specified text during the speaker recognition operation A specified word model synthesizing unit for synthesizing the word model, a recognition result judging unit for judging the speaker recognition result by comparing the similarity between the speaker voice and the word model with a judgment threshold value, And a recognition result output unit that outputs a recognition result output unit that outputs an unspecified speaker phoneme model to a registered speaker. Weighted coefficients to the phonological model. A weighting similarity that calculates a similarity between the feature vector sequence of the input speech and the word model as a cumulative distance obtained by accumulating a value obtained by weighting a partial distance of each phoneme model unit in a time-series direction. A speaker recognition device including a calculation unit.

【０００９】この構成により、登録話者の特徴的な発声
部分を認識結果に大きく反映させることにより、話者認
識の向上を図れる話者認識装置を実現できる。With this configuration, it is possible to realize a speaker recognition apparatus capable of improving speaker recognition by largely reflecting the characteristic utterance of the registered speaker in the recognition result.

【００１０】[0010]

【発明の実施の形態】請求項１記載の発明は、装置が毎
回指定する単語を発声することにより、話者が本人であ
るか否かを判定するテキスト指定型話者認識装置であっ
て、話者の発声を取り込む音声入力部と、音声信号の特
徴ベクトル系列を算出する特徴ベクトル算出部と、不特
定話者の音韻モデルを格納している不特定話者音韻モデ
ル格納部と、話者登録動作時に不特定話者の音韻モデル
を登録話者に適応化する話者適応化部と、登録話者の音
韻モデルを格納する登録話者音韻モデル格納部と、話者
認識動作時に指定テキストの単語モデルを合成する指定
単語モデル合成部と、話者音声と単語モデルとの類似度
を判定しきい値と比較して話者認識結果を判定する認識
結果判定部と、話者認識結果を出力する認識結果出力部
とを備えたテキスト指定型話者認識装置に、前記話者適
応化部で不特定話者音韻モデルを登録話者に適応化した
時の平均特徴ベクトルの移動量に比例した重み係数をそ
の音韻モデルに付加する話者特徴重み係数付加部と、入
力音声の特徴ベクトル系列と単語モデルとの類似度を、
音韻モデル単位の部分距離に重みづけした値を時系列方
向に累積した累積距離として算出する重み付け類似度算
出部を備えたことを特徴とする話者認識装置である。DESCRIPTION OF THE PREFERRED EMBODIMENTS The invention according to claim 1 is a text designation type speaker recognition device which determines whether or not a speaker is himself / herself by uttering a designated word every time. A speech input unit that captures a speaker's utterance, a feature vector calculation unit that calculates a feature vector sequence of a speech signal, an unspecified speaker phoneme model storage unit that stores a phoneme model of an unspecified speaker, A speaker adaptation unit that adapts the phoneme model of the unspecified speaker to the registered speaker during the registration operation, a registered speaker phoneme model storage unit that stores the phoneme model of the registered speaker, and a specified text during the speaker recognition operation A specified word model synthesizing unit for synthesizing the word model, a recognition result judging unit for judging the speaker recognition result by comparing the similarity between the speaker voice and the word model with a judgment threshold value, Text with a recognition result output unit to output A talk in which a weighting coefficient proportional to the moving amount of the average feature vector when the unspecified speaker phoneme model is adapted to the registered speaker by the speaker adaptation unit is added to the designated speaker recognition device. The user feature weighting coefficient adding unit, and the similarity between the feature vector sequence of the input speech and the word model,
A speaker recognition device comprising a weighted similarity calculating unit that calculates a value weighted to a partial distance in phoneme model units as a cumulative distance accumulated in a time-series direction.

【００１１】この構成において、話者適応化時の平均特
徴ベクトルの移動距離に比例する重み係数を音韻モデル
に付加して格納しておき、話者認識時には重み付き音韻
モデルを連結して指定テキストの単語モデルを作成し、
類似度計算において各音韻モデルの部分距離に前記の重
み係数による重みづけを行い、それを発声区間に渡って
加算した累積距離をその登録話者の判定しきい値と比較
することにより、話者認識結果を判定する。これによ
り、登録話者の特徴的な発声部分の類似度が話者認識に
強く反映されるため、話者認識精度の向上が期待出来
る。In this configuration, a weighting factor proportional to the moving distance of the average feature vector at the time of speaker adaptation is added to the phoneme model and stored, and at the time of speaker recognition, the weighted phoneme model is connected to the designated text. Create a word model of
In the similarity calculation, the partial distance of each phoneme model is weighted by the above-mentioned weighting coefficient, and the cumulative distance added over the vocal section is compared with the judgment threshold value of the registered speaker. Determine the recognition result. As a result, the similarity of the characteristic utterance part of the registered speaker is strongly reflected in the speaker recognition, so that improvement in the speaker recognition accuracy can be expected.

【００１２】請求項２記載の発明は、特定話者の音声デ
ータが十分に得られ、不特定話者の音韻データを適応化
せずに、特定話者の音韻モデルを直接作成できる場合、
あるいは登録話者の音韻モデルがあらかじめ用意されて
いる場合に、不特定話者と登録話者の同一音韻モデルの
平均特徴ベクトル間の距離を算出し、その値に比例する
重みづけ係数を音韻モデルに付加する話者特徴重み係数
付加部を備えたものである。According to the second aspect of the present invention, when the voice data of a specific speaker is sufficiently obtained and the phoneme model of the specific speaker can be directly created without adapting the phoneme data of the unspecified speaker,
Alternatively, when the phoneme model of the registered speaker is prepared in advance, the distance between the average feature vector of the same phoneme model of the unspecified speaker and the registered speaker is calculated, and the weighting coefficient proportional to the value is calculated as the phoneme model. Is provided with a speaker characteristic weighting coefficient adding unit to be added.

【００１３】この構成により、特定話者の音声データが
十分に得られ、不特定話者の音韻データを適応化せず
に、特定話者の音韻モデルを直接作成できる場合、不特
定／登録話者の同一音韻の音韻モデルで平均特徴ベクト
ル間の距離を算出して登録話者の音韻モデルに重み係数
を付加する。そして、話者認識時には重みづけ類似度を
算出する。これにより、話者適応を行わない場合でも登
録話者の特徴的な音韻モデルに重み係数を付加できる。With this configuration, when sufficient speech data of a specific speaker can be obtained and a phoneme model of a specific speaker can be directly created without adapting phoneme data of an unspecified speaker, unspecified / registered speech The distance between the average feature vectors is calculated using the phoneme model of the same phoneme of the speaker, and a weight coefficient is added to the phoneme model of the registered speaker. Then, at the time of speaker recognition, a weighted similarity is calculated. Thus, even when speaker adaptation is not performed, a weight coefficient can be added to a characteristic phoneme model of a registered speaker.

【００１４】請求項３記載の発明は、テキスト依存型で
単語モデルを用いる場合、不特定話者と特定話者の単語
モデルの同一音韻部分の対応をとり、個々の平均ベクト
ル間の距離を求め、その値に比例した重みづけを単語モ
デルに部分的に加える話者特徴重み係数付加部を備えた
ものである。According to a third aspect of the present invention, when a word model is used in a text-dependent manner, correspondence between the same phoneme portion of the word model of the unspecified speaker and that of the specific speaker is obtained, and the distance between the individual average vectors is obtained. , A speaker characteristic weighting coefficient adding unit for partially adding weighting to the word model in proportion to the value.

【００１５】この構成により、話者認識に単語モデルを
用いる場合、不特定話者の単語モデルと特定話者の単語
モデルとの間で時間軸のマッチングを行い、対応する音
韻部分の平均ベクトルで距離を算出し、その値に比例し
た重み係数を単語モデルに音韻単位で付加する。これに
より重み付け音韻モデルを連結した場合と同様の単語モ
デルが得られる。話者認識時には重みづけ類似度を算出
する事により、登録話者の特徴的な音韻を強調した話者
認識を従来のテキスト固定型話者認識装置で実現でき
る。With this configuration, when a word model is used for speaker recognition, time axis matching is performed between the word model of an unspecified speaker and the word model of a specific speaker, and the average vector of the corresponding phoneme portion is obtained. The distance is calculated, and a weight coefficient proportional to the distance is added to the word model in phoneme units. As a result, a word model similar to the case where the weighted phoneme models are connected is obtained. By calculating the weighted similarity at the time of speaker recognition, speaker recognition emphasizing the characteristic phoneme of the registered speaker can be realized by the conventional fixed text speaker recognition device.

【００１６】以下、本発明の一実施の形態について図面
を参照しながら説明する。図１は本発明の一実施の形態
における話者認識装置の構成ブロック図、図２は同話者
認識装置の回路ブロック図、図３は同話者認識装置の話
者特徴重みづけの概念図、図４は同話者認識装置の話者
登録動作時のフローチャート、図５は同話者認識装置の
話者認識動作時のフローチャートである。Hereinafter, an embodiment of the present invention will be described with reference to the drawings. FIG. 1 is a block diagram showing a configuration of a speaker recognition apparatus according to an embodiment of the present invention, FIG. 2 is a circuit block diagram of the speaker recognition apparatus, and FIG. 3 is a conceptual diagram of speaker feature weighting of the speaker recognition apparatus. FIG. 4 is a flowchart at the time of the speaker registration operation of the speaker recognition device, and FIG. 5 is a flowchart at the time of the speaker recognition operation of the speaker recognition device.

【００１７】図１において、１−１は話者の発声を取り
込む音声入力部であり、１−２は音声信号の特徴ベクト
ル系列を算出する特徴ベクトル算出部であり、１−３は
不特定話者の音韻モデルを格納している不特定話者音韻
モデル格納部であり、１−４は話者登録動作時に不特定
話者の音韻モデルを登録話者に適応化する話者適応化部
であり、１−５は話者適応化部で不特定話者音韻モデル
を登録話者に適応化した時の平均特徴ベクトルの移動量
に比例した重み係数をその音韻モデルに付加する話者特
徴重み係数付加部であり、１−６は適応化した登録話者
の音韻モデルを格納する登録話者音韻モデル格納部であ
り、１−７は話者認識動作時に指定テキストの単語モデ
ルを合成する指定単語モデル合成部であり、１−８は入
力音声の特徴ベクトル系列と単語モデルとの類似度を、
音韻モデル単位の部分距離に重みづけした値を時系列方
向に累積した累積距離として算出する重み付け類似度算
出部であり、１−９は話者音声と単語モデルとの類似度
を判定しきい値と比較して話者認識結果を判定する認識
結果判定部であり、１−１０は話者認識結果を出力する
認識結果出力部である。In FIG. 1, reference numeral 1-1 denotes a voice input unit for taking in a utterance of a speaker, 1-2 denotes a feature vector calculation unit for calculating a feature vector sequence of a voice signal, and 1-3 denotes an unspecified speech. Is a speaker-independent speaker model storage unit that stores a phoneme model of a speaker. 1-4 is a speaker adaptation unit that adapts the phoneme model of the speaker-independent speaker to a registered speaker during a speaker registration operation. There is a speaker characteristic weight 1-5 for adding a weighting factor proportional to the moving amount of the average feature vector when the speaker-independent adaptation unit adapts the unspecified speaker phoneme model to the registered speaker. A coefficient adding unit 1-6 is a registered speaker phoneme model storage unit for storing an adapted registered speaker phoneme model, and 1-7 is a designation for synthesizing a word model of a designated text at the time of speaker recognition operation. A word model synthesizing unit 1-8 is a feature vector of the input speech. The degree of similarity between the Le series and the word model,
A weighted similarity calculation unit that calculates a value weighted to a partial distance in phoneme model units as a cumulative distance accumulated in a time-series direction, and 1-9 denotes a similarity between a speaker's voice and a word model as a determination threshold. And a recognition result output unit 1-10 for outputting the speaker recognition result.

【００１８】図２は音声認識装置の回路ブロック図であ
り、２−１はマイク、２−２はスピーカー、２−３は中
央処理装置（ＣＰＵ）、２−４は書き込み可能メモリ
（ＲＡＭ）、２−５は読み出し専用メモリ（ＲＯＭ）で
ある。FIG. 2 is a circuit block diagram of the speech recognition apparatus, wherein 2-1 is a microphone, 2-2 is a speaker, 2-3 is a central processing unit (CPU), 2-4 is a writable memory (RAM), 2-5 is a read-only memory (ROM).

【００１９】図１の構成ブロック図における音声入力部
１−１はマイク２−１により、不特定話者音韻モデル格
納部１−３はＲＯＭ２−５により、登録話者音韻モデル
格納部１−６はＲＡＭ２−４により、話者適応化部１−
４と話者特徴重み係数付加部１−５と指定単語モデル合
成部１−７と重み付け類似度算出部１−８と認識結果判
定部１−９はＣＰＵ２−３がＲＯＭ２−５に書かれたプ
ログラムを実行し、ＲＯＭ２−５とＲＡＭ２−４にアク
セスすることにより、認識結果判定部１−９はスピーカ
ー２−２又は他の出力手段により実行される。In the block diagram of FIG. 1, the voice input unit 1-1 is provided by the microphone 2-1; the unspecified speaker phoneme model storage unit 1-3 is provided by the ROM 2-5; and the registered speaker phoneme model storage unit 1-6. Is a speaker adaptation unit 1-
4, a speaker characteristic weighting coefficient adding unit 1-5, a designated word model synthesizing unit 1-7, a weighting similarity calculating unit 1-8, and a recognition result determining unit 1-9. By executing the program and accessing the ROM 2-5 and the RAM 2-4, the recognition result determination unit 1-9 is executed by the speaker 2-2 or other output means.

【００２０】上記のように構成された話者認識装置の動
作について、話者登録動作時と、話者認識処理時とに分
けて説明する。The operation of the speaker recognition device configured as described above will be described separately for a speaker registration operation and a speaker recognition process.

【００２１】１．話者登録動作時登録話者の重み付き音
韻モデル作成時の動作について、図４のフローチャート
により説明する。この処理は認識の対象となる話者が自
分の音声パターンを装置に登録する時に実行される。1. The operation of creating a weighted phoneme model of a registered speaker during the speaker registration operation will be described with reference to the flowchart of FIG. This process is executed when the speaker to be recognized registers his / her own voice pattern in the apparatus.

【００２２】ｓｔｅｐ（１）：ユーザーが発声した音声
信号を音声入力部１−１より取り込む。Step (1): A voice signal uttered by the user is fetched from the voice input unit 1-1.

【００２３】ｓｔｅｐ（２）：音声信号を特徴ベクトル
算出部１−２によりデジタル信号に変換し、音声特徴ベ
クトルを算出する。Step (2): The speech signal is converted into a digital signal by the feature vector calculation section 1-2, and a speech feature vector is calculated.

【００２４】ｓｔｅｐ（３）：ユーザーの音声特徴ベク
トルを用いて不特定話者の音韻モデルを特定話者（ユー
ザー）に適応化する。具体的には、話者適応化部１−４
において、不特定話者音韻モデル格納部１−３に格納さ
れている不特定話者の音韻モデルの平均特徴ベクトル
を、登録話者の同一音韻の平均特徴ベクトルの方向に移
動させることにより、登録話者の音韻モデルを作成す
る。Step (3): Adapts the phoneme model of the unspecified speaker to the specific speaker (user) using the user's speech feature vector. Specifically, the speaker adaptation unit 1-4
In the above, registration is performed by moving the average feature vector of the phoneme model of the unspecified speaker stored in the unspecified speaker phoneme model storage unit 1-3 in the direction of the average feature vector of the same phoneme of the registered speaker. Create a speaker phonemic model.

【００２５】ｓｔｅｐ（４）：不特定話者の音韻モデル
を登録話者に適応化した際の平均特徴ベクトルの移動距
離を話者特徴重み係数算出部により算出する。Step (4): The moving distance of the average feature vector when the phoneme model of the unspecified speaker is adapted to the registered speaker is calculated by the speaker characteristic weighting coefficient calculating unit.

【００２６】ｓｔｅｐ（５）：話者特徴重み係数付加部
１−５により、移動距離に一定数を掛けた距離比例の重
み係数を登録話者の音韻モデルに付加する。Step (5): The speaker characteristic weighting coefficient adding unit 1-5 adds a distance-proportional weighting coefficient obtained by multiplying the moving distance by a certain number to the phoneme model of the registered speaker.

【００２７】ｓｔｅｐ（６）：重み係数を付加した音韻
モデルを、登録話者音韻モデル格納部１−６に記録して
おく。Step (6): The phoneme model to which the weight coefficient has been added is recorded in the registered speaker phoneme model storage unit 1-6.

【００２８】請求項２では、話者適応を行わず、特定話
者の認識モデルが直接得られる場合、話者適応時のベク
トル移動距離を用いずに、登録話者と不特定話者の同一
音韻モデルの平均特徴ベクトル間のベクトル距離を算出
し、その値に比例した重み係数を算出する。そして重み
係数を付加した音韻モデルを登録話者音韻モデル格納部
１−６に記録しておく。According to the second aspect, when the speaker adaptation is not performed and the recognition model of the specific speaker is directly obtained, the registered speaker and the unspecified speaker are identical without using the vector moving distance at the time of speaker adaptation. A vector distance between the average feature vectors of the phoneme model is calculated, and a weight coefficient proportional to the value is calculated. Then, the phoneme model to which the weight coefficient has been added is recorded in the registered speaker phoneme model storage unit 1-6.

【００２９】請求項３では、音韻単位の認識モデルを用
いず、最初から単語単位の音声モデルが得られる場合、
両単語モデルの時間軸を正規化し、対応する同一音韻部
分の平均特徴ベクトル間の距離を算出し、その値に比例
する重み付け係数を登録話者の単語モデルの各音韻区間
に付加する。According to the third aspect, when a speech model in word units is obtained from the beginning without using a recognition model in phoneme units,
The time axis of both word models is normalized, the distance between corresponding average feature vectors of the same phoneme portion is calculated, and a weighting coefficient proportional to the value is added to each phoneme section of the registered speaker's word model.

【００３０】２．話者認識処理話者認識処理時の動作に
ついて、図５のフローチャートにより説明する。2. Speaker recognition processing The operation at the time of speaker recognition processing will be described with reference to the flowchart of FIG.

【００３１】ｓｔｅｐ（１）：装置が発声単語を話者に
指定する。ｓｔｅｐ（２）：話者の重み付き音韻モデルを辞書に従
って連結し、指定単語の単語モデルを作成する。Step (1): The device designates the utterance word to the speaker. Step (2): A speaker-weighted phoneme model is connected in accordance with a dictionary to create a word model of a designated word.

【００３２】ｓｔｅｐ（３）：話者が発声した単語の音
声信号を音声入力部１−１より取り込む。Step (3): The voice signal of the word uttered by the speaker is fetched from the voice input section 1-1.

【００３３】ｓｔｅｐ（４）：特徴ベクトル算出部１−
２により入力音声の特徴ベクトル時系列を算出する。Step (4): feature vector calculation unit 1
2, a feature vector time series of the input voice is calculated.

【００３４】ｓｔｅｐ（５）：重み付け類似度算出部１
−８により、音声特徴ベクトルの時系列と単語モデル間
で時間軸のマッチングを行い、特徴ベクトルの時系列と
単語モデルの各音韻との対応を求める。Step (5): Weighted similarity calculating section 1
According to −8, time axis matching is performed between the time series of the speech feature vector and the word model, and the correspondence between the time series of the feature vector and each phoneme of the word model is obtained.

【００３５】ｓｔｅｐ（６）：ｓｔｅｐ（５）で求めた
対応に沿って、重み付けした累積距離を算出する。Step (6): A weighted cumulative distance is calculated in accordance with the correspondence obtained in step (5).

【００３６】具体的には、音声特徴ベクトルの時系列と
単語モデル間で対応する部分音韻モデルとの距離を算出
する。そして音韻単位のベクトル距離を発声区間に渡っ
て累積する。その際、距離にその音韻モデルの重み係数
を掛ける事により、話者特徴を強調する。More specifically, the distance between the time series of the speech feature vector and the corresponding partial phoneme model between the word models is calculated. Then, the vector distance in phoneme units is accumulated over the utterance section. At this time, the speaker characteristics are emphasized by multiplying the distance by the weight coefficient of the phoneme model.

【００３７】累積距離＝（音韻単位の距離×重み係数）
の発声区間の和あるいは、上記の累積距離を音韻モデルの重みづけ係数
にその出現回数を掛けた値の和で割る事により、累積距
離が正規化され、認識結果の判定しきい値を重みづけの
有無に関係なくそのまま用いる事ができる。Cumulative distance = (distance per phoneme × weight coefficient)
The cumulative distance is normalized by dividing the above cumulative distance by the sum of the values obtained by multiplying the weighting coefficient of the phoneme model by the number of appearances, and the cumulative distance is normalized, and the judgment threshold of the recognition result is weighted. Can be used as is regardless of the presence or absence of.

【００３８】正規化累積距離＝累積距離÷（（重み付け
係数×音韻モデルの累積回数）の発声区間の和）この重み付け処理により、その話者の話者特長を良く反
映している音韻部分が強調される。その結果、話者が本
人の時の累積距離はあまり影響を受けない一方、話者が
詐称者の時の累積距離は増大し、本人／詐称者のコント
ラストがより明確になり、話者認識精度の向上が期待で
きる。Normalized cumulative distance = cumulative distance ÷ (sum of utterance sections of (weighting coefficient × cumulative number of phoneme models)) By this weighting process, the phoneme portion that well reflects the speaker characteristics of the speaker is emphasized. Is done. As a result, the cumulative distance when the speaker is the person is not much affected, while the cumulative distance when the speaker is the impostor is increased, the contrast between the person and the impostor becomes clearer, and the speaker recognition accuracy is improved. Can be expected to improve.

【００３９】音韻モデルに離散分布のＨＭＭ（ヒドンマ
ルコフモデル）を用いる場合は、適応化前後の音韻モデ
ルの各状態の平均ベクトル距離から重み付け係数を算出
し、モデルに付加する。ＨＭＭモデル間の距離には、ｋ
ｕｌｌｂａｃｋ等の確率的な距離尺度を用いてもよい。
状態ｊの重み係数Ｚｊは（数１）による。When a discrete distribution HMM (Hidden Markov Model) is used as the phoneme model, a weighting coefficient is calculated from the average vector distance of each state of the phoneme model before and after the adaptation, and is added to the model. The distance between the HMM models is k
A probabilistic distance measure such as ullback may be used.
The weight coefficient Zj of the state j is based on (Equation 1).

【００４０】[0040]

【数１】 (Equation 1)

【００４１】話者認識時には、一旦重み係数を無視して
ビタビアルゴリズムにより最適状態遷移系列を求め、そ
の経路上の出力尤度に重み付けをして累積する。特徴ベ
クトルの時系列ｙ１Ｔについて、重みづけされた累積尤
度Ｐ（ｙ）は（数２）による。上記の累積尤度を正規化
するには、最適状態遷移系列の各状態の継続時間と重み
係数の積和で割る。At the time of speaker recognition, an optimal state transition sequence is obtained by the Viterbi algorithm while ignoring the weighting factor, and the output likelihood on the path is weighted and accumulated. For the time series y1T of the feature vectors, the weighted cumulative likelihood P (y) is given by (Equation 2). To normalize the cumulative likelihood, the duration of each state of the optimal state transition sequence is divided by the product sum of the weighting factor.

【００４２】[0042]

【数２】 (Equation 2)

【００４３】以上によりＨＭＭ音韻モデルの場合の、正
規化した重みづけ尤度が得られる。ｓｔｅｐ（７）：認識結果判定部１−９により、累積距
離をその話者が本人か否か判断するしきい値と比較し、
認識結果を決定する。As described above, the normalized weighted likelihood in the case of the HMM phoneme model is obtained. Step (7): The recognition result determination unit 1-9 compares the accumulated distance with a threshold value for determining whether or not the speaker is the person,
Determine the recognition result.

【００４４】[0044]

【発明の効果】本発明によれば、登録話者の音声特長を
良く表現している音韻の話者間距離が強調されるため、
登録話者の認識精度の向上が期待できる。そして、この
重みづけ処理には話者の特徴に関する音声学的予備知識
を利用しないので、話者の個人差に柔軟に対応できる。According to the present invention, the inter-speaker distance of a phoneme that expresses the speech characteristics of a registered speaker well is emphasized.
Improvement of recognition accuracy of registered speakers can be expected. Since no prior phonetic knowledge about the characteristics of the speaker is used in the weighting process, it is possible to flexibly cope with individual differences between speakers.

【００４５】請求項２の発明によれば、特定話者の音声
データから音韻モデルを直接作成できる場合に、一般
（不特定）話者との比較から話者の特徴を分析し、音韻
モデルに重みづけを行うことにより、より高精度に話者
認識を行う事が出来る。According to the second aspect of the present invention, when a phoneme model can be directly created from the speech data of a specific speaker, the characteristics of the speaker are analyzed by comparison with a general (unspecified) speaker, and the phoneme model is converted to a phoneme model. By performing weighting, speaker recognition can be performed with higher accuracy.

【００４６】請求項３の発明によれば、従来一般のテキ
スト依存型の話者認識装置においても、登録話者と一般
（不特定）話者間の単語モデルを比較し、話者に特徴的
な音韻部分に重みを付加する事により、話者特徴を強調
した高精度な話者認識を行うことが出来る。According to the third aspect of the present invention, even in the conventional general text-dependent speaker recognition apparatus, the word models between the registered speaker and the general (unspecified) speaker are compared, and the characteristic of the speaker is determined. By adding weights to the appropriate phonemes, it is possible to perform high-accuracy speaker recognition in which speaker characteristics are emphasized.

[Brief description of the drawings]

【図１】本発明の一実施の形態における話者認識装置の
構成ブロック図FIG. 1 is a configuration block diagram of a speaker recognition device according to an embodiment of the present invention.

【図２】本発明の一実施の形態における話者認識装置の
回路ブロック図FIG. 2 is a circuit block diagram of a speaker recognition device according to an embodiment of the present invention.

【図３】本発明の一実施の形態における話者認識装置の
話者特徴重みづけの概念図FIG. 3 is a conceptual diagram of speaker feature weighting of the speaker recognition device in one embodiment of the present invention.

【図４】本発明の一実施の形態における話者認識装置の
話者登録動作時のフローチャートFIG. 4 is a flowchart at the time of a speaker registration operation of the speaker recognition device according to the embodiment of the present invention;

【図５】本発明の一実施の形態における話者認識装置の
話者認識動作時のフローチャートFIG. 5 is a flowchart at the time of a speaker recognition operation of the speaker recognition device according to the embodiment of the present invention;

[Explanation of symbols]

１−１音声入力部１−２特徴ベクトル算出部１−３不特定話者音韻モデル格納部１−４話者適応化部１−５話者特徴重み係数付加部１−６登録話者音韻モデル格納部１−７指定単語モデル合成部１−８重み付け類似度算出部１−９認識結果判定部１−１０認識結果出力部２−１マイク２−２スピーカー２−３中央処理装置（ＣＰＵ）２−４書き込み可能メモリ（ＲＡＭ）２−５読み出し専用メモリ（ＲＯＭ） 1-1 Speech input unit 1-2 Feature vector calculation unit 1-3 Unspecified speaker phoneme model storage unit 1-4 Speaker adaptation unit 1-5 Speaker feature weight coefficient addition unit 1-6 Registered speaker phoneme model Storage unit 1-7 Designated word model synthesis unit 1-8 Weighted similarity calculation unit 1-9 Recognition result determination unit 1-10 Recognition result output unit 2-1 Microphone 2-2 Speaker 2-3 Central processing unit (CPU) 2 -4 Writable memory (RAM) 2-5 Read-only memory (ROM)

Claims

[Claims]

1. A text designation type speaker recognition device for determining whether or not a speaker is himself / herself by uttering a designated word every time, and a voice input unit for taking in a utterance of the speaker. A feature vector calculating unit for calculating a feature vector sequence of a voice signal, an unspecified speaker phoneme model storage unit storing a phoneme model of an unspecified speaker, and a phoneme model of an unspecified speaker during a speaker registration operation. Speaker adaptation unit that adapts a registered speaker, registered speaker phoneme model storage unit that stores registered speaker phoneme models, and specified word model synthesis that synthesizes the word model of the specified text during speaker recognition operation A recognition result determination unit that determines a speaker recognition result by comparing a similarity between a speaker voice and a word model with a determination threshold value, and a recognition result output unit that outputs a speaker recognition result. In the text-specified speaker recognition device, A speaker feature weighting factor adding unit that adds a weighting factor proportional to the amount of movement of the average feature vector when the unspecified speaker phonemic model is adapted to the registered speaker by the adapting unit; A speaker characterized by comprising a weighted similarity calculation unit that calculates a similarity between a feature vector sequence and a word model as a cumulative distance obtained by accumulating a value obtained by weighting a partial distance in a unit of a phoneme model in a time-series direction. Recognition device.

2. A method in which sufficient speech data of a specific speaker is obtained and a phoneme model of a specific speaker can be directly created without adapting phoneme data of an unspecified speaker, or a phoneme model of a registered speaker. Is prepared in advance, the distance between the average feature vector of the same phoneme model of the unspecified speaker and the registered speaker is calculated, and the speaker feature weight for adding a weighting coefficient proportional to the value to the phoneme model The speaker recognition device according to claim 1, further comprising a coefficient adding unit.

3. In the case where a word model is used in a text-dependent type, the correspondence between the same phoneme portion of the word model of the unspecified speaker and the word model of the specific speaker is determined, and the distance between the respective average vectors is obtained. 2. The speaker recognition apparatus according to claim 1, further comprising a speaker characteristic weighting coefficient adding unit for partially adding weight to the word model.