JPH11338492A - Speaker recognition unit - Google Patents

Speaker recognition unit

Info

Publication number
JPH11338492A
JPH11338492A JP10147083A JP14708398A JPH11338492A JP H11338492 A JPH11338492 A JP H11338492A JP 10147083 A JP10147083 A JP 10147083A JP 14708398 A JP14708398 A JP 14708398A JP H11338492 A JPH11338492 A JP H11338492A
Authority
JP
Japan
Prior art keywords
speaker
model
phoneme
unit
recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
JP10147083A
Other languages
Japanese (ja)
Inventor
Tadanao Tokuda
肇直 徳田
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Panasonic Holdings Corp
Original Assignee
Matsushita Electric Industrial Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Matsushita Electric Industrial Co Ltd filed Critical Matsushita Electric Industrial Co Ltd
Priority to JP10147083A priority Critical patent/JPH11338492A/en
Publication of JPH11338492A publication Critical patent/JPH11338492A/en
Pending legal-status Critical Current

Links

Abstract

PROBLEM TO BE SOLVED: To enhance a speaker recognition performance for a speaking content assigning type. SOLUTION: A speaker's character weighting factor addition part 1-5 and a weighting similarity calculating part 1-8 are provided in a text assigned type speaker recognition unit. A distance between a speaker of a recognition object and many-and-unspecified speakers is compared to add a weighting factor proportional to the largeness of the distance to a phoneme model, assuming that a model having a large value thereof shows a speaker's character intensively. The weighted phoneme models are connected at the recognition to prepare a utterance-assigned word model, and similarity between the model and a characteristic vector time series of an speaking voice is calculated by the similarity calculating part 1-8 to determine a result of speaker recognition based on the calculated value.

Description

【発明の詳細な説明】DETAILED DESCRIPTION OF THE INVENTION

【0001】[0001]

【発明の属する技術分野】本発明は、話者が発声した音
声を音響分析し、その音声特徴パターンをあらかじめ登
録してある話者特徴パターンと比較し、その結果から話
者が誰であるか、また話者が主張する本人であるか否か
を判定する話者認識装置に関するものである。
BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an acoustic analysis of a voice uttered by a speaker, comparing the voice feature pattern with a speaker feature pattern registered in advance, and determining the speaker from the result. The present invention also relates to a speaker recognition device for determining whether or not a speaker is the claimed person.

【0002】[0002]

【従来の技術】話者認識では、特定の話者がパスワード
を発声した音声パターンを標準パターンとして保持して
おくテキスト依存型の方式が一般的である。この方式で
は認識時に入力音声と本人の標準パターンとの時間軸整
合後の距離を計算し、その値を一定のしきい値と比較す
る事により本人か否かを判断する。しかし、テキスト依
存型ではテープレコーダーなどによって本人の声を録音
し、認識装置の前でそれを再生すれば、装置がそれを受
理してしまう危険性がある。
2. Description of the Related Art In speaker recognition, a text-dependent system in which a voice pattern in which a specific speaker utters a password is held as a standard pattern is generally used. In this method, at the time of recognition, the distance between the input voice and the user's standard pattern after time axis matching is calculated, and the value is compared with a certain threshold to determine whether or not the user is the user. However, in the text-dependent type, if a person's voice is recorded by a tape recorder or the like and is reproduced in front of the recognition device, there is a risk that the device will accept it.

【0003】そこで近年、認識のたびに任意のテキスト
を装置側から指定する事が可能な、テキスト指定型の話
者認識方式が提案されている。この方式では、本人が指
定されたテキストを正しく発声したと判定できる時の
み、それを受理する。これにより、発声内容が毎回異な
るため、録音した音声による詐称を防ぐ事ができる。
[0003] In recent years, there has been proposed a text designation type speaker recognition system in which an arbitrary text can be designated from the device every time recognition is performed. In this method, only when the user can determine that the specified text has been correctly uttered, the text is accepted. This makes it possible to prevent spoofing due to the recorded voice since the utterance content is different each time.

【0004】テキスト指定型では、本人が正しく指定さ
れたテキストを発声したか否かを判定するために、本人
がそのテキストを発声した場合の音声のモデルを生成す
る必要がある。システムの側が任意のテキストを指定で
きるためには、各話者ごとに全ての基本的な音韻モデル
を学習しておき、それを接続して、指定されたテキスト
の音声のモデルを生成する。
In the text designation type, in order to determine whether or not the user has uttered the correctly specified text, it is necessary to generate a model of the voice when the user utters the text. In order for the system side to be able to specify an arbitrary text, all basic phonemic models are learned for each speaker, and connected to generate a speech model of the specified text.

【0005】各話者の音韻モデルの作成には、1人の話
者が非常に多量に発声した音声データ(数百文章)から
各音韻モデルのHMM(ヒドンマルコフモデル)パラメ
ータを直接切り出して学習するのが理想であるが、現実
的にはそのような多量の発声をユーザーに強いる事は不
可能である。そこで、多数話者のデータから作成した不
特定話者音韻モデルを初期モデルとし、各話者の100
単語程度の学習データを用いて、その初期モデルをその
話者に適応化する。このようにして、話者の音韻モデル
を生成する。
[0005] To create a phoneme model for each speaker, learning is performed by directly extracting HMM (Hidden Markov Model) parameters of each phoneme model from speech data (hundreds of sentences) produced by one speaker in a very large amount. This is ideal, but it is practically impossible to force the user to make such a large amount of utterances. Therefore, an unspecified speaker phoneme model created from data of many speakers is used as an initial model, and 100
Using the word-level training data, the initial model is adapted to the speaker. In this way, a speaker phonemic model is generated.

【0006】[0006]

【発明が解決しようとする課題】一般に、話者の音声の
特徴は全ての音声区間に渡って現れるわけではなく、例
えば母音部の平均ピッチの高低や鼻音スペクトルなど音
声区間の部分的な特徴として観測される。しかし、従来
の方法では、登録した話者の全ての音韻を等しく評価し
て単語の類似度を算出し、話者認識に用いていた。しか
し、この方法では前述の話者特徴を良く表現した部分と
そうでない部分が同等に評価されるため、必ずしも良好
な話者認識性能が得られるとは限らなかった。
Generally, the characteristics of a speaker's voice do not appear in all voice sections, but, for example, as partial features of the voice section such as the average pitch of vowels and the nasal spectrum. Observed. However, in the conventional method, all phonemes of a registered speaker are evaluated equally to calculate the similarity of words, and the calculated similarity is used for speaker recognition. However, according to this method, a portion which expresses the above-mentioned speaker characteristics well and a portion which does not express the same are evaluated equally, so that good speaker recognition performance is not always obtained.

【0007】したがって本発明は、登録話者の特徴的な
発声部分を認識結果に大きく反映させることにより、話
者認識の向上を図れる話者認識装置を提供することを目
的とする。
Accordingly, it is an object of the present invention to provide a speaker recognition apparatus capable of improving speaker recognition by largely reflecting a characteristic utterance of a registered speaker in a recognition result.

【0008】[0008]

【課題を解決するための手段】本発明は、装置が毎回指
定する単語を発声することにより、話者が本人であるか
否かを判定するテキスト指定型話者認識装置であって、
話者の発声を取り込む音声入力部と、音声信号の特徴ベ
クトル系列を算出する特徴ベクトル算出部と、不特定話
者の音韻モデルを格納している不特定話者音韻モデル格
納部と、話者登録動作時に不特定話者の音韻モデルを登
録話者に適応化する話者適応化部と、登録話者の音韻モ
デルを格納する登録話者音韻モデル格納部と、話者認識
動作時に指定テキストの単語モデルを合成する指定単語
モデル合成部と、話者音声と単語モデルとの類似度を判
定しきい値と比較して話者認識結果を判定する認識結果
判定部と、話者認識結果を出力する認識結果出力部とを
備えたテキスト指定型話者認識装置に、前記話者適応化
部で不特定話者音韻モデルを登録話者に適応化した時の
平均特徴ベクトルの移動量に比例した重み係数をその音
韻モデルに付加する話者特徴重み係数付加部と、入力音
声の特徴ベクトル系列と単語モデルとの類似度を、音韻
モデル単位の部分距離に重みづけした値を時系列方向に
累積した累積距離として算出する重み付け類似度算出部
を備えたことを特徴とする話者認識装置である。
SUMMARY OF THE INVENTION The present invention is a text designation type speaker recognition device which determines whether or not a speaker is himself / herself by uttering a designated word every time.
A speech input unit that captures a speaker's utterance, a feature vector calculation unit that calculates a feature vector sequence of a speech signal, an unspecified speaker phoneme model storage unit that stores a phoneme model of an unspecified speaker, A speaker adaptation unit that adapts the phoneme model of the unspecified speaker to the registered speaker during the registration operation, a registered speaker phoneme model storage unit that stores the phoneme model of the registered speaker, and a specified text during the speaker recognition operation A specified word model synthesizing unit for synthesizing the word model, a recognition result judging unit for judging the speaker recognition result by comparing the similarity between the speaker voice and the word model with a judgment threshold value, And a recognition result output unit that outputs a recognition result output unit that outputs an unspecified speaker phoneme model to a registered speaker. Weighted coefficients to the phonological model. A weighting similarity that calculates a similarity between the feature vector sequence of the input speech and the word model as a cumulative distance obtained by accumulating a value obtained by weighting a partial distance of each phoneme model unit in a time-series direction. A speaker recognition device including a calculation unit.

【0009】この構成により、登録話者の特徴的な発声
部分を認識結果に大きく反映させることにより、話者認
識の向上を図れる話者認識装置を実現できる。
With this configuration, it is possible to realize a speaker recognition apparatus capable of improving speaker recognition by largely reflecting the characteristic utterance of the registered speaker in the recognition result.

【0010】[0010]

【発明の実施の形態】請求項1記載の発明は、装置が毎
回指定する単語を発声することにより、話者が本人であ
るか否かを判定するテキスト指定型話者認識装置であっ
て、話者の発声を取り込む音声入力部と、音声信号の特
徴ベクトル系列を算出する特徴ベクトル算出部と、不特
定話者の音韻モデルを格納している不特定話者音韻モデ
ル格納部と、話者登録動作時に不特定話者の音韻モデル
を登録話者に適応化する話者適応化部と、登録話者の音
韻モデルを格納する登録話者音韻モデル格納部と、話者
認識動作時に指定テキストの単語モデルを合成する指定
単語モデル合成部と、話者音声と単語モデルとの類似度
を判定しきい値と比較して話者認識結果を判定する認識
結果判定部と、話者認識結果を出力する認識結果出力部
とを備えたテキスト指定型話者認識装置に、前記話者適
応化部で不特定話者音韻モデルを登録話者に適応化した
時の平均特徴ベクトルの移動量に比例した重み係数をそ
の音韻モデルに付加する話者特徴重み係数付加部と、入
力音声の特徴ベクトル系列と単語モデルとの類似度を、
音韻モデル単位の部分距離に重みづけした値を時系列方
向に累積した累積距離として算出する重み付け類似度算
出部を備えたことを特徴とする話者認識装置である。
DESCRIPTION OF THE PREFERRED EMBODIMENTS The invention according to claim 1 is a text designation type speaker recognition device which determines whether or not a speaker is himself / herself by uttering a designated word every time. A speech input unit that captures a speaker's utterance, a feature vector calculation unit that calculates a feature vector sequence of a speech signal, an unspecified speaker phoneme model storage unit that stores a phoneme model of an unspecified speaker, A speaker adaptation unit that adapts the phoneme model of the unspecified speaker to the registered speaker during the registration operation, a registered speaker phoneme model storage unit that stores the phoneme model of the registered speaker, and a specified text during the speaker recognition operation A specified word model synthesizing unit for synthesizing the word model, a recognition result judging unit for judging the speaker recognition result by comparing the similarity between the speaker voice and the word model with a judgment threshold value, Text with a recognition result output unit to output A talk in which a weighting coefficient proportional to the moving amount of the average feature vector when the unspecified speaker phoneme model is adapted to the registered speaker by the speaker adaptation unit is added to the designated speaker recognition device. The user feature weighting coefficient adding unit, and the similarity between the feature vector sequence of the input speech and the word model,
A speaker recognition device comprising a weighted similarity calculating unit that calculates a value weighted to a partial distance in phoneme model units as a cumulative distance accumulated in a time-series direction.

【0011】この構成において、話者適応化時の平均特
徴ベクトルの移動距離に比例する重み係数を音韻モデル
に付加して格納しておき、話者認識時には重み付き音韻
モデルを連結して指定テキストの単語モデルを作成し、
類似度計算において各音韻モデルの部分距離に前記の重
み係数による重みづけを行い、それを発声区間に渡って
加算した累積距離をその登録話者の判定しきい値と比較
することにより、話者認識結果を判定する。これによ
り、登録話者の特徴的な発声部分の類似度が話者認識に
強く反映されるため、話者認識精度の向上が期待出来
る。
In this configuration, a weighting factor proportional to the moving distance of the average feature vector at the time of speaker adaptation is added to the phoneme model and stored, and at the time of speaker recognition, the weighted phoneme model is connected to the designated text. Create a word model of
In the similarity calculation, the partial distance of each phoneme model is weighted by the above-mentioned weighting coefficient, and the cumulative distance added over the vocal section is compared with the judgment threshold value of the registered speaker. Determine the recognition result. As a result, the similarity of the characteristic utterance part of the registered speaker is strongly reflected in the speaker recognition, so that improvement in the speaker recognition accuracy can be expected.

【0012】請求項2記載の発明は、特定話者の音声デ
ータが十分に得られ、不特定話者の音韻データを適応化
せずに、特定話者の音韻モデルを直接作成できる場合、
あるいは登録話者の音韻モデルがあらかじめ用意されて
いる場合に、不特定話者と登録話者の同一音韻モデルの
平均特徴ベクトル間の距離を算出し、その値に比例する
重みづけ係数を音韻モデルに付加する話者特徴重み係数
付加部を備えたものである。
According to the second aspect of the present invention, when the voice data of a specific speaker is sufficiently obtained and the phoneme model of the specific speaker can be directly created without adapting the phoneme data of the unspecified speaker,
Alternatively, when the phoneme model of the registered speaker is prepared in advance, the distance between the average feature vector of the same phoneme model of the unspecified speaker and the registered speaker is calculated, and the weighting coefficient proportional to the value is calculated as the phoneme model. Is provided with a speaker characteristic weighting coefficient adding unit to be added.

【0013】この構成により、特定話者の音声データが
十分に得られ、不特定話者の音韻データを適応化せず
に、特定話者の音韻モデルを直接作成できる場合、不特
定/登録話者の同一音韻の音韻モデルで平均特徴ベクト
ル間の距離を算出して登録話者の音韻モデルに重み係数
を付加する。そして、話者認識時には重みづけ類似度を
算出する。これにより、話者適応を行わない場合でも登
録話者の特徴的な音韻モデルに重み係数を付加できる。
With this configuration, when sufficient speech data of a specific speaker can be obtained and a phoneme model of a specific speaker can be directly created without adapting phoneme data of an unspecified speaker, unspecified / registered speech The distance between the average feature vectors is calculated using the phoneme model of the same phoneme of the speaker, and a weight coefficient is added to the phoneme model of the registered speaker. Then, at the time of speaker recognition, a weighted similarity is calculated. Thus, even when speaker adaptation is not performed, a weight coefficient can be added to a characteristic phoneme model of a registered speaker.

【0014】請求項3記載の発明は、テキスト依存型で
単語モデルを用いる場合、不特定話者と特定話者の単語
モデルの同一音韻部分の対応をとり、個々の平均ベクト
ル間の距離を求め、その値に比例した重みづけを単語モ
デルに部分的に加える話者特徴重み係数付加部を備えた
ものである。
According to a third aspect of the present invention, when a word model is used in a text-dependent manner, correspondence between the same phoneme portion of the word model of the unspecified speaker and that of the specific speaker is obtained, and the distance between the individual average vectors is obtained. , A speaker characteristic weighting coefficient adding unit for partially adding weighting to the word model in proportion to the value.

【0015】この構成により、話者認識に単語モデルを
用いる場合、不特定話者の単語モデルと特定話者の単語
モデルとの間で時間軸のマッチングを行い、対応する音
韻部分の平均ベクトルで距離を算出し、その値に比例し
た重み係数を単語モデルに音韻単位で付加する。これに
より重み付け音韻モデルを連結した場合と同様の単語モ
デルが得られる。話者認識時には重みづけ類似度を算出
する事により、登録話者の特徴的な音韻を強調した話者
認識を従来のテキスト固定型話者認識装置で実現でき
る。
With this configuration, when a word model is used for speaker recognition, time axis matching is performed between the word model of an unspecified speaker and the word model of a specific speaker, and the average vector of the corresponding phoneme portion is obtained. The distance is calculated, and a weight coefficient proportional to the distance is added to the word model in phoneme units. As a result, a word model similar to the case where the weighted phoneme models are connected is obtained. By calculating the weighted similarity at the time of speaker recognition, speaker recognition emphasizing the characteristic phoneme of the registered speaker can be realized by the conventional fixed text speaker recognition device.

【0016】以下、本発明の一実施の形態について図面
を参照しながら説明する。図1は本発明の一実施の形態
における話者認識装置の構成ブロック図、図2は同話者
認識装置の回路ブロック図、図3は同話者認識装置の話
者特徴重みづけの概念図、図4は同話者認識装置の話者
登録動作時のフローチャート、図5は同話者認識装置の
話者認識動作時のフローチャートである。
Hereinafter, an embodiment of the present invention will be described with reference to the drawings. FIG. 1 is a block diagram showing a configuration of a speaker recognition apparatus according to an embodiment of the present invention, FIG. 2 is a circuit block diagram of the speaker recognition apparatus, and FIG. 3 is a conceptual diagram of speaker feature weighting of the speaker recognition apparatus. FIG. 4 is a flowchart at the time of the speaker registration operation of the speaker recognition device, and FIG. 5 is a flowchart at the time of the speaker recognition operation of the speaker recognition device.

【0017】図1において、1−1は話者の発声を取り
込む音声入力部であり、1−2は音声信号の特徴ベクト
ル系列を算出する特徴ベクトル算出部であり、1−3は
不特定話者の音韻モデルを格納している不特定話者音韻
モデル格納部であり、1−4は話者登録動作時に不特定
話者の音韻モデルを登録話者に適応化する話者適応化部
であり、1−5は話者適応化部で不特定話者音韻モデル
を登録話者に適応化した時の平均特徴ベクトルの移動量
に比例した重み係数をその音韻モデルに付加する話者特
徴重み係数付加部であり、1−6は適応化した登録話者
の音韻モデルを格納する登録話者音韻モデル格納部であ
り、1−7は話者認識動作時に指定テキストの単語モデ
ルを合成する指定単語モデル合成部であり、1−8は入
力音声の特徴ベクトル系列と単語モデルとの類似度を、
音韻モデル単位の部分距離に重みづけした値を時系列方
向に累積した累積距離として算出する重み付け類似度算
出部であり、1−9は話者音声と単語モデルとの類似度
を判定しきい値と比較して話者認識結果を判定する認識
結果判定部であり、1−10は話者認識結果を出力する
認識結果出力部である。
In FIG. 1, reference numeral 1-1 denotes a voice input unit for taking in a utterance of a speaker, 1-2 denotes a feature vector calculation unit for calculating a feature vector sequence of a voice signal, and 1-3 denotes an unspecified speech. Is a speaker-independent speaker model storage unit that stores a phoneme model of a speaker. 1-4 is a speaker adaptation unit that adapts the phoneme model of the speaker-independent speaker to a registered speaker during a speaker registration operation. There is a speaker characteristic weight 1-5 for adding a weighting factor proportional to the moving amount of the average feature vector when the speaker-independent adaptation unit adapts the unspecified speaker phoneme model to the registered speaker. A coefficient adding unit 1-6 is a registered speaker phoneme model storage unit for storing an adapted registered speaker phoneme model, and 1-7 is a designation for synthesizing a word model of a designated text at the time of speaker recognition operation. A word model synthesizing unit 1-8 is a feature vector of the input speech. The degree of similarity between the Le series and the word model,
A weighted similarity calculation unit that calculates a value weighted to a partial distance in phoneme model units as a cumulative distance accumulated in a time-series direction, and 1-9 denotes a similarity between a speaker's voice and a word model as a determination threshold. And a recognition result output unit 1-10 for outputting the speaker recognition result.

【0018】図2は音声認識装置の回路ブロック図であ
り、2−1はマイク、2−2はスピーカー、2−3は中
央処理装置(CPU)、2−4は書き込み可能メモリ
(RAM)、2−5は読み出し専用メモリ(ROM)で
ある。
FIG. 2 is a circuit block diagram of the speech recognition apparatus, wherein 2-1 is a microphone, 2-2 is a speaker, 2-3 is a central processing unit (CPU), 2-4 is a writable memory (RAM), 2-5 is a read-only memory (ROM).

【0019】図1の構成ブロック図における音声入力部
1−1はマイク2−1により、不特定話者音韻モデル格
納部1−3はROM2−5により、登録話者音韻モデル
格納部1−6はRAM2−4により、話者適応化部1−
4と話者特徴重み係数付加部1−5と指定単語モデル合
成部1−7と重み付け類似度算出部1−8と認識結果判
定部1−9はCPU2−3がROM2−5に書かれたプ
ログラムを実行し、ROM2−5とRAM2−4にアク
セスすることにより、認識結果判定部1−9はスピーカ
ー2−2又は他の出力手段により実行される。
In the block diagram of FIG. 1, the voice input unit 1-1 is provided by the microphone 2-1; the unspecified speaker phoneme model storage unit 1-3 is provided by the ROM 2-5; and the registered speaker phoneme model storage unit 1-6. Is a speaker adaptation unit 1-
4, a speaker characteristic weighting coefficient adding unit 1-5, a designated word model synthesizing unit 1-7, a weighting similarity calculating unit 1-8, and a recognition result determining unit 1-9. By executing the program and accessing the ROM 2-5 and the RAM 2-4, the recognition result determination unit 1-9 is executed by the speaker 2-2 or other output means.

【0020】上記のように構成された話者認識装置の動
作について、話者登録動作時と、話者認識処理時とに分
けて説明する。
The operation of the speaker recognition device configured as described above will be described separately for a speaker registration operation and a speaker recognition process.

【0021】1.話者登録動作時登録話者の重み付き音
韻モデル作成時の動作について、図4のフローチャート
により説明する。この処理は認識の対象となる話者が自
分の音声パターンを装置に登録する時に実行される。
1. The operation of creating a weighted phoneme model of a registered speaker during the speaker registration operation will be described with reference to the flowchart of FIG. This process is executed when the speaker to be recognized registers his / her own voice pattern in the apparatus.

【0022】step(1):ユーザーが発声した音声
信号を音声入力部1−1より取り込む。
Step (1): A voice signal uttered by the user is fetched from the voice input unit 1-1.

【0023】step(2):音声信号を特徴ベクトル
算出部1−2によりデジタル信号に変換し、音声特徴ベ
クトルを算出する。
Step (2): The speech signal is converted into a digital signal by the feature vector calculation section 1-2, and a speech feature vector is calculated.

【0024】step(3):ユーザーの音声特徴ベク
トルを用いて不特定話者の音韻モデルを特定話者(ユー
ザー)に適応化する。具体的には、話者適応化部1−4
において、不特定話者音韻モデル格納部1−3に格納さ
れている不特定話者の音韻モデルの平均特徴ベクトル
を、登録話者の同一音韻の平均特徴ベクトルの方向に移
動させることにより、登録話者の音韻モデルを作成す
る。
Step (3): Adapts the phoneme model of the unspecified speaker to the specific speaker (user) using the user's speech feature vector. Specifically, the speaker adaptation unit 1-4
In the above, registration is performed by moving the average feature vector of the phoneme model of the unspecified speaker stored in the unspecified speaker phoneme model storage unit 1-3 in the direction of the average feature vector of the same phoneme of the registered speaker. Create a speaker phonemic model.

【0025】step(4):不特定話者の音韻モデル
を登録話者に適応化した際の平均特徴ベクトルの移動距
離を話者特徴重み係数算出部により算出する。
Step (4): The moving distance of the average feature vector when the phoneme model of the unspecified speaker is adapted to the registered speaker is calculated by the speaker characteristic weighting coefficient calculating unit.

【0026】step(5):話者特徴重み係数付加部
1−5により、移動距離に一定数を掛けた距離比例の重
み係数を登録話者の音韻モデルに付加する。
Step (5): The speaker characteristic weighting coefficient adding unit 1-5 adds a distance-proportional weighting coefficient obtained by multiplying the moving distance by a certain number to the phoneme model of the registered speaker.

【0027】step(6):重み係数を付加した音韻
モデルを、登録話者音韻モデル格納部1−6に記録して
おく。
Step (6): The phoneme model to which the weight coefficient has been added is recorded in the registered speaker phoneme model storage unit 1-6.

【0028】請求項2では、話者適応を行わず、特定話
者の認識モデルが直接得られる場合、話者適応時のベク
トル移動距離を用いずに、登録話者と不特定話者の同一
音韻モデルの平均特徴ベクトル間のベクトル距離を算出
し、その値に比例した重み係数を算出する。そして重み
係数を付加した音韻モデルを登録話者音韻モデル格納部
1−6に記録しておく。
According to the second aspect, when the speaker adaptation is not performed and the recognition model of the specific speaker is directly obtained, the registered speaker and the unspecified speaker are identical without using the vector moving distance at the time of speaker adaptation. A vector distance between the average feature vectors of the phoneme model is calculated, and a weight coefficient proportional to the value is calculated. Then, the phoneme model to which the weight coefficient has been added is recorded in the registered speaker phoneme model storage unit 1-6.

【0029】請求項3では、音韻単位の認識モデルを用
いず、最初から単語単位の音声モデルが得られる場合、
両単語モデルの時間軸を正規化し、対応する同一音韻部
分の平均特徴ベクトル間の距離を算出し、その値に比例
する重み付け係数を登録話者の単語モデルの各音韻区間
に付加する。
According to the third aspect, when a speech model in word units is obtained from the beginning without using a recognition model in phoneme units,
The time axis of both word models is normalized, the distance between corresponding average feature vectors of the same phoneme portion is calculated, and a weighting coefficient proportional to the value is added to each phoneme section of the registered speaker's word model.

【0030】2.話者認識処理話者認識処理時の動作に
ついて、図5のフローチャートにより説明する。
2. Speaker recognition processing The operation at the time of speaker recognition processing will be described with reference to the flowchart of FIG.

【0031】step(1):装置が発声単語を話者に
指定する。 step(2):話者の重み付き音韻モデルを辞書に従
って連結し、指定単語の単語モデルを作成する。
Step (1): The device designates the utterance word to the speaker. Step (2): A speaker-weighted phoneme model is connected in accordance with a dictionary to create a word model of a designated word.

【0032】step(3):話者が発声した単語の音
声信号を音声入力部1−1より取り込む。
Step (3): The voice signal of the word uttered by the speaker is fetched from the voice input section 1-1.

【0033】step(4):特徴ベクトル算出部1−
2により入力音声の特徴ベクトル時系列を算出する。
Step (4): feature vector calculation unit 1
2, a feature vector time series of the input voice is calculated.

【0034】step(5):重み付け類似度算出部1
−8により、音声特徴ベクトルの時系列と単語モデル間
で時間軸のマッチングを行い、特徴ベクトルの時系列と
単語モデルの各音韻との対応を求める。
Step (5): Weighted similarity calculating section 1
According to −8, time axis matching is performed between the time series of the speech feature vector and the word model, and the correspondence between the time series of the feature vector and each phoneme of the word model is obtained.

【0035】step(6):step(5)で求めた
対応に沿って、重み付けした累積距離を算出する。
Step (6): A weighted cumulative distance is calculated in accordance with the correspondence obtained in step (5).

【0036】具体的には、音声特徴ベクトルの時系列と
単語モデル間で対応する部分音韻モデルとの距離を算出
する。そして音韻単位のベクトル距離を発声区間に渡っ
て累積する。その際、距離にその音韻モデルの重み係数
を掛ける事により、話者特徴を強調する。
More specifically, the distance between the time series of the speech feature vector and the corresponding partial phoneme model between the word models is calculated. Then, the vector distance in phoneme units is accumulated over the utterance section. At this time, the speaker characteristics are emphasized by multiplying the distance by the weight coefficient of the phoneme model.

【0037】累積距離=(音韻単位の距離×重み係数)
の発声区間の和 あるいは、上記の累積距離を音韻モデルの重みづけ係数
にその出現回数を掛けた値の和で割る事により、累積距
離が正規化され、認識結果の判定しきい値を重みづけの
有無に関係なくそのまま用いる事ができる。
Cumulative distance = (distance per phoneme × weight coefficient)
The cumulative distance is normalized by dividing the above cumulative distance by the sum of the values obtained by multiplying the weighting coefficient of the phoneme model by the number of appearances, and the cumulative distance is normalized, and the judgment threshold of the recognition result is weighted. Can be used as is regardless of the presence or absence of.

【0038】正規化累積距離=累積距離÷((重み付け
係数×音韻モデルの累積回数)の発声区間の和) この重み付け処理により、その話者の話者特長を良く反
映している音韻部分が強調される。その結果、話者が本
人の時の累積距離はあまり影響を受けない一方、話者が
詐称者の時の累積距離は増大し、本人/詐称者のコント
ラストがより明確になり、話者認識精度の向上が期待で
きる。
Normalized cumulative distance = cumulative distance ÷ (sum of utterance sections of (weighting coefficient × cumulative number of phoneme models)) By this weighting process, the phoneme portion that well reflects the speaker characteristics of the speaker is emphasized. Is done. As a result, the cumulative distance when the speaker is the person is not much affected, while the cumulative distance when the speaker is the impostor is increased, the contrast between the person and the impostor becomes clearer, and the speaker recognition accuracy is improved. Can be expected to improve.

【0039】音韻モデルに離散分布のHMM(ヒドンマ
ルコフモデル)を用いる場合は、適応化前後の音韻モデ
ルの各状態の平均ベクトル距離から重み付け係数を算出
し、モデルに付加する。HMMモデル間の距離には、k
ullback等の確率的な距離尺度を用いてもよい。
状態jの重み係数Zjは(数1)による。
When a discrete distribution HMM (Hidden Markov Model) is used as the phoneme model, a weighting coefficient is calculated from the average vector distance of each state of the phoneme model before and after the adaptation, and is added to the model. The distance between the HMM models is k
A probabilistic distance measure such as ullback may be used.
The weight coefficient Zj of the state j is based on (Equation 1).

【0040】[0040]

【数1】 (Equation 1)

【0041】話者認識時には、一旦重み係数を無視して
ビタビアルゴリズムにより最適状態遷移系列を求め、そ
の経路上の出力尤度に重み付けをして累積する。特徴ベ
クトルの時系列y1Tについて、重みづけされた累積尤
度P(y)は(数2)による。上記の累積尤度を正規化
するには、最適状態遷移系列の各状態の継続時間と重み
係数の積和で割る。
At the time of speaker recognition, an optimal state transition sequence is obtained by the Viterbi algorithm while ignoring the weighting factor, and the output likelihood on the path is weighted and accumulated. For the time series y1T of the feature vectors, the weighted cumulative likelihood P (y) is given by (Equation 2). To normalize the cumulative likelihood, the duration of each state of the optimal state transition sequence is divided by the product sum of the weighting factor.

【0042】[0042]

【数2】 (Equation 2)

【0043】以上によりHMM音韻モデルの場合の、正
規化した重みづけ尤度が得られる。 step(7):認識結果判定部1−9により、累積距
離をその話者が本人か否か判断するしきい値と比較し、
認識結果を決定する。
As described above, the normalized weighted likelihood in the case of the HMM phoneme model is obtained. Step (7): The recognition result determination unit 1-9 compares the accumulated distance with a threshold value for determining whether or not the speaker is the person,
Determine the recognition result.

【0044】[0044]

【発明の効果】本発明によれば、登録話者の音声特長を
良く表現している音韻の話者間距離が強調されるため、
登録話者の認識精度の向上が期待できる。そして、この
重みづけ処理には話者の特徴に関する音声学的予備知識
を利用しないので、話者の個人差に柔軟に対応できる。
According to the present invention, the inter-speaker distance of a phoneme that expresses the speech characteristics of a registered speaker well is emphasized.
Improvement of recognition accuracy of registered speakers can be expected. Since no prior phonetic knowledge about the characteristics of the speaker is used in the weighting process, it is possible to flexibly cope with individual differences between speakers.

【0045】請求項2の発明によれば、特定話者の音声
データから音韻モデルを直接作成できる場合に、一般
(不特定)話者との比較から話者の特徴を分析し、音韻
モデルに重みづけを行うことにより、より高精度に話者
認識を行う事が出来る。
According to the second aspect of the present invention, when a phoneme model can be directly created from the speech data of a specific speaker, the characteristics of the speaker are analyzed by comparison with a general (unspecified) speaker, and the phoneme model is converted to a phoneme model. By performing weighting, speaker recognition can be performed with higher accuracy.

【0046】請求項3の発明によれば、従来一般のテキ
スト依存型の話者認識装置においても、登録話者と一般
(不特定)話者間の単語モデルを比較し、話者に特徴的
な音韻部分に重みを付加する事により、話者特徴を強調
した高精度な話者認識を行うことが出来る。
According to the third aspect of the present invention, even in the conventional general text-dependent speaker recognition apparatus, the word models between the registered speaker and the general (unspecified) speaker are compared, and the characteristic of the speaker is determined. By adding weights to the appropriate phonemes, it is possible to perform high-accuracy speaker recognition in which speaker characteristics are emphasized.

【図面の簡単な説明】[Brief description of the drawings]

【図1】本発明の一実施の形態における話者認識装置の
構成ブロック図
FIG. 1 is a configuration block diagram of a speaker recognition device according to an embodiment of the present invention.

【図2】本発明の一実施の形態における話者認識装置の
回路ブロック図
FIG. 2 is a circuit block diagram of a speaker recognition device according to an embodiment of the present invention.

【図3】本発明の一実施の形態における話者認識装置の
話者特徴重みづけの概念図
FIG. 3 is a conceptual diagram of speaker feature weighting of the speaker recognition device in one embodiment of the present invention.

【図4】本発明の一実施の形態における話者認識装置の
話者登録動作時のフローチャート
FIG. 4 is a flowchart at the time of a speaker registration operation of the speaker recognition device according to the embodiment of the present invention;

【図5】本発明の一実施の形態における話者認識装置の
話者認識動作時のフローチャート
FIG. 5 is a flowchart at the time of a speaker recognition operation of the speaker recognition device according to the embodiment of the present invention;

【符号の説明】[Explanation of symbols]

1−1 音声入力部 1−2 特徴ベクトル算出部 1−3 不特定話者音韻モデル格納部 1−4 話者適応化部 1−5 話者特徴重み係数付加部 1−6 登録話者音韻モデル格納部 1−7 指定単語モデル合成部 1−8 重み付け類似度算出部 1−9 認識結果判定部 1−10 認識結果出力部 2−1 マイク 2−2 スピーカー 2−3 中央処理装置(CPU) 2−4 書き込み可能メモリ(RAM) 2−5 読み出し専用メモリ(ROM) 1-1 Speech input unit 1-2 Feature vector calculation unit 1-3 Unspecified speaker phoneme model storage unit 1-4 Speaker adaptation unit 1-5 Speaker feature weight coefficient addition unit 1-6 Registered speaker phoneme model Storage unit 1-7 Designated word model synthesis unit 1-8 Weighted similarity calculation unit 1-9 Recognition result determination unit 1-10 Recognition result output unit 2-1 Microphone 2-2 Speaker 2-3 Central processing unit (CPU) 2 -4 Writable memory (RAM) 2-5 Read-only memory (ROM)

Claims (3)

【特許請求の範囲】[Claims] 【請求項1】装置が毎回指定する単語を発声することに
より、話者が本人であるか否かを判定するテキスト指定
型話者認識装置であって、話者の発声を取り込む音声入
力部と、音声信号の特徴ベクトル系列を算出する特徴ベ
クトル算出部と、不特定話者の音韻モデルを格納してい
る不特定話者音韻モデル格納部と、話者登録動作時に不
特定話者の音韻モデルを登録話者に適応化する話者適応
化部と、登録話者の音韻モデルを格納する登録話者音韻
モデル格納部と、話者認識動作時に指定テキストの単語
モデルを合成する指定単語モデル合成部と、話者音声と
単語モデルとの類似度を判定しきい値と比較して話者認
識結果を判定する認識結果判定部と、話者認識結果を出
力する認識結果出力部とを備えたテキスト指定型話者認
識装置に、前記話者適応化部で不特定話者音韻モデルを
登録話者に適応化した時の平均特徴ベクトルの移動量に
比例した重み係数をその音韻モデルに付加する話者特徴
重み係数付加部と、入力音声の特徴ベクトル系列と単語
モデルとの類似度を、音韻モデル単位の部分距離に重み
づけした値を時系列方向に累積した累積距離として算出
する重み付け類似度算出部を備えたことを特徴とする話
者認識装置。
1. A text designation type speaker recognition device for determining whether or not a speaker is himself / herself by uttering a designated word every time, and a voice input unit for taking in a utterance of the speaker. A feature vector calculating unit for calculating a feature vector sequence of a voice signal, an unspecified speaker phoneme model storage unit storing a phoneme model of an unspecified speaker, and a phoneme model of an unspecified speaker during a speaker registration operation. Speaker adaptation unit that adapts a registered speaker, registered speaker phoneme model storage unit that stores registered speaker phoneme models, and specified word model synthesis that synthesizes the word model of the specified text during speaker recognition operation A recognition result determination unit that determines a speaker recognition result by comparing a similarity between a speaker voice and a word model with a determination threshold value, and a recognition result output unit that outputs a speaker recognition result. In the text-specified speaker recognition device, A speaker feature weighting factor adding unit that adds a weighting factor proportional to the amount of movement of the average feature vector when the unspecified speaker phonemic model is adapted to the registered speaker by the adapting unit; A speaker characterized by comprising a weighted similarity calculation unit that calculates a similarity between a feature vector sequence and a word model as a cumulative distance obtained by accumulating a value obtained by weighting a partial distance in a unit of a phoneme model in a time-series direction. Recognition device.
【請求項2】特定話者の音声データが十分に得られ、不
特定話者の音韻データを適応化せずに、特定話者の音韻
モデルを直接作成できる場合、あるいは登録話者の音韻
モデルがあらかじめ用意されている場合に、不特定話者
と登録話者の同一音韻モデルの平均特徴ベクトル間の距
離を算出し、その値に比例する重みづけ係数を音韻モデ
ルに付加する話者特徴重み係数付加部を備えたことを特
徴とする請求項1記載の話者認識装置。
2. A method in which sufficient speech data of a specific speaker is obtained and a phoneme model of a specific speaker can be directly created without adapting phoneme data of an unspecified speaker, or a phoneme model of a registered speaker. Is prepared in advance, the distance between the average feature vector of the same phoneme model of the unspecified speaker and the registered speaker is calculated, and the speaker feature weight for adding a weighting coefficient proportional to the value to the phoneme model The speaker recognition device according to claim 1, further comprising a coefficient adding unit.
【請求項3】テキスト依存型で単語モデルを用いる場
合、不特定話者と特定話者の単語モデルの同一音韻部分
の対応をとり、個々の平均ベクトル間の距離を求め、そ
の値に比例した重みづけを単語モデルに部分的に加える
話者特徴重み係数付加部を備えたことを特徴とする請求
項1記載の話者認識装置。
3. In the case where a word model is used in a text-dependent type, the correspondence between the same phoneme portion of the word model of the unspecified speaker and the word model of the specific speaker is determined, and the distance between the respective average vectors is obtained. 2. The speaker recognition apparatus according to claim 1, further comprising a speaker characteristic weighting coefficient adding unit for partially adding weight to the word model.
JP10147083A 1998-05-28 1998-05-28 Speaker recognition unit Pending JPH11338492A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP10147083A JPH11338492A (en) 1998-05-28 1998-05-28 Speaker recognition unit

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP10147083A JPH11338492A (en) 1998-05-28 1998-05-28 Speaker recognition unit

Publications (1)

Publication Number Publication Date
JPH11338492A true JPH11338492A (en) 1999-12-10

Family

ID=15422094

Family Applications (1)

Application Number Title Priority Date Filing Date
JP10147083A Pending JPH11338492A (en) 1998-05-28 1998-05-28 Speaker recognition unit

Country Status (1)

Country Link
JP (1) JPH11338492A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109313902A (en) * 2016-06-06 2019-02-05 思睿逻辑国际半导体有限公司 Voice user interface
US10572812B2 (en) 2015-03-19 2020-02-25 Kabushiki Kaisha Toshiba Detection apparatus, detection method, and computer program product
CN113793615A (en) * 2021-09-15 2021-12-14 北京百度网讯科技有限公司 Speaker recognition method, model training method, device, equipment and storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10572812B2 (en) 2015-03-19 2020-02-25 Kabushiki Kaisha Toshiba Detection apparatus, detection method, and computer program product
CN109313902A (en) * 2016-06-06 2019-02-05 思睿逻辑国际半导体有限公司 Voice user interface
CN113793615A (en) * 2021-09-15 2021-12-14 北京百度网讯科技有限公司 Speaker recognition method, model training method, device, equipment and storage medium
CN113793615B (en) * 2021-09-15 2024-02-27 北京百度网讯科技有限公司 Speaker recognition method, model training method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
AU707355B2 (en) Speech recognition
US6029124A (en) Sequential, nonparametric speech recognition and speaker identification
US5946654A (en) Speaker identification using unsupervised speech models
Masuko et al. Imposture using synthetic speech against speaker verification based on spectrum and pitch
US20060206326A1 (en) Speech recognition method
WO2007046267A1 (en) Voice judging system, voice judging method, and program for voice judgment
JPH075892A (en) Voice recognition method
JP2006171750A (en) Feature vector extracting method for speech recognition
JP2001166789A (en) Method and device for voice recognition of chinese using phoneme similarity vector at beginning or end
US7072750B2 (en) Method and apparatus for rejection of speech recognition results in accordance with confidence level
JP3130524B2 (en) Speech signal recognition method and apparatus for implementing the method
JPH11175082A (en) Voice interaction device and voice synthesizing method for voice interaction
KR20220134347A (en) Speech synthesis method and apparatus based on multiple speaker training dataset
JP2002236494A (en) Speech section discriminator, speech recognizer, program and recording medium
JP4461557B2 (en) Speech recognition method and speech recognition apparatus
JP2003177779A (en) Speaker learning method for speech recognition
JPH11338492A (en) Speaker recognition unit
JP4391179B2 (en) Speaker recognition system and method
JP3090119B2 (en) Speaker verification device, method and storage medium
JP2001255887A (en) Speech recognition device, speech recognition method and medium recorded with the method
JP2003044078A (en) Voice recognizing device using uttering speed normalization analysis
JP4749990B2 (en) Voice recognition device
JP2006010739A (en) Speech recognition device
JP3036509B2 (en) Method and apparatus for determining threshold in speaker verification
JP4449380B2 (en) Speaker normalization method and speech recognition apparatus using the same