JPH05323990A

JPH05323990A - Talker recognizing method

Info

Publication number: JPH05323990A
Application number: JP4344586A
Authority: JP
Inventors: Tomoko Matsui; 知子松井; Sadahiro Furui; 貞煕古井
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1992-03-19
Filing date: 1992-12-24
Publication date: 1993-12-07

Abstract

PURPOSE:To prevent a person from impersonating another person by reproducing the speech of the person recorded as that of another person by a vocalization content depending method. CONSTITUTION:Audio data for learning and a KANA (Japanese syllabary) representing the system of a vocalization content or phonetic symbol for each talker to be registered are inputted to a talker adaptation part 1 with teacher, and a phoneme/syllable model from a phoneme/syllable model accumulation part 2 for unspecified talker is adapted to the talker to be registered by using them, and it is accumulated in an accumulation part 3 with the ID of the talker. Talker recognition is performed by showing by generating a new sentence/work at every recognition from a sentence/work generating part 4 for recognition, vocalizing the sentence/work, and converting the speech of the sentence/work to a feature parameter system by a feature parameter extraction part 5. The talker recognition can be performed by connecting an adapted phoneme/syllable model in the accumulation part 3 according to the sentence/work generated from the generating part 4 by a speech model generating part 6, and calculating the similarity of input speech with the feature parameter system by a calculation part 7 based on the result of the calculation.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】この発明は例えばインターホンの
音声から訪問者が誰であるかを認識したり、入力された
音声により暗証番号を提示した人と同一人であることを
同定したりするためなどに用いられ、入力音声を、特徴
パラメータを用いた表現形式に変換し、その表現形式に
よる入力音声と、あらかじめ話者対応に生成された上記
表現形式による話者の声の特徴との類似度を求めて、入
力音声を発声した話者を認識する話者認識方法、および
指示した文章あるいは単語が正しく発声されたか否かを
判定する時に効果的に類似度を正規化する話者認識方法
に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention is used to recognize who a visitor is from a voice of an intercom, and to identify that the visitor is the same person who presents a personal identification number by input voice. For example, the input speech is converted into an expression format using a characteristic parameter, and the similarity between the input speech in the expression format and the feature of the voice of the speaker in the expression format previously generated corresponding to the speaker. The present invention relates to a speaker recognition method for recognizing a speaker who uttered an input voice, and a speaker recognition method for effectively normalizing the similarity when determining whether or not a designated sentence or word is correctly uttered. ..

【０００２】[0002]

【従来の技術】従来の話者認識の方法は、発声内容をあ
らかじめ限定した発声内容限定型の方法と、任意の発声
内容の音声を用いることのできる発声内容独立型の方法
とに分類することができる。発声内容限定型の方法で
は、通常、発声する文章または複数の単語をあらかじめ
決めておいて、話者が発声した音声を特徴パラメータ
（例えばケプストラム、ピッチなど）の時系列に変換
し、各登録話者の同じ文章または単語の特徴パラメータ
時系列との類似度を求め、その値によって話者が誰であ
るかを判定する手法がよく用いられる。2. Description of the Related Art Conventional speaker recognition methods are classified into a voicing content limited type method in which voicing content is limited in advance and a voicing content independent type method in which a voice having any utterance content can be used. You can In the method of limited utterance content, usually, a sentence or a plurality of words to be uttered is determined in advance, and the voice uttered by the speaker is converted into a time series of characteristic parameters (eg, cepstrum, pitch, etc.), and each registered utterance is converted. A method is often used in which the degree of similarity of a person's same sentence or word to a time series of characteristic parameters is determined, and the value is used to determine who is the speaker.

【０００３】この方法によれば、同じ文章あるいは単語
の音声の特徴パラメータ時系列の類似度を用いるので、
比較的容易に高い精度で話者を認識することができる
が、登録話者の音声を録音してしまえば、その録音した
音声を再生することによって、容易にその登録話者にな
りすますことができるという大きな欠点があった。一
方、発声内容独立型の方法では、通常、ベクトル量子化
などの方法を用いて話者が発声した文章などの音声に含
まれる特徴パラメータの分布を求め、登録話者の特徴パ
ラメータの分布との類似度によって判定する手法がよく
用いられる。According to this method, the similarity of the characteristic parameter time series of the voice of the same sentence or word is used.
It is relatively easy to recognize a speaker with high accuracy, but if you record the voice of a registered speaker, you can easily impersonate the registered speaker by playing the recorded voice. There was a big drawback. On the other hand, in the utterance content independent method, usually, the distribution of the characteristic parameters included in the voice such as the sentence uttered by the speaker is obtained using a method such as vector quantization, and the distribution of the characteristic parameters of the registered speakers is performed. A method of determining based on the degree of similarity is often used.

【０００４】この方法によれば、話者認識のために発声
する文章や単語を、そのつど変えることができるが、ど
のような発声内容でも受け入れることができるので、こ
の方法によっても、録音した音声を用いて他人になりす
ますことができるという欠点を回避することはできなか
った。According to this method, the sentence or word uttered for speaker recognition can be changed each time, but any utterance content can be accepted. It was not possible to avoid the drawback of being able to impersonate another person.

【０００５】[0005]

【課題を解決するための手段】請求項１の発明によれ
ば、認識すべき各話者の声に適応した音素あるいは音節
のモデルを作成して登録し、それらのモデルを接続して
生成した文章あるいは単語の音声のモデルと、特徴パラ
メータを用いた表現形式に変換された入力音声との類似
度を求めて、その入力音声を発声した話者を認識する。According to the invention of claim 1, a model of a phoneme or a syllable adapted to the voice of each speaker to be recognized is created and registered, and those models are connected and generated. The similarity between the model of the voice of the sentence or the word and the input voice converted into the expression form using the characteristic parameter is obtained, and the speaker who uttered the input voice is recognized.

【０００６】請求項３または４の発明によれば、認識す
べき各話者の声に適応した音素または音節群のモデルを
作成して登録し、それらのモデルを接続して生成した文
章あるいは単語の音声のモデルと、特徴パラメータを用
いた表現形式に変換された入力音声との類似度を求め
て、その入力音声を発声した話者を認識する。音素ある
いは音節群のモデルは、請求項３の発明では学習用音声
中の特徴パラメータの分布を複数のガウス分布に近似
し、これらのガウス分布の重みつき加算で、音素あるい
は音節群の特徴パラメータの分布を近似した音素あるい
は音節群モデルとする。請求項４の発明では学習用音声
中の特徴パラメータの分布を符号帳で近似し、これら符
号帳要素の重みつき加算で、各音素あるいは音節群の特
徴パラメータの分布を近似した音素あるいは音節群モデ
ルとする。According to the third or fourth aspect of the present invention, a model of a phoneme or a syllable group adapted to the voice of each speaker to be recognized is created and registered, and a sentence or word generated by connecting these models is generated. The similarity between the model of the voice and the input voice converted into the expression format using the characteristic parameter is obtained, and the speaker who uttered the input voice is recognized. According to the invention of claim 3, the model of the phoneme or syllable group approximates the distribution of the characteristic parameters in the learning voice to a plurality of Gaussian distributions, and weights the Gaussian distributions to add the characteristic parameters A phoneme or syllable group model with an approximate distribution is used. According to the invention of claim 4, a phoneme or syllable group model in which the distribution of the characteristic parameters in the learning speech is approximated by a codebook and the distribution of the characteristic parameters of each phoneme or syllable group is approximated by weighted addition of these codebook elements. And

【０００７】請求項６の発明によれば、話者を認識すべ
き入力音声を、特徴パラメータを用いた表現形式に変換
し、その表現形式による入力音声と、上記表現形式によ
る、発声内容すなわち言葉によらない各話者の声の特徴
および発声すべき言葉との各類似度を求め、これら類似
度を組み合わせて、その入力音声を発声した話者を認識
する。According to the invention of claim 6, the input voice for recognizing the speaker is converted into an expression form using the characteristic parameter, and the input voice in the expression form and the utterance content, that is, the words in the expression form. The characteristics of each speaker's voice that does not depend on each speaker and the respective degrees of similarity with the words to be uttered are obtained, and the speakers that have uttered the input speech are recognized by combining these degrees of similarity.

【０００８】請求項８の発明によれば、話者の学習用音
声データから発声内容に独立な音声モデルを作成し、こ
のモデルと特徴パラメータを用いた表現形式に変換され
た入力音声との間の類似度を求めて、これを音素あるい
は音節（群）モデルを接続した文章或は単語音声モデル
と特徴パラメータを用いた表現形式に変換された入力音
声との間の類似度から差し引くことにより、文章或は単
語の内容、収録時期、伝送系、マイクロホンその他の試
験条件の違いによる類似度の値のばらつきを正規化す
る。According to the invention of claim 8, a voice model independent of the utterance content is created from the voice data for learning of the speaker, and between this model and the input voice converted into the expression format using the characteristic parameters. , And subtracting this from the similarity between the sentence or word speech model in which the phoneme or syllable (group) model is connected and the input speech converted into the expression format using the characteristic parameters, Normalize the variation in the similarity value due to the difference in the text or word content, recording time, transmission system, microphone and other test conditions.

【０００９】各話者の発声内容に独立な音声モデルは、
請求項１の発明に関しては、各音素あるいは音節のモデ
ルと同じ、音声を特徴パラメータの時系列で表現したも
の、あるいはそれを隠れマルコフモデルで表現したもの
などとする。請求項３の発明に関しては、学習用音声中
の特徴パラメータの分布を複数のガウス分布で近似した
ものとする。請求項４の発明に関しては、学習用音声中
の特徴パラメータの分布を符号帳で近似したものとす
る。請求項６の発明に関しては、学習用音声中の特徴パ
ラメータの分布を符号帳で近似したものとするか或は複
数のガウス分布で近似したものとする。A speech model independent of the utterance content of each speaker is
The invention of claim 1 is the same as the model of each phoneme or syllable, that is, the voice is represented by a time series of characteristic parameters, or the model is represented by a hidden Markov model. According to the invention of claim 3, the distribution of the characteristic parameters in the learning voice is approximated by a plurality of Gaussian distributions. According to the invention of claim 4, the distribution of the characteristic parameters in the learning voice is approximated by a codebook. According to the invention of claim 6, the distribution of the characteristic parameters in the learning speech is approximated by a codebook or a plurality of Gaussian distributions.

【００１０】上記いずれの発明においても、認識ごと
に、話者が発声すべき文章あるいは単語を指示し、その
入力音声について認識を行う。In any of the above-mentioned inventions, for each recognition, the speaker designates a sentence or word to be uttered, and the input voice is recognized.

【００１１】[0011]

【実施例】次に図面を参照して詳細に説明する。図１に
請求項１および２の発明の実施例を示す。まず話者を登
録する段階では、登録すべき各話者について、学習用音
声データを教師つき話者適応部１に入力し、その話者に
適応した音素あるいは音節のモデルを作成する。つまり
学習用音声データの発声内容はあらかじめ決めておき、
その発声内容を仮名、あるいは発音記号の系列で表した
ものを、音声とともに教師つき話者適応部１に入力す
る。不特定話者用音素／音節モデル蓄積部２には、あら
かじめ多数話者の音声から作成した各音素あるいは音節
のモデルを蓄積しておく。各音素あるいは音節のモデル
には、音声を特徴パラメータの時系列で表現したもの、
あるいはそれを隠れマルコフモデルで表現したものなど
を用いる。これらの音素／音節モデルを、登録すべき話
者に適応させる方法としては、すでに確立されている種
々の方法、例えば文献「Jean-Luc Gauvain and Chin-Hu
i Lee :″Improved Acoustic Modeling With Bayesian
Learning,Proc.IEEE ICASSP 92,1922」に述べられてい
る方法などを用いることができる。話者に適応した音素
あるいは音節のモデルは、適応化音素／音節モデル蓄積
部３に登録される。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Next, a detailed description will be given with reference to the drawings. FIG. 1 shows an embodiment of the invention of claims 1 and 2. First, in the step of registering a speaker, learning voice data is input to the speaker-accommodating speaker adaptation unit 1 for each speaker to be registered, and a phoneme or syllable model adapted to the speaker is created. In other words, the utterance content of the learning voice data is decided in advance,
The utterance content represented by a kana or phonetic symbol sequence is input to the teacher-equipped speaker adaptation unit 1 together with the voice. The phoneme / syllable model storage unit 2 for unspecified speakers stores models of each phoneme or syllable created in advance from the voices of many speakers. For each phoneme or syllable model, a voice is represented by a time series of characteristic parameters,
Alternatively, a hidden Markov model is used. As a method of adapting these phoneme / syllable models to a speaker to be registered, various methods that have already been established, for example, the literature “Jean-Luc Gauvain and Chin-Hu” are used.
i Lee: ″ Improved Acoustic Modeling With Bayesian
Learning, Proc. IEEE ICASSP 92, 1922 ”and the like can be used. The phoneme or syllable model adapted to the speaker is registered in the adapted phoneme / syllable model storage unit 3.

【００１２】また、話者同定のためには、その話者に固
有の識別番号（暗証番号、以下話者ＩＤと記す）をキー
などで入力し、このＩＤをその話者の適応化音素あるい
は音節モデルとともに蓄積部３に記憶しておく。次に、
話者を認識する段階では、あらかじめ発声すべき文章ま
たは単語を決めておくか、あるいは認識用文章／単語生
成部４によって、新しい文章または単語を生成して話者
にそれを示し、話者がその文章または単語を発声し、そ
の音声を特徴パラメータ抽出部５に入力する。特徴パラ
メータ抽出部５では、入力された音声を例えばケプスト
ラム、ピッチなどの特徴パラメータを用いた表現形式に
変換する。一方、文章／単語音声モデル生成部６では、
認識用文章／単語生成部４によって生成された文章また
は単語、あるいはあらかじめ決められている文章または
単語に従って、適応化音素／音節モデル蓄積部３に登録
されている音素または音節のモデルを接続し、その文章
または単語の音声モデルを生成する。その際に話者照
合、すなわち本人の声であるか否かを判定する場合は、
話者に自分のＩＤをキーなどで入力してもらい、そのＩ
Ｄに対応する登録話者に適応した音素または音節のモデ
ルを文章／単語音声モデル生成部６に入力する。話者識
別、すなわちあらかじめ登録された誰の声であるかを判
定する場合は、すべての登録話者に対応する音素または
音節のモデルを文章／単語音声モデル生成部６に逐次入
力する。To identify a speaker, an identification number (personal identification number, hereinafter referred to as a speaker ID) peculiar to the speaker is input with a key or the like, and the ID is used as an adaptive phoneme of the speaker. It is stored in the storage unit 3 together with the syllable model. next,
At the stage of recognizing the speaker, a sentence or word to be uttered is determined in advance, or a new sentence or word is generated by the recognition sentence / word generation unit 4 and the new sentence or word is shown to the speaker. The sentence or word is uttered, and the voice is input to the characteristic parameter extraction unit 5. The characteristic parameter extraction unit 5 converts the input voice into an expression format using characteristic parameters such as cepstrum and pitch. On the other hand, in the sentence / word voice model generation unit 6,
According to a sentence or word generated by the recognition sentence / word generation unit 4, or a predetermined sentence or word, the phoneme or syllable model registered in the adapted phoneme / syllable model storage unit 3 is connected, Generate a phonetic model of the sentence or word. At that time, in the case of speaker verification, that is, when it is determined whether or not the voice is the voice of the person,
Ask the speaker to enter his or her ID with a key, etc.
The phoneme or syllable model corresponding to the registered speaker corresponding to D is input to the sentence / word speech model generation unit 6. In the case of speaker identification, that is, to determine who has a voice registered in advance, phoneme or syllable models corresponding to all registered speakers are sequentially input to the sentence / word voice model generation unit 6.

【００１３】特徴パラメータ抽出部５で得られた特徴パ
ラメータの時系列と、文章／単語音声モデル生成部６で
生成された文章または単語の音声モデルとは、類似度計
算部７に入力されて、両者の類似の度合いが計算され
る。この具体的方法としては、例えば文献「中川聖一：
“確率モデルによる音声認識”、第 3.1.3項、pp. 40−
50，電子情報通信学会、1988」に延べられている方法な
どを用いることができる。計算された類似度の値は、一
旦、類似度蓄積部８に蓄えられた後、話者認識判定部９
に送られる。話者認識判定部９では、話者照合の場合は
話者ＩＤを用いて、しきい値蓄積部１０から、その本人
の声とみなせる類似度の変動の範囲を示すしきい値を読
み出して、上記のようにして計算された、本人の音素あ
るいは音節モデルを接続した文章または単語の音声モデ
ルと、認識用文章あるいは単語の入力音声との類似度と
比較し、類似度の値が読み出されたしきい値よりも大き
ければそれは本人の音声であると判定し、しきい値より
も小さければ他人の音声であると判定する。話者識別の
場合は、入力音声と、登録されたすべての話者にそれぞ
れ適応した音素あるいは音節モデルを接続した文章また
は単語の音声モデルとの類似度をすべて比較して、最も
類似度の大きい話者を選択し、その話者が発声したもの
と判定する。The time series of characteristic parameters obtained by the characteristic parameter extraction unit 5 and the speech model of the sentence or word generated by the sentence / word speech model generation unit 6 are input to the similarity calculation unit 7, The degree of similarity between the two is calculated. As a concrete method of this, for example, the document “Seiji Nakagawa:
"Speech Recognition by Stochastic Model", Section 3.1.3, pp. 40-
50, Institute of Electronics, Information and Communication Engineers, 1988 ”, etc. can be used. The calculated similarity value is temporarily stored in the similarity storage unit 8 and then the speaker recognition determination unit 9
Sent to. In the speaker recognition determination unit 9, the speaker ID is used in the case of speaker verification, and the threshold value storage unit 10 reads a threshold value indicating a range of variation in the degree of similarity that can be regarded as the voice of the person, The speech model of a sentence or word connected to the phoneme or syllable model of the person calculated as described above is compared with the similarity between the recognition sentence or the input voice of the word, and the similarity value is read out. If it is larger than the threshold, it is determined to be the voice of the person, and if smaller than the threshold, it is determined to be the voice of another person. In the case of speaker identification, the similarity between the input speech and the speech model of a sentence or word in which phoneme or syllable models adapted to all the registered speakers are connected is compared to obtain the highest similarity. A speaker is selected, and it is determined that the speaker uttered.

【００１４】次に図２を参照して請求項３の発明の実施
例を説明する。まず、話者登録の段階では登録すべき各
話者について、学習用音声データを特徴パラメータ抽出
部１１に入力する。学習用音声データの発声内容はあら
かじめ決めておく。特徴パラメータ抽出部１１では、入
力された音声を例えばケプストラム、ピッチなどの特徴
パラメータを用いた表現形式に変換する。次に、話者の
声の特徴を表現する方法として、複数のガウス分布を用
いる。つまり、特徴パラメータの時系列に変換された学
習用音声データが混合ガウス分布作成部１２に入力さ
れ、学習用音声データに含まれる特徴パラメータの分布
が、複数のガウス分布の組み合わせ（これを以下「混合
ガウス分布」と呼ぶ）で表現される。Next, an embodiment of the invention of claim 3 will be described with reference to FIG. First, at the speaker registration stage, learning voice data is input to the characteristic parameter extraction unit 11 for each speaker to be registered. The utterance content of the learning voice data is determined in advance. The characteristic parameter extraction unit 11 converts the input voice into an expression format using characteristic parameters such as cepstrum and pitch. Next, a plurality of Gaussian distributions are used as a method of expressing the features of the speaker's voice. That is, the learning speech data converted into the time series of the characteristic parameters is input to the mixed Gaussian distribution creating unit 12, and the distribution of the characteristic parameters included in the learning speech data is a combination of a plurality of Gaussian distributions (hereinafter referred to as " It is expressed as a “mixed Gaussian distribution”).

【００１５】次に、学習用音声データと、その発声内容
を仮名あるいは発音記号の系列で表したテキストと混合
ガウス分布とを、音素群モデル作成部１３に入力する。
音素群モデル作成部１３では、特徴パラメータの値が類
似したいくつかの音素をまとめた音素群ごとに、その特
徴パラメータの分布を混合ガウス分布の要素の重みつき
加算で表現した音素群モデルを作成し、それらの音素群
モデルを、学習用音声データのテキストに従って時間的
に接続したときに、学習用音声データの特徴パラメータ
の時系列が最も精度よく表現できるように、各要素の重
みを決定する。この方法としては、例えば文献「南泰
浩、松岡達雄、鹿野清宏：“不特定話者連続音声データ
ベースによる連結学習ＨＭＭの評価”、電子情報通信学
会音声研究会資料、ＳＰ91−113,1992」に述べられてい
る方法などを用いることができる。こうして作られた音
素群モデルは、その学習用音声を発声した話者を示す話
者ＩＤ許容範囲を推定し、その値を、類似度に関するし
きい値として、話者ＩＤとともに、しきい値蓄積部８に
蓄える。Next, the learning voice data, the text in which the utterance content thereof is represented by a kana or phonetic symbol sequence, and the Gaussian mixture distribution are input to the phoneme group model creating unit 13.
The phoneme group model creation unit 13 creates a phoneme group model in which the distribution of the feature parameters is expressed by weighted addition of elements of the mixed Gaussian distribution for each phoneme group in which several phonemes having similar feature parameter values are collected. Then, when these phoneme group models are temporally connected according to the text of the learning voice data, the weight of each element is determined so that the time series of the characteristic parameters of the learning voice data can be expressed most accurately. .. This method is described, for example, in the document "Yasuhiro Minami, Tatsuo Matsuoka, Kiyohiro Kano:" Evaluation of connected learning HMM using unspecified speaker continuous speech database ", Institute of Electronics, Information and Communication Engineers, Speech study group material, SP91-113, 1992". It is possible to use a known method or the like. The phoneme group model thus created estimates the speaker ID allowable range indicating the speaker who uttered the learning voice, and uses that value as the threshold value for similarity, together with the speaker ID, and accumulates the threshold value. Store in part 8.

【００１６】次に、話者を認識する段階では、あらかじ
め発声すべき文章または単語を決めておくか、あるいは
認識用文章／単語生成部４によって、新しい文章または
単語を生成して話者にそれを示し、その話者がその文章
または単語を発声し、その音声を特徴パラメータ抽出部
５に入力する。特徴パラメータ抽出部５では、入力され
た音声を、特徴パラメータ抽出部１１と同じ表現形式に
変換する。同時に、文章／単語音声モデル生成部６で
は、認識用文章／単語生成部４によって生成された文章
または単語、あるいはあらかじめ決められている文章ま
たは単語に従って、音素群モデル蓄積部３に蓄積されて
いる音素群モデルを接続し、その文章または単語の音声
モデルを生成する。その際に話者照合、すなわち本人の
声であるか否かを判定する場合は、話者に自分のＩＤを
キーなどで入力してもらい、そのＩＤに対応する登録話
者に適応した音素群モデルを文章／単語音声モデル生成
部１１に入力する。話者識別、すなわちあらかじめ登録
された誰の声であるかを判定する場合は、すべての登録
話者に対応する音素または音節のモデルを、文章／単語
音声モデル生成部１１に逐次入力する。Next, in the step of recognizing the speaker, a sentence or word to be uttered is determined in advance, or a new sentence or word is generated by the recognition sentence / word generation unit 4 and the new sentence or word is given to the speaker. , The speaker utters the sentence or word, and inputs the voice to the characteristic parameter extraction unit 5. The characteristic parameter extraction unit 5 converts the input voice into the same expression format as the characteristic parameter extraction unit 11. At the same time, in the sentence / word speech model generation unit 6, according to the sentence or word generated by the recognition sentence / word generation unit 4, or according to a predetermined sentence or word, the phoneme group model storage unit 3 stores the same. The phoneme group model is connected to generate a speech model of the sentence or word. At that time, in the case of speaker verification, that is, in the case of determining whether or not the voice is the voice of the person in question, the speaker is asked to input his own ID with a key or the like, and the phoneme group adapted to the registered speaker corresponding to the ID The model is input to the sentence / word voice model generation unit 11. In the case of speaker identification, that is, in determining who the voice is registered in advance, phoneme or syllable models corresponding to all registered speakers are sequentially input to the sentence / word voice model generation unit 11.

【００１７】その後、特徴パラメータ抽出部５で得られ
た特徴パラメータの時系列と、文章／単語音声モデル生
成部６で生成された文章または単語の音声モデルとを類
似度計算部７に入力し、類似度を計算し、さらに話者認
識判定部９で話者の同定または話者の認識は図１の場合
と同様に行う。図３に請求項４の発明の実施例を示す。
図２の実施例において、学習用音声データに含まれる特
徴パラメータの分布を混合ガウス分布で表現したが、図
３の実施例では学習用音声データの特徴パラメータの時
系列を符号帳作成部１４に入力して、学習用音声データ
に含まれる特徴パラメータの分布をベクトル量子化の符
号帳で表現する。この符号帳を前記混合ガウス分布の代
わりに音素群モデル作成部１３へ入力し、混合ガウス分
布の代わりに符号帳の要素の重みつき加算で特徴パラメ
ータの分布を表現した音素モデルを作る。その他は図２
の場合と同様である。なお、混合ガウス分布、符号帳を
作成する方法としては、例えば文献「松井知子、古井貞
煕：“ＶＱ，離散／連続ＨＭＭによるテキスト独立型話
者認識法の比較検討”、電子情報通信学会音声研究会資
料、ＳＰ91−89,1991 」に述べられている方法などを用
いることができる。After that, the time series of the characteristic parameters obtained by the characteristic parameter extraction unit 5 and the voice model of the sentence or word generated by the sentence / word voice model generation unit 6 are input to the similarity calculation unit 7, The degree of similarity is calculated, and the speaker recognition determining section 9 identifies the speaker or recognizes the speaker in the same manner as in FIG. FIG. 3 shows an embodiment of the invention of claim 4.
In the embodiment of FIG. 2, the distribution of the characteristic parameters included in the learning voice data is expressed by the Gaussian mixture, but in the embodiment of FIG. 3, the time series of the characteristic parameters of the learning voice data is stored in the codebook creating unit 14. The distribution of the characteristic parameters included in the input speech data is represented by a vector quantization codebook. This codebook is input to the phoneme group model creating unit 13 instead of the Gaussian mixture distribution, and a phoneme model expressing the distribution of feature parameters is created by weighted addition of the elements of the codebook instead of the Gaussian mixture distribution. Others are Figure 2
It is similar to the case of. As a method for creating a mixed Gaussian distribution and a codebook, for example, a document “Tomoko Matsui, Sadahiro Furui:“ VQ, comparative study of text-independent speaker recognition methods using discrete / continuous HMM ”, and the Institute of Electronics, Information and Communication Engineers Speech The method described in "Study Group Material, SP91-89, 1991" can be used.

【００１８】図２，図３の実施例においても、音素群モ
デルの代わりに音節群モデルを用いてもよい。次に請求
項６の発明の実施例を図４を参照して説明する。話者の
登録する段階では、登録すべき各話者については、学習
用音声データを特徴パラメータ抽出部１１に入力して特
徴パラメータ（例えばケプストラム、ピッチなど）の時
系列に変換したのち、話者特徴計算部１５に入力し、そ
の話者の、発声内容に独立な（発声内容に依存しない）
声の特徴を抽出する。発声内容に独立な声の特徴を抽出
する方法としては、例えば文献「松井知子、古井貞煕：
“ＶＱ，離散／連続ＨＭＭによるテキスト独立型話者認
識法の比較検討”、電子情報通信学会音声研究会資料、
ＳＰ91−89,1991 」に述べられている方法、すなわち特
徴パラメータの分布を、ベクトル量子化の符号帳や、複
数のガウス分布で表現する方法を用いることができる。
抽出された声の特徴は、その話者のＩＤとともに、話者
特徴蓄積部１６に蓄えられる。In the embodiments shown in FIGS. 2 and 3, a syllable group model may be used instead of the phoneme group model. Next, an embodiment of the invention of claim 6 will be described with reference to FIG. At the speaker registration stage, for each speaker to be registered, learning voice data is input to the characteristic parameter extraction unit 11 and converted into a time series of characteristic parameters (eg, cepstrum, pitch, etc.), and then the speakers are registered. Input to the feature calculator 15 and independent of the speaking content of the speaker (not dependent on the speaking content)
Extract voice features. As a method of extracting a feature of a voice independent of the utterance content, for example, a document “Tomoko Matsui, Sadahiro Furui:
"Comparative study of text-independent speaker recognition methods using VQ and discrete / continuous HMM", The Institute of Electronics, Information and Communication Engineers, Speech Study Group materials,
SP91-89, 1991 ", that is, the distribution of feature parameters is represented by a vector quantization codebook or a plurality of Gaussian distributions.
The extracted feature of the voice is stored in the speaker feature storage unit 16 together with the ID of the speaker.

【００１９】一方、多数の話者の音声データを、特徴パ
ラメータ抽出部１７に入力して特徴パラメータの時系列
に変換したのち、音素／音節モデル作成部１８に入力し
て、多数話者に共通、すなわち話者に独立な音素／音節
モデルを作成する。この方法としては、文献、「南泰
浩、松岡達雄、鹿野靖浩：“不特定話者連続音声データ
ベースによる連結学習ＨＭＭの評価”、電子情報通信学
会音声研究会資料、ＳＰ91−113,1992」などに述べられ
ている方法を用いることができる。つまり、学習用音声
データの発声内容はあらかじめ決めておき、その発声内
容を仮名、あるいは発音記号の系列で表したものを、音
声とともに音素／音節モデル作成部１８に入力する。各
音素あるいは音節のモデルには、音声を特徴パラメータ
の時系列で表現したもの、あるいはそれを隠れマルコフ
モデルで表現したものなどを用いる。作成された音素／
音節モデルは、音素／音節モデル蓄積部２に蓄えられ
る。On the other hand, the voice data of a large number of speakers is input to the characteristic parameter extraction unit 17 and converted into a time series of characteristic parameters, and then input to the phoneme / syllable model generation unit 18 to be shared by a large number of speakers. , I.e., create a speaker-independent phoneme / syllable model. This method is described in the literature, "Yasuhiro Minami, Tatsuo Matsuoka, Yasuhiro Kano:" Evaluation of connected learning HMM using unspecified speaker continuous speech database ", Institute of Electronics, Information and Communication Engineers, Speech study group material, SP91-113, 1992". The method described can be used. That is, the utterance content of the learning voice data is determined in advance, and the utterance content represented by a kana or phonetic symbol sequence is input to the phoneme / syllable model creating unit 18 together with the voice. As a model of each phoneme or syllable, a speech represented by a time series of characteristic parameters, or a representation of it by a hidden Markov model is used. Created phonemes /
The syllable model is stored in the phoneme / syllable model storage unit 2.

【００２０】次に、話者を認識する段階では、あらかじ
め発声すべき文章または単語を決めておくか、あるいは
認識用文章／単語生成部４によって、新しい文章または
単語を生成して話者にそれを示し、話者がその文章また
は単語を発声し、その音声を特徴パラメータ抽出部５に
入力する。特徴パラメータ抽出部５では、入力された音
声を、特徴パラメータ抽出部１１と同じ特徴パラメータ
を用いた表現形式に変換する。一方、文章／単語音声モ
デル生成部６では、認識用文章／単語生成部４によって
生成された文章または単語、あるいはあらかじめ決めら
れている文章または単語に従って、音素／音節モデル蓄
積部２に登録されている音素または音節のモデルを接続
し、その文章または単語の音声モデルを生成する。Next, at the stage of recognizing the speaker, a sentence or word to be uttered is determined in advance, or a new sentence or word is generated by the recognition sentence / word generation unit 4 and the new sentence or word is given to the speaker. The speaker utters the sentence or word, and inputs the voice to the characteristic parameter extraction unit 5. The characteristic parameter extraction unit 5 converts the input voice into an expression format using the same characteristic parameters as the characteristic parameter extraction unit 11. On the other hand, in the sentence / word speech model generation unit 6, according to the sentence or word generated by the recognition sentence / word generation unit 4, or according to a predetermined sentence or word, it is registered in the phoneme / syllable model storage unit 2. Connect the phoneme or syllable models that exist to generate a phonetic model of the sentence or word.

【００２１】特徴パラメータ抽出部５で得られた特徴パ
ラメータの時系列と、文章／単語音声モデル生成部６で
生成された文章または単語の音声モデルは、発声内容類
似度計算部１９に入力されて、両者の類似の度合いが計
算される。この具体的方法としては、例えば文献「中川
聖一：“確率モデルによる音声認識”、第3.1.3 項、p
p. 40−50，電子情報通信学会、1988」に述べられてい
る方法などを用いることができる。計算された類似度の
値は、一旦、発声内容類似度蓄積部２１に蓄えられる。The time series of characteristic parameters obtained by the characteristic parameter extraction unit 5 and the speech model of the sentence or word generated by the sentence / word speech model generation unit 6 are input to the utterance content similarity calculation unit 19. , The degree of similarity between the two is calculated. As a concrete method, for example, “Seiichi Nakagawa:“ Speech recognition by probabilistic model ”, Section 3.1.3, p.
p. 40-50, The Institute of Electronics, Information and Communication Engineers, 1988 ”can be used. The calculated similarity value is temporarily stored in the utterance content similarity storage unit 21.

【００２２】一方、特徴パラメータ抽出部５で得られた
特徴パラメータの時系列と、話者特徴蓄積部１６に蓄え
られている各話者の声の特徴とが、話者類似度計算部２
２に送られる。話者類似度計算部２２では、話者照合の
場合は話者ＩＤを用いて、その話者の声の特徴との類似
度が計算される。話者識別の場合は、話者特徴蓄積部１
６に蓄積されているすべての話者の声の特徴との類似度
が計算される。計算された類似度の値は、一旦、話者類
似度蓄積部２３に蓄えられる。On the other hand, the time series of the characteristic parameters obtained by the characteristic parameter extracting section 5 and the characteristics of each speaker's voice stored in the speaker characteristic accumulating section 16 are calculated by the speaker similarity calculating section 2
Sent to 2. In the speaker similarity calculation unit 22, in the case of speaker verification, the speaker ID is used to calculate the similarity with the feature of the speaker's voice. In the case of speaker identification, the speaker feature storage unit 1
The similarity with the voice features of all speakers stored in 6 is calculated. The calculated similarity value is temporarily stored in the speaker similarity storage unit 23.

【００２３】次に、発声内容類似度蓄積部２１に蓄えら
れている発声内容の類似度と、話者類似度蓄積部２３に
蓄えられている話者の声の特徴との類似度が、類似度組
合わせ部２４に入力され、総合的類似度が計算される。
総合的類似度を計算する方法としては、種々のものが考
えられる。例えば（１）両者の類似度の重みつき平均値
を計算する、（２）発声内容類似度にあらかじめしきい
値を設けておいて、類似度がしきい値よりも小さい場合
は総合的類似度に０のような極めて小さい値を与え、類
似度がしきい値を越えた場合のみ、話者類似度の値を総
合的類似度として用いる、などが考えられる。計算され
た総合的類似度を、話者認識判定部２５に入力し、話者
認識の判定を行う。話者照合を行う場合は、しきい値蓄
積部１０から、話者ＩＤに従って、その本人の声とみな
せる類似度の変動の範囲を示すしきい値を読み出して、
上記のようにして計算された総合的類似度と比較し、総
合的類似度の値が読み出されたしきい値よりも大きけれ
ばそれは本人の音声であると判定し、しきい値よりも小
さければ他人の音声であると判定する。話者識別の場合
は、入力音声と、登録（蓄積）されているすべての話者
との総合的類似度をすべて比較して、最も総合的類似度
の大きい話者を選択し、その話者が発声したものと判定
する。Next, the similarity between the utterance contents stored in the utterance content similarity storage unit 21 and the similarity of the voice of the speaker stored in the speaker similarity storage unit 23 are similar. The degree of similarity is input to the degree combination unit 24 and the total degree of similarity is calculated.
There are various possible methods for calculating the overall similarity. For example, (1) a weighted average value of the two similarities is calculated, (2) a threshold is set in advance for the utterance content similarity, and when the similarity is smaller than the threshold, the overall similarity is calculated. It is conceivable that a very small value such as 0 is given to and the value of the speaker similarity is used as the overall similarity only when the similarity exceeds the threshold value. The calculated overall similarity is input to the speaker recognition determination unit 25, and the speaker recognition is determined. When performing speaker verification, a threshold value indicating the range of variation in the degree of similarity that can be regarded as the voice of the person is read from the threshold value storage unit 10 according to the speaker ID,
Compared with the overall similarity calculated as described above, if the value of the overall similarity is larger than the read threshold value, it is determined that the voice is the person's voice, and it is smaller than the threshold value. If so, it is determined to be the voice of another person. In the case of speaker identification, all the similarities between the input voice and all the registered (stored) speakers are compared, the speaker with the highest overall similarity is selected, and that speaker is selected. Is judged to have been uttered.

【００２４】次に図５を参照して請求項８に関する発明
の実施例を説明する。まず、話者登録の段階では、登録
すべき各話者について、学習用音声データを特徴パラメ
ータ抽出部１１に入力する。学習用音声データの発声内
容はあらかじめ決めておく。特徴パラメータ抽出部１１
では、入力された音声を例えばケプストラム、ピッチな
どの特徴パラメータを用いた表現形式に変換する。次
に、特徴パラメータの時系列に変換された学習用音声デ
ータを教師つき話者適応部１に入力し、その話者に適応
した音素あるいは音節のモデルを作成する。不特定話者
用音素／音節モデル蓄積部２には、あらかじめ多数話者
の音声から作成した各音素あるいは音節のモデルを蓄積
しておく。各音素あるいは音節のモデルとしては、音声
を特徴パラメータの時系列で表現したもの、あるいはこ
れを隠れマルコフモデルで表現したものその他、種々の
表現形式のものを採用することができる。これらの音素
／音節モデルを、登録すべき話者に適応させる方法とし
ては、すでに確立されている種々の方法、例えば文献
「Jean-Luc Gauvain and Chin-Hui Lee:“Improved Aco
ustic Modeling With Bayesian Learing, Proc. IEEE I
CASSP 92,1992 」に述べられている方法がある。話者に
適応した音素あるいは音節のモデルは、適応化音素／音
節モデル蓄積部３に登録される。さらに、適応化音素／
音節モデルを接続したものと学習用音声データの特徴パ
ラメータ時系列との間の誤差から、音声の変動許容範囲
を推定し、この値を類似度に関するしきい値として話者
ＩＤとともにしきい値蓄積部１０に蓄える。Next, an embodiment of the invention relating to claim 8 will be described with reference to FIG. First, at the speaker registration stage, learning voice data is input to the characteristic parameter extraction unit 11 for each speaker to be registered. The utterance content of the learning voice data is determined in advance. Feature parameter extraction unit 11
Then, the input voice is converted into an expression format using characteristic parameters such as cepstrum and pitch. Next, the learning voice data converted into the time series of the characteristic parameters is input to the speaker-adaptive unit 1 with a teacher, and a model of a phoneme or a syllable adapted to the speaker is created. The phoneme / syllable model storage unit 2 for unspecified speakers stores models of each phoneme or syllable created in advance from the voices of many speakers. As a model of each phoneme or syllable, it is possible to employ various representation formats such as a representation of speech in time series of characteristic parameters, a representation of this in hidden Markov model, and the like. As a method for adapting these phoneme / syllable models to a speaker to be registered, various methods that have been established, for example, the document “Jean-Luc Gauvain and Chin-Hui Lee:“ Improved Aco
ustic Modeling With Bayesian Learing, Proc. IEEE I
CASSP 92,1992 ”. The phoneme or syllable model adapted to the speaker is registered in the adapted phoneme / syllable model storage unit 3. In addition, adapted phonemes /
From the error between the connected syllable model and the time series of the characteristic parameters of the training voice data, the allowable range of fluctuation of the voice is estimated, and this value is used as the threshold value for similarity and the threshold value is accumulated together with the speaker ID. Store in part 10.

【００２５】ここで、特徴パラメータの時系列に変換さ
れた学習用音声データは発声内容独立モデル作成部２６
にも入力され、発声内容に独立な音声モデルが生成され
る。そして、これらのモデルは発声内容独立モデル蓄積
部２７に蓄えられる。次に、適応化音素／音節モデル蓄
積部３から送りだされる適応化音素／音節モデルを接続
したものと学習用音声データの特徴パラメータ時系列と
の間の誤差と、発声内容独立モデル蓄積部２７から送り
だされる発声内容に独立な音声モデルと学習音声データ
の特徴パラメータ時系列との間の誤差から、発声内容の
違いによる類似度の変動許容範囲を推定し、この値を発
声内容判定用しきい値として話者ＩＤとともに発声内容
判定用しきい値蓄積部２８に蓄える。Here, the learning voice data converted into the time series of the characteristic parameters is the utterance content independent model creating unit 26.
Is also input, and a speech model independent of the utterance content is generated. Then, these models are stored in the utterance content independent model storage unit 27. Next, an error between the connected adaptive phoneme / syllable model sent from the adaptive phoneme / syllable model storage unit 3 and the characteristic parameter time series of the learning voice data, and the utterance content independent model storage unit The allowable variation range of the similarity due to the difference in the utterance content is estimated from the error between the voice model independent of the utterance content and the time series of the characteristic parameters of the learning voice data, which is output from 27, and this value is determined. It is stored together with the speaker ID in the utterance content determination threshold storage unit 28 as a use threshold.

【００２６】次に、話者を認識する段階では、あらかじ
め発声すべき文章または単語を決めておくか、あるいは
認識用文章／単語生成部４によって、新しい文章または
単語を生成して話者にそれを示し、話者がその文章また
は単語を発声し、その音声を特徴パラメータ抽出部５に
入力する。特徴パラメータ抽出部５は入力された音声を
特徴パラメータ抽出部１１と同じ表現形式に変換する。
一方、文章／単語音声モデル作成部６では、認識用文章
／単語生成部４により生成された文章または単語、ある
いはあらかじめ決められている文章または単語に従っ
て、適応化音素／音節モデル蓄積部３に登録されている
音素または音節のモデルを接続し、その文章または単語
の音声モデルを作成する。その際、話者照合、すなわち
本人の声であるか否かを判定する場合は、話者に話者自
身のＩＤをキーその他により入力してもらい、そのＩＤ
に対応する登録話者に適応した音素または音節のモデル
を文章／単語音声モデル生成部６に入力する。話者識
別、すなわち、あらかじめ登録された誰の声であるかを
判定する場合は、すべての登録話者に対応する音素また
は音節のモデルを文章／単語音声モデル生成部６に入力
する。Next, at the stage of recognizing the speaker, a sentence or word to be uttered is determined in advance, or a new sentence or word is generated by the recognition sentence / word generation unit 4 and the new sentence or word is given to the speaker. The speaker utters the sentence or word, and inputs the voice to the characteristic parameter extraction unit 5. The characteristic parameter extraction unit 5 converts the input voice into the same expression format as the characteristic parameter extraction unit 11.
On the other hand, in the sentence / word speech model creating unit 6, the sentence / word is created by the recognition sentence / word creating unit 4, or is registered in the adaptive phoneme / syllable model accumulating unit 3 according to a predetermined sentence or word. Connect the phoneme or syllable models that have been created and create a phonetic model of the sentence or word. At that time, in the case of speaker verification, that is, when it is determined whether or not the voice is the voice of the person, the speaker inputs the ID of the speaker himself with a key or the like.
The phoneme or syllable model corresponding to the registered speaker corresponding to is input to the sentence / word speech model generation unit 6. In the case of speaker identification, that is, in determining who the voice is registered in advance, phoneme or syllable models corresponding to all registered speakers are input to the sentence / word voice model generation unit 6.

【００２７】特徴パラメータ抽出部５で得られた特徴パ
ラメータの時系列と、文章／単語音声モデル生成部６で
生成された文章または単語の音声モデルとは、類似度計
算部７に入力されて、両者の類似の度合が計算される。
この具体的方法としては、例えば文献「中川聖一：“確
率モデルによる音声認識”、第3.1.3 項、pp. 40-50、
電子情報通信学会、1988」に述べられている方法を用い
ることができる。計算された類似度の値は、類似度正規
化部３０、話者認識判定部９に送られる。同時に、特徴
パラメータ抽出部５で得られた特徴パラメータの時系列
を、発声内容独立モデル蓄積部２７に蓄えられている発
声内容に独立な音声モデルと共に、正規化用類似度計算
部２９に入力し、正規化用類似度を計算してこの値を類
似度正規化部３０に入力する。類似度正規化部３０で
は、類似度計算部７から送られた類似度の値から、正規
化用類似度計算部２９から送られた正規化用類似度の値
を差し引くことによって類似度の正規化を行なう。正規
化された類似度の値は発声内容判定部３１に入力され
る。発声内容判定部３１では、正規化された類似度の値
を、発声内容判定用しきい値蓄積部２８から送られた発
声内容判定用しきい値と比較し、正規化された類似度の
値が発声内容判定用しきい値よりも大きい場合にこれを
正しい発声内容であると判定し、小さい場合はこれを誤
った発声内容であると判定する。その判定結果は話者認
識判定部９に送られる。話者認識判定部９では、類似度
計算部７から送られてきた類似度の値をもとに、話者照
合の場合は話者ＩＤを用いて、しきい値蓄積部１０か
ら、その本人の声とみなせる類似度の変動の範囲を示す
しきい値を読み出して、上記のように計算された、本人
の適応化音素あるいは音節モデルを接続した文章または
単語の音声モデルと、認識用文章あるいは単語の入力音
声との類似度とを比較し、類似度の値が読み出されたし
きい値よりも大きければそれは本人の音声であると判定
し、しきい値よりも小さければ他人の音声であると判定
する。話者識別の場合は、入力音声と、登録された全て
の話者のそれぞれの適応化音素あるいは音節モデルを接
続した文章または単語の音声モデルとの間の類似度をす
べて比較して、最も類似度の大きい話者を選択し、その
話者が発声したものと判定する。さらに、それらの判定
結果と、発声内容判定部３１から送られた判定結果と併
せて、総合的に話者認識判定を行なう。The time series of characteristic parameters obtained by the characteristic parameter extraction unit 5 and the speech model of the sentence or word generated by the sentence / word speech model generation unit 6 are input to the similarity calculation unit 7, The degree of similarity between the two is calculated.
As a concrete method of this, for example, the document “Seiichi Nakagawa:“ Speech Recognition by Stochastic Model ”, Section 3.1.3, pp. 40-50,
The method described in "The Institute of Electronics, Information and Communication Engineers, 1988" can be used. The calculated similarity value is sent to the similarity normalization unit 30 and the speaker recognition determination unit 9. At the same time, the time series of the feature parameters obtained by the feature parameter extraction unit 5 is input to the normalization similarity calculation unit 29 together with the voice model independent of the voice content stored in the voice content independent model storage unit 27. , The normalization similarity is calculated, and this value is input to the similarity normalization unit 30. The similarity normalization unit 30 subtracts the normalization similarity value sent from the normalization similarity calculation unit 29 from the similarity value sent from the similarity calculation unit 7 to normalize the similarity. Make a change. The normalized similarity value is input to the speech content determination unit 31. The utterance content determination unit 31 compares the normalized similarity value with the utterance content determination threshold value sent from the utterance content determination threshold value accumulating unit 28, and the normalized similarity value. Is greater than the threshold for utterance content determination, it is determined to be the correct utterance content, and if smaller than Threshold is determined to be the incorrect utterance content. The determination result is sent to the speaker recognition determination unit 9. The speaker recognition determination unit 9 uses the speaker ID in the case of speaker verification based on the value of the similarity sent from the similarity calculation unit 7, and uses the speaker ID from the threshold storage unit 10 to identify the person. The threshold value indicating the range of variation in the degree of similarity that can be regarded as the voice of the person is read, and the speech model of the sentence or word connected to the person's adapted phoneme or syllable model calculated as described above and the recognition sentence or If the similarity value is larger than the read threshold value, it is judged that it is the person's voice, and if it is smaller than the threshold value, it is the other person's voice. Judge that there is. In the case of speaker identification, the similarity between the input speech and the speech model of the sentence or word in which the respective adapted phonemes or syllable models of all the registered speakers are connected is compared to find the most similar. A speaker with a high frequency is selected and it is determined that the speaker has uttered. Further, the speaker recognition determination is comprehensively performed by combining the determination result and the determination result sent from the utterance content determination unit 31.

【００２８】次に、図６を参照してこの発明の実施例を
説明する。先ず、話者登録の段階においては、登録すべ
き各話者について学習用音声データを特徴パラメータ抽
出部１１に入力する。学習用音声データの発声内容はあ
らかじめ決めておく。特徴パラメータ抽出部１１は入力
された音声を例えばケプストラム、ピッチその他の特徴
パラメータを用いた表現形式に変換する。次に、話者の
声の特徴を表現する方法として、請求項３の発明では複
数のガウス分布、請求項４の発明ではベクトル量子化の
符号帳を用いる。即ち、特徴パラメータの時系列に変換
された学習用音声データが混合ガウス分布／符号帳作成
部３２に入力され、学習用音声データに含まれる特徴パ
ラメータの分布が複数のガウス分布の組合せ、あるいは
ベクトル量子化の符号帳（これらを以下「混合ガウス分
布／符号帳」と呼ぶ）で表現される。複数の混合ガウス
分布の組合せ、あるいは符号帳を作成する方法として
は、例えば文献「松井知子、古井貞煕： “ＶＱ、離散
／連続ＨＭＭによるテキスト独立形話者認識法の比較検
討”、電子情報通信学会音声研究会資料、SP91-89 、19
91」に述べられている方法を用いることができる。混合
ガウス分布／符号帳作成部３２で作成された混合ガウス
分布／符号帳は発声内容に独立な音声モデルとして発声
内容独立モデル蓄積部２７に蓄えられる。Next, an embodiment of the present invention will be described with reference to FIG. First, at the speaker registration stage, learning voice data is input to the feature parameter extraction unit 11 for each speaker to be registered. The utterance content of the learning voice data is determined in advance. The characteristic parameter extraction unit 11 converts the input voice into an expression format using characteristic parameters such as cepstrum, pitch, and the like. Next, as a method of expressing the features of the speaker's voice, a plurality of Gaussian distributions are used in the invention of claim 3, and a vector quantization codebook is used in the invention of claim 4. That is, the learning speech data converted into the time series of the characteristic parameters is input to the Gaussian mixture / codebook creating unit 32, and the distribution of the characteristic parameters included in the learning speech data is a combination of a plurality of Gaussian distributions or a vector. It is expressed by a quantization codebook (these are referred to as “mixed Gaussian distribution / codebook” hereinafter). As a method of combining a plurality of mixed Gaussian distributions or creating a codebook, for example, a document “Tomoko Matsui, Sadahi Furui:“ VQ, comparative study of text-independent speaker recognition method using discrete / continuous HMM ””, electronic information Communication Society of Japan Voice Study Material, SP91-89, 19
The method described in "91" can be used. The mixed Gaussian distribution / codebook created by the mixed Gaussian distribution / codebook creating unit 32 is stored in the utterance content independent model storage unit 27 as a speech model independent of the utterance content.

【００２９】次に、学習用音声データと、その発声内容
を仮名あるいは発音記号の系列で表現したテキストと混
合ガウス分布／符号帳とを、音素群モデル生成部１３に
入力する。音素群モデル生成部１３では、特徴パラメー
タの値が類似したいくつかの音素をまとめた音素群ごと
に、その特徴パラメータの分布を混合ガウス分布／符号
帳の要素の重みつき加算で表現した音素群モデルを作成
し、これらの音素群モデルを、学習用音声データのテキ
ストに従って時間的に接続したときに、学習用音声デー
タの特徴パラメータの時系列が最も精度よく表現できる
ように、各要素の重みを決定する。この方法としては、
例えば文献「南泰浩、松岡達雄、鹿野清宏： “不特定
話者連続音声データベースによる連結学習ＨＭＭの評
価”、電子情報通信学会音声研究会資料、SP91-113, 19
92」に述べられている方法を用いることができる。こう
して作られた音素群モデルは、その学習用音声を発声し
た話者を示す話者ＩＤとともに音素群モデル蓄積部３に
蓄えられる。さらに、音素群モデルを接続したものと学
習用音声データの特徴パラメータ時系列との間の誤差か
ら音声の変動許容範囲を推定し、この値を類似度に関す
るしきい値として、話者ＩＤとともにしきい値蓄積部１
０に蓄える。さらに、音素群モデルを接続したものと学
習用音声データの特徴パラメータ時系列との間の誤差
と、発声内容独立モデル蓄積部２７から送りだされた発
声内容に独立な音声モデルと学習音声データの特徴パラ
メータ時系列との間の誤差から、発声内容の違いによる
類似度の変動許容範囲を推定し、この値を発声内容判定
用しきい値として、話者ＩＤとともに発声内容判定用し
きい値蓄積部２８に蓄える。Next, the learning voice data, the text in which the utterance content is represented by a kana or phonetic symbol series, and the mixed Gaussian distribution / codebook are input to the phoneme group model generation unit 13. In the phoneme group model generation unit 13, for each phoneme group in which several phonemes having similar feature parameter values are grouped, the distribution of the feature parameters is represented by a mixed Gaussian distribution / weighted addition of elements of the codebook. When a model is created and these phoneme group models are temporally connected according to the text of the training voice data, the weight of each element is set so that the time series of the characteristic parameters of the training voice data can be expressed most accurately. To decide. For this method,
For example, "Yasuhiro Minami, Tatsuo Matsuoka, Kiyohiro Shikano:" Evaluation of Connected Learning HMMs by Unspecified Speaker Continuous Speech Database ", IEICE Speech Society Material, SP91-113, 19
The method described in "92" can be used. The phoneme group model thus created is stored in the phoneme group model storage unit 3 together with the speaker ID indicating the speaker who uttered the learning voice. Furthermore, the allowable variation range of the voice is estimated from the error between the connected phoneme group model and the time series of the characteristic parameters of the learning voice data, and this value is used as the threshold value for the similarity with the speaker ID. Threshold storage unit 1
Store to 0. Further, the error between the connected phoneme group model and the time series of the characteristic parameters of the learning voice data, and the speech model independent of the speech content sent from the speech content independent model accumulating unit 27 and the learning speech data. The allowable variation range of the similarity due to the difference in utterance content is estimated from the error with the characteristic parameter time series, and this value is used as the utterance content determination threshold value and the utterance content determination threshold value is stored together with the speaker ID. Store in part 28.

【００３０】次に、話者を認識する段階では、あらかじ
め発声すべき文章または単語を決めておくか、あるいは
認識用文章／単語生成部４により新しい文章または単語
を生成して話者にこれを示し、その話者がその文章また
は単語を発声し、その音声を特徴パラメータ抽出部５に
入力する。特徴パラメータ抽出部５は入力された音声を
特徴パラメータ抽出部１１と同じ表現形式に変換する。
これと同時に、文章／単語音声モデル生成部６は、認識
用文章／単語生成部４により生成された文章または単
語、あるいはあらかじめ決められている文章または単語
に従って、音素群モデル蓄積部３に蓄積されている音素
群モデルを接続し、その文章または単語の音声モデルを
生成する。ここで、話者照合すなわち本人の声であるか
否かを判定する場合は、話者に話者自身のＩＤをキーそ
の他の入力手段により入力してもらい、そのＩＤに対応
する登録話者の音素群モデルを文章／単語音声モデル生
成部６に入力する。話者識別、すなわちあらかじめ登録
された誰の声であるかを判定する場合は、すべての登録
話者に対応する音素または音節のモデルを、文章／単語
音声モデル生成部６に逐次入力する。Next, at the stage of recognizing the speaker, the sentence or word to be uttered is determined in advance, or a new sentence or word is generated by the recognition sentence / word generation unit 4 and the speaker is informed of this. The speaker speaks the sentence or word, and inputs the voice to the characteristic parameter extraction unit 5. The characteristic parameter extraction unit 5 converts the input voice into the same expression format as the characteristic parameter extraction unit 11.
At the same time, the sentence / word speech model generation unit 6 accumulates in the phoneme group model accumulation unit 3 according to the sentence or word generated by the recognition sentence / word generation unit 4 or a predetermined sentence or word. Connected phoneme group models to generate a speech model of the sentence or word. Here, in the case of speaker verification, that is, when it is determined whether or not the voice is the voice of the person, the speaker is asked to input the ID of the speaker himself or herself by a key or other input means, and the registered speaker corresponding to the ID is registered. The phoneme group model is input to the sentence / word voice model generation unit 6. In the case of speaker identification, that is, in determining who the voice is registered in advance, models of phonemes or syllables corresponding to all registered speakers are sequentially input to the sentence / word voice model generation unit 6.

【００３１】その後、特徴パラメータ抽出部５で得られ
た特徴パラメータの時系列と、文章／単語音声モデル生
成部６で生成された文章または単語の音声モデルとを類
似度計算部７に入力し、類似度を計算し、その値を類似
度正規化部３０、話者認識判定部９に入力する。同時
に、特徴パラメータ抽出部５で得られた特徴パラメータ
の時系列を、発声内容独立モデル蓄積部２７に蓄えられ
ている発声内容に独立な音声モデルとともに正規化用類
似度計算部２９に入力し、正規化用類似度を計算し、そ
の値を類似度正規化部３０に入力する。類似度正規化部
３０では、類似度計算部７から送られた類似度の値か
ら、正規化用類似度計算部２９から送られた類似度の値
を差し引くことにより類似度の正規化を行なう。さら
に、正規化された類似度の値は発声内容判定部３１に入
力される。発声内容判定部３１では、正規化された類似
度の値を、発声内容判定用しきい値蓄積部２８から送ら
れた発声内容判定用しきい値と比較し、正規化された類
似度の値が発声内容判定用しきい値よりも大きい場合は
これを正しい発声内容であると判定し、小さい場合はこ
れを誤った発声内容であると判定する。その判定結果は
話者認識判定部９に送られる。話者認識判定部９では、
類似度計算部７から送られてきた類似度の値をもとに、
話者照合の場合は話者ＩＤを用いて、しきい値蓄積部１
０から、その本人の声とみなせる類似度の変動の範囲を
示すしきい値を読み出して、上記のように計算された、
本人の音素あるいは音節群モデルを接続した文章または
単語の音声モデルと、認識用文章あるいは単語の入力音
声との類似度とを比較し、類似度の値が読み出されたし
きい値よりも大きければそれは本人の音声であると判定
し、しきい値よりも小さければ他人の音声であると判定
する。話者識別の場合は、入力音声と、登録された全て
の話者にそれぞれ適応した音素あるいは音節群モデルを
接続した文章または単語の音声モデルとの類似度をすべ
て比較して、最も類似度の大きい話者を選択し、その話
者が発声したものと判定する。さらに、それらの判定結
果と、発声内容判定部３１から送られた判定結果と併せ
て、総合的に話者認識判定を行なう。After that, the time series of the characteristic parameters obtained by the characteristic parameter extracting section 5 and the voice model of the sentence or the word generated by the sentence / word voice model generating section 6 are input to the similarity calculating section 7, The similarity is calculated, and the value is input to the similarity normalization unit 30 and the speaker recognition determination unit 9. At the same time, the time series of the characteristic parameters obtained by the characteristic parameter extraction unit 5 is input to the normalization similarity calculation unit 29 together with the speech model independent of the utterance content stored in the utterance content independent model storage unit 27, The similarity for normalization is calculated, and the value is input to the similarity normalization unit 30. The similarity normalization unit 30 normalizes the similarity by subtracting the similarity value sent from the normalization similarity calculation unit 29 from the similarity value sent from the similarity calculation unit 7. .. Further, the normalized similarity value is input to the utterance content determination unit 31. The utterance content determination unit 31 compares the normalized similarity value with the utterance content determination threshold value sent from the utterance content determination threshold value accumulating unit 28, and the normalized similarity value. Is greater than the threshold for utterance content determination, it is determined to be the correct utterance content, and if smaller than Threshold is determined to be the incorrect utterance content. The determination result is sent to the speaker recognition determination unit 9. In the speaker recognition determination unit 9,
Based on the value of the similarity sent from the similarity calculator 7,
In the case of speaker verification, the threshold storage unit 1 is used by using the speaker ID.
From 0, a threshold value indicating the range of variation in the degree of similarity that can be regarded as the voice of the person is read out and calculated as described above.
The speech model of a sentence or word connected to the phoneme or syllable group model of the person is compared with the similarity of the input speech of the recognition sentence or word, and the similarity value is larger than the read threshold value. If it is smaller than the threshold value, it is determined to be the voice of another person. In the case of speaker identification, the similarity between the input speech and the speech model of the sentence or word in which the phoneme or syllable group models adapted to all the registered speakers are connected is compared to determine the highest similarity. A large speaker is selected and it is determined that the speaker has uttered. Further, the speaker recognition determination is comprehensively performed by combining the determination result and the determination result sent from the utterance content determination unit 31.

【００３２】次に、図７を参照して請求項８の発明を請
求項６および７に関する発明に適用した実施例を説明す
る。話者を登録する段階では、登録すべき各話者につい
ては、それぞれの学習用音声データを特徴パラメータ抽
出部１１に入力して特徴パラメータ（例えばケプストラ
ム、ピッチ）の時系列に変換したのち、話者特徴計算部
１５に入力し、その話者の、発声内容に独立な（発声内
容に依存しない）声の特徴を抽出する。発声内容に独立
な声の特徴を抽出する方法としては、例えば、文献「松
井知子、古井貞煕：“ＶＱ、離散／連続ＨＭＭによるテ
キスト独立形話者認識法の比較検討”、電子情報通信学
会音声研究会資料、SP91-89 、1991」に述べられている
方法、すなわち特徴パラメータの分布をベクトル量子化
の符号帳、複数のガウス分布で表現する方法を用いるこ
とができる。抽出された声の特徴は、その話者のＩＤと
ともに話者特徴蓄積部１６に蓄えられる。Next, an embodiment in which the invention of claim 8 is applied to the inventions of claims 6 and 7 will be described with reference to FIG. At the stage of registering the speakers, for each speaker to be registered, each learning voice data is input to the characteristic parameter extraction unit 11 and converted into a time series of characteristic parameters (for example, cepstrum, pitch), It is input to the speaker feature calculation unit 15 and the features of the speaker's voice independent of the utterance content (not dependent on the utterance content) are extracted. As a method for extracting a feature of a voice independent of the utterance content, for example, a document “Tomoko Matsui, Sadahiro Furui:“ VQ, comparative study of text-independent speaker recognition methods using discrete / continuous HMM ”, IEICE. The method described in "Speech Study Group Material, SP91-89, 1991", that is, the distribution of feature parameters is represented by a vector quantization codebook, or a method of expressing a plurality of Gaussian distributions can be used. The extracted voice feature is stored in the speaker feature storage unit 16 together with the speaker ID.

【００３３】一方、多数の話者の音声データを特徴パラ
メータ抽出部１７に入力して特徴パラメータの時系列に
変換したのち、音素／音節モデル作成部１８に入力して
多数話者に共通の、すなわち話者に独立な音素／音節モ
デルを作成する。この方法としては、例えば、文献「南
泰浩、松岡達雄、鹿野清宏： “不特定話者連続音声デ
ータベースによる連結学習ＨＭＭの評価”、電子情報通
信学会音声研究会資料、SP91-113, 1992」に述べられて
いる方法を用いることができる。この場合、学習用音声
データの発声内容はあらかじめ決めておき、その発声内
容を仮名、あるいは発音記号の系列で表したものを、音
声とともに音素／音節モデル作成部１８に入力する。各
音素あるいは音節のモデルには、音声を特徴パラメータ
の時系列で表現したもの、あるいはそれを隠れマルコフ
モデルで表現したものがある。作成された音素／音節モ
デルは音素／音節モデル蓄積部２に蓄えられる。On the other hand, after the voice data of a large number of speakers is input to the characteristic parameter extraction unit 17 and converted into a time series of characteristic parameters, it is input to the phoneme / syllable model generation unit 18 and common to a large number of speakers. That is, a speaker-independent phoneme / syllable model is created. This method is described in, for example, the document “Yasuhiro Minami, Tatsuo Matsuoka, Kiyohiro Kano:“ Evaluation of Connected Learning HMM by Unspecified Speaker Continuous Speech Database ”, Institute of Electronics, Information and Communication Engineers, Speech Study Group, SP91-113, 1992”. The method described can be used. In this case, the utterance content of the learning voice data is determined in advance, and the utterance content represented by a kana or a phonetic symbol sequence is input to the phoneme / syllable model creating unit 18 together with the voice. As models of each phoneme or syllable, there are a model in which a voice is represented by a time series of characteristic parameters, or a model in which it is represented by a hidden Markov model. The created phoneme / syllable model is stored in the phoneme / syllable model storage unit 2.

【００３４】これと同時に、特徴パラメータ抽出部１７
で抽出された特徴パラメータの時系列は発声内容独立モ
デル作成部２６に入力され、発声内容に独立な音声モデ
ルが生成される。そして、それらのモデルは発声内容独
立モデル蓄積部２７に蓄えられる。次に、音素／音節モ
デル蓄積部２から送られた音素／音節モデルと、発声内
容独立モデル蓄積部２７から送られた発声内容独立モデ
ルとから、発声内容の違いによる類似度の変動許容範囲
を推定し、この値を発声内容判定用しきい値として発声
内容判定用しきい値蓄積部２８に蓄える。At the same time, the characteristic parameter extraction unit 17
The time series of the feature parameters extracted in (4) is input to the utterance content independent model creation unit 26, and a voice model independent of the utterance content is generated. Then, those models are stored in the utterance content independent model storage unit 27. Next, from the phoneme / syllable model sent from the phoneme / syllable model storage unit 2 and the utterance content independent model sent from the utterance content independent model storage unit 27, the allowable variation range of the similarity due to the difference in utterance content is determined. The estimated value is stored in the utterance content determination threshold storage unit 28 as the utterance content determination threshold value.

【００３５】話者を認識する段階では、あらかじめ発声
すべき文章または単語を決めておくか、あるいは認識用
文章／単語生成部４により新しい文章または単語を生成
して話者にそれを示し、話者がその文章または単語を発
声し、その音声を特徴パラメータ抽出部５に入力する。
特徴パラメータ抽出部５では、入力された音声を特徴パ
ラメータ抽出部１１と同じ特徴パラメータを用いた表現
形式に変換する。一方において、文章／単語音声モデル
生成部６では、認識用文章／単語生成部４により生成さ
れた文章または単語、あるいはあらかじめ決められてい
る文章または単語に従って音素／音節モデル蓄積部２に
登録されている音素または音節のモデルを接続し、その
文章または単語の音声モデルを生成する。特徴パラメー
タ抽出部５で得られた特徴パラメータの時系列と、文章
／単語音声モデル生成部６で生成された文章または単語
の音声モデルとは発声内容類似度計算部１９に入力され
て、両者の類似の度合いが計算される。この具体的方法
としては、例えば、文献「中川聖一：“確率モデルによ
る音声認識”、第3.1.3 項、pp. 40-50 、電子情報通信
学会、1988」に述べられている方法を用いることができ
る。計算された類似度の値は、類似度正規化部３０に送
られる。これと同時に特徴パラメータ抽出部５で得られ
た特徴パラメータの時系列を、発声内容独立モデル蓄積
部２７に蓄えられている発声内容に独立な音声モデルと
ともに正規化用類似度計算部２９に入力し、正規化用類
似度を計算し、その値を類似度正規化部３０に入力す
る。類似度正規化部３０では、発声内容類似度計算部１
９から送られた類似度の値から、正規化用類似度計算部
２９から送られた正規化用類似度の値を差し引くことに
より類似度の正規化を行なう。正規化された類似度の値
は発声内容判定部３１に入力される。発声内容判定部３
１では、正規化された類似度の値を、発声内容判定用し
きい値蓄積部２８から送られた発声内容判定用しきい値
と比較し、正規化された類似度の値が発声内容判定用し
きい値よりも大きい場合はこれを正しい発声内容である
と判定し、小さい場合はこれを誤った発声内容であると
判定する。その判定結果は話者認識判定部２５に送られ
る。At the stage of recognizing the speaker, a sentence or word to be uttered is determined in advance, or a new sentence or word is generated by the recognition sentence / word generation unit 4 and the new sentence or word is shown to the speaker. The person utters the sentence or word and inputs the voice into the characteristic parameter extraction unit 5.
The characteristic parameter extraction unit 5 converts the input voice into an expression format using the same characteristic parameter as the characteristic parameter extraction unit 11. On the other hand, in the sentence / word speech model generation unit 6, according to the sentence or word generated by the recognition sentence / word generation unit 4, or in accordance with a predetermined sentence or word, it is registered in the phoneme / syllable model storage unit 2. Connect the phoneme or syllable models that exist to generate a phonetic model of the sentence or word. The time series of the characteristic parameters obtained by the characteristic parameter extraction unit 5 and the voice model of the sentence or the word generated by the sentence / word voice model generation unit 6 are input to the utterance content similarity calculation unit 19, and both of them are input. The degree of similarity is calculated. As a concrete method, for example, the method described in the document “Seiichi Nakagawa:“ Speech recognition by probabilistic model ”, Section 3.1.3, pp. 40-50, The Institute of Electronics, Information and Communication Engineers, 1988” is used. be able to. The calculated similarity value is sent to the similarity normalization unit 30. At the same time, the time series of the characteristic parameters obtained by the characteristic parameter extraction unit 5 is input to the normalization similarity calculation unit 29 together with the speech model independent of the speech content stored in the speech content independent model storage unit 27. , The normalization similarity is calculated, and the value is input to the similarity normalization unit 30. In the similarity normalization unit 30, the utterance content similarity calculation unit 1
The similarity degree is normalized by subtracting the value of the similarity degree for normalization sent from the similarity degree calculation section for normalization 29 from the similarity degree value sent from 9. The normalized similarity value is input to the speech content determination unit 31. Speech content determination unit 3
In 1, the normalized similarity value is compared with the utterance content determination threshold value sent from the utterance content determination threshold value accumulating unit 28, and the normalized similarity value is determined as the utterance content determination value. If it is larger than the usage threshold value, it is determined that it is the correct utterance content, and if it is smaller than it, it is determined that it is the erroneous utterance content. The determination result is sent to the speaker recognition determination unit 25.

【００３６】一方、特徴パラメータ抽出部５で得られた
特徴パラメータの時系列と、話者特徴蓄積部１６に蓄え
られている各話者の声の特徴とが、話者類似度計算部２
２に送られる。話者類似度計算部２２では、話者照合の
場合は話者ＩＤを用いて、その話者の声の特徴との類似
度が計算される。話者識別の場合は、話者特徴蓄積部１
６に蓄積されているすべての話者の声の特徴との間の類
似度が計算される。計算された類似度の値は、一旦、話
者類似度蓄積部２３に蓄えられた後、話者認識判定部２
５に入力される。話者認識判定部２５では、話者の判定
を行なう。話者照合を行なう場合は、しきい値蓄積部１
０から、話者ＩＤに従って、その本人の声とみなせる類
似度の変動の範囲を示すしきい値を読み出して、話者類
似度蓄積部２３から送られた類似度と比較し、その類似
度の値が読み出されたしきい値よりも大きければそれは
本人の音声であると判定し、しきい値よりも小さければ
他人の音声であると判定する。話者識別の場合は、入力
音声と、登録（蓄積）されているすべての話者との類似
度を全て比較し、最も類似度の大きい話者を選択し、そ
の話者が発声したものと判定する。さらに、それらの判
定結果と、発声内容判定部３１から送られた判定結果と
併せて、総合的に話者認識判定を行なう。On the other hand, the time series of the characteristic parameters obtained by the characteristic parameter extracting section 5 and the characteristics of each speaker's voice stored in the speaker characteristic accumulating section 16 are calculated by the speaker similarity calculating section 2
Sent to 2. In the speaker similarity calculation unit 22, in the case of speaker verification, the speaker ID is used to calculate the similarity with the feature of the speaker's voice. In the case of speaker identification, the speaker feature storage unit 1
The similarity between all the speaker's voice features stored in 6 is calculated. The calculated similarity value is temporarily stored in the speaker similarity storage unit 23, and then the speaker recognition determination unit 2
Input to 5. The speaker recognition determination unit 25 determines the speaker. When performing speaker verification, the threshold storage unit 1
According to the speaker ID, a threshold value indicating the range of variation of the similarity degree that can be regarded as the voice of the person is read from 0, compared with the similarity degree sent from the speaker similarity degree accumulation unit 23, and the similarity degree If the value is larger than the read threshold value, it is determined to be the voice of the person, and if smaller than the threshold value, it is determined to be the voice of another person. In the case of speaker identification, all the similarities between the input voice and all the registered (stored) speakers are compared, the speaker with the highest similarity is selected, and it is determined that the speaker uttered. judge. Further, the speaker recognition determination is comprehensively performed by combining the determination result and the determination result sent from the utterance content determination unit 31.

【００３７】[0037]

【発明の効果】以上述べたように、請求項１乃至５の発
明においては、認識すべき各話者の声に適応した音素あ
るいは音節のモデルを作成して登録し、それらのモデル
を接続して生成した文章あるいは単語の音声のモデル
と、特徴パラメータを用いた表現形式に変換された入力
音声との類似度を求めて、その入力音声を発声した話者
を認識するので、認識をするたびに発声すべき内容を変
えることができ、しかもその内容を発声しなければ高い
類似度を得ることができないので、録音した音声を再生
することによって、他人になりすますことを防ぐことが
できる。As described above, according to the first to fifth aspects of the invention, models of phonemes or syllables adapted to the voice of each speaker to be recognized are created and registered, and those models are connected. Since the similarity between the model of the voice of the sentence or word generated by the input voice and the input voice converted into the expression format using the characteristic parameters is obtained, the speaker who uttered the input voice is recognized. The content to be uttered can be changed, and a high degree of similarity cannot be obtained unless the content is uttered. Therefore, by playing the recorded voice, it is possible to prevent impersonating another person.

【００３８】請求項６の発明においては、話者を認識す
べき入力音声を、特徴パラメータを用いた表現形式に変
換し、その表現形式による入力音声と、上記表現形式に
よる発声内容すなわち言葉によらない各話者の声の特徴
との類似度と、発声すべき言葉との類似度とを組み合わ
せて、その入力音声を発声した話者を認識するので、認
識をするたびに発声すべき内容を変えることができ、し
かもその内容を発声しなければ高い類似度を得ることが
できないので、録音した音声を再生することによって、
他人になりすますことを防ぐことができる。According to the sixth aspect of the invention, the input voice for recognizing the speaker is converted into an expression form using the characteristic parameter, and the input voice in the expression form and the utterance content, that is, the words in the expression form are used. It recognizes the speaker who uttered the input voice by combining the similarity with the voice characteristics of each speaker and the similarity with the words to be uttered. You can change it, and you cannot get a high degree of similarity unless you utter that content, so by playing back the recorded voice,
You can prevent impersonating others.

【００３９】請求項８の発明は、請求項１、２、３、
４、５、６および７に記載される話者認識方法におい
て、発声内容に独立な音声モデルを作成し、これと特徴
パラメータを用いた表現形式に変換された入力音声との
間の類似度を使用して、音素あるいは音節（群）モデル
を接続した文章或は単語音声モデルと特徴パラメータを
用いた表現形式に変換された入力音声との間の類似度の
値を正規化しており、文章或は単語の内容、収録時期、
伝送系、マイクロホンその他の試験条件の違いによる類
似度の変動の影響を受け難い話者認識を行なうことがで
きる。The invention of claim 8 relates to claim 1, 2, 3,
In the speaker recognition methods described in 4, 5, 6 and 7, a voice model independent of the utterance content is created, and the similarity between the voice model and the input voice converted into the expression format using the feature parameter is calculated. By using the phoneme or syllable (group) model, the value of the similarity between the sentence or the word voice model connected to the input voice converted into the expression format using the characteristic parameter is normalized, Is the content of the word, recording time,
It is possible to perform speaker recognition that is not easily affected by variations in similarity due to differences in transmission conditions, microphones, and other test conditions.

【００４０】次に実験例を述べる。実験は、男性２３
名、女性１３名が約５カ月に渡る３つの時期（時期Ａ、
Ｂ、Ｃ）に発声した文章データ（１文章長は平均４秒）
を対象とする。これらの音声を、従来から使われている
特徴量、つまり、ケプストラムの細かい時間毎に時系列
に変換する。ケプストラムは標本化周波数１２kHz 、フ
レーム長３２ms、フレーム周期８ms、ＬＰＣ分析（Line
ar Predictive Coding、線形予測分析）次数１６で抽出
した。学習には、時期Ａに発声した１０文章を用い、テ
ストでは、時期Ｂ、Ｃに発声した５文章を１文章づつ用
いた。Next, an experimental example will be described. The experiment is male 23
, 13 women for 5 months (Time A,
Sentence data for B and C) (1 sentence length is 4 seconds on average)
Target. These voices are converted into a time series at a feature amount that has been used in the past, that is, every fine time of the cepstrum. The cepstrum has a sampling frequency of 12 kHz, a frame length of 32 ms, a frame period of 8 ms, and LPC analysis (Line
ar Predictive Coding, linear predictive analysis) Extracted with degree 16. For the learning, 10 sentences uttered in the period A were used, and in the test, 5 sentences uttered in the periods B and C were used one by one.

【００４１】各話者の音素群モデルは、請求項３の発明
に基づいて６４個のガウス分布の組合せ（「松井知子、
古井貞煕：“ＶＱ、離散／連続ＨＭＭによるテキスト独
立形話者認識法の比較検討”、電子情報通信学会音声研
究会資料、SP91-89 、1991」）で表した。音素群の数は
２５である。結果は、本人が指定された文章とは異なる
文章を発声した時の棄却率で評価した。この棄却率が高
いほど、本人の音声でも、録音音声の再生により異なる
内容である場合には、高い割合でそれを棄却することが
できる。ここでは、しきい値よりも類似度の大きい入力
音声は、本人が指定された文章を発声した音声として受
理し、それ以外は棄却した。このしきい値は、各話者ご
とに、本人が指定された文章を発声した音声が棄却され
ることのないように事後的に設定した。その結果を第７
図に示す。この結果は、この発明の話者認識方法は類似
度の正規化を施さない場合と比較して棄却誤り率はほぼ
半分になることを示している。以上のことから、この請
求項８の発明の有効性が実証された。The phoneme group model for each speaker is based on the invention of claim 3 and is based on the combination of 64 Gaussian distributions ("Matsui Tomoko,
Satoshi Furui: "Comparison study of text-independent speaker recognition methods using VQ, discrete / continuous HMM," Institute of Electronics, Information and Communication Engineers, Speech Study Group, SP91-89, 1991 "). The number of phoneme groups is 25. The results were evaluated by the rejection rate when the person uttered a sentence different from the designated sentence. The higher the rejection rate, the higher the rate of rejection of the person's voice when the content of the voice varies depending on the reproduction of the recorded voice. Here, the input voice having a degree of similarity higher than the threshold is accepted as the voice uttered by the person himself / herself and rejected otherwise. This threshold was set a posteriori for each speaker so that the voice uttered by the person himself / herself was not rejected. The result is No. 7
Shown in the figure. This result shows that the speaker recognition method of the present invention has a rejection error rate that is almost half that in the case where the similarity degree is not normalized. From the above, the effectiveness of the invention of claim 8 was proved.

[Brief description of drawings]

【図１】請求項１および２の発明の実施例が適用される
認識装置の例を示すブロック図FIG. 1 is a block diagram showing an example of a recognition device to which an embodiment of the invention of claims 1 and 2 is applied.

【図２】請求項３の発明の実施例が適用される認識装置
の例を示すブロック図。FIG. 2 is a block diagram showing an example of a recognition device to which an embodiment of the invention of claim 3 is applied.

【図３】請求項４の発明の実施例が適用される認識装置
の例を示すブロック図。FIG. 3 is a block diagram showing an example of a recognition device to which an embodiment of the invention of claim 4 is applied.

【図４】請求項６および７の発明の実施例が適用される
認識装置の例を示すブロック図FIG. 4 is a block diagram showing an example of a recognition device to which the embodiments of the inventions of claims 6 and 7 are applied.

【図５】請求項１、２に関する請求項８の発明の実施例
が適用される認識装置の例を示すブロック図。FIG. 5 is a block diagram showing an example of a recognition device to which an embodiment of the invention of claim 8 relating to claims 1 and 2 is applied.

【図６】請求項３、４および５に関する請求項８の発明
の実施例が適用される認識装置の例を示すブロック図。FIG. 6 is a block diagram showing an example of a recognition device to which an embodiment of the invention of claim 8 relating to claims 3, 4 and 5 is applied.

【図７】請求項６および７に関する請求項８の発明の実
施例が適用される認識装置の例を示すブロック図。FIG. 7 is a block diagram showing an example of a recognition device to which an embodiment of the invention of claim 8 relating to claims 6 and 7 is applied.

【図８】請求項８に関する実験を示す図。FIG. 8 is a diagram showing an experiment related to claim 8;

[Explanation of symbols]

１話者適応部２不特定話者用音素／音節モデル蓄積部３適応化音素／音節モデル蓄積部４認識用文章／単語生成部５特徴パラメータ抽出部６文章／単語音声モデル作成部７類似度計算部９話者認識判定部１０しきい値蓄積部１１特徴パラメータ抽出部３０類似度正規化部 1 speaker adaptation unit 2 phoneme / syllable model storage unit for unspecified speaker 3 adapted phoneme / syllable model storage unit 4 recognition sentence / word generation unit 5 feature parameter extraction unit 6 sentence / word speech model creation unit 7 similarity Calculation unit 9 Speaker recognition determination unit 10 Threshold storage unit 11 Feature parameter extraction unit 30 Similarity normalization unit

Claims

[Claims]

1. Means for creating and registering a phoneme or syllable model adapted to the voice of each speaker to be recognized, and means for connecting these phoneme or syllable models to generate a sentence or word speech model. A sentence or word speech model having means for converting an input speech into an expression format using a characteristic parameter, and connecting the input speech converted into the expression format with a phoneme or syllable model registered in advance for a speaker. A speaker recognition method characterized by recognizing a speaker who uttered the input voice by obtaining a degree of similarity with the speaker.

2. Each time recognition is performed, a speaker designates a sentence or word to be uttered, and the input speech in which the sentence or word is uttered is converted into an expression format using a characteristic parameter, and a phoneme or syllable. 2. The speaker recognition method according to claim 1, further comprising: recognizing the speaker who uttered the input voice by obtaining the similarity between the model of FIG. 1 and the voice model of the sentence or word generated. ..

3. A means for converting an input voice into an expression form using a characteristic parameter, a means for calculating a plurality of Gaussian distributions expressing the characteristics of each speaker's voice independent of utterance content, and those Gaussian distributions. A means for creating a phoneme or syllable group model by giving an appropriate weight to each phoneme or syllable group, and a means for connecting the phoneme or syllable group models in time to create a speech model of a sentence or word. Using the training speech data of each speaker, the distribution of the characteristic parameters in the speech data is approximated by a plurality of Gaussian distributions, and the weighted addition of these Gaussian distributions is performed for each phoneme or syllable group. Create a phoneme or syllable group model that approximates the distribution of characteristic parameters, and use the phoneme or syllable group model according to the utterance content (sentence or word) of the input speech that should recognize the speaker. A speaker recognition method characterized by recognizing a speaker by comparing a model of a speech of a sentence or a word connected in terms of time with a time-series representation of the above-mentioned input speech.

4. A means for converting an input voice into an expression format using a feature parameter, a means for calculating a vector quantization codebook expressing the features of each speaker's voice independent of utterance content, A method for creating a phoneme or syllable group model by giving appropriate weights to the codebook corresponding to each phoneme or syllable group, and creating a phonetic model of a sentence or word by connecting the phoneme or syllable group model in time. Means for approximating the distribution of the characteristic parameters in the speech data with a codebook by using the learning speech data of each speaker, and performing weighted addition of these codebook elements for each phoneme or syllable group. A phoneme syllable group model that approximates the distribution of the characteristic parameters of is created, and the phoneme syllable group model is temporally connected according to the utterance content (sentence or word) of the input speech that should recognize the speaker. Is a speaker recognition method characterized by recognizing a speaker by comparing a model of a voice of a word with a time-series representation of the input voice of the feature parameter.

5. Each time recognition is performed, a speaker designates a sentence or word to be uttered, and the input speech in which the sentence or word is uttered is converted into an expression format using a characteristic parameter, and the phoneme or 4. The speaker that uttered the input speech is recognized by comparing with a speech model of the sentence or word connected with a syllable group model.
Or the speaker recognition method described in 4.

6. A means for recognizing a speaker by extracting the characteristics of a speaker's voice independent of the utterance content, a means for recognizing the utterance content (word) of the voice of an unspecified speaker, and an input voice. Means for converting into an expression format using characteristic parameters, and obtaining the respective similarity between the input speech converted into the above expression format, the characteristics of the voice of each speaker and the words to be uttered, Combining degrees logically and numerically,
A speaker recognition method comprising recognizing a speaker who has generated the input voice.

7. The above-mentioned two types of the above-mentioned two types regarding a sentence or word to be uttered by a speaker for each recognition, and an input voice in which the sentence or word is uttered is converted into an expression format using a characteristic parameter. Combining similarities,
7. The speaker recognition method according to claim 6, wherein the speaker uttering the input voice is recognized.

8. Claims 1, 2, 3, 4, 5, 6 and 7
In the speaker recognition method described in any of the above, the similarity between the input speech converted into the expression format using the characteristic parameter and the sentence or word speech model in which the phoneme or syllable (group) model is connected. A method for recognizing a speaker, characterized by normalizing variations in the degree of text or words, recording time, transmission system, microphone and other test conditions.