JPH1185181A

JPH1185181A - Generating method of voice model and method and device for speaker recognition using the model

Info

Publication number: JPH1185181A
Application number: JP9245781A
Authority: JP
Inventors: Tomoko Matsui; 知子松井; Kiyoaki Aikawa; 清明相川
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1997-09-10
Filing date: 1997-09-10
Publication date: 1999-03-30

Abstract

PROBLEM TO BE SOLVED: To generate the voice model, on which discriminating and collating capabilities are maintained at a high level against the speaker's voice that varies depending on the uttering time, and to realize such a speaker recognition that a high recognition performance is maintained while utilizing the model. SOLUTION: The feature parameter string, which is extracted from latest inputted voices in a feature parameter extracting section 1, and the feature parameter string, which is previously stored in a feature parameter storage section 2, are inputted to a model generating section 3. In the section 3, a parameter θ<(t)> of a hidden Markov model(HMM) is defined as the one obtd. by such a manner that a parameter θ' of the HMM, that is estimated from time independent voice data components, is transformed by a model transformation function G<(t)> , that represents the variation being dependent on time t. Then, the parameter θ' and the function G<(t)> of each time are estimated based on the inputted feature parameter string at each time. Then, the parameter θ' is stored in a model storage section 4 and used for a speaker recognition when recognition voice data are inputted.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、入力音声から抽出
した話者の個人性を表す特徴パラメータ列に基づく音声
モデルの生成方法並びに当該音声モデルを用いた話者認
識方法及びその装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method for generating a speech model based on a characteristic parameter sequence representing a speaker's personality extracted from input speech, a speaker recognition method using the speech model, and an apparatus therefor.

【０００２】[0002]

【従来の技術】従来より、インターホンの音声から訪問
者が誰であるかを認識したり、入力された音声により暗
証番号の人と同一人であることを同定したりする場合等
に用いる技術として、入力音声から話者の個人性を表す
特徴パラメータ列を抽出し、その特徴パラメータ列と、
予め話者毎に登録された特徴パラメータ列によりモデル
化した音声のモデルとの類似度を求めることによって、
入力音声を発声した話者を認識することが研究されてい
る。このような話者認識においては、話者の個人性を表
す特徴パラメータとしてケプストラムやピッチ等を用
い、話者毎に登録しておく音声モデルを“hidden Marko
v model”（隠れマルコフモデル。以下、「ＨＭＭ」と
いう）によって表現し、話者が発声した文章等の音声に
含まれる特徴パラメータ列と、登録された各話者のＨＭ
Ｍとの類似度によって判定を行う方法がよく用いられ
る。2. Description of the Related Art Conventionally, as a technique used for recognizing who a visitor is from the sound of an intercom, or identifying the same person as a personal identification number based on the input sound, etc. , Extracting a feature parameter sequence representing the speaker's personality from the input speech,
By calculating the similarity with the model of the voice modeled by the feature parameter sequence registered for each speaker in advance,
Recognition of a speaker who uttered an input voice has been studied. In such speaker recognition, cepstrum, pitch, etc. are used as feature parameters representing the individuality of a speaker, and a speech model registered for each speaker is referred to as “hidden Marko”.
v model ”(Hidden Markov Model; hereinafter, referred to as“ HMM ”), a feature parameter sequence included in speech such as sentences uttered by the speaker, and a registered HM of each speaker.
A method of making a determination based on the similarity to M is often used.

【０００３】ここで、かかる話者認識で用いる音声モデ
ルについては、話者の発声負担を考慮し、実システムで
は各話者が一時期に発声した小量のデータから初期モデ
ルを作成することが多い。ところが、話者の声は特に２
〜３カ月の単位で大きく変動し（例えば、文献「古井貞
煕：“音声波に含まれる個人性情報”，東京大学学位論
文，昭和５３年」参照）、しかもその変動の方向が一定
でないために、初期モデルのままでは高い認識性能を維
持することができない。Here, with regard to a speech model used in such speaker recognition, an initial model is often created from a small amount of data uttered by each speaker at one time in an actual system in consideration of the utterance load of the speaker. . However, the voice of the speaker is especially 2
It fluctuates greatly in units of up to three months (for example, see the document "Tadahiro Furui:" Individuality information included in speech waves ", Dissertation of the University of Tokyo, 1978), and the direction of the fluctuation is not constant. However, high recognition performance cannot be maintained with the initial model.

【０００４】このようなことから、従来においては、各
話者に定期的に発声させてそのデータを蓄積しておき、
それら複数のデータを用いて各話者のＨＭＭを再生成す
る方法が検討されてきた。[0004] For this reason, conventionally, each speaker is made to utter periodically to accumulate the data,
A method of regenerating the HMM of each speaker using the plurality of data has been studied.

【０００５】[0005]

【発明が解決しようとする課題】しかし、各話者に定期
的に発声させることにより蓄積された複数のデータは、
いろいろな発声変動を含んでいる。このため、上記従来
の方法には、再生成されるＨＭＭの特徴パラメータ列の
分布の分散が広がり、識別能力、照合能力が低下すると
いう問題があった。However, a plurality of data accumulated by periodically uttering each speaker,
Includes various vocal variations. For this reason, the above-mentioned conventional method has a problem that the variance of the distribution of the feature parameter sequence of the regenerated HMM is widened, and the discriminating ability and the matching ability are reduced.

【０００６】本発明はこのような事情に鑑みてなされた
もので、発声時期に依存して変動する話者の声に対し、
識別能力や照合能力を高水準に維持することができる音
声モデルの生成方法を提供することを目的とする。又、
かかる音声モデルを利用し、高い認識性能を維持するこ
とができる話者認識方法及びその装置を提供することを
目的とする。[0006] The present invention has been made in view of such circumstances, and the present invention relates to a speaker's voice that fluctuates depending on the utterance time.
An object of the present invention is to provide a method for generating a speech model that can maintain a high level of discrimination ability and matching ability. or,
It is an object of the present invention to provide a speaker recognition method and apparatus capable of maintaining high recognition performance using such a speech model.

【０００７】[0007]

【課題を解決するための手段】本発明においては、（１）ある話者が複数の時期に発声した音声のデータ
は、各時期に対応した複数の母集団から得られたものと
する。（２）前記複数の母集団は、時期に独立な話者性を共通
に有する。と考えることにより、各話者が各時期に発声した音声の
データを、各話者毎に、発声時期に依存した変動成分
と、時期に独立な成分とに分けることを考える。According to the present invention, it is assumed that (1) data of voices uttered by a speaker at a plurality of periods are obtained from a plurality of populations corresponding to each period. (2) The plurality of populations commonly have independent speaker characteristics at different times. Thus, it is considered that the data of the voice uttered by each speaker at each time is divided into a variable component depending on the utterance time and a component independent at each time for each speaker.

【０００８】今、ある話者が時期ｔに発声した音声のデ
ータから推定したＨＭＭのパラメータをθ^(t)とする。
このとき、パラメータθ^(t)は、当該話者の時期に独立
な音声データ成分から推定されるＨＭＭのパラメータθ
＾を、時期ｔに依存した変動を表すモデル変換関数Ｇ
^(t)によって変換したものであるとすると、下記数１に
示すように表せる。Assume that a parameter of the HMM estimated from data of a voice uttered by a certain speaker at time t is θ ^(t) .
At this time, the parameter θ ^(t) is the parameter θ of the HMM estimated from the independent voice data component at the time of the speaker.
＾ is a model conversion function G representing a variation depending on the time t.
^If it is converted by ^(t) , it can be expressed as shown in Equation 1 below.

【数１】尚、式中の文字の上側に付された符号“＾”は、明細書
における文字に次いで付記されている符号“＾”に相当
する（以下の数式においても同様）。(Equation 1) In addition, the symbol "$" attached above the character in the formula corresponds to the symbol "$" added after the character in the specification (the same applies to the following formulas).

【０００９】上式における時期に独立なパラメータθ＾
は、パラメータθ^(t)から時期に依存した変動成分を除
いた形になっているので、その分散は小さくなると期待
できる。そこで、本発明は、このような時期に独立なＨ
ＭＭのパラメータを、入力音声から抽出した特徴パラメ
ータ列によって推定して音声モデルを生成する。The time independent parameter θ 時期 in the above equation
Is a form in which the time-dependent fluctuation component is removed from the parameter θ ^(t), so that its variance can be expected to be small. Therefore, the present invention provides independent H
The speech model is generated by estimating the parameters of the MM using the feature parameter sequence extracted from the input speech.

【００１０】ここで、上述の時期に独立なＨＭＭのパラ
メータθ＾は、数１の形で各時期のＨＭＭパラメータθ
^(t)が表されるとしたとき、当該各時期のすべてに共通
なパラメータθと、それを変換する当該各時期の変動を
表すモデル変換関数のセットＧ_set（＝（Ｇ⁽¹⁾，
Ｇ⁽²⁾，…，Ｇ^(T)））との組であって、当該各時期の音
声データＯ^(t)に対する尤度を最大にするものを求める
ことによって、推定することができる。Here, the above-described HMM parameter θ ＾ independent of the timing is expressed by the following equation (1).
^Assuming that ^(t) is expressed, a parameter θ common to all of the timings and a set G _set (= (G ⁽¹⁾ ,
G ⁽²⁾ ,..., G ^(T) )), which can be estimated by obtaining the combination that maximizes the likelihood for the audio data O ^{(t) at} each time.

【００１１】すなわち、下記数２に示すように、利用可
能な全時期の音声データＯ^(t)（t=1,2,…,T）を用い、
音声データに対する尤度を最大にするパラメータθ＾と
モデル変換関数のセットＧ_set＾＝（Ｇ＾⁽¹⁾，Ｇ
＾⁽²⁾，…，Ｇ＾^(T)）とを同時に求め、そのθ＾を当該
話者の音声モデルのパラメータとして用いることとする
のである。That is, as shown in the following equation 2, audio data O ^(t) (t = 1, 2,..., T) of all available periods is used.
A set of parameters θ ＾ that maximizes the likelihood for speech data and a model conversion function G _set ＾ = (G ＾ ⁽¹⁾ , G
＾ ⁽²⁾ ,..., G ＾ ^(T) ) at the same time, and θ ＾ thereof is used as a parameter of the speaker's speech model.

【数２】尚、式中の右辺における“Ｌ（Ｏ^(t)；Ｇ^(t)，θ）”
は、音声データＯ^(t)に対する音声モデルＧ^(t)（θ）の
尤度関数を表し、“arg”は、右辺が全時期（t=1〜T）
の各音声データに対する音声モデルの尤度の直積を最大
とする（θ＾，Ｇ_s _et＾）の組であることを意味してい
る。そして、以上の理論に基づく本願の各請求項に記載
された発明は、以下のようなものとなっている。(Equation 2) Note that “L (O ^(t) ; G ^(t) , θ)” on the right side of the equation
Represents the likelihood function of the voice model G ^(t) (θ) with respect to the voice data O ^(t) , and “arg” indicates the right side of the whole time period (t = 1 to T)
Is meant to be a set of maximizing the direct product of the likelihood of the speech model _{_{(θ ^, G s et ^}} ) for each sound data. And the invention described in each claim of this application based on the above theory is as follows.

【００１２】請求項１記載の発明は、音声から話者の個
人性を表す特徴パラメータを抽出し、当該特徴パラメー
タに基づいて前記話者の音声モデルを生成する音声モデ
ルの生成方法において、複数時期に発声された音声から
各々特徴パラメータを抽出し、各時期の特徴パラメータ
から、前記音声のうちの時期に依存しない成分から抽出
されたパラメータを推定し、当該パラメータに基づいて
前記音声を発声した話者の音声モデルを生成することを
特徴としている。According to a first aspect of the present invention, there is provided a voice model generating method for extracting a characteristic parameter representing a personality of a speaker from voice and generating a voice model of the speaker based on the characteristic parameter. The feature parameters are extracted from the voice uttered in the above, the parameters extracted from the time-independent components of the voice are estimated from the feature parameters at each time, and the speech that utters the voice based on the parameters is estimated. It is characterized by generating a voice model of a person.

【００１３】請求項２記載の発明は、音声から話者の個
人性を表す特徴パラメータを抽出し、当該特徴パラメー
タに基づいて前記話者の音声モデルを生成する音声モデ
ルの生成方法において、複数時期に発声された音声から
各々特徴パラメータを抽出し、各時期の特徴パラメータ
から、前記音声のうちの時期に依存しない成分から抽出
された各時期共通のパラメータと、当該パラメータを各
時期に依存した変動相当で変換する変換因子とを推定
し、前記各時期共通のパラメータに基づいて前記音声を
発声した話者の音声モデルを生成することを特徴として
いる。According to a second aspect of the present invention, there is provided a voice model generating method for extracting a characteristic parameter representing a personality of a speaker from voice and generating a voice model of the speaker based on the characteristic parameter. The characteristic parameters are extracted from the voice uttered at the same time, and from the characteristic parameters at each time, the parameters common to each time extracted from the time-independent component of the voice, and the variation depending on each time. It is characterized by estimating a conversion factor to be converted correspondingly and generating a voice model of the speaker who uttered the voice based on the parameter common to each time.

【００１４】請求項３記載の発明は、請求項２記載の音
声モデルの生成方法において、推定される各時期共通の
パラメータと変換因子との組につき、各々、各時期共通
のパラメータを変換因子で変換して推定される各時期の
特徴パラメータの前記音声に対する尤度を求め、前記尤
度が最大となる前記組に含まれる各時期共通のパラメー
タに基づいて前記音声モデルの生成を行うことを特徴と
している。According to a third aspect of the present invention, in the method of generating a speech model according to the second aspect, for each set of the estimated common parameter and the conversion factor at each time, the common parameter at each time is converted by the conversion factor. Calculating the likelihood of the characteristic parameter of each period estimated by conversion to the voice, and generating the voice model based on a parameter common to each time included in the set in which the likelihood is maximized. And

【００１５】請求項４記載の発明は、各話者につき、請
求項１〜３のいずれかの項記載の音声モデルの生成方法
を用いて予め各々の音声モデルを生成しておき、不特定
話者が発声した音声から特徴パラメータを抽出し、当該
特徴パラメータと、前記音声モデルとの類似度を求め、
前記類似度に基づいて前記不特定話者を認識することを
特徴としている。According to a fourth aspect of the present invention, for each speaker, each voice model is generated in advance using the voice model generating method according to any one of the first to third aspects, and the unspecified speech is generated. A feature parameter is extracted from the voice uttered by the person, and a similarity between the feature parameter and the voice model is obtained;
It is characterized in that the unspecified speaker is recognized based on the similarity.

【００１６】請求項５記載の発明は、入力音声から話者
の個人性を表す特徴パラメータを抽出する特徴パラメー
タ抽出手段と、複数時期の入力音声から抽出された特徴
パラメータから、前記入力音声のうちの時期に依存しな
い成分から抽出されたパラメータを推定し、当該パラメ
ータに基づいて前記入力音声を発声した話者の音声モデ
ルを生成するモデル生成手段と、生成された前記音声モ
デルを蓄積するモデル蓄積手段と、不特定話者が発声し
た入力音声から抽出された特徴パラメータと、蓄積され
た音声モデルとの類似度を求め、前記類似度に基づいて
前記不特定話者を認識する認識手段とを有することを特
徴としている。According to a fifth aspect of the present invention, there is provided a feature parameter extracting means for extracting a feature parameter representing a speaker's personality from an input voice, and a feature parameter extracted from the input voice for a plurality of periods. Model generation means for estimating a parameter extracted from a component independent of time and generating a voice model of a speaker who uttered the input voice based on the parameter, and a model storage for storing the generated voice model Means for recognizing the unspecified speaker based on the similarity between the feature parameter extracted from the input voice uttered by the unspecified speaker and the accumulated speech model, and recognizing the unspecified speaker. It is characterized by having.

【００１７】請求項６記載の発明は、請求項５記載の特
徴パラメータ抽出手段、モデル蓄積手段及び認識手段
と、各時期の特徴パラメータから、前記入力音声のうち
の時期に依存しない成分から抽出された各時期共通のパ
ラメータと、当該パラメータを各時期に依存した変動相
当で変換する変換因子とを推定し、前記各時期共通のパ
ラメータに基づいて前記入力音声を発声した話者の音声
モデルを生成するモデル生成手段とを有することを特徴
としている。According to a sixth aspect of the present invention, a feature parameter extracting means, a model storing means and a recognizing means according to the fifth aspect, and a feature parameter of each time are extracted from a time-independent component of the input speech. The parameters common to each time period and the conversion factors for converting the parameters with a variation corresponding to each time period are estimated, and a voice model of the speaker who uttered the input voice is generated based on the parameters common to each time period. And a model generating means.

【００１８】請求項７記載の発明は、請求項６記載の話
者認識装置において、前記モデル生成手段は、推定され
る各時期共通のパラメータと変換因子との組につき、各
々、各時期共通のパラメータを変換因子で変換して推定
される各時期の特徴パラメータの前記入力音声に対する
尤度を求め、前記尤度が最大となる前記組に含まれる各
時期共通のパラメータに基づいて前記音声モデルの生成
を行うことを特徴としている。According to a seventh aspect of the present invention, in the speaker recognition apparatus according to the sixth aspect, the model generating means includes a set of a parameter and a conversion factor which are estimated at each time and which are common to each time. The likelihood for the input voice of the feature parameter of each period estimated by converting the parameter by the conversion factor is determined, and the likelihood of the voice model is determined based on the common parameter for each period included in the set in which the likelihood is maximized. The generation is performed.

【００１９】請求項８記載の発明は、複数時期に発声さ
れた入力音声から話者の個人性を表す特徴パラメータを
各々抽出する過程と、抽出された各時期の特徴パラメー
タから、前記入力音声のうちの時期に依存しない成分か
ら抽出されたパラメータを推定する過程と、当該パラメ
ータに基づいて前記音声を発声した話者の音声モデルを
生成する過程とを所定の演算装置に実行させるためのプ
ログラムを記録した記録媒体である。The invention according to claim 8 is a step of extracting characteristic parameters each representing a speaker's personality from input voices uttered at a plurality of periods, and extracting the characteristic parameters of the input voice from the extracted characteristic parameters at each period. A program for causing a predetermined arithmetic device to execute a process of estimating a parameter extracted from a component independent of the time and a process of generating a voice model of a speaker who uttered the voice based on the parameter are executed. It is a recording medium on which recording is performed.

【００２０】請求項９記載の発明は、入力音声から話者
の個人性を表す特徴パラメータを抽出する過程と、複数
時期の入力音声から抽出された特徴パラメータから、前
記入力音声のうちの時期に依存しない成分から抽出され
たパラメータを推定し、当該パラメータに基づいて前記
入力音声を発声した話者の音声モデルを生成する過程
と、生成された前記音声モデルを蓄積する過程と、不特
定話者が発声した入力音声から抽出された特徴パラメー
タと、蓄積された音声モデルとの類似度を求め、前記類
似度に基づいて前記不特定話者を認識する過程とを所定
の演算装置に実行させるためのプログラムを記録した記
録媒体である。According to a ninth aspect of the present invention, there is provided a method of extracting a characteristic parameter representing a speaker's personality from an input voice, and extracting a characteristic parameter extracted from a plurality of input voices in a time of the input voice. Estimating parameters extracted from the independent components and generating a speech model of the speaker who uttered the input speech based on the parameters; accumulating the generated speech models; Determining a similarity between the feature parameter extracted from the input speech uttered by the user and the stored speech model, and recognizing the unspecified speaker based on the similarity. Is a recording medium on which the program is recorded.

【００２１】[0021]

BEST MODE FOR CARRYING OUT THE INVENTION

＜構成＞以下に、図面を参照して本発明の実施の形態に
ついて説明する。図１は本発明の一実施形態による話者
認識装置の構成を示すブロック図である。<Configuration> An embodiment of the present invention will be described below with reference to the drawings. FIG. 1 is a block diagram showing a configuration of a speaker recognition device according to one embodiment of the present invention.

【００２２】この図において、１は音声データをケプス
トラムやピッチ等の特徴パラメータに変換する特徴パラ
メータ抽出部であり、入力された一定時間の音声データ
を順次特徴パラメータに変換し、特徴パラメータ蓄積部
２及びモデル生成部３又は類似度計算部５へ順次供給す
ることによって、これらの各部へ入力音声データを時系
列に変換した特徴パラメータ列を入力する。特徴パラメ
ータ蓄積部２は、特徴パラメータ抽出部１から供給され
た特徴パラメータを蓄える記憶手段であり、以前から蓄
えている特徴パラメータの時系列（特徴パラメータ列）
を必要に応じて（音声モデルを生成するとき。詳細は後
述）モデル生成部３へ入力する。In FIG. 1, reference numeral 1 denotes a feature parameter extraction unit for converting voice data into feature parameters such as cepstrum and pitch. The feature parameter extraction unit 2 sequentially converts input voice data for a certain period of time into feature parameters. Then, by sequentially supplying them to the model generation unit 3 or the similarity calculation unit 5, a feature parameter sequence obtained by converting the input voice data into a time series is input to each of these units. The feature parameter storage unit 2 is a storage unit that stores the feature parameters supplied from the feature parameter extraction unit 1, and stores a time series (feature parameter sequence) of the previously stored feature parameters.
Is input to the model generation unit 3 as needed (when generating a voice model; details will be described later).

【００２３】モデル生成部３は、特徴パラメータ抽出部
１から入力された特徴パラメータ列と、特徴パラメータ
蓄積部２から入力された特徴パラメータ列とに基づき、
各時期の発声変動に独立な音声モデルを生成し、当該音
声モデル（当該モデルを表すパラメータ）をモデル蓄積
部４へ出力する。このモデル生成部３による音声モデル
生成については、後述の動作説明にて詳細に説明する。
モデル蓄積部４は、モデル生成部３から出力された音声
モデルを蓄積し、必要に応じて（話者認識を行うとき。
詳細は後述）蓄積している音声モデルを類似度計算部５
へ供給する。The model generating section 3 is based on the feature parameter string input from the feature parameter extracting section 1 and the feature parameter string input from the feature parameter storing section 2.
A voice model independent of vocal fluctuations at each time is generated, and the voice model (a parameter representing the model) is output to the model storage unit 4. The generation of the voice model by the model generation unit 3 will be described in detail in the operation description described later.
The model storage unit 4 stores the speech model output from the model generation unit 3 and, if necessary (when performing speaker recognition.
The details will be described later.
Supply to

【００２４】類似度計算部５は、特徴パラメータ抽出部
１から供給された特徴パラメータ列と、モデル蓄積部４
に蓄えられている音声モデルとの類似の度合いを計算
し、計算の結果得られた類似度の値を話者認識判定部６
へ出力する。話者認識判定部６は、しきい値蓄積部７か
ら所定のしきい値を読み出し、類似度計算部５から入力
された類似度と比較して話者の判定を行う。ここに、所
定のしきい値とは、当該類似度の計算に用いられた音声
モデルに対応する話者の声とみなせる類似度の変動範囲
を示す値であり、しきい値蓄積部７には、話者毎のかか
るしきい値が予め記憶されている。The similarity calculating section 5 includes a feature parameter sequence supplied from the feature parameter extracting section 1 and a model storing section 4.
And calculates the degree of similarity with the speech model stored in the speaker recognition determining unit 6.
Output to The speaker recognition determination unit 6 reads a predetermined threshold value from the threshold value accumulation unit 7 and compares the read threshold value with the similarity input from the similarity calculation unit 5 to determine a speaker. Here, the predetermined threshold value is a value indicating a variation range of the similarity that can be regarded as a speaker's voice corresponding to the speech model used in the calculation of the similarity. , The threshold value for each speaker is stored in advance.

【００２５】＜動作＞（１）音声モデルの生成次に、上記構成による動作について説明する。まず、話
者の音声モデルを生成する動作（話者を登録する段階の
動作）について説明する。この場合、登録しようとする
話者が定期的に発声した音声のデータを特徴パラメータ
抽出部１へ順次入力する。これにより、特徴パラメータ
抽出部１が各発声時期毎の音声データをそれぞれ特徴パ
ラメータの時系列に変換して順次特徴パラメータ蓄積部
２とモデル生成部３へ供給し、以下のようにして当該話
者の音声モデルの生成が行われる。<Operation> (1) Generation of Speech Model Next, the operation of the above configuration will be described. First, an operation of generating a speaker voice model (operation at a stage of registering a speaker) will be described. In this case, the data of the voice uttered periodically by the speaker to be registered is sequentially input to the feature parameter extracting unit 1. As a result, the feature parameter extraction unit 1 converts the speech data at each utterance time into a time series of feature parameters, and supplies the time series to the feature parameter storage unit 2 and the model generation unit 3 sequentially. Is generated.

【００２６】モデル生成部３における音声モデルの生成
手順を図２に示す。ここでは、上記数１を満たすＨＭＭ
のパラメータθ＾を上記数２に相当する手法によって推
定することとする。そして、その推定手法の一例とし
て、文献「T.Anastasakos,J.McDonough,R.Schwartz,and
J.Makhoul,"A Compact Model for Speaker-Adaptive T
raining",Proc.ICSLP,pp.1137-1140,1996.」にて紹介さ
れているSpeaker Adaptive Training法（ＳＡＴ法）を
用い、発声時期に独立な音声モデルを複数のガウス分布
の重み付き加算によって生成されるＨＭＭで表現するた
めのパラメータθ＾を推定する。FIG. 2 shows a procedure for generating a voice model in the model generating section 3. Here, the HMM satisfying the above equation (1)
Is estimated by a method corresponding to the above equation (2). Then, as an example of the estimation method, a document “T.Anastasakos, J. McDonough, R. Schwartz, and
J. Makhoul, "A Compact Model for Speaker-Adaptive T
Raining ", Proc.ICSLP, pp.1137-1140, 1996.", using the speaker adaptive training method (SAT method), and using a weighted addition of multiple Gaussian distributions for speech models independent of the utterance time. Estimate a parameter θ ＾ to be represented by the generated HMM.

【００２７】モデル生成部３へは、最新入力の音声デー
タを変換した特徴パラメータ列が特徴パラメータ抽出部
１から入力されると共に、以前に抽出された各時期の特
徴パラメータ列が特徴パラメータ蓄積部２から入力され
る（ステップＳ1）。以下においては、上記最新入力の
音声が発声された時期をＴとし、特徴パラメータ蓄積部
２から入力された特徴パラメータ列が抽出された音声が
発声された各時期をそれぞれ１〜Ｔ−１とする。A feature parameter sequence obtained by converting the latest input speech data is input from the feature parameter extracting unit 1 to the model generating unit 3, and a previously extracted feature parameter sequence at each time is stored in the feature parameter storing unit 2. (Step S1). In the following, the time at which the latest input voice is uttered is T, and the respective times at which the voice from which the characteristic parameter sequence input from the characteristic parameter storage unit 2 is extracted are uttered are 1 to T-1. .

【００２８】その後、モデル生成部３は、入力された各
時期の特徴パラメータに基づき、各時期のモデル変換関
数の推定を行う（ステップＳ2）。ここで、ＨＭＭのガ
ウス分布の平均ベクトルは、平均的に各時期に依存した
バイアス成分を持つことが確認されている。このことに
基づき、ここでのモデル変換関数の推定においては、Ｈ
ＭＭの状態ｊ、ガウス分布ｋの平均ベクトルμ_jkに関し
て、時期ｔのモデル変換関数Ｇ^(t)を下記数３に示すよ
うに定義する。Thereafter, the model generator 3 estimates a model conversion function at each time, based on the input characteristic parameters at each time (step S2). Here, it has been confirmed that the average vector of the Gaussian distribution of the HMM has a bias component that depends on each period on average. Based on this, in estimating the model conversion function here, H
With respect to the state j of the MM and the average vector μ _jk of the Gaussian distribution k, a model conversion function G ^(t) at the time t is defined as shown in the following Expression 3.

【数３】 (Equation 3)

【００２９】数３中、ｂ^(t)が時期ｔに依存したバイア
ス成分である。すなわち、モデル変換関数Ｇ^(t)は、バ
イアス成分ｂ^(t)を加算することによって、時期に独立
な成分からなるガウス分布ｋの平均ベクトルμ_jk＾を、
時期ｔの特徴パラメータ列を表すガウス分布の平均ベク
トルμ^(t) _jkへ変換するものとなっている（但し、ここ
では変換する対象をベクトルで表しているので、Ｇ^(t)
は行列で表される。）。従って、ステップＳ2での各時
期のモデル変換関数の推定とは、具体的には、各時期の
バイアス成分ｂ^(t)を求めることに相当する。In Equation 3, b ^(t) is a bias component depending on time t. That is, the model conversion function G ^(t) is obtained by adding the bias component b ^(t) to obtain an average vector μ _{jk の} of a Gaussian distribution k composed of components independent of time.
A Gaussian distribution mean vector μ ^(t) _jk representing the characteristic parameter sequence at time t is converted (however, since the object to be converted is represented by a vector, G ^(t)
Is represented by a matrix. ). Therefore, estimating the model conversion function at each time in step S2 specifically corresponds to ^obtaining the bias component b ^(t) at each time.

【００３０】本実施形態では、文献「A.Sankar and C.-
H.Lee,"Robust speech recognitionbased on stochasti
c matching",Proc.ICASSP,pp.I-121-124,1995.」にて紹
介されているstochastic matching法を用い、上記数２
右辺の直積が局所的に最大となるように、下記数４に従
って各時期のバイアス成分を求めるものとする（これに
より求まるバイアス成分はｂ^(t)＾で表す。）。In this embodiment, the reference "A. Sankar and C.-
H. Lee, "Robust speech recognitionbased on stochasti
c matching ", Proc. ICASSP, pp. I-121-124, 1995."
The bias component at each time is determined according to the following equation 4 so that the direct product of the right side is locally maximum (the bias component determined thereby is represented by b ^(t) ）).

【数４】式中、Ｎ_tは時期ｔでのデータ長を表している。(Equation 4) Wherein, N _t represents the data length in time t.

【００３１】ここで、stochastic matching法を用いる
場合にあっては、入力されたデータから初期モデルを求
めておく必要があり、数４は、その初期モデルにおける
分散値等を用いた形となっている。従って、モデル生成
部３は、まず、入力された全てのデータ（ｔ＝１〜Ｔの
全時期の特徴パラメータ列）を用いてＨＭＭのパラメー
タを最尤推定する従来の手法により、従来におけるＨＭ
Ｍで表現された初期モデルを推定する。Here, when the stochastic matching method is used, it is necessary to obtain an initial model from the input data. Equation 4 is a form using a variance value or the like in the initial model. I have. Therefore, the model generation unit 3 first uses the conventional HMH method with the conventional method of estimating the parameters of the HMM by the maximum likelihood using all the input data (the characteristic parameter sequence at all times from t = 1 to T).
Estimate the initial model represented by M.

【００３２】そして、その初期モデルにおいて、状態
ｊ、ガウス分布ｋで観測ベクトルｏ^(t ⁾（ｎ）が出現す
る確率γ^(t) _jk（ｎ）並びに同初期モデルを表すガウス
分布ｋの平均ベクトルμ_jk及び分散値σ_jkを用い、上記
数４に従って各時期のバイアス成分ｂ^(t)＾を求める。
尚、観測ベクトルｏ^(t)（ｎ）とは、時期ｔの音声デー
タから抽出された特徴パラメータの時系列を成分とする
ベクトルである。Then, in the initial model, the probability γ ^(t) _jk (n) that the observation vector o ^(t ⁾ (n) appears in the state j and the Gaussian distribution k, and the average vector of the Gaussian distribution k representing the initial model Using the μ _jk and the variance σ _jk , a bias component b ^(t)の at each time is obtained according to the above ^{equation (} 4 ⁾ .
The observation vector o ^(t) (n) is a vector having a time series of feature parameters extracted from the audio data at the time t as components.

【００３３】次に、上記ステップＳ2で推定したモデル
変換関数に基づき、ＨＭＭを生成する各ガウス分布の平
均を推定する（ステップＳ3）。具体的には、上述した
ようにして求めたバイアス成分ｂ^(t)＾を用い、状態
ｊ、（無相関）ガウス分布ｋの平均ベクトルμ_jk＾を下
記数５によって計算する。Next, the average of each Gaussian distribution for generating the HMM is estimated based on the model conversion function estimated in step S2 (step S3). Specifically, using the bias component b ^(t)求め obtained as described above, the average vector μ _{jk 状態 of} the state j and the (uncorrelated) Gaussian distribution k is calculated by the following equation (5).

【数５】この式におけるγ^(t) _jk（ｎ）は、そのＨＭＭにおい
て、状態ｊ、ガウス分布ｋで観測ベクトルｏ^(t)（ｎ）
が出現する確率である。(Equation 5) Γ ^(t) _jk (n) in this equation is the observation vector o ^(t) (n) in the state j and Gaussian distribution k in the HMM.
Is the probability that appears.

【００３４】続いて、上記ステップＳ3で推定した各ガ
ウス分布の平均に基づき、同各ガウス分布の分散を推定
する（ステップＳ4）。具体的には、上記ステップＳ3で
求めたガウス分布ｋの平均ベクトルμ_jk＾を用い、下記
数６によって分散ベクトルσ _jk＾の各要素σ_jkl＾（ｌ
がベクトルの要素を表す識別子である。）を計算して分
散ベクトルσ_jk＾を求める。Subsequently, each gas estimated in the above step S3 is
Estimate the variance of each Gaussian distribution based on the mean of the male distribution
(Step S4). Specifically, in step S3
Average vector μ of Gaussian distribution k found_jkUse ＾
The variance vector σ is given by Equation 6. _jkEach element σ of ＾_jkl＾ (l
Is an identifier representing a vector element. ) Calculate the minutes
Scatter vector σ_jkAsk for ＾.

【数６】式中、μ_jkl ^(t)＾は平均ベクトルμ_jk ^(t)＾の第ｌ番目
の要素である。ここに、平均ベクトルμ_jk ^(t)＾とは、
ステップＳ2で求めたモデル変換関数Ｇ^(t)によりステッ
プＳ3で求めたガウス分布ｋの平均ベクトルμ_jk＾を変
換することによって得られる時期ｔのデータに適応した
平均ベクトル（すなわち、μ_jk ^(t)＾＝μ_j _k＋ｂ^(t)）で
ある。(Equation 6) _Where μ _jkl ^(t) _で is the l-th element of the mean vector μ _jk ^(t) ＾. Where the mean vector μ _jk ^(t) 、 is
An average vector (ie, μ _jk ^(t) adapted to the data at time t obtained by converting the average vector μ _{jk の} of the Gaussian distribution k determined in step S3 by the model conversion function G ^(t) determined in step S2 ⁾ ^ = a _{_{^{μ j k + b (t)}}} ).

【００３５】モデル生成部３は、以上のようにして得ら
れた平均ベクトルμ_jk＾、分散ベクトルσ_jk＾を当該話
者のＨＭＭのパラメータ（上記パラメータθ＾に相当す
るもの）としてモデル蓄積部４へ出力し、当該話者の音
声モデルの生成を終了する（ステップＳ5）。これによ
り、モデル蓄積部４は、その出力されたパラメータを当
該話者の音声モデルとして蓄える。このとき、平均ベク
トルμ_jk＾、分散ベクトルσ_jk＾の各要素によって表さ
れる各ガウス分布には適宜重み付けを行う。The model generating unit 3 uses the average vector μ _jk ＾ and the variance vector σ _jk _得 obtained as described above as parameters of the HMM of the speaker (corresponding to the above parameter θ ＾). 4 to terminate the generation of the speaker's voice model (step S5). Thereby, the model storage unit 4 stores the output parameters as a voice model of the speaker. At this time, each Gaussian distribution represented by each element of the mean vector μ _jk ＾ and the variance vector σ _jkを is appropriately weighted.

【００３６】尚、以上の一連の処理は、登録しようとす
る話者それぞれについて行い、各話者の音声モデルをそ
れぞれ生成してモデル蓄積部４に蓄える。又、時期が進
行して新たな音声データを入力する場合にあっても、そ
の新たな音声データを特徴パラメータ抽出部１へ入力す
ることによって、上記同様に音声モデルを生成すること
ができる。The above series of processing is performed for each speaker to be registered, and a voice model of each speaker is generated and stored in the model storage unit 4. In addition, even when new audio data is input after a lapse of time, an audio model can be generated in the same manner as described above by inputting the new audio data to the feature parameter extracting unit 1.

【００３７】（２）話者認識次に、上述したように生成された音声モデルを用いて話
者を認識する動作について説明する。話者を認識する段
階では、対象話者の音声データを話者認識用音声データ
として特徴パラメータ抽出部１へ入力する。すると、特
徴パラメータ抽出部１は、その音声データを特徴パラメ
ータの時系列に変換して類似度計算部５へ供給する。(2) Speaker Recognition Next, an operation of recognizing a speaker using the speech model generated as described above will be described. In the step of recognizing the speaker, the voice data of the target speaker is input to the feature parameter extracting unit 1 as voice data for speaker recognition. Then, the feature parameter extraction unit 1 converts the audio data into a time series of feature parameters and supplies the time series to the similarity calculation unit 5.

【００３８】類似度計算部５では、入力された特徴パラ
メータ列と、モデル蓄積部４に蓄えられた音声モデルと
の類似度が計算される。本実施形態では、特定の話者と
の同一性の判定（話者照合）を行うこととし、話者のＩ
Ｄを併せて入力する等して類似度を計算する音声モデル
を特定することとする。これにより、類似度計算部５
は、その特定された音声モデルをモデル蓄積部４から読
み出し、入力された特徴パラメータ列と読み出した音声
モデルとの類似度を計算し、計算された類似度の値を話
者認識判定部６へ出力する。The similarity calculator 5 calculates the similarity between the input feature parameter sequence and the speech model stored in the model storage unit 4. In the present embodiment, the determination of the identity with a specific speaker (speaker verification) is performed, and the I
It is assumed that a speech model for calculating the similarity is specified by, for example, inputting D. Thereby, the similarity calculating unit 5
Reads out the specified voice model from the model storage unit 4, calculates the similarity between the input feature parameter sequence and the read voice model, and sends the calculated similarity value to the speaker recognition determination unit 6. Output.

【００３９】続いて、話者認識判定部６が、しきい値蓄
積部７から上記ＩＤに対応する話者の声とみなせる類似
度の変動範囲を示すしきい値を読み出し、類似度計算部
５から入力された類似度と比較する。これにより、当該
類似度の値が読み出されたしきい値よりも大きければ上
記対象話者の音声が上記ＩＤに対応する話者の音声であ
ると判定され、小さければ上記対象話者の音声は上記Ｉ
Ｄに対応する話者以外の他人の音声であると判定され、
その判定結果が出力される。Subsequently, the speaker recognition judging section 6 reads a threshold value indicating a range of similarity variation that can be regarded as a speaker's voice corresponding to the ID from the threshold value accumulating section 7, and reads out the similarity calculating section 5 Is compared with the similarity input from. Accordingly, if the value of the similarity is greater than the read threshold value, the voice of the target speaker is determined to be the voice of the speaker corresponding to the ID, and if the value is lower, the voice of the target speaker is determined. Is the above I
D is determined to be the voice of another person other than the speaker corresponding to D,
The result of the determination is output.

【００４０】尚、以上説明した音声モデルの生成と話者
認識は、上述した各動作手順を所定の演算装置に実行さ
せるためのプログラムとして記録媒体に記録しておくこ
ととしてもよく、適宜、その記録媒体を演算装置に読み
取らせることによって、上述した音声モデルの生成や話
者認識を実行することとしてもよい。The generation of the speech model and the speaker recognition described above may be recorded on a recording medium as a program for causing a predetermined arithmetic unit to execute the above-described operation procedures. By causing the arithmetic unit to read the recording medium, the above-described generation of the speech model and speaker recognition may be performed.

【００４１】[0041]

【実施例】次に、上述した実施形態について、実際に音
声データを用いて音声モデルを生成し、話者認識を行っ
た実験に基づく実施例を示す。尚、実験形態には、テキ
スト独立型話者照合実験を例として採り挙げる。Next, an embodiment of the above-described embodiment based on an experiment in which a speech model was actually generated using speech data and speaker recognition was performed will be described. In the experiment mode, a text-independent speaker verification experiment will be described as an example.

【００４２】音声データ使用する音声データとしては、男性２０名が約１６カ月
に亘る７つの時期（それぞれ、時期Ｔ１、Ｔ２、…、Ｔ
７とする）に、静かな部屋で発声した文章データを用い
る。ここで、男性２０名のうち、１０名は登録話者、そ
の他の１０名は詐称者とする。文章データの文章テキス
トはＡＴＲ連続音声テキスト５０３文から抜粋し、各時
期の発声文章数は、音声モデル生成用に５、音声認識の
テスト用に３とする。又、音声モデル生成用の５文章の
テキストは、各時期、各話者毎に異なるものとし、１文
章長は平均約４秒程とする。一方、テスト用の３文章の
テキストは、前記５文章のテキストとは異なるものとす
るが、全話者、全時期で同一とする。Voice data The voice data to be used includes seven males (at periods T1, T2,..., T7) for approximately 16 months.
7), sentence data uttered in a quiet room is used. Here, of the 20 males, 10 are registered speakers and the other 10 are impostors. The sentence text of the sentence data is extracted from 503 sentences of the ATR continuous speech text, and the number of uttered sentences at each time is set to 5 for generation of a speech model and 3 for a test of speech recognition. The texts of the five sentences for generating the speech model are different for each speaker at each time, and the length of one sentence is about 4 seconds on average. On the other hand, the text of the three sentences for the test is different from the text of the five sentences, but is the same for all speakers and all times.

【００４３】特徴パラメータ特徴パラメータは、従来から使われている特徴量である
ケプストラムとし、音声データをケプストラムの細かい
時間毎の時系列に変換して用いる。このケプストラム
は、標本化周波数１２ｋＨｚ、フレーム長３２ｍｓ、フ
レーム周期８ｍｓ、ＬＰＣ分析（Linear Predictive Co
ding；線形予測分析）次数１６で抽出する。Characteristic Parameter The characteristic parameter is a cepstrum, which is a characteristic amount conventionally used, and is used by converting audio data into a time series of fine cepstrum every time. This cepstrum has a sampling frequency of 12 kHz, a frame length of 32 ms, a frame period of 8 ms, and LPC analysis (Linear Predictive Co.).
ding; linear prediction analysis)

【００４４】音声モデル音声モデルとしては、１状態、１６無相関混合ガウス分
布の連続型ＨＭＭを用いる。この音声モデルの生成で
は、初めは時期Ｔ１に発声された上記５文章の音声デー
タを用い、その後は各時期に発声された上記５文章の音
声データを逐次加えて用いることにより、逐次学習を行
うこととする。尚、話者のＨＭＭを１６のガウス分布の
重み付き加算によって表すことについては、例えば文献
「松井知子、古井貞煕：“ＶＱ，離散／連続ＨＭＭによ
るテキスト独立型話者認識法の比較検討”，電子情報通
信学会音声研究会資料，SP91-89，1991」にて紹介され
ている。Speech Model As a speech model, a continuous HMM having one state and 16 uncorrelated Gaussian mixture is used. In the generation of the voice model, sequential learning is performed by initially using the voice data of the five sentences uttered at the time T1, and thereafter using the voice data of the five sentences uttered at each time sequentially. It shall be. The expression of the speaker's HMM by weighted addition of 16 Gaussian distributions is described in, for example, the document "Tomoko Matsui, Sadahiro Furui:" VQ, Comparative Study of Text-Independent Speaker Recognition Method Using Discrete / Continuous HMM "" , IEICE Symposium, SP91-89, 1991. "

【００４５】話者認識話者認識のテストには、音声モデルの生成で用いた文章
が発声された時期の次の時期に発声された上記３文章を
用いる。このとき、実際に判定に用いるのは、各３文章
の初めの１秒分ずつとする。Speaker Recognition In the speaker recognition test, the above three sentences uttered at the time following the time at which the sentence used in the generation of the speech model was uttered are used. At this time, the first one second of each of the three sentences is actually used for the determination.

【００４６】以上の条件における学習（音声モデルの生
成）とテストで用いる音声データの発声時期を整理する
と図３に示すようになる。この図において、上段のケー
スＸ、Ａ、Ｂ、Ｃ、Ｄ、Ｅは、それぞれ、中段の“学
習”の欄に示す音声データを用いて音声モデルを生成
し、下段の“テスト”の欄に示す時期に発声された音声
データを用いて話者認識のテストを行う場合に対応す
る。ここで、中段の“学習”の欄では、使用する音声デ
ータの発声時期と文章総数とを併せて示してある。例え
ば、ケースＢでは、時期Ｔ1〜Ｔ3（すなわち、時期Ｔ
1、Ｔ2及びＴ3）に発声された総数１５の音声モデル生
成用の文章データに含まれる音声データを用いて音声モ
デルを生成し、時期Ｔ4に発声されたテスト用の文章デ
ータに含まれる音声データを用いて話者認識のテストを
行う。FIG. 3 shows the utterance timing of the voice data used in the learning (generation of the voice model) and the test under the above conditions. In this figure, in the upper cases X, A, B, C, D, and E, a voice model is generated using the voice data shown in the “learning” column in the middle, and the voice model is generated in the “test” column in the lower. This corresponds to a case where a speaker recognition test is performed using voice data uttered at the indicated time. Here, in the “learning” section in the middle, the utterance time of the voice data to be used and the total number of sentences are also shown. For example, in case B, the timings T1 to T3 (that is, the timing T
A voice model is generated using the voice data included in the text data for generating a total of 15 voice models uttered at 1, T2 and T3), and the voice data included in the test text data uttered at time T4 Is used to test speaker recognition.

【００４７】尚、ＨＭＭの尤度は、事後確率に基づく尤
度正規化法（例えば文献「T.Matsuiand S.Furui,"Likel
ihood normalization using a phoneme- and speaker-i
ndependent model for speaker verification",Speech
Communication,Vol.17,No.1-2,pp.109-116,1995.」参
照）によって正規化する。又、ここでは、しきい値はテ
ストデータから計算した詐称者誤り率と本人棄却率とが
等しくなるように設定するものとして、その等誤り率に
より評価を行うこととする。Incidentally, the likelihood of the HMM is calculated by a likelihood normalization method based on the posterior probability (for example, the document "T. Matsuiand S. Furui," Likel
ihood normalization using a phoneme- and speaker-i
ndependent model for speaker verification ", Speech
Communication, Vol. 17, No. 1-2, pp. 109-116, 1995. "). Here, the threshold value is set so that the impersonator error rate calculated from the test data is equal to the false rejection rate, and the evaluation is performed based on the equal error rate.

【００４８】このような条件の下で実験を行うと、図４
に示すような結果が得られる。この図は、各ケースと全
ケース平均の平均話者照合誤り率を示したものである。
図中、“本方法”は上記実施形態にて説明した方法、
“従来法”は各ケースで利用可能なすべてのデータを用
いて話者のＨＭＭを最尤推定する方法、“一時期”は各
ケースで最新のデータ（５文章分の音声データ）のみを
用いて話者のＨＭＭを最尤推定する方法であり、各ケー
スの対応欄はこれらそれぞれの方法による結果を示して
いる。これらの結果により、本方法が従来法と比べて有
効であること、又、複数の時期に発声された音声データ
を用いた方が性能がよいことが分かる。特に、全ケース
平均の誤り率を見てみると、本方法による平均誤り削減
率は、従来法に対しては１２．５％、一時期の方法に対
しては７０．８％となっており、本方法の有効性が顕著
に現れている。When an experiment is performed under such conditions, FIG.
The result shown in FIG. This figure shows the average speaker verification error rate of each case and the average of all cases.
In the figure, “this method” is the method described in the above embodiment,
The "conventional method" is a method of maximum likelihood estimation of the speaker's HMM using all data available in each case, and the "one time" is using only the latest data (speech data for 5 sentences) in each case. This is a method of estimating the maximum likelihood of the speaker's HMM, and the corresponding column in each case shows the result of each of these methods. From these results, it can be seen that the present method is more effective than the conventional method, and that the performance is better when using voice data uttered at a plurality of times. In particular, looking at the average error rate of all cases, the average error reduction rate by this method is 12.5% for the conventional method and 70.8% for the one-time method. The effectiveness of this method is remarkably demonstrated.

【００４９】尚、本方法によって生成されるＨＭＭの分
散については、上記数６に示したが、これは従来法によ
って生成されるＨＭＭの分散よりも小さくなる。従来法
によって生成されるＨＭＭの分散ベクトルσ_jk ^(t)￣
は、状態ｊ、ガウス分布ｋで観測ベクトルｏ^(t)（ｎ）
が出現する確率をγ′^(t) _jk（ｎ）、時期ｔのデータに
適応した平均ベクトルをμ_jk ^(t)￣とすると、The variance of the HMM generated by the present method is shown in the above equation (6), which is smaller than the variance of the HMM generated by the conventional method. HMM variance vector σ _jk ^(t) generated by the conventional method
Is the observation vector o ^(t) (n) with state j and Gaussian distribution k
Is defined as γ ′ ^(t) _jk (n), and μ _jk ^(t)平均 as an average vector adapted to the data at time t.

【数７】と表すことができるので（但し、数７は各成分を表
す。）、これに基づいて各ケースにおける分散を求め、
本方法によって生成されるＨＭＭの分散との比較を行う
と図５に示すようになる。(Equation 7) (However, Equation 7 represents each component.) Based on this, the variance in each case is obtained, and
FIG. 5 shows a comparison with the variance of the HMM generated by this method.

【００５０】図５は、従来法と本方法、一時期の方法と
本方法それぞれの分散ベクトルの各要素の比の平均を表
している。尚、一時期の方法における分散ベクトルは、
上記数７においてＴ＝１等として最新のデータのみから
生成されるＨＭＭであることを考慮することによって求
めることができる。図示のように、本方法と従来法とを
比べると、従来法の方が分散が大きく、その傾向は利用
するデータの時期数が増えるに従って大きくなる。一
方、本方法と一時期の方法とを比べると、分散値の比は
利用するデータの時期数に依存していない。これは、本
方法によれば、ＨＭＭの分散の広がりを適切に抑えるこ
とができることを意味している。FIG. 5 shows the average of the ratio of each element of the variance vector of each of the conventional method and the present method, the temporary method and the present method. Note that the variance vector in the one-time method is
In the above equation (7), it can be obtained by considering that the HMM is generated from only the latest data as T = 1 or the like. As shown in the figure, when the present method is compared with the conventional method, the conventional method has a larger variance, and the tendency increases as the number of data periods used increases. On the other hand, when this method is compared with the one-time method, the ratio of the variance values does not depend on the number of data periods used. This means that according to the present method, the spread of the variance of the HMM can be appropriately suppressed.

【００５１】上記数６と数７は、共に各時期の分散値を
（確率的に）各データ長に相当する重みを付けて平均す
るものと解釈できる。しかし、本方法は、ある話者が複
数の時期に発声した音声のデータは各時期に対応した複
数の母集団から得られたものであるという考えに立脚す
るものであることから、各時期の分散については、数６
によって求められるものの方が数７のそれよりも、より
一様最小分散不偏推定量に近くなる。このようなことに
起因して、上述のように本方法によってＨＭＭの分散の
広がりを抑えることができるわけである。Equations (6) and (7) can be interpreted as averaging the variance value at each time (probabilistically) with a weight corresponding to each data length. However, this method is based on the idea that the data of voices uttered by a speaker at multiple periods are obtained from a plurality of populations corresponding to each period. For variance, Equation 6
Is closer to the uniform minimum variance unbiased estimator than that of equation (7). Due to this, as described above, the spread of HMM dispersion can be suppressed by the present method.

【００５２】[0052]

【発明の効果】以上説明したように本発明によれば、各
時期の特徴パラメータから、話者の音声のうちの時期に
依存しない成分から抽出されたパラメータを推定し、そ
のパラメータに基づいて音声モデルを生成することとし
たので、時期に依存した変動成分が除かれた時期毎の分
散が小さいパラメータに基づく音声モデルの生成が可能
となる。これにより、発声時期に依存して変動する話者
の声に対し、識別能力や照合能力を高水準に維持し得る
音声モデルを生成することができる。ここで、請求項２
記載の発明では、かかる分散の小さいパラメータは、当
該パラメータを各時期に依存した変動相当で変換する変
換因子と共に推定され、請求項３記載の発明では、パラ
メータを変換因子で変換して推定される各時期の特徴パ
ラメータの音声に対する尤度によって推定される。As described above, according to the present invention, a parameter extracted from a component independent of time in a speaker's voice is estimated from a characteristic parameter of each time, and a voice is extracted based on the parameter. Since the model is generated, it is possible to generate a speech model based on a parameter having a small variance at each time, from which time-dependent fluctuation components have been removed. Accordingly, it is possible to generate a speech model that can maintain the discrimination ability and the matching ability at a high level with respect to the speaker's voice that fluctuates depending on the utterance time. Here, claim 2
In the invention described in the above, such a parameter having a small variance is estimated together with a conversion factor that converts the parameter with a variation corresponding to each time, and in the invention described in the third aspect, the parameter is estimated by converting the parameter with the conversion factor. It is estimated by the likelihood of the feature parameter at each time for the voice.

【００５３】そして、上記音声モデルを用いて話者認識
を行う請求項４等に記載の発明によれば、複数時期に亘
って高い認識性能を維持することができるという効果が
得られる。According to the invention described in claim 4 or the like in which speaker recognition is performed using the voice model, an effect is obtained that high recognition performance can be maintained over a plurality of periods.

[Brief description of the drawings]

【図１】本発明の一実施形態による話者認識装置の構
成を示すブロック図である。FIG. 1 is a block diagram illustrating a configuration of a speaker recognition device according to an embodiment of the present invention.

【図２】図１のモデル生成部３における音声モデルの
生成手順を示す図である。FIG. 2 is a diagram showing a procedure of generating a speech model in a model generation unit 3 of FIG.

【図３】実施例の条件における学習とテストで用いる
音声データの発声時期を示す図である。FIG. 3 is a diagram showing utterance times of voice data used in learning and testing under the conditions of the embodiment.

【図４】実施例の条件下で行った実験の結果を各ケー
スと全ケース平均の平均話者照合誤り率によって示した
図である。FIG. 4 is a diagram illustrating the results of an experiment performed under the conditions of the example, using the average speaker verification error rate of each case and the average of all cases.

【図５】本方法、従来方及び一時期の方法によって生
成されるＨＭＭの分散の比較結果を示す図である。FIG. 5 is a diagram showing a comparison result of variances of HMMs generated by the present method, the conventional method, and the one-time method.

[Explanation of symbols]

１特徴パラメータ抽出部２特徴パラメータ蓄積部３モデル生成部４モデル蓄積部５類似度計算部６話者認識判定部７しきい値蓄積部 DESCRIPTION OF SYMBOLS 1 Feature parameter extraction part 2 Feature parameter storage part 3 Model generation part 4 Model storage part 5 Similarity calculation part 6 Speaker recognition judgment part 7 Threshold storage part

Claims

[Claims]

1. A speech model generation method for extracting a feature parameter representing a speaker's personality from a speech and generating a speech model of the speaker based on the feature parameter, comprising the steps of: Each feature parameter is extracted, a parameter extracted from a component independent of the time in the voice is estimated from the feature parameter at each time, and a voice model of a speaker who utters the voice is generated based on the parameter. A method of generating a speech model.

2. A method for generating a speech model for extracting a feature parameter representing a personality of a speaker from a speech and generating a speech model of the speaker based on the feature parameter, comprising the steps of: A characteristic parameter is extracted, and a characteristic parameter of each time is extracted. From the characteristic parameter of each time, a parameter common to each time extracted from a component that does not depend on the time in the voice, and a conversion factor for converting the parameter with a variation corresponding to each time. And generating a voice model of the speaker who uttered the voice based on the parameters common to each time.

3. The method of generating a speech model according to claim 2, wherein for each pair of the estimated parameter common to each time period and the conversion factor, each parameter common to each time period is converted by a conversion factor and estimated. Generating a voice model based on a parameter common to each time period included in the group in which the likelihood is maximized; Method.

4. A speech model is generated in advance for each speaker using the speech model generation method according to any one of claims 1 to 3, and the speech model is generated from a speech uttered by an unspecified speaker. A speaker recognition method comprising: extracting a feature parameter; obtaining a similarity between the feature parameter and the speech model; and recognizing the unspecified speaker based on the similarity.

5. A feature parameter extracting means for extracting a feature parameter representing a speaker's personality from an input voice, and a time-independent component of the input voice from a feature parameter extracted from the input voice of a plurality of times. Model generation means for estimating a parameter extracted from the above, and generating a voice model of a speaker who uttered the input voice based on the parameter; model storage means for storing the generated voice model; Recognizing means for determining a similarity between a feature parameter extracted from an input voice uttered by a speaker and a stored voice model, and recognizing the unspecified speaker based on the similarity. Speaker recognition device.

6. A feature parameter extracting means, a model storage means, and a recognizing means according to claim 5, and a common parameter for each time extracted from a time-independent component of the input voice from the feature parameter for each time. And a model generating means for estimating a conversion factor for converting the parameter with a variation corresponding to each time, and generating a voice model of a speaker who uttered the input voice based on the parameter common to each time. A speaker recognition device comprising:

7. The speaker recognition apparatus according to claim 6, wherein the model generating means converts a parameter common to each time with a conversion factor for each set of the estimated parameter common to each time and a conversion factor. Determining the likelihood of the characteristic parameter of each period estimated for the input voice, and generating the voice model based on parameters common to each period included in the set having the maximum likelihood. Speaker recognition device.

8. A process of extracting characteristic parameters each representing a speaker's personality from input voices uttered at a plurality of periods, and from the extracted characteristic parameters of each period, independent of a time in the input voice. A recording medium storing a program for causing a predetermined arithmetic device to execute a process of estimating a parameter extracted from a component and a process of generating a voice model of a speaker who uttered the voice based on the parameter.

9. A process of extracting a feature parameter representing a speaker's personality from input speech, and extracting a feature-independent component of the input speech from a feature parameter extracted from the input speech of a plurality of periods. Estimating the parameters, and generating a speech model of the speaker who uttered the input speech based on the parameters; accumulating the generated speech model; and input speech uttered by the unspecified speaker. Recording a program for determining a similarity between the extracted feature parameter and the stored speech model, and causing a predetermined arithmetic device to execute a process of recognizing the unspecified speaker based on the similarity; Medium.