JP2003076390A

JP2003076390A - Method and system for authenticating speaker

Info

Publication number: JP2003076390A
Application number: JP2001264334A
Authority: JP
Inventors: Shoji Hayakawa; 昭二早川; Chiharu Kawai; 千晴河合
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2001-08-31
Filing date: 2001-08-31
Publication date: 2003-03-14
Anticipated expiration: 2021-08-31
Also published as: JP4440502B2

Abstract

PROBLEM TO BE SOLVED: To provide a method and a system for authenticating a speaker for which registered voice input of the minimum time is sufficient without degrading the speaker authentication precision. SOLUTION: The speaker authenticating method for specifying a speaker by arbitrary speech contents takes the voice of the speaker as the input and analyzes the inputted voice of the speaker and extracts and temporarily stores feature parameters and generates or updates a speaker model of the speaker on the basis of the feature parameters and discriminates whether the speaker model has been sufficiently learnt or not on the basis of a prescribed criterion and makes the speaker additionally input his or her voice in the case that it is discriminated that the speaker model has not been sufficiently learnt.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、音声を用いて話者
を特定する話者認証システムに関する。特に、入力され
る音声の内容が任意であっても話者の特定が可能な話者
認証システムの話者登録に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speaker authentication system for identifying a speaker using voice. In particular, the present invention relates to speaker registration in a speaker authentication system capable of specifying a speaker even if the content of input voice is arbitrary.

【０００２】[0002]

【従来の技術】昨今のコンピュータ関連技術の急速な発
展に伴って、音声入力によって話者を特定することがで
きる話者認証技術も急激に進歩してきている。そして、
話者認証技術を適用するアプリケーションが急増するに
従って、話者認証の精度の向上が強く求められるように
なってきた。そして、従来は発声内容を固定することに
よって話者認証の精度向上を図っていた。2. Description of the Related Art With the recent rapid development of computer-related technology, a speaker authentication technology capable of identifying a speaker by voice input has been rapidly advanced. And
As the number of applications to which speaker authentication technology is applied has increased rapidly, there has been a strong demand for improving the accuracy of speaker authentication. In the past, the accuracy of speaker authentication was improved by fixing the utterance content.

【０００３】しかしながら、アプリケーションによって
は、その発声内容を固定しない場合もあり、そのような
場合であっても、ある程度の話者認証精度を維持する必
要があった。そして、任意の発声内容に基づいて話者認
証を行う場合においては、あらかじめ登録しておく話者
の登録音声に、より多くの音素を含んでおくことが、認
識精度向上に重要な要素となっている。したがって、発
声内容が固定されている話者認証時における登録音声よ
りも、より長い記録時間分の登録音声が必要となる。However, depending on the application, the utterance content may not be fixed, and even in such a case, it is necessary to maintain a certain degree of speaker authentication accuracy. Then, in the case of performing speaker authentication based on an arbitrary utterance content, it is an important factor for improving the recognition accuracy that the registered voice of the speaker registered in advance contains more phonemes. ing. Therefore, the registered voice for a longer recording time is required than the registered voice at the time of speaker authentication in which the utterance content is fixed.

【０００４】例えば、古井の著書である「音響・音声工
学」（近代科学社(1992)、p.213）においては、一般に
発声内容が任意である場合の話者認証には、１０〜３０
秒程度の登録音声が必要であると言われているが、一方
では、確率モデルを用いて一般的な話者モデルを作成し
た場合に、登録音声の記録時間を６０秒にすることによ
って、登録音声の記録時間が３０秒である時と比べて識
別誤りが半減したという報告もされており（"Robust te
xt-independent speaker identification using gaussi
an mixture speaker models", IEEE Trans. On Speech
and Audio Process. Vol.3 No.1.pp.78 (1995)）、話者
認証の精度を充分に確保するためには、なるべく長い記
録時間分の登録音声が必要となることが明らかである。For example, according to Furui's book "Acoustic / Voice Engineering" (Modern Science Co., Ltd. (1992), p.213), generally speaking, 10 to 30 are used for speaker authentication when the utterance content is arbitrary.
It is said that a registered voice of about a second is required, but on the other hand, when a general speaker model is created using a probabilistic model, the registration voice recording time is set to 60 seconds to register the voice. It has also been reported that the identification error was halved compared to when the voice recording time was 30 seconds ("Robust te
xt-independent speaker identification using gaussi
an mixture speaker models ", IEEE Trans. On Speech
and Audio Process. Vol.3 No.1.pp.78 (1995)), it is clear that the registered voice for as long recording time as possible is required to ensure sufficient accuracy of speaker authentication. .

【０００５】ここで、図１に従来の話者認証システムの
原理図を示す。図１に示すように、音声の登録を開始す
る場合、音声入力部１で音声を取り込み、音声分析部２
で音声を特徴パラメタに変換し、ＲＡＭ領域３等に一時
保存することになる。FIG. 1 shows a principle diagram of a conventional speaker authentication system. As shown in FIG. 1, when the voice registration is started, the voice input unit 1 captures the voice and the voice analysis unit 2
Then, the voice is converted into a characteristic parameter and temporarily stored in the RAM area 3 or the like.

【０００６】そして、所定の認証精度を維持するのに十
分な特徴パラメタが確保できているか否か、すなわち十
分な量の音声入力が登録できているか否かを判定する音
声登録量判定部４により、登録すべき音声入力が量的に
不足していると判定された場合には、音声入力部１に戻
って、登録者に対して追加の音声入力を促すメッセージ
を出力する。最後に、登録すべき音声入力が所定の認証
精度を維持するのに十分な量であると判定された場合に
は、話者モデル生成部６において話者モデルを生成する
ことで登録処理を行っていた。Then, the voice registration amount determination unit 4 determines whether or not enough characteristic parameters are maintained to maintain a predetermined authentication accuracy, that is, whether or not a sufficient amount of voice input is registered. If it is determined that the number of voice inputs to be registered is insufficient, the process returns to the voice input unit 1 to output a message prompting the registrant to input additional voice. Finally, when it is determined that the amount of voice input to be registered is sufficient to maintain a predetermined authentication accuracy, the speaker model generation unit 6 generates a speaker model to perform registration processing. Was there.

【０００７】[0007]

【発明が解決しようとする課題】しかしながら、上述し
たような従来の方法では、所定の認証精度を維持するの
に十分な特徴パラメタが確保できているか否か、すなわ
ち十分な量の音声入力が登録できているか否かを判定す
る指標として音声入力量を用いているため、所定の音声
入力量が登録されるまでは話者モデルが生成されること
がない。したがって、話者モデル自体が話者認証のため
に十分な精度を維持しているかどうか検証するのは最終
的な話者モデル生成後にしか行うことができないという
問題点があった。However, in the conventional method as described above, it is determined whether or not enough characteristic parameters are maintained to maintain a predetermined authentication accuracy, that is, a sufficient amount of voice input is registered. Since the voice input amount is used as an index for determining whether or not the speaker model is generated, the speaker model is not generated until the predetermined voice input amount is registered. Therefore, there is a problem that whether or not the speaker model itself maintains sufficient accuracy for speaker authentication can be verified only after the final generation of the speaker model.

【０００８】また、十分な量の音声入力が登録できてい
ない場合に、どのような音声をどの程度入力すれば話者
モデルとして十分に学習していることになるのかについ
ても、明確な指標が存在していないという問題点もあっ
た。したがって、音声入力量としては十分であるにもか
かわらず、特定の音素については学習が不十分であるた
めに認証精度が低い話者モデルとなる場合も生じてい
た。Further, when a sufficient amount of voice input cannot be registered, there is also a clear index as to what kind of voice and how much voice should be input in order to sufficiently learn as a speaker model. There was also the problem that it did not exist. Therefore, although the amount of speech input is sufficient, the learning may be insufficient for a specific phoneme, resulting in a speaker model with low authentication accuracy.

【０００９】さらに、最終的な話者モデルに十分な認証
精度を付与するために大量の音声入力を登録しておく必
要があることから、大容量のＲＡＭ領域３等をシステム
全体として確保しておく必要がある。したがって、計算
機資源を必要以上に消費してしまうおそれがあるという
問題点もあった。Further, since it is necessary to register a large amount of voice input in order to give sufficient authentication accuracy to the final speaker model, a large capacity RAM area 3 etc. is secured for the entire system. I need to put it. Therefore, there is a problem that computer resources may be consumed more than necessary.

【００１０】また、登録話者が音声入力する際に、音声
入力のためだけに３０秒以上発声する必要があり、かか
る音声入力作業は登録話者にとっては苦痛であり、なる
べく少ない発声時間で効率的に音声登録を終了できるこ
とが望ましい。Further, when a registered speaker inputs a voice, it is necessary to speak for 30 seconds or more just for voice input, and the voice input work is a pain for the registered speaker, and it is efficient with a minimum speaking time. It is desirable to be able to finish voice registration.

【００１１】本発明は、上記問題点を解決するために、
話者認証精度を下げることなく、最小時間の登録音声入
力で足りる話者認証システム及び方法を提供することを
目的とする。In order to solve the above problems, the present invention provides
An object of the present invention is to provide a speaker authentication system and method that require a minimum time for registration voice input without lowering speaker authentication accuracy.

【００１２】[0012]

【課題を解決するための手段】上記目的を達成するため
に本発明にかかる話者認証システムは、任意の発声内容
で話者を特定する話者認証システムであって、話者の音
声を入力する音声入力部と、入力された話者の音声を分
析し、特徴パラメタを抽出して一時保存する特徴パラメ
タ保存部と、特徴パラメタに基づいて、話者の話者モデ
ルを生成もしくは更新する話者モデル生成・更新部と、
話者モデルの学習が十分であるか否かを所定の判断基準
に基づいて判定する話者モデル評価部と、話者モデルを
話者データベースとして保存する話者モデル保存部とを
含み、話者モデルの学習が不十分であると判定された場
合には、音声入力部において追加の音声入力を行い、話
者モデルの学習が十分であると判定された場合には、話
者モデルを話者データベースに保存することを特徴とす
る。In order to achieve the above object, a speaker authentication system according to the present invention is a speaker authentication system which specifies a speaker with arbitrary utterance contents, and inputs a voice of the speaker. A voice input unit, a feature parameter storage unit that analyzes the input voice of the speaker, extracts and temporarily stores the feature parameter, and a talk that generates or updates the speaker model of the speaker based on the feature parameter. Model generator / updater,
The speaker model includes a speaker model evaluation unit that determines whether or not the speaker model is sufficiently learned based on a predetermined criterion, and a speaker model storage unit that stores the speaker model as a speaker database. When it is determined that the model learning is insufficient, additional voice input is performed in the voice input unit, and when it is determined that the speaker model learning is sufficient, the speaker model is set to the speaker. It is characterized by storing in a database.

【００１３】かかる構成により、音声入力に対して必ず
話者モデルが生成されることから、入力時における話者
モデルの学習度合を把握することができ、また新たに音
声入力されるごとに話者モデルが更新されることから、
話者モデルの学習度合は音声入力されるごとに進展する
ことになり、学習度合が所定の値に到達するための最小
限の音声入力で済ますことが可能となる。With this configuration, the speaker model is always generated with respect to the voice input, so that the learning degree of the speaker model at the time of input can be grasped and the speaker can be newly input every time the voice is input. Since the model is updated,
The degree of learning of the speaker model advances every time a voice is input, and it is possible to minimize the amount of voice input for the degree of learning to reach a predetermined value.

【００１４】また、本発明にかかる話者認証システム
は、入力すべき発声内容を話者に提示する発声内容提示
部をさらに備えることが好ましい。より効果的な音声入
力内容を提示できることから、より短い音声入力によっ
て話者モデルの学習を完了させることができるからであ
る。提示内容としては、可能な限り幅広い音素を含むこ
とが好ましく、また既に生成されている話者モデルに不
足あるいは欠けている音素を含んでいることが好まし
い。Further, the speaker authentication system according to the present invention preferably further comprises a utterance content presenting unit for presenting the utterance content to be input to the speaker. This is because the more effective voice input content can be presented, and the learning of the speaker model can be completed by the shorter voice input. It is preferable that the presentation contents include phonemes as wide as possible, and also include phonemes that are insufficient or missing in the already generated speaker model.

【００１５】また、本発明にかかる話者認証システム
は、入力された話者の音声を認識する音声認識部と、音
声認識部における認識結果に基づいて、話者モデル生成
のために不足している発声内容を選択する発声内容選択
部をさらに備えることが好ましい。認識された内容と重
複している発声内容を再入力の対象から除外することが
できるからである。Further, the speaker authentication system according to the present invention is insufficient for speaker model generation based on the voice recognition unit that recognizes the input voice of the speaker and the recognition result in the voice recognition unit. It is preferable to further include a utterance content selection unit that selects the utterance content. This is because it is possible to exclude the utterance content that is duplicated with the recognized content from the target of re-input.

【００１６】また、本発明にかかる話者認証システム
は、不特定話者の音声データに基づいて生成された不特
定話者モデルと、話者モデル生成・更新部で生成又は更
新された話者モデルを一時保存する話者モデル一時保存
部とをさらに備え、１回目の音声入力の場合には不特定
話者モデルに基づいて話者モデルを生成し、２回目以降
の音声入力の場合には話者モデル一時保存部に保存され
ている話者モデルに基づいて話者モデルを更新し、話者
モデルの学習が十分であると判定された場合には、話者
モデル一時保存部に保存されている話者モデルを話者デ
ータベースに保存することが好ましい。１回目の音声入
力時から所定の認証精度を確保することができるからで
ある。Further, the speaker authentication system according to the present invention includes an unspecified speaker model generated based on voice data of the unspecified speaker and a speaker generated or updated by the speaker model generation / update unit. A speaker model temporary storage unit for temporarily storing the model is further provided, in the case of the first voice input, a speaker model is generated based on the unspecified speaker model, and in the case of the second and subsequent voice inputs. The speaker model is updated based on the speaker model stored in the speaker model temporary storage unit, and if it is determined that the speaker model has been sufficiently learned, the speaker model is stored in the speaker model temporary storage unit. It is preferable to store the speaker model in use in the speaker database. This is because the predetermined authentication accuracy can be secured from the first voice input.

【００１７】また、本発明は、上記のような話者認証シ
ステムの機能をコンピュータの処理ステップとして実行
するソフトウェアを特徴とするものであり、具体的に
は、任意の発声内容で話者を特定する話者認証方法であ
って、話者の音声を入力する工程と、入力された話者の
音声を分析し、特徴パラメタを抽出して一時保存する工
程と、特徴パラメタに基づいて、話者の話者モデルを生
成もしくは更新する工程と、話者モデルの学習が十分で
あるか否かを所定の判断基準に基づいて判定する工程
と、話者モデルを話者データベースとして保存する工程
とを含み、話者モデルの学習が不十分であると判定され
た場合には、話者の音声を入力する工程において追加の
音声入力を行い、話者モデルの学習が十分であると判定
された場合には、話者モデルを話者データベースに保存
する話者認証方法並びにそのような工程を具現化するコ
ンピュータ実行可能なプログラムであることを特徴とす
る。The present invention is also characterized by software that executes the functions of the speaker authentication system as described above as processing steps of a computer. Specifically, the speaker is specified by an arbitrary utterance content. A method of authenticating a speaker, comprising: a step of inputting a speaker's voice; a step of analyzing the input speaker's voice, extracting a characteristic parameter and temporarily storing the characteristic parameter; Of the speaker model, the step of determining whether or not the learning of the speaker model is sufficient based on a predetermined criterion, and the step of saving the speaker model as a speaker database. If it is determined that the learning of the speaker model is insufficient, additional voice input is performed in the step of inputting the voice of the speaker, and it is determined that the learning of the speaker model is sufficient. A speaker Characterized in that it is a speaker authentication method and a computer-executable program for implementing such a process to save the Dell speaker database.

【００１８】かかる構成により、コンピュータ上へ当該
プログラムをロードさせ実行することで、音声入力に対
して必ず話者モデルが生成されることから、入力時にお
ける話者モデルの学習度合を把握することができ、また
新たに音声入力されるごとに話者モデルが更新されるこ
とから、話者モデルの学習度合は音声入力されるごとに
進展することになり、学習度合が所定の値に到達するた
めの最小限の音声入力で済ますことができる話者認証シ
ステムを実現することが可能となる。With such a configuration, by loading and executing the program on a computer, a speaker model is always generated for voice input. Therefore, the learning degree of the speaker model at the time of input can be grasped. Since the speaker model is updated each time a new voice is input, the learning degree of the speaker model will advance each time a voice is input, and the learning degree reaches a predetermined value. It is possible to realize a speaker authentication system that requires only minimal voice input.

【００１９】[0019]

【発明の実施の形態】（実施の形態１）以下、本発明の
実施の形態１にかかる話者認証システムについて、図面
を参照しながら説明する。まず、本発明の実施の形態１
にかかる話者認証システムの原理図を図２に示す。BEST MODE FOR CARRYING OUT THE INVENTION (First Embodiment) A speaker authentication system according to a first embodiment of the present invention will be described below with reference to the drawings. First, the first embodiment of the present invention
FIG. 2 shows a principle diagram of the speaker authentication system according to the present invention.

【００２０】図２において、１は音声登録時に登録者の
音声を入力するための音声入力部を、２は入力した音声
を分析し特徴パラメタに変換する音声分析部を、３は特
徴パラメタを一時的に保管するＲＡＭ領域を、それぞれ
示している。In FIG. 2, 1 is a voice input unit for inputting a voice of a registrant at the time of voice registration, 2 is a voice analysis unit for analyzing the input voice and converting it into a characteristic parameter, and 3 is a temporary feature parameter. The RAM areas that are temporarily stored are shown.

【００２１】また、５は登録音声の入力回数に応じて話
者モデルを作成するか更新するか選択する話者モデル生
成手段選択部を示しており、登録音声入力が１回目であ
れば話者モデル生成部６によって話者モデルを生成し、
２回目以降であれば話者モデル更新部７によって話者モ
デルを更新することになる。Reference numeral 5 denotes a speaker model generation means selecting section for selecting whether to create or update the speaker model according to the number of times the registered voice is input. If the registered voice input is the first time, the speaker is selected. The model generation unit 6 generates a speaker model,
If it is the second time or later, the speaker model updating unit 7 updates the speaker model.

【００２２】さらに、８は話者モデルとして十分な認証
精度を有しているかどうか、すなわち話者モデルとして
学習が十分であるか否かを所定の判定基準に基づいて判
定する話者モデル評価部を示しており、話者モデルとし
て学習が十分であると判定された場合には、生成されて
いる話者モデルに一定の認証精度が確保されているもの
と判断される。一方、話者モデルとして学習が不十分で
あると判定された場合には、生成されている話者モデル
に一定の認証精度が確保されていないものと判断され、
音声入力部１によって２回目以降の音声入力がなされる
ことになる。Further, 8 is a speaker model evaluation unit for judging whether or not the speaker model has sufficient authentication accuracy, that is, whether or not learning as a speaker model is sufficient, based on a predetermined judgment criterion. When it is determined that learning is sufficient as the speaker model, it is determined that the generated speaker model has a certain degree of authentication accuracy. On the other hand, when it is determined that the learning as the speaker model is insufficient, it is determined that the speaker model being generated does not have a certain level of authentication accuracy,
The voice input unit 1 inputs the second and subsequent voices.

【００２３】このような構成にすることで、まず音声入
力単位ごとに話者モデルを作成・更新することができる
ようになる。すなわち、まず１回目の音声入力がされ、
音声入力部１で音声が取り込まれる。取り込まれた音声
は音声分析部２で特徴パラメタに変換される。作成され
た特徴パラメタはＲＡＭ領域３において保存されること
になる。したがって、ＲＡＭ領域３は入力単位ごとの特
徴パラメタを保存しておけば足りる。With such a configuration, it is possible to first create and update the speaker model for each voice input unit. That is, the first voice input is made,
The voice is captured by the voice input unit 1. The captured voice is converted into a characteristic parameter by the voice analysis unit 2. The created characteristic parameter will be stored in the RAM area 3. Therefore, it is sufficient for the RAM area 3 to store the characteristic parameters for each input unit.

【００２４】また、１回目の音声入力の場合には、話者
モデル生成部６でＲＡＭ領域３から特徴パラメタを受け
取って、話者モデルを生成する。２回目以降の音声入力
の場合には、話者モデル更新部７でＲＡＭ領域３から特
徴パラメタを受け取って、既に生成されている話者モデ
ルについて更新することになる。In the case of the first voice input, the speaker model generator 6 receives the characteristic parameter from the RAM area 3 and generates a speaker model. In the case of the second and subsequent voice inputs, the speaker model updating unit 7 receives the characteristic parameters from the RAM area 3 and updates the already generated speaker model.

【００２５】さらに、話者モデルが生成あるいは更新さ
れた後、話者モデルとして学習が十分であるか否か、す
なわち話者モデルとして所定の認識精度が確保できてい
るか否かについて話者モデル評価部８で判断し、話者モ
デルとして学習が不十分であると判断された場合には、
話者モデルとして所定の認識精度が確保できていないも
のと判断できることから、追加音声入力として、音声入
力部１から認証精度の向上に必要となる次の音声入力を
取り込むことになる。Further, after the speaker model is generated or updated, the speaker model is evaluated as to whether or not learning is sufficient as the speaker model, that is, whether or not a predetermined recognition accuracy is secured as the speaker model. When it is determined that the learning is insufficient as the speaker model in the part 8,
Since it can be determined that the predetermined recognition accuracy has not been ensured as the speaker model, the next voice input required for improving the authentication accuracy is fetched from the voice input unit 1 as an additional voice input.

【００２６】話者モデルとして学習が十分であると判定
された場合には、話者モデルとして所定の認識精度が確
保できているものと判断できることから、これ以上の音
声登録は不要となる。When it is determined that the speaker model is sufficiently learned, it can be determined that the speaker model has a predetermined recognition accuracy, and thus no more voice registration is required.

【００２７】したがって、音声入力単位ごとに話者モデ
ルとして学習が十分であるか否かを所定の判断基準に基
づいて判定することにより、登録者に不必要な発声入力
を強いることを未然に防止するとともに、話者モデルと
して学習が不十分である状態で音声登録が終了すること
も回避することが可能となる。さらに、音声入力単位ご
とに話者モデルを生成・更新することで、特徴パラメタ
を保管しておくために必要なＲＡＭ領域を小さくしてお
くことができ、計算機資源の有効利用を図ることが可能
となる。Therefore, by judging whether or not learning is sufficient as a speaker model for each voice input unit based on a predetermined criterion, it is possible to prevent the registrant from being forced to make unnecessary voice input. In addition, it is possible to prevent the voice registration from ending in a state where learning is insufficient as a speaker model. Furthermore, by generating and updating the speaker model for each voice input unit, the RAM area required for storing the characteristic parameters can be made small, and effective use of computer resources can be achieved. Becomes

【００２８】より具体的には、図３に示すようなシステ
ムとなる。図３は本発明の実施の形態１にかかる話者認
証システムの構成図である。図３において、音声入力部
３１において登録話者の音声を取り込み、音声分析部３
２に渡す。音声分析部３２では話者認証を行うための特
徴パラメタに変換する。More specifically, the system is as shown in FIG. FIG. 3 is a configuration diagram of the speaker authentication system according to the first exemplary embodiment of the present invention. In FIG. 3, the voice input unit 31 captures the voice of the registered speaker, and the voice analysis unit 3
Pass to 2. The voice analysis unit 32 converts it into a characteristic parameter for speaker authentication.

【００２９】音声分析部３２で抽出された特徴パラメタ
は、話者モデル生成あるいは更新のために、ＲＡＭ３３
等で一時記憶される。もちろん、特徴パラメタ記憶部３
４等のディスク領域に保存しておくものであっても良
い。The characteristic parameters extracted by the voice analysis unit 32 are stored in the RAM 33 for generating or updating the speaker model.
Etc. are temporarily stored. Of course, the characteristic parameter storage unit 3
It may be stored in a disk area such as 4.

【００３０】そして、登録話者の話者モデルがすでに存
在しているか否かを話者モデル生成手段選択部３５にお
いて確認する。登録話者の話者モデルが存在していない
場合、すなわち音声入力が１回目の場合には、話者モデ
ル生成部３６に保存している特徴パラメタを渡すこと
で、新たに話者モデルを生成することになる。話者モデ
ルを生成した後、話者モデル評価部３８において当該話
者モデルの学習度合を検証することになる。Then, the speaker model generation means selecting section 35 confirms whether or not the speaker model of the registered speaker already exists. When the speaker model of the registered speaker does not exist, that is, when the voice input is the first time, a new speaker model is generated by passing the stored characteristic parameter to the speaker model generation unit 36. Will be done. After the speaker model is generated, the speaker model evaluation unit 38 verifies the learning degree of the speaker model.

【００３１】また、登録話者の話者モデルが既に存在し
ている場合、すなわち音声入力が２回目以降である場合
には、保存されている特徴パラメタを用いて、話者モデ
ル更新部３７において話者モデルの更新を行う。話者モ
デルが更新されると、更新された話者モデルについても
話者モデル評価部３８において、学習度合が十分か否か
について判定することになる。When the speaker model of the registered speaker already exists, that is, when the voice input is the second or later, the speaker model updating unit 37 uses the stored characteristic parameter. Update the speaker model. When the speaker model is updated, the speaker model evaluation unit 38 also determines whether or not the learning degree of the updated speaker model is sufficient.

【００３２】そして、話者モデル評価部３８において、
入力音声による学習が不十分である、すなわち十分な認
証精度が確保されていない話者モデルであると判断され
た場合には、音声入力部３１に戻って、再度音声入力を
行うことになる。この場合、話者に再入力を促すメッセ
ージを出力する再入力促進部３９を設けることが好まし
い。話者モデルの学習度合を話者自身が認識できるから
である。Then, in the speaker model evaluation section 38,
When it is determined that the learning by the input voice is insufficient, that is, the speaker model does not secure sufficient authentication accuracy, the process returns to the voice input unit 31 and the voice is input again. In this case, it is preferable to provide a re-input promoting unit 39 that outputs a message prompting the speaker to re-input. This is because the speaker himself can recognize the degree of learning of the speaker model.

【００３３】一方、学習が十分である、すなわち十分な
認証精度が確保されている話者モデルであると判断され
た場合には、話者モデル保存部４０において、話者モデ
ルデータベース４１として保存することになる。On the other hand, if it is determined that the learning is sufficient, that is, the speaker model has sufficient authentication accuracy, it is stored in the speaker model storage unit 40 as the speaker model database 41. It will be.

【００３４】ここで、特徴パラメタとしては様々な種類
のパラメタが考えられる。例えば、ＬＰＣ（Linear Pre
dictive Coding）ケプストラムやＭＦＣＣ（Mel Freque
ncyCepstral Coefficients）等、通常の音声入力による
話者認証で用いられる特徴パラメタであれば何でも良
い。Various types of parameters can be considered as the characteristic parameters. For example, LPC (Linear Pre
dictive Coding) Cepstrum and MFCC (Mel Freque
ncyCepstral Coefficients) or any other characteristic parameter used in speaker authentication by normal voice input.

【００３５】また、特徴パラメタのみを保存すれば良
く、音声入力として取り込む時間は５〜１５秒程度（短
い文章で１文〜数文程度）であれば十分である。Further, it is sufficient to store only the characteristic parameters, and it is sufficient that the time taken as voice input is about 5 to 15 seconds (one sentence to a few sentences in a short sentence).

【００３６】一方、話者モデル自体や、その生成方法に
ついても、様々な方法が考えられる。例えば、ＧＭＭ
（Gaussian Mixture Model）等の確率モデルを用いる方
法であっても良いし、ｋ−ｍｅａｎｓ法やＬＢＧ法等を
用いてクラスタリングを行い、コードブックを作成する
方法であっても良い。話者モデルの生成方法自体は特に
制限されるものではなく、音声入力による話者認証を行
うことができるものであれば何でも良い。On the other hand, various methods can be considered for the speaker model itself and its generation method. For example, GMM
A method using a probabilistic model such as (Gaussian Mixture Model) may be used, or a method may be used in which a codebook is created by performing clustering using the k-means method or the LBG method. The method of generating the speaker model itself is not particularly limited, and any method that can perform speaker authentication by voice input may be used.

【００３７】同様に、話者モデルの更新方法についても
様々な方法が考えられる。例えば、ＭＡＰ(maximum a p
osteriori )法や、ベイジアン（Bayesian）適応等のモ
デル更新アルゴリズムを用いて行う方法が一般的であ
る。話者モデルの更新方法についても特に制限されるも
のではなく、音声入力による話者認証を行うことができ
るものであれば何でも良い。Similarly, various methods can be considered for the method of updating the speaker model. For example, MAP (maximum ap
osteriori) method and a method using a model update algorithm such as Bayesian adaptation. The method for updating the speaker model is also not particularly limited, and any method that can perform speaker authentication by voice input may be used.

【００３８】また本実施の形態１においては、話者モデ
ルを新たに生成するか、あるいは既存の話者モデルを更
新するか、どちらを選択するかについては、音声入力の
回数（１回目か否か）で判断しているが、特にこれに限
定されるものではなく、登録者の話者モデルが既に生成
されているか否かで判断するものであっても良い。さら
に、既に話者モデルが生成されている場合であっても、
話者が登録自体を再実行したいと希望している場合や、
話者モデル自体の認証精度が所定の水準にまで到達して
いないと判断される場合には、新規に話者モデルを生成
するようにしても良い。In the first embodiment, the choice of whether to newly generate a speaker model or update an existing speaker model is made by selecting the number of times of voice input (whether it is the first time or not). However, the present invention is not limited to this, and the determination may be made based on whether or not the speaker model of the registrant has already been generated. Furthermore, even if the speaker model has already been generated,
If the speaker wants to re-register itself, or
If it is determined that the authentication accuracy of the speaker model itself has not reached the predetermined level, a new speaker model may be generated.

【００３９】生成あるいは更新された話者モデルが、十
分に学習されているか否かについては、以下の判断基準
を用いて行う。Whether or not the generated or updated speaker model is sufficiently learned is determined using the following criteria.

【００４０】まず話者モデルを更新する前後における入
力音声に対するベクトル空間上の距離差あるいは尤度差
を求める。そして、当該距離差あるいは尤度差の変化が
小さいものであれば、話者モデルの学習程度が進んでい
るものと判断できることから、学習自体を終了すること
になる。First, the distance difference or the likelihood difference in the vector space with respect to the input voice before and after the speaker model is updated is obtained. If the change in the distance difference or the likelihood difference is small, it can be determined that the degree of learning of the speaker model has advanced, and the learning itself ends.

【００４１】例えば図４は、話者モデル更新前後の対数
尤度差を示す図である。ここでは、初期の話者モデルを
４つの文章に基づいて生成し、その後１文ごとに話者モ
デルを更新していった場合における話者モデル更新前後
の対数尤度差を示している。なお、横軸には話者モデル
の学習に用いた文章の総数を示している。For example, FIG. 4 is a diagram showing the difference in log likelihood before and after the speaker model is updated. Here, the log-likelihood difference before and after the speaker model is updated when the initial speaker model is generated based on four sentences and then the speaker model is updated for each sentence is shown. The horizontal axis shows the total number of sentences used for learning the speaker model.

【００４２】図４からもわかるように、話者モデル更新
前後の対数尤度差は、学習が進むにつれて小さくなって
いく傾向が顕著である。したがって、話者モデル更新前
後の対数尤度差の減少が飽和した時点において十分に学
習されたものと判断すれば良いことになる。図４におい
ては、話者モデル更新前後の対数尤度差が３以下である
音声入力が２回あった時点をしきい値として判断してい
る。As can be seen from FIG. 4, the log-likelihood difference before and after the speaker model update tends to become smaller as the learning progresses. Therefore, it is sufficient to judge that the learning has been sufficiently performed at the time when the decrease in the logarithmic likelihood difference before and after the speaker model update is saturated. In FIG. 4, the threshold is determined at the time when there are two voice inputs with a log likelihood difference of 3 or less before and after the speaker model update.

【００４３】なお、話者モデルが十分に学習されている
か否かを判定する判断基準については、上述した方法に
特に限定されるものではない。例えば、話者モデルがＧ
ＭＭの場合には、分散の値を指標として判断することも
考えられる。すなわち、分散の値が過度に小さな値にな
っている場合には、特徴パラメタとして十分に抽出され
ていないものと考えられることから、利用者に追加の音
声入力を促すことになる。The criterion for determining whether or not the speaker model has been sufficiently learned is not limited to the above method. For example, the speaker model is G
In the case of MM, it may be considered to use the value of variance as an index. That is, when the variance value is too small, it is considered that the feature parameter is not sufficiently extracted, and the user is prompted to input additional voice.

【００４４】また、話者モデルがコードブックの場合に
は、セントロイドに割り当てられたサンプル数を指標と
して判断することも考えられる。すなわち、セントロイ
ドに割り当てられたサンプル数が少ない場合には、代表
点として選択されたサンプル点が特異点である可能性が
高くなってしまうことから、利用者に追加の音声入力を
促すことになる。When the speaker model is a codebook, it can be considered that the number of samples assigned to the centroid is used as an index. That is, when the number of samples assigned to the centroid is small, the sample point selected as the representative point is more likely to be a singular point, and therefore the user is prompted to input additional voice. Become.

【００４５】さらに、特徴パラメタ及び話者モデルにつ
いては、ユーザごとに作成しておくことが好ましい。す
なわち、話者登録時においてユーザ識別子も同時に登録
しておき、ユーザ識別子ごとに独立した特徴パラメタ及
び話者モデルを生成することによって、複数の話者を識
別することが可能となる。Further, it is preferable that the characteristic parameters and the speaker model are created for each user. That is, it is possible to identify a plurality of speakers by registering the user identifier at the same time as the speaker registration and generating an independent feature parameter and speaker model for each user identifier.

【００４６】次に、本発明の実施の形態１にかかる話者
認証システムを実現するプログラムの処理の流れについ
て説明する。図５に本発明の実施の形態にかかる話者認
証システムを実現するプログラムの処理の流れ図を示
す。Next, the processing flow of the program for realizing the speaker authentication system according to the first embodiment of the present invention will be described. FIG. 5 shows a flow chart of processing of a program that realizes the speaker authentication system according to the exemplary embodiment of the present invention.

【００４７】図５において、まず話者の音声データを入
力し（ステップＳ５０１）、入力された音声データから
特徴パラメタを抽出する（ステップＳ５０２）。そし
て、音声入力が１回目である場合には（ステップＳ５０
３：Ｙｅｓ）、話者モデルを新規に生成し（ステップＳ
５０４）、音声入力が２回目以降である場合には（ステ
ップＳ５０３：Ｎｏ）、話者モデルを更新する（ステッ
プＳ５０５）。In FIG. 5, first, voice data of the speaker is input (step S501), and characteristic parameters are extracted from the input voice data (step S502). Then, when the voice input is the first time (step S50
3: Yes), a new speaker model is generated (step S
504) If the voice input is for the second time or later (step S503: No), the speaker model is updated (step S505).

【００４８】次に、話者モデルの学習が十分であるか否
かを、学習度合を示す判断基準に基づいて判断し（ステ
ップＳ５０６）、話者モデルの学習が不十分であると判
断された場合には（ステップＳ５０６：Ｎｏ）、追加入
力促進メッセージを出力して（ステップＳ５０７）、話
者が再度音声を入力することになる（ステップＳ５０
１）。Next, it is judged whether or not the speaker model is sufficiently learned, based on the judgment criteria indicating the degree of learning (step S506), and it is judged that the learning of the speaker model is insufficient. In this case (step S506: No), the additional input prompting message is output (step S507), and the speaker inputs the voice again (step S50).
1).

【００４９】一方、話者モデルの学習が十分であると判
断された場合には（ステップＳ５０６：Ｙｅｓ）、生成
・更新された話者モデルを新たな話者モデルとしてデー
タベースに保存することになる（ステップＳ５０８）。On the other hand, when it is judged that the learning of the speaker model is sufficient (step S506: Yes), the generated / updated speaker model is stored in the database as a new speaker model. (Step S508).

【００５０】以上のように本実施の形態１によれば、音
声入力単位ごとに話者モデルとして十分に学習されてい
るか否かを判定することにより、登録者に不必要な音声
入力を強いることを未然に防止するとともに、話者モデ
ルとして学習が不十分である状態で音声登録が終了する
ことも回避することが可能となる。さらに、音声入力単
位ごとに話者モデルを生成・更新することで、特徴パラ
メタを保管しておくために必要な記憶領域を小さくして
おくことができ、計算機資源の有効利用を図ることが可
能となる。As described above, according to the first embodiment, it is necessary to force the registrant to perform unnecessary voice input by determining whether or not each voice input unit is sufficiently learned as a speaker model. It is possible to prevent the voice registration from occurring and to prevent the voice registration from ending in a state where the learning is insufficient as the speaker model. Furthermore, by creating and updating the speaker model for each voice input unit, the storage area required to store the characteristic parameters can be made small, and effective use of computer resources can be achieved. Becomes

【００５１】（実施の形態２）以下、本発明の実施の形
態２にかかる話者認証システムについて、図面を参照し
ながら説明する。まず、本発明の実施の形態２にかかる
話者認証システムの構成図を図６に示す。(Second Embodiment) Hereinafter, a speaker authentication system according to a second embodiment of the present invention will be described with reference to the drawings. First, FIG. 6 shows a block diagram of a speaker authentication system according to a second exemplary embodiment of the present invention.

【００５２】図６において、実施の形態１にかかる話者
認証システムと異なる点は、登録のために入力すべき発
声内容を登録者に提示する発声内容提示部６１及び発声
内容制御部６２を備えている点である。すなわち、発声
内容提示部６１において、利用者が入力すべき発声内容
を提示するとともに、発声内容提示部６１に提示する内
容については、発声内容制御部６２によって制御するこ
とになる。In FIG. 6, a point different from the speaker authentication system according to the first embodiment is that a utterance content presenting section 61 and a utterance content control section 62 for presenting the utterance content to be input for registration to the registrant are provided. That is the point. That is, the utterance content presentation unit 61 presents the utterance content to be input by the user, and the utterance content control unit 62 controls the content presented to the utterance content presentation unit 61.

【００５３】まず、発声内容提示部６１において提示さ
れる発声内容は、音素がバランス良く配分されたテキス
トデータを保存しているテキストデータベース６３から
選択して提示する。提示方法は特に限定されるものでは
なく、例えばディスプレイ等の表示装置上で表示出力す
るものであっても良いし、電話回線等を用いて合成音声
によって出力提示するものであっても良い。First, the utterance content presented by the utterance content presenting section 61 is selected and presented from the text database 63 which stores text data in which phonemes are distributed in a well-balanced manner. The presentation method is not particularly limited, and may be displayed and output on a display device such as a display, or may be output and presented by synthetic voice using a telephone line or the like.

【００５４】次に、発声内容制御部６２においては、既
に生成されている話者モデルに基づいて、過去に提示し
たテキストデータ及び当該テキストデータに対応した音
声入力データの音声認識結果を解析することによって、
既に話者モデルに含まれている音素と含まれていない、
あるいはわずかしか含まれていない音素を明確に認識す
ることになる。そして、話者モデルに含まれていない音
素、あるいはわずかしか含まれていない音素を幅広く含
んでいるテキストデータをテキストデータベース６３か
ら選択することによって、発声内容提示部６１における
提示内容を制御することになる。Next, the utterance content control unit 62 analyzes the voice recognition result of the text data presented in the past and the voice input data corresponding to the text data, based on the speaker model already generated. By
Phonemes already included in the speaker model and not included,
Or it will clearly recognize phonemes that are only slightly included. Then, the presentation content in the utterance content presentation unit 61 is controlled by selecting from the text database 63 text data that widely includes phonemes that are not included in the speaker model or phonemes that are only slightly included. Become.

【００５５】また、図７に示すように、登録された音声
入力を音声認識する音声認識部７１をさらに備え、発声
内容制御部６２において音声認識部７１における音声認
識結果を用いることで、話者モデル作成のために不足し
ている発声内容を選択するよう制御することも考えられ
る。Further, as shown in FIG. 7, a voice recognition unit 71 for recognizing the registered voice input is further provided, and the voice recognition result in the voice recognition unit 71 is used in the utterance content control unit 62 to make the speaker It is also conceivable to control to select the utterance content that is insufficient for model creation.

【００５６】すなわち図７において、１回目の入力音声
は、音声入力部３１で取り込まれ、音声分析部３２で特
徴パラメタが抽出され、話者モデル生成部３６又は話者
モデル更新部３７において音声入力ごとに話者モデルが
生成又は更新される。そして、話者モデルが生成又は更
新された後、入力された音声データに対して音声認識部
７１で音声認識を行い、認識結果を発声内容制御部６２
に送る。That is, in FIG. 7, the first input voice is taken in by the voice input unit 31, the characteristic parameters are extracted by the voice analysis unit 32, and the voice is input by the speaker model generation unit 36 or the speaker model update unit 37. The speaker model is generated or updated every time. After the speaker model is generated or updated, the voice recognition unit 71 performs voice recognition on the input voice data, and the recognition result is used as the utterance content control unit 62.
Send to.

【００５７】発声内容制御部６２では、受け取った音声
認識の結果から必要な音素を含むテキストをテキストデ
ータベース６３より選択し、発声内容提示部６１に送
る。そして、発声内容提示部６１で認証精度を向上させ
るのに最も効果的なテキストデータを発声内容として表
示し、２回目の音声入力を行う。The utterance content control unit 62 selects a text including a necessary phoneme from the received voice recognition result from the text database 63 and sends it to the utterance content presentation unit 61. Then, the utterance content presenting section 61 displays the most effective text data for improving the authentication accuracy as the utterance content, and performs the second voice input.

【００５８】すなわち、既に生成されている話者モデル
は、以前に入力された音声入力に基づいて生成されてい
ることから、まず以前に入力された音声入力について音
声認識を行い、テキストデータベース６３に準備されて
いるテキストデータと照合して、合致度の小さいテキス
トデータを選択するとともに、特定の言語、例えば日本
語における全ての音素を網羅して取り込むことができる
ようにテキストデータを選択することによって、話者モ
デルに含まれていない音素をより多く含むテキストデー
タを提示することが可能となる。That is, since the speaker model that has already been generated is generated based on the previously input speech input, first, speech recognition is performed on the previously input speech input, and the speech database 63 is stored in the text database 63. By comparing with the prepared text data and selecting the text data with a small degree of matching, by selecting the text data so that all phonemes in a specific language, such as Japanese, can be comprehensively captured. , It is possible to present text data containing more phonemes not included in the speaker model.

【００５９】このようにすることで、話者にとって無駄
のない入力作業をすることができ、最小限の発声入力時
間で最大限の効果を期待することが可能となる。By doing so, it is possible to perform the input work without waste for the speaker, and it is possible to expect the maximum effect with the minimum utterance input time.

【００６０】次に、本発明の実施の形態２にかかる話者
認証システムを実現するプログラムの処理の流れについ
て説明する。図８に本発明の実施の形態にかかる話者認
証システムを実現するプログラムの処理の流れ図を示
す。Next, the flow of processing of the program that realizes the speaker authentication system according to the second embodiment of the present invention will be described. FIG. 8 shows a flow chart of processing of a program that realizes the speaker authentication system according to the exemplary embodiment of the present invention.

【００６１】図８において、まず話者の音声データを入
力し（ステップＳ８０１）、入力された音声データから
特徴パラメタを抽出する（ステップＳ８０２）。特徴パ
ラメタの抽出と平行して、入力された音声データに対し
て音声認識を行い（ステップＳ８０３）、認識結果を発
声内容制御部へ渡す。In FIG. 8, first, voice data of a speaker is input (step S801), and characteristic parameters are extracted from the input voice data (step S802). In parallel with the extraction of the characteristic parameter, voice recognition is performed on the input voice data (step S803), and the recognition result is passed to the utterance content control unit.

【００６２】そして、音声入力が１回目である場合には
（ステップＳ８０４：Ｙｅｓ）、話者モデルを新規に生
成し（ステップＳ８０５）、音声入力が２回目以降であ
る場合には（ステップＳ８０４：Ｎｏ）、話者モデルを
更新する（ステップＳ８０６）。When the voice input is the first time (step S804: Yes), a speaker model is newly generated (step S805), and when the voice input is the second time or later (step S804: No), the speaker model is updated (step S806).

【００６３】次に、話者モデルの学習が十分であるか否
かを、学習度合を示す判断基準に基づいて判断し（ステ
ップＳ８０７）、話者モデルの学習が不十分であると判
断された場合には（ステップＳ８０７：Ｎｏ）、音声認
識の内容に基づいて発声内容として事前に準備されてい
るテキストデータベース６３をサーチする（ステップＳ
８０８）。Next, it is judged whether or not the speaker model is sufficiently learned, based on the judgment criterion indicating the degree of learning (step S807), and it is judged that the learning of the speaker model is insufficient. In this case (step S807: No), the text database 63 prepared in advance as the utterance content is searched based on the content of the voice recognition (step S807).
808).

【００６４】そして、音声認識内容と最も一致度の低い
テキストデータ（特定の言語の全音素を最も網羅してい
るテキストデータ）を選択して（ステップＳ８０９）、
当該テキストデータを次の発声内容として話者に提示し
（ステップＳ８１０）、話者は当該テキストデータの内
容を再度音声入力することになる（ステップＳ８０
１）。Then, the text data having the lowest degree of coincidence with the speech recognition content (text data most covering all phonemes of a specific language) is selected (step S809),
The text data is presented to the speaker as the next utterance content (step S810), and the speaker inputs the content of the text data again by voice (step S80).
1).

【００６５】一方、話者モデルの学習が十分であると判
断された場合には（ステップＳ８０７：Ｙｅｓ）、生成
・更新された話者モデルを新たな話者モデルとしてデー
タベースに保存することになる（ステップＳ８１１）。On the other hand, when it is determined that the learning of the speaker model is sufficient (step S807: Yes), the generated / updated speaker model is stored in the database as a new speaker model. (Step S811).

【００６６】以上のように本実施の形態２によれば、登
録のために入力すべき発声内容を登録者に提示する手段
を備えることによって、登録者は入力する発声内容を考
えて発声する必要がなく、音声入力時の負担を最小限に
することができる。As described above, according to the second embodiment, by providing the registrant with the utterance content to be input for registration, the registrant needs to utter in consideration of the utterance content to be input. As a result, the load at the time of voice input can be minimized.

【００６７】また、提示する発声内容により多くの音素
を含むように制御することができることから、音声入力
時間が短時間であっても最も効率的に登録者の発声する
音素を話者モデルに取り込むことが可能となる。Further, since it is possible to control so as to include more phonemes in the presented utterance content, even if the voice input time is short, the phoneme uttered by the registrant is most efficiently incorporated into the speaker model. It becomes possible.

【００６８】さらに、１回目の登録音声入力時に発声さ
れた音声を音声認識する手段を備えており、音声認識結
果を用いて話者モデル作成のために不足している発声内
容を選択し、次回の登録音声入力時に提示することによ
り、短い登録時間で音声登録を終了することを可能にし
ている。Furthermore, a means for recognizing the voice uttered when the first registered voice is input is provided, and the utterance content which is insufficient for creating the speaker model is selected using the voice recognition result. It is possible to finish the voice registration in a short registration time by presenting it when the registered voice is input.

【００６９】（実施の形態３）以下、本発明の実施の形
態３にかかる話者認証システムについて、図面を参照し
ながら説明する。まず、本発明の実施の形態３にかかる
話者認証システムの構成図を図９に示す。(Third Embodiment) A speaker authentication system according to a third embodiment of the present invention will be described below with reference to the drawings. First, FIG. 9 shows a block diagram of a speaker authentication system according to a third exemplary embodiment of the present invention.

【００７０】図９において、実施の形態１にかかる話者
認証システムと異なる点は、１回目の音声入力におい
て、不特定話者の音声に基づいて事前に生成されている
話者モデルに基づいて話者モデルを生成する点にある。
すなわち、話者モデル生成部３６において事前に生成さ
れている不特定話者モデル９１を参照して、ＲＡＭ３３
に一時記憶されている特徴パラメタに基づいて不特定話
者モデル９１を更新することで新たな話者モデルを生成
することになる。In FIG. 9, a point different from the speaker authentication system according to the first embodiment is that, in the first voice input, a speaker model generated in advance based on a voice of an unspecified speaker is used. The point is to generate a speaker model.
That is, the RAM 33 is referred to by referring to the unspecified speaker model 91 generated in advance by the speaker model generation unit 36.
A new speaker model is generated by updating the unspecified speaker model 91 based on the characteristic parameter temporarily stored in the.

【００７１】まず、不特定話者モデル９１の生成時にお
いては、１００〜１０００人以上の大量の話者の音声デ
ータを入力し、前述したＧＭＭやコードブック等のモデ
ルを生成することになる。First, when the unspecified speaker model 91 is generated, the voice data of a large number of speakers of 100 to 1000 or more are input, and the models such as the GMM and the codebook described above are generated.

【００７２】そして、図９に示すように、１回目の音声
入力は音声入力部３１で取り込まれ、音声分析部３２で
特徴パラメタが抽出され、話者モデル生成部３６におい
て、不特定話者モデル９１と抽出された特徴パラメタに
基づいて、音声入力ごとの一時的な話者モデルが生成さ
れる。そして、生成された話者モデルは話者モデル一時
保存部９２において一時的に保存される。Then, as shown in FIG. 9, the first voice input is taken in by the voice input unit 31, the characteristic parameters are extracted by the voice analysis unit 32, and the speaker model generation unit 36 makes the unspecified speaker model. Based on 91 and the extracted feature parameter, a temporary speaker model for each voice input is generated. Then, the generated speaker model is temporarily stored in the speaker model temporary storage unit 92.

【００７３】２回目以降の入力音声についても同様に特
徴パラメタが抽出されるが、話者モデル更新部３７にお
いて、話者モデル一時保存部９２に保存されている話者
モデルを更新することで、新たな話者モデルに更新され
ることになる。話者モデルが更新されると、更新された
話者モデルについても話者モデル評価部３８において、
学習度合が十分か否かについて判定することになる。Similarly, the characteristic parameters are extracted from the second and subsequent input voices, but by updating the speaker model stored in the speaker model temporary storage unit 92 in the speaker model updating unit 37, It will be updated to a new speaker model. When the speaker model is updated, the speaker model evaluation unit 38 also updates the updated speaker model.
It will be determined whether or not the degree of learning is sufficient.

【００７４】そして、実施の形態１と同様に、話者モデ
ル評価部３８において、入力音声による学習が不十分で
ある、すなわち十分な認証精度が確保されていない話者
モデルであると判断された場合には、音声入力部３１に
戻って、再度音声入力を行うことになる。この場合、話
者に再入力を促すメッセージを出力する再入力促進部３
９を設けることが好ましい。話者モデルの学習度合を話
者自身が認識できるからである。Then, as in the first embodiment, the speaker model evaluation unit 38 determines that the learning by the input voice is insufficient, that is, the speaker model does not have sufficient authentication accuracy. In this case, the voice input unit 31 is returned to and voice input is performed again. In this case, the re-input promoting unit 3 which outputs a message prompting the speaker to re-input.
It is preferable to provide 9. This is because the speaker himself can recognize the degree of learning of the speaker model.

【００７５】一方、学習が十分である、すなわち十分な
認証精度が確保されている話者モデルであると判断され
た場合には、話者モデル保存部４０において、話者モデ
ルデータベース４１として保存することになる。On the other hand, if it is determined that the learning is sufficient, that is, the speaker model has sufficient authentication accuracy, the speaker model storage unit 40 stores the speaker model database 41. It will be.

【００７６】このようにすることで、１回目の音声入力
時からある程度の認証精度が期待できるとともに、追加
入力についても最小限の発声内容で最大限の効果を期待
することが可能となる。By doing so, it is possible to expect a certain degree of authentication accuracy from the first voice input, and it is possible to expect the maximum effect with respect to additional input with the minimum utterance content.

【００７７】なお、本実施の形態３では、一時的な話者
モデルを話者モデル一時保存部９２に保存しているが、
話者モデルデータベース４１に直接話者モデルを生成
し、認証精度が所定の水準まで確保することができるよ
うになるまで繰り返し音声入力しながら、話者モデルを
更新するようにしても良い。In the third embodiment, the temporary speaker model is stored in the speaker model temporary storage unit 92.
The speaker model may be directly generated in the speaker model database 41, and the speaker model may be updated while repeatedly inputting voice until the authentication accuracy can be ensured to a predetermined level.

【００７８】なお、本発明の実施の形態にかかる話者認
証システムを実現するプログラムは、図１０に示すよう
に、ＣＤ−ＲＯＭ１０２−１やフレキシブルディスク１
０２−２等の可搬型記録媒体１０２だけでなく、通信回
線の先に備えられた他の記憶装置１０１や、コンピュー
タ１０３のハードディスクやＲＡＭ等の記録媒体１０４
のいずれに記憶されるものであっても良く、プログラム
実行時には、プログラムはＤＳＰ上にダウンロードされ
て実行される。The program for realizing the speaker authentication system according to the embodiment of the present invention is, as shown in FIG. 10, a CD-ROM 102-1 and a flexible disk 1.
In addition to the portable recording medium 102 such as 02-2, another storage device 101 provided at the end of the communication line, or a recording medium 104 such as a hard disk or a RAM of the computer 103.
Any of the above may be stored, and when the program is executed, the program is downloaded and executed on the DSP.

【００７９】また、本発明の実施の形態にかかる話者認
証システムにより生成された話者モデル等についても、
図１０に示すように、ＣＤ−ＲＯＭ１０２−１やフレキ
シブルディスク１０２−２等の可搬型記録媒体１０２だ
けでなく、通信回線の先に備えられた他の記憶装置１０
１や、コンピュータ１０３のハードディスクやＲＡＭ、
あるいはフラッシュメモリ等の不揮発性メモリ等に代表
される記録媒体１０４のいずれに記憶されるものであっ
ても良く、例えば本発明にかかる話者認証システムを利
用する際にコンピュータ１０３により読み取られる。Further, regarding the speaker model and the like generated by the speaker authentication system according to the embodiment of the present invention,
As shown in FIG. 10, not only the portable recording medium 102 such as the CD-ROM 102-1 and the flexible disk 102-2 but also another storage device 10 provided at the end of the communication line.
1, a hard disk or RAM of the computer 103,
Alternatively, it may be stored in any recording medium 104 represented by a non-volatile memory such as a flash memory, and is read by the computer 103 when using the speaker authentication system according to the present invention, for example.

【００８０】[0080]

【発明の効果】以上のように本発明にかかる話者認証シ
ステムによれば、音声入力単位ごとに話者モデルとして
十分に学習されているか否かを判定することにより、登
録者に不必要な音声入力を強いることを未然に防止する
とともに、話者モデルとして学習が不十分である状態で
音声登録が終了することも回避することが可能となる。As described above, according to the speaker authentication system of the present invention, it is unnecessary for the registrant to judge whether or not the speaker model is sufficiently learned for each voice input unit. It is possible to prevent the voice input from being forced and to prevent the voice registration from ending in a state where learning is insufficient as a speaker model.

【００８１】また、本発明にかかる話者認証システムに
よれば、音声入力単位ごとに話者モデルを生成・更新す
ることで、特徴パラメタを保管しておくために必要な記
憶領域を小さくしておくことができ、計算機資源の有効
利用を図ることが可能となる。Further, according to the speaker authentication system of the present invention, the speaker model is generated / updated for each voice input unit, thereby reducing the storage area required for storing the characteristic parameters. Therefore, it is possible to effectively use computer resources.

[Brief description of drawings]

【図１】従来の話者認証システムの原理図FIG. 1 Principle diagram of a conventional speaker authentication system

【図２】本発明の実施の形態１にかかる話者認証シス
テムの原理図FIG. 2 is a principle diagram of a speaker authentication system according to the first embodiment of the present invention.

【図３】本発明の実施の形態１にかかる話者認証シス
テムの構成図FIG. 3 is a configuration diagram of a speaker authentication system according to the first exemplary embodiment of the present invention.

【図４】本発明の実施の形態１にかかる話者認証シス
テムにおける話者モデル更新前後の対数尤度差を示す図FIG. 4 is a diagram showing a log likelihood difference before and after a speaker model update in the speaker authentication system according to the first exemplary embodiment of the present invention.

【図５】本発明の実施の形態１にかかる話者認証シス
テムにおける処理の流れ図FIG. 5 is a flowchart of processing in the speaker authentication system according to the first exemplary embodiment of the present invention.

【図６】本発明の実施の形態２にかかる話者認証シス
テムの構成図FIG. 6 is a configuration diagram of a speaker authentication system according to a second embodiment of the present invention.

【図７】本発明の実施の形態２にかかる話者認証シス
テムの構成図FIG. 7 is a configuration diagram of a speaker authentication system according to a second embodiment of the present invention.

【図８】本発明の実施の形態２にかかる話者認証シス
テムにおける処理の流れ図FIG. 8 is a flowchart of processing in the speaker authentication system according to the second exemplary embodiment of the present invention.

【図９】本発明の実施の形態３にかかる話者認証シス
テムの構成図FIG. 9 is a configuration diagram of a speaker authentication system according to a third embodiment of the present invention.

【図１０】コンピュータ環境の例示図FIG. 10 is an exemplary diagram of a computer environment.

[Explanation of symbols]

１、３１音声入力部２、３２音声分析部３ＲＡＭ領域４音声登録量判定部５、３５話者モデル生成手段選択部６、３６話者モデル生成部７、３７話者モデル更新部８、３８話者モデル評価部３３ＲＡＭ３４特徴パラメタ記憶部３９再入力促進部４０話者モデル保存部４１話者モデルデータベース６１発声内容提示部６２発声内容制御部６３テキストデータベース７１音声認識部９１不特定話者モデル９２話者モデル一時保存部１０１回線先の記憶装置１０２ＣＤ−ＲＯＭやフレキシブルディスク等の可搬
型記録媒体１０２−１ＣＤ−ＲＯＭ１０２−２フレキシブルディスク１０３コンピュータ１０４コンピュータ上のＲＡＭ／ハードディスク等の
記録媒体1, 31 voice input unit 2, 32 voice analysis unit 3 RAM area 4 voice registration amount determination unit 5, 35 speaker model generation means selection unit 6, 36 speaker model generation unit 7, 37 speaker model update unit 8, 38 Speaker model evaluation unit 33 RAM 34 Characteristic parameter storage unit 39 Re-entry promotion unit 40 Speaker model storage unit 41 Speaker model database 61 Speech content presentation unit 62 Speech content control unit 63 Text database 71 Speech recognition unit 91 Unspecified speaker Model 92 Talker model temporary storage unit 101 Storage device 102 at line destination Portable recording medium 102-1 CD-ROM or flexible disk 102-1 CD-ROM 102-2 Flexible disk 103 Computer 104 Recording on RAM / hard disk etc. on computer Medium

───────────────────────────────────────────────────── フロントページの続き (72)発明者河合千晴神奈川県川崎市中原区上小田中４丁目１番１号富士通株式会社内Ｆターム(参考） 5D015 AA03 GG06 ─────────────────────────────────────────────────── ─── Continued front page (72) Inventor Chiharu Kawai 4-1, Kamiodanaka, Nakahara-ku, Kawasaki-shi, Kanagawa No. 1 within Fujitsu Limited F-term (reference) 5D015 AA03 GG06

Claims

[Claims]

1. A speaker authentication system for identifying a speaker with arbitrary utterance content, comprising: a voice input unit for inputting a voice of the speaker; analyzing the input voice of the speaker; A characteristic parameter storage unit that extracts and temporarily stores the speaker model, a speaker model generation / update unit that generates or updates the speaker model of the speaker based on the characteristic parameter, and learning of the speaker model is sufficient. A speaker model evaluation unit that determines whether or not the speaker model is stored based on a predetermined determination criterion; and a speaker model storage unit that stores the speaker model as a speaker database, and learning of the speaker model is insufficient. If it is determined that there is additional voice input in the voice input unit, if it is determined that the learning of the speaker model is sufficient, save the speaker model in the speaker database Characterized by And speaker authentication system.

2. The speaker authentication system according to claim 1, further comprising a speech content presentation unit that presents speech content to be input to the speaker.

3. A voice recognition unit for recognizing an input voice of the speaker, and a voicing for selecting a voicing content which is insufficient for generating the speaker model based on a recognition result in the voice recognition unit. The speaker authentication system according to claim 1, further comprising a content selection unit.

4. An unspecified speaker model generated based on voice data of an unspecified speaker, and a speaker model for temporarily storing the speaker model generated or updated by the speaker model generation / update unit. A temporary storage unit is further provided, wherein a speaker model is generated based on the unspecified speaker model in the case of the first voice input, and the speaker model temporary storage unit is generated in the case of the second and subsequent voice inputs. The speaker model is updated based on the speaker model stored in the speaker model, and when it is determined that the learning of the speaker model is sufficient, the speaker model is temporarily stored in the speaker model temporary storage unit. The speaker authentication system according to claim 1, wherein the speaker model is stored in the speaker database.

5. A speaker authentication method for identifying a speaker with arbitrary utterance content, comprising the steps of inputting a voice of a speaker, analyzing the input voice of the speaker, and extracting characteristic parameters. Temporarily storing, based on the characteristic parameter, generating or updating the speaker model of the speaker, based on a predetermined determination criteria whether the learning of the speaker model is sufficient Including the step of determining, the step of storing the speaker model as a speaker database, in the step of inputting the voice of the speaker, if it is determined that the learning of the speaker model is insufficient. A speaker authentication method, characterized in that when additional voice input is performed and it is determined that the learning of the speaker model is sufficient, the speaker model is stored in the speaker database.

6. A program to be executed by a computer that embodies a speaker authentication method for identifying a speaker with arbitrary utterance content, the step of inputting a voice of a speaker, and the input voice of the speaker. The step of generating and updating the speaker model of the speaker on the basis of the characteristic parameter by extracting the characteristic parameter and temporarily storing the characteristic parameter, and whether or not the learning of the speaker model is sufficient. Including a step of determining whether or not based on a predetermined determination criteria, and a step of storing the speaker model as a speaker database, if it is determined that the learning of the speaker model is insufficient, In the step of inputting the voice of the speaker, additional voice input is performed, and when it is determined that the learning of the speaker model is sufficient, the speaker model is set to the speaker database. Programs to be executed by storing the computer characterized by.