JP3522421B2

JP3522421B2 - Speaker recognition system and speaker recognition method

Info

Publication number: JP3522421B2
Application number: JP30655695A
Authority: JP
Inventors: 潤一郎藤本
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1995-10-31
Filing date: 1995-10-31
Publication date: 2004-04-26
Anticipated expiration: 2015-10-31
Also published as: JPH09127973A

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、話者認識を行なう
話者認識システムおよび話者認識方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speaker recognition system and a speaker recognition method for speaker recognition.

【０００２】[0002]

【従来の技術】従来、銀行などにおいて、例えば入出
金，残高照会などのアプリケーションを利用者が利用す
る際、この利用者が正規の利用者本人であることを確認
するために、暗証番号などを利用者に入力させるように
している。また、コンピュータでは、パスワードと称し
て、暗証番号と同様の暗証文字列を利用者に入力させる
ことによって本人の確認を行なっている。しかしなが
ら、このような暗証番号や暗証文字列などの入力による
確認は、他人が、暗証番号や暗証文字列を知りさえすれ
ば、難無く、これを盗用することができる。しかも、暗
証番号や暗証文字列は、それを登録した者(本人)の生年
月日や記念日、あるいは電話番号、氏名の綴りなどを利
用したものが多く、他人がこれを見破ることは差程難し
いことではない。2. Description of the Related Art Conventionally, when a user uses an application such as deposit / withdrawal or balance inquiry in a bank, a personal identification number (PIN) or the like has been used to confirm that the user is the authorized user. I'm trying to let the user enter it. Further, the computer confirms the identity of the user by allowing the user to enter a personal identification code string similar to a personal identification number, called a password. However, such confirmation by inputting the personal identification number or personal identification character string can be stolen without difficulty as long as another person knows the personal identification number or personal identification character string. Moreover, many PINs and PINs use the date of birth or anniversary of the person who registered the PIN (person), the telephone number, or the spelling of his / her name. It's not difficult.

【０００３】暗証番号や暗証文字列のこのような欠点を
回避するため、近年、声によって本人か否かを判定す
る、いわゆる話者認識が着目されている。この話者認識
は、ある話者が発声した音声の特徴量(特徴パターン)
が、予め登録されているこの話者の音声標準パターンと
一致するか否かを調べることにより、本人か否かを判定
(認識)するものである。すなわち、話者の音声から抽出
した特徴量(特徴パターン)とこの話者の音声標準パター
ンとの類似度を計算し、類似度の高低によって本人か否
かを判定するものであり、人間の肉体的特徴を利用する
ものであることから、音声は、暗証番号や暗証文字列に
比べて他人がこれを真似ることは難かしく、従って、他
人の盗用をより有効に防止することができる。In order to avoid such drawbacks of the personal identification number and the personal identification character string, in recent years, attention has been paid to so-called speaker recognition, which is to judge whether or not the person is the person by voice. This speaker recognition is a feature amount (feature pattern) of a voice uttered by a speaker.
Determines whether the person is who he or she is by checking if it matches the voice standard pattern of this speaker registered in advance.
(Recognize). That is, the similarity between the feature amount (feature pattern) extracted from the speaker's voice and the voice standard pattern of this speaker is calculated, and it is determined whether or not the person is the person based on the degree of similarity. Since it utilizes the physical characteristics, it is difficult for other people to imitate the voice as compared with the personal identification number or the personal identification character string, and thus it is possible to more effectively prevent the theft of another person.

【０００４】[0004]

【発明が解決しようとする課題】しかしながら、このよ
うな話者認識システムは、従来一般的に、例えば銀行の
窓口などに設置され、従って、話者認識を行なうために
は、利用者は、その都度、銀行の窓口等へ出向かなけれ
ばならないという問題があった。However, such a speaker recognition system is conventionally generally installed at, for example, a bank window, and therefore, in order to perform speaker recognition, the user must There was a problem that I had to go to a bank counter etc. each time.

【０００５】本発明は、利用者が例えば銀行の窓口など
に出向かずとも、話者認識を行なうことができ、銀行な
どで提供されている入出金，残高照会などのアプリケー
ションを利用することの可能な話者認識システムおよび
話者認識方法を提供することを目的としている。According to the present invention, a user can recognize a speaker without going to a bank counter or the like, and can use applications such as deposit / withdrawal and balance inquiry provided by a bank. The present invention aims to provide a simple speaker recognition system and a speaker recognition method.

【０００６】[0006]

【課題を解決するための手段】上記目的を達成するため
に、請求項１記載の発明は、少なくとも１つの端末と中
央装置とが、情報を送受信可能に設けられており、端末
には、話者の音声を入力し音声信号とする音声入力手段
と、音声信号の特徴量を抽出する特徴抽出手段と、話者
の音声の特徴量と話者認識用情報としての音声特徴量と
の類似度を算出する類似度算出手段とが設けられ、ま
た、中央装置には、話者認識用情報を管理する話者認識
管理手段と、端末の類似度算出手段からの類似度に基づ
き話者の判定を行なう判定手段とが設けられており、端
末において、類似度算出に用いられる話者認識用情報
は、中央装置から端末に転送され、また端末の類似度算
出手段で算出された類似度は、端末から中央装置に転送
されるようになっていることを特徴としている。In order to achieve the above object, the invention according to claim 1 has at least one terminal and an intermediate device.
Central device is installed to send and receive information,
Is a voice input means for inputting the voice of the speaker and converting it into a voice signal.
And a feature extraction means for extracting a feature amount of a voice signal, and a speaker
Voice features and voice features as speaker recognition information
And a similarity calculation means for calculating the similarity of
In addition, the central unit has a speaker recognition that manages speaker recognition information.
Based on the similarity between the management means and the terminal similarity calculation means.
A judgment means for judging the talker is provided.
At the end, speaker recognition information used for similarity calculation
Is transferred from the central unit to the terminal, and the terminal similarity
The similarity calculated by the output means is transferred from the terminal to the central unit.
It is characterized by being adapted to .

【０００７】[0007]

【０００８】また、請求項２記載の発明は、少なくとも
１つの端末と中央装置とが、情報を送受信可能に設けら
れており、端末には、話者の音声を入力し音声信号とす
る音声入力手段と、音声信号の特徴量を抽出する特徴抽
出手段と、話者の音声の特徴量と話者認識用情報として
の音声特徴量との類似度を算出する類似度算出手段と、
類似度算出手段からの類似度に基づき話者の判定を行な
う第１の判定手段とが設けられ、また、中央装置には、
端末の類似度算出手段からの類似度に基づき話者の判定
を行なう第２の判定手段とが設けられており、端末の類
似度算出手段で算出された類似度は、端末の第１の判定
手段に与えられるか、または、端末から中央装置の第２
の判定手段に転送されるようになっていることを特徴と
している。According to the second aspect of the invention, at least one terminal and the central unit are provided so as to be able to transmit and receive information, and the terminal inputs the voice of the speaker into a voice signal. Means, feature extracting means for extracting a feature amount of a voice signal, similarity calculating means for calculating a degree of similarity between the feature amount of the voice of the speaker and the voice feature amount as the speaker recognition information,
First determination means for determining the speaker based on the similarity from the similarity calculation means is provided, and the central device is
Second determination means for determining the speaker based on the similarity from the terminal similarity calculation means is provided, and the similarity calculated by the terminal similarity calculation means is the first determination of the terminal. Means provided or from the terminal to a second of the central unit
It is characterized in that it is transferred to the determination means of.

【０００９】また、請求項３記載の発明は、請求項１ま
たは請求項２記載の話者認識システムにおいて、さら
に、中央装置から端末には、所定の情報が転送され、端
末に設けられている特徴抽出手段は、中央装置から提供
された情報に基づいて、入力された音声を特徴量に変換
することを特徴としている。[0009] The invention of claim 3, wherein the claim 1 or
Further, in the speaker recognition system according to claim 2 , further, predetermined information is transferred from the central device to the terminal, and the feature extracting means provided in the terminal is based on the information provided from the central device. The feature is that the input voice is converted into a feature amount.

【００１０】また、請求項４記載の発明は、請求項１記
載の話者認識システムにおいて、話者認識管理手段によ
って管理されている話者認識用情報の変更修正は、決め
られた端末からの情報でのみなされることを特徴として
いる。Further, in the invention according to claim 4, in the speaker recognition system according to claim 1 , change or modification of the speaker recognition information managed by the speaker recognition management means is decided. It is characterized in that it is regarded only by the information from the terminal that has been accessed.

【００１１】また、請求項５記載の発明は、請求項１記
載の話者認識システムにおいて、中央装置からの１人の
話者認識用情報は、１つの端末のみにしか同時には供給
することができないように構成されていることを特徴と
している。Further, an invention according to claim 5, wherein, in claim 1 Symbol <br/> placing the speaker recognition system, one speaker recognition information from the central unit, at the same time only to only one terminal Is characterized in that it cannot be supplied.

【００１２】[0012]

【００１３】また、請求項６記載の発明は、少なくとも
１つの端末と中央装置とが、情報を送受信可能に設けら
れており、端末においては、話者の音声が入力される
と、該音声信号の特徴量を抽出し、該話者の音声の特徴
量と話者認識用情報としての音声特徴量との類似度を算
出するようになっており、この際、端末において類似度
算出に用いられる話者認識用情報は、中央装置から端末
に転送され、また、端末で算出された類似度は、端末か
ら中央装置に転送され、中央装置では、転送された類似
度に基づき話者の判定を行なうことを特徴としている。Further, in the invention according to claim 6 , at least one terminal and the central unit are provided so as to be capable of transmitting and receiving information, and when the voice of the speaker is input to the terminal, the voice signal is output. Is extracted, and the similarity between the feature amount of the voice of the speaker and the voice feature amount as the speaker recognition information is calculated. At this time, it is used in the similarity calculation in the terminal. The speaker recognition information is transferred from the central device to the terminal, the similarity calculated by the terminal is transferred from the terminal to the central device, and the central device determines the speaker based on the transferred similarity. The feature is to do.

【００１４】また、請求項７記載の発明は、少なくとも
１つの端末と中央装置とが、情報を送受信可能に設けら
れており、端末においては、話者の音声が入力される
と、該音声信号の特徴量を抽出し、該話者の音声の特徴
量と話者認識用情報としての音声特徴量との類似度を算
出するようになっており、端末で算出された類似度は、
端末において話者の判定に用いられるか、または、端末
から中央装置に転送されて中央装置において話者の判定
に用いられ、端末が該端末の中だけの処理を行なうの
か、外部の装置との間で情報の送受信を行なうのかに応
じて、話者の判定のしきい値を変化させることを特徴と
している。Further, according to the invention of claim 7, at least one terminal and a central unit are provided so that information can be transmitted and received. When the voice of the speaker is input to the terminal, the voice signal is transmitted. Is extracted, and the similarity between the feature amount of the voice of the speaker and the voice feature amount as the speaker recognition information is calculated, and the similarity calculated by the terminal is
It is used for the determination of the speaker at the terminal, or is transferred from the terminal to the central device and used for the determination of the speaker at the central device, and the terminal performs processing only in the terminal.
Or to send and receive information to and from external devices.
Flip and is characterized by a Rukoto to change the threshold of the determination of the speaker.

【００１５】[0015]

【００１６】[0016]

【発明の実施の形態】図１は一般的な話者認識システム
の構成例を示す図である。図１を参照すると、この話者
認識システムは、例えば銀行などにおける本人の確認を
話者認識により行なうためのものであって、利用者の音
声を入力するための音声入力手段(例えば、マイクロフ
ォン)１と、利用者に所定の指定情報を入力させるため
の指定手段(例えばキーボード)２と、音声入力手段１か
ら入力された信号の中から話者の音声の部分のみを音声
区間として検出する音声区間検出部３と、音声区間検出
部３で検出した音声区間内の音声信号から特徴量(特徴
パターン)を抽出する特徴抽出部４と、話者認識を行な
うに先立って話者の音声の標準的な特徴量(特徴パター
ン)を標準パターンとして話者認識用情報記憶部５に予
め登録する登録部６と、利用者(話者)の音声の特徴量
(特徴パターン)と話者認識用情報記憶部５に登録されて
いる標準パターンとを照合し、その類似度に基づいて話
者認識を行なう話者認識部７と、標準パターンの登録を
行なう登録モードと話者認識を行なう認識モードとの切
替を行なう切替部(例えばスイッチ)８とを有している。FIG. 1 is a diagram showing an example of the configuration of a general speaker recognition system. Referring to FIG. 1, this speaker recognition system is for confirming the person himself / herself in a bank or the like by speaker recognition, and is a voice input means (for example, a microphone) for inputting a voice of a user. 1, a specifying means (for example, a keyboard) 2 for allowing a user to input predetermined specification information, and a voice that detects only a voice part of a speaker from a signal input from the voice input means 1 as a voice section. A section detection unit 3, a feature extraction unit 4 for extracting a feature amount (feature pattern) from a voice signal in the voice section detected by the voice section detection unit 3, and a speaker voice standard prior to speaker recognition. Registering unit 6 which pre-registers a typical feature amount (feature pattern) as a standard pattern in the speaker recognition information storage unit 5, and a feature amount of the voice of the user (speaker).
(Characteristic pattern) is collated with the standard pattern registered in the speaker recognition information storage unit 5, and the speaker recognition unit 7 that recognizes the speaker based on the similarity and the registration that registers the standard pattern It has a switching unit (for example, a switch) 8 that switches between a mode and a recognition mode for speaker recognition.

【００１７】ここで、特徴抽出部４は、音声信号を特徴
量(特徴パターン)として、スペクトルに変換しても良い
し、あるいはＬＰＣケプストラムに変換しても良く、特
徴量の種類については特に限定するものではない。な
お、スペクトルに変換するためには、特徴量変換にはＦ
ＦＴを用い、また、ＬＰＣケプストラムに変換するため
にはＬＰＣ分析などを用いるのがよい。Here, the feature extraction unit 4 may convert the voice signal as a feature amount (feature pattern) into a spectrum or an LPC cepstrum, and the type of the feature amount is not particularly limited. Not something to do. It should be noted that in order to convert to a spectrum, F to conversion of feature quantity
FT is preferably used, and LPC analysis or the like is preferably used for conversion into LPC cepstrum.

【００１８】また、標準パターンの登録時(登録モード
時)において、登録部６は、ある話者が発声した音声に
基づいて特徴抽出部４で抽出された特徴量(特徴パター
ン)を標準パターンとして話者認識用情報記憶部５に登
録する際、図２に示すように、この話者により指定手段
２から入力された指定情報(例えば、この話者の名前や
生年月日，あるいはこの話者の暗証番号など)と対応付
けて、標準パターンを話者認識用情報記憶部５に登録す
ることができる。換言すれば、話者認識用情報記憶部５
には、話者認識に必要な話者認識用の情報が登録される
ようになっており、また、この話者認識用情報記憶部５
には、複数の話者(例えば利用者Ａ，Ｂ，Ｃ，Ｄ，…)の
話者認識用情報が登録可能となっている。When the standard pattern is registered (in the registration mode), the registration unit 6 uses the feature quantity (feature pattern) extracted by the feature extraction unit 4 based on the voice uttered by a speaker as the standard pattern. When registering in the speaker recognition information storage unit 5, as shown in FIG. 2, the designation information input from the designation means 2 by this speaker (for example, the name and birth date of this speaker, or this speaker). It is possible to register the standard pattern in the speaker recognition information storage unit 5 in association with the personal identification number (No. In other words, the speaker recognition information storage unit 5
The speaker recognition information necessary for speaker recognition is registered in the speaker recognition section, and the speaker recognition information storage unit 5
The speaker recognition information of a plurality of speakers (for example, users A, B, C, D, ...) Can be registered in.

【００１９】また、話者認識用情報記憶部５に登録され
る音声の標準パターンとしては、この話者認識システム
の使用形態等に応じて、各利用者(話者)に予め言葉を発
声させたものであっても良いし、各利用者ごとにそれぞ
れ自由に所望の言葉を発声させたものであっても良い。Further, as a standard pattern of voices registered in the speaker recognition information storage unit 5, each user (speaker) is made to speak a word in advance in accordance with the usage pattern of the speaker recognition system. Alternatively, each user may freely utter a desired word.

【００２０】また、話者認識部７は、例えば、古井著
「ディジタル音声処理」(東海出版会)などに記載されて
いるように、現在の話者の音声の特徴パターンが話者認
識用情報記憶部５に登録されている複数の話者の標準パ
ターンのうちのどれに最も類似しているかを判定し、登
録されている複数の話者のうちから１人の話者を識別す
る話者識別方式のものであっても良いし、話者認識用情
報記憶部５に登録されている複数の話者の標準パターン
から現在の話者に対応する標準パターンを取り出し、こ
の標準パターンと現在の話者の特徴パターンとを照合
し、その類似度が所定基準値(しきい値)よりも高いか低
いかにより現在の話者が正規の話者本人であるか否かを
判定する話者照合方式のものであっても良い。Also, the speaker recognition unit 7 determines the characteristic pattern of the current speaker's voice as the speaker recognition information, as described in, for example, "Digital Speech Processing" by Furui (Tokai Publishing Co., Ltd.). A speaker that determines which one of the standard patterns of the plurality of speakers registered in the storage unit 5 is most similar, and identifies one speaker from the plurality of registered speakers. The identification pattern may be used, or a standard pattern corresponding to the current speaker is extracted from the standard patterns of a plurality of speakers registered in the speaker recognition information storage unit 5, and the standard pattern and the current pattern Speaker verification that matches the speaker's characteristic pattern and determines whether the current speaker is the regular speaker or not based on whether the similarity is higher or lower than a predetermined reference value (threshold) It may be of a system.

【００２１】さらに、話者認識部７は、話者認識用情報
記憶部５に登録される音声の標準パターンが各利用者
(話者)に予め言葉を発声させたものである場合には、こ
れに対応した認識を行なうものにすることができ、ま
た、話者認識用情報記憶部５に登録される音声の標準パ
ターンが各利用者ごとにそれぞれ自由に所望の言葉を発
声させたものである場合には、これに対応した認識を行
なうものにすることができる。但し、各利用者(話者)に
予め決められた言葉を発声させて話者認識を行なう場
合、類似の判定基準(しきい値)を各話者に対して全て一
定値にすることができるが、各利用者ごとにそれぞれ所
望の言葉を発声させて話者認識を行なう場合には、類似
の判定基準(しきい値)を各話者ごとに相違させることも
できる。Further, the speaker recognizing unit 7 determines that the standard pattern of the voice registered in the speaker recognizing information storage unit 5 is for each user.
When the (speaker) has spoken a word in advance, the corresponding recognition can be performed, and the standard pattern of the voice registered in the speaker recognition information storage unit 5 can be used. Is a voice in which a desired word is freely uttered for each user, recognition corresponding to this can be performed. However, when each user (speaker) utters a predetermined word to perform speaker recognition, a similar criterion (threshold) can be set to a constant value for each speaker. However, when a desired word is uttered for each user to perform speaker recognition, a similar determination standard (threshold value) can be made different for each speaker.

【００２２】以下では、説明の便宜上、話者認識システ
ムは、各利用者(話者)に予め決められた言葉(特定の言
葉)を発声させるものとし、また、話者認識部７では、
話者照合方式の話者認識がなされるとする。なお、話者
認識部７において、話者照合方式の話者認識がなされる
場合、この話者認識時に、利用者(話者)は、指定手段２
から登録モード時に入力した指定情報と同じ指定情報を
入力する必要がある。これにより、話者認識部７では、
話者認識用情報記憶部５に登録されている複数の話者の
標準パターンのうちから現在の話者に対応する標準パタ
ーンを取り出すことができ、この標準パターンと現在の
話者の音声の特徴パターンとの照合を行なうことができ
る。In the following, for convenience of explanation, the speaker recognition system causes each user (speaker) to speak a predetermined word (specific word), and the speaker recognition unit 7
It is assumed that speaker recognition is performed by speaker verification. When the speaker recognition unit 7 performs speaker recognition by the speaker verification method, the user (speaker) is designated by the specifying unit 2 at the time of speaker recognition.
It is necessary to enter the same specified information as the specified information entered in the registration mode. As a result, in the speaker recognition unit 7,
A standard pattern corresponding to the current speaker can be extracted from the standard patterns of a plurality of speakers registered in the speaker recognition information storage unit 5, and the characteristics of the standard pattern and the voice of the current speaker can be extracted. The pattern can be matched.

【００２３】このような構成の話者認識システムを利用
者(例えばＤ)が始めて利用する場合、この利用者(話者)
Ｄは、先ず、自己の音声を標準パターンとして登録する
必要がある。このため、この利用者Ｄは、切替部(例え
ばスイッチ)８を操作して、特徴抽出部４を登録部６に
接続し、登録モードに設定する。When a user (for example, D) uses the speaker recognition system having such a configuration for the first time, this user (speaker)
First, D needs to register his own voice as a standard pattern. Therefore, the user D operates the switching unit (for example, the switch) 8 to connect the feature extraction unit 4 to the registration unit 6 and set the registration mode.

【００２４】次いで、利用者(話者)Ｄは、指定手段２か
ら所定の指定情報，例えば(利用者Ｄ)を入力する。ま
た、この際、利用者は、予め決められた特定の言葉を発
声する。この音声は、音声入力手段１から入力し、音声
区間検出部３，特徴抽出部４により、特徴量(特徴パタ
ーン)に変換され、この話者の音声の標準パターンとし
て、登録部６に与えられる。Next, the user (speaker) D inputs predetermined designation information, for example, (user D) from the designation means 2. Further, at this time, the user utters a predetermined specific word. This voice is input from the voice input means 1, converted into a feature amount (feature pattern) by the voice section detection unit 3 and the feature extraction unit 4, and given to the registration unit 6 as a standard pattern of the voice of this speaker. .

【００２５】これにより、登録部６は、この利用者(話
者)Ｄの音声の標準パターンを指定手段２から入力され
た指定情報と対応付けて、話者認識用情報記憶部５に登
録する。例えば過去に、この話者認識用情報記憶部５に
複数の利用者(異なる利用者)Ａ，Ｂ，Ｃが自己の音声を
標準パターンとして登録しており、現在の利用者Ｄが上
記のように自己の音声を標準パターンとして登録すると
き、この標準パターンは、話者認識用情報記憶部５に図
２に示すように記憶(登録)される。As a result, the registration unit 6 registers the standard pattern of the voice of the user (speaker) D in the speaker recognition information storage unit 5 in association with the designation information input from the designation unit 2. . For example, in the past, a plurality of users (different users) A, B, and C have registered their own voices as standard patterns in the speaker recognition information storage unit 5, and the current user D is as described above. When the user's own voice is registered as a standard pattern, the standard pattern is stored (registered) in the speaker recognition information storage unit 5 as shown in FIG.

【００２６】このようにして、この音声の標準パターン
が話者認識用情報記憶部５に記憶されると、利用者Ｄ
は、この話者認識システムにより、利用者Ｄについての
話者認識を行なわせることができる。すなわち、この利
用者Ｄは、このシステムを用いて、いま利用している利
用者が利用者Ｄ本人であるか否かの判定を行なわせるこ
とができる。In this way, when the standard pattern of the voice is stored in the speaker recognition information storage section 5, the user D
With this speaker recognition system, the speaker recognition for the user D can be performed. That is, this user D can use this system to determine whether or not the user who is currently using is the user D himself / herself.

【００２７】具体的に、利用者Ｄが以後、このシステム
を利用する場合、利用者Ｄは、切替部８を操作して、特
徴抽出部４を話者認識部７に接続し、このシステムを認
識モードに設定する。Specifically, when the user D subsequently uses this system, the user D operates the switching unit 8 to connect the feature extracting unit 4 to the speaker recognizing unit 7, and to use this system. Set to recognition mode.

【００２８】次いで、利用者Ｄは、指定手段２から所定
の指定情報，例えば(利用者Ｄ)を入力する。また、この
際、利用者Ｄは、予め決められた特定の言葉を発声す
る。この音声は、音声入力手段１から入力し、音声区間
検出部３，特徴抽出部４により、特徴量(特徴パターン)
に変換されて、話者認識部７に与えられる。Next, the user D inputs predetermined designation information, for example, (user D) from the designation means 2. Further, at this time, the user D utters a predetermined specific word. This voice is input from the voice input means 1, and the voice section detection unit 3 and the feature extraction unit 4 input a feature amount (feature pattern).
And is given to the speaker recognition unit 7.

【００２９】これにより、話者認識部７は、指定手段２
から入力された指定情報(利用者Ｄ)に対応させて登録さ
れている標準パターンを話者認識用情報記憶部５から取
り出し、この標準パターンと特徴抽出部４からの特徴パ
ターンとを照合して、その類似度を算出し、この類似度
が所定基準値よりも高いか低いかを判定する。この結
果、類似度が低いと判定されたときには、利用者が正規
の話者本人Ｄではないと判別し、この利用者による利用
を拒絶する。これに対し、類似度が高いと判定されたと
きには、利用者が正規の話者本人Ｄであると判別し、利
用者による利用を許可する。すなわち、利用者によるア
プリケーション(例えば入出金，残高照会などの処理)の
利用を許可する。As a result, the speaker recognizing unit 7 causes the specifying unit 2
The standard pattern registered in association with the designated information (user D) input from is extracted from the speaker recognition information storage unit 5 and the standard pattern is compared with the feature pattern from the feature extraction unit 4. The similarity is calculated, and it is determined whether the similarity is higher or lower than a predetermined reference value. As a result, when it is determined that the degree of similarity is low, it is determined that the user is not the regular speaker himself D, and the use by this user is rejected. On the other hand, when it is determined that the degree of similarity is high, it is determined that the user is the regular speaker himself D, and the use is permitted by the user. That is, the user is permitted to use the application (for example, processing such as deposit / withdrawal and balance inquiry).

【００３０】ところで、図１のような話者認識システム
は、従来一般的に、例えば銀行の窓口などに設置され、
従って、話者認識を行なうためには、利用者は、その都
度、銀行の窓口等へ出向かなければならないという問題
があった。By the way, a speaker recognition system as shown in FIG. 1 is generally installed at a bank teller, etc.
Therefore, there is a problem that the user has to go to a bank counter or the like each time to perform speaker recognition.

【００３１】本発明は、このような問題を回避し、利用
者が、利用者の自宅において、あるいは利用者の会社等
において、話者認識を行なうことができて、銀行などの
アプリケーション(入出金，残高照会などのアプリケー
ション)等を利用できるようにすることを意図してい
る。The present invention avoids such a problem and allows the user to recognize the speaker at his / her home or at the user's office, etc. , Applications such as balance inquiry) are intended to be available.

【００３２】図３は本発明に係る話者認識システムの第
１の構成例を示す図である。この第１の構成例では、話
者認識システムは、少なくとも１つの端末３１−１〜３
１−ｎと中央装置３２とが、情報を送受信可能に設けら
れている(例えば有線あるいは無線の通信手段３３−１
〜３３−ｎによって通信可能に設けられている)。ここ
で、各端末３１−１〜３１−ｎは、説明の便宜上、同じ
構成のものであるとする。FIG. 3 is a diagram showing a first configuration example of the speaker recognition system according to the present invention. In this first configuration example, the speaker recognition system includes at least one terminal 31-1 to 31-3.
1-n and the central unit 32 are provided so that information can be transmitted and received (for example, wired or wireless communication means 33-1).
~ 33-n are provided so that they can communicate with each other). Here, it is assumed that the terminals 31-1 to 31-n have the same configuration for convenience of description.

【００３３】図３の構成例では、図１の構成例におい
て、音声入力手段１(あるいは、さらに、音声区間検出
部３)，特徴抽出部４，話者認識部７が端末側に設けら
れ、また、話者認識管理手段１０が中央装置３２に設け
られたものとなっている。なお、ここで、話者認識管理
手段１０は、図１の話者認識用情報記憶部５，登録部６
の機能を有し、さらに、これに話者認識全体の管理，制
御機能をももたせることもできる。また、図１の切替部
８の機能は、端末側にもたせても良いし、中央装置３２
側にもたせても良いが、以下では、便宜上、切替部８の
機能は、端末側に設けられているものとする。In the configuration example of FIG. 3, in the configuration example of FIG. 1, the voice input means 1 (or, further, the voice section detection unit 3), the feature extraction unit 4, and the speaker recognition unit 7 are provided on the terminal side. Further, the speaker recognition management means 10 is provided in the central device 32. Here, the speaker recognition management means 10 includes the speaker recognition information storage unit 5 and the registration unit 6 of FIG.
It also has the function of, and can also have the function of managing and controlling the entire speaker recognition. Further, the function of the switching unit 8 in FIG. 1 may be provided to the terminal side, or the central device 32.
Although it may be provided on the side, in the following, for convenience, it is assumed that the function of the switching unit 8 is provided on the terminal side.

【００３４】図４は図３の話者認識システムの具体例を
示す図である。なお、図４では、簡単のため、１つの端
末３１−１だけが図示されているが、他の端末３１−２
〜３１−ｎも、端末３１−１と同様の構成のものである
とする。図４を参照すると、端末３１−１には、音声入
力手段１，指定手段２，音声区間検出部３，特徴抽出部
４，話者認識部７，切替部８が設けられ、さらに、端末
３１−１には、中央装置３２との間で情報を通信手段
(例えば電話回線あるいは無線)３３−１を介して送受信
するための送受信インタフェース部３４−１が設けられ
ている。また、中央装置３２には、話者認識用情報記憶
部５，登録部６が設けられ、さらに中央装置３２には、
各端末３１−１〜３１−ｎとの間で情報を送受信するた
めの送受信インタフェース部３５が設けられている。FIG. 4 is a diagram showing a specific example of the speaker recognition system of FIG. Note that, in FIG. 4, only one terminal 31-1 is shown for simplicity, but other terminals 31-2 are shown.
.. 31-n have the same configuration as the terminal 31-1. Referring to FIG. 4, the terminal 31-1 is provided with a voice input unit 1, a designation unit 2, a voice section detection unit 3, a feature extraction unit 4, a speaker recognition unit 7, and a switching unit 8, and further, the terminal 31. -1, means for communicating information with the central unit 32
A transmission / reception interface unit 34-1 for transmitting / receiving via (for example, a telephone line or wireless) 33-1 is provided. Further, the central device 32 is provided with a speaker recognition information storage unit 5 and a registration unit 6, and the central device 32 further includes
A transmission / reception interface unit 35 for transmitting / receiving information to / from each of the terminals 31-1 to 31-n is provided.

【００３５】ここで、各端末３１−１〜３１−ｎとして
は、例えばパソコン(マイクロフォン，Ａ／Ｄ変換など
の音声取込機能を備えたパソコン)を用いることが可能
であって、各端末３１−１〜３１−ｎの利用者は、自己
の端末を、例えば自宅や会社において、保有することが
できる。より具体的に、各端末３１−１〜３１−ｎに
は、既存のパソコン(パソコン通信機能を備えたパソコ
ン)を用いることができ、この場合、図４の構成例にお
いて、端末３１−１の送受信インタフェース部３４−１
は、例えば、パソコンに内蔵されているモデムとして実
現され、また、音声入力手段１は、パソコンに設けられ
ているマイクロフォンで実現され、指定手段２はパソコ
ンのコンソールで実現され、音声区間検出部３，特徴抽
出部４，話者認識部７は、パソコンに搭載されるソフト
ウェア，例えば、音声区間検出ソフト，特徴抽出ソフ
ト，話者認識ソフトとして実現される。Here, as each of the terminals 31-1 to 31-n, for example, a personal computer (a personal computer having a microphone, a voice capturing function such as A / D conversion) can be used. The users of -1 to 31-n can own their own terminals, for example, at home or in the office. More specifically, an existing personal computer (personal computer having a personal computer communication function) can be used for each of the terminals 31-1 to 31-n. In this case, in the configuration example of FIG. Transmission / reception interface unit 34-1
Is realized, for example, as a modem built in a personal computer, the voice input means 1 is realized by a microphone provided in the personal computer, the designating means 2 is realized by a console of the personal computer, and the voice section detection unit 3 is provided. The feature extraction unit 4 and the speaker recognition unit 7 are realized as software installed in a personal computer, for example, voice section detection software, feature extraction software, and speaker recognition software.

【００３６】また、図４の構成例において、中央装置３
２の送受信インタフェース３５には、例えば交換器を用
いることができる。また、登録部６は、この中央装置に
搭載される登録ソフトとして実現され、話者認識用情報
記憶部５には、中央装置３２に設けられているメモリを
用いることができる。Further, in the configuration example of FIG. 4, the central unit 3
An exchange, for example, can be used as the second transmission / reception interface 35. The registration unit 6 is realized as registration software installed in the central device, and the speaker recognition information storage unit 5 can use a memory provided in the central device 32.

【００３７】また、図４の構成例において、端末側に設
けられている切替部８は、例えば、端末側の利用者によ
って操作されるスイッチとして構成できる。Further, in the configuration example of FIG. 4, the switching unit 8 provided on the terminal side can be configured as, for example, a switch operated by a user on the terminal side.

【００３８】ここで、話者認識用情報記憶部５には、話
者認識用情報として、例えば、図２に示したように、指
定情報と対応付けて標準パターンが記憶され、この場
合、端末側において話者認識がなされるときに、その旨
の指示が端末から中央装置３２に転送されると、中央装
置３２側では、話者認識用情報記憶部５に記憶されてい
る話者認識用情報を読出して、これを端末に伝送するよ
うになっている。これにより、端末側の話者認識部７
は、ある話者の音声の特徴パターンを、中央装置３２か
ら伝送された話者認識用情報の標準パターンと照合し
て、この話者の特徴パターンと標準パターンとの類似度
を求めて、話者認識を行なうことができる。より具体的
には、端末側において例えば話者照合方式の話者認識が
なされるときに、中央装置３２からは、この端末の指定
手段２からの指定情報に対応した標準パターンを話者認
識用情報として端末に伝送できる。Here, the speaker recognition information storage unit 5 stores, as the speaker recognition information, a standard pattern in association with the designated information, for example, as shown in FIG. When the speaker recognition is performed on the side, if an instruction to that effect is transferred from the terminal to the central device 32, the central device 32 side recognizes the speaker recognition stored in the speaker recognition information storage unit 5. The information is read out and transmitted to the terminal. As a result, the speaker recognition unit 7 on the terminal side
Compares the feature pattern of a speaker's voice with the standard pattern of the speaker recognition information transmitted from the central unit 32, obtains the similarity between the feature pattern of the speaker and the standard pattern, and Person recognition can be performed. More specifically, for example, when the terminal side recognizes the speaker by the speaker verification method, the central unit 32 uses the standard pattern corresponding to the designation information from the designation unit 2 of the terminal for the speaker recognition. It can be transmitted to the terminal as information.

【００３９】このような話者認識システムでは、標準パ
ターンの登録(さらには標準パターンの変更あるいは更
新)，話者認識を行なうために、利用者は、利用者の家
庭や会社等に設置されている端末を操作することによっ
て、例えば銀行の窓口などに設置されている中央装置
(例えば話者認識装置ユニット)に対し、標準パターンの
登録操作，話者認識操作を、前述したと同様にして行な
うことができる。In such a speaker recognition system, the user is installed in the user's home or office in order to register the standard pattern (further change or update the standard pattern) and recognize the speaker. A central device installed at a bank teller, for example, by operating a terminal
The standard pattern registration operation and the speaker recognition operation can be performed on (for example, the speaker recognition device unit) in the same manner as described above.

【００４０】例えば、標準パターンの登録を行なうと
き、利用者は、自己の端末，例えば３１−１の切替部８
を操作して、特徴抽出部４が送受信インタフェース３４
−１と直接接続するよう切替設定する。次いで、この利
用者が、指定手段２から所定の指定情報，例えば(利用
者Ｄ)を入力すると、この指定情報は、通信手段３３−
１を介して中央装置３２に伝えられる。また、この際、
利用者Ｄは、予め決められた特定の言葉を発声する。こ
の音声は、音声入力手段１から入力し、例えば音声区間
検出部３から音声信号として出力され、特徴抽出部４で
特徴量に変換されて、通信手段３３−１を介して中央装
置３２に伝送される。これにより、中央装置３２の登録
部６では、伝送された指定情報に対応させて、伝送され
た特徴量(特徴パターン)信号を標準パターンとして、話
者認識用情報記憶部５に登録することができる。For example, when registering a standard pattern, the user uses his / her terminal, for example, the switching unit 8 of 31-1.
And the feature extraction unit 4 operates the transmission / reception interface 34.
Set to switch to connect directly to -1. Next, when this user inputs predetermined designation information from the designation means 2, for example, (user D), this designation information is transmitted to the communication means 33-
1 to the central unit 32. Also, at this time,
The user D utters a predetermined specific word. This voice is input from the voice input unit 1, is output as a voice signal from the voice section detection unit 3, is converted into a feature amount by the feature extraction unit 4, and is transmitted to the central device 32 via the communication unit 33-1. To be done. As a result, the registration unit 6 of the central device 32 can register the transmitted feature amount (feature pattern) signal as a standard pattern in the speaker recognition information storage unit 5 in association with the transmitted designation information. it can.

【００４１】また、この話者認識システムにおいて、話
者認識を行なうとき、利用者は、自己の端末，例えば３
１−１の切替部８を話者認識部７側に切替設定する。次
いで、この利用者が、指定手段２から所定の指定情報，
例えば(利用者Ｄ)を入力すると、この指定情報は、通信
手段３３−１を介して中央装置３２に伝えられ、これに
より、中央装置３２からは、話者認識用情報として、例
えば、この指定情報に対応した標準パターン，例えば利
用者Ｄの標準パターンが話者認識用情報記憶部５から読
出されて、端末に伝送される。次いで、端末の利用者Ｄ
は、予め決められた特定の言葉を発声する。この音声
は、音声入力手段１から入力し、例えば音声区間検出部
３から音声信号として出力され、特徴抽出部４により特
徴量(特徴パターン)に変換されて、話者認識部７に与え
られる。In addition, in this speaker recognition system, when performing speaker recognition, the user has his own terminal, for example, 3
The switching unit 8 of 1-1 is switched to the speaker recognition unit 7 side. Next, this user uses the specifying unit 2 to specify predetermined information,
For example, when (user D) is input, this designation information is transmitted to the central device 32 via the communication means 33-1 and, as a result, the central device 32 uses this designation information as speaker recognition information. A standard pattern corresponding to the information, for example, the standard pattern of the user D is read from the speaker recognition information storage unit 5 and transmitted to the terminal. Then, the user D of the terminal
Utters a predetermined specific word. This voice is input from the voice input unit 1, is output as a voice signal from the voice section detection unit 3, is converted into a feature amount (feature pattern) by the feature extraction unit 4, and is provided to the speaker recognition unit 7.

【００４２】これにより、端末の話者認識部７は、中央
装置３２から伝送された話者認識用情報(すなわち上記
例では標準パターン)と特徴抽出部４からの特徴パター
ンとを照合して、その類似度を算出し、この類似度が所
定基準値(しきい値)よりも高いか低いかを判定し(すな
わち、利用者が正規の利用者か否かを判定し)、この判
定結果を中央装置３２に伝送する。中央装置３２では、
端末からの判定結果に基づいて、アプリケーションを許
可するか否かを決定する。As a result, the speaker recognition unit 7 of the terminal collates the speaker recognition information (that is, the standard pattern in the above example) transmitted from the central unit 32 with the characteristic pattern from the characteristic extraction unit 4, The similarity is calculated, it is determined whether this similarity is higher or lower than a predetermined reference value (threshold value) (that is, it is determined whether the user is a regular user), and this determination result is Transmit to central unit 32. In the central unit 32,
Based on the determination result from the terminal, it is determined whether or not to permit the application.

【００４３】すなわち、中央装置３２は、類似度が低い
との判定結果が伝送されたときには、利用者が正規の話
者本人Ｄではないと判別し、この利用者による利用を拒
絶する。これに対し、類似度が高いとの判定結果が伝送
されたときには、利用者が正規の話者本人Ｄであると判
別し、利用者によるアプリケーション(例えば入出金，
残高照会などの処理)の利用を許可する。That is, when the result of the determination that the degree of similarity is low is transmitted, the central unit 32 determines that the user is not the authorized speaker himself D, and rejects the use by this user. On the other hand, when the determination result that the degree of similarity is high is transmitted, it is determined that the user is the regular speaker himself D, and the application by the user (for example, deposit / withdrawal,
Processing such as balance inquiry) is permitted.

【００４４】このように、この話者認識システムでは、
利用者の自宅あるいは会社等に設置されている端末(例
えばパソコン)を用いて、話者認識を行なわせ、その判
定結果を、銀行などに設置されている中央装置(例えば
話者認識装置ユニット)に伝送し、中央装置において、
この判定結果に基づき、本人であることが確認された
後、入出金，残高照会などのアプリケーションを利用す
ることができる。すなわち、利用者は、銀行等にその都
度出向かずとも、自宅や会社などに設置されている端末
に話者認識を行なわせ、銀行等のアプリケーションを利
用することができる。Thus, in this speaker recognition system,
Using a terminal (for example, a personal computer) installed at the user's home or office, etc., the speaker is recognized, and the judgment result is a central device (for example, a speaker recognition device unit) installed at a bank or the like. To the central device,
After it is confirmed that the person is the person based on the result of this determination, applications such as deposit / withdrawal and balance inquiry can be used. That is, the user can use an application of a bank or the like by causing a terminal installed at home, a company, or the like to recognize a speaker without going to the bank or the like each time.

【００４５】また、この構成例では、利用者側の端末と
して、既存のパソコン(パソコン通信機能を備えたパソ
コン)を用いることができる。Further, in this configuration example, an existing personal computer (personal computer having a personal computer communication function) can be used as the terminal on the user side.

【００４６】さらに、この構成例では、話者認識部７か
らの判定結果を通信手段(例えば電話回線や無線など)を
介して中央装置３２に送信するようにしているので、通
信手段(電話回線や無線など)の品質や通信環境が多少悪
い場合でも、判定結果信号は、影響を受けにくく、従っ
て、中央装置３２では、伝送された判定結果に基づき、
利用者にアプリケーションを利用させるか否かの判断を
正しく行なうことができる。また、判定結果信号は、デ
ータ量が極めて少なく、伝送時間を著しく短縮すること
ができる。Further, in this configuration example, since the determination result from the speaker recognition section 7 is transmitted to the central unit 32 via the communication means (for example, telephone line or wireless), the communication means (telephone line (Or wireless or the like) or the communication environment is somewhat poor, the determination result signal is not easily affected, and therefore, the central device 32 determines whether the determination result signal is transmitted based on the transmitted determination result.
It is possible to correctly determine whether or not to allow the user to use the application. Further, the determination result signal has a very small amount of data, and the transmission time can be significantly shortened.

【００４７】さらに、この構成例では、端末側に、音声
区間検出部３，特徴抽出部４，話者認識部７が設けられ
ていることによって、利用者は、自己の声の特性に適合
するよう、音声区間検出部３の特性，特徴抽出部４の特
性などを管理することができる。例えば、自己の声の音
量や音質に合わせて、音声区間検出の感度(声の大きさ
のしきい値)などを調整したりすることができる。Further, in this configuration example, since the voice section detecting section 3, the feature extracting section 4, and the speaker recognizing section 7 are provided on the terminal side, the user conforms to the characteristics of his own voice. As described above, it is possible to manage the characteristics of the voice section detection unit 3, the characteristics of the feature extraction unit 4, and the like. For example, it is possible to adjust the sensitivity of voice section detection (threshold of voice volume) and the like according to the volume and sound quality of one's own voice.

【００４８】なお、一般に、話者認識部７は、特徴パタ
ーンと標準パターンとの類似度を算出する機能と算出さ
れた類似度がしきい値よりも高いか低いかを判定する機
能とを有し、これらの機能を、１つのブロックで構成す
ることもできるが、類似度算出部，判定部として別々の
ブロック(ソフト)として構成することもできる。In general, the speaker recognition unit 7 has a function of calculating the similarity between the characteristic pattern and the standard pattern and a function of determining whether the calculated similarity is higher or lower than a threshold value. However, these functions can be configured by one block, but can also be configured by separate blocks (software) as the similarity calculation unit and the determination unit.

【００４９】この場合には、例えば、図５に示すよう
に、端末３１−１側に、類似度算出部６０を設け、中央
装置３２側に、判定部６２を設けて、端末の類似度算出
部６０で算出された特徴パターンと標準パターンとの類
似度を、通信手段３３を介して中央装置の判定部６２に
伝送し、中央装置の判定部６２において話者の判定を行
なうように構成することもできる。なお、この場合に
も、図３，図４の構成例と同様に、端末において、類似
度算出に用いられる話者認識用情報は、中央装置から端
末に転送することができる。In this case, for example, as shown in FIG. 5, the similarity calculation section 60 is provided on the terminal 31-1 side and the determination section 62 is provided on the central device 32 side to calculate the similarity degree of the terminal. The similarity between the characteristic pattern and the standard pattern calculated by the unit 60 is transmitted to the determination unit 62 of the central apparatus via the communication unit 33, and the determination unit 62 of the central apparatus is configured to determine the speaker. You can also Also in this case, as in the configuration examples of FIGS. 3 and 4, the speaker recognition information used in the similarity calculation in the terminal can be transferred from the central device to the terminal.

【００５０】また、あるいは、図６に示すように、端末
３１−１側に、類似度算出部６０，判定部６１を設け、
また、中央装置３２側に、端末の判定部６１とは別に、
判定部６２を設けて、端末の類似度算出部６０で算出さ
れた特徴パターンと標準パターンとの類似度を、場合に
応じて、端末３１−１の判定部６１に与えて端末側にお
いて話者の判定を行なうか、中央装置３２の判定部６２
に与えて中央装置側で話者の判定を行なうかを選択する
ように構成することもできる。すなわち、図６の構成例
では、端末３１−１内において、類似度算出部６０と判
定部６１とにより、第１の話者認識部が構成され、ま
た、中央装置３２側では、端末の類似度算出部６０と中
央装置の判定部６２とにより、第２の話者認識部が構成
されており、第１の話者認識部で話者認識を行なうか、
第２の話者認識部で話者認識を行なうかを選択可能にな
っている。Alternatively, as shown in FIG. 6, a similarity calculation section 60 and a determination section 61 are provided on the terminal 31-1 side,
Further, on the side of the central device 32, separately from the determination unit 61 of the terminal,
The determination unit 62 is provided, and the similarity between the characteristic pattern calculated by the similarity calculation unit 60 of the terminal and the standard pattern is given to the determination unit 61 of the terminal 31-1 as the case may be, and the speaker on the terminal side. Or the determination unit 62 of the central device 32.
It can be configured to select whether or not to perform the speaker determination on the side of the central device. That is, in the configuration example of FIG. 6, in the terminal 31-1, the similarity calculation unit 60 and the determination unit 61 configure a first speaker recognition unit, and on the side of the central device 32, the similarity of the terminal is determined. The degree calculation unit 60 and the determination unit 62 of the central device constitute a second speaker recognition unit. Whether the first speaker recognition unit performs speaker recognition,
It is possible to select whether or not to perform speaker recognition in the second speaker recognition unit.

【００５１】なお、図６の構成例においても、話者認識
用情報(標準パターンなど)については、中央装置３２の
話者認識管理手段１０だけにより一括管理し、端末内だ
けの処理を行なう場合にも、中央装置３２から転送させ
ることもできるが、中央装置３２に設定されている話者
認識用情報(標準パターンなど)とは別の話者認識用情報
(標準パターンなど)を端末にも用意し、端末内だけで話
者認識を行なう場合には、標準パターンを中央装置から
伝送させることなく、端末内に設けられている話者認識
用情報(標準パターンなど)を用いることもできる。Also in the configuration example of FIG. 6, the speaker recognition information (standard pattern, etc.) is collectively managed only by the speaker recognition management means 10 of the central unit 32, and is processed only in the terminal. Also, although it can be transferred from the central device 32, the speaker recognition information different from the speaker recognition information (standard pattern etc.) set in the central device 32 is used.
If a standard pattern (such as a standard pattern) is also prepared in the terminal and speaker recognition is performed only within the terminal, the standard pattern is not transmitted from the central unit, and the speaker recognition information (standard Pattern) can also be used.

【００５２】ところで、一般に、正規の利用者本人が話
者認識のために発声した音声の特徴量(特徴パターン)と
この利用者本人によって予め登録されている音声の特徴
量(標準パターン)との間には、時間的なへだたりがある
ため、同じ発声者の音声であっても、特徴パターンが標
準パターンと完全に一致することは稀であり、通常は、
いくらか相違している。従って、端末の話者認識部７に
おいて、特徴パターンと標準パターンとの類似度に対す
る判定のしきい値は、適宜なものに設定されている必要
がある。By the way, in general, a feature amount (feature pattern) of a voice uttered by a regular user for speaker recognition and a feature amount (standard pattern) of a voice registered in advance by the user himself / herself. Since there is a time lag between them, it is rare that the feature pattern perfectly matches the standard pattern even with the same speaker's voice, and normally,
Somewhat different. Therefore, it is necessary for the speaker recognition unit 7 of the terminal to set an appropriate threshold value for the determination of the similarity between the characteristic pattern and the standard pattern.

【００５３】しかしながら、この判定のしきい値を高く
設定すると、話者認識の精度を高めることができるが、
反面、正規の利用者本人の音声であるにもかかわらず、
正規の利用者本人ではないと判定されてしまう確率が高
くなり、本人が利用しにくくなってしまう。一方、この
判定のしきい値を低く設定すると、正規の利用者本人以
外の他人の音声を正規の利用者本人の音声であると誤認
識する確率が高くなり、正規の利用者本人の情報が他人
に盗まれ、悪用される恐れが増加する。従って、しきい
値を常に一定の適宜なものに設定して話者認識を行なう
のは、難かしい場合がある。However, if the threshold value for this determination is set high, the accuracy of speaker recognition can be increased.
On the other hand, despite being the voice of the legitimate user,
This increases the probability that the user is judged not to be the legitimate user, making it difficult for the user to use. On the other hand, if the threshold value for this judgment is set low, the probability of erroneously recognizing the voice of another person other than the legitimate user as the voice of the legitimate user is high, and the information of the legitimate user is Increased risk of being stolen and misused by others. Therefore, it may be difficult to set the threshold value to a constant and appropriate value for speaker recognition.

【００５４】本発明は、このような互いに相反する問題
を良好に解決する話者認識システムおよび話者認識方法
をさらに提供することを意図しており、このような問題
を解決するため、本発明では、場合に応じて、判定のし
きい値，すなわち認識精度を可変に設定して、話者認識
を行なうようにしている。The present invention is intended to further provide a speaker recognition system and a speaker recognition method which can satisfactorily solve such conflicting problems. In order to solve such problems, the present invention is provided. Then, the threshold value for determination, that is, the recognition accuracy is variably set depending on the case so that the speaker recognition is performed.

【００５５】より具体的に、この端末(例えばパソコン)
に搭載されている話者認識機能をこの端末の中だけで利
用する場合(例えば、この端末の立ち上げ操作を行なう
ような、端末内だけのアプリケーションに利用する場
合)には、例えば通信回線等を介して他人に盗まれる恐
れが少ないので、判定のしきい値を低く設定して(認識
精度を低下させて)、正規の利用者本人の音声が、利用
者本人の音声であると判定される確率を高めるようにす
る。More specifically, this terminal (for example, a personal computer)
If you want to use the speaker recognition function installed in the terminal only inside this terminal (for example, if you use it for applications only inside the terminal such as starting up this terminal), for example, communication line etc. Since it is less likely to be stolen by others through the, the threshold for judgment is set low (decreasing recognition accuracy), and the sound of the legitimate user is judged to be the sound of the user himself. Increase the probability of

【００５６】また、この端末(例えばパソコン)に搭載さ
れている話者認識機能を、他の装置(例えば他の端末や
中央装置)と関連させてあるいは協働させて利用する場
合(例えば対外的に利用するような場合)には、例えば通
信回線等を介して他人に盗まれる恐れがあるので、判定
のしきい値を高く設定し(認識精度を高め)、これによ
り、正規の利用者本人の情報が他人に盗まれるのを防止
する。When the speaker recognition function installed in this terminal (for example, a personal computer) is used in association with or in cooperation with another device (for example, another terminal or central device) (for example, external communication) In such a case), there is a risk of being stolen by another person via a communication line, etc., so the threshold for judgment is set high (to increase the recognition accuracy), so that the legitimate user can Information from being stolen by others.

【００５７】なお、本発明のこのような機能(認識精度
を可変にする機能)は、図３，図４の構成例において、
例えば、１つの端末，例えば３１−１内に複数のしきい
値を用意しておき、この端末の処理に応じて(例えば、
この端末がこの端末の中だけの処理を行なうのか、外部
の装置との間で情報の送受信を行なうのかに応じて)、
複数のしきい値のうちから最適なものを話者認識部７が
選択して用いることで、実現できる。Note that such a function of the present invention (a function of changing the recognition accuracy) is as follows in the configuration example of FIGS.
For example, a plurality of threshold values are prepared in one terminal, for example, 31-1, and depending on the processing of this terminal (for example,
(Depending on whether this terminal processes only in this terminal or transmits / receives information to / from an external device),
This can be realized by the speaker recognition unit 7 selecting and using the optimum one from the plurality of threshold values.

【００５８】あるいは、図３，図４の構成例において、
端末内に全てのしきい値を用意しておくかわりに、例え
ば、端末の中だけの処理用のしきい値については、この
端末内に用意しておき、外部の装置との間での処理用の
しきい値については、この端末が外部の装置(例えば中
央装置)と通信接続されたときに、例えば中央装置の話
者認識管理手段１０から伝送させることもできる。Alternatively, in the configuration example of FIG. 3 and FIG.
Instead of preparing all the threshold values in the terminal, for example, for the threshold value for processing only in the terminal, prepare it in this terminal and process it with an external device. The threshold for use may be transmitted from the speaker recognition management means 10 of the central device, for example, when the terminal is communicatively connected to an external device (for example, the central device).

【００５９】例えば、この端末において話者認識を行な
うとき、この端末に、中央装置の話者認識管理手段１０
から、この端末の利用者用の標準パターンを転送する
際、この標準パターンとともに、付随情報として、この
端末の中央装置との間での処理用のしきい値を転送させ
るようにすることもできる。For example, when speaker recognition is performed at this terminal, the speaker recognition management means 10 of the central unit is connected to this terminal.
Therefore, when transferring the standard pattern for the user of this terminal, it is possible to transfer the threshold value for processing with the central device of this terminal together with this standard pattern as additional information. .

【００６０】この場合には、話者，例えば利用者Ｄが、
例えば端末３１−１から(利用者Ｄ)であることを入力す
ると、電話線あるいは無線等の通信手段を介して、中央
装置の話者認識管理手段１０へ伝達され、これにより、
中央装置の話者認識管理手段１０では、その話者に対応
した音声標準パターンと、付随情報として類似度のしき
い値とを、この端末３１−１へ返送し、端末３１−１で
は、話者認識部７において、本人であるか否かを判定す
るに際し、中央装置３２の管理手段１０から送られたし
きい値を用いることができる。これにより、中央装置３
２の話者認識管理手段１０によって、端末３１−１の認
識精度を制御することもできる。In this case, the speaker, for example user D,
For example, if the user (user D) is input from the terminal 31-1, it is transmitted to the speaker recognition management means 10 of the central apparatus via a communication means such as a telephone line or a radio, and thereby,
The speaker recognition management means 10 of the central apparatus returns the voice standard pattern corresponding to the speaker and the threshold value of the similarity degree as accompanying information to the terminal 31-1, and the terminal 31-1 speaks. When the person recognizing unit 7 determines whether or not the person is the person, the threshold value sent from the management unit 10 of the central apparatus 32 can be used. This allows the central unit 3
The recognition accuracy of the terminal 31-1 can be controlled by the second speaker recognition management means 10.

【００６１】このように、図３，図４の構成例におい
て、しきい値を場合に応じて、変えて用いることで(例
えば、端末内だけの処理の場合は、類似度のしきい値と
してより低いしきい値を用い、外部との装置との間での
処理の場合は、類似度のしきい値としてより高い値を用
いることで)、例えば、同じパソコンを端末として使っ
ていても、そのパソコンの中だけの認識する場合は、パ
ソコンの中で設定された緩やかな判定で、また、他の装
置とつながった時には、厳しい判定で、話者認識を実現
することができる。As described above, in the configuration examples of FIGS. 3 and 4, the threshold value is changed and used depending on the case (for example, in the case of processing only in the terminal, the threshold value of the similarity is Using a lower threshold and using a higher value as the threshold of similarity in the case of processing between external devices), for example, even if the same personal computer is used as a terminal, In the case of recognizing only in the personal computer, the speaker recognition can be realized by the gradual determination set in the personal computer and the strict determination when connecting to another device.

【００６２】また、認識精度を可変にする機能は、図６
の構成例では、例えば、端末の判定部６１において用い
られるしきい値については、端末内において用意し、ま
た、中央装置３２の判定部６２において用いられるしき
い値については、中央装置３２内において用意すること
で、実現できる。The function of changing the recognition accuracy is shown in FIG.
In the configuration example of, for example, the threshold value used in the determination unit 61 of the terminal is prepared in the terminal, and the threshold value used in the determination unit 62 of the central device 32 is in the central device 32. It can be realized by preparing.

【００６３】すなわち、図６の構成では、端末の利用者
が、この端末内だけのアプリケーションのために話者認
識を行なおうとする場合には、類似度算出部６０から算
出される類似度をこの端末内の判定部６１に与えて判定
させるように、この端末に対して指示を与える。That is, in the configuration shown in FIG. 6, when the user of the terminal intends to perform speaker recognition for an application only in this terminal, the similarity calculated by the similarity calculator 60 is calculated. An instruction is given to this terminal so that the determination section 61 in this terminal can make the determination.

【００６４】この場合、この端末の利用者が話者認識の
ための音声を発生すると、この音声特徴量(特徴パター
ン)は、この端末の類似度算出部６０で標準パターン(例
えば、この端末内に予め用意された標準パターン)との
類似度が算出され、この類似度は、この端末の話者認識
部７の判定部６１に与えられる。判定部６１において
は、この端末内に予め用意された所定のしきい値に対し
て類似度が高いか低いかを判定することで、話者認識を
行なうことができる。In this case, when the user of this terminal generates a voice for speaker recognition, this voice feature amount (feature pattern) is converted into a standard pattern (for example, in this terminal) by the similarity calculation unit 60 of this terminal. Is calculated, and the similarity is given to the determination unit 61 of the speaker recognition unit 7 of this terminal. The determination unit 61 can perform speaker recognition by determining whether the similarity is high or low with respect to a predetermined threshold prepared in advance in this terminal.

【００６５】一方、端末の利用者が、中央装置３２のア
プリケーションを利用するために話者認識を行なおうと
する場合には、この端末から中央装置３２(話者認識管
理手段１０)に所定の指示を与える。これにより、中央
装置３２は、端末からの指示に従って、例えば、標準パ
ターンをこの端末へ送る。次いで、この端末の利用者が
話者認識のための音声を発生すると、この音声特徴量
(特徴パターン)は、この端末の類似度算出部６０で標準
パターン(例えば中央装置から伝送された標準パターン)
との類似度が算出され、この類似度は、中央装置３２に
伝送され、中央装置３２の判定部６２に与えられる。中
央装置３２の判定部６２においては、中央装置３２に予
め用意された所定のしきい値に対して類似度が高いか低
いかにより、本人か否かを判定する。On the other hand, when the user of the terminal wants to perform speaker recognition in order to use the application of the central device 32, the terminal device sends a predetermined message to the central device 32 (speaker recognition management means 10). Give instructions. As a result, the central device 32 sends, for example, a standard pattern to this terminal according to the instruction from the terminal. Next, when the user of this terminal generates a voice for speaker recognition, this voice feature
The (feature pattern) is a standard pattern (for example, a standard pattern transmitted from the central device) in the similarity calculation unit 60 of this terminal.
Is calculated, and this similarity is transmitted to the central device 32 and given to the determination unit 62 of the central device 32. The determination unit 62 of the central apparatus 32 determines whether or not the person is the person based on whether the similarity is high or low with respect to a predetermined threshold prepared in advance in the central apparatus 32.

【００６６】このように、図６の構成例では、端末内の
アプリケーションの利用の場合は、端末内の判定部６１
で話者認識(判定)を行なわせ、中央装置のアプリケーシ
ョンの利用の場合は、中央装置の判定部６２で話者認識
(判定)を行なわせることができ、端末の判定部６１での
判定基準(しきい値)と中央装置３２の判定部６２での判
定基準(しきい値)とを、端末と中央装置３２とで、それ
ぞれ独立に設定できるので(例えば、判定部６１では判
定基準を緩く(しきい値を低く)設定し、判定部６２では
判定基準を厳しく(しきい値を高く)設定できるので)、
場合に応じて、認識精度を相違させることができる。す
なわち、図３，図４の構成例と同様に、例えば、同じパ
ソコンを端末として使っていても、そのパソコンの中だ
けの認識する場合は、パソコンの中で設定された緩やか
な判定で、また、他の装置とつながった時には、厳しい
判定で、話者認識を実現することができる。As described above, in the configuration example of FIG. 6, when the application in the terminal is used, the determination unit 61 in the terminal is used.
In the case of using the application of the central device, the determination unit 62 of the central device recognizes the speaker.
(Determination) can be performed, and the determination criterion (threshold value) in the determination unit 61 of the terminal and the determination criterion (threshold value) in the determination unit 62 of the central device 32 are set to the terminal and the central device 32. Since they can be set independently (for example, the judgment unit 61 can set the judgment standard loosely (the threshold value is low) and the judgment unit 62 can set the judgment standard severely (the threshold value is high),
The recognition accuracy can be changed depending on the case. That is, similar to the configuration examples of FIGS. 3 and 4, for example, even when the same personal computer is used as a terminal, when only the personal computer is to be recognized, the gentle determination set in the personal computer is used. When connected to another device, it is possible to realize speaker recognition with strict judgment.

【００６７】このように、本発明では、例えば、同じパ
ソコンを端末として使っていても、そのパソコンの中だ
けの認識する場合は、パソコンの中で設定された穏やか
な判定で、また、他の装置とつながった時には、厳しい
判定で、話者認識を行なうことが可能となる。すなわ
ち、同じパソコンを端末として使っていても、そのパソ
コンの中だけの認識と、他の装置とつながった場合と
で、認識精度を変えることのできるような話者認識を実
現できる。As described above, according to the present invention, for example, even when the same personal computer is used as a terminal, when only the personal computer is recognized, a gentle judgment set in the personal computer is used, and another personal computer is used. When connected to the device, it is possible to perform speaker recognition with strict judgment. That is, even if the same personal computer is used as a terminal, it is possible to realize speaker recognition in which the recognition accuracy can be changed depending on whether the personal computer is connected to another device or connected.

【００６８】なお、上述の各構成例において、しきい値
は、話者認識用情報とともに記憶されても良い。例えば
図３，図４の構成例において、例えば、中央装置３２の
話者認識用情報記憶部５には、図７に示すように、標準
パターンの他に、類似度のしきい値などの各種の付随情
報が記憶されても良い。このときには、例えば、端末側
の話者認識部７は、ある話者の音声の特徴パターンを、
中央装置３２の話者認識用情報記憶部５から読出され伝
送された標準パターンと照合して、この話者の特徴パタ
ーンと標準パターンとの類似度を求めたとき、この類似
度が例えば上記標準パターンとともに中央装置３２の話
者認識用情報記憶部５から読出され伝送された付随情
報，すなわち、しきい値よりも高いか低いかにより、こ
の話者が本人であるか否かを判別することができる。In each of the above configuration examples, the threshold value may be stored together with the speaker recognition information. For example, in the configuration example of FIGS. 3 and 4, for example, in the speaker recognition information storage unit 5 of the central device 32, as shown in FIG. May be stored. At this time, for example, the speaker recognition unit 7 on the terminal side determines the characteristic pattern of the voice of a speaker as
When the similarity between the characteristic pattern of this speaker and the standard pattern is obtained by collating with the standard pattern read from the speaker recognition information storage unit 5 of the central unit 32 and transmitted, this similarity is, for example, the above-mentioned standard. It is determined whether or not this speaker is the true person based on the accompanying information read from the speaker recognition information storage unit 5 of the central unit 32 and transmitted together with the pattern, that is, whether the speaker is higher or lower than the threshold value. You can

【００６９】また、図３乃至図５の構成例において、例
えば話者認識用情報記憶部５に図７に示すように付随情
報として記憶されるしきい値としては、図７のように登
録されている各標準パターンごとに異なるしきい値を設
定することもできるし、登録されている全ての標準パタ
ーンに対して、同じ(一定の)しきい値を設定することも
できる。しきい値として、登録されている全ての標準パ
ターンに対して同じ(一定の)ものを用いる場合は、話者
認識用情報記憶部５(１つのメモリ)の中に１つのしきい
値だけを記憶しておいて共通に使用することができる。
同様に、図６の構成例においても、例えば端末側に設定
されるしきい値と中央装置側に設定されるしきい値との
それぞれのしきい値として、登録されている各標準パタ
ーンごとに異なるしきい値を設定することもできるし、
登録されている全ての標準パターンに対して、同じ(一
定の)しきい値を設定することもできる。In the configuration examples of FIGS. 3 to 5, for example, the threshold value stored in the speaker recognition information storage unit 5 as accompanying information as shown in FIG. 7 is registered as shown in FIG. A different threshold value can be set for each standard pattern that is set, or the same (constant) threshold value can be set for all registered standard patterns. When the same (fixed) threshold is used for all registered standard patterns, only one threshold is stored in the speaker recognition information storage unit 5 (one memory). It can be stored and used in common.
Similarly, in the configuration example of FIG. 6 as well, for example, as threshold values set on the terminal side and threshold values set on the central device side, for each registered standard pattern. You can set different thresholds,
It is also possible to set the same (constant) threshold value for all registered standard patterns.

【００７０】また、上述の各構成例では、利用者は、自
分の端末以外の端末をも利用することができるが、反
面、これにより、他人が、本人の知らないところで利用
する可能性を高めることにもなる。話者認識において、
本人か否かを判定するために特に重要なものは、話者認
識用情報(特に音声の標準パターン)であり、これが悪意
で書き換えられたりすると、以後、本人が利用すること
ができなくなったり、あるいは、他人によって本人の情
報が悪用されたりすることになる。In addition, in each of the above-mentioned configuration examples, the user can use a terminal other than his / her own terminal, but on the other hand, this increases the possibility that another person will use the terminal without his / her knowledge. It will also happen. In speaker recognition,
Especially important for determining whether or not the person is the speaker recognition information (especially the standard pattern of voice), and if this is maliciously rewritten, it becomes impossible for the person to use it thereafter. Alternatively, the information of the person concerned may be misused by another person.

【００７１】そこで、話者認識用情報の変更修正が、決
められた端末からの情報でのみ行なわれるよう、話者認
識システムを構成することもできる。例えば、自宅の端
末などを指定して、音声の標準パターンの書き換え，更
新等を、利用者の自宅の端末からのみ行なうことができ
るように構成することもできる。これによって、他人に
よる話者認識用情報の変更修正(書き換え)などを防止す
ることができる。Therefore, it is possible to configure the speaker recognition system so that the speaker recognition information is changed and corrected only with the information from the determined terminal. For example, the home terminal or the like may be designated, and the standard voice pattern may be rewritten or updated only from the home terminal of the user. As a result, it is possible to prevent others from changing or modifying (rewriting) the speaker recognition information.

【００７２】また、上述の各構成例において、話者認識
用情報記憶部５内の話者認識用情報をフラグ管理するよ
うにすることもできる。例えば、現在使用している話者
認識用情報(標準パターン)と使用していない話者認識用
情報(標準パターン)とを区別するように、話者認識用情
報(標準パターン)に“０”または“１”のフラグを設定
して、これを管理することもできる。この場合には、こ
のフラグ管理によって、１人の話者認識用情報(標準パ
ターン)を同時に１ヵ所のみにしか供給することができ
ないようにし、使用中の話者認識用情報(標準パターン)
をそれ以外の者が使用できないようにすることもでき
る。これによって、本人が使用中の場合、他人が使用す
ることを防止できるとともに、他人が使用している時
に、本人が使用した場合、この本人は、誰かが自分の音
声標準パターンを利用していることがわかり、迅速に対
策をとることができる。Further, in each of the above configuration examples, it is possible to manage the speaker recognition information in the speaker recognition information storage unit 5 by a flag. For example, in order to distinguish the currently used speaker recognition information (standard pattern) from the unused speaker recognition information (standard pattern), the speaker recognition information (standard pattern) is set to “0”. Alternatively, a flag of "1" can be set to manage this. In this case, this flag management allows one speaker recognition information (standard pattern) to be supplied to only one place at a time, and the speaker recognition information (standard pattern) in use is used.
Can also be made unavailable to others. This prevents others from using it when the person is using it, and when the person uses it when another person is using it, this person uses someone's own voice standard pattern. You can understand that and take measures quickly.

【００７３】また、例えば、図５，図６の構成例では、
端末側に、音声区間検出部３，特徴抽出部４，類似度算
出部６０が設けられていることによって、利用者は、自
己の声の特性に適合するよう、音声区間検出部３の特
性，特徴抽出部４の特性を管理することができるという
利点を有しているが、その反面、音声区間検出，特徴抽
出，類似度算出等の処理と話者判定(しきい値判定)の処
理とが、端末側と中央装置側とで分散してなされ、ま
た、分散してなされることがあるため、中央装置側の管
理者は、正規の利用者を正規の利用者と認めなかった
り、他人を正規の利用者と認識したりする、いわゆる誤
認識が発生する場合に、この原因を中央装置側だけで一
括管理することができないことがある。例えば、この原
因が、端末の音声区間検出，特徴抽出，類似度算出によ
るものか、中央装置の話者判定(しきい値)によるものか
を、中央装置側だけで判別することができないことがあ
る。Further, for example, in the configuration examples of FIGS.
Since the voice section detection unit 3, the feature extraction unit 4, and the similarity calculation unit 60 are provided on the terminal side, the user can adjust the characteristics of the voice section detection unit 3 so as to match the characteristics of his or her voice. Although it has the advantage that the characteristics of the feature extraction unit 4 can be managed, on the other hand, it is possible to perform processing such as voice section detection, feature extraction, similarity calculation, and speaker determination (threshold determination) processing. However, because the terminal side and the central device side are distributed and sometimes distributed, the administrator of the central device side does not recognize the authorized user as an authorized user, or When a so-called erroneous recognition occurs, such as recognizing a user as a legitimate user, the cause may not be collectively managed only by the central device side. For example, it is not possible for the central device alone to determine whether this is due to the detection of the voice section of the terminal, feature extraction, similarity calculation, or the speaker determination (threshold value) of the central device. is there.

【００７４】誤認識を生じさせる原因としては、例え
ば、話者が発生した言葉の語頭や語尾が弱く、音声区間
検出が正常に行なわれず、特徴パターン上で、この部分
が欠落していることがあり、話者がこれに気付かずに何
回言い直しても同じような結果になってしまうという場
合があり、このときには、何回やり直しても正しい認識
ができない。The cause of erroneous recognition is, for example, that the beginning or end of the word generated by the speaker is weak, the voice section is not normally detected, and this part is missing in the characteristic pattern. In some cases, the speaker may not be aware of this, and the same result may be obtained no matter how many times he or she repeats it.

【００７５】このような問題を回避するため、例えば上
述の各構成例において、例えば中央装置３２(例えば話
者認識管理手段１０)から端末，例えば３１−１に所定
の情報を提供し、端末３１−１の音声区間検出部３や特
徴抽出部４などは、中央装置３２から提供された情報に
基づいて、音声区間検出や特徴量変換などを行なうこと
も可能である。In order to avoid such a problem, for example, in each of the above configuration examples, for example, the central device 32 (for example, the speaker recognition management means 10) provides predetermined information to a terminal, for example, 31-1, and the terminal 31 The -1 voice section detection unit 3 and the feature extraction section 4 can also perform voice section detection and feature amount conversion based on the information provided from the central device 32.

【００７６】例えば、中央装置３２側からは、情報とし
て、例えば音声区間検出の感度指示情報を提供すること
ができる。For example, from the central device 32 side, for example, sensitivity instruction information for voice section detection can be provided as information.

【００７７】この場合、端末側において、利用者が当初
音声を発声し、この音声が音声区間検出されるときに、
音声区間と判定された前後に、例えば０．５秒程度のデ
ータを付加し、このデータをも含めた音声区間内の音声
信号を、そのまま端末側のファイル(図示せず)に保存す
る。このようにして、端末側で利用者の音声信号がファ
イルに保存された後、これを特徴量(特徴パターン)に変
換して、類似度算出部６０で類似度を算出し、中央装置
の判定部６２において、類似度がしきい値よりも高いか
低いかにより話者の判定を行なわせる。この結果、利用
者本人ではないと判定された場合、中央装置３２側で
は、例えば、話者認識用情報記憶部５に付随情報として
記憶されている音声区間検出の感度(声の大きさのしき
い値等)を高めて、これを端末側に、情報として提供(送
信)し、端末側において、ファイルに保存されている音
声信号に対して再度、音声区間検出を行なわせる。ま
た、話者認識用情報記憶部５に付随情報として記憶され
ている音声区間検出の感度を下げて、これを端末側に、
情報として提供(送信)し、端末側において、ファイルに
保存されている音声信号に対して再度、音声区間検出を
行なわせる。In this case, on the terminal side, when the user initially utters a voice and this voice is detected in the voice section,
For example, data of about 0.5 seconds is added before and after it is determined to be a voice section, and the voice signal in the voice section including this data is stored as it is in a file (not shown) on the terminal side. In this way, after the voice signal of the user is stored in the file on the terminal side, this is converted into a feature amount (feature pattern), the similarity calculation unit 60 calculates the similarity, and the central device determines The section 62 determines the speaker depending on whether the similarity is higher or lower than the threshold value. As a result, when it is determined that the user is not the user himself, the central device 32 side, for example, the sensitivity of voice section detection stored as accompanying information in the speaker recognition information storage unit 5 (voice volume The threshold value, etc.) is increased, and this is provided (transmitted) to the terminal side as information, and the terminal side is made to perform voice section detection again for the voice signal stored in the file. In addition, the sensitivity of voice section detection stored as accompanying information in the speaker recognition information storage unit 5 is lowered, and this is transferred to the terminal side.
It is provided (transmitted) as information, and the terminal section is made to perform the voice section detection again for the voice signal stored in the file.

【００７８】このようにして、音声区間検出の感度を高
めた場合と下げた場合とで、それぞれ話者認識を行な
い、いずれかで正しい話者であることが判定されれば、
この利用者を正しい話者であると認識することができ
る。このように、音声区間検出の感度を高めることによ
って、例えば話者の声が小さい場合に、音声区間が正し
く検出されないという事態(検出漏れを起こすという事
態)が生じていたのを、改善することができ、また、音
声区間検出の感度を下げることによって、例えば話者の
音声の前後に雑音が生じるような場合に、音声区間検出
部が実際よりも長い音声を検出してしまうという事態が
生じていたのを、改善することができる。In this way, the speaker recognition is performed for each of the case where the sensitivity of the voice section detection is increased and the case where the sensitivity is decreased, and if it is determined that the correct speaker is detected in either case,
This user can be recognized as the correct speaker. In this way, by improving the sensitivity of voice section detection, for example, when the voice of the speaker is small, the situation where the voice section is not correctly detected (the situation of causing omission of detection) has been improved. Also, by lowering the sensitivity of voice section detection, for example, when noise occurs before and after the voice of the speaker, the voice section detection unit may detect a voice longer than it actually is. I was able to improve it.

【００７９】上述の例では、中央装置３２側から端末側
に提供する情報として、音声区間検出の感度を例にとっ
たが、特徴量変換のサンプリング周波数を情報として端
末側に提供することもでき、この場合には、端末側で
は、中央装置３２からの情報によって特徴量変換のサン
プリング周波数を変えることができる。さらに、中央装
置３２側から端末側には、上記以外の種々の情報を提供
することもできる。In the above-mentioned example, as the information provided from the central device 32 side to the terminal side, the sensitivity of voice section detection is taken as an example, but the sampling frequency for feature amount conversion can also be provided to the terminal side as information. In this case, on the terminal side, the sampling frequency for feature amount conversion can be changed by the information from the central unit 32. Further, various information other than the above can be provided from the central device 32 side to the terminal side.

【００８０】このように、端末に、中央装置から所定情
報を与えることで、必要に応じて、話者認識のための音
声区間検出や特徴抽出，話者認識などを、中央装置側か
ら管理，制御することができる。As described above, by providing the terminal with the predetermined information from the central unit, the central unit manages the voice section detection for speaker recognition, the feature extraction, the speaker recognition, etc. Can be controlled.

【００８１】上述の各構成例の説明では、話者認識を行
なう場合について述べたが、標準パターンの新規登録や
変更，更新についても、端末側から同様にして行なうこ
とができる。なお、中央装置側において、話者認識に用
いたデータによって、自動的に標準パターンを更新する
機能が備わっている場合は、端末側からの操作を行なわ
ずとも、中央装置側で、自動的に標準パターンの更新を
行なうことができる。In the above description of each structural example, the case where the speaker recognition is performed has been described, but new registration, change and update of the standard pattern can be similarly performed from the terminal side. If the central device has a function for automatically updating the standard pattern according to the data used for speaker recognition, the central device automatically performs the operation without any operation from the terminal side. The standard pattern can be updated.

【００８２】また、上述した各構成例においては、特徴
パターンと標準パターンとの類似度を、これらが類似し
ている度合として捉えているが、これらの相違の度合と
して捉えることもできる。類似度を相違の度合として捉
える場合には、類似の度合として捉える場合と判定の仕
方が逆になり、類似度(相違度)がしきい値よりも高い場
合に、話者本人でないと判定し、類似度(相違度)がしき
い値よりも低い場合に、話者本人であると判定する。Further, in each of the above-described configuration examples, the degree of similarity between the characteristic pattern and the standard pattern is regarded as the degree of similarity between them, but it may be regarded as the degree of difference between them. When the similarity is regarded as the degree of dissimilarity, the method of determination is opposite to that when it is regarded as the degree of similarity, and when the similarity (dissimilarity) is higher than the threshold value, it is determined that the speaker is not the person himself. , If the degree of similarity (degree of difference) is lower than the threshold value, it is determined that the speaker is the person himself.

【００８３】また上述の例では、中央装置から標準パタ
ーンを転送するときに、これとともに類似度のしきい値
も転送するようにしたが、これらを別々に(異なるタイ
ミングで)転送することもできる。Further, in the above-mentioned example, when the standard pattern is transferred from the central unit, the threshold value of the similarity is transferred together with the standard pattern, but they may be transferred separately (at different timings). .

【００８４】また、上述の各構成例では、１つの端末と
中央装置３２との間で情報の送受信がなされる場合を例
にとって説明したが、例えば、１人の利用者が端末を２
台持っているような場合、標準パターンの登録機能を例
えば端末３１−１にもたせ、話者認識の機能を例えば端
末３１−２にもたせて、端末３１−１で標準パターンな
どの情報を登録して、使っている標準パターンなどの情
報を端末３１−１から端末３１−２に送って、端末３１
−２での認識に使用することもできる。Further, in each of the above-described configuration examples, the case where information is transmitted and received between one terminal and the central unit 32 has been described as an example. However, for example, one user uses two terminals.
In the case of having a table, the standard pattern registration function is given to the terminal 31-1, for example, and the speaker recognition function is given to the terminal 31-2, for example, and the terminal 31-1 registers information such as the standard pattern. And sends information such as the standard pattern being used from the terminal 31-1 to the terminal 31-2,
It can also be used for recognition at -2.

【００８５】また、上述の各構成例では、音声区間検出
部３の後に、特徴抽出部４が設けられているが、必要に
応じ、音声区間検出部３の前に、特徴抽出部４が設けら
れるように構成することも可能である。In each of the above configuration examples, the feature extraction unit 4 is provided after the voice section detection unit 3, but the feature extraction unit 4 is provided before the voice section detection unit 3 as necessary. It is also possible to configure so that it can be performed.

【００８６】[0086]

【発明の効果】以上に説明したように、請求項１乃至請
求項７記載の発明によれば、例えば利用者の自宅あるい
は会社等に設置されている端末を用いて、話者認識を行
なわせ、例えば銀行などの入出金，残高照会などのアプ
リケーションを利用することができる。また、例えば、
同じパソコンを端末として使っていても、そのパソコン
の中だけの認識する場合は、パソコンの中で設定された
緩やかな判定で、また、他の装置とつながった時には、
厳しい判定で話者認識を実現できる。As described above, according to the present invention, according to claim 1 or 請
According to the invention described in claim 7 , for example, a terminal installed in the user's home or office is used to perform speaker recognition, and an application such as deposit / withdrawal or balance inquiry of a bank is used. be able to. Also, for example,
Even if you use the same personal computer as a terminal, if you recognize only in that personal computer, the gentle judgment set in the personal computer, and when connecting to another device,
Speaker recognition can be realized with strict judgment.

[Brief description of drawings]

【図１】一般的な話者認識システムの構成例を示す図で
ある。FIG. 1 is a diagram showing a configuration example of a general speaker recognition system.

【図２】話者認識用情報記憶部の構成例を示す図であ
る。FIG. 2 is a diagram showing a configuration example of a speaker recognition information storage unit.

【図３】本発明に係る話者認識システムの構成例を示す
図である。FIG. 3 is a diagram showing a configuration example of a speaker recognition system according to the present invention.

【図４】図３の話者認識システムの具体例を示す図であ
る。FIG. 4 is a diagram showing a specific example of the speaker recognition system of FIG.

【図５】本発明に係る話者認識システムの他の構成例を
示す図である。FIG. 5 is a diagram showing another configuration example of the speaker recognition system according to the present invention.

【図６】本発明に係る話者認識システムの他の構成例を
示す図である。FIG. 6 is a diagram showing another configuration example of the speaker recognition system according to the present invention.

【図７】話者認識用情報記憶部の構成例を示す図であ
る。FIG. 7 is a diagram showing a configuration example of a speaker recognition information storage unit.

[Explanation of symbols]

１音声入力手段２指示手段３音声区間検出部４特徴抽出部５話者認識用情報記憶部６登録部７話者認識部８切替部１０話者認識管理手段３１端末３２中央装置３３通信手段６０類似度算出部６１判定部６２判定部 1 Voice input means 2 instruction means 3 Voice section detector 4 Feature extraction unit 5 Speaker recognition information storage 6 registration department 7 Speaker recognition unit 8 switching unit 10 Speaker recognition management means 31 terminals 32 Central unit 33 Communication means 60 Similarity calculation section 61 Judgment unit 62 Judgment unit

Claims

(57) [Claims]

1. At least one terminal and a central unit,
The terminal is provided so that information can be transmitted and received, and the terminal has a voice input means for inputting a voice of a speaker into a voice signal, a feature extracting means for extracting a feature amount of the voice signal, and a feature of the voice of the speaker. Amount and a similarity calculation means for calculating the similarity between the voice feature amount as the speaker recognition information, and the central device, and a speaker recognition management means for managing the speaker recognition information. , Determination means for determining the speaker based on the similarity from the similarity calculation means of the terminal, and the speaker recognition information used for the similarity calculation in the terminal is transferred from the central device to the terminal. The speaker recognition system, wherein the similarity calculated by the similarity calculation means of the terminal is transferred from the terminal to the central device.

2. At least one terminal and a central unit,
The terminal is provided so that information can be transmitted and received, and the terminal has a voice input means for inputting a voice of a speaker into a voice signal, a feature extracting means for extracting a feature amount of the voice signal, and a feature of the voice of the speaker. The similarity calculation means for calculating the similarity between the volume and the voice feature amount as the speaker recognition information, and the first determination means for determining the speaker based on the similarity from the similarity calculation means are provided. Also, the central device is provided with a second judging means for judging a speaker based on the similarity from the similarity calculating means of the terminal, and the second judging means calculates the speaker by the similarity calculating means of the terminal. The speaker recognition system, wherein the degree of similarity is given to the first determining means of the terminal or transferred from the terminal to the second determining means of the central device.

3. The speaker recognition system according to claim 1 or 2, wherein predetermined information is transferred from the central device to the terminal, and the feature extraction means provided in the terminal is a central device. A speaker recognition system characterized by converting an input voice into a feature amount based on information provided from a device.

4. The speaker recognition system according to claim 1, wherein the change or modification of the speaker recognition information managed by the speaker recognition management means is regarded as information from a determined terminal. Features speaker recognition system.

5. The method of claim 1 Symbol placement of speaker recognition system, one speaker recognition information from the central unit is configured such that it can be supplied simultaneously only to only one terminal A speaker recognition system characterized by the above.

6. At least one terminal and a central unit,
The terminal is capable of transmitting and receiving information, and when the voice of the speaker is input, the feature amount of the voice signal is extracted in the terminal, and the feature amount of the voice of the speaker and the speaker recognition information are used. It is designed to calculate the similarity with the voice feature amount of
At this time, the speaker recognition information used for similarity calculation in the terminal is transferred from the central device to the terminal, and the similarity calculated in the terminal is transferred from the terminal to the central device. A speaker recognition method characterized by making a speaker determination based on the transferred similarity.

7. At least one terminal and a central unit,
It is provided so that information can be sent and received.
When the voice of the speaker is input, the
Extracted as a feature amount of the speaker's voice and speaker recognition information
It is designed to calculate the similarity with the voice feature amount of
The similarity calculated at the terminal is the
Be used for centralized equipment or transferred from the terminal to a central unit.
It is sent to be used for the determination of the speaker in the central device, and the determination of the speaker is made according to whether the terminal processes only in the terminal or transmits / receives information to / from an external device. Recognition method characterized by changing the threshold of the speaker.