JP4440414B2

JP4440414B2 - Speaker verification apparatus and method

Info

Publication number: JP4440414B2
Application number: JP2000081328A
Authority: JP
Inventors: 将治原田; 昭二早川; 晃鈴木
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2000-03-23
Filing date: 2000-03-23
Publication date: 2010-03-24
Anticipated expiration: 2020-03-23
Also published as: JP2001265387A

Description

【０００１】
【発明の属する技術分野】
本発明は、事前に登録してある音声データの特徴量に基づいて、利用者本人か否かを音声によって判定する話者照合装置又は方法に関する。
【０００２】
【従来の技術】
近年、コンピュータ技術の発展とともに、急速に通信環境についても整備されつつある。このような通信環境の整備に伴い、電話によるコンピュータアクセス（Computer Telephony Integration）が一般の家庭においても普通に行うことが可能になってきた。
【０００３】
かかる電話によるコンピュータアクセス分野においては、プライバシーに関する情報や秘密保持義務を有する情報等に代表される、本人や特定の個人以外に知らせてはならない情報に対するアクセスを行う場合に問題がある。すなわち、例えばプッシュホンを用いる場合においては、パスワードを電話のボタン操作によって入力することで当該情報へのアクセス権限を取得することが可能であるが、パスワードを他人に知られてしまうと、本人でないにもかかわらず、当該情報に容易にアクセスできてしまうという問題である。そのため、本人に固有である音声を用いて、本人あるいは特定の個人であるか否かについて照合を行うことの必要性が高まってきている。
【０００４】
【発明が解決しようとする課題】
しかし、音声合成技術についても近年急速な進歩を遂げており、かかる技術を駆使することによって、話者の個人性をも表現することも不可能ではなくなっている。
【０００５】
すなわち、従来の技術においては、話者照合のための入力として人間の肉声を想定しており、特定の人間の声を正確に音声合成するためには、当該人間の音声波形データ等を大量に収録して編集しなければならないことから、実現性に乏しかった。
【０００６】
しかしながら、昨今では本人の音声を少しだけ収録することで個人性を反映させた音声合成を実現することができるようになっており、容易に他人の声を真似ることが可能となってきている。
【０００７】
このような話者の個人性をも表現できる合成音声装置を用いることで、第三者が特定の個人になりすますことができ、話者照合システム自体が悪用されるおそれがあるという大きな問題点が生じている。
【０００８】
本発明は、上記問題点を解決すべく、話者の個人性をも表現している合成音声を用いた場合であっても、話者照合を的確におこなうことができる話者照合装置及び方法を提供することを目的とする。
【０００９】
【課題を解決するための手段】
上記目的を達成するために本発明にかかる話者照合装置は、音声入力する話者の音声が、予め登録された登録話者の音声と一致するか否かを判定する話者照合装置であって、入力する発声内容について話者に指示を与える話者入力指示部と、話者の音声を一又は二以上入力する音声入力部と、音声入力部で入力された音声を分析する音声分析部と、入力された同一の発声内容である二以上の音声について、相互間の類似度を算出する入力音声類似度算出部とを含み、算出された類似度が完全一致に近い所定レベル以上の類似度である場合、類似度が一致するという情報も用いて話者を照合することを特徴とする。
【００１０】
かかる構成により、本人であるか否かの判断と共に、人工的に生成された合成音声については人間の音声が本来有するべき揺らぎが全くないものとして識別することができるようになることから、本人と全く関係のない第三者が音声合成装置等を用いて本人になりすます行為を未然に防止することが可能となる。
【００１１】
また、本発明にかかる話者照合装置は、類似度の判断を、登録話者モデルに対する照合過程が同一か否かに基づいて行うことが好ましい。人間の音声においては、発声の長さやスペクトルが発声の都度相違するために照合過程が一致することがあり得ないことから、照合過程を比較することで、本人と全く関係のない第三者が音声合成装置等を用いて本人になりすます行為を未然に防止することが可能となる。
【００１２】
また、本発明にかかる話者照合装置は、音声入力部で少なくとも二以上の音声が入力された場合であって、少なくとも１つの音声について変換処理が行われている場合には、音声入力部で入力された少なくとも二以上の音声のうち、変換処理が行われていない音声について信号処理を施し、あるいは入力された少なくとも二以上の音声について正規化処理を施すことが好ましい。複数回音声を入力する場合に、二回目以降の入力音声に何らかのフィルタ等を掛けて変換処理を行うことで、音声入力における自然な揺らぎを人工的に生成し、合成音声でないと認識させる行為についても未然に防止するためである。
【００１３】
また、本発明にかかる話者照合装置は、類似度が一致すると判断された場合には、本人の音声入力ではないものと判断して入力を棄却することが好ましい。繰り返し発声された音声データがほぼ完全に一致した場合には録音物等の疑いがあるものとして、本人であるとは判断しないようにするためである。
【００１４】
また、本発明は、上記のような話者照合装置の機能をコンピュータの処理ステップとして実行するソフトウェアを特徴とするものであり、具体的には、音声入力する話者の音声が、予め登録された登録話者の音声と一致するか否かを判定する話者照合方法であって、入力する発声内容について話者に指示を与える工程と、話者の音声を一又は二以上入力する工程と、入力された音声を分析する工程と、入力された同一の発声内容である二以上の音声について、相互間の類似度を算出する工程とを含み、算出された類似度が完全一致に近い所定レベル以上の類似度である場合、類似度が一致するという情報も用いて話者を照合する話者照合方法並びにそのような工程をプログラムとして記録したコンピュータ読み取り可能な記録媒体であることを特徴とする。
【００１５】
かかる構成により、コンピュータ上へ当該プログラムをロードさせ実行することで、本人であるか否かを判断できると共に、人工的に生成された合成音声については人間の音声が本来有するべき揺らぎが全くないものとして識別することができるようになることから、本人と全く関係のない第三者が音声合成装置等を用いて本人になりすます行為を未然に防止することができる話者照合装置を実現することが可能となる。
【００１６】
【発明の実施の形態】
（実施の形態１）
以下、本発明の実施の形態１にかかる話者照合装置について、図面を参照しながら説明する。図１は本発明の実施の形態１にかかる話者照合装置の構成図である。
【００１７】
図１において、１は個人ＩＤ入力部を示し、話者照合時に個人ＩＤを入力するものである。２は個人別音声情報登録部を示し、個人ＩＤごとに音声情報を事前にデータベース化しておくものである。ここでは、音声波形データのみならず、音声データを解析した特徴量についても事前に登録しておく。
【００１８】
次に、３は音声入力指示部を示し、話者照合時に利用者が入力すべき音声について指示を出すものである。４は音声入力部を示し、マイク等の入力媒体を通じて、利用者が実際に発声して音声データを入力するものである。
【００１９】
音声入力部４では、音声入力指示部３の指示に従って、音声を入力することになる。この場合、同じ発声内容を含む比較的長い音声を一回だけ入力するものであっても良いし、同じ発声内容を二回以上繰り返すものであっても良い。また、同じ発声内容を含んでいる異なる発声内容を入力するものであっても良い。例えば、「前川さん」と「早川さん」と発声させることで、「かわさん」の部分が同一発声内容となることで、比較を行うことが可能となる。かかる入力方法では、利用者が同一音声の照合を行っていると気づきにくく、比較的精度良く照合を行うことが期待できる。
【００２０】
したがって、例えば図２に示すように入力音声格納部２１を設けることで、前回に入力していた音声データに基づいて発声内容の照合を行うことも考えられる。人間で有れば、時と場所を変えて入力した場合に音声の揺らぎが生じることが自然であることから、揺らぎのほとんど見られない入力について合成音声あるいは録音音声等であるものと判断できるからである。
【００２１】
また、５は音声分析部を示し、入力された音声データを分析して、その音声波形データの物理的な特徴量を求めるものである。求まった特徴量に基づいて、登録音声類似度算出部６では個人別音声情報登録部２に登録されている音声データの特徴量と入力音声の音声データの特徴量との第１の類似度を算出し、入力音声類似度算出部７では同一内容の入力音声について音声データの特徴量の第２の類似度を算出する。
【００２２】
一般に、従来の話者照合においては、音声の特徴量等に基づいて入力音声と登録音声との第１の類似度を算出することのみで類否判断を行っている。しかし、人間が発声する場合には、その時々の状態や環境に応じて音声に揺らぎが生じ、全く同一の音声として発声することは不可能であることから、一定の許容範囲を定めて、第１の類似度が当該範囲内であれば同一人であるものと判断する等の方法を採用している。
【００２３】
したがって、音声合成装置等を用いて、第１の類似度がかかる許容範囲内となるように調整した合成音声を生成することで、第三者が容易に本人になりすますことが可能となる。
【００２４】
一方、音声合成装置等で人工的に生成された合成音声については、揺らぎが生じることが無く、何度入力しても同一の音声を入力することができる。したがって、従来の話者照合に加えて、複数回同一の音声を入力しても入力音声間の類似度である第２の類似度が毎回同じ値として算出されるものについても、人間の音声ではなく合成音声のような不自然な音声であると判断することができる。
【００２５】
具体的に、類似度の判断基準を音声データ間の照合距離とした場合について、図３を用いて説明する。図３は、音声データ間の照合距離の頻度分布を示すものであり、照合距離が短いほど類似度が高いと判断するものである。
【００２６】
図３において、領域Ａは個人別音声情報登録部２に登録されている音声データの特徴量と入力された本人の音声データの特徴量との距離の分布を示す領域である。領域Ｂは同一内容の音声部分における音声データの特徴量に関する照合距離の分布、例えば一回目と二回目の入力音声間における照合距離の分布を示している。領域Ｃは個人別音声情報登録部２に登録されている音声データの特徴量と詐称者の入力音声の音声データの特徴量との間の照合距離の分布を示している。
【００２７】
すなわち、領域Ａ及び領域Ｃは、個人別音声情報登録部２に登録されている音声データとの照合距離の分布であるのに対し、領域Ｂは入力された音声データ間の照合距離の分布である点で大きく相違する。
【００２８】
まず従来の方法においては、領域Ａ及び領域Ｃの間で入力された音声が本人の音声であるか否かについて判断していた。すなわち、入力された音声データの照合距離が所定のしきい値であるしきい値Ｉよりも小さい場合には、入力された音声の類似度が高いものと判断して入力音声が本人の音声であるものと判断する。
【００２９】
一方、領域Ａ及び領域Ｂの間では、入力された音声が自然音声であるか合成音声で有るかを判断することになる。すなわち、入力された音声データ間の照合距離が所定のしきい値であるしきい値IIよりも小さい場合には、入力された音声に人間本来の自然な揺らぎがないものと判断して、入力音声が合成音声や録音音声等の不自然な音声であるものと判断する。
【００３０】
次に、登録音声類似度算出部６及び入力音声類似度算出部７における類似度の算出方法について説明する。まず、特定の個人ＩＤに対応する音声データの特徴量と入力された音声データの特徴量が類似しているものと判断するためのしきい値としては、従来から固定した一定の値が用いられることが多い。例えば、図４に示すように、入力された音声と事前に登録されている音声との間で照合距離を計算し、あらかじめ設定したしきい値と比較して、当該しきい値よりも照合距離が同じ若しくは短い場合（図４の“−”）には本人であると、長い場合（図４の“＋”）には他人であると判断するものである。
【００３１】
かかるしきい値の設定には、以下に示すような方法を用いることが多い。図５は、類似度判断の指標として照合距離を用いた場合において、照合距離を横軸として、本人ではないと棄却する判断が誤りであった場合の確率である本人拒否率ＦＲＲ（False Rejection error Rate）を縦軸にとったものである。一方、同じく照合距離を横軸として、詐称者であるとする判断が誤りであった場合の確率である他人受入率ＦＡＲ（False Acceptance error Rate）も縦軸にとる。
【００３２】
しきい値を小さな値にすると、詐称者を誤って受理してしまう率ＦＡＲは減るが、本人を誤って棄却してしまう率ＦＲＲが高くなる。逆にしきい値を大きな値とすると、本人を誤って棄却してしまう率ＦＲＲは小さくなるが、詐称者を誤って受理してしまう率ＦＡＲは大きくなる。よって、かかる２つの誤り率の重要度に応じて、しきい値を適切な値に設定するのが望ましい。
【００３３】
実験的には事後的にかかる２つの誤り率が等しくなる値をしきい値として評価するのが一般的である。本実施の形態１においては、図３におけるしきい値Ｉとしては、人間の音声による実験値から、しきい値IIとしては音声合成装置により生成された合成音声による実験値から、それぞれＦＲＲとＦＡＲが一致する値をしきい値としている。すなわち、所定のしきい値を定めた場合において、本人同士の音声間距離と本人・他人間の音声間距離の頻度分布曲線（図３）のうち、定めたしきい値からはみ出た部分の面積がＦＡＲ、ＦＡＲを示すことになる。
【００３４】
また、入力音声が合成音声等であるか否かを判定するための照合距離の算出方法についても、同様に様々な方法が考えられる。本実施の形態１においては、音声データの特徴量をｎ次元の特徴パラメータとし、ｎ次空間内における空間内距離として当該照合距離を求めている。ただし、特にこの方法に限定されるものではなく、当該照合距離の算出方法として、ＤＰマッチングを用いることも考えられる。ここで、ＤＰとは動的計画法（Dynamic Programming）を意味している。
【００３５】
例えば図６は、同時期に同一に発声された内容に含まれる単語発声に対する同一話者内の距離の頻度分布をＤＰマッチングを用いて算出したものである。かかる方法によっても判断の対象となる距離分布を求めることが可能である。
【００３６】
図７は、ＤＰマッチングを用いた場合におけるＤＰパスの例示図である。ここで、ＤＰパスとは時間対応付けを行った場合における最小値を選択することを意味する。なお、図７の横軸は同一音声に関する１回目の音声入力に基づいた音声データの特徴パラメータ系列を、縦軸には同一音声に関する２回目の音声入力に基づいた音声データの特徴パラメータ系列を、それぞれ示し、ｉ、ｊはそれぞれフレーム数を示している。
【００３７】
同一の発声部分に関する一回目と二回目の発声について、ＤＰマッチング等を用いて時間対応付け（時間正規化）を行い、時間正規化後の距離を用いて判断する。その距離が極端に小さい場合や極端に大きい場合については、不自然な発声であるものとして棄却する。かかる判断には、ＤＰパスの結果を用いるとより容易に判断することができる。
【００３８】
すなわち図７において、人間の自然な発声の場合には、７１に示すように一回目と二回目の発声において局所的なＤＰパスの揺れが生じ、特徴パラメータが完全に一致するということはあり得ない。しかし、音声合成装置等によって人工的に生成された合成音声等の場合には何度入力してもその特徴パラメータは一致していることから、７２に示すように一回目と二回目の特徴パラメータは完全に一致する。かかる不自然な発声を検出することで本人になりすますことを防止することが可能となる。
【００３９】
そこで、合成音声等であるか否かの照合方法として、登録話者モデルに対する照合過程が同一か否かを調べることも考えられる。図８は照合過程の同一性判断を適用した本発明の一実施例にかかる話者照合装置の構成図である。図８では、類似度算出過程比較部８１を入力音声類似度算出部７の代わりに設けている点に特徴を有する。
【００４０】
類似度算出過程比較部８１では、例えばＤＰマッチングを用いたので有ればＤＰパスを、ビタービアルゴリズム（Viterbi algorithm）を用いたＨＭＭ（Hidden Markov Model）である場合には、状態遷移をバックトレースした結果を、それぞれの入力音声について調査し比較する。一般に人間の発声の場合においては、発声の長さやスペクトルが異なるために、照合過程が一致することは起こり得ないのに対して、合成音声や録音音声の場合には、登録話者の音声情報に対する照合過程が何回入力しても一致してしまうため、かかる不正入力を検出することが可能となる。
【００４１】
そして、総合判断部８においては、登録音声類似度算出部６で算出された個人別音声情報登録部２に登録されている音声データの特徴量と入力音声の音声データの特徴量との類似度と、入力音声類似度算出部７で算出された同一内容の入力音声に関する音声データの特徴量の類似度とに基づいて、総合的に入力された音声が本人のものであるか否かについて判断する。
【００４２】
まず、登録音声類似度算出部６で算出された個人別音声情報登録部２に登録されている音声データの特徴量と入力音声の音声データの特徴量との類似度が所定のしきい値よりも小さい、すなわち上述したような方法により求めた照合距離のしきい値Ｉよりも大きい場合には、人間が本来有する音声発声時の揺らぎの範囲を超えているものとして、入力された音声が本人のものではないと判断される。類似度が所定のしきい値以上、すなわち上述したような方法により求めた照合距離のしきい値Ｉ以下である場合には、人間が本来有する音声発声時の揺らぎの範囲内であると判断され、以下の判断に移る。
【００４３】
次に、入力音声類似度算出部７で算出された同一内容の入力音声に関する音声データの特徴量の類似度が所定のしきい値以上、すなわち上述したような方法により求めた照合距離のしきい値II以下である場合には、人間が本来有する音声発声時の揺らぎすら生じていない不自然な音声で有るものとして、入力された音声が本人のものではないと判断される。類似度が所定のしきい値より小さい、すなわち上述したような方法により求めた照合距離のしきい値IIより大きい場合には、人間が本来有する音声発声時の揺らぎが生じていると判断され、入力された音声が本人のものであると判断される。
【００４４】
最後に、入力された音声が本人のものであるか否かについての判断結果を判断結果出力部９において出力する。出力方法としては、表示装置等へ表示するものであっても良いし、判断結果に応じて稼働するアプリケーション等へファイルとして渡すものであったも良い。
【００４５】
次に、人工的な合成音声に対して、一回目と二回目で異なる信号処理を施すことで、人工的に合成音声に揺らぎを付加することで、上述したような合成音声の棄却条件を回避することも考えられる。かかる回避を防止するために、図９に示すように信号処理部を音声入力部４の後処理として設けることで対処する。
【００４６】
図９は、かかる方法を実現する本発明の一実施例にかかる話者照合装置の構成図である。図１に比して、音声入力部４の前処理として信号処理部９１が付加されている点に特徴を有している。
【００４７】
信号処理部９１は、音声入力部４から入力された音声すべてについて処理するものではない。対比する音声入力の少なくとも一つについて、想定される信号処理を施し、入力音声と信号処理後の音声について上述したような方法で類似度を判別することで、特徴パラメータが実際には一致している合成音声を擬似的に異なる音声であるものと見せかけた入力音声についても、合成音声であることを検出することができ、さらなるセキュリティ性能の向上に寄与できる。
【００４８】
また、音声入力環境は、時間や場所といった周囲の状況によって変動し、同一の音声を同一人が入力した場合であっても周囲の環境が同一であることは考えられないことから、周囲の環境変動による話者の誤認を最小限に止めるべく、入力されてきた音声に適当な信号処理を行うのにも利用可能である。
【００４９】
また、信号処理部９１において、信号に対する変換処理ではなく、正規化処理を行うことも考えられる。正規化処理としては、音声区間全域に渡って平均化したケプストラム（Cepstrum）の値を、各フレームにおけるケプストラムの値から差し引くことで行うＣＭＮ法（Cepstral Mean Normalization）等を用いることが考えられる。正規化処理を行うことで、類似度算出対象となる入力音声を同一環境における音声であるものとして扱うことができ、判断の精度向上が期待できる。なお、正規化処理の手法は特にＣＭＮ法に限定されるものではない。
【００５０】
以上のように本実施の形態によれば、人工的に生成された合成音声については人間の音声が本来有するべき揺らぎが全くないものとして識別することができるようになることから、本人と全く関係のない第三者が音声合成装置等を用いて本人になりすます行為を未然に防止することが可能となる。
【００５１】
次に、本発明の実施の形態にかかる話者照合装置を実現するプログラムの処理の流れについて説明する。図１０に本発明の実施の形態にかかる話者照合装置を実現するプログラムの処理の流れ図を示す。
【００５２】
図１０において、まず事前に登録されている音声情報を引き出すために、照合対象となる利用者の個人ＩＤを入力し、登録されている音声情報を抽出する（ステップＳ１０１）。
【００５３】
次に、音声の類似度を算出するために、どのような音声を入力するのか指示を出し（ステップＳ１０２）、同一の内容について少なくとも二回以上含まれている一又は二以上の音声を入力する（ステップＳ１０３）。
【００５４】
そして、まず抽出された登録音声と入力された音声との間の第１の類似度を算出する（ステップＳ１０４）。算出された第１の類似度が所定のしきい値より小さい場合には（ステップＳ１０５：Ｎｏ）、人間の有する自然な揺らぎ以上の相違を有するものと判断して、詐称者による音声であると判断する（ステップＳ１０９）。
【００５５】
次に、算出された第１の類似度が所定のしきい値以上である場合には（ステップＳ１０５：Ｙｅｓ）、二以上入力されている入力音声同士の間における第２の類似度を算出する（ステップＳ１０６）。算出された第２の類似度が所定のしきい値以上である場合には（ステップＳ１０７：Ｙｅｓ）、人間の有する自然な揺らぎすら有しない不自然な音声であるものと判断して、詐称者による音声であると判断する（ステップＳ１０９）。
【００５６】
算出された第２の類似度が所定のしきい値よりも小さい場合には（ステップＳ１０７：Ｎｏ）、自然な音声による入力であるものと判断して、本人による音声入力であると判断する（ステップＳ１０８）。
【００５７】
なお、本発明の実施の形態にかかる話者照合装置を実現するプログラムを記憶した記録媒体は、図１１に示す記録媒体の例に示すように、ＣＤ−ＲＯＭ１１２−１やフロッピーディスク１１２−２等の可搬型記録媒体１１２だけでなく、通信回線の先に備えられた他の記憶装置１１１や、コンピュータ１１３のハードディスクやＲＡＭ等の記録媒体１１４のいずれでも良く、プログラム実行時には、プログラムはローディングされ、主メモリ上で実行される。
【００５８】
また、本発明の実施の形態にかかる話者照合装置により生成された個人別音声情報等を記録した記録媒体も、図１１に示す記録媒体の例に示すように、ＣＤ−ＲＯＭ１１２−１やフロッピーディスク１１２−２等の可搬型記録媒体１１２だけでなく、通信回線の先に備えられた他の記憶装置１１１や、コンピュータ１１３のハードディスクやＲＡＭ等の記録媒体１１４のいずれでも良く、例えば本発明にかかる話者照合装置を利用する際にコンピュータ１１３により読み取られる。
【００５９】
【発明の効果】
以上のように本発明にかかる話者照合装置によれば、人工的に生成された合成音声については人間の音声が本来有するべき揺らぎが全くないものとして識別することができるようになることから、本人と全く関係のない第三者が音声合成装置等を用いて本人になりすます行為を未然に防止することが可能となる。
【図面の簡単な説明】
【図１】本発明の実施の形態にかかる話者照合装置の構成図
【図２】本発明の一実施例にかかる話者照合装置の構成図
【図３】本発明の実施の形態にかかる話者照合装置の原理説明図
【図４】本発明の実施の形態にかかる話者照合装置におけるしきい値概念説明図
【図５】本発明の実施の形態にかかる話者照合装置におけるしきい値概念説明図
【図６】本発明の実施の形態にかかる話者照合装置におけるＤＰマッチング結果の例示図
【図７】本発明の実施の形態にかかる話者照合装置におけるＤＰマッチングの概念説明図
【図８】本発明の一実施例にかかる話者照合装置の構成図
【図９】本発明の一実施例にかかる話者照合装置の構成図
【図１０】本発明の実施の形態にかかる話者照合装置における処理の流れ図
【図１１】記録媒体の例示図
【符号の説明】
１個人ＩＤ入力部
２個人別音声情報登録部
３音声入力指示部
４音声入力部
５音声分析部
６入力音声類似度算出部
７登録音声類似度算出部
８総合判定部
９判定結果出力部
２１入力音声格納部
８１類似度算出過程比較部
９１信号処理部
１１１回線先の記憶装置
１１２ＣＤ−ＲＯＭやフロッピーディスク等の可搬型記録媒体
１１２−１ＣＤ−ＲＯＭ
１１２−２フロッピーディスク
１１３コンピュータ
１１４コンピュータ上のＲＡＭ／ハードディスク等の記録媒体[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a speaker verification apparatus or method for determining by voice whether or not a user is a user based on a feature amount of voice data registered in advance.
[0002]
[Prior art]
In recent years, with the development of computer technology, the communication environment has been rapidly improved. With the development of such a communication environment, it has become possible to perform computer access (Computer Telephony Integration) by telephone even in ordinary homes.
[0003]
In the field of computer access by telephone, there is a problem in accessing information that should not be communicated to anyone other than the person or a specific person, such as information relating to privacy or information having a confidentiality obligation. In other words, for example, when using a touch-tone phone, it is possible to acquire the authority to access the information by entering the password by operating a button on the phone, but if someone knows the password, Nevertheless, the problem is that the information can be easily accessed. For this reason, there is an increasing need to collate whether or not the user is a person or a specific individual using a voice unique to the person.
[0004]
[Problems to be solved by the invention]
However, speech synthesis technology has also made rapid progress in recent years, and it is no longer impossible to express the personality of a speaker by making full use of such technology.
[0005]
That is, in the prior art, a human voice is assumed as an input for speaker verification, and in order to accurately synthesize a specific human voice, a large amount of the human voice waveform data is used. Because it had to be recorded and edited, it was not feasible.
[0006]
However, in recent years, it has become possible to realize speech synthesis reflecting personality by recording a little of the person's voice, and it has become possible to easily imitate the voice of others.
[0007]
By using such a synthesized speech device that can also express the personality of a speaker, a third party can impersonate a specific individual, and the speaker verification system itself can be abused. Has occurred.
[0008]
In order to solve the above problems, the present invention provides a speaker verification apparatus and method that can accurately perform speaker verification even when using synthesized speech that also expresses the personality of the speaker. The purpose is to provide.
[0009]
[Means for Solving the Problems]
In order to achieve the above object, a speaker verification apparatus according to the present invention is a speaker verification apparatus that determines whether or not the voice of a speaker to be input by voice matches the voice of a registered speaker registered in advance. A speaker input instructing unit that gives instructions to the speaker about the utterance content to be input, a voice input unit that inputs one or more of the voices of the speaker, and a voice analysis unit that analyzes the voice input by the voice input unit And an input voice similarity calculation unit that calculates the similarity between two or more voices that are the same utterance content, and the calculated similarity is a similarity of a predetermined level or higher that is close to perfect match In the case of the degree, the speaker is collated using information that the degree of similarity matches.
[0010]
With such a configuration, it is possible to identify artificial voices as well as judgments as to whether or not the person is the person himself / herself as having no fluctuations that human voice should originally have. It is possible to prevent an unrelated third party from impersonating himself / herself using a speech synthesizer or the like.
[0011]
In the speaker verification device according to the present invention, it is preferable that the similarity is determined based on whether the verification process for the registered speaker model is the same. In human speech, since the length and spectrum of utterances differ for each utterance, the matching process cannot match, so by comparing the matching processes, a third party who has nothing to do with the person It is possible to prevent an act of impersonating the person using a speech synthesizer or the like.
[0012]
In the speaker verification device according to the present invention, when at least two or more voices are input by the voice input unit and conversion processing is performed on at least one voice, the voice input unit Of the input at least two or more sounds, it is preferable to perform signal processing on the sound that has not been subjected to conversion processing, or to perform normalization processing on at least two or more input sounds. About the act of artificially generating natural fluctuations in speech input and recognizing that it is not synthetic speech by applying some filter to the input speech for the second and subsequent input when inputting speech multiple times This is to prevent it.
[0013]
Moreover, it is preferable that the speaker collation apparatus according to the present invention rejects the input when it is determined that the similarities coincide with each other and it is determined that the input is not the user's voice input. This is because when the voice data that has been repeatedly spoken almost completely matches, it is not judged that the person is the person himself as a suspicion of the recorded material.
[0014]
Further, the present invention is characterized by software that executes the function of the speaker verification device as described above as a processing step of the computer. Specifically, the voice of the speaker to be input is registered in advance. A speaker verification method for determining whether or not the voice of a registered speaker matches, a step of giving an instruction to the speaker about the input utterance content, and a step of inputting one or more of the speaker's voice A step of analyzing the input voice and a step of calculating a similarity between two or more voices having the same utterance content, and the calculated similarity is a predetermined close to perfect match If the similarity is equal to or higher than the level, the speaker verification method for verifying the speaker using the information that the similarities match, and a computer-readable recording medium recording such a process as a program And features.
[0015]
With such a configuration, it is possible to determine whether the user is the person by loading and executing the program on the computer, and the artificially generated synthesized speech does not have any fluctuation that human speech should originally have. Therefore, it is possible to realize a speaker verification device that can prevent a third party who is completely unrelated to the person from impersonating the person using a speech synthesizer or the like. It becomes possible.
[0016]
DETAILED DESCRIPTION OF THE INVENTION
(Embodiment 1)
Hereinafter, a speaker verification device according to a first exemplary embodiment of the present invention will be described with reference to the drawings. FIG. 1 is a configuration diagram of a speaker verification device according to a first exemplary embodiment of the present invention.
[0017]
In FIG. 1, reference numeral 1 denotes a personal ID input unit for inputting a personal ID at the time of speaker verification. Reference numeral 2 denotes an individual voice information registration unit, which stores voice information in a database for each individual ID in advance. Here, not only the speech waveform data but also the feature value obtained by analyzing the speech data is registered in advance.
[0018]
Next, reference numeral 3 denotes a voice input instructing unit for giving an instruction regarding voice to be input by the user at the time of speaker verification. Reference numeral 4 denotes a voice input unit, which is used by a user to actually speak and input voice data through an input medium such as a microphone.
[0019]
The voice input unit 4 inputs voice in accordance with instructions from the voice input instruction unit 3. In this case, a relatively long voice including the same utterance content may be input only once, or the same utterance content may be repeated twice or more. Further, different utterance contents including the same utterance contents may be input. For example, by saying “Mr. Maekawa” and “Mr. Hayakawa”, the portion of “Mr. Kawa” has the same utterance content, so that comparison can be performed. With such an input method, it is difficult to notice when the user is collating the same voice, and it can be expected that collation is performed with relatively high accuracy.
[0020]
Therefore, for example, by providing the input voice storage unit 21 as shown in FIG. 2, it is conceivable to collate the utterance contents based on the voice data input last time. If you are a human being, it is natural for voice fluctuations to occur when you change time and place, so it is possible to determine that an input that hardly sees fluctuations is synthetic voice or recorded voice, etc. It is.
[0021]
Reference numeral 5 denotes a voice analysis unit, which analyzes input voice data and obtains physical feature quantities of the voice waveform data. Based on the obtained feature amount, the registered speech similarity calculation unit 6 obtains the first similarity between the feature amount of the speech data registered in the individual speech information registration unit 2 and the feature amount of the speech data of the input speech. The input voice similarity calculation unit 7 calculates the second similarity of the feature amount of the voice data for the input voice having the same content.
[0022]
Generally, in conventional speaker verification, similarity determination is performed only by calculating the first similarity between the input voice and the registered voice based on the feature amount of the voice. However, when a human utters, the voice fluctuates according to the situation and environment at that time, and it is impossible to utter as the same voice. If the similarity of 1 is within the range, a method of determining that the same person is the same is adopted.
[0023]
Therefore, by using a speech synthesizer or the like to generate synthesized speech adjusted so that the first similarity is within the allowable range, a third party can easily impersonate the person.
[0024]
On the other hand, the synthesized speech artificially generated by a speech synthesizer or the like does not fluctuate, and the same speech can be input no matter how many times it is input. Therefore, in addition to the conventional speaker verification, even when the same speech is input a plurality of times, the second similarity that is the similarity between the input speech is calculated as the same value every time. Therefore, it can be determined that the voice is unnatural such as a synthesized voice.
[0025]
Specifically, a case where the similarity determination criterion is a collation distance between audio data will be described with reference to FIG. FIG. 3 shows the frequency distribution of the collation distance between the audio data, and it is judged that the similarity is higher as the collation distance is shorter.
[0026]
In FIG. 3, an area A is an area indicating a distribution of distances between the feature amount of the voice data registered in the individual-specific voice information registration unit 2 and the feature amount of the input person's voice data. A region B shows a distribution of collation distances related to the feature amount of the voice data in the voice part having the same content, for example, a collation distance distribution between the first and second input voices. Region C shows the distribution of the collation distance between the feature amount of the speech data registered in the individual speech information registration unit 2 and the feature amount of the speech data of the spoiler input speech.
[0027]
That is, region A and region C are distributions of collation distances with audio data registered in the individual-specific audio information registration unit 2, whereas region B is a distribution of collation distances between input audio data. It is very different in a certain point.
[0028]
First, in the conventional method, it is determined whether or not the voice input between the area A and the area C is the person's own voice. That is, when the collation distance of the input voice data is smaller than the threshold value I which is a predetermined threshold value, it is determined that the similarity of the input voice is high, and the input voice is the voice of the user. Judge that there is.
[0029]
On the other hand, between region A and region B, it is determined whether the input sound is natural sound or synthesized sound. That is, when the collation distance between the input voice data is smaller than the threshold value II which is a predetermined threshold value, it is determined that the input voice does not have a natural fluctuation inherent in humans. It is determined that the voice is an unnatural voice such as a synthesized voice or a recorded voice.
[0030]
Next, the similarity calculation method in the registered voice similarity calculation unit 6 and the input voice similarity calculation unit 7 will be described. First, a fixed value that has been conventionally used is used as a threshold value for determining that the feature amount of the voice data corresponding to the specific personal ID is similar to the feature amount of the input voice data. There are many cases. For example, as shown in FIG. 4, the collation distance is calculated between the input voice and the pre-registered voice, and compared with a preset threshold value. Are the same or short ("-" in FIG. 4), it is determined that the person is the person, and if it is long ("+" in FIG. 4), the person is determined to be the other person.
[0031]
For setting such a threshold value, the following method is often used. FIG. 5 shows the identity rejection rate FRR (False Rejection error), which is the probability that the decision to reject a person who is not the person is wrong, with the comparison distance being used as an index for similarity judgment. Rate) is taken on the vertical axis. On the other hand, with the collation distance as the horizontal axis, the false acceptance rate FAR (False Acceptance error Rate), which is the probability when the determination that the person is an impersonator is wrong, is also taken on the vertical axis.
[0032]
If the threshold value is set to a small value, the rate FAR for erroneously accepting a spoofer decreases, but the rate FRR for erroneously rejecting the person increases. Conversely, if the threshold value is set to a large value, the rate FRR that erroneously rejects the person becomes small, but the rate FAR that erroneously accepts the false person becomes large. Therefore, it is desirable to set the threshold value to an appropriate value according to the importance of these two error rates.
[0033]
It is common to experimentally evaluate the value at which the two error rates are equal afterwards as a threshold value. In the first embodiment, the threshold value I in FIG. 3 is an experimental value based on human speech, and the threshold value II is an experimental value based on a synthesized speech generated by a speech synthesizer. The value that matches is the threshold value. That is, when a predetermined threshold is set, the area of the portion of the frequency distribution curve (FIG. 3) of the distance between the voices of the persons and the distance between the voices of the person and the other person that protrudes from the determined threshold Indicates FAR, FAR.
[0034]
Similarly, various methods can be considered as a method for calculating the collation distance for determining whether or not the input speech is synthetic speech or the like. In the first embodiment, the feature amount of the voice data is set as an n-dimensional feature parameter, and the matching distance is obtained as an in-space distance in the n-th space. However, the method is not particularly limited to this method, and DP matching may be used as a method for calculating the matching distance. Here, DP means dynamic programming.
[0035]
For example, FIG. 6 shows a frequency distribution of distances within the same speaker with respect to word utterances included in the content uttered at the same time using DP matching. It is possible to obtain a distance distribution to be determined by such a method.
[0036]
FIG. 7 is a view showing an example of a DP path when DP matching is used. Here, the DP path means selecting a minimum value when time association is performed. Note that the horizontal axis of FIG. 7 is a feature parameter sequence of speech data based on the first speech input related to the same speech, and the vertical axis is a feature parameter sequence of speech data based on the second speech input related to the same speech. I and j respectively indicate the number of frames.
[0037]
For the first and second utterances related to the same utterance part, time matching (time normalization) is performed using DP matching or the like, and determination is performed using the distance after time normalization. When the distance is extremely small or extremely large, it is rejected as an unnatural utterance. Such a determination can be made more easily by using the result of the DP path.
[0038]
That is, in FIG. 7, in the case of a natural human utterance, as shown in 71, it is possible that local DP path fluctuations occur in the first and second utterances, and the feature parameters completely match. Absent. However, in the case of synthetic speech or the like artificially generated by a speech synthesizer or the like, the feature parameters are the same no matter how many times they are input. Is an exact match. By detecting such an unnatural utterance, it is possible to prevent impersonation.
[0039]
Therefore, as a method for checking whether or not the voice is a synthesized speech or the like, it is possible to check whether or not the checking process for the registered speaker model is the same. FIG. 8 is a block diagram of a speaker verification device according to an embodiment of the present invention to which the matching process identity determination is applied. FIG. 8 is characterized in that a similarity calculation process comparison unit 81 is provided instead of the input speech similarity calculation unit 7.
[0040]
In the similarity calculation process comparison unit 81, for example, if DP matching is used, the DP path is backtraced. If the HMM (Hidden Markov Model) using the Viterbi algorithm is used, the state transition is backtraced. The results are investigated and compared for each input speech. In general, in the case of human utterances, the length and spectrum of utterances are different, so the matching process cannot occur. In the case of synthesized speech or recorded speech, the speech information of the registered speaker Since the matching process with respect to the number of times matches, it becomes possible to detect such illegal input.
[0041]
Then, in the overall judgment unit 8, the similarity between the feature amount of the voice data registered in the individual voice information registration unit 2 calculated by the registered voice similarity calculation unit 6 and the feature amount of the voice data of the input voice And whether or not the speech input comprehensively belongs to the user based on the similarity of the feature amount of the speech data related to the input speech of the same content calculated by the input speech similarity calculation unit 7 To do.
[0042]
First, the similarity between the feature quantity of the voice data registered in the individual voice information registration section 2 calculated by the registered voice similarity calculation section 6 and the feature quantity of the voice data of the input voice is determined from a predetermined threshold value. Is smaller, that is, larger than the threshold I of the collation distance obtained by the method as described above, it is assumed that the range of fluctuation at the time of voice utterance originally possessed by humans is exceeded, and the input voice is It is judged that it is not. When the similarity is equal to or greater than a predetermined threshold, that is, equal to or smaller than the threshold I of the collation distance obtained by the above-described method, it is determined that the human is originally within the range of fluctuation at the time of voice utterance. Move on to the following judgment.
[0043]
Next, the similarity of the feature amount of the voice data related to the input voice having the same content calculated by the input voice similarity calculation unit 7 is equal to or greater than a predetermined threshold, that is, the collation distance threshold obtained by the method as described above. When the value is less than or equal to the value II, it is determined that the input voice is not that of the person himself / herself, as it is an unnatural voice that does not even cause fluctuations at the time of voice utterance inherent in humans. When the similarity is smaller than a predetermined threshold value, that is, larger than the threshold value II of the collation distance obtained by the method as described above, it is determined that the fluctuation at the time of voice utterance inherent in humans has occurred, It is determined that the input voice belongs to the user.
[0044]
Finally, the determination result output unit 9 outputs a determination result as to whether or not the input voice belongs to the user. As an output method, it may be displayed on a display device or the like, or may be passed as a file to an application or the like that operates according to the determination result.
[0045]
Next, artificial speech is subjected to different signal processing in the first and second times to artificially add fluctuations to the synthesized speech, thereby avoiding the above-described synthetic speech rejection conditions. It is also possible to do. In order to prevent such a circumvention, a signal processing unit is provided as a post-process of the voice input unit 4 as shown in FIG.
[0046]
FIG. 9 is a block diagram of a speaker verification apparatus according to an embodiment of the present invention that implements such a method. Compared to FIG. 1, a signal processing unit 91 is added as preprocessing of the voice input unit 4.
[0047]
The signal processing unit 91 does not process all the voices input from the voice input unit 4. At least one of the audio inputs to be compared is subjected to assumed signal processing, and the similarity is actually determined by the method described above for the input audio and the audio after signal processing, so that the characteristic parameters actually match. The input voice that pretends that the synthesized voice is a pseudo-different voice can be detected as the synthesized voice, which can contribute to further improvement in security performance.
[0048]
Also, the voice input environment varies depending on the surrounding conditions such as time and place, and even if the same person inputs the same voice, it is unlikely that the surrounding environment is the same. It can also be used to perform appropriate signal processing on the input voice in order to minimize misidentification of the speaker due to fluctuations.
[0049]
It is also conceivable that the signal processing unit 91 performs normalization processing instead of signal conversion processing. As the normalization processing, it is conceivable to use a CMN method (Cepstral Mean Normalization) performed by subtracting the cepstrum value averaged over the entire speech section from the cepstrum value in each frame. By performing the normalization process, it is possible to treat the input speech that is the target of similarity calculation as speech in the same environment, and it can be expected that the accuracy of the determination is improved. Note that the normalization processing method is not particularly limited to the CMN method.
[0050]
As described above, according to the present embodiment, artificially generated synthesized speech can be identified as having no fluctuation that human speech should originally have. It is possible to prevent an unauthorized third party from impersonating himself / herself using a speech synthesizer or the like.
[0051]
Next, a processing flow of a program that realizes the speaker verification apparatus according to the embodiment of the present invention will be described. FIG. 10 shows a flowchart of the processing of a program that realizes the speaker verification apparatus according to the embodiment of the present invention.
[0052]
In FIG. 10, first, in order to extract voice information registered in advance, the personal ID of the user to be collated is input, and the registered voice information is extracted (step S101).
[0053]
Next, in order to calculate the similarity of the voice, an instruction is given as to what kind of voice is to be input (step S102), and one or more voices that are included at least twice for the same content are input. (Step S103).
[0054]
First, a first similarity between the extracted registered voice and the input voice is calculated (step S104). If the calculated first similarity is smaller than the predetermined threshold (step S105: No), it is determined that the difference is more than the natural fluctuation of human beings, and the voice is by a spoofer. Judgment is made (step S109).
[0055]
Next, when the calculated first similarity is greater than or equal to a predetermined threshold (step S105: Yes), a second similarity between two or more input voices that are input is calculated. (Step S106). If the calculated second similarity is greater than or equal to a predetermined threshold (step S107: Yes), it is determined that the voice is an unnatural voice that does not have even natural fluctuations that humans have, and an impersonator (Step S109).
[0056]
If the calculated second similarity is smaller than a predetermined threshold (step S107: No), it is determined that the input is a natural voice input and is determined to be a voice input by the person ( Step S108).
[0057]
The recording medium storing the program for realizing the speaker verification device according to the embodiment of the present invention is a CD-ROM 112-1, floppy disk 112-2, etc. as shown in the example of the recording medium shown in FIG. In addition to the portable recording medium 112, any other storage device 111 provided at the end of the communication line, or a recording medium 114 such as a hard disk or a RAM of the computer 113 may be used. Runs on main memory.
[0058]
In addition, as shown in the example of the recording medium shown in FIG. 11, the recording medium on which the individual voice information generated by the speaker verification device according to the embodiment of the present invention is recorded is also a CD-ROM 112-1. Not only the portable recording medium 112 such as the disk 112-2 but also any other storage device 111 provided at the end of the communication line, or the recording medium 114 such as the hard disk or RAM of the computer 113, for example, the present invention. It is read by the computer 113 when using such a speaker verification device.
[0059]
【The invention's effect】
As described above, according to the speaker verification device according to the present invention, artificially generated synthesized speech can be identified as having no fluctuation that human speech should originally have, It becomes possible to prevent a third party who has nothing to do with the person from impersonating the person using a speech synthesizer or the like.
[Brief description of the drawings]
FIG. 1 is a configuration diagram of a speaker verification device according to an embodiment of the present invention.
FIG. 2 is a block diagram of a speaker verification device according to an embodiment of the present invention.
FIG. 3 is a diagram illustrating the principle of the speaker verification device according to the embodiment of the present invention.
FIG. 4 is an explanatory diagram of a threshold value concept in the speaker verification device according to the embodiment of the present invention.
FIG. 5 is an explanatory diagram of a threshold value concept in the speaker verification device according to the embodiment of the present invention.
FIG. 6 is a view showing an example of a DP matching result in the speaker verification apparatus according to the embodiment of the present invention.
FIG. 7 is a conceptual explanatory diagram of DP matching in the speaker verification device according to the embodiment of the present invention.
FIG. 8 is a block diagram of a speaker verification device according to an embodiment of the present invention.
FIG. 9 is a block diagram of a speaker verification device according to an embodiment of the present invention.
FIG. 10 is a flowchart of processing in the speaker verification apparatus according to the embodiment of the present invention.
FIG. 11 is an exemplary diagram of a recording medium.
[Explanation of symbols]
1 Personal ID input section
2 Individual voice information registration department
3 Voice input instruction section
4 Voice input part
5 Voice analysis part
6 Input voice similarity calculator
7 Registered voice similarity calculator
8 Comprehensive judgment part
9 Judgment result output part
21 Input voice storage
81 Similarity calculation process comparison unit
91 Signal processor
111 Line destination storage device
112 Portable recording media such as CD-ROM and floppy disk
112-1 CD-ROM
112-2 Floppy disk
113 computer
114 Recording medium such as RAM / hard disk on computer

Claims

A speaker verification device for determining whether or not a voice of a speaker for voice input matches a voice of a registered speaker registered in advance,
A speaker input instruction unit that gives instructions to the speaker about the utterance content to be input;
A voice input unit that includes one or more voices of the speaker, including a voice part of the same content;
A voice analysis unit for analyzing the voice input by the voice input unit;
The similarity between the voice input by the voice input unit and the registered voice of the speaker registered in advance is calculated as the first similarity, and the same one or more of the input voices an input speech similarity calculation unit for calculating a similarity between each other as the second degree of similarity in the two or more audio portion is utterance contents,
When the calculated first similarity is smaller than a predetermined threshold, it is determined that the input is not the voice input of the person, and even if the first similarity is equal to or higher than the predetermined threshold, the second A speaker verification device comprising a determination unit that determines that the input is not the voice input of the person himself / herself when the similarity is equal to or higher than a predetermined level close to perfect match.

The speaker verification apparatus according to claim 1, wherein the second similarity is determined based on whether the verification process for the registered speaker model is the same.

When at least two or more voices are input in the voice input unit,
The speaker verification apparatus according to claim 1, wherein normalization processing is performed on at least two or more voices input by the voice input unit.

When the calculated first similarity is smaller than a predetermined threshold, the determination unit determines that the voice is an impersonator, and the first similarity is equal to or higher than the predetermined threshold. 2. The speaker verification device according to claim 1 , wherein even if the second similarity is a similarity equal to or higher than a predetermined level close to a perfect match, it is determined that the voice is a spoofer and the input is rejected.

A speaker verification method for determining whether or not a voice of a speaker for voice input matches a voice of a registered speaker registered in advance,
Giving instructions to the speaker about the utterance content to be entered;
Inputting one or more voices including the same voice part of the speaker;
Analyzing the input speech;
The similarity between the voice input by the voice input unit and the registered voice of the speaker registered in advance is calculated as the first similarity, and the same one or more of the input voices Calculating a similarity between two or more audio parts that are utterance contents as a second similarity;
When the calculated first similarity is smaller than a predetermined threshold, it is determined that the input is not the voice input of the person, and even if the first similarity is equal to or higher than the predetermined threshold, the second And a step of determining that the input is not the voice input of the person himself / herself when the similarity is equal to or higher than a predetermined level close to perfect match.

A computer-readable recording medium that records a program for realizing a speaker verification method for determining whether or not a voice of a speaker to be input matches a voice of a registered speaker registered in advance,
Giving instructions to the speaker about the utterance content to be entered;
Inputting one or more voices including the same voice part of the speaker;
Analyzing the input speech;
The similarity between the voice input by the voice input unit and the registered voice of the speaker registered in advance is calculated as the first similarity, and the same one or more of the input voices Calculating a similarity between two or more audio parts that are utterance contents as a second similarity;
When the first similarity is smaller than a predetermined threshold range, it is determined that the voice input is not the person's own voice. Even when the first similarity is equal to or higher than the predetermined threshold, the second similarity is determined. A computer-readable recording medium on which a program is recorded, which causes a computer to execute a step of determining that the input is not the voice input of the user when the degree of similarity is equal to or higher than a predetermined level close to perfect coincidence.