JP2006078654A

JP2006078654A - Voice authenticating system, method, and program

Info

Publication number: JP2006078654A
Application number: JP2004261005A
Authority: JP
Inventors: Naohisa Komatsu; 尚久小松; Yusuke Fujita; 雄介藤田; Kenichi Sato; 研一佐藤; Kazuhiro Tsurumaru; 和宏鶴丸
Original assignee: EMBEDDED SYSTEM KK; Waseda University
Current assignee: EMBEDDED SYSTEM KK; Waseda University
Priority date: 2004-09-08
Filing date: 2004-09-08
Publication date: 2006-03-23

Abstract

<P>PROBLEM TO BE SOLVED: To provide a voice authenticating system for markedly improving noise resistance and reliability, and provide a method and a program. <P>SOLUTION: The voice authenticating system comprises outputting to successively generate prescribed characteristic parameters that represent the individuality of a speaker on the basis of an input voice signal, detecting a sound-existing section and a voice-existing section of the voice signal respectively, and extracting the characteristic parameters of the sound-existing section and the voice-existing section in the voice signal from the characteristic parameters generated. Speaker authentication is conducted on the basis of the extracted characteristic parameters. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、音声認証装置及び方法並びにプログラムに関し、例えばＣＥＬＰ（Code Excited Linear Prediction）符号化方式による符号化処理により得られたパラメータに基づいて話者認証を行う音声認証装置に適用して好適なものである。 The present invention relates to a voice authentication apparatus, method, and program, and is suitable for application to a voice authentication apparatus that performs speaker authentication based on parameters obtained by encoding processing using a CELP (Code Excited Linear Prediction) encoding method, for example. Is.

近年、電子商取引やオンライン・バンキングなどのネットワークを利用したサービスの拡大・浸透に伴い、情報セキュリティ及びプライバシー保護の観点から、ユーザの正当性を確認する個人認証の必要性が高まっている。このような状況の下、従来では簡便で安価な個人認証手法として、ＩＣカードやパスワードなどのユーザの所有物や知識に基づく個人認証が行われていた。 In recent years, with the expansion and penetration of services using networks such as electronic commerce and online banking, the need for personal authentication to confirm the legitimacy of users is increasing from the viewpoint of information security and privacy protection. Under such circumstances, conventionally, as a simple and inexpensive personal authentication method, personal authentication based on the possession and knowledge of the user such as an IC card and a password has been performed.

しかしながらこのような従来手法によると、盗難や紛失又は忘却などの危険性が伴う問題がある。そこで、近年、音声、筆跡、指紋又は顔の表情といった個人の身体的又は行動的特徴を利用した、いわゆるバイオメトリクス認証に注目が集まっている。 However, according to such a conventional method, there is a problem accompanied by danger such as theft, loss or forgetting. In recent years, so-called biometric authentication using personal physical or behavioral features such as speech, handwriting, fingerprints or facial expressions has attracted attention.

例えば音声には、音韻性情報及び個人性情報が併存しており、音韻性情報は音声認識に利用でき、個人性情報は話者認識に利用できることが知られている。また携帯電話を利用したディジタル音声通信においては、音声が伝送のためにＣＥＬＰ符号化方式により符号化されるが、符号化された音声の中には音韻性情報及び個人性情報が保存されている。そこで、かかる符号化により得られたＬＳＰ（Line Spectrum Pair）と呼ばれるパラメータを利用して、個人認証を行うことが従来から提案されている（例えば非特許文献１参照）。
茂垣武文、小松尚久，「ＣＥＬＰ方式を用いたテキスト提示型話者照合方式」，コンピュータセキュリティシンポジウム‘９８，平成１０年１０月，ｐ２０１−２０６ For example, it is known that phonological information and personality information coexist in speech, phonological information can be used for speech recognition, and personality information can be used for speaker recognition. In digital voice communication using a mobile phone, voice is encoded by the CELP encoding method for transmission, but phonetic information and personality information are stored in the encoded voice. . Therefore, it has been conventionally proposed to perform personal authentication using a parameter called LSP (Line Spectrum Pair) obtained by such encoding (see Non-Patent Document 1, for example).
Takefumi Mogaki and Naohisa Komatsu, “Text-speaking speaker verification method using CELP method”, Computer Security Symposium '98, October 1998, p201-206

ところが、音声に含まれる個人性情報を利用した個人認証（以下、これを音声認証又は話者認証と呼ぶ）では、環境雑音の影響を受け易く、環境雑音が多い状況下では認証精度が著しく劣化する。 However, personal authentication using personality information contained in speech (hereinafter referred to as speech authentication or speaker authentication) is easily affected by environmental noise, and the accuracy of authentication is significantly degraded under conditions of high environmental noise. To do.

実際上、研究用ＡＴＲ日本語音声データベースに収録された雑音を含まない理想的な音声をＣＥＬＰ符号化処理し、得られたＬＳＰを用いて話者認証を行う一方で、かかる音声に電子協騒音データベースに収録された「自動車（car）」、「人ごみ（crowd）」、「駅（station）」又は「交差点（street）」の雑音を付加して、この音声をＣＥＬＰ符号化処理することにより得られたＬＳＰを用いて話者認証を行うようにして、雑音環境下での話者認証の信頼性を評価する実験を行ったところ、図１５に示すような実験結果が得られた。 In practice, ideal speech that does not contain noise recorded in the research ATR Japanese speech database is CELP encoded, and speaker authentication is performed using the resulting LSP. It is obtained by adding CELP coding processing to the speech by adding noise of “car”, “crowd”, “station” or “street” recorded in the database. An experiment for evaluating the reliability of speaker authentication under a noisy environment by performing speaker authentication using the obtained LSP, an experimental result as shown in FIG. 15 was obtained.

この図１５からも明らかなように、雑音を付加していない状態での誤り率が5.13〔％〕であったのに対し、「自動車（car）」の雑音を付加したときの誤り率が30.80〔％〕、「人ごみ（crowd）」の雑音を付加したときの誤り率が15.87〔％〕、「駅（station）」の雑音を付加したときでの誤り率が14.52〔％〕、「交差点（street）」の雑音を付加したときの誤り率が9.47〔％〕でなり、雑音環境下では話者認証の信頼性が低くなる。 As is apparent from FIG. 15, the error rate when no noise is added is 5.13 [%], whereas the error rate when the “car” noise is added is 30.80. [%], Error rate when adding “crowd” noise is 15.87 [%], error rate when adding “station” noise is 14.52 [%], “intersection ( The error rate is 9.47 [%] when the noise of “street” is added, and the reliability of speaker authentication is low in a noisy environment.

なおこの評価は、図１６に示すように、次式 In addition, as shown in FIG.

で定義されるＦＲＲ（False Rejection Rate:本人拒否率）と、次式 FRR (False Rejection Rate) defined by, and the following formula

で定義されるＦＡＲ（False Acceptance Rate:他人受け入れ率）との値が等しくなる点における誤り率であるＥＥＲ（Equal Error Rate:等誤り率）を用いて行っている。また各雑音データの諸元については図１７に示す。 This is performed using EER (Equal Error Rate), which is an error rate at a point where the value of FAR (False Acceptance Rate) is the same. The specifications of each noise data are shown in FIG.

しかしながら、話者認証が実際に使用される環境には様々な環境があり得る。従って、かかる音声認証装置を構築するに際しては、種々の雑音環境下においても実用上十分な認証精度を得られるように耐雑音性を向上させることが望まれる。 However, there may be various environments in which speaker authentication is actually used. Therefore, when constructing such a voice authentication device, it is desired to improve noise resistance so that practically sufficient authentication accuracy can be obtained even under various noise environments.

本発明は以上の点を考慮してなされたもので、耐雑音性及び信頼性を格段的に向上させ得る音声認証装置及び方法並びにプログラムを提案しようとするものである。 The present invention has been made in view of the above points, and an object of the present invention is to propose a voice authentication apparatus, method, and program capable of significantly improving noise resistance and reliability.

かかる課題を解決するため本発明においては、音声認証装置において、入力する音声信号に基づいて、話者の個人性を表す所定の特徴パラメータを順次生成して出力する特徴パラメータ生成手段と、音声信号の有音区間を検出する有音区間検出手段と、音声信号の有声音区間を検出する有声音区間検出手段と、有音区間検出手段及び有声音検出手段の検出結果に基づいて、特徴パラメータ生成手段から順次出力される特徴パラメータの中から音声信号における有音区間かつ有声音区間の特徴パラメータを抽出する特徴パラメータ抽出手段と、特徴パラメータ抽出手段により抽出された特徴パラメータに基づいて、話者認証を行う話者認証手段とを設けるようにした。 In order to solve such a problem, in the present invention, in the voice authentication apparatus, a feature parameter generating means for sequentially generating and outputting a predetermined feature parameter representing the personality of the speaker based on the input voice signal, and the voice signal Based on the detection results of the voiced section detecting means for detecting the voiced section of the voice signal, the voiced section detecting means for detecting the voiced section of the voice signal, and the voiced section detecting means and the voiced sound detecting means Feature parameter extracting means for extracting feature parameters of voiced and voiced sections in the speech signal from the feature parameters sequentially output from the means, and speaker authentication based on the feature parameters extracted by the feature parameter extracting means And a speaker authentication means.

この結果この音声認証装置では、環境雑音に対して安定な特徴を示す特徴パラメータのみを利用した話者登録及び話者照合を行うことができるため、話者登録時や話者照合時における環境雑音の影響を受け難くすることができる。 As a result, this voice authentication device can perform speaker registration and speaker verification using only feature parameters that exhibit stable characteristics against environmental noise, so environmental noise during speaker registration and speaker verification Can be made less susceptible to

また本発明においては、音声認証方法において、入力する音声信号に基づいて、話者の個人性を表す所定の特徴パラメータを順次生成して出力すると共に、音声信号の有音区間及び有声音区間をそれぞれ検出する第１のステップと、生成した特徴パラメータの中から音声信号における有音区間かつ有声音区間の特徴パラメータを抽出する第２のステップと、抽出した特徴パラメータに基づいて、話者認証を行う第３のステップとを設けるようにした。 According to the present invention, in the voice authentication method, predetermined feature parameters representing the individuality of the speaker are sequentially generated and output based on the input voice signal, and the voiced and voiced sections of the voice signal are determined. Speaker authentication is performed based on the first step for detecting each, the second step for extracting the feature parameters of the voiced and voiced sections of the speech signal from the generated feature parameters, and the extracted feature parameters. And a third step to be performed.

この結果この音声認証方法によれば、環境雑音に対して安定な特徴を示す特徴パラメータのみを利用した話者登録及び話者照合を行うことができるため、話者登録時や話者照合時における環境雑音の影響を受け難くすることができる。 As a result, according to this voice authentication method, speaker registration and speaker verification can be performed using only feature parameters that show stable characteristics against environmental noise. Therefore, at the time of speaker registration or speaker verification It can be made less susceptible to environmental noise.

さらに本発明においては、プログラムにおいて、コンピュータに、入力する音声信号に基づいて、話者の個人性を表す所定の特徴パラメータを順次生成して出力すると共に、音声信号の有音区間及び有声音区間をそれぞれ検出する第１のステップと、生成した特徴パラメータの中から音声信号における有音区間かつ有声音区間の特徴パラメータを抽出する第２のステップと、抽出した特徴パラメータに基づいて、話者認証を行う第３のステップとを有する処理を実行させるようにした。 Furthermore, in the present invention, the program sequentially generates and outputs predetermined characteristic parameters representing the personality of the speaker based on the voice signal input to the computer, and also includes the voiced and voiced sections of the voice signal. A second step of extracting the feature parameters of the voiced and voiced sections of the speech signal from the generated feature parameters, and speaker authentication based on the extracted feature parameters And a third step of performing the process.

この結果このプログラムによれば、環境雑音に対して安定な特徴を示す特徴パラメータのみを利用した話者登録及び話者照合を行うことができるため、話者登録時や話者照合時における環境雑音の影響を受け難くすることができる。 As a result, according to this program, speaker registration and speaker verification can be performed using only feature parameters that show stable characteristics against environmental noise. Can be made less susceptible to

本発明によれば、音声認証装置及び方法並びにプログラムにおいて、入力する音声信号に基づいて、話者の個人性を表す所定の特徴パラメータを順次生成して出力すると共に、音声信号の有音区間及び有声音区間をそれぞれ検出し、生成した特徴パラメータの中から音声信号における有音区間かつ有声音区間の特徴パラメータを抽出し、抽出した特徴パラメータに基づいて話者認証を行うようにしたことにより、環境雑音に対して安定な特徴を示す特徴パラメータのみを利用した話者登録及び話者照合を行うことができるため、話者登録時や話者照合時における環境雑音の影響を受け難くすることができ、かくして耐雑音性及び信頼性を格段的に向上させ得る音声認証装置及び方法並びにプログラムを実現できる。 According to the present invention, in the voice authentication apparatus, method, and program, the predetermined feature parameters representing the personality of the speaker are sequentially generated and output based on the input voice signal, By detecting each voiced sound section, extracting the feature parameters of the voiced sound section and the voiced sound section from the generated feature parameters, and performing speaker authentication based on the extracted feature parameters, Speaker registration and speaker verification using only feature parameters that show stable characteristics against environmental noise can be performed, making it less susceptible to environmental noise during speaker registration and speaker verification. Thus, it is possible to realize a voice authentication apparatus, method, and program capable of significantly improving noise resistance and reliability.

以下図面について、本発明の一実施の形態を詳述する。 Hereinafter, an embodiment of the present invention will be described in detail with reference to the drawings.

（１）第１の実施の形態
（１−１）原理
雑音環境下において話者認証の認証精度が劣化する原因の１つとして、音声が存在しない、つまり話者の個人性情報を含まない区間の音声情報を話者認証に用いていることが挙げられる。 (1) First Embodiment (1-1) Principle As one of the causes of deterioration in authentication accuracy of speaker authentication under a noisy environment, there is no voice, that is, a section that does not include speaker personality information Is used for speaker authentication.

そこで、例えばＣＥＬＰ符号化方式により符号化された音声情報を利用して話者認証を行う音声認証装置において、環境雑音に対して安定な特徴を示すフレームのＬＳＰのみを選択的に抽出し、このＬＳＰを用いて登録処理や照合処理を行うことが耐雑音処理として有効であると考えられる。 Therefore, for example, in a voice authentication apparatus that performs speaker authentication using voice information encoded by the CELP encoding method, only LSPs of frames that show stable characteristics against environmental noise are selectively extracted. It is considered that registration processing and verification processing using LSP is effective as noise-proof processing.

この場合、ＩＴＵ／Ｔ（International Telecommunication Union / Telecommunication Standardization Sector）では、ＣＥＬＰ符号化方式の一種であるＣＳ−ＡＣＥＬＰ（Conjugate Structure-Algebraic CELP）符号化方式がG.729として標準化されており、さらにこのＣＳ−ＡＣＥＬＰ符号化方式をＶｏＩＰ等に適用するために、無音区間を圧縮処理してビットレートを下げる方式がG.729 AnnexBとして標準化されている。 In this case, in ITU / T (International Telecommunication Union / Telecommunication Standardization Sector), a CS-ACELP (Conjugate Structure-Algebraic CELP) encoding method which is a kind of CELP encoding method is standardized as G.729. In order to apply the CS-ACELP coding method to VoIP and the like, a method for compressing a silent section and reducing the bit rate is standardized as G.729 Annex B.

そしてこのG.729 AnnexBでは、無音区間の検出のためのアルゴリズムとしてＶＡＤ（Voice Activity Decision）アルゴリズムが採用されている。そこで、このＶＡＤアルゴリズムを用いて音声区間のフレームのＬＳＰのみを抽出し、このＬＳＰを用いて話者登録や話者照合を行うことによって、環境雑音の影響を受け難くすることができるものと考えられる。 In G.729 Annex B, a VAD (Voice Activity Decision) algorithm is adopted as an algorithm for detecting a silent section. Therefore, by extracting only the LSP of the frame of the voice section using this VAD algorithm and performing speaker registration and speaker verification using this LSP, it is considered that it can be made less susceptible to environmental noise. It is done.

一方、ＣＳ−ＡＣＥＬＰ符号化方式による符号化処理により得られるパラメータの１つであるピッチ（Pitch delay）は、励振信号の基本周期を表すものであり、有声音における音の高低に対応するパラメータである。有声音は無声音と比べてパワーが強く、有声音のＬＳＰは話者の声道の特徴を安定して表現できる。従って、ピッチを利用して有声音区間のフレームのＬＳＰを抽出することで、雑音環境下においても変化の少ない安定したＬＳＰを取得できることが期待できる。 On the other hand, the pitch (Pitch delay), which is one of the parameters obtained by the encoding process using the CS-ACELP encoding method, represents the fundamental period of the excitation signal, and is a parameter corresponding to the level of the sound in the voiced sound. is there. Voiced sound has stronger power than unvoiced sound, and LSP of voiced sound can stably express the characteristics of the vocal tract of the speaker. Therefore, it can be expected that a stable LSP with little change can be obtained even in a noisy environment by extracting the LSP of the frame of the voiced sound section using the pitch.

ここで、ＶＡＤアルゴリズムによる有音／無音フレームの挙動と、ピッチ及び有声音／無声音フレームの関係とについて本願出願人が調べたところ、
(1)ピッチが安定しているフレームは、雑音によるＬＳＰの変化が小さい
(2)ピッチは無声音フレームではほとんど安定せず、有声音フレームで安定する
(3)ピッチは無音フレームの一部で安定する場合がある
という事実が明らかになった。 Here, the applicant of the present invention investigated the behavior of voiced / silent frames by the VAD algorithm and the relationship between the pitch and voiced / unvoiced frames.
(1) A frame with a stable pitch has a small change in LSP due to noise.
(2) The pitch is hardly stable in the unvoiced sound frame, but stable in the voiced sound frame.
(3) The fact that the pitch may be stable in a part of the silent frame was revealed.

従って、これらの事実から、ＶＡＤアルゴリズムとピッチの安定区間を抽出する手法とを組み合わせることによって、精度良く有声音のフレームを抽出することができるものと思われる。すなわちＶＡＤアルゴリズムによって有音区間と判定され、かつピッチに基づき有声音区間と判定される区間内のフレームを選択することによって、環境雑音に対して安定な特徴を示すフレームのＬＳＰのみを抽出することができ、このＬＳＰを用いて話者登録や話者照合を行うことによって、環境雑音の影響を低減させて、信頼性の高い話者認証を行うことができるものと考えられる。以下、かかる原理を適用した音声認証装置について説明する。 Therefore, from these facts, it is considered that a frame of voiced sound can be extracted with high accuracy by combining the VAD algorithm and a technique for extracting a stable section of pitch. That is, by selecting a frame within a section determined as a voiced section by the VAD algorithm and determined as a voiced section based on the pitch, only the LSP of a frame showing a stable characteristic against environmental noise is extracted. By performing speaker registration and speaker verification using this LSP, it is considered that the influence of environmental noise can be reduced and speaker authentication with high reliability can be performed. Hereinafter, a voice authentication device to which such a principle is applied will be described.

（１−２）第１の実施の形態による音声認証装置の構成
図１において、１０は全体として本実施の形態による音声認証装置を示し、話者の音声をマイクロホン（図示せず）により集音することにより得られた音声信号Ｓ１をＣＥＬＰ符号化部２に入力する。 (1-2) Configuration of Voice Authentication Device According to First Embodiment In FIG. 1, reference numeral 10 denotes a voice authentication device according to this embodiment as a whole, and a speaker's voice is collected by a microphone (not shown). The speech signal S1 obtained by doing so is input to the CELP encoding unit 2.

ＣＥＬＰ符号化部２は、供給される音声信号Ｓ１に対してＣＳ−ＡＣＥＬＰ符号化方式による符号化処理を施し、得られたフレーム（10〔ms〕）ごとの量子化されたＬＳＰをフレーム選択処理部３に送出する。またＣＥＬＰ符号化部２は、かかるＣＳ−ＡＣＥＬＰ符号化処理によりサブフレーム（５〔ms〕）ごとに得られるピッチ（Pitch delay）をピッチ情報として有声音／無声音判定部４に送出する。 The CELP encoding unit 2 performs encoding processing by the CS-ACELP encoding method on the supplied speech signal S1, and performs frame selection processing on the quantized LSP for each obtained frame (10 [ms]) Send to part 3. The CELP encoding unit 2 sends the pitch (pitch delay) obtained for each subframe (5 [ms]) by the CS-ACELP encoding process to the voiced / unvoiced sound determination unit 4 as pitch information.

有声音／無声音判定部４は、供給されるピッチ情報に基づいて、そのとき対象としているフレームが有声音区間及び無声音区間のいずれのものであるか否かをサブフレームごとに判定する。 Based on the supplied pitch information, the voiced / unvoiced sound determination unit 4 determines, for each subframe, whether the target frame is a voiced sound section or an unvoiced sound section.

具体的には、有声音／無声音判定部４は、図２に示す有声音／無声音判定処理手順ＲＴ１に従って、まず対象とするフレームを中心としたＮ（例えば10）点のピッチを抽出し（ステップＳＰ１）、この後これら抽出した10個のピッチの値を最小二乗法により直線で近似して、このときの平均二乗誤差Ｄを算出する（ステップＳＰ２）。 Specifically, the voiced / unvoiced sound determination unit 4 first extracts pitches of N (for example, 10) points centered on the target frame in accordance with the voiced / unvoiced sound determination processing procedure RT1 shown in FIG. SP1) Thereafter, the extracted 10 pitch values are approximated by a straight line by the least square method, and the mean square error D at this time is calculated (step SP2).

次いで有声音／無声音判定部４は、このようにして求めた平均二乗誤差Ｄが、当該平均二乗誤差Ｄについて予め定められた第１の閾値Ｄ_ｔｈ未満であり、かつそのとき注目しているピッチの値が当該値Ｐについて予め定められた第２の閾値Ｐ_ｔｈ未満であるか否かを判断する（ステップＳＰ３）。 Next, the voiced / unvoiced sound determination unit 4 determines that the mean square error D obtained in this way is less than a first threshold value _Dth that is predetermined for the mean square error D, and the pitch at which attention is paid at that time. Is determined to be less than a predetermined second threshold _{Pth for} the value P (step SP3).

そして有声音／無声音判定部４は、この判断の結果として肯定結果を得た場合には、そのフレームがピッチが安定した有声音区間のものであると判定し（ステップＳＰ４）、否定結果を得た場合には、そのフレームがピッチが安定していない無声音区間のものであると判定して（ステップＳＰ５）、この判定結果を有声音／無声音判定信号Ｓ２としてフレーム選択処理部に送出する（ステップＳＰ６）。 When the voiced / unvoiced sound determination unit 4 obtains an affirmative result as a result of this determination, it determines that the frame is in a voiced sound section with a stable pitch (step SP4) and obtains a negative result. If this is the case, it is determined that the frame is in an unvoiced sound section whose pitch is not stable (step SP5), and this determination result is sent to the frame selection processing unit as a voiced / unvoiced sound determination signal S2 (step SP5). SP6).

さらに有声音／無声音判定部４は、この後同様の処理をサブフレームごとに繰り返す（ステップＳＰ１〜ステップＳＰ６）。このようにして有声音／無声音判定部４は、サブフレームごとに、対象とするフレームが有声音区間及び無声音区間のいずれのものであるか否かの判定結果を有声音／無声音判定信号Ｓ２として順次フレーム選択処理部３に送出する。 Furthermore, the voiced / unvoiced sound determination unit 4 thereafter repeats the same processing for each subframe (step SP1 to step SP6). In this manner, the voiced / unvoiced sound determination unit 4 determines, as the voiced / unvoiced sound determination signal S2, the determination result as to whether the target frame is a voiced sound section or an unvoiced sound section for each subframe. Sequentially sent to the frame selection processing unit 3.

一方、マイクロホンからの音声信号Ｓ１は、有音／無音判定部５にも与えられる。そして有音／無音判定部５は、供給される音声信号Ｓ１について、ＶＡＤアルゴリズムを用いてフレームごとにそのフレームが有音区間及び無声音区間のいずれのものであるか否かを判定する。 On the other hand, the sound signal S <b> 1 from the microphone is also given to the sound / silence determination unit 5. The voice / silence determination unit 5 determines whether the frame is a voiced segment or a voiceless segment for each frame using the VAD algorithm for the supplied audio signal S1.

実際上、有音／無音判定部５は、図３に示すＶＡＤアルゴリズムによる有音／無音判定処理手順ＲＴ２に従って、フレーム（10〔ms〕）ごとに、まず全帯域エネルギ、低帯域エネルギ及び零交差率を求める。また有音／無音判定部５は、これと共に後述するＣＥＬＰ符号化部２における場合と同様にしてＬＳＰを求め、このＬＳＰのスペクトルを算出する（ステップＳＰ１１）。 Actually, the voice / silence determination unit 5 firstly performs full-band energy, low-band energy, and zero crossing every frame (10 [ms]) according to the voice / silence determination processing procedure RT2 by the VAD algorithm shown in FIG. Find the rate. In addition, the sound / silence determination unit 5 calculates an LSP in the same manner as in the CELP encoding unit 2 described later, and calculates the spectrum of this LSP (step SP11).

そして有音／無音判定部５は、これらＬＳＰから求めたスペクトル、全帯域エネルギ、低帯域エネルギ及び零交差率と、後述する環境雑音におけるこれら４つのパラメータの各平均値との差分をそれぞれ求めることにより、当該４つのパラメータの差分パラメータを生成する（ステップＳＰ１２）。 Then, the sound / silence determination unit 5 obtains the difference between the spectrum, the entire band energy, the low band energy, and the zero crossing rate obtained from these LSPs and the average values of these four parameters in the environmental noise described later. Thus, a difference parameter between the four parameters is generated (step SP12).

また有音／無音判定部５は、この後ステップＳＰ１１において得られた全帯域エネルギが15〔dB〕以下であるか否かを判断し（ステップＳＰ１３）、肯定結果を得た場合にはそのフレーム内の音は「雑音」であると判定する一方（ステップＳＰ１４）、否定結果を得た場合には、かかる４つの差分パラメータが作る４次元ベクトル空間上の位置から、そのフレーム内の音が「音声」及び「雑音」のいずれであるかを初期判定する（ステップＳＰ１５）。 Further, the sound / silence determination unit 5 determines whether or not the entire band energy obtained in step SP11 is 15 [dB] or less (step SP13). While the sound within is determined to be “noise” (step SP14), if a negative result is obtained, the sound within the frame is determined from the position in the four-dimensional vector space created by the four difference parameters. It is initially determined whether it is “speech” or “noise” (step SP15).

続いて有音／無音判定部５は、前数フレーム分の初期判定結果に基づいて当該判定結果のスムージング（平滑化）処理を行うことにより、そのフレームについて、「音声」及び「雑音」のいずれであるかを最終的に判定する（ステップＳＰ１６）。この結果、例えば２つ前のフレーム、１つ前のフレームがともに「音声」で、対象とするフレームの初期判定が「雑音」の場合、前フレームとのエネルギの差がある閾値よりも小さいときには、対象とするフレームでは音声が連続している判断し、最終判断では「雑音」から「音声」に変更されることとなる。 Subsequently, the voice / silence determination unit 5 performs smoothing processing on the determination result based on the initial determination results for the previous several frames, and thereby determines whether the frame is “voice” or “noise”. Is finally determined (step SP16). As a result, for example, when the second frame and the previous frame are both “speech” and the initial determination of the target frame is “noise”, the energy difference from the previous frame is smaller than a certain threshold value. In the target frame, it is determined that the sound is continuous, and in the final determination, “noise” is changed to “speech”.

そして有音／無音判定部５は、このようにして得られた最終的な判定結果に基づいて、「音声」と判定した場合には「有音区間」、「雑音」と判定した場合には「無音区間」との判定結果を有音／無音判定信号Ｓ３としてフレーム選択処理部３に送出する（ステップＳＰ７）。 Then, the sound / silence determination unit 5 determines “sound section” and “noise” when it determines “speech” based on the final determination result thus obtained. The determination result of “silent section” is sent to the frame selection processing unit 3 as a sound / silence determination signal S3 (step SP7).

また有音／無音判定部５は、そのフレームについて、最終的に「雑音」と判定したか否かを判断し（ステップＳＰ１８）、否定結果を得た場合には次のフレームに処理を移し（ステップＳＰ１１）、これに対して肯定結果を得た場合には、そのフレームについて抽出したＬＳＰから求めたスペクトル、全帯域エネルギ、低帯域エネルギ及び零交差率を利用して、背景雑音におけるこれら４つのパラメータの各平均値を更新する（ステップＳＰ１９）。 The sound / silence determination unit 5 determines whether or not the frame is finally determined to be “noise” (step SP18), and if a negative result is obtained, the process proceeds to the next frame ( Step SP11), if a positive result is obtained, the spectrum, full-band energy, low-band energy and zero-crossing rate obtained from the LSP extracted for the frame are used to obtain these four in the background noise. Each average value of the parameters is updated (step SP19).

そして有音／無音判定部５は、この後各フレームについて同様の処理を繰り返す（ステップＳＰ１１〜ステップＳＰ１９）。このようにして有音／無音判定部５は、フレームごとに、「有音区間」又は「無音区間」の判定結果を有音／無音判定信号Ｓ３として順次フレーム選択処理部３に送出する。 The sound / silence determination unit 5 thereafter repeats the same processing for each frame (step SP11 to step SP19). In this way, the sound / silence determination unit 5 sequentially sends the determination result of “sound period” or “silence period” to the frame selection processing unit 3 as the sound / silence determination signal S3 for each frame.

フレーム選択処理部３は、供給される有音声／無声音判定信号Ｓ２及び有音／無音判定信号Ｓ３に基づいて、フレームごとに、そのフレームが有音区間かつ有声音区間のものであるか否かを判断する。この際フレーム選択処理部３は、有音／無音判定信号Ｓ３に基づき得られる有音区間であるか否かの判定結果がフレーム単位（10〔ms〕）であるのに対し、有声音／無声音判定信号Ｓ２に基づき得られる有声音区間であるか否かの判定結果がサブフレーム（５〔ms〕）単位であるため、図４に示すように、奇数番目のサブフレームの判定結果に基づいてそのフレームが有声音区間のものであるか否かを判断する。 Based on the supplied voiced / unvoiced sound determination signal S2 and voiced / silent sound determination signal S3, the frame selection processing unit 3 determines whether or not the frame is a sounded section and a voiced sound section for each frame. Judging. At this time, the frame selection processing unit 3 determines whether or not it is a voiced section obtained based on the voiced / silent judgment signal S3 in units of frames (10 [ms]), whereas the voiced / voiceless voice Since the determination result of whether or not it is a voiced sound section obtained based on the determination signal S2 is a subframe (5 [ms]) unit, as shown in FIG. 4, based on the determination result of the odd-numbered subframe. It is determined whether or not the frame is in a voiced sound section.

そしてフレーム選択処理部３は、かかる判断結果に基づいて、ＣＥＬＰ符号化部２から順次供給される各フレームのＬＳＰのうち、有音区間かつ有声音区間であると判定されたフレームのＬＳＰのみを抽出する。そしてフレーム選択部３は、登録モード時には、このようにして得られた有音区間かつ有声音区間のフレームのＬＳＰを重み付け処理部６に送出する。 Then, based on the determination result, the frame selection processing unit 3 selects only the LSPs of the frames determined to be the voiced and voiced segments among the LSPs of the frames sequentially supplied from the CELP encoding unit 2. Extract. In the registration mode, the frame selection unit 3 sends the LSP of the frames in the voiced and voiced intervals obtained in this way to the weighting processing unit 6.

重み付け処理部６は、フレーム選択処理部３から供給される各ＬＳＰに対して、当該ＬＳＰの平均値（以下、これを平均ＬＳＰと呼ぶ）に所定の重み付け係数ｗを乗じたものを加算する重み付け処理を施す。この場合、話者の平均ＬＳＰは個人性を表すことが知られており、このような重み付け処理を施すことで、個人性を強調して、図５に示すように、ＬＳＰが作る10次元ユークリッド空間における話者ごとの分布を分離することができる。そして重み付け処理部６は、このようにして得られた重み付けされたＬＳＰをクラスタリング部７に送出する。 The weighting processing unit 6 adds a weight obtained by multiplying each LSP supplied from the frame selection processing unit 3 by an average value of the LSP (hereinafter referred to as an average LSP) by a predetermined weighting coefficient w. Apply processing. In this case, it is known that the average LSP of the speaker represents the individuality. By applying such weighting processing, the individuality is emphasized, and as shown in FIG. 5, the 10-dimensional Euclidean created by the LSP The distribution for each speaker in the space can be separated. Then, the weighting processing unit 6 sends the weighted LSP obtained in this way to the clustering unit 7.

クラスタリング部７は、図６に示すように、フレーム選択処理部３から供給されるＬＳＰに対してＬＢＧ＋splittingアルゴリズムによるクラスタリング処理を実行する。 As shown in FIG. 6, the clustering unit 7 performs a clustering process based on the LBG + splitting algorithm on the LSP supplied from the frame selection processing unit 3.

ここでＬＢＧアルゴリズムは、適当な初期コードブックＣ_Ｎ ^（０）から出発し、学習系列Ｔに分割条件と代表点条件を繰り返し適用することで、良好なコードブックに収束させる設計アルゴリズムである。そしてこのＬＢＧアルゴリズムによるクラスタリング処理は、図７に示すクラスタリング処理手順ＲＴ３に従って行われる。 Here, the LBG algorithm is a design algorithm that starts from an appropriate initial codebook C _N ⁽⁰⁾ and converges to a good codebook by repeatedly applying a division condition and a representative point condition to the learning sequence T. The clustering process by the LBG algorithm is performed according to the clustering process procedure RT3 shown in FIG.

すなわちＬＢＧアルゴリズムによるクラスタリング処理では、まず次元数をＫ、レベル数をＮとして、Ｎ個の初期量子化代表ベクトルｙ_１ ^（０），ｙ_２ ^（０），……，ｙ_Ｎ ^（０）からなる初期コードブックＣ_Ｎ ^（０）と、Ｌ個のＫ次元学習ベクトルｘ_１ ^（０），ｘ_２ ^（０），……，ｘ_Ｎ ^（０）からなる学習系列Ｔと、収束判定用閾値εとが予め与えられているものとして、ｍを「０」、初期ひずみＤ^（−１）を無限大にそれぞれ設定（ｍ＝０、Ｄ^（−１）＝∞）する（ステップＳＰ１）。 That is, in the clustering process by the LBG algorithm, first, the number of dimensions is K, the number of levels is N, and N initial quantization representative vectors y ₁ ⁽⁰⁾ , y ₂ ⁽⁰⁾ ,..., Y _N ^{(0) are included.} A learning sequence T composed of an initial codebook C _N ⁽⁰⁾ , L K-dimensional learning vectors x ₁ ⁽⁰⁾ , x ₂ ⁽⁰⁾ ,..., X _N ⁽⁰⁾ , a convergence determination threshold ε, Are set in advance, m is set to “0”, and initial strain D ⁽⁻¹⁾ is set to infinity (m = 0, D ⁽⁻¹⁾ = ∞) (step SP1).

続いて、量子化代表ベクトルｙ_１ ^（ｍ），ｙ_２ ^（ｍ），……，ｙ_Ｎ ^（ｍ）からなるコードブックＣ_Ｎ ^（ｍ）のもとで、平均ひずみＤ^（ｍ）を最小とするＮ個の領域ｐ_１ ^（ｍ），ｐ_２ ^（ｍ），……，ｐ_Ｎ ^（ｍ）への学習系列Ｔの分割Ｐ_Ｎ ^（ｍ）を分割条件を適用してから定める。すなわち、量子化代表ベクトルｙ_ｉ ^（ｍ）に対応した領域Ｐ_ｉ ^（ｍ）は、Ｎ個の量子化代表ベクトルのなかでｙ_ｉ ^（ｍ）とのひずみが最小となる学習ベクトルの集合で与えられる。こうしてＬ個の学習ベクトルがＮ個の領域に分割される。また各領域に所属する学習ベクトルをその領域内の量子化代表ベクトルで置き換えたときに生じる平均ひずみＤ^（ｍ）を算出する（ステップＳＰ２）。 Subsequently, the average distortion D ^(m) is minimized under the codebook C _N ^(m) composed of the quantized representative vectors y ₁ ^(m) , y ₂ ^(m) ,..., Y _N ^(m). N regions _p ¹ to _{^{(m), p 2 (m}} ), ......, defined division of the learning sequence T to ^{p N (m)} _P ^{N (m)} is after applying the divided condition. That is, the region corresponding to the quantization representative vector _{^{_{y i (m) P i (}}} m) is given by a set of learning vector strain and y _{i ^(m)} among the N quantization representative vector is minimized It is done. In this way, L learning vectors are divided into N regions. Further, the average distortion D ^(m) that occurs when the learning vector belonging to each region is replaced with the quantized representative vector in that region is calculated (step SP2).

次に、このようにして求めた平均ひずみＤ^（ｍ）が次式 Next, the average strain D ^(m) thus determined is

を満たすか否かを判定し（ステップＳＰ３）、満たさない場合には、Ｎ個の領域ｐ_１ ^（ｍ），ｐ_２ ^（ｍ），……，ｐ_Ｎ ^（ｍ）への学習系列Ｔの分割Ｐ_Ｎ ^（ｍ）のもとで、学習系列Ｔに対して平均ひずみＤ^（ｍ）を最小とするＮ個の量子化代表ベクトルｙ_１ ^（ｍ），ｙ_２ ^（ｍ），……，ｙ_Ｎ ^（ｍ）からなるコードブックＣ_Ｎを代表点条件を適用して定める。領域Ｐ_ｉ ^（ｍ）に所属する学習ベクトルＴの平均ベクトルとして与えられる重心を量子化代表ベクトルｙ_ｉとする（ステップＳＰ２４）。さらにｍをインクリメントし、Ｃ_ＮをコードブックＣ_Ｎ ^（ｍ）として（ステップＳＰ２５）、同様の処理を繰り返す。 (Step SP3), if not, the learning sequence T is divided into _N regions p ₁ ^(m) , p ₂ ^(m) ,..., P _N ^(m) . _N quantization representative vectors y ₁ ^(m) , y ₂ ^(m) ,..., Y _N that minimize the average distortion D ^(m) with respect to the learning sequence T under P _N ^(m). the codebook C _N consisting of ^(m) determined by applying the representative point condition. The center of gravity given as the average vector of the learning vectors T belonging to the region P _i ^(m) is set as a quantized representative vector y _i (step SP24). Further, m is incremented, C _N is set as the code book C _N ^(m) (step SP25), and the same processing is repeated.

そして、やがて平均ひずみＤ^（ｍ）が（３）式を満たすと処理を停止して、このときのコードブックＣ_Ｎ ^（ｍ）を最終的に設計したＮレベルのコードブックに決定する（ステップＳＰ２６）。これにより最終的に適切な状態に収束したＮレベルのコードブックを得ることができる。 Then, when the average strain D ^(m) eventually satisfies the equation (3), the processing is stopped, and the code book C _N ^(m) at this time is finally determined as the designed N-level code book (step SP26). ). As a result, an N-level codebook finally converged to an appropriate state can be obtained.

一方、ＬＢＧアルゴリズムにより設計されたコードブックの良否は、初期コードブックＣ_Ｎ ^（０）と学習系列Ｔの選定に強く依存する。初期コードブックＣ_Ｎ ^（０）は想定される入力ベクトルの分布範囲を被覆していることが望ましい。この条件をある程度満足する初期コードブックの生成方法として、splittingアルゴリズムが知られている。 On the other hand, the quality of the code book designed by the LBG algorithm strongly depends on the selection of the initial code book C _N ⁽⁰⁾ and the learning sequence T. The initial codebook C _N ⁽⁰⁾ preferably covers the assumed input vector distribution range. A splitting algorithm is known as a method for generating an initial codebook that satisfies this condition to some extent.

このsplittingアルゴリズムは、ＮレベルのコードブックＣ_Ｎの量子化代表ベクトルｙ_１，ｙ_２，……，ｙ_Ｎを次式 The splitting algorithm, the quantization representative vector _y 1 of N-level codebook _{C _N,} y 2, ......, following equation _{y N}

のように微小なベクトルδを用いて接近した２つのベクトルに分割することによって、量子化代表ベクトルｙ_１，ｙ_２，……，ｙ_２Ｎからなる２Ｎレベルの初期コードブックＣ_２Ｎ ^（０）を生成するものである。そしてこのsplittingアルゴリズムをＬＢＣアルゴリズムと組み合わせることによって、１レベルのコードブックから出発して順次２，４，８，……レベルのコードブックを設計することができる。 By dividing into two vectors in proximity with the minute vector δ as, quantization representative vector y _1, y _2, ......, y _2N consists 2N level of initial codebook C _{2N ⁽⁰⁾} is Is to be generated. By combining this splitting algorithm with the LBC algorithm, it is possible to design codebooks of 2, 4, 8,.

かくしてクラスタリング部７は、重み付け処理部６から供給される重み付け処理された各ＬＳＰに対して、このようなＬＢＣアルゴリズム及びsplittingアルゴリズムを組み合わせたＬＢＧ＋splittingアルゴリズムによるクラスタリング処理を実行することにより、その話者に固有のＬＳＰと同じ10次元のＮ（例えば16）個の量子化代表ベクトルを得、これらをその話者の特徴コードブックＣＢ（図６）として例えばフラッシュメモリやハードディスク等でなる記憶部８に格納して保存する。 Thus, the clustering unit 7 performs clustering processing on the weighted LSP supplied from the weighting processing unit 6 by the LBG + splitting algorithm that combines the LBC algorithm and the splitting algorithm, thereby giving the speaker The same 10-dimensional N (for example, 16) quantized representative vectors as the unique LSP are obtained, and these are stored in the storage unit 8 such as a flash memory or a hard disk as the feature code book CB (FIG. 6) of the speaker. And save.

一方、照合モード時には、重み付け処理部６から出力される重み付け処理されたＬＳＰが照合部９のベクトル量子化部１０に与えられる。また照合モード時、対象とする登録話者の特徴コードブックＣＢが記憶部８から読み出されてベクトル量子化部１０に与えられる。 On the other hand, in the collation mode, the weighted LSP output from the weighting processing unit 6 is given to the vector quantization unit 10 of the collation unit 9. In the collation mode, the feature code book CB of the registered speaker as a target is read from the storage unit 8 and given to the vector quantization unit 10.

かくしてベクトル量子化部１０は、図８に示すように、重み付け処理部６から順次与えられる有音区間かつ有声音区間のフレームの各ＬＳＰを、特徴コードブックＣＢでベクトル量子化することにより、当該フレームごとの量子化誤差を算出する。すなわちベクトル量子化部１０は、各ＬＳＰ（10次元ベクトル）について、特徴コードブロックＣＢを構成する16個の量子化代表ベクトル（10次元ベクトル）との間の各距離をそれぞれ順に算出する。 Thus, as shown in FIG. 8, the vector quantization unit 10 vector-quantizes each LSP of the frames of the voiced and voiced segments sequentially given from the weighting processing unit 6 with the feature codebook CB, thereby The quantization error for each frame is calculated. That is, the vector quantization unit 10 sequentially calculates each distance between each LSP (10-dimensional vector) and 16 quantized representative vectors (10-dimensional vector) constituting the feature code block CB.

そしてベクトル量子化部１０は、このようにしてそのフレームについて得られた16個の距離のうちの最小のもの（以下、これを最小量子化誤差と呼ぶ）を量子化誤差検出信号Ｓ４として判定部１１に送出する。 Then, the vector quantization unit 10 determines the smallest one of the 16 distances obtained for the frame in this way (hereinafter referred to as the minimum quantization error) as the quantization error detection signal S4. 11 to send.

判定部１１は、最小量子化誤差検出信号Ｓ４に基づき得られる各フレームの最小量子化誤差の値と、当該量子化誤差について予め設定された所定の第３の閾値とを比較する。この比較は、照合音声の音声長として予め定められた時間に応じて設定された規定フレーム数分だけ行われる。そして判定部１１は、最小量子化誤差が当該第３の閾値よりも小さいフレーム数をカウントし、当該フレーム数が予め設定された第４の閾値以上であったときに、そのときの話者を本人であると判定し、判定結果を判定信号Ｓ５として出力する。 The determination unit 11 compares the value of the minimum quantization error of each frame obtained based on the minimum quantization error detection signal S4 with a predetermined third threshold set in advance for the quantization error. This comparison is performed for the specified number of frames set in accordance with a predetermined time as the voice length of the verification voice. Then, the determination unit 11 counts the number of frames whose minimum quantization error is smaller than the third threshold, and when the number of frames is equal to or greater than a preset fourth threshold, the speaker at that time is determined. It is determined that the user is the person, and the determination result is output as a determination signal S5.

なおＣＥＬＰ符号化部２の具体的構成を図９に示す。ＣＥＬＰ符号化部２では、供給される音声信号Ｓ１をハイパスフィルタでなる前処理部２０に入力する。そして前処理部２０は、供給される音声信号Ｓ１に対して140〔Hz〕をカットオフ周波数とするフィルタリング処理を施し、かくして得られた雑音成分を除去した音声信号Ｓ１０をＬＰＣ（Linear Predictor Coefficients）分析部２１及び減算部２２に入力する。 A specific configuration of the CELP encoding unit 2 is shown in FIG. The CELP encoding unit 2 inputs the supplied audio signal S1 to the preprocessing unit 20 composed of a high-pass filter. Then, the preprocessing unit 20 performs a filtering process on the supplied audio signal S1 with a cut-off frequency of 140 [Hz], and the audio signal S10 obtained by removing the noise component thus obtained is converted into an LPC (Linear Predictor Coefficients). The data is input to the analysis unit 21 and the subtraction unit 22.

ＬＰＣ分析部２１は、供給される音声信号Ｓ１０に対して線形予測分析をフレーム（10〔ms〕）ごとに行うことにより、フレームごとの合成フィルタのフィルタ係数（ＬＰＣ）を決定して、これを合成フィルタ２３に設定する。またＬＰＣ分析部２１は、この後このようにして得られた各フィルタ係数（ＬＰＣ）をそれぞれＬＳＰに変換し、これらＬＳＰを量子化した後に、これを上述のようにフレーム選択処理部３（図１）に送出する。 The LPC analysis unit 21 determines the filter coefficient (LPC) of the synthesis filter for each frame by performing linear prediction analysis for the supplied speech signal S10 for each frame (10 [ms]), Set to the synthesis filter 23. The LPC analysis unit 21 thereafter converts each filter coefficient (LPC) thus obtained into LSPs, quantizes these LSPs, and then converts them into the frame selection processing unit 3 (see FIG. Send to 1).

一方、合成フィルタ２３は、サブフレームごとに、後述のように加算部２９から供給される音源信号Ｓ１１に対してそのとき設定されたフィルタ係数（ＬＰＣ）でフィルタリング処理することにより合成音声の音声信号でなる合成音声信号Ｓ１２を生成し、これを減算部２２に送出する。 On the other hand, the synthesis filter 23 performs a filtering process with a filter coefficient (LPC) set at that time on the sound source signal S11 supplied from the addition unit 29 as described later for each subframe, thereby generating a speech signal of the synthesized speech. Is generated and sent to the subtractor 22.

そして減算部２２は、供給される音声信号Ｓ１０から合成音声信号Ｓ１２を減算処理することにより、入力音声と合成音声の波形歪み成分を表す歪み検出信号Ｓ１３を得、これを聴感重み付け部２４に送出する。 The subtractor 22 subtracts the synthesized speech signal S12 from the supplied speech signal S10 to obtain a distortion detection signal S13 representing the waveform distortion components of the input speech and synthesized speech, and sends this to the perceptual weighting unit 24. To do.

このとき聴感重み付け部２４には、ＬＰＣ分析部２１からそのときの合成フィルタ２３のフィルタ係数（ＬＰＣ）が与えられる。かくして聴感重み付け部２４は、歪み検出信号Ｓ１３に対して聴感上の歪みが最小となるよう、かかる合成フィルタ２３のフィルタ係数（ＬＰＣ）に基づき決定されるフィルタ係数の所定のフィルタリング処理を施し、得られた聴感重み付け歪み検出信号Ｓ１４をピッチ分析部２５及び固定コードブック探索部２６に送出する。 At this time, the perceptual weighting unit 24 is provided with the filter coefficient (LPC) of the synthesis filter 23 at that time from the LPC analysis unit 21. Thus, the audibility weighting unit 24 performs a predetermined filtering process of the filter coefficient determined based on the filter coefficient (LPC) of the synthesis filter 23 so as to minimize the audible distortion with respect to the distortion detection signal S13. The perceptual weighting distortion detection signal S14 is sent to the pitch analysis unit 25 and the fixed codebook search unit 26.

ピッチ分析部２５は、供給される聴感重み付け歪み検出信号Ｓ１４に基づいて、合成音声のピッチＬ及び当該合成音声のピッチ成分に対するゲインＧ_ＰをA-b-Sによって決定する。具体的に、ピッチ分析部２５は、聴感重み付け部２４から与えられる聴感重み付け歪み検出信号Ｓ１４と、後述のように加算部２９から与えられる音源信号Ｓ１１とに基づいて、聴感重み付け歪み検出信号Ｓ１４の信号レベルが最も小さくなるようなピッチＬを選択し、これを現在のサブフレームのピッチＬに決定する。そしてピッチ分析部２５は、このようにして決定したピッチＬを上述のように有声音／無声音区間判定部４（図１）に送出する。 The pitch analysis unit 25 determines the pitch L of the synthesized speech and the gain _GP for the pitch component of the synthesized speech by AbS based on the supplied auditory weighting distortion detection signal S14. Specifically, the pitch analysis unit 25 calculates the perceptual weighting distortion detection signal S14 based on the perceptual weighting distortion detection signal S14 provided from the perceptual weighting unit 24 and the sound source signal S11 provided from the addition unit 29 as described later. The pitch L that minimizes the signal level is selected, and this is determined as the pitch L of the current subframe. Then, the pitch analysis unit 25 sends the pitch L thus determined to the voiced / unvoiced sound section determination unit 4 (FIG. 1) as described above.

なおピッチＬが決定すると、これに応じてゲインＧ_１も決定する。そしてこの決定結果に応じたコードベクトルが適応コードブック２７から出力され、これが増幅部２８において上述のゲインＧ_Ｐで増幅処理されて加算部２９に与えられる。 Note the pitch L is determined, the gain G ₁ is also determined accordingly. A code vector corresponding to the determination result is output from the adaptive codebook 27, and is amplified by the amplification unit 28 with the above-described gain _GP and supplied to the addition unit 29.

またこのとき固定コードブック探索部２６は、適応コードブック２７による合成音声を入力音声から取り除いた差分音声についての音源信号Ｓ１１をA-b-Sによって決定する。具体的に固定コードブック探索部２６は、固定コードブック３０から予め用意されている各コードベクトルのうちの１つのコードベクトルを出力させる。なお１つのコードベクトルに対して１つのゲインＧ_Ｃが決定する。そしてこの固定コードブック３０から出力されたコードベクトルが、この後増幅部３１において上述のゲインＧ_Ｃで増幅処理されて、加算部２９に与えられる。 At this time, the fixed codebook search unit 26 determines the sound source signal S11 for the differential sound obtained by removing the synthesized sound from the adaptive codebook 27 from the input sound by AbS. Specifically, the fixed codebook search unit 26 outputs one code vector among the code vectors prepared in advance from the fixed codebook 30. Note one gain G _C is determined for a single code vector. Then the code vector outputted from the fixed codebook 30, the amplifying unit 31 after this is amplified processed by the above gain G _C, given to the adder 29.

加算部２９は、これらゲインＧ_Ｐで増幅処理された適用コードブック２７からのコードベクトルと、ゲインＧ_Ｃで増幅処理された固定コードブック３０からのコードベクトルとを加算することにより、音声生成モデルにおける人間の声帯振動に対応する音源信号Ｓ１１を生成し、これを当該音声生成モデルにおける人間の調音運動すなわち声道の伝達関数に対応する合成フィルタ２３と、上述のピッチ分析部２５とにそれぞれ送出する。 Addition unit 29, by adding the code vector from the application code book 27 which is amplified treated with these gain G _P, and a code vector from the fixed codebook 30 is amplified processed by the gain G _C, speech production model A sound source signal S11 corresponding to the human vocal fold vibration is generated and transmitted to the synthesis filter 23 corresponding to the human articulatory motion, that is, the transfer function of the vocal tract in the speech generation model, and the pitch analysis unit 25 described above. To do.

かくしてこの音源信号Ｓ１１が、この後上述のように合成フィルタ２３においてフィルタリング処理されることにより合成音声が生成され、この合成音声と入力音声との間の歪みが聴感重み付け部２４においてフィルタリング処理されて固定コードブック探索部２６に与えられる。そして固定コードブック探索部２６は、同様の操作をすべてのコードベクトルについて行い、歪みが最も小さくなるコードベクトルをそのときのコードベクトルとして決定する。 Thus, the sound source signal S11 is subsequently filtered in the synthesis filter 23 as described above to generate a synthesized voice, and distortion between the synthesized voice and the input voice is filtered in the perceptual weighting unit 24. The fixed code book search unit 26 is given. Then, the fixed codebook search unit 26 performs the same operation for all the code vectors, and determines the code vector with the smallest distortion as the code vector at that time.

このようにしてこのＣＥＬＰ符号化部２においては、入力する音声信号Ｓ１に基づいてＬＳＰ及びピッチを得ることができるようになされている。 In this way, the CELP encoding unit 2 can obtain the LSP and pitch based on the input audio signal S1.

（１−３）本実施の形態の動作及び効果
以上の構成において、この音声認証装置１では、ＣＳ−ＡＣＥＬＰ符号化方式による符号化処理により得られたピッチ情報に基づいて有声音区間を検出すると共に、ＶＡＤアルゴリズムにより有音区間を検出し、ＣＳ−ＡＣＥＬＰ符号化方式による符号化処理により得られたフレームごとのＬＳＰのうちの有音区間かつ有声音区間のフレームのＬＳＰのみを抽出して、このＬＳＰを利用して話者登録及び話者照合を行う。 (1-3) Operation and effect of the present embodiment In the above configuration, the voice authentication apparatus 1 detects a voiced sound section based on pitch information obtained by encoding processing using the CS-ACELP encoding method. In addition, the voiced section is detected by the VAD algorithm, and only the LSP of the frame of the voiced section and the voiced section is extracted from the LSP for each frame obtained by the encoding process by the CS-ACELP encoding method, Speaker registration and speaker verification are performed using this LSP.

従って、この音声認証装置１では、環境雑音に対して安定な特徴を示すフレームのＬＳＰのみを利用した話者登録及び話者照合を行うことができるため、話者登録時や話者照合時における環境雑音の影響を受け難くすることができる。 Therefore, since this voice authentication device 1 can perform speaker registration and speaker verification using only the LSP of a frame that exhibits stable characteristics against environmental noise, it can be used during speaker registration or speaker verification. It can be made less susceptible to environmental noise.

実際上、研究用ＡＴＲ日本語音声データベースに収録された雑音を含まない理想的な音声に、電子協騒音データベースに収録された上述の「自動車（car）」、「人ごみ（crowd）」、「駅（station）」又は「交差点（street）」の雑音を、ＳＮ比が10〔dB〕、15〔dB〕、20〔dB〕となるように付加割合を順次変えながら付加して、上述のフレーム選択処理による耐雑音処理を行った場合と、行わなかった場合とにおける話者認証の信頼性を評価する実験を行ったところ、図１０及び図１１に示すような実験結果が得られた。なお、ここでも評価はＥＥＲを用いて行っている。 In fact, the above-mentioned “car”, “crowd” and “station” recorded in the electronic cooperative noise database are added to the ideal speech without noise included in the ATR Japanese speech database for research. (Station) or “street” noise is added by changing the addition ratio sequentially so that the SN ratio is 10 [dB], 15 [dB], and 20 [dB], and the above frame selection Experiments were performed to evaluate the reliability of speaker authentication in the case where noise proofing processing was performed and in the case where noise proofing processing was not performed. As a result, experimental results as shown in FIGS. 10 and 11 were obtained. Here, the evaluation is also performed using EER.

この実験結果からも明らかなように、雑音の種類によって多少の精度の違いはあるものの、全体としてかかるフレーム選択処理による耐雑音処理を行なった場合（図１０）の方が、当該耐雑音処理を行わなかった場合（図１１）に比べて誤り率が低くなる。従って、この耐雑音処理を行うことによって信頼性の高い話者認証を行い得ることが分かる。 As is clear from this experimental result, although there is a slight difference in accuracy depending on the type of noise, the noise proofing process is performed when the noise proofing process by the frame selection process as a whole (FIG. 10) is performed. The error rate is lower than when not performed (FIG. 11). Therefore, it can be seen that highly reliable speaker authentication can be performed by performing the noise proof processing.

またこれとは別に、音声長（発話時間）を５秒、10秒、20秒、30秒と変えて、音声長と照合精度との関係について評価する実験を行ったところ、図１２及び図１３に示すような実験結果が得られた。 Separately from this, when the voice length (speech time) was changed to 5 seconds, 10 seconds, 20 seconds, and 30 seconds, an experiment for evaluating the relationship between the voice length and the collation accuracy was performed. Experimental results as shown in Fig. 1 were obtained.

この実験結果からも分かるように、上述のフレーム選択処理による耐雑音処理を行なった場合（図１２）と、当該耐雑音処理を行わなかった場合（図１３）とのいずれの場合においても音声長が長いほど照合精度が高くなるが、耐雑音処理を行うことによって、より短時間で耐雑音処理を行わなかった場合と同じ照合精度を得ることができる。従って、かかる耐雑音処理を行うことによって短時間で信頼性の高い話者認証を行い得ることが分かった。 As can be seen from the results of this experiment, the voice length in both cases where the noise immunity processing by the above-described frame selection processing is performed (FIG. 12) and when the noise immunity processing is not performed (FIG. 13). The longer the length is, the higher the collation accuracy is. However, by performing the noise proofing process, it is possible to obtain the same matching precision as when the noise proofing process is not performed in a shorter time. Therefore, it was found that reliable speaker authentication can be performed in a short time by performing such noise proof processing.

以上の構成によれば、有音区間かつ有声音区間のフレームのＬＳＰのみを利用して話者登録及び話者照合を行うようにしたことにより、話者登録時や話者照合時における環境雑音の影響を受け難くすることができ、かくして耐雑音性が高く、信頼性の高い話者認証を行い得る音声照合装置を実現できる。 According to the above configuration, since the speaker registration and the speaker verification are performed using only the LSP of the voiced and voiced frame, the environmental noise at the time of speaker registration and speaker verification is obtained. Thus, it is possible to realize a voice collation apparatus that can perform speaker authentication with high noise resistance and high reliability.

（２）第２の実施の形態
（２−１）原理
図１について上述した第１の実施の形態による音声認証装置１では、平均ＬＳＰに表れる個人性を重要視し、重み付け処理部６において、フレーム選択処理部３から出力される有音区間かつ有声音区間のフレームのＬＳＰに対して平均ＬＳＰに所定の重み付け係数ｗを乗じたものを加算する重み付け処理を行っている。 (2) Second Embodiment (2-1) Principle In the voice authentication device 1 according to the first embodiment described above with reference to FIG. 1, importance is placed on the individuality that appears in the average LSP. A weighting process is performed in which the average LSP multiplied by a predetermined weighting coefficient w is added to the LSP of the voiced and voiced frame output from the frame selection processing unit 3.

しかしながら、このような方法によると、平均ＬＳＰが安定するまでに相当の時間を要するという問題があり、さらに平均ＬＳＰには雑音成分も含まれるために付加雑音の影響を受け易いという問題もある。 However, according to such a method, there is a problem that it takes a considerable time until the average LSP is stabilized, and there is also a problem that the average LSP is easily affected by additional noise because it includes a noise component.

一方、図１について上述した照合部９のベクトル量子化部１０において算出される量子化誤差は、照合音声に基づく各ＬＳＰと、登録音声に基づく登録話者に固有のパラメータである量子化代表ベクトルとの間の距離を表すものであり、この値が小さければ小さいほど照合話者が登録話者である可能性が高い。 On the other hand, the quantization error calculated in the vector quantization unit 10 of the matching unit 9 described above with reference to FIG. 1 is the LSP based on the matching speech and the quantized representative vector that is a parameter specific to the registered speaker based on the registered speech. The smaller the value, the higher the possibility that the collation speaker is a registered speaker.

従って、ベクトル量子化部１０から出力される、これら量子化誤差の最小値である最小量子化誤差の平均に基づいて照合話者が登録話者であるか否かを判定しても、実用上十分な照合精度が得られるものと思われる。 Therefore, even if it is determined whether or not the collation speaker is a registered speaker based on the average of the minimum quantization errors, which are the minimum values of these quantization errors, output from the vector quantization unit 10, It seems that sufficient verification accuracy can be obtained.

そしてこの方法によれば、照合話者が登録話者である場合には、この値が常に小さい値となると考えられることから、最小量子化誤差の平均値が安定するまでにはそれほどの時間を要せず、また重み付け処理部６における重み付け処理のように付加雑音の平均が付加されることもないことから、付加雑音の影響も少ないものと考えられる。 According to this method, when the verification speaker is a registered speaker, it is considered that this value is always a small value. Therefore, it takes much time until the average value of the minimum quantization error is stabilized. It is not necessary, and since the average of the additional noise is not added unlike the weighting processing in the weighting processing unit 6, it is considered that the influence of the additional noise is small.

そこで、この第２の実施の形態においては、最小量子化誤差の平均に基づく話者認証アルゴリズムを提案する。以下、かかるアルゴリズムを適用した第２の実施の形態による音声認証装置について説明する。 Therefore, in the second embodiment, a speaker authentication algorithm based on the average of minimum quantization errors is proposed. Hereinafter, a voice authentication device according to a second embodiment to which such an algorithm is applied will be described.

（２−２）第２の実施の形態による音声認証装置の構成
図１との対応部分に同一符号を付して示す図１４は、第２の実施の形態による音声認証装置４０を示すものであり、重み付け処理部６（図１）に代えて照合部４１に平均距離演算部４２が設けられている点を除いて第１の実施の形態による音声認証装置１（図１）と同様に構成されている。 (2-2) Configuration of Voice Authentication Device According to Second Embodiment FIG. 14 showing the same parts as those in FIG. 1 with the same reference numerals shows a voice authentication device 40 according to the second embodiment. There is a configuration similar to that of the voice authentication device 1 (FIG. 1) according to the first embodiment except that the average distance calculation unit 42 is provided in the matching unit 41 instead of the weighting processing unit 6 (FIG. 1). Has been.

実際上、この音声認証装置４０の場合、フレーム選択処理部３から出力された有音区間かつ有声音区間の各フレームのＬＳＰは、クラスタリング部７に与えられる。そしてクラスタリング部７は、このＬＳＰに基づいて、上述のＬＢＣアルゴリズム及びsplittingアルゴリズムを組み合わせたＬＢＧ＋splittingアルゴリズムによるクラスタリング処理を実行することにより、その話者に固有のＬＳＰと同じ10次元のＮ（例えば16）個の量子化代表ベクトルを得、登録モード時には、これらをその話者の特徴コードブックＣＢ（図６）として記憶部８に格納して保存する。 In practice, in the case of the voice authentication device 40, the LSP of each frame in the voiced and voiced intervals output from the frame selection processing unit 3 is given to the clustering unit 7. Based on the LSP, the clustering unit 7 executes a clustering process based on the LBG + splitting algorithm that combines the above-described LBC algorithm and splitting algorithm, thereby obtaining the same 10-dimensional N (for example, 16) as the LSP unique to the speaker. The quantized representative vectors are obtained and stored in the storage unit 8 as the speaker's feature code book CB (FIG. 6) in the registration mode.

一方、照合モード時には、登録モード時と同様にして得られた有音区間かつ有声音区間の各フレームのＬＳＰがフレーム選択処理部３から照合部４１のベクトル量子化部１０に与えられる。また照合モード時、ベクトル量子化部１０には、記憶部８に保存された特徴コードブックＣＢのうち、そのとき対象としている登録話者の特徴コードブックＣＢも与えられる。 On the other hand, in the verification mode, the LSP of each frame in the voiced and voiced intervals obtained in the same manner as in the registration mode is given from the frame selection processing unit 3 to the vector quantization unit 10 of the verification unit 41. In the collation mode, the vector quantization unit 10 is also provided with the feature code book CB of the registered speaker as a target among the feature code books CB stored in the storage unit 8.

かくしてベクトル量子化部１０は、第１の実施の形態の場合と同様にして、フレーム選択処理部３から順次与えられる各ＬＳＰについて、特徴コードブロックＣＢを構成する16個の量子化代表ベクトルとの間の各距離をそれぞれ順に算出し、得られた16個の距離のうちの最小量子化誤差を量子化誤差検出信号Ｓ４として平均距離演算部４２に送出する。 In this way, the vector quantization unit 10 determines the 16 quantized representative vectors constituting the feature code block CB for each LSP sequentially given from the frame selection processing unit 3 in the same manner as in the first embodiment. Each of the distances is calculated in order, and the minimum quantization error among the obtained 16 distances is sent to the average distance calculation unit 42 as the quantization error detection signal S4.

平均距離演算部４２は、供給される量子化誤差検出信号Ｓ４に基づき得られる有音区間かつ有声音区間のフレームごとの最小量子化誤差に基づいて、特徴コードブロックＣＢの各量子化代表ベクトルと照合音声に基づくＬＳＰとの間の平均距離を算出する。 The average distance calculation unit 42, based on the minimum quantization error for each frame of the voiced and voiced sections obtained based on the supplied quantization error detection signal S4, and each quantized representative vector of the feature code block CB An average distance from the LSP based on the collation voice is calculated.

実際上、平均距離演算部４２は、ベクトル量子化部１０から与えられる量子化誤差検出信号Ｓ４に基づき得られる有音区間かつ有声音区間のフレームごとの最小量子化誤差の平均値を算出する。この際、平均距離演算部４２は、かかるフレームごとの最小量子化誤差の平均値を、過去所定フレーム数分の量子化最小誤差を用いて、これら過去所定フレーム数分の量子化最小誤差の平均を算出する移動平均法により算出するようになされ、得られた算出結果を量子化誤差平均値信号Ｓ２０として判定部４３に送出する。 Actually, the average distance calculation unit 42 calculates the average value of the minimum quantization error for each frame of the voiced and voiced intervals obtained based on the quantization error detection signal S4 given from the vector quantization unit 10. At this time, the average distance calculation unit 42 uses the quantization minimum error for the past predetermined number of frames as the average value of the minimum quantization error for each frame, and averages the quantization minimum errors for the past predetermined number of frames. Is calculated by the moving average method to calculate the above, and the obtained calculation result is sent to the determination unit 43 as the quantization error average value signal S20.

かくして判定部４３は、第１の実施の形態の場合と同様にして、量子化誤差平均値信号Ｓ２０に基づき得られる該当フレームの最小量子化誤差の平均値と、当該平均値について予め定められた第５の閾値とを比較し、この最小量子化誤差の平均値が当該第５の閾値より小さいときにはそのときの話者を本人であると判定する一方、最小量子化誤差の平均値が当該第５の閾値より大きいときにはそのときの話者を他人であると判定し、判定結果を判定信号Ｓ２１として出力する。 Thus, the determination unit 43 determines the average value of the minimum quantization error of the corresponding frame obtained based on the quantization error average value signal S20 and the average value in the same manner as in the first embodiment. The average value of the minimum quantization error is compared with the fifth threshold value, and when the average value of the minimum quantization error is smaller than the fifth threshold value, it is determined that the speaker at that time is the person in question. When it is larger than the threshold value of 5, the speaker at that time is determined to be another person, and the determination result is output as a determination signal S21.

（２−３）本実施の形態の動作及び効果
以上の構成において、この音声認証装置４０では、有音区間かつ有声音区間のフレームの最小量子化誤差の平均値に基づいて話者照合を行う。 (2-3) Operation and effect of the present embodiment In the above configuration, the voice authentication device 40 performs speaker verification based on the average value of the minimum quantization error of the frames in the voiced and voiced sections. .

従って、この音声認証装置４０では、平均ＬＳＰを利用した重み付け処理を行う第１の実施の形態による音声認証装置１（図１）と比較して格段的に短時間で同じ精度の話者照合を行うことができ、ほぼリアルタイムでの話者認証を行うことができる。 Therefore, this voice authentication device 40 performs speaker verification with the same accuracy in a much shorter time than the voice authentication device 1 (FIG. 1) according to the first embodiment that performs the weighting process using the average LSP. It is possible to perform speaker authentication in almost real time.

実際上、シミュレーションを行ったところ、第１の実施の形態による音声認証装置１において、誤り率30〔％〕程度の認証精度を得るためには15秒程度の時間を必要としたのに対して、この音声認証装置４０によれば、１〜２秒程度で同じ認証精度を得ることができることが確認できた。 Actually, when a simulation was performed, the voice authentication apparatus 1 according to the first embodiment required about 15 seconds to obtain authentication accuracy of an error rate of about 30%. According to this voice authentication device 40, it was confirmed that the same authentication accuracy could be obtained in about 1-2 seconds.

またこの音声認証装置４０では、平均ＬＳＰを利用した重み付け処理を行う第１の実施の形態による音声認証装置１と比較して環境雑音の影響を受け難く、その分信頼性の高い話者認証を行うことができる。 Further, the voice authentication device 40 is less susceptible to environmental noise than the voice authentication device 1 according to the first embodiment that performs weighting processing using an average LSP, and thus provides highly reliable speaker authentication. It can be carried out.

さらにこの音声認証装置４０では、有音区間かつ有声音区間のフレームごとの最小量子化誤差の平均値を算出する際に、過去所定フレーム数分の量子化最小誤差を用いた移動平均法により算出するようにしているため、最新の情報のみを用いて最小量子化誤差の平均値を算出することができ、その分より信頼性の高い話者認証を行うことができる。 Further, in the voice authentication device 40, when calculating the average value of the minimum quantization error for each frame of the voiced section and the voiced section, the voice authentication apparatus 40 is calculated by the moving average method using the quantization minimum errors for the past predetermined number of frames. Therefore, the average value of the minimum quantization error can be calculated using only the latest information, and the speaker authentication with higher reliability can be performed accordingly.

以上の構成によれば、有音区間かつ有声音区間のフレームの最小量子化誤差の平均値に基づいて話者照合を行うようにしたことにより、より短時間でより信頼性が高い話者認証を行い得る音声認証装置を実現できる。 According to the above configuration, the speaker verification is performed based on the average value of the minimum quantization error of the frames in the voiced and voiced sections, so that the speaker authentication can be performed more quickly and more reliably. A voice authentication device capable of performing

（３）他の実施の形態
なお上述の第１及び第２の実施の形態においては、音声信号Ｓ１における有声音区間を検出する手法として、ＣＳ−ＡＣＥＬＰ符号化処理により得られたピッチ情報を利用するようにした場合について述べたが、本発明はこれに限らず、音声信号Ｓ１における有声音区間を検出する手法としては、その音声認証装置において適用する符号化方式等に応じた種々の手法を広く適用することができる。 (3) Other Embodiments In the first and second embodiments described above, pitch information obtained by CS-ACELP coding processing is used as a method for detecting a voiced sound section in the speech signal S1. Although the present invention is not limited to this, the present invention is not limited to this, and as a method for detecting the voiced sound section in the speech signal S1, various methods according to the encoding method applied in the speech authentication apparatus can be used. Can be widely applied.

また上述の第１及び第２の実施の形態においては、音声信号Ｓ１における有音区間を検出する手法としてＶＡＤアルゴリズムを適用するようにした場合について述べたが、本発明はこれに限らず、この他種々の手法を広く適用することができる。 In the first and second embodiments described above, the case where the VAD algorithm is applied as a method for detecting a voiced section in the audio signal S1 has been described, but the present invention is not limited to this, Various other methods can be widely applied.

ただし、第１及び第２の実施の形態のように音声信号Ｓ１における有声音区間を検出する手法として、ＣＳ−ＡＣＥＬＰ符号化処理により得られたピッチ情報を利用する一方、音声信号Ｓ１における有音区間を検出する手法としてＶＡＤアルゴリズムを適用することによって、既存の携帯電話において得られるパラメータを利用して有音区間かつ有声音区間を検出することができるため、僅かな変更のみで既存の携帯電話に本発明による音声認証機能を搭載することができる利点がある。 However, as a method for detecting the voiced sound section in the speech signal S1 as in the first and second embodiments, the pitch information obtained by the CS-ACELP encoding process is used, while the speech in the speech signal S1 is used. By applying the VAD algorithm as a method for detecting a section, it is possible to detect a voiced section and a voiced section using parameters obtained in an existing mobile phone. There is an advantage that the voice authentication function according to the present invention can be installed.

さらに上述の第２の実施の形態においては、ベクトル量子化部１０において算出された16個の距離のうちの最小量子化誤差のみを利用して、特徴コードブロックＣＢの各量子化代表ベクトルと照合音声に基づくＬＳＰとの間の平均距離を算出するようにした場合について述べたが、本発明はこれに限らず、ベクトル量子化部１０において算出された16個の距離すべてを利用して、当該平均値を算出するようにしても良い。 Furthermore, in the above-described second embodiment, only the minimum quantization error among the 16 distances calculated by the vector quantization unit 10 is used to collate with each quantized representative vector of the feature code block CB. Although the case where the average distance to the LSP based on speech is calculated has been described, the present invention is not limited to this, and all 16 distances calculated in the vector quantization unit 10 are used, An average value may be calculated.

本発明は、ＣＥＬＰ符号化方式による符号化処理により得られたパラメータに基づいて話者認証を行う音声認証装置のほか、この他種々の音声情報に基づいて話者認証を行う音声認証装置に広く適用することができる。 The present invention is widely applied to a voice authentication device that performs speaker authentication based on parameters obtained by encoding processing using a CELP encoding method and a voice authentication device that performs speaker authentication based on various types of voice information. Can be applied.

第１の実施の形態による音声認証装置の構成を示すブロック図である。It is a block diagram which shows the structure of the voice authentication apparatus by 1st Embodiment. 有声音／無声音区間判定処理手順を示すフローチャートである。It is a flowchart which shows a voiced / unvoiced sound area determination processing procedure. 有音／無音判定処理手順を示すフローチャートである。It is a flowchart which shows a sound / silence determination processing procedure. フレーム選択処理部における有声音／無声音信号に基づく有声音／無声音区間の判定の説明に供する図表である。It is a figure with which it uses for description of determination of the voiced sound / unvoiced sound area based on the voiced sound / unvoiced sound signal in a frame selection process part. 重み付け処理部における重み付け処理の説明に供する概念図である。It is a conceptual diagram with which it uses for description of the weighting process in a weighting process part. 音声認証装置における登録プロセスの説明に供する概念図である。It is a conceptual diagram with which it uses for description of the registration process in a voice authentication apparatus. クラスタリング処理手順を示すフローチャートである。It is a flowchart which shows a clustering process procedure. 音声認証装置における照合プロセスの説明に供する概念図である。It is a conceptual diagram with which it uses for description of the collation process in a voice authentication apparatus. ＣＥＬＰ符号化部の構成を示すブロック図である。It is a block diagram which shows the structure of a CELP encoding part. 本実施の形態による耐雑音処理を行った場合の各種雑音環境下におけるＥＥＲの測定実験結果を示す図表である。It is a table | surface which shows the measurement experiment result of EER in various noise environment at the time of performing the noise-proof process by this Embodiment. 本実施の形態による耐雑音処理を行わない場合の各種雑音環境下におけるＥＥＲの測定実験結果を示す図表である。It is a graph which shows the measurement experiment result of EER in various noise environment at the time of not performing the noise-proof process by this Embodiment. 本実施の形態による耐雑音処理を行った場合の音声長と誤り率との関係を示す図表である。It is a graph which shows the relationship between the audio | voice length at the time of performing the noise-proof process by this Embodiment, and an error rate. 本実施の形態による耐雑音処理を行わない場合の音声長と誤り率との関係を示す図表である。It is a graph which shows the relationship between the audio | voice length when not carrying out the noise-proof process by this Embodiment, and an error rate. 第２の実施の形態による音声認証装置の構成を示すブロック図である。It is a block diagram which shows the structure of the voice authentication apparatus by 2nd Embodiment. 各種雑音環境下におけるＥＥＲの測定実験結果を示す図表である。It is a graph which shows the measurement experiment result of EER under various noise environment. ＦＲＲ、ＦＡＲ及びＥＥＲの説明に供する特性曲線図である。It is a characteristic curve figure with which it uses for description of FRR, FAR, and EER. 各種雑音データの諸元を示す図表である。It is a chart which shows the specifications of various noise data.

Explanation of symbols

１、４０……音声認証装置、２……ＣＥＬＰ符号化部、３……フレーム選択部、４……有声音／無声音判定部、５……有音／無音判定部、６……重み付け処理部、７……クラスタリング部、８……記憶部、９、４１……照合部、１０……ベクトル量子化部、１１、４３……判定部、４２……平均距離演算部、Ｓ１……音声信号、Ｓ２……有声音／無声音判定信号、Ｓ３……有音／無音判定信号、Ｓ４……量子化誤差検出信号、Ｓ５、Ｓ２１……判定信号、Ｓ２０……量子化誤差平均値信号。
DESCRIPTION OF SYMBOLS 1, 40 ... Voice authentication apparatus, 2 ... CELP encoding part, 3 ... Frame selection part, 4 ... Voiced / unvoiced sound determination part, 5 ... Sound / silence determination part, 6 ... Weighting process part , 7... Clustering unit, 8... Storage unit, 9 and 41... Verification unit, 10... Vector quantization unit, 11 and 43. , S2... Voiced / unvoiced sound determination signal, S3... Voiced / silent sound determination signal, S4... Quantization error detection signal, S5, S21.

Claims

Feature parameter generating means for sequentially generating and outputting predetermined feature parameters representing the personality of the speaker based on the input audio signal;
A voiced section detecting means for detecting a voiced section of the audio signal;
Voiced sound section detecting means for detecting a voiced sound section of the voice signal;
Based on the detection results of the voiced section detecting means and the voiced sound detecting means, the voiced section of the voice signal and the voiced voice section of the voice parameter are sequentially output from the feature parameters generated from the feature parameter generating means. Feature parameter extraction means for extracting feature parameters;
A voice authentication apparatus comprising: speaker authentication means for performing speaker authentication based on the feature parameter extracted by the feature parameter extraction means.

The feature parameter generation means includes:
The audio signal is encoded by a CS-ACELP (Conjugate Structure-Algebraic Code Excited Linear Prediction) encoding method to generate an LSP (Line Spectrum Pair) as the feature parameter in units of frames,
The voiced sound section detecting means is
Detecting a voiced sound section of the voice signal based on pitch information obtained by encoding the voice signal in the feature parameter generation means;
The voiced section detecting means is
The voice authentication device according to claim 1, wherein the voiced section of the voice signal is detected using a VAD (Voice Activity Decision) algorithm.

The above speaker authentication means
A feature that generates a feature codebook including a predetermined number of representative vectors unique to the speaker by performing a predetermined clustering process on the extracted feature parameters supplied from the feature parameter extraction means in the registration mode. Codebook generation means;
Storage means for storing and holding the feature code book generated by the feature code book generation means;
Distance calculating means for calculating a distance between the extracted feature parameter supplied from the feature parameter extracting means in the collation mode and each representative vector constituting the feature codebook read from the storage means;
Average distance calculating means for calculating an average value of the distance between the feature parameter and the representative vector calculated by the distance calculating means;
The voice authentication device according to claim 1, further comprising: a determination unit that performs speaker collation determination based on a calculation result of the average distance calculation unit.

The average distance calculation means is
The average value of the distance between the characteristic parameter and the representative vector calculated by the distance calculation means is calculated by a moving average method using a predetermined number of the distances in the past. Voice authentication device.

A first step of sequentially generating and outputting predetermined feature parameters representing the individuality of the speaker based on the input voice signal, and detecting each of the voiced and voiced sections of the voice signal;
A second step of extracting, from the generated feature parameters, the feature parameters of the voiced section and the voiced section of the voice signal;
A voice authentication method comprising: a third step of performing speaker authentication based on the extracted feature parameter.

In the first step,
The audio signal is encoded by a CS-ACELP (Conjugate Structure-Algebraic Code Excited Linear Prediction) encoding method to generate an LSP (Line Spectrum Pair) as the feature parameter in units of frames. 6. The voiced sound section of the voice signal is detected based on the obtained pitch information, and the voiced section of the voice signal is detected using a VAD (Voice Activity Decision) algorithm. The voice authentication method described.

The third step is
Feature codebook generation for generating a feature codebook composed of a predetermined number of representative vectors unique to the speaker by performing a predetermined clustering process on the feature parameters extracted in the second step in the registration mode Steps,
A storing step for storing and storing the generated feature codebook;
A distance calculating step for calculating a distance between the feature parameter of the voiced voice section and the voiced voice period in the extracted voice signal and each representative vector constituting the feature code book stored and held in the collation mode;
An average distance calculating step of calculating an average value of the distance between the feature parameter and the representative vector calculated in the distance calculating step;
The voice authentication method according to claim 5, further comprising: a determination step of performing speaker verification determination based on a calculation result in the average distance calculation step.

In the above average distance calculation step,
The voice according to claim 7, wherein the average value of the feature parameter calculated in the distance calculation step and the distance between the representative vectors is calculated by a moving average method using a predetermined number of the distances in the past. Authentication method.

On the computer,
A first step of sequentially generating and outputting predetermined feature parameters representing the individuality of the speaker based on the input voice signal, and detecting each of the voiced and voiced sections of the voice signal;
A second step of extracting, from the generated feature parameters, the feature parameters of the voiced section and the voiced section of the voice signal;
A program for executing a process comprising: a third step of performing speaker authentication based on the extracted feature parameter.