JP2005148640A

JP2005148640A - Device, method and program of voice recognition

Info

Publication number: JP2005148640A
Application number: JP2003389665A
Authority: JP
Inventors: Kiyoshi Honda; 清志本多; Tatsuya Kitamura; 達也北村; Satoru Fujita; 覚藤田; Hironori Takemoto; 浩典竹本
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2003-11-19
Filing date: 2003-11-19
Publication date: 2005-06-09
Anticipated expiration: 2023-11-19
Also published as: JP4049732B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a voice recognition device which can recognize the utterer of the voice input by combining and using a corresponding relation of a plurality of voice spectral characteristics and each part in a voice path which becomes its forming factor. <P>SOLUTION: A characteristics extracting part 202 determines the shape parameter of a voice path model based on the voice input from a recognition object. The voice path model consists of a first and a second sound tube corresponding to the oral cavity and the pharyngeal cavity respectively, a connected small sound tube corresponding to the larynx cavity and at least one conic tube corresponding to the piriform recess. The characteristics extracting part 202 determines the shape parameter of the voice path model as a recognition shape parameter based on the voice input from the utterer at the time of recognition. A threshold value comparing part 222 recongizes whether the utterer is a registered recognition object or not based on the comparison result of the recognition shape parameter and the registration shape parameter at a similarity degree calculating part 220. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

この発明は、音声の個人差に基づいて、話者を自動的に判定する話者認識を用いて、個人の認証を行なうための音声認証装置、音声認証方法および音声認証プログラムに関する。 The present invention relates to a voice authentication device, a voice authentication method, and a voice authentication program for performing personal authentication using speaker recognition that automatically determines a speaker based on individual differences in voice.

重要な施設や部屋への部外者の入室を制限したりする場合や、システム外部からのシステム破壊等およびシステム内部からの不正アクセスを防止するためのアクセス管理や、さらには、電子商取引におけるいわゆる「なりすまし」などの不正行為の防止のために「個人認証技術」が必要とされる局面が増大している。 Access control to prevent outsiders from entering important facilities and rooms, access management to prevent system destruction from outside the system and unauthorized access from inside the system, and so-called e-commerce The situation where “personal authentication technology” is required to prevent fraudulent acts such as “spoofing” is increasing.

このような個人認証には、従来から、「ユーザーＩＤ」と「パスワード」の組み合わせや、公開鍵暗号系における「秘密鍵」等を利用した方式が採用されている。 Conventionally, a method using a combination of a “user ID” and a “password”, a “secret key” in a public key cryptosystem, or the like is employed for such personal authentication.

さらに、より個人認証の信頼度を向上させるために、指紋や虹彩などの本人の身体的特徴や行動的特徴を用いたいわゆる「バイオメトリクス」による認証技術も多く利用されている。 Furthermore, in order to further improve the reliability of personal authentication, so-called “biometrics” authentication techniques using physical characteristics and behavioral characteristics of the person such as fingerprints and irises are often used.

一方、「バイオメトリクス」の一種である、音声を用いた個人認証技術に対する期待も高まっている。これは、最近の音声処理技術の発展に伴い、認証対象者の音声という、従来からの通信システムをそのまま利用可能な特徴を個人認証に用いることができれば、容易に通信関連のシステムが実現できると期待されるからである（たとえば、非特許文献１を参照）。 On the other hand, there is an increasing expectation for personal authentication technology using voice, which is a kind of “biometrics”. This is because with recent developments in speech processing technology, it is possible to easily realize a communication-related system if the characteristics of a person to be authenticated, which can use a conventional communication system as it is, can be used for personal authentication. This is because it is expected (see, for example, Non-Patent Document 1).

ただし、音声による本人認証（以下、「音声認証」とよぶ）では、上記のような利点があるものの、指紋や虹彩などと比較すると、個人の身体的特性との関連性が従来方式では低く、本人認証の技術として利用するためには、さらなる精度の向上が必要である。 However, although voice authentication (hereinafter referred to as “voice authentication”) has the above-mentioned advantages, compared with fingerprints and irises, the relevance to the physical characteristics of individuals is low in the conventional method. In order to use it as a technique for personal authentication, further improvement in accuracy is necessary.

ここで、声道の下部構造と３次元ＭＲＩ動画像データとの対応関係から、声道下部構造のモデル化を試みた例は存在するが（たとえば、非特許文献２を参照）、音声認証をいかにして行なうかについては、従来、必ずしも明らかではなかった。
古井著、“音声による本人認証第１部音声による本人認証のしくみと技術動向”,情報処理, 40巻11号, 1999年, 11月竹本，本多，正木，島田，藤本著，“３次元MRI動画データに基づく声道下部構造のモデル化”，日本音響学会講演論文集 pp. 281-282, 2003年9月 Here, there is an example in which modeling of the vocal tract lower structure is attempted from the correspondence between the lower structure of the vocal tract and the three-dimensional MRI moving image data (see, for example, Non-Patent Document 2). Conventionally, it has not always been clear how to do this.
Furui, “Authentication by Voice, Part 1 Mechanism and Technology Trend of Voice Authentication, Information Processing, Vol. 40, No. 11, 1999, November Takemoto, Honda, Masaki, Shimada, Fujimoto, “Modeling of the vocal tract structure based on 3D MRI video data”, Proceedings of the Acoustical Society of Japan, pp. 281-282, September 2003

本発明は、上述したような問題点を解決するためになされたものであって、その目的は、音声スペクトル上の複数の特徴と、その生成要因となる声道内の各部位との対応関係を組み合わせて利用することにより、入力された音声の話者を特定することが可能な音声認証装置、音声認証方法および音声認証プログラムを提供することである。 The present invention has been made in order to solve the above-described problems, and its purpose is to provide a correspondence relationship between a plurality of features on the speech spectrum and each part in the vocal tract that is the generation factor. Is used to provide a voice authentication device, a voice authentication method, and a voice authentication program capable of specifying a speaker of an input voice.

このような目的を達成するために、本発明の１つの局面に従うと、音声認証装置であって、認証対象者からの音声入力に基づいて、声道モデルの形状パラメータを決定するための特徴抽出手段を備え、声道モデルは、口腔に対応する第１の音響管部分と、第１の音響管部分に連結し、咽頭腔に対応する第２の音響管部分と、第２の音響管部分の底面に連結し、喉頭腔に対応する連結小音響管と、第２の音響管部分の底面に連結し、梨状窩に対応する少なくとも１つの円錐管とを含み、学習時において、特徴抽出手段により決定された形状パラメータを登録形状パラメータとして認証対象者と関連付けて記憶するための記憶手段をさらに備え、特徴抽出手段は、認証時において、話者からの音声入力に基づいて、声道モデルの形状パラメータを認証形状パラメータとして決定し、話者が登録された認証対象者であるか否かを特定するために、認証形状パラメータと登録形状パラメータとの比較を行なう類似度比較手段をさらに備える。 In order to achieve such an object, according to one aspect of the present invention, there is provided a voice authentication device for extracting a feature parameter for determining a shape parameter of a vocal tract model based on a voice input from a person to be authenticated. The vocal tract model includes a first acoustic tube portion corresponding to the oral cavity, a second acoustic tube portion coupled to the first acoustic tube portion and corresponding to the pharyngeal cavity, and a second acoustic tube portion. And at least one conical tube connected to the bottom surface of the second acoustic tube portion and corresponding to the piriform fossa for feature extraction during learning Storage means for storing the shape parameter determined by the means in association with the person to be authenticated as a registered shape parameter, and the feature extraction means is based on the voice input from the speaker at the time of authentication. Shape parameter Determined as testimony shape parameter, to identify whether a person to be authenticated the speaker has been registered, further comprising a similarity comparison means for comparing the authentication shape parameter and registration shape parameter.

好ましくは、特徴抽出手段は、音声入力に基づいて、形状パラメータの初期値を決定する初期値決定手段と、初期値に基づく声道モデルの伝達関数と音声入力の入力スペクトルとの差を最小化するように形状パラメータを修正する修正手段とを含む。 Preferably, the feature extraction means minimizes the difference between the initial value determination means for determining the initial value of the shape parameter based on the voice input, and the transfer function of the vocal tract model based on the initial value and the input spectrum of the voice input. Correcting means for correcting the shape parameter to

好ましくは、第１の音響管部分は、互いに連結した複数の第１の音響管を含み、第２の音響管部分は、互いに連結した複数の第２の音響管を含む。 Preferably, the first acoustic tube portion includes a plurality of first acoustic tubes connected to each other, and the second acoustic tube portion includes a plurality of second acoustic tubes connected to each other.

この発明の他の局面に従うと、音声認証方法であって、学習時において、認証対象者からの音声入力に基づいて、声道モデルの形状パラメータを決定するステップを備え、声道モデルは、口腔に対応する第１の音響管部分と、第１の音響管部分に連結し、咽頭腔に対応する第２の音響管部分と、第２の音響管部分の底面に連結し、喉頭腔に対応する連結小音響管と、第２の音響管部分の底面に連結し、梨状窩に対応する少なくとも１つの円錐管とを含み、学習時において決定された形状パラメータを登録形状パラメータとして認証対象者と関連付けて記憶装置に記憶するステップと、認証時において、話者からの音声入力に基づいて、声道モデルの形状パラメータを認証形状パラメータとして決定するステップと、認証形状パラメータと登録形状パラメータとの比較結果に基づいて、話者が登録された認証対象者であるか否かを特定するステップとをさらに備える。 According to another aspect of the present invention, there is provided a voice authentication method including a step of determining a shape parameter of a vocal tract model based on a voice input from a person to be authenticated at the time of learning. A first acoustic tube portion corresponding to the first acoustic tube portion, a second acoustic tube portion corresponding to the pharyngeal cavity, and a bottom surface of the second acoustic tube portion corresponding to the laryngeal cavity A small acoustic tube connected to the bottom surface of the second acoustic tube portion, and at least one conical tube corresponding to the piriform fossa, and the shape parameter determined at the time of learning as a registered shape parameter A step of determining the shape parameter of the vocal tract model as an authentication shape parameter based on the voice input from the speaker at the time of authentication, and the registration of the authentication shape parameter. Based on the comparison result between Jo parameter, further comprising the step of identifying whether a person to be authenticated the speaker has been registered.

好ましくは、声道モデルの形状パラメータを決定するステップは、音声入力に基づいて、形状パラメータの初期値を決定するステップと、初期値に基づく声道モデルの伝達関数と音声入力の入力スペクトルとの差を最小化するように形状パラメータを修正するステップとを含む。 Preferably, the step of determining the shape parameter of the vocal tract model includes the step of determining an initial value of the shape parameter based on the speech input, a transfer function of the vocal tract model based on the initial value, and an input spectrum of the speech input. Modifying the shape parameters to minimize the difference.

この発明のさらに他の局面にしたがうと、コンピュータに音声認証処理を実行させるための音声認証プログラムであって、音声認証処理は、学習時において、認証対象者からの音声入力に基づいて、声道モデルの形状パラメータを決定するステップを備え、声道モデルは、口腔に対応する第１の音響管部分と、第１の音響管部分に連結し、咽頭腔に対応する第２の音響管部分と、第２の音響管部分の底面に連結し、喉頭腔に対応する連結小音響管と、第２の音響管部分の底面に連結し、梨状窩に対応する少なくとも１つの円錐管とを含み、学習時において決定された形状パラメータを登録形状パラメータとして認証対象者と関連付けて記憶するステップと、認証時において、話者からの音声入力に基づいて、声道モデルの形状パラメータを認証形状パラメータとして決定するステップと、認証形状パラメータと登録形状パラメータとの比較結果に基づいて、話者が登録された認証対象者であるか否かを特定するステップとをさらに備える。 According to still another aspect of the present invention, there is provided a voice authentication program for causing a computer to execute a voice authentication process, wherein the voice authentication process is performed based on a voice input from a person to be authenticated during learning. Determining a shape parameter of the model, the vocal tract model comprising: a first acoustic tube portion corresponding to the oral cavity; a second acoustic tube portion coupled to the first acoustic tube portion and corresponding to the pharyngeal cavity; A small acoustic tube coupled to the bottom surface of the second acoustic tube portion and corresponding to the laryngeal cavity, and at least one conical tube coupled to the bottom surface of the second acoustic tube portion and corresponding to the piriform fossa Storing the shape parameter determined at the time of learning in association with the person to be authenticated as a registered shape parameter, and recognizing the shape parameter of the vocal tract model based on the voice input from the speaker at the time of authentication. Further comprising determining a shape parameter, and a step of, based on a result of comparison between authentication shape parameter and registration shape parameter, to identify whether a person to be authenticated the speaker has been registered.

本発明に係る音声認証装置、音声認証方法および音声認証プログラムは、音声認証において、個人の身体的特性との関連性を高めて本人認証を行なうことができ、音声認証の精度を向上させることが可能である。 The voice authentication device, the voice authentication method, and the voice authentication program according to the present invention can improve the relevance with the physical characteristics of an individual in voice authentication and improve the accuracy of voice authentication. Is possible.

以下、図面を参照して本発明の実施の形態について説明する。
［ハードウェア構成］
図１は、本発明の音声認証方法および音声認証プログラムが適用される音声認証装置を用いた音声認証システム１０００の一例を示す概念図である。 Embodiments of the present invention will be described below with reference to the drawings.
[Hardware configuration]
FIG. 1 is a conceptual diagram showing an example of a voice authentication system 1000 using a voice authentication apparatus to which a voice authentication method and a voice authentication program of the present invention are applied.

図１を参照して、音声認証システム１０００は、認証対象者２の発話に基づいて認証対象者２に対してアクセスを許可するか否かを判断するコンピュータ１００を備える。 Referring to FIG. 1, the voice authentication system 1000 includes a computer 100 that determines whether to permit access to the authentication target person 2 based on the utterance of the authentication target person 2.

すなわち、以下では、本発明の音声認証方法をアクセス権の管理に適用する場合を例にとって説明することにする。 That is, hereinafter, a case where the voice authentication method of the present invention is applied to access right management will be described as an example.

図１を参照して、このコンピュータ１００は、ＣＤ−ＲＯＭ（Compact Disc Read-Only Memory ）上の情報を読込むためのＣＤ−ＲＯＭドライブ１０８およびフレキシブルディスク（Flexible Disk、以下ＦＤ）１１６に情報を読み書きするためのＦＤドライブ１０６を備えたコンピュータ本体１０２と、コンピュータ本体１０２に接続された表示装置としてのディスプレイ１０４と、同じくコンピュータ本体１０２に接続された入力装置としてのキーボード１１０およびマウス１１２と、音声入力装置としてのマイク１３２と、音声出力装置としてのスピーカ１３４とを含む。 Referring to FIG. 1, this computer 100 reads / writes information to / from a CD-ROM drive 108 and a flexible disk (hereinafter referred to as FD) 116 for reading information on a CD-ROM (Compact Disc Read-Only Memory). A computer main body 102 including an FD drive 106, a display 104 as a display device connected to the computer main body 102, a keyboard 110 and a mouse 112 as input devices also connected to the computer main body 102, and a voice input device As a microphone 132 and a speaker 134 as an audio output device.

なお、本発明の音声認証方法を入室管理などに適用する場合には、コンピュータ１００は入室管理システムの一部として動作し、本人認証された場合には、ゲートの開錠処理等を行なうことになる。また、本発明の音声認証方法を電子商取引等に適用する場合には、マイク１３２から入力された音声は、通信に適したフォーマットに変換された後に、ネットワーク３１０を介して、相手先のコンピュータシステム３００に伝送される。相手先のコンピュータシステム３００において、以下に説明するような音声認証処理を行ない、認証対象者２の本人認証を行なうことになる。 When the voice authentication method of the present invention is applied to entrance management or the like, the computer 100 operates as a part of the entrance management system. When the user is authenticated, the computer 100 performs an unlocking process of the gate. Become. When the voice authentication method of the present invention is applied to electronic commerce or the like, the voice input from the microphone 132 is converted into a format suitable for communication, and then the partner computer system via the network 310. 300. In the other party's computer system 300, voice authentication processing as described below is performed to authenticate the person 2 to be authenticated.

図２は、このコンピュータ１００のハードウェア構成をブロック図形式で示す図である。 FIG. 2 is a block diagram showing the hardware configuration of the computer 100. As shown in FIG.

図２に示されるように、このコンピュータ１００を構成するコンピュータ本体１０２は、ＣＤ−ＲＯＭドライブ１０８およびＦＤドライブ１０６に加えて、それぞれバスＢＳに接続されたＣＰＵ（Central Processing Unit ）１２０と、ＲＯＭ（Read Only Memory) およびＲＡＭ（Random Access Memory）を含むメモリ１２２と、直接アクセスメモリ装置、たとえば、ハードディスク１２４と、マイク１３２またはスピーカ１３４とデータの授受を行なうためのインタフェース１２８とを含んでいる。ＣＤ−ＲＯＭドライブ１０８にはＣＤ−ＲＯＭ１１８が装着される。ＦＤドライブ１０６にはＦＤ１１６が装着される。 As shown in FIG. 2, in addition to the CD-ROM drive 108 and the FD drive 106, the computer main body 102 constituting the computer 100 includes a CPU (Central Processing Unit) 120 connected to the bus BS, and a ROM ( A memory 122 including a read only memory (RAM) and a random access memory (RAM), a direct access memory device, for example, a hard disk 124, and an interface 128 for exchanging data with a microphone 132 or a speaker 134 are included. A CD-ROM 118 is attached to the CD-ROM drive 108. An FD 116 is attached to the FD drive 106.

なお、インタフェース１２８は、たとえば、相手先のコンピュータシステム３００との通信を行なうために使用することもできる。 The interface 128 can also be used, for example, to communicate with the counterpart computer system 300.

後に説明するように、本発明の音声認証プログラムが動作するにあたっては、その動作の基礎となる情報を格納するデータベースは、ハードディスク１２４に格納されるものとして説明を行なう。 As will be described later, when the voice authentication program of the present invention operates, a database that stores information that is the basis of the operation will be described as being stored in the hard disk 124.

なお、ＣＤ−ＲＯＭ１１８は、コンピュータ本体に対してインストールされるプログラム等の情報を記録可能な媒体であれば、他の媒体、たとえば、ＤＶＤ−ＲＯＭ（Digital Versatile Disc）やメモリカードなどでもよく、その場合は、コンピュータ本体１０２には、これらの媒体を読取ることが可能なドライブ装置が設けられる。 The CD-ROM 118 may be another medium, such as a DVD-ROM (Digital Versatile Disc) or a memory card, as long as it can record information such as a program installed in the computer main body. In this case, the computer main body 102 is provided with a drive device that can read these media.

本発明の音声認証装置の主要部は、コンピュータハードウェアと、ＣＰＵ１２０により実行されるソフトウェアとにより構成される。一般的にこうしたソフトウェアはＣＤ−ＲＯＭ１１８、ＦＤ１１６等の記憶媒体に格納されて流通し、ＣＤ−ＲＯＭドライブ１０８またはＦＤドライブ１０６等により記憶媒体から読取られてハードディスク１２４に一旦格納される。または、当該装置がネットワーク３１０に接続されている場合には、ネットワーク上のサーバから一旦ハードディスク１２４にコピーされる。そうしてさらにハードディスク１２４からメモリ１２２中のＲＡＭに読出されてＣＰＵ１２０により実行される。なお、ネットワーク接続されている場合には、ハードディスク１２４に格納することなくＲＡＭに直接ロードして実行するようにしてもよい。 The main part of the voice authentication apparatus of the present invention is constituted by computer hardware and software executed by the CPU 120. Generally, such software is stored and distributed in a storage medium such as a CD-ROM 118 or FD 116, read from the storage medium by the CD-ROM drive 108 or FD drive 106, and temporarily stored in the hard disk 124. Alternatively, when the device is connected to the network 310, it is temporarily copied from the server on the network to the hard disk 124. Then, the data is further read from the hard disk 124 to the RAM in the memory 122 and executed by the CPU 120. In the case of network connection, the program may be directly loaded into the RAM and executed without being stored in the hard disk 124.

図１および図２に示したコンピュータのハードウェア自体およびその動作原理は一般的なものである。したがって、本発明の最も本質的な部分は、ＦＤ１１６、ＣＤ−ＲＯＭ１１８、ハードディスク１２４等の記憶媒体に記憶されたソフトウェアである。 The computer hardware itself and its operating principle shown in FIGS. 1 and 2 are general. Therefore, the most essential part of the present invention is software stored in a storage medium such as the FD 116, the CD-ROM 118, and the hard disk 124.

なお、一般的傾向として、コンピュータのオペレーティングシステムの一部として様々なプログラムモジュールを用意しておき、アプリケーションプログラムはこれらモジュールを所定の配列で必要な時に呼び出して処理を進める方式が一般的である。そうした場合、当該音声認証装置を実現するためのソフトウェア自体にはそうしたモジュールは含まれず、当該コンピュータでオペレーティングシステムと協働してはじめて音声認証装置が実現することになる。しかし、一般的なプラットフォームを使用する限り、そうしたモジュールを含ませたソフトウェアを流通させる必要はなく、それらモジュールを含まないソフトウェア自体およびそれらソフトウェアを記録した記録媒体（およびそれらソフトウェアがネットワーク上を流通する場合のデータ信号）が実施の形態を構成すると考えることができる。 As a general tendency, various program modules are prepared as a part of a computer operating system, and an application program generally calls a module in a predetermined arrangement and advances the processing when necessary. In such a case, the software itself for realizing the voice authentication device does not include such a module, and the voice authentication device is realized only when the computer cooperates with the operating system. However, as long as a general platform is used, it is not necessary to distribute software including such modules, and the software itself not including these modules and the recording medium storing the software (and the software distributes on the network). Data signal) can be considered to constitute the embodiment.

［個人性生成要因にもとづく音声認証］
図３は、音声スペクトル上の特徴と声道内の部位の対応関係を示す図である。 [Voice authentication based on personality generation factors]
FIG. 3 is a diagram illustrating a correspondence relationship between features on the voice spectrum and parts in the vocal tract.

以下に説明するとおり、本発明の音声認証装置や音声認証方法は、音声スペクトル上の複数の特徴と、その生成要因となる声道内の各部位との対応関係を組み合わせて利用することにより、入力された音声の話者を特定することを可能にするものである。 As will be described below, the voice authentication device and the voice authentication method of the present invention use a combination of a plurality of features on the voice spectrum and corresponding parts of each part in the vocal tract as a generation factor thereof. It is possible to specify the speaker of the input voice.

特に、本発明では、入力された音声（母音）から話者の個人性要因を抽出することにより音声認証を行なう。このとき、音声認証において、認証に用いる音声の発声内容（キーワード）を予め決めておく「テキスト依存型」の認証や、どんな言葉を発生してもよい「テキスト独立型」の認証や、装置を使うたびに新しいキーワードを装置側から認証対象者に対して指定する「テキスト指定型」の認証のいずれにの場合にも、本発明の音声認証を適用することができる。 In particular, in the present invention, voice authentication is performed by extracting a speaker's personality factor from input voice (vowel). At this time, in the voice authentication, the “text-dependent” authentication in which the utterance content (keyword) of the voice used for the authentication is determined in advance, the “text-independent” authentication in which any words can be generated, the device The voice authentication according to the present invention can be applied to any of “text designation type” authentication in which a new keyword is designated from the apparatus side to the authentication subject every time it is used.

一方、図４は、音声生成系の正中断面図を示す概念図である。 On the other hand, FIG. 4 is a conceptual diagram showing a mid-sectional view of the speech generation system.

図３および図４を参照すると、音声スペクトルの極（極大点）の分布パターンは、「声道長」に対応している。「声道」とは、声門から咽頭腔、口腔を通り唇に至る空間を指す。 Referring to FIG. 3 and FIG. 4, the distribution pattern of the maximum (maximum point) of the voice spectrum corresponds to “the length of the vocal tract”. The “vocal tract” refers to the space from the glottis through the pharyngeal cavity, mouth and lips.

低次フォルマントは、「咽頭腔と口腔の長さ、断面積および容積の関係」に対応している。低周波側から４番目の極大点である第４フォルマントは、「喉頭腔の形状」と対応している。 The low-order formant corresponds to the “relationship between pharyngeal cavity and oral cavity length, cross-sectional area and volume”. The fourth formant, which is the fourth maximum point from the low frequency side, corresponds to the “shape of the laryngeal cavity”.

さらに、高周波数帯域に存在する零点（極小点）の数、周波数、バンド幅、およびその周辺の極との相対的エネルギー差は、後に詳しく説明するように、「声道下部の梨状窩の形状」と対応している。 In addition, the number of zeros (minimum points) existing in the high frequency band, the frequency, the bandwidth, and the relative energy difference with the surrounding poles, as described in detail later, are as follows. Corresponds to “shape”.

声道形状は、個々の人間の声質、すなわち音声の個人性を決定づける主要因である。言い換えれば、音声の個人性の主たる生成要因は、声道形状の個人差であるといえる。 The vocal tract shape is the main factor that determines the voice quality of each person, that is, the individuality of the voice. In other words, it can be said that the main generation factor of the voice personality is the individual difference of the vocal tract shape.

以下、図３における声道内の部位について、さらに詳しく説明する。 Hereinafter, the part in the vocal tract in FIG. 3 will be described in more detail.

（声道長）
「声道長」とは声門から口唇までの長さを指す。声道長は年齢、性別、個人の体格との相関が高い。音響管の長さが長いほどその共鳴周波数が低くなるので、声道長と音声スペクトルの極の分布パターンには対応関係がある。従って、音声からその話者の声道長を求めることができる。 (Voice tract chief)
“Vocal tract length” refers to the length from the glottis to the lips. Vocal tract length is highly correlated with age, gender, and individual physique. The longer the acoustic tube is, the lower its resonance frequency is, so there is a correspondence between the vocal tract length and the distribution pattern of the speech spectrum poles. Therefore, the vocal tract length of the speaker can be obtained from the voice.

（咽頭腔と口腔の長さ、断面積、容積の関係）
咽頭腔と口腔の長さ、断面積、容積の関係は低次フォルマントを決定する。 (Relationship between pharyngeal cavity and oral cavity length, cross-sectional area, volume)
The relationship between the length of the pharyngeal cavity and oral cavity, cross-sectional area, and volume determines the lower-order formant.

図５は、声道の２区間モデルにおける咽頭腔と口腔の断面積変化、および低次フォルマントとの理論的関係を示す図である。 FIG. 5 is a diagram showing a theoretical relationship between a change in the cross-sectional area of the pharyngeal cavity and the oral cavity and a low-order formant in a two-section model of the vocal tract.

図５（ａ）は、声道を２つの区間からなる音響管で代表させた２区間モデルにおいて、口腔に比べて咽頭腔の断面積が大きい場合と、口腔に比べて咽頭腔の断面積が小さい場合とを示している。一方、図５（ｂ）は、図５（ａ）に示した咽頭腔と口腔との間の断面積の関係が異なる２つの場合にそれぞれ対応する、フォルマント周波数の変化を示す図である。 FIG. 5A shows a case where the cross-sectional area of the pharyngeal cavity is larger than that of the oral cavity and the cross-sectional area of the pharyngeal cavity is larger than that of the oral cavity. It shows a small case. On the other hand, FIG. 5B is a diagram showing changes in formant frequency corresponding to two cases in which the cross-sectional area relationship between the pharyngeal cavity and the oral cavity shown in FIG.

まず、図５（ａ）の上側のように、咽頭腔の断面積が増大すると、図５（ｂ）の上側に示すように、第１フォルマント（F1）の低下と第２フォルマント（F2）の上昇が起こる。 First, when the cross-sectional area of the pharyngeal cavity increases as shown in the upper side of FIG. 5A, as shown in the upper side of FIG. 5B, the first formant (F1) decreases and the second formant (F2) decreases. A rise occurs.

これに対して、図５（ａ）の下側のように、口腔の断面積が増大すると、第１フォルマント（F1）の上昇と第２フォルマント（F2）の低下が起こる。 On the other hand, as shown in the lower side of FIG. 5A, when the cross-sectional area of the oral cavity increases, the first formant (F1) rises and the second formant (F2) falls.

この図５に示すような関係は、核磁気共鳴画像法（MRI：Magnetic Resonance Imaging）にもとづく実測値と録音音声との間にも観測され、第１フォルマント周波数は咽頭腔断面積と、第２フォルマント周波数は口腔断面積との相関が認められる。 The relationship shown in FIG. 5 is also observed between the measured value based on nuclear magnetic resonance imaging (MRI) and the recorded sound, and the first formant frequency is the pharyngeal cavity cross-sectional area, The formant frequency is correlated with the oral cross section.

図６は、咽頭腔平均面積と第１フォルマント周波数との相関関係をＭＲＩによる実測値で示す図である。図６（ａ）は、複数の被験者について、ＭＲＩにより実測された咽頭腔の平均面積と母音「あ」の第１フォルマントの周波数の実測値との関係を示す。また、図６（ｂ）は、複数の被験者について、ＭＲＩにより実測された咽頭腔の平均面積と母音「え」の第１フォルマントの周波数の実測値との関係を示す。 FIG. 6 is a diagram showing the correlation between the average area of the pharyngeal cavity and the first formant frequency as a measured value by MRI. FIG. 6A shows the relationship between the average area of the pharyngeal cavity measured by MRI and the measured value of the frequency of the first formant of the vowel “A” for a plurality of subjects. FIG. 6B shows the relationship between the average area of the pharyngeal cavity measured by MRI and the measured value of the first formant frequency of the vowel “E” for a plurality of subjects.

咽頭腔の平均面積の実測値と母音の第１フォルマントの周波数の実測値との間には負の相関がみられる。 There is a negative correlation between the measured value of the average area of the pharyngeal cavity and the measured value of the frequency of the first formant of the vowel.

また、図７は、口腔平均面積と第１フォルマント周波数との相関関係をＭＲＩによる実測値で示す図である。図７（ａ）は、複数の被験者について、ＭＲＩにより実測された口腔の平均面積と母音「あ」の第１フォルマントの周波数の実測値との関係を示す。また、図７（ｂ）は、複数の被験者について、ＭＲＩにより実測された口腔の平均面積と母音「え」の第１フォルマントの周波数の実測値との関係を示す。 FIG. 7 is a diagram showing the correlation between the average oral cavity area and the first formant frequency as a measured value by MRI. FIG. 7A shows the relationship between the average area of the oral cavity measured by MRI and the measured value of the frequency of the first formant of the vowel “A” for a plurality of subjects. FIG. 7B shows the relationship between the average area of the oral cavity measured by MRI and the measured value of the frequency of the first formant of the vowel “E” for a plurality of subjects.

口腔の平均面積の実測値と母音の第１フォルマントの周波数の実測値との間には正の相関がみられる。 There is a positive correlation between the measured value of the average area of the oral cavity and the measured value of the frequency of the first formant of the vowel.

以上の関係を用いて、低次フォルマントから咽頭腔と口腔の概形を推定することができる。 Using the above relationship, the rough shape of the pharyngeal cavity and the oral cavity can be estimated from the low-order formant.

（喉頭腔の形状）
「喉頭腔」とは下咽頭腔の一部を構成する細い管である。 (Shape of laryngeal cavity)
The “laryngeal cavity” is a thin tube that forms part of the hypopharyngeal cavity.

図８は、喉頭腔の形状を説明するための図面である。図８（ａ）は、ＭＲＩ画像において喉頭腔を白線で囲んで示し、図８（ｂ）は、ＭＲＩ画像から得られた下咽頭腔の３次元形状をワイヤフレームで示しており、喉頭腔部分は、ワイヤフレームを太線で示すとともにグレースケールを濃くして示してある。なお、図８（ｂ）に示すとおり、下咽頭腔には、喉頭腔と、原則としては喉頭腔の両側後部に、後に説明する梨状窩が含まれている。 FIG. 8 is a drawing for explaining the shape of the laryngeal cavity. 8A shows the laryngeal cavity surrounded by a white line in the MRI image, and FIG. 8B shows the three-dimensional shape of the hypopharyngeal cavity obtained from the MRI image in a wire frame. Shows the wire frame with bold lines and the dark gray scale. As shown in FIG. 8 (b), the hypopharyngeal cavity includes a laryngeal cavity and, in principle, piriform fossa, which will be described later, on both sides of the laryngeal cavity.

図９は、３名分の下咽頭腔の３次元形状を示す図である。図９において、図９（ａ１）〜（ａ３）は、それぞれ３名の被験者についてＭＲＩ画像から得られた下咽頭腔の３次元形状をワイヤフレームで表わしたものを正面から見た図であり、図９（ｂ１）〜（ｂ３）は、これらのワイヤフレームをそれぞれ左側から見た図である。 FIG. 9 is a diagram showing the three-dimensional shape of the hypopharyngeal cavity for three persons. In FIG. 9, FIGS. 9 (a1) to (a3) are views of the three-dimensional shape of the hypopharyngeal cavity obtained from the MRI images for three subjects, respectively, as viewed from the front, FIGS. 9B1 to 9B3 are views of these wire frames as viewed from the left side.

この図に示されるように、喉頭腔の形状と大きさには個人差がある。 As shown in this figure, there are individual differences in the shape and size of the laryngeal cavity.

図１０は、図９に示した３名について、各母音（/a/, /i/, /u/, /e/, /o/）を発声しているときの下咽頭腔の各部の横断面形状を声門からの距離をパラメータとして示す図である。図１０（ｃ１）〜（ｃ３）の各々が、図９（ａ１）〜（ａ３）に示した各人に対応している。 FIG. 10 shows the crossing of each part of the hypopharyngeal cavity when vowels (/ a /, / i /, / u /, / e /, / o /) are uttered for the three persons shown in FIG. It is a figure which shows a surface shape as a parameter from the distance from a glottis. Each of FIGS. 10 (c1) to (c3) corresponds to each person shown in FIGS. 9 (a1) to (a3).

図１０に示すように、発声する母音が変わっても、各人において、その形状変化が極めて小さい。 As shown in FIG. 10, even if the vowel to be uttered changes, the shape change is extremely small for each person.

図１１は、母音「え」の音声スペクトルを示す図である。図１１において、第４フォルマントには、矢印を付して示す。 FIG. 11 is a diagram illustrating a speech spectrum of the vowel “e”. In FIG. 11, the fourth formant is indicated by an arrow.

喉頭腔は声道内で音響的に独立しており、ヘルムホルツ共鳴器として働く。そして、喉頭腔の形状や大きさは音声スペクトルの第４フォルマントの周波数、バンド幅、エネルギーを決定する。すなわち、喉頭腔の形態上の個人差は、第４フォルマントに現れる。 The laryngeal cavity is acoustically independent in the vocal tract and acts as a Helmholtz resonator. The shape and size of the laryngeal cavity determines the frequency, bandwidth, and energy of the fourth formant of the voice spectrum. That is, the individual difference in the shape of the laryngeal cavity appears in the fourth formant.

図１２は、話者ア〜コの第４フォルマントの周波数を示す図である。 FIG. 12 is a diagram illustrating the frequency of the fourth formant of speakers A to K.

図１２に示すように、第４フォルマント周波数は、個人間で異なっている。従って、喉頭腔は音声の個人性生成の一要因であるといえる。 As shown in FIG. 12, the fourth formant frequency varies among individuals. Therefore, it can be said that the laryngeal cavity is a factor in generating personality of speech.

第４フォルマントに対応するヘルムホルツ共鳴器の形状を求めることによって、話者の喉頭腔の形状を求めることが可能である。 By determining the shape of the Helmholtz resonator corresponding to the fourth formant, the shape of the speaker's laryngeal cavity can be determined.

なお、図１２には、各話者について、梨状窩による零点の周波数も記載されているが、これについては後述する。 FIG. 12 also shows the zero point frequency due to the piriform fossa for each speaker, which will be described later.

（梨状窩の形状）
図１３は、下咽頭腔における梨状窩の位置を示す図である。図１３は、ＭＲＩ画像から得られた下咽頭腔の３次元形状を正面から見てワイヤフレームで示しており、梨状窩部分は、ワイヤフレームを太線で示すとともにグレースケールを濃くして示してある。 (Piriform shape)
FIG. 13 is a diagram showing the position of the piriform fossa in the hypopharyngeal cavity. FIG. 13 shows the three-dimensional shape of the hypopharyngeal cavity obtained from the MRI image as a wire frame when viewed from the front, and the piriform fossa portion is indicated by a thick line and a dark gray scale. is there.

梨状窩は下咽頭腔に、原則として左右１つずつ存在する分岐管である。前面から見ると梨状窩は、図１３のような形状をしているので、この形状は円錐で近似することができる。 The piriform fossa is a branch duct that exists in the hypopharyngeal cavity in principle, one on each side. When viewed from the front, the piriform fossa has a shape as shown in FIG. 13, and this shape can be approximated by a cone.

図１０に示したとおり、喉頭腔と同様に、梨状窩の形状、長さ、大きさには個人差があり、なおかつ発声する母音が変わってもその形状変化が極めて小さい。 As shown in FIG. 10, like the laryngeal cavity, the shape, length, and size of the piriform fossa vary among individuals, and even if the vowel to be uttered changes, the shape change is extremely small.

梨状窩は声道内の分岐管であるため、音声スペクトル上で零点（極小点）を発生させる。梨状窩の形状、長さ、大きさは音声スペクトルの高周波数帯域に現れる零点の数、周波数、バンド幅、その零点の周辺の極との相対的エネルギー差を決定する。 Since the piriform fossa is a branch pipe in the vocal tract, a zero point (minimum point) is generated on the voice spectrum. The shape, length, and size of the piriform fossa determine the number of zeros that appear in the high frequency band of the speech spectrum, the frequency, the bandwidth, and the relative energy difference from the poles around the zeros.

図１４は、母音「え」の音声スペクトル上の梨状窩による零点の位置を示す図である。 FIG. 14 is a diagram illustrating the position of the zero point due to the piriform fossa on the speech spectrum of the vowel “e”.

図１４において、梨状窩による零点には、矢印を付加している。 In FIG. 14, an arrow is added to the zero point due to the piriform fossa.

また、図１２には、上述のとおり話者ア〜コの１０名の梨状窩による零点の周波数を示している。 In addition, FIG. 12 shows the frequency of the zero point by ten piriform fossae of speakers A to K as described above.

図１２より、梨状窩による零点の周波数には個人差があることがわかる。この周波数は梨状窩の形態の個人差に対応する。従って、梨状窩も音声の個人性生成の一要因であるといえる。 From FIG. 12, it can be seen that there is an individual difference in the frequency of the zero point due to the piriform fossa. This frequency corresponds to individual differences in the shape of the piriform fossa. Therefore, it can be said that the piriform fossa is also a factor in generating personality of speech.

２つの梨状窩の形状、長さ、大きさが異なる場合には零点が２つ現れ、形状、長さ、大きさが等しいかもしくは近い場合には零点は１つのみ現れる。一般に、梨状窩は左右２つ存在するが、図９（ａ３）に示した被験者のように梨状窩が片方にのみ存在する人もいる。この場合にも零点は１つのみ現れる。 When the shape, length and size of the two piriform fossa are different, two zeros appear, and when the shape, length and size are the same or close, only one zero appears. Generally, there are two right and left piriform fossa, but there are some people who have a piriform fossa only on one side as shown in FIG. 9 (a3). In this case, only one zero appears.

以上のことから、梨状窩による零点に関する情報を用いれば、話者の梨状窩の形状、長さ、大きさを求めることができる。 From the above, the shape, length, and size of the speaker's piriform fossa can be obtained by using information about the zero point due to the piriform fossa.

なお、音声スペクトル上で梨状窩の影響が表われる周波数帯域は、固定電話の周波数帯域（4 kHz以下）よりも高い。そのため、本手法を電話に利用する場合には、より広い周波数帯域を持つ携帯電話やＩＰ電話を対象にする必要がある。 Note that the frequency band in which the effect of the piriform fossa appears on the voice spectrum is higher than the fixed telephone frequency band (4 kHz or less). Therefore, when this method is used for a telephone, it is necessary to target a mobile phone or an IP phone having a wider frequency band.

［声道モデルの形状パラメータの最適化による話者の登録と認証］
本発明では、上記の個人性生成要因を組み合わせて個人の登録と認証を行なう。 [Speaker registration and authentication by optimizing the shape parameters of the vocal tract model]
In the present invention, personal registration and authentication are performed by combining the individuality generation factors described above.

音声から声道断面積関数を逆推定することは難しい課題の一つであるが、その理由として、従来の音声生成モデルでは前述した梨状窩と喉頭腔の共鳴現象を考慮していないために、高域スペクトルの複雑性を逆推定に取り込むことができないことがあげられる。 Back-estimating the vocal tract cross-sectional area function from speech is one of the difficult tasks, because the conventional speech generation model does not take into account the resonance phenomenon between the piriform fossa and the laryngeal cavity described above. The complexity of the high frequency spectrum cannot be taken into the inverse estimation.

図１５は、本発明の音声生成モデルの概念図説明するための図である。 FIG. 15 is a diagram for explaining a conceptual diagram of the speech generation model of the present invention.

すなわち、本発明では、音声の生成を、音源からの音が、主声道の共鳴と下咽頭腔の共鳴との影響を受けた結果が、音声として発声されているものとしてモデル化している。 That is, in the present invention, sound generation is modeled as a result of the sound from the sound source being affected by the resonance of the main vocal tract and the resonance of the hypopharyngeal cavity as sound.

図１５に示すようなモデルを用いることで、高域スペクトルの複雑性を逆推定に取り込むことを可能とする。 By using a model as shown in FIG. 15, it is possible to incorporate the complexity of the high frequency spectrum into the inverse estimation.

すなわち、従来の音声生成モデルでは音声を音源と声道の線形結合で表わすのに対し、本発明における音声生成モデルでは音源と主声道共鳴のほかに下咽頭腔共鳴を加えている。このモデルに基づいて、音声スペクトルに含まれている下咽頭共鳴の成分を取り除くことにより、主声道の断面積関数を正確に推定することができる。 That is, in the conventional voice generation model, the voice is expressed by a linear combination of the sound source and the vocal tract, whereas the voice generation model in the present invention adds hypopharyngeal cavity resonance in addition to the sound source and the main vocal tract resonance. Based on this model, it is possible to accurately estimate the cross-sectional area function of the main vocal tract by removing the hypopharyngeal resonance component contained in the speech spectrum.

具体的には、個人性パラメータの決定には下記のいずれかの方法を用いることができる。 Specifically, any of the following methods can be used to determine the personality parameter.

（第１の個人性パラメータの決定方法）
まず、第１の個人性パラメータの決定方法としては、入力された音声のスペクトルから、咽頭腔・口腔の形状パラメータ、喉頭腔・梨状窩の形状パラメータを求め、これらをそのまま個人性パラメータとして採用するという方法を用いることができる。 (First personality parameter determination method)
First, as a method for determining the first personality parameter, the shape parameter of the pharyngeal cavity / oral cavity and the shape parameter of the laryngeal cavity / piriform fossa are obtained from the spectrum of the input speech, and these are used as the individuality parameters as they are. Can be used.

（第２の個人性パラメータの決定方法）
あるいは、上記のパラメータを声道モデルに適用して、入力された音声のスペクトルと声道モデルにより計算した伝達関数が一致するよう最適化し、そのときの声道モデルのパラメータを個人性パラメータとして採用する方法を用いることも可能である。 (Second personality parameter determination method)
Alternatively, the above parameters are applied to the vocal tract model and optimized so that the input speech spectrum matches the transfer function calculated by the vocal tract model, and the parameters of the vocal tract model at that time are adopted as personality parameters. It is also possible to use a method to do this.

以下、これら２つの個人性パラメータの決定方法について、さらに詳しく説明する。 Hereinafter, the method for determining these two personality parameters will be described in more detail.

［第１の個人性パラメータの決定方法の詳細］
まず、主声道共鳴と下咽頭腔共鳴とは線形関係になく相互作用があるため、音声から個人性要因を抽出するには声道モデルより得られる伝達関数と入力された音声のスペクトルとの間で誤差最小化をはかることにより、個人性パラメータを求めなければならない。この最適化には一般的な誤差最小化の手法を用いることができる。 [Details of First Personality Parameter Determination Method]
First, because the main vocal tract resonance and the hypopharyngeal cavity resonance are not in a linear relationship and interact with each other, in order to extract personality factors from speech, the transfer function obtained from the vocal tract model and the input speech spectrum Individuality parameters must be obtained by minimizing the error between them. A general error minimization method can be used for this optimization.

以下、声道モデルの形状パラメータを最適化する手法を説明する。 Hereinafter, a method for optimizing the shape parameter of the vocal tract model will be described.

図１６は、図１５で説明した声道の各部分から構成される声道モデルの外形を示す図である。この声道モデルは、基本的に口腔と喉頭腔をそれぞれ２つの音響管で近似し、これら２つの音響管が連結されているものとしている。さらに、喉頭腔の音響管の底部には、２つの円錐で表わされる梨状窩と、２つの小音響管の連結により近似される喉頭腔とが連結されているものとする。音源からの音は、喉頭腔底部からこの声道モデルに入力されるものとする。 FIG. 16 is a diagram showing an outer shape of a vocal tract model composed of each part of the vocal tract described in FIG. In this vocal tract model, the oral cavity and the laryngeal cavity are basically approximated by two acoustic tubes, and these two acoustic tubes are connected. Further, it is assumed that a piriform fossa represented by two cones and a laryngeal cavity approximated by the connection of two small acoustic tubes are connected to the bottom of the acoustic tube in the laryngeal cavity. It is assumed that sound from the sound source is input to the vocal tract model from the bottom of the laryngeal cavity.

図１７は、図１６に示した３次元声道モデルの形状を特定するための各パラメータを示す図である。 FIG. 17 is a diagram showing parameters for specifying the shape of the three-dimensional vocal tract model shown in FIG.

図１７に示すとおり、まず、口腔に対応する音響管は長さＬorで、断面の半径Ｒorの円筒形状であり、上面側（口腔側）は開口している。一方、咽頭腔に対応する音響管は長さＬphで、断面の半径Ｒphの円筒形状であり、その上面は口腔に対応する音響管の下側の開口部と連結している。一方、咽頭腔に対応する音響管の下面には、その中央部に、喉頭腔に対応する連結小音響管が連結するとともに、この連結音響管の両側に梨状窩に対応する２つの円錐管が連結する。喉頭腔に対応する連結小音響管は、咽頭腔に対応する音響管の下面と連結する断面半径Ｒla1、長さＬla1の円筒形の第１の小音響管と、この第１の小音響管の下面と連結する断面半径Ｒla2、長さＬla2の円筒形の第２の小音響管とを備え、第２の小音響管の下側は、開口している。この第２の小音響管の下側から音源の音が声道モデルに入力される。 As shown in FIG. 17, first, the acoustic tube corresponding to the oral cavity has a length Lor, a cylindrical shape with a radius Ror of the cross section, and the upper surface side (oral cavity side) is open. On the other hand, the acoustic tube corresponding to the pharyngeal cavity has a cylindrical shape with a length Lph and a cross-sectional radius Rph, and its upper surface is connected to the lower opening of the acoustic tube corresponding to the oral cavity. On the other hand, a small connected acoustic tube corresponding to the laryngeal cavity is connected to the lower surface of the acoustic tube corresponding to the pharyngeal cavity, and two conical tubes corresponding to the piriform fossa on both sides of the connected acoustic tube. Are linked. A connecting small acoustic tube corresponding to the laryngeal cavity includes a cylindrical first small acoustic tube having a cross-sectional radius Rla1 and a length Lla1 connected to the lower surface of the acoustic tube corresponding to the pharyngeal cavity, and the first small acoustic tube. A cylindrical second small acoustic tube having a cross-sectional radius Rla2 and a length Lla2 connected to the lower surface is provided, and the lower side of the second small acoustic tube is open. The sound of the sound source is input to the vocal tract model from the lower side of the second small acoustic tube.

（音声認証システムの機能構成）
図１８は、コンピュータ１００上で動作するソフトウェアにより実現される音声認証システム１０００の機能構成を説明するための機能ブロック図である。 (Functional configuration of voice authentication system)
FIG. 18 is a functional block diagram for explaining a functional configuration of the voice authentication system 1000 realized by software operating on the computer 100.

なお、このような図１８に示す音声認証システムの基本的な構成は、上述した非特許文献１に記載されたものと同様であるが、以下の説明のとおり、本発明では、話者モデルが図１６および図１７で示される声道モデルのパラメータの組み合わせにより表現される構成となっている。 The basic configuration of the voice authentication system shown in FIG. 18 is the same as that described in Non-Patent Document 1 described above, but as described below, in the present invention, a speaker model is used. The configuration is expressed by a combination of parameters of the vocal tract model shown in FIGS.

以下、簡単に音声認証システム１０００の機能構成について、簡単に説明する。 Hereinafter, the functional configuration of the voice authentication system 1000 will be briefly described.

図１８を参照して、入力される音声波は、まず、音声分析部２００において、２０ミリ秒程度の細かい時間ごとにスペクトル変換される。このようなスペクトルの表現方法としては、特に限定されないが、たとえば、ケプストラム（cepstrum）パラメータを用いることができる。以下では、ケプトストラムパラメータのように音声スペクトルを表現するためのパラメータを「音声パラメータ」と呼ぶ。 Referring to FIG. 18, first, an input voice wave is subjected to spectrum conversion at a fine time of about 20 milliseconds in voice analysis unit 200. A method for expressing such a spectrum is not particularly limited, and for example, a cepstrum parameter can be used. Hereinafter, a parameter for expressing a speech spectrum such as a cepstrum parameter is referred to as a “speech parameter”.

話者モデルの登録処理（学習処理）では、切替部２０４は、特徴抽出部２０２から話者モデル作成部２０６に処理をつなぐように切り替えられている。 In the speaker model registration process (learning process), the switching unit 204 is switched to connect the process from the feature extraction unit 202 to the speaker model creation unit 206.

そこで、特徴抽出部２０２は、音声パラメータの時系列に基づいて、話者の特徴を表現するパラメータ、すなわち、上述した声道モデルの形状を規定する各パラメータ（以下、「声道モデル形状パラメータ」と呼ぶ）の値を抽出する。 Therefore, the feature extraction unit 202 is a parameter that expresses the feature of the speaker based on the time series of voice parameters, that is, each parameter that defines the shape of the above-described vocal tract model (hereinafter referred to as “voice tract model shape parameter”). Value).

話者モデル作成部２０６は、ハードディスク１２４のような記憶装置に、各話者と対応する声道モデル形状パラメータとを対応付けて登録する。 The speaker model creation unit 206 registers each speaker and the corresponding vocal tract model shape parameter in a storage device such as the hard disk 124 in association with each other.

続いて、しきい値設定部２１０は、予め各話者の音声の変動の幅を同一話者についての複数の入力音声から調べ、本人の音声と判定するための許容限界のしきい値を決定する。 Subsequently, the threshold value setting unit 210 examines the range of fluctuation of each speaker's voice in advance from a plurality of input voices for the same speaker, and determines a threshold of an allowable limit for determining the voice of the speaker. To do.

一方、認証処理においては、切替部２０４は、特徴抽出部２０２から類似度計算部２２０に処理をつなぐように切り替えられている。 On the other hand, in the authentication process, the switching unit 204 is switched to connect the processing from the feature extraction unit 202 to the similarity calculation unit 220.

したがって、認証処理においても、学習処理時と同様にして、音声分析部２００と特徴抽出部２０２との処理により、入力音声に対応した声道モデル形状パラメータを抽出する。 Therefore, also in the authentication process, the vocal tract model shape parameter corresponding to the input voice is extracted by the processes of the voice analysis unit 200 and the feature extraction unit 202 in the same manner as in the learning process.

類似度計算部２２０は、特徴抽出部２０２により抽出された声道モデル形状パラメータと、登録されている各話者モデルとの比較を行ない、類似の度合い、たとえば、両者の距離を計算し、しきい値比較部２２２は、類似の度合いが予め設定されているしきい値よりも大きければ、本人の音声と判定して受理する旨の認証結果を出力し、そうでない場合は、他人の音声として判定して、拒否するあるいは棄却する認証結果を出力する。 The similarity calculation unit 220 compares the vocal tract model shape parameter extracted by the feature extraction unit 202 with each registered speaker model, calculates the degree of similarity, for example, the distance between them, If the degree of similarity is greater than a preset threshold value, the threshold value comparison unit 222 outputs an authentication result indicating that the voice is accepted and accepted, and if not, the threshold value is calculated as another person's voice. Judgment is made and an authentication result to be rejected or rejected is output.

すなわち、本発明では、上述のとおり、話者の音声からこの声道モデルの形状パラメータを決定し、これを用いて認証を行なう。ある話者の音声に対する声道モデルの形状パラメータの決定は以下のような方法で行なう。 That is, in the present invention, as described above, the shape parameter of the vocal tract model is determined from the voice of the speaker, and authentication is performed using this. Determination of the shape parameter of the vocal tract model for a certain speaker's voice is performed by the following method.

（ある話者の音声に対する声道モデルの形状パラメータの決定）
図１９は、第１の個人性パラメータの決定方法により、話者の音声に対する声道モデルの形状パラメータを決定して登録するための話者モデルの登録処理の手続きを説明するためのフローチャートである。 (Determination of vocal tract model shape parameters for a speaker's voice)
FIG. 19 is a flowchart for explaining a procedure of a speaker model registration process for determining and registering a vocal tract model shape parameter for a speaker's voice by the first personality parameter determination method. .

図１９を参照して、話者モデルの登録処理が開始されると、まず、声道長の決定が行なわれる（ステップＳ１００）。すなわち、音声分析部２００が、音声をスペクトル分析する。そして、特徴抽出部２０２は、一定の周波数帯域に現れる極の数に基づいて、声道長を決定する。その際、ＭＲＩ計測により予め得られている標準的な声道長を参考にする。 Referring to FIG. 19, when the speaker model registration process is started, the vocal tract length is first determined (step S100). That is, the voice analysis unit 200 performs spectrum analysis on the voice. Then, the feature extraction unit 202 determines the vocal tract length based on the number of poles that appear in a certain frequency band. At that time, reference is made to a standard vocal tract length obtained in advance by MRI measurement.

続いて、特徴抽出部２０２は、声道長および基本周波数からしきい値処理によって男女を判定する（ステップＳ１０２）。このようなしきい値は、予め実験的に定めておくものとする。 Subsequently, the feature extraction unit 202 determines sexes by threshold processing from the vocal tract length and the fundamental frequency (step S102). Such a threshold value is experimentally determined in advance.

次に、特徴抽出部２０２は、喉頭管の形状パラメータを決定する（ステップＳ１０４）。すなわち、図１７のパラメータＬla1、Ｌla2、Ｒla2はＭＲＩ計測による標準的な値を参考に決定するものとする。喉頭管はヘルムホルツ共鳴器とみなせるので、これら３つのパラメータと第４フォルマント周波数からＲla1を決定することができる。 Next, the feature extraction unit 202 determines the shape parameter of the laryngeal tube (step S104). That is, the parameters Lla1, Lla2, and Rla2 in FIG. 17 are determined with reference to standard values obtained by MRI measurement. Since the laryngeal canal can be regarded as a Helmholtz resonator, Rla1 can be determined from these three parameters and the fourth formant frequency.

次に、特徴抽出部２０２は、梨状窩の形状パラメータを決定する（ステップＳ１０６）。 Next, the feature extraction unit 202 determines the shape parameter of the piriform fossa (step S106).

このとき、上述のとおり、梨状窩は円錐形で近似されている。梨状窩の形状パラメータの決定のために、たとえば、あらかじめ円錐形の底面の半径と高さとその円錐形により作られる零点の周波数とバンド幅の関係をテーブルにしておく。次に、音声スペクトル上で４ｋＨｚ以上の周波数帯域に現れる零点の数を特定し、零点が１つであれば１つの円錐形を、零点が２つあれば２つの円錐形を用いる。そして、音声スペクトル上の零点の周波数とバンド幅からテーブル逆引きにより円錐形の底面の半径（図１７のパラメータＲpr1、Ｒpr2）と高さ（図１７のＬpr1、Ｌpr2）を決定する。 At this time, as described above, the piriform fossa is approximated by a conical shape. In order to determine the shape parameter of the piriform fossa, for example, the relationship between the radius and height of the bottom surface of the conical shape and the frequency and bandwidth of the zero point created by the conical shape is previously set in a table. Next, the number of zeros appearing in a frequency band of 4 kHz or higher on the speech spectrum is specified, and if there is one zero, one cone is used, and if there are two zeros, two cones are used. Then, the radius (parameters Rpr1, Rpr2 in FIG. 17) and the height (Lpr1, Lpr2 in FIG. 17) and the height of the bottom of the cone are determined by reverse table lookup from the frequency and bandwidth of the zero point on the speech spectrum.

続いて、特徴抽出部２０２は、口腔および咽頭腔の形状パラメータを決定する（ステップＳ１０８）。 Subsequently, the feature extraction unit 202 determines the shape parameters of the oral cavity and the pharyngeal cavity (step S108).

ここでは、声道長を２等分し咽頭腔と口腔からなる２区間声道モデルをつくり、低次フォルマントの分析より咽頭腔と口腔の断面積を求める。 Here, the vocal tract length is divided into two equal parts to create a two-section vocal tract model consisting of the pharyngeal cavity and the oral cavity, and the cross-sectional area of the pharyngeal cavity and the oral cavity is obtained by analysis of low-order formants.

さらに、特徴抽出部２０２は、２区間声道モデルへ下咽頭腔を追加して声道モデルを完成させる（ステップＳ１１０）。すなわち、ステップＳ１０８で得られた声道モデルに下咽頭腔を加える。 Further, the feature extraction unit 202 completes the vocal tract model by adding the hypopharyngeal cavity to the two-section vocal tract model (step S110). That is, the hypopharyngeal cavity is added to the vocal tract model obtained in step S108.

次に、話者モデル作成部２０６は、ステップＳ１１０により得られた声道モデルの形状パラメータを当該話者に関する個人性パラメータとして、記憶装置に登録する（ステップＳ１１２）。 Next, the speaker model creation unit 206 registers the shape parameter of the vocal tract model obtained in step S110 in the storage device as a personality parameter related to the speaker (step S112).

以上で、第１の個人性パラメータの決定方法に基づく、話者モデルの登録処理が完了する。 This completes the speaker model registration process based on the first individuality parameter determination method.

［第２の個人性パラメータの決定方法の詳細］
次に、上述した第２の個人性パラメータの決定方法、および第２の個人性パラメータの決定方法に基づく、話者モデルの登録処理について説明する。 [Details of second personality parameter determination method]
Next, speaker model registration processing based on the above-described second personality parameter determination method and second personality parameter determination method will be described.

（第２の個人性パラメータの決定方法の第１の例）
図１９のステップＳ１０８で得られた声道モデルにおいて、咽頭腔と口腔をそれぞれさらに２等分した４区間声道モデルを作る。初期値としては、２等分した各部分は、２等分前と同じ断面積を有するものとする。その上で、この４区間声道モデルの伝達関数と入力スペクトルとの差を最小化するように４区間の形状パラメータおよび下咽頭腔の形状パラメータを最適化する。必要に応じてさらに分割数を増やし８区間声道モデルを用いることもできる。上記最適化により、分割された各部分の断面積を個別に決定する。これにより得られた声道モデルの形状パラメータを当該話者に関する個人性パラメータとする。このような形状パラメータの決定方法は、登録時（学習時）においても、認証時においても実施される。 (First Example of Second Personality Parameter Determination Method)
In the vocal tract model obtained in step S108 of FIG. 19, a four-section vocal tract model is created by further dividing the pharyngeal cavity and the oral cavity into two equal parts. As an initial value, each part divided into two equal parts has the same cross-sectional area as before. Then, the shape parameters of the four sections and the hypopharyngeal cavity are optimized so as to minimize the difference between the transfer function of the four section vocal tract model and the input spectrum. If necessary, the number of divisions can be further increased to use an 8-section vocal tract model. By the above optimization, the sectional area of each divided part is determined individually. The shape parameter of the vocal tract model obtained as a result is used as the individuality parameter for the speaker. Such a method for determining the shape parameter is performed both at the time of registration (learning) and at the time of authentication.

なお、咽頭腔に相当する音響管と口腔に対応する音響管を分割する数については、上述した２分割や４分割に限られず、分割した結果に対応して得られる伝達関数と入力スペクトルの差を、計算により最小化することが可能な自由度であるかぎり、咽頭腔に相当する音響管と口腔に対応する音響管とのそれぞれで他の分割数とすることも可能である。 The number of the acoustic tube corresponding to the pharyngeal cavity and the acoustic tube corresponding to the oral cavity is not limited to the above-described two divisions or four divisions, and the difference between the transfer function and the input spectrum obtained corresponding to the division result. As long as the degree of freedom can be minimized by calculation, it is possible to set the number of divisions for each of the acoustic tube corresponding to the pharyngeal cavity and the acoustic tube corresponding to the oral cavity.

（第２の個人性パラメータの決定方法の第２の例）
図２０は、第２の個人性パラメータの決定方法の第２の例の手続きを示すフローチャートである。 (Second Example of Second Personality Parameter Determination Method)
FIG. 20 is a flowchart showing the procedure of the second example of the method for determining the second personality parameter.

まず、特徴抽出部２０２は、音声から声道断面積関数を求める（ステップＳ２００）。これは例えば、いわゆるＰＡＲＣＯＲ分析を用いることで可能である。 First, the feature extraction unit 202 obtains a vocal tract cross-sectional area function from speech (step S200). This is possible, for example, using so-called PARCOR analysis.

次に、特徴抽出部２０２は、図１９のステップＳ１０４と同じ方法で音声スペクトルから喉頭腔の形状パラメータを求める（ステップＳ２０２）。 Next, the feature extraction unit 202 obtains the shape parameter of the laryngeal cavity from the speech spectrum by the same method as step S104 in FIG. 19 (step S202).

続いて、特徴抽出部２０２は、低次フォルマントと口腔平均面積、低次フォルマントと咽頭腔平均面積との相関関係から、ステップＳ２００で求めた声道断面積関数の口腔および咽頭腔に相当する部分を修正する（ステップＳ２０４）。 Subsequently, the feature extraction unit 202 obtains a portion corresponding to the oral cavity and the pharyngeal cavity of the vocal tract cross-sectional area function obtained in step S200 from the correlation between the low-order formant and the average oral cavity area, and the low-order formant and the average pharyngeal cavity area. Is corrected (step S204).

ステップＳ２００で求めた声道断面積関数には分岐管が含まれない。そこで、特徴抽出部２０２は、図１９のステップＳ１０６と同じ方法で音声スペクトルから円錐形で近似した梨状窩の形状パラメータを求める（ステップＳ２０６）。 The vocal tract cross-sectional area function obtained in step S200 does not include a branch pipe. Therefore, the feature extraction unit 202 obtains the shape parameter of the piriform fossa approximated by a cone from the speech spectrum by the same method as step S106 in FIG. 19 (step S206).

そして、特徴抽出部２０２は、以上により求めた喉頭腔、口腔、咽頭腔、梨状窩の形状パラメータを初期値として、図１７に示したような声道モデルを作成する（ステップＳ２０８）。なお、ＰＡＲＣＯＲ分析の分析次数に応じて、口腔および咽頭腔の分割数、すなわち精度は変化する。 Then, the feature extraction unit 202 creates a vocal tract model as shown in FIG. 17 using the shape parameters of the laryngeal cavity, oral cavity, pharyngeal cavity, and piriform fossa obtained as described above as initial values (step S208). Note that the number of divisions of the oral cavity and the pharyngeal cavity, that is, the accuracy, changes according to the analysis order of the PARCOR analysis.

続いて、特徴抽出部２０２は、この声道モデルの伝達関数を計算し、それと音声スペクトルとの誤差が最小となるまで声道モデルの形状パラメータを修正する（ステップＳ２１０）。 Subsequently, the feature extraction unit 202 calculates a transfer function of this vocal tract model, and corrects the shape parameter of the vocal tract model until the error between it and the speech spectrum is minimized (step S210).

ステップＳ２１０により得られた声道モデルの形状パラメータを当該話者に関する個人性パラメータとし、記憶装置に登録する（ステップＳ２１２）。 The shape parameter of the vocal tract model obtained in step S210 is registered as a personality parameter for the speaker in the storage device (step S212).

このような形状パラメータの決定方法も、登録時（学習時）におけるだけでなく、認証時においても実施される。 Such a method for determining the shape parameter is performed not only at the time of registration (learning) but also at the time of authentication.

以上のようにして、図１８に示したような音声認証システムに話者を登録する際には、第１または第２の個人性パラメータの決定方法を用いて、音声から登録話者の個人性パラメータを決定して登録する。話者を照合する場合には、入力音声からその話者の個人性パラメータを決定し、登録話者の個人性パラメータと照合して入力音声の話者と決定する。 As described above, when a speaker is registered in the voice authentication system as shown in FIG. 18, the personality of the registered speaker is determined from the voice by using the first or second personality parameter determination method. Determine and register parameters. When collating a speaker, the personality parameter of the speaker is determined from the input voice, and the speaker of the input voice is determined by collating with the personality parameter of the registered speaker.

このような構成により、音声認証において、個人の身体的特性との関連性を高めて本人認証を行なうことができ、音声認証の精度を向上させることが可能である。 With such a configuration, in voice authentication, it is possible to increase the relevance with the physical characteristics of an individual and perform personal authentication, and it is possible to improve the accuracy of voice authentication.

今回開示された実施の形態はすべての点で例示であって制限的なものではないと考えられるべきである。本発明の範囲は上記した説明ではなくて特許請求の範囲によって示され、特許請求の範囲と均等の意味および範囲内でのすべての変更が含まれることが意図される。 The embodiment disclosed this time should be considered as illustrative in all points and not restrictive. The scope of the present invention is defined by the terms of the claims, rather than the description above, and is intended to include any modifications within the scope and meaning equivalent to the terms of the claims.

本発明の音声認証方法を実施するための音声認証装置１０００の一例を示す概念図である。It is a conceptual diagram which shows an example of the voice authentication apparatus 1000 for enforcing the voice authentication method of this invention. コンピュータ１００のハードウェア構成をブロック図形式で示す図である。It is a figure which shows the hardware constitutions of the computer 100 in a block diagram format. 音声スペクトル上の特徴と声道内の部位の対応関係を示す図である。It is a figure which shows the correspondence of the characteristic on an audio | voice spectrum, and the site | part in a vocal tract. 音声生成系の正中断面図を示す概念図である。It is a conceptual diagram which shows the median cross-sectional view of an audio | voice production | generation system. 声道の２区間モデルにおける咽頭腔と口腔の断面積変化、および低次フォルマントとの理論的関係を示す図である。It is a figure which shows the theoretical relationship with the cross-sectional area change of a pharyngeal cavity and an oral cavity in a two-section model of a vocal tract, and a low-order formant. 咽頭腔平均面積と第１フォルマント周波数との相関関係をＭＲＩによる実測値で示す図である。It is a figure which shows the correlation of a pharyngeal cavity average area and a 1st formant frequency by the measured value by MRI. 口腔平均面積と第１フォルマント周波数との相関関係をＭＲＩによる実測値で示す図である。It is a figure which shows the correlation of an oral cavity average area and a 1st formant frequency by the measured value by MRI. 喉頭腔の形状を説明するための図面である。It is drawing for demonstrating the shape of a laryngeal cavity. ３名分の下咽頭腔の３次元形状を示す図である。It is a figure which shows the three-dimensional shape of the hypopharyngeal cavity for three persons. 図９に示した３名について、各母音を発声しているときの下咽頭腔の各部の横断面形状を声門からの距離をパラメータとして示す図である。It is a figure which shows the cross-sectional shape of each part of the hypopharyngeal space when each vowel is uttered about three persons shown in FIG. 9 as a parameter from the glottis. 母音「え」の音声スペクトルを示す図である。It is a figure which shows the audio | voice spectrum of vowel "e". 話者ア〜コの第４フォルマントの周波数を示す図である。It is a figure which shows the frequency of the 4th formant of speaker A-ko. 下咽頭腔における梨状窩の位置を示す図である。It is a figure which shows the position of the piriform fossa in a hypopharyngeal cavity. 母音「え」の音声スペクトル上の梨状窩による零点の位置を示す図である。It is a figure which shows the position of the zero point by the piriform fossa on the audio | voice spectrum of vowel "e". 本発明の音声生成モデルの概念図説明するための図である。It is a figure for demonstrating the conceptual diagram of the audio | voice production | generation model of this invention. 図１５で説明した声道の各部分から構成される声道モデルの外形を示す図である。It is a figure which shows the external shape of the vocal tract model comprised from each part of the vocal tract demonstrated in FIG. 図１６に示した３次元声道モデルの形状を特定するための各パラメータを示す図である。It is a figure which shows each parameter for pinpointing the shape of the three-dimensional vocal tract model shown in FIG. コンピュータ１００上で動作するソフトウェアにより実現される音声認証システム１０００の機能構成を説明するための機能ブロック図である。2 is a functional block diagram for explaining a functional configuration of a voice authentication system 1000 realized by software operating on a computer 100. FIG. 話者の音声に対する声道モデルの形状パラメータを決定して登録するための話者モデルの登録処理の手続きを説明するためのフローチャートである。It is a flowchart for demonstrating the procedure of the registration process of the speaker model for determining and registering the shape parameter of the vocal tract model with respect to a speaker's audio | voice. 第２の個人性パラメータの決定方法の第２の例の手続きを示すフローチャートである。It is a flowchart which shows the procedure of the 2nd example of the determination method of a 2nd individuality parameter.

Explanation of symbols

１００コンピュータ、１０２コンピュータ本体、１０４ディスプレイ、１０６ＦＤドライブ、１０８ＣＤ−ＲＯＭドライブ、１１０キーボード、１１２マウス、１１６フレキシブルディスク、１１８ＣＤ−ＲＯＭ、１２０ＣＰＵ、１２２メモリ、１２４ハードディスク、１２８通信インタフェース、１３２マイク、１３４スピーカ、３００相手先コンピュータ、３１０ネットワーク、１０００音声認証システム。 100 computer, 102 computer main body, 104 display, 106 FD drive, 108 CD-ROM drive, 110 keyboard, 112 mouse, 116 flexible disk, 118 CD-ROM, 120 CPU, 122 memory, 124 hard disk, 128 communication interface, 132 microphone 134 Speaker, 300 partner computer, 310 network, 1000 voice authentication system.

Claims

A feature extraction unit for determining a shape parameter of a vocal tract model based on a voice input from a person to be authenticated,
The vocal tract model is
A first acoustic tube portion corresponding to the oral cavity;
A second acoustic tube portion coupled to the first acoustic tube portion and corresponding to the pharyngeal cavity;
A small acoustic tube connected to the bottom surface of the second acoustic tube portion and corresponding to the laryngeal cavity;
At least one conical tube coupled to the bottom surface of the second acoustic tube portion and corresponding to the piriform fossa;
At the time of learning, it further comprises storage means for storing the shape parameter determined by the feature extraction means in association with the person to be authenticated as a registered shape parameter,
The feature extraction means determines a shape parameter of the vocal tract model as an authentication shape parameter based on a voice input from a speaker at the time of authentication,
A voice authentication device further comprising similarity comparison means for comparing the authentication shape parameter with the registered shape parameter in order to specify whether or not the speaker is the registered person to be authenticated.

The feature extraction means includes
An initial value determining means for determining an initial value of the shape parameter based on the voice input;
The voice authentication device according to claim 1, further comprising a correction unit that corrects the shape parameter so as to minimize a difference between a transfer function of the vocal tract model based on the initial value and an input spectrum of the voice input.

The first acoustic tube portion includes a plurality of first acoustic tubes connected to each other,
The voice authentication device according to claim 2, wherein the second acoustic tube portion includes a plurality of second acoustic tubes connected to each other.

At the time of learning, comprising the step of determining the shape parameter of the vocal tract model based on the voice input from the person to be authenticated,
The vocal tract model is
A first acoustic tube portion corresponding to the oral cavity;
A second acoustic tube portion coupled to the first acoustic tube portion and corresponding to the pharyngeal cavity;
A small acoustic tube connected to the bottom surface of the second acoustic tube portion and corresponding to the laryngeal cavity;
At least one conical tube coupled to the bottom surface of the second acoustic tube portion and corresponding to the piriform fossa;
Storing the shape parameter determined at the time of learning in a storage device in association with the authentication target person as a registered shape parameter;
Determining the shape parameter of the vocal tract model as an authentication shape parameter based on a voice input from a speaker at the time of authentication;
A voice authentication method further comprising: specifying whether or not the speaker is the registered person to be authenticated based on a comparison result between the authentication shape parameter and the registered shape parameter.

Determining the shape parameter of the vocal tract model comprises:
Determining an initial value of the shape parameter based on the speech input;
The speech authentication method according to claim 4, further comprising the step of modifying the shape parameter so as to minimize a difference between a transfer function of the vocal tract model based on the initial value and an input spectrum of the speech input.

The first acoustic tube portion includes a plurality of first acoustic tubes connected to each other,
The voice authentication method according to claim 5, wherein the second acoustic tube portion includes a plurality of second acoustic tubes connected to each other.

A voice authentication program for causing a computer to execute voice authentication processing,
The voice authentication process includes:
At the time of learning, comprising the step of determining the shape parameter of the vocal tract model based on the voice input from the person to be authenticated,
The vocal tract model is
A first acoustic tube portion corresponding to the oral cavity;
A second acoustic tube portion coupled to the first acoustic tube portion and corresponding to the pharyngeal cavity;
A small acoustic tube connected to the bottom surface of the second acoustic tube portion and corresponding to the laryngeal cavity;
At least one conical tube coupled to the bottom surface of the second acoustic tube portion and corresponding to the piriform fossa;
Storing the shape parameter determined at the time of learning in a storage device in association with the authentication target person as a registered shape parameter;
Determining the shape parameter of the vocal tract model as an authentication shape parameter based on a voice input from a speaker at the time of authentication;
A voice authentication program further comprising: specifying whether or not the speaker is the registered person to be authenticated based on a comparison result between the authentication shape parameter and the registered shape parameter.

Determining the shape parameter of the vocal tract model comprises:
Determining an initial value of the shape parameter based on the speech input;
The voice authentication program according to claim 7, further comprising the step of correcting the shape parameter so as to minimize a difference between a transfer function of the vocal tract model based on the initial value and an input spectrum of the voice input.

The first acoustic tube portion includes a plurality of first acoustic tubes connected to each other,
The voice authentication program according to claim 8, wherein the second acoustic tube portion includes a plurality of second acoustic tubes connected to each other.